enable auto-optimized shuffle for module 2011

originally implemented for Spark 3.1.2 in commit https://github.com/databrickslabs/overwatch/commit/d751d5fc75c939892b73f877cb0e5542eb2cc030 on branch 1228-silver-job-runs-spark312-r0812 as part of #1253.

This PR removes all of the new utilities and transformation refactoring that were only aids to development and testing. They did not impact performance in any significant way.

The essential change brought to this branch (1228-optimization-only) is entirely expressed in commit https://github.com/databrickslabs/overwatch/commit/8c9ee79d20a4904ecd5aa2908715179c58e615e1. The new code introduced is here: https://github.com/databrickslabs/overwatch/blob/8c9ee79d20a4904ecd5aa2908715179c58e615e1/src/main/scala/com/databricks/labs/overwatch/pipeline/Silver.scala#L271-L274

The background and analysis of the optimization presented in the description of #1253 is still representative of the performance improvements realized by this change.

proof notebook (IN PROGRESS)

Corresponding job runs for before/after comparison of this change:

`0.8.1.2`	`0.8.2.0-SNAPSHOT`
Run 647559332994892 (210402 rows in 19.2 mins)	Run 455980391763738 (210402 rows in 8.13 mins)
Run 265434635290698 (483230 rows in 20.53 mins)	Run 635176874378143 (483230 rows in 8.9 mins)

Jul 03 '24 00:07 neilbest-db

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

Jul 03 '24 00:07 sonarqubecloud[bot]

Added row counts and timings to second set of comparison runs to table in description. ☝️

Jul 03 '24 15:07 neilbest-db

[ The following content is copy-pasted from #1253. ]

Introduction

A typical run of Overwatch module 2011 (JobsRuns_Silver, "JR") using release 0.8.1.2 to process 10 days of data in workspace e2-demo-west-ws produces this console output:

. . .
COMPLETED: 2011-Silver_JobsRuns --> Workspace ID: 2556758628403379
TIME RANGE: 2024-05-09 00:00:00 -> 2024-05-19 00:00:00
 SUCCESS! Silver_JobsRuns
OUTPUT ROWS: 210402
RUNTIME MINS: 16.25 --> Workspace ID: 2556758628403379
. . .

The corresponding run of the code from this feature branch (1228-silver-job-runs-spark312-r0812) in an otherwise identical environment against the same upstream data goes like this:

. . .
COMPLETED: 2011-Silver_JobsRuns --> Workspace ID: 2556758628403379
TIME RANGE: 2024-05-09 00:00:00 -> 2024-05-19 00:00:00
 SUCCESS! Silver_JobsRuns
OUTPUT ROWS: 210402
RUNTIME MINS: 5.25 --> Workspace ID: 2556758628403379
. . .

This ~67% reduction in runtime minutes is characteristic of the changes introduced here. Below are some details on how this was achieved, but first a little background.

Background

Databricks AQE shuffle auto-optimization

As part of Overwatch release 0.7.1.1, #675 introduced the following Spark configuration globally for all modules:

https://github.com/databrickslabs/overwatch/blob/ae56406ea5aef1e1e8a40b3653a9e6df70922458/src/main/scala/com/databricks/labs/overwatch/utils/SparkSessionWrapper.scala#L20

This is a component of a larger set of features called Adaptive Query Execution (AQE) that was released with Databricks Runtime (DBR) 7.0 and Spark 3.0.0. See this Databricks Engineering Blog article for more details on AQE in general.

Soon thereafter performance degradation in certain Overwatch modules was reported in https://github.com/databrickslabs/overwatch/issues/794#issue-1607557972:

In an attempt to increase performance -- a regression was made that results in the larger modules (i.e. spark table merges) to have WAY too few tasks resulting in extremely long -- never finishing runtimes.

Solution -- disable this flag.

. . . and the autoOptimizeShuffle feature was disabled in #790 for release 0.7.1.2:

https://github.com/databrickslabs/overwatch/blob/f9c8dd088ea0e81d4a311368a5c47b1a4f9ad375/src/main/scala/com/databricks/labs/overwatch/utils/SparkSessionWrapper.scala#L20

JR shuffle factor

Prior to the AQE shuffle auto-optimization described in the previous section, Overwatch release 0.7.1.0 (#563, #661) added the shuffleFactor parameter to the definition of jobRunsModule:

https://github.com/databrickslabs/overwatch/blob/bc136594b1e9d08c77339d6a31f7c87bfe2e7202/src/main/scala/com/databricks/labs/overwatch/pipeline/Silver.scala#L246

Based on this shuffleFactor value of 12.0 and the job cluster configuration used for these tests, Module.optimizeShufflePartitions() currently sets spark.sql.shuffle.partitions to 9600:

https://github.com/databrickslabs/overwatch/blob/3f4cdca4d8438550b13c38fec43eea60bdbee877/src/main/scala/com/databricks/labs/overwatch/pipeline/Module.scala#L79-L102

Results

With AQE shuffle auto-optimization disabled in 0.8.1.2, many Spark stages for JR are forced to run 9600 tasks or more. This results in extended stage durations (grey verticals are minutes) and additional executors, (i.e. cluster auto-scaling, blue verticals) in the following chart from the Spark UI:

This PR introduced the following change in the Silver JR module's configuration in commit https://github.com/databrickslabs/overwatch/commit/d751d5fc75c939892b73f877cb0e5542eb2cc030: https://github.com/databrickslabs/overwatch/blob/d751d5fc75c939892b73f877cb0e5542eb2cc030/src/main/scala/com/databricks/labs/overwatch/pipeline/Silver.scala#L271-L274

Running 0.8.1.2 with the enhancements in this PR, not only are the corresponding Spark job durations decreased but recruitment of additional executors is deferred to later phases of the module:

Note that differences in Spark job labels in these screenshots is due to other provisional features unrelated to Spark configuration or performance that were included in the Overwatch snapshot JAR used for these test runs. Those special labels are aids to development, testing, and troubleshooting to be introduced in #1223. Despite other minor refactoring in the JR module logic the overall contours of the progression of Spark jobs is recognizable because these runs were restricted to only this module for the same date range. Spark job 64 from 20:28 until 20:36 in the upper timeline corresponds to jobRunsAppendMeta (label truncated) at 20:23, for example. This phase in particular seems to be where the overall duration of the module has been decreased most significantly.

Closer examination of the task statistics for corresponding jobs shows that many tasks created in the absence of shuffle auto-optimization are effectively doing nothing!

With shuffle auto-optimization relaxed we see a much more favorable distribution of task durations and bytes processed:

This trend is similarly apparent in other phases of the JR module.

Conclusion

This PR has produced demonstrable performance improvements for the following dataframe transformations:

Next steps

Further gains in resource utilization and time efficiency may be possible in the subsequent phases of the JR module (2011):

Jul 15 '24 17:07 neilbest-db

closes #1228

Jul 15 '24 19:07 neilbest-db

adjust Silver Job Runs module configuration

enable auto-optimized shuffle for module 2011

Quality Gate passed

Introduction

Background

Databricks AQE shuffle auto-optimization

JR shuffle factor

Results

Conclusion

Next steps