[KYUUBI #7245] Fix arrow batch converter error
Why are the changes needed?
Control the amount of data to prevent memory overflow and increase to initial speed.
When kyuubi.operation.result.format=arrow, spark.connect.grpc.arrow.maxBatchSize does not work as expected.
Reproduction:
You can debug KyuubiArrowConverters or add the following log to line 300 of KyuubiArrowConverters:
logInfo(s"Total limit: ${limit}, rowCount: ${rowCount}, " +
s"rowCountInLastBatch:${rowCountInLastBatch}," +
s"estimatedBatchSize: ${estimatedBatchSize}," +
s"maxEstimatedBatchSize: ${maxEstimatedBatchSize}," +
s"maxRecordsPerBatch:${maxRecordsPerBatch}")
Test data: 1.6 million rows, 30 columns per row. Command executed:
bin/beeline \
-u 'jdbc:hive2://10.168.X.X:XX/default;thrift.client.max.message.size=2000000000' \
--hiveconf kyuubi.operation.result.format=arrow \
-n test -p 'testpass' \
--outputformat=csv2 -e "select * from db.table" > /tmp/test.csv
Log output
25/11/13 13:52:57 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 200000, lastBatchRowCount:200000, estimatedBatchSize: 145600000 maxEstimatedBatchSize: 4,maxRecordsPerBatch:10000
25/11/13 13:52:57 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 200000, lastBatchRowCount:200000, estimatedBatchSize: 145600000
Original Code
while (rowIter.hasNext && (
rowCountInLastBatch == 0 && maxEstimatedBatchSize > 0 ||
estimatedBatchSize <= 0 ||
estimatedBatchSize < maxEstimatedBatchSize ||
maxRecordsPerBatch <= 0 ||
rowCountInLastBatch < maxRecordsPerBatch ||
rowCount < limit ||
limit < 0))
When the limit is not set, i.e., -1, all data will be retrieved at once. If the row count is too large, the following three problems will occur:
(1) Driver/executor oom
(2) Array oom cause of array length is not enough
(3) Transfer data slowly
After updating the code, the log output is as follows:
25/11/14 10:57:16 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 5762, rowCountInLastBatch:5762, estimatedBatchSize: 4194736, maxEstimatedBatchSize: 4194304, maxRecordsPerBatch:10000
25/11/14 10:57:16 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 11524, rowCountInLastBatch: 5762, estimatedBatchSize: 4194736, maxEstimatedBatchSize: 4194304, maxRecordsPerBatch: 10000
25/11/14 10:57:16 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 17286, rowCountInLastBatch: 5762, estimatedBatchSize: 4194736, maxEstimatedBatchSize: 4194304, maxRecordsPerBatch: 10000
The estimatedBatchSize is slightly larger than the maxEstimatedBatchSize. Data can be written in batches as expected.
Fix #7245.
How was this patch tested?
Test data: 1.6 million rows, 30 columns per row.
25/11/14 10:57:16 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 5762, rowCountInLastBatch:5762, estimatedBatchSize: 4194736, maxEstimatedBatchSize: 4194304, maxRecordsPerBatch:10000
25/11/14 10:57:16 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 11524, rowCountInLastBatch: 5762, estimatedBatchSize: 4194736, maxEstimatedBatchSize: 4194304, maxRecordsPerBatch: 10000
25/11/14 10:57:16 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 17286, rowCountInLastBatch: 5762, estimatedBatchSize: 4194736, maxEstimatedBatchSize: 4194304, maxRecordsPerBatch: 10000
Was this patch authored or co-authored using generative AI tooling?
No
@echo567, please keep the PR template and fill in seriously, especially "Was this patch authored or co-authored using generative AI tooling?", it does matter for legal purposes.
Codecov Report
:x: Patch coverage is 0% with 8 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 0.00%. Comparing base (fba1f94) to head (6ef4ef1).
| Files with missing lines | Patch % | Lines |
|---|---|---|
| ...rk/sql/execution/arrow/KyuubiArrowConverters.scala | 0.00% | 6 Missing :warning: |
| ...g/apache/spark/sql/kyuubi/SparkDatasetHelper.scala | 0.00% | 2 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## master #7246 +/- ##
======================================
Coverage 0.00% 0.00%
======================================
Files 696 696
Lines 43530 43528 -2
Branches 5883 5881 -2
======================================
+ Misses 43530 43528 -2
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.
@echo567, please keep the PR template and fill in seriously, especially "Was this patch authored or co-authored using generative AI tooling?", it does matter for legal purposes.
Sorry, the changes have been made.
The code is copied from Spark, seems it was changed at SPARK-44657. Can we just follow that?
The code is copied from Spark, seems it was changed at SPARK-44657. Can we just follow that?
okay,I made modifications based on this Spark issue
hi I've merged the code from the latest master branch. Is there anything else I need to change?