kyuubi icon indicating copy to clipboard operation
kyuubi copied to clipboard

[KYUUBI #7245] Fix arrow batch converter error

Open echo567 opened this issue 2 months ago • 6 comments

Why are the changes needed?

Control the amount of data to prevent memory overflow and increase to initial speed.

When kyuubi.operation.result.format=arrow, spark.connect.grpc.arrow.maxBatchSize does not work as expected.

Reproduction: You can debug KyuubiArrowConverters or add the following log to line 300 of KyuubiArrowConverters:

logInfo(s"Total limit: ${limit}, rowCount: ${rowCount}, " +
s"rowCountInLastBatch:${rowCountInLastBatch}," +
s"estimatedBatchSize: ${estimatedBatchSize}," +
s"maxEstimatedBatchSize: ${maxEstimatedBatchSize}," +
s"maxRecordsPerBatch:${maxRecordsPerBatch}")

Test data: 1.6 million rows, 30 columns per row. Command executed:

bin/beeline \
  -u 'jdbc:hive2://10.168.X.X:XX/default;thrift.client.max.message.size=2000000000' \
  --hiveconf kyuubi.operation.result.format=arrow \
  -n test -p 'testpass' \
  --outputformat=csv2 -e "select * from db.table" > /tmp/test.csv

Log output

25/11/13 13:52:57 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 200000, lastBatchRowCount:200000, estimatedBatchSize: 145600000 maxEstimatedBatchSize: 4,maxRecordsPerBatch:10000
25/11/13 13:52:57 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 200000, lastBatchRowCount:200000, estimatedBatchSize: 145600000

Original Code

while (rowIter.hasNext && (
rowCountInLastBatch == 0 && maxEstimatedBatchSize > 0 ||
estimatedBatchSize <= 0 ||
estimatedBatchSize < maxEstimatedBatchSize ||
maxRecordsPerBatch <= 0 ||
rowCountInLastBatch < maxRecordsPerBatch ||
rowCount < limit ||
limit < 0))

When the limit is not set, i.e., -1, all data will be retrieved at once. If the row count is too large, the following three problems will occur: (1) Driver/executor oom (2) Array oom cause of array length is not enough (3) Transfer data slowly

After updating the code, the log output is as follows:

25/11/14 10:57:16 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 5762, rowCountInLastBatch:5762, estimatedBatchSize: 4194736, maxEstimatedBatchSize: 4194304, maxRecordsPerBatch:10000
25/11/14 10:57:16 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 11524, rowCountInLastBatch: 5762, estimatedBatchSize: 4194736, maxEstimatedBatchSize: 4194304, maxRecordsPerBatch: 10000 
25/11/14 10:57:16 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 17286, rowCountInLastBatch: 5762, estimatedBatchSize: 4194736, maxEstimatedBatchSize: 4194304, maxRecordsPerBatch: 10000

The estimatedBatchSize is slightly larger than the maxEstimatedBatchSize. Data can be written in batches as expected.

Fix #7245.

How was this patch tested?

Test data: 1.6 million rows, 30 columns per row.

25/11/14 10:57:16 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 5762, rowCountInLastBatch:5762, estimatedBatchSize: 4194736, maxEstimatedBatchSize: 4194304, maxRecordsPerBatch:10000
25/11/14 10:57:16 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 11524, rowCountInLastBatch: 5762, estimatedBatchSize: 4194736, maxEstimatedBatchSize: 4194304, maxRecordsPerBatch: 10000 
25/11/14 10:57:16 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 17286, rowCountInLastBatch: 5762, estimatedBatchSize: 4194736, maxEstimatedBatchSize: 4194304, maxRecordsPerBatch: 10000

Was this patch authored or co-authored using generative AI tooling?

No

echo567 avatar Nov 16 '25 11:11 echo567

@echo567, please keep the PR template and fill in seriously, especially "Was this patch authored or co-authored using generative AI tooling?", it does matter for legal purposes.

pan3793 avatar Nov 17 '25 12:11 pan3793

Codecov Report

:x: Patch coverage is 0% with 8 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 0.00%. Comparing base (fba1f94) to head (6ef4ef1).

Files with missing lines Patch % Lines
...rk/sql/execution/arrow/KyuubiArrowConverters.scala 0.00% 6 Missing :warning:
...g/apache/spark/sql/kyuubi/SparkDatasetHelper.scala 0.00% 2 Missing :warning:
Additional details and impacted files
@@          Coverage Diff           @@
##           master   #7246   +/-   ##
======================================
  Coverage    0.00%   0.00%           
======================================
  Files         696     696           
  Lines       43530   43528    -2     
  Branches     5883    5881    -2     
======================================
+ Misses      43530   43528    -2     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov-commenter avatar Nov 17 '25 14:11 codecov-commenter

@echo567, please keep the PR template and fill in seriously, especially "Was this patch authored or co-authored using generative AI tooling?", it does matter for legal purposes.

Sorry, the changes have been made.

echo567 avatar Nov 19 '25 02:11 echo567

The code is copied from Spark, seems it was changed at SPARK-44657. Can we just follow that?

pan3793 avatar Nov 19 '25 03:11 pan3793

The code is copied from Spark, seems it was changed at SPARK-44657. Can we just follow that?

okay,I made modifications based on this Spark issue

echo567 avatar Nov 26 '25 12:11 echo567

hi I've merged the code from the latest master branch. Is there anything else I need to change?

echo567 avatar Dec 08 '25 13:12 echo567