Concurrent data file fetching and parallel RecordBatch processing

Open sdd opened this issue 1 year ago • 1 comments

This brings some big performance gains vs the previous sequential batch processing. On my 12-core Ryzen 9 5900X, I see all 12 cores hitting about 50% utilization.

Performance on retrieval of all the data on a full table scan in my perf testing branch for this hit 84 million rows in 7s, or over 11M rows/sec. Real world could be quite a bit faster as 50% of the CPU usage was for Minio serving up the data files.

As with the concurrent file plan PR, the concurrency config has been set to fast defaults based on testing a range of values but can be user-configured.

Performance test results, generated using the tests in https://github.com/apache/iceberg-rust/pull/497:

Jul 31 '24 20:07 sdd

If I run this directly against locally hosted Minio, cutting out the HAProxy container in the stack (that is being used to introduce latency and bandwidth constraints to simulate real-world usage), I can process the same request in just under 3s, at a rate of almost 30M rows/sec

Jul 31 '24 21:07 sdd