datafusion Investigate performance tradeoff in compressing spill files

Is your feature request related to a problem or challenge?

Part of https://github.com/apache/datafusion/issues/16065 , Related to https://github.com/apache/datafusion/issues/14078

Background

PR https://github.com/apache/datafusion/pull/16268 will introduce compression option when writing spill files in disk. These options are limited to compression options to what Arrow IPC Stream Writer supports : zstd, lz4_frame (compression level is not configurable)

To further enhance spilling execution, we need to investigate

CPU / I/O tradeoff when zstd or lz4_frame compression is enabled i.e. compression ratio, extra latency spent for compression
Current arrow ipc stream writer always write batch at a time in append_batch. In terms of compression, it is not sure yet how much single batch can benefit from compression.
whether we need separate Writer or Reader implementation instead of IPC Stream Writer.
how to introduce sort of adaptiveness.

Describe the solution you'd like

First, we need to track (or update) how many bytes are written in spill files. Datafusion currently tracks spilled_bytes as part of SpillMetrics, but it is calculated based on in memory array size, which would be different from actual spill files size especially when we compress spill files.

Second, update the benchmarks or write a separate benchmarks to see the performance characteristics. One possible way is writing out spill-related metrics to output.json when running benches like tpch with debug option. Another idea is to generate some spill files for microbenchmark testing only spill writing - reading process.

Describe alternatives you've considered

No response

Additional context

No response

Jun 11 '25 05:06 ding-young

take

Jun 11 '25 05:06 ding-young

CPU / I/O tradeoff when zstd or lz4_frame compression is enabled i.e. compression ratio, extra latency spent for compression

It's worth doing some micro-benches to see how those compressions work on Arrow arrays. We can configure several possible shapes of to-spill intermedia data (single col of different types/very wide batches/etc., perhaps just use the TPCH table), and test how different compression types perform for speed and compression ratio. It would be interesting to also test how vortex perform.

Current arrow ipc stream writer always write batch at a time in append_batch. In terms of compression, it is not sure yet how much single batch can benefit from compression.

This looks not so ideal, especially for compressing 'thin' batches with primitive types. If we can confirm there are huge overheads here then we should consider implementing compressing multiple batches at once.

whether we need separate Writer or Reader implementation instead of IPC Stream Writer.

how to introduce sort of adaptiveness.

Describe the solution you'd like

First, we need to track (or update) how many bytes are written in spill files. Datafusion currently tracks spilled_bytes as part of SpillMetrics, but it is calculated based on in memory array size, which would be different from actual spill files size especially when we compress spill files.

Second, update the benchmarks or write a separate benchmarks to see the performance characteristics. One possible way is writing out spill-related metrics to output.json when running benches like tpch with debug option. Another idea is to generate some spill files for microbenchmark testing only spill writing - reading process.

🤔 Yes now the spilled_bytes is not implemented correctly if compression happens, it needs a follow-up fix. Adding more micro-benchmarks sounds great.

Jun 12 '25 02:06 2010YOUY01

Need for a Custom Batch Writer?

1. `concat_batches` before writing?

I tried a quick local test where, instead of writing one batch at a time using the current IPCStreamWriter, I concatenated multiple batches using concat_batches before writing them. In my local environment, this didn't make a noticeable difference in compression ratio. Maybe that's because the compression happens at the buffer level for each column (i.e., values of the same column are grouped together), or perhaps because each record batch already consists of 8192 rows and the compression window size overlaps with that. Still, even if concatenating batches introduces some memory copy overhead, it might still impact I/O bandwidth or reduce the number of system calls, so I think it's worth investigating further.

2. Comet's implementation

I looked into why Comet introduced a custom batch writer and reviewed the related PR. The main reasons their implementation improved performance were:

(a) Their previous approach duplicated the schema for each batch, which the new implementation avoided. (b) They didn’t use FlatBuffer encoding, so there was no alignment or metadata overhead.

In our case, though, since IPCStreamWriter already writes the schema only once when the writer is created, we probably won’t see the same benefits from (a).

I haven’t had a chance to look closely at the Vortex side yet. If I come across any interesting experimental results or ideas worth sharing, I’ll follow up later.

Jun 24 '25 12:06 ding-young

Investigate performance tradeoff in compressing spill files

Is your feature request related to a problem or challenge?

Background

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Describe the solution you'd like

Need for a Custom Batch Writer?

1. concat_batches before writing?

2. Comet's implementation

1. `concat_batches` before writing?