spark-perf icon indicating copy to clipboard operation
spark-perf copied to clipboard

Use of DefaultCodec for HDFS DataGenerator is really bad

Open aarondav opened this issue 11 years ago • 2 comments

When DefaultCodec = ZLib, we end up serializing all data generation, because the native implementation of ZLib actually locks on a static class object during deserialization.

Changing to SnappyCodec enables actual parallelism.

aarondav avatar Dec 09 '14 20:12 aarondav

@JoshRosen @aarondav Do you guys have a rough idea of how big of a deal it would be to fix this in terms of test run time?

nchammas avatar Feb 10 '15 04:02 nchammas

Not certain, but if at the time I cared enough to file this issue, we probably weren't even nearly saturating any of our disk/memory/CPU, so I would expect at least 2x improvement in data generation time (more or less depending on nodes used). Was probably using the r3.8xlarges with 32 cores, so 1/32 parallelism does seem bad.

aarondav avatar Feb 10 '15 07:02 aarondav