Use of DefaultCodec for HDFS DataGenerator is really bad
When DefaultCodec = ZLib, we end up serializing all data generation, because the native implementation of ZLib actually locks on a static class object during deserialization.
Changing to SnappyCodec enables actual parallelism.
@JoshRosen @aarondav Do you guys have a rough idea of how big of a deal it would be to fix this in terms of test run time?
Not certain, but if at the time I cared enough to file this issue, we probably weren't even nearly saturating any of our disk/memory/CPU, so I would expect at least 2x improvement in data generation time (more or less depending on nodes used). Was probably using the r3.8xlarges with 32 cores, so 1/32 parallelism does seem bad.