RemoteShuffleService write amplification

i'm noticing running some spark apps that produce 11TB of shuffle data on external shuffle service, that they produce closer to 18TB of shuffle data on remote shuffle service. is some write amplification expected?

May 23 '22 21:05 cpd85

It may depend on how these metrics are calculated. Remote shuffle service does write some extra data for each shuffle record like task attempt id and partition id to track the record. But sometime, the metics may be also off a little bit due to serialization/compressing.

May 26 '22 04:05 hiboyang

got it. looks like compression isn't supported at the moment on server side? my workloads tend to stress out the SSD and not use computation so I think they could benefit from compression. I see this class https://github.com/uber/RemoteShuffleService/blob/7220c23694e0175e01719621707680a2718173cf/src/main/java/com/uber/rss/common/Compression.java but as far as I can tell it it isn't actually used or configurable

May 31 '22 15:05 cpd85