YutingWang98 comments

Results 9 comments of


                                            YutingWang98

fault tolerance of restarting server

@mayurdb Hi mayurdb! We also have this server down/restart issue quite frequently. Do you mind sharing your progress on the stage retry and new server list picking, or how you...

Spark 3.1/3.2 failed sql skew and local reader tests

@hiboyang Hi, I fould the bug and fixed it in a pull request

Can Rss have stage retry when one server is down?

Thank you for the suggestions @hiboyang ! Does this mean the shuffle data written to the server will be doubled if I set 'spark.shuffle.rss.replicas' to 2? If so, this will...

Can Rss have stage retry when one server is down?

Hi, @hiboyang. If the 'spark.shuffle.rss.replicas' does write double size of data to server, we won't be able to use this to large jobs with 400+ TB shuffle data unfortunatly. So...

Can Rss have stage retry when one server is down?

Thanks for the replay! Will see what I can do to improve this.

Can Rss have stage retry when one server is down?

@hiboyang Hi! I attempted to contribute to adding stage retry, but there seems to be a difficulty due to the implementation of Rss. Wondering if I can have some insights...

Can Rss have stage retry when one server is down?

Hi @mayurdb, thank you for the reply, and sharing your implementation! I have a question here: If the spark stages are cascading, then one stage may depend on the previous...

Can Rss have stage retry when one server is down?

> @mayurdb Thank you for sharing it, will take a look!

[WIP] Unsafe shuffle writer support in RSS

Hi @mayurdb, we have also been experiencing memory and map stage latency issues using Rss. We plan to test and work on this implementation as well. Wondering if you have...