RemoteShuffleService
RemoteShuffleService copied to clipboard
Remote shuffle service for Apache Spark to store shuffle data on remote servers.
Hi, we are seeing some zstd corruption error during shuffle read recently. ``` org.apache.spark.SparkException: Job aborted due to stage failure: Task 300 in stage 7.0 failed 4 times, most recent...
Current support of RSS with AQE framework does not provide performant APIs for AQE skew join optimization. To explain more, when AQE detects a skew partition it tries to divide...
Pass the startmapIndex and endMapIdx to getReaderForRange. And with this change, the AQE realted issues are fixed https://github.com/uber/RemoteShuffleService/issues/99
Hi, I ran the [SparkSqlOptimizeSkewedJoinTest](https://github.com/uber/RemoteShuffleService/blob/spark30/src/test/scala/org/apache/spark/shuffle/SparkSqlOptimzeSkewedJoinTest.scala#L79) and [SparkSqlOptimizeLocalShuffleReaderTest](https://github.com/uber/RemoteShuffleService/blob/spark30/src/test/scala/org/apache/spark/shuffle/SparkSqlOptimizeLocalShuffleReaderTest.scala#L69) using spark3.1 and spark3.2, and both Rss test failed with assertion error with duplicate output rows. For example, the expected output of SparkSqlOptimizeLocalShuffleReaderTest...
Hi, I am wondering: Q1. if `RssInvalidServerVersionException` will occur when RSS-i is restarted by a shell script as soon as it crashes due to some reasons meanwhile some applications are...
when I use a spark image built by myself with jdk8, I met following error:  I found a reasonable answer In StackOverflow  So if there's a way to...
### What changes were proposed in this pull request? 1. Add curly braces to the if statement to pass checkstyle, although it's not necessary. 2. Add space after token if....
**Key traits** - Stores the map output data in serialized form - Buffers the data in memory as much as possible. Chunk the data before sending it to RSS servers....
Adds fault tolerance in RSS servers for one or more server going away. This is how the functionality works - Node/server goes away - Task reading/writing data from that server...
Hi, I just found out my spark job got killed with this error: ``` Caused by: com.uber.rss.exceptions.RssException: Failed to get node data for zookeeper node: /spark_rss/{cluster}/default/nodes/{server_host_name} at com.uber.rss.metadata.ZooKeeperServiceRegistry.getServerInfo(ZooKeeperServiceRegistry.java:231) ... at...