RemoteShuffleService icon indicating copy to clipboard operation
RemoteShuffleService copied to clipboard

which branch should be used for building jar and image for remote shuffle service for K8 environment?

Open roligupt opened this issue 3 years ago • 6 comments

I see there are 2 branches K8 and rss-k8, which branch should be used for building jar and image for remote shuffle service for K8 environment?

roligupt avatar Feb 18 '22 03:02 roligupt

I have a fork, and make Remote Shuffle Sevice work on k8s. Also removed dependence on ZooKeeper. The fork is here: https://github.com/datapunchorg/RemoteShuffleService/tree/k8s-spark-3.1

hiboyang avatar Feb 18 '22 05:02 hiboyang

I have a fork, and make Remote Shuffle Sevice work on k8s. Also removed dependence on ZooKeeper. The fork is here: https://github.com/datapunchorg/RemoteShuffleService/tree/k8s-spark-3.1

Thanks for your quick response! I will try it out.

roligupt avatar Feb 18 '22 06:02 roligupt

I have a fork, and make Remote Shuffle Sevice work on k8s. Also removed dependence on ZooKeeper. The fork is here: https://github.com/datapunchorg/RemoteShuffleService/tree/k8s-spark-3.1

@hiboyang one quick question about spark with client jar - I wanted to build my own spark image with the jar. Although I am not building the spark distribution from scratch but using the spark bin (spark-3.1.1-bin-hadoop3.2.tgz) that is provided on apache spark download side. how do go about building the client jar to include in spark image?

roligupt avatar Feb 18 '22 06:02 roligupt

You need to put the Remote Shuffle Service client jar file inside jars folder in Spark image.

You could download Remote Shuffle Service client jar file from Maven:

    <dependency>
        <groupId>org.datapunch</groupId>
        <artifactId>remote-shuffle-service-client-spark31</artifactId>
        <version>0.0.12</version>
    </dependency>

If you download that Spark bin (spark-3.1.1-bin-hadoop3.2.tgz), you could unzip it, add Remote Shuffle Service client jar file to jars folder, then run command like following to build your image:

./dev/make-distribution.sh --name spark-with-remote-shuffle-service-client --pip --tgz -Phive -Phive-thriftserver -Pkubernetes -Phadoop-3.2 -Phadoop-cloud

Please note if you use the remote-shuffle-service-client-spark31 jar file here, you need to use Remote Shuffle Service server from this branch as well: https://github.com/datapunchorg/RemoteShuffleService/tree/k8s-spark-3.1

hiboyang avatar Feb 19 '22 17:02 hiboyang

You need to put the Remote Shuffle Service client jar file inside jars folder in Spark image.

You could download Remote Shuffle Service client jar file from Maven:

    <dependency>
        <groupId>org.datapunch</groupId>
        <artifactId>remote-shuffle-service-client-spark31</artifactId>
        <version>0.0.12</version>
    </dependency>

If you download that Spark bin (spark-3.1.1-bin-hadoop3.2.tgz), you could unzip it, add Remote Shuffle Service client jar file to jars folder, then run command like following to build your image:

./dev/make-distribution.sh --name spark-with-remote-shuffle-service-client --pip --tgz -Phive -Phive-thriftserver -Pkubernetes -Phadoop-3.2 -Phadoop-cloud

Please note if you use the remote-shuffle-service-client-spark31 jar file here, you need to use Remote Shuffle Service server from this branch as well: https://github.com/datapunchorg/RemoteShuffleService/tree/k8s-spark-3.1

@hiboyang I understand everything except that spark-3.1.1-bin-hadoop3.2.tgz is already a distribution package which comes with the jar files. and as far as i understand "./dev/make-distribution.sh " creates the distribution package. If I already have the spark binaries in spark-3.1.1-bin-hadoop3.2.tgz I dont need to run "./dev/make-distribution.sh" I can simply Copy the jar and build the image.

roligupt avatar Feb 24 '22 01:02 roligupt

Yes, it should work as well for "simply Copy the jar and build the image".

hiboyang avatar Feb 24 '22 02:02 hiboyang