spark-perf Spark 2.0.0 support

I'm working on this and will submit a pull request once done, we face NoSuchMethodError problems once you try to run anything but scheduling-throughput

The fix for that is to modify spark-tests/project/SparkTestsBuild.scala - use 2.0.0-preview for org.apache.spark dependency version and Scala 2.11.8; specifically this resolves

NoSuchMethodError: org/apache/spark/SparkContext.rddToPairRDDFunctions(Lorg/apache/spark/rdd/RDD;Lscala/reflect/ClassTag;Lscala/reflect/ClassTag;Lscala/math/Ordering;)Lorg/apache/spark/rdd/PairRDDFunctions; at spark.perf.AggregateByKey.runTest(KVDataTest.scala:137)

which is triggered by

class AggregateByKey(sc: SparkContext) extends KVDataTest(sc) {
  override def runTest(rdd: RDD[_], reduceTasks: Int) {
    rdd.asInstanceOf[RDD[(String, String)]]
      .map{case (k, v) => (k, v.toInt)}.reduceByKey(_ + _, reduceTasks).count()
  } 
}

Jun 14 '16 11:06 a-roberts

With only the above change we get

16/06/14 12:52:44 INFO ContextCleaner: Cleaned shuffle 9
Exception in thread "main" java.lang.NoSuchMethodError: org/json4s/jackson/JsonMethods$.render$default$2(Lorg/json4s/JsonAST$JValue;)Lorg/json4s/Formats;
        **at spark.perf.TestRunner$.main(TestRunner.scala:47)**
        at spark.perf.TestRunner.main(TestRunner.scala)

By removing the call to render we can now build and run all of SparkPerf with Spark 2.0.0 (there's probably a better fix, I played around with the json4s import versions but without success). The files to change are

modified:   lib/sparkperf/testsuites.py
modified:   mllib-tests/project/MLlibTestsBuild.scala
modified:   spark-tests/project/SparkTestsBuild.scala
modified:   streaming-tests/project/StreamingTestsBuild.scala
modified:   spark-tests/src/main/scala/spark/perf/TestRunner.scala

Pull request to follow

Jun 14 '16 15:06 a-roberts

All modules* built OK, code changes currently at https://github.com/a-roberts/spark-perf/commit/5f090fc2f1c272b839cee8965c77293d018c18d1

I'll sanity check this first by running all of the tests before contributing, noticed a few API changes we need to handle and I've also changed the configuration file to look for $SPARK_HOME instead of /root by default

Still working on MLlib actually, in my commit nothing for this module is built (duration 0s!)

Jun 20 '16 13:06 a-roberts

I've updated my commit using the new APIs available in the latest Spark 2 code, I think we should either create a new branch for 2.0 or simply provide different defaults if we detect the user specifies Spark 2 (e.g. Scala 2.11.8 not Scala 2.10.x). I've verified all ML tests now function as expected

This is currently relying on us having the jars from a recently built Spark 2 in the lib folder for all spark-perf projects - this is because the APIs have changed since the spark-2.0.0-preview artifact which is in Maven central and the requirement will be removed once spark-2.0.0 artifacts are available.

Would appreciate having this reviewed, you can easily view my changes at https://github.com/databricks/spark-perf/compare/master...a-roberts:master

Jul 01 '16 16:07 a-roberts

We've noticed a 30% geomean regression for Spark 2 and this SparkPerf vs Spark 1.5.2 and "normal" SparkPerf i.e. before this changeset, this is running with a low scale factor and the configuration below.

Either my changes are a real disaster or we've noticed a significant performance regression, we can gather a 1.6.2 comparison but would like for my changes for the benchmark itself to be checked so we can rule out problems here.

@pwendell as a top contributor to this project can you or anybody else familiar with the new Spark 2 APIs please review this changeset?

Configuration used where we see the big regression:

spark-perf/config/config.py : SCALE_FACTOR=0.05 No. Of Workers: 1 Executor per Worker : 1 Executor Memory: 18G Driver Memory : 8G Serializer: kryo
$SPARK_HOME/conf/spark-defaults.conf: executor Java Options: -Xdisableexplicitgc -Xcompressedrefs

Main changes I made

Use Scala 2.11.8 and Spark 2.0.0 RC2 on our local filesystem
MLAlgorithmTests use Vectors.fromML
For streaming-tests HdfsRecoveryTest we use wordStream.foreachRDD not wordStream.foreach
KVDataTest uses awaitTerminationOrTimeout in a SparkStreamingContext instead of awaitTermination
Trivial: we use compact not compact.render for outputting json

In Spark 2.0 the top five methods where we spend our time is as follows, the percentage is how much of the overall processing time was spent in this particular method:

AppendOnlyMap.changeValue 44%
SortShuffleWriter.write 19%
SizeTracker.estimateSize 7.5%
SizeEstimator.estimate 5.36%
Range.foreach 3.6%

and in 1.5.2 the top five methods are:

AppendOnlyMap.changeValue 38%
ExternalSorter.insertAll 33%
Range.foreach 4%
SizeEstimator.estimate 2%
SizeEstimator.visitSingleObject 2%

I see the following scores, on the left I have the test name followed by the 1.5.2 time and then the 2.0.0 time sthroughput: 5.2s vs 7.08s agg by key; 0.72s vs 1.01s agg by key int: 0.93s vs 1.19s agg by key naive: 1.88s vs 2.02 sort by key: 0.64s vs 0.8s sort by key int: 0.59s vs 0.64s scala count: 0.09s vs 0.08s scala count w fltr: 0.31s vs 0.47s

This is only running the Spark core tests (scheduling throughput through scala-count-w-filtr, including all inbetween).

Will mention this on the mailing list as part of a general performance regression thread so this particular item remains focused on the Spark 2.0.0 changes i have made for SparkPerf, the goal is to have something stable to compare Spark releases with.

Jul 08 '16 09:07 a-roberts

I'm updating this to work with Spark 2 now that it's available and we don't need to use a snapshot or to build with an included version

Aug 08 '16 09:08 a-roberts

so now we need to clone and build new spark-perf to work with spark 2.0. and which are the modules of Spark-Perf which will work with Spark 2.0.

Aug 26 '16 09:08 somideshmukh

All modules, my PR is at https://github.com/databricks/spark-perf/pull/115

Aug 26 '16 09:08 a-roberts

but ,when I have gone to https://github.com/databricks/spark-perf.git and tried to clone master.I haven't found any commit for 2.0

Aug 26 '16 09:08 somideshmukh

That's because my change is a pull request that hasn't been merged, working on a small issue regarding the Spark version now with the mllib project as I see the travis-cl integration build failed, would be much appreciated if you clone my changes and see if you find any problems

Aug 26 '16 10:08 a-roberts

Hi,I have clone your changes and integrated it with Spark 2.0 and run Spark-Test.I have got proper results with no error.Only change that I need to do was no in config.py file were in place of MLLIB_SPARK_VERSION = 2.0.0 and need to keep MLLIB_SPARK_VERSION = 2.0

Aug 30 '16 07:08 somideshmukh

where i can clone your changes?

Nov 08 '16 06:11 Minkyolyy

Any update on the issues in this project?

Dec 25 '16 20:12 saksgarg

maybe spark-perf 2.0 Just replace some of the packages,and didn't paly the advantage of dataset

Jan 11 '17 02:01 Minkyolyy