Spark 2.0.0 support
I'm working on this and will submit a pull request once done, we face NoSuchMethodError problems once you try to run anything but scheduling-throughput
The fix for that is to modify spark-tests/project/SparkTestsBuild.scala - use 2.0.0-preview for org.apache.spark dependency version and Scala 2.11.8; specifically this resolves
NoSuchMethodError: org/apache/spark/SparkContext.rddToPairRDDFunctions(Lorg/apache/spark/rdd/RDD;Lscala/reflect/ClassTag;Lscala/reflect/ClassTag;Lscala/math/Ordering;)Lorg/apache/spark/rdd/PairRDDFunctions; at spark.perf.AggregateByKey.runTest(KVDataTest.scala:137)
which is triggered by
class AggregateByKey(sc: SparkContext) extends KVDataTest(sc) {
override def runTest(rdd: RDD[_], reduceTasks: Int) {
rdd.asInstanceOf[RDD[(String, String)]]
.map{case (k, v) => (k, v.toInt)}.reduceByKey(_ + _, reduceTasks).count()
}
}
With only the above change we get
16/06/14 12:52:44 INFO ContextCleaner: Cleaned shuffle 9
Exception in thread "main" java.lang.NoSuchMethodError: org/json4s/jackson/JsonMethods$.render$default$2(Lorg/json4s/JsonAST$JValue;)Lorg/json4s/Formats;
**at spark.perf.TestRunner$.main(TestRunner.scala:47)**
at spark.perf.TestRunner.main(TestRunner.scala)
By removing the call to render we can now build and run all of SparkPerf with Spark 2.0.0 (there's probably a better fix, I played around with the json4s import versions but without success). The files to change are
modified: lib/sparkperf/testsuites.py
modified: mllib-tests/project/MLlibTestsBuild.scala
modified: spark-tests/project/SparkTestsBuild.scala
modified: streaming-tests/project/StreamingTestsBuild.scala
modified: spark-tests/src/main/scala/spark/perf/TestRunner.scala
Pull request to follow
All modules* built OK, code changes currently at https://github.com/a-roberts/spark-perf/commit/5f090fc2f1c272b839cee8965c77293d018c18d1
I'll sanity check this first by running all of the tests before contributing, noticed a few API changes we need to handle and I've also changed the configuration file to look for $SPARK_HOME instead of /root by default
Still working on MLlib actually, in my commit nothing for this module is built (duration 0s!)
I've updated my commit using the new APIs available in the latest Spark 2 code, I think we should either create a new branch for 2.0 or simply provide different defaults if we detect the user specifies Spark 2 (e.g. Scala 2.11.8 not Scala 2.10.x). I've verified all ML tests now function as expected
This is currently relying on us having the jars from a recently built Spark 2 in the lib folder for all spark-perf projects - this is because the APIs have changed since the spark-2.0.0-preview artifact which is in Maven central and the requirement will be removed once spark-2.0.0 artifacts are available.
Would appreciate having this reviewed, you can easily view my changes at https://github.com/databricks/spark-perf/compare/master...a-roberts:master
We've noticed a 30% geomean regression for Spark 2 and this SparkPerf vs Spark 1.5.2 and "normal" SparkPerf i.e. before this changeset, this is running with a low scale factor and the configuration below.
Either my changes are a real disaster or we've noticed a significant performance regression, we can gather a 1.6.2 comparison but would like for my changes for the benchmark itself to be checked so we can rule out problems here.
@pwendell as a top contributor to this project can you or anybody else familiar with the new Spark 2 APIs please review this changeset?
Configuration used where we see the big regression:
- spark-perf/config/config.py : SCALE_FACTOR=0.05 No. Of Workers: 1 Executor per Worker : 1 Executor Memory: 18G Driver Memory : 8G Serializer: kryo
- $SPARK_HOME/conf/spark-defaults.conf: executor Java Options: -Xdisableexplicitgc -Xcompressedrefs
Main changes I made
- Use Scala 2.11.8 and Spark 2.0.0 RC2 on our local filesystem
- MLAlgorithmTests use Vectors.fromML
- For streaming-tests HdfsRecoveryTest we use wordStream.foreachRDD not wordStream.foreach
- KVDataTest uses awaitTerminationOrTimeout in a SparkStreamingContext instead of awaitTermination
- Trivial: we use compact not compact.render for outputting json
In Spark 2.0 the top five methods where we spend our time is as follows, the percentage is how much of the overall processing time was spent in this particular method:
- AppendOnlyMap.changeValue 44%
- SortShuffleWriter.write 19%
- SizeTracker.estimateSize 7.5%
- SizeEstimator.estimate 5.36%
- Range.foreach 3.6%
and in 1.5.2 the top five methods are:
- AppendOnlyMap.changeValue 38%
- ExternalSorter.insertAll 33%
- Range.foreach 4%
- SizeEstimator.estimate 2%
- SizeEstimator.visitSingleObject 2%
I see the following scores, on the left I have the test name followed by the 1.5.2 time and then the 2.0.0 time sthroughput: 5.2s vs 7.08s agg by key; 0.72s vs 1.01s agg by key int: 0.93s vs 1.19s agg by key naive: 1.88s vs 2.02 sort by key: 0.64s vs 0.8s sort by key int: 0.59s vs 0.64s scala count: 0.09s vs 0.08s scala count w fltr: 0.31s vs 0.47s
This is only running the Spark core tests (scheduling throughput through scala-count-w-filtr, including all inbetween).
Will mention this on the mailing list as part of a general performance regression thread so this particular item remains focused on the Spark 2.0.0 changes i have made for SparkPerf, the goal is to have something stable to compare Spark releases with.
I'm updating this to work with Spark 2 now that it's available and we don't need to use a snapshot or to build with an included version
so now we need to clone and build new spark-perf to work with spark 2.0. and which are the modules of Spark-Perf which will work with Spark 2.0.
All modules, my PR is at https://github.com/databricks/spark-perf/pull/115
but ,when I have gone to https://github.com/databricks/spark-perf.git and tried to clone master.I haven't found any commit for 2.0
That's because my change is a pull request that hasn't been merged, working on a small issue regarding the Spark version now with the mllib project as I see the travis-cl integration build failed, would be much appreciated if you clone my changes and see if you find any problems
Hi,I have clone your changes and integrated it with Spark 2.0 and run Spark-Test.I have got proper results with no error.Only change that I need to do was no in config.py file were in place of MLLIB_SPARK_VERSION = 2.0.0 and need to keep MLLIB_SPARK_VERSION = 2.0
where i can clone your changes?
Any update on the issues in this project?
maybe spark-perf 2.0 Just replace some of the packages,and didn't paly the advantage of dataset