spark
spark copied to clipboard
Lightning-fast cluster computing in Java, Scala and Python.
Often when mapping some RDD, you want to do a bit of setup before processing each partition, followed by cleanup at the end of the partition; this adds utility functions...
In Spark, it is common practice (and usually preferred) to launch a large number of small tasks, which unfortunately can create an even larger number of very small shuffle files...
In many applications (especially graph computation and machine learning) we are iteratively joining model parameters (vertices) with data (edges). In these cases it can be beneficial to pre-organize the records...
This isn't ready to be merged yet (needs tests & docs), but wanted to get some feedback. The point of this PR is to provide a really high-level metric on...
I added some fixes to reduce heap/prevent memory leak we found after running a load test (thousands of jobs). There may be more leaks, but this fixes a big one.
[Spark-872]I think we should call reviveOffer in statusUpdate function to request resource, In scheduler.statusUpdate function, it calls reviveOffer only for TASK_LOST and TASK_FAILED, so it need deal with TASK_FINISHED scenario,...
I've been using `spark.examples.SparkLR` for performance testing. Just passing along some enhancements to the class.
I'm not completely certain that this is ready to be merged yet, but I think it is ready to gather some comments. Outside of DAGScheduler.scala, the changes are almost entirely...
One of the scalability problem we saw is when processing huge data, we need to have large number of reduce splits, which makes the memory overhead of shuffle writers becomes...
In current YARN mode approaching, the application is run in the Application Master as a user program thus the whole spark context is on remote. This approaching won't support application...