spark issues

Utility function to get a setup & cleanup function for mapping each partition

17

Often when mapping some RDD, you want to do a bit of setup before processing each partition, followed by cleanup at the end of the partition; this adds utility functions...

squito

Shuffle consolidation

20

In Spark, it is common practice (and usually preferred) to launch a large number of small tasks, which unfortunately can create an even larger number of very small shuffle files...

jason-dai

IndexedRDD for accelerated joins

8

In many applications (especially graph computation and machine learning) we are iteratively joining model parameters (vertices) with data (edges). In these cases it can be beneficial to pre-organize the records...

jegonzal

Regularly poll executors to track their utilization

8

This isn't ready to be merged yet (needs tests & docs), but wanted to get some feedback. The point of this PR is to provide a really high-level metric on...

squito

Clean up DAGScheduler datastructures after job completes

9

I added some fixes to reduce heap/prevent memory leak we found after running a load test (thousands of jobs). There may be more leaks, but this fixes a big one.

jhartlaub

should call reviveOffer in fine-grained mesos mode after tasks successfully finished

3

[Spark-872]I think we should call reviveOffer in statusUpdate function to request resource, In scheduler.statusUpdate function, it calls reviveOffer only for TASK_LOST and TASK_FAILED, so it need deal with TASK_FINISHED scenario,...

xiajunluan

tweaks to the SparkLR example

5

I've been using `spark.examples.SparkLR` for performance testing. Just passing along some enhancements to the class.

wannabeast

Added stageId <--> jobId mapping in DAGScheduler

33

I'm not completely certain that this is ready to be merged yet, but I think it is ready to gather some comments. Outside of DAGScheduler.scala, the changes are almost entirely...

markhamstra

Compression in reduce side combine

3

One of the scalability problem we saw is when processing huge data, we need to have large number of reduce splits, which makes the memory overhead of shuffle writers becomes...

lyogavin

For SPARK-527, Support spark-shell when running on YARN

13

In current YARN mode approaching, the application is run in the Application Master as a user program thus the whole spark context is on remote. This approaching won't support application...

colorant

spark
spark copied to clipboard

Metadata

Utility function to get a setup & cleanup function for mapping each partition

Shuffle consolidation

IndexedRDD for accelerated joins

Regularly poll executors to track their utilization

Clean up DAGScheduler datastructures after job completes

should call reviveOffer in fine-grained mesos mode after tasks successfully finished

tweaks to the SparkLR example

Added stageId <--> jobId mapping in DAGScheduler

Compression in reduce side combine

For SPARK-527, Support spark-shell when running on YARN

← Metadata

Owner

Metadata

spark spark copied to clipboard

Metadata

← Metadata

Owner

Metadata

spark
spark copied to clipboard