SparkInternals
SparkInternals copied to clipboard
Notes talking about the design and implementation of Apache Spark
Fixed a typo in 4-shuffleDetails.md.
https://www.gitbook.com/download/epub/book/yourtion/sparkinternals 这个链接点进去一直重定向到注册界面,登录了也无法下载。可以直接上传一下吗 
From the paper [Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf) - narrow dependencies, where each partition of the parent RDD is used by at most one partition...
This would be a great addition to the tahi books list at [Free-Programming-Books](https://github.com/EbookFoundation/free-programming-books/pull/6434). But the proposed link there makes it difficult to find the thai resource, and a link directly...
Removed a redundant "blog" word.
RT
我看CogroupRDD的实现,没看懂narrowdependency或shuffledependency对cogrouprdd中partition的影响... 不知道如果a.cogroup(b) , a分别是rangepartitioner和hashpartitioner的话,中间生成的cogrouprdd的分区数莫非和rdd a的一样多?因为cogroup这个算子不能指定numPartitons呀 我看您在JobLogicalPlan章节中对dependency分了4类(或者说两打类), 而且看cogroupRDD的对于依赖的处理,似乎并没有这么复杂,完全无视了所谓的N:1 NarrowDependency。 > override def compute(s: Partition, context: TaskContext): Iterator[(K, Array[Iterable[_]])] = { > val sparkConf = SparkEnv.get.conf > val externalSorting = sparkConf.getBoolean("spark.shuffle.spill", true)...
hadoop用distributedcache发布数据应该是共享的。 每次task开始前会初始化,这里面会包括distributedcache下载数据。 这个操作在每个node上只能顺序进行。多个task的话只能一个一个等。 如果已经下载过了就不会重复下载。
Thanks so much for your great effort, highly appreciated ! I guess the latest content is on Spark 1.0, any plans for Spark 2.X ?