SparkInternals icon indicating copy to clipboard operation
SparkInternals copied to clipboard

Why the definition of dependencies is different from RDD paper?

Open endersuu opened this issue 3 years ago • 0 comments

From the paper Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

  • narrow dependencies, where each partition of the parent RDD is used by at most one partition of the child RDD
  • wide dependencies, where multiple child partitions may depend on it

However, The definition of dependencies from the chapter JobLogicalPlan is different :

  • NarrowDependency, Each partition of the child RDD fully depends on a small number of partitions of its parent RDD. Fully depends (i.e., FullDependency) means that a child partition depends the entire parent partition.

  • ShuffleDependency, Multiple child partitions partially depends on a parent partition. Partially depends (i.e., PartialDependency) means that each child partition depends a part of the parent partition.

This makes me really confused. Are ShuffleDependency and wide dependency the same thing?

endersuu avatar Oct 09 '22 03:10 endersuu