Lazy evaluation of join and map operations

Open jegonzal opened this issue 12 years ago • 1 comments

The VertexSetRDD[VD] stores the vertex attributes as an IndexedSeq[VD]. When a VertexSetRDD is first constructed from an RDD[(Vid,VD)] the attributes are stored in an Array[VD]. When mapValues is in invoked on a VertexSetRDD[VD] a new array is created and populated with the result of the map operation.

https://github.com/amplab/graphx/blob/master/graph/src/main/scala/org/apache/spark/graph/VertexSetRDD.scala#L129

However when leftJoin is invoked an IndexedSeqView is created:

https://github.com/amplab/graphx/blob/master/graph/src/main/scala/org/apache/spark/graph/VertexSetRDD.scala#L192

Should both be implemented using views or should both be implemented using actual storage. The tradeoffs are the following:

Using views means that long chains of computation might be invoked repeatedly.
Using Arrays could lead to many long-lived allocations.

I suspect all the operations should be implemented using the view but I am not sure what the implications are for caching.

Oct 19 '13 19:10 jegonzal

The current justification for two separate strategies is that the join operation is "light weight" and so recomputing it would not be costly. Alternatively, the mapValues operation could be arbitrarily expensive.

Oct 19 '13 19:10 jegonzal