Lazy evaluation of join and map operations
The VertexSetRDD[VD] stores the vertex attributes as an IndexedSeq[VD]. When a VertexSetRDD is first constructed from an RDD[(Vid,VD)] the attributes are stored in an Array[VD]. When mapValues is in invoked on a VertexSetRDD[VD] a new array is created and populated with the result of the map operation.
https://github.com/amplab/graphx/blob/master/graph/src/main/scala/org/apache/spark/graph/VertexSetRDD.scala#L129
However when leftJoin is invoked an IndexedSeqView is created:
https://github.com/amplab/graphx/blob/master/graph/src/main/scala/org/apache/spark/graph/VertexSetRDD.scala#L192
Should both be implemented using views or should both be implemented using actual storage. The tradeoffs are the following:
- Using views means that long chains of computation might be invoked repeatedly.
- Using Arrays could lead to many long-lived allocations.
I suspect all the operations should be implemented using the view but I am not sure what the implications are for caching.
The current justification for two separate strategies is that the join operation is "light weight" and so recomputing it would not be costly. Alternatively, the mapValues operation could be arbitrarily expensive.