something wrong with random walk in node2vec_spark?
val edge2attr = graph.triplets.map { edgeTriplet => (s"${edgeTriplet.srcId}${edgeTriplet.dstId}", edgeTriplet.attr) }.repartition(200).cache
(s"${prevNodeId}${currentNodeId}", (srcNodeId, pathBuffer)) }.join(edge2attr).map { case (edge, ((srcNodeId, pathBuffer), attr)) =>
in the code, join key is generated by s"${edgeTriplet.srcId}${edgeTriplet.dstId}", do we need a separator between the two elements?
Actually Yes. You should use s"${edgeTriplet.srcId}\t${edgeTriplet.dstId}" instead!
yes. if you dont add a separator, edge between node #1 and node #1111 will be same with edge between node #11 and node #111 which is '11111'. When using sepaeator,like \t,there will be 1\t1111 vs 11\t111.
And ,i think, thats why u got bad results when your data is very big. Becase the bigger data you use,the more chance you get Wrong edges