networkit icon indicating copy to clipboard operation
networkit copied to clipboard

Profiling displays wrong results with deleted nodes

Open michitux opened this issue 8 years ago • 8 comments

The profiling module considers deleted nodes as nodes with centrality 0 in many cases which leads to wrong plots. The notebook at http://nbviewer.jupyter.org/gist/michitux/c70f61c350b7de81a7e0e937ea31cb96 shows a very simple example. The graphs G and H are identical apart from 1000 deleted nodes (added and removed again). However, their profiles are very different. While for these graphs the problem is quite obvious, for graphs with skewed degree distributions the problem might go unnoticed.

michitux avatar Jul 24 '17 09:07 michitux

I ran into the same problem. I would find it helpful if this behavior was either documented or changed.

The problem results from having sparse indices. I think it could even be fixed easily, by changing or having another version of parallelForNodes. This function could hold a second counter only for valid nodes. So the handle function would get two indices: One to address the node it is meant for (the current v), and one as a dense counter for that node, which could be used to store it correctly in the centrality measures. E.g. DegreeCentrality could then just use a vector of the actual size of nodes and store the degree at the correct position with the new counter. (I can prepare a PR if wished, but not in the next 3 weeks).

@michitux For my case I could fix it in Python by simply reindexing all nodes like this:

from networkit import *
# g is some graph, possibly with deleted nodes
g = graph.GraphTools.getCompactedGraph(g, graph.GraphTools.getContinuousNodeIds(g))

This circumvents the whole issue and is a fine quickfix imo, but it would be nice to have it in the documentation.

jstriebel avatar Aug 29 '17 18:08 jstriebel

This is a severe problem we should do something about, preferably before the next bugfix release.

hmeyerhenke avatar Jan 26 '18 07:01 hmeyerhenke

Instead of changing parallelForNodes(), I would prefer to update the centrality computations to properly take deleted nodes into account.

Letting everything that uses parallelForNodes() deal with node indices that can change across invocations seems to be more complex than just handling deleted nodes correctly.

I can prepare patches to the metrics that are part of the profile.

avdgrinten avatar Jan 26 '18 10:01 avdgrinten

The centrality computations are correct afaik. The problem are the plotting libraries which are external. They just get the array that is indexed by node id and contains 0 for missing nodes (but existing nodes might get 0, too). Using NaN for missing nodes in centrality results might be worth a try but I'm not sure if it actually works.

michitux avatar Jan 26 '18 11:01 michitux

I also found the centrality computations to be correct when assuming that the node indices are dense. I don't see a way to fix it for sparse indices without the change I proposed.

This affects not only plotting, but also statistics accessible via getStat.

A remark in the documentation and maybe a runtime warning would be very helpful as a first step. This behavior could simply lead to wrong insights, which is indeed a severe problem, as @hmeyerhenke already noted.

/cc @maxkatzmann @thobl

jstriebel avatar Mar 01 '18 11:03 jstriebel

I plan to prepare a fix for this issue for the next NetworKit release. As discussed above I do not plan to change parallelForNodes() but integrate the fix into the metric computations individually.

avdgrinten avatar Mar 01 '18 11:03 avdgrinten

@avdgrinten I checked this long-open issue and it seems, that with the current release (tested on binder-hub) the problems still remain.

fabratu avatar Oct 25 '21 07:10 fabratu

This should probably be re-evaluted and moved to a new issue ("Deleted nodes are not handled correctly in all algorithms" or so).

avdgrinten avatar Oct 25 '21 09:10 avdgrinten