Performance improvements and bugfix to Tarjan's algorithm
This is the same as this PR in the old lightgraphs repository.
From substantial benchmarking, this is a significant speedup in most cases and an asymptotic improvement from quadratic to linear for star graphs, with a performance regression of a few percent only in the limit of extremely sparse graphs. Furthermore, it opens up the possibility for future speedups in other functions provided by this package, since it also computes data that can be directly used in those other functions.
This PR does not change the API for the function itself, but it does add a couple of performance and type hint functions that other library types that inherit from AbstractGraphs can optionally overload to improve/tune performance, for example if they store adjacency lists in a linked list and so have different performance characteristics than array representations.
Codecov Report
Merging #32 (b1018bf) into master (11f54ad) will increase coverage by
2.07%. The diff coverage is88.33%.
:exclamation: Current head b1018bf differs from pull request most recent head 83cda6f. Consider uploading reports for the commit 83cda6f to get more accurate results
@@ Coverage Diff @@
## master #32 +/- ##
==========================================
+ Coverage 97.26% 99.33% +2.07%
==========================================
Files 114 106 -8
Lines 6579 5571 -1008
==========================================
- Hits 6399 5534 -865
+ Misses 180 37 -143
@saolof While I can't review this in great detail, I am wondering what the level of testing we have is. If it is reasonable, then passing tests would make it easier to quickly merge.
@saolof While I can't review this in great detail, I am wondering what the level of testing we have is. If it is reasonable, then passing tests would make it easier to quickly merge.
Tests are reasonably all-encompassing. One gotcha is that there are a few probabilistic tests in the codebase that will fail in ~1% of test runs, and can cause tests to fail for unrelated code changes. So in those cases it is worth checking which test failed.
For this particular PR, it passes the CI tests, and in addition to that the original PR thread also has a ton of benchmarks which act as additional tests on a wide range of different random graphs to check that the output is the same as on stable.
@jpfairbanks Thoughts?
I also haven't reviewed in detail, but I don't see anything crazy. @simonschoelly, you were reviewing this on the old repo. Can you take a look before we merge?
Bump. Is this good to merge? Do we have enough tests to give us confidence in merging?
@saolof can you add a few more tests to this so the new codepaths are fully tested? It looks like this PR currently decreases test coverage. Other than that, this is looks like a really nice improvement.
Sure. I'll add tests for infer_nb_iterstate_type by creating a mock graph type for it to infer the node iterator type of.
Bump, has anyone been able to review this in detail?
not yet.
Presumably fails because of a bug in Aqua: https://github.com/JuliaTesting/Aqua.jl/issues/139
I just fixed a similar quadratic hit in https://github.com/JuliaGraphs/Graphs.jl/pull/266. Rather than pushing neighbors on a separate stack (here, depending on number of neighbors), I did it by always pushing the neighbors directly on the dfs stack, and using a more refined coloring scheme than just visited / unvisited. Maybe it is worth doing the same here ?
I just fixed a similar quadratic hit in #266. Rather than pushing neighbors on a separate stack (here, depending on number of neighbors), I did it by always pushing the neighbors directly on the dfs stack, and using a more refined coloring scheme than just visited / unvisited. Maybe it is worth doing the same here ?
I don't push neighbours onto a stack at all. That would be quadratic memory. The algorithm only has a dfs stack for visited stack, and the iteration index for neighbours when iterating over nodes that have a lot of neighbours.
For me, this is pushing on a stack ?
if is_large_vertex(g, s)
push!(largev_iterstate_stack, iterate(outneighbors(g, s)))
end
This is not quadratic memory, this is linear in the number of edges. Edit: Oh, you just push the iteration state
For me, this is pushing on a stack ?
if is_large_vertex(g, s) push!(largev_iterstate_stack, iterate(outneighbors(g, s))) endThis is not quadratic memory, this is linear in the number of edges.
I am pushing the current node on the stack, not the neighbours. In the Julia iteration protocol, iterate returns a tuple of the current iterated object and the iteration index.
In a worst case memory scenario like a tournament graph where all nodes have many neighbours (most of which are already visited) that pushes all nodes but not all edges, and in that case this PR gives a 4x improvement in runtime compared to the previous implementation.
Tarjan's algorithm is implicitly worst case linear memory in the number of nodes, and among other things this PR improves the constant factor compared to the previous implementation.
Ok, I think I have the solution: We need to use Stateful iterators. This boils down to the same, but at least, this is part of the julia API, and it provides a cleaner implementation.