Graphs.jl Performance improvements and bugfix to Tarjan's algorithm

This is the same as this PR in the old lightgraphs repository.

From substantial benchmarking, this is a significant speedup in most cases and an asymptotic improvement from quadratic to linear for star graphs, with a performance regression of a few percent only in the limit of extremely sparse graphs. Furthermore, it opens up the possibility for future speedups in other functions provided by this package, since it also computes data that can be directly used in those other functions.

This PR does not change the API for the function itself, but it does add a couple of performance and type hint functions that other library types that inherit from AbstractGraphs can optionally overload to improve/tune performance, for example if they store adjacency lists in a linked list and so have different performance characteristics than array representations.

Oct 31 '21 10:10 saolof

Codecov Report

Merging #32 (b1018bf) into master (11f54ad) will increase coverage by 2.07%. The diff coverage is 88.33%.

:exclamation: Current head b1018bf differs from pull request most recent head 83cda6f. Consider uploading reports for the commit 83cda6f to get more accurate results

@@            Coverage Diff             @@
##           master      #32      +/-   ##
==========================================
+ Coverage   97.26%   99.33%   +2.07%     
==========================================
  Files         114      106       -8     
  Lines        6579     5571    -1008     
==========================================
- Hits         6399     5534     -865     
+ Misses        180       37     -143

Nov 07 '21 22:11 codecov[bot]

@saolof While I can't review this in great detail, I am wondering what the level of testing we have is. If it is reasonable, then passing tests would make it easier to quickly merge.

Nov 08 '21 04:11 ViralBShah

@saolof While I can't review this in great detail, I am wondering what the level of testing we have is. If it is reasonable, then passing tests would make it easier to quickly merge.

Tests are reasonably all-encompassing. One gotcha is that there are a few probabilistic tests in the codebase that will fail in ~1% of test runs, and can cause tests to fail for unrelated code changes. So in those cases it is worth checking which test failed.

For this particular PR, it passes the CI tests, and in addition to that the original PR thread also has a ton of benchmarks which act as additional tests on a wide range of different random graphs to check that the output is the same as on stable.

Nov 11 '21 22:11 saolof

@jpfairbanks Thoughts?

Nov 11 '21 23:11 ViralBShah

I also haven't reviewed in detail, but I don't see anything crazy. @simonschoelly, you were reviewing this on the old repo. Can you take a look before we merge?

Nov 12 '21 00:11 jpfairbanks

Bump. Is this good to merge? Do we have enough tests to give us confidence in merging?

Jan 07 '22 05:01 ViralBShah

@saolof can you add a few more tests to this so the new codepaths are fully tested? It looks like this PR currently decreases test coverage. Other than that, this is looks like a really nice improvement.

Feb 01 '22 01:02 oscardssmith

Sure. I'll add tests for infer_nb_iterstate_type by creating a mock graph type for it to infer the node iterator type of.

Feb 02 '22 14:02 saolof

Bump, has anyone been able to review this in detail?

Mar 10 '22 08:03 gdalle

not yet.

Mar 10 '22 13:03 oscardssmith

Presumably fails because of a bug in Aqua: https://github.com/JuliaTesting/Aqua.jl/issues/139

Jun 16 '23 14:06 gdalle

I just fixed a similar quadratic hit in https://github.com/JuliaGraphs/Graphs.jl/pull/266. Rather than pushing neighbors on a separate stack (here, depending on number of neighbors), I did it by always pushing the neighbors directly on the dfs stack, and using a more refined coloring scheme than just visited / unvisited. Maybe it is worth doing the same here ?

Jul 05 '23 08:07 etiennedeg

I just fixed a similar quadratic hit in #266. Rather than pushing neighbors on a separate stack (here, depending on number of neighbors), I did it by always pushing the neighbors directly on the dfs stack, and using a more refined coloring scheme than just visited / unvisited. Maybe it is worth doing the same here ?

I don't push neighbours onto a stack at all. That would be quadratic memory. The algorithm only has a dfs stack for visited stack, and the iteration index for neighbours when iterating over nodes that have a lot of neighbours.

Jul 05 '23 09:07 saolof

For me, this is pushing on a stack ?

    if is_large_vertex(g, s)
        push!(largev_iterstate_stack, iterate(outneighbors(g, s)))
    end

This is not quadratic memory, this is linear in the number of edges. Edit: Oh, you just push the iteration state

Jul 05 '23 09:07 etiennedeg

For me, this is pushing on a stack ?
    if is_large_vertex(g, s)
        push!(largev_iterstate_stack, iterate(outneighbors(g, s)))
    end
This is not quadratic memory, this is linear in the number of edges.

I am pushing the current node on the stack, not the neighbours. In the Julia iteration protocol, iterate returns a tuple of the current iterated object and the iteration index.

In a worst case memory scenario like a tournament graph where all nodes have many neighbours (most of which are already visited) that pushes all nodes but not all edges, and in that case this PR gives a 4x improvement in runtime compared to the previous implementation.

Tarjan's algorithm is implicitly worst case linear memory in the number of nodes, and among other things this PR improves the constant factor compared to the previous implementation.

Jul 05 '23 09:07 saolof

Ok, I think I have the solution: We need to use Stateful iterators. This boils down to the same, but at least, this is part of the julia API, and it provides a cleaner implementation.

Apr 29 '24 13:04 etiennedeg