AMGX AmgX not determinstic on large matrices even with determinism

I have a large matrix (approximately 8 million rows) for which AmgX doesn't produce deterministic results. Running the exact same programme twice with the exact same input will produce slight differences in the solution vector. These differences are on the order of 1e-14. Nevertheless, with determinism_flag=1 the solution should be entirely deterministic and give exactly the same results right? So far I've only done detailed testing on a single GPU, but some other tests I did suggested this was also occurring for multi-GPU runs too.

It looks like the non-determinism occurs when constructing the multi-grid hierarchy as when I print the grid stats I can see slight differences between the number of rows and non-zero entries in some of the grid levels between runs.

I'm happy to provide a minimum working example and the matrix I'm using if you want me to. Just let me know.

Nov 01 '20 16:11 joconnor22

Hi @joconnor22 ,

Can you share the config you are using? Providing the matrix would help identifying the issue too.

Thanks!

Nov 03 '20 15:11 marsaev

Hi @marsaev, thanks for your reply.

The solver config I'm using is:

{
    "config_version": 2,
    "verbosity_level": 0,
    "determinism_flag": 1,
    "communicator": "MPI",
    "solver": {
        "solver": "GMRES",
        "print_solve_stats": 1,
        "obtain_timings": 0,
        "monitor_residual": 1,
        "convergence": "RELATIVE_INI_CORE",
        "tolerance": 1e-12,
        "max_iters": 100,
        "preconditioner": {
            "solver": "AMG",
            "algorithm": "CLASSICAL",
            "print_grid_stats": 1,
            "cycle": "V",
            "selector": "PMIS",
            "interpolator": "D2",
            "smoother": "BLOCK_JACOBI",
            "coarse_solver": "DENSE_LU_SOLVER",
            "dense_lu_num_rows": 4,
            "presweeps": 2,
            "postsweeps": 2,
            "max_iters": 1
        }
    }
}

What's the best way to provide the matrix? I can only reproduce this problem on large matrices (e.g. 8 million rows) so even after compression it's still too big to upload here.

Nov 06 '20 10:11 joconnor22

Just to add some additional insight here. The determinism_flag applies to aggregators and matrix coloring (so specific to the aggregation based AMG rather than classical). Still, I think we should be observing run to run reproducibility.

Am I correct in understanding that you observe run to run variability in the level structure of the preconditioner? I definitely would not expect this to happen so possibly a bug somewhere.

For interest did you try the branch v2.1.x? We are trialing a significant number of optimisations and fixes in this development branch.

Feb 01 '21 11:02 mattmartineau

Hi, yes with print_grid_stats=1 I get a slightly different output between runs (e.g. some levels have different numbers of rows/entries between runs). Actually, it was more like there were two/three possible scenarios and the output of each run would always be one of those two/three outcomes. This only seemed to occur when the matrix was relatively large (towards a million rows or so). For smaller matrices it seemed to always be reproducible between runs.

It's been a while now since I looked at this so I can't remember exactly but I'm pretty sure I did test the v2.1 branch as well.

Feb 03 '21 10:02 joconnor22

@joconnor22 , i'm looking into the system you uploaded right now - will try it with dev branch, will let you know if it reproduces

Feb 03 '21 10:02 marsaev

Great, thanks.

Feb 03 '21 10:02 joconnor22

@joconnor22 i found reason for non-determinism for your case - ordering of atomics for a large number of hits to same memory address and floating addition affected weights for classical selector. I will submit deterministic version soon.

Feb 13 '21 22:02 marsaev

OK great, thanks a lot for taking the time to look at it!

Feb 15 '21 09:02 joconnor22

Tracking internally: AMGX-45

Apr 07 '21 18:04 marsaev

AmgX not determinstic on large matrices even with determinism_flag=1