AMGX icon indicating copy to clipboard operation
AMGX copied to clipboard

AmgX not determinstic on large matrices even with determinism_flag=1

Open joconnor22 opened this issue 5 years ago • 9 comments

I have a large matrix (approximately 8 million rows) for which AmgX doesn't produce deterministic results. Running the exact same programme twice with the exact same input will produce slight differences in the solution vector. These differences are on the order of 1e-14. Nevertheless, with determinism_flag=1 the solution should be entirely deterministic and give exactly the same results right? So far I've only done detailed testing on a single GPU, but some other tests I did suggested this was also occurring for multi-GPU runs too.

It looks like the non-determinism occurs when constructing the multi-grid hierarchy as when I print the grid stats I can see slight differences between the number of rows and non-zero entries in some of the grid levels between runs.

I'm happy to provide a minimum working example and the matrix I'm using if you want me to. Just let me know.

joconnor22 avatar Nov 01 '20 16:11 joconnor22

Hi @joconnor22 ,

Can you share the config you are using? Providing the matrix would help identifying the issue too.

Thanks!

marsaev avatar Nov 03 '20 15:11 marsaev

Hi @marsaev, thanks for your reply.

The solver config I'm using is:

{
    "config_version": 2,
    "verbosity_level": 0,
    "determinism_flag": 1,
    "communicator": "MPI",
    "solver": {
        "solver": "GMRES",
        "print_solve_stats": 1,
        "obtain_timings": 0,
        "monitor_residual": 1,
        "convergence": "RELATIVE_INI_CORE",
        "tolerance": 1e-12,
        "max_iters": 100,
        "preconditioner": {
            "solver": "AMG",
            "algorithm": "CLASSICAL",
            "print_grid_stats": 1,
            "cycle": "V",
            "selector": "PMIS",
            "interpolator": "D2",
            "smoother": "BLOCK_JACOBI",
            "coarse_solver": "DENSE_LU_SOLVER",
            "dense_lu_num_rows": 4,
            "presweeps": 2,
            "postsweeps": 2,
            "max_iters": 1
        }
    }
}

What's the best way to provide the matrix? I can only reproduce this problem on large matrices (e.g. 8 million rows) so even after compression it's still too big to upload here.

joconnor22 avatar Nov 06 '20 10:11 joconnor22

Just to add some additional insight here. The determinism_flag applies to aggregators and matrix coloring (so specific to the aggregation based AMG rather than classical). Still, I think we should be observing run to run reproducibility.

Am I correct in understanding that you observe run to run variability in the level structure of the preconditioner? I definitely would not expect this to happen so possibly a bug somewhere.

For interest did you try the branch v2.1.x? We are trialing a significant number of optimisations and fixes in this development branch.

mattmartineau avatar Feb 01 '21 11:02 mattmartineau

Hi, yes with print_grid_stats=1 I get a slightly different output between runs (e.g. some levels have different numbers of rows/entries between runs). Actually, it was more like there were two/three possible scenarios and the output of each run would always be one of those two/three outcomes. This only seemed to occur when the matrix was relatively large (towards a million rows or so). For smaller matrices it seemed to always be reproducible between runs.

It's been a while now since I looked at this so I can't remember exactly but I'm pretty sure I did test the v2.1 branch as well.

joconnor22 avatar Feb 03 '21 10:02 joconnor22

@joconnor22 , i'm looking into the system you uploaded right now - will try it with dev branch, will let you know if it reproduces

marsaev avatar Feb 03 '21 10:02 marsaev

Great, thanks.

joconnor22 avatar Feb 03 '21 10:02 joconnor22

@joconnor22 i found reason for non-determinism for your case - ordering of atomics for a large number of hits to same memory address and floating addition affected weights for classical selector. I will submit deterministic version soon.

marsaev avatar Feb 13 '21 22:02 marsaev

OK great, thanks a lot for taking the time to look at it!

joconnor22 avatar Feb 15 '21 09:02 joconnor22

Tracking internally: AMGX-45

marsaev avatar Apr 07 '21 18:04 marsaev