abacus-develop icon indicating copy to clipboard operation
abacus-develop copied to clipboard

Bugs: the results of different parallel schemes vary greatly for LCAO calculations

Open WHUweiqingzhou opened this issue 1 year ago • 3 comments

Describe the bug

During the test of issue #4058, I find results of different parallel settings are totally different for same INPUT:

OMP_NUM_THREADS=1 mpirun -np 16 abacus | tee out.log
OMP_NUM_THREADS=2 mpirun -np 16 abacus | tee out.log
OMP_NUM_THREADS=2 mpirun -np 8 abacus | tee out.log
OMP_NUM_THREADS=4 mpirun -np 4 abacus | tee out.log

image

see more in link

Expected behavior

No response

To Reproduce

No response

Environment

No response

Additional Context

No response

Task list for Issue attackers (only for developers)

  • [ ] Verify the issue is not a duplicate.
  • [ ] Describe the bug.
  • [ ] Steps to reproduce.
  • [ ] Expected behavior.
  • [ ] Error message.
  • [ ] Environment details.
  • [ ] Additional context.
  • [ ] Assign a priority level (low, medium, high, urgent).
  • [ ] Assign the issue to a team member.
  • [ ] Label the issue with relevant tags.
  • [ ] Identify possible related issues.
  • [ ] Create a unit test or automated test to reproduce the bug (if applicable).
  • [ ] Fix the bug.
  • [ ] Test the fix.
  • [ ] Update documentation (if necessary).
  • [ ] Close the issue and inform the reporter (if applicable).

WHUweiqingzhou avatar May 08 '24 10:05 WHUweiqingzhou

I also made tests by using GNU image @dyzheng, I find the calculations are also unstable, but better than Intel image: image

But for unconverged INPUT, the calculations are more unstable: image

See more in link.

WHUweiqingzhou avatar May 09 '24 06:05 WHUweiqingzhou

As for different version:

see link.

For v3.3.2, the results of STRU1 and STRU2 are different:

image

For v3.4.0, the results of STRU1 and STRU2 with different MPI are almost same:

image

For v3.5.0, the result of STRU1 and STRU2 with different MPI are different:

image

For v3.6.0, the result is same as v3.5.0

image

It looks like v3.4.0 behaves well, something changed between v3.4.0 and v3.5.0

WHUweiqingzhou avatar May 09 '24 10:05 WHUweiqingzhou

I choose some commit to make tests, see the link.

For 38766b4a, 2023/9/28: image

For 2ffa3d4e, 2023/10/9. It looks like drho changes after this commit: image

For 77f178d0, 2023/10/26: image

For 57c903ae, 2023/11/03: image

For fd76546b, 2023/11/23: image

@Qianruipku, could you have a look?

WHUweiqingzhou avatar May 11 '24 09:05 WHUweiqingzhou

I try the commit a5abaea0, which is just before 2ffa3d4: image

I confirm this change happen at 2ffa3d4, see link.

WHUweiqingzhou avatar May 13 '24 10:05 WHUweiqingzhou

I try mixing_type = pulay and mixing_ndim=21 at a5abaea, and get the result. It looks like old pulay (broyden now) is not stable in this case? image

link

WHUweiqingzhou avatar May 14 '24 07:05 WHUweiqingzhou

@Qianruipku I try different mixing_gg0=0 and scf_thr_type=1 at 2ffa3d4e, and find the result is same as Broyden result of a5abaea0: see the link. For a5abaea0:

START CHARGE      : atomic
 DONE(1.32678    SEC) : INIT SCF
 ITER   TMAG      AMAG      ETOT(eV)       EDIFF(eV)      DRHO       TIME(s)    
 GE1    3.13e+01  3.21e+01  -2.012837e+05  0.000000e+00   4.623e-02  2.619e+01  
 GE2    3.62e+01  3.73e+01  -2.012882e+05  -4.475781e+00  1.512e-02  2.198e+01  
 GE3    3.47e+01  3.63e+01  -2.012885e+05  -3.216326e-01  1.092e-02  2.198e+01  
 GE4    3.46e+01  3.70e+01  -2.012879e+05  5.839711e-01   1.606e-02  2.201e+01  
 GE5    3.44e+01  3.67e+01  -2.012886e+05  -6.875509e-01  2.846e-03  2.197e+01  
 GE6    3.58e+01  3.81e+01  -2.012850e+05  3.610166e+00   3.639e-02  2.200e+01  
 GE7    3.43e+01  3.68e+01  -2.012886e+05  -3.587337e+00  4.259e-03  2.197e+01  
 GE8    3.42e+01  3.68e+01  -2.012886e+05  -4.320307e-02  1.462e-03  2.201e+01  
 GE9    3.43e+01  3.68e+01  -2.012886e+05  5.644168e-02   4.966e-03  2.200e+01  
 GE10   3.42e+01  3.68e+01  -2.012886e+05  -4.094097e-02  3.658e-03  2.201e+01  
 GE11   3.41e+01  3.69e+01  -2.012887e+05  -1.928839e-02  1.539e-03  2.202e+01  
 GE12   3.41e+01  3.69e+01  -2.012887e+05  -2.543738e-03  1.574e-03  2.202e+01  
 GE13   3.41e+01  3.69e+01  -2.012887e+05  -6.717234e-03  4.667e-04  2.203e+01  
 GE14   3.41e+01  3.69e+01  -2.012887e+05  2.690787e-03   1.217e-03  2.203e+01  
 GE15   3.41e+01  3.69e+01  -2.012887e+05  -3.728993e-03  4.753e-04  2.204e+01  
 GE16   3.41e+01  3.69e+01  -2.012887e+05  -6.213324e-04  3.090e-04  2.205e+01  
 GE17   3.41e+01  3.69e+01  -2.012887e+05  1.019319e-03   6.257e-04  2.205e+01  
 GE18   3.41e+01  3.69e+01  -2.012887e+05  -1.727669e-03  3.054e-04  2.206e+01  
 GE19   3.41e+01  3.69e+01  -2.012887e+05  -2.660938e-04  1.692e-04  2.212e+01  
 GE20   3.41e+01  3.69e+01  -2.012887e+05  -4.791429e-05  1.023e-04  2.209e+01  
 GE21   3.41e+01  3.69e+01  -2.012887e+05  -2.845928e-05  1.066e-04  2.212e+01  
 GE22   3.41e+01  3.69e+01  -2.012887e+05  -4.190659e-06  7.938e-05  2.217e+01  
 GE23   3.41e+01  3.69e+01  -2.012887e+05  -1.090570e-05  5.489e-05  2.213e+01  
 GE24   3.41e+01  3.69e+01  -2.012887e+05  7.023658e-08   6.898e-05  2.212e+01  
 GE25   3.41e+01  3.68e+01  -2.012887e+05  -1.937866e-06  5.804e-05  2.213e+01  
 GE26   3.41e+01  3.68e+01  -2.012887e+05  -1.122038e-05  2.331e-05  2.214e+01  
 GE27   3.41e+01  3.68e+01  -2.012887e+05  -7.857934e-07  2.666e-05  2.216e+01  
 GE28   3.41e+01  3.68e+01  -2.012887e+05  2.993593e-07   2.932e-05  2.215e+01  
 GE29   3.41e+01  3.68e+01  -2.012887e+05  -2.109869e-06  1.792e-05  2.213e+01  
 GE30   3.41e+01  3.68e+01  -2.012887e+05  3.184652e-07   2.027e-05  2.217e+01  
 GE31   3.41e+01  3.68e+01  -2.012887e+05  1.038596e-05   6.569e-05  2.214e+01  
 GE32   3.41e+01  3.68e+01  -2.012887e+05  -1.034819e-05  2.186e-05  2.214e+01  
 GE33   3.41e+01  3.68e+01  -2.012887e+05  4.226644e-06   4.674e-05  2.217e+01  
 GE34   3.41e+01  3.68e+01  -2.012887e+05  -1.963234e-06  3.550e-05  2.217e+01  
 GE35   3.41e+01  3.68e+01  -2.012887e+05  -2.668124e-06  2.238e-05  2.217e+01  
 GE36   3.41e+01  3.68e+01  -2.012887e+05  -7.664895e-07  1.254e-05  2.217e+01  
 GE37   3.41e+01  3.68e+01  -2.012887e+05  2.685720e-07   1.899e-05  2.217e+01  
 GE38   3.41e+01  3.68e+01  -2.012887e+05  5.085099e-07   2.371e-05  2.215e+01  
 GE39   3.41e+01  3.68e+01  -2.012887e+05  -1.756162e-07  2.297e-05  2.217e+01  
 GE40   3.41e+01  3.68e+01  -2.012887e+05  -1.341152e-06  1.120e-05  2.217e+01  
 GE41   3.41e+01  3.68e+01  -2.012887e+05  4.999221e-09   7.053e-06  2.215e+01  
 GE42   3.41e+01  3.68e+01  -2.012887e+05  6.138648e-07   1.840e-05  2.215e+01  
 GE43   3.41e+01  3.68e+01  -2.012887e+05  -8.628854e-07  9.731e-06  2.217e+01  
 GE44   3.41e+01  3.68e+01  -2.012887e+05  -1.153533e-07  6.617e-06  2.218e+01  
 GE45   3.41e+01  3.68e+01  -2.012887e+05  -4.853204e-08  5.367e-06  2.218e+01  
 GE46   3.41e+01  3.68e+01  -2.012887e+05  -2.341219e-08  5.924e-06  2.220e+01  

For 2ffa3d4e:

ITER   TMAG      AMAG      ETOT(eV)       EDIFF(eV)      DRHO       TIME(s)    
 GE1    3.13e+01  3.21e+01  -2.012837e+05  0.000000e+00   2.314e+00  2.637e+01  
 GE2    3.62e+01  3.73e+01  -2.012882e+05  -4.475781e+00  2.857e-01  2.267e+01  
 GE3    3.47e+01  3.63e+01  -2.012885e+05  -3.216326e-01  8.493e-02  2.270e+01  
 GE4    3.46e+01  3.70e+01  -2.012879e+05  5.839711e-01   5.170e+00  2.271e+01  
 GE5    3.44e+01  3.67e+01  -2.012886e+05  -6.873429e-01  1.088e-02  2.270e+01  
 GE6    3.58e+01  3.81e+01  -2.012850e+05  3.617410e+00   3.629e+02  2.273e+01  
 GE7    3.43e+01  3.68e+01  -2.012886e+05  -3.594388e+00  1.362e+00  2.271e+01  
 GE8    3.42e+01  3.68e+01  -2.012886e+05  -4.305293e-02  2.881e-02  2.273e+01  
 GE9    3.42e+01  3.68e+01  -2.012886e+05  3.112351e-02   3.512e+00  2.277e+01  
 GE10   3.42e+01  3.68e+01  -2.012886e+05  -1.827247e-02  4.005e-01  2.271e+01  
 GE11   3.41e+01  3.69e+01  -2.012887e+05  -1.670464e-02  1.139e-01  2.261e+01  
 GE12   3.41e+01  3.69e+01  -2.012887e+05  -2.902616e-03  1.525e-01  2.242e+01  
 GE13   3.41e+01  3.69e+01  -2.012887e+05  -6.588764e-03  1.453e-03  2.240e+01  
 GE14   3.41e+01  3.69e+01  -2.012887e+05  3.300182e-04   1.390e-02  2.240e+01  
 GE15   3.41e+01  3.69e+01  -2.012887e+05  4.628865e-03   8.823e-02  2.238e+01  
 GE16   3.41e+01  3.69e+01  -2.012887e+05  -6.824514e-03  5.390e-04  2.241e+01  
 GE17   3.41e+01  3.69e+01  -2.012887e+05  4.037616e-04   3.347e-03  2.226e+01  
 GE18   3.41e+01  3.69e+01  -2.012887e+05  -1.141057e-03  1.147e-03  2.222e+01  
 GE19   3.41e+01  3.69e+01  -2.012887e+05  -2.858993e-04  6.974e-05  2.222e+01  
 GE20   3.41e+01  3.69e+01  -2.012887e+05  -4.719985e-05  2.939e-05  2.225e+01  
 GE21   3.41e+01  3.69e+01  -2.012887e+05  -2.902679e-05  4.334e-05  2.225e+01  
 GE22   3.41e+01  3.69e+01  -2.012887e+05  -3.342697e-06  3.658e-05  2.226e+01  
 GE23   3.41e+01  3.69e+01  -2.012887e+05  -1.117724e-05  1.266e-05  2.224e+01  
 GE24   3.41e+01  3.69e+01  -2.012887e+05  -1.517585e-07  4.617e-05  2.225e+01  
 GE25   3.41e+01  3.68e+01  -2.012887e+05  -6.274518e-07  6.574e-05  2.228e+01  
 GE26   3.41e+01  3.68e+01  -2.012887e+05  -1.256247e-05  6.464e-06  2.228e+01  
 GE27   3.41e+01  3.68e+01  -2.012887e+05  -1.080055e-06  7.928e-06  2.229e+01  
 GE28   3.41e+01  3.68e+01  -2.012887e+05  2.439966e-07   2.441e-05  2.229e+01  
 GE29   3.41e+01  3.68e+01  -2.012887e+05  -1.845282e-06  1.931e-05  2.228e+01  
 GE30   3.41e+01  3.68e+01  -2.012887e+05  3.163369e-07   1.136e-05  2.228e+01  
 GE31   3.41e+01  3.68e+01  -2.012887e+05  4.611435e-06   5.006e-04  2.226e+01  
 GE32   3.41e+01  3.68e+01  -2.012887e+05  -4.753171e-06  7.248e-05  2.231e+01  
 GE33   3.41e+01  3.68e+01  -2.012887e+05  3.359972e-06   8.001e-05  2.231e+01  
 GE34   3.41e+01  3.68e+01  -2.012887e+05  -2.829436e-06  4.059e-05  2.228e+01  
 GE35   3.41e+01  3.68e+01  -2.012887e+05  -6.033466e-07  2.800e-05  2.229e+01  
 GE36   3.41e+01  3.68e+01  -2.012887e+05  -1.081985e-06  3.135e-06  2.230e+01  
 GE37   3.41e+01  3.68e+01  -2.012887e+05  8.878815e-07   1.789e-05  2.227e+01  
 GE38   3.41e+01  3.68e+01  -2.012887e+05  6.261401e-08   2.357e-05  2.226e+01  
 GE39   3.41e+01  3.68e+01  -2.012887e+05  -5.403119e-07  1.372e-05  2.225e+01  
 GE40   3.41e+01  3.68e+01  -2.012887e+05  -8.631824e-07  2.357e-06  2.227e+01  
 GE41   3.41e+01  3.68e+01  -2.012887e+05  5.617541e-06   4.904e-07  2.251e+01

WHUweiqingzhou avatar May 15 '24 05:05 WHUweiqingzhou

I've got two questions:

  1. It was shown in #2997 that, even if the parallalization scheme is the same, LCAO calculation may still be unstable for some systems. Are calculations in this PR stable from run to run? [We conjectured that #2997 might result from a nearly-singular overlap matrix, but so far it is not confirmed and we do not have solution in the near term.]
  2. If calculations in this PR are stable on their own, I was wondering, is it possible to further nail down the problem to MPI or openMP (or both)?

jinzx10 avatar May 23 '24 08:05 jinzx10