abacus-develop icon indicating copy to clipboard operation
abacus-develop copied to clipboard

Memory leak in elpa diagonalization

Open LiuXiaohui123321 opened this issue 2 years ago • 14 comments

Describe the bug

When using ABACUS to do structure relaxation calculation, I find that the amount of the memory used will gradually increase with the calculation steps. I use the top command to dynamically view the amount of memory used, for example, Moment 1: relax-01 Moment 2: relax-02 I also use Grafana to give a graph of the results of the amount of memory used over time, relax-mem

To determine whether the memory leak is coming from scf or relaxation, I do another somewhat "extreme" test,

scf_thr 1e-20 scf_nmax 10000

and also use the top command and Grafana Moment 1: scf-01 Moment 2: scf-02 scf-mem

It appears that the memory leak is at least in the scf process. And the relaxation process needs more tests.

Expected behavior

No response

To Reproduce

memLeak.tar.gz

  1. ABACUS: version 3.2.2
  2. Build: intel-2019.update5, intelmpi-2019.update5, mkl-2019.update5, gcc-9.2.0; cmake
  3. Run: mpirun -n 40 abacus

Environment

  • gcc-9.2.0
  • intel-2019.update5
  • intelmpi-2019.update5
  • mkl-2019.update5

Additional Context

No response

Task list for Issue attackers

  • [X] Verify the issue is not a duplicate.
  • [X] Describe the bug.
  • [X] Steps to reproduce.
  • [ ] Expected behavior.
  • [ ] Error message.
  • [ ] Environment details.
  • [ ] Additional context.
  • [ ] Assign a priority level (low, medium, high, urgent).
  • [ ] Assign the issue to a team member.
  • [ ] Label the issue with relevant tags.
  • [ ] Identify possible related issues.
  • [ ] Create a unit test or automated test to reproduce the bug (if applicable).
  • [ ] Fix the bug.
  • [ ] Test the fix.
  • [ ] Update documentation (if necessary).
  • [ ] Close the issue and inform the reporter (if applicable).

LiuXiaohui123321 avatar Sep 15 '23 10:09 LiuXiaohui123321

@LiuXiaohui123321 We have an issue of possible scf process memory leak in #2935 . Would please help test that one?

hongriTianqi avatar Oct 18 '23 10:10 hongriTianqi

@LiuXiaohui123321 We have an issue of possible scf process memory leak in #2935 . Would please help test that one?

@hongriTianqi I have tested the example in issue #2935, and have attached the results below that issue.

LiuXiaohui123321 avatar Nov 06 '23 03:11 LiuXiaohui123321

This issue has been solved thanks to insights from @dyzheng. When we feel nowhere to solve the issue, @dyzheng realized it might be caused be MKL inconsistency between libraries.

hongriTianqi avatar Nov 21 '23 01:11 hongriTianqi

I noticed that the memory leak bug has been fixed by PR #3472 in the release version 3.5.1. Then I re-tested the example here, but it looks like the bug is not completely resolved and the memory leak is still there.

The test environment used is as follows:

  1. elpa-2021.05.002/2021.11.002
  • module purge
  • module load intel/2019.update5 intelmpi/2019.update5 mkl/2019.update5 gcc/9.2.0
  • FC=mpiifort CC=mpiicc ./configure --prefix=... SCALAPACK_LDFLAGS="-L$MKLROOT/lib/intel64 -lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lmkl_blacs_intelmpi_lp64 -lpthread -lm -Wl,-rpath,$MKLROOT/lib/intel64" SCALAPACK_FCFLAGS="-L$MKLROOT/lib/intel64 -lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lmkl_blacs_intelmpi_lp64 -lpthread -lm -I$MKLROOT/include/intel64/lp64"
  1. cereal-1.3.0
  • tar zxvf cereal-1.3.0.tar.gz -C ./
  1. abacus-develop-3.5.1
  • module purge
  • module load cmake/3.19.0 intel/2019.update5 intelmpi/2019.update5 mkl/2019.update5 gcc/9.2.0
  • CC=icc CXX=icpc cmake -B build -DCereal_INCLUDE_DIR=.../cereal-1.3.0/include -DELPA_LIBRARY=.../lib/libelpa.so -DELPA_INCLUDE_DIR=.../include/elpa-2021.xx.002

LiuXiaohui123321 avatar Jan 16 '24 06:01 LiuXiaohui123321

Thanks for your update, we will retest the case with v3.5.1.

dyzheng avatar Jan 16 '24 06:01 dyzheng

I noticed that the memory leak bug has been fixed by PR #3472 in the release version 3.5.1. Then I re-tested the example here, but it looks like the bug is not completely resolved and the memory leak is still there.

The test environment used is as follows:

  1. elpa-2021.05.002/2021.11.002
  • module purge
  • module load intel/2019.update5 intelmpi/2019.update5 mkl/2019.update5 gcc/9.2.0
  • FC=mpiifort CC=mpiicc ./configure --prefix=... SCALAPACK_LDFLAGS="-L$MKLROOT/lib/intel64 -lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lmkl_blacs_intelmpi_lp64 -lpthread -lm -Wl,-rpath,$MKLROOT/lib/intel64" SCALAPACK_FCFLAGS="-L$MKLROOT/lib/intel64 -lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lmkl_blacs_intelmpi_lp64 -lpthread -lm -I$MKLROOT/include/intel64/lp64"
  1. cereal-1.3.0
  • tar zxvf cereal-1.3.0.tar.gz -C ./
  1. abacus-develop-3.5.1
  • module purge
  • module load cmake/3.19.0 intel/2019.update5 intelmpi/2019.update5 mkl/2019.update5 gcc/9.2.0
  • CC=icc CXX=icpc cmake -B build -DCereal_INCLUDE_DIR=.../cereal-1.3.0/include -DELPA_LIBRARY=.../lib/libelpa.so -DELPA_INCLUDE_DIR=.../include/elpa-2021.xx.002

Here are the input files, 2960_input-files.tar.gz

The function DiagoElpa::diag() in the iteration is used to detect the memory usage,

==> DiagoElpa::diag 152.761 GB 24.008 s ==> DiagoElpa::diag 148.648582458 GB 52.2605142593 s ==> DiagoElpa::diag 148.210006714 GB 78.6172110289 s ==> DiagoElpa::diag 147.836853027 GB 104.926947951 s ==> DiagoElpa::diag 147.556583405 GB 131.186710715 s ==> DiagoElpa::diag 147.277137756 GB 157.465421528 s ==> DiagoElpa::diag 146.908161163 GB 183.724356905 s ==> DiagoElpa::diag 146.597167969 GB 210.000731617 s ==> DiagoElpa::diag 146.396087646 GB 236.282636881 s ==> DiagoElpa::diag 146.320915222 GB 262.553348139 s ==> DiagoElpa::diag 146.269115448 GB 288.850581214 s ==> DiagoElpa::diag 146.199905396 GB 315.123554945 s ==> DiagoElpa::diag 146.132167816 GB 341.371098965 s ==> DiagoElpa::diag 146.083370209 GB 367.664950594 s ==> DiagoElpa::diag 146.016197205 GB 393.920230776 s ... ==> DiagoElpa::diag 118.571239471 GB 11613.5364821 s ==> DiagoElpa::diag 118.489513397 GB 11639.7941203 s ==> DiagoElpa::diag 118.408836365 GB 11666.0673717 s ==> DiagoElpa::diag 118.355426788 GB 11692.3510467 s ==> DiagoElpa::diag 118.298965454 GB 11718.6152955 s ==> DiagoElpa::diag 118.220375061 GB 11744.8890342 s ==> DiagoElpa::diag 118.143379211 GB 11771.1769129 s ==> DiagoElpa::diag 118.080581665 GB 11797.4658343 s ==> DiagoElpa::diag 118.068721771 GB 11823.7530197 s ==> DiagoElpa::diag 118.022708893 GB 11850.0350929 s ==> DiagoElpa::diag 117.957267761 GB 11876.3323782 s ==> DiagoElpa::diag 117.887569427 GB 11902.6097547 s ==> DiagoElpa::diag 117.837528229 GB 11928.8735333 s ==> DiagoElpa::diag 117.771236420 GB 11955.1298046 s ==> DiagoElpa::diag 117.696727753 GB 11981.4391402 s ==> DiagoElpa::diag 117.612930298 GB 12007.7014688 s

And the output of top command, where the red boxes indicate that memory has been increasing for these processes image

LiuXiaohui123321 avatar Jan 16 '24 07:01 LiuXiaohui123321

This bug was probably introduced in release version 2.3.5, corresponding to a commit record of https://github.com/deepmodeling/abacus-develop/pull/1213.

Currently it is traced to the following code, and more tests needed, https://github.com/deepmodeling/abacus-develop/blob/3c3859639b980cfcb48bfc70182fa36c11b2ff4a/source/module_hsolver/genelpa/elpa_new_real.cpp#L16

LiuXiaohui123321 avatar Jan 16 '24 07:01 LiuXiaohui123321

@LiuXiaohui123321, Is the memory leak still there for ABACUS v3.5.2?

WHUweiqingzhou avatar Jan 30 '24 10:01 WHUweiqingzhou

@LiuXiaohui123321, Is the memory leak still there for ABACUS v3.5.2?

Hi Weiqing! I check the release log of v3.5.2, and there are no updates to fix memory leaks in this version. I've tested v3.5.1 before and the bug of memory leak is still there!

LiuXiaohui123321 avatar Feb 07 '24 03:02 LiuXiaohui123321

I will try to reuse the elpa_handle in DiagoElpa.

dyzheng avatar Feb 22 '24 07:02 dyzheng

Hi @LiuXiaohui123321, we have fixed a memory problem in PR #3637 recently, which has been proved to solve issue #3634 and #3652. Could you have another try by latest version?

WHUweiqingzhou avatar Mar 01 '24 03:03 WHUweiqingzhou

Hi @LiuXiaohui123321, we have fixed a memory problem in PR #3637 recently, which has been proved to solve issue #3634 and #3652. Could you have another try by latest version?

I test this on two different system platforms using v3.5.4, and the memory leak is still there. But a good news is that it only appears on one of the platforms, where the compiler uses Intel 2019.

LiuXiaohui123321 avatar Mar 08 '24 07:03 LiuXiaohui123321

@LiuXiaohui123321 Thanks for your reply, do you think this issue can be closed or need more discussion?

dyzheng avatar Mar 08 '24 07:03 dyzheng

@LiuXiaohui123321 Thanks for your reply, do you think this issue can be closed or need more discussion?

Since the problem is still there in some cases, I think we can leave it open for now.

LiuXiaohui123321 avatar Mar 08 '24 07:03 LiuXiaohui123321

When I try to reproduce the issue using the latest intel image, there is a segfault error. [ 396.130777] abacus[2368]: segfault at 7fff8d837608 ip 00007f22e5270f9a sp 00007fff8d837610 error 6 in libelpa_openmp.so.19.0.0[7f22e51b2000+25c000] Command used:docker run -v pwd:/wd -w /wd registry.dp.tech/deepmodeling/abacus-intel abacus

caic99 avatar Aug 07 '24 07:08 caic99