abacus-develop icon indicating copy to clipboard operation
abacus-develop copied to clipboard

My LCAO HSE calculations are slow outside of the SCF iterations.

Open pJahad opened this issue 7 months ago • 2 comments

Details

I'm running calculations using the input files from the ABACUS Test Report for HSE (#6294), but the speed is very slow for everything except the SCF iterations. For example, with Si, the LOCAL POTENTIAL calculation is extremely slow, as shown here:

 Initial plane wave basis and FFT box
 ---------------------------------------------------------
 DONE(0.260526     SEC) : INIT PLANEWAVE
 DONE(30.8853     SEC) : LOCAL POTENTIAL

Additionally, it takes a long time for files to be output after SCF convergence is achieved. I'm also including the TIME STATISTICS from the stdout:

----------------------------------------------------------------------
      CLASS_NAME             NAME         TIME/s  CALLS   AVG/s  PER/%  
----------------------------------------------------------------------
                     total              214.96 13        16.54  100.00 
 Driver              atomic_world       214.96 1         214.96 100.00 
 ESolver_KS_LCAO     before_all_runners 33.02  1         33.02  15.36  
 NOrbital_Lm         extra_uniform      6.49   1875      0.00   3.02   
 Mathzone_Add1       Uni_Deriv_Phi      6.33   1875      0.00   2.94   
 Exx_LRI             init               32.38  1         32.38  15.06  
 Matrix_Orbs21       init               6.05   2         3.03   2.82   
 Matrix_Orbs21       init_radial_table  18.78  2         9.39   8.74   
 Center2_Orb         cal_ST_Phi12_R     15.67  3439      0.00   7.29   
 LRI_CV              set_orbitals       19.27  1         19.27  8.96   
 Matrix_Orbs11       init_radial_table  5.77   1         5.77   2.69   
 Ions                opt_ions           181.84 1         181.84 84.59  
 ESolver_KS          runner             128.16 1         128.16 59.62  
 ESolver_KS_LCAO     before_scf         3.01   1         3.01   1.40   
 Exx_LRI             cal_exx_ions       2.95   1         2.95   1.37   
 Potential           cal_veff           2.22   35        0.06   1.03   
 PotXC               cal_veff           2.19   35        0.06   1.02   
 XC_Functional       v_xc               52.51  22        2.39   24.43  
 HSolverLCAO         solve              7.82   34        0.23   3.64   
 HamiltLCAO          updateHk           2.31   3536      0.00   1.07   
 HSolverLCAO         hamiltSolvePsiK    3.87   3536      0.00   1.80   
 DiagoElpa           elpa_solve         3.10   3536      0.00   1.44   
 RI_2D_Comm          split_m2D_ktoR     100.91 7         14.42  46.94  
 Exx_LRI             cal_exx_elec       12.43  7         1.78   5.78   
 XC_Functional_Libxc v_xc_libxc         2.16   27        0.08   1.01   
 ESolver_KS_LCAO     cal_force          53.68  1         53.68  24.97  
 Force_Stress_LCAO   getForceStress     53.68  1         53.68  24.97  
 Exx_LRI             cal_exx_force      17.47  1         17.47  8.13   
 Exx_LRI             cal_exx_stress     36.11  1         36.11  16.80  
----------------------------------------------------------------------

My execution environment is as follows: ABACUS version: v3.9.0.7 Compilation: Dockerfile.intel with intel-oneapi-mkl set to 2025.1 Command: OMP_NUM_THREADS=16 mpirun -np 2 abacus CPU: 32 cores of Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz

Is this kind of speed expected? Do you have any advice?

Task list for Issue attackers (only for developers)

  • [ ] Reproduce the performance issue on a similar system or environment.
  • [ ] Identify the specific section of the code causing the performance issue.
  • [ ] Investigate the issue and determine the root cause.
  • [ ] Research best practices and potential solutions for the identified performance issue.
  • [ ] Implement the chosen solution to address the performance issue.
  • [ ] Test the implemented solution to ensure it improves performance without introducing new issues.
  • [ ] Optimize the solution if necessary, considering trade-offs between performance and other factors (e.g., code complexity, readability, maintainability).
  • [ ] Review and incorporate any relevant feedback from users or developers.
  • [ ] Merge the improved solution into the main codebase and notify the issue reporter.

pJahad avatar Jun 18 '25 15:06 pJahad

Hi @pJahad, the time cost of HSE is indeed much more then PBE calculation. The time consumption of the Si example you mentioned above is reasonable.

pxlxingliang avatar Jun 19 '25 01:06 pxlxingliang

@pxlxingliang Thank you for the comment.

The time consumption of the Si example you mentioned above is reasonable.

That's a relief to hear.

However, my main concern is the time spent on the LOCAL POTENTIAL calculation. With PBE, this part is very fast. I was wondering what the difference is in the underlying processes that causes this.

Also, I have confirmed that this section speeds up as I increase the number of OpenMP threads. Would using a GPU accelerate it as well?

pJahad avatar Jun 23 '25 11:06 pJahad