psiflow CP2K jobs slower with higher number of cores per worker

Discussed in https://github.com/molmod/psiflow/discussions/26

^{Originally posted by b-mazur May 17, 2024} I'm trying to reproduce mof_phase_transition.py example and I'm facing issue where with increasing number of cores per worker my calculations gets prohibitively slow. In all cases max_walltime: 20 results in AssertionError: atomic energy calculation of O failed because none of the CP2K tasks for oxygen are completed in 20 minutes.

I played a bit with different number of cores per worker and here are values of SCF steps reached in 20 minutes for oxygen task with multiplicity 5:

cores per worker	SCF steps
1	33
2	31
4	20
16	3

Finally I was able to finish this part by increasing max_walltime to 180 minutes and using only 1 core per worker but this will create another issue when ReferenceEvaluation is used for whole MOF in next steps.

I've never used CP2K but I feel that 180 minutes is far too long for single point of single atom. What else I observe is the surprisingly low CPU performance of slurm tasks, at levels of <10%. I checked timings in CP2K output but MPI timing doesn't seem to be such large (however, as I said, I have no experience so maybe I don't understand something). Here is an example:

 -------------------------------------------------------------------------------
 -                                                                             -
 -                                T I M I N G                                  -
 -                                                                             -
 -------------------------------------------------------------------------------
 SUBROUTINE                       CALLS  ASD         SELF TIME        TOTAL TIME
                                MAXIMUM       AVERAGE  MAXIMUM  AVERAGE  MAXIMUM
 CP2K                                 1  1.0    12.43    12.51 11438.24 11438.32
 qs_forces                            1  2.0     0.00     0.00 11307.17 11307.28
 qs_energies                          1  3.0     0.00     0.00 11272.36 11272.43
 scf_env_do_scf                       1  4.0     0.00     0.00 11209.10 11209.25
 scf_env_do_scf_inner_loop          125  5.0     0.00     0.01 10924.29 10924.59
 qs_scf_new_mos                     125  6.0     0.00     0.00  7315.99  7321.60
 qs_scf_loop_do_ot                  125  7.0     0.00     0.00  7315.99  7321.60
 ot_scf_mini                        125  8.0     0.00     0.00  6917.97  6923.00
 dbcsr_multiply_generic            3738 10.8     0.14     0.15  5700.36  5708.79
 ot_mini                            125  9.0     0.00     0.00  3021.17  3021.32
 mp_sum_l                         18209 11.7  2898.88  2915.89  2898.88  2915.89
 qs_ot_get_p                        256  9.0     0.00     0.00  2869.85  2871.50
 qs_ot_get_derivative               126 10.0     0.00     0.00  2737.75  2738.66
 rs_pw_transfer                    1905 10.0     0.02     0.02  2014.01  2029.50
 qs_ks_update_qs_env                128  6.0     0.00     0.00  1921.34  1922.16
 qs_ot_p2m_diag                     145 10.0     0.00     0.00  1907.22  1918.97
 rebuild_ks_matrix                  126  7.0     0.00     0.00  1867.10  1869.47
 qs_ks_build_kohn_sham_matrix       126  8.0     0.01     0.01  1867.10  1869.47
 qs_rho_update_rho_low              126  6.0     0.00     0.00  1717.53  1723.18
 calculate_rho_elec                 252  7.0     2.66    19.36  1717.53  1723.18
 density_rs2pw                      252  8.0     0.01     0.01  1667.59  1684.20
 cp_dbcsr_syevd                     145 11.0     0.01     0.01  1442.29  1459.78
 pw_transfer                       3737 10.9     0.13     0.17  1190.61  1208.75
 fft_wrap_pw1pw2                   3485 11.9     0.02     0.02  1190.36  1208.51
 fft3d_ps                          3485 13.9    69.25    75.81  1183.31  1202.91
 cp_fm_syevd                        145 12.0     0.00     0.00  1164.55  1177.19
 mp_alltoall_z22v                  3485 15.9  1095.45  1118.30  1095.45  1118.30
 qs_ot_get_derivative_diag           69 11.0     0.00     0.00  1104.83  1105.37
 mp_sum_b                          6644 12.1  1081.35  1095.09  1081.35  1095.09
 mp_waitall_1                    154259 15.1  1026.71  1085.45  1026.71  1085.45
 multiply_cannon                   3738 11.8     0.18     0.21  1069.89  1079.22
 fft_wrap_pw1pw2_500               1965 13.7     1.43     2.06  1032.01  1051.67
 qs_ot_get_derivative_taylor         57 11.0     0.00     0.00   978.10   978.50
 mp_waitany                        4620 12.0   869.01   955.24   869.01   955.24
 qs_vxc_create                      126  9.0     0.00     0.00   879.87   886.00
 qs_ot_get_orbitals                 250  9.0     0.00     0.00   799.42   800.84
 sum_up_and_integrate                64  9.0     0.08     0.08   790.31   799.75
 integrate_v_rspace                 128 10.0     0.00     0.00   790.23   799.67
 potential_pw2rs                    128 11.0     0.01     0.01   787.91   789.91
 make_m2s                          7476 11.8     0.07     0.07   704.01   711.12
 make_images                       7476 12.8     0.12     0.13   703.73   710.85
 make_images_sizes                 7476 13.8     0.01     0.01   703.28   710.42
 mp_alltoall_i44                   7476 14.8   703.27   710.41   703.27   710.41
 rs_pw_transfer_RS2PW_500           254 10.0     0.57     0.64   676.24   691.60
 xc_pw_derive                      1140 12.0     0.01     0.01   654.68   664.73
 mp_sendrecv_dv                    7056 11.0   659.52   660.69   659.52   660.69
 xc_rho_set_and_dset_create         126 11.0     2.11    10.54   491.93   604.77
 cp_fm_redistribute_start           145 13.0   443.54   480.41   587.12   600.87
 x_to_yz                           1712 15.9     1.63     1.75   590.30   599.08
 cp_fm_redistribute_end             145 13.0   417.14   571.08   430.35   590.60
 xc_vxc_pw_create                    64 10.0     1.70     8.42   581.92   584.35
 mp_sum_d                          3471 10.2   485.51   568.31   485.51   568.31
 multiply_cannon_loop              3738 12.8     0.07     0.08   543.02   555.38
 yz_to_x                           1773 14.1    16.93    19.20   523.71   539.79
 mp_allgather_i34                  3738 12.8   526.54   538.30   526.54   538.30
 multiply_cannon_metrocomm3       14952 13.8     0.03     0.04   487.40   508.68
 rs_pw_transfer_RS2PW_170           252 10.0     0.27     0.31   424.39   428.13
 calculate_dm_sparse                252  8.0     0.00     0.00   401.52   402.48
 rs_pw_transfer_PW2RS_500           131 12.9     0.27     0.29   336.32   337.31
 qs_ot_p2m_taylor                   111  9.9     0.00     0.00   320.88   328.70
 xc_pw_divergence                   128 11.0     0.00     0.00   315.44   324.80
 dbcsr_complete_redistribute        394 12.1     0.01     0.02   299.29   314.94
 copy_dbcsr_to_fm                   198 11.1     0.00     0.00   294.59   303.78
 xc_exc_calc                         62 10.0     0.26     0.77   297.95   301.64
 cp_fm_syevd_base                   145 13.0   147.08   300.70   147.08   300.70
 init_scf_loop                        3  5.0     0.00     0.00   281.84   282.02
 ot_new_cg_direction                 63 10.0     0.00     0.00   264.48   265.62
 mp_sum_dv                         2372 13.6   164.02   262.33   164.02   262.33
 arnoldi_normal_ev                  262 10.9     0.00     0.00   244.27   260.42
 arnoldi_extremal                   256 10.0     0.00     0.00   236.76   252.71
 -------------------------------------------------------------------------------

I'm using psiflow 3.0.4 and container oras://ghcr.io/molmod/psiflow:3.0.4_python3.10_cuda.

Any idea what I could check to find where the problem is? Also, wouldn't it be better to tabulate energy for all atoms in psiflow source files? Thanks in advance for any help!

May 23 '24 09:05 svandenhaute

Hi @b-mazur ,

The energy of the isolated atoms depends on the specific pseudopotential used as well as the functional, so tabulating those would be quite an amount of work. In normal scenarios, these atomic energy calculations usually finish quite quickly (a few seconds -- a few minutes) so it's usually easier to do it on the fly.

This is a bug that I've encountered once on a very specific cluster here in Belgium, and I haven't quite figured out what causes it. Heuristically, I've found that by adding / removing a few MPI flags, CP2K performance will go back to normal again, but I don't quite understand why this is the case given that everything is executed within a container.

What is the host OS, and host container runtime (singularity/apptainer version)? Did you modify the default MPI command in the .yaml?

May 23 '24 09:05 svandenhaute

It's one of the things we're fixing in the next release. For CP2K in particular, it's currently still necessary to put cpu_affinity: none in the .yaml. Perhaps that could fix your problem?

May 23 '24 09:05 svandenhaute

Hi @svandenhaute, apologies for the long silence.

I'm still facing this problem, I've tried to calculate SP with cp2k container (oras://ghcr.io/molmod/cp2k:2023.2) and calculations finished in ~1 min (so the good news is you fixed it in the new release). I've already tried different options with mpi_command parameter but even with the most basic mpi_command: 'mpirun -np {}' calculations take orders of magnitude longer. I'm also using cpu_affinity: none. Do you remember which MPI flags helped in your case?

My host OS is

NAME="AlmaLinux"
VERSION="8.8 (Sapphire Caracal)"
ID="almalinux"
ID_LIKE="rhel centos fedora"
VERSION_ID="8.8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="AlmaLinux 8.8 (Sapphire Caracal)"
ANSI_COLOR="0;34"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:almalinux:almalinux:8::baseos"
HOME_URL="https://almalinux.org/"
DOCUMENTATION_URL="https://wiki.almalinux.org/"
BUG_REPORT_URL="https://bugs.almalinux.org/"

ALMALINUX_MANTISBT_PROJECT="AlmaLinux-8"
ALMALINUX_MANTISBT_PROJECT_VERSION="8.8"
REDHAT_SUPPORT_PRODUCT="AlmaLinux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.8"

and apptainer version 1.2.5-1.el8.

I was also thinking of moving to psiflow 3.0.4 since these problems do not occur, but I am currently interested in incremental learnig to create an MLP for the phase transition in MOF. I see that learning.py has changed significantly, hence my question if incremental learnig is possible and do you plan a tutorial similar to mof_phase_transition.py in the near future, if not, do you have any tips on how to quickly modify mof_phase_trasition.py to make it work with psiflow 3.0.4?

Jul 02 '24 10:07 bartoszmazurwro

I was also wondering about using the new CP2K container in psiflow 3.0.3. Is this possible?

Jul 02 '24 11:07 bartoszmazurwro

I honestly don't know. If you have tried both mpirun -np X and mpirun -np X -bind-to core -rmk user -launcher fork then I wouldn't know. The MPI in the container is MPICH, so you could check out the manual to see if there are additional flags to try. What about -map-by core?

I was also thinking of moving to psiflow 3.0.4 since these problems do not occur, but I am currently interested in incremental learnig to create an MLP for the phase transition in MOF. I see that learning.py has changed significantly, hence my question if incremental learnig is possible and do you plan a tutorial similar to mof_phase_transition.py in the near future

Yes, we are actually in the final stages here. The tentative timeline in this sense is to create a new release (including working examples of the incremental learning scheme) by this Sunday.

I was also wondering about using the new CP2K container in psiflow 3.0.3. Is this possible?

No, they are not compatible. The new CP2K container is built with OpenMPI instead of MPICH, but also does not contain psiflow or its dependencies (which is required for compatibility with 3.x).

If possible, I'd strongly suggest to wait until the new release is out. Aside from this, it should fix a bunch of other issues!

Jul 02 '24 11:07 svandenhaute

Great to hear! I'll wait for the next release then. Thanks a lot for your help and quick reply.

Jul 02 '24 13:07 bartoszmazurwro

@b-mazur the first release candidate for v4.0.0 is out, in case you want to try again.

Jul 30 '24 10:07 svandenhaute

At first glance I can't find any mention of incremental learning in the examples and documentation. Does this mean that it is not yet available or can I do this using the active learning of Learning class instead?

Jul 31 '24 16:07 bartoszmazurwro

Exactly, the active_learning method on the new learning class can be used to recreate the incremental learning.

What was previously pretraining (i.e applying random perturbations and training on those) is now much improved by using one of the MACE foundation models in a passive_learning run, as in the water online learning example.

To create walkers with metadynamics, check out the proton jump example.

Jul 31 '24 21:07 svandenhaute