[BUG] DPA-2 Training is super slow because of “DEEPMD WARNING Data loading buffer is empty or nearly empty“
Bug summary
When run DPA-2 (dp_tf) train calculation with Deepkit-V3.0.0a0, we always got following WARNINGS:
[2024-07-05 10:59:59,135] DEEPMD INFO batch 20000: trn: rmse = 9.27e+00, rmse_e = 5.18e-02, rmse_f = 3.20e-01, lr = 1.68e-04 [2024-07-05 10:59:59,135] DEEPMD INFO batch 20000: val: rmse = 2.10e+01, rmse_e = 2.09e-01, rmse_f = 7.22e-01 [2024-07-05 10:59:59,135] DEEPMD INFO batch 20000: total wall time = 867.78 s [2024-07-05 11:04:32,527] DEEPMD WARNING Data loading buffer is empty or nearly empty. This may indicate a data loading bottleneck, and increasing the number of workers (--num-workers) may help. [2024-07-05 11:15:47,722] DEEPMD INFO batch 22000: trn: rmse = 8.23e+00, rmse_e = 1.09e-01, rmse_f = 2.84e-01, lr = 1.68e-04 [2024-07-05 11:15:47,722] DEEPMD INFO batch 22000: val: rmse = 8.26e+00, rmse_e = 2.24e-01, rmse_f = 2.84e-01 [2024-07-05 11:15:47,722] DEEPMD INFO batch 22000: total wall time = 948.59 s
This means that loading the data takes up a lot of time during training, which makes my training very slow. I would like to know whether we have solutions to fix this problem.
By the way: keywords "stat_file": "./dpa2 was used.
DeePMD-kit Version
V3.0.0a0
Backend and its version
deepmd-kit.3.0_cuda123/lib/python3.11
How did you download the software?
Offline packages
Input Files, Running Commands, Error Log, etc.
[2024-07-05 10:59:59,135] DEEPMD INFO batch 20000: trn: rmse = 9.27e+00, rmse_e = 5.18e-02, rmse_f = 3.20e-01, lr = 1.68e-04
[2024-07-05 10:59:59,135] DEEPMD INFO batch 20000: val: rmse = 2.10e+01, rmse_e = 2.09e-01, rmse_f = 7.22e-01
[2024-07-05 10:59:59,135] DEEPMD INFO batch 20000: total wall time = 867.78 s
[2024-07-05 11:04:32,527] DEEPMD WARNING Data loading buffer is empty or nearly empty. This may indicate a data loading bottleneck, and increasing the number of workers (--num-workers) may help.
[2024-07-05 11:15:47,722] DEEPMD INFO batch 22000: trn: rmse = 8.23e+00, rmse_e = 1.09e-01, rmse_f = 2.84e-01, lr = 1.68e-04
[2024-07-05 11:15:47,722] DEEPMD INFO batch 22000: val: rmse = 8.26e+00, rmse_e = 2.24e-01, rmse_f = 2.84e-01
[2024-07-05 11:15:47,722] DEEPMD INFO batch 22000: total wall time = 948.59 s
[2024-07-05 11:19:32,806] DEEPMD WARNING Data loading buffer is empty or nearly empty. This may indicate a data loading bottleneck, and increasing the number of workers (--num-workers) may help.
[2024-07-05 11:30:11,391] DEEPMD INFO batch 24000: trn: rmse = 1.21e+01, rmse_e = 1.30e-01, rmse_f = 4.21e-01, lr = 1.65e-04
[2024-07-05 11:30:11,392] DEEPMD INFO batch 24000: val: rmse = 9.12e+00, rmse_e = 1.42e-01, rmse_f = 3.18e-01
[2024-07-05 11:30:11,392] DEEPMD INFO batch 24000: total wall time = 863.67 s
{ "_comment": "that's all", "model": { "type_map": [ "H", "C", "N", "O" ], "descriptor": { "type": "dpa2", "tebd_dim": 8, "repinit_rcut": 7.0, "repinit_rcut_smth": 6.0, "repinit_nsel": 100, "repformer_rcut": 4.0, "repformer_rcut_smth": 3.5, "repformer_nsel": 40, "repinit_neuron": [ 25, 50, 100 ], "repinit_axis_neuron": 12, "repinit_activation": "tanh", "repformer_nlayers": 12, "repformer_g1_dim": 128, "repformer_g2_dim": 32, "repformer_attn2_hidden": 32, "repformer_attn2_nhead": 4, "repformer_attn1_hidden": 128, "repformer_attn1_nhead": 4, "repformer_axis_dim": 4, "repformer_update_h2": false, "repformer_update_g1_has_conv": true, "repformer_update_g1_has_grrg": true, "repformer_update_g1_has_drrd": true, "repformer_update_g1_has_attn": true, "repformer_update_g2_has_g1g1": true, "repformer_update_g2_has_attn": true, "repformer_attn2_has_gate": true, "repformer_add_type_ebd_to_seq": false },
Steps to Reproduce
mpirun -np 4 dp --pt train --skip-neighbor-stat --mpi-log=master input.json
Further Information, Files, and Links
No response
I don't think MPI training has been supported by the PyTorch backend. Please read the documentation.
I don't think MPI training has been supported by the PyTorch backend. Please read the documentation.
To clarify, MPI is supported by the PyTorch DDP, but PyTorch needs to be compiled with MPI. Also, the nccl backend is hard-coded here.
https://github.com/deepmodeling/deepmd-kit/blob/63e4a25e264ff204820ce8a12036c5ef44b89bdf/deepmd/pt/entrypoints/main.py#L110