deepmd-kit icon indicating copy to clipboard operation
deepmd-kit copied to clipboard

[BUG] DPA-2 Training is super slow because of “DEEPMD WARNING Data loading buffer is empty or nearly empty“

Open Manyi-Yang opened this issue 1 year ago • 2 comments

Bug summary

When run DPA-2 (dp_tf) train calculation with Deepkit-V3.0.0a0, we always got following WARNINGS:

[2024-07-05 10:59:59,135] DEEPMD INFO batch 20000: trn: rmse = 9.27e+00, rmse_e = 5.18e-02, rmse_f = 3.20e-01, lr = 1.68e-04 [2024-07-05 10:59:59,135] DEEPMD INFO batch 20000: val: rmse = 2.10e+01, rmse_e = 2.09e-01, rmse_f = 7.22e-01 [2024-07-05 10:59:59,135] DEEPMD INFO batch 20000: total wall time = 867.78 s [2024-07-05 11:04:32,527] DEEPMD WARNING Data loading buffer is empty or nearly empty. This may indicate a data loading bottleneck, and increasing the number of workers (--num-workers) may help. [2024-07-05 11:15:47,722] DEEPMD INFO batch 22000: trn: rmse = 8.23e+00, rmse_e = 1.09e-01, rmse_f = 2.84e-01, lr = 1.68e-04 [2024-07-05 11:15:47,722] DEEPMD INFO batch 22000: val: rmse = 8.26e+00, rmse_e = 2.24e-01, rmse_f = 2.84e-01 [2024-07-05 11:15:47,722] DEEPMD INFO batch 22000: total wall time = 948.59 s

This means that loading the data takes up a lot of time during training, which makes my training very slow. I would like to know whether we have solutions to fix this problem.

By the way: keywords "stat_file": "./dpa2 was used.

DeePMD-kit Version

V3.0.0a0

Backend and its version

deepmd-kit.3.0_cuda123/lib/python3.11

How did you download the software?

Offline packages

Input Files, Running Commands, Error Log, etc.

[2024-07-05 10:59:59,135] DEEPMD INFO batch 20000: trn: rmse = 9.27e+00, rmse_e = 5.18e-02, rmse_f = 3.20e-01, lr = 1.68e-04 [2024-07-05 10:59:59,135] DEEPMD INFO batch 20000: val: rmse = 2.10e+01, rmse_e = 2.09e-01, rmse_f = 7.22e-01 [2024-07-05 10:59:59,135] DEEPMD INFO batch 20000: total wall time = 867.78 s [2024-07-05 11:04:32,527] DEEPMD WARNING Data loading buffer is empty or nearly empty. This may indicate a data loading bottleneck, and increasing the number of workers (--num-workers) may help. [2024-07-05 11:15:47,722] DEEPMD INFO batch 22000: trn: rmse = 8.23e+00, rmse_e = 1.09e-01, rmse_f = 2.84e-01, lr = 1.68e-04 [2024-07-05 11:15:47,722] DEEPMD INFO batch 22000: val: rmse = 8.26e+00, rmse_e = 2.24e-01, rmse_f = 2.84e-01 [2024-07-05 11:15:47,722] DEEPMD INFO batch 22000: total wall time = 948.59 s [2024-07-05 11:19:32,806] DEEPMD WARNING Data loading buffer is empty or nearly empty. This may indicate a data loading bottleneck, and increasing the number of workers (--num-workers) may help. [2024-07-05 11:30:11,391] DEEPMD INFO batch 24000: trn: rmse = 1.21e+01, rmse_e = 1.30e-01, rmse_f = 4.21e-01, lr = 1.65e-04 [2024-07-05 11:30:11,392] DEEPMD INFO batch 24000: val: rmse = 9.12e+00, rmse_e = 1.42e-01, rmse_f = 3.18e-01 [2024-07-05 11:30:11,392] DEEPMD INFO batch 24000: total wall time = 863.67 s { "_comment": "that's all", "model": { "type_map": [ "H", "C", "N", "O" ], "descriptor": { "type": "dpa2", "tebd_dim": 8, "repinit_rcut": 7.0, "repinit_rcut_smth": 6.0, "repinit_nsel": 100, "repformer_rcut": 4.0, "repformer_rcut_smth": 3.5, "repformer_nsel": 40, "repinit_neuron": [ 25, 50, 100 ], "repinit_axis_neuron": 12, "repinit_activation": "tanh", "repformer_nlayers": 12, "repformer_g1_dim": 128, "repformer_g2_dim": 32, "repformer_attn2_hidden": 32, "repformer_attn2_nhead": 4, "repformer_attn1_hidden": 128, "repformer_attn1_nhead": 4, "repformer_axis_dim": 4, "repformer_update_h2": false, "repformer_update_g1_has_conv": true, "repformer_update_g1_has_grrg": true, "repformer_update_g1_has_drrd": true, "repformer_update_g1_has_attn": true, "repformer_update_g2_has_g1g1": true, "repformer_update_g2_has_attn": true, "repformer_attn2_has_gate": true, "repformer_add_type_ebd_to_seq": false },

Steps to Reproduce

mpirun -np 4 dp --pt train --skip-neighbor-stat --mpi-log=master input.json

Further Information, Files, and Links

No response

Manyi-Yang avatar Jul 05 '24 09:07 Manyi-Yang

I don't think MPI training has been supported by the PyTorch backend. Please read the documentation.

njzjz avatar Jul 05 '24 23:07 njzjz

I don't think MPI training has been supported by the PyTorch backend. Please read the documentation.

To clarify, MPI is supported by the PyTorch DDP, but PyTorch needs to be compiled with MPI. Also, the nccl backend is hard-coded here.

https://github.com/deepmodeling/deepmd-kit/blob/63e4a25e264ff204820ce8a12036c5ef44b89bdf/deepmd/pt/entrypoints/main.py#L110

njzjz avatar Aug 21 '24 20:08 njzjz