[BUG] Using DPGEN2 combines DPA2 model and first-princeple calculation(VASP) to label the alloy data, iteration 0 is normal, but iteration 1 is abnormal
Bug summary
The process combines DPGEN2 with DPA2 and the 'fp' flag for sampling according to the tutorial . After obtaining the pre-trained model + alloy_domains and obtaining the initially trained model, lammps is used to generate trajectories. Then, DPGen2 selects configurations for FP calculation, FP calculation obtains labeled samples, and then a new model is retrained. Up to this point, everything is normal without any exceptions. However, when using this new model to combine with lammps again, there is a problem. This is likely to occur in this round of the loop when the newly generated model encounters some interface issue with lammps, resulting in abnormal temperature during NVT simulation, leading to atom loss. And tried to decrease the temperature (from 1273K to 873K) is still not work.
The error information: ERROR:root:lmp failed command was: lmp -var restart 0 -i in.lammps -log log.lammpsout msg: LAMMPS (2 Aug 2023 - Development - patch_2Aug2023-221-g759825bdc7) OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98) using 1 OpenMP thread(s) per MPI task Reading data file ... triclinic box = (0 0 0) to (9.8782598 10.781243 9.5195809) with tilt (-1.0142034 0.41050131 -0.56833045) 1 by 1 by 1 MPI processor grid reading atoms ... 108 atoms read_data CPU = 0.008 seconds Traceback (most recent call last): File "/home/input_lbg-12166-11617325/tmp/inputs/artifacts/dflow_python_packages/opt/mamba/lib/python3.10/site-packages/dflow/python/utils.py", line 327, in try_to_execute output = op_obj.execute(input) File "/home/input_lbg-12166-11617325/tmp/inputs/artifacts/dflow_python_packages/opt/mamba/lib/python3.10/site-packages/dflow/python/op.py", line 136, in wrapper_exec op_out = func(self, op_in) File "/home/input_lbg-12166-11617325/tmp/inputs/artifacts/dflow_python_packages/opt/mamba/lib/python3.10/site-packages/dpgen2/op/run_lmp.py", line 192, in execute raise TransientError("lmp failed") dflow.python.python_op_template.TransientError: lmp fai led ERROR:root:lmp failed
DP-GEN Version
DPGEN v0.1.dev278+g356b9e3
Platform, Python Version, Remote Platform, etc
Bohrium platform, Python 3.10.6
Input Files, Running Commands, Error Log, etc.
https://workflows.deepmodeling.com/workflows/argo/sampling-titaalcrfenico-hk4e5 Bohrium_output.zip (input.json))
Steps to Reproduce
https://workflows.deepmodeling.com/workflows/argo/sampling-titaalcrfenico-hk4e5
Further Information, Files, and Links
No response
The tutorial is from https://nb.bohrium.dp.tech/detail/18475433825, in the part of "DP-Gen based on a DPA-2 pretrained model"
The submit comand is dpgen2 submit input.json. and The all input files in the directory is as follows, if you need any files, please let me know. drwxr-xr-x 4 root root 4.0K Apr 6 15:41 valid_predict/ drwxr-xr-x 4 root root 4.0K Apr 6 15:01 valid_data/ drwxr-xr-x 214 root root 4.0K Apr 5 18:56 train_predict/ lrwxrwxrwx 1 root root 70 Apr 5 16:06 valid -> /personal/dpa2_hea/version5_20240223_add_more_fcc/sampling/valid_data// lrwxrwxrwx 1 root root 70 Apr 5 16:06 train -> /personal/dpa2_hea/version5_20240223_add_more_fcc/sampling/train_data// lrwxrwxrwx 1 root root 15 Apr 5 16:01 teacher_model.pt -> model_300000.pt -rw-r--r-- 1 root root 3.4K Apr 5 15:59 DPPTPredict.py -rw-r--r-- 1 root root 530 Apr 5 15:59 MD_exp_ini_conf.py drwxr-xr-x 15 root root 4.0K Apr 5 15:58 train_data/ -rw-r--r-- 1 root root 11K Mar 22 10:04 input.json -rw-r--r-- 1 root root 5.5K Mar 21 22:04 train.json -rw-r--r-- 1 root root 179 Mar 20 13:17 INCAR drwxr-xr-x 6 root root 4.0K Mar 20 12:52 sampling_back_up/ drwxr-xr-x 6 root root 4.0K Mar 20 12:50 ../ lrwxrwxrwx 1 root root 67 Mar 20 10:44 pretrained_model.pt -> /personal/dpa2_hea/version5_20240223_add_more_fcc/sampling/model.pt drwxr-xr-x 3 root root 4.0K Mar 20 09:46 init/ -rw-r--r-- 1 root root 770 Mar 20 09:45 gen_init.py -rw-r--r-- 1 root root 116M Mar 20 09:38 model_300000.pt -rw-r--r-- 1 root root 196M Mar 20 09:38 model.pt lrwxrwxrwx 1 root root 22 Mar 15 16:32 PBE -> /personal/dpa2_hea/PBE/ -rw-r--r-- 1 root root 3.8K Feb 24 00:12 template.lammps
https://github.com/deepmodeling/deepmd-kit/issues/3751
The issue is expected to be solved on the latest devel branch of deepmd-kit. You can test if it works. If there's no question I will close this issue.
Feel free to reopen it if the bug is still there.