dpgen2 [BUG] Using DPGEN2 combines DPA2 model and first-princeple calculation(VASP) to label the alloy data, iteration 0 is normal, but iteration 1 is abnormal

Bug summary

The process combines DPGEN2 with DPA2 and the 'fp' flag for sampling according to the tutorial . After obtaining the pre-trained model + alloy_domains and obtaining the initially trained model, lammps is used to generate trajectories. Then, DPGen2 selects configurations for FP calculation, FP calculation obtains labeled samples, and then a new model is retrained. Up to this point, everything is normal without any exceptions. However, when using this new model to combine with lammps again, there is a problem. This is likely to occur in this round of the loop when the newly generated model encounters some interface issue with lammps, resulting in abnormal temperature during NVT simulation, leading to atom loss. And tried to decrease the temperature (from 1273K to 873K) is still not work.

The error information: ERROR:root:lmp failed command was: lmp -var restart 0 -i in.lammps -log log.lammpsout msg: LAMMPS (2 Aug 2023 - Development - patch_2Aug2023-221-g759825bdc7) OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98) using 1 OpenMP thread(s) per MPI task Reading data file ... triclinic box = (0 0 0) to (9.8782598 10.781243 9.5195809) with tilt (-1.0142034 0.41050131 -0.56833045) 1 by 1 by 1 MPI processor grid reading atoms ... 108 atoms read_data CPU = 0.008 seconds Traceback (most recent call last): File "/home/input_lbg-12166-11617325/tmp/inputs/artifacts/dflow_python_packages/opt/mamba/lib/python3.10/site-packages/dflow/python/utils.py", line 327, in try_to_execute output = op_obj.execute(input) File "/home/input_lbg-12166-11617325/tmp/inputs/artifacts/dflow_python_packages/opt/mamba/lib/python3.10/site-packages/dflow/python/op.py", line 136, in wrapper_exec op_out = func(self, op_in) File "/home/input_lbg-12166-11617325/tmp/inputs/artifacts/dflow_python_packages/opt/mamba/lib/python3.10/site-packages/dpgen2/op/run_lmp.py", line 192, in execute raise TransientError("lmp failed") dflow.python.python_op_template.TransientError: lmp fai led ERROR:root:lmp failed

DP-GEN Version

DPGEN v0.1.dev278+g356b9e3

Platform, Python Version, Remote Platform, etc

Bohrium platform, Python 3.10.6

Input Files, Running Commands, Error Log, etc.

https://workflows.deepmodeling.com/workflows/argo/sampling-titaalcrfenico-hk4e5 Bohrium_output.zip (input.json))

Steps to Reproduce

https://workflows.deepmodeling.com/workflows/argo/sampling-titaalcrfenico-hk4e5

Further Information, Files, and Links

No response

Apr 06 '24 07:04 Jeremy1189

The tutorial is from https://nb.bohrium.dp.tech/detail/18475433825, in the part of "DP-Gen based on a DPA-2 pretrained model"

Apr 06 '24 08:04 Jeremy1189

The submit comand is dpgen2 submit input.json. and The all input files in the directory is as follows, if you need any files, please let me know. drwxr-xr-x 4 root root 4.0K Apr 6 15:41 valid_predict/ drwxr-xr-x 4 root root 4.0K Apr 6 15:01 valid_data/ drwxr-xr-x 214 root root 4.0K Apr 5 18:56 train_predict/ lrwxrwxrwx 1 root root 70 Apr 5 16:06 valid -> /personal/dpa2_hea/version5_20240223_add_more_fcc/sampling/valid_data// lrwxrwxrwx 1 root root 70 Apr 5 16:06 train -> /personal/dpa2_hea/version5_20240223_add_more_fcc/sampling/train_data// lrwxrwxrwx 1 root root 15 Apr 5 16:01 teacher_model.pt -> model_300000.pt -rw-r--r-- 1 root root 3.4K Apr 5 15:59 DPPTPredict.py -rw-r--r-- 1 root root 530 Apr 5 15:59 MD_exp_ini_conf.py drwxr-xr-x 15 root root 4.0K Apr 5 15:58 train_data/ -rw-r--r-- 1 root root 11K Mar 22 10:04 input.json -rw-r--r-- 1 root root 5.5K Mar 21 22:04 train.json -rw-r--r-- 1 root root 179 Mar 20 13:17 INCAR drwxr-xr-x 6 root root 4.0K Mar 20 12:52 sampling_back_up/ drwxr-xr-x 6 root root 4.0K Mar 20 12:50 ../ lrwxrwxrwx 1 root root 67 Mar 20 10:44 pretrained_model.pt -> /personal/dpa2_hea/version5_20240223_add_more_fcc/sampling/model.pt drwxr-xr-x 3 root root 4.0K Mar 20 09:46 init/ -rw-r--r-- 1 root root 770 Mar 20 09:45 gen_init.py -rw-r--r-- 1 root root 116M Mar 20 09:38 model_300000.pt -rw-r--r-- 1 root root 196M Mar 20 09:38 model.pt lrwxrwxrwx 1 root root 22 Mar 15 16:32 PBE -> /personal/dpa2_hea/PBE/ -rw-r--r-- 1 root root 3.8K Feb 24 00:12 template.lammps

Apr 06 '24 08:04 Jeremy1189

https://github.com/deepmodeling/deepmd-kit/issues/3751 The issue is expected to be solved on the latest devel branch of deepmd-kit. You can test if it works. If there's no question I will close this issue.

Jul 01 '24 03:07 zjgemi

Feel free to reopen it if the bug is still there.

Jul 05 '24 06:07 zjgemi