deepmd-kit
deepmd-kit copied to clipboard
[BUG] _lmp raise "assert mapping is not None" with dpa2 model_
Bug summary
I trained a dpa2 model using deepmd-kit-3.0.0a0/examples/water/dpa2/input_torch.json , training data is deepmd-kit-3.0.0a0/examples/water/data, freeze and use frozen_model.pth to run DPMD using input files in deepmd-kit-3.0.0a0/examples/water/lmp. All above works are done with only necessary running step change to the example files.
All the data and input files to reproduct are provided in water_test_inputs.zip
The error of lammps (water_test_inputs/lmp/slurm-9441.out) is :
OMP: Info #172: KMP_AFFINITY: OS proc 0 maps to socket 0 core 0 thread 0
OMP: Info #254: KMP_AFFINITY: pid 13722 tid 13722 thread 0 bound to OS proc set 0
4046
Exception: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/__torch__/deepmd/pt/model/model/ener_model.py", line 59, in forward_lower
aparam: Optional[Tensor]=None,
do_atomic_virial: bool=False) -> Dict[str, Tensor]:
model_ret = (self).forward_common_lower(extended_coord, extended_atype, nlist, mapping, fparam, aparam, do_atomic_virial, )
~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
model_predict = annotate(Dict[str, Tensor], {})
torch._set_item(model_predict, "atom_energy", model_ret["energy"])
File "code/__torch__/deepmd/pt/model/model/ener_model.py", line 200, in forward_common_lower
_31 = (self).input_type_cast(extended_coord0, None, fparam, aparam, )
cc_ext, _32, fp, ap, input_prec, = _31
atomic_ret = (self).forward_common_atomic(cc_ext, extended_atype, nlist0, mapping, fp, ap, )
~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
model_predict = _29(atomic_ret, (self).atomic_output_def(), cc_ext, do_atomic_virial, )
model_predict1 = (self).output_type_cast(model_predict, input_prec, )
File "code/__torch__/deepmd/pt/model/model/ener_model.py", line 264, in forward_common_atomic
fparam: Optional[Tensor]=None,
aparam: Optional[Tensor]=None) -> Dict[str, Tensor]:
ret_dict = (self).forward_atomic(extended_coord, extended_atype, nlist, mapping, fparam, aparam, )
~~~~~~~~~~~~~~~~~~~~ <--- HERE
return ret_dict
def forward_atomic(self: __torch__.deepmd.pt.model.model.ener_model.EnergyModel,
File "code/__torch__/deepmd/pt/model/model/ener_model.py", line 284, in forward_atomic
pass
descriptor = self.descriptor
_43 = (descriptor).forward(extended_coord, extended_atype, nlist, mapping, )
~~~~~~~~~~~~~~~~~~~ <--- HERE
descriptor0, rot_mat, g2, h2, sw, = _43
fitting_net = self.fitting_net
File "code/__torch__/deepmd/pt/model/descriptor/dpa2.py", line 54, in forward
mapping0 = unchecked_cast(Tensor, mapping)
else:
ops.prim.RaiseException("AssertionError: ")
~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
mapping0 = _2
_15 = torch.view(mapping0, [nframes, nall])
Traceback of TorchScript, original code (most recent call last):
File "/data/home/changxiaoju/software/deepmd-kit-3.0.0a0-cuda123/lib/python3.11/site-packages/deepmd/pt/model/model/ener_model.py", line 73, in forward_lower
do_atomic_virial: bool = False,
):
model_ret = self.forward_common_lower(
~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
extended_coord,
extended_atype,
File "/data/home/changxiaoju/software/deepmd-kit-3.0.0a0-cuda123/lib/python3.11/site-packages/deepmd/pt/model/model/make_model.py", line 206, in forward_common_lower
)
del extended_coord, fparam, aparam
atomic_ret = self.forward_common_atomic(
~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
cc_ext,
extended_atype,
File "/data/home/changxiaoju/software/deepmd-kit-3.0.0a0-cuda123/lib/python3.11/site-packages/deepmd/pt/model/atomic_model/base_atomic_model.py", line 103, in forward_common_atomic
nlist = torch.where(pair_mask == 1, nlist, -1)
ret_dict = self.forward_atomic(
~~~~~~~~~~~~~~~~~~~ <--- HERE
extended_coord,
extended_atype,
File "/data/home/changxiaoju/software/deepmd-kit-3.0.0a0-cuda123/lib/python3.11/site-packages/deepmd/pt/model/atomic_model/dp_atomic_model.py", line 164, in forward_atomic
if self.do_grad_r() or self.do_grad_c():
extended_coord.requires_grad_(True)
descriptor, rot_mat, g2, h2, sw = self.descriptor(
~~~~~~~~~~~~~~~ <--- HERE
extended_coord,
extended_atype,
File "/data/home/changxiaoju/software/deepmd-kit-3.0.0a0-cuda123/lib/python3.11/site-packages/deepmd/pt/model/descriptor/dpa2.py", line 443, in forward
g1 = self.g1_shape_tranform(g1)
# mapping g1
assert mapping is not None
~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
mapping_ext = (
mapping.view(nframes, nall).unsqueeze(-1).expand(-1, -1, g1.shape[-1])
RuntimeError: AssertionError:
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
DeePMD-kit Version
DeePMD-kit v3.0.0a0
TensorFlow Version
torch Version: 2.1.2.post300
How did you download the software?
Offline packages
Input Files, Running Commands, Error Log, etc.
Steps to Reproduce
(base) [juju@mgt workdir]$ cd water_test_inputs/dpa2/
(base) [juju@mgt dpa2]$ sbatch job.sbatch
Submitted batch job 9442
(base) [juju@mgt dpa2]$ sbatch freeze.sbatch
Submitted batch job 9443
(base) [juju@mgt dpa2]$ cp frozen_model.pth ../lmp
(base) [juju@mgt dpa2]$ cd ../lmp/
(base) [juju@mgt lmp]$ sbatch job.sbatch
Submitted batch job 9444
Further Information, Files, and Links
No response