deepmd-kit icon indicating copy to clipboard operation
deepmd-kit copied to clipboard

[BUG] _lmp raise "assert mapping is not None" with dpa2 model_

Open changxiaoju opened this issue 1 year ago • 0 comments

Bug summary

I trained a dpa2 model using deepmd-kit-3.0.0a0/examples/water/dpa2/input_torch.json , training data is deepmd-kit-3.0.0a0/examples/water/data, freeze and use frozen_model.pth to run DPMD using input files in deepmd-kit-3.0.0a0/examples/water/lmp. All above works are done with only necessary running step change to the example files.

All the data and input files to reproduct are provided in water_test_inputs.zip

The error of lammps (water_test_inputs/lmp/slurm-9441.out) is :

OMP: Info #172: KMP_AFFINITY: OS proc 0 maps to socket 0 core 0 thread 0 
OMP: Info #254: KMP_AFFINITY: pid 13722 tid 13722 thread 0 bound to OS proc set 0
4046
Exception: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/deepmd/pt/model/model/ener_model.py", line 59, in forward_lower
    aparam: Optional[Tensor]=None,
    do_atomic_virial: bool=False) -> Dict[str, Tensor]:
    model_ret = (self).forward_common_lower(extended_coord, extended_atype, nlist, mapping, fparam, aparam, do_atomic_virial, )
                 ~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    model_predict = annotate(Dict[str, Tensor], {})
    torch._set_item(model_predict, "atom_energy", model_ret["energy"])
  File "code/__torch__/deepmd/pt/model/model/ener_model.py", line 200, in forward_common_lower
    _31 = (self).input_type_cast(extended_coord0, None, fparam, aparam, )
    cc_ext, _32, fp, ap, input_prec, = _31
    atomic_ret = (self).forward_common_atomic(cc_ext, extended_atype, nlist0, mapping, fp, ap, )
                  ~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    model_predict = _29(atomic_ret, (self).atomic_output_def(), cc_ext, do_atomic_virial, )
    model_predict1 = (self).output_type_cast(model_predict, input_prec, )
  File "code/__torch__/deepmd/pt/model/model/ener_model.py", line 264, in forward_common_atomic
    fparam: Optional[Tensor]=None,
    aparam: Optional[Tensor]=None) -> Dict[str, Tensor]:
    ret_dict = (self).forward_atomic(extended_coord, extended_atype, nlist, mapping, fparam, aparam, )
                ~~~~~~~~~~~~~~~~~~~~ <--- HERE
    return ret_dict
  def forward_atomic(self: __torch__.deepmd.pt.model.model.ener_model.EnergyModel,
  File "code/__torch__/deepmd/pt/model/model/ener_model.py", line 284, in forward_atomic
      pass
    descriptor = self.descriptor
    _43 = (descriptor).forward(extended_coord, extended_atype, nlist, mapping, )
           ~~~~~~~~~~~~~~~~~~~ <--- HERE
    descriptor0, rot_mat, g2, h2, sw, = _43
    fitting_net = self.fitting_net
  File "code/__torch__/deepmd/pt/model/descriptor/dpa2.py", line 54, in forward
      mapping0 = unchecked_cast(Tensor, mapping)
    else:
      ops.prim.RaiseException("AssertionError: ")
      ~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
      mapping0 = _2
    _15 = torch.view(mapping0, [nframes, nall])

Traceback of TorchScript, original code (most recent call last):
  File "/data/home/changxiaoju/software/deepmd-kit-3.0.0a0-cuda123/lib/python3.11/site-packages/deepmd/pt/model/model/ener_model.py", line 73, in forward_lower
        do_atomic_virial: bool = False,
    ):
        model_ret = self.forward_common_lower(
                    ~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
            extended_coord,
            extended_atype,
  File "/data/home/changxiaoju/software/deepmd-kit-3.0.0a0-cuda123/lib/python3.11/site-packages/deepmd/pt/model/model/make_model.py", line 206, in forward_common_lower
            )
            del extended_coord, fparam, aparam
            atomic_ret = self.forward_common_atomic(
                         ~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
                cc_ext,
                extended_atype,
  File "/data/home/changxiaoju/software/deepmd-kit-3.0.0a0-cuda123/lib/python3.11/site-packages/deepmd/pt/model/atomic_model/base_atomic_model.py", line 103, in forward_common_atomic
            nlist = torch.where(pair_mask == 1, nlist, -1)
    
        ret_dict = self.forward_atomic(
                   ~~~~~~~~~~~~~~~~~~~ <--- HERE
            extended_coord,
            extended_atype,
  File "/data/home/changxiaoju/software/deepmd-kit-3.0.0a0-cuda123/lib/python3.11/site-packages/deepmd/pt/model/atomic_model/dp_atomic_model.py", line 164, in forward_atomic
        if self.do_grad_r() or self.do_grad_c():
            extended_coord.requires_grad_(True)
        descriptor, rot_mat, g2, h2, sw = self.descriptor(
                                          ~~~~~~~~~~~~~~~ <--- HERE
            extended_coord,
            extended_atype,
  File "/data/home/changxiaoju/software/deepmd-kit-3.0.0a0-cuda123/lib/python3.11/site-packages/deepmd/pt/model/descriptor/dpa2.py", line 443, in forward
        g1 = self.g1_shape_tranform(g1)
        # mapping g1
        assert mapping is not None
        ~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
        mapping_ext = (
            mapping.view(nframes, nall).unsqueeze(-1).expand(-1, -1, g1.shape[-1])
RuntimeError: AssertionError: 

--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

DeePMD-kit Version

DeePMD-kit v3.0.0a0

TensorFlow Version

torch Version: 2.1.2.post300

How did you download the software?

Offline packages

Input Files, Running Commands, Error Log, etc.

water_test_inputs.zip

Steps to Reproduce

(base) [juju@mgt workdir]$ cd water_test_inputs/dpa2/
(base) [juju@mgt dpa2]$ sbatch job.sbatch 
Submitted batch job 9442
(base) [juju@mgt dpa2]$ sbatch freeze.sbatch 
Submitted batch job 9443
(base) [juju@mgt dpa2]$ cp frozen_model.pth ../lmp
(base) [juju@mgt dpa2]$ cd ../lmp/
(base) [juju@mgt lmp]$ sbatch job.sbatch 
Submitted batch job 9444

Further Information, Files, and Links

No response

changxiaoju avatar Mar 07 '24 11:03 changxiaoju