DeepSpeed [BUG] DeepCompile in ZeRO-3 fails to do the forward pass

Describe the bug It looks like the parameters are still partitioned when doing the forward pass, with DeepCompile.

Error logs:

  File "torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "deepspeed/utils/nvtx.py", line 20, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "deepspeed/runtime/engine.py", line 2054, in forward
    loss = self.module(*inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "torch/nn/modules/module.py", line 1737, in _wrapped_call_impl
    return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "torch/_dynamo/eval_frame.py", line 574, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "openrlhf/models/model.py", line 244, in forward
    def forward(
  File "openrlhf/models/model.py", line 257, in torch_dynamo_resume_in_forward_at_257
    input_ids, position_ids, _, ring_attn_pad_len, indices = unpad_and_slice_tensor(
  File "torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "transformers/utils/generic.py", line 965, in wrapper
    output = func(self, *args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "transformers/models/llama/modeling_llama.py", line 527, in forward
    inputs_embeds = self.embed_tokens(input_ids)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "torch/nn/modules/sparse.py", line 190, in forward
    return F.embedding(
           ^^^^^^^^^^^^
  File "torch/nn/functional.py", line 2551, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: 'weight' must be 2-D

To Reproduce I'm using OpenRLHF with DeepCompile, but I think this should not be directly related to code in OpenRLHF and should be general enough to reproduce.

Basically, you may be able to reconstruct a reproduction script that, during training, run a forward pass when doing .eval() in ZeRO-3 with DeepCompile, for a huggingface model (meta-llama/Llama-3.2-1B-Instruct) with flash attention enabled (transformers==4.51.1, flash_attn==2.7.4.post1).

Expected behavior No error thrown

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
dc ..................... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
 [WARNING]  FP Quantizer is using an untested triton version (3.2.0), only 2.3.(0, 1) and 3.0.0 are known to be compatible with these kernels
fp_quantizer ........... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
 [WARNING]  gds requires the dev libaio .so object and headers but these were not found.
 [WARNING]  gds: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
gds .................... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.6
 [WARNING]  using untested triton version (3.2.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch version .................... 2.6.0+cu124
deepspeed info ................... 0.16.6, unknown, unknown
torch cuda version ............... 12.4
torch hip version ................ None
nvcc version ..................... 12.4
deepspeed wheel compiled w. ...... torch 2.6, cuda 12.4
shared memory (/dev/shm) size .... 251.75 GB

System info (please complete the following information):

OS: RHEL 9.5
1 machines with x4 A100s each
transformers==4.51.1
flash_attn==2.7.4.post1
Python version: 3.12.9
OpenRLHF version: 0.7.1.post1

Apr 17 '25 20:04 HollowMan6

With flash attention disabled, we no longer have the unpartitioned parameters error above, but instead same as https://github.com/deepspeedai/DeepSpeed/issues/7229#issuecomment-2814147263

Apr 17 '25 22:04 HollowMan6

Is there any progress ?

Jun 11 '25 11:06 hijkzzz

@HollowMan6 @hijkzzz

#7386 resolved the issue on my environment as long as you use SDPA and padding across multiple ranks. I added the benchmark of ZeRO3 in the Wiki page.

The behavior of the compiler can change with a subtle difference - Please let us know if you still see any issue. As the (re)compilation and optimization take a long time, we will need to implement a feature to save/load the optimized graph.

Jun 25 '25 05:06 tohtana

I believe this issue is not fully solved. It appears again as soon as there is any graph break.

Jul 08 '25 16:07 bsochack