DeepSpeed [BUG] RuntimeError: still have inflight params [<bound method Init._convert_to_deepspeed_param.<locals>.ds

Describe the bug

what's the possible reason for error below

  File "/home/xihe/xinhe/distNAS/DeepspeedNAS/train.py", line 200, in train_zero
    engine.backward(loss)
  File "/home/xihe/xinhe/deepspeed/DeepSpeed/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/xihe/xinhe/deepspeed/DeepSpeed/deepspeed/runtime/engine.py", line 1980, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/home/xihe/xinhe/deepspeed/DeepSpeed/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/xihe/xinhe/deepspeed/DeepSpeed/deepspeed/runtime/zero/stage3.py", line 2088, in backward
    self._get_param_coordinator(training=True).reset_step()
  File "/home/xihe/xinhe/deepspeed/DeepSpeed/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 185, in reset_step
    raise RuntimeError(
RuntimeError: still have inflight params [<bound method Init._convert_to_deepspeed_param.<locals>.ds_summary of Parameter containing:

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [YES] ...... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [YES] ...... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [YES] ...... [OKAY]
--------------------------------------------------
No CUDA runtime is found, using CUDA_HOME='/cm/extra/Utils/CUDA/11.1.0.0_455.23.05'
DeepSpeed general environment info:
torch install path ............... ['/datasets/xihe/miniconda3/envs/colossal/lib/python3.9/site-packages/torch']
torch version .................... 1.12.1+cu113
deepspeed install path ........... ['/home/xihe/xinhe/deepspeed/DeepSpeed/deepspeed']
deepspeed info ................... 0.8.3+3667758, 3667758, master
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.1
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.3

Apr 06 '23 15:04 marsggbo

@marsggbo, thanks for reporting this issue. Can you please provide more details to enable reproducing this problem?

Apr 06 '23 22:04 tjruwase

I use a model with a dynamic forward, below is an example

class ToyNASModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = torch.nn.Conv2d(3, 512, kernel_size=3, stride=1, padding=1, bias=False)
        self.conv2 = torch.nn.Conv2d(512, 1024, kernel_size=3, stride=1, padding=1, bias=False)
        self.conv3 = torch.nn.Conv2d(512, 1024, kernel_size=5, stride=1, padding=2, bias=False)
        self.gavg = torch.nn.AdaptiveAvgPool2d((1, 1))
        self.fc = torch.nn.Linear(1024, 1000, bias=False)
        self.count = 0

    def forward(self, x):
        out = self.conv1(x)
        if self.count % 2 == 0:
            out = self.conv2(out)
        else:
            out = self.conv3(out)
        self.count += 1
        out = self.gavg(out).view(out.size(0), -1)
        out = self.fc(out)
        return out

the ds_config is as below:

{
    "train_micro_batch_size_per_gpu": 32,
    "gradient_accumulation_steps": 1,
    "steps_per_print": 1,
    "optimizer": {
        "type": "Adam",
        "params": {
            "lr": 0.001,
            "betas": [
                0.8,
                0.999
            ],
            "eps": 1e-08,
            "weight_decay": 3e-07
        }
    },
    "deepspeed": {
        "num_gpus": 1
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": 0.001,
            "warmup_num_steps": 1000
        }
    },
    "zero_optimization": {
        "stage": 3,
        "allgather_partitions": true,
        "reduce_scatter": true,
        "allgather_bucket_size": 50000000,
        "reduce_bucket_size": 50000000,
        "overlap_comm": true,
        "contiguous_gradients": true
    },
    "zero_allow_untested_optimizer": true,
    "fp16": {
        "enabled": true,
        "fp16_master_weights_and_grads": false,
        "loss_scale": 0,
        "loss_scale_window": 500,
        "hysteresis": 2,
        "min_loss_scale": 1,
        "initial_scale_power": 15
    }
}

Apr 07 '23 01:04 marsggbo

I have the same issue (using the PyTorch lightning implementation and deepspeed v0.9.0), any pointers on where this originates from would be splendid.

Apr 19 '23 13:04 tobideusser

@marsggbo, thanks for sharing a toy model. However, we need more than this to reproduce the issue. Can you please share script, and code, data, and command line to reproduce the issue?

@tobideusser, can you share repro details?

Apr 19 '23 13:04 tjruwase

I'm sorry for taking so long to reply, I was trying to figure out where that error came from.

To start with, this is basically my validation_step:

def validation_step(self, batch: Dict, batch_idx: int) -> Dict:
        predictions = self.model.generate(
            input_ids=batch["prompt_ids"],
            num_beams=1,
            do_sample=False,
            max_new_tokens=54,
        )
        batch["predictions"] = predictions
        batch = self._detach_tensors_in_dict(batch)
        return batch

This is what I found out:

The error appears after the first validation_step of PyTorch Lightning, but before the second one.
In there, I use the generation function of a language model imported from HF (https://huggingface.co/docs/transformers/main_classes/text_generation)
After investigating where this error is raised, I found that during the reset_step it is simply checked if there are still any in flight parameters (see: https://github.com/microsoft/DeepSpeed/blob/39b429d56ef12b3dc82fc177e2f0f801db744a3d/deepspeed/runtime/zero/partitioned_param_coordinator.py#L183), so someone probably has encountered this and wrote an exception for it, I guess?
If I disable the sanity checking of Pytorch Lightning (https://lightning.ai/docs/pytorch/latest/common/trainer.html?highlight=num_sanity_val_steps#num-sanity-val-steps) the first validation step actually runs through but throws an Invalidate trace cache @ step 271 and module 2: cache has only 271 modules print statement (which is printed here: https://github.com/microsoft/DeepSpeed/blob/39b429d56ef12b3dc82fc177e2f0f801db744a3d/deepspeed/runtime/zero/partitioned_param_coordinator.py#L152)
Interestingly, if I create a dummy dataset (and reproducible code) this error does not appear.

Therefore, my toy example actually runs through and is not helpful so far.

Could you shed some light on what an "Inflight Parameter" actually is? Is it possible to somehow detach them? I tried simply detaching every tensor in the result after model.generate in the validation step, but this does not change the behaviour.

Apr 24 '23 08:04 tobideusser

I have the same problem using zero3 in pytorch lightning and using the generate function. Is there a solution?

May 04 '23 12:05 justHungryMan

Hello @justHungryMan @tobideusser. This issue has been fixed by a collaborative effort with the lightning team. Please update the deepspeed and lightning to apply the fix. Thank you.

@marsggbo if the error is still there even with the latest deepspeed. Please feel free to reopen this issue with a reproduce script.

May 19 '23 18:05 HeyangQin

@HeyangQin I also met this problem when running deepspeed chat

the deepspeed version is 0.9.5

Jun 15 '23 10:06 ZJXNEFU

Hello @ZJXNEFU. Could you provide a reproduce script for us to investigate this issue? Thank you

Jun 15 '23 20:06 HeyangQin

Hello @ZJXNEFU. Could you provide a reproduce script for us to investigate this issue? Thank you

# deepspeed-chat step3 single_node run_rl.sh

#!/bin/bash

ACTOR_MODEL_PATH=/model/path   # a model fine-tuned on the bigcode/starcoder
CRITIC_MODEL_PATH=/reward_model_path  # a reward model trained on bigcode/tiny_starcoder_py  by step 2 scripts
ACTOR_ZERO_STAGE=$3
CRITIC_ZERO_STAGE=$4
OUTPUT=$5
if [ "$OUTPUT" == "" ]; then
    OUTPUT=./rl_output
fi
if [ "$ACTOR_ZERO_STAGE" == "" ]; then
    ACTOR_ZERO_STAGE=3
fi
if [ "$CRITIC_ZERO_STAGE" == "" ]; then
    CRITIC_ZERO_STAGE=3
fi
mkdir -p $OUTPUT

Num_Padding_at_Beginning=0 # this is model related

Actor_Lr=5e-4
Critic_Lr=5e-6

GPU_OPTION=" --include localhost:0,1,2,3,4,5,6,7 "

deepspeed --master_port 12346 ${GPU_OPTION} main.py \
   --data_path data/ \
   --data_split 2,4,4 \
   --data_output_path cache/ \
   --actor_model_name_or_path $ACTOR_MODEL_PATH \
   --critic_model_name_or_path $CRITIC_MODEL_PATH \
   --num_padding_at_beginning 0 \
   --per_device_train_batch_size 2 \
   --per_device_mini_train_batch_size 2 \
   --generation_batch_numbers 1 \
   --ppo_epochs 1 \
   --max_answer_seq_len 1024 \
   --max_prompt_seq_len 1024 \
   --actor_learning_rate ${Actor_Lr} \
   --critic_learning_rate ${Critic_Lr} \
   --num_train_epochs 1 \
   --lr_scheduler_type cosine \
   --gradient_accumulation_steps 1 \
   --num_warmup_steps 100 \
   --deepspeed --seed 1234 \
   --enable_hybrid_engine \
   --inference_tp_size 1 \
   --actor_zero_stage $ACTOR_ZERO_STAGE \
   --critic_zero_stage $CRITIC_ZERO_STAGE \
   --actor_gradient_checkpointing \
   --disable_actor_dropout \
   --actor_lora_dim 128 \
   --actor_lora_module_name decoder.layers. \
   --output_dir $OUTPUT

the follow script is deepspeed config

    zero_opt_dict = {
        "stage": 3,
        "offload_param": {
            "device": "none",
        },
        "offload_optimizer": {
            "device": "none",
        },
        "stage3_param_persistence_threshold": 1e4,
        "stage3_max_live_parameters": 1e7,
        "stage3_prefetch_bucket_size": 1e7,
        "memory_efficient_linear": False,
        "reduce_bucket_size": 1e7,
        "allgather_bucket_size": 1e7,
        "stage3_max_reuse_distance": 1e7
    }

   all_config = {
        "train_batch_size": GLOBAL_BATCH_SIZE,
        "train_micro_batch_size_per_gpu": MICRO_BATCH_SIZE,
        "steps_per_print": 10,
        "zero_optimization": zero_opt_dict,
        "fp16": {
            "enabled": True
        },
        "gradient_clipping": 1.0,
        "prescale_gradients": False,
        "wall_clock_breakdown": False
    }

ds_report

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/root/miniconda3/envs/dschat/lib/python3.9/site-packages/torch']
torch version .................... 1.12.1+cu116
deepspeed install path ........... ['/root/miniconda3/envs/dschat/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.9.5+b692d236, b692d236, master
torch cuda version ............... 11.6
torch hip version ................ None
nvcc version ..................... 11.6
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.6

@HeyangQin

Jun 16 '23 01:06 ZJXNEFU

+1，please tell me how to address this issue. "RuntimeError: still have inflight params [<bound method Init._convert_to_deepspeed_param..ds_summary of Parameter containing"

Jun 27 '23 12:06 Fhujinwu

Still have similar inflight params issue with 0.10.0+f8551b43 when running deepspeedchat.

Jun 29 '23 08:06 dearchill

One of our recent fixes https://github.com/microsoft/DeepSpeed/pull/3819 should have fixed this issue. It is not included in the pypi release yet so you need to install deepspeed from source to apply this fix. Please let us know if you still see this inflight issue even with the fix.

Jun 29 '23 18:06 HeyangQin

One of our recent fixes #3819 should have fixed this issue. It is not included in the pypi release yet so you need to install deepspeed from source to apply this fix. Please let us know if you still see this inflight issue even with the fix.

Still have similar inflight params issue with deepspeed_0.10.0+fd1d2c64 when running deepspeedchat. https://github.com/microsoft/DeepSpeedExamples/issues/616

Jun 30 '23 09:06 Fhujinwu

One of our recent fixes #3819 should have fixed this issue. It is not included in the pypi release yet so you need to install deepspeed from source to apply this fix. Please let us know if you still see this inflight issue even with the fix.

Yes, I installed 0.10.0+f8551b43 from source, but the issue still remained. For my case, the branch "HeyangQin/fix_issue_3156" solved it. Thanks a lot anyway. You can also try the branch "HeyangQin/fix_issue_3156". @Fhujinwu

Jun 30 '23 09:06 dearchill

One of our recent fixes #3819 should have fixed this issue. It is not included in the pypi release yet so you need to install deepspeed from source to apply this fix. Please let us know if you still see this inflight issue even with the fix.

Yes, I installed 0.10.0+f8551b43 from source, but the issue still remained. For my case, the branch "HeyangQin/fix_issue_3156" solved it. Thanks a lot anyway. You can also try the branch "HeyangQin/fix_issue_3156". @Fhujinwu

thank you，the branch "HeyangQin/fix_issue_3156" solved the issue.

Jul 01 '23 01:07 Fhujinwu

I came across the same error (RuntimeError: still have inflight params) checking out and installing the branch HeyangQin/fix_issue_3156 resolved it but using the current main did not. Unfortunately the branch is at version 0.9. Any chance it will be merged into main so that the problem is fixed on version 0.10? @HeyangQin

Happy to provide any additional info if it can help 😃

Aug 18 '23 00:08 vittorio-perera

I came across the same error too (RuntimeError: still have inflight params) . When I use deepspeed to train RL on v100 * 8, this bug still exists. I also switch branch to HeyangQin/fix_issue_3156, but it doesn't work. Happy to provide any additional info if it can help.

my deepspeed vision 0.10.1. @HeyangQin

This is Log output

File "/xxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 259, in train_rlhf self.critic_model.backward(critic_loss) File "/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1890, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2031, in backward self._get_param_coordinator(training=True) File "/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 199, in reset_step aise RuntimeError(f"still have inflight params "

RuntimeError: RuntimeError: still have inflight params [{'id': 4548, 'status': 'INFLIGHT', 'numel': 412139520, 'ds_numel': 412139520, 'shape': (80496, 5120), 'ds_shape': (80496, 5120), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([103034880])}, {'id': 4035, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1638400, 'shape': (0,), 'ds_shape': (1280, 1280), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([409600])}, {'id': 4027, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 328960, 'shape': (0,), 'ds_shape': (257, 1280), 'requires_grad': False...........................................

This is my ds_report:

Setting ds_accelerator to cuda (auto detect)

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

[WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] cpu_adagrad ............ [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] please install triton==1.0.0 if you want to use sparse attention sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] utils .................. [NO] ....... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch'] torch version .................... 1.13.1 deepspeed install path ........... ['/xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed-0.9.3+f2d600ba-py3.10.egg/deepspeed'] deepspeed info ................... 0.9.3+f2d600ba, f2d600ba, HeyangQin/fix_issue_3156 torch cuda version ............... 11.7 torch hip version ................ None nvcc version ..................... 10.1 deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7

Aug 19 '23 10:08 iamsile

Hello @iamsile @vittorio-perera. Could you provide a reproduction script for us to better investigate this issue? Thank you

Aug 24 '23 05:08 HeyangQin

@HeyangQin I'll try to put together a script as soon as possible. thanks for getting back on this!

Aug 24 '23 12:08 vittorio-perera

Hello @iamsile @vittorio-perera. Could you provide a reproduction script for us to better investigate this issue? Thank you

@HeyangQin This is a full record:https://github.com/microsoft/DeepSpeed/issues/4175. ~~I used HeyangQin/fix_issue_3156 to fix it~~. ~~But I found this branch didn't merge into master.~~ Looking forward to your reply.

@HeyangQin In my lastest test. I found copy HeyangQin/fix_issue_3156 into master hasn't work. In RL training, it only works at step0, after that it must be crash. this is a full report:

reward score --> step=0, rank=2, tensor([0.2354], device='cuda:2', dtype=torch.bfloat16) reward score --> step=0, rank=0, tensor([0.4492], device='cuda:0', dtype=torch.bfloat16) reward score --> step=0, rank=1, tensor([0.0601], device='cuda:1', dtype=torch.bfloat16) use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False... use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False... use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False... /xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") /xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") /xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") Epoch: 0 | Step: 0 | PPO Epoch: 1 | Actor Loss: -0.55859375 | Critic Loss: 0.216796875 | Unsupervised Loss: 0.0 Average reward score: 0.2470703125

Traceback (most recent call last): File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 672, in main() File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 541, in main out = trainer.generate_experience(batch_prompt['images'], File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 153, in generate_experience seq = self._generate_sequence(images, prompts, mask, step) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 95, in _generate_sequence seq = self.actor_model.module.forward(images, File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/peft/peft_model.py", line 296, in forward return self.get_base_model()(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/modeling_mplug_owl.py", line 1672, in forward outputs = self.language_model.generate( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 1538, in generate return self.greedy_search( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 2362, in greedy_search outputs = self( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 794, in forward outputs = self.model( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 677, in forward layer_outputs = decoder_layer( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 372, in forward hidden_states = self.input_layernorm(hidden_states) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl result = hook(self, input) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 383, in _pre_forward_module_hook self.pre_sub_module_forward_function(module) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 495, in pre_sub_module_forward_function param_coordinator.fetch_sub_module(sub_module, forward=True) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 328, in fetch_sub_module assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary() AssertionError: {'id': 918, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 5120, 'shape': (0,), 'ds_shape': (5120,), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': {578}, 'ds_tensor.shape': torch.Size([1707])} Traceback (most recent call last): File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 672, in main() File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 541, in main out = trainer.generate_experience(batch_prompt['images'], File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 153, in generate_experience seq = self._generate_sequence(images, prompts, mask, step) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 95, in _generate_sequence seq = self.actor_model.module.forward(images, File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/peft/peft_model.py", line 296, in forward return self.get_base_model()(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/modeling_mplug_owl.py", line 1672, in forward outputs = self.language_model.generate( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 1538, in generate return self.greedy_search( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 2362, in greedy_search outputs = self( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 794, in forward outputs = self.model( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 677, in forward layer_outputs = decoder_layer( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 372, in forward hidden_states = self.input_layernorm(hidden_states) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl result = hook(self, input) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 383, in _pre_forward_module_hook self.pre_sub_module_forward_function(module) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 495, in pre_sub_module_forward_function param_coordinator.fetch_sub_module(sub_module, forward=True) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 328, in fetch_sub_module assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary() AssertionError: {'id': 918, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 5120, 'shape': (0,), 'ds_shape': (5120,), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': {578}, 'ds_tensor.shape': torch.Size([1707])} Traceback (most recent call last): File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 672, in main() File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 541, in main out = trainer.generate_experience(batch_prompt['images'], File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 153, in generate_experience seq = self._generate_sequence(images, prompts, mask, step) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 95, in _generate_sequence seq = self.actor_model.module.forward(images, File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/peft/peft_model.py", line 296, in forward return self.get_base_model()(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/modeling_mplug_owl.py", line 1672, in forward outputs = self.language_model.generate( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 1538, in generate return self.greedy_search( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 2362, in greedy_search outputs = self( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 794, in forward outputs = self.model( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 677, in forward layer_outputs = decoder_layer( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 372, in forward hidden_states = self.input_layernorm(hidden_states) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl result = hook(self, input) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 383, in _pre_forward_module_hook self.pre_sub_module_forward_function(module) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 495, in pre_sub_module_forward_function param_coordinator.fetch_sub_module(sub_module, forward=True) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 328, in fetch_sub_module assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary() AssertionError: {'id': 918, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 5120, 'shape': (0,), 'ds_shape': (5120,), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': {578}, 'ds_tensor.shape': torch.Size([1707])} [2023-08-28 07:34:58,470] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 46301 [2023-08-28 07:34:58,470] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 46302 [2023-08-28 07:34:58,476] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 46303

Aug 25 '23 09:08 iamsile

Hello @iamsile @vittorio-perera. Could you provide a reproduction script for us to better investigate this issue? Thank you

@HeyangQin This is a full record:#4175. ~I used HeyangQin/fix_issue_3156 to fix it~. ~But I found this branch didn't merge into master.~ Looking forward to your reply.

@HeyangQin In my lastest test. I found copy HeyangQin/fix_issue_3156 into master hasn't work. In RL training, it only works at step0, after that it must be crash. this is a full report:

reward score --> step=0, rank=2, tensor([0.2354], device='cuda:2', dtype=torch.bfloat16) reward score --> step=0, rank=0, tensor([0.4492], device='cuda:0', dtype=torch.bfloat16) reward score --> step=0, rank=1, tensor([0.0601], device='cuda:1', dtype=torch.bfloat16) use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False... use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False... use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False... /xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") /xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") /xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") Epoch: 0 | Step: 0 | PPO Epoch: 1 | Actor Loss: -0.55859375 | Critic Loss: 0.216796875 | Unsupervised Loss: 0.0 Average reward score: 0.2470703125

Traceback (most recent call last): File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 672, in main() File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 541, in main out = trainer.generate_experience(batch_prompt['images'], File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 153, in generate_experience seq = self._generate_sequence(images, prompts, mask, step) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 95, in _generate_sequence seq = self.actor_model.module.forward(images, File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/peft/peft_model.py", line 296, in forward return self.get_base_model()(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/modeling_mplug_owl.py", line 1672, in forward outputs = self.language_model.generate( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 1538, in generate return self.greedy_search( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 2362, in greedy_search outputs = self( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 794, in forward outputs = self.model( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 677, in forward layer_outputs = decoder_layer( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 372, in forward hidden_states = self.input_layernorm(hidden_states) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl result = hook(self, input) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 383, in _pre_forward_module_hook self.pre_sub_module_forward_function(module) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 495, in pre_sub_module_forward_function param_coordinator.fetch_sub_module(sub_module, forward=True) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 328, in fetch_sub_module assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary() AssertionError: {'id': 918, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 5120, 'shape': (0,), 'ds_shape': (5120,), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': {578}, 'ds_tensor.shape': torch.Size([1707])} Traceback (most recent call last): File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 672, in main() File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 541, in main out = trainer.generate_experience(batch_prompt['images'], File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 153, in generate_experience seq = self._generate_sequence(images, prompts, mask, step) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 95, in _generate_sequence seq = self.actor_model.module.forward(images, File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/peft/peft_model.py", line 296, in forward return self.get_base_model()(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/modeling_mplug_owl.py", line 1672, in forward outputs = self.language_model.generate( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 1538, in generate return self.greedy_search( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 2362, in greedy_search outputs = self( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 794, in forward outputs = self.model( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 677, in forward layer_outputs = decoder_layer( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 372, in forward hidden_states = self.input_layernorm(hidden_states) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl result = hook(self, input) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 383, in _pre_forward_module_hook self.pre_sub_module_forward_function(module) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 495, in pre_sub_module_forward_function param_coordinator.fetch_sub_module(sub_module, forward=True) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 328, in fetch_sub_module assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary() AssertionError: {'id': 918, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 5120, 'shape': (0,), 'ds_shape': (5120,), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': {578}, 'ds_tensor.shape': torch.Size([1707])} Traceback (most recent call last): File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 672, in main() File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 541, in main out = trainer.generate_experience(batch_prompt['images'], File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 153, in generate_experience seq = self._generate_sequence(images, prompts, mask, step) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 95, in _generate_sequence seq = self.actor_model.module.forward(images, File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/peft/peft_model.py", line 296, in forward return self.get_base_model()(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/modeling_mplug_owl.py", line 1672, in forward outputs = self.language_model.generate( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 1538, in generate return self.greedy_search( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 2362, in greedy_search outputs = self( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 794, in forward outputs = self.model( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 677, in forward layer_outputs = decoder_layer( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 372, in forward hidden_states = self.input_layernorm(hidden_states) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl result = hook(self, input) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 383, in _pre_forward_module_hook self.pre_sub_module_forward_function(module) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 495, in pre_sub_module_forward_function param_coordinator.fetch_sub_module(sub_module, forward=True) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 328, in fetch_sub_module assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary() AssertionError: {'id': 918, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 5120, 'shape': (0,), 'ds_shape': (5120,), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': {578}, 'ds_tensor.shape': torch.Size([1707])} [2023-08-28 07:34:58,470] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 46301 [2023-08-28 07:34:58,470] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 46302 [2023-08-28 07:34:58,476] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 46303

Same. Work at step 0 and then crash with raise RuntimeError(f"{param.ds_summary()} already in registry") RuntimeError: {'id': 0, 'status': 'INFLIGHT', 'numel': 262144000, 'ds_numel': 262144000, 'shape': (64000, 4096), 'ds_shape': (64000, 4096), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([65536000])} already in registry

Sep 05 '23 04:09 jiahuanluo

Hello @iamsile @vittorio-perera. Could you provide a reproduction script for us to better investigate this issue? Thank you

@HeyangQin This is a full record:#4175. ~I used HeyangQin/fix_issue_3156 to fix it~. ~But I found this branch didn't merge into master.~ Looking forward to your reply.

@HeyangQin In my lastest test. I found copy HeyangQin/fix_issue_3156 into master hasn't work. In RL training, it only works at step0, after that it must be crash. this is a full report:

reward score --> step=0, rank=2, tensor([0.2354], device='cuda:2', dtype=torch.bfloat16) reward score --> step=0, rank=0, tensor([0.4492], device='cuda:0', dtype=torch.bfloat16) reward score --> step=0, rank=1, tensor([0.0601], device='cuda:1', dtype=torch.bfloat16) use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False... use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False... use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False... /xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") /xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") /xxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") Epoch: 0 | Step: 0 | PPO Epoch: 1 | Actor Loss: -0.55859375 | Critic Loss: 0.216796875 | Unsupervised Loss: 0.0 Average reward score: 0.2470703125

Traceback (most recent call last): File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 672, in main() File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 541, in main out = trainer.generate_experience(batch_prompt['images'], File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 153, in generate_experience seq = self._generate_sequence(images, prompts, mask, step) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 95, in _generate_sequence seq = self.actor_model.module.forward(images, File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/peft/peft_model.py", line 296, in forward return self.get_base_model()(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/modeling_mplug_owl.py", line 1672, in forward outputs = self.language_model.generate( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 1538, in generate return self.greedy_search( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 2362, in greedy_search outputs = self( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 794, in forward outputs = self.model( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 677, in forward layer_outputs = decoder_layer( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 372, in forward hidden_states = self.input_layernorm(hidden_states) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl result = hook(self, input) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 383, in _pre_forward_module_hook self.pre_sub_module_forward_function(module) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 495, in pre_sub_module_forward_function param_coordinator.fetch_sub_module(sub_module, forward=True) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 328, in fetch_sub_module assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary() AssertionError: {'id': 918, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 5120, 'shape': (0,), 'ds_shape': (5120,), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': {578}, 'ds_tensor.shape': torch.Size([1707])} Traceback (most recent call last): File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 672, in main() File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 541, in main out = trainer.generate_experience(batch_prompt['images'], File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 153, in generate_experience seq = self._generate_sequence(images, prompts, mask, step) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 95, in _generate_sequence seq = self.actor_model.module.forward(images, File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/peft/peft_model.py", line 296, in forward return self.get_base_model()(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/modeling_mplug_owl.py", line 1672, in forward outputs = self.language_model.generate( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 1538, in generate return self.greedy_search( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 2362, in greedy_search outputs = self( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 794, in forward outputs = self.model( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 677, in forward layer_outputs = decoder_layer( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 372, in forward hidden_states = self.input_layernorm(hidden_states) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl result = hook(self, input) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 383, in _pre_forward_module_hook self.pre_sub_module_forward_function(module) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 495, in pre_sub_module_forward_function param_coordinator.fetch_sub_module(sub_module, forward=True) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 328, in fetch_sub_module assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary() AssertionError: {'id': 918, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 5120, 'shape': (0,), 'ds_shape': (5120,), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': {578}, 'ds_tensor.shape': torch.Size([1707])} Traceback (most recent call last): File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 672, in main() File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py", line 541, in main out = trainer.generate_experience(batch_prompt['images'], File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 153, in generate_experience seq = self._generate_sequence(images, prompts, mask, step) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 95, in _generate_sequence seq = self.actor_model.module.forward(images, File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/peft/peft_model.py", line 296, in forward return self.get_base_model()(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/modeling_mplug_owl.py", line 1672, in forward outputs = self.language_model.generate( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 1538, in generate return self.greedy_search( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/transformers/generation/utils.py", line 2362, in greedy_search outputs = self( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 794, in forward outputs = self.model( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 677, in forward layer_outputs = decoder_layer( File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/xxxxxx/rlhf_mutil_modal_mplug_2/alpaca_rlhf/deepspeed_chat/training/utils/model/llama/modeling_llama.py", line 372, in forward hidden_states = self.input_layernorm(hidden_states) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl result = hook(self, input) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 383, in _pre_forward_module_hook self.pre_sub_module_forward_function(module) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 495, in pre_sub_module_forward_function param_coordinator.fetch_sub_module(sub_module, forward=True) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/xxxxxx/env/mplug_owl_2/mplug_owl/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 328, in fetch_sub_module assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary() AssertionError: {'id': 918, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 5120, 'shape': (0,), 'ds_shape': (5120,), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': {578}, 'ds_tensor.shape': torch.Size([1707])} [2023-08-28 07:34:58,470] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 46301 [2023-08-28 07:34:58,470] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 46302 [2023-08-28 07:34:58,476] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 46303

I have fixed it and close it.

Sep 11 '23 08:09 iamsile

@iamsile Hi, Could you please tell me how to fix this? Many Thanks.

Sep 12 '23 06:09 jiahuanluo

@iamsile Has your case been resolved with the latest deepspeed version? I observed similar issues recently. typically with a bert model and some linear layers, under zero-3. the training process starts with some "Invalidate trace cache @ step xx: expected module xx, but got module xxx", and then after about 10 steps. it aborts with RuntimeError: still have inflight params [{'id': 1143, 'status': 'AVAILABLE', 'numel': 1982464, 'ds_numel': 1982464, 'shape': (1408, 1408), 'ds_shape': (1408, 1408), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([247808])}, {'id': 1145, 'status': 'AVAILABLE', 'numel': 5767168, 'ds_numel': 5767168, 'shape': (4096, 1408), 'ds_shape': (4096, 1408), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([720896])}]

Mar 21 '24 23:03 XenonLamb

i've also got this issue @XenonLamb , mine happens about 230 steps in at the end of a batch when i'm returning a dummy loss value. not sure what the deal is 🤨

[rank6]: RuntimeError: still have inflight params [{'id': 6, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1572864, 'shape': (0,), 'ds_shape': (1024, 512, 3), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([196608])}, {'id': 10, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 3145728, 'shape': (0,), 'ds_shape': (3072, 1024), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([393216])}, {'id': 18, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([131072])}, {'id': 16, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (4096, 1024), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 22, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (1024, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 23, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([131072])}, {'id': 28, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 3145728, 'shape': (0,), 'ds_shape': (3072, 1024), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([393216])}, {'id': 36, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([131072])}, {'id': 34, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (4096, 1024), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 37, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (4096, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 4, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 393216, 'shape': (0,), 'ds_shape': (512, 256, 3), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([49152])}, {'id': 12, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([131072])}, {'id': 8, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 3145728, 'shape': (0,), 'ds_shape': (1024, 1024, 3), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([393216])}, {'id': 19, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (4096, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 20, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (1024, 4096), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 30, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([131072])}, {'id': 40, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (1024, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 38, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (1024, 4096), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 46, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 3145728, 'shape': (0,), 'ds_shape': (3072, 1024), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([393216])}, {'id': 55, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (4096, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 77, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([131072])}, {'id': 89, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 524288, 'shape': (0,), 'ds_shape': (512, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([65536])}, {'id': 58, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (1024, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 66, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([131072])}, {'id': 73, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (4096, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 82, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 262144, 'shape': (0,), 'ds_shape': (512, 512), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([32768])}, {'id': 56, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (1024, 4096), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 70, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (4096, 1024), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 76, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (1024, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 88, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 524288, 'shape': (0,), 'ds_shape': (1024, 512), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([65536])}, {'id': 64, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 3145728, 'shape': (0,), 'ds_shape': (3072, 1024), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([393216])}, {'id': 85, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 524288, 'shape': (0,), 'ds_shape': (512, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([65536])}, {'id': 86, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 262144, 'shape': (0,), 'ds_shape': (512, 512), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([32768])}, {'id': 48, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([131072])}, {'id': 72, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([131072])}, {'id': 54, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([131072])}, {'id': 59, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([131072])}, {'id': 74, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (1024, 4096), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 41, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([131072])}, {'id': 52, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 4194304, 'shape': (0,), 'ds_shape': (4096, 1024), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([524288])}, {'id': 84, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 524288, 'shape': (0,), 'ds_shape': (1024, 512), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([65536])}, {'id': 92, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([131072])}]

Apr 29 '24 00:04 dzagardo

I have same issue when I use zero stage 3 with latest deepspeed version. This issue occurs after first evaluation step. How could I fix it?

May 14 '24 01:05 daehuikim

try the latest version driver and cuda.

May 14 '24 07:05 jiahuanluo

Currently facing the same ussing with zero stage 3 and first validation step in pytorch lightning

May 27 '24 15:05 cyrildiagne

facing the same issue File "/root/miniconda3/envs/open/lib/python3.11/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 351, in _end_of_forward_hook
self.get_param_coordinator(training=False).reset_step()
File "/root/miniconda3/envs/open/lib/python3.11/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 204, in reset_step
raise RuntimeError(f"still have inflight params "
RuntimeError: still have inflight params [{'id': 723, 'status': 'AVAILABLE', 'numel': 4194304, 'ds_numel': 4194304, 'shape': (512, 8192), 'ds_shape': (512, 8192)

Jun 03 '24 07:06 karthik-nexusflow

[BUG] RuntimeError: still have inflight params [<bound method Init._convert_to_deepspeed_param.<locals>.ds_summary of Parameter containing:

Setting ds_accelerator to cuda (auto detect)

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible