DeepSpeed [BUG] In zero3 mode, how to set nn.Linear weight (some parameters can be updated, but some cannot).

Describe the bug For zero3, all of the parameters are partitioned, so the weight shape of nn.Linear is 0, and grad cannot be set for some dimensions of weight.

To Reproduce

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL")
model.lm_head.weight.data[-1, :].requires_grad_(True)

RUN torchrun --nproc_per_node 8 code.py --deepspeed_config ds_config_zero3.json

Expected behavior Code runs successfully in the mode of deepspeed zero3.

### Tasks

### Tasks

Jan 31 '24 02:01 zhongshsh

I try https://github.com/microsoft/DeepSpeed/blob/master/docs/code-docs/source/zero3.rst, and find the ds_tensor shape of weight is only one dim. I don't know how to index the original dim in this case.

Jan 31 '24 04:01 zhongshsh

may be you can refer to partition_parameters.

Jan 31 '24 07:01 mklf

@mklf Many thx for your reply. I tried it and found the code can't set grad as I want. It means xxx.weight still does not require grad, not to mention setting in partial dimensions.

import deepspeed
with deepspeed.zero.GatheredParameters(xxx.weight, modifier_rank=0):
    if deepspeed.comm.get_rank() == 0:
          xxx.weight.requires_grad_(True)

Feb 01 '24 06:02 zhongshsh

I have the same problem, any progress on your end?

Sep 13 '24 16:09 Littleor

@zhongshsh, @Littleor apologies for the delayed response on this.

I am confused about the reported problem because I don't believe setting requires_grad on .data slice is supported by torch. Parameters/tensors either have gradients or not. Perhaps you can explain a bit more. Also, it would be helpful to show the expected behavior using an example without zero3.

I have rewritten the logic to toggle parameter requires_grad for both zero3 and non deepspeed scenarios. The codes and outputs below. Please share your observations. Thanks!

---Without DeepSpeed or ZeRO3-----

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "microsoft/Phi-3.5-mini-instruct"
model = AutoModelForCausalLM.from_pretrained(model_id)
print(f'initial - {model.lm_head.weight.requires_grad=}')

model.lm_head.weight.requires_grad_(False)
print(f'after set to False - {model.lm_head.weight.requires_grad=}')

model.lm_head.weight.requires_grad_(True)
print(f'after set to True - {model.lm_head.weight.requires_grad=}')

$ torchrun --nproc_per_node  1 test_set_grad.py 
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.18it/s]
initial - model.lm_head.weight.requires_grad=True
after set to False - model.lm_head.weight.requires_grad=False
after set to True - model.lm_head.weight.requires_grad=True

---With ZeRO3-----

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.deepspeed import HfDeepSpeedConfig

ds_config = {"train_batch_size": 1, "zero_optimization": {"stage": 3}}
hfds = HfDeepSpeedConfig(ds_config) 

model_id = "microsoft/Phi-3.5-mini-instruct"
model = AutoModelForCausalLM.from_pretrained(model_id)
print(f'initial - {model.lm_head.weight.requires_grad=}')

model.lm_head.weight.requires_grad_(False)
print(f'after set to False - {model.lm_head.weight.requires_grad=}')

model.lm_head.weight.requires_grad_(True)
print(f'after set to True - {model.lm_head.weight.requires_grad=}')

$ torchrun --nproc_per_node 1 test_set_grad_z3.py 
/home/deepspeed/.local/lib/python3.10/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
[2024-09-18 14:43:06,230] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-18 14:43:06,617] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-18 14:43:07,439] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 1
[2024-09-18 14:43:07,440] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-09-18 14:43:07,440] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-09-18 14:43:08,390] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 195, num_elems = 3.82B
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.94s/it]
initial - model.lm_head.weight.requires_grad=True
after set to False - model.lm_head.weight.requires_grad=False
after set to True - model.lm_head.weight.requires_grad=True

Sep 18 '24 14:09 tjruwase