FastChat NotImplementedError: Cannot copy out of meta tensor; no data!

Hello,I used Qlora to train,but I get an error: NotImplementedError: Cannot copy out of meta tensor; no data!

requriements.txt:

peft @ file:///root/peft
torch==1.13.1+cu116
torchaudio==0.13.1+cu116
torchvision==0.14.1+cu116
transformers==4.28.1
deepspeed==0.9.4
flash-attn==0.2.0

This my train code:

CUDA_VISIBLE_DEVICES=0 deepspeed fastchat/train/train_lora.py \
    --model_name_or_path ../vicuna-7b  \
    --lora_r 8 \
    --lora_alpha 16 \
    --lora_dropout 0.05 \
    --data_path ./data/dummy_conversation.json  \
    --bf16 True \
    --output_dir ./checkpoints \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1200 \
    --save_total_limit 100 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --q_lora True \
    --deepspeed playground/deepspeed_config_s2.json

Can you help me,please? @merrymercy

Jun 20 '23 05:06 aresa7796

error log:

(vicuna) root@1dc5cd5794cc:~/FastChat# CUDA_VISIBLE_DEVICES=0 deepspeed fastchat/train/train_lora.py \
>     --model_name_or_path ../vicuna-7b  \
>     --lora_r 8 \
>     --lora_alpha 16 \
>     --lora_dropout 0.05 \
>     --data_path ./data/dummy_conversation.json  \
>     --bf16 True \
>     --output_dir ./checkpoints \
>     --num_train_epochs 3 \
>     --per_device_train_batch_size 4 \
>     --per_device_eval_batch_size 4 \
>     --gradient_accumulation_steps 1 \
>     --evaluation_strategy "no" \
>     --save_strategy "steps" \
>     --save_steps 1200 \
>     --save_total_limit 100 \
>     --learning_rate 2e-5 \
>     --weight_decay 0. \
>     --warmup_ratio 0.03 \
>     --lr_scheduler_type "cosine" \
>     --logging_steps 1 \
>     --tf32 True \
>     --model_max_length 2048 \
>     --q_lora True \
>     --deepspeed playground/deepspeed_config_s2.json
[2023-06-20 11:57:19,339] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-06-20 11:57:21,837] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
Detected CUDA_VISIBLE_DEVICES=0: setting --include=localhost:0
[2023-06-20 11:57:21,850] [INFO] [runner.py:555:main] cmd = /root/miniconda3/envs/vicuna/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None fastchat/train/train_lora.py --model_name_or_path ../vicuna-7b --lora_r 8 --lora_alpha 16 --lora_dropout 0.05 --data_path ./data/dummy_conversation.json --bf16 True --output_dir ./checkpoints --num_train_epochs 3 --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --gradient_accumulation_steps 1 --evaluation_strategy no --save_strategy steps --save_steps 1200--save_total_limit 100 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 1 --tf32 True --model_max_length 2048 --q_lora True --deepspeed playground/deepspeed_config_s2.json
[2023-06-20 11:57:23,357] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-06-20 11:57:25,431] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0]}
[2023-06-20 11:57:25,432] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-06-20 11:57:25,432] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-06-20 11:57:25,432] [INFO] [launch.py:163:main] dist_world_size=1
[2023-06-20 11:57:25,432] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-06-20 11:57:26,936] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-06-20 11:57:29,050] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-06-20 11:57:29,051] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-06-20 11:57:29,051] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████| 2/2 [00:52<00:00, 26.18s/it]
trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06220594176090199
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /root/FastChat/fastchat/train/train_lora.py:201 in <module>                                      │
│                                                                                                  │
│   198                                                                                            │
│   199                                                                                            │
│   200 if __name__ == "__main__":                                                                 │
│ ❱ 201 │   train()                                                                                │
│   202                                                                                            │
│                                                                                                  │
│ /root/FastChat/fastchat/train/train_lora.py:177 in train                                         │
│                                                                                                  │
│   174 │   if list(pathlib.Path(training_args.output_dir).glob("checkpoint-*")):                  │
│   175 │   │   trainer.train(resume_from_checkpoint=True)                                         │
│   176 │   else:                                                                                  │
│ ❱ 177 │   │   trainer.train()                                                                    │
│   178 │   trainer.save_state()                                                                   │
│   179 │                                                                                          │
│   180 │   # check if zero3 mode enabled                                                          │
│                                                                                                  │
│ /root/miniconda3/envs/vicuna/lib/python3.9/site-packages/transformers/trainer.py:1662 in train   │
│                                                                                                  │
│   1659 │   │   inner_training_loop = find_executable_batch_size(                                 │
│   1660 │   │   │   self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size  │
│   1661 │   │   )                                                                                 │
│ ❱ 1662 │   │   return inner_training_loop(                                                       │
│   1663 │   │   │   args=args,                                                                    │
│   1664 │   │   │   resume_from_checkpoint=resume_from_checkpoint,                                │
│   1665 │   │   │   trial=trial,                                                                  │
│                                                                                                  │
│ /root/miniconda3/envs/vicuna/lib/python3.9/site-packages/transformers/trainer.py:1731 in         │
│ _inner_training_loop                                                                             │
│                                                                                                  │
│   1728 │   │   │   or self.fsdp is not None                                                      │
│   1729 │   │   )                                                                                 │
│   1730 │   │   if args.deepspeed:                                                                │
│ ❱ 1731 │   │   │   deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(                   │
│   1732 │   │   │   │   self, num_training_steps=max_steps, resume_from_checkpoint=resume_from_c  │
│   1733 │   │   │   )                                                                             │
│   1734 │   │   │   self.model = deepspeed_engine.module                                          │
│                                                                                                  │
│ /root/miniconda3/envs/vicuna/lib/python3.9/site-packages/transformers/deepspeed.py:378 in        │
│ deepspeed_init                                                                                   │
│                                                                                                  │
│   375 │   │   "lr_scheduler": lr_scheduler,                                                      │
│   376 │   }                                                                                      │
│   377 │                                                                                          │
│ ❱ 378 │   deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)          │
│   379 │                                                                                          │
│   380 │   if resume_from_checkpoint is not None:                                                 │
│   381 │   │   # it's possible that the user is trying to resume from model_path, which doesn't   │
│                                                                                                  │
│ /root/miniconda3/envs/vicuna/lib/python3.9/site-packages/deepspeed/__init__.py:165 in initialize │
│                                                                                                  │
│   162 │   │   │   │   │   │   │   │   │   │      config=config,                                  │
│   163 │   │   │   │   │   │   │   │   │   │      config_class=config_class)                      │
│   164 │   │   else:                                                                              │
│ ❱ 165 │   │   │   engine = DeepSpeedEngine(args=args,                                            │
│   166 │   │   │   │   │   │   │   │   │    model=model,                                          │
│   167 │   │   │   │   │   │   │   │   │    optimizer=optimizer,                                  │
│   168 │   │   │   │   │   │   │   │   │    model_parameters=model_parameters,                    │
│                                                                                                  │
│ /root/miniconda3/envs/vicuna/lib/python3.9/site-packages/deepspeed/runtime/engine.py:267 in      │
│ __init__                                                                                         │
│                                                                                                  │
│    264 │   │   self.pipeline_parallelism = isinstance(model, PipelineModule)                     │
│    265 │   │                                                                                     │
│    266 │   │   # Configure distributed model                                                     │
│ ❱  267 │   │   self._configure_distributed_model(model)                                          │
│    268 │   │                                                                                     │
│    269 │   │   self._get_model_parameters()                                                      │
│    270                                                                                           │
│                                                                                                  │
│ /root/miniconda3/envs/vicuna/lib/python3.9/site-packages/deepspeed/runtime/engine.py:1049 in     │
│ _configure_distributed_model                                                                     │
│                                                                                                  │
│   1046 │   │                                                                                     │
│   1047 │   │   # zero.Init() handles device placement of model                                   │
│   1048 │   │   if not (self.dont_change_device or is_zero3_model):                               │
│ ❱ 1049 │   │   │   self.module.to(self.device)                                                   │
│   1050 │   │                                                                                     │
│   1051 │   │   # MoE related initialization                                                      │
│   1052 │   │   for _, module in self.module.named_modules():                                     │
│                                                                                                  │
│ /root/miniconda3/envs/vicuna/lib/python3.9/site-packages/torch/nn/modules/module.py:989 in to    │
│                                                                                                  │
│    986 │   │   │   │   │   │   │   non_blocking, memory_format=convert_to_format)                │
│    987 │   │   │   return t.to(device, dtype if t.is_floating_point() or t.is_complex() else No  │
│    988 │   │                                                                                     │
│ ❱  989 │   │   return self._apply(convert)                                                       │
│    990 │                                                                                         │
│    991 │   def register_backward_hook(                                                           │
│    992 │   │   self, hook: Callable[['Module', _grad_t, _grad_t], Union[None, Tensor]]           │
│                                                                                                  │
│ /root/miniconda3/envs/vicuna/lib/python3.9/site-packages/torch/nn/modules/module.py:641 in       │
│ _apply                                                                                           │
│                                                                                                  │
│    638 │                                                                                         │
│    639 │   def _apply(self, fn):                                                                 │
│    640 │   │   for module in self.children():                                                    │
│ ❱  641 │   │   │   module._apply(fn)                                                             │
│    642 │   │                                                                                     │
│    643 │   │   def compute_should_use_set_data(tensor, tensor_applied):                          │
│    644 │   │   │   if torch._has_compatible_shallow_copy_type(tensor, tensor_applied):           │
│                                                                                                  │
│ /root/miniconda3/envs/vicuna/lib/python3.9/site-packages/torch/nn/modules/module.py:641 in       │
│ _apply                                                                                           │
│                                                                                                  │
│    638 │                                                                                         │
│    639 │   def _apply(self, fn):                                                                 │
│    640 │   │   for module in self.children():                                                    │
│ ❱  641 │   │   │   module._apply(fn)                                                             │
│    642 │   │                                                                                     │
│    643 │   │   def compute_should_use_set_data(tensor, tensor_applied):                          │
│    644 │   │   │   if torch._has_compatible_shallow_copy_type(tensor, tensor_applied):           │
│                                                                                                  │
│ /root/miniconda3/envs/vicuna/lib/python3.9/site-packages/torch/nn/modules/module.py:641 in       │
│ _apply                                                                                           │
│                                                                                                  │
│    638 │                                                                                         │
│    639 │   def _apply(self, fn):                                                                 │
│    640 │   │   for module in self.children():                                                    │
│ ❱  641 │   │   │   module._apply(fn)                                                             │
│    642 │   │                                                                                     │
│    643 │   │   def compute_should_use_set_data(tensor, tensor_applied):                          │
│    644 │   │   │   if torch._has_compatible_shallow_copy_type(tensor, tensor_applied):           │
│                                                                                                  │
│ /root/miniconda3/envs/vicuna/lib/python3.9/site-packages/torch/nn/modules/module.py:641 in       │
│ _apply                                                                                           │
│                                                                                                  │
│    638 │                                                                                         │
│    639 │   def _apply(self, fn):                                                                 │
│    640 │   │   for module in self.children():                                                    │
│ ❱  641 │   │   │   module._apply(fn)                                                             │
│    642 │   │                                                                                     │
│    643 │   │   def compute_should_use_set_data(tensor, tensor_applied):                          │
│    644 │   │   │   if torch._has_compatible_shallow_copy_type(tensor, tensor_applied):           │
│                                                                                                  │
│ /root/miniconda3/envs/vicuna/lib/python3.9/site-packages/torch/nn/modules/module.py:641 in       │
│ _apply                                                                                           │
│                                                                                                  │
│    638 │                                                                                         │
│    639 │   def _apply(self, fn):                                                                 │
│    640 │   │   for module in self.children():                                                    │
│ ❱  641 │   │   │   module._apply(fn)                                                             │
│    642 │   │                                                                                     │
│    643 │   │   def compute_should_use_set_data(tensor, tensor_applied):                          │
│    644 │   │   │   if torch._has_compatible_shallow_copy_type(tensor, tensor_applied):           │
│                                                                                                  │
│ /root/miniconda3/envs/vicuna/lib/python3.9/site-packages/torch/nn/modules/module.py:641 in       │
│ _apply                                                                                           │
│                                                                                                  │
│    638 │                                                                                         │
│    639 │   def _apply(self, fn):                                                                 │
│    640 │   │   for module in self.children():                                                    │
│ ❱  641 │   │   │   module._apply(fn)                                                             │
│    642 │   │                                                                                     │
│    643 │   │   def compute_should_use_set_data(tensor, tensor_applied):                          │
│    644 │   │   │   if torch._has_compatible_shallow_copy_type(tensor, tensor_applied):           │
│                                                                                                  │
│ /root/miniconda3/envs/vicuna/lib/python3.9/site-packages/torch/nn/modules/module.py:641 in       │
│ _apply                                                                                           │
│                                                                                                  │
│    638 │                                                                                         │
│    639 │   def _apply(self, fn):                                                                 │
│    640 │   │   for module in self.children():                                                    │
│ ❱  641 │   │   │   module._apply(fn)                                                             │
│    642 │   │                                                                                     │
│    643 │   │   def compute_should_use_set_data(tensor, tensor_applied):                          │
│    644 │   │   │   if torch._has_compatible_shallow_copy_type(tensor, tensor_applied):           │
│                                                                                                  │
│ /root/miniconda3/envs/vicuna/lib/python3.9/site-packages/torch/nn/modules/module.py:641 in       │
│ _apply                                                                                           │
│                                                                                                  │
│    638 │                                                                                         │
│    639 │   def _apply(self, fn):                                                                 │
│    640 │   │   for module in self.children():                                                    │
│ ❱  641 │   │   │   module._apply(fn)                                                             │
│    642 │   │                                                                                     │
│    643 │   │   def compute_should_use_set_data(tensor, tensor_applied):                          │
│    644 │   │   │   if torch._has_compatible_shallow_copy_type(tensor, tensor_applied):           │
│                                                                                                  │
│ /root/miniconda3/envs/vicuna/lib/python3.9/site-packages/torch/nn/modules/module.py:641 in       │
│ _apply                                                                                           │
│                                                                                                  │
│    638 │                                                                                         │
│    639 │   def _apply(self, fn):                                                                 │
│    640 │   │   for module in self.children():                                                    │
│ ❱  641 │   │   │   module._apply(fn)                                                             │
│    642 │   │                                                                                     │
│    643 │   │   def compute_should_use_set_data(tensor, tensor_applied):                          │
│    644 │   │   │   if torch._has_compatible_shallow_copy_type(tensor, tensor_applied):           │
│                                                                                                  │
│ /root/miniconda3/envs/vicuna/lib/python3.9/site-packages/torch/nn/modules/module.py:664 in       │
│ _apply                                                                                           │
│                                                                                                  │
│    661 │   │   │   # track autograd history of `param_applied`, so we have to use                │
│    662 │   │   │   # `with torch.no_grad():`                                                     │
│    663 │   │   │   with torch.no_grad():                                                         │
│ ❱  664 │   │   │   │   param_applied = fn(param)                                                 │
│    665 │   │   │   should_use_set_data = compute_should_use_set_data(param, param_applied)       │
│    666 │   │   │   if should_use_set_data:                                                       │
│    667 │   │   │   │   param.data = param_applied                                                │
│                                                                                                  │
│ /root/miniconda3/envs/vicuna/lib/python3.9/site-packages/torch/nn/modules/module.py:987 in       │
│ convert                                                                                          │
│                                                                                                  │
│    984 │   │   │   if convert_to_format is not None and t.dim() in (4, 5):                       │
│    985 │   │   │   │   return t.to(device, dtype if t.is_floating_point() or t.is_complex() els  │
│    986 │   │   │   │   │   │   │   non_blocking, memory_format=convert_to_format)                │
│ ❱  987 │   │   │   return t.to(device, dtype if t.is_floating_point() or t.is_complex() else No  │
│    988 │   │                                                                                     │
│    989 │   │   return self._apply(convert)                                                       │
│    990                                                                                           │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
NotImplementedError: Cannot copy out of meta tensor; no data!
[2023-06-20 11:58:47,819] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 44380
[2023-06-20 11:58:47,835] [ERROR] [launch.py:321:sigkill_handler] ['/root/miniconda3/envs/vicuna/bin/python', '-u', 'fastchat/train/train_lora.py', '--local_rank=0', '--model_name_or_path', '../vicuna-7b', '--lora_r', '8', '--lora_alpha', '16', '--lora_dropout', '0.05', '--data_path', './data/dummy_conversation.json', '--bf16', 'True', '--output_dir', './checkpoints', '--num_train_epochs', '3', '--per_device_train_batch_size', '4', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '1', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '1200', '--save_total_limit', '100', '--learning_rate', '2e-5', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--model_max_length', '2048', '--q_lora', 'True', '--deepspeed', 'playground/deepspeed_config_s2.json'] exits with return code = 1

Jun 20 '23 10:06 aresa7796

I have same trouble. peft v0.4.0.dev0

Jun 20 '23 13:06 ycat3

@ycat3 What version of deepspeed are you using?

Jun 20 '23 13:06 aresa7796

Version: 0.9.2

Jun 21 '23 00:06 ycat3

Facing same issue! deepspeed 0.95 transformers 4.29.2 peft 0.4.0.dev0 CUDA Version: 11.4

Jun 29 '23 10:06 limbo92

cc @BabyChouSr

Jul 01 '23 14:07 merrymercy

I got this error when I downgrade transformer version from 4.30.2 to 4.29.2 to fix another issue(Found optimizer configured in the DeepSpeed config, but no scheduler). When I revert transformer version to 4.30.2 and use some other methods to solve the another issue, this issue is also fixed.

Jul 03 '23 05:07 limbo92

How do you solve the other issue with the scheduler?

Jul 05 '23 23:07 mmkamani7

@limbo92 how do you solve the issue Found optimizer configured in the DeepSpeed config, but no scheduler with transformer version 4.30.2? I also occured this problem. can you give us some clue?

Jul 06 '23 07:07 kanslor

Adding the scheduler properties to the deepspeed config file should work. For instance, in my case, I added this line and it seems to be working: "scheduler":{"lr_scheduler_type":"cosine"}

Jul 06 '23 07:07 mmkamani7

@mmkamani7 thanks a lot, I'll try it.

Jul 06 '23 07:07 kanslor

@mmkamani7 hello, i try it with "scheduler":{"lr_scheduler_type":"cosine"} config，it seems Found optimizer configured in the DeepSpeed config, but no scheduler disappeared, but occur CUDA out of memory. I open this lora_training on a RTX3090 with 24GB , torch==1.13.1+cu117 , the traning code is same with aresa7796‘s, have you any idea about this?

Jul 06 '23 08:07 kanslor

@kanslor I add this to deepspeed.config file "scheduler":{ "type":"WarmupLR", "params": {"warmup_min_lr":"auto", "warmup_max_lr":"auto", "warmup_num_steps":"auto" } } And I also faced OOM error, so I give up using official fine-tune method. I use this git repo https://github.com/hiyouga/LLaMA-Efficient-Tuning/tree/main#ppo-training-rlhf to successfully fine-tune Vicuna7B. This is my lora weight. It only needs almost 16 GPU memory to fine-tune.

This is my command line to Fine-Tune by using new git repo.

CUDA_VISIBLE_DEVICES=0 python src/train_sft.py
--model_name_or_path your_model_path
--do_train
--dev_ratio 0.1
--dataset dummy_conversation_for_LETM
--finetuning_type lora
--output_dir output_path_you_want
--overwrite_cache
--per_device_train_batch_size 4
--gradient_accumulation_steps 4
--lr_scheduler_type cosine
--logging_steps 10
--save_steps 1000
--learning_rate 5e-5
--num_train_epochs 3.0
--plot_loss
--fp16

Web Demo

python src/cli_demo.py
--model_name_or_path vicuna_model_path
--checkpoint_dir fine_tuned_lora_path

python src/web_demo.py
--model_name_or_path vicuna_model_path
--checkpoint_dir fine_tuned_lora_path

Combine Vicuna7B and lora weight/ Export model

python src/export_model.py
--model_name_or_path vicuna_model_path
--checkpoint_dir fine_tuned_lora_path
--output_dir combine_model_path

Jul 07 '23 07:07 limbo92

@limbo92 your suggestion worked for me as well. I am not sure what is wrong with the FastChat that gives OOM to everyone!

Jul 07 '23 20:07 mmkamani7

To fix the issue with the "found optimizer but no scheduler", simply remove the optimizer from the deepspeed config. This was a new change with the new version of transformers (4.30.2). Check out this issue for more information regarding supported combinations.

I believe that the old transformers version (4.28.1) I was using also faced the OOM issues that you guys are getting. After installing (4.30.2), this was resolved with LLaMA-7B taking about 5GB of VRAM and fine-tuning with LoRA taking about 10GB of VRAM.

Jul 08 '23 17:07 BabyChouSr

@BabyChouSr I am still getting the OOM for GPU, even after removing both optimizer and scheduler configs. Transformers' version is 4.30.2 and peft is 0.4.0.dev0. Could you please share your deepspeed config file as well as the script for training?

Jul 11 '23 22:07 mmkamani7

My deepspeed config and training script is the same as listed here

Jul 11 '23 23:07 BabyChouSr

Found this answer on Stack Overflow, not sure if it helps but the problem seems to be with the accelarate library, and it's auto-offloading functionality, Check the link if you wan to know more

https://stackoverflow.com/questions/77547377/notimplementederror-cannot-copy-out-of-meta-tensor-no-data

Jan 31 '24 16:01 elmondhir

Setting the device explicitly solved my issues too! def get_device_map() -> str: return 'cuda' if torch.cuda.is_available() else 'cpu'

device = get_device_map()

Apr 07 '24 03:04 My3VM