diffusers RuntimeError: 'weight' must be 2-D

Describe the bug

When I run the example of text_to_image.py, I got the problem shown in logs. I'm pretty sure I have it configured and running as the reademe.md requires.

Reproduction

https://github.com/huggingface/diffusers/tree/main/examples/text_to_image/train_text_to_image.py

export MODEL_NAME="CompVis/stable-diffusion-v1-4" export dataset_name="lambdalabs/pokemon-blip-captions"

accelerate launch train_text_to_image.py
--pretrained_model_name_or_path=$MODEL_NAME
--dataset_name=$dataset_name
--use_ema
--resolution=512 --center_crop --random_flip
--train_batch_size=1
--gradient_accumulation_steps=4
--gradient_checkpointing
--mixed_precision="fp16"
--max_train_steps=15000
--learning_rate=1e-05
--max_grad_norm=1
--lr_scheduler="constant" --lr_warmup_steps=0
--output_dir="sd-pokemon-model"

Logs

Traceback (most recent call last):
  File "train_text_to_image.py", line 630, in <module>
    main()
  File "train_text_to_image.py", line 569, in main
    print(text_encoder(batch["input_ids"]))
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/clip/modeling_clip.py", line 733, in forward
    return self.text_model(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/clip/modeling_clip.py", line 636, in forward
    hidden_states = self.embeddings(input_ids=input_ids, position_ids=position_ids)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/clip/modeling_clip.py", line 165, in forward
    inputs_embeds = self.token_embedding(input_ids)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/sparse.py", line 158, in forward
    return F.embedding(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py", line 2199, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: 'weight' must be 2-D

System Info

diffusers=0.5.1 torch=1.12.0+cu113 accelerate=0.13.2

Oct 26 '22 08:10 young-chao

cc @patil-suraj can you take a look here?

Oct 26 '22 15:10 patrickvonplaten

Thanks a lot for reporting, will take try this and let you know.

Oct 26 '22 16:10 patil-suraj

I just tried your command and it works fine for me, couldn't reproduce it. Could you maybe try again and let us no if the issue persists.

Oct 26 '22 16:10 patil-suraj

I also created a new container to build the environment from scratch to download the code and install dependencies, but I still encountered this error in the end: RuntimeError: 'weight' must be 2-D

accelerate config In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0 Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU [4] MPS): 2 How many different machines will you use (use more than 1 for multi-node training)? [1]: 1 Do you want to use DeepSpeed? [yes/NO]: yes Do you want to specify a json file to a DeepSpeed config? [yes/NO]: no What should be your DeepSpeed's ZeRO optimization stage (0, 1, 2, 3)? [2]: 3 Where to offload optimizer states? [none/cpu/nvme]: cpu Where to offload parameters? [none/cpu/nvme]: cpu How many gradient accumulation steps you're passing in your script? [1]: 4 Do you want to use gradient clipping? [yes/NO]: yes What is the gradient clipping value? [1.0]: 1 Do you want to save 16-bit model weights when using ZeRO Stage-3? [yes/NO]: yes Do you want to enable deepspeed.zero.Init when using ZeRO Stage-3 for constructing massive models? [yes/NO]: ye4s Please enter yes or no. Do you want to enable deepspeed.zero.Init when using ZeRO Stage-3 for constructing massive models? [yes/NO]: yes How many GPU(s) should be used for distributed training? [1]:2 Do you wish to use FP16 or BF16 (mixed precision)? [NO/fp16/bf16]: fp16

Oct 27 '22 03:10 young-chao

Oh you are deepspeed, I tried without deepspeed. Please make sure to post the detailed command when opening issues so we could reproduce fast :)

I haven't tried this with Zero-stage-3 but it should work with stage2, stage3 is not really required for stable diffusion, it should work fine with stage2 as it's not a huge model so it does not need the parameter partitioning that stage3 offers,

Also note that, --train_text_encoder is not supported with deepspeed for now.

Oct 27 '22 08:10 patil-suraj

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Nov 25 '22 15:11 github-actions[bot]

你好，你的邮件已收到，谢谢！

Nov 25 '22 15:11 young-chao

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Dec 19 '22 15:12 github-actions[bot]