diffusers icon indicating copy to clipboard operation
diffusers copied to clipboard

accelerate + FSDP + T2I train saving ckpt error

Open Forainest opened this issue 2 years ago • 8 comments

Describe the bug

I have used /examples/text_to_image/train_text_to_image_sdxl.py to train a fine tune sdxl. I used accelerate 0.25.0 + FSDP, when I was saving a checkpoint it will stuck and can't save a whole ckpt. And I also tried deepspeed it will stuck too. I didn't change any code in train_text_to_image_sdxl.py

Reproduction

accelerate config is

compute_environment: LOCAL_MACHINE
debug: true
distributed_type: FSDP
diwbcast_bf16: 'no'
fsdp_config:
    fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
    fsdp_backward_prefetch_policy: BACKWARD_PRE
    fsdp_cpu_ram_efficient_loading: true
    fsdp_forward_prefetch: true
    fsdp_offload_params: true
    fsdp_sharding_strategy: 1
    fsdp_state_dict_type: FULL_STATE_DICT
    fsdp_sync_module_state: true
    fsdp_transformer_layer_cls_to_wrap: UNet2DConditionModel, DownBlock2D, CrossAttnDownBlock2D. UpBlock2D, CrossAttnUpBlock2D
    fsdp_use_orig_params: true
machine_rank: 0,
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false 

code: /examples/text_to_image/train_text_to_image_sdxl.py pretrain_model and dataset: totally follow README

Logs

No response

System Info

Linux localhost.localdomain 4.14.0-115.el7a.0.1.aarch64

Who can help?

@yiyixuxu @sayakpaul

Forainest avatar Jan 25 '24 01:01 Forainest

When use deepspeed I delete accelerate.is_main_process, it can save ckpt normally

Forainest avatar Jan 25 '24 01:01 Forainest

For DeepSpeed, you need to follow what's done in: https://github.com/huggingface/diffusers/pull/6628. Could you try that and see if it works?

sayakpaul avatar Jan 25 '24 03:01 sayakpaul

Thanks for deepspeed!. How about FSDP? Do we have any method to save ckpt successful?

Forainest avatar Jan 25 '24 11:01 Forainest

I am no FSDP expert. Can you post the error trace?

sayakpaul avatar Jan 25 '24 11:01 sayakpaul

There has no error trace. It will hang out when using accelerate + FSDP to save a ckpt or finish training. like issuse: https://github.com/huggingface/diffusers/issues/2816

Forainest avatar Jan 25 '24 11:01 Forainest

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Feb 24 '24 15:02 github-actions[bot]

@Forainest When using deepspeed, you can install apex, which will be automatically used in deepspeed. That works for me.

git clone https://github.com/NVIDIA/apex.git 
cd apex git checkout 741bdf50825a97664db08574981962d66436d16a 
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ --global-option="--cuda_ext" --global-option="--cpp_ext"

AoqunJin avatar Apr 07 '24 16:04 AoqunJin

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar May 02 '24 15:05 github-actions[bot]