facechain icon indicating copy to clipboard operation
facechain copied to clipboard

ValueError, subprocess.CalledProcessError, 训练失败

Open MushroomLyn opened this issue 1 year ago • 2 comments

用推荐的PAI-DSW跑的,已经将./facechain/train_text_to_image_lora.py的1033行train_loss += avg_loss.item() / args.gradient_accumulation_steps后面加上了梯度,但依然不成功。 打开链接成功,上传图片成功,但是训练失败,报错信息如下:

2024-03-19 16:35:16,756 - modelscope - INFO - Use user-specified model revision: v2.0
{'clip_sample_range', 'rescale_betas_zero_snr', 'variance_type', 'thresholding', 'dynamic_thresholding_ratio', 'sample_max_value'} was not found in config. Values will be initialized to default values.
/opt/conda/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
{'force_upcast'} was not found in config. Values will be initialized to default values.
{'dropout', 'attention_type', 'reverse_transformer_layers_per_block'} was not found in config. Values will be initialized to default values.
Generating train split: 3 examples [00:00, 670.37 examples/s]
03/19/2024 16:35:22 - INFO - __main__ - ***** Running training *****
03/19/2024 16:35:22 - INFO - __main__ -   Num examples = 3
03/19/2024 16:35:22 - INFO - __main__ -   Num Epochs = 200
03/19/2024 16:35:22 - INFO - __main__ -   Instantaneous batch size per device = 1
03/19/2024 16:35:22 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 1
03/19/2024 16:35:22 - INFO - __main__ -   Gradient Accumulation steps = 1
03/19/2024 16:35:22 - INFO - __main__ -   Total optimization steps = 600
2024-03-19 16:35:22,393 - modelscope - INFO - Use user-specified model revision: v1.0.0
Resuming from checkpoint /mnt/workspace/.cache/modelscope/damo/face_frombase_c4/face_frombase_c4.bin
Steps:   0%|                                                                                                           | 0/600 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/mnt/workspace/facechain/facechain/train_text_to_image_lora.py", line 1225, in <module>
    main()
  File "/mnt/workspace/facechain/facechain/train_text_to_image_lora.py", line 1028, in main
    model_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/diffusers/models/unets/unet_2d_condition.py", line 1121, in forward
    sample, res_samples = downsample_block(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/diffusers/models/unets/unet_2d_blocks.py", line 1199, in forward
    hidden_states = attn(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/diffusers/models/transformers/transformer_2d.py", line 391, in forward
    hidden_states = block(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/diffusers/models/attention.py", line 329, in forward
    attn_output = self.attn1(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/diffusers/models/attention_processor.py", line 512, in forward
    return self.processor(
  File "/opt/conda/lib/python3.10/site-packages/diffusers/models/attention_processor.py", line 1856, in __call__
    deprecate(
  File "/opt/conda/lib/python3.10/site-packages/diffusers/utils/deprecation_utils.py", line 18, in deprecate
    raise ValueError(
ValueError: The deprecation tuple ('LoRAAttnProcessor', '0.26.0', 'Make sure use AttnProcessor instead by settingLoRA layers to `self.{to_q,to_k,to_v,to_out[0]}.lora_layer` respectively. This will be done automatically when using `LoraLoaderMixin.load_lora_weights`') should be removed since diffusers' version 0.26.0 is >= 0.26.0
Steps:   0%|                                                                                                           | 0/600 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/opt/conda/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1023, in launch_command
    simple_launcher(args)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 643, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '/mnt/workspace/facechain/facechain/train_text_to_image_lora.py', '--pretrained_model_name_or_path=ly261666/cv_portrait_model', '--revision=v2.0', '--sub_path=film/film', '--output_dataset_name=/mnt/workspace/facechain/worker_data/qw/training_data/ly261666/cv_portrait_model/person2', '--caption_column=text', '--resolution=512', '--random_flip', '--train_batch_size=1', '--num_train_epochs=200', '--checkpointing_steps=5000', '--learning_rate=1.5e-04', '--lr_scheduler=cosine', '--lr_warmup_steps=0', '--seed=42', '--output_dir=/mnt/workspace/facechain/worker_data/qw/ly261666/cv_portrait_model/person2', '--lora_r=4', '--lora_alpha=32', '--lora_text_encoder_r=32', '--lora_text_encoder_alpha=32', '--resume_from_checkpoint=fromfacecommon']' returned non-zero exit status 1.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/gradio/queueing.py", line 407, in call_prediction
    output = await route_utils.call_process_api(
  File "/opt/conda/lib/python3.10/site-packages/gradio/route_utils.py", line 226, in call_process_api
    output = await app.get_blocks().process_api(
  File "/opt/conda/lib/python3.10/site-packages/gradio/blocks.py", line 1550, in process_api
    result = await self.call_function(
  File "/opt/conda/lib/python3.10/site-packages/gradio/blocks.py", line 1185, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/opt/conda/lib/python3.10/site-packages/anyio/to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
  File "/opt/conda/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 2134, in run_sync_in_worker_thread
    return await future
  File "/opt/conda/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 851, in run
    result = context.run(func, *args)
  File "/opt/conda/lib/python3.10/site-packages/gradio/utils.py", line 661, in wrapper
    response = f(*args, **kwargs)
  File "/mnt/workspace/facechain/app.py", line 804, in run
    train_lora_fn(base_model_path=base_model_path,
  File "/mnt/workspace/facechain/app.py", line 207, in train_lora_fn
    raise gr.Error("训练失败 (Training failed)")
gradio.exceptions.Error: '训练失败 (Training failed)'

有没有人能帮忙解决下的?谢谢~

MushroomLyn avatar Mar 19 '24 08:03 MushroomLyn

一样的错误,就是跑不起来

clearlove88 avatar Mar 21 '24 09:03 clearlove88

看报错信息,diffusers应该不能高于0.26.0。或者有大佬提到去diffusers里面注释一些代码就可以。找下issues吧

datuizhuang avatar Mar 26 '24 06:03 datuizhuang

please try out the newest train-free, 10s inference version facechain-fact.

sunbaigui avatar Jun 04 '24 09:06 sunbaigui