lora-scripts icon indicating copy to clipboard operation
lora-scripts copied to clipboard

ubuntu上双卡多卡Dreambooth训练报错

Open changqingla opened this issue 1 year ago • 2 comments

steps: 0%| | 0/234 [00:00<?, ?it/s] epoch 1/3 [rank1]: Traceback (most recent call last): [rank1]: File "/workspace/lora-scripts/./scripts/stable/train_db.py", line 529, in [rank1]: train(args) [rank1]: File "/workspace/lora-scripts/./scripts/stable/train_db.py", line 343, in train [rank1]: encoder_hidden_states = train_util.get_hidden_states( [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/workspace/lora-scripts/scripts/stable/library/train_util.py", line 4427, in get_hidden_states [rank1]: encoder_hidden_states = text_encoder.text_model.final_layer_norm(encoder_hidden_states) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1729, in getattr [rank1]: raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") [rank1]: AttributeError: 'DistributedDataParallel' object has no attribute 'text_model' [rank0]: Traceback (most recent call last): [rank0]: File "/workspace/lora-scripts/./scripts/stable/train_db.py", line 529, in [rank0]: train(args) [rank0]: File "/workspace/lora-scripts/./scripts/stable/train_db.py", line 343, in train [rank0]: encoder_hidden_states = train_util.get_hidden_states( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/workspace/lora-scripts/scripts/stable/library/train_util.py", line 4427, in get_hidden_states [rank0]: encoder_hidden_states = text_encoder.text_model.final_layer_norm(encoder_hidden_states) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1729, in getattr [rank0]: raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") [rank0]: AttributeError: 'DistributedDataParallel' object has no attribute 'text_model' steps: 0%| | 0/234 [00:01<?, ?it/s] W1010 02:41:57.596000 135260147754816 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3147 closing signal SIGTERM E1010 02:41:57.711000 135260147754816 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 1 (pid: 3148) of binary: /root/miniconda3/bin/python Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/root/miniconda3/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1116, in main() File "/root/miniconda3/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1112, in main launch_command(args) File "/root/miniconda3/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1097, in launch_command multi_gpu_launcher(args) File "/root/miniconda3/lib/python3.12/site-packages/accelerate/commands/launch.py", line 734, in multi_gpu_launcher distrib_run.run(args) File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 133, in call return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./scripts/stable/train_db.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-10-10_02:41:57 host : ubuntu-Super-Server rank : 1 (local_rank: 1) exitcode : 1 (pid: 3148) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

02:41:58-146594 ERROR Training failed / 训练失败

changqingla avatar Oct 10 '24 02:10 changqingla

focus on

hben35096 avatar Oct 19 '24 09:10 hben35096

请问你解决了吗

asizk avatar Dec 25 '24 03:12 asizk