diffusers icon indicating copy to clipboard operation
diffusers copied to clipboard

Train stopped at 0%

Open adinoolfi opened this issue 1 year ago • 1 comments

Describe the bug

I used this script for training lora model with command line parameters at this link: https://huggingface.co/docs/diffusers/training/lora However, training stops at 0% without any error in the cmd. Through some print() I discovered that the script stops at the first iteration of the For at line 741 and more precisely at line 744, after which it no longer print() anything.

Reproduction

for epoch in range(first_epoch, args.num_train_epochs): unet.train() train_loss = 0.0 for step, batch in enumerate(train_dataloader): with accelerator.accumulate(unet): # Convert images to latent space latents = vae.encode(batch["pixel_values"].to(dtype=weight_dtype)).latent_dist.sample() latents = latents * vae.config.scaling_factor

Logs

No response

System Info

absl-py==2.1.0 accelerate==0.30.0 aiohttp==3.9.5 aiosignal==1.3.1 attrs==23.2.0 certifi==2024.2.2 charset-normalizer==3.3.2 click==8.1.7 colorama==0.4.6 datasets==2.19.1 diffusers @ file:///C:/Users/PNP/Desktop/Nappi/tirocinio-venv/Scripts/diffusers dill==0.3.8 docker-pycreds==0.4.0 filelock==3.14.0 frozenlist==1.4.1 fsspec==2024.3.1 ftfy==6.2.0 gitdb==4.0.11 GitPython==3.1.43 grpcio==1.63.0 huggingface-hub==0.23.0 idna==3.7 importlib_metadata==7.1.0 intel-openmp==2021.4.0 Jinja2==3.1.4 Markdown==3.6 MarkupSafe==2.1.5 mkl==2021.4.0 mpmath==1.3.0 multidict==6.0.5 multiprocess==0.70.16 networkx==3.3 numpy==1.26.4 packaging==24.0 pandas==2.2.2 peft==0.7.0 pillow==10.3.0 platformdirs==4.2.1 protobuf==4.25.3 psutil==5.9.8 pyarrow==16.1.0 pyarrow-hotfix==0.6 python-dateutil==2.9.0.post0 pytz==2024.1 PyYAML==6.0.1 regex==2024.5.10 requests==2.31.0 safetensors==0.4.3 sentry-sdk==2.1.1 setproctitle==1.3.3 six==1.16.0 smmap==5.0.1 sympy==1.12 tbb==2021.12.0 tensorboard==2.16.2 tensorboard-data-server==0.7.2 tokenizers==0.19.1 torch==2.3.0+cu118 torchaudio==2.3.0+cu118 torchvision==0.18.0 tqdm==4.66.4 transformers==4.41.1 typing_extensions==4.11.0 tzdata==2024.1 urllib3==2.2.1 wandb==0.17.0 wcwidth==0.2.13 Werkzeug==3.0.3 xxhash==3.4.1 yarl==1.9.4 zipp==3.18.1

For the diffusers I clone the repositoy from git hub.

Who can help?

No response

adinoolfi avatar May 23 '24 16:05 adinoolfi

Hi, so you don't get any error message? it just stops and returns to the console?

Your diffusers installation doesn't say anything, is it from source? is it updated?

Also it will be good to have more details about your environment diffusers-cli env

asomoza avatar May 23 '24 17:05 asomoza

Hi, so you don't get any error message? it just stops and returns to the console?

Your diffusers installation doesn't say anything, is it from source? is it updated?

Also it will be good to have more details about your environment diffusers-cli env

Yes, when I run it from cmd it doesn't give me errors and remains stuck for hours at 0% of the steps. I also tried typing diffusers-cli env in the virtual Python environment from which I am running the script and it gives me this error: Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in run_code File "C:\Users\PNP\Desktop\Nappi\tirocinio-env\Scripts\diffusers-cli.exe_main.py", line 7, in File "C:\Users\PNP\Desktop\Nappi\tirocinio-env\Lib\site-packages\diffusers\commands\diffusers_cli.py", line 39, in main service.run() File "C:\Users\PNP\Desktop\Nappi\tirocinio-env\Lib\site-packages\diffusers\commands\env.py", line 110, in run platform_info = f"{platform.freedesktop_os_release().get('PRETTY_NAME', None)} - {platform.platform()}" ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\PNP\AppData\Local\Programs\Python\Python312\Lib\platform.py", line 1342, in freedesktop_os_release raise OSError( FileNotFoundError: [Errno 2] Unable to read files /etc/os-release, /usr/lib/os-release

adinoolfi avatar May 28 '24 08:05 adinoolfi

@adinoolfi Could you reach your C:\Users\PNP\Desktop\Nappi\tirocinio-env\Lib\site-packages\diffusers\commands\env.py file and change its 110. line like this:

-            platform_info = f"{platform.freedesktop_os_release().get('PRETTY_NAME', None)} - {platform.platform()}"
+            platform_info = f"{platform.platform()}"

Then try again with diffusers-cli env.

tolgacangoz avatar May 28 '24 08:05 tolgacangoz

@adinoolfi Could you reach your C:\Users\PNP\Desktop\Nappi\tirocinio-env\Lib\site-packages\diffusers\commands\env.py file and change its 110. line like this:

-            platform_info = f"{platform.freedesktop_os_release().get('PRETTY_NAME', None)} - {platform.platform()}"
+            platform_info = f"{platform.platform()}"

Then try again with diffusers-cli env.

Ok, the command works and gives me this: Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

  • 🤗 Diffusers version: 0.28.0.dev0
  • Platform: Windows-11-10.0.22621-SP0
  • Running on a notebook?: No
  • Running on Google Colab?: No
  • Python version: 3.12.0
  • PyTorch version (GPU?): 2.3.0+cpu (False)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Huggingface_hub version: 0.23.2
  • Transformers version: 4.41.1
  • Accelerate version: 0.30.1
  • PEFT version: 0.7.0
  • Bitsandbytes version: not installed
  • Safetensors version: 0.4.3
  • xFormers version: not installed
  • Accelerator: NVIDIA GeForce RTX 2080 SUPER, 8192 MiB VRAM
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

adinoolfi avatar May 28 '24 08:05 adinoolfi

Screenshot 2024-05-21 100847 This is the problem...

adinoolfi avatar May 28 '24 09:05 adinoolfi

I see two issues, the first is that since you're training a LoRA and using peft, you'll need to install bitsandbytes.

Second, since you're using windows with a 8 GB VRAM card, probably you're running out of memmory but the system is using the RAM as an alternative. This is really slow, check the task manager and see if the VRAM of your card is being fully used.

I have the impression that your training isn't stopping but just being trained really slow.

asomoza avatar May 28 '24 15:05 asomoza

  • PyTorch version (GPU?): 2.3.0+cpu (False)

Also, the environment uses CPU version of PyTorch.

tolgacangoz avatar May 28 '24 15:05 tolgacangoz

oh right, that explains it all.

@adinoolfi if you installed pytorch like this: pip install torch torchvision it should have installed the cuda version of pytorch but you have the cpu version, this suggest your system is not detecting CUDA installed. This goes out of the scope of diffusers, but I suggest you reinstall everything again.

As @tolgacangoz suggested, you need to have pytorch with cuda support.

asomoza avatar May 28 '24 15:05 asomoza