Job hangs forever, windows process exits
After starting the job, at some point the job stops producing output and hangs. The new window with the process stops trailing output. In the UI the job is still visible as Running. Can't be stopped, Can't be deleted.
`Job ID: "37e195e0-1d73-44bf-b440-14bc5e0b0bb3" #############################################
Running job: my_first_lora_v1
############################################# Running 1 process Loading Wan model Loading transformer 1 Loading checkpoint shards: 100%|#########################################################| 3/3 [00:00<00:00, 6.86it/s] Quantizing Transformer 1
- quantizing 40 transformer blocks 100%|##################################################################################| 40/40 [00:37<00:00, 1.06it/s]
- quantizing extras Moving transformer 1 to CPU Loading transformer 2 Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]`
And the logs from Node
[UI] Job 37e195e0-1d73-44bf-b440-14bc5e0b0bb3 exited with code 0 after 0.014 seconds.
I have the same problem with my "low end" hardware (mobile 4060/8Gb VRAM; 32Gb RAM). Training (trying to train :) ) a "text-slider" with an hardware as little as this, on WAN2.2-14B :
On Windows, it seems to swap the RAM which become quickly full to the disk, and stop at a moment (not every-time at the same place).
For me it works reducing the "Quantization" => "Transformer" to "3 bit" . Anything higher will do the same problem as you. Process stop without any error. I have to force the "Mark as stopped". (FYI : The resulted "text-slider" LORA seems to not be well trained. I mean the result is not here.)
I have a similar issue where mine just stopped at loading transformer 2 as well. I tried to lowered the quantization to 3 bit and even text encoder to 3 bit but it still stuck at loading transformer 2. I also have a low end card 4060 desktop. Thank you for your help.
Hi. I make it works (with any quantizer option) freeing a lot of HDD space (more than 150Gb free). It doesn't works (hang like you) with less (It hangs with 120Gb of free space. I have 32Gb of RAM). After that I had another error (but self explaining) about the pagefile on my system too small. I had to manually set the pagefile max size to more than my free HDD space. I could now train "text-slider" without problem (take a little time with 8Gb of VRAM, but works).
Edit : I forgot to mention that I had to reduce the size of the images of the dataset (I have 4 random images), to 256px of highest size. Else I got an OOM. But I should try again now with higher resolution to see if it change something and if I can run it with higher resolutions.
Same Problem... want to try FLUX2-dev instant crash after "loading transformers" must start 10times then works (no crash log / errors!) then
`Loading Mistral
Loading checkpoint shards: 0%| | 0/10 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/10 [00:00<?, ?it/s] Loading checkpoint shards: 10%|# | 1/10 [00:00<00:00, 9.35it/s] Loading checkpoint shards: 10%|# | 1/10 [00:00<00:00, 9.35it/s] Loading checkpoint shards: 50%|##### | 5/10 [00:00<00:00, 20.76it/s] Loading checkpoint shards: 50%|##### | 5/10 [00:00<00:00, 20.76it/s]` and crash / cmd close... no errors
RTX5090 32GB vram 64gb RAM 256GB Pagefile
sad.. wan2.2 works fine flux2 not