ai-toolkit icon indicating copy to clipboard operation
ai-toolkit copied to clipboard

Job hangs forever, windows process exits

Open maciej-wolny opened this issue 6 months ago • 4 comments

After starting the job, at some point the job stops producing output and hangs. The new window with the process stops trailing output. In the UI the job is still visible as Running. Can't be stopped, Can't be deleted.

`Job ID: "37e195e0-1d73-44bf-b440-14bc5e0b0bb3" #############################################

Running job: my_first_lora_v1

############################################# Running 1 process Loading Wan model Loading transformer 1 Loading checkpoint shards: 100%|#########################################################| 3/3 [00:00<00:00, 6.86it/s] Quantizing Transformer 1

  • quantizing 40 transformer blocks 100%|##################################################################################| 40/40 [00:37<00:00, 1.06it/s]
  • quantizing extras Moving transformer 1 to CPU Loading transformer 2 Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]`

And the logs from Node [UI] Job 37e195e0-1d73-44bf-b440-14bc5e0b0bb3 exited with code 0 after 0.014 seconds.

maciej-wolny avatar Oct 18 '25 18:10 maciej-wolny

I have the same problem with my "low end" hardware (mobile 4060/8Gb VRAM; 32Gb RAM). Training (trying to train :) ) a "text-slider" with an hardware as little as this, on WAN2.2-14B :

On Windows, it seems to swap the RAM which become quickly full to the disk, and stop at a moment (not every-time at the same place).

For me it works reducing the "Quantization" => "Transformer" to "3 bit" . Anything higher will do the same problem as you. Process stop without any error. I have to force the "Mark as stopped". (FYI : The resulted "text-slider" LORA seems to not be well trained. I mean the result is not here.)

ZeTofZone avatar Oct 19 '25 23:10 ZeTofZone

I have a similar issue where mine just stopped at loading transformer 2 as well. I tried to lowered the quantization to 3 bit and even text encoder to 3 bit but it still stuck at loading transformer 2. I also have a low end card 4060 desktop. Thank you for your help.

bohaman1 avatar Oct 20 '25 14:10 bohaman1

Hi. I make it works (with any quantizer option) freeing a lot of HDD space (more than 150Gb free). It doesn't works (hang like you) with less (It hangs with 120Gb of free space. I have 32Gb of RAM). After that I had another error (but self explaining) about the pagefile on my system too small. I had to manually set the pagefile max size to more than my free HDD space. I could now train "text-slider" without problem (take a little time with 8Gb of VRAM, but works).

Edit : I forgot to mention that I had to reduce the size of the images of the dataset (I have 4 random images), to 256px of highest size. Else I got an OOM. But I should try again now with higher resolution to see if it change something and if I can run it with higher resolutions.

ZeTofZone avatar Oct 20 '25 16:10 ZeTofZone

Same Problem... want to try FLUX2-dev instant crash after "loading transformers" must start 10times then works (no crash log / errors!) then

`Loading Mistral

Loading checkpoint shards: 0%| | 0/10 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/10 [00:00<?, ?it/s] Loading checkpoint shards: 10%|# | 1/10 [00:00<00:00, 9.35it/s] Loading checkpoint shards: 10%|# | 1/10 [00:00<00:00, 9.35it/s] Loading checkpoint shards: 50%|##### | 5/10 [00:00<00:00, 20.76it/s] Loading checkpoint shards: 50%|##### | 5/10 [00:00<00:00, 20.76it/s]` and crash / cmd close... no errors

RTX5090 32GB vram 64gb RAM 256GB Pagefile

sad.. wan2.2 works fine flux2 not

syscore64-cmyk avatar Dec 02 '25 17:12 syscore64-cmyk