ai-toolkit icon indicating copy to clipboard operation
ai-toolkit copied to clipboard

Training speed degrades significantly with GPU power and temperature drop

Open meknidirta opened this issue 3 months ago • 18 comments

This is for bugs only

Did you already ask in the discord?

Yes

You verified that this is a bug and not a feature request or question by asking in the discord?

Yes

Describe the bug

Configuration: Windows 11 25H2 Python 3.12.10 PyTorch 2.8.0 Running lastest version of ai-toolkit (commit 9b89bab)

Hardware: 1x RTX 3060 12 GB (Driver Version: 581.15, CUDA Version: 13.0 ) 64 GB DDR4 RAM

Training config:

job: extension
config:
  name: test128
  process:
  - type: diffusion_trainer
    training_folder: C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\output
    sqlite_db_path: C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\aitk_db.db
    device: cuda
    trigger_word: null
    performance_log_every: 10
    network:
      type: lora
      linear: 128
      linear_alpha: 128
      conv: 16
      conv_alpha: 16
      lokr_full_rank: true
      lokr_factor: -1
      network_kwargs:
        ignore_if_contains: []
    save:
      dtype: bf16
      save_every: 250
      max_step_saves_to_keep: 4
      save_format: diffusers
      push_to_hub: false
    datasets:
    - folder_path: C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\datasets/target
      mask_path: null
      mask_min_value: 0.1
      default_caption: put design on the shirt
      caption_ext: txt
      caption_dropout_rate: 0.05
      cache_latents_to_disk: false
      is_reg: false
      network_weight: 1
      resolution:
      - 512
      - 768
      - 1024
      controls: []
      shrink_video_to_frames: true
      num_frames: 1
      do_i2v: true
      flip_x: false
      flip_y: false
      control_path_1: C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\datasets/control
      control_path_2: null
      control_path_3: null
    train:
      batch_size: 1
      bypass_guidance_embedding: false
      steps: 7000
      gradient_accumulation: 1
      train_unet: true
      train_text_encoder: false
      gradient_checkpointing: true
      noise_scheduler: flowmatch
      optimizer: adamw8bit
      timestep_type: weighted
      content_or_style: balanced
      optimizer_params:
        weight_decay: 0.0001
      unload_text_encoder: false
      cache_text_embeddings: true
      lr: 0.0001
      ema_config:
        use_ema: false
        ema_decay: 0.99
      skip_first_sample: true
      force_first_sample: false
      disable_sampling: true
      dtype: bf16
      diff_output_preservation: false
      diff_output_preservation_multiplier: 1
      diff_output_preservation_class: person
      switch_boundary_every: 1
      loss_type: mse
    model:
      name_or_path: Qwen/Qwen-Image-Edit-2509
      quantize: true
      qtype: uint3|ostris/accuracy_recovery_adapters/qwen_image_edit_2509_torchao_uint3.safetensors
      quantize_te: true
      qtype_te: qfloat8
      arch: qwen_image_edit_plus
      low_vram: true
      model_kwargs:
        match_target_res: true
      layer_offloading: true
      layer_offloading_text_encoder_percent: 1
      layer_offloading_transformer_percent: 1
    sample:
      sampler: flowmatch
      sample_every: 250
      width: 1024
      height: 1024
      samples: []
      neg: ''
      seed: 42
      walk_seed: true
      guidance_scale: 4
      sample_steps: 25
      num_frames: 1
      fps: 1
meta:
  name: test128
  version: '1.0'

Issue: When training a rank 128 Qwen Edit 2509 LoRA, the training speed becomes highly inconsistent and slows down dramatically after a few epochs. Initially, it runs at around 56 seconds per iteration, but after some time, it degrades to ~200 seconds per iteration or more. Fresh install doesn't fix it. Training LoRa with rank 64 or less works fine.

There are:

  • No out-of-memory (OOM) errors
  • No SSD swap usage
  • GPU load remains at around 100%
  • VRAM and RAM usage stay constant

However, during this slowdown:

  • GPU power draw and temperature drop sharply, indicating reduced actual compute utilization even though usage metrics report around 100%
  • Training speed slows down dramatically, making progress extremely inefficient

Typical training behavior:

  • ~55 °C GPU temperature
  • ~115 W power draw

Problematic behavior:

  • Noticeable drop in both temperature and power draw
  • GPU still reports around 100% utilization
  • Dramatic slowdown in training performance
Image

Related issues: https://github.com/ostris/ai-toolkit/issues/390

meknidirta avatar Nov 09 '25 18:11 meknidirta

Did some additional testing with rank 96. Same problem.

Image Image

meknidirta avatar Nov 09 '25 20:11 meknidirta

I suspect that problem is in available memory. Maybe swap is used in some moments, you need to check it. Because both RAM and VRAM are almost maxed out. Do an experiment: select only 1024 resolution. I think it would be a steady high numbers of s/it. And it would be on a low side with 512.

Dorithur avatar Nov 10 '25 06:11 Dorithur

I have noticed this on Flux. I start with 30GB VRAM used and in the matter of time between generations it can increase to 32 or even 33 making a spillover ot RAM, drasticly slowing training down. (from 2-3s/it to over 10s/it) [RTX 5090; 128GB RAM] or from something like 80s/it to > 5 minutes/it on my 12GB GPU [RTX 4080 Laptop; 64 GB RAM]

mcDandy avatar Nov 10 '25 07:11 mcDandy

I suspect that problem is in available memory. Maybe swap is used in some moments, you need to check it. Because both RAM and VRAM are almost maxed out. Do an experiment: select only 1024 resolution. I think it would be a steady high numbers of s/it. And it would be on a low side with 512.

Testing with only 1024 and Match Target Resolution off I get this error:

test128_1024:   0%|          | 0/7000 [03:31<?, ?it/s, lr: 1.0e-04 loss: 9.839e-02]Error running job: UintxTensor dispatch: attempting to run unimplemented operator/function: func=<OpOverload(op='aten.t', overload='default')>, types=(<class 'torchao.dtypes.uintx.uintx_layout.UintxTensor'>,), arg_types=(<class 'torchao.dtypes.uintx.uintx_layout.UintxTensor'>,), kwarg_types={}
========================================
Result:
 - 0 completed jobs
 - 1 failure
========================================
Traceback (most recent call last):
Traceback (most recent call last):
  File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\run.py", line 120, in <module>
  File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\run.py", line 120, in <module>
        main()main()
  File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\run.py", line 108, in main
  File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\run.py", line 108, in main
        raise eraise e
  File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\run.py", line 96, in main
  File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\run.py", line 96, in main
        job.run()job.run()
  File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\jobs\ExtensionJob.py", line 22, in run
  File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\jobs\ExtensionJob.py", line 22, in run
        process.run()process.run()
  File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\jobs\process\BaseSDTrainProcess.py", line 2162, in run
  File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\jobs\process\BaseSDTrainProcess.py", line 2162, in run
        loss_dict = self.hook_train_loop(batch_list)loss_dict = self.hook_train_loop(batch_list)
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\extensions_built_in\sd_trainer\SDTrainer.py", line 2051, in hook_train_loop
  File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\extensions_built_in\sd_trainer\SDTrainer.py", line 2051, in hook_train_loop
        loss = self.train_single_accumulation(batch)loss = self.train_single_accumulation(batch)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\extensions_built_in\sd_trainer\SDTrainer.py", line 2026, in train_single_accumulation
  File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\extensions_built_in\sd_trainer\SDTrainer.py", line 2026, in train_single_accumulation
        self.accelerator.backward(loss)self.accelerator.backward(loss)
  File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\accelerate\accelerator.py", line 2740, in backward
  File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\accelerate\accelerator.py", line 2740, in backward
        loss.backward(**kwargs)loss.backward(**kwargs)
  File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torch\_tensor.py", line 647, in backward
  File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torch\_tensor.py", line 647, in backward
        torch.autograd.backward(torch.autograd.backward(
  File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torch\autograd\__init__.py", line 354, in backward
  File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torch\autograd\__init__.py", line 354, in backward
        _engine_run_backward(_engine_run_backward(
  File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torch\autograd\graph.py", line 829, in _engine_run_backward
  File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torch\autograd\graph.py", line 829, in _engine_run_backward
        return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward passreturn Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torch\autograd\function.py", line 311, in apply
  File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torch\autograd\function.py", line 311, in apply
        return user_fn(self, *args)return user_fn(self, *args)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\toolkit\memory_management\manager_modules.py", line 264, in backward
  File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\toolkit\memory_management\manager_modules.py", line 264, in backward
        grad_input = grad_out.to(dtype=target_dtype) @ w_bwd_buffers[idx]grad_input = grad_out.to(dtype=target_dtype) @ w_bwd_buffers[idx]
                                  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torchao\utils.py", line 425, in _dispatch__torch_function__
  File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torchao\utils.py", line 425, in _dispatch__torch_function__
        return func(*args, **kwargs)return func(*args, **kwargs)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torchao\utils.py", line 440, in _dispatch__torch_dispatch__
  File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torchao\utils.py", line 440, in _dispatch__torch_dispatch__
        return cls._ATEN_OP_OR_TORCH_FN_TABLE[func](func, types, args, kwargs)return cls._ATEN_OP_OR_TORCH_FN_TABLE[func](func, types, args, kwargs)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torchao\utils.py", line 401, in wrapper
  File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torchao\utils.py", line 401, in wrapper
        return func(f, types, args, kwargs)return func(f, types, args, kwargs)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torchao\dtypes\affine_quantized_tensor_ops.py", line 346, in _
  File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torchao\dtypes\affine_quantized_tensor_ops.py", line 346, in _
        weight_tensor = weight_tensor.t()weight_tensor = weight_tensor.t()
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torchao\utils.py", line 440, in _dispatch__torch_dispatch__
  File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torchao\utils.py", line 440, in _dispatch__torch_dispatch__
        return cls._ATEN_OP_OR_TORCH_FN_TABLE[func](func, types, args, kwargs)return cls._ATEN_OP_OR_TORCH_FN_TABLE[func](func, types, args, kwargs)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torchao\utils.py", line 401, in wrapper
  File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torchao\utils.py", line 401, in wrapper
        return func(f, types, args, kwargs)return func(f, types, args, kwargs)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torchao\dtypes\affine_quantized_tensor_ops.py", line 410, in _
  File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torchao\dtypes\affine_quantized_tensor_ops.py", line 410, in _
        tensor.tensor_impl.t(),tensor.tensor_impl.t(),
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torchao\dtypes\uintx\plain_layout.py", line 153, in __torch_dispatch__
  File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torchao\dtypes\uintx\plain_layout.py", line 153, in __torch_dispatch__
        tensor.int_data.t(), tensor.scale, tensor.zero_point, tensor._layouttensor.int_data.t(), tensor.scale, tensor.zero_point, tensor._layout
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torchao\utils.py", line 444, in _dispatch__torch_dispatch__
  File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torchao\utils.py", line 444, in _dispatch__torch_dispatch__
        raise NotImplementedError(raise NotImplementedError(
NotImplementedErrorNotImplementedError: : UintxTensor dispatch: attempting to run unimplemented operator/function: func=<OpOverload(op='aten.t', overload='default')>, types=(<class 'torchao.dtypes.uintx.uintx_layout.UintxTensor'>,), arg_types=(<class 'torchao.dtypes.uintx.uintx_layout.UintxTensor'>,), kwarg_types={}UintxTensor dispatch: attempting to run unimplemented operator/function: func=<OpOverload(op='aten.t', overload='default')>, types=(<class 'torchao.dtypes.uintx.uintx_layout.UintxTensor'>,), arg_types=(<class 'torchao.dtypes.uintx.uintx_layout.UintxTensor'>,), kwarg_types={}
test128_1024:   0%|          | 0/7000 [07:47<?, ?it/s, lr: 1.0e-04 loss: 9.839e-02]

meknidirta avatar Nov 10 '25 15:11 meknidirta

Update:

I conducted additional testing using Match Target Resolution with three selected resolutions (512, 768, and 1024). Below are the results:

Test 1: Rank 64 Configuration (Working Setup)

  • Setup: This is a setup that has been verified to work well in the past (posted it here before).
  • Note: My Windows is set to my native language, so the Yandex OCR image might not be perfect, but the translation is accurate.

Image

Test 2: Rank 72 Configuration

  • Setup: I increased the rank to 72 while keeping the other settings identical to Test 1.
  • Observation: Although there is still some unused shared GPU memory, the training step time increased significantly, from 21.03s/iteration to 249.81s/iteration.

Image


Even with rank 72, there was available shared GPU memory. However, the GPU's power draw and temperature dropped significantly during training, despite the reported 100% utilization. The training speed drastically slowed down:

  • Before: Around 21 seconds per iteration. Image

  • After: Increased to over 249 seconds per iteration, despite there being no apparent memory overflow or VRAM issues. Image

meknidirta avatar Nov 10 '25 17:11 meknidirta

Image

Here is what I get with your configuration and 36 images. Power Draw is not on it's limit. Usually I get value around 250 W. It's a sign of VRAM shortage. It gives me 8 sec/iter on average.

By the way, why do you need such a big LR?

Dorithur avatar Nov 10 '25 18:11 Dorithur

I used the default settings and changed only what I needed to for testing purposes. Isn't RamTorch implementation supposed to help with that? I have offload for both the transformer and text encoder set to 100%.

meknidirta avatar Nov 10 '25 20:11 meknidirta

I used the default settings and changed only what I needed to for testing purposes. Isn't RamTorch implementation supposed to help with that? I have offload for both the transformer and text encoder set to 100%.

Unfortunately i tried this and same thing happens, i was at 25% ram torch and everything was going smoothly till step 2500 and it slowed right down after that, i gues sill try with a higher amount in ram torch

BadSuda avatar Nov 24 '25 18:11 BadSuda

Have the same bug: every step was 30-40 s/it, but right after 1800 step - i got 500+ s/it. i tried to wait and it is calculating this steps in range of 300-600 s/it

Image

so i dont think its memory issue, sadly my last checkpoint was 300+ steps away... i am training WAN 2.1 lora if its important

sawk1 avatar Nov 27 '25 15:11 sawk1

stop training and resume from last checkpoint - works so make checkpoints more often

sawk1 avatar Nov 27 '25 15:11 sawk1

Hello, I can confirm the same issue with a RTX 4070 12GB VRAM. after around 1300 out of 5000 steps it decreased from 5,xxa/it to 30.xxs/it. Even the Power Draw is only at max 129W / 215W but most of the time the power draw stays at 67W / 215W. If don't get any out of memory errors nore any other error Event a stop and start of the training doesn't help.

Skybeat avatar Dec 02 '25 05:12 Skybeat

stop training and resume from last checkpoint - works so make checkpoints more often

It doesn't help for me, I tried everything, stop the job and resume as well as restart the system and resume. At some point the speed decreases rapidly and you can't do anything

Skybeat avatar Dec 05 '25 08:12 Skybeat

I did yome test, to make sure my system doesn't have a issue. If I use ConfyUI with z-image-turbo I have ~ 4,80s/it. If I stop ConfyUi and start AI-Toolkit and resume my LoRa training with z-iamge-turbo, this is the actual s/it rate: 40%|#### | 2002/5000 [05:39<55:38:34, 66.82s/it, lr: 8.0e-05 loss: 4.740e-01]

Even more intresting is that AI-Toolkit does not use the full power of my RXT 4070 as the below screenshot shows. If I use ConfyUI the Power-Draw is constantly at around 190W during image creation.

Image

It really looks like something is going wrong in AI-Toolkit, at some point during a training job.

Skybeat avatar Dec 05 '25 11:12 Skybeat

Im having the same problem, Training Z image turbo

elen07zz avatar Dec 08 '25 04:12 elen07zz

[ 64GB RAM, 3090 (24GB VRAM), Windows 10 ]

This has been my first time using AI Toolkit, training a photoreal character LoRa of me on Wan 2.1, and I seem to be experiencing odd slowdown (and speedup) symptoms. The overall drop in speed is making the training run take a much longer time than it initially predicted - about two days for 2000 iterations at 512px.

Training began at 71s/it and quickly sped up in chunks, settling into a more linear rate of increase to reach 12.8s/it by 249 iterations.

At this point in the training, an epoch was saved and two sample images were generated.

Suddenly the training rate progressively slowed - reaching 53s/it at 1249 iterations before training died from an OOM on the GPU during sample image generation.

After the training restart at 1250 iterations, training was 469s/it and sped up in chunks, settling into a more linear trend to reach 152s/it by 1539 iterations.

I soon closed all processes and browser windows, then started ai-toolkit and restarted the same partially complete training job. It did not help.

Here are my settings for this training, in case any of them might offer a clue to the behavior described.

job: extension config: name: testtrig01 process:

  • type: diffusion_trainer training_folder: C:\ai-toolkit\output sqlite_db_path: C:\ai-toolkit\aitk_db.db device: cuda trigger_word: testtrig performance_log_every: 10 network: type: lora linear: 64 linear_alpha: 32 conv: 16 conv_alpha: 16 lokr_full_rank: true lokr_factor: -1 network_kwargs: ignore_if_contains: [] save: dtype: bf16 save_every: 250 max_step_saves_to_keep: 4 save_format: diffusers push_to_hub: false datasets:
    • folder_path: C:\ai-toolkit\datasets/testtrig mask_path: null mask_min_value: 0.1 default_caption: '' caption_ext: txt caption_dropout_rate: 0.05 cache_latents_to_disk: true is_reg: false network_weight: 1 resolution:
      • 512 controls: [] shrink_video_to_frames: true num_frames: 1 do_i2v: true flip_x: false flip_y: false train: batch_size: 1 bypass_guidance_embedding: false steps: 2000 gradient_accumulation: 1 train_unet: true train_text_encoder: false gradient_checkpointing: true noise_scheduler: flowmatch optimizer: adamw8bit timestep_type: sigmoid content_or_style: balanced optimizer_params: weight_decay: 0.0001 unload_text_encoder: false cache_text_embeddings: false lr: 0.0001 ema_config: use_ema: true ema_decay: 0.99 skip_first_sample: false force_first_sample: false disable_sampling: false dtype: bf16 diff_output_preservation: true diff_output_preservation_multiplier: 1 diff_output_preservation_class: man switch_boundary_every: 1 loss_type: mse logging: log_every: 1 use_ui_logger: true model: name_or_path: Wan-AI/Wan2.1-T2V-14B-Diffusers quantize: true qtype: qfloat8 quantize_te: true qtype_te: qfloat8 arch: wan21:14b low_vram: true model_kwargs: {} sample: sampler: flowmatch sample_every: 250 width: 512 height: 512 samples:
      • prompt: testtrig, man, A man stands in a side profile view and smiles toward the camera. He is wearing a grey business suit, white dress shirt with a blue striped tie. The background includes a dark-colored wall with white trim. neg: '' seed: 42 walk_seed: true guidance_scale: 5 sample_steps: 30 num_frames: 33 fps: 16 meta: name: testtrig version: '1.0'

ooofest avatar Dec 21 '25 06:12 ooofest

Just got this issue a 30 min ago and now it's fine.

Run comfyui Rightclick - Clear Vram Close comfyui Run ai toolkit

wa1ker885 avatar Dec 21 '25 15:12 wa1ker885

The VRAM (on my second 3090) was not being used between ai-toolkit runs, so using ComfyUI to attempt clearing it did not help, unfortunately.

I rebooted the system, but the slow training continued after restarting the prior training run. As others have said, the Power Draw is markedly lower than before, despite showing GPU Load = 100%.

Image

ooofest avatar Dec 21 '25 17:12 ooofest

And here's a curious symptom - previously, even with degraded GPU speed for training, the generation of sample images went at "full" speed. However, I recently saw the first sample (of two) generated at degraded speed, then the second sample generating at normal GPU rate.

Stopping AI Toolkit processes and runny ComfyUI, it continues to operate at full speed.

I feel this continues to point at something related to AI Toolkit (or its dependencies) for the slowdown.

Generating Images:   0%|          | 0/2 [00:00<?, ?it/s]Unloading vae
Unloading text encoder
  0%|          | 0/30 [00:00<?, ?it/s]
  3%|3         | 1/30 [01:55<55:46, 115.39s/it]
  7%|6         | 2/30 [03:50<53:38, 114.95s/it]
 10%|#         | 3/30 [05:44<51:38, 114.77s/it]
 13%|#3        | 4/30 [07:39<49:41, 114.69s/it]
 17%|#6        | 5/30 [09:33<47:45, 114.63s/it]
 20%|##        | 6/30 [11:28<45:50, 114.61s/it]
 23%|##3       | 7/30 [13:22<43:55, 114.60s/it]
 27%|##6       | 8/30 [15:17<42:00, 114.59s/it]
 30%|###       | 9/30 [17:12<40:06, 114.61s/it]
 33%|###3      | 10/30 [19:06<38:11, 114.59s/it]
 37%|###6      | 11/30 [21:01<36:17, 114.60s/it]
 40%|####      | 12/30 [22:55<34:22, 114.58s/it]
 43%|####3     | 13/30 [24:50<32:27, 114.56s/it]
 47%|####6     | 14/30 [26:44<30:33, 114.59s/it]
 50%|#####     | 15/30 [28:39<28:38, 114.56s/it]
 53%|#####3    | 16/30 [30:34<26:44, 114.58s/it]
 57%|#####6    | 17/30 [32:28<24:49, 114.56s/it]
 60%|######    | 18/30 [34:23<22:54, 114.57s/it]
 63%|######3   | 19/30 [36:17<21:00, 114.60s/it]
 67%|######6   | 20/30 [38:12<19:06, 114.64s/it]
 70%|#######   | 21/30 [40:07<17:11, 114.63s/it]
 73%|#######3  | 22/30 [42:01<15:16, 114.58s/it]
 77%|#######6  | 23/30 [43:56<13:22, 114.63s/it]
 80%|########  | 24/30 [45:50<11:27, 114.59s/it]
 83%|########3 | 25/30 [47:45<09:32, 114.58s/it]
 87%|########6 | 26/30 [49:39<07:38, 114.56s/it]
 90%|######### | 27/30 [51:34<05:43, 114.55s/it]
 93%|#########3| 28/30 [53:28<03:49, 114.55s/it]
 97%|#########6| 29/30 [55:23<01:54, 114.56s/it]
100%|##########| 30/30 [57:18<00:00, 114.60s/it]

Loading Vae
Generating Images:  50%|#####     | 1/2 [58:32<58:32, 3512.67s/it]Unloading vae

Unloading text encoder
  0%|          | 0/30 [00:00<?, ?it/s]
  3%|3         | 1/30 [00:12<06:15, 12.94s/it]
  7%|6         | 2/30 [00:25<06:02, 12.96s/it]
 10%|#         | 3/30 [00:38<05:50, 12.99s/it]
 13%|#3        | 4/30 [00:51<05:38, 13.01s/it]
 17%|#6        | 5/30 [01:05<05:25, 13.02s/it]
 20%|##        | 6/30 [01:18<05:12, 13.02s/it]
 23%|##3       | 7/30 [01:31<04:59, 13.03s/it]
 27%|##6       | 8/30 [01:44<04:46, 13.04s/it]
 30%|###       | 9/30 [01:57<04:33, 13.04s/it]
 33%|###3      | 10/30 [02:10<04:20, 13.05s/it]
 37%|###6      | 11/30 [02:23<04:07, 13.05s/it]
 40%|####      | 12/30 [02:36<03:54, 13.05s/it]
 43%|####3     | 13/30 [02:49<03:41, 13.05s/it]
 47%|####6     | 14/30 [03:02<03:28, 13.05s/it]
 50%|#####     | 15/30 [03:15<03:15, 13.05s/it]
 53%|#####3    | 16/30 [03:28<03:02, 13.05s/it]
 57%|#####6    | 17/30 [03:41<02:49, 13.05s/it]
 60%|######    | 18/30 [03:54<02:36, 13.06s/it]
 63%|######3   | 19/30 [04:07<02:23, 13.06s/it]
 67%|######6   | 20/30 [04:20<02:10, 13.06s/it]
 70%|#######   | 21/30 [04:33<01:57, 13.06s/it]
 73%|#######3  | 22/30 [04:46<01:44, 13.06s/it]
 77%|#######6  | 23/30 [04:59<01:31, 13.06s/it]
 80%|########  | 24/30 [05:13<01:18, 13.06s/it]
 83%|########3 | 25/30 [05:26<01:05, 13.06s/it]
 87%|########6 | 26/30 [05:39<00:52, 13.06s/it]
 90%|######### | 27/30 [05:52<00:39, 13.06s/it]
 93%|#########3| 28/30 [06:05<00:26, 13.06s/it]
 97%|#########6| 29/30 [06:18<00:13, 13.06s/it]
100%|##########| 30/30 [06:31<00:00, 13.05s/it]

ooofest avatar Dec 23 '25 19:12 ooofest