Training speed degrades significantly with GPU power and temperature drop
This is for bugs only
Did you already ask in the discord?
Yes
You verified that this is a bug and not a feature request or question by asking in the discord?
Yes
Describe the bug
Configuration: Windows 11 25H2 Python 3.12.10 PyTorch 2.8.0 Running lastest version of ai-toolkit (commit 9b89bab)
Hardware: 1x RTX 3060 12 GB (Driver Version: 581.15, CUDA Version: 13.0 ) 64 GB DDR4 RAM
Training config:
job: extension
config:
name: test128
process:
- type: diffusion_trainer
training_folder: C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\output
sqlite_db_path: C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\aitk_db.db
device: cuda
trigger_word: null
performance_log_every: 10
network:
type: lora
linear: 128
linear_alpha: 128
conv: 16
conv_alpha: 16
lokr_full_rank: true
lokr_factor: -1
network_kwargs:
ignore_if_contains: []
save:
dtype: bf16
save_every: 250
max_step_saves_to_keep: 4
save_format: diffusers
push_to_hub: false
datasets:
- folder_path: C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\datasets/target
mask_path: null
mask_min_value: 0.1
default_caption: put design on the shirt
caption_ext: txt
caption_dropout_rate: 0.05
cache_latents_to_disk: false
is_reg: false
network_weight: 1
resolution:
- 512
- 768
- 1024
controls: []
shrink_video_to_frames: true
num_frames: 1
do_i2v: true
flip_x: false
flip_y: false
control_path_1: C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\datasets/control
control_path_2: null
control_path_3: null
train:
batch_size: 1
bypass_guidance_embedding: false
steps: 7000
gradient_accumulation: 1
train_unet: true
train_text_encoder: false
gradient_checkpointing: true
noise_scheduler: flowmatch
optimizer: adamw8bit
timestep_type: weighted
content_or_style: balanced
optimizer_params:
weight_decay: 0.0001
unload_text_encoder: false
cache_text_embeddings: true
lr: 0.0001
ema_config:
use_ema: false
ema_decay: 0.99
skip_first_sample: true
force_first_sample: false
disable_sampling: true
dtype: bf16
diff_output_preservation: false
diff_output_preservation_multiplier: 1
diff_output_preservation_class: person
switch_boundary_every: 1
loss_type: mse
model:
name_or_path: Qwen/Qwen-Image-Edit-2509
quantize: true
qtype: uint3|ostris/accuracy_recovery_adapters/qwen_image_edit_2509_torchao_uint3.safetensors
quantize_te: true
qtype_te: qfloat8
arch: qwen_image_edit_plus
low_vram: true
model_kwargs:
match_target_res: true
layer_offloading: true
layer_offloading_text_encoder_percent: 1
layer_offloading_transformer_percent: 1
sample:
sampler: flowmatch
sample_every: 250
width: 1024
height: 1024
samples: []
neg: ''
seed: 42
walk_seed: true
guidance_scale: 4
sample_steps: 25
num_frames: 1
fps: 1
meta:
name: test128
version: '1.0'
Issue: When training a rank 128 Qwen Edit 2509 LoRA, the training speed becomes highly inconsistent and slows down dramatically after a few epochs. Initially, it runs at around 56 seconds per iteration, but after some time, it degrades to ~200 seconds per iteration or more. Fresh install doesn't fix it. Training LoRa with rank 64 or less works fine.
There are:
- No out-of-memory (OOM) errors
- No SSD swap usage
- GPU load remains at around 100%
- VRAM and RAM usage stay constant
However, during this slowdown:
- GPU power draw and temperature drop sharply, indicating reduced actual compute utilization even though usage metrics report around 100%
- Training speed slows down dramatically, making progress extremely inefficient
Typical training behavior:
- ~55 °C GPU temperature
- ~115 W power draw
Problematic behavior:
- Noticeable drop in both temperature and power draw
- GPU still reports around 100% utilization
- Dramatic slowdown in training performance
Related issues: https://github.com/ostris/ai-toolkit/issues/390
Did some additional testing with rank 96. Same problem.
I suspect that problem is in available memory. Maybe swap is used in some moments, you need to check it. Because both RAM and VRAM are almost maxed out. Do an experiment: select only 1024 resolution. I think it would be a steady high numbers of s/it. And it would be on a low side with 512.
I have noticed this on Flux. I start with 30GB VRAM used and in the matter of time between generations it can increase to 32 or even 33 making a spillover ot RAM, drasticly slowing training down. (from 2-3s/it to over 10s/it) [RTX 5090; 128GB RAM] or from something like 80s/it to > 5 minutes/it on my 12GB GPU [RTX 4080 Laptop; 64 GB RAM]
I suspect that problem is in available memory. Maybe swap is used in some moments, you need to check it. Because both RAM and VRAM are almost maxed out. Do an experiment: select only 1024 resolution. I think it would be a steady high numbers of s/it. And it would be on a low side with 512.
Testing with only 1024 and Match Target Resolution off I get this error:
test128_1024: 0%| | 0/7000 [03:31<?, ?it/s, lr: 1.0e-04 loss: 9.839e-02]Error running job: UintxTensor dispatch: attempting to run unimplemented operator/function: func=<OpOverload(op='aten.t', overload='default')>, types=(<class 'torchao.dtypes.uintx.uintx_layout.UintxTensor'>,), arg_types=(<class 'torchao.dtypes.uintx.uintx_layout.UintxTensor'>,), kwarg_types={}
========================================
Result:
- 0 completed jobs
- 1 failure
========================================
Traceback (most recent call last):
Traceback (most recent call last):
File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\run.py", line 120, in <module>
File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\run.py", line 120, in <module>
main()main()
File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\run.py", line 108, in main
File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\run.py", line 108, in main
raise eraise e
File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\run.py", line 96, in main
File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\run.py", line 96, in main
job.run()job.run()
File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\jobs\ExtensionJob.py", line 22, in run
File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\jobs\ExtensionJob.py", line 22, in run
process.run()process.run()
File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\jobs\process\BaseSDTrainProcess.py", line 2162, in run
File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\jobs\process\BaseSDTrainProcess.py", line 2162, in run
loss_dict = self.hook_train_loop(batch_list)loss_dict = self.hook_train_loop(batch_list)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\extensions_built_in\sd_trainer\SDTrainer.py", line 2051, in hook_train_loop
File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\extensions_built_in\sd_trainer\SDTrainer.py", line 2051, in hook_train_loop
loss = self.train_single_accumulation(batch)loss = self.train_single_accumulation(batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\extensions_built_in\sd_trainer\SDTrainer.py", line 2026, in train_single_accumulation
File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\extensions_built_in\sd_trainer\SDTrainer.py", line 2026, in train_single_accumulation
self.accelerator.backward(loss)self.accelerator.backward(loss)
File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\accelerate\accelerator.py", line 2740, in backward
File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\accelerate\accelerator.py", line 2740, in backward
loss.backward(**kwargs)loss.backward(**kwargs)
File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torch\_tensor.py", line 647, in backward
File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torch\_tensor.py", line 647, in backward
torch.autograd.backward(torch.autograd.backward(
File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torch\autograd\__init__.py", line 354, in backward
File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torch\autograd\__init__.py", line 354, in backward
_engine_run_backward(_engine_run_backward(
File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torch\autograd\graph.py", line 829, in _engine_run_backward
File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torch\autograd\graph.py", line 829, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward passreturn Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torch\autograd\function.py", line 311, in apply
File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torch\autograd\function.py", line 311, in apply
return user_fn(self, *args)return user_fn(self, *args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\toolkit\memory_management\manager_modules.py", line 264, in backward
File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\toolkit\memory_management\manager_modules.py", line 264, in backward
grad_input = grad_out.to(dtype=target_dtype) @ w_bwd_buffers[idx]grad_input = grad_out.to(dtype=target_dtype) @ w_bwd_buffers[idx]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torchao\utils.py", line 425, in _dispatch__torch_function__
File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torchao\utils.py", line 425, in _dispatch__torch_function__
return func(*args, **kwargs)return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torchao\utils.py", line 440, in _dispatch__torch_dispatch__
File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torchao\utils.py", line 440, in _dispatch__torch_dispatch__
return cls._ATEN_OP_OR_TORCH_FN_TABLE[func](func, types, args, kwargs)return cls._ATEN_OP_OR_TORCH_FN_TABLE[func](func, types, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torchao\utils.py", line 401, in wrapper
File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torchao\utils.py", line 401, in wrapper
return func(f, types, args, kwargs)return func(f, types, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torchao\dtypes\affine_quantized_tensor_ops.py", line 346, in _
File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torchao\dtypes\affine_quantized_tensor_ops.py", line 346, in _
weight_tensor = weight_tensor.t()weight_tensor = weight_tensor.t()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torchao\utils.py", line 440, in _dispatch__torch_dispatch__
File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torchao\utils.py", line 440, in _dispatch__torch_dispatch__
return cls._ATEN_OP_OR_TORCH_FN_TABLE[func](func, types, args, kwargs)return cls._ATEN_OP_OR_TORCH_FN_TABLE[func](func, types, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torchao\utils.py", line 401, in wrapper
File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torchao\utils.py", line 401, in wrapper
return func(f, types, args, kwargs)return func(f, types, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torchao\dtypes\affine_quantized_tensor_ops.py", line 410, in _
File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torchao\dtypes\affine_quantized_tensor_ops.py", line 410, in _
tensor.tensor_impl.t(),tensor.tensor_impl.t(),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torchao\dtypes\uintx\plain_layout.py", line 153, in __torch_dispatch__
File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torchao\dtypes\uintx\plain_layout.py", line 153, in __torch_dispatch__
tensor.int_data.t(), tensor.scale, tensor.zero_point, tensor._layouttensor.int_data.t(), tensor.scale, tensor.zero_point, tensor._layout
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torchao\utils.py", line 444, in _dispatch__torch_dispatch__
File "C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\venv\Lib\site-packages\torchao\utils.py", line 444, in _dispatch__torch_dispatch__
raise NotImplementedError(raise NotImplementedError(
NotImplementedErrorNotImplementedError: : UintxTensor dispatch: attempting to run unimplemented operator/function: func=<OpOverload(op='aten.t', overload='default')>, types=(<class 'torchao.dtypes.uintx.uintx_layout.UintxTensor'>,), arg_types=(<class 'torchao.dtypes.uintx.uintx_layout.UintxTensor'>,), kwarg_types={}UintxTensor dispatch: attempting to run unimplemented operator/function: func=<OpOverload(op='aten.t', overload='default')>, types=(<class 'torchao.dtypes.uintx.uintx_layout.UintxTensor'>,), arg_types=(<class 'torchao.dtypes.uintx.uintx_layout.UintxTensor'>,), kwarg_types={}
test128_1024: 0%| | 0/7000 [07:47<?, ?it/s, lr: 1.0e-04 loss: 9.839e-02]
Update:
I conducted additional testing using Match Target Resolution with three selected resolutions (512, 768, and 1024). Below are the results:
Test 1: Rank 64 Configuration (Working Setup)
- Setup: This is a setup that has been verified to work well in the past (posted it here before).
- Note: My Windows is set to my native language, so the Yandex OCR image might not be perfect, but the translation is accurate.
Test 2: Rank 72 Configuration
- Setup: I increased the rank to 72 while keeping the other settings identical to Test 1.
- Observation: Although there is still some unused shared GPU memory, the training step time increased significantly, from 21.03s/iteration to 249.81s/iteration.
Even with rank 72, there was available shared GPU memory. However, the GPU's power draw and temperature dropped significantly during training, despite the reported 100% utilization. The training speed drastically slowed down:
-
Before: Around 21 seconds per iteration.
-
After: Increased to over 249 seconds per iteration, despite there being no apparent memory overflow or VRAM issues.
Here is what I get with your configuration and 36 images. Power Draw is not on it's limit. Usually I get value around 250 W. It's a sign of VRAM shortage. It gives me 8 sec/iter on average.
By the way, why do you need such a big LR?
I used the default settings and changed only what I needed to for testing purposes. Isn't RamTorch implementation supposed to help with that? I have offload for both the transformer and text encoder set to 100%.
I used the default settings and changed only what I needed to for testing purposes. Isn't RamTorch implementation supposed to help with that? I have offload for both the transformer and text encoder set to 100%.
Unfortunately i tried this and same thing happens, i was at 25% ram torch and everything was going smoothly till step 2500 and it slowed right down after that, i gues sill try with a higher amount in ram torch
Have the same bug: every step was 30-40 s/it, but right after 1800 step - i got 500+ s/it. i tried to wait and it is calculating this steps in range of 300-600 s/it
so i dont think its memory issue, sadly my last checkpoint was 300+ steps away... i am training WAN 2.1 lora if its important
stop training and resume from last checkpoint - works so make checkpoints more often
Hello, I can confirm the same issue with a RTX 4070 12GB VRAM. after around 1300 out of 5000 steps it decreased from 5,xxa/it to 30.xxs/it. Even the Power Draw is only at max 129W / 215W but most of the time the power draw stays at 67W / 215W. If don't get any out of memory errors nore any other error Event a stop and start of the training doesn't help.
stop training and resume from last checkpoint - works so make checkpoints more often
It doesn't help for me, I tried everything, stop the job and resume as well as restart the system and resume. At some point the speed decreases rapidly and you can't do anything
I did yome test, to make sure my system doesn't have a issue. If I use ConfyUI with z-image-turbo I have ~ 4,80s/it. If I stop ConfyUi and start AI-Toolkit and resume my LoRa training with z-iamge-turbo, this is the actual s/it rate: 40%|#### | 2002/5000 [05:39<55:38:34, 66.82s/it, lr: 8.0e-05 loss: 4.740e-01]
Even more intresting is that AI-Toolkit does not use the full power of my RXT 4070 as the below screenshot shows. If I use ConfyUI the Power-Draw is constantly at around 190W during image creation.
It really looks like something is going wrong in AI-Toolkit, at some point during a training job.
Im having the same problem, Training Z image turbo
[ 64GB RAM, 3090 (24GB VRAM), Windows 10 ]
This has been my first time using AI Toolkit, training a photoreal character LoRa of me on Wan 2.1, and I seem to be experiencing odd slowdown (and speedup) symptoms. The overall drop in speed is making the training run take a much longer time than it initially predicted - about two days for 2000 iterations at 512px.
Training began at 71s/it and quickly sped up in chunks, settling into a more linear rate of increase to reach 12.8s/it by 249 iterations.
At this point in the training, an epoch was saved and two sample images were generated.
Suddenly the training rate progressively slowed - reaching 53s/it at 1249 iterations before training died from an OOM on the GPU during sample image generation.
After the training restart at 1250 iterations, training was 469s/it and sped up in chunks, settling into a more linear trend to reach 152s/it by 1539 iterations.
I soon closed all processes and browser windows, then started ai-toolkit and restarted the same partially complete training job. It did not help.
Here are my settings for this training, in case any of them might offer a clue to the behavior described.
job: extension config: name: testtrig01 process:
- type: diffusion_trainer
training_folder: C:\ai-toolkit\output
sqlite_db_path: C:\ai-toolkit\aitk_db.db
device: cuda
trigger_word: testtrig
performance_log_every: 10
network:
type: lora
linear: 64
linear_alpha: 32
conv: 16
conv_alpha: 16
lokr_full_rank: true
lokr_factor: -1
network_kwargs:
ignore_if_contains: []
save:
dtype: bf16
save_every: 250
max_step_saves_to_keep: 4
save_format: diffusers
push_to_hub: false
datasets:
- folder_path: C:\ai-toolkit\datasets/testtrig
mask_path: null
mask_min_value: 0.1
default_caption: ''
caption_ext: txt
caption_dropout_rate: 0.05
cache_latents_to_disk: true
is_reg: false
network_weight: 1
resolution:
- 512 controls: [] shrink_video_to_frames: true num_frames: 1 do_i2v: true flip_x: false flip_y: false train: batch_size: 1 bypass_guidance_embedding: false steps: 2000 gradient_accumulation: 1 train_unet: true train_text_encoder: false gradient_checkpointing: true noise_scheduler: flowmatch optimizer: adamw8bit timestep_type: sigmoid content_or_style: balanced optimizer_params: weight_decay: 0.0001 unload_text_encoder: false cache_text_embeddings: false lr: 0.0001 ema_config: use_ema: true ema_decay: 0.99 skip_first_sample: false force_first_sample: false disable_sampling: false dtype: bf16 diff_output_preservation: true diff_output_preservation_multiplier: 1 diff_output_preservation_class: man switch_boundary_every: 1 loss_type: mse logging: log_every: 1 use_ui_logger: true model: name_or_path: Wan-AI/Wan2.1-T2V-14B-Diffusers quantize: true qtype: qfloat8 quantize_te: true qtype_te: qfloat8 arch: wan21:14b low_vram: true model_kwargs: {} sample: sampler: flowmatch sample_every: 250 width: 512 height: 512 samples:
- prompt: testtrig, man, A man stands in a side profile view and smiles toward the camera. He is wearing a grey business suit, white dress shirt with a blue striped tie. The background includes a dark-colored wall with white trim. neg: '' seed: 42 walk_seed: true guidance_scale: 5 sample_steps: 30 num_frames: 33 fps: 16 meta: name: testtrig version: '1.0'
- folder_path: C:\ai-toolkit\datasets/testtrig
mask_path: null
mask_min_value: 0.1
default_caption: ''
caption_ext: txt
caption_dropout_rate: 0.05
cache_latents_to_disk: true
is_reg: false
network_weight: 1
resolution:
Just got this issue a 30 min ago and now it's fine.
Run comfyui Rightclick - Clear Vram Close comfyui Run ai toolkit
The VRAM (on my second 3090) was not being used between ai-toolkit runs, so using ComfyUI to attempt clearing it did not help, unfortunately.
I rebooted the system, but the slow training continued after restarting the prior training run. As others have said, the Power Draw is markedly lower than before, despite showing GPU Load = 100%.
And here's a curious symptom - previously, even with degraded GPU speed for training, the generation of sample images went at "full" speed. However, I recently saw the first sample (of two) generated at degraded speed, then the second sample generating at normal GPU rate.
Stopping AI Toolkit processes and runny ComfyUI, it continues to operate at full speed.
I feel this continues to point at something related to AI Toolkit (or its dependencies) for the slowdown.
Generating Images: 0%| | 0/2 [00:00<?, ?it/s]Unloading vae
Unloading text encoder
0%| | 0/30 [00:00<?, ?it/s]
3%|3 | 1/30 [01:55<55:46, 115.39s/it]
7%|6 | 2/30 [03:50<53:38, 114.95s/it]
10%|# | 3/30 [05:44<51:38, 114.77s/it]
13%|#3 | 4/30 [07:39<49:41, 114.69s/it]
17%|#6 | 5/30 [09:33<47:45, 114.63s/it]
20%|## | 6/30 [11:28<45:50, 114.61s/it]
23%|##3 | 7/30 [13:22<43:55, 114.60s/it]
27%|##6 | 8/30 [15:17<42:00, 114.59s/it]
30%|### | 9/30 [17:12<40:06, 114.61s/it]
33%|###3 | 10/30 [19:06<38:11, 114.59s/it]
37%|###6 | 11/30 [21:01<36:17, 114.60s/it]
40%|#### | 12/30 [22:55<34:22, 114.58s/it]
43%|####3 | 13/30 [24:50<32:27, 114.56s/it]
47%|####6 | 14/30 [26:44<30:33, 114.59s/it]
50%|##### | 15/30 [28:39<28:38, 114.56s/it]
53%|#####3 | 16/30 [30:34<26:44, 114.58s/it]
57%|#####6 | 17/30 [32:28<24:49, 114.56s/it]
60%|###### | 18/30 [34:23<22:54, 114.57s/it]
63%|######3 | 19/30 [36:17<21:00, 114.60s/it]
67%|######6 | 20/30 [38:12<19:06, 114.64s/it]
70%|####### | 21/30 [40:07<17:11, 114.63s/it]
73%|#######3 | 22/30 [42:01<15:16, 114.58s/it]
77%|#######6 | 23/30 [43:56<13:22, 114.63s/it]
80%|######## | 24/30 [45:50<11:27, 114.59s/it]
83%|########3 | 25/30 [47:45<09:32, 114.58s/it]
87%|########6 | 26/30 [49:39<07:38, 114.56s/it]
90%|######### | 27/30 [51:34<05:43, 114.55s/it]
93%|#########3| 28/30 [53:28<03:49, 114.55s/it]
97%|#########6| 29/30 [55:23<01:54, 114.56s/it]
100%|##########| 30/30 [57:18<00:00, 114.60s/it]
Loading Vae
Generating Images: 50%|##### | 1/2 [58:32<58:32, 3512.67s/it]Unloading vae
Unloading text encoder
0%| | 0/30 [00:00<?, ?it/s]
3%|3 | 1/30 [00:12<06:15, 12.94s/it]
7%|6 | 2/30 [00:25<06:02, 12.96s/it]
10%|# | 3/30 [00:38<05:50, 12.99s/it]
13%|#3 | 4/30 [00:51<05:38, 13.01s/it]
17%|#6 | 5/30 [01:05<05:25, 13.02s/it]
20%|## | 6/30 [01:18<05:12, 13.02s/it]
23%|##3 | 7/30 [01:31<04:59, 13.03s/it]
27%|##6 | 8/30 [01:44<04:46, 13.04s/it]
30%|### | 9/30 [01:57<04:33, 13.04s/it]
33%|###3 | 10/30 [02:10<04:20, 13.05s/it]
37%|###6 | 11/30 [02:23<04:07, 13.05s/it]
40%|#### | 12/30 [02:36<03:54, 13.05s/it]
43%|####3 | 13/30 [02:49<03:41, 13.05s/it]
47%|####6 | 14/30 [03:02<03:28, 13.05s/it]
50%|##### | 15/30 [03:15<03:15, 13.05s/it]
53%|#####3 | 16/30 [03:28<03:02, 13.05s/it]
57%|#####6 | 17/30 [03:41<02:49, 13.05s/it]
60%|###### | 18/30 [03:54<02:36, 13.06s/it]
63%|######3 | 19/30 [04:07<02:23, 13.06s/it]
67%|######6 | 20/30 [04:20<02:10, 13.06s/it]
70%|####### | 21/30 [04:33<01:57, 13.06s/it]
73%|#######3 | 22/30 [04:46<01:44, 13.06s/it]
77%|#######6 | 23/30 [04:59<01:31, 13.06s/it]
80%|######## | 24/30 [05:13<01:18, 13.06s/it]
83%|########3 | 25/30 [05:26<01:05, 13.06s/it]
87%|########6 | 26/30 [05:39<00:52, 13.06s/it]
90%|######### | 27/30 [05:52<00:39, 13.06s/it]
93%|#########3| 28/30 [06:05<00:26, 13.06s/it]
97%|#########6| 29/30 [06:18<00:13, 13.06s/it]
100%|##########| 30/30 [06:31<00:00, 13.05s/it]