Possible way to mitigate hard crashes on AMD gpus when using ROCm backend
Hi everyone, I stumbled across a possible way of mitigating random hard crashes of my RX 6800 AMD gpu when using more complex workflows in ComfyUI.
Hardware:
- Ryzen 7 5800X3D
- 32GB DDR4 3200 MT/s RAM
- RX 6800 (non-XT) 16GB VRAM (Navi21, gfx1030)
Software
- Nobara 39 @ Kernel 6.6.9
- ROCm 5.6
- Python 3.11.5 (separate venv managed via conda)
- pytorch 2.1.2+rocm5.6
- ComfyUI Revision: 1869 [66831eb6]
ComfyUI used to run fine with SD 1.5 and SDXL models, even larger image creation dimensions and bigger batch sizes were no problem if they fit into the 16GB VRAM.
During more complex workflows however I would experience random hard crashes of the gpu prompting a system reboot. Those crashes were happening especially in subsequent VAE encoding/decoding steps or additional processing nodes like face detailer or upscaling. I was seemingly able to reduce the occurence using --disable-smart-memory but ultimately it still kept randomly crashing every other prompt run (some ran fine, others crashed at random steps in the workflow). I was able to completely circumvent the crashes by splitting up long workflows into separate segments using node bypassing and running it step by step sequentially.
Then I recently came across an old reddit post where a RX6800XT user experienced similar things when running Automatic1111: https://www.reddit.com/r/StableDiffusion/comments/12faj1y/amd_gpu_forced_to_reboot_on_linuxauto1111/
This user attributed the issues to how the ROCm backend handles garbage collection in the pytorch module. I then adopted the same ROCm environment variable mentioned above:
PYTORCH_HIP_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:6144
Since then I did not experience any hard crashes when rerunning the same workflows that used to randomly crash before. I have no deeper knowledge about the ROCm backend settings and have only adopted the values mentioned in the Reddit thread without own tinkering with my 16GB VRAM card. But it alleviated most if not all my hard crash issues running ComfyUI.
TL;DR
Running ComfyUI using ROCm with additional garbage collection parameters via
PYTORCH_HIP_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:6144 python main.py
on my 16GB RX6800 fixed my hard crashes in most if not all situations.
The steps mentioned above might also help with other AMD cards but I would assume that the garbage collection parameters might have to be adjusted according to VRAM size.
If this fix ends up helping other AMD users it might be helpful to mention it as a possible troubleshooting step in the AMD section of the installation guide.
If anyone more knowledgable on the ROCm backend can provide additional insight into different settings it would be appreciated to make the fix more universal.
If there are any outstanding questions I will try to provide as much information as possible.
I've had some soft locks as well. Only way to recover is to ssh into machine and restart lightdm.
I was running rocm 6.2 and adding the parameters you specified didn't help. However if I start comfyUI with --lowram I get it more stable. Before I could generate anything over 512x512 with the flux model with fp8.
have you had success to run flux.1 GGUF on AMD ?
I still get
Error occurred when executing FluxSamplerParams+:
HIP out of memory. Tried to allocate 64.00 MiB. GPU 0 has a total capacty of 512.00 MiB of which 2.00 MiB is free. Of the allocated memory 277.65 MiB is allocated by PyTorch, and 114.35 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_HIP_ALLOC_CONF
File "/root/ComfyUI/execution.py", line 316, in execute
output_data, output_ui, has_subgraph = get_output_data(obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
File "/root/ComfyUI/execution.py", line 191, in get_output_data
return_values = _map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
File "/root/ComfyUI/execution.py", line 168, in _map_node_over_list
process_inputs(input_dict, i)
File "/root/ComfyUI/execution.py", line 157, in process_inputs
results.append(getattr(obj, func)(**inputs))
File "/root/ComfyUI/custom_nodes/ComfyUI_essentials/sampling.py", line 398, in execute
latent = samplercustomadvanced.sample(randnoise, guider, samplerobj, sigmas, latent_image)[1]
File "/root/ComfyUI/comfy_extras/nodes_custom_sampler.py", line 612, in sample
samples = guider.sample(noise.generate_noise(latent), latent_image, sampler, sigmas, denoise_mask=noise_mask, callback=callback, disable_pbar=disable_pbar, seed=noise.seed)
File "/root/ComfyUI/comfy/samplers.py", line 716, in sample
output = self.inner_sample(noise, latent_image, device, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)
File "/root/ComfyUI/comfy/samplers.py", line 695, in inner_sample
samples = sampler.sample(self, sigmas, extra_args, callback, noise, latent_image, denoise_mask, disable_pbar)
File "/root/ComfyUI/comfy/samplers.py", line 600, in sample
samples = self.sampler_function(model_k, noise, sigmas, extra_args=extra_args, callback=k_callback, disable=disable_pbar, **self.extra_options)
File "/usr/local/lib64/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/root/ComfyUI/comfy/k_diffusion/sampling.py", line 144, in sample_euler
denoised = model(x, sigma_hat * s_in, **extra_args)
File "/root/ComfyUI/comfy/samplers.py", line 299, in __call__
out = self.inner_model(x, sigma, model_options=model_options, seed=seed)
File "/root/ComfyUI/comfy/samplers.py", line 682, in __call__
return self.predict_noise(*args, **kwargs)
File "/root/ComfyUI/comfy/samplers.py", line 685, in predict_noise
return sampling_function(self.inner_model, x, timestep, self.conds.get("negative", None), self.conds.get("positive", None), self.cfg, model_options=model_options, seed=seed)
File "/root/ComfyUI/comfy/samplers.py", line 279, in sampling_function
out = calc_cond_batch(model, conds, x, timestep, model_options)
File "/root/ComfyUI/comfy/samplers.py", line 228, in calc_cond_batch
output = model.apply_model(input_x, timestep_, **c).chunk(batch_chunks)
File "/root/ComfyUI/custom_nodes/ComfyUI-Advanced-ControlNet/adv_control/utils.py", line 68, in apply_model_uncond_cleanup_wrapper
return orig_apply_model(self, *args, **kwargs)
File "/root/ComfyUI/comfy/model_base.py", line 142, in apply_model
model_output = self.diffusion_model(xc, t, context=context, control=control, transformer_options=transformer_options, **extra_conds).float()
File "/usr/local/lib64/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib64/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/root/ComfyUI/comfy/ldm/flux/model.py", line 159, in forward
out = self.forward_orig(img, img_ids, context, txt_ids, timestep, y, guidance, control)
File "/root/ComfyUI/comfy/ldm/flux/model.py", line 130, in forward_orig
img = block(img, vec=vec, pe=pe)
File "/usr/local/lib64/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib64/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/root/ComfyUI/comfy/ldm/flux/layers.py", line 225, in forward
qkv, mlp = torch.split(self.linear1(x_mod), [3 * self.hidden_size, self.mlp_hidden_dim], dim=-1)
File "/usr/local/lib64/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib64/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/root/ComfyUI/custom_nodes/ComfyUI-GGUF/ops.py", line 146, in forward
weight, bias = self.get_weights(x.dtype)
File "/root/ComfyUI/custom_nodes/ComfyUI-GGUF/ops.py", line 125, in get_weights
weight = self.get_weight(self.weight, dtype)
File "/root/ComfyUI/custom_nodes/ComfyUI-GGUF/ops.py", line 117, in get_weight
weight = dequantize_tensor(tensor, dtype)
File "/root/ComfyUI/custom_nodes/ComfyUI-GGUF/dequant.py", line 18, in dequantize_tensor
out = dequantize(data, qtype, oshape, dtype=None)
File "/root/ComfyUI/custom_nodes/ComfyUI-GGUF/dequant.py", line 39, in dequantize
blocks = dequantize_blocks(blocks, block_size, type_size, dtype)
File "/root/ComfyUI/custom_nodes/ComfyUI-GGUF/dequant.py", line 110, in dequantize_blocks_Q4_0
qs = (qs & 0x0F).reshape((n_blocks, -1)).to(torch.int8) - 8
I have used GGUF Q5 and it uses significantly lower VRAM now, I still get occationally crashes but I have also added these flags to ComfyUI so it doesn't happen so often:
--lowvram --reserve-vram 5
I have plently of system RAM and I'm not understanding why it doesn't try to use that instead of just crash and burn
@grigio It ran fine for me, although it seems a bit slower than the 'native' FP8 safetensor version (with minimal testing, might be run to run variance at this point). But on the other hand it does seem to be less taxing on RAM overall.
My current setup
### Loading: ComfyUI-Manager (V2.50.3)
### ComfyUI Revision: 2641 [f1c23016] | Released on '2024-09-01'
Torch version: 2.5.0.dev20240804+rocm6.1
OS: Fedora Linux 40 (KDE Plasma) x86_4
Kernel: Linux 6.10.6-200.fc40.x86_64
CPU: AMD Ryzen 7 5800X3D (16) @ 3.40 z
GPU: AMD Radeon RX 6800 [Discrete] 16GB VRAM
Vulkan: 1.3.278 - radv [Mesa 24.1.6]
Memory: 12.22 GiB / 31.24 GiB (39%)
got prompt
100%|█████████████████████████████████████████████| 4/4 [01:16<00:00, 19.05s/it]
Requested to load AutoencodingEngine
Loading 1 new model
loaded completely 0.0 319.7467155456543 True
Prompt executed in 77.93 seconds
I have a ticket at ROCm and they had some flags that reduces crashes. The crashes still happens sometimes, but they are greatly reduced.
https://github.com/ROCm/ROCm/issues/3580#issuecomment-2315830998
Having run ComfyUI once with --reserve-vram seems to have a more profound impact than I anticipated. In my limited testing it kept the pytorch process from spilling from the VRAM into the GTT and thus eliminated swapping between VRAM and RAM.
This resulted in a speedup of up to 15% of the whole prompt process (clip to output) when using the native fp8 safetensor workflow (gguf does not seem to significantly improve, though).
Weirdly this effect is persistent even if subsequent runs are started without --reserve-vram parameters at all. Not sure if it sets some kind of flag in the rocm or pytorch environment. Sadly I couldn't really find info on the parameter in the documentation at all. The sluggishness and slowdowns due to swapping might return if more other processes occupy the VRAM (since I run SD on the same GPU that also outputs my system).
Hi, thanks for those tips! Do you still need to use those variables or parameters with current versions of PyTorch? Because I sometimes get those crashes, but it's too random to be able to reproduce them.