ComfyUI Possible way to mitigate hard crashes on AMD gpus when using ROCm backend

Hi everyone, I stumbled across a possible way of mitigating random hard crashes of my RX 6800 AMD gpu when using more complex workflows in ComfyUI.

Hardware:

Ryzen 7 5800X3D
32GB DDR4 3200 MT/s RAM
RX 6800 (non-XT) 16GB VRAM (Navi21, gfx1030)

Software

Nobara 39 @ Kernel 6.6.9
ROCm 5.6
Python 3.11.5 (separate venv managed via conda)
pytorch 2.1.2+rocm5.6
ComfyUI Revision: 1869 [66831eb6]

ComfyUI used to run fine with SD 1.5 and SDXL models, even larger image creation dimensions and bigger batch sizes were no problem if they fit into the 16GB VRAM.

During more complex workflows however I would experience random hard crashes of the gpu prompting a system reboot. Those crashes were happening especially in subsequent VAE encoding/decoding steps or additional processing nodes like face detailer or upscaling. I was seemingly able to reduce the occurence using --disable-smart-memory but ultimately it still kept randomly crashing every other prompt run (some ran fine, others crashed at random steps in the workflow). I was able to completely circumvent the crashes by splitting up long workflows into separate segments using node bypassing and running it step by step sequentially.

Then I recently came across an old reddit post where a RX6800XT user experienced similar things when running Automatic1111: https://www.reddit.com/r/StableDiffusion/comments/12faj1y/amd_gpu_forced_to_reboot_on_linuxauto1111/

This user attributed the issues to how the ROCm backend handles garbage collection in the pytorch module. I then adopted the same ROCm environment variable mentioned above: PYTORCH_HIP_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:6144

Since then I did not experience any hard crashes when rerunning the same workflows that used to randomly crash before. I have no deeper knowledge about the ROCm backend settings and have only adopted the values mentioned in the Reddit thread without own tinkering with my 16GB VRAM card. But it alleviated most if not all my hard crash issues running ComfyUI.

TL;DR Running ComfyUI using ROCm with additional garbage collection parameters via PYTORCH_HIP_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:6144 python main.py on my 16GB RX6800 fixed my hard crashes in most if not all situations.

The steps mentioned above might also help with other AMD cards but I would assume that the garbage collection parameters might have to be adjusted according to VRAM size.

If this fix ends up helping other AMD users it might be helpful to mention it as a possible troubleshooting step in the AMD section of the installation guide.

If anyone more knowledgable on the ROCm backend can provide additional insight into different settings it would be appreciated to make the fix more universal.

If there are any outstanding questions I will try to provide as much information as possible.

Jan 06 '24 11:01 Th3Rom3

I've had some soft locks as well. Only way to recover is to ssh into machine and restart lightdm.

I was running rocm 6.2 and adding the parameters you specified didn't help. However if I start comfyUI with --lowram I get it more stable. Before I could generate anything over 512x512 with the flux model with fp8.

Aug 12 '24 10:08 hartmark

have you had success to run flux.1 GGUF on AMD ?

I still get

Error occurred when executing FluxSamplerParams+:

HIP out of memory. Tried to allocate 64.00 MiB. GPU 0 has a total capacty of 512.00 MiB of which 2.00 MiB is free. Of the allocated memory 277.65 MiB is allocated by PyTorch, and 114.35 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_HIP_ALLOC_CONF

  File "/root/ComfyUI/execution.py", line 316, in execute
    output_data, output_ui, has_subgraph = get_output_data(obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
  File "/root/ComfyUI/execution.py", line 191, in get_output_data
    return_values = _map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
  File "/root/ComfyUI/execution.py", line 168, in _map_node_over_list
    process_inputs(input_dict, i)
  File "/root/ComfyUI/execution.py", line 157, in process_inputs
    results.append(getattr(obj, func)(**inputs))
  File "/root/ComfyUI/custom_nodes/ComfyUI_essentials/sampling.py", line 398, in execute
    latent = samplercustomadvanced.sample(randnoise, guider, samplerobj, sigmas, latent_image)[1]
  File "/root/ComfyUI/comfy_extras/nodes_custom_sampler.py", line 612, in sample
    samples = guider.sample(noise.generate_noise(latent), latent_image, sampler, sigmas, denoise_mask=noise_mask, callback=callback, disable_pbar=disable_pbar, seed=noise.seed)
  File "/root/ComfyUI/comfy/samplers.py", line 716, in sample
    output = self.inner_sample(noise, latent_image, device, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)
  File "/root/ComfyUI/comfy/samplers.py", line 695, in inner_sample
    samples = sampler.sample(self, sigmas, extra_args, callback, noise, latent_image, denoise_mask, disable_pbar)
  File "/root/ComfyUI/comfy/samplers.py", line 600, in sample
    samples = self.sampler_function(model_k, noise, sigmas, extra_args=extra_args, callback=k_callback, disable=disable_pbar, **self.extra_options)
  File "/usr/local/lib64/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/ComfyUI/comfy/k_diffusion/sampling.py", line 144, in sample_euler
    denoised = model(x, sigma_hat * s_in, **extra_args)
  File "/root/ComfyUI/comfy/samplers.py", line 299, in __call__
    out = self.inner_model(x, sigma, model_options=model_options, seed=seed)
  File "/root/ComfyUI/comfy/samplers.py", line 682, in __call__
    return self.predict_noise(*args, **kwargs)
  File "/root/ComfyUI/comfy/samplers.py", line 685, in predict_noise
    return sampling_function(self.inner_model, x, timestep, self.conds.get("negative", None), self.conds.get("positive", None), self.cfg, model_options=model_options, seed=seed)
  File "/root/ComfyUI/comfy/samplers.py", line 279, in sampling_function
    out = calc_cond_batch(model, conds, x, timestep, model_options)
  File "/root/ComfyUI/comfy/samplers.py", line 228, in calc_cond_batch
    output = model.apply_model(input_x, timestep_, **c).chunk(batch_chunks)
  File "/root/ComfyUI/custom_nodes/ComfyUI-Advanced-ControlNet/adv_control/utils.py", line 68, in apply_model_uncond_cleanup_wrapper
    return orig_apply_model(self, *args, **kwargs)
  File "/root/ComfyUI/comfy/model_base.py", line 142, in apply_model
    model_output = self.diffusion_model(xc, t, context=context, control=control, transformer_options=transformer_options, **extra_conds).float()
  File "/usr/local/lib64/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib64/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/ComfyUI/comfy/ldm/flux/model.py", line 159, in forward
    out = self.forward_orig(img, img_ids, context, txt_ids, timestep, y, guidance, control)
  File "/root/ComfyUI/comfy/ldm/flux/model.py", line 130, in forward_orig
    img = block(img, vec=vec, pe=pe)
  File "/usr/local/lib64/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib64/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/ComfyUI/comfy/ldm/flux/layers.py", line 225, in forward
    qkv, mlp = torch.split(self.linear1(x_mod), [3 * self.hidden_size, self.mlp_hidden_dim], dim=-1)
  File "/usr/local/lib64/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib64/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/ComfyUI/custom_nodes/ComfyUI-GGUF/ops.py", line 146, in forward
    weight, bias = self.get_weights(x.dtype)
  File "/root/ComfyUI/custom_nodes/ComfyUI-GGUF/ops.py", line 125, in get_weights
    weight = self.get_weight(self.weight, dtype)
  File "/root/ComfyUI/custom_nodes/ComfyUI-GGUF/ops.py", line 117, in get_weight
    weight = dequantize_tensor(tensor, dtype)
  File "/root/ComfyUI/custom_nodes/ComfyUI-GGUF/dequant.py", line 18, in dequantize_tensor
    out = dequantize(data, qtype, oshape, dtype=None)
  File "/root/ComfyUI/custom_nodes/ComfyUI-GGUF/dequant.py", line 39, in dequantize
    blocks = dequantize_blocks(blocks, block_size, type_size, dtype)
  File "/root/ComfyUI/custom_nodes/ComfyUI-GGUF/dequant.py", line 110, in dequantize_blocks_Q4_0
    qs = (qs & 0x0F).reshape((n_blocks, -1)).to(torch.int8) - 8

Aug 21 '24 22:08 grigio

I have used GGUF Q5 and it uses significantly lower VRAM now, I still get occationally crashes but I have also added these flags to ComfyUI so it doesn't happen so often:

--lowvram --reserve-vram 5

I have plently of system RAM and I'm not understanding why it doesn't try to use that instead of just crash and burn

Aug 31 '24 13:08 hartmark

@grigio It ran fine for me, although it seems a bit slower than the 'native' FP8 safetensor version (with minimal testing, might be run to run variance at this point). But on the other hand it does seem to be less taxing on RAM overall.

My current setup

### Loading: ComfyUI-Manager (V2.50.3)
### ComfyUI Revision: 2641 [f1c23016] | Released on '2024-09-01'
Torch version: 2.5.0.dev20240804+rocm6.1

OS: Fedora Linux 40 (KDE Plasma) x86_4
Kernel: Linux 6.10.6-200.fc40.x86_64
CPU: AMD Ryzen 7 5800X3D (16) @ 3.40 z
GPU: AMD Radeon RX 6800 [Discrete] 16GB VRAM
Vulkan: 1.3.278 - radv [Mesa 24.1.6]
Memory: 12.22 GiB / 31.24 GiB (39%)

got prompt
100%|█████████████████████████████████████████████| 4/4 [01:16<00:00, 19.05s/it]
Requested to load AutoencodingEngine
Loading 1 new model
loaded completely 0.0 319.7467155456543 True
Prompt executed in 77.93 seconds

Sep 02 '24 07:09 Th3Rom3

I have a ticket at ROCm and they had some flags that reduces crashes. The crashes still happens sometimes, but they are greatly reduced.

https://github.com/ROCm/ROCm/issues/3580#issuecomment-2315830998

Sep 02 '24 09:09 hartmark

Having run ComfyUI once with --reserve-vram seems to have a more profound impact than I anticipated. In my limited testing it kept the pytorch process from spilling from the VRAM into the GTT and thus eliminated swapping between VRAM and RAM.

This resulted in a speedup of up to 15% of the whole prompt process (clip to output) when using the native fp8 safetensor workflow (gguf does not seem to significantly improve, though).

Weirdly this effect is persistent even if subsequent runs are started without --reserve-vram parameters at all. Not sure if it sets some kind of flag in the rocm or pytorch environment. Sadly I couldn't really find info on the parameter in the documentation at all. The sluggishness and slowdowns due to swapping might return if more other processes occupy the VRAM (since I run SD on the same GPU that also outputs my system).

Sep 02 '24 10:09 Th3Rom3

Hi, thanks for those tips! Do you still need to use those variables or parameters with current versions of PyTorch? Because I sometimes get those crashes, but it's too random to be able to reproduce them.

Jul 02 '25 15:07 pkrasicki