RunPod-Fooocus-API Finish task with error: CUDA out of memory

My endpoint on the Runpad is working correctly, but after a certain period of time and several generations during it, I often start getting this error:

qw2kmmdraszo23[error][31m[2025-05-22 08:36:54] ERROR [0m [34m[Task Queue] Finish task with error: CUDA out of memory. Tried to allocate 2.00 MiB. GPU 0 has a total capacity of 79.25 GiB of which 960.00 KiB is free. Process 1074130 has 79.24 GiB memory in use. Of the allocated memory 77.18 GiB is allocated by PyTorch, and 1.55 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables), job_id=3aad4185-32e6-4517-bb77-09a442b3c3f1[0m\n

Moreover, the amount of GPU memory affects only delays when this error appears. I found an issue with a similar error and tried to put Allowed Cuda Versions on 12.1, but this also only delayed the problem, not solved it (as well as the permutation on 12.8). Any ideas on how to fix this "once and for all"? Or maybe someone can tell me where to put PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True?

May 22 '25 14:05 EvgeniyWis

If an 80GB card is not enough for Fooocus, something is definitely off. Have you tried the other solutions from that issue?

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True can be placed into the Dockerfile as an environment variable. But that helps only with the allocator efficiency, not the size.

More likely, you're gradually filling VRAM with model data over multiple requests on a warm worker, resulting in the OOM error after some time. Still, you would have to use many to fill 80GB of VRAM. Are you dynamically switching or stacking many models or ControlNets? I would start with the memory-related Fooocus flags. If that doesn't help, we can try monitoring the memory or explicitly free tensors at the end of the FastAPI request lifecycle.

May 23 '25 06:05 davefojtik

Well, I'm still at the testing stage, but after I set these flags ("--always-low-vram" was added and "--always-gpu" was removed) to start.sh: python main.py --skip-pip --disable-in-browser --disable-offload-from-vram --always-low-vram & So far, I have not received an error about CUDA, so I think the issue can be considered closed for now

May 24 '25 22:05 EvgeniyWis

Well, It wasn't long before I started getting this error from the workers:

WARN: very high memory utilization: 46.57GiB / 46.57GiB (100 %)
WARN: container is unhealthy: triggered memory limits (OOM)
WARN: container is unhealthy: triggered memory limits (OOM)
WARN: container is unhealthy: triggered memory limits (OOM)
WARN: container is unhealthy: triggered memory limits (OOM)
stop container 48482287e2a36af3cadda7a801cceb3662d0a14a6f99186665ff15f7a1ab8980

May 29 '25 11:05 EvgeniyWis