automatic [Issue]: UI out-of-sync using Runpod via cloudflare reverse proxy

Issue Description

When sdnext is connected via cloudflare tunnel reverse proxy, such as Runpod, when running long jobs UI can lose sync with the backend and cause issues such as being stuck on "Generate" task in the UI.

As mentioned by vlad the root cause is runpod usage of cloudflare as reverse proxy - it enforces max duration on connection and hard disconnects it. As a result, any long running job can run into that (higher batch sizes mean longer duration and thus higher chance of this).

A workaround could be direct connection rather than cloudflare tunnels, I will test if this completely circumvents it.

Version Platform Description

Runpod tunnels, cloudflare tunnels.

Relevant log output

22:45:49-015456 INFO     Processed: images=1 its=1.43 time=6.99 timers={'hires': 48.89, 'pipeline': 29.45, 'prompt': 24.74, 'preview': 22.51, 'decode': 14.31, 'offload': 11.56, 'init': 4.15, 'move': 4.12, 'gc': 0.57, 'post': 0.23, 'validate': 0.07} memory={'ram': {'used': 7.59, 'total': 46.57}, 'job':
                         'Inpaint', 'gpu': {'used': 7.12, 'total': 23.67}, 'active': 1.78, 'peak': 17.05, 'retries': 0, 'oom': 0, 'swap': 0}
22:45:49-176212 TRACE    Server: alive=True requests=308 memory=7.58/46.57 status='running' task='Generate' timestamp='20250513224315' current='task(9vh7kw3snqhwuyd)' id='5e983102b43748d' job=0 jobs=0 total=1 step=0 steps=0 queued=0 uptime=375 elapsed=153.76 eta=None progress=0
22:45:50-134777 INFO     Save: image="outputs/control/00526-2025-05-13-Illustrious - animeScreenshotMerge_v30_1631000.png" type=PNG width=1824 height=1248 size=2447332
22:45:50-226681 INFO     Save: image="outputs/grids/00065-2025-05-13-Illustrious - animeScreenshotMerge_v30_1631000-grid.jpg" type=JPEG width=3648 height=1248 size=487510
22:45:50-230651 INFO     Processed: images=3 its=0.59 time=101.73 timers={'hires': 48.89, 'pipeline': 29.45, 'prompt': 24.74, 'preview': 22.51, 'decode': 14.31, 'offload': 11.56, 'init': 4.15, 'move': 4.12, 'post': 1.45, 'gc': 0.57, 'validate': 0.07} memory={'ram': {'used': 7.61, 'total': 46.57}, 'job':
22:47:49-198330 TRACE    Server: alive=True requests=343 memory=7.49/46.57 status='running' task='Generate' timestamp='20250513224315' current='' id='5e983102b43748d' job=0 jobs=0 total=1 step=0 steps=0 queued=0 uptime=495 elapsed=273.78 eta=None progress=0
22:49:49-224386 TRACE    Server: alive=True requests=367 memory=7.5/46.57 status='running' task='Generate' timestamp='20250513224315' current='' id='5e983102b43748d' job=0 jobs=0 total=1 step=0 steps=0 queued=0 uptime=615 elapsed=393.81 eta=None progress=0
22:51:49-246756 TRACE    Server: alive=True requests=391 memory=7.5/46.57 status='running' task='Generate' timestamp='20250513224315' current='' id='5e983102b43748d' job=0 jobs=0 total=1 step=0 steps=0 queued=0 uptime=735 elapsed=513.83 eta=None progress=0
22:53:49-270137 TRACE    Server: alive=True requests=415 memory=7.5/46.57 status='running' task='Generate' timestamp='20250513224315' current='' id='5e983102b43748d' job=0 jobs=0 total=1 step=0 steps=0 queued=0 uptime=855 elapsed=633.85 eta=None progress=0
22:55:49-297153 TRACE    Server: alive=True requests=439 memory=7.5/46.57 status='running' task='Generate' timestamp='20250513224315' current='' id='5e983102b43748d' job=0 jobs=0 total=1 step=0 steps=0 queued=0 uptime=975 elapsed=753.88 eta=None progress=0
22:57:49-319113 TRACE    Server: alive=True requests=463 memory=7.5/46.57 status='running' task='Generate' timestamp='20250513224315' current='' id='5e983102b43748d' job=0 jobs=0 total=1 step=0 steps=0 queued=0 uptime=1095 elapsed=873.9 eta=None progress=0
22:59:49-340236 TRACE    Server: alive=True requests=487 memory=7.5/46.57 status='running' task='Generate' timestamp='20250513224315' current='' id='5e983102b43748d' job=0 jobs=0 total=1 step=0 steps=0 queued=0 uptime=1215 elapsed=993.92 eta=None progress=0

Backend

Diffusers

Compute

nVidia CUDA

UI

ModernUI

Branch

Dev

Model

StableDiffusion XL

Acknowledgements

[x] I have read the above and searched for existing issues
[x] I confirm that this is classified correctly and its not an extension issue

May 25 '25 21:05 CalamitousFelicitousness

right now, cloudflare hard-enforces timeout at 100 sec which means that any job that takes longer than 100 sec will cause a timeout and ui will get disconnected causing client and server to get out-of-sync. workaround is to use direct ip connection instead of http proxy when/if provider does offer that (runpod does, some others dont).

there are similar issues with other providers.

i'm looking if there is a way to work around this, so keeping this open for now.

Jul 03 '25 20:07 vladmandic

I'm not using paperspace with Cloudflare, but I do reverse-proxy stuff through a tunnel, and what happens is that sometimes websocket connections get stuck somehow. Note, it happens even on a LAN if the connection is reverse proxied. Tested: nginx, Caddy, same problems. Dropping the tunnel (or resetting the connection) is enough to "unstick" this problem, for the record.

Jul 03 '25 20:07 lbeltrame