[Issue]: UI out-of-sync using Runpod via cloudflare reverse proxy
Issue Description
When sdnext is connected via cloudflare tunnel reverse proxy, such as Runpod, when running long jobs UI can lose sync with the backend and cause issues such as being stuck on "Generate" task in the UI.
As mentioned by vlad the root cause is runpod usage of cloudflare as reverse proxy - it enforces max duration on connection and hard disconnects it. As a result, any long running job can run into that (higher batch sizes mean longer duration and thus higher chance of this).
A workaround could be direct connection rather than cloudflare tunnels, I will test if this completely circumvents it.
Version Platform Description
Runpod tunnels, cloudflare tunnels.
Relevant log output
22:45:49-015456 INFO Processed: images=1 its=1.43 time=6.99 timers={'hires': 48.89, 'pipeline': 29.45, 'prompt': 24.74, 'preview': 22.51, 'decode': 14.31, 'offload': 11.56, 'init': 4.15, 'move': 4.12, 'gc': 0.57, 'post': 0.23, 'validate': 0.07} memory={'ram': {'used': 7.59, 'total': 46.57}, 'job':
'Inpaint', 'gpu': {'used': 7.12, 'total': 23.67}, 'active': 1.78, 'peak': 17.05, 'retries': 0, 'oom': 0, 'swap': 0}
22:45:49-176212 TRACE Server: alive=True requests=308 memory=7.58/46.57 status='running' task='Generate' timestamp='20250513224315' current='task(9vh7kw3snqhwuyd)' id='5e983102b43748d' job=0 jobs=0 total=1 step=0 steps=0 queued=0 uptime=375 elapsed=153.76 eta=None progress=0
22:45:50-134777 INFO Save: image="outputs/control/00526-2025-05-13-Illustrious - animeScreenshotMerge_v30_1631000.png" type=PNG width=1824 height=1248 size=2447332
22:45:50-226681 INFO Save: image="outputs/grids/00065-2025-05-13-Illustrious - animeScreenshotMerge_v30_1631000-grid.jpg" type=JPEG width=3648 height=1248 size=487510
22:45:50-230651 INFO Processed: images=3 its=0.59 time=101.73 timers={'hires': 48.89, 'pipeline': 29.45, 'prompt': 24.74, 'preview': 22.51, 'decode': 14.31, 'offload': 11.56, 'init': 4.15, 'move': 4.12, 'post': 1.45, 'gc': 0.57, 'validate': 0.07} memory={'ram': {'used': 7.61, 'total': 46.57}, 'job':
22:47:49-198330 TRACE Server: alive=True requests=343 memory=7.49/46.57 status='running' task='Generate' timestamp='20250513224315' current='' id='5e983102b43748d' job=0 jobs=0 total=1 step=0 steps=0 queued=0 uptime=495 elapsed=273.78 eta=None progress=0
22:49:49-224386 TRACE Server: alive=True requests=367 memory=7.5/46.57 status='running' task='Generate' timestamp='20250513224315' current='' id='5e983102b43748d' job=0 jobs=0 total=1 step=0 steps=0 queued=0 uptime=615 elapsed=393.81 eta=None progress=0
22:51:49-246756 TRACE Server: alive=True requests=391 memory=7.5/46.57 status='running' task='Generate' timestamp='20250513224315' current='' id='5e983102b43748d' job=0 jobs=0 total=1 step=0 steps=0 queued=0 uptime=735 elapsed=513.83 eta=None progress=0
22:53:49-270137 TRACE Server: alive=True requests=415 memory=7.5/46.57 status='running' task='Generate' timestamp='20250513224315' current='' id='5e983102b43748d' job=0 jobs=0 total=1 step=0 steps=0 queued=0 uptime=855 elapsed=633.85 eta=None progress=0
22:55:49-297153 TRACE Server: alive=True requests=439 memory=7.5/46.57 status='running' task='Generate' timestamp='20250513224315' current='' id='5e983102b43748d' job=0 jobs=0 total=1 step=0 steps=0 queued=0 uptime=975 elapsed=753.88 eta=None progress=0
22:57:49-319113 TRACE Server: alive=True requests=463 memory=7.5/46.57 status='running' task='Generate' timestamp='20250513224315' current='' id='5e983102b43748d' job=0 jobs=0 total=1 step=0 steps=0 queued=0 uptime=1095 elapsed=873.9 eta=None progress=0
22:59:49-340236 TRACE Server: alive=True requests=487 memory=7.5/46.57 status='running' task='Generate' timestamp='20250513224315' current='' id='5e983102b43748d' job=0 jobs=0 total=1 step=0 steps=0 queued=0 uptime=1215 elapsed=993.92 eta=None progress=0
Backend
Diffusers
Compute
nVidia CUDA
UI
ModernUI
Branch
Dev
Model
StableDiffusion XL
Acknowledgements
- [x] I have read the above and searched for existing issues
- [x] I confirm that this is classified correctly and its not an extension issue
right now, cloudflare hard-enforces timeout at 100 sec which means that any job that takes longer than 100 sec will cause a timeout and ui will get disconnected causing client and server to get out-of-sync. workaround is to use direct ip connection instead of http proxy when/if provider does offer that (runpod does, some others dont).
there are similar issues with other providers.
i'm looking if there is a way to work around this, so keeping this open for now.
I'm not using paperspace with Cloudflare, but I do reverse-proxy stuff through a tunnel, and what happens is that sometimes websocket connections get stuck somehow. Note, it happens even on a LAN if the connection is reverse proxied. Tested: nginx, Caddy, same problems. Dropping the tunnel (or resetting the connection) is enough to "unstick" this problem, for the record.