Gateway Timeout on Docker Swarm worker replicas
To Reproduce
-
Create a Dokploy simply Docker Swarm configuration with 1 manager and 1 worker.
-
Create an app with https://github.com/Dokploy/swarm-test
-
Put more than 1 replica in the Swarm config
-
Verify all deployed replicas are splitting well in the two instances
Manager
Worker
Current vs. Expected behavior
I expect all the Docker Swarm containers work normally independently where the request goes both on manager and worker instances but it seems like when the request goes to worker instance I get Gateway Timeout, otherwise if it goes to manager works.
- Example 1: Request > Swarm decides to go manager > Works
- Example 2: Request > Swarm decides to go worker > Gateway Timeout
Provide environment information
Operating System:
OS: Canonical-Ubuntu-22.04-aarch64-2024.06.26-0
Arch : arm64
Dokploy version: v0.10.3
VPS Provider: Oracle Cloud
What application/services are you trying to deploy?: Simple Nodejs app
Which area(s) are affected? (Select all that apply)
Application, Docker Compose, Traefik, Docker
Additional context
To check that it's not a network issue between instances or something I created a rule to open all the ports in the security list, by the way I'm using this project to boot the instances: https://github.com/statickidz/dokploy-oci-free/
hmm I think something is needed at the traefik level to make it able to route to the worker container.
hmm I think something is needed at the traefik level to make it able to route to the worker container.
Is this something related to my environment or you were able to make it work before?
@statickidz Yes, this already worked for me some time ago, however since we upgraded traefik to version 3 I haven't tried it, surely there was some change.
I recently tested and is working for me used this docker image
I don't have any running container in the dokploy server
In the worker is running 6 instances
The domain I've used
and when you enter you will see this
If you reload after a couple minutes the information should change since is using another private ip and everything, so the load balancing working fine
@Siumauricio I see! I just created a new Dokploy instances (manager and worker) in AWS to check if it was something related with OCI but I'm getting the same result, that's quite weird. As before, all ports opened, no issues joining the Swarm cluster but when the request leads to the worker I get the Gateway Timeout. At this point I'm not sure what could be.
Did you make a custom installation? or did you installed with the official script?
Did you make a custom installation? or did you installed with the official script?
For the main instance official script, for the workers the commands provided on the "Add Node" button.
https://github.com/statickidz/dokploy-oci-free/blob/main/bin/dokploy-main.sh https://github.com/statickidz/dokploy-oci-free/blob/main/bin/dokploy-worker.sh
Have you check in the dashboard of dokploy if you have the worker associated in the cluster section?
I see you are exiting docker swarm in the worker, then how did you link the worker to the manager, you follow the steps from the Add Node button manually?
I would recommend you first try using the traditional way that dokploy gives, that is linking the workers manually, if you see that it works, I think it would be a problem of your infrastructure setup.
Is your infrastructure running on Oracle OCI? I encountered the same problem, but it runs normally if executed on the same node where Traefik is located.
Have you check in the dashboard of dokploy if you have the worker associated in the cluster section?
Yep, it's been displayed correctly
I see you are exiting docker swarm in the worker, then how did you link the worker to the manager, you follow the steps from the Add Node button manually?
I would recommend you first try using the traditional way that dokploy gives, that is linking the workers manually, if you see that it works, I think it would be a problem of your infrastructure setup.
Same result either if I pre-install docker and I pre-leave swarm (like in the script) or if I take the Dokploy quick steps to install it.
For example, this is the last test on a fresh worker node with the dokploy steps, result is always Gateway Timeout:
@Siumauricio this is a test environment so if you feel you want to debug that in deep reach me, I can provide you the access to the instances
Is your infrastructure running on Oracle OCI? I encountered the same problem, but it runs normally if executed on the same node where Traefik is located.
Found it on the Oracle OCI, works well if I point all the instances to the manager with this like you say
But I feel this is not OCI related, because I created a couple of instances on AWS to try and the result was the same https://github.com/Dokploy/dokploy/issues/592#issuecomment-2447020784
But I feel this is not OCI related, because I created a couple of instances on AWS to try and the result was the same
I just try it on my azure server and the same issue occurd.
@Siumauricio Can we try load balance of traefik like
[tcp.services]
[tcp.services.app]
[[tcp.services.app.weighted.services]]
name = "appv1"
weight = 3
[[tcp.services.app.weighted.services]]
name = "appv2"
weight = 1
[tcp.services.appv1]
[tcp.services.appv1.loadBalancer]
[[tcp.services.appv1.loadBalancer.servers]]
address = "private-ip-server-1/:8080"
[tcp.services.appv2]
[tcp.services.appv2.loadBalancer]
[[tcp.services.appv2.loadBalancer.servers]]
address = "private-ip-server-2/:8080"
instead of pointing them directly to the service itself like
services:
animeapi-core-409c00-service-11:
loadBalancer:
servers:
- url: http://animeapi-core-409c00:8000
I recently tested and is working for me used this docker image我最近进行了测试,并为我工作使用了此Docker图像
Visit xxx.traefik.me got error Bad Gateway with version v0.17.9
@aliuq which cloud are you using to run your instances?
I get the same error. Swarm is not working, every service I deploy results into a bad gateway error.
In this example I used the same service as @Siumauricio
Swarm is pretty useless this way, I have tried to debug but cant figure it out. My guess is that networks are getting messed up somehow, I am pretty sure traefik fails to connect to the worker container. But dont know how to debug that ..
I added level: DUBUG to treafik.yml
This is the error raised in traefik
2025-03-13T18:22:30Z DBG github.com/traefik/traefik/v3/pkg/server/service/proxy.go:100 > 502 Bad Gateway error="dial tcp: lookup test-whoami-tegwkb on 127.0.0.11:53: no such host"
Oracle Cloud has the same error that has been bugging me for a couple days now.
Access on manager deployed with compose works fine.
When deployed with Stack on the worker, it reports multiple errors
After many days of trying, I found a temporary solution, but I don't know the cause of the problem. My solution is:
- remove dokploy
docker service rm dokploy dokploy-traefik dokploy-postgres dokploy-redis
docker volume rm -f dokploy-postgres-database redis-data-volume
docker network rm -f dokploy-network
sudo rm -rf /etc/dokploy
2, Manually install docker swarm
Manager:
docker swarm leave --force
docker swarm init --advertise-addr <MANAGER_IP>
Worker:
docker swarm leave --force
docker swarm join --token <TOKEN> --advertise-addr <WORKER_IP> <MANAGER_IP>:2377
3, Manually install dokploy
docker network create --driver overlay --attachable dokploy-network
mkdir -p /etc/dokploy
chmod -R 777 /etc/dokploy
docker pull dokploy/dokploy:latest
# Installation
docker service create \
--name dokploy \
--replicas 1 \
--network dokploy-network \
--mount type=bind,source=/var/run/docker.sock,target=/var/run/docker.sock \
--mount type=bind,source=/etc/dokploy,target=/etc/dokploy \
--publish published=3000,target=3000,mode=host \
--update-parallelism 1 \
--update-order stop-first \
dokploy/dokploy:latest
Hi, same issue on my end, is there any update on this?
我收到相同的错误。Swarm 无法正常工作,我部署的每个服务都会导致错误。
bad gateway在此示例中,我使用了与
Swarm 这种方式很没用,我试过调试但想不通。我的猜测是网络以某种方式搞砸了,我很确定 traefik 无法连接到 worker 容器。但是不知道怎么调试那个..
我添加到 treafik.yml 这是 traefik 中引发的错误
level: DUBUG2025-03-13T18:22:30Z DBG github.com/traefik/traefik/v3/pkg/server/service/proxy.go:100 > 502 Bad Gateway error="dial tcp: lookup test-whoami-tegwkb on 127.0.0.11:53: no such host"
I'm also having the same issue
Maybe we should run a traefik on each node in the swarm, and then reverse proxy to the traefik on the node through the entrypoint, and then reverse proxy to the instance through the traefik in the node
Maybe we should run a traefik on each node in the swarm, and then reverse proxy to the traefik on the node through the entrypoint, and then reverse proxy to the instance through the traefik in the node
No, the cluster should only have one web portal, check out this scenario of mine manually install,it's work now.
To a certain extent, dokploy is still not that stable, but I really like its design, similar to a cloud service.Some features are still missing, such as swarm management where you can only see containers on the master node. So I also installed portainer
Maybe we should run a traefik on each node in the swarm, and then reverse proxy to the traefik on the node through the entrypoint, and then reverse proxy to the instance through the traefik in the node
No, the cluster should only have one web portal, check out this scenario of mine manually install,it's work now.
![]()
![]()
To a certain extent, dokploy is still not that stable, but I really like its design, similar to a cloud service.Some features are still missing, such as swarm management where you can only see containers on the master node. So I also installed portainer
Can I restore Dokploy from a backup after a manual installation? From what I have observed, restoring an instance from a backup will reset the docker network
Possible solution to the problem
I have encountered a similar problem, but I have experience solving it. It is enough to add a "Label" so that Traeffik correctly searches for the IP of the container.
deploy:
labels:
- traefik.docker.network=dokploy-network
Example all lables:
deploy:
labels:
- traefik.http.routers.local-dns-bvremm-4-web.rule=Host(`pdns.rt-home.local`)
- traefik.http.routers.local-dns-bvremm-4-web.entrypoints=web
- traefik.http.services.local-dns-bvremm-4-web.loadbalancer.server.port=8080
- traefik.http.routers.local-dns-bvremm-4-web.service=local-dns-bvremm-4-web
- traefik.enable=true
- traefik.docker.network=dokploy-network
This issue doesn't seem to be resolved and I'm still getting a gateway timeout error in my deployment after the Dokploy update
I've considered a few more scenarios for the problem. I think this may be due to the fact that Traefik does not use the overlay Network. I'll try to check how it works with Dokploy. But I think that's where the problem lies.
I had the same problem a couple of weeks ago. My application sporadically received a 502 bad gateway error. From what I could determine, the request never reached the container; it remained in Traefik. I spent several days investigating and applying Traefik configurations but couldn't solve it. In the end, I deleted the Traefik container and replaced it with Nginx Proxy Manager, and the problem never returned.
Dokploy can run without Traefik.
I had the same problem a couple of weeks ago. My application sporadically received a 502 bad gateway error. From what I could determine, the request never reached the container; it remained in Traefik. I spent several days investigating and applying Traefik configurations but couldn't solve it. In the end, I deleted the Traefik container and replaced it with Nginx Proxy Manager, and the problem never returned.
Dokploy can run without Traefik.
It's a good idea, but now that Dokploy is deeply integrated with Traefik, it might be difficult to migrate to a different solution
Currently traefik is only a single part of dokploy, there's a feature request for caddy, I can consider work on support caddy which I consider it shouldn't be to much work
Currently traefik is only a single part of dokploy, there's a feature request for caddy, I can consider work on support caddy which I consider it shouldn't be to much work
I like Traefik better to a certain extent. reverse proxies and control routes by means of labels really cool. I installed docker swarm and dokploy manually , the problem never happened again. there must be a hidden bug! I understand It's not easy to fix.
Mind to share the steps how did you installed manually? Maybe there's a bug in the instalation script
Someone on Discord told me we should use a private IP instead of public Ip https://github.com/Dokploy/website/blob/f886a19b446f2cf8686b7309318f13178699afb5/apps/website/public/install.sh#L87
And he told me it worked but I havent had time to test it
I feel is something with docker swarm itself, like how we configure it in the instalation script