dokploy icon indicating copy to clipboard operation
dokploy copied to clipboard

Gateway Timeout on Docker Swarm worker replicas

Open statickidz opened this issue 1 year ago • 14 comments

To Reproduce

  1. Create a Dokploy simply Docker Swarm configuration with 1 manager and 1 worker.

    image

  2. Create an app with https://github.com/Dokploy/swarm-test

    image

  3. Put more than 1 replica in the Swarm config

    image

  4. Verify all deployed replicas are splitting well in the two instances

    Manager image

    Worker image

Current vs. Expected behavior

I expect all the Docker Swarm containers work normally independently where the request goes both on manager and worker instances but it seems like when the request goes to worker instance I get Gateway Timeout, otherwise if it goes to manager works.

  • Example 1: Request > Swarm decides to go manager > Works
  • Example 2: Request > Swarm decides to go worker > Gateway Timeout

Provide environment information

Operating System:
  OS: Canonical-Ubuntu-22.04-aarch64-2024.06.26-0
  Arch : arm64
Dokploy version: v0.10.3
VPS Provider: Oracle Cloud
What application/services are you trying to deploy?: Simple Nodejs app

Which area(s) are affected? (Select all that apply)

Application, Docker Compose, Traefik, Docker

Additional context

To check that it's not a network issue between instances or something I created a rule to open all the ports in the security list, by the way I'm using this project to boot the instances: https://github.com/statickidz/dokploy-oci-free/

statickidz avatar Oct 25 '24 06:10 statickidz

hmm I think something is needed at the traefik level to make it able to route to the worker container.

Siumauricio avatar Oct 25 '24 21:10 Siumauricio

hmm I think something is needed at the traefik level to make it able to route to the worker container.

Is this something related to my environment or you were able to make it work before?

statickidz avatar Oct 28 '24 10:10 statickidz

@statickidz Yes, this already worked for me some time ago, however since we upgraded traefik to version 3 I haven't tried it, surely there was some change.

Siumauricio avatar Oct 30 '24 05:10 Siumauricio

I recently tested and is working for me used this docker image

Screenshot 2024-10-29 at 11 24 18 PM

Screenshot 2024-10-29 at 11 24 33 PM

Screenshot 2024-10-29 at 11 24 39 PM

I don't have any running container in the dokploy server

image

In the worker is running 6 instances Screenshot 2024-10-29 at 11 25 27 PM

The domain I've used Screenshot 2024-10-29 at 11 26 02 PM

and when you enter you will see this Screenshot 2024-10-29 at 11 26 22 PM

If you reload after a couple minutes the information should change since is using another private ip and everything, so the load balancing working fine Screenshot 2024-10-29 at 11 26 30 PM

Siumauricio avatar Oct 30 '24 05:10 Siumauricio

@Siumauricio I see! I just created a new Dokploy instances (manager and worker) in AWS to check if it was something related with OCI but I'm getting the same result, that's quite weird. As before, all ports opened, no issues joining the Swarm cluster but when the request leads to the worker I get the Gateway Timeout. At this point I'm not sure what could be.

statickidz avatar Oct 30 '24 12:10 statickidz

Did you make a custom installation? or did you installed with the official script?

Siumauricio avatar Oct 31 '24 03:10 Siumauricio

Did you make a custom installation? or did you installed with the official script?

For the main instance official script, for the workers the commands provided on the "Add Node" button.

https://github.com/statickidz/dokploy-oci-free/blob/main/bin/dokploy-main.sh https://github.com/statickidz/dokploy-oci-free/blob/main/bin/dokploy-worker.sh

statickidz avatar Oct 31 '24 08:10 statickidz

Have you check in the dashboard of dokploy if you have the worker associated in the cluster section?

Siumauricio avatar Nov 07 '24 04:11 Siumauricio

I see you are exiting docker swarm in the worker, then how did you link the worker to the manager, you follow the steps from the Add Node button manually?

I would recommend you first try using the traditional way that dokploy gives, that is linking the workers manually, if you see that it works, I think it would be a problem of your infrastructure setup.

Siumauricio avatar Nov 07 '24 04:11 Siumauricio

Is your infrastructure running on Oracle OCI? I encountered the same problem, but it runs normally if executed on the same node where Traefik is located.

binaryYuki avatar Nov 07 '24 09:11 binaryYuki

Have you check in the dashboard of dokploy if you have the worker associated in the cluster section?

Yep, it's been displayed correctly

image

I see you are exiting docker swarm in the worker, then how did you link the worker to the manager, you follow the steps from the Add Node button manually?

I would recommend you first try using the traditional way that dokploy gives, that is linking the workers manually, if you see that it works, I think it would be a problem of your infrastructure setup.

Same result either if I pre-install docker and I pre-leave swarm (like in the script) or if I take the Dokploy quick steps to install it.

For example, this is the last test on a fresh worker node with the dokploy steps, result is always Gateway Timeout:

image

@Siumauricio this is a test environment so if you feel you want to debug that in deep reach me, I can provide you the access to the instances

Is your infrastructure running on Oracle OCI? I encountered the same problem, but it runs normally if executed on the same node where Traefik is located.

Found it on the Oracle OCI, works well if I point all the instances to the manager with this like you say

image

But I feel this is not OCI related, because I created a couple of instances on AWS to try and the result was the same https://github.com/Dokploy/dokploy/issues/592#issuecomment-2447020784

statickidz avatar Nov 07 '24 10:11 statickidz

But I feel this is not OCI related, because I created a couple of instances on AWS to try and the result was the same

I just try it on my azure server and the same issue occurd.

@Siumauricio Can we try load balance of traefik like

[tcp.services]
  [tcp.services.app]
    [[tcp.services.app.weighted.services]]
      name = "appv1"
      weight = 3
    [[tcp.services.app.weighted.services]]
      name = "appv2"
      weight = 1

  [tcp.services.appv1]
    [tcp.services.appv1.loadBalancer]
      [[tcp.services.appv1.loadBalancer.servers]]
        address = "private-ip-server-1/:8080"

  [tcp.services.appv2]
    [tcp.services.appv2.loadBalancer]
      [[tcp.services.appv2.loadBalancer.servers]]
        address = "private-ip-server-2/:8080"

instead of pointing them directly to the service itself like

  services:
    animeapi-core-409c00-service-11:
      loadBalancer:
        servers:
          - url: http://animeapi-core-409c00:8000

binaryYuki avatar Nov 07 '24 13:11 binaryYuki

I recently tested and is working for me used this docker image我最近进行了测试,并为我工作使用了此Docker图像

Visit xxx.traefik.me got error Bad Gateway with version v0.17.9

Image

Image

Image

aliuq avatar Feb 02 '25 16:02 aliuq

@aliuq which cloud are you using to run your instances?

statickidz avatar Feb 03 '25 16:02 statickidz

I get the same error. Swarm is not working, every service I deploy results into a bad gateway error.

In this example I used the same service as @Siumauricio

Image

Image

Image

Swarm is pretty useless this way, I have tried to debug but cant figure it out. My guess is that networks are getting messed up somehow, I am pretty sure traefik fails to connect to the worker container. But dont know how to debug that ..

I added level: DUBUG to treafik.yml This is the error raised in traefik

2025-03-13T18:22:30Z DBG github.com/traefik/traefik/v3/pkg/server/service/proxy.go:100 > 502 Bad Gateway error="dial tcp: lookup test-whoami-tegwkb on 127.0.0.11:53: no such host"

codeit-ninja avatar Mar 13 '25 14:03 codeit-ninja

Oracle Cloud has the same error that has been bugging me for a couple days now. Access on manager deployed with compose works fine. When deployed with Stack on the worker, it reports multiple errors Image Image

After many days of trying, I found a temporary solution, but I don't know the cause of the problem. My solution is:

  1. remove dokploy
docker service rm dokploy dokploy-traefik dokploy-postgres dokploy-redis
docker volume rm -f dokploy-postgres-database redis-data-volume
docker network rm -f dokploy-network
sudo rm -rf /etc/dokploy

2, Manually install docker swarm

Manager:
docker swarm leave --force
docker swarm init --advertise-addr <MANAGER_IP>
Worker:
docker swarm leave --force
docker swarm join --token <TOKEN>  --advertise-addr <WORKER_IP> <MANAGER_IP>:2377

3, Manually install dokploy

docker network create --driver overlay --attachable dokploy-network
mkdir -p /etc/dokploy
chmod -R 777 /etc/dokploy
docker pull dokploy/dokploy:latest
# Installation
docker service create \
  --name dokploy \
  --replicas 1 \
  --network dokploy-network \
  --mount type=bind,source=/var/run/docker.sock,target=/var/run/docker.sock \
  --mount type=bind,source=/etc/dokploy,target=/etc/dokploy \
  --publish published=3000,target=3000,mode=host \
  --update-parallelism 1 \
  --update-order stop-first \
  dokploy/dokploy:latest

sundakai avatar Mar 21 '25 02:03 sundakai

Hi, same issue on my end, is there any update on this?

agustints avatar Apr 02 '25 04:04 agustints

我收到相同的错误。Swarm 无法正常工作,我部署的每个服务都会导致错误。bad gateway

在此示例中,我使用了与

Image

Image

Image

Swarm 这种方式很没用,我试过调试但想不通。我的猜测是网络以某种方式搞砸了,我很确定 traefik 无法连接到 worker 容器。但是不知道怎么调试那个..

我添加到 treafik.yml 这是 traefik 中引发的错误level: DUBUG

2025-03-13T18:22:30Z DBG github.com/traefik/traefik/v3/pkg/server/service/proxy.go:100 > 502 Bad Gateway error="dial tcp: lookup test-whoami-tegwkb on 127.0.0.11:53: no such host"

I'm also having the same issue

Hoshino-Yumetsuki avatar Apr 10 '25 03:04 Hoshino-Yumetsuki

Maybe we should run a traefik on each node in the swarm, and then reverse proxy to the traefik on the node through the entrypoint, and then reverse proxy to the instance through the traefik in the node

Hoshino-Yumetsuki avatar Apr 10 '25 03:04 Hoshino-Yumetsuki

Maybe we should run a traefik on each node in the swarm, and then reverse proxy to the traefik on the node through the entrypoint, and then reverse proxy to the instance through the traefik in the node

No, the cluster should only have one web portal, check out this scenario of mine manually install,it's work now.

Image Image Image

To a certain extent, dokploy is still not that stable, but I really like its design, similar to a cloud service.Some features are still missing, such as swarm management where you can only see containers on the master node. So I also installed portainer

sundakai avatar Apr 10 '25 05:04 sundakai

Maybe we should run a traefik on each node in the swarm, and then reverse proxy to the traefik on the node through the entrypoint, and then reverse proxy to the instance through the traefik in the node

No, the cluster should only have one web portal, check out this scenario of mine manually install,it's work now.

Image Image Image

To a certain extent, dokploy is still not that stable, but I really like its design, similar to a cloud service.Some features are still missing, such as swarm management where you can only see containers on the master node. So I also installed portainer

Can I restore Dokploy from a backup after a manual installation? From what I have observed, restoring an instance from a backup will reset the docker network

Hoshino-Yumetsuki avatar Apr 10 '25 06:04 Hoshino-Yumetsuki

Possible solution to the problem

I have encountered a similar problem, but I have experience solving it. It is enough to add a "Label" so that Traeffik correctly searches for the IP of the container.

    deploy:
      labels:
        - traefik.docker.network=dokploy-network

Example all lables:

    deploy:
      labels:
        - traefik.http.routers.local-dns-bvremm-4-web.rule=Host(`pdns.rt-home.local`)
        - traefik.http.routers.local-dns-bvremm-4-web.entrypoints=web
        - traefik.http.services.local-dns-bvremm-4-web.loadbalancer.server.port=8080
        - traefik.http.routers.local-dns-bvremm-4-web.service=local-dns-bvremm-4-web
        - traefik.enable=true
        - traefik.docker.network=dokploy-network

ron-tayler avatar Apr 12 '25 18:04 ron-tayler

This issue doesn't seem to be resolved and I'm still getting a gateway timeout error in my deployment after the Dokploy update

Hoshino-Yumetsuki avatar Apr 13 '25 09:04 Hoshino-Yumetsuki

I've considered a few more scenarios for the problem. I think this may be due to the fact that Traefik does not use the overlay Network. I'll try to check how it works with Dokploy. But I think that's where the problem lies.

ron-tayler avatar Apr 13 '25 12:04 ron-tayler

I had the same problem a couple of weeks ago. My application sporadically received a 502 bad gateway error. From what I could determine, the request never reached the container; it remained in Traefik. I spent several days investigating and applying Traefik configurations but couldn't solve it. In the end, I deleted the Traefik container and replaced it with Nginx Proxy Manager, and the problem never returned.

Dokploy can run without Traefik.

mgyugcha avatar Apr 14 '25 18:04 mgyugcha

I had the same problem a couple of weeks ago. My application sporadically received a 502 bad gateway error. From what I could determine, the request never reached the container; it remained in Traefik. I spent several days investigating and applying Traefik configurations but couldn't solve it. In the end, I deleted the Traefik container and replaced it with Nginx Proxy Manager, and the problem never returned.

Dokploy can run without Traefik.

It's a good idea, but now that Dokploy is deeply integrated with Traefik, it might be difficult to migrate to a different solution

Hoshino-Yumetsuki avatar Apr 16 '25 01:04 Hoshino-Yumetsuki

Currently traefik is only a single part of dokploy, there's a feature request for caddy, I can consider work on support caddy which I consider it shouldn't be to much work

Siumauricio avatar Apr 16 '25 01:04 Siumauricio

Currently traefik is only a single part of dokploy, there's a feature request for caddy, I can consider work on support caddy which I consider it shouldn't be to much work

I like Traefik better to a certain extent. reverse proxies and control routes by means of labels really cool. I installed docker swarm and dokploy manually , the problem never happened again. there must be a hidden bug! I understand It's not easy to fix.

sundakai avatar Apr 16 '25 02:04 sundakai

Mind to share the steps how did you installed manually? Maybe there's a bug in the instalation script

Siumauricio avatar Apr 16 '25 02:04 Siumauricio

Someone on Discord told me we should use a private IP instead of public Ip https://github.com/Dokploy/website/blob/f886a19b446f2cf8686b7309318f13178699afb5/apps/website/public/install.sh#L87

And he told me it worked but I havent had time to test it

I feel is something with docker swarm itself, like how we configure it in the instalation script

Siumauricio avatar Apr 16 '25 02:04 Siumauricio