dokploy icon indicating copy to clipboard operation
dokploy copied to clipboard

[BUG] GPU not working when running as root

Open kikoncuo opened this issue 11 months ago β€’ 1 comments

To Reproduce

I install everything required to configure GPU as per https://gist.github.com/tomlankhorst/33da3c4b9edbde5c83fc1244f010815c and https://github.com/Dokploy/dokploy/issues/816

Including manually updating the service

docker service update \
  --replicas 1 \
  --mount-add type=bind,source=/usr/bin/nvidia-container-runtime,target=/usr/bin/nvidia-container-runtime,readonly \
  --mount-add type=bind,source=/etc/docker/daemon.json,target=/etc/docker/daemon.json,readonly \
  --mount-add type=bind,source=/etc/dokploy,target=/etc/dokploy \
  --mount-add type=bind,source=/var/run/docker.sock,target=/var/run/docker.sock \
  --mount-add type=volume,source=dokploy-docker-config,target=/root/.docker \
  --publish-add published=3000,target=3000,mode=host \
  --update-parallelism 1 \
  --update-order stop-first \
  --constraint-add 'node.role == manager' \
  --generic-resource-add "gpu=1" \
  --env-add ADVERTISE_ADDR=$MYIP \
  --env-add NVIDIA_VISIBLE_DEVICES=all \
  dokploy

(Without it none of the options showed in green even thou I was already running docker containers manually with GPU) Image

Current vs. Expected behavior

The current behavior is error that the GPU can't be configured

Migration complete
Setting up cron jobs....
Server Started: 3000
Starting Deployment Worker
GPU Setup Error: Error: Failed to configure GPU support. Please ensure you have sudo privileges and try again.
at f (.next/server/chunks/153.js:346:53)
at async l (.next/server/chunks/153.js:339:11127)
at async (.next/server/chunks/1463.js:4:9463)

But I'm running all commands as root user and even tried to update the service to run as root docker service update
--user root
dokploy

I've checked that it got updated correctly

root@Ubuntu-2204-jammy-amd64-base ~ # docker service inspect hq --pretty

ID:             hqjrklhp1bh546k8hup9o6aoo
Name:           dokploy
Service Mode:   Replicated
 Replicas:      1
UpdateStatus:
 State:         completed
 Started:       6 minutes ago
 Completed:     5 minutes ago
 Message:       update completed
Placement:
 Constraints:   [node.role == manager node.role == manager]
UpdateConfig:
 Parallelism:   1
 On failure:    pause
 Monitoring Period: 5s
 Max failure ratio: 0
 Update order:      stop-first
RollbackConfig:
 Parallelism:   1
 On failure:    pause
 Monitoring Period: 5s
 Max failure ratio: 0
 Rollback order:    stop-first
ContainerSpec:
 Image:         dokploy/dokploy:latest@sha256:7ce688a60fd5ff1d582e27003327d15685082291747f19984c4125e1aaff72de
 Env:           ADVERTISE_ADDR=HEREISMYIP NVIDIA_VISIBLE_DEVICES=all
 Init:          false
 User: root
Mounts:
 Target:        /etc/docker/daemon.json
  Source:       /etc/docker/daemon.json
  ReadOnly:     true
  Type:         bind
 Target:        /etc/dokploy
  Source:       /etc/dokploy
  ReadOnly:     false
  Type:         bind
 Target:        /usr/bin/nvidia-container-runtime
  Source:       /usr/bin/nvidia-container-runtime
  ReadOnly:     true
  Type:         bind
 Target:        /var/run/docker.sock
  Source:       /var/run/docker.sock
  ReadOnly:     false
  Type:         bind
 Target:        /root/.docker
  Source:       dokploy-docker-config
  ReadOnly:     false
  Type:         volume
Resources:
Networks: dokploy-network
Endpoint Mode:  vip
Ports:
 PublishedPort = 3000
  Protocol = tcp
  TargetPort = 3000
  PublishMode = host

I expected GPU to be enabled successfully

Provide environment information

Ubuntu 22.04

Which area(s) are affected? (Select all that apply)

Application

Are you deploying the applications where Dokploy is installed or on a remote server?

Same server where Dokploy is installed

Additional context

Hetzner GPU server

Will you send a PR to fix it?

No

kikoncuo avatar Apr 03 '25 13:04 kikoncuo

@kikoncuo Install NVIDIA Drivers and NVIDIA Container Toolkit then you check the UI it will detect and show green. use refresh button to fetch the updates. note:

  • swarm GPU support will not enable based on manual configuration. it might not detect what was configured from your end.
  • But it doesn't mean you can't deploy GPU based apps just try your services if that's working if its not then your manual config is not correct. check both of these docs https://gist.github.com/tomlankhorst/33da3c4b9edbde5c83fc1244f010815c https://gist.github.com/coltonbh/374c415517dbeb4a6aa92f462b9eb287

I suggest not to wait for the UI to detect your manual config, try out your services

vishalkadam47 avatar May 18 '25 06:05 vishalkadam47

Ok no response and no one else has experience the same issue we can close this issue for now.

Siumauricio avatar Jul 05 '25 23:07 Siumauricio

@kikoncuo same issue. Could your please tell me how to fix it ?

sheiy avatar Sep 04 '25 03:09 sheiy

cat /etc/nvidia-container-runtime/config.toml
swarm-resource = "DOCKER_RESOURCE_GPU"
cat /etc/docker/daemon.json 
{
    "data-root": "/data/docker-data",
    "default-runtime": "nvidia",
    "node-generic-resources": [
        "gpu=GPU-13efa0bb-0379-f661-b81f-38e2b3c005c0",
        "gpu=GPU-6d5cfac8-cd1c-53ee-c314-9d5bb877cb28",
        "gpu=GPU-26751515-5e3e-fa16-024e-ee26f754eae5",    
        "gpu=GPU-c4b0146e-7914-0a81-0f9d-faf480945d13",
        "gpu=GPU-a9c877b3-3a30-af26-a499-f90888ce83ef"
    ],
    "registry-mirrors": [
        "https://docker.m.daocloud.io"
    ],
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }
}
 docker service create \
      --name dokploy \
      --replicas 1 \
      --network dokploy-network \
      --mount type=bind,source=/usr/bin/nvidia-container-runtime,target=/usr/bin/nvidia-container-runtime,readonly \
      --mount type=bind,source=/etc/docker/daemon.json,target=/etc/docker/daemon.json,readonly \
      --mount type=bind,source=/var/run/docker.sock,target=/var/run/docker.sock \
      --mount type=bind,source=/etc/dokploy,target=/etc/dokploy \
      --mount type=bind,source=/data/dokploy/dokploy-config,target=/root/.docker \
      --publish published=30000,target=3000,mode=host \
      --update-parallelism 1 \
      --update-order stop-first \
      --constraint 'node.role == manager' \
      --generic-resource "gpu=1" \
      -e ADVERTISE_ADDR=$advertise_addr \
      -e NVIDIA_VISIBLE_DEVICES=all \
      dokploy/dokploy:v0.24.12
Image Image Image

run apt install sudo in dokploy container can not fix GPU Setup Error: Error: Failed to configure GPU support. Please ensure you have sudo privileges and try again.

sheiy avatar Sep 04 '25 03:09 sheiy

@vishalkadam47 Could you please help me fix this?

sheiy avatar Sep 04 '25 03:09 sheiy

Although I have drivers installed and nvidia-smi works fine, I can not make it running:

# docker service update --generic-resource-add "gpu=1" --env-add NVIDIA_VISIBLE_DEVICES=all np1owfg4eh5h
overall progress: 0 out of 1 tasks
1/1: no suitable node (insufficient resources on 1 node)


docker service ps np1owfg4eh5h
ID             NAME            IMAGE                    NODE         DESIRED STATE   CURRENT STATE             ERROR                              PORTS
gpbtpahmqncf   dokploy.1       dokploy/dokploy:latest                Running         Pending 26 seconds ago    "no suitable node (insufficient resources on 1 node)"

eximius313 avatar Sep 17 '25 00:09 eximius313

Although I have drivers installed and nvidia-smi works fine, I can not make it running

Ok, I've solved part of this problem:

  1. using:
nvidia-smi -L | awk -F'UUID: ' '{print $2}' | awk -F')' '{print $1}'

returns GPU-xxy-zz-yy-aa-bb.

  1. then: nano /etc/docker/daemon.json:
{
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    },
    "default-runtime": "nvidia",
    "node-generic-resources": [
        "GPU=GPU-xxy-zz-yy-aa-bb" <--- HERE
    ]
}
  1. and then systemctl restart docker

Now # docker service update --env-add NVIDIA_VISIBLE_DEVICES=all dokploy gives:

dokploy
overall progress: 1 out of 1 tasks
1/1: running   [==================================================>]
verify: Service dokploy converged

and:

Image

But still:

  • NVIDIA Container Runtime is not visible although swarm-resource = "DOCKER_RESOURCE_GPU" is enabled in /etc/nvidia-container-runtime/config.toml
  • Swarm GPU Support is missing
  • and # docker service update --generic-resource-add "gpu=1" dokploy gives
dokploy
overall progress: 0 out of 1 tasks
1/1: no suitable node (insufficient resources on 1 node)

eximius313 avatar Sep 17 '25 20:09 eximius313

Ok, next update:

# docker service update --mount-add type=bind,source=/usr/bin/nvidia-container-runtime,target=/usr/bin/nvidia-container-runtime,readonly dokploy gives

dokploy
overall progress: 1 out of 1 tasks
1/1: running   [==================================================>]
verify: Service dokploy converged

and finally I have the same result as @sheiy:

Image

but still Swarm GPU Support is missing and docker service update --generic-resource-add "gpu=1" dokploy gives:

dokploy
overall progress: 0 out of 1 tasks
1/1: no suitable node (insufficient resources on 1 node)

and when I try to enable GPU - I get exactly the same error as @sheiy - docker service logs dokploy:

dokploy.1    | GPU Setup Error: Error: Failed to configure GPU support. Please ensure you have sudo privileges and try again.
dokploy.1    |     at f (.next/server/chunks/8512.js:8:50)
dokploy.1    |     at async l (.next/server/chunks/8512.js:1:2650)
dokploy.1    |     at async (.next/server/chunks/1515.js:9:23924)
dokploy.1    |   severity_local: 'NOTICE',
dokploy.1    |   severity: 'NOTICE',
dokploy.1    |   code: '42P07',
dokploy.1    |   message: 'relation "__drizzle_migrations" already exists, skipping',
dokploy.1    |   file: 'parse_utilcmd.c',
dokploy.1    |   line: '207',
dokploy.1    |   routine: 'transformCreateStmt'
dokploy.1    | }

eximius313 avatar Sep 17 '25 20:09 eximius313

for single gpu working /etc/docker/daemon.json

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    },
    "node-generic-resources": [
        "GPU=1"
    ]
}

vovka93 avatar Oct 05 '25 20:10 vovka93

do you have it green?

Image

eximius313 avatar Oct 05 '25 21:10 eximius313

Okay, so just ran into this and keep circling back on it over a few weeks. Finally got heads down on it and found a couple things that seem to be unique to dokploy and swarm.

  1. The docker daemon needs the generic resource to be set to "gpu=1". All lowercase. This is what dokploy is set to look for. I tried the default NVIDIA-GPU and GPU and neither showed up in dokploy until I made that change.
  2. If nvidia server and container drivers are installed and test passes, you need to update the dokploy docker service with the right with the nvidia runtime, daemon, and env.

I used GPT to help put together a troubleshooting guide using everything I did to get it working. Up to you on if you want to use the generic ID or full UUID, I went with generic. Hope this helps!


🧠 Dokploy GPU Setup (Ubuntu 24.04 LTS + Docker Swarm)

This guide walks through enabling GPU support in Dokploy using Docker Swarm on Ubuntu 24.04 LTS and deploying Ollama as a GPU-enabled app.


βš™οΈ Step 1 – Check GPU + Driver

lspci | grep -i nvidia
nvidia-smi

βœ… Confirm your GPU appears and nvidia-smi outputs normal driver/CUDA information.


🧩 Step 2 – Install NVIDIA Container Toolkit

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -fsSL https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

🧰 Step 3 – Make NVIDIA the Default Docker Runtime

sudo nvidia-ctk runtime configure --runtime=docker --set-as-default
sudo systemctl restart docker

Verify:

docker info | grep -A2 "Runtimes"
# Should show: Runtimes: … nvidia …  and  Default Runtime: nvidia

🧾 Step 4 – Confirm Runtime Config

Check /etc/nvidia-container-runtime/config.toml includes:

disable-require = false
supported-driver-capabilities = "compat32,compute,display,graphics,ngx,utility,video"
swarm-resource = "DOCKER_RESOURCE_GPU"

[nvidia-container-runtime]
log-level = "info"

[nvidia-container-cli]
ldconfig = "@/sbin/ldconfig.real"

πŸ’‘ Step 5 – Advertise the GPU to Docker Swarm

Edit /etc/docker/daemon.json:

{
  "default-runtime": "nvidia",
  "runtimes": { "nvidia": { "path": "nvidia-container-runtime", "runtimeArgs": [] } },
  "node-generic-resources": [ "gpu=1" ]
}

Restart Docker:

sudo systemctl restart docker

πŸ§ͺ Step 6 – Verify GPU Access with Plain Docker

docker run --rm --gpus all nvidia/cuda:12.5.1-base-ubuntu24.04 nvidia-smi

βœ… You should see the familiar nvidia-smi output table.


🐝 Step 7 – Initialize Docker Swarm and Label GPU Node

docker swarm init        # only once
docker node update --label-add gpu=true $(hostname)

🧫 Step 8 – Run a Swarm GPU Test

docker service create --name gpu-test \
  --generic-resource "gpu=1" \
  --constraint 'node.labels.gpu==true' \
  --restart-condition none \
  nvidia/cuda:12.5.1-base-ubuntu24.04 nvidia-smi

docker service logs -f gpu-test

βœ… The GPU info should print successfully inside the logs.


πŸ–₯ Step 9 – Make Dokploy Detect the GPU

Give the running Dokploy service visibility into the GPU runtime:

docker service update dokploy \
  --mount-add type=bind,source=/usr/bin/nvidia-container-runtime,target=/usr/bin/nvidia-container-runtime,readonly \
  --mount-add type=bind,source=/etc/docker/daemon.json,target=/etc/docker/daemon.json,readonly \
  --env-add NVIDIA_VISIBLE_DEVICES=all \
  --generic-resource-add "gpu=1"

Then go to Dokploy β†’ Server GPU Setup β†’ Refresh. All checks should turn 🟒 green.


πŸš€ Step 10 – Deploy Ollama with GPU (Example Stack)

This follows Dokploy’s [Compose conventions](https://docs.dokploy.com/docs/core/docker-compose): using ../files for persistent storage and running under Stack mode.

version: "3.9"

services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"                         # Ollama API
    volumes:
      # Dokploy-managed persistent data path
      - "../files/ollama:/root/.ollama"
    environment:
      NVIDIA_VISIBLE_DEVICES: all
      NVIDIA_DRIVER_CAPABILITIES: compute,utility
      # Optional: OLLAMA_HOST: "0.0.0.0:11434"
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.labels.gpu==true             # schedule on GPU node
      resources:
        reservations:
          generic_resources:
            - discrete_resource_spec:
                kind: gpu
                value: 1                      # matches daemon.json "gpu=1"
      restart_policy:
        condition: on-failure
      update_config:
        parallelism: 1
        order: start-first

Deploy this as a Stack inside Dokploy β€” your Ollama service will have GPU acceleration and persistent model storage under ../files/ollama.


Hopefully your looks something like this now:

Image

🧭 Quick Troubleshooting

Problem Fix
Dokploy still shows red Ensure gpu (not NVIDIA-GPU) in daemon.json; restart Docker; rerun Dokploy update
Swarm service won’t schedule Check `docker node inspect self -f '{{json .Description.Resources.GenericResources}}' jq .β€” verify it lists"gpu": 1`
No logs from Swarm service Use docker service logs, not docker logs
GPU test fails Restart Docker, recheck runtime (docker info), rerun test container

QuinnGT avatar Oct 06 '25 23:10 QuinnGT

Can this be added to the dokploy docs

vishalkadam47 avatar Oct 09 '25 16:10 vishalkadam47

THANKS It works !! (i modified the runtime for 550 and earlier CUDA version, but I managed to make it work with GTX 1050 thx! (with PCI passtrough on proxmox) @QuinnGT

steffpro avatar Oct 14 '25 16:10 steffpro

For me "gpu=1" did the trick (it works well aside GPU with UUID):

nano /etc/docker/daemon.json:

{
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    },
    "default-runtime": "nvidia",
    "node-generic-resources": [
        "GPU=GPU-xxy-zz-yy-aa-bb",
        "gpu=1"
    ]
}

Thanks @vovka93 & @QuinnGT

eximius313 avatar Oct 14 '25 20:10 eximius313

@QuinnGT thanks for the Troubleshooting steps you shared, I'll add few more checks in the UI soon and ability to detect manual configuration.

vishalkadam47 avatar Oct 15 '25 03:10 vishalkadam47

Okay, so just ran into this and keep circling back on it over a few weeks. Finally got heads down on it and found a couple things that seem to be unique to dokploy and swarm.

  1. The docker daemon needs the generic resource to be set to "gpu=1". All lowercase. This is what dokploy is set to look for. I tried the default NVIDIA-GPU and GPU and neither showed up in dokploy until I made that change.
  2. If nvidia server and container drivers are installed and test passes, you need to update the dokploy docker service with the right with the nvidia runtime, daemon, and env.

I used GPT to help put together a troubleshooting guide using everything I did to get it working. Up to you on if you want to use the generic ID or full UUID, I went with generic. Hope this helps!

🧠 Dokploy GPU Setup (Ubuntu 24.04 LTS + Docker Swarm)

This guide walks through enabling GPU support in Dokploy using Docker Swarm on Ubuntu 24.04 LTS and deploying Ollama as a GPU-enabled app.

βš™οΈ Step 1 – Check GPU + Driver

lspci | grep -i nvidia nvidia-smi βœ… Confirm your GPU appears and nvidia-smi outputs normal driver/CUDA information.

🧩 Step 2 – Install NVIDIA Container Toolkit

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey
| sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -fsSL https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list
| sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g'
| sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update sudo apt-get install -y nvidia-container-toolkit

🧰 Step 3 – Make NVIDIA the Default Docker Runtime

sudo nvidia-ctk runtime configure --runtime=docker --set-as-default sudo systemctl restart docker Verify:

docker info | grep -A2 "Runtimes"

Should show: Runtimes: … nvidia … and Default Runtime: nvidia

🧾 Step 4 – Confirm Runtime Config

Check /etc/nvidia-container-runtime/config.toml includes:

disable-require = false
supported-driver-capabilities = "compat32,compute,display,graphics,ngx,utility,video"
swarm-resource = "DOCKER_RESOURCE_GPU"

[nvidia-container-runtime]
log-level = "info"

[nvidia-container-cli]
ldconfig = "@/sbin/ldconfig.real"

πŸ’‘ Step 5 – Advertise the GPU to Docker Swarm

Edit /etc/docker/daemon.json:

{ "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "nvidia-container-runtime", "runtimeArgs": [] } }, "node-generic-resources": [ "gpu=1" ] } Restart Docker:

sudo systemctl restart docker

πŸ§ͺ Step 6 – Verify GPU Access with Plain Docker

docker run --rm --gpus all nvidia/cuda:12.5.1-base-ubuntu24.04 nvidia-smi βœ… You should see the familiar nvidia-smi output table.

🐝 Step 7 – Initialize Docker Swarm and Label GPU Node

docker swarm init # only once docker node update --label-add gpu=true $(hostname)

🧫 Step 8 – Run a Swarm GPU Test

docker service create --name gpu-test
--generic-resource "gpu=1"
--constraint 'node.labels.gpu==true'
--restart-condition none
nvidia/cuda:12.5.1-base-ubuntu24.04 nvidia-smi

docker service logs -f gpu-test βœ… The GPU info should print successfully inside the logs.

πŸ–₯ Step 9 – Make Dokploy Detect the GPU

Give the running Dokploy service visibility into the GPU runtime:

docker service update dokploy
--mount-add type=bind,source=/usr/bin/nvidia-container-runtime,target=/usr/bin/nvidia-container-runtime,readonly
--mount-add type=bind,source=/etc/docker/daemon.json,target=/etc/docker/daemon.json,readonly
--env-add NVIDIA_VISIBLE_DEVICES=all
--generic-resource-add "gpu=1" Then go to Dokploy β†’ Server GPU Setup β†’ Refresh. All checks should turn 🟒 green.

πŸš€ Step 10 – Deploy Ollama with GPU (Example Stack)

This follows Dokploy’s [Compose conventions](https://docs.dokploy.com/docs/core/docker-compose): using ../files for persistent storage and running under Stack mode.

version: "3.9"

services: ollama: image: ollama/ollama:latest ports: - "11434:11434" # Ollama API volumes: # Dokploy-managed persistent data path - "../files/ollama:/root/.ollama" environment: NVIDIA_VISIBLE_DEVICES: all NVIDIA_DRIVER_CAPABILITIES: compute,utility # Optional: OLLAMA_HOST: "0.0.0.0:11434" deploy: replicas: 1 placement: constraints: - node.labels.gpu==true # schedule on GPU node resources: reservations: generic_resources: - discrete_resource_spec: kind: gpu value: 1 # matches daemon.json "gpu=1" restart_policy: condition: on-failure update_config: parallelism: 1 order: start-first Deploy this as a Stack inside Dokploy β€” your Ollama service will have GPU acceleration and persistent model storage under ../files/ollama.

Hopefully your looks something like this now:

Image ## 🧭 Quick Troubleshooting Problem Fix Dokploy still shows red Ensure `gpu` (not `NVIDIA-GPU`) in `daemon.json`; restart Docker; rerun Dokploy update Swarm service won’t schedule Check `docker node inspect self -f '{{json .Description.Resources.GenericResources}}' jq .`β€” verify it lists`"gpu": 1` No logs from Swarm service Use `docker service logs`, **not** `docker logs` GPU test fails Restart Docker, recheck runtime (`docker info`), rerun test container

@Siumauricio please add this in dokploy docs, I'll update the code soon with more proper checks and add detection for manual changes. @QuinnGT thanks

vishalkadam47 avatar Nov 18 '25 00:11 vishalkadam47