OneTrainer icon indicating copy to clipboard operation
OneTrainer copied to clipboard

[Feat]: Add Dockerfile for local Windows 11 + CUDA 13 + RTX 5090 setup

Open Cybernetic-Ransomware opened this issue 5 months ago • 4 comments

Describe your use-case.

Summary I created and tested a new Dockerfile that allows running OneTrainer locally on:

  • Windows 11 host (via Docker Desktop / WSL2),
  • CUDA 13,
  • NVIDIA RTX 5090,
  • with optional S3 storage mounting for training data.

Motivation Current Docker resources seem focused on cloud providers (RunPod, Vast, etc.). This Dockerfile provides a tested setup for local Windows users with modern NVIDIA GPUs.

File

FROM nvidia/cuda:13.0.0-cudnn-runtime-ubuntu24.04
LABEL authors="aleksander.marszalki"
LABEL name="trainer-s3"

RUN apt-get update && apt-get install -y \
    software-properties-common \
    build-essential apt-utils \
    wget curl vim git ca-certificates kmod \
    nvidia-driver-525 \    
    python3 python3-pip python3.12-venv \
    git libgl1 libglib2.0-0 s3fs \
    && rm -rf /var/lib/apt/lists/*

RUN ln -sf /usr/bin/python3 /usr/bin/python && \
    ln -sf /usr/bin/pip3 /usr/bin/pip

RUN pip install --no-cache-dir --break-system-packages torch torchvision --index-url https://download.pytorch.org/whl/cu129 && \
    pip install --no-cache-dir --break-system-packages tensorflow && \
    pip install --no-cache-dir --break-system-packages -U "huggingface_hub[cli]"


WORKDIR /opt/onetrainer
RUN git clone https://github.com/Nerogar/OneTrainer.git --single-branch --branch master --depth 1
RUN mkdir -p instance training_concepts training_samples training_data

WORKDIR /opt/onetrainer/OneTrainer
RUN ./install.sh


CMD hf auth login --token ${HF_TOKEN} && \
    hf download black-forest-labs/FLUX.1-dev --local-dir ~/.cache/huggingface/flux && \
    mkdir -p ${S3_MOUNT} && \
    s3fs ${S3_BUCKET} ${S3_MOUNT} -o allow_other && \
    mkdir -p /opt/onetrainer/local_training_data && \
    cp -r ${S3_MOUNT}/training_data* /opt/onetrainer/local_training_data/ && \
    nvidia-smi && \
    ulimit -a && \
    ./venv/bin/python scripts/train.py --config-path ${S3_MOUNT}/OneTrainer/training_presets/flux_lora_backup_af_epoch_sierpien_JDTAG_S3.json

What would you like to see as a solution?

Proposal

  • Generalize and contribute my Dockerfile as an additional resource (e.g. resources/docker/Dockerfile.windows-cuda13)
  • Add short build/run instructions in a README
  • Optionally provide a compose.yaml example for mounting S3 storage
  • Workflows tested: docker build, docker run, training script startup

Questions

  • Would you like me to open a PR with this Dockerfile (and make it more general-purpose)?
  • Should this file live in resources/docker/, or do you prefer a different structure?
  • Do you also want a compose.yaml examples for S3 and MINIO mounting?

Have you considered alternatives? List them here.

No response

Cybernetic-Ransomware avatar Sep 15 '25 08:09 Cybernetic-Ransomware

Please see here: https://github.com/Nerogar/OneTrainer/pull/963

Note it’s a draft.

O-J1 avatar Sep 15 '25 10:09 O-J1

Hi @Cybernetic-Ransomware We're already working on improving and unifying Docker support for all platforms and some cloud providers. However, if you could test my PR as mentioned by O-J1, and port some of your changes to fix any issue you encounter, we'll get closer to a tested and ready local Docker stack.

Fwiw, I'll probably make a first merge for the local changes since 1. I believe that's how most user wish to use OneTrainer, 2. I feel like a delivery is due and ready and 3. to get some user feedback to help me with the cloud stack revamp.

(cc @O-J1 @dxqb since you're watching #963)

bbergeron0 avatar Sep 21 '25 02:09 bbergeron0

@bbergeron0 I just ran your PR on Windows using VcXsrv with a few minor changes in the docker-compose file:

    runtime: nvidia
    deploy:
        resources:
            reservations:
                devices:
                    - driver: nvidia
                      count: 1
                      capabilities: [gpu]
    environment:
        # Allow X11 access
        - NVIDIA_VISIBLE_DEVICES=all
        - NVIDIA_DRIVER_CAPABILITIES=all
        - DISPLAY=host.docker.internal:0.0
        - UID=1000
        - GID=1000

Surprisingly, it worked without the usual PyTorch exceptions, and the image size is about three times smaller compared to my previous Dockerfile based on NVIDIA's distribution.

I think a major improvement could be to properly map volumes for configuration presets and training data. I’ve spent quite some time exploring container paths and am still halfway through. One option I’m considering is mounting a separate filebrowser container via Compose to simplify local data access.

I’ll continue exploring your Dockerfile and will share any useful findings from my previous setup to help improve the local Docker stack.

Cybernetic-Ransomware avatar Sep 22 '25 15:09 Cybernetic-Ransomware

@Cybernetic-Ransomware Thanks for the feedback! I'd like a bit more information to help me integrate your changes into my PR. First, could you share the errors you encountered that led to these changes? Second, what happens if you revert all your changes except DISPLAY=host.docker.internal:0.0?

I think a major improvement could be to properly map volumes for configuration presets and training data

What do you mean? Creating a volume for each writable directory?

simplify local data access

In my next commit, I'll mount each writable directory directly to its local counterpart. I hope that will make accessing data easier.

PS: You can reply in #963, it's easier for me to track a single thread '^^

bbergeron0 avatar Sep 28 '25 16:09 bbergeron0