OneTrainer icon indicating copy to clipboard operation
OneTrainer copied to clipboard

Improve Docker Stack

Open bbergeron0 opened this issue 5 months ago • 15 comments

This PR will introduce a series of commits to incrementally improve OneTrainer's Docker stack in the hope of making it easier to maintain and keep up to date. These improvements include better documentation, the application of Dockerfile best practices, the removal of unnecessary steps and dependencies, and the unification of installation instructions where possible.

Linked discussion: https://github.com/Nerogar/OneTrainer/discussions/946

Roadmap

  • [x] Update local Docker image - Done, testing
  • [ ] Update VastAI Docker image - In Progress
  • [ ] Update RunPod Docker image - To do
  • [ ] (Somehow) unify the Docker stacks - To do

Testing

Local

Volunteer testing would be greatly appreciated. If you’d like to provide feedback, please include your OS, hardware, and engine (Docker, Podman, etc.).

  • OS: Gentoo
  • Hardware: RTX 3060
  • Engine: Docker (Buildkit)

Cloud

I'll need volunteers to help test the new VastAI and RunPod images.

bbergeron0 avatar Sep 01 '25 15:09 bbergeron0

Just did some basic research, take the following with a large grain of salt:

  • Do we need to worry about the tag version being moved or changed? If this is a real possibility can we use a hash of some kind?
  • Are multi-stage builds relevant to this, seems like it might significantly reduce build size
  • I remember my boss saying every RUN = new image layer, can we consolidate any of them?
  • For local, can we use a cache mount https://docs.docker.com/build/cache/optimize/#use-cache-mounts ?
  • Line 20 of dockerignore, shouldnt it be /training_presets in both? 🤔
  • Shouldnt we use apt-get instead of apt given apt seems to be for full GUI systems
  • Do we want or need a HEALTHCHECK?
  • msg="The \"HOME\" variable is not set. Defaulting to a blank string." do we need to fix this?

Apologies if I am talking out of my behind on this, just trying to provide some semblance of feedback 😅 Testing it now.

Test results:

With zero changes, no X server installed just WSL2, Ubunt as the backend and Docker Desktop on Windows 11 I get this:

 ✔ onetrainer-onetrainer  Built                                                                                                                                           0.0s 
 ✔ Container onetrainer   Recreated                                                                                                                                       0.1s 
onetrainer  | [entrypoint] Setting user UID and GID...
onetrainer  | [entrypoint] Changing /data ownership...
onetrainer  | [entrypoint] Adding user to GPU device groups...
onetrainer  | ls: cannot access '/dev/nvidia*': No such file or directory                                                                                                      
onetrainer  | [entrypoint] Running command...                                                                                                                                  
onetrainer  | font_manager.py     :1639 2025-09-03 14:43:39,668 generated new fontManager
onetrainer  | Traceback (most recent call last):
onetrainer  |   File "/OneTrainer/./scripts/train_ui.py", line 14, in <module>
onetrainer  |     main()                                                                                                                                                       
onetrainer  |   File "/OneTrainer/./scripts/train_ui.py", line 9, in main                                                                                                      
onetrainer  |     ui = TrainUI()                                                                                                                                               
onetrainer  |          ^^^^^^^^^                                                                                                                                               
onetrainer  |   File "/OneTrainer/modules/ui/TrainUI.py", line 66, in __init__                                                                                                 
onetrainer  |     super().__init__()
onetrainer  |   File "/usr/local/lib/python3.12/site-packages/customtkinter/windows/ctk_tk.py", line 40, in __init__                                                           
onetrainer  |     CTK_PARENT_CLASS.__init__(self, **pop_from_dict_by_set(kwargs, self._valid_tk_constructor_arguments))                                                        
onetrainer  |   File "/usr/local/lib/python3.12/tkinter/__init__.py", line 2346, in __init__                                                                                   
onetrainer  |     self.tk = _tkinter.create(screenName, baseName, className, interactive, wantobjects, useTk, sync, use)                                                       
onetrainer  |               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                       
onetrainer  | _tkinter.TclError: couldn't connect to display ""
onetrainer exited with code 1

After install VcXsrv, disabling access control (I have no idea if this is wise), selecting full screen and starting the client I successfully connect to the X server (but the size is wrong) however the GPU still isnt added and trying to train results in this error.

[entrypoint] Setting user UID and GID...
[entrypoint] Changing /data ownership...
[entrypoint] Adding user to GPU device groups...
ls: cannot access '/dev/nvidia*': No such file or directory
[entrypoint] Running command...
font_manager.py     :1639 2025-09-03 14:59:31,134 generated new fontManager
Traceback (most recent call last):
  File "/OneTrainer/modules/ui/TrainUI.py", line 626, in __training_thread_function
    trainer.start()
  File "/OneTrainer/modules/trainer/GenericTrainer.py", line 120, in start
    self.model = self.model_loader.load(
                 ^^^^^^^^^^^^^^^^^^^^^^^
  File "/OneTrainer/modules/modelLoader/StableDiffusionXLLoRAModelLoader.py", line 48, in load
    base_model_loader.load(model, model_type, model_names, weight_dtypes)
  File "/OneTrainer/modules/modelLoader/stableDiffusionXL/StableDiffusionXLModelLoader.py", line 269, in load
    raise Exception("could not load model: " + model_names.base_model)
Exception: could not load model: E:/stable-diffusion-webui-forge/models/Stable-diffusion/sd_xl_base_1.0_0.9vae.safetensors
Exception in thread Thread-2 (__training_thread_function):
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.12/threading.py", line 1012, in run
    self._target(*self._args, **self._kwargs)
  File "/OneTrainer/modules/ui/TrainUI.py", line 636, in __training_thread_function
    trainer.end()
  File "/OneTrainer/modules/trainer/GenericTrainer.py", line 802, in end
    self.model.to(self.temp_device)
    ^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'to'
/usr/local/lib/python3.12/site-packages/tensorboard/default.py:30: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources
TensorFlow installation not found - running with reduced feature set.

NOTE: Using experimental fast data loading logic. To disable, pass
    "--load_fast=false" and report issues on GitHub. More details:
    https://github.com/tensorflow/tensorboard/issues/4784

Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.19.0 at http://localhost:6006/ (Press CTRL+C to quit)

O-J1 avatar Sep 03 '25 14:09 O-J1

Hi, thanks for the feedback!

Research

Do we need to worry about the tag version being moved or changed? If this is a real possibility, can we use a hash of some kind?

It is a possibility, and we can lock a base image to a digest rather than a tag. However, whether we should do it is a matter of opinion. While it guarantees reproducibility, I believe that tags are more explicit and are often moved for good reasons, such as minor version bumps or security/hotfix revisions. I'm also not convinced of the benefits of reproducibility when OneTrainer is already running well on multiple Python versions (and OSes).

Are multi-stage builds relevant to this? Seems like it might significantly reduce build size.

AFAIK, multi-stage builds are beneficial when some steps must be done in a heavier environment, but the environment becomes unnecessary after that step. In our case, we're already starting with a minimal environment (python:3.12), and we keep the size minimal by installing only OneTrainer and its dependencies.

I remember my boss saying every RUN = new image layer. Can we consolidate any of them?

That's true, but it's not a big deal. Layers are compressed and built using OverlayFS, so a layer actually only contains the compressed diff, which is highly size-efficient. Also, since only diffs are tracked, two distinct mutations cost as much done together as done separately, ignoring the minuscule overhead of layering and negligible benefits of single-compression. On the other hand, merging RUN commands can make a Dockerfile harder to read and maintain, which is the opposite of what I'm aiming for with this PR. The Dockerfile is already minimal, so no RUN command can really be merged with noticeable storage savings in terms of diff count.

For local, can we use a cache mount https://docs.docker.com/build/cache/optimize/#use-cache-mounts ?

I've taken some time to learn about cache mounts, and I think we would absolutely benefit from it for the "pip install" step. I'll implement it in a future commit.

Line 20 of .dockerignore, shouldn't it be /training_presets in both? 🤔

Why so? training_presets contains some default presets that I want to copy into the image, so I don't want to ignore it.

Shouldn't we use apt-get instead of apt, given apt seems to be for full GUI systems?

Good point. I recently made that change in the ComfyUI PR; it’s coming here soon as well.

Do we want or need a HEALTHCHECK?

I don't see what we could healthcheck against. Either OneTrainer is running and the container is OK, or OneTrainer has crashed and the container has automatically shut down. Also, HEALTHCHECKs are better suited for orchestrating and monitoring long-lived daemons, which OneTrainer isn't.

msg="The \"HOME\" variable is not set. Defaulting to a blank string." do we need to fix this?

Ah, I guess some environments don't expose $HOME (annoyingly). I think I can replace that variable expansion with "~"; this should solve the issue.

Apologies if I am talking out of my behind on this, just trying to provide some semblance of feedback 😅 Testing it now.

Since you're a prominent contributor to this project, it’s only right to ask as many questions as necessary. I won't be around forever after that PR. ;)

Testing

onetrainer | ls: cannot access '/dev/nvidia*': No such file or directory

How strange, someone else had this error too on the ComfyUI PR. I'll do some digging and circle back to that one.

onetrainer | _tkinter.TclError: couldn't connect to display ""

WSL2 doesn’t come with X11, so that checks out. Installing some kind of X server will be a hard requirement for WSL users.

Exception: could not load model: E:/stable-diffusion-webui-forge/models/Stable-diffusion/sd_xl_base_1.0_0.9vae.safetensors

The error doesn’t explain why the model couldn't be loaded, but I'd say it’s because E:/ is not a recognized path inside the container. Can you edit compose.yaml and add - "E:/:/mnt/e" to the volumes section, and then change your model path to
/mnt/e/stable-diffusion-webui-forge/models/Stable-diffusion/sd_xl_base_1.0_0.9vae.safetensors?

bbergeron0 avatar Sep 05 '25 04:09 bbergeron0

Modifying the the compose.yaml to "E:/:/mnt/e" works as expected, training worked without issue.

I forgot to mention I had to set DISPLAY=host.docker.internal:0.0 to get display to work.

O-J1 avatar Sep 07 '25 09:09 O-J1

I'm glad that solved the issue! I had the time today to do a few changes. There's still a few things I want to change on the local stack (mostly how user-generated data are handled), and then I will start looking at VastAI and Runpod.

I forgot to mention I had to set DISPLAY=host.docker.internal:0.0 to get display to work.

The unfortunate issue is that Docker runs Linux containers, which means we have to use X11, which won't provide the best experience on WSL. End users will need to install additional packages and tweak their host environment. Documenting some recommended configuration steps is really the best we can do for WSL users. I'll try to setup a Windows 11 VM to test WSL.

bbergeron0 avatar Sep 09 '25 01:09 bbergeron0

Do we need to set PYTHONUNBUFFERED=1? I can see this in the old image from someone who worked in the Pytorch group

O-J1 avatar Sep 23 '25 07:09 O-J1

@O-J1 I found this SO thread to undestand what this option does: https://stackoverflow.com/questions/59812009/what-is-the-use-of-pythonunbuffered-in-docker-file. Essentially, it force the stdout and stderr streams to be unbuffered (when buffered, the process don't print out what is written to stdout and stderr until a newline character is sent, which save on costly IO operations: https://stackoverflow.com/questions/19990589/stderr-and-stdout-not-buffered-vs-buffered).

Example:

# Buffered
print('Oh,', end='') # Print nothing
print(' hi', end='') # Print nothing
print(' there!') # Print "Oh, hi there!<newline>"

# Unbuffered
print('Oh,', end='') # Print "Oh,"
print(' hi', end='') # Print " hi"
print(' there!') # Print " there!<newline>"

There are two reasons to enable PYTHONUNBUFFERED:

  1. You print many non-terminated (end='') strings without flushing the buffer often, but still want to see the logs immediately.
  2. You are concerned that your application might crash between a non-terminated print statement and a flush trigger, and you do not want to lose the buffered data.

I do not believe it would be beneficial in our case. In fact, I find it rarely useful to unbuffer console I/O, but that is just my opinion. Someone who knows OneTrainer better, and how things are logged (e.g., not me), will have a more informed opinion on that question.

bbergeron0 avatar Sep 28 '25 16:09 bbergeron0

I added a new commit to address some of the more opinionated issues that have been brought up, such as reducing the instruction count in the Dockerfile and simplifying the volume mapping setup. I'll take another round of feedback, and then I'll mark this PR as ready to merge. As I mentioned in another thread, I'll start by merging local Docker support first and will look at cloud images afterward.

bbergeron0 avatar Sep 28 '25 20:09 bbergeron0

I've tried to give this a run because I needed an isolated container. It repeatedly got stuck at:

 => [ 2/11] RUN apt-get update &&     apt-get install -y --no-install-recommends         cmake         libgl-dev &&     apt-get clean &&     rm -rf /var/lib/apt/lists/*                                                                                                                                              166.7s
 => => # Hit:1 http://deb.debian.org/debian trixie InRelease                                                                                                                                                                                                                                                                
 => => # Get:2 http://deb.debian.org/debian trixie-updates InRelease [47.3 kB]                                                                                                                                                                                                                                              
 => => # Get:3 http://deb.debian.org/debian-security trixie-security InRelease [43.4 kB]                                                                                                                                                                                                                                    
 => => # Get:4 http://deb.debian.org/debian trixie/main amd64 Packages [9669 kB]        

with no network traffic to wait for.

This might be an issue with my docker installation, I don't know (even though this is the same machine that I've used to create the RunPod docker images).

However, it also highlights what I've said above: What I really want is to run an image, not build an image. The average user will not want to deal with any issues caused by docker image composing.

Someone has to build the image - and for this we need docker build files - but as a user who wants to run a container I shouldn't have to deal with that.

dxqb avatar Nov 11 '25 19:11 dxqb

I've tried to give this a run because I needed an isolated container. It repeatedly got stuck at:

...

Someone has to build the image - and for this we need docker build files - but as a user who wants to run a container I shouldn't have to deal with that.

I had no issues building and I am on windows of all things, I think the issue might be on your side but lets see.

Remember though the sole purpose intended for a local user to run locally to isolate their system during training. The other ones he mentioned will be for remote/spin up a machine

O-J1 avatar Nov 14 '25 08:11 O-J1

I've tried to give this a run because I needed an isolated container. It repeatedly got stuck at: ... Someone has to build the image - and for this we need docker build files - but as a user who wants to run a container I shouldn't have to deal with that.

I had no issues building and I am on windows of all things, I think the issue might be on your side but lets see.

I agree, but my point is a different one: Why do I have to build a container to run a container? Now you could counter that nothing stops me from using a container that somebody else has built using these build files and uploaded. But the docker files are designed to build a container from my own local OneTrainer copy (see above, discussed about files and directories). So it's designed for exactly what I, as a regular user in this situation, don't want to do.

dxqb avatar Nov 14 '25 08:11 dxqb

So that you get all your files since its meant to be equivalent to running bare metal on your local system, whilst keeping bare metal copy immutable

O-J1 avatar Nov 14 '25 08:11 O-J1

It should build a container that is universal, not dependant on my local files, so it can be shared and uploaded and not every user has to build their own.

It should the provide a possibility to use my local data in the container, for example by mounting some directories.

dxqb avatar Nov 14 '25 08:11 dxqb

It should build a container that is universal, not dependant on my local files, so it can be shared and uploaded and not every user has to build their own.

That is the point of the remote intended images that will come later. The usecase of this is strictly local (hence the name local) and there have been many users requesting this

O-J1 avatar Nov 14 '25 08:11 O-J1

I don't see it as a separate usecase. The universal container that can be downloaded, can also built locally and used in the same way.

dxqb avatar Nov 14 '25 08:11 dxqb

Hey, sorry for going silent since September; work and family life are a time-consuming and arduous blessing.

@dxqb To run a Docker container, you need a Docker image. To get a Docker image, you must either build it locally or pull a pre‑built image from an image repository.

For most end users, pulling a pre‑built image is the intended way to obtain Docker images. However, doing so requires the project owners to create a Docker Hub account so they can push pre‑built images, images that the project owners must build locally.

When you say "I shouldn’t have to deal with that [as an end user]", you’re right, but you’re barking up the wrong tree. The scope of this PR is to lay the groundwork for developers so they can build and then eventually publish images, while also supporting users who can abide building the images locally in the meantime.

It should build a container that is universal, not dependent on my local files.

No, it shouldn’t. Docker builds from local files. Developers want to test their local files. A CI system should generate artifacts according to local files.

When building from local files (i.e., importing files from the host into the image using COPY), those files become part of the image, and the resulting image is "universal", you can send it to anyone, and it will behave the same as long as the same runtime options are used.

I really have no idea what the issue or solution is here. What are your expectations for a Dockerfile? How do you want you and other users to obtain the image? What do you mean by "universal"? How would you build an image without using local files?

bbergeron0 avatar Nov 14 '25 20:11 bbergeron0