[Bug]: Backups fail due to insane ram usage
What happened?
Not really a bug in the sense of "not working", but insane resource consumption.
When doing SDXL fine tune training with EMA enabled and EMA running on CPU, the process takes about 18-19 GB of system RAM (CPU!, not VRAM). Independent of the selected Optimizer and other settings, this goes up by an insane amount when storing backups during training. I imposed a memory limit of 28GB to the process (in order to not cause core dumps due to OOM on system RAM) and the process crashes due to out of memory. Hence, saving the backup takes at least about 10GB of additional system RAM. Given that the whole model when saved is less than 8GB this sounds insane.
When EMA is completely disabled, system memory consumption with the very same settings is at 15-17 GB and goes up "only" by about 6-7 GB when storing the backup (I saw a maximum of 24GB of system RAM consumed). My guess is that all data written during the backup process is first copied/stored in RAM and then written to disk. I am also guessing that when EMA is enabled this amount gets at least doubled.
The key point is: Given that all data needed should already be in RAM/VRAM at that point in time, it should be possible to just "stream" it to disk with nearly no additional system memory being consumed during the backup/save process.
Why is this important? Well, this does not haunt people with >32GB of system RAM. But since we go a long way to get the training process to consume as less resources as possible (mostly VRAM on GPU) it sounds strange that for some people training will fail after it actually was completed but a backup is stored / final model is written to disk. Essentially it makes SDXL training with EMA enabled impossible for people with only "small" amounts of system RAM (32GB are not enough), independent of the amount of available VRAM and probably also causes a lot of trouble in case on has "only" 16 GB of memory.
What did you expect would happen?
RAM consumption should only increase slightly when writing backups/saving the final model or intermediate states.
Relevant log output
No response
Output of pip freeze
No response
Thank you for making this, I was so confused as to what my issue was since I wasn't getting OOM errors. Luckily for me i had an extra 16gb ram stick lying around and I added it into my pc, when caching i get to about 36gb of ram i had 32 so that extra 16 made it comfortably fit into ram when caching.
I also noticed that at least on Linux there seem to be some memory leaks (both VRAM and RAM). Hence, my workaround is to close & reopen OneTrainer before starting the next training session. Otherwise some memory allocations are never released and I end up getting either VRAM OOM or segmentation faults due to OOM in system RAM. It helps a bit, but does not solve the problem described above (just underlines the behavior concerning all kinds of memory we currently see).
@Thomas2419 @gilga2024 Can you both run update.bat and confirm if this still occurs?
I updated to the latest state on master-branch and can confirm that the problem still occurs, although behavior changed a bit. It now fails when saving the model instead of saving the backup :
- it trains at about 18 GB RAM
- when writing the backup, memory consumption increases to about 24-25 GB RAM
- it fails because of an OOM for RAM when trying to save the model (when trying to reserve even more RAM)
One can also not train a SDXL fine tune by disabling backups It will then produce an OOM for RAM when saving the model.
Also interesting: When manually stopping the training, the backup & model are successfully saved without going into OOM for RAM.
Could you
I updated to the latest state on master-branch and can confirm that the problem still occurs, although behavior changed a .. [snip] Also interesting: When manually stopping the training, the backup & model are successfully saved without going into OOM for RAM.
Dang. Could you please provide more system details (os, total VRAM, total RAM) and the config json, given other Linux users have not encountered this I suspect these details will be needed to investigate.
OS is Debian/Linux, VRAM is 12 GB and RAM is 32 GB (physical, a bit less in real). There are many *.json files, but no config.json in the whole directory (including subdirectories).
OS is Debian/Linux, VRAM is 12 GB and RAM is 32 GB (physical, a bit less in real). There are many *.json files, but no config.json in the whole directory (including subdirectories).
You export your config.json by hitting the big export button in the UI (bottom right). Ensure the preset you used to train (and fails) is active
Sorry, never used that button again after learning it does not export "everything" and implemented my own solution for that. Will do it after a current training run is complete (may take a few days).
Sorry, never used that button again after learning it does not export "everything" and implemented my own solution for that. Will do it after a current training run is complete (may take a few days).
It exports a config of settings used, to allow recreation of the training run, thats all its intended for as far as I know, it also allows troubleshooting
Sorry for taking so long. But I had to finish some other stuff. Just recreated the problem using the attached config. The process was limited to a memory usage of 27.000MB and failed while saving a model (version) during training after writing a backup.
related: https://github.com/huggingface/safetensors/issues/415
One small update to this: I got it working (actually by accident) using fp16 instead of fp32 for the 'output data type' setting on the model tab. Maybe that helps to pinpoint it further. If I use fp32 here it always crashes when saving the model (saving the backup works).
One question related to this: The base model itself is fp16, so I chose fp16 for the weight data type and unet data type. I also use fp16 for the TE1 and TE2 data type, since I guess they are also in fp16. VAE is fp32. When doing fp32 training, so fp32 for train data type and fallback data type, does it hurt to use fp16 instead of fp32 for the output data type?