How much VRAM is needed at a minimum to fine-tune the 3.6B parameter model C?
Thank you for releasing the new model so promptly; I’m very excited about fine-tuning it. Could you please tell me how much VRAM is needed at a minimum to fine-tune the 3.6B parameter model C? Even when I use a local 48GB of VRAM for fine-tuning at a resolution of 768, I run into out-of-memory issues. When I train lora with the 3.6B model C, with a batch size of 4 and a resolution of 768, the system VRAM occupies about 45GB. These results occur with fsdp and EMA turned off. Is this level of VRAM usage normal, or is it because optimizations like xformers have not yet been implemented? Additionally, what is the optimal resolution for training? The default appears to be 768, but is it not recommended to train at a resolution of 1024, as is done with sdxl?
Additionally, for everyone's reference, I have switched to fine-tune the 1B version of model C. At a resolution of 1024, the basic VRAM usage is 30GB. With a 48GB VRAM RTX6000ada, it is possible to achieve a batch size of 6. The training dataset for the test model contains 740 images and requires 3.5 hours to train 12500 steps.
That's super helpful -- thanks a lot, @stinbuaa!
On an A100 with 40 GB of VRAM, batch_size: 40 with image_size: 768 seems to work, using model_version: 1B (and generator_checkpoint_path: models/stage_c_lite_bf16.safetensors).
Would love some more clarity on this issue as well. I tried with an A6000 and was getting OOM on 3.6B model C even at batch size 1. Is this the kind of thing that will need multiple A100/H100s?
For what it's worth, with the 3.6B model C and a batch size of 1, it seems to be peaking at 45,149 MiB of VRAM according to nvidia-smi on an A100 with 80 GB of VRAM. A batch size of 6 (with grad_accum_steps: 1) just barely fits, using 78,063 MiB.