time cost of 7b model training compared to AdamW

Open dawnranger opened this issue 2 years ago • 1 comments

Does it means LOMO is 11 times faster than AdamW?

Jun 19 '23 08:06 dawnranger

Our experiments focus on fine-tuning large language models on consumer GPUs like RTX 3090. The testing under these conditions is indeed as stated (using the 7b model on RTX 3090 GPUs, with a batch size of 1024 tokens and parallel strategy of ZeRO-3). As mentioned in the paper, this is because LOMO does not require inter-GPU communication. In the scenario where model parallelism is necessary for LOMO (13b model), LOMO is not significantly faster than SGD by a factor of 11 when inter-GPU communication is required. Regarding the 13b/30b/65b model, we cannot claim that LOMO is 11 times faster than AdamW because these model cannot be trained with AdamW using 8 RTX 3090 GPUs.

Jun 19 '23 10:06 KaiLv69

Based on the information provided, we consider this issue resolved. If you have any further questions or concerns, please reopen this issue and provide additional details.

Jun 24 '23 16:06 KaiLv69