time cost of 7b model training compared to AdamW
Does it means LOMO is 11 times faster than AdamW?
Our experiments focus on fine-tuning large language models on consumer GPUs like RTX 3090. The testing under these conditions is indeed as stated (using the 7b model on RTX 3090 GPUs, with a batch size of 1024 tokens and parallel strategy of ZeRO-3). As mentioned in the paper, this is because LOMO does not require inter-GPU communication. In the scenario where model parallelism is necessary for LOMO (13b model), LOMO is not significantly faster than SGD by a factor of 11 when inter-GPU communication is required. Regarding the 13b/30b/65b model, we cannot claim that LOMO is 11 times faster than AdamW because these model cannot be trained with AdamW using 8 RTX 3090 GPUs.
Based on the information provided, we consider this issue resolved. If you have any further questions or concerns, please reopen this issue and provide additional details.