[BUG] 3 GPUs is not as good as expectation compare with 2 GPUs; NV vs AMD performace; flash attention not support for AMD GPUs
I have encountered some challenges when using Deepspeed that we hope to address with your expertise.
-
During fine-tuning LLM LLama-7b-chat-hf and LLama-13b-chat-hf with multiple GPUs, I observed the following token-per-second speeds: 1 GPU (60 tokens/s), 2 GPUs (178 tokens/s), 3 GPUs (230 tokens/s), and 4 GPUs (300 tokens/s). Surprisingly, the efficiency did not exhibit a proportional increase with the addition of GPUs beyond two. 3 GPUs is not as good as expectation compare with 2 GPUs. Are there any possible technical explanation for this issue?
-
Under identical conditions using the TRX50 motherboard, we compared the performance of two configurations: Case 1: NV RTX 4090 x 2 cards Case 2: AMD Radeon Pro W7900 x 2 cards Initially, two months ago, the AMD Radeon Pro W7900 outperformed the NV RTX 4090 in terms of speed (tokens/s) for LLama-7b-chat-hf and LLama-13b-chat-hf models. However, in my recent tests, the NV RTX 4090 surpassed the AMD Radeon Pro W7900, both with and without the flash-attn feature enabled.
I seek your insights on these issues. Are there any explanations for these fluctuations in performance? Are certain versions of Deepspeed optimized for specific GPU types, such as the NV RTX 4090 or the AMD W7900?
- I also want to ask why flash-attn cannot support for AMD GPUs (Radeon pro W7800, W7900)?
Thank you! Le