[BUG] Significant difference between using DeepSpeed and not using DeepSpeed
Describe the bug The goal is to train an LLM adapter (freeze the LLM and only train the adapter). However, training on a single GPU without DeepSpeed can achieve 79.24% on testing, while training with DeepSpeed on a single GPU or multiple GPUs only achieves 69.56% (single GPU) and 67.94% (7 GPUs).
The adapter will take a standard torch_geometric graph input, i.e. node embeddings with edge indices indicating the graph structure. I am following the ZeRO stage 2 optimization.
To Reproduce
Please check the ZeRO stage 2 configuration below
{ "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "train_micro_batch_size_per_gpu": "auto", "train_batch_size": "auto", "gradient_accumulation_steps": "auto", "zero_optimization": { "stage": 2, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto" } }
For the exact code, I am afraid it is still a private repo, but the training follows a standard transformers trainer manner.
Expected behavior We should expect the model trained with DeepSpeed settings (single GPU or multiple GPUs) to have a similar performance as that without DeepSpeed.
ds_report output
Please run ds_report to give us details about your setup.
Screenshots If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
- Server: single node with 8 A100 GPUs.
- Python: Python 3.10.9
- DeepSpeed: 0.11.1
- Transformers: 4.31.0