[BUG] Significant difference between using DeepSpeed and not using DeepSpeed

Open hxu105 opened this issue 1 year ago • 0 comments

Describe the bug The goal is to train an LLM adapter (freeze the LLM and only train the adapter). However, training on a single GPU without DeepSpeed can achieve 79.24% on testing, while training with DeepSpeed on a single GPU or multiple GPUs only achieves 69.56% (single GPU) and 67.94% (7 GPUs).

The adapter will take a standard torch_geometric graph input, i.e. node embeddings with edge indices indicating the graph structure. I am following the ZeRO stage 2 optimization.

To Reproduce Please check the ZeRO stage 2 configuration below { "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "train_micro_batch_size_per_gpu": "auto", "train_batch_size": "auto", "gradient_accumulation_steps": "auto", "zero_optimization": { "stage": 2, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto" } }

For the exact code, I am afraid it is still a private repo, but the training follows a standard transformers trainer manner.

Expected behavior We should expect the model trained with DeepSpeed settings (single GPU or multiple GPUs) to have a similar performance as that without DeepSpeed.

ds_report output Please run ds_report to give us details about your setup.

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

Server: single node with 8 A100 GPUs.
Python: Python 3.10.9
DeepSpeed: 0.11.1
Transformers: 4.31.0

Aug 28 '24 01:08 hxu105