DeepSpeed [BUG] Profiler records values that differ from transformers

Describe the bug When comparing zero-1 and zero-2, I noticed discrepancies between the results in the DeepSpeed Flops Profiler and the training speed metrics in transformers, and the conclusions drawn from the two are completely opposite. Therefore, I would like to understand why this phenomenon exists. Below is my statistical table:

method	samples /sec(deepspeed)	samples /sec(hf transformers)	throughput (hf transformers)
zero-1	33.68	39.669	13540
zero-2	33.77	35.527	12170

To Reproduce pass

Expected behavior The value of samples/s for zero-1 should be higher than that of zero-2, and close to the transformers.

ds_report output

JIT compiled ops requires ninja                                                                                                                                                   
ninja .................. [OKAY]                                                                                                                                                   
--------------------------------------------------                                                                                                                                
op name ................ installed .. compatible                              
--------------------------------------------------                                                                                                                                
async_io ............... [NO] ....... [OKAY]                                             
fused_adam ............. [NO] ....... [OKAY]                                             
cpu_adam ............... [NO] ....... [OKAY]                                             
cpu_adagrad ............ [NO] ....... [OKAY]                                             
cpu_lion ............... [NO] ....... [OKAY]                                             
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH                                                                                       
evoformer_attn ......... [NO] ....... [NO]                                               
fused_lamb ............. [NO] ....... [OKAY]                                             
fused_lion ............. [NO] ....... [OKAY]                                             
inference_core_ops ..... [NO] ....... [OKAY]                                                                                                                                      
cutlass_ops ............ [NO] ....... [OKAY]                                             
quantizer .............. [NO] ....... [OKAY]                                             
ragged_device_ops ...... [NO] ....... [OKAY]                                             
ragged_ops ............. [NO] ....... [OKAY]                                             
random_ltd ............. [NO] ....... [OKAY]                                             
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]                                               
spatial_inference ...... [NO] ....... [OKAY]                                             
transformer ............ [NO] ....... [OKAY]                                                                                                                                      
stochastic_transformer . [NO] ....... [OKAY]                                             
transformer_inference .. [NO] ....... [OKAY]
torch version .................... 2.0.1
deepspeed info ................... 0.13.1, unknown, unknown
torch cuda version ............... 11.8
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.8

Screenshots pass

System info (please complete the following information):

OS: debian 12
x8 A100s
Python 3.10

Launcher context

torchrun $DISTRIBUTED_ARGS pretraining.py \
    --model_config_path $MODEL \
    --data_dir $DATA \
    --do_train true \
    --do_eval false \
    --fp16 true \
    --output_dir output_test \
    --num_train_epochs 1 \
    --per_device_train_batch_size 32 \
    --gradient_accumulation_steps 4 \
    --evaluation_strategy "no" \
    --save_strategy "epoch" \
    --save_total_limit 10 \
    --learning_rate 1e-3 \
    --weight_decay 0.1 \
    --adam_beta1 0.9 \
    --adam_beta2 0.98 \
    --logging_steps 1 \
    --report_to "none" \
    --model_max_length 2048 \
    --include_tokens_per_second true \
    --include_num_input_tokens_seen true \
    --gradient_checkpointing \
    --deepspeed ${DS_CONFIG_PATH}

Docker context pass

Additional context here is the deepspeed flops profiler summary text, zero-1

world size:                                                             6       
data parallel size:                                                     6       
model parallel size:                                                    1       
batch size per GPU:                                                     32      
params per GPU:                                                         1.34 B  
params of model = params per GPU * mp_size:                             1.34 B  
fwd MACs per GPU:                                                       78.96 TMACs
fwd flops per GPU:                                                      157.93 T
fwd flops of model = fwd flops per GPU * mp_size:                       157.93 T
fwd latency:                                                            1.2 s   
fwd FLOPS per GPU = fwd flops per GPU / fwd latency:                    131.12 TFLOPS
bwd latency:                                                            3.98 s  
bwd FLOPS per GPU = 2 * fwd flops per GPU / bwd latency:                79.35 TFLOPS
fwd+bwd FLOPS per GPU = 3 * fwd flops per GPU / (fwd+bwd latency):      91.38 TFLOPS
step latency:                                                           515.57 ms
iter latency:                                                           5.7 s   
FLOPS per GPU = 3 * fwd flops per GPU / iter latency:                   83.11 TFLOPS
samples/second:                                                         33.68

zero-2

world size:                                                             6       
data parallel size:                                                     6       
model parallel size:                                                    1       
batch size per GPU:                                                     32      
params per GPU:                                                         1.34 B  
params of model = params per GPU * mp_size:                             1.34 B  
fwd MACs per GPU:                                                       78.96 TMACs
fwd flops per GPU:                                                      157.93 T
fwd flops of model = fwd flops per GPU * mp_size:                       157.93 T
fwd latency:                                                            1.2 s   
fwd FLOPS per GPU = fwd flops per GPU / fwd latency:                    131.35 TFLOPS
bwd latency:                                                            3.97 s  
bwd FLOPS per GPU = 2 * fwd flops per GPU / bwd latency:                79.56 TFLOPS
fwd+bwd FLOPS per GPU = 3 * fwd flops per GPU / (fwd+bwd latency):      91.6 TFLOPS
step latency:                                                           512.26 ms
iter latency:                                                           5.68 s  
FLOPS per GPU = 3 * fwd flops per GPU / iter latency:                   83.34 TFLOPS
samples/second:                                                         33.77

Feb 05 '24 09:02 ziliwang

Hi @ziliwang - sorry for getting to this so late, are you still hitting this with the latest DeepSpeed release?

Apr 22 '24 17:04 loadams

Hi @ziliwang, just wondering, could you please share your change to apply tflops profile? According to the official specs, we can avoid user code changes only if we apply DeepSpeed runtime. However, I figured out that in /usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py, even though the self.flops_profiler.start_profile(ignore_list=None) and self.flops_profiler.stop_profile() are indeed invoked in forward(), there are no tflops profiler summary text.

The root cause is that the step(self, lr_kwargs=None), which includes self.flops_profiler.print_model_profile(), is not invoked, thus no flops profiler summary text. Thanks for any suggestions or hints on how to trigger the step() function

Aug 29 '24 01:08 IT-Forrest