[BUG] Profiler records values that differ from transformers
Describe the bug When comparing zero-1 and zero-2, I noticed discrepancies between the results in the DeepSpeed Flops Profiler and the training speed metrics in transformers, and the conclusions drawn from the two are completely opposite. Therefore, I would like to understand why this phenomenon exists. Below is my statistical table:
| method | samples /sec(deepspeed) | samples /sec(hf transformers) | throughput (hf transformers) |
|---|---|---|---|
| zero-1 | 33.68 | 39.669 | 13540 |
| zero-2 | 33.77 | 35.527 | 12170 |
To Reproduce pass
Expected behavior The value of samples/s for zero-1 should be higher than that of zero-2, and close to the transformers.
ds_report output
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
[WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
torch version .................... 2.0.1
deepspeed info ................... 0.13.1, unknown, unknown
torch cuda version ............... 11.8
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.8
Screenshots pass
System info (please complete the following information):
- OS: debian 12
- x8 A100s
- Python 3.10
Launcher context
torchrun $DISTRIBUTED_ARGS pretraining.py \
--model_config_path $MODEL \
--data_dir $DATA \
--do_train true \
--do_eval false \
--fp16 true \
--output_dir output_test \
--num_train_epochs 1 \
--per_device_train_batch_size 32 \
--gradient_accumulation_steps 4 \
--evaluation_strategy "no" \
--save_strategy "epoch" \
--save_total_limit 10 \
--learning_rate 1e-3 \
--weight_decay 0.1 \
--adam_beta1 0.9 \
--adam_beta2 0.98 \
--logging_steps 1 \
--report_to "none" \
--model_max_length 2048 \
--include_tokens_per_second true \
--include_num_input_tokens_seen true \
--gradient_checkpointing \
--deepspeed ${DS_CONFIG_PATH}
Docker context pass
Additional context
here is the deepspeed flops profiler summary text,
zero-1
world size: 6
data parallel size: 6
model parallel size: 1
batch size per GPU: 32
params per GPU: 1.34 B
params of model = params per GPU * mp_size: 1.34 B
fwd MACs per GPU: 78.96 TMACs
fwd flops per GPU: 157.93 T
fwd flops of model = fwd flops per GPU * mp_size: 157.93 T
fwd latency: 1.2 s
fwd FLOPS per GPU = fwd flops per GPU / fwd latency: 131.12 TFLOPS
bwd latency: 3.98 s
bwd FLOPS per GPU = 2 * fwd flops per GPU / bwd latency: 79.35 TFLOPS
fwd+bwd FLOPS per GPU = 3 * fwd flops per GPU / (fwd+bwd latency): 91.38 TFLOPS
step latency: 515.57 ms
iter latency: 5.7 s
FLOPS per GPU = 3 * fwd flops per GPU / iter latency: 83.11 TFLOPS
samples/second: 33.68
zero-2
world size: 6
data parallel size: 6
model parallel size: 1
batch size per GPU: 32
params per GPU: 1.34 B
params of model = params per GPU * mp_size: 1.34 B
fwd MACs per GPU: 78.96 TMACs
fwd flops per GPU: 157.93 T
fwd flops of model = fwd flops per GPU * mp_size: 157.93 T
fwd latency: 1.2 s
fwd FLOPS per GPU = fwd flops per GPU / fwd latency: 131.35 TFLOPS
bwd latency: 3.97 s
bwd FLOPS per GPU = 2 * fwd flops per GPU / bwd latency: 79.56 TFLOPS
fwd+bwd FLOPS per GPU = 3 * fwd flops per GPU / (fwd+bwd latency): 91.6 TFLOPS
step latency: 512.26 ms
iter latency: 5.68 s
FLOPS per GPU = 3 * fwd flops per GPU / iter latency: 83.34 TFLOPS
samples/second: 33.77
Hi @ziliwang - sorry for getting to this so late, are you still hitting this with the latest DeepSpeed release?
Hi @ziliwang, just wondering, could you please share your change to apply tflops profile? According to the official specs, we can avoid user code changes only if we apply DeepSpeed runtime. However, I figured out that in /usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py, even though the self.flops_profiler.start_profile(ignore_list=None) and self.flops_profiler.stop_profile() are indeed invoked in forward(), there are no tflops profiler summary text.
The root cause is that the step(self, lr_kwargs=None), which includes self.flops_profiler.print_model_profile(), is not invoked, thus no flops profiler summary text. Thanks for any suggestions or hints on how to trigger the step() function