liuyq47
liuyq47
Hi, thanks for adding the NVIDIA dataset support. After trying it out, I see sometimes there are spikes in step time during the training process like the one I shown...
I was comparing the time to the step above and below the highlighted section. Normally I see backward pass takes around ~400ms and backward_allreduce steps takes around 229ms but this...
I'm using 8 DGX-1 (64 V100-SXM2) Pytorch version 1.5.0 and Cuda 10.1 [deepspeed_bsz64k_lamb_config_seq128.json.txt](https://github.com/microsoft/DeepSpeedExamples/files/4985009/deepspeed_bsz64k_lamb_config_seq128.json.txt) [bert_large_lamb_nvidia_data.json.txt](https://github.com/microsoft/DeepSpeedExamples/files/4985006/bert_large_lamb_nvidia_data.json.txt)
I've seen the spikes too with gradient accumulations. (8 nodes with bz of 64 and gradient accumulation of 16) and higher number of nodes (64 DGX-1). Normal all-reduce time is...