Unable to understand throughput calculation
Within run_squad.py we have two cases based on args.max_steps
https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT/run_squad.py#L1198
I just want to understand in else case what are we trying to do, because I have tried 3 experiments
- with max_steps as -1
- with max_steps as 1000
- with max_steps as 10000 keeping no of nodes as 14 and no of GPUs per node as 1 with batch size per GPU as 64. I have seen so much difference between these three configurations. Can someone help me understand this? Thanks!
The amount of training is determined by the parameters max_steps and num_train_epochs, whichever is minimum. The former defaults to -1. The throughput computation accounts for these parameters in the if else clause.
You throughput should reach a steady state after running a minimum number of steps > 100. If max_steps is higher than what is determined by num_epochs, your throughput will be artificially inflated by the max_steps set, which is a bug. I can fix this soon, feel free to send in a PR.
Need to add args.max_steps = min(args.max_steps, len(train_features) * args.num_train_epochs) before line 1203 to fix the discrepancy you are seeing.
Thanks @sharathts , will raise a pr with above change