DeepLearningExamples icon indicating copy to clipboard operation
DeepLearningExamples copied to clipboard

Unable to understand throughput calculation

Open Druva24 opened this issue 2 years ago • 3 comments

Within run_squad.py we have two cases based on args.max_steps https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT/run_squad.py#L1198 I just want to understand in else case what are we trying to do, because I have tried 3 experiments

  1. with max_steps as -1
  2. with max_steps as 1000
  3. with max_steps as 10000 keeping no of nodes as 14 and no of GPUs per node as 1 with batch size per GPU as 64. I have seen so much difference between these three configurations. Can someone help me understand this? Thanks!

Druva24 avatar Aug 10 '23 21:08 Druva24

The amount of training is determined by the parameters max_steps and num_train_epochs, whichever is minimum. The former defaults to -1. The throughput computation accounts for these parameters in the if else clause.

You throughput should reach a steady state after running a minimum number of steps > 100. If max_steps is higher than what is determined by num_epochs, your throughput will be artificially inflated by the max_steps set, which is a bug. I can fix this soon, feel free to send in a PR.

sharathts avatar Aug 12 '23 01:08 sharathts

Need to add args.max_steps = min(args.max_steps, len(train_features) * args.num_train_epochs) before line 1203 to fix the discrepancy you are seeing.

sharathts avatar Aug 12 '23 02:08 sharathts

Thanks @sharathts , will raise a pr with above change

Druva24 avatar Aug 17 '23 02:08 Druva24