deep-learning-containers icon indicating copy to clipboard operation
deep-learning-containers copied to clipboard

[bug] TF imagenet performance script does not respect timeout

Open tejaschumbalkar opened this issue 3 years ago • 0 comments

Checklist

  • [ ] I've prepended issue tag with type of change: [bug]
  • [ ] (If applicable) I've attached the script to reproduce the bug
  • [ ] (If applicable) I've documented below the DLC image/dockerfile this relates to
  • [ ] (If applicable) I've documented below the tests I've run on the DLC image
  • [ ] I'm using an existing DLC image listed here: https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html
  • [ ] I've built my own container based off DLC (and I've attached the code used to build my own image)

Concise Description: The TF imagenet performance script run_tensorflow_training_performance_gpu_imagenet does not respect the timeout

timeout "$TRAINING_TIME" python scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py .......

Even if the time goes over the supplied $TRAINING_TIME, the script is not able to exit. This leads to timeout on the test build in PR context.

Expected behavior:

The TF imagenet performance script run_tensorflow_training_performance_gpu_imagenet respects the timeout.

tejaschumbalkar avatar May 24 '22 20:05 tejaschumbalkar