deep-learning-containers icon indicating copy to clipboard operation
deep-learning-containers copied to clipboard

[feature-request] EKS test failures due to timeouts should report on cluster state and provide more info

Open kace opened this issue 3 years ago • 0 comments

Checklist

  • [x] I've prepended issue tag with type of change: [feature]
  • [ ] (If applicable) I've documented below the DLC image/dockerfile this relates to
  • [ ] (If applicable) I've documented the tests I've run on the DLC image
  • [ ] I'm using an existing DLC image listed here: https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html
  • [ ] I've built my own container based off DLC (and I've attached the code used to build my own image)

Concise Description: Timeout failures for EKS tests are ambiguous and always require investigation. This is true even for basic errors that involve version mismatches and invalid resource states. To save debugging time, EKS tests should report on cluster state.

kace avatar Sep 01 '22 22:09 kace