deep-learning-containers
deep-learning-containers copied to clipboard
[bug] test_cpu_nccl SM local test failure on PyTorch CPU image
Checklist
- [x] I've prepended issue tag with type of change: [bug]
- [ ] (If applicable) I've attached the script to reproduce the bug
- [ ] (If applicable) I've documented below the DLC image/dockerfile this relates to
- [ ] (If applicable) I've documented below the tests I've run on the DLC image
- [x] I'm using an existing DLC image listed here: https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html
- [ ] I've built my own container based off DLC (and I've attached the code used to build my own image)
Concise Description:
The test_cpu_nccl SM local test is marked as green by the pytest run but the test actually fails with
RuntimeError: Distributed package doesn't have NCCL built in
We do not install on CPU image for PyTorch. Also, the test is being referred from sagemaker-pytorch-training-toolkit.
We need to confirm if the test is valid in the first place.
DLC image/dockerfile:
All Pytorch CPU images