awsome-distributed-training
awsome-distributed-training copied to clipboard
install_dcgm_exporter.sh fails intermitendly on hyperpod slurm using u22.04 DLAMI
docker: Error response from daemon: Get "https://nvcr.io/v2/": dial tcp: lookup [nvcr.io](http://nvcr.io/) on 127.0.0.53:53: server misbehaving
docker: Error response from daemon: Get "https://nvcr.io/v2/": dial tcp: lookup [nvcr.io](http://nvcr.io/) on 127.0.0.53:53: server misbehaving
Suggested customer clone the latest version of LCS with exponential backoff:
https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils/install_dcgm_exporter.sh
resolved with pr 683