awsome-distributed-training icon indicating copy to clipboard operation
awsome-distributed-training copied to clipboard

install_dcgm_exporter.sh fails intermitendly on hyperpod slurm using u22.04 DLAMI

Open nghtm opened this issue 11 months ago • 1 comments

docker: Error response from daemon: Get "https://nvcr.io/v2/": dial tcp: lookup [nvcr.io](http://nvcr.io/) on 127.0.0.53:53: server misbehaving

docker: Error response from daemon: Get "https://nvcr.io/v2/": dial tcp: lookup [nvcr.io](http://nvcr.io/) on 127.0.0.53:53: server misbehaving

nghtm avatar May 15 '25 18:05 nghtm

Suggested customer clone the latest version of LCS with exponential backoff:

https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils/install_dcgm_exporter.sh

nghtm avatar May 15 '25 18:05 nghtm

resolved with pr 683

nghtm avatar May 21 '25 14:05 nghtm