Matthew Nightingale
Matthew Nightingale
### Ask your question Hi, I am hoping to understand the difference between the `dcgmi -v` version and the version of `dcgm exporter` which should be used. I want to...
Running the script [3.test_cases/10.FSDP/1.distributed-training.sbatch](https://github.com/aws-samples/awsome-distributed-training/blob/main/3.test_cases/10.FSDP/1.distributed-training.sbatch) on 2 p5 nodes, and the job is failing at validation step after 500 batches. [slurm-47.log](https://github.com/aws-samples/awsome-distributed-training/files/15371088/slurm-47.log) ``` 0: OSError: [Errno 12] Cannot allocate memory ``` **Configuration:**...
``` 7: [rank80]: urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10) ``` Running FSDP example, 16 p5 nodes. The example worked with 8 nodes
Issue not encountered. Proposed solution: add exponential backoff to the `install_docker.sh` script. [install_docker.sh.txt](https://github.com/user-attachments/files/20232931/install_docker.sh.txt)
``` docker: Error response from daemon: Get "https://nvcr.io/v2/": dial tcp: lookup [nvcr.io](http://nvcr.io/) on 127.0.0.53:53: server misbehaving docker: Error response from daemon: Get "https://nvcr.io/v2/": dial tcp: lookup [nvcr.io](http://nvcr.io/) on 127.0.0.53:53: server...
``` curl: (6) Could not resolve host: [github.com](http://github.com/) "Traceback (most recent call last): File "/tmp/air-voiceforce-hyperpod-lifecycle-2/src/lifecycle_script.py", line 211, in main(args) File "/tmp/air-voiceforce-hyperpod-lifecycle-2/src/lifecycle_script.py", line 184, in main ExecuteBashScript("./utils/install_enroot_pyxis.sh").run(node_type) File "/tmp/air-voiceforce-hyperpod-lifecycle-2/src/lifecycle_script.py", line 31,...
EFA Cheatsheet to be updated with information about p5e and p5en, similar to [this section on p5](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/efa-cheatsheet.md#2-a-word-on-p548xlarge-instances) https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/efa-cheatsheet.md
When creating HyperPod clusters with 2 ml.g5.8xlarge instances, we are seeing errors trying to run containers with Pyxis + Enroot. ``` srun: unrecognized option '--container-image' ``` Cloudwatch does not show...
Lets please consider deprecating the [update-neuron.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils/update_neuron_sdk.sh) LCS given the libraries are out of date, or consider updating to most recent libraries: I would prefer deprecation to simplify LCS unless we...
# Investigate ENROOT_RUNTIME_PATH and data-root mismatch ## Description We need to clarify if there's still a mismatch between ENROOT_RUNTIME_PATH and data-root, based on current LCS, install_docker.sh and intall_enroot_pyxis.sh are using...