sagemaker-training-toolkit
sagemaker-training-toolkit copied to clipboard
Train machine learning models within a 🐳 Docker container using 🧠 Amazon SageMaker.
*Issue #, if available:* ORTE has lost communication with a remote daemon *Description of changes:* - Made changes to the non-leader nodes to sleep rather than wait on `orted` process....
**Describe the bug** custom_mpi_options flag in the sagemaker training toolkit isn't over-riding the MPI command instead it just appends the flags Logic https://github.com/aws/sagemaker-training-toolkit/blob/c357433d6fdbc43a896b25bd126c46f689ddb73c/src/sagemaker_training/mpi.py#L185-L188 **To reproduce** ``` mpi_options = '-verbose -x...
Hey there! I'm having some trouble getting my Sagemaker Tensorflow code to work after moving my script to another directory. Previously, I had the following directory structure: submit_notebook.ipynb train.py setup.py...
* add segfault error attribution * in creased failure reason limit to 8k(WIP at SM side, no total limit on SMTT side) * limit error message part of failure reason...
The library cannot be used in Python 3.10 Here's the error when trying to run `train` in a SageMaker training image: ``` Traceback (most recent call last): File "/opt/venv/bin/train", line...
feature: shlex quote asyncio run *Issue #, if available:* #128 *Description of changes:* Add shlex.quote to asyncio run. Arguments are not escaped so JSON hyperparameters are not passed correctly. *Testing...
**Describe the bug** Hyperparameters with spaces get passed as separate command line tokens **To reproduce** Create a hyperparameter like "key" set it to "a b". Toolkit will invoke user script...
feature: Pass SIGTERM to training subprocess fix: #125 *Issue #, if available:* 125 *Description of changes:* Install SIGTERM handler and send_signal to subprocess *Testing done:* "Stop" button in console triggers...
**Describe the bug** SIGTERM from StopTrainingJob doesn't appear to be passed to the training subprocess. **To reproduce** Add a SIGTERM handler to a training script, start a training job, then...
*Issue #, if available:* *Description of changes:* - Add native pytorch DDP support - Add support for py39 - Connected PRs https://github.com/aws/sagemaker-python-sdk/pull/2705 https://github.com/aws/sagemaker-pytorch-training-toolkit/pull/231 - Rename `NCCL_MIN_NRINGS` to `NCCL_MIN_NCHANNELS` - Make...