sagemaker-training-toolkit
sagemaker-training-toolkit copied to clipboard
Pass SIGTERM to training script to stop training
Describe the bug SIGTERM from StopTrainingJob doesn't appear to be passed to the training subprocess.
To reproduce Add a SIGTERM handler to a training script, start a training job, then click "Stop". The signal handler will not fire.
Expected behavior Signal handler should fire when "StopTrainingJob" happens
Screenshots or logs If applicable, add screenshots or logs to help explain your problem.
System information A description of your system.
- Include the version of SageMaker Training Toolkit you are using.
- If you are using a prebuilt Amazon SageMaker Docker image, provide the URL.
- If you are using a custom Docker image, provide:
- framework name (eg. PyTorch)
- framework version
- Python version
- processing unit type (ie. CPU or GPU)
Additional context Add any other context about the problem here.