sagemaker-training-toolkit icon indicating copy to clipboard operation
sagemaker-training-toolkit copied to clipboard

Pass SIGTERM to training script to stop training

Open bstriner opened this issue 3 years ago • 0 comments

Describe the bug SIGTERM from StopTrainingJob doesn't appear to be passed to the training subprocess.

To reproduce Add a SIGTERM handler to a training script, start a training job, then click "Stop". The signal handler will not fire.

Expected behavior Signal handler should fire when "StopTrainingJob" happens

Screenshots or logs If applicable, add screenshots or logs to help explain your problem.

System information A description of your system.

  • Include the version of SageMaker Training Toolkit you are using.
  • If you are using a prebuilt Amazon SageMaker Docker image, provide the URL.
  • If you are using a custom Docker image, provide:
    • framework name (eg. PyTorch)
    • framework version
    • Python version
    • processing unit type (ie. CPU or GPU)

Additional context Add any other context about the problem here.

bstriner avatar May 19 '22 20:05 bstriner