deep-learning-containers icon indicating copy to clipboard operation
deep-learning-containers copied to clipboard

[bug] Signals don't get propagated to the training script

Open shiftan opened this issue 5 years ago • 3 comments

Checklist

  • [V] I've prepended issue tag with type of change: [bug]
  • [x] (If applicable) I've attached the script to reproduce the bug
  • [V ] (If applicable) I've documented below the DLC image/dockerfile this relates to
  • [x] (If applicable) I've documented below the tests I've run on the DLC image
  • [V ] I'm using an existing DLC image listed here: https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html

Concise Description: Images (I've tested the PyTorch one) aren't built as suggested here:https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-dockerfile.html. Specifically, the entrypoint is set to a bash script, instead of directly to the python code, hence the python code isn't running as pid 1 and signals don't get propagated.

DLC image/dockerfile: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.6.0-cpu-py3

Current behavior: Registering to the SIGTERM signal using "signal.signal(sigName, handler)" in a training script, doesn't get called e.g. when setting max_run to 60, and waiting enough time. Also, running "ps -elf" by using subprocess.run("ps -elf", shell=True) from a training script shows the below: 4 S root 1 0 0 80 0 - 4941 - 07:57 ? 00:00:00 bash -m start_with_right_hostname.sh train 4 S root 15 1 2 80 0 - 56741 - 07:57 ? 00:00:00 /opt/conda/bin/python /opt/conda/bin/train 4 S root 26 15 0 80 0 - 7630 - 07:57 ? 00:00:00 /opt/conda/bin/python shell_launcher.py --SSM_CMD_LINE ps -elf

0 S root 27 26 0 80 0 - 1641 - 07:57 ? 00:00:00 /bin/sh -c ps -elf 0 R root 28 27 0 80 0 - 9041 - 07:57 ? 00:00:00 ps -elf And as you can see the python isn't PID 1

Expected behavior: Signals to get propagated + python script to be PID 1, (unless signals can get propagated otherwise)

Additional context: The below as a training script, when running with small value of max_run, or just stopping the training job from the console can show the problem.

` import signal import sys import time

def handler(signum, frame): print("Signal handler called with signal", signum) print(frame) sys.exit(0)

for sigName in [signal.SIGTERM, signal.SIGHUP, signal.SIGINT]: signal.signal(sigName, handler)

print("Waiting for a signal...") while True: time.sleep(1) `

shiftan avatar Sep 27 '20 08:09 shiftan

Good evening. Thank you for your question. So here it is:

  1. Are you using BYO container or a pre-built PyTorch container?
  • BYO container: you should build own dockerfile and follow the instructions here: https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html, with this, you can define the command you want to run when the container starts

  • pre-built PyTorch container: your training script ONLY needs to define the PyTorch related training steps, for example: https://github.com/aws/sagemaker-python-sdk/blob/master/tests/data/pytorch_mnist/mnist.py, because by default the pre-built container will run this (https://github.com/aws/sagemaker-pytorch-training-toolkit/blob/master/src/sagemaker_pytorch_container/training.py#L112) instead

  1. As per the script noted, the script is used in the PyTorch training container to update hostname to algo-1, algo-2, ... instead aws so that NCCL and MPI knows when there are multiple hosts as the behavior of socket.gethostname (https://github.com/aws/sagemaker-pytorch-training-toolkit/blob/master/src/sagemaker_pytorch_container/training.py#L80). In the case of single host, the host name will be algo-1 then.

SergTogul avatar Oct 20 '20 00:10 SergTogul

I'm using a pre-built pytorch image: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.6.0-cpu-py3 As per my understanding, registering to signals is the way to know the instance is about to be stopped, e.g. to save the state before. @SergTogul Are you saying this isn't supported with the pre-built images? Is there any other way?

I've just tested it again with the code below and max_run=60 and got that output:


Invoking script with the following command:

/opt/conda/bin/python sig_test.py


Listing processes
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
4 S root         1     0  0  80   0 -  4941 -      07:09 ?        00:00:00 bash -m start_with_right_hostname.sh train
4 S root        14     1  1  80   0 - 56767 -      07:09 ?        00:00:00 /opt/conda/bin/python /opt/conda/bin/train
0 S root        25    14  0  80   0 -  6869 -      07:10 ?        00:00:00 /opt/conda/bin/python sig_test.py
0 S root        26    25  0  80   0 -  1641 -      07:10 ?        00:00:00 /bin/sh -c ps -elf
0 R root        27    26  0  80   0 -  9041 -      07:10 ?        00:00:00 ps -elf
Waiting for a signal...

2020-10-20 07:13:55 Stopping - Stopping the training job
2020-10-20 07:16:16 Uploading - Uploading generated training model
2020-10-20 07:16:16 MaxRuntimeExceeded - Training job runtime exceeded MaxRuntimeInSeconds provided
Training seconds: 440
Billable seconds: 440

The python code:

import signal
import sys
import time
import subprocess

def handler(signum, frame):
    print("Signal handler called with signal", signum)
    print(frame)
    sys.exit(0)

for sigName in [signal.SIGTERM, signal.SIGHUP, signal.SIGINT]:
    signal.signal(sigName, handler)

print("Listing processes")
subprocess.run("ps -elf", shell=True)
print("Waiting for a signal...")
while True:
    time.sleep(1)

shiftan avatar Oct 20 '20 07:10 shiftan

Is there any update on this? I am running into the same issue

theo-rogers avatar Aug 26 '24 19:08 theo-rogers

This issue has been automatically marked as stale due to 60 days of inactivity. Please comment or remove the stale label to keep it open. It will be closed in 7 days if no further activity occurs.

github-actions[bot] avatar Nov 09 '25 19:11 github-actions[bot]

Closing this issue after 7 additional days of inactivity since being marked stale.

github-actions[bot] avatar Nov 16 '25 19:11 github-actions[bot]