[bug] Signals don't get propagated to the training script
Checklist
- [V] I've prepended issue tag with type of change: [bug]
- [x] (If applicable) I've attached the script to reproduce the bug
- [V ] (If applicable) I've documented below the DLC image/dockerfile this relates to
- [x] (If applicable) I've documented below the tests I've run on the DLC image
- [V ] I'm using an existing DLC image listed here: https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html
Concise Description: Images (I've tested the PyTorch one) aren't built as suggested here:https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-dockerfile.html. Specifically, the entrypoint is set to a bash script, instead of directly to the python code, hence the python code isn't running as pid 1 and signals don't get propagated.
DLC image/dockerfile: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.6.0-cpu-py3
Current behavior: Registering to the SIGTERM signal using "signal.signal(sigName, handler)" in a training script, doesn't get called e.g. when setting max_run to 60, and waiting enough time. Also, running "ps -elf" by using subprocess.run("ps -elf", shell=True) from a training script shows the below: 4 S root 1 0 0 80 0 - 4941 - 07:57 ? 00:00:00 bash -m start_with_right_hostname.sh train 4 S root 15 1 2 80 0 - 56741 - 07:57 ? 00:00:00 /opt/conda/bin/python /opt/conda/bin/train 4 S root 26 15 0 80 0 - 7630 - 07:57 ? 00:00:00 /opt/conda/bin/python shell_launcher.py --SSM_CMD_LINE ps -elf
0 S root 27 26 0 80 0 - 1641 - 07:57 ? 00:00:00 /bin/sh -c ps -elf 0 R root 28 27 0 80 0 - 9041 - 07:57 ? 00:00:00 ps -elf And as you can see the python isn't PID 1
Expected behavior: Signals to get propagated + python script to be PID 1, (unless signals can get propagated otherwise)
Additional context: The below as a training script, when running with small value of max_run, or just stopping the training job from the console can show the problem.
` import signal import sys import time
def handler(signum, frame): print("Signal handler called with signal", signum) print(frame) sys.exit(0)
for sigName in [signal.SIGTERM, signal.SIGHUP, signal.SIGINT]: signal.signal(sigName, handler)
print("Waiting for a signal...") while True: time.sleep(1) `
Good evening. Thank you for your question. So here it is:
- Are you using BYO container or a pre-built PyTorch container?
-
BYO container: you should build own dockerfile and follow the instructions here: https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html, with this, you can define the command you want to run when the container starts
-
pre-built PyTorch container: your training script ONLY needs to define the PyTorch related training steps, for example: https://github.com/aws/sagemaker-python-sdk/blob/master/tests/data/pytorch_mnist/mnist.py, because by default the pre-built container will run this (https://github.com/aws/sagemaker-pytorch-training-toolkit/blob/master/src/sagemaker_pytorch_container/training.py#L112) instead
- As per the script noted, the script is used in the PyTorch training container to update hostname to algo-1, algo-2, ... instead aws so that NCCL and MPI knows when there are multiple hosts as the behavior of socket.gethostname (https://github.com/aws/sagemaker-pytorch-training-toolkit/blob/master/src/sagemaker_pytorch_container/training.py#L80). In the case of single host, the host name will be algo-1 then.
I'm using a pre-built pytorch image: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.6.0-cpu-py3 As per my understanding, registering to signals is the way to know the instance is about to be stopped, e.g. to save the state before. @SergTogul Are you saying this isn't supported with the pre-built images? Is there any other way?
I've just tested it again with the code below and max_run=60 and got that output:
Invoking script with the following command:
/opt/conda/bin/python sig_test.py
Listing processes
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
4 S root 1 0 0 80 0 - 4941 - 07:09 ? 00:00:00 bash -m start_with_right_hostname.sh train
4 S root 14 1 1 80 0 - 56767 - 07:09 ? 00:00:00 /opt/conda/bin/python /opt/conda/bin/train
0 S root 25 14 0 80 0 - 6869 - 07:10 ? 00:00:00 /opt/conda/bin/python sig_test.py
0 S root 26 25 0 80 0 - 1641 - 07:10 ? 00:00:00 /bin/sh -c ps -elf
0 R root 27 26 0 80 0 - 9041 - 07:10 ? 00:00:00 ps -elf
Waiting for a signal...
2020-10-20 07:13:55 Stopping - Stopping the training job
2020-10-20 07:16:16 Uploading - Uploading generated training model
2020-10-20 07:16:16 MaxRuntimeExceeded - Training job runtime exceeded MaxRuntimeInSeconds provided
Training seconds: 440
Billable seconds: 440
The python code:
import signal
import sys
import time
import subprocess
def handler(signum, frame):
print("Signal handler called with signal", signum)
print(frame)
sys.exit(0)
for sigName in [signal.SIGTERM, signal.SIGHUP, signal.SIGINT]:
signal.signal(sigName, handler)
print("Listing processes")
subprocess.run("ps -elf", shell=True)
print("Waiting for a signal...")
while True:
time.sleep(1)
Is there any update on this? I am running into the same issue
This issue has been automatically marked as stale due to 60 days of inactivity. Please comment or remove the stale label to keep it open. It will be closed in 7 days if no further activity occurs.
Closing this issue after 7 additional days of inactivity since being marked stale.