adaptdl icon indicating copy to clipboard operation
adaptdl copied to clipboard

Running the AdaptDL training process as something other than Process 1 causes checkpointing to fail.

Open rmfan opened this issue 4 years ago • 0 comments

Right now we checkpoint for rescaling by creating a sigint/sigterm handler, and then we catch the sigterm sent by Kubernetes when the adaptdl scheduler decides to terminate the worker pods. However, if the training process is not running at process 1, then it may not receive the sigterm, and checkpointing will not occur.

This means that the AdaptDL training must be the main command run in the container (i.e., not wrapping it a shell command)

Wont work: /bin/sh -c "python3 adaptdl_training_code.py"

Will work: python3 adaptdl_training_code.py

See https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination

rmfan avatar Oct 29 '21 17:10 rmfan