Richard Fan
Results
2
issues of
Richard Fan
Running the AdaptDL training process as something other than Process 1 causes checkpointing to fail.
Right now we checkpoint for rescaling by creating a sigint/sigterm handler, and then we catch the sigterm sent by Kubernetes when the adaptdl scheduler decides to terminate the worker pods....