sagemaker-python-sdk Endpoint failing after initially passing ping health check

Describe the bug I am trying to deploy an Mlflow model to a new endpoint using a custom Docker container. Initial creation seems to proceed without any problems. It even initially passes the ping health check. After a little while it stops responding and I get an: 'The primary container for production variant xxxxx did not pass the ping health check' error. I have been able to deploy multiple other models, previously, without running into this problem. The model itself loads and scores without issues, locally.

To reproduce A clear, step-by-step set of instructions to reproduce the bug.

Expected behavior I would expect either successful deployment or a specific errors if the deployment fails.

Screenshots or logs I've added the logs from Cloudwatch below. Unfortunately, they aren't particularly informative:

[2023-09-11 16:29:10 +0000] [17720] [INFO] Starting gunicorn 20.1.0

[2023-09-11 16:29:10 +0000] [17720] [INFO] Listening at: http://127.0.0.1:8000 (17720) [2023-09-11 16:29:10 +0000] [17720] [INFO] Using worker: gevent [2023-09-11 16:29:10 +0000] [17728] [INFO] Booting worker with pid: 17728 [2023-09-11 16:29:10 +0000] [17729] [INFO] Booting worker with pid: 17729 [2023-09-11 16:29:10 +0000] [17730] [INFO] Booting worker with pid: 17730 [2023-09-11 16:29:10 +0000] [17731] [INFO] Booting worker with pid: 17731 10.32.0.2 - - [11/Sep/2023:16:29:15 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0" 10.32.0.2 - - [11/Sep/2023:16:29:18 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0" 10.32.0.2 - - [11/Sep/2023:16:29:23 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0" 10.32.0.2 - - [11/Sep/2023:16:29:28 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0" 10.32.0.2 - - [11/Sep/2023:16:29:33 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0" 10.32.0.2 - - [11/Sep/2023:16:29:38 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0" 10.32.0.2 - - [11/Sep/2023:16:29:43 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0" 10.32.0.2 - - [11/Sep/2023:16:29:48 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0" 10.32.0.2 - - [11/Sep/2023:16:29:53 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0" 10.32.0.2 - - [11/Sep/2023:16:29:58 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0" 10.32.0.2 - - [11/Sep/2023:16:30:03 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0" 10.32.0.2 - - [11/Sep/2023:16:30:08 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0" 10.32.0.2 - - [11/Sep/2023:16:30:13 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0" 10.32.0.2 - - [11/Sep/2023:16:30:18 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0" 10.32.0.2 - - [11/Sep/2023:16:30:23 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0" 10.32.0.2 - - [11/Sep/2023:16:30:28 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0" 10.32.0.2 - - [11/Sep/2023:16:30:33 +0000] "GET /ping HTTP/1.1" 200 1 "-" "AHC/2.0"

System information A description of your system. Please provide:

SageMaker Python SDK version: 1.24.0
Framework name (eg. PyTorch) or algorithm (eg. KMeans): Mlflow/Catboost
Framework version: 2.3.2/1.1.1
Python version: 3.10.4
CPU or GPU: CPU
Custom Docker image (Y/N): Y

Additional context Add any other context about the problem here.

Sep 12 '23 17:09 nfarley-soaren

If you're building locally it could be that the artifact you're trying to run on mlflow was built for a different cpu architecture (e.g. suppose u are building and deploying from a mac). If that were the case the container would exit/fail before it logged anything I believe

Apr 01 '24 17:04 evankozliner

@evankozliner Thank you for the reply. I'm building models using Mlflow, registering them to the Mlflow model registry and then pulling the artifacts from S3 to deploy in a docker container on Sagemaker. I've tried to build and register these models on both a local machine (windows) or on a dedicated EC2 instance (linux) before deploying to Sagemaker (so the container image is already in ECR and the container itself is built on Sagemaker). I've had luck with some models but not others, so I don't understand why this would be an issue only sometimes if this was the cause. Wouldn't you expect them all to fail? There's no information in the logs that would allow me to try to identify the cause, so troubleshooting has been random. I was wondering if it could be related to the environment contents or size but, again, without further information in the logs, everything is a guess. I'd appreciate any further insight or suggestions that you can offer. It's very possible I'm missing something or misunderstanding what you're trying to say.

Apr 01 '24 18:04 nfarley-soaren