sagemaker-python-sdk Can't deploy pretrained model even after following the documentation

Discussed in https://github.com/aws/sagemaker-python-sdk/discussions/3638

^{Originally posted by monika-prajapati February 6, 2023} I have a model that I want to deploy as a sagemaker endpoint. I followed this documentation and did the following:

Create inference.py script with model_fn, input_fn, predict_fn, and output_fn using this as reference
Make file/folder structure according to documentation and make model.tar.gz file

. ├── code │ ├── inference.py │ └── requirements.txt └── model.pth

I created model.tar.gz with . as root, while in a directory containing code folder.

My code in the sagemaker notebook looks like this

import boto3
import sagemaker
from sagemaker.pytorch import PyTorchModel

session = boto3.Session()
sagemaker_client = session.client('sagemaker')
role = sagemaker.get_execution_role()

# Define the model data location in S3
model_data = 's3://speech2textmodel/model.tar.gz'

# Define the model architecture
model1 = PyTorchModel(model_data=model_data,
                     role=role,
                     entry_point='inference.py',
                    framework_version='1.6.0',
                    py_version='py3')

predictor = model1.deploy(instance_type='ml.m5.xlarge', initial_instance_count=1)

I got error

UnexpectedStatusException: Error hosting endpoint pytorch-inference-2023-02-06-09-28-21-891: Failed. Reason: The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint..

This is error in cloudwatch

ERROR - /.sagemaker/ts/models/model.mar already exists.
```</div>

Feb 07 '23 04:02 bhattbhuwan13

do you have solution for this? I am facing the same problem

Sep 20 '23 01:09 KennyTC

@KennyTC Nope.

Oct 02 '23 02:10 bhattbhuwan13

I am facing the same issue... @KennyTC @bhattbhuwan13 Have you fixed this?

Oct 16 '23 09:10 Ruotian-Zhang

I face the same issue with the same error, seems that the error message is not meaningful. In my case the requirement.txt had versions of libraries that weren't compatible with the Python version that I chose for the container image. I realized about that seeing the begin of the CloudWatch log for that particular deploy execution. After I fixed that issue with the requirements, I was able to deploy my PytorchModel and get the endpoint created and running for it.

Nov 10 '23 16:11 mdmonaco89

I was able to resolve this by ensuring the Pytorch image version specified matched my custom requirements.txt and python version e.g.

pytorch_model = PyTorchModel(model_data=fname,
                             role=role,
                             entry_point='inference.py',
                             framework_version='2.1.0',
                             py_version='py310')

requirements

boto3==1.33.3
botocore==1.33.3
torch==2.0.0

Mar 26 '24 17:03 evankozliner