sagemaker-python-sdk icon indicating copy to clipboard operation
sagemaker-python-sdk copied to clipboard

Can't deploy pretrained model even after following the documentation

Open bhattbhuwan13 opened this issue 2 years ago • 5 comments

Discussed in https://github.com/aws/sagemaker-python-sdk/discussions/3638

Originally posted by monika-prajapati February 6, 2023 I have a model that I want to deploy as a sagemaker endpoint. I followed this documentation and did the following:

  • Create inference.py script with model_fn, input_fn, predict_fn, and output_fn using this as reference
  • Make file/folder structure according to documentation and make model.tar.gz file

. ├── code │ ├── inference.py │ └── requirements.txt └── model.pth

I created model.tar.gz with . as root, while in a directory containing code folder.

My code in the sagemaker notebook looks like this

import boto3
import sagemaker
from sagemaker.pytorch import PyTorchModel

session = boto3.Session()
sagemaker_client = session.client('sagemaker')
role = sagemaker.get_execution_role()

# Define the model data location in S3
model_data = 's3://speech2textmodel/model.tar.gz'

# Define the model architecture
model1 = PyTorchModel(model_data=model_data,
                     role=role,
                     entry_point='inference.py',
                    framework_version='1.6.0',
                    py_version='py3')

predictor = model1.deploy(instance_type='ml.m5.xlarge', initial_instance_count=1)

I got error

UnexpectedStatusException: Error hosting endpoint pytorch-inference-2023-02-06-09-28-21-891: Failed. Reason: The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint..

This is error in cloudwatch

ERROR - /.sagemaker/ts/models/model.mar already exists.
```</div>

bhattbhuwan13 avatar Feb 07 '23 04:02 bhattbhuwan13

do you have solution for this? I am facing the same problem

KennyTC avatar Sep 20 '23 01:09 KennyTC

@KennyTC Nope.

bhattbhuwan13 avatar Oct 02 '23 02:10 bhattbhuwan13

I am facing the same issue... @KennyTC @bhattbhuwan13 Have you fixed this?

Ruotian-Zhang avatar Oct 16 '23 09:10 Ruotian-Zhang

I face the same issue with the same error, seems that the error message is not meaningful. In my case the requirement.txt had versions of libraries that weren't compatible with the Python version that I chose for the container image. I realized about that seeing the begin of the CloudWatch log for that particular deploy execution. After I fixed that issue with the requirements, I was able to deploy my PytorchModel and get the endpoint created and running for it.

mdmonaco89 avatar Nov 10 '23 16:11 mdmonaco89

I was able to resolve this by ensuring the Pytorch image version specified matched my custom requirements.txt and python version e.g.

pytorch_model = PyTorchModel(model_data=fname,
                             role=role,
                             entry_point='inference.py',
                             framework_version='2.1.0',
                             py_version='py310')

requirements

boto3==1.33.3
botocore==1.33.3
torch==2.0.0

evankozliner avatar Mar 26 '24 17:03 evankozliner