sagemaker-python-sdk Deploying model trained by script mode failed

System Information

Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans): TensorFlow / The one in the example char-rnn-tensorflow
Framework Version: 1.11
Python Version: 3
CPU or GPU: CPU
Python SDK Version: latest
Are you using a custom image: No

Describe the problem

I'm following the script mode QuickStart notebook in Sagemaker. Everything worked fine until I tried to deploy the trained model to an endpoint. From the log, the model training completed and models were uploaded successfully. However, it failed with an error complaining about not finding the model: ClientError: An error occurred (ValidationException) when calling the CreateModel operation: Could not find model data at s3://sagemaker-us-east-1-MY-AWS-ACCOUNT/sagemaker-tensorflow-scriptmode-2018-12-27-17-14-30-624/output/model.tar.gz

Looking the results from the example notebook, the local mode produced a model.tar.gz outside the model folder in S3 while the non-local-mode didn't produce the compressed file in the model folder at all.

Is the deployment process incompatible with the script mode? Does it only work for the legacy mode?

Minimal repro / logs

Please provide any logs and a bare minimum reproducible test case, as this will be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

To reproduce the error, just add the command line below and run the whole notebook.

INFO:sagemaker:Creating model with name: sagemaker-tensorflow-scriptmode-2019-01-21-18-35-13-608
---------------------------------------------------------------------------
ClientError                               Traceback (most recent call last)
<ipython-input-13-7ba0576fcc7e> in <module>()
----> 1 predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/estimator.py in deploy(self, initial_instance_count, instance_type, accelerator_type, endpoint_name, use_compiled_model, **kwargs)
    367             initial_instance_count=initial_instance_count,
    368             accelerator_type=accelerator_type,
--> 369             endpoint_name=endpoint_name)
    370 
    371     @property

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/model.py in deploy(self, initial_instance_count, instance_type, accelerator_type, endpoint_name, tags)
    237             self.name += compiled_model_suffix
    238 
--> 239         self._create_sagemaker_model(instance_type, accelerator_type)
    240         production_variant = sagemaker.production_variant(self.name, instance_type, initial_instance_count,
    241                                                           accelerator_type=accelerator_type)

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/model.py in _create_sagemaker_model(self, instance_type, accelerator_type)
    107         self.sagemaker_session.create_model(self.name, self.role,
    108                                             container_def, vpc_config=self.vpc_config,
--> 109                                             enable_network_isolation=enable_network_isolation)
    110 
    111     def _framework(self):

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/session.py in create_model(self, name, role, container_defs, vpc_config, enable_network_isolation, primary_container)
    578 
    579         try:
--> 580             self.sagemaker_client.create_model(**create_model_request)
    581         except ClientError as e:
    582             error_code = e.response['Error']['Code']

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
    318                     "%s() only accepts keyword arguments." % py_operation_name)
    319             # The "self" in this scope is referring to the BaseClient.
--> 320             return self._make_api_call(operation_name, kwargs)
    321 
    322         _api_call.__name__ = str(py_operation_name)

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
    622             error_code = parsed_response.get("Error", {}).get("Code")
    623             error_class = self.exceptions.from_code(error_code)
--> 624             raise error_class(parsed_response, operation_name)
    625         else:
    626             return parsed_response

ClientError: An error occurred (ValidationException) when calling the CreateModel operation: Could not find model data at s3://sagemaker-us-east-1-MY-AWS-ACCOUNT/sagemaker-tensorflow-scriptmode-2019-01-21-18-35-13-608/output/model.tar.gz.

Exact command to reproduce: I only added one line at the end of the notebook for deployment: predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

Jan 21 '19 18:01 jzhang-gp

Hi @JingJZ160

The notebook is just for training. The train script cloned from github in the repo just saved the checkpoints. It did not save any model locally in the container after training so SageMaker will not upload any model to S3.

We have two choices now. We still need to discuss which to do.

We don't use this script and add new scripts that saves the training model.
We still use this example and update the doc that tells users no models will be uploaded to S3 after it.

If you want to do deployment after that, you need to make the training script save the final model to '/opt/ml/model' (should be the value of env variable SM_MODEL_DIR). Then SageMaker will upload the model automatically from the local dir to S3.

And if you see model uploaded to S3 in the log but we actually did not upload. We also need to update the log parts too.

Thanks for reporting the problem! We will make new updates when we know what to do next.

Jan 24 '19 00:01 yangaws

@mvsusp @yangaws : any resolution for this issue ?

Oct 14 '21 21:10 mohanraj1311

@mohanraj1311, unfortunately neither me or @yangaws are in this project anymore. Can you check with the current team?

Oct 17 '21 20:10 mvsusp

@mvsusp : thanks for the reply. Can you pls share whom to tag ?

Oct 28 '21 00:10 mohanraj1311