Deploying model trained by script mode failed
System Information
- Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans): TensorFlow / The one in the example char-rnn-tensorflow
- Framework Version: 1.11
- Python Version: 3
- CPU or GPU: CPU
- Python SDK Version: latest
- Are you using a custom image: No
Describe the problem
I'm following the script mode QuickStart notebook in Sagemaker. Everything worked fine until I tried to deploy the trained model to an endpoint. From the log, the model training completed and models were uploaded successfully. However, it failed with an error complaining about not finding the model:
ClientError: An error occurred (ValidationException) when calling the CreateModel operation: Could not find model data at s3://sagemaker-us-east-1-MY-AWS-ACCOUNT/sagemaker-tensorflow-scriptmode-2018-12-27-17-14-30-624/output/model.tar.gz
Looking the results from the example notebook, the local mode produced a model.tar.gz outside the model folder in S3 while the non-local-mode didn't produce the compressed file in the model folder at all.
Is the deployment process incompatible with the script mode? Does it only work for the legacy mode?
Minimal repro / logs
Please provide any logs and a bare minimum reproducible test case, as this will be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
To reproduce the error, just add the command line below and run the whole notebook.
INFO:sagemaker:Creating model with name: sagemaker-tensorflow-scriptmode-2019-01-21-18-35-13-608
---------------------------------------------------------------------------
ClientError Traceback (most recent call last)
<ipython-input-13-7ba0576fcc7e> in <module>()
----> 1 predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/estimator.py in deploy(self, initial_instance_count, instance_type, accelerator_type, endpoint_name, use_compiled_model, **kwargs)
367 initial_instance_count=initial_instance_count,
368 accelerator_type=accelerator_type,
--> 369 endpoint_name=endpoint_name)
370
371 @property
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/model.py in deploy(self, initial_instance_count, instance_type, accelerator_type, endpoint_name, tags)
237 self.name += compiled_model_suffix
238
--> 239 self._create_sagemaker_model(instance_type, accelerator_type)
240 production_variant = sagemaker.production_variant(self.name, instance_type, initial_instance_count,
241 accelerator_type=accelerator_type)
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/model.py in _create_sagemaker_model(self, instance_type, accelerator_type)
107 self.sagemaker_session.create_model(self.name, self.role,
108 container_def, vpc_config=self.vpc_config,
--> 109 enable_network_isolation=enable_network_isolation)
110
111 def _framework(self):
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/session.py in create_model(self, name, role, container_defs, vpc_config, enable_network_isolation, primary_container)
578
579 try:
--> 580 self.sagemaker_client.create_model(**create_model_request)
581 except ClientError as e:
582 error_code = e.response['Error']['Code']
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
318 "%s() only accepts keyword arguments." % py_operation_name)
319 # The "self" in this scope is referring to the BaseClient.
--> 320 return self._make_api_call(operation_name, kwargs)
321
322 _api_call.__name__ = str(py_operation_name)
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
622 error_code = parsed_response.get("Error", {}).get("Code")
623 error_class = self.exceptions.from_code(error_code)
--> 624 raise error_class(parsed_response, operation_name)
625 else:
626 return parsed_response
ClientError: An error occurred (ValidationException) when calling the CreateModel operation: Could not find model data at s3://sagemaker-us-east-1-MY-AWS-ACCOUNT/sagemaker-tensorflow-scriptmode-2019-01-21-18-35-13-608/output/model.tar.gz.
-
Exact command to reproduce:
I only added one line at the end of the notebook for deployment:
predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')
Hi @JingJZ160
The notebook is just for training. The train script cloned from github in the repo just saved the checkpoints. It did not save any model locally in the container after training so SageMaker will not upload any model to S3.
We have two choices now. We still need to discuss which to do.
- We don't use this script and add new scripts that saves the training model.
- We still use this example and update the doc that tells users no models will be uploaded to S3 after it.
If you want to do deployment after that, you need to make the training script save the final model to '/opt/ml/model' (should be the value of env variable SM_MODEL_DIR). Then SageMaker will upload the model automatically from the local dir to S3.
And if you see model uploaded to S3 in the log but we actually did not upload. We also need to update the log parts too.
Thanks for reporting the problem! We will make new updates when we know what to do next.
@mvsusp @yangaws : any resolution for this issue ?
@mohanraj1311, unfortunately neither me or @yangaws are in this project anymore. Can you check with the current team?
@mvsusp : thanks for the reply. Can you pls share whom to tag ?