sagemaker-python-sdk icon indicating copy to clipboard operation
sagemaker-python-sdk copied to clipboard

v2.102 crashes when launching Pytorch estimator job

Open rahul003 opened this issue 3 years ago • 0 comments

Describe the bug

ClientError                               Traceback (most recent call last)
<ipython-input-15-919fc4f433e5> in <module>
     50         disable_profiler=True,
     51         base_job_name=base_job_name,
---> 52         **kwargs
     53     )

~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/sagemaker/pytorch/estimator.py in __init__(self, entry_point, framework_version, py_version, source_dir, hyperparameters, image_uri, distribution, **kwargs)
    226             if instance_type[:3] == "ml.":
    227                 instance_type = instance_type[3:]
--> 228             validate_distribution_instance(self.sagemaker_session, distribution, instance_type)
    229 
    230             distribution = validate_distribution(

~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/sagemaker/fw_utils.py in validate_distribution_instance(sagemaker_session, distribution, instance_type)
    873 
    874     instance_desc = sagemaker_session.boto_session.client("ec2").describe_instance_types(
--> 875         InstanceTypes=[f"{instance_type}"]
    876     )
    877     if "GpuInfo" not in instance_desc["InstanceTypes"][0]:

~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
    506                 )
    507             # The "self" in this scope is referring to the BaseClient.
--> 508             return self._make_api_call(operation_name, kwargs)
    509 
    510         _api_call.__name__ = str(py_operation_name)

~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
    913             error_code = parsed_response.get("Error", {}).get("Code")
    914             error_class = self.exceptions.from_code(error_code)
--> 915             raise error_class(parsed_response, operation_name)
    916         else:
    917             return parsed_response

ClientError: An error occurred (UnauthorizedOperation) when calling the DescribeInstanceTypes operation: You are not authorized to perform this operation.

To reproduce https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/model_parallel/gpt2/smp-train-gpt-simple.ipynb

Expected behavior With v2.100 it works fine and launches the job

Screenshots or logs Log above

System information A description of your system. Please provide:

  • SageMaker Python SDK version: 2.102
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): Pytorch
  • Framework version: 1.11
  • Python version: 3.8
  • CPU or GPU: GPU
  • Custom Docker image (Y/N): N

Additional context Add any other context about the problem here.

rahul003 avatar Aug 04 '22 17:08 rahul003