Model.compile ignores framework_version when compiling for ml_inf1
Describe the bug
When attempting to compile for ml_inf1 via the SDK a model which was trained / fine-tuned with PyTorch 1.9.1, the framework_version argument is ignored, resulting in a version mismatch between the one at training and the one chosen automatically by SageMaker.
To reproduce
import json
from sagemaker.pytorch.model import PyTorchModel
from sagemaker.predictor import Predictor
sm_model = PyTorchModel(
model_data=traced_model_url,
predictor_cls=Predictor,
framework_version="1.9",
role="<role_arn>",
entry_point="inference.py",
source_dir="code",
py_version="py3",
name="<name>"
)
compiled_inf_model = sm_model.compile(
target_instance_family="ml_inf1",
input_shape=<input_shape>,
job_name="<job_name>",
role="<role_arn>",
framework="pytorch",
framework_version="1.9",
output_path="<output_path>"
compiler_options=json.dumps("--dtype int64"),
compile_max_run=1000,
)
Expected behavior
The Compilation Job should also contain the "Framework version" when opening it in the AWS Console. However, only the PYTORCH Framework value is present, and the compilation fails after 5 minutes with the error message
ClientError: InputConfiguration: Unable to load PyTorch model:', '\nUnknown type name \'NoneType\':\nSerialized File "code/__torch__/torch/nn/modules/activation/___torch_mangle_8258.py", line 7\n _is_full_backward_hook : Optional[bool]\n def forward(self: __torch__.torch.nn.modules.activation.___torch_mangle_8258.Tanh,\n argument_1: Tensor) -> NoneType:\n ~~~~~~~~ <--- HERE\n return None\n') For further troubleshooting common failures please visit: https://docs.aws.amazon.com/sagemaker/latest/dg/neo-troubleshooting-compilation.html

If, however, I clone the failed job in the AWS Console and just add the 1.9 "Framework version" manually, the job runs to completion.
Screenshots or logs
localhost compiler-container-Primary[5078]: Traceback (most recent call last):
--
localhost compiler-container-Primary[5078]: File "/opt/amazon/lib/python3.6/site-packages/neo_inferentia_compiler/pytorch_framework.py", line 107, in compile_model
localhost compiler-container-Primary[5078]: model = torch.jit.load(self.model_file)
localhost compiler-container-Primary[5078]: File "/opt/amazon/lib/python3.6/site-packages/torch_neuron/jit_load_wrapper.py", line 13, in wrapper
localhost compiler-container-Primary[5078]: script_module = jit_load(*args, **kwargs)
localhost compiler-container-Primary[5078]: File "/opt/amazon/lib/python3.6/site-packages/torch/jit/_serialization.py", line 161, in load
localhost compiler-container-Primary[5078]: cpp_module = torch._C.import_ir_module(cu, f, map_location, _extra_files)
localhost compiler-container-Primary[5078]: RuntimeError:
localhost compiler-container-Primary[5078]: Unknown type name 'NoneType':
localhost compiler-container-Primary[5078]: Serialized File "code/__torch__/torch/nn/modules/activation/___torch_mangle_8258.py", line 7
localhost compiler-container-Primary[5078]: _is_full_backward_hook : Optional[bool]
localhost compiler-container-Primary[5078]: def forward(self: __torch__.torch.nn.modules.activation.___torch_mangle_8258.Tanh,
localhost compiler-container-Primary[5078]: argument_1: Tensor) -> NoneType:
localhost compiler-container-Primary[5078]: ~~~~~~~~ <--- HERE
localhost compiler-container-Primary[5078]: return None
localhost compiler-container-Primary[5078]: During handling of the above exception, another exception occurred:
localhost compiler-container-Primary[5078]: Traceback (most recent call last):
localhost compiler-container-Primary[5078]: File "/opt/amazon/bin/neo_main.py", line 101, in <module>
localhost compiler-container-Primary[5078]: compile()
localhost compiler-container-Primary[5078]: File "/opt/amazon/bin/neo_main.py", line 74, in compile
localhost compiler-container-Primary[5078]: compiler_options
localhost compiler-container-Primary[5078]: File "/opt/amazon/bin/neo_main.py", line 32, in compile_model
localhost compiler-container-Primary[5078]: return framework_instance.compile_model()
localhost compiler-container-Primary[5078]: File "/opt/amazon/lib/python3.6/site-packages/neo_inferentia_compiler/pytorch_framework.py", line 109, in compile_model
localhost compiler-container-Primary[5078]: raise RuntimeError("InputConfiguration: Unable to load PyTorch model:", str(e))
localhost compiler-container-Primary[5078]: RuntimeError: ('InputConfiguration: Unable to load PyTorch model:', ' Unknown type name \'NoneType\': Serialized File "code/__torch__/torch/nn/modules/activation/___torch_mangle_8258.py", line 7 _is_full_backward_hook : Optional[bool] def forward(self: __torch__.torch.nn.modules.activation.___torch_mangle_8258.Tanh, argument_1: Tensor) -> NoneType: ~~~~~~~~ <--- HERE return None ')
System information A description of your system. Please provide:
- SageMaker Python SDK version: 2.97.0
- Framework name (eg. PyTorch) or algorithm (eg. KMeans): PyTorch
- Framework version: 1.9.1
- Python version: 3.8
- CPU or GPU: CPU/Inf
- Custom Docker image (Y/N): -
Additional context
The problem may lie in the negative lookahead regex group (?!ml_inf) at this line: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/model.py#L735.
Is this condition still applicable?
I have got something similar My code
compiled_sm_model = PyTorchModel(
model_data=traced_model_url,
predictor_cls=Predictor,
framework_version="1.5",
role=role,
sagemaker_session=sess,
entry_point="inference_inf1.py",
source_dir="aux",
py_version="py3",
env={"MMS_DEFAULT_RESPONSE_TIMEOUT": "500"},
)
hardware = "inf1"
compilation_job_name = name_from_base("godel")
compiled_inf1_model = compiled_sm_model.compile(
target_instance_family=f"ml_{hardware}",
input_shape={"input_ids": [1, 512], "attention_mask": [1, 512]},
job_name=compilation_job_name,
role=role,
framework="pytorch",
framework_version="1.5",
output_path=f"s3://{sagemaker_session_bucket}/{prefix}/compiled_model",
compiler_options=json.dumps("--dtype int64"),
compile_max_run=900,
)
Constant error is
localhost compiler-container-Primary[5145]: 2022-09-27 15:00:27,229 INFO root Successfully download and extract model into /tmp/models
--
localhost compiler-container-Primary[5145]: 2022-09-27 15:00:28,305 INFO neo_inferentia_compiler.pytorch_framework Neuron Compilation Inputs: model_file /tmp/models/./model.pth output_file /opt/ml/model/compiled/model.pth
localhost compiler-container-Primary[5145]: Traceback (most recent call last):
localhost compiler-container-Primary[5145]: File "/opt/amazon/lib/python3.6/site-packages/neo_inferentia_compiler/pytorch_framework.py", line 107, in compile_model
localhost compiler-container-Primary[5145]: model = torch.jit.load(self.model_file)
localhost compiler-container-Primary[5145]: File "/opt/amazon/lib/python3.6/site-packages/torch_neuron/jit_load_wrapper.py", line 13, in wrapper
localhost compiler-container-Primary[5145]: script_module = jit_load(*args, **kwargs)
localhost compiler-container-Primary[5145]: File "/opt/amazon/lib/python3.6/site-packages/torch/jit/_serialization.py", line 161, in load
localhost compiler-container-Primary[5145]: cpp_module = torch._C.import_ir_module(cu, f, map_location, _extra_files)
localhost compiler-container-Primary[5145]: RuntimeError: [enforce fail at inline_container.cc:222] . file not found: model/version
localhost compiler-container-Primary[5145]: During handling of the above exception, another exception occurred:
localhost compiler-container-Primary[5145]: Traceback (most recent call last):
localhost compiler-container-Primary[5145]: File "/opt/amazon/bin/neo_main.py", line 101, in <module>
localhost compiler-container-Primary[5145]: compile()
localhost compiler-container-Primary[5145]: File "/opt/amazon/bin/neo_main.py", line 74, in compile
localhost compiler-container-Primary[5145]: compiler_options
localhost compiler-container-Primary[5145]: File "/opt/amazon/bin/neo_main.py", line 32, in compile_model
localhost compiler-container-Primary[5145]: return framework_instance.compile_model()
localhost compiler-container-Primary[5145]: File "/opt/amazon/lib/python3.6/site-packages/neo_inferentia_compiler/pytorch_framework.py", line 109, in compile_model
localhost compiler-container-Primary[5145]: raise RuntimeError("InputConfiguration: Unable to load PyTorch model:", str(e))
localhost compiler-container-Primary[5145]: RuntimeError: ('InputConfiguration: Unable to load PyTorch model:', '[enforce fail at inline_container.cc:222] . file not found: model/version')
I just try to figure out is it my bad or something is not ok with SM. Thanks
PS torch version is 1.12.1 sagemaker 2.107.0