transformers ModelError while deploying FlanT5-xl

System Info

transformers_version==4.17.0 Plaform = Sagemaker Notebook python==3.9.0

Who can help?

@ArthurZucker @younesbelkada

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Amazon Sagemaker deployment script in AWS for flant5-xl

from sagemaker.huggingface import HuggingFaceModel
import sagemaker

role = sagemaker.get_execution_role()
# Hub Model configuration. https://huggingface.co/models
hub = {
	'HF_MODEL_ID':'google/flan-t5-xl',
	'HF_TASK':'text2text-generation'
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
	transformers_version='4.17.0',
	pytorch_version='1.10.2',
	py_version='py38',
	env=hub,
	role=role, 
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
	initial_instance_count=1, # number of instances
	instance_type='ml.m5.xlarge' # ec2 instance type
)

predictor.predict({
	'inputs': "The answer to the universe is"
})

Results in

---------------------------------------------------------------------------
ModelError                                Traceback (most recent call last)
/tmp/ipykernel_20116/1338286066.py in <cell line: 26>()
     24 )
     25 
---> 26 predictor.predict({
     27         'inputs': "The answer to the universe is"
     28 })

~/anaconda3/envs/python3/lib/python3.10/site-packages/sagemaker/predictor.py in predict(self, data, initial_args, target_model, target_variant, inference_id)
    159             data, initial_args, target_model, target_variant, inference_id
    160         )
--> 161         response = self.sagemaker_session.sagemaker_runtime_client.invoke_endpoint(**request_args)
    162         return self._handle_response(response)
    163 

~/anaconda3/envs/python3/lib/python3.10/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
    528                 )
    529             # The "self" in this scope is referring to the BaseClient.
--> 530             return self._make_api_call(operation_name, kwargs)
    531 
    532         _api_call.__name__ = str(py_operation_name)

~/anaconda3/envs/python3/lib/python3.10/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
    958             error_code = parsed_response.get("Error", {}).get("Code")
    959             error_class = self.exceptions.from_code(error_code)
--> 960             raise error_class(parsed_response, operation_name)
    961         else:
    962             return parsed_response

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
  "code": 400,
  "type": "InternalServerException",
  "message": "Could not load model /.sagemaker/mms/models/google__flan-t5-xl with any of the following classes: (\u003cclass \u0027transformers.models.auto.modeling_auto.AutoModelForSeq2SeqLM\u0027\u003e, \u003cclass \u0027transformers.models.t5.modeling_t5.T5ForConditionalGeneration\u0027\u003e)."
}
"

From an existing issue, I suspected this might be due to the use of transformers==4.17.0, however, when I use the exact same script to deploy flant5-large model, it works without any issues.

Expected behavior

The model should get deployed on AWS Sagemaker without any issues.

Feb 01 '23 14:02 RonLek

Hello @RonLek

Thanks for the issue! Note that starting from flan-t5-xl, the weights of the model are sharded. Sharded weights loading has been supported after the release of transformers==4.17.0 (precisely in transformers==4.18.0: https://github.com/huggingface/transformers/releases/tag/v4.18.0 ), so I think the fix should be updating the transformers version to a more recent one, e.g. 4.26.0 or 4.25.0.

Feb 01 '23 15:02 younesbelkada

Hi @younesbelkada and @RonLek ! I have the same issue deploying google/flan-t5-xxl on SageMaker.

I've tried to update to transformers==4.26.0 by providing code/requirements.txt through s3://sagemaker-eu-north-1-***/model.tar.gz:

# Hub Model configuration. https://huggingface.co/models
hub: dict = {"HF_MODEL_ID": "google/flan-t5-xxl", "HF_TASK": "text2text-generation"}

# Create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    transformers_version="4.17.0",
    pytorch_version="1.10.2",
    py_version="py38",
    model_data="s3://sagemaker-eu-north-1-***/model.tar.gz",
    env=hub,
    role=role,
)

Observing the AWS logs I can see that transformers==4.26.0 was installed:

This is an experimental beta features, which allows downloading model from the Hugging Face Hub on start up. It loads the model defined in the env var `HF_MODEL_ID`
/opt/conda/lib/python3.8/site-packages/huggingface_hub/file_download.py:588: FutureWarning: `cached_download` is the legacy way to download files from the HF hub, please consider upgrading to `hf_hub_download`  warnings.warn(
#015Downloading:   0%\|          \| 0.00/11.0k [00:00<?, ?B/s]#015Downloading: 100%\|██████████\| 11.0k/11.0k [00:00<00:00, 5.49MB/s]
#015Downloading:   0%\|          \| 0.00/674 [00:00<?, ?B/s]#015Downloading: 100%\|██████████\| 674/674 [00:00<00:00, 663kB/s]
#015Downloading:   0%\|          \| 0.00/2.20k [00:00<?, ?B/s]#015Downloading: 100%\|██████████\| 2.20k/2.20k [00:00<00:00, 2.24MB/s]
#015Downloading:   0%\|          \| 0.00/792k [00:00<?, ?B/s]#015Downloading: 100%\|██████████\| 792k/792k [00:00<00:00, 43.5MB/s]
#015Downloading:   0%\|          \| 0.00/2.42M [00:00<?, ?B/s]#015Downloading:   0%\|          \| 4.10k/2.42M [00:00<01:04, 37.5kB/s]#015Downloading:   1%\|          \| 28.7k/2.42M [00:00<00:16, 147kB/s] #015Downloading:   4%\|▎         \| 86.0k/2.42M [00:00<00:07, 318kB/s]#015Downloading:   9%\|▊         \| 209k/2.42M [00:00<00:03, 633kB/s] #015Downloading:  18%\|█▊        \| 438k/2.42M [00:00<00:01, 1.16MB/s]#015Downloading:  37%\|███▋      \| 897k/2.42M [00:00<00:00, 2.18MB/s]#015Downloading:  76%\|███████▌  \| 1.83M/2.42M [00:00<00:00, 4.24MB/s]#015Downloading: 100%\|██████████\| 2.42M/2.42M [00:00<00:00, 3.12MB/s]
#015Downloading:   0%\|          \| 0.00/2.54k [00:00<?, ?B/s]#015Downloading: 100%\|██████████\| 2.54k/2.54k [00:00<00:00, 2.62MB/s]
WARNING - Overwriting /.sagemaker/mms/models/google__flan-t5-xxl ...
Collecting transformers==4.26.0  Downloading transformers-4.26.0-py3-none-any.whl (6.3 MB)     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 65.9 MB/s eta 0:00:00
Requirement already satisfied: requests in /opt/conda/lib/python3.8/site-packages (from transformers==4.26.0->-r /opt/ml/model/code/requirements.txt (line 1)) (2.28.1)
Collecting huggingface-hub<1.0,>=0.11.0  Downloading huggingface_hub-0.12.0-py3-none-any.whl (190 kB)     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 190.3/190.3 kB 46.0 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.17 in /opt/conda/lib/python3.8/site-packages (from transformers==4.26.0->-r /opt/ml/model/code/requirements.txt (line 1)) (1.23.3)
Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /opt/conda/lib/python3.8/site-packages (from transformers==4.26.0->-r /opt/ml/model/code/requirements.txt (line 1)) (0.13.0)
Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.8/site-packages (from transformers==4.26.0->-r /opt/ml/model/code/requirements.txt (line 1)) (21.3)
Requirement already satisfied: tqdm>=4.27 in /opt/conda/lib/python3.8/site-packages (from transformers==4.26.0->-r /opt/ml/model/code/requirements.txt (line 1)) (4.64.1)
Requirement already satisfied: pyyaml>=5.1 in /opt/conda/lib/python3.8/site-packages (from transformers==4.26.0->-r /opt/ml/model/code/requirements.txt (line 1)) (6.0)
Requirement already satisfied: filelock in /opt/conda/lib/python3.8/site-packages (from transformers==4.26.0->-r /opt/ml/model/code/requirements.txt (line 1)) (3.8.0)
Requirement already satisfied: regex!=2019.12.17 in /opt/conda/lib/python3.8/site-packages (from transformers==4.26.0->-r /opt/ml/model/code/requirements.txt (line 1)) (2022.9.13)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /opt/conda/lib/python3.8/site-packages (from huggingface-hub<1.0,>=0.11.0->transformers==4.26.0->-r /opt/ml/model/code/requirements.txt (line 1)) (4.3.0)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/lib/python3.8/site-packages (from packaging>=20.0->transformers==4.26.0->-r /opt/ml/model/code/requirements.txt (line 1)) (3.0.9)
Requirement already satisfied: charset-normalizer<3,>=2 in /opt/conda/lib/python3.8/site-packages (from requests->transformers==4.26.0->-r /opt/ml/model/code/requirements.txt (line 1)) (2.0.12)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.8/site-packages (from requests->transformers==4.26.0->-r /opt/ml/model/code/requirements.txt (line 1)) (3.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.8/site-packages (from requests->transformers==4.26.0->-r /opt/ml/model/code/requirements.txt (line 1)) (1.26.11)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.8/site-packages (from requests->transformers==4.26.0->-r /opt/ml/model/code/requirements.txt (line 1)) (2022.9.24)
Installing collected packages: huggingface-hub, transformers  Attempting uninstall: huggingface-hub    Found existing installation: huggingface-hub 0.10.0    Uninstalling huggingface-hub-0.10.0:      Successfully uninstalled huggingface-hub-0.10.0  Attempting uninstall: transformers    Found existing installation: transformers 4.17.0    Uninstalling transformers-4.17.0:      Successfully uninstalled transformers-4.17.0
Successfully installed huggingface-hub-0.12.0 transformers-4.26.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
[notice] A new release of pip available: 22.2.2 -> 23.0
[notice] To update, run: pip install --upgrade pip
Warning: MMS is using non-default JVM parameters: -XX:-UseContainerSupport
2023-02-01T15:46:06,090 [INFO ] main com.amazonaws.ml.mms.ModelServer -
MMS Home: /opt/conda/lib/python3.8/site-packages
Current directory: /
Temp directory: /home/model-server/tmp
Number of GPUs: 0
Number of CPUs: 4
Max heap size: 3461 M
Python executable: /opt/conda/bin/python3.8
Config file: /etc/sagemaker-mms.properties
Inference address: http://0.0.0.0:8080
Management address: http://0.0.0.0:8080
Model Store: /.sagemaker/mms/models
Initial Models: ALL
Log dir: null
Metrics dir: null
Netty threads: 0
Netty client threads: 0
Default workers per model: 4
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Preload model: false
Prefer direct buffer: false
2023-02-01T15:46:06,140 [WARN ] W-9000-google__flan-t5-xxl com.amazonaws.ml.mms.wlm.WorkerLifeCycle - attachIOStreams() threadName=W-9000-google__flan-t5-xxl
2023-02-01T15:46:06,204 [INFO ] W-9000-google__flan-t5-xxl-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - model_service_worker started with args: --sock-type unix --sock-name /home/model-server/tmp/.mms.sock.9000 --handler sagemaker_huggingface_inference_toolkit.handler_service --model-path /.sagemaker/mms/models/google__flan-t5-xxl --model-name google__flan-t5-xxl --preload-model false --tmp-dir /home/model-server/tmp
2023-02-01T15:46:06,205 [INFO ] W-9000-google__flan-t5-xxl-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Listening on port: /home/model-server/tmp/.mms.sock.9000
2023-02-01T15:46:06,205 [INFO ] W-9000-google__flan-t5-xxl-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - [PID] 47
2023-02-01T15:46:06,206 [INFO ] W-9000-google__flan-t5-xxl-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - MMS worker started.
2023-02-01T15:46:06,206 [INFO ] W-9000-google__flan-t5-xxl-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Python runtime: 3.8.10
2023-02-01T15:46:06,206 [INFO ] main com.amazonaws.ml.mms.wlm.ModelManager - Model google__flan-t5-xxl loaded.
2023-02-01T15:46:06,210 [INFO ] main com.amazonaws.ml.mms.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2023-02-01T15:46:06,218 [INFO ] W-9000-google__flan-t5-xxl com.amazonaws.ml.mms.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.mms.sock.9000
2023-02-01T15:46:06,218 [INFO ] W-9000-google__flan-t5-xxl com.amazonaws.ml.mms.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.mms.sock.9000
2023-02-01T15:46:06,219 [INFO ] W-9000-google__flan-t5-xxl com.amazonaws.ml.mms.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.mms.sock.9000
2023-02-01T15:46:06,226 [INFO ] W-9000-google__flan-t5-xxl com.amazonaws.ml.mms.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.mms.sock.9000
2023-02-01T15:46:06,278 [INFO ] W-9000-google__flan-t5-xxl-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9000.
2023-02-01T15:46:06,281 [INFO ] W-9000-google__flan-t5-xxl-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9000.
2023-02-01T15:46:06,284 [INFO ] W-9000-google__flan-t5-xxl-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9000.
2023-02-01T15:46:06,290 [INFO ] W-9000-google__flan-t5-xxl-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9000.
2023-02-01T15:46:06,298 [INFO ] main com.amazonaws.ml.mms.ModelServer - Inference API bind to: http://0.0.0.0:8080
Model server started.
2023-02-01T15:46:06,302 [WARN ] pool-3-thread-1 com.amazonaws.ml.mms.metrics.MetricCollector - worker pid is not available yet.
2023-02-01T15:46:08,478 [INFO ] W-9000-google__flan-t5-xxl-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Model google__flan-t5-xxl loaded io_fd=3abd6afffe6261f4-0000001d-00000000-084f36d4c5a81b10-639dfd41
2023-02-01T15:46:08,491 [INFO ] W-9000-google__flan-t5-xxl com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 2081
2023-02-01T15:46:08,493 [WARN ] W-9000-google__flan-t5-xxl com.amazonaws.ml.mms.wlm.WorkerLifeCycle - attachIOStreams() threadName=W-google__flan-t5-xxl-1
2023-02-01T15:46:08,499 [INFO ] W-9000-google__flan-t5-xxl-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Model google__flan-t5-xxl loaded io_fd=3abd6afffe6261f4-0000001d-00000001-c96df6d4c5a81b10-276a10eb
2023-02-01T15:46:08,500 [INFO ] W-9000-google__flan-t5-xxl com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 2089
2023-02-01T15:46:08,500 [WARN ] W-9000-google__flan-t5-xxl com.amazonaws.ml.mms.wlm.WorkerLifeCycle - attachIOStreams() threadName=W-google__flan-t5-xxl-3
2023-02-01T15:46:08,512 [INFO ] W-9000-google__flan-t5-xxl-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Model google__flan-t5-xxl loaded io_fd=3abd6afffe6261f4-0000001d-00000004-12e7f154c5a81b12-fe262c46
2023-02-01T15:46:08,512 [INFO ] W-9000-google__flan-t5-xxl com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 2101
2023-02-01T15:46:08,513 [WARN ] W-9000-google__flan-t5-xxl com.amazonaws.ml.mms.wlm.WorkerLifeCycle - attachIOStreams() threadName=W-google__flan-t5-xxl-4
2023-02-01T15:46:08,561 [INFO ] W-9000-google__flan-t5-xxl-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Model google__flan-t5-xxl loaded io_fd=3abd6afffe6261f4-0000001d-00000003-6582f154c5a81b12-273338b8
2023-02-01T15:46:08,561 [INFO ] W-9000-google__flan-t5-xxl com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 2150
2023-02-01T15:46:08,561 [WARN ] W-9000-google__flan-t5-xxl com.amazonaws.ml.mms.wlm.WorkerLifeCycle - attachIOStreams() threadName=W-google__flan-t5-xxl-2
2023-02-01T15:46:10,450 [INFO ] pool-2-thread-6 ACCESS_LOG - /169.254.178.2:59002 "GET /ping HTTP/1.1" 200 7
2023-02-01T15:46:15,412 [INFO ] pool-2-thread-6 ACCESS_LOG - /169.254.178.2:59002 "GET /ping HTTP/1.1" 200 0
2023-02-01T15:46:20,411 [INFO ] pool-2-thread-6 ACCESS_LOG - /169.254.178.2:59002 "GET /ping HTTP/1.1" 200 0

But I got the same error when trying to do an inference:

botocore.errorfactory.ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
  "code": 400,
  "type": "InternalServerException",
  "message": "Could not load model /.sagemaker/mms/models/google__flan-t5-xxl with any of the following classes: (\u003cclass \u0027transformers.models.auto.modeling_auto.AutoModelForSeq2SeqLM\u0027\u003e, \u003cclass \u0027transformers.models.t5.modeling_t5.T5ForConditionalGeneration\u0027\u003e)."
}

AWS logs:

2023-02-01T15:49:59,831 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Prediction error
2023-02-01T15:49:59,832 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Traceback (most recent call last):
2023-02-01T15:49:59,832 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.8/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py", line 219, in handle
2023-02-01T15:49:59,832 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     self.initialize(context)
2023-02-01T15:49:59,832 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.8/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py", line 77, in initialize
2023-02-01T15:49:59,832 [INFO ] W-9000-google__flan-t5-xxl com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 1
2023-02-01T15:49:59,833 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     self.model = self.load(self.model_dir)
2023-02-01T15:49:59,833 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.8/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py", line 104, in load
2023-02-01T15:49:59,833 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     hf_pipeline = get_pipeline(task=os.environ["HF_TASK"], model_dir=model_dir, device=self.device)
2023-02-01T15:49:59,833 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.8/site-packages/sagemaker_huggingface_inference_toolkit/transformers_utils.py", line 272, in get_pipeline
2023-02-01T15:49:59,833 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     hf_pipeline = pipeline(task=task, model=model_dir, device=device, **kwargs)
2023-02-01T15:49:59,834 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.8/site-packages/transformers/pipelines/__init__.py", line 754, in pipeline
2023-02-01T15:49:59,834 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     framework, model = infer_framework_load_model(
2023-02-01T15:49:59,834 [INFO ] W-9000-google__flan-t5-xxl ACCESS_LOG - /169.254.178.2:59002 "POST /invocations HTTP/1.1" 400 13
2023-02-01T15:49:59,834 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.8/site-packages/transformers/pipelines/base.py", line 266, in infer_framework_load_model
2023-02-01T15:49:59,834 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     raise ValueError(f"Could not load model {model} with any of the following classes: {class_tuple}.")
2023-02-01T15:49:59,835 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ValueError: Could not load model /.sagemaker/mms/models/google__flan-t5-xxl with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForSeq2SeqLM'>, <class 'transformers.models.t5.modeling_t5.T5ForConditionalGeneration'>).
2023-02-01T15:49:59,835 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -
2023-02-01T15:49:59,835 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - During handling of the above exception, another exception occurred:
2023-02-01T15:49:59,835 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -
2023-02-01T15:49:59,836 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Traceback (most recent call last):
2023-02-01T15:49:59,836 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.8/site-packages/mms/service.py", line 108, in predict
2023-02-01T15:49:59,836 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     ret = self._entry_point(input_batch, self.context)
2023-02-01T15:49:59,836 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.8/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py", line 243, in handle
2023-02-01T15:49:59,836 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     raise PredictionException(str(e), 400)
2023-02-01T15:49:59,837 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - mms.service.PredictionException: Could not load model /.sagemaker/mms/models/google__flan-t5-xxl with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForSeq2SeqLM'>, <class 'transformers.models.t5.modeling_t5.T5ForConditionalGeneration'>). : 400

Feb 01 '23 16:02 valentinboyanov

Hello @valentinboyanov

I can see in your script that:

HuggingFaceModel(
    transformers_version="4.17.0",
    pytorch_version="1.10.2",
    py_version="py38",
    model_data="s3://sagemaker-eu-north-1-***/model.tar.gz",
    env=hub,
    role=role,
)

can you update transformers_version with the correct value? I suspect this is causing the issue

Feb 01 '23 16:02 younesbelkada

@younesbelkada if I change it, I'm unable to deploy at all:

    raise ValueError(
ValueError: Unsupported huggingface version: 4.26.0. You may need to upgrade your SDK version (pip install -U sagemaker) for newer huggingface versions. Supported huggingface version(s): 4.6.1, 4.10.2, 4.11.0, 4.12.3, 4.17.0, 4.6, 4.10, 4.11, 4.12, 4.17.

This is why I've followed the instructions by Heiko Hotz (marshmellow77) in this comment to provide a requirements.txt file that will let me specify dependencies I want to be installed in the container.

Feb 01 '23 16:02 valentinboyanov

@valentinboyanov what is the content for your model_data="s3://sagemaker-eu-north-1-***/model.tar.gz"? Could you please share the folder structure.

Feb 01 '23 16:02 philschmid

@philschmid yes, here it goes:

➜  model tree .
.
└── code
    └── requirements.txt

1 directory, 1 file

➜  model cat code/requirements.txt 
transformers==4.26.0%

Feb 01 '23 16:02 valentinboyanov

When you provide a model_data key word you also have to include the inference.py and the model weights.

Feb 01 '23 17:02 philschmid

@philschmid what should be the contents of the inference.py in case of the flan-t5-xl model? Can this be an empty file if I don't intend to change anything from the hub model? There doesn't seem to be such a file included within the Hugging Face repository.

@valentinboyanov I confirm getting the same as well. From the CW logs it seems that 4.17.0 is un-installed and replaced with the latest version specified in the requirements.txt file.

@younesbelkada if I change it, I'm unable to deploy at all:
    raise ValueError(
ValueError: Unsupported huggingface version: 4.26.0. You may need to upgrade your SDK version (pip install -U sagemaker) for newer huggingface versions. Supported huggingface version(s): 4.6.1, 4.10.2, 4.11.0, 4.12.3, 4.17.0, 4.6, 4.10, 4.11, 4.12, 4.17.
This is why I've followed the instructions by Heiko Hotz (marshmellow77) in this comment to provide a requirements.txt file that will let me specify dependencies I want to be installed in the container.

Feb 02 '23 21:02 RonLek

I'm having the same Could not load model error with any of the following classes: AutoModelForSeq2SeqLM and T5ForConditionalGeneration when using a docker for inference of a flan-t5-xxl-sharded-fp16 model:

Code works without Docker, but If I build and run docker run --gpus all -p 7080:7080 flan-t5-xxl-sharded-fp16:latest, error is the following:

[2023-02-05 21:33:53 +0000] [1] [INFO] Starting gunicorn 20.1.0
[2023-02-05 21:33:53 +0000] [1] [INFO] Listening at: http://0.0.0.0:7080 (1)
[2023-02-05 21:33:53 +0000] [1] [INFO] Using worker: uvicorn.workers.UvicornWorker
[2023-02-05 21:33:53 +0000] [7] [INFO] Booting worker with pid: 7
[2023-02-05 21:34:01 +0000] [7] [INFO] Is CUDA available: True
[2023-02-05 21:34:01 +0000] [7] [INFO] CUDA device: NVIDIA A100-SXM4-40GB
[2023-02-05 21:34:01 +0000] [7] [INFO] Loading model
[2023-02-05 21:34:02 +0000] [7] [ERROR] Exception in worker process
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/gunicorn/arbiter.py", line 589, in spawn_worker
    worker.init_process()
  File "/usr/local/lib/python3.9/site-packages/uvicorn/workers.py", line 66, in init_process
    super(UvicornWorker, self).init_process()
  File "/usr/local/lib/python3.9/site-packages/gunicorn/workers/base.py", line 134, in init_process
    self.load_wsgi()
  File "/usr/local/lib/python3.9/site-packages/gunicorn/workers/base.py", line 146, in load_wsgi
    self.wsgi = self.app.wsgi()
  File "/usr/local/lib/python3.9/site-packages/gunicorn/app/base.py", line 67, in wsgi
    self.callable = self.load()
  File "/usr/local/lib/python3.9/site-packages/gunicorn/app/wsgiapp.py", line 58, in load
    return self.load_wsgiapp()
  File "/usr/local/lib/python3.9/site-packages/gunicorn/app/wsgiapp.py", line 48, in load_wsgiapp
    return util.import_app(self.app_uri)
  File "/usr/local/lib/python3.9/site-packages/gunicorn/util.py", line 359, in import_app
    mod = importlib.import_module(module)
  File "/usr/local/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/app/main.py", line 29, in <module>
    pipe_flan = pipeline("text2text-generation", model="../flan-t5-xxl-sharded-fp16", model_kwargs={"load_in_8bit":True, "device_map": "auto"})
  File "/usr/local/lib/python3.9/site-packages/transformers/pipelines/__init__.py", line 754, in pipeline
    framework, model = infer_framework_load_model(
  File "/usr/local/lib/python3.9/site-packages/transformers/pipelines/base.py", line 266, in infer_framework_load_model
    raise ValueError(f"Could not load model {model} with any of the following classes: {class_tuple}.")
ValueError: Could not load model ../flan-t5-xxl-sharded-fp16 with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForSeq2SeqLM'>, <class 'transformers.models.t5.modeling_t5.T5ForConditionalGeneration'>).
[2023-02-05 21:34:02 +0000] [7] [INFO] Worker exiting (pid: 7)
[2023-02-05 21:34:04 +0000] [1] [INFO] Shutting down: Master
[2023-02-05 21:34:04 +0000] [1] [INFO] Reason: Worker failed to boot.

Dockerfile is the following:

FROM tiangolo/uvicorn-gunicorn-fastapi:python3.9

# install dependencies
RUN python3 -m pip install --upgrade pip
RUN pip3 install torch==1.13.0 transformers==4.26.0 sentencepiece torchvision torchaudio accelerate==0.15.0 bitsandbytes-cuda113

COPY ./app /app
COPY ./flan-t5-xxl-sharded-fp16/ /flan-t5-xxl-sharded-fp16

EXPOSE 7080

# Start the app
CMD ["gunicorn", "-b", "0.0.0.0:7080", "main:app","--workers","1","--timeout","180","-k","uvicorn.workers.UvicornWorker"]

The code of app/main.py is the following:

from fastapi import FastAPI, Request
from fastapi.logger import logger

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, T5ForConditionalGeneration 

import json
import logging
import numpy as np
import os
import torch

from transformers import pipeline

app = FastAPI()

gunicorn_logger = logging.getLogger('gunicorn.error')
logger.handlers = gunicorn_logger.handlers

if __name__ != "main":
    logger.setLevel(gunicorn_logger.level)
else:
    logger.setLevel(logging.INFO)

logger.info(f"Is CUDA available: {torch.cuda.is_available()}")
logger.info(f"CUDA device: {torch.cuda.get_device_name(torch.cuda.current_device())}")

logger.info("Loading model")

# error is in this line
pipe_flan = pipeline("text2text-generation", model="../flan-t5-xxl-sharded-fp16", model_kwargs={"load_in_8bit":True, "device_map": "auto"}) 

# extra code removed

Feb 05 '23 21:02 rafaelsf80

@philschmid @younesbelkada just wanted to follow up on this.

@philschmid what should be the contents of the inference.py in case of the flan-t5-xl model? There doesn't seem to be such a file included within the Hugging Face repository.

@valentinboyanov I confirm getting the same as well. From the CW logs it seems that 4.17.0 is un-installed and replaced with the latest version specified in the requirements.txt file.
@younesbelkada if I change it, I'm unable to deploy at all:
    raise ValueError(
ValueError: Unsupported huggingface version: 4.26.0. You may need to upgrade your SDK version (pip install -U sagemaker) for newer huggingface versions. Supported huggingface version(s): 4.6.1, 4.10.2, 4.11.0, 4.12.3, 4.17.0, 4.6, 4.10, 4.11, 4.12, 4.17.
This is why I've followed the instructions by Heiko Hotz (marshmellow77) in this comment to provide a requirements.txt file that will let me specify dependencies I want to be installed in the container.

Feb 08 '23 10:02 RonLek

@RonLek i am planning to create an example. I ll post it here once it is ready.

Feb 08 '23 12:02 philschmid

@RonLek done: https://www.philschmid.de/deploy-flan-t5-sagemaker

Feb 08 '23 17:02 philschmid

This works! Thanks a ton @philschmid for the prompt response :rocket:

Feb 08 '23 23:02 RonLek

@philschmid just curious. Would there be a similar sharded model repo for flan-t5-xl?

Feb 12 '23 21:02 RonLek

If you check this blog post: https://www.philschmid.de/deploy-t5-11b There is a code snippet on how to do this, for t5-11b https://www.philschmid.de/deploy-t5-11b

import torch
from transformers import AutoModelWithLMHead
from huggingface_hub import HfApi

# load model as float16
model = AutoModelWithLMHead.from_pretrained("t5-11b", torch_dtype=torch.float16, low_cpu_mem_usage=True)
# shard model an push to hub
model.save_pretrained("sharded", max_shard_size="2000MB")

Feb 13 '23 07:02 philschmid

Thanks! This worked :fire:

Feb 16 '23 20:02 RonLek

@philschmid thanks for the guidance here. While deploying your solution on SageMaker i noticed that it works great on g5 instances but not on p3 instances( p3.8xlarge). Also, do we know when the the direct deploy from HF hub would work out of the box? Error below -

Model fails to load, the reason being that the library bitsandbytes that is required "The installed version of bitsandbytes was compiled without GPU support. " on p3 instance and that leads to the below error when you invoke the model-
2023-02-25T01:24:28,714 [INFO ] W-model-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - mms.service.PredictionException: 'NoneType' object has no attribute 'cget_col_row_stats' : 400

Mar 10 '23 17:03 rags1357

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Apr 04 '23 15:04 github-actions[bot]