ModelError while deploying FlanT5-xl
System Info
transformers_version==4.17.0 Plaform = Sagemaker Notebook python==3.9.0
Who can help?
@ArthurZucker @younesbelkada
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
Amazon Sagemaker deployment script in AWS for flant5-xl
from sagemaker.huggingface import HuggingFaceModel
import sagemaker
role = sagemaker.get_execution_role()
# Hub Model configuration. https://huggingface.co/models
hub = {
'HF_MODEL_ID':'google/flan-t5-xl',
'HF_TASK':'text2text-generation'
}
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
transformers_version='4.17.0',
pytorch_version='1.10.2',
py_version='py38',
env=hub,
role=role,
)
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1, # number of instances
instance_type='ml.m5.xlarge' # ec2 instance type
)
predictor.predict({
'inputs': "The answer to the universe is"
})
Results in
---------------------------------------------------------------------------
ModelError Traceback (most recent call last)
/tmp/ipykernel_20116/1338286066.py in <cell line: 26>()
24 )
25
---> 26 predictor.predict({
27 'inputs': "The answer to the universe is"
28 })
~/anaconda3/envs/python3/lib/python3.10/site-packages/sagemaker/predictor.py in predict(self, data, initial_args, target_model, target_variant, inference_id)
159 data, initial_args, target_model, target_variant, inference_id
160 )
--> 161 response = self.sagemaker_session.sagemaker_runtime_client.invoke_endpoint(**request_args)
162 return self._handle_response(response)
163
~/anaconda3/envs/python3/lib/python3.10/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
528 )
529 # The "self" in this scope is referring to the BaseClient.
--> 530 return self._make_api_call(operation_name, kwargs)
531
532 _api_call.__name__ = str(py_operation_name)
~/anaconda3/envs/python3/lib/python3.10/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
958 error_code = parsed_response.get("Error", {}).get("Code")
959 error_class = self.exceptions.from_code(error_code)
--> 960 raise error_class(parsed_response, operation_name)
961 else:
962 return parsed_response
ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
"code": 400,
"type": "InternalServerException",
"message": "Could not load model /.sagemaker/mms/models/google__flan-t5-xl with any of the following classes: (\u003cclass \u0027transformers.models.auto.modeling_auto.AutoModelForSeq2SeqLM\u0027\u003e, \u003cclass \u0027transformers.models.t5.modeling_t5.T5ForConditionalGeneration\u0027\u003e)."
}
"
From an existing issue, I suspected this might be due to the use of transformers==4.17.0, however, when I use the exact same script to deploy flant5-large model, it works without any issues.
Expected behavior
The model should get deployed on AWS Sagemaker without any issues.
Hello @RonLek
Thanks for the issue!
Note that starting from flan-t5-xl, the weights of the model are sharded.
Sharded weights loading has been supported after the release of transformers==4.17.0 (precisely in transformers==4.18.0: https://github.com/huggingface/transformers/releases/tag/v4.18.0 ), so I think the fix should be updating the transformers version to a more recent one, e.g. 4.26.0 or 4.25.0.
Hi @younesbelkada and @RonLek ! I have the same issue deploying google/flan-t5-xxl on SageMaker.
I've tried to update to transformers==4.26.0 by providing code/requirements.txt through s3://sagemaker-eu-north-1-***/model.tar.gz:
# Hub Model configuration. https://huggingface.co/models
hub: dict = {"HF_MODEL_ID": "google/flan-t5-xxl", "HF_TASK": "text2text-generation"}
# Create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
transformers_version="4.17.0",
pytorch_version="1.10.2",
py_version="py38",
model_data="s3://sagemaker-eu-north-1-***/model.tar.gz",
env=hub,
role=role,
)
Observing the AWS logs I can see that transformers==4.26.0 was installed:
This is an experimental beta features, which allows downloading model from the Hugging Face Hub on start up. It loads the model defined in the env var `HF_MODEL_ID`
/opt/conda/lib/python3.8/site-packages/huggingface_hub/file_download.py:588: FutureWarning: `cached_download` is the legacy way to download files from the HF hub, please consider upgrading to `hf_hub_download` warnings.warn(
#015Downloading: 0%\| \| 0.00/11.0k [00:00<?, ?B/s]#015Downloading: 100%\|██████████\| 11.0k/11.0k [00:00<00:00, 5.49MB/s]
#015Downloading: 0%\| \| 0.00/674 [00:00<?, ?B/s]#015Downloading: 100%\|██████████\| 674/674 [00:00<00:00, 663kB/s]
#015Downloading: 0%\| \| 0.00/2.20k [00:00<?, ?B/s]#015Downloading: 100%\|██████████\| 2.20k/2.20k [00:00<00:00, 2.24MB/s]
#015Downloading: 0%\| \| 0.00/792k [00:00<?, ?B/s]#015Downloading: 100%\|██████████\| 792k/792k [00:00<00:00, 43.5MB/s]
#015Downloading: 0%\| \| 0.00/2.42M [00:00<?, ?B/s]#015Downloading: 0%\| \| 4.10k/2.42M [00:00<01:04, 37.5kB/s]#015Downloading: 1%\| \| 28.7k/2.42M [00:00<00:16, 147kB/s] #015Downloading: 4%\|▎ \| 86.0k/2.42M [00:00<00:07, 318kB/s]#015Downloading: 9%\|▊ \| 209k/2.42M [00:00<00:03, 633kB/s] #015Downloading: 18%\|█▊ \| 438k/2.42M [00:00<00:01, 1.16MB/s]#015Downloading: 37%\|███▋ \| 897k/2.42M [00:00<00:00, 2.18MB/s]#015Downloading: 76%\|███████▌ \| 1.83M/2.42M [00:00<00:00, 4.24MB/s]#015Downloading: 100%\|██████████\| 2.42M/2.42M [00:00<00:00, 3.12MB/s]
#015Downloading: 0%\| \| 0.00/2.54k [00:00<?, ?B/s]#015Downloading: 100%\|██████████\| 2.54k/2.54k [00:00<00:00, 2.62MB/s]
WARNING - Overwriting /.sagemaker/mms/models/google__flan-t5-xxl ...
Collecting transformers==4.26.0 Downloading transformers-4.26.0-py3-none-any.whl (6.3 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 65.9 MB/s eta 0:00:00
Requirement already satisfied: requests in /opt/conda/lib/python3.8/site-packages (from transformers==4.26.0->-r /opt/ml/model/code/requirements.txt (line 1)) (2.28.1)
Collecting huggingface-hub<1.0,>=0.11.0 Downloading huggingface_hub-0.12.0-py3-none-any.whl (190 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 190.3/190.3 kB 46.0 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.17 in /opt/conda/lib/python3.8/site-packages (from transformers==4.26.0->-r /opt/ml/model/code/requirements.txt (line 1)) (1.23.3)
Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /opt/conda/lib/python3.8/site-packages (from transformers==4.26.0->-r /opt/ml/model/code/requirements.txt (line 1)) (0.13.0)
Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.8/site-packages (from transformers==4.26.0->-r /opt/ml/model/code/requirements.txt (line 1)) (21.3)
Requirement already satisfied: tqdm>=4.27 in /opt/conda/lib/python3.8/site-packages (from transformers==4.26.0->-r /opt/ml/model/code/requirements.txt (line 1)) (4.64.1)
Requirement already satisfied: pyyaml>=5.1 in /opt/conda/lib/python3.8/site-packages (from transformers==4.26.0->-r /opt/ml/model/code/requirements.txt (line 1)) (6.0)
Requirement already satisfied: filelock in /opt/conda/lib/python3.8/site-packages (from transformers==4.26.0->-r /opt/ml/model/code/requirements.txt (line 1)) (3.8.0)
Requirement already satisfied: regex!=2019.12.17 in /opt/conda/lib/python3.8/site-packages (from transformers==4.26.0->-r /opt/ml/model/code/requirements.txt (line 1)) (2022.9.13)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /opt/conda/lib/python3.8/site-packages (from huggingface-hub<1.0,>=0.11.0->transformers==4.26.0->-r /opt/ml/model/code/requirements.txt (line 1)) (4.3.0)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/lib/python3.8/site-packages (from packaging>=20.0->transformers==4.26.0->-r /opt/ml/model/code/requirements.txt (line 1)) (3.0.9)
Requirement already satisfied: charset-normalizer<3,>=2 in /opt/conda/lib/python3.8/site-packages (from requests->transformers==4.26.0->-r /opt/ml/model/code/requirements.txt (line 1)) (2.0.12)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.8/site-packages (from requests->transformers==4.26.0->-r /opt/ml/model/code/requirements.txt (line 1)) (3.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.8/site-packages (from requests->transformers==4.26.0->-r /opt/ml/model/code/requirements.txt (line 1)) (1.26.11)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.8/site-packages (from requests->transformers==4.26.0->-r /opt/ml/model/code/requirements.txt (line 1)) (2022.9.24)
Installing collected packages: huggingface-hub, transformers Attempting uninstall: huggingface-hub Found existing installation: huggingface-hub 0.10.0 Uninstalling huggingface-hub-0.10.0: Successfully uninstalled huggingface-hub-0.10.0 Attempting uninstall: transformers Found existing installation: transformers 4.17.0 Uninstalling transformers-4.17.0: Successfully uninstalled transformers-4.17.0
Successfully installed huggingface-hub-0.12.0 transformers-4.26.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
[notice] A new release of pip available: 22.2.2 -> 23.0
[notice] To update, run: pip install --upgrade pip
Warning: MMS is using non-default JVM parameters: -XX:-UseContainerSupport
2023-02-01T15:46:06,090 [INFO ] main com.amazonaws.ml.mms.ModelServer -
MMS Home: /opt/conda/lib/python3.8/site-packages
Current directory: /
Temp directory: /home/model-server/tmp
Number of GPUs: 0
Number of CPUs: 4
Max heap size: 3461 M
Python executable: /opt/conda/bin/python3.8
Config file: /etc/sagemaker-mms.properties
Inference address: http://0.0.0.0:8080
Management address: http://0.0.0.0:8080
Model Store: /.sagemaker/mms/models
Initial Models: ALL
Log dir: null
Metrics dir: null
Netty threads: 0
Netty client threads: 0
Default workers per model: 4
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Preload model: false
Prefer direct buffer: false
2023-02-01T15:46:06,140 [WARN ] W-9000-google__flan-t5-xxl com.amazonaws.ml.mms.wlm.WorkerLifeCycle - attachIOStreams() threadName=W-9000-google__flan-t5-xxl
2023-02-01T15:46:06,204 [INFO ] W-9000-google__flan-t5-xxl-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - model_service_worker started with args: --sock-type unix --sock-name /home/model-server/tmp/.mms.sock.9000 --handler sagemaker_huggingface_inference_toolkit.handler_service --model-path /.sagemaker/mms/models/google__flan-t5-xxl --model-name google__flan-t5-xxl --preload-model false --tmp-dir /home/model-server/tmp
2023-02-01T15:46:06,205 [INFO ] W-9000-google__flan-t5-xxl-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Listening on port: /home/model-server/tmp/.mms.sock.9000
2023-02-01T15:46:06,205 [INFO ] W-9000-google__flan-t5-xxl-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - [PID] 47
2023-02-01T15:46:06,206 [INFO ] W-9000-google__flan-t5-xxl-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - MMS worker started.
2023-02-01T15:46:06,206 [INFO ] W-9000-google__flan-t5-xxl-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Python runtime: 3.8.10
2023-02-01T15:46:06,206 [INFO ] main com.amazonaws.ml.mms.wlm.ModelManager - Model google__flan-t5-xxl loaded.
2023-02-01T15:46:06,210 [INFO ] main com.amazonaws.ml.mms.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2023-02-01T15:46:06,218 [INFO ] W-9000-google__flan-t5-xxl com.amazonaws.ml.mms.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.mms.sock.9000
2023-02-01T15:46:06,218 [INFO ] W-9000-google__flan-t5-xxl com.amazonaws.ml.mms.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.mms.sock.9000
2023-02-01T15:46:06,219 [INFO ] W-9000-google__flan-t5-xxl com.amazonaws.ml.mms.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.mms.sock.9000
2023-02-01T15:46:06,226 [INFO ] W-9000-google__flan-t5-xxl com.amazonaws.ml.mms.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.mms.sock.9000
2023-02-01T15:46:06,278 [INFO ] W-9000-google__flan-t5-xxl-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9000.
2023-02-01T15:46:06,281 [INFO ] W-9000-google__flan-t5-xxl-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9000.
2023-02-01T15:46:06,284 [INFO ] W-9000-google__flan-t5-xxl-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9000.
2023-02-01T15:46:06,290 [INFO ] W-9000-google__flan-t5-xxl-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9000.
2023-02-01T15:46:06,298 [INFO ] main com.amazonaws.ml.mms.ModelServer - Inference API bind to: http://0.0.0.0:8080
Model server started.
2023-02-01T15:46:06,302 [WARN ] pool-3-thread-1 com.amazonaws.ml.mms.metrics.MetricCollector - worker pid is not available yet.
2023-02-01T15:46:08,478 [INFO ] W-9000-google__flan-t5-xxl-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Model google__flan-t5-xxl loaded io_fd=3abd6afffe6261f4-0000001d-00000000-084f36d4c5a81b10-639dfd41
2023-02-01T15:46:08,491 [INFO ] W-9000-google__flan-t5-xxl com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 2081
2023-02-01T15:46:08,493 [WARN ] W-9000-google__flan-t5-xxl com.amazonaws.ml.mms.wlm.WorkerLifeCycle - attachIOStreams() threadName=W-google__flan-t5-xxl-1
2023-02-01T15:46:08,499 [INFO ] W-9000-google__flan-t5-xxl-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Model google__flan-t5-xxl loaded io_fd=3abd6afffe6261f4-0000001d-00000001-c96df6d4c5a81b10-276a10eb
2023-02-01T15:46:08,500 [INFO ] W-9000-google__flan-t5-xxl com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 2089
2023-02-01T15:46:08,500 [WARN ] W-9000-google__flan-t5-xxl com.amazonaws.ml.mms.wlm.WorkerLifeCycle - attachIOStreams() threadName=W-google__flan-t5-xxl-3
2023-02-01T15:46:08,512 [INFO ] W-9000-google__flan-t5-xxl-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Model google__flan-t5-xxl loaded io_fd=3abd6afffe6261f4-0000001d-00000004-12e7f154c5a81b12-fe262c46
2023-02-01T15:46:08,512 [INFO ] W-9000-google__flan-t5-xxl com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 2101
2023-02-01T15:46:08,513 [WARN ] W-9000-google__flan-t5-xxl com.amazonaws.ml.mms.wlm.WorkerLifeCycle - attachIOStreams() threadName=W-google__flan-t5-xxl-4
2023-02-01T15:46:08,561 [INFO ] W-9000-google__flan-t5-xxl-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Model google__flan-t5-xxl loaded io_fd=3abd6afffe6261f4-0000001d-00000003-6582f154c5a81b12-273338b8
2023-02-01T15:46:08,561 [INFO ] W-9000-google__flan-t5-xxl com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 2150
2023-02-01T15:46:08,561 [WARN ] W-9000-google__flan-t5-xxl com.amazonaws.ml.mms.wlm.WorkerLifeCycle - attachIOStreams() threadName=W-google__flan-t5-xxl-2
2023-02-01T15:46:10,450 [INFO ] pool-2-thread-6 ACCESS_LOG - /169.254.178.2:59002 "GET /ping HTTP/1.1" 200 7
2023-02-01T15:46:15,412 [INFO ] pool-2-thread-6 ACCESS_LOG - /169.254.178.2:59002 "GET /ping HTTP/1.1" 200 0
2023-02-01T15:46:20,411 [INFO ] pool-2-thread-6 ACCESS_LOG - /169.254.178.2:59002 "GET /ping HTTP/1.1" 200 0
But I got the same error when trying to do an inference:
botocore.errorfactory.ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
"code": 400,
"type": "InternalServerException",
"message": "Could not load model /.sagemaker/mms/models/google__flan-t5-xxl with any of the following classes: (\u003cclass \u0027transformers.models.auto.modeling_auto.AutoModelForSeq2SeqLM\u0027\u003e, \u003cclass \u0027transformers.models.t5.modeling_t5.T5ForConditionalGeneration\u0027\u003e)."
}
AWS logs:
2023-02-01T15:49:59,831 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Prediction error
2023-02-01T15:49:59,832 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Traceback (most recent call last):
2023-02-01T15:49:59,832 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/opt/conda/lib/python3.8/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py", line 219, in handle
2023-02-01T15:49:59,832 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - self.initialize(context)
2023-02-01T15:49:59,832 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/opt/conda/lib/python3.8/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py", line 77, in initialize
2023-02-01T15:49:59,832 [INFO ] W-9000-google__flan-t5-xxl com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 1
2023-02-01T15:49:59,833 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - self.model = self.load(self.model_dir)
2023-02-01T15:49:59,833 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/opt/conda/lib/python3.8/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py", line 104, in load
2023-02-01T15:49:59,833 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - hf_pipeline = get_pipeline(task=os.environ["HF_TASK"], model_dir=model_dir, device=self.device)
2023-02-01T15:49:59,833 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/opt/conda/lib/python3.8/site-packages/sagemaker_huggingface_inference_toolkit/transformers_utils.py", line 272, in get_pipeline
2023-02-01T15:49:59,833 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - hf_pipeline = pipeline(task=task, model=model_dir, device=device, **kwargs)
2023-02-01T15:49:59,834 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/opt/conda/lib/python3.8/site-packages/transformers/pipelines/__init__.py", line 754, in pipeline
2023-02-01T15:49:59,834 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - framework, model = infer_framework_load_model(
2023-02-01T15:49:59,834 [INFO ] W-9000-google__flan-t5-xxl ACCESS_LOG - /169.254.178.2:59002 "POST /invocations HTTP/1.1" 400 13
2023-02-01T15:49:59,834 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/opt/conda/lib/python3.8/site-packages/transformers/pipelines/base.py", line 266, in infer_framework_load_model
2023-02-01T15:49:59,834 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - raise ValueError(f"Could not load model {model} with any of the following classes: {class_tuple}.")
2023-02-01T15:49:59,835 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ValueError: Could not load model /.sagemaker/mms/models/google__flan-t5-xxl with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForSeq2SeqLM'>, <class 'transformers.models.t5.modeling_t5.T5ForConditionalGeneration'>).
2023-02-01T15:49:59,835 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -
2023-02-01T15:49:59,835 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - During handling of the above exception, another exception occurred:
2023-02-01T15:49:59,835 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -
2023-02-01T15:49:59,836 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Traceback (most recent call last):
2023-02-01T15:49:59,836 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/opt/conda/lib/python3.8/site-packages/mms/service.py", line 108, in predict
2023-02-01T15:49:59,836 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ret = self._entry_point(input_batch, self.context)
2023-02-01T15:49:59,836 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/opt/conda/lib/python3.8/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py", line 243, in handle
2023-02-01T15:49:59,836 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - raise PredictionException(str(e), 400)
2023-02-01T15:49:59,837 [INFO ] W-google__flan-t5-xxl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - mms.service.PredictionException: Could not load model /.sagemaker/mms/models/google__flan-t5-xxl with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForSeq2SeqLM'>, <class 'transformers.models.t5.modeling_t5.T5ForConditionalGeneration'>). : 400
Hello @valentinboyanov
I can see in your script that:
HuggingFaceModel(
transformers_version="4.17.0",
pytorch_version="1.10.2",
py_version="py38",
model_data="s3://sagemaker-eu-north-1-***/model.tar.gz",
env=hub,
role=role,
)
can you update transformers_version with the correct value? I suspect this is causing the issue
@younesbelkada if I change it, I'm unable to deploy at all:
raise ValueError(
ValueError: Unsupported huggingface version: 4.26.0. You may need to upgrade your SDK version (pip install -U sagemaker) for newer huggingface versions. Supported huggingface version(s): 4.6.1, 4.10.2, 4.11.0, 4.12.3, 4.17.0, 4.6, 4.10, 4.11, 4.12, 4.17.
This is why I've followed the instructions by Heiko Hotz (marshmellow77) in this comment to provide a requirements.txt file that will let me specify dependencies I want to be installed in the container.
@valentinboyanov what is the content for your model_data="s3://sagemaker-eu-north-1-***/model.tar.gz"? Could you please share the folder structure.
@philschmid yes, here it goes:
➜ model tree .
.
└── code
└── requirements.txt
1 directory, 1 file
➜ model cat code/requirements.txt
transformers==4.26.0%
When you provide a model_data key word you also have to include the inference.py and the model weights.
@philschmid what should be the contents of the inference.py in case of the flan-t5-xl model? Can this be an empty file if I don't intend to change anything from the hub model? There doesn't seem to be such a file included within the Hugging Face repository.
@valentinboyanov I confirm getting the same as well. From the CW logs it seems that 4.17.0 is un-installed and replaced with the latest version specified in the requirements.txt file.
@younesbelkada if I change it, I'm unable to deploy at all:
raise ValueError( ValueError: Unsupported huggingface version: 4.26.0. You may need to upgrade your SDK version (pip install -U sagemaker) for newer huggingface versions. Supported huggingface version(s): 4.6.1, 4.10.2, 4.11.0, 4.12.3, 4.17.0, 4.6, 4.10, 4.11, 4.12, 4.17.This is why I've followed the instructions by Heiko Hotz (marshmellow77) in this comment to provide a
requirements.txtfile that will let me specify dependencies I want to be installed in the container.
I'm having the same Could not load model error with any of the following classes: AutoModelForSeq2SeqLM and T5ForConditionalGeneration when using a docker for inference of a flan-t5-xxl-sharded-fp16 model:
Code works without Docker, but If I build and run docker run --gpus all -p 7080:7080 flan-t5-xxl-sharded-fp16:latest, error is the following:
[2023-02-05 21:33:53 +0000] [1] [INFO] Starting gunicorn 20.1.0
[2023-02-05 21:33:53 +0000] [1] [INFO] Listening at: http://0.0.0.0:7080 (1)
[2023-02-05 21:33:53 +0000] [1] [INFO] Using worker: uvicorn.workers.UvicornWorker
[2023-02-05 21:33:53 +0000] [7] [INFO] Booting worker with pid: 7
[2023-02-05 21:34:01 +0000] [7] [INFO] Is CUDA available: True
[2023-02-05 21:34:01 +0000] [7] [INFO] CUDA device: NVIDIA A100-SXM4-40GB
[2023-02-05 21:34:01 +0000] [7] [INFO] Loading model
[2023-02-05 21:34:02 +0000] [7] [ERROR] Exception in worker process
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/gunicorn/arbiter.py", line 589, in spawn_worker
worker.init_process()
File "/usr/local/lib/python3.9/site-packages/uvicorn/workers.py", line 66, in init_process
super(UvicornWorker, self).init_process()
File "/usr/local/lib/python3.9/site-packages/gunicorn/workers/base.py", line 134, in init_process
self.load_wsgi()
File "/usr/local/lib/python3.9/site-packages/gunicorn/workers/base.py", line 146, in load_wsgi
self.wsgi = self.app.wsgi()
File "/usr/local/lib/python3.9/site-packages/gunicorn/app/base.py", line 67, in wsgi
self.callable = self.load()
File "/usr/local/lib/python3.9/site-packages/gunicorn/app/wsgiapp.py", line 58, in load
return self.load_wsgiapp()
File "/usr/local/lib/python3.9/site-packages/gunicorn/app/wsgiapp.py", line 48, in load_wsgiapp
return util.import_app(self.app_uri)
File "/usr/local/lib/python3.9/site-packages/gunicorn/util.py", line 359, in import_app
mod = importlib.import_module(module)
File "/usr/local/lib/python3.9/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 850, in exec_module
File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
File "/app/main.py", line 29, in <module>
pipe_flan = pipeline("text2text-generation", model="../flan-t5-xxl-sharded-fp16", model_kwargs={"load_in_8bit":True, "device_map": "auto"})
File "/usr/local/lib/python3.9/site-packages/transformers/pipelines/__init__.py", line 754, in pipeline
framework, model = infer_framework_load_model(
File "/usr/local/lib/python3.9/site-packages/transformers/pipelines/base.py", line 266, in infer_framework_load_model
raise ValueError(f"Could not load model {model} with any of the following classes: {class_tuple}.")
ValueError: Could not load model ../flan-t5-xxl-sharded-fp16 with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForSeq2SeqLM'>, <class 'transformers.models.t5.modeling_t5.T5ForConditionalGeneration'>).
[2023-02-05 21:34:02 +0000] [7] [INFO] Worker exiting (pid: 7)
[2023-02-05 21:34:04 +0000] [1] [INFO] Shutting down: Master
[2023-02-05 21:34:04 +0000] [1] [INFO] Reason: Worker failed to boot.
Dockerfile is the following:
FROM tiangolo/uvicorn-gunicorn-fastapi:python3.9
# install dependencies
RUN python3 -m pip install --upgrade pip
RUN pip3 install torch==1.13.0 transformers==4.26.0 sentencepiece torchvision torchaudio accelerate==0.15.0 bitsandbytes-cuda113
COPY ./app /app
COPY ./flan-t5-xxl-sharded-fp16/ /flan-t5-xxl-sharded-fp16
EXPOSE 7080
# Start the app
CMD ["gunicorn", "-b", "0.0.0.0:7080", "main:app","--workers","1","--timeout","180","-k","uvicorn.workers.UvicornWorker"]
The code of app/main.py is the following:
from fastapi import FastAPI, Request
from fastapi.logger import logger
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, T5ForConditionalGeneration
import json
import logging
import numpy as np
import os
import torch
from transformers import pipeline
app = FastAPI()
gunicorn_logger = logging.getLogger('gunicorn.error')
logger.handlers = gunicorn_logger.handlers
if __name__ != "main":
logger.setLevel(gunicorn_logger.level)
else:
logger.setLevel(logging.INFO)
logger.info(f"Is CUDA available: {torch.cuda.is_available()}")
logger.info(f"CUDA device: {torch.cuda.get_device_name(torch.cuda.current_device())}")
logger.info("Loading model")
# error is in this line
pipe_flan = pipeline("text2text-generation", model="../flan-t5-xxl-sharded-fp16", model_kwargs={"load_in_8bit":True, "device_map": "auto"})
# extra code removed
@philschmid @younesbelkada just wanted to follow up on this.
@philschmid what should be the contents of the
inference.pyin case of the flan-t5-xl model? There doesn't seem to be such a file included within the Hugging Face repository.@valentinboyanov I confirm getting the same as well. From the CW logs it seems that
4.17.0is un-installed and replaced with the latest version specified in therequirements.txtfile.@younesbelkada if I change it, I'm unable to deploy at all:
raise ValueError( ValueError: Unsupported huggingface version: 4.26.0. You may need to upgrade your SDK version (pip install -U sagemaker) for newer huggingface versions. Supported huggingface version(s): 4.6.1, 4.10.2, 4.11.0, 4.12.3, 4.17.0, 4.6, 4.10, 4.11, 4.12, 4.17.This is why I've followed the instructions by Heiko Hotz (marshmellow77) in this comment to provide a
requirements.txtfile that will let me specify dependencies I want to be installed in the container.
@RonLek i am planning to create an example. I ll post it here once it is ready.
@RonLek done: https://www.philschmid.de/deploy-flan-t5-sagemaker
This works! Thanks a ton @philschmid for the prompt response :rocket:
@philschmid just curious. Would there be a similar sharded model repo for flan-t5-xl?
If you check this blog post: https://www.philschmid.de/deploy-t5-11b There is a code snippet on how to do this, for t5-11b https://www.philschmid.de/deploy-t5-11b
import torch
from transformers import AutoModelWithLMHead
from huggingface_hub import HfApi
# load model as float16
model = AutoModelWithLMHead.from_pretrained("t5-11b", torch_dtype=torch.float16, low_cpu_mem_usage=True)
# shard model an push to hub
model.save_pretrained("sharded", max_shard_size="2000MB")
Thanks! This worked :fire:
@philschmid thanks for the guidance here. While deploying your solution on SageMaker i noticed that it works great on g5 instances but not on p3 instances( p3.8xlarge). Also, do we know when the the direct deploy from HF hub would work out of the box? Error below -
Model fails to load, the reason being that the library bitsandbytes that is required "The installed version of bitsandbytes was compiled without GPU support. " on p3 instance and that leads to the below error when you invoke the model-
2023-02-25T01:24:28,714 [INFO ] W-model-3-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - mms.service.PredictionException: 'NoneType' object has no attribute 'cget_col_row_stats' : 400
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.