Initial fetch for `config.json` ignores `--revision`?
If I set CMD_ADDITIONAL_ARGUMENTS to --model turboderp/Mistral-7B-instruct-exl2 --revision 4.0bpw
Then I get this error:
2024-03-13T14:03:42.164428603Z + exec python3 -m aphrodite.endpoints.openai.api_server --host 0.0.0.0 --port 5000 --download-dir /app/tmp/hub --max-model-len 4096 --quantization exl2 --enforce-eager --model turboderp/Mistral-7B-instruct-exl2 --revision 4.0bpw --download-dir /volume/hub
2024-03-13T14:03:44.082470629Z WARNING: exl2 quantization is not fully optimized yet. The speed can be slower
2024-03-13T14:03:44.082490019Z than non-quantized models.
2024-03-13T14:03:44.084028269Z INFO: Initializing the Aphrodite Engine (v0.5.0) with the following config:
2024-03-13T14:03:44.084035559Z INFO: Model = 'turboderp/Mistral-7B-instruct-exl2'
2024-03-13T14:03:44.084039269Z INFO: DataType = torch.bfloat16
2024-03-13T14:03:44.084042909Z INFO: Model Load Format = auto
2024-03-13T14:03:44.084045799Z INFO: Number of GPUs = 1
2024-03-13T14:03:44.084048349Z INFO: Disable Custom All-Reduce = False
2024-03-13T14:03:44.084050519Z INFO: Quantization Format = exl2
2024-03-13T14:03:44.084052649Z INFO: Context Length = 4096
2024-03-13T14:03:44.084057519Z INFO: Enforce Eager Mode = True
2024-03-13T14:03:44.084059709Z INFO: KV Cache Data Type = auto
2024-03-13T14:03:44.084061789Z INFO: KV Cache Params Path = None
2024-03-13T14:03:44.084063869Z INFO: Device = cuda
2024-03-13T14:03:44.492961433Z Traceback (most recent call last):
2024-03-13T14:03:44.492985083Z File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py", line 304, in hf_raise_for_status
2024-03-13T14:03:44.492988443Z response.raise_for_status()
2024-03-13T14:03:44.492991203Z File "/usr/local/lib/python3.10/dist-packages/requests/models.py", line 1021, in raise_for_status
2024-03-13T14:03:44.492993893Z raise HTTPError(http_error_msg, response=self)
2024-03-13T14:03:44.492996533Z requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/turboderp/Mistral-7B-instruct-exl2/resolve/main/config.json
2024-03-13T14:03:44.492999293Z
2024-03-13T14:03:44.493001403Z The above exception was the direct cause of the following exception:
2024-03-13T14:03:44.493003813Z
2024-03-13T14:03:44.493005773Z Traceback (most recent call last):
2024-03-13T14:03:44.493008093Z File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 398, in cached_file
2024-03-13T14:03:44.493010223Z resolved_file = hf_hub_download(
2024-03-13T14:03:44.493012273Z File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
2024-03-13T14:03:44.493014363Z return fn(*args, **kwargs)
2024-03-13T14:03:44.493016513Z File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1261, in hf_hub_download
2024-03-13T14:03:44.493018643Z metadata = get_hf_file_metadata(
2024-03-13T14:03:44.493020723Z File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
2024-03-13T14:03:44.493022793Z return fn(*args, **kwargs)
2024-03-13T14:03:44.493024903Z File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1667, in get_hf_file_metadata
2024-03-13T14:03:44.493026983Z r = _request_wrapper(
2024-03-13T14:03:44.493029103Z File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 385, in _request_wrapper
2024-03-13T14:03:44.493031173Z response = _request_wrapper(
2024-03-13T14:03:44.493033263Z File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 409, in _request_wrapper
2024-03-13T14:03:44.493035313Z hf_raise_for_status(response)
2024-03-13T14:03:44.493041563Z huggingface_hub.utils._errors.EntryNotFoundError: 404 Client Error. (Request ID: Root=1-65f1b240-7d5d7d3b668248e21867e88e;d37da62a-3494-4c58-91fd-28dda5419afb)
2024-03-13T14:03:44.493043843Z
2024-03-13T14:03:44.493045873Z Entry Not Found for url: https://huggingface.co/turboderp/Mistral-7B-instruct-exl2/resolve/main/config.json.
2024-03-13T14:03:44.493062953Z
2024-03-13T14:03:44.493066373Z The above exception was the direct cause of the following exception:
2024-03-13T14:03:44.493068993Z
2024-03-13T14:03:44.493071043Z Traceback (most recent call last):
2024-03-13T14:03:44.493073083Z File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
2024-03-13T14:03:44.493075173Z return _run_code(code, main_globals, None,
2024-03-13T14:03:44.493077243Z File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
2024-03-13T14:03:44.493079313Z exec(code, run_globals)
2024-03-13T14:03:44.493081353Z File "/app/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 561, in <module>
2024-03-13T14:03:44.493083673Z engine = AsyncAphrodite.from_engine_args(engine_args)
2024-03-13T14:03:44.493085783Z File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 676, in from_engine_args
2024-03-13T14:03:44.493087773Z engine = cls(parallel_config.worker_use_ray,
2024-03-13T14:03:44.493089813Z File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 341, in __init__
2024-03-13T14:03:44.493091913Z self.engine = self._init_engine(*args, **kwargs)
2024-03-13T14:03:44.493093943Z File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 410, in _init_engine
2024-03-13T14:03:44.493095973Z return engine_class(*args, **kwargs)
2024-03-13T14:03:44.493098053Z File "/app/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 102, in __init__
2024-03-13T14:03:44.493100183Z self._init_tokenizer()
2024-03-13T14:03:44.493102283Z File "/app/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 166, in _init_tokenizer
2024-03-13T14:03:44.493104343Z self.tokenizer: TokenizerGroup = TokenizerGroup(
2024-03-13T14:03:44.493106503Z File "/app/aphrodite-engine/aphrodite/transformers_utils/tokenizer.py", line 157, in __init__
2024-03-13T14:03:44.493108583Z self.tokenizer = get_tokenizer(self.tokenizer_id, **tokenizer_config)
2024-03-13T14:03:44.493110623Z File "/app/aphrodite-engine/aphrodite/transformers_utils/tokenizer.py", line 87, in get_tokenizer
2024-03-13T14:03:44.493112653Z tokenizer = AutoTokenizer.from_pretrained(
2024-03-13T14:03:44.493114713Z File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py", line 782, in from_pretrained
2024-03-13T14:03:44.493116783Z config = AutoConfig.from_pretrained(
2024-03-13T14:03:44.493118833Z File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 1111, in from_pretrained
2024-03-13T14:03:44.493120903Z config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
2024-03-13T14:03:44.493122953Z File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 633, in get_config_dict
2024-03-13T14:03:44.493125233Z config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
2024-03-13T14:03:44.493127343Z File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 688, in _get_config_dict
2024-03-13T14:03:44.493129363Z resolved_config_file = cached_file(
2024-03-13T14:03:44.493131423Z File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 452, in cached_file
2024-03-13T14:03:44.493133483Z raise EnvironmentError(
2024-03-13T14:03:44.493135593Z OSError: turboderp/Mistral-7B-instruct-exl2 does not appear to have a file named config.json. Checkout 'https://huggingface.co/turboderp/Mistral-7B-instruct-exl2/main' for avail
I know what's happening. Will fix soon.
(Wondering if there's any workaround in meantime with official runpod image? 👉👈 I tried REVISION env var too)
Any update?
Still an issue... a workaround would be really nice, since this makes it pretty difficult to use with HF models that place different degrees of quantization within different trees.
NOTE: It may just be an issue in the Google Colab, since I see that it was recently reported as fixed in #246 -- or, maybe it just hasn't made it into release yet, not sure how your release schedule works here, so: my apologies if this will be soon fixed.
It may just be an issue in the Google Colab
It's not just Google Colab, since I tested on Runpod using official image. Though I haven't tested with latest release.
There are bugs in the code. You can use this patch against v0.5.2. I haven't re-applied this to current HEAD:
diff --git a/aphrodite/endpoints/openai/api_server.py b/aphrodite/endpoints/openai/api_server.py
index 3b3b6ed..02da554 100644
--- a/aphrodite/endpoints/openai/api_server.py
+++ b/aphrodite/endpoints/openai/api_server.py
@@ -565,6 +565,7 @@ if __name__ == "__main__":
engine_args.tokenizer,
tokenizer_mode=engine_args.tokenizer_mode,
trust_remote_code=engine_args.trust_remote_code,
+ revision=engine_args.revision,
)
chat_template = args.chat_template
diff --git a/aphrodite/endpoints/openai/serving_engine.py b/aphrodite/endpoints/openai/serving_engine.py
index c98b332..b8d4e07 100644
--- a/aphrodite/endpoints/openai/serving_engine.py
+++ b/aphrodite/endpoints/openai/serving_engine.py
@@ -63,7 +63,8 @@ class OpenAIServing:
self.tokenizer = get_tokenizer(
engine_model_config.tokenizer,
tokenizer_mode=engine_model_config.tokenizer_mode,
- trust_remote_code=engine_model_config.trust_remote_code)
+ trust_remote_code=engine_model_config.trust_remote_code,
+ revision=engine_model_config.revision,)
async def show_available_models(self) -> ModelList:
"""Show available models. Right now we only have one model."""
diff --git a/aphrodite/engine/aphrodite_engine.py b/aphrodite/engine/aphrodite_engine.py
index b811bfe..11baf74 100644
--- a/aphrodite/engine/aphrodite_engine.py
+++ b/aphrodite/engine/aphrodite_engine.py
@@ -163,7 +163,7 @@ class AphroditeEngine:
max_input_length=None,
tokenizer_mode=self.model_config.tokenizer_mode,
trust_remote_code=self.model_config.trust_remote_code,
- revision=self.model_config.tokenizer_revision)
+ revision=self.model_config.revision)
init_kwargs.update(tokenizer_init_kwargs)
self.tokenizer: TokenizerGroup = TokenizerGroup(
self.model_config.tokenizer, **init_kwargs)
I have the same problem. Is there any chance this patch could be applied to a release, please? Because RunPod only pulls the latest release of Aphrodite. Thanks
you need to use --tokenizer-revision as well
for example
python -m aphrodite.endpoints.openai.api_server --model turboderp/Llama-3-8B-exl2 --revision 6.0bpw --tokenizer-revision 6.0bpw
Otherwise you'll get config.json missing error because the code tries to find that in main branch and fails
Have you tested this on the dev branch after the PR was merged?
Well in that case the PR wasn't enough. Maybe if @AlpinDale can confirm this, I'm happy to make another PR to add tokenizer_revision=self.model_config.tokenizer_revision as well.
The patch I posted was enough for me to be able to pull things from HuggingFace and use revisions
Have you tested this on the dev branch after the PR was merged?
Well in that case the PR wasn't enough. Maybe if @AlpinDale can confirm this, I'm happy to make another PR to add
tokenizer_revision=self.model_config.tokenizer_revisionas well.
Oh, my bad, i was using 0.5.1 from pypi, but it works by adding --tokenizer-revision Will test the dev branch
Alright, i'm also getting that error (got this error initially as well, but at an earlier stage, fixed by using --tokenizer-revision, now this happens)
WARNING: exl2 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO: Initializing the Aphrodite Engine (v0.5.1) with the following config:
INFO: Model = 'turboderp/Llama-3-8B-exl2'
INFO: DataType = torch.bfloat16
INFO: Model Load Format = auto
INFO: Number of GPUs = 1
INFO: Disable Custom All-Reduce = False
INFO: Quantization Format = exl2
INFO: Context Length = 7040
INFO: Enforce Eager Mode = False
INFO: KV Cache Data Type = auto
INFO: KV Cache Params Path = None
INFO: Device = cuda
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING: Model is quantized. Forcing float16 datatype.
INFO: Downloading model weights ['*.safetensors']
INFO: Model weights loaded. Memory usage: 6.26 GiB x 1 = 6.26 GiB
INFO: # GPU blocks: 566, # CPU blocks: 2048
INFO: Minimum concurrency: 1.29x
INFO: Maximum sequence length allowed in the cache: 9056
INFO: Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager
mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
WARNING: CUDA graphs can take additional 1~3 GiB of memory per GPU. If you are running out of memory, consider decreasing
`gpu_memory_utilization` or enforcing eager mode.
Capturing graph... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 35/35 0:00:00
INFO: Graph capturing finished in 8 secs.
/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 304, in hf_raise_for_status
response.raise_for_status()
File "/usr/local/lib/python3.10/site-packages/requests/models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/turboderp/Llama-3-8B-exl2/resolve/main/config.json
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/transformers/utils/hub.py", line 398, in cached_file
resolved_file = hf_hub_download(
File "/usr/local/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1221, in hf_hub_download
return _hf_hub_download_to_cache_dir(
File "/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1282, in _hf_hub_download_to_cache_dir
(url_to_download, etag, commit_hash, expected_size, head_call_error) = _get_metadata_or_catch_error(
File "/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1722, in _get_metadata_or_catch_error
metadata = get_hf_file_metadata(url=url, proxies=proxies, timeout=etag_timeout, headers=headers)
File "/usr/local/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1645, in get_hf_file_metadata
r = _request_wrapper(
File "/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 372, in _request_wrapper
response = _request_wrapper(
File "/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 396, in _request_wrapper
hf_raise_for_status(response)
File "/usr/local/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 315, in hf_raise_for_status
raise EntryNotFoundError(message, response) from e
huggingface_hub.utils._errors.EntryNotFoundError: 404 Client Error. (Request ID: Root=1-663bb63c-xx)
Entry Not Found for url: https://huggingface.co/turboderp/Llama-3-8B-exl2/resolve/main/config.json.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/site-packages/aphrodite/endpoints/openai/api_server.py", line 564, in <module>
tokenizer = get_tokenizer(
File "/usr/local/lib/python3.10/site-packages/aphrodite/transformers_utils/tokenizer.py", line 87, in get_tokenizer
tokenizer = AutoTokenizer.from_pretrained(
File "/usr/local/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 819, in from_pretrained
config = AutoConfig.from_pretrained(
File "/usr/local/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 928, in from_pretrained
config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/usr/local/lib/python3.10/site-packages/transformers/configuration_utils.py", line 631, in get_config_dict
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/usr/local/lib/python3.10/site-packages/transformers/configuration_utils.py", line 686, in _get_config_dict
resolved_config_file = cached_file(
File "/usr/local/lib/python3.10/site-packages/transformers/utils/hub.py", line 452, in cached_file
raise EnvironmentError(
OSError: turboderp/Llama-3-8B-exl2 does not appear to have a file named config.json. Checkout 'https://huggingface.co/turboderp/Llama-3-8B-exl2/main' for available files.
I'm using this Dockerfile (mostly to apply my previous patch). I had to reset to the root user as I was running into permission issues applying the patch.
FROM alpindale/aphrodite-engine
USER 0:0
COPY tokenizer-revision.patch .
RUN git apply tokenizer-revision.patch
with this .env file:
QUANTIZATION=exl2
MODEL_NAME=turboderp/Llama-3-8B-Instruct-exl2
REVISION="4.0bpw"
NUMBA_CACHE_DIR=/tmp/numba_cache
You need NUMBA_CACHE_DIR due to #323