aphrodite-engine Initial fetch for `config.json` ignores `--revision`?

If I set CMD_ADDITIONAL_ARGUMENTS to --model turboderp/Mistral-7B-instruct-exl2 --revision 4.0bpw

Then I get this error:

2024-03-13T14:03:42.164428603Z + exec python3 -m aphrodite.endpoints.openai.api_server --host 0.0.0.0 --port 5000 --download-dir /app/tmp/hub --max-model-len 4096 --quantization exl2 --enforce-eager --model turboderp/Mistral-7B-instruct-exl2 --revision 4.0bpw --download-dir /volume/hub
2024-03-13T14:03:44.082470629Z WARNING:  exl2 quantization is not fully optimized yet. The speed can be slower 
2024-03-13T14:03:44.082490019Z than non-quantized models.
2024-03-13T14:03:44.084028269Z INFO:     Initializing the Aphrodite Engine (v0.5.0) with the following config:
2024-03-13T14:03:44.084035559Z INFO:     Model = 'turboderp/Mistral-7B-instruct-exl2'
2024-03-13T14:03:44.084039269Z INFO:     DataType = torch.bfloat16
2024-03-13T14:03:44.084042909Z INFO:     Model Load Format = auto
2024-03-13T14:03:44.084045799Z INFO:     Number of GPUs = 1
2024-03-13T14:03:44.084048349Z INFO:     Disable Custom All-Reduce = False
2024-03-13T14:03:44.084050519Z INFO:     Quantization Format = exl2
2024-03-13T14:03:44.084052649Z INFO:     Context Length = 4096
2024-03-13T14:03:44.084057519Z INFO:     Enforce Eager Mode = True
2024-03-13T14:03:44.084059709Z INFO:     KV Cache Data Type = auto
2024-03-13T14:03:44.084061789Z INFO:     KV Cache Params Path = None
2024-03-13T14:03:44.084063869Z INFO:     Device = cuda
2024-03-13T14:03:44.492961433Z Traceback (most recent call last):
2024-03-13T14:03:44.492985083Z   File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py", line 304, in hf_raise_for_status
2024-03-13T14:03:44.492988443Z     response.raise_for_status()
2024-03-13T14:03:44.492991203Z   File "/usr/local/lib/python3.10/dist-packages/requests/models.py", line 1021, in raise_for_status
2024-03-13T14:03:44.492993893Z     raise HTTPError(http_error_msg, response=self)
2024-03-13T14:03:44.492996533Z requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/turboderp/Mistral-7B-instruct-exl2/resolve/main/config.json
2024-03-13T14:03:44.492999293Z 
2024-03-13T14:03:44.493001403Z The above exception was the direct cause of the following exception:
2024-03-13T14:03:44.493003813Z 
2024-03-13T14:03:44.493005773Z Traceback (most recent call last):
2024-03-13T14:03:44.493008093Z   File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 398, in cached_file
2024-03-13T14:03:44.493010223Z     resolved_file = hf_hub_download(
2024-03-13T14:03:44.493012273Z   File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
2024-03-13T14:03:44.493014363Z     return fn(*args, **kwargs)
2024-03-13T14:03:44.493016513Z   File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1261, in hf_hub_download
2024-03-13T14:03:44.493018643Z     metadata = get_hf_file_metadata(
2024-03-13T14:03:44.493020723Z   File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
2024-03-13T14:03:44.493022793Z     return fn(*args, **kwargs)
2024-03-13T14:03:44.493024903Z   File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1667, in get_hf_file_metadata
2024-03-13T14:03:44.493026983Z     r = _request_wrapper(
2024-03-13T14:03:44.493029103Z   File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 385, in _request_wrapper
2024-03-13T14:03:44.493031173Z     response = _request_wrapper(
2024-03-13T14:03:44.493033263Z   File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 409, in _request_wrapper
2024-03-13T14:03:44.493035313Z     hf_raise_for_status(response)
2024-03-13T14:03:44.493041563Z huggingface_hub.utils._errors.EntryNotFoundError: 404 Client Error. (Request ID: Root=1-65f1b240-7d5d7d3b668248e21867e88e;d37da62a-3494-4c58-91fd-28dda5419afb)
2024-03-13T14:03:44.493043843Z 
2024-03-13T14:03:44.493045873Z Entry Not Found for url: https://huggingface.co/turboderp/Mistral-7B-instruct-exl2/resolve/main/config.json.
2024-03-13T14:03:44.493062953Z 
2024-03-13T14:03:44.493066373Z The above exception was the direct cause of the following exception:
2024-03-13T14:03:44.493068993Z 
2024-03-13T14:03:44.493071043Z Traceback (most recent call last):
2024-03-13T14:03:44.493073083Z   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
2024-03-13T14:03:44.493075173Z     return _run_code(code, main_globals, None,
2024-03-13T14:03:44.493077243Z   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
2024-03-13T14:03:44.493079313Z     exec(code, run_globals)
2024-03-13T14:03:44.493081353Z   File "/app/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 561, in <module>
2024-03-13T14:03:44.493083673Z     engine = AsyncAphrodite.from_engine_args(engine_args)
2024-03-13T14:03:44.493085783Z   File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 676, in from_engine_args
2024-03-13T14:03:44.493087773Z     engine = cls(parallel_config.worker_use_ray,
2024-03-13T14:03:44.493089813Z   File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 341, in __init__
2024-03-13T14:03:44.493091913Z     self.engine = self._init_engine(*args, **kwargs)
2024-03-13T14:03:44.493093943Z   File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 410, in _init_engine
2024-03-13T14:03:44.493095973Z     return engine_class(*args, **kwargs)
2024-03-13T14:03:44.493098053Z   File "/app/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 102, in __init__
2024-03-13T14:03:44.493100183Z     self._init_tokenizer()
2024-03-13T14:03:44.493102283Z   File "/app/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 166, in _init_tokenizer
2024-03-13T14:03:44.493104343Z     self.tokenizer: TokenizerGroup = TokenizerGroup(
2024-03-13T14:03:44.493106503Z   File "/app/aphrodite-engine/aphrodite/transformers_utils/tokenizer.py", line 157, in __init__
2024-03-13T14:03:44.493108583Z     self.tokenizer = get_tokenizer(self.tokenizer_id, **tokenizer_config)
2024-03-13T14:03:44.493110623Z   File "/app/aphrodite-engine/aphrodite/transformers_utils/tokenizer.py", line 87, in get_tokenizer
2024-03-13T14:03:44.493112653Z     tokenizer = AutoTokenizer.from_pretrained(
2024-03-13T14:03:44.493114713Z   File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py", line 782, in from_pretrained
2024-03-13T14:03:44.493116783Z     config = AutoConfig.from_pretrained(
2024-03-13T14:03:44.493118833Z   File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 1111, in from_pretrained
2024-03-13T14:03:44.493120903Z     config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
2024-03-13T14:03:44.493122953Z   File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 633, in get_config_dict
2024-03-13T14:03:44.493125233Z     config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
2024-03-13T14:03:44.493127343Z   File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 688, in _get_config_dict
2024-03-13T14:03:44.493129363Z     resolved_config_file = cached_file(
2024-03-13T14:03:44.493131423Z   File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 452, in cached_file
2024-03-13T14:03:44.493133483Z     raise EnvironmentError(
2024-03-13T14:03:44.493135593Z OSError: turboderp/Mistral-7B-instruct-exl2 does not appear to have a file named config.json. Checkout 'https://huggingface.co/turboderp/Mistral-7B-instruct-exl2/main' for avail

Mar 13 '24 14:03 josephrocca

I know what's happening. Will fix soon.

Mar 13 '24 14:03 AlpinDale

(Wondering if there's any workaround in meantime with official runpod image? 👉👈 I tried REVISION env var too)

Mar 22 '24 17:03 josephrocca

Any update?

Mar 28 '24 00:03 localbarrage

Still an issue... a workaround would be really nice, since this makes it pretty difficult to use with HF models that place different degrees of quantization within different trees.

NOTE: It may just be an issue in the Google Colab, since I see that it was recently reported as fixed in #246 -- or, maybe it just hasn't made it into release yet, not sure how your release schedule works here, so: my apologies if this will be soon fixed.

May 01 '24 03:05 atlasveldine

It may just be an issue in the Google Colab

It's not just Google Colab, since I tested on Runpod using official image. Though I haven't tested with latest release.

May 01 '24 16:05 josephrocca

There are bugs in the code. You can use this patch against v0.5.2. I haven't re-applied this to current HEAD:

diff --git a/aphrodite/endpoints/openai/api_server.py b/aphrodite/endpoints/openai/api_server.py
index 3b3b6ed..02da554 100644
--- a/aphrodite/endpoints/openai/api_server.py
+++ b/aphrodite/endpoints/openai/api_server.py
@@ -565,6 +565,7 @@ if __name__ == "__main__":
         engine_args.tokenizer,
         tokenizer_mode=engine_args.tokenizer_mode,
         trust_remote_code=engine_args.trust_remote_code,
+        revision=engine_args.revision,
     )
 
     chat_template = args.chat_template
diff --git a/aphrodite/endpoints/openai/serving_engine.py b/aphrodite/endpoints/openai/serving_engine.py
index c98b332..b8d4e07 100644
--- a/aphrodite/endpoints/openai/serving_engine.py
+++ b/aphrodite/endpoints/openai/serving_engine.py
@@ -63,7 +63,8 @@ class OpenAIServing:
         self.tokenizer = get_tokenizer(
             engine_model_config.tokenizer,
             tokenizer_mode=engine_model_config.tokenizer_mode,
-            trust_remote_code=engine_model_config.trust_remote_code)
+            trust_remote_code=engine_model_config.trust_remote_code,
+            revision=engine_model_config.revision,)
 
     async def show_available_models(self) -> ModelList:
         """Show available models. Right now we only have one model."""
diff --git a/aphrodite/engine/aphrodite_engine.py b/aphrodite/engine/aphrodite_engine.py
index b811bfe..11baf74 100644
--- a/aphrodite/engine/aphrodite_engine.py
+++ b/aphrodite/engine/aphrodite_engine.py
@@ -163,7 +163,7 @@ class AphroditeEngine:
             max_input_length=None,
             tokenizer_mode=self.model_config.tokenizer_mode,
             trust_remote_code=self.model_config.trust_remote_code,
-            revision=self.model_config.tokenizer_revision)
+            revision=self.model_config.revision)
         init_kwargs.update(tokenizer_init_kwargs)
         self.tokenizer: TokenizerGroup = TokenizerGroup(
             self.model_config.tokenizer, **init_kwargs)

May 01 '24 18:05 wingrunr21

I have the same problem. Is there any chance this patch could be applied to a release, please? Because RunPod only pulls the latest release of Aphrodite. Thanks

May 07 '24 07:05 houmie

you need to use --tokenizer-revision as well

for example

python -m aphrodite.endpoints.openai.api_server --model turboderp/Llama-3-8B-exl2 --revision 6.0bpw --tokenizer-revision 6.0bpw

Otherwise you'll get config.json missing error because the code tries to find that in main branch and fails

May 08 '24 17:05 TheHamkerCat

Have you tested this on the dev branch after the PR was merged?

Well in that case the PR wasn't enough. Maybe if @AlpinDale can confirm this, I'm happy to make another PR to add tokenizer_revision=self.model_config.tokenizer_revision as well.

May 08 '24 17:05 houmie

The patch I posted was enough for me to be able to pull things from HuggingFace and use revisions

May 08 '24 17:05 wingrunr21

Have you tested this on the dev branch after the PR was merged?

Well in that case the PR wasn't enough. Maybe if @AlpinDale can confirm this, I'm happy to make another PR to add tokenizer_revision=self.model_config.tokenizer_revision as well.

Oh, my bad, i was using 0.5.1 from pypi, but it works by adding --tokenizer-revision Will test the dev branch

May 08 '24 17:05 TheHamkerCat

Alright, i'm also getting that error (got this error initially as well, but at an earlier stage, fixed by using --tokenizer-revision, now this happens)

WARNING:  exl2 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO:     Initializing the Aphrodite Engine (v0.5.1) with the following config:
INFO:     Model = 'turboderp/Llama-3-8B-exl2'
INFO:     DataType = torch.bfloat16
INFO:     Model Load Format = auto
INFO:     Number of GPUs = 1
INFO:     Disable Custom All-Reduce = False
INFO:     Quantization Format = exl2
INFO:     Context Length = 7040
INFO:     Enforce Eager Mode = False
INFO:     KV Cache Data Type = auto
INFO:     KV Cache Params Path = None
INFO:     Device = cuda
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING:  Model is quantized. Forcing float16 datatype.
INFO:     Downloading model weights ['*.safetensors']
INFO:     Model weights loaded. Memory usage: 6.26 GiB x 1 = 6.26 GiB
INFO:     # GPU blocks: 566, # CPU blocks: 2048
INFO:     Minimum concurrency: 1.29x
INFO:     Maximum sequence length allowed in the cache: 9056
INFO:     Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager 
mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
WARNING:  CUDA graphs can take additional 1~3 GiB of memory per GPU. If you are running out of memory, consider decreasing 
`gpu_memory_utilization` or enforcing eager mode.
Capturing graph... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 35/35 0:00:00
INFO:     Graph capturing finished in 8 secs.
/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 304, in hf_raise_for_status
    response.raise_for_status()
  File "/usr/local/lib/python3.10/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/turboderp/Llama-3-8B-exl2/resolve/main/config.json

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/transformers/utils/hub.py", line 398, in cached_file
    resolved_file = hf_hub_download(
  File "/usr/local/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1221, in hf_hub_download
    return _hf_hub_download_to_cache_dir(
  File "/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1282, in _hf_hub_download_to_cache_dir
    (url_to_download, etag, commit_hash, expected_size, head_call_error) = _get_metadata_or_catch_error(
  File "/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1722, in _get_metadata_or_catch_error
    metadata = get_hf_file_metadata(url=url, proxies=proxies, timeout=etag_timeout, headers=headers)
  File "/usr/local/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1645, in get_hf_file_metadata
    r = _request_wrapper(
  File "/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 372, in _request_wrapper
    response = _request_wrapper(
  File "/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 396, in _request_wrapper
    hf_raise_for_status(response)
  File "/usr/local/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 315, in hf_raise_for_status
    raise EntryNotFoundError(message, response) from e
huggingface_hub.utils._errors.EntryNotFoundError: 404 Client Error. (Request ID: Root=1-663bb63c-xx)

Entry Not Found for url: https://huggingface.co/turboderp/Llama-3-8B-exl2/resolve/main/config.json.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/site-packages/aphrodite/endpoints/openai/api_server.py", line 564, in <module>
    tokenizer = get_tokenizer(
  File "/usr/local/lib/python3.10/site-packages/aphrodite/transformers_utils/tokenizer.py", line 87, in get_tokenizer
    tokenizer = AutoTokenizer.from_pretrained(
  File "/usr/local/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 819, in from_pretrained
    config = AutoConfig.from_pretrained(
  File "/usr/local/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 928, in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/transformers/configuration_utils.py", line 631, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/transformers/configuration_utils.py", line 686, in _get_config_dict
    resolved_config_file = cached_file(
  File "/usr/local/lib/python3.10/site-packages/transformers/utils/hub.py", line 452, in cached_file
    raise EnvironmentError(
OSError: turboderp/Llama-3-8B-exl2 does not appear to have a file named config.json. Checkout 'https://huggingface.co/turboderp/Llama-3-8B-exl2/main' for available files.

May 08 '24 17:05 TheHamkerCat

I'm using this Dockerfile (mostly to apply my previous patch). I had to reset to the root user as I was running into permission issues applying the patch.

FROM alpindale/aphrodite-engine

USER 0:0

COPY tokenizer-revision.patch .
RUN git apply tokenizer-revision.patch

with this .env file:

QUANTIZATION=exl2
MODEL_NAME=turboderp/Llama-3-8B-Instruct-exl2
REVISION="4.0bpw"
NUMBA_CACHE_DIR=/tmp/numba_cache

You need NUMBA_CACHE_DIR due to #323

May 11 '24 16:05 wingrunr21