Unable to use BLEURT in offline mode
Describe the bug
Trying to use BLEURT in offline mode fails. The script and model weights are cached to disk fine (when in online mode). In offline mode, it loads the script from the cache fine, but when trying to load the cached model weights, it throws an error.
I looks like the bug exists somewhere in the get_from_cache function, as the error is thrown from here:
https://github.com/huggingface/datasets/blob/f96547708a889c09ca8a02ed7aadd8c5690503c5/src/datasets/utils/file_utils.py#L530
I know the metrics within
datasetsare deprecated. However, this exact error is thrown byevaluateas well.
Steps to reproduce the bug
Steps to reproduce the behaviour:
from datasets import load_metric
import os
os.environ["HF_DATASETS_OFFLINE"] = "1"
bleurt = load_metric("bleurt")
Gives the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/johnmg/mediqa/lib/python3.10/site-packages/datasets/utils/deprecation_utils.py", line 46, in wrapper
return deprecated_function(*args, **kwargs)
File "/home/johnmg/mediqa/lib/python3.10/site-packages/datasets/load.py", line 1397, in load_metric
metric.download_and_prepare(download_config=download_config)
File "/home/johnmg/mediqa/lib/python3.10/site-packages/datasets/metric.py", line 625, in download_and_prepare
self._download_and_prepare(dl_manager)
File "/home/johnmg/.cache/huggingface/modules/datasets_modules/metrics/bleurt/89f7c298fa543e9cee6749e6ed198069d7c10fc8e99c0ff37a843dbc0eea88d7/bleurt.py", line 117, in _download_and_prepare
model_path = dl_manager.download_and_extract(CHECKPOINT_URLS[checkpoint_name])
File "/home/johnmg/mediqa/lib/python3.10/site-packages/datasets/download/download_manager.py", line 564, in download_and_extract
return self.extract(self.download(url_or_urls))
File "/home/johnmg/mediqa/lib/python3.10/site-packages/datasets/download/download_manager.py", line 427, in download
downloaded_path_or_paths = map_nested(
File "/home/johnmg/mediqa/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 436, in map_nested
return function(data_struct)
File "/home/johnmg/mediqa/lib/python3.10/site-packages/datasets/download/download_manager.py", line 453, in _download
return cached_path(url_or_filename, download_config=download_config)
File "/home/johnmg/mediqa/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 182, in cached_path
output_path = get_from_cache(
File "/home/johnmg/mediqa/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 530, in get_from_cache
_raise_if_offline_mode_is_enabled(f"Tried to reach {url}")
File "/home/johnmg/mediqa/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 260, in _raise_if_offline_mode_is_enabled
raise OfflineModeIsEnabled(
datasets.utils.file_utils.OfflineModeIsEnabled: Offline mode is enabled. Tried to reach https://storage.googleapis.com/bleurt-oss/bleurt-base-128.zip
Expected behavior
I would expect that, after loading the metric as bleurt = load_metric("bleurt") with an internet connection it will be cached locally, and I should be able to load it from this cache without an internet connection afterwards. I also considered manually specifying the cached model filepath like so:
bleurt = load_metric("bleurt", "/home/johnmg/.cache/huggingface/metrics/bleurt/default/downloads/extracted/4686726448df12b97ad0880ca1f80735f419854eb56f1878cf550dcbd717fb20/bleurt-base-128")
but this doesn't work either:
KeyError: "/home/johnmg/.cache/huggingface/metrics/bleurt/default/downloads/extracted/4686726448df12b97ad0880ca1f80735f419854eb56f1878cf550dcbd717fb20/bleurt-base-128 model not found. You should supply the name of a model checkpoint for bleurt in dict_keys(['bleurt-tiny-128', 'bleurt-tiny-512', 'bleurt-base-128', 'bleurt-base-512', 'bleurt-large-128', 'bleurt-large-512', 'BLEURT-20-D3', 'BLEURT-20-D6', 'BLEURT-20-D12', 'BLEURT-20'])"
as the metric loading scripts expect the model checkpoint to be one of:
https://github.com/huggingface/datasets/blob/f96547708a889c09ca8a02ed7aadd8c5690503c5/metrics/bleurt/bleurt.py#L64-L75
Environment info
I installed datasets from main with pip install git+https://github.com/huggingface/datasets.git
Hi ! Metric related issues should be posted in the evaluate repository - happy to help from there ;)
Could you try passing download_config=DownloadConfig(use_etag=False) to datasets.load_metric() or evaluate.load()?
You might have this issue because it tried to reach the URL to get the file ETag used by the cache.
No dice, it seems. I tried the following, but it hung and eventually failed in offline mode.
While online:
import evaluate
from datasets import DownloadConfig
from transformers.utils import is_offline_mode
assert not is_offline_mode()
bleurt = evaluate.load("bleurt", "BLEURT-20")
While offline:
import evaluate
from datasets import DownloadConfig
from transformers.utils import is_offline_mode
assert is_offline_mode()
bleurt = evaluate.load("bleurt", "BLEURT-20", download_config=DownloadConfig(use_etag=False))
import evaluate
from datasets import DownloadConfig
from transformers.utils import is_offline_mode
import os
os.environ["HF_DATASETS_OFFLINE"] = "1"
os.environ["TRANSFORMERS_OFFLINE"] = "1"
assert is_offline_mode()
bleurt = evaluate.load("bleurt", "BLEURT-20", download_config=DownloadConfig(use_etag=False))
Any help would be appreciated @lhoestq 😅
Same here.
Same here.