`load_dataset` uses out-of-date cache instead of re-downloading a changed dataset
Describe the bug
When a dataset is updated on the hub, using load_dataset will load the locally cached dataset instead of re-downloading the updated dataset
Steps to reproduce the bug
Here is a minimal example script to
- create an initial dataset and upload
- download it so it is stored in cache
- change the dataset and re-upload
- redownload
import time
from datasets import Dataset, DatasetDict, DownloadMode, load_dataset
username = "YOUR_USERNAME_HERE"
initial = Dataset.from_dict({"foo": [1, 2, 3]})
print(f"Intial {initial['foo']}")
initial_ds = DatasetDict({"train": initial})
initial_ds.push_to_hub("test")
time.sleep(1)
download = load_dataset(f"{username}/test", split="train")
changed = download.map(lambda x: {"foo": x["foo"] + 1})
print(f"Changed {changed['foo']}")
changed.push_to_hub("test")
time.sleep(1)
download_again = load_dataset(f"{username}/test", split="train")
print(f"Download Changed {download_again['foo']}")
# >>> gives the out-dated [1,2,3] when it should be changed [2,3,4]
The redownloaded dataset should be the changed dataset but it is actually the cached, initial dataset. Force-redownloading gives the correct dataset
download_again_force = load_dataset(f"{username}/test", split="train", download_mode=DownloadMode.FORCE_REDOWNLOAD)
print(f"Force Download Changed {download_again_force['foo']}")
# >>> [2,3,4]
Expected behavior
I assumed there should be some sort of hashing that should check for changes in the dataset and re-download if the hashes don't match
Environment info
-
datasetsversion: 2.15.0 │ - Platform: Linux-5.15.0-1028-nvidia-x86_64-with-glibc2.17 │
- Python version: 3.8.17 │
-
huggingface_hubversion: 0.19.4 │ - PyArrow version: 13.0.0 │
- Pandas version: 2.0.3 │
-
fsspecversion: 2023.6.0
Hi, thanks for reporting! https://github.com/huggingface/datasets/pull/6459 will fix this.
I meet a similar problem as using loading scripts. I have to set download_mode='force_redownload' to load the latest script.