datasets icon indicating copy to clipboard operation
datasets copied to clipboard

`load_dataset` uses out-of-date cache instead of re-downloading a changed dataset

Open mnoukhov opened this issue 2 years ago • 2 comments

Describe the bug

When a dataset is updated on the hub, using load_dataset will load the locally cached dataset instead of re-downloading the updated dataset

Steps to reproduce the bug

Here is a minimal example script to

  1. create an initial dataset and upload
  2. download it so it is stored in cache
  3. change the dataset and re-upload
  4. redownload
import time                                                                                                                                                    
                                                                                                                                                               
from datasets import Dataset, DatasetDict, DownloadMode, load_dataset                                                                                 
       
username = "YOUR_USERNAME_HERE"                                                                                                                                          
                                                                                                                                                               
initial = Dataset.from_dict({"foo": [1, 2, 3]})                                                                                                                
print(f"Intial {initial['foo']}")                                                                                                                              
initial_ds = DatasetDict({"train": initial})                                                                                                                   
initial_ds.push_to_hub("test")                                                                                                                                 
                                                                                                                                                               
time.sleep(1)                                                                                                                                                  
                                                                                                                                                               
download = load_dataset(f"{username}/test", split="train")                                                                                                     
changed = download.map(lambda x: {"foo": x["foo"] + 1})                                                                                                        
print(f"Changed {changed['foo']}")                                                                                                                             
changed.push_to_hub("test")                                                                                                                                    
                                                                                                                                                               
time.sleep(1)                                                                                                                                                  
                                                                                                                                                               
download_again = load_dataset(f"{username}/test", split="train")                                                                                               
print(f"Download Changed {download_again['foo']}")                                                                                                             
# >>> gives the out-dated [1,2,3] when it should be changed [2,3,4]                                                                                                                                                                               

The redownloaded dataset should be the changed dataset but it is actually the cached, initial dataset. Force-redownloading gives the correct dataset

download_again_force = load_dataset(f"{username}/test", split="train", download_mode=DownloadMode.FORCE_REDOWNLOAD)                                            
print(f"Force Download Changed {download_again_force['foo']}")                                                                                                 
# >>> [2,3,4]                                                                                                                                                                              

Expected behavior

I assumed there should be some sort of hashing that should check for changes in the dataset and re-download if the hashes don't match

Environment info

  • datasets version: 2.15.0 │
  • Platform: Linux-5.15.0-1028-nvidia-x86_64-with-glibc2.17 │
  • Python version: 3.8.17 │
  • huggingface_hub version: 0.19.4 │
  • PyArrow version: 13.0.0 │
  • Pandas version: 2.0.3 │
  • fsspec version: 2023.6.0

mnoukhov avatar Dec 02 '23 21:12 mnoukhov

Hi, thanks for reporting! https://github.com/huggingface/datasets/pull/6459 will fix this.

mariosasko avatar Dec 04 '23 16:12 mariosasko

I meet a similar problem as using loading scripts. I have to set download_mode='force_redownload' to load the latest script.

ljw20180420 avatar Aug 20 '24 08:08 ljw20180420