transformers Many checkpoints are outdated (torch.save'd with torch

System Info

transformers version: 4.36.0
Platform: Linux-6.5.0-15-generic-x86_64-with-glibc2.31
Python version: 3.11.5
Huggingface_hub version: 0.19.4
Safetensors version: 0.4.0
Accelerate version: 0.24.1
Accelerate config: not found
PyTorch version (GPU?): 2.3.0a0+git78a84f1 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help?

No response

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Older checkpoint format (PyTorch < 1.6) didn't support mmap, which is recommended for faster loading and, especially, for loading large models in memory.

import os
from huggingface_hub import snapshot_download
import torch
from torch.serialization import _open_file_like, _is_zipfile

def is_outdated_torch_load(f):
    with _open_file_like(f, 'rb') as opened_file:
        if _is_zipfile(opened_file):
            return False
        return True

def update_torch_checkpoint(f):
    res = torch.load(f)
    torch.save(res, f)

# Download known outdated checkpoint
model_name = "sshleifer/tiny-gpt2"
checkpoint_name = "pytorch_model.bin"
ret = snapshot_download(repo_id=model_name,
                        allow_patterns=checkpoint_name,
                        local_dir_use_symlinks=True,
                        local_dir="./")
# Load and assert it is outdated (aka torch.save used from pytorch <= 1.5)
checkpoint_path = os.path.join(ret, checkpoint_name)
assert is_outdated_torch_load(checkpoint_path), f"{checkpoint_path} is outdated!"

# Refresh checkpoint with PyTorch >= 1.6  and assert it is NOT outdated anymore
update_torch_checkpoint(checkpoint_path)
assert not is_outdated_torch_load(checkpoint_path), f"{checkpoint_path} is NOT outdated!"

Expected behavior

In order to support mmap which is faster and support larger models without OOM, all checkpoints should be refreshed with torch.save from Pytorch >= 1.6

Feb 13 '24 21:02 thiagocrepaldi

Hi @thiagocrepaldi, thanks for raising this issue!

Unfortunately, it's simply not possible for us to convert all checkpoints to be compatible. There are currently more than 800k models listed on the hub, as well as many private models and models which haven't been uploaded. Backwards compatibility is important in the library and although our currently supported version of pytorch is >= 1.11, enforcing this would likely break many things for many users.

One option would be to open PRs on model's on the hub with the converted weights and an explanation of the advantages. It would then be up to the repo's owner whether or not they would like to update the checkpoints. Care would need to be taken to make sure the conversions are correct and to avoid spamming users.

I'd suggest instead just doing this conversion on the fly, as and when you need it.

Note: the default serialization of weights for models is now safetensors, and we use the safetensors library to open.

cc @Narsil In case I got any of the facts wrong here of there's anything else to add.

Feb 14 '24 16:02 amyeroberts

Hi @amyeroberts,

Indeed the number of models might be an issue for such conversion, but if that is something that huggingface servers can handle, Backward Compatibility wouldn't be a problem.
A new file with a different name could be created, say pytorch_model.bin -> pytorch_model_mmap.bin for example
The numbers would match because we would essentialy do torch.save(torch.load(f), new_file)) without any possibility of changing the content of the file
Doing the conversion on the fly is possible, bu defeats the purpose of having a hub with pretrained weights :)

Feb 14 '24 21:02 thiagocrepaldi

@thiagocrepaldi

Indeed the number of models might be an issue for such conversion, but if that is something that huggingface servers can handle, Backward Compatibility wouldn't be a problem.

That unfortunately isn't the case. As mentioned in my previous comment, there are private models on the hub which we don't have access to and model which aren't hosted on the hub at all, which would break.

Feb 15 '24 12:02 amyeroberts

@thiagocrepaldi

Indeed the number of models might be an issue for such conversion, but if that is something that huggingface servers can handle, Backward Compatibility wouldn't be a problem.

That unfortunately isn't the case. As mentioned in my previous comment, there are private models on the hub which we don't have access to and model which aren't hosted on the hub at all, which would break.

Thank you. How about the publicly accessible ones? Would it be reasonable to update or add an updated checkpoint alongside the outdated ones?

Feb 15 '24 16:02 thiagocrepaldi

@thiagocrepaldi As this isn't something we've had requested or mentioned before, I don't think it's worth setting off a massive scale conversion of weights on the hub, especially as many weights will be compatible (> torch 1.6). If this issue gets a lot of attention from the community then we can reconsider targeting models with a threshold number of downloads.

In the meantime, you are welcome to open PRs on affected models on the hub, updating the weights and explaining the advantages of the conversion. This way the repo owners can decide if this is something they want. You could also open an issue with a list of known affected checkpoints and get other members of the community to help in the effort of opening PRs.

Feb 15 '24 20:02 amyeroberts

Thank you, I will try proposing individual PRs. It would be nice to establish minimal Pytorch versions for the newer models to guarantee performance and compatibility with newer features provided by PyTorch 2.x

Feel free to close this issue, if needed

Feb 15 '24 22:02 thiagocrepaldi

Many checkpoints are outdated (torch.save'd with torch < 1.6) and don't support mmap

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior