Many checkpoints are outdated (torch.save'd with torch < 1.6) and don't support mmap
System Info
-
transformersversion: 4.36.0 - Platform: Linux-6.5.0-15-generic-x86_64-with-glibc2.31
- Python version: 3.11.5
- Huggingface_hub version: 0.19.4
- Safetensors version: 0.4.0
- Accelerate version: 0.24.1
- Accelerate config: not found
- PyTorch version (GPU?): 2.3.0a0+git78a84f1 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: no
- Using distributed or parallel set-up in script?: no
Who can help?
No response
Information
- [ ] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
Older checkpoint format (PyTorch < 1.6) didn't support mmap, which is recommended for faster loading and, especially, for loading large models in memory.
import os
from huggingface_hub import snapshot_download
import torch
from torch.serialization import _open_file_like, _is_zipfile
def is_outdated_torch_load(f):
with _open_file_like(f, 'rb') as opened_file:
if _is_zipfile(opened_file):
return False
return True
def update_torch_checkpoint(f):
res = torch.load(f)
torch.save(res, f)
# Download known outdated checkpoint
model_name = "sshleifer/tiny-gpt2"
checkpoint_name = "pytorch_model.bin"
ret = snapshot_download(repo_id=model_name,
allow_patterns=checkpoint_name,
local_dir_use_symlinks=True,
local_dir="./")
# Load and assert it is outdated (aka torch.save used from pytorch <= 1.5)
checkpoint_path = os.path.join(ret, checkpoint_name)
assert is_outdated_torch_load(checkpoint_path), f"{checkpoint_path} is outdated!"
# Refresh checkpoint with PyTorch >= 1.6 and assert it is NOT outdated anymore
update_torch_checkpoint(checkpoint_path)
assert not is_outdated_torch_load(checkpoint_path), f"{checkpoint_path} is NOT outdated!"
Expected behavior
In order to support mmap which is faster and support larger models without OOM, all checkpoints should be refreshed with torch.save from Pytorch >= 1.6
Hi @thiagocrepaldi, thanks for raising this issue!
Unfortunately, it's simply not possible for us to convert all checkpoints to be compatible. There are currently more than 800k models listed on the hub, as well as many private models and models which haven't been uploaded. Backwards compatibility is important in the library and although our currently supported version of pytorch is >= 1.11, enforcing this would likely break many things for many users.
One option would be to open PRs on model's on the hub with the converted weights and an explanation of the advantages. It would then be up to the repo's owner whether or not they would like to update the checkpoints. Care would need to be taken to make sure the conversions are correct and to avoid spamming users.
I'd suggest instead just doing this conversion on the fly, as and when you need it.
Note: the default serialization of weights for models is now safetensors, and we use the safetensors library to open.
cc @Narsil In case I got any of the facts wrong here of there's anything else to add.
Hi @amyeroberts,
- Indeed the number of models might be an issue for such conversion, but if that is something that huggingface servers can handle, Backward Compatibility wouldn't be a problem.
- A new file with a different name could be created, say
pytorch_model.bin->pytorch_model_mmap.binfor example - The numbers would match because we would essentialy do
torch.save(torch.load(f), new_file))without any possibility of changing the content of the file - Doing the conversion on the fly is possible, bu defeats the purpose of having a hub with pretrained weights :)
@thiagocrepaldi
Indeed the number of models might be an issue for such conversion, but if that is something that huggingface servers can handle, Backward Compatibility wouldn't be a problem.
That unfortunately isn't the case. As mentioned in my previous comment, there are private models on the hub which we don't have access to and model which aren't hosted on the hub at all, which would break.
@thiagocrepaldi
Indeed the number of models might be an issue for such conversion, but if that is something that huggingface servers can handle, Backward Compatibility wouldn't be a problem.
That unfortunately isn't the case. As mentioned in my previous comment, there are private models on the hub which we don't have access to and model which aren't hosted on the hub at all, which would break.
Thank you. How about the publicly accessible ones? Would it be reasonable to update or add an updated checkpoint alongside the outdated ones?
@thiagocrepaldi As this isn't something we've had requested or mentioned before, I don't think it's worth setting off a massive scale conversion of weights on the hub, especially as many weights will be compatible (> torch 1.6). If this issue gets a lot of attention from the community then we can reconsider targeting models with a threshold number of downloads.
In the meantime, you are welcome to open PRs on affected models on the hub, updating the weights and explaining the advantages of the conversion. This way the repo owners can decide if this is something they want. You could also open an issue with a list of known affected checkpoints and get other members of the community to help in the effort of opening PRs.
Thank you, I will try proposing individual PRs. It would be nice to establish minimal Pytorch versions for the newer models to guarantee performance and compatibility with newer features provided by PyTorch 2.x
Feel free to close this issue, if needed