transformers Failed to convert 65B llama to hf weights

System Info

transformers version: 4.28.0.dev0
Platform: Linux-5.15.0-69-generic-x86_64-with-glibc2.31
Python version: 3.9.16
Huggingface_hub version: 0.13.3
Safetensors version: not installed
PyTorch version (GPU?): 2.0.0 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help?

No response

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Tried to execute this command to convert the 65B llama weights to hf version.

python src/transformers/models/llama/convert_llama_weights_to_hf.py  --input_dir /directory_contains_a_65B_weights_folder/ 
--model_size 65B --output_dir /target_directory/65B/

I got a RuntimeError during the execution. The weights have been successfully loaded but failed during saving. I found a similar error message in here, but there's no answer for that. I have checked my disk, and it should have enough space to save the model (223 GB available).

Fetching all parameters from the checkpoint at /scratch/users/xxxxx/65B.
Loading the checkpoint in a Llama model.
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 81/81 [03:52<00:00,  2.88s/it]
Saving in the Transformers format.
Traceback (most recent call last):
  File "/users/xxxxx/anaconda3/envs/llama/lib/python3.9/site-packages/torch/serialization.py", line 441, in save
    _save(obj, opened_zipfile, pickle_module, pickle_protocol)
  File "/users/xxxxx/anaconda3/envs/llama/lib/python3.9/site-packages/torch/serialization.py", line 668, in _save
    zip_file.write_record(name, storage.data_ptr(), num_bytes)
RuntimeError: [enforce fail at inline_container.cc:471] . PytorchStreamWriter failed writing file data/59: file write failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/users/xxxxx/transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py", line 279, in <module>
    main()
  File "/users/xxxxx/transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py", line 267, in main
    write_model(
  File "/users/xxxxx/transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py", line 230, in write_model
    model.save_pretrained(model_path)
  File "/users/xxxxx/anaconda3/envs/llama/lib/python3.9/site-packages/transformers/modeling_utils.py", line 1755, in save_pretrained
    save_function(shard, os.path.join(save_directory, shard_file))
  File "/users/xxxxx/anaconda3/envs/llama/lib/python3.9/site-packages/torch/serialization.py", line 442, in save
    return
  File "/users/xxxxx/anaconda3/envs/llama/lib/python3.9/site-packages/torch/serialization.py", line 291, in __exit__
    self.file_like.write_end_of_file()
RuntimeError: [enforce fail at inline_container.cc:337] . unexpected pos 8497872128 vs 8497872024

Expected behavior

I had no issue converting the 7B and 13B models with the same process.

Apr 24 '23 14:04 lijiazheng99

The error comes directly from torch.save, so we can't really help on our side. I have never seen it either :-/

Apr 24 '23 14:04 sgugger

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

May 24 '23 15:05 github-actions[bot]

I also encountered this problem and successfully solved it. This is mainly due to insufficient hard disk space.

Apr 02 '24 14:04 hiGiraffe