Failed to convert 65B llama to hf weights
System Info
-
transformersversion: 4.28.0.dev0 - Platform: Linux-5.15.0-69-generic-x86_64-with-glibc2.31
- Python version: 3.9.16
- Huggingface_hub version: 0.13.3
- Safetensors version: not installed
- PyTorch version (GPU?): 2.0.0 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: no
- Using distributed or parallel set-up in script?: no
Who can help?
No response
Information
- [ ] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
Tried to execute this command to convert the 65B llama weights to hf version.
python src/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir /directory_contains_a_65B_weights_folder/
--model_size 65B --output_dir /target_directory/65B/
I got a RuntimeError during the execution. The weights have been successfully loaded but failed during saving. I found a similar error message in here, but there's no answer for that. I have checked my disk, and it should have enough space to save the model (223 GB available).
Fetching all parameters from the checkpoint at /scratch/users/xxxxx/65B.
Loading the checkpoint in a Llama model.
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 81/81 [03:52<00:00, 2.88s/it]
Saving in the Transformers format.
Traceback (most recent call last):
File "/users/xxxxx/anaconda3/envs/llama/lib/python3.9/site-packages/torch/serialization.py", line 441, in save
_save(obj, opened_zipfile, pickle_module, pickle_protocol)
File "/users/xxxxx/anaconda3/envs/llama/lib/python3.9/site-packages/torch/serialization.py", line 668, in _save
zip_file.write_record(name, storage.data_ptr(), num_bytes)
RuntimeError: [enforce fail at inline_container.cc:471] . PytorchStreamWriter failed writing file data/59: file write failed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/users/xxxxx/transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py", line 279, in <module>
main()
File "/users/xxxxx/transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py", line 267, in main
write_model(
File "/users/xxxxx/transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py", line 230, in write_model
model.save_pretrained(model_path)
File "/users/xxxxx/anaconda3/envs/llama/lib/python3.9/site-packages/transformers/modeling_utils.py", line 1755, in save_pretrained
save_function(shard, os.path.join(save_directory, shard_file))
File "/users/xxxxx/anaconda3/envs/llama/lib/python3.9/site-packages/torch/serialization.py", line 442, in save
return
File "/users/xxxxx/anaconda3/envs/llama/lib/python3.9/site-packages/torch/serialization.py", line 291, in __exit__
self.file_like.write_end_of_file()
RuntimeError: [enforce fail at inline_container.cc:337] . unexpected pos 8497872128 vs 8497872024
Expected behavior
I had no issue converting the 7B and 13B models with the same process.
The error comes directly from torch.save, so we can't really help on our side. I have never seen it either :-/
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
I also encountered this problem and successfully solved it. This is mainly due to insufficient hard disk space.