DeepSpeed
DeepSpeed copied to clipboard
TorchCheckpointEngine: torch.save using pickle protocol 4
to allow large tensor serialization > 4B. Can reproduce this by running the attached files:
- put both files in same directory.
- change the .txt to .py
- run from the directory
python test_large_tensor_save_cp.py - expecting the following error message:
OverflowError: serializing a string larger than 4 GiB requires pickle protocol 4 or higher
deepspeed_vllm_config.json test_large_tensor_save_cp.txt
Note: It cannot be reproduced with CPU backend, could not check on GPU. for HPU (Intel Gaudi2) it does happen. I assume it is somehow related to backends that goes through the the below flow in pytorch: https://github.com/pytorch/pytorch/blob/4bfaa6bc250f5ff5702703ea237f578a15bbe3b6/torch/_tensor.py#L247 which converts the tensor into a numpy format.