DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

TorchCheckpointEngine: torch.save using pickle protocol 4

Open nelyahu opened this issue 2 years ago • 0 comments

to allow large tensor serialization > 4B. Can reproduce this by running the attached files:

  1. put both files in same directory.
  2. change the .txt to .py
  3. run from the directory python test_large_tensor_save_cp.py
  4. expecting the following error message: OverflowError: serializing a string larger than 4 GiB requires pickle protocol 4 or higher

deepspeed_vllm_config.json test_large_tensor_save_cp.txt

Note: It cannot be reproduced with CPU backend, could not check on GPU. for HPU (Intel Gaudi2) it does happen. I assume it is somehow related to backends that goes through the the below flow in pytorch: https://github.com/pytorch/pytorch/blob/4bfaa6bc250f5ff5702703ea237f578a15bbe3b6/torch/_tensor.py#L247 which converts the tensor into a numpy format.

nelyahu avatar Dec 28 '23 13:12 nelyahu