DeepSpeed [BUG] No `universal_checkpoint_info` in the Accelerate+Deepspeed Checkpoint

I trained model using Accelerate+Deepspeed ZeRO-2 and got a ZeRO-2 checkpoint. The checkpoint structure is listed below. And this is the Google Drive link to my checkpoint.

checkpoint-3/
├── config.json
├── generation_config.json
├── global_step3
│   ├── bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
│   ├── bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt
│   ├── bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt
│   ├── bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt
│   └── mp_rank_00_model_states.pt
├── latest
├── model.safetensors
├── rng_state_0.pth
├── rng_state_1.pth
├── rng_state_2.pth
├── rng_state_3.pth
├── scheduler.pt
├── trainer_state.json
├── training_args.bin
└── zero_to_fp32.py

I tried to convert this ZeRO-2 checkpoint to the universal format using ds_to_universal.py but encountered errors:

args = Namespace(input_folder='experiment_ckpts/tinyllama_expanded_frez_embed-2024-04-16-010251/checkpoint-3', output_folder='experiment_ckpts/tinyllama_expanded_frez_embed-2024-04-16-010251/checkpoint-3_universal', num_extract_workers=10, num_merge_workers=10, keep_temp_folder=False, strict=True)
Convert DeepSpeed Checkpoint to Universal Checkpoint
Converting DeepSpeed checkpoint in experiment_ckpts/tinyllama_expanded_frez_embed-2024-04-16-010251/checkpoint-3 to Universal checkpoint in experiment_ckpts/tinyllama_expanded_frez_embed-2024-04-16-010251/checkpoint-3_universal
Traceback (most recent call last):
  File "dist_env_tools/ds_to_universal.py", line 363, in <module>
    main(args)
  File "dist_env_tools/ds_to_universal.py", line 320, in main
    _check_for_required_state(ds_checkpoint)
  File "dist_env_tools/ds_to_universal.py", line 311, in _check_for_required_state
    assert universal_checkpoint_info is not None, f'Required {UNIVERSAL_CHECKPOINT_INFO} state is missing in checkpoint. Verify that client creates this state.'
AssertionError: Required universal_checkpoint_info state is missing in checkpoint. Verify that client creates this state.

It seems the checkpoint structure is a bit different from Universal Checkpoint examples in Megatron-Deepspeed.

May I ask how can i find the universal_checkpoint_info in my checkpoint?

Apr 17 '24 13:04 Orion-Zheng

@Orion-Zheng, this is expected because universal checkpointing requires some metadata to be saved by the client in the checkpoint. At this time, we have only modified Megatron-DeepSpeed client to save the required metadata. Similar changes need to be applied to HF trainer checkpointing save logic. If you have bandwidth to work on this, I think it will have a great impact of enabling universal checkpointing to HF training.

Apr 17 '24 14:04 tjruwase

Thank you. I also think this would be a very impactful work because so many people use Huggingface Trainer now.😃After this month I think I will have some bandwidth to do this. I am familiar with Trainer's save logic but currently not very familiar with Deepspeed and Megatron's. I will try to read the code by myself first and ask you if still encounter some barrier.

Apr 17 '24 14:04 Orion-Zheng

I've encountered the same error for checkpoint saved in pytorch lightning + deepspeed, so this ds_to_universal.py script doesn't support pytorch-lightning, too?

Apr 18 '24 03:04 zhoubay

Hi @tjruwase , I tried to add the UNIVERSAL_CHECKPOINT_INFO to the client_state, and the ds_to_universal.py works fine.

{
  'universal_checkpoint_info': 
    {
      'universal_checkpoint_version': 0.2
    }
}

Then how to load this universal_folder to the model? I find that when using Megatron-DeepSpeed, there's a flag called universal-checkpoint, and its only usage I've found in Megatron-DeepSpeed is to set ds_config_dict["checkpoint"] = {"load_universal": True}

However, I'm still confused how to load the universal_checkpoint_folder.

Any hint or instruction is welcomed!

Thank you for your attention! Looking forward to your reply!

Apr 18 '24 13:04 zhoubay

@Orion-Zheng Could you provide the scripts you used for training? I would be happy to help solve the issue.

Apr 19 '24 16:04 xylian86

@Orion-Zheng Could you provide the scripts you used for training? I would be happy to help solve the issue.

I think the point is not the scripts for training, ds_to_universal.py will check if there's any universal_checkpoint_info key in the checkpoint.

https://github.com/microsoft/DeepSpeed/blob/b22706a7211366abf2df98a0d118ea1d3a837e21/deepspeed/checkpoint/ds_to_universal.py#L347-L349

Aware this, I fool the script by add the key without anything in the checkpoint, and it works, as this comment says.

Hi @tjruwase , I tried to add the UNIVERSAL_CHECKPOINT_INFO to the client_state, and the ds_to_universal.py works fine.
{
  'universal_checkpoint_info': 
    {
      'universal_checkpoint_version': 0.2
    }
}
Then how to load this universal_folder to the model? I find that when using Megatron-DeepSpeed, there's a flag called universal-checkpoint, and its only usage I've found in Megatron-DeepSpeed is to set ds_config_dict["checkpoint"] = {"load_universal": True}

However, I'm still confused how to load the universal_checkpoint_folder.

Any hint or instruction is welcomed!

Thank you for your attention! Looking forward to your reply!

However, without checking the Megatron-DeepSpeed repo due to environment installation faliure, I don't know the exact value of the key universal_checkpoint_info should be and whether the "foolish" action could affect the performance.

And more importantly, I've got a directory that I don't know how to load :(

Apr 20 '24 01:04 zhoubay

@Orion-Zheng This PR should fix the issue you mentioned (universal checkpoint does not support HF trainer). Feel free to ping me if you have any questions or suggestions on this PR.

Jun 11 '24 12:06 xylian86

Wow great! I will try it later and get back to you😃Many thanks for your work!

Jun 11 '24 13:06 Orion-Zheng

Hello @xylian86 , I was previously using the HF trainer. Why doesn't the universal checkpoint support the HF trainer? Is there any way to load the universal checkpoint? Do I have to switch training frameworks to deepspeed?

Edit: I am using the HF lr scheduler + DS optimizer for training. I've managed to load the universal checkpoint by forcing load_universal_checkpoint to return True, but the training loop exits silently after first iteration.

Aug 30 '24 15:08 huyiwen