DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] can not initialize DeepSpeed-Inference engine with deepspeed.init_inference()

Open Jirigesi opened this issue 3 years ago • 2 comments

Hello, I am new user of the DeepSpeed(DS) and I successfully trained checkpoints using DS. However, I met issue when trying to used the checkpoint for inference. I want to use the tutorial by this, however, I tried to give the folder of *.pt file or to the .pt file. I always get this error

Traceback (most recent call last): File "deepspeed_infer2.py", line 28, in ds_engine = deepspeed.init_inference(model, File "/home/jirigesi/anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/init.py", line 288, in init_inference engine = InferenceEngine(model, File "/home/jirigesi/anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 134, in init self._apply_injection_policy( File "/home/jirigesi/anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 316, in _apply_injection_policy checkpoint = SDLoaderFactory.get_sd_loader_json( File "/home/jirigesi/anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/runtime/state_dict_factory.py", line 23, in get_sd_loader_json ckpt_list = data['checkpoints'] KeyError: 'checkpoints' [2022-07-27 22:48:51,258] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 93887 [2022-07-27 22:48:51,258] [ERROR] [launch.py:184:sigkill_handler] ['/home/jirigesi/anaconda3/envs/deepspeed/bin/python', '-u', 'deepspeed_infer2.py', '--local_rank=0'] exits with return code = 1

This is my checkpoint.json:

{
    "type": "DeepSpeed",
      "version": 0.3,
      "checkpoint_path": "./ds_models/global_step1/mp_rank_00_model_states.pt"
  }

this is code i used to get the inference engine:

# Initialize the DeepSpeed-Inference engine
    ds_engine = deepspeed.init_inference(model,
                                    dtype=torch.half,
                                    checkpoint="checkpoint.json",
                                    replace_method='auto',
                                    replace_with_kernel_inject=True)

I can use another approach to load the checkpoint:

#Initialize the DeepSpeed-Inference engine
 model_engine, _, _, _ = deepspeed.initialize(
                                             model=model, 
                                             model_parameters=model.parameters(), 
                                             config=ds_config
                                             )

 # load checkpoint 
 load_dir = '../results/ds_models/global_step226'
 #load checkpoint
 _, client_sd = model_engine.load_checkpoint(load_dir)

and use this new model_engine for inference. I am not sure what is the difference between two methods? and why first approach is not working?

Jirigesi avatar Jul 27 '22 22:07 Jirigesi

@Jirigesi Thanks for using DeepSpeed! I believe the problem when using init_inference is that your checkpoint.json is missing the key checkpoints:

KeyError: 'checkpoints'

Try replacing checkpoint_path with checkpoints:

{
    "type": "DeepSpeed",
    "version": 0.3,
    "checkpoints": ["./ds_models/global_step1/mp_rank_00_model_states.pt"]
}

mrwyattii avatar Aug 01 '22 17:08 mrwyattii

Hello @Jirigesi,

Apologies for the delayed follow up to your issue. The inference tutorial is slightly out of date with the code. For checkpoint loading to work using a checkpoint.json as described in the tutorial, replace_with_kernel_inject must be False due to this check in the InferenceEngine: https://github.com/microsoft/DeepSpeed/blob/58a4a4d4c19bda86d489ac171fa10f3ddb27c9d6/deepspeed/inference/engine.py#L95 This check was added in GH-2083 along with the Meta Tensors feature, which uses "meta tensors" to initialize the model, then loads the weights after module replacement.

The GH-2940 draft PR changes the InferenceEngine check in the code snippet above to more explicitly check for meta tensor usage, allowing checkpoints to be loaded as described in the tutorial. We're also looking to update the tutorial as well to reflect the current state of checkpoint loading.

Please let us know if you have any additional questions!

Thanks, Lev

lekurile avatar Mar 08 '23 19:03 lekurile