DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[REQUEST] Allow pre-sharding of models using a single process

Open molohov opened this issue 2 years ago • 2 comments

Is your feature request related to a problem? Please describe. When using multiple GPUs for tensor parallel inference, DeepSpeed loads the model into memory N times, where N is the number of GPUs. This can quickly exhaust system memory when the model is large. For example, I cannot load opt-66b onto 8 GPUs with tp_size = 8 on a DGXA100 system with 1TiB of system memory.

A solution is to pre-shard the model so that each process doesn't need to load the entire model into memory, but in order to do this, one STILL has to load the model into memory N times.

Describe the solution you'd like Allow pre-sharding of a model across tp_size devices using a SINGLE process, so as to avoid loading the model redundantly into memory N times.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context If you try to do this today, with this kind of setup:

deepspeed --num_gpus 1 SCRIPT.py --tp_size 8 --checkpoint_path <PATH>

tp_config = deepspeed.inference.config.DeepSpeedTPConfig(
    enabled = True,
    tp_size = args.tp_size,
)
deepspeed.init_inference(
    tensor_parallel = tp_config,
    save_mp_checkpoint_path = args.checkpoint_path,
    ....
)

You'll get this error:

RuntimeError: the new group's world size should be less or equal to the world size set by init_process_group

molohov avatar May 17 '23 21:05 molohov

+1

ArlanCooper avatar Jan 17 '24 11:01 ArlanCooper

+1. What if i need to use two GPU to infer the model, how can i write the code?

dsj96 avatar Apr 24 '24 08:04 dsj96