[REQUEST] Allow pre-sharding of models using a single process
Is your feature request related to a problem? Please describe. When using multiple GPUs for tensor parallel inference, DeepSpeed loads the model into memory N times, where N is the number of GPUs. This can quickly exhaust system memory when the model is large. For example, I cannot load opt-66b onto 8 GPUs with tp_size = 8 on a DGXA100 system with 1TiB of system memory.
A solution is to pre-shard the model so that each process doesn't need to load the entire model into memory, but in order to do this, one STILL has to load the model into memory N times.
Describe the solution you'd like Allow pre-sharding of a model across tp_size devices using a SINGLE process, so as to avoid loading the model redundantly into memory N times.
Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.
Additional context If you try to do this today, with this kind of setup:
deepspeed --num_gpus 1 SCRIPT.py --tp_size 8 --checkpoint_path <PATH>
tp_config = deepspeed.inference.config.DeepSpeedTPConfig(
enabled = True,
tp_size = args.tp_size,
)
deepspeed.init_inference(
tensor_parallel = tp_config,
save_mp_checkpoint_path = args.checkpoint_path,
....
)
You'll get this error:
RuntimeError: the new group's world size should be less or equal to the world size set by init_process_group
+1
+1. What if i need to use two GPU to infer the model, how can i write the code?