[REQUEST] Code sample to use DeepSpeed inference without having to run deepspeed cmd line in production set
Is your feature request related to a problem? Please describe. The current examples for DeepSpeed inference uses cmd line 'deepspeed' that internally uses launcher modules of deepspeed to initialize the NCCL/MPI backends and discovers the ranks, and world size etc. While using command line to run inference is good for the developer, but not useful for production use. Most of the production code uses serving stack to serve the inference real-time and uses python packages to initialize the frameworks like deepspeed and can't rely on running cmd line executions for each inference request.
Describe the solution you'd like Provide a clear example of using deepspeed inference w/o using command line 'deepspeed'. Show in that example how to initialize the backend by using deepspeed.init_distributed() and call deepspeed.init_inference() calls and get rid of having to run the command line. I called deepspeed.init_distributed() in my code but the backend always fails to initilize even though I set all the right env variables for RANK, LOCAL_RANK, WORLD_SIZE etc.
Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.
Additional context Add any other context or screenshots about the feature request here.
I also agree with you ? @dhawalkp have you found any solution for this
@tahercoolguy @dhawalkp Even I'm looking for a way to initialise an inference engine without using the command line. Also, Were you guys able to distribute the model into multiple GPUs?
When i pass --num-gpus
@dhawalkp @sanxchep Did you ever find a solution for this?
@ChrisStormm, @dhawalkp, @sanxchep how about the following HF integration example https://github.com/huggingface/transformers-bloom-inference/blob/main/bloom-inference-scripts/bloom-ds-inference.py