DeepSpeed-MII icon indicating copy to clipboard operation
DeepSpeed-MII copied to clipboard

[Feature Request] Support for token streaming

Open bharatv007 opened this issue 2 years ago • 4 comments

I could not find in the doc, adding token streaming support during generation for GPT models would be great.

bharatv007 avatar Apr 27 '23 19:04 bharatv007

Is there any update about this feature?

yaliqin avatar Nov 09 '23 00:11 yaliqin

@yaliqin, after the release of DeepSpeed-FastGen and MII v0.1 we are working on adding this feature. You can expect in the coming 1-2 weeks. Thanks!

mrwyattii avatar Nov 09 '23 17:11 mrwyattii

@mrwyattii Thank you for the quick update. I am a little confused about DeepSpeed serials, like DeepSpeed zero, DeepSpeed mii, DeepSpeed-FastGen. Is there any document I can refer to understand the relationship and evolution roadmap?

yaliqin avatar Nov 09 '23 17:11 yaliqin

You can refer to the current MII landing page and the legacy MII landing page. I will also try to provide a concise summary here:

With the announcement of DeepSpeed-FastGen, DeepSpeed-MII now has two APIs. This is because FastGen involved a complete re-design of our inference engine (and MII). For the time being, we are still supporting both because FastGen is heavily focused on text-generation while the legacy APIs provide support for a wider range of models and tasks. This may change in the future as we continue to develop DeepSpeed-MII. In general, DeepSpeed-MII relies on the DeepSpeed inference engine, which provides many features (e.g., custom high-performance CUDA kernels) to accelerate your models for inference.

DeepSpeed-ZeRO is the technology we use to reduce memory consumption by sharding optimizer states, gradients, and model parameters across multiple GPUs (or even offload to CPU). ZeRO is heavily used in training applications, but we also offer ZeRO-Inference through the legacy MII APIs. ZeRO-Inference does not provide the same performance benefits as our inference engine, but it does allow users to fit very large models on very limited GPU memory.

Please let me know if you have any questions! We have certainly developed many features in the DeepSpeed ecosystem over the past few years, and I realize it can be a little confusing :)

mrwyattii avatar Nov 09 '23 18:11 mrwyattii