We're excited to present the features we're currently working on and planning to support in this roadmap document. Your feedback is highly valued, so please don't hesitate to comment or reach out if you have anything you'd like to add or discuss. We're committed to delivering the best possible experience with ScaleLLM.

Q1-Q2 2024

Efficiency

[x] Adding flash decoding with paged KV cache support [Done]
[x] Introducing attention kernel capable of supporting speculative decoding [Ongoing]
- [ ] Exploring the feasibility of adopting the flashinfer library [Ongoing]
[x] Implementing speculative decoding [Done]
[x] Enabling CUDA graph for decoding to improve performance [Done]
[x] Implementing dynamic split-fuse for enhanced latency [Done]
[ ] Exploring lookahead decoding support
[ ] Implementing fused FFN (Feed-Forward Network) to enhance efficiency
[ ] Introducing a ring attention mechanism for handling long contexts

Cache

[x] Implementing stateful conversation to avoid recomputing for chat sessions [Done]

New Models

[x] Integrating Google Gemma [Done]
[x] Integrating Llama3 [Done]
[ ] Incorporating the Mixtral MoE model [Ongoing]
- [ ] Implementing MoE (Mixture of Experts) kernels
[ ] Introducing the Mamba model
[ ] Introducing multi-modal models [Ongoing]
- [ ] LLaVA model
[ ] LoRA & QLoRA
- [ ] S-LoRA: Serving thousands of LoRA adapters

New Devices

[ ] Adding support for Apple chips
[ ] Exploring other chips such as TPU, etc.

Usability

[x] Developing Python wrapper for easier integration [Done]
[ ] Enhancing documentation for improved usability [Ongoing]

New GPU Architecture

[ ] Turing architecture (sm75)

Structural Decoding

[ ] Function Calling

Quantization

[ ] Supporting FP8 for both models and KV caches

Supported Operating Systems

[ ] Extending support to macOS and Windows platforms

Misc

[ ] Conducting benchmarking to compare performance with other open-source projects [Ongoing]
[ ] Adding more bechmarks and unittests for kernels and dependencies [Ongoing]
[ ] Adding more Prometheus metrics and creating a Grafana dashboard for monitoring.
[ ] Loosening coupling with PyTorch for easy deployment

Mar 16 '24 23:03 guocuimi

I think LLaMA 3 should be added as well, and probably should be high priority.

Apr 19 '24 07:04 omarmhaimdat

I think LLaMA 3 should be added as well, and probably should be high priority.

Yes, Llama3 is supported already, please check latest release. https://github.com/vectorch-ai/ScaleLLM/releases/tag/v0.0.8

Apr 19 '24 07:04 guocuimi

Woow @guocuimi, thank you for your quick update! You guys rock !

Apr 19 '24 07:04 omarmhaimdat

ScaleLLM Roadmap

Q1-Q2 2024

Efficiency

Cache

New Models

New Devices

Usability

New GPU Architecture

Structural Decoding

Quantization

Supported Operating Systems

Misc