ScaleLLM Roadmap
We're excited to present the features we're currently working on and planning to support in this roadmap document. Your feedback is highly valued, so please don't hesitate to comment or reach out if you have anything you'd like to add or discuss. We're committed to delivering the best possible experience with ScaleLLM.
Q1-Q2 2024
Efficiency
- [x] Adding flash decoding with paged KV cache support [Done]
- [x] Introducing attention kernel capable of supporting speculative decoding [Ongoing]
- [ ] Exploring the feasibility of adopting the flashinfer library [Ongoing]
- [x] Implementing speculative decoding [Done]
- [x] Enabling CUDA graph for decoding to improve performance [Done]
- [x] Implementing dynamic split-fuse for enhanced latency [Done]
- [ ] Exploring lookahead decoding support
- [ ] Implementing fused FFN (Feed-Forward Network) to enhance efficiency
- [ ] Introducing a ring attention mechanism for handling long contexts
Cache
- [x] Implementing stateful conversation to avoid recomputing for chat sessions [Done]
New Models
- [x] Integrating Google Gemma [Done]
- [x] Integrating Llama3 [Done]
- [ ] Incorporating the Mixtral MoE model [Ongoing]
- [ ] Implementing MoE (Mixture of Experts) kernels
- [ ] Introducing the Mamba model
- [ ] Introducing multi-modal models [Ongoing]
- [ ] LLaVA model
- [ ] LoRA & QLoRA
- [ ] S-LoRA: Serving thousands of LoRA adapters
New Devices
- [ ] Adding support for Apple chips
- [ ] Exploring other chips such as TPU, etc.
Usability
- [x] Developing Python wrapper for easier integration [Done]
- [ ] Enhancing documentation for improved usability [Ongoing]
New GPU Architecture
- [ ] Turing architecture (sm75)
Structural Decoding
- [ ] Function Calling
Quantization
- [ ] Supporting FP8 for both models and KV caches
Supported Operating Systems
- [ ] Extending support to macOS and Windows platforms
Misc
- [ ] Conducting benchmarking to compare performance with other open-source projects [Ongoing]
- [ ] Adding more bechmarks and unittests for kernels and dependencies [Ongoing]
- [ ] Adding more Prometheus metrics and creating a Grafana dashboard for monitoring.
- [ ] Loosening coupling with PyTorch for easy deployment
I think LLaMA 3 should be added as well, and probably should be high priority.
I think LLaMA 3 should be added as well, and probably should be high priority.
Yes, Llama3 is supported already, please check latest release. https://github.com/vectorch-ai/ScaleLLM/releases/tag/v0.0.8
Woow @guocuimi, thank you for your quick update! You guys rock !