KavioYu
KavioYu
migrate static api for sensitive demo
add semi-structure pruning demo
I want to develop some features based on Sglang to improve the performance of srt. 1. A new scheduler of ControllerMulti that can more accurately identify the resource utilization of...
## Motivation Accelerate the model inference by speculative inference (EAGLE2). ## Modifications It will be provided soon. ## Checklist - [ ] Format your code according to the [Contributor Guide](https://github.com/sgl-project/sglang/blob/main/docs/en/contributor_guide.md)....
## Motivation Implement a better dispatch scheduler for DP mode, which could dispatch new requests depending on the remaining resources of different inference processes. It could help the server get...
I have developed a Triton-based implementation of [Native Sparse Attention](https://arxiv.org/pdf/2502.11089) in [GitHub](https://github.com/yukavio/nsa) to optimize long-context attention computation. Currently, I want to migrate this implementation to Flash Attention v3 to improve...
This PR try to add Implementation of Compressed Attention and Selected Attention of [Native Sparse Attention](https://arxiv.org/pdf/2502.11089) The hyperparameter of selected and compressed attention kernel is setting for good performance on...