Chen Zhang
Chen Zhang
Trying another implementation of #12655
As the kv_caches are not needed by model, we can remove it from model runner now and remove the complex bind_kv_cache function. However, as self.kv_caches is still used by tpu...
Build on top of https://github.com/vllm-project/vllm/pull/14079, should be merged after it. This pr supports “real” sliding window in v1: 1. Support dropping blocks outside sliding window 2. For prefix caching, only...
WIP https://github.com/vllm-project/vllm/issues/11382
Should be merged after https://github.com/vllm-project/vllm/pull/17398 To prepare for hybrid allocator, this PR moves logic that need to run for each specialized manager from KVCacheManager to SpecializedManager. As the `SpecializedManager` not...
Should be merge after https://github.com/vllm-project/vllm/pull/17193 This PR changes ForwardContext.attn_metadata from a global one to dict[layer_name, AttentionMetadata] to prepare for hybrid allocator which allocate different block table to sliding window layers...
https://github.com/vllm-project/vllm/pull/17137 drops the last matched block to support eagle. This strategy is not correct for sliding window layers. When sliding window size is 4 and block_size is 2, we need...
In the future hybrid allocator, the KVCacheManager output would be list[list[list[KVCacheBlocks]], which is much more complex than the current list[KVCacheBlocks]. To hide the complexity, this pr introduces `KVCacheBlocks` to save...
Should merge after https://github.com/vllm-project/vllm/pull/17394 Hybrid allocator will need to build attention metadata for each kv cache group because different kv cache groups may have different attention type and block_table. To...