harrywu

Results 9 issues of harrywu

Excuse me, how could I get the implementation of baseline by the way please?

### metrics - time_queue_requests - time_inference_requests - time_prefill_requests - time_decode_requests - max_num_generation_tokens_requests ``` max_num_generation_tokens_requests.append( max(seq.get_output_len() for seq in seq_group.get_seqs()) ) ``` --- PR Checklist (Click to Expand) Thank you for...

### Feature request Hello! It is a bit like [#26553](https://github.com/huggingface/transformers/issues/26553), which implement `SinkCache`. I would love to see some method of kv cache sparsity like **H2O** implemented, as proposed in...

### Anything you want to discuss about vllm. https://github.com/vllm-project/vllm/blob/99caa4910651754f3f68de518ca42349c8c424d1/vllm/attention/backends/flash_attn.py#L282 I noticed that in flash-attn backends. `forward_prefix` and `forward_decode` seem to be executed serially. Does `forward_decode` wait for `forward_prefix` to finish...

misc

I couldn't find the code for online serving. Has this part not been open-sourced yet? I would like to reproduce `Figure 11. Comparison of the online latency.`

Here is my env. The version of `transfomers` is meet the requirements in `monkeypatch.py` ``` torch==2.2.0 transfomers==4.37.0 ``` The traceback are as follows: traceback >> python pred_snap.py --model llama2-7b-chat-4k --compress_args_path...

https://github.com/FasterDecoding/SnapKV/blob/ea655b18061313e088879bd2b4a3e3c0c2dc2e21/snapkv_utils.py#L50 In `update_kv` function, instead of using the function's arguments `attention_mask`, this variable is overridden.

Just a guess. What will happen if **H2O** also uses **Clustering via Pooling** when comparing? It seems that Clustering via Pooling can improve the effectiveness of such drop token methods.

### Which component has the problem? CuTe DSL ### Bug Report **Steps/Code to reproduce bug** ``` import torch import cutlass import cutlass.cute as cute from cutlass.cute.runtime import from_dlpack @cute.jit def...

bug
? - Needs Triage
inactive-30d
CuTe DSL