TriForce
TriForce copied to clipboard
[COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
Hi, thanks for your great job for LLM decoding process. I tested the code and got the expected decoding speedup for llama2-7B, but it seems that the end2end time cost...
Nice work! In the paper I saw this batched result: But examples like https://github.com/Infini-AI-Lab/TriForce/blob/main/test/on_chip.py only use batch size=1. Does the code supports batched speculative inference?
``` CUDA_VISIBLE_DEVICES=0 python test/on_chip.py --prefill 124928 --budget 4096 \ --chunk_size 8 --top_p 0.9 --temp 0.6 --gamma 6 Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03
Have you considered incorporating this work into an open source inference framework, such as vLLM?
Hi, I would like to ask why the attention mask is not used in the prefill stage. I want to output the attention scores matrix in prefill stage. Is the...
Thanks for your excellent work! But i met some questions when i try to use your framework. I try to run `offloading.py` and `offloading_TP.py` on RTX4090 * 4 machine. As...
Hi authors, in `models/cache.py` (lines 154–159), the code computes the mean of the key vectors in each chunk and then selects the top-k chunks based on the dot product between...
没有考虑topk topp? 而且至少应该算个softmax出来? https://github.com/Infini-AI-Lab/TriForce/blob/164c8c0131cf49951eefdea89a3fbcccb8ca326b/utils/sampling.py#L64