TriForce icon indicating copy to clipboard operation
TriForce copied to clipboard

[COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding

Results 9 TriForce issues
Sort by recently updated
recently updated
newest added

Hi, thanks for your great job for LLM decoding process. I tested the code and got the expected decoding speedup for llama2-7B, but it seems that the end2end time cost...

good first issue

Nice work! In the paper I saw this batched result: But examples like https://github.com/Infini-AI-Lab/TriForce/blob/main/test/on_chip.py only use batch size=1. Does the code supports batched speculative inference?

good first issue

``` CUDA_VISIBLE_DEVICES=0 python test/on_chip.py --prefill 124928 --budget 4096 \ --chunk_size 8 --top_p 0.9 --temp 0.6 --gamma 6 Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03

good first issue
version issue

It seems like gamma is y2, but how do you change y1?

good first issue

Have you considered incorporating this work into an open source inference framework, such as vLLM?

enhancement
good first issue

Hi, I would like to ask why the attention mask is not used in the prefill stage. I want to output the attention scores matrix in prefill stage. Is the...

Thanks for your excellent work! But i met some questions when i try to use your framework. I try to run `offloading.py` and `offloading_TP.py` on RTX4090 * 4 machine. As...

good first issue

Hi authors, in `models/cache.py` (lines 154–159), the code computes the mean of the key vectors in each chunk and then selects the top-k chunks based on the dot product between...

没有考虑topk topp? 而且至少应该算个softmax出来? https://github.com/Infini-AI-Lab/TriForce/blob/164c8c0131cf49951eefdea89a3fbcccb8ca326b/utils/sampling.py#L64