TriForce issues

Questions about end2end time cost of the inference request

2

Hi, thanks for your great job for LLM decoding process. I tested the code and got the expected decoding speedup for llama2-7B, but it seems that the end2end time cost...

littletomatodonkey

good first issue

Example code to run batched inference?

2

Nice work! In the paper I saw this batched result: But examples like https://github.com/Infini-AI-Lab/TriForce/blob/main/test/on_chip.py only use batch size=1. Does the code supports batched speculative inference?

learning-chip

good first issue

Out of memory on H800

8

``` CUDA_VISIBLE_DEVICES=0 python test/on_chip.py --prefill 124928 --budget 4096 \ --chunk_size 8 --top_p 0.9 --temp 0.6 --gamma 6 Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03

Lucas-TY

good first issue

version issue

how to change y1?

1

It seems like gamma is y2, but how do you change y1?

Lucas-TY

good first issue

Adapt to open source inference framework

1

Have you considered incorporating this work into an open source inference framework, such as vLLM?

Siegfried-qgf

enhancement

good first issue

Attention Scores Matrix Visualization

1

Hi, I would like to ask why the attention mask is not used in the prefill stage. I want to output the attention scores matrix in prefill stage. Is the...

bulaikexiansheng

The progress bar does not reflect for a long time

6

Thanks for your excellent work! But i met some questions when i try to use your framework. I try to run `offloading.py` and `offloading_TP.py` on RTX4090 * 4 machine. As...

bulaikexiansheng

good first issue

Inconsistency in Chunk-Based Attention: Top-k on Mean Key vs. Top-k on Softmax

1

Hi authors, in `models/cache.py` (lines 154–159), the code computes the mean of the key vectors in each chunk and then selects the top-k chunks based on the dot product between...

fengql123

this sample func seems like a bug...

6

没有考虑topk topp？而且至少应该算个softmax出来？ https://github.com/Infini-AI-Lab/TriForce/blob/164c8c0131cf49951eefdea89a3fbcccb8ca326b/utils/sampling.py#L64

JoursBleu

TriForce
TriForce copied to clipboard

Metadata

Questions about end2end time cost of the inference request

Example code to run batched inference?

Out of memory on H800

how to change y1?

Adapt to open source inference framework

Attention Scores Matrix Visualization

The progress bar does not reflect for a long time

Inconsistency in Chunk-Based Attention: Top-k on Mean Key vs. Top-k on Softmax

this sample func seems like a bug...

← Metadata

Owner

Metadata

TriForce TriForce copied to clipboard

Metadata

← Metadata

Owner

Metadata

TriForce
TriForce copied to clipboard