dreaming-panda comments

Results 10 comments of


                                            dreaming-panda

The support on vLLM?

Currently we have not added support for vLLM and are working to build a tensor parallelism system. With batch size > 1, we need to solve some additional problems, such...

Estimate the number of generated tokens per step from the acceptance-rate-vector?

we plan to refactor the repo and will share the files. Thank you!

How to benchmark for speedup and acceptance rate?

> +1 How to benchmark the speed up? I ran the example codes and didn't see obvious acceleration. How to reproduce 4.04x accelerate of Llama2-7b on A100? To run Sequoia:...

How to benchmark for speedup and acceptance rate?

decoding step means how many tokens are generated. large model step means how many times large model do verification. decoding step / large model step reflects how many tokens are...

How to benchmark for speedup and acceptance rate?

Your understanding is correct. We only allow baseline to generate 32 tokens is because in some experiments, such as Vicuna33B, running baseline can cost a lot of time. You can...

Reproducibility: the tree_search generates too small tree

Can you send me your demo-config.json? PS: the original demo-config.json is just a demo. You need to modify the content to generate a tree you want.

Reproducibility: the tree_search generates too small tree

{ "acceptance_rate_vector": "acceptance-rate-vector.pt", "max_depth": 15, "max_budget": 128, "draft_time": 0.0003, "valid_budget": [1, 2, 4, 8, 16, 32, 64, 128], "target_time":[0.025, 0.025, 0.025, 0.025, 0.025, 0.027, 0.030, 0.035], "dst": "demo_tree.pt" } p...

Reproducibility: the tree_search generates too small tree

Yes, the generated tree sizes can only be numbers from "valid_budget".

Work On CPU

I do not think this can be used on CPU. CPU does not have such high FLOPS for speculative decoding, so even if the code can run on CPU, no...