dreaming-panda

Results 10 comments of dreaming-panda

Currently we have not added support for vLLM and are working to build a tensor parallelism system. With batch size > 1, we need to solve some additional problems, such...

we plan to refactor the repo and will share the files. Thank you!

> +1 How to benchmark the speed up? I ran the example codes and didn't see obvious acceleration. How to reproduce 4.04x accelerate of Llama2-7b on A100? To run Sequoia:...

decoding step means how many tokens are generated. large model step means how many times large model do verification. decoding step / large model step reflects how many tokens are...

Your understanding is correct. We only allow baseline to generate 32 tokens is because in some experiments, such as Vicuna33B, running baseline can cost a lot of time. You can...

Can you send me your demo-config.json? PS: the original demo-config.json is just a demo. You need to modify the content to generate a tree you want.

{ "acceptance_rate_vector": "acceptance-rate-vector.pt", "max_depth": 15, "max_budget": 128, "draft_time": 0.0003, "valid_budget": [1, 2, 4, 8, 16, 32, 64, 128], "target_time":[0.025, 0.025, 0.025, 0.025, 0.025, 0.027, 0.030, 0.035], "dst": "demo_tree.pt" } p...

Yes, the generated tree sizes can only be numbers from "valid_budget".

I do not think this can be used on CPU. CPU does not have such high FLOPS for speculative decoding, so even if the code can run on CPU, no...