bethalianovike
bethalianovike
Hi @ChaojifeixiaDazhuang I also encounter this problem recently, do you have any solution right now?
Thank you @sunzj! Yes! I am also stuck on that step... Have you already tried to run `mlc_llm chat` on that EAGLE-llama2-chat-7B? When I try, it gives me a tokenizer...
@sunzj Got it! Thanks! Actually, when I look at the github repo, there is another code that we can use to run the speculative decoding, https://github.com/mlc-ai/mlc-llm/blob/main/tests/python/serve/test_serve_engine_spec.py It includes the code...
@sunzj It can run perfectly! Thanks! Can we get the decode time or decode rate for this speculative decoding result?
@sunzj Yes, that's works, thanks! In my device (NVIDIA GeForce RTX 4090), the decode time seems to improve, but the answer without speculative decoding does not match speculative decoding... Is...
Hi @pianogGG To find the decode time, you need to change the URL from `http://127.0.0.1:8000/v1/chat/completions` to `http://127.0.0.1:8000/metrics`.
Hi @pianogGG, To display chat completion use this command: ``` curl -X POST \ -H "Content-Type: application/json" \ -d '{ "messages": [ {"role": "user", "content": [your_question]} ] }' \ http://127.0.0.1:8001/v1/chat/completions...
Hi @pianogGG, Actually, I've been testing speculative decoding performance with EAGLE in RTX4090, and I'm not seeing any improvements. Using Llama2-7b-chat-hf-q0f16 as my main model and EAGLE-llama2-chat-7B-q4f16_1 as an additional...