bethalianovike comments

Results 8 comments of


                                            bethalianovike

[Bug] [TFLite] AttributeError: 'ResizeBilinearOptions' object has no attribute 'HalfPixelCenters'

Hi @ChaojifeixiaDazhuang I also encounter this problem recently, do you have any solution right now?

[Question] Speculative Decoding Mode

Thank you @sunzj! Yes! I am also stuck on that step... Have you already tried to run `mlc_llm chat` on that EAGLE-llama2-chat-7B? When I try, it gives me a tokenizer...

[Question] Speculative Decoding Mode

@sunzj Got it! Thanks! Actually, when I look at the github repo, there is another code that we can use to run the speculative decoding, https://github.com/mlc-ai/mlc-llm/blob/main/tests/python/serve/test_serve_engine_spec.py It includes the code...

[Question] Speculative Decoding Mode

@sunzj It can run perfectly! Thanks! Can we get the decode time or decode rate for this speculative decoding result?

[Question] Speculative Decoding Mode

@sunzj Yes, that's works, thanks! In my device (NVIDIA GeForce RTX 4090), the decode time seems to improve, but the answer without speculative decoding does not match speculative decoding... Is...

[Question] Speculative Decoding Mode

Hi @pianogGG To find the decode time, you need to change the URL from `http://127.0.0.1:8000/v1/chat/completions` to `http://127.0.0.1:8000/metrics`.

[Question] Speculative Decoding Mode

Hi @pianogGG, To display chat completion use this command: ``` curl -X POST \ -H "Content-Type: application/json" \ -d '{ "messages": [ {"role": "user", "content": [your_question]} ] }' \ http://127.0.0.1:8001/v1/chat/completions...

[Question] Speculative Decoding Mode

Hi @pianogGG, Actually, I've been testing speculative decoding performance with EAGLE in RTX4090, and I'm not seeing any improvements. Using Llama2-7b-chat-hf-q0f16 as my main model and EAGLE-llama2-chat-7B-q4f16_1 as an additional...