KavioYu

Results 29 comments of KavioYu

I'm very interested in implementing tree attention for speculative decoding. @simon-mo

> After obtaining the result file, you can run the _[eagle/evaluation/alpha.py](https://github.com/SafeAILab/EAGLE/blob/main/eagle/evaluation/alpha.py)_ file to get the acceptance rate. [eagle/evaluation/gen_ea_alpha_llama2chat.py](https://github.com/SafeAILab/EAGLE/blob/main/eagle/evaluation/gen_ea_alpha_llama2chat.py) could't been excuted. It seems to be because the ea_model forward interface...

> 1 and 3 are interesting to us. 2 has been implemented here > > https://github.com/sgl-project/sglang/blob/5ff25cdf5b1310e83d9e595142b39ae4d7b561e9/python/sglang/srt/server_args.py#L426-L430 > > , although there is still room for improvement. > Please join our...

@jjjjohnson During the RL training process, model weights are constantly changing, so we cannot train a specialized weight-based draft model (such as Eagle) for the model. Therefore, lookahead, a statistics-based...

> hello, whether this code supports the multiple request sepc? Yes, I will support it.

> Hi @yukavio Is there any recent progress or plan for this? Do you plan to support deepseek-v2? I have implemented the draft and verify stages and tested them on...

> Also i have another question, in the pr model_runner.py init kv cache twice in different tpworker, this results in the oom in gpu, if we merge the draft and...

> @yukavio Hi yukavio Recently, SGLang has undergone some refactoring work. You need to merge the latest main to resolve the corresponding conflicts. Thanks! OK, I am fixing some bugs...

> @yukavio Hi, when is this PR expected to be merged? I've trained a draft model and am eager to try it out. If all goes well I will finish...

I have updated the code. The new implementation has support draft model inference with cuda graph / speculative inference with batch. It can been tested by run `python examples/runtime/engine/offline_batch_inference.py`. TODO:...