Medusa
Medusa copied to clipboard
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
hi, @ctlllll I try to use medusa on llama model,and do some medusa head experiments. when base_model_config. medusa_num_heads in from_pretrained(medusa_model.py) is set to be 2 or 3, an error will...
Hello, I want to finetune llama2 70B medusa head. But for A100-80G, if I do not want use quantized model, it can not fit the model in a single A100....
OSError
python gen_model_answer_baseline.py --model-path /data/transformers/vicuna-7b-v1.3 --model-id vicuna-7b-v1.3-0 python gen_model_answer_medusa.py --model-path /data/transformers/medusa_vicuna-7b-v1.3 --model-id medusa-vicuna-7b-v1.3-0 My vicuna-7b-v1.3 download comes from:https://huggingface.co/FasterDecoding/medusa-vicuna-7b-v1.3/tree/main My medusa-vicuna-7b-v1.3 download comes from:https://huggingface.co/FasterDecoding/medusa-vicuna-7b-v1.3/tree/main I used this command to add the local...
Roadmap
# Roadmap ## Functionality - [x] #36 - [x] #39 - [ ] Distill from any model without access to the original training data - [ ] Batched inference -...
Hi, I'm not an expert, so this might be a stupid question, but I have a question about the Heads warmup part of the Medusa paper. In that part it...
vLLM support
https://github.com/mlc-ai/mlc-llm https://github.com/mlc-ai/llm-perf-bench
We are currently running out of bandwidth. Contributors to help integrate Medusa into llama.cpp would be greatly appreciated :)
Can Medusa use FasterTransformer in the furture?
Qwen 7B/14B model looks strong, I understand we don't have access to their dataset, but still extremely useful to havea medusa finetuned with smaller Chinese/English dataset.