halexan
halexan
之前我也反应过这个问题,直接被作者无情的关了issue,貌似最近作者不再更新了
希望能增加deepseek v2 和 deepseek coder v2 的支持 vllm 0.5.1版本已支持deepseek v2, 详见[vllm 0.5.1 release](https://github.com/vllm-project/vllm/releases/tag/v0.5.1)
Looking forward!
Any updates for deepseek v2?
> In order to efficiently deploy DeepSeek-V2 for service, we first convert its parameters into the precision of FP8. In addition, we also perform KV cache quantization (Hooper et al.,...
> > > In order to efficiently deploy DeepSeek-V2 for service, we first convert its parameters into the precision of FP8. In addition, we also perform KV cache quantization (Hooper...
Tested DeepSeek-V2-Chat-0628 on 8*A800 serve ```python python3 -m sglang.launch_server \ --model-path /data/model-cache/deepseek-ai/DeepSeek-V2-Chat-0628 \ --served-model-name deepseek-chat \ --tp 8 \ --enable-mla \ --disable-radix-cache \ --mem-fraction-static 0.87 \ --schedule-conservativeness 0.1 \ --chunked-prefill-size...
@Xu-Chen Does your 8*A800 has nvlink?
> VLLM don't support MoE FP8 models on Ampere. This is because vLLM uses Triton for its FusedMoE kernel, which doesn't support the FP8 Marlin mixed-precision gemm. See https://huggingface.co/neuralmagic/DeepSeek-Coder-V2-Instruct-FP8/discussions/1 >...
> @Xu-Chen So can we use sglang to run deepseek v2 232B? Thanks Yes, you can, without quantization.