fastllm 不期望的停止 DeepSeek-R1-0528-INT4

已经设置max_token为32768

每次contextLen接近10000都会中止，显而易见模型应该继续输出的。

请问哪个参数控制max contextLen？

alive = 1, pending = 0, contextLen = 9730, Speed: 3.199408 tokens / s. alive = 1, pending = 0, contextLen = 9730, Speed: 3.096603 tokens / s. alive = 1, pending = 0, contextLen = 9730, Speed: 3.186505 tokens / s. alive = 1, pending = 0, contextLen = 9730, Speed: 3.197784 tokens / s. alive = 1, pending = 0, contextLen = 9730, Speed: 3.188518 tokens / s.

Jun 24 '25 15:06 shalene847

我这边没有这个问题，使用ftllm server部署的吗？前端有设置max_new_token吗

Jun 27 '25 10:06 ztxz16

我遇到类似的问题。我是通过 docker 形式部署，用的 ftllm server，具体命令 export FASTLLM_USE_NUMA=ON && export FASTLLM_NUMA_THREADS=32 && ftllm server /models/deepseek/DeepSeek-R1-0528-INT4 --model_name ft_deepseek --port 8080 --device cuda --moe_device numa -t 1 通过 openwebui 调用。会出现如下情况，token输出大概到2000的时候，单个对话的session就基本不响应了。启动服务的时候没有设置 max_new_token ，代码中没见着 server 有这个参数。我看底层中输出长度由 output_token_limit 控制。

Jul 02 '25 02:07 icm-ai

找到了open webui 的页面设置为最大 131072 后，发现还是到接近3000token时候出现Long Prefill ... (0%)，然后服务就没什么反应了，查了代码可能是输入序列长度超过预设阈值出现的，不过为什么会长时间没反应？新开对话窗口也是没有反应。

Jul 02 '25 08:07 icm-ai

I have same issue. No matter what value for max_token, the ftllm will get into this issue if contextlen is greater than 10000.

Jul 18 '25 14:07 artisankat2000

我这边看到的情况是显存吃完了之后就不响应了

Jul 20 '25 08:07 luyufan498