Tommy Yang comments

Results 5 comments of


                                            Tommy Yang

Issue with token number: how to increase processed input tokens: models llama and phi, with 4 GPUs.

I'm facing a similar issue when inferencing qwen-72B model. The build params used for trt is: ```bash python build.py --hf_model_dir ./Qwen-72B-chat/ \ --dtype float16 \ --remove_input_padding \ --use_gpt_attention_plugin float16 \...

RuntimeError: batch size must be positive in flash_attn_varlen_forward when use_remove_padding=True

Same problem, and use_dynamic_bsz is already `False`

[Bug] Tool Call Parser: 开启tool call支持后，stream模式下ChoiceDeltaToolCall解析异常

同样的问题，我直接在流式场景下，把 qwen2d5_parser.py 里面的解析逻辑改成，等 `` 存在的时候再做解析，并将 api_server 里面的 tool message is None 情况下直接 continue，暂时解决了这个问题。

[Bug] Tool Call Parser: 开启tool call支持后，stream模式下ChoiceDeltaToolCall解析异常

Hi, @RunningLeon I just submitted a PR for this issue, plz review

fix: qwen3 nonstream parse with no or uncompleted think content

> > @RunningLeon Is it fixed when you supported interns1 reasoning parser? > > The first problem in the below should be fixed. @ywx217 hi, as for the second one,...