高凌霄
高凌霄
### Is there an existing issue for this? - [X] I have searched the existing issues ### Current Behavior 通过查看glm相关论文,我总结出了glm和glm130b的区别: | 模型名 | PE | 归一化 | | ------------- |-------------|...
From this [article,](https://www.anyscale.com/blog/continuous-batching-llm-inference) I learned that continuous batching and PagedAttention greatly improve the inference performance of large models. I would like to know if fastertransformer has plans to support these...
### Description ```shell I start triton server with '--model-control-mode poll'. Segmentation fault occurs when modifying the model directory. ``` ### Reproduced Steps ```shell 1.CUDA_VISIBLE_DEVICES=3,4,5,6 /opt/tritonserver/bin/tritonserver --model-repository=/ft_workspace/all_models/t5/ --http-port 8008 --model-control-mode poll...
问题复现方式: 在量化qwen2.5 vl模型时,设置n_parallel_calib_samples参数,会导致transformers在计算rope是,出现shape不匹配的问题 原因是qwen2.5 vl模型使用了mrope,也就是3维rope算法,它传入的position_embedding参数包含三种不同的频率,导致postion_embedding的shape出现变化,原始的方法不能兼容这种变化 备注: 这个样例只是做了简单实现,需要maintainer做更好的集成 Issue Reproduction Steps: When quantizing the Qwen2.5-VL model, setting the n_parallel_calib_samples parameter causes a shape mismatch error in the transformers library during RoPE...