hezeli123
hezeli123
The logic of repetition_penalty in FT is not same with OPENAI description, How to use it ? OpenAI: https://platform.openai.com/docs/guides/gpt/managing-tokens mu[j] -> mu[j] - c[j] * alpha_frequency - float(c[j] > 0)...
> Hi @calico-niko @bnuzhanyu The ViT is offloaded to TRT, and the fp32 accuracy of it on TRT9.3 is alined with Pytorch. And you can also change the version of...
The current ViT diffs have a big impact which results in many bad cases. I run ViT with FP32 precision now.
> Hi @hezeli123 , you said that when not using pipeline parellism this works for you. I assume you just omitted `--pp_size` or set it to `1` when you built...
> Could you share the content of your `/tensorrtllm_backend/all_models/bls/` folder? this issue maybe the same problem[https://github.com/triton-inference-server/tensorrtllm_backend/issues/354]. pre/post processiong model files were : https://github.com/triton-inference-server/tensorrtllm_backend/tree/main/all_models/inflight_batcher_llm
并发数 | norm tokens/s | awq tokens/s -- | -- | -- 1 | 40.93 | 42.06 2 | 62 | 60.52 4 | 79.08 | 73.32 8 | 94.4...
> Could you share the benchmark scripts? 脚本很简单,主要方式是 一批外网的url(如:http://img1.baidu.com/it/u=3682444617,1983875605&fm=253&app=138&f=JPEG?w=1067&h=800),采样同步轮询调用openai接口,统计的瓶颈吞吐。 @lvhan028 你也可以使用你们内部的方式看下性能情况,感觉针对规模较小的模型,量化加速效果不好。
期间没有发送stop的请求
日志如下,有收到请求后的图像下载信息,后续没有LLM推理相关的日志。 2024-07-11 19:59:48,123 - lmdeploy - [37mINFO[0m - async_collect_pil_images latency: 98.4154 ms 2024-07-11 19:59:48,123 - lmdeploy - [37mINFO[0m - ImageEncoder received 1 images, left 1 images. 2024-07-11 19:59:48,123 - lmdeploy...
使用oneflow_compile编译后,生成的图片全黑,使用torch.compile(unet)没有问题,生成的图片正常。