Yulei Qian
Yulei Qian
@dskhudia Thank you for your explaination. However, I still have a question about this description. “LLaMa-2-70B being a dense model reaches compute bound regime earlier and afterwards doubling of concurrency...
> @dskhudia Thank you for your swift response. Our test is done on 8x-A100-80G system and our proprietary inference engine which also has continuous batching and split-fuse. According to our...
> @JadeRay , @FC-Li > > I dig a bit deeper into it. In continuous batching setting, there is more latency for the iteration where trt-llm removes a request and...
> For example, DBRX is both higher quality than LLaMA2-70B and - thanks to having about half as many active parameters - DBRX inference throughput is up to 2x faster...
@dskhudia We have benchmarked DBRX and llama2-70B layer by layer, and we find that the benefit of TTFT comes from PerTokenFlops and the benefit of TPOT comes from communication as...