Yulei Qian

Results 5 comments of Yulei Qian

@dskhudia Thank you for your explaination. However, I still have a question about this description. “LLaMa-2-70B being a dense model reaches compute bound regime earlier and afterwards doubling of concurrency...

> @dskhudia Thank you for your swift response. Our test is done on 8x-A100-80G system and our proprietary inference engine which also has continuous batching and split-fuse. According to our...

> @JadeRay , @FC-Li > > I dig a bit deeper into it. In continuous batching setting, there is more latency for the iteration where trt-llm removes a request and...

> For example, DBRX is both higher quality than LLaMA2-70B and - thanks to having about half as many active parameters - DBRX inference throughput is up to 2x faster...

@dskhudia We have benchmarked DBRX and llama2-70B layer by layer, and we find that the benefit of TTFT comes from PerTokenFlops and the benefit of TPOT comes from communication as...