inference icon indicating copy to clipboard operation
inference copied to clipboard

Questions about Llama benchmarks

Open ljk3210 opened this issue 2 years ago • 2 comments

Really exciting to see progress on LLM benchmarking in the loadgen codebase. I do wonder that:

a) Is First Token latency going to be the only metric? Sometimes we might also need per-token latency, or other metrics. b) The "significant overhead" mentioned in https://github.com/mlcommons/inference/blob/master/language/llama2-70b/SUT.py#L333 , is it because adding FirstTokenStreamer to generate slows it down significantly? c) Is there an expected release date on mlperf-loadgen 4.0?

Thanks for the great work :)

ljk3210 avatar Jan 19 '24 05:01 ljk3210

@pgmpablo157321 to answer

mrmhodak avatar Jan 23 '24 16:01 mrmhodak

@ljk3210 For the Server scenario the first-token-latency is one of the constraints that have to be met. The other constraint is the time-per-output-token. Their statistics during the run will also be reported, but the actual metric of interest here is the target qps which determines the frequency at which queries are dispatched to the client system.

Regarding b), yes, the overhead is a perceived one, since the implementation is not optimized. The tokens streamer runs in a live thread.

@pgmpablo157321 Can comment on the release date for the loadgen 4.0

attafosu avatar Jan 23 '24 17:01 attafosu