inference Benchmarking Question List #1

Hello everyone. I have been using MLperf benchmarks for some time. And I have a small list of questions about them. I am asking them here because I have not found answers in other sources of information.

I have several video cards in my system. Can I explicitly set the number of video cards for the test?
This question follows from the question above. Do all tests use all available GPUs?
Many tests have different profiles like "edge" "datacenter" what is the difference between them?
Since the space on my SSD is limited, how can I tell the benchmarks to use a different directory to store the cache?
The tests (in the profiles that I used) do not always use 100% of the video memory. Are there any scenarios for which all the video memory will be used or is this not necessary?
Perhaps there are more subtle benchmark settings, is there any user guide.

Sep 27 '24 10:09 Agalakdak

Hi @Agalakdak Some of your questions are "benchmark implementation" dependent and we currently have Nvidia, Intel, and Reference implementations for most/all of the benchmarks and other vendor implementations are available for some of the benchmarks.

"no" for most of the reference implementations except some like for llama2. "yes" for Nvidia implementation though it uses all the GPUs by default.
For Nvidia implementation - "yes". For reference implementation, it uses 1 GPU by default and in some benchmark implementations it supports multiple GPUs.
Those are 2 different submission categories. The required scenarios to be run differs for them and "Offline" scenario is the only common one for both.
export CM_REPOS=<NEW_PATH> can be used to do this or we can create softlink for any folder inside $HOME/CM/repos/local/cache path.
Many small inference models do not need large amount of GPU memory. Parameter size given here is usually a good guide for the required GPU memory.
Unfortunately not much currently - as most implementations by default only support the systems on which the MLPerf results were submitted. We are trying to extend this - but it is a WIP and the implementations and the results changes every 6 months.

Oct 02 '24 16:10 arjunsuresh

@arjunsuresh , Thank you! And the last 2 questions for now.

How can benchmarking results be interpreted? Is this some abstract metric or can the data be interpreted as "Model A can process x requests per second".
How can I donate $10

Oct 03 '24 10:10 Agalakdak

For offline scenario - samples per second is the usual metric. Requests per second or queries per second may not be correct as a single request or query can contain multiple samples. But for LLMs it is often tokens per second.
I don't think MLCommons is taking donations but I might be wrong. You can contact the right people here

Oct 04 '24 08:10 arjunsuresh