storage icon indicating copy to clipboard operation
storage copied to clipboard

batch_size, computation_time, AU, MB vs MiB

Open captn-picard opened this issue 10 months ago • 3 comments

Good day,

A few questions regarding the storage benchmarks.

-I am wondering, are we "allowed" to change the batch_size when running the storage benchmark? In the sense that if we change the batch_size, it also changes the AU/Throughput results, but does it somehow make the benchmark invalid? I am asking because there is no --param for batch_size when launching ./benchmark.sh.

-If we are allowed to change the benchmark, must the computation_time be adjusted proportionally? From my understanding, the computation_time is used to simulate the "think time" of an accelerator per-batch. So, when running a benchmark, if we change the batch_size to something other than the default, must the computation_time also change proportionally? For example, if we double the batch_size in cosmoflow_a100.yaml from 1 (the default value) to 2, must the computation_time double as well (from 0.00551 to 0.01102)?

Image

-I'm trying to understand the point of the AU score, from my understanding it is necessary for the AU score to pass if we want to consider the throughputs as valid? In the results that are published online I do not see the AU scores (just # Simulated Accelerators, Dataset Size and Throughput), so we just assume the AU passed?

Image

-For the throughput, in the results online it is written MiB/s whereas in the generated results it is written MB/second. Which one is it?

Image

Much appreciated!

captn-picard avatar Mar 13 '25 17:03 captn-picard

Hi,

  • A few questions regarding the storage benchmarks.

  • -When running a benchmark, if we change the batch_size to something other than the default, must the computation_time change proportionally? For example, if we change the cosmoflow_a100.yaml's batch_size to 2 (the default was 1), must the computation_time go from 0.00551 to 0.01102?

There is a whole process to show "validated" results that includes a peer review process amongst all the other submitters. You'll need to join the working grouphttps://mlcommons.org/working-groups/benchmarks/storage/ https://mlcommons.org/working-groups/benchmarks/storage/ to participate in that, plus sign various legal documents such as a Trademark License Agreement. To talk about your "validated" results, you would need to include the name "MLPerf Storage", which is a trademark of MLCommons, and the hook that forces people to participate in the peer review process rather than just publish their results.

The peer review process is a quality control mechanism, the "validated" results are much more trust-worthy by the general public than unreviewed results because all of the other submitters are implicitly saying that your results are "trustworthy" because all the other submitters have reviewed them and are saying that there was no cheating in that result.

The peer review process only happens periodically. We're coming up on version 2 of the benchmark, and there will be a peer review of results for version 2.

There is a way to publish "unverified" results, but that still requires signing the Trademark License Agreement and there are things you cannot say or do with those unverified results.

You're welcome to join the working group and talk with all of the other people running the benchmark if you'd like to.

  • -I'm trying to understand the point of the AU score, from my understanding it is necessary for the AU score to pass if we want to consider the throughputs as valid? In the results that are published online I do not see the AU scores (just # Simulated Accelerators, Dataset Size and Throughput), so we just assume the AU passed?

The "computation time" is, as a general description, how long it takes for a specific model of NVIDIA GPU to process a batch of the given workload. In the case you're asking about, it takes 323ms for an NVIDIA H100 to process a batch when running a 3D-Unet workload. That's a constant we cannot change.

The purpose of the benchmark is to determine how many H100 GPUs that are running a 3D-Unet workload can a given vendor and model of storage system support? We define "support" as "keep that GPU at least 90% busy", meaning that the GPU does not stall waiting for data to arrive from the storage system for more than 10% of the total benchmark run time. That's what the term "AU%" means, "accelerator utilization" where "accelerator" is a general term for GPUs and other silicon that processes neural networks.

  • -For the throughput, in the results online it is written MiB/s whereas in the generated results it is written MB/second. Which one is it?

The results are actually in MiB/s, the generated results in the log files are imprecise in their wording because the logfiles are mosty intended use by the benchmark submitters and reviewers while the web pages are intended for the use of the public who wants to see the results.

Please let me know if you have more questions...

Thanks,

Curtis


From: captn-picard Sent: Thursday, March 13, 2025 10:30 AM To: mlcommons/storage Cc: Subscribed Subject: [mlcommons/storage] computation_time, AU, MB vs MiB (Issue #87)

[captn-picard]captn-picard created an issue (mlcommons/storage#87)https://github.com/mlcommons/storage/issues/87

Good day,

A few questions regarding the storage benchmarks.

-When running a benchmark, if we change the batch_size to something other than the default, must the computation_time change proportionally? For example, if we change the cosmoflow_a100.yaml's batch_size to 2 (the default was 1), must the computation_time go from 0.00551 to 0.01102?

image.png (view on web)https://github.com/user-attachments/assets/cef29f8b-8b20-49ff-994a-63228cdf5c69

-I'm trying to understand the point of the AU score, from my understanding it is necessary for the AU score to pass if we want to consider the throughputs as valid? In the results that are published online I do not see the AU scores (just # Simulated Accelerators, Dataset Size and Throughput), so we just assume the AU passed?

image.png (view on web)https://github.com/user-attachments/assets/18169ff4-99c4-4d40-9912-2f0821667304

-For the throughput, in the results online it is written MiB/s whereas in the generated results it is written MB/second. Which one is it?

image.png (view on web)https://github.com/user-attachments/assets/e27bcc10-2c5b-4487-9e90-6596978a096d

Much appreciated!

— Reply to this email directly, view it on GitHubhttps://github.com/mlcommons/storage/issues/87, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AXZDB7KJHECQEBH47NPUFOL2UG6FHAVCNFSM6AAAAABY62PMN6VHI2DSMVQWIX3LMV43ASLTON2WKOZSHEYTOOJTGY3DINA. You are receiving this because you are subscribed to this thread.

FileSystemGuy avatar Mar 14 '25 14:03 FileSystemGuy

Hi Curtis,

Thank you for your quick response!

For now, our results are "unverified" as those benchmarks would mostly be used to obtain a reference point for our system as it evolves in the future. Of course we will disclose the fact that our results are "unverified" as per your Messaging Guidelines.

To come back to the computation_time, you mentioned it must remain constant. Is that the case even if we change the batch_size? Since the computation_time is based on the batch_size, I would think that if we were to double the batch_size, the computation_time should follow? Would this be a "correct way" to do things?

I am asking because increasing the batch_size and computation_time actually gives me better results (higher AU and higher throughput), and I would like to find the configuration that gives the best results for our current storage system.

With regards!

captn-picard avatar Mar 14 '25 19:03 captn-picard

Hi,

Sorry for the length of this, but there's a lot to cover.

Publishing MLPerf benchmark results may not work the way you think it does, so please let me go through how we operate while I'm responding to your questions.

MLPerf benchmarks are not like "fio" or "mdtest", open source programs that you can run however you think will show your best results, there are processes and requirements that MLCommons imposes because MLCommons believes doing so will make the benchmark results more trustworthy and more open. I say MLCommons in the 3rd person because I, like nearly everyone who helped build these benchmarks, am a volunteer.

  • For now, our results are "unverified" as those benchmarks would mostly be used to obtain a reference point for our system as it evolves in the future. Of course we will disclose the fact that our results are "unverified" as per your Messaging Guidelines.

Caveat: I'm not MLCommon's lawyer, or any type of lawyer. "MLPerf" and "MLPerf Storage" are registered trademarks of MLCommons. You (or someone who can commit your organization to a contract) would need to sign MLCommon's Trademark License Agreement before you could use those trademarks, ie: you cannot even say "unverified MLPerf Storage results" without agreeing to be bound by some of MLCommon's processes. The rationale behind that restriction is that MLCommons has worked hard to build up a reputation for MLPerf benchmarks as being trustworthy and accurate. MLCommons does not want to let anyone leverage their good reputation without some guardrails on what they say. So even with saying "unverified", there are limits on what you can say about MLCommon's benchmark results. Eg: you cannot disparage another submission, you can pretty much only tout the value of your submission. You've seen the "Messaging Guidelines" document already.

To come back to the computation_time, you mentioned it must remain constant. Is that the case even if we change the batch_size? Since the computation_time is based on the batch_size, I would think that if we were to double the batch_size, the computation_time should follow? Would this be a "correct way" to do things?

There are two "classes" of submission of MLPerf Storage, CLOSED and OPEN. CLOSED sacrifices essentially all forms of change to the benchmark in order to offer "comparability" across multiple submissions, a "level playing field" where all the storage systems were asked to support the same workload as best they could. OPEN sacrifices essentially all forms of comparability in order to offer the submitter wide (but not unlimited) freedoms to change the workload or the operating environment for the test, a "best case" scenario where the customer is willing to change what and/or how they do their AI/ML in order to get the benefits (higher performance) you offer if those changes were to be made. In a CLOSED submission, changing the batch size would be rejected, while in an OPEN submission, you would need to convince your "peers" that the submission was done fairly and accurately according to the Rules for that submission round, and that the changes represented a "valid" way to do AI/ML training. MLCommons uses a "peer review" process when a benchmark submission is made. All of the other submitters get a chance to review every aspect of every other submission, and that peer review group needs to come to (rough) consensus that each submission was done according to the Rules and is something that doesn't break the needed AI/ML aspects. Submission cycles only happen when new versions of the benchmark are released; happily, we're coming up on the v2.0 submission cycle. Between submission cycles, you can publish MLPerf Storage results, but only with the "unverified" wording, because they have not gone through the peer review process. Note the "not unlimited" wording from 2 paragraphs above. For example, we've had people ask if they could eliminate the weight-exchanges in data-parallel training in an effort to show how much faster AI/ML training would be without that. The weight-exchanges are simulated in the benchmark with an MPI global barrier that forces all instances of the benchmark code to stop and wait for all the others to come to the same synchronization point. Obviously, that reduces performance, so the question is a valid one. Unfortunately, data parallel AI/ML training without weight exchanges isn't AI/ML at all, so that request was denied. Batch sizes are determined with a delicate balancing of the DRAM capacity of the GPU and the convergence rate of the NN architecture based upon batch sizes, number of epochs, number of GPUs being used, etc, so bigger is not always better. A simple doubling of the batch sizes we use will likely result in lower quality of training (lower convergence rates), ie: the batch sizes are already optimal, and would not likely fit into the DRAM of the GPU in any case (if you attempted that doubling on real GPUs). All of the above is a consequence of MLCommons' focus on representing the reality of AI/ML, all the MLPerf benchmarks are tied as close as possible to actual AI/ML operations and workflows. So, changing a parameter of the test to make storage go faster may be fair game in OPEN but is not in CLOSED.

I am asking because increasing the batch_size and computation_time actually gives me better results (higher AU and higher throughput), and I would like to find the configuration that gives the best results for our current storage system.

As a consequence of doing things in an AI/ML realistic fashion, data samples must be read from storage in a random order, even though that defeats nearly all forms of caching that storage products like to use. If you were training a NN to recognized cats versus dogs, and you showed the NN all the cat pictures then all the dog pictures, the NN would have very low recognition accuracy. That's an example of needing to do something that is clearly much more stressful on storage, but the AI/ML algorithms require it. The best way to understand what is required and how these processes work is to join the working grouphttps://mlcommons.org/working-groups/benchmarks/storage/. The WG is currently open to anyone, at no cost, you just need to "register" so that we can set the Access Control Lists on the various documents so you can access them (we're hosted by Google Groups). I hope that all the above wasn't too much detail, I tried to explain what our processes are and why we have them, as well as directly respond to your questions. Please let me know if anything doesn't make sense and/or you'd like more information on anything having to do with the benchmark. Thanks,

Curtis


From: captn-picard Sent: Friday, March 14, 2025 12:39 PM To: mlcommons/storage Cc: FileSystemGuy; Comment Subject: Re: [mlcommons/storage] batch_size, computation_time, AU, MB vs MiB (Issue #87)

[captn-picard]captn-picard left a comment (mlcommons/storage#87)https://github.com/mlcommons/storage/issues/87#issuecomment-2725596249

Hi Curtis,

Thank you for your quick response!

For now, our results are "unverified" as those benchmarks would mostly be used to obtain a reference point for our system as it evolves in the future. Of course we will disclose the fact that our results are "unverified" as per your Messaging Guidelines.

To come back to the computation_time, you mentioned it must remain constant. Is that the case even if we change the batch_size? Since the computation_time is based on the batch_size, I would think that if we were to double the batch_size, the computation_time should follow? Would this be a "correct way" to do things?

I am asking because increasing the batch_size and computation_time actually gives me better results (higher AU and higher throughput), and I would like to find the configuration that gives the best results for our current storage system.

With regards!

— Reply to this email directly, view it on GitHubhttps://github.com/mlcommons/storage/issues/87#issuecomment-2725596249, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AXZDB7OXWVSY7DNYYGLOVUD2UMWABAVCNFSM6AAAAABY62PMN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMRVGU4TMMRUHE. You are receiving this because you commented.

FileSystemGuy avatar Mar 14 '25 21:03 FileSystemGuy

This is moot (no longer relevant) so is being closed.

FileSystemGuy avatar Jun 17 '25 21:06 FileSystemGuy