nvbench icon indicating copy to clipboard operation
nvbench copied to clipboard

Specify the kernel name or nvtx range for benchmarking.

Open neoblizz opened this issue 3 years ago • 9 comments

Right now it seems the intended method for benchmarking is simply encapsulating the kernel like so:

void my_benchmark(nvbench::state& state) {
  state.exec([](nvbench::launch& launch) { 
    my_kernel<<<num_blocks, 256, 0, launch.get_stream()>>>();
  });
}
NVBENCH_BENCH(my_benchmark);

What I have is more of an application built with multiple kernels (2 or more), so the benchmark looks more like this:

run_my_app(...) {
  my_kernel1<<<num_blocks, 256, 0, launch.get_stream()>>>();
  cudaDeviceSynchronize();
  my_kernel2<<<num_blocks, 256, 0, launch.get_stream()>>>();
}

void my_benchmark(nvbench::state& state) {
  run_my_app(params);
}
NVBENCH_BENCH(my_benchmark);

What I would like to ask/request is (1) if there was an option to specify which kernel to benchmark within the app (either through kernel name or nvtx range). And (2) currently, I have the wrapper around the whole application, is it aggregating different kernel launch metrics provided by cupti? When I see the following output, which kernels is it reporting for (or is it aggregated for all launches)?

HBWPeak LoadEff StoreEff L1HitRate L2HitRate Samples Samples CPU Time Noise GPU Time Noise
0.03% 64.10% 85.46% 19.56% 47.72% 3x 1024x 563.111 us 41.55% 558.423 us 41.54%

neoblizz avatar Jul 01 '22 23:07 neoblizz

NVBench is meant to benchmark single kernels, so the ideal solution would be to write separate benchmarks for my_kernel1 and my_kernel2.

That said, you can still execute multiple kernels in a single benchmark, and the timings will reflect the runtime of the entire critical section passed to state.exec, and will not be broken down by kernel. I believe the CUPTI metrics will also be aggregated across all kernels (@senior-zero, can you confirm?). The only thing to be careful of is synchronization -- you'll need to pass the nvbench::exec_tag::sync tag to the state.exec call if the your benchmark will synchronize internally.

The NSight tools are much better for end-to-end profiling / tuning, while NVBench is intended for regression testing kernel performance, so splitting up the kernels hasn't been a priority. That said, I have gotten other requests for better NVTX integration / awareness in NVBench in the past -- I think @jrhemstad had some thoughts on this?

alliepiper avatar Jul 25 '22 21:07 alliepiper

CUPTI metrics are reported on per-range basis. That is, for all kernels within your benchmark a range is created and benchmarked.

gevtushenko avatar Jul 25 '22 23:07 gevtushenko

I believe the CUPTI metrics will also be aggregated across all kernels (@senior-zero, can you confirm?).

Yeah, it seems like they're aggregated for the entire nvbench range, which is the critical section passed in by the user. I just have a ton of use-cases for all C++-based profiling in the GPU libraries I work on. And it will be super neat to have CUPTI profiling be set to "Auto profiling" and all the metrics reported as a vector of a metric instead of an aggregated one.

I am playing around with this change, the auto profiling works great within nvbench, it just needs a bit more work to take the metrics gathered and output them nicely within nvbench (since right now it is showing one of many kernels).

NVBench is meant to benchmark single kernels, so the ideal solution would be to write separate benchmarks for my_kernel1 and my_kernel2.

That will be extremely hard to integrate into cases where we are regression testing an application rather than a kernel, where my_kernel1's output is being used by my_kernel2 and together they complete a performance profile. For different datasets, the two kernels perform drastically differently too.

I understand this may not be the priority, but just a +1 for me if this is ever on the roadmap! Thank you all! 😃

neoblizz avatar Jul 25 '22 23:07 neoblizz

NVBench is meant to benchmark single kernels

Just to qualify this statement a bit, NVBench works just fine with generic functions/applications that may contain many kernel calls.

void my_benchmark(nvbench::state& state) {
  state.exec([](nvbench::launch& launch) { 
    foo(); 
  });
}
NVBENCH_BENCH(my_benchmark);

NVBench is designed for answering the question "Is function foo() faster or slower?" where foo() may be a single kernel or an arbitrary function with many kernels/synchronizations.

As a result, timing and CUPTI metrics will be aggregated for anything within the benchmark range (as @senior-zero said).

Going beyond this scope by attempting to enable profiling of specific kernels/ranges inside foo() treads into the territory of reinventing functionality better provided by Nsight Systems or Nsight Compute.

That said, we can keep the issue open as I myself am guilty of requesting the very same functionality via NVTX ranges.

In my case, I was interested in benchmarking a function that had to read data from disk, and so the relevant kernel performance was tiny compared to disk read and it was impossible to see any performance impacts on the kernel. Furthermore, it was impractical/inconvenient to try and extract the kernel in question.

jrhemstad avatar Jul 25 '22 23:07 jrhemstad

NVBench is designed for answering the question "Is function foo() faster or slower?" where foo() may be a single kernel or an arbitrary function with many kernels/synchronizations.

That is exactly how I am using it right now. Nsight Systems/Compute are great too, but I have been enjoying using nvbench where I can keep my entire benchmarking environment in C++, including reports it outputs, (excluding the graphs I later generate using python).

For more context: https://github.com/NVIDIA/nvbench/issues/92#issuecomment-1194785278 I think profiling the entire range is great, but there's a lot of value in also trying to find out where the performance is going (which kernel is the bottleneck for example). In cases where we have a few "operations" with 4-7 different kernel implementations, and each suited for a different type of dataset. nvbench can be so powerful in helping that environment if it could individually output kernel metrics.

neoblizz avatar Jul 26 '22 00:07 neoblizz

nvbench can be so powerful in helping that environment if it could individually output kernel metrics.

I don't disagree. Like I said above, I wished for exactly the same functionality.

The challenge is scope creep.

If we were to pursue this kind of functionality to its logical conclusion, we would end up reinventing Nsight Systems/Compute and probably do a worse job :slightly_smiling_face:.

Redundancy isn't the end of the world. The problem is that it would be a significant amount of work to implement the ability to specify kernel names/NVTX ranges within a benchmark, especially since we want to avoid skewing the actual runtime.

The only way I can think of doing that would be to hook into NVTX/CUPTI callbacks to keep a log of things like nvtxPush/Pop, kernel launches, memcpys, synchronizations, etc. Then you'd have to do some sophisticated post-processing to correlate any asynchronous activity initiated during an NVTX range (like a kernel launch) to project the duration of the NVTX range based on when the last activity initiated during the range completes.

How I know this is because this is exactly what Nsight Systems does :slightly_smiling_face:. Hence why I keep alluding to reinventing Nsight tools :)

We'll definitely keep thinking on if there are easier, lighter weight ways to give you the ability to get a more detailed breakdown of where time is spent in an arbitrary benchmark.

jrhemstad avatar Jul 26 '22 16:07 jrhemstad

How I know this is because this is exactly what Nsight Systems does 🙂. Hence why I keep alluding to reinventing Nsight tools :)

We'll definitely keep thinking on if there are easier, lighter weight ways to give you the ability to get a more detailed breakdown of where time is spent in an arbitrary benchmark.

Makes sense! 🙂 I'll play around with CUPTI's Auto Range Profiling within nvbench meanwhile to see if something lightweight can be engineered without too much post-profiling. Looking forward to future features within this library as well, I truly appreciate the work you guys have put into it, it is awesome!

neoblizz avatar Jul 26 '22 17:07 neoblizz

Huh, I actually wasn't aware of the CUPTI range profiling APIs (I don't think they existed last time I looked at CUPTI :wink:).

At first glance, this appears to do much of the heavy lifting for us.

This looks like a very promising potential path forward! Keep us updated here with what you find.

jrhemstad avatar Jul 26 '22 18:07 jrhemstad

This looks like a very promising potential path forward! Keep us updated here with what you find.

🎉

Will do that, and yeah it is new! Seems like it is doing all of the "matches this kernel to these metrics" for us, which is the challenging part. And then all you have to do is modify nvbench to output vectors of metrics instead of a single aggregated metric.

neoblizz avatar Jul 26 '22 18:07 neoblizz