Halide Per-pipeline-invocation profiling

This is a new sampling profiler as discussed in #5796. When it samples, it samples all simultaneously running Halide pipelines, and all simultaneously running instances of the same pipeline, tracking their stats separately (they have separate sampling tokens, instead of fighting over a single global one). At pipeline exit these stats are accumulated into global pipeline stats and static per-pipeline stats.

The big differences with the current sampling profiler are that:

pipeline invocations are measured as total wall-clock time, rather than wall-clock time minus time spent in any other Halide pipeline running at the same time. The old behavior didn't really work when pipelines could use different numbers of threads, or if simultaneously running pipelines set the sampling token at different rates.
In the per-pipeline results, time with the thread pool busy spent doing work on some other Halide pipeline is tracked as its own entry. This conveniently also measures time spend in the ragged end of a too-coarse parallel for loop, or time spent trying to grab thread pool locks in a too-fine parallel for loop, so it's also useful for standalone micro-benchmarking.
Pipeline invocations that are just bounds queries get ignored. Previously these could misleadingly bring the average runtime way way down.

I'm going to need some help testing (and probably fixing) the hvx changes. At the very least the remote runtime needs to be rebuilt.

Fixes #5796

Mar 13 '24 18:03 abadams

May I suggest to add the ability to do microsecond level sleeps to set the sampling rate, with a user-facing API to set it? I'd maybe like to take 5, or 10 samples per millisecond, instead of just 1. Additionally, now that we measure per-pipeline and per-instance, an accurate rdtsc-based wall-time measurement would be nice (if available on the system). Thoughts on this?

Mar 14 '24 08:03 mcourteaux

May I suggest to add the ability to do microsecond level sleeps to set the sampling rate, with a user-facing API to set it? I'd maybe like to take 5, or 10 samples per millisecond, instead of just 1. Additionally, now that we measure per-pipeline and per-instance, an accurate rdtsc-based wall-time measurement would be nice (if available on the system). Thoughts on this?

I changed things to microseconds (except on Windows, which now just does a yield if the requested sleep time is under a millisecond). Regarding rdtsc, I think that would be a separate change that provides a different implementation for one of the *_clock.cpp runtime modules to use on x86.

Mar 14 '24 20:03 abadams

@abadams I'm testing the PR for hvx. Will update you soon.

Mar 21 '24 06:03 aankit-ca

@abadams @mcourteaux I tested the PR on simulator (host-hvx) and device (arm-hvx). Profiling worked with a few minor changes on the simulator but I see a crash on device. I'm trying to debug the device crash. I see the crash on device with the main branch too, so the crash might not be related to this PR. I'll be OOO starting tomorrow till April 7. @prasmish will work on resolving the issue in the meantime.

Mar 28 '24 07:03 aankit-ca

Ready to land?

Apr 05 '24 16:04 steven-johnson

Waiting on the hvx issue.

Apr 05 '24 16:04 abadams

@abadams These https://github.com/halide/Halide/pull/8187 are some changes needed to get these changes working on hexagon. There is still failure on device, but the error is not related to this PR. We can reproduce the error on the main branch too. I don't wanna hold off this PR because of the crash.

Apr 11 '24 06:04 aankit-quic

Can be merged. The HVX issue is unrelated to this, it seems.

Apr 19 '24 15:04 mcourteaux

It looks like there's other stuff in #8187 which is needed though. Only the missing sampling token issue was broken on main.

Apr 19 '24 23:04 abadams

Ready to land (pending green)?

Jun 25 '24 15:06 steven-johnson