LinuxPerf.jl Implausible cycle count when PMU muxing is active

It's not uncommon for me to get a reading like this:

julia> @pstats "cpu-cycles,instructions,branch-instructions,branch-misses,cache-misses,cache-references" rand(1000,1000)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               8.60e+04   62.9%  #  0.0 cycles per ns
╶ instructions             7.89e+06  100.0%  # 91.8 insns per cycle
╶ branch-instructions      1.29e+05  100.0%  #  1.6% of insns
╶ branch-misses            9.09e+02  100.0%  #  0.7% of branch insns
╶ cache-misses             3.44e+03  100.0%  #  0.5% of cache refs
╶ cache-references         6.77e+05   37.5%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

about ~10-30% of the time, compared to a more reasonable measurement:

julia> @pstats "cpu-cycles,instructions,branch-instructions,branch-misses,cache-misses,cache-references" rand(1000,1000)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               3.59e+07   82.6%  #  1.6 cycles per ns
╶ instructions             4.35e+07   82.6%  #  1.2 insns per cycle
╶ branch-instructions      7.04e+06   82.6%  # 16.2% of insns
╶ branch-misses            1.93e+05   82.6%  #  2.7% of branch insns
╶ cache-misses             5.19e+05   82.6%  # 11.7% of cache refs
╶ cache-references         4.46e+06   87.2%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

~100 instructions per cycle seems... a little high

Dec 19 '24 14:12 topolarity

The above measurements are with the diff from https://github.com/JuliaPerf/LinuxPerf.jl/issues/50#issuecomment-2554247942, but this also happens using disable_all! / enable_all!

Here's a strange measurement from current master:

julia> @pstats "cpu-cycles,instructions,branch-instructions,branch-misses,cache-misses,cache-references" rand(1000,1000)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               1.43e+04   59.4%  #  0.0 cycles per ns
╶ instructions             1.26e+07  100.0%  # 877.8 insns per cycle
╶ branch-instructions      1.07e+06  100.0%  #  8.5% of insns
╶ branch-misses            1.19e+04  100.0%  #  1.1% of branch insns
╶ cache-misses             1.09e+05  100.0%  #  8.2% of cache refs
╶ cache-references         1.33e+06   40.9%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Dec 19 '24 14:12 topolarity

I haven't been able to get perf stat to make the same mistake so LinuxPerf may be doing something wrong here:

$ perf stat -e cycles,instructions,branches,branch-misses,cache-references,cache-misses echo


 Performance counter stats for 'echo':

         1,417,515      cycles                                                        (81.44%)
         1,048,902      instructions              #    0.74  insn per cycle
           240,734      branches
            31,285      branch-misses             #   13.00% of all branches
            87,314      cache-references
            25,054      cache-misses              #   28.694 % of all cache refs      (18.56%)

       0.001116123 seconds time elapsed

       0.000000000 seconds user
       0.001277000 seconds sys

Dec 19 '24 15:12 topolarity

Seems to be affected by the number of events we're creating for threading (I'm running on a 64-core / 128-hyper-threads machine). This is the worst outlier with threads=false which is orders of magnitude improved:

julia> @pstats "cpu-cycles,instructions,branch-instructions,branch-misses,cache-misses,cache-references" threads=false rand(1000,1000)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               4.27e+06   73.0%  #  1.2 cycles per ns
╶ instructions             1.03e+07  100.0%  #  2.4 insns per cycle
╶ branch-instructions      6.23e+05  100.0%  #  6.0% of insns
╶ branch-misses            4.94e+03  100.0%  #  0.8% of branch insns
╶ cache-misses             8.85e+04  100.0%  #  7.6% of cache refs
╶ cache-references         1.17e+06   27.4%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Using OPENBLAS_NUM_THREADS=1 also dramatically reduces the noise, since most of the threads we're registering perf counters for are OpenBLAS's

Weirdly enough, the situation also improves dramatically if I use -t auto (with threads=true)

Dec 19 '24 16:12 topolarity

If I group the cpu-cycles together with the instructions then both estimates seem to have these very low readings:

julia> @pstats "(cpu-cycles,instructions),branch-instructions,branch-misses,cache-misses,cache-references" rand(1000,1000)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               8.71e+04   40.7%  #  0.0 cycles per ns
└ instructions             2.60e+04   40.7%  #  0.3 insns per cycle
╶ branch-instructions      1.01e+06  100.0%  # 3900.4% of insns
╶ branch-misses            1.33e+04  100.0%  #  1.3% of branch insns
╶ cache-misses             1.04e+05  100.0%  # 11.7% of cache refs
╶ cache-references         8.91e+05   59.6%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Somehow these counters were active ~40% of the time, but the reading was ~100x lower than expected

Dec 19 '24 17:12 topolarity

All the issues you've had have been on a single machine right?

Dec 19 '24 21:12 Zentrik

Yeah, that's right. I'm going to try to take some measurements tomorrow on other HW.

Dec 19 '24 23:12 topolarity

I suspected this might be processor C-states, so I disabled any power states by setting /dev/cpu_dma_latency = 0 and I confirmed via /sys/devices/system/cpu/cpu0/cpuidle/state*/usage that the processor was locked to C-0. Unfortunately this doesn't seem to have much effect on the problem.

In fact, on the machine I'm testing we can even get 0 readings, despite a non-zero running time:

julia> @pstats "cpu-cycles,instructions,branch-instructions,branch-misses,cache-misses,cache-references,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-loads" rand(1000,1000)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               0.00e+00   31.0%  #  0.0 cycles per ns
╶ instructions             2.38e+06  100.0%  #  Inf insns per cycle
╶ branch-instructions      1.25e+05  100.0%  #  5.3% of insns
╶ branch-misses            4.50e+01  100.0%  #  0.0% of branch insns
╶ cache-misses             1.21e+05  100.0%  # 96.1% of cache refs
╶ cache-references         1.25e+05  100.0%
╶ L1-dcache-loads          3.01e+02  100.0%
╶ L1-dcache-load-misses    3.55e+02  100.0%  # 117.9% of dcache loads
╶ L1-dcache-stores         1.25e+05  100.0%
╶ L1-dcache-loads          4.28e+02   70.3%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

There are also still measurements that are ≥1 order of magnitude smaller than they should be:

julia> @pstats "cpu-cycles,instructions,branch-instructions,branch-misses,cache-misses,cache-references,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-loads" rand(1000,1000)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               3.27e+05   48.1%  #  0.2 cycles per ns
...
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Compared to a normal measurement:

julia> @pstats "cpu-cycles,instructions,branch-instructions,branch-misses,cache-misses,cache-references,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-loads" rand(1000,1000)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               3.19e+06   62.1%  #  0.9 cycles per ns
...
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

This is an Intel Xeon system, vs. the Zen3 hw that was tested above.

Jan 27 '25 16:01 topolarity

OK, it looks like there may be two sources of noise here...

Above I had C-/P-states enabled and used enable_all! / disable_all!.

On the other hand, if I enable normal C-/P-states and use enable! / disable! (instead of disable_all!):

julia> @pstats "cpu-cycles,instructions,branch-instructions,branch-misses,cache-misses,cache-references,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-loads" rand(1000,1000)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               3.73e+04   47.2%  #  0.0 cycles per ns
...

So we get bad readings for both:

enable! / disable! with lower-energy C-/P-states enabled
enable_all! / disable_all! with processor locked to P-0 / C-0

The effects of having C-/P-states enabled on cpu_cycles is generally much bigger (e.g. above measurement), but using disable_all! seems to cause bad / zero readings (less often, but pretty reliably).

Both of these only seem to happen with threads=true (the default). I've tried to create a C-reproducer that shows the same problem, but so far I can't get it to show a problem nearly as severe.

Jan 27 '25 19:01 topolarity

OK I think I have a root cause.

As it turns out, our kernel overhead for enabling / disabling perf events is much bigger than I thought. Across the several ~hundred perf fd's that we open (primarily for the ~32 openBLAS threads on my machine), it takes 1M instructions / 10M cycles / 7 milliseconds to enable and disable the perf events once.

If threads=false, then that overhead drops to 20k instructions / ~0.2 milliseconds, so it seems to scale linearly with the number of fd's we have open, and that overhead is pretty much the same regardless of whether we use disable! or disable_all!.

So the time we shaved off in https://github.com/JuliaPerf/LinuxPerf.jl/pull/38 is all on the user side, but the time in-kernel overhead strongly dominates. That would generally be OK, because perf is supposed to ignore time spent in kernel (if you set exclude_kernel as we do). However, it turns out this is not true for time_running / time_enabled (see this patch to the manpages). Those estimates always include time spent in the kernel. So the basic problem here is that we are dominated by time spent in the kernel, which skews our running time estimates and making our scaled counts much less reliable when exclude_kernel is being used.

So that's the explanation. I'm not sure what that means we should do to adapt / workaround it yet... any thoughts @vchuravy or @Zentrik ?

Feb 05 '25 20:02 topolarity

Do other tools like perf deal with this in anyway?

Feb 13 '25 21:02 Zentrik

Do other tools like perf deal with this in anyway?

Not to my knowledge - when used externally you need a relatively short run (~few milliseconds) with significant kernel overhead to notice this, so it'd be pretty rare

It's unfortunately quite common for in-process monitoring, since enabling / disabling the events guarantees a significant amount of kernel overhead and you're more likely to be measuring a short ~few ms event.

Feb 24 '25 01:02 topolarity

Perhaps we should run with threads=false by default?

Feb 24 '25 01:02 topolarity