[zstd][cli] Add performance counters support to bench mode
** NOT FOR LANDING**
Adding an extra parameter (-y) while running in benchmark mode to allow collecting processor performance counters, as that will allow next to know performance stats per operation (i.e. compression vs decompression).
We can collect the following performance counters using the Linux perf API: CPU cycles, instructions, branch misses, cache hits and cache misses.
One advantage of leveraging the Linux perf API is that it should work on any processor that runs Linux, therefore should work fine on x86-64 (Intel and AMD), Arm (arm32/aarch64) and RISC-V.
The counters will allow to generate new interesting stats like cycles/byte, a measure that is helpful to compare different CPU micro architectures with the benefit of being independent of clock speed.
Plus, any I/O operations (i.e. reading files from the disk) that will waste cycles displayed in a regular 'perf stat' will not be counted, since we only capture counters during the main benchmark loop.
This patch is still in its early stages as the idea is to listen to feedback and properly address its current short comings to progress towards a contribution that can be landed on zstd.
Runnning with the help flag should print this: adenilson@aquario:~/compression/my-fork-zstd$ ./programs/zstd --help *** Zstandard CLI (64-bit) v1.5.8, by Yann Collet ***
Compress or decompress the INPUT file(s); reads from STDIN if INPUT is - or not provided.
Usage: zstd [OPTIONS...] [INPUT... | -] [-o OUTPUT]
...
Benchmark options:
-b# Perform benchmarking with compression level #. [Default: 3]
-e# Test all compression levels up to #; starting level is -b#. [Default: 1]
-i# Set the minimum evaluation to time # seconds. [Default: 3]
-y# Collect CPU counters.
Two examples when the flag is enabled:
a) Synthetic: adenilson@aquario:~/compression/my-fork-zstd$ ./programs/zstd -b1y
Perf cycles: 326893971910 -> 3239077 (x3.087), 487.4 MB/s, 2636.7 MB/s
1#
b) With file input: adenilson@aquario:~/compression/my-fork-zstd$ ./programs/zstd -b1y ~/corpus/linux-5.6-rc3.tar
Perf cycles: 427627890230 -> 190860020 (x5.017), 851.1 MB/s, 2906.7 MB/s
1#
The basic idea is to add into the benchmark mode a way to know more precisely the CPU stats operations (e.g. compression vs decompression), remove from the equation cycles spent on I/O and allow to calculate some extra stats (e.g. cycles/byte).
If this is a feature that could be helpful to zstd, I can further develop the patch to get into a "land-able" state.
This is just an early draft with the basic idea, a PoC (Proof of Concept).
I considered using the RDPMC instruction, but its behavior is different between x86-64 implementations (i.e. Intel vs AMD), plus it would be x86-64 only.
On the other hand, it may be possible to collect some extra counters not available using the Linux perf API.
@Cyan4973 thoughts?
I believe this is a good topic.
Benchmark mode is indeed useful to measure performance differences,
and adding counters to this stage is contributing to this objective.
I would just note that current -b already removes I/O operations, so it's purely a buffer-to-buffer operation.
There are also many kind of counters that could be collected, so I guess implementation still has a lot of choices to make.
Given it's an advanced feature, not enabled by default, I'm fine with non-portable counters that only exist on some platforms but not others.