zstd icon indicating copy to clipboard operation
zstd copied to clipboard

[zstd][cli] Add performance counters support to bench mode

Open Adenilson opened this issue 10 months ago • 6 comments

** NOT FOR LANDING**

Adding an extra parameter (-y) while running in benchmark mode to allow collecting processor performance counters, as that will allow next to know performance stats per operation (i.e. compression vs decompression).

We can collect the following performance counters using the Linux perf API: CPU cycles, instructions, branch misses, cache hits and cache misses.

One advantage of leveraging the Linux perf API is that it should work on any processor that runs Linux, therefore should work fine on x86-64 (Intel and AMD), Arm (arm32/aarch64) and RISC-V.

The counters will allow to generate new interesting stats like cycles/byte, a measure that is helpful to compare different CPU micro architectures with the benefit of being independent of clock speed.

Plus, any I/O operations (i.e. reading files from the disk) that will waste cycles displayed in a regular 'perf stat' will not be counted, since we only capture counters during the main benchmark loop.

This patch is still in its early stages as the idea is to listen to feedback and properly address its current short comings to progress towards a contribution that can be landed on zstd.

Adenilson avatar Mar 29 '25 23:03 Adenilson

Runnning with the help flag should print this: adenilson@aquario:~/compression/my-fork-zstd$ ./programs/zstd --help *** Zstandard CLI (64-bit) v1.5.8, by Yann Collet ***

Compress or decompress the INPUT file(s); reads from STDIN if INPUT is - or not provided.

Usage: zstd [OPTIONS...] [INPUT... | -] [-o OUTPUT] ... Benchmark options: -b# Perform benchmarking with compression level #. [Default: 3] -e# Test all compression levels up to #; starting level is -b#. [Default: 1] -i# Set the minimum evaluation to time # seconds. [Default: 3] -y# Collect CPU counters.

Adenilson avatar Mar 29 '25 23:03 Adenilson

Two examples when the flag is enabled:

a) Synthetic: adenilson@aquario:~/compression/my-fork-zstd$ ./programs/zstd -b1y

Perf cycles: 326893971910 -> 3239077 (x3.087), 487.4 MB/s, 2636.7 MB/s

1#

b) With file input: adenilson@aquario:~/compression/my-fork-zstd$ ./programs/zstd -b1y ~/corpus/linux-5.6-rc3.tar

Perf cycles: 427627890230 -> 190860020 (x5.017), 851.1 MB/s, 2906.7 MB/s

1#

Adenilson avatar Mar 29 '25 23:03 Adenilson

The basic idea is to add into the benchmark mode a way to know more precisely the CPU stats operations (e.g. compression vs decompression), remove from the equation cycles spent on I/O and allow to calculate some extra stats (e.g. cycles/byte).

Adenilson avatar Mar 29 '25 23:03 Adenilson

If this is a feature that could be helpful to zstd, I can further develop the patch to get into a "land-able" state.

This is just an early draft with the basic idea, a PoC (Proof of Concept).

Adenilson avatar Mar 29 '25 23:03 Adenilson

I considered using the RDPMC instruction, but its behavior is different between x86-64 implementations (i.e. Intel vs AMD), plus it would be x86-64 only.

On the other hand, it may be possible to collect some extra counters not available using the Linux perf API.

@Cyan4973 thoughts?

Adenilson avatar Mar 29 '25 23:03 Adenilson

I believe this is a good topic. Benchmark mode is indeed useful to measure performance differences, and adding counters to this stage is contributing to this objective. I would just note that current -b already removes I/O operations, so it's purely a buffer-to-buffer operation. There are also many kind of counters that could be collected, so I guess implementation still has a lot of choices to make. Given it's an advanced feature, not enabled by default, I'm fine with non-portable counters that only exist on some platforms but not others.

Cyan4973 avatar Mar 30 '25 01:03 Cyan4973