zstd icon indicating copy to clipboard operation
zstd copied to clipboard

small regression in compressed size of tiny chunk from commit 6004c11

Open marmoute opened this issue 6 months ago • 14 comments

Describe the bug

Running hg clone on some repository take about +150% runtime after this specific change (eg. from 3 minutes to 7 minutes).

The runtime change doesn't affect as much the zstd compression stage than all other data processing operations around the compression such as sha1 computation, zstd decompression and binary-diff computation. So we suspect that new code badly impact the memory caches or some other hardware subsystem.

To Reproduce

Full reproduction was a bit hairy as the source of a couple project had to be adapted while bisecting this. I'll try to build two simple changeset in Mercurial to reproduce this. However I want to report this quickly as zstd's core developer might have useful quick insight about this issue.

To give a bit more context : Mercurial use zstd from two source : a vendored python-zstandard version, and the zstd-rs Rust crate. The vendored version was silently very outdated (zstd 1.4.4) until a few month ago when we upgraded it to zstd 0.23.0 after some security concerned pointed out the issue. The rust-rs version has always used a newer version of zstd and always have been unexpectedly slower for some specific clone workload. These clone involve mostly compression of a large number of very small data chunk (delta from one version to another).

The non-rust version became much slower on these same workload, matching the rust performance. profiling pointed to a huge inflation of the time spent doing sha1, bdiff and zstd decompression, and performance analysis are pointing at a much high traffic with the L2 cache and RAM.

Bisecting this slowness pointed very clearly to our upgrade of python-zstandard. bisecting the python-zstandard code itself pointed to the zstd 1.4.5 → 1.4.8 upgrade. bisecting zstd itself pointed very clearly to 6004c11.

As said before, I can work to provide you with a simple way to run these operations with the two different versions of the zstd code on the specific workload, but I wanted to file this report first.

Expected behavior

We would be happy to not have Mercurial becoming over twice slower on some workload :-)

And it seems suspicious to see such performance degradation of side operation from a commit that is expected to improve performance.

Screenshots and charts

We also also provide heaptrack and perf data for before and after, but they total > 100MB

Desktop (please complete the following information):

  • OS: Linux (various kernel)
  • Version from 6004c11 on ward
  • Compiler both gcc and clang since the Rust version seems affected too.
  • Flags nothing special (unless the library adds some)
  • Other relevant hardware specs: reproduced on both intel and AMD machine, both large desktop and laptops
  • Build system see librairy

Additional context

Let me know if you need any other details. I'll try to prepare an easy reproduction for you to run by the end of the week.

marmoute avatar Jul 23 '25 16:07 marmoute

Something doesn't add up.

The flame graph shows a large increase in cpu time for bdiff_splitlines and sha1_compression_stats, which are both outside of the scope of libzstd library.

Now I understand that a library can over-use shared resources, leading to other parts of the software having a harder time, but even then, it's hard to justify a +150% runtime increase solely based on heavier usage of the cache, this would be a quite extreme impact. Furthermore, the changes readable in this commit seem way too limited to have any impact on cache usage.

When updating the version of the python binding, there are plenty of things that get changed too, on top of the upstream C library. So it can be a bit messy to find one cause. For example, one of my prime suspects in such scenarios is the boundary between C and Python, and how many times it's being crossed, since there is a sort of "penalty cost" for each traversal. Subtle differences on the binding side can lead to substantial differences in the nb of border crossing for the same scenario.

Yet you seem certain that this commit, precisely, is to blame for the +150% runtime increase. Did you made some kind A/B comparison test, with a version just before the targeted commit and then the targeted commit, everything else remaining equivalent ?

cc @terrelln

Cyan4973 avatar Jul 23 '25 20:07 Cyan4973

I know this is surprising, but I am certain that this is this commit, precisely that result in the performance regression. I manually built installation to strip everything down to just this change, everything else being equivalent and this reliably reproduce the slowdown. I'll try to provide you two easy "ready to use" setup so you can confirm this on your side (and look into more details).

We also tried various other more obvious lead before narrowing it to this change and none of them made a difference. So it seems like something in this changeset trigger some pathological case in some situation.

marmoute avatar Jul 23 '25 21:07 marmoute

Which compression settings are you using (if you know them) ? Which compression level ?

Cyan4973 avatar Jul 23 '25 22:07 Cyan4973

Which compression settings are you using (if you know them) ? Which compression level ?

I don't think we alter the default python-zstandard one, but I might be mistaken.

marmoute avatar Jul 24 '25 09:07 marmoute

I built two extra mercurial commit. One with the zstd-commit right before the change and one above it with just 6004c11 added.

I also uploaded the repository we use to reproduce the issue. I move the operation from a clone to an unbundle to further thin the operation to the pathological part.

You should be able fetch, build and reproduce the issue with the following command sequence.

(
    mkdir /tmp/zstd-4446;
    cd /tmp/zstd-4446 ;
    python3 -m venv /tmp/zstd-4446/util-hg ;
    /tmp/zstd-4446/util-hg/bin/pip install mercurial ;
    export HGRCPATH= ;
    export HGPLAIN=1 ;
    unset HGWITHRUSTEXT ;
    /tmp/zstd-4446/util-hg/bin/hg clone --rev f24be20621a6 https://foss.heptapod.net/mercurial/mercurial-devel mercurial-dev ;
    cd mercurial-dev ;
    echo '### installing a version right before the change ###' ;
    python3 -m venv /tmp/zstd-4446/fast ;
    /tmp/zstd-4446/util-hg/bin/hg update f24be20621a6~1 ;
    /tmp/zstd-4446/util-hg/bin/hg log -r . ;

    # make sure there is nothing laying around
    /tmp/zstd-4446/util-hg/bin/hg purge --all --no-confirm ;
    make local ; # older packaging quirk
    /tmp/zstd-4446/fast/bin/pip install . ;

    ### install the "slow" versions
    echo '### installing a version with the change ###' ;
    python3 -m venv /tmp/zstd-4446/slow ;
    /tmp/zstd-4446/util-hg/bin/hg update f24be20621a6 ;
    /tmp/zstd-4446/util-hg/bin/hg log -r . ;
    # make sure there is nothing laying around
    /tmp/zstd-4446/util-hg/bin/hg purge --all --no-confirm ;
    make local ; # older packaging quirk
    /tmp/zstd-4446/slow/bin/pip install . ;
    cd .. ;

    # get a repository reproducing the issue ;
    echo '### getting a bundle for netbeans ####' ;
    curl -O https://www.mercurial-scm.org/temp-static-folder/zstd-4446/netbeans-none.hg ;

    echo '### unbundling pre change ####' ;
    /tmp/zstd-4446/fast/bin/hg init repo-fast ;
    time /tmp/zstd-4446/fast/bin/hg -R repo-fast unbundle netbeans-none.hg ;
    rm -rf repo-fast ;
    echo '### unbundling post change ###' ;
    /tmp/zstd-4446/fast/bin/hg init repo-slow
    time /tmp/zstd-4446/slow/bin/hg -R repo-slow unbundle netbeans-none.hg ;
    rm -rf repo-slow ;
)

I did a quicl test two different machines to validate it worked fine and they both showed a regression.

One with a i9-9900K

real	2m17.798s → 6m5.155s
user	2m9.937s → 5m56.629s
sys	0m7.775s → 0m8.397s

One with a i7-7700K

real	3m0.348s → 7m24.184s
user	2m35.540s → 6m59.535s
sys	0m17.822s → 0m18.671s

marmoute avatar Jul 24 '25 10:07 marmoute

The proposed script works as intended on a standard linux configuration.

However, I'm unsure how to proceed from there.

My current guess is that this workload is dominated by small to very small workloads, kilobyte range. But on top of that, I would need to understand the call pattern, which is not trivial because there are additional layers between the application and the library.

Ideally, I would like a short test, that can be iterated upon quickly and locally. Unfortunately, the proposed test takes several minutes to run, and I don't yet know how to iterate with it, i.e. change the source code, compile, measure.

A simpler scenario would be a test employing libzstd directly, but there's a risk that the issue is so specific that it's not reproducible with a test focused solely on libzstd.

Cyan4973 avatar Jul 25 '25 20:07 Cyan4973

Looking the directories fast and slow within the test zstd-4446, the only mention of zstd I can find is zstd.cpython-312-x86_64-linux-gnu.so.

So I made a test: I copied the zstd.cpython-312-x86_64-linux-gnu.so from slow into fast, and ran the test to observe the difference. If the speed difference is due to zstd, the speed should plummet to the slow level.

But there was no difference. The test remained "fast".

This makes me think that the speed difference between the scenarios slow and fast is not solely determined by libzstd.

Looking at perf traces, I note that the main contributors to cpu cost are completely different, making me wonder if we are comparing the same scenarios:

  Perf report - "fast" 

  23.84%  hg       python3.12                                [.] _PyEval_EvalFrameDefault                                                                                                                           
   7.67%  hg       sha1dc.cpython-312-x86_64-linux-gnu.so    [.] sha1_compression_states     
   4.08%  hg       sha1dc.cpython-312-x86_64-linux-gnu.so    [.] ubc_check                              
   2.29%  hg       python3.12                                [.] PyObject_Malloc                                           
   1.71%  hg       python3.12                                [.] PyObject_Free                                               
   1.10%  hg       zstd.cpython-312-x86_64-linux-gnu.so      [.] ZSTD_decompressSequences_bmi2.constprop.0                                                                                                          
   1.06%  hg       [kernel.kallsyms]                         [k] 0xffffffff888001c6                                                                                                                                 
   1.03%  hg       libc.so.6                                 [.] __memmove_avx_unaligned_erms                                                                                                               
   1.00%  hg       parsers.cpython-312-x86_64-linux-gnu.so   [.] nt_find                                                                                                                                 
   0.91%  hg       sha1dc.cpython-312-x86_64-linux-gnu.so    [.] sha1_process                                                                                                                     
   0.80%  hg       libc.so.6                                 [.] __memset_avx2_unaligned_erms


  Perf report - "slow" 
  
  21.60%  hg       bdiff.cpython-312-x86_64-linux-gnu.so     [.] bdiff_splitlines
  18.18%  hg       sha1dc.cpython-312-x86_64-linux-gnu.so    [.] sha1_compression_states
  10.71%  hg       python3.12                                [.] _PyEval_EvalFrameDefault
   9.55%  hg       sha1dc.cpython-312-x86_64-linux-gnu.so    [.] ubc_check
   2.70%  hg       zstd.cpython-312-x86_64-linux-gnu.so      [.] ZSTD_compressBlock_doubleFast
   2.38%  hg       libc.so.6                                 [.] __memmove_avx_unaligned_erms
   2.08%  hg       sha1dc.cpython-312-x86_64-linux-gnu.so    [.] sha1_process
   2.07%  hg       bdiff.cpython-312-x86_64-linux-gnu.so     [.] recurse
   1.15%  hg       bdiff.cpython-312-x86_64-linux-gnu.so     [.] bdiff_diff
   0.98%  hg       zstd.cpython-312-x86_64-linux-gnu.so      [.] ZSTD_decompressSequences_bmi2.constprop.0
   0.90%  hg       python3.12                                [.] PyObject_Malloc
   0.79%  hg       zstd.cpython-312-x86_64-linux-gnu.so      [.] HUF_buildCTable_wksp
   0.74%  hg       libc.so.6                                 [.] __memset_avx2_unaligned_erms
   0.71%  hg       libc.so.6                                 [.] __memcmp_avx2_movbe
   0.66%  hg       zstd.cpython-312-x86_64-linux-gnu.so      [.] ZSTD_compressBlock_doubleFast_extDict_generic
   0.61%  hg       python3.12                                [.] PyObject_Free 

In the fast scenario, only ZSTD_decompressSequences emerges, and it uses just 1% of cpu.

In the slow scenario, ZSTD_compressBlock is much more visible, at 2.70%. If the workload was identical, it should not "disappear" from the traces of the fast scenario

Cyan4973 avatar Jul 25 '25 23:07 Cyan4973

The proposed script works as intended on a standard linux configuration.

However, I'm unsure how to proceed from there.

My current guess is that this workload is dominated by small to very small workloads, kilobyte range. But on top of that, I would need to understand the call pattern, which is not trivial because there are additional layers between the application and the library.

The workload interleaves compression of (mostly) small data (possibly very small) with operation on much larger data.

My (very uneducated) guess is that the new small block compression code has a much more aggressive use of the CPU caches, possibly some order of magnitude larger to that its used to be for such small inputs. This would result in the other processing having much more cache miss, slowing down the overall process.

next week, I'll look into gathering much information on the workload patterns, i.e. sequencing of all operations and the size of their inputs. This would hopefully helps to narrow the patterns.

Ideally, I would like a short test, that can be iterated upon quickly and locally. Unfortunately, the proposed test takes several minutes to run, and I don't yet know how to iterate with it, i.e. change the source code, compile, measure.

Regarding being able to hack on the source : The script created a clone of Mercurial in /tmp/zstd-4446/mercurial-dev. In that clone, the libzstd source lives in ls contrib/python-zstandard/zstd/

You can create a "Python's virtual environment" where you'll be able to install your modified Mercurial using python3 -m venv /tmp/zstd-4446/custom.

Then you can install a modified Mercurial using /tmp/zstd-4446/custom/bin/pip install /tmp/zstd-4446/mercurial-dev

And then run it using /tmp/zstd-4446/custom/bin/hg

/tmp/zstd-4446/custom/bin/hg init /tmp/zstd-4446/repo-tmp ; time /tmp/zstd-4446/custom/bin/hg -R /tmp/zstd-4446/repo-tmp unbundle /tmp/zstd-4446/netbeans-none.hg ; rm -rf /tmp/zstd-4446/repo-tmp ;

However that still take a few minutes. I unfortunately don't have a way to provide a shorted test for now. As the issue appears in quite specific scenarios.

A simpler scenario would be a test employing libzstd directly, but there's a risk that the issue is so specific that it's not reproducible with a test focused solely on libzstd.

Everything I have seen so far point to libzstd operations leading to degraded performance to other code. So much that just libzstd will need to be involved.

marmoute avatar Jul 26 '25 00:07 marmoute

Looking the directories fast and slow within the test zstd-4446, the only mention of zstd I can find is zstd.cpython-312-x86_64-linux-gnu.so.

So I made a test: I copied the zstd.cpython-312-x86_64-linux-gnu.so from slow into fast, and ran the test to observe the difference. If the speed difference is due to zstd, the speed should plummet to the slow level.

But there was no difference. The test remained "fast".

This makes me think that the speed difference between the scenarios slow and fast is not solely determined by libzstd.

I am not sure how you did your test, but I cannot reproduce what you describe, moving the .so around produce the expected result for me. (in both direction)

Looking at perf traces, I note that the main contributors to cpu cost are completely different, making me wonder if we are comparing the same scenarios:

In the fast scenario, only ZSTD_decompressSequences emerges, and it uses just 1% of cpu.

In the slow scenario, ZSTD_compressBlock is much more visible, at 2.70%. If the workload was identical, it should not "disappear" from the traces of the fast scenario

some of the main contributor to the CPU cost remains around in both case : _PyEval_EvalFrameDefault, sha1_compression_states, ubc_check. However the sha1 computation becomes much slower.

interestingly, the ZSTD_compressBlock_doubleFast function start to appear while it wasn't around before so it seems to also be affected by the regression.

The ZSTD_decompressSequences sequence seems however unaffected, and the percentage reduction mat the overall timing increase.

marmoute avatar Jul 26 '25 00:07 marmoute

I am not sure how you did your test, but I cannot reproduce what you describe, moving the .so around produce the expected result for me. (in both direction)

I hope we talk about the same scenarios.

I confirm that, when running the fast scenario in which the libzstd library has been overwritten by the slow scenario one, the fast scenario remains fast (~2mn40). I checked with md5sum to be sure.

Conversely, if I overwrite the libzstd library from the slow scenario with the version from the fast scenario, the scenario still remains slow (~6mn50). Also double checked with md5sum.

For the slow scenario, I noticed that the manifest part of the test was noticeably slower, maybe enough to justify a large portion of the difference.

Cyan4973 avatar Jul 26 '25 03:07 Cyan4973

I am not sure how you did your test, but I cannot reproduce what you describe, moving the .so around produce the expected result for me. (in both direction)

I hope we talk about the same scenarios.

I confirm that, when running the fast scenario in which the libzstd library has been overwritten by the slow scenario one, the fast scenario remains fast (~2mn40). I checked with md5sum to be sure.

Conversely, if I overwrite the libzstd library from the slow scenario with the version from the fast scenario, the scenario still remains slow (~6mn50). Also double checked with md5sum.

I did test that on my side and I confirm the opposite, moving the zstd.cpython-313-x86_64-linux-gnu.so from a an install built from zstd a8006264cfb4 (fast) into an install built from zstd 6004c1117f1f (slow) make the slow install fast, and moving thing the other way make a fast install slow.

So I am not sure what's going on on your side. I'll look into providing you a small script reproducing the operation to make sure we do the same things.

For the slow scenario, I noticed that the manifest part of the test was noticeably slower, maybe enough to justify a large portion of the difference.

Yes, most of the difference come from there. Here the manifest processing on this repository handle large content (goign through sha1, bdiff, etc…) that result in small "delta" (that goes through compression). That pattern seems to be the one affected.

marmoute avatar Jul 26 '25 07:07 marmoute

TL;DR; Some slight upward change of some compressed size seems to throw some application completely south, toward performance problematic behavior we know about.

My next step was to confirm that both side were performing the same work and what kind of timing change we could see in such work.

The two attached files contains a log of relevant events in the form date (in ns) duration(in ns) operation-type followed by some information about the input and output size.

zstd-4446-events-a8006264cfb4629e5ec23534fac1330609c99ebb.txt

zstd-4446-events-6004c1117f1f79a2b44363f05571eb19ccc00817.txt

So at start the performed operation are the same except that the result size for some compressed block varies lightly (one or two bytes more). At some point this seems to throw the Mercurial code toward a much more expensive slow path, snowballing for the rest of the processing. I am going to keep digging a bit to confirm that this is the only things at play. If this is "just" this, the good new is that we happens to be currently working on avoiding this kind of performance derailment in general, so we might have solution to this impact.

I'll get back to you when I'll have the full story narrowed down.

marmoute avatar Jul 28 '25 14:07 marmoute

I dig further to confirm the latest lead.

  • 6004c11 result in frequent small (~1%), but noticeable degradation of compression of small input (< 1k)
  • That change in input trigger some pathological behavior in Mercurial. Confirming that the main source of slowdown is indeed not directly related to zstd.

Sorry for raising alarm too loud. The bisection pointed so clearly to zstd, that I forgot to check the very problem I am having my nose on for the last couple of month.

While perfectly acceptable, the lost in compression for small block is slightly unfortunate. We could live with it, but I wonder if we could do something about it. Here is some quick statistic I gathered from the trace:

  • for the 82909 chunk with size ≤ 1024: we loose 64937 bytes in compression quality (average of 0.78 bytes per chunk)
  • for the 58395 chunk with size ≤ 512: we loose 53128 bytes in compression quality (average of 0.90 bytes per chunk)
  • for the 32350 chunk with size ≤ 256: we loose 30469 bytes in compression quality (average of 0.95 bytes per chunk)

For very small chunk, this still apply when comparing the previous (default level) compression with maximum (level 22) compression.

  • The ≤256 blocks still lose 13092 bytes (average of 0.31 bytes per chunk).

And comparing 22 compression to 22 compression give similar result

  • chunk ≤ 1024 bytes; 83457 items; 89898 bytes extra; average 1.07 bytes / item
  • chunk ≤ 512 bytes; 58854 items; 61199 bytes extra; average 1.03 bytes / item
  • chunk ≤ 256 bytes; 32851 items; 31899 bytes extra; average 0.97 bytes / item

marmoute avatar Jul 28 '25 20:07 marmoute

This change was designed to increase decompression speed of small blocks. It makes table construction simpler and faster, but as a consequence loses a little bit of accuracy, hence a small compression ratio loss. So it's a trade-off.

I don't know enough if it's selectable or hard-wired. cc @terrelln.

Imho, this is a reasonable tradeoff. We are talking about ~0.1% ratio loss, in exchange for a more perceptible ~10% decompression speed gain, since table construction is more or less a fixed cost, so it takes a larger share of cpu on small data.

Cyan4973 avatar Jul 28 '25 20:07 Cyan4973