Unexpected performance degradation when running experiment
Dear OpenBLAS Team,
I'm currently working to improve Numpy's matmul for the strided case and I ran a large grid search with different BLAS frameworks, see
https://github.com/numpy/numpy/pull/23752#issuecomment-2629521597
Here a repost of the plots:
The plots show the improvement of performance of the respective BLAS framework plus copying over naïve matrix multiplication.
In the case of OpenBLAS, I've actually run two experiments: one where the iteration order over the search space is completely shuffled over all "pixels" of the experiment and one where the "pixel"-order is sequential (as is the actual obvious choice, called noshuffle here). In latter noshuffle case, there is an unexpected performance degradation visible as a red triangle in the top right corner, e.g. for n=20 and batch_size=1. That was the reason I've introduced shuffling in the first place. Other frameworks are not affected by iteration order (graphs not provided, but I can do so on demand).
I wonder whether with the help of these plots this performance artefact can be improved. I can do more benchmarks and plots like that if interested and also provide some code.
Best from Berlin, Michael
Very interesting (and an odd effect), thanks for sharing @xor2k. I don't see the code for generating the plot. Would you be able to share the code for just one of the noshuffle plots with the performance dip, or even a subset of that (e.g., for the n=20, batchsize=1graph, moving left to right alongmfor constantp`). The easier/faster it is to reproduce the performance dip, the more likely it is we can get to the bottom of this.
Thanks for having a look, I've put everything into a gist, see
https://gist.github.com/xor2k/b2b7d1d5e87bfe8a8a30c2d0c7e12f9e
The code might not be as polished as it gets, but the approach is super generic and I think something like this may be interesting to have in OpenBLAS or Numpy for benchmarking in the future, especially the automated switch between environments, integrated plotting and the multidimensional parameter space to just exhaust all meaningful options.
Feel free to ask if something does not work out-of-the-box. The simulation will take forever, so for a quick test it might make sense to (drastically) reduce the search space size in test.py.
Nice! I see some mention of conda envs. In case it's helpful to you, this Pixi dev setup allows runtime switching of BLAS libraries without any rebuilds: https://github.com/rgommers/pixi-dev-scipystack/tree/main/scipy.
Curious. Offhand, the only systemic "memory" of previous problem size that I can think of is in which (or at least, how many) cpu cores just got utilized, so interleaving "small" and "large" problems might help to keep a thermally constrained system cooler. But this would be expected to affect all "competing" software equally (and not to happen at all on any decently designed modern hardware)... I assume you ran the tests with 0.3.29 ? Yesterday's #5133 may have improved overall performance a bit
I can rerun it with the most recent version of OpenBlas, indeed. Is there some nightly Debian package or so?
Unfortunately I am only (somewhat) aware of their weekly CD/DVD image builds, the last of which appears to date from last monday. That date might just be new enough to catch sunday's PR, but unfortunately their file list only has it as "dev_0.3.29+ds-2" with no indication of git hash, making it look like the release version. Not familiar with what it takes to build a .deb from source sadly ...
Okay, then I'll create a fake package by patching the original one against the current git.
Got an error while building the OpenBLAS debian package "gemv.c:92:14: error: implicit declaration of function 'num_cpu_avail'". Would the version 0.3.29 from the 12. of January be recent enough?
Jan12 would not include the change made on Feb16... but if you could take 0.3.29 and apply #5133's trivial change by hand
that would be great.
That error is a bit suspect, num_cpu_avail() is declared in common_thread.h which gets included by common.h (which gemv.c and pretty much everybody else includes) in multithreaded builds (which should also be the only ones to reference num_cpu_avail). Possibly one of the recent additions of thread thresholds in develop inadvertently removed an ifdef SMP in the code reshuffle. I'll look into that tomorrow
Thanks for the quick reply! I guess it's easy to fix, I'll then retry.