OpenBLAS No need to set OMP num

No need to set num_threads, as num_threads(num) will cause more new threads' overhead in some scenarios. In openMP, if your required threads num is larger than your last used num, then new num threads will be created. With this patch, pts/rbenchmark-1.0.3 will be 2.375x improved (0.385 secs VS 0.164 secs) on the Ice Lake server under CentOS 8.

Here are the steps on how to run rbenchmark on CentOS 8:

Install R package $ sudo dnf install R
Build your openblas $ make TARGET=CORE2 USE_THREAD=1 USE_OPENMP=1 FC=gfortran CC=gcc LIBPREFIX="libopenblas" INTERFACE64=0
Download R benchmark $ wget http://www.phoronix-test-suite.com/benchmark-files/rbenchmarks-20160105.tar.bz2 $ tar -xf rbenchmarks-20160105.tar.bz2 $ cd rbenchmarks $ export LD_LIBRARY_PATH=<Your openblas source root dir> $ Rscript R-benchmark-25/R-benchmark-25.R The benchmark' result is like "Overall mean (sum of I, II and III trimmed means/3)_ (sec): 0.166433631462761".

Feb 25 '22 02:02 kangshan1157

Reason given for this change in #2775 (@Guobing-Chen) was "In current code, no matter what number of threads specified, all available CPU count is used when invoking OMP, which leads to very bad performance if the workload is small while all available CPUs are big. Lots of time are wasted on inter-thread sync. Fix this issue by really using the number specified by the variable 'num' from calling API." so I am a bit sceptical you may just be comparing different situations/workloads

Feb 25 '22 07:02 martin-frbg

Reason given for this change in #2775 (@Guobing-Chen) was "In current code, no matter what number of threads specified, all available CPU count is used when invoking OMP, which leads to very bad performance if the workload is small while all available CPUs are big. Lots of time are wasted on inter-thread sync. Fix this issue by really using the number specified by the variable 'num' from calling API." so I am a bit sceptical you may just be comparing different situations/workloads

Yes. We are using different workloads. Rbenchmark tries to calculate the Eigenvalues of a 640x640 random matrix. The "eigen" function will continuously call "dgeev_" which will call "exec_blas" and each time a new num will be calculated. If with "num_threads(num)", eg. If there are 112 logical cores in total on the hardware. If the first time the calculated num is 50, then openMP will create 50 threads to deal with the workload. The second time if the calculated num is 112, then openMP will create 112 new threads to deal with the workload, and it does not reuse the old ones. But if without "num_threads(num)", there are 112 threads in OpenMP thread pool and it will reuse them for two times' operation. New threads' creation will cause much overhead and it is the root cause why rbenchmark has very poor performance without this patch.

Feb 25 '22 07:02 kangshan1157

That is rather ancient benchmark script, please re-check with benchmarks/scripts/R/deig.R Namely it permits gc() in metered section. How does pthread version perform?

Mar 03 '22 00:03 brada4

That is rather ancient benchmark script, please re-check with benchmarks/scripts/R/deig.R Namely it permits gc() in metered section. How does pthread version perform?

I tried "benchmarks/scripts/R/deig.R" and here are the openMP version (USE_THREAD=1 USE_OPENMP=1) results. Without the patch SIZE Flops Time 128x128 : 1927.93 MFlops 0.029000 sec 256x256 : 570.51 MFlops 0.784000 sec 384x384 : 721.25 MFlops 2.093000 sec 512x512 : 378.53 MFlops 9.453000 sec 640x640 : 409.15 MFlops 17.081000 sec 768x768 : 475.10 MFlops 25.419000 sec 896x896 : 582.22 MFlops 32.938000 sec 1024x1024 : 679.14 MFlops 42.150000 sec 1152x1152 : 784.43 MFlops 51.959000 sec 1280x1280 : 902.08 MFlops 61.979000 sec 1408x1408 : 1038.00 MFlops 71.692000 sec 1536x1536 : 1170.07 MFlops 82.570000 sec 1664x1664 : 1223.62 MFlops 100.386000 sec 1792x1792 : 1371.79 MFlops 111.837000 sec 1920x1920 : 1530.71 MFlops 123.274000 sec 2048x2048 : 1683.94 MFlops 135.995000 sec

With the patch: SIZE Flops Time 128x128 : 1189.58 MFlops 0.047000 sec 256x256 : 4758.30 MFlops 0.094000 sec 384x384 : 7987.15 MFlops 0.189000 sec 512x512 : 8321.50 MFlops 0.430000 sec 640x640 : 11805.34 MFlops 0.592000 sec 768x768 : 14946.26 MFlops 0.808000 sec 896x896 : 18801.13 MFlops 1.020000 sec 1024x1024 : 15634.06 MFlops 1.831000 sec 1152x1152 : 17147.01 MFlops 2.377000 sec 1280x1280 : 15217.77 MFlops 3.674000 sec 1408x1408 : 24641.16 MFlops 3.020000 sec 1536x1536 : 32376.88 MFlops 2.984000 sec 1664x1664 : 22427.32 MFlops 5.477000 sec 1792x1792 : 29691.74 MFlops 5.167000 sec 1920x1920 : 27672.17 MFlops 6.819000 sec 2048x2048 : 37297.66 MFlops 6.140000 sec

Mar 03 '22 06:03 kangshan1157

640x640 : 409.15 MFlops 17.081000 sec 640x640 : 11805.34 MFlops 0.592000 sec

Impressive indeed.

How it compares to pthread version, i.e building without USE_OPENMP parameter? Not picking on you, just curious, over course of day I will measure myself.

Mar 03 '22 07:03 brada4

640x640 : 409.15 MFlops 17.081000 sec 640x640 : 11805.34 MFlops 0.592000 sec

Impressive indeed.

How it compares to pthread version, i.e building without USE_OPENMP parameter? Not picking on you, just curious, over course of day I will measure myself.

Pthread version is "USE_THREAD=1 USE_OPENMP=0" and here are the pthread version results: Without patch: SIZE Flops Time 128x128 : 1747.19 MFlops 0.032000 sec 256x256 : 6675.83 MFlops 0.067000 sec 384x384 : 11265.46 MFlops 0.134000 sec 512x512 : 10194.43 MFlops 0.351000 sec 640x640 : 14147.29 MFlops 0.494000 sec 768x768 : 17733.59 MFlops 0.681000 sec 896x896 : 23049.46 MFlops 0.832000 sec 1024x1024 : 25355.14 MFlops 1.129000 sec 1152x1152 : 32169.25 MFlops 1.267000 sec 1280x1280 : 33987.89 MFlops 1.645000 sec 1408x1408 : 37832.39 MFlops 1.967000 sec 1536x1536 : 36293.24 MFlops 2.662000 sec 1664x1664 : 40050.35 MFlops 3.067000 sec 1792x1792 : 41009.69 MFlops 3.741000 sec 1920x1920 : 45175.12 MFlops 4.177000 sec 2048x2048 : 43463.21 MFlops 5.269000 sec With patch: SIZE Flops Time 128x128 : 1694.24 MFlops 0.033000 sec 256x256 : 7214.20 MFlops 0.062000 sec 384x384 : 11182.01 MFlops 0.135000 sec 512x512 : 11848.49 MFlops 0.302000 sec 640x640 : 18991.19 MFlops 0.368000 sec 768x768 : 24153.15 MFlops 0.500000 sec 896x896 : 30247.88 MFlops 0.634000 sec 1024x1024 : 30260.00 MFlops 0.946000 sec 1152x1152 : 42412.53 MFlops 0.961000 sec 1280x1280 : 42324.05 MFlops 1.321000 sec 1408x1408 : 41924.68 MFlops 1.775000 sec 1536x1536 : 35467.18 MFlops 2.724000 sec 1664x1664 : 40931.17 MFlops 3.001000 sec 1792x1792 : 41318.94 MFlops 3.713000 sec 1920x1920 : 45755.70 MFlops 4.124000 sec 2048x2048 : 42813.17 MFlops 5.349000 sec

Mar 03 '22 07:03 kangshan1157

Looking deeper at the benchmark script:

additional chol() that is DGETRF+DPOTRF is there, but otherwise it is O(n^3) drilldown as solve/eig.
except those few BLAS/LAPACK functions (plundered by single-thread gc() ) rest are single-threaded, but likely best option to do back then. Summary result will be worse on 30-core 1GHz CPU than on 10-core 3GHz CPU

Mar 03 '22 08:03 brada4

I wonder if this can be fixed without bringing back the original problem. Perhaps by making it conditional on the fraction of the total threads available (e.g. run with num_threads(num) if num is less than half of the cpu) ? Guess this would introduce some weird new crossover points where performance suddenly changes for no apparent reason...

Mar 03 '22 09:03 martin-frbg

Thats other problem with threading thresholds. Yes, lots of time is wasted then the heuristics there fail and spin too many threads. Actually you see slightly that the 128x128 sample is slower after change, if you set O_N_THR=1 it gets faster, rougly same in OMP and PTH, fix applied or not. This fix just emphasizes other long hidden problem, in a different place.

Characteristic part on sandybridge 2-core non-ht NUC pre-fix pthread dgemm.R, versions recent.

      SIZE             Flops                   Time
           128x128 :    4194.30 MFlops   0.001000 sec
!!         256x256 :    3050.40 MFlops   0.011000 sec
           384x384 :   28311.55 MFlops   0.004000 sec

thr=1

           128x128 :    4194.30 MFlops   0.001000 sec
           256x256 :   16777.22 MFlops   0.002000 sec
           384x384 :   18874.37 MFlops   0.006000 sec

EDIT: I doubted initial measurements as they were just summary, not raw, so asked to use known consistent benchmark. Precision tool turned up more favourable result than initial assessment.

Mar 03 '22 09:03 brada4

@martin-frbg Except the benchmarks/scripts/R/deig.R, what other benchmark do I need to verify. For deig.R this benchmark, it seems it is not stable enough, so I add the loop count to 20 and collect the following data on Ice Lake. From 128 To 384 Step=128 Loops=20 without patch

	round 0	round 0	round 1	round 1	round 2	round 2	round 3	round 3	round 4	round 4	mean	mean	rsd	rsd
SIZE	Flops	Time	Flops	Time	Flops	Time	Flops	Time	Flops	Time	Flops	Time	Flops	Time
128x128	2167.06	0.516	2188.26	0.511	2129.91	0.525	2101.88	0.532	2146.26	0.521	2146.674	0.521	1.55%	1.55%
256x256	457.25	19.564	432.09	20.703	458.37	19.516	438.19	20.415	456.18	19.61	448.416	19.9616	2.75%	2.78%
384x384	422.31	71.492	403.81	74.766	426.01	70.871	406.3	74.308	423.49	71.292	416.384	72.5458	2.51%	2.53%

with patch

	round 0	round 0	round 1	round 1	round 2	round 2	round 3	round 3	round 4	round 4	mean	mean	rsd	rsd
SIZE	Flops	Time	Flops	Time	Flops	Time	Flops	Time	Flops	Time	Flops	Time	Flops	Time
128x128	2733.99	0.409	2675.12	0.418	2463	0.454	2564.68	0.436	2733.99	0.409	2774.69	0.403	4.25%	4.84%
256x256	5850.63	1.529	5591.01	1.6	5305.82	1.686	5580.54	1.603	5756.51	1.554	5893.02	1.518	3.53%	3.95%
384x384	8569.81	3.523	8890.29	3.396	7995.61	3.776	8521.43	3.543	8562.52	3.526	8398.17	3.595	3.84%	3.84%

with patch VS without patch

SIZE	Flops
128x128	1.29x
256x256	13.14x
384x384	20.17x

Without this path, the larger size, the more time needs to be completed, so I only pickup 3 sizes. The patch can improve this benchmark. I have no idea what else benchmarks I have to verify with the patch.

Mar 19 '22 09:03 kangshan1157

We found a similar slowdown case in the easybuild community, using numpy and svd (https://gist.githubusercontent.com/markus-beuckelmann/8bc25531b11158431a5b09a45abd6276/raw/660904cb770197c3c841ab9b7084657b1aea5f32/numpy-benchmark.py)

In the end the problem is that both #2775 and this fix are right, depending on the use case, and the compiler used.

When you do an svd via dgesdd a lot of BLAS functions are called with wildly varying numbers of threads. If you vary num_threads in parallel regions there is a huge overhead particularly with libgomp (GCC).

Here's an example for DGESDD (via https://github.com/mpimd-csc/flexiblas/issues/7#issuecomment-712203015) https://gist.github.com/0a2e1783e68b5aca8b69e0947c833082 from worse to worst:

$ gfortran -lopenblas -O2 -fopenmp test_dgesdd.f90 -o test_dgesdd
$ OMP_NUM_THREADS=2 OMP_PROC_BIND=true ./test_dgesdd
 Time =    1.1620110740000000
$ OMP_NUM_THREADS=64 OMP_PROC_BIND=true ./test_dgesdd 
 Time =    7.8668489140000002
$ OMP_NUM_THREADS=64 OMP_PROC_BIND=false ./test_dgesdd 
 Time =    42.826767384000000

Using a little printf("%ld ", num); above #pragma omp parallel for... I found that this single dgesdd invokes the parallel region 16228 times, and switches the number of threads 7900 times. A small sample is this: 1 3 1 4 1 4 1 4 1 4 1 5 2 5 2 5 2 5 2 6 2 6 2 6 and 64 64 8 64 8 64 64 64 64 8 64 8 64 (https://gist.github.com/c367d0bf460ed385b2d994fcee5723e6)

After a google search, and inspired by https://stackoverflow.com/questions/24440118/openmp-parallell-region-overhead-increase-when-num-threads-varies I adapted that test: https://gist.github.com/cb7b050f6d1f5a3893df4a1352714668, ran it on a node with 64 cores and watch the huge overhead:

$ gcc -O2 -fopenmp test.c -o test
$ OMP_PROC_BIND=true ./test
2 threads 137.097901
64 threads 4427.660024
2/64 alternating 175142.989028

Intel (or clang for that matter) doesn't have this issue to the same extent, but also has no speedup for 2 threads:

$ icc -O2 -fopenmp test.c -o test
$ OMP_PROC_BIND=true ./test
2 threads 4647.016525
64 threads 4935.026169
2/64 alternating 10691.881180

I'm not quite sure what the best solution is. Clearly if the number of threads stays low, there is a performance benefit (with libgomp), in this case about 32 (which happens to be exactly 64/2 threads so mostly linear versus the number of threads).

But switching too often is catastrophic by a factor 40 or so.

Perhaps a heuristic that could work is to keep track of say the latest 32 openmp regions and set num_threads to the maximum num of those regions? Then if you have 32 low threaded calls in a row you can still take advantage of the performance benefit.

Jun 13 '22 13:06 bartoldeman

Thanks. I have not found a solution I like - keeping track of past behaviour adds its own overhead and is not particularly good at predicting the future unless the program really does the same computation over and over. Perhaps something as trivial as introducing a new environment variable "OPENBLAS_ADAPTIVE" to choose between pre- and post-#2775 behaviour on startup would already help?

Jun 14 '22 07:06 martin-frbg

attempting to supersede this with #3703, using a new environment variable to choose between the two modes of operation

Jul 27 '22 22:07 martin-frbg

Since #3703 has been merged, can this PR now be closed?

Sep 06 '22 06:09 kevincwells

No need to set OMP num_threads