No need to set OMP num_threads
No need to set num_threads, as num_threads(num) will cause more new threads' overhead in some scenarios. In openMP, if your required threads num is larger than your last used num, then new num threads will be created. With this patch, pts/rbenchmark-1.0.3 will be 2.375x improved (0.385 secs VS 0.164 secs) on the Ice Lake server under CentOS 8.
Here are the steps on how to run rbenchmark on CentOS 8:
- Install R package $ sudo dnf install R
- Build your openblas $ make TARGET=CORE2 USE_THREAD=1 USE_OPENMP=1 FC=gfortran CC=gcc LIBPREFIX="libopenblas" INTERFACE64=0
- Download R benchmark $ wget http://www.phoronix-test-suite.com/benchmark-files/rbenchmarks-20160105.tar.bz2 $ tar -xf rbenchmarks-20160105.tar.bz2 $ cd rbenchmarks $ export LD_LIBRARY_PATH=<Your openblas source root dir> $ Rscript R-benchmark-25/R-benchmark-25.R The benchmark' result is like "Overall mean (sum of I, II and III trimmed means/3)_ (sec): 0.166433631462761".
Reason given for this change in #2775 (@Guobing-Chen) was "In current code, no matter what number of threads specified, all available CPU count is used when invoking OMP, which leads to very bad performance if the workload is small while all available CPUs are big. Lots of time are wasted on inter-thread sync. Fix this issue by really using the number specified by the variable 'num' from calling API." so I am a bit sceptical you may just be comparing different situations/workloads
Reason given for this change in #2775 (@Guobing-Chen) was "In current code, no matter what number of threads specified, all available CPU count is used when invoking OMP, which leads to very bad performance if the workload is small while all available CPUs are big. Lots of time are wasted on inter-thread sync. Fix this issue by really using the number specified by the variable 'num' from calling API." so I am a bit sceptical you may just be comparing different situations/workloads
Yes. We are using different workloads. Rbenchmark tries to calculate the Eigenvalues of a 640x640 random matrix. The "eigen" function will continuously call "dgeev_" which will call "exec_blas" and each time a new num will be calculated. If with "num_threads(num)", eg. If there are 112 logical cores in total on the hardware. If the first time the calculated num is 50, then openMP will create 50 threads to deal with the workload. The second time if the calculated num is 112, then openMP will create 112 new threads to deal with the workload, and it does not reuse the old ones. But if without "num_threads(num)", there are 112 threads in OpenMP thread pool and it will reuse them for two times' operation. New threads' creation will cause much overhead and it is the root cause why rbenchmark has very poor performance without this patch.
That is rather ancient benchmark script, please re-check with benchmarks/scripts/R/deig.R Namely it permits gc() in metered section. How does pthread version perform?
That is rather ancient benchmark script, please re-check with benchmarks/scripts/R/deig.R Namely it permits gc() in metered section. How does pthread version perform?
I tried "benchmarks/scripts/R/deig.R" and here are the openMP version (USE_THREAD=1 USE_OPENMP=1) results. Without the patch SIZE Flops Time 128x128 : 1927.93 MFlops 0.029000 sec 256x256 : 570.51 MFlops 0.784000 sec 384x384 : 721.25 MFlops 2.093000 sec 512x512 : 378.53 MFlops 9.453000 sec 640x640 : 409.15 MFlops 17.081000 sec 768x768 : 475.10 MFlops 25.419000 sec 896x896 : 582.22 MFlops 32.938000 sec 1024x1024 : 679.14 MFlops 42.150000 sec 1152x1152 : 784.43 MFlops 51.959000 sec 1280x1280 : 902.08 MFlops 61.979000 sec 1408x1408 : 1038.00 MFlops 71.692000 sec 1536x1536 : 1170.07 MFlops 82.570000 sec 1664x1664 : 1223.62 MFlops 100.386000 sec 1792x1792 : 1371.79 MFlops 111.837000 sec 1920x1920 : 1530.71 MFlops 123.274000 sec 2048x2048 : 1683.94 MFlops 135.995000 sec
With the patch: SIZE Flops Time 128x128 : 1189.58 MFlops 0.047000 sec 256x256 : 4758.30 MFlops 0.094000 sec 384x384 : 7987.15 MFlops 0.189000 sec 512x512 : 8321.50 MFlops 0.430000 sec 640x640 : 11805.34 MFlops 0.592000 sec 768x768 : 14946.26 MFlops 0.808000 sec 896x896 : 18801.13 MFlops 1.020000 sec 1024x1024 : 15634.06 MFlops 1.831000 sec 1152x1152 : 17147.01 MFlops 2.377000 sec 1280x1280 : 15217.77 MFlops 3.674000 sec 1408x1408 : 24641.16 MFlops 3.020000 sec 1536x1536 : 32376.88 MFlops 2.984000 sec 1664x1664 : 22427.32 MFlops 5.477000 sec 1792x1792 : 29691.74 MFlops 5.167000 sec 1920x1920 : 27672.17 MFlops 6.819000 sec 2048x2048 : 37297.66 MFlops 6.140000 sec
640x640 : 409.15 MFlops 17.081000 sec 640x640 : 11805.34 MFlops 0.592000 sec
Impressive indeed.
How it compares to pthread version, i.e building without USE_OPENMP parameter? Not picking on you, just curious, over course of day I will measure myself.
640x640 : 409.15 MFlops 17.081000 sec 640x640 : 11805.34 MFlops 0.592000 sec
Impressive indeed.
How it compares to pthread version, i.e building without USE_OPENMP parameter? Not picking on you, just curious, over course of day I will measure myself.
Pthread version is "USE_THREAD=1 USE_OPENMP=0" and here are the pthread version results: Without patch: SIZE Flops Time 128x128 : 1747.19 MFlops 0.032000 sec 256x256 : 6675.83 MFlops 0.067000 sec 384x384 : 11265.46 MFlops 0.134000 sec 512x512 : 10194.43 MFlops 0.351000 sec 640x640 : 14147.29 MFlops 0.494000 sec 768x768 : 17733.59 MFlops 0.681000 sec 896x896 : 23049.46 MFlops 0.832000 sec 1024x1024 : 25355.14 MFlops 1.129000 sec 1152x1152 : 32169.25 MFlops 1.267000 sec 1280x1280 : 33987.89 MFlops 1.645000 sec 1408x1408 : 37832.39 MFlops 1.967000 sec 1536x1536 : 36293.24 MFlops 2.662000 sec 1664x1664 : 40050.35 MFlops 3.067000 sec 1792x1792 : 41009.69 MFlops 3.741000 sec 1920x1920 : 45175.12 MFlops 4.177000 sec 2048x2048 : 43463.21 MFlops 5.269000 sec With patch: SIZE Flops Time 128x128 : 1694.24 MFlops 0.033000 sec 256x256 : 7214.20 MFlops 0.062000 sec 384x384 : 11182.01 MFlops 0.135000 sec 512x512 : 11848.49 MFlops 0.302000 sec 640x640 : 18991.19 MFlops 0.368000 sec 768x768 : 24153.15 MFlops 0.500000 sec 896x896 : 30247.88 MFlops 0.634000 sec 1024x1024 : 30260.00 MFlops 0.946000 sec 1152x1152 : 42412.53 MFlops 0.961000 sec 1280x1280 : 42324.05 MFlops 1.321000 sec 1408x1408 : 41924.68 MFlops 1.775000 sec 1536x1536 : 35467.18 MFlops 2.724000 sec 1664x1664 : 40931.17 MFlops 3.001000 sec 1792x1792 : 41318.94 MFlops 3.713000 sec 1920x1920 : 45755.70 MFlops 4.124000 sec 2048x2048 : 42813.17 MFlops 5.349000 sec
Looking deeper at the benchmark script:
- additional chol() that is DGETRF+DPOTRF is there, but otherwise it is O(n^3) drilldown as solve/eig.
- except those few BLAS/LAPACK functions (plundered by single-thread gc() ) rest are single-threaded, but likely best option to do back then. Summary result will be worse on 30-core 1GHz CPU than on 10-core 3GHz CPU
I wonder if this can be fixed without bringing back the original problem. Perhaps by making it conditional on the fraction of the total threads available (e.g. run with num_threads(num) if num is less than half of the cpu) ? Guess this would introduce some weird new crossover points where performance suddenly changes for no apparent reason...
Thats other problem with threading thresholds. Yes, lots of time is wasted then the heuristics there fail and spin too many threads. Actually you see slightly that the 128x128 sample is slower after change, if you set O_N_THR=1 it gets faster, rougly same in OMP and PTH, fix applied or not. This fix just emphasizes other long hidden problem, in a different place.
Characteristic part on sandybridge 2-core non-ht NUC pre-fix pthread dgemm.R, versions recent.
SIZE Flops Time
128x128 : 4194.30 MFlops 0.001000 sec
!! 256x256 : 3050.40 MFlops 0.011000 sec
384x384 : 28311.55 MFlops 0.004000 sec
thr=1
128x128 : 4194.30 MFlops 0.001000 sec
256x256 : 16777.22 MFlops 0.002000 sec
384x384 : 18874.37 MFlops 0.006000 sec
EDIT: I doubted initial measurements as they were just summary, not raw, so asked to use known consistent benchmark. Precision tool turned up more favourable result than initial assessment.
@martin-frbg Except the benchmarks/scripts/R/deig.R, what other benchmark do I need to verify. For deig.R this benchmark, it seems it is not stable enough, so I add the loop count to 20 and collect the following data on Ice Lake. From 128 To 384 Step=128 Loops=20 without patch
| round 0 | round 0 | round 1 | round 1 | round 2 | round 2 | round 3 | round 3 | round 4 | round 4 | mean | mean | rsd | rsd | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SIZE | Flops | Time | Flops | Time | Flops | Time | Flops | Time | Flops | Time | Flops | Time | Flops | Time |
| 128x128 | 2167.06 | 0.516 | 2188.26 | 0.511 | 2129.91 | 0.525 | 2101.88 | 0.532 | 2146.26 | 0.521 | 2146.674 | 0.521 | 1.55% | 1.55% |
| 256x256 | 457.25 | 19.564 | 432.09 | 20.703 | 458.37 | 19.516 | 438.19 | 20.415 | 456.18 | 19.61 | 448.416 | 19.9616 | 2.75% | 2.78% |
| 384x384 | 422.31 | 71.492 | 403.81 | 74.766 | 426.01 | 70.871 | 406.3 | 74.308 | 423.49 | 71.292 | 416.384 | 72.5458 | 2.51% | 2.53% |
with patch
| round 0 | round 0 | round 1 | round 1 | round 2 | round 2 | round 3 | round 3 | round 4 | round 4 | mean | mean | rsd | rsd | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SIZE | Flops | Time | Flops | Time | Flops | Time | Flops | Time | Flops | Time | Flops | Time | Flops | Time |
| 128x128 | 2733.99 | 0.409 | 2675.12 | 0.418 | 2463 | 0.454 | 2564.68 | 0.436 | 2733.99 | 0.409 | 2774.69 | 0.403 | 4.25% | 4.84% |
| 256x256 | 5850.63 | 1.529 | 5591.01 | 1.6 | 5305.82 | 1.686 | 5580.54 | 1.603 | 5756.51 | 1.554 | 5893.02 | 1.518 | 3.53% | 3.95% |
| 384x384 | 8569.81 | 3.523 | 8890.29 | 3.396 | 7995.61 | 3.776 | 8521.43 | 3.543 | 8562.52 | 3.526 | 8398.17 | 3.595 | 3.84% | 3.84% |
with patch VS without patch
| SIZE | Flops |
|---|---|
| 128x128 | 1.29x |
| 256x256 | 13.14x |
| 384x384 | 20.17x |
Without this path, the larger size, the more time needs to be completed, so I only pickup 3 sizes. The patch can improve this benchmark. I have no idea what else benchmarks I have to verify with the patch.
We found a similar slowdown case in the easybuild community, using numpy and svd (https://gist.githubusercontent.com/markus-beuckelmann/8bc25531b11158431a5b09a45abd6276/raw/660904cb770197c3c841ab9b7084657b1aea5f32/numpy-benchmark.py)
In the end the problem is that both #2775 and this fix are right, depending on the use case, and the compiler used.
When you do an svd via dgesdd a lot of BLAS functions are called with wildly varying numbers of threads. If you vary num_threads in parallel regions there is a huge overhead particularly with libgomp (GCC).
Here's an example for DGESDD (via https://github.com/mpimd-csc/flexiblas/issues/7#issuecomment-712203015)
https://gist.github.com/0a2e1783e68b5aca8b69e0947c833082
from worse to worst:
$ gfortran -lopenblas -O2 -fopenmp test_dgesdd.f90 -o test_dgesdd
$ OMP_NUM_THREADS=2 OMP_PROC_BIND=true ./test_dgesdd
Time = 1.1620110740000000
$ OMP_NUM_THREADS=64 OMP_PROC_BIND=true ./test_dgesdd
Time = 7.8668489140000002
$ OMP_NUM_THREADS=64 OMP_PROC_BIND=false ./test_dgesdd
Time = 42.826767384000000
Using a little printf("%ld ", num); above #pragma omp parallel for... I found that this single dgesdd invokes the parallel region
16228 times, and switches the number of threads 7900 times. A small sample is this: 1 3 1 4 1 4 1 4 1 4 1 5 2 5 2 5 2 5 2 6 2 6 2 6 and 64 64 8 64 8 64 64 64 64 8 64 8 64 (https://gist.github.com/c367d0bf460ed385b2d994fcee5723e6)
After a google search, and inspired by https://stackoverflow.com/questions/24440118/openmp-parallell-region-overhead-increase-when-num-threads-varies I adapted that test: https://gist.github.com/cb7b050f6d1f5a3893df4a1352714668, ran it on a node with 64 cores and watch the huge overhead:
$ gcc -O2 -fopenmp test.c -o test
$ OMP_PROC_BIND=true ./test
2 threads 137.097901
64 threads 4427.660024
2/64 alternating 175142.989028
Intel (or clang for that matter) doesn't have this issue to the same extent, but also has no speedup for 2 threads:
$ icc -O2 -fopenmp test.c -o test
$ OMP_PROC_BIND=true ./test
2 threads 4647.016525
64 threads 4935.026169
2/64 alternating 10691.881180
I'm not quite sure what the best solution is. Clearly if the number of threads stays low, there is a performance benefit (with libgomp), in this case about 32 (which happens to be exactly 64/2 threads so mostly linear versus the number of threads).
But switching too often is catastrophic by a factor 40 or so.
Perhaps a heuristic that could work is to keep track of say the latest 32 openmp regions and set num_threads to the maximum num of those regions? Then if you have 32 low threaded calls in a row you can still take advantage of the performance benefit.
Thanks. I have not found a solution I like - keeping track of past behaviour adds its own overhead and is not particularly good at predicting the future unless the program really does the same computation over and over. Perhaps something as trivial as introducing a new environment variable "OPENBLAS_ADAPTIVE" to choose between pre- and post-#2775 behaviour on startup would already help?
attempting to supersede this with #3703, using a new environment variable to choose between the two modes of operation
Since #3703 has been merged, can this PR now be closed?