Incorrect result with `cblas_dgemv` vs reference netlib and other libraries
We recently switched to testing openBLAS on a project and are noticing some test case failures due to a matrix multiplication operation returning an incorrect result.
This issue has been observed on a variety of platforms (Ubuntu 22.04, RHEL7, RHEL9, MSYS2 mingw), a variety of compilers (clang-15, mingw-13, gcc-12, gcc-11, gcc-9), as well as a variety of openblas versions(0.3.3, 0.3.20, 0.3.21, 0.3.24), and a variety of CPUs:
- Intel(R) Xeon(R) CPU E5-4650 0 @ 2.70GHz
- Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz
- Intel(R) Core(TM) i7-9850H CPU @ 2.60GHz 2.59 GHz
Reproduction
I have attached a minimally reproducible example (in C++) showing the problem
Reproduction Code
#include <cblas.h>
#include <cassert>
#include <iostream>
#include <fstream>
constexpr static size_t size = 16;
void get_values(double* A, double* b, const char* binfile) {
std::fstream binaryReader;
binaryReader.open(binfile, std::ios::in | std::ios::binary);
assert(binaryReader.is_open());
for (size_t i = 0; i < (size * size); i++) {
binaryReader.read(reinterpret_cast<char *>(&A[i]), sizeof(double));
}
for (size_t i = 0; i < size; i++) {
binaryReader.read(reinterpret_cast<char *>(&b[i]), sizeof(double));
}
}
void matrixMultiply(const double* A, const double* b, double* c) {
const size_t N = size;
for (size_t i = 0; i < N; ++i) {
for (size_t j = 0; j < N; ++j){
c[i] += A[ (i * N) + j] * b[j];
}
}
}
void CBlasMatrixMultiply(const double* a, const double* b, double* c){
int32_t M = size;
int32_t N = size;
cblas_dgemv(CblasRowMajor,
CblasNoTrans,
M, N,
1.0, a, N,
b, 1,
1.0, c, 1);
}
int main(int argc, char* argv[]) {
auto* A = new double[256]();
auto* b = new double[16]();
const char* binfile = argc > 1 ? argv[1] : BIN_FILE;
get_values(A, b, binfile);
auto* c_blas = new double[16]();
CBlasMatrixMultiply(A, b, c_blas);
auto *c_mat = new double[16]();
matrixMultiply(A, b, c_mat);
printf(" BLAS MAT\n");
for (size_t i = 0; i < size; i++) {
printf("%0.18e\t%0.18e\n", c_blas[i], c_mat[i]);
}
// dot product?
auto c_ddot = cblas_ddot(size, A + (size * 15), 1, b, 1);
printf("ddot: %0.18e\n", c_ddot);
printf("dgemv: %0.18e\n", c_blas[15]);
printf("ddot == dgemv? %s\n", c_ddot == c_blas[15] ? "YES" : "NO");
printf("dgemv is 0.0? %s\n", c_blas[15] == 0.0 ? "YES" : "NO");
delete[] A;
delete[] b;
delete[] c_blas;
delete[] c_mat;
return 0;
}
Compile this code with:
g++ -o blas_test blas_test.cpp -DBIN_FILE=\"path/to/bin\" $(pkg-config --libs --cflags openblas)
And observe the following output:
OpenBLAS result
BLAS MAT
1.175201193643801378e+00 1.175201193643801822e+00
1.103638323514327002e+00 1.103638323514327224e+00
3.578143506473725477e-01 3.578143506473724922e-01
7.045563366848892062e-02 7.045563366848883735e-02
9.965128148869371871e-03 9.965128148869309421e-03
1.099586127207577424e-03 1.099586127207556390e-03
9.945433911373591229e-05 9.945433911360671605e-05
7.620541308983597162e-06 7.620541308896932986e-06
5.064719745540013918e-07 5.064719744437437483e-07
2.971814122565419325e-08 2.971814140421127963e-08
1.560886642160141946e-09 1.560886564391797127e-09
7.419920233786569952e-11 7.419902813631537848e-11
3.221201083647429186e-12 3.221406076314455686e-12
1.292299600663682213e-13 1.289813102624490508e-13
4.440892098500626162e-15 4.440892098500626162e-15
0.000000000000000000e+00 -4.440892098500626162e-16
ddot: 0.000000000000000000e+00
dgemv: 0.000000000000000000e+00
ddot == dgemv? YES
dgemv is 0.0? YES
It is worth noting that only the last value is different outside of acceptable numerical precision, and that every other value passes within 1e-16. Furthermore, a value of exactly 0.0 is, in itself, suspicious, as there's no real circumstance the value could be that.
Change the compile command to:
# cblas here is netlib
g++ -o blas_test blas_test.cpp -DBIN_FILE=\"path/to/bin\" $(pkg-config --libs --cflags cblas)
and observe this result:
netlib result
BLAS MAT
1.175201193643801822e+00 1.175201193643801822e+00
1.103638323514327224e+00 1.103638323514327224e+00
3.578143506473724922e-01 3.578143506473724922e-01
7.045563366848883735e-02 7.045563366848883735e-02
9.965128148869309421e-03 9.965128148869309421e-03
1.099586127207556390e-03 1.099586127207556390e-03
9.945433911360671605e-05 9.945433911360671605e-05
7.620541308896932986e-06 7.620541308896932986e-06
5.064719744437437483e-07 5.064719744437437483e-07
2.971814140421127963e-08 2.971814140421127963e-08
1.560886564391797127e-09 1.560886564391797127e-09
7.419902813631537848e-11 7.419902813631537848e-11
3.221406076314455686e-12 3.221406076314455686e-12
1.289813102624490508e-13 1.289813102624490508e-13
4.440892098500626162e-15 4.440892098500626162e-15
-4.440892098500626162e-16 -4.440892098500626162e-16
ddot: -4.440892098500626162e-16
dgemv: -4.440892098500626162e-16
ddot == dgemv? YES
dgemv is 0.0? NO
Here is the binary file that contains a 16x16 matrix and a 16x1 vector: (NOTE: This is a binary data file, extension changed to make github happy) reproduction.txt
Other Notes
We have done extensive testing in other BLAS-like environments to get a result close to the expected -4e-16 result, which passes our test. Both MATLAB (2023a) and numpy (1.26 w/ MKL) return a result very close to what we expect, and pass our test. And, obviously, our naive matrix multiplication in the reproduction code gives
The matrix in question is not overly ill-conditioned, it has a condition number of ~10.
Somewhat surprising as the three cpu generations would be using different optimized implementations of the GEMV BLAS kernel
On first glance this appears to be some FMA-related effect from letting the compiler use AVX instructions - it is possible to obtain the netlib result by building for TARGET=GENERIC, but if I reconfigure any TARGET to use the same unoptimized, plain-C GEMV kernel without changing the compiler options in Makefile.x86_64, I end up with an "intermediate" result,
although there is no other BLAS function in the call graph. (Unfortunately removing -mavx breaks some code that uses intrinsics, so it will take more time to confirm that suspicion - later.)
It is one bit of precision off, very normal occurrence computing in different order.
I don't think that's right, this is roughly -2 ^ -51, and machine precision should be 2 ^ -53. Even if that is "one bit off", the result is exactly 0.0. Surely that's a bug.
If you do:
auto* c_blas = new double[16]();
c_blas[15] = 1.0e-18;
CBlasMatrixMultiply(A, b, c_blas);
printf("%s", c_blas[15] == 1.0e-18 ? "YES ": "NO");
Then it will return YES. It's not even changing the value. Our initial thought was that this was some sort of memory alignment issue due to the matrix size. We couldn't fix it with manual alignment to 8 bytes though.
You abuse machine rounding precision 32 times (or 16 with FMA) , discount 5 bits in your check.
It is not magic symbolic computation soup that gives accurate poly result each time.
I do not expect an identical result, but this result is exactly the starting value of c_blas[15], every time, for every value. 32 (or 16) floating point operations and they all cancel out? Every time? For any starting value? Exactly? On multiple CPUs? This is the puzzle here, not why it's not exactly matching netlib/MKL.
It is rounding to output precision to store in a register at every 1 or 2 FLOP-s 50% up 50% down and so lottery continues till the end of computation. Yes, workload splitting affects result.
just a guess that intel uses generic code for small inputs, then gradually jumps to vector code and adds CPU threads as samples grow. Openblas uses vector code always and switxhes to all cpus at one point.
Then it will return
YES. It's not even changing the value. Our initial thought was that this was some sort of memory alignment issue due to the matrix size. We couldn't fix it with manual alignment to 8 bytes though.
As far as I can tell all terms cancel with the "right" evaluation order and the y[15] evaluates to "exact" zero within the limits of precision - as this is added to what the c_blas array initially contained (the "beta times y" of the GEMV equation, beta being one in your case), you see no change. There appears to be loss of precision in the AVX2 "microkernel" used for Haswell and newer due to operand size limitations in the instruction set. Certainly not ideal, but not a catastrophic failure either (which would certainly have shown up in testsuites during the almost ten years this code has been in place)
There appears to be loss of precision in the AVX2 "microkernel" used for Haswell and newer due to operand size limitations in the instruction set. Certainly not ideal, but not a catastrophic failure either (which would certainly have shown up in testsuites during the almost ten years this code has been in place)
I can't understand this conclusion given that I can reproduce this with an OPENBLAS_CORETYPE env that doesn't seem to use AVX2. Am I misunderstanding what this env does? Is AVX2 always checked even when an older CPU is manually set?
Here's what I'm testing on:
lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 43 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 6
On-line CPU(s) list: 0-5
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz
CPU family: 6
Model: 85
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 6
Stepping: 0
BogoMIPS: 5786.40
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mc
a cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall n
x pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtop
ology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni
pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic mov
be popcnt tsc_deadline_timer aes xsave avx f16c rdrand
hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti
ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 sme
p bmi2 invpcid avx512f avx512dq rdseed adx smap clflush
opt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xsa
ves arat pku ospke md_clear flush_l1d arch_capabilities
Virtualization features:
Hypervisor vendor: VMware
Virtualization type: full
Caches (sum of all):
L1d: 192 KiB (6 instances)
L1i: 192 KiB (6 instances)
L2: 6 MiB (6 instances)
L3: 132 MiB (6 instances)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-5
Vulnerabilities:
Gather data sampling: Unknown: Dependent on hypervisor status
Itlb multihit: KVM: Mitigation: VMX unsupported
L1tf: Mitigation; PTE Inversion
Mds: Mitigation; Clear CPU buffers; SMT Host state unknown
Meltdown: Mitigation; PTI
Mmio stale data: Mitigation; Clear CPU buffers; SMT Host state unknown
Retbleed: Mitigation; IBRS
Spec rstack overflow: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer
sanitization
Spectre v2: Mitigation; IBRS, IBPB conditional, STIBP disabled, RSB
filling, PBRSB-eIBRS Not affected
Srbds: Not affected
Tsx async abort: Not affected
This is on Ubuntu 22.04, with the libopenblas-pthread-dev package. I can also recreate this on OpenBLAS 3.24.0 on MSYS2, but that machine is Coffee Lake and can't use SkylakeX instructions.
I'm using OPENBLAS_VERBOSE=5 to print the core type at the top
OPENBLAS_CORETYPE=Nehalem (No AVX2)
Core: Nehalem
BLAS MAT
1.175201193643801378e+00 1.175201193643801822e+00
1.103638323514327002e+00 1.103638323514327224e+00
3.578143506473726032e-01 3.578143506473724922e-01
7.045563366848886511e-02 7.045563366848883735e-02
9.965128148869364932e-03 9.965128148869309421e-03
1.099586127207521913e-03 1.099586127207556390e-03
9.945433911356243994e-05 9.945433911360671605e-05
7.620541308817063708e-06 7.620541308896932986e-06
5.064719744429790893e-07 5.064719744437437483e-07
2.971814136443207133e-08 2.971814140421127963e-08
1.560886614404566330e-09 1.560886564391797127e-09
7.419920233786569952e-11 7.419902813631537848e-11
3.221423128252354218e-12 3.221406076314455686e-12
1.287858708565181587e-13 1.289813102624490508e-13
4.440892098500626162e-15 4.440892098500626162e-15
0.000000000000000000e+00 -4.440892098500626162e-16
ddot: 0.000000000000000000e+00
dgemv: 0.000000000000000000e+00
ddot == dgemv? YES
dgemv is 0.0? YES
OPENBLAS_CORETYPE=SandyBridge (No AVX2)
Core: Sandybridge
BLAS MAT
1.175201193643801378e+00 1.175201193643801822e+00
1.103638323514327002e+00 1.103638323514327224e+00
3.578143506473726032e-01 3.578143506473724922e-01
7.045563366848886511e-02 7.045563366848883735e-02
9.965128148869364932e-03 9.965128148869309421e-03
1.099586127207521913e-03 1.099586127207556390e-03
9.945433911356243994e-05 9.945433911360671605e-05
7.620541308817063708e-06 7.620541308896932986e-06
5.064719744429790893e-07 5.064719744437437483e-07
2.971814136443207133e-08 2.971814140421127963e-08
1.560886614404566330e-09 1.560886564391797127e-09
7.419920233786569952e-11 7.419902813631537848e-11
3.221423128252354218e-12 3.221406076314455686e-12
1.287858708565181587e-13 1.289813102624490508e-13
4.440892098500626162e-15 4.440892098500626162e-15
0.000000000000000000e+00 -4.440892098500626162e-16
ddot: 0.000000000000000000e+00
dgemv: 0.000000000000000000e+00
ddot == dgemv? YES
dgemv is 0.0? YES
OPENBLAS_CORETYPE=Haswell
Core: Haswell
BLAS MAT
1.175201193643801378e+00 1.175201193643801822e+00
1.103638323514327002e+00 1.103638323514327224e+00
3.578143506473725477e-01 3.578143506473724922e-01
7.045563366848892062e-02 7.045563366848883735e-02
9.965128148869371871e-03 9.965128148869309421e-03
1.099586127207577424e-03 1.099586127207556390e-03
9.945433911373591229e-05 9.945433911360671605e-05
7.620541308983597162e-06 7.620541308896932986e-06
5.064719745540013918e-07 5.064719744437437483e-07
2.971814122565419325e-08 2.971814140421127963e-08
1.560886642160141946e-09 1.560886564391797127e-09
7.419920233786569952e-11 7.419902813631537848e-11
3.221201083647429186e-12 3.221406076314455686e-12
1.292299600663682213e-13 1.289813102624490508e-13
4.440892098500626162e-15 4.440892098500626162e-15
0.000000000000000000e+00 -4.440892098500626162e-16
ddot: 0.000000000000000000e+00
dgemv: 0.000000000000000000e+00
ddot == dgemv? YES
dgemv is 0.0? YES
OPENBLAS_CORETYPE=SkylakeX (AVX512)
Core: SkylakeX
BLAS MAT
1.175201193643801378e+00 1.175201193643801822e+00
1.103638323514327002e+00 1.103638323514327224e+00
3.578143506473725477e-01 3.578143506473724922e-01
7.045563366848892062e-02 7.045563366848883735e-02
9.965128148869371871e-03 9.965128148869309421e-03
1.099586127207577424e-03 1.099586127207556390e-03
9.945433911373591229e-05 9.945433911360671605e-05
7.620541308983597162e-06 7.620541308896932986e-06
5.064719745540013918e-07 5.064719744437437483e-07
2.971814122565419325e-08 2.971814140421127963e-08
1.560886642160141946e-09 1.560886564391797127e-09
7.419920233786569952e-11 7.419902813631537848e-11
3.221201083647429186e-12 3.221406076314455686e-12
1.292299600663682213e-13 1.289813102624490508e-13
4.440892098500626162e-15 4.440892098500626162e-15
0.000000000000000000e+00 -4.440892098500626162e-16
ddot: -8.881784197001252323e-16
dgemv: 0.000000000000000000e+00
ddot == dgemv? NO
dgemv is 0.0? YES
In my testing, I'm seeing that neither AVX2 or AVX matters, as Nehalem and SandyBridge give identical results. AVX2 on Haswell gives a slightly different result but one that's well within machine precision. The biggest difference I see is with SkylakeX and, presumably, AVX512 giving a different result for ddot only. This value is "correct" to us as it is within our test case, and matches very closely to what other BLAS libraries give.
Please help me understand how the AVX2 microkernel is the issue here. If OPENBLAS_CORETYPE does not actually control this, does OpenBLAS provide a way to disable AVX2 at runtime?
It will fall back to older compute kernels if you do not have AVX2 in CPUID. The difference is 1-2 youngest bits of significand and is expected. If you want same result always you have to use unoptimized netlib build.