OpenBLAS Incorrect result with `cblas_dgemv` vs reference netlib and other libraries

We recently switched to testing openBLAS on a project and are noticing some test case failures due to a matrix multiplication operation returning an incorrect result.

This issue has been observed on a variety of platforms (Ubuntu 22.04, RHEL7, RHEL9, MSYS2 mingw), a variety of compilers (clang-15, mingw-13, gcc-12, gcc-11, gcc-9), as well as a variety of openblas versions(0.3.3, 0.3.20, 0.3.21, 0.3.24), and a variety of CPUs:

Intel(R) Xeon(R) CPU E5-4650 0 @ 2.70GHz
Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz
Intel(R) Core(TM) i7-9850H CPU @ 2.60GHz 2.59 GHz

Reproduction

I have attached a minimally reproducible example (in C++) showing the problem

Reproduction Code

#include <cblas.h>

#include <cassert>
#include <iostream>
#include <fstream>

constexpr static size_t size = 16;

void get_values(double* A, double* b, const char* binfile) {
    std::fstream binaryReader;
    binaryReader.open(binfile, std::ios::in | std::ios::binary);
    assert(binaryReader.is_open());

    for (size_t i = 0; i < (size * size); i++) {
        binaryReader.read(reinterpret_cast<char *>(&A[i]), sizeof(double));
    }

    for (size_t i = 0; i < size; i++) {
        binaryReader.read(reinterpret_cast<char *>(&b[i]), sizeof(double));
    }
}

void matrixMultiply(const double* A, const double* b, double* c) {
    const size_t N = size;
    for (size_t i = 0; i < N; ++i) {
        for (size_t j = 0; j < N; ++j){
            c[i] += A[ (i * N) + j] * b[j];
        }
    }
}

void CBlasMatrixMultiply(const double* a, const double* b, double* c){
    int32_t M = size;
    int32_t N = size;
    cblas_dgemv(CblasRowMajor,
                CblasNoTrans,
                M, N,
                1.0, a, N,
                b, 1,
                1.0, c, 1);
}

int main(int argc, char* argv[]) {
    auto* A = new double[256]();
    auto* b = new double[16]();

    const char* binfile = argc > 1 ? argv[1] : BIN_FILE;
    get_values(A, b, binfile);

    auto* c_blas = new double[16]();
    CBlasMatrixMultiply(A, b, c_blas);

    auto *c_mat = new double[16]();
    matrixMultiply(A, b, c_mat);

    printf("           BLAS                         MAT\n");
    for (size_t i = 0; i < size; i++) {
        printf("%0.18e\t%0.18e\n", c_blas[i], c_mat[i]);
    }

    // dot product?
    auto c_ddot = cblas_ddot(size, A + (size * 15), 1, b, 1);
    printf("ddot: %0.18e\n", c_ddot);
    printf("dgemv: %0.18e\n", c_blas[15]);
    printf("ddot == dgemv? %s\n", c_ddot == c_blas[15] ? "YES" : "NO");
    printf("dgemv is 0.0? %s\n", c_blas[15] == 0.0 ? "YES" : "NO");

    delete[] A;
    delete[] b;
    delete[] c_blas;
    delete[] c_mat;

    return 0;
}

Compile this code with:

g++ -o blas_test blas_test.cpp -DBIN_FILE=\"path/to/bin\" $(pkg-config --libs --cflags openblas)

And observe the following output:

OpenBLAS result

           BLAS                         MAT
1.175201193643801378e+00        1.175201193643801822e+00
1.103638323514327002e+00        1.103638323514327224e+00
3.578143506473725477e-01        3.578143506473724922e-01
7.045563366848892062e-02        7.045563366848883735e-02
9.965128148869371871e-03        9.965128148869309421e-03
1.099586127207577424e-03        1.099586127207556390e-03
9.945433911373591229e-05        9.945433911360671605e-05
7.620541308983597162e-06        7.620541308896932986e-06
5.064719745540013918e-07        5.064719744437437483e-07
2.971814122565419325e-08        2.971814140421127963e-08
1.560886642160141946e-09        1.560886564391797127e-09
7.419920233786569952e-11        7.419902813631537848e-11
3.221201083647429186e-12        3.221406076314455686e-12
1.292299600663682213e-13        1.289813102624490508e-13
4.440892098500626162e-15        4.440892098500626162e-15
0.000000000000000000e+00        -4.440892098500626162e-16
ddot: 0.000000000000000000e+00
dgemv: 0.000000000000000000e+00
ddot == dgemv? YES
dgemv is 0.0? YES

It is worth noting that only the last value is different outside of acceptable numerical precision, and that every other value passes within 1e-16. Furthermore, a value of exactly 0.0 is, in itself, suspicious, as there's no real circumstance the value could be that.

Change the compile command to:

# cblas here is netlib
g++ -o blas_test blas_test.cpp -DBIN_FILE=\"path/to/bin\" $(pkg-config --libs --cflags cblas)

and observe this result:

netlib result

           BLAS                         MAT
1.175201193643801822e+00        1.175201193643801822e+00
1.103638323514327224e+00        1.103638323514327224e+00
3.578143506473724922e-01        3.578143506473724922e-01
7.045563366848883735e-02        7.045563366848883735e-02
9.965128148869309421e-03        9.965128148869309421e-03
1.099586127207556390e-03        1.099586127207556390e-03
9.945433911360671605e-05        9.945433911360671605e-05
7.620541308896932986e-06        7.620541308896932986e-06
5.064719744437437483e-07        5.064719744437437483e-07
2.971814140421127963e-08        2.971814140421127963e-08
1.560886564391797127e-09        1.560886564391797127e-09
7.419902813631537848e-11        7.419902813631537848e-11
3.221406076314455686e-12        3.221406076314455686e-12
1.289813102624490508e-13        1.289813102624490508e-13
4.440892098500626162e-15        4.440892098500626162e-15
-4.440892098500626162e-16       -4.440892098500626162e-16
ddot: -4.440892098500626162e-16
dgemv: -4.440892098500626162e-16
ddot == dgemv? YES
dgemv is 0.0? NO

Here is the binary file that contains a 16x16 matrix and a 16x1 vector: (NOTE: This is a binary data file, extension changed to make github happy) reproduction.txt

Other Notes

We have done extensive testing in other BLAS-like environments to get a result close to the expected -4e-16 result, which passes our test. Both MATLAB (2023a) and numpy (1.26 w/ MKL) return a result very close to what we expect, and pass our test. And, obviously, our naive matrix multiplication in the reproduction code gives

The matrix in question is not overly ill-conditioned, it has a condition number of ~10.

Nov 16 '23 16:11 augusew1

Somewhat surprising as the three cpu generations would be using different optimized implementations of the GEMV BLAS kernel

Nov 16 '23 16:11 martin-frbg

On first glance this appears to be some FMA-related effect from letting the compiler use AVX instructions - it is possible to obtain the netlib result by building for TARGET=GENERIC, but if I reconfigure any TARGET to use the same unoptimized, plain-C GEMV kernel without changing the compiler options in Makefile.x86_64, I end up with an "intermediate" result, although there is no other BLAS function in the call graph. (Unfortunately removing -mavx breaks some code that uses intrinsics, so it will take more time to confirm that suspicion - later.)

Nov 17 '23 07:11 martin-frbg

It is one bit of precision off, very normal occurrence computing in different order.

Nov 17 '23 11:11 brada4

I don't think that's right, this is roughly -2 ^ -51, and machine precision should be 2 ^ -53. Even if that is "one bit off", the result is exactly 0.0. Surely that's a bug.

If you do:

auto* c_blas = new double[16]();
c_blas[15] = 1.0e-18;
CBlasMatrixMultiply(A, b, c_blas);
printf("%s", c_blas[15] == 1.0e-18 ? "YES ": "NO");

Then it will return YES. It's not even changing the value. Our initial thought was that this was some sort of memory alignment issue due to the matrix size. We couldn't fix it with manual alignment to 8 bytes though.

Nov 17 '23 12:11 augusew1

You abuse machine rounding precision 32 times (or 16 with FMA) , discount 5 bits in your check.

It is not magic symbolic computation soup that gives accurate poly result each time.

Nov 17 '23 15:11 brada4

I do not expect an identical result, but this result is exactly the starting value of c_blas[15], every time, for every value. 32 (or 16) floating point operations and they all cancel out? Every time? For any starting value? Exactly? On multiple CPUs? This is the puzzle here, not why it's not exactly matching netlib/MKL.

Nov 17 '23 15:11 augusew1

It is rounding to output precision to store in a register at every 1 or 2 FLOP-s 50% up 50% down and so lottery continues till the end of computation. Yes, workload splitting affects result.

Nov 17 '23 15:11 brada4

just a guess that intel uses generic code for small inputs, then gradually jumps to vector code and adds CPU threads as samples grow. Openblas uses vector code always and switxhes to all cpus at one point.

Nov 20 '23 15:11 brada4

Then it will return YES. It's not even changing the value. Our initial thought was that this was some sort of memory alignment issue due to the matrix size. We couldn't fix it with manual alignment to 8 bytes though.

As far as I can tell all terms cancel with the "right" evaluation order and the y[15] evaluates to "exact" zero within the limits of precision - as this is added to what the c_blas array initially contained (the "beta times y" of the GEMV equation, beta being one in your case), you see no change. There appears to be loss of precision in the AVX2 "microkernel" used for Haswell and newer due to operand size limitations in the instruction set. Certainly not ideal, but not a catastrophic failure either (which would certainly have shown up in testsuites during the almost ten years this code has been in place)

Nov 22 '23 18:11 martin-frbg

There appears to be loss of precision in the AVX2 "microkernel" used for Haswell and newer due to operand size limitations in the instruction set. Certainly not ideal, but not a catastrophic failure either (which would certainly have shown up in testsuites during the almost ten years this code has been in place)

I can't understand this conclusion given that I can reproduce this with an OPENBLAS_CORETYPE env that doesn't seem to use AVX2. Am I misunderstanding what this env does? Is AVX2 always checked even when an older CPU is manually set?

Here's what I'm testing on:

lscpu

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         43 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  6
  On-line CPU(s) list:   0-5
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz
    CPU family:          6
    Model:               85
    Thread(s) per core:  1
    Core(s) per socket:  1
    Socket(s):           6
    Stepping:            0
    BogoMIPS:            5786.40
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mc
                         a cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall n
                         x pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtop
                         ology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni
                          pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic mov
                         be popcnt tsc_deadline_timer aes xsave avx f16c rdrand
                         hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti
                          ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 sme
                         p bmi2 invpcid avx512f avx512dq rdseed adx smap clflush
                         opt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xsa
                         ves arat pku ospke md_clear flush_l1d arch_capabilities
Virtualization features:
  Hypervisor vendor:     VMware
  Virtualization type:   full
Caches (sum of all):
  L1d:                   192 KiB (6 instances)
  L1i:                   192 KiB (6 instances)
  L2:                    6 MiB (6 instances)
  L3:                    132 MiB (6 instances)
NUMA:
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-5
Vulnerabilities:
  Gather data sampling:  Unknown: Dependent on hypervisor status
  Itlb multihit:         KVM: Mitigation: VMX unsupported
  L1tf:                  Mitigation; PTE Inversion
  Mds:                   Mitigation; Clear CPU buffers; SMT Host state unknown
  Meltdown:              Mitigation; PTI
  Mmio stale data:       Mitigation; Clear CPU buffers; SMT Host state unknown
  Retbleed:              Mitigation; IBRS
  Spec rstack overflow:  Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer
                          sanitization
  Spectre v2:            Mitigation; IBRS, IBPB conditional, STIBP disabled, RSB
                          filling, PBRSB-eIBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected

This is on Ubuntu 22.04, with the libopenblas-pthread-dev package. I can also recreate this on OpenBLAS 3.24.0 on MSYS2, but that machine is Coffee Lake and can't use SkylakeX instructions.

I'm using OPENBLAS_VERBOSE=5 to print the core type at the top

OPENBLAS_CORETYPE=Nehalem (No AVX2)

Core: Nehalem
           BLAS                         MAT
1.175201193643801378e+00        1.175201193643801822e+00
1.103638323514327002e+00        1.103638323514327224e+00
3.578143506473726032e-01        3.578143506473724922e-01
7.045563366848886511e-02        7.045563366848883735e-02
9.965128148869364932e-03        9.965128148869309421e-03
1.099586127207521913e-03        1.099586127207556390e-03
9.945433911356243994e-05        9.945433911360671605e-05
7.620541308817063708e-06        7.620541308896932986e-06
5.064719744429790893e-07        5.064719744437437483e-07
2.971814136443207133e-08        2.971814140421127963e-08
1.560886614404566330e-09        1.560886564391797127e-09
7.419920233786569952e-11        7.419902813631537848e-11
3.221423128252354218e-12        3.221406076314455686e-12
1.287858708565181587e-13        1.289813102624490508e-13
4.440892098500626162e-15        4.440892098500626162e-15
0.000000000000000000e+00        -4.440892098500626162e-16
ddot: 0.000000000000000000e+00
dgemv: 0.000000000000000000e+00
ddot == dgemv? YES
dgemv is 0.0? YES

OPENBLAS_CORETYPE=SandyBridge (No AVX2)

Core: Sandybridge
           BLAS                         MAT
1.175201193643801378e+00        1.175201193643801822e+00
1.103638323514327002e+00        1.103638323514327224e+00
3.578143506473726032e-01        3.578143506473724922e-01
7.045563366848886511e-02        7.045563366848883735e-02
9.965128148869364932e-03        9.965128148869309421e-03
1.099586127207521913e-03        1.099586127207556390e-03
9.945433911356243994e-05        9.945433911360671605e-05
7.620541308817063708e-06        7.620541308896932986e-06
5.064719744429790893e-07        5.064719744437437483e-07
2.971814136443207133e-08        2.971814140421127963e-08
1.560886614404566330e-09        1.560886564391797127e-09
7.419920233786569952e-11        7.419902813631537848e-11
3.221423128252354218e-12        3.221406076314455686e-12
1.287858708565181587e-13        1.289813102624490508e-13
4.440892098500626162e-15        4.440892098500626162e-15
0.000000000000000000e+00        -4.440892098500626162e-16
ddot: 0.000000000000000000e+00
dgemv: 0.000000000000000000e+00
ddot == dgemv? YES
dgemv is 0.0? YES

OPENBLAS_CORETYPE=Haswell

Core: Haswell
           BLAS                         MAT
1.175201193643801378e+00        1.175201193643801822e+00
1.103638323514327002e+00        1.103638323514327224e+00
3.578143506473725477e-01        3.578143506473724922e-01
7.045563366848892062e-02        7.045563366848883735e-02
9.965128148869371871e-03        9.965128148869309421e-03
1.099586127207577424e-03        1.099586127207556390e-03
9.945433911373591229e-05        9.945433911360671605e-05
7.620541308983597162e-06        7.620541308896932986e-06
5.064719745540013918e-07        5.064719744437437483e-07
2.971814122565419325e-08        2.971814140421127963e-08
1.560886642160141946e-09        1.560886564391797127e-09
7.419920233786569952e-11        7.419902813631537848e-11
3.221201083647429186e-12        3.221406076314455686e-12
1.292299600663682213e-13        1.289813102624490508e-13
4.440892098500626162e-15        4.440892098500626162e-15
0.000000000000000000e+00        -4.440892098500626162e-16
ddot: 0.000000000000000000e+00
dgemv: 0.000000000000000000e+00
ddot == dgemv? YES
dgemv is 0.0? YES

OPENBLAS_CORETYPE=SkylakeX (AVX512)

Core: SkylakeX
           BLAS                         MAT
1.175201193643801378e+00        1.175201193643801822e+00
1.103638323514327002e+00        1.103638323514327224e+00
3.578143506473725477e-01        3.578143506473724922e-01
7.045563366848892062e-02        7.045563366848883735e-02
9.965128148869371871e-03        9.965128148869309421e-03
1.099586127207577424e-03        1.099586127207556390e-03
9.945433911373591229e-05        9.945433911360671605e-05
7.620541308983597162e-06        7.620541308896932986e-06
5.064719745540013918e-07        5.064719744437437483e-07
2.971814122565419325e-08        2.971814140421127963e-08
1.560886642160141946e-09        1.560886564391797127e-09
7.419920233786569952e-11        7.419902813631537848e-11
3.221201083647429186e-12        3.221406076314455686e-12
1.292299600663682213e-13        1.289813102624490508e-13
4.440892098500626162e-15        4.440892098500626162e-15
0.000000000000000000e+00        -4.440892098500626162e-16
ddot: -8.881784197001252323e-16
dgemv: 0.000000000000000000e+00
ddot == dgemv? NO
dgemv is 0.0? YES

In my testing, I'm seeing that neither AVX2 or AVX matters, as Nehalem and SandyBridge give identical results. AVX2 on Haswell gives a slightly different result but one that's well within machine precision. The biggest difference I see is with SkylakeX and, presumably, AVX512 giving a different result for ddot only. This value is "correct" to us as it is within our test case, and matches very closely to what other BLAS libraries give.

Please help me understand how the AVX2 microkernel is the issue here. If OPENBLAS_CORETYPE does not actually control this, does OpenBLAS provide a way to disable AVX2 at runtime?

Nov 27 '23 15:11 augusew1

It will fall back to older compute kernels if you do not have AVX2 in CPUID. The difference is 1-2 youngest bits of significand and is expected. If you want same result always you have to use unoptimized netlib build.

Nov 27 '23 16:11 brada4