Performance optimising `ggml_vec_dot_i2_i8_s` with AVX-512 VNNI

Open HJLebbink opened this issue 9 months ago • 1 comments

While profiling BitNet inference (single-threaded run_inference), I observed that the function ggml_vec_dot_i2_i8_s (which performs a multiply-accumulate on 1.58-bit (ternary) and 8-bit data) dominates the runtime (I dont recall precisely, but it was about 80% of total CPU time).

I rewrote ggml_vec_dot_i2_i8_s using AVX-512 VNNI intrinsics, collapsing the original 85-line implementation into one short loop. AVX-512 VNNI (Vector Neural Network Instructions) provides specialised dot-product instructions for integer data (e.g. VPDPBUSD for 8-bit multiplies).

inline float ggml_vec_dot_i2_i8_s_avx512_vnni(int n, const uint8_t* x, const int8_t* y) {
    const int number_of_blocks = n / 128;

    const __m512i mask = _mm512_set1_epi8(0x03);
    __m512i accu_1 = _mm512_setzero_si512();
    __m512i accu_2 = _mm512_setzero_si512();

    for (int j = 0; j < number_of_blocks; ++j) {
        const __m256i xq = _mm256_loadu_si256((const __m256i*)(x + (j * 32)));
        const __m512i xqa = _mm512_inserti64x4(_mm512_castsi256_si512(_mm256_srli_epi16(xq, 2)), xq, 1);

        const __m512i x0 = _mm512_and_si512(_mm512_srli_epi16(xqa, 4), mask);
        const __m512i x2 = _mm512_and_si512(xqa, mask);

        const __m512i y0 = _mm512_loadu_si512((const __m512i*)(y + j * 128 + (0 * 64)));
        const __m512i y2 = _mm512_loadu_si512((const __m512i*)(y + j * 128 + (1 * 64)));

        accu_1 = _mm512_dpbusd_epi32(accu_1, x0, y0);
        accu_2 = _mm512_dpbusd_epi32(accu_2, x2, y2);
    }
    return static_cast<float>(hsum_i32_16(_mm512_add_epi32(accu_1, accu_2)));
}

See https://github.com/HJLebbink/BitNet/blob/5e2feac8f8a8061be6ad39815515be164184b0bb/src/ggml-bitnet-mad.cpp#L229

In did some naive tests of run_inference.py, this rewrite yielded roughly a 40% speedup (same prompt and settings).

Surprisingly, the standard end-to-end benchmark script (e2e_benchmark.py) shows no performance improvement after the changes. For example, running:

python utils/e2e_benchmark.py -m /path/to/model.gguf -n 200 -p 256 -t 1

This suggests the bottleneck in these benchmarks might lie elsewhere, such as, thread synchronisation, spinlock contention or other overhead could be masking the gains.

Questions:

Has anyone else experimented with rewriting ggml_vec_dot_i2_i8_s using newer AVX instructions in BitNet? What speedups did you observe?
Does anyone have insight into why e2e_benchmark might not reflect these low-level optimisations? Could the threading model, locking or other overhead in BitNet be limiting throughput in the end-to-end benchmark?

May 11 '25 11:05 HJLebbink

Or an even faster implementation using Galois Fields to extract the two bits in one go. https://github.com/HJLebbink/BitNet/blob/b20fa484a3d59d815ddf2523d3b050716e00fa15/src/ggml-bitnet-mad.cpp#L229

May 12 '25 11:05 HJLebbink