oneDNN [WIP] common, xe: adjust stochastic rounding bias alignment

Testing with [1], shows that f32->f8 down-convert with stochastic rounding always rounds down, so in practice, it almost always rounds down. This is because there are 4-5 bits between the MSB in the rounding bias and the LSB used in down-conversion from f32, which must all be 1 for there to be a chance at rounding up.

This change uses 16 bits for f32->f16 down-convert, and 8 bits for f32->f8 downconvert (with the assumption that the original src was f16). To that end, round bias is aligned to the mantissa of f16 for f8 down-convert.

[1]


int test_e5m2(int trials) {
    // 1.125 * 2^-14 is the smallest normal in f8_e5m2 + 1/2 ULP
    // We can then check if we rounded up/down by testing the LSB.
    // Up/down rounding should be split 50-50.
    float e5m2_f = dnnl::impl::utils::bit_cast<float>(0x38900000);
    std::random_device r;
    std::mt19937 engine(r());
    std::uniform_int_distribution<uint32_t> distribution(0);

    int result = 0;
    for (int i = 0; i < trials; ++i) {
        uint32_t seed = distribution(engine);
        float sr = dnnl::impl::math::stochastic_round_fwd(
                e5m2_f, 0, seed, dnnl::impl::data_type::f8_e5m2);
        dnnl::impl::float8_e5m2_t test = sr;
        result += (test.raw_bits_ & 1) ? 1 : -1;
    }

    return result;
}

int test_e4m3(int trials) {
    // 1.0625 * 2^-6 is the smallest normal in f8_e4m3 + 1/2 ULP
    // We can then check if we rounded up/down by testing the LSB.
    // Up/down rounding should be split 50-50.
    float e4m3_f = dnnl::impl::utils::bit_cast<float>(0x3C880000);
    std::random_device r;
    std::mt19937 engine(r());
    std::uniform_int_distribution<uint32_t> distribution(0);

    int result = 0;
    for (int i = 0; i < trials; ++i) {
        uint32_t seed = distribution(engine);
        float sr = dnnl::impl::math::stochastic_round_fwd(
                e4m3_f, 0, seed, dnnl::impl::data_type::f8_e4m3);
        dnnl::impl::float8_e4m3_t test = sr;
        result += (test.raw_bits_ & 1) ? 1 : -1;
    }

    return result;
}

int main(void) {
    // output is piped to python
    std::cout << "import numpy\n";

    std::cout << "a = [\n";
    for (int i = 0; i < 1000; ++i)
        std::cout << "    " << test_e5m2(100) << ",\n";
    std::cout << "]\n";
    std::cout << "print(numpy.histogram(a, bins=50, range=[-100, 100]))\n";
    std::cout << "print(numpy.histogram(a, bins=2, range=[-100, 100]))\n";

    std::cout << "a = [\n";
    for (int i = 0; i < 1000; ++i)
        std::cout << "    " << test_e4m3(100) << ",\n";
    std::cout << "]\n";
    std::cout << "print(numpy.histogram(a, bins=50, range=[-100, 100]))\n";
    std::cout << "print(numpy.histogram(a, bins=2, range=[-100, 100]))\n";

    return 0;
}

[ ] Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
[x] Have you formatted the code using clang-format?

Mar 17 '25 21:03 atkassen

The result has an ~3% bias of rounding up, likely related to the distribution of bits generated by the Philox engine, or the seeds generated by the Mersenne Twister engine.

This might be from numpy.histogram(): its documentation says the bin ranges are half-open, like [l, r) so balanced cases (when result is 0) are assigned to the upper bin and not split between lower/upper bins as we would like.

Mar 18 '25 00:03 echeresh

The result has an ~3% bias of rounding up, likely related to the distribution of bits generated by the Philox engine, or the seeds generated by the Mersenne Twister engine.

This might be from numpy.histogram(): its documentation says the bin ranges are half-open, like [l, r) so balanced cases (when result is 0) are assigned to the upper bin and not split between lower/upper bins as we would like.

Thanks for looking into that. It looks like you're right; removing the 0s for the 2-bin histogram tends to give more balanced histograms.

Mar 18 '25 03:03 atkassen

make test disable test_device_cpu disable benchdnn_all enable benchdnn_reorder enable benchdnn_matmul set test_scope=NIGHTLY

Apr 17 '25 23:04 atkassen

make test disable test_device_cpu disable benchdnn_all enable benchdnn_reorder enable benchdnn_matmul set test_scope=NIGHTLY

Apr 18 '25 00:04 atkassen

make test disable test_device_cpu disable benchdnn_all enable benchdnn_reorder enable benchdnn_matmul set test_scope=NIGHTLY

Apr 18 '25 17:04 atkassen

make test disable benchdnn_all enable benchdnn_reorder enable benchdnn_matmul set test_scope=NIGHTLY

Apr 22 '25 20:04 atkassen

@mgouicem @vpirogov does it make sense to merge this (+ backport?) to have stochastic rounding in a working state, while a decision is made about what sort of support we will need?

May 01 '25 21:05 atkassen