[WIP] common, xe: adjust stochastic rounding bias alignment
Testing with [1], shows that f32->f8 down-convert with stochastic rounding always rounds down, so in practice, it almost always rounds down. This is because there are 4-5 bits between the MSB in the rounding bias and the LSB used in down-conversion from f32, which must all be 1 for there to be a chance at rounding up.
This change uses 16 bits for f32->f16 down-convert, and 8 bits for f32->f8 downconvert (with the assumption that the original src was f16). To that end, round bias is aligned to the mantissa of f16 for f8 down-convert.
[1]
int test_e5m2(int trials) {
// 1.125 * 2^-14 is the smallest normal in f8_e5m2 + 1/2 ULP
// We can then check if we rounded up/down by testing the LSB.
// Up/down rounding should be split 50-50.
float e5m2_f = dnnl::impl::utils::bit_cast<float>(0x38900000);
std::random_device r;
std::mt19937 engine(r());
std::uniform_int_distribution<uint32_t> distribution(0);
int result = 0;
for (int i = 0; i < trials; ++i) {
uint32_t seed = distribution(engine);
float sr = dnnl::impl::math::stochastic_round_fwd(
e5m2_f, 0, seed, dnnl::impl::data_type::f8_e5m2);
dnnl::impl::float8_e5m2_t test = sr;
result += (test.raw_bits_ & 1) ? 1 : -1;
}
return result;
}
int test_e4m3(int trials) {
// 1.0625 * 2^-6 is the smallest normal in f8_e4m3 + 1/2 ULP
// We can then check if we rounded up/down by testing the LSB.
// Up/down rounding should be split 50-50.
float e4m3_f = dnnl::impl::utils::bit_cast<float>(0x3C880000);
std::random_device r;
std::mt19937 engine(r());
std::uniform_int_distribution<uint32_t> distribution(0);
int result = 0;
for (int i = 0; i < trials; ++i) {
uint32_t seed = distribution(engine);
float sr = dnnl::impl::math::stochastic_round_fwd(
e4m3_f, 0, seed, dnnl::impl::data_type::f8_e4m3);
dnnl::impl::float8_e4m3_t test = sr;
result += (test.raw_bits_ & 1) ? 1 : -1;
}
return result;
}
int main(void) {
// output is piped to python
std::cout << "import numpy\n";
std::cout << "a = [\n";
for (int i = 0; i < 1000; ++i)
std::cout << " " << test_e5m2(100) << ",\n";
std::cout << "]\n";
std::cout << "print(numpy.histogram(a, bins=50, range=[-100, 100]))\n";
std::cout << "print(numpy.histogram(a, bins=2, range=[-100, 100]))\n";
std::cout << "a = [\n";
for (int i = 0; i < 1000; ++i)
std::cout << " " << test_e4m3(100) << ",\n";
std::cout << "]\n";
std::cout << "print(numpy.histogram(a, bins=50, range=[-100, 100]))\n";
std::cout << "print(numpy.histogram(a, bins=2, range=[-100, 100]))\n";
return 0;
}
- [ ] Do all unit and benchdnn tests (
make testandmake test_benchdnn_*) pass locally for each commit? - [x] Have you formatted the code using clang-format?
The result has an ~3% bias of rounding up, likely related to the distribution of bits generated by the Philox engine, or the seeds generated by the Mersenne Twister engine.
This might be from numpy.histogram(): its documentation says the bin ranges are half-open, like [l, r) so balanced cases (when result is 0) are assigned to the upper bin and not split between lower/upper bins as we would like.
The result has an ~3% bias of rounding up, likely related to the distribution of bits generated by the Philox engine, or the seeds generated by the Mersenne Twister engine.
This might be from
numpy.histogram(): its documentation says the bin ranges are half-open, like[l, r)so balanced cases (whenresultis 0) are assigned to the upper bin and not split between lower/upper bins as we would like.
Thanks for looking into that. It looks like you're right; removing the 0s for the 2-bin histogram tends to give more balanced histograms.
make test disable test_device_cpu disable benchdnn_all enable benchdnn_reorder enable benchdnn_matmul set test_scope=NIGHTLY
make test disable test_device_cpu disable benchdnn_all enable benchdnn_reorder enable benchdnn_matmul set test_scope=NIGHTLY
make test disable test_device_cpu disable benchdnn_all enable benchdnn_reorder enable benchdnn_matmul set test_scope=NIGHTLY
make test disable benchdnn_all enable benchdnn_reorder enable benchdnn_matmul set test_scope=NIGHTLY
@mgouicem @vpirogov does it make sense to merge this (+ backport?) to have stochastic rounding in a working state, while a decision is made about what sort of support we will need?