vortex icon indicating copy to clipboard operation
vortex copied to clipboard

Introduce a BF16 DType

Open AdamGS opened this issue 1 year ago • 0 comments

bf16 (aka bfloat16) is a floating point number format introduced by Google to improve storage utilization and computation speed for machine learning models. It has the roughly the same range as a standard IEEE 754 float32 but with much reduced precision (8-bit mantissa instead of 24). Due to its popularity, more and more hardware vendors now support specialized instructions for it, including recent AVX extensions and GPU vendors.

Open questions in my mind are:

  • [ ] Is it an extension dtype?
  • [ ] How does it canonicalize into Arrow? Seems like there were efforts to introduce it as an "official" extension but nothing materialized as far as I can tell.

AdamGS avatar Dec 20 '24 11:12 AdamGS