zig introduce new float types; remove `@setFloatMode`

Rather than a setting, there will be new float types:

real16
real32
real64
real128

These types guarantee @sizeOf but otherwise have non-well-defined memory layout. @bitCast is not allowed on this type, and they are not allowed in extern or packed structs. This is a conservative choice that may or may not be relaxed in a future proposal.

These types are different than their f16, f32, f64, and f128 counterparts in that:

They cannot store NaN, -inf, or +inf.
- No operations can produce +inf or -inf. If they would, checked illegal behavior occurs instead.
- No operations can produce NaN. If they would, checked illegal behavior occurs instead.
- Saturating arithmetic can be used to avoid potential illegal behavior.
Cannot distinguish between positive or negative zero.
Arithmetic operations do not follow IEEE float semantics. They may:
- Produce different results on different targets
- Produce different results in different optimization modes
- Use the reciprocal of an argument rather than perform division.
- Perform floating-point contraction (e.g. fusing a multiply followed by an addition into a fused multiply-add).
- Perform algebraically equivalent transformations that may change results in floating point (e.g. reassociate)
- Any operation may use any rounding mode.

Reals implicitly coerce into floats, however, @realFromFloat is required to convert a float to real, which invokes checked illegal behavior if the value is NaN, -inf, or inf.

@setFloatMode is to be eliminated.

Users would be encouraged to choose real types unless deterministic IEEE floating point semantics are required. Therefore, the floating point types would be renamed to float16, float32, float64, and float128.

Mar 10 '25 00:03 andrewrk

In status-quo, comptime_float is "guaranteed to have the same precision and operations of [f128]." I think with this proposal, a comptime_real type would make sense: Afaik status-quo float literals are already all reals. If so, do we need/want comptime_float anymore? To me it would make sense to default to comptime_real, and for users to explicitly specify float128 where they need it.

The only footgun of defaulting to comptime_real might be (-0.0) == 0.0 being unexpected for some users. (Maybe we could add an early-pipeline compile error for this, similar to the whitespace-around-binary-operators rule.)

Mar 10 '25 17:03 rohlem

(bike-shedding question.. but since nobody has asked it) since they aren't real numbers as in maths, have other names been considered?

e.g. renaming f32 -> ieee32 and naming the new type f32 or float32, or something which can't be confused with real numbers as in maths?

Mar 10 '25 22:03 emidoots

What makes them not real numbers as in maths? The fact that they are limited in precision?

Mar 11 '25 00:03 andrewrk

This sounds good but I'm also concerned about the name. They have limited precision and limited range, unlike real numbers (obviously those limits can't be overcome in a fixed-size type, but still).

Knowing the limitations of IEEE-754, I would expect that changing from float to real would make numbers behave closer to mathematical reality, but in this proposal it sounds like they would behave less like reality. Or there are at least fewer guarantees about how they behave.

Mar 11 '25 03:03 190n

Wouldn't this concern also apply to the word "integer"? Fixed-size integers also cannot represent all the mathematical integers. They are limited in precision in much the same way.

Mar 11 '25 03:03 andrewrk

This proposal does not include a problem statement, so I'll ask: which need exactly is this attempting to address? Is it the need for safety-checked finite floating-point arithmetic that treats infinities/NaNs as illegal? Or is it the need for fast optimized "sloppy" arithmetic, as a replacement for @setFloatMode? Currently, it appars to address both of these needs as one and the same. I think it might be useful to treat them as two distinct concerns.

Encoding options like -ffast-math for optimized float arithmetic using the type system makes a lot more sense to me over scoped @setFloatMode statements or compiler flags, so :+1: to that idea. It will make it very explicit in code when the program transitions from deterministic well-defined semantics to "I don't care about accuracy or reproducibility, just crunch the numbers as fast as possible".

However, there are domains where reproducibility is very important, but where it might still be useful to treat infinity/NaN results as program bugs and safety-check operations on floats. One such domain that is important to me personally is game dev, specifically in the areas of networking and input replay systems, where it is critical that the same series of operations yield the same reproducible results regardless of the system running the code or even which compiler you used.

So I'm wondering if in addition to

strict IEEE 754 semantics (equivalent to fN today)
optimized, trading accuracy for speed (the proposed realN)

there might also be a case for a third type/mode that is the "finite subset of IEEE 754", which treats infinity/NaN results as safety-checked illegal behavior but is otherwise compliant with IEEE 754.

comptime_float is not named in the proposal, but since it was brought up above I'll quickly state that for the aforementioned domains that value reproducibility, it is important that expressions that involve comptime_float/floating-point literals have very strict, predictable and well-defined semantics. It would be very bad and footgun-prone if the compiler was allowed rearrange comptime_float expressions in ways that may subtly change which final values get written to the program binary.

Mar 11 '25 06:03 castholm

it is important that expressions that involve comptime_float/floating-point literals have very strict, predictable and well-defined semantics.

Just wanted to clarify that my suggestion above was to make float literals comptime_real, but afaiu status-quo behavior would still be available unchanged by wrapping every literal in @as(float128, ) (which just makes the code more explicit about the semantics and bit size to use).

Mar 11 '25 11:03 rohlem

Wouldn't this concern also apply to the word "integer"? Fixed-size integers also cannot represent all the mathematical integers. They are limited in precision in much the same way.

The set of integers modulo N (Z_n) is a well defined and commonly used mathematical concept. The machine integer types correspond to them directly.

Calling a finite-precision floating-point number "real" is misleading because it doesn't have the defining properties of real numbers.

Mar 11 '25 14:03 pdoane

Consider 3 different non-integer representations:

pub const A = struct {
   numer: i16,
   denom: i16
};

pub const B = struct {
   int_bits: i16,
   frac_bits: i16
};

pub const C = struct {
  sign: i1,
  exp: i8,
  mantissa: i23
};

If one were to say A, B, and C are all approximations of real numbers and could be used as a type definition for Real, then the proposed naming makes sense to me.

If instead someone said A is a rational number, B is a fixed-point number, and C is a real number, I'd find that to be inconsistent and suggest they call it a floating-point number.

Mar 11 '25 15:03 pdoane

Is there a reason we're choosing the prefixes real and float as opposed to r and f? I think it's inconsistent with i and u for signed and unsigned integers, respectively.

Mar 11 '25 16:03 amarz45

"Real" types have an (admittedly old) history in computer science as floating point types. I would not call it misleading at all. Examples include Pascal and Structured Text.

r32, r64 for "real" is reasonable to me.

Other options could be:

rf32 "restricted float"
sf32 "safe float"

If you would like the proposed type to explicitly reference it's floating point nature.

Mar 11 '25 19:03 jeffective

the proposal might be helped with a little clarity of purpose: is the goal to beat IEEE floats on speed? on accuracy? on distribution of realizable values within a specific range? there are tradeoffs involved in any of these goals.

for example, reading the proposal from the perspective of asking the question "would i want to use them in the computation step of a realtime audio project," i find myself unable to say "yes" or "no".

i'll also argue against "real" as the type name. i don't find it inappropriate, but i do think it would be better to have a type name which describes accurately the semantics it supports—"float" is good at this; "posit" is only okay. the defining property of real numbers (the existence of a "supremum / least upper bound") is computationally impossible, and my bias is to avoid overloading the term with another meaning.

Mar 11 '25 20:03 robbielyman

the proposal might be helped with a little clarity of purpose: is the goal to beat IEEE floats on speed? on accuracy? on distribution of realizable values within a specific range? there are tradeoffs involved in any of these goals.

IEEE floats restrict the compiler in many ways. I think this proposal’s goal is to introduce new types without these restrictions so that the compiler is free to do more optimizations. The real types proposed would be similar to GCC’s -ffast-math. There’s a good article about what it does here.

I really like this proposal, because I use @setFloatMode(.optimized) a few times in my code. This essentially does the same thing, except as a distinct type with checks for illegal behaviour. The only thing I would change is using the prefixes f for the new types and ieeef for the IEEE floating-point types.

Mar 12 '25 00:03 amarz45

The set of integers modulo N (Z_n) is a well defined and commonly used mathematical concept. The machine integer types correspond to them directly.

First of all, this stament is wrong. integers Z_n modulo power of 2: n=2^k do not always define devision and do not correspond to the integer arithmetics used in programming. Example 3/2 mod 16 does not exit. Yet @as(u4, 3/2) gives you 1.

Secondly, this discussion is a distraction. In software real and float are synonyms.

Mar 12 '25 00:03 slonik-az

While real is not a bad name, under this proposal both real and float are floating-point types, which are used to represent real numbers. Thus, it's not immediately clear that real means "floating-point representation" and float means "IEEE-754-compliant floating-point representation." This is why I think the prefixes f and ieeef would be more clear.

Mar 12 '25 01:03 amarz45

No matter what convention is chosen one needs to consult documentation for proper differences between float and real.

Mar 12 '25 01:03 slonik-az

Secondly, this discussion is a distraction. In software real and float are synonyms.

I apologize for muddying the waters. Clearly it is true that established mathematical convention has no bearing on what programmers writ large or small should do with it. It's possible a second try on my part will fail to communicate the point I am trying to make, but I still want to reiterate—this statement is not accurate as a representation of the state of the world, nor is it useful as a framework from which to make a decision.

Here is my argument: representing real numbers to a computer necessitates being clear about what aspects of the number system are important to you. A calculator program, for instance, should probably not use IEEE floats without significant further thought, since the choices made in the spec harm precision and reproducibility (in the sense that if a calculation is altered to give a mathematically equal result, the IEEE floating point value may change), reasonable concerns for a calculator program.

Alternates to IEEE floats like posits, other "unum" systems, fixed-point decimals, etc, exist because they represent alternate prioritizations of what properties of real numbers are important to mimic. since tradeoffs are inevitable, it feels like choosing "real" for the type name will perpetuate this exact misunderstanding of the state of play.

Mar 12 '25 01:03 robbielyman

The set of integers modulo N (Z_n) is a well defined and commonly used mathematical concept. The machine integer types correspond to them directly.

First of all, this stament is wrong. integers Z_n modulo power of 2: n=2^k do not always define devision and do not correspond to the integer arithmetics used in programming. Example 3/2 mod 16 does not exit. Yet @as(u4, 3/2) gives you 1.

Note that I said "set of integers" which does not imply a particular algebraic structure. It's in direct reference to an earlier assertion that integers are infinite like real numbers. I agree that there are many algebraic structures on Z_N that do not define division, and you can define algebraic structures with operators that do match typical CPU operations.

Mar 12 '25 02:03 pdoane

Personally, I like Andrew's proposal of realN and do not find any problems with it. Fortran has been using real for floating numbers since 1960s and there has been no confusion.

Mar 12 '25 08:03 slonik-az

Fortran has been using real for floating numbers since 1960s and there has been no confusion.

From what I can see Fortran doesn't have float as data type in the first place so it's not surprising. Some languages use real as alias to float (typically single precision variant), so having both names in use definitely could cause some confusion. Short variants, so fNN/rNN should probably be fine, tbh going with intended use, prior art (fast-math) and/or status quo (FloatMode) it may as well be called "fast float" (aka ffNN) or "optimized float" and "strict float" (ofNN/sfNN).

Mar 13 '25 11:03 nissarin

This sounds great.

I currently use f32s for my math code with strict float mode enabled, but turn it off in specific scopes where this results in useful optimizations like hardware support for approximate inverse square root which wasn't mentioned above but I assume is meant to be included alongside anything else optimized float mode currently allows. I have to be sure to check for inf/nan/etc before doing this.

This works fine. However, I have to remember to place these checks in the right places. If I want to play with where I draw the line, I need to make sure to update the checks.

With something like real32, I would presumably just use std.math.cast to do these checks, and the type system would make it impossible to leave them off without explicitly calling @realFromFloat.

This would also give me the option to switch to just using them everywhere in a self documenting way if I'm willing to give up nan/inf/such.

Mar 14 '25 03:03 MasonRemaley

re:naming, pdoane's argument above here I find particularly compelling.

Mar 14 '25 19:03 emidoots

I like where this is going. In my own code I've never wanted silent NaNs, negative zero, or any of the other IEEE-754 baggage. I'd even support giving f32/f64 the semantics in the proposal, while putting strict floats behind something obscure/ugly like @Float(.{ .semantics = .ieee }).

Mar 16 '25 20:03 whatisaphone

I think that real makes sense as a type, although I'm not super sure about throwing away inf an NaNs. I find both to be quite useful for debugging computations. That use-case would be covered by the illegal checked behaviour in the proposal (if I can turn it on in release mode).

I like the name, it's analogous to how Fortran does things. In Fortran, a real number has type REAL, which is agnostic of the specific representation chosen (fixed-point, floating-point, 32bits, 64bits - whatever the compiler decides). You can control the representation with a KIND parameter, which is specific to a particular compiler.

What makes them not real numbers as in maths? The fact that they are limited in precision?

As far as I'm concerned, the most important difference is that real arithmetic is distributive and associative, whereas FP arithmetic is neither associative nor distributive due to rounding. This really matters when you're working with numbers across a couple orders of magnitude (which you should try to avoid at all costs anyway).

Mar 19 '25 10:03 smups

On a different matter, I don't think it should be up to the library developer to choose if the user wants reproducibility in floating point operations.

-ffastmath is the status quo and moves the problem to the end user, which is better in my opinion.

@setFloatMode also works well in the sense that it allows a user to specify more constraints to a type, while leaving most of the decisions to the user of the library.

I must say that I do not see value to this proposition, so, to ask again what was asked above, but not answered:

This proposal does not include a problem statement, so I'll ask: which need exactly is this attempting to address? Is it the need for safety-checked finite floating-point arithmetic that treats infinities/NaNs as illegal? Or is it the need for fast optimized "sloppy" arithmetic, as a replacement for @setFloatMode?

Mar 20 '25 22:03 ecstrema

On a different matter, I don't think it should be up to the library developer to choose if the user wants reproducibility in floating point operations.

-ffastmath is the status quo and moves the problem to the end user, which is better in my opinion.

@setFloatMode also works well in the sense that it allows a user to specify more constraints to a type, while leaving most of the decisions to the user of the library.

Different parts of the code may have different requirements. Some code relies on reproducible results, because they are chaotic systems that must match between multiple machines / implementations, or because they rely on floating point comparisons somewhere. But other systems don't have such requirements and instead prioritise best performance.

Using -ffastmath can be too coarse grained for some projects, and often it ends up being a compromise. Using @setFloatMode is more fine grained and probably fine for most projects. But actually expressing it in a type is the best option imo. Its instantly signals which values require strict ieee conformance, and which values only need to be able to express real numbers, without any other restrictions.

For my own projects this would probably mean using reals everywhere, except for types that are shared with the GPU, where ieee layout is required.

I think this change makes sense, as long as what a real32 vs a f32 is correctly described in docs. As a game dev, predicting the types of operations a function produces is often very important. If real32 uses a ieee754 float under the hood, but with a ton of compiler optimizations around not producing NaNs etc, then great, thats exactly what i want. If instead it gives the compiler the freedom to implement it as a fixed point number instead, then i think that would be less useful as it would reduce how predictable the output is.

Mar 27 '25 11:03 absurd3

I found this article while learning more about this topic:

https://simonbyrne.github.io/notes/fastmath/

This still leaves open the exact question of what the semantics should be: if you combine a regular + and a fast-math +, can they reassociate? What should the scoping rules be, and how should it interact with things like inter-procedural optimization?

Jun 25 '25 13:06 nathany

I like that this would be useful for SPIR-V, where actual IEEE-compliant floating point semantics are not supported in the base feature set.

I don’t like that this provides no guarantees about what the semantics of the real type actually are. Can the compiler just replace all computations resulting in real types with zeros? Probably not. Can it replace 3.14159274 with 3.1415927? Maybe. Where does it actually draw the line, and how can I, as the user of this type, reason about where it draws the line and what range of outputs I can get for my inputs? While with floats this may be non-trivial, it is at least well defined. You could make the argument that you should only use reals when you don’t actually care, but the thing is, you always care, otherwise you wouldn’t be doing the arithmetic. The compiler just guesses how much you care.

I would prefer explicit types that are something like:

ieeefloatn, which behaves as the current types do.
<something>floatn where that something specifies the minimum arithmetic precision. The compiler can perform any operations that are guaranteed to have at least that precision. Note that many optimizations (e.g., some reasscociations, fma, etc) actually increase precision, so they would always be legal to perform.

Though I also wonder whether these should be different operators (similar to how saturating/wrapping integer arithmetic are different operators) rather than different types.

Oct 05 '25 03:10 ashpil

This proposal does not clearly articulate the benefit it plans to gain in return for the huge amount of work and downsides it is going to incur.

The most obvious problem here is that everybody is going to want a different combination of features:

I don't want my numerical simulation code to explode because it created a +Inf after chugging along for 3 days--but it probably does want fused multiply-add.
A whole lot of video game graphics code is fine popping a NaN or Inf on a pixel while missing a frame boundary would be terrible.
Switching rounding modes as well as precise traps or exceptions may be super expensive depending upon the processor.
Someone building a Javascript interpreter with NaN-boxing would want the new types for performance but couldn't use them
And what about the weird, bizarre floating point types that all the machine learning folks are using (and can't agree on)?
Having an opaque representation is something that sounds good in theory while in reality everybody needs to understand detailed semantics to do even something as basic as atof() or ftoa().

To top it off, the fact that nobody here has even mentioned the quiet problems from denormals or subnormals (both of which, IMO, are a bigger problem than Inf and NaN which announce their presence with blaring sirens) is a significant red flag that a whole lot of issues are being glossed over.

IEEE-754/854 should not be regarded as some immutable proclamation from God. However, a lot of man-hours were spent on it by some very, very smart people, and silicon back then was excessively expensive so they didn't just throw random features into it. Before you throw a chunk of IEEE 754 out, you need to understand precisely what you gain in return for what you lose.

I would start with "What Every Computer Scientist Should Know About Floating-Point Arithmetic" by David Goldberg (https://dl.acm.org/doi/pdf/10.1145/103162.103163).

As for concrete stuff, I would argue the proposal is self-contradictory:

They cannot store NaN, -inf, or +inf. Arithmetic operations do not follow IEEE float semantics. They may:

This is almost precisely backwards. Dropping the IEEE semantics in return for speed at all costs things like FMA (fused multiply-add) practically demands that you allow NaN and +/-Inf to propagate properly. Doing precise traps is dramatically easier with IEEE semantics and may not even be possible without (the poster child for this was the Alpha architecture).

Saturating arithmetic can be used to avoid potential illegal behavior.

Please do study the published literature--people have been trying this for two decades and failing. The problem is that saturating arithmetic assumes uncorrelated errors which very quickly go to [-Inf, +Inf] due to things like catastrophic cancellation on subtraction while, in reality, the numerical errors are sufficiently correlated for things like matrix operations that most error bounds stay within a tractable range that can be managed.

If people still, for some reason, believe that this proposal is worth spending time on as opposed to the vast amount of things Zig still needs fixed, I at least recommend that they go over the current literature and should probably start with the proceedings from the IEEE International Symposium on Computer Arithmetic.

Overall, the Arith 2025 proceedings are a very good microcosm of what is currently going on in the field: https://www.arith2025.org/proceedings/

Nov 06 '25 04:11 buzmeg

I do not think the proposal involves removing support for f32/f64/etc. real32/real64 are simply type-level representations of what is already achievable with @setFloatMode. Instead of setting a compile option in code, which changes the meaning of a built-in type, you have a separate type that makes it clear just from looking at a function's signature whether it is valid to pass in a non-finite float. It also allows you to mix-and-match the optimized and IEEE-compatible floats within a single scope. IEEE was absolutely made by smart people, which is why it's still relevant today, and for most cases developers will still use IEEE floats. This is simply changing the way that the developer specifies -ffast-math-style semantics, and in my opinion it's a much more sensible approach.

Nov 21 '25 10:11 eira-fransham