NaN canonicalization
Some popular source languages don't have the ability to preserve NaN bit patterns. For example, JS has only a single NaN value, and makes no guarantees about preserving specific bit patterns. NaN bit patterns could become an API feature that some implementation source languages support that others don't. Consequently, interface types should consider canonicalizing NaNs in f32 and f64 values at interface boundaries.
Good observation! I agree.
I can think of two design principles that would motivate this:
-
Determinism/predictability: Languages that only deal in single, canonical NaNs should not have to deal with receiving non-canonical NaNs across interface boundaries. (Although what problems could this cause? Could those languages just reat canonical and non-canonical NaNs uniformly?
-
Declaration of intent: Semantically, floats are not supposed to express arbitrary bits, so interface declarations are simpler and clearer if NaN boxing cannot possibly be part of the interface and if documented interfaces cannot possibly try to promise that the NaN payloads will be preserved.
Are those the principles motivating this suggestion, or am I off base? Regardless, are the underlying design principles documented anywhere?
I'm thinking about this in the context of this part of the interface-types motivation:
As part of this maximum-reuse profile, the language and toolchain that a module author uses should be kept an encapsulated module implementation detail allowing module clients to independently choose their own language and toolchain.
To promote maximal reuse, we should discourage APIs that rely on NaN bits to convey meaningful information, because such APIs wouldn't be accessible from JS or other languages with a single NaN.
The reasoning seems to be that interface types should restrict communication between components to information that all languages can retain in their obvious representation without loss. That seems to be adding yet another scope creep to interface types, and one that's fairly fuzzy.
Also, there's nothing that says that an interface-types f64 has to lower to a double in JS—a JS program could specify a non-canonical lowerer that lowers it to a BigInt via its bits. Sure that causes boxing, but so does lowering i32 to Int32 in OCaml; I imagine we aren't planning to limit i32 to 31-bit integers, but such a limitation seems to be advocated by the same reasoning. For OCaml, I would expect the tooling for generating a converter from an interface type to OCaml data will, by default, lower i32 to int, but will also have an option for lowering to Int32 as well.
There also might be a cost to this. I imagine some day we will have bulk operations for lifting and lowering list f64. Fusing a bulk lift with a bulk lower could result in simply a memcpy. But the restriction here would require canonicalization as well.
To be sure, I'm still exploring the space here.
One option would be to say that it's nondeterministic whether NaNs are canonicalized at boundaries. That would let implementations skip the overhead, but still signal the intent of f32 and f64 and allow JS and others to participate with their main number types.
Another option would be to observe that NaN canonicalization is SIMD-optimizable, so we may be able to make it fairly fast. It still wouldn't be free though, especially for small-to-medium arrays.
Does OCaml have a way of representing a value which is unboxed if it's in the 31-bit integer range, and boxed otherwise? Could it use its tagged integer representation for this? If so, it wouldn't have to box in the common case, and it wouldn't have to fail on values that other languages accept.
That seems to be adding yet another scope creep to interface types, and one that's fairly fuzzy.
Regardless of what the answer to this technical question is, as of the vote on the scope yesterday, this sort of question seems squarely within the scope of interface types (and the component model), specifically under promoting cross-language interopability. This isn't the first time this sort of question has popped up and it won't be the last. In all these sorts of questions, there is an inherent tension between expressing every last thing a particular language might want and defining language-agnostic interfaces that are widely implementable.
By way of comparison, I'd be surprised if COM or protobufs makes any guarantees about preserving NaN payloads across interfaces. That would mean that they've implicitly chosen the "non-deterministically canonicalize NaNs" route @sunfishcode suggested is another option above. Given wasm's emphasis on determinism, it makes sense for us to ask if we should choose differently.
I think there's a third motivation in addition to the two @tlively gave: if f64 carries non-canonical NaN payloads, it may force a toolchain (which knows nothing of the specific workloads it is being used to compile) to conservatively decide that it must preserve non-canonical payloads (b/c some other toolchain does, thus parity) incurring extra cost (not just at the boundary, but potentially within the body of computation as well) -- concretely I'm thinking about this harming NaN-boxing engines. If the cost of NaN-canonicalization was significant, then this would be a tough tradeoff, but as @sunfishcode says, it's quite cheap these days, and I expect insignificant in the context of a cross-component call.
This isn't the first time this sort of question has popped up and it won't be the last. In all these sorts of questions, there is an inherent tension between expressing every last thing a particular language might want and defining language-agnostic interfaces that are widely implementable.
Yes, these sorts of questions will come up regularly. One way to resolve them is to have everyone fight over an answer, pick a winner, and then apply that solution to everyone. That strategy results in lots of fights and comes out with winners and losers.
Another way to resolve them is to find a way to accommodate everyone's needs. Adapters (even simple ones) and fusion provide a great way to make this possible. For example, a producer of an API for which NaNs are supposed to be insignificant can use an f64.lift_canonicalize_nan operation. This ensures consumers do not depend on insignificant nan information, and it also informs tooling that nan's are not significant and can be optimized appropriately. Similarly, a consumer of such an API can use an f64.lower_canonicalize_nat operation. The fuser, when it matches lift_canonicalize_nan with lower_canonicalize_nan can easily avoid canonicalizing twice.
Meanwhile, an API intended for efficient numeric programs (and which has little interest in JS programs) can still use interface types as a means of efficient shared-nothing data transfer. Their needs are not bottlenecked by others' irrelevant needs.
As a bonus, if the numeric program using the API happens to rely on NaN canonicalization for its own purposes, it can use lower_canonicalize_nan to make the data conform to the program's internal invariants as it is transferred. If it has a specific canonicalization it relies on, it can specify the bit pattern as an argument to lower_canonicalize_nan.
I would rather interface types offer options to people rather than impose choices on people (of course, so long as it also provides sufficient isolation so that others' choice of the available options does not interfere with one's own choice).
This is perfectly reasonable: to have more than one coercion operation. That is one of the fundamental merits of the adapter fusion approach. There is, of course, a counter argument: the existence of lower_canonicalize_nan is embodying the possibility of additional semantics that is not adequately modeled in the type signature. To maximize interoperability, especially across ownership boundaries, we should have as much a 'universal' interpretation as is possible.
I'm not sure I follow the counter argument. The consumer is generally free to do whatever it wants with the data. If you didn't provide lower_canonicalize_nan, it could still canonicalize the NaN itself. By providing a specialized adapter for it, you enable optimizations that the consumer cannot employ on its own: eliminating the canonicalization if the lifter already did so, or using hardware accelerating to perform the canonicalization during the transfer. So, semantically speaking we're not adding anything that wasn't already possible—it's just a performance optimization.
Should it be possible to declare an interface with an f64 argument where the NaN bits are a significant part of the interface contract?
Choices so far include:
- Yes, so some source languages can't map
f64to their main number types. - Yes, some source languages just can't be used to implement some interfaces.
- No, NaNs are always canonicalized at interfaces.
- No, it's nondeterministic whether NaNs are canonicalized at interfaces.
The choice I suggested was:
- Yes, but some programs in some languages will need to not use the default mapping for
f64and instead map to some more complete representation (e.g.BigInt) in order to access/provide the full range of functionality permitted/required by the interface.
In fact, the (non-default) lowerer for JS could lower to the number type for non-NaNs and lower to BigInt for NaNs.
It seems like a reasonable balance that also allows for incremental progress would be
- no canonicalization in interface types itself
- for the canonical ABI and JS, interface type floats turn into
Numbers and NaNs get canonicalized just like they do with the regular Wasm -> JS value semantics - when we eventually have adapter functions, JS programs can do whatever they want, such as turn
NaNs intoBigInts
Depends on what the goal is. If maximum interoperability is then canonicalization seems essential
When I was helping with the early days of Kotlin, they wanted to interop with Java but to also be nulll-safe, which posed an obvious problem. The solution strategy I developed for them at a high-level was to have an ergonomic default—the type-checker would treat Java values as non-null but automatically insert checks where such assumptions were made—but also a more manual fallback—the type-checker would also recognize Java values were potentially non-null and still permit programmers to test them for nullness before automatically inserting checks. That interop strategy was quite successful and is analogous to what I am suggesting here.
The central question is whether the abstract of values admitted by an f64 include NaNs with payloads; that is independent of the lifting/lowering instructions used to produce or consume those values. If non-canonical NaNs are included in the set of valid f64 values, then every constituent language/toolchain in the ecosystem is forced to deal with them (one way or another); I don't think we can sidestep this fact by trying to make this a canonical ABI or lifting/lowering-instruction option.
The concrete experience we have from years of IEEE754 usage in the wild is that non-canonical NaNs aren't useful in practice (other than for NaN-boxing, which wouldn't be a cross-component thing) and mostly only serve to cause language/toolchain vendors to waste time worrying about them, so if indeed the runtime cost is negligible, then I don't see why we wouldn't take the opportunity to (slightly) improve the robustness and simplicity of the ecosystem.
That is no longer an interoperability argument. That's fine; I'm just pointing out that the argument has moved to cleaning up the past (in a non-opt-in fashion).
Regardless of what we decide here, languages and tooling will have to worry about NaNs. I don't see how canonicalizing over the boundary will help with that. Even for languages that rely on NaN-canonicalization (e.g. for branch-free NaN-insensitive equality comparisons or hashing), if here you choose a different canonical NaN than the one they chose for their runtime, then they'll have to recanonicalize everything anyways.
I worry that extending the scope of interface types beyond efficient transfer/communication/interaction to the point that we have to try to anticipate/review all programs' needs to come to an answer makes for an infeasible and contentious goal.
e.g. for branch-free NaN-insensitive equality comparisons or hashing
That's not the goal; see NaN-boxing
if here you choose a different canonical NaN than the one they chose for their runtime, then they'll have to recanonicalize everything anyways
In practice (e.g., JS engines today), the canonical NaN bitpattern is an arbitrary global configurable constant, so as long as it is standardized, it can be #ifdefd.
I worry that extending the scope of interface types beyond efficient transfer/communication/interaction to the point that we have to try to anticipate/review all programs' needs to come to an answer makes for an infeasible and contentious goal.
The context here is the Component Model, and the goals and use cases (recently confirmed by the CG) now definitely extend past simply "transfer/communication/interaction" but, rather, robust composition of components implemented in different languages. Maybe it's a bad idea and we'll fail -- but I believe that's scope for this proposal.
The context here is the Component Model
So a component I would expect to be supported by such a model is floating-point libraries (e.g. ones providing functions like acos). Some such libraries are specifically used to favor cross-platform consistency over optimal performance. Common among the specs for these is the requirement that, given a NaN input, the output result is that exact NaN. Of course, it is impossible to satisfy such a spec if NaNs are canonicalized. So canonicalizing NaNs conflicts with existing expectations for cross-platform systems.
If a goal of the component model is to be able to make libraries like libc into components
rather, robust composition of components implemented in different languages
There are existing multi-language systems, some of which have formal guarantees about how components from different languages compose and interact. The norm in these systems is, when composing components from different languages, to insert coercions that are necessary for the two specific languages at hand at the boundary point. (Typically these coercions are auto-generated from the types, though sometimes they can be explicitly specified by the person doing the composing.) So, if you're composing two components of the same language, then you insert identity coercions. But if you're converting between different languages with different representations of the same concept, then you insert a coercion that maps between those representations as faithfully as possible (e.g. mapping arbitrary 64-bit doubles to [non-NaN doubles + NaN] in the obvious fashion, and then selecting a specific 64-bit NaN representation in the reverse direction). The relevant composition theorems still hold with such coercions.
One common composition theorem multi-language systems strive for is that composing two components of language L within language L results in the same behavior as composing two components of language L each as part of the multi-language system. That makes it possible for a program to, say, #include <foo.h> and not worry about whether the (C) component providing foo.h was linked using C semantics or using interface-types semantics. But, clearly, if you're canonicalizing NaNs then you're changing the semantics of linking to the above cross-platform libraries. To satisfy this theorem, it's necessary that the coercions inserted when linking programs from the same language are simply the identity functions.
On the other hand, having a "global" bottleneck representation hinders, rather than aids, multi-language systems. It limits what languages you can add to the system because you've required that they're natural coercions must be at least as expressive as the bottleneck (more formally speaking, they have a semi-invertible surjection to the bottleneck representation). Or, it means that as you consider more languages you'll have to narrow your bottleneck further. This is the issue I raised with 31-bit integers being the "natural/ergonomic" counterpart in OCaml.
So my understanding of multi-language systems suggests that f64 in interface types should preserve NaNs, rather than canonicalize them, because the languages that can observe the difference will care whereas the languages that can't observe the difference won't.
If a goal of the component model is to be able to make libraries like libc into components
Libraries like libc are specific examples given of what would stay as shared-everything modules (possibly imported and reused inside a component). Components are not meant to subsume intra-language modules and linking -- even when components exist, you'll naturally want to use language-native modules and ABIs for rich, deep language integration.
The norm in these systems is [...]
Which systems are you talking about, concretely? Because, if you look at COM or microservices-composed-with-gRPC (which you could roughly summarize as the two large industrial endeavors in this multi-language-composition space in the past), none of the things you're saying hold. It's possible you're thinking about systems that compose, say, within the .NET universe, or within the JVM universe, and those systems will have a natural inclination to unify the world on the runtime's native object/string/etc concepts, but with both wasm and the component model we're both talking about a very different setting.
Fundamentally, the problem in having canonicalization be a pairwise detail is that you lose any abstract semantics for a component in isolation. That is, if I give you a component A, you can't tell me, semantically, what it precisely does without knowing which component B uses it and which languages the two components were implemented in. That's the opposite of "black box reuse", which is one of the fundamental goals of components.
Hmm, one side effect of rebasing interface types on top of the component model that I hadn't thought of before is that we no longer have a roadmap for to intra-component FFI. Previously we had been punting all issues of inter-language communication to IT, even for boundaries between modules in the same trust domain. For that fully trusted FFI use case, preserving NaN bits is a perfectly reasonable thing to want to do, and I think that's what @RossTate is getting at above. Do we have a story for creating polyglot components?
One way to sidestep the whole problem would be to have both canonicalized and non-canonicalized float interface types, but that seems like a big hammer. Would it also be possible to lift an f64 with NaN boxing into a sum type and lower it back to a NaN boxing f64 on the other side as a runtime no-op?
My operating assumption is: regardless of same-language vs. different-language, all the code inside a single component is either generated from the same toolchain or adhering to the same ABI (like this one). Because of that, there's no strong need for interface types or shared-nothing linking: everyone just follows the same ABI using the same linear memory and trusts that everyone else does as well. (Interface Types are for when you don't want to assume those things.)
Do we have a story for creating polyglot components?
Yes: use core wasm and use the module-linking features to link those core modules together (noting that module-linking features don't require interface types and are happy to link up core modules as shown).
Thanks, @tlively, for effectively rephrasing one of my concerns.
My connection to libc was bad; clearly importing things like malloc only makes sense in a shared-everything model. Sorry for getting my systems crossed.
However, the floating-point libraries I mentioned still seem like perfect examples of what a (multi-language) component model should support. They are easily shared-nothing: they provide solely "pure" input-output functions that don't even have state or even need a shadow stack. And they are regularly used in multi-language settings: Java programs using java.lang.StrictMath link with fdlibm, and many Julia programs link with OpenLibm. Due to the pure nature of these libraries, multiple (separate) Java programs should be able to link to the same fdlibm component, and likewise for Julia and OpenLibm, rather than needlessly duplicating this code. In other words, these shared-nothing libraries are effectively a service that programs needing consistent cross-platform behavior can link to.
But I think the higher-level issue here is agreeing upon what a Multi-Language Component Model is. From your presentation, I understood a component to be a piece of software that implements some API and is self-contained/isolated (i.e. shared-nothing) except through explicit data exchanges (via interface types). To me, the above floating-point libraries match that description. I suspect that we roughly agree on that high-level description of components—where we disagree is on what multi-language means.
From the discussion above, my sense is that the interpretation of multi-language y'all are arguing for is that all expressible component APIs can be conveniently implemented and utilized to their full extent by "any" language. But to me, multi-language means that a component implementing an API can be implemented by "any" language capable of implementing that API. So if the API is to maintain a stateful counter, then "pure" languages like Haskell (or, even more constraining, Coq) are probably not what you're going to implement your component with. And if the API requires preserving NaNs, then JavaScript is probably not what you're going to implement your component with. And if that API offers more precision than what some other components (or languages) need, then those other components simply won't utilize the full benefits of the services your component offers (and no one is hurt by that).
I consider my interpretation to be additive—the more languages you include in "any", the more APIs you can (conveniently) support—whereas the other interpretation seems to be subtractive—the more languages you include in "any", the fewer APIs you're allowed to support. I don't see the value of the subtractive interpretation (are we going to restrict all components to be pure functions so that Haskell/Coq can call them conveniently?), but I do see value in the additive interpretation.
An example that comes to mind is decimal values. C# and .NET offer direct support for 128-bit decimal floating-point values, including syntactic integration into the C# language and hardware acceleration in the runtime. This is extremely valuable to financial programs.
With the subtractive interpretation, we wouldn't add something like d128 to interface types. Until every language has built-in support for decimal floating-point values, no API can use them.
With the additive interpretation, we would add something like d128 to interface types. Sure, many languages won't have built-in support for them, but they can choose how best to fit the concept into their system. One way would be to simply approximate them as binary floating-point values, i.e. f64. But another way would be to make a new library available within that language (say Java) that provides a Decimal class that simply stores the bits of a d128 as two i64s and implements the various operations/methods of Decimal by simply calling out to the C#-implemented library and lifting those two i64 values as a d128. Or, in the case of Python and Julia, you could simply lower d128 into the existing decimal or Decimals libraries.
I would like interface types to provide a system where people can deliberately write different components of a program in different languages according to the strengths of those languages and then conveniently compose those components in order to collectively construct programs that no one programming language could write well. That's what a multi-language component model means to me, and to me that means that interface types should broaden rather than restrict interactions.
I don't think the general question you're asking can be answered definitively in the abstract with either of the extremes positions you're suggesting ("only if all languages" vs. "only if any language"). It's easy to think of examples where either extreme position will lead to bad outcomes, and thus I don't think we can simply argue for one or the other in the abstract.
Rather, as I think is usual with standards design, we have to consider the practical reality of use cases and weigh pros vs. cons, case by case. There are real practical downsides (listed above) with allowing non-canonical NaNs to cross boundaries and I think all the use cases for supporting non-canonical NaNs are hypothetical. Moreover, in line with what @tlively suggested, if real use cases did emerge, the right way to support them would be to add a second type; this would be a clear and useful signal to all implementations involved that they should take the extra pains to preserve the NaN payload. (E.g., a JS binding could then produce something other than a Number for this new type while still getting to use Number for the majority-case f64.)
Okay, so we're back to not treating this as a problem about multi-language interop, but rather specifically about floating-point.
There are real practical downsides (listed above) with allowing non-canonical NaNs to cross boundaries and I think all the use cases for supporting non-canonical NaNs are hypothetical.
I gave real existing libraries that real existing languages currently link to in a cross-language shared-nothing manner, and the specifications of those APIs explicitly state requirements (in line with IEEE 754-2019 recommendations) that cannot be supported with NaN canonicalization. As many language runtimes link against a foreign-implemented library for these floating-point libraries (and expect them to preserve NaN payloads per IEEE 754 recommendations), such a component would provide a service that could be shared by many programs implemented across many languages. Could you articulate why you believe this is not a viable use case for interface types?
While you've listed some hypothetical downsides, they did not seem to me to be articulated in sufficient depth to assert that they were real and practical. It would help me understand your perspective better if you were to elaborate on (one of) them further. For example, you mention tooling, but I don't see how NaN canonicalization would affect tools like LLVM/Binaryen—you have to know how the rest of the program handles (or ignores) NaNs, which only the programmer knows and hence there already exist various compiler flags to indicate how much flexibility to grant the compiler with respect to floating-point values. Maybe you have something else in mind, but without an elaboration on what that something is I have a hard time seeing how NaN canonicalization would have a real practical benefit for tooling.
Late to this party, and I am not an expert in this area so feel free to ignore, but my gut reaction is that defaulting to canonicalization would not be desirable. My thinking:
- An
f64is just a bucket of 64 bits of which each combination is assigned meaning by the IEEE standard.. saying that we only support a subset of combinations seems counter to the low level nature of Wasm. - The existing
f64already is specced to support all combinations, so we'd need the interface type to not be namef64if we wanted to be very strict about what kind of data-type this is. - A language like JS can read serialized floats from all sorts of sources and already has to deal with canonicalization itself. If you're writing data processing in JS that must roundtrip, then you have to avoid this somehow. To me, this wouldn't be any different if JS code deals with data coming from IT.
- Generally IT has the philosophy that if two languages agree on data format there should be no additional overhead, so C sending data to Rust should not be subject to unnecessary conversions.
So my solution would be for f64 to continue to mean that all bits have meaning in all combinations, but that adapters for individual languages can decide to either canonicalize or use raw bits storage. It be up to programmers to know if a language is suitable for implementing a particular interface, much like they already must know that they can't roundtrip serialized data using certain data types.
That, or introduce a c64 type :P
I gave real existing libraries that real existing languages currently link to in a cross-language shared-nothing manner, and the specifications of those APIs explicitly state requirements (in line with IEEE 754-2019 recommendations) that cannot be supported with NaN canonicalization.
If we were to compile that OpenLibm acos code to WebAssembly, today, we'd get a function which can return nondeterministic NaN bits. This is because core wasm's own arithmetic operators don't guarantee to propagate NaN payloads. Consequently, that OpenLibm acos example already doesn't satisfy use cases that require NaN propagation.
It would be easy to modify the code to do what .NET does to ensure IEEE compliance. Right now, the functions in those libraries have typically just one line that relies on the fact that either x+x or (x-x)/(x-x) preserves NaN payloads on IEEE-754-2019-compliant hardware, so you could easily make the change in the C source code without having to do anything specifically in WebAssembly (and without meaningfully changing performance since these lines are off the hot path). This would be in line with changes these libraries have made to accommodate buggy compilers and buggy hardware.
More generally, on a platform where add, sub, mul and sqrt don't propagate NaN payloads, how important is it for acos to propagate NaN payloads?
Two thoughts:
- While those instructions are not guaranteed to propagate NaN payloads, they are generally implemented with common instructions/libraries, and so based on these libraries' experiences I wouldn't be surprised if in practice wasm propagates NaN payloads fairly reliably (though I do know there have been some weird cases identified in SIMD). You could probably compile the above libraries to wasm without change and see them still preserve NaNs despite the lack of guarantees.
- That aside, it's easy to have a compiler generate NaN-guards around these instructions. I suspect any program really wanting to ensure cross-platform determinism will do so.
I get why add and such are not guaranteed to propagate NaN payloads in wasm (though it might be reasonable to require that they do when all inputs have the same payload): due to the variation of hardware, we'd have to go out of our way to make this deterministic and take a performance hit for it. But NaN canonicalization in interface types would be going out of our way to remove existing functionality while adding additional complexity (because now lowering has to specify a canonical NaN—I've found existing systems with different choices of canonical NaNs), and I don't (yet) see the benefit of that.