Alternatives to i31ref wrt compiling parametric polymorphism on uniformly-represented values (OCaml)
As i31ref doesn't seem to be an unanimously-agreed-on (see https://github.com/WebAssembly/gc/issues/53) part of the GC MVP spec, I am very interested in discussing what the concrete alternatives to it are in the context of parametric polymorphism on uniformly-represented values. (I would appreciate if the answer doesn't immediately read as "use your own GC on linear memory".)
To give some (historical) context: Why does OCaml use 31-bit integers in the first place? Generally, it is possible, to have a model of uniform values where every value is "boxed" (i.e. lives in its own, individually allocated, heap block). Then, every value is represented by a pointer to the heap and can be passed in a single register when calling a function. A heap block always consists of a header (for the GC), and a sequence of machine words (values). From an expressiveness standpoint, this is fine. However, when even simple values such as integers are always boxed (i.e. require a memory access to "unbox" them), performance suffers. Design constraints for the representation of unboxed integers were: a) need to be able to pass unboxed integer values in a single register, and b) need a means for the GC to distinguish (when crawling the heap) whether a value in a heap block represents an unboxed integer or a pointer to another heap block, c) being as simple as possible for the sake of maintainability. In OCaml, the compromise between performance and simplicity that was chosen is to unbox integer values by shifting them left by one bit and adding one. Since pointers are always word-aligned, this made it trivial to distinguish unboxed integers from values that live behind heap pointers. While this is not the best-performing solution (because all integer arithmetic has to operate on tagged values), it is a simple one.
Note that there exist compilation targets of OCaml that use 32-bit integer arithmetic, and the OCaml ecosystem largely accounts for that. Having libraries also consider the case where integers have 64-bits seems feasible. Some code will get faster if we can use native 32-bit integer arithmetic.
Ideally, for the sake of simplicity, we would like to emit one type Value to WebAssembly, which represents an OCaml value, which is either:
- a reference to a heap block
- an unboxed value that fits into a machine register
- a reference to some opaque value from the outside world (traditionally, for OCaml, this is C, but in WebAssembly, this could be either an
anyrefor some address in the linear memory of another module)
A heap block of OCaml traditionally consists of
- a header word that holds a) a tag that gives some information about the block, b) GC color bits (obsolete with WASM GC), and c) the size in machine words of the block
- a sequence of machine words, length of it being the size specified in the header.
The most trivial representation (i.e. the one matching most closely the existing one) that I see when I look at the MVP spec is an anyref array that holds both references to other heap blocks and i31ref values. So, from the viewpoint of having to do as little work as possible in order to compile to WebAssembly and keeping the implementation simple, i31ref is certainly looking very attractive for an OCaml-to-WASM compiler MVP.
In https://github.com/WebAssembly/gc/issues/53#issuecomment-546252669, @rossberg summarized:
For polymorphic languages, there are these [heap representations]:
- Pointer tagging, unboxing small scalars
- Type passing, unboxing native scalars, runtime type dispatch
- Type passing, unboxing native scalars, runtime code specialisation
- Boxing everything
- Static code specialisation
From OCaml's perspective, I think that (2) and (4) don't seem acceptable as a long-term solution in terms of performance. Here, compiling to the WASM linear memory and shipping our own GC seems a more attractive choice.
So, that leaves (3) and (5).
(3) seems fairly complex. If the WebAssembly engine would do the runtime code specialization, or if we could reuse some infrastructure from another, similar language, it could be worthwhile for us to work with that. It currently seems unlikely that OCaml in general will switch to (3) in the foreseeable future, unless we can come up with a simple model of runtime code specialization. I expect that implementing runtime code specialization in a WebAssembly engine goes way beyond a MVP, so it seems unlikely this will happen.
(5) is simpler than (3) in the sense that we do not have to ship a nontrivial runtime. If we analyze the whole program in order to emit precise types (struct instead of anyref array) for our heap blocks on WebAssembly, we wouldn't need to use i31ref and we can reap the other benefits of whole-program optimization (e.g. dead-code elimination, operating with native unboxed values, no awkward 31-bit arithmetic). Still, this will be a sizeable amount of work (possible too much to do it right away).
I also can't say how bad the size of the emitted code will be in terms of all the types we need to emit. Instead of emitting a single Value type, we need to emit one struct type for every "shape" of heap block that can occur in the program. To keep this manageable, we need to unify all the types whose heap block representations have the same shape.
Then, static code specialization kills one nice feature of OCaml: separate compilation of modules. However, instead of doing static code specialization before emitting WebAssembly, maybe it is possible to implement a linker for our emitted WebAssembly modules that does code specialization at link time if we emit some additional information to the compiled WebAssembly modules? This kind of linker could possibly be interesting to other languages that are in a similar position as us, as well. Obviously, link time will be slower than we are used to. I haven't thought this through in detail at all.
It seems likely that these issues are manageable, if enough effort is put into them.
Edit: while the previous paragraph sounds fairly optimistic, looking into whole-program monomorphization (turning one polymorphic function into several non-polymorphic ones) more closely, it is definitely not trivial to implement. Types that we would need for this are no longer present at the lower compilation stages. When I look at the MLton compiler (a whole-program-optimizing compiler for Standard ML), it seems that it is a good idea to monomorphize early, in order to be able to optimize based on the types of parameters. Features like GADTs, and the ability to store heterogenous values or polymorphic functions in hash maps (or other data types) do not make it simpler. It looks to me like this would mean almost a full rewrite of the existing compiler and it is not obvious whether every function can be monomorphized (without resorting to runtime type dispatch).
Are we missing something here, are there other techniques that we have been overlooking so far? Feel free to drop pointers to good papers on the topics in general, if you know some.
Also, I am very interested what perspective other languages with similar value representations have on this and whether there is interest in collaborating on a code-specializing and dead-code eliminating linker.
Although you've ruled it out, I would actually suggest (4) for OCaml, albeit with some optimization.
First, the optimization I have in mind is to have OCaml function definitions operating on int compile to wasm function definitions operating on i32. For polymorphic uses of these functions, have another definition that works on whatever "uniform" representation of OCaml values you use, simply calling the optimized function with appropriate (un)boxing. Or more generally, when a value can be statically determined to be an int, represent it as i32.
There are a few reasons I suggest this:
- A number of other languages have found this approach effective enough. See various JVM languages like Java, Kotlin, and Scala that all take this approach.
- It's still possible the MVP will have
i31ref. It's also possible it'll instead have something likei32ref, which would generally be boxed on 32-bit machines and unboxed on 64-bit machines. - Even if the MVP doesn't have these things, the MVP is just the first release. WebAssembly will advance, possibly in ways that would make specifically 31-bit integers often (if not always) be unboxed for modules that specifically request it.
So in short, I suspect the short-term costs of this approach are more acceptable than you might expect (with some local/composable optimization techniques), and I suspect it is more likely to adapt well to integrate the improvements that will be made to WebAssembly over time.
There's also a meta-point to consider: having WebAssembly modules compiled from OCaml in this manner will give the CG realistic programs to experiment with and analyze, which can be used to determine if it's worth the effort (at all levels of the process) to give modules more control over their low-level representations. Right now we can only hypothesize.
That all said, you know your specific needs much better than I, and even if this approach might work well for the OCaml community broadly, it still might not be well-suited to more specialized concerns you might have.
Fair enough, (4) is close enough to (1) to look like the next best solution in terms of effort needed and we could use i32 arithmetic. I'll look into this in more detail to see if it is indeed so straightforward.
I'm wondering if I can do this by first compiling to "WASM GC MVP with i31ref" and then translating that code to "WASM GC MVP without i31ref". At least, this way, I will not have to go all the way back to unbox all these integers again when a feature that is equivalent to 31ref arrives. Has anyone done that?
Even with i31ref, you have to coerce integers to and from i31ref using instructions i31.new and i31.get_u/i31/get_s. So, if the design of i31ref changes, it should be pretty trivial to change your compiler to substitute in the appropriate coercion instructions.
The more difficult thing to change would be using 31-bit vs. 32-bit arithmetic. It sounds like putting in the effort for 31-bit arithmetic is only worth it if i31ref exists. Since that's not guaranteed, I'd use easier 32-bit arithmetic for now, and then put in the effort to switch to 31-bit arithmetic only if/when i31ref has been finalized.
@sabine:
Then, static code specialization kills one nice feature of OCaml: separate compilation of modules.
That would be the smallest problem. More importantly, as you note in your edit, static specialisation kills any form of first-class polymorphism, of which OCaml has plenty (polymorphic methods and record fields, first-class modules, GADTs, ...). So I think this approach is simply impossible to use. The same is true for any other language with a rich enough type system.
@RossTate:
For polymorphic uses of these functions, have another definition that works on whatever "uniform" representation of OCaml values you use, simply calling the optimized function with appropriate (un)boxing.
That is not a scalable approach. In languages like OCaml, you have polymorphic type inference, and by design that results in way more polymorphism than in traditional languages. You also use lots of abstract types. In particular, many functions are polymorphic in multiple independent types, and refer to multiple abstract types. To be efficient, the approach you suggest would require an exponential number of specialisations to be generated upfront for each such function.
There also is the practical problem of porting. If the GC proposal does not support established implementation techniques, but instead requires a major rewrite of a pre-existing compiler and runtime, then it will not be a plausible target for many existing languages.
The advice I gave here is specific to OCaml. There is only one type we were talking about doing this for: int. You can do these coercions per use site, and you can consolidate coercions if you want. These are conceptually coercions between monomorphic and polymorphic calling conventions. There will need to be coercions between curried and uncurried calling conventions anyways. That is, an int -> int -> int can and should be called with two arguments (i.e. uncurried), but it can also be given to of type (int -> 'a) -> int -> 'a, in which it must be called with one argument (i.e. curried). That polymorphic function will then return a function using a curried calling convention and so that result often needs to be coerced to an uncurried calling convention. That is, unless you are suggesting that all functions always use the curried calling convention, since WebAssembly does not currently support the sort of variadic arguments that would be necessary to eliminate these coercions. So my suggestion just hooks into infrastructure that would need to be in an OCaml-to-WebAssembly compiler anyways.
There has been a good amount of research work on type-directed or shape-directed unboxing, supported by code specialization. For a recent example, see the work on call-graph-based specialization of (boxed) polymorphic functions as part of the 2017 thesis of Dmitry Petrashko on Scala. But these work typically require a fair amount of sophistication. In contrast, having an efficient way to "box" immediate scalars without going through the heap is easy for backends to allow, and fairly easy to compile into for types that fit the smaller width. The two approaches are complementary (clever specialization optimizations has other benefits), and yes, boxing everything is of course possible if the low-level substrate does not provide better, but getting reasonable performances requires much more sophistication on the user side.
Many of the successful language implementations work by keeping a tight check on their complexity budget: they go for the 20% of sophistication that brings 80% of the benefits in as many areas as possible. OCaml was faster than most ML, or Chicken Scheme than most Schemes, not through extremely optimizations, but rather thanks a few very-important optimizations (static function calls) and good data-representation choices that ensured that typical programs were fast by default.
It is also possible to build fast systems that are slow by default but fast through excellent, sophisticated engineering (good JVMs or CLR implementations, GHC, Graal+Truffle, etc.), but I agree with @rossberg that it is important to ensure that the 20%-80% approaches are available first.
The issue is "alternatives to i31ref", and the concerns are fairly specific to OCaml due its use of specifically 31-bit arithmetic. (Haskell, for example, only guarantees 30-bit precision because they have found reserving 2 bits for GC to be useful.) There is already another very long thread, #53, about i31ref. Let's not pull this thread into an argument about i31ref, an argument that cannot be conducted well if we have not even considered the alternatives, which is what this thread is about.
Consider a function like this one from the Map interface:
val mapi : (key -> 'a -> 'b) -> 'a t -> 'b t
This is polymorphic in two type variables and two abstract types. Each of them could either be an integer or something else. So even with just int, that makes 2^4 = 16 different ways in which you'd have to specialise the code according to your suggestion (8 if the compilation scheme allows to fix the export type t).
Or you reduce the number of specialisations, lump several dimensions together, and pay the extra cost of frequent boxing/unboxing conversions for the rest. Many variations of this have been investigated in highly polymorphic languages, and there are reasons few are used in practice. Leroy (among others) actually had a line of papers researching unboxing for polymorphic languages, yet OCaml, like many similar languages, uses a uniform representation instead. He observed that the conversion overhead often is higher than the benefit of unboxing, and that a minimal unboxing scheme produces better results for typical profiles.
I'm not intimately familiar with how OCaml's native code compiler handles currying, but the byte code compiler handles it through clever dynamic stack techniques that are unrelated to compile-time unboxing.
the concerns are fairly specific to OCaml due its use of specifically 31-bit arithmetic.
No, they're not, and it's getting a bit boring at this point that I have to keep refuting that. So again: plain ints are just one particular use case. The general purpose is unboxing small scalars of any sort, for which there are many, many use cases, in many, many languages. (And of course, 30 bit ints would fit just fine as well.)
As a bystander who follows these threads I wanted to share some feedback that this aggressiveness is quite off-putting.
but the byte code compiler handles it through clever dynamic stack techniques that are unrelated to compile-time unboxing
Unfortunately, WebAssembly does not support these techniques. So I believe coercions will be necessary instead. @sabine, have you considered this problem with currying?
To clarify, I was not suggesting specializing polymorphic functions based on their type arguments. I was suggesting coercing monomorphic functions that work specifically on integers to polymorphic functions operating on the uniform representation (when used polymorphically). This could be done with an augmentation of the coercion system I was suggesting above for dealing with currying.
To clarify, I was not suggesting specializing polymorphic functions based on their type arguments. I was suggesting coercing monomorphic functions that work specifically on integers to polymorphic functions
Oh, okay, but that's probably even worse. You would have to box/unbox in all first-order contexts where you pass a value from polymorphic to monomorphic functions and vice versa, which is like all the time. And you'd still have the same problem: for a function of type int->int->int, there are 8 different possibilities of a polymorphic context to which it could be passed in a higher-order fashion. All would have a different calling convention under the approach you suggest. So either your function has 8 different specialisations, or you're creating wrapper functions at each polymorphic use site.
@sabine thanks for summarizing things so clearly.
First, until we invent JIT and other features, the current Wasm, even when including the current GC proposal, is an entirely static, dare I say "closed world" system, with AOT compilation to a single .wasm, with concrete types at the boundary. By definition, all forms of polymorphism, modularity, separate compilation etc will be resolved by the time that .wasm gets baked, no matter what the language likes to present beforehand. This alone should make (5) by far the most realistic option for most languages that care about speed, again, in the current way Wasm works.
There seems to be a lot of default fear of (5), that it going to result in unreasonable code bloat. If you generate specializations under a closed world assumption, you only generate the ones that are actually used. This causes more in-lining of single-use functions (not further increasing code size) followed by much more aggressive optimisation of inlined code working directly on specific naked (scalar) types, often shrinking the code back down significantly. It turns indirect calls into static calls, helping greatly with dead-code elimination. LLVM and binaryen can do endlessly more on code like this than code that stays generic. In my personal experience working with (5) a lot, the amount of code bloat can be remarkably small, and the resulting speedups due to code "collapsing" impressive.
But @rossberg is correct that anyone that prefers (5) can make this happen, regardless of the presence of i31ref. So why would we want i31ref additionally?
I'd like to see an example of polymorphism that is impossible to compile down to specialized code using (5), and requires a struct of all anyref to work. I think, by definition, that such an example does not exist (given that we don't have any dynamic code features in Wasm that would make this statically impossible).
There will be people who have code size as their #1 concern, and don't believe my story above that it can actually help reduce code size, when done right. That's fine, because it is hard to make claims about this in the absence of specific languages with specific compilers and specific applications.
Finally, are there downsides to having i31ref when you don't use it? Under what circumstances does a Wasm implementation have to check "is this thing an i31ref?" before accessing a pointer? If anyref is always a pointer, always pointing to a heap object where the initial bytes always have the same layout (e.g. a type id), I can imagine that would lead the speedups. If indeed there are frequent checks required for i31ref, that to me would be the strongest reason to not wanting to have it.
@aardappel, repeating what I pointed out above, and I believe many times before: static type specialisation does not work in a language with first-class polymorphism or polymorphic recursion.
Simplest possible example in OCaml:
type poly = {f : 'a. 'a -> unit}
let g {f} = f 1; f "boo"
There is no way you can statically specialise anything here, since you don't know what polymorphic definition you are calling. It's completely dynamic. A client of g could pass anything for f, it could be taken from a list, or any dynamically computed data structure. Inversely, the site defining an f generally has no way of knowing where it flows and what types it will be used at.
The same is true for generic methods in a language like C#, which is why the CIL performs jit-time type specialisation. And it is the reason why C++ does not allow "template virtual methods", because its simplistic compilation model cannot support them.
Static specialisation also does not work in a language with polymorphic recursion, because the set of instantiations can be statically unbounded.
If I could make a wish, then that folks in these discussions could accept the fact that there are very good reasons why certain classes of languages are implemented in certain ways, that there has been decades of engineering and research into it, that we are not smarter than all the many folks who did this, and that we need to support a mapping for those compilation techniques instead of hoping for some yet undiscovered trick to make these requirements go away. Please?
First, until we invent JIT and other features, the current Wasm, even when including the current GC proposal, is an entirely static, dare I say "closed world" system, with AOT compilation to a single .wasm, with concrete types at the boundary. By definition, all forms of polymorphism, modularity, separate compilation etc will be resolved by the time that .wasm gets baked, no matter what the language likes to present beforehand
I'm confused, doesn't wasm allow producing modules and linking them together? I would expect compilers to use those to produce modules at the natural separate-compilation granularity of their source language. In particular, I would expect to compile each module/crate/package separately, without making assumptions on how other modules are going to use it. Of course you can still generate WASM code only after a pass of link-time-optimization, but I would naively expect that it can lead to code-distribution issues (if the generated WASM for my library depends on the clients it's compiled with, code caching etc. is going to be more delicate to handle).
I'd like to see an example of polymorphism that is impossible to compile down to specialized code using (5), and requires a struct of all anyref to work.
It's easy to have examples where specialization leads to impressive blowups in code size. I mentioned the Scala specialization work by Dmitry Petrashko earlier; it starts by evaluating previous 2014/2015 work on call-graph-based type specialization in Scala on the codebase of the Dotty compiler, and found out that on average each method would be specialized 22 times.
Then there is the example of polymorphic recursion, where the set of different type instantiations depends on runtime data (it particular it may be arbitrarily large). (In most cases I'm familiar with, the number of different "shapes" may be bounded, especially if you only distinguish immediates vs. pointers.)
But then what about union types? Many language runtimes use representations that are either an immediate word or a pointer, with tagging to distinguish the two. This is ubiquitous for example in implementations of ML, Haskell, Scheme and what not. You can very easily have tuples or arrays of such immediate-or-pointers values, where the immediate-or-pointerness of any element is only known at runtime and can change on mutation. How would one operate on this data if functions are expected to know statically whether they take an immediate or a pointer value?
Under what circumstances does a Wasm implementation have to check "is this thing an i31ref?" before accessing a pointer? If anyref is always a pointer, always pointing to a heap object where the initial bytes always have the same layout (e.g. a type id), I can imagine that would lead the speedups.
I'm not very familiar with Wasm (I was drawn into this discussion by chance), but I believe that:
- programs using
int31refwould need to check for it when they know it can happen; this does not affect the performance of programs not using this type - the Wasm GC would need to check
int31refbefore following a reference (because in this case it's not a valid pointer); checking one bit of a word is practically free compared to dereferencing the word as a pointer - someone willing to propose runtime operations operating on any wasm value (serialization, etc.) would do similarly need to check for
int31refonanyrefvalues; again this is extremely cheap
If I understand correctly, Wasm implementations try to be safe from undefined behavior, so in particular they are careful to check that pointers go into allowed memory before dereferencing them. This monitoring discipline, and in general wasm sandboxing/CFI features, are going to completely drown out the cost of checking bitpatterns on a word (even if you decided to do checks or masking on each dereference).
@rossberg your example is essentially a function pointer to a generic function if I'm not mistaken, which is indeed not possible in languages that rely entirely on monomorphic specialization. I'd expect a compiler to choose to force boxing on args for such functions (causing an extra heap alloc for f 1).
I don't see why generic functions that are not used in this very dynamic fashion (which I would certainly hope are the vast majority) should suffer from the mere presence of this possibly very dynamic use, and certainly not why Wasm as an eco system should pay for this cost (if the mere presence of int31ref introduces conditionals). I'd hope the cost to only be born by functions that use this functionality.
I am well aware that there are endless people who spend longer time looking at this than me, but just an "appeal to authority" is not going to be sufficient to stop me from at least questioning this direction.
If I could make a wish, then that folks in these discussions could accept the fact that there are very good reasons why certain classes of languages are implemented in certain ways, that there has been decades of engineering and research into it, that we are not smarter than all the many folks who did this, and that we need to support a mapping for those compilation techniques instead of hoping for some yet undiscovered trick to make these requirements go away. Please?
The Glasgow Haskell Compiler, which presumably is considered a production compiler within this class of languages, does not unbox Int even though the Haskell semantics specifically permits Int to have only 30 bits (a number they arrived at because other runtimes that do unbox Int found it useful to reserve two bits for GC). From what I understand, they did so because they found that algebraic operations were so common that it was more useful to embed the algebraic-constructor tag in the low bits of the pointer. So the experience of Haskell developers is that either custom pointer tagging should be preferred over i31ref for better performance or i30ref should be preferred over i31ref for better GC strategies. Having looked through a number of language implementations, this sort of gap seems pretty common. Unfortunately for OCaml, it is an outlier because I believe it is the only language anchored specifically around 31-bit integers. This is why I thought it would be useful to focus specifically on advice for OCaml.
@gasche
I'm confused, doesn't wasm allow producing modules and linking them together? I would expect compilers to use those to produce modules at the natural separate-compilation granularity of their source language.
Yes, but the types at the edges are currently very basic types that may well not make it possible to express the full set of type feature of a language like ML, meaning separate compilation would only be "safe" if compiled from one version of the program, essentially not making it separate compilation anymore. This may change in the future but is TBD. Also, current linking of multiple Wasm modules is runtime-only, though static linking is planned.
and found out that on average each method would be specialized 22 times.
Average? Each method in the entire compiler? Meaning many methods would be specialized hundreds of times? (to account for the methods only used once). I find that hard to understand how that is possible. Does this account for optimization and DCE of the now much more static blown up code?
Would be more useful to know how much a codebase would blow up in total, for example the ratio of AST nodes of a large program before and after specialization. In my tests for example, in cases with heavy nesting of higher order functions, that ratio was at most 2x (after optimization). All very anecdotal of course.
If I understand correctly, Wasm implementations try to be safe from undefined behavior, so in particular they are careful to check that pointers go into allowed memory before dereferencing them. This monitoring discipline, and in general wasm sandboxing/CFI features, are going to completely drown out the cost of checking bitpatterns on a word (even if you decided to do checks or masking on each dereference).
An anyref is entirely under the control of the Wasm implementation, thus typically would need no checking of any kind before dereferencing as a pointer. Checking its bits, and potentially branching (and maybe missing) would be an entirely additional cost int31ref would bring. So no, not quite for free.
(sandboxing is only done for linear memory pointers, but even there it has no cost typically due to use of memory mapping features)
@aardappel:
I don't see why generic functions that are not used in this very dynamic fashion (which I would certainly hope are the vast majority) should suffer from the mere presence of this possibly very dynamic use
You don't know at its definition site which polymorphic function ends up being used that way, and with what degree of polymorphism. The composition can happen somewhere else entirely, e.g. in a different module. And it can be partially instantiated, i.e., you plug in a polymorphic function somewhere where a less polymorphic (but not monomorphic) type is expected. There are many degrees of freedom, and you generally need a compilation scheme that is both efficient and compositional for all of them.
In the limit, the possibilities for polymorphic composition are almost similar to those in a dynamic language, which is why they make similar representation choices. (Though an important difference is that there usually is no observable type dispatch, because polymorphism is fully parametric.)
@RossTate:
As I have stated many times before, whether it's i31 or i30 is largely immaterial. This is an internal representation type. For the vast majority of use cases its max size doesn't matter. However, we can't easily make extra pointer bits available to user space across engines anyway, so AFAICS, we can just as well provide 31 bit range for tagged ints. But if there are reasons to pick another width, then that's totally fine as well. I just haven't heard them yet.
There are many languages that use (wordsize-1) bit integers in their implementations one way or the other. Whether a language implementation exposes that to users is a completely separate question. Some do (not just OCaml), some also expose values like 30, 29, or 28. But where they do, the language definition typically does not guarantee an actual size. For example, OCaml explicitly states that int_size can have other values. So again, size doesn't matter.
@aardappel:
Yes, but the types at the edges are currently very basic types that may well not make it possible to express the full set of type feature of a language like ML, meaning separate compilation would only be "safe" if compiled from one version of the program, essentially not making it separate compilation anymore.
Huh? I'd fully expect that a language with a proper module system can (and should!) compile its modules to Wasm modules separately, and without any loss of generality. Wasm's module system allows you to import/export anything, so such a compilation scheme should be perfectly possible.
An anyref is entirely under the control of the Wasm implementation, thus typically would need no checking of any kind before dereferencing as a pointer. Checking its bits, and potentially branching (and maybe missing) would be an entirely additional cost int31ref would bring. So no, not quite for free.
You can't deref an anyref. You first need to cast it to something concrete. That necessarily involves a check, with or without i31ref.
Checking its bits, and potentially branching (and maybe missing) would be an entirely additional cost int31ref would bring. So no, not quite for free.
You can't deref an anyref. You first need to cast it to something concrete. That necessarily involves a check, with or without i31ref.
That is indeed the one place where adding i31ref has a non-zero cost: when ref.test or ref.cast or br_on_cast operate on a value that's statically typed as anyref or eqref, then before loading and comparing that value's type descriptor, they first have to check the pointer's tag bit. Given that the type descriptor check (which needs to happen with or without i31ref's existence) involves a load from memory, and the tag bit check doesn't, I'm confident that the additional overhead is negligible.
Any code that has more specific static knowledge than eqref is entirely unaffected by the existence of i31ref. Code that passes around anyref values (without downcasting them) is just as unaffected.
As a bystander who follows these threads I wanted to share some feedback that this aggressiveness is quite off-putting.
I don't think anyone contributing here wants to come off as aggressive (I certainly don't want to :smile:). This i31ref part of the spec is a rather controversial issue in both a technical and political sense:
(a) WASM is this great thing that everyone wants to be part of and that, by aiming for strong safety guarantees, sets the bar quite a bit higher for everyone, compared to the commonly-used traditional architectures that we are used to.
(b) so, all kinds of different people are coming together. Many people (language implementers, engine implementers) are coming here with existing codebases. When looking at compiling to WASM or implementing WASM, we discover choices in our codebases that make it difficult to work with WASM. Some things that were before considered acceptable in the existing codebase now present as technical debt. Some choices are very fundamental to the codebase and it is not immediately obvious whether or even how we can undo and rework the codebase at a reasonable cost. So we all try to understand how to work with the spec in a way that is maintainable in the long run.
(c) all involved here care deeply about performance. With WASM being a melting pot for all kinds of languages, none of the language implementers can expect to get a WASM that caters optimally to all concerns of their language (or, for that matter, a WASM that is trivial to compile to). There are so incredibly many constraints and requirements coming from all the languages overall. Still, we consider reasonable tradeoffs in performance worth it in the overall scheme, the premise of being able to run code anywhere by compiling to WASM. One important task of WASM engine implementers is to watch over the specification and look for features that introduce unnecessary performance costs, while all the compiler implementers try to lobby for and ask for features that they can make use of (ideally, while keeping in mind the overall goal of making WASM an efficient compilation target for many languages). The goal here for WASM GC is to strike a balance between WASM GC being small and fast and providing the expressivity that makes it a reasonable compilation target.
(d) resource constraints matter for all of us. Decisions need to be made and we all need to put only so much on our plates that we can still handle that and get something done. I appreciate very much that WASM is not some huge gigantic beast, but a rather compact language whose specification can be read and mostly understood in a very short time.
I'm trying to summarize and sometimes comment on some of the points brought up. If you feel that I missed your point or something needs clarification, do speak up.
@rossberg says that first-class polymorphism, of which OCaml has plenty (polymorphic methods and record fields, first-class modules, GADTs, ...) makes it impossible to do static code specialization to the point WASM GC without i31ref requires. OCaml compiler implementers that I asked seem to share this idea and provided some hard-to-monomorphize examples. I know too little about all the features of OCaml at this point (I have been working with the compiler only since September part time and I still have a lot to learn.) to say whether full-program monomorphization is possible or not - I do not have a proof for either of these hypotheses. @rossberg do you know good papers or theses on this topic?
@gasche says that type-directed or shape-directed unboxing, supported by code specialization is a fairly sophisticated technique for compilers to implement, while, in contrast, having an efficient way to "box" immediate scalars without going through the heap is easy for backends to allow, and fairly easy to compile into for types that fit the smaller width. I agree with your assessment, that OCaml, in particular, chose to go with not many optimizations by choosing a memory model that brings 80% of the performance to be expected to the table. The implicit assumption that the memory works this way, that you can put an unboxed integer into a memory cell in a heap block is being relied on in large parts of the compiler (in particular those parts where enough optimization has been done in order to make them a worthwhile starting point for compiling to WASM).
@aardappel says There seems to be a lot of default fear of (5), that it going to result in unreasonable code bloat. Looking into MLton, I believe now that code size with monomorphization is likely not the major problem, as the bloat is also offset by the ability to do dead-code analysis in a whole-program compiler.
@gasche says Many language runtimes use representations that are either an immediate word or a pointer, with tagging to distinguish the two. Indeed, we have the same representation in OCaml. type t = X | Y of int is a type that has two variants, X and Y. The variant X, which has no additional data attached, is represented as an integer, while Y is represented by a heap block with a field (tag) that says it's Y and another field that holds the attached integer value. This representation is introduced so early in the compiler that we do not have the information to undo it at a later in the compilation pipeline where we could branch off to WASM with reasonable implementation cost. I could call this either "technical debt" or a natural consequence of the choice of memory model permeating the whole compiler. There are people from the OCaml compiler team who believe that it should be possible to rewrite the compiler to introduce a different memory representation for variant types, but it will require a lot of effort.
@rossberg says that WASM should support a mapping for fundamental compilation techniques of polymorphic languages. As someone who wants to bring OCaml to WASM, I'm obviously very much in favor of that. In contrast, @aardappel sees WASM GC MVP as a target aimed at languages that can compile via static code specialization. Which, by looking at all the parts of the spec apart from i31ref seems mostly true.
Concerning the topic of separate compilation of modules: So far, we are not aware of any problems with separate compilation of modules for OCaml, in the presence of i31ref. Obviously, in a MVP compiler backend, to a MVP language running on MVP engines, only modules compiled with the same compiler version will be compatible. That's perfectly fine. No complaints whatsoever here. :smile: Users will understand.
To get back to the topic of alternatives to i31ref: When I think of an alternative to i31ref, I think of something that enables by some means a memory model that allows us to, with reasonable efficiency,
(1) store heap pointers and integers in the same array and read them back out
(2) in a way that we can, at runtime, check whether a value is an integer or a heap pointer
so that we can write a compiler backend for the OCaml compiler that compiles to the WASM GC MVP with the resources we have at hand.
Traditional hardware architectures have a memory model that lets us fulfill these requirements by pointer tagging integers at the cost of losing one bit on the integer representation - but there may be other reasonable alternatives that I don't see right now, that work for us, and that other languages could make better use of than i31ref. Possibly the alternative can live outside of WASM GC MVP, but I am not sure about that at all. If there was some external tool with a memory model that fulfills (1) and (2) and that compiles to WASM GC MVP, with a long-term expectation of reasonable performance, we could work with that. I don't know how that could work, though.
I believe that these two requirements are all that we need in order to compile all the crazy polymorphism in OCaml. I think @rossberg is right that, OCaml, in the long term, can work with 30-bit integers or 31-bit integers, or even, crazy as it may seem, 29-bit integers.
Requirement (2) can possibly be lifted after a major refactor of the OCaml compiler (which is not 100% guaranteed to succeed, and that we don't have resources for right now, but at least there are people optimistic that it could work).
Having both requirements (1) and (2), it looks to me like I need to double-box integers in a MVP without i31ref. I'll elaborate on that later.
I'm very interested in other languages who would use i31ref to compile polymorphism (no matter whether the language is dynamically or statically typed). Even more so, I am interested in the perspective of languages who need to compile extensive polymorphism but do not see i31ref as useful, needing something else. Please comment and describe your constraints for the memory model.
However, I expect that as soon as i31ref makes it to the implementation stage, people will start asking for i63ref.
Thanks for the great post, @sabine!
I think @rossberg is right that, OCaml, in the long term, can work with 30-bit integers or 31-bit integers, or even, crazy as it may seem, 29-bit integers.
I was just about to ask you this question, so thanks for beating me to it! This is extremely useful, as it's the kind of perspective we can only get from language implementers.
Can I ask you for some more of your perspective along this line? Something I've been wondering is that, if we do change the number of bits for guaranteed-unboxed integers, what's a non-arbitrary number to change it too? The number that comes to mind is 24 bits because that's 3 bytes (so just enough for RGB and a little more than enough for some Unicode encodings). So I'm interested to hear specifically your thoughts on whether that would work for OCaml? That is, do you know of any OCaml applications or implementation strategies that would suggest a useful lower-bound on the number of bits needed for unboxed integers?
JS implementations faced a similar dilemma with the need to optimize simple integers. The dominant technique there seems to be nan-boxing; which in the language of i31ref would be equivalent to i52ref. TLDR: the discussion becomes moot when we migrate (which we no doubt will) to 64bit wasm.
@rossberg
You don't know at its definition site which polymorphic function ends up being used that way, and with what degree of polymorphism. The composition can happen somewhere else entirely, e.g. in a different module.
It is my assumption that if you compile down to a single .wasm, there comes a moment where with static specialization you can know the vast majority of these cases. If your default way of working involves completely independent compilation to separate .wasm files, then yes, you more often cannot, but then again you're kind of bringing this upon yourself.
You can't deref an anyref. You first need to cast it to something concrete. That necessarily involves a check, with or without i31ref.
I was thinking of engine code, like the GC traversing objects. And checks can be more or less expensive.
@jakobkummerow
That is indeed the one place where adding i31ref has a non-zero cost: when ref.test or ref.cast or br_on_cast operate on a value that's statically typed as anyref or eqref, then before loading and comparing that value's type descriptor, they first have to check the pointer's tag bit. Given that the type descriptor check (which needs to happen with or without i31ref's existence) involves a load from memory, and the tag bit check doesn't, I'm confident that the additional overhead is negligible.
That bit check may involve a missed branch, but yes, for code that doesn't use i31ref (which is what I'm concerned about) you'd never get a branch miss. Conversely, one could regard loading the type field to have no cost if the subsequent code is going to touch all the following fields anyway.
I'll take your word for it for now that cost will be negligible, though I hope at some point we'll have some numbers, particularly a real world GC benchmark (not containing any i31ref values) with bit checks turned on and off. I'm a big fan of the "you pay for what you use" principle, in Wasm, and all of computing.
@fgmccabe
TLDR: the discussion becomes moot when we migrate (which we no doubt will) to 64bit wasm.
It's not just about how many bits we can fit in a pointer, it is also about whether we want to force the need to check these bits upon languages that don't need it (or languages that may need it, for code where it can be statically known that it's not needed).
64-bit wasm is mostly about 64-bit operands to linear memory load/stores, and thus larger linear memories, not much else. The size of an anyref would still be entirely an implementation choice, and in theory, an implementation could be using 32-bits to represent anyref even while linear memory is running in 64-bit addressing mode.
Conversely, does the GC proposal put some expected upper bound on the amount of GC objects that can be addressed? If this can be >2GB, then likely it must use 64-bit pointers internally, and we might as well use i63ref ? ;) That would not be great for any remaining 32-bit CPU architectures we want to run on, but it is certainly is strange (and wasteful?) that in practice, almost all i31refs will sit inside a 64-bit pointer.
Actually, even an i63ref on a 32-bit architecture is not that expensive, since if the bit says its a pointer, it can simply truncate without a further check. It doesn't even need to load the 2nd half from memory. So maybe we should go for i63ref, if we believe 32-bit platforms are becoming less relevant for Wasm as time goes on.
@aardappel
Yes, but the types at the edges are currently very basic types that may well not make it possible to express the full set of type feature of a language like ML, meaning separate compilation would only be "safe" if compiled from one version of the program, essentially not making it separate compilation anymore.
One could make the same argument about separate compilation when linking is done at the assembly level, or LLVM level, or JVM bytecode level, yet those are things that implementations do in practice, because whole-program compilation is extremely inconvenient in various ways. When you generate code for a library into a module, you don't know what client modules you will be linked against, so it is hard to tell what specialized instances you need to generate. It is possible of course to design whole-program three-shakers to reduce code size by propagating whole-programmation about usage, and a common practice in some communities, but that does not remove the importance in practice of also having an open-world compilation model.
@RossTate
From what I understand, [GHC does pointer tagging] because they found that algebraic operations were so common that it was more useful to embed the algebraic-constructor tag in the low bits of the pointer.
Haskell is in an exceptional situation due to the pervasiveness of laziness: many values are thunks, and GHC's evaluation model performs an indirect jump on inspection, even to find out that a value is already evaluated. The benefit of pointer tagging is not to avoid a dereference (which you need to do in most cases to get the constructor arguments anyway), but to avoid this indirect jump. The way they avoid the cost of excessive boxing is through very aggressive inlining and optimizations in general; again the "sufficiently smart compiler" approach, which I don't think should be held as a goal for language implementors.
Note that there is a lot more Haskell code written in the strict subset these days, so it is not completely clear to me that this pointer-tagging choice is still the right design. At the time it was introduced, pointer tagging provided a 5% performance improvement over a strategy without pointer tagging. At the risk of going completely into guesswork: this suggests that it is reasonable to consider implementing a Haskell runtime without pointer tagging, and get realistic performance. (You may even see interesting gains by using tagged integers, which may offset the absence of tagging.) On the other hand, when implementing a strict language, pointer tagging does not help, and not having tagged immediate values makes you fall off a performance cliff.
So the experience of Haskell developers is that either custom pointer tagging should be preferred over i31ref for better performance
I am not aware of a performance comparison between pointer tagging and immediate-integers tagging for Haskell programs.
or i30ref should be preferred over i31ref for better GC strategies.
Chicken Scheme uses more than one bit of tag for value shapes, but only immediate integers use the least bit set to 1. In this model you can have i31ref, and still there are extra tag bits available in the 0 case. (Not sure whether this would work for the Haskell runtimes you are mentioning.)
I believe that in practice, if you used much less than 31 bits for immediate values, people would not use this for an integer type (24 is too small for your main type of machine-length integers), only for other immediate values. This might be a workable compromise (in particular, maybe people want 60-plus integers nowadays anyway), but it is not completely clear what the benefits are.
@sabine
I know too little about all the features of OCaml at this point (I have been working with the compiler only since September part time and I still have a lot to learn.) to say whether full-program monomorphization is possible or not - I do not have a proof for either of these hypotheses.
Ahead-of-time full-program monomorphization is known to be impossible for languages that support polymorphic recursion, including OCaml and Haskell (but not SML, see MLton), Agda, Idris, etc. If you have a JIT you can monomorphize "on the fly", this is what the dotnet runtime does. This point was made in this earlier comment of @rossberg: https://github.com/WebAssembly/gc/issues/53#issuecomment-545889014
@RossTate here is another argument that you may find interesting, in terms of finding a "principled" argument for choosing one size or the other.
We are talking about splits in the data space of values that can fit one word. You may want to have tag bits/patterns for either pointers (GHC pointer tagging, or other tricks used by other implementations) or scalars (non-pointers). For example, the Chez Scheme runtime has a bitpattern for fixed-sized numbers but also for booleans, pairs (the "cons tag" mentioned in the discussion), symbols, vectors, etc. The GC only needs to know the fine-grained structure of the pointers, so from the GC perspective we may need many categories of pointers, but only one category of values (which language implementations can the subdivide with more tagging). Given that both sides may need tag space depending on the language, it makes sense to divide the space evenly between pointers and non-pointers: this is exactly the int31ref design.
@RossTate
if we do change the number of bits for guaranteed-unboxed integers, what's a non-arbitrary number to change it too?
I don't think that a non-arbitrary number of bits per se exists. In an ideal world, we would have garbage collection in hardware, and we wouldn't need to chop off bits from a hardware word in order to implement efficient garbage collection in software. However, a specification for garbage collection in hardware mostly faces the same challenges as the WASM GC spec: there being a lot of different ideas of how efficient GC should work like, with nearly every language bringing both some acquired tastes and some fundamental invariants to the table.
I agree with @gasche that, if the number of bits gets too low, we will not use this to implement integer types: the better trade-off in that situation is to implement the integer types via boxed values instead of these unboxed integers, work on optimizing that representation, and enjoying native 32-bit arithmetic.
Note, though, that the implementation of integer types is only one of the situations where guaranteed-unboxed integers are used in the data representation of OCaml. My impression is that the OCaml compiler can make good use of guaranteed-unboxed integers with less bits in most, if not all of these other situations. (I need to check back with people tomorrow to make a list of all the situations where boxing the unboxed is considered to be particularly bad and confirm if this assessment is correct.)
@gasche
Ahead-of-time full-program monomorphization is known to be impossible for languages that support polymorphic recursion, including OCaml and Haskell (but not SML, see MLton), Agda, Idris, etc. If you have a JIT you can monomorphize "on the fly", this is what the dotnet runtime does. This point was made in this earlier comment
Are there any resources that show how a simple JIT compiler that can monomorphize code that contains polymorphic recursion looks like? I do believe Andreas Rossberg when he says it's not possible to do ahead-of-time full-program monomorphization because of polymorphic recursion. I would like to understand the argument in detail, why this is the case, where exactly things cannot work when trying to monomorphize ahead of time.
Here would be an artificial example:
(* a fairly inefficient way to compute 2^n,
by creating a full tree of depth n and counting its leaves *)
let pow2 n =
let rec loop : 'a . int -> 'a -> ('a -> int) -> int =
fun n v count ->
if n = 0 then count v
else
(* call ourselves on values of type ('a * 'a) *)
loop (n - 1) (v, v) (fun (v1, v2) -> count v1 + count v2)
in
loop n () (fun _ -> 1)
let () =
(* The type of [v] in the last call to [loop] in the code below
is determined dynamically by the integer read at runtime.
For example if we read 3, the last iteration takes a value of type
(((unit * unit) * (unit * unit)) * ((unit * unit) * (unit * unit)))
and returns 8. *)
print_int (pow2 (read_int ()))