lean4 Module system enabling separate compilation

Currently import truly imports everything from a module, including private declarations and transitively imported declarations. Thus if we make any change at all that is reflected in the .olean output file, we have no choice but to recompile every downstream module.

This issue usually is solved by use of a "proper" module system that allows sufficient information hiding, e.g. by exporting only the signatures of declarations (and not even that for private declarations). In the simplest case, this can be done literally in a "signature" .olean file, in which case the build system can forego rebuilding downstream modules (that in fact import only the signature file) if the file is unchanged. However, note that this does not a priori prevent rebuilds on other kinds of changes such as adding a new (public) declaration, which would be much harder to prevent.

A module system with proper information hiding would not only avoid rebuilds, but also follow general good software practices and enable further ones such as semantic versioning. However, some aspects of Lean (and similar ITPs) complicate the design of such a system:

Definitional unfolding is inherently anti-modular. If we want to prove something about a definition from another module, that definition must either be marked to export its body (which abbrev, [reducible], and similar should automatically do), or we have to import the full .olean file after all. Any library aiming to make the most out of the module system should strive to prove all relevant properties about a definition inside the definition's module so that downstream modules can rely on these theorems instead of having to unfold the definition. However, if a module is split into multiple parts merely for code organization, then importing the full .olean files within that group should be fine, retaining access to the definition bodies within the group and separate compilation outside of it. If a definition is sufficiently simple/stable that changes to it are not expected, it can still be fully exported using some marker as mentioned above, of course.
Meta-programs are even worse: if we want to run an imported function at compile time, we necessarily depend on its body and transitively on the bodies of all referenced functions. If we want separate compilation in the presence of metaprograms, I don't see any way but to put them into separate (output) files. Manually doing so is not realistic since every single notation, which we definitely want to use interspersed with non-meta code, is a metaprogram. Thus it seems we need a staging approach where metaprograms are marked as such (Racket would call this the "phase" of the declaration) and exported into different output files from regular declarations. Declarations can then be imported from specific phases into specific phases (so that metaprograms can make use of regular functions) and the build system creates file dependencies accordingly, though I'm not sure we'd need Racket's full, very general Int-indexed phase system for our use cases.
Further (meta)data has to be exported for compilation. In the case of [inline] and [specialize], there is no way around exporting the definition bodies (or rather their IR) if we want to these attributes to work cross module. Then there are some cases where the body affects the ABI, e.g. in type synonyms such as def MyUnboxedUInt := UInt16. Finally, some cross-module information we are currently using such as lifted closed terms should probably be limited/disabled, at least by default, so it does not defeat separate compilation.
Finally, some parts of the existing implementation would have to be refined to make the most out of separate compilation. For example, adding a new private declaration could still trigger downstream rebuilds if any (exported) autogenerated names (such as from macros) change. @leodemoura and I have already discussed basing macro scopes not on a module-local counter but a declaration-local one based on a (very short) hash of the declaration name (note that they are already based on the module name to make them globally unique). In the unlikely case of a (module-local) conflict between these hashes, we would increment the second one, which could in theory again lead to more rebuilds than absolutely necessary, but should hardly matter in practice.

Apr 22 '21 12:04 Kha

It may be worth looking at Musa Al-hassy's "Do-it-Yourself Module Systems" (PhD thesis and defense slides). It targets Agda, and so much of it is likely applicable to Lean as well, while providing a powerful model of how the functionality of module systems can be achieved without introducing a second "module language" with all of its downsides - while also providing powerful tools for composition and reinterpretation of interfaces.

Aug 23 '21 23:08 eternaleye

@eternaleye This issue is about second-class, unparameterized modules in the sense of compilation units, so I'm not sure that work is applicable here. We certainly don't want to represent module signatures (shallowly) as Lean types.

Aug 30 '21 14:08 Kha

Would this be a relevant pointer to Racket's phase separation?

Feb 11 '22 10:02 ydewit

@ydewit That, and https://dl.acm.org/doi/abs/10.1145/583852.581486

Feb 11 '22 12:02 Kha

~~What is the meaning of a "signature" (.olean) described above?~~

I found the answer to my own question so removing to reduce the noise on the issue.

Jun 28 '22 20:06 ydewit