semantic Generate datatypes of kind (* -> *) -> (* -> *) from Tree-Sitter syntax definitions

@dcreager suggested I add this issue, after a long in-person discussion about what work could be shared between Semantic and the Cubix framework ( http://cubix-framework.com ), as they use strikingly similar technology for different goals.

He told me about @aymannadeem's work using Template Haskell to generate datatypes of kind * from Tree-Sitter definitions. Given that, it should be a small amount of work to additionally generate definitions of kind (* -> *) -> (* -> *).

Why are these significant? Semantic has hitherto been based on datatypes of kind (* -> *). These represent unsorted signature fragments: datatypes that can be modularly combined into different languages and dialects, but where each node is essentially untyped / dynamically-typed. Lifting these to sort (* -> *) -> (* -> *) adds sort information to each node, making it possible to have datatypes-a-la-carte style definitions of programming languages with 0 information loss.

This is the approach of multi-sorted compositional data types ( https://www.researchgate.net/publication/228841104_Compositional_data_types ). The Cubix paper ( http://www.jameskoppel.com/files/papers/oopsla18main-p221-p.pdf ) describes a transformation from datatypes of kind * to ones of kind (* -> *) -> (* -> *), which is implemented in the comptrans library ( http://hackage.haskell.org/package/comptrans ).

I've argued in my paper that any frameworks which (a) enables building multi-language tools, and (b) preserves enough information to do transformation, will need to be written in a similar way.

Aug 02 '19 07:08 jkoppel

NB: Edited the above to quote kinds since asterisks are interpreted by Markdown.

Aug 02 '19 16:08 robrix

@jkoppel: Thank you for the insightful note!

We’re definitely interested in extending the TH-generated datatypes to support more use cases; in particular, we’re thinking about how to represent diffs over these types, and it’s occurred to us that a higher-order functor might be suitable.

That said, it sounds like you’re specifically interested in sharing syntax datatypes between languages, and that seems fundamentally at odds with deriving the datatypes representing a language from its grammar, given that grammars typically do not share much structure. And generating these datatypes from the grammar is important for our use cases, since it enables us to automate the mapping of tree-sitter’s untyped parse trees into strongly-typed Haskell ASTs, which is otherwise a tedious and error-prone process (which we’ve hitherto termed Assignment for no particularly good reason).

Could you elaborate on the use cases you have in mind? In the meantime I’ll give the Cubix paper a read (it’s been in my queue for a bit but I’ll bump it up 😊)

As an aside, this kind also hints at interoperability with bound & fused-effects, which is exciting!

Aug 02 '19 17:08 robrix

Ah, I see the confusion.

We're not interested in sharing syntax datatypes between languages, except to support dialects (e.g.: Hack vs. PHP, C vs. C++) and polyglot programming (e.g.: Jinja2, ERB, files containing PHP+HTML+JS+SQL). The big departure Cubix takes from older DLC approaches is to, instead of translating an entire syntax to generic components (which is very painful, as you've doubtless experienced; it was very easy for me to find problems with Assignment, several of which I shared with Josh), to instead use a mixture of generic and language-specific components. This is why having DLC-style datatype definitions for a language-specific fragment, in either single-sorted or multi-sorted form, is useful.

Aug 03 '19 00:08 jkoppel

Ahhhh thank you very much for clarifying! That makes a lot of sense.

Could you provide us with a couple of examples? e.g. right now if we generated datatypes like:

data Statement = IfStatement If | ForStatement For
data If = If { condition :: Expression, consequence :: [Statement], alternative :: [Statement] }

then higher-kinded datatypes would look more like:

data Statement f a = IfStatement (If f a) | ForStatement (For f a)
data If f a = If { condition :: …, consequence :: […], alternative :: […] }

but then what roles do f and a play? E.g. if this were syntax in bound, f would be the fixpoint & a the type of free variables, suggesting that we’d use f a in each of the … positions, but in so doing lose the sort information (expressions vs. statements), which makes me think that we’d actually be treating a as an index? But would that imply that we’d have to define If as a GADT? Examples would help us a lot 😊

Aug 03 '19 16:08 robrix

Here are the examples for Statement and If:

data StatementL
data Statement e l where
  IfStatement :: e IfL -> Statement e StatementL
  ForStatement :: e ForL -> Statement e StatementL

data IfL
data If e l where
   If :: e <sort of condition> -> e <sort of consequence> -> e <sort of alternative> -> If e IfL

Aug 03 '19 16:08 jkoppel