Generating the ast
This is a fun one.
Even with #[derive(FromPest)], the initial creation of the ast structures is fairly rote.
It would be really cool if we could take a .pest file and create a working (though not necessarily ideal) module of ast structures.
Basic shapes
-
rule = { a ~ b ~ c }becomes
#[derive(Debug, FromPest)] #[pest_ast(rule(Rule::rule))] pub struct rule { pub a: a, pub b: b, pub c: c, } -
rule = { a | b | c }becomes
#[derive(Debug, FromPest)] #[pest_ast(rule(Rule::rule))] pub enum rule { a(a), b(b), c(c), } -
a*becomes
Vec<a> -
a+becomes
Vec<a> -
a?becomes
Option<a>
Please, ping me on Gitter or on Discord if you're interested in attacking this. It'll be fun, but somewhat involved.
I think it's best to structure this in a way that it's expected that the user take the output and commit it to their repository to be maintained by hand in the future. I don't think the goal should ever be to make a good, semantic AST, but rather to make the "obvious" translation and provide a quick starting point for projects that have designed their grammar already.
What are your thoughts on translating rule? to an Option and rule* to a Vec?
This should already work manually, and is definitely what I'd consider the "canonical" translation that such a tool should generate. They're powered solely by provided implementations in from-pest:
I have more questions :)
- What is the most basic structure that a rule could desugar to? My assumption is a wrapper around a single char, like this:
one_char = { ANY }
becomes
struct one_char {
content: char,
}
Am I right about this?
- Based on the previous point, a rule
number = { ASCII_DIGIT* }
would turn into a struct
number {
digit: Vec<char>,
}
I probably intended number to be parsed as an actual number, say, a u16. Would i first parse it to the ast and then convert it to the actual type I want later on?
Generally, I think that a sensible default for atoms would be to store a pest::Span, or potentially a custom From<pest::Span> (which could handle being an owning version instead of borrowing). This is one of the questions I'm not sure how to answer currently.
The problem with
// number = @{ ASCII_DIGIT* }
struct number(u16);
is that the rule matches way too big numbers, so converting it either means panicking or manually plumbing a FatalError through. Alternatively, you could use BigInteger.
So, this is a lot less "this is the obvious way to handle it" than the other proposals for the generated AST, but a potential sketch:
- All AST nodes take a lifetime
<'pest>. - All AST nodes have a member
#[pest_ast(outer)] span: pest::Span<'pest> - Other members are added based on presence of non-builtin productions in the rule definition.
Here's a potential by-hand translation of a small number of rules:
Grammar:
a = { "a" }
b = { "b" }
c = { "c" }
number = @{ ASCII_DIGIT* }
any = { ANY }
seq = { a ~ b ~ c }
choice = { a | b | c }
compund_seq = { a ~ (b | c) }
compound_choice = { (a ~ b) | (b ~ c) }
assign = { (a|b|c) ~ "=" ~ number }
assigns = { (assign ~ ",")* ~ assign ~ ","? }
AST: (tuple structs entirely to sidestep the issue of generating member names) (and abusing fake nested struct syntax)
struct a<'pest>(
#[pest_ast(outer)] Span<'pest>,
);
struct b<'pest>(
#[pest_ast(outer)] Span<'pest>,
);
struct c<'pest>(
#[pest_ast(outer)] Span<'pest>,
);
struct number<'pest>(
#[pest_ast(outer)] Span<'pest>,
);
struct any<'pest>(
#[pest_ast(outer)] Span<'pest>,
);
struct seq<'pest>(
#[pest_ast(outer)] Span<'pest>,
a<'pest>,
b<'pest>,
c<'pest>,
);
enum choice<'pest>{
struct _1(a<'pest>),
struct _2(b<'pest>),
struct _3(c<'pest>),
}
struct compound_seq<'pest>(
#[pest_ast(outer)] Span<'pest>,
a<'pest>,
enum _2 {
struct _1(b<'pest>),
struct _2(c<'pest>),
},
);
enum compound_choice<'pest>{
struct _1(
#[pest_ast(outer)] Span<'pest>,
a<'pest>,
b<'pest>,
),
struct _2(
#[pest_ast(outer)] Span<'pest>,
b<'pest>,
c<'pest>,
),
}
struct assign<'pest>(
#[pest_ast(outer)] Span<'pest>,
enum _1 {
struct _1(a<'pest>),
struct _2(b<'pest>),
struct _3(c<'pest>),
},
number<'pest>,
);
struct assigns<'pest>(
#[pest_ast(outer)] Span<'pest>,
Vec<struct _1(assign<'pest>)>,
assign<'pest>,
);
I think having the AST take any lifetime is a flawed approach, since it adds a lot of complication in handling it. The first version of pest was designed with an owned Rc<*some input*> shared between Spans and Pairs and was easier to work with. I also don't really have a good alternative. One could have two separate APIs for owned and not owned inputs, but this would put a pretty big burned on the maintainers of the project.
Hey folks, I started https://github.com/killercup/pest-ast-generator for fun a few days ago and just now saw this issue. The approach I took is very simple -- the goal was to get rid of a bunch of struct I needed to write, not to support every edge case.