Morphology support
Right now terminal tokens have to be separate words. Treebender should be able to support morphological rules:
V[ stem: t ] -> walk
V[ stem: t ] -> talk
// stem: f to block walkedededededededed...
V[ tense: past, stem: f ] -> V[ stem: t ] ++ ed // syntax TBD
Questions:
- What scope do we want here? Are we only supporting basic concatenative morphology (prefixes and suffixes), or will we try and support allomorphy, sound changes / ablaut, semitic roots...
- It's tempting to say we just focus on English and support concatenative and allow the user to fall back with a flag:
V[ can_inflect: y ] -> walk V[ can_inflect: n ] -> buy V[ tense: past, can_inflect: n ] -> V[ can_inflect: y ] ++ ed V[ tense: past, can_inflect: n ] -> bought + However, lots of common words in English have changes like bake ~ baked not *bakeed. There's no real way to support that without some more sophisticated tool or tons of duplicate rules.
Todo:
- Remind myself of how the LKB does this
One way to approach this would actually be to just allow grammar files to define a token-splitting process that runs before parsing.
Something like:
$splitters = [
/(.+)ed/ => [\1, -ed]
/(.+)d/ => [\1, -ed] // for words like "baked"
/(.+)s/ => [\1, -s]
/(.+)es/ => [\1, -s]
]
Then all possible splitters would match on a word, plus an implicit "no expansion" splitter, and split a sentence into a bunch of possible morphological derivations:
"The dogs walked to the beach and baked" "The dogs walk -ed to the beach and baked" "The dogs walke -ed to the beach and baked" "The dog -s walked to the beach and baked" "The dog -s walk -ed to the beach and baked" "The dog -s walke -ed to the beach and baked" "The dogs walked to the beach and bak -ed" "The dogs walk -ed to the beach and bak -ed" "The dogs walke -ed to the beach and bak -ed" "The dog -s walked to the beach and bak -ed" "The dog -s walk -ed to the beach and bak -ed" "The dog -s walke -ed to the beach and bak -ed" "The dogs walked to the beach and bake -ed" "The dogs walk -ed to the beach and bake -ed" "The dogs walke -ed to the beach and bake -ed" "The dog -s walked to the beach and bake -ed" ==> "The dog -s walk -ed to the beach and bake -ed" "The dog -s walke -ed to the beach and bake -ed"
Obviously this has the potential to blow up, but we could also fail fast if a splitter generates a token that doesn't match any nonterminals in the grammar.