odyCy
odyCy copied to clipboard
roadmap of cleaning perseus
Perseus idiosyncrasies
Lemmas Root and suffix are sometimes dash-separated. This is happens only for VERBs. Learn the reason for this & remove dashes (possibly also suffixes if it makes sense).
Beta code encoding errors
Make sure that every character is a valid ancient greek letter.
For example, ἀλλ̓ should be ἀλλ’
Compatability with Proiel
XPOS XPOS tags just contain the combination of UPOS & FEATS, no new information is introduced. Our model should not learn them / use them as labels
Morpohological features Proiel has both more features & possible values. How do we not learn proiel-specific features when training on their data?
POS
Perseus doesn't have PROPN,
while proiel doesn't have PUNCT
- Does training on Proiel make our senter worse?
- Should we convert PROPN to NOUN? Or are we learning PROPN well enough to keep it?