odyCy icon indicating copy to clipboard operation
odyCy copied to clipboard

roadmap of cleaning perseus

Open jankounchained opened this issue 3 years ago • 0 comments

Perseus idiosyncrasies

Lemmas Root and suffix are sometimes dash-separated. This is happens only for VERBs. Learn the reason for this & remove dashes (possibly also suffixes if it makes sense).

Beta code encoding errors Make sure that every character is a valid ancient greek letter. For example, ἀλλ̓ should be ἀλλ’

Compatability with Proiel

XPOS XPOS tags just contain the combination of UPOS & FEATS, no new information is introduced. Our model should not learn them / use them as labels

Morpohological features Proiel has both more features & possible values. How do we not learn proiel-specific features when training on their data?

POS Perseus doesn't have PROPN, while proiel doesn't have PUNCT

  • Does training on Proiel make our senter worse?
  • Should we convert PROPN to NOUN? Or are we learning PROPN well enough to keep it?

jankounchained avatar Feb 07 '23 14:02 jankounchained