Improve lemmatization
Currently, the model used a lookup-based lemmatization on the training set. This can be improved by adapting the lemmy package for v. 3 of SpaCy
Another potential solution might be to use the lemmatization lstm from stanza. Which should be accessible using the spacy integration. However, it might not perform as well out of distribution.
If it is of any use, there are now close to 50,000 representation-lemma relationships in Wikidata: https://w.wiki/457J
An alternative approach is to use the new neural edit trees.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
This issue was closed because it has been stalled for 5 days with no activity.
There are now 115,204 representation-lemma relationships in Wikidata: https://w.wiki/457J
Wonderful thanks @fnielsen. Actually the latest version (will probably be released today or tomorrow) actually improves the lemmatisation by quite a margin (all the way up to ~95%) using neural edit trees.
However I know that the greek model odycy uses a hybrid approach. Might want to try that out as the next thing.
Edit: The new 0.2.0 models are now up!
Actually @fnielsen if I were to integrate with the Danish word registries would recommend doing to using wikidata?
Some considerations I have:
- Would there be things I couldn't do using wikidata? I.e. is there relevant metadata that I might want to include?
- How well would it transfer to e.g. Norwegian and Swedish (to allow for generalizations of DaCy)
It is fine if you don't know the answer, you simply seem to have more expertise on this than I do. I would generally prefer using wikidata, but I am unsure what the tradeoffs are. Hoping you can help me
Bokmål and Swedish are currently larger than Danish wrt. Wikidata lexemes, see https://ordia.toolforge.org/language/ : 40,862 Swedish lemmas, 32,431 Bokmål lemmas, 21,583 Danish lemmas, and 15,036 Nynorsk lemmas. Perhaps that is not sufficient for a good coverage. Wrt. to forms there are, e.g., 282,378 Swedish forms in Wikidata.
I plan to copy most forms from Det Centrale Ordregister (COR) https://ordregister.dk/ to Wikidata so The Danish lexemes on Wikidata should be around 100,000.
I should think that any information there is in COR would also be in Wikidata. And there would be further metadata.
One issue though is what a lemma is, e.g., to "understimuleret": Is that an adjective (lemma understimuleret) or a verbform (lemma understimulere), see https://openreview.net/pdf?id=kvEmQxxAab I would tend to see "understimuleret" as an adjective.
Thanks @fnielsen. Coverage isn't too much of a problem - we can do a fallback strategy where you first do the lookup and if that fails, then if that fails (e.g. for new words) we can fall back to the neural edit tree.
This issue is stale because it has been open for 14 days with no activity. Feel free to either 1) remove the stale label or 2) comment. If nothing happens, this will be closed in 7 days.
This issue is stale because it has been open for 14 days with no activity. Feel free to either 1) remove the stale label or 2) comment. If nothing happens, this will be closed in 7 days.
This issue was closed automatically. Feel free to re-open it if it's important.