DaCy Improve lemmatization

Currently, the model used a lookup-based lemmatization on the training set. This can be improved by adapting the lemmy package for v. 3 of SpaCy

Another potential solution might be to use the lemmatization lstm from stanza. Which should be accessible using the spacy integration. However, it might not perform as well out of distribution.

Mar 21 '21 12:03 KennethEnevoldsen

If it is of any use, there are now close to 50,000 representation-lemma relationships in Wikidata: https://w.wiki/457J

Sep 15 '21 14:09 fnielsen

An alternative approach is to use the new neural edit trees.

Aug 02 '22 16:08 KennethEnevoldsen

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Jan 06 '23 02:01 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

Jan 13 '23 02:01 github-actions[bot]

There are now 115,204 representation-lemma relationships in Wikidata: https://w.wiki/457J

May 16 '23 10:05 fnielsen

Wonderful thanks @fnielsen. Actually the latest version (will probably be released today or tomorrow) actually improves the lemmatisation by quite a margin (all the way up to ~95%) using neural edit trees.

However I know that the greek model odycy uses a hybrid approach. Might want to try that out as the next thing.

Edit: The new 0.2.0 models are now up!

May 16 '23 18:05 KennethEnevoldsen

Actually @fnielsen if I were to integrate with the Danish word registries would recommend doing to using wikidata?

Some considerations I have:

Would there be things I couldn't do using wikidata? I.e. is there relevant metadata that I might want to include?
How well would it transfer to e.g. Norwegian and Swedish (to allow for generalizations of DaCy)

It is fine if you don't know the answer, you simply seem to have more expertise on this than I do. I would generally prefer using wikidata, but I am unsure what the tradeoffs are. Hoping you can help me

May 22 '23 19:05 KennethEnevoldsen

Bokmål and Swedish are currently larger than Danish wrt. Wikidata lexemes, see https://ordia.toolforge.org/language/ : 40,862 Swedish lemmas, 32,431 Bokmål lemmas, 21,583 Danish lemmas, and 15,036 Nynorsk lemmas. Perhaps that is not sufficient for a good coverage. Wrt. to forms there are, e.g., 282,378 Swedish forms in Wikidata.

I plan to copy most forms from Det Centrale Ordregister (COR) https://ordregister.dk/ to Wikidata so The Danish lexemes on Wikidata should be around 100,000.

I should think that any information there is in COR would also be in Wikidata. And there would be further metadata.

One issue though is what a lemma is, e.g., to "understimuleret": Is that an adjective (lemma understimuleret) or a verbform (lemma understimulere), see https://openreview.net/pdf?id=kvEmQxxAab I would tend to see "understimuleret" as an adjective.

Jun 26 '23 18:06 fnielsen

Thanks @fnielsen. Coverage isn't too much of a problem - we can do a fallback strategy where you first do the lookup and if that fails, then if that fails (e.g. for new words) we can fall back to the neural edit tree.

Jun 26 '23 18:06 KennethEnevoldsen

This issue is stale because it has been open for 14 days with no activity. Feel free to either 1) remove the stale label or 2) comment. If nothing happens, this will be closed in 7 days.

Jul 26 '23 13:07 github-actions[bot]

This issue is stale because it has been open for 14 days with no activity. Feel free to either 1) remove the stale label or 2) comment. If nothing happens, this will be closed in 7 days.

Aug 29 '23 13:08 github-actions[bot]

This issue was closed automatically. Feel free to re-open it if it's important.

Sep 06 '23 13:09 github-actions[bot]