unilex issues

Please update: SPDX-License-Identifier: Unicode-3.0

Please update the SPDX in data files to: `SPDX-License-Identifier: Unicode-3.0` relates to: - #10 - #19 - https://github.com/unicode-org/.github/issues/15

srl295

Publish under dual Unicode AND Open License

10

Your data is an impressive work which could help many, many minority and rare languages to get stronger online representation. The Wikimedia Foundation, Wikipedia, Wikidata, and @Lingua-Libre movements would love...

hugolpz

Investigate article "1000 MT translators"

https://arxiv.org/abs/2205.03983 ![Screenshot_2022-06-22-00-35-49-05_40deb401b9ffe8e1df2f1cc5ba480b12](https://user-images.githubusercontent.com/1420189/174908171-c909060e-14f3-451d-bd10-2da4fe4a7411.jpg)

hugolpz

Levantine arabic

Add frequency * hpps://portal.sina.birzeit.edu/curras Assign issue to me please.

hugolpz

Tatar frequency word

Possibility to create or request resource for Tatar: * Corpus: Corpus of written tatar (Saykhunov 2021), see [here](https://www.corpus.tatar/en) * [wordlist](https://corpus.tatar/stat_en.htm) > "[Frequency list of Tatar wordforms (case-sensitive)](https://www.corpus.tatar/stat/tatcorpus3.words_frequency_case-sensitive.bz2)" * Corpus: [leipzig](https://cls.corpora.uni-leipzig.de/en/tat_web_2019/)...

hugolpz

Run process again to include missing files

1

There is a [`crawl_ca-valencia.py`](https://github.com/google/corpuscrawler/blob/master/Lib/corpuscrawler/crawl_ca_valencia.py) [within the google/corpuscrawler projects](https://github.com/google/corpuscrawler/search?q=valencia). Which produces a file visible on their [readme.md](https://raw.githubusercontent.com/google/corpuscrawler/master/README.md) . Surprisingly, this frequency file didn't make it to UNILEX. As renowed Twitter expert...

hugolpz

Tok Pisin

1

Worthwhile project, but the corpus has lots of plain English (both Am & Br/Aus) and probably the source materials contain texts in English as well as Tok Pisin and the...

evali1

Comparing languages of LinguaLibre vs UNILEX

Just for references since I'am hand-comparing the languages lists of Lingualibre vs UNILEX. I observed the following languages are not in UNILEX, possibly for various reasons. I'am conscious this issue...

hugolpz

Griko part-of-speech tags

1

The [Griko language resources](https://bitbucket.org/antonis/grikoresource) include manually assigned [part of speech tags](https://bitbucket.org/antonis/grikoresource/src/897cb9d9526901e0905ef0c8330267b896a5eb15/data/projected_tags/train.projected_tags.txt?at=master&fileviewer=file-view-default). @antonisa, would you perhaps be interested in contributing this data to the Unilex project? If you’re interested, would you...

brawer

Language code for Griko

1

@antonisa, thanks again for your data submission! For now, I’ve tagged it as `el-Latn-u-sd-it75` which means “Greek in the Latin writing system as used in Apulia”. Is your data actually...

brawer

unilex
unilex copied to clipboard

Metadata

Please update: SPDX-License-Identifier: Unicode-3.0

Publish under dual Unicode AND Open License

Investigate article "1000 MT translators"

Levantine arabic

Tatar frequency word

Run process again to include missing files

Tok Pisin

Comparing languages of LinguaLibre vs UNILEX

Griko part-of-speech tags

Language code for Griko

← Metadata

Owner

Metadata

unilex unilex copied to clipboard

Metadata

← Metadata

Owner

Metadata

unilex
unilex copied to clipboard