python-wordsegment icon indicating copy to clipboard operation
python-wordsegment copied to clipboard

Support for Other Languages

Open ykhatami opened this issue 4 years ago • 2 comments

The LDC has the Web 1T 5-gram 10 European Languages published at https://catalog.ldc.upenn.edu/LDC2009T25

Is there any plan to support these languages? If not, can I jump in and contribute? Would it be enough to parse the above data and get the unigram/bigram counts?

ykhatami avatar Mar 05 '21 07:03 ykhatami

No, I don’t have plans to ship those corpuses at this time. The linked datasets do not appear to redistributable for free. Under “View Fees”, the costs is $150 for non-members.

grantjenks avatar Mar 05 '21 15:03 grantjenks

Not sure if this is of any use but this maybe handy for this task https://github.com/Poio-NLP/poio-corpus (they used it to build a prediction engine - pressagio).

willwade avatar Jan 18 '24 00:01 willwade