corpuscrawler icon indicating copy to clipboard operation
corpuscrawler copied to clipboard

Use available corpora for opensubtitles (63 languages)

Open hugolpz opened this issue 4 years ago • 3 comments

Research

  • J. Tiedemann, 2016, Finding Alternative Translations in a Large Corpus of Movie Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)

Gain

Closest of natural oral corpora.

Links

  • Portal
    • bre.txt.gz -- Bretonl corpus.
    • 60+ languages available.
    • List: af,ar,bg,bn,br,bs,ca,cs,da,de,el,en,eo,es,et,eu,fa,fi,fr,gl,he,hi,hr,hu,hy,id,is,it,ja,ka,kk,ko,lt,lv,mk,ml,ms,nl,no,pl,pt,pt_br,ro,ru,si,sk,sl,sq,sr,sv,ta,te,th,tl,tr,uk,ur,vi,ze_en,ze_zh,zh_cn,zh_tw

There are ready-to-download open licence Wikipedia corpora available.

Project introduction Type Languages (2024) Portal all Language specific Download link Comments
OpenSubtitles 2016/2018
Subtitles
Parallel sentences
Monolingual sentences
75 Portal br&en bre (mono) '''Source:''' * P. Lison and J. Tiedemann (2016), ''"OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles"'', http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf . '''Licence:''' unclear, "The corpora is made freely available to the research community on the OPUS website" − Lison and Tiedemann (2016).

hugolpz avatar Feb 25 '21 17:02 hugolpz

Sounds great. Send a pull request?

brawer avatar Feb 25 '21 17:02 brawer

Hello Sascha / @brawer, My Python skills are near zero so far, I do my best to help with my available knowledges and know-how :

  • multilingual corpora literature review → sharing
  • Wikimedia's API, ecosystems, resources → sharing
  • documenting opensource project positively to increase engagements
  • clarifying roadmaps
  • networking for stronger projects¹

The project also lacks meaningul documentation (#80). It would be inefficient to get a total Python-newbie on Python copy-engineering. I will be more productive on other linguistic diversity issues, here on on @Lingua-libre projects.

Given how central to web linguistic diversity is this CLDR/UNILEX/Unicode/Google's CorpusCrawler repository, is there an email contact to which I or/and Wikimedia France or/and Wikimedia Foundation could write to ask for more solid support for CorpusCrawler ? Volunteership can do a lot but is too irregular. A dedicated, versatile, paid maintainer supervising ~20² Google's open sources projects, unblocking most key bottlenecks via 4 hours coding sprints and community support would quickly provide a positive ROI. 2020 opens access to skilled workers all around the world. There is surely a long list of open sources projects which would gain of such tiny yet skilled bottlenecks-kicks to move forward.

I would be interested to coordinate such email with Wikimedia France and the US Wikimedia Foundation to get a hand of names of that email. (If there is a reasonable >5~10% chances to achieved the intended goal of a skilled, paid maintainer here 4hrs/week in next 2 years).

1: see text above 2: depending on projects activity, could be less or more. Current project has about 1 issue / month.

hugolpz avatar Feb 26 '21 12:02 hugolpz

Thanks for the chat @Brawer. Our online chat will help me conceive better the next phases of Lingualibre and collaboration with crawler.

hugolpz avatar Mar 04 '21 16:03 hugolpz