mtdata
mtdata copied to clipboard
A tool that locates, downloads, and extracts machine translation corpora
List of available pairs: - English-Turkish - English-Bulgarian - English-Croatian - English-Slovene - English-Macedonian - English-Icelandic - English-Maltese English-Spanish and English-Dutch are Paracrawl 9 enriched DSI (domain) data, so there's...
http://lepage-lab.ips.waseda.ac.jp/en/projects/meiteilon-manipuri-language-resources/
https://github.com/sgongora27/giossa-gongora-guarani-2021/tree/main/ParallelSet The format is rather non-standard unfortunately.
http://catalog.elra.info/en-us/repository/browse/ELRA-W0320/# CC-BY-SA-3.0 Not sure why there isn't a download link from the main page, guess somebody needs to go in with an ELRA login, get it, and rehost.
## Change log * CLI arg `--log-level` with default set to `WARNING` * progressbar can be disabled from CLI `--no-pbar`; default is enabled `--pbar` ``` python -m mtdata -h usage:...
https://camel.abudhabi.nyu.edu/arabacquis/ https://camel.abudhabi.nyu.edu/madar-parallel-corpus/
If there's an error, I expect mtdata to return a non-zero code. Example: 1. Add ``` en ru 5183 SciPar_Ukraine SciPar UK-EN-RU https://elrc-share.eu/repository/browse/scipar-uk-en-ru/f635552ab06011ec9c1a00155d0267061ce92362f8af4c0b9d4f64d017c2df3f/ https://elrc-share.eu/repository/download/f635552ab06011ec9c1a00155d0267061ce92362f8af4c0b9d4f64d017c2df3f/ CC-BY-NC-SA-4.0 tmx/en-ru.tmx ``` to mtdata/index/elrc_share.tsv 2....
here is my mtdata.recipes.wmt22-constrained.yaml config ```yaml - id: wmt22-zhen-t langs: zho-eng desc: WMT 22 General MT url: https://www.statmt.org/wmt22/translation-task.html dev: test: - Statmt-newstest_enzh-2021-eng-zho train: ``` when download the test set using...
WMT22 general MT (news) task uses this dataset link to download: https://elrc-share.eu/repository/browse/eu-acts-in-ukrainian/71205868ae7011ec9c1a00155d026706d86232eb1bba43b691bdb6e8a8ec3ccf/
This link https://storage.googleapis.com/samanantar-public/benchmarks.zip is not working as of now.