OpusTools
OpusTools copied to clipboard
I tried to use `opus_get` Tried the most simple command from README: ``` $ opus_get --directory RF --source en --target sv Downloading 3 file(s) with the total size of 121...
Bumps [numpy](https://github.com/numpy/numpy) from 1.16.4 to 1.22.0. Release notes Sourced from numpy's releases. v1.22.0 NumPy 1.22.0 Release Notes NumPy 1.22.0 is a big release featuring the work of 153 contributors spread...
There has been a slight change in the yaml files in OPUS: the item 'latest release' is now renamed to 'latest_release' (with underscore instead of space). This also affects the...
Apostrophes, commas, question marks, etc, are all printed with a leading space. Is this by design? I couldn't see any options to modify the behaviour. ``` (src)="8"> She 's calling...
**Motivation.** I want `OpusRead.printPairs` to be a generator for downstream task. Specifically, I intend to share Opus as a huggingface dataset (see: `DatasetBuilder._generate_examples` in [link](https://huggingface.co/docs/datasets/dataset_script#generate-samples)). **Change**. Added `yield_tuple` write mode...
I tried to extract the aligned sentence pairs from CCMatrix, previously downloaded using `opus_express`. The command I used was ``` opus_read --source en --target fi --directory CCMatrix --preprocess xml --leave_non_alignments_out...
Make it possible to search with 3-letter language IDs (like in mtdata) - integrate the ID conversion tools implemented in mtdata (https://github.com/thammegowda/mtdata). Even better is to support extensions like script...
The provided tmx file contain the tokenized text, and I wonder what tokenizer is used for the language like Thai, Chinese etc. Is there any docs to find this? Thx!
There seem to be 417 language varieties represented in https://opus.nlpl.eu/JW300.php. This would imply 417C2 = 86,736 undirected language pairs. However, I only count 54,376 of them, and the paper confirms...
I have tried to obtain bitext from the JW300 corpus in plain text format. The webpage http://opus.nlpl.eu/JW300-v1.php gives the instruction to use opus-tools to extract bitext from the alignment XML...