Where are the missing language pairs?

Open icaswell opened this issue 4 years ago • 1 comments

There seem to be 417 language varieties represented in https://opus.nlpl.eu/JW300.php. This would imply 417C2 = 86,736 undirected language pairs. However, I only count 54,376 of them, and the paper confirms this number. Do you know where the missing 32,360 language pairs are, and would you be willing to provide them?

I notice that the adjacency matrix seems to have only one fully connected component, so e.g. although ady has no parallel data with en, it has parallel data with "jw_rmv", which has parallel data with en. So it seems likely that ady and en can be aligned. Just to demonstrate that it's conceptually possible, I found these two pairs in the respective corpora:

jw_rmv: Пала со амэ подаса дума андэ авэр статья ? ady: Сыда къыкІэлъыкІорэ статьям щызэхэтфыщтыр ?

jw_rmv: Пала со амэ подаса дума андэ авэр статья ? en: What will we consider in the following article ?

Implication: the following is a sentence pair between English and Adyghe:

ady: Сыда къыкІэлъыкІорэ статьям щызэхэтфыщтыр ? en: What will we consider in the following article ?

(Interestingly, jw_rmv, which actually seems to be Vlax Romany in Cyrillic script, is the one language that is aligned with the most other languages -- more than English!)

Mar 16 '21 01:03 icaswell

Useful theorem: every language in JW300 is parallel with English, Assamese, or both.

Mar 25 '21 22:03 icaswell