Cannot replicate pre-computed syntactic distance
Hi, thank you for your work!
I wanted to ask regarding computing the syntactic distance between languages.
If I understood correctly, pre-computed syntactic distances obtained by
lang2vec.distance("syntactic", [l1, l2])
is the cosine distance between two languages, which should be properly replicated by
from scipy.spatial.distance import cosine
a = lang2vec.get_features(l1, "syntax_wals")[l1]
b = lang2vec.get_features(l2, "syntax_wals")[l2]
cosine(a, b)
And for missing features in a and b (which has -- as their values), I followed what is mentioned here: https://github.com/antonisa/lang2vec/issues/7#issuecomment-730548622.
However, I find them mismatch. I also tried it with syntax_knn instead of syntax_wals, but they still mismatch.
And for some of the languages that are involved in pre-computed distances, they only have -- for all features, not actually being able to compute distances with other languages. (e.g., syntactic distance between frr, dan is provided, as shown as an example in README, but l2v.get_features("frr", "syntax_wals") gives a list of "--"s.)
Below are average Pearson correlation coefficients and pvalues between pre-computed and manually computed distances of each language.
- manually computed with
syntax_wals& pre-computed : coef - 0.6325433738084123 / pvalue - 0.13051253893837714 - manually computed with
syntax_knn& pre-computed : coef - 0.6257979297204636 / pvalue - 0.1392174749552544
I would really appreciate it if you could provide more details on computing the distance if I missed something here!
Thank you so much :)