lang2vec icon indicating copy to clipboard operation
lang2vec copied to clipboard

Cannot replicate pre-computed syntactic distance

Open letme-hj opened this issue 1 year ago • 1 comments

Hi, thank you for your work!

I wanted to ask regarding computing the syntactic distance between languages.

If I understood correctly, pre-computed syntactic distances obtained by

lang2vec.distance("syntactic", [l1, l2])

is the cosine distance between two languages, which should be properly replicated by

from scipy.spatial.distance import cosine

a = lang2vec.get_features(l1, "syntax_wals")[l1]
b = lang2vec.get_features(l2, "syntax_wals")[l2]
cosine(a, b)

And for missing features in a and b (which has -- as their values), I followed what is mentioned here: https://github.com/antonisa/lang2vec/issues/7#issuecomment-730548622.

However, I find them mismatch. I also tried it with syntax_knn instead of syntax_wals, but they still mismatch. And for some of the languages that are involved in pre-computed distances, they only have -- for all features, not actually being able to compute distances with other languages. (e.g., syntactic distance between frr, dan is provided, as shown as an example in README, but l2v.get_features("frr", "syntax_wals") gives a list of "--"s.)

Below are average Pearson correlation coefficients and pvalues between pre-computed and manually computed distances of each language.

  • manually computed with syntax_wals & pre-computed : coef - 0.6325433738084123 / pvalue - 0.13051253893837714
  • manually computed with syntax_knn & pre-computed : coef - 0.6257979297204636 / pvalue - 0.1392174749552544

I would really appreciate it if you could provide more details on computing the distance if I missed something here!

Thank you so much :)

letme-hj avatar Jan 07 '25 20:01 letme-hj