sent2vec icon indicating copy to clipboard operation
sent2vec copied to clipboard

does it support Chinese?

Open zfxSteven opened this issue 4 years ago • 2 comments

zfxSteven avatar Apr 12 '21 12:04 zfxSteven

sent2vec is a wrapper around Bert and Word2Vec models. So, as long as the original model supports Chinese, sent2vec works accordingly.

pdrm83 avatar Apr 12 '21 17:04 pdrm83

sent2vec uses "distilbert-base-uncased" as default model. For other languages you need to use the "bert-base-multilingual-cased" model. You can find the documentation here:

https://huggingface.co/bert-base-multilingual-cased

Sorry for the bad translation (I used Google Translate) but this is how you can apply sent2vec to another language

sentences = [
    "这是一本学习 NLP 的好书",
    "DistilBERT 是一个了不起的 NLP 模型",
    "我们可以交替使用嵌入、编码或矢量化。",
]

vectorizer = Vectorizer()
vectorizer.bert(sentences, pretrained_weights='bert-base-multilingual-cased')
vectors = vectorizer.vectors

from scipy import spatial

dist_1 = spatial.distance.cosine(vectors[0], vectors[1])
dist_2 = spatial.distance.cosine(vectors[0], vectors[2])
print('dist_1: {0}, dist_2: {1}'.format(dist_1, dist_2))
assert dist_1 < dist_2

That returns the following result:

dist_1: 0.019039809703826904, dist_2: 0.029676854610443115

I hope this helps.

almarengo avatar Jan 26 '22 05:01 almarengo