sent2vec icon indicating copy to clipboard operation
sent2vec copied to clipboard

glove path error

Open yassmine-lam opened this issue 4 years ago • 1 comments

Hi,

I tried to use word2vec code with glove embeddings glove.6B.300d.txt but I got this error

ValueError: invalid literal for int() with base 10: 'the'

Could someone help plz

thank u

yassmine-lam avatar Feb 19 '21 07:02 yassmine-lam

From gensim you can load GloVe pretrained weights of different sizes:

  • glove6B 50d, 100d, 200d, & 300d vectors
  • glove42B300d vectors
  • glove840B300d vectors

Here is the GloVe Official Page

Download the file from the website above. You can then substitute the file name in glove_file with the path to the file that you have downloaded.

This is how you would want to implement it within sent2vec

from sent2vec.vectorizer import Vectorizer
from sent2vec.splitter import Splitter

from gensim.test.utils import get_tmpfile
from gensim.scripts.glove2word2vec import glove2word2vec

sentences = [
    "Alice is in the Wonderland.",
    "Alice is not in the Wonderland.",
]

glove_file = 'glove.6B.300d.txt'
word2vec_glove_file = get_tmpfile("glove.6B.300d.word2vec.txt")
glove2word2vec(glove_file, word2vec_glove_file)


splitter = Splitter()
splitter.sent2words(sentences=sentences, remove_stop_words=['not'], add_stop_words=[])
vectorizer = Vectorizer()
vectorizer.word2vec(splitter.words, pretrained_vectors_path= word2vec_glove_file)
vectors = vectorizer.vectors

I hope it helps.

almarengo avatar Jan 26 '22 06:01 almarengo