sent2vec
sent2vec copied to clipboard
glove path error
Hi,
I tried to use word2vec code with glove embeddings glove.6B.300d.txt but I got this error
ValueError: invalid literal for int() with base 10: 'the'
Could someone help plz
thank u
From gensim you can load GloVe pretrained weights of different sizes:
- glove6B 50d, 100d, 200d, & 300d vectors
- glove42B300d vectors
- glove840B300d vectors
Here is the GloVe Official Page
Download the file from the website above. You can then substitute the file name in glove_file with the path to the file that you have downloaded.
This is how you would want to implement it within sent2vec
from sent2vec.vectorizer import Vectorizer
from sent2vec.splitter import Splitter
from gensim.test.utils import get_tmpfile
from gensim.scripts.glove2word2vec import glove2word2vec
sentences = [
"Alice is in the Wonderland.",
"Alice is not in the Wonderland.",
]
glove_file = 'glove.6B.300d.txt'
word2vec_glove_file = get_tmpfile("glove.6B.300d.word2vec.txt")
glove2word2vec(glove_file, word2vec_glove_file)
splitter = Splitter()
splitter.sent2words(sentences=sentences, remove_stop_words=['not'], add_stop_words=[])
vectorizer = Vectorizer()
vectorizer.word2vec(splitter.words, pretrained_vectors_path= word2vec_glove_file)
vectors = vectorizer.vectors
I hope it helps.