fasttextB_embeddings_300d.npy file
hi, I am running the basic learner ./run_main.sh 0 DeepXML EURLex-4k 0 108 and everything is going fine except I don't have the fasttext embeddings file.
The error output is Embedding File not found. Check path or set 'init' to null. Where/how was the .npy embeddings file created? Is it from the pretrained word vectors on fasttext's website?
would appreciate any info to illuminate this issue! thanks
Hi,
Thanks for tryring out DeepXML. In general: the embedding files are created using pre-trained model available at the fasttext's website.
You can use the following link to download the embedding file for EURLex-4K: https://owncloud.iitd.ac.in/nextcloud/index.php/s/5XsZAKLbHfbpfZA
Please let me know, if you need anything else.
thanks for your insight! does that mean that there are different embeddings for every dataset?
the way I tried to generate the embedding files was using for example wki.en.vec, reading it in as a np 0 dimensional array, and then saving that to a .npy file. it didn't give the same results as the embedding file that you shared, what did you do differently?
The embedding file in our case contains a V x D matrix, where V is the vocabulary dimension and D is the dimensionality. In other words, there is a vector for each token in the dataset. So, the embedding file would be different for each dataset as the vocabulary will be different.
We use the FastText model to compute embedding for each token in vocabulary, which is then passed to our model.
thanks for the clarification. so what I ended up doing was something like:
model = fasttext.train_unsupervised(corpus_file, dim=dim)
and then using vocab = model.words, creating a np array of V (len(vocab)) x D where each row is
wordvec = model.get_word_vector(word) for every word in the vocabulary.
Am I understanding your process correctly?
Hi
I have added an example here which computes embeddings from pre-trained fasttext model. You are free to train your own model provided your corpus is: (i) large enough, (ii) general english or relevant to the task.
Hey, By any chance is the npy file still available with anyone of you? @kunaldahiya @cairomo The link above has the file missing in it
Hey, By any chance is the npy file still available with anyone of you? @kunaldahiya @cairomo The link above has the file missing in it
Hi,
You can follow this example to get embeddings for a given vocabulary: https://github.com/kunaldahiya/pyxclib/blob/master/xclib/examples/get_ftx_embeddings.py
Hey, By any chance is the npy file still available with anyone of you? @kunaldahiya @cairomo The link above has the file missing in it
Hi,
You can follow this example to get embeddings for a given vocabulary: https://github.com/kunaldahiya/pyxclib/blob/master/xclib/examples/get_ftx_embeddings.py
I tried to do this but eurlex only had bow files features and not the text corpus.