deepxml fasttextB_embeddings

hi, I am running the basic learner ./run_main.sh 0 DeepXML EURLex-4k 0 108 and everything is going fine except I don't have the fasttext embeddings file.

The error output is Embedding File not found. Check path or set 'init' to null. Where/how was the .npy embeddings file created? Is it from the pretrained word vectors on fasttext's website?

would appreciate any info to illuminate this issue! thanks

Jun 16 '21 02:06 cairomo

Hi,

Thanks for tryring out DeepXML. In general: the embedding files are created using pre-trained model available at the fasttext's website.

You can use the following link to download the embedding file for EURLex-4K: https://owncloud.iitd.ac.in/nextcloud/index.php/s/5XsZAKLbHfbpfZA

Please let me know, if you need anything else.

Jun 16 '21 13:06 kunaldahiya

thanks for your insight! does that mean that there are different embeddings for every dataset?

the way I tried to generate the embedding files was using for example wki.en.vec, reading it in as a np 0 dimensional array, and then saving that to a .npy file. it didn't give the same results as the embedding file that you shared, what did you do differently?

Jun 16 '21 19:06 cairomo

The embedding file in our case contains a V x D matrix, where V is the vocabulary dimension and D is the dimensionality. In other words, there is a vector for each token in the dataset. So, the embedding file would be different for each dataset as the vocabulary will be different.

We use the FastText model to compute embedding for each token in vocabulary, which is then passed to our model.

Jun 17 '21 08:06 kunaldahiya

thanks for the clarification. so what I ended up doing was something like: model = fasttext.train_unsupervised(corpus_file, dim=dim)

and then using vocab = model.words, creating a np array of V (len(vocab)) x D where each row is wordvec = model.get_word_vector(word) for every word in the vocabulary.

Am I understanding your process correctly?

Jun 23 '21 00:06 cairomo

Hi

I have added an example here which computes embeddings from pre-trained fasttext model. You are free to train your own model provided your corpus is: (i) large enough, (ii) general english or relevant to the task.

Jul 05 '21 17:07 kunaldahiya

Hi

Please re-install pyxclib. The latest version contains required files. See this link.

Jul 06 '21 17:07 kunaldahiya

Hey, By any chance is the npy file still available with anyone of you? @kunaldahiya @cairomo The link above has the file missing in it

Nov 28 '24 11:11 khatrimann

Hey, By any chance is the npy file still available with anyone of you? @kunaldahiya @cairomo The link above has the file missing in it

Hi,

You can follow this example to get embeddings for a given vocabulary: https://github.com/kunaldahiya/pyxclib/blob/master/xclib/examples/get_ftx_embeddings.py

Nov 28 '24 13:11 kunaldahiya

Hey, By any chance is the npy file still available with anyone of you? @kunaldahiya @cairomo The link above has the file missing in it

Hi,

You can follow this example to get embeddings for a given vocabulary: https://github.com/kunaldahiya/pyxclib/blob/master/xclib/examples/get_ftx_embeddings.py

I tried to do this but eurlex only had bow files features and not the text corpus.

Nov 28 '24 14:11 khatrimann

fasttextB_embeddings_300d.npy file