doc2vec test

@pprablanc would you be interested in testing out this package providing document vectors?

Dec 03 '20 20:12 jwijffels

Sure, I can test this package. Are there other implementations of PV-DM /PV-DBOW you'd like to compare with yours ?

Dec 04 '20 11:12 pprablanc

I don't think there are any other in R. Maybe just gensim. But mainly comparing to your other examples where you added an svm/nb on top of a set of embeddings (I saw you did avg embeddings, sif, weighted by tfidf or bm25) to classify something seems a good test. I'm still getting sometimes crashes regarding C stack overflow but I'm working on finding out the reason of this. Feel free to put comments here.

Dec 04 '20 11:12 jwijffels

I've pushed the package to CRAN today. Maybe you are interested as well in building this https://github.com/ddangelov/Top2Vec by

Tokenising text using sentencepiece or tokenizers.bpe / Embed this tokenised text using docvec / Cluster the resulting embeddings with uwot and dbscan / weight topics a bit with tradition tfidf

Dec 08 '20 22:12 jwijffels

I've pushed the package to CRAN today. Maybe you are interested as well in building this https://github.com/ddangelov/Top2Vec by
* Tokenising text using sentencepiece or tokenizers.bpe / Embed this tokenised text using docvec / Cluster the resulting embeddings with uwot and dbscan / weight topics a bit with tradition tfidf

Great idea. I gave it a very quick and rough go last night and all the pieces seem to be more or less in place:

model <- paragraph2vec(x = x, type = "PV-DBOW", dim = 300, iter = 40, hs = TRUE, window = 15,
                       negative = 0, sample = 0.00001, min_count = 50, lr = 0.05, threads = 4)

embeddings_docs <- as.matrix(model, which = "docs")
embeddings_words <- as.matrix(model, which = "words")

docs_umap <- uwot::umap(embeddings_docs, n_neighbors = 15, n_components = 5, metric = "cosine")

cl <- dbscan::hdbscan(docs_umap, minPts = 15)

centroids <- cbind(embeddings_docs,cl$cluster) %>% # create topic vectors as centroids of cluster-document vectors
  as_tibble(rownames = "id") %>% 
  rename(cluster = `V301`) %>% 
  mutate(cluster = as.character(cluster)) %>% 
  group_by(cluster) %>% 
  summarise_if(is.numeric, mean)

k = 30 # iterate over topics to label each with most similar word based on cosine similarity of topic and word vectors

topic <- centroids[k,] %>% 
  select(-cluster) %>% 
  as.numeric() %>%
  matrix(ncol = 300, nrow = 1)
rownames(topic) <- deframe(centroids[k,1])

paragraph2vec_similarity(y = embeddings_words, x = topic, top_n = 10)

Excuse the messy code, I wrote it up in a rush with dplyr. Just wanted to support the notion that top2vec in R is well within reach. The code is also reasonably fast, on par with the Python implementation

Jan 08 '21 02:01 michalovadek

Great. Thanks for testing out.

Does it provide sensible topics for your case?
I thought there was some extra postprocessing step involving some extra tfidf weighting but maybe I was wrong? I'll test this out on some real data soon as well.

Jan 08 '21 11:01 jwijffels

Yes, the topics are decent in my use case. There are some garbage topics as well but 1) I get those also with the original Python module and 2) some of my data is of poor quality and needs to be pre-processed more. So I think this is worth pursuing. Would be interested to know whether it works for you.
As far as I can see, the pre-processing in Top2Vec is quite light. It uses gensim.utils.simple_preprocess, so lowercase, min_char = 2, max_char = 15, de-accentation, tokenization. I cannot find any use of tf-idf, but it might be worth trying as a potential improvement.

Jan 08 '21 14:01 michalovadek

I thought the tf-idf was at the end only when extracting the most relevant words for each topic. Whereby more weight on the similarity word / topic is used for words with high tfidf. But I'm not sure on that.
Also wondering about these outliers in step 4 how much these would influence meaningfullness of topics.

Jan 08 '21 14:01 jwijffels

Ideally the user would be able to choose whether to use tf-idf weighting or not.
In my limited experience with Top2Vec, the topic words are usually very good and informative. But if outliers were to be a serious problem in some datasets, we could have another option to use medoids instead of centroids in the calculation of the topic vectors.

Jan 08 '21 14:01 michalovadek

@michalovadek are you planning to create an R package implementing top2vec?

Jan 21 '21 09:01 jwijffels

yes, I will start pushing some very initial code to this repo (hopefully soon): https://github.com/michalovadek/top2vecr

I am still thinking about the API, namely how to give the user control over the various parameters (both of your functions as well as umap and hdbscan) without completely overwhelming them

I will also probably start with some inefficient code and only optimize it later, depending on time available

Jan 21 '21 09:01 michalovadek

Ah great. I'll follow your repository.

Jan 21 '21 09:01 jwijffels

I pushed a very early implementation to the repo mentioned. Haven't had as much time to work on this but will see in the future. Test it out if you can.

I think all the various components of the main top2vecr function (doc2vec, umap, hdbscan, centroids/medoids, similarity) should be compartmentalized as separate functions, but I am not yet sure whether it makes sense to expose them to the user as well. The main function should in the future also return more data about the process how the topics were obtained. Any suggestions are welcome.

It should be possible to further apply hierarchical clustering to the default hdbscan output, so that the user can basically fix a K number of topics that are to be returned instead of the "optimal" K returned by hdbscan (with default presets). This would then resemble other topic modelling techniques like LDA where K needs to be chosen upfront.

Jan 31 '21 01:01 michalovadek

Thanks for sharing. Will address remarkts at the repository instead of here.

Feb 01 '21 09:02 jwijffels

I don't think there are any other in R. Maybe just gensim. But mainly comparing to your other examples where you added an svm/nb on top of a set of embeddings (I saw you did avg embeddings, sif, weighted by tfidf or bm25) to classify something seems a good test. I'm still getting sometimes crashes regarding C stack overflow but I'm working on finding out the reason of this. Feel free to put comments here.

You can find a little test comparing your PVDM, PVDBOW with average embeddings on classification task here --> https://github.com/pprablanc/test_doc2vec I didn't have any crash. PVDBOW works well, but there's something odd with PVDM, the results are pretty low. I don't think there should be a great difference between PVDM and PVDBOW ? What do you think ?

Feb 03 '21 22:02 pprablanc

Interesting dataset. So a graph dataset where the nodes have text alongside them in order to eventually classify. That made me wonder if I could use this is well alongside this R package which https://github.com/jwijffels/deepwalker to get graph embeddings as well. But that's another story.

Regarding possible crashes - I fixed some memory leaks in the 0.1.1 release which was put on cran on 2021-01-21 so that would indeed normally not happen any more.
My first remarks when looking at your code is that I initially thought of comparing to a naive bayes model with as features the words, not the embeddings.
I also noted the same thing as you that PV-DM seems to be underperforming on my own datasets although the paper by Mikolov indicated that it was the best. This made me wonder if I did something incorrect in wrapping the code (like switching PV-DM with PV-DBOW). I discussed this at https://groups.google.com/forum/embed/#!topic/gensim/qcOPPpVvcDs which made me believe the issue was not by my implementation. Note original implementation by Mikolov here had the following, corresponding to PV-DBOW time ./word2vec -train ../alldata-id.txt -output vectors.txt -cbow 0 -size 100 -window 10 -negative 5 -hs 0 -sample 1e-4 -threads 40 -binary 0 -iter 20 -min-count 1 -sentence-vectors 1
I also noted on my own datasets that with some hyperparameter tweaking, I could lift up the PV-DM approach in a classification setting similar to the PV-DBOW but not completely. Maybe my data was too small (about 500000 texts)
- @michalovadek used some other settings here: https://github.com/bnosac/doc2vec/issues/14
I also did some testing available at https://www.bnosac.be/index.php/blog/103-doc2vec-in-r on a smaller dataset and that also did not work that well for PV-DM compared to PV-DBOW but again, small data. Not sure what the reason is, still need to test it compared to what comes out of Gensim to validate.
But all of this is the reason why the default is PV-DBOW in the R package.

Feb 04 '21 15:02 jwijffels

@michalovadek I've been testing out your implementation of top2vecr and it gives really nice results (as in semantically coherent topics). Only issue I've encountered is that when calling hdbscan, it uses dist on the result of umap and that fails for larger data.

> cl <- dbscan::hdbscan(head(docs_umap, 50000), minPts = 15L)
Error in dist(x, method = "euclidean") : 
  negative length vectors are not allowed
> cl <- dbscan::hdbscan(head(docs_umap, 10000), minPts = 15L)
> str(cl$cluster)
 num [1:10000] 96 50 0 0 0 56 74 0 0 0 ...
  negative length vectors are not allowed

Feb 15 '21 08:02 jwijffels

thanks for testing, this is a pretty important limitation, as I imagine in many situations the embeddings can really benefit from large corpora. Let's see whether the hdbscan maintainers shed some light on this. I will consider an alternative clustering method in the meanwhile, but I doubt we would achieve the same quality with another method.

Feb 15 '21 12:02 michalovadek

Yes, that's exactly what I thought as well. There is also currently no predict.hdbscan https://github.com/mhahsler/dbscan/issues/32 if we want to be able to assign new documents to topics. Hopefully the dbscan authors can help. Note there is a predict.hdbscan in this pull request: https://github.com/mhahsler/dbscan/pull/33/files

Feb 15 '21 13:02 jwijffels