lda2vec icon indicating copy to clipboard operation
lda2vec copied to clipboard

The top words are very similar after 5-6 epochs

Open yg37 opened this issue 9 years ago • 12 comments

screen shot 2016-06-16 at 12 16 27 pm

I was rerunning the script for 20_newsgroup and this is the topic term distribution after 1 epoch. From the picture, we can see that the top words for each topic are actually very similar. Is it normal or were I implementing something wrong?I encountered the same issue when I ran the script on other corpus. After 10 epochs, the top words were almost identical with top words being "the","a", etc.

yg37 avatar Jun 16 '16 19:06 yg37

I have the same problem. I runned the script of the 20_newgroup on either the original corpus or on one of my own, and after just one epoch the topics' top words are identical. I tried to change any hyperparameter but the results were the same.

`Top words in topic 0 invoke out_of_vocabulary out_of_vocabulary <SKIP> the . to , a i

Top words in topic 1 invoke out_of_vocabulary out_of_vocabulary <SKIP> the . to , a i

Top words in topic 2 invoke out_of_vocabulary out_of_vocabulary <SKIP> the . to , a i

Top words in topic 3 invoke out_of_vocabulary out_of_vocabulary <SKIP> the . to , a i

...

Top words in topic 19 invoke out_of_vocabulary out_of_vocabulary <SKIP> the . , to a i`

cprevosteau avatar Jun 17 '16 10:06 cprevosteau

I had it running on the server last night and the top words diverged after around 20 epochs. Not sure why the initial topic term distribution behaves that way, maybe it has something to do with the prior?

yg37 avatar Jun 17 '16 17:06 yg37

I consistently have out_of_vocabulary as the top word across all topics, any suggestions on what I should look for? This happens even when I set the min and max vocab count thresholds to None.

agtsai-i avatar Jul 12 '16 21:07 agtsai-i

Hi all,

From my experience you can set a more aggressive down-sampling rule to remove the out_of_vocabulary or equally redundant tokens from at least some of the topics if not all. I have lowered the down-sampling threshold in my dataset and the stop words largely disappeared from top topic word lists. An alternative to that, which I haven't tried, is to carry out data cleaning before you feed the tokens to the model. That way you can remove the out_of_vocabulary token as well as other meaningless tokens from modelling entirely. Data cleaning could possibly lead to improved results (it does in case of a pure LDA at least) although I don't know the maths behind the lda2vec model well enough to make a strong case for that.

I personally eventually gave up on using lda2vec because each time you use it, the model requires a lot time to fine tune the topic results. Standard word2vec or text2vec with some form of unsupervised semantic clustering are probably a less time-consuming alternative to lda2vec because they can work regardless of the dataset or a type of computer system you use, apart from mere fact that model optimisation may itself work more quickly. Moreover, lda2vec was a real pain to install on my Windows a couple of months ago. lda2vec may be useful but you should have very specific reasons for using it.

radekrepo avatar Jul 13 '16 06:07 radekrepo

Thanks @nanader! I'll play with the down-sampling threshold. I believe I had removed the out_of_vocabulary tokens entirely by setting the vocab count thresholds to None (at least, that's what my reading of the code tells me should happen so far), and so I was surprised to still see them pop up.

So far I've tried doc2vec and word2vec + earth mover's distance, but have not had stellar results so far. I like the approach used here for documents (in principle) more than the other two, and of course the given examples look amazing. I'd really like lda2vec to work out with the data I have.

I installed lda2vec on an AWS GPU instance, and that wasn't too horrible.

agtsai-i avatar Jul 13 '16 17:07 agtsai-i

I recently tried topic2vec as an alternative:
http://arxiv.org/abs/1506.08422 https://github.com/scavallari/Topic2Vec/blob/master/Topic2Vec_20newsgroups.ipynb I tried it on simple wiki data and it performed very well

yg37 avatar Jul 13 '16 17:07 yg37

Oh interesting, thank you!

agtsai-i avatar Jul 13 '16 17:07 agtsai-i

Ah, by the way, agtsai-i. You can also use the vector space to label topics with the nearest cosine distance token vectors instead of relying on the most common topic-word assignments. lda2vev model results allow for it. That way you could ignore the tuning of the topic model results entirely and get as many or as few topics as you want. It depends on what you want to achieve, really. I hope that helps

radekrepo avatar Jul 14 '16 10:07 radekrepo

True, but if I did that I would be discarding a lot of the novelty of lda2vec and essentially just using word2vec right?

*Never mind, I see what you're saying. Much appreciated

agtsai-i avatar Jul 14 '16 17:07 agtsai-i

Hi @agtsai-i & @yg37 did you resolve this issue in the end? Could you kindly share the solution if any? Thanks a lot.

gracegcy avatar Mar 22 '17 17:03 gracegcy

@gracegcy Keep running the epochs and the top words will diverge

yg37 avatar Mar 23 '17 16:03 yg37

@radekrepo @agtsai-i @yg37 Have you noticed that the result of the down-sampling step was never used? No wonder I kept getting a lot of stop words (OoV, punctuations, etc.) however I lowered its threshold. I put my efforts for solving this here: https://github.com/cemoody/lda2vec/issues/92

ghost avatar Feb 17 '19 21:02 ghost