ETM icon indicating copy to clipboard operation
ETM copied to clipboard

Is that true that a lot of repeated topics appear?

Open sharon-gao opened this issue 5 years ago • 7 comments

Hi,

Thanks for your interesting paper and this repository!

I tried train ETM on both 20ng and my own dataset with num_topics = 50.

Among the 50 topics I found some repeated topics, like ['writes', 'article', 'good', 'people', 'make', 'read', 'thing', 'time', 'lot'] (repeated for 4 times) and ['time', 'good', 'problem', 'work', 'back', 'problems', 'ago', 'thing', 'couple'] (repeated for 2 times).

Does anyone observe the same phenomenon?

sharon-gao avatar Nov 27 '20 18:11 sharon-gao

Hi @ShuangNYU,

Nice that you managed to extract the main topics of your own dataset.

Could you please share your code with us?

Me and a lot of others don't manage to get the output topic vector. #19 #4 #5

RoelTim avatar Nov 29 '20 23:11 RoelTim

Hi @ShuangNYU,

Nice that you managed to extract the main topics of your own dataset.

Could you please share your code with us?

Me and a lot of others don't manage to get the output topic vector. #19 #4 #5

Hi @RoelTim ,

Glad to hear from you. I create my own formatted data by using the code in 'scripts / data_nyt.py'. You can change the data_file to a path to your own dataset. # Read data print('reading text file...') data_file = 'raw/new_york_times_text/nyt_docs.txt' with open(data_file, 'r') as f: docs = f.readlines() And then just run this file. If there is any error, please tell me and perhaps I can help.

Besides, after finishing this and running the topic model, could you share your results whether there are a lot of repeated topics?

sharon-gao avatar Dec 03 '20 14:12 sharon-gao

Hi, @ShuangNYU

Recently I am trying my own dataset using ETM and encounter the same question as you.(twitter dataset each row as a document)

Sample topics I get: Topic 7: ['government', 'stop', 'back', 'cari', 'great', 'coronavirusoutbreak', 'shit', 'hope', 'read'] Topic 8: ['back', 'stop', 'government', 'coronavirusoutbreak', 'cari', 'shit', 'ya', 'good', 'hai']

Is there any suggested solution? I tried to fix topic number but still the same result.

EJ0917 avatar May 06 '21 13:05 EJ0917

Hi, @ShuangNYU

I managed to use data_nyt to create my own formatted data but failed to run it. guess I got some bugs. appreciate it if you could share your code. seems they changed main.py recently.

lw081701019 avatar Sep 07 '21 19:09 lw081701019

Hi, @EJ0917

I managed to use data_nyt to create my own formatted data but failed to run it. guess I got some bugs. appreciate it if you could share your code. seems they changed main.py recently.

lw081701019 avatar Sep 07 '21 19:09 lw081701019

Same question ! I got all topics as the same one. Is there any suggested solution? @ShuangNYU

liuh236 avatar Jun 01 '22 15:06 liuh236

@ShuangNYU if you have access to NYT annotated corpus, could you give an access of this dataset tome, i also require access to this dataset but it is not freely available and i don't have much budget to get access to it.thanks

asma-ui avatar Feb 10 '23 16:02 asma-ui