Leilan
Leilan
For issue 1, to specify the English tokenizer, you may need to modify the default argument `lang='zh'` to `lang='en'` (dataset.py line 23: __init__ of DocDataset class). This change was made...
For issue 2, this happened because after filtering out stopwords, some documents will leave no words and become empty, and those documents would not be counted as ‘processed’. That is...
I am not sure what your situation is. Do you mean 'topic words', which are words displayed while training, when you say 'topic names'? If you use the provided tokenizer,...
It seems a valuable idea to improve the filtering strategy to make the models more robust. I will fix that.
Two options. One is to convert the genetic data into text, e.g. list all the protein names of one transcription factor in a line, separated by spaces. Custom the tokenizer...
I've tried to run the GSM model on your data, and preprocess step works fine, (although it met an OOM error on my laptop due to the too large vocabulary...
What do you mean by "latent vector (after processing)"? Do you mean the "topic distribution of a document" or "a document's latent representation"? Yes, you can calculate cosine similarities between...
zhdd是先将DailyDialog做了去重后再用Baidu API做的机器翻译,当时网络原因出现了少量遗漏,所以数量上没有13118条,更正后的对齐数据见这里:[dailydialog_zh_en.json](https://github.com/zll17/Neural_Topic_Models/blob/master/data/dailydialog_zh_en.json)。 **Note**: According to the original [license](http://yanran.li/dailydialog), this transformed corpus is also licensed under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/).
谢谢反馈,我检查一下,稍后回复。
你好,我试验了GSM、WLDA和WTM这三个模型,是从仓库拉取重新配置的环境,其中GSM跑了三次,WLDA和WTM-GMM各跑了一次,结果大致是正常的,不过确实发现GSM有不稳定的现象,因为同样的参数两次的实验TD浮动了0.2(分别是0.423和0.626)左右,以下是我的命令和结果: | exp_id | 主要参数差异 | TD | | ------------------- | ------------------------------ | ----- | | gsm_exp0_manual | --no_above 0.0134 --no_below 5 | 0.423 | | gsm_exp1_autoadj | --autoadj |...