Slava Jankin
Slava Jankin
Re non-trivial corpus with large number of documents, how about UN General Debate corpus? It's publicly available from [Harvard Dataverse](https://doi.org/10.7910/DVN/0TJX8Y): "UNGDC 1970-2017.zip". Direct link [here](https://www.dropbox.com/sh/c5z9ye4hsxgzps6/AABLOKHEpIekv6mo6jVo01jda?dl=0). It covers country statements in...
Are you thinking about adding a correspondence analysis (CA) option as well? Arguably, CA could be tapping into underlying linguistic properties a bit better than PCA.
You can set it up on DFM directly. That's how it's implemented in [quanteda](https://github.com/kbenoit/quanteda) [textmodel_ca](http://finzi.psych.upenn.edu/library/quanteda/html/textmodel_ca.html) function. It's calling [ca](https://cran.r-project.org/web/packages/ca/ca.pdf) package. Another option is [vegan](https://cran.r-project.org/web/packages/vegan/vegan.pdf) package. Vegan is widely used in...
I think that sounds really good. And combination with text2vec is great. Looking forward to see the development.