lightlda icon indicating copy to clipboard operation
lightlda copied to clipboard

Distributed LDA, takes raw text as input and outputs topic word table.

Light LDA

Modified based on MSR's Light LDA, added preprocessing scripts.

Usage (Suppose you are in lightlda/):

make
cd datasets
tar zxf 20news-train.tgz
python scripts/pipeline.py etc/params.config

Note: parameters are defined in etc/params.config. The result is put in output/model/${timestamp}/snapshot.word_topic_table.${iteration}${client_id}. By using python scripts/parse_word_topic_table.py a visualization can be obtained. The <word-id-file> is in output/datablocks/${timestamp}/word_tf.txt.

Note2: The machine file defined in etc/params.config only works on cogito. And the whole pipeline assumes a shared filesystem.