ToModAPI
ToModAPI copied to clipboard
LFTM and empty lines
Ok basically LFTM uses gloves embeddings, when available, stripping out the words that are not included in the preprocessed embedding.
When a line does not include any word in the glove dictionary, it appears empty in the LFLDA.glove file.
In the training, the line is just ignored (rather than considered as "empty") https://github.com/datquocnguyen/LFTM/blob/master/src/models/LFLDA.java#L173
The result is that there are more lines in the corpus than corpus predictions. This affects ground truth evaluation metrics
Note about a possible workaround:
with open(os.path.join(model_path, 'LFLDA.glove'),'r') as f:
glove_corpus = [x.strip() for x in f.readlines()]
empty_docs = [i for i, x in enumerate(glove_corpus) if len(x) < 1]
for i in empty_docs:
preds.insert(i,[(0,0)])