document_cluster icon indicating copy to clipboard operation
document_cluster copied to clipboard

Get top n words that are nearest to cluster centroid

Open fkolokathi opened this issue 8 years ago • 4 comments

I cannot understand how by taking the indices of the words with max tf-idf per cluster center, you find the top words that are nearest to cluster centroid.Moreover, I want to ask you, cluster centroid is the center of each cluster?

fkolokathi avatar Nov 22 '17 11:11 fkolokathi

In this regard I have another question. If you are clustering synopses (therefore films), the centroid should represent a "fake" film, not a fake word. The points closer to the center should be the closest films, but no the closets words to the film right?

PabloRR100 avatar Nov 21 '19 22:11 PabloRR100

@fkolokathi @PabloRR100 apologies, I haven't had a chance to look back at this in quite some time. In regards to @fkolokathi's question--I'm not sure beyond words what else would comprise the cluster centroid? As @PabloRR100 points out, the centroid is really a "fake film synopsis", not a fake word.

@PabloRR100 I think you're correct if my memory serves. Do you have any suggestions for how things could be improved for clarity?

brandomr avatar Nov 21 '19 22:11 brandomr

Thank you so much for replaying @brandomr. I am making my head around this since I have a bunch of documents that I want to cluster and then plot a WordCloud of the most relevant words around it. So essentially the same use-case. I was using this "closeness" to the center before to give the importance for the Wordcloud.

What do you think about using the k words with the highest IDF, considered as most important for the list of documents (or some metric using an average(TF) across documents and the IDF) for the words that appear in the documents of the cluster as their importance for the Wordcloud?

PabloRR100 avatar Nov 22 '19 08:11 PabloRR100

@PabloRR100 I think that makes sense. I'd definitely spot check things to ensure that the results you are seeing are actually logical.

You might check out this paper on vennclouds and the associated repo that automatically generates dynamic word clouds comparing documents. That methodology might be useful for you.

brandomr avatar Nov 22 '19 17:11 brandomr