BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

Best-performing embedding models?

Open raphael-milliere opened this issue 1 year ago • 2 comments

I've been looking for up-to-date information about how various pre-trained models compare for clustering and topic modeling with BERTopic – rather than semantic search which is all the rage these days with RAG pipelines.

According to the official pre-trained model evaluations, all-mpnet-base-v2 is best overall, while sentence-t5-xxl is best for sentence similarity. However, both of these models are quite old. Surely there are better pre-trained models available for similarity/clustering?

Looking at the MTEB leaderboard, mxbai-embed-large-v1 appears to be the leading open weights model currently. Should I expect this model to be superior to all-mpnet-base-v2 or sentence-t5-xxl for BERTopic? I've done some informal tests, but I'm not convinced it results in better topics.

raphael-milliere avatar Apr 17 '24 16:04 raphael-milliere

I would indeed advise looking at the MTEB leaderboard and specifically look at the clustering metric since that is what BERTopic is using mostly. In my experience the clusters are formed a bit better when using a model that scores higher on the leaderboard.

However, do note that small differences in clusters might not affect the topic representations that greatly if you have a relatively big dataset. You might see differences in smaller clusters but it will unlikely affect those larger clusters that already have good representations.

MaartenGr avatar Apr 18 '24 13:04 MaartenGr

@raphael-milliere did you find anything?

aramis-it avatar May 17 '24 06:05 aramis-it