KeyBERT icon indicating copy to clipboard operation
KeyBERT copied to clipboard

Get embeddings for the key words?

Open jricheimer opened this issue 3 years ago • 3 comments

Hi, is there a way to get access to the embeddings for just the keywords output by extract_keywords()?

jricheimer avatar Feb 02 '23 21:02 jricheimer

You can extract the embeddings with .extract_embeddings as described here.

MaartenGr avatar Feb 03 '23 09:02 MaartenGr

Note that this workflow will fit the Vectorizer twice. In the case of using something slow like KeyphraseCountVectorizer, you'll probably notice a slowdown.

vectorizer = KeyphraseCountVectorizer()
with TicToc():
    doc_embeddings, word_embeddings = kw_model.extract_embeddings(docs, vectorizer=vectorizer)

with TicToc():
    topics = kw_model.extract_keywords(docs, vectorizer=vectorizer, top_n=20, doc_embeddings=doc_embeddings, word_embeddings=word_embeddings)

Elapsed time is 3.111434 seconds.
Elapsed time is 2.751747 seconds.

I tried monkeypatching a fixed value into the vectorizer and noticed a substantial speedup to the second call without any change in results:

with TicToc():
    counts = vectorizer.fit(docs)
vectorizer.fit = lambda *args, **kwargs: counts
with TicToc():
    doc_embeddings, word_embeddings = kw_model.extract_embeddings(docs, vectorizer=vectorizer)

with TicToc():
    topics = kw_model.extract_keywords(docs, vectorizer=vectorizer, top_n=20, doc_embeddings=doc_embeddings, word_embeddings=word_embeddings)

Elapsed time is 2.567713 seconds.
Elapsed time is 0.555116 seconds.
Elapsed time is 0.053710 seconds.

mbarnathan avatar Mar 05 '23 04:03 mbarnathan

That's quite a large difference, thanks for sharing! Also, I really like the elegant solution of simply adding two lines.

MaartenGr avatar Mar 05 '23 07:03 MaartenGr