Get embeddings for the key words?
Hi, is there a way to get access to the embeddings for just the keywords output by extract_keywords()?
You can extract the embeddings with .extract_embeddings as described here.
Note that this workflow will fit the Vectorizer twice. In the case of using something slow like KeyphraseCountVectorizer, you'll probably notice a slowdown.
vectorizer = KeyphraseCountVectorizer()
with TicToc():
doc_embeddings, word_embeddings = kw_model.extract_embeddings(docs, vectorizer=vectorizer)
with TicToc():
topics = kw_model.extract_keywords(docs, vectorizer=vectorizer, top_n=20, doc_embeddings=doc_embeddings, word_embeddings=word_embeddings)
Elapsed time is 3.111434 seconds.
Elapsed time is 2.751747 seconds.
I tried monkeypatching a fixed value into the vectorizer and noticed a substantial speedup to the second call without any change in results:
with TicToc():
counts = vectorizer.fit(docs)
vectorizer.fit = lambda *args, **kwargs: counts
with TicToc():
doc_embeddings, word_embeddings = kw_model.extract_embeddings(docs, vectorizer=vectorizer)
with TicToc():
topics = kw_model.extract_keywords(docs, vectorizer=vectorizer, top_n=20, doc_embeddings=doc_embeddings, word_embeddings=word_embeddings)
Elapsed time is 2.567713 seconds.
Elapsed time is 0.555116 seconds.
Elapsed time is 0.053710 seconds.
That's quite a large difference, thanks for sharing! Also, I really like the elegant solution of simply adding two lines.