KeyBERT Matching with Synonyms using KeyLLM OR KeyBERT

I have been playing with KeyBERT and KeyLLM for a while now. And here is something I would like to achieve.

If have a text "CO2 emissions are high these days" and a list of candidate words, which might contain the word Carbondioxide and not CO2 will KeyBERT or KeyLLM find Carbondioxide as a match?

Text = "CO2 emissions are high these days" candidate keyword list have the word ["Carbon dioxide"] and not "CO2"

Expected output = ["Carbon dioxide"]

Jul 29 '24 09:07 ChettakattuA

If have a text "CO2 emissions are high these days" and a list of candidate words, which might contain the word Carbondioxide and not CO2 will KeyBERT or KeyLLM find Carbondioxide as a match?

I think it should be possible if you use it as a candidate word. Have you tried it out?

Jul 30 '24 13:07 MaartenGr

In this result the acronym and synonyms are not identified by KeyBERT

acronym used = CO2 -> carbon dioxide
synonym used = emission -> release
Plural = emission -> emissions

The code used

from keybert import KeyBERT 
kw_model = KeyBERT() 
text = "CO2 emissions are high these days"
can = ["carbon dioxide", "emissions","release","emission","co2"]
Keywords = kw_model.extract_keywords(text,candidates=can)

Is there some way to resolve this?

Aug 08 '24 14:08 ChettakattuA

Ah right, that's because the candidates should appear in the original document in order to find them. Instead, you might want to use the seed_keywords parameter which allows you to steer the model towards certain words. Note that you might have to use the global perspective here.

Aug 10 '24 06:08 MaartenGr

But do you know why its require the word itself to appear in the text? What I understood from the documentation is it uses embeddings and cosine similarity. Aint it enough to understand similar words or synonyms from the text and candidates?

Aug 13 '24 13:08 ChettakattuA

@ChettakattuA That depends on what you want. Generally, keywords are derived directly from the article that was written for SEO reasons. In KeyBERT candidates are passed to the CountVectorizer as a vocabulary, which means they should appear in the original documents (as they are fitted on the original documents):

https://github.com/MaartenGr/KeyBERT/blob/f0f96a6d524ad1403bd847b05c8345cf099ed060/keybert/_model.py#L163-L182

Aug 17 '24 14:08 MaartenGr