KeyBERT icon indicating copy to clipboard operation
KeyBERT copied to clipboard

Matching with Synonyms using KeyLLM OR KeyBERT

Open ChettakattuA opened this issue 1 year ago • 5 comments

I have been playing with KeyBERT and KeyLLM for a while now. And here is something I would like to achieve.

If have a text "CO2 emissions are high these days" and a list of candidate words, which might contain the word Carbondioxide and not CO2 will KeyBERT or KeyLLM find Carbondioxide as a match?

Text = "CO2 emissions are high these days" candidate keyword list have the word ["Carbon dioxide"] and not "CO2"

Expected output = ["Carbon dioxide"]

ChettakattuA avatar Jul 29 '24 09:07 ChettakattuA

If have a text "CO2 emissions are high these days" and a list of candidate words, which might contain the word Carbondioxide and not CO2 will KeyBERT or KeyLLM find Carbondioxide as a match?

I think it should be possible if you use it as a candidate word. Have you tried it out?

MaartenGr avatar Jul 30 '24 13:07 MaartenGr

image

In this result the acronym and synonyms are not identified by KeyBERT

acronym used = CO2 -> carbon dioxide
synonym used = emission -> release
Plural = emission -> emissions 

The code used

from keybert import KeyBERT 
kw_model = KeyBERT() 
text = "CO2 emissions are high these days"
can = ["carbon dioxide", "emissions","release","emission","co2"]
Keywords = kw_model.extract_keywords(text,candidates=can)

Is there some way to resolve this?

ChettakattuA avatar Aug 08 '24 14:08 ChettakattuA

Ah right, that's because the candidates should appear in the original document in order to find them. Instead, you might want to use the seed_keywords parameter which allows you to steer the model towards certain words. Note that you might have to use the global perspective here.

MaartenGr avatar Aug 10 '24 06:08 MaartenGr

But do you know why its require the word itself to appear in the text? What I understood from the documentation is it uses embeddings and cosine similarity. Aint it enough to understand similar words or synonyms from the text and candidates?

ChettakattuA avatar Aug 13 '24 13:08 ChettakattuA

@ChettakattuA That depends on what you want. Generally, keywords are derived directly from the article that was written for SEO reasons. In KeyBERT candidates are passed to the CountVectorizer as a vocabulary, which means they should appear in the original documents (as they are fitted on the original documents):

https://github.com/MaartenGr/KeyBERT/blob/f0f96a6d524ad1403bd847b05c8345cf099ed060/keybert/_model.py#L163-L182

MaartenGr avatar Aug 17 '24 14:08 MaartenGr