python-crfsuite icon indicating copy to clipboard operation
python-crfsuite copied to clipboard

Handling unigram and bigram features at the same time in word2features

Open AbhishekBose opened this issue 4 years ago • 0 comments

Hello, I am trying to perform an NER experiment on a custom dataset containing a lot of food items. I have labels for certain unigrams and bigrams for my training data.

My label corpus contains "green chilli" = "vegetable". I don't have "chilli" as a label I am using this label list in order to annotate sentences for NER.

For example:

A sentence might contain a bigram such as "green chilli" with it's associated label = "vegetable"

Currently while generating the features, I am marking both "green" and "chilli" as "vegetable". My annotation pipeline is as follows:

  • Split sentence into unigrams
  • Check if unigram exists in label list -> If label exists mark unigram with label
  • Get bigram by considering token + sentence[idx+1] or token + sentence[idx-1]
  • Check if bigram exists in label corpus -->> mark both token and sentence[idx+1] or sentence[idx-1] with that label

As a result of point number 4, both green and chilli get marked as vegetable

So when I train my model and run inference on a test sentence containing "green chilli", I would get "vegetable", "vegetable" twice.

What would be the best way to annotate this using word2features?

AbhishekBose avatar Dec 24 '21 08:12 AbhishekBose