tokenizer.add_tokens() interfers with downstream NER task

Open laurens777 opened this issue 3 years ago • 0 comments

Goal: Add tokens to the tokenizer for clinical domain to prevent the tokenizer from tokenizing it.

I am using the following code to add a few tokens to the tokenizer:

tokenizer.add_tokens(["MV", "AV"])
model.resize_token_embeddings(len(tokenizer))

After fine-tuning the tokenizer no longer splits these tokens into single characters and keeps it as a single token. However, it no longer assigns the correct NER tag to the token. I have double checked my training data and there are no issues there. So this seems to be an error in the biobert code.

Apr 07 '22 18:04 laurens777