biobert
biobert copied to clipboard
tokenizer.add_tokens() interfers with downstream NER task
Goal: Add tokens to the tokenizer for clinical domain to prevent the tokenizer from tokenizing it.
I am using the following code to add a few tokens to the tokenizer:
tokenizer.add_tokens(["MV", "AV"])
model.resize_token_embeddings(len(tokenizer))
After fine-tuning the tokenizer no longer splits these tokens into single characters and keeps it as a single token. However, it no longer assigns the correct NER tag to the token. I have double checked my training data and there are no issues there. So this seems to be an error in the biobert code.