deidentify
deidentify copied to clipboard
Handle empty documents in FlairTagger
The FlairTagger (and possibly CRFTagger) ignores empty documents. The length of the output documents does not match the length of the input documents.
We should either allow empty documents, or raise a warning and that no empty strings should be passed.
Reproducible example
from pprint import pprint
from deidentify.base import Document
from deidentify.taggers import FlairTagger
from deidentify.tokenizer import TokenizerFactory
documents = [
Document(name="doc_01", text=""),
Document(name="doc_02", text="Stukje tekst met de naam Jan Jansen."),
Document(name="doc_03", text=""),
]
tokenizer = TokenizerFactory().tokenizer(corpus="ons", disable=("tagger", "ner"))
tagger = FlairTagger(
model="model_bilstmcrf_ons_fast-v0.2.0", tokenizer=tokenizer, verbose=False
)
annotated_docs = tagger.annotate(documents)
print(f"len(documents) = {len(documents)}")
print(f"len(annotated_docs) = {len(annotated_docs)}")
pprint(annotated_docs)
Actual:
len(documents) = 3
len(annotated_docs) = 1
[Document(name=doc_02). Chars: 36, Annotations: 1]
Expected:
len(documents) = 3
len(annotated_docs) = 3