deidentify icon indicating copy to clipboard operation
deidentify copied to clipboard

Handle empty documents in FlairTagger

Open jantrienes opened this issue 5 years ago • 0 comments

The FlairTagger (and possibly CRFTagger) ignores empty documents. The length of the output documents does not match the length of the input documents.

We should either allow empty documents, or raise a warning and that no empty strings should be passed.

Reproducible example

from pprint import pprint

from deidentify.base import Document
from deidentify.taggers import FlairTagger
from deidentify.tokenizer import TokenizerFactory

documents = [
    Document(name="doc_01", text=""),
    Document(name="doc_02", text="Stukje tekst met de naam Jan Jansen."),
    Document(name="doc_03", text=""),
]


tokenizer = TokenizerFactory().tokenizer(corpus="ons", disable=("tagger", "ner"))
tagger = FlairTagger(
    model="model_bilstmcrf_ons_fast-v0.2.0", tokenizer=tokenizer, verbose=False
)

annotated_docs = tagger.annotate(documents)
print(f"len(documents) = {len(documents)}")
print(f"len(annotated_docs) = {len(annotated_docs)}")

pprint(annotated_docs)

Actual:

len(documents) = 3
len(annotated_docs) = 1
[Document(name=doc_02). Chars: 36, Annotations: 1]

Expected:

len(documents) = 3
len(annotated_docs) = 3

jantrienes avatar Jan 20 '21 07:01 jantrienes