presidio icon indicating copy to clipboard operation
presidio copied to clipboard

Transformers based NLP engine

Open omri374 opened this issue 3 years ago • 0 comments

Background

Presidio currently leverages spaCy for NER. It is possible to switch to a stanza model, or to create additional NER recognizers using 3rd party packages such as Flair and Transformers (see examples here and here). However, it is not possible to completely replace a spacy/stanza model with a transformers model, except for when using the trf models coming from spaCy, such as en_core_web_trf.

Proposed solution

Create a new type of NlpEngine called TransformersNlpEngine which runs a transformers NER model instead of spaCy's NER. The user could decide which transformers model to plug in (either pretrained or custom).

Since Presidio leverages spaCy's pipeline for different features such as lemmas and tokens, which aren't supported by a transformers NER model, we propose to create a spaCy pipeline without NER, and add a new component which runs a transformers NER model.

Instead of this pipeline which we currently have: image

We would have this pipeline: image

This pipeline would not be trainable, but one can train a transformers model and inject it into this pipeline

omri374 avatar Jun 29 '22 06:06 omri374