presidio icon indicating copy to clipboard operation
presidio copied to clipboard

Batch analysis and anonymization doesn't contain logic for detecting potential PII columns

Open omri374 opened this issue 4 years ago • 0 comments

In the existing sample for batch analysis, the logic goes over all the columns and looks for PII. We'd like to extend this with logic which also evaluates how likely a column contains PII based on its name.

For example, if a column name is Age, and the values are [19,55,2,39], the column name could help determine that this column contains PII and not just non-sensitive numbers.

We could leverage things like:

  1. Lists of potential PII column names (or substrings)
  2. Existing context words in each recognizer
  3. Sampling from the actual values and seeing if any PII is detected and at what confidence.

omri374 avatar Oct 24 '21 07:10 omri374