presidio
presidio copied to clipboard
Batch analysis and anonymization doesn't contain logic for detecting potential PII columns
In the existing sample for batch analysis, the logic goes over all the columns and looks for PII. We'd like to extend this with logic which also evaluates how likely a column contains PII based on its name.
For example, if a column name is Age, and the values are [19,55,2,39], the column name could help determine that this column contains PII and not just non-sensitive numbers.
We could leverage things like:
- Lists of potential PII column names (or substrings)
- Existing context words in each recognizer
- Sampling from the actual values and seeing if any PII is detected and at what confidence.