pm4py-core icon indicating copy to clipboard operation
pm4py-core copied to clipboard

Privacy-Preserving Process Mining with PM4Py

Open henrikkirchmann opened this issue 3 years ago • 1 comments

As discussed with, @fit-alessandro-berti and @s-j-v-zelst we implemented privacy-preserving process techniques in PM4Py. We support two anonymization steps:

  • The anonymization of control-flow information with SaCoFa[^1] and the Laplacian mechanism[^2] in the pm4py.algo.anonymization.trace_variant_query package.
  • The anonymization of contextual information with PRIPEL[^3] which requires the anonymization of control-flow as a first step in the pm4py.algo.anonymization.pripel package.

We protect the private data with differential privacy by introducing noise into the event logs. For PRIPEL we use the Diffprivlib library[^4] and handle the library requirements right now like the earth mover's distance handles the requirement of the pyemd library in https://github.com/pm4py/pm4py-core/blob/release/pm4py/evaluation/init.py.

Here is a small example, how to use the implemented techniques, to anonymize the control-flow and the contextual information of an event log:

import pm4py
from pm4py.algo.anonymization.trace_variant_query import algorithm as trace_variant_query
from pm4py.algo.anonymization.pripel import algorithm as pripel

log = pm4py.read_xes("EventLogName.xes")
epsilon = 0.5
sacofa_result = trace_variant_query.apply(log=log, variant=trace_variant_query.Variants.SACOFA,
                                          parameters={"epsilon": epsilon, "k": 30, "p": 4})
anonymized_log = pripel.apply(log=log, trace_variant_query=sacofa_result, epsilon=epsilon)

To anonymize the control-flow, SaCoFa and the Laplacian mechanism insert noise into a trace-variant count, through the stepwise construction of a prefix tree. To use the Laplacian mechanism we have to set variant=trace_variant_query.Variants.LAPLACE. Given an event log, the algorithms are configured with the following parameters:

  • epsilon: The strength of the differential privacy guarantee. The smaller the value of epsilon, the stronger the privacy guarantee that is provided.
  • $\boldsymbol{k}$: The maximal length of considered traces in the prefix tree. We note that this parameter governs the runtime complexity of both algorithms, which is $\mathcal{O}(|A|^k)$ with $A$ being the set of activities for which events have been recorded in the log. We recommend setting $k$, so that roughly 80% of all traces from the original event log are covered. However, setting $k$ to the same length as the maximum prefix-length in the original log might lead to an overfitting towards long traces.
  • $\boldsymbol{p}$: The pruning parameter, which denotes the minimum count a prefix has to have in order to not be discarded. The $k$ dependent exponential runtime of the algorithms is mitigated by the pruning parameter.

To anonymize contextual information, such as timestamps and resources. PRIPEL, enriches a control-flow anonymized event log with contextual information, while still achieving differential privacy. PRIPEL requires the original event log and the corresponding result of the control-flow anonymization as input. The approach is fine-tuned by setting the following parameters:

  • epsilon: The strength of the differential privacy guarantee. The epsilon value for PRIPEL and the epsilon value for the adopted control-flow anonymization should be the same.
  • Blocklist: Some event logs contain attributes that are equivalent to a case ID. For privacy reasons, such attributes must be deleted from the anonymized log. We handle such attributes with this list. As an example, in a hospital, the case ID could be based on a patient visit. However, the patient ID could be equivalently serving as a case ID and should therefore be omitted.

[^1]: Fahrenkog-Petersen, S. A., Kabierski, M., Rösel, F., van der Aa, H. and Weidlich, M. SaCoFa: Semantics-aware Control-flow Anonymization for Process Mining. 3rd International Conference on Process Mining (ICPM), 72-79 (2021). https://doi.org/10.1109/ICPM53251.2021.9576857 [^2]: Mannhardt, F., Koschmider, A., Baracaldo, N. et al. Privacy-Preserving Process Mining. Bus Inf Syst Eng 61, 595–614 (2019). https://doi.org/10.1007/s12599-019-00613-3 [^3]: Fahrenkrog-Petersen, S.A., van der Aa, H., Weidlich, M. (2020). PRIPEL: Privacy-Preserving Event Log Publishing Including Contextual Information. Business Process Management. BPM 2020. Lecture Notes in Computer Science, vol 12168. Springer, Cham. https://doi.org/10.1007/978-3-030-58666-9_7 [^4]: https://github.com/IBM/differential-privacy-library

henrikkirchmann avatar Oct 13 '22 20:10 henrikkirchmann

Dear @henrikkirchmann

Thanks for your contribution. Your packaging looks fine! We will consider integration for the next major PM4Py release (2.3.x)

fit-alessandro-berti avatar Oct 14 '22 06:10 fit-alessandro-berti

Hello @fit-alessandro-berti This is great to hear! Do you already have an estimation when PM4Py 2.3 will be released? Please let me know if I can help with the documentation on the https://pm4py.fit.fraunhofer.de/documentation site for this package :)

henrikkirchmann avatar Oct 23 '22 07:10 henrikkirchmann

Ehi @henrikkirchmann we do not know yet with certainty. We are waiting for a couple of modules to be fully ready.

Should be weeks rather than months though :)

fit-alessandro-berti avatar Oct 23 '22 07:10 fit-alessandro-berti