Privacy-Preserving Process Mining with PM4Py
As discussed with, @fit-alessandro-berti and @s-j-v-zelst we implemented privacy-preserving process techniques in PM4Py. We support two anonymization steps:
- The anonymization of control-flow information with SaCoFa[^1] and the Laplacian mechanism[^2] in the pm4py.algo.anonymization.trace_variant_query package.
- The anonymization of contextual information with PRIPEL[^3] which requires the anonymization of control-flow as a first step in the pm4py.algo.anonymization.pripel package.
We protect the private data with differential privacy by introducing noise into the event logs. For PRIPEL we use the Diffprivlib library[^4] and handle the library requirements right now like the earth mover's distance handles the requirement of the pyemd library in https://github.com/pm4py/pm4py-core/blob/release/pm4py/evaluation/init.py.
Here is a small example, how to use the implemented techniques, to anonymize the control-flow and the contextual information of an event log:
import pm4py
from pm4py.algo.anonymization.trace_variant_query import algorithm as trace_variant_query
from pm4py.algo.anonymization.pripel import algorithm as pripel
log = pm4py.read_xes("EventLogName.xes")
epsilon = 0.5
sacofa_result = trace_variant_query.apply(log=log, variant=trace_variant_query.Variants.SACOFA,
parameters={"epsilon": epsilon, "k": 30, "p": 4})
anonymized_log = pripel.apply(log=log, trace_variant_query=sacofa_result, epsilon=epsilon)
To anonymize the control-flow, SaCoFa and the Laplacian mechanism insert noise into a trace-variant count, through the stepwise construction of a prefix tree. To use the Laplacian mechanism we have to set variant=trace_variant_query.Variants.LAPLACE. Given an event log, the algorithms are configured with the following parameters:
- epsilon: The strength of the differential privacy guarantee. The smaller the value of epsilon, the stronger the privacy guarantee that is provided.
- $\boldsymbol{k}$: The maximal length of considered traces in the prefix tree. We note that this parameter governs the runtime complexity of both algorithms, which is $\mathcal{O}(|A|^k)$ with $A$ being the set of activities for which events have been recorded in the log. We recommend setting $k$, so that roughly 80% of all traces from the original event log are covered. However, setting $k$ to the same length as the maximum prefix-length in the original log might lead to an overfitting towards long traces.
- $\boldsymbol{p}$: The pruning parameter, which denotes the minimum count a prefix has to have in order to not be discarded. The $k$ dependent exponential runtime of the algorithms is mitigated by the pruning parameter.
To anonymize contextual information, such as timestamps and resources. PRIPEL, enriches a control-flow anonymized event log with contextual information, while still achieving differential privacy. PRIPEL requires the original event log and the corresponding result of the control-flow anonymization as input. The approach is fine-tuned by setting the following parameters:
- epsilon: The strength of the differential privacy guarantee. The epsilon value for PRIPEL and the epsilon value for the adopted control-flow anonymization should be the same.
- Blocklist: Some event logs contain attributes that are equivalent to a case ID. For privacy reasons, such attributes must be deleted from the anonymized log. We handle such attributes with this list. As an example, in a hospital, the case ID could be based on a patient visit. However, the patient ID could be equivalently serving as a case ID and should therefore be omitted.
[^1]: Fahrenkog-Petersen, S. A., Kabierski, M., Rösel, F., van der Aa, H. and Weidlich, M. SaCoFa: Semantics-aware Control-flow Anonymization for Process Mining. 3rd International Conference on Process Mining (ICPM), 72-79 (2021). https://doi.org/10.1109/ICPM53251.2021.9576857 [^2]: Mannhardt, F., Koschmider, A., Baracaldo, N. et al. Privacy-Preserving Process Mining. Bus Inf Syst Eng 61, 595–614 (2019). https://doi.org/10.1007/s12599-019-00613-3 [^3]: Fahrenkrog-Petersen, S.A., van der Aa, H., Weidlich, M. (2020). PRIPEL: Privacy-Preserving Event Log Publishing Including Contextual Information. Business Process Management. BPM 2020. Lecture Notes in Computer Science, vol 12168. Springer, Cham. https://doi.org/10.1007/978-3-030-58666-9_7 [^4]: https://github.com/IBM/differential-privacy-library
Dear @henrikkirchmann
Thanks for your contribution. Your packaging looks fine! We will consider integration for the next major PM4Py release (2.3.x)
Hello @fit-alessandro-berti This is great to hear! Do you already have an estimation when PM4Py 2.3 will be released? Please let me know if I can help with the documentation on the https://pm4py.fit.fraunhofer.de/documentation site for this package :)
Ehi @henrikkirchmann we do not know yet with certainty. We are waiting for a couple of modules to be fully ready.
Should be weeks rather than months though :)