Scalable Process Discovery with Guarantees

Ref:

Sander Leemans, Dirk Fahland, Wil van der Aalst: Scalable process discovery and conformance checking, Software & Systems Modeling.

Summary (partial)

Scalable $\implies$ run time of process discovery algorithm is independent from number of traces in the log

Independence of the number of activities cannot be achieved, as activities are part of the output of any technique. For such cases, it is better to adopt for decomposing process discovery

Scalable algorithms: (that does not provide soundness guarantee)

alpha miner and it's derivatives:

requires just a single pass over the event log.

Guarantees rediscoverability

But they do not guarantee sound models.

Non scalable algorithms: (provides guarantees, but not scalable)

Genetic process mining and Evolutionary tree miners :

Requires multiple pass over event log

Guarantees soundness

No rediscoverability guarantee

Inductive miner:

Requires multiple pass over event log

Guarantees both soundess and rediscoverability

The technique described in the paper -- Inductive miner directly-follows($IM_D$) is scalable and gaurentees both soundness and rediscoverability

Adapts Inductive miner framework to recurse on directly-follows graph rather than event logs

Directly-follows graphs can be computed in a single pass over the event log, and their computation can even be parallelised.

The size of a directly-follows graph only depends quadratically on the number of activities in a log, and is independent of the number of traces or events

Therefore, this adaptation combines the scalability of directly-follows graphs, as used by the Heuristics Miner (HM) and the α-algorithm (α), with guarantees such as sound- ness and rediscoverability as provided by the IM framework

$IM_D$ Process discovery

Let the petrinet be $N = (P,T,F)$

Where

$P $: set of places

$T $: set of transitions

Arcs $F \subseteq (P \times T) \cup (T \times P)$

Process discovery involves finding all the places $P$ and arcs $F$. In the $IM_D$ framework, the following steps are involved

1) Finding the directly follows graph $D$

Let graph $D = (\mathbf{\Sigma}, E)$ be obtained by identifying directly follows relations that occur more than $t_{freq}$ times in a log $L$

Where

$\mathbf{\Sigma} : $ Set of all activities observed in log $L$

$E = \{(a,b) \in \mathbf{\Sigma} \times \mathbf{\Sigma} \ \mid \ \#(a \to b) \geq t_{freq} \}$

$D \in \mathcal{G} \qquad$ where $\mathcal{G}$ is a space of all directly follows graph

2) Directly follows graph is recursively partitioned by identifying cuts, until a base case is reached. Lets first define the terms,

Partition : Non-overlapping division of activities in $D$

Cut ($C$) : Partition combined with one of the following process tree operator. $C \in \{\times, \to, \wedge,\circlearrowright \}$

$\times$ : Exclusive choice, where the partitions are disconnected

$\to$ : Sequential behaviour, where the partitions are sequentially connected

$\wedge$ : Parallel behaviour, where the partitions can occur concurrently

$\circlearrowright$ : Each partition should contain a clear set of start and end activities and the loop between partitions should go through these activities

$IM_D$ searches for cuts that will partition the graph $D$ in accordance with one of the process tree operators.

Cut detection ($CD$) has been implemented using standard graph algorithms (connected components, strongly connected components)

Ref : Leemans, S., Fahland, D., van der Aalst, W.: Discovering block-structured process models from event logs - a constructive approach. In: Petri Nets 2013. LNCS, vol. 7927, pp. 311–329. Springer (2013)

Cut detection $CD : D \to (C, \{D_1, D_2, ..., D_K\})$

where

$C$ is a cut (one of the process tree operator)

$D_i \quad i \in [1,K] \quad$ are directly-follows graphs containing partitioned activites

$D, D_i \in \mathcal{G}$

$IM_D$ recurses on each of the new directly-follows graph $D_i$ until one of the following base case is reached:

1) $D_i$ contains only one activity

2) Fall-through : No cuts are possible. Any behaviours are possible consisting of activities in this partition.

Each of these recursions returns a process tree that is inserted as a child.

Places $P$ and arcs $F$ of a petri net can be derived from the process tree

Thus we have all the parameters required to build a process model

Complexity of $IM_D$

For $\mid \mathbf{\Sigma} \mid = n$,

cut detection is $O(n^2)$

Each recursion removes atleast one activity from the graph $\implies IM_D$ is $O(n^3)$

Directly follows graph requires just a single pass over event log $L$ and can be parallelized

Guarantees of $IM_D$

Process trees are inherently sound

$IM_D$ guarantees rediscoverability on the same class of models, i.e. assuming that the model is representable by a process tree without using duplicate activities, and it is not possible to start loops with an activity they can also end with

Ref : Leemans, S., Fahland, D., van der Aalst, W.: Discovering block-structured process models from event logs - a constructive approach. In: Petri Nets 2013. LNCS, vol. 7927, pp. 311–329. Springer (2013)

Handling infrequent behaviour

Infrequent behaviours may introduce edges which may make it impossible to detect an otherwise strong cut

Therefore, whenever fall-through occurs, edges with frequency less than certain threshold $t_{freq}$ are removed and cuts are computed again.

$t_{freq}(a) = h \times max_c(\#(a \to c)) \qquad$ where $h \in (0,1)$

i.e, we keep edges $(a \to b)$, which occur frequently enough compared to most occuring outgoing edge of $a$

Handling incomplete behaviour

TODO

Evaluation

TODO