Process Mining on Databases: Extracting Event Data from Real-Life Data Sources
Total Page:16
File Type:pdf, Size:1020Kb
Process mining on databases Citation for published version (APA): Gonzalez Lopez de Murillas, E. (2019). Process mining on databases: extracting event data from real-life data sources. Technische Universiteit Eindhoven. Document status and date: Published: 27/02/2019 Document Version: Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers) Please check the document version of this publication: • A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website. • The final author version and the galley proof are versions of the publication after peer review. • The final published version features the final layout of the paper including the volume, issue and page numbers. Link to publication General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal. If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement: www.tue.nl/taverne Take down policy If you believe that this document breaches copyright please contact us at: [email protected] providing details and we will investigate your claim. Download date: 09. Oct. 2021 Process Mining on Databases Extracting Event Data from Real-Life Data Sources E. González López de Murillas Process Mining on Databases: Extracting Event Data from Real-Life Data Sources E. González López de Murillas Copyright © 2019 by E. González López de Murillas. All Rights Reserved. CIP-DATA LIBRARY TECHNISCHE UNIVERSITEIT EINDHOVEN González López de Murillas, E. Process Mining on Databases: Extracting Event Data from Real-Life Data Sources by E. González López de Murillas. Eindhoven: Technische Universiteit Eindhoven, 2018. Proefschrift. A catalogue record is available from the Eindhoven University of Technology Library ISBN 978-90-386-4704-3 Keywords: Process mining, Databases, Data extraction, Case notion discovery, Case notion recommendation, Event log building, Event data querying SIKS Dissertation Series No. 2019-03 The research reported in this thesis has been carried out under the auspices of SIKS, the Dutch Research School for Information and Knowledge Systems. Process Mining on Databases: Extracting Event Data from Real-Life Data Sources PROEFSCHRIFT ter verkrijging van de graad van doctor aan de Technische Universiteit Eindhoven, op gezag van de rector magnificus prof.dr.ir. F.P.T. Baaijens, voor een commissie aangewezen door het College voor Promoties, in het openbaar te verdedigen op woensdag 27 februari 2019 om 16:00 uur door Eduardo González López de Murillas geboren te Pamplona, Spanje Dit proefschrift is goedgekeurd door de promotoren en de samenstelling van de pro- motiecommissie is als volgt: voorzitter: prof.dr. O.J. Boxma 1e promotor: prof.dr.ir. H.A. Reijers 2e-promotor: prof.prof.h.c.dr.h.c.dr.ir. W.M.P. van der Aalst (RWTH Aachen) leden: prof.dr. A. Gal (Israel Institute of Technology) dr. M. Montali (Free University of Bozen-Bolzano) dr. G.H.L. Fletcher prof.dr. J. Ezpeleta Mateo (Universidad de Zaragoza) Het onderzoek of ontwerp dat in dit proefschrift wordt beschreven is uitgevoerd in overeenstemming met de TU/e Gedragscode Wetenschapsbeoefening. Dedicated to my little nephew, Mateo. Abstract Process mining as a discipline is gaining importance in recent years. Its focus is on the analysis of business processes by means of process execution data. It is a common misconception to think that a process mining project starts as soon as the data are available. We tend to underestimate the time, knowledge, and number of iterations needed to gather relevant data and build an event log suitable for analysis. In many cases, the data extraction and preparation phase can take up to 80% of the project duration. Improving this time-consuming initial phase will have a large impact on the whole process mining project in terms of time, cost, and quality of insights. In this thesis, we take on several challenges related to data extraction, multi- perspective event log building, and event data querying. We consider environments in which many business processes coexist, sharing data both in a centralized and distributed storage. The aim of this work is twofold. First, to ease the path for practitioners to get started with a process mining project in unknown environments when domain knowledge is not always available. Second, to provide the tools needed for event data querying and to build specialized event logs for a more effective process mining analysis. We propose several techniques that, together, constitute a pipeline connecting databases with existing process mining techniques: • A meta-model for process mining on databases (in Chapter 3). This meta-model aims to standardize the way to capture and structure both data and process aspects of an organization. • Data extraction adapters for several scenarios (in Chapter 4). These adapters, developed for some mainstream scenarios, allow us to extract and transform data from databases to comply with the proposed meta-model structure. • A technique to discover and recommend case notions in order to build event logs for process mining (in Chapter 5). This is achieved by exploring possible views on the data, correlating events by means of the underlying relations extracted from the original database, and assessing their “interestingness” with respect to some metrics. • A data-aware process oriented query language (in Chapter 7). This query lan- guage provides a compact and easy way to express relevant queries in the busi- ness context by means of constructs that exploit the underlying meta-model. The techniques presented in this thesis have been evaluated using representative sample datasets that resemble the most relevant challenges that can be found in real- viii life environments, as well as case studies with real databases. The results show that these techniques succeed at capturing a more comprehensive picture of the systems under study, improving the quality of the generated event logs, and supporting the user in the process of data extraction and log building. Moreover, they are a step ahead towards the goal of automatic data extraction and process mining analysis of complex IT systems. Implementations of our techniques are publicly available and licensed as open source. Preface Honeybees are fascinating creatures. These small insects are incredibly cooperative. They organize themselves according to a very well defined societal structure, divided mainly into reproductive and non-reproductive groups. Individually, they would not be able to survive. However, as a team they can build complicated housing structures called hives where they live, breed, store food, and protect themselves from external agents. These incredible animals do not only produce honey, but also wax, their main construction material. Both products are highly valuable for us humans due to the nutritive value of honey, and the versatility of the wax as a material in many manu- facturing processes. However, the shape in which beehives appear in nature does not allow human beekeepers to take advantage of the honey and wax production without partially or entirely destroying the hive. Nowadays, most organizations produce data as a byproduct of their business ac- tivity, similarly to the hard-working bees producing wax and honey for their survival. In many cases, the produced data is crucial for the functioning of the organization, while at the same time it acts as a record of the execution of their business processes. However, many of such organizations struggle to get the maximal potential from their data. One of the reasons is the disconnect between the data structure that supports their business processes and the data structure that would be suitable for the analysis of their process data. We believe that business process execution and analysis can coexist, sharing com- mon data sources in a sustainable manner. In order to achieve such an equilibrium, data structures need to be reshaped, similarly to the way that beekeepers provide a structure to beehives. When used responsibly, artificial beehives allow the beekeeper to benefit from the production of honey without disrupting the colony. This thesis proposes a way to structure data in order to facilitate data extraction and to enable the analysis of business processes in real-life environments. Contents List of Figures xv List of Tables xxi 1 Introduction 1 1.1 Event Data and Process Mining ........................ 2 1.1.1 Event Data ................................ 2 1.1.2 Process Mining ............................. 4 1.2 Process Mining on Databases ......................... 7 1.3 Challenges in Process Mining on Databases ................. 8 1.3.1 Challenge 1: Finding, Merging, and Cleaning Event Data .... 9 1.3.2 Challenge 2: Dealing with Complex Event Logs Having Diverse Characteristics .............................. 11 1.3.3 Challenge 3: Cross-Organizational Mining ............. 11 1.3.4 Challenge 4: Multi-Perspective Event Log Building ........ 11 1.3.5 Challenge 5: Improve Usability for Non-Experts ......... 12 1.3.6 Challenge 6: Fill the Domain Knowledge Gap in Event Log Extraction ................................ 12 1.3.7 Challenge 7: Question-Driven Log Extraction ........... 12 1.4 Contributions and Structure of this Thesis ................. 13 1.4.1 Thesis Contributions .........................