Profiling Data and Beyond: Gaining Insights from Metadata
Total Page:16
File Type:pdf, Size:1020Kb
Profiling Data and Beyond: Gaining Insights from Metadata Inauguraldissertation zur Erlangung des akademischen Grades eines Doktors der Wirtschaswissenschaen durch die Wirtschaswissenschaliche Fakultät der Westfälischen Wilhelms-Universität Münster Vorgelegt von Fabian Schomm-von Auenmüller, MSc aus Viersen Mai 2018 1 Dekanin Prof. Dr. Theresia Theurl Erster Gutachter Prof. Dr. Gofried Vossen Zweiter Gutachter Prof. Dr. Stephan Meisel Mündliche Prüfung 06.07.2018 2 Contents 1 Introduction 1 1.1 Aims of this Thesis . 2 1.2 Structure of this Thesis . 2 1.3 Running Scenario: The CloudHost Case . 3 I State of the Art 5 2 Profiling Fundamentals 6 2.1 Definition . 6 2.2 Application Areas . 7 2.2.1 Offender Profiling . 8 2.2.2 Crime Prevention Profiling . 9 2.2.3 Racial Profiling . 11 2.2.4 Customer Profiling . 12 2.2.5 DNA Profiling . 13 2.2.6 Software Profiling . 14 2.2.7 Data Profiling . 16 2.3 A Model for Profiling Activities . 16 2.3.1 Profile Creation . 17 2.3.2 Profile Usage . 18 2.4 From Description to Prescription: A Classification . 19 3 Problem Identification and Characterization 22 3.1 Unfamiliar Datasets . 23 3.1.1 Relational Data Model . 24 3.1.2 Key-Value Data Model . 25 3.1.3 Document-Oriented Data Model . 26 3.1.4 Graph-Oriented Data Model . 26 3.2 Typical Tasks and Objectives . 27 3.3 Information Overload . 29 3.4 Data Comprehension . 31 3.5 User Characterization . 32 3.5.1 Skill dimensions . 32 3.5.2 Multiple Users . 33 3.6 Common Solution Strategies . 34 i Contents 4 Data Profiling — State of the Art 36 4.1 Definition . 36 4.2 Metadata Typology . 39 4.2.1 Common Types of Metadata . 40 4.2.2 Classifying Types of Metadata . 59 4.2.3 Metadata Extraction Process . 65 4.3 Three Challenges of Data Profiling . 65 4.3.1 Managing the Input . 66 4.3.2 Performing the Computations . 66 4.3.3 Interpreting the Output . 67 4.4 Software Tools and Research Projects . 68 4.4.1 Dedicated Tools . 68 4.4.2 Integrated Tools . 71 4.4.3 Research Projects . 71 5 Data Profiling Variants 74 5.1 Data-Oriented Data Profiling . 75 5.2 Goal-Oriented Data Profiling . 78 5.3 Multi-Source Data Profiling . 80 5.4 Data Profile Validation . 83 5.5 Visual Data Exploration . 86 5.6 Data Discovery . 91 5.7 Summary . 94 II Beyond Data Profiling 96 6 Broadening the Scope of Metadata 97 6.1 Distinguishing Intrinsic and Extrinsic Metadata . 97 6.2 Revisiting the CloudHost Case . 98 6.3 Types of Extrinsic Metadata . 100 6.4 Sources of Extrinsic Metadata and Their Extraction . 109 6.5 Bringing Intrinsic and Extrinsic Metadata Together . 114 7 Profiling More Than Data 116 7.1 The Generic Application Framework . 116 7.2 Profiling the GAF Levels . 117 7.2.1 Process Profiling . 117 7.2.2 Application Profiling . 120 7.3 Proposed Appproaches . 122 7.3.1 Data-First Approach . 122 7.3.2 Application-First Approach . 124 7.3.3 Process-First Approach . 127 7.3.4 Combined Approaches . 129 ii Contents 8 Conclusion 130 8.1 Summary . 130 8.2 Outlook . 132 Bibliography 135 List of Web Pages 149 iii List of Figures 2.1 Examples of different types of profiles . 7 2.2 Satirical depiction of “hard” racial profiling . 11 2.3 Generic model for profiling activities . 17 2.4 Classification of profiling activities according to main purpose . 21 3.1 Example excerpt of two CloudHost tables . 25 3.2 CloudHost employee example as key-value pairs . 26 3.3 CloudHost employee example in JSON . 26 3.4 CloudHost employee example as a graph . 27 3.5 Classification of users according to skill level . 33 4.1 Profiling activities model applied to data profiling . 36 4.2 Classification of data profiling tasks according to Abedjan et al. 61 4.3 Data profiling outputs according to Olson ................... 63 4.4 Proposed process for the extraction of metadata . 65 4.5 Column analysis for the department column using Talend . 69 4.6 Default profile of the employee dataset using Datamartist . 70 4.7 Architecture of the Metanome profiling platform . 72 5.1 Characterization dimensions for data profiling variants . 75 5.2 Data-oriented data profiling . 75 5.3 Goal-oriented data profiling . 78 5.4 Multi-source data profiling . 80 5.5 Data profile validation . 84 5.6 Visual data exploration . 86 5.7 CityPlot example . 89 5.8 Zoomed out spreadsheet view of the CloudHost portfolio dataset . 90 5.9 Data discovery . 91 5.10 Data profiling super process . 95 6.1 Use case diagram for the sales example . 99 6.2 Model of the data flow in an exemplary project seminar . 111 6.3 Extrinsic metadata extraction process . 113 6.4 Combined metadata extraction process . 114 6.5 Conceptual overview of all discussed data profile types . 115 7.1 Generic Application Framework . 117 7.2 Data-first approach . 123 iv List of Figures 7.3 Application-first approach . 125 7.4 Process-first approach . 128 7.5 Customer journey . 129 8.1 Relation between ERM and data models . 134 v List of Tables 4.1 Sample CloudHost employee dataset: “emp” . 41 4.2 Value Frequency Table of the department Column . 52 4.3 List of single field metadata types . 54 4.4 CloudHost customers dataset . 57 4.5 List of multi field metadata types . 58 5.1 Data profile for the data discovery of the portfolio dataset . 93 6.1 Excerpt of the CloudHost sales data . 99 6.2 List of permissible privileges in MySQL . 101 6.3 List of extrinsic metadata types . 109 vi 1 Introduction The rate at which data is generated increases every day [35], and at the same time, the cost for data storage is continually decreasing [15]. This development allows many organizations to store vast amounts of data with little regard as to what should be done with it. The raw data however has no inherent value by itself, but it represents a source of untapped potential. It has been said that “data is the new oil” [24]. This analogy does not refer to the scarcity or the non-renewability of the resource, but rather to the fact that it needs to be processed and refined to gain value from.