Business Intelligence on Non-Conventional Data
Total Page:16
File Type:pdf, Size:1020Kb
Alma Mater Studiorum - Universit`adi Bologna Dottorato di ricerca in Computer Science and Engineering Ciclo XXIX Settore concorsuale di afferenza: 09/H1 - SISTEMI DI ELABORAZIONE DELLE INFORMAZIONI Settore scientifico disciplinare: ING-INF/05 - SISTEMI DI ELABORAZIONE DELLE INFORMAZIONI Business Intelligence on Non-Conventional Data Enrico Gallinucci Coordinatore Dottorato Relatore Prof. Paolo Ciaccia Prof. Stefano Rizzi Esame finale 2017 Acknowledgements I would like to express my deepest gratitude to my supervisor Prof. Stefano Rizzi and to my advisor Prof. Matteo Golfarelli for their continuous support, patience and motivation during these three research years. Their mentoring has provided an invaluable contribution to both my professional and personal growth. My sincere thanks also goes to Prof. Alberto Abelló and Prof. Oscar Romero for their collaboration and support during my research period in Spain at Universitat Politècnica de Catalunya, Barcelona. I would like to thank Prof. Enrico Denti, Prof. Claudio Sartori, Prof. Robert Wrembel and Prof. Esteban Zimanyi for reviewing this thesis. I wish to thank my dear colleagues in Cesena and Barcelona for the effective teamwork, the stimulating discussions and the strong camaraderie. Finally, I would like to thank my family and my closest friends for making this journey easier with their unquestioned love and support. Abstract The revolution in digital communications witnessed over the last decade had a significant impact on the world of Business Intelligence (BI). In the big data era, the amount and diversity of data that can be collected and analyzed for the decision-making process transcends the restricted and structured set of internal data that BI systems are conventionally limited to. This thesis investigates the unique challenges imposed by three specific categories of non-conventional data: social data, linked data and schemaless data. Social data comprises the user-generated contents published through websites and social media, which can provide a fresh and timely perception about people’s tastes and opinions. In Social BI (SBI), the analysis focuses on topics, meant as specific concepts of interest within the subject area. In this context, this thesis proposes meta-star, an alternative strategy to the traditional star-schema for modeling hierarchies of topics to enable OLAP analyses. The thesis also presents an architectural framework of a real SBI project and a cross-disciplinary benchmark for SBI. Linked data employ the Resource Description Framework (RDF) to provide a public network of interlinked, structured, cross-domain knowledge. In this context, this thesis proposes an interactive and collaborative approach to build aggregation hierarchies from linked data. Schemaless data refers to the storage of data in NoSQL databases that do not force a predefined schema, but let database instances embed their own local schemata. In this context, this thesis proposes an approach to determine the schema profile of a document-based database; the goal is to facilitate users in a schema-on-read analysis process by understanding the rules that drove the usage of the different schemata. A final and complementary contribution of this thesis is an innovative technique inthe field of recommendation systems to overcome user disorientation in the analysis ofa large and heterogeneous wealth of data. Table of contents List of figures xi List of tables xv 1 Introduction1 1.1 Business Intelligence . .1 1.2 Motivations and Contributions . .4 1.2.1 Social BI . .6 1.2.2 Exploratory BI . .7 1.2.3 Pervasive BI . .7 2 Background9 2.1 Data Warehouse . .9 2.2 DW Architecture . 12 2.3 OLAP Analysis . 14 2.4 Multidimensional Model . 14 2.5 Logical modeling: the star schema . 16 3 Meta-star: a modeling approach for social data aggregation 19 3.1 Introduction . 19 3.1.1 Motivation . 21 3.1.2 Goal . 22 3.2 Related Literature . 24 3.3 Meta-Stars . 27 3.3.1 Slowly-Changing Topics and Levels . 33 3.4 Querying Meta-Stars . 34 3.4.1 Translating Group-by Components into SQL . 38 3.4.2 Translating Semantic Filters into SQL . 38 3.4.3 Translating Selections into SQL . 39 viii Table of contents 3.4.4 Impact of Static Levels . 40 3.5 Query Execution Plans and Cost Model for Meta-Stars . 41 3.6 Evaluation . 43 3.6.1 Time (for Accessing Topics) . 45 3.6.2 Time (for Accessing Facts) . 46 3.6.3 Queries with Semantic Topic Aggregation . 47 3.6.4 Space . 48 3.7 Conclusions . 48 4 An architectural and methodological framework for Social BI 51 4.1 Introduction . 51 4.2 Related Literature . 52 4.3 Architectural and Methodological Framework . 53 4.4 A Case Study on EU Politics . 56 4.5 Architectural Options . 57 4.5.1 Analysis . 57 4.5.2 ODS . 58 4.5.3 Crawling . 59 4.5.4 Semantic Enrichment . 60 4.6 Case Study Analysis . 61 4.6.1 Effectiveness . 62 4.6.2 Efficiency . 64 4.6.3 Sustainability . 65 4.7 Conclusions . 66 5 SABINE: a modular benchmark for Social BI 69 5.1 Introduction . 69 5.2 The Content of SABINE . 71 5.2.1 Topics and Mappings . 71 5.2.2 Clips and Annotations . 73 5.2.3 Multidimensional Cubes . 74 5.3 SABINE Construction Techniques . 75 5.3.1 Ontology Design . 76 5.3.2 Crawling . 76 5.3.3 Text Analysis . 77 5.3.4 Topic Search . 77 5.3.5 Sentiment Analysis . 78 Table of contents ix 5.3.6 Data Linking . 80 5.4 Research Tasks . 82 5.4.1 Content Analysis Tasks . 82 5.4.2 Semantic Analysis Tasks . 85 5.4.3 SBI Analytics Tasks . 86 5.5 Conclusions . 87 6 iMOLD: a collaborative approach for Exploratory BI on linked data 89 6.1 Introduction . 89 6.2 Related Literature . 91 6.3 Background . 93 6.3.1 Multidimensional modeling . 93 6.3.2 Linked data modeling . 94 6.4 Approach Overview . 95 6.5 Aggregation Patterns in Ontologies . 99 6.5.1 Association-Based Patterns . 99 6.5.2 Generalization-Based Patterns . 102 6.6 Acquisition . 104 6.6.1 Acquisition of Association-Based Patterns . 106 6.6.2 Acquisition of Generalization-Based Patterns . 109 6.6.3 Mixed patterns . 114 6.6.4 User Experience . 115 6.6.5 Collaboration and Reuse . 118 6.7 Case Study and User Evaluation . 119 6.7.1 Case Study . 120 6.7.2 User Evaluation . 122 6.8 Conclusions . 125 7 Profiling hidden schemata on schemaless data 127 7.1 Introduction . 127 7.2 Related Literature . 130 7.3 Requirements for Schema Profiling . 133 7.4 Formal Background . 134 7.5 Evaluating Schema Profiles . 137 7.5.1 Explicativeness . 137 7.5.2 Precision . 138 7.5.3 Conciseness . 140 x Table of contents 7.6 Building Schema Profiles . 142 7.7 Experimental Results . 146 7.7.1 Effectiveness . 147 7.7.2 Efficiency . 153 7.8 Conclusions . 155 8 Cubeload: a benchmark generator of OLAP sessions 157 8.1 Introduction . 157 8.2 Related Literature . 159 8.3 Overview . 160 8.4 The Workload Model . 161 8.5 Session Templates . 163 8.6 Experiments . 165 8.7 Conclusions . ..