Experimenting with a Big Data Framework for Scaling a Data Quality Query System
Total Page:16
File Type:pdf, Size:1020Kb
EXPERIMENTING WITH A BIG DATA FRAMEWORK FOR SCALING A DATA QUALITY QUERY SYSTEM A THESIS SUBMITTED TO THE UNIVERSITY OF MANCHESTER FOR THE DEGREE OF MASTER OF PHILOSOPHY IN THE FACULTY OF SCIENCE AND ENGINEERING 2016 By Sonia Cisneros Cabrera School of Computer Science Contents Abstract 13 Declaration 14 Copyright 15 Dedication 16 Acknowledgements 17 1 Introduction 18 1.1 Motivation . 18 1.2 Research Goals and Contributions . 20 1.3 Research Method . 21 1.4 Thesis Overview . 25 2 Background and Related Work 26 2.1 Data Quality: Wrangling and the Big Data Life Cycle . 26 2.2 Big Data and Parallelism . 31 2.3 Big Data Frameworks: Tools and Projects . 33 2.3.1 MapReduce, Hadoop and Spark . 35 2.4 Scalability of Data Management Operators . 37 3 Designing a Scalable DQ2S 42 3.1 The Data Quality Query System (DQ2S) . 42 3.1.1 The Framework . 43 3.1.2 The Query . 44 3.1.3 The Datasets . 46 3.1.4 The Algorithms Included in the DQ2S Database Algebra Im- plementation . 48 2 3.2 The Artifact . 55 3.2.1 The Java DQ2S......................... 55 3.2.2 The Python DQ2S Implementation . 58 3.2.3 The Optimised Python DQ2S Implementation . 60 3.2.4 The PySpark DQ2S Implementation . 62 The Apache Spark Generals . 62 Spark Python API . 67 3.3 Summary . 70 4 Experiment design 71 4.1 Objectives and Scope of the Experiment . 72 4.2 Data and Tools . 74 4.3 Setup: Spark Cluster and Application Settings . 76 4.3.1 The Apache Spark Local Mode . 76 4.3.2 The Apache Spark Standalone Mode . 77 4.4 Experiment Procedure . 79 4.5 Algorithm Comparison . 90 4.6 Experiment Evaluation . 92 4.6.1 Performance Metrics . 92 4.6.2 Validity Evaluation . 94 4.7 Summary . 99 5 Evaluation of Experimental Results 100 5.1 Experiments Results . 100 5.2 Algorithm Comparison . 118 5.3 Summary . 139 6 Conclusions and Future Work 140 6.1 Research Summary and Key Results . 140 6.2 Research Contributions and Impact . 141 6.3 Future Work . 142 Bibliography 146 A DQ2S Source code 162 A.1 The Java DQ2S ............................. 162 A.2 The Python DQ2S............................ 175 3 A.3 The Optimised Python DQ2S . 186 A.4 The PySpark DQ2S........................... 191 B Apache Spark set up 198 B.1 Configuration of Spark’s Basic Elements . 198 B.2 Single Machine Deployment . 199 B.3 PySpark Submission to the DPSF Cluster . 201 C Testing Results 205 C.1 Raw Results . 205 C.2 Performance Metrics Results . 224 D Apache Spark User Level Proposal 227 Word Count: 37 242 4 List of Tables 1.1 The Design Science Research guideliness . 23 2.1 Big data life cycle processes definitions. 30 2.2 Big data frameworks available for data quality tasks . 34 2.3 Summary of the most relevant state-of-the-art related studies. 41 3.1 Operators comprising the DQ2S Timeliness query. 50 3.2 Java DQ2S technical description . 58 3.3 Python DQ2S technical description . 60 3.4 Optimised Python DQ2S technical description . 62 3.5 PySpark DQ2S technical description. 70 4.1 Research Question’s solution plan . 73 4.2 Relation of number of rows and size per each input dataset utilised in the experimentation. 74 4.3 Hardware specification of the desktop machine (PC) and the cluster (DPSF) utilised in the experimentation. 75 4.4 Tools used in the empirical evaluation of the research . 76 4.5 Spark configuration locations . 79 4.6 General overview of the experiment’s aim . 79 4.7 Number of runtimes and averages collected by experiment. 80 4.8 Conclusion validity threats applicable to the research presented. De- tails and addressing approach. 96 4.9 Internal validity threats applicable to the research presented. Details and addressing approach. 97 4.10 Construct validity threats applicable to the research presented. Details and addressing approach. 98 4.11 External validity threats applicable to the research presented. Details and addressing approach. 99 5 5.1 Relation of the dataset sizes and results when executing Optimised Python DQ2S in a single machine for Experiment 4. 104 5.2 Size in bytes for the first dataset that caused a failure (12 000 000 rows), and the maximum number of rows processed by the Optimised Python DQ2S (11 500 000). 104 5.3 Relation of seconds decreased from different configurations using 1 worker (W) and 1 to 4 cores (C) with the Standalone mode in a single machine. 108 5.4 Times Slower: Relation between the runtime results of the PySpark in- stance with 1 (sequential) and 8 threads (best time). Where local[1]<local[8] indicates the PySpark DQ2S in local mode with 8 threads is slower than the same implementation with 8 threads as many times as indicated. 111 5.5 Size in GB for the datasets sizes processed on the cluster. 115 5.6 Relation of the dataset sizes and results when executing the Optimised Python DQ2S in the cluster for Experiment 9. 117 5.7 Times Slower: Relation between Original Java and Python runtime results. Where Python <Java indicates the Python DQ2S is slower than the Java implementation as many times as indicated. 118 5.8 Times Slower: Relation between Python and Optimised Python run- time results. Where Python <Optimised Python indicates the Python DQ2S is slower than the Optimised Python implementation as many times as indicated. 122 5.9 Times Slower: Relation between Original Java and Optimised Python runtime results. Where Java <Optimised Python indicates the Java DQ2S is slower than the Optimised Python implementation as many times as indicated. 122 5.10 Times Slower: Relation between the runtime results of the Optimised Python and PySpark with 1 thread. Where local[1]<Optimised Python indicates the PySpark DQ2S in local mode with 1 thread is slower than the Optimised implementation as many times as indicated. 124 5.11 Times Slower: Relation between the runtime results of the Optimised Python and PySpark with 8 threads. Where local[8]<Optimised Python indicates the PySpark DQ2S in local mode with 8 threads is slower than the Optimised implementation as many times as indicated. 124 6 5.12 Times Slower: Relation between the runtime results of the Optimised Python and PySpark with 1 and 8 threads for 11 500 000 rows. Where the notation implementation <implementation indicates the left im- plementation is slower than the implementation on the right as many times as indicated. 126 B.1 Apache Spark processes’ default values for Standalone mode, where NA stands for “Not Applicable”. 198 B.2 Apache Spark processes’ general configuration parameters and loca- tion, where NA stands for “Not Applicable”. 199 C.1 Original Java DQ2S execution times. 209 C.2 Python DQ2S execution times. 210 C.3 Optimised Python DQ2S execution times. 210 C.4 Optimised Python DQ2S execution times in nanoseconds and seconds for 11 500 000 rows. 211 C.5 Execution times obtained from PySpark DQ2S Standalone mode in a single machine over 1 worker and 1 core. 211 C.6 Execution times obtained from PySpark DQ2S Standalone mode in a single machine over 1 worker and 2 cores each one. 212 C.7 Execution times obtained from PySpark DQ2S Standalone mode in a single machine over 1 worker and 3 cores each one. 212 C.8 Execution times obtained from PySpark DQ2S Standalone mode in a single machine over 1 worker and 4 cores each one. 213 C.9 Execution times obtained from PySpark DQ2S Standalone mode in a single machine over 2 workers and 1 core each one. 213 C.10 Execution times obtained from PySpark DQ2S Standalone mode in a single machine over 2 workers and 2 cores each one. 214 C.11 Execution times obtained from PySpark DQ2S Standalone mode in a single machine over 3 workers and 1 core each one. 214 C.12 Execution times obtained from PySpark DQ2S Standalone mode in a single machine over 4 workers and 1 core each one. 215 C.13 Execution times obtained from PySpark DQ2S Local mode using 1 core in a single machine. 215 C.14 Execution times obtained from PySpark DQ2S Local mode using 2 cores in a single machine. 216 7 C.15 Execution times obtained from PySpark DQ2S Local mode using 3 cores in a single machine. 216 C.16 Execution times obtained from PySpark DQ2S Local mode using 4 cores in a single machine. 217 C.17 Execution times obtained from PySpark DQ2S Local mode using 5 cores in a single machine. 218 C.18 Execution times obtained from PySpark DQ2S Local mode using 6 cores in a single machine. 218 C.19 Execution times obtained from PySpark DQ2S Local mode using 7 cores in a single machine. 219 C.20 Execution times obtained from PySpark DQ2S Local mode using 8 cores in a single machine. 219 C.21 Execution times obtained from PySpark DQ2S Local mode using 1 to 6 cores in a single machine for 11 500 000 rows. 220 C.22 Execution times obtained from PySpark DQ2S Local mode using 7 to 10 cores in a single machine for 11 500 000 rows. 220 C.23 Execution times obtained from PySpark DQ2S Local mode using 1 to 6 cores in a single machine for 35 000 000 rows. 221 C.24 Execution times obtained from PySpark DQ2S Local mode using 7 to 10, 12, and 16 cores in a single machine for 35 000 000 rows. 221 C.25 Execution times obtained from PySpark DQ2S Standalone mode pro- cessing 35 000 000 rows with 1 to 4 worker nodes on a cluster.