Scaling Big Data Cleansing

Scaling Big Data Cleansing Dissertation by Zuhair Yarub Khayyat In Partial Fulfillment of the Requirements For the Degree of Doctor of Philosophy King Abdullah University of Science and Technology Thuwal, Kingdom of Saudi Arabia May, 2017 2 EXAMINATION COMMITTEE PAGE The dissertation of Zuhair Yarub Khayyat is approved by the examination committee. Committee Chairperson: Panos Kalnis | Professor Committee Members: Mohamed-Slim Alouini | Professor, Basem Shihada | Asso- ciate Professor, Marco Canini | Assistant Professor, Christoph Koch | Professor 3 ©May, 2017 Zuhair Yarub Khayyat All Rights Reserved 4 ABSTRACT Scaling Big Data Cleansing Zuhair Yarub Khayyat Data cleansing approaches have usually focused on detecting and fixing errors with little attention to big data scaling. This presents a serious impediment since identify- ing and repairing dirty data often involves processing huge input datasets, handling sophisticated error discovery approaches and managing huge arbitrary errors. With large datasets, error detection becomes overly expensive and complicated especially when considering user-defined functions. Furthermore, a distinctive algorithm is de- sired to optimize inequality joins in sophisticated error discovery rather than na¨ıvely parallelizing them. Also, when repairing large errors, their skewed distribution may obstruct effective error repairs. In this dissertation, I present solutions to overcome the above three problems in scaling data cleansing. First, I present BigDansing as a general system to tackle efficiency, scalability, and ease-of-use issues in data cleansing for Big Data. It automatically parallelizes the user's code on top of general-purpose distributed platforms. Its programming inter- face allows users to express data quality rules independently from the requirements of parallel and distributed environments. Without sacrificing their quality, BigDans- ing also enables parallel execution of serial repair algorithms by exploiting the graph representation of discovered errors. The experimental results show that BigDansing outperforms existing baselines up to more than two orders of magnitude. Although BigDansing scales cleansing jobs, it still lacks the ability to handle sophisticated error discovery requiring inequality joins. Therefore, I developed IEJoin as an algorithm for fast inequality joins. It is based on sorted arrays and space efficient 5 bit-arrays to reduce the problem's search space. By comparing IEJoin against well- known optimizations, I show that it is more scalable, and several orders of magnitude faster. BigDansing depends on vertex-centric graph systems, i.e., Pregel, to efficiently store and process discovered errors. Although Pregel scales general-purpose graph computations, it is not able to handle skewed workloads efficiently. Therefore, I introduce Mizan, a Pregel system that balances the workload transparently during runtime to adapt for changes in computing needs. Mizan is general; it does not assume any a priori knowledge of the graph structure or the algorithm behavior. Through extensive evaluations, I show that Mizan provides up to 84% improvement over techniques leveraging static graph pre-partitioning. 6 ACKNOWLEDGEMENTS To my parents, Yarub and Naida. I simply cannot thank you enough; I would have never been able to succeed in my Ph.D. without your unconditional love, en- couragement and faith. I will forever be indebted for your efforts in raising me and the education you provided me. I still remember my first programming experience Logo in from the 90s, which was the cornerstone for my passion in Computer Science. To my charming wife, Manal. I am grateful for sharing my journey toward my Ph.D.; it has been a unique experience for both of us. I am eternally grateful for your continuous love and support during this rough period. Being my wife is the best thing that ever happened to me while at KAUST. To my lovely kids, Mazin and Ziyad. Your enthusiasm in learning and doing new things always inspires me. Your energy has been consistently feeding my deter- mination and keeping me up all night. I am thankful for being in my life; my Ph.D. journey would have never been exciting without you. To my advisor, Panos Kalnis. Thank you for your confidence in me and your exceptional support throughout my Ph.D. With your teachings, I was able to compete with students from top universities and publish my work in high-quality venues. To my mentor, Hani Jamjoom. Thank you for your patience and valuable guid- ance in my early Ph.D. years. Your knowledge and experience in scientific writing helped me throughout my whole Ph.D. To my academic family, InfoCloud members. Thank you for your good com- pany, positive influence, and support through my rough times. Portions of the research in this dissertation used the MDC Database made avail- able by Idiap Research Institute, Switzerland and owned by Nokia. 7 TABLE OF CONTENTS Examination Committee Page 2 Copyright 3 Abstract 4 Acknowledgements 6 Table of Contents 7 List of Figures 11 List of Tables 15 1 Introduction 17 1.1 Dealing with Big Datasets . 19 1.2 Sophisticated Cleaning Jobs . 21 1.3 Managing Large Random Violation Graphs . 23 1.4 Contributions and dissertation Organization . 25 2 Related Work 28 2.1 Data Cleaning . 28 2.2 Scaling Inequality Joins . 31 2.3 General-purpose Scalable Graph Processing . 34 3 BigDansing: A System for Big Data Cleansing 43 3.1 Fundamentals and An Overview . 43 3.1.1 Data Cleansing Semantics . 44 3.1.2 Architecture . 46 3.2 Rule Specification . 49 3.2.1 Logical Operators . 49 3.2.2 BigDansing Example Job . 54 3.2.3 The Planning Process . 54 8 3.3 Building Physical Plans . 56 3.3.1 Physical Operators . 57 3.3.2 From Logical to Physical Plan . 58 3.3.3 Fast Joins with Ordering Comparisons . 63 3.3.4 Translation to Spark Execution Plans . 65 3.3.5 Translation to MapReduce Execution Plans . 67 3.3.6 Physical Data Access . 68 3.4 Distributed Repair Algorithms . 69 3.4.1 Scaling Data Repair as a Black Box . 70 3.4.2 Scalable Equivalence Class Algorithm . 73 3.5 Experimental Study . 74 3.5.1 Setup . 74 3.5.2 Single-Node Experiments . 77 3.5.3 Multi-Node Experiments . 80 3.5.4 Scaling BigDansing Out . 82 3.5.5 Deduplication in BigDansing ................ 82 3.5.6 BigDansing In-Depth . 83 3.6 Use Case: Violation Detection in RDF . 86 3.6.1 Source of Errors . 86 3.6.2 Integrity Violation Example . 87 3.6.3 Violation Detection Plan in BigDansing ........... 88 3.6.4 Experiments . 91 4 Fast and Scalable Inequality Joins 93 4.1 Solution Overview . 93 4.2 Centralized Algorithms . 97 4.2.1 IEJoin . 98 4.2.2 IESelfJoin . 102 4.2.3 Enhancements . 104 4.3 Query Optimization . 106 4.3.1 Selectivity Estimation . 107 4.3.2 Join Optimization with Multiple Predicates . 108 4.3.3 Multi-way Join Optimization . 109 4.4 Incremental Inequality Joins . 110 4.5 Scalable Inequality Joins . 114 4.5.1 Scalable IEJoin . 114 9 4.5.2 Multithreaded and Distributed IEJoin . 117 4.6 Integration into Existing Systems . 118 4.6.1 PostgreSQL . 118 4.6.2 Spark SQL . 120 4.6.3 BigDansing ........................... 123 4.6.4 Rheem . 124 4.7 Experimental Study . 125 4.7.1 Datasets . 125 4.7.2 Algorithms . 126 4.7.3 Queries . 128 4.7.4 Setup . 131 4.7.5 Parameter Setting . 131 4.7.6 Single-node Experiments . 134 4.7.7 IEJoin Optimizer Experiments . 139 4.7.8 Incremental IEJoin Experiments . 143 4.7.9 Multi-node Experiments . 146 5 Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing 152 5.1 Dynamic Behavior of Algorithms . 152 5.1.1 Example Algorithms . 156 5.2 Mizan ................................... 158 5.2.1 Monitoring . 159 5.2.2 Migration Planning . 160 5.3 Implementation . 165 5.3.1 Vertex Ownership . 167 5.3.2 DHT Updates After Vertex Migration . 168 5.3.3 Migrating Vertices with Large Message Size . 169 5.4 Evaluation . 170 5.4.1 Giraph vs. Mizan ........................ 171 5.4.2 Effectiveness of Dynamic Vertex Migration . 173 5.4.3 Overhead of Vertex Migration . 178 5.4.4 Scalability of Mizan ....................... 179 6 Concluding Remarks 182 6.1 Summary of Contributions . 182 6.2 Future Research Directions . 184 10 References 186 Appendices 199 11 LIST OF FIGURES 1.1 The violation graph of φF and φD. Undirected edges represent com- mutative violations, while edge labels represent the violated rule and its attributes. 19 1.2 East-Coast and West-Coast transactions . 22 2.1 Example of hash-based partitioning with three partitions and seven edge cuts . 36 2.2 Example of three-way min-cuts partitioning using METIS with four edge cuts . 37 3.1 BigDansing architecture . 46 3.2 Logical operators execution for rule FD φF ............... 52 3.3 Planner execution flow . 55 3.4 Example of a logical plan . 56 3.5 Plans for DC in rule 3.1 . 60 3.6 Example plans with CoBlock . 62 3.7 A Summary of BigDansing's logical, physical and execution operators for Spark. 65 3.8 Data repair as a black box. 70 3.9 Data cleansing times. 77 3.10 Single-node experiments with '1, '2, and '3 .............. 78 3.11 Multi-nodes experiments with '1, '2 and '3 .............. 79 3.12 Results for the experiments on (a) scale-out, (b) deduplication, and (c) physical optimizations. 81 3.13 (a) Abstraction and (b) scalable repair. 83 3.14 An example RDF graph . 87 3.15 An integrity constraint query graph in RDF . 89 3.16 The logical plan of the UDF detection rule in Figure 3.15 . 89 3.17 Data transformations within the logical operators in Figure 3.16 . 90 3.18 Violation detection performance on dirty RDF data .

Load more