Analyzing Large-Scale, Distributed and Uncertain Data

Analyzing Large-Scale, Distributed and Uncertain Data Thesis submitted in partial fulfillment of the requirements for the degree of "DOCTOR OF PHILOSOPHY" By Yaron Gonen Submitted to the Senate of Ben-Gurion University of the Negev arXiv:1712.01817v1 [cs.DB] 5 Dec 2017 July, 2017 Beer-Sheva Analyzing Large-Scale, Distributed and Uncertain Data Thesis submitted in partial fulfillment of the requirements for the degree of "DOCTOR OF PHILOSOPHY" By Yaron Gonen Submitted to the Senate of Ben-Gurion University of the Negev Approved by the advisor: ______________________ Approved by the Dean of the Kreitman School of Advanced Graduate Studies: _____________________ July, 2017 Beer-Sheva This work was carried out under the supervision of Professor Ehud Gudes In the Department of Computer Science Faculty of Natural Sciences Acknowledgments I am grateful to the people who made this thesis possible. First I would like to express my deepest gratitude to my advisor, Prof. Ehud Gudes, for being there for me all the way through with his support and attentive manner. Apart from learning how to conduct re- search, I learned a lot from Prof. Gudes’s extensive knowledge and wide perspective in every aspect of life. Dear Ehud, I thank you for encouraging and guiding me, I thank you for keeping your door always open, I thank you for being such a pleasant person to work with and I thank you for believing in me. Throughout my years of study I have gained not only knowledge, but friends. I thank the WWH team: Amnon Meiseles, Benny Lutati, Vadim Levit and Zohar Komarovsky, for being great teammates and good friends. I thank the staff members of the "Principles of Programming Languages", with a special thanks to Mira Balaban, for teaching me the art of teaching. A special thanks goes to Nurit Gal-Oz, a dear friend and a brilliant researcher. We worked together over the years and I have learned a lot from her, both on the academic side and on personal issues. I would like to thank my beloved parents, Tova and Israel, who taught me the value of education and always tried to protect me from stress and overload. My love and appreciation goes to my son and daughter, Rotem and Tal, who learned to understand and accept my long hours of working at home. Last but not least, I wish to thank my partner, Ariella, for encouraging me and backing me up at home. Thank you for listening and encouraging me. Thank you for your love and for being everything I could ever ask for in a partner. iii iv Contents Contents v List of Figures ix List of Algorithms xi Abstract xiii 1 Introduction 1 1.1 The Age of Data . .1 1.2 Making Sense of It . .2 1.3 More Data vs. Better Algorithms . .2 1.4 MapReduce . .3 1.5 Thesis . .4 1.5.1 Main Contributions . .4 1.5.2 Outline . .4 2 Background & Related Work 5 2.1 MapReduce . .5 2.1.1 Hadoop Architecture . .5 2.2 Probabilistic Database . .6 2.3 Closed Frequent Itemsets . .6 3 MapReduce 9 3.1 The Problem . .9 3.2 The Solution . 10 3.3 The Origins of MapReduce: map and fold ..................... 11 3.4 MapReduce Basics . 12 3.5 Mappers and Reducers . 13 3.6 Partitioners and Combiners . 15 3.7 Communication-Cost Model . 17 v 3.8 Short Review of Algorithms for MapReduce . 17 3.8.1 Minimum, Maximum, and Count of Groups . 17 3.8.2 Filtering and Joins . 17 3.8.3 Frequent Itemsets and Closed Frequent Itemsets . 18 3.9 Hadoop Architecture . 18 3.10 Summary . 18 4 Mining Closed Frequent Itemsets Using MapReduce 19 4.1 Introduction . 19 4.2 Related Work . 20 4.2.1 State of the Art . 20 4.3 Problem Definition . 21 4.3.1 Example . 22 4.4 The Algorithm . 22 4.4.1 Overview . 22 4.4.2 Definitions . 23 4.4.3 Map Phase . 23 4.4.4 Combiner Phase . 24 4.4.5 Reduce Phase . 24 4.4.6 Run Example . 25 4.4.7 Soundness . 28 4.4.8 Completeness . 28 4.4.9 Duplication Elimination . 29 4.5 Experiments . 29 4.5.1 Data . 29 4.5.2 Setup . 30 4.5.3 Measurement . 30 4.5.4 Experiments . 30 4.5.5 Results . 30 4.6 Conclusion . 31 5 Query Evaluation on Distributed Probabilistic Databases Using MapReduce 33 5.1 An Introduction to Probabilistic Databases . 33 5.1.1 Definitions . 33 5.1.2 Query Semantics on a Probabilistic Database . 34 5.1.3 The Dalvi-Suciu Dichotomy . 34 5.1.4 Safe-Plans, Projections and Functional Dependencies . 34 5.2 Relational Algebra using MapReduce . 36 vi 5.2.1 Selection . 36 5.2.2 Projection . 37 5.2.3 Natural Join . 38 5.2.4 Select-Project . 40 5.2.5 Select-Join . 41 5.3 Example . 42 5.3.1 A Simple Query . 42 5.3.2 A non-Safe Execution Plan . 43 5.3.3 A Safe Plan . 45 5.3.4 First Step: Projections . 45 5.4 Communication Cost and Optimization of Safe and Non-Safe Execution Plans 46 5.4.1 Communication Cost of Plans . 46 5.4.2 Comparing the Communication Costs . 48 5.5 Safe Plan Optimization . 48 5.5.1 Motivation . 48 5.5.2 The MRQL Optimization is not Always Safe . 48 5.5.3 Safe-Preserving Algebraic Laws . 49 5.5.4 Estimating Cost of a Plan . 51 5.5.5 Plan Cost Estimation Algorithm . 53 5.5.6 Improving a Safe Plan . 53 5.5.7 Plan Optimization Algorithm . 54 5.5.8 Sure Rules . 55 5.5.9 Rules that need to be Checked . 56 5.5.10 Finding the Optimized Plan . 56 5.6 Experiments . 57 5.6.1 Data . 57 5.6.2 Setup . 58 5.6.3 Measurement . 58 5.6.4 The Query . 58 5.6.5 Control Group . 59 5.6.6 The Experiments . 59 5.6.7 Results . 59 5.7 Conclusion . 60 6 Universal Hadoop Cache 61 6.1 The Hadoop Distributed File System . 61 6.1.1 HDFS Concepts . 61 vii 6.1.2 Hadoop Job Concepts . 63 6.1.3 Iterative Algorithms Efficiency Problem . 64 6.1.4 Contribution . 65 6.2 Related Work . 65 6.2.1 Data Locality . 66.

Analyzing Large-Scale, Distributed and Uncertain Data

Probabilistic Databases

Adaptive Schema Databases ∗

Finding Interesting Itemsets Using a Probabilistic Model for Binary Databases

Open-World Probabilistic Databases

A Compositional Query Algebra for Second-Order Logic and Uncertain Databases

Answering Queries from Statistics and Probabilistic Views∗

Uva-DARE (Digital Academic Repository)

BAYESSTORE: Managing Large, Uncertain Data Repositories with Probabilistic Graphical Models

Brief Tutorial on Probabilistic Databases

Efficient In-Database Analytics with Graphical Models

Indexing Correlated Probabilistic Databases

Analyzing Uncertain Tabular Data