Fast Data Analytics by Learning
Total Page:16
File Type:pdf, Size:1020Kb
Fast Data Analytics by Learning by Yongjoo Park A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Science and Engineering) in The University of Michigan 2017 Doctoral Committee: Associate Professor Michael Cafarella, Co-Chair Assistant Professor Barzan Mozafari, Co-Chair Associate Professor Eytan Adar Professor H. V. Jagadish Associate Professor Carl Lagoze Yongjoo Park [email protected] ORCID iD: 0000-0003-3786-6214 c Yongjoo Park 2017 To my mother ii Acknowledgements I thank my co-advisors, Michael Cafarella and Barzan Mozafari. I am deeply grateful that I could start my graduate studies with Mike Cafarella. Mike was the one of the nicest persons I have met (I can say, throughout my life). He also deeply cared of my personal life as well as my graduate study. He always encouraged and supported me so I could make the best decisions for my life. He was certainly more than a simple academic advisor. Mike also has excellent talents in presentation. His descriptions (on any subject) are well-organized and straightforward. Sometimes, I find myself trying to imitate his presentation styles when I give my own presentations, which, I believe, is natural since I have been advised by him for more than five years now. I started work with Barzan after a few years since I came to Michigan. He is excellent in writing research papers and presenting work in the most interesting, formal, and concise way. He devoted tremendous amount of time for me. I learned great amount of writing and presentation skills from him. I am glad I will be his first PhD student, and I hope he is glad about the fact as well. Now as I finish my long (and much fun) PhD study, I deeply understand the importance of having good advisors, and I am sincerely grateful that I was advised by Mike and Barzan. I always think my life has been full of fortunes I cannot explain, and having them as my advisors was one of them. I am sincerely grateful of my thesis committee members. Both Carl Lagoze and Eytan Adar were kind enough to serve on my thesis committee, and they were supportive. Eytan Adar gave detailed feedbacks on this dissertation, which helped me improve its completeness greatly. H.V. Jagadish gave me wonderful advice on presentations and writing. Even from my relatively short experience with him, I could immediately sense that he is wise and is a great mentor. My PhD study was greatly helped by the members of the database group at the University of Michigan, Ann Arbor. Dolan Antenucci helped me setting up systems and shared with me many interesting real-world datasets. Some of the results in this dissertation are based on the datasets provided by him. I also had good discussions with Mike Anderson, Shirley Zhe Chen, and Matthew Burgess. I regret that I did not have more collaborations with the people at the database group. iii My graduate study was not possible without the financial support from Jeongsong Cultural Foundation. When I first came to the United States as a Masters student (with- out guaranteed funding), Jeongsong Cultural Foundation provided me with tuition sup- port. I always thank it, and I am going to pay my debt to the society. Also, Kwanjeong Educational Foundation supported me for my Phd studies. With its help, I could only focus on my research without any financial concerns. I will be someone like them, who can contribute to the society. My research is one part of such efforts, but I hope I can do more. Last, but most importantly, I deeply thank my parents and my sister who always supported me and my study. I was fortunate enough to have my family. I have always been a taker. I should mention my mother, my beloved mother, who devoted her entire life to her children. My life has become better and better, over time. Now I do not have to worry too much about paying phone bills or bus fees. Now I can easily afford Starbucks coffee and commute by drive. But I want to go back, to the time, when I was young. I miss my mother. I miss the time when I fell asleep by her, watching a late night movie on television. I miss the food she made for me, I miss the conversations with her. I miss her smile, and I miss that there always was a person in this world who would stand and speak for me however bad things I did. I received the best education in the world only due to her selfless and stupid amount of sacrifice. She only wanted her children to receive the best support, have the most tasty food, and enjoy every thing other kids would enjoy. Yet, she was foolish enough to never enjoy any such. To her, raising her kids, me and my sister, was her only happiness, and she conducted the task better than anyone. Both of her children grew up wonderfully. I conducted fine research during my graduate studies, and I will continue to do so. My sister became a medical doctor. I want to tell my mom that her life was worth more than any, and she should be proud of herself. iv Contents Dedication ........................................... ii Acknowledgements ..................................... iii List of Figures ......................................... ix List of Tables ......................................... xiii Abstract ............................................ xiv Chapter 1 Introduction ........................................ 1 1.1 Data Analytics Tasks and Synopses Construction................ 3 1.1.1 Data Analytics Tasks............................ 3 1.1.2 Approaches to Building Synopses .................... 4 1.2 Overview and Contributions ........................... 5 1.2.1 Exploiting Past Computations....................... 5 1.2.2 Task-Aware Synopses............................ 7 1.2.3 Summary of Results ............................ 9 2 Background ........................................ 10 2.1 Techniques for Exact Data Analytics....................... 10 2.1.1 Horizontal Scaling ............................. 10 2.1.2 Columnar Databases............................ 11 2.1.3 In-memory Databases ........................... 11 2.2 Approaches for Approximate Query Processing ................ 11 2.2.1 Random Sampling ............................. 11 2.2.2 Random Projections ............................ 13 v Part I Exploiting Past Computations 14 3 Faster Data Aggregations by Learning from the Past ................ 15 3.1 Motivation...................................... 15 3.2 Verdict Overview .................................. 19 3.2.1 Architecture and Workflow........................ 19 3.2.2 Supported Queries............................. 21 3.2.3 Internal Representation .......................... 22 3.2.4 Why and When Verdict Offers Benefit.................. 23 3.2.5 Limitations.................................. 24 3.3 Inference....................................... 24 3.3.1 Problem Statement............................. 25 3.3.2 Inference Overview............................. 26 3.3.3 Prior Belief.................................. 27 3.3.4 Model-based Answer............................ 28 3.3.5 Key Challenges............................... 29 3.4 Estimating Query Statistics ............................ 30 3.4.1 Covariance Decomposition ........................ 31 3.4.2 Analytic Inter-tuple Covariances..................... 32 3.5 Verdict Process Summary ............................. 33 3.6 Deployment Scenarios ............................... 34 3.7 Formal Guarantees ................................. 36 3.8 Parameter Learning................................. 38 3.9 Model Validation .................................. 39 3.10 Generalization of Verdict under Data Additions ................ 41 3.11 Experiments..................................... 42 3.11.1 Experimental Setup............................. 43 3.11.2 Generality of Verdict ........................... 45 3.11.3 Speedup and Error Reduction....................... 46 3.11.4 Confidence Interval Guarantees...................... 48 3.11.5 Memory and Computational Overhead................. 49 3.11.6 Impact of Data Distributions and Workload Characteristics . 49 3.11.7 Accuracy of Parameter Learning..................... 52 3.11.8 Model Validation.............................. 52 3.11.9 Data Append ................................ 54 3.11.10 Verdict vs. Simple Answer Caching ................... 55 vi 3.11.11 Error Reductions for Time-Bound AQP Engines............ 56 3.12 Prevalence of Inter-tuple Covariances in Real-World.............. 57 3.13 Technical Details................................... 58 3.13.1 Double-integration of Exp Function ................... 58 3.13.2 Handling Categorical Attributes ..................... 58 3.13.3 Analytically Computed Parameter Values................ 59 3.14 Related Work .................................... 60 3.15 Summary....................................... 61 Part II Building Task-aware Synopses 63 4 Accurate Approximate Searching by Learning from Data ............. 64 4.1 Motivation...................................... 64 4.2 Hashing-based kNN Search............................ 68 4.2.1 Workflow................................... 68 4.2.2 Hash Function Design........................... 70 4.3 Neighbor-Sensitive Hashing............................ 71 4.3.1 Formal Verification of Our Claim..................... 72 4.3.2 Neighbor-Sensitive Transformation.................... 74 4.3.3 Our Proposed NST............................. 76 4.3.4 Our NSH Algorithm............................ 82 4.4 Experiments..................................... 86 4.4.1 Setup..................................... 87 4.4.2 Validating Our Main