Optimizing Timeliness, Accuracy, and Cost in Geo-Distributed Data-Intensive Computing Systems

Optimizing Timeliness, Accuracy, and Cost in Geo-Distributed Data-Intensive Computing Systems A DISSERTATION SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY Benjamin Heintz IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy Professor Abhishek Chandra December, 2016 c Benjamin Heintz 2016 ALL RIGHTS RESERVED Acknowledgements I would like to thank Chenyu Wang for developing and running the experiments in Chapter 3, and Ravali Kandur for her assistance deploying Apache Storm on PlanetLab. The work presented here was supported in part by NSF Grants CNS-0643505, CNS- 0519894, CNS-1413998, and IIS-0916425, by an IBM Faculty Award, and by a University of Minnesota Doctoral Dissertation Fellowship. About six years ago, I set my sights on an M.S. degree in Computer Science. As I was beginning to plan for what I thought would be a brief career as a graduate student, I met with Prof. Jaideep Srivastava to discuss possible summer research. I am incredibly thankful that, instead of talking about which summer projects I could work on, he encouraged me to aim higher and pursue a Ph.D. I owe a tremendous debt of gratitude to my collaborators, especially Prof. Jon Weiss- man, Prof. Ramesh K. Sitaraman, and my advisor, Prof. Abhishek Chandra. If I have developed even a small fraction of their ability to see the details without losing track of the big picture, to write well, and most of all to be persistent, then the last five years will have been worth it. I am particularly grateful for my advisor's endless patience and unshakable calm. Some of the most important results in my research have only come together after I have nearly given up on a deadline or called off a set of experiments. I am grateful that he never let me off the hook so easily. On a personal note, I cannot thank my parents Henry and Jayne Heintz enough for their support, especially through college and graduate school. I would not have access to a sliver of the opportunities that I do if it weren't for them. Finally, thanks to my wife Anne, for always being there for me, even when that means listening to me complain about how my experiments aren't working. i Dedication To my dad. ii Abstract Big Data touches every aspect of our lives, from the way we spend our free time to the way we make scientific discoveries. Netflix streamed more than 42 billion hours of video in 2015, and in the process recorded massive volumes of data to inform video recommendations and plan investments in new content. The CERN Large Hadron Collider produces enough data to fill more than one billion DVDs every week, and this data has led to the discovery of the Higgs boson particle. Such large scale computing is challenging because no one machine is capable of ingesting, storing, or processing all of the data. Instead, applications require distributed systems comprising many machines working in concert. Adding to the challenge, many data streams originate from geographically distributed sources. Scientific sensors such as LIGO span multiple sites and generate data too massive to process at any one location. The machines that analyze these data are also geo-distributed; for example Netflix and Facebook users span the globe, and so do the machines used to analyze their behavior. Many applications need to process geo-distributed data on geo-distributed systems with low latency. A key challenge in achieving this requirement is determining where to carry out the computation. For applications that process unbounded data streams, two performance metrics are critical: WAN traffic and staleness (i.e., delay in receiving results). To optimize these metrics, a system must determine when to communicate results from distributed resources to a central data warehouse. As an additional challenge, constrained WAN bandwidth often renders exact computation infeasible. Fortunately, many applications can tolerate inaccuracy, albeit with diverse preferences. To support diverse applications, systems must determine what partial results to communicate in order to achieve the desired staleness-error tradeoff. This thesis presents answers to these three questions|where to compute, when to communicate, and what partial results to communicate|in two contexts: batch computing, where the complete input data set is available prior to computation; and stream computing, where input data are continuously generated. We also explore the challenges facing emerging programming models and execution engines that unify stream and batch computing. iii Contents Acknowledgements i Dedication ii Abstract iii List of Tables ix List of Figures x 1 Introduction 1 1.1 Challenges in Geo-Distributed Data-Intensive Computing . 2 1.1.1 Where to Place Computation . 3 1.1.2 When to Communicate Partial Results . 4 1.1.3 What Partial Results to Communicate . 5 1.2 Summary of Research Contributions . 5 1.2.1 Batch Processing . 5 1.2.2 Stream Processing . 7 1.2.3 Unified Stream & Batch Processing . 8 1.3 Outline . 9 2 Model-Driven Optimization of MapReduce Applications 10 2.1 Introduction . 10 2.1.1 The MapReduce Framework and an Optimization Example . 11 2.1.2 Research Contributions . 13 iv 2.2 Model and Optimization . 14 2.2.1 Model . 14 2.2.2 Model Validation . 20 2.2.3 Optimization Algorithm . 25 2.3 Model and Optimization Results . 27 2.3.1 Modeled Environments . 27 2.3.2 Heuristics vs. Model-Driven Optimization . 28 2.3.3 End-to-End vs. Myopic . 31 2.3.4 Single-Phase vs. Multi-Phase . 33 2.3.5 Barriers vs. Pipelining . 34 2.3.6 Compute Resource Distribution . 37 2.3.7 Input Data Distribution . 37 2.4 Comparison to Hadoop . 39 2.4.1 Experimental Setup . 40 2.4.2 Applications . 40 2.4.3 Experimental Results . 41 2.5 Related Work . 46 2.6 Concluding Remarks . 47 3 Cross-Phase Optimization in MapReduce 48 3.1 Introduction . 48 3.2 Map-Aware Push . 49 3.2.1 Overlapping Push and Map to Hide Latency . 50 3.2.2 Overlapping Push and Map to Improve Scheduling . 51 3.2.3 Map-Aware Push Scheduling . 51 3.2.4 Implementation in Hadoop . 52 3.2.5 Experimental Results . 53 3.3 Shuffle-Aware Map . 55 3.3.1 Shuffle-Aware Map Scheduling . 56 3.3.2 Implementation in Hadoop . 58 3.3.3 Experimental Results . 58 3.4 Putting it all Together . 60 v 3.4.1 Amazon EC2 . 60 3.4.2 PlanetLab . 62 3.5 Related Work . 63 3.6 Concluding Remarks . 65 4 Optimizing Timeliness and Cost in Geo-Distributed Streaming Ana- lytics 67 4.1 Introduction . 67 4.1.1 Challenges in Optimizing Windowed Grouped Aggregation . 69 4.1.2 Research Contributions . 70 4.2 Problem Formulation . 72 4.3 Dataset and Workload . 74 4.4 Minimizing WAN Traffic and Staleness . 76 4.5 Practical Online Algorithms . 80 4.5.1 Simulation Methodology . 81 4.5.2 Emulating the Eager Optimal Algorithm . 82 4.5.3 Emulating the Lazy Optimal Algorithm . 85 4.5.4 The Hybrid Algorithm . 88 4.6 Implementation . 90 4.6.1 Edge . 91 4.6.2 Center . 92 4.7 Experimental Evaluation . 93 4.7.1 Aggregation Using a Single Edge . 93 4.7.2 Scaling to Multiple Edges . 94 4.7.3 Effect of Laziness Parameter . 96 4.7.4 Impact of Network Capacity . 97 4.8 Discussion . 98 4.8.1 Compression . 98 4.8.2 Variable-Size Aggregates . 100 4.8.3 Applicability to Other Environments . 101 4.9 Related Work . 102 4.10 Concluding Remarks . 103 vi 5 Trading Timeliness and Accuracy in Geo-Distributed Streaming An- alytics 104 5.1 Introduction . 104 5.2 Problem Formulation . 106 5.2.1 Problem Statement . 106 5.2.2 The Staleness-Error Tradeoff . 107 5.3 Offline Algorithms . 109 5.3.1 Minimizing Staleness (Error-Bound) . 109 5.3.2 Minimizing Error (Staleness-Bound) . 110 5.3.3 High-Level Lessons . 115 5.4 Online Algorithms . 116 5.4.1 The Two-Part Cache Abstraction . 116 5.4.2 Error-Bound Algorithms . 118 5.4.3 Staleness-Bound Algorithms . 120 5.4.4 Implementation . 123 5.5 Evaluation . 123 5.5.1 Setup . 124 5.5.2 Comparison Algorithms . 124 5.5.3 Simulation Results: Minimizing Staleness (Error-Bound) . 126 5.5.4 Simulation Results: Minimizing Error (Staleness-Bound) . 128 5.5.5 Experimental Results . 129 5.6 Discussion . 138 5.6.1 Alternative Approximation Approaches . 138 5.6.2 Coordination Between Multiple Edges . 139 5.6.3 Incorporating WAN Latency . 140 5.7 Related Work . 140 5.8 Concluding Remarks . 141 6 Challenges and Opportunities in Unified Stream & Batch Analytics 142 6.1 Introduction . 142 6.2 Background . ..

Optimizing Timeliness, Accuracy, and Cost in Geo-Distributed Data-Intensive Computing Systems

Big Data Stream Analysis: a Systematic Literature Review

Real-Time Analytics for Fast Evolving Social Graphs

Dzone-Guide-To-Big-Data.Pdf

Big Data Analytics Options on AWS

Big Data Analytics Options on AWS AWS Whitepaper Big Data Analytics Options on AWS AWS Whitepaper

S2CE: a Hybrid Cloud and Edge Orchestrator for Mining Exascale

A Buyer's Guide to Streaming Data Integration for Google Cloud Platform

Distributed Supervised Sentiment Analysis of Tweets

Streaming Infrastructure and Natural Language Modeling with Application to Streaming Big Data Yuheng Du Clemson University, [email protected]

Process Mining with Streaming Data

Efficient Stream Data Management

Distributed Stream Processing of Twitter Data Using Apache Spark