UC Berkeley UC Berkeley Electronic Theses and Dissertations

UC Berkeley UC Berkeley Electronic Theses and Dissertations Title Scalable Systems for Large Scale Dynamic Connected Data Processing Permalink https://escholarship.org/uc/item/9029r41v Author Padmanabha Iyer, Anand Publication Date 2019 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California Scalable Systems for Large Scale Dynamic Connected Data Processing by Anand Padmanabha Iyer A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Ion Stoica, Chair Professor Scott Shenker Professor Michael J. Franklin Professor Joshua Bloom Fall 2019 Scalable Systems for Large Scale Dynamic Connected Data Processing Copyright 2019 by Anand Padmanabha Iyer 1 Abstract Scalable Systems for Large Scale Dynamic Connected Data Processing by Anand Padmanabha Iyer Doctor of Philosophy in Computer Science University of California, Berkeley Professor Ion Stoica, Chair As the proliferation of sensors rapidly make the Internet-of-Things (IoT) a reality, the devices and sensors in this ecosystem—such as smartphones, video cameras, home automation systems, and autonomous vehicles—constantly map out the real-world producing unprecedented amounts of dynamic, connected data that captures complex and diverse relations. Unfortunately, existing big data processing and machine learning frameworks are ill-suited for analyzing such dynamic connected data and face several challenges when employed for this purpose. This dissertation focuses on the design and implementation of scalable systems for dynamic connected data processing. We discuss simple abstractions that make it easy to operate on such data, efficient data structures for state management, and computation models that reduce redundant work. We also describe how bridging theory and practice with algorithms and techniques that leverage approximation and streaming theory can significantly speed up connected data computations. Leveraging these, the systems described in this dissertation achieve more than an order of magnitude improvement over the state-of-the-art. i To my family ii Contents Contents ii List of Figures vi List of Tables ix 1 Introduction 1 1.1 Motivating Examples . 2 1.1.1 Alex Diagnoses Network Issues . 2 1.1.2 Taylor Finds Financial Frauds . 4 1.1.3 Alex & Taylor Aren’t Alone! . 4 1.2 Problems with Existing Systems . 5 1.2.1 Programmability . 5 1.2.2 Storage . 6 1.2.3 Performance . 6 1.3 Solution Overview . 7 1.4 Dissertation Plan . 8 2 Background 9 2.1 Data Parallel Processing . 9 2.2 Graph Parallel Processing . 10 2.2.1 Property Graph Model . 10 2.2.2 Graph Parallel Abstractions . 10 3 Ad-Hoc Analytics on Dynamic Connected Data 13 3.1 Introduction . 13 3.2 Background & Challenges . 15 3.2.1 Time-evolving Graph Workloads . 15 3.2.2 Limitations of Existing Solutions . 16 3.2.3 Challenges . 17 3.3 Tegra Design . 18 3.3.1 Timelapse Abstraction & API . 18 iii 3.3.2 Evolving Graph Analytics Using Timelapse . 19 3.4 Computation Model . 21 3.4.1 Incremental Graph Computations . 21 3.4.2 ICE Computation Model . 22 3.4.3 ICE vs Streaming Systems . 24 3.4.4 Improving ICE Model . 25 3.5 Distributed Graph Snapshot Index (DGSI) . 26 3.5.1 Leveraging Persistent Datastructures . 26 3.5.2 Graph Storage & Partitioning . 26 3.5.3 Version Management . 27 3.5.4 Memory Management . 28 3.6 Implementation . 29 3.6.1 ICE on GAS Model . 29 3.6.2 Using Tegra as a Developer . 30 3.7 Evaluation . 30 3.7.1 Microbenchmarks . 32 3.7.2 Ad-hoc Window Operations . 34 3.7.3 Timelapse & ICE . 36 3.7.4 Tegra Shortcomings . 38 3.8 Related Work . 40 3.9 Summary . 41 4 Pattern Mining in Dynamic Connected Data 42 4.1 Introduction . 42 4.2 Background & Motivation . 45 4.2.1 Graph Pattern Mining . 45 4.2.2 Approximate Pattern Mining . 47 4.2.3 Graph Pattern Mining Theory . 47 4.2.4 Challenges . 49 4.3 ASAP Overview . 50 4.4 Approximate Pattern Mining in ASAP . 51 4.4.1 Extending to General Patterns . 51 4.4.2 Applying to Distributed Settings . 58 4.4.3 Advanced Mining Patterns . 60 4.5 Building the Error-Latency Profile (ELP) . 61 4.5.1 Building Estimator vs. Time Profile . 62 4.5.2 Building Estimator vs. Error Profile . 63 4.5.3 Handling Evolving Graphs . 64 4.6 Evaluation . 64 4.6.1 Overall Performance . 66 4.6.2 Advanced Pattern Mining . 68 iv 4.6.3 Effectiveness of ELP Techniques . 68 4.6.4 Scaling ASAP on a Cluster . 70 4.6.5 More Complex Patterns . 73 4.7 Related Work . 74 4.8 Summary . 75 5 Approximate Analytics on Dynamic Connected Data 76 5.1 Introduction . 76 5.2 Background & Challenges . 77 5.2.1 Graph Processing Systems . 77 5.2.2 Approximate Analytics . 78 5.2.3 Challenges . 78 5.3 Our Approach . 79 5.3.1 Graph Sparsification . 80 5.3.2 Picking Sparsification Parameter . 81 5.4 Evaluation . 82 5.5 Related Work . 84 5.6 Summary . 85 6 Geo-Distributed Analytics on Connected Data 86 6.1 Introduction . 86 6.2 Background & Challenges . 87 6.2.1 Geo-Distributed Analytics . 87 6.2.2 Challenges . 88 6.3 Our Proposal . 89 6.3.1 Assumptions . 89 6.3.2 Approach . 90 6.4 Evaluation . 94 6.5 Related Work . 95 6.6 Summary . 95 7 Real-time Decisions on Dynamic Connected Data 96 7.1 Introduction . 96 7.2 Background and Motivation . 98 7.2.1 LTE Network Primer . 98 7.2.2 RAN Troubleshooting Today . 99 7.2.3 Machine Learning for RAN Diagnostics . 100 7.3 CellScope Overview . 104 7.3.1 Problem Statement . 104 7.3.2 Architectural Overview . 104 7.4 Mitigating Latency Accuracy Trade-off . 105 7.4.1 Feature Engineering . 106 v 7.4.2 Multi-Task Learning . 106 7.4.3 Data Grouping for MTL ..

UC Berkeley UC Berkeley Electronic Theses and Dissertations

Discretized Streams: Fault-Tolerant Streaming Computation at Scale

The Design and Implementation of Declarative Networks

Apache Spark: a Unified Engine for Big Data Processing

A Ditya a Kella

Alluxio: a Virtual Distributed File System by Haoyuan Li A

Donut: a Robust Distributed Hash Table Based on Chord

A Networking Abstraction for Distributed Data-Parallel Applications

Heterogeneity and Load Balance in Distributed Hash Tables

Tachyon-Further-Improve-Sparks

Big Data Analysis with Apache Spark

Table of Contents

Cluster Computing with Working Sets