Secure, Expressive, and Debuggable Large-Scale Analytics

Secure, Expressive, and Debuggable Large-Scale Analytics Ankur Dave Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2020-143 http://www2.eecs.berkeley.edu/Pubs/TechRpts/2020/EECS-2020-143.html August 12, 2020 Copyright © 2020, by the author(s). All rights reserved. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. Secure, Expressive, and Debuggable Large-Scale Analytics by Ankur Dave A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Ion Stoica, Chair Professor Raluca Ada Popa Professor John Chuang Summer 2020 Secure, Expressive, and Debuggable Large-Scale Analytics Copyright 2020 by Ankur Dave 1 Abstract Secure, Expressive, and Debuggable Large-Scale Analytics by Ankur Dave Doctor of Philosophy in Computer Science University of California, Berkeley Professor Ion Stoica, Chair Growing volumes of data collection, outsourced computing, and demand for complex analytics have led to the rise of big data analytics frameworks such as MapReduce and Apache Spark. However, these systems fall short in processing sensitive data, graph querying, and debugging. This dissertation addresses these remaining challenges in analytics by introducing three systems built on top of Spark: Oblivious Coopetitive Queries (OCQ), GraphFrames, and Arthur. OCQ focuses on the setting of coopetitive analytics, which refers to cooperation among competing parties to run queries over their joint data. OCQ is an efficient, general framework for oblivious coopetitive analytics using hardware enclaves. GraphFrames is an integrated system that lets users combine graph algorithms, pattern matching, and relational queries, each of which typically requires a specialized engine, and optimizes work across them. Arthur is a debugger for Apache Spark that provides a rich set of analysis tools at close to zero runtime overhead through selective replay of data flow applications. Together, these systems bring Apache Spark closer to the goal of a unified analytics platform that retains the flexibility, extensibility, and performance of relational systems. i To my family ii Contents Contents ii List of Figures v List of Tables viii 1 Introduction 1 1.1 Rise of Distributed Dataflow Systems . 1 1.2 Challenges in Large-Scale Analytics . 2 1.3 Dissertation Overview . 4 2 Oblivious Coopetitive Analytics Using Hardware Enclaves 5 2.1 Introduction . 5 2.2 Background . 8 2.2.1 Hardware enclaves . 8 2.2.2 Oblivious algorithms . 8 2.2.3 Spark SQL and Opaque . 9 2.3 Architecture . 10 2.3.1 Setup phase . 11 2.3.2 Query lifecycle . 12 2.4 Threat model and security guarantees . 12 2.4.1 Abstract enclave model . 12 2.4.2 Party threat model . 13 2.4.3 Security guarantees . 14 2.5 Query Planning . 15 2.5.1 Overview . 15 2.5.2 Algorithm . 17 2.5.3 Query planner rules . 17 2.5.4 Determining padding upper bounds . 21 2.5.5 Pre-specified padding bounds . 22 2.6 Coopetitive algorithms . 23 2.6.1 Mixed-sensitivity join . 23 iii 2.6.2 Coopetitive aggregation . 24 2.6.3 Obliviousness proofs . 24 2.7 Implementation . 25 2.8 Evaluation . 26 2.8.1 Setup . 26 2.8.2 Comparison to other systems . 27 2.8.3 Overhead of security . 28 2.8.4 Benefit of having SGX at each site . 30 2.8.5 Benefit of schema-aware padding . 30 2.9 Related work . 32 2.9.1 Cryptographic approaches . 32 2.9.2 Hardware enclave approaches . 33 2.9.3 Unencrypted federated databases . 33 2.9.4 Differential privacy . 33 2.10 Summary . 34 3 GraphFrames: An Integrated API for Mixing Graph and Relational Queries 35 3.1 Introduction . 35 3.2 Motivation . 37 3.2.1 Use Cases . 37 3.2.2 Challenges and Requirements . 38 3.2.3 Need for a Unified System . 39 3.3 GraphFrame API . 39 3.3.1 DataFrame Background . 40 3.3.2 GraphFrame Data Model . 40 Graph Construction . 41 Edges, Vertices, Triplets, and Patterns . 42 View Creation . 43 Relational Operators . 43 Attribute-Based Partitioning . 44 User-defined Functions . 45 3.3.3 Generality of GraphFrames . 45 3.3.4 Spark Integration . 46 3.3.5 Putting It Together . 46 3.4 Query Optimization . 47 3.4.1 Query Planner . 48 3.4.2 View Selection via Query Planning . 51 3.4.3 Adaptive View Creation . 52 3.5 Implementation . 52 3.6 Evaluation . 53 3.6.1 Performance vs. Specialized Systems . 53 3.6.2 Impact of Views . 54 iv 3.6.3 End-to-End Pipeline Performance . 55 3.6.4 Attribute-Based Partitioning Algorithm . 56 3.7 Discussion . 56 3.7.1 Execution on RDBMS Engines . 57 3.7.2 Physical Storage . 57 3.7.3 Additional Join Algorithms . 57 3.7.4 Dynamically Updating Graphs . 58 3.8 Related work . 58 4 Arthur: Rich Post-Facto Debugging for Production Analytics Applications 60 4.1 Introduction . 60 4.2 Target Environment . 62 4.3 Architecture . 63 4.4 Basic Features . 64 4.4.1 Data Flow Visualization . 64 4.4.2 Exploration of Intermediate Datasets . 64 4.4.3 Task Replay . 66 4.5 Record Tracing . 66 4.5.1 Mechanism . 67 4.5.2 Forward Tracing . 68 4.5.3 Backward Tracing . 68 4.6 Post-Hoc Instrumentation . 69 4.7 Implementation . 70 4.8 Evaluation . 72 4.8.1 Recording Overhead . 72 4.8.2 Replay Performance . 73 4.8.3 Applicability . 73 4.9 Discussion . 75 4.9.1 Limitations . 75 4.9.2 Extensions . ..

Secure, Expressive, and Debuggable Large-Scale Analytics

Learning Apache Mahout Classification Table of Contents

Empirical Study on the Usage of Graph Query Languages in Open Source Java Projects

Apache Mahout User Recommender

Vulnerability Summary for the Week of July 10, 2017

Chainsys-Platform-Technical Architecture-Bots

Towards an Integrated Graph Algebra for Graph Pattern Matching with Gremlin

Licensing Information User Manual

Large Scale Querying and Processing for Property Graphs Phd Symposium∗

Workshop- Matrix Math at Scale with Apache Mahout and Spark

Sphinx: Empowering Impala for Efficient Execution of SQL Queries

Introduction to Graph Database with Cypher & Neo4j

Hadoop Operations and Cluster Management Cookbook