Scalable Fault-Tolerant Elastic Data Ingestion in Asterixdb
Total Page:16
File Type:pdf, Size:1020Kb
UC Irvine UC Irvine Electronic Theses and Dissertations Title Scalable Fault-Tolerant Elastic Data Ingestion in AsterixDB Permalink https://escholarship.org/uc/item/9xv3435x Author Grover, Raman Publication Date 2015 License https://creativecommons.org/licenses/by-nd/4.0/ 4.0 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA, IRVINE Scalable Fault-Tolerant Elastic Data Ingestion in AsterixDB DISSERTATION submitted in partial satisfaction of the requirements for the degree of DOCTOR OF PHILOSOPHY in Information and Computer Science by Raman Grover Dissertation Committee: Professor Michael J. Carey, Chair Professor Chen Li Professor Sharad Mehrotra 2015 © 2015 Raman Grover DEDICATION To my wonderful parents... ii TABLE OF CONTENTS Page LIST OF FIGURES vi LIST OF TABLES viii ACKNOWLEDGMENTS ix CURRICULUM VITAE x ABSTRACT OF THE DISSERTATION xi 1 Introduction 1 1.1 Challenges in Data Feed Management . .3 1.1.1 Genericity and Extensibility . .3 1.1.2 Fetch-Once Compute-Many Model . .5 1.1.3 Scalability and Elasticity . .5 1.1.4 Fault Tolerance . .7 1.2 Contributions . .8 1.3 Organization . .9 2 Related Work 10 2.1 Stream Processing Engines . 10 2.2 Data Routing Engines . 11 2.3 Extract Transform Load (ETL) Systems . 12 2.4 Other Systems . 13 2.4.1 Flume . 14 2.4.2 Chukwa . 15 2.4.3 Kafka . 16 2.4.4 Sqoop . 17 2.5 Summary . 17 3 Background and Preliminaries 19 3.1 AsterixDB . 19 3.1.1 AsterixDB Architecture . 20 3.1.2 AsterixDB Data Model . 21 3.1.3 Querying Data . 23 iii 3.2 Hyracks . 24 3.2.1 High-Level Architecture . 24 3.2.2 Execution Model . 25 3.3 Summary . 27 4 Data Feed Basics 28 4.1 Collecting Data: Feed Adaptors . 28 4.2 Pre-Processing Collected Data . 30 4.3 Building a Cascade Network of Feeds . 33 4.4 Lifecycle of a Feed . 35 4.5 Policies for Feed Ingestion . 37 4.6 Summary . 40 5 Runtime for Data Ingestion 42 5.1 Feeds Metadata . 43 5.2 Basic Runtime Components . 44 5.3 Building the Data Ingestion Pipeline . 45 5.3.1 Primary Feed without a UDF . 48 5.3.2 Secondary Feed with AQL UDF . 57 5.3.3 Feed with a Java UDF . 64 5.4 Inside a Feed Joint . 67 5.4.1 Modes of Operation . 69 5.5 Disconnecting a feed . 70 5.6 At Least Once Semantics . 73 5.7 Experimental Evaluation . 74 5.7.1 Batch Inserts versus Data Ingestion . 76 5.7.2 Fetch Once, Compute Many Model . 81 5.7.3 Evaluating Scalability . 87 5.8 Summary . 90 6 Fault-Tolerant Data Ingestion 92 6.1 Soft Failures . 93 6.1.1 Executing An Operator in a Sandbox . 93 6.1.2 Logging of an Exception . 94 6.2 Hard Failures . 95 6.2.1 Detecting and Identifying a Failure . 95 6.2.2 The Fault-Tolerance Protocol . 97 6.2.3 Example Failure Scenarios . 98 6.2.4 Under the Hood . 105 6.3 Experimental Evaluation . 106 6.4 Other Approaches to Fault-Tolerance . 109 6.4.1 Replication-Based Approach . 109 6.4.2 Upstream Backup Approach . 111 6.4.3 Flux . 111 6.4.4 Borealis Stream Processing Engine . 112 iv 6.5 Summary . 114 7 Dealing with Data Indigestion 116 7.1 Congestion or Data Indigestion . 117 7.2 Monitoring a Data Ingestion Pipeline . 118 7.3 Ingestion Policies . 121 7.3.1 Basic Policy . 125 7.3.2 Spill Policy . 126 7.3.3 Discard Policy . 128 7.3.4 Throttle Policy . 129 7.3.5 Elastic Policy . 130 7.4 Discard versus Throttle . 135 7.5 Comparison with Storm + MongoDB . 140 7.6 Other Approaches to Dealing with Congestion . 143 7.6.1 Load Shedding . 144 7.6.2 Operator-Level Elasticity . 146 7.6.3 Cluster-Level Elasticity . 147 7.7 Summary . 148 8 Use Cases 150 8.1 Knowledge Base Acceleration . 150 8.2 Publish-Subscribe . 151 8.3 Analysis of a Twitter Feed . 152 8.4 Event Shop . 153 8.5 Summary . 154 9 Conclusion and Future Work 155 9.1 Conclusion . 155 9.2 Future Work . 157 9.2.1 Continuous Queries . 157 9.2.2 Data Replication . 158 Bibliography 160 Appendices 164 A Feed Management Console . 164 B Writing a Custom Adaptor . 165 B.1 Push-Based Adaptor . 166 B.2 Pull-Based Adaptor . 167 B.3 AdaptorFactory . 168 C Writing an External Java Function . 170 D Installing a Pluggable - Adaptor/Function . 172 v LIST OF FIGURES Page 2.1 Dataflow inside a Flume Agent . 15 2.2 Dataflow inside Kafka . 16 3.1 AsterixDB Architecture . 20 3.2 A visualization of the results of spatial aggregation query. The color of the cell indicates the tweet count. 24 3.3 Hyracks Architecture . 25 4.1 Building a cascade network of feeds. The solid lines represents the flow of data as constructed by creating a.