Understanding Query Performance in Accumulo

Understanding Query Performance in Accumulo Scott M. Sawyer, B. David O’Gwynn, An Tran, and Tamara Yu MIT Lincoln Laboratory Emails: fscott.sawyer, dogwynn, atran, [email protected] Abstract—Open-source, BigTable-like distributed databases the interim. Also unlike RDBMS tables, BigTables do not provide a scalable storage solution for data-intensive applica- require pre-defined schema, allowing columns to be added to tions. The simple key–value storage schema provides fast record any row at-will for greater flexibility. However, new NoSQL ingest and retrieval, nearly independent of the quantity of databases lack the mature code base and rich feature set of data stored. However, real applications must support non-trivial established RDBMS solutions. Where needed, query planning queries that require careful key design and value indexing. We and optimization must be implemented by the application study an Apache Accumulo–based big data system designed for a network situational awareness application. The application’s developer. storage schema and data retrieval requirements are analyzed. We then characterize the corresponding Accumulo performance As a result, optimizing data retrieval in a NoSQL system bottlenecks. Queries are shown to be communication-bound and can be challenging. Developers must design a row key scheme, server-bound in different situations. Inefficiencies in the open- decide which optimizations to employ, and write queries source communication stack and filesystem limit network and that efficiently scan tables. Because distributed systems are I/O performance, respectively. Additionally, in some situations, complex, bottlenecks can be difficult to predict and identify. parallel clients can contend for server-side resources. Maximizing Many open-source NoSQL databases are implemented in Java data retrieval rates for practical queries requires effective key and use heavy communication middleware, causing inefficien- design, indexing, and client parallelization. cies that are difficult to characterize. Thus, designing a Big Data system that meets challenging data ingest, query, and I. INTRODUCTION scalability requirements represents a very large tradespace. An increasing number of applications now operate on In this paper, we study a practical Big Data application datasets too large to store on a single database server. These using a multi-node Accumulo instance. The Lincoln Labora- Big Data applications face challenges related to the volume, tory Cyber Situational Awareness (LLCySA) system monitors velocity and variety of data. A new class of distributed traffic and events on an enterprise network and uses Big Data databases offers scalable storage and retrieval for many of these analytics to detect cybersecurity threats. We present the storage applications. One such solution, Apache Accumulo [1] is an schema, analyze the retrieval requirements, and experimentally open-source database based on Google’s BigTable design [2]. determine the bottlenecks for different query types. Although BigTable is a tabular key–value store, in which each key is we focus on this particular system, results are applicable to a pair of strings corresponding to a row and column identifier, a wide range of applications that use BigTable-like databases such that records have the format: to manage semi-structured event data. Advanced applications must perform ad hoc, multi-step queries spanning multiple (row, column)!value database servers and clients. Efficient queries must balance server, client and network resources. Records are lexicographically sorted by row key, and rows are distributed across multiple database servers. The sorted records Related work has investigated key design schemes and ensure fast, efficient reads of a row, or a small range of rows, performance issues in BigTable implementations. Although regardless of the total system size. Compared to HBase [3], distributed key–value stores have been widely used by web another open-source BigTable implementation, Accumulo pro- companies for several years, studying these systems is a vides cell-level access control and features an architecture that relatively new area of research. Kepner, et al. propose the leads to higher performance for parallel clients [4]. Dynamic Distributed Dimensional Data Model (D4M), a gen- eral purpose schema suitable for use in Accumulo or other The BigTable design eschews many features of traditional BigTable implementations [5]. Wasi-ur-Rahman, et al. study database systems, making trade-offs between performance, performance of HBase and identify the communication stack scalability, and data consistency. A traditional Relational as the principal bottleneck [6]. The primary contribution of our Database Management System (RDBMS) based on Structured work is the analysis of Accumulo performance bottlenecks in Query Language (SQL) provides atomic transactions and data the context of an advanced Big Data application. consistency, an essential requirement for many applications. On the other hand, BigTable uses a “NoSQL” model that This paper is organized as follows. Section II discusses relaxes these transaction requirements, guaranteeing only even- the data retrieval and server-side processing capabilities of tual consistency while tolerating stale or approximate data in Accumulo. Section III contains a case study of a challenging application using Accumulo. Then, Section IV describes our This work is sponsored by the Assistant Secretary of Defense for Research and Engineering under Air Force contract FA8721-05-C-0002. Opinions, performance evaluation methodology and presents the results. interpretations, conclusions, and recommendations are those of the author and Finally, Section V concludes with specific recommendations are not necessarily endorsed by the United States Government. for optimizing data retrieval performance in Accumulo. Primary Table Src. Dest. Bytes Port 13-01-01_a8c8 Alice Bob 128 13-01-02_c482 Bob Carol 80 13-01-02_7204 Alice Carol 8080 13-01-03_5d86 Carol Bob 55 21 Fig. 1. In a parallel client query, M tablet servers send entries to N Accumulo Index Table clients where results are further processed before being aggregated at a single query client. 13-01-01_a8c8 13-01-02_c482 13-01-02_7204 13-01-03_5d86 Alice|Src. 1 1 Bob|Dest. 1 1 II. DATA RETRIEVAL IN ACCUMULO Bob|Src. 1 A single key–value pair in Accumulo is called an entry. Per Carol|Dest. 1 1 the BigTable design, Accumulo persists entries to a distributed Carol|Src. 1 file system, generally running on commodity spinning-disk hard drives. A contiguous range of sorted rows is called a Fig. 2. In this example, network events are stored in a primary table using a tablet, and each server in the database instance is a tablet datestamp and hash as a unique row identifer. The index table enables efficient server. Accumulo splits tablets into files and stores them on the look-up by attribute value without the need to scan the entire primary table. Hadoop Distributed File System (HDFS) [7], which provides redundancy and high-speed I/O. Tablet servers can perform simple operations on entries, enabling easy parallelization of requires a two-step query. Suppose each row in the primary tasks such as regular expression filtering. Tablet servers will table represents a persisted object, and each column and value also utilize available memory for caching entries to accelerate pair in that row represent an object attribute. In the index table, ingest and query. each row key stores a unique attribute value from the primary table, and each column contains a key pointing to a row in At query time, Accumulo tablet servers process entries the primary table. Figure 2 illustrates a notional example of using an iterator framework. Iterators can be cascaded together this indexing technique. Given these tables, an indexed value to achieve simple aggregation and filtering. The server-side look-up requires a first scan of the index table to retrieve a set iterator framework enables entries to be processed in parallel of row keys, and a second scan to retrieve those rows from the at the tablet servers and filtered before sending entries across primary table. In a RDBMS, this type of query optimization the network to clients [8]. While these iterators make it easy to is handled automatically. implement a parallel processing chain for entries, the iterator programming model has limitations. In particular, iterators must operate on a single row or entry, be stateless between III. CASE STUDY:NETWORK SITUATIONAL AWARENESS iterations, and not communicate with other iterators. Opera- Network Situational Awareness (SA) is an emerging appli- tions that cannot be implemented in the iterator framework cation area that addresses aspects of the cybersecurity problem must be performed at the client. with data-intensive analytics. The goal of network SA is to pro- An Accumulo client initiates data retrieval by requesting vide detailed, current information about the state of a network scans on one or more row ranges. Entries are processed by and how it got that way. The primary information sources are any configured iterators, and the resulting entries are sent the logs of various network services. Enterprise networks are from server to client using Apache Thrift object serialization large and complex, as an organization may have thousands middleware [9], [10]. Clients deserialize one entry at a time of users and devices, dozens of servers, multiple physical using client-side iterators and process the entries according to locations, and

Load more