Accelerating Raw Data Analysis with the ACCORDA Software and Hardware Architecture

Accelerating Raw Data Analysis with the ACCORDA Software and Hardware Architecture Yuanwei Fang Chen Zou Andrew A. Chien CS Department CS Department University of Chicago University of Chicago University of Chicago Argonne National Laboratory [email protected] [email protected] [email protected] ABSTRACT 1. INTRODUCTION The data science revolution and growing popularity of data Driven by a rapid increase in quantity, type, and number lakes make efficient processing of raw data increasingly impor- of sources of data, many data analytics approaches use the tant. To address this, we propose the ACCelerated Operators data lake model. Incoming data is pooled in large quantity { for Raw Data Analysis (ACCORDA) architecture. By ex- hence the name \lake" { and because of varied origin (e.g., tending the operator interface (subtype with encoding) and purchased from other providers, internal employee data, web employing a uniform runtime worker model, ACCORDA scraped, database dumps, etc.), it varies in size, encoding, integrates data transformation acceleration seamlessly, en- age, quality, etc. Additionally, it often lacks consistency or abling a new class of encoding optimizations and robust any common schema. Analytics applications pull data from high-performance raw data processing. Together, these key the lake as they see fit, digesting and processing it on-demand features preserve the software system architecture, empower- for a current analytics task. The applications of data lakes ing state-of-art heuristic optimizations to drive flexible data are vast, including machine learning, data mining, artificial encoding for performance. ACCORDA derives performance intelligence, ad-hoc query, and data visualization. Fast direct from its software architecture, but depends critically on the processing on raw data is critical for these applications. acceleration of the Unstructured Data Processor (UDP) that The data lake model has numerous advantages over tra- is integrated into the memory-hierarchy, and accelerates data ditional data warehouses. First, raw data is only processed transformation tasks by 16x-21x (parsing, decompression) to when needed, the data lake model avoids the processing and as much as 160x (deserialization) compared to an x86 core. IO work for unused data. Second, direct access to raw data We evaluate ACCORDA using TPC-H queries on tabular can reduce the data latency for analysis, a key property for data formats, exercising raw data properties such as parsing streaming data sources or critical fresh data. Third, data and data conversion. The ACCORDA system achieves 2.9x- lake supports exploratory analytics where schemas and at- 13.2x speedups when compared to SparkSQL, reducing raw tributes may be updated rapidly. Fourth, data lakes can data processing overhead to a geomean of 1.2x (20%). In avoid information loss during loading by keeping the source doing so, ACCORDA robustly matches or outperforms prior information, so applications can interpret inconsistent data systems that depend on caching loaded data, while computing in customized fashion. Fifth, the data lake model avoids on raw, unloaded data. This performance benefit is robust scaling limits such as maximum capacity of database systems across format complexity, query predicates, and selectivity and licensing limits or cost (price/loaded data licensing mod- (data statistics). ACCORDA's encoding-extended operator els). One consequence of the data lake model is that data interface unlocks aggressive encoding-oriented optimizations remains in native complex structures and formats instead of that deliver 80% average performance increase over the 7 in-machine SQL types with well-defined schemas. affected TPC-H queries. From a system point of view, the essence of the data lake model is to shift ETL/loading work from offline into the query execution. Of course, the shifted work is much smaller PVLDB Reference Format: - just the part needed for the query. As for each query, Y. Fang, C. Zou, and A. A. Chien. Accelerating Raw Data Anal- only the relevant data is pulled from the lake and processed ysis with the ACCORDA Software and Hardware Architecture. (essentially load, then analyze) to complete the query. PVLDB, 12(11): 1568-1582, 2019. Because data lakes put raw data interpretation on the criti- DOI: https://doi.org/10.14778/3342263.3342634 cal path of query execution, they elevate raw data processing to a critical performance concern. Queries that require an- swers quickly, or those that require interactive and real-time analysis, perhaps on recently-arrived raw data, will be sensi- This work is licensed under the Creative Commons Attribution- tive to the speed of raw data processing. NonCommercial-NoDerivatives 4.0 International License. To view a copy Prior research has made improvements for batch ETL of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For (data loading). For example, several software-acceleration any use beyond those covered by this license, obtain permission by emailing approaches exploit SIMD instructions (InstantLoad[46] { to [email protected]. Copyright is held by the owner/author(s). Publication rights licensed to the VLDB Endowment. find delimiters, and MISON [43] { to speculate parsing). Proceedings of the VLDB Endowment, Vol. 12, No. 11 More closely relevant are Lazy ETL approaches that avoid ISSN 2150-8097. unnecessary ETL by only processing the data actually needed. DOI: https://doi.org/10.14778/3342263.3342634 1568 Databases [12, 16, 36, 37] apply lazy loading, caching ma- faster and 14x less data movement in a typical select- terialized results, and positional maps to avoid expensive filter-project query). data transformations. These approaches require query data The remainder of the paper is organized as follows. Section reuse, significant memory, and storage resources to achieve 2 outlines background on data analytics systems and raw high performance. When these properties do not hold, per- data performance challenges. In Section 3, we illustrate new formance can become variable and poor. Another technique analytics capabilities enabled by fast data transformation. for raw data processing is pushing predicates down before In Section 4 and 5, we present the ACCORDA software and the loading work, several systems [15, 37, 47] exploit query hardware architecture. We study ACCORDA's performance semantics for partial predicate pushdown. For these systems, in Section 7 using the methodology described in Section 6. performance is highly dependent on selectivity, and filters We discuss and compare our results to the research literature are limited to be very simple (literal-equal) and thus do not in Section 8, and summarize the results in Section 9. work for all external data representations (e.g., gzip). Apache SparkSQL [15] is a widely-used analytics system for processing raw data built on top of Apache Spark [59], 2. BACKGROUND a popular in-memory map-reduce framework. SparkSQL offers a SQL-like interface for relational processing, and its 2.1 Data Lake or Raw Data Processing popularity has produced a rich ecosystem for data analytics. Raw data processing requires in-line data parsing, cleaning, We propose ACCORDA, a novel software and hardware and transformation in addition to analysis. The raw data architecture that integrates data encoding into operator in- exists in an external format, often serialized and compressed terfaces and query plans, and uses a uniform runtime worker for portability, that must be parsed to be understood. Typi- model to seamlessly exploit the UDP accelerator for data cally, only a small fraction of the data is required, so it can transformation tasks. We implemented ACCORDA, modify- be imported to extract what is actually needed. In contrast, ing the SparkSQL code base and adding a set of accelerated database computations on loaded data generally exploit an operators and new optimization rules for data encoding into internal, optimized format for performance. the Catalyst query optimizer. Furthermore, raw data lacks a clear schema, making ex- In prior work, we designed and evaluated the ACCORDA traction difficult. Parsing, decoding, deserialization, and accelerator (a.k.a the Unstructured Data Processor or UDP more can be required to transform elements into CPU binary [23, 25]), and build on it here. The ACCORDA architecture formats. To allow further read optimizations, row/column integrates UDP acceleration in the memory-hierarchy with transposition can be desired. We term these operations col- low overhead. The Unstructured Data Processor is a high- lectively as data transformation, and all have a common un- performance, low-power data transformation accelerator that derlying model derived from the finite automata (FA); these achieves 20x geomean performance improvement and 1,900x models underly all grammar-based encodings. In many cases, geomean performance/watt improvement for a range of ETL these raw formats are composed of a sequence of tag/value tasks over an entire multi-core CPU [25]. pairs (e.g., Snappy, CSV, JSON, and XML). The correspond- We evaluate the ACCORDA architecture on raw data pro- ing operation is determined by the specific tag value (e.g., cessing micro-benchmarks and TPC-H queries on tabular delimiter in CSV, tag in Snappy, and key in JSON). These format to stress parsing flat data and data type conversion, formats often employ sub-word encodings (bytes, symbols, and show that ACCORDA provides robust query perfor- etc.) and variable-length encodings (strings, Huffman,

Accelerating Raw Data Analysis with the ACCORDA Software and Hardware Architecture

A Review on Outlier/Anomaly Detection in Time Series Data

Mean, Median and Mode We Use Statistics Such As the Mean, Median and Mode to Obtain Information About a Population from Our Sample Set of Observed Values

Big Data and Open Data As Sustainability Tools a Working Paper Prepared by the Economic Commission for Latin America and the Caribbean

Have You Seen the Standard Deviation?

The Reproducibility of Statistical Results in Psychological Research: an Investigation Using Unpublished Raw Data

Education Statistics on the World Wide Web 233

Descriptive Statistics

EURAMET 1027) Th E C O Unt Mean Is Defined As the Arithmetic Mean of the Particle Diameter in the System

The Precision of the Arithmetic Mean, Geometric Mean And

Chapter 3: Measures of Central Tendency

Adaptive Query Processing on RAW Data

Reconceptualizing Data in Data Science