Modular Model-Based Time Series Management with Spark and Cassandra

ModelarDB: Modular Model-Based Time Series Management with Spark and Cassandra Søren Kejser Jensen Torben Bach Pedersen Christian Thomsen Aalborg University, Denmark Aalborg University, Denmark Aalborg University, Denmark [email protected] [email protected] [email protected] ABSTRACT Table 1: Comparison of common storage solutions Industrial systems, e.g., wind turbines, generate big amounts of Storage Method Size (GiB) Storage Method Size (GiB) data from reliable sensors with high velocity. As it is unfeasible to PostgreSQL 10.1 782.87 CSV Files 582.68 store and query such big amounts of data, only simple aggregates RDBMS-X - Row 367.89 Apache Parquet Files 106.94 are currently stored. However, aggregates remove fluctuations and RDBMS-X - Column 166.83 Apache ORC Files 13.50 outliers that can reveal underlying problems and limit the knowl- InfluxDB 1.4.2 - Tags 4.33 Apache Cassandra 3.9 111.89 edge to be gained from historical data. As a remedy, we present the InfluxDB 1.4.2 - Measurements 4.33 ModelarDB 2.41 - 2.84 distributed Time Series Management System (TSMS) ModelarDB that uses models to store sensor data. We thus propose an online, adaptive multi-model compression algorithm that maintains data corrected by established cleaning procedures. Although practition- values within a user-defined error bound (possibly zero). We also ers in the field require high-frequent historical data for analysis, it is propose (i) a database schema to store time series as models, (ii) currently impossible to store the huge amounts of data points. As a methods to push-down predicates to a key-value store utilizing this workaround, simple aggregates are stored at the cost of removing schema, (iii) optimized methods to execute aggregate queries on outliers and fluctuations in the time series. models, (iv) a method to optimize execution of projections through In this paper, we focus on how to store and query massive amounts static code-generation, and (v) dynamic extensibility that allows of high quality sensor data ingested in real-time from many sensors. new models to be used without recompiling the TSMS. Further, To remedy the problem of aggregates removing fluctuations and we present a general modular distributed TSMS architecture and its outliers, we propose that high quality sensor data is compressed implementation, ModelarDB, as a portable library, using Apache using model-based compression. We use the term model for any Spark for query processing and Apache Cassandra for storage. An representation of a time series from which the original time series experimental evaluation shows that, unlike current systems, Mode- can be recreated within a known error bound (possibly zero). For larDB hits a sweet spot and offers fast ingestion, good compression, example, the linear function y = ax + b can represent an increasing, and fast, scalable online aggregate query processing at the same decreasing, or constant time series and reduces storage requirements time. This is achieved by dynamically adapting to data sets using from one value per data point to only two values: a and b. We multiple models. The system degrades gracefully as more outliers support both lossy and lossless compression. Our lossy compression occur and the actual errors are much lower than the bounds. preserves the data’s structure and outliers and the user-defined error PVLDB Reference Format: bound allows a trade-off between accuracy and required storage. Søren Kejser Jensen, Torben Bach Pedersen, Christian Thomsen. Mode- This contrasts traditional sensor data management where models are larDB: Modular Model-Based Time Series Management with Spark and used to infer data points with less noise [29]. Cassandra. PVLDB, 11(11): 1688-1701, 2018. To establish the state-of-the-art used in industry for storing time DOI: https://doi.org/10.14778/3236187.3236215 series, we evaluate the storage requirements of commonly used systems and big data file formats. We select the systems based 1. INTRODUCTION on DB-Engines Ranking [11], discussions with companies in the For critical infrastructure, e.g., renewable energy sources, large energy sector, and our survey [28]. Two Relational Database Man- numbers of high quality sensors with wired electricity and connec- agement Systems (RDBMSs) and a TSMS are included due to their tivity provide data to monitoring systems. The sensors are sampled widespread industrial use, although being optimized for smaller data at regular intervals and while invalid, missing, or out-of-order data sets than the distributed solutions. The big data file formats, on the points can occur, they are rare and all but missing data points can be other hand, handle big data sets well, but do not support streaming ingestion for online analytics. We use sensor data with a 100ms Permission to make digital or hard copies of all or part of this work for sampling interval from an energy production company. The schema personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies and data set (Energy Production High Frequency), are described in bear this notice and the full citation on the first page. To copy otherwise, to Section 7. The results, in Table 1 including our system ModelarDB, republish, to post on servers or to redistribute to lists, requires prior specific show the benefit of using a TSMS or columnar storage for time permission and/or a fee. Articles from this volume were invited to present series. However, the storage reduction achieved by ModelarDB is their results at The 44th International Conference on Very Large Data Bases, much more significant, even with a 0% error bound, and clearly August 2018, Rio de Janeiro, Brazil. demonstrates the advantage of model-based storage for time series. Proceedings of the VLDB Endowment, Vol. 11, No. 11 Copyright 2018 VLDB Endowment 2150-8097/18/07... $ 10.00. To efficiently manage high quality sensor data, we find the fol- DOI: https://doi.org/10.14778/3236187.3236215 lowing properties paramount for a TSMS: (i) Distribution: Due to 1688 huge amounts of sensor data, a distributed architecture is needed. DEFINITION 1 (TIME SERIES). A time series TS is a sequence (ii) Stream Processing: For monitoring, ingested data points must of data points, in the form of time stamp and value pairs, or- be queryable after a small user-defined time lag. (iii) Compression: dered by time in increasing order TS = h(t1; v1); (t2; v2);:::i. Fine-grained historical values can reveal changes over time, e.g., For each pair (ti; vi), 1 ≤ i, the time stamp ti represents the performance degradation. However, raw data is infeasible to store time when the value vi 2 R was recorded. A time series TS = without compression, which also helps query performance due to h(t1; v1);:::; (tn; vn)i consisting of a fixed number of n data points reduced disk I/O. (iv) Efficient Retrieval: To reduce processing is a bounded time series. time for querying a subset of the historical data, indexing, parti- tioning and/or time-ordered storage is needed. (v) Approximate As a running example we use the time series TS = h(100; 28:3); Query Processing (AQP): Approximation of query results within a (200; 30:7); (300; 28:3); (400; 28:3); (500; 15:2);:::i, each pair user-defined error bound can reduce query response time and enable represents a time stamp in milliseconds since the recording of mea- lossy compression. (vi) Extensibility: Domain experts should be surements was initiated and a recorded value. A bounded time series able to add domain-specific models without changing the TSMS, can be constructed, e.g, from the subset of data points of TS where and the system should automatically use the best model. ti ≤ 300, 1 ≤ i. While methods for compressing segments of a time series using one of multiple models exist [22, 37, 40], we found no TSMS using DEFINITION 2 (REGULAR TIME SERIES). A time series TS multi-model compression for our survey [28]. Also, the existing = h(t1; v1); (t2; v2);:::i is considered regular if the time elapsed methods do not provide all the properties listed above. They either between each data point is always the same, i.e., ti+1 − ti = provide no latency guarantees [22, 40], require a trade-off between ti+2 − ti+1 for 1 ≤ i and irregular otherwise. latency and compression [37], or limit the supported model types [22, 40]. ModelarDB, in contrast, provides all these features, and we Our example time series TS is a regular time series as 100 mil- make the following contributions to model-based storage and query liseconds elapse between each of its adjacent data points. processing for big data systems: DEFINITION 3 (SAMPLING INTERVAL). The sampling inter- • A general-purpose architecture for a modular model-based TSMS val of a regular time series TS = h(t1; v1); (t2; v2);:::i is the providing the listed paramount properties. time elapsed between each pair of data points in the time series • An efficient and adaptive algorithm for online multi-model com- SI = ti+1 − ti for 1 ≤ i. pression of time series within a user-defined error bound. The algorithm is model-agnostic, extensible, combines lossy and loss- As 100 milliseconds elapse between each pair of data points in less compression, allows missing data points, and offers both low TS, it has a sampling interval of 100 milliseconds. latency and high compression ratio at the same time. • Methods and optimizations for a model-based TSMS: DEFINITION 4 (MODEL). A model is a representation of a – A database schema to store multiple time series as models. time series TS = h(t1; v1); (t2; v2);:::i using a pair of functions – Methods to push-down predicates to a key-value store used as M = (mest; merr).

Modular Model-Based Time Series Management with Spark and Cassandra

Java Linksammlung

Oracle Metadata Management V12.2.1.3.0 New Features Overview

Oracle Big Data SQL Release 4.1

Hybrid Transactional/Analytical Processing: a Survey

Hortonworks Data Platform Apache Spark Component Guide (December 15, 2017)

Schema Evolution in Hive Csv

Benchmarking Distributed Data Warehouse Solutions for Storing Genomic Variant Information

Hortonworks Data Platform Release Notes (October 30, 2017)

Storage and Ingestion Systems in Support of Stream Processing

Enabling Geospatial in Big Data Lakes and Databases with Locationtech Geomesa

An Introduction to Big Data Technologies

Spark Guide Mar 1, 2016