CONFIDENTIAL UP TO AND INCLUDING 03/01/2017 - DO NOT COPY, DISTRIBUTE OR MAKE PUBLIC IN ANY WAY

Towards a representative benchmark for time series databases

Thomas Toye Student number: 01610806

Supervisors: Prof. dr. Bruno Volckaert, Prof. dr. ir. Filip De Turck Counsellors: Dr. ir. Joachim Nielandt, Jasper Vaneessen

Master's dissertation submitted in order to obtain the academic degree of Master of Science in de industriële wetenschappen: elektronica-ICT

Academic year 2018-2019 ii CONFIDENTIAL UP TO AND INCLUDING 03/01/2017 - DO NOT COPY, DISTRIBUTE OR MAKE PUBLIC IN ANY WAY

Towards a representative benchmark for time series databases

Thomas Toye Student number: 01610806

Supervisors: Prof. dr. Bruno Volckaert, Prof. dr. ir. Filip De Turck Counsellors: Dr. ir. Joachim Nielandt, Jasper Vaneessen

Master's dissertation submitted in order to obtain the academic degree of Master of Science in de industriële wetenschappen: elektronica-ICT

Academic year 2018-2019 PREFACE iv

Preface

I would like to thank my supervisors, Prof. dr. Bruno Volkaert and Prof. dr. ir. Filip De Turck.

I am very grateful for the help and guidance of my counsellors, Dr. ir. Joachim Nielandt and Jasper Vaneessen.

I would also like to thank my parents for their support, not only during the writing of this dissertation, but also during my transitionary programme and my master’s.

The author gives permission to make this master dissertation available for consul- tation and to copy parts of this master dissertation for personal use. In all cases of other use, the copyright terms have to be respected, in particular with regard to the obligation to state explicitly the source when quoting results from this master dissertation.

Thomas Toye, June 2019 Towards a representative benchmark for time series databases

Thomas Toye

Master’s dissertation submitted in order to obtain the academic degree of Master of Science in de industriele¨ wetenschappen: elektronica-ICT

Academic year 2018–2019

Supervisors: Prof. dr. Bruno Volckaert, Prof. dr. ir. Filip De Turck Counsellors: Dr. ir. Joachim Nielandt, Jasper Vaneessen

Summary

As the fastest growing database type, time series databases (TSDBs) have expe- rienced a rise in database vendors, and with it, a rise in difficulty in selecting the best one. TSDB benchmarks compare the performance of different databases to each other, but the workloads they use are not representative: they use random data, or synthesized data that is only applicable to one domain. This disserta- tion argues that these non-representative benchmarks may not always accurately model real world performance, and instead, representative workloads should be used in TSDB benchmarks. In this context, workloads are defined as consisting of data sets and queries. Workload data sets can be categorized using eight pa- rameters (number of metrics, regularity, volume, data type, number of tags, tag value data type, tag value cardinality, variation). A new benchmark was created, which uses three representative workloads next to a baseline non-representative workload. Results of this benchmark show significant performance differences for data ingestion speed for complex data, latency and maximum request rate (when broad time ranges are used), and storage efficiency of data points when comparing representative and non-representative workloads. The results show that existing benchmarks may not be accurate for real world performance.

Keywords

Time series database, representative benchmarking, load testing Towards a representative benchmark for time series databases Thomas Toye Supervisor(s): Bruno Volckaert, Filip De Turck

Abstract— As the fastest growing database type, time series databases ison by being easily extensible to competing solutions that solve (TSDBs) have experienced a rise in database vendors, and with it, a rise in comparable problems. 4. Scalable: Benchmarks must be able to difficulty in selecting the best one. TSDB benchmarks compare the perfor- mance of different databases to each other, but the workloads they use are measure performance in a wide range of scale. Not just single-n- not representative: they use random data, or synthesized data that is only ode performance, but also cluster configurations. 5. Verifiable: applicable to one domain. We argue that these non-representative bench- Benchmarks should be repeatable and independently verifiable. marks may not always accurately model real world performance, and in- 6. Simple: Benchmarks must be easily understandable, while stead, representative workloads should be used in TSDB benchmarks. In this context, workloads are defined as consisting of data sets and queries. making choices that do not affect performance. Workload data sets can be categorized using eight parameters (number of Existing TSDB benchmarks were evaluated, a summary is metrics, regularity, volume, data type, number of tags, tag value data type, shown in Table II. Two gaps in the state of the art are clear: cur- tag value cardinality, variation). A new benchmark was created, which uses three representative work- rent benchmarks insufficiently test TSDB performance at scale, loads next to a baseline non-representative workload. Results of this bench- and current benchmarks are not representative or only represen- mark show significant performance differences for data ingestion speed for tative for a single use case. The data used is either random, or complex data, latency and maximum request rate (when broad time ranges synthetic; real world data are not used. This begs the question: are used), and storage efficiency of data points when comparing represen- tative and non-representative workloads. The results show that existing are results of a non-representative benchmark generalizable to benchmarks may not be accurate for real world performance. real world performance? Keywords— Time series database, representative benchmarking, load testing

I.INTRODUCTION IME SERIES DATABASES provide storage and interfac- ing for time series. In its simplest form, time series data T Representative Revelant Portable Scalable Verifiable Simple are just data with an attached timestamp. This subtype of data For IoT has seen increasing interest in the last decade, especially with TS-Benchmark      the rise of the Internet of Things, which produces time series for use cases everything from temperature to sea levels. Other areas where IoTDB-benchmark       time series are used are the financial industry (e.g. historical analysis of stock performance), the DevOps industry (e.g. cap- TSDBBench       ture of metrics from a server fleet) and the analytics industry For financial (e.g. tracking ad performance over time). FinTime      Finding the best database to use is not an easy task. Eighty- use cases For DevOps three existing TSDBs were found by Bader et al. [1]. To deter- influxdb-comparisons      mine the best one, benchmarks are used. However, these bench- use cases marks may not be representative of the use case or industry the TABLE I TSDB is needed for, which makes their results difficult to gen- EVALUATION OF EXISTING TSDB BENCHMARKS eralize. In this abstract, we will first analyze existing TSDB bench- marks. Then, a new benchmark is proposed, which compares representative workloads to non-representative workloads. The III.BENCHMARKCOMPONENTS results of this benchmark are analysed to A new benchmark is developed to compare benchmark per- formance between representative and non-representative work- II.EVALUATION OF EXISTING BENCHMARKS loads. Workloads consist of a workload data set that is loaded Chen et al. [2] consolidate the properties of a good bench- into the TSDB and a workload query set that executes upon it. mark as follows: 1. Representative: Benchmarks must simulate real world conditions, both the input to a system and the sys- A. Data set tem itself should be representative of real world usage. 2. Rel- Time series data sets have the following properties in com- evant: Benchm arks must measure relevant metrics and tech- mon: data arrives in order, updates are very rare to non-existent, nologies. Results should be useful to compare widely-used so- deletion is rare, and data values follow a pattern. lutions. 3. Portable: Benchmarks should provide a fair compar- They differ on the following characteristics: : Data points are organizaed in metrics, which can be Baseline Financial Rating IoT • Metrics compared to tables in relational databases. Metrics 1 6 1 7 : In regular time series, data points are spaced evenly Regularity Regular Semi-reg. Irregular Regular • Regularity in time. Irregular time series do not emit data points regularly. Volume Low Low Low Low Irregular time series are often the result of event triggers. Tags 2 1 5 0 : High volume time series may emit hundreds of thou- Tag value 10,000 7,164 20M 0 • Volume sands of data points a seconds, while low volume time series cardinality only emit one event a day. Variation High Low High Low : Traditionally, values of data points in a time series • Data type Total data 20M 74.4M 20M 14,5M have been integers or floating point numbers. But they can also points be booleans, strings or even custom data types. License NA CC0 Custom CC-BY-4 Tags: A time series data point may have one or more tags asso- • TABLE II ciated with the timestamp and value. There may be no tags or OVERVIEW OF WORKLOAD DATA SETS a lot of tags. Tags may hold special values, such as geospatial information. : The number of possible combinations • Tag value cardinality the tag values make. Three tags with two possible values each set uses historical stock market information, the rating data set make a tag value cardinality of six. uses movie reviews and the IoT data set is produced by power : While time series data usually follow a pattern, the • Variation information for a house. variation in a series may be very different. One series may de- scribe a flat line, while another may describe seasonal variations V. EVALUATION with daily spikes. A. Storage efficiency B. Query set Figure 1 shows relative storage efficiency. The size in bytes Bader et al. describe ten distinct TSDB queries capabilities per data point was compared to the size per data point in the in [1]. These building blocks (e.g. update, delete, select from a source comma seperated value (CSV) source. The input size time range) can form time series queries (e.g. select the mean was one million data points for every data set. It shows that rep- of temperature values from last year, aggregated by day). Next resentative data sets have different storage efficiency than the to the queries themselves, the relative frequency is an important reference. OpenTSDB is better at storing real world data sets part of the query set. than synthesized data, InfluxDB much worse. Tag value cardi- nality and data point value variation are thought to have a high . Measurement characteristics impact on storage efficiency. Measurement characteristics describe the performance met- rics that are monitored to quantify performance. For TSDB InfluxDB OpenTSDB benchmarks, common metrics include response latency (mean, KairosDB-Cassandra KairosDB-ScyllaDB 95th and 99th percentile, etc.), response size, data ingestion CSV speed, and storage efficiency. 51 .

40 36 17 IV. A REPRESENTATIVE BENCHMARK . 31 A benchmark was created with representativeness as its de- 79 sign goal. It compares three representative workloads to a non- . representative workload to investigate possible performance dif- 20 18

ferences. Three real world data sets, from domains in which relative size 26 . 21 82 . 7 . 6 5 33 TSDB are prevalent, are used, next to a baseline. The baseline 29 7 42 . . 24 . 57 . 44 . 3 89 3 . . 2 26 2 . 2 . 1 1 1 1 1 1 is a non-representative data set, with random values and tags. 0 0 For every data set, twenty queries are written, relevant to the 0 data set’s domain (e.g. getting the average rating for a movie Baseline IoT Financial Ratings in the ratings data set), except for the baseline, for which a sin- gle query is used. Vegeta [3] was use to capture response la- Fig. 1. Relative storage efficiency of different TSDBs per data point compared to the CSV source format. tency (mean and 95th percentile), response times, and response size. The http load program [4] was used for load testing. Standard UNIX tools were used for storage efficiency analysis. B. Data ingestion throughput Four TSDBs are tested: InfluxDB, OpenTSDB, KairosDB with For every data sets, one million data points were loaded into Cassandra as a backing database and KairosDB with ScyllaDB. each TSDBs and ingestion speed was measured (in data points These are modern, open source databases with an HTTP inter- per second). The results are shown in Figure 2. For the rep- face. resentative ratings workload, performance is degradede, espe- Table II shows an overview of the data sets used. The base- cially for InfluxDB. This is a data set with high tag cardinality line is a data set with random values and tags, the financial data and complex tag values. InfluxDB OpenTSDB E. Response size KairosDB-Cassandra KairosDB-ScyllaDB Figure 5 shows the mean response size of TSDBs in bytes. 5

5 The mean response size is correlated with the data set. The size 10 5 5 10 · · 10 10 differences for large responses (e.g. financial workload) can be 82 · · . 6 18 4 . 3 10 62 56 attributed mainly to timestamp encoding. 736 360 413 578 . 535 . 198 , , , , , , 231 1 792 1 , , 98 89 87 913 86 82 78 , 196 59 54 , 29 21 InfluxDB OpenTSDB 342

4 , 10 4 KairosDB-Cassandra KairosDB-ScyllaDB 1 Data points per second . 35

5 . 75 75 . 10 .

Baseline IoT Financial Ratings 186 , 854 , 117 117 33 , , 28 Fig. 2. Data points ingested per second. Data sets used were one million data 23 23 points each. 104 35

C. Load testing . 250 ,

Figure 3 shows results of the load test. The results for 1 35 4 4 . . .

3 65 45 .

OpenTSDB are surprising: it performed well for the baseline 10 . 507 459 459 390 and IoT query workloads, but not for the financial and ratings 350 Response size (bytes) 202 202 query workloads. For the latter two workloads, the time ranges 185 are very broad, so the database has to scan more data. The other 126 2 TSDBs may be able to optimize this operation better. 10 Baseline IoT Financial Ratings InfluxDB OpenTSDB KairosDB-Cassandra KairosDB-ScyllaDB Fig. 5. Mean size in bytes of the TSDB response. 106 36 . 400 ,

57 VI.CONCLUSIONS 6 . 37 73 . . 3 997 . Compared to a baseline non-representative workload, repre-

3 33 347 . 235

10 17 43 57 . 83 117 . . . 78 57 47

8 sentative workloads showed significant performance differences . . . 40 29 28 26 15 15 12 when it came to storage efficiency, data ingestion speed for com- 13 . 2 plex data, latency and maximum request rate (when broad time 0 Requests per second 10 ranges are used). Existing TSDB benchmarks do not use rep- Baseline IoT Financial Ratings resentative workloads, thus their relevance may be called into question. Fig. 3. Maximum requests per second. Tests were performed on data sets one The fact that not all representative workloads show perfor- million data points in size. mance impact highlights the importance of using multiple rep- resentative workloads for general TSDB benchmarks - just one D. Response latency representative workload may not be enough to highlight possible Figure 4 shows the mean response latency when using a repre- deviations or performance degradations. sentative query set. A performance degradation for OpenTSDB It is unpractical to create a representative workload for every surfaces for the financial and ratings query workloads, which domain, but TSDB workload can be characterized by workload use broad time ranges. Otherwise, the baseline is a good predic- parameters. Further research is needed to determine if these pa- tor for relative performance in the representative benchmarks. rameters are enough to accurately describe a TSDB workload This is attributed to the same cause as in Section V-C. and thus generalize results of one workload to another with the same workload parameters. InfluxDB OpenTSDB KairosDB-Cassandra KairosDB-ScyllaDB REFERENCES [1] Andreas Bader, Oliver Kopp, Micheal Falkenthal, Survey and Comparison 44 02 . . of Open Source Time Series Databases, Datenbanksysteme fur¨ Business, 64 681 563 . , 4 , Technologie und Web (BTW2017) – Workshopband. 2 10 2 91 83 85 . 862 41 12 [2] Yanpei Chen, Francois Raab, Randy Katz, From TPC-C to Big Data Bench- . . . . 49 88 . marks: A Functional Workload Model . , Specifying Big Data Benchmarks. 155 136 106 104 102 29 74 56 57 . WBDB 2012, WBDB 2012. Lecture Notes in Computer Science, vol 8163.

2 . 64

10 18 . 12 Springer, Berlin, Heidelberg. 7 27 Latency (ms)

. [3] Tomas´ Senart, Vegeta – HTTP load testing tool and library, 1 0 https://github.com/tsenart/vegeta 10 [4] Jef Poskanzer, http load, https://acme.com/software/http_ Baseline IoT Financial Ratings load/

Fig. 4. Mean latency per request. CONTENTS ix

Contents

Preface iv

Abstract v

Extended abstract vi

Table of Contents ix

1 Introduction 1

2 Literature review 2 2.1 Databases ...... 2 2.1.1 Database Management Systems ...... 2 2.1.2 Relational databases ...... 2 2.1.3 Non-relational databases ...... 3 2.1.4 NewSQL databases ...... 4 2.1.5 Time series databases ...... 4 2.2 Time series database benchmarks ...... 4 2.2.1 TS-Benchmark ...... 4 2.2.2 IoTDB-benchmark ...... 5 2.2.3 TSDBBench ...... 6 2.2.4 FinTime ...... 7 2.2.5 influxdb-comparisons ...... 7 2.2.6 STAC-M3 ...... 8 2.3 Data sets ...... 8

3 State of the art 10 3.1 Uses of time series databases ...... 10 3.1.1 TSDB usage as a data store ...... 10 x

3.1.2 Inherent time series database functions used ...... 11 3.1.3 Common characteristics of time series data ...... 12 3.1.4 Differing characteristics of time series data ...... 12 3.1.5 Industry use cases ...... 13 3.2 A “good” benchmark ...... 16 3.3 Existing benchmarks ...... 16 3.3.1 TS-Benchmark ...... 17 3.3.2 IoTDB-benchmark ...... 17 3.3.3 TSDBBench/YCSB-TS ...... 18 3.3.4 FinTime ...... 19 3.3.5 influxdb-comparisons ...... 19 3.4 Evaluation of existing benchmarks ...... 20 3.4.1 On scalability ...... 20 3.4.2 On representativeness ...... 21 3.5 Contribution ...... 21

4 A new benchmark 23 4.1 Benchmark components ...... 23 4.1.1 Workload data set characteristics ...... 23 4.1.2 Workload query characteristics ...... 24 4.1.3 Measurement characteristics ...... 24 4.2 Design of a representative data workload ...... 25 4.2.1 A baseline workload ...... 25 4.2.2 A financial time series workload ...... 26 4.2.3 A rating system workload ...... 27 4.2.4 An IoT workload ...... 28 4.2.5 Workload data set overview ...... 29 4.2.6 Data set pre-processing ...... 29 4.3 Design of a representative query workload ...... 30 4.3.1 Queries for the baseline workload ...... 30 4.3.2 Queries for the financial workload ...... 31 4.3.3 Queries for the rating workload ...... 31 4.3.4 Queries for the IoT workload ...... 32 4.4 Metrics ...... 32 4.5 Technical implementation ...... 33 4.5.1 Test environment ...... 33 4.5.2 Data ingestion ...... 33 4.5.3 Load and latency testing ...... 34 xi

4.6 Design evaluation ...... 34

5 Results 36 5.1 Storage efficiency ...... 36 5.2 Data ingestion throughput ...... 39 5.3 Load testing with query workload ...... 40 5.4 Response latency ...... 41 5.5 Mean response size ...... 43 5.6 Evaluation ...... 45

6 Conclusions and future work 48 6.1 Conclusions ...... 48 6.2 Future work ...... 49

A Detailed results 51 A.1 Data ingestion throughput ...... 51 A.2 Storage efficiency ...... 51 A.3 Load testing ...... 52 A.4 Response latency ...... 52 A.5 Mean response size ...... 52

Bibliography 54

List of Abbreviations 57

List of Figures 59

List of Tables 60 INTRODUCTION 1

Chapter 1

Introduction

Time series databases provide storage and interfacing for time series. In its simplest form, time series data are just data with an attached timestamp. This subtype of data has seen increasing interest in the last decade, especially with the rise of the Internet of Things, which produces time series for everything from temperature to sea levels. Other areas where time series are used are the financial industry (e.g. historical analysis of stock performance), the DevOps industry (e.g. cap- ture of metrics from a server fleet) and the analytics industry (e.g. tracking ad performance over time).

Time Series Databases (TSDBs) are the fastest growing type of databases. When selecting a TSDB, performance is one of the main considerations. Comparing database performance is done using benchmarks, and for TSDBs, a number of benchmarks already exist. However, these all use either random data or synthetic data. Moreover, TSDBs have a wide range of applications, and representative synthesized is only be valid for one domain. Thus, the data used for benchmarks is either non-representative, or only representative for one use case or industry. Can the results of performance tests with random or generated data be generalized to the real world?

In this abstract, we will first analyze existing TSDB benchmarks. Then, properties of time series data sets are analyzed. Finally, a new benchmark is proposed, which compares representative workloads to non-representative workloads. LITERATURE REVIEW 2

Chapter 2

Literature review

2.1 Databases

A database is a set of data, organized in a form that makes it easy to process.

2.1.1 Database Management Systems

A Database Management System (DBMS) is an application for management of databases. Apart from the creation and deletion of databases, a DBMS allows create, read, update and delete (CRUD) operations on these databases.

A database is the data itself and how it is organized. The term “database” is often used instead of “DBMS”. In this dissertation, the two are used interchangeably.

2.1.2 Relational databases

Edgar Codd introduced the relational model in 1970 [1]. Relational databases use this model to store data, which is represented by rows, attributes of this data are organized in columns, and the data itself in tables. A relational DBMS (RDBMS) will most often use Structured Query Language (SQL) for data retrieval and manipulation. 3

2.1.3 Non-relational databases

As applications began to scale, companies started moving away from traditional RDBMSs for the following reasons [2]:

• In traditional DBMSs, the focus on correctness leads to degraded perfor- mance.

• The relational model was thought not to be the best way to store data.

• The DBMSs were often used as simple data stores. A full-blown DBMS was overkill for such use cases.

These factors caused a move to so-called “NoSQL” databases. The term used to refer to databases that do away with the relational structure of RDBMSs, but has taken on the meaning of “Not only SQL” [3]. Catell [4] identifies six key features of NoSQL DBMSs:

1. Horizontal scalability

2. Replication and partition of data over many machines

3. Simple interface (relative to SQL)

4. Weaker concurrency model (compared to ACID nature of relational DBMSs)

5. Distributed indexes used for data storage

6. Able to add new attributes to existing data

NoSQL databases generally do away with the correctness found in relational databases. For example, transactions may not be available in NoSQL DBMSs, or writes may take a while to propagate and show up in reads. 4

2.1.4 NewSQL databases

NewSQL databases try to bridge RDBMS and NoSQL DBMS differences by bring- ing relational semantics to NoSQL DBMSs [3]. The aim is to have the best of both worlds: the relational model of RDBMSs and the scalability and fault tolerance of NoSQL DBMSs.

2.1.5 Time series databases

Time series databases (TSDBs) are databases optimised for storing time series. Time series are represented in these databases as data points with a value, a timestamp, and metadata, such as a metric name, tags, and geospatial information.

Time series databases can be relational (e.g. Timescale, a NewSQL DBMS) or non-relational (e.g. InfluxDB, a NoSQL DBMS) databases.

Bader et al. [5] identified 75 TSDBs, of which 42 are open source and 33 are proprietary.

2.2 Time series database benchmarks

There are a number of existing benchmarks tailored to TSDBs. This is a recent development: most of these benchmarks were developed less than three years ago.

2.2.1 TS-Benchmark

TS-Benchmark is a benchmark specifically developed for TSDBs by Chen at the Renmin University of China in December 2018. A new benchmark was modelled based on a wind farm scenario: sensor data are appended and queried [6].

Databases tested in this benchmark are InfluxDB, IotDB, TimescaleDB, Druid, and OpenTSDB. The benchmark is written in Java and uses no external depen- dencies or frameworks. 5

Apart from a presentation, not much information is available on TS-Benchmark. Metrics measured by TS-Benchmark:

• Load performance: The ingestion speed of the TSDB, which measured in points loaded per second

• Throughput performance: new data points appended to an existing time series (measured in points appended per second)

• Query performance: For both simple aggregation queries and time range, read queries are performaed and two measurements are made: requests per second and average response time.

• Stress test. Two stress tests are performed. In the first, data points are appended while a constant number of queries are ran (performance measured in points appended per second). In the second, queries are run while a constant number of data points are appended (performance measured in requests per second and average response time).

Load performance is different from throughput performance. The former measures the importing of a big data set into the database, while the latter measures ap- pending points in real-time. It is unclear if the benchmark uses special facilities to test load performance (e.g. bulk or batch functionality from the TSDB) or if importing is needed to test read queries.

2.2.2 IoTDB-benchmark

In preprint paper on arXiv, Liu and Yuan describe IoTDB-Benchmark [7]. The features that set this benchmark apart from basic benchmarks are generation of out-of-order data, measurement of system resources, next to database performance metrics, and simulation of real-world conditions by running heterogeneous queries concurrently. IotDB-benchmark is written in Java.

IotDB-benchmark has 10 types of queries, ranging from “latest data point” to “time range query with value filter”. InfluxDB, OpenTSDB, KairosDB, and 6

TimescaleDB are targeted by IoTDB-benchmark. The benchmark also supports Cloud Time Series Database (CTSDB), a TSDB created by Tencent Cloud1, but this is not mentioned in the paper.

Metrics measured by IotDB-benchmark:

• Query latency: Statistical metrics, such as average, maximum, 95th per- centile, etc. are calculated on the time the ten supported query types take.

• Throughput performance: Data points appended to an existing time series, measured in points appended per second.

• Space consumption: The used disk space is measured.

• : System resources: System resources, such as CPU time, network, mem- ory and I/O usage are measured.

2.2.3 TSDBBench

TSDBBench was created by Bader as part of his dissertation in 2016. It extends the Yahoo! Cloud Serving Benchmark (YCSB) for use with time series databases in a project called YCSB-TS. TSDBBench includes YCSB-TS, the benchmark itself, and Overlord, a provisioning system written in Python that sets up databases to test [5].

In practice, the benchmark seems unmaintained. The documentation is out of date, necessary files are hosted on a defunct domain, and the database versions tested are several years old.

Ten types of queries are supported, such as “insert”, “update”, “scan” and “sum”. TSDBBench supports eighteen databases, which is the most of any TSDB bench- mark.

Metrics measured by TSDBBench:

1Not much documentation on CTSDB is available, and all of it is in Chinese. 7

• Query latency: Statistical metrics, such as average, maximum, 95th per- centile, etc., are calculated on the time the ten supported query types take.

• Space consumption: The used disk space is measured.

2.2.4 FinTime

FinTime was developed in 1999. It is not written in a specific language: FinTime is merely a description of a benchmark. The benchmark describes two models, including data model, queries, and operational characteristics [8]. They contain nine queries run by five clients at once, and six queries run by fifty clients at once, respectively.

Metrics measured by FinTime:

• Query latency (defined as “Response Time Metric”): The geometric mean of query latencies.

• Throughput Metric: Average time that a complete set of queries take. Every set (nine queries for the first model, six for the second) represents a user.

R×T • Cost metric: Defined as TC , where R is the response time metric, T is the throughput metric, and TC is the total cost of the system in USD. This metric provides insight in the cost-effectiveness of a system.

2.2.5 influxdb-comparisons

The project influxdb-comparisons is created by InfluxData, the company that develops InfluxDB. It compares the InfluxDB TSDB to other databases. The project is written in Go and was started in 2016.

At this moment, the benchmark supports InfluxDB, Elasticsearch, Cassandra, MongoDB and OpenTSDB.

Metrics measured by influxdb-comparisons: 8

• Space consumption: After batch loading data, disk usage is measured.

• Load performance: Measured in time taken to load the data and average ingestion rate.

• Query latency: Measured as queries per second.

2.2.6 STAC-M3

STAC-M3 is a closed-source benchmark that measures performance of TSDB stacks, focused on high-speed applications. The publications, specification, and ap- plication itself are only accessible to Securities Technology Analysis Center (STAC) members.

At the moment, only results for the kdb+ database have been published publicly. The following metrics are measured:

• Storage efficiency: The size of the original data set divided by the size of the database.

• Mean and maximum response times for a variety of scenarios. For most scenarios, minimum and median response times are also reported, as well as the standard deviation.

2.3 Data sets

To study and create benchmarks for TSDBs, it is important to understand the fields where time series are recorded and analyzed. Six existing repositories of time series data sets were discovered.

Dau et al. maintain a repository of 128 time series data sets for data mining and machine learning purposes [9]. The data sets range from electricity usage to accellerometer data of performed gestures. Every data set is cleaned and docu- mented. 9

The Center for Machine Learning and Intelligent Systems at the University of California maintains a database of data sets for use with machine learning [10]. Ninety-two time series data sets are currently in their repository, with domains ranging from stress detection and retail to electricity consumption and parking occupancy rates.

Hyndman created the Time Series Data Library (TSDL), which contains about eight hundred time series data sets [11]. TSDL spans many domains, from hydrol- ogy and finance to crime and physics.

A “data catalog start-up”called data.world currently has thirty-four time series data sets in its repository [12]. The data sets are mostly governmental statistics, such as crime data and pollution indexes.

On Kaggle, 238 data sets show up when searching for time series databases. These data sets are contributed by different authors.

Leskovec and Krevl maintain the Stanford Network Analysis Project (SNAP) data sets [13]. These data sets are often graphs, but the online reviews and online communities data sets contain time series data. STATE OF THE ART 10

Chapter 3

State of the art

In this chapter, the various uses of time series databases will be examined. Then, existing benchmarks are evaluated, and gaps in the state of the art are examined.

3.1 Uses of time series databases

3.1.1 TSDB usage as a data store

Some use cases do not exploit the full potential of time series databases, they merely use a time series database as a data store for time-coupled data. While the data could be stored in another data store, using a time series database offers clear advantages:

• Compression: Since time series data arrives mostly in-order, high compres- sion ratios can be achieved efficiently with delta coding, or more advanced compression algorithms, such as SPRINTZ [14].

• Scalability: Most modern time series databases come with scalability built- in, removing the need to worry about data migration when applications become bigger or more data-intensive.

• Usage of inherent time functions when needed: Even if an application makes no use of time series functions, they could do so at a later time, without 11

the need for data migration. This also holds true for arbitrary queries: when engineers want to run time-based arbitrary queries, they can do so without data transformation or migration.

Anomaly detection, forecasting and prediction are examples that usually use the time series database as a data store: a separate application provides the processing.

3.1.2 Inherent time series database functions used

Most TSDBs are not simple data stores, but provide specialised functions to handle times series analysis and aggregation. Bader et al. [5] describe the following time series database capabilities:

• INS: Insertion of a single data point

• UPDATE: Update of one or more data points with a certain timestamp

• READ: Retrieval of one or more data points with a certain timestamp

• SCAN: Retrieval of rows in a timestamp range

• AVG: Calculates the average value in a time range

• SUM: Calculates the sum of values in a time range

• CNT: Counts the number of data points with a certain timestamp

• DEL: Deletes data points with a certain timestamp

• MAX: Calculates the maximum value in a time range

• MIN: Calculates the minimum value in a time range

Functions that calculate a value, such as SUM, can be aggregated in time peri- ods. Time series databases provide first-class support for queries like “average of temperature grouped in blocks of 7 minutes” and “highest CPU usage for every hour”. 12

Visualisation is an example that relies heavily on these features. To provide users with flexible visualisation options, the database needs to support, or at least facil- itate, the above functions.

3.1.3 Common characteristics of time series data

While time series are used in different industries for a variety of use cases, in general, time series data have the following characteristics:

• In-order data arrival: Data will, with rare exceptions, arrive with ascend- ing time stamps.

• Updates are non-existent: Changing data points are rare and not part of normal operations.

• Deletion is rare: It is uncommon for individual data points to be deleted, but it may be common to retire a large amount of data points at a time, for example, when data points are being retired as part of a retention policy.

• TSDB-specific functions may be heavily used, depending on the ap- plication.

• Data values follow a pattern: There might be trends, cycles, seasonal and non-seasonal cycles. It’s rare for time series data to be completely random.

3.1.4 Differing characteristics of time series data

While time series data have general characteristics, series may diverge on the following properties:

• Regularity: In regular time series, data points are spaced evenly in time. Irregular time series do not emit data points regularly. Irregular time series are often the result of event triggers.

• Volume: High volume time series may emit hundreds of thousands of data points a seconds, while low volume time series only emit one event a day. 13

• Data type: Traditionally, values of data points in a time series have been integers or floating point numbers. But they can also be booleans, strings or even custom data types.

• Tags: A time series data point may have one or more tags associated with the timestamp and value. There may be no tags or a lot of tags. Tags may hold special values, such as geospatial information.

• Tag value cardinality: The number of possible combinations the tag values make. Three tags with two possible values each make a tag value cardinality of six.

• Variation: While time series data usually follow a pattern, the variation in a series may be very different. One series may describe a flat line, while another may describe seasonal variations with daily spikes.

3.1.5 Industry use cases

Internet of Things and sensor data

The Internet of Things revolution has made it possible to connect devices to the internet that were previously only available as offline systems. These devices can be split up in two categories: actuators, to which commands can be sent to perform an action, and sensors, which sense the current environment and translate physical quantities into digital values.

The values sent from these sensors and the usual analyses performed upon them are a natural fit for time series databases. Every data point generated by a sensor is associated with a timestamp (the time at which it was produced). The frequency of data generation depends on the application domain, common intervals are every minute, every ten minutes and every hour.

Common operations on sensor data include getting the most recent data points, averaging data points over time intervals and flexible visualisation. IoT data sets are usually regular, low volume for small amounts of sensors, and often makes use of geospatial tags. 14

Financial

Time series have long been a subject of study in financial disciplines. Stock in- formation, exchange rates and portfolio valuations can all be represented as time series, thus a time series database is a logical choice to store financial data points.

For example, kdb+, a time series database developed by Kx Systems, is often used in high-frequency trading. kdb+ also explicitly presents other financial use cases, such as algorithmic trading, forex trading, and regulatory management.

Financial time series are regular, but differ greatly in volume. Data points may be produced every day (e.g. stock closing prices) to every few milliseconds (e.g. high-frequency trading).

DevOps and machine monitoring applications

In the operations and DevOps industries, TSDBs are used extensively to monitor computer systems and software applications. Common metrics include processor load, memory usage and application response times. Metrics are usually aggre- gated on the device they are collected from in one minute intervals before being sent to a metrics collector.

The collected data are used for manual analysis (e.g. “What is the slowest com- ponent in our stack?”), alerting (e.g. sending an alert when the average load is above 90% for more than 5 minutes) and automatic anomaly detection.

Software monitoring and DevOps use cases produce regular time series that are low volume for small amounts of machines and applications.

Asset tracking

Apart from software applications, time series databases are also often used to monitor physical systems. Most time series databases include support for storing and querying spatial data. This way, it is possible to associate location data. 15

Use cases include asset tracking (e.g. storing current location of vehicles at a point in time) and geographical filtering (e.g. average of temperature for sensors within a range).

Asset tracking use cases produce data points with geospatial information. Time series produced can be regular (e.g. location is sent every minute), but is of- ten irregular. Since asset tracking use cases involve tracking entities in a large geographical area or in rough terrain, connectivity may be limited. This means accurately determining position and transmitting that position may be impacted, resulting in irregular time series.

Analytics

In analytics, time series may be used to monitor website visits, advertisement clicks, or E-commerce orders.

Time series are used to track key perfomance indicators (KPIs) and infrastruc- ture costs at Houghton Mifflin Harcourt [15]. KPIs can give an insight in the performance of the business.

These use cases produce irregular time series, since they are based on events. The volume may depend on various factors, such as the time (e.g. orders on a Wednesday night compared to orders on Black Friday), the weather (e.g. umbrellas sold in a convenience store), or other arbitrary factors (e.g. number of cars per hour on a day with a train strike).

Physics experiment tracking

Time series databases have been used in physics experiments to capture and pro- cess high volume data streams. For example, at CERN, the time series database InfluxDB handles writes at a rate of over 700kHz [16].

Other use cases

Other use cases include game bot detection based on time series classification [17], telecommunications forecasting based on usage pattern prediction and fraud 16 detection through pattern analysis.

3.2 A “good” benchmark

Chen et al.[18] consolidate the properties of a good benchmark based on previous research as follows:

• Representative: Benchmarks must simulate real-world conditions, both the input to a system and the system itself should be representative and relevant.

• Relevant: Benchmarks must measure relevant metrics and technologies. Results should be useful to compare widely-used solutions.

• Portable: Benchmarks should provide a fair comparison by being easily extensible to competing solutions that solve comparable problems.

• Scalable: Benchmarks must be able to measure performance in a wide range of scale. Not just single-node performance, but also cluster configurations.

• Verifiable: Benchmarks should be repeatable and independently verifiable.

• Simple: Benchmarks must be easily understandable, while making choices that do not affect performance.

These properties can be used to put existing benchmarks to the test. Relevance of individual benchmarks will not be evaluated. All of these benchmarks evaluate time series databases. Since TSDBs are the fastest growing type of database [19], we consider all benchmarks relevant.

3.3 Existing benchmarks

Here, existing benchmarks for time series databases are examined in more detail and properties described in Section 3.2 are discussed. 17

3.3.1 TS-Benchmark

TS-Benchmark is a benchmark simulating a wind plant monitoring system.

•  Representative: TS-Benchmark uses a data model inspired by real world applications. An ARIMA time series model is trained with real-world wind power data [6].

•  Portable: TS-Benchmark targets InfluxDB, IoTDB, TimescaleDB, Druid and OpenTSDB.

•  Scalable: Only single-node performance of database systems is tested. The benchmark could be extended to perform on multi-node database systems.

•  Verifiable: The source code for TS-Benchmark was published on GitHub.

•  Simple: The benchmark follows a simple five-stage course, in which each stage performs a single operation or test.

3.3.2 IoTDB-benchmark

In a recent paper, for now only published on ArXiv, Liu et al. describe IoTDB- benchmark, a benchmark specifically designed for time series databases [7].

•  Representative: The data generator creates square waves, sine waves and sawtooth waves with optional noise. Furthermore, constant values and ran- dom values within a range can be generated. Care needs to be taken when selecting a data generation function: rarely will real-world data follow a per- fect sine function. This will have an effect on the compaction of data. To ensure representativeness of data, the “random values within a range” func- tion is the best approximation. However, depending on the use case, it will still not be representative of most real-world data, where subsequent data points may have a relatively low delta compared to other points close in time instead of a completely random delta. IoTDB-benchmark allows configuration of many data generation parameters, such as the data type of fields, number of tags per device, etc. 18

•  Portable: IoTDB-Benchmark supports IoTDB, InfuxDB, OpenTSDB, KairosDB, TimescaleDB, and CTSDB. The focus is on IoTDB, and not all functions are supported in databases other than IoTDB. For example, gen- eration and insertion of customized time series is only supported for IoTDB at the moment.

•  Scalable: Only single-node performance of database systems is tested. The benchmark could be extended to perform on multi-node database systems.

•  Verifiable: The source code for IoTDB-Benchmark was published on GitHub.

•  Simple: The benchmark follows a simple six-stage course, in which each stage performs a single operation or test.

3.3.3 TSDBBench/YCSB-TS

YCSB-TS, part of the TSDBBench benchmark, is a fork of YCSB that targets time series databases, since these databases are not supported in YCSB.

•  Representative: YCSB-TS allows configuration of the workload used. Se- lecting or creating a good workload is critical in ensuring that the benchmark is representative. The standard workload is artificial and not based on real- world data.

•  Portable: YCSB-TS supports InfluxDB, KairosDB, Blueflood, Druid, NewTS, OpenTSDB and Rhombus.

•  Scalable: YCSB-TS has support for benchmarking multi-node set-ups. Tests were performed with single-node set-ups and five-node set-ups[5].

•  Verifiable: The source code for all components of TSDBBench was pub- lished on GitHub, along with instructions on how to replicate the benchmark.

•  Simple: 19

3.3.4 FinTime

FinTime is an older benchmark (it was proposed in 1999), but it still holds value as a representative benchmark. It mimics financial industry use cases.

•  Representative: FinTime’s two models are based on real-world financial use cases. Namely, it specifies data generation and queries for historical financial market information and a tick database for financial instruments.

•  Portable: FinTime does not prescribe a query language. Implementations have been created for SQL databases, but SQL is not required.

•  Scalable: The benchmark was performed on single-node database systems, but could be extended to work on multi-node systems.

•  Verifiable: Only the source code for the data generation was published. It is unclear how latency and throughput are measured.

•  Simple: Since FinTime is only a description of a data schema and queries to be run, it requires manual implementation.

3.3.5 influxdb-comparisons

The influxdb-comparisons project is a benchmark created by InfluxData, vendor of InfluxDB.

•  Representative: The influxdb-comparisons benchmark simulates a DevOps use case, where a lot of different hosts send usage statistics (such as CPU load, disk IO usage, etc.) to a time series database. This is a representative benchmark for this scenario.

•  Portable: The benchmark currently supports seven different TSDBs.

•  Scalable: Only single-node performance is tested. The benchmark could be extended to perform on multi-node database systems.

•  Verifiable: The source code for influxdb-comparisons is available under the MIT licence on GitHub. 20

•  Simple: The benchmark follows a five-stage course, in which each stage performs a single operation or test.

3.4 Evaluation of existing benchmarks

Table 3.1 shows the compiled evaluation of existing benchmarks.

Benchmark Representative Revelant Portable Scalable Verifiable Simple TS-Benchmark For IoT use cases      IoTDB-benchmark       TSDBBench       FinTime For financial use cases      influxdb-comparisons For DevOps use cases     

Table 3.1: Evaluation of existing TSDB benchmarks

3.4.1 On scalability

Scalability is a gap in the current state of the art. Only one benchmark, TS- DBBench, tests multi-node performance. Testing multi-node set-ups is often harder due to either long manual or error-prone automated test set-up provision- ing.

When TSDBS are actually deployed in the real world, multi-node setups are the norm. Benchmarks should reflect this. Actually supporting multi-node setups in a benchmark is usually not hard, but configuring, setting up, and comparing these setups takes a lot of time.

Most benchmarks are able to test multi-node setups, due to the fact that most distributed TSDBs present a single interface: the client application does not need to be aware of the clustered nature of the TSDB. 21

3.4.2 On representativeness

As mentioned in Section 3.2, representativeness means that benchmarks must sim- ulate real-world conditions, both the input to and the system itself. For the system itself, this means no configuration tuning that would not be used in real produc- tion systems, running benchmarks on system configurations that reflect systems on which production databases would run, etc. For the input to the system, that means real world data and real world queries, or data and queries comparable to real world usage of them. Representativeness is import for generalisation pur- poses: we can not generalize the results of a benchmark to real world usage if the benchmark is not representative of real world usage.

TS-Benchmark, FinTime and influxdb-comparisons seem to be representative benchmarks, but this is only true for specific domains. The results of FinTime are only valid in financial contexts, for influxdb-comparisons only in specific DevOps contexts. This leads to false generalisations: we can not make conclusions on the performance of a database as a whole when a benchmark simulating a single use case is used.

Tay [20] and Zhang et al. [21] have made the case for application-specific bench- marking: instead of using generic micro-benchmarks, real world data are either used directly to benchmark a system or used to construct a representative bench- mark.

Since the use cases of time series databases are broad, it is necessary to develop benchmarks that test a variety of representative scenarios. At the moment, no such benchmarks exist.

3.5 Contribution

This dissertation discusses the design, technical implementation and results of a representative benchmark. It compares three representative workloads to a base- line. The representative workloads use existing real world time series data sets 22 and are chosen to simulate environments and use cases in whichk TSDBs are often used.

Evaluation of the results of the benchmark will determine if representative bench- marks are a necessity, or if non-representative benchmarks accurately predict perfo- mance for representative workloads. If non-representative benchmarks can predict real world performance, then representative workloads are not needed, which may lead to simpler benchmarks. If non-representative benchmarks can not accurately predict real world performance, validity of non-representative benchmarks can be called into question. A NEW BENCHMARK 23

Chapter 4

A new benchmark

In the previous chapter, current benchmarks have been examined, and their insuf- ficient representativeness has been noted. This may present a problem for general- isation of their results: do they accurately model real world performance? In this chapter, a new benchmark will be described, with a focus on representativeness. This benchmark will be used to test both representative and non-representative workloads to examine differences in performance.

4.1 Benchmark components

A benchmark consists of multiple separate components. The workload data set characteristics are the time series data characteristics described in Section 3.1.4. Workload query characteristics are comprised of characteristics of the queries them- selves, and the spread between query types. Finally, the metrics measurement component will be considered.

4.1.1 Workload data set characteristics

Apart from the time series data characteristics discussed in Section 3.1.4, time series data sets can be categorised as synthetic or real world and high existing volume and low existing volume. 24

Synthetic workload data sets use tunable synthesizers that can generate workload data sets [22]. These workloads may trade configurability for representativeness, and care should be taken in their configuration. Real world data will be used as the workload for this benchmark.

High existing data volumes may influence database performance. For big data sets, a DBMS may need to scan large amounts of data.

4.1.2 Workload query characteristics

In Section 3.1.2, the functions of TSDBs were defined. These lead to possible queries, such as reading single data points, averaging data points values within a time range, and summation of all data point values with a certain tag.

Not only is the type of queries important, but the relative frequency of the query type compared to all query types. For example, an application may frequently insert new data, while calculating the maximum data point value is done infre- quently.

Concurrency may play an important role when benchmarking queries. When mul- tiple queries are run, performance may degrade, especially when read and write queries are mixed. In this dissertation, mixed read and write queries are not con- sidered. Write queries will be considered in an ingestion benchmark, and read queries will be considered in load testing and latency testing benchmarks.

4.1.3 Measurement characteristics

The last benchmark factor is the measurement component. This component mea- sures the effective performance of operations performed. The metrics surveyed may be latentcies, network usage, storage requirements, etc. Care must be taken that the measurement component minimally influences the benchmark results. For example, an ingestion client could monitor the number of data points per second sent to the database: this requires no instrumentation on the database server and thus minimally disturbs it. 25

4.2 Design of a representative data workload

In Section 3.4.2, it was argued that representativeness is dependent on industry and use cases. Therefore, as the workload data set for the time series database benchmark, four different data sets will be considered. These are selected to be in different domains, with different time series characteristics. To ensure representa- tiveness, data sets with real data are used. These are selected to model real world use cases for time series databases.

Of course, four different data sets do not cover every industry or use case. How- ever, analysis of the results of benchmarks using these workload data sets will allow comparisons that indicate if the considered use case has an influence on performance.

4.2.1 A baseline workload

This is a non-representative workload, to be used as a baseline for comparison with representative workloads. Data points are written to one metric with random values and random tags.

• Metrics: Only one metric is tracked: “benchmark”. All data points belong to this metric.

• Regularity: The time series is fully regular, with one data point being produced every second.

• Volume: Low volume. There is only one metric where a data point is produced every second. There are no spikes of traffic.

• Data type: For this benchmark, floating point numbers will be used to represent the data point values.

• Tags: Every data point is tagged with two random tags. The possible values of the first tag are TAG 1 00 to TAG 1 99 and the possible values of the second tag are TAG 2 00 to TAG 2 99. 26

• Tag value cardinality: High. There are 10,000 (two tags with 100 possible values each) possible tag combinations.

• Variation: High. The values are randomly generated for every data point. Data point values bear no relationship to previous values. The values are floating point numbers between 0 and 100 inclusive.

4.2.2 A financial time series workload

Time series data are often used in financial analysis. Prices of commodities, fu- tures, assets, and other financial instruments produce time series [23]. This his- torical data can then be used in performance calculations, price prediction, and financial ratio calculation.

The data set used for this benchmark was created by Boris Marjanovic and pub- lished on Kaggle [24]. It is licenced under CC01. The data set contains historical data for 1344 Exchange-Traded Funds (ETFs) and 7195 stocks. For each stock and ETF, it lists the open, high, low and closing prices, next to the volume2 and open interest3 for every day the ETF or stock was trading.

• Metrics: Six different metrics are tracked: the opening, high, low and clos- ing prices for the stock, and the volume and open interest for the stock.

• Regularity: Semi-regular. Every day, an update is published, except on weekends and market closings (such as holidays). It is rare for new stocks to be published or for existing stocks to be removed from the exchange.

• Volume: Low volume, with short bursts. Data are published at market closing, which is the same time every day. This may lead to short spikes of high traffic when a lot of stocks are tracked.

• Data type: Prices are represented by numbers with five digits past the decimal points. Floating point numbers are sometimes not used to store these

1Creative Commons 1.0 Public Domain Dedication, which dedicates this work to the public domain. 2The total number of shares traded during a day. 3The number of outstanding contracts that have not been fulfilled. 27

prices, due to the possible inaccuracies and high cost of processing floating point operations. Instead, the prices are multiplied by 105 and saved as integers. This does place a burden on client applications if the database does not perform this conversion itself, therefore, they will be saved as floating point numbers for this benchmark.

• Tags: Only a single tag is saved: the ticker symbol. Ticker symbols are strings, for which no general format is specified: every exchange specifies their own rules. In general, the symbol length is short (nine is the maximum length in the data set), alphanumeric (and additionally contain no numbers for the data set) and case-insensitive. As an example, Apple’s stock ticker symbol is AAPL.

• Tag value cardinality: Medium. There are 7,164 possible tag values. For the first one million data points, the tag cardinality is 143.

• Variation: Low. While stock prices are volatile, it is rare for stocks to have high changes in the span of a day.

4.2.3 A rating system workload

Rating systems allow customers and consumers to rate their experiences of goods and services. Users can like or dislike products, leave comments about a restaurant visit, or leave a rating for sellers on online marketplaces. Commonly, this feedback is represented as a five-star system, where half a star represents the lowest score, and five stars represents the maximum score.

GroupLens Research created data sets of varying sizes from the MovieLens website, which allows users to rate movies with a five-star system [25]. The MovieLens 20M data set contains twenty million ratings and is the basis for this workload. The data set comes with a custom license, allowing non-commercial use, but forbidding redistribution.

• Metrics: Only one metric is tracked: ratings. The value of the data point is the rating the user gave a movie, and the timestamp is when this rating was published. 28

• Regularity: The time series is irregular. The data points are events, pro- duced when a user leaves a review.

• Volume: Approximately one review was left every thirty seconds. This is not a high level of activity, we can therefore qualify this time series as low volume.

• Data type: The ratings are floating point numbers, between 0.5 and 5.0 in 0.5 increments. This leads to ten possible values.

• Tags: Five tags are associated with every data point: userId (integer, the identifier of the user who left the review), title (string, the title of the movie being reviewed), imdbId (integer, the identifier of the movie on the Internet Movie Database4), tmdbId (integer, the identifier of the movie on The Movie Database5), genres (string, a list of genres the movie belongs to encoded as a string).

• Tag value cardinality: High. There are 138,493 different users, and 26,212 different movie titles. The rest of the tags are dependent on the movie title (the title directly implies the genre and external identifiers). Since not every user has rated every movie, the tag cardinality is not the multiplication of these two figures. The tag cardinality of the complete data set was deter- mined to be 20,000,262 and the tag cardinality of the first one million data points was determined to be 1,000,000.

• Variation: Subsequent points do not relate to each other, since they are ordered by timestamp and not the movie reviewed. This leads to a high variation. However, the absolute variation is still small, since the maximum absolute variation is 4.5.

4.2.4 An IoT workload

IoT applications, in particular sensor applications, produce a lot of data. This can be temperature data, power consumption, location data, etc. IoT data are almost

4https://www.imdb.com/ 5https://www.themoviedb.org/ 29 always temporally indexed, thus a time series database is a natural fit.

The UCI6 Machine Learning Repository [10] contains the Individual household electric power consumption Data Set, a data set which records power information for a house every minute. It was created by Georges Hebrail and Alice Berard and released under the CC BY 4.07 license.

• Metrics: Seven metrics are tracked for the household: active and reac- tive power, voltage, intensity (current), and three power meters for different rooms.

• Regularity: The data set is regular. Every minute, a new data point is emitted. Data are missing for a small period of time, and for these missing data points, the values were filled in with zeroes.

• Volume: Only seven data points are emitted every minute. This makes the data set low volume.

• Tags: The data contains no tags.

• Variation: Variation between subsequent data point values is low due to the small sampling interval.

4.2.5 Workload data set overview

Table 4.1 shows an overview of all used workload data sets.

4.2.6 Data set pre-processing

The data sets were pre-processed using Python. All data sets are denormalized as to provide one data point per line in the resulting file. Every line provides a complete data point, including the timestamp, metric name, data point value, and (potentially) tags.

6University of California, Irvine 7Creative Commons Attribution 4.0 International 30

Baseline Financial Rating IoT Metrics 1 6 1 7 Regularity Regular Semi-regular Irregular Regular Volume Low volume Low volume Low volume Low volume Tags 2 1 5 0 Tag value cardinality 10,000 7,164 20,000,262 0 Variation High Low High Low Total data points 20,000,000 74,418,459 20,000,262 14,526,812 License CC0 Custom CC BY 4.0

Table 4.1: Overview of workload data sets 4.3 Design of a representative query workload

A representative data set is only part of the workload. Representative queries on these data sets are the other. While real world data sets are readily available, in- formation on data usage or queries performed on these data sets is not. Therefore, for every data set, logical queries and patterns will be created. For a truly rep- resentative query workload, existing TSDB systems should be surveyed and their usage patterns monitored.

The implementation of query workloads was complicatied by the fact that every database use a custom query language. These may have different semantics. For example, when grouping a time range by week, some TSDBs will start the grouping block on the start timestamp of the given range, others will align the groups by the calendar (so the first block may not be a full week). Query results have to be compared to ensure correctness. A standardized query language, as SQL is for RDBMS, would speed up development of benchmarks and TSDB applications.

4.3.1 Queries for the baseline workload

The baseline workload is a non-representative workload to which others will be compared. The query workload reflects this: there is only one query, requesting a single data point between two timestamps with two specific tags. 31

4.3.2 Queries for the financial workload

The financial query workload simulates a stock information application which in- forms stock traders of historical statistics. The following queries are run:

• Get all opening prices for a stock in a time range (relative frequency: 0.20)

• Get the minimum closing price for a stock (relative frequency: 0.25)

• Get the maximum opening price for a stock (relative frequency: 0.15)

• Get the mean high price for a stock grouped by week (relative frequency: 0.25)

• Get the total volume for a stock grouped by four weeks (relative frequency: 0.15)

4.3.3 Queries for the rating workload

The ratings query benchmarks simulates the backing database of a movie website. The queries get the average rating for a movie with a title or IMDb identifier, get ratings for a particular user and group average ratings for a movie by year:

• Get the mean rating for a movie with a specific title (relative frequency: 0.70)

• Get the mean rating for a movie with a specific IMDb identifier (relative frequency: 0.10)

• Get all ratings by a specific user (relative frequency: 0.05)

• Get mean rating per year for a movie with a specific title (relative frequency: 0.15) 32

4.3.4 Queries for the IoT workload

The IoT query workload mimics an power consumption application. Mean power for week (possibly grouped by day) and for a three month time range, active power is grouped by week:

• Get mean active power for a one week time range (relative frequency: 0.4)

• Get mean active power for a two week time range grouped by day (relative frequency: 0.4)

• Get mean active power for a twelve week time range grouped by week (rela- tive frequency: 0.2)

4.4 Metrics

Ingestion throughput is the number of data points per second that can be inserted into the database, possibly using a bulk loading mechanism. This metric is especially important for OLAP applications, where data from a master database is loaded into a TSDB for time series analytics processing.

Space consumption is the amount of storage required to store the database. Storage efficiency is space consumption divided by the number of data points stored. This metric shows how efficient the database engine is at compressing data points. The measurement is taken after loading the database with a predefined set of data points, and is expressed in bytes per data point.

Latency, expressed in mean, 95th, and 99th percentile response times, shows how fast the database can answer queries. For user-facing applications, this is especially important: applications need to render quickly, or users leave.

Load testing gives us the maximum number of requests per second a TSDB can handle.

The mean response size is the average size in bytes of the returned TSDB response body. This response body may contain, next to the requested data, 33 metadata, such as the number of data points used in calculation, aggregated tags, etc. While this information may be useful to some applications, in general the TSDB response is preferred to be small. This leads to faster responses, lower network load and lower memory requirements, though the effects may be small.

4.5 Technical implementation

4.5.1 Test environment

The tests were run on homogenous machines containing two Quad core Intel E5520 (2.2GHz) CPUs, 12GB RAM and a 160GB harddisk. The devices were connected via Gigabit Ethernet.

The versions of the databases used are as follows: OpenTSDB 2.3.1 (with HBase 1.4.4),InfluxDB 1.5.4, KairosDB 1.2.1 with either ScyllaDB 3.0.6 or Cassandra 3.11. These databases were minimally changed from their stock configuration. For InfluxDB, the maximum number of series was increased, for OpenTSDB, chunked requests were enabled, and for KairosDB (with ScyllaDB) the maximum batch size was decreased to one hundred for the financial workload.

Databases used were run in Docker containers (one container for OpenTSDB and InfluxDB, two containers for KairosDB with either underlying DBMS in a docker-compose setup). When not under test, containers were stopped. Only one container was under test at a given time and no other applications were active on the database host, apart from basic monitoring software.

During tests, one machine acted as the database host, while the other loaded the data or performed queries.

4.5.2 Data ingestion

Data loaders from the influxdb-comparisons project [26] were used. These load the data sets, converted for use with a specific database, into that specific database. Since no data loader was available for KairosDB, its Telnet API was used. 34

4.5.3 Load and latency testing

Vegeta [27], a load testing tool, is used to test latencies. Every second, a data set-specific number of requests is made to the TSDB. There are twenty queries in every query workload, and each one is translated to the query language of every TSDB. The queries are cycled in a round-robin pattern, as to ensure determinism. http load [28] is used to conduct load testing. The program is configured with a thirty second timeout, ten parallel requests and a thirty second run time. The same URLs as the latency testing measurement are loaded and ten requests are made in parallel. When one finishes, another one starts. Afterwards, the number of requests per second is reported.

4.6 Design evaluation

In Section 3.2, properties of a good benchmark were discussed. Now, these will be applied to the benchmark described in this chapter.

• Representative: Through the use of multiple use cases, real world, non- synthetic data sets, and balanced query workloads, this is a very representa- tive benchmark.

• Relevant: This benchmark evaluates TSDBs. As the fastest growing type of database [19], this can be considered a relevant benchmark. The met- rics measured are based on other database and TSDB benchmarks, and are comparable with them.

• Portable: To add a new database to the benchmark, the following com- ponents are necessary: a Docker container containing the database, a data formatter, a data ingestion loader, and a set of queries as HTTP requests. Most open source databases have existing Docker containers, and creating a data formatter is a few hours of work. A data ingestion loader is more time- consuming, but many databases have existing ingestion loaders. The last component presents a challenge: not every database has an HTTP interface. For example, TSDBs that rely on an existing RDBMS, such as Timescale 35

(built on PostgresQL), do not include an HTTP API. To benchmark this kind of databases, the benchmark would need to be extended to include other measurement tools. This does make comparison of results harder.

• Scalable: Both the ingestion and querying component of the benchmark are able to accept a list of different URLs to spread the load. This makes the ingestion and measuring component of the benchmark scalable. However, tests were only conducted on single-node TSDB setups. Multi-node database setups are hard to set up right, and it is especially hard to fairly compare heterogenous DBMSs, such as TSDBs.

• Verifiable: The data sets used are available under open licenses (Section 4.2.5), the tools to ingest are available under the MIT licence [26], and the tool to test latencies and response size is available under the MIT license. The components to denormalize the data sets, to transform them to specific database formats, and the database setup components will be made available as open source when the embargo on this master’s dissertation ends.

• Simple: The benchmark was kept as simple as possible, with distinct parts doing a single thing. This leads to an architecture where one component can easily be swapped with another, e.g. the data generator could be switched with a generator from another benchmark. RESULTS 36

Chapter 5

Results

The results of the data ingestion of the data set workloads described in Section 4.2 and the query workload upon those data sets described in Section 4.3 are presented here. The metrics reported are described in Section 4.4. The results are analysed to examine possible performance differences between non-representative and representative workloads.

5.1 Storage efficiency

One million data points were inserted in TSDBs. Afterwards, the database was shut down, and the size of the data directory of every TSDB was measured. This includes raw database files and write-ahead logs. As a comparison, the storage efficiency of CSV files is included. Figure 5.1 shows the results graphically.

InfluxDB performs nearly as well as CSV for the baseline data set workload, but is much less efficient for the representative data set workloads.

OpenTSDB and KairosDB require at least one tag to be present on the data points. Therefore, the tag notags with the string value "true" was added on the IoT data set, which contains no tags otherwise. This may influence storage consumption, but both TSDBs performs very well for this data set workload nonetheless.

OpenTSDB outperforms every other TSDB for all representative data set work- 37

InfluxDB OpenTSDB KairosDB-Cassandra KairosDB-ScyllaDB CSV 104 73 73 73 . . . 074 074 074 34 , , , . 3 1 1 1 24

10 . 583 72 35 . 3 . 45 . . 84 . 264 182 171 59 32 07 154 57 . . 138 . . 18 114 . bytes 12 82 . 80

2 77 71 48 . 10 44 57 . 46 34 29 91 . 101 8

Baseline IoT Financial Ratings

Figure 5.1: Storage efficiency of different TSDBs in bytes per data point. Data points contain a timestamp, a value, and may contain tags, depending on the data set.

InfluxDB OpenTSDB KairosDB-Cassandra KairosDB-ScyllaDB CSV 51 40 . 36 17 . 31 30 79 .

20 18 relative size 26 . 21 82

10 . 7 . 6 5 33 29 7 . . 42 24 . . 57 44 . 3 3 89 . . 2 26 2 . 2 . 1 1 1 1 1 1 0 0 0 0 Baseline IoT Financial Ratings

Figure 5.2: Relative storage efficiency of different TSDBs per data point compared to the CSV source format. 38 loads. It shows exceptional performance for the IoT dataset, where it is able to store data nearly four times as efficient as the CSV input data set. This is likely a result of the low tag value cardinality: there is only one tag and one tag value1.

KairosDB (with Cassandra) performs well for the IoT data set workload, but does not do better than the CSV source data set. It always uses at least twice as much storage space as the source data set.

KairosDB (with ScyllaDB) was unable to complete data loading for the ratings data set. For the other data sets, it used exactly 1074.73 bytes per data point to store all data, regardless of the data size. To ensure these three measurements were correct, they were repeated, and the same values were found. The fact that the persisted data size is so large is remarkable, since the ScyllaDB uses the same storage format as Cassandra [29].

When comparing the relative storage efficiency compared to CSV (graphically displayed in Figure 5.2), the impact of high tag value cardinality becomes clear. Tag value cardinality is the number of possible combinations tag values can make. InfluxDB in particular requires relatively more storage space to store higher tag cardinality for the representative data set workloads. Other TSDBs display no such dependency on tag value cardinality. Variation may also play an important role. The representative workloads have lower data point value variation than the baseline, especially the IoT data set. This may enable OpenTSDB to more efficiently store the time series.

It is clear that representative data set workloads allow to see patterns not uncov- ered by a traditional data set workload. A non-representative benchmark might appoint InfluxDB the winner of a storage efficiency test, while it is clear that, on the given representative domains, OpenTSDB has much better storage efficiency.

1The original data set does not contain any tags, but since OpenTSDB requires at least one tag for data points, the tag notags with the string value "true" was used. 39

5.2 Data ingestion throughput

The data ingestion throughput or data ingestion rate is the number of data points a TSDB can ingest per second in a bulk loading pattern. Ingestion rate tests were performed with data sets with one million data points and the results are shown in Figure 5.3.

InfluxDB outperforms the other TSDBs for all data set workload ingestion tests, but performs exceptionally worse at the intake of the ratings data set workload, where its ingestion is seven times slower than KairosDB (with Cassandra) and nearly five times slower than OpenTSDB. This is likely due to the high series cardinality.

OpenTSDB performs better than KairosDB, but is still significantly slower than InfluxDB. For the non-representative baseline data set workload, OpenTSDB is nearly five times slower than InfluxDB. For representative data set workloads, this gap shrinks. InfluxDB is just over twice as fast as OpenTSDB for the IoT and financial data set workloads. For the ratings data set workload ingestion test, OpenTSDB is nearly five times as fast as InfluxDB.

KairosDB (with ScyllaDB) was unable to complete for the ratings data set work- load. The ingestion speed was 33,340 data points per second, but since not all data points were successfully saved, this result is excluded.

The differences between KairosDB with Cassandra and KairosDB with ScyllaDB not huge, but ScyllaDB consistently outperforms Cassandra. For the baseline data set workload, ScyllaDB performs 8.10% better, for the IoT and financial workloads 5.55% and 12.95% respectively.

For the IoT and financial data set workloads, relative performance is comparable to the baseline. InfluxDB comes in first, OpenTSDB second, followed by KairosDB with ScyllaDB and KairosDB with Cassandra, respectively. However, for the rat- ings data set workload, we see a different pattern. Here, InfluxDB has the slowest ingestion speed, and KairosDB with Cassandra the highest, with OpenTSDB in the middle. The reason for this is unclear. High tag value cardinality has been known 40 to slow down InfluxDB performance through high memory usage, but InfluxDB performed well on the baseline, which also has high tag value cardinality. This performance may be caused by the high size of the data points, and the large amount of tags.

The use of real world, representative data sets revealed a performance degradation of InfluxDB compared to the other TSDBs for the ratings data set.

5.3 Load testing with query workload

The maximum number of queries per second was determined for every TSDB-data set tuple. The results are shown in Figure 5.4.

InfluxDB significantly outperforms all other TSDBs for every query workload. In the non-representative query workload, it outperforms the next runner-up (OpenTSDB) by a factor of 18. In the representative query workloads, this factor is different. For the IoT query workload, InfluxDB performs 8.5 times better than OpenTSDB, for the financial query workload 15 times better, and for the ratings query workload nearly 37 times better. Clearly, the query workload has a big impact on performance.

KairosDB with Cassandra was not able to complete the ratings workload due to memory constraints. KairosDB with ScyllaDB was not able to complete this query workload because the not all the data could be loaded (see Section 5.2). For the other query workloads, ScyllaDB outperforms Cassandra every time. In the baseline query workload, it outperforms by just over 20%. For the representative IoT query workload, it achieves 36.49% more requests per second, and for the financial query workload a 22% improvement.

It is remarkable how KairosDB performs much better on the representative work- loads. The baseline workload requests just one data point, a very simple query which is easily cacheable, and yet, KairosDB performs two times better on the more representative, but much more complex IoT and financial benchmarks. There re- ally is no clear explanation why KairosDB would perform so much worse for a 41

InfluxDB OpenTSDB KairosDB-Cassandra KairosDB-ScyllaDB 5 10 5 ·

6 10

10 · 5 82 5 . 4 10 18 10 . · · 3 62 56 . . 736 1 1 360 , 413 578 535 , 198 , , , , 231 98 792 89 , 87 86 82 5 , 10 78 59 54 913 , 196 , 29 21

Data points per10 second 4 342 , 4

Baseline IoT Financial Ratings

Figure 5.3: Data points ingested per second. Data sets used were one million data points each. much simpler workload. If anything, it would expected to perform a lot faster than the IoT workload, since the baseline only requests a single data point (which can be cached) and requires no aggregation or calculations.

OpenTSDB performance is good in the baseline and the IoT query workload, but is degraded in the financial and ratings query workload. This may have to do with the fact that the data ranges to scan are much bigger in these last two query workloads, while the first two only require data from relatively narrow time ranges.

5.4 Response latency

The mean latency, shown graphically in Figure 5.5, is the mean time it takes to receive a response from the TSDB. The 95th percentile response time, displayed graphically in Figure 5.6. This metric displays what the maximum latency for 95% of requests is. One in twenty requests will have a longer latency than this. 42

InfluxDB OpenTSDB KairosDB-Cassandra KairosDB-ScyllaDB 105 36 . 400 4 , 10 6 57 . 37 997 .

3 73 10 . 3 347 . 235 33 . 117 17 . 2 78 43 57 83 . .

10 . 40 57 47 . . 8 29 28 26 . 15 15 12 Requests per second 101 13 . 2

100 Baseline IoT Financial Ratings

Figure 5.4: Maximum requests per second. Tests were performed on data sets one million data points in size.

The tests were performed with a constant rate of requests. This rate was de- termined by choosing the lowest maximum requests per second for every query workload. Empirically, the request rate was increased until timeouts were ob- served, this request rate was then rounded down. It was found that some TSDBs were able to handle more requests per second than the load testing showed when the number of parallel requests was increased while leading to little increase in latency. Ultimately, the used rates were rounded to 10 requests per second for the baseline query workload, 20 requests per second for the IoT query workload, 30 requests per second for the financial query workload, and 2 requests per second for the ratings query workload.

KairosDB with Cassandra and KairosDB with ScyllaDB were not able was not able to complete the ratings workload due to memory constraints and because not all data could be loaded (see Section 5.3).

InfluxDB is the clear winner when it comes to latency. The TSDB is able to handle requests and send a response in less than 2ms for the baseline, and queries 43 for the complex ratings query workload take on average just over 100ms. InfluxDB outperforms all other TSDBs tested when it comes to latency, both mean latency and 95th percentile.

OpenTSDB shows good performance for the baseline and IoT query workloads, but like the load testing, has trouble with the financial and ratings query workloads. As mentioned in Section 5.3, this may have to do with the big time ranges the TSDB has to scan to aggregate data points. The latencies for the last two workloads are high: the average latency is over two and a half seconds.

KairosDB with ScyllaDB shows greater performance than KairosDB with Cassan- dra for every query workload. For the first two workloads, it performs nearly twice as fast when comparing mean latency. For the financial workload, the difference (ScyllaDB 4.63% faster) is small.

5.5 Mean response size

In Figure 5.7, the mean response size is shown graphically. This mean is clearly coupled to the data set. Overall, InfluxdDB has the most verbose responses. After inspecting a few responses, the main reason for this seems to be due to the fact that InfluxDB encodes timestamps as strings in responses, while KairosDB uses numbers, and OpenTSDB uses numbers encoded as strings. Compare these encodings:

• KairosDB encodes the time as 1189641600000, representing the number of milliseconds since January 1, 1970. This takes 14 bytes to encode in JSON.

• InfluxDB encodes the time as "2007-09-13T00:00:00Z", which takes 23 bytes to encode. However, this format is able to add more precision, adding seconds, milliseconds and even nanoseconds.

• OpenTSDB encodes the time as "1189641600", representing the number of seconds since January 1, 1970, as a string. This takes 13 bytes to encode in JSON, but is not as precise as the other encodings. 44

InfluxDB OpenTSDB KairosDB-Cassandra KairosDB-ScyllaDB 44 02 . . 104 681 563 , , 2 64 2 .

103 862 91 83 . . 85 41 12 . . . 49 155 . 136 88 . 106 104 102

2 74

10 57 29 . 56 . 18 Latency (ms) 12 64 . 101 7 27 . 1 100

Baseline IoT Financial Ratings

Figure 5.5: Mean latency per request.

InfluxDB OpenTSDB KairosDB-Cassandra KairosDB-ScyllaDB 06 . 61 . 4 991

10 , 786 3 , 2 79 . 3 10 67 . 462 17 22 64 . . . 334 28 193 . 53 . 133 127 124 93 87 2 . 10 70 05 44 . 85 . 22 Latency (ms) 12 101 4 . 1 100

Baseline IoT Financial Ratings

Figure 5.6: 95th percentile of latency per request. 45

Other factors influence the response size. For example, OpenTSDB and KairosDB will return a list of tags used on data points. For large response size, such as for the representative query workloads, the timestamp encoding is the deciding factor.

Both KairosDB TSDBs experienced timeouts for the baseline data set. For KairosDB on Cassandra, 27 timeouts were encountered, and KairosDB on Scyl- laDB encountered 4 timeouts. These were ignored when calculating the mean response size.

KairosDB on Cassandra and on ScyllaDB both return the same number of bytes for the IoT and financial workload since the underlying databases are interchangeable. Given the same data, KairosDB should deliver the same response, and this result gives confidence that it does2.

5.6 Evaluation

When comparing storage efficiency (Section 5.1), representative data sets showed that storage efficiency varies heavily between use cases, and so does relative stor- age efficiency. It showed that the results of the non-representative benchmark can not be generalised to relative storage efficiency in representative workloads. Tag value cardinality and data point value variation were identified as possible param- eters that have a high impact on storage efficiency. Real world data usually has low variation, while non-representative benchmarks often use random values (high variation). These non-representative benchmarks may become more representative of real world use cases through the use of random walks instead, which have lower variation and more closely model real world data.

The use of representative data sets and query workloads for ingestion speed testing (Section 5.2) showed performance problems when ingesting the complex ratings data set, especially for InfluxDB.

In the load testing benchmark (Section 5.3), it was discovered that OpenTSDB

2Some ndividual queries were compared to further confirm that KairosDB with either Cas- sandra ScyllaDB give the same response for the same query. 46

InfluxDB OpenTSDB KairosDB-Cassandra KairosDB-ScyllaDB 1 . 35 . 75 75 105 . . 186 , 854 , 117 117 , , 33 28 23 23

104 35 . 250 , 1 35 4 4 . . . 65 45 3 . 10 . 507 459 459 390 Response size (bytes) 350 202 202 185 126 102

Baseline IoT Financial Ratings

Figure 5.7: Mean size in bytes of the TSDB response. performed well for the baseline and IoT query workloads, but not for the financial and ratings query workloads.

For the response latency (Section 5.4), the use of representative benchmarks again showed a performance degradation for OpenTSDB for the financial and ratings query workloads, which use broad time ranges. Otherwise, the baseline is a good predictor for relative performance in the representative benchmarks.

When testing the mean response size (Section 5.5), the encoding of timestamps was shown to be the deciding factor when it comes to query workloads which return a large response.

These results make it clear that representative data set workloads and query work- loads may lead to important differences in benchmark results. They shed doubt on the real world applicability of benchmarks using random or synthetic data sets and/or non-representative query workloads.

The fact that not all representative workloads show performance impact (e.g. only 47 the ratings workload showed the performance degradation for InfluxDB in the data ingestion test) highlights the importance of using multiple representative workloads - just one representative workload may not be enough to highlight possible devia- tions or performance degradations. It is impractical to create a workload for every use case, but it is possible to generalize workloads into categories (e.g. volume, tag value cardinality, data type, ...). Further testing is needed to confirm that data sets with the same workload parameters will yield comparable results. CONCLUSIONS AND FUTURE WORK 48

Chapter 6

Conclusions and future work

6.1 Conclusions

Compared to a baseline non-representative workload, representative workloads showed significant performance differences when it came to storage efficiency, data ingestion speed for complex data, latency and maximum request rate (when broad time ranges are used). Storage efficiency is lower for data sets with low tag value cardinality and low variation. Non-representative benchmarks using random data will have high variation, while real world data often displays low variation. Using random walks instead of random values may make a benchmark more represen- tative. Data ingestion throughput testing highlighted performance problems for data sets with large data points and high tag cardinality. Latency and load testing showed that some databases perform significantly worse when they need to scan a large amount of data. This illustrates the importance of using representative workloads.

A number of TSDB benchmarks have been studied, but none of them use repre- sentative workloads. Three existing TSDB benchmarks use nearly representative workloads, but none of them use real world data sets. Instead, they use random or synthetised data. Considering that my benchmark, which uses representative, real world workloads, sheds a different light on TSDB performance, the relevance of these existing benchmarks may be called into question. 49

While representative workloads uncovered significant performance differences com- pared to non-representative workloads, it is unpractical to create or test represen- tative workloads for every use case imaginable, but TSDB workloads can be cat- egorized with workload parameters (number of metrics, regularity, volume, data type, number of tags, tag value data type, tag value cardinality, variation). Fur- ther research is needed to determine if these parameters are enough to accurately describe a TSDB workload and thus generalize results of one workload to another with the same workload parameters.

Benchmark TSDBs is a complex endeavour due to the absence of standardized query languages, data models, or capabilities (such as aggregators or functions). The proliferation of TSDB models has the advantage of specialisation: instead of optimizing for the general case, individual TSDBs may seek to specialise in a niche, e.g. geo-spatial data querying, nanosecond timestamp resolution, or real- time streaming queries. The disadvantage is that it is much harder to compare different TSDBs. The varying support for operations makes it so that not all TSDBs can be compared to each other, semantic differences in query languages require careful comparison of results to ensure they are valid, and different database interfacing methods may lead to more difficult interpretation of benchmark results.

6.2 Future work

This dissertation has proven the relevance of representative benchmarks. The experiments and tests that were run for this dissertation took a lot of time to prepare and execute, and therefore, a lot of extensions have been left for the future. Several possible lines of research could be pursued:

• The hypothesis that workloads with the same data set characteristics yield comparable benchmark results could be tested. Analysis might produce an- other, non-obvious workload parameter.

• The benchmark described in this dissertation can be extended to use more TSDBs. Currently, four TSDBs are tested, but more can be added. Another approach would be to extend another existing TSDB benchmark to be more representative. 50

• The query workload could be extended to include data mutations (such as create, update and delete queries). Benchmarks using this query workload might produce even more representative results. However, query spread should be carefully studied: for most query workloads, create queries will heavily outnumber update and delete queries.

• A comparison of TSDB query languages might yield interesing results on their construction and capabilities. Perhaps a unifying query language could be constructed, which would facilitate research into different TSDB families.

• In production environments, TSDBs are often used in multi-node setups. This scalability aspect is only addressed in one existing benchmark. The benchmark in thes dissertation could be extended to test clustered TSDBs.

• This dissertation has focused on TSDBs, a specialized type of database. Representative benchmarking could be studied in different domains as well, such as relational databases and specialized non-relational databases (such as graph, triple or document stores). DETAILED RESULTS 51

Appendix A

Detailed results

This appendix lists detailed results discussed and displayed graphically in Chap- ter 5.

A.1 Data ingestion throughput

Table A.1 lists the detailed resuls for Section 5.2.

InfluxDB OpenTSDB KairosDB KairosDB Data set Cassandra ScyllaDB Baseline 481818 89360 54792 59231 IoT 317999 162473 87413 98736 Financial 156498 86578 78198 82535 Ratings 4342 21196 29913 NA

Table A.1: Data ingestion speed in point per second.

A.2 Storage efficiency

Table A.2 list the detailed results for Section 5.1. 52

CSV InfluxDB OpenTSDB KairosDB KairosDB Data set Cassandra ScyllaDB Baseline 1.0 1.4443 2.4213 2.6983 18.7948 IoT 1.0 3.3308 0.2585 2.2353 31.1704 Financial 1.0 5.8211 1.5668 6.2073 36.5115 Ratings 1.0 7.2624 0.891 3.2897 NA

Table A.2: Storage efficiency in bytes per data point.

A.3 Load testing

Table A.3 lists the detailed resuls for Section 5.3. Tests were performed using ten requests in parallel, with a thirty second timeout.

InfluxDB OpenTSDB KairosDB KairosDB Data set Cassandra ScyllaDB Baseline 6400.36 347.367 12.8 15.4667 IoT 997.567 117.3 29.4333 40.1667 Financial 235.733 15.5664 26.8332 28.5667 Ratings 78.3333 2.13333 NA NA

Table A.3: Maximum requests per second performed using representative queries.

A.4 Response latency

Table A.4 shows the mean latency and Table A.5 for TSDB responses. Table A.6 shows the number of timeout that occurred during the latency and response size tests. These results are discussed in Section 5.4

A.5 Mean response size

Table A.7 lists the detailed results for Section 5.5. 53

InfluxDB OpenTSDB KairosDB KairosDB Data set Cassandra ScyllaDB Baseline 1.266 12.559 862.643 230.125 IoT 7.636 18.293 155.91 74.49 Financial 57.88 2681.441 106.85 122.25 Ratings 104.41 2563.02 NA NA

Table A.4: TSDB mean request latency size for representative queries.

InfluxDB OpenTSDB KairosDB KairosDB Data set Cassandra ScyllaDB Baseline 1.399121 12.8529 124.640305 70.532287 IoT 22.049411 44.926242 333.99535 193.673539 Financial 87.277173 3991.057059 133.174898 127.224758 Ratings 462.786983 2786.607644 NA NA

Table A.5: TSDB 95th percentile request latency size for representative queries.

InfluxDB OpenTSDB KairosDB KairosDB Data set Cassandra ScyllaDB baseline 0 0 27 4 IoT 0 0 0 0 Financial 0 0 0 0 Ratings 0 0 NA NA

Table A.6: Number of timeouts during the latency and response size tests.

InfluxDB OpenTSDB KairosDB KairosDB Data set Cassandra ScyllaDB Baseline 185.0 126.0 202.0 202.0 IoT 507.35 350.45 459.4 459.4 Financial 33186.1 28854.35 23117.75 23117.75 Ratings 1250.35 390.65 NA NA

Table A.7: TSDB mean response size for representative queries. BIBLIOGRAPHY 54

Bibliography

[1] E. F. Codd. A Relational Model of Data for Large Shared Data Banks. Commun. ACM, 13(6):377–387, June 1970.

[2] Andrew Pavlo and Matthew Aslett. What’s Really New with NewSQL? SIG- MOD Rec., 45(2):45–55, September 2016.

[3] Katarina Grolinger, Wilson A. Higashino, Abhinav Tiwari, and Miriam AM Capretz. Data management in cloud environments: NoSQL and NewSQL data stores. Journal of Cloud Computing: Advances, Systems and Applications, 2(1):22, December 2013.

[4] Rick Cattell. Scalable SQL and NoSQL data stores. ACM SIGMOD Record, 39(4):12, May 2011.

[5] Andreas Bader, Oliver Kopp, and Michael Falkenthal. Survey and Comparison of Open Source Time Series Databases. Gesellschaft f¨urInformatik e.V., 2017.

[6] Yueguo Chen. TS-Benchmark: A benchmark for time series databases. http: //prof.ict.ac.cn/Bench18/chenyueguo.pdf, June 2018.

[7] Rui Liu and Jun Yuan. Benchmark Time Series Database with IoTDB- Benchmark for IoT Scenarios. arXiv:1901.08304 [cs], January 2019.

[8] Kaippallimalil J. Jacob and Dennis Shasha. FinTime: A financial time series benchmark. SIGMOD Record, 28:42–48, 1999.

[9] Hoang Anh Dau, Eamonn Keogh, Kaveh Kamgar, Chin-Chia Michael Yeh, Yan Zhu, Shaghayegh Gharghabi, Chotirat Ann Ratanamahatana, Yanping, Bing Hu, Nurjahan Begum, Anthony Bagnall, Abdullah Mueen, and Gustavo 55

Batista. The UCR Time Series Classification Archive. October 2018. https: //www.cs.ucr.edu/∼{}eamonn/time series data 2018/.

[10] Dheeru Dua and Casey Graff. UCI Machine Learning Repository. http:// archive.ics.uci.edu/ml, 2017.

[11] R.J. Hyndman. Time Series Data Library. https://datamarket.com/data/list/ ?q=provider:tsdl.

[12] Time-series data on data.world: 34 datasets. https://data.world/datasets/ time-series.

[13] Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford Large Network Dataset Collection. June 2014.

[14] Davis Blalock, Samuel Madden, and John Guttag. Sprintz: Time Series Com- pression for the Internet of Things. Proc. ACM Interact. Mob. Wearable Ubiq- uitous Technol., 2(3):93:1–93:23, September 2018.

[15] Robert Allen. Case Study: How Houghton Mifflin Harcourt gets real-time views into their AWS spend with InfluxData, October 2017.

[16] Adam Wegrzynek. Towards the integrated ALICE Online-Offline monitor- ing subsystem. https://indico.cern.ch/event/587955/contributions/2937431/ attachments/1678739/2706702/CHEP-2018.pdf, September 2018.

[17] Mario Luca Bernardi, Marta Cimitile, Fabio Martinelli, and Francesco Mer- caldo. A Time Series Classification Approach to Game Bot Detection. In Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics, WIMS ’17, pages 6:1–6:11, New York, NY, USA, 2017. ACM.

[18] Yanpei Chen, Francois Raab, and Randy Katz. From TPC-C to Big Data Benchmarks: A Functional Workload Model. In David Hutchison, Takeo Kanade, Josef Kittler, Jon M. Kleinberg, Friedemann Mattern, John C. Mitchell, Moni Naor, Oscar Nierstrasz, C. Pandu Rangan, Bernhard Steffen, Madhu Sudan, Demetri Terzopoulos, Doug Tygar, Moshe Y. Vardi, Gerhard Weikum, Tilmann Rabl, Meikel Poess, Chaitanya Baru, and Hans-Arno Ja- cobsen, editors, Specifying Big Data Benchmarks, volume 8163, pages 28–43. Springer Berlin Heidelberg, Berlin, Heidelberg, 2014. 56

[19] DB-Engines Ranking per database model category. https://db-engines.com/ en/ranking categories.

[20] Y C Tay. Data Generation for Application-Specific Benchmarking. VLDB, Challenges and Visions, 7:4, 2011.

[21] Zhang, Xiaolan and Seltzer, and Margo. Application-Specific Benchmarking. Harvard University, 2001.

[22] Ajay Joshi, Lieven Eeckhout, and Lizy John. The Return of Synthetic Bench- marks. In 2008 SPEC Benchmark Workshop, pages 1–11, 2008.

[23] A. Chakraborti, M. Patriarca, and M. S. Santhanam. Financial time-series analysis: A brief overview. arXiv:0704.1738 [physics, q-fin], pages 51–67, 2007.

[24] Boris Marjanovic. Huge Stock Market Dataset. https://kaggle.com/ borismarjanovic/price-volume-data-for-all-us-stocks-etfs.

[25] F. Maxwell Harper and Joseph A. Konstan. The MovieLens Datasets: History and Context. ACM Trans. Interact. Intell. Syst., 5(4):19:1–19:19, December 2015.

[26] Code for comparison write ups of InfluxDB and other solutions: Influxdata/influxdb-comparisons. InfluxData, May 2019.

[27] Tom´asSenart. HTTP load testing tool and library. tsenart/vegeta. https: //github.com/tsenart/vegeta, May 2019.

[28] Jef Poskanzer. Http load. https://acme.com/software/http load/.

[29] NoSQL data store using the seastar framework, compatible with Apache Cas- sandra: Scylladb/scylla. https://github.com/scylladb/scylla, May 2019. LIST OF ABBREVIATIONS 57

List of Abbreviations

ACID Atomicity, Consistency, Isolation, Durability API Application Programming Interface ARIMA AutoRegressive Integrated Moving Average CAP Consistency, Availability and Partition Tolerance CERN European Organization for Nuclear Research CPU Central Processing Unit CRUD Create, Read, Update and Delete CSV Comma-separated values CTSDB Cloud Time Series Database DBMS Database management system ETF Exchange-Traded Fund HTTP HyperText Transfer Protocol IBDb Internet Movie Database IEEE Institute of Electrical and Electronics Engineers IoT Internet of Things JSON JavaScript Object Notation KPI Key Performance Indicator MIT Massachusetts Institute of Technology NoSQL Not Only SQL OLAP Online Analytical Processing 58

RAM Random Acces Memory RDBMS Relational Database Management System REST Representational State Transfer SNAP Stanford Network Analysis Project SQL Structured Query Language STAC Securities Technology Analysis Center TPC Transaction Processing Performance Council TS Time Series TSDB Time Series Database TSDL Time Series Data Library UCI University of California, Irvine UDP User Datagram Protocol URL Uniform Resource Locator USD United States Dollar YCSB Yahoo! Cloud Serving Benchmark LIST OF FIGURES 59

List of Figures

5.1 Storage efficiency of different TSDBs in bytes per data point. Data points contain a timestamp, a value, and may contain tags, depend- ing on the data set...... 37 5.2 Relative storage efficiency of different TSDBs per data point com- pared to the CSV source format...... 37 5.3 Data points ingested per second. Data sets used were one million data points each...... 41 5.4 Maximum requests per second. Tests were performed on data sets one million data points in size...... 42 5.5 Mean latency per request...... 44 5.6 95th percentile of latency per request...... 44 5.7 Mean size in bytes of the TSDB response...... 46 LIST OF TABLES 60

List of Tables

3.1 Evaluation of existing TSDB benchmarks ...... 20

4.1 Overview of workload data sets ...... 30

A.1 Data ingestion speed in point per second...... 51 A.2 Storage efficiency in bytes per data point...... 52 A.3 Maximum requests per second performed using representative queries. 52 A.4 TSDB mean request latency size for representative queries...... 53 A.5 TSDB 95th percentile request latency size for representative queries. 53 A.6 Number of timeouts during the latency and response size tests. . . . 53 A.7 TSDB mean response size for representative queries...... 53