Towards a Representative Benchmark for Time Series Databases

Total Page:16

File Type:pdf, Size:1020Kb

Towards a Representative Benchmark for Time Series Databases CONFIDENTIAL UP TO AND INCLUDING 03/01/2017 - DO NOT COPY, DISTRIBUTE OR MAKE PUBLIC IN ANY WAY Towards a representative benchmark for time series databases Thomas Toye Student number: 01610806 Supervisors: Prof. dr. Bruno Volckaert, Prof. dr. ir. Filip De Turck Counsellors: Dr. ir. Joachim Nielandt, Jasper Vaneessen Master's dissertation submitted in order to obtain the academic degree of Master of Science in de industriële wetenschappen: elektronica-ICT Academic year 2018-2019 ii CONFIDENTIAL UP TO AND INCLUDING 03/01/2017 - DO NOT COPY, DISTRIBUTE OR MAKE PUBLIC IN ANY WAY Towards a representative benchmark for time series databases Thomas Toye Student number: 01610806 Supervisors: Prof. dr. Bruno Volckaert, Prof. dr. ir. Filip De Turck Counsellors: Dr. ir. Joachim Nielandt, Jasper Vaneessen Master's dissertation submitted in order to obtain the academic degree of Master of Science in de industriële wetenschappen: elektronica-ICT Academic year 2018-2019 PREFACE iv Preface I would like to thank my supervisors, Prof. dr. Bruno Volkaert and Prof. dr. ir. Filip De Turck. I am very grateful for the help and guidance of my counsellors, Dr. ir. Joachim Nielandt and Jasper Vaneessen. I would also like to thank my parents for their support, not only during the writing of this dissertation, but also during my transitionary programme and my master's. The author gives permission to make this master dissertation available for consul- tation and to copy parts of this master dissertation for personal use. In all cases of other use, the copyright terms have to be respected, in particular with regard to the obligation to state explicitly the source when quoting results from this master dissertation. Thomas Toye, June 2019 Towards a representative benchmark for time series databases Thomas Toye Master's dissertation submitted in order to obtain the academic degree of Master of Science in de industriele¨ wetenschappen: elektronica-ICT Academic year 2018{2019 Supervisors: Prof. dr. Bruno Volckaert, Prof. dr. ir. Filip De Turck Counsellors: Dr. ir. Joachim Nielandt, Jasper Vaneessen Summary As the fastest growing database type, time series databases (TSDBs) have expe- rienced a rise in database vendors, and with it, a rise in difficulty in selecting the best one. TSDB benchmarks compare the performance of different databases to each other, but the workloads they use are not representative: they use random data, or synthesized data that is only applicable to one domain. This disserta- tion argues that these non-representative benchmarks may not always accurately model real world performance, and instead, representative workloads should be used in TSDB benchmarks. In this context, workloads are defined as consisting of data sets and queries. Workload data sets can be categorized using eight pa- rameters (number of metrics, regularity, volume, data type, number of tags, tag value data type, tag value cardinality, variation). A new benchmark was created, which uses three representative workloads next to a baseline non-representative workload. Results of this benchmark show significant performance differences for data ingestion speed for complex data, latency and maximum request rate (when broad time ranges are used), and storage efficiency of data points when comparing representative and non-representative workloads. The results show that existing benchmarks may not be accurate for real world performance. Keywords Time series database, representative benchmarking, load testing Towards a representative benchmark for time series databases Thomas Toye Supervisor(s): Bruno Volckaert, Filip De Turck Abstract— As the fastest growing database type, time series databases ison by being easily extensible to competing solutions that solve (TSDBs) have experienced a rise in database vendors, and with it, a rise in comparable problems. 4. Scalable: Benchmarks must be able to difficulty in selecting the best one. TSDB benchmarks compare the perfor- mance of different databases to each other, but the workloads they use are measure performance in a wide range of scale. Not just single-n- not representative: they use random data, or synthesized data that is only ode performance, but also cluster configurations. 5. Verifiable: applicable to one domain. We argue that these non-representative bench- Benchmarks should be repeatable and independently verifiable. marks may not always accurately model real world performance, and in- 6. Simple: Benchmarks must be easily understandable, while stead, representative workloads should be used in TSDB benchmarks. In this context, workloads are defined as consisting of data sets and queries. making choices that do not affect performance. Workload data sets can be categorized using eight parameters (number of Existing TSDB benchmarks were evaluated, a summary is metrics, regularity, volume, data type, number of tags, tag value data type, shown in Table II. Two gaps in the state of the art are clear: cur- tag value cardinality, variation). A new benchmark was created, which uses three representative work- rent benchmarks insufficiently test TSDB performance at scale, loads next to a baseline non-representative workload. Results of this bench- and current benchmarks are not representative or only represen- mark show significant performance differences for data ingestion speed for tative for a single use case. The data used is either random, or complex data, latency and maximum request rate (when broad time ranges synthetic; real world data are not used. This begs the question: are used), and storage efficiency of data points when comparing represen- tative and non-representative workloads. The results show that existing are results of a non-representative benchmark generalizable to benchmarks may not be accurate for real world performance. real world performance? Keywords— Time series database, representative benchmarking, load testing I. INTRODUCTION IME SERIES DATABASES provide storage and interfac- ing for time series. In its simplest form, time series data T Representative Revelant Portable Scalable Verifiable Simple are just data with an attached timestamp. This subtype of data For IoT has seen increasing interest in the last decade, especially with TS-Benchmark the rise of the Internet of Things, which produces time series for use cases everything from temperature to sea levels. Other areas where IoTDB-benchmark time series are used are the financial industry (e.g. historical analysis of stock performance), the DevOps industry (e.g. cap- TSDBBench ture of metrics from a server fleet) and the analytics industry For financial (e.g. tracking ad performance over time). FinTime Finding the best database to use is not an easy task. Eighty- use cases For DevOps three existing TSDBs were found by Bader et al. [1]. To deter- influxdb-comparisons mine the best one, benchmarks are used. However, these bench- use cases marks may not be representative of the use case or industry the TABLE I TSDB is needed for, which makes their results difficult to gen- EVALUATION OF EXISTING TSDB BENCHMARKS eralize. In this abstract, we will first analyze existing TSDB bench- marks. Then, a new benchmark is proposed, which compares representative workloads to non-representative workloads. The III. BENCHMARK COMPONENTS results of this benchmark are analysed to A new benchmark is developed to compare benchmark per- formance between representative and non-representative work- II. EVALUATION OF EXISTING BENCHMARKS loads. Workloads consist of a workload data set that is loaded Chen et al. [2] consolidate the properties of a good bench- into the TSDB and a workload query set that executes upon it. mark as follows: 1. Representative: Benchmarks must simulate real world conditions, both the input to a system and the sys- A. Data set tem itself should be representative of real world usage. 2. Rel- Time series data sets have the following properties in com- evant: Benchm arks must measure relevant metrics and tech- mon: data arrives in order, updates are very rare to non-existent, nologies. Results should be useful to compare widely-used so- deletion is rare, and data values follow a pattern. lutions. 3. Portable: Benchmarks should provide a fair compar- They differ on the following characteristics: : Data points are organizaed in metrics, which can be Baseline Financial Rating IoT • Metrics compared to tables in relational databases. Metrics 1 6 1 7 : In regular time series, data points are spaced evenly Regularity Regular Semi-reg. Irregular Regular • Regularity in time. Irregular time series do not emit data points regularly. Volume Low Low Low Low Irregular time series are often the result of event triggers. Tags 2 1 5 0 : High volume time series may emit hundreds of thou- Tag value 10,000 7,164 20M 0 • Volume sands of data points a seconds, while low volume time series cardinality only emit one event a day. Variation High Low High Low : Traditionally, values of data points in a time series • Data type Total data 20M 74.4M 20M 14,5M have been integers or floating point numbers. But they can also points be booleans, strings or even custom data types. License NA CC0 Custom CC-BY-4 Tags: A time series data point may have one or more tags asso- • TABLE II ciated with the timestamp and value. There may be no tags or OVERVIEW OF WORKLOAD DATA SETS a lot of tags. Tags may hold special values, such as geospatial information. : The number of possible combinations • Tag value cardinality the tag values make. Three tags with two possible values each set uses historical stock market information, the rating data set make a tag value cardinality of six. uses movie reviews and the IoT data set is produced by power : While time series data usually follow a pattern, the • Variation information for a house. variation in a series may be very different. One series may de- scribe a flat line, while another may describe seasonal variations V. EVALUATION with daily spikes. A. Storage efficiency B. Query set Figure 1 shows relative storage efficiency. The size in bytes Bader et al.
Recommended publications
  • Time Series Database (TSDB) Query Languages
    Time Series Database (TSDB) Query Languages Philipp Bende [email protected] Abstract—Time series data becomes became more and One can define time series data by the following rules: more relevant since the so called 4th industrial revolution. A large amount of sensors are continuously measuring Time series data data, so called time series data. This paper aims to • is a sequence of numbers representing the measure- explain what time series data is and why it is relevant. It goes into detail of storing large amounts of time series ments of a variable at equal time intervals. data and the problems with conventional relational SQL- • can be identified by a source name or id and a metric style databases. Further, so called time series databases name or id. (TSDBs), which are databases specialized for handling • consists of ftimestamp , valueg tuples, ordered by large quantities of time series data, are introduced. Re- timestamp where the timestamp is a high precision quirements for designing an efficient TSDB, as well as Unix timestamp (or comparable) and the value is a different TSDB designs are presented. Finally a choice of float most of the times, but can be any datatype. multiple TSDBs is introduced, emphasizing on their design and how queries for data are performed on the example of OpenTSDB. I. TIME SERIES DATA II. DIFFERENCE BETWEEN TSDB AND CONVENTIONAL DATABASES In this section we give the reader an overview of what time series data is, where and why one might After explaining what time series data is in the previ- be interested in it and a few real world example use ous chapter, this chapter focuses on illustrating the dif- cases of where time series data is accumulated and used.
    [Show full text]
  • Time Series Management Systems: a Survey
    This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TKDE.2017.2740932 1 Time Series Management Systems: A Survey Søren Kejser Jensen, Torben Bach Pedersen, Senior Member, IEEE, Christian Thomsen Abstract—The collection of time series data increases as more monitoring and automation are being deployed. These deployments range in scale from an Internet of things (IoT) device located in a household to enormous distributed Cyber-Physical Systems (CPSs) producing large volumes of data at high velocity. To store and analyze these vast amounts of data, specialized Time Series Management Systems (TSMSs) have been developed to overcome the limitations of general purpose Database Management Systems (DBMSs) for times series management. In this paper, we present a thorough analysis and classification of TSMSs developed through academic or industrial research and documented through publications. Our classification is organized into categories based on the architectures observed during our analysis. In addition, we provide an overview of each system with a focus on the motivational use case that drove the development of the system, the functionality for storage and querying of time series a system implements, the components the system is composed of, and the capabilities of each system with regard to Stream Processing and Approximate Query Processing (AQP). Last, we provide a summary of research directions proposed by other researchers in the field and present our vision for a next generation TSMS.
    [Show full text]
  • Grafana Is an Open Source Visualization and Monitoring Tool That Is Used for Creating Dashboards and Charting Time Series Data
    SED 1088 Transcript EPISODE 1088 [INTRODUCTION] [00:00:00] JM: Grafana is an open source visualization and monitoring tool that is used for creating dashboards and charting time series data. Grafana is used by thousands of companies to monitor their infrastructure. It’s a popular component in monitoring stacks and it’s often used together with Prometheus, Elasticsearch, MySQL and other data sources. The engineering complexities around building Grafana involve the large number of integrations, the highly configurable ReactJS frontend and the ability to query and display large datasets. Grafana must also be deployable to cloud and on-prem environments. Torkel Ödegaard is a cofounder of Grafana Labs and he joins the show to talk about his work on the open source project and the company that he’s building around it. If you want to 30,000 unique engineers every day, consider sponsoring Software Engineering Daily. Whether you are hiring engineers or selling a product to engineers, Software Engineering Daily is a great place to reach talented engineers and you can send me an email, [email protected] if you’re curious about sponsoring the podcast or forward it to your marketing team. We are also looking for writers and a videographer. If you’re interested in working with us, you can also send me an email, [email protected]. [SPONSOR MESSAGE] [00:01:28] JM: This episode of Software Engineering Daily is brought to you by Datadog, a full stack monitoring platform that integrates with over 350 technologies like Gremlin, PagerDuty, AWS Lambda, Spinnaker and more. With the rich visualizations and algorithmic alerts, Datadog can help you monitor the effects of chaos experiments.
    [Show full text]
  • Graphite Documentation Release 1.2.0
    Graphite Documentation Release 1.2.0 Chris Davis Apr 19, 2021 Contents 1 Overview 1 2 FAQ 3 3 Installing Graphite 7 4 The Carbon Daemons 35 5 Feeding In Your Data 39 6 Getting Your Data Into Graphite 41 7 Administering Carbon 43 8 Administering The Webapp 45 9 Using The Composer 47 10 The Render URL API 49 11 The Metrics API 71 12 Functions 73 13 The Dashboard User Interface 105 14 The Whisper Database 113 15 The Ceres Database 117 16 Alternative storage finders 121 17 Graphite Events 125 18 Graphite Tag Support 129 19 Graphite Terminology 137 20 Tools That Work With Graphite 139 i 21 Working on Graphite-web 145 22 Client APIs 147 23 Who is using Graphite? 149 24 Release Notes 151 25 Indices and tables 207 Python Module Index 209 Index 211 ii CHAPTER 1 Overview 1.1 What Graphite is and is not Graphite does two things: 1. Store numeric time-series data 2. Render graphs of this data on demand What Graphite does not do is collect data for you, however there are some tools out there that know how to send data to graphite. Even though it often requires a little code, sending data to Graphite is very simple. 1.2 About the project Graphite is an enterprise-scale monitoring tool that runs well on cheap hardware. It was originally designed and written by Chris Davis at Orbitz in 2006 as side project that ultimately grew to be a foundational monitoring tool. In 2008, Orbitz allowed Graphite to be released under the open source Apache 2.0 license.
    [Show full text]
  • Survey of Time Series Database Technology Version: 1.0.0 Date: 30 March 2020
    Survey of Time Series Database Technology version: 1.0.0 date: 30 March 2020 Author(s) Brian McBride (Epimorphics Ltd.) Dave Reynolds (Epimorphics Ltd.) Reviewer(s) Matt Fry (UKCEH) Oliver Swain (UKCEH) Simon Stanley (UKCEH) Mike Brown (UKCEH) Contents Introduction 4 Time Series Data Overview 4 Features of Interest 5 Scale 5 Conceptual Model of a Time Series 6 Approach to Data Storage 7 Deployment Infrastructure 8 Software Licensing and Support 8 Maturity and Popularity 8 General Considerations 9 Handling complex metadata 9 Techniques for Improving Scalability and Performance 9 Batching Incoming Data 9 Many Measurements per Row 9 Data Compression 10 Example Time Series Databases 10 Relational Database 10 TimescaleDB 11 Time Series Conceptual Model 12 Higher Ingest Rates 12 Data Compression 12 InfluxDB 12 Time Series Conceptual Model 12 Querying Data 13 Scale 13 Licensing 13 OpenTSDB 14 2 Time Series Conceptual Model 14 Querying Data 14 Scale 15 Cassandra 15 Time Series Conceptual Model 15 Querying Data 16 Scale 16 Heroic 16 Data Model 16 Query Language 16 Scale 17 Amazon Timestream 17 Direct query to file 17 Summary Comparison Table 18 Other Systems 18 Irish Marine Institute 19 Met Office 19 British Geological Survey 19 Environment Agency Hydrology Service 20 Implications for CEH 20 Appendix 1 - External Analytics 22 Apache Spark 22 Apache Storm 23 Apache Flink 23 Others 23 Comment 23 Appendix 2 - Time Series Databases Encountered 25 Time Series Data Specific Databases 25 Time Series Database Hosted Solutions 26 More General Databases Used for Time Series Data 26 3 DB-Engines List of Time Series Databases Ordered by Popularity 27 Introduction 1 This report has been prepared by Epimorphics Ltd.
    [Show full text]
  • Comparative Analysis of Time Series Databases in the Context of Edge Computing for Low Power Sensor Networks
    Comparative Analysis of Time Series Databases in the Context of Edge Computing for Low Power Sensor Networks Piotr Grzesik1 and Dariusz Mrozek1 Department of Applied Informatics, Silesian University of Technology ul. Akademicka 16, 44-100 Gliwice, Poland [email protected] Abstract. Selection of an appropriate database system for edge IoT de- vices is one of the essential elements that determine efficient edge-based data analysis in low power wireless sensor networks. This paper presents a comparative analysis of time series databases in the context of edge computing for IoT and Smart Systems. The research focuses on the per- formance comparison between three time-series databases: TimescaleDB, InfluxDB, Riak TS, as well as two relational databases, PostgreSQL and SQLite. All selected solutions were tested while being deployed on a single-board computer, Raspberry Pi. For each of them, the database schema was designed, based on a data model representing sensor readings and their corresponding timestamps. For performance testing, we devel- oped a small application that was able to simulate insertion and querying operations. The results of the experiments showed that for presented sce- narios of reading data, PostgreSQL and InfluxDB emerged as the most performing solutions. For tested insertion scenarios, PostgreSQL turned out to be the fastest. Carried out experiments also proved that low-cost, single-board computers such as Raspberry Pi can be used as small-scale data aggregation nodes on edge device in low power wireless sensor net- works, that often serve as a base for IoT-based smart systems. Keywords: time series, PostgreSQL, TimescaleDB, InfluxDB, edge com- puting, edge analytics, Raspberry Pi, Riak TS, SQLite 1 Introduction In the recent years we have been observing IoT systems being applied for multiple use cases such as water monitoring[20], air quality monitoring [24], and health monitoring [25], generating a massive amount of data that is being sent to the cloud for storing and further processing.
    [Show full text]
  • Suitability of Influxdb Database for Iot Applications
    International Journal of Innovative Technology and Exploring Engineering (IJITEE) ISSN: 2278-3075, Volume-8 Issue-10, August 2019 Suitability Of Influxdb Database For Iot Applications Mohammad Nasar, Mohammad Abu Kausar auto scaling, superior execution, and high accessibility [24]. Abstract: Large amounts of data are generated every moment Though, there are numerous NoSQL databases, and each by connected objects creating Internet of Things (IoT). IoT isn’t database has its very own execution qualities, strategy of data about things; it’s about the data those things create and collect. distribution, Query language, etc. This will make the Organizations rely on this data to provide better user Superior database option for more difficult applications. experiences, to make smarter business decisions, and ultimately fuel their growth. However, none of this is possible without a Hence, look into the characteristics of NoSQL databases and reliable database that is able to handle the massive amounts of specifications for IoT data management are essential for data generated by IoT devices. Relational databases are known selecting the appropriate database for IoT applications. for being flexible, easy to work with, and mature but they aren’t particularly known for is scale, which prompted the creation of II. INTERNET OF THINGS (IOT) NoSQL databases. Another thing to note is that IoT data is time-series in nature. In this paper we are discussed and The word Internet of Things (IoT) relates to a large and compare about top five time-series database like InfluxDB, diverse network of physical and virtual elements integrated Kdb+, Graphite, Prometheus and RRDtool. in sensors, software, electronics, and connectivity to allow objects to attain higher value and service by exchanging Index Terms: Internet of Things, NoSQL databases, information over the Internet with other linked objects.
    [Show full text]
  • Time Series Databases and Influxdb
    Universite´ libre de Bruxelles Advanced Databases Winter Semester 2017-2018 Time Series Databases and InfluxDB Authors: Syeda Noor Zehra Naqvi Supervisor: (000455274) Dr. Esteban Zimanyi´ Sofia Yfantidou (000456361) December 17, 2017 Contents 1 TIME SERIES & TIME SERIES DBs3 1.1 Time Series............................3 1.1.1 Definition.........................3 1.1.2 Uses............................3 1.2 Time Series Databases......................4 1.2.1 Definition.........................4 1.2.2 Properties.........................4 1.2.3 Popularity.........................5 1.2.4 Benefits and Uses.....................6 1.2.5 Top Time Series Databases................7 2 INFLUXDB8 2.1 General Information & Architecture...............8 2.1.1 Key Concepts.......................9 2.1.2 Sharding.......................... 11 2.1.3 Storage Engine...................... 12 2.2 Customers & Use Cases..................... 12 2.2.1 DevOps Monitoring: The IBM Case........... 13 2.2.2 IoT Monitoring: The Spiio Case............. 13 2.2.3 Real-Time Analytics: The eBay Case.......... 14 2.3 Pros & Cons............................ 14 2.3.1 Pros............................ 14 2.3.2 Cons............................ 16 2.3.3 When not to use InfluxDB................ 17 2.4 Popularity............................. 17 2.5 Comparisons............................ 18 3 HANDS-ON WORK 18 3.1 Dataset Presentation....................... 18 3.2 InfluxDB Tutorial......................... 21 3.2.1 Database Setup...................... 21 3.2.2 Schema Design...................... 21 3.2.3 Data Import........................ 22 3.2.4 Basic Queries....................... 24 3.3 Benchmarking SQL Server vs InfluxDB............. 28 3.3.1 Query Properties..................... 28 1 3.3.2 Hardware Specifications................. 29 3.3.3 Benchmarking Queries.................. 30 3.3.4 Benchmarking Query Results.............. 32 3.4 Benchmarking........................... 36 3.4.1 InfluxDB vs.
    [Show full text]
  • Survey and Comparison of Open Source Time Series Databases
    B. Mitschang et al. (Hrsg.): BTW 2017 Ű Workshopband, Lecture Notes in Informatics (LNI), Gesellschaft für Informatik, Bonn 2017 249 Survey and Comparison of Open Source Time Series Databases Andreas Bader,1 Oliver Kopp,2 Michael Falkenthal3 Abstract: Time series data, i.e., data consisting of a series of timestamps and corresponding values, is a special type of data occurring in settings such as “Smart Grids”. Extended analysis techniques called for a new type of databases: Time Series Databases (TSDBs), which are specialized for storing and querying time series data. In this work, we aim for a complete list of all available TSDBs and a feature list of popular open source TSDBs. The systematic search resulted in 83 TSDBs. The twelve most prominent found open source TSDBs are compared. Therefore, 27 criteria in six groups are defined: (i) Distribution/Clusterability, (ii) Functions, (iii) Tags, Continuous Calculation, and Long-term Storage, (iv) Granularity, (v) Interfaces and Extensibility, (vi) Support and License. Keywords: Time Series Databases, Survey, Comparison 1 Introduction and Background The importance of sensors has been growing in the last years. Thereby, IoT technologies gained access to industrial environments to enable intensive metering of production steps, whole manufacturing processes, and further parameters. One key challenge of these endeavors is to efficiently store and analyze huge sets of metering data from many different sensors, which are typically present in the form of time series data. These principles are currently also applied to energy grids, where increasing amounts of dynamic and flexible power generation units, such as solar panels and thermal power stations, respectively, along with energy storages, such as batteries or pumped storage hydro power stations, require to intensively meter the parameters of the energy grid [Ko15].
    [Show full text]
  • Tools for Big Data Analysis
    Masaryk University Faculty of Informatics Tools for big data analysis Master’s Thesis Bc. Martin Macák Brno, Spring 2018 Replace this page with a copy of the official signed thesis assignment anda copy of the Statement of an Author. Declaration Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Bc. Martin Macák Advisor: doc. Ing. RNDr. Barbora Bühnová, Ph.D. i Acknowledgements I would like to thank my supervisor, doc. Ing. RNDr. Barbora Bühnová, Ph.D. for offering me to work on this thesis. Her support, guidance, and patience greatly helped me to finish it. I would also like to thank her for introducing me to the great team of people in the CERIT-SC Big Data project. From this team, I would like to especially thank RNDr. Tomáš Rebok, Ph.D., who had many times found time for me, to provide me useful advice, and Bruno Rossi, PhD, who had given me the opportunity to present the results of this thesis in LaSArIS seminar. I would also like to express my gratitude for the support of my family, my parents, Jana and Alexander, and the best sister, Nina. My thanks also belong to my supportive friends, mainly Bc. Tomáš Milo, Bc. Peter Kelemen, Bc. Jaroslav Davídek, Bc. Štefan Bojnák, and Mgr. Ondřej Gasior. Lastly, I would like to thank my girlfriend, Bc. Iveta Vidová for her patience and support.
    [Show full text]
  • Comparison of Time Series Databases
    Institute of Parallel and Distributed Systems University of Stuttgart Universitätsstraße 38 D–70569 Stuttgart Diplomarbeit Nr. 3729 Comparison of Time Series Databases Andreas Bader Course of Study: Informatik Examiner: Dr. Holger Schwarz Supervisor: Dipl.-Inf. Oliver Kopp, Michael Falkenthal, M.Sc. Commenced: 2015-07-15 Completed: 2016-01-13 CR-Classification: H.2.4, H.3.4, C.2.4, C.4 Abstract Storing and analyzing large amounts of data are growing in importance since the fourth industrial revolution. As more devices are becoming “smart” and are equipped with sensors in today’s world, the amount of data that can be stored and analyzed grows. Insights from this data are important for several industries, e. g., energy companies for controlling smart grids. Traditional Relational Database Management Systems (RDBMS) have reached their lim- its with such huge amounts of data, which resulted in a new database type, the NoSQL Database Management Systems (DBMS). NoSQL DBMS are specialized in handling huge amounts of data with the help of distribution and weaker consistency. Between these two a new type arose: Time Series Database (TSDB), which is specialized for storing and querying time series data. The amount of existing TSDBs is big, whereby for this thesis 75 TSDBs have been found. 42 of them are open source, the remaining TSDBs are commercial. Many of the found open source TSDBs are under ongoing development. The challenge is the selection of one TSDB for a given scenario or problem. Benchmarks that have the ability to compare several TSDBs for a specific scenario or in general are hardly existing.
    [Show full text]
  • Time Series Database in Industrial Iot and Its Testing Tool
    FACULTY OF INFORMATION TECHNOLOGY AND ELECTRICAL ENGINEERING Mikael Martinviita Time series database in Industrial IoT and its testing tool Master’s Thesis Degree Programme in Computer Science and Engineering October 2018 Martinviita M. (2018) Time series database in Industrial IoT and its testing tool. University of Oulu, Degree Programme in Computer Science and Engineering. Master’s Thesis, 62 p. ABSTRACT In the essence of the Industrial Internet of Things is data gathering. Data is time and event-based and hence time series data is key concept in the Industrial Internet of Things, and specific time series database is required to process and store the data. Solution development and choosing the right time series database for Industrial Internet of Things solution can be difficult. Inefficient comparison of time series databases can lead to wrong choices and consequently to delays and financial losses. This thesis is improving the tools to compare different time series databases in context of the Industrial Internet of Things. In addition, the thesis identifies the functional and non-functional requirements of time series database in Industrial Internet of Things and designs and implements a performance test bench. A practical example of how time series databases can be compared with identified requirements and developed test bench is also provided. The example is used to examine how selected time series databases fulfill these requirements. Eight functional requirements and eight non-functional requirements were identified. Functional requirements included, e.g., aggregation support, information models, and hierarchical configurations. Non-functional requirements included, e.g., scalability, performance, and lifecycle. Developed test bench took Industrial Internet of Things point of view by testing the database in three scenarios: write heavy, read heavy, and concurrent write and read operations.
    [Show full text]