Implementing Compression on Distributed Time Series Database

Implementing compression on distributed time series database Michael Burman School of Science Thesis submitted for examination for the degree of Master of Science in Technology. Espoo 05.11.2017 Supervisor Prof. Kari Smolander Advisor Mgr. Jiri Kremser Aalto University, P.O. BOX 11000, 00076 AALTO www.aalto.fi Abstract of the master’s thesis Author Michael Burman Title Implementing compression on distributed time series database Degree programme Major Computer Science Code of major SCI3042 Supervisor Prof. Kari Smolander Advisor Mgr. Jiri Kremser Date 05.11.2017 Number of pages 70+4 Language English Abstract Rise of microservices and distributed applications in containerized deployments are putting increasing amount of burden to the monitoring systems. They push the storage requirements to provide suitable performance for large queries. In this paper we present the changes we made to our distributed time series database, Hawkular-Metrics, and how it stores data more effectively in the Cassandra. We show that using our methods provides significant space savings ranging from 50 to 95% reduction in storage usage, while reducing the query times by over 90% compared to the nominal approach when using Cassandra. We also provide our unique algorithm modified from Gorilla compression algorithm that we use in our solution, which provides almost three times the throughput in compression with equal compression ratio. Keywords timeseries compression performance storage Aalto-yliopisto, PL 11000, 00076 AALTO www.aalto.fi Diplomityön tiivistelmä Tekijä Michael Burman Työn nimi Pakkausmenetelmät hajautetussa aikasarjatietokannassa Koulutusohjelma Pääaine Computer Science Pääaineen koodi SCI3042 Työn valvoja ja ohjaaja Prof. Kari Smolander Päivämäärä 05.11.2017 Sivumäärä 70+4 Kieli Englanti Tiivistelmä Hajautettujen järjestelmien yleistyminen on aiheuttanut valvontajärjestelmissä tiedon määrän kasvua, sillä aikasarjojen määrä on kasvanut ja niihin talletetaan useammin tietoa. Tämä on aiheuttanut kasvavaa kuormitusta levyjärjestelmille, joilla on ongelmia palvella kasvavia kyselyitä. Tässä paperissa esittelemme muutoksia hajautettuun aikasarjatietokantaamme, Hawkular-Metricsiin, käyttäen hyödyksi tehokkaampaa tiedon pakkausta ja järjestelyä kun tietoa talletetaan Cassandraan. Nopeutimme kyselyjä lähes kymmenkertaisesti ja samalla pienensimme levytilavaatimuksia aineistosta riippuen 50-95%. Esittelemme myös muutoksemme Gorilla pakkausalgoritmiin, jota hyödynnäm- me tulosten saavuttamiseksi. Muutoksemme nopeuttavat pakkaamista melkein kolminkertaiseksi alkuperäiseen algoritmiin nähden ilman pakkaustehon laskua. Avainsanat aikasarjat suorituskyky pakkausmenetelmät 4 Preface I would like to thank Red Hat for providing me with an interesting problem to work on and my team for providing valuable feedback. Special thanks go to John Sanda for great discussions, code reviews and ideas and to Filip Brychta for relentless quality testing. I also like to thank Kari Smolander for feedback on the thesis. In the end, this paper would not have been possible without the support from my wife and my kids, Linnea, Elli and Nella. Otaniemi, 5.11.2017 Michael Burman 5 Contents Abstract2 Abstract (in Finnish)3 Preface4 Contents5 Symbols and abbreviations8 1 Introduction1 1.1 Outline...................................2 2 Preliminaries3 2.1 Time series................................3 2.1.1 Time series storage........................3 2.2 Hawkular.................................4 2.2.1 Hawkular-Metrics.........................4 2.3 Cassandra.................................5 2.3.1 Storage..............................5 2.4 Related work...............................6 2.4.1 Gorilla / Beringei.........................7 2.4.2 Chronix..............................7 2.4.3 Prometheus............................7 2.4.4 InfluxDB..............................8 2.4.5 Cassandra based time series storages..............8 3 Compression9 3.0.1 Dictionary based coding - LZ77 and LZ78...........9 3.0.2 Huffman coding..........................9 3.0.3 Arithmetic coding......................... 10 3.0.4 Asymmetric Numeral Systems.................. 10 3.0.5 Run-Length Encoding...................... 10 3.0.6 Delta coding............................ 10 3.1 Generic compression algorithms..................... 11 3.1.1 LZ4................................ 11 3.1.2 Deflate............................... 11 3.1.3 Zstandard............................. 11 3.2 Gorilla compression............................ 12 3.2.1 Compressing time stamps.................... 12 3.2.2 Compressing values........................ 14 3.3 Integer compression techniques..................... 14 3.3.1 Variable Byte, Byte-aligned................... 16 3.3.2 The Simple family, Word-aligned................ 16 6 3.3.3 Frame-Of-Reference........................ 16 3.3.4 Patched coding.......................... 17 3.3.5 Improvements to PFOR..................... 18 3.4 Value predictors.............................. 18 3.4.1 Computational Predictors.................... 18 3.4.2 Context Based Predictors.................... 19 3.5 Double-precision floating point compression algorithms........ 20 3.6 Implementing compression algorithm.................. 22 4 Research material and methods 23 4.1 Experiments setup............................ 23 4.1.1 Hardware............................. 24 4.1.2 Software.............................. 24 4.2 Compression method evaluation criteria................. 25 4.3 Architectural evaluation criteria..................... 26 4.4 Data sets................................. 26 4.4.1 Synthetic data........................... 26 4.4.2 Real world data.......................... 27 5 Design and implementation 29 5.1 Compression algorithm implementations................ 29 5.1.1 Simple family implementations................. 29 5.1.2 Gorilla implementation...................... 31 5.1.3 FPC implementation....................... 33 5.2 Implementing compression to Hawkular-Metrics............ 33 5.2.1 Data modeling.......................... 34 5.2.2 Execution............................. 35 5.2.3 Transforming data to wide-column format........... 35 5.3 Temporary tables solution........................ 36 5.3.1 Table management........................ 36 5.3.2 Writing data............................ 37 5.3.3 Reading data........................... 38 5.3.4 Data partitioning strategies................... 39 6 Experiments 41 6.1 Data set compression ratio........................ 41 6.1.1 Generated data sets........................ 41 6.1.2 Real world data sets....................... 43 6.2 Performance of the compression job................... 47 6.3 Query performance............................ 50 7 Observations 54 7.1 Compression ratio............................. 54 7.2 Compression speed............................ 55 7.3 Selecting the compression algorithm................... 56 7.4 Architectural choices........................... 57 7 7.5 Query performance............................ 58 7.6 Future work................................ 59 8 Summary 61 References 62 A Optimized Java 71 A.1 Intrinsics.................................. 71 A.2 Inlining.................................. 72 A.3 Superword Level Parallelism....................... 72 A.4 Generic.................................. 73 B Source code repositories 74 B.0.1 Repositories............................ 74 B.1 Licensing.................................. 74 8 Symbols and abbreviations Symbols δ delta Operators ⊕ eXclusive OR Abbreviations GC Garbage Collector JIT Just In Time compiler JVM Java Virtual Machine ODS Facebook’s Operational Data Store LSM Log-Structured Merge Tree TSM Time-Structured Merge Tree RLE Run-Length Encoding FFT Fast Fourier Transform LZ77 Lempel and Ziv compression algorithm from the 1977 paper LZ78 Lempel and Ziv compression algorithm from the 1978 paper FOR Frame-Of-Reference PFOR Patched Frame-Of-Reference AFOR-1 Delbru et al. 32-integer version of binary packing AFOR-3 Greedy algorithm for binary packing partitioning VSEncoding Vector of Split Encoding FCM Finite Context Method Predictors DFCM Differential FCM ANS Asymmetric Numeral System FSE Finite State Entropy Zstd Zstandard NSM n-ary storage model DSM decomposite storage model SLP Superword Level Parallelism 1 Introduction Distributed applications are generating increasing amount of data given all the modern possibilities of instrumenting an application. This data must be stored and processed for queries that require historical knowledge, such as autoscaling based on the historical usage data or when it is used by the customers for internal or external billing purposes, in which case the exact data must be kept and stored with multiple copies. Containers, which are isolated application images that package everything needed to run an application, including code and runtime requirements[126] have been the hype in the infrastructure development for few years now. To manage large amount of containers, solutions such as Kubernetes[127], which is a solution for automating deployment, management and scaling of containers, have risen in popularity. Due to the lower overhead of containers compared to the virtual machines, it has allowed infrastructure to run workloads with higher efficiency. Containerized systems have been a natural deployment system for microservices and lambda architectures, thus driving up the amount of components in the system. These changes have required the monitoring systems to digest more and more data all the time, quickly growing the amount of

Implementing Compression on Distributed Time Series Database

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support