Time Series Compression: a Survey
Total Page:16
File Type:pdf, Size:1020Kb
1 Time series compression: a survey Giacomo Chiarot, Claudio Silvestri Abstract—The presence of smart objects is increasingly widespread and their ecosystem, also known as Internet of Things, is relevant in many different application scenarios. The huge amount of temporally annotated data produced by these smart devices demand for efficient techniques for transfer and storage of time series data. Compression techniques play an important role toward this goal and, despite the fact that standard compression methods could be used with some benefit, there exist several ones that specifically address the case of time series by exploiting their peculiarities to achieve a more effective compression and a more accurate decompression in the case of lossy compression techniques. This paper provides a state-of-the-art survey of the principal time series compression techniques, proposing a taxonomy to classify them considering their overall approach and their characteristics. Furthermore, we analyze the performances of the selected algorithms by discussing and comparing the experimental results that where provided in the original articles. The goal of this paper is to provide a comprehensive and homogeneous reconstruction of the state-of-the-art which is currently fragmented across many papers that use different notations and where the proposed methods are not organized according to a classification. Index Terms—Time series, Data Compaction and Compression, Internet of Things, State-of-the-art Survey, Sensor Data, Approximation, and Data Abstraction. F 1 INTRODUCTION IME series are relevant in several different contexts in the IoT). Further, these algorithms take advantage of the T and Internet of Things ecosystem (IoT) is among the peculiarities of time series produced by sensors, such as: most pervasive ones. IoT devices, indeed, can be found • Redundancy: some segments of a time series can in different applications, ranging from health-care (smart frequently appear inside the same or other related wearables) to industrial ones (smart grids) [2], producing time series; very large amount of data. For instance, a single Boeing 787 • Approximability: sensors in some case produce time fly can produce about half a terabyte of data from sensors series that can be approximated by functions; [26]. In those scenarios, characterized by high data rates and • Predictability: some time series can be predictable, volumes, time series compression techniques are a sensible for example using deep neural network techniques. choice to increase the efficiency of collection, storage and analysis of sensor data. In particular, the need to include in the analysis both information related to the recent and to the past history of the data stream, leads to consider data compression as a solution to optimize space without losing the most important information. A direct application of time series compression, for example, can be seen in Time Series Management Systems (or Time Series Database) in which compression is one of the most significant step [15]. There exists an extensive literature on data compression arXiv:2101.08784v1 [cs.DB] 21 Jan 2021 algorithms, both on generic purpose ones for finite size data and on domain specific ones, for example for images and for video and audio data streams. This survey aim at providing an overview of the state-of-the-art in time series compres- sion research, specifically focusing on general purpose data compression techniques that are either developed for time series or working well with time series. The algorithms we chose to summarize are able to deal with the continuous growth of time series over time and suitable for generic domains (as in the different applications Fig. 1: Visual classification of time series compression algo- rithms • Both authors are with the Department of Environmental Sciences, Informatics, and Statistics of Ca’ Foscari University of Venice. E-mail: The main contribution of this survey is to present a [email protected], [email protected]. reasoned summary of the state-of-the art in time series com- • Claudio Silvestri is also with the European Center For Living Technology, pression algorithms, which is currently fragmented among hosted by Ca’ Foscari University of Venice. several sub-domains ranging from databases to IoT sen- sor management. Moreover, we proposes a taxonomy of 2 time series compression techniques based on their approach 2.2 Compression (dictionary-based, functional approximation, autoencoders, Data compression, also known as source coding, is defined in sequential, others) and their properties (adaptiveness, loss- [27] as ”the process of converting an input data stream (the less reconstruction, symmetry, tuneability), anticipated in source stream or the original raw data) into another data visual form in Figure 1 and discussed in Section 3, that will stream (the output, the bitstream, or the compressed stream) guide the description of the selected approaches. Finally, that has a smaller size”. This process can take advantage of we recapitulate the results of performance measurements the Simplicity Power (SP) theory, formulated in [39], in which indicated in the described studies. compression goal is to remove redundancy while having an The article is organized as follows. Section 2 provides high descriptive power. some definitions regarding time series, compression, and The decompression process, complementary to the com- quality indexes. Section 3 describes compression algorithms pression one is indicated also as source decoding, and tries and is structured according to the proposed taxonomy. Sec- to reconstruct the original data stream from its compressed tion 4 summarize the experimental results found in the stud- representation. ies that originally presented the approaches we describe. A Compression algorithms can be described with the com- summary and conclusions are presented in Section 5. bination of different classes, shown in the following list: 2 BACKGROUND • Non-adaptive - adaptive: a non-adaptive algorithm that doesn’t need a training phase to work efficiently This section gives a formal definition of time series, com- with a particular dataset or domain since operations pression and quality indices. and parameters are fixed, while an adaptive one does; 2.1 Time series • Lossy - lossless: algorithms can be lossy, if the Time series are defined as a collection of data, sorted in decoder doesn’t return a result that is identical to ascending order according to the timestamp ti associated original data, or lossless if the decoder result is to each element. They are divided into: identical to original data; • Symmetric - non-symmetric • Univariate Time Series (UTS): elements inside the : a symmetric algo- collections are real values; rithm uses the same algorithm as encoder and de- • Multivariate Time Series (MTS): elements inside the coder, working on different directions, whereas non- collections are arrays of real values, in which each symmetric one uses two different algorithms. position in the array is associated to a time series In the particular case of time series compression, a com- features. pression algorithm (encoder) takes in input one Time Series 0 For instance, the temporal evolution of the average daily TS of size s and returns its compressed representation TS 0 0 price of a commodity as the one represented in the plot of size s , where s < s and the size is defined as the bits 0 in Figure 2 can be modeled as an univariate time series, needed to store the time series: E(TS) = TS . From the 0 whereas the summaries of daily exchanges for a stock (in- compressed representation TS , using a decoder, it is possible 0 cluding opening price, closing price, volume of trades and to reconstruct the original time series: D(TS ) = TS. If other information) can be modeled as a multivariate time TS = TSs then the algorithm is lossless, otherwise it is series. lossy. In Section 3, there are shown the most relevant categories of compression techniques and their implementation. 2.3 Quality indices To measure the performances of a compression encoder for time series, three characteristics are considered: compres- Fig. 2: Example of an UTS representing stocks fluctuations sion ratio, speed and accuracy. Compression ratio This metric measures the effective- Using a formal notation, time series can be written as: ness of a compression technique and it is defined as: m 0 TS = [(t1; x1);:::; (tn; xn)]; xi 2 R (1) s ρ = (3) where n is the number of elements inside a time series s and m is vector dimension of multivariate time series. For where s0 is the size of the compressed representation and 1 univariate time series m = 1. Given k 2 [1; n], we write s the size of the original time series. Its inverse ρ is named TS[k] to indicate the k-th element (tn; xn) of the time series compression factor. An index used for the same purpose is the TS. compression gain, defined as: A time series can be divided into segments, defined as 1 a portion of the time series, without any missing elements c = 100 log (4) g e ρ and ordering preserved: Speed The unit of measure for speed is cycles per byte TS = [(t ; x );:::; (t ; x )] (2) [i;j] i i j j (CPB) and is defined as the average number of computer where 8k 2 [i; j];TS[k] = TS[i;j][k − i + 1]. cycles needed to compress one byte. 3 Accuracy, also called distortion criteria, measures the of atoms length should guarantee a low decompression fidelity of the reconstructed time series respect to the orig- error and maximize the compression factor at the same inal. It is possible to use different metrics to determine time. Listing 1 shows how training phase works at high fidelity [28]: level: createDictionary function computes a dictionary Pn 2 of segments given a dataset composed by time series and a (xi−xi) • MSE = i=1 Mean Squared Error: n p threshold value th. • Root Mean Squared Error: RMSE = MSE 2 Pn x n i=1 i = Listing 1: Training phase • Signal to Noise Ratio: SNR = MSE 2 xpick 1 createDictionary(StreamS, Threshold th, int • Peak Signal to Noise Ratio: P SNR = MSE where segmentLength) f xpick is the maximum value in the original time 2 Dictionary d; series.