Time Series Data Cleaning with Regular and Irregular Time

arXiv:2004.08284v3 [cs.DB] 6 Jun 2020 bevdvleerr.I hssre,w ou nteexisti or the [8] on errors focus we timestamp survey, either this spent development. In be are errors. project can value cost warehousing errors observed and series data time time in The the cleaning of data 80% instanc on For to industry. IT 30% the higher of much approximately rate is mark growth which relevant annual 17%, 7% the about the is [7], than quality al. data of et of rate Shilakes presence growth to the According to tim [6]. th the data due in since gaps challenges and studied, seasonality, trends, are quality autocorrelations, data data series unique time pose in th work, issues recent quality and In data agencies. conflicts, government errors, atten and widespread of businesses drawn from Yahoo risks have series on and time information costs in inconsistencies stock The 93%. of is rate sets, Finance correct d data erroneous the some have instance, still even accurate, For prevalent, quite the very considered in are important also errors which are most series commodi field time future the However, financial predict often field, movements. to is price are financial data (stock) sensor data series the time the series In of of time application dirty. [5], data and unreliable that large often fact of are kinds the devices all environm to industrial are Owing the there from uninterruptedly. where data capturing industry, seri devices in Time sensor important [2]–[4] on. are so fields and data many manufacturing, signa in hydrology, industrial and used processing, meteorology economy, widely financial as been such have series Time x h au ae ytesre ttefis iepit the point, time first the at series the by variable taken value the variables, ihih osbedrcin iesre aacleaning. irregular data Finall series the industry. time on and directions research focus possible from highlight s special criteria tools, cleaning evaluation a data and summarize have we Besides we intervals. time particular, techn cleaning In data methods type. state-of-the-art series th the s time reviews in This comprehensively of used data. and series classification widely time algorithm a erroneous the cleaning provides clean checking tim the automatically use to manually these also database give can and with when we deal data data, non-trivial, errone Han to erroneous is original assets. present, keeping stored discarding series besides data At errors, be time of containing intervals. not series in loss could time the data in errors irregular dirty results with the which Data dling database, field. the industrial in the in n iesre aacnb end[]a euneo random of sequence a as [1] defined be can data series Time Abstract eoe h au o h -htm eid n oon. so and period, time n-th the for value the denotes ieSre aaCenn ihRglrand Regular with Cleaning Data Series Time x Err r rvln ntm eisdt,especially data, series time in prevalent are —Errors x 2 reua ieItras(ehia Report) (Technical Intervals Time Irregular 1 , eoe h au o h eodtm period, time second the for value the denotes x 2 ,..., x .I I. n hr h admvariable random the where , NTRODUCTION colo otae snhaUiest,Biig108,C 100084, Beijing University, Tsinghua Software, of School iWn,Ce Wang Chen Wang, Xi x 1 u data, ous denotes series e feach of ystems urvey ,we y, iques tion ata. ent ng ey es ty et e, n e e e - l au s13 nte2t rdn a,teosre au s1 is value observed 1.3. the is day, true value trading the true 25th and the 2.4 the respectivel and is on 1.15, value 1.3, and observed 1.1 is the day, value 1.2, trading al 1.3, four 20th are are the the values values in on true observed that the the seen and be 8-11, can 0, of It days value. not trading actual may the consecutive value as observed crawl same the stock the figure reasons, the be various of the tradi For price consecutive website. in the 30 a indicates line by in line blue stock red the the The and of example. days, detail. price type an true three in the these as described indicates of days are characteristics trading the errors 1, consecutive of Figure pr 30 in stock shown for the As three stock takes sm types, article a into point This error of series single errors. error, continuous time series and big in time error point error single of namely, common classification categories, the the summarize on we [15] and Statement Problem A. error data series the time of of overview cleaning state-of-the-art detailed The value the a cleaning. review provide data erroneous we not of do state-of-the-art [12] existing problems al. Kar quality However, et quality. data data ouch of improving for formation techniques the the sens and for of generation reasons the the [11], review in [12] data, missing data al. et erroneous data Karkouch and with database. integration dealing data inconsistence, of data methods difficult the effort. is that and summarize it time doubt but more rate, takes no it accuracy is because high implement a There to has cleaning. cleaning automatic manual and cleaning then discarded. and are algorithm, data detection abnormal anomaly detected the an using via detected nFgr r l ,ta s otnoserr cu here. occur errors continuous is, that days 0, trading 11th all to are 8th detaile 1 the longer Figure consecutive from no in values but t several observed [13], continue in The types can here. several occur into errors errors subdivided continuous be series, Specifically, time points. time the in is, error data series time with commonly dealing when methods industry processing the of observati in types are used the two following are thereby, the There in errors, errors. mentioned value errors observed series with time dealing of methods 1 otnoserr.Tes-aldcniuu ros th errors, continuous so-called The errors. Continuous (1) [14] [13], research related by enlightened study, this In mainly [10] [9], cleaning data of surveys existing The manual into divided is cleaning Data is data. series Cleaning (2) time the First, data. erroneous Discarding (1) hc a rvd uoilfrothers. for tutorial a provide may which , hina reby, the ice all ng on ed k- or at y, s: o d s 1 l , 2

2.5 4.5 True Dirty 4 2 3.5 3 1.5 2.5

Value Value 2 1 1.5

0.5 1 0.5 True Dirty 0 0 0 5 10 15 20 25 30 1 2 3 4 5 6 7 8 9 10 Time Time

Fig. 1. An Example Error Type Fig. 2. An Example of Translational Error

Continuous errors are common in real life. For instance, and future work of time series cleaning topics are discussed. when someone is holding a smartphone and walking on the road, nearby tall buildings may have a lasting impact on the collected GPS information. Besides, system errors can also B. Problem Challenge cause continuous errors. For the problem of time series data cleaning, the following (2) A single big error. A single point error is an error that four difficulties have been discovered through the survey: occurs discontinuously in a time series and only occurs on (1) The amount of data is large and the error rate is high. a single data point at intervals. A big error means that the The main source of time series data is sensor acquisition. observed value of the data point is far from the true value. Especially in the industrial field, sensors distributed throughout Remarkably, the size of the error is relative and closely related the machine are constantly monitoring the operation of the to the real situation of the data set. As shown on the 20th machine in real-time. These sensors often collect data at a trading day in Figure 1, the observed value differs from the frequency of seconds, and the amount of data collected is true value by 1.1. Compared with the 25th trading day when quite large. For instance, the sensor collection interval of a the observed value differs from the true value by 0.3, the error wind power company equipment is 7 seconds, each machine of this data point is large, so this data point error is a single big has more than 2000 working status data, and more than 30 error. The single point big error is also very common in daily million pieces of data are collected every day, so the working life. For instance, the data of motor vehicle oil level recorded status data of one day could exceed 60 billion. However, the by cursors may cause a single big error when bumping on the data collected by the sensor are often not accurate enough, road. and some because of the physical quantity of the observation (3) A single small error. Similar to a single point big error, is difficult to measure accurately. For instance, in a steel mill, that is, errors do not occur consecutively, only on a single data with affecting by environmental disturbances the surface tem- point at intervals. When the observed value of the data point perature of the continuous casting slab cannot be accurately differs from the true value by a small distance, on the 25th measured or may cause distortion due to the power of the trading day in Figure 1, it is a single point small error. As sensor itself. stated in [16], the rationale behind single-point small errors is (2) The reasons for generating time series data errors are that people or systems always try to minimize possible errors. complicated. People always try to avoid the generation of For instance, people may only have some small omissions time series erroneous data, however, there are various time when copying files. series errors. Besides the observed errors that we mentioned (4) Translational error. As shown in Figure 2, where x above for various reasons, Karkouch et al. [12] also explain axis represents time and y axis represents the value of the in detail the IoT data errors generated by various complex corresponding time, the red line represents true value, and environments. IoT data is a common time series of data, and the blue line represents error value after the translation, the its widespread existence is really in the world. The complex solution to this type of error is not as much as mentioned reasons of time series errors also are challenges we face in above. cleaning and analyzing data that is different from traditional Ignoring time series errors often results in unpredictable relational data. consequences for a series of applications such as query anal- (3) Time series data are continuously generated and stored. ysis. Thereby, time series cleaning algorithms are very impor- The biggest difference between time series data and relational tant for mining the potential value of data. This paper reviews data is that the time series is continuous. Thereby, for time the cleaning algorithm and anomaly detection algorithm of series data, it is important that the cleaning algorithm supports time series data. By summarizing the existing methods, a online operations (real-time operations). The online anomaly reference or guidance is given to scholars interested in time detection or cleaning algorithm can monitor the physical series data cleaning and based on this, the possible challenges quantity in real-time, detect the problem and then promptly 3

alarm or perform a reasonable cleaning. Thereby, the time TABLE I series cleaning algorithm is not only required to support THE OVERVIEW OF METHODS online calculation or streaming calculation but also has good Type Method throughput. Smoothing based Moving Average [20] (4) Minimum modification principle [17]–[19]. Time series (Section II) AutoRegressive [4], [21], [22] data often contain many errors. Most of the widely used Autoregressive Moving Average Model time series cleaning methods utilize the principle of smooth [13], [23], [24] Kalman Filter [25]–[32] filtering. Such methods may change the original data too Interpolation [33], [34] much, and result in the loss of the information contained in State-space Model [26], [35], [36] the original data. Data cleaning needs to avoid changing the Trajectory Simplification [37], [38] original correct data. It should be based on the principle of Constraint based Order Dependencies [39]–[42] minimum modification, that is, the smaller the change, the (Section III) Denial Constraints [43] Sequential Dependencies [44] better. Speed Constraints [45] Variance Constraints [46] Similarity Rule Constraints [47] C. Organization Learning Individual Models [48] Temporal Dependence [49], [50] Different algorithms tackle these challenges in differ- Statistics based Maximum Likelihood [51]–[55] ent ways, which usually include smoothing-based methods, (Section IV) Bayesian Model [56], [57] constraint-based methods, and statistical-based methods as Markov Model [58]–[60] shown in Table I. Besides some time series anomaly detection Hidden Markov Model [61]–[64] algorithms can also be effectively used to clean data. The SMURF [65]–[67] remainder of this paper is organized as follows. The aforesaid Spatio-Temporal Probabilistic Model [68] four types of algorithms are discussed from Section II to V, Expectation-Maximization [69] respectively. In Section VI we introduce existing time series Relationship-dependent Network [70]– cleaning tools, systems, and evaluation criteria. Finally, we [72] summarize this paper in Section VII and discuss possible Anomaly detection Density-Based Spatial Clustering of Ap- future directions. (Section V) plications with Noise [73], [74] Local Outlier Factor [73] Abnormal Sequence Detection [75]–[78] Window-based Anomaly Detection [79] II. SMOOTHINGBASEDCLEANINGALGORITHM Generative Adversarial Networks [80]– Smoothing techniques are often used to eliminate data noise, [82] especially numerical data noise. Low-pass filtering, which Long Short-Term Memory [83]–[85] Speed-based cleaning algorithm [86] filters out the lower frequency of the data set, is a simple algorithm. The characteristic of this type of technology is that the time overhead is small, but because the original data simplicity, we use SMA with n =3 and k =4, then calculate may be modified much, which makes the data distorted and according to equation (2). leads to the uncertainty of the analysis results, there are not many applications used in time series cleaning. The research (xt−n + xt−2... + xt+n) x = =7.4 (2) of smoothing technology mainly focuses on algorithms such t (2n + 1) as Moving Average (MA) [20], Autoregressive (AR) [4], [21], To eliminate errors or noises, the data x(t) can be con- [22] and Kalman filter model [25]–[27]. Thereby, this chapter sidered as a certain time window of sliding window, then mainly introduce these three technologies and their extensions. continuously calculate the local average over a given interval 2n + 1 to filter out the noise (erroneous data) and get a A. Moving average smoother measurement. In the weighted moving average (WMA) algorithm, data The moving average (MA) series algorithm [20] is widely points at different relative positions in the window have used in time series for smoothing and time series prediction. different weights. Generally defined as: A simple moving average (SMA) algorithm: Calculate the n average of the most recently observed N time series values, x = ω x (3) which is used to predict the value at time t. A simple definition t X i i+t i −n as shown in equation (1). b = In equation (3), ωi represents the weight of the influence n 1 of the i position data point on the t position data point, other x = x (1) t 2n +1 X i+t definitions follow the example above. A simple strategy is i=−n b that the farther away from the two data points, the smaller the In equation (1), xt is the predicted value of xt, xt represents mutual influence. For instance, a natural idea is the reciprocal the true value at timeb t, 2 ∗ n +1 is the window size(count). of the distance between two data points as the weight of their For a time series x(t)=9.8, 8.5, 5.4, 5.6, 5.9, 9.2, 7.4, for mutual influence. Similarly, the weight of each data point in 4 the exponential weighted moving average (EWMA) algorithm C. Kalman filter model [32] decreases exponentially with increasing distance, which Kalman [25] proposes the Kalman filter theory, which is mainly used for unsteady time series [87], [88]. can deal with time-varying systems, non-stationary signals, Aiming at the need for the rapid response of sensor data and multi-dimensional signals. Kalman filter creatively in- cleaning, Zhuang et al. [31] propose an intelligent weighted corporates errors (predictive and measurement errors) into moving average algorithm, which calculates weighted moving the calculation, the errors exist independently and are not averages via collected confidence data from sensors. Zhang affected by measured data. The Kalman model involves proba- et al. [89] propose a method based on multi-threshold control bility, random variable, Gaussian Distribution, and State-space and approximate positive transform to clean the probe vehicle Model, etc. Consider that the Kalman model involves too data, and fill the lost data with the weighted average method much content, no specific description is given here, and only and exponential smoothing method. Qu et al. [90] first use a simple definition is given. First, we introduce a system of cluster-based methods for anomaly detection and then use discrete control processes which can be described by a Linear exponentially weighted averaging for data repair, which is used Stochastic Difference equation as shown in equation (7). to clean power data in a distributed environment. xt = mxt−1 + nvt + pt (7) B. Autoregressive Also, the measured values of the system are expressed as The Autoregressive (AR) Model is a process that uses itself shown in equation (8). as a regression variable and uses the linear combination of the yt = rxt−1 + qt (8) previous k random variables to describe the linear regression model of the random variable at the time t. The definition of In equation (7) and (8), xt is the system state value at AR model [21], [91] as shown in equation (4). time t, and vt is the control variable value for the system at k time t. m and n are system parameters, and for multi-model systems, they are matrices, yt is the measured value at time t, xt = ωixt−i + ǫt + a (4) X r is the parameter of the measurement system, and for multi- i=1 b measurement systems, r is a matrix, p(k) and q(k) represent In equation (4), xt is the predicted value of xt, xt represents the noises of the process and measurement, respectively, and the true value at timeb t, k is the order, µ is mean value of the they are assumed to be white Gaussian Noise. process, ǫt is white noise, ωi is the parameter of the model, The extended Kalman filter is the most widely used es- a is a constant. timation for a recursive nonlinear system because it simply Park et al. [92] use labeled data y to propose an autore- linearizes the nonlinear system models. However, the extended gressive with exogenous input (ARX) model based on the AR Kalman filter has two drawbacks: linearization can produce model: k unstable filters and it is hard to implement the derivation of yt = xt + ωi(yt−i − xt−i)+ ǫt (5) the Jacobian matrices. Thereby, Ma et al. [96] present a new X method of predicting the Mackey-Glass equation based on the b i=1 In equation (5), y is the possible repair of x , and others unscented Kalman filter to solve these problems. In the field t t of signal processing, there are many works [28]–[30] based on are the same to theb aforesaid AR model. Alengrin et al. [24] propose Autoregressive moving average (ARMA) model), Kalman filtering, but these techniques have not been widely which is composed of the AR model and MA model. Besides used in the field of time series cleaning. Gardner et al. [32] that, the Gaussian Autoregressive Moving Average model is propose a new model, which is based on the Kernel Kalman defined as shown in equation (6) [13]. Filter, to perform various nonlinear time series processing. Zhuang et al. [31] use the Kalman filter model to predict sensor Φ(B)Zt = θ0 + θ(B)xt (6) data and smoothed it with WMA. p In equation (6), Φ(B) = 1 − Φ1B − ... − ΦpB and q D. Trajectory Simplification θ(B)=1 − θ1B − ... − θqB are polynomial in B of degrees p and q, respectively, θ0 is a constant, B is the backshift The main purpose of trajectory simplification is to reduce operator such that BZt = Zt−1, and xt is a sequence of the original trajectory from N trajectory data points to M independent Gaussian variates with mean µ=0 and variance trajectory data points, M

TABLE II TABLE III SUMMARYOF SMOOTHING AN EXAMPLEOF ORDER DEPENDENCIES

Reference Method sequence number time [20] MA t1 1 1 [31], [89] WMA t2 2 4 [32], [89], [90] EWMA t3 3 10 [21], [22], [91] AR t4 4 13 [92], [94] ARX t5 5 17 [13], [23], [24] ARMA t6 6 21 [93], [95] ARIMA t7 7 23 [25]–[27] Kalman Filter Model [28]–[30] [31], [32] [39]–[42]. We ﬁnd that ODs are also suitable for solving some [96] The Unscented Kalman Filter [26], [35], [36] The State-space Model time series data cleaning problems. The speciﬁc explanation [33], [34] Interpolation is as follows: Let x(t) = x1, x2...xt be a time series, ODs [97]–[99] To Estimate the Parameters of Model can be expressed by <, ≤, >, ≥. For the number of miles traveled by the car x(t), the mileage should increase over time.

Formal representation is as follows: t1 < t2 then xt1 ≤ xt2 Error checking. the main task is to determine whether there is where x(t) is mileage, t is timestamp. For instance, consider ˆ ˆ a simplified trajectory T , satisfy DAD(T ) <ǫ. Error-Search an example relation instance in Table III. The tuples are sorted is designed to retain the direction information of the track. The on attribute sequence number, which identifies sea level that 2 time complexity of the algorithm is O(N log N), the space rapidly increase from hour to hour. 2 complexity is O(N ). Error-Search is an accurate trajectory Generally, ODs in the form of equation (9) states that N is simplification algorithm that retains direction information, but strictly increasing with M. Such as equation (10). the time complexity is too high. [38] proposed an approximate algorithm Span-Search algorithm. Span-Search algorithm uses M →(0,∞) N (9) approximate error to replace time complexity O(N log2 N) hour ∞ height (10) space complexity O(N). →(0, ) ODs and DCs can also be used as an integrity constraint E. Summary and Discussion for error detection and data repairing in databases. Wijsen As shown in Table II, there are many methods based on [100], [101] extends ODs with a time dimension for tem- smoothing, such as the state-space model [26], [35], [36] poral databases. Let I = {I1, I2, I3,... } be a temporal and Interpolation [33], [34]. The state-space model assumes relation, which can be viewed as a time series of conventional that the system’s change over time can be determined by an “snapshot” relations, all over the same set of attributes. A unobservable vector sequence, the relationship between the trend dependency (TD) allows attributes with linearly ordered time series and the observable sequence can be determined domains to be compared over time by using any operator of by the state-space model. By establishing state equations and {<, =, >, ≤, ≥, 6=}. Consider the constraint is specified over observation equations, the state-space model provides a model (Ii, Ii+1) in I. For each time point i, it requires comparing framework to fully describe the temporal characteristics of employee records at time i with records at the next time dynamic systems. To make this kind of smoothing algorithm i +1, such that salaries of employees should never decrease. have a better effect, many studies [97]–[99] have also proposed Lopatenko et al. [43] propose a numerical type data cleaning various techniques to estimate the parameters in the above method based on Denial Constraints (DCs) as constraints, methods. Most smoothing techniques, when cleaning time whose principle is similar to this one. It is also notable that series, have a small-time overhead, but it is very easy to the given denial constraints could also be dirty and should be change the original correct data points, which greatly affects cleaned together with data [102]. the accuracy of cleaning. In other words, correct data are altered, which can distort the results of the analysis and lead B. Sequential Dependencies to uncertainty in the results. The sequential dependency algorithm proposed by Golab et al. [44] focuses on the difference in values between two III. CONSTRAINTBASEDCLEANINGALGORITHM consecutive data points in a time series. Golab et al. [44] In this section, we introduce several typical algorithms, define the CSD Tableau Discovery Problem as given a relation which include order dependencies (ODs) [39], sequential de- instance and an embedded SD M →g N, to find a tableau pendencies (SDs) [44] and speed constraints [45], for repairing tr of minimum size such that the CSD (M →g N,tr) has time series errors. confidence at least a given threshold. A CSD can be

A. Order Dependencies (hour →(0,∞) height, [1961.01.01 00:00–2016.01.01 00:00]). In relational databases, Order Dependencies (ODs) are It states that for any two consecutive hours in [1961.01.01 simple and effective methods, which have been widely studied 00:00–2016.01.01 00:00], their distance should always be > 0. 6

TABLE IV 240 AN EXAMPLE RELATION INSTANCE OF TIME SERIES

time value 220 t1 1 150 200 t2 2 160 t3 3 170 180 t4 4 180 Value 160 t5 5 110 t 6 6 200 140 t7 7 210 t8 8 220 120 t9 9 230 100 1 2 3 4 5 6 7 8 9 Time Generally, a sequential dependency (SD) is in the form of Fig. 3. An Example of Speed Constraints

M →g N. (11) In equation (11), M ⊆ R are ordered attributes, N ⊆ R can sequence satisfies both the maximum and minimum speed be measured by certain distance metrics, and g is an interval. It constraints. states that when tuples are sorted on M, the distance between Generally, a speed constraint is in the form of equation (12). the N-values of any two consecutive tuples are within interval S = (S , S ) (12) g. Fischer et al. [103] propose the concept of streaming mode min max to represent the structure and semantic constraints of data If time series data x satisfies the speed constraint S, then for − x x T S xj xi S streams. The concept contains a variety of semantic informa- any i, j in a time window , it has min < j −i < max. tion, including not only numeric values, but also attributes be- In practical applications, speed constraints are often valid for tween order. The sequential dependency algorithm can be used a specific period of time. For instance, when considering the not only for traditional relational database cleaning, but also fastest speed of a car, the time period of the reference is for time series cleaning. In fact, there are many dependency- often in hours, and two data points in different years are based cleaning algorithms designed for relational databases not considered. The value of the speed constraints S may be that are not suitable for time series data cleaning, such as: positive (the growth rate of the constraint value) or negative Functional Dependencies [104] (FDs), Conditional Functional (the rate of decline of the constraint value). Speed constraints Dependencies [105], [106] (CFDs), Matching Dependencies are less effective when dealing with small errors, and Wei (MDs) [107]–[109], Differential Dependencies (DDs) [110]– Yin et al. [46] propose a further study of variance constraints, [112] or Comparable Dependencies (CDs) [113], [114]. The which use the variance threshold V to measure the degree of sequential dependency is one of the few algorithms based on dispersion of the time series in a given W window. dependency that can be used for time series cleaning. D. Temporal Dependence C. Speed Constraints Discovering and exploiting temporal dependence is highly To clean time series data, speed constraint-based method desired in many applications. For example, consider health [45] considers the restrictions of speed on value changes in a care, analysis of temporal dependence can be of value through- given interval. As we have learned some common sense, e.g., out the entire disease treatment process, from disease preven- the maximum flying speed of a bird, temperatures in a day, tion to treatment. A temporal dependence contains two com- car mileage, etc. Consider with time window T is a pair of ponents: the causative behavior and the dependent behavior. minimum speed S and maximum speed S over the time The users future behavior is affected by the causative behavior min max and the causative behavior causes the dependent behavior. series x = x1, x2,..., xt, where each xi is the value of the i-th data point, with a timestamp i. For one more example, in E-commerce networks, accurate For instance, consider time series: analysis from users time series activity records is of signifi- cant importance for advertising, marketing, and psychological x(t) = {150, 160, 170, 180, 110, 200, 210, 220, 230} where studies. Qingchao Ca et al. [49] proposed recurrent cohort timestamps t = {1, 2, 3, 4, 5, 6, 7, 8, 9}. The value attribute analysis to group users into cohorts, and Dehua Cheng et corresponds to x, while the time attribute in Table IV denotes al. [50] improved the performance of temporal dependence the timestamps. recovery by reversing the time stamps of original time series Suppose a window size T = 2, S = −50, and min and combine both time series. Smax = 50 in the speed constraints, for data points x5 and x 110−180 x x 4, 5−4 = −80 < −50. Similarly, 5 and 6 with speed 200−110 s E. Summary and Discussion 6−5 = 90 > 50 are violations to max = 50. To remedy the violations (denoted by red lines), a repair on x5 can be In the field of relational databases, there are many cleaning x ′ performed, i.e., 5 = 190, which is represented by the blue algorithms based on integrity constraints, which are difficult “∗” symbol in Figure 3. As illustrated in Figure 3, the repaired to apply in the field of time series in which the observed 7

TABLE V [53] propose the first maximum likelihood solution to address SUMMARYOF CONSTRAINTS the challenge of truth discovery from noisy social sensing data. Reference Method Yakout et al. [54] argue a new data repairing approach that [39]–[42] ODs is based on maximizing the likelihood of replacement data [100], [101] Extend ODs in the given data distribution, which can be modeled using [43] DCs statistical machine learning techniques, but this technology is [44] SDs used to repair the data of the database. For the repairing of time [104] FDs [105], [106] CFDs series data errors, Zhang et al. [55] propose a better solution [45] Speed Constraints based on maximum likelihood, which solves the problem from [46] Variance Constraints the perspective of probability. According to the probability [47] Similarity Rule Constraints distribution of the speed change of adjacent data points in the [48] Learning Individual Models time series, the time series cleaning problem can be converted to find a cleaned time series, which is based on the probability of speed change that has the greatest likelihood. values are substantially numerical, because they follow a strict equality relationship. A few methods, which we summarize B. Markov model in Table (V), can be used for time series data cleaning, for instance ODs and SDs can be used to solve problems in Markov process is a class of stochastic processes, which some scenarios, such as the number of miles in a car is non- means that the transition of each state in the process depends decreasing. Further speed-based constraints can be used to only on the previous n states. This process is called a n−order process data such as GPS and stock prices, but only with model, where n is the number that affects the transition state. relevant domain knowledge can give a reasonable constraint. The simplest Markov process is the first−order process, and Therefore, the constraint-based cleaning algorithm needs to the transition of each state depends only on the state before it. be further improved to have better robustness. Similarity rules Time and state are discrete Markov processes called Markov [47], capturing a more general form of constraints on similar- chains, abbreviated as Xn = X(n),n =0, 1, 2.... The Markov ities between data values [115] for data repairing [116], [117] chain [58] is a sequence of random variables X1,X2,X3.... and imputation [47], [118], could be considered. Moreover, The range of these variables, that is, the set of all their possible learning individual models [48] could help in repairing missing values, is called the ”state space”, and the value of Xn is the data in different scenarios. One possible future direction is state of time n. The Markov Model [59], [60] is a statistical model based to use anomaly detection methods to detect anomalies first, on Markov chain, which is widely used in speech recogni- and then treat outliers as missing to repair. We will discuss tion, part-of-speech automatic annotation, phonetic conversion, anomaly detection in Section V. probabilistic grammar and other natural language processing applications. In order to find patterns that change over time, IV. STATISTICS BASED CLEANING ALGORITHM the Markov model attempts to build a process model that can Statistical-based cleaning algorithms occupy an important generate patterns. [59] and [60] use specific time steps, states, position in the field of data cleaning. Such algorithms use and make Markov assumptions. With these assumptions, this models, which learned from data, to clean data. The statistical- ability to generate a pattern system is a Markov process. based approach involves a lot of statistical knowledge, but this A Markov process consists of an initial vector and a state article focuses on statistical-based data cleaning methods, so transition matrix. One thing to note about this assumption is we won’t cover statistical-related knowledge in detail. that the state transition probability does not change over time. Hidden Markov Model (HMM) [61], [62] is a statisti- A. Maximum likelihood cal model based on Markov Model, which is used to describe a Markov process with implicit unknown parameters. The intuitive idea of the maximum likelihood principle is a The difficulty is to determine the implicit parameters of random test, if there are several possible outcomes x1, x2...xt, the process from observable parameters, and then use these if the result xi occurs in one test, it is generally considered parameters for further analysis, such as prediction of time that the test conditions are favorable for xi, or think that xi series data. For instance, after rolling the dice 10 times, has the highest probability of occurrence. we could get a string of numbers, for example we might Notation: For a given time series data x(t), which consis- get such a string of numbers:1, 4, 5, 3, 3, 1, 6, 2, 4, 5 as shown tents with a probability distribution d, and assume that its in Figure 4. This string of numbers is called the visible probability aggregation function (discrete distribution) is Fd; state chain. But in HMM, we not only have such a string consider a distribution parameter θ, sampling x1, x2...xn from of visible state chains, but also a chain of implied state this distribution, then use Fd to calculate its probability [51] chains. In this example, the implicit state chain might be: as shown in equation (13). D5,D3,D2,D3,D4,D6,D1,D5,D1,D2. Gupta et al. [63] use HMM to predict the price of stocks. P = (x , x ...xn)= Fd(x , x ...xn|θ) (13) 1 2 1 2 Baba et al. [64] argue a data cleaning method based on Ziekow et al. [52] use the maximum likelihood technique to the HMM, which used to clean RFID data related to geo- clean Radio Frequency Identification (RFID) data. Wang et al. graphic location information. In multi-dimensional time series 8

Fig. 4. An Example of Hidden Markov cleaning, HMM has more application space than the single- the integrity constraint is greater than the current window size, dimensional cleaning algorithm, because of the correlation the algorithm linearly increases the current window size by 2 between the dimensions. steps and outputs the point data in the current window. If it is detected that the label does not move, the algorithm outputs C. Binomial Sampling the current window midpoint as the output point, and then continues to slide an epoch for the next processing. Shawn R. Jeffery et al. [67] propose an adaptive method SMURF algorithm is widely used to clean RFID data, and for cleaning RFID data, which exploits techniques based on many studies [65], [66] improve it. Leema, A et al. [65] study sampling and smoothing theory to improve the quality of RFID the effect of tag movement speed on data removal results and data. Tag transition detection: tag transition detection refers to H Xu et al. [66] consider the impact of data redundancy on the fact that when the position of the tag changes, the cleaning setting up sliding windows. result should reflect the fact that the tag leaves. Let’s introduce a few RFID-related concepts. (1) Interrogation cycles: the reader’s question-and-answer D. Spatio-Temporal Probabilistic Model process for the tag is the basic unit of the reader’s detection Besides data cleaning, M. Milani et al. [68] propose Spatio- tag. Temporal Probabilistic Model (STPM), this method learns (2) Reader read cycle (epoch, 0.2 sec - 0.25 sec): a collec- more detailed data patterns from historical data, and then tion of multiple interrogation cycles. cleans the current data. STPM not only gives joint probability Based on the above concept, the following definition: Wi distributions that are updated on the data set at different times, is smooth window of tag i and is composed of ωi epoch, but also distinguishes association updates from association Si is the window that tag i is actually detected in the Wi values. STPM based on Dynamic Probabilistic Relational window, Countt indicates the number of inquiry cycles of Models (DRPMs), so we need to state DRPMs model first. t, and R is the corresponding number of t epoch tag i. For The DRPMs is a graph model used to represent the rela- a given time window, suppose the probability that the tag i tionship between dynamic data sets, its models based on the R may be read in each epoch is pi = , and the Statistical dependency relationship between attributes, and generally uses Countt Soothing for Unreliable RFID Data (SMURF) [67] algorithm conditional probability distribution to calculate the probability treats each epoch’s reading of the tag as a Bernoulli experiment of each attribute value in a given parent node value and forms with probability pi. Therefore, pi conforms to the binomial a relationship chain. For instance, when we need to estimate distribution B(ωi,pi). pi,avg is the average read rate in Si. the data at time T , we can only use the data before time Using the model based on the Bernoulli experiment to T to infer, namely, the current state depends only on the observe the tag i, if the average reading rate of the tags in previous state, which is similar to the Markov Model. STPM ω ωi epoch is (1 − pi,avg) i . To ensure the dynamic nature of extends DRPMs to model update pattern between different the tag the size of the sliding window Wi needs to be satisfied time data, and captures spatial and temporal update patterns as shown in equation (14). by modeling updates events to provide update relationships of possible existence, finally detect and repair data. ||Si|− ωipi,avg| > 2qωipi,avg(1 − pi,avg) (14) The SMURF algorithm first sets the initial window size E. Others to 1, and then dynamically adjusts the window length based Firstly, we summarize the methods described above in Table on the actual situation of the read. If the current window (VI). In fact, Bayesian prediction model is a technique based meets the integrity requirement [67], the SMURF algorithm on Bayesian statistics. The Bayesian prediction model utilizes will detect the status of the tag. When the detection result model information, data information, and prior information, so indicates that the tag status changes, SMURF will adjust the the prediction effect is good, there this model is widely used, current window length to 1/2 of the original window to react to including in the field of time series data cleaning. Wang et al. the tag’s transition. If the calculated window size that satisfies [56] establish a cost model for Bayesian analysis which is used 9

TABLE VI SUMMARYOF STATISTICS

Reference Method [51]–[53], [78] Maximum Likelihood [54], [55] [58]–[60] Markov Model [61]–[63] HMM [64] [65]–[67] SMURF [68] STPM [56], [57] Bayesian [70]–[72] RDN [120] GMM Fig. 5. An Example Abnormal Point Detection [69] EM

al. [122] select all data points with timestamps t − k to t + k to analyze errors in the data. Bergman et al. [70] consider the with the timestamp t as the center point, and the median of user’s participation and use the user’s feedback on the query these data points is considered to be the predicted value of results to clean the data. Mayfield et al. [72] propose a more data points with timestamp t value. Hill et al. [21] first cluster complex relationship-dependent network (RDN [71]) model to the data points and take the average of the clusters as the model the probability relationships between attributes. The dif- predicted value of the point. The AR model and the ARX ference between RDN and traditional relational dependencies model are widely used for anomaly detection in various fields, (such as Bayesian networks [57]) is that RDNs can contain such as economics, social surveys [3], [4], and so on. The ARX ring structures. The method iteratively cleans the data set model takes advantage of manually labeled information, so it and observes the change in the probability distribution of the is more accurate than the AR model when cleaning data. The data set before and after each wash. When the probability ARIMA model [123] represents a type of time series model distribution of the data set converges, the cleaning process is consisting of AR and MA mentioned above, which can be used aborted. Zhou et al. [119] argue a technique for accelerating for data cleaning of non-stationary time series. Kontaki et al. the learning of Gaussian models via using GPU. The article [79] propose continuous monitoring of distance-based outliers believes that in the case of excessive data, it is not necessary to over data streams. One of the most widely used definitions use all the data to learn the model. Also, the author provides is the one based on distance as shown in Figure 5: an object a method of automatic tuning. In order to clean and repair p is marked as an outlier, if there are less than k objects in fuel level data, Tian et al. [120] propose a modified Gaussian given distance. Here k =4, q is the normal point and p is the mixture model (GMM) based on the synchronous iteration abnormal point. method, which uses the particle swarm optimization algorithm and the steepest descent algorithm to optimize the parameters B. Abnormal sequence detection of GMM and uses linear interpolation-based algorithm to correct data errors. Shumway et al. [69] use the EM [121] Different studies have different definitions of subsequence algorithm combined with the spatial state model [27], [35] to anomalies. Keogh et al. [75] proposed that a subsequence predict and smooth the time series. anomaly, that is, a subsequence has the largest distance from its nearest non-overlapping match. With this definition, the V. TIME SERIES ANOMALY DETECTION simplest calculation method is to calculate the distance between each subsequence with length n and other subsequences. Gupta et al. [14] investigate the anomaly detection methods Of course, the time complexity of this calculation method is for time series data: for a given time series data, there may very high. In the later studies, Keogh et al. [124] propose be two types of outliers, namely single-point anomalies and a heuristic algorithm by reordering candidate subsequences subsequence anomalies (continuous anomalies). In this sec- and Wei et al. [125] argue an acceleration algorithm using tion, we first discuss the detection methods of abnormal points local sensitive hash values. In calculating the distance, the and abnormal sequences, next introduce the application of Euclidean distance is usually used, and Keogh [126] further Density-Based Spatial Clustering of Applications with Noise proposes a method using the compression-based similarity (DBSCAN) algorithm in data cleaning, and then review the measure as the distance function. As shown in Figure 6, the abnormal detection methods related to machine learning. data is divided into multiple sub-sequences that overlap each other. First, calculate the abnormal score of each window, A. Abnormal point detection and then calculate the abnormal score (AS) of the whole For single-point anomalies, the most common idea is to use test sequence according to the abnormal score (AS) of each predictive models for detection. That is, the predicted value window. Window-based techniques can better locate anomalies of the established model and the observed value for each compared to direct output of the entire time series as outliers. data point is compared, and if the difference between the two There are two main types of methods based on this technique. values is greater than a certain threshold, the observed value One is to maintain a normal database [127], [128], and then is considered to be an abnormal value. Specifically, Basu et compare the test sequence with the sequence in the normal 10

4.5 Noise Cluster #1 4 Cluster #2 Cluster #3 3.5 Cluster #4 Cluster #5 3

2.5

Value 2 2

1.5 Fig. 6. An Example Abnormal Sequence Detection 1 0.5 database to determine whether it is abnormal; the other is 0 1 2 3 4 5 Value 1 ؟to build an anomalous database [76], [77] and then compare࡯ࡏॠࡄОੌӃ the test sequence with the sequence in the database to detect Fig. 7. An Example of DBSCAN Clustering if it is anomalous. [129] found that the length of the error is unknown and they use grammar induction to aid anomaly )UTLOJKTIKOTZKX\GR detection without any prior knowledge. 'HTUXSGR JKZKIZOUT *KTYOZ_ C. Density-Based Spatial Clustering of Applications with 9VKKJHGYKJ Noise IRKGTOTMGRMUXOZNS 'HTUXSGR The DBSCAN [130] algorithm is a clustering method based 98SUJKR on density-reachable relationship, which divides the region XKVGOX with sufficient density into clusters and finds clusters of Fig. 8. The Architecture Diagram of SC Algorithm arbitrary shape in the spatial database with noise and defines the cluster as the largest set of points connected by density. Then the algorithm defines the cluster according to the set data detected as abnormal. SC algorithm is mainly used density threshold as the basis for dividing the cluster, that for cleaning irregular time time series data. To solve the is, when the threshold is satisfied, it can be considered as a problem of irregular time series data cleaning, they use the cluster. speed characteristics of irregular time series data to design The principle of DBSCAN algorithm: (1) DBSCAN SC algorithms that can be used to clean rregular time series searches for clusters by checking the Eps neighborhood of data. Wang et al. [86] prove that the SR model is equivalent each point in the data set. If the Eps neighborhood of point to the AR model at equal interval time series, and gives an p contains more points than MinPts, create a cluster with p incremental speed-based cleaning algorithm (ISC algorithm). as the core object; (2) Then, DBSCAN iteratively aggregates The time complexity of the SC algorithm is relatively high, objects that are directly reachable from these core objects. and pruning can be considered in the future to reduce the time This process may involve the consolidation of some density- complexity. reachable clusters; (3) When no new points are added to any cluster, the process ends. Where MinPts is the minimum number of neighbor points E. Generative Adversarial Networks that a given point becomes the core object in the neighborhood, With the rapid development of machine learning technology, Eps is the neighborhood radius. For instance, Eps is 0.5 and more and more problems are solved using machine learning. MinPts is 3, for a given data set, the effect of clustering Dan Li et al. [80] use the GANs network to effectively detect is as shown in Figure 7. Some noise points can be repaired anomalies in time series data. GANs trains two models at and clustered into classes adjacent to them. Recent research the same time, which are the generation model for capturing [131] has shown that after repairing erroneous data. They data distribution and the discriminant model for discriminating also perform cleaning experiments on GPS data based on whether the data are real data or pseudo data as shown in DBSCAN, the accuracy of clustering on spatial data can be Figure 9. improved. But this method cannot solve continuous errors and needs further improvement. Given a random variable with a probability of uniform distribution as input, we want to generate a probability distribution of the output as ”dog probability distribution”. The D. Speed-based cleaning algorithm philosophy of Generative Matching Networks (GMNs), which The speed-based cleaning algorithm [86] (SC algorithm) idea is to train the generative network by directly comparing performs anomaly detection through the confidence interval the generated distribution with the true distribution, is to in the time dimension and density in the spatial dimension as optimize the network by repeating the following steps: shown in Figure 8. And designs an algorithm for detecting (1) Generate some evenly distributed input; continuous errors based on the above two detection methods. (2) Let these inputs go through the network and collect the Wang et al. [86] propose the SR model for repairing the generated output; 11

As shown in Figure 10, RNN has only one delivery state ht, LSTM also has a delivery status ct (cell state). There are three main stages within LSTM: (1) Forgotten phase. The forgetting phase is mainly to forget the input that is passed in from the previous node. A simple Fig. 9. A Simple Flow Chart of Generative Adversarial Networks principle is: forget the unimportant, remember the important one. More specifically, zf is calculated as a forgotten gate, which is used to control the previous state ct−1 , and then decide whether to retain the data or forget it. (2) Selective memory phase. At this stage, the input is selectively memorized. Mainly to remember the input x, the more important the data needs to be more reserved. (3) Output phase. This phase determines which outputs Fig. 10. Simple RNN Structure and Simple LSTM Structure would be treated as current states. Similar to the normal RNN, the output yt is often also obtained by ht change. Filonov, Pavel et al. [83] and Pankaj Malhotra et al. [133] (3) Compare the true ”dog probability distribution” with the provide recurrent neural networks by providing network time generated ”dog probability distribution” based on the available series data. The recurrent neural network understands what samples (e.g. calculate the MMD distance between the real dog the normal expected network activity is. When an unfamiliar image sample and the generated image sample); activity from the network is provided to a trained network, it (4) Use backpropagation and gradient descent to compute can distinguish whether the activity is expected or invaded. the errors and update the weights. The purpose of this process is to minimize the loss of the generation model and discriminant. G. Summary and Discussion Dan Li et al. [80] use GANs to detect abnormalities in time In addition to the methods described above, we also sum- series and a natural idea is to use GANs network to repair marize some common methods in Table (VII). As shown in missing values of time series data. Perhaps more machine Table (VII), Xing et al. [134] show that the cleaned sequence learning algorithms are waiting for the cleaning of time series can improve the accuracy of time series classification. Diao error values. A simple idea is to treat the detected anomaly et al. [73] design LOF [135] based online anomaly detection data as missing data and then repair it. Y. Sun et al. [81] first and cleaning algorithm. Zhang et al. [136] propose an iterative analyze the similarity between parking space data and parking minimum cleaning algorithm based on the timing correlation data, and then use Recurrent GANs to generate parking data as of error time series in continuous errors and keep the principle repair data, which provide a new idea for solving the problem of minimum modification in data cleaning. The algorithm is of time series data repair. C. Fang et al. [82] propose FuelNet effective in cleaning continuous errors in time series data. which is based on Convolutional Neural Networks (CNNs) and Qu et al. [90] first use cluster-based methods for anomaly GANs. FuelNet is used to repair the inconsistent and impute detection and then use exponentially weighted averaging for the incomplete fuel consumption rates over time. data repair, which is used to clean power data in a distributed environment. R. Corizzo et al. [74] use detect anomalous F. Long Short-Term Memory geographic data by distance-based method, and then use Since Recurrent Neural Network (RNN) also has the prob- Gradient-boosted tree (GBT) to repair the anomalous data. lem of gradient disappearance, it is difficult to process long- Charu C et al. [78] proposed a solution for distributed storage sequence data. F. A. Gers et al. [132] improve RNN and got the and query for large-scale streaming sensor data, and they RNN special case Long Short-Term Memory (LSTM), which examined the problem of historical storage and diagnosis of can avoid the disappearance of the regular RNN gradient. It massive numbers of simultaneous streams. We can conclude has been widely used in industry and [83]–[85] use LSTM to that anomaly detection algorithms play an important role in perform anomaly detection on time series data. time series data cleaning. It is also becoming more and more The left picture is a simple RNN structure diagram, and the important to design anomaly detection algorithms for time right picture is a simple LSTM structure diagram in Figure series repair, and we discuss future directions in Section VII. 10, where given function as shown in equation (15). VI. TOOLS AND EVALUATION CRITERIA F : h,y = f(h, x) (15) In this section, we first give an overview of tools to clean t In equation (15), x is the input of data in the current state, time series and then summarize evaluation criteria related to t−1 h (hidden state) indicates the input of the previous node time series cleaning methods. received, yt is the output in the current state and ht is the output passed to the next node. As can be seen from the Figure 10, the output ht is related to the values of xt and ht−1. yt A. Tools is often used to invest in a linear layer (mainly for dimension There are many tools or systems for data cleaning, but mapping), and then use softmax to classify the required data. they are not effective on time series cleaning problems. In 12

TABLE VII cleaned results (as shown in equation (17) referring to the SUMMARYOF DETECTION minimum modiﬁcation principle in data repairing). Reference Method n [75], [122], [124] Distance-based ∆(x, x)= |x − x | (17) X i i [73], [74] i=1 [86] SC b b [127], [128] Maintain a Normal Database Dasu et al. [147] propose a statistical distortion method [76], [77] Build an Anomalous Database to evaluate the quality of cleaning methods. The proposed [90], [130], [131] Clustering method directly observes the numerical distribution in the data [80]–[82] GANs set and evaluates the quality according to the variation of the [83]–[85] LSTM distribution caused by different cleaning methods.

Table VIII we investigate some tools that might be used VII. CONCLUSION AND FUTURE DIRECTIONS for time series cleaning because they [137]–[139], [141] are In this paper, we review four types of time series cleaning originally used to solve traditional database cleaning problems. algorithms, cleaning tools or systems and related research on Ding et al. [140] present Cleanits, which is an industrial evaluation criteria. Next, we summarize the full text in Section time series cleaning system and implements an integrated VII-A and list some advice of future directions in Section cleaning strategy for detecting and repairing in industrial VII-B. time series. Cleanits provides a user-friendly interface so users can use results and logging visualization over every A. Conclusion cleaning process. Besides, the algorithm design of Cleanits also considers the characteristics of industrial time series and With the development of technology, people gradually re- domain knowledge. The ASPA proposed by Rong et al. [142] alize the value contained in the data. Owing to companies violates the principle of minimum modification and distort the want to derive valuable knowledge from these data, and data, which is not suitable for being used widely. EDCleaner data analysis has played an increasingly important role in proposed by J. Wang et al. [143] is designed for social finance, healthcare, natural sciences, and industry. Time series network data, detection and cleaning are performed through data, as an important data type, is widely found in industrial the characteristics of statistical data fields. Y. Huang et al. manufacturing. For instance, a wind power enterprise analyzes [144] propose PACAS which is a framework for data cleaning sensor data, which are located throughout the wind turbine, between service providers and customers. R. Huang et al. [145] to determine whether the fan is in a normal state; transport present TsOutlier, a new framework for detecting outliers with companies also want to optimize vehicle fleet travel by an- explanations over IoT data. TsOutlier uses multiple algorithms alyzing vehicle GPS information. However, due to external to detect anomalies in time series data, and supports both environmental interference, sensor accuracy, and other issues, batch and streaming processing. EGADS [146] Offers two time series data often contain many errors that can interfere classes of algorithms for detecting outliers: Plug-in methods with subsequent data analysis and cause unpredictable effects. and Decomposition-based methods, which is designed for automatic anomaly detection of large-scale time series data. B. Future Directions There is not much research on time series cleaning tools As mentioned above, data is an intangible asset and ad- or systems, and we discuss further in Future Directions in vanced technology helps to fully exploit the potential value of Section VII. data. Thereby, time series data cleaning methods provide very important technical support for the discovery of these values B. Evaluation criteria in processing time series error data. Next we list some advice The Root Mean Square (RMS) error [67] is used to evaluate of future directions based on [15]. the effectiveness of the cleaning algorithm. Let x denotes the The error type illustrates handbook of time series data. At sequence consisting of the true values of the time series, x present, data scientists have a very detailed analysis of the denotes the sequence consisting of the observations after the errors in the traditional relational database. However, there error is added, and x denotes the sequence consisting of the is still much work to be further studied in the analysis of time series data error types. For instance, this paper roughly repaired values afterb the cleaning. Here the RMS error [67] is represented as shown in equation (16). divides the types of time series errors into three types, namely single point big errors, single point small errors and continuous n errors. In fact, in continuous errors, there are also a lot of v 1 2 ∆(x, x)= u (xi − xi) (16) un X meticulous types of errors, such as additive errors, innovational b t i=1 b errors [13] or missing errors [148], [149]. How to systemati- The equation (16) measures the distance between the true cally analyze these error types and form time series data error value and the cleaned value. The smaller the RMS error, the type illustrated handbook is very important. The clear error better the cleaning effect. type helps to develop targeted cleaning algorithms to solve Other criteria include error distance between incorrect data the problem of “GIGO (Garbage in, garbage out.)” that exists and correct data, repaired distance between erroneous data and in the current field. 13

TABLE VIII SOME EXAMPLESOF TOOLS OR SYSTEMS

Method Detail PIClean [137] Based on statistics Produce probabilistic errors and probabilistic fixes using low-rank approximation, which implicitly discovers and uses relationships between columns of a dataset for cleaning. HoloClean [138] Based on statistics Learn the probability model and then select the final data cleaning plan based on the probability distribution. ActiveClean [139] Based on statistics Allow for progressive and iterative cleaning in statistical modeling problems while preserving convergence guarantees. Cleanits [140] Anomaly detection Develop reliable data cleaning algorithms by considering features of both industrial time series and domain knowledge. MLClean [141] Anomaly detection The combination of data cleaning technology and machine learning methods is designed to generate unbiased cleaning data, which is used to train accurate models. ASAP [142] Smoothing based Develop a new analytics operator called ASAP that automatically smooths streaming time series by adaptively optimizing the trade-off between noise reduction and trend retention. EDCleaner [143] Based on statistics For social network data, detection and cleaning are performed through the characteristics of statistical data fields. PACAS [144] Based on statistics Design a framework for data cleaning between service providers and customers. TsOutlier [145] Anomaly detection Use multiple algorithms to detect anomalies in time series data, and support both batch and streaming processing. EGADS [146] Anomaly detection Offer two classes of algorithms for detecting outliers: Plug-in methods and Decomposition-based methods, which is designed for automatic anomaly detection of large-scale time series data.

The design of time series data cleaning algorithm. Each industry, the application scenarios are becoming more and chapter of this paper reviews some time series error cleaning more clear. The requirements for cleaning algorithms in differ- algorithms, but further optimizations are possible. The existing ent application scenarios have different focuses. For instance, methods are mostly for a single-dimensional time series (even rather than cleaning errors, another application is to directly the GPS data exists two dimensions’ information), but each answer queries over the possible repairs [150]. Moreover, dimension is cleaned separately during cleaning [45], [55], rather than quantitive values, cleaning qualitative events [151], [136]. To further improve the practicability of the algorithm, [152] under pattern constraints [153], [154] is also highly it is imperative to consider the cleaning of multidimensional demanded. Finally, the data stored in the Blockchain network time series. Besides, with the development of machine learning [103] are generally structured data, with the development of technology, more technical learning techniques should be Blockchain technology, the design of data cleaning algorithms considered for data cleaning algorithms, which may lead to on Blockchain networks is also particularly important. better cleaning results because of the mathematical support behind them. REFERENCES The implementation of time series cleaning tool. At present, the mainstream data cleaning tools in the industry are still [1] Robert H Shumway and David S Stoffer. Time series analysis and its aimed at relational databases, and these tools are not ideal applications: with R examples. Springer, 2017. for processing time series data. As time series data cleaning [2] James Douglas Hamilton. Time series analysis, volume 2. Princeton university press Princeton, NJ, 1994. problems become more serious, how to use the fast-developing [3] Peter J Brockwell and Richard A Davis. Introduction to time series distributed technology, high performance computing technol- and forecasting. springer, 2016. ogy and stream processing technology to implement time [4] George EP Box, Gwilym M Jenkins, Gregory C Reinsel, and Greta M series cleaning tools (including research tools and commercial Ljung. Time series analysis: forecasting and control. John Wiley & Sons, 2015. tools) and apply them to real-world scenarios such as industry [5] Shawn R. Jeffery, Gustavo Alonso, Michael J. Franklin, Wei Hong, is also the key work of the next stage. and Jennifer Widom. Declarative support for sensor data cleaning. In Pervasive Computing, 4th International Conference, PERVASIVE 2006, The algorithm design of time series anomaly detection. In Dublin, Ireland, May 7-10, 2006, Proceedings, pages 83–100, 2006. real-world scenarios, efficient anomaly detection algorithms [6] Tamraparni Dasu, Rong Duan, and Divesh Srivastava. Data quality for play an irreplaceable role in time series repair. It is difficult to temporal streams. IEEE Data Eng. Bull., 39(2):78–92, 2016. [7] C Shilakes and J Tylman. Enterprise information portals. enterprise judge the difference between the error value and the true value, software team. Enterprise Information Portals, 1998. so it is necessary to specifically design a time series anomaly [8] Shaoxu Song, Yue Cao, and Jianmin Wang. Cleaning timestamps with detection algorithm that can be applied to an industrial scene. temporal constraints. PVLDB, 9(10):708–719, 2016. It is worth noting that more research is needed on how to [9] Xu Chu, Ihab F. Ilyas, Sanjay Krishnan, and Jiannan Wang. Data cleaning: Overview and emerging challenges. In Proceedings of the perform anomaly detection, cleaning, and analysis in the case 2016 International Conference on Management of Data, SIGMOD of weak domain knowledge or less labeled data. Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, pages 2201–2206, 2016. Design of data cleaning algorithms for specific application [10] Joseph M Hellerstein. Quantitative data cleaning for large databases. scenarios. With the application of various technologies in the United Nations Economic Commission for Europe (UNECE), 2008. 14

[11] Mourad Khayati, Alberto Lerner, Zakhar Tymchenko, and Philippe [35] Richard H Jones. Exponential smoothing for multivariate time series. Cudré-Mauroux. Mind the gap: An experimental evaluation of impu- Journal of the Royal Statistical Society: Series B (Methodological), tation of missing values techniques in time series. PVLDB, 13(5):768– 28(1):241–251, 1966. 782, 2020. [36] JWC Van Lint, SP Hoogendoorn, and Henk J van Zuylen. Accurate [12] Aimad Karkouch, Hajar Mousannif, Hassan Al Moatassime, and freeway travel time prediction with state-space neural networks under Thomas Noël. Data quality in internet of things: A state-of-the-art missing data. Transportation Research Part C: Emerging Technologies, survey. J. Network and Computer Applications, 73:57–81, 2016. 13(5-6):347–369, 2005. [13] Ruey S Tsay. Outliers, level shifts, and variance changes in time series. [37] Minjie Chen, Mantao Xu, and Pasi Fränti. A fast $o(n)$ multiresolution Journal of forecasting, 7(1):1–20, 1988. polygonal approximation algorithm for GPS trajectory simplification. [14] Manish Gupta, Jing Gao, Charu C. Aggarwal, and Jiawei Han. Outlier IEEE Trans. Image Processing, 21(5):2770–2785, 2012. detection for temporal data: A survey. IEEE Trans. Knowl. Data Eng., [38] Cheng Long, Raymond Chi-Wing Wong, and H. V. Jagadish. Trajectory 26(9):2250–2267, 2014. simplification: On minimizing the direction-based error. PVLDB, [15] Aoqian Zhang. Research on Time Series Data Cleaning. PhD thesis, 8(1):49–60, 2014. Tsinghua University, 2018. [39] Jirun Dong and Richard Hull. Applying approximate order dependency [16] Philip Bohannon, Michael Flaster, Wenfei Fan, and Rajeev Rastogi. to reduce indexing space. In Proceedings of the 1982 ACM SIGMOD A cost-based model and effective heuristic for repairing constraints by International Conference on Management of Data, Orlando, Florida, value modification. In Proceedings of the ACM SIGMOD International June 2-4, 1982., pages 119–127, 1982. Conference on Management of Data, Baltimore, Maryland, USA, June [40] Seymour Ginsburg and Richard Hull. Order dependency in the 14-16, 2005, pages 143–154, 2005. relational model. Theor. Comput. Sci., 26:149–195, 1983. [17] Foto N. Afrati and Phokion G. Kolaitis. Repair checking in inconsistent [41] Seymour Ginsburg and Richard Hull. Sort sets in the relational model. databases: algorithms and complexity. In Database Theory - ICDT In Proceedings of the Second ACM SIGACT-SIGMOD Symposium on 2009, 12th International Conference, St. Petersburg, Russia, March Principles of Database Systems, March 21-23, 1983, Colony Square 23-25, 2009, Proceedings, pages 31–41, 2009. Hotel, Atlanta, Georgia, USA, pages 332–339, 1983. [18] Jan Chomicki and Jerzy Marcinkowski. Minimal-change integrity [42] Seymour Ginsburg and Richard Hull. Sort sets in the relational model. maintenance using tuple deletions. Inf. Comput., 197(1-2):90–121, J. ACM, 33(3):465–488, 1986. 2005. [43] Andrei Lopatenko and Loreto Bravo. Efficient approximation algo- [19] Ronald Fagin, Benny Kimelfeld, and Phokion G. Kolaitis. Dichotomies rithms for repairing inconsistent databases. In Proceedings of the in the complexity of preferred repairs. In Proceedings of the 34th 23rd International Conference on Data Engineering, ICDE 2007, The ACM Symposium on Principles of Database Systems, PODS 2015, Marmara Hotel, Istanbul, Turkey, April 15-20, 2007, pages 216–225, Melbourne, Victoria, Australia, May 31 - June 4, 2015, pages 3–15, 2007. 2015. [44] Lukasz Golab, Howard J. Karloff, Flip Korn, Avishek Saha, and Divesh [20] David R. Brillinger. Time series - data analysis and theory, volume 36 Srivastava. Sequential dependencies. PVLDB, 2(1):574–585, 2009. of Classics in applied mathematics. SIAM, 2001. [45] Shaoxu Song, Aoqian Zhang, Jianmin Wang, and Philip S. Yu. [21] David J. Hill and Barbara S. Minsker. Anomaly detection in streaming SCREEN: stream data cleaning under speed constraints. In Proceedings of the 2015 ACM SIGMOD International Conference on Management environmental sensor data: A data-driven modeling approach. Environ- of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015 mental Modelling and Software, 25(9):1014–1022, 2010. , pages 827–841, 2015. [22] Kenji Yamanishi and Jun’ichi Takeuchi. A unifying framework for [46] Wei Yin, Tianbai Yue, Hongzhi Wang, Yanhao Huang, and Yaping Li. detecting outliers and change points from non-stationary time series Time series cleaning under variance constraints. In Database Systems data. In Proceedings of the Eighth ACM SIGKDD International for Advanced Applications - DASFAA 2018 International Workshops: Conference on Knowledge Discovery and Data Mining, July 23-26, BDMS, BDQM, GDMA, and SeCoP, Gold Coast, QLD, Australia, May 2002, Edmonton, Alberta, Canada, pages 676–681, 2002. 21-24, 2018, Proceedings, pages 108–113, 2018. [23] S Dilling and BJ MacVicar. Cleaning high-frequency velocity profile [47] Shaoxu Song, Yu Sun, Aoqian Zhang, Lei Chen, and Jianmin Wang. data with autoregressive moving average (arma) models. Flow Mea- Enriching data imputation under similarity rule constraints. IEEE surement and Instrumentation, 54:68–81, 2017. Trans. Knowl. Data Eng., 32(2):275–287, 2020. [24] Gérard Alengrin and Gérard Favier. New stochastic realization al- [48] Aoqian Zhang, Shaoxu Song, Yu Sun, and Jianmin Wang. Learning gorithms for identification of ARMA models. In IEEE International individual models for imputation. In 35th IEEE International Con- Conference on Acoustics, Speech, and Signal Processing, ICASSP ’78, ference on Data Engineering, ICDE 2019, Macao, China, April 8-11, Tulsa, Oklahoma, USA, April 10-12, 1978 , pages 208–213, 1978. 2019, pages 160–171, 2019. [25] Rudolph Emil Kalman. A new approach to linear filtering and [49] Qingchao Cai, Zhongle Xie, Gang Chen, H. V. Jagadish, Beng Chin prediction problems. Journal of basic Engineering, 82(1):35–45, 1960. Ooi, and Meihui Zhang. Effective temporal dependence discovery in [26] Martyna Marczak, Tommaso Proietti, and Stefano Grassi. A data- time series data. PVLDB, 11(8):893–905, 2018. cleaning augmented kalman filter for robust estimation of state space [50] Dehua Cheng, Mohammad Taha Bahadori, and Yan Liu. FBLG: a models. Econometrics and Statistics, 5:107–123, 2018. simple and effective approach for temporal dependence discovery from [27] G Wayne Morrison and David H Pike. Kalman filtering applied to time series data. In The 20th ACM SIGKDD International Conference statistical forecasting. Management Science, 23(7):768–774, 1977. on Knowledge Discovery and Data Mining, KDD ’14, New York, NY, [28] Robert Grover Brown, Patrick YC Hwang, et al. Introduction to random USA - August 24 - 27, 2014, pages 382–391, 2014. signals and applied Kalman filtering, volume 3. Wiley New York, 1992. [51] Yoram Bresler and Albert Macovski. Exact maximum likelihood [29] Garry A. Einicke and Langford B. White. Robust extended kalman parameter estimation of superimposed exponential signals in noise. filtering. IEEE Trans. Signal Processing, 47(9):2596–2599, 1999. IEEE Trans. Acoustics, Speech, and Signal Processing, 34(5):1081– [30] Zenton Goh, Kah-Chye Tan, and B. T. G. Tan. Kalman-filtering speech 1089, 1986. enhancement method based on a voiced-unvoiced speech model. IEEE [52] Tomasz Gogacz and Szymon Toruńczyk. Entropy bounds for con- Trans. Speech and Audio Processing, 7(5):510–524, 1999. junctive queries with functional dependencies. In 20th International [31] Yongzhen Zhuang, Lei Chen, Xiaoyang Sean Wang, and Jie Lian. A Conference on Database Theory, ICDT 2017, March 21-24, 2017, weighted moving average-based approach for cleaning sensor data. In Venice, Italy, pages 15:1–15:17, 2017. 27th IEEE International Conference on Distributed Computing Systems [53] Dong Wang, Lance M. Kaplan, and Tarek F. Abdelzaher. Maximum (ICDCS 2007), June 25-29, 2007, Toronto, Ontario, Canada, page 38, likelihood analysis of conflicting observations in social sensing. TOSN, 2007. 10(2):30:1–30:27, 2014. [32] Everette S Gardner Jr. Exponential smoothing: The state of the art— [54] Mohamed Yakout, Laure Berti-Equille,´ and Ahmed K. Elmagarmid. part ii. International journal of forecasting, 22(4):637–666, 2006. Don’t be scared: use scalable automatic repairing with maximal likeli- [33] Eamonn J. Keogh, Selina Chu, David M. Hart, and Michael J. Pazzani. hood and bounded changes. In Proceedings of the ACM SIGMOD An online algorithm for segmenting time series. In Proceedings of the International Conference on Management of Data, SIGMOD 2013, 2001 IEEE International Conference on Data Mining, 29 November - New York, NY, USA, June 22-27, 2013, pages 553–564, 2013. 2 December 2001, San Jose, California, USA, pages 289–296, 2001. [55] Aoqian Zhang, Shaoxu Song, and Jianmin Wang. Sequential data clean- [34] Shu Xu, Bo Lu, Michael Baldea, Thomas F Edgar, Willy Wojsznis, ing: A statistical approach. In Proceedings of the 2016 International Terrence Blevins, and Mark Nixon. Data cleaning in the process Conference on Management of Data, SIGMOD Conference 2016, San industries. Reviews in Chemical Engineering, 31(5):453–490, 2015. Francisco, CA, USA, June 26 - July 01, 2016, pages 909–924, 2016. 15

[56] Xiaolan Wang, Xin Luna Dong, and Alexandra Meliou. Data x- [77] Dipankar Dasgupta and Nivedita Sumi Majumdar. Anomaly detection ray: A diagnostic tool for data errors. In Proceedings of the 2015 in multidimensional data using negative selection algorithm. In Pro- ACM SIGMOD International Conference on Management of Data, ceedings of the 2002 Congress on Evolutionary Computation. CEC’02 Melbourne, Victoria, Australia, May 31 - June 4, 2015, pages 1231– (Cat. No. 02TH8600), volume 2, pages 1039–1044. IEEE, 2002. 1245, 2015. [78] Charu C. Aggarwal and Philip S. Yu. On historical diagnosis of sensor [57] Lise Getoor, Nir Friedman, Daphne Koller, and Benjamin Taskar. streams. In 31st IEEE International Conference on Data Engineering, Learning probabilistic models of relational structure. In Proceedings of ICDE 2015, Seoul, South Korea, April 13-17, 2015, pages 185–194, the Eighteenth International Conference on Machine Learning (ICML 2015. 2001), Williams College, Williamstown, MA, USA, June 28 - July 1, [79] Maria Kontaki, Anastasios Gounaris, Apostolos N. Papadopoulos, 2001, pages 170–177, 2001. Kostas Tsichlas, and Yannis Manolopoulos. Continuous monitoring [58] AM Dukhovny. Markov chains with quasitoeplitz transition matrix: of distance-based outliers over data streams. In Proceedings of the Applications. International Journal of Stochastic Analysis, 3(2):141– 27th International Conference on Data Engineering, ICDE 2011, April 152, 1900. 11-16, 2011, Hannover, Germany, pages 135–146, 2011. [59] Jun Cai. A markov model of switching-regime arch. Journal of [80] Dan Li, Dacheng Chen, Baihong Jin, Lei Shi, Jonathan Goh, and Business & Economic Statistics, 12(3):309–316, 1994. See-Kiong Ng. MAD-GAN: multivariate anomaly detection for time [60] Qinqing Zhang and Saleem A. Kassam. Finite-state markov model for series data with generative adversarial networks. In Artificial Neural rayleigh fading channels. IEEE Trans. Communications, 47(11):1688– Networks and Machine Learning - ICANN 2019: Text and Time Series 1692, 1999. - 28th International Conference on Artificial Neural Networks, Munich, [61] Md. Rafiul Hassan, Baikunth Nath, and Michael Kirley. A fusion model Germany, September 17-19, 2019, Proceedings, Part IV, pages 703– of hmm, ANN and GA for stock market forecasting. Expert Syst. Appl., 716, 2019. 33(1):171–180, 2007. [81] Yuqiang Sun, Lei Peng, Huiyun Li, and Min Sun. Exploration on [62] Ming Dong, Dong Yang, Yan Kuang, David He, Serap Erdal, and spatiotemporal data repairing of parking lots based on recurrent gans. Donna Kenski. Pm2.5 concentration prediction using hidden semi- In 21st International Conference on Intelligent Transportation Systems, markov model-based times series data mining. Expert Syst. Appl., ITSC 2018, Maui, HI, USA, November 4-7, 2018, pages 467–472, 2018. 36(5):9046–9055, 2009. [82] Chenguang Fang, Shaoxu Song, Zhiwei Chen, and Acan Gui. Fine- [63] Aditya Gupta and Bhuwan Dhingra. Stock market prediction using grained fuel consumption prediction. In Proceedings of the 28th ACM hidden markov models. In 2012 Students Conference on Engineering International Conference on Information and Knowledge Management, and Systems, pages 1–4. IEEE, 2012. CIKM 2019, Beijing, China, November 3-7, 2019, pages 2783–2791, [64] Asif Iqbal Baba, Manfred Jaeger, Hua Lu, Torben Bach Pedersen, 2019. Wei-Shinn Ku, and Xike Xie. Learning-based cleansing for indoor [83] Pavel Filonov, Andrey Lavrentyev, and Artem Vorontsov. Multivariate RFID data. In Proceedings of the 2016 International Conference on industrial time series with cyber-attack simulation: Fault detection Management of Data, SIGMOD Conference 2016, San Francisco, CA, using an lstm-based predictive data model. CoRR, abs/1612.06676, USA, June 26 - July 01, 2016, pages 925–936, 2016. 2016. [65] A Anny Leema and M Hemalatha. An effective and adaptive data [84] Pankaj Malhotra, Vishnu TV, Anusha Ramakrishnan, Gaurangi Anand, cleaning technique for colossal rfid data sets in healthcare. WSEAS Lovekesh Vig, Puneet Agarwal, and Gautam Shroff. Multi-sensor prog- transactions on Information Science and Applications, 8(6):243–252, nostics using an unsupervised health index based on LSTM encoder- 2011. decoder. CoRR, abs/1608.06154, 2016. [66] He Xu, Jie Ding, Peng Li, Daniele Sgandurra, and Ruchuan Wang. An [85] Pankaj Malhotra, Lovekesh Vig, Gautam Shroff, and Puneet Agarwal. improved SMURF scheme for cleaning RFID data. IJGUC, 9(2):170– Long short term memory networks for anomaly detection in time series, 178, 2018. 2015. [67] Shawn R. Jeffery, Minos N. Garofalakis, and Michael J. Franklin. [86] Xi Wang. Research on irregular time series data cleaning based on Adaptive cleaning for RFID data streams. In Proceedings of the 32nd speed. Master’s thesis, Tsinghua University, 2020. International Conference on Very Large Data Bases, Seoul, Korea, [87] Robert Goodell Brown. Smoothing, forecasting and prediction of September 12-15, 2006, pages 163–174, 2006. discrete time series. Courier Corporation, 2004. [68] Mostafa Milani, Zheng Zheng, and Fei Chiang. Currentclean: Spatio- temporal cleaning of stale data. In 35th IEEE International Conference [88] Charles C Holt. Forecasting seasonals and trends by exponentially on Data Engineering, ICDE 2019, Macao, China, April 8-11, 2019, weighted moving averages. International journal of forecasting, pages 172–183, 2019. 20(1):5–10, 2004. [69] Robert H Shumway and David S Stoffer. An approach to time series [89] Zhaosheng Zhang, Diange Yang, Tao Zhang, Qiaochu He, and Xi- smoothing and forecasting using the em algorithm. Journal of time aomin Lian. A study on the method for cleaning and repairing the series analysis, 3(4):253–264, 1982. probe vehicle data. IEEE Trans. Intelligent Transportation Systems, [70] Moria Bergman, Tova Milo, Slava Novgorodov, and Wang Chiew Tan. 14(1):419–427, 2013. Query-oriented data cleaning with oracles. In Proceedings of the 2015 [90] Z Qu, Y Wang, Chong Wang, Nan Qu, and Jia Yan. A data ACM SIGMOD International Conference on Management of Data, cleaning model for electric power big data based on spark framework. Melbourne, Victoria, Australia, May 31 - June 4, 2015, pages 1199– International Journal of Database Theory and Application, 9(3):137– 1214, 2015. 150, 2016. [71] Jennifer Neville and David D. Jensen. Relational dependency networks. [91] Maksims Volkovs, Fei Chiang, Jaroslaw Szlichta, and Renée J. Miller. J. Mach. Learn. Res., 8:653–692, 2007. Continuous data cleaning. In IEEE 30th International Conference on [72] Chris Mayfield, Jennifer Neville, and Sunil Prabhakar. ERACER: Data Engineering, Chicago, ICDE 2014, IL, USA, March 31 - April 4, a database approach for statistical inference and data cleaning. In 2014, pages 244–255, 2014. Proceedings of the ACM SIGMOD International Conference on Man- [92] Gyuhae Park, Amanda C Rutherford, Hoon Sohn, and Charles R Farrar. agement of Data, SIGMOD 2010, Indianapolis, Indiana, USA, June An outlier analysis framework for impedance-based structural health 6-10, 2010, pages 75–86, 2010. monitoring. Journal of Sound and Vibration, 286(1-2):229–250, 2005. [73] Yinglong Diao, Ke-yan Liu, Xiaoli Meng, Xueshun Ye, and Kaiyuan [93] George EP Box and David A Pierce. Distribution of residual autocorre- He. A big data online cleaning algorithm based on dynamic outlier lations in autoregressive-integrated moving average time series models. detection. In 2015 International Conference on Cyber-Enabled Dis- Journal of the American statistical Association, 65(332):1509–1526, tributed Computing and Knowledge Discovery, CyberC 2015, Xi’an, 1970. China, September 17-19, 2015, pages 230–234, 2015. [94] Hermine N Akouemo and Richard J Povinelli. Data improving in time [74] Roberto Corizzo, Michelangelo Ceci, and Nathalie Japkowicz. series using arx and ann models. IEEE Transactions on Power Systems, Anomaly detection and repair for accurate predictions in geo- 32(5):3352–3359, 2017. distributed big data. Big Data Res., 16:18–35, 2019. [95] Peiyuan Chen, Troels Pedersen, Birgitte Bak-Jensen, and Zhe Chen. [75] Eamonn J. Keogh, Jessica Lin, Sang-Hee Lee, and Helga Van Herle. Arima-based time series model of stochastic wind power generation. Finding the most unusual time series subsequence: algorithms and IEEE transactions on power systems, 25(2):667–676, 2009. applications. Knowl. Inf. Syst., 11(1):1–27, 2007. [96] Jie Ma and Jian-fu Teng. Predict chaotic time-series using unscented [76] Fabio A. González and Dipankar Dasgupta. Anomaly detection using kalman filter. In Proceedings of 2004 International Conference on Ma- real-valued negative selection. Genetic Programming and Evolvable chine Learning and Cybernetics (IEEE Cat. No. 04EX826), volume 2, Machines, 4(4):383–403, 2003. pages 687–690. IEEE, 2004. 16

[97] Ih Chang, George C Tiao, and Chung Chen. Estimation of time series [121] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum parameters in the presence of outliers. Technometrics, 30(2):193–204, likelihood from incomplete data via the em algorithm. Journal of the 1988. Royal Statistical Society: Series B (Methodological), 39(1):1–22, 1977. [98] Ananthram Swami and Jerry M. Mendel. ARMA parameter estimation [122] Sabyasachi Basu and Martin Meckesheimer. Automatic outlier detec- using only output cumulants. IEEE Trans. Acoustics, Speech, and tion for time series: an application to sensor data. Knowl. Inf. Syst., Signal Processing, 38(7):1257–1265, 1990. 11(2):137–154, 2007. [99] Gregory L Plett. Extended kalman filtering for battery management [123] Guoqiang Peter Zhang. Time series forecasting using a hybrid ARIMA systems of lipb-based hev battery packs: Part 3. state and parameter and neural network model. Neurocomputing, 50:159–175, 2003. estimation. Journal of Power sources, 134(2):277–292, 2004. [124] Eamonn J. Keogh, Jessica Lin, and Ada Wai-Chee Fu. HOT SAX: [100] Jef Wijsen. Reasoning about qualitative trends in databases. Inf. Syst., efficiently finding the most unusual time series subsequence. In 23(7):463–487, 1998. Proceedings of the 5th IEEE International Conference on Data Mining [101] Jef Wijsen. Trends in databases: Reasoning and mining. IEEE Trans. (ICDM 2005), 27-30 November 2005, Houston, Texas, USA, pages 226– Knowl. Data Eng., 13(3):426–438, 2001. 233, 2005. [102] Shaoxu Song, Han Zhu, and Jianmin Wang. Constraint-variance [125] Li Wei, Eamonn J. Keogh, and Xiaopeng Xi. Saxually explicit images: tolerant data repairing. In Proceedings of the 2016 International Finding unusual shapes. In Proceedings of the 6th IEEE International Conference on Management of Data, SIGMOD Conference 2016, San Conference on Data Mining (ICDM 2006), 18-22 December 2006, Francisco, CA, USA, June 26 - July 01, 2016, pages 877–892, 2016. Hong Kong, China, pages 711–720, 2006. [103] Roberto Casado-Vara, Fernando de la Prieta, Javier Prieto, and Juan M. [126] Eamonn J. Keogh, Stefano Lonardi, and Chotirat (Ann) Ratanama- Corchado. Blockchain framework for iot data quality via edge com- hatana. Towards parameter-free data mining. In Proceedings of puting. In Proceedings of the 1st Workshop on Blockchain-enabled the Tenth ACM SIGKDD International Conference on Knowledge Networked Sensor Systems, BlockSys@SenSys 2018, Shenzhen, China, Discovery and Data Mining, Seattle, Washington, USA, August 22-25, November 4, 2018, pages 19–24, 2018. 2004, pages 206–215, 2004. [104] Catriel Beeri, Martin Dowd, Ronald Fagin, and Richard Statman. [127] Anup K. Ghosh, Aaron Schwartzbard, and Michael Schatz. Learning On the structure of armstrong relations for functional dependencies. program behavior profiles for intrusion detection. In Proceedings of Journal of the ACM (JACM), 31(1):30–46, 1984. the Workshop on Intrusion Detection and Network Monitoring, Santa [105] Wenfei Fan, Floris Geerts, Xibei Jia, and Anastasios Kementsietsidis. Clara, CA, USA, April 9-12, 1999, pages 51–62, 1999. Conditional functional dependencies for capturing data inconsistencies. [128] D. Endler. Intrusion detection applying machine learning to solaris ACM Transactions on Database Systems (TODS), 33(2):6, 2008. audit data. In 14th Annual Computer Security Applications Conference [106] Wenfei Fan, Floris Geerts, Jianzhong Li, and Ming Xiong. Discovering (ACSAC 1998), 7-11 December 1998, Scottsdale, AZ, USA, pages 268– conditional functional dependencies. IEEE Trans. Knowl. Data Eng., 279, 1998. 23(5):683–698, 2011. [129] Pavel Senin, Jessica Lin, Xing Wang, Tim Oates, Sunil Gandhi, [107] Shaoxu Song and Lei Chen. Discovering matching dependencies. Arnold P. Boedihardjo, Crystal Chen, and Susan Frankenstein. Time Pro- In Proceedings of the 18th ACM Conference on Information and series anomaly discovery with grammar-based compression. In Knowledge Management, CIKM 2009, Hong Kong, China, November ceedings of the 18th International Conference on Extending Database Technology, EDBT 2015, Brussels, Belgium, March 23-27, 2015 2-6, 2009, pages 1421–1424, 2009. , pages 481–492, 2015. [108] Shaoxu Song and Lei Chen. Efficient discovery of similarity constraints [130] Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. for matching dependencies. Data Knowl. Eng., 87:146–166, 2013. A density-based algorithm for discovering clusters in large spatial [109] Yihan Wang, Shaoxu Song, Lei Chen, Jeffrey Xu Yu, and Hong Cheng. databases with noise. In Proceedings of the Second International Discovering conditional matching rules. TKDD, 11(4):46:1–46:38, Conference on Knowledge Discovery and Data Mining (KDD-96), 2017. Portland, Oregon, USA, pages 226–231, 1996. [110] Shaoxu Song and Lei Chen. Differential dependencies: Reasoning and [131] Shaoxu Song, Chunping Li, and Xiaoquan Zhang. Turn waste into discovery. ACM Trans. Database Syst., 36(3):16:1–16:41, 2011. wealth: On simultaneous clustering and cleaning over dirty data. [111] Shaoxu Song, Lei Chen, and Hong Cheng. Parameter-free determina- In Proceedings of the 21th ACM SIGKDD International Conference tion of distance thresholds for metric distance constraints. In IEEE on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, 28th International Conference on Data Engineering (ICDE 2012), August 10-13, 2015, pages 1115–1124, 2015. Washington, DC, USA (Arlington, Virginia), 1-5 April, 2012, pages [132] Felix A. Gers, Jürgen Schmidhuber, and Fred A. Cummins. Learning 846–857, 2012. to forget: Continual prediction with LSTM. Neural Computation, [112] Shaoxu Song, Lei Chen, and Hong Cheng. Efficient determination of 12(10):2451–2471, 2000. distance thresholds for differential dependencies. IEEE Trans. Knowl. [133] Pankaj Malhotra, Anusha Ramakrishnan, Gaurangi Anand, Lovekesh Data Eng., 26(9):2179–2192, 2014. Vig, Puneet Agarwal, and Gautam Shroff. Lstm-based encoder-decoder [113] Shaoxu Song, Lei Chen, and Philip S. Yu. On data dependencies in for multi-sensor anomaly detection. CoRR, abs/1607.00148, 2016. dataspaces. In Proceedings of the 27th International Conference on [134] Zhengzheng Xing, Jian Pei, and Philip S. Yu. Early classification on Data Engineering, ICDE 2011, April 11-16, 2011, Hannover, Germany, time series. Knowl. Inf. Syst., 31(1):105–127, 2012. pages 470–481, 2011. [135] Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jörg [114] Shaoxu Song, Lei Chen, and Philip S. Yu. Comparable dependencies Sander. LOF: identifying density-based local outliers. In Proceedings over heterogeneous data. VLDB J., 22(2):253–274, 2013. of the 2000 ACM SIGMOD International Conference on Management [115] Shaoxu Song, Han Zhu, and Lei Chen. Probabilistic correlation-based of Data, May 16-18, 2000, Dallas, Texas, USA., pages 93–104, 2000. similarity measure on text records. Inf. Sci., 289:8–24, 2014. [136] Aoqian Zhang, Shaoxu Song, Jianmin Wang, and Philip S. Yu. Time [116] Shaoxu Song, Hong Cheng, Jeffrey Xu Yu, and Lei Chen. Repairing series data cleaning: From anomaly detection to anomaly repairing. vertex labels under neighborhood constraints. PVLDB, 7(11):987–998, PVLDB, 10(10):1046–1057, 2017. 2014. [137] Zhuoran Yu and Xu Chu. Piclean: A probabilistic and interactive data [117] Shaoxu Song, Boge Liu, Hong Cheng, Jeffrey Xu Yu, and Lei Chen. cleaning system. In Proceedings of the 2019 International Conference Graph repairing under neighborhood constraints. VLDB J., 26(5):611– on Management of Data, SIGMOD Conference 2019, Amsterdam, The 635, 2017. Netherlands, June 30 - July 5, 2019., pages 2021–2024, 2019. [118] Shaoxu Song, Aoqian Zhang, Lei Chen, and Jianmin Wang. En- [138] Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. riching data imputation with extensive similarity neighbors. PVLDB, Holoclean: Holistic data repairs with probabilistic inference. PVLDB, 8(11):1286–1297, 2015. 10(11):1190–1201, 2017. [119] Jingbo Zhou and Anthony K. H. Tung. Smiler: A semi-lazy time [139] Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J. Franklin, and series prediction system for sensors. In Proceedings of the 2015 Ken Goldberg. Activeclean: Interactive data cleaning for statistical ACM SIGMOD International Conference on Management of Data, modeling. PVLDB, 9(12):948–959, 2016. Melbourne, Victoria, Australia, May 31 - June 4, 2015, pages 1871– [140] Xiaoou Ding, Hongzhi Wang, Jiaxuan Su, Zijue Li, Jianzhong Li, and 1886, 2015. Hong Gao. Cleanits: A data cleaning system for industrial time series. [120] Daxin Tian, Yukai Zhu, Xuting Duan, Junjie Hu, Zhengguo Sheng, PVLDB, 12(12):1786–1789, 2019. Min Chen, Jian Wang, and Yunpeng Wang. An effective fuel-level [141] Ki Hyun Tae, Yuji Roh, Young Hun Oh, Hyunsu Kim, and Steven Eui- data cleaning and repairing method for vehicle monitor platform. IEEE jong Whang. Data cleaning for accurate, fair, and robust models: A big Trans. Industrial Informatics, 15(1):410–422, 2019. data - AI integration approach. In Proceedings of the 3rd International 17

Workshop on Data Management for End-to-End Machine Learning, DEEM@SIGMOD 2019, Amsterdam, The Netherlands, June 30, 2019, pages 5:1–5:4, 2019. [142] Kexin Rong and Peter Bailis. ASAP: prioritizing attention via time series smoothing. PVLDB, 10(11):1358–1369, 2017. [143] Jinlin Wang, Hongli Zhang, Binxing Fang, Xing Wang, Gongzhu Yin, and Ximiao Yu. Edcleaner: Data cleaning for entity information in social network. In 2019 IEEE International Conference on Commu- nications, ICC 2019, Shanghai, China, May 20-24, 2019, pages 1–7, 2019. [144] Yu Huang, Mostafa Milani, and Fei Chiang. PACAS: privacy-aware, data cleaning-as-a-service. In IEEE International Conference on Big Data, Big Data 2018, Seattle, WA, USA, December 10-13, 2018, pages 1023–1030, 2018. [145] Ruihong Huang, Zhiwei Chen, Zhicheng Liu, Shaoxu Song, and Jianmin Wang. Tsoutlier: Explaining outliers with uniform profiles over iot data. In 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, December 9-12, 2019, pages 2024–2027, 2019. [146] Nikolay Laptev, Saeed Amizadeh, and Ian Flint. Generic and scalable framework for automated time-series anomaly detection. In Proceed- ings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, August 10-13, 2015, pages 1939–1947, 2015. [147] Tamraparni Dasu and Ji Meng Loh. Statistical distortion: Consequences of data cleaning. PVLDB, 5(11):1674–1683, 2012. [148] Jianmin Wang, Shaoxu Song, Xiaochen Zhu, and Xuemin Lin. Efficient recovery of missing events. PVLDB, 6(10):841–852, 2013. [149] Jianmin Wang, Shaoxu Song, Xiaochen Zhu, Xuemin Lin, and Jiaguang Sun. Efficient recovery of missing events. IEEE Trans. Knowl. Data Eng., 28(11):2943–2957, 2016. [150] Xiang Lian, Lei Chen, and Shaoxu Song. Consistent query answers in inconsistent probabilistic databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, Indianapolis, Indiana, USA, June 6-10, 2010, pages 303–314, 2010. [151] Xiaochen Zhu, Shaoxu Song, Xiang Lian, Jianmin Wang, and Lei Zou. Matching heterogeneous event data. In International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22- 27, 2014, pages 1211–1222. ACM, 2014. [152] Yu Gao, Shaoxu Song, Xiaochen Zhu, Jianmin Wang, Xiang Lian, and Lei Zou. Matching heterogeneous event data. IEEE Trans. Knowl. Data Eng., 30(11):2157–2170, 2018. [153] Xiaochen Zhu, Shaoxu Song, Jianmin Wang, Philip S. Yu, and Jiaguang Sun. Matching heterogeneous events with patterns. In IEEE 30th International Conference on Data Engineering, Chicago, ICDE 2014, IL, USA, March 31 - April 4, 2014, pages 376–387, 2014. [154] Shaoxu Song, Yu Gao, Chaokun Wang, Xiaochen Zhu, Jianmin Wang, and Philip S. Yu. Matching heterogeneous events with patterns. IEEE Trans. Knowl. Data Eng., 29(8):1695–1708, 2017.