Efficient Proper Length Time Series Motif Discovery

2013 IEEE 13th International Conference on Data Mining Efficient Proper Length Time Series Motif Discovery Sorrachai Yingchareonthawornchai, Haemwaan Sivaraks, Thanawin Rakthanmanon*, Chotirat Ann Ratanamahatana Department of Computer Engineering Chulalongkorn University, Phayathai Rd., Pathumwan, Bangkok Thailand, 10330 *Kasetsart University, Ladyaow, Chatuchak, Bangkok, Thailand, 10900 [email protected], [email protected], [email protected], [email protected] Abstract— As one of the most essential data mining tasks, finding 286. Is there a better way to discover motifs without frequently occurring patterns, i.e., motif discovery, has drawn a specifying initial length L? lot of attention in the past decade. Despite successes in speedup of a) motif discovery algorithms, most of the existing algorithms still 4 require predefined parameters. The critical and most cumbersome one is time series motif length since it is difficult to 2 manually determine the proper length of the motifs—even for the 0 domain experts. In addition, with variability in the motif lengths, -2 ranking among these motifs becomes another major problem. In L : speculate manually this work, we propose a novel algorithm using compression ratio as a heuristic to discover meaningful motifs in proper lengths. 0 400 800 1200 1600 The ranking of these various length motifs relies on an ability to b) compress time series by its own motif as a hypothesis. 4 Furthermore, other than being an anytime algorithm, our Motif (L = 286) experimental evaluation also demonstrates that our proposed 2 method outperforms existing works in various domains both in 0 terms of speed and accuracy. -2 Motif (L = 163) Keywords: proper length motif; motif discovery; time series mining. 0 400 800 1200 1600 I. INTRODUCTION Figure 1. a) A typical time series data. A user needs to speculate a Time series data mining is an active area of research in the length of motif manually. More often it is hard to speculate the exact proper literature because time series data are omnipresent. Among length of the motif as shown in b) that the exact lengths are 163 and 286. many other time series mining tasks, finding occurring patterns in time series or motif is one important task. Motif discovery In addition, the problem becomes more subtle if one algorithm [1][2][5][6][7][10][11][13][15] is basically a search attempts to exhaustively search for all possible motifs in all algorithm for patterns within time series. It has utilities in possible lengths since the number of discovered motifs will be higher level data mining tasks such as clustering [8][9]. There huge, not to mention overlapping motifs in various lengths. It are a plethora of motif discovery algorithms using various is also difficult to rank motifs because of their length techniques. For example, recent MK motif discovery algorithm variability. [2] has utilized the use of early abandoning to prune off Since the definitions of motifs in existing works assume the unnecessary search spaces. The time complexity of the length is fixed, in variable length motifs aspect, we give a new algorithm has reduced to be essentially linear. definition of the motif which considers both similarity and However, users do still suffer from selecting a set of frequency of the motifs by an ability to compress by its own parameters; an initial window size is a typical one. The existing motif. In this regard, we use MDL (Minimum Description works require a length of the motif as a parameter for a reason Length) principle [4] [8][10][12] as a compression basis. that it is configurable; this is considered an unrealistic In this work, we propose a compression-based motif assumption that motif length be easily predefined by users. The discovery algorithm that covers three aspects: parameter choice of this parameter is quite sensitive and untenable for freeness, variable length motifs ranking and speedup over a users [15] because it is hard to determine the proper length of previous work. Our method requires no parameter from users. motif by hands—even for domain experts. For instance, The output is ranked motif classes. In addition, our method is suppose a user wants to determine the motifs in time series data an anytime algorithm, which would be very beneficial for motif in Fig.1. Since existing motif discovery algorithms require discovery in massive dataset. In the experiments, our algorithm initial length of sliding window, the user needs to speculate is shown to outperform the existing algorithm both in terms of length L manually or did some trials and errors of various speed improvement and accuracy based on Accuracy-on- lengths until satisfaction. In fact, the time series in Fig.1 has Detection (AoD)[14]. two classes of motifs: one of length 163 and another of length 1550-4786/13 $31.00 © 2013 IEEE 1265 DOI 10.1109/ICDM.2013.111 II. BACKGROUND AND NOTATIONS The entropy is defined as How can compression provide proper length motif discovery? 1 Entropy )( B pT i log 2 ( ) (3) We have two observations. First, when a motif of length L, as i pi a hypothesis, compresses its common structures, the quality of where pi is the probability that symbol in Ti will occur. the compression depends on the similarity between the In order to calculate entropy, we use discrete normalization. hypothesis and its occurrences: the more similar it is, the more T min 1 ability it can compress. Likewise, when the time series Discrete T )( round (( a 12(*) )) 1 (4) contains a number of motifs sharing the same structure, the minmax 1 quality of compression also depends on the frequency of the where T is a vector, and a is its cardinality. In this work, we occurrences having the same structure as the hypothesis: the set a = 64 to be consistent with the original work in [8]. more frequent motifs there are, the more ability they can The intuition behind this will become clear in the next compress. Therefore, two observations imply that a good section. Here, define a group’s description length as follows, motif is a model that has both similarity and frequency DLC C DL H B DL HA )|()()( (5) properties. In other words, the best model out of a model class +CA is the model for which the encoding cost for both data and As in definition 1, it is required to have at least two model combined is the lowest. Fig.2 portrays two obvious occurrences of motifs. Therefore, we need to calculate bitsave different compressions between good and bad compression. of a closest pair and record the group creation. Given The gist is that abstraction of a good motif can be realized by subsequences A, B and a group C = {A, B}, bitsave of group applying MDL principle. creation is bitsave DL A DL B DLC C)()()( (6) Good As for adding a new neighbor A to the group C, the bitsave Poor after adding the new neighbor to become C’ is Good bitsave DL A DLC C DLC C )'()()( (7) III. THE INTUITION BEHIND ALGORITHM After providing background and notation, we are now ready to introduce our motif discovery algorithm. We discuss Good Compression Bad Compression measurement of ability to compress here, and we will discuss (Good matching and (Poor matching) how to speed up the algorithm in the next subsections. frequent occurrences) We model the problem as finding a subsequence as a Figure 2. An example of two obvious cases. A good motif class is a hypothesis of length w that best compresses the input time motif that matches well to its neighbors and has lots of neighbors. In contrast, series, Ts. In other words, out of a model class, finding H that a bad motif class is a motif that poorly matches and has few neighbors. In this minimizes { DL(H) + DL(Ts | H) }, where DL (description abstraction, we measure goodness by ability to compress. length) is a function from Rw to R, representing an expected number of bits of the given subsequence. Definition 1: K-Compression Motif: Given a time series T, the Calculating DL(Ts | H) is computationally expensive since most significant motif in T (called thereafter the 1- length of Ts can be huge. We can reduce complexity by Compression Motif) is the subsequence or hypothesis, having exploiting decomposability of description length. In lieu of at least two occurrences, whose encoding cost for its computing the absolute description length, we calculate the neighboring non-trivial subsequences together with the relative description length, i.e., bitsave obtained by the hypothesis is the lowest. In other words, the 1st compression difference between before and after compression of original motif is the hypothesis H that minimizes the description length time series given hypothesis H. of H plus description length of T given H, with a constraint th DL HTT DL T DL HT DL()|(()()|,( H )) (8) that H must have at least two matching occurrences. The K k most significant motif in T is Kth-class of subsequences having th B DL T :)( wio DL :)( wio |()(( HT )) DL kH 2;)( (9) at least two occurrences that have the K -to-lowest encoding i1 cost for both data and hypothesis. In other words, Total bitsave = sum of bitsave at ith occurrence - DL(H). Bitsave calculation The equations above imply two important observations. The The bitsave calculation and equations here closely follow first is we can calculate only the different subsections between the method proposed in [8]. We calculate the expected number the original time series and the compressed time series given of bits for representing the data via description length H, instead of whole time series.

Load more