A Probabilistic Approach to Fast Pattern Matching in Time Series Databases
Total Page:16
File Type:pdf, Size:1020Kb
From: KDD-97 Proceedings. Copyright © 1997, AAAI (www.aaai.org). All rights reserved. A Probabilistic Approach to Fast Pattern Matching in Time Series Databases Eamonn Keogh and Padhraic Smyth* Department of Information and Computer Science University of California, Irvine CA 92697-3425 {eamonn, smyth}©ics,uci. edu Abstract familiar problem in archived time series storage: high- dimensional data sets at very high resolution make The problem of efficiently and accurately locating pat- manual exploration virtually impossible. 1 terns of interest in massive time series data sets is an important and non-trivial problem in a wide variety In this paper we address the general problem of of applications, including diagnosis and monitoring of matching a sequential pattern (called a query Q) complex systems, biomedical data analysis, and ex- a time series database (called the reference sequence ploratory data analysis in scientific and business time R). We will assume that Q and R are each real- series. In this paper a probabilistic approach is taken valued univariate sequences. Generalizations to multi- to this problem. Using piecewise linear segmentations as the mlderlying rcpresentation, local features (such variate and categorical valued sequences are of signif- ,as peaks, troughs, and platcaus) are defined using icant practical interest but will not be discussed here. prior distribution on expected deformations from a ba- To keep the discussion and notation simple we will as- sic tcmplate. Global shape information is rcpresented sume that the sequence data are mfiformly sampled using another prior on the relative locations of the (i.e., uniformly spaced in time). The generalization individual features. An appropriately defined prob- abilistic model integrates the local and global infor- to the non-uniformly sampled case is straightforward mation and directly leads to an overall distance mea- and will not be discussed. The problem is to find the k sure between sequence patterns based on prior knowl- closest matches in R to the query Q. Most solutions to edge. A search algorithm using this distance measure this problem rely on three specific components: (1) is shownto efficiently and accurately find matches for representation technique which abstracts the notion of a variety of patterns on a numberof data sets, includ- ing engineering sensor data from space Shuttle mis- shape in some sense, (2) a distance measure for pairs sion archives. The proposed approach provides a nat- sequence segments, and (3) an efficient search mecha- ural framework to support user-customizable "query nism for matching queries to reference sequences. The by content" on time series data, taking prior domain contribution of this paper is primarily in components information into account in a principled manner. (1) and (2). A piecewise linear representation scheme is proposed and combined with a generative probabilis- Introduction and Motivation tic model on expected pattern deformations, leading to Massive time series data sets are commonplace in a a natural distance metric incorporating relevant prior variety of online monitoring applications in medicine, knowledge about the problem. engineering, finance, and so forth. As an example, consider mission operations for NASA’s Space Shut- Related Work tle. Approximately 20,000 sensors are telemetered once There are a large number of different techniques for per second to Mission Control at Johnson Space Cen- efficient subsequence matching. The work of Falout- ter, Houston. Entire multi-day missions are archived sos, Ranganathan, and Manolopolous (1994) is fairly at this 1 Hz rate for each of the 20,000 sensors. typical. Sequences are decomposed into windows, fea- From a mission operations viewpoint, only a tiny frac- tures are extracted from each window (locally esti- tion of the data can be viewed in real-time, and the mated spectral coefficients in this case), and efficient archives are too vast to ever investigate. Yet, the matching is then performed using an R*-tree structure data are potentially very valuable for supporting di- in feature space. Agrawal et al. (1995) proposed agnosis, anomaly detection, and prediction. This is a Also with the Jet Propulsion Laboratory 525-3660, Cal- l Copyright @1997,American Association for Artificial ifonfia Institute of Technology, Pasadena, CA 91109. Intelligence (www.aaai.org). All rights reserved. 24 KDD-97 alternative approach which can handle amplitude scal- ing, offset translation, and "don’t care" regions in the data, where distance is determined from the envelopes of the original sequences. Berndt and Clifford (1994) use dynamic time-warping approach to allow for "elas- ticity" in the temporal axis when matching a query Q to a reference sequence R. Another popular approach is to abstract the notion of shape. Relational trees can be used to capture the hierarchy of peaks (or val- leys) in a sequence and tree matching algorithms can then be used to compare two time series (Shaw and DeFigueiredo, 1990; Wang, et al., 1994). A limitation of these approaches in general is that they do not provide a coherent language for expressing prior knowledge, handling uncertainty in the matching Figure 1: Automated segmentation of an inertial nav- process, or integrating shape cues at both the local and igation sensor from Space Shuttle mission STS-57. (a) global level. In this paper we investigate a probabilistic original data, the first 7.5 hours of the mission, origi- approach which offers a theoretically sound formalism nally 27,000 data points, (b) the segmented version for this sequence, K = 43 segments chosen by the multi- scale merging algorithm described in the text. ¯ Integration of local and global shape information, ¯ Graceful handling of noise and uncertainty, and language which can directly capture the notion of se- ¯ Incorporation of prior knowledge in an intuitive quence shapes and which is intuitive as a language for manner. humaninteraction. There is considerable psychological evidence (going The probabilistic approach to template matching is rel- back to Attneave’s famous cat diagram, 1954) that atively well-developed in the computer vision litera- the human visual system segments smooth curves into ture. The method described in this paper is similar in piecewise straight lines. Piecewise linear segmenta- spirit to the recent work of Burl and Perona (1996), al- tions provide both an intuitive and practical method though it differs substantially in details of implementa- for representing curves in a simple parametric form tion (their work is concerned with 2d template match- (generalizations to low-order polynomial and spline ing and they only consider probability functions over representations are straightforward). There are a large inter-feature distances and not over local feature de- number of different algorithms for segmenting a curve formations). into the It" "best" piecewise linear segments (e.g., Pavlidis (1974)). We use a computationally efficient A Segmented Piecewise Linear and flexible approach based on "bottom-up" merg- Representation ing of local segments into a hierarchical multi-scale There are numerous techniques for representing se- segmentation, where at each step the two local seg- quence data. The representation critically impacts ments are merged which lead to the least increase in the sensitivity of the distance measure to various dis- squared error. Automated approaches to finding the tortions and also can substantially determine the effi- best number of segments K can be based on statis- ciency of the matching process. Thus, one seeks robust tical arguments (penalized likelihood for example or representations which are computationally efficient to MinimumDescription Length as in Pednault (1991) work with. Spectral representations are well-suited to for this problem). Wefound that for piecewise linear sequences which are locally stationary in time, e.g., segmentations, simple heuristic techniques for finding the direct use of Fourier coefficients (as in Faloutsos et good values of K worked quite well, based on halting al. (1995)) or parametric spectral models (e.g., Smyth, the bottom-up merging process when the change in ap- 1994). However, many sequences, in particular those proximation error in going from K to K - 1 increased containing transient behavior, are quite non-stationary substantially. Figure 1 shows the segmentation of an and maypossess very weak spectral signatures even lo- inertial navigation sensor from the first 8 hours of a cally. Furthermore, from a knowledge discovery view- Space Shuttle mission by this method. For practical point, the spectral methods are somewhat indirect. applications it may be desirable to have I( chosen di- We are interested here in pursuing a representational rectly by the user to reflect a particular resolution at Keogh 25 which matching is to be performed. V Probabilistic Similarity Measures Defining "similarity" metrics is a long-standing prob- lem in pattern recognition and a large number of dis- Figure 2: A simple query consisting of 2 feature shapes. tance measures based on shape similarity have been The horizontal line at the bottom indicates al, or proposed in tile literature. Typically these similarity equivalently, the degree of horizontal elasticity which measures are designed to have certain desirable prop- is allowed between the 2 features. erties such as invariance to translation or scaling. Here we propose a probabitistic distance model k-1 based on the notion of an ideal prototype