<<

Proceedings of the Twenty-Fifth International Joint Conference on (IJCAI-16)

Unsupervised from Time Series , Qin Zhang⇤, Jia Wu⇤, Hong Yang§, Yingjie Tian† ‡, Chengqi Zhang⇤ ⇤ Quantum Computation & Intelligent Systems Centre, University of Technology Sydney, Australia † Research Center on Fictitious Economy & , Chinese Academy of Sciences, Beijing, China. ‡ Key Lab of Big & Knowledge Management, Chinese Academy of Sciences, Beijing, China. § MathWorks, Beijing, China qin.zhang@student.,jia.wu@, chengqi.zhang@ uts.edu.au, [email protected], [email protected] { } Abstract

In this paper we study the problem of learning Shapelet 1 discriminative features (segments), often referred to as shapelets [Ye and Keogh, 2009] of time se- ries, from unlabeled time series data. Discover- ing shapelets for time series classification has been widely studied, where many search-based algo- Shapelet 2 Shapelet 1 rithms are proposed to efficiently scan and select segments from a pool of candidates. However, such types of search-based may incur high Shapelet 2 time cost when the segment candidate pool is large. Alternatively, a recent work [Grabocka et al., 2014] uses regression learning to directly learn, instead Figure 1: The two blue thin lines represent the time series of searching for, shapelets from time series. Moti- data. The upper one is rectangular signals with noise and the vated by the above observations, we propose a new lower one is sinusoidal signals with noise. The curves marked Unsupervised Shapelet Learning Model (USLM) in bold red are learnt features(shapelets). We can observe that to efficiently learn shapelets from unlabeled time the learnt shapelets may differ from all candidate segments series data. The corresponding learning func- and thus are robust to noise. tion integrates the strengths of pseudo-class label, spectral analysis, shapelets regularization term and regularized least-squares to auto-learn shapelets, defined distance metric and the segments which best pre- pseudo-class labels and classification boundaries dict the class labels are selected as shapelets. Based on the simultaneously. A coordinate descent seminal work, a line of speed-up algorithms [Mueen et al., is used to iteratively solve the learning function. 2011][Rakthanmanon and Keogh, 2013][Chang et al., 2012] Experiments show that USLM outperforms search- have been proposed to improve the performance. All these based algorithms on real-world time series data. methods can be categorized as the search-based algorithms which scan a pool of candidate segments. For example, with the Synthetic Control dataset [Chen et al., 2015] which con- 1 Introduction tains 600 time series examples of length 60, the number of candidates for all lengths is 1.098 106. Time series classification has wide applications in fi- ⇥ nance [Ruiz et al., 2012], medicine [Hirano and Tsumoto, On the other hand, a recent work [Grabocka et al., 2014] 2006] and trajectory analysis [Cai and Ng, 2004]. The main proposes a new time series shapelet learning approach. In- challenge of time series classification is to find discriminative stead of searching for shapelets from a candidate pool, they features that can best predict class labels. To solve the chal- use regression learning and aim to learn shapelets from time lenge, a line of works have been proposed to extract discrimi- series. This way, shapelets are detached from candidate seg- native features, which are often referred to as shapelets, from ments and the learnt shapelets may differ from all the can- time series. Shapelets are maximally discriminative features didate segments. More importantly, shapelet learning is fast of time series which enjoy the merit of high prediction accu- to compute, scalable to large datasets, and robust to noise as racy and are easy to explain [Ye and Keogh, 2009]. There- shown in Fig. 1. fore, discovering shapelets has become an important branch We present a new Unsupervised Shapelet Learning Model in time series analysis. (USLM for short) that can auto-learn shapelets from unla- The seminal work on shapelet discovery [Ye and Keogh, beled time series data. We first introduce pseudo-class label 2009] resorts to a full-scan of all possible time series seg- to transform to supervised learn- ments where the segments are ranked according to a pre- ing. Then, we use the popular regularized least-squares and

2322 spectral analysis approaches to learn both shapelets and clas- through regression learning. This type of learning method sification boundaries. A new regularization term is also added does not consider a limited set of candidates but can obtain to avoid similar shapelets being selected. A coordinate de- arbitrary shapelets. scent algorithm is used to iteratively solve the pseudo-class Unsupervised . Many unsupervised fea- label, classification boundary and shapelets. ture selection algorithms have been proposed to select infor- Compared to the search-based algorithms on unlabeled mative features from unlabeled data. A commonly used cri- time series data [Zakaria et al., 2012], our method provides a terion in unsupervised feature learning is to select features new tool that learns shapelets from unlabeled time series data. best preserving data similarity or manifold structure con- The contributions of the work are summarized as follows: structed from the whole feature space[Zhao and Liu, 2007] [ ] We present a new Unsupervised Shapelet Learning Cai et al., 2010 , but they fail to incorporate discrimina- • Model (USLM) to learn shapelets from unlabeled time tive information implied within data, which cannot be di- series data. USLM combines pseudo-class label, spec- rectly applied in our shapelet learning problem. Earlier un- tral analysis, shapelets regularization and regularized supervised feature selection algorithms evaluate the impor- least-squares for learning. tance of each feature individually and select features one by one [He et al., 2005][Zhao and Liu, 2007], with a limita- We empirically validate the performance of USLM on tion that correlation among features is neglected [Cai et al., • both synthetic and real-world data. The results show 2010]citezhao2010efficient˜ [Zhang et al., 2015]. promising results compared to the state-of-the-art unsu- State-of-the-art unsupervised feature selection algorithms pervised shapelets selection models. perform feature selection by simultaneously exploiting dis- The remainder of the paper is organized as follows: Sec- criminative information and feature correlation. Unsuper- tion 2 surveys related work. Section 3 gives the preliminar- vised Discriminative Feature Selection (UDFS) [Yang et al., ies. Section 4 introduces the proposed unsupervised shapelet 2011] aims to select the most discriminative features for data learning model USLM. Section 5 introduces the learning al- representation, where manifold structure is also considered. gorithm with analysis. Section 6 conducts experiments. We Since the most discriminative information for feature selec- conclude the paper in Section 7. tion is usually encoded in labels, it is very important to predict a good cluster indicators as pseudo labels for unsupervised feature selection. 2 Related Work Shapelets for clustering. Shapelets also have been uti- Shapelets [Ye and Keogh, 2009] are time series short seg- lized to cluster time series [Zakaria et al., 2012]. Zakaria ments that can best predict class labels. The basic idea of et al. [Zakaria et al., 2012] have proposed a method to use shapelets discovery is to consider all segments of training data unsupervised-shapelets (u-Shapelets) for time series cluster- and assess them regarding a scoring function to estimate how ing. The algorithm searches for u-Shapelets which can sep- predictive they are with respect to the given class labels [Wis- arate and remove a subset of time series from the rest of the tuba et al., 2015]. The seminal work [Ye and Keogh, 2009] dataset, then it iteratively repeats the search among the re- builds a decision tree classifier by recursively searching for maining data until no data remains to be separated. It is a informative shapelets measured by information gain. Based greedy search algorithm which attempts to maximize the gap on information gain, several new measures such as F-Stat, between the two groups of time series divided by a u-shapelet. Kruskall-Wallis and Mood’s median are used in shapelets se- The k-shape algorithm is proposed in the work [Paparri- lection [Hills et al., 2014][Lines et al., 2012]. zos and Gravano, 2015] to cluster time series. k-shape is a Since time series data usually have a large number of can- novel algorithm for shape-based time series clustering that didate segments, the runtime of brute-force shapelets selec- is efficient and domain independent. k-shape is based on a tion is infeasible. Therefore, a series of speed-up techniques scalable iterative refinement procedure which creates homo- have been proposed. On the one hand, there are smart im- geneous and well-separated clusters. Specifically, k-Shape plementations using early abandon of distance computations requires a distance measure that is invariant to scaling and and entropy pruning of the information gain heuristic [Ye shifting. It uses a normalized version of the cross-correlation and Keogh, 2009]. On the other hand, many speed-ups measure as distance measure to consider the shapes of time rely on the reuse of computations and pruning of the search series. Based on the normalized cross-correlation, the method space [Mueen et al., 2011], as well as pruning candidates by computes cluster centroids in every iteration to update the as- searching possibly interesting candidates on the SAX repre- signment of time series to clusters. sentation [Rakthanmanon and Keogh, 2013] or using infre- Our work differs from the above research problems. We quent shapelets [He et al., 2012]. Shapelets have been applied introduce a new approach for unsupervised shapelet learning in a series of real-world applications. to auto-learn shapelets from unlabeled time series by com- Shapelet learning. Instead of searching for shapelets ex- bining shapelet learning and unsupervised feature selection haustively, a recent work [Grabocka et al., 2014] proposes methods. to learn optimal shapelets and reports statistically significant improvements in accuracy compared to other shapelet-based classifiers. Instead of restricting the pool of possible candi- 3 Preliminaries dates to those found in the training data and simply searching In this paper, scalars are denoted by letters (a, b, ...; ↵,, ...), them, they consider shapelets to be features that can be learnt vectors by lower-case bold letters (a, b,...), and matrices by

2323 boldfaced upper-case letters (A, B,...). We use a(k) to denote q ↵dijq the k-th element of vector a, and A to denote the element dijq e (ij) q=1 · locating at the ith-row and j-th column. A and A de- X(ij) q (2) (i,:) (:,j) ⇡ e↵dijq note vectors of the i-th row and j-th column of the matrix P q=1 l respectively. In a time series example, ta,b denotes a segment 1 i where dijq = h=1(tj(q+Ph 1) si(h)), and ↵ controls the li starting from a to b. precision of the function. The soft minimum approaches the Consider a set of time series examples T = true minimum whenP ↵ . In our experiments, we set t1, t2,...,tn . Each example ti (1 i n) contains an ↵ = 100. !1 ordered{ set of} real values denoted as ( , ,..., ), ti(1) ti(2) ti(qi) Pseudo-class label Unsupervised learning faces the chal- where qi is the length of ti. We wish to learn a set of lenge of unlabeled training examples. Thus, we introduce the top-k most discriminative shapelets S = s1, s2,...,sk . { } pseudo-class labels for learning. Consider that we cluster a Similar to the shapelet learning model [Grabocka et time series data set into c categories, the pseudo-class label al., 2014], we set the length of shapelets to expand r c n matrix Y R ⇥ contains c labels, where Y(ij) indicates the different length scales starting at a minimum lmin, i.e. probability2 of the j-th time series example belonging to the i- lmin, 2 lmin,...,r lmin . Each length scale i lmin { ⇥ ⇥ } r ⇥ th category. Time series examples that share the same pseudo- contains ki shapelets and k = ki. Obviously, i=1 class label fall into the same category. If Y(ij) > Y(i,j), i, r ki (i lmin) S R ⇥ ⇥ and r lmin qi to keep the 8 i=1 then the time series example tj belong to the cluster i. shapelets2 compact. ⇥ P ⌧ S Spectral Analysis Spectral analysis was introduced by [Donath and Hoffman, 1973] and has been widely used in 4 Unsupervised shapelet learning unsupervised learning [Von Luxburg, 2007]. The princi- ple behind is that examples that are close to each other are In this section, we aim to formulate the Unsupervised likely to share the same class label [Tang and Liu, 2014] Shapelet Learning Model (USLM) which is shown in Eq. (7). n n [Von Luxburg, 2007]. Assume that G R ⇥ is the simi- It combines with spectral regularization term, shapelet sim- larity matrix of time series based on the2 shapelet-transformed ilarity regularization term and the regularized least square matrix X, then the similarity matrix can be calculated as in minimization term. Eq. (3), where is the parameter of the RBF kernel. To introduce the USLM specifically, we introduce X X 2 shapelet-transformed representation [Grabocka et al., 2014] k (:,i) (:,j)k 2 (3) of time series first, which transfer time series from origi- G(ij) = e nal space to a shapelet-based space. Then we introduce the Based on G, we expect the pseudo-class labels of similar pseudo-class label and the three terms of the unsupervised data instances to be the same. Therefore, we can formulate a shapelet learning model respectively. spectral regularization term as follows, Shapelet-transformed Representation Shapelet transfor- mation was introduced by the work[Lines et al., 2012] to 1 n b G Y Y 2 downsize a long time series into a short feature vector in the 2 (ij)k (:,i) (:,j)k2 i=1 j=1 shapelets feature space. Time series are ordered sequences X X and shapelet-transformation can preserve the shape informa- c n n 1 2 tion for classification. = G(ij)(Y(ki) Y(kj)) 2 (4) Given a set of time series examples T = t1, t2,...,tn k=1 i=1 j=1 { k n} X X X and a set of shapelets S = s1, s2,...,sk , we use X R ⇥ c { } 2 to denote the shapelet-transformed matrix, where each ele- = Y (DG G)Y (k,:) (k,:) ment X denotes the distance between shapelet si and k=1 (si,tj ) X time series tj. For simplicity, we use X(ij) to represent = tr(YLGY) X(si,tj ) which can be calculated as in Eq. (1), where LG = DG G is the Laplacian matrix and DG is li a diagonal matrix with its elements defined as DG(i, i)= 1 n X(ij) =min (tj(g+h 1) si(h)) (1) j=1 G(ij). g=1,...,q li h=1 Least Square Minimization Based on the pseudo-class X labels,P we wish to minimize the least square error. Let where q = qj li +1denotes the total number of segments Rk c W ⇥ be the classification boundary under the pseudo- with length li from series tj, and qj,li are the lengths of time class2 labels, the least square error minimizes the following series tj and shapelet si respectively. objective function, Given a set of time series data S, X(ij) is a function with T 2 respect to all candidate shapelets S, i.e. X(S)(ij). For sim- min W X Y F (5) W k k plicity, we omit the variable S and still use X(ij) instead. The distance function in Eq. (1) is not continuous and thus Shapelet Similarity Minimization We wish to learn non-differential. Based on the work [Grabocka et al., 2014], shapelets that are diverse in shape. Specifically, we penalize we approximate the distance function using the soft minimum the model in case it outputs similar shapelets. Formally, we k k function as in Eq. (2), denote the shapelet similarity matrix as H R ⇥ , where 2

2324 @F(Si Xt+1,Yt+1) each element H( , ) represents the similarity between two 13: Si = | is from Eq. (17) si sj r @S shapelets si and sj. For simplicity, we use H(ij) to represent 14: end for

H(si,sj ) which can be calculated as in Eq. (6), 15: St+1 = Simax+1 16: t t +1 d 2 k ij k 2 (6) 17: end while H(ij) = e 18: Output: S⇤ = St; Y⇤ = Yt; W⇤ = Wt. where dij is the distance between shapelet si and shapelet sj. Update Y by fixing W and S By fixing W and S, the dij can be calculated by following Eq. (2). function• in Eq. (7) degenerates to Eq. (8), Unsupervised Shapelet Learning Model Eq. (7) gives the unsupervised shapelet learning model (USLM). It is a 1 T 2 T 2 min F(Y)= tr(YLGY )+ W X Y F (8) joint optimization problem with respect to three variables, the Y 2 2 k k classification boundary W, Pseudo-class label Y and candi- The derivative of Eq. (8) with respect to Y is, date shapelets S. @FY T = Y(LG 2I) 2W X (9) 1 1 2 @Y min tr(YLG(S)Y>)+ H(S) F W,S,Y 2 2 k k (7) Let Eq. (9) equal to 0, we can obtain the solution of Y as in 2 T 2 3 2 Eq. (10), + W X(S) Y F + W F T 1 2 k k 2 k k Y = 2W X(LG + 2I) (10) In the objective function, the first term is the spectral regu- where I is an identity matrix. Thus, the update of Y is larization that preserves local structure information. The sec- 1 ond term is the shapelet similarity regularization term that Yt+1 = 2Wt>Xt(LGt + 2I) (11) prefers diverse shapelets. The third and fourth terms are the Update W by fixing S and Y By fixing S and Y, Eq. (7) regularized least square minimization. degenerates• to Eq. (12) as below, Note that matrix LG in Eq. (4), matrix H in Eq. (6), and matrix X in Eq. (2) depend on the shapelets S. We explicitly 2 T 2 3 2 min F(W)= W X Y F + W F (12) write these matrices as variables with respect to shapelets S in W 2 k k 2 k k Eq. (7), i.e. ( ) , and respectively. Below, we LG S H(S) X(S) The derivative of Eq. (12) with respect to W is, propose a coordinate descent algorithm to solve the model. @FW T T =(2XX + 3I)W 2XY (13) 5 The Algorithm @W In this part, we first introduce a coordinate descent algorithm Let Eq. (13) equal to 0, we obtain the solution of W as in to solve the USLM, and then analyze its convergence and ini- Eq. (14), tialization methods. T 1 T W =(2XX + 3I) (2XY ) (14) 5.1 Learning Algorithm Thus, the update of W is In the coordinate descent algorithm, we iteratively update one 1 T Wt+1 =(2XtX> + 3I) (2XtY ) (15) variable by fixing the remaining two variables. The steps will t t+1 be repeated until convergence. Algorithm 1 summarizes the Update S by fixing W and Y By fixing W and Y, steps. Eq.• (7) degenerates to Eq. (16) as below,

Algorithm 1. Unsupervised Shapelet Learning Algo- 1 T 1 2 min F(S)= tr(YLG(S)Y )+ H(S) F rithm (USLA) S 2 2 k k 1: Input: (16) 2 T 2 Time series T with c classes + W X(S) Y F • Length and number of shapelets: l , r, k 2 k k • min Number of internal iterations imax Eq. (16) is non-convex and we cannot explicitly solve S as • ⌘ and Parameters , , and ↵, in finding W and Y. Instead, we resort to an iterative algo- • 1 2 3 2: Output: Shapelets S⇤ and class labels Y⇤ rithm by setting a learning rate ⌘, i.e. Si+1 = Si ⌘ Si, @F(Si) r 3: Initialize: S0, W0, Y0 where Si = @S . The iterative steps will guarantee 4: While Not convergent do the convergencer of the objective function. The derivative of 5: Calculate: Xt(T, St ↵), LGt(T, St ↵,) Eq. (16) with respect to is | | S(mp) 6: and Ht(St ↵) based on Eqs. (2), (4), and (6); | @F(S) 1 @LG(S) @H(S) 7: update Wt+1, Yt+1: = T + ( ) T 1 Y Y 1H S 8: Yt+1 2Wt Xt(LGt + 2I) @S(mp) 2 @S(mp) @S(mp) T 1 T (17) 9: Wt+1 (2XtXt + 3I) (2XtYt+1). @X(S) + ( T X Y ) 10: update St+1: 2W W @S(mp) 11: for i =1,...,imax do 12: Si+1 Si ⌘ Si where m =1,...,k, and p =1,...,lm. r

2325 n Because LG = DG G and DG(i, i)= G(ij), the 6 Experiments j=1 first term in Eq. (17) turns to calculating @G /@S as P(ij) (mp) We conduct experiments to validate the performance of shown in Eq. (18), USLM. All experiments are conducted on a Windows 8 ma- k chine with 3.00GHz CPU and 8GB memory. The Matlab @G(ij) 2G(ij) = ( (X X )) source codes and data are available online1. @ 2 · (qi) (qj) S(mp) q=1 X (18) 6.1 Datasets @X(qi) @X(qj) ( ) Synthetic data: This dataset is generated by following the · @ @ S(mp) S(mp) work [Shariat and Pavlovic, 2011] and [Zakaria et al., 2012]. and The dataset consists of ten examples from two classes. The q @ ij first class contains sinusoidal signals of length 200. The sec- X(ij) 1 ↵dijq @dijq = e ((1 + ↵dijq)E1 ↵E2) ond class contains rectangular signals of length 100. We @S E2 @S (mp) 1 q=1 (mp) randomly embed Gaussian noise in the data. Heterogeneous X (19) noise is generated from five Gaussian processes with q q 2 E = ij e↵dijq E = ij d e↵dijq q = ⌫ [0, 2] and variances [0, 10] chosen uniformly at where 1 q=1 , 2 q=1 ijq and ij 2 2 1 li random. qj li +1and dijq = h=1(tj(q+h 1) si(h)). P li P Real-world data: We use seven time series benchmark datasets download from the UCR time series archive [Chen @dijq 0 P if i = m = 2 6 (20) et al., 2015][Cetin et al., 2015]. The datasets are summa- @S l (S(mp) Tj,q+p 1) if i = m (mp) ⇢ m rized in Table 1. More details of the datasets can resort to The second term in Eq. (17) turns to calculating Eq. (21), their Website. ˜ @H(ij) 2 1 d˜2 @dij ˜ 2 ij = 2 dije (21) @S(mp) @S(mp) Table 1: of the benchmark time series datasets where d˜ij is the distance between shapelets si and sj. The Dataset Train/Test Length ] classes @d˜ij @X(ij) calculation of d˜ij and is similar to X and CBF 30/900(930) 128 3 @S(mp) (ij) @S(mp) respectively. ECG 200 100/100(200) 96 2 @F(S) Face Four 24/88(112) 350 4 To sum up, we can calculate the gradient Si = by r @Si following Eqs. (17)-(21). In the following, we discuss the Ita.Pow.Dem. 67/1029(1096) 24 2 convergence of the coordinate descent algorithm in solving Lighting2 60/61(121) 637 2 Eq. (7). Lighting7 70/73(143) 319 7 OSU Leaf 200/242(442) 427 6 5.2 Convergence The convergence of Algorithm 1 depends on stepwise de- scents. When updating Y or W, we know that Yt+1 = Y⇤(Wt, St) and Wt+1 = W⇤(Yt+1, St). When updating S, 6.2 Measures the objective function in Eq. (16) is not convex and a closed- Existing measures used to evaluate the performance of time form derivative is difficult to obtain. Instead, we use a gra- series clustering include Jaccard Score, Rand Index, Folkes dient descent algorithm. In the iterations, as long as we set and Mallow index [Halkidi et al., 2001][Zakaria et al., 2012]. an appropriate learning rate ⌘ that is usually very small, the Among them, Rand index [Rand, 1971] is the most popular objective function will decrease to convergence. one, while the remaining measures can be taken as variants of In addition, the objective function in Eq. (7) is non-convex Rand index. Therefore, we use Rand Index as the evaluation but have a lower bound of 0 due to being non-negative, so Al- method. gorithm 1 converges to local optima. In our experiments we To calculate Rand index, we compare the cluster labels Y⇤ run the algorithm several times under different initializations obtained by the clustering algorithm with the genuine class and choose the best solution as output. labels Ltrue as in Eq. (22), 5.3 Initialization TP + TN Rand index = , (22) Because the algorithm only outputs local optima, we discuss TP + TN + FP + FN how to initialize the variables to improve the performance where TP is the number of time series pairs which belong to of the algorithm. The algorithm expects to initialize , S0 Y0 the same class in Ltrue and are assigned to the same cluster and W0. We first initialize S0 by using the centroids of the in Y⇤, TN is the number of time series pairs which belong to segments having the same length with the shapelets length, different classes in Ltrue and are assigned to different clusters because centroids represent typical patterns behind the data. in Y⇤, FP is the number of time series pairs which belong to Then, we use the results of the shapelet-transformed matrix different classes in Ltrue but are assigned to the same cluster of time series based on S0 to obtain X0. Next, the results ob- in Y⇤, and FN is the number of time series pairs which tained by k-means based on X0 is used to initialize W0 and 1 Y0. The initialization enables fast convergence. https://github.com/BlindReview/shapelet

2326 1500 1400 CBF CBF ECG 200 belong to the same class in Ltrue but are assigned to different ECG 200 1200 Face Four Face Four 1200 Ita−Pow−Dem Ita−Pow−Dem clusters in Y⇤. If Rand index is close to 1, it indicates a Lighting2 1000 Lighting2 Lighting7 Lighting7 [ ][ OSU Leaf 900 high quality clustering Zakaria et al., 2012 Paparrizos and 800 ] Gravano, 2015 . 600 600 Running Time(sec)

Running Time(sec) 400 300 6.3 Time series with unequal lengths 200 0 0 2 4 6 8 10 12 2 3 4 5 6 7 Number of Shapelets Number of Clusters (a) # of shapelets (b) # of clusters Shapelet 1 Shapelet 1 Figure 3: USLM running time w.r.t. (a) the number of shapelets k and (b) the number of clusters c. The running Shapelet 1 Shapelet 2 time increases linearly w.r.t. the number of features and re- mains stable w.r.t. the number of clusters c. Thus, USLM scales well to large datasets. Shapelet 2 6.5 Comparisons with k-Shape The k-Shape algorithm is proposed in the work [Paparrizos Shapelet 2 and Gravano, 2015] to cluster unlabeled time series. In this part, we compare our algorithm with the k-Shape algorithm. The source codes for k-Shape are downloaded from the web- site of the original authors. Table 2 lists the results. We select Figure 2: An example of the rectangular signals(short blue the best results obtained by k-Shape after 100 times of re- thin curves) and the sinusoidal signals (long black thin peats. We can observe that the results obtained by USLM are curves). The two learned shapelets are marked with bold red better. curves.

Fig. 2 shows an example of two shapelets learnt by USLM. Table 2: Comparison between k-Shape and USLM. Shapelet 1 is a sharp spike of length 12 which matches the Rand Index k-Shape USLM most prominent spikes of the rectangular signals. Shapelet 2 CBF 0.74 1.00 is a subsequence of length 30 which is very similar to the stan- ECG200 0.70 0.76 dard shape of the sinusoidal signals. From the results, we can observe that USLM can auto-learn representative shapelets Fac.F. 0.64 0.79 from unlabeled time series data. Ita.Pow 0.70 0.82 The results in Fig. 3 also show that USLM can handle time Lig.2 0.65 0.80 series and shapelets of unequal lengths, where we can obtain Lig.7 0.74 0.79 the best Rand index value of 1. In contrast, the work [Shariat OSU L. 0.66 0.82 and Pavlovic, 2011] obtains only 0.9 on the dataset even if Average 0.69 0.83 they use class label information during training.

6.4 Running time Form Table 2, we can observe that USLM outperforms the We test the running time of USLM by changing the number k-Shape algorithm on all the seven datasets. The average im- of shapelets k and the number of clusters c. The results are provement on the seven datasets is 14% and the best improve- given in Fig. 3. ment is 26% on the ‘CBF’ dataset. To sum up, the results First, we vary the parameter k from 2 to 12 with a step size show that USLM gains higher accuracy than k-Shape. of 2. The remaining parameters are fixed as follows, 1 = 2 = 3 = 4 =1, =1, Imax = 50,⌘ =0.01 and the 7 Conclusions length of the shapelets is set to 10. The running time is the average of ten executions. In this paper, we investigated a new problem of feature learn- From Fig. 3(a), we can observe that the running time gener- ing from unlabeled time series data. To solve the problem, ally increases linearly with respect to the number of shapelets. we proposed a new learning model USLM by combining the Thus, our approach scales well to large datasets. pseudo-class label, spectral analysis, shapelets regularization Then, we let the number of clusters c change from 2 to 7. and regularized least-squares minimization. USLM can auto- The length of shapelets is set to be 10% of the time series learn the most discriminative features from unlabeled times length. We set k =2which means that we only learn two series data. Experiments on real-world time series data have shapelets of equal length. The remaining parameters are the shown that USLM can obtain an average improvement of same as above. Fig. 3(b) 14% compared to k-Shape.

2327 Acknowledgments [Lines et al., 2012] Jason Lines, Luke M Davis, Jon Hills, This work was supported by the Australian Research Council and Anthony Bagnall. A shapelet transform for time se- (ARC) Discovery Projects under Grant No. DP140100545 ries classification. In KDD, pages 289–297, 2012. and DP140102206 and also been partially supported by [Mueen et al., 2011] Abdullah Mueen, Eamonn Keogh, and grants from National Natural Science Foundation of China Neal Young. Logical-shapelets: an expressive primitive (Nos. 61472390 and 11271361). Y. Tian is the correspond- for time series classification. In KDD, pages 1154–1162, ing author. 2011. [Paparrizos and Gravano, 2015] John Paparrizos and Luis References Gravano. k-shape: Efficient and accurate clustering of [Cai and Ng, 2004] Yuhan Cai and Raymond Ng. Indexing time series. In SIGMOD, pages 1855–1870, 2015. spatio-temporal trajectories with chebyshev polynomials. [Rakthanmanon and Keogh, 2013] Thanawin Rakthan- In SIGMOD, pages 599–610, 2004. manon and Eamonn Keogh. Fast shapelets: A scalable [Cai et al., 2010] Deng Cai, Chiyuan Zhang, and Xiaofei He. algorithm for discovering time series shapelets. In SDM, Unsupervised feature selection for multi-cluster data. In pages 668–676, 2013. KDD, pages 333–342, 2010. [Rand, 1971] William M Rand. Objective criteria for the [Cetin et al., 2015] Mustafa S Cetin, Abdullah Mueen, and evaluation of clustering methods. Journal of the Ameri- Vince D Calhoun. Shapelet ensemble for multi- can Statistical association, 66(336):846–850, 1971. dimensional time series. SDM, pages 307–315, 2015. [Ruiz et al., 2012] Eduardo J Ruiz, Vagelis Hristidis, Carlos [Chang et al., 2012] Kai-Wei Chang, Bikash Deka, Wen- Castillo, Aristides Gionis, and Alejandro Jaimes. Correlat- Mei W Hwu, and Dan Roth. Efficient pattern-based time ing financial time series with micro-blogging activity. In series classification on gpu. In ICDM, pages 131–140, WSDM, pages 513–522, 2012. 2012. [Shariat and Pavlovic, 2011] Shahriar Shariat and Vladimir [Chen et al., 2015] Yanping Chen, Eamonn Keogh, Bing Hu, Pavlovic. Isotonic cca for sequence alignment and activity Nurjahan Begum, Anthony Bagnall, Abdullah Mueen, and recognition. In ICCV, pages 2572–2578, 2011. Gustavo Batista. The ucr time series classification archive, [Tang and Liu, 2014] Jiliang Tang and Huan Liu. An un- www.cs.ucr.edu/ eamonn/time series data/. 2015. ⇠ supervised feature selection framework for social media [Donath and Hoffman, 1973] William E Donath and Alan J data. IEEE Transactions on Knowledge and Data Engi- Hoffman. Lower bounds for the partitioning of graphs. neering, 26(12):2914–2927, 2014. IBM Journal of Research and Development, 17(5):420– [Von Luxburg, 2007] Ulrike Von Luxburg. A tutorial on 425, 1973. spectral clustering. Statistics and , 17(4):395– [Grabocka et al., 2014] Josif Grabocka, Nicolas Schilling, 416, 2007. Martin Wistuba, and Lars Schmidt-Thieme. Learning [Wistuba et al., 2015] Martin Wistuba, Josif Grabocka, and time-series shapelets. In KDD, pages 392–401, 2014. Lars Schmidt-Thieme. Ultra-fast shapelets for time series [Halkidi et al., 2001] Maria Halkidi, Yannis Batistakis, and classification. Journal of Data and Knowledge Engineer- Michalis Vazirgiannis. On clustering validation tech- ing, 2015. niques. Journal of intelligent information systems, [Yang et al., 2011] Yi Yang, Heng Tao Shen, Zhigang Ma, 17(2):107–145, 2001. Zi Huang, and Xiaofang Zhou. l2,1-norm regularized dis- [He et al., 2005] Xiaofei He, Deng Cai, and Partha Niyogi. criminative feature selection for unsupervised learning. In Laplacian score for feature selection. In NIPS, pages 507– IJCAI, pages 1589–1594, 2011. 514, 2005. [Ye and Keogh, 2009] Lexiang Ye and Eamonn Keogh. Time [He et al., 2012] Qing He, Zhi Dong, Fuzhen Zhuang, Tian- series shapelets: a new primitive for data mining. In KDD, feng Shang, and Zhongzhi Shi. Fast time series classifi- pages 947–956, 2009. cation based on infrequent shapelets. In ICMLA, pages [Zakaria et al., 2012] Jamaluddin Zakaria, Abdullah Mueen, 215–219, 2012. and Eamonn Keogh. Clustering time series using [Hills et al., 2014] Jon Hills, Jason Lines, Edgaras unsupervised-shapelets. In ICDM, pages 785–794, 2012. Baranauskas, James Mapp, and Anthony Bagnall. [Zhang et al., 2015] Peng Zhang, Chuan Zhou, Peng Wang, Classification of time series by shapelet transformation. Byron J Gao, Xingquan Zhu, and Li Guo. E-tree: An Data Mining and Knowledge Discovery, 28(4):851–881, efficient indexing structure for ensemble models on data 2014. streams. IEEE Transactions on Knowledge and Data En- [Hirano and Tsumoto, 2006] Shoji Hirano and Shusaku gineering, 27(2):461–474, 2015. Tsumoto. of time-series medical data [Zhao and Liu, 2007] Zheng Zhao and Huan Liu. Spectral based on the trajectory representation and multiscale feature selection for supervised and unsupervised learning. comparison techniques. In ICDM, pages 896–901, 2006. In ICML, pages 1151–1157, 2007.

2328