Appl Intell (2011) 35:226–241 DOI 10.1007/s10489-010-0216-5

Semi-Markov conditional random fields for accelerometer-based activity recognition

La The Vinh Sungyoung Lee Hung Xuan Le · · · Hung Quoc Ngo Hyoung Il Kim Manhyung Han · · · Young-Koo Lee

Published online: 12 March 2010 ©SpringerScience+BusinessMedia,LLC2010

Abstract Activity recognition is becoming an important re- sedentary jobs, while there are growing evidences in the re- search area, and finding its way to many application domains lationship between common health problems such as dia- ranging from daily life services to industrial zones. Sensing betes, cardiovascular, osteoporosis and the level of physical hardware and learning algorithms are two important com- activity [24]. Therefore, simple monitoring will not protect ponents in activity recognition. For sensing devices, we pre- anybody from any disease but may help to assess and then fer to use accelerometers due to low cost and low power alter the life style, which in turn could result in health ben- requirement. For learning algorithms, we propose a novel efits. In addition, activity recognition has been considered implementation of the semi-Markov Conditional Random to be a potential factor in improving convenience as well as Fields (semi-CRF) introduced by Sarawagi and Cohen. Our productivity at the work place; for example, in smart hospi- implementation not only outperforms the original method in tals [6, 18], in aircraft maintenance [10], or in a workshop terms of computation complexity (at least 10 times faster [13]. Also, such activity recognition systems can be used in our experiments) but also is able to capture the interde- to predict abnormal behaviors such as falling down [14]for pendency among labels, which was not possible in the pre- emergency response in health-care systems. viously proposed model. Our results indicate that the pro- There are various approaches using video [4]andde- posed approach works well even for complicated activities ployed sensors [21], many researchers, however, have used like eating and driving a car. The average precision and re- accelerometers in their research work due to their low cost, call are 88.47% and 86.68%, respectively, which are higher low power requirement, portability, and versatility charac- than results obtained by using other methods such as Hidden teristics. Therefore, we also use accelerometer-based input Markov Model (HMM) or Topic Model (TM). for our activity recognition system. With respect to recogni- tion methods, sliding window approach is commonly used Keywords Activity recognition Wearable sensors in accelerometer-based activity recognition [1, 13, 17, 20]. Accelerometer Hidden Markov· Model (HMM) · However, in most cases, a sliding window cannot cover one Conditional Random· Fields (CRF) · complete activity, since the duration of different activities usually varies significantly and the start time of an activity 1Introduction in a continuous stream is unknown in advance. Thus, the sliding window approach may produce fragments of activ- Nowadays, activity recognition is an increasingly important ities making it difficult to obtain comprehensive models to research area. The modern life style tends to involve in more satisfy the performance requirements of a continuous activ- ity recognition system. One feasible solution for this prob- lem is to take into account the duration of activities so that L.T. Vinh S. Lee (!) H.X. Le H.Q. Ngo H.I. Kim M. Han some short-length fragments can be eliminated. In addition · · · · · · Y. -K. Lee to the problem associated with sliding windows, the inter- Dept. of Computer Engineering, Kyung Hee University, 446-701 Room 351, Seocheon-dong, Giheung-gu, Yongin-si, dependency among activities is another issue, which should Gyeonggi-do, Republic of Korea be considered when detecting activities in a continuous data e-mail: [email protected] stream. Semi-Markov conditional random fields for accelerometer-based activity recognition 227

Nevertheless, to the best of our knowledge, none of the Unfortunately, the presumption is not always satisfied in re- existing activity recognition models is able to handle all ality. For example, in the activity recognition problem, two the afore mentioned problems, especially, for large-scale expected activities (activities that we want to recognize) are activity recognition systems. Therefore, in this study, we often separated by irrelevant activities (activities that we do propose a novel method for activity recognition, which is not intend to detect). Furthermore, constant self-transition based on the work from Sarawagi and Cohen [19], to model probabilities cause the distribution of state’s duration to be the duration as well as the interdependency of activities. geometric [16]whichisinappropriatetotherealactivitydu- Furthermore, we introduce our clever caching algorithm to ration model. overcome the high computation complexity of the original In [19], Sarawagi and Cohen have shown that semi-CRF work [19]. is capable of using an explicit duration model. It, however, increases the computation complexity of forward and back- ward algorithms by D times from O(TM2) to O(TM2D), 2Papercontributionandoutline where T,M,D are the length of the input sequence, the number of possible label values, and the maximum dura- Our contributions in this work are three folds. First, we pro- tion length, respectively. In [5], the proposed semi-CRF pose a novel implementation of semi-Markov Conditional model requires a complexity of O(TM2D)forestimating Random Fields (semi-CRF) that is superior to one proposed each gradient. If we have N parameters to be optimized, in [19]. Second, we propose an efficient algorithm for pa- the computation load will be O(NTM2D), which is very rameter estimation, which runs much faster than the origi- high. Truyen et al. [22]introducedamorecomplicated nal training algorithm in [19]. Third, we apply the proposed model, called Hierarchical Semi-Markov Conditional Ran- semi-CRF to accelerometer-based activity recognition with dom Fields (HSCRF) and demonstrated that HSCRF could alarge-scaledataset. be converted to semi-CRF as a special case. Nevertheless, The rest of this paper is organized as follows. We briefly their conversion did not show any improvement in the com- survey related work and their results in Sect. 3.Section4 de- plexity required for the optimization of the model’s parame- scribes the background of our work which includes the stan- ters. In [15], the authors proposed a method to decrease the dard CRF [9, 23]andanexistingimplementationofsemi- computational cost of semi-CRF. They, however, utilized a CRF [19]. We also point out limitations of these models, Bayes filter to eliminate some sequences from the compu- which are solved by our proposed approach as described in tation. The approach, therefore, did not keep the originality Sect. 5.Section6 shows how the proposed semi-CRF model of the problems. In short, semi-CRF model is a potential so- can be applied to accelerometer-based activity recognition. lution for modeling sequential data like activity data. How- In Sect. 7,wediscussindetailourexperimentsandresults ever, the current training algorithms for semi-CRF require to show our improvements. We conclude the paper and out- high complexity, making the model impractical in large- line future work in Sect. 8.Finally,wepresentdetailsabout scale systems. Furthermore, the semi-CRF model, proposed our algorithms in the appendix section. in [19], is still not able to solve the long-range transition dif- ficulty. The authors in [11]usedahighorderCRFmodelto capture long-range transitions in case of predefined transi- 3Relatedwork tion order. However, if the order of transition is not known exactly, their method cannot be used. It also should be noted So far, many algorithms have been proposed for accelerome- that not many research contributions are made in scalable ter-based activity recognition. Decision tree, support vec- activity recognition [7]. tor machine and some other kinds of classification meth- In this study, we propose to overcome the above limita- ods were evaluated in [1, 17]. To make use of the sequential tions of the existing work by introducing our novel semi- structure of activities, (HMM) was CRF to model both the duration and the interdependency of used in [20]. Recently, Conditional Random Fields model activities. Additionally, we develop a fast training algorithm (CRF) was introduced as a much better approach com- for our model making it suitable for scalable activity recog- pared to HMM in sequential modeling [9]. Thus, some re- nition applications. searchers have successfully applied CRF to activity recog- nition [12, 23]. AlimitationofbothconventionalHMMandfirst-order 4Background CRF is the Markovian property, which assumes that the cur- rent state depends only on the previous state. Because of In this section, we briefly review the theory of Conditional this assumption, the labels of two adjacent states must be Random Fields [9]anditsextension[19]. We also point out supposed to occur successively in the observation sequence. the limitations of the existing work at the end of this section. 228 L.T. Vinh et al.

learning and inferring with semi-CRF. The authors include in each state a label, a beginning time and an ending time. Thus, a new state is defined as

si (y, b, e) i 1, 2,...,P, (8) = =

where P is the length of the sequence S s1,...,sP , = which is constructed from input labels Y (y1,y2,...,yT ). Fig. 1 The graphical structure of a linear-chain Conditional Random = Fields model y, b,ande are label, beginning time, and ending time of the state si ,respectively.Forexample,ifwehaveY = (1, 1, 2, 2, 2, 3, 4, 4) then S (1, 1, 2), (2, 3, 5), (3, 6, 6), 4.1 Conditional random fields ={ (4, 7, 8) .Originally,thebeginningandendingtimemust } In conventional CRF [9], the input sequence X and the cor- satisfy the following constraints. responding label sequence Y of length T are given in the form si.b si.e i 1, 2,...,P, (9) ≤ = s .e 1 s .b i 1, 2,...,P 1, (10) X x1,x2,...,xT , (1) i i 1 ={ } + = + = − s .b , Y y1,y2,...,yT . (2) 1 1 (11) ={ } = sP .e T. (12) In this work, we consider only linear chain CRF as depicted = in Fig. 1.Thelikelihoodofthelabeleddataiscalculatedas Now, instead of computing the likelihood of Y given X,the T likelihood of S given X is estimated by t 1 Ψ(yt 1,yt ,X) P(Y X) = − , (3) | = ZX ! P Ψ(s ,s ,X) T i 1 i 1 i W F(yt 1,yt ,X) P(S X) = − , (13) Ψ(yt 1,yt ,X) e − , (4) | = Z − = ! X T P " Z Ψ(y ,y ,X), (5) X "t 1 "t ZX Ψ(s"i 1,s"i,X), (14) = − = − Y " t 1 S i 1 " #= "" #= where F is a column vector of feature functions, W is a where Ψ(si 1,si,X) encodes the potential of the transition column vector of model parameters, Ψ is called potential − from si 1 to si .Inthefollowingequations,wesupposethat function. ZX,thenormalizationfactor,iscomputedbyusing − any function Ψ(si 1,si,X) can be rewritten in the form forward/backward variables − Ψ(si 1.y, si.y, X, si.b, si.e).Forexample,withthese- − αt (yt ) Ψ(yt 1,yt ,X)αt 1(yt 1), (6) quence S above, we can rewrite Ψ(s1,s2,X) as = − − − yt 1 Ψ(1, 2,X,3, 5). "− T ZX αT (yT ). (7) W F(si 1,si ,X) = Ψ(si 1,si,X) e − , (15) yT − = " T Model parameters are estimated so that P(Y X) is max- where W w1,w2,...,wN is a column vector of model =[ ] imized, or equivalently, L(Y X) log(P (Y X))| is maxi- parameters, mized. Since, solving dL | 0, =i 1, 2, 3,...,N| ,isin- dwi = = 1 tractable, numerical methods such as stochastic gradient f (si 1,si,X) 2 − ascent method are applied to optimize the concave func- f (si 1,si,X) F(si 1,si,X)  −  tion L(Y X). − = ··· | f N (s ,s ,X)  i 1 i  4.2 Semi-Markov conditional random fields  −  is a column vector of feature functions. In (13)and(14), we Since the conventional CRF is limited to the Markovian as- can consider the product of potential functions Ψ over all sumption that the label yt at time t depends only on the pre- transitions of a sequence as the potential of the sequence. vious label yt 1,itisnotabletocapturethedurationdis- Thus, we can rewrite (13)intheform tribution as well− as the interdependency of labels. There- Pol(S) fore, semi-Markov model is proposed to handle these is- P(S X) , (16) sues. In [19], Sarawagi and Cohen described a method for | = Pol(S ) S" " * Semi-Markov conditional random fields for accelerometer-based activity recognition 229 where 5Semi-CRFwithdiscontiguousstates

P In this section, we present details of our proposed model, Pol(S) Ψ(si 1,si,X) (17) = − i 1 which is applied to accelerometer-based activity recogni- #= tion, based on the semi-CRF model introduced in [19]. is the potential of the sequence S s1,s2,...,sP .Thefor- = To handle the problem of contiguous state labels, we ward algorithm and parameter estimation are implemented use inequalities instead of the equalities in (10), (11), and based on the following equations [19] (12), so we have 0

Fig. 3 Training Data Clustering

Fig. 2 (Color online) Duration potential with different values of mean and standard deviation. If a detected segment has a length of, for ex- ample 6, then it most likely belongs to the same class with the label, whose duration potential is presented in the green

Clearly, the most likely duration (at the center of the bell) has the highest potential value. Since irrelevant activities can have an arbitrary length, we do not model the duration of such activities because modeling will increase the possi- Fig. 4 Training/Testing semi-CRF with discrete input sequences ble maximum duration length, D,whichmayresultinhigh complexity. of si 1.b si.e 1asin[19]. The inequality enables our Note that in (26)weassumethatthedurationofanactiv- + = + model to skip irrelevant activities and to directly model the ity has a Gaussian-like distribution. Although the assump- transition between two expected activities. For the training tion is not always true, it is reasonable to assume that since and inference algorithms, we propose novel implementa- most activities often last around a constant amount of time. tions of the forward, backward algorithms, gradient calcu- Next, we define the weighted observation potential func- lating algorithm and Viterbi algorithm. Details of our algo- tion as rithms are presented in the appendix section at the end of O Q (si 1,si,X) this paper. − Gy(y, t1,t2)δ(si.y y,si.b t1,si.e t2) = = = = 6Theproposedactivityrecognitionframework y,t1,t2 " + GIA(IA, t1,t2)δ(si 1.e 1 t1,si.b 1 t2) , In this section, we present how the proposed semi-CRF + − + = − = (27) model can be applied to accelerometer-based activity recog- , nition. where In our system, discrete values were used as the input of the semi-CRF. Therefore, we first quantized the con- t2 O tinuous input signal from the accelerometers using Linde- Gy(y, t1,t2) w (y, o)δ(xt o), (28) = = t t1 o Buzo-Gray (LBG) algorithm. Figure 3 illustrates a block "= " diagram of the clustering module. In this module, we em- t2 O ployed overlapped sliding windows to chop the signal into GIA(IA,t1,t2) w (IA,o)δ(xt o), (29) = = equal-length frames. After windowing, feature vectors were t t1 o "= " extracted from signal frames and fed into a LBG clustering where wO (y, o) and wO (IA,o)in that order are the weights function to construct a codebook. Then, every input fea- of the observation given that input symbol o is observed in ture vector was quantized to an integer value, which was state with label y and IA.Forconvenienceinthepresen- the index of the nearest codebook vector. The output se- tation of the following equations, we denote G(y, t1,t2) quence of discrete values was used for training or testing D = Gy(y, t1,t2) G (y, t2 t1 1) as a combined potential the semi-CRF model, as it is shown in Fig. 4.Weconducted function. + − + experiments to analyze the effect of different parameter’s Our approach is similar to that in [19], but we allow dis- values on the achieved results. The percentage of the over- continuity in the time of state by using si 1.b > si.e instead lap between two contiguous windows, the length of each + Semi-Markov conditional random fields for accelerometer-based activity recognition 231

Fig. 5 Achieved precision and recall with different parameter’s values window, and the codebook’s size were chosen based on our 7Evaluation experiments. Figure 5 shows the results obtained with dif- ferent values of these parameters. From the results depicted In this section, we first show that our proposed algorithm in Fig. 5,wedecidedthattheoverlapportionwas50%,the achieves a remarkable improvement in terms of computa- window’s length was 512 samples, and the number of code- tion complexity compared to the previous work. Then, we book vectors was 64. present the recognition results with an available large-scale 232 L.T. Vinh et al.

Fig. 6 Average time needed for computing all the gradients. Herein, the number of labels (M)is4,themaximumduration (D) is 16, the codebook’s size (V )is128,thelengthofthe input sequence (T )changes from 32 to 1024. Therefore, the number of gradients is M M2 MV 532 + + =

dataset. From the achieved results, we show that the pro- It can be seen that our improvement completely comes posed method not only speeds up the calculation signifi- from the caching mechanism. In our gradient (52), (60), (62) cantly, but also produces much better accuracy in recog- and (63)weutilizedifferentpartitioningmethods,which nition. All the experiments were implemented in C++ on take into account the characteristics of the gradients. For acomputerwithInteldualcore1.83GHzprocessorand example, in (52)wepartitionthesetofalllength-Tlabel 512 MB RAM. sequences (ST )intosubsets( Λy" Ωy )basedonthe t t ⊕ t 1 occurrence time of a transition from y to +y.Meanwhilein 4 " 7.1 Complexity analysis [19]afixedpartitioningmethod,whereST was always di- vided into subsets based on the length of the last segment, In the algorithm proposed in [19], the estimation of (22)re- was used regardless of different characteristics of the gradi- quires that α(t,y) and ηk(t, y) are pre-calculated for all pos- ents. This is the reason why our algorithm achieves much sible values of t and y.Eachofthemneedsacomplexityof more efficient caching results. O(MD) as can be seen in (18)and(21). Therefore, the com- Herein, we take a numerical example to compare plexity per gradient of (22)isproportionaltoO(TM2D). O(NTM2D),whichistheestimatedcomplexityin[19], Hence, the estimation of gradients for all N model’s para- to O(3TM(M D)) O(NTD),ouralgorithmcomplex- + + meters takes O(NTM2D). ity. Suppose that we need to compute N 1000 gradients = In our solution, gradients are computed by using (52), of an input sequence, which has the length T 1000, the = (60), (62), and (63). It is obvious that if α, γ , λ, β, θ, ζ , maximum duration D 100, the number of labels M 8, = = and ν are cached, then estimating the above equations re- then the former is about 64 109,thelatterisabout109. × quires a maximum complexity of O(TD).Hence,foropti- In addition, Fig. 6 illustrates another comparison of the two mizing N parameters, our algorithm requires only O(NTD) complexities with N 532, M 4, D 16 and T changes = = = to completely calculate all gradients. Nevertheless, we need from 32 to 1024. Both algorithms were evaluated on the to take into account the extra time of estimating the cached same computer with the same dataset. The amount of time variables. As shown in the pseudo-code for the forward and which was required by the method proposed by Sarawagi backward algorithm, α, γ , λ, β, η,andζ can be computed and Cohen in [19]ismarkedinblue,timeconsumedbyour with O(2TM(M D)).Meanwhile,from(59)and(68)we algorithm is marked in red. Obviously, there is a remarkable + see that θ and ν take O(TMD) and O(TM2),respectively. improvement in our complexity since our time requirement Totally caching these variables requires a complexity of O(3 is at least 10 times less than the computation time required TM(M D)). by Sarawagi and Cohen’s method. + Semi-Markov conditional random fields for accelerometer-based activity recognition 233

Table 1 Low level activities occurring during the routines [8]

Activity Average duration Occurrences Total

Sitting/desk activities 49.41 min 54 3016.0 min Unlabeled 1.35 min 239 931.3 min Having dinner 17.62 min 6 125.3 min Walking freely 2.86 min 38 124.2 min Driving car 10.37 min 10 120.3 min Having lunch 10.95 min 7 75.1 min Discussing at white board 12.80 min 5 62.7 min Attending a presentation 48.9 min 1 48.9 min Driving a bike 11.82 min 4 46.3 min Walking while carrying something 1.43 min 10 23.1 min Walking 2.71 min 7 23.0 min Picking up mensa food 3.30 min 7 22.6 min Sitting/having a coffee 5.56 min 4 21.8 min Queuing in a line 2.89 min 7 19.8 min Using the toilet 1.95 min 2 16.7 min Washing dishes 3.37 min 3 12.8 min Standing/having a coffee 6.7 min 1 6.7 min Preparing food 4.6 min 1 4.6 min Washing hands 0.32 min 3 2.2 min Running 1.0 min 1 1.0 min Wiping the whiteboard 0.8 min 1 0.8 min

Table 2 Daily routines 7.2 Experiments Routine Occurrences

In our experiments, we used a dataset of long-term activities, Dinner 7 which has been published at http://www.mis.informatik.tu- Commuting 14 darmstadt.de/data.Thedatasetcontains7daysofcontinu- Lunch 7 ous data (except the sleeping time), measured by two tri- Office work 14 axial accelerometers, one on the wrist and the other in the Unknown (null) >50 right pocket. The sensor’s sampling frequency was set to 100 Hz. In the published dataset, the author calculated the mean value of every 0.4 s window, so the actual sampling demonstrates an example of recognized routines together frequency was about 2.5 Hz. In total, it has 34 labeled ac- with the ground truth. tivities, of which a subset of 24 activities occurred during Our results show a considerable improvement compared the routines. Table 1 lists all the annotated activities, which to [2, 8], except dinner routine which has lower precision were grouped by the authors in [8]into5dailyroutinesas and recall than those of [2], with 3 other routines we obtain seen in Table 2.Tocompareourmethodwiththeexisting better results. Since we utilize similar features (mean, stan- methods, which were evaluated in [8]and[2], we kept all dard deviation, time) as in the original work, the improve- the data settings unchanged. ment can be explained as the result of taking into account From our experiment’s results illustrated in Fig. 5,we the interdependency together with the duration of activities. chose 50%-overlapped-sliding windows which had a length Nevertheless, when trying to explain the results obtained of 512 samples (about 3.41 minutes). Within a window, with the dinner routine, we find out that our precision and re- mean, standard deviation, and mean crossing rate were ex- call are affected by the fragmentation of the routine. As seen tracted from each signal. Then, these values were combined in Table 4,theworstresult,decreasingtheoverallachieve- with the time of frame to form a feature vector. Finally, we ment, is achieved with dinner routine of day 2, which is in- followed leave-one-out cross validation rule to measure the terrupted by other activities such as walking, and carrying results of recognition as can be seen in Table 3.Figure7 something. Meanwhile, we still obtain quite good results (on 234 L.T. Vinh et al.

Table 3 Recognition result: the first and second column contain the Blanke and Bernt Chiele using boosting techniques [2], our results results which are evaluated by Huynh et al. in [8]usingaHMM,anda are presented in the last column. All the recognition experiments were TM, respectively, the third column contains the achievement of Ulf evaluated with the same dataset

Baseline (HMM) Huynh et al. Ulf Blanke et al. Our method

Routines Precision/Recall (%) Precision/Recall (%) Precision/Recall (%) Precision/Recall (%) Dinner 88.60/27.30 56.90/40.20 85.27/90.48 78.43/71.57 Commuting 72.60/31.50 83.50/71.10 81.77/82.36 86.57/86.86 Lunch 84.40/80.70 73.80/70.20 84.56/90.04 91.86/91.57 Office 89.20/91.10 93.40/81.80 98.12/93.63 97.00/96.71

Table 4 Sequence of activities, which occur in the dinner routine. As can be seen, a dinner routine often contains some related activities such as having dinner, sitting, and washing dishes. However, in day 2 the dinner routine is interrupted by walking and carrying something

Day Sequence of activities Precision (%) Recall (%)

1Havingdinner,sitting 100 100 2 Having dinner, walking, sitting, walking, sitting, carrying something, walking, washing dishes 0 0 3Havingdinner,sitting,washingdishes 60100 4Havingdinner 100 76 5Havingdinner,washingdishes 9595 6Havingdinner,sitting 100 45 7 Having dinner, sitting, washing dishes 94 85

Fig. 7 A single day recognized routines

average 91.00% and 90.34% for precision and recall, respec- “lunch–office” is always the highest to encourage the pre- tively) with other days, which are not fragmented. diction of these events. Table 5 shows an example of transition matrix after train- In addition to the accuracy, our solution also has a prac- ing. It can be seen that the potential of most likely transi- tical training time. With 72 hours of training data, our algo- tions such as “office work–lunch”, “commuting–dinner” or rithm takes about 2 hours to complete parameter estimation Semi-Markov conditional random fields for accelerometer-based activity recognition 235

Table 5 Transition weights Dinner Commuting Lunch Office

Dinner 9.67 10.99 9.34 8.99 − − − − Commuting 2.50 8.38 6.80 4.20 − − − − Lunch 7.08 9.39 8.23 6.12 − − − − Office 9.60 9.38 7.51 8.38 − − − −

in a system with Intel dual core 1.83 GHz processor and for computing the normalization factor ZX,gradientesti- 512 MB RAM. mating algorithms, and Viterbi algorithm for inference. In conclusion, our proposed model and algorithm work The forward algorithm is used to compute the normal- well in practice, making the semi-CRF model practical for ization factor ZX efficiently by using the dynamic program- large-scale activity recognition. ming method. First, we denote

q t α(y,t) Pol(S ) Ψ(si 1,si,X), (30) 8Conclusion = = − St Γ y St Γ y i 1 "∈ t "∈ t #= In this paper, we have presented a novel implementation y where Γ S s1,s2,...,sq is a set of all semi- of semi-Markov Conditional Random Fields and fast algo- t ={ = } Markov sequences, which have an original label sequence rithms for gradient calculation. The solution not only is able (y1,y2,...,yt ) with the last expected label is y.Thus,every to make use of the interdependency and the duration of ac- t y S s1,s2,...,sq Γ satisfies sq .e t and sq .y y. tivities to increase the accuracy, but also takes a practical = ∈ t ≤ = amount of time for parameter estimation. Although the algo- q t rithm produced a low accuracy in some particular case, over- γ(y,t) Pol(S ) Ψ(si 1,si,X), (31) = = − all, we have shown that our approach obtains better results St Λy St Λy i 1 "∈ t "∈ t #= when compared to others. In this study, we only used activ- y ity recognition as our target application. However, the semi- where Λt S s1,s2,...,sq ,isasetofallsemi- ={ = } CRF model can be extensively applied to other fields such as Markov sequences, which have an original label sequence t natural language processing or gene prediction. Our work, (y1,y2,...,yt 1,yt y).Therefore,everyS s1,s2,..., y − = =t therefore, may bring a significant contribution not only to sq Λt satisfies sq .e t and sq .y y.Letφ represent ∈ = = activity recognition, but also to a broader range of research aspecialsequenceoflengtht,whichcontainsonly“IA” areas. labels. From (14)wehave

For the future work, we plan to extend the proposed al- T ZX Pol(S ) gorithms to handle continuous inputs and disparate sensors, = T such as audio sensors, gyroscope sensors, and video sensors. "S It would require further research to find out the mechanism Pol(ST ) Pol(φT ) = + of their correlation and the role of each sensor in activity y ST Γ y recognition. " "∈ T α(y,T ) eGIA(IA,1,T ). (32) Acknowledgements This research was supported by the MKE = y + (Ministry of Knowledge Economy), Korea, under the ITRC (Informa- " tion Technology Research Center) support program supervised by the To compute α(y,t) efficiently, we note that NIPA (National IT Industry Promotion Agency) (NIPA-2009-(C1090- 0902-0002)). This work was also sponsored by the IT R&D program of α(y,t) Pol(St ) MKE/KEIT, [10032105, Development of Realistic Multiverse Game = Engine Technology], and by the Basic Science Research Program St Γ y "∈ t funded by National Research Foundation (2009-0076798). We also t 1 t would like to acknowledge Huynh and his colleges for the publication Pol(S − IA) Pol(S ), (33) = ⊕ + of the activity dataset. t 1 y t y S − Γt 1 S Λt "∈ − "∈ t 1 where S − IA denotes the concatenation of a label “IA” Appendix A: Forward algorithm ⊕ t 1 to the end of the original sequence of S − .Itiseasytosee that In Appendices A–D we present details of our algorithms in- t 1 t 1 GIA(IA,t,t) cluding forward and backward algorithms, which are used Pol(S − IA) Pol(S − )e . (34) ⊕ = 236 L.T. Vinh et al.

From (33)and(34)wehave Algorithm 1:ForwardalgorithmforcalculatingZX Forward α(y,t) α(y,t 1)eGIA(IA,t,t) γ(y,t). (35) = − + for t 1 To T do = for y 1 To StateNum do Derive γ(y,t)from (31)weseethat α =y t 0 [ ][ ]= γ y t 0 [ ][ ]= γ(y,t) Pol(St ) λ y t 0 = [ ][ ]= St Λy for d 1 To D do "∈ t = if t d 1 > 0 then D − + t d γ y t Pol(S − (y, t d 1,t)), (36) [ ][ ]+=G(y,t d 1,t) = ⊕ − + λ y t d e − + d 1 St d [ ][ − ] "= "− γ y t [ ][ ]+= eGIA(IA,1,t d) G(y,t d 1,t) t d − + − + where S − (y, t d 1,t) represents the appending of else ⊕ − + t d t d d labels y to the original sequence of S − .IncaseS − Break contains at least one state, we can assume that (y∗,b,e) is t d if t>1 then the last state of S − ,thenwehave α y t [ ][ ]= α y t 1 eGIA(IA,t,t) γ y t t d [ ][ − ] + [ ][ ] Pol(S − (y, t d 1,t)) ⊕ − + else t d wTr(y ,y) G(y,t d 1,t) α y t γ y t Pol(S − )e ∗ + − + . (37) [ ][ ]= [ ][ ] = for y" 1 To StateNum do = wTr(y" ,y) In the other case, St d or its original sequence com- λ y t λ y t α y" t e − =∅ [ ][ ]= [ ][ ]+ [ ][ ] prises of only “IA”. It is clear that G (IA,1,T ) ZX e IA = for y 1 To StateNum do t d = Pol(S − (y, t d 1,t)) ZX ZX α(y,T ) ⊕ − + = + GIA(IA,1,t d) G(y,t d 1,t) e − e − + . (38) end = From (36), (37), (38)weconcludethat

D Appendix B: Backward algorithm wTr(y ,y) G(y,t d 1,t) γ(y,t) α(y",t d)e " + − + = − d 15 y Similarly to the forward algorithm, let we denote "= "" T t 1 GIA(IA,1,t d) G(y,t d 1,t) β(y,t) Pol(S − + ) e − + − + . (39) = + ST t 1 Ωy 6 −"+ ∈ t q Tr Obviously in (39)onlyα(y",t d) and w (y",y) depend − Ψ(si 1,si,X), (42) on y therefore by pre-calculating = − " ST t 1 Ωy i 1 −"+ ∈ t #= wTr(y ,y) y λ(y, t) α(y ,t)e " , (40) where Ω S s1,s2,...,sq is a set of all semi- " t ={ = } = Markov sequences, which have an original label sequence "y" (yt ,yt 1,...,yT ) with the first expected label is y. + we have T t 1 η(y,t) Pol(S − + ), (43) = D T t 1 y S Υt G(y,t d 1,t) −"+ ∈ γ(y,t) λ(y, t d)e − + = − y d 1 where Υt S s1,s2,...,sq is a set of all semi-Markov "= + ={ = } GIA(IA,1,t d) G(y,t d 1,t) sequences, which have an original label sequence (yt e − + − + . (41) = + y,yt 1,...,yT ).Followsimilarstepsinforwardalgorithm + , we come up with Based on (32), (35), (40), and (41)theforwardalgorithm can be implemented as in below pseudo-code. β(y,t) β(y,t 1)eGIA(IA,t,t) η(y,t), (44) = + + Semi-Markov conditional random fields for accelerometer-based activity recognition 237

Take the logarithm form of P(S X) we have Algorithm 2:BackwardalgorithmforcalculatingZX | Backward P Tr D L(S X) (Q (si 1,si,X) Q (si 1,si,X) for t T Down To 1 do | = − + − = i 1 for y 1 To StateNum do "= β =y t 0 O [ ][ ]= Q (si 1,si,X)) log(ZX). (47) η y t 0 + − − [ ][ ]= ζ y t 0 To find the optimal parameter values w∗ we have to solve [ ][ ]= dL for d 1 To D do 0. From (47)weknowthat = dw∗ = if t d 1 T then +η y −t ≤ dL P dQ (s ,s ,X) 1 dZ [ ][ ]+= ∗ i 1 i X ζ y t d eG(y,t,t d 1) − . (48) + − dw = dw − ZX dw [ ][ + ] ∗ i 1 ∗ ∗ η y t "= G(y,t,t[ ][ ]+=d 1) GIA(IA,t d,T) e + − + + Herein we use Q∗ and w∗ to refer to any kind of the poten- else tial function and weight (∗ can be D, Tr,orO). Computing Break the first term of the right side in (48)istrivial,ZX is cal- if t

As a result Combining (61)andthedefinitionofθ in (55)leadsto

D T P dZX (d m )2 θ(y,t,d)δ(xk o). (62) dZX y dwO (y, o) = = D − 2 ψ(si 1,si,X), k,t,d dw (y) = 2σ − k t,t"d 1 d 1 t 1 y T d,t i 1 ∈[ + − ] S χy "= "= "∈ #= (54) Similarly, we come up with estimation equations in case of irrelevant labels d,t where χy S s1,...,sP is a set of all semi-Markov T ={ = } dZX sequences whose original sequences contain d continuous ν(t)δ(xt o), (63) dwO (IA,o) = = labels y from time t.Let t 1 "= P where θ(y,t,d) ψ(si 1,si,X). (55) = − P T d,t i 1 S χy "∈ #= ν(t) ψ(si 1,si,x), (64) = − ST i 1 T d,t " #= Obviously, each S χy can be represented as a concate- ∈ T nation sequence S s1,s2,...,sP does not contain any expected = activities at time t.So,ST can be decomposed as T t 1 T t d 1 S Sprev− (y, t, t d 1) Spost− − + , (56) = ⊕ + − ⊕ T t 1 T t S Sprev− IA Spost− , (65) where = ⊕ ⊕ where t 1 y" t 1 Sprev− Γt 1 φ − , (57) ∈ − ∪{ } t 1 y" t 1 y" Sprev− Γt 1 φ − , (66) 9 ∈ − ∪{ } y" and 9 and T t d 1 y∗ T t d 1 Spost− − + Ωt d φ − − + . (58) ∈ + ∪{ } T t y T t y∗ Spost− Ωt 1 φ − . (67) 9 ∈ y + ∪{ } Equations (55), (56), (57), and (58)implythat 9 Finally we have G(y,t,t d 1) θ(y,t,d) (λ(y, t 1)ζ(y, t d)e + − = − + Tr w (y",y) GIA(IA,t,t) GIA(IA,1,t 1) G(y,t,t d 1) ν(t) α(y",t 1)β(y, t 1)e + ζ(y,t d)e − + + − = y − + + + 5 y" G(y,t,t d 1) GIA(IA,t d,T) " " λ(y, t 1)e + − + + GIA(IA,t,T ) GIA(IA,1,t) + − α(y",t 1)e β(y,t 1)e GIA(IA,1,t 1) G(y,t,t d 1) GIA(IA,t d,T) e − + + − + + ). + − + + + eGIA(IA,1,T ) . (68) (59) + 6 Using θ(y,t,d) the gradient is calculated as After computing the derivatives, many convex optimization techniques can be applied [3]. In our solution, we applied D T 2 dZX (d my) asimplestochasticgradientascentmethodtofindingthe − θ(y,t,d). (60) dwD(y) = 2σ 2 optimal result. To avoid over-fitting, we use L1 regularizer d 1 t 1 y 1 T "= "= which is W W .Therefore,theactualtargetfunctionis − 2 C.3 Gradient of the observation weight P Tr D L(S X) Q (si 1,si,X) Q (si 1,si,X) | = − + − To estimate the gradient of the observation weight, we con- i 1 "= + sider two cases. In the first case, we handle the observation O Q (si 1,si,X) weights of expected labels. From (27), (28), and (29)wecan + − assert that 1 ,T log(ZX) ε W W, (69) − − 2 O si .e dQ (si 1,si,X) − δ(si.y y,xk o). (61) where ε is a smoothing constant, which is manually esti- dwO (y, o) = = = k si .b "= mated. Semi-Markov conditional random fields for accelerometer-based activity recognition 239

Appendix D: Inference using Viterbi algorithm 0ifδ(y,t) A, = 4Duration(y, t) d if δ(y,t) B, (73) =  = d otherwise. Inference in our semi-CRF is done by using Viterbi al-  2 gorithm with a complexity of O(TM D).Ourtargetisto Using δ(y,t), 4State (y, t) and 4Duration(y, t) we can follow find the best matched sequence Y given an input X so that the below steps to track the optimized path. P(Y X) is maximized. Let we denote |

t References δ(y,t) max log P(S s1,s2,...,sq x1,x2,...,xt ) , = = | + ,(70) 1. Bao L, Intille SS (2004) Activity recognition from user-annotated acceleration data. In: Proceedings of the 2nd international confer- ence on pervasive computing, vol 3001, pp 1–17 t S s1,s2,...,sq has the last expected activity is y,or 2. Blanke U, Schiele B (2009) Daily routine recognition through ac- = tivity spotting. In: Proceedings of the 4th international symposium equivalently sq .e t and sq .y y. ≤ = on location and context-awareness, vol 5561, pp 192–206 3. Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge A δ(y,t 1) GIA(IA,t,t), University Press, Cambridge = − + B GIA(IA, 1,t d) 4. Brdiczka O, Maisonnasse J, Reignier P, Crowley JL (2009) De-  = − δ(y,t) max G(y, t d 1,t), (71) tecting small group activities from multimodal observations. Appl = d  + − + Intell 30:47–57 δ(y ,t d) wTr(y ,y)  " − + " 5. Doherty MK (2007) Gene prediction with conditional random G(y, t d 1,t). fields. Master’s thesis, Department of Electrical Engineering and  + − +  , Massachusetts Institute of Technology  For backtracking we use 4State(y, t) and 4Duration(y, t) to 6. Favela J, Tentori M, Castro LA, Gonzalez VM, Moran EB, Martínez-García AI (2007) Activity recognition for context-aware store the previous trace of δ(y,t) as following hospital applications: issues and opportunities for the deployment of pervasive networks. Mob Netw Appl 12:155–171 y if δ(y,t) A, 7. Huynh T, Blanke U, Schiele B (2007) Scalable recognition of daily = activities with wearable sensors. In: Proceedings of the 3rd inter- 4State(y, t) IA if δ(y,t) B, (72) =  = national symposium on location and context-awareness, vol 4718, y" otherwise, pp 50–67 8. Huynh T, Fritz M, Schiele B (2008) Discovery of activity patterns  using topic models. In: Proceedings of the 10th international con- ference on , vol 344, pp 10–19 Algorithm 3:Viterbialgorithmfortrackingthebest 9. Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence matched sequence data. In: Proceedings of the 18th international conference on ma- Step 1. Initialization chine learning, pp 282–289 10. Lampe M, Strassner M, Fleisch E (2004) A ubiquitous comput- y∗ argmax δ(y,T ) = ing environment for aircraft maintenance. In: Proceedings of ACM t T = symposium on applied computing, pp 1586–1592 y y∗ 11. Liao L, Choudhury T, Fox D, Kautz HA (2007) Training condi- = i 1 tional random fields using virtual evidence boosting. In: Proceed- end = ings of the 20th international joint conference on artificial intelli- Step 2. Backtracking gence, pp 2530–2535 12. Liao L, Fox D, Kautz HA (2007) Extracting places and activities while y IA and t>0 do from gps traces using hierarchical conditional random fields. Int J =* Duration while 4 (y, t) 0 do Rob Res 26:119–134 t t 1 = 13. Lukowicz P, Starner TE, Troster G, Ward JA (2006) Activity si.y =y − recognition of assembly tasks using body-worn microphones and = si.e t accelerometers. IEEE Trans Pattern Anal Mach Intell 28:1553– = Duration 1567 si.b t 4 (y, t) 1 = − + 14. Najafi B, Aminian K, Loew F, Blanc Y, Robert PA (2002) Mea- i i 1 surement of stand-sit and sit-stand transitions using a miniature = + Duration t t 4 (y, t) gyroscope and its application in fall risk evaluation in the elderly. = − y 4State(y, t) IEEE Trans Biomed Eng 49:843–851 = P i 1 15. Okanohara D, Miyao Y, Tsuruoka Y, Tsujii J (2006) Improving the end = − scalability of semi-Markov conditional random fields for named entity recognition. In: Proceedings of the 21st international con- Step 3. Finalization ference on computational linguistics and the 44th annual meeting y1,y2,...,yT = of the association for computational linguistics, pp 465–472 OriginalSequence(sP ,sP 1,...,s1) 16. Rabiner LR (1989) A tutorial on hidden Markov models and se- − end lected applications in speech recognition. In: Proceedings of the IEEE, vol 77, pp 257–286 240 L.T. Vinh et al.

17. Ravi N, Dandekar N, Mysore P, Littman ML (2005) Activity Hung Xuan Le received BS degree recognition from accelerometer data. In: Proceedings of the 20th in Computer Science from Hanoi national conference on artificial intelligence, vol 20, pp 1541– University, Vietnam, in 2003, M.S, 1546 Ph.D degree in Computer Engineer- 18. Sánchez D, Tentori M, Favela J (2008) Activity recognition for the ing from Kyung Hee University, smart hospital. IEEE Intell Syst 23:50–57 Korea, in 2005 and 2008, respec- 19. Sarawagi S, Cohen W (2005) Semi-Markov conditional random tively. From 2002 to 2003, he served fields for information extraction. In: Proceedings of the 2004 con- as a software engineering at FPT ference on neural information processing systems, pp 1185–1192 corp., Vietnam. From Dec 2008 to 20. Suutala J, Pirttikangas S, Röning J (2007) Discriminative temporal Aug 2009, he was a research pro- smoothing for activity recognition from wearable sensors. In: Pro- fessor at Dept. of Computer Engi- ceedings of the 4th international symposium on ubiquitous com- neering, Kyung Hee University, Ko- puting systems, vol 4836, pp 182–195 rea. Since Sept 2009, he has been 21. Tapia EM, Intille SS, Larson K (2004) Activity recognition in the aPostdoctoralResearchAssociate home setting using simple and ubiquitous sensors. In: Proceedings at Dept. of Electrical Engineering, of the 2nd international conference on pervasive computing, vol University of South Florida, USA. His research interests are wire- 3001, pp 158–175 less sensor networks, cryptography and information security. Email: 22. Truyen TT, Phung DQ, Bui HH, Venkatesh S (2009) Hierarchical [email protected] semi-Markov conditional random fields for recursive sequential data. In: Proceedings of the 22nd annual conference on neural in- formation processing systems, pp 1657–1664 23. Vail DL, Veloso MM, Lafferty JD (2007) Conditional random Hung Quoc Ngo received his B.E. fields for activity recognition. In: Proceedings of the 6th inter- in Telecommunications from Ho- national joint conference on autonomous agents and multi-agent Chi-Minh City University of Tech- systems, p 235 nology, Vietnam, in 2003, with high- 24. World Health Organization (2004) Global strategy on diet, physi- est honors. He got his M.S. in Com- cal activity and health. Technical report puter Engineering from Kyung Hee University, Korea, in 2005, with highest honors. His research inter- ests include factor graphs and its applications in distributed optimiza- tion and inference in wireless sen- sor networks and . La The Vinh received his B.S. Email: [email protected] and M.S. from Ha Noi University of Technology, Vietnam, in 2004 and 2007 respectively. Since Sep- tember 2008, he has been work- Hyoung Il Kim received his MS ing on his PhD degree at the De- in computer engineering from the partment of Computer Engineering Kyung Hee University, Korea in at Kyung Hee University, Korea. 1996. Currently, he is a Bada (Sam- His research interests include digital sung Smartphone) platform engi- signal processing, pattern recogni- neer in the Mobile Communication tion and artificial intelligence. Email: Division at Samsung Electronics. [email protected] His research interests are activity recognition for ubiquitous & mo- bile computing. Email: meeso.kim@ samsung.com Sungyoung Lee received his B.S. from Korea University, Seoul, Ko- rea. He got his M.S. and PhD de- grees in Computer Science from Illinois Institute of Technology Manhyung Han received his Bach- (IIT), Chicago, Illinois, USA in elor and Master degrees in Dept. of 1987 and 1991 respectively. He has Computer Engineering from Kyung been a professor in the Department Hee University, Korea in 2005 and of Computer Engineering, Kyung 2007, respectively. Currently, he is Hee University, Korea since 1993. working as a Ph.D degree in Dept. He is a founding director of the of Computer Engineering at Kyung Ubiquitous Computing Laboratory, Hee University. His research ar- and has been affiliated with a direc- eas include ubiquitous computing, tor of Neo Medical ubiquitous-Life context-aware middleware, activ- Care Information Technology Re- ity recognition and WSN. Email: search Center, Kyung Hee University since 2006. He is a member of [email protected] ACM and IEEE. Email: [email protected] Semi-Markov conditional random fields for accelerometer-based activity recognition 241

Young-Koo Lee got his B.S., M.S. and PhD in Computer Science from Korea advanced Institute of Science and Technology, Korea. He is a pro- fessor in the Department of Com- puter Engineering at Kyung Hee University, Korea. His research in- terests include ubiquitous data man- agement, , and data- bases. He is a member of the IEEE, the IEEE Computer Society, and the ACM. Email: [email protected]