THE MODIFIED CUSUM ALGORITHM FOR SLOW AND DRASTIC CHANGE DETECTION IN GENERAL HMMS WITH UNKNOWN CHANGE PARAMETERS

Namrata Vaswani

ECE Dept., Georgia Institute of Technology, Atlanta, GA 30332 [email protected]

ABSTRACT defined but for nonlinear systems, linearization techniques like ex- tended Kalman filters are the main tools [6]. These are compu- We study the change detection problem in a general HMM when tationally efficient but are not always applicable. In [7], the au- the change parameters are unknown and the change can be slow thors attempt to use a particle filtering approach for evaluating the or drastic. Drastic changes can be detected easily using the in- CUSUM statistic without linearization. This, in the most general crease in tracking error or the negative log of observation likeli- case (when the PF is not asymptotically stable), would require to hood (OL). But slow changes usually get missed. We have pro- run (1 + t) PFs to evaluate CUSUM at time t. In [7], they define a posed in past work a statistic called ELL which works for slow modification of the CUSUM statistic which has non-growing com- change detection. Now single time estimates of any statistic can putational complexity with t. be noisy. Hence we propose a modification of the Cumulative Sum For unknown changed system parameters, the LR can be re- (CUSUM) algorithm which can be applied to ELL and OL and thus placed by Generalized Likelihood Ratio (GLR). The solution of improves both slow and drastic change detection performance. GLR for linear systems in well known [6]. But for nonlinear sys- tems, CUSUM applied to GLR would require to run one PF for each value of the unknown parameter. In [8], the authors con- 1. INTRODUCTION sidered a case where the unknown parameter belongs to a finite set with cardinality P . This required running (1 + P t) PFs at Change detection is required in many practical problems arising in time t to evaluate CUSUM at t. Hence the authors proposed to , flight control, fault detection and in surveillance search over a finite past of length ∆, i.e. to use CUSUMt,∆ = problems like abnormal activity detection [1]. In most cases, the max0≤p≤∆ LR(p, t). For the general case where the unknown pa- underlying system in its normal state can be modeled as a paramet- rameter belongs to an uncountable set, ML parameter estimation ric stochastic model which is usually nonlinear. The observations procedures have been proposed, see a recent survey article [9]. But are usually noisy (making the system partially observed). Such a most of these use a PF estimate of the joint posterior of all states, system forms a “general HMM” [2] (also referred to as a “partially N πt (dx1:t|Y1:t) and the error in the joint posterior estimated us- observed nonlinear dynamical model” or a “stochastic state space ing a fixed number of particles increases with time [9] (unstable). model” in different contexts). It can be approximately tracked (es- Also, all of the above algorithms are based on observation LR and timate of the hidden state variables given hence are unable to detect slow changes [1]. observations) using a Particle Filter (PF) [3]. In this paper, we propose to modify the CUSUM algorithm in We study here the change detection problem in a general HMM a different way. This modification is applicable to any single time when the change parameters are unknown and the change can be instant change detection statistic. We apply it to both OL and ELL slow or drastic. We use a PF to estimate the posterior probability and so are able to detect drastic as well as slow changes. Also, it distribution of the state at time t, Xt, given observations up to t, can be evaluated using a single PF even when the unknown change 4 P r(Xt ∈ dx|Y1:t) = πt(dx). Even when change parameters are parameter belongs to an uncountable set and the CUSUM statistic unknown, drastic changes can be detected easily using the increase estimate is stable. The paper is organized as follows: In Section 2, in tracking error [4] or the negative log of observation likelihood we discuss ELL and OL. The modified CUSUM algorithm and its (OL). OL is the negative log likelihood of the current observation application to ELL and OL is discussed in Section 3. In Section conditioned on all past observations. But slow changes usually get 4, we study two example applications, discuss their stability and missed. We have proposed in past work [5], a statistic called ELL show simulation results. Conclusions are presented in Section 5. (Expected Log-Likelihood) which is able to detect slow changes. We have also shown complementary behavior of ELL and OL for 1.1. The General HMM Model and Problem Definition slow and drastic changes [1]. Now single time instant estimates of ELL or OL may be noisy The system (or state) process {Xt} is a Markov process with state and are prone to outliers. Hence in practice, one needs to aver- transition kernel Qt(xt, dxt+1) and the observation process is a age the statistic over a set of past time instants. A principled way memoryless function of the state given by Yt = ht(Xt) + wt of doing this is the CUSUM algorithm [6]. For known change where wt is an i.i.d. noise process and ht is, in general, a non- parameters, the CUSUM algorithm finds the maximum (over all linear function. The conditional distribution of the observation previous time instants, t − p + 1) of the observation likelihood ra- given state, Gt(dyt, xt), is assumed to be absolutely continuous 4 tio, LR(p, t), assuming that the change occurred at time t − p + 1 and its pdf is given by gt(Yt, x) = ψt(x). Now p0(x) (the prior (see equation (2) in Section 3). For linear systems, the LR is well initial state distribution), Gt(dyt, xt), Qt(xt, dxt+1) are known and assumed to be absolutely continuous1. Thus the prior distri- The Kerridge Inaccuracy [10] between two pdfs p and q is de- bution of the state at any t is also absolutely continuous and ad- R 4 c fined as K(p : q) = p(x)[− log q(x)]dx. We have ELL(Y1:t) = mits a density, pt(x). Note that we use the superscript is used 0 0 0 Eπt [− log pt (x)] = K(πt : pt ). to denote any parameter related to the changed system, for the 0 c,0 Why ELL works: Now, taking expectation of ELL(Y1:t) = original system and for the case when the observations of the 0 0 K(πt : pt ) over normal observation sequences, we get changed system are filtered using a filter optimal for the original 0 0 0 2 EY 0 [ELL(Y1:t)] = EY 0 Eπ0 [− log pt (x)] = Ep0 [− log pt (x)] = system . Also the unnormalized filter kernel [2] is defined as 1:t 1:t t t 0 0 0 4 0 Rt(xt, dxt+1) = ψt(xt+1)Qt(xt, dxt+1). H(pt ) = K(pt : pt ) = EKt where H(.) denotes entropy. Sim- c We assume that the normal (original/unchanged) system has ilarly, for the changed system observations, EY c [ELL(Y1:t)] = 0 1:t state transition kernel Q . A change (which can be slow or drastic) 4 t K(pc : p0) = EKc, i.e. the expectation of ELL(Y c ) taken over in the system model begins to occur at some finite time t and lasts t t t 1:t c changed system observation sequences is actually the Kerridge In- till a final finite time tf . In the time interval, [tc, tf ], the state c c 0 accuracy between the changed system prior, pt , and the original transition kernel is Qt and after tf it again becomes Qt . Both 0 c system prior, pt , which will be larger than the Kerridge Inaccu- Qt and the change start and end times tc, tf are assumed to be 0 0 0 racy between pt and pt (entropy of pt ) [11]. unknown. The goal is to detect the change, with minimum delay. c Now, ELL will detect the change when EKt is “significantly” 4 larger than EK0. Setting the change threshold to κ = EK0 + 1.2. Approximate Non-linear Filtering using a Particle Filter p t t t 0 0 0 3 VKt , where VKt = V arY1:t (Kt ), will ensure a false alarm The problem of nonlinear filtering is to compute at each time t, probability less than 0.11 (follows fromp the Chebyshev inequality the conditional probability distribution, of the state Xt given the c c [12]). By the same logic, if EKt − 3 VKt > κt then the miss observation sequence Y1:t, πt(dx) = P r(Xt ∈ dx|Y1:t). The probability [12] (probability of missing the change) will also be transition from πt−1 to πt is: less than 0.11.

ψtπt|t−1 πt−1 —-> πt|t−1 = Qt(πt−1) —-> πt = 2.2. When ELL fails: The OL Statistic (πt|t−1, ψt) The above analysis assumed no estimation errors in evaluating A Particle Filter [3] is a recursive algorithm for approximate non- ELL and if this were true ELL would work for drastic changes linear filtering which produces at each time t, a cloud of N par- as well. But, the PF being used is optimal for the unchanged sys- (i) N ticles {xt } whose empirical measure, πt , closely “follows” πt. tem. Hence when estimating πt (required for evaluating the ELL) N 4 for the changed system, there is “exact filtering error” [1]. Also the At Pt = 0, it samples N times from π0 to approximate it by π0 (dx) = 1 N particle filtering error is much larger in this case. The upper bound δ (i) (dx). Then, for each t, it runs the Bayes recursion: N i=1 x 0 on the approximation error in estimating the ELL is proportional to the “rate of change” [1]. Hence ELL is approximated accurately XN XN N 4 1 N 4 1 for a slow change and thus it detects such a change as soon as it π = δ (i) (dx)—->π = δ (i) (dx) t−1 x t|t−1 x¯ N t−1 N t becomes “detectable” [1]. But ELL fails to detect drastic changes i=1 i=1 because of large estimation error in evaluating πt. But large es- N N X X timation error in evaluating πt also corresponds to a large value N 4 1 (i) N 4 —->π¯ = w δ (i) (dx)—->π = δ (i) (dx) t t x¯ t x of OL (Observation Likelihood) which can be used for detecting N t t i=1 i=1 such changes (Theorem 2.4 of chapter 2 of [1]). OL as explained earlier, is the negative log likelihood of the current observation (i) (i) (i) (i) (i) N where x¯t ∼ Qt(xt−1, dx), xt ∼ Multinomial({x¯t wt }i=1), conditioned on past observations under the no change hypothesis, (i) (i) 4 ψt(¯x ) i.e. OL = − log P r(Y |Y ,H ). It is evaluated using a PF as w = t . t 1:t−1 0 t (πN ,ψ ) N 0 N t|t−1 t OLt = − log(Qt πt−1, ψt). OL, on the other hand, takes longer to detect a slow change or may not detect it at all (discussed in [1]). 2. CHANGE DETECTION STATISTICS: ELL AND OL 3. THE MODIFIED CUSUM ALGORITHM 2.1. The ELL statistic Now single time instant estimates of ELL or OL may be noisy “Expected (negative) Log Likelihood” or ELL [5] at time t, is the and prone to outliers. Hence in practice, one needs to average the conditional expectation of the negative logarithm of the prior like- statistic over a set of past time instants. Averaging OL over past p lihood of the state at time t, under the no change hypothesis (H0), time instants gives aOL(p, t) = (1/p)[− log P r(Yt−p+1:t|Y1:t−p)]. given observations till time t, i.e. Pt Average ELL is aELL(p, t) = (1/p) k=t−p+1 ELL(Y1:k). 4 0 0 But since ELL(Y ) is not stationary and hence also not ergodic, ELL(Y ) = E[− log p (x)|Y ] = E [− log p (x)]. (1) 1:t 1:t t 1:t πt t the above averaging cannot be justified for large p. For small p, N N one can use the standard assumption of approximating any nonsta- Using the PF estimate πt , the ELL estimate becomes ELL = PN (i) tionary process by a piecewise stationary process over small time 1 [− log p0(x )]. ELL as defined above is equal to the N i=1 t t intervals (same idea as that used to justify any other moving aver- Kerridge Inaccuracy [10] between the posterior and prior state pdf. age). Instead of aELL, one can evaluate joint ELL as 1Note that for ease of notation, we denote the pdf either by the same jELL(p, t) = (1/p)E[− log pt−p+1:t(Xt−p+1:t)|Y1:t], which is symbol or by the lowercase of the probability distribution symbol the Kerridge Inaccuracy between the joint posterior distribution of 2 0,0 0 c,c c At most places is replaced by and by Xt−p+1:t given Y1:t and their joint prior. If using aELL(p, t), the aELL −1 threshold can be set as T h (p, t) = (1/p)EY1:t [aELL(p, t)] along with the fact that h(x) = tan (x2,t/x1,t) and that the ob- which is the sum of individual entropies of Xt−p+1:t. If using servation noise is truncated, it is easy to see that the support set of jELL jELL(p, t), T h (p, t) = (1/p)EY1:t [jELL(p, t)] which is ψt(x), Ex,Yt , compact. Using Example 3.10 of [2], this along with the joint entropy of Xt−p+1:t. the fact that ΦXt−1 is continuous, that π0 has compact support Now,for the case of known change parameters, the CUSUM and that the system noise is Gaussian, makes the unnormalized 0 c,0 c statistic applied to the observation likelihood ratio is [6, 7] filter kernels Rt ,Rt ,Rt mixing. Also since Ex,Yt is compact, supx∈E ψt(x) is finite and Mt = supx∈E [− log pt(x)] is x,Yt x,Yt CUSUMt= max log LR(p, t), where 4 1≤p≤t finite . Thus we satisfy all assumptions of Theorem 2.2 of Chap- ter 2 of [1]. Also using an argument similar to that in Section 4 c 0 log LR(p, t)=log P r (Yt−p+1:t|Y1:t−p) − log P r (Yt−p+1:t|Y1:t−p) 2.7.1 of [1]5, assumption (iv)0 of the stronger Theorem 2.1 is also =[aOLc(p, t) − aOL0(p, t)] (2) satisfied and hence it holds, i.e. the error between the true ELL and its approximation averaged over PF realizations and observa- and change is declared if CUSUMt > λ for some positive thresh- tion sequences is eventually monotonically decreasing with time, old λ. For unknown change parameters and for any statistic, de- for large N, and hence is stable. This implies that the error in noted by stat(p, t), one can modify this as follows: Set a threshold CUSUM-ELL (finite sum of past ELLs) estimate is also stable. stat 3 for stat(p, t) for normal observations, T h (p, t) , and replace Simulation Results: ROC plots of Figure 1 compare perfor- stat log LR(p, t) by [stat(p, t) − T h (p, t)], i.e. use mance of ELL (blue -o), OL (red -*), CUSUM-ELL (black -square) stat stat and CUSUM-OL (majenta -x). An ROC (Receiver Operating Char- CUSUMt = max [stat(p, t) − T h (p, t)] (3) 1≤p≤t acteristics) plot for a change detection problem [6] is obtained by plotting average detection delay against average time between where stat is aOL0, aELL or jELL. Change is declared if false alarms for different values of the detection threshold. We stat ∗ ∗ CUSUMt > λ. Change time is given by t − p + 1 where p simulated 20 realizations and calculated average detection delay is the argument maximizing [stat(p, t) − T hstat(p, t)] over p. and average time between false alarms for different values of the Now the error in the PF estimate of the joint posterior of Xt−p+1:t detection threshold. As can be seen from the figure, CUSUM-ELL given Y1:t increases as p increases, and as a thumb rule, the esti- performs much better than ELL for all 3 cases. Also, CUSUM- mates are meaningless for p > 5 [9]. Also as explained above, ELL and ELL significantly outperform CUSUM-OL and OL for aELL can be justified for only small values of p. Thus in practice the slow change (r=1). In fact, in this example because of the na- we use the following modification (with ∆ = 5): ture of h(x), even for the faster (r=10) and drastic change (r=20), CUSUM-ELL outperforms CUSUM-OL and OL. stat stat CUSUMt,∆ = max [stat(p, t) − T h (p, t)]. (4) 1≤p≤∆ 4.2. Nonlinear State Dynamics 4. TWO APPLICATIONS If the system dynamics is nonlinear, in most cases it is not possi- 4.1. Bearings-only Target Tracking ble to define pt(x) in closed form. One solution for such cases is to use prior knowledge to define pt(x) as coming from a certain In bearings-only target tracking (details in [3]), the target moves parametric family for example a Gaussian or a mixture of Gaus- on the x-y plane according to the standard second order model: sians. We study here an example discussed in [3] which has the     following nonlinear state dynamics: 1 1 0 0 0.5 0 X = X + 25 Xt−1 + 8 cos(1.2(t − 1)) + n Y = t t−1 1+X2 t and t  0 1 0 0   1 0  t−1 Xt = ΦXt−1 + Γnt, Φ =   , Γ =  (5), 2 0 0 1 1 0 0.5 Xt 2 20 + wt where nt is Gaussian system noise with σsys = 10 and 0 0 0 1 0 1 2 wt is truncated Gaussian observation noise with σobs = 1. Initial where Xt = x1,t, x˙ 1,t, x2,t, x˙ 2,t. Here x1,t, x2,t denote the x and state is taken to be zero. We introduce a change by adding a bias y components of the target location and x˙ 1,t, x˙ 2,t denote the x and rσsys to the state equation for 10 time instants starting at t = 5. Based on prior knowledge, we take p (x) to be a Gaussian with y components of the target velocity. The observation, Yt, is a noisy P t −1 t 2 measurement of the target bearing, Yt = tan (x2,t/x1,t) + wt. τ=1 8 cos(1.2(τ − 1)) and σsys. In this case, the observation model is nonlinear, but the system Stability: It is easy to see that ψt(x) has compact support, Ex,Yt . model is linear Gaussian and hence pt(x) can be defined in closed Also, the system dynamics is continuous and system noise is Gaus- form. The system noise, nt, is zero mean i.i.d. Gaussian with sian. This along with the fact that π0 has compact support makes 0 c,0 c R ,R ,R mixing. Since Ex,Y is compact, supx∈E ψt(x) Σsys = 0.001I and the observation noise is zero mean i.i.d. trun- t t t t x,Yt 2 cated Gaussian with variance σ = 0.005 and truncation param- is finite and Mt = supx∈E [− log pt(x)] is finite. Thus all as- obs x,Yt eter B = 10σobs. We attempt to detect a change in the state dy- sumptions of Theorem 2.2 of Chapter 2 of [1] are satisfied. T namics where the change is due to an additive bias of Γ[rσsys 0] , Simulation Results: We show ROC plots in Figure 2 for com- for 10 time steps starting at t = 5. The initial state was assumed paring performance of ELL (blue -o), OL (red -*), CUSUM-aELL to be known (zero variance). (black -square), CUSUM-jELL (green -4) and CUSUM-OL (ma- Stability: With the following mild assumption, the above sys- jenta -x). Here ELL outperforms OL and CUSUM-aELL/jELL tem satisfies the stability results of [1]: Assume that for each t, 4 the x component of the target location is bounded, i.e. there ex- Continuous functions (here ψt(x) and [− log pt(x)]) map compact ists a Pt < ∞, s.t. −Pt ≤ x1,t ≤ Pt. Using this assumption, sets onto compact sets. 5 The difference here is that since −Pt ≤ x1,t ≤ Pt, the system noise 3 stat Usually take T h (p, t) = EY1:t [stat(p, t)] is truncated Gaussian distributed instead of Gaussian r=10 r=20 ELL 50 50 CUSUM−aELL r=1 50 OL ELL ELL CUSUM−OL CUSUM−aELL CUSUM−aELL 45 OL 45 OL 45 CUSUM−OL CUSUM−OL 40 40 40

35 35 35

30 30 30

25 25 25

20 20 20 Average Detection Delay Average Detection Delay Average Detection Delay 15 15 15

10 10 10

5 5 5

0 0 0 0 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 35 40 45 Mean Time between False Alarms Mean Time between False Alarms Mean Time between False Alarms (a) Slow Change (b) Faster Change (c) Drastic Change

Fig. 1. Bearings-only tracking example: ROC curves comparing ELL, CUSUM on aELL, OL and CUSUM on OL

r=1 r=2 r=10 50 50 50 ELL ELL ELL CUSUM−aELL CUSUM−aELL CUSUM−aELL 45 CUSUM−jELL 45 CUSUM−jELL 45 CUSUM−jELL OL OL OL CUSUM−OL CUSUM−OL CUSUM−OL 40 40 40

35 35 35

30 30 30

25 25 25

20 20 20 Average Detection Delay Average Detection Delay 15 Average Detection Delay 15 15

10 10 10

5 5 5

0 0 0 0 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 35 40 45 Mean Time between False Alarms Mean Time between False Alarms Mean Time between False Alarms (a) Slow Change (b) Faster Change (c) Drastic Change

Fig. 2. Nonlinear System Model example: ROC curves comparing ELL, CUSUM on aELL, CUSUM on jELL, OL and CUSUM on OL. Note that in (b), CUSUM-aELL and CUSUM-jELL remain at zero while in (c), CUSUM-OL and OL remain at zero. outperform CUSUM-OL for the slow (r=1) and faster (r=2) change plication to Particle Filters,” Tech. report, RR-4215, INRIA, and vice versa for the drastic (r=10) change. Infact for r=10, CUSUM- 2002. OL and OL have zero detection delay while CUSUM-ELL and [3] N.J. Gordon, D.J. Salmond, and A.F.M. Smith, “Novel ELL completely fail. Also, for all cases, CUSUM-ELL is better approach to nonlinear/nongaussian bayesian state estima- than ELL and CUSUM-OL is better than OL. CUSUM-aELL and tion,” IEE Proceedings-F (Radar and Signal Processing), pp. CUSUM-jELL have comparable performance, jELL is better at 140(2):107–113, 1993. lower thresholds while aELL is better at higher thresholds. [4] Y. Bar-Shalom and T. E. Fortmann, Tracking and Data Asso- ciation, Academic Press, 1988. [5] N. Vaswani, “Change detection in partially observed nonlin- 5. CONCLUSIONS ear dynamic systems with unknown change parameters,” in American Control Conference (ACC), 2004. We have modified the CUSUM algorithm to work for change de- [6] M. Basseville and I Nikiforov, Detection of Abrupt Changes: tection in nonlinear systems with unknown change parameters. Theory and Application, Prentice Hall, 1993. This modification can be applied to the ELL statistic [5] and is [7] B. Azimi-Sadjadi and P.S. Krishnaprasad, “Change detection the first application of CUSUM for slow change detection. Also for nonlinear systems: A particle filtering approach,” in ACC, 2002. our modified CUSUM can be evaluated using a single PF and the [8] P. Li and V. Kadirkamanathan, “Particle filtering based likeli- CUSUM statistic estimates are stable. As part of future work, we hood ratio approach to fault diagnosis in nonlinear stochastic would like to use ML parameter estimation [9] to reduce the “exact systems,” IEEE Trans. Syst., Man, Cybern. C, vol. 31, pp. filtering error” [1] in ELL (and CUSUM-ELL) approximation. 337–343, August 2001. [9] C. Andrieu, A. Doucet, S.S. Singh, and V.B. Tadic, “Particle 6. REFERENCES methods for change detection, system identification, and con- trol,” Proc. of the IEEE, vol. 93, pp. 423– 438, March 2004. [1] N. Vaswani, Change Detection in Stochastic Shape Dynam- [10] D.F. Kerridge, “Inaccuracy and inference,” J. Royal Statist. ical Models with Applications in Activity Modeling and Ab- Society, Ser. B, vol. 23 1961. [11] Rudolf Kulhavy, “A geometric approach to statistical estima- normality Detection, Ph.D. Thesis, ECE Dept, University of tion,” in IEEE CDC, Dec. 1995. Maryland at College Park, August 2004. [12] G. Casella and R. Berger, , Duxbury [2] LeGland F. and Oudjane N., “Stability and Uniform Approxi- Thomson Learning, second edition, 2002. mation of Nonlinear Filters using the Hilbert Metric, and Ap-