Proceedings of the 10th International Conference on Sampling Theory and Applications

Tracking Dynamic Sparse Signals with Kalman Filters: Framework and Improved Inference

Evripidis Karseras , Kin Leung and Wei Dai Department of Electrical and Electronic Engineering, Imperial College, London, UK {e.karseras11, kin.leung, wei.dai1}@imperial.ac.uk

Abstract—The standard Kalman filter performs optimally for This is of great importance since it provides all the necessary conventional signals but tends to fail when it comes to recovering statistical information to use in the prediction step of the dynamic sparse signals. In this paper a method to solve this prob- tracking process. Additionally, the inference procedure used lem is proposed. The basic idea is to model the system dynamics with a hierarchical Bayesian network which successfully captures in this framework allows for automatic determination of the the inherent sparsity of the data, in contrast to the traditional active components hence the need for a pre-determined level state-space model. This probabilistic model provides all the of sparsity is eliminated. This is an appealing attribute for an necessary statistical information needed to perform sparsity- on-line tracking algorithm. aware predictions and updates in the Kalman filter steps. A set In this work the aforementioned Bayesian network is em- of theorems show that a properly scaled version of the associated cost function can lead to less greedy optimisation algorithms, un- ployed to extend the state-space model adopted in the tra- like the ones previously proposed. It is demonstrated empirically ditional Kalman filter. This way the problem of modelling that the proposed method outperforms the traditional Kalman sparsity is tackled efficiently. The resulting statistical infor- filter for dynamic sparse signals and also how the redesigned mation from the inference procedure is then incorporated in inference algorithm, termed here Bayesian Subspace Pursuit the Kalman filter steps thus producing sparsity-aware state (BSP) greatly improves the inference procedure. estimates. I.INTRODUCTION A set of theorems dictate that a proper scaling of the cost function associated with the inference procedure can lead to The Kalman filter has been the workhorse approach in the more efficient inference algorithms. The techniques initially area of linear dynamic system modelling in both practical proposed are greedy methods at heart. By scaling the cost and theoretic scenarios. The escalating trend towards sparse function with the noise variance, and by using knowledge signal representation has rendered this estimator to be useless gained from well known algorithms, it when it comes to tracking dynamic sparse signals. It is easy is possible to redesign these methods to admit better quali- to verify that the estimation process behind the Kalman filter ties. The gains are two fold. Firstly, the improved inference is not fit for sparse signals. Intuitively, the Gaussian prior mechanism bears far better qualities than the one previously distribution placed over the system’s observations does not proposed. Secondly, the proposed method outperforms the place any sparsity constraints over the space of all possible traditional Kalman filter in terms of reconstruction error when solutions. it comes to dynamic sparse signals. The Kalman filter was externally modified in the bibli- In Section II we present the basic idea for amalgamating the ography to admit sparse solutions. The idea in [1] and [2] Bayesian network of the RVM in the Kalman filter, termed is to enforce sparsity by thresholds. Work in [3] adopts a here Hierarchical Bayesian Kalman filter (HB-Kalman). In probabilistic model but signal amplitudes and support are Section III we present as set of theorems and explain the estimated separately. Finally, the techniques presented in [4] motivation to improve upon previous techniques. Additionally use prior sparsity knowledge into the tracking process. All we provide the steps for a revised inference algorithm based these approaches typically require a number of parameters to on the Subspace Pursuit (SP) reconstruction algorithm in [8], be pre-determined. It also remains unclear how these methods termed here Bayesian Subspace Pursuit (BSP). In Section IV perform towards model and parameter mismatch. we demonstrate the performance of the proposed methods in For a single time instance of the sparse reconstruction some synthetic scenarios. problem, the Relevance Vector Machine (RVM) introduced in [10] was used with great success in Compressed Sens- II.HIERARCHICAL BAYESIAN ing applications [5] and selection [6]. The hierarchical The system model is described by the following equations: Bayesian network behind the RVM achieves highly sparse x = F x + z , (1) models for the observations not only providing estimates for t t t−1 t sparse signals but on their full posterior distributions as well. yt = Φtxt + nt. (2)

where vectors xt, yt denote the system’s state and observa- The authors would like to acknowledge the European Commission for funding SmartEN ITN (Grant No. 238726) under the Marie Curie ITN FP7 tion respectively. The state innovation and observation noise programme. processes are modelled by zt and nt respectively.

224 Proceedings of the 10th International Conference on Sampling Theory and Applications

n We assume that signal xt ∈ R is sparse in some domain. Differently from the standard Kalman filter, one has to which is considered to remain the same at all time instances perform the additional step of learning the hyper-parameters (e.g the frames of a video are sparse in the domain). αt. From Equation (2) we get ye,t = Φtzt + nt where a This allows to set the state transition Ft equal to the sparse zt is preferred to produce a sparse xt. Following the unitary matrix I. Equation (1) becomes: analysis in [10] and [9], maximising the likelihood p(yt|αt) is equivalent to minimising the following cost function: xt = xt−1 + zt T −1 As in the standard Kalman filter we adopt the Gaus- L(αt) = log |Σα| + ye,tΣα ye,t, (3) sian assumption so that: p(z ) = N (0, Z ), p(n ) = 2 −1 T t t t where Σα = σ I + ΦtAt Φt . The algorithms described in 2  N 0, σ I ; and p(xt|xt−1) = N (xt−1, Zt) and p (yt|xt) = [9] can be applied to estimate αt. Note that the cost function 2  N Φxt, σ I . At each time instance, the Kalman filter in- L(α) is not convex. The obtained estimate αt is generally sub- volves the prediction step where the parameters of p(xt|yt−1) optimal and details on the estimation of the globally optimal are calculated, while the update step evaluates those of αt are given in the next section. p(xt|yt). The advantages of the standard Kalman filter include the ability to track the full statistics, and that the mean squared III.BAYESIAN SUBSPACE PURSUIT error solution coincides with the maximum posterior solution Here we discuss the performance guarantees for a single which has a closed form. The major issue when applying the time instance of the inference procedure. For convenience, filter to dynamic sparse signals, is that the solution is typically subscript t is dropped and focus is turned to Equation (2) −1 not sparse. This drawback is due to the fact that in the standard where x|α ∼ N 0, A . This was analysed in [6] for the approach, the covariance matrix Zt is priorly given. Variants purpose of Basis Selection. It had also been proven in [6] that of the Kalman filter such as the non-linear Kalman filter also a maximally sparse solution of y = Φx attains the global suffer because of the special nature of the of the non-linearities minimum of the cost function. However, the analysis did not associated with sparse reconstruction. specify the conditions to avoid local minima. By contrast, we To alleviate this problem, the key idea behind Sparse provide a more refined analysis. Due to space constraints, only Bayesian Learning (SBL) [10] is employed. As opposed to the main results are presented. 2 the traditional Kalman filter where the covariance matrix Zt We follow [6] by driving the noise variance σ → 0. The of zt is given, here it is assumed that the state innovation following Theorem specifies the behaviour of the cost function process is given by: L (α). −1 zt ∼ N 0, At , Theorem 1. For any given α, define the set I , {1 ≤ i ≤ n : 0 < αi < ∞}. Then it holds that: where A = diag (α) = diag ([α1, ··· , αn]t), and the hyper- 2 parameters αi are unknown and have to be learned from yt. 2 † lim σ L (α) = y − ΦI ΦI y , (4) To see how this promotes a sparse solution, let us drop the σ2→0 2 subscript t for simplicity. Then it holds that: where ΦI is a sub-matrix of Φ formed by the columns indexed † n by I, and ΦI denotes the pseudo-inverse of ΦI . −1 Y −1 p (x|α) = N 0, A = N 0, αi . Furthermore, if |I| < m and y ∈ span (ΦI ), then L (α) → i=1 −∞ and σ2L (α) → 0 as σ2 → 0. By driving αi = +∞ it means that p (xi|αi) = N (0, 0); Two observations can be obtained: (a) the scenarios anal- hence it is certain that xi = 0. What remains is to find the max- ysed in [6] can be seen as special cases of Theorem 1 imum likelihood solution of α for the given observation vector where L (α) → −∞; and (b) a proper scaling of the cost y. The explicit form of the likelihood function p y|α, σ2 was function gives the squared `2-norm of the reconstruction error. derived in [10] and a set of fast algorithms to estimate α and Reconstruction is then equivalent to recovering a support set consequently z and x are proposed in [9]. that minimises the reconstruction distortion. This principle is Finally the principles behind the Kalman filter and SBL the same as the one behind many greedy algorithms such as the are put together. Similar to the standard Kalman filter, two OMP [7] and SP [8]. Theorem 1 suggests such connections. steps, prediction and update, need to be performed at each According to Theorem 1 the key quantities concerning time instance. In the prediction step, one has to evaluate: the algorithms described in [9] must be scaled by the noise −1 µt|t−1 = µt−1|t−1, Σt|t−1 = Σt−1|t−1 + At , variance. The original formulae can be found in [9] while the revised ones are given below: yt|t−1 = Φtµt|t−1, ye,t = yt − yt|t−1. −2 2 T −1 −2 T where the notation t|t − 1 means prediction at time instance σ Σx = σ AI + ΦI ΦI , µx = σ ΣxΦI y, t for measurements up to time instance t − 1. In the update 2 −1 2 T −1 T σ C−i = I − ΦI−i σ AI−i + ΦI−iΦI−i ΦI−i, step, one computes: 2 T 2 −1 2 T 2 −1 s¯i = σ si = φi σ C−i φi, q¯i = σ qi = φi σ C−i y. T 2 T −1 Kt = Σt|t−1Φt (σ I + ΦtΣt|t−1Φt ) , Subscript I denotes the set of indices for which 0 < αi < µt|t = µt|t−1 + Ktye,t, Σt|t = (I − KtΦt)Σt|t−1. +∞. The notation I − i means removal of index i from I.

225 Proceedings of the 10th International Conference on Sampling Theory and Applications

Subsequently formula [9, Equation (20)] for the optimal αi Algorithm 1 FMLM-xi 2 given all other αj, j 6= i, becomes: Input: Φ, y, σ s¯2 Initialise: i T αi = - Tˆ = {index i ∈ [1, n] for maximum |φ y|}. q¯2 − σ2s¯ i i i Iteration: Finally the scaled cost function becomes: - Calculate values of αi and [µx]i for i ∈ [1, n] \ Tˆ. 0 ˆ ¯ 2 2 2 −1 T - T = T ∪ {index i corresponding to the maximum value L = σ L = σ log σ I + ΦI A Φ I I of [µ ] for i∈ / Tˆ}. T 2 T −1 T  x i +y I − ΦI (σ AI + Φ ΦI ) Φ y. 0 I I - Calculate values αi for i ∈ T . ˜ 0 Let us clarify the importance of using the scaled quantities. - T = {i ∈ T : 0 < αi < +∞}. 2 ¯ ¯ −2 ˜ Assume that σ = 0. It is then easy to show that the original - If |LT˜ − LTˆ| = 0 then compute σ Σx, µx for T and formulae in [9] result in poor performance. Scaling the cost quit. Set Tˆ = T˜ and continue otherwise. function (and consequently these quantities) is necessary when Output: we want to account for a given noise variance. This may - Estimated support set T˜ and sparse signal x˜ with |T˜| seem irrelevant but in many tracking applications the noise non-zero components, x˜T˜ = µx. −2 floor is assumed to be estimated in some way or provided - Estimated covariance matrix σ Σx. by the manufacturer for specific devices. The initial work in [10] provides the formula to infer the noise level from Algorithm 2 Bayesian Subspace Pursuit the observations. The scaled versions of the aforementioned Input: Φ, y, σ2 quantities can still be applied if desired. Initialise: We now have a better understanding of the inference pro- ˆ 1 - T = {index i ∈ [1, n] for minimum αi = |φT y| }. cedure but it still remains unclear what the selection criterion i Iteration: for the basis functions should be. In [9] selection is based - Store αmax = arg maxi∈Tˆ |αi|. on the value of α which maximises the difference ∆L in the 2 i - Calculate values αi and θi =q ¯i − s¯i for i ∈ [1, n]. likelihood function, while algorithms such as the OMP and SP - Calculate values tθi>0 = |{i ∈ [1, n]: θi > 0}| and make decisions on different grounds. The following Theorem tαi≤amax = |{i ∈ [1, n]: |αi| ≤ amax}|. sheds some light on this matter. - If tθi>0 = 0 then s = tαi≤amax + 1 else s = t + t . Theorem 2. Assume the noiseless setting y = Φx where θi>0 αi≤amax m×n T - T 0 = Tˆ ∪ {indices corresponding to s smallest values of Φ ∈ R and φi φi = 1 for all 1 ≤ i ≤ n. Furthermore T α for i ∈ [1, n]}. assume that t = max φ φj for 1 ≤ i 6= j ≤ n. Then i i −2 0 an algorithm similar to the one in [9] based on one of the - Compute σ Σx and µx for T . following criteria recovers all s-sparse signals exactly given - T˜ = {indices corresponding to s largest non-zero values the sufficient condition t < 0.375/s; (a) the maximum σ2∆L, of |µx| for which 0 < αi < +∞}. ¯ ¯ ˆ ˜ (b) the maximum xi or (c) the minimum αi. - If |LT˜ − LTˆ| = 0 then quit. Otherwise set T = T and continue. Theorem 2 is the starting point for redesigning the infer- Output: ence algorithm. Based on the scaled quantities we can re- - Estimated support set T˜ and sparse signal x˜ with |T˜| derive the algorithm in [9] termed Fast Marginal Likelihood non-zero components, x˜T˜ = µx. Maximisation (FMLM). It is possible to have variants with −2 - Estimated covariance matrix σ Σx for T˜. OMP-like performance guarantees based on different criteria as Theorem 2 suggests. Actually the inference algorithm then greatly resembles the OMP; where the basis functions are the optimisation criterion. Also, results from [8] motivate us recovered sequentially with decreasing order of correlation to extend the FMLM procedure to a less greedy optimisation with the residual signal. For brevity we only present the procedure by borrowing ideas from the SP algorithm. The SP version based on maximising x hence the algorithm is termed i selects a subset of basis functions at each time instance based FMLM-x . The steps are given in Algorithm 1. i also on correlation maximisation, but adds a backtracking step Theorem 3. Assume that the same conditions hold as in so as to retain only the sparse components with the largest Theorem 2. An algorithm similar to the one in [9] based on magnitudes. The redesigned algorithm termed here Bayesian the less greedy criterion of maximum θi =q ¯i, recovers all s- Subspace Pursuit is described in Algorithm 2. sparse signals exactly given the sufficient condition t < 0.5/s. The algorithm presented in Algorithm 2 recovers all s-sparse IV. EMPIRICAL RESULTS signals exactly if matrix Φ satisfies the RIP with parameter A. Single Time Instance δ < 0.205. 3s We concentrate on the performance of the algorithms for Theorem 3 suggests further improvements to the perfor- a single time instance and for σ2 = 0. The algorithms under mance guarantees, to match those of the OMP by altering comparison are the FMLM algorithm as originally presented

226 Proceedings of the 10th International Conference on Sampling Theory and Applications

Recovery rates for Gaussian Sparse Signals Reconstruction Error Comparison 0 1 10 FMLM FMLM−x 0.9 i FMLM−α i −1 10 0.8 FMLM−δl i FMLM−θ 0.7 i BSP −2 0.6 10

0.5

−3 0.4 10 Reconstruction Error 0.3

Frequency of exact reconstruction −4 0.2 10 Kalman filter HBK−FMLM−x 0.1 i HBK−BSP −5 0 10 0 10 20 30 40 50 60 70 80 90 0 20 40 60 80 100 120 140 160 180 200 Number of non−zero components Time

Figure 1. Exact reconstruction rates for m = 128, n = 256 Figure 2. Reconstruction error comparison for m = 128.

2 in [9], the variants based on the scaled quantities; FMLM- variance is set to σn = 0.01 for the entire simulation time. We xi, FMLM-αi, FMLM-δli, FMLM-θi and the BSP. The compare the following techniques; the classic Kalman filter, experiment is as follows: the HBK with FMLM-xi as the optimisation procedure and 1) Generate Φ ∈ R128×256 with i.i.d entries from the HBK with BSP. N 0, 1 . In this scenario noisy measurements yt are taken by m 128×256 2) Generate T uniformly at random so that |T | = K. choosing the design matrix Φt ∈ R as described in 3) Choose values for xT from N (0, 1). subsection IV-A and is re-sampled at each time instance. The 4) Compute y = Φx and apply a reconstruction algo- number of observations m remains constant at each time rithms. Compare estimate xˆ to x. instance. In Figure 2 we primarily notice how the HBK 5) Repeat experiment for increasing values of K and for outperforms the original Kalman filter, direct consequence 100 realisations. of the sparse dynamic model. The HBK-BSP captures the The results from this procedure are depicted in Figure 1. The evolution in the support set with greater success due to the first critical observation is that the original FMLM performs improved optimisation algorithm. 2 poorly when σ = 0 due to the improperly scaled cost REFERENCES function. The three scaled variants of FMLM based on the [1] N. Vaswani, “Kalman filtered compressed sensing,” in Image Processing, criteria mentioned in Theorem 2 perform - within computa- 2008. ICIP 2008. 15th IEEE International Conference on, oct. 2008, pp. tional accuracy - in the same manner. We observe the increase 893 –896. in performance for FMLM-θ , a consequence of altering the [2] A. Carmi, P. Gurfil, and D. Kanevsky, “Methods for sparse signal recovery i using kalman filtering with embedded pseudo-measurement norms and selection criterion to θi =q ¯i. Even though changing the quasi-norms,” , IEEE Transactions on, vol. 58, no. 4, criterion gives theoretically better performance as Theorem 3 pp. 2405 –2409, april 2010. suggests, empirically this gain is not great. By redesigning the [3] J. Ziniel, L.C. Potter, and P. Schniter, “Tracking and smoothing of time- varying sparse signals via approximate belief propagation,” in Signals, inference algorithm based on ideas from the SP we are able Systems and Computers (ASILOMAR), 2010 Conference Record of the to achieve far better performance, as the curve for the BSP Forty Fourth Asilomar Conference on. IEEE, 2010, pp. 808–812. algorithm shows. [4] A. Charles, M.S. Asif, J. Romberg, and C. Rozell, “Sparsity penalties in dynamical system estimation,” in Information Sciences and Systems (CISS), 2011 45th Annual Conference on. IEEE, 2011, pp. 1–6. B. Dynamic Sparse Signal [5] Shihao Ji, Ya Xue, and L. Carin, “Bayesian compressive sensing,” Signal We now compare the proposed method, HB-Kalman filter Processing, IEEE Transactions on, vol. 56, no. 6, pp. 2346 –2356, june n 2008. against the original Kalman filter. Signal xt ∈ R is assumed [6] D.P. Wipf and B.D. Rao, “Sparse bayesian learning for basis selection,” to be sparse in its natural basis with support set S chosen uni- Signal Processing, IEEE Transactions on, vol. 52, no. 8, pp. 2153 – 2164, formly at random from [1, n] where n = 256. The magnitudes aug. 2004. [7] J.A. Tropp, “Greed is good: algorithmic results for ,” of the non-zero entries of xt evolve according to Equation II IEEE Trans. Inform. Theory, vol. 50, no. 10, pp. 2231 – 2242, oct. 2004. 2 2 with Zi = σz I with σz = 0.1. The simulation time for this [8] Wei Dai and Olgica Milenkovic, “Subspace pursuit for compressive experiment was T = 200 time instances. At two randomly sensing signal reconstruction,” IEEE Trans. Inform. Theory, vol. 55, pp. 2230–2249, 2009. chosen time instances: T = 50 and T = 150, a change in the [9] M.E. Tipping, A.C. Faul, et al., “Fast marginal likelihood maximisation support of xt is introduced. A non-zero component is added for sparse Bayesian models,” in International workshop on artificial to the support of x50 and a non-zero component is removed intelligence and statistics. Jan, 2003, vol. 1. [10] Michael E. Tipping, “The relevance vector machine,” 2000. from the support of x150. Apart from these two time instances the support of xt remains unchanged. At T = 1 the support is initialised with K = 30 non-zero components. Observation

227