<<

An Elementary Introduction to Kalman Filtering Yan Pei Swarnendu Biswas University of Texas at Austin Indian Institute of Technology Kanpur [email protected] [email protected] Donald S. Fussell Keshav Pingali University of Texas at Austin University of Texas at Austin [email protected] [email protected]

ABSTRACT can be applied only if is Gaussian [15]. The goal of Kalman filtering is a classic state estimation technique used this paper is to present the abstract concepts behind Kalman in application areas such as and autonomous filtering in a way that is accessible to most computer scien- control of vehicles. It is now being used to solve problems tists while clarifying the key assumptions, and then show in computer systems such as controlling the voltage and how the problem of state estimation in linear systems can frequency of processors. be solved as an application of these general concepts. Although there are many presentations of Kalman filtering Abstractly, Kalman filtering can be seen as a particular in the literature, they usually deal with particular systems approach to combining approximations of an unknown value like autonomous robots or linear systems with Gaussian to produce a better approximation. Suppose we use two de- noise, which makes it difficult to understand the general vices of different designs to measure the temperature ofa principles behind Kalman filtering. In this paper, we first CPU core. Because devices are usually noisy, the measure- present the abstract ideas behind Kalman filtering at a level ments are likely to differ from the actual temperature of the accessible to anyone with a basic knowledge of probability core. Since the devices are of different designs, let us assume theory and calculus, and then show how these concepts can that noise affects the two devices in unrelated ways (this be applied to the particular problem of state estimation in is formalized using the notion of correlation in Section2). linear systems. This separation of concepts from applications Therefore, the measurements x1 and x2 are likely to be dif- should make it easier to understand Kalman filtering and to ferent from each other and from the actual core temperature apply it to other problems in computer systems. xc . A natural question is the following: is there a way to combine the information in the noisy measurements x1 and KEYWORDS x2 to obtain a good approximation of the actual temperature x ? Kalman filtering, , uncertainty, noise, state esti- c One ad hoc solution is to use the formula 0.5∗x +0.5∗x mation, , BLUE, linear systems 1 2 to take the average of the two measurements, giving them equal weight. Formulas of this sort are called linear 1 INTRODUCTION because they use a weighted sum to fuse values; for our Kalman filtering is a state estimation technique invented temperature problem, their general form is β∗x1+α∗x2. In in 1960 by Rudolf E. Kálmán [16]. Because of its ability to this paper, we use the term estimate to refer to both a noisy extract useful information from noisy data and its small com- measurement and to a value computed by an , since putational and memory requirements, it is used in many both are approximations of unknown values of interest. arXiv:1710.04055v5 [eess.SY] 27 Jun 2019 application areas including spacecraft , motion Suppose we have additional information about the two planning in , signal processing, and wireless sen- devices; say the second one uses more advanced tempera- sor networks [12, 21, 28–30]. Recent work has used Kalman ture sensing. Since we would have more confidence in the filtering in controllers for computer systems [5, 13, 14, 24]. second measurement, it seems reasonable that we should Although many introductions to Kalman filtering are avail- discard the first one, which is equivalent to using the linear able in the literature [1–4, 6–11, 18, 22, 26, 30], they are usu- estimator 0.0∗x1 + 1.0∗x2. Kalman filtering tells us that in ally focused on particular applications like robot motion or general, this intuitively reasonable linear estimator is not state estimation in linear systems. This can make it difficult “optimal”; paradoxically, there is useful information even in to see how to apply Kalman filtering to other problems. Other the measurement from the lower-quality device, and the op- presentations derive Kalman filtering as an application of timal estimator is one in which the weight given to each Bayesian inference assuming that noise is Gaussian. This measurement is proportional to the confidence we have in leads to the common misconception that Kalman filtering the device producing that measurement. Only if we have no confidence whatever in the first device should we discard its p2(x) measurement.

Section2 describes how these intuitive ideas can be quan- )

x ( ) tified. Estimates are modeled as random samples from distri- ( p1 x p butions, and confidence in estimates is quantified in terms of the and of these distributions.1 Sec- x tions3-5 develop the two key ideas behind Kalman filtering. 58 ◦C 60 ◦C 63 ◦C (1) How should estimates be fused optimally? Section3 shows how to fuse scalar estimates such Figure 1: Using pdfs to model devices with systematic as temperatures optimally. It is also shown that the and random errors. Ground truth is 60 ◦C. Dashed lines problem of fusing more than two estimates can be re- are means of pdfs. duced to the problem of fusing two estimates at a time without any loss in the quality of the final estimate. Section4 extends these results to estimates that are vec- with pdf p whose mean and are µ and σ 2 respec- tors, such as state vectors representing the estimated i i i tively; following convention, we use x to represent a random position and velocity of a robot. i sample from this distribution as well. (2) In some applications, estimates are vectors but only Means and variances of distributions model different kinds a part of the vector can be measured directly. For ex- of inaccuracies in measurements. Device i is said to have a ample, the state of a spacecraft may be represented systematic error or bias in its measurements if the mean µ by its position and velocity, but only its position may i of its distribution is not equal to the actual temperature x be observable. In such situations, how do we obtain a c (in general, to the value being estimated, which is known as complete estimate from a partial estimate? ground truth); otherwise, the instrument is unbiased. Figure1 Section5 shows how the Best Linear Unbiased Esti- shows pdfs for two devices that have different amounts of mator (BLUE) can be used for this. Intuitively, it is systematic error. The variance σ 2 on the other hand is a mea- assumed that there is a linear relationship between the i sure of the random error in the measurements. The impact observable and hidden parts of the state vector, and of random errors can be mitigated by taking many measure- this relationship is used to compute an estimate for ments with a given device and averaging their values, but the hidden part of the state, given an estimate for the this approach will not help to reduce systematic error. observable part. In the usual formulation of Kalman filtering, it is assumed Section6 uses these ideas to solve the state estimation that measuring devices do not have systematic errors. How- problems for linear systems, which is the usual context for ever, we do not have the luxury of taking many measure- presenting Kalman filters. Section7 briefly discusses exten- ments of a given state, so we must take into account the sions of Kalman filtering for nonlinear systems. impact of random error on a single measurement. Therefore, confidence in a device is modeled formally by the variance 2 FORMALIZATION OF ESTIMATES of the distribution associated with that device; the smaller This section makes precise the notions of estimates and con- the variance, the higher our confidence in the measurements fidence in estimates. made by the device. In Figure1, the fact that we have less confidence in the first device has been illustrated bymaking 2.1 Scalar estimates p1 more spread out than p2, giving it a larger variance. To model the behavior of devices producing noisy measure- The informal notion that noise should affect the two de- ments, we associate each device i with a random variable vices in “unrelated ways” is formalized by requiring that the corresponding random variables be uncorrelated. This is that has a probability density function (pdf) pi (x) such as the ones shown in Figure1 (the x-axis in this figure represents a weaker condition than requiring them to be independent, temperature). Random variables need not be Gaussian.2 Ob- as explained in the AppendixA. Suppose we are given the taining a measurement from device i corresponds to drawing measurement made by one of the devices (say x1) and we ) a random sample from the distribution for that device. We have to guess what the other measurement (x2 might be. 2 If knowing x1 does not give us any new information about write xi ∼pi (µi , σi ) to denote that xi is a random variable what x2 might be, the random variables are independent. This is expressed formally by the equation p(x |x ) = p(x ); 1Basic concepts including probability density function, mean, expectation, 2 1 2 variance and covariance are introduced in AppendixA. intuitively, knowing the value of x1 does not change the 2The role of Gaussians in Kalman filtering is discussed in Section 6.5. pdf for the possible values of x2. If the random variables are 2 only uncorrelated, knowing x1 might give us new informa- The covariance Σxx of a random variable x is the T tion about x2 such as restricting its possible values but the matrix E[(x − µ x)(x − µ x) ], where µ x is the mean of x. In- mean of x2|x1 will still be µ2. Using expectations, this can be tuitively, entry (i,j) of this matrix is the covariance between written as E[x2|x1] = E[x2], which is equivalent to requiring the i and j components of vector x; in particular, entry (i,i) th that E[(x1−µ1)(x2−µ2)], the covariance between the two vari- is the variance of the i component of x. A random variable ables, be equal to zero. This is obviously a weaker condition x with a pdf p whose mean is µx and is than independence. Σxx is written as x∼p(µx , Σxx). The inverse of the covariance −1 Although the discussion in this section has focused on matrix (Σxx) is called the precision or information matrix. measurements, the same formalization can be used for esti- Uncorrelated random variables: The cross-covariance ma- mates produced by an estimator. Lemma 2.1(i) shows how trix Σvw of two random variables v and w is the matrix T the mean and variance of a linear combination of pairwise E[(v−µ v)(w−µw ) ]. Intuitively, element (i,j) of this matrix uncorrelated random variables can be computed from the is the covariance between elements v(i) and w(j). If the ran- means and variances of the random variables [19]. The mean dom variables are uncorrelated, all entries in this matrix are and variance can be used to quantify bias and random errors zero, which is equivalent to saying that every component of for the estimator as in the case of measurements. v is uncorrelated with every component of w. Lemma 2.2 An unbiased estimator is one whose mean is equal to the generalizes Lemma 2.1. unknown value being estimated and it is preferable to a bi- ased estimator with the same variance. Only unbiased estima- Lemma 2.2. Let x1∼p1(µ 1, Σ1),..., xn∼pn(µ n, Σn) be a set tors are considered in this paper. Furthermore, an unbiased of pairwise uncorrelated random variables of length m. Let Ín estimator with a smaller variance is preferable to one with a y = i=1 Ai xi . larger variance since we would have more confidence in the (i) The mean and covariance matrix of y are the following: estimates it produces. As a step towards generalizing this n discussion to estimators that produce vector estimates, we Õ µ = A µ (3) refer to the variance of an unbiased scalar estimator as the y i i i=1 Mean Square Error of that estimator or MSE for short. n Lemma 2.1(ii) asserts that if a random variable is pairwise Õ T Σyy = Ai ΣiAi (4) uncorrelated with a set of random variables, it is uncorrelated i=1 with any linear combination of those variables. (ii) If a random variable xn+1 is pairwise uncorrelated with 2 2 x ,.., x , it is uncorrelated with y. Lemma 2.1. Let x1∼p1(µ1, σ1 ),..., xn∼pn(µn, σn) be a set of 1 n Ín pairwise uncorrelated random variables. Let y = i=1 αixi be T The MSE of an unbiased estimator y is E[(y−µ y) (y−µ y)], a random variable that is a linear combination of the xi ’s. which is the sum of the variances of the components of y; (i) The mean and variance of y are: if y has length 1, this reduces to variance as expected. The n MSE is also the sum of the diagonal elements of Σ (this is Õ yy µy = αi µi (1) called the of Σyy). i=1 n 2 Õ 2 2 3 FUSING SCALAR ESTIMATES σy = αi σi (2) i=1 Section 3.1 discusses the problem of fusing two scalar esti- mates. Section 3.2 generalizes this to the problem of fusing (ii) If a random variable xn+1 is pairwise uncorrelated with n>2 scalar estimates. Section 3.3 shows that fusing n>2 esti- x1,.., xn, it is uncorrelated with y. mates can be done iteratively by fusing two estimates at a time without any loss of quality in the final estimate.

2.2 Vector estimates 3.1 Fusing two scalar estimates In some applications, estimates are vectors. For example, the We now consider the problem of choosing the optimal values state of a mobile robot might be represented by a vector of the parameters α and β in the linear estimator β∗x1 + α∗x2 containing its position and velocity. Similarly, the vital signs for fusing estimates x1 and x2 from uncorrelated random of a person might be represented by a vector containing his variables. temperature, pulse rate and blood pressure. In this paper, we The first reasonable requirement is that if the two esti- denote a vector by a boldfaced lowercase letter, and a matrix mates x1 and x2 are equal, fusing them should produce the by an uppercase letter. same value. This implies that α+β=1. Therefore the linear 3 estimators of interest are of the form the precisions of the two distributions, the expressions for y and νy can be written more simply as follows: yα (x1, x2)=(1−α)∗x1 + α∗x2 (5) ν1 ν2 y(x1, x2) = ∗x1 + ∗x2 (11) If x1 and x2 in Equation5 are considered to be unbiased ν1 + ν2 ν1 + ν2 estimators of some quantity of interest, then y is an unbi- α νy = ν1 + ν2 (12) ased estimator for any value of α. How should optimality of such an estimator be defined? One reasonable definition These results say that the weight we should give to an is that the optimal value of α minimizes the variance of yα estimate is proportional to the confidence we have in that since this will produce the highest-confidence fused esti- estimate, and that we have more confidence in the fused es- mates as discussed in Section2. The variance ( MSE) of yα timate than in the individual estimates, which is intuitively can be determined from Lemma 2.1: reasonable. To use these results, we need only the variances of the distributions. In particular, the pdfs pi , which are usu- 2 2 2 2 2 σy (α) = (1 − α) ∗σ1 + α ∗σ2 (6) ally not available in applications, are not needed, and the proof of Theorem 3.1 does not require these pdf’s to have Theorem 3.1. Let x ∼p (µ , σ 2) and x ∼p (µ , σ 2) be un- 1 1 1 1 2 2 2 2 the same mean. correlated random variables. Consider the linear estimator yα (x1, x2) = (1−α)∗x1 + α∗x2. The variance of the estimator 2 3.2 Fusing multiple scalar estimates σ1 is minimized for α = 2 2 . σ1 +σ2 The approach in Section 3.1 can be generalized to optimally fuse multiple pairwise uncorrelated estimates x , x ,..., x . This result can be proved by setting the derivative of σ 2(α) 1 2 n y Let y (x ,.., x ) denote the linear estimator for fusing the with respect to α to zero and solving equation for α. n,α 1 n n estimates given parameters α1,.., αn, which we denote by ( ) Proof. α. The notation yα x1, x2 introduced in the previous section can be considered to be an abbreviation of y2,α (x1, x2). d 2 2 2 σy (α) = −2(1 − α) ∗ σ1 + 2α ∗ σ2 2 dα Theorem 3.2. Let xi ∼pi (µi , σi ) for (1≤i≤n) be a set of 2 2 2 pairwise uncorrelated random variables. Consider the linear = 2α ∗ (σ1 + σ2 ) − 2 ∗ σ1 = 0 estimator y (x ,.., x ) = Ín α x where Ín α = 1. The σ 2 n,α 1 n i=1 i i i=1 i α = 1 (7) variance of the estimator is minimized for σ 2 + σ 2 1 2 1 σ 2 The second order derivative of σ 2(α), (σ 2+σ 2), is positive, α = i y 1 2 i Ín 1 2 j=1 2 showing that σy (α) reaches a minimum at this point. □ σj 2 Ín 2 2 In the literature, the optimal value of α is called the Kalman Proof. From Lemma 2.1, σy (α) = i=1 αi σ i . To find the 2 gain K. Substituting K into the linear fusion model, we get values of αi that minimize the variance σy under the con- the optimal linear estimator y(x1, x2): straint that the αi ’s sum to 1, we use the method of Lagrange multipliers. Define 2 2 σ2 σ1 y(x1, x2) = ∗x1 + ∗x2 (8) n n σ 2 + σ 2 σ 2 + σ 2 Õ 2 2 Õ 1 2 1 2 f (α1,..., αn) = αi σ i + λ( αi − 1) As a step towards fusion of n>2 estimates, it is useful to i=1 i=1 rewrite this as follows: where λ is the Lagrange multiplier. Taking the partial deriva- 1 1 tives of f with respect to each αi and setting these derivatives 2 2 2 2 2 σ1 σ2 − / ( , ) ∗ ∗ to zero, we find α1σ1 = α2σ2 = ... = αnσn = λ 2. From this, y x1 x2 = 1 1 x1 + 1 1 x2 (9) 2 + 2 2 + 2 and the fact that sum of the αi ’s is 1, the result follows. □ σ1 σ2 σ1 σ2 Substituting the optimal value of α into Equation6, we The minimal variance is given by the following expression: get 1 2 = σyn n (13) 2 1 Õ 1 σy = (10) 1 1 2 2 + 2 σ σ1 σ2 j=1 j 2 The expressions for y and σy are complicated because they As in Section 3.1, these expressions are more intuitive if contain the reciprocals of variances. If we let ν1 and ν2 denote the variance is replaced with precision: the contribution of xi 4 ) ··· ··· to the value of yn(x ,.., xn) is proportional to xi ’s confidence. (ν1 + ν2) (ν1 + ν2 + ν3 (ν1 + + νn−1) (ν1 + + νn ) 1 y(x1, x2) y2(y2(x1, x2), x3) y2(y2(...), xn−1) y2(y2(...), xn ) n ν1 ν1+ν2 ν1+···+νn−1 ν1+ν2 ν1+ν2+ν3 ν1+···+νn Õ νi (ν1) + + ... + + y (x ,.., x ) = ∗ x (14) x n 1 n ν +...+ν i 1 i=1 1 n ν2 ν3 νn−1 νn ν1+ν2 ν1+ν2+ν3 ν1+···+νn−1 ν1+···+νn n Õ x2 x3 xn−1 xn νyn = νi (15) (ν2) (ν3) (νn−1) (νn ) i=1 Equations 14 and 15 generalize Equations 11 and 12. Figure 2: Dataflow graph for incremental fusion. 3.3 Incremental fusing is optimal inversely proportional to the variance of the random variable In many applications, the estimates x1, x2,..., xn become available successively over a period of time. While it is possi- from which that estimate is obtained. Furthermore, when fus- ble to store all the estimates and use Equations 14 and 15 to ing n>2 estimates, estimates can be fused incrementally with- fuse all the estimates from scratch whenever a new estimate out any loss in the quality of the final result. These results are becomes available, it is possible to save both time and storage often expressed formally in terms of the Kalman gain K, as if one can do this fusion incrementally. We show that just as shown below; the equations can be applied recursively to fuse a sequence of numbers can be added by keeping a running multiple estimates. Note that if ν1≫ν2, K≈0 and y(x1,x2)≈x1; sum and adding the numbers to this running sum one at a conversely if ν1≪ν2, K≈1 and y(x1,x2)≈x2. time, a sequence of n>2 estimates can be fused by keeping a 2 2 x1∼p1(µ1, σ1 ), x2∼p2(µ2, σ2 ) “running estimate” and fusing estimates from the sequence 2 one at a time into this running estimate without any loss in σ1 ν2 K = 2 2 = (18) the quality of the final estimate. In short, we want to show σ1 + σ2 ν1 + ν2 that y (x , x ,..., x ) = y (y (y (x , x ), x )..., x ). n 1 2 n 2 2 2 1 2 3 n y(x1, x2) = x1 + K(x2 − x1) (19) A little bit of algebra shows that if n>2, Equations 14 and 2 ( − ) 2 15 for the optimal linear estimator and its precision can be σy = 1 K σ1 or νy = ν1 + ν2 (20) expressed as shown in Equations 16 and 17. 4 FUSING VECTOR ESTIMATES

νyn−1 νn The results in Section3 for fusing scalar estimates can be yn(x1,.., xn) = yn−1(x1,..., xn−1) + xn νyn−1 +νn νyn−1 +νn extended to vectors by replacing variances with covariance (16) matrices.

νy = νy + νn (17) n n−1 4.1 Fusing multiple vector estimates This shows that y (x ,.., x ) = y (y − (x ,.., x − ), x ). n 1 n 2 n 1 1 n 1 n For vectors, the linear estimator is Using this argument recursively gives the required result.3 n n To make the connection to Kalman filtering, it is useful Õ Õ ( ) to derive the same result using a pictorial argument. Fig- yn,A x1, x2,.., xn = Ai xi where Ai = I (21) ure2 shows the process of incrementally fusing the n es- i=1 i=1 timates. In this picture, time progresses from left to right, Here A stands for the matrix parameters (A1,..., An). All the the precision of each estimate is shown in parentheses next vectors xi are assumed to be of the same length. To simplify to it, and the weights on the edges are the weights from notation, we omit the subscript n in yn,A in the discussion Equation 11. The contribution of each xi to the final value below since it is obvious from the context. y2(y2(y2(x1, x2), x3)..., xn) is given by the product of the weights Optimality: The parameters A1,..., An in the linear data on the path from xi to the final value and this product isob- fusion model are chosen to minimize MSE(yA) which is T viously equal to the weight of xi in Equation 14, showing E[(yA−µyA ) (yA−µyA )], as explained in Section2. that incremental fusion is optimal. Theorem 4.1 generalizes Theorem 3.2 to the vector case. The proof of this theorem uses matrix derivatives [23] (see 3.4 Summary AppendixB) and is given in AppendixC since it is not needed The results in this section can be summarized informally for understanding the rest of this paper. Comparing Theo- as follows. When using a linear estimator to fuse uncertain rems 4.1 and 3.2, we see that the expressions are similar, the scalar estimates, the weight given to each estimate should be main difference being that the role of variance in the scalar case is played by the covariance matrix in the vector case. 3We thank Mani Chandy for showing us this approach to proving the result. 5 Theorem 4.1. Let xi ∼pi (µi , Σi ) for (1≤i≤n) be a set of 5 BEST LINEAR UNBIASED ESTIMATOR pairwise uncorrelated random variables. Consider the linear (BLUE) estimator y (x ,.., x )= Ín A x , where Ín A = I. The A 1 n i=1 i i i=1 i In some applications, the state of the system is represented value of MSE(y ) is minimized for A by a vector but only part of the state can be measured directly, n so it is necessary to estimate the hidden portion of the state Õ −1 −1 −1 Ai = ( Σj ) Σi (22) corresponding to a measured value of the visible state. This j=1 section describes an estimator called the Best Linear Unbiased Therefore the optimal estimator is Estimator (BLUE) [17, 20, 27] for doing this. n n y Õ −1 −1 Õ −1 y(x1,..., xn) = ( Σj ) Σi xi (23) j=1 i=1 −1 (yb− µy) = ΣyxΣxx(x − µx) The covariance matrix of y can be computed by using Lemma 2.2. yb1 n Õ −1 −1 Σyy = ( Σ ) (24)   j µx j=1 µy In the vector case, precision is the inverse of a covariance matrix, denoted by N . Equations 25–26 use precision to ex- x press the optimal estimator and its variance and generalize x1 Equations 14–15 to the vector case. n Figure 3: BLUE line corresponding to Equation 33. −1 Õ y(x1,..., xn) = Ny Ni xi (25) i=1 Consider the general problem of determining a value for n Õ vector y given a value for a vector x. If there is a functional N = N (26) y j relationship between x and y (say y=F(x) and F is given), j=1 it is easy to compute y given a value for x (say x1). As in the scalar case, fusion of n>2 vector estimates can In our context however, x and y are random variables, so be done incrementally without loss of precision. The proof such a precise functional relationship will not hold. Figure3 is similar to the scalar case, and is omitted. shows an example in which x andy are scalar-valued random There are several equivalent expressions for the Kalman variables. The gray ellipse in this figure, called a confidence gain for the fusion of two estimates. The following one, ellipse, is a projection of the joint distribution of x and y onto which is easily derived from Equation 22, is the vector analog the (x,y) plane, that shows where some large proportion of of Equation 18: the (x,y) values are likely to be. Suppose x takes the value −1 x1. Even within the confidence ellipse, there are many points K = Σ1(Σ1 + Σ2) (27) (x1,y), so we cannot associate a single value ofy with x1. One The covariance matrix of the optimal estimator y(x1, x2) possibility is to compute the mean of the y values associated is the following. with x1 (that is, the expectation E[y|x=x1]), and return this as the estimate for y if x=x . This requires knowing the joint Σ = Σ (Σ + Σ )−1Σ (28) 1 yy 1 1 2 2 distribution of x and y, which may not always be available. = KΣ2 = Σ1 − KΣ1 (29) In some problems, we can assume that there is an un- known linear relationship between x and y and that uncer- 4.2 Summary tainty comes from noise. Therefore, we can use a technique The results in this section can be summarized in terms of the similar to the ordinary (OLS) method to esti- Kalman gain K as follows. mate this linear relationship, and use it to return the best estimate of y for any given value of x. In Figure3, we see that x1∼p1(µ 1, Σ1), x2∼p2(µ 2, Σ2) although there are many points (x1,y), the y values are clus- −1 −1 K = Σ1(Σ1 + Σ2) = (N1 + N2) N2 (30) tered around the line shown in the figure so the value yb1 is a reasonable estimate for the value of y corresponding to x1. y(x1, x2) = x1 + K(x2 − x1) (31) This line, called the best linear unbiased estimator (BLUE), is ( − ) Σyy = I K Σ1 or Ny = N1 + N2 (32) the analog of ordinary least squares (OLS) for distributions. 6 ( ) Computing BLUE. Consider the estimator byA,b x =Ax+b. at any time step is a function of the state of the system at We choose A and b so that this is an unbiased estimator with the previous time step and the control inputs applied to the minimal MSE. The “ b ” over the y is notation that indicates system during that interval. This is usually expressed by that we are computing an estimate for y. an equation of the form xt = ft (xt−1, ut ) where ut is the control input. The function f is nonlinear in the general x µ  Σ Σ  t Theorem 5.1. Let ∼p( x , xx xy ). The estima- case, and can be different for different steps. If the system y µ Σ Σ y yx yy is linear, the relation for state evolution over time can be tor y (x)=Ax+b for estimating the value of y for a given bA,b written as xt = Ft xt−1 + Bt ut , where Ft and Bt are time- value of x is an unbiased estimator with minimal MSE if dependent matrices that can be determined from the physics b = µ y−A(µ x) of the system. Therefore, if the initial state x0 is known −1 exactly and the system dynamics are modeled perfectly by A = ΣyxΣxx the Ft and Bt matrices, the evolution of the state over time The proof of Theorem 5.1 is straightforward. For an unbi- can be computed precisely as shown in Figure 4a. ased estimator, E[by]=E[y]. This implies that b=µ y−A(µ x) so In general however, we may not know the initial state ex- ( ) ( − ) an unbiased estimator is of the form byA x = µ y+A x µ x . actly, and the system dynamics and control inputs may not Note that this is equivalent to asserting that the BLUE line be known precisely. These inaccuracies may cause the state must pass through the point (µ x, µ y). Setting the derivative computed by the model to diverge unacceptably from the ( ) of MSEA byA with respect to A to zero[23] and solving for actual state over time. To avoid this, we can make measure- A, we find that the best linear unbiased estimator is ments of the state after each time step. If these measurements −1 were exact, there would of course be no need to model the by = µ y + ΣyxΣxx(x − µ x) (33) system dynamics. However, in general, the measurements This equation can be understood intuitively as follows. If themselves are imprecise. we have no information about x and y, the best we can do is Kalman filtering was invented to solve the problem of state ( ) the estimate µ x, µ y , which lies on the BLUE line. However, estimation in such systems. Figure 4b shows the dataflow if we know that x has a particular value x1, we can use the of the computation, and we use it to introduce standard correlation between y and x to estimate a better value for terminology. An estimate of the initial state, denoted by x | , ( − ) b0 0 y from the difference x1 µ x . Note that if Σyx = 0 (that is, is assumed to be available. At each time step t=1, 2,.., the x and y are uncorrelated), the best estimate of y is just µ y, system model is used to provide an estimate of the state at so knowing the value of x does not give us any additional time t using information from time t−1. This step is called information about y as one would expect. In Figure3, this and the estimate that it provides is called the a corresponds to the case when the BLUE line is parallel to priori estimate and denoted by xt |t−1. The a priori estimate the x-axis. At the other extreme, suppose that y and x are b is then fused with zt , the state estimate obtained from the functionally related so y = Cx. In that case, it is easy to see measurement at time t, and the result is the a posteriori state that Σyx = CΣxx, so y = Cx as expected. In Figure3, this b estimate at time t, denoted by bxt |t . This a posteriori estimate corresponds to the case when the confidence ellipse shrinks is used by the model to produce the a priori estimate for down to the BLUE line. the next time step and so on. As described below, the a Equation 33 is a generalization of ordinary least squares priori and a posteriori estimates are the means of certain in the sense that if we compute the relevant means and random variables; the covariance matrices of these random ( ) variances of a set of discrete data xi ,yi and substitute into variables are shown within parentheses above each estimate Equation 33, we get the same line that is obtained by using in Figure4b, and these are used to weight estimates when OLS. fusing them. Section 6.1 presents the state evolution model and a priori 6 KALMAN FILTERS FOR LINEAR state estimation. Section 6.2 discusses how state estimates SYSTEMS are fused if an estimate of the entire state can be obtained In this section, we apply the developed in Sec- by measurement; Section 6.3 addresses this problem when tions3-5 to the particular problem of state estimation in only a portion of the state can be measured directly. linear systems, which is the classical application of Kalman filtering. Figure 4a shows how the evolution of the state of such 6.1 State evolution model and prediction a system over time can be computed if the initial state x0 The evolution of the state over time is described by a series and the model of the system dynamics are known precisely. of random variables x0, x1, x2,... Time advances in discrete steps. The state of the system 7 x0 x1 x2 xt−1 xt f 1 f 2 ..... f t

(a) Discrete-time dynamical system.

(Σ0|0) (Σ1|0) (Σ1|1) (Σ2|1) (Σ2|2) (Σt−1|t−1) (Σt |t−1) (Σt |t ) bx0|0 bx1|0 bx1|1 bx2|1 bx2|2 bxt−1|t−1 bxt |t−1 bxt |t f1 + f2 + ..... ft +

z1 x1 z2 x2 zt xt (R1) (R2) (Rt ) (b) Dynamical system with uncertainty.

Predictor Measurement Fusion+ (Σ ) (Σ ) −1 (Σ ) 0|0 t |t−1 Kt = Σt |t−1(Σt |t−1 + Rt ) t |t x | − = Ft x − | − + Bt ut x0|0 bt t 1 bt 1 t 1 xt |t−1 xt |t b b xt |t = xt |t−1 + Kt (zt - xt |t−1) b T b b b Σt |t−1 = Ft Σt−1|t−1F + Qt t Σt |t = (I - Kt )Σt |t−1 zt xt (Rt )

(c) Implementation of the dataflow diagram (b).

Predictor Measurement Fusion+ (Σ ) (Σ ) T T −1 (Σ ) 0|0 t |t−1 Kt = Σt |t−1Ht (Ht Σt |t−1Ht + Rt ) t |t x | − = Ft x − | − + Bt ut x0|0 bt t 1 bt 1 t 1 xt |t−1 xt |t b b xt |t = xt |t−1 + Kt (zt - Ht xt |t−1) b T b b b Σt |t−1 = Ft Σt−1|t−1F + Qt t Σt |t = (I - Kt Ht )Σt |t−1

zt Ht xt (Rt )

(d) Implementation of the dataflow diagram (b) for systems with partial observability.

Figure 4: State estimation using Kalman filtering.

• The random variable x0 captures the likelihood of dif- time step t = 1, 2,... that capture our beliefs about the likeli- ferent initial states. hood of different states at time t before and after fusion with • The random variables at successive time steps are re- the measurement respectively. The mean and covariance ma- lated by the following linear model: trix of a random variable xi |j are denoted by bxi |j and Σi |j respectively. We assume E[bx0|0]=E[x0] (no bias). xt = Ft xt−1 + Bt ut + wt (34) Prediction essentially uses xt−1|t−1 as a proxy for xt−1 in Here, ut is the control input, which is assumed to Equation 34 to determine xt |t−1 as shown in Equation 35. be deterministic, and wt is a zero-mean noise term that models all the uncertainty in the system. The covariance matrix of wt is denoted by Qt , and the xt |t−1 = Ft xt−1|t−1 + Bt ut + wt (35) noise terms in different time steps are assumed tobe uncorrelated to each other (that is, E[wi wj ]=0 if i,j) For state estimation, we need only the mean and covari- and to x0. ance matrix of xt |t−1. The Predictor box in Figure4 com- For estimation, we have a random variable x0|0 that cap- putes these values; the covariance matrix is obtained from tures our belief about the likelihood of different states at Lemma 2.2 under the assumption that wt is uncorrelated time t=0, and two random variables xt |t−1 and xt |t at each with xt−1|t−1, which is justified in Section 6.2. 8 x t|t−1 example, if the state vector has two components and only the first component is observable, Ht can be [ 1 0 ]. In general, xt+1|t ˆxt|t−1 the Ht matrix can specify a linear relationship between the state and the observation, and it can be time-dependent. The

ˆxt|t imprecise measurement model introduced in Equation 36 x ⨁ t|t ˆxt+1|t becomes: xt zt+1 ˆxt+1|t+1 zt ⨁ zt = Ht xt + vt (37) xt+1 xt+1|t+1 The hidden portion of the state can be specified using a matrix Ct , which is an orthogonal complement of Ht . For example, if H = [ 1 0 ], one choice for C is [ 0 1 ]. Figure 5: Pictorial representation of Kalman filtering. t t Figure 4d shows the computation for this case. The fusion phase can be understood intuitively in terms of the following 6.2 Fusing complete observations of the steps. state (i) The observable part of the a priori estimate of the state If the entire state can be measured at each time step, the Ht bxt |t−1 is fused with the measurement zt , using the imprecise measurement at time t is modeled as follows: techniques developed in Sections3-4. The quantity zt − Ht bxt |t−1 is called the innovation. The result is the a posteriori es- zt = xt + vt (36) timate of the observable state Ht bxt |t . where vt is a zero-mean noise term with covariance matrix (ii) The BLUE in Section5 is used to obtain the a pos- Rt . The noise terms in different time steps are assumed tobe teriori estimate of the hidden state Ct bxt |t by adding uncorrelated with each other (that is, E[vi vj ] is zero if i,j) to the a priori estimate of the hidden state Ct bxt |t−1 a as well as with x0|0 and all wk . A subtle point here is that value obtained from the product of the covariance be- xt in this equation is the actual state of the system at time t tween Ht xt |t−1 andCt xt |t−1 and the difference between (that is, a particular realization of the random variable xt ), Ht bxt |t−1 and Ht bxt |t . so variability in zt comes only from vt and its covariance (iii) The a posteriori estimates of the observable and hidden matrix Rt . portions of the state are composed to produce the a Therefore, we have two imprecise estimates for the state posteriori estimate of the entire state bxt |t . at each time step t = 1, 2,..., the a priori estimate from the The actual implementation produces the final result di- predictor (x ) and the one from the measurement (z ). If bt |t−1 t rectly without going through these steps as shown in Fig- z is uncorrelated to x , we can use Equations 30-32 to t t |t−1 ure 4d, but these incremental steps are useful for understand- fuse the estimates as shown in Figure 4c. ing how all this works, and they are implemented as follows. The assumptions that (i) xt−1|t−1 is uncorrelated with wt , a priori which is used in prediction, and (ii) xt |t−1 is uncorrelated (i) The estimate of the observable part of the state T with zt , which is used in fusion, are easily proved to be is Ht bxt |t−1 and the covariance is Ht Σt |t−1Ht . The a pos- correct by induction on t, using Lemma 2.2(ii). Figure 4b teriori estimate is obtained directly from Equation 31: gives the intuition: xt |t−1 for example is an affine function Ht bxt |t = Ht bxt |t−1 of the random variables x0|0, w1, v1, w2, v2,..., wt , and is T( T )−1( − ) therefore uncorrelated with vt (by assumption about vt and + Ht Σt |t−1Ht Ht Σt |t−1Ht + Rt zt Ht bxt |t−1 Lemma 2.2(ii)) and hence with zt . T( T )−1 Figure5 shows the computation pictorially using confi- Let Kt =Σt |t−1Ht Ht Σt |t−1Ht + Rt . The a posteriori dence ellipses to illustrate uncertainty. The dotted arrows estimate of the observable state can be written in terms at the bottom of the figure show the evolution of the state, of Kt as follows: and the solid arrows show the computation of the a priori Ht xt |t = Ht xt |t−1 + Ht Kt (zt − Ht xt |t−1) (38) estimates and their fusion with measurements. b b b (ii) The a priori estimate of the hidden state is Ct bxt |t−1. 6.3 Fusing partial observations of the state The covariance between the hidden portion Ct xt |t−1 T In some problems, only a portion of the state can be mea- and the observable portion Ht xt |t−1 is Ct Σt |t−1Ht . The sured directly. The observable portion of the state is specified difference between the a priori estimate and a posteriori formally using a full row-rank matrix Ht called the observa- estimate of Ht x is Ht Kt (zt −Hbxt |t−1). Therefore the a tion matrix: if the state is x, what is observable is Ht x. For posteriori estimate of the hidden portion of the state is 9 obtained directly from Equation 33: system is easily shown to be the following:           v 1 0 v − 0 0.25 0 C x | = C x | − n n 1 t bt t t bt t 1 = + 2 (42) sn 0.25 1 sn−1 0 0.5 × 0.25 9.8 + (C Σ H T)(H Σ H T)−1H K (z − H x ) t t |t−1 t t t |t−1 t t t t t bt |t−1       v0 0 80 0 = Ct xt |t−1 + Ct Kt (zt − Ht xt |t−1) (39) where we assume = and Σ0 = . b b s0 0 0 10 (iii) Putting the a posteriori estimates (38) and (39) together, The gray lines in Figure6 show the evolution of velocity and distance with time according to this model. Because of       Ht Ht Ht uncertainty in modeling the system dynamics, the actual xt |t = xt |t−1 + Kt (zt − Ht xt |t−1) Ct b Ct b Ct b evolution of the velocity and position will be different in practice. The red lines in Figure6 show one trajectory for H  Since t is invertible, it can be canceled from the left this evolution, corresponding to a Gaussian noise term with Ct  2 2.5 covariance in Equation 34 (because this noise and right hand sides, giving the equation 2.5 4 term is random, there are many trajectories for the evolu- x | = x | − + Kt (zt − Ht x | − ) (40) bt t bt t 1 bt t 1 tion, and we are just showing one of them). The red lines To compute Σt |t , Equation 40 can be rewritten as correspond to “ground truth” in our example. bxt |t = (I − Kt Ht )bxt |t−1 + Kt zt . Since xt |t−1 and zt are un- The green points in Figure 6b show the noisy measure- correlated, it follows from Lemma 2.2 that ments of velocity at different time steps, assuming the noise is modeled by a Gaussian with variance 8. The blue lines T T Σt |t = (I − Kt Ht )Σt |t−1(I − Kt Ht ) + Kt Rt Kt show the a posteriori estimates of the velocity and position. It can be seen that the a posteriori estimates track the ground Substituting the value of K and simplifying, we get t truth quite well even when the ideal system model (the gray Σt |t = (I − Kt Ht )Σt |t−1 (41) lines) is inaccurate and the measurements are noisy. The cyan bars in the right figure show the variance of the veloc- Figure 4d puts all this together. In the literature, this ity at different time steps. Although the initial variance is dataflow is referred to as Kalman filtering. Unlike inSec- quite large, application of Kalman filtering is able to reduce tions3 and4, the Kalman gain is not a dimensionless value it rapidly in few time steps. here. If Ht = I, the computations in Figure 4d reduce to those of Figure 4c as expected. 6.5 Discussion Equation 40 shows that the a posteriori state estimate is a This section shows that Kalman filtering for state estimation linear combination of the a priori state estimate (bxt |t−1) and in linear systems can be derived from two elementary ideas: the measurement (zt ). The optimality of this linear unbiased optimal linear estimators for fusing uncorrelated estimates estimator is shown in the AppendixD. In Section 3.3, it was and best linear unbiased estimators for correlated variables. shown that incremental fusion of scalar estimates is optimal. This is a different approach to the subject than the standard The dataflow of Figures4(c,d) computes the a posteriori state presentations in the literature. One standard approach is to estimate at time t by incrementally fusing measurements use Bayesian inference. The other approach is to assume that from the previous time steps, and this incremental fusion the a posteriori state estimator is a linear combination of the can be shown to be optimal using a similar argument. form At bxt |t−1+Bt zt , and then find the values of At and Bt that produce an unbiased estimator with minimum MSE. We 6.4 Example: falling body believe that the advantage of the presentation given here is To demonstrate the effectiveness of the , we that it exposes the concepts and assumptions that underlie consider an example in which an object falls from the origin Kalman filtering. at time t=0 with an initial speed of 0 m/s and an expected Most presentations in the literature also begin by assuming 2 constant acceleration of 9.8 m/s due to gravity. Note that that the noise terms wt in the state evolution equation and vt acceleration in reality may not be constant due to factors in the measurement are Gaussian. Some presentations [1, 10] such as wind, air friction, and so on. use properties of Gaussians to derive the results in Sections3 The state vector of the object contains two components, although as we have seen, these results do not depend on one for the distance from the origin s(t) and one for the distributions being Gaussians. Gaussians however enter the velocity v(t). We assume that only the velocity state can be picture in a deeper way if one considers nonlinear estimators. measured at each time step. If time is discretized in steps of It can be shown that if the noise terms are not Gaussian, there 0.25 seconds, the difference equation for the dynamics of the may be nonlinear estimators whose MSE is lower than that 10 2500 300 150 Model-only Model-only Ground Truth Ground Truth Estimated 250 Estimated 125 2000 Measured Variance 200 100 1500

150 75

1000 Variance Distance (m)

Velocity (m/s) 100 50

500 50 25

0 0 0 0 5 10 15 20 0 5 10 15 20 Time (s) Time (s) (a) Evolution of state: Distance (b) Evolution of state: Velocity

Figure 6: Estimates of the object’s state over time. of the linear estimator presented in Figure 4d. However if h. This requires computing the following Jacobians4, which the noise is Gaussian, this linear estimator is as good as any play the role of Ft and Ht in Figure 4d. unbiased nonlinear estimator (that is, the linear estimator is a minimum variance unbiased estimator (MVUE)). This result ∂f ∂h Ft = Ht = is proved using the Cramer-Rao lower bound [25]. ∂x ∂x xbt−1|t −1,ut xbt |t−1 7 EXTENSION TO NONLINEAR SYSTEMS The EKF performs well in some applications such as navi- gation systems and GPS [29]. The (EKF) and Unscented Kalman Fil- ter (UKF) are heuristic approaches to using Kalman filtering UKF. When the system dynamics and observation models for nonlinear systems. The state evolution and measurement are highly nonlinear, the Unscented Kalman Filter (UKF) [15] equations for nonlinear systems with additive noise can be can be an improvement over the EKF. The UKF is based on written as follows; in these equations, f and h are nonlinear the unscented transformation, which is a method for comput- functions. ing the of a random variable x that undergoes a nonlinear transformation (y = д(x)). The random variable x

xt = f (xt−1, ut ) + wt (43) is sampled using a carefully chosen set of sigma points and these sample points are propagated through the nonlinear z = h(x ) + v (44) t t t function д. The statistics of y are estimated using a weighted sample mean and covariance of the posterior sigma points. Intuitively, the EKF constructs linear approximations to The UKF tends to be more robust and accurate than the EKF the nonlinear functions f and h and applies the Kalman but has higher computation overhead due to the sampling filter equations, while the UKF constructs approximations process. to probability distributions and propagates these through the nonlinear functions to construct approximations to the 8 CONCLUSION posterior distributions. In this paper, we have shown that two concepts - optimal EKF. Examining Figure 4d, we see that the a priori state linear estimators for fusing uncorrelated estimates and best estimate in the predictor can be computed using the system linear unbiased estimators for correlated variables - provide the underpinnings for Kalman filtering. By combining these model: bxt |t−1 = f (bxt−1|t−1, ut ). However, since the system dynamics and measurement equations are nonlinear, it is not ideas, standard results on Kalman filtering for linear systems clear how to compute the covariance matrices for the a priori can be derived in an intuitive and straightforward way that estimate and the measurement. In the EKF, these matrices is simpler than other presentations of this material in the are computed by linearizing Equations 43 and 44 using the 4The Jacobian matrix is the matrix of all first order partial derivatives ofa expansions for the nonlinear functions f and vector-valued function. 11 literature. This approach makes clear the assumptions that [16] Rudolph Emil Kalman. 1960. A New Approach to Linear Filtering underlie the optimality results associated with Kalman filter- and Prediction Problems. Transactions of the ASME–Journal of Basic ing, and should make it easier to apply Kalman filtering to 82, Series D (1960), 35–45. [17] Peter K Kitanidis. 1987. Unbiased Minimum-variance Linear State problems in computer systems. Estimation. Automatica 23, 6 (Nov. 1987), 775–778. https://doi.org/10. 1016/0005-1098(87)90037-9 ACKNOWLEDGMENTS [18] and Giogio Picci. 2017. Linear Stochastic Systems. Springer-Verlag. This research was supported by NSF grants 1337281, 1406355, [19] Peter S Maybeck. 1982. Stochastic models, estimation, and control. Vol. 3. and 1618425, and by DARPA contracts FA8750-16-2-0004 and Academic press. FA8650-15-C-7563. The authors would like to thank K. Mani [20] Jerry M Mendel. 1995. Lessons in for Signal Processing, Chandy (Caltech), Ivo Babuska (UT Austin) and Augusto Communications, and Control. Pearson Education. Ferrante (Padova) for their on this paper. [21] Kaushik Nagarajan, Nicholas Gans, and Roozbeh Jafari. 2011. Mod- eling Human Gait Using a Kalman Filter to Measure Walking Dis- tance. In Proceedings of the 2nd Conference on Wireless Health (WH ’11). REFERENCES ACM, New York, NY, USA, Article 34, 2 pages. https://doi.org/10.1145/ [1] Tim Babb. 2018. How a Kalman filter works, in pictures | Bzarg. https: 2077546.2077584 //www.bzarg.com/p/how-a-kalman-filter-works-in-pictures/. (2018). [22] Eduardo F. Nakamura, Antonio A. F. Loureiro, and Alejandro C. Frery. Accessed: 2018-11-30. 2007. Information Fusion for Wireless Sensor Networks: Methods, [2] A. V. Balakrishnan. 1987. Kalman Filtering Theory. Optimization Models, and Classifications. ACM Comput. Surv. 39, 3, Article 9 (Sept. Software, Inc., Los Angeles, CA, USA. 2007). https://doi.org/10.1145/1267070.1267073 [3] Allen L. Barker, Donald E. Brown, and Worthy N. Martin. 1994. [23] Kaare Brandt Petersen and Michael Syskind Pedersen. 2012. The Matrix Bayesian Estimation and the Kalman Filter. Technical Report. Char- Cookbook. http://www2.imm.dtu.dk/pubdb/views/publication_details. lottesville, VA, USA. php?id=3274. (Nov. 2012). Version 20121115. [4] Alex Becker. 2018. Kalman Filter Overview. https://www.kalmanfilter. [24] Raghavendra Pradyumna Pothukuchi, Amin Ansari, Petros Voulgaris, net/default.aspx. (2018). Accessed: 2018-11-08. and Josep Torrellas. 2016. Using Multiple Input, Multiple Output [5] K. Bergman. 2009. Nanophotonic Interconnection Networks in Multi- Formal Control to Maximize Resource Efficiency in Architectures. In core Embedded Computing. In 2009 IEEE/LEOS Winter Topicals Meeting Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual Interna- Series. 6–7. https://doi.org/10.1109/LEOSWT.2009.4771628 tional Symposium on. IEEE, 658–670. [6] Liyu Cao and Howard M. Schwartz. 2004. Analysis of the Kalman [25] C.R. Rao. 1945. Information and the Accuracy Attainable in the Esti- Filter Based Estimation : An Orthogonal Decomposition mation of Statistical Parameters. Bulletin of the Calcutta Mathematical Approach. Automatica 40, 1 (Jan. 2004), 5–19. https://doi.org/10.1016/ Society 37 (1945), 81–89. j.automatica.2003.07.011 [26] Matthew B. Rhudy, Roger A. Salguero, and Keaton Holappa. 2017. A [7] Charles K. Chui and Guanrong Chen. 2017. Kalman Filtering: With Kalman Filtering Tutorial for Undergraduate Students. International Real-Time Applications (5th ed.). Springer Publishing Company, Incor- Journal of Computer Science & Engineering Survey 8, 1 (Feb. 2017). porated. [27] Sailes K. Sengupta. 1995. Fundamentals of Statistical Signal Processing: [8] R.L. Eubank. 2005. A Kalman Filter Primer (Statistics: Textbooks and Estimation Theory. Technometrics 37, 4 (1995), 465–466. https://doi. Monographs). Chapman & Hall/CRC. org/10.1080/00401706.1995.10484391 [9] Geir Evensen. 2006. : The . [28] Éfren L. Souza, Eduardo F. Nakamura, and Richard W. Pazzi. 2016. Springer-Verlag New York, Inc., Secaucus, NJ, USA. Target Tracking for Sensor Networks: A Survey. ACM Comput. Surv. [10] Rodney Faragher. 2012. Understanding the Basis of the Kalman Filter 49, 2, Article 30 (June 2016), 31 pages. https://doi.org/10.1145/2938639 Via a Simple and Intuitive Derivation. IEEE Signal Processing Magazine [29] Sebastian Thrun, Wolfram Burgard, and Dieter Fox. 2005. Probabilistic 29, 5 (September 2012), 128–132. Robotics (Intelligent Robotics and Autonomous Agents). The MIT Press. [11] Mohinder S. Grewal and Angus P. Andrews. 2014. Kalman Filtering: [30] Greg Welch and Gary Bishop. 1995. An Introduction to the Kalman Theory and Practice with MATLAB (4th ed.). Wiley-IEEE Press. Filter. Technical Report. Chapel Hill, NC, USA. [12] Anne-Kathrin Hess and Anders Rantzer. 2010. Distributed Kalman [31] Yan Pei, Swarnendu Biswas, Donald S. Fussell, and Keshav Pingali. Filter Algorithms for Self-localization of Mobile Devices. In Proceedings 2017. An Elementary Introduction to Kalman Filtering. ArXiv e-prints of the 13th ACM International Conference on Hybrid Systems: Compu- (Oct. 2017). arXiv:1710.04055 tation and Control (HSCC ’10). ACM, New York, NY, USA, 191–200. https://doi.org/10.1145/1755952.1755980 [13] Connor Imes and Henry Hoffmann. 2016. Bard: A Unified Framework for Managing Soft Timing and Power Constraints. In International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS). [14] C. Imes, D. H. K. Kim, M. Maggio, and H. Hoffmann. 2015. POET: A Portable Approach to Minimizing Energy Under Soft Real-time Constraints. In 21st IEEE Real-Time and Embedded Technology and Applications Symposium. 75–86. https://doi.org/10.1109/RTAS.2015. 7108419 [15] Simon J. Julier and Jeffrey K. Uhlmann. 2004. Unscented Filtering and Nonlinear Estimation. Proc. IEEE 92, 3 (2004), 401–422. https: //doi.org/10.1109/JPROC.2003.823141 12 Appendices x2 : [−1, 1] that are the projections of u on the x and y axes respectively. Given a value for x2, there are only two possible A BASIC PROBABILITY THEORY AND values for x1, so x1 and x2 are not independent. However, the mean of these values is 0, which is the mean of x , so x and STATISTICS TERMINOLOGY 1 1 x2 are not correlated. Probability density function. For a continuous random vari- able x, a probability density function (pdf) is a function p(x) B MATRIX DERIVATIVES whose value provides a relative likelihood that the value of If f (X) is a scalar function of a matrix X, the matrix deriva- the random variable will equal x. The integral of the pdf ∂f (X ) within a range of values is the probability that the random tive ∂X is defined as the matrix variable will take a value within that range. ∂f (X ) ∂f (X ) ... If д(x) is a function of x with pdf p(x), the © ∂X (1,1) ∂X (1,n) ª ­ ...... ® or expectation of д(x) is E[д(x)], defined as the following ­ ® ∂f (X ) ... ∂f (X ) integral: « ∂X (n,1) ∂X (n,n) ¬ ∫ ∞ E[д(x)] = д(x)p(x)dx Lemma B.1. Let X be a m × n matrix, a be a m × 1 vector, −∞ b be a n × 1 vector. By definition, the mean µx of a random variable x is E[x]. The variance of a random variable x measures the variability ∂aT Xb of the distribution. For the set of possible values of x, variance = abT (45) 2 2 2 ∂X (denoted by σx ) is defined by σx = E[(x −µx ) ]. The variance T T of a continuous random variable x can be written as the ∂(a X Xb) T = X(ab + baT ) (46) following integral: ∂X ∫ ∞ ∂(trace(XBXT )) 2 2 = XBT + XB (47) σx = (x − µ) p(x)dx ∂X −∞ If x is discrete and all outcomes are equally likely, then See Petersen and Pedersen for a proof [23]. 2 2 Σ(xi −µx ) σx = n . The standard deviation σx is the of the variance. C PROOF OF THEOREM 4.1 Covariance. The covariance of two random variables is a Theorem 4.1, which is reproduced below for convenience, measure of their joint variability. The covariance between can be proved using matrix derivatives. random variables x : p ∼(µ , σ 2) and x : p ∼(µ , σ 2) is the 1 1 1 1 2 2 2 2 Theorem. Let pairwise uncorrelated estimates x (1≤i≤n) expectation E[(x −µ )∗(x −µ )]. Two random variables are i 1 1 2 2 drawn from distributions p (x)=(µ , Σ ) be fused using the lin- uncorrelated or not correlated if their covariance is zero. This i i i ear model y (x ,.., x ) = Ín A x , where Ín A = I. The is not the same concept as independence of random variables. A 1 n i=1 i i i=1 i MSE(y ) is minimized for Two random variables are independent if knowing the A value of one of the variables does not give us any information n about the possible values of the other one. This is written Õ −1 −1 −1 Ai = ( Σj ) Σi . formally as p(x1|x2) = p(x1); intuitively, knowing the value j=1 of x2 does not change the probability thatp1 takes a particular value. Proof. To use the Lagrange multiplier approach, we can Ín 2 Independent random variables are uncorrelated but ran- convert the constraint i=1 Ai = I into a set of m scalar dom variables can be uncorrelated even if they are not in- equations (for example, the first equation would be A1(1, 1)+ 2 dependent. It can be shown that if x1 and x2 are not corre- A2(1, 1) + .. + An(1, 1) = 1), and then introduce m Lagrange lated, E[x1|x2]=E[x1]; intuitively, knowing the value of x2 multipliers, which can denoted by λ(1, 1),...λ(m,m). may change the probability that x1 takes a particular value, This obscures the matrix structure of the problem so it is but the mean of the resulting distribution remains the same better to implement this idea implicitly. Let Λ be an m×m as the mean of x1. A special case of this that is easy to un- matrix in which each entry is one of the scalar Lagrange derstand are examples in which knowing x2 restricts the multipliers we would have introduced in the approach de- possible values of x1 without changing the mean. Consider a scribed above. Analogous to the inner product of vectors, we random variable u : U that is uniformly distributed over the can define the inner product of two matrices as = T Ím Ím unit circle, and consider random variables x1 : [−1, 1] and trace(A B) (it is easy to see that is i=1 j=1 A(i, j)B(i, j)). 13 Using this notation, we can formulate the optimization prob- (b) We prove that the estimator in Equation 48 is unbiased if lem using Lagrange multipliers as follows: At =(I−Bt ) ∗ Ht . n  Õ T T f (A1,..., An) = E (xi − µ i ) Ai Ai (xi − µ i ) i=1 E[y ] = E[At ∗ y + Bt ∗ zt ] (From Equation 48) n bt |t bt |t−1 Õ  ∗ [ ] ∗ [ ] + Λ, Ai − I = At E byt |t−1 + Bt E zt i=1 = At ∗ E[xt ] + Bt ∗ E[Ht xt + vt ] (Equation 37 for zt ) Taking the matrix derivative of f with respect to each Ai = At ∗ E[xt ] + Bt ∗ Ht ∗ E[xt ] (Because vt is zero mean) and setting each derivative to zero to find the optimal values  T = (At + Bt ∗ Ht ) ∗ E[xt ] of Ai gives us the equation E 2Ai (xi − µ i )(xi − µ i ) + Λ = 0. This equation can be written as 2Ai Σi + Λ = 0, which im- plies The estimator is unbiased if (At + Bt ∗ Ht ) = I, which is Λ equivalent to requiring that At = (I − Bt ∗ Ht ). Therefore A1Σ1 = A2Σ2 = ... = An Σn = − the general unbiased linear estimator is of the form 2 Using the constraint that the sum of all Ai equals to the identity matrix I gives us the desired expression for Ai : n yt |t = (I − Bt ∗ Ht ) ∗ yt |t−1 + Bt ∗ zt Õ −1 −1 −1 b b Ai = ( Σj ) Σi = yt |t−1 + Bt ∗ (zt − Ht ∗ yt |t−1) (49) j=1 b b □

D PROOF OF THE OPTIMALITY OF Since Equation 40 is of this form, it is an unbiased linear EQUATION 40 estimator. We show that (bxt |t = bxt |t−1 + Kt (zt − Ht bxt |t−1)) (Equation 40) is an optimal unbiased linear estimator for fusing the a priori Optimality: We now show that using Bt = Kt at each step state estimate with the measurement at each step. The proof is optimal, assuming that this is done at all time steps before t. has two steps: we show that this estimator is unbiased, and Since zt and yt |t−1 are uncorrelated, we can use Lemma 2.2 to then show it is optimal. compute the covariance matrix of byt |t , denoted by Ξt |t . This T T Unbiased condition: We prove a more general result that gives Ξt |t = (I − Bt Ht )Σt |t−1(I − Bt Ht ) + Bt Rt Bt . The MSE characterizes unbiased linear estimators for this problem, is the trace of this matrix, and we need to find Bt that mini- assuming that the prediction stage (Figure4(d)) is unchanged. mizes this trace. Using matrix derivatives (Equation 47), we The general form of the linear estimator for computing the see that a posteriori state estimate is ∗ ∗ byt |t = At byt |t−1 + Bt zt (48) ∂(trace(Ξt |t )) = − ( − ) T + ∗ It is unbiased if E[y ]=E[xt ], and we show that this is true 2 I Bt Ht Σt |t−1Ht 2 Bt Rt bt |t ∂Bt if At =(I−Bt ) ∗ Ht . [ ] [ ] The proof is by induction on t. By assumption, E by0|0 = E x0 . Assume inductively that E[yt−1|t−1]=E[xt−1]. b Setting this zero and solving for Bt gives (a) We first prove that the predictor is unbiased. T T −1 Bt = Σt |t−1Ht [Ht Σt |t−1Ht + Rt ] . This is exactly Kt , prov- ∗ ∗ byt |t−1 = Ft byt−1|t−1 + Bt ut (Predictor in Figure4) ing that Equation 40 is an optimal unbiased linear estimator. [ ] ∗ [ ] ∗ E byt |t−1 = Ft E byt−1|t−1 + Bt ut = Ft ∗ E[xt−1] + Bt ∗ut (By inductive assumption) Comment: This proof of optimality provides another way of deriving Equation 40. We believe the constructive ap- = E[Ft ∗ xt−1 + Bt ∗ut ] proach described in Section6 provides more insight into how = E[F ∗ x + B ∗u + w ] (w is zero mean) t t−1 t t t t and why Kalman filtering works. The ideas in that construc- = E[xt ] (From state evolution equation 34) tion can be used to provide a different proof of optimality.

14