1 via Approximation of Log-likelihood for Priors in

Tohid Ardeshiri, Umut Orguner, and Fredrik Gustafsson

Abstract—In this paper, a Bayesian inference technique based x x x ··· x on Taylor series approximation of the logarithm of the likelihood 1 2 3 T function is presented. The proposed approximation is devised for the case, where the prior distribution belongs to the exponential family of distributions. The logarithm of the likelihood function is linearized with respect to the sufficient statistic of the prior distribution in exponential family such that the posterior obtains y1 y2 y3 yT the same exponential family form as the prior. Similarities between the proposed method and the extended Kalman filter for nonlinear filtering are illustrated. Furthermore, an extended Figure 1: A probabilistic graphical model for stochastic dy- target measurement update for target models where the target namical system with latent state xk and measurements yk. extent is represented by a random having an inverse Wishart distribution is derived. The approximate update covers the important case where the spread of measurement is due to the target extent as well as the measurement noise in the sensor. (HMMs) whose probabilistic graphical model is presented in Fig. 1. In such filtering problems, the posterior to the last Index Terms—Approximate Bayesian inference, exponential processed measurement is in the same form as the prior family, Bayesian graphical models, extended Kalman filter, ex- distribution before the measurement update. Thus, the same tended target tracking, group target tracking, random matrices, inference algorithm can be used in a recursive manner. inverse Wishart. The exact posterior distribution of a latent variable can not always be given a compact analytical expression since I.INTRODUCTION the integral in the denominator of (1) may not be available in analytical form. Consequently, the number of parameters Determination of the posterior distribution of a latent vari- needed to express the posterior distribution will increase able x given the measurements (observed data) y is at the every time the Bayes rule (1) is used for inference. Several core of Bayesian inference using probabilistic models. These methods for approximate inference over probabilistic models probabilistic models describe the relation between the random are proposed in the literature such as variational Bayes [1], latent variables, the deterministic parameters, and the measure- [2], expectation propagation [3], integrated nested Laplace ap- ments. Such relations are specified by prior distributions of proximation (INLA) [4], generalized linear models (GLMs) [5] the latent variables p(x), and the likelihood function p(y|x) and, Monte-Carlo (MC) sampling methods [6], [7]. which give a probabilistic description of the measurements Variational Bayes (VB) and expectation propagation (EP) given (some of) the latent variables. Using the probabilistic are two analytical optimization-based solutions for the ap- model and measurements the exact posterior can be expressed proximate Bayesian inference [8]. In these two approaches, in a functional form using the Bayes rule Kullback-Leibler divergence [9] between the true posterior p(x)p(y|x) distribution and an approximate posterior is minimized. INLA p(x|y) = . (1) R p(x)p(y|x) dx is a technique to perform approximate Bayesian inference in latent Gaussian models [10] using the Laplace approximation. The exact posterior distribution can be analytical. A sub- GLMs are an extension of ordinary linear regression when arXiv:1510.01225v1 [cs.LG] 5 Oct 2015 class of cases where the posterior is analytical is when the errors belong to the exponential family. posterior belongs to the same family of distributions as the Sampling methods such as particle filters and Markov Chain prior distribution. In such cases, the prior distribution is Monte Carlo (MCMC) methods provide a general class of called a for the likelihood function. A well- numerical solutions to the approximate Bayesian inference known example where an analytical posterior is obtained using problem. However, in this paper, our focus is on fast analytical conjugate priors is when the latent variable is a priori normal- approximations which are applicable to large-scale inference distributed and the likelihood function given the latent variable problems. The proposed solution is built on properties of as its mean is again normal. the exponential family of distributions and earlier work on The conjugacy is especially useful when measurements are extended Kalman filter [11]. processed sequentially as they appear in the filtering task for In this paper, a Bayesian inference technique based on stochastic dynamical systems known as hidden Markov models Taylor series approximation of the logarithm of the likeli- hood function is presented. The proposed approximation is Tohid Ardeshiri and Fredrik Gustafsson are with the Department of derived for the case where the prior distribution belongs to Electrical Engineering, Linkoping¨ University, 58183 Linkoping,¨ Sweden, (e- mail: [email protected], [email protected]). The authors gratefully acknowl- the exponential family of distributions. The rest of this paper edge funding from the Swedish Research Council (VR) for the project is organized as follows; In Section II an introduction to Scalable Kalman Filters. exponential family of distribution is provided. In Section III Umut Orguner is with Department of Electrical and Electronics Engi- neering, Middle East Technical University, 06531 Ankara Turkey, (email: a general algorithm for approximate inference in the expo- [email protected]). nential family of distributions is suggested. We show how the 2 proposed technique is related to the extended Kalman filter p(Y|x, λ) in exponential family, where λ is a set of hyper- for nonlinear filtering in Section III-B. A novel measurement parameters of the likelihood and x is the latent variable. The update for tracking extended targets using random matrices likelihood of the m IID measurements can be written as is given in Section IV. The new measurement update for m  Y  tracking extended targets is evaluated in numerical simulations p(Y|x, λ) = h(yj) in Section V. The concluding remarks are given in Section VI. j=1  m  X j II.THE EXPONENTIAL FAMILY × exp η(x, λ) · T (y ) − mA(η(x, λ)) . (6) The exponential family of distributions [8] include many j=1 common distributions such as Gaussian, beta, gamma and We seek a conjugate prior distribution for x denoted by Wishart. For x ∈ X the exponential family in its natural form p(x). The prior distribution p(x) is a conjugate prior if the can be represented by posterior distribution p(x; η) = h(x) exp(η · T (x) − A(η)), (2) p(x|Y) ∝ p(Y|x, λ)p(x) (7) where η is the natural parameter, T (x) is the sufficient statistic, is also in the same exponential family as the prior. Let us A(η) is log-partition function and h(x) is the base measure. assume that the prior is in the form η and T (x) may be vector-valued. Here a·b denotes the inner p(x) ∝ exp(F(x)) (8) product of a and b1. The log-partition function is defined by the integral for some function F(·). Hence, Z p(x|Y) ∝ p(Y|x, λ) exp(F(x)) (9) A(η) , log h(x) exp(η · T (x)) dx. (3)   X m X j ∝ exp η(x, λ) · T (y ) − mA(η(x, λ)) + F(x) . Also, η ∈ Ω = {η ∈ Rm|A(η) < +∞} where Ω is the natural parameter space. Moreover, Ω is a convex set and A(·) is a j=1 convex function on Ω. (10) When a probability density function (PDF) in the exponen- For the posterior (10) to be in the same exponential family as tial family is parametrized by a non-canonical parameter θ, the prior [12], F(·) needs to be in the form the PDF in (2) can be written as F(x) = ρ1 · η(x, λ) − ρ0A(η(x, λ)) (11) p(x; θ) = h(x) exp(η(θ) · T (x) − A(η(θ))), (4) for some ρ , (ρ0, ρ1), such that where θ is the non-canonical parameter vector of the PDF. p(x|Y) ∝ Example 1. Consider the p(x) =  m   d X j N (x; µ, Σ) with mean µ ∈ R and Σ. Its exp ρ1 + T (y ) · η(x, λ) − (m + ρ0)A(η(x, λ)) . PDF can be written in exponential family form given in (2) j=1 where (12) − d h(x) = (2π) 2 , (5a) Hence, the conjugate prior for the likelihood (6) is  x  parametrized by ρ and is given by T (x) = , (5b) vec(xxT) 1 p(x; ρ) = exp (ρ1 · η(x, λ) − ρ0A(η(x, λ)) , (13)  Σ−1µ  Z η(µ, Σ) = , (5c) vec(− 1 Σ−1) where 2 Z 1 1 A(η(µ, Σ)) = µTΣ−1µ + log |Σ|. (5d) Z = exp (ρ1 · η(x, λ) − ρ0A(η(x, λ)) dx. (14) 2 2 In Example 2, the conjugate prior for the normal likelihood  is derived using the exponential family form for the normal Remark 1. In the rest of this document, the vec(·) operators distribution given in Example 1. In Example 3, a conjugate will be dropped to keep the notation less cluttered. That is, prior for a more complicated likelihood function is derived. we will use the simplified notation T (x) = (x, xxT) instead of (5b) and η(µ, Σ) = (Σ−1µ, − 1 Σ−1) instead of (5c). Example 2. Consider the measurement y with the likelihood 2 p(y|x) = N (y; Cx,R). The likelihood can be written in exponential family form as in A. Conjugate Priors in Exponential Family − d p(y|x) = (2π) 2 exp (η(Cx,R) · T (y) − A(η(Cx,R))) In this section we study the conjugate prior family for a like- (15) lihood function belonging to the exponential family. Consider T −1 1 −1 m independent and identically distributed (IID) measurements where T (y) = (y, yy ), η(Cx,R) = (R Cx, − 2 R ) and j d Y , {y ∈ R |1 ≤ j ≤ m} and the likelihood function 1 1 A(η(Cx,R)) = xTCTR−1Cx + log |R| (16a) 1In the rest of this paper we may use aTb in place of a · b when a and 2 2 b are vector valued. When a and b are matrix valued we may equivalently 1 1 write Tr(aTb) instead of vec(a) · vec(b) where Tr(·) is the operator = CTR−1C · xxT + log |R|. (16b) and the operator vec(·) vectorizes its argument (lays it out in a vector). 2 2 3

Therefore, the likelihood function can be written in the form Table I: Some continuous exponential family distributions and their sufficient statistic are listed.  1   p(y|x) ∝ exp CTR−1y, − CTR−1C · T (x) . (17) 2 Continuous Exp. Family Distribution T (·) x Hence, the conjugate prior for the likelihood function of Normal distribution with known σ2 x/σ T this example should have T (·) as its sufficient statistic i.e., Normal distribution (x, xx ) with known minimum xm log x the conjugate prior is normal. When the prior distribution is with known shape k xk normal i.e., p(x) = N (x; µ, Σ) the posterior distribution obeys Chi-squared distribution log x (log x1, ··· , log xn) p(x|y) ∝ with known mean µ |x − µ| Inverse Gaussian distribution (x, 1/x)  1 1   exp CTR−1y + Σ−1µ, − CTR−1C − Σ−1 · T (x) Scaled inverse Chi-squared distribution (log x, 1/x) (log x, log(1 − x)) 2 2 2 (18) Lognormal distribution (log x, (log x) ) (log x, x) which is the information filter form for the Kalman filter’s Inverse gamma distribution (log x, 1/x) Gaussian Gamma distribution (log τ, τ, τx, τx2) measurement update [13].  Wishart distribution (log |X|,X) −1 Example 3. Consider the measurement y with the likelihood Inverse Wishart distribution (log |X|,X ) 1 p(y|x) ∝ exp(− (y − x)2 + cos(y − x)). (19) 24 B. Conjugate Likelihoods in Exponential Family The likelihood function is a multi-modal likelihood function First, we define the conjugate likelihood functions; and is illustrated in Fig. 2 for y = 3. The likelihood can be Definition 1. A likelihood function p(y|x) is called conjugate written in the form likelihood for prior distribution p(x) in a family of distribu- p(y|x) ∝ exp(λ · T (x)) (20) tions, if the posterior distribution p(x|y) belongs to the same family of distributions as the prior distribution.  1 y T where λ = − 24 12 cos y sin y and T (x) = p(x; ρ) T Now, consider the prior distribution in the exponen- x2 x cos x sin x . Hence, a conjugate prior for the tial family form on the latent variable x ∈ X , where ρ is a likelihood function of this example should have T (·) as its suf- set of hyper-parameters for the prior ficient statistic such as p(x) ∝ exp − 1 0 0 0 T (x). 10 p(x; ρ) = h(x) exp (η(ρ) · T (x) − A(η(ρ))) . Thus, the posterior for the likelihood function and the prior (22) distribution obeys We will follow the treatment given for characterization of the conjugate priors in Section II-A as well as [12], for the  1 1 y T  p(x|y) ∝ exp − 24 − 10 12 cos y sin y · T (x) . conjugate likelihood functions. Assume now that we have j d (21) observed m IID measurements Y , {y ∈ R |1 ≤ j ≤ m}. Let the likelihood of the measurements be written as Although such posteriors can be computed analytically up to a normalization factor, the density may not be available in p(Y|x) ∝ exp (L(x, Y)) (23) analytical form. The numerically normalized prior and the for some function L(·). We seek a conjugate likelihood func- posterior are illustrated in Fig. 2.  tion in the exponential family. Hence, the posterior distribution has to be in the exponential family where, p(x|Y) ∝ h(x) exp (η(ρ) · T (x) − A(η(ρ)) + L(x, Y)) . prior (24) likelihood posterior For the posterior (24) to be in the same exponential family, L(·) needs to be in the form L(x, y) =+ λ(Y) · T (x) (25) where =+ means equality up to an additive constant with respect to latent variable x such that p(x|Y) ∝ h(x) exp ((η(ρ) + λ(Y)) · T (x)) . (26) Hence, the conjugate likelihood for the prior distribution

−10 −5 0 5 10 15 family (22) is parametrized by λ(Y) and is given by x p(Y|x) ∝ exp (λ(Y) · T (x)) . (27) Figure 2: The likelihood function (19) prior distribution (20) and the posterior distribution (21) in Example 3 are plotted. The property expressed in (27) is the necessary condition which should hold for a conjugate likelihood for a given prior distribution family. Another condition which should hold is Similar reasoning as the one presented in this section can that the log-partition function of the posterior (26) should obey be applied to the case where the prior is fixed and the Z likelihood function is to be selected; The concept of “conjugate A = log h(x) exp((η(ρ) + λ(Y)) · T (x)) dx < ∞ (28) likelihoods” will be considered in the sequel. X 4 such that the natural parameter of the posterior belongs to the where the log-likelihood function is given by L(·) and the natural parameter space Ω. These properties of the conjugate prior distribution is given by likelihood are summarized in theorem 1. p(x) = h(x) exp (η · T (x) − A(η)) (33) Theorem 1. The likelihood function p(y|x) is a conjugate likelihood for the prior distribution in an exponential family where η ∈ Ω. If we approximate L(·) such that p(x) = h(x) exp(η · T (x) − A(η)) with x ∈ X if and only if + 1) p(y|x) ∝ exp(λ · T (x)) for some λ and, L(x, Y) ≈ Lb(x, Y) = λ(Y) · T (x) (34) the likelihood function p(y|x) is integrable with respect 2) for some λ(·) such that for any measurement Y belonging to to y and, R the support of the likelihood function 3) log X h(x) exp((η + λ) · T (x)) dx < ∞ for ∀η ∈ Ω.  η + λ(Y) ∈ Ω (35) In Examples 4 and 5, the conjugate likelihood functions for and, two prior distributions are derived using the exponential family Z   form. exp Lb(x, Y) dY < ∞, (36) Example 4. Consider the normal distribution p(x) = N (x; µ, Σ) whose PDF is written in exponential family form then the approximate likelihood will be conjugate likelihood in Example 1. The conjugate likelihood function has to be in for the prior distribution with sufficient statistic T (x). the form When the prior distribution is normal, the proposed ap- T  proximation can be obtained using well-known Taylor series p(y|x) ∝ exp λ1(y) · x + λ2(y) · vec(xx ) (29) approximation at the global maximum of the log-likelihood function. This is due to the fact that T (x) = (x, xxT) for the for some permissible λ(·) , [λ1, λ2], which includes p(y|x) = N (y; Cx,R) for a matrix C with appropriate dimensions and normal prior and a second order Taylor series approximation symmetric R 0. of the log-likelihood approximates the log-likelihood with a  function linear in T (x). This mathematical convenience is used Example 5. Consider the prior distribution in the well-known approximate Bayesian inference technique p(x) ∝ exp (λ · T (x)) (30) INLA [4]. Similarly, when the prior distribution is exponential distribution, a first order Taylor series approximation can be T where T (x) = x2 x cos x sin x . The family of con- used to approximate the log-likelihood with a function linear jugate likelihood density functions p(y|x) is parametrized by in T (x) = x. ψ ∈ R4 such that Taylor series expansion of functions will be used in the + proposed approximations in the following. Hence, we establish 1) log p(y|x) = ψ · T (x), ∞ the notation used in this paper for Taylor series expansion of 2) R p(y|x) dy = 1, i.e., the likelihood is integrable −∞ arbitrary functions before we proceed. with respect to y, R ∞ 3) −∞ exp((λ + ψ) · T (x)) dx < ∞, i.e., the posterior is integrable with respect to x. A. Taylor series expansion A member of this family is The Taylor series expansion of a real-valued function f(x) p(y|x) ∝ exp(−α2(y − x)2 + cos(βy − x + γ)) (31) that is infinitely differentiable at a real vector xb is given by the series T 3 for [α, β, γ] ∈ R .  1 f(x) =f(x) + ∇ f(x) · (x − x) In Table I the sufficient statistic for some continuous mem- b 1! x b b bers of the exponential family are given. Some members of 1 + ∇2 f(x) · (x − x)(x − x)T + ··· (37) the continuous exponential family have conjugate likelihoods 2! x b b b which are summarized in Table II. where ∇xf(xb) is the gradient vector which is composed of EASUREMENT PDATE VIA PPROXIMATION OF THE the partial derivatives of f(x) with respect to elements of x, III.M U A 2 LOG-LIKELIHOOD evaluated at xb. Similarly, ∇xf(xb) is the Hessian matrix of f(x) containing second order partial derivatives and evaluated In this section a method is proposed for the analytical at x. approximation of complex likelihood functions based on b When f(x) is a vector-valued function, i.e., f(x) ∈ Rd with linearization of the log-likelihood with respect to sufficient ith element denoted by f (x), first the Taylor series expansion T (·) i statistic function . We will use the property of conjugate is derived for each element separately. Then, the expansion of likelihood functions i.e., linearity of log-likelihood function f(x) at x is constructed by rearranging the terms in a matrix. with respect to sufficient statistic of the prior distribution to b come up with an approximation of the likelihood function for 1 f(x) =f(x) + (∇ f(x))T (x − x) which there exists analytical posterior distribution. b 1! x b b In the proposed method first, the log-likelihood function is d 1 X derived in analytical form. Second, the likelihood function is + e ∇2 f (x) · (x − x)(x − x)T + ··· 2! i x i b b b approximated by a dot product of a statistic which depends i=1 on the measurement and the sufficient statistic of the prior (38) distribution. For example, consider a likelihood function In (38), ∇xf(x) is the Jacobian matrix of f(x) evaluated at p(Y|x) ∝ exp (L(x, Y)) (32) b xb and ei is a unity vector with its ith element equal to 1. 5

Table II: Some prior distributions and their conjugate likelihood functions are listed. The arguments of the PDFs on the left columns are the latent random variables. On the right hand side column the conjugate likelihood functions p(y|·) are given where y denotes the measurement. In the middle column the sufficient statistic functions of the prior distributions in the left column are given. The logarithm of the conjugate likelihood functions are linear in sufficient statistic function to their left.

Prior distribution with sufficient statistic T (·) T (·) Conjugate likelihood function N (x; µ; Σ) (x, xxT) N (y; Cx,R) ∝ exp Tr R−1(y − Cx)(y − Cx)T   N (x; µ; Σ) (x, xxT) log −N (y; x, σ2) ∝ exp − 1 (log(y) − x)2 2σ2 Gamma(x; α, β) (log x, x) Exp(y; x) = x exp(−xy) 0 Gamma(x; α, β) (log x, x) IGamma(y; α0, x) ∝ xα exp(−x/y) 0 Gamma(x; α, β) (log x, x) Gamma(y; α0, x) ∝ xα exp(−xy) 1 −1 2 x 2 Gamma(x; α, β) (log x, x) N (y; µ, x ) ∝ x exp(− 2 (y − µ) ) 1 − 2 1 2 IGamma(x; α; β) (log x, 1/x) N (y; µ, x) ∝ x exp(− 2x (y − µ) ) k k k−1  y  IGamma(x; α; β) (log x, 1/x) Weibull(y; x, k) ∝ x y exp − x 1 2 −1 2 1 2 GaussianGamma(x, τ; µ, λ, α, β) (log τ, τ, τx, τx ) N (y; x, τ ) ∝ τ exp(− 2 τ(y − x) ) d W(X; n, V ) (log |X|,X) N (y; µ, X−1) ∝ |X| 2 exp(Tr(X(y − µ)(y − µ)T)) − d IW(X; ν, Ψ) (log |X|,X−1) N (y; µ, X) ∝ |X| 2 exp(Tr(X−1(y − µ)(y − µ)T))

B. The extended Kalman filter A common solution to the log-likelihood linearization prob- A well-known problem where a special case of the pro- lem has been the first order Taylor series approximation of posed inference technique is used is the filtering problem for c(x) about xb = µ (instead of second order Taylor series nonlinear state-space models in presence of additive Gaussian approximation of L(x)), i.e., noise which will be described here. Consider the discrete-time c(x) ≈ c(µ) + c0(µ)(x − µ) (44) stochastic nonlinear dynamical system where yk|xk N (yk; c(xk),Rk), (39a) ∼ 0 c (µ) ∇xc(x)| . (45) xk+1|xk ∼ N (xk+1; f(xk),Qk), (39b) , x=µ With this approximation the approximate log-likelihood be- where xk and yk are the latent variable and the measurement at time index k and c(·) and f(·) are nonlinear functions. A comes + common solution to the inference problem is the extended L(x) ≈(c0(µ)R−1(yT − c(µ) − c0(µ)Tµ)) · x Kalman filter (EKF) [11]. 1 In the following example, we show that the measurement − c0(µ)R−1c0(µ)T · xxT. (46) update for EKF is a special case of the proposed algorithm. 2 Example 6. Consider the latent variable x with a priori PDF Hence, a conjugate likelihood function for prior distribution p(x) = N (x; µ, Σ) and the measurement y with the likelihood is obtained by approximating the log-likelihood function by function p(y|x) = N (y; c(x),R). The likelihood function can a function that is linear in sufficient statistic of the prior be written in exponential family form as in distribution. Also note that the posterior distribution obeys

− d p(x|y) ∝ exp (φ · T (x)) (47) p(y|x) = (2π) 2 exp (η(x,R) · T (y) − A(η(x,R))) (40)

T −1 1 −1 where where T (y) = (y, yy ), η(x,R) = (R c(x), − 2 R ) and  0 −1 T 0 T −1 1 1 φ = c (µ)R (y − c(µ) − c (µ) µ) + Σ µ , A(η(x,R)) = Tr R−1c(x)c(x)T + log |R|. (41) 2 2 1 1  − c0(µ)R−1c0(µ)T − Σ−1 (48) Thus, 2 2  1  p(y|x) ∝ exp R−1c(x) · y − c(x)TR−1c(x) (42) which is the extended information filter form for the extended 2 Kalman filter [14, Sec.3.5.4 ]. An advantage of this solution over Taylor series approximation of the log-likelihood is that and the log-likelihood function can be written as the natural parameter of the posterior will be in natural param- 1 eter space by construction i.e., 1 c0(µ)R−1c0(µ)T + 1 Σ−1 0. L(x) =+ yTR−1c(x) − c(x)TR−1c(x) (43) 2 2 2  + where = means equality up to an additive constant with C. A general linearization guideline respect to the latent variable x. The second order Taylor series approximation of any function will be linear in the sufficient The approximation of a general scalar function L(·) with statistic T (x) = (x, xxT). However, such an approximation another function Lb(·) linear in a general vector valued function does not guarantee the integrability of the approximate like- t(·) i.e., Lb(·) , λ · t(·) can be expressed as an optimization lihood and the corresponding approximate posterior due to problem where various cost functions are suggested and min- the dependence of second element of the posterior’s natural imized. Study of such procedures is considered outside the parameter on the measurement y. scope of this paper and a subject of future research. However, 6

+ a few tricks are introduced here which can be used for the where ≈ means approximation up to an additive constant with purpose of linearization. respect to all sufficient (i.e., the elements of T (x)). The approximate likelihood can be found as Lemma 1. Let L(x) be an arbitrary scalar-valued continuous xb 2 2 and differentiable function and t(x) be a vector-valued invert- − 2 2 (xb+σ −y ) p (y|x) ∝ x 2(xb+σ ) (54) ible function with independent continuous elements such that b1 −1 −1 x = t (z) and L(t (z)) is differentiable with respect to z. which is not integrable with respect to y since xb > 0 and Then, L(x) can be linearized with respect to t(x) about t(xb) pb1(y|x) goes to infinity as y goes to infinity. such that Solution 2: We can linearize the right-hand-side of (52) with L(x) ≈ L(x) + Φ · (t(x) − t(x)) (49) −1 −1 b b respect to x by letting z , x and subsequent linearization −1 with respect to z about z x−1 using Lemma 1 which gives and Φ = ∇zL(t (z)) . b , b z=t(xb) 2 2 2 +  −x y x  Proof. Let z = t(x) and Q(z) L(t−1(z)). Then, x = b b −1 , −2L(x, y) ≈ 2 + 2 2 x . (55) t−1(z) and Q(z) = L(x) for z = t(x). Using the first-order xb + σ (xb + σ ) Taylor series approximation of the scalar function Q(z) about The approximate likelihood can be found as z t(x) we obtain b , b  x2  p (y|x) ∝ exp − b (y2 − x − σ2) , (56) Q(z) ≈ Q(z) + ∇zQ(z)| · (z − z). (50) b2 2 2 b b z=bz b 2(xb + σ ) x Hence, the proof follows. which is integrable with respect to y. However, the approxi- Remark 2. The variable z and the function t(x) can be scalar- mate posterior obtains the form valued, vector-valued or even matrix-valued. Further, ∇zQ(z) p(x|y) can be computed using the chain rule and the fact that Q(z) = b −1    2 2 2   L t (z) . xb (y − xb − σ ) −1 ∝ exp −(α + 1) log x − 2 2 + β x , 2(xb + σ ) The elements of sufficient statistic of most continuous (57) members of the exponential family are dependent, such as x and xxT for normal distribution. Furthermore, some of them which is not integrable with respect to x for a small y such 2 2 2 xb (y −xb−σ ) are not invertible such as log |X| for Wishart distribution. that 2 2 + β < 0. However, the linearization method given in Lemma 1 can be 2(xb+σ ) applied to individual summand terms of the log-likelihood Solution 3: We can linearize the first term on the right- function and individual elements of the sufficient statistic. hand-side of (52) with respect to log x and the second term −1 −1 Due to the freedom in choosing the element of the sufficient with respect to x about log xb and xb , respectively, using statistic to linearize the log-likelihood with respect to, there is Lemma 1 which gives no unique solution for the linearization problem. In Example 7 2 2 we will use Lemma 1 and linearize a log-likelihood function + xb y xb −1 −2L(x, y) ≈ 2 log x + 2 2 x . (58) in various ways and describe their properties. xb + σ (xb + σ ) Example 7. Consider the scalar latent variable x with a The approximate likelihood can be found as priori PDF p(x) = IGamma(x; α, β) and the measurement  2 2  xb 2 − 2 y x y with the likelihood function p(y|x) = N (y; 0, x + σ ). p (y|x) ∝ x 2(xb+σ ) exp − b , (59) b3 2 2 Since the exact posterior density is not analytically tractable, 2x(xb + σ ) an approximation is needed. which is integrable with respect to y. The approximate poste- The prior distribution can be written in the exponential fam- rior remains integrable with respect to x for xb > 0. ily form where the natural parameter is η = (−α−1, −β) and the sufficient statistic is T (x) = (log x, x−1). The logarithm Solution 4: We can first expand the log-likelihood as in of the prior distribution is given by  2  2 + σ y −1 −2L(x, y) = log x + log 1 + + (60) log p(x) ∝ −(α + 1) log x − βx . (51) x x + σ2 In order to compute the posterior using the proposed lineariza- and then linearize the last two terms on the right hand side of tion approach, first the log-likelihood function is computed −1 −1 (60) with respect to z , x about the nominal point bz = xb 2 using Lemma 1 which gives + 2 y −2L(x, y) = log(x + σ ) + . (52) 2 2 2 x + σ2 +   σ xb y xb −1 −2L(x, y) ≈ log x + 2 + 2 2 x . (61) The linearization of the log-likelihood with respect to the σ + xb (σ + xb) sufficient statistic of the prior distribution can be carried out in several ways, four of which are listed in the following; The approximate likelihood can be found as   2 2 2  − 1 1 σ x y x Solution 1: We can linearize the right-hand-side of (52) with p (y|x) ∝ x 2 exp − b + b (62) b4 2x σ2 + x (σ2 + x)2 respect to log x using Lemma 1 by letting z , log x and b b subsequent linearization with respect to z about bz , log xb which is integrable with respect to y. The posterior remains which gives integrable with respect to x for xb > 0. 2 Integrability of the approximate posterior is not guaranteed +  x −y x  b b for the first two solutions, while the last two solutions will −2L(x, y) ≈ 2 + 2 2 log x, (53) xb + σ (xb + σ ) result in an inverse gamma approximate posterior distribution. 7

The parameters of the posterior depends on the linearization x1 x2 ··· xT point xb among other factors. A candidate point for lineariza- tion is the prior mean x = α . N N N b β  y1 y2 yT As pointed out in Theorem 1 and illustrated in Example 7, linearity of the log-likelihood with respect to sufficient statistic m1 > 0 m2 > 0 mT > 0 of the prior is a necessary condition for conjugacy. Hence, even X1 X2 ··· XT after succeeding in linearization of the log-likelihood with respect to T (x), the integrability of the posterior distribution should be examined. Consequently, the linearization should be Figure 3: A probabilistic graphical model for extended targets done skillfully such that the approximation error is minimized with kinematic state xk, extent state Xk and measurements and the approximate posterior remains integrable. yk. In the rest of this paper we will focus on exemplifying the covariance addition in the Gaussian distributions on the right application of the linearization technique in a specific Bayesian hand side of (63) and the intertwined nature of the kinematic inference problem in Section IV and the related numerical state and extent state in the likelihood. In order to get rid of the simulations in Section V. In this specific example, two of covariance addition, latent variables are defined in [18] and the the sufficient statistics is not invertible as it is required in problem of intertwined kinematic and extent states are solved Lemma 1. Hence, a remedy is suggested and used in the using variational approximation. In [16] the authors design an example. unbiased estimator. Both of these measurement updates can be used in a recursive estimation framework which requires the IV. EXTENDED TARGET TRACKING posterior to be in the same form as the prior as in A. The problem formulation p(xk,Xk|Y0:k) ≈ N (xk; xk|k,Pk|k)IW(Xk; νk|k,Vk|k). In the Bayesian extended target tracking (ETT) frame- (66) work [15] the state of an extended target consists of a kinematic state and a symmetric positive definite matrix rep- resenting the extent state. Suppose that the extended target’s kinematic state and the target’s extent state are denoted by B. Solution proposed by Feldmann et al. [16] xk and Xk, respectively. Further, we assume that the mea- In [16] the authors cleverly design a measurement up- j d mk surements Yk , {yk ∈ R }j=1 generated by the extended date based on unbiasedness properties for the measurement target are independent and identically distributed (conditioned model (63) to calculate xk|k, Pk|k, νk|k and Vk|k in the j j on xk and Xk) as yk ∼ N (yk; Hxk, sXk + R), see Fig. 3. posterior (66). The following measurement likelihood which will be used in The kinematic state xk is updated as follows: this paper was first introduced in [16] xk|k = xk|k−1 + Kk(¯yk − Hxk|k−1) (67) m k T Y j Pk|k = Pk|k−1 − KkSkKk (68) p(Yk|xk,Xk) = N (yk; Hxk, sXk + R), (63) j=1 where where T −1 Kk = Pk|k−1H Sk (69) • mk is the number of measurements at time k. T 1 • s is a real positive scalar constant. Sk = HP H + (sX + R) (70) k|k−1 m k|k−1 • R is the measurement noise covariance. k Vk|k−1 We assume the following prior form on the kinematic state Xk|k−1 = (71) and target extent state. νk|k−1 − 2d − 2

p(xk,Xk|Y0:k−1) and (64) mk = N (xk; xk|k−1,Pk|k−1)IW(Xk; νk|k−1,Vk|k−1) 1 X j y¯k , yk (72) n d×d mk where xk ∈ R , Xk ∈ R>0 and the double time index “a|b” j=1 is read “at time a using measurements up to and including is the mean of the measurements at time k. b time ”. The extent state updates given in [16] is in the following The inverse Wishart distribution IW(X; ν, V ) we use in form this work is given in the following form. 1 (ν−d−1) 1 −1 νk|k =νk|k−1 + mk (73) |V | 2 exp tr − VX 2 FFK IW(X; ν, V ) , 1 ν (65) Vk|k =Vk|k−1 + Mk (74) 2 (ν−d−1)d  1  2 2 Γd 2 (ν − d − 1) |X| FFK where Mk is a given positive definite update matrix and the where X is a symmetric positive definite matrix of dimension FFK d × d, ν > 2d is the scalar degrees of freedom and V superscript · which is used to ease presentation consists of is a symmetric positive definite matrix of dimension d × d the authors initials in [16]. Before describing the construction M FFK and is called the scale matrix. This form of the inverse of the matrix k of [16] let us define the measurement Wishart distribution is the same one used in the well-known spread Yk from the predicted measurement Hxk|k−1 as fol- reference [17]. lows. mk No analytical solution for the posterior exists. The reason 1 X j j T for the lack of an analytical solution for the posterior corre- Yk , (yk − Hxk|k−1)(yk − Hxk|k−1) (75) mk sponding to the likelihood (63) and the prior (64) is both the j=1 8

Feldmann et al. first write the measurement spread Yk as Proof. Proof is given in Appendix A for the sake of clarity. 1 2 Yk = Yk + Yk . (76) 1 2 Lemma 2 states that the likelihood (63) can be approx- The summands Yk and Yk are defined as imately factorized into two independent likelihood terms Y 1 (¯y − Hx )(¯y − Hx )T, (77) corresponding to kinematic and extent states. This type of k , k k|k−1 k k|k−1 factorization gives independent measurement updates for the mk 2 1 X j j T kinematic state and the extent state. For this purpose, one can Yk , (yk − y¯k)(yk − y¯k) . (78) Vk|k−1 mk set X = and xˆ = xk|k−1 in order to find the j=1 b νk|k−1−2d−2 factors of Lemma 2. Using the conjugate likelihood factors 1 2 When we take the conditional expected values of Yk and Yk , in (84), the posterior density p(xk,Xk||Y0:k) is given in the we obtain form of (66) with the update parameters given below. 1  1  Y k ,E Yk Xk = Xk|k−1 xk|k = Pk|k T sXk|k−1 + R  −1 T −1  = HPk|k−1H + , (79) × Pk|k−1xk|k−1 + mkH (sXk|k−1 + R) y¯k , (86) mk −1 2  2   −1 T −1  Y k ,E Yk Xk = Xk|k−1 Pk|k = Pk|k−1 + mkH (sXk|k−1 + R) H , (87) mk − 1 = (sXk|k−1 + R). (80) where mk Vk|k−1 FFK X = , (88) The matrix Mk is then proposed to be k|k−1 νk|k−1 − 2d − 2 FFK 1/2 1 −1/2 1 1 −1/2 1/2 mk M k = Xk|k−1(Y k) Yk (Y k) Xk|k−1 1 X j (81) y¯k = yk. (89) 1/2 2 −1/2 2 2 −1/2 1/2 mk + (mk − 1)Xk|k−1(Y k) Yk (Y k) Xk|k−1. j=1 The conditional of M FFK is Note that the kinematic updates are the same as the kinematic k updates proposed in [16].  FFK  E Mk Xk = Xk|k−1 = mkXk|k−1, (82) The extent state is updated as given in which results in an unbiased estimator. νk|k =νk|k−1 + mk (90) LLL Vk|k =Vk|k−1 + Mk (91) C. ETT via log-likelihood linearization where A new measurement update is obtained by performing LLL −1 linearization of the logarithm of the likelihood function (63) M k = mkXk|k−1 + mksXk|k−1(sXk|k−1 + R) (92) with respect to sufficient statistic of the prior distribution (64).   −1 × Yk − (sXk|k−1 + R) (sXk|k−1 + R) Xk|k−1. The sufficient statistics of the prior distribution (64) are given as follows. This form of the update suggests that the matrix Yk serves sX + R Y T −1 as a pseudo-measurement for the quantity k . If k is T (xk,Xk) = (xk, xkxk ,Xk , log |Xk|) (83) larger than the current predicted estimate of sXk + R (i.e., As seen above the second and the fourth elements of T (·) are sXk|k−1 + R) then a positive matrix quantity is added to the not invertible. In the following lemma, using log-likelihood statistics Vk|k−1 and vice versa. linearization (LLL) via a first order Taylor series expansion, We here note that we approximate the likelihood function (63) to obtain a   T E Yk| Xk = Xk|k−1 = HPk|k−1H + sXk|k−1 + R (93) conjugate likelihood (with respect to the prior (64)) with sufficient statistics given in (83). where the expectation is taken over all the measurements at time k and the estimate x . Hence we conclude that Qm j k|k−1 Lemma 2. The likelihood j=1 N (y ; Hx, sX + R) can be expected value of the second term on the right hand side approximated up to a multiplicative constant2 by a first order of (85) is always positive which makes the current update Taylor series approximation around the nominal points Xb (for biased. Another drawback of the update is that the uncertainty variable X) and xˆ (for variable x) as of the predicted estimate xk|k−1 does not affect the update. In LLL m order to solve both problems we modify the quantity M Y k N (yj; Hx, sX + R) as follows. j=1 M ULL m X   k , k k|k−1 m T −1 Y + mksXk|k−1(HPk|k−1H + sXk|k−1 + R) ≈ N (yj; Hx, sX + R) IW (X; m, M) (84)  b   T  j=1 × Yk − (HPk|k−1H + sXk|k−1 + R) T −1 where × (HPk|k−1H + sXk|k−1 + R) Xk|k−1. (94) −1 Note that in the current update the difference M =mXb + msXb(sXb + R) h i (85) T −1 Yk − (HPk|k−1H + sXk|k−1 + R) (95) × Yk − (sXb + R) (sXb + R) X.b has zero conditional mean which makes the update unbiased 2 T −1 i.e., constant with respect to the variables X and x. and the terms (HPk|k−1H + sXk|k−1 + R) decrease 9

with increasing uncertainty in the predicted estimate xk|k−1 which makes the update term smaller. In the following we will call the proposed unbiased measurement update based on linearization of the likelihood in the log domain as ULL for ease of presentation.

V. NUMERICAL SIMULATION In this section the newly proposed measurement update ULL, is compared with the update based on variational Bayes [18] and update proposed by Feldmann et al. in [16] which will be referred to as VB and FFK, respectively. In section V-A, we will use a simulation scenario which is partly based on the simulation scenario described in [18, Section IV] which in turn is based on [19, Section 6.1]. In section V-B an ETT Figure 4: Extent estimation error for FFK, VB and ULL with scenario based on the numerical simulation described in [16, respect to the optimal Bayesian solution. Solid lines show the VI-B] will be presented. mean error EX and shaded areas of respective colors illustrate the interval between the fifth and the ninety fifth percentiles. A. Monte-Carlo simulations

A planar extended target in the two dimensional space th j 0 0 the j MC run, mk measurements are generated according to whose true kinematic state xk and the extent state Xk are mj = max(2, mj Poisson(10)), 0 T k ∼ (100a) xk = [0 m, 0 m, 100 m/s, 100 m/s] (96a) j,i 0 0 0 2 2 2 2 T yk ∼ N (·; Hxk, sXk + R) (100b) Xk = Ek Diag([300 m , 200 m ])Ek (96b) 2 2 where s = 0.25, R = 100 I2 m and H = [I2, 02×2]. is considered. Here, Ek , [e1, e2] is a 2 × 2 matrix whose Matrices I2 and 02×2 are the and zero matrix of 0 columns e1 and e2 are the normalized eigenvectors of Xk size 2 × 2, respectively. The performance of the measurement which are e √1 [1, 1]T and e √1 [1, −1]T. For the updates are compared for various levels of accuracy of the 1 , 2 2 , 2 extended target with these true parameters, we conduct a initial predicted target density in view of the optimal Bayesian Monte Carlo (MC) simulations to quantify the differences solution. To this end, the optimal solution is numerically between the measurement updates FFK, ULL and VB. For computed via importance sampling using the predicted density 5 each MC run we assume that the initial predicted target density as the proposal density. Total of 10 samples are generated for all three methods is the same and has the following from the predicted density and the normalized importance structure: weights are calculated from the likelihood function given in (63). Using the importance weights, the posterior expected p(xk,Xk|Y0:k−1) values of the kinematic state x and extent state X are (97) k k = N (x ; xj ,P )IW(X ; νj ,V j ) calculated. The resulting expected values are referred to as k k|k−1 k|k−1 k k|k−1 k|k−1 opt opt xk|k and Xk|k respectively. where superscript j indicates the jth MC run. The quantities To change the level of accuracy for the predicted target j xk|k−1, Pk|k−1 for the kinematic state are selected as density, the parameters αk are selected on a linear grid of 40 values between 1 and 50 while values of δk were selected on  P  a logarithmic grid of 40 values between 2 and 1000. For each xj N ·; x0 , k|k−1 (98a) k|k−1 ∼ k pair of αk and δk out of the 40 pairs, NMC = 1000 runs αk j have been made by selecting the parameters xk|k−1, Pk|k−1, P = Diag [502, 502, 102, 102] j j j U,j k|k−1 (98b) νk|k−1, Vk|k−1 and Yk as above and the estimates xk|k and XU,j for U ∈ {FFK, VB, ULL} are calculated3. where the scalar αk is the scaling parameter for the covariance k|k j Using the results of the MC runs, the average error metrics of the density in (98a) to adjust the distribution of xk|k−1 in U U j j Ex and EX for U ∈ {FFK, VB, ULL} for the kinematic state the MC simulation study. The quantities νk|k−1 and Vk|k−1 and the extent state, respectively, are calculated as follows: are selected as 1  N  2 j j 1 XMC 2 νk|k−1 = max(7, ν ∼ Poisson(100)) (99a) U U,j opt,j Ex ,  H(xk|k − xk|k )  , (101) dN 2 MC j=1 V j  0  k|k−1 j X 1 k   4 j νk|k−1 ∼ W ·; δk, (99b) NMC δ 1 X U,j opt,j νk|k−1 − 2d − 2 k U 2 EX ,  2 tr(Xk|k − Xk|k )  . (102) d NMC where the functions Poisson(λ) and W(·; n, Ψ) represent the j=1 Poisson density with expected value λ and Wishart density with degrees of freedom n and scale matrix Ψ. The scalar δk 3In each MC run 20 iterations were performed for the VB solution, although controls the variance of the Wishart density in this study. For the VB solution would usually converge within the first 5 iterates. 10

1

0

−1

−2

y-coordinate [km] −3

−4 0 1 2 3 4 5 6 7 8 9 10 x-coordinate [km] Figure 5: Kinematic state estimation error for FFK, VB and ULL with respect to the optimal Bayesian solution. Solid lines Figure 6: Trajectory of an extended target. For every third show the mean error Ex and shaded areas of respective colors scan, the true extended target and its center are shown. illustrate the interval between the fifth and the ninety fifth percentiles. Since FFK and ULL use identical kinematic state update (for identical Xk|k−1) their respective graphs coincide.

Table III: Comparison of measurement updates for ETT with respect to cycle time and estimation errors. The data are obtained from the single ETT simulation scenario. Since the VB update diverged several time, the errors exceeding 24m are set to 24m to make the quantitative comparison meaningful.

Ex[m] EX [m] Cycle times [s] Update Mean ± St.Dev. Mean ± St.Dev. Mean± St.Dev. FFK 15.4581 ± 0.9943 19.4354 ± 0.6894 0.1627 ± 0.0000 VB 16.3447 ± 1.1913 19.8340 ± 0.7764 3.2547 ± 0.0002 ULL 15.5204 ± 0.9685 19.2356 ± 0.6680 0.1293 ± 0.0000 Figure 7: Histogram of the root mean square error of the U,j kinematic state Ex for U ∈ {FFK, VB, ULL} in a single ETT scenario. The errors are illustrated in Fig. 4 and Fig. 5. Since FFK and ULL use identical kinematic state update, their respective graphs coincide. VB has the lowest extent state estimation jth MC run is generated according to (100) where s = 0.25, error. ULL has a lower estimation compared to FFK in the R = 202I m2 and H = [I , 0 ] and 1 ≤ k ≤ 181. simulations scenario presented here. When the simulation is 2 2 2×2 In the filters, the kinematic state vector consists of the repeated with R = 502I FFK performs better than ULL while 2 position and the velocity x = [p , p , p˙ , p˙ ]T, which evolves VB maintains its superior estimation performance in terms of k x y x y according to the discrete-time constant velocity model in two EX . dimensional Cartesian coordinates [20] where p(xk+1|xk) = N (xk+1; Akxk,Qk) and B. Single extended target tracking scenario   " τ 4 τ 3 # The performance of measurement update is evaluated in a I2 τI2 2 4 I2 2 I2 Ak = ,Qk = σν 3 , 02 I2 τ 2 single extended target tracking scenario where an extended tar- 2 I2 τ I2 get follows a trajectory in a two dimensional space illustrated τ = 10s, σ = 0.1m/s2. in Fig.6. In this simulation scenario, an elliptical extended v target with diameters 340 and 80 meters moves with a constant In the filters the initial prior density for the kinematic state velocity of about 50 km/h. The extended target’s true initial and extent state are 0 0 kinematic state x1 and true initial extent state X1 are as j j j follows: p(x1,X1) = N (x0; x1|0,P1|0)IW(X0; ν1|0,V1|0) (104) 0 T x1 = [0m, 0m, 9.8m/s, −9.8m/s] (103a) where 0 2 2 2 2 T P X1 = Ek Diag([170 m , 400 m ])Ek (103b) j 0 0 x1|0 ∼ N (·; x0, ) (105a) α0 where Ek is the 2×2 matrix whose columns are the normalized 2 2 2 2 0 √1 T P1|0 = Diag([50 , 50 , 10 , 10 ]) (105b) eigenvectors of X1 which are e1 , [−1, 1] and e2 , 2 j 1 T j √ [1, 1] , i.e., E [e , e ]. ν1|0 = max(7, ν ∼ Poisson(10)) (105c) 2 k , 1 2 We conduct MC study to quantify the difference between j V X0 different measurement updates in a single extended target 1|0 j 0 j |ν1|0 ∼ W(·; δ0, ) (105d) mj δ j j,i k ν1|0 − 2d − 2 0 tracking scenario. The measurement set Yk = {yk }i=1 for the 11

VI. CONCLUSION A Bayesian inference technique based on Taylor series approximation of the logarithm of the likelihood function is presented. The proposed approximation is devised for the case where the prior distribution belongs to the exponential family of distribution and is continuous. The logarithm of the likelihood function is linearized with respect to the sufficient statistic of the prior distribution in exponential family such that the posterior obtains the same exponential family form as the prior. Taylor series approximation is used for linearization with respect to the sufficient statistic. However, Taylor series approximation is a local approximation which is used for approximation of the posterior over the entire support. In spite Figure 8: Histogram of the average extent state estimation error U,j of this inherent weakness, the proposed algorithm performs EX for U ∈ {FFK, VB, ULL} in a single ETT scenario. well in the numerical simulations that are presented here for extended target tracking. Furthermore, there are numerous successful applications of extended Kalman filter, which is a and where α0 = 10 and δ0 = 5. special case of the proposed algorithm, in theory and practice. 1) Time update: The expression for the time update of When a Bayesian posterior estimate is needed with limited the target’s kinematic state in the filter follows the standard computational budget which prohibits the use of Monte-Carlo Kalman filter time update equations, methods or even variational Bayes approximation as in real- time filtering problems, the proposed algorithm offers a good j j x = Akx (106) alternative. k+1|k k|k The comparison of possible choices for the linearization j j T Pk+1|k = AkPk|kAk + Qk. (107) point and linearization methods with respect to the sufficient statistic are among the future research problems. The extent state is assumed to be slowly varying. Hence, exponential forgetting strategy [15], [21], [22] can be used VII. ACKNOWLEDGMENTS to account for possible changes in the parameters in time. In [23] it is shown that using the exponential forgetting factor The authors would like to thank Henri Nurminen for proof will produce maximum entropy distribution in the time update reading and providing comments on an earlier version of the for the processes which are slowly varying with unknown dy- manuscript. namics but bounded by a Kullback-Leibler divergence (KLD) constraint. Forgetting factor is applied to the parameters of the REFERENCES inverse Wishart distribution as follows. [1] M. Jordan, Learning in Graphical Models. MIT Press, 1999. j j [2] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, “An νk+1|k = exp(−τ/τ0)νk|k, (108) introduction to variational methods for graphical models,” Mach. Learn., vol. 37, no. 2, pp. 183–233, Nov. 1999. [Online]. Available: νj − 2d − 2 http://dx.doi.org/10.1023/A:1007665907178 V j = k+1|k V j . (109) [3] T. Minka, “Expectation propagation for approximate Bayesian infer- k+1|k νj − 2d − 2 k|k ence,” in Proceedings of the Seventeenth Conference Annual Conference k|k on Uncertainty in Artificial Intelligence (UAI-01). San Francisco, CA: Morgan Kaufmann, 2001, pp. 362–369. The exponential decay time constant was selected τ0 = 15. [4] H. Rue, S. Martino, and N. Chopin, “Approximate Bayesian inference 2) Evaluation of filters: 50 000 MC runs are performed for latent Gaussian models by using integrated nested Laplace approx- j j j j imations,” Journal of the Royal Statistical Society: Series B (Statistical where the parameters x1|0, P1|0, ν1|0, V1|0 and Yk for Methodology), vol. 71, no. 2, pp. 319–392, 2009. 1 ≤ k ≤ 181 are selected as above. Using the results of [5] J. A. Nelder and R. W. M. Wedderburn, “Generalized linear models,” U,j U,j Journal of the Royal Statistical Society. Series A (General), vol. 135, the MC runs, the average error metrics Ex and EX for no. 3, pp. pp. 370–384, 1972. U ∈ {FFK, VB, ULL} and jth MC run for the target’s [6] W. Hastings, “Monte Carlo sampling methods using Markov chains and their applications,” Biometrika, vol. 57, pp. 97–109, 1970. kinematic state and the extent state, respectively, are calculated [7] S. Geman and D. Geman, “Stochastic relaxation, Gibbs distributions, as follows: and the Bayesian restoration of images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 6, no. 6, pp. 721–741, 1984. 1 K ! 2 [8] M. Wainwright and M. Jordan, Graphical Models, Exponential Families, 1 2 and Variational Inference, ser. Foundations and trends in machine U,j X M,j Ex , H(xk|k − xk) , (110a) learning. Now Publishers, 2008. dK 2 k=1 [9] T. M. Cover and J. Thomas, Elements of Information Theory. John 1 Wiley and Sons, 2006. K ! 4 [10] W. Hennevogl, L. Fahrmeir, and G. Tutz, Multivariate Statistical Mod- 1 X EU,j tr(XM,j − X )2 . (110b) elling Based on Generalized Linear Models, ser. Springer Series in X , d2K k|k k Statistics. Springer New York, 2001. k=1 [11] G. Smith, S. Schmidt, L. McGee, U. S. N. Aeronautics, and S. Adminis- tration, Application of Statistical Filter Theory to the Optimal Estimation The errors (mean± standard deviation) and the cycle times of Position and Velocity on Board a Circumlunar Vehicle, ser. NASA technical report. National Aeronautics and Space Administration, 1962. (mean± standard deviation) for evaluated measurement update [12] M. I. Jordan, “Lecture notes on conjugacy and exponential family,” 2014. rules are given in Table III. The histograms of distribution of [13] R. E. Kalman, “A new approach to linear filtering and prediction the errors defined in (110) are given in Fig. 7 and Fig.8. problems,” Transactions of the ASME–Journal of Basic Engineering, vol. 82, no. Series D, pp. 35–45, 1960. The estimation errors of FFK and ULL are comparable but [14] S. Thrun, W. Burgard, and D. Fox, Probabilistic Robotics, ser. Intelligent the proposed update ULL, is less computationally expensive. robotics and autonomous agents. MIT Press, 2005. 12

[15] J. W. Koch, “Bayesian approach to extended object and cluster tracking Writing (113) in terms of Z and N = [N 1,...,N m], we using random matrices,” IEEE Trans. Aerosp. Electron. Syst., vol. 44, obtain no. 3, pp. 1042–1059, Jul. 2008. m [16] M. Feldmann, D. Franken,¨ and W. Koch, “Tracking of extended objects X j + −1 and group targets using random matrices,” IEEE Trans. Signal Process., −2 log N (y ; Hx, sX + R) = m log |Z | vol. 59, no. 4, pp. 1409–1420, Apr. 2011. j=1 [17] A. K. Gupta and D. K. Nagar, Matrix variate distributions. Boca Raton, FL: Chapman & Hall/CRC, 2000. m 1/2 1/2 X  j −1 −1 [18] U. Orguner, “A variational measurement update for extended target + m log sI + Z RZ + tr N (sZ + R) tracking with random matrices,” IEEE Trans. Signal Process., vol. 60, j=1 no. 7, pp. 3827–3834, 2012. [19] M. Baum, M. Feldmann, D. Fraenken, U. D. Hanebeck, and W. Koch, (117) “Extended object and group tracking: A comparison of random matrices and random hypersurface models.” in GI Jahrestagung (2), ser. LNI, K.- If we make a first order Taylor series expansion for the second P. Fahnrich¨ and B. Franczyk, Eds., vol. 176. GI, 2010, pp. 904–906. and the third terms on the right hand side of (117) with respect [20] X. Li and V. Jilkov, “Survey of maneuvering target tracking. part i. to the variables Z and N around Zb and Nb, we obtain dynamic models,” IEEE Trans. Aerosp. Electron. Syst., vol. 39, no. 4, m pp. 1333–1364, Oct 2003. X j + [21] R. Kulhavy and M. B. Zarrop, “On a general concept of forgetting,” −2 log N (y ; Hx, sX + R) ≈ International Journal of Control, vol. 58, no. 4, pp. 905–924, 1993. j=1 [22] C. M. Carvalho and M. West, “Dynamic matrix-variate graphical mod-   els,” Bayesian Anal, vol. 2, pp. 69–98, 2007. m log |Z−1| + m tr (sR−1 + Z)−1Z [23]E. Ozkan,¨ V. Smidl, S. Saha, C. Lundquist, and F. Gustafsson, b “Marginalized adaptive particle filtering for non-linear models with m unknown time-varying noise parameters,” to appear in Automatica. X h j −1 −1i + tr N (sZb + R) j=1 m APPENDIX X h i + s tr Zb−1(sZb−1 + R)−1Nb j(sZb−1 + R)−1Zb−1Z A. Proof of Lemma 2 j=1 (118) The logarithm of the likelihood Qm N (yj; Hx, sX + R) j=1 + is given as where ≈ represents an approximation up to an additive con- stant. The first order Taylor series approximations for the m X j + scalar valued functions of matrix variables of interest are given −2 log N (y ; Hx, sX + R) = m log |(sX + R)| in Appendix B. We now substitute back the relations (115) j=1 and (116) into (118) to obtain the approximation m m X j T −1 j X + (y − Hx) (sX + R) (y − Hx) (111) −2 logN (yj; Hx, sX + R) j=1 j=1 =m log |X| + m log sI + X−1/2RX−1/2 +   ≈m log |X| + m tr (sR−1 + Xb −1)−1X−1 m X m + (yj − Hx)T(sX + R)−1(yj − Hx) (112) X h i + tr (yj − Hx)(yj − Hx)T(sXb + R)−1 j=1 j=1 =m log |X| + m log sI + X−1/2RX−1/2 m X h −1 j j T m + s tr Xb(sXb + R) (y − Hxˆ)(y − Hxˆ) X + tr (yj − Hx)(yj − Hx)T(sX + R)−1 (113) j=1 i j=1 ×(sXb + R)−1XXb −1 (119)

 −1 −1 −1 −1 where the sign =+ denotes an equality up to an additive =m log |X| + m tr (sR + Xb ) X constant. We approximate the right hand side of (113) by a m X function that is linear in the sufficient statistic + (yj − Hx)T(sXb + R)−1(yj − Hx) j=1 T (x, X) = (x, xxT, log |X|,X−1). (114)  m −1 X j j T As aforementioned, the statistics xxT and log |X| are not + s tr Xb(sXb + R) (y − Hxˆ)(y − Hxˆ) invertible. The remedy we suggest here is to make a first j=1 order Taylor series expansion of (113) with respect to the new i ×(sX + R)−1XX−1 (120) variables b b Rearranging the terms in (120), dividing both sides by −2 and −1 Z ,X , (115a) then taking the exponential of both sides, we can see that j j j T m N ,(y − Hx)(y − Hx) , j = 1, . . . , m, (115b) Y N (yj; Hx, sX + R) around the corresponding nominal values j=1  m  −1 Z X (116a) × Y j b , b ≈  N (y ; Hx, sXb + R) IW (X; m, M) (121) j j j T Nb ,(y − Hxˆ)(y − Hxˆ) , j = 1, . . . , m. (116b) j=1 13

× ∂f  ∂Z  where the sign ≈ denotes an approximation up to a multiplica- 1 = tr Z−1 tive constant and M is given as ∂zij ∂zij  −1  −1 −1 −1 ∂(sZ + R)  −1 −1 −1 + tr (sZ + R) (132) M , m sR + Xb + sXb(sXb + R) ∂zij    ∂Z−1  m = tr Z−1E  + s tr (sZ−1 + R)−1 (133) X j j T −1 ij ×  (y − Hxˆ)(y − Hxˆ)  (sXb + R) X.b (122) ∂zij   j=1 −1 −1 −1 −1 ∂Z −1 =[Z ]ji − s tr (sZ + R) Z Z ∂zij Another form for the inverse Wishart parameter M can be (134) written using the matrix inversion lemma as follows. −1 −1 −1 −1 −1  =[Z ]ji − s tr Z (sZ + R) Z Eij (135) −1 −1  −1 −1 −1 −1 M =mXb + msXb(sXb + R) =[Z ]ji − s Z (sZ + R) Z (136)    ji m  −1 −1 −1 −1 −1 1 X j j T = Z − sZ (sZ + R) Z ji (137) ×  (y − Hxˆ)(y − Hxˆ)  − (sXb + R) m  −1 −1 j=1 = (Z + sR ) ji (138) −1 × (sXb + R) Xb (123) where Eij is a d × d matrix filled with zeros except the ijth element which is unity. The last expression (138) is equivalent Proof is complete. to −1 −T F1 = (Z + sR ) (139) B. First Order Taylor Series Approximations for Some Scalar which gives Valued Functions of Matrix Variables

1/2 1/2 In this section, some first order Taylor series approximations log sI + Z RZ ≈ log sI + RZb of some scalar valued functions of matrix arguments will be  −1 −1  studied. The two functions we consider are given as + tr (Zb + sR ) (Z − Zb) • Function 1 (140)

1/2 1/2 2) Calculation of F2: f1(Z) , log sI + Z RZ (124) ∂f  ∂(sZ−1 + R)−1  2 = tr N (141) • Function 2 ∂zij ∂zij  −1 −1  −1 −1 ∂(sZ + R) f2(Z) , tr N(sZ + R) (125) = tr N (142) ∂zij −1 Both functions can be approximated with a first order Taylor  −1 −1 ∂(sZ + R) Z = − tr N(sZ + R) series approximation around a nominal point b as follows. ∂zij   T  × (sZ−1 + R)−1 (143) fk(Z) ≈ fk(Zb) + tr Fk (Z − Zb) (126)  −1  −1 −1 ∂Z −1 −1 for k = 1, 2 where Fk is defined as = − s tr N(sZ + R) (sZ + R) (144) ∂zij

∂fk  −1 −1 −1 ∂Z −1 [Fk]ij , (127) =s tr N(sZ + R) Z Z ∂zij ∂zij Z=Zb  × (sZ−1 + R)−1 (145) for 1 ≤ i, j ≤ d where the notation [·]ij denotes the element corresponding to ith row and jth column of the argument −1 −1 −1 −1 −1 −1 =s tr N(sZ + R) Z EijZ (sZ + R) (146) matrix and zij , [Z]ij. Hence, in order to construct the −1 −1 −1 −1 −1 −1  required Taylor series approximation, we are only required =s tr Z (sZ + R) N(sZ + R) Z Eij (147) F F  −1 −1 −1 −1 −1 −1 to calculate the matrices 1 and 2. =s Z (sZ + R) N(sZ + R) Z ji (148) 1) Calculation of F1: We first write f1(·) as The last expression (138) is equivalent to

1/2 1/2 T f1(Z) log sI + Z RZ (128)  −1 −1 −1 −1 −1 −1 , F2 = s Z (sZ + R) N(sZ + R) Z (149) −1 = log |Z| + log sZ + R (129) which gives

−1 −1  −1 −1 Now, we can calculate the related derivative using two well- tr N(sZ + R) ≈ tr N(sZb + R) known matrix derivatives  −1 −1 −1 −1 −1 −1  ∂ ln |U|  ∂U + s tr Zb (sZb + R) N(sZb + R) Zb (Z − Zb) =tr U−1 , (130) ∂x ∂x (150) ∂U−1 ∂U = − U−1 U−1. (131) ∂x ∂x