<<

Estimation

Alireza Karimi

Laboratoire d’Automatique, MEC2 397, email: alireza.karimi@epfl.ch

Spring 2013

(Introduction) Spring 2013 1 / 152 Course Objective

Extract information from noisy

Parameter Estimation Problem : Given a set of measured data

{x[0], x[1],...,x[N − 1]}

which depends on an unknown vector θ, determine an

θˆ = g(x[0], x[1],...,x[N − 1])

where g is some function.

Applications : Image processing, communications, biomedicine, system identification, state estimation in control, etc.

(Introduction) Estimation Theory Spring 2013 2 / 152 Some Examples

Range estimation : We transmit a pulse that is reflected by the aircraft. An echo is received after τ second. Range θ is estimated from the equation θ = τc/2wherec is the light’s speed.

System identification : The input of a plant is excited with u and the output y is measured. If we have :

y[k]=G(q−1, θ)u[k]+n[k]

where n[k] is the measurement . The model are estimated using u[k]andy[k]fork =1,...,N.

DC level in noise : Consider a set of data {x[0], x[1],...,x[N − 1]} that can be modeled as : x[n]=A + w[n] where w[n] is some zero noise process. A can be estimated using the measured data set.

(Introduction) Estimation Theory Spring 2013 3 / 152 Outline

Classical estimation (θ deterministic) Minimum Unbiased Estimator (MVU) Cramer-Rao Lower Bound (CRLB) Best Linear Unbiased Estimator (BLUE) Maximum Likelihood Estimator (MLE) Estimator (LSE) Bayesian estimation (θ ) Minimum Mean Square Error Estimator (MMSE) Maximum A Posteriori Estimator (MAP) Linear MMSE Estimator

(Introduction) Estimation Theory Spring 2013 4 / 152 References

Main reference : Fundamentals of Statistical Estimation Theory

by Steven M. KAY, Prentice-Hall, 1993 (available in Library de La Fontaine, RLC). We cover Chapters 1 to 14, skipping Chapter 5 and Chapter 9.

Other references : Lessons in Estimation Theory for Signal Processing, Communications and Control. By Jerry M. Mendel, Prentice-Hall, 1995. , Random Processes and Estimation Theory for . By Henry Stark and John W. Woods, Prentice-Hall, 1986.

(Introduction) Estimation Theory Spring 2013 5 / 152 Review of Probability and Random Variables

(Probability and Random Variables) Estimation Theory Spring 2013 6 / 152 Random Variables

Random Variable : AruleX (·) that assigns to every element of a sample space Ω a real value is called a RV. So X is not really a variable that varies randomly but a function whose domain is Ω and whose range is some subset of the real line.

Example : Consider the of flipping a coin twice. The sample space (the possible outcomes) is :

Ω={HH, HT , TH, TT }

We can define a X such that

X (HH)=1, X (HT )=1.1, X (TH)=1.6, X (TT )=1.8

Random variable X assigns to each event (e.g. E = {HT , TH}⊂Ω) a subset of the real line (in this case B = {1.1, 1.6}).

(Probability and Random Variables) Estimation Theory Spring 2013 7 / 152 Function

For any element ζ in Ω, the event {ζ|X (ζ) ≤ x} is an important event. The probability of this event

Pr[{ζ|X (ζ) ≤ x}]=PX (x) is called the probability distribution function of X .

Example : For the random variable defined earlier, we have :

PX (1.5) = Pr[{ζ|X (ζ) ≤ 1.5}]=Pr[{HH, HT }]=0.5

PX (x) can be computed for all x ∈ R. It is clear that 0 ≤ PX (x) ≤ 1.

Remark : For the same experiment (throwing a coin twice) we could define another random variable that would lead to a different PX (x). In most of engineering problems the sample space is a subset of the real line so X (ζ)=ζ and PX (x) is a continuous function of x.

(Probability and Random Variables) Estimation Theory Spring 2013 8 / 152 Probability Density Function (PDF)

The Probability Density Function, if it exists, is given by : dP (x) p (x)= X X dx When we deal with a single random variable the subscripts are removed : dP(x) p(x)= dx

Properties : ∞ (i) p(x)dx = P(∞) − P(−∞)=1 −∞ x (ii) Pr[{ζ|X (ζ) ≤ x}]=Pr[X ≤ x]=P(x)= p(α)dα −∞ x2 (iii) Pr[x1 < X ≤ x2]= p(x)dx x1

(Probability and Random Variables) Estimation Theory Spring 2013 9 / 152 Gaussian Probability Density Function

A random variable is distributed according to a Gaussian or if the PDF is given by :

1 − (x−µ)2 p(x)=√ e 2σ2 2πσ2 The PDF has two parameters : µ, the mean and σ2 the variance.

We note X ∼N(µ, σ2) when the random variable X has a normal (Gaussian) distribution with the mean µ and the σ. Small σ small variability (uncertainty) and large σ means large variability.

Remark : Gaussian distribution is important because according to the Central Limit Theorem the sum of N independent RVs has a PDF that converges to a Gaussian distribution when N goes to infinity. (Probability and Random Variables) Estimation Theory Spring 2013 10 / 152 Some other common PDF 1 a < x < b Uniform (b > a): p(x)= b−a 0 otherwise

µ> 1 − /µ Exponential ( 0) : p(x)=µ exp( x )u(x)

x −x2 Rayleigh (σ>0) : p(x)= exp( )u(x) σ2 2σ2 √ 1 xn/2−1 exp(− x )forx > 0 Chi-square χ2 : p(x)= 2nΓ(n/2) 2 0forx < 0

∞ where Γ(z)= tz−1e−t dt and u(x) is the unit step function. 0

(Probability and Random Variables) Estimation Theory Spring 2013 11 / 152 Joint, Marginal and Conditional PDF

Joint PDF : Consider two random variables X and Y then : x2 y2 Pr[x1 < X ≤ x2 and y1 < Y ≤ y2]= p(x, y)dxdy x1 y1

Marginal PDF : ∞ ∞ p(x)= p(x, y)dy and p(y)= p(x, y)dx −∞ −∞

Conditional PDF : p(x|y)isdefinedasthePDFofX conditioned on knowing the value of Y .

Bayes’ Formula : Consider two RVs defined on the same probability space then we have : p(x, y) p(x, y)=p(x|y)p(y)=p(y|x)p(x)orp(x|y)= p(y)

(Probability and Random Variables) Estimation Theory Spring 2013 12 / 152 Independent Random Variables

Two RVs X and Y are independent if and only if :

p(x, y)=p(x)p(y)

A direct conclusion is that : p(x, y) p(x)p(y) p(x|y)= = = p(x)andp(y|x)=p(y) p(y) p(y)

which means conditioning does not change the PDF.

Remark : For a joint Gaussian pdf the contours of constant density is an ellipse centered at (µx ,µy ). For independent X and Y the major (or minor) axis is parallel to x or y axis.

(Probability and Random Variables) Estimation Theory Spring 2013 13 / 152 of a Random Variable

The expected value, if it exists, of a random variable X with PDF p(x)is defined by : ∞ E(X )= xp(x)dx −∞

Some properties of expected value : E{X + Y } = E{X } + E{Y } E{aX } = aE{X } The expected value of Y = g(X ) can be computed by : ∞ E(Y )= g(x)p(x)dx −∞

Conditional expectation : The conditional expectation of X given that a specific value of Y hasoccurredis: ∞ E(X |Y )= xp(x|y)dx −∞

(Probability and Random Variables) Estimation Theory Spring 2013 14 / 152 Moments of a Random Variable

The rth of X is defined as : ∞ E(X r )= xr p(x)dx −∞

The first moment of X is its expected value or the mean (µ = E(X )).

Moments of Gaussian RVs : A Gaussian RV with N (µ, σ2)has moments of all orders in closed form E(X )=µ E(X 2)=µ2 + σ2 E(X 3)=µ3 +3µσ2 E(X 4)=µ4 +6µ2σ2 +3σ4 E(X 5)=µ5 +10µ3σ2 +15µσ4 E(X 6)=µ6 +15µ4σ2 +45µ2σ4 +15σ6

(Probability and Random Variables) Estimation Theory Spring 2013 15 / 152 Central Moments

The rth central moment of X is defined as : ∞ E[(X − µ)r ]= (x − µ)r p(x)dx −∞

The second central moment (variance) is denoted by σ2 or var(X ).

Central Moments of Gaussian RVs : 0ifr is odd E[(X − µ)r ]= σr (r − 1)!! if r is even

where n!! denotes the double factorial that is the product of every odd number from n to 1.

(Probability and Random Variables) Estimation Theory Spring 2013 16 / 152 Some properties of Gaussian RVs

∼N µ ,σ2 − µ /σ ∼N , If X ( x x )thenZ =(X x ) x (0 1). ∼N , σ µ ∼N µ ,σ2 If Z (0 1) then X = x Z + x ( x x ). ∼N µ ,σ2 ∼N µ , 2σ2 If X ( x x )thenZ = aX + b (a x + b a x ). ∼N µ ,σ2 ∼N µ ,σ2 If X ( x x )andY ( y y ) are two independent RVs, then ∼N µ µ , σ2 σ2 aX + bY (a x + b y a x + b y )

The sum of square of n independent RV with standard normal N , χ2 distribution (0 1) has a n distribution with n degree of freedom. χ2 N , For large value of n,√n converges to (n 2n). The Euclidian X 2 + Y 2 of two independent RVs with standard normal distribution has the Rayleigh distribution.

(Probability and Random Variables) Estimation Theory Spring 2013 17 / 152

For two RVs X and Y , the covariance is defined as

σxy = E[(X − µx )(Y − µy )] ∞ ∞ σxy = (x − µx )(y − µy )p(x, y)dxdy −∞ −∞

If X and Y are zero mean then σxy = E{XY }. σ2 σ2 σ var(X + Y )= x + y +2 xy 2σ2 var(aX )=a x

Important formula : The relation between the variance and the mean of X is given by

σ2 = E[(X − µ)2]=E(X 2) − 2µE(X )+µ2 = E(X 2) − µ2

The variance is the mean of the square minus the square of the mean. (Probability and Random Variables) Estimation Theory Spring 2013 18 / 152 Independence, Uncorrelatedness and Orthogonality

If σxy =0,thenX and Y are uncorrelated and E{XY } = E{X }E{Y }

X and Y called orthogonal if E{XY } =0. If X and Y are independent then they are uncorrelated.

p(x, y)=p(x)p(y) ⇒ E{XY } = E{X }E{Y }

Uncorrelatedness does not imply the independence. For example, if X isanormalRVwithzeromeanandY = X 2 we have p(y|x) = p(y) but 3 σxy = E{XY }−E{X }E{Y } = E{X }−0=0 The correlation only shows the linear dependence between two RV so is weaker than independence. For Jointly Gaussian RVs, independence is equivalent to being uncorrelated.

(Probability and Random Variables) Estimation Theory Spring 2013 19 / 152 Random Vectors

Random Vector : is a vector of random variables 1 :

T x =[x1, x2,...,xn]

T Expectation Vector : µx = E(x)=[E(x1), E(x2),...,E(xn)] T : Cx = E[(x − µx )(x − µx ) ]

Cx is an n × n symmetric matrix which is assumed to be positive definite and so invertible.

The elements of this matrix are : [Cx ]ij = E{[xi − E(xi )][xj − E(xj )]}.

If the random variables are uncorrelated then Cx is a diagonal matrix. Multivariate Gaussian PDF : 1 1 T −1 p(x)= exp − (x − µx ) C (x − µx ) n x (2π) det(Cx ) 2

1. In some books (including our main reference) there is no distinction between random variable X and its specific value x. From now on we adopt the notation of our reference. (Probability and Random Variables) Estimation Theory Spring 2013 20 / 152 Random Processes

Discrete Random Process : x[n] is a sequence of random variables defined for every integer n. Mean value : is defined as E(x[n]) = µx [n]. Function (ACF) : is defined as

rxx [k, n]=E(x[n]x[n + k]) Wide Sense Stationary (WSS) : x[n] is WSS if its mean and its autocorrelation function (ACF) do not depend on n. Autocovariance function : is defined as − µ − µ − µ2 cxx [k]=E[(x[n] x )(x[n + k] x )] = rxx [k] x Cross-correlation Function (CCF) : is defined as

rxy [k]=E(x[n]y[n + k]) Cross-covariance function : is defined as

cxy [k]=E[(x[n] − µx )(y[n + k] − µy )] = rxy [k] − µx µy

(Probability and Random Variables) Estimation Theory Spring 2013 21 / 152 Discrete

Some properties of ACF and CCF :

rxx [0] ≥|rxx [k]| rxx [k]=rxx [−k] rxy [k]=ryx [−k]

Power : The of ACF and CCF gives the Auto-PSD and Cross-PSD : ∞ Pxx (f )= rxx [k]exp(−j2πfk) k=−∞ ∞ Pxy (f )= rxy [k]exp(−j2πfk) k=−∞

Discrete White Noise : is a discrete random process with zero mean and 2 rxx [k]=σ δ[k]whereδ[k] is the Kronecker impulse function. The PSD of 2 white noise becomes Pxx (f )=σ and is completely flat with frequency.

(Probability and Random Variables) Estimation Theory Spring 2013 22 / 152 Introduction and Minimum Variance Unbiased Estimation

(Minimum Variance Unbiased Estimation) Estimation Theory Spring 2013 23 / 152 The Mathematical Estimation Problem

Parameter Estimation Problem : Given a set of measured data x = {x[0], x[1],...,x[N − 1]} which depends on an unknown parameter vector θ, determine an estimator θˆ = g(x[0], x[1],...,x[N − 1]) where g is some function. The first step is to find the PDF of data as a function of θ : p(x; θ) Example : Consider the problem of DC level in white Gaussian noise with one observed data x[0] = θ + w[0] where w[0] has the PDF N (0,σ2). Then the PDF of x[0] is : 1 1 p(x[0]; θ)=√ exp − (x[0] − θ)2 2πσ2 2σ2

(Minimum Variance Unbiased Estimation) Estimation Theory Spring 2013 24 / 152 The Mathematical Estimation Problem

Example : Consider a data sequence that can be modeled with a linear trend in white Gaussian noise

x[n]=A + Bn + w[n] n =0, 1,...,N − 1

Suppose that w[n] ∼N(0,σ2) and is uncorrelated with all the other samples. Letting θ =[AB]andx =[x[0], x[1],...,x[N − 1]] the PDF is : − − N1 1 1 N1 p(x; θ)= p(x[n]; θ)= √ exp − (x[n] − A − Bn)2 πσ2 N 2σ2 n=0 ( 2 ) n=0 The quality of any estimator for this problem is related to the assumptions on the data model. In this example, linear trend and WGN PDF assumption.

(Minimum Variance Unbiased Estimation) Estimation Theory Spring 2013 25 / 152 The Mathematical Estimation Problem

Classical versus Bayesian estimation If we assume θ is deterministic we will have a classical estimation problem. The following method will be studied : MVU, MLE, BLUE, LSE. If we assume θ is a random variable with a known PDF, then we will have a Bayesian estimation problem. In this case the data are described as the joint PDF

p(x,θ)=p(x|θ)p(θ)

where p(θ) summarizes our knowledge about θ before any data is observed and p(x|θ) summarizes our knowledge provided by data x conditioned on knowing θ. The following methods will be studied : MMSE, MAP, Kalman Filter.

(Minimum Variance Unbiased Estimation) Estimation Theory Spring 2013 26 / 152 Assessing Estimator Performance

Consider the problem of estimating a DC level A in uncorrelated noise :

x[n]=A + w[n] n =0, 1,...,N − 1

Consider the following : − 1 N1 Aˆ = x[n] 1 N n=0

Aˆ2 = x[0]

Suppose that A =1, Aˆ1 =0.95 and Aˆ2 =0.98. Which estimator is better ?

An estimator is a random variable, so its performance can only be described by its PDF or statistically (e.g. by Monte-Carlo simulation).

(Minimum Variance Unbiased Estimation) Estimation Theory Spring 2013 27 / 152 Unbiased Estimators

An estimator that on the average yield the true value is unbiased. Mathematically E(θ − θˆ)=0 fora <θ

Let’s compute the expectation of the two estimators Aˆ1 and Aˆ2 : − − − 1 N1 1 N1 1 N1 E(Aˆ )= E(x[n]) = E(A + w[n]) = (A +0)=A 1 N N N n=0 n=0 n=0

E(Aˆ2)=E(x[0]) = E(A + w[0]) = A +0=A Both estimators are unbiased. Which one is better ? Now, let’s compute the variance of the two estimators : − − 1 N1 1 N1 1 σ2 var(Aˆ )=var x[n] = var(x[n]) = Nσ2 = 1 N N2 N2 N n=0 n=0

2 var(Aˆ2)=var(x[0]) = σ > var(Aˆ1)

(Minimum Variance Unbiased Estimation) Estimation Theory Spring 2013 28 / 152 Unbiased Estimators

Remark : When several unbiased estimators of the same parameters from independent set of data are available, i.e., θˆ1, θˆ2,...,θˆn, a better estimator can be obtained by averaging :

1 n θˆ = θˆ ⇒ E(θˆ)=θ n i i=1 Assuming that the estimators have the same variance, we have :

1 n 1 var(θˆ ) var(θˆ)= var(θˆ )= n var(θˆ )= i n2 i n2 i n i=1

By increasing n, the variance will decrease (if n →∞, θˆ → θ). It is not the case for biased estimators, no matter how many estimators are averaged.

(Minimum Variance Unbiased Estimation) Estimation Theory Spring 2013 29 / 152 Minimum Variance Criterion

The most logical criterion for estimation is the Mean Square Error (MSE) :

mse(θˆ)=E[(θˆ − θ)2]

Unfortunately this type of estimators leads to unrealizable estimators (the estimator will depend on θ).

mse(θˆ)=E{[θˆ − E(θˆ)+E(θˆ) − θ]2} = E{[θˆ − E(θˆ)+b(θ)]2}

where b(θ)=E(θˆ) − θ is defined as the of the estimator. Therefore :

mse(θˆ)=E{[θˆ − E(θˆ)]2} +2b(θ)E[θˆ − E(θˆ)] + b2(θ)=var(θˆ)+b2(θ)

Instead of minimizing MSE we can minimize the variance of the unbiased estimators : Minimum Variance Unbiased Estimator

(Minimum Variance Unbiased Estimation) Estimation Theory Spring 2013 30 / 152 Minimum Variance Unbiased Estimator

Existence of MVU Estimator : In general MVU estimator does not always exist. There may be no unbiased estimator or non of unbiased estimators has a uniformly minimum variance.

Finding the MVU Estimator : There is no known procedure which always leads to the MVU estimator. Three existing approaches are : 1 Determine the Cramer-Rao lower bound (CRLB) and check to see if some estimator satisfies it. 2 Apply the Rao-Blackwell-Lehmann-Scheffe theorem (we will skip it). 3 Restrict to linear unbiased estimators.

(Minimum Variance Unbiased Estimation) Estimation Theory Spring 2013 31 / 152 Cramer-Rao Lower Bound

(Cramer-Rao Lower Bound) Estimation Theory Spring 2013 32 / 152 Cramer-Rao Lower Bound

CRLB is a lower bound on the variance of any unbiased estimator.

var(θˆ) ≥ CRLB(θ)

Note that the CRLB is a function of θ. It tells us what is the best performance that can be achieved (useful in feasibility study and comparison with other estimators). It may lead us to compute the MVU estimator.

(Cramer-Rao Lower Bound) Estimation Theory Spring 2013 33 / 152 Cramer-Rao Lower Bound

Theorem (scalar case) Assume that the PDF p(x; θ) satisfies the regularity condition ∂ θ ln p(x; ) θ E ∂θ =0 for all

Then the variance of any unbiased estimator θˆ satisfies

− ∂2 ln p(x; θ) 1 var(θˆ) ≥ −E ∂θ2

An unbiased estimator that attains the CRLB can be found iff : ∂ θ ln p(x; ) θ − θ ∂θ = I( )(g(x) ) for some functions g(x)andI(θ). The estimator is θˆ = g(x)andthe minimum variance is 1/I(θ). (Cramer-Rao Lower Bound) Estimation Theory Spring 2013 34 / 152 Cramer-Rao Lower Bound

Example : Consider x[0] = A + w[0] with w[0] ∼N(0,σ2). 1 1 p(x[0]; A)=√ exp − (x[0] − A)2 2πσ2 2σ2

√ 1 ln p(x[0], A)=− ln 2πσ2 − (x[0] − A)2 2σ2 Then ∂ ln p(x[0]; A) 1 ∂2 ln p(x[0]; A) 1 = (x[0] − A) ⇒− = ∂A σ2 ∂A2 σ2

According to Theorem : 1 var(Aˆ) ≥ σ2 and I(A)= and Aˆ = g(x[0]) = x[0] σ2

(Cramer-Rao Lower Bound) Estimation Theory Spring 2013 35 / 152 Cramer-Rao Lower Bound

Example : Consider multiple observations for DC level in WGN : x[n]=A + w[n] n =0, 1,...,N − 1withw[n] ∼N(0,σ2) − 1 1 N1 p(x; A)= √ exp − (x[n] − A)2 πσ2 N 2σ2 ( 2 ) n=0 Then − ∂ ln p(x; A) ∂ 1 N1 = − ln[(2πσ2)N/2] − (x[n] − A)2 ∂A ∂A 2σ2 n=0 − − 1 N1 N 1 N1 = (x[n] − A)= x[n] − A σ2 σ2 N n=0 n=0 According to Theorem : − σ2 N 1 N1 var(Aˆ) ≥ and I(A)= and Aˆ = g(x[0]) = x[n] N σ2 N n=0 (Cramer-Rao Lower Bound) Estimation Theory Spring 2013 36 / 152 Transformation of Parameters

If it is desired to estimate α = g(θ), then the CRLB is :

− ∂2 ln p(x; θ) 1 ∂g 2 var(θˆ) ≥ −E ∂θ2 ∂θ

Example : Compute the CRLB for estimation of the power (A2)ofaDC level in noise : σ2 4A2σ2 var(Aˆ2) ≥ (2A)2 = N N

Definition Efficient estimator : An unbiased estimator that attains the CRLB is said to be efficient. − 1 N1 Example : Knowing thatx ¯ = x[n] is an efficient estimator for A,is N n=0 x¯2 an efficient estimator for A2 ? (Cramer-Rao Lower Bound) Estimation Theory Spring 2013 37 / 152 Transformation of Parameters

Solution : Knowing thatx ¯ ∼N(A,σ2/N), we have : σ2 E(¯x2)=E 2(¯x)+var(¯x)=A2 + = A2 N So the estimator A2 =¯x2 is not even unbiased. Let’s look at the variance of this estimator : var(¯x2)=E(¯x4) − E 2(¯x2) but we have from the moments of Gaussian RVs (slide 15) : σ2 σ2 E(¯x4)=A4 +6A2 +3( )2 N N Therefore : σ2 3σ4 σ2 2 4A2σ2 2σ4 var(¯x2)=A4 +6A2 + − A2 + = + N N2 N N N2

(Cramer-Rao Lower Bound) Estimation Theory Spring 2013 38 / 152 Transformation of Parameters

Remarks : The estimator A2 =¯x2 is biased and not efficient. As N →∞the bias goes to zero and the variance of the estimator approaches the CRLB. This type of estimators are called asymptotically efficient. General Remarks : If g(θ)=aθ + b is an affine function of θ,theng(θ)=g(θˆ)isan efficient estimator. First, it is unbiased : E(aθˆ+ b)=aθ + b = g(θ), moreover : ∂ 2 θ ≥ g θˆ 2 θˆ var(g( )) ∂θ var( )=a var( ) but var(g(θ)) = var(aθˆ + b)=a2var(θˆ), so that the CRLB is achieved. If g(θ) is a nonlinear function of θ and θˆ is an efficient estimator, then g(θˆ) is an asymptotically efficient estimator.

(Cramer-Rao Lower Bound) Estimation Theory Spring 2013 39 / 152 Cramer-Rao Lower Bound

Theorem (Vector Parameter) Assume that the PDF p(x; θ) satisfies the regularity condition ∂ θ ln p(x; ) θ E ∂θ = 0 for all

ˆ −1 Then the variance of any unbiased estimator θ satisfies Cθˆ − I (θ) ≥ 0 where ≥ 0 means that the matrix is positive semidefinite. I(θ) is called the matrix and is given by : ∂2 ln p(x; θ) Iij (θ)=−E ∂θi ∂θj

An unbiased estimator that attains the CRLB can be found iff : ∂ θ ln p(x; ) θ − θ ∂θ = I( )(g(x) )

(Cramer-Rao Lower Bound) Estimation Theory Spring 2013 40 / 152 CRLB Extension to Vector Parameter

Example : Consider a DC level in WGN with A and σ2 unknown. Compute the CRLB for estimation of θ =[A σ2]T .

− N N 1 N1 ln p(x; θ)=− ln 2π − ln σ2 − (x[n] − A)2 2 2 2σ2 n=0 The Fisher information matrix is : 2 2 ∂ ln p(x;θ) ∂ ln p(x;θ) N ∂A2 ∂A∂σ2 σ2 0 I(θ)=−E 2 2 = ∂ ln p(x;θ) ∂ ln p(x;θ) 0 N ∂A∂σ2 ∂(σ2)2 2σ4

The matrix is diagonal (just for this example) and can be easily inverted to yield :

σ2 2σ4 var(θˆ) ≥ var(σ2) ≥ N N

Is there any unbiased estimator that achieves these bounds ? (Cramer-Rao Lower Bound) Estimation Theory Spring 2013 41 / 152 Transformation of Parameters

If it is desired to estimate α = g(θ), and the CRLB for the covariance of θˆ is I−1(θ), then : − ∂g ∂2 ln p(x; θ) 1 ∂g T Cα ≥ −E ˆ ∂θ ∂θ2 ∂θ

Example : Consider a DC level in WGN with A and σ2 unknown. Compute the CRLB for estimation of signal to noise ratio α = A2/σ2. θ σ2 T α θ θ2/θ We have =[A ] and = g( )= 1 2, then the Jacobian is : ∂ θ ∂ θ ∂ θ 2 g( ) g( ) g( ) 2A − A = = 2 4 ∂θ ∂θ1 ∂θ2 σ σ

So the CRLB is : − 2 N 1 2A α α2 2A A σ2 0 σ2 4 +2 var(ˆα) ≥ − 2 = σ2 σ4 N −A 0 2σ4 σ4 N

(Cramer-Rao Lower Bound) Estimation Theory Spring 2013 42 / 152 Linear Models with WGN

If N point samples of data observed can be modeled as x = Hθ + w where x = N × 1 observation vector H = N × p observation matrix (known, rank p) θ = p × 1 vector of parameters to be estimated w = N × 1 noise vector with PDF N (0,σ2I) Compute the CRLB and the MVU estimator that achieves this bound. Step 1 : Compute ln p(x; θ). ∂2 ln p(x; θ) Step 2 : Compute I(θ)=−E and the covariance ∂θ2 ˆ −1 matrix of θ : Cθˆ = I (θ). Step 3 : Find the MVU estimator g(x)byfactoring ∂ θ ln p(x; ) θ − θ ∂θ = I( )[g(x) ]

(Linear Models) Estimation Theory Spring 2013 43 / 152 Linear Models with WGN √ θ − πσ2 N − 1 − θ T − θ Step 1 : ln p(x; )= ln( 2 ) 2σ2 (x H ) (x H ). Step 2 : ∂ ln p(x; θ) 1 ∂ = − [xT x − 2xT Hθ + θT HT Hθ] ∂θ 2σ2 ∂θ 1 = [HT x − HT Hθ] σ2 ∂2 ln p(x; θ) 1 Then I(θ)=−E = HT H ∂θ2 σ2

Step 3 : Find the MVU estimator g(x)byfactoring ∂ ln p(x; θ) HT H = I(θ)[g(x) − θ]= [(HT H)−1HT x − θ] ∂θ σ2 Therefore : ˆ T −1 T −1 2 T −1 θ = g(x)=(H H) H x Cθˆ = I (θ)=σ (H H)

(Linear Models) Estimation Theory Spring 2013 44 / 152 Linear Models with WGN

For a linear model with WGN represented by x = Hθ + w the MVU estimator is : θˆ =(HT H)−1HT x This estimator is efficient and attains the CRLB. That the estimator is unbiased can be seen easily by :

E(θˆ)=(HT H)−1HT E(Hθ + w)=θ

The statistical performance of θˆ is completely specified because θˆ is a linear transformation of a Gaussian vector x and hence has a Gaussian distribution : θˆ ∼N(θ,σ2(HT H)−1)

(Linear Models) Estimation Theory Spring 2013 45 / 152 Example(CurveFitting)

Consider fitting the data x[n]byap-th order function of n : 2 p x[n]=θ0 + θ1n + θ2n + ···+ θpn + w[n] We have N data samples, then : x =[x[0], x[1],...,x[N − 1]]T w =[w[0], w[1],...,w[N − 1]]T T θ =[θ0,θ1,...,θp] so x = Hθ + w,whereH is :   10 0 ··· 0    11 1 ··· 1     12 4 ··· 2p   . . . . .   ......  − − 2 ··· − p 1 N 1(N 1) (N 1) N×(p+1) Hence the MVU estimator is : θˆ =(HT H)−1HT x (Linear Models) Estimation Theory Spring 2013 46 / 152 Example ()

Consider the Fourier analysis of the data x[n]:

M 2πkn M 2πkn x[n]= a cos( )+ b sin( )+w[n] k N k N k=1 k=1

T so we have θ =[a1, a2,...,aM , b1, b2,...,bM ] and x = Hθ + w where : a, a,..., a , b, b,..., b H =[h1 h2 hM h1 h2 hM ]

with     1 1  2πk   2πk   cos( N )   sin( N )  π π a  2 k2  b  2 k2  h =  cos( N )  , h =  sin( N )  k  .  k  .   .   .  2πk(N−1) 2πk(N−1) cos( N ) sin( N ) Hence the MVU estimate of the Fourier coefficients is : θˆ =(HT H)−1HT x (Linear Models) Estimation Theory Spring 2013 47 / 152 Example (Fourier Analysis)

T −1 2 After simplification (noting that (H H) = N I), we have : 2 θˆ = [(ha)T x,...,(ha )T x , (hb)T x,...,(hb )T x]T N 1 M 1 M which is the same as the standard solution : − − 2 N1 2πkn 2 N1 2πkn ˆa = x[n]cos , bˆ = x[n]sin k N N k N N n=0 n=0

From the properties of linear models the estimates are unbiased. The covariance matrix is : 2σ2 C = σ2(HT H)−1 = I θˆ N ˆ Note that θ is Gaussian and Cθˆ is diagonal (the amplitude estimates are independent).

(Linear Models) Estimation Theory Spring 2013 48 / 152 Example (System Identification)

Consider identification of a Finite (FIR) model, h[k]for k =0, 1,...,p − 1, with input u[n] and output x[n] provided for n =0, 1,...,N − 1:

p−1 x[n]= h[k]u[n − k]+w[n] n =0, 1,...,N − 1 k=0 FIR model can be represented by the linear model x = Hθ + w where     h[0] u[0] 0 ··· 0    ···   h[1]   u[1] u[0] 0  θ =   H =  . . . .  ...  . . .. .  − h[p 1] × − − ··· − p 1 u[N 1] u[N 2] u[N p] N×p

ˆ T −1 T 2 T −1 The MVU estimate is θ =(H H) H x with Cθˆ = σ (H H) .

(Linear Models) Estimation Theory Spring 2013 49 / 152 Linear Models with Colored Gaussian Noise

Determine the MVU estimator for the linear model x = Hθ + w with w a colored Gaussian noise with N (0, C). Whitening approach : Since C is positive definite, its inverse can be factored as C−1 = DT D where D is an invertible matrix. This matrix acts as a whitening transformation for w : E[(Dw)(Dw)T ]=E(DwwT D)=DCDT = DD−1D−T D = I Now if we transform the linear model x = Hθ + w to : x = Dx = DHθ + Dw = Hθ + w where w = Dw ∼N(0, I) is white and we can compute the MVU estimator as : θˆ =(HT H)−1HT x =(HT DT DH)−1HT DT Dx so, we have : ˆ T −1 −1 T −1 T −1 T −1 −1 θ =(H C H) H C x with Cθˆ =(H H ) =(H C H)

(Linear Models) Estimation Theory Spring 2013 50 / 152 Linear Models with known components

Consider a linear model x = Hθ + s + w,wheres is a known signal. To determine the MVU estimator let x = x − s,sothatx = Hθ + w is a standard linear model. The MVU estimator is : ˆ T −1 T 2 T −1 θ =(H H) H (x − s)withCθˆ = σ (H H)

Example : Consider a DC level and exponential in WGN : x[n]=A + r n + w[n]wherer is known. Then we have :         x[0] 1 1 w[0]          x[1]   1   r   w[1]   .  =  .  A +  .  +  .   .   .   .   .  x[N − 1] 1 r N−1 w[N − 1] The MVU estimator is : − 1 N1 σ2 Aˆ =(HT H)−1HT (x − s)= (x[n] − r n)withvar(Aˆ)= N N n=0

(Linear Models) Estimation Theory Spring 2013 51 / 152 Best Linear Unbiased Estimators (BLUE)

Problems of finding the MVU estimators : The MVU estimator does not always exist or impossible to find. The PDF of data may be unknown.

BLUE is a suboptimal estimator that : restricts estimates to be linear in data ; θˆ = Ax restricts estimates to be unbiased ; E(θˆ)=AE(x)=θ minimizes the variance of the estimates ; needs only the mean and the variance of the data (not the PDF). As a result, in general, the PDF of the estimates cannot be computed.

Remark : The unbiasedness restriction implies a linear model for the data. However, it may still be used if the data are transformed suitably or the model is linearized.

(Best Linear Unbiased Estimators) Estimation Theory Spring 2013 52 / 152 Finding the BLUE (Scalar Case)

1 Choose a linear estimator for the observed data

x[n] , n =0, 1,...,N − 1

N−1 T T θˆ = anx[n]=a x where a =[a0, a1,...,aN−1] n=0 2 Restrict estimate to be unbiased : N−1 E(θˆ)= anE(x[n]) = θ n=0

3 Minimize the variance

var(θˆ)=E{[θˆ − E(θˆ)]2} = E{[aT x − aT E(x)]2} = E{aT [x − E(x)][x − E(x)]T a} = aT Ca

(Best Linear Unbiased Estimators) Estimation Theory Spring 2013 53 / 152 Finding the BLUE (Scalar Case)

Consider the problem of amplitude estimation of known signals in noise : x[n]=θs[n]+w[n]

N−1 T 1 Choose a linear estimator : θˆ = anx[n]=a x n=0 2 Restrict estimate to be unbiased : E(θˆ)=aT E(x)=aT sθ = θ then aT s =1wheres =[s[0], s[1],...,s[N − 1]]T 3 Minimize aT Ca subject to aT s =1.

The constrained optimization can be solved using Lagrangian Multipliers : Minimize J = aT Ca + λ(aT s − 1) The optimal solution is : sT C−1x 1 θˆ = and var(θˆ)= sT C−1s sT C−1s

(Best Linear Unbiased Estimators) Estimation Theory Spring 2013 54 / 152 Finding the BLUE (Vector Case)

Theorem (Gauss–Markov) If the data are of the general linear model form

x = Hθ + w

with w is a noise vector with zero mean and covariance C (the PDF of w is arbitrary), then the BLUE of θ is :

θˆ =(HT C−1H)−1HT C−1x

and the covariance matrix of θˆ is

T −1 −1 Cθˆ =(H C H)

Remark : If noise is Gaussian then BLUE is MVU estimator.

(Best Linear Unbiased Estimators) Estimation Theory Spring 2013 55 / 152 Finding the BLUE

Example : Consider the problem of DC level in noise : x[n]=A + w[n], σ2 θ where w[n] is of unspecified PDF with var(w[n]) = n.Wehave = A and H = 1 =[1, 1,...,1]T . The covariance matrix is :     1 ··· σ2 ··· σ2 0 0 0 0 0  0   2  1 ···  0 σ ··· 0   0 σ2 0  1 ⇒ −1  1  C =  . . . .  C =  . . . .   . . .. .   . . .. .  ··· σ2 ··· 1 00 N−1 00 σ2 N−1 and hence the BLUE is : − −1 − N1 1 N1 x[n] θˆ =(HT C−1H)−1HT C−1x = σ2 σ2 n=0 n n=0 n and the minimum covariance is : − N−1 1 1 C =(HT C−1H)−1 = θˆ σ2 n=0 n (Best Linear Unbiased Estimators) Estimation Theory Spring 2013 56 / 152 Maximum Likelihood Estimation

(Maximum Likelihood Estimation) Estimation Theory Spring 2013 57 / 152 Maximum Likelihood Estimation

Problems : MVU estimator does not often exist or cannot be found. BLUE is restricted to linear models.

Maximum Likelihood Estimator (MLE) : can always be applied if the PDF is known ; is optimal for large data size ; is computationally complex and requires numerical methods.

Basic Idea : Choose the parameter value that makes the observed data, the most likely data to have been observed.

Likelihood Function : is the PDF p(x; θ)whenθ is regarded as a variable (not a parameter).

ML Estimate : is the value of θ that maximizes the .

Procedure : Find log-likelihood function ln p(x; θ) ; differentiate w.r.t θ and set to zero and solve for θ. (Maximum Likelihood Estimation) Estimation Theory Spring 2013 58 / 152 Maximum Likelihood Estimation

Example : Consider DC level in WGN with unknown variance x[n]=A + w[n]. Suppose that A > 0andσ2 = A.ThePDFis: − 1 1 N1 p(x; A)= exp − (x[n] − A)2 π N 2A (2 A) 2 n=0 Taking the derivative of the log-likelihood function, we have :

N−1 N−1 ∂ ln p(x; A) N 1 1 = − + (x[n] − A)+ (x[n] − A)2 ∂A 2A A 2A2 n=0 n=0

What is the CRLB ? Does an MVU estimator exist ? MLE can be found by setting the above equation to zero :    N−1 1  1 1 Aˆ = − + x2[n]+ 2 N 4 n=0

(Maximum Likelihood Estimation) Estimation Theory Spring 2013 59 / 152 Stochastic Convergence

Convergence in distribution : Let {p(xN )} be a sequence of PDF. If there exists a PDF p(x) such that

lim p(xN )=p(x) N→∞

at every point x at which p(x) is continuous, we say that p(xN ) converges d in distribution to p(x). We write also xN → x. Example

Consider N independent RV x1, x2,...,xN with mean µ and finite variance 1 N σ2.Let¯x = x , then according to the Central Limit Theorem N N n n=1 √ − µ x¯N ∼N , (CLT), zN = N σ converges in distribution to z (0 1).

(Maximum Likelihood Estimation) Estimation Theory Spring 2013 60 / 152 Stochastic Convergence

Convergence in probability : Sequence of random variables {xN } converges in probability to the random variable x if for every >0

lim Pr{|xN − x| > } =0 N→∞ Convergence with probability 1 (almost sure convergence) : Sequence of random variables {xN } converges with probability 1 to the random variable x if and only if for all possible events

Pr{ lim xN = x} =1 N→∞

Example

Consider N independent RV x1, x2,...,xN with mean µ and finite variance 1 N σ2.Let¯x = x . Then according to the Law of Large Numbers N N n n=1 (LLN),z ¯N converges in probability to µ.

(Maximum Likelihood Estimation) Estimation Theory Spring 2013 61 / 152 Asymptotic Properties of Estimators

Asymptotic Unbiasedness : Estimator θˆN is an asymptotically unbiased estimator of θ if : lim E(θˆN − θ)=0 N→∞ Asymptotic Distribution : It refers to p(xn)asitevolvesfrom n =1, 2,..., especially for large value of n (it is not the ultimate form of distribution, which may be degenerate).

Asymptotic Variance : is not equal to lim var(θˆN )(whichisthe N→∞ limiting variance). It is defined as :

1 2 asymptotic var(θˆN )= lim E{N[θˆN − lim E(θˆN )] } N N→∞ N→∞

Mean-Square Convergence : Estimator θˆN converges to θ in a mean-squared sense, if : 2 lim E[(θˆN − θ) ]=0 N→∞

(Maximum Likelihood Estimation) Estimation Theory Spring 2013 62 / 152 Consistency

Estimator θˆN is a consistent estimator of θ if for every >0

plim(θˆN )=θ ⇔ lim Pr[|θˆN − θ| > ]=0 N→∞ Remarks :

If θˆN is asymptotically unbiased and its limiting variance is zero then it converges to θ in mean-square.

If θˆN converges to θ in mean-square, then the estimator is consistent. Asymptotic unbiasedness does not imply consistency and vice versa. plim can be treated as an operator, e.g. : x plim(x) plim(xy) = plim(x)plim(y) ; plim = y plim(y)

The importance of consistency is that any continuous function of a consistent estimator is itself a consistent estimator.

(Maximum Likelihood Estimation) Estimation Theory Spring 2013 63 / 152 Maximum Likelihood Estimation

Properties : MLE may be biased and is not necessarily an efficient estimator. However : MLE is a consistent estimator meaning that

lim Pr[|θˆ − θ| > ]=0 N→∞ MLE asymptotically attains the CRLB (asymptotic variance is equal to CRLB). Under some “regularity”conditions, the MLE is asymptotically normally distributed θˆ ∼Na (θ, I−1(θ)) even if the PDF of x is not Gaussian. If an MVU estimator exists then ML procedure will find it.

(Maximum Likelihood Estimation) Estimation Theory Spring 2013 64 / 152 Maximum Likelihood Estimation

Example Consider DC level in WGN with known variance σ2. Sol. Then − 1 1 N1 p(x; A)= exp − (x[n] − A)2 πσ2 N 2σ2 (2 ) 2 n=0 − ∂ ln p(x; A) 1 N1 = (x[n] − A)=0 ∂A σ2 n=0 N−1 ⇒ x[n] − NAˆ =0 n=0

− 1 N1 which leads to Aˆ =¯x = x[n] N n=0

(Maximum Likelihood Estimation) Estimation Theory Spring 2013 65 / 152 MLE for Transformed Parameters (Invariance Property)

Theorem (Invariance Property of the MLE) The MLE of the parameter α = g(θ), where the PDF p(x; θ) is parameterized by θ,isgivenby

αˆ = g(θˆ)

where θˆ is the MLE of θ.

It can be proved using the property of the consistent estimators. If α = g(θ) is a one-to-one function, then αˆ = arg max p(x; g −1(α)) = g(θˆ) α If α = g(θ) is not a one-to-one function, then

p¯T (x; α)= max p(x; θ)andˆα = arg max p¯T (x; θ)=g(θˆ) {θ:α=g(θ)} α

(Maximum Likelihood Estimation) Estimation Theory Spring 2013 66 / 152 MLE for Transformed Parameters (Invariance Property)

Example Consider DC level in WGN and find MLE of α = exp(A). Since g(θ)isa one-to-one function then : αˆ = arg max p (x,α)=argmaxp(x;lnα) = exp(¯x) α T α

Example Consider DC level in WGN and find MLE of α = A2.Sinceg(θ)isnota one-to-one function then : √ √ αˆ = arg max{p(x; α), p(x; − α)} α≥0 √ √ 2 = arg√ max {p(x; α), p(x; − α)} α≥0 2 = arg max p(x; A) = Aˆ2 =¯x2 −∞

(Maximum Likelihood Estimation) Estimation Theory Spring 2013 67 / 152 MLE (Extension to Vector Parameter)

Example Consider DC Level in WGN with unknown variance. The vector parameter θ =[A σ2]T should be estimated. We have :

− ∂ ln p(x; θ) 1 N1 = (x[n] − A) ∂A σ2 n=0 − ∂ ln p(x; θ) N 1 N1 = − + (x[n] − A)2 ∂σ2 2σ2 2σ4 n=0 which leads to the following MLE :

− 1 N1 Aˆ =¯x σˆ2 = (x[n] − x¯)2 N n=0

(Maximum Likelihood Estimation) Estimation Theory Spring 2013 68 / 152 MLE for General Gaussian Case

Consider the general Gaussian case where x ∼N(µ(θ), C(θ)). The partial derivative of the PDF is : ∂ ln p(x; θ) 1 ∂C(θ) ∂µ(θ)T = − tr C−1(θ) + C−1(θ)(x − µ(θ)) ∂θk 2 ∂θk ∂θk 1 ∂C−1(θ) − (x − µ(θ))T (x − µ(θ)) 2 ∂θk for k =1,...,p. By setting the above equations equal to zero, MLE can be found. A particular case is when C is known (the first and third terms become zero). In addition, if µ(θ) is linear in θ, the general linear model is obtained.

(Maximum Likelihood Estimation) Estimation Theory Spring 2013 69 / 152 MLE for General Linear Models

Consider the general linear model x = Hθ + w where w is a noise vector with PDF N (0, C): 1 1 p(x; θ)= exp − (x − Hθ)T C−1(x − Hθ) (2π)N det(C) 2 Taking the derivative of ln p(x; θ)leadsto: ∂ θ ∂ θ T ln p(x; ) (H ) −1 − θ ∂θ = ∂θ C (x H ) Then

HT C−1(x − Hθ)=0 ⇒ θˆ =(HT C−1H)−1HT C−1x

which is the same as MVU estimator. The PDF of θˆ is :

θˆ ∼N(θ, (HT C−1H)−1)

(Maximum Likelihood Estimation) Estimation Theory Spring 2013 70 / 152 MLE (Numerical Method)

Newton–Raphson : A closed form estimator cannot be always computed by maximizing the likelihood function. However, the maximum value can be computed by the numerical methods like the iterative Newton–Raphson algorithm.

−  ∂2 ln p(x; θ) 1 ∂ ln p(x; θ) θ = θ −  k+1 k ∂θ∂θT ∂θ  θ=θk Remarks : The Hessian can be replaced by the negative of its expectation, the Fisher information matrix I(θ). This method suffers from convergence problems (local maximum). Typically, for large data length, the log-likelihood function becomes more quadratic and the algorithm will produce the MLE.

(Maximum Likelihood Estimation) Estimation Theory Spring 2013 71 / 152 Least Squares Estimation

(Least Squares) Estimation Theory Spring 2013 72 / 152 The Least Squares Approach

In all the previous methods we assumed that the measured signal x[n]is thesumofatruesignals[n] and a measurement error w[n]withknown probabilistic model. In least squares method x[n]=s[n,θ]+e[n] where e(n) represents the modeling and measurement errors. The objective is to minimize the LS cost : N−1 J(θ)= (x[n] − s[n,θ])2 n=0 We do not need a probabilistic assumption but only a deterministic signal model. It has a broader range of applications. No claim about the optimality can be made. The statistical performance cannot be assessed.

(Least Squares) Estimation Theory Spring 2013 73 / 152 The Least Squares Approach

Example Estimate the DC level of a signal. We observe x[n]=A + [n]for n =0,...,N − 1 and the LS criterion is :

N−1 J(A)= (x[n] − A)2 n=0

− − ∂J(A) N1 1 N1 = −2 (x[n] − A)=0 ⇒ Aˆ = x[n] ∂A N n=0 n=0

(Least Squares) Estimation Theory Spring 2013 74 / 152 Linear Least Squares

Suppose that the observation model is linear x = Hθ +  then

N−1 J(θ)= (x[n] − s[n,θ])2 =(x − Hθ)T (x − Hθ) n=0 = xT x − 2xT Hθ + θT HT Hθ

where H is full rank. The gradient is ∂ θ J( ) − T T θ ⇒ θˆ T −1 T ∂θ = 2H x +2H H =0 =(H H) H x The minimum LS cost is :

ˆ T ˆ T T −1 T T Jmin =(x − Hθ) (x − Hθ)=x [I − H(H H) H ]x = x (x − Hθ)

where I − H(HT H)−1HT is an Idempotent matrix.

(Least Squares) Estimation Theory Spring 2013 75 / 152 Comparing Different Estimators for the Linear Model

Consider the following linear model x = Hθ + w Estimator Assumption Estimate

LSE No probabilistic assumption θˆ =(HT H)−1HT x

BLUE w is white with unknown PDF θˆ =(HT H)−1HT x

MLE w is white Gaussian noise θˆ =(HT H)−1HT x

MVUE w is white Gaussian noise θˆ =(HT H)−1HT x

For MLE and MVUE the PDF of the estimate will be Gaussian.

(Least Squares) Estimation Theory Spring 2013 76 / 152 Weighted Linear Least Squares

The LS criterion can be modified by including a positive definite (symmetric) weighting matrix W :

J(θ)=(x − Hθ)T W(x − Hθ)

That leads to the following estimator :

θˆ =(HT WH)−1HT Wx

and minimum LS cost :

T T −1 T Jmin = x [W − WH(H WH) H W]x

Remark : If we take W = C−1,whereC is the covariance of noise then weighted least squares estimator is the BLUE. However, there is no true LS-based reason for this choice.

(Least Squares) Estimation Theory Spring 2013 77 / 152 Geometrical Interpretation

Recall the general signal model s = Hθ.IfwedenotethecolumnofH by hi we have :   θ1   p  θ2  s =[h h ··· h ]   = θ h 1 2 p  .  i i . i=1 θp

The signal model is a linear combination of the vectors {hi }. The LS minimizes the length of the error vector between the data and the signal model  = x − s :

J(θ)=(x − Hθ)T (x − Hθ)= x − Hθ 2

The data vector can lie anywhere in RN , while signal vectors must lie in a p-dimensional subspace of RN ,termedSp, which is spanned by the column of H. (Least Squares) Estimation Theory Spring 2013 78 / 152 Geometrical Interpretation (Orthogonal Projection)

Intuitively, it is clear that the LS error is minimized when ˆs = Hθˆ is the orthogonal projection of x onto Sp. So the LS error vector  = x − ˆs is orthogonal to all columns of H :

HT  = 0 ⇒ HT (x − Hθˆ)=0 ⇒ θˆ =(HT H)−1HT x

The signal estimate is the projection of x onto Sp :

ˆs = Hθˆ = H(HT H)−1HT x = Px

where P is the orthogonal projection matrix. Note that if z ∈ Range(H), then Pz = z. Recall that Range(H)isthe subspace spanned by the columns of H. Now, since Px ∈ Sp then P(Px) = Px. Therefore any projection matrix is idempotent, i.e. P2 = P. It can be verified that P is symmetric and singular (with rank p).

(Least Squares) Estimation Theory Spring 2013 79 / 152 Geometrical Interpretation (Orthonormal columns of H)

Recall that   < h1, h1 > ··· < h1, hp >  < h , h > ··· < h , h >  T  2 1 2 2 2 p  H H =  . . . .   . . .. .  < hp, h1 > ··· < hp, hp >

If the columns of H are orthonormal then HT H = I and θˆ = HT x. θˆ T In this case we have i = hi x, thus p p θˆ θˆ T ˆs = H = i hi = (hi x)hi i=1 i=1 If we increase the number of parameters (the order of the linear model), we can easily compute the new estimate.

(Least Squares) Estimation Theory Spring 2013 80 / 152 Choosing the Model Order

Suppose that you have a set of data and the objective is to fit a polynomial to data. What is the best polynomial order ?

Remarks :

It is clear that by increasing the order Jmin is monotonically non-increasing. By choosing p = N we can perfectly fit the data to model. However, we fit the noise as well. We should choose the simplest model that adequately describes the data. We increase the order only if the cost reduction is significant.

If we have an idea about the expected level of Jmin, we increase p to approximately attain this level. There is an order-recursive LS algorithm to efficiently compute a (p + 1)-th order model based on a p-thorderone(See8.6).

(Least Squares) Estimation Theory Spring 2013 81 / 152 Sequential Least Squares   Suppose that θˆ[N − 1] based on x[N − 1] = x[0],...,x[N − 1] is available. If we get a new data x[N], we want to compute θˆ[N]asa function of θˆ[N − 1] and x[N]. Example − 1 N1 Consider LS estimate of DC level : Aˆ[N − 1] = x[n]. We have : N n=0  − 1 N 1 1 N1 Aˆ[N]= x[n]= N x[n] + x[N] N +1 N +1 N n=0 n=0 N 1 = Aˆ[N − 1] + x[N] N +1 N +1 1 = Aˆ[N − 1] + (x[N] − Aˆ[N − 1]) N +1 new estimate = old estimate + gain × prediction error

(Least Squares) Estimation Theory Spring 2013 82 / 152 Sequential Least Squares Example (DC level in uncorrelated noise) σ2 x[n]=A + w[n] and var(w[n]) = n. The WLS estimate (or BLUE) is :  N−1 x[n] n=0 σ2 Aˆ[N − 1] =  n N−1 1 σ2 n=0 n Similar to the previous example we can obtain :

1/σ2 Aˆ[N]=Aˆ[N − 1] +  N x[N] − Aˆ[N − 1] N−1 1 n=0 σ2 n = Aˆ[N − 1] + K[N] x[N] − Aˆ[N − 1]

The gain factor K[N] can be reformulated as :

var(Aˆ[N − 1]) K[N]= ˆ − σ2 var(A[N 1]) + N (Least Squares) Estimation Theory Spring 2013 83 / 152 Sequential Least Squares

Consider the general linear model x = Hθ + w,wherew is an uncorrelated noise with the covariance matrix C.TheBLUE(orWLS)is: ˆ T −1 −1 T −1 T −1 −1 θ =(H C H) H C x and Cθˆ =(H C H) Let’s define :

C[n] = diag(σ2,σ2,...,σ2) 0 1 n H[n − 1] n × p H[n]= = hT [n] 1 × p   x[n]= x[0], x[1],...,x[n]

The objective is to find θˆ[n], based on n + 1 data samples, as a function of θˆ[n − 1] and new data x[n]. The batch estimator is :

θˆ[n − 1] = Σ[n − 1]HT [n − 1]C−1[n − 1]x[n − 1]

with Σ[n − 1] = (HT [n − 1]C−1[n − 1]H[n − 1])−1

(Least Squares) Estimation Theory Spring 2013 84 / 152 Sequential Least Squares

Estimator Update : θˆ[n]=θˆ[n − 1] + K[n] x[n] − hT [n]θˆ[n − 1]

where Σ[n − 1]h[n] K[n]= σ2 T Σ − n + h [n] [n 1]h[n] Covariance Update :

Σ[n]=(I − K[n]hT [n])Σ[n − 1]

The following Lemma is used to compute the updates : Matrix Inversion Lemma (A + BCD)−1 = A−1 − A−1B[C−1 + DA−1B]−1DA−1 Σ−1 − , , /σ2, T with A = [n 1] B = h[n] C =1 n D = h [n]. Initialization ? (Least Squares) Estimation Theory Spring 2013 85 / 152 Constrained Least Squares

Linear constraints in the form Aθ = b can be considered in LS solution. The LS criterion becomes :

T T Jc (θ)=(x − Hθ) (x − Hθ)+λ (Aθ − b) ∂ Jc − T T θ T λ ∂θ = 2H x +2H H + A Setting the gradient equal to zero produces : 1 θˆ =(HT H)−1HT x − (HT H)−1AT λ c 2 λ = θˆ − (HT H)−1AT 2 where θˆ =(HT H)−1HT x is the unconstrained LSE. Now Aθˆc = b canbesolvedtofindλ : λ =2[A(HT H)−1AT ]−1(Aθˆ − b)

(Least Squares) Estimation Theory Spring 2013 86 / 152 Nonlinear Least Squares

Many applications have nonlinear observation model : s(θ) = Hθ.This leads to a nonlinear optimization problem that can be solved numerically.

Newton-Raphson Method : Find a zero of the gradient of the criterion by linearizing the gradient around the current estimate.  −  ∂g(θ) 1  θˆ θˆ − θ  k+1 = k ∂θ g( ) θ=θˆk ∂ T θ θ s ( ) − θ where g( )= ∂θ [x s( )] and ∂ θ N−1 ∂2 ∂ ∂ [g( )]i − s[n] − s[n] s[n] ∂θ = (x[n] s[n]) ∂θ θ ∂θ ∂θ j n=0 i j j i Around the solution [x − s(θ)] is small so the first term in the Jacobian can be neglected. This makes this method equivalent to the Gauss-Newton algorithm which is numerically more robust. (Least Squares) Estimation Theory Spring 2013 87 / 152 Nonlinear Least Squares

Gauss-Newton Method : Linearize signal model around the current estimate and solve the resulting linear problem.  ∂ θ  θ ≈ θ s( ) θ − θ θ θ θ − θ s( ) s( 0)+ ∂θ  ( 0)=s( 0)+H( 0)( 0) θ=θ0 The solution to the linearized problem is :

T −1 T θˆ =[H (θ0)H(θ0)] H (θ0)[x − s(θ0)+H(θ0)θ0] T −1 T = θ0 +[H (θ0)H(θ0)] H (θ0)[x − s(θ0)]

If we now iterate the solution, it becomes :

ˆ T ˆ ˆ −1 T ˆ ˆ θˆk+1 = θk +[H (θk )H(θk )] H (θk )[x − s(θk )]

Remark : Both the Newton-Raphson and the Gauss-Newton methods can have convergence problems. (Least Squares) Estimation Theory Spring 2013 88 / 152 Nonlinear Least Squares

Transformation of parameters : Transform into a linear problem by seeking for an invertible function α = g(θ) such that : s(θ)=s(g−1(α)) = Hα So the nonlinear LSE is θˆ = g−1(ˆα) where αˆ =(HTH)−1HTx. Example (Estimate the amplitude and phase of a sinusoidal signal)

s[n]=A cos(2πf0n + φ) n =0, 1,...,N − 1 The LS problem is nonlinear, however, we have :

A cos(2πf0n + φ)=A cos φ cos 2πf0n − A sin φ sin 2πf0n

if we let α1 = A cos φ and α2 = −A sin φ then the signal model becomes linear s = Hα.TheLSEisαˆ =(HTH)−1HTx and g−1(α) is ! −α α2 α2 φ 2 A = 1 + 2 and = arctan α1 (Least Squares) Estimation Theory Spring 2013 89 / 152 Nonlinear Least Squares

Separability of parameters : If the signal model is linear in some of the parameters, try to write it as s = H(α)β. The LS error can be minimized wrt β. βˆ =[HT(α)H(α)]−1HT(α)x The cost function is a function of α that can be minimized by a numerical method (e.g. brute force).   J(α, βˆ)=xT I − H(α)[HT(α)H(α)]−1HT(α) x

Then   αˆ = arg max xT H(α)[HT(α)H(α)]−1HT(α) x α

Remark : This method is interesting if the dimension of α is much less than the dimension of β.

(Least Squares) Estimation Theory Spring 2013 90 / 152 Nonlinear Least Squares

Example (Damped Exponentials) Consider the following signal model :

n 2n 3n s[n]=A1r + A2r + A3r n =0, 1,...,N − 1

T with θ =[A1, A2, A3, r] .Itisknownthat0< r < 1. Let’s take T β =[A1, A2, A3] so we get s = H(r)β with :   11 1    rr2 r 3  H(r)= . . .   . . .  r N−1 r 2(N−1) r 3(N−1)   Step 1 : Maximize xT H(r)[HT (r)H(r)]−1HT (r) x to obtain ˆr. Step 2 : Compute βˆ =[HT (ˆr)H(ˆr)]−1HT (ˆr)x.

(Least Squares) Estimation Theory Spring 2013 91 / 152 The Bayesian Philosophy

(The Bayesian Philosophy) Estimation Theory Spring 2013 92 / 152 Introduction

Classical Approach : Assumes θ is unknown but deterministic. Some prior knowledge on θ cannot be used. Variance of the estimate may depend on θ. MVUE may not exist. In Monte-Carlo Simulations, we do M runs for each fixed θ and then compute sample mean and variance for each θ (no averaging over θ). Bayesian Approach : Assumes θ is random with a known prior PDF, p(θ). We estimate a realization of θ based on the available data. Variance of the estimate does not depend on θ. A Bayesian estimate always exists In Monte-Carlo simulations, we do M runs for randomly chosen θ and then we compute sample mean and variance over all θ values.

(The Bayesian Philosophy) Estimation Theory Spring 2013 93 / 152 Mean Square Error (MSE)

Classical MSE Classical MSE is a function of the unknown parameter θ and cannot be used for constructing the estimators. mse(θˆ)=E{[θ − θˆ(x)]2} = [θ − θˆ(x)]2p(x; θ)dx

Note that the E{·} is wrt the PDF of x.

Bayesian MSE : Bayesian MSE is not a function of θ and can be minimized to find an estimator. Bmse(θˆ)=E{[θ − θˆ(x)]2} = [θ − θˆ(x)]2p(x,θ)dxdθ

Note that the E{·} is wrt the joint PDF of x and θ.

(The Bayesian Philosophy) Estimation Theory Spring 2013 94 / 152 Minimum Mean Square Error Estimator

Consider the estimation of A that minimizes the Bayesian MSE, where A is a random variable with uniform PDF p(A)=U[−A0, A0] and independent of w[n]. Bmse(θˆ)= [A − Aˆ]2p(x,θ)dxdA = [A − Aˆ]2p(A|x)dA p(x)dx

where we used Bayes’ theorem : p(x,θ)=p(A|x)p(x). Since p(x) ≥ 0 for all x, we need only minimize the integral in brackets. We set the derivative equal to zero : ∂ [A − Aˆ]2p(A|x)dA = −2 Ap(A|x)dA +2Aˆ p(A|x)dA ∂Aˆ which results in Aˆ = Ap(A|x)dA = E(A|x)

Bayesian MMSE Estimate : is the conditional mean of A given data x or the mean of posterior PDF p(A|x). (The Bayesian Philosophy) Estimation Theory Spring 2013 95 / 152 Minimum Mean Square Error Estimator

How to compute the Bayesian MMSE estimate : Aˆ = E(A|x)= Ap(A|x)dA

The posteriori PDF can be computed using Bayes’ Rule : p(x|A)p(A) p(x|A)p(A) p(A|x)= = " p(x) p(x|A)p(A)dA " p(x) is the marginal PDF defined as p(x)= p(x, A)dA. The integral in the denominator acts as a ”normalization”of p(x|A)p(A) such the integral of p(A|x) be equal to 1. p(x|A) has exactly the same form as p(x; A) which is used in the classical estimation. The MMSE estimator always exists and can be computed by : " Ap(x|A)p(A)dA Aˆ = " p(x|A)p(A)dA

(The Bayesian Philosophy) Estimation Theory Spring 2013 96 / 152 Minimum Mean Square Error Estimator

Example : For p(A)=U[−A0, A0]wehave: N−1 1 A0 1 −1 A exp (x[n] − A)2 dA πσ2 N/2 σ2 2A0 −A0 (2 ) 2 Aˆ = n=0 N−1 A0 − 1 1 1 − 2 / exp 2 (x[n] A) dA 2A − (2πσ2)N 2 2σ 0 A0 n=0

Before collecting the data the mean of the prior PDF p(A)isthebest estimate while after collecting the data the best estimate is the mean of posterior PDF p(A|x). Choice of p(A) is crucial for the quality of the estimation. Only Gaussian prior PDF leads to a closed-form estimator. Conclusion : For an accurate estimator choose a prior PDF that can be physically justified. For a closed-form estimator choose a Gaussian prior PDF. (The Bayesian Philosophy) Estimation Theory Spring 2013 97 / 152 Properties of the Gaussian PDF

Theorem (Conditional PDF of Bivariate Gaussian) If x and y are distributed according to a bivariate Gaussian PDF : 1 −1 x − E(x) T x − E(x) p(x, y)= exp C−1 2π det(C) 2 y − E(y) y − E(y) var(x) cov(x, y) with covariance matrix : C = cov(x, y) var(y) then the conditional PDF p(y|x) is also Gaussian and :

cov(x, y) E(y|x)=E(y)+ [x − E(x)] var(x) cov2(x, y) var(y|x)=var(y) − var(x)

(The Bayesian Philosophy) Estimation Theory Spring 2013 98 / 152 Properties of the Gaussian PDF

Example (DC level in WGN with Gaussian prior PDF) Consider the Bayesian model x[0] = A + w[0] (just one observation) with ∼N µ ,σ2 a prior PDF of A ( A A) independent of noise. First we compute the covariance cov(A, x[0]) :

cov(A, x[0]) = E{(A − µA)(x[0] − E(x[0])}

= E{(A − µA)(A + w[0] − µA − 0)} { − µ 2 − µ } σ2 = E (A A) +(A A)w[0] =var(A)+0= A

The Bayesian MMSE estimate is also Gaussian with :

, σ2 cov(A x[0]) A Aˆ = µ | = µ + [x[0] − µ ]=µ + (x[0] − µ ) A x A A A σ2 σ2 A var(x) + A cov2(A, x[0]) σ2 var(Aˆ)=σ2 = σ2 − = σ2 (1 − A ) A|x A A σ2 σ2 var(x) + A

(The Bayesian Philosophy) Estimation Theory Spring 2013 99 / 152 Properties of the Gaussian PDF

Theorem (Conditional PDF of Multivariate Gaussian) If x (with dimension k × 1)andy (with dimension l × 1) are jointly Gaussian with PDF : T 1 −1 x − E(x) x − E(x) p(x, y)= exp C−1 k+l − − (2π) 2 det(C) 2 y E(y) y E(y) C C with covariance matrix : C = xx xy Cyx Cyy then the conditional PDF p(y|x) is also Gaussian and :

| −1 − E(y x)=E(y)+Cyx Cxx [x E(x)] − −1 Cy|x = Cyy Cyx Cxx Cxy

(The Bayesian Philosophy) Estimation Theory Spring 2013 100 / 152 Bayesian Linear Model

Let the data be modeled as

x = Hθ + w

where θ is a p × 1 random vector with prior PDF N (µθ, Cθ)andw is a noise vector with PDF N (0, Cw ). Since θ and w are independent and Gaussian, they are jointly Gaussian then the posterior PDF is also Gaussian. In order to find the MMSE estimator we should compute the covariance matrices :

T Cxx = E{[x − E(x)][x − E(x)] } T = E{(Hθ + w − Hµθ)(Hθ + w − Hµθ) } T = E{(H(θ − µθ)+w)(H(θ − µθ)+w) } T T T T = HE{(θ − µθ)(θ − µθ) }H + E(ww )=HCθH + Cw T Cθx = E{(θ − µθ)[H(θ − µθ)+w] } T T T = E{(θ − µθ)(θ − µθ) }H = CθH

(The Bayesian Philosophy) Estimation Theory Spring 2013 101 / 152 Bayesian Linear Model

Theorem (Posterior PDF for the Bayesian General Linear Model) If the observed data can be modeled as

x = Hθ + w

where θ is a p × 1 random vector with prior PDF N (µθ, Cθ) and w is a noise vector with PDF N (0, Cw ), then the posterior PDF p(x|θ) is Gaussian with mean :

T T −1 E(θ|x)=µθ + CθH (HCθH + Cw) (x − Hµθ)

and covariance

T T −1 Cθ|x = Cθ − CθH (HCθH + Cw) HCθ

Remark : In contrast to the classical general linear model, H need not be full rank. (The Bayesian Philosophy) Estimation Theory Spring 2013 102 / 152 Bayesian Linear Model Example (DC Level in WGN with Gaussian Prior PDF)

, ,..., − , ∼N µ ,σ2 , ∼N ,σ2 x[n]=A+w[n] n =0 N 1 A ( A A) w[n] (0 )

We have the general Bayesian linear model x = HA + w with H = 1.Then

| µ σ2 T σ2 T σ2 −1 − µ E(A x)= A + A1 (1 A1 + I) (x 1 A) | σ2 − σ2 T σ2 T σ2 −1 σ2 var(A x)= A A1 (1 A1 + I) 1 A

We use the Matrix Inversion Lemma to get :

σ2 11T 11T (I + 1 A 1T )−1 = I − = I − σ2 σ2 T σ2 σ2 + 1 1 σ2 + N A A

σ2 σ2 σ2 E(A|x)=µ + A (¯x − µ ) and var(A|x)= N A A σ2 σ2 A σ2 σ2 A + N A + N (The Bayesian Philosophy) Estimation Theory Spring 2013 103 / 152 Nuisance Parameters Definition () Suppose that θ and α are unknown parameters but we are only interested in estimating θ.Inthiscaseα is called a nuisance parameter.

In the classical approach we have to estimate both but in the Bayesian approach we can integrate it out. Note that in the Bayesian approach we can find p(θ|x) from p(θ,α|x) as a marginal PDF : p(θ|x)= p(θ,α|x)dα

We can also express it as p(x|θ)p(θ) p(θ|x)=" where p(x|θ)= p(x|θ,α)p(α|θ)dα p(x|θ)p(θ)dθ Furthermore if α is independent of θ we have : p(x|θ)= p(x|θ,α)p(α)dα

(The Bayesian Philosophy) Estimation Theory Spring 2013 104 / 152 General Bayesian Estimators

(General Bayesian Estimators) Estimation Theory Spring 2013 105 / 152 General Bayesian Estimators

Risk Function A general Bayesian estimator is obtained by minimizing the Bayes Risk

θˆ = arg min R(θˆ) θˆ

where R(θˆ)=E{C(ε)} is the Bayes Risk, C(ε) is a cost function and ε = θ − θˆ is the estimation error.

Three common risk functions 1. Quadratic : R(θˆ)=E{ε2} = E{(θ − θˆ)2}

2. Absolute : R(θˆ)=E{|ε|} = E{|θ − θˆ|} 0 |ε| <δ 3. Hit-or-Miss : R(θˆ)=E{C(ε)} where C(ε)= 1 |ε|≥δ

(General Bayesian Estimators) Estimation Theory Spring 2013 106 / 152 General Bayesian Estimators

The Bayes risk is R(θˆ)=E{C(ε)} = C(θ − θˆ)p(x,θ)dxdθ = g(θˆ)p(x)dx where g(θˆ)= C(θ − θˆ)p(θ|x)dθ should be minimized.

Quadratic For this case R(θˆ)=E{(θ − θˆ)2} =Bmse(θˆ) which leads to the MMSE estimator with θˆ = E(θ|x) so θˆ is the mean of the posterior PDF p(θ|x).

(General Bayesian Estimators) Estimation Theory Spring 2013 107 / 152 General Bayesian Estimators

Absolute In this case we have :

θˆ ∞ g(θˆ)= |θ − θˆ|p(θ|x)dθ = (θˆ − θ)p(θ|x)dθ + (θ − θˆ)p(θ|x)dθ −∞ θˆ

By setting the derivative of g(θˆ) equal to zero and the use of Leibnitz’s rule, we get : θˆ ∞ p(θ|x)dθ = p(θ|x)dθ −∞ θˆ Then θˆ is the median (area to the left=area to the right) of p(θ|x)

(General Bayesian Estimators) Estimation Theory Spring 2013 108 / 152 General Bayesian Estimators

Hit-or-Miss In this case we have

θˆ−δ ∞ g(θˆ)= 1 · p(θ|x)dθ + 1 · p(θ|x)dθ −∞ θˆ+δ θˆ+δ =1− p(θ|x)dθ θˆ−δ

For δ arbitrarily small the optimal estimate is the location of the maximum of p(θ|x) or the mode of the posterior PDF.

This estimator is called the maximum a posteriori (MAP) estimator.

Remark : For unimodal and symmetric posterior PDF (e.g. Gaussian PDF), the mean and the mode and the median are the same.

(General Bayesian Estimators) Estimation Theory Spring 2013 109 / 152 Minimum Mean Square Error Estimators

Extension to the vector parameter case : In general, like the scalar case, we can write :

θˆi = E(θi |x)= θi p(θi |x)dθi i =1, 2,...,p

Then p(θi |x) can be computed as a marginal conditional PDF. For example :

θˆ1 = θ1 ··· p(θ|x)dθ2 ···dθp dθ1 = θ1p(θ|x)dθ

In vector form we " have :  θ θ| θ " 1p( x)d    θ2p(θ|x)dθ  θˆ   θ θ| θ θ| =  .  = p( x)d = E( x) " . θ p(θ|x)dθ p ˆ Similarly Bmse(θi )= [Cθ|x ]ii p(x)dx

(General Bayesian Estimators) Estimation Theory Spring 2013 110 / 152 Properties of MMSE Estimators

For Bayesian Linear Model, poor prior knowledge leads to MVU estimator. θˆ θ| µ −1 T −1 )−1 T −1( − µ ) = E( x)= θ +(Cθ + H Cw H H Cw x H θ −1 For no prior knowledge µθ → 0 and Cθ → 0, and therefore : θˆ → T −1 −1 T −1 [H Cw H] H Cw x Commutes over affine transformations. Suppose that α = Aθ + b then the MMSE estimator for α is αˆ = E(α|x)=E(Aθ + b|x)=AE(θ|x)+b = Aθˆ + b Enjoys the additive property for independent data sets. Assume that θ, x1, x2 are jointly Gaussian with x1 and x2 independent : ˆ −1 −1 θ θ θ − θ − = E( )+C x1 Cx1x1 [x1 E(x1)] + C x2 Cx2x2 [x2 E(x2)]

(General Bayesian Estimators) Estimation Theory Spring 2013 111 / 152 Maximum A Posteriori Estimators

In the MAP estimation approach we have : θˆ = arg max p(θ|x) θ where p(x|θ)p(θ) p(θ|x)= p(x)

An equivalent maximization is : θˆ = arg max p(x|θ)p(θ) θ or θˆ = arg max[ln p(x|θ)+lnp(θ)] θ If p(θ) is uniform or is approximately constant around the maximum of p(x|θ) (for large data length), we can remove p(θ) to obtain the Bayesian Maximum Likelihood : θˆ = arg max p(x|θ) θ

(General Bayesian Estimators) Estimation Theory Spring 2013 112 / 152 Maximum A Posteriori Estimators Example (DC Level in WGN with Uniform Prior PDF) The MMSE estimator cannot be obtained in explicit form due to the need to evaluate the following integrals : N−1 1 A0 1 −1 A exp (x[n] − A)2 dA πσ2 N/2 σ2 2A0 −A0 (2 ) 2 Aˆ = n=0 N−1 A0 − 1 1 1 − 2 / exp (x[n] A) dA 2A − (2πσ2)N 2 2σ2 0 A0 n=0   −A0 x¯ < −A0 ˆ − ≤ ≤ The MAP estimator is given as : A =  x¯ A0 x¯ A0 A0 x¯ > A0

Remark : The main advantage of MAP estimator is that for non jointly Gaussian PDFs it may lead to explicit solutions or less computational effort. (General Bayesian Estimators) Estimation Theory Spring 2013 113 / 152 Maximum A Posteriori Estimators

Extension to the vector parameter case

θˆ1 = arg max p(θ1|x)=argmax ··· p(θ|x)dθ2 ...dθp θ1 θ1 This needs integral evaluation of the marginal conditional PDFs.

An alternative is the following vector MAP estimator that maximizes the joint conditional PDF :

θˆ = arg max p(θ|x)=argmaxp(x|θ)p(θ) θ θ This corresponds to a circular Hit-or-Miss cost function. In general, as N →∞, the MAP estimator → the Bayesian MLE. If the posterior PDF is Gaussian, the mode is identical to the mean, therefore the MAP estimator is identical to the MMSE estimator. The invariance property in ML theory does not hold for the MAP estimator. (General Bayesian Estimators) Estimation Theory Spring 2013 114 / 152 Performance Description

In the classical estimation the mean and the variance of the estimate (or its PDF) indicates the performance of the estimator. In the Bayesian approach the PDF of the estimate is different for each realization of θ. So a good estimator should perform well for every possiblevalueofθ. The estimation error ε = θ − θˆ is a function of two random variables (θ and x). The mean and the variance of the estimation error, wrt two random variables, indicates the performance of the estimator. The mean value of the estimation error is zero. So the estimates are unbiased (in the Bayesian sense). ˆ Ex,θ(θ − θ)=Ex,θ(θ − E(θ|x)) = Ex [Eθ|x (θ) − Eθ|x (θ|x)] = Ex (0) = 0 The variance of the estimate is Bmse : 2 2 var(ε)=Ex,θ(ε )=Ex,θ[(θ − θˆ) ]=Bmse(θˆ)

(General Bayesian Estimators) Estimation Theory Spring 2013 115 / 152 Performance Description (Vector Parameter Case)

The vector of the estimation error is ε = θ − θˆ and has a zero mean. Its covariance matrix is : T T Mθˆ = Ex,θ(εε )=Ex,θ{[θ − E(θ|x)][θ − E(θ|x)] } T = Ex Eθ|x [θ − E(θ|x)][θ − E(θ|x)] = Ex (Cθ|x) if x and θ are jointly Gaussian, we have : − −1 Mθˆ = Cθ|x = Cθθ Cθx Cxx Cxθ

since Cθ|x does not depend on x. For a Bayesian linear model we have : T T −1 Mθˆ = Cθ|x = Cθ − CθH (HCθH + Cw ) HCθ In this case ε is a linear transformation of x and θ and thus is Gaussian. Therefore : ε ∼N(0, Mθˆ)

(General Bayesian Estimators) Estimation Theory Spring 2013 116 / 152 Signal Processing Example

Deconvolution Problem : Estimate a signal transmitted through a channel with known impulse response :

ns −1 x[n]= h[n − m]s[m]+w[n] n =0, 1,...,N − 1 m=0

where s[n] is a WSS with known ACF, and w[n]isWGN with variance σ2.         x[0] h[0] 0 ··· 0 s[0] w[0]    ···       x[1]   h[1] h[0] 0   s[1]   w[1]           .  =  . . . .   .  +  .  ...... x[N − 1] h[N − 1] h[N − 2] ··· h[N − ns ] s[ns − 1] w[N − 1]

T T 2 −1 x = Hθ + w ⇒ ˆs = CsH (HCsH + σ I) x

where Cs is a symmetric Toeplitz matrix with [Cs]ij = rss [i − j]

(General Bayesian Estimators) Estimation Theory Spring 2013 117 / 152 Signal Processing Example

Consider that H = I (no filtering), so the Bayesian model becomes x = s + w Classical Approach The MVU estimator is ˆs = (HTH)−1HTx = x. 2 −1 Bayesian Approach The MMSE estimator is ˆs = Cs(Cs + σ I) x. Scalar case : We estimate s[0] based on x[0] :

rss [0] η ˆs[0] = 2 x[0] = x[0] rss [0] + σ η +1 2 where η = rss [0]/σ is the SNR. Signal is a Realization of an Auto Regressive Process : Consider a first order AR process : s[n]=−as[n − 1] + u[n], where u[n]is σ2 WGN with variance u.TheACFofs is : σ2 r [k]= u (−a)|k| ⇒ [C ] = r [i − j] ss 1 − a2 s ij ss

(General Bayesian Estimators) Estimation Theory Spring 2013 118 / 152 Linear Bayesian Estimators

(Linear Bayesian Estimators) Estimation Theory Spring 2013 119 / 152 Introduction

Problems with general Baysian estimators : difficult to determine in closed form, need intensive computations, involve multidimensional integration (MMSE) or multidimensional maximization (MAP), can be determined only under the jointly Gaussian assumption. What can we do if the joint PDF is not Gaussian or unknown ? Keep the MMSE criterion ; Restrict the estimator to be linear. This leads to the Linear MMSE Estimator : Which is a suboptimal estimator that can be easily implemented. It needs only the first and second moments of the joint PDF. It is analogous to BLUE in classical estimation. In practice, this estimator is termed Wiener filter.

(Linear Bayesian Estimators) Estimation Theory Spring 2013 120 / 152 Linear MMSE Estimators (scalar case)

Goal : Estimate θ, given the data vector x. Assume that only the first two moments of the joint PDF of x and θ are available : θ E( ) , Cθθ Cθx E(x) Cxθ Cxx LMMSE Estimator : Take the class of all affine estimators N−1 T θˆ = anx[n]+aN = a x + aN n=0 T where a =[a0, a1,...,aN−1] . Then minimize the Bayesian MSE : Bmse(θˆ)=E[(θ − θˆ)2]

Computing aN : Let’s differentiate the Bmse with respect to aN : ∂     T 2 T E (θ − a x − aN ) = −2E (θ − a x − aN ) ∂aN T Setting this equal to zero gives : aN = E(θ) − a E(x). (Linear Bayesian Estimators) Estimation Theory Spring 2013 121 / 152 Linear MMSE Estimators (scalar case)

Minimize the Bayesian MSE : θˆ θ − θˆ 2 θ − T − θ T 2 Bmse( )=E[( ) ]=E[( a x E( )+ a E(x)) ] = E [aT (x − E(x)) − (θ − E(θ))]2 = E[aT (x − E(x))(x − E(x))T a] − E[aT (x − E(x))(θ − E(θ))] −E[(θ − E(θ))(x − E(x))T a]+E[(θ − E(θ))2] T T = a Cxx a − a Cxθ − Cθx a + Cθθ This can be minimized by setting the gradient to zero : ∂Bmse(θˆ) =2C a − 2C θ =0 ∂a xx x −1 T which results in a = Cxx Cxθ and leads to (Note that : Cθx = Cxθ): θˆ T T −1 θ − T −1 = a x + aN = CxθCxx x + E( ) CxθCxx E(x) θ −1 − = E( )+Cθx Cxx (x E(x))

(Linear Bayesian Estimators) Estimation Theory Spring 2013 122 / 152 Linear MMSE Estimators (scalar case)

The minimum Bayesian MSE is obtained :

θˆ T −1 −1 − T −1 Bmse( )=CxθCxx Cxx Cxx Cxθ CxθCxx Cxθ − −1 Cθx Cxx Cxθ + Cθθ − −1 = Cθθ Cθx Cxx Cxθ

Remarks : The results are identical to those of MMSE estimators for jointly Gaussian PDF. The LMMSE estimator relies on the correlation between random variables. If a parameter is uncorrelated with the data (but nonlinearly dependent), it cannot be estimated with an LMMSE estimator.

(Linear Bayesian Estimators) Estimation Theory Spring 2013 123 / 152 Linear MMSE Estimators (scalar case)

Example (DC Level in WGN with Uniform Prior PDF) The data model is : x[n]=A + w[n] n =0, 1,...,N − 1 2 where A ∼U[−A0, A0] is independent from w[n] (WGN with variance σ ). We have E(A)=0andE(x)=0. The are :

T T 2 T 2 Cxx = E(xx )=E[(A1 + w)(A1 + w) ]=E(A )11 + σ I T T 2 T Cθx = E(Ax )=E[A(A1 + w) ]=E(A )1

Therefore the LMMSE estimator is : 2/ −1 2 T 2 T 2 −1 A0 3 Aˆ = Cθ C x = σ 1 (σ 11 + σ I) x = x¯ x xx A A 2/ σ2/ A0 3+ N

where  A0 3 A0 2 σ2 2 2 1 A  A0 A = E(A )= A dA =  = − 2A 6A 3 A0 0 0 −A0

(Linear Bayesian Estimators) Estimation Theory Spring 2013 124 / 152 Linear MMSE Estimators (scalar case)

Example (DC Level in WGN with Uniform Prior PDF) Comparison of different Bayesian estimators : N−1 A0 −1 A exp (x[n] − A)2 dA σ2 −A0 2 MMSE : Aˆ = n=0 N−1 A0 −1 exp (x[n] − A)2 dA − 2σ2 A0 n=0   −A0 x¯ < −A0 ˆ − ≤ ≤ MAP : A =  x¯ A0 x¯ A0 A0 x¯ > A0

A2/3 LMMSE : Aˆ = 0 x¯ 2/ σ2/ A0 3+ N

(Linear Bayesian Estimators) Estimation Theory Spring 2013 125 / 152 Geometrical Interpretation

Vector space of random variables : The set of scalar zero mean random variables is a . The zero length vector of the set is a RV with zero variance. For any real scalar a, ax is another zero mean RV in the set. For two RVs x and y, x + y = y + x is a RV in the set. For two RVs x and y, the inner product is : < x, y >= E(xy) √ The length of x is defined as : x = < x, x > = E(x2) Two RVs x and y are orthogonal if : < x, y >= E(xy)=0.

For RVs, x1, x2, y and real numbers a1 and a2 we have :

< a1x1 + a2x2, y >= a1 < x1, y > +a2 < x2, y >

E[(a1x1 + a2x2)y]=a1E(x1y)+a2E(x2y) < y, x > E(yx) The projection of y on x is : x = x 2 σ2 x x

(Linear Bayesian Estimators) Estimation Theory Spring 2013 126 / 152 Geometrical Interpretation

The LMMSE estimator can be determined using the vector space viewpoint. We have : N−1 θˆ = anx[n] n=0 where aN is zero, because of zero mean assumption. θˆ belongs to the subspace spanned by x[0], x[1],...,x[N − 1]. θ is not in this subspace. A good estimator will minimize the MSE : E[(θ − θˆ)2]= 2 where = θ − θˆ is the error vector. Clearly the length of the vector error is minimized when is orthogonal to the subspace spanned by x[0], x[1],...,x[N − 1] or to each data sample : E (θ − θˆ)x[n] =0 for n =0, 1,...,N − 1

(Linear Bayesian Estimators) Estimation Theory Spring 2013 127 / 152 Geometrical Interpretation

The LMMSE estimator can be determined by solving the following equations : N−1 E θ − amx[m] x[n] =0 n =0, 1,...,N − 1 m=0 or N−1 amE(x[m]x[n]) = E(θx[n]) n =0, 1,...,N − 1 m=0       2 E(x [0]) E(x[0]x[1]) ··· E(x[0]x[N − 1]) a0 E(θx[0])  2 ··· −     θ   E(x[1]x[0]) E(x [1]) E(x[1]x[N 1])   a1   E( x[1])      =    . . . .   .   .  ...... 2 E(x[N − 1]x[0]) E(x[N − 1]x[1]) ··· E(x [N − 1]) aN−1 E(θx[N − 1]) Therefore T ⇒ −1 T Cxx a = Cθx a = Cxx Cθx The LMMSE estimator is : θˆ T −1 = a x = Cθx Cxx x (Linear Bayesian Estimators) Estimation Theory Spring 2013 128 / 152 The vector LMMSE Estimator

T We want to estimate θ =[θ1,θ2,...,θp] with a linear estimator that minimize the Bayesian MSE for each element. N−1   2 θˆi = ainx[n]+aiN Minimize Bmse(θˆi )=E (θi − θˆi ) n=0 Therefore : −1 θˆ = E(θ )+Cθ C x − E(x) i =1, 2,...,p i i i x xx 1×N N×N N×1 The scalar LMMSE estimators can be combined into a vector estimator : −1 θˆ = E(θ) + Cθ C x − E(x) x xx p×1 p×N N×N N×1 and similarly −1 Bmse(θˆ )=[M ] where M = Cθθ − Cθ C C θ i θˆ ii θˆ x xx x p×p p×p p×N N×p

(Linear Bayesian Estimators) Estimation Theory Spring 2013 129 / 152 The vector LMMSE Estimator

Theorem (Bayesian Gauss–Markov Theorem) If the data are described by the Bayesian linear model form

x = Hθ + w

where θ is a p × 1 random vector with mean E(θ) and covariance Cθθ, w is a noise vector with zero mean and covariance Cw and is uncorrelated with θ (the joint PDF of p(θ, w) is arbitrary), then the LMMSE estimator of θ is : T T −1 θˆ = E(θ)+CθθH (HCθθH + Cw ) x − HE(θ)

The performance of the estimator is measured by  = θ − θˆ whose mean is zero and whose covariance matrix is − T T −1 −1 T −1 −1 C = Mθˆ = Cθθ CθθH (HCθθH + Cw ) HCθθ = Cθθ + H Cw H

(Linear Bayesian Estimators) Estimation Theory Spring 2013 130 / 152 Sequential LMMSE Estimation

Objective : Given θˆ[n − 1] based on x[n − 1], update the new estimate θˆ[n] based on the new sample x[n]. Example (DC Level in White Noise) Assume that both A and w[n] have zero mean :

σ2 x[n]=A + w[n] ⇒ Aˆ[N − 1] = A x¯ σ2 σ2/ A + N

Estimator Update : Aˆ[N]=Aˆ[N − 1] + K[N](x[N] − Aˆ[N − 1])

Bmse(Aˆ[N − 1]) where K[N]= Bmse(Aˆ[N − 1]) + σ2 Minimum MSE Update :

Bmse(Aˆ[N]) = (1 − K[N])Bmse(Aˆ[N − 1])

(Linear Bayesian Estimators) Estimation Theory Spring 2013 131 / 152 Sequential LMMSE Estimation

Vector space view : If two observations are orthogonal, the LMMSE estimate of θ is the sum of the projection of θ on each observation. 1 Find the LMMSE estimator of A based on x[0], yielding Aˆ[0] : E(Ax[0]) σ2 Aˆ[0] = x[0] = A x[0] 2 σ2 σ2 E(x [0]) A + 2 Find the LMMSE estimator of x[1] based on x[0], yieldingx ˆ[1|0] E(x[0]x[1]) σ2 xˆ[1|0] = x[0] = A x[0] 2 σ2 σ2 E(x [0]) A + 3 Determine the innovation of the new data :˜x[1] = x[1] − xˆ[1|0]. This error vector is orthogonal to x[0]

4 Add to Aˆ[0] the LMMSE estimator of A based on the innovation :

ˆ ˆ E(Ax˜[1]) ˆ − | A[1] = A[0] + 2 x˜[1] = A[0] + K[1](x[1] xˆ[1 0]) E(˜x [1]) the projection of A on x˜[1] (Linear Bayesian Estimators) Estimation Theory Spring 2013 132 / 152 Sequential LMMSE Estimation

Basic Idea : Generate a sequence of orthogonal RVs, namely, the innovations : , , ,..., x˜[0] x˜[1] x˜[2] x˜[n] x[0] x[1]−xˆ[1|0] x[2]−xˆ[2|0,1] x[n]−xˆ[n|0,1,...,n−1] Then, add the individual estimators to yield :

N E(Ax˜[n]) Aˆ[N]= K[n]˜x[n]=Aˆ[N−1]+K[N]˜x[N]whereK[n]= E(˜x2[n]) n=0 It can be shown that : Bmse(Aˆ[N − 1]) x˜[N]=x[N] − Aˆ[N − 1] and K[N]= σ2 +Bmse(Aˆ[N − 1]) the minimum MSE be updated as : Bmse(Aˆ[N]) = (1 − K[N])Bmse(Aˆ[N − 1])

(Linear Bayesian Estimators) Estimation Theory Spring 2013 133 / 152 General Sequential LMMSE Estimation

Consider the general Bayesian linear model x = Hθ + w,wherew is an uncorrelated noise with the diagonal covariance matrix Cw .Let’sdefine: C [n] = diag(σ2,σ2,...,σ2) w 0 1 n H[n − 1] n × p H[n]= = hT [n] 1 × p   x[n]= x[0], x[1],...,x[n] Estimator Update : θˆ[n]=θˆ[n − 1] + K[n] x[n] − hT [n]θˆ[n − 1] where M[n − 1]h[n] K[n]= σ2 T − n + h [n]M[n 1]h[n] Minimum MSE Matrix Update : M[n]=(I − K[n]hT [n])M[n − 1]

(Linear Bayesian Estimators) Estimation Theory Spring 2013 134 / 152 General Sequential LMMSE Estimation

Remarks : For the initialization of the sequential LMMSE estimator, we can use the prior information :

θˆ[−1] = E(θ) M[−1] = Cθθ

For no prior knowledge about θ we can let Cθθ →∞.Thenwehave the same form as the sequential LSE, although the approaches are fundamentally different. No matrix inversion is required. The gain factor K[n] weights confidence in the new data (measured σ2 − by n) against all previous data (summarized by M[n 1]).

(Linear Bayesian Estimators) Estimation Theory Spring 2013 135 / 152 Wiener Filtering

Consider the signal model : x[n]=s[n]+w[n] n =0, 1,...,N − 1 where the data and noise are zero mean with known covariance matrix.

There are three main problems concerning the Wiener Filters : Filtering : Estimate θ = s[n] (scalar) based on the data set x =[x[0], x[1],...,x[n]]. The signal sample is estimated basedonthethe present and past data only.

Smoothing : Estimate θ = s =[ s[0], s[1],...,s[N − 1]] (vector) based on the data set x = x[0], x[1],...,x[N − 1] . The signal sample is estimated based on the the present, past and future data.

Prediction : Estimate θ = x[N − 1+]for positive integer based on the data set x =[x[0], x[1],...,x[N − 1]].

Remarks : All these problems are solved using the LMMSE estimator : θˆ −1 − −1 = Cθx Cxx x with the minimum Bmse : Mθˆ = Cθθ Cθx Cxx Cxθ

(Linear Bayesian Estimators) Estimation Theory Spring 2013 136 / 152 Wiener Filtering ()

Estimate θ = s =[s[0], s[1],..., s[N − 1]] (vector) based on the data set x = x[0], x[1],...,x[N − 1] . T T Cxx = Rxx = Rss + Rww and Cθx = E(sx )=E(s(s + w) )=Rss Therefore : −1 −1 ˆs = Csx Cxx x = Rss (Rss + Rww ) x = Wx Filter Interpretation : The Wiener Smoothing Matrix can be interpreted as an FIR filter. Let’s define :

W =[a0, a1,...,aN−1] T where an is the (n +1)thrawofW. Let’s define also : (n) (1) (n) hn =[h [0], h [1],...,h [N − 1]] which is just the vector an when flipped upside down. Then N−1 T (n) − − ˆs[n]=an x = h [N 1 k]x[k] k=0 This represents a time-varying, non causal FIR filter. (Linear Bayesian Estimators) Estimation Theory Spring 2013 137 / 152 Wiener Filtering (Filtering)   Estimate θ = s[n] based on the data set x = x[0], x[1],...,x[n] .We have again Cxx = Rxx = Rss + Rww and     , ,..., , − ,...,  T Cθx = E(s[n] x[0] x[1] x[n] )= rss [n] rss [n 1] rss [0] =(rss ) Therefore :  T −1 T ˆs[n]=(rss ) (Rss + Rww ) x = a x (n) (1) (n) Filter Interpretation : If we define hn =[h [0], h [1],...,h [N − 1]] which is just the vector a when flipped upside down. Then : n ˆs[n]=aT x = h(n)[n − k]x[k] k=0 This represents a time-varying, causal FIR filter. The impulse response can be computed as :  ⇒ (Rss + Rww )a = rss (Rss + Rww )hn = rss

(Linear Bayesian Estimators) Estimation Theory Spring 2013 138 / 152 Wiener Filtering (Prediction)

Estimate θ = x[N − 1+] based on the data set x =[x[0], x[1],...,x[N − 1]]. We have again Cxx = Rxx and −  , ,..., − Cθx = E(x[N 1+ ][x[0] x[1] x[N 1]]) −  , −  ,...,   T = rxx [N 1+ ] rxx [N 2+ ] rxx [ ] =(rxx ) Therefore : −   T −1 T xˆ[N 1+ ]=(rxx ) Rxx x = a x

Filter Interpretation : If we let again h[N − k]=ak ,wehave N−1 xˆ[N − 1+]=aT x = h[N − k]x[k] k=0 This represents a time-invariant, causal FIR filter. The impulse response can be computed as : Rxx h = rxx For  = 1, this is equivalent to an Auto-Regressive process. (Linear Bayesian Estimators) Estimation Theory Spring 2013 139 / 152 Kalman Filters

(Kalman Filters) Estimation Theory Spring 2013 140 / 152 Introduction

Kalman filter is a generalization of the Wiener filter. In Wiener filter we estimate s[n] based on the noisy observation vector x[n]. We assume that s[n] is a stationary random process with known mean and covariance matrix. In Kalman filter we assume that s[n] is a non stationary random process whose mean and covariance matrix vary according to a known dynamical model. Kalman filter is a sequential MMSE estimator of s[n]. If the signal and noise are jointly Gaussian, then the Kalman filter is an optimal MMSE estimator, if not, it is the optimal LMMSE estimator. Kalman filter has many applications in for state estimation. It can be generalized to vector signals and noise (in contrast to Wiener filter).

(Kalman Filters) Estimation Theory Spring 2013 141 / 152 Dynamical Signal Models

Consider a first order dynamical model :

s[n]=as[n − 1] + u[n] n ≥ 0

σ2 − ∼N µ ,σ2 − where u[n] is WGN with variance u, s[ 1] ( s s )ands[ 1] is independent of u[n] for all n ≥ 0. It can be shown that :

n n+1 k n+1 s[n]=a s[−1] + a u[n − k] ⇒ E(s[n]) = a µs k=0 and Cs [m, n]=E s[m] − E(s[m]) s[n] − E(s[n]) n m+n+2σ2 σ2 m−n 2k ≥ = a s + ua a for m 0 k=0

and Cs [m, n]=Cs [n, m]form < n. (Kalman Filters) Estimation Theory Spring 2013 142 / 152 Dynamical Signal Models Theorem (Vector Gauss-Markov Model) The Gauss-Markov model for a p × 1 vector signal s[n] is :

s[n]=As[n − 1] + Bu[n] n ≥ 0

A is p × pandB p × r and all eigenvalues of A are less than 1 in magnitude. The r × 1 vector u[n] ∼N(0, Q) and the initial condition is a p × 1 random vector with s[n] ∼N(µs, Cs) and is independent of u[n]’s. n+1 Then, the signal process is Gaussian with mean E(s[n]) = A µs and covariance :   m   m+1 n+1 T k T n−m+k T Cs[m, n]=A Cs A + A BQB A k=m−n

for m ≥ nandCs[m, n]=Cs[n, m] for m < n. The covariance matrix C[n]=Cs[n, n] and the mean and covariance propagation equations are : E(s[n]) = AE(s[n − 1]) , C[n]=AC[n − 1]AT + BQBT

(Kalman Filters) Estimation Theory Spring 2013 143 / 152 Scalar Kalman Filter

Data model : State equation s[n]=as[n − 1] + u[n] Observation equation x[n]=s[n]+w[n] Assumptions : u[n] is zero mean Gaussian noise with independent samples and 2 σ2 E(u [n]) = u. w[n] is zero mean Gaussian noise with independent samples and 2 σ2 E(w [n]) = n (time-varying variance). − ∼N µ ,σ2 µ s[ 1] ( s s ) (for simplicity we suppose that s =0). s[−1], u[n]andw[n] are independent. Objective : Develop a sequential MMSE estimator to estimate s[n]based on the data X[n]= x[0], x[1],...,x[n] . This estimator is the mean of posterior PDF :  ˆs[n|n]=E(s[n]x[0], x[1],...,x[n])

(Kalman Filters) Estimation Theory Spring 2013 144 / 152 Scalar Kalman Filter

To develop the equations of a Kalman filter we need the following properties : Property 1 : For two uncorrelated jointly Gaussian data vector, the MMSE estimator θ (if it is zero mean) is given by :

θˆ = E(θ|x1, x2)=E(θ|x1)+E(θ|x2)

Property 2 : If θ = θ1 + θ2, then the MMSE estimator of θ is :

θˆ = E(θ|x)=E(θ1 + θ2|x)=E(θ1|x)+E(θ2|x)

Basic Idea : Generate the innovationx ˜[n]=x[n] − xˆ[n|n − 1] which is uncorrelated with previous samples X − 1]. Then usex ˜[n] instead of x[n] for estimation (X[n] is equivalent to X[n − 1], x˜[n] ).

(Kalman Filters) Estimation Theory Spring 2013 145 / 152 Scalar Kalman Filter

From Property 1, we have :    |  − ,  −  ˆs[n n]=E(s[n] X[n 1] x˜[n]) = E(s[n] X[n 1]) + E(s[n] x˜[n]) =ˆs[n|n − 1] + E(s[n]x˜[n]) From Property 2, we have :  | − −  − ˆs[n n 1] = E(as[n 1] + u[n] X[n 1])  = aE(s[n − 1]X[n − 1]) + E(u[n]X[n − 1]) = aˆs[n − 1|n − 1] The MMSE estimator of s[n]basedon˜x[n]is:  E(s[n]˜x(n)) E(s[n]x˜[n]) = x˜[n]=K[n](x[n] − xˆ[n|n − 1]) E(˜x2[n]) wherex ˆ[n|n − 1] = ˆs[n|n − 1] +w ˆ [n|n − 1] = ˆs[n|n − 1]. Finally we have : ˆs[n|n]=aˆs[n − 1|n − 1] + K[n](x[n] − ˆs[n|n − 1])

(Kalman Filters) Estimation Theory Spring 2013 146 / 152 Scalar State – Scalar Observation Kalman Filter

State Model : s[n]=as[n − 1] + u[n] Observation Model : x[n]=s[n]+w[n] Prediction : ˆs[n|n − 1] = aˆs[n − 1|n − 1] Minimum Prediction MSE :

| − 2 − | − σ2 M[n n 1] = a M[n 1 n 1] + u

Kalman Gain : M[n|n − 1] K[n]= σ2 | − n + M[n n 1] Correction : ˆs[n|n]=ˆs[n|n − 1] + K[n](x[n] − ˆs[n|n − 1]) Minimum MSE : M[n|n]=(1− K[n])M[n|n − 1]

(Kalman Filters) Estimation Theory Spring 2013 147 / 152 Properties of Kalman Filter The Kalman filter is an extension of the sequential MMSE estimator but applied to time-varying parameters that are represented by a dynamic model. No matrix inversion is required. The Kalman filter is a time-varying linear filter. The Kalman filter computes (and uses) its own performance measure, the Bayesian MSE M[n|n]. The prediction stage increases the error, while the correction stage decreases it. The Kalman filter generates an uncorrelated sequence from the observations so can be viewed as a whitening filter. The Kalman filter is optimal in the Gaussian case and the optimal LMMSE estimator if the Gaussian assumption is not valid. In KF M[n|n], M[n|n − 1] and K[n] do not depend on the measurement and can be computed off line. At steady state (n →∞) the Kalman filter becomes an LTI filter (M[n|n], M[n|n − 1] and K[n] become constant). (Kalman Filters) Estimation Theory Spring 2013 148 / 152 Vector State – Scalar Observation Kalman Filter

State Model : s[n]=As[n − 1] + Bu[n]withu[n] ∼N(0, Q) T Observation Model : x[n]=h [n]s[n]+w[n]withs[−1] ∼N(µs , Cs ) Prediction : ˆs[n|n − 1] = Aˆs[n − 1|n − 1] Minimum Prediction MSE (p × p): M[n|n − 1] = AM[n − 1|n − 1]AT + BQBT Kalman Gain (p × 1) : M[n|n − 1]h[n] K[n]= σ2 T | − n + h [n]M[n n 1]h[n] Correction : ˆs[n|n]=ˆs[n|n − 1] + K[n](x[n] − hT [n]ˆs[n|n − 1]) Minimum MSE Matrix (p × p): M[n|n]=(I − K[n]hT [n])M[n|n − 1]

(Kalman Filters) Estimation Theory Spring 2013 149 / 152 Vector State – Vector Observation Kalman Filter

State Model : s[n]=As[n − 1] + Bu[n]withu[n] ∼N(0, Q) Observation Model : x[n]=H[n]s[n]+w[n]withw[n] ∼N(0, C[n]) Prediction : ˆs[n|n − 1] = Aˆs[n − 1|n − 1] Minimum Prediction MSE (p × p):

M[n|n − 1] = AM[n − 1|n − 1]AT + BQBT

Kalman Gain Matrix (p × M): (Need matrix inversion !) − K[n]=M[n|n − 1]HT [n] C[n]+H[n]M[n|n − 1]HT [n] 1

Correction : ˆs[n|n]=ˆs[n|n − 1] + K[n](x[n] − H[n]ˆs[n|n − 1]) Minimum MSE Matrix (p × p):

M[n|n]=(I − K[n]H[n])M[n|n − 1]

(Kalman Filters) Estimation Theory Spring 2013 150 / 152 ExtendedKalmanFilter

The extended Kalman filter is a sub-optimal approach when the state and the observation equations are nonlinear. Nonlinear data model : Consider the following nonlinear data model s[n]=f(s[n − 1]) + Bu[n] x[n]=g(s[n]) + w[n] where f(·)andg(·) are nonlinear function mapping (vector to vector). Model linearization : We linearize the model around the estimated value of s using a first order Taylor series. ∂f   f(s[n − 1]) ≈ f(ˆs[n − 1|n − 1]) + s[n − 1] − ˆs[n − 1|n − 1] ∂s[n − 1] ∂g   g(s[n]) ≈ g(ˆs[n|n − 1]) + s[n] − ˆs[n|n − 1] ∂s[n] We denote the Jacobians by :   ∂  ∂  − f  g  A[n 1] = ∂ −  H[n]= ∂  s[n 1] ˆs[n−1|n−1] s[n] ˆs[n|n−1] (Kalman Filters) Estimation Theory Spring 2013 151 / 152 ExtendedKalmanFilter

Now we use the linearized models :

s[n]=A[n − 1]s[n − 1] + Bu[n] +f(ˆs[n − 1|n − 1]) − A[n − 1]ˆs[n − 1|n − 1] x[n]=H[n]s[n]+w[n] +g(ˆs[n|n − 1]) − H[n]ˆs[n|n − 1]

There are two differences with the standard Kalman filter : 1 A is now, time varying. 2 Both equations have known terms added to them. This will not change the derivation of the Kalman filter. The extended Kalman filter has exactly the same equations where A is replaced with A[n − 1]. A[n − 1] and H[n] should be computed at each sampling time. MSE matrices and Kalman gain can no longer be computed off line.

(Kalman Filters) Estimation Theory Spring 2013 152 / 152