10. Linear Models and Maximum Likelihood Estimation ECE 830, Spring 2017
Total Page:16
File Type:pdf, Size:1020Kb
10. Linear Models and Maximum Likelihood Estimation ECE 830, Spring 2017 Rebecca Willett 1 / 34 Primary Goal General problem statement: We observe iid yi ∼ pθ; θ 2 Θ n and the goal is to determine the θ that produced fyigi=1. Given a collection of observations y1; :::; yn and a probability model p(y1; :::; ynjθ) parameterized by the parameter θ, determine the value of θ that best matches the observations. 2 / 34 Estimation Using the Likelihood Definition: Likelihood function p(yjθ) as a function of θ with y fixed is called the \likelihood function". If the likelihood function carries the information about θ brought by the observations y = fyigi, how do we use it to obtain an estimator? Definition: Maximum Likelihood Estimation θbMLE = arg max p(yjθ) θ2Θ is the value of θ that maximizes the density at y. Intuitively, we are choosing θ to maximize the probability of occurrence for y. 3 / 34 Maximum Likelihood Estimation MLEs are a very important type of estimator for the following reasons: I MLE occurs naturally in composite hypothesis testing and signal detection (i.e., GLRT) I The MLE is often simple and easy to compute I MLEs are invariant under reparameterization I MLEs often have asymptotic optimal properties (e.g. consistency (MSE ! 0 as N ! 1) 4 / 34 Computing the MLE If the likelihood function is differentiable, then θb is found from @ log p(yjθ) = 0 @θ If multiple solutions exist, then the MLE is the solution that maximizes log p(yjθ). That is, take the global maximizer. Note: It is possible to have multiple global maximizers that are all MLEs! 5 / 34 Example: Estimating the mean and variance of a Gaussian iid 2 yi = A + νi; νi ∼ N (0; σ ); i = 1; ··· ; n θ = [A; σ2]> n @ log p(yjθ) 1 X = (y − A) @A σ2 i i=1 n @ log p(yjθ) n 1 X = − + (y − A)2 @σ2 2σ2 2σ4 i i=1 n 1 X ) Ab = yi n i=1 n 2 1 X 2 ) σc = (yi − Ab) n i=1 Note: σc2 is biased! 6 / 34 Example: Stock Market (Dow-Jones Industrial Avg.) Based on this plot we might conjecture that the data is \on average" increasing. Probability model: yi = A + Bi + νi A; B are unknown parameters, νi white Gaussian noise to model fluctuations. ( n ) 1 1 X p(fy g jA; B) = exp − (y − A − Bi)2 i i (2πσ2)n=2 2σ2 i i=1 7 / 34 Example: A modern Example - Imaging Image processing can involve complicated estimation problems. For example, suppose we observe a moving object with noise. This image is blurry and noisy and our goal is to restore this image by debarring and denoising. 8 / 34 Example: A modern Example - Imaging We can model the moving part of the observed image as y = θ + ν |{z} |{z} noise-free blurry image noise = x ∗ w +ν | {z } motion blur (convolution) where the parameter of interest w is the ideal (non-blurry) image and x represents the blur model or point spread function. Observation model: y =Xw + ν; ν ∼ N(0; σ2I) y ∼N (Xw; σ2I) where X is a circulant matrix with the first row corresponding to x. 9 / 34 Linear Models and Subspaces The basic linear model is: θ = Xw n n×p where θ 2 R , X 2 R with n > p is known and full rank, and p w 2 R is a p-dimensional parameter or weight vector. In other words, we can write p X θ = wixi i=1 th th where wi is the i element of w and xi is the i column of X. We say that θ is in a subspace spanned by the columns of X 10 / 34 Subspaces n Consider a set of vectors x1; x2; : : : ; xp 2 R . The span of these n vectors is the set of all vectors θ 2 R that can be generated from linear combinations of the set ( p ) p X span (fxigi=1) := θ : θ = wixi; w1; : : : ; wp 2 R i=1 p-dimensional subspace n If the x1; : : : ; xp 2 R are linearly independent, then their span is a n p-dimensional subspace of R . Hence, even though the signal θ is a length-n vector, the fact that it lies in the subspace X means that it is actually a function of only p ≤ n free parameters or \degrees of freedom". 11 / 34 Example: n = 3 2 1 3 2 0 3 x1 = 4 0 5 x2 = 4 1 5 0 0 82 3 9 < a = X = span (x1; x2) = 4 b 5 : a; b 2 R : 0 ; Example: n = 3 2 p 3 2 3 1=p2 0 x1 = 4 1= 2 5 x2 = 4 0 5 0 1 82 3 9 < a = X = span (x1; x2) = 4 a 5 : a; b 2 R : b ; 12 / 34 13 / 34 Examples Example: \Constant" subspace 2 1 3 p 6 . 7 n p = 1;X = x1 = (1= n) 4 . 5 2 R 1 θ = x1w1 is a constant signal with unknown amplitude. Example: Recommender system We have n movies in a database, and for a new client want to n estimate how much they will like each movie, denoted θ 2 R . We have p existing archetype clients for whom we know their movie preferences; the preferences of the ith archetype client are denoted n xi 2 R , i = 1; : : : ; p. We will assume that θ is a weighted combination of these archetype profiles. 14 / 34 Example: Polynomial regression 2 p−1 θi = w1 + w2i + w3i + ··· + wpi X = Vandermonde matrix 15 / 34 Using subspace models I Imagine we measure a function or signal in noise, y = θ + ν; where elements of ν are independent realizations from a Gaussian density and θ = Xw I Question: How can we use knowledge of X to estimate θ from y? I Let X be the subspace spanned by the columns of X, and let X? be its orthogonal subspace. Then we can uniquely write y = u + v; where u 2 X and v 2 X?. Since we know θ 2 X , v can be considered pure \noise" and it makes sense to remove it. I Answer: Orthogonal projection { find the point in u 2 X which is closest to the observed y. I In general θ 6= u because ν has some component in X . 16 / 34 Orthogonal projection n Given a point y 2 R and subspace X , let's say we want to find the point y^ 2 X which is closest to y: y^ = arg min kθ − yk2 θ2X = arg min kθ − yk2 θ=Xw;w2Rp = arg min kXw − yk2 θ=Xw;w2Rp Essentially, we want to keep the = X (X>X)−1X> y: component of y that lies in the | {z } subspace X and remove or kill pseudoinverse of X the component in the orthogonal | {z } =:PX ; projection matrix subspace. 17 / 34 Sinusoid subspace Example: Sinusoid subspace Let θi = A cos(2πfi + φ); i = 1; : : : ; n where frequency f is known but amplitude A and phase φ are unknown. Goal: estimate A and φ from y = θ + ν where ν is white Gaussian noise. > First we need to express θ = [θ1; : : : ; θn] as an element in a 2-dimensional subspace; i.e. write θ = Xw where X is a known n × 2 matrix and w is unknown. Then we can apply orthogonal projection. 18 / 34 Real solution: First note that cos(α + β) = cos(α) cos(β) − sin(α) sin(β), so that θi = A cos(2πfi + φ) = A cos(2πfi) cos(φ) − A sin(2πfi) sin(φ): Let w1 := A cos(φ) and w2 := A sin(φ), and 2 cos(2πf) sin(2πf) 3 6 cos(4πf) sin(4πf) 7 X = 6 7 : 6 . 7 4 . 5 cos(2nπf) sin(2nπf) n×2 2 Note θ = Xw where X 2 R and w 2 R . Use pseudoinverse to compute w^, then set q 2 2 Ab = w^1 +w ^2 φb = arctan(−w^2=w^1) 19 / 34 MLE and Linear Models When we observe y = θ + ν = Xw + ν 2 and ν ∼ N (0; σ In), then the MLE of w (and hence θ) can be computed by projection onto the subspace spanned by the columns of X. For more general probability models, the MLE may not have this form. Generalized linear models provide a unifying framework for modeling the relationship between y and w. The basic idea is that we define a link function g such that g(Ey) = Xw: Typically there is no closed-form expression for w^MLE and it must be computed numerically. 20 / 34 Example: Colored Gaussian Noise p y ∼ N (Xw; Σ); w 2 R ; Σ;X known 1 1 p(yjw) = exp{− (y − Xw)>Σ−1(y − Xw)g (2π)n=2jΣj1=2 2 The value of wb is given by, w^ = arg min − log p(yjw) w = arg min(y − Xw)>Σ−1(y − Xw) w = (X>Σ−1X)−1X>Σ−1y 21 / 34 Invariance of MLE Suppose we wish to estimate the function g = G(θ) and not θ itself. Intuitively we might try gb = G(θb) where θb is the MLE of θ. Remarkably, it turns out that gb is the MLE of g. This very special invariance principle is summarized in the following theorem. Theorem: Invariance of the MLE Let θb denote the MLE of θ. Then gb = G(θb) is the MLE of g = G(θ). 22 / 34 Example: > Let y = [y1; ··· ; yn] where yi ∼ Poisson(λ). Given y, find the MLE of the probability that y ∼ Poisson(λ) exceeds the mean λ.