 Parametric Inference  Maximum Likelihood Inference  Exponential Families  Expectation Maximization (EM)  Bayesian Inference  Statistical Decison Theory

IP, José Bioucas Dias, IST, 2007 1 Statistical Inference

Statistics aims at retriving the “causes” (e.g., parameters of a pdf) from the observations (effects)

Probability

Statistics

Statistical inference problems can thus be seen as Inverse Problems

As a result of this perpective, at the eighteenth century (at the time of Bayes and Laplace) Statistics was often called Inverse Probability

IP, José Bioucas Dias, IST, 2007 2 Parametric Inference

Consider the parametric model where

is the parameter space and is the parameter

The problem of inference reduces to the estimation of from ; i.e,

Parameters of interest and nuisance parameters Let Sometimes we are only interested in some function that depends only on - parameter of interest; - nuisance parameter Example:

IP, José Bioucas Dias, IST, 2007 3 Parametric Inference (theoretical limits)

The Cramer Rao Lower Bound (CRLB)

Under under appropriate regularity conditions, the covariance matrix of any Unbiased , satisfies

where is the matrix given by

An unbiased estimator that attains the CRLB may be found iif

For some function h. The estimator is

IP, José Bioucas Dias, IST, 2007 4 CRLB for the general Gaussian case

Example: Parameter of a signal in white noise

If

Example: Known signal in unknown white noise

IP, José Bioucas Dias, IST, 2007 5 Maximum Likelihood Method

is the

If for all f we can use the log-likelihood

Example (Bernoulli)

IP, José Bioucas Dias, IST, 2007 6 Maximum Likelihood

Example (Uniform)

1 1

IP, José Bioucas Dias, IST, 2007 7 Maximum Likelihood

Example (Gaussian) IID

Sample mean

Sample variace

IP, José Bioucas Dias, IST, 2007 8 Maximum Likelihood

Example (Multivariate Gaussian) IID

Sample mean

Sample covariance

IP, José Bioucas Dias, IST, 2007 9 Maximum Likelihood (linear observation model)

Example: Linear observation in Gaussian noise

A is full rank

IP, José Bioucas Dias, IST, 2007 10 Example: Linear observation in Gaussian noise (cont.)

• MLE is equivalent to the LSE using the norm

• If , , is given by the Moore-Penrose Pseudo-Inverse

• is a projection matrix

(SVD)

• If the noise is zero-mean but not Gaussian, the Best Linear Unbiased estimator (BLUE) is still given by

IP, José Bioucas Dias, IST, 2007 11 Maximum likelihood

Linear observation in Gaussian noise

MLE

Properties (MLE is optimal for the linear model)

• Is the Minimum Variance Unbiased (MVU) estimator [ and is the minimum among all unbiased ]

• Is efficient (it attains the Camer Rao Lower Bound (CRLB))

• Its PDF is

IP, José Bioucas Dias, IST, 2007 12 Maximum likelihood (characterization)

Appealing properties of MLE

Let A sequence of IID vectors in and

1. The MLE is consistent: ( denotes the true parameter)

2. The MLE is equivariant: if is the MLE estimate of , then is the MLE estimate of

3. The MLE (under appropriate regularity conditions) is asymptotically Normal and optimal or efficient:

Fisher information matrix

IP, José Bioucas Dias, IST, 2007 13 The

Definition: the set an exponential family of dimension k if there there are functions such that

is a sufficient statistic for f , i.e,

Theorem: (Neyman-Fisher Factorization) is a sufficient statistic for f iif can be factored as

IP, José Bioucas Dias, IST, 2007 14 The exponential family

Natural (or canonical) form Given an exponential family, it is always possible to introduce the change of variables and the reparemeterization such that

Since is a PDF, it must integrate to one

IP, José Bioucas Dias, IST, 2007 15 The exponential family (The partition function)

Computing moments from the derivatives of the partition function

After some calculus

IP, José Bioucas Dias, IST, 2007 16 The exponential family (IID sequences)

Let a member of an exponential family defined by

The density of the IID sequence is

belongs exponential family defined by

IP, José Bioucas Dias, IST, 2007 17 Examples of exponential families

Many of the most common probabilistic models belong to exponential families; e.g., Gaussian, Poisson, Bernoulli, binomial, exponential, gamma, and beta.

Example:

Canonical form

IP, José Bioucas Dias, IST, 2007 18 Examples of exponential families (Gaussian)

Example:

Canonical form

IP, José Bioucas Dias, IST, 2007 19 Computing maximum likelihood estimates

Very often the MLE can not be found analytically. Commonly used numerical methods: 1. Newton-Raphson 2. Scoring 3. Expectation Maximization (EM)

Newton-Raphson method

Scoring method

Can be computed off-line

IP, José Bioucas Dias, IST, 2007 20 Computing maximum likelihood estimates (EM)

Expectation Maximization (EM) [Dempster, Laird, and Rubin, 1977]

Suppose that is hard to maximize

But we can find a vector z such that is easy to maximze and

Idea: iterate between two steps:

E-step: “Fill in z” in Terminology Observed data M-step: Maximize Missing data Complete data

IP, José Bioucas Dias, IST, 2007 21 Expectation maximization

The EM algorithm

1. Pick up a starting vector : repeat steps 2. and 3.

2. E-step: Calculate

3. M-step

Alternatively (GEM)

IP, José Bioucas Dias, IST, 2007 22 Expectation maximization

The EM (GEM) algorithm always increases the likelihood.

Define

1.

2. Kulback Leibler distance 3.

4.

KL distance maximization

IP, José Bioucas Dias, IST, 2007 23 Expectation maximization (why does it work?)

IP, José Bioucas Dias, IST, 2007 24 EM: Mixtures of densities

Let be the random variable that selects the active mode:

where and

IP, José Bioucas Dias, IST, 2007 25 EM: Mixtures of densities

Consider now that is a sequence of IID random variables

Let be IID random variables, where selects the active mode in the sample :

IP, José Bioucas Dias, IST, 2007 26 EM: Mixtures of densities

Equivalent Q

Where is the sample mean of x, i.e.,

IP, José Bioucas Dias, IST, 2007 27 EM: Mixtures of densities

E-step:

M-step:

IP, José Bioucas Dias, IST, 2007 28 EM: Mixtures of densities

E-step:

M-step:

IP, José Bioucas Dias, IST, 2007 29 EM: Mixtures of Gaussian densities (MOGs)

E-step:

M-step:

Weighted sample mean

Weighted sample covariance

IP, José Bioucas Dias, IST, 2007 30 EM: Mixtures of Gaussian densities. 1D Example

0 1 0.6316 -0.0288 1.0287 0.6258 3 3 0.3158 2.8952 2.5649 0.3107 6 10 0.0526 6.1687 7.3980 0.0635

p = 3

loglikelihood L(fk) N = 1900 -3800

-4000

-4200

-4400

-4600

-4800

-5000

-5200 0 5 10 15 20 25 30

IP, José Bioucas Dias, IST, 2007 31 EM: Mixtures of Gaussian Densities (MOGs)

0 1 0.6316 p = 3 Example – 1D 3 3 0.3158 N = 1900 6 10 0.0526

0.35 0.35 hist hist est MOG est modes 0.3 0.3 true MOG true modes

0.25 0.25

0.2 0.2

0.15 0.15

0.1 0.1

0.05 0.05

0 0 -5 0 5 10 15 -6 -4 -2 0 2 4 6 8 10 12

IP, José Bioucas Dias, IST, 2007 32 EM: Mixtures of Gaussian Densities: 2D Example

MOG with determination of the number of modes [M. Figueiredo, 2002]

k=3

2

0

-2

-2 0 2 4

IP, José Bioucas Dias, IST, 2007 33 Bayesian Estimation

IP, José Bioucas Dias, IST, 2007 34 The Bayesian Philosophy ([Wasserman, 2004]) Bayesian Inference B1 – Probabilities describe degrees of belief, not limiting relative frequency

B2 – We can make probability statements about parameters, even though they are fixed parameters

B3 – We make inferences about a parameter  by producing a probalility distribution for 

Frequencist or Classical Inference F1 – Probabilities refer to limiting relative frequencies and are objective properties of the real world

F2 – Parameters are fixed unknown parameters

F3 – The criteria for obtaining statistical procedures are based on long run frequency properties.

IP, José Bioucas Dias, IST, 2007 35 The Bayesian Philosophy

Observation model unknown observation

Prior knowledge

Bayesian Inference Classical Inference

describes degrees of belief (subjective), not limiting frequency

IP, José Bioucas Dias, IST, 2007 36 The Bayesian method

1. Choose a prior density , called the prior (or a priori) distribution that expresses our beliefs about f, before we see any data

2. Choose the observation model that reflects our belief about g given f

3. Calculate the posterior (or a posteriori) distribution using the Bayes law:

where

is the marginal on g (other names: evidence, unconditional, predictive)

4. Any inference should be based on the posterior

IP, José Bioucas Dias, IST, 2007 37 The Bayesian method

Example: Let IID and

4 ==0.5 for = >1, pulls 3.5 ==1 ==2  towards 1/2 3 ==10

2.5

2

1.5

1

0.5

0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

IP, José Bioucas Dias, IST, 2007 38 Example (cont.):

(Bernoulli observations, Beta prior)

Observation model

Prior

Posterior

Thus,

IP, José Bioucas Dias, IST, 2007 39 Example (cont.):

(Bernoulli observations, Beta prior)

Maximum a posteriori estimate (MAP)

• Total ignorance: flat prior  = =1 • Note that for large values of n

The von Mises Theorem

If the prior is continuous and not zero at the location of the MLestimate, then,

IP, José Bioucas Dias, IST, 2007 40 Conjugate priors

In the previous example, the prior and the posterior are both Beta distributed. We say that the prior is conjugate with respect to the model

• Formally, let and be two parametrized families of priors and observation models, respectively

• is a conjugate family for if

for some

• Very often, prior information about f is very small, allowing to select conjugate priors

• Conjugate priors why? Computing the posterior density simply consists in updating the parameters of the prior

IP, José Bioucas Dias, IST, 2007 41 Conjugate priors (Gaussian observation, Gaussian prior)

• Gaussian observations

• Gaussian prior

• The posterior distribution is Gaussian

1. The mean of is in the simplex defined by {g,}

2. The variance of is the parallel of variances and

IP, José Bioucas Dias, IST, 2007 42 Conjugate priors (Gaussian IID observations, Gaussian prior)

• Gaussian IID observations

• Gaussian prior

• The posterior distribution is Gaussian

1. The mean of is in the simplex defined by

2. The variance of is the parallel of variances and

IP, José Bioucas Dias, IST, 2007 43 Conjugate Priors (Gaussian IID observations, Gaussian prior)

1

0.8

0.6

0.4

0.2

-15 -10 -5 5 10 15

IP, José Bioucas Dias, IST, 2007 44 Conjugate Priors (multivariate Gaussian: observation and prior)

• (g,f) jointly Gaussian distributed:

• Then a)

b)

c)

IP, José Bioucas Dias, IST, 2007 45 Conjugate Priors (multivariate Gaussian: observation and prior)

• Linear observation model (f and w independent)

• Posterior

IP, José Bioucas Dias, IST, 2007 46 Conjugate Priors (multivariate Gaussian: observation and prior)

• Linear observation model (f and w independent)

• Using the matrix inversion lemma

• is the solution of the following regularized LS problem

e.g., penalize oscillatory solutions

IP, José Bioucas Dias, IST, 2007 47 Improper Priors

• Assume that p(f)=k on given domain

• Even if the domain of f is unbounded, and, thus,

the posterior is well defined.

• In a sense, improper priors account for a state of total ignorance. This raises no issues to the Bayesian framework, as far as the posterior is proper.

IP, José Bioucas Dias, IST, 2007 48 Bayes Estimators

IP, José Bioucas Dias, IST, 2007 49 Bayes estimators

Ingredients of Statistical Decision Theory:

• posterior distribution conveys all knowledge about f, given the observation g

• loss function measures the discrepancy between and .

• a posteriori expected loss

• optimal Bayes estimator

IP, José Bioucas Dias, IST, 2007 50 Bayesian framework

• Nuisance Parameter

Let and

Nuisance parameter

• The posterior risk depends only on the marginal on 

• In a pure Bayesian framework, nuisance parameters are integrated out

IP, José Bioucas Dias, IST, 2007 51 Bayes estimators: Maximum a posteriori probability (MAP)

• Zero-one, “0/1”, loss Volume of an -ball

• Maximum a posteriori probability

A discrete domain leads to the MAP estimator as well

IP, José Bioucas Dias, IST, 2007 52 Bayes Estimators: Posterior Mean (PM)

• Quadratic loss: Q is symmetric and positive definite

Only this term Depends on

• Posterior mean may be hard to compute

• Valid for any . If Q diagonal the loss function is additive

IP, José Bioucas Dias, IST, 2007 53 Bayes estimators: Additive loss

• Let

• Then, the minimization is decoupled

• Each component of minimizes the corresponding marginal a posteriori expected loss

IP, José Bioucas Dias, IST, 2007 54 Bayes Estimators: Additive Loss

• Additive “0/1” loss:

is the maximizer of the posterior marginal

• Additive quadratic loss:

The additive quadratic loss is a quadratic loss with Q=I. Therefore, The corresponding Bayes estimator is the posterior mean

IP, José Bioucas Dias, IST, 2007 55 Example (Gaussian IID observations, Gaussian prior)

• Gaussian IID observations

• Gaussian prior

• The posterior distribution is Gaussian

as

IP, José Bioucas Dias, IST, 2007 56 Example (Gaussian observation, Laplacian prior)

MAP estimate

• Strictly concave

IP, José Bioucas Dias, IST, 2007 57 Example (Gaussian observation, Laplacian prior)

MAP estimate

IP, José Bioucas Dias, IST, 2007 58 Example (Gaussian observation, Laplacian prior)

PM estimate

No closed form expressions Resort to numerical procedures

IP, José Bioucas Dias, IST, 2007 59 Example (Gaussian observation, Laplacian prior)

0.8 0.8 0.7

0.7 0.7 0.6

0.6 0.6 0.5 0.5 0.5 0.4 0.4 0.4 0.3 0.3 0.3 0.2 0.2 0.2

0.1 0.1 0.1

0 0 0 -10 -5 0 5 10 -10 -5 0 5 10 -10 -5 0 5 10

0.4 0.5 0.5 0.35 0.4 0.4 0.3

0.25 0.3 0.3 0.2

0.2 0.2 0.15

0.1 0.1 0.1 0.05

0 0 0 -10 -5 0 5 10 -10 -5 0 5 10 -10 -5 0 5 10 IP, José Bioucas Dias, IST, 2007 60 Example (Gaussian observation, Laplacian prior)

5

4

3

2

1

0

-1

-2

-3

-4

-5 -5 -4 -3 -2 -1 0 1 2 3 4 5

IP, José Bioucas Dias, IST, 2007 61 Example (Multivariate Gaussian: observation and prior)

• Linear observation model (f and w independent)

• Posterior

• is called the Wiener filter

• If all the eigenvectors of C approaches infinite, then

which is the Moore-Penrose pseudo (or generalized) inverse of A

IP, José Bioucas Dias, IST, 2007 62