<<

Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

Generative vs. Discriminative Models, Maximum Likelihood Estimation, Mixture Models

Mihaela van der Schaar

Department of Engineering Science University of Oxford

...... 1/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

Generative vs Discriminative Approaches

Machine learning: learn a (random) function that maps a variable X (feature) to a variable Y (class) using a (labeled) dataset M = {(X1, Y1),..., (Xn, Yn)}. Discriminative Approach: learn P(Y |X ). Generative Approach: learn P(Y , X ) = P(Y |X )P(X ).

5 Class 1

4

3

2 Class 2 2 X 1

0

−1

−2

−6 −4 −2 0 2 X1 ...... 2/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

Generative vs Discriminative Approaches

Discriminative Approach: Finds a good fit for P(Y |X ) without explicitly modeling the generative process. Example techniques: K nearest neighbors, , , SVMs, perceptrons, etc. Example problem: 2 classes, separate the classes.

5 Class 1

4

3 Class 2

2 2 X 1

0

−1

−2

−6 −4 −2 0 2 X1 ...... 3/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

Generative vs Discriminative Approaches

Generative Approach: Finds a probabilistic model (a joint distribution P(Y , X )) that explicitly models the distribution of both the features and the corresponding labels (classes). Example techniques: Naive Bayes, Hidden Markov Models, etc.

0.35 P (Y = 0|X) 0.4 0.3 P (Y = 1|X) 0.3 )

0.25 X 0.2 ( P 0.1

) 0.2 X

| 0 −5 0 5 Y

( 0.15 X P

0.1

0.05

0 −5 0 5 10 15 20 X ...... 4/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

Generative vs Discriminative Approaches

Generative Approach: Finds parameters that explain all data. Makes use of all the data. Flexible framework, can incorporate many tasks (e.g. classification, regression, , generating new data samples similar to the existing dataset, etc). Stronger modeling assumptions. Discriminative Approach: Finds parameters that help to predict relevant data. Learns to perform better on the given tasks. Weaker modeling assumptions. Less immune to overfitting.

...... 5/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

Problem Setup

D n We are given a dataset = (Xi )i=1 with n entries. Example: Xi ’s are the annual incomes of n individuals picked randomly from a large population. Goal: estimate the that describes n the entire population from which the random samples (Xi )i=1 are drawn.

What we observe: random samples drawn from a distribution What we want to estimate: the distribution!

3

2 0.4

1 0.3

2 0

X 0.2

−1 0.1

Probability Density 0 −2 2 3 2 −3 0 1 X2 0 −1.5 −1 −0.5 0 0.5 1 1.5 −1 −2 −2 X1 −3 X1 ...... 6/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

Models and Likelihood Functions Models: Parametric Families of Distributions

Key to make progress: restrict to a parametrized family of distributions! Formalization becomes as follows: The dataset D comprise independent and identically distributed (iid) samples from a distribution Pθ with a parameter θ, i.e.

∼ P ∼ P⊗n Xi θ, (X1, X2, . . ., Xn) θ .

The distribution Pθ belongs to the family P, i.e.

P = {Pθ : θ ∈ Θ} ,

where Θ is a parameter space. Estimating the distribution Pθ ∈ P → estimating the parameter θ ∈ Θ! ...... 7/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

Models and Likelihood Functions The

How is the family of models P related to the dataset D?

The likelihood function Ln(θ, D): is defined as

∏n Ln(θ, D) = Pθ(Xi ). i=1

Intuitively, Ln(θ, D) quantifies how compatible is any choice of θ with the occurrence of D.

Maximum Likelihood Estimator (MLE)

Given a dataset D of size n drawn from a distribution Pθ ∈ P, the MLE estimate of θ is defined as

ˆ∗ L D θn = arg max n(θ, ). θ∈Θ ...... 8/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

Why Maximum Likelihood Estimation? ˆ∗ Why is θn a good estimator for θ?

ˆ∗ 1 Consistency: the estimate θn converges to θ in probability ! ˆ∗ →p θn θ.

2 Asymptotic Normality: can compute asymptotic confidence intervals ! √ ˆ∗ − →d N 2 n (θn θ) (0, σ (θ)). ˆ∗ 3 Asymptotic Efficiency: the asymptotic of θn is, in fact, equal to the Cramer-Rao lower bound for the variance of a consistent, asymptotically normally distributed estimator ! 4 Invariance under re-parametrization: If g(.) is a continuous and continuously differentiable function, then the ˆ∗ MLE of g(θ) is g(θn).

See proofs in (Keener, Chapter 8)...... 9/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

The Gaussian Family The Gaussian Family P and Parameter Space Θ

D The dataset∏ = (X1, X2, . . ., Xn) is drawn from a distribution P⊗n n P θ = i=1 θ(Xi ), where ( ) 1 (X − µ)2 Pθ(X ) = √ · exp − , 2π σ 2σ2 where θ = (µ, σ). The parameter space Θ is { } Θ = (µ, σ): µ ∈ R, σ ∈ R+ .

The family P is the family of Gaussian distributions given by { } (X −µ)2 1 − + P = √ · e 2σ2 : µ ∈ R, σ ∈ R . 2π σ ...... 10/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

The Gaussian Family The Gaussian Likelihood Function

The likelihood function is given by

n ∏ (X −µ)2 √ − ∑ 1 − i −n 1 n (X −µ)2 L(θ, D) = √ · e 2σ2 = ( 2π σ) · e 2σ2 i=1 i i=1 2π σ ˆ∗ ∗ ∗ The MLE estimate θn = (ˆµn, σˆn) is given by √ ∑ ∗ ∗ − −1 n (X −µ)2 n · 2σ2 i=1 i (ˆµn, σˆn) = arg max ( 2π σ) e µ∈R,σ∈R+

It is usually more convenient to work with the log-likelihood function

−n 1 ∑n log (L(θ, D)) = log(2πσ2) − (X − µ)2 2 2σ2 i i=1 ...... 11/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

The MLE Estimator ∗ ∗ Findingµ ˆn andσ ˆn (I)

The log(.) operation is monotonic, therefore

arg max log(L(θ, D)) = arg max L(θ, D) θ∈Θ θ∈Θ L D Can solve the optimization problem arg maxθ (θ, ) by equating the first derivative of log(L(θ, D)) with respect to θ and equating to zero (first-order condition), i.e.

∂ log(L(θ, D)) = 0. ∂θ What properties log(L(θ, D)) must have for the above method to work?

...... 12/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

The MLE Estimator ∗ ∗ Findingµ ˆn andσ ˆn (I)

The log(.) operation is monotonic, therefore

arg max log(L(θ, D)) = arg max L(θ, D) θ∈Θ θ∈Θ L D Can solve the optimization problem arg maxθ (θ, ) by equating the first derivative of log(L(θ, D)) with respect to θ and equating to zero (first-order condition), i.e.

∂ log(L(θ, D)) = 0. ∂θ What properties log(L(θ, D)) must have for the above method to work? Concavity and log-concavity!

...... 13/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

The MLE Estimator ∗ ∗ Findingµ ˆn andσ ˆn (II)

Note that θ = (µ, σ) is vector-valued: the first-order condition becomes [ ] ∂log(L(θ, D)) ∂log(L(θ, D)) T ▽ log(L(θ, D)) = 0 → = 0 θ ∂µ ∂σ

By taking the first derivative with respect to µ and σ, we have that: ( ∑ ) ∑ ∂ −n log(2πσ2) n (X − µ)2 n (X − µ) − i=1 i = i=1 i ∂µ 2 2σ2 σ2 ( ∑ ) ∑ ∂ −n log(2πσ2) n (X − µ)2 −n n (X − µ)2 − i=1 i = + i=1 i ∂σ 2 2σ2 σ σ3

...... 14/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

The MLE Estimator ∗ ∗ Findingµ ˆn andσ ˆn (III)

The MLE estimators are: Sample : ∑ n (X − µ) 1 ∑n i=1 i = 0 → µˆ∗ = X σ2 n n i i=1 Sample Variance: ∑ −n n (X − µ)2 1 ∑n + i=1 i = 0 → (ˆσ∗)2 = (X − µˆ∗)2 σ σ3 n n i n i=1

...... 15/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

The MLE Estimator ∗ ∗ Findingµ ˆn andσ ˆn (IV)

Exercise: try to derive the MLE estimator when X is a multivariate Gaussian distribution:

The dataset D = (X1, . . ., Xn), where Xi is M-dimensional, and Xi ∼ N (µ, Σ). { } The parameter space is Θ = (µ, Σ): µ ∈ RM , Σ ≽ 0 The multivariate Gaussian distribution is ( ) − − − M 1 1 T −1 P (X) = (2π) 2 · |Σ| 2 · exp (X − µ) Σ (X − µ) θ 2

...... 16/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

Confidence Intervals ∗ ∗ What is our confidence in the estimatesµ ˆn andσ ˆn?

Depends on the sample size n: the larger n, the smaller the ∗ ∗ ofµ ˆn andσ ˆn.

µ = 5, σ = √3 10

8 n = 50 n = 250 Larger variance Smaller variance Less confidence More confidence 6

∗ n 4 ˆ µ

∗ µˆn = 5 2

0

−2 0 50 100 150 200 250 300 350 400 450 500 Sample size n ...... 17/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

Confidence Intervals Confidence Sets

Point estimators provide no quantification for uncertainty → need to introduce a measure of confidence in an estimate!

Confidence Sets A (1 − α) confidence set for a parameter θ is a subset of the parameter space, Θ(˜ X1, . . ., Xn) ⊂ Θ, such that

P(θ ∈ Θ)˜ ≥ 1 − α.

Confidence intervals are one-dimensional confidence sets. Because of the asymptotic normality of the general MLE estimates, we can compute asymptotic confidence intervals. Normality → compute confidence intervals. Asymptotic normality → compute asymptotic confidence intervals...... 18/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

Confidence Intervals Example: Unknown Mean and Known Variance (I)

Assume we know σ and want to estimate µ from D ∑ ∗ 1 n The MLE estimate for µ isµ ˆn = Xi √ n i=1 n ∗ − ∼ N We know that σ (ˆµn µ) (0, 1) () We want to compute the confidence interval ˜ ∗ − ∗ Θ = [ˆµn µ,˜ µˆn +µ ˜] (symmetric normal distribution)

Computing the Confidence Interval for µ ( √ ) P n ∗ − ≥ α → −1 α Find γ for which (ˆµn µ) γ = γ = Q ( ), ∫ σ ( ) 2 2 ∞ 2 where Q(x) = √1 exp −u du. 2π x 2 ( √ ) P n | ∗ − | ≤ −1 α − ⇐⇒ √σ −1 α σ µˆn µ Q ( 2 ) = 1 α µ˜ = n Q ( 2 )

...... 19/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

Confidence Intervals Example: Unknown Mean and Known Variance (II)

The 95% confidence interval for the MLE mean estimate is ˜ ∗ − √σ ∗ √σ Θ = [ˆµn 1.96 n , µˆn + 1.96 n ]

µ = 5, σ = 3 14

12

10 ∗ µˆn 8 95% confidence interval gets

∗ n narrower as sample size increases ˆ µ

6

4

2

0 0 50 100 150 200 250 300 350 400 450 500 Sample size n ...... 20/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

The Categorical distribution The Categorical Family P and the Parameter Space Θ

Each data point takes a value from a finite set of values: Xi ∈ {1, 2, . . ., r}.

The probability that Xi = j is given by pj ∈ [0, 1].

The parameter of a categorical distribution is θ = (p1, . . ., pr ). The parameter{ space is the simplex ∑ } ∈ r r Θ = (p1, . . ., pr ):(p1, . . ., pr ) [0, 1] , j=1 pj = 1 . The probability mass function of the dataset D:

∏n ∏r 1{ } P Xi =j θ(X1, . . ., Xn) = pj i=1 j=1 ∑n n1 · n2 nr = p1 p2 ... pr , nj = 1{Xi =j}. i=1 ...... 21/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

The MLE Estimator ˆ∗ Finding θn (I)

∑ L D r · The log-likelihood function: log( (θ, )) = j=1 nj log(pj ). ˆ∗ The MLE estimate θn is ∑r ˆ∗ θn = arg max nj log(pj ) p , . . .,p 1 r j=1 ∑r s.t. pj = 1. j=1

Constrained optimization: Not as easy as in the Gaussian case!

...... 22/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

The MLE Estimator ˆ∗ Finding θn (II): Method A

The Method of Lagrange Multipliers ∑ ∑ r − r − Maximize j=1 nj log(pj ) λ ( j=1 pj 1) via the first-order condition. ▽ log(L(θ, D)) = 0 → p∗ = λ−1n . θ ∑ ∑ j,n j ∗ Since nj = n and pj,n = 1, then λ = n, and [ ] n n T θˆ∗ = 1 ... r . n n n ˆ∗ The MLE θn is the empirical distribution function, which uniformly converges to the true probability mass function. ˆ∗ This matches our expectations regarding the consistency of θn.

...... 23/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

The MLE Estimator ˆ∗ Finding θn (III): Method B An Information-Theoretic Approach We can reformulate the MLE optimization problem as follows

∑r ∑r ˆ∗ nj θn = arg max nj log(pj ) = arg max log(pj ) p , . . .,p p , . . .,p n 1 r j=1 1 r j=1 ( ) ∑r ∑r pj nj = arg max qj log + qj log(qj ), qj = . p , . . .,p qj n 1 r j=1 j=1 ( ) ∑r pj = arg max qj log = arg min D(q||p) p1, . . .,pr qj p |j=1 {z } KL

Since D(q||p) ≥ 0, then the minimum is achieved at q = p, ∗ nj and we have pj,n = qj = n ...... 24/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

The MLE Estimator Confidence Intervals

Unlike the case of the Gaussian distribution, the MLE estimators are not normally distributed for any n. For large n, we can construct asymptotic confidence intervals using asymptotic normality. For arbitrary n, consider the case when r = 2 (Bernoulli distribution), we have θ = (p1, p2 = 1 − p1). A conservative (1 − α) confidence interval for the parameter p1 is [ ] n Q−1(α/2) n Q−1(α/2) Θ˜ = 1 − √ , 1 + √ n 2 n n 2 n

Derivation follows from the central limit theorem and √bounding the variance of a Bernoulli random variable by − ≤ 1 p1(1 p1) 2 ...... 25/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

The Generative Model Heterogeneous Populations

In many applications, the data is sampled from K different populations with different parameters. Example: Gaussian mixture with 3 components.

0.4

0.35

0.3

0.25

0.2

0.15

0.1 Probability Density 0.05

0 3 2 3 1 2 0 1 −1 0 X2 −1 −2 −2 X −3 −3 1

...... 26/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

The Generative Model Gaussian Mixture Models

A K-component (univariate) Gaussian has the following parameters ≤ ≤ The Gaussian parameters θ∑k = (µk , σk ), 1 k K. K The mixing proportion πk , i=1 πk = 1. The dataset D has the following structure:

D = ((X1, Z1), . . ., (Xn, Zn)).

Each variable Xi is drawn from a Gaussian distribution ∼ N 2 Xi (µk , σk )

th The variable Zi represent the membership of the i data point to one of the K components, and is drawn from a categorical distribution Zi ∼ Categorical(π1, . . ., πK )...... 27/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

MLE for Mixture Models Complete-Data and Marginal Likelihood Functions

Two possible scenarios:

If Zi is observed: then the complete-data likelihood function is uni-modal:

(X −µ )2 ∏n − i zi 1 2σ2 L D · √ · zi (θ, ) = πzi e i=1 2π σzi

If Zi is latent: then the marginal likelihood function is multi-modal:

− 2 ∏n ∑K − (Xi µk ) 1 2σ2 L(θ, D) = πk · √ · e k 2π σ i=1 k=1 k Hard problem → approximate solution using the EM

algorithm (next lecture)! ...... 28/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

Maximum Likelihood is a Frequentist Methods

A frequentist method... 1 Never uses or gives the probability of a hypothesis (no prior or posterior). 2 Depends on the likelihood for both observed and unobserved data. 3 Does not require a prior. 4 Tends to be less computationally intensive. Frequentist measures such as p-values and confidence intervals are often used in current research practices since the 20th century.

...... 29/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

Bayesian Methods

On the other hand, a Bayesian method... 1 Assumes a prior: uses probabilities for both hypotheses and data. 2 Depends on the prior and likelihood of observed data. 3 Requires one to know or construct a subjective prior. 4 May be computationally intensive due to integration over many parameters. Many recent advances in Bayesian methods! Read about: variational Bayesian methods, Markov Chain Monte Carlo methods, etc.

...... 30/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism

References

1 Robert W Keener, ”: notes for a course in theoretical ,” 2006. 2 Robert W Keener, ”Theoretical Statistics: Topics for a Core Course,” 2010. 3 Christopher Bishop, ” and Machine Learning,” 2007.

...... 31/31