Generative Vs. Discriminative Models, Maximum Likelihood Estimation, Mixture Models

Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism Generative vs. Discriminative Models, Maximum Likelihood Estimation, Mixture Models Mihaela van der Schaar Department of Engineering Science University of Oxford . 1/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism Generative vs Discriminative Approaches Machine learning: learn a (random) function that maps a variable X (feature) to a variable Y (class) using a (labeled) dataset M = f(X1; Y1);:::; (Xn; Yn)g. Discriminative Approach: learn P(Y jX ): Generative Approach: learn P(Y ; X ) = P(Y jX )P(X ): 5 Class 1 4 3 2 Class 2 2 X 1 0 −1 −2 −6 −4 −2 0 2 X1 . 2/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism Generative vs Discriminative Approaches Discriminative Approach: Finds a good fit for P(Y jX ) without explicitly modeling the generative process. Example techniques: K nearest neighbors, logistic regression, linear regression, SVMs, perceptrons, etc. Example problem: 2 classes, separate the classes. 5 Class 1 4 3 Class 2 2 2 X 1 0 −1 −2 −6 −4 −2 0 2 X1 . 3/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism Generative vs Discriminative Approaches Generative Approach: Finds a probabilistic model (a joint distribution P(Y ; X )) that explicitly models the distribution of both the features and the corresponding labels (classes). Example techniques: Naive Bayes, Hidden Markov Models, etc. 0.35 P (Y = 0|X) 0.4 0.3 P (Y = 1|X) 0.3 ) 0.25 X 0.2 ( P 0.1 ) 0.2 X | 0 −5 0 5 Y ( 0.15 X P 0.1 0.05 0 −5 0 5 10 15 20 X . 4/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism Generative vs Discriminative Approaches Generative Approach: Finds parameters that explain all data. Makes use of all the data. Flexible framework, can incorporate many tasks (e.g. classification, regression, survival analysis, generating new data samples similar to the existing dataset, etc). Stronger modeling assumptions. Discriminative Approach: Finds parameters that help to predict relevant data. Learns to perform better on the given tasks. Weaker modeling assumptions. Less immune to overfitting. 5/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism Problem Setup D n We are given a dataset = (Xi )i=1 with n entries. Example: Xi 's are the annual incomes of n individuals picked randomly from a large population. Goal: estimate the probability distribution that describes n the entire population from which the random samples (Xi )i=1 are drawn. What we observe: random samples drawn from a distribution What we want to estimate: the distribution! 3 2 0.4 1 0.3 2 0 X 0.2 −1 0.1 Probability Density 0 −2 2 3 2 −3 0 1 X2 0 −1.5 −1 −0.5 0 0.5 1 1.5 −1 −2 −2 X1 −3 X1 . 6/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism Models and Likelihood Functions Models: Parametric Families of Distributions Key to make progress: restrict to a parametrized family of distributions! Formalization becomes as follows: The dataset D comprise independent and identically distributed (iid) samples from a distribution Pθ with a parameter θ, i.e. ∼ P ∼ P⊗n Xi θ; (X1; X2; : : :; Xn) θ : The distribution Pθ belongs to the family P, i.e. P = fPθ : θ 2 Θg ; where Θ is a parameter space. Estimating the distribution Pθ 2 P ! estimating the parameter θ 2 Θ! . 7/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism Models and Likelihood Functions The Likelihood Function How is the family of models P related to the dataset D? The likelihood function Ln(θ; D): is defined as Yn Ln(θ; D) = Pθ(Xi ): i=1 Intuitively, Ln(θ; D) quantifies how compatible is any choice of θ with the occurrence of D. Maximum Likelihood Estimator (MLE) Given a dataset D of size n drawn from a distribution Pθ 2 P, the MLE estimate of θ is defined as ^∗ L D θn = arg max n(θ; ): θ2Θ . 8/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism Why Maximum Likelihood Estimation? ^∗ Why is θn a good estimator for θ? ^∗ 1 Consistency: the estimate θn converges to θ in probability ! ^∗ !p θn θ: 2 Asymptotic Normality: can compute asymptotic confidence intervals ! p ^∗ − !Nd 2 n (θn θ) (0; σ (θ)): ^∗ 3 Asymptotic Efficiency: the asymptotic variance of θn is, in fact, equal to the Cramer-Rao lower bound for the variance of a consistent, asymptotically normally distributed estimator ! 4 Invariance under re-parametrization: If g(:) is a continuous and continuously differentiable function, then the ^∗ MLE of g(θ) is g(θn). See proofs in (Keener, Chapter 8). 9/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism The Gaussian Family The Gaussian Family P and Parameter Space Θ D The datasetQ = (X1; X2; : : :; Xn) is drawn from a distribution P⊗n n P θ = i=1 θ(Xi ), where ( ) 1 (X − µ)2 Pθ(X ) = p · exp − ; 2π σ 2σ2 where θ = (µ, σ). The parameter space Θ is { } Θ = (µ, σ): µ 2 R; σ 2 R+ : The family P is the family of Gaussian distributions given by { } (X −µ)2 1 − + P = p · e 2σ2 : µ 2 R; σ 2 R : 2π σ . 10/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism The Gaussian Family The Gaussian Likelihood Function The likelihood function is given by n Y (X −µ)2 p − P 1 − i −n 1 n (X −µ)2 L(θ; D) = p · e 2σ2 = ( 2π σ) · e 2σ2 i=1 i i=1 2π σ ^∗ ∗ ∗ The MLE estimate θn = (^µn; σ^n) is given by p P ∗ ∗ − −1 n (X −µ)2 n · 2σ2 i=1 i (^µn; σ^n) = arg max ( 2π σ) e µ2R,σ2R+ It is usually more convenient to work with the log-likelihood function −n 1 Xn log (L(θ; D)) = log(2πσ2) − (X − µ)2 2 2σ2 i i=1 . 11/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism The MLE Estimator ∗ ∗ Findingµ ^n andσ ^n (I) The log(:) operation is monotonic, therefore arg max log(L(θ; D)) = arg max L(θ; D) θ2Θ θ2Θ L D Can solve the optimization problem arg maxθ (θ; ) by equating the first derivative of log(L(θ; D)) with respect to θ and equating to zero (first-order condition), i.e. @ log(L(θ; D)) = 0: @θ What properties log(L(θ; D)) must have for the above method to work? . 12/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism The MLE Estimator ∗ ∗ Findingµ ^n andσ ^n (I) The log(:) operation is monotonic, therefore arg max log(L(θ; D)) = arg max L(θ; D) θ2Θ θ2Θ L D Can solve the optimization problem arg maxθ (θ; ) by equating the first derivative of log(L(θ; D)) with respect to θ and equating to zero (first-order condition), i.e. @ log(L(θ; D)) = 0: @θ What properties log(L(θ; D)) must have for the above method to work? Concavity and log-concavity! . 13/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism The MLE Estimator ∗ ∗ Findingµ ^n andσ ^n (II) Note that θ = (µ, σ) is vector-valued: the first-order condition becomes [ ] @log(L(θ; D)) @log(L(θ; D)) T 5 log(L(θ; D)) = 0 ! = 0 θ @µ @σ By taking the first derivative with respect to µ and σ, we have that: ( P ) P @ −n log(2πσ2) n (X − µ)2 n (X − µ) − i=1 i = i=1 i @µ 2 2σ2 σ2 ( P ) P @ −n log(2πσ2) n (X − µ)2 −n n (X − µ)2 − i=1 i = + i=1 i @σ 2 2σ2 σ σ3 . 14/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism The MLE Estimator ∗ ∗ Findingµ ^n andσ ^n (III) The MLE estimators are: Sample Mean: P n (X − µ) 1 Xn i=1 i = 0 ! µ^∗ = X σ2 n n i i=1 Sample Variance: P −n n (X − µ)2 1 Xn + i=1 i = 0 ! (^σ∗)2 = (X − µ^∗)2 σ σ3 n n i n i=1 . 15/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism The MLE Estimator ∗ ∗ Findingµ ^n andσ ^n (IV) Exercise: try to derive the MLE estimator when X is a multivariate Gaussian distribution: The dataset D = (X1; : : :; Xn), where Xi is M-dimensional, and Xi ∼ N (µ, Σ). { } The parameter space is Θ = (µ, Σ): µ 2 RM ; Σ ≽ 0 The multivariate Gaussian distribution is ( ) − − − M 1 1 T −1 P (X) = (2π) 2 · jΣj 2 · exp (X − µ) Σ (X − µ) θ 2 . 16/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism Confidence Intervals ∗ ∗ What is our confidence in the estimatesµ ^n andσ ^n? Depends on the sample size n: the larger n, the smaller the ∗ ∗ variances ofµ ^n andσ ^n. µ = 5, σ = √3 10 8 n = 50 n = 250 Larger variance Smaller variance Less confidence More confidence 6 ∗ n 4 ˆ µ ∗ µˆn = 5 2 0 −2 0 50 100 150 200 250 300 350 400 450 500 Sample size n . 17/31 Overview MLE Continuous Distributions Discrete Distributions Mixture Models Frequentism & Bayesianism Confidence Intervals Confidence Sets Point estimators provide no quantification for uncertainty ! need to introduce a measure of confidence in an estimate! Confidence Sets A (1 − α) confidence set for a parameter θ is a subset of the parameter space, Θ(~ X1; : : :; Xn) ⊂ Θ, such that P(θ 2 Θ)~ ≥ 1 − α: Confidence intervals are one-dimensional confidence sets.

Load more