Model Selection, Hyperparameter Optimisation, and Gaussian

Learning model structure Probabilistic & Unsupervised Learning How many clusters in the data? Model selection, Hyperparameter optimisation, How smooth should the function be? and Gaussian Processes Is this input relevant to predicting that output? Maneesh Sahani [email protected] What is the order of a dynamical system? Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science SVYDAAAQLTADVKKDLRDSWKVIGSDKKGNG University College London How many states in a hidden Markov model? Term 1, Autumn 2014 How many auditory sources in the input? Model selection Model complexity and overfitting: a simple example Models (labelled by m) have parameters θm that specify the probability of data: M = 0 M = 1 M = 2 M = 3 P(Djθm; m) : 40 40 40 40 If model is known, learning θ means finding posterior or point estimate (ML, MAP, . ). m 20 20 20 20 What if we need to learn the model too? 0 0 0 0 I Could combine models into a single “supermodel”, with composite parameter (m; θm). I ML learning will overfit: favours most flexible (nested) model with most parameters, −20 −20 −20 −20 0 5 10 0 5 10 0 5 10 0 5 10 even if the data actually come from a simpler one. I Density function on composite parameter space (union of manifolds of different M = 4 M = 5 M = 6 M = 7 dimensionalities) difficult to define ) MAP learning ill-posed. 40 40 40 40 I Joint posterior difficult to compute — dimension of composite parameter varies. 20 20 20 20 ) Separate model selection step: 0 0 0 0 P(θm; mjD) = P(θmjm; D) · P(mjD) | {z } | {z } model-specific posterior model selection −20 −20 −20 −20 0 5 10 0 5 10 0 5 10 0 5 10 Model selection Bayesian model selection: some terminology Given models labeled by m wih parameters θm, identify the “correct” model for data D. A model class m is a set of distributions parameterised by θm, e.g. the set of all possible ML ML/MAP has no good answer: P(Djθm ) is always larger for more complex (nested) models. mixtures of m Gaussians. Neyman-Pearson hypothesis testing The model implies both a prior over the parameters P(θmjm); and a likelihood of data given I For nested models. Starting with simplest model (m = 1), compare (e.g. by likelihood parameters (which might require integrating out latent variables) P(Djθm; m): ratio test) null hypothesis m to alternative m + 1. Continue until m + 1 is rejected. I Usually only valid asympotically in data number. The posterior distribution over parameters is I Conservative (N-P hypothesis tests are asymmetric). P(Djθm; m)P(θmjm) P(θmjD; m)= : Likelihood validation P(Djm) I Partition data into disjoint training and validation data sets D = Dtr [Dvld. Choose The marginal probability of the data under model class m is: (D jθML) θML = (D jθ) model with greatest P vld m , with m argmax P tr . Z I Unbiased, but often high-variance. P(Djm)= P(Djθm; m)P(θmjm) dθm: I Cross-validation uses multiple partitions and averages likelihoods. Θm Bayesian model selection (also called the Bayesian evidence for model m). I Choose most likely model: argmax P(mjD). The ratio of two marginal probabilities (or sometimes its log) is known as the Bayes factor: I Principled from a probabilistic viewpoint—if true model is in set being considered—but 0 sensitive to assumed priors etc. P(Djm) P(mjD) p(m ) 0 = 0 I Can use posterior probabilities to weight models for combined predictions (no need to P(Djm ) P(m jD) p(m) select at all). The Bayesian Occam’s razor Bayesian model comparison: Occam’s razor at work Occam’s Razor is a principle of scientific philosophy: of two explanations adequate to explain the same set of observations, the simpler should always be preferred. Bayesian inference formalises and automatically implements a form of Occam’s Razor. Compare model classes m using their posterior probability given the data: M = 0 M = 1 M = 2 M = 3 Z P(Djm)P(m) 40 40 40 40 Model Evidence P(mjD) = ; P(Djm)= P(Djθm; m)P(θmjm) dθm P(D) 1 Θm 20 20 20 20 0.8 P(Djm): The probability that randomly selected parameter values from the model class 0 0 0 0 would generate data set D. −20 −20 −20 −20 0.6 0 5 10 0 5 10 0 5 10 0 5 10 Model classes that are too simple are unlikely to generate the observed data set. M = 4 M = 5 M = 6 M = 7 P(Y|M) 0.4 Model classes that are too complex can generate many possible data sets, so again, they are 40 40 40 40 unlikely to generate that particular data set at random. 0.2 20 20 20 20 Like Goldilocks, we favour a model that is just right. 0 0 1 2 3 4 5 6 7 0 0 0 0 M −20 −20 −20 −20 0 5 10 0 5 10 0 5 10 0 5 10 ) m Dj ( P data sets: D D0 Conjugate-exponential families (recap) Practical Bayesian approaches Can we compute P(Djm)? . Sometimes. I Laplace approximation: Suppose P(Djθm; m) is a member of the exponential family: I Approximate posterior by a Gaussian centred at the maximum a posteriori parameter estimate. N N T Y Y s(xi ) θm−A(θm) I Bayesian Information Criterion (BIC) P(Djθm; m) = P(xi jθm; m) = e : I an asymptotic (N ! 1) approximation. i=1 i=1 I Variational Bayes If our prior on θ is conjugate: m I Lower bound on the marginal probability. T I Biased estimate. sp θm−np A(θm) P(θmjm) = e =Z(sp; np) I Easy and fast, and often better than Laplace or BIC. I Monte Carlo methods: then the joint is in the same family: (i) I (Annealed) Importance sampling: estimate evidence using samples θ from arbitrary f (θ): T P s(x )+s θ −(N+n )A(θ ) (i) (i) Z i i p m p m X P(Djθ ; m)P(θ jm) P(D; θjm) P(D; θmjm) = e =Z (sp; p) ! dθ f (θ) = P(Djm) f (θ(i)) f (θ) i and so: I “Reversible jump” Markov Chain Monte Carlo: sample from posterior on composite (m; θm). Z # samples for each m / p(mjD). P(Djm) = dθ P(D; θ jm) = ZP s(x ) + s ; N + n Z(s ; p) m m i i p p p I Both exact in the limit of infinite samples, but may have high variance with finite samples. Not an exhaustive list (Bethe approximations, Expectation propagation, . ) But this is a special case. In general, we need to approximate . We will discuss Laplace and BIC now, leaving the rest for the second half of course. Laplace approximation Bayesian Information Criterion (BIC) BIC can be obtained from the Laplace approximation: Z We want to find P(Djm) = P(D; θ jm) dθ . d 1 m m log P(Djm) ≈ log P(θ∗ jm) + log P(Djθ∗ ; m) + log 2π − log jAj m m 2 2 As data size N grows (relative to parameter count d), θm becomes more constrained ∗ We have ) P(D; θmjm) / P(θmjD; m) becomes concentrated on posterior mode θm. ∗ 2 ∗ 2 ∗ 2 ∗ Idea: approximate log P(D; θmjm) to second-order around θ . A = r log P(D; θ jm) = r log P(Djθ ; m) + r log P(θ jm) Z Z So as the number of iid data N ! 1, A grows as NA0 + constant for a fixed matrix A0. log P(D;θmjm) P(D; θmjm)dθm = e dθm d ) log jAj ! log jNA0j = log(N jA0j) = d log N + log jA0j. Z log P(D;θ∗ jm)+r log P(D;θ∗ jm)·(θ −θ∗ )+ 1 (θ −θ∗ )Tr2 log P(D;θ∗jm)(θ −θ∗ ) Retaining only terms that grow with N we get: = e m m m m 2 m m m m dθ | {z } | {z } m =0 =−A d Z log P(Djm) ≈ log P(Djθ∗ ; m) − log N ∗ − 1 (θ −θ∗ )TA(θ −θ∗ ) m 2 m m m m = P(D; θmjm)e dθm 2 Properties: ∗ ∗ d − 1 2 2 = P(Djθm; m)P(θmjm)(2π) jAj I Quick and easy to compute. I Does not depend on prior. A = −∇2 log P(D; θ∗ jm) is the negative Hessian of log P(D; θjm) evaluated at θ∗ . m m I We can use the ML estimate of θ instead of the MAP estimate (= as N ! 1). I Related to the “Minimum Description Length” (MDL) criterion. This is equivalent to approximating the posterior by a Gaussian: an approximation which is I Assumes that in the large sample limit, all the parameters are well-determined (i.e. the asymptotically correct. model is identifiable; otherwise, d should be the number of well-determined parameters). I Neglects multiple modes (e.g. permutations in a MoG). Hyperparameters and Evidence optimisation Evidence optimisation in linear regression Consider simple linear regression: In some cases, we need to choose between a family of continuously parameterised models. Z xi P(Djη) = P(Djθ)P(θjη) dθ " hyperparameters w ∼ N (0; C) C w yi 2 This choice can be made by ascending the gradient in: σ the exact evidence (if tractable). I T 2 yi ∼ N w xi ; σ I the approximated evidence (Laplace, EP, Bethe, . ) I a free-energy bound on the evidence (Variational Bayes) Maximize or by placing a hyperprior on the hyperparameters η, and sampling from the posterior I Z 2 2 P(Djη)P(η) P(y1 ::: yN jx1 ::: xN ; C; σ ) = P(y1 ::: yN jx1 ::: xN ; w; σ )P(wjC) dw P(ηjD) = P(D) to find optimal values of C, σ.

Model Selection, Hyperparameter Optimisation, and Gaussian

Arxiv:1803.00360V2 [Stat.CO] 2 Mar 2018 Prone to Misinterpretation [2, 3]

Paradoxes and Priors in Bayesian Regression

1 Estimation and Beyond in the Bayes Universe

A Compendium of Conjugate Priors

Hierarchical Models & Bayesian Model Selection

Bayes Factors in Practice Author(S): Robert E

Bayes Factor Consistency

A Review of Bayesian Optimization

Calibrated Bayes Factors for Model Selection and Model Averaging

The Bayes Factor

On the Derivation of the Bayesian Information Criterion

9 Introduction to Hierarchical Models