Learning model structure
Probabilistic & Unsupervised Learning How many clusters in the data?
Model selection, Hyperparameter optimisation, How smooth should the function be? and Gaussian Processes Is this input relevant to predicting that output? Maneesh Sahani [email protected] What is the order of a dynamical system? Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science SVYDAAAQLTADVKKDLRDSWKVIGSDKKGNG University College London How many states in a hidden Markov model?
Term 1, Autumn 2014 How many auditory sources in the input?
Model selection Model complexity and overfitting: a simple example
Models (labelled by m) have parameters θm that specify the probability of data: M = 0 M = 1 M = 2 M = 3
P(D|θm, m) . 40 40 40 40
If model is known, learning θ means finding posterior or point estimate (ML, MAP, . . . ). m 20 20 20 20
What if we need to learn the model too? 0 0 0 0 I Could combine models into a single “supermodel”, with composite parameter (m, θm).
I ML learning will overfit: favours most flexible (nested) model with most parameters, −20 −20 −20 −20 0 5 10 0 5 10 0 5 10 0 5 10 even if the data actually come from a simpler one.
I Density function on composite parameter space (union of manifolds of different M = 4 M = 5 M = 6 M = 7 dimensionalities) difficult to define ⇒ MAP learning ill-posed. 40 40 40 40 I Joint posterior difficult to compute — dimension of composite parameter varies.
20 20 20 20 ⇒ Separate model selection step: 0 0 0 0 P(θm, m|D) = P(θm|m, D) · P(m|D) | {z } | {z } model-specific posterior model selection −20 −20 −20 −20 0 5 10 0 5 10 0 5 10 0 5 10 Model selection Bayesian model selection: some terminology
Given models labeled by m wih parameters θm, identify the “correct” model for data D. A model class m is a set of distributions parameterised by θm, e.g. the set of all possible ML ML/MAP has no good answer: P(D|θm ) is always larger for more complex (nested) models. mixtures of m Gaussians. Neyman-Pearson hypothesis testing The model implies both a prior over the parameters P(θm|m), and a likelihood of data given I For nested models. Starting with simplest model (m = 1), compare (e.g. by likelihood parameters (which might require integrating out latent variables) P(D|θm, m). ratio test) null hypothesis m to alternative m + 1. Continue until m + 1 is rejected. I Usually only valid asympotically in data number. The posterior distribution over parameters is I Conservative (N-P hypothesis tests are asymmetric). P(D|θm, m)P(θm|m) P(θm|D, m)= . Likelihood validation P(D|m)
I Partition data into disjoint training and validation data sets D = Dtr ∪ Dvld. Choose The marginal probability of the data under model class m is: (D |θML) θML = (D |θ) model with greatest P vld m , with m argmax P tr . Z I Unbiased, but often high-variance. P(D|m)= P(D|θm, m)P(θm|m) dθm. I Cross-validation uses multiple partitions and averages likelihoods. Θm
Bayesian model selection (also called the Bayesian evidence for model m). I Choose most likely model: argmax P(m|D). The ratio of two marginal probabilities (or sometimes its log) is known as the Bayes factor: I Principled from a probabilistic viewpoint—if true model is in set being considered—but 0 sensitive to assumed priors etc. P(D|m) P(m|D) p(m ) 0 = 0 I Can use posterior probabilities to weight models for combined predictions (no need to P(D|m ) P(m |D) p(m) select at all).
The Bayesian Occam’s razor Bayesian model comparison: Occam’s razor at work Occam’s Razor is a principle of scientific philosophy: of two explanations adequate to explain the same set of observations, the simpler should always be preferred. Bayesian inference formalises and automatically implements a form of Occam’s Razor. Compare model classes m using their posterior probability given the data: M = 0 M = 1 M = 2 M = 3 Z P(D|m)P(m) 40 40 40 40 Model Evidence P(m|D) = , P(D|m)= P(D|θm, m)P(θm|m) dθm P(D) 1 Θm 20 20 20 20
0.8 P(D|m): The probability that randomly selected parameter values from the model class 0 0 0 0
would generate data set D. −20 −20 −20 −20 0.6 0 5 10 0 5 10 0 5 10 0 5 10
Model classes that are too simple are unlikely to generate the observed data set. M = 4 M = 5 M = 6 M = 7 P(Y|M) 0.4
Model classes that are too complex can generate many possible data sets, so again, they are 40 40 40 40 unlikely to generate that particular data set at random. 0.2 20 20 20 20 Like Goldilocks, we favour a model that is just right. 0 0 1 2 3 4 5 6 7 0 0 0 0 M
−20 −20 −20 −20 0 5 10 0 5 10 0 5 10 0 5 10 ) m D| ( P
data sets: D D0 Conjugate-exponential families (recap) Practical Bayesian approaches
Can we compute P(D|m)? ...... Sometimes. I Laplace approximation:
Suppose P(D|θm, m) is a member of the exponential family: I Approximate posterior by a Gaussian centred at the maximum a posteriori parameter estimate. N N T Y Y s(xi ) θm−A(θm) I Bayesian Information Criterion (BIC) P(D|θm, m) = P(xi |θm, m) = e . I an asymptotic (N → ∞) approximation. i=1 i=1 I Variational Bayes If our prior on θ is conjugate: m I Lower bound on the marginal probability.
T I Biased estimate. sp θm−np A(θm) P(θm|m) = e /Z(sp, np) I Easy and fast, and often better than Laplace or BIC.
I Monte Carlo methods: then the joint is in the same family: (i) I (Annealed) Importance sampling: estimate evidence using samples θ from arbitrary f (θ): T P s(x )+s θ −(N+n )A(θ ) (i) (i) Z i i p m p m X P(D|θ , m)P(θ |m) P(D, θ|m) P(D, θm|m) = e /Z (sp, p) → dθ f (θ) = P(D|m) f (θ(i)) f (θ) i and so: I “Reversible jump” Markov Chain Monte Carlo: sample from posterior on composite (m, θm). Z # samples for each m ∝ p(m|D). P(D|m) = dθ P(D, θ |m) = Z P s(x ) + s , N + n Z(s , p) m m i i p p p I Both exact in the limit of infinite samples, but may have high variance with finite samples. Not an exhaustive list (Bethe approximations, Expectation propagation, . . . ) But this is a special case. In general, we need to approximate . . . We will discuss Laplace and BIC now, leaving the rest for the second half of course.
Laplace approximation Bayesian Information Criterion (BIC) BIC can be obtained from the Laplace approximation: Z We want to find P(D|m) = P(D, θ |m) dθ . d 1 m m log P(D|m) ≈ log P(θ∗ |m) + log P(D|θ∗ , m) + log 2π − log |A| m m 2 2 As data size N grows (relative to parameter count d), θm becomes more constrained ∗ We have ⇒ P(D, θm|m) ∝ P(θm|D, m) becomes concentrated on posterior mode θm. ∗ 2 ∗ 2 ∗ 2 ∗ Idea: approximate log P(D, θm|m) to second-order around θ . A = ∇ log P(D, θ |m) = ∇ log P(D|θ , m) + ∇ log P(θ |m)
Z Z So as the number of iid data N → ∞, A grows as NA0 + constant for a fixed matrix A0. log P(D,θm|m) P(D, θm|m)dθm = e dθm d ⇒ log |A| → log |NA0| = log(N |A0|) = d log N + log |A0|. Z log P(D,θ∗ |m)+∇ log P(D,θ∗ |m)·(θ −θ∗ )+ 1 (θ −θ∗ )T∇2 log P(D,θ∗|m)(θ −θ∗ ) Retaining only terms that grow with N we get: = e m m m m 2 m m m m dθ | {z } | {z } m =0 =−A d Z log P(D|m) ≈ log P(D|θ∗ , m) − log N ∗ − 1 (θ −θ∗ )TA(θ −θ∗ ) m 2 m m m m = P(D, θm|m)e dθm 2 Properties: ∗ ∗ d − 1 2 2 = P(D|θm, m)P(θm|m)(2π) |A| I Quick and easy to compute.
I Does not depend on prior. A = −∇2 log P(D, θ∗ |m) is the negative Hessian of log P(D, θ|m) evaluated at θ∗ . m m I We can use the ML estimate of θ instead of the MAP estimate (= as N → ∞).
I Related to the “Minimum Description Length” (MDL) criterion. This is equivalent to approximating the posterior by a Gaussian: an approximation which is I Assumes that in the large sample limit, all the parameters are well-determined (i.e. the asymptotically correct. model is identifiable; otherwise, d should be the number of well-determined parameters).
I Neglects multiple modes (e.g. permutations in a MoG). Hyperparameters and Evidence optimisation Evidence optimisation in linear regression
Consider simple linear regression: In some cases, we need to choose between a family of continuously parameterised models.
Z xi P(D|η) = P(D|θ)P(θ|η) dθ ↑ hyperparameters w ∼ N (0, C)
C w yi 2 This choice can be made by ascending the gradient in: σ the exact evidence (if tractable). I T 2 yi ∼ N w xi , σ I the approximated evidence (Laplace, EP, Bethe, . . . )
I a free-energy bound on the evidence (Variational Bayes) Maximize or by placing a hyperprior on the hyperparameters η, and sampling from the posterior I Z 2 2 P(D|η)P(η) P(y1 ... yN |x1 ... xN , C, σ ) = P(y1 ... yN |x1 ... xN , w, σ )P(w|C) dw P(η|D) = P(D) to find optimal values of C, σ. using Markov chain Monte Carlo sampling. 2 I Compute the posterior P(w|y1 ... yN , x1 ... xN , C, σ ) given these optimal values.
The evidence for linear regression Automatic Relevance Determination
The most common form of evidence optimization for regression (due to MacKay and Neal) −1 −1 T T XX −1 −1 XY takes C = diag(α) (i.e. wi ∼ N (0, αi )) and then optimizes the precisions {αi }. I The posterior on w is normal, with variance Σ = ( σ2 + C ) and mean µ = Σ σ2 . Note: X is a matrix where columns are input vectors, and Y is a row vector of corresponding Setting the gradients to 0 and solving gives predicted outputs. 1 − α Σ αnew = i ii 2 R 2 i 2 I The evidence, E(C, σ ) = P(Y |X, w, σ )P(w|C) dw, is given by: µi T T T s 2 new (Y − µ X)(Y − µ X) T (σ ) = 2 |2πΣ| 1 I X ΣX T P E(C, σ ) = exp − Y − Y N − i (1 − Σii αi ) |2πσ2I| |2πC| 2 σ2 σ4 During optimization the αi s meet one of two fates I For optimization, general forms for the gradients are available. If θ is a parameter in C: αi → ∞ ⇒ wi = 0 irrelevant input xi ∂ 1 ∂ − log E(C, σ2) = Tr (C − Σ − µµT) C 1 αi finite ⇒ wi = argmax P (wi | X, Y , αi ) relevant input xi ∂θ 2 ∂θ ∂ 1 1 This procedure, Automatic Relevance Determination (ARD), yields sparse solutions that log E(C, σ2) = −N + Tr I − ΣC−1 + (Y − µTX)(Y − µTX)T ∂σ2 σ2 σ2 improve on ML regression. (cf. L1-regression or LASSO).
Evidence optimisation is also called maximum marginal likelihood or ML-2 (Type 2 maximum likelihood). Prediction averaging Marginalised linear regression
xi xi
w ∼ N 0, τ 2I 2 2 2 2 τ y1,..., yN σ τ w yi σ
T 2 T 2 T 2 Y ∼ N 0, τ X X + σ I yi ∼ N w xi , σ
Integrate out w: the joint distribution of y1,..., yN given x1,..., xN is Gaussian. Linear regression predicts output y given input vector x by: The means and covariances are: T T y ∼ N (wTx, σ2) E[yi ] = E[w xi ] = 0 xi = 0 2 T T 2 2 T 2 E[(yi − yi ) ] = E[(xi w)(w xi )] + σ = τ xi xi + σ Posterior over w is Gaussian with covariance Σ=( 1 XX T + 1 I)−1 and mean µ= 1 ΣXY T σ2 τ 2 σ2 T T 2 T (where X is matrix with columns being input vectors, Y is row vector of outputs). E[(yi − yi )(yj − yj )] = E[(xi w)(w xj )] = τ xi xj Given a new input vector x0, the predicted output y 0 is (integrating out w): 2 T 2 2 T 2 T y1 0 τ x1 x1 + σ τ x1 x2 ··· τ x1 xN 2 T 2 T 2 2 T 0 0 T 0 0T 0 2 y2 0 τ x2 x1 τ x2 x2 + σ τ x2 xN y |x ∼ N (µ x , x Σx + σ ) . x1,..., xN ∼ N . , . . . . . . .. . 0T 0 2 T 2 T 2 T 2 the additional variance term x Σx comes from the posterior uncertainty in w. yN 0 τ xN x1 τ xN x2 ··· τ xN xN + σ
Predictions with marginalised regression Nonlinear regression Now, include the test input vector x0 and test output y 0:
T 2 T 2 2 T 0 xi Y 0 0 τ X X + σ I τ X x 0 X, x ∼ N , T T y 0 τ 2x0 X τ 2x0 x0 + σ2
0 We can find y |Y by the standard multivariate Gaussian result: w ∼ N 0, τ 2I a 0 AC T −1 T −1 2 w yi 2 ∼ N , ⇒ b|a ∼ N C A a, B − C A C τ σ b 0 CT B