Learning model structure

Probabilistic & Unsupervised Learning How many clusters in the ?

Model selection, optimisation, How smooth should the function be? and Gaussian Processes Is this input relevant to predicting that output? Maneesh Sahani [email protected] What is the order of a dynamical system? Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science SVYDAAAQLTADVKKDLRDSWKVIGSDKKGNG University College London How many states in a hidden Markov model?

Term 1, Autumn 2014 How many auditory sources in the input?

Model selection Model complexity and overfitting: a simple example

Models (labelled by m) have parameters θm that specify the probability of data: M = 0 M = 1 M = 2 M = 3

P(D|θm, m) . 40 40 40 40

If model is known, learning θ finding posterior or point estimate (ML, MAP, . . . ). m 20 20 20 20

What if we need to learn the model too? 0 0 0 0 I Could combine models into a single “supermodel”, with composite parameter (m, θm).

I ML learning will overfit: favours most flexible (nested) model with most parameters, −20 −20 −20 −20 0 5 10 0 5 10 0 5 10 0 5 10 even if the data actually come from a simpler one.

I Density function on composite parameter space (union of manifolds of different M = 4 M = 5 M = 6 M = 7 dimensionalities) difficult to define ⇒ MAP learning ill-posed. 40 40 40 40 I Joint posterior difficult to compute — dimension of composite parameter varies.

20 20 20 20 ⇒ Separate model selection step: 0 0 0 0 P(θm, m|D) = P(θm|m, D) · P(m|D) | {z } | {z } model-specific posterior model selection −20 −20 −20 −20 0 5 10 0 5 10 0 5 10 0 5 10 Model selection Bayesian model selection: some terminology

Given models labeled by m wih parameters θm, identify the “correct” model for data D. A model class m is a set of distributions parameterised by θm, e.g. the set of all possible ML ML/MAP has no good answer: P(D|θm ) is always larger for more complex (nested) models. mixtures of m Gaussians. Neyman-Pearson hypothesis testing The model implies both a prior over the parameters P(θm|m), and a likelihood of data given I For nested models. Starting with simplest model (m = 1), compare (e.g. by likelihood parameters (which might require integrating out latent variables) P(D|θm, m). ratio test) null hypothesis m to alternative m + 1. Continue until m + 1 is rejected. I Usually only valid asympotically in data number. The posterior distribution over parameters is I Conservative (N-P hypothesis tests are asymmetric). P(D|θm, m)P(θm|m) P(θm|D, m)= . Likelihood validation P(D|m)

I Partition data into disjoint training and validation data sets D = Dtr ∪ Dvld. Choose The marginal probability of the data under model class m is: (D |θML) θML = (D |θ) model with greatest P vld m , with m argmax P tr . Z I Unbiased, but often high-. P(D|m)= P(D|θm, m)P(θm|m) dθm. I Cross-validation uses multiple partitions and averages likelihoods. Θm

Bayesian model selection (also called the Bayesian evidence for model m). I Choose most likely model: argmax P(m|D). The ratio of two marginal probabilities (or sometimes its log) is known as the : I Principled from a probabilistic viewpoint—if true model is in set being considered—but 0 sensitive to assumed priors etc. P(D|m) P(m|D) p(m ) 0 = 0 I Can use posterior probabilities to weight models for combined predictions (no need to P(D|m ) P(m |D) p(m) select at all).

The Bayesian Occam’s razor Bayesian model comparison: Occam’s razor at work Occam’s Razor is a principle of scientific philosophy: of two explanations adequate to explain the same set of observations, the simpler should always be preferred. formalises and automatically implements a form of Occam’s Razor. Compare model classes m using their given the data: M = 0 M = 1 M = 2 M = 3 Z P(D|m)P(m) 40 40 40 40 Model Evidence P(m|D) = , P(D|m)= P(D|θm, m)P(θm|m) dθm P(D) 1 Θm 20 20 20 20

0.8 P(D|m): The probability that randomly selected parameter values from the model class 0 0 0 0

would generate data set D. −20 −20 −20 −20 0.6 0 5 10 0 5 10 0 5 10 0 5 10

Model classes that are too simple are unlikely to generate the observed data set. M = 4 M = 5 M = 6 M = 7 P(Y|M) 0.4

Model classes that are too complex can generate many possible data sets, so again, they are 40 40 40 40 unlikely to generate that particular data set at random. 0.2 20 20 20 20 Like Goldilocks, we favour a model that is just right. 0 0 1 2 3 4 5 6 7 0 0 0 0 M

−20 −20 −20 −20 0 5 10 0 5 10 0 5 10 0 5 10 ) m D| ( P

data sets: D D0 Conjugate-exponential families (recap) Practical Bayesian approaches

Can we compute P(D|m)? ...... Sometimes. I Laplace approximation:

Suppose P(D|θm, m) is a member of the : I Approximate posterior by a Gaussian centred at the maximum a posteriori parameter estimate. N N T Y Y s(xi ) θm−A(θm) I Bayesian Information Criterion (BIC) P(D|θm, m) = P(xi |θm, m) = e . I an asymptotic (N → ∞) approximation. i=1 i=1 I Variational Bayes If our prior on θ is conjugate: m I Lower bound on the marginal probability.

T I Biased estimate. sp θm−np A(θm) P(θm|m) = e /Z(sp, np) I Easy and fast, and often better than Laplace or BIC.

I Monte Carlo methods: then the joint is in the same family: (i) I (Annealed) Importance : estimate evidence using samples θ from arbitrary f (θ): T P s(x )+s θ −(N+n )A(θ ) (i) (i) Z i i p m p m X P(D|θ , m)P(θ |m) P(D, θ|m) P(D, θm|m) = e /Z (sp, p) → dθ f (θ) = P(D|m) f (θ(i)) f (θ) i and so: I “Reversible jump” : sample from posterior on composite (m, θm). Z # samples for each m ∝ p(m|D). P(D|m) = dθ P(D, θ |m) = ZP s(x ) + s , N + n Z(s , p) m m i i p p p I Both exact in the limit of infinite samples, but may have high variance with finite samples. Not an exhaustive list (Bethe approximations, Expectation propagation, . . . ) But this is a special case. In general, we need to approximate . . . We will discuss Laplace and BIC now, leaving the rest for the second half of course.

Laplace approximation Bayesian Information Criterion (BIC) BIC can be obtained from the Laplace approximation: Z We want to find P(D|m) = P(D, θ |m) dθ . d 1 m m log P(D|m) ≈ log P(θ∗ |m) + log P(D|θ∗ , m) + log 2π − log |A| m m 2 2 As data size N grows (relative to parameter count d), θm becomes more constrained ∗ We have ⇒ P(D, θm|m) ∝ P(θm|D, m) becomes concentrated on posterior θm. ∗ 2 ∗ 2 ∗ 2 ∗ Idea: approximate log P(D, θm|m) to second-order around θ . A = ∇ log P(D, θ |m) = ∇ log P(D|θ , m) + ∇ log P(θ |m)

Z Z So as the number of iid data N → ∞, A grows as NA0 + constant for a fixed matrix A0. log P(D,θm|m) P(D, θm|m)dθm = e dθm d ⇒ log |A| → log |NA0| = log(N |A0|) = d log N + log |A0|. Z log P(D,θ∗ |m)+∇ log P(D,θ∗ |m)·(θ −θ∗ )+ 1 (θ −θ∗ )T∇2 log P(D,θ∗|m)(θ −θ∗ ) Retaining only terms that grow with N we get: = e m m m m 2 m m m m dθ | {z } | {z } m =0 =−A d Z log P(D|m) ≈ log P(D|θ∗ , m) − log N ∗ − 1 (θ −θ∗ )TA(θ −θ∗ ) m 2 m m m m = P(D, θm|m)e dθm 2 Properties: ∗ ∗ d − 1 2 2 = P(D|θm, m)P(θm|m)(2π) |A| I Quick and easy to compute.

I Does not depend on prior. A = −∇2 log P(D, θ∗ |m) is the negative Hessian of log P(D, θ|m) evaluated at θ∗ . m m I We can use the ML estimate of θ instead of the MAP estimate (= as N → ∞).

I Related to the “Minimum Description Length” (MDL) criterion. This is equivalent to approximating the posterior by a Gaussian: an approximation which is I Assumes that in the large sample limit, all the parameters are well-determined (i.e. the asymptotically correct. model is identifiable; otherwise, d should be the number of well-determined parameters).

I Neglects multiple modes (e.g. permutations in a MoG). and Evidence optimisation Evidence optimisation in

Consider : In some cases, we need to choose between a family of continuously parameterised models.

Z xi P(D|η) = P(D|θ)P(θ|η) dθ ↑ hyperparameters w ∼ N (0, C)

C w yi 2 This choice can be made by ascending the gradient in: σ the exact evidence (if tractable). I T 2 yi ∼ N w xi , σ I the approximated evidence (Laplace, EP, Bethe, . . . )

I a free-energy bound on the evidence (Variational Bayes) Maximize or by placing a on the hyperparameters η, and sampling from the posterior I Z 2 2 P(D|η)P(η) P(y1 ... yN |x1 ... xN , C, σ ) = P(y1 ... yN |x1 ... xN , w, σ )P(w|C) dw P(η|D) = P(D) to find optimal values of C, σ. using Markov chain Monte Carlo sampling. 2 I Compute the posterior P(w|y1 ... yN , x1 ... xN , C, σ ) given these optimal values.

The evidence for linear regression Automatic Relevance Determination

The most common form of evidence optimization for regression (due to MacKay and Neal) −1 −1 T T XX −1 −1 XY takes C = diag(α) (i.e. wi ∼ N (0, αi )) and then optimizes the precisions {αi }. I The posterior on w is normal, with variance Σ = ( σ2 + C ) and µ = Σ σ2 . Note: X is a matrix where columns are input vectors, and Y is a row vector of corresponding Setting the gradients to 0 and solving gives predicted outputs. 1 − α Σ αnew = i ii 2 R 2 i 2 I The evidence, E(C, σ ) = P(Y |X, w, σ )P(w|C) dw, is given by: µi T T T s 2 new (Y − µ X)(Y − µ X)   T   (σ ) = 2 |2πΣ| 1 I X ΣX T P E(C, σ ) = exp − Y − Y N − i (1 − Σii αi ) |2πσ2I| |2πC| 2 σ2 σ4 During optimization the αi s meet one of two fates I For optimization, general forms for the gradients are available. If θ is a parameter in C: αi → ∞ ⇒ wi = 0 irrelevant input xi   ∂ 1 ∂ − log E(C, σ2) = Tr (C − Σ − µµT) C 1 αi finite ⇒ wi = argmax P (wi | X, Y , αi ) relevant input xi ∂θ 2 ∂θ ∂ 1  1  This procedure, Automatic Relevance Determination (ARD), yields sparse solutions that log E(C, σ2) = −N + Tr I − ΣC−1 + (Y − µTX)(Y − µTX)T ∂σ2 σ2 σ2 improve on ML regression. (cf. L1-regression or LASSO).

Evidence optimisation is also called maximum or ML-2 (Type 2 maximum likelihood). Prediction averaging Marginalised linear regression

xi xi

w ∼ N 0, τ 2I 2 2 2 2 τ y1,..., yN σ τ w yi σ

T 2 T 2  T 2 Y ∼ N 0, τ X X + σ I yi ∼ N w xi , σ

Integrate out w: the joint distribution of y1,..., yN given x1,..., xN is Gaussian. Linear regression predicts output y given input vector x by: The means and are: T T y ∼ N (wTx, σ2) E[yi ] = E[w xi ] = 0 xi = 0 2 T T 2 2 T 2 E[(yi − yi ) ] = E[(xi w)(w xi )] + σ = τ xi xi + σ Posterior over w is Gaussian with Σ=( 1 XX T + 1 I)−1 and mean µ= 1 ΣXY T σ2 τ 2 σ2 T T 2 T (where X is matrix with columns being input vectors, Y is row vector of outputs). E[(yi − yi )(yj − yj )] = E[(xi w)(w xj )] = τ xi xj Given a new input vector x0, the predicted output y 0 is (integrating out w):      2 T 2 2 T 2 T  y1 0 τ x1 x1 + σ τ x1 x2 ··· τ x1 xN 2 T 2 T 2 2 T 0 0 T 0 0T 0 2 y2  0  τ x2 x1 τ x2 x2 + σ τ x2 xN  y |x ∼ N (µ x , x Σx + σ )        .  x1,..., xN ∼ N  .  ,  . . .   .   .   . .. .  0T 0 2 T 2 T 2 T 2 the additional variance term x Σx comes from the posterior uncertainty in w. yN 0 τ xN x1 τ xN x2 ··· τ xN xN + σ

Predictions with marginalised regression Now, include the test input vector x0 and test output y 0:

 T    2 T 2 2 T 0  xi Y 0 0 τ X X + σ I τ X x 0 X, x ∼ N , T T y 0 τ 2x0 X τ 2x0 x0 + σ2

0 We can find y |Y by the standard multivariate Gaussian result: w ∼ N 0, τ 2I       a 0 AC  T −1 T −1  2 w yi 2 ∼ N , ⇒ b|a ∼ N C A a, B − C A C τ σ b 0 CT B

T 2 So y ∼ N w φ(xi ), σ

0 0 y |Y , X, x We can also introduce a nonlinear mapping x 7→ φ(x)? Each element of φ(x) is a  2 0T 2 T 2 −1 T 2 0T 0 2 2 0T 2 T 2 −1 2 T 0 (nonlinear) feature extracted from x. May be many more features than elements on x. ∼ N τ x X(τ X X + σ I) Y ,τ x x + σ − τ x X(τ X X + σ I) τ X x The regression function f (x) = wTφ(x) is nonlinear, but outputs Y still jointly Gaussian!   −1 ∼ N 1 x0TΣXY T, x0TΣx0 + σ2 Σ = 1 XX T + 1 I σ2 σ2 τ 2 T 2 T 2 Y |X ∼ N (0N , τ Φ Φ + σ IN )

th I Same answer as obtained by integrating wrt posterior over w. where the i column of matrix Φ is φ(xi ). Proceeding as before, the predictive distribution over y 0 for a test input x0 is: I Evidence P(Y |X) is just probability under joint Gaussian; also reduces to expression found previously.   y 0|x0, Y , X ∼ N τ 2φ(x0)TΦK −1Y T, τ 2φ(x0)Tφ(x0) + σ2 − τ 4φ(x)TΦK −1ΦTφ(x0) I Thus, Bayesian linear regression can be derived from a joint, parameter-free distribution 2 T 2 on all the outputs conditioned on all the inputs. K = τ Φ Φ + σ I The covariance kernel Regression using the covariance kernel

0 T  2 T 2  For non-linear regression, all operations depended on K (x, x ) rather than explicitly on φ(x). Y |X ∼ N 0N ,τ Φ Φ + σ IN So we can define the joint in terms of K implicitly using a (potentially infinite-dimensional) The covariance of the output vector Y plays a central role in the development of the theory of feature map φ(x). Gaussian processes. 0 Y |X, K ∼ N (0N , K (X, X)) Define the covariance kernel function K : X × X → R such that if x, x ∈ X are two input 0 vectors with corresponding outputs y, y , then where the i, j entry in the K (X, X) is K (xi , xj ).

K (x, x0) = Cov[y, y 0] = E[yy 0] − E[y]E[y 0] This is called the kernel trick. Prediction: compute the predictive distribution of y 0 conditioned on Y : 0 2 T 0 2 In the nonlinear regression example we have K (x, x ) = τ φ(x) φ(x ) + σ δx=x0 . The covariance kernel has two properties: y 0|x0, X, Y , K ∼ N (K (x0, X)K (X, X)−1Y T, K (x0, x0) − K (x0, X)K (X, X)−1K (X, x0)) 0 0 0 | {z } | {z } I Symmetric: K (x, x ) = K (x , x) for all x, x . mean variance

I Positive semidefinite: the matrix [K (xi , xj )] formed by any finite set of input vectors Evidence: this is just the Gaussian likelihood: x1,..., xN is positive semidefinite. − 1 − 1 YK (X,X)−1Y T Theorem: A covariance kernel K : X × X → R is symmetric and positive semidefinite if and P(Y |X, K ) = |2πK (X, X)| 2 e 2 only if there is a feature map φ : → such that X H Evidence optimisation: the covariance kernel K often has parameters, and these can be K (x, x0) = φ(x)Tφ(x0) optimized by gradient ascent in log P(Y |X, K ).

The feature space H can potentially be infinite dimensional.

The Gaussian process Regression with Gaussian processes

We seek to learn the function that maps inputs x ,..., x to outputs y ,..., y . A Gaussian process (GP) is a collection of random variables, any finite number of which 1 N 1 N have (consistent) Gaussian distributions. Instead of assuming a specific form, we consider a random function drawn from a GP prior:

In our regression setting, for each input vector x we have an output f (x). Given f (·) ∼ GP(0, K (·, ·)) . X = [x1,..., xN ], the joint distribution of the outputs F = [f (x1),..., f (xN )] is: Any function is possible (no restriction on support) but some are (much) more likely a priori. F|X, K ∼ N (0, K (X, X)) Observations yi usually taken to be noisy versions of latent f (xi ): Thus the random function f (x) (as a collection of random variables, one f (x) for each x) is a y |x , f (·) ∼ N (f (x ), σ2) Gaussian process. i i i In general, a Gaussian process is parametrized by a mean function m(x) and covariance Evidence: given by the multivariate Gaussian likelihood: kernel K (x, x0), and we write 2 − 1 − 1 Y (K (X,X)+σ2I)−1Y T P(Y |X) = |2π(K (X, X) + σ I)| 2 e 2 f (·) ∼ GP(m(·), K (·, ·)) Posterior: also a GP:

2 −1 T 2 −1 Posterior Gaussian process: on observing X and F, the conditional joint distribution of f (·)|X, Y ∼ GP(K (·, X)(K (X, X) + σ I) Y , K (·, ·) − K (·, X)(K (X, X) + σ I) K (X, ·)) 0 0 0 0 0 F = [f (x1),..., f (xM )] on another set of input vectors x1,..., xM is still Gaussian: Predictions: posterior on f , plus observation noise: F 0|X 0, X, F, K ∼ N (K (X 0, X)K (X, X)−1F T, K (X 0, X 0) − K (X 0, X)K (X, X)−1K (X, X 0)) y 0|X, Y , x0 ∼ N (E[f (x0)|X, Y ], Var[f (x0)|X, Y ] + σ2) thus the posterior over functions f (·)|X, F is still a Gaussian process! Evidence Optimisation: gradient ascent in log P(Y |X). Samples from a Gaussian process Sample from a 2D Gaussian process

We can draw sample functions from a GP by fixing a set of input vectors x1,..., xN , and drawing a sample f (x1),..., f (xN ) from the corresponding multivariate Gaussian. This can then be plotted. Example prior and posterior GPs:

Another approach is to

I sample f (x1) first,

I then f (x2)|f (x1),

I and generally f (xn)|f (x1),..., f (xn−1) for n = 1, 2,....

Examples of covariance kernels Forms of kernels

If K1 and K2 are covariance kernels, then so are:

I Rescaling: αK1 for α > 0. • Polynomial: m = 1 I Addition: K1 + K2 K (x, x0) = (1 + xTx0)m m = 1, 2,... m = 2 output (y) Elementwise product: K K m = 3 I 1 2 0 input (x) I Mapping: K1(φ(x), φ(x )) for some function φ.

• Squared-exponential (or exponentiated-quadratic):0 2 η = 1 − kx−x k K (x, x0) = θ2e 2η2 η = 2 output (y) A covariance kernel is translation-invariant if η = 4 0 0 input (x) K (x, x ) = h(x − x )

• Periodic (exp-sine): 2 0 τ = 1, η = 2 − 2 sin (π(x−x )/τ) 0 2 η2 A GP with a translation-invariant covariance kernel is stationary: if f (·) ∼ GP(0, K ), then so K (x, x ) = θ e output (y) τ = 2, η = 1 is f (· − x) ∼ GP(0, K ) for each x. τ = 3, η = 0.5 input (x)

• Rational Quadratic: −α α = 0.5, η = 2 A covariance kernel is radial or radially symmetric if  kx − x0k2  K (x, x0) = 1 + α > 0

output (y) α = 1, η = 2 2αη2 0 0 α = 2, η = 2 K (x, x ) = h(kx − x k)

input (x) A GP with a radial covariance kernel is stationary with respect to translations, rotations, and reflections of the input space. Nonparametric Bayesian Models and Occam’s Razor

Overparameterised models can overfit. In the GP, the parameter is the function f (x) which can be infinite-dimensional.

However, the Bayesian treatment integrates over these parameters: we never identify a single “best fit” f , just a posterior (and posterior mean). So f cannot be adjusted to overfit the data.

The GP is an example of the larger class of nonparametric Bayesian models.

I Infinite number of parameters.

I Often constructed as the infinite limit of a nested family of finite models (sometimes equivalent to infinite model averaging).

I Parameters integrated out, so effective number of parameters to overfit is zero or small (hyperparameters).

I No need for model selection. Bayesian posterior on parameters will concentrate on “submodel” with largest integral automatically.

I No explicit need for Occam’s razor, validation or added regularisation penalty.