Modern Bayesian Nonparametrics
Total Page:16
File Type:pdf, Size:1020Kb
Modern Bayesian Nonparametrics Peter Orbanz Yee Whye Teh Cambridge University and Columbia University Gatsby Computational Neuroscience Unit, UCL NIPS 2011 Peter Orbanz & Yee Whye Teh 1 / 72 OVERVIEW 1. Nonparametric Bayesian models 2. Regression 3. Clustering 4. Applications Coffee refill break 5. Asymptotics 6. Exchangeability 7. Latent feature models 8. Dirichlet process 9. Completely random measures 10. Summary Peter Orbanz & Yee Whye Teh 2 / 72 C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.org/gpml 20 Regression 3 PARAMETERS AND PATTERNS2 1 0 Parameters output, y −1 −2 P(Xjθ) =−3 Probability[datajpattern] −5 0 5 input, x (a), =1 3 3 2 2 1 1 0 0 output, y −1 output, y −1 −2 −2 −3 −3 −5 0 5 −5 0 5 input, x input, x (b), =0.3 (c), =3 Figure 2.5: (a) Data is generated from a GP with hyperparameters (, σf , σn)= Inference idea(1, 1, 0.1), as shown by the + symbols. Using Gaussian process prediction with these hyperparameters we obtain a 95% confidence region for the underlying function f (shown in grey). Panels (b) and (c) again show the 95% confidence region, but this timedata for hyperparameter = underlying values pattern (0.3, 1.08, 0 +.00005) independent and (3.0, 1.16, noise0.89) respectively. The covariance is denoted ky as it is for the noisy targets y rather than for the 2 Peter Orbanz & Yee Whye Teh underlying function f. Observe that the length-scale , the signal variance σf 3 / 72 2 hyperparameters and the noise variance σn can be varied. In general we call the free parameters hyperparameters.11 In chapter 5 we will consider various methods for determining the hyperpa- rameters from training data. However, in this section our aim is more simply to explore the effects of varying the hyperparameters on GP prediction. Consider the data shown by + signs in Figure 2.5(a). This was generated from a GP with the SE kernel with (, σf , σn) = (1, 1, 0.1). The figure also shows the 2 standard-deviation error bars for the predictions obtained using these values of the hyperparameters, as per eq. (2.24). Notice how the error bars get larger for input values that are distant from any training points. Indeed if the x-axis 11We refer to the parameters of the covariance function as hyperparameters to emphasize that they are parameters of a non-parametric model; in accordance with the weight-space view, section 2.1, the parameters (weights) of the underlying parametric model have been integrated out. 8 CHAPTER 4. NONPARAMETRIC TECHNIQUES h = .2 h = 1 h = .5 4 0.15 0.6 3 δ(x) 0.4 0.1 2 TERMINOLOGY δ(x) δ(x) 2 0.2 1 0.05 2 2 1 1 1 0 0 0 -2 0 Parametric model -2 0 -2 0 -1 -1 -1 0 -1 0 -1 0 -1 1 1 1 -2 I Number of parameters fixed (or constantly-2 bounded) w.r.t. sample size 2 2 2 -2 Nonparametric modelFigure 4.3: Examples of two-dimensional circularly symmetric normal Parzen windows ϕ(x/h) for three different values of h. Note that because the δ ( ) are normalized, k · I Number of parametersdifferent grows vertical with scales sample must size be used to show their structure. I 1-dimensional parameter space Example: Density estimation 20 CHAPTER 2. BAYESIAN DECISION THEORY x2 p(x) p(x) p(x) µ x1 Figure 2.9: Samples drawn fromParametric a two-dimensional Gaussian lie in a cloud centered on the mean µ. The red ellipses show lines ofFigure equal probability 4.4: density Three of the Gaussian. Parzen-windowNonparametric density estimates based on the same set of five Peter Orbanz & Yee Whye Teh samples, using the window functions in Fig. 4.3. As before, the4 / 72 vertical axes have being merely σ2 times the identity matrix I. Geometrically, this corresponds to the situation in which the samples fall in equal-sizebeen hyperspherical scaledto clusters, show the cluster the structure of each function. for the ith class being centered about the mean vector µi. The computation of the 2d 1 2 determinant and the inverse of Σ is particularly easy: Σ = σ and Σ− = (1/σ )I. i | i| i Since both Σi and the (d/2) ln 2π term in Eq. 47 are independent of i, they are unimportant| additive| constants that can be ignored. Thus we obtain the simple discriminant functions and 2 x µi gi(x)= " − 2 " + ln P (ωi), (48) − 2σ 2 Euclidean where is the Euclidean norm, that is, lim σn(x)=0. (18) " · " n norm →∞ x µ 2 =(x µ )t(x µ ). (49) " − i" − i − i If the prior probabilities are not equal, then Eq.To 48 shows prove that the convergence squared distance we must place conditions on the unknown density p(x), on 2 2 x µ must be normalized by the variance σ and offset by adding ln P (ωi); thus, if" x−is" equally near two different mean vectors,the the window optimal decision function will favor theϕ a (u), and on the window width hn. In general, continuity of priori more likely category. p( ) at x is required, and the conditions imposed by Eqs. 12 & 13 are customarily Regardless of whether the prior probabilities· are equal or not, it is not actually necessary to compute distances. Expansion of the quadratic form (x µ )t(x µ ) invoked. With− care,i − iti can be shown that the following additional conditions assure yields convergence (Problem 1): 1 g (x)= [xtx 2µtx + µtµ ] + ln P (ω ), (50) i −2σ2 − i i i i which appears to be a quadratic function of x. However, the quadratic term xtx is the same for all i, making it an ignorable additive constant. Thus, we obtain the sup ϕ(u) < (19) linear equivalent linear discriminant functions u ∞ discriminant t gi(x)=wix + wi0, (51) d where lim ϕ(u) ui = 0 (20) u # #→∞ i=1 ! NONPARAMETRIC BAYESIAN MODEL Definition A nonparametric Bayesian model is a Bayesian model on an 1-dimensional parameter space. Interpretation Parameter space T = set of possible patterns, for example: Problem T Density estimation Probability distributions Regression Smooth functions Clustering Partitions Solution to Bayesian problem = posterior distribution on patterns Peter Orbanz & Yee Whye Teh [Sch95] 5 / 72 REGRESSION Peter Orbanz & Yee Whye Teh 6 / 72 GAUSSIAN PROCESSES Nonparametric regression Patterns = continuous functions, say on interval [a; b]: θ :[a; b] ! R T = C[a; b] Gaussian process prior I Hyperparameters: Mean function and covariance function m 2 C[a; b] and k :[a; b] × [a; b] ! R I Plug in finite set s = fs1;:::; sng ⊂ [a; b]: 0 1 0 1 m(s1) k(s1; s1) ::: k(s1; sn) B . C B . C m(s) = @ . A and k(s; s) = @ . A m(sn) k(sn; s1) ::: k(sn; sn) I Distribution of θ is Gaussian process if n (θ(s1); : : : ; θ(sn)) ∼ N m(s); k(s; s) for any s ⊂ [a; b] Peter Orbanz & Yee Whye Teh [RW06] 7 / 72 C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.org/gpml GAUSSIAN PROCESS REGRESSION 2.2 Function-space View 15 Observation model 2 2 I Inputs s = (s1;:::; sn) 1 1 I Outputs = (t ;:::; tn) t 1 0 0 output, f(x) ti ∼ N si output, f(x) θ( ); σnoise −1 −1 Posterior distributionPrior and Posterior −2 −2 −5a 0 b5 −5 0 5 input, x input, x I Posterior is again a Gaussian Process (a), prior (b), posterior 2 2 I Quantifies prediction uncertainty Figure 2.2: Panel (a) shows three functions drawn at random from a GP prior; 1 the dots1 indicate values of y actually generated; the two other functions have (less correctly) been drawn as lines by joining a large number of evaluated points. Panel (b) shows three random functions drawn from the posterior, i.e. the prior conditioned on 0 t 0 Predictions at test points s∗ the five1 noise free observations indicated. In both plots the shaded area represents the output, f(x) output, f(x) pointwise mean plus and minus two times the standard deviation for each input value ∗ = (s∗ ;:::; s∗m) Test inputs s 1 −1 (corresponding−1 to the 95% confidence region), for the prior and posterior respectively. −1 −2 2 −2 m^ = k(s∗; s) k(s; s) + σnoiseI t which informally can be thought of as roughly the distance you have to move in ^ −5 0 2 −1 5 −a5 s1 0 b5 k = k(s∗; s∗) − k(s∗; s) k(s; s)input, + σx noiseI k(s; s∗input) space before theinput, function x value can change significantly, see section 4.2.1. For eq. (2.16) the characteristic length-scale is around one unit. By replacing x x by x x / in eq. (2.16) for some positive constant we could change | p − q| | p − q| Peter Orbanz & Yee Whye Teh Predictive distribution: the characteristic length-scale[RW06] of the8 / process. 72 Also, the overall variance of the magnitude 2random−1 function can be controlled by a positive pre-factor before the exp in p(y∗|x∗ x y) ∼ N k(x∗ x)[K + σ I] y noiseeq. (2.16). We will discuss more about how such factors affect the predictions 2 2 −1 k(x∗ x∗)+σnoisein− sectionk(x∗ x)2.3[,K and+ σ saynoiseI more] k( aboutx∗ x) how to set such scale parameters in chapter 5.