Arxiv:1402.4306V2 [Stat.ML] 19 Feb 2014 Advantages Come at No Additional Computa- Archambeau and Bach, 2010]
Total Page:16
File Type:pdf, Size:1020Kb
Student-t Processes as Alternatives to Gaussian Processes Amar Shah Andrew Gordon Wilson Zoubin Ghahramani University of Cambridge University of Cambridge University of Cambridge Abstract simple exact learning and inference procedures, and impressive empirical performances [Rasmussen, 1996], Gaussian processes as kernel machines have steadily We investigate the Student-t process as an grown in popularity over the last decade. alternative to the Gaussian process as a non- parametric prior over functions. We de- At the heart of every Gaussian process (GP) is rive closed form expressions for the marginal a parametrized covariance kernel, which determines likelihood and predictive distribution of a the properties of likely functions under a GP. Typ- Student-t process, by integrating away an ically simple parametric kernels, such as the Gaus- inverse Wishart process prior over the co- sian (squared exponential) kernel are used, and its pa- variance kernel of a Gaussian process model. rameters are determined through marginal likelihood We show surprising equivalences between dif- maximization, having analytically integrated away the ferent hierarchical Gaussian process models Gaussian process. However, a fully Bayesian nonpara- leading to Student-t processes, and derive a metric treatment of regression would place a nonpara- new sampling scheme for the inverse Wishart metric prior over the Gaussian process covariance ker- process, which helps elucidate these equiv- nel, to represent uncertainty over the kernel function, alences. Overall, we show that a Student- and to reflect the natural intuition that the kernel does t process can retain the attractive proper- not have a simple parametric form. ties of a Gaussian process { a nonparamet- Likewise, given the success of Gaussian processes ker- ric representation, analytic marginal and pre- nel machines, it is also natural to consider more general dictive distributions, and easy model selec- families of elliptical processes [Fang et al., 1989], such tion through covariance kernels { but has en- as Student-t processes, where any collection of func- hanced flexibility, and predictive covariances tion values has a desired elliptical distribution, with a that, unlike a Gaussian process, explicitly de- covariance matrix constructed using a kernel. pend on the values of training observations. We verify empirically that a Student-t pro- As we will show, the Student-t process can be derived cess is especially useful in situations where by placing an inverse Wishart process prior on the ker- there are changes in covariance structure, nel of a Gaussian process. Given their intuitive value, or in applications like Bayesian optimiza- it is not surprising that various forms of Student-t tion, where accurate predictive covariances processes have been used in different applications [Yu are critical for good performance. These et al., 2007, Zhang and Yeung, 2010, Xu et al., 2011, arXiv:1402.4306v2 [stat.ML] 19 Feb 2014 advantages come at no additional computa- Archambeau and Bach, 2010]. However, the connec- tional cost over Gaussian processes. tions between these models, and the theoretical prop- erties of these models, remain largely unknown. Simi- larly, the practical utility of such models remains un- 1 INTRODUCTION certain. For example, Rasmussen and Williams [2006] wonder whether \the Student-t process is perhaps not as exciting as one might have hoped". Gaussian processes are rich distributions over func- tions, which provide a Bayesian nonparametric ap- In short, our paper answers in detail many of the proach to regression. Owing to their interpretability, \what, when and why?" questions one might have non-parametric flexibility, large support, consistency, about Student-t processes (TPs), inverse Wishart pro- cesses, and elliptical processes in general. Specifically: Appearing in Proceedings of the 17th International Con- ference on Artificial Intelligence and Statistics (AISTATS) We precisely define and motivate the inverse 2014, Reykjavik, Iceland. JMLR: W&CP volume 33. Copy- • Wishart process [Dawid, 1981] as a prior over co- right 2014 by the authors. variance matrices of arbitrary size. Student-t Processes as Alternatives to Gaussian Processes We propose a Student-t process, which we derive we write Σ Wn(ν; K) if its density is given by • from hierarchical Gaussian process models. We ∼ (ν n 1)=2 1 1 derive analytic forms for the marginal and pre- p(Σ) = cn(ν; K) Σ − − exp Tr K− Σ ; dictive distributions of this process, and analytic j j − 2 (1) derivatives of the marginal likelihood. 1 where c (ν; K) = K ν=22νn=2Γ (ν=2) − . n j j n We show that the Student-t process is the most • general elliptically symmetric process with ana- The Wishart distribution defined with this param- lytic marginal and predictive distributions. eterization is consistent under marginalization. If Σ Wn(ν; K), then any n1 n1 principal submatrix We derive a new way of sampling from the in- Σ ∼ is W (ν; K ) distributed.× This property makes • 11 n1 11 verse Wishart process, which intuitively resolves the Wishart distribution appear to be an attractive the seemingly bizarre marginal equivalence be- of prior over covariance matrices. Unfortunately the tween inverse Wishart and inverse Gamma priors Wishart distribution suffers a flaw which makes it im- for covariance kernels in hierarchical GP models. practical for nonparametric Bayesian modelling. We show that the predictive covariances of a TP Suppose we wish to model a covariance matrix using 1 1 • depend on the values of training observations, ν− Σ, so that its expected value E[ν− Σ] = K, and 1 1 2 even though the predictive covariances of a GP var[ν− Σij] = ν− (Kij + KiiKjj). Since we require do not. ν > n 1, we must let ν to define a process which has− positive semidefinite! 1 Wishart distributed 1 marginals of arbitrary size. However, as ν , ν− Σ We show that, contrary to the Student-t process ! 1 • described in Rasmussen and Williams [2006], an tends to the constant matrix K almost surely. Thus the requirement ν > n 1 prohibits defining a useful analytic TP noise model can be used which sepa- − rates signal and noise analytically. process which has Wishart marginals of arbitrary size. Nevertheless, the inverse Wishart distribution does We demonstrate non-trivial differences in be- not suffer this problem. Dawid [1981] parametrized • haviour between the GP and TP on a variety of the inverse Wishart distribution as follows: applications. We specifically find the TP more Definition. A random Σ Π(n) is inverse Wishart 2 robust to change-points and model misspecifica- distributed with parameters ν R+, K Π(n) and 2 2 tion, to have notably improved predictive covari- we write Σ IWn(ν; K) if its density is given by ances, to have useful \tail-dependence" between ∼ (ν+2n)=2 1 1 distant function values (which is orthogonal to the p(Σ) = cn(ν; K) Σ − exp Tr KΣ− ; choice of kernel), and to be particularly promis- j j − 2 (2) ing for Bayesian optimization, where predictive (ν+n 1)=2 K − covariances are especially important. with c (ν; K) = j j . n 2(ν+n 1)n=2Γ ((ν + n 1)=2) − n − We begin by introducing the inverse Wishart pro- If Σ IWn(ν; K), Σ has mean and covariance only cess in section 2. We then derive a Student-t pro- ∼ 1 when ν > 2 and E[Σ] = (ν 2)− K. Both the Wishart cess by using an inverse Wishart process over covari- and the inverse Wishart distributions− place prior mass ance kernels (section 3), and discuss the properties on every Σ Π(n). Furthermore Σ Wn(ν; K) if and of this Student t process in section 4. Finally, we 12 1 ∼ only if Σ− IWn(ν n + 1;K− ). demonstrate the− Student-t process on regression and ∼ − Bayesian optimization problems in section 5. Dawid [1981] shows that the inverse Wishart distribu- tion defined as above is consistent under marginaliza- tion. If Σ IWn(ν; K), then any principal submatrix 2 INVERSE WISHART PROCESS ∼ Σ11 will be IWn1 (ν; K11) distributed. Note the key dif- ference in the parameterizations of both distributions: In this section we argue that the inverse Wishart dis- the parameter ν does not need to depend on the size of tribution is an attractive choice of prior for covariance the matrix in the inverse Wishart distribution. These matrices of arbitrary size. The Wishart distribution properties are desirable and motivate defining a pro- is a probability distribution over Π(n), the set of real cess which has inverse Wishart marginals of arbitrary valued, n n, symmetric, positive definite matrices. size. Let be some input space and k : R Its density× function is defined as follows. a positiveX definite kernel function. X × X ! Definition. A random Σ Π(n) is Wishart dis- Definition. σ is an inverse Wishart process on with 2 X tributed with parameters ν > n 1, K Π(n), and parameters ν R+ and base kernel k : R if − 2 2 X × X ! Shah, Wilson, Ghahramani 2 2 Definition. y Rn is multivariate Student-t dis- 2 n 1 1 tributed with parameters ν R+ [0; 2], φ R and K Π(n) if it has density 2 n 2 0 0 2 Γ( ν+n ) 1 1 2 1=2 − − p(y) = n K − 2 ν ((ν 2)π) Γ( 2 )j j 2 2 − − 0 1 2 3 − 0 1 2 3 1 ν+n (y φ)>K− (y φ) − 2 Figure 1: Five samples (blue solid) from (h; κ) (left) 1 + − − (5) GP × ν 2 and (ν; h; κ) (right), with ν = 5, h(x) = cos(x) (red − TP dashed) and κ(x ; x ) = 0:01 exp( 20(x x )2). The We write y MVTn(ν; φ;K). i j − i − j ∼ grey shaded area represents a 95% predictive interval We easily compute the mean and covariance of under each model.