<<

Student-t Processes as Alternatives to Gaussian Processes

Amar Shah Andrew Gordon Wilson Zoubin Ghahramani University of Cambridge University of Cambridge University of Cambridge

Abstract simple exact learning and inference procedures, and impressive empirical performances [Rasmussen, 1996], Gaussian processes as kernel machines have steadily We investigate the Student-t process as an grown in popularity over the last decade. alternative to the Gaussian process as a non- parametric prior over functions. We de- At the heart of every Gaussian process (GP) is rive closed form expressions for the marginal a parametrized kernel, which determines likelihood and predictive distribution of a the properties of likely functions under a GP. Typ- Student-t process, by integrating away an ically simple parametric kernels, such as the Gaus- inverse Wishart process prior over the co- sian (squared exponential) kernel are used, and its pa- kernel of a Gaussian process model. rameters are determined through marginal likelihood We show surprising equivalences between dif- maximization, having analytically integrated away the ferent hierarchical Gaussian process models Gaussian process. However, a fully Bayesian nonpara- leading to Student-t processes, and derive a metric treatment of regression would place a nonpara- new scheme for the inverse Wishart metric prior over the Gaussian process covariance ker- process, which helps elucidate these equiv- nel, to represent uncertainty over the kernel function, alences. Overall, we show that a Student- and to reflect the natural intuition that the kernel does t process can retain the attractive proper- not have a simple parametric form. ties of a Gaussian process – a nonparamet- Likewise, given the success of Gaussian process kernel ric representation, analytic marginal and pre- machines, it is also natural to consider more general dictive distributions, and easy model selec- families of elliptical processes [Fang et al., 1989], such tion through covariance kernels – but has en- as Student-t processes, where any collection of func- hanced flexibility, and predictive tion values has a desired , with a that, unlike a Gaussian process, explicitly de- constructed using a kernel. pend on the values of training observations. We verify empirically that a Student-t pro- As we will show, the Student-t process can be derived cess is especially useful in situations where by placing an inverse Wishart process prior on the ker- there are changes in covariance structure, or nel of a Gaussian process. Given their intuitive value, in applications such as Bayesian optimiza- it is not surprising that various forms of Student-t tion, where accurate predictive covariances processes have been used in different applications [Yu are critical for good performance. These et al., 2007, Zhang and Yeung, 2010, Xu et al., 2011, advantages come at no additional computa- Archambeau and Bach, 2010]. However, the connec- tional cost over Gaussian processes. tions between these models, and the theoretical prop- erties of these models, remain largely unknown. Simi- larly, the practical utility of such models remains un- 1 INTRODUCTION certain. For example, Rasmussen and Williams [2006] wonder whether “the Student-t process is perhaps not as exciting as one might have hoped”. Gaussian processes are rich distributions over func- tions, which provide a Bayesian nonparametric ap- In short, our paper answers in detail many of the proach to regression. Owing to their interpretability, “what, when and why?” questions one might have non-parametric flexibility, large support, consistency, about Student-t processes (TPs), inverse Wishart pro- cesses, and elliptical processes in general. Specifically: Appearing in Proceedings of the 17th International Con- ference on Artificial Intelligence and (AISTATS) We precisely define and motivate the inverse 2014, Reykjavik, Iceland. JMLR: W&CP volume 33. Copy- • Wishart process [Dawid, 1981] as a prior over co- right 2014 by the authors. variance matrices of arbitrary size.

877 Student-t Processes as Alternatives to Gaussian Processes

We propose a Student-t process, which we derive we write Σ Wn(ν, K) if its density is given by • from hierarchical Gaussian process models. We ∼ (ν n 1)/2 1 1 derive analytic forms for the marginal and pre- p(Σ) = cn(ν, K) Σ − − exp Tr K− Σ , dictive distributions of this process, and analytic | | − 2  (1) derivatives of the marginal likelihood. 1 where c (ν, K) = K ν/22νn/2Γ (ν/2) − . n | | n We show that the Student-t process is the most   • general elliptically symmetric process with ana- The defined with this param- lytic marginal and predictive distributions. eterization is consistent under marginalization. If Σ Wn(ν, K), then any n1 n1 principal submatrix We derive a new way of sampling from the in- Σ ∼ is W (ν, K ) distributed.× This property makes • 11 n1 11 verse Wishart process, which intuitively resolves the Wishart distribution appear to be an attractive the seemingly bizarre marginal equivalence be- of prior over covariance matrices. Unfortunately the tween inverse Wishart and inverse Gamma priors Wishart distribution suffers a flaw which makes it im- for covariance kernels in hierarchical GP models. practical for nonparametric Bayesian modelling.

We show that the predictive covariances of a TP Suppose we wish to model a covariance matrix using 1 1 • depend on the values of training observations, ν− Σ, so that its expected value E[ν− Σ] = K, and 1 1 2 even though the predictive covariances of a GP var[ν− Σij] = ν− (Kij + KiiKjj). Since we require do not. ν > n 1, we must let ν to define a process which has− positive semidefinite→ ∞ Wishart distributed 1 marginals of arbitrary size. However, as ν , ν− Σ We show that, contrary to the Student-t process → ∞ • described in Rasmussen and Williams [2006], an tends to the constant matrix K almost surely. Thus the requirement ν > n 1 prohibits defining a useful analytic TP noise model can be used which sepa- − rates signal and noise analytically. process which has Wishart marginals of arbitrary size. Nevertheless, the inverse Wishart distribution does We demonstrate non-trivial differences in be- not suffer this problem. Dawid [1981] parametrized • haviour between the GP and TP on a variety of the inverse Wishart distribution as follows: applications. We specifically find the TP more Definition. A random Σ Π(n) is inverse Wishart ∈ robust to change-points and model misspecifica- distributed with parameters ν R+, K Π(n) and ∈ ∈ tion, to have notably improved predictive covari- we write Σ IWn(ν, K) if its density is given by ances, to have useful “tail-dependence” between ∼ (ν+2n)/2 1 1 distant function values (which is orthogonal to the p(Σ) = cn(ν, K) Σ − exp Tr KΣ− , choice of kernel), and to be particularly promis- | | − 2  (2) ing for Bayesian optimization, where predictive (ν+n 1)/2 K − covariances are especially important. with c (ν, K) = | | . n 2(ν+n 1)n/2Γ ((ν + n 1)/2) − n − We begin by introducing the inverse Wishart process If Σ IWn(ν, K), Σ has and covariance only in section 2. We then derive a Student-t process by ∼ 1 when ν > 2 and E[Σ] = (ν 2)− K. Both the Wishart using an inverse Wishart process over covariance ker- and the inverse Wishart distributions− place prior mass nels (section 3), and discuss the properties of this on every Σ Π(n). Furthermore Σ Wn(ν, K) if and Student-t process in section 4. Finally, we demonstrate 1∈ 1 ∼ only if Σ− IWn(ν n + 1,K− ). the Student-t process on regression and Bayesian op- ∼ − timization problems in section 5. Dawid [1981] shows that the inverse Wishart distribu- tion defined as above is consistent under marginaliza- tion. If Σ IWn(ν, K), then any principal submatrix 2 INVERSE WISHART PROCESS ∼ Σ11 will be IWn1 (ν, K11) distributed. Note the key dif- ference in the parameterizations of both distributions: In this section we argue that the inverse Wishart dis- the parameter ν does not need to depend on the size of tribution is an attractive choice of prior for covariance the matrix in the inverse Wishart distribution. These matrices of arbitrary size. The Wishart distribution properties are desirable and motivate defining a pro- is a distribution over Π(n), the set of real cess which has inverse Wishart marginals of arbitrary valued, n n, symmetric, positive definite matrices. size. Let be some input space and k : R Its density× function is defined as follows. a positiveX definite kernel function. X × X → Definition. A random Σ Π(n) is Wishart dis- Definition. σ is an inverse Wishart process on with ∈ X tributed with parameters ν > n 1, K Π(n), and parameters ν R+ and base kernel k : R if − ∈ ∈ X × X →

878 Shah, Wilson, Ghahramani

2 2 Definition. y Rn is multivariate Student-t dis- ∈ n 1 1 tributed with parameters ν R+ [0, 2], φ R and K Π(n) if it has density ∈ \ ∈ 0 0 ∈ Γ( ν+n ) 1 1 2 1/2 − − p(y) = n K − 2 ν ((ν 2)π) Γ( 2 )| | 2 2 − − 0 1 2 3 − 0 1 2 3 1 ν+n (y φ)>K− (y φ) − 2 Figure 1: Five samples (blue solid) from (h, κ) (left) 1 + − − (5) GP × ν 2 and (ν, h, κ) (right), with ν = 5, h(x) = cos(x) (red  −  TP dashed) and κ(x , x ) = 0.01 exp( 20(x x )2). The We write y MVTn(ν, φ,K). i j − i − j ∼ grey shaded area represents a 95% predictive We easily compute the mean and covariance of under each model. the MVT using the generative derivation: E[y] = E[E[y Σ]] = φ and cov[y] = E[E[(y φ)(y φ)> Σ]] = | − − | for any finite collection x1, ..., xn , σ(x1, ..., xn) E[(ν 2)Σ] = K. We prove the following Lemma in ∈ X ∼ − IWn(ν, K) where K Π(n) with Kij = k(xi, xj). We the supplementary material [Shah et al., 2014]. write σ (ν, k).∈ ∼ IWP Lemma 1. The multivariate Student-t is consistent In the next section we use the inverse Wishart process under marginalization. as a nonparametric prior over kernels in a hierarchical We define a Student-t process as follows. Gaussian process model. Definition. f is a Student-t process on with pa- X rameters ν > 2, mean function Ψ : R, and ker- 3 DERIVING THE STUDENT-t X → PROCESS nel function k : R if any finite collection of function valuesX have × X a → joint multivariate Student-t distribution, i.e. (f(x ), ..., f(x )) MVT (ν, φ,K) Gaussian processes (GPs) are popular nonparamet- 1 n > n where K Π(n) with K = k(x , x ∼) and φ n with ric Bayesian distributions over functions. A thorough ij i j R φ = Φ(x∈). We write f (ν, Φ, k). ∈ guide to GPs has been provided by Rasmussen and i i ∼ T P Williams [2006]. GPs are characterized by a mean function and a kernel function. Practitioners tend 4 TP PROPERTIES & RELATION to use parametric kernel functions and learn their TO OTHER PROCESSES hyperparameters using maximum likelihood or sam- pling based methods. We propose placing an inverse In this section we discuss the conditional distribution Wishart process prior on the kernel function, leading of the TP, the relationship between GPs and TPs, an- to a Student-t process. other covariance prior which leads to the same TP, el- liptical processes, and a sampling scheme for the IWP For a base kernel k parameterized by θ, and a con- θ which gives insight into this equivalence. Finally we tinuous mean function φ : 1 , our generative 1 R consider modelling noisy functions with a TP. approach is as follows X → σ (ν, kθ) ∼ IWP 4.1 Relation to Gaussian process y σ (φ, (ν 2)σ) . (3) | ∼ GP − The Student-t process generalizes the Gaussian pro- Since the inverse Wishart distribution is a conjugate cess. A GP can be seen as a limiting case of a TP as prior for the covariance matrix of a Gaussian likeli- shown in Lemma 2, which is proven in the supplemen- hood, we can analytically marginalize σ in the gen- tary material. erative model of (3). For any collection of Lemma 2. Suppose f (ν, Φ, k) and g ∼ T P ∼ y = (y1, ..., yn)> with φ = (φ(x1), ..., φ(xn))>, Σ = (Φ, k). Then f tends to g in distribution as ν . σ(x1, . . . , xn), GP → ∞ p(y ν, K) = p(y Σ)p(Σ ν, K)dΣ The ν parameter controls how heavy tailed the process | | | Z is. Smaller values of ν correspond to heavier tails. As ν > 1 (y φ)(y φ) 1 gets larger, the tails converge to Gaussian tails. This exp T r K + − − Σ− − 2 ν 2 is illustrated in prior sample draws shown in Figure  − dΣ ∝ Σ (ν+2n+1)/2   1. Notice that the samples from the TP tend to have Z | | (ν+n)/2 more extreme behaviour than the GP. 1 1 − 1 + (y φ)>K− (y φ) ∝ ν 2 − − ν also controls the nature of the dependence between  −  (4) variables which are jointly Student-t distributed, and

879 3

2

1

0

1 − 2 − 3 − 3 2 1 0 1 2 3 − − − 3 3

2 2

1 1 0 t0

Student- Processes as0 Alternatives1 2 to Gaussian Processes . . . 0 5 1 5 2 5 3 3 3 1 3 1 − Manuscript− under review by AISTATS− 2014 6 2.5 2.5 2.5

2 2 − 5 2 2 − 2 − − 1.5 1.5 3 1.5 3 − − 4 3 2 1 0 1 2 3 3 2 1 0 1 2 3 The scaling constant of the multivariate Student-t pre- 1 1 1

− − − − − − −

3 3 3 3 0.5 0.5 0.5 dictive covariance has an intuitive explanation. Note −

0 2 0 2 0 2 2 6 5 4 3 2 1 0 1 62 53 44 35 26 1 0 1 62 53 44 35 26 1 0 1 2 3 4 5 6 − − − − − − − − − − − − − − − − − −

− that β1 is distributed as the sum of squares of n1 inde- 1 1 1 1

0 0 0 0 pendent MVT1(ν, 0, 1) distributions and hence E[β1] =

1 1 1 1 n . If the observed value of β is larger than n , the − − − 1 1 1 1 2 2 2 2 − − − predictive covariance is scaled up and vice versa. The 3 3 3 3 − − − 3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3 magnitude of scaling is controlled by ν. − − − − − − − − − 4 3 3

Figure 2: Bivariate samples from a student-t copula with ν = 3 (left), a student-t copula with 5 ν = 10 (centre) Figure2 2: Uncorrelated2 bivariate samples from a and a Gaussian copula (right).1 All marginal distributions are1 N(0, 1) distributed. 1 6 Student-1 t copula with1 ν = 3 (left), a Student-t copula 4.3 Another Covariance Prior 1 Proof. It is sufficient to show convergence in den- Note that β1 + β2 = (y φ)>K− (y φ). We have 0 0 sitywith for anyν finite= collection 10 (centre) of inputs. Let andy a Gaussian copula− (right).− 1 ∼ p(y , y ) MVT1 n(ν, φ,K) and set β = (y φ)>1 K− (y φ) then 1 2 − − − − p(y2 y1) = All marginal distributions are N(0| , 1)p(y1) distributed. Despite the apparent flexibility of the inverse Wishart 2 2 (ν+n)/2 (ν+n )/2 β (ν+n)/2 β1 + β2 β1 1 1 − p(y) 1 + − − e β/2 1 + − 1 + − ν 2 ν 2 distribution, we illustrate in Lemma 4 the surprising 3 ∝ ν 2 3 → ∝ −  −  −  −  (ν+n)/2 −  3 2 1 0 1 2 3 3 2 1 0 1 2 3 β2 − − − − − − − 1 + (6) result that a multivariate Student-t distribution can be an3 ν . Hence the distribution of y tends to a ∝ β + ν 2 not→ ∞just their marginal distributions.1 − In Figure 2 Nn(φ,K) distribution as ν .   2 → ∞ derived using a much simpler covariance prior which we show plots of samples whichComparing this all density have to the one Gaussian in equation 5, we 1 note that has been considered previously [Yu et al., 2007]. The marginals but different joint distributions. Noticeν + β 2 how The0 ν parameter controls how heavy tailed the process ˜ 1 ˜ y2 y1 MVTn2 ν + n1, φ2, − K22 . (7) proof can be found in the supplementary material. is.the Thetail smallerdependency it is, the heavier the tails. of As theseν gets distributions| ∼ isν controlled+ n1 2 × 1  −  larger,− the tails converge to mimic Gaussian tails. This is illustrated in prior sample draws shown in Figure 1. As ν tends to infinity, this predictive distribution tends n by2 ν. For example, the dependencies between y(xp) Lemma 4. Let K Π(n), φ R , ν > 2, ρ > 0 and Notice− that the samples from the TP leave the 95% to a Gaussian process1 predictive distribution as we ∈ ∈ confidence3 band more often than the samples from the would expect given Lemma 2. The predictive mean and− y(xq) are different depending on whether y is a GP,3 a clear2 indicator1 0 of1 the2 heavier3 tails. has the same form as for a Gaussian process predic- − − − tive. The key difference is in the predictive covariance, 1 ItTP is important or a to note GP, that ν evenalso controls if the the nature TP and GP have the same r− Γ(ν/2, ρ/2) which is a scaled version of the Gaussian process pre- of the dependence between variables which are jointly dictive covariance. ∼ student-tkernel. distributed and not just the marginal distri- y r N (φ, r(ν 2)K/ρ), (7) bution of points. In Figure 2 we show plots of samples A somewhat disappointing feature of the Gaussian n which all have Gaussian marginals but different joint process is that for a given kernel, the predictive co- | ∼ − distributions. Notice how the tail dependency of the variance of new samples does not explicitly depend on previous observations. bivariate4.2 distributions Conditional is controlled by ν. distribution then marginally y MVTn(ν, φ,K). The scaling constant of the multivariate student-t pre- ∼ 4.2 Conditional distribution 1 dictive covariance has an intuitive explanation. Note The conditional distributionthat β1 is distributedfor a as the multivariate sum of squares of n1 inde- 1 ν+n ρ β We now derive the conditional distribution for a pendent MVT1(ν, 0, 1) distributions and hence E[β1] = From (7), r− y Γ 2 , 2 (1 + ν 2 ) and hence multivariate student-t distribution. Suppose y n1. If the observed value of β1 is larger than n1, the Student-t has an analytic∼ form which we state in | ∼ν+β 2 − MVTn(ν, φ,K) and let y1 and y2 represent the first predictive covariance is scaled up and vice versa. The magnitude of scaling is controlled by ν. E[(ν 2)r/ρ y] = − . This is exactly the fac- nLemma1 and remaining 3n and2 entries prove of y respectively. in the Let supplementary material. ν+n 2  1 − | − β1 = (y1 φ1)>K11− (y1 φ1) and β2 = (y2 The fact that the predictive covariance of the multi- 1− − − ˜ ˜ ˜ − ˜ ˜ 1 tor by which K is scaled in the MVT conditional φ2)>K22 (y2 φ2), where φ2 = K21K11− (y1 φ1) variate student-t depends on the data is one of the key 22 Lemma− 3. Suppose1 y− − MVT (ν, φ,K) and let y φ and K˜ = K K K− K . benefitsn of this distribution over a Gaussian one. 1 2 22 22 − 21 11 12 ∼ distribution in (6). and y2 represent the first n1 and remaining n2 entries of y respectively. Then This result is surprising because we previously inte- grated over an infinite dimensional nonparametric ob- ν + β 2 ˜ 1 ˜ ject (the IWP) to derive the Student-t process, yet y2 y1 MVTn2 ν + n1, φ2, − K22 , (6) | ∼ ν + n1 2 × here we show that we can integrate over a single scale  −  ˜ 1 parameter (inverse Gamma) to arrive at the same where φ2 = K21K11− (y1 φ1) + φ2, β1 = (y1 1 − 1 − marginal process. We provide some insight into why φ ) K− (y φ ) and K˜ = K K K− K . 1 > 11 1 1 22 22 21 11 12 these distinct priors lead to the same marginal multi- − −ν+β1 2 Note that [y2 y1] = φ˜2, cov[y2 y1] = − K˜22. E ν+n1 2 variate Student-t distribution in section 4.5. | | − ×

As ν tends to infinity, this predictive distribution tends 4.4 Elliptical Processes to a Gaussian process predictive distribution as we would expect given Lemma 2. Perhaps less intuitively, We now show that both Gaussian and Student-t pro- this predictive distribution also tends to a Gaussian cesses are elliptically symmetric, and that the Student- t process is the more general elliptical process. process predictive as n1 tends to infinity. Definition. y Rn is elliptically symmetric if and The predictive mean has the same form as for a Gaus- ∈ only if there exists µ Rn, R a nonnegative random sian process, conditioned on having the same kernel ∈ k, with the same hyperparameters. The key difference variable, Ω a n d matrix with maximal rank d and × d is in the predictive covariance, which now explicitly u uniformly distributed on the unit sphere in R inde- depends on the training observations. Indeed, a some- pendent of R such that y =D µ+RΩu, where =D denotes what disappointing feature of the Gaussian process is equality in distribution. that for a given kernel, the predictive covariance of An overview of elliptically symmetric distributions and new samples does not depend on training observations. the following Lemma can be found in Fang et al. [1989]. Importantly, since the marginal likelihood of the TP 2 in (5) differs from the marginal likelihood of the GP, Lemma 5. Suppose R1 χ (n) and R2 1 ∼ ∼ both the predictive mean and predictive covariance of Γ− (ν/2, 1/2) independently. If R = √R1, then y is a TP will differ from that of a GP, after learning kernel Gaussian distributed. If R = (ν 2)R1R2 then y is hyperparameters. MVT distributed. − p

880 Shah, Wilson, Ghahramani

Elliptically symmetric distributions characterize a tions, which are volume preserving operations. Sym- large class of distributions which are unimodal and metric positive definite (SPD) matrices can be repre- where the likelihood of a point decreases in its dis- sented through a diagonal and an orthogonal matrix: tance from this . These properties are natural assumptions we often want to encode in our prior dis- Theorem 8. Let Σ Π(n), the set of SPD, n n ∈ × tribution, making elliptical distributions ideal for mul- matrices. Suppose λ , ..., λ are the eigenvalues of { 1 n} tivariate modelling tasks. The idea naturally extends Σ. There exists Q Ξ(n) such that Σ = QΛQ>, where ∈ to infinite dimensional objects. Λ = diag(λ1, ..., λn). Definition. Let = y be a countable family of Y { i} random variables. It is an elliptical process if any finite Now suppose Σ IWn(ν, I). We compute the den- ∼ subset of them are jointly elliptically symmetric. sity of an IW using the representation in Theorem 8, being careful to include the Jacobian of the change Not all elliptical distributions have densities (e.g. L´evy, of variable, J(Σ; Q, Λ), given in Edelman and Rao alpha-stable distributions). Even fewer elliptical pro- [2005]. From (2) and using the facts that Q>Q = I cesses have densities, and the set of those that do is and AB = BA , characterized in Theorem 6 due to Kelker [1970]. | | | | Theorem 6. Suppose = y is an elliptical pro- Y { i} cess. Any finite collection z = z1, ..., zn has a p(Σ)dΣ = p(QΛQ>) J(Σ; Q, Λ) dΛdQ density if and only if there exists{ a non-negative} ⊂ Y ran- | | (ν+2n)/2 1 1 dom variable r such that z r N (µ, rΩΩ>). QΛQ> − exp Tr (QΛQ>)− | ∼ n ∝ | | − 2   A simple corollary of this theorem describes the only Q> λ λ dΛdQ  × | i − j| two cases where an elliptical process has an analyt- 1 i. tions, the same computational costs as a Gaussian pro- cess and increased flexibility, the Student-t process can This result provides a geometric interpretation of what be used as a drop-in replacement for a Gaussian pro- a sample from IWn(ν, I) looks like. We first uniformly cess in many applications. at random pick an orthogonal set of basis vectors in Rn and then stretch these basis vectors using an ex- changeable set of scalar random variables. An analo- 4.5 A New Way to Sample the IWP gous interpretation holds for the Wishart distribution. We show that the density of an inverse Wishart dis- Recall from Lemma 5 that if u is uniformly distributed tribution depends only on the eigenvalues of a pos- on the unit sphere in Rn and R χ2(n) indepen- itive definite matrix. To the best of our knowledge ∼ dently, then √Ru Nn(0,I). By (4) and Lemma this change of variables has not been computed previ- 5, if we sample Q and∼ Λ from the generative process ously. This decomposition offers a novel way of sam- above, then (ν 2)RQΛ1/2u is marginally a draw pling from an inverse Wishart distribution and insight from MVT(ν, 0,I).− Since the diagonal elements of Λ into why the Student-t process can be derived using are exchangeable,p Q is orthogonal and sampled uni- an inverse Gamma or an inverse Wishart process co- formly over Ξ(n), and u is spherically symmetric, we variance prior. 1/2 must have that QΛ u =D √R0u for some positive Let Ξ(n) be the set of all n n orthogonal matrices. A scalar R0 by symmetry. By Lemma × 1 matrix is orthogonal if it is square, real valued and its 5 we know R0 Γ− (ν/2, 1/2). In summary, the ac- rows and columns are orthogonal unit vectors. Orthog- tion of QΛ1/2 on∼ u is equivalent in distribution to a onal matrices are compositions of rotations and reflec- rescaling by an inverse Gamma variate.

881 Student-t Processes as Alternatives to Gaussian Processes

3 3

2 2

1 1

0 0

1 1 − −

2 2 − −

3 3 − 5 4 3 2 1 0 1 2 3 4 5 − 5 4 3 2 1 0 1 2 3 4 5 − − − − − − − − − − Figure 4: Posterior distributions of 1 sample from Syn- thetic Data B under GP prior (left) and TP prior (right). The solid line is the posterior mean, the shaded area represents a 95% predictive interval, cir- Figure 3: Scatter plots of points drawn from various cles are training points and crosses are test points. 2-dim processes. Here ν = 2.1 and Kij = 0.8δij + 0.2. Top-left: MVT (ν, 0,K) + MVT (ν, 0, 0.5I). Top- 2 2 inference scheme. This is a novel finding to the best right: MVT (ν, 0,K + 0.5I) (our model). Bottom- 2 of our knowledge. left: MVT2(ν, 0,K) + N2(0, 0.5I). Bottom-right: N2(0,K + 0.5I). 5 APPLICATIONS

4.6 Modelling Noisy Functions In this section we compare TPs to GPs for regression and Bayesian optimization. It is common practice to assume that outputs are the sum of a latent Gaussian process and independent 5.1 Regression Gaussian noise. Such a model is analytically tractable, n since Gaussian distributions are closed1 under addition.1 Consider a set of observations x , y for x { i i}i=1 i ∈ X Unfortunately the Student-t distribution is not closed and yi R. Analogous to Gaussian process regression, ∈ under addition. we assume the following

This problem was encountered by Rasmussen and f (ν, Φ, kθ) Williams [2006], who went on to dismiss the multi- ∼ T P yi = f(xi) for i = 1, ..., n. (9) variate Student-t process for practical purposes. Our approach is to incorporate the noise1 into the kernel1 In this work we consider parametric kernel functions. function, for example, letting k = kθ + δ, where kθ is A key task when using such kernels is in learning the a parametrized kernel and δ is a diagonal kernel func- parameters of the chosen kernel, which are called the 1 1 tion. Such a model is not equivalent to adding inde- hyperparameters of the model. We include derivatives pendent noise, since the scaling parameter ν will have of the marginal log likelihood of the TP with respect to an effect on the squared-exponential kernel as well as the hyperparameters in the supplementary material. the noise kernel. Zhang and Yeung [2010] propose a similar method for handling noise; however, they in- 5.1.1 correctly assume that the latent function and noise are independent under this model. The noise will be We test the Student-t process as a regression model uncorrelated with the latent function, but not inde- on a of datasets. We sample hyperparameters pendent. using Hamiltonian Monte Carlo [Neal, 2011] and use a kernel function which is a sum of a squared exponential As ν this model tends to a GP with indepen- and a delta kernel function (k = k ). The results for dent Gaussian→ ∞ noise. In Figure 3, we consider bivari- θ SE all of these experiments are summarized in Table 1. ate samples from a TP when ν is small and the signal to noise ratio is small. Here we see that the TP with A. We sample 100 functions from noise incorporated into its kernel behaves similarly to a GP prior with Gaussian noise and fit both GPs and a TP with independent Student-t noise. TPs to the data with the goal of predicting test points. For each function we train on 80 data points and test There have been several attempts to make GP regres- on 20. The TP, which generalizes the GP, has superior sion robust to heavy tailed noise that rely on approx- predictive uncertainty in this example. imate inference [Neal, 1997, Vanhatalo et al., 2009]. It is hence attractive that our proposed method can Synthetic Data B. We construct data by drawing model heavy tailed noise whilst retaining an analytic 100 functions from a GP with a squared exponential

882 Shah, Wilson, Ghahramani

Table 1: Predictive Mean Squared Errors (MSE) and Log Likelihoods (LL) of regression experiments. The 2 TP consistently has the lowest MSE and highest LL. 0 Gaussian Process Student-T Process Data set MSE LL MSE LL 2 Synth A 2.24 0.09 -1.66 0.04 2.29 0.08 -1.00 0.03 − Synth B 9.53 ± 0.03 -1.45± 0.02 5.69 ± 0.03 -1.30± 0.02 Snow 10.2 ± 0.08 4.00 ± 0.12 10.5 ± 0.07 25.7 ± 0.18 ± ± ± ± 5 4 3 2 1 0 1 2 3 4 5 Spatial 6.89 0.04 4.34 0.22 5.71 0.03 44.4 0.4 − − − − − Wine 4.84 ± 0.08 -1.4± 1 4.20 ± 0.06 113± 2 ± ± ± ± 0.2 kernel and adding Student-t noise independently. The posterior distribution of one sample is shown in Fig- 0.1 ure 4. The predictive are also not identical since 0 the posterior distributions of the hyperparameters dif- 5 4 3 2 1 0 1 2 3 4 5 fer between the TP and the GP. Here the TP has a − − − − − superior predictive mean, since after hyperparameter Figure 5: Posterior distribution of a function to maxi- training it is better able to model Student-t noise, as mize under a GP prior (top) and acquisition functions well as better predictive uncertainty. (bottom). The solid green line is the acquisition func- tion for a GP, the dotted red and dashed black lines Whistler Snowfall Data1. Daily snowfall amounts are for TP priors with ν = 15 and ν = 5 respectively. in Whistler have been recorded for the years 2010 and All other hyperparameters are kept the same. 2011. This data exhibits clear changepoint type be- haviour due to seasonality which the TP handles much Mat´ern5/2 kernel defined as better than the GP.

2 2 2 Spatial Interpolation Data . This dataset contains kM52(x, x0) = θ0 1 + 5rx,x0 exp 5rx,x0 rainfall measurements at 467 (100 observed and 367 to −  q   q (10) be estimated) locations in Switzerland on 8 May 1986. 0 2 D (xd x ) Wine Data. This dataset due to Cortez et al. [2009] 2 − d where r (x, x0) = d=1 θ2 . consists of 12 attributes of various red wines includ- d ing acidity, density, pH and alcohol level. Each wine We assume that theP function we wish to optimize over is f : RD R and is drawn from a multivariate is given a corresponding quality score between 0 and → 10. We choose a random subset of 400 wines: 360 for Student-t process with ν > 2, con- training and 40 for testing. stant mean µ and kernel function a linear sum of a ARD Mat´ern5/2 kernel and a delta function kernel. 5.2 Bayesian Optimization Our goal is to find where f attains its minimum. Let X = x , f N be our current set of N N { n n}n=1 Machine learning algorithms often require tuning pa- observations and fbest = min f1, ..., fN . To com- { } rameters, which control learning rates and abilities, press notation we let θ represent the parameters via optimizing an objective function. One can model θ, ν, µ. Let the acquisition function aEI x; XN , θ de- this objective function using a Gaussian process, under note the expected improvement over the current best  a powerful iterative optimization procedure known as value from choosing to sample at point x given cur- Gaussian process Bayesian optimization [Brochu et al., rent observations XN and hyperparameters θ. Note 2010]. To pick where to query the objective function that the distribution of f(x) XN , θ is MVT1(ν + 2 | next, one can optimize the expected improvement (EI) N, µ˜(x; Xn), τ˜(x; Xn, ν) ), where the form ofµ ˜ and fbest µ˜ over the running optimum, the probability of improv- τ˜ are derived in (6). Letγ ˜ = τ˜− . Then ing the current best or a GP upper confidence bound.

5.2.1 Method aEI x; XN , θ = E max fbest f(x), 0 XN , θ − | f best   1 y µ˜   In this paper we work with the EI criterion and for = dy(fbest y) λν+N − reasons described in Snoek et al. [2012] we use an ARD − τ˜ τ˜ 1 Z−∞   γ˜2 1 =γ ˜τ˜Λν+N (˜γ) +τ ˜ 1 + − λν+N (˜γ), (11) 1The snowfall dataset can be found at http://www. ν + N 1 climate.weatheroffice.ec.gc.ca.  −  2The spatial interpolation data can be found at http: where λν and Λν are the density and distribution func- //www.ai_geostats.org under SIC97. tions of a MVT1(ν, 0, 1) distribution respectively.

883 Student-t Processes as Alternatives to Gaussian Processes

20 20 0 GP GP GP TP TP TP 0 15 1 −

20 10 − 2 − Min Function Value Min Function Value Max Function Value 40 5 − 3 − 60 0 − 1 2 3 4 5 6 7 8 9 10 4 8 12 16 20 24 28 32 36 40 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Function Evaluation Function Evaluation Function Evaluation Figure 6: Function evaluations for the synthetic function (left), Branin-Hoo function (centre) and the Hartmann function (right). Evaluations under a Student-t process prior (solid line) and a Gaussian process prior (dashed line) are shown. Error bars represent the of 50 runs. In each panel we are minimizing an objective function. The vertical axis represents the running minimum function value.

The parameters θ are all sampled from the posterior is defined on the set (x1, x2) : 0 x1 15, 5 x2 using slice sampling, similar to the method used in 15 . We initialized the{ runs with≤ 4 initial≤ observations,− ≤ ≤ Snoek et al. [2012]. Suppose we have H sets of poste- one} for each corner of the input square. rior samples θ H . We set { h}h=1 Hartmann function This is a function with 6 local 6 H minima in [0, 1] [Picheny et al., 2013]. The runs are 1 a˜EI x; XN = aEI x; XN , θh (12) initialised with 6 observations at corners of the unit H 6 h=1 cube in R . Notice that the TP tends to behave more  X  like a step function whereas the Gaussian process’ rate as our approximate marginalized acquisition function. of improvement is somewhat more constant. The rea- The choice of the net place to sample is xnext = son for this behaviour is that the TP tends to more argmax D a˜ x; X , which we find by using gra- x R EI N thoroughly explore any modes which it has found, be- dient descent∈ based methods starting from a dense set  fore moving away from these modes. This phenomenon of points in the input space. seems more prevalant in higher . To get more intuition on how ν changes the behaviour of the acquisition function, we study an example in Figure 5. Here we fix all hyperparameters other than 6 CONCLUSIONS ν and plot the acquisition functions varying ν. In this example, it is clear that in certain scenarios the TP prior and GP prior will lead to very different proposals We have shown that the inverse Wishart process 1 given the same information. 1 (IWP) is an appropriate prior over covariance1 matri- ces of arbitrary size. We used an IWP prior over a GP 5.2.2 Experiments kernel and showed that marginalizing over the IWP re- sults in a Student-t process (TP). The TP has consis- We compare a TP prior with a Mat´ernplus a delta tent marginals, closed form conditionals and contains function kernel to a GP prior with the same kernel, for the Gaussian process as a special case. We also proved Bayesian optimization. To integrate away uncertainty that the TP is the only elliptical process other than we slice sample the hyperparameters [Neal, 2003]. We the GP which has an analytically representable density consider 3 functions: a 1-dim synthetic sinusoidal, the function. The TP prior was applied in regression and 2-dim Branin-Hoo function and a 6-dim Hartmann Bayesian optimization tasks, showing improved per- function. All the results are shown in Figure 6. formance over GPs with no additional computational Sinusoidal synthetic function In this experi- costs. ment we aimed to find the minimum of f(x) = (x The take home message for practitioners should be 1)2 sin(3x + 5x 1 + 1) in the interval [5, 10]. The− func-− − that the TP has many if not all of the benefits of tion has 2 local minima in this interval. TP optimiza- GPs, but with increased modelling flexibility at no tion clearly outperforms GP optimization in this prob- extra cost. Our work suggests that it could be use- lem; the TP was able to come to within 0.1% of the ful to replace GPs with TPs in almost any applica- minimum in 8.1 0.4 iterations whilst the GP took tion. The added flexibility of the TP is orthogonal to 10.7 0.6 iterations.± ± the choice of kernel, and could complement recent ex- Branin-Hoo function This function is a popular pressive closed form kernels [Wilson and Adams, 2013, benchmark for optimization methods [Jones, 2001] and Wilson et al., 2013] in future work.

884 Shah, Wilson, Ghahramani

References J. Snoek, H. Larochelle, and R. Adams. Practical Bayesian Optimization of Machine Learning Algo- C. Archambeau and F. Bach. Multiple Gaussian Pro- rithms. Advances in Neural Information Processing cess Models. Advances in Neural Information Pro- Systems, 2012. cessing Systems, 2010. J. Vanhatalo, P. Jylanki, and A. Vehtari. Gaussian E. Brochu, M. Cora, and N. de Freitas. A Tutorial Process Regression with Student-t Likelihood. Ad- on Bayesian Optimization of Expensive Cost Func- vances in Neural Information Processing Systems, tions, with Applications to Active User Modeling pages 1910–1918, 2009. and Hierarchical Reinforcement Learning. arXiv, 2010. URL http://arxiv.org/abs/1012.2599. A. G. Wilson and R. P. Adams. Gaussian Process P. Cortez, A. Cerdeira, F. Almeida, T. Matos, and Kernels for Pattern Discovery and Extrapolation. J. Reis. Modeling Wine Preferences by Data Mining Proceedings of the 30th International Conference on from Physiochemical Properties. Decision Support Machine Learning, 2013. Systems, Elsevier, 47(4):547–553, 2009. A. G. Wilson, E. Gilboa, A. Nehorai, and J. P. Cun- A. P. Dawid. Spherical Matrix Distributions and a ningham. GPatt: Fast multidimensional pattern ex- Multivariate Model. J. R. Statistical Society B, 39: trapolation with Gaussian processes. arXiv preprint 254–261, 1977. arXiv:1310.5288, 2013. URL http://arxiv.org/ abs/1310.5288. A. P. Dawid. Some Matrix-Variate Distribution The- ory: Notational Considerations and a Bayesian Ap- Z. Xu, F. Yan, and Y. Qi. Sparse Matrix-Variate t plication. Biometrika, 68:265–274, 1981. Process Blockmodel. Proceedings of the 25th AAAI Conference on Artificial Intelligence, 2011. A. Edelman and N. Raj Rao. Theory. Acta Numerica, 1:1–65, 2005. S. Yu, V. Tresp, and K. Yu. Robust Multi-Task Learn- ing with t-Processes. Proceedings of the 24th Inter- K. T. Fang, S. Kotz, and K. W. Ng. Symmetric Multi- national Conference on Machine Learning, 2007. variate and Related Distributions. Chapman & Hall, 1989. Y. Zhang and D. Y. Yeung. Multi-Task Learning us- ing Generalized t Process. Proceedings of the 13th D. R. Jones. A Taxonomy of Global Optimization Conference on Artificial Intelligence and Statistics, Methods Based on Response Surfaces. Journal of 2010. Global Optimization, 21(4):345–383, 2001. D. Kelker. Distribution Theory of Spherical Distri- butions and a Location-Scale Parameter. Sankhya, Ser. A,, 32:419–430, 1970. R. M. Neal. Monte Carlo Implementation of Gaussian Process Models for Bayesian Regression and Classi- fication. Technical Report No. 9702, Dept of Statis- tics, University of Toronto, 1997. R. M. Neal. Slice Sampling. Annals of Statistics, 31(3): 705–767, 2003. R. M. Neal. Handbook of Markov chain Monte Carlo. Chapman & Hall/CRC, 2011. V. Picheny, T. Wagner, and D. Ginsbourger. A Bench- mark of -Based Infill Criteria for Noisy Op- timization. Structural and Multidisciplinary Opti- mization, 48(3):607–626, 2013. C. E. Rasmussen. Evaluation of Gaussian Processes and other Methods for Non-. PhD thesis, Graduate Department of Computer Science, University of Toronto, 1996. C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. A. Shah, A. G. Wilson, and Z. Ghahramani. Student- t Processes as Alternatives to Gaussian Processes Supplementary Material. 2014. URL http://mlg. eng.cam.ac.uk/amar/tpsupp.pdf.

885