Student-t Processes as Alternatives to Gaussian Processes

Amar Shah Andrew Gordon Wilson Zoubin Ghahramani University of Cambridge University of Cambridge University of Cambridge

Abstract simple exact learning and inference procedures, and impressive empirical performances [Rasmussen, 1996], Gaussian processes as kernel machines have steadily We investigate the Student-t process as an grown in popularity over the last decade. alternative to the Gaussian process as a non- parametric prior over functions. We de- At the heart of every Gaussian process (GP) is rive closed form expressions for the marginal a parametrized covariance kernel, which determines likelihood and predictive distribution of a the properties of likely functions under a GP. Typ- Student-t process, by integrating away an ically simple parametric kernels, such as the Gaus- inverse Wishart process prior over the co- sian (squared exponential) kernel are used, and its pa- kernel of a Gaussian process model. rameters are determined through marginal likelihood We show surprising equivalences between dif- maximization, having analytically integrated away the ferent hierarchical Gaussian process models Gaussian process. However, a fully Bayesian nonpara- leading to Student-t processes, and derive a metric treatment of regression would place a nonpara- new sampling scheme for the inverse Wishart metric prior over the Gaussian process covariance ker- process, which helps elucidate these equiv- nel, to represent uncertainty over the kernel function, alences. Overall, we show that a Student- and to reflect the natural intuition that the kernel does t process can retain the attractive proper- not have a simple parametric form. ties of a Gaussian process – a nonparamet- Likewise, given the success of Gaussian processes ker- ric representation, analytic marginal and pre- nel machines, it is also natural to consider more general dictive distributions, and easy model selec- families of elliptical processes [Fang et al., 1989], such tion through covariance kernels – but has en- as Student-t processes, where any collection of func- hanced flexibility, and predictive covariances tion values has a desired , with a that, unlike a Gaussian process, explicitly de- covariance constructed using a kernel. pend on the values of training observations. We verify empirically that a Student-t pro- As we will show, the Student-t process can be derived cess is especially useful in situations where by placing an inverse Wishart process prior on the ker- there are changes in covariance structure, nel of a Gaussian process. Given their intuitive value, or in applications like Bayesian optimiza- it is not surprising that various forms of Student-t tion, where accurate predictive covariances processes have been used in different applications [Yu are critical for good performance. These et al., 2007, Zhang and Yeung, 2010, Xu et al., 2011, arXiv:1402.4306v2 [stat.ML] 19 Feb 2014 advantages come at no additional computa- Archambeau and Bach, 2010]. However, the connec- tional cost over Gaussian processes. tions between these models, and the theoretical prop- erties of these models, remain largely unknown. Simi- larly, the practical utility of such models remains un- 1 INTRODUCTION certain. For example, Rasmussen and Williams [2006] wonder whether “the Student-t process is perhaps not as exciting as one might have hoped”. Gaussian processes are rich distributions over func- tions, which provide a Bayesian nonparametric ap- In short, our paper answers in detail many of the proach to regression. Owing to their interpretability, “what, when and why?” questions one might have non-parametric flexibility, large support, consistency, about Student-t processes (TPs), inverse Wishart pro- cesses, and elliptical processes in general. Specifically: Appearing in Proceedings of the 17th International Con- ference on Artificial Intelligence and (AISTATS) We precisely define and motivate the inverse 2014, Reykjavik, Iceland. JMLR: W&CP volume 33. Copy- • Wishart process [Dawid, 1981] as a prior over co- right 2014 by the authors. variance matrices of arbitrary size. Student-t Processes as Alternatives to Gaussian Processes

We propose a Student-t process, which we derive we write Σ Wn(ν, K) if its density is given by • from hierarchical Gaussian process models. We ∼ (ν n 1)/2 1 1 derive analytic forms for the marginal and pre- p(Σ) = cn(ν, K) Σ − − exp Tr K− Σ , dictive distributions of this process, and analytic | | − 2  (1) derivatives of the marginal likelihood. 1 where c (ν, K) = K ν/22νn/2Γ (ν/2) − . n | | n We show that the Student-t process is the most   • general elliptically symmetric process with ana- The Wishart distribution defined with this param- lytic marginal and predictive distributions. eterization is consistent under marginalization. If Σ Wn(ν, K), then any n1 n1 principal submatrix We derive a new way of sampling from the in- Σ ∼ is W (ν, K ) distributed.× This property makes • 11 n1 11 verse Wishart process, which intuitively resolves the Wishart distribution appear to be an attractive the seemingly bizarre marginal equivalence be- of prior over covariance matrices. Unfortunately the tween inverse Wishart and inverse Gamma priors Wishart distribution suffers a flaw which makes it im- for covariance kernels in hierarchical GP models. practical for nonparametric Bayesian modelling.

We show that the predictive covariances of a TP Suppose we wish to model a using 1 1 • depend on the values of training observations, ν− Σ, so that its E[ν− Σ] = K, and 1 1 2 even though the predictive covariances of a GP var[ν− Σij] = ν− (Kij + KiiKjj). Since we require do not. ν > n 1, we must let ν to define a process which has− positive semidefinite→ ∞ Wishart distributed 1 marginals of arbitrary size. However, as ν , ν− Σ We show that, contrary to the Student-t process → ∞ • described in Rasmussen and Williams [2006], an tends to the constant matrix K almost surely. Thus the requirement ν > n 1 prohibits defining a useful analytic TP noise model can be used which sepa- − rates signal and noise analytically. process which has Wishart marginals of arbitrary size. Nevertheless, the inverse Wishart distribution does We demonstrate non-trivial differences in be- not suffer this problem. Dawid [1981] parametrized • haviour between the GP and TP on a variety of the inverse Wishart distribution as follows: applications. We specifically find the TP more Definition. A random Σ Π(n) is inverse Wishart ∈ robust to change-points and model misspecifica- distributed with parameters ν R+, K Π(n) and ∈ ∈ tion, to have notably improved predictive covari- we write Σ IWn(ν, K) if its density is given by ances, to have useful “tail-dependence” between ∼ (ν+2n)/2 1 1 distant function values (which is orthogonal to the p(Σ) = cn(ν, K) Σ − exp Tr KΣ− , choice of kernel), and to be particularly promis- | | − 2  (2) ing for Bayesian optimization, where predictive (ν+n 1)/2 K − covariances are especially important. with c (ν, K) = | | . n 2(ν+n 1)n/2Γ ((ν + n 1)/2) − n − We begin by introducing the inverse Wishart pro- If Σ IWn(ν, K), Σ has mean and covariance only cess in section 2. We then derive a Student-t pro- ∼ 1 when ν > 2 and E[Σ] = (ν 2)− K. Both the Wishart cess by using an inverse Wishart process over covari- and the inverse Wishart distributions− place prior mass ance kernels (section 3), and discuss the properties on every Σ Π(n). Furthermore Σ Wn(ν, K) if and of this Student t process in section 4. Finally, we 1∈ 1 ∼ only if Σ− IWn(ν n + 1,K− ). demonstrate the− Student-t process on regression and ∼ − Bayesian optimization problems in section 5. Dawid [1981] shows that the inverse Wishart distribu- tion defined as above is consistent under marginaliza- tion. If Σ IWn(ν, K), then any principal submatrix 2 INVERSE WISHART PROCESS ∼ Σ11 will be IWn1 (ν, K11) distributed. Note the key dif- ference in the parameterizations of both distributions: In this section we argue that the inverse Wishart dis- the parameter ν does not need to depend on the size of tribution is an attractive choice of prior for covariance the matrix in the inverse Wishart distribution. These matrices of arbitrary size. The Wishart distribution properties are desirable and motivate defining a pro- is a over Π(n), the set of real cess which has inverse Wishart marginals of arbitrary valued, n n, symmetric, positive definite matrices. size. Let be some input space and k : R Its density× function is defined as follows. a positiveX definite kernel function. X × X → Definition. A random Σ Π(n) is Wishart dis- Definition. σ is an inverse Wishart process on with ∈ X tributed with parameters ν > n 1, K Π(n), and parameters ν R+ and base kernel k : R if − ∈ ∈ X × X → Shah, Wilson, Ghahramani

2 2 Definition. y Rn is multivariate Student-t dis- ∈ n 1 1 tributed with parameters ν R+ [0, 2], φ R and K Π(n) if it has density ∈ \ ∈ 0 0 ∈ Γ( ν+n ) 1 1 2 1/2 − − p(y) = n K − 2 ν ((ν 2)π) Γ( 2 )| | 2 2 − − 0 1 2 3 − 0 1 2 3 1 ν+n (y φ)>K− (y φ) − 2 Figure 1: Five samples (blue solid) from (h, κ) (left) 1 + − − (5) GP × ν 2 and (ν, h, κ) (right), with ν = 5, h(x) = cos(x) (red  −  TP dashed) and κ(x , x ) = 0.01 exp( 20(x x )2). The We write y MVTn(ν, φ,K). i j − i − j ∼ grey shaded area represents a 95% predictive interval We easily compute the mean and covariance of under each model. the MVT using the generative derivation: E[y] = E[E[y Σ]] = φ and cov[y] = E[E[(y φ)(y φ)> Σ]] = | − − | for any finite collection x , ..., x , σ(x , ..., x ) E[(ν 2)Σ] = K. We prove the following Lemma in 1 n 1 n − IW (ν, K) where K Π(n) with ∈K X = k(x , x ). We∼ the Supplementary Material. n ∈ ij i j write σ (ν, k). Lemma 1. The multivariate Student-t is consistent ∼ IWP In the next section we use the inverse Wishart process under marginalization. as a nonparametric prior over kernels in a hierarchical We define a Student-t process as follows. Gaussian process model. Definition. f is a Student-t process on with pa- X rameters ν > 2, mean function Ψ : R, and ker- 3 DERIVING THE STUDENT-t X → nel function k : R if any finite collection PROCESS of function valuesX have × X a → joint multivariate Student-t distribution, i.e. (f(x1), ..., f(xn))> MVTn(ν, φ,K) ∼ n Gaussian processes (GPs) are popular nonparamet- where K Π(n) with Kij = k(xi, xj) and φ R with ∈ ∈ ric Bayesian distributions over functions. A thorough φi = Φ(xi). We write f (ν, Φ, k). guide to GPs has been provided by Rasmussen and ∼ T P Williams [2006]. GPs are characterized by a mean 4 TP PROPERTIES & RELATION function and a kernel function. Practitioners tend to use parametric kernel functions and learn their TO OTHER PROCESSES hyperparameters using maximum likelihood or sam- pling based methods. We propose placing an inverse In this section we discuss the conditional distribution Wishart process prior on the kernel function, leading of the TP, the relationship between GPs and TPs, an- to a Student-t process. other covariance prior which leads to the same TP, el- liptical processes, and a sampling scheme for the IWP For a base kernel k parameterized by θ, and a con- θ 1 which1 gives insight into this equivalence. Finally we tinuous mean function φ : R, our generative consider modelling noisy functions with a TP. approach is as follows X → σ (ν, k ) 4.1 Relation to Gaussian process ∼ IWP θ y σ (φ, (ν 2)σ) . (3) | ∼ GP − The Student-t process generalizes the Gaussian pro- cess. A GP can be seen as a limiting case of a TP as Since the inverse Wishart distribution is a conjugate shown in Lemma 2, which is proven in the Supplemen- prior for the covariance matrix of a Gaussian likeli- tary Material. hood, we can analytically marginalize σ in the gen- Lemma 2. Suppose f (ν, Φ, k) and g erative model of (3). For any collection of data (Φ, k). Then f tends to g∼in T distribution P as ν ∼. y = (y1, ..., yn)> with φ = (φ(x1), ..., φ(xn))>, GP → ∞ p(y ν, K) = p(y Σ)p(Σ ν, K)dΣ The ν parameter controls how heavy tailed the process | | | Z is. Smaller values of ν correspond to heavier tails. As ν > 1 (y φ)(y φ) 1 gets larger, the tails converge to Gaussian tails. This exp T r K + − − Σ− − 2 ν 2 is illustrated in prior sample draws shown in Figure  − dΣ ∝ Σ (ν+2n+1)/2   1. Notice that the samples from the TP tend to have Z | | (ν+n)/2 more extreme behaviour than the GP. 1 1 − 1 + (y φ)>K− (y φ) ∝ ν 2 − − ν also controls the nature of the dependence between  −  (4) variables which are jointly Student-t distributed, and 3




As ν tends to infinity, this predictive distribution tends 4.4 Elliptical Processes to a Gaussian process predictive distribution as we would expect given Lemma 2. Perhaps less intuitively, We now show that both Gaussian and Student-t pro- this predictive distribution also tends to a Gaussian cesses are elliptically symmetric, and that the Student- t process is the more general elliptical process. process predictive as n1 tends to infinity. Definition. y Rn is elliptically symmetric if and The predictive mean has the same form as for a Gaus- ∈ only if there exists µ Rn, R a nonnegative random sian process, conditioned on having the same kernel ∈ k, with the same hyperparameters. The key difference variable, Ω a n d matrix with maximal rank d and × d is in the predictive covariance, which now explicitly u uniformly distributed on the unit sphere in R inde- depends on the training observations. Indeed, a some- pendent of R such that y =D µ+RΩu, where =D denotes what disappointing feature of the Gaussian process is equality in distribution. that for a given kernel, the predictive covariance of An overview of elliptically symmetric distributions and new samples does not depend on training observations. the following Lemma can be found in Fang et al. [1989]. Importantly, since the marginal likelihood of the TP 2 in (5) differs from the marginal likelihood of the GP, Lemma 5. Suppose R1 χ (n) and R2 1 ∼ ∼ both the predictive mean and predictive covariance of Γ− (ν/2, 1/2) independently. If R = √R1, then y is a TP will differ from that of a GP, after learning kernel Gaussian distributed. If R = (ν 2)R1R2 then y is hyperparameters. MVT distributed. − p Shah, Wilson, Ghahramani

Elliptically symmetric distributions characterize a tions, which are volume preserving operations. Sym- large class of distributions which are unimodal and metric positive definite (SPD) matrices can be repre- where the likelihood of a point decreases in its dis- sented through a diagonal and an orthogonal matrix: tance from this . These properties are natural assumptions we often want to encode in our prior dis- Theorem 8. Let Σ Π(n), the set of SPD, n n ∈ × tribution, making elliptical distributions ideal for mul- matrices. Suppose λ , ..., λ are the eigenvalues of { 1 n} tivariate modelling tasks. The idea naturally extends Σ. There exists Q Ξ(n) such that Σ = QΛQ>, where ∈ to infinite dimensional objects. Λ = diag(λ1, ..., λn). Definition. Let = y be a countable family of Y { i} random variables. It is an elliptical process if any finite Now suppose Σ IWn(ν, I). We compute the den- ∼ subset of them are jointly elliptically symmetric. sity of an IW using the representation in Theorem 8, being careful to include the Jacobian of the change Not all elliptical distributions have densities (e.g. L´evy, of variable, J(Σ; Q, Λ), given in Edelman and Rao alpha-stable distributions). Even fewer elliptical pro- [2005]. From (2) and using the facts that Q>Q = I cesses have densities, and the set of those that do is and AB = BA , characterized in Theorem 6 due to Kelker [1970]. | | | | Theorem 6. Suppose = y is an elliptical pro- Y { i} cess. Any finite collection z = z1, ..., zn has a p(Σ)dΣ = p(QΛQ>) J(Σ; Q, Λ) dΛdQ density if and only if there exists{ a non-negative} ⊂ Y ran- | | (ν+2n)/2 1 1 dom variable r such that z r N (µ, rΩΩ>). QΛQ> − exp Tr (QΛQ>)− | ∼ n ∝ | | − 2   A simple corollary of this theorem describes the only Q> λ λ dΛdQ  × | i − j| two cases where an elliptical process has an analyt- 1 i. tions, the same computational costs as a Gaussian pro- cess and increased flexibility, the Student-t process can This result provides a geometric interpretation of what be used as a drop-in replacement for a Gaussian pro- a sample from IWn(ν, I) looks like. We first uniformly cess in many applications. at random pick an orthogonal set of basis vectors in Rn and then stretch these basis vectors using an ex- changeable set of scalar random variables. An analo- 4.5 A New Way to Sample the IWP gous interpretation holds for the Wishart distribution. We show that the density of an inverse Wishart dis- Recall from Lemma 5 that if u is uniformly distributed tribution depends only on the eigenvalues of a pos- on the unit sphere in Rn and R χ2(n) indepen- itive definite matrix. To the best of our knowledge ∼ dently, then √Ru Nn(0,I). By (4) and Lemma this change of variables has not been computed previ- 5, if we sample Q and∼ Λ from the generative process ously. This decomposition offers a novel way of sam- above, then (ν 2)RQΛ1/2u is marginally a draw pling from an inverse Wishart distribution and insight from MVT(ν, 0,I).− Since the diagonal elements of Λ into why the Student-t process can be derived using are exchangeable,p Q is orthogonal and sampled uni- an inverse Gamma or an inverse Wishart process co- formly over Ξ(n), and u is spherically symmetric, we variance prior. 1/2 must have that QΛ u =D √R0u for some positive Let Ξ(n) be the set of all n n orthogonal matrices. A scalar R0 by symmetry. By Lemma × 1 matrix is orthogonal if it is square, real valued and its 5 we know R0 Γ− (ν/2, 1/2). In summary, the ac- rows and columns are orthogonal unit vectors. Orthog- tion of QΛ1/2 on∼ u is equivalent in distribution to a onal matrices are compositions of rotations and reflec- rescaling by an inverse Gamma variate. Student-t Processes as Alternatives to Gaussian Processes

3 3

2 2

1 1

0 0

1 1 − −

2 2 − −

3 3 − 5 4 3 2 1 0 1 2 3 4 5 − 5 4 3 2 1 0 1 2 3 4 5 − − − − − − − − − − Figure 4: Posterior distributions of 1 sample from Syn- thetic Data B under GP prior (left) and TP prior (right). The solid line is the posterior mean, the shaded area represents a 95% predictive interval, cir- Figure 3: Scatter plots of points drawn from various cles are training points and crosses are test points. 2-dim processes. Here ν = 2.1 and Kij = 0.8δij + 0.2. Top-left: MVT2(ν, 0,K) + MVT2(ν, 0, 0.5I). Top- It is hence attractive that our proposed method can right: MVT2(ν, 0,K + 0.5I) (our model). Bottom- model heavy tailed noise whilst retaining an analytic left: MVT2(ν, 0,K) + N2(0, 0.5I). Bottom-right: inference scheme. This is a novel finding to the best N2(0,K + 0.5I). of our knowledge.

4.6 Modelling Noisy Functions 5 APPLICATIONS

It is common practice to assume that outputs are the In this section we compare TPs to GPs for regression sum of a latent Gaussian process and independent and Bayesian optimization. Gaussian noise. An advantage of such a model is in the fact that the sum of independent1 Gaussian dis-1 5.1 Regression tributions is Gaussian distributed and hence such a Gaussian process model remains analytic in the pres- Consider a set of observations x , y n for x { i i}i=1 i ∈ X ence of noise. Unfortunately the sum of two indepen- and yi R. Analogous to Gaussian process regression, dent MVTs is analytically intractable. we assume∈ the following generative model This problem was encountered by Rasmussen and f (ν, Φ, kθ) Williams [2006], who went on to1 dismiss the multi-1 ∼ T P variate Student-t process for practical purposes. Our yi = f(xi) for i = 1, ..., n. (9) approach is to incorporate the noise into the kernel 1 1 function, for example, letting k = kθ + δ, where kθ is In this work we consider parametric kernel functions. a parametrized kernel and δ is a diagonal kernel func- A key task when using such kernels is in learning the tion. Such a model is not equivalent to adding inde- parameters of the chosen kernel, which are called the pendent noise, since the scaling parameter ν will have hyperparameters of the model. We include derivatives an effect on the squared-exponential kernel as well as of the marginal log likelihood of the TP with respect to the noise kernel. Zhang and Yeung [2010] propose a the hyperparameters in the Supplementary Material. similar method for handling noise; however, they in- correctly assume that the latent function and noise 5.1.1 Experiments are independent under this model. The noise will be We test the Student-t process as a regression model uncorrelated with the latent function, but not inde- on a number of datasets. We sample hyperparameters pendent. using Hamiltonian Monte Carlo [Neal, 2011] and use a As ν this model tends to a GP with independent kernel function which is a sum of a squared exponential → ∞ Gaussian noise. In Figure 3, we consider samples from and a delta kernel function (kθ = kSE). The results for various two dimensional processes when ν is small and all of these experiments are summarized in Table 1. the signal to noise ratio is small. Here we see that Synthetic Data A. We sample 100 functions from the TP with noise incorporated into its kernel behaves a GP prior with Gaussian noise and fit both GPs and similarly to a TP with independent Student-t noise. TPs to the data with the goal of predicting test points. There have been several attempts to make GP regres- For each function we train on 80 data points and test sion robust to heavy tailed noise that rely on approx- on 20. The TP, which generalizes the GP, has superior imate inference [Neal, 1997, Vanhatalo et al., 2009]. predictive uncertainty in this example. Shah, Wilson, Ghahramani

Table 1: Predictive Mean Squared Errors (MSE) and Log Likelihoods (LL) of regression experiments. The 2 TP consistently has the lowest MSE and highest LL. 0 Gaussian Process Student-T Process Data set MSE LL MSE LL 2 Synth A 2.24 0.09 -1.66 0.04 2.29 0.08 -1.00 0.03 − Synth B 9.53 ± 0.03 -1.45± 0.02 5.69 ± 0.03 -1.30± 0.02 Snow 10.2 ± 0.08 4.00 ± 0.12 10.5 ± 0.07 25.7 ± 0.18 ± ± ± ± 5 4 3 2 1 0 1 2 3 4 5 Spatial 6.89 0.04 4.34 0.22 5.71 0.03 44.4 0.4 − − − − − Wine 4.84 ± 0.08 -1.4± 1 4.20 ± 0.06 113± 2 ± ± ± ± 0.2

Synthetic Data B. We construct data by drawing 0.1 100 functions from a GP with a squared exponential kernel and adding Student-t noise independently. The 0 5 4 3 2 1 0 1 2 3 4 5 posterior distribution of one sample is shown in Fig- − − − − − ure 4. The predictive means are also not identical since Figure 5: Posterior distribution of a function to maxi- the posterior distributions of the hyperparameters dif- mize under a GP prior (top) and acquisition functions fer between the TP and the GP. Here the TP has a (bottom). The solid green line is the acquisition func- superior predictive mean, since after hyperparameter tion for a GP, the dotted red and dashed black lines training it is better able to model Student-t noise, as are for TP priors with ν = 15 and ν = 5 respectively. well as better predictive uncertainty. All other hyperparameters are kept the same. Whistler Snowfall Data1. Daily snowfall amounts 5.2.1 Method in Whistler have been recorded for the years 2010 and 2011. This data exhibits clear changepoint type be- In this paper we work with the EI criterion and for haviour due to seasonality which the TP handles much reasons described in Snoek et al. [2012] we use an ARD better than the GP. Mat´ern5/2 kernel defined as 2 Spatial Interpolation Data . This dataset contains 2 2 k (x, x0) = θ 1 + 5r 0 exp 5r 0 rainfall measurements at 467 (100 observed and 367 to M52 0 x,x − x,x be estimated) locations in Switzerland on 8 May 1986.  q   q (10)

0 2 D (xd x ) Wine Data. This dataset due to Cortez et al. [2009] 2 − d where r (x, x0) = d=1 θ2 . consists of 12 attributes of various red wines includ- d ing acidity, density, pH and alcohol level. Each wine We assume that theP function we wish to optimize over is given a corresponding quality score between 0 and is f : RD R and is drawn from a multivariate → 10. We choose a random subset of 400 wines: 360 for Student-t process with scale parameter ν > 2, con- training and 40 for testing. stant mean µ and kernel function a linear sum of a ARD Mat´ern5/2 kernel and a delta function kernel. Our goal is to find where f attains its minimum. 5.2 Bayesian Optimization N Let XN = xn, fn n=1 be our current set of N observations and{ f } = min f , ..., f . To com- best { 1 N } Machine learning algorithms often require tuning pa- press notation we let θ represent the parameters rameters, which control learning rates and abilities, θ, ν, µ. Let the acquisition function aEI x; XN , θ de- via optimizing an objective function. One can model note the expected improvement over the current best  this objective function using a Gaussian process, under value from choosing to sample at point x given cur- a powerful iterative optimization procedure known as rent observations XN and hyperparameters θ. Note Gaussian process Bayesian optimization [Brochu et al., that the distribution of f(x) XN , θ is MVT1(ν + 2 | 2010]. To pick where to query the objective function N, µ˜(x; Xn), τ˜(x; Xn, ν) ), where the form ofµ ˜ and fbest µ˜ next, one can optimize the expected improvement (EI) τ˜ are derived in (13). Letγ ˜ = τ˜− . Then over the running optimum, the probability of improv- ing the current best or a GP upper confidence bound. aEI x; XN , θ = E max fbest f(x), 0 XN , θ − | 1 fbest 1 y µ˜ = dy(f  y) λ −   1The snowfall dataset can be found at http://www. best − τ˜ ν+N τ˜ Z−∞ 2   2The spatial interpolation data can be found at γ˜ 1 http: =γ ˜τ˜Λν+N (˜γ) +τ ˜ 1 + − λν+N (˜γ), (11) // under SIC97. ν + N 1  −  Student-t Processes as Alternatives to Gaussian Processes

20 20 0 GP GP GP TP TP TP 0 15 1 −

20 10 − 2 − Min Function Value Min Function Value Max Function Value 40 5 − 3 − 60 0 − 1 2 3 4 5 6 7 8 9 10 4 8 12 16 20 24 28 32 36 40 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Function Evaluation Function Evaluation Function Evaluation Figure 6: Function evaluations for the synthetic function (left), Branin-Hoo function (centre) and the Hartmann function (right). Evaluations under a Student-t process prior (solid line) and a Gaussian process prior (dashed line) are shown. Error bars represent the standard deviation of 50 runs. In each panel we are minimizing an objective function. The vertical axis represents the running minimum function value.

where λν and Λν are the density and distribution func- Branin-Hoo function This function is a popular tions of a MVT1(ν, 0, 1) distribution respectively. benchmark for optimization methods [Jones, 2001] and is defined on the set (x , x ) : 0 x 15, 5 x The parameters θ are all sampled from the posterior 1 2 1 2 15 . We initialized the{ runs with≤ 4 initial≤ observations,− ≤ ≤ using slice sampling, similar to the method used in one} for each corner of the input square. Snoek et al. [2012]. Suppose we have H sets of poste- H rior samples θh h=1. We set Hartmann function This is a function with 6 local { } minima in [0, 1]6 [Picheny et al., 2013]. The runs are 1 H initialised with 6 observations at corners of the unit a˜ x; X = a x; X , θ (12) 6 EI N H EI N h cube in R . Notice that the TP tends to behave more h=1  X  like a step function whereas the Gaussian process’ rate as our approximate marginalized acquisition function. of improvement is somewhat more constant. The rea- The choice of the net place to sample is xnext = son for this behaviour is that the TP tends to more D thoroughly explore any modes which it has found, be- argmaxx R a˜EI x; XN , which we find by using gra- dient descent∈ based methods starting from a dense set fore moving away from these modes. This phenomenon of points in the input space. seems more prevalant in higher dimensions. To get more intuition on how ν changes the behaviour of the acquisition function, we study an example in 6 CONCLUSIONS Figure 5. Here we fix all hyperparameters other than We have shown that the inverse Wishart process ν and plot the acquisition functions varying ν. In this 1 1 (IWP) is an appropriate prior over covariance1 matri- example, it is clear that in certain scenarios the TP ces of arbitrary size. We used an IWP prior over a GP prior and GP prior will lead to very different proposals kernel and showed that marginalizing over the IWP re- given the same information. sults in a Student-t process (TP). The TP has consis- tent marginals, closed form conditionals and contains 5.2.2 Experiments the Gaussian process as a special case. We also proved We compare a TP prior with a Mat´ernplus a delta that the TP is the only elliptical process other than function kernel to a GP prior with the same kernel, for the GP which has an analytically representable density Bayesian optimization. To integrate away uncertainty function. The TP prior was applied in regression and we slice sample the hyperparameters [Neal, 2003]. We Bayesian optimization tasks, showing improved per- consider 3 functions: a 1-dim synthetic sinusoidal, the formance over GPs with no additional computational 2-dim Branin-Hoo function and a 6-dim Hartmann costs. function. All the results are shown in Figure 6. The take home message for practitioners should be Sinusoidal synthetic function In this experi- that the TP has many if not all of the benefits of ment we aimed to find the minimum of f(x) = (x GPs, but with increased modelling flexibility at no 2 1 − − 1) sin(3x + 5x− + 1) in the interval [5, 10]. The func- extra cost. Our work suggests that it could be use- tion has 2 local minima in this interval. TP optimiza- ful to replace GPs with TPs in almost any applica- tion clearly outperforms GP optimization in this prob- tion. The added flexibility of the TP is orthogonal to lem; the TP was able to come to within 0.1% of the the choice of kernel, and could complement recent ex- minimum in 8.1 0.4 iterations whilst the GP took pressive closed form kernels [Wilson and Adams, 2013, 10.7 0.6 iterations.± Wilson et al., 2013] in future work. ± Shah, Wilson, Ghahramani

References J. Vanhatalo, P. Jylanki, and A. Vehtari. Gaussian Process Regression with Student-t Likelihood. Ad- C. Archambeau and F. Bach. Multiple Gaussian Pro- vances in Neural Information Processing Systems, cess Models. Advances in Neural Information Pro- pages 1910–1918, 2009. cessing Systems, 2010. A. G. Wilson and R. P. Adams. Gaussian process E. Brochu, M. Cora, and N. de Freitas. A Tutorial covariance kernels for pattern discovery and extrap- on Bayesian Optimization of Expensive Cost Func- olation. Proceedings of the 30th International Con- tions, with Applications to Active User Modeling ference on Machine Learning, 2013. and Hierarchical Reinforcement Learning. arXiv, 2010. URL A. G. Wilson, E. Gilboa, A. Nehorai, and J. P. Cun- P. Cortez, A. Cerdeira, F. Almeida, T. Matos, and ningham. GPatt: Fast multidimensional pattern ex- J. Reis. Modeling Wine Preferences by Data Mining trapolation with Gaussian processes. arXiv, 2013. from Physiochemical Properties. Decision Support URL . Systems, Elsevier, 2009. Z. Xu, F. Yan, and Y. Qi. Sparse Matrix-Variate t A. P. Dawid. Spherical Matrix Distributions and a Process Blockmodel. 2011. Multivariate Model. J. R. Statistical Society B, S. Yu, V. Tresp, and K. Yu. Robust Multi-Task Learn- 1977. ing with t-Processes. 2007. A. P. Dawid. Some Matrix-Variate Distribution The- Y. Zhang and D. Y. Yeung. Multi-Task Learning us- ory: Notational Considerations and a Bayesian Ap- ing Generalized t Process. Proceedings of the 13th plication. Biometrika, 1981. Conference on Artificial Intelligence and Statistics, A. Edelman and N. Raj Rao. Theory. 2010. Acta Numerica, 1:1–65, 2005. K. T. Fang, S. Kotz, and K. W. Ng. Symmetric Multi- variate and Related Distributions. Chapman & Hall, 1989. D. R. Jones. A Taxonomy of Global Optimization Methods Based on Response Surfaces. Journal of Global Optimization, 21(4):345–383, 2001. D. Kelker. Distribution Theory of Spherical Distri- butions and a Location-Scale Parameter. Sankhya, Ser. A,, 1970. R. M. Neal. Monte Carlo Implementation of Gaussian Process Models for Bayesian Regression and Classi- fication. Technical Report No. 9702, Dept of Statis- tics, University of Toronto, 1997. R. M. Neal. Slice Sampling. Annals of Statistics, 31(3): 705–767, 2003. R. M. Neal. Handbook of Markov chain Monte Carlo. Chapman & Hall/CRC, 2011. V. Picheny, T. Wagner, and D. Ginsbourger. A Bench- mark of Kriging-Based Infill Criteria for Noisy Op- timization. Structural and Multidisciplinary Opti- mization, 48(3):607–626, 2013. C. E. Rasmussen. Evaluation of Gaussian Processes and other Methods for Non-Linear Regression. PhD thesis, Graduate Department of Computer Science, University of Toronto, 1996. C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. J. Snoek, H. Larochelle, and R. Adams. Practical Bayesian Optimization of Machine Learning Algo- rithms. Advances in Neural Information Processing Systems, 2012. Student-t Processes as Alternatives to Gaussian Processes

Supplementary Material

In Appendix A, we provide proofs of Lemmas and Corollaries from our paper. We describe the derivatives of the log marginal likelihood of the Student-t process which is useful for hyperparameter learning in Appendix B. In Appendix C we offer more insights as to why two seemingly different covariance priors for a Gaussian process prior lead to the same marginal distribution.

A Proofs

Lemma. [1] The multivariate Student-t is consistent under marginalization.

Proof. Assume the generative process of equation 3 of the main text. Σ11 is IWn1 (ν, K11) distributed for any principal submatrix of Σ. Futhermore y1 Σ11 Nn1 (0, (ν 2)Σ11) since the Gaussian distribution is consistent under marginalization. Hence y MVT| (ν,∼ µ ,K ). − 1 ∼ n1 1 11

Lemma. [2] Suppose f (ν, Φ, k) and g (Φ, k). Then f tends to g in distribution as ν . ∼ T P ∼ GP → ∞

Proof. It is sufficient to show convergence in density for any finite collection of inputs. Let y MVTn(ν, φ,K) 1 ∼ and set β = (y φ)>K− (y φ) then − − (ν+n)/2 β − β/2 p(y) 1 + e− ∝ ν 2 →  −  an ν . Hence the distribution of y tends to a N (φ,K) distribution as ν . → ∞ n → ∞

Lemma. [3] Suppose y MVTn(ν, φ,K) and let y1 and y2 represent the first n1 and remaining n2 entries of y respectively. Then ∼ ˜ ν + β1 2 ˜ y2 y1 MVTn2 ν + n1, φ2, − K22 , (13) | ∼ ν + n1 2 ×  −  1 1 1 where φ˜ = K K− (y φ ) φ , β = (y φ )>K− (y φ ) and K˜ = K K K− K . 2 21 11 1 − 1 − 2 1 1 − 1 11 1 − 1 22 22 − 21 11 12

1 1 Proof. Let β = (y φ˜ )>K˜ − (y φ˜ ). Note that β + β = (y φ)>K− (y φ). We have 2 2 − 2 22 2 − 2 1 2 − −

p(y , y ) β + β (ν+n)/2 β (ν+n1)/2 p(y y ) = 1 2 1 + 1 2 − 1 + 1 2| 1 p(y ) ∝ ν 2 ν 2 1 − −  β  (ν+n)/2  1 + 2 − ∝ β1 + ν 2  −  Comparing this expression to the definition of a MVT density function gives the required result.

Lemma. [4] Let K Π(n), φ Rn, ν > 2, ρ > 0 and ∈ ∈ 1 r− Γ(ν/2, ρ/2) ∼ y r N (φ, r(ν 2)K/ρ), (14) | ∼ n − then marginally y MVT (ν, φ,K). ∼ n Shah, Wilson, Ghahramani

1 Proof. Let β = (y φ)>K− (y φ). We can analytically marginalize out the scalar r, − −

ρβ n ρ (ν+2) p(y) = p(y r)p(r)dr exp r− 2 exp r− 2 dr | ∝ − 2(ν 2)r − 2r Z Z  (ν−+n)    β − 2 1 (ν+n+2) 1 + exp r− 2 dr ∝ ν 2 − 2r  −  (ν+n) Z   β 2 1 + − ∝ ν 2  −  Hence y MVT (ν, φ,K) . Note the redundancy in ρ. Without loss of generality, let ρ = 1. ∼ n

Corollary. [7] Suppose = yi is an elliptical process. Any finite collection z = z1, ..., zn has an analytically representableY density{ if} and only if is either a Gaussian process or a Student-{ t process.} ⊂ Y Y

Proof. By Theorem 6, we need to be able to analytically solve p(z r)p(r)dr, where z r Nn(µ, rΩΩ>). This | 1 | ∼ is possible either when r is a constant with probability 1 or when r Γ− (ν/2, 1/2), the . These lead to the Gaussian and Student-t processes respectively. R ∼

B Marginal Likelihood Derivatives

Being able to analytically compute the derivative of the likelihood with respect to the hyperparameters is useful for hyperparameter learning e.g. maximum likelihood or Hamiltonian (Hybrid) Monte Carlo.

n 1 Γ( ν+n ) (ν + n) β logp(y ν, K ) = log((ν 2)π) log( K ) + log 2 log 1 + , | θ − 2 − − 2 | θ| Γ( ν ) − 2 ν 2  2   − 

1 where β = (y φ)>K− (y φ) and its derivative with respect to a hyperparameter is − θ −

∂ 1 ν + n 1 ∂Kθ log p(y ν, φ,K ) = Tr αα> K− , ∂θ | θ 2 ν + β 2 − θ ∂θ  −  

1 where α = K− (y φ). We may also learn ν using gradient based methods and the following derivative θ − ∂ n ν + n ν log p(y ν, K ) = + ψ ψ ∂ν | θ − 2(ν 2) 2 − 2 − 1 β  (ν+ n)β log 1 + + (15) − 2 ν 2 2(ν 2)2 + 2β(ν 2)  −  − − where ψ is the digamma function.

C More Insight Into the Inverse Wishart Process and Inverse Gamma Priors

As a reminder, we define a Wishart distribution as follows Definition. A random Σ Π(n) is Wishart distributed with parameters ν > n 1, K Π(n) and we write Σ W (ν, K) if its density∈ is given by − ∈ ∼ n

(ν n 1)/2 1 1 p(Σ) = c (ν, K) Σ − − exp Tr K− Σ , (16) n | | − 2   1 where c (ν, K) = K ν/22νn/2Γ (ν/2) − . n | | n   Student-t Processes as Alternatives to Gaussian Processes

C.1 The Multivariate Gamma Function

The function in the normalizing constant of the Wishart distribution is called the multivariate gamma function and is defined as follows

Definition. The multivariate gamma function,Γn(.), is a generalization of the gamma function defined as

a (n+1)/2 Γ (a) = S − exp Tr(S) dS (17) n | | − ZS>0  where S > 0 means S is positive definite. In the following lemma we illustrate an explicit relationship between the multivariate gamma function and the gamma function. Lemma. [A] n n(n 1)/4 Γ (a) = π − Γ a + (1 j)/2 (18) n − j=1 Y  Proof.

a (n+1)/2 Γ (a) = S − exp Tr(S) dS n | | − ZS>0 a (n+1)/2  a (n+1)/2 = S − exp S S − exp Tr(S ) 11 − 11 | 22.1| − 22.1 ZS>0 1  exp Tr(S S− S ) dS dS dS × − 21 11 12 11 12 22.1 (n 1)/2 a (n+1)/2 = (πS ) − S − exp S dS  11 11 − 11 11 ZS11>0  a (n+1)/2 S − exp Tr(S ) dS × | 22.1| − 22.1 22.1 ZS22.1 (n 1)/2  = π − Γ(a)Γn 1(a 1/2) − −

This recursive relationship and the fact that Γ1(b) = Γ(b) implies

n (j 1)/2 Γ (a) = π − Γ(a (j 1)/2) n − − j=1 Y n n(n 1)/4 = π − Γ(a + (1 j)/2) − j=1 Y which is as required.

A simple corollary of this result will be key later. Corollary. [B] Γ (a) Γ(a) n = (19) Γ (a 1/2) Γ(a n/2) n − −

C.2 Two Different Covariance Priors

The two generative processes we are interested in are

1 1 r− Γ(ν/2, 1/2) Ω W (ν + n 1,K− ) ∼ ∼ n − 1 y N (0, (ν 2)rK) y N(0, (ν 2)Ω− ) 1 ∼ n − 2 ∼ − where n N, ν > 2 and K is a n n symmetric, positive definite matrix. ∈ × Shah, Wilson, Ghahramani

The marginal distribution for y1 is

p(y ) = p(y r)p(r)dr 1 1| Z 1 n/2 1/2 y1>K− y1 ν/2 1 exp( 1/(2r)) = (2πr(ν 2))− K − exp r− − − dr − | | − 2(ν 2)r 2ν/2Γ(ν/2) Z − n/2 1/2   1 (2π(ν 2))− K − (ν+n)/2 1 y1>K− y1 = − | | r− − exp 1 + /2r dr 2ν/2Γ(ν/2) − (ν 2) Z   −   n/2 1/2 1 (ν+n)/2 (2π(ν 2))− K − y>K− y − = − | | 1 + 1 1 /2 Γ (ν + n)/2 2ν/2Γ(ν/2) (ν 2)  −   1 (ν+n)/2  n/2 1/2 y1>K− y1 − Γ (ν + n)/2 = (π(ν 2))− K − 1 + . (20) − | | (ν 2) Γ(ν/2)  −  

The marginal distribution for y1 is

p(y ) = p(y Ω)p(Ω)dΩ 2 2| Z n/2 y>Ωy2 = 2π(ν 2) − Ω 1/2 exp 2 − | | − 2(ν 2) Z  −   1 (ν 2)/2 1 c (ν + n 1,K− ) Ω − exp Tr KΩ dΩ × n − | | − 2 n/2 1   = 2π(ν 2) − c (ν + n 1,K− )  − n −  (ν 1)/2 1 y2y2> Ω − exp Tr K + Ω dΩ × | | − 2 ν 2 Z   −   n/2 1 = 2π(ν 2) − c (ν + n 1,K− ) − n − 1 1  y2y> − c ν + n, K + 2 − × n ν 2    −  1 n/2 (ν+n 1)/2 (ν+n 1)n/2 − = 2π(ν 2) − K − − 2 − Γ ((ν + n 1)/2) − | | n −  1 (ν+n)/2  (ν+n)/2 y2>K− y2 − (ν+n)n/2 K − 1 + 2 Γ ((ν + n)/2) × | | ν 2 n −  1 (ν+n)/2 n/2 1/2 y2>K− y2 − Γn (ν + n)/2 = (π(ν 2))− K − 1 + . (21) − | | ν 2 Γn (ν + n 1)/2  −  −   Both marginal distributions are equivalent given the result in Corollary B.