<<

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)

Student-t Process Regression with Student-t Likelihood

Qingtao Tang†, Li Niu‡, Yisen Wang†, Tao Dai†, Wangpeng An†, Jianfei Cai‡, Shu-Tao Xia† † Department of Computer Science and Technology, Tsinghua University, China ‡ School of Computer Science and Engineering, Nanyang Technological University, Singapore tqt15,dait14,wangys14,awp15 @mails.tsinghua.edu.cn; [email protected] { [email protected];} [email protected]

Abstract distribution, GPR performs poorly on the data from heavy- tailed distributions or with outliers. However, real-world Gaussian Process Regression (GPR) is a powerful data often exhibit heavy-tailed phenomena [Nair et al., 2013] Bayesian method. However, the performance of and contain outliers [Bendre et al., 1994; Niu et al., 2015; GPR can be significantly degraded when the train- 2016]. ing data are contaminated by outliers, including In order to handle the outliers, heavy-tailed distribu- target outliers and input outliers. Although there tions (e.g., Laplace distribution, mixtures of Gaussians, and are some variants of GPR (e.g., GPR with Student-t Student-t distribution) have been introduced into GPR. In likelihood (GPRT)) aiming to handle outliers, most particular, Laplace noise is used in [Kuss, 2006] while mixed of the variants focus on handling the target outliers two forms of Gaussian corruption are used in [Naish-Guzman while little effort has been done to deal with the and Holden, 2008]. In [Neal, 1997; Vanhatalo et al., 2009; input outliers. In contrast, in this work, we aim to Jylanki¨ et al., 2011], the noise  is assumed to follow the handle both the target outliers and the input outliers Student-t distribution (GPRT). However, all these methods at the same time. Specifically, we replace the Gaus- are only robust to the target outliers, but not robust to the sian noise in GPR with independent Student-t noise outliers in the inputs X, since the latent variables f(X) are to cope with the target outliers. Moreover, to en- still assumed to follow the Gaussian distribution. hance the robustness w.r.t. the input outliers, we use Related to the robustness w.r.t. the outliers in the inputs a Student-t Process prior instead of the common X, some works [Shah et al., 2014; Solin and Sarkk¨ a,¨ 2015; Gaussian Process prior, leading to Student-t Pro- Tang et al., 2016] rely on Student-t Process to handle the cess Regression with Student-t Likelihood (TPRT). input outliers. Particularly, the method in [Shah et al., 2014; We theoretically show that TPRT is more robust to Solin and Sarkk¨ a,¨ 2015] replaces the Gaussian Process with both input and target outliers than GPR and GPRT, the Student-t Process and incorporates the noise term into the and prove that both GPR and GPRT are special kernel function (TPRK) for computational simplicity. Fol- cases of TPRT. Various experiments demonstrate lowing [Shah et al., 2014; Solin and Sarkk¨ a,¨ 2015], an input that TPRT outperforms GPR and its variants on dependent Student-t noise (TPRD) is proposed in [Tang et both synthetic and real datasets. al., 2016]. Note that Tang et al. [2016] prove that TPRK, TPRD, and GPR have the same predictive mean if the kernel has a certain property named β property, which is actually 1 Introduction satisfied by most kernels. Taking the frequently used kernels Gaussian Process Regression (GPR) is a powerful Bayesian implemented by GPML [Rasmussen and Nickisch, 2010] (the method with good interpretability, non-parametric flexibil- most popular toolbox in Gaussian Process community) as ity, and simple hyper-parameter learning [Rasmussen, 2006]. examples, 24 out of 28 kernels have β property, for which Due to its nice properties, GPR has been successfully applied the above Student-t Process based methods (i.e., TPRK and to many fields, such as reinforcement learning [Rasmussen TPRD) have the same predictive value as GPR and thus fail et al., 2003], computer vision [Liu and Vasconcelos, 2015], to deal with the input outliers effectively. spatio-temporal modeling [Senanayake et al., 2016]. In this paper, with the aim to handle both the input outliers In GPR, the basic model is y = f(X) + , where y = and the target outliers at the same time, we propose Student- n n yi i=1 is the target vector, X = xi i=1 is the collection t Process Regression with Student-t Likelihood (TPRT). In of{ input} vectors, and  is the noise.{ } The latent function our model, the latent function f is assumed to be a Student-t f is given a Gaussian Process prior and  is assumed to Process prior while the noise is assumed to be an independent be independent and identically distributed (i.i.d.) Gaussian Student-t noise, instead of the noise incorporated into kernel noise. In practice, as the number of input vectors is finite, as in [Shah et al., 2014; Solin and Sarkk¨ a,¨ 2015] or dependent the latent variables f(X) follow a multivariate Gaussian noise as in [Tang et al., 2016]. In addition to owning all distribution. Due to the thin-tailed property of the Gaussian the advantages of GPR, such as good interpretability, non-

2822 Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) parametric flexibility, and simple hyper-parameter learning, 2.2 Student-t Distribution and Student-t Process our proposed TPRT method is robust to both input and tar- get outliers, because the Student-t Process prior contributes The Student-t distribution [McNeil, 2006] we use in this to robustness to the input outliers while the independent paper is defined as follows. Definition 1. An n-dim random vector x Rn follows the Student-t noise assumption can cope with the target outliers. ∈ One challenge of our TPRT method is that the inference is n-variate Student-t distribution with degrees of freedom ν n ∈ R+, mean vector µ R , and correlation matrix R Π(n) analytically intractable. To solve the inference problem, we ∈ ∈ utilize Laplace approximation for computing the posterior if its joint probability density function (PDF) is given by and marginal likelihood. The computational cost of TPRT Γ[(ν + n)/2] is roughly the same as that of GPRT, which also requires St (x ν, µ, R) = approximate inference. From the perspective of posterior and | Γ(ν/2)νn/2πn/2 R 1/2 | | marginal likelihood, we show that TPRT is more robust than − ν+n  1  2 GPR and GPRT. Besides, both GPR and GPRT are proved to 1 + (x µ)>R−1(x µ) . be special cases of TPRT. Finally, extensive experiments also · ν − − demonstrate the effectiveness of our TPRT method on both synthetic and real datasets. Given the definition of Student-t distribution, we can have the definition of Student-t Process [Shah et al., 2014]. 2 Background Definition 2. The process f is a Student-t Process (TP) on with degrees of freedom ν R+, mean function m: In this section, we will briefly introduce Gaussian Process RX, and kernel function k: ∈ R if any finite subsetX → Regression (GPR), provide the definitions of Student-t dis- of function values have a multivariateX × X → Student-t distribution, tribution and Student-t Process, and then compare Gaussian n i.e., f = f (xi) i=1 St(ν, µ, K) where K Π(n) with Process (GP) with Student-t Process (TP). { } ∼ n ∈ Kij = k(xi, xj; θk) and µ R with µi = m(xi). We ∈ 2.1 Review of GPR denote that the process f is a Student-t Process with degrees of freedom ν, mean function m, and kernel function k as f In a regression problem, we have a training set = X, y ∼ n D { } TP (ν, m, k). of n instances, where X = xi and xi denotes a d-dim n { }i=1 input vector; y = yi i=1 and yi denotes a scalar output or target. In GPR, we{ have} 2.3 Comparison of GP and TP

yi = f (xi) + i, i = 1, 2, ..., n, (1) In [Shah et al., 2014], it has been proved that GP is a special case of TP with degrees of freedom ν + . Among all the where  i , , ..., n is assumed to be i.i.d. Gaussian i ( = 1 2 ) elliptical processes with an analytically→ representable∞ density, noise and the latent function f is given a GP prior, implying TP is the most general one, which implies its expressiveness that any finite subset of latent variables f f x n fol- = ( i) i=1 for nonparametric Bayesian modeling. The comparison of TP low a multivariate Gaussian distribution, i.e.,{p f X}, K ( ) = and GP is illustrated in Figure 1 ([Shah et al., 2014]), from f µ, K , where µ is the mean vector and|K is the ( ) which we can see that TP allows the samples (blue solid) to covarianceN | matrix. Specifically, µ is an n-dim vector which be away from the mean (red dashed) while the samples of is usually assumed to be 0 for simplicity, and K is the GP gather around the mean. This indicates that the outliers covariance matrix with K k x , x θ , in which k is i,j = ( i j; k) (usually away from the mean) will not have much effect on a kernel function and θ θ , θ , . . . , θ is the set of k = ( k1 k2 kl) the mean of TP, but will affect the mean of GP severely as GP kernel parameters. As  is i.i.d. Gaussian noise, given the i enforces the samples to be close to the mean. Since we make latent variables f, the likelihood can be represented as prediction mainly with the predictive mean in practice, TP is 2  p (y f, σ) = y f, σ I , (2) expected to be more robust to outliers than GP. | N | where σ2 is the of the noise and I is an n n identity matrix. Based on Bayes’ theorem, we can obtain× the 2 2 following marginal likelihood by integrating over f: 1 1

p (y X, σ, K) = (y 0, Σ) , (3) 0 0 |2 N | where Σ = K + σ I. Then, the hyper-parameters σ and θk 1 1 can be learned by minimizing the negative marginal − − 2 2 likelihood − 0 1 2 3 − 0 1 2 3 1 1 n (a) Gaussian Process (b) Student-t Process ln p (y X, σ, K) = y>Σ−1y+ ln Σ + ln 2π. (4) − | 2 2 | | 2 After learning the hyper-parameters σ and θk, given an Figure 1: A comparison of TP and GP with identical mean function d (red dashed), kernel function, and hyper-parameters. Degrees of input x∗ R , the predictive mean is ∈ freedom ν of TP is 5. The blue solid lines are the samples from E > −1 (f∗ X, y, x∗) = k∗ Σ y, (5) the processes and the gray shaded area represents 95% confidence | n interval. where k∗ = k(xi, x∗; θk) i=1. Please refer to [Rasmussen, 2006] for more{ details. }

2823

1 1 Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)

3 Our TPRT Method enhances the robustness to input outliers. On the other hand, 3.1 Assumptions of TPRT independent Student-t noise (likelihood) (see (7)) tackles the thin-tailed issue of Gaussian noise and avoids the problems Now, we introduce our Student-t Process Regression with of the dependent noise. Student-t Likelihood (TPRT) model. In the regression prob- One challenge of our proposed method is that the inference lem (1), we assume is analytically intractable. Inspired by the Laplace approxi- f TP (ν1, m, k), (6) mation in Gaussian Process Classification [Rasmussen, 2006] ∼ and GPRT [Vanhatalo et al., 2009], we propose a Laplace ind 2 i St(i ν2, 0, σ ), i = 1, 2, . . . , n. (7) approximation for the posterior and the marginal likelihood, ∼ | Compared with GPR, TPRT has the following two differ- which are required for prediction and hyper-parameter learn- ences: ing separately. Note that a similar approach has been con- sidered by West [1984] in the case of robust The latent function f is assumed to be TP, which is • and by Rue et al. [2009] in their integrated nested Laplace more robust than GP. From Definition 2, we know that approximation. In the following, for ease of presentation, we the latent variables f follow the multivariate Student- collect all the hyper-parameters into θ = θk, ν1, ν2, σ for t distribution instead of Gaussian distribution. It has the rest of this paper. { } been theoretically analyzed in [Box and Tiao, 1962; O’Hagan, 1979] that Student-t distribution is a robust al- 3.2 Approximation for the Posterior ternative to Gaussian distribution. Specifically, Student- In this section, we derive the approximation for the condi- t distribution can reject up to N outliers when there are tional posterior of the latent variables f, and compare it with at least 2N observations in total. With sufficient data, the posterior of GPRT and GPR. the degrees of freedom ν1, which controls how heavy In particular, from (6) and Definition 2, we have the tail of the distribution is, can be estimated based f St(ν1, µ, K). on maximum marginal likelihood, yielding an adaptive ∼ robust procedure. Based on the above discussions, the Similarly as in GPR, the mean µ is assumed to be 0 for n simplicity. Then, we can have latent variables f = f (xi) i=1 are expected to be { } Γ[(ν + n)/2] more robust to input outliers. p (f X, θ) = 1 | n/2 n/2 1/2 Secondly, the noise i (i = 1, 2, . . . , n) is assumed to Γ(ν1/2)ν1 π K • | | follow the i.i.d. Student-t distribution instead of the − ν1+n  1  2 i.i.d. Gaussian distribution or the dependent Student- 1 + f >K−1f . (8) t distribution. With (1) and (7), we can learn that the · ν1 likelihood p(yi f(xi)) is a Student-t distribution, which Considering (7) and (1), we have the likelihood accounts for the| robustness to target outliers. The rea- n Y Γ[(ν2 + 1)/2] sons for using i.i.d. Student-t noise instead of other p (y f, X, θ) = | 1/2 1/2 heavy-tailed distributions are as follows: (a) Student- i=1 Γ(ν2/2)ν2 π σ t distribution is a natural generalization of Gaussian − ν2+1 "  2# 2 distribution. When degrees of freedom ν approaches 1 yi f (xi) 2 1 + − .(9) + , Student-t distribution reduces to Gaussian distri- · ν2 σ bution.∞ (b) The robustness of Student-t distribution f has been theoretically proved in [Box and Tiao, 1962; Based on (8) and (9), the posterior of the latent variables , ] p(f X, y, θ) p (f X, θ) p (y X, f, θ) , (10) O’Hagan, 1979 , as mentioned above. (c) The i.i.d. | ∝ | | Student-t noise assumption (7) has been successfully is analytically intractable. To solve the posterior of the used in some previous works such as Bayesian Lin- latent variables f, denoting the unnormalized posterior ear Regression [West, 1984], spatio-temporal modeling p (f X, θ) p (y X, f, θ) as exp (Ψ(f)), we use a second [Chen et al., 2012]. Thus, we expect that TPRT is also order| Taylor expansion| of Ψ(f) around the of the robust to outliers in the targets y. unnormalized posterior as follows, 1   To the best of our knowledge, TPRT is the first improve- Ψ(f) Ψ(fˆ) (f fˆ)>A−1(f fˆ) , (11) ment of GPR that considers the robustness to both input and ' − 2 − − target outliers. As discussed in Section 1, the previous ap- where fˆ is the mode of Ψ(f) proaches in [Kuss, 2006; Naish-Guzman and Holden, 2008; fˆ = arg max p (f X, θ) p (y X, f, θ) Vanhatalo et al., 2009; Jylanki¨ et al., 2011] replace the f | | i.i.d. Gaussian noise with heavy tailed noise, only aiming at robustness to outliers in the targets y. Student-t Pro- = arg min ln , (12) cess is used in [Shah et al., 2014; Solin and Sarkk¨ a,¨ 2015; f Q n "  2# Tang et al., 2016], but the robustness to input outliers is X ν2 + 1 1 yi f (xi) ln = ln 1 + − compromised by their dependent noise assumption. Q 2 ν σ In contrast, our method combines the strengths of these two i=1 2 types of methods and avoids their weaknesses. Specifically, ν + n  1  + 1 ln 1 + f >K−1f , on one hand, the Student-t Process assumption (see (6)) 2 ν1

2824 Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)

A−1 is the negative Hessian matrix of Ψ(f) at fˆ: 3.3 Approximation for the Marginal Likelihood A−1 = ln p (f X, θ) p (y X, f, θ) In this section, we introduce the approximation for the −∇∇ | | marginal likelihood of TPRT, which is needed for learning K−1(ν + fˆ>K−1fˆ) 2K−1fˆfˆ>K−1 = (ν + n) 1 the hyper-parameters, and compare it with the marginal likeli- 1  − 2 > −1 ν1 + fˆ K fˆ hood of GPR and GPRT. In particular, the marginal likelihood of TPRT is +W , Z where W is a diagonal matrix with p(y X, θ) = p (f X, θ) p (y X, f, θ) df (17) | | | 2   2 Z yi fˆi ν2σ − − = exp (Ψ(f)) df, Wii = (ν2 + 1) , i = 1, 2, . . . , n. −  2 2 2 yi fˆi + ν2σ − which is also analytically intractable. Similar to that as in Section 3.2, we utilize the Laplace approximation. Consider- Based on (10) and (11), we can reach ing the approximation of Ψ(f) in (11), we can rewrite (17)  1   as p(f X, y, θ) exp (f fˆ)>A−1(f fˆ) , (13) | ∝ −2 − −   Z  1 p(y X, θ) exp Ψ(fˆ) exp (f fˆ)> which is the unnormalized PDF of (fˆ, A). Therefore, we | ' − 2 − · q f X, yN, θ p f X, y, θ −1  obtain the approximation, ( ), for ( ) A (f fˆ) df, | | − p(f X, y, θ) q(f X, y, θ) , (fˆ, A), (14) | ' | N which can be evaluated analytically. Then we obtain an which will be used for making predictions in Section 3.4. approximation, ln q(y X, θ), for the negative log marginal likelihood as follows,− | Comparison with GPRT: Note that the Laplace approxima- tion is also utilized in GPRT [Vanhatalo et al., 2009] and the ln p(y X, θ) ln q(y X, θ) fˆ in their approximation can be written as − | ' − | 1 0 , ln ˆ + ln B + c, (18) fˆ = arg min ln , (15) Q|f=f 2 | | f Q n "  2# where is the same as in (12), B = KA−1, and c = 0 X ν2 + 1 1 yi f (xi) n Q Γ[(ν +1)/2] Γ[(ν +n)/2] ln = ln 1 + − P 2 1 n π i=1 ln 1/2 1/2 ln n/2 n/2 2 ln(2 ). Q 2 ν2 σ − Γ(ν2/2)ν2 π σ − Γ(ν1/2)ν1 π − i=1 Note that (18) is differentiable w.r.t. θ and thus the hyper- > −1 +f K f. parameters θ can be estimated by minimizing (18). We can see that the only difference between (12) and (15) is   Comparisons with GPR and GPRT: The negative log that the term ν1+n ln 1 + 1 f >K−1f in (12) is the result 2 ν1 marginal likelihood of GPR is provided in (4) and the approx- of a log transformation of the term f >K−1f in (15). If there imate negative log marginal likelihood of GPRT [Vanhatalo et are input outliers, the term f >K−1f would be disturbed and al., 2009] is the log transformation can reduce the disturbance. Therefore, the mean of approximate posterior of TPRT (i.e., (12)) is 0 1 0 0 ln ˆ + ln B + c , (19) more robust to input outliers than that of GPRT (i.e., (15)). Q |f=f 2 | | In fact, similar log transformations have been widely used in where 0 is the same as in (15), B0 = I + KW , and data analysis and [Box and Cox, 1964]. Q n Γ[(ν +1)/2] c0 P 2 = i=1 ln 1/2 1/2 . Comparing the negative Comparison with GPR: For GPR, as the posterior distribu- − Γ(ν2/2)ν2 π σ tion is symmetric, the mean of the posterior is equal to the log marginal likelihoods of GPR, GPRT, and TPRT, we can mode. Thus, the mean of the posterior can be written as observe that the main difference of their negative log marginal n  2 likelihoods is also caused by the log transformation, similar X yi f (xi) arg min f >K−1f + − . (16) to the difference of their posteriors discussed in Section 3.2. f σ i=1 Due to the log transformation, the negative log marginal Comparing (12) and (16), we can see that both terms in (12) likelihood of TPRT is more stable when there are outliers in take the log transformation of the corresponding terms in the inputs or the targets. Thus, the estimation of the hyper- (16). Thus, when there are outliers in the targets y or in parameters θ, which is the solution of minimum negative log the inputs X, (12) would be more robust than (16). Note marginal likelihood, is less affected by the outliers. that Student-t distribution is a more general form of Gaussian 3.4 Making Predictions distribution and the degrees of freedom ν1, ν2, controlling the heavy tail, also affect the log transformation. In particular, Once obtaining the approximate posterior q(f X, y, θ) and | when ν1, ν2 approach + , (12) reduces to (16), which means the hyper-parameters θ, we can make predictions. Specif- ∞ d there is no log transformation. ically, given a new input x∗ R , the predictive mean is ∈

2825 Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) computed as 4 Relations to GPR and GPRT Z Theorem 1 GPR is a special case of TPRT with ν , ν E E 1 2 (f∗ x∗, X, y) = (f∗ f, X, x∗) p(f X, y, θ)df + . → | | | ∞ Z Proof. When ν1, ν2 + , the TP assumption (6) reduces E (f∗ f, X, x∗) q(f X, y, θ)df → ∞ ' | | to GP and the Student-t noise (7) reduces to Gaussian noise. In this case, since the Laplace method approximates the pos- > ˆ = k∗ Kf, (20) terior and marginal likelihood with Gaussian distributions, it n is not difficult to prove that the obtained posterior and ML where k∗ = k(xi, x∗; θk) i=1 and we use the fact for TP: { > } using Laplace approximation are exactly the same as those of E (f∗ f, X, x∗) = k Kf (see Lemma 3 in [Shah et al., ∗ GPR. Therefore, TPRT GPR with ν1, ν2 + .  2014]|). The predictive variance can be derived in a similar → → ∞ way. Theorem 2 GPRT is a special case of TPRT with ν1 + . → ∞ Proof. The proof is similar to that of Theorem 1 and thus we 3.5 Implementation Details omit the details here due to the space limitation.  When computing the Laplace approximation q(f X, y, θ) Theorem 1 and 2 indicate that TPRT is a natural general- for the posterior of latent variables f, the main computational| ization of GPR and GPRT, like Student-t distribution/process effort lies in finding the mode of the unnormalized log pos- is a generalization of Gaussian distribution/process. When terior, i.e., solving (12). In [Rasmussen, 2006], Newton’s tuning ν1, ν2 carefully, we can guarantee that the performance method is used to find the mode. However, in our model, of TPRT is at least comparable with the performances of GPR as the negative Hessian matrix of the log posterior is not and GPRT. In fact, the hyper-parameters ν1, ν2 obtained from necessarily positive definite, we cannot use Newton’s method maximum marginal likelihood instead of manual tuning can and thus use a conjugate gradient algorithm instead. already achieve satisfactory performance, as we will show in After obtaining the mode fˆ using (12), The hyper- the experiments. parameters θ can be estimated by minimizing the approxi- mate negative log marginal likelihood (18). Our implemen- 5 Experiments tation is also based on the conjugate gradient optimization, in which we need to compute the derivatives of (18) w.r.t. θ. In this section, we evaluate our TPRT method on both syn- The dependency of the approximate marginal likelihood on θ thetic and real datasets. The experimental results demonstrate is two-fold: the effectiveness of our TPRT method. We also investigate why our TPRT method works on real datasets. ∂ ln q(y X, θ) X ∂ ln q(y X, θ) ∂Kk,l | = | 5.1 Datasets ∂θi ∂Kk,l ∂θi k,l The experiments are conducted on both synthetic and ∂ ln q(y X, θ) ∂fˆ real datasets. For the synthetic datasets, we use Neal + | . (21) [ ] ∂fˆ ∂θi Dataset Neal, 1997 and its variant with input outliers. The training data are described as follows, In particular, as ln q(y X, θ) is a function of K, there is Neal Data. This dataset was proposed by [Neal, 1997], | K an explicit dependency via the terms involving . Besides, • containing target outliers. This dataset contains 100 ˆ since the change in θ will cause a change in f, there is instances with one attribute. an implicit dependency through the terms involving fˆ. The Neal with input outliers. To study the robustness to explicit derivative of (18) can be easily obtained. The implicit • the input outliers, we add outliers to 5% of inputs of the derivative accounts for the dependency of (18) on θ due to i.e. X ˆ ˆ Neal data, , 5% of elements in the inputs of Neal change in the mode f. As f is the solution to (12), we have data are randomly added or subtracted by 3 standard derivations. We use Neal Xoutlier to denote Neal with ∂ ln Q = 0. (22) input outliers in Table 2. ∂f ˆ f=f In the testing stage, 50 instances are generated from the Thus, differentiating (18) w.r.t. fˆ can be simplified as underlying true function [Neal, 1997] as test data. For real datasets, we use 6 real datasets from [Alcala´ et al., ∂ ln B /∂fˆ. Then, we can have | | 2010; Lichman, 2013] and their detailed information is listed ∂ ln q(y X, θ) ∂fˆ 1 ∂ ln B ∂fˆ in Table 1, from which we can observe that 6 real datasets | = | | , (23) have different numbers of instances and attributes, and cover ∂fˆ ∂θi −2 ∂fˆ ∂θi a wide range of areas. Following [Shah et al., 2014], on Concrete and Bike datasets, a subset of 400 instances are ∂fˆ where can be obtained by differentiating (22) w.r.t. θi ∂θi randomly chosen from each dataset for the experiments. For (see [Kuss and Rasmussen, 2005]). After obtaining the ex- each dataset, 80% of instances are randomly sampled as the plicit and implicit derivatives of (18) w.r.t. θ, we can solve θ training data and the remaining instances are used as the test using conjugate gradient descent. data.

2826 Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)

Table 1: Detailed information of the real datasets. Table 2: The RMSE results of our TPRT method and all baselines on different datasets. The best results on each dataset are denoted in Dataset # of Instances # of Attributes Area boldface. Diabetes 43 2 Medical Machine CPU 209 6 Computer Dataset GPR GPRT TPRT MPG 398 8 Industry Neal 0.1676 0.0739 0.0558 ELE 495 2 Industry Neal Xoutlier 0.4392 0.3815 0.3369 Concrete 1030 9 Physical Diabetes 0.8895 0.8870 0.8512 Bike 17389 16 Social Machine CPU 0.4719 0.4811 0.4094 MPG 0.3376 0.3360 0.3287 5.2 Baselines, Experimental Settings and ELE 0.5785 0.5740 0.5500 Computational Complexity Concrete 0.4013 0.4005 0.3894 For the kernel, we choose the squared exponential kernel Bike 0.3445 0.3536 0.3383 [Rasmussen and Nickisch, 2010], which is one of the most commonly used kernels. As this kernel has the β property, probability density function. It is clear that the estimated the methods in [Shah et al., 2014; Solin and Sarkk¨ a,¨ 2015; densities of both datasets have heavy tails in some degrees. Tang et al., 2016] generate the same predictive mean as GPR. As analyzed in Section 3.2 and 3.3, TPRT is more robust to Thus, we only report the results of GPR as a representative. the input outliers than GPR and GPRT. This may explain why We also include GPRT as a baseline, in which the Laplace TPRT performs better on these datasets. Especially for the approximation is also used following [Vanhatalo et al., 2009] Machine CPU dataset, all input attributes exhibit heavy-tail for fair comparison. Each experiment is repeated 10 times distribution, which may explain why TPRT has much lower and the average results of all methods on each dataset are re- RMSE on this dataset as demonstrated in Table 2. ported. We use the root mean squared error (RMSE) between #10-5 0.015 the predictive mean and the true value for evaluation. 6

Recall that when learning the hyper-parameters, we use 0.01 4

conjugate gradient method, in which the maximum iteration p(x) p(x) number is set as 100. The initial values for the kernel param- 2 0.005 θ σ2 eters k and the variance of the noise are set as 1. Note 0 0 2 -2 0 2 4 6 8 -100 0 100 200 300 400 500 that we use the same initial θ and σ for the baselines GPR 4 k x #10 x and GPRT. (a) Machine CPU (b) Concrete The main computational cost of TPRT lies in solving the inverse of the kernel matrix (O(n3) time complexity), Figure 2: The heavy-tailed phenomenon on the real datasets. Each which can be accelerated by most methods speeding up GPR, subfigure shows the results of a representative attribute of the corre- such as methods proposed by [Williams and Seeger, 2000; sponding dataset. ] Deisenroth and Ng, 2015; Wilson and Nickisch, 2015 . 6 Conclusion 5.3 Experimental Results In this paper, motivated by the fact that neither GPR nor its variants are robust to both input and target outliers, we The experimental results are summarized in Table 2, from propose a novel Student-t Process Regression with Student- which we can observe that GPRT and TPRT have better per- t Likelihood (TPRT), which is robust to the input and tar- formance than GPR on the Neal dataset, which demonstrates get outliers at the same time. Specifically, we propose the the effectiveness of GPRT and TPRT for handling the target Student-t Process prior assumption to handle the input out- outliers. On the Neal Xoutlier dataset, TPRT outperforms liers and Student-t likelihood (noise) assumption to handle GPR and GPRT significantly and the reason can be explained the target outliers. We derive a Laplace approximation for the as follows. The Student-t Process assumption (6) is more inference and analyze why TPRT is robust to both input and tolerant to the input outliers, and the log transformations in target outliers from the views of the posterior and marginal the posterior and marginal likelihood reduce the effect of the likelihood. We also prove that both GPR and TPRT are outliers. On all the real datasets, TPRT also outperforms all special cases of TPRT. Extensive experiments demonstrate the baselines and achieves the best results. Especially on the the effectiveness of our TPRT method on both synthetic and Machine CPU dataset, the RMSE of TPRT is 13.24% lower real datasets. than that of GPR. We further investigate why our TPRT method works on the real datasets by studying the distribution of each real Acknowledgments dataset. In particular, we use kernel density estimation [Sil- This work is supported by the National Natural Science verman, 1986] to estimate the probability density function Foundation of China under grant Nos. 61371078, 61375054, of each attribute for each dataset. One observation is that and the R&D Program of Shenzhen under grant Nos. most datasets (5 out of 6 datasets except the Diabetes dataset) JCYJ20140509172959977, JSGG20150512162853495, have certain heavy-tailed distributed input attributes, which ZDSYS20140509172959989, JCYJ20160331184440545. may be caused by input outliers. Taking Machine CPU This work is partially supported by NTU-SCSE Grant and Concrete as examples, Figure 2 reports their estimated 2016-2017 and NTU-SUG (CoE) Grant 2016-2018.

2827 Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)

References [Niu et al., 2015] Li Niu, Wen Li, and Dong Xu. Visual [Alcala´ et al., 2010] J Alcala,´ A Fernandez,´ J Luengo, J Der- recognition by learning from web data: A weakly super- CVPR rac, S Garc´ıa, L Sanchez,´ and F Herrera. Keel data-mining vised domain generalization approach. In , June software tool: Data set repository, integration of algo- 2015. rithms and experimental analysis framework. Journal of [Niu et al., 2016] L. Niu, X. Xu, L. Chen, L. Duan, and Multiple-Valued Logic and Soft Computing, 17(2-3):255– D. Xu. Action and event recognition in videos by learning 287, 2010. from heterogeneous web sources. TNNLS, PP(99):1–15, 2016. [Bendre et al., 1994] SM Bendre, V Barnett, and T Lewis. Outliers in statistical data, 1994. [O’Hagan, 1979] Anthony O’Hagan. On outlier rejection phenomena in bayes inference. Journal of the Royal [Box and Cox, 1964] George EP Box and David R Cox. An Statistical Society. Series B (Methodological), pages 358– analysis of transformations. Journal of the Royal Statis- 367, 1979. tical Society. Series B (Methodological), pages 211–252, 1964. [Rasmussen and Nickisch, 2010] Carl Edward Rasmussen and Hannes Nickisch. Gaussian processes for machine [Box and Tiao, 1962] G. E. P. Box and G. C. Tiao. A fur- learning toolbox. JMLR, 11(Nov):3011–3015, 2010. ther look at robustness via bayes’s theorem. Biometrika, [ ] 49(3/4):419–432, 1962. Rasmussen et al., 2003 Carl Edward Rasmussen, Malte Kuss, et al. Gaussian processes in reinforcement learning. [Chen et al., 2012] Yang Chen, Feng Chen, Jing Dai, In NIPS, volume 4, page 1, 2003. T Charles Clancy, and Yao-Jan Wu. Student-t based robust [Rasmussen, 2006] Carl Edward Rasmussen. Gaussian pro- spatio-temporal prediction. In ICDM, pages 151–160. cesses for machine learning. 2006. IEEE, 2012. [Rue et al., 2009] Havard˚ Rue, Sara Martino, and Nicolas [ ] Deisenroth and Ng, 2015 Marc Deisenroth and Jun Wei Chopin. Approximate bayesian inference for latent gaus- Ng. Distributed gaussian processes. In ICML, pages 1481– sian models by using integrated nested laplace approxi- 1490, 2015. mations. Journal of the royal statistical society: Series b [Jylanki¨ et al., 2011] Pasi Jylanki,¨ Jarno Vanhatalo, and Aki (statistical methodology), 71(2):319–392, 2009. Vehtari. Robust gaussian process regression with a [Senanayake et al., 2016] Ransalu Senanayake, Simon student-t likelihood. JMLR, 12(Nov):3227–3257, 2011. OCallaghan, and Fabio Ramos. Predicting spatio– [Kuss and Rasmussen, 2005] Malte Kuss and Carl Edward temporal propagation of seasonal influenza using Rasmussen. Assessing approximate inference for binary variational gaussian process regression. In AAAI, 2016. gaussian process classification. JMLR, 6(Oct):1679–1704, [Shah et al., 2014] Amar Shah, Andrew Wilson, and Zoubin 2005. Ghahramani. Student-t processes as alternatives to gaus- [Kuss, 2006] Malte Kuss. Gaussian process models for ro- sian processes. In AISTAT, pages 877–885, 2014. bust regression, classification, and reinforcement learn- [Silverman, 1986] Bernard W Silverman. Density estimation ing. PhD thesis, Darmstadt University of Technology, for statistics and data analysis, volume 26. CRC press, Germany, 2006. 1986. [Lichman, 2013] M. Lichman. UCI machine learning repos- [Solin and Sarkk¨ a,¨ 2015] Arno Solin and Simo Sarkk¨ a.¨ State itory, 2013. space methods for efficient inference in student-t process AISTAT [Liu and Vasconcelos, 2015] Bo Liu and Nuno Vasconcelos. regression. In , pages 885–893, 2015. Bayesian model adaptation for crowd counts. In CVPR, [Tang et al., 2016] Qingtao Tang, Yisen Wang, and Shu-Tao pages 4175–4183, 2015. Xia. Student-t process regression with dependent student-t noise. In ECAI, pages 82–89, 2016. [McNeil, 2006] Alexander J McNeil. Multivariate t distri- butions and their applications. Journal of the American [Vanhatalo et al., 2009] Jarno Vanhatalo, Pasi Jylanki,¨ and Statistical Association, 101(473):390–391, 2006. Aki Vehtari. Gaussian process regression with student-t likelihood. In NIPS, pages 1910–1918, 2009. [Nair et al., 2013] Jayakrishnan Nair, Adam Wierman, and Bert Zwart. The fundamentals of heavy-tails: Properties, [West, 1984] Mike West. Outlier models and prior distribu- emergence, and identification. SIGMETRICS Perform. tions in bayesian linear regression. Journal of the Royal Eval. Rev., 41(1):387–388, June 2013. Statistical Society. Series B (Methodological), pages 431– 439, 1984. [Naish-Guzman and Holden, 2008] Andrew Naish-Guzman [ ] and Sean Holden. Robust regression with twinned gaus- Williams and Seeger, 2000 Christopher KI Williams and sian processes. In NIPS, pages 1065–1072, 2008. Matthias Seeger. Using the nystrom¨ method to speed up kernel machines. In NIPS, pages 661–667, 2000. [Neal, 1997] Radford M Neal. Monte carlo implementation [Wilson and Nickisch, 2015] Andrew Wilson and Hannes of gaussian process models for bayesian regression and Nickisch. Kernel interpolation for scalable structured classification. Technical report, Dept. of statistics and gaussian processes. In ICML, pages 1775–1784, 2015. Dept. of Computer Science, University of Toronto, 1997.

2828