Gaussian processes: the next step in data analysis Suzanne Aigrain () Neale Gibson, Tom Evans, Amy McQuillan Steve Roberts, Steve Reece, Mike Osborne

... let the data speak Gaussian processes: the next step in exoplanet data analysis Suzanne Aigrain (University of Oxford) Neale Gibson, Tom Evans, Amy McQuillan Steve Roberts, Steve Reece, Mike Osborne Gaussian processes for modelling systematics 5

high precision

time-series stochastic correlated disentangle noise

Figure 1. The ‘raw’ HD 189733 NICMOS dataset used as an example for our Gaussian process model. Left: Raw light curves of HD 189733 for each of the 18 wavelength channels, from 2.50µmto1.48µmtoptobottom.Right:Theopticalstateparametersextracted from the spectra plotted as a time series. These are used as the input parameters for our GP model. The red lines represent a GP regression on the input parameters, used to remove the noise and test how this effects the GP model.

the noisy input parameters. We also checked that the choice wavelength channels. In this example, only orbits 2, 3 and 5 of hyperprior length scale had little effect on the results, by are used to determine the parameters and hyperparameters setting them to large values, and repeating the procedure of the GP (or ‘train’ the GP), and are shown by the red with varying length scales ensuring the transmission spec- points4. Orbit 4 (green points) was not used... in the traininglet the data speak tra were not significantly altered. set. Predictive distributions were calculated for orbits 2–5, and are shown by the grey regions, which plot the 1 and 2σ confidence intervals. The predictive distribution is a good fit 3.2 Type-II maximum likelihood to orbit 4, showing that our GP model is effective at mod- As mentioned in Sect. 2.4, a useful approximation is Type-II elling the instrumental systematics. The systematics model maximum likelihood. First, the log posterior is maximised will of course be even more constrained than in this exam- with respect to the hyperparameters and variable mean- ple, as we use orbits 2–5 to simultaneously infer parameters function parameters using a Nelder-Mead simplex algorithm of the GP and function. (see e.g. Press et al. 1992). The transit model is the same Now that all parameters and hyperparameters are opti- as that used in GPA11, which uses Mandel & Agol (2002) mised with respect to the posterior distribution, the hyper- models calculated assuming quadratic limb darkening and a parameters are held fixed. This means the inverse covari- circular orbit. All non-variable parameters were fixed to the ance matrix and log determinant used to evaluate the log values given in Pont et al. (2008), except for the limb dark- posterior are also fixed, and need calculated only once. An ening parameters, which were calculated for each wavelength MCMC is used to marginalise over the remaining parame- channel (GPA11, Sing 2010). The only variable mean func- ters of interest, in this case the planet-to-star radius ratio tion variable parameters are the planet-to-star radius ratio, and two parameters that govern a linear baseline model; an out-of-transit flux foot and (time) gradient Tgrad. 4 We used the smoothed hyperparameters here for aesthetic pur- An example of the predictive distributions found using poses. Using the noisy hyperparameters would simply result in type-II maximum likelihood is shown in Fig. 2 for four of the noisy predictive distributions.

c 2002 RAS, MNRAS 000,1–12 ￿ 2 N. P. Gibson, S. Aigrain et al.

GAUSSIAN PROCESSES FOR REGRESSION covariance between the corresponding outputs. The specifi- cation of the covariance function implies a probability dis- Models based on Gaussian processes are extensively used in tribution over functions, and we can draw samples from this the machine learning community for Bayesian inference in distribution by choosing a number of input points, writing regression and classification problems. They can be seen as out the corresponding covariance matrix, and generating a an extension of kernel regression to probabilistic models. In random Gaussian vector with this covariance matrix. Insert this appendix, we endeavour to give a very brief introduction reference to figure. to the use of GPs for regression. Our explanations are based The mean function controls the deterministic compo- on the textbooks by Rassmussen & Williams (2006) and nent of the model, whilst the covariance function controls Bishop (2006), where the interested reader will find far more its stochastic component. Depending on the situation, the complete and detailed information. former may simply be zero everywhere. Similarly, the latter may represent a physical process of interest, or it may repre- Introducing Gaussian processes sent noise. A commonly used family of covariance functions is the squared exponential kernel: Consider a collection of N observations, each consisting of D D known inputs, or independent variables, and one output, 2 kSE(xn, xm, θ)=θ0 exp θ1 (xi,n xi,m) or dependent variable, which is measured with finite accu- − ￿ i=1 ￿ racy (it is trival to extend the same reasoning to more than ￿ one, jointly observed outputs). In the case of observations where θ0 is an amplitude, and θ1 an inverse length scale 2 of a planetary transit in a single bandpass, for example, the (this becomes more obvious if we write θ1 =1/2l ). Pairs of output is a flux or magnitude, and the inputs might include observations generated from the squared exponential kernel the time at which the observations were made, but also an- are strongly mutually correlated if they lie near each other in cillary data pertaining to the state of the instrument and input space, and relatively uncorrelated if they are mutually the observing conditions. In the machine learning literature, distant in input space. Multiple covariance functions can be observed outputs are usually referred to as targets, because added or multipled together to model multiple stochastic the quality of a model is measured by its ability to repro- processes simultaneously occurring in the data. For example, duce the targets, given the inputs, and we shall adopt this most GP models include an i.i.d. component: convention here. k(x , x , θ, σ)=k (x , x , θ)+δ σ2 We write the inputs for the nth observation as a column n m SE n m nm T vector xn =(x1,n ...,xD,n) , and the corresponding target where δ is the Kronecker delta function. value as tn. The N target values are then collated into a The joint probability given by equation 1 is the likeli- T column vector t =(t1,...,tN ) , and the D N outputs into hood of observing the targets t,givenX, φ and θ. This likeli- × a matrix X,withelementsXij = xi,j . A standard approach, hood is already marginalised over all possible realisations of which most astronomers will be familiar with, is to model the stochastic component, which satisfy the covariance ma- the data as: trix specified by X and θ. In this sense, GPs are intrinsically Bayesian. In fact, one can show that GP regression with the t = m(X, φ)+￿ squared exponential kernel is equivalent to Bayesian linear where m is a mean function with parameters φ, which acts regression with an infinite number of basis functions1. The on the matrix of N input vectors, returning an N-element basis function weights can be seen as the parameters of the vector, and ￿ represents independent, identically distributed model, and in Bayesian jargon, the elements of θ and φ are (i.i.d.) noise all hyper-parameters of the model. ￿ 0, σ2I . ∼ N GPs as interpolation tools Here [a, B] represents a multivariate Gaussian distribu- N￿ ￿ tion with mean vector a and covariance matrix B, σ is the Given a set of inputs X and observed targets t,wemaywish standard deviation of the noise, and I is the identity matrix. to make predictions for a new set of inputs X￿. We refer The joint probability distribution for a set of outputs t is thus: 1 In the Bayesian linear basis equivalent of GP regression, the P (t X, σ, φ)= m(X, φ), σ2I basis function weights are subject to an automatic relevance de- | N termination (ARD) priors. These priors are zero-mean Gaussians, A Gaussian process￿ (hereafter￿ GP) is defined by the fact which force the weights to be zero unless there is evidence in the that the joint distribution of any set of samples drawn from a data to support a non-zero value. This helps to explain how GPs A GaussianGP is process a multivariate in a nutshell Gaussian distribution. Therefore, i.i.d. can be extremely flexible, yet avoid the over-fitting problem which noise is a special case of GP. We now extend our model to affects maximum-likelihood linear models. In linear basis mod- the general GP case: elling, one seeks a global probability maximum in feature space, where each coordinate represents the weight of a particular ba- P (t X, θ, φ)= [m(X, φ), K](1) | N sis function. Working with a high-dimensional feature space is problematic: it gives rise to degeneracy, numerical instability and where the covariance matrix K has elements excessive computational demand. This problem is known as the

Knm cov [xn, xm]=k(xn, xm, θ) curse of dimensionality. GPs belong to a class of methods, which ≡ circumvent this problem by working in function space rather than The function k, known as the covariance function or kernel, feature space, and parametrising the functions in terms of their has parameters θ, and acts on a pair of inputs, returning the covariance properties. This is known as the kernel trick.

c 2011 RAS, MNRAS 000,1–?? ￿ 2 N. P. Gibson, S. Aigrain et al.

GAUSSIAN PROCESSES FOR REGRESSION covariance between the corresponding outputs. The specifi- cation of the covariance function implies a probability dis- Models based on Gaussian processes are extensively used in tribution over functions, and we can draw samples from this the machine learning community for Bayesian inference in distribution by choosing a number of input points, writing regression and classification problems. They can be seen as out the corresponding covariance matrix, and generating a an extension of kernel regression to probabilistic models. In random Gaussian vector with this covariance matrix. Insert this appendix, we endeavour to give a very brief introduction reference to figure. to the use of GPs for regression. Our explanations are based The mean function controls the deterministic compo- on the textbooks by Rassmussen & Williams (2006) and nent of the model, whilst the covariance function controls Bishop (2006), where the interested reader will find far more its stochastic component. Depending on the situation, the complete and detailed information. former may simply be zero everywhere. Similarly, the latter may represent a physical process of interest, or it may repre- Introducing Gaussian processes sent noise. A commonly used family of covariance functions is the squared exponential kernel: Consider a collection of N observations, each consisting of D D known inputs, or independent variables, and one output, 2 kSE(xn, xm, θ)=θ0 exp θ1 (xi,n xi,m) or dependent variable, which is measured with finite accu- − ￿ i=1 ￿ racy (it is trival to extend the same reasoning to more than ￿ one, jointly observed outputs). In the case of observations where θ0 is an amplitude, and θ1 an inverse length scale 2 of a planetary transit in a single bandpass, for example, the (this becomes more obvious if we write θ1 =1/2l ). Pairs of output is a flux or magnitude, and the inputs might include observations generated from the squared exponential kernel the time at which the observations were made, but also an- are strongly mutually correlated if they lie near each other in cillary data pertaining to the state of the instrument and input space, and relatively uncorrelated if they are mutually the observing conditions. In the machine learning literature, distant in input space. Multiple covariance functions can be observed outputs are usually referred to as targets, because added or multipled together to model multiple stochastic the quality of a model is measured by its ability to repro- processes simultaneously occurring in the data. For example, duce the targets, given the inputs, and we shall adopt this most GP models include an i.i.d. component: convention here. k(x , x , θ, σ)=k (x , x , θ)+δ σ2 We write the inputs for the nth observation as a column n m SE n m nm T vector xn =(x1,n ...,xD,n) , and the corresponding target where δ is the Kronecker delta function. value as tn. The N target values are then collated into a The joint probability given by equation 1 is the likeli- T column vector t =(t1,...,tN ) , and the D N outputs into hood of observing the targets t,givenX, φ and θ. This likeli- × a matrix X,withelementsXij = xi,j . A standard approach, hood is already marginalised over all possible realisations of which most astronomers will be familiar with, is to model the stochastic component, which satisfy the covariance ma- the data as: trix specified by X and θ. In this sense, GPs are intrinsically Bayesian. In fact, one can show that GP regression with the t = m(X, φ)+￿ squared exponential kernel is equivalent to Bayesian linear where m is a mean function with parameters φ, which acts regression with an infinite number of basis functions1. The on the matrix of N input vectors, returning an N-element basis function weights can be seen as the parameters of the vector, and ￿ represents independent, identically distributed model, and in Bayesian jargon, the elements of θ and φ are (i.i.d.) noise all hyper-parameters of the model. ￿ 0, σ2I . ∼ N GPs as interpolation tools Here [a, B] represents a multivariate Gaussian distribu- N￿ ￿ tion with mean vector a and covariance matrix B, σ is the Given a set of inputs X and observed targets t,wemaywish standard deviation of the noise, and I is the identity matrix. to make predictions for a new set of inputs X￿. We refer The joint probability distribution for a set of outputs t is thus: 1 In the Bayesian linear basis equivalent of GP regression, the P (t X, σ, φ)= m(X, φ), σ2I basis function weights are subject to an automatic relevance de- | N termination (ARD) priors. These priors are zero-mean Gaussians, A Gaussian process￿ (hereafter￿ GP) is defined by the fact which force the weights to be zero unless there is evidence in the that the joint distribution of any set of samples drawn from a data to support a non-zero value. This helps to explain how GPs A GaussianGP is process a multivariate in a nutshell Gaussian distribution. Therefore, i.i.d. can be extremely flexible, yet avoid the over-fitting problem which noise is a special case of GP. We now extend our model to affects maximum-likelihood linear models. In linear basis mod- outputthe vector general (e.g. fluxes) GPinput matrix case: (e.g. time, pointing, detector temperature...) elling, one seeks a global probability maximum in feature space, where each coordinate represents the weight of a particular ba- likelihood P (t X, θ, φ)= [m(X, φ), K](1) | N sis function. Working with a high-dimensional feature space is problematic: it gives rise to degeneracy, numerical instability and where the covariance matrix K has elements (hyper-)parameters excessive computational demand. This problem is known as the

Knm cov [xn, xm]=k(xn, xm, θ) curse of dimensionality. GPs belong to a class of methods, which ≡ circumvent this problem by working in function space rather than The function k, known as the covariance function or kernel, feature space, and parametrising the functions in terms of their has parameters θ, and acts on a pair of inputs, returning the covariance properties. This is known as the kernel trick.

c 2011 RAS, MNRAS 000,1–?? ￿ 2 N. P. Gibson, S. Aigrain et al.

GAUSSIAN PROCESSES FOR REGRESSION covariance between the corresponding outputs. The specifi- cation of the covariance function implies a probability dis- Models based on Gaussian processes are extensively used in tribution over functions, and we can draw samples from this the machine learning community for Bayesian inference in distribution by choosing a number of input points, writing regression and classification problems. They can be seen as out the corresponding covariance matrix, and generating a an extension of kernel regression to probabilistic models. In random Gaussian vector with this covariance matrix. Insert this appendix, we endeavour to give a very brief introduction reference to figure. to the use of GPs for regression. Our explanations are based The mean function controls the deterministic compo- on the textbooks by Rassmussen & Williams (2006) and nent of the model, whilst the covariance function controls Bishop (2006), where the interested reader will find far more its stochastic component. Depending on the situation, the complete and detailed information. former may simply be zero everywhere. Similarly, the latter may represent a physical process of interest, or it may repre- Introducing Gaussian processes sent noise. A commonly used family of covariance functions is the squared exponential kernel: Consider a collection of N observations, each consisting of D D known inputs, or independent variables, and one output, 2 kSE(xn, xm, θ)=θ0 exp θ1 (xi,n xi,m) or dependent variable, which is measured with finite accu- − ￿ i=1 ￿ racy (it is trival to extend the same reasoning to more than ￿ one, jointly observed outputs). In the case of observations where θ0 is an amplitude, and θ1 an inverse length scale 2 of a planetary transit in a single bandpass, for example, the (this becomes more obvious if we write θ1 =1/2l ). Pairs of output is a flux or magnitude, and the inputs might include observations generated from the squared exponential kernel the time at which the observations were made, but also an- are strongly mutually correlated if they lie near each other in cillary data pertaining to the state of the instrument and input space, and relatively uncorrelated if they are mutually the observing conditions. In the machine learning literature, distant in input space. Multiple covariance functions can be observed outputs are usually referred to as targets, because added or multipled together to model multiple stochastic the quality of a model is measured by its ability to repro- processes simultaneously occurring in the data. For example, duce the targets, given the inputs, and we shall adopt this most GP models include an i.i.d. component: convention here. k(x , x , θ, σ)=k (x , x , θ)+δ σ2 We write the inputs for the nth observation as a column n m SE n m nm T vector xn =(x1,n ...,xD,n) , and the corresponding target where δ is the Kronecker delta function. value as tn. The N target values are then collated into a The joint probability given by equation 1 is the likeli- T column vector t =(t1,...,tN ) , and the D N outputs into hood of observing the targets t,givenX, φ and θ. This likeli- × a matrix X,withelementsXij = xi,j . A standard approach, hood is already marginalised over all possible realisations of which most astronomers will be familiar with, is to model the stochastic component, which satisfy the covariance ma- the data as: trix specified by X and θ. In this sense, GPs are intrinsically Bayesian. In fact, one can show that GP regression with the t = m(X, φ)+￿ squared exponential kernel is equivalent to Bayesian linear where m is a mean function with parameters φ, which acts regression with an infinite number of basis functions1. The on the matrix of N input vectors, returning an N-element basis function weights can be seen as the parameters of the vector, and ￿ represents independent, identically distributed model, and in Bayesian jargon, the elements of θ and φ are (i.i.d.) noise all hyper-parameters of the model. ￿ 0, σ2I . ∼ N GPs as interpolation tools Here [a, B] represents a multivariate Gaussian distribu- N￿ ￿ tion with mean vector a and covariance matrix B, σ is the Given a set of inputs X and observed targets t,wemaywish standard deviation of the noise, and I is the identity matrix. to make predictions for a new set of inputs X￿. We refer The joint probability distribution for a set of outputs t is thus: 1 In the Bayesian linear basis equivalent of GP regression, the P (t X, σ, φ)= m(X, φ), σ2I basis function weights are subject to an automatic relevance de- | N termination (ARD) priors. These priors are zero-mean Gaussians, A Gaussian process￿ (hereafter￿ GP) is defined by the fact which force the weights to be zero unless there is evidence in the that the joint distribution of any set of samples drawn from a data to support a non-zero value. This helps to explain how GPs A GaussianGP is process a multivariate in a nutshell Gaussian distribution. Therefore, i.i.d. can be extremely flexible, yet avoid the over-fitting problem which noise is a special case of GP. We now extend our model to affects maximum-likelihood linear models. In linear basis mod- outputthe vector general (e.g. fluxes) GPinput matrix case: (e.g. time, pointing, detector temperature...) elling, one seeks a global probability maximum in feature space, where each coordinate represents the weight of a particular ba- likelihood P (t X, θ, φ)= [m(X, φ), K](1)multivariate Gaussian | N sis function. Working with a high-dimensional feature space is problematic: it gives rise to degeneracy, numerical instability and where the covariancemean function matrix K has elements (hyper-)parameters excessive computational demand. This problem is known as the covariance matrix Knm cov [xn, xm]=k(xn, xm, θ) curse of dimensionality. GPs belong to a class of methods, which ≡ circumvent this problem by working in function space rather than The function k, known as the covariance function or kernel, feature space, and parametrising the functions in terms of their has parameters θ, and acts on a pair of inputs, returning the covariance properties. This is known as the kernel trick.

c 2011 RAS, MNRAS 000,1–?? ￿ 2 N. P. Gibson, S. Aigrain et al.

GAUSSIAN PROCESSES FOR REGRESSION covariance between the corresponding outputs. The specifi- cation of the covariance function implies a probability dis- Models based on Gaussian processes are extensively used in tribution over functions, and we can draw samples from this the machine learning community for Bayesian inference in distribution by choosing a number of input points, writing regression2 and classificationN. P. Gibson, problems. S. Aigrain They can et al. be seen as out the corresponding covariance matrix, and generating a an extension of kernel regression to probabilistic models. In random Gaussian vector with this covariance matrix. Insert this appendix,GAUSSIAN we endeavour PROCESSES to give a very FOR brief REGRESSION introduction covariance between the corresponding outputs. The specifi- referencecation of to the figure. covariance function implies a probability dis- to the useModels of GPs based for regression. on Gaussian Our processes explanations are extensively are based used in tributionThe mean over function functions, controls and we the can draw deterministic samples from compo- this on the textbooksthe machine by learning Rassmussen community & Williams for Bayesian (2006) inference and in nentdistribution of the model, by choosing whilst the a number covariance of input function points, controls writing Bishop (2006),regression where and the classification interested reader problems. will They find can far more be seen as its stochasticout the corresponding component. covariance Depending matrix, on the and situation, generating the a complete andan extension detailed of information. kernel regression to probabilistic models. In formerrandom may Gaussian simply be vector zero with everywhere. this covariance Similarly, matrix. theInsert latter this appendix, we endeavour to give a very brief introduction mayreference represent a to physical figure. process of interest, or it may repre- to the use of GPs for regression. Our explanations are based The mean function controls the deterministic compo- Introducingon the Gaussian textbooks processes by Rassmussen & Williams (2006) and sent noise. A commonly used family of covariance functions nent of the model, whilst the covariance function controls Bishop (2006), where the interested reader will find far more is the squared exponential kernel: Consider a collection of N observations, each consisting of its stochastic component. Depending on the situation, the complete and detailed information. former may simply be zero everywhere.D Similarly, the latter D known inputs, or independent variables, and one output, 2 kSE(mayxn, x representm, θ)= aθ0 physicalexp θ1 process(x ofi,n interest,xi,m) or it may repre- or dependent variable, which is measured with finite accu- − racy (it isIntroducing trival to extend Gaussian the same processes reasoning to more than sent noise. A commonly￿ usedi=1 family of covariance￿ functions is the squared exponential￿ kernel: one, jointlyConsider observed a collection outputs). of InN theobservations, case of observations each consisting of where θ0 is an amplitude, and θ1 an inverse length scale D 2 of a planetary transit in a single bandpass, for example, the (this becomes more obvious if we write θ1 =1/2l ). Pairs of D known inputs, or independent variables, and one output, 2 kSE(xn, xm, θ)=θ0 exp θ1 (xi,n xi,m) output is aor flux dependent or magnitude, variable, and which the is inputs measured might with include finite accu- observations generated from the squared− exponential kernel ￿ i=1 ￿ the time atracy which (it is the trival observations to extend the were same made, reasoning but also to more an- than are strongly mutually correlated￿ if they lie near each other in cillary dataone, pertaining jointly observed to the outputs). state of In the the instrument case of observations and inputwhere space,θ0 andis an relatively amplitude, uncorrelated and θ1 an if inverse they are length mutually scale 2 the observingof a planetary conditions. transit In the in machine a single bandpass, learning forliterature, example, the distant(this in becomes input space. more obvious Multiple if we covariance write θ1 =1 functions/2l ). Pairs can be of observed outputsoutput is are a flux usually or magnitude, referred and to as thetargets inputs, because might include addedobservations or multipled generated together from to the model squared multiple exponential stochastic kernel the qualitythe of time a model at which is measured the observations by its were ability made, to but repro- also an- processesare strongly simultaneously mutually correlated occurring if in they the lie data. near For each example, other in duce the targets,cillary data given pertaining the inputs, to the and state we of shall the adopt instrument this and mostinput GP models space, and include relatively an i.i.d. uncorrelated component: if they are mutually the observing conditions. In the machine learning literature, distant in input space. Multiple covariance functions can be convention here. observed outputs are usually referred to as targets, because k(x added, x , θ or, σ multipled)=k (x together, x , θ)+ to modelδ σ2 multiple stochastic We write the inputs for the nth observation as a column n m SE n m nm the quality of a modelT is measured by its ability to repro- processes simultaneously occurring in the data. For example, vector xn duce=(x the1,n ...,x targets,D,n given) , and the the inputs, corresponding and we shall target adopt this wheremostδ is GP the models Kronecker include delta an i.i.d. function. component: value as tconventionn. The N here.target values are then collated into a The joint probability given by equation2 1 is the likeli- T th k(xn, xm, θ, σ)=kSE(xn, xm, θ)+δnmσ column vectorWet =( writet1,...,t the inputsN ) , for and the then DobservationN outputs as into a column hood of observing the targets t,givenX, φ and θ. This likeli- T × a matrix Xvector,withelementsxn =(x1,n ...,xXij =D,nx)i,j,. and A standard the corresponding approach, target hoodwhere is alreadyδ is the marginalised Kronecker delta over function. all possible realisations of which mostvalue astronomers as tn. The willN target be familiar values with,are then is to collated model into a the stochasticThe joint component, probability which given satisfy by equation the covariance 1 is the likeli- ma- T column vector t =(t1,...,tN ) , and the D N outputs into hood of observing the targets t,givenX, φ and θ. This likeli- the data as: × trix specified by X and θ. In this sense, GPs are intrinsically a matrix X,withelementsXij = xi,j . A standard approach, Bayesian.hood is In already fact, one marginalised can show over that all GP possible regression realisations with the of t = m(X, φwhich)+￿ most astronomers will be familiar with, is to model squaredthe stochastic exponential component, kernel is which equivalent satisfy to the Bayesian covariance linear ma- the data as: where m is a mean function with parameters φ, which acts regressiontrix specified with an by infiniteX and θ. number In this sense, of basis GPs functions are intrinsically1. The Bayesian. In fact, one can show that GP regression with the on the matrixt = m of(X,Nφ)+input￿ vectors, returning an N-element basis function weights can be seen as the parameters of the squared exponential kernel is equivalent to Bayesian linear vector, andwhere￿ representsm is a mean independent, function with identically parameters distributedφ, which acts model,regression and in with Bayesian an infinite jargon, number the elements of basis functions of θ and1.φ Theare (i.i.d.) noiseon the matrix of N input vectors, returning an N-element all hyper-parametersbasis function weightsof the can model. be seen as the parameters of the ￿ 0,vector,σ2I . and ￿ represents independent, identically distributed model, and in Bayesian jargon, the elements of θ and φ are ∼ N (i.i.d.) noise all hyper-parameters of the model. GPs as interpolation tools Here [a, B] represents2 a multivariate Gaussian distribu- N￿ ￿ ￿ 0, σ I . tion with mean∼ N vector a and covariance matrix B, σ is the GivenGPs a set as of interpolation inputs X and tools observed targets t,wemaywish Here [a, B] represents a multivariate Gaussian distribu- standard deviationN￿ of the￿ noise, and I is the identity matrix. to make predictions for a new set of inputs X￿. We refer The jointtion probability with mean distribution vector a and for covariance a set of outputs matrix B,t σisis the Given a set of inputs X and observed targets t,wemaywish thus: standard deviation of the noise, and I is the identity matrix. to make predictions for a new set of inputs X￿. We refer The joint probability distribution for a set of outputs t is 1 In the Bayesian linear basis equivalent of GP regression, the P (t X, σ,thus:φ)= m(X, φ), σ2I basis function weights are subject to an automatic relevance de- 1 In the Bayesian linear basis equivalent of GP regression, the | N termination (ARD) priors. These priors are zero-mean Gaussians, P (t X, σ, φ)= m(X, φ), σ2I basis function weights are subject to an automatic relevance de- A Gaussian process| ￿ (hereafterN ￿ GP) is defined by the fact which force the weights to be zero unless there is evidence in the that the joint distribution of any set of samples drawn from a termination (ARD) priors. These priors are zero-mean Gaussians, A Gaussian process￿ (hereafter￿ GP) is defined by the fact datawhich to support force the a non-zero weights to value. be zero This unless helps there to explain is evidence how in GPs the A Gaussian process in a nutshell GP is a multivariatethat the joint Gaussian distribution distribution. of any set of Therefore, samples drawn i.i.d. from a can bedata extremely to support flexible, a non-zero yet avoid value. the This over-fitting helps to explain problem how which GPs noise is aGP special is a multivariate case of GP. Gaussian We now distribution. extend our Therefore, model to i.i.d. affectscan maximum-likelihood be extremely flexible, yet linear avoid models. the over-fitting In linear problem basis which mod- output vector (e.g. fluxes) input matrix (e.g. time, pointing, detector temperature...) the generalnoise GP is case: a special case of GP. We now extend our model to elling,aff oneects seeks maximum-likelihood a global probability linear models. maximum In linear in feature basis space mod-, the general GP case: whereelling, each one coordinate seeks a global represents probability the weight maximum of a in particularfeature space ba-, likelihood P (t X, θ, φ)= [m(X, φ), K](1)multivariate Gaussian sis function.where each Working coordinate with represents a high-dimensional the weight of feature a particular space ba- is | P (t XN, θ, φ)= [m(X, φ), K](1) | N problematic:sis function. it gives Working rise to with degeneracy, a high-dimensional numerical feature instability space and is where the covariancemean function matrix K has elements problematic: it gives rise to degeneracy, numerical instability and (hyper-)parameters where the covariance matrix K has elements excessive computational demand. This problem is known as the covariance matrix curseexcessive of dimensionality. computational GPs demand. belong This to a class problem of methods, is known as which the Knm cov [xn, xm]=k(xn, xm, θ) curse of dimensionality. GPs belong to a class of methods, which ≡ Knm cov [xn, xm]=k(xn, xm, θ) circumvent this problem by working in function space rather than ≡ circumvent this problem by working in function space rather than The function k, known as the covariance function or kernel, feature space, and parametrising the functions in terms of their The functionkernelk, function known as the covariance function or kernel, feature space, and parametrising the functions in terms of their has parametershas parametersθ, and actsθ, and on a acts pair on of a inputs, pair of inputs, returning returning the the covariancecovariance properties. properties. This This is known is known as as the thekernelkernel trick trick..

c c20112011 RAS, RAS, MNRAS MNRAS 000000,1–,1–???? ￿￿ 2 N. P. Gibson, S. Aigrain et al.

GAUSSIAN PROCESSES FOR REGRESSION covariance between the corresponding outputs. The specifi- cation of the covariance function implies a probability dis- Models based on Gaussian processes are extensively used in tribution over functions, and we can draw samples from this the machine learning community for Bayesian inference in distribution by choosing a number of input points, writing regression2 and classificationN. P. Gibson, problems. S. Aigrain They can et al. be seen as out the corresponding covariance matrix, and generating a an extension of kernel regression to probabilistic models. In random Gaussian vector with this covariance matrix. Insert this appendix,GAUSSIAN we endeavour PROCESSES to give a very FOR brief REGRESSION introduction covariance between the corresponding outputs. The specifi- referencecation of to the figure. covariance function implies a probability dis- to the useModels of GPs based for regression. on Gaussian Our processes explanations are extensively are based used in tributionThe mean over function functions, controls and we the can draw deterministic samples from compo- this on the textbooksthe machine by learning Rassmussen community & Williams for Bayesian (2006) inference and in nentdistribution of the model, by choosing whilst the a number covariance of input function points, controls writing Bishop (2006),regression where and the classification interested reader problems. will They find can far more be seen as its stochasticout the corresponding component. covariance Depending matrix, on the and situation, generating the a complete andan extension detailed of information. kernel regression to probabilistic models. In formerrandom may Gaussian simply be vector zero with everywhere. this covariance Similarly, matrix. theInsert latter this appendix, we endeavour to give a very brief introduction mayreference represent a to physical figure. process of interest, or it may repre- to the use of GPs for regression. Our explanations are based The mean function controls the deterministic compo- Introducingon the Gaussian textbooks processes by Rassmussen & Williams (2006) and sent noise. A commonly used family of covariance functions nent of the model, whilst the covariance function controls Bishop (2006), where the interested reader will find far more is the squared exponential kernel: Consider a collection of N observations, each consisting of its stochastic component. Depending on the situation, the complete and detailed information. former may simply be zero everywhere.D Similarly, the latter D known inputs, or independent variables, and one output, 2 kSE(mayxn, x representm, θ)= aθ0 physicalexp θ1 process(x ofi,n interest,xi,m) or it may repre- or dependent variable, which is measured with finite accu- − racy (it isIntroducing trival to extend Gaussian the same processes reasoning to more than sent noise. A commonly￿ usedi=1 family of covariance￿ functions is the squared exponential￿ kernel: one, jointlyConsider observed a collection outputs). of InN theobservations, case of observations each consisting of where θ0 is an amplitude, and θ1 an inverse length scale D 2 of a planetary transit in a single bandpass, for example, the (this becomes more obvious if we write θ1 =1/2l ). Pairs of D known inputs, or independent variables, and one output, 2 kSE(xn, xm, θ)=θ0 exp θ1 (xi,n xi,m) output is aor flux dependent or magnitude, variable, and which the is inputs measured might with include finite accu- observations generated from the squared− exponential kernel ￿ i=1 ￿ the time atracy which (it is the trival observations to extend the were same made, reasoning but also to more an- than are strongly mutually correlated￿ if they lie near each other in cillary dataone, pertaining jointly observed to the outputs). state of In the the instrument case of observations and inputwhere space,θ0 andis an relatively amplitude, uncorrelated and θ1 an if inverse they are length mutually scale 2 the observingof a planetary conditions. transit In the in machine a single bandpass, learning forliterature, example, the distant(this in becomes input space. more obvious Multiple if we covariance write θ1 =1 functions/2l ). Pairs can be of observed outputsoutput is are a flux usually or magnitude, referred and to as thetargets inputs, because might include addedobservations or multipled generated together from to the model squared multiple exponential stochastic kernel the qualitythe of time a model at which is measured the observations by its were ability made, to but repro- also an- processesare strongly simultaneously mutually correlated occurring if in they the lie data. near For each example, other in duce the targets,cillary data given pertaining the inputs, to the and state we of shall the adopt instrument this and mostinput GP models space, and include relatively an i.i.d. uncorrelated component: if they are mutually the observing conditions. In the machine learning literature, distant in input space. Multiple covariance functions can be convention here. observed outputs are usually referred to as targets, because k(x added, x , θ or, σ multipled)=k (x together, x , θ)+ to modelδ σ2 multiple stochastic We write the inputs for the nth observation as a column n m SE n m nm the quality of a modelT is measured by its ability to repro- processes simultaneously occurring in the data. For example, vector xn duce=(x the1,n ...,x targets,D,n given) , and the the inputs, corresponding and we shall target adopt this wheremostδ is GP the models Kronecker include delta an i.i.d. function. component: value as tconventionn. The N here.target values are then collated into a The joint probability given by equation2 1 is the likeli- T th k(xn, xm, θ, σ)=kSE(xn, xm, θ)+δnmσ column vectorWet =( writet1,...,t the inputsN ) , for and the then DobservationN outputs as into a column hood of observing the targets t,givenX, φ and θ. This likeli- T × a matrix Xvector,withelementsxn =(x1,n ...,xXij =D,nx)i,j,. and A standard the corresponding approach, target hoodwhere is alreadyδ is the marginalised Kronecker delta over function. all possible realisations of which mostvalue astronomers as tn. The willN target be familiar values with,are then is to collated model into a the stochasticThe joint component, probability which given satisfy by equation the covariance 1 is the likeli- ma- T column vector t =(t1,...,tN ) , and the D N outputs into hood of observing the targets t,givenX, φ and θ. This likeli- the data as: × trix specified by X and θ. In this sense, GPs are intrinsically a matrix X,withelementsXij = xi,j . A standard approach, Bayesian.hood is In already fact, one marginalised can show over that all GP possible regression realisations with the of t = m(X, φwhich)+￿ most astronomers will be familiar with, is to model squaredthe stochastic exponential component, kernel is which equivalent satisfy to the Bayesian covariance linear ma- the data as: where m is a mean function with parameters φ, which acts regressiontrix specified with an by infiniteX and θ. number In this sense, of basis GPs functions are intrinsically1. The Bayesian. In fact, one can show that GP regression with the on the matrixt = m of(X,Nφ)+input￿ vectors, returning an N-element basis function weights can be seen as the parameters of the squared exponential kernel is equivalent to Bayesian linear vector, andwhere￿ representsm is a mean independent, function with identically parameters distributedφ, which acts model,regression and in with Bayesian an infinite jargon, number the elements of basis functions of θ and1.φ Theare (i.i.d.) noiseon the matrix of N input vectors, returning an N-element all hyper-parametersbasis function weightsof the can model. be seen as the parameters of the ￿ 0,vector,σ2I . and ￿ represents independent, identically distributed model, and in Bayesian jargon, the elements of θ and φ are ∼ N (i.i.d.) noise all hyper-parameters of the model. GPs as interpolation tools Here [a, B] represents2 a multivariate Gaussian distribu- N￿ ￿ ￿ 0, σ I . tion with mean∼ N vector a and covariance matrix B, σ is the GivenGPs a set as of interpolation inputs X and tools observed targets t,wemaywish Here [a, B] represents a multivariate Gaussian distribu- standard deviationN￿ of the￿ noise, and I is the identity matrix. to make predictions for a new set of inputs X￿. We refer The jointtion probability with mean distribution vector a and for covariance a set of outputs matrix B,t σisis the Given a set of inputs X and observed targets t,wemaywish thus: standard deviation of the noise, and I is the identity matrix. to make predictions for a new set of inputs X￿. We refer The joint probability distribution for a set of outputs t is 1 In the Bayesian linear basis equivalent of GP regression, the P (t X, σ,thus:φ)= m(X, φ), σ2I basis function weights are subject to an automatic relevance de- 1 In the Bayesian linear basis equivalent of GP regression, the | N termination (ARD) priors. These priors are zero-mean Gaussians, P (t X, σ, φ)= m(X, φ), σ2I basis function weights are subject to an automatic relevance de- A Gaussian process| ￿ (hereafterN ￿ GP) is defined by the fact which force the weights to be zero unless there is evidence in the that the joint distribution of any set of samples drawn from a termination (ARD) priors. These priors are zero-mean Gaussians, A Gaussian process￿ (hereafter￿ GP) is defined by the fact datawhich to support force the a non-zero weights to value. be zero This unless helps there to explain is evidence how in GPs the A Gaussian process in a nutshell GP is a multivariatethat the joint Gaussian distribution distribution. of any set of Therefore, samples drawn i.i.d. from a can bedata extremely to support flexible, a non-zero yet avoid value. the This over-fitting helps to explain problem how which GPs noise is aGP special is a multivariate case of GP. Gaussian We now distribution. extend our Therefore, model to i.i.d. affectscan maximum-likelihood be extremely flexible, yet linear avoid models. the over-fitting In linear problem basis which mod- output vector (e.g. fluxes) input matrix (e.g. time, pointing, detector temperature...) the generalnoise GP is case: a special case of GP. We now extend our model to elling,aff oneects seeks maximum-likelihood a global probability linear models. maximum In linear in feature basis space mod-, the general GP case: whereelling, each one coordinate seeks a global represents probability the weight maximum of a in particularfeature space ba-, likelihood P (t X, θ, φ)= [m(X, φ), K](1)multivariate Gaussian sis function.where each Working coordinate with represents a high-dimensional the weight of feature a particular space ba- is | P (t XN, θ, φ)= [m(X, φ), K](1) | N problematic:sis function. it gives Working rise to with degeneracy, a high-dimensional numerical feature instability space and is where the covariancemean function matrixmean functionK parametershas elements problematic: it gives rise to degeneracy, numerical instability and (hyper-)parameters where the covariance matrix K has elements excessive computational demand. This problem is known as the covariance matrix curseexcessive of dimensionality. computational GPs demand. belong This to a class problem of methods, is known as which the Knm cov [xn, xm]=k(xn, xm, θ) curse of dimensionality. GPs belong to a class of methods, which ≡ Knm cov [xn, xm]=k(xn, xm, θ) circumvent this problem by working in function space rather than ≡ circumvent this problem by working in function space rather than The function k, known as the covariance function or kernel, feature space, and parametrising the functions in terms of their The functionkernelk, function known askernel the parameterscovariance function or kernel, feature space, and parametrising the functions in terms of their has parametershas parametersθ, and actsθ, and on a acts pair on of a inputs, pair of inputs, returning returning the the covariancecovariance properties. properties. This This is known is known as as the thekernelkernel trick trick..

c c20112011 RAS, RAS, MNRAS MNRAS 000000,1–,1–???? ￿￿ Gaussian process regression Gaussian process regression

a priori knowledge Gaussian process regression

chose kernel & mean functions a priori knowledge guess initial hyper-parameters Gaussian process regression

chose kernel & mean functions prior probability distribution a priori knowledge over functions guess initial hyper-parameters Gaussian process regression

chose kernel & mean functions prior probability distribution a priori knowledge over functions guess initial hyper-parameters

get some data Gaussian process regression

chose kernel & mean functions prior probability distribution a priori knowledge over functions guess initial hyper-parameters

get some data condition the GP Gaussian process regression

chose kernel & mean functions prior probability distribution a priori knowledge over functions guess initial hyper-parameters

get some data condition the GP

predictive (posterior) distribution Gaussian process regression

chose kernel & mean functions prior probability distribution a priori knowledge over functions guess initial hyper-parameters

get some data condition the GP

predictive (posterior) marginal likelihood distribution Gaussian process regression

chose kernel & mean functions prior probability distribution a priori knowledge over functions guess initial hyper-parameters

get some data condition the GP

posterior distribution over predictive (posterior) marginal likelihood hyper-parameters distribution Gaussian process regression

chose kernel & mean functions prior probability distribution a priori knowledge over functions guess initial hyper-parameters

get some data condition the GP

posterior distribution over predictive (posterior) marginal likelihood hyper-parameters distribution

interpolate (predict) White noise is with standard deviation σ is represented bt:

2 kWN(t, t￿) = σ I, (6)

where I is the identity matrix.

The squared exponential (SE) kernel is given by:

12 N. P. Gibson et al. 2 2 (t t￿) A very simple example kSE(t, t￿) = A exp − (7) − 2l2 ￿ ￿ where A is an amplitude and l is a length scale. This gives rather smooth variations with a typical time-scale of l and r.m.s. amplitude σ. from Gibson et al. (submitted)

The rational quadratic (RQ) kernel is given by:

2 α 2 (t t￿) − kRQ(t, t￿) = A 1 + − (8) 2αl2 ￿ ￿ where α is known as the index. Rasmussen & Williams (2006) show that this is equivalent to a scale mixture of 2 SE kernels with different length scales, distributed according to a Beta distribution with parameters α and l− . This gives variations with a range of time-scales, the distribution peaking around l but extending to significantly longer period (but remaining rather smooth). When alpha , the RQ reduces to the SE with length scale l. →∞ Examples of functions drawn from these 3 kernels are shown in Figure 3. There are many more types of covariance functions in use, including some (such as the Matern` family) which are better suited to model rougher, less smooth variations. However, the SE and RQ kernels already offer a great degree of freedom with relatively few hyper- parameters, and covariance functions based on these are sufficient to model the data of interest satisfactorily.

4.2 Periodic and quasi-periodic kernels

A periodic covariance function can be constructed from any kernel involving the squared distance (t t )2 by Figure A1. Top: Examples of random functions drawn from a GP with a squared exponential covariance kernel. The left and right plots − ￿ have hyperparameters θ0, θ1 = 25, 2 ,and θ0, θ1 = 25, 0.5 ,respectively.Thedarkandlightgreyshadedareasrepresentthe12 { } replacing{ } the{ latter} { with sin} [π(t t￿)/P], where P is the period. For example, the following: and 2 σ boundaries of the distribution. Bottom: The predictive distributions− generated after adding some training data and conditioning 2 using the same hyperparameters as the corresponding plots above, with the addition of a white noise term (σ =1).Thecolouredlines2 are now random vectors drawn from the posterior distribution. 2 sin [π(t t￿)/P] kper,SE(t, t￿) = A exp − (9) − 2L2 ￿ ￿ type-II maximum likelihood. Thisgives involves periodic fixing variations the covari- whichshows closely data drawn resemble from asamples transit mean drawn function from with a squared sinu- exponential GP within a given ance hyperparameters at their maximum likelihood values, soidal and Gaussian time-domain ‘systematics’ added. The kernel. The length scale L is now relative to the period, and letting L gives sinusoidal variations, whilst and marginalising over the remaining mean function hyper- GP model has a transit mean function, and the additional→∞ parameters of interest. This is aincreasingly valid approximation small values when of L‘systematics’give periodic are variations modelled by with the GP. increasingly Finally, the complex bottom harmonic content. the posterior distribution is sharply peaked at its maximum right plot shows data drawn from a step function. In this with respect to the covariance hyperparameters,Similar periodic and functions should couldcase be the constructed GP interpolation from is any less kernel. reliable. Other This is periodic because functions could also be used, so give similar results to the full marginalisation.long as they give rise to a symmetric,the squared positive exponential definite function covariance implies smooth matrix functions. – sin2 is merely the simplest. Fig. A2 gives examples of GP regression using a squared A more appropriate choice of covariance kernel is required exponential covariance function,As after described optimising in the Rasmussen hyper- &for Williams such functions (2006), (e.g. valid Garnett covariance et al. 2010). functions Indeed, a can wide be constructed by adding or multi- parameters via a Nelder-Meadplying simplex simpler algorithm covariance (see e.g. functions.choice of Thus, covariance we can kernels obtain exist a quasiwhich periodiceffectively kernel model simply by multiplying a periodic Press et al. 1992). The training data (red points) were gen- periodic, quasi-periodic and step functions (see Rasmussen erated from various functions (shownkernel by with the one green of dashed the basic& kernels Williams described 2006), enabling above. GPs The to latter be a very then flexible specifies and the rate of evolution of periodic line) with the addition of i.i.d noise.signal. All were For example,modelled using we can multiplypowerful framework equation 9 for with modelling yet another data in SE the kernel: absence of a a zero-mean function, with the exception of the transit func- parametric model. 2 2 tion at the bottom left, which used a transit mean function. 2 sin [π(t t￿)/P] (t t￿) In the top left the data were generated from a straight line kQP,SE(t, t￿) = A exp − − (10) − 2L2 − 2l2 function. The marginal likelihood strongly favours functions ￿ ￿ with a longer length scale, and the GP is able to predict the underlying function both when interpolatingto model a andquasi-periodic extrapolat- signal with a single evolutionary time-scale l. ing away from the data. The top right data was generated from a quadratic function, and again the GP reliably inter- polates the underlying function. However, as the length scale 6 is now shorter, the predictive distribution veers towards the prior more quickly outside the training data. These two ex- amples show that GPs with squared exponential kernels are very good at interpolating data sets providing the data is generated from a smooth function. The bottom left plot

c 2002 RAS, MNRAS 000,1–12 ￿ White noise is with standard deviation σ is represented bt:

2 kWN(t, t￿) = σ I, (6)

where I is the identity matrix.

The squared exponential (SE) kernel is given by:

12 N. P. Gibson et al. 2 2 (t t￿) A very simple example kSE(t, t￿) = A exp − (7) − 2l2 ￿ ￿ where A is an amplitude and l is a length scale. This gives rather smooth variations with a typical time-scale of l and r.m.s. amplitude σ. from Gibson et al. (submitted)

The rational quadratic (RQ) kernel is given by:

2 α 2 (t t￿) − kRQ(t, t￿) = A 1 + − (8) 2αl2 ￿ ￿ where α is known as the index. Rasmussen & Williams (2006) show that this is equivalent to a scale mixture of 2 SE kernels with different length scales, distributed according to a Beta distribution with parameters α and l− . This gives variations with a range of time-scales, the distribution peaking around l but extending to significantly longer period (but remaining rather smooth). When alpha , the RQ reduces to the SE with length scale l. →∞ Examples of functions drawn from these 3 kernels are shown in Figure 3. There are many more types of covariance functions in use, including some (such as the Matern` family) which are better suited to model rougher, less smooth variations. However, the SE and RQ kernels already offer a great degree of freedom with relatively few hyper- parameters, and covariance functions based on these are sufficient to model the data of interest satisfactorily.

4.2 Periodic and quasi-periodic kernels

A periodic covariance function can be constructed from any kernel involving the squared distance (t t )2 by Figure A1. Top: Examples of random functions drawn from a GP with a squared exponential covariance kernel. The left and right plots − ￿ have hyperparameters θ0, θ1 = 25, 2 ,and θ0, θ1 = 25, 0.5 ,respectively.Thedarkandlightgreyshadedareasrepresentthe12 { } replacing{ } the{ latter} { with sin} [π(t t￿)/P], where P is the period. For example, the following: and 2 σ boundaries of the distribution. Bottom: The predictive distributions− generated after adding some training data and conditioning 2 using the same hyperparameters as the corresponding plots above, with the addition of a white noise term (σ =1).Thecolouredlines2 are now random vectors drawn from the posterior distribution. 2 sin [π(t t￿)/P] kper,SE(t, t￿) = A exp − (9) − 2L2 ￿ ￿ type-II maximum likelihood. Thisgives involves periodic fixing variations the covari- whichshows closely data drawn resemble from asamples transit mean drawn function from with a squared sinu- exponential GP within a given ance hyperparameters at their maximum likelihood values, soidal and Gaussian time-domain ‘systematics’ added. The kernel. The length scale L is now relative to the period, and letting L gives sinusoidal variations, whilst and marginalising over the remaining mean function hyper- GP model has a transit mean function, and the additional→∞ parameters of interest. This is aincreasingly valid approximation small values when of L‘systematics’give periodic are variations modelled by with the GP. increasingly Finally, the complex bottom harmonic content. the posterior distribution is sharply peaked at its maximum right plot shows data drawn from a step function. In this with respect to the covariance hyperparameters,Similar periodic and functions should couldcase be the constructed GP interpolation from is any less kernel. reliable. Other This is periodic because functions could also be used, so give similar results to the full marginalisation.long as they give rise to a symmetric,the squared positive exponential definite function covariance implies smooth matrix functions. – sin2 is merely the simplest. Fig. A2 gives examples of GP regression using a squared A more appropriate choice of covariance kernel is required exponential covariance function,As after described optimising in the Rasmussen hyper- &for Williams such functions (2006), (e.g. valid Garnett covariance et al. 2010). functions Indeed, a can wide be constructed by adding or multi- parameters via a Nelder-Meadplying simplex simpler algorithm covariance (see e.g. functions.choice of Thus, covariance we can kernels obtain exist a quasiwhich periodiceffectively kernel model simply by multiplying a periodic Press et al. 1992). The training data (red points) were gen- periodic, quasi-periodic and step functions (see Rasmussen erated from various functions (shownkernel by with the one green of dashed the basic& kernels Williams described 2006), enabling above. GPs The to latter be a very then flexible specifies and the rate of evolution of periodic line) with the addition of i.i.d noise.signal. All were For example,modelled using we can multiplypowerful framework equation 9 for with modelling yet another data in SE the kernel: absence of a a zero-mean function, with the exception of the transit func- parametric model. 2 2 tion at the bottom left, which used a transit mean function. 2 sin [π(t t￿)/P] (t t￿) In the top left the data were generated from a straight line kQP,SE(t, t￿) = A exp − − (10) − 2L2 − 2l2 function. The marginal likelihood strongly favours functions ￿ ￿ with a longer length scale, and the GP is able to predict the underlying function both when interpolatingto model a andquasi-periodic extrapolat- signal with a single evolutionary time-scale l. ing away from the data. The top right data was generated from a quadratic function, and again the GP reliably inter- polates the underlying function. However, as the length scale 6 is now shorter, the predictive distribution veers towards the prior more quickly outside the training data. These two ex- amples show that GPs with squared exponential kernels are very good at interpolating data sets providing the data is generated from a smooth function. The bottom left plot

c 2002 RAS, MNRAS 000,1–12 ￿ White noise is with standard deviation σ is represented bt:

2 kWN(t, t￿) = σ I, (6)

where I is the identity matrix.

The squared exponential (SE) kernel is given by:

12 N. P. Gibson et al. 2 2 (t t￿) A very simple example kSE(t, t￿) = A exp − (7) − 2l2 ￿ ￿ where A is an amplitude and l is a length scale. This gives rather smooth variations with a typical time-scale of l and r.m.s. amplitude σ. from Gibson et al. (submitted)

The rational quadratic (RQ) kernel is given by:

2 α 2 (t t￿) − kRQ(t, t￿) = A 1 + − (8) 2αl2 ￿ ￿ where α is known as the index. Rasmussen & Williams (2006) show that this is equivalent to a scale mixture of 2 SE kernels with different length scales, distributed according to a Beta distribution with parameters α and l− . This gives variations with a range of time-scales, the distribution peaking around l but extending to significantly longer period (but remaining rather smooth). When alpha , the RQ reduces to the SE with length scale l. →∞ Examples of functions drawn from these 3 kernels are shown in Figure 3. There are many more types of covariance functions in use, including some (such as the Matern` family) which are better suited to model rougher, less smooth variations. However, the SE and RQ kernels already offer a great degree of freedom with relatively few hyper- parameters, and covariance functions based on these are sufficient to model the data of interest satisfactorily. similar marginal likelihood, but different predictive power 4.2 Periodic and quasi-periodic kernels

A periodic covariance function can be constructed from any kernel involving the squared distance (t t )2 by Figure A1. Top: Examples of random functions drawn from a GP with a squared exponential covariance kernel. The left and right plots − ￿ have hyperparameters θ0, θ1 = 25, 2 ,and θ0, θ1 = 25, 0.5 ,respectively.Thedarkandlightgreyshadedareasrepresentthe12 { } replacing{ } the{ latter} { with sin} [π(t t￿)/P], where P is the period. For example, the following: and 2 σ boundaries of the distribution. Bottom: The predictive distributions− generated after adding some training data and conditioning 2 using the same hyperparameters as the corresponding plots above, with the addition of a white noise term (σ =1).Thecolouredlines2 are now random vectors drawn from the posterior distribution. 2 sin [π(t t￿)/P] kper,SE(t, t￿) = A exp − (9) − 2L2 ￿ ￿ type-II maximum likelihood. Thisgives involves periodic fixing variations the covari- whichshows closely data drawn resemble from asamples transit mean drawn function from with a squared sinu- exponential GP within a given ance hyperparameters at their maximum likelihood values, soidal and Gaussian time-domain ‘systematics’ added. The kernel. The length scale L is now relative to the period, and letting L gives sinusoidal variations, whilst and marginalising over the remaining mean function hyper- GP model has a transit mean function, and the additional→∞ parameters of interest. This is aincreasingly valid approximation small values when of L‘systematics’give periodic are variations modelled by with the GP. increasingly Finally, the complex bottom harmonic content. the posterior distribution is sharply peaked at its maximum right plot shows data drawn from a step function. In this with respect to the covariance hyperparameters,Similar periodic and functions should couldcase be the constructed GP interpolation from is any less kernel. reliable. Other This is periodic because functions could also be used, so give similar results to the full marginalisation.long as they give rise to a symmetric,the squared positive exponential definite function covariance implies smooth matrix functions. – sin2 is merely the simplest. Fig. A2 gives examples of GP regression using a squared A more appropriate choice of covariance kernel is required exponential covariance function,As after described optimising in the Rasmussen hyper- &for Williams such functions (2006), (e.g. valid Garnett covariance et al. 2010). functions Indeed, a can wide be constructed by adding or multi- parameters via a Nelder-Meadplying simplex simpler algorithm covariance (see e.g. functions.choice of Thus, covariance we can kernels obtain exist a quasiwhich periodiceffectively kernel model simply by multiplying a periodic Press et al. 1992). The training data (red points) were gen- periodic, quasi-periodic and step functions (see Rasmussen erated from various functions (shownkernel by with the one green of dashed the basic& kernels Williams described 2006), enabling above. GPs The to latter be a very then flexible specifies and the rate of evolution of periodic line) with the addition of i.i.d noise.signal. All were For example,modelled using we can multiplypowerful framework equation 9 for with modelling yet another data in SE the kernel: absence of a a zero-mean function, with the exception of the transit func- parametric model. 2 2 tion at the bottom left, which used a transit mean function. 2 sin [π(t t￿)/P] (t t￿) In the top left the data were generated from a straight line kQP,SE(t, t￿) = A exp − − (10) − 2L2 − 2l2 function. The marginal likelihood strongly favours functions ￿ ￿ with a longer length scale, and the GP is able to predict the underlying function both when interpolatingto model a andquasi-periodic extrapolat- signal with a single evolutionary time-scale l. ing away from the data. The top right data was generated from a quadratic function, and again the GP reliably inter- polates the underlying function. However, as the length scale 6 is now shorter, the predictive distribution veers towards the prior more quickly outside the training data. These two ex- amples show that GPs with squared exponential kernels are very good at interpolating data sets providing the data is generated from a smooth function. The bottom left plot

c 2002 RAS, MNRAS 000,1–12 ￿ Example application 1: instrumental systematics in transmission spectra

See Neale Gibson’s talk 5 Modelling the HD 189733 light curve

5.1 Choice of model

We modelled the HD 189733 data with all the individual kernels detailed in section 4, as well as a number of different combinations. We initially trained the GP on all three datasets, but the MOST data had virtually no effect on the results (mainly because they do not overlap with any of the transit observations of interest), so we did not use the MOST data for the final calculations. We used LOO-CV to compare different kernels, but we also performed a careful visual examination of the results, generative predictive distributions over the entire monitoring period and over individual seasons. Interestingly, we note that, in practice, LOO-CV systematically favours the simplest kernel which appears to give good results by eye. We experimented with various combinations of SE and RQ kernels to form quasi-periodic models, and found that using an RQ kernel for the evolutionary term significantly improves the LOO-CV pseudo-likelihood relative to an SE-based evolutionary term. It makes very little difference to the mean of the predictive distribution when the observations are well-sampled, but it vastly increases the predictive power of the GP away from observations, as it allows for a small amount of covariance on long time-scales, even if the dominant evolution time-scale is relatively short. On the other hand, we were not able to distinguish between SE and RQ kernels for the periodic term (the two give virtually identical best-fit marginal likelihoods and pseudo-likelihoods), and therefore opted for the simpler SE kernel. ExampleTo describe application the observational noise,2: we experimented with a separate, additive SE kernel as well as a white noise term. However, we found that this did not significantly improve the marginal or pseudo-likelihood, and the best-fit modellinglength-scale HD was comparable189733b’s to the typical OOT interval light between curve consecutive data points. We therefore reverted to a white-noise term only. The final kernel was thus:

2 2 α 2 sin [π(t t￿)/P] (t t￿) − 2 kQP,mixed(t, t￿) = A exp − 1 + − + σ I. (11) − 2L2 × 2αl2 ￿ ￿ ￿ ￿ The best-fit hyperparameters were A = 6.68 mmag, P = 11.86 days, L = 0.91, α = 0.23, l = 17.80 days, and σ = 2.1 mmag. Our period estimate is in excellent agreement with Henry & Winn (2008). We note that very similar best-fit hyper-parameters were obtained with the other kernels we tried (where those kernel shared equivalent hyper- parameters). The relatively long periodic length-scale L indicates that the variations are dominated by fairly large active regions. The evolutionary term has a relatively short time-scale l (about 1.5 times the period) but a relatively shallow index α, which is consistent with the notion that individual active regions evolve relatively fast, but that there are preferentially active longitudes where active regions tend to be located (as inferred from better-sampled long-duration CoRoT light curves for similarly active stars, e.g. by Lanza et al.). Note that we should really treat the noise term(s) separately for each dataset, and the constant relative offsets (or linear trends where applicable) as parameters of the mean function. However, we doubt this would make a big difference to the results.

5.2 Results

Once the GP has been trained and conditioned on the available data, we can compute a predictive distribution for any desired set of input times. This predictive distribution is a multi-variate Gaussian, specified by a mean vector and a covariance matrix (see e.g. Appendix A2 of Gibson et al. 2011 for details). Here we are interested in estimating the difference between the predicted flux at two input times, which is just the difference between the corresponding elements of the mean fector. We also want an estimate of the uncertainty associated with this predicted difference. This can be obtained directly from the covariance matrix of the predictive distribution: 2 σy y = cov(x1, x1) + cov(x2, x2) 2cov(x1, x2). (12) 2| 1 −

9 5 Modelling the HD 189733 light curve

5.1 Choice of model

We modelled the HD 189733 data with all the individual kernels detailed in section 4, as well as a number of different combinations. We initially trained the GP on all three datasets, but the MOST data had virtually no effect on the results (mainly because they do not overlap with any of the transit observations of interest), so we did not use the MOST data for the final calculations. We used LOO-CV to compare different kernels, but we also performed a careful visual examination of the results, generative predictive distributions over the entire monitoring period and over individual seasons. Interestingly, we note that, in practice, LOO-CV systematically favours the simplest kernel which appears to give good results by eye. We experimented with various combinations of SE and RQ kernels to form quasi-periodic models, and found that using an RQ kernel for the evolutionary term significantly improves the LOO-CV pseudo-likelihood relative to an SE-based evolutionary term. It makes very little difference to the mean of the predictive distribution when the observations are well-sampled, but it vastly increases the predictive power of the GP away from observations, as it allows for a small amount of covariance on long time-scales, even if the dominant evolution time-scale is relatively short. On the other hand, we were not able to distinguish between SE and RQ kernels for the periodic term (the two give virtually identical best-fit marginal likelihoods and pseudo-likelihoods), and therefore opted for the simpler SE kernel. ExampleTo describe application the observational noise,2: we experimented with a separate, additive SE kernel as well as a white noise term. However, we found that this did not significantly improve the marginal or pseudo-likelihood, and the best-fit modellinglength-scale HD was comparable189733b’s to the typical OOT interval light between curve consecutive data points. We therefore reverted to a white-noise term only. The final kernel was thus:

2 2 α 2 sin [π(t t￿)/P] (t t￿) − 2 kQP,mixed(t, t￿) = A exp − 1 + − + σ I. (11) − 2L2 × 2αl2 ￿ ￿ ￿ ￿ The best-fit hyperparameters were A = 6.68 mmag, P = 11.86 days, L = 0.91, α = 0.23, l = 17.80 days, and σ = 2.1 mmag. Our period estimate is in excellent agreement with Henry & Winn (2008). We note that very similar best-fit hyper-parameters were obtained with the other kernels we tried (where those kernel shared equivalent hyper- parameters). The relatively long periodic length-scale L indicates that the variations are dominated by fairly large active regions. The evolutionary term has a relatively short time-scale l (about 1.5 times the period) but a relatively shallow index α, which is consistent with the notion that individual active regions evolve relatively fast, but that there are preferentially active longitudes where active regions tend to be located (as inferred from better-sampled long-duration CoRoT light curves for similarly active stars, e.g. by Lanza et al.). Note that we should really treat the noise term(s) separately for each dataset, and the constant relative offsets (or linear trends where applicable) as parameters of the mean function. However, we doubt this would make a big difference to the results.

5.2 Results predict brightness at times of transit observations correct measuredOnce the radius GP has ratio been for trained effect and of conditionedun-occulted on the available data, we can compute a predictive distribution spots (Fredericfor any Pont’s desired talk) set of input times. This predictive distribution is a multi-variate Gaussian, specified by a mean vector and a covariance matrix (see e.g. Appendix A2 of Gibson et al. 2011 for details). Here we are interested in estimating the difference between the predicted flux at two input times, which is just the difference between the corresponding elements of the mean fector. We also want an estimate of the uncertainty associated with this predicted difference. This can be obtained directly from the covariance matrix of the predictive distribution: 2 σy y = cov(x1, x1) + cov(x2, x2) 2cov(x1, x2). (12) 2| 1 −

9 5 Modelling the HD 189733 light curve

5.1 Choice of model

We modelled the HD 189733 data with all the individual kernels detailed in section 4, as well as a number of different combinations. We initially trained the GP on all three datasets, but the MOST data had virtually no effect on the results (mainly because they do not overlap with any of the transit observations of interest), so we did not use the MOST data for the final calculations. We used LOO-CV to compare different kernels, but we also performed a careful visual examination of the results, generative predictive distributions over the entire monitoring period and over individual seasons. Interestingly, we note that, in practice, LOO-CV systematically favours the simplest kernel which appears to give good results by eye. We experimented with various combinations of SE and RQ kernels to form quasi-periodic models, and found that using an RQ kernel for the evolutionary term significantly improves the LOO-CV pseudo-likelihood relative to an SE-based evolutionary term. It makes very little difference to the mean of the predictive distribution when the observations are well-sampled, but it vastly increases the predictive power of the GP away from observations, as it allows for a small amount of covariance on long time-scales, even if the dominant evolution time-scale is relatively short. On the other hand, we were not able to distinguish between SE and RQ kernels for the periodic term (the two give virtually identical best-fit marginal likelihoods and pseudo-likelihoods), and therefore opted for the simpler SE kernel. ExampleTo describe application the observational noise,2: we experimented with a separate, additive SE kernel as well as a white noise term. However, we found that this did not significantly improve the marginal or pseudo-likelihood, and the best-fit modellinglength-scale HD was comparable189733b’s to the typical OOT interval light between curve consecutive data points. We therefore reverted to a white-noise term only. The final kernel was thus:

2 2 α 2 sin [π(t t￿)/P] (t t￿) − 2 kQP,mixed(t, t￿) = A exp − 1 + − + σ I. (11) − 2L2 × 2αl2 ￿ ￿ ￿ ￿ The best-fit hyperparameters were A = 6hyper-parameters.68 mmag, P = 11. 86contain days, Linformation= 0.91, α =about0.23, l = 17.80 days, and σ = 2.1 mmag. Our period estimate is inrotation excellent rate, agreement spot withdistribution Henry & and Winn evolution... (2008). We note that very similar best-fit hyper-parameters were obtained with the other kernels we tried (where those kernel shared equivalent hyper- parameters). The relatively long periodic length-scale L indicates that the variations are dominated by fairly large active regions. The evolutionary term has a relatively short time-scale l (about 1.5 times the period) but a relatively shallow index α, which is consistent with the notion that individual active regions evolve relatively fast, but that there are preferentially active longitudes where active regions tend to be located (as inferred from better-sampled long-duration CoRoT light curves for similarly active stars, e.g. by Lanza et al.). Note that we should really treat the noise term(s) separately for each dataset, and the constant relative offsets (or linear trends where applicable) as parameters of the mean function. However, we doubt this would make a big difference to the results.

5.2 Results predict brightness at times of transit observations correct measuredOnce the radius GP has ratio been for trained effect and of conditionedun-occulted on the available data, we can compute a predictive distribution spots (Fredericfor any Pont’s desired talk) set of input times. This predictive distribution is a multi-variate Gaussian, specified by a mean vector and a covariance matrix (see e.g. Appendix A2 of Gibson et al. 2011 for details). Here we are interested in estimating the difference between the predicted flux at two input times, which is just the difference between the corresponding elements of the mean fector. We also want an estimate of the uncertainty associated with this predicted difference. This can be obtained directly from the covariance matrix of the predictive distribution: 2 σy y = cov(x1, x1) + cov(x2, x2) 2cov(x1, x2). (12) 2| 1 −

9 Pros ...... and cons

• Rigorous error propagation • Computationally intensive: O(N3)

• Extremely versatile • ok up to N~1000

• Built-in Ockam’s razor • alternative: Variational Bayes (see Tom Evans’ poster) • Joint modelling of arbitrary number of inputs (and outputs) • What if the joint distribution isn’t Gaussian? • Easy to combine with other techniques e.g. MCMC • Gaussian mixture models Pros ...... and cons

• Rigorous error propagation • Computationally intensive: O(N3)

• Extremely versatile • ok up to N~1000

• Built-in Ockam’s razor • alternative: Variational Bayes (see Tom Evans’ poster) • Joint modelling of arbitrary number of inputs (and outputs) • What if the joint distribution isn’t Gaussian? • Easy to combine with other techniques e.g. MCMC • Gaussian mixture models Want to try? Python GP module under development