MNRAS 000,1–10 (2021) Preprint 2 March 2021 Compiled using MNRAS LATEX style file v3.0

Atmospheric characterization of hot Jupiters using hierarchical models of Spitzer observations

Dylan Keating,1★ Nicolas B. Cowan,1,2 1Department of Physics, McGill University, Montréal, QC H3A 2T8, Canada 2Department of Earth & Planetary Sciences, McGill University, Montréal, QC H3A 2T8, Canada

Submitted 2021 February 25

ABSTRACT The field of atmospheric characterization is trending towards comparative studies involving many planets, and using hierarchical modelling is a natural next step. Here we demonstrate two use cases. We first use hierarchical modelling to quantify variability in repeated observations, by reanalyzing a suite of ten Spitzer secondary eclipse observations of the hot Jupiter XO-3b. We compare three models: one where we fit ten separate eclipse depths, one where we use a single eclipse depth for all ten observations, and a hierarchical model. By comparing the Widely Applicable Information Criterion (WAIC) of each model, we show that the hierarchical model is preferred over the others. The hierarchical model yields less scatter across the suite of eclipse depths, and higher precision on the individual eclipse depths, than does fitting the observations separately. We do not detect appreciable variability in the secondary eclipses of XO-3b, in line with other analyses. Finally, we fit the suite of published dayside brightness measurements from Garhart et al.(2020) using a hierarchical model. The hierarchical model gives tighter constraints on the individual brightness temperatures and is a better predictive model, according to the WAIC. Notably, we do not detect the increasing trend in brightness temperature ratios versus stellar irradiation reported by Garhart et al.(2020) and Baxter et al.(2020). Although we tested hierarchical modelling on Spitzer eclipse data of hot Jupiters, it is applicable to observations of smaller planets like hot neptunes and super earths, as well as for photometric and spectroscopic transit or phase curve observations. Key words: planets and satellites: individual (XO-3b) – techniques: photometric

1 INTRODUCTION from multiple planets simultaneously using hierarchical models to robustly infer trends. Although the Spitzer Space Telescope wasn’t designed for exoplanet science, it was a workhorse for the field. In particular, observa- tions of exoplanet transits, secondary eclipses, and phase curves with Spitzer’s Infrared Array Camera have been used to characterize the atmospheres of over a hundred transiting planets. 1.1 Spitzer Systematics Substantial progress has been made towards a statistical under- Exoplanet observations taken with Spitzer’s Infrared Array Camera standing of exoplanetary atmospheres (Cowan & Agol 2011; Sing (IRAC; Fazio et al.(2004)) are dominated by systematics noise. et al. 2016; Schwartz & Cowan 2015; Schwartz et al. 2017; Parmen- The systematics are driven by intrapixel sensitivity variations on the tier & Crossfield 2018; Zhang et al. 2018; Keating et al. 2019, 2020; detector and by now are well characterized (Ingalls et al. 2016). Baxter et al. 2020; Bell et al. 2020). Many of the planets in these

arXiv:2103.00010v1 [astro-ph.EP] 26 Feb 2021 The detector systematics are typically fitted simultaneously with the studies had been analyzed using disparate reduction and analysis astrophysical signal of interest. Each transit, secondary eclipse, and pipelines, but recently the field has been moving towards analyzing phase curve yields information about the IRAC detector sensitivity, suites of observations from multiple planets using a single pipeline. but typically this information is not shared between observations. Garhart et al.(2020) independently reduced and analyzed 78 eclipse Since the Spitzer systematics are a function of the centroid loca- depths from 36 planets and found that hotter planets had higher tion on the pixel, efforts have been made to determine the detector brightness temperatures at 4.5 휇m than at 3.6 휇m. Bell et al.(2020) sensitivity independently using observations of quiet (Ingalls reanalyzed every available Spitzer 4.5 휇m hot Jupiter phase curve et al. 2012; Krick et al. 2020; May & Stevenson 2020). The flux of using an open-source reduction and analysis pipeline, confirming a calibration should be constant as a function of time, so any several previously reported trends. deviation must be due to the centroid moving across the detector as In this work we outline a complementary way to further the statis- the telescope pointing drifts. A crucial assumption for this approach tical understanding of exoplanet atmospheres: fitting measurements is that the Spitzer systematics do not vary with time, and that they are not dependent on the brightness of the star. Ingalls et al.(2012) and May & Stevenson(2020) approached the problem by explicitly ★ E-mail: [email protected] calculating the detector sensitivity, while Krick et al.(2020) used a

© 2021 The Authors 2 D. Keating and N. Cowan machine learning technique called random forests to look for patterns underfitting all of the observations. If we observe one secondary in the systematics. eclipse, we would expect that the next one we observe would have a Other approaches do not assume anything explicit about the detec- similar— but not identical— depth, due to measurement uncertainty, tor sensitivity. Independent component analysis (Waldmann 2012; if not astrophysical variability. The second eclipse we observe would Morello et al. 2014, 2016) separates the signal into additive subcom- also change our beliefs about the first one. Each measurement of the ponents using blind source separation, with the idea being that one planet’s eclipse depth can be thought of as a draw from a distribution, of these signals is the astrophysical signal. In another approach, Mor- with some variance. With enough measurements, the shape of this van et al.(2020) used the baseline signal before and after a transit to distribution can be inferred. A hierarchical model naturally takes all learn and predict the in transit detector systematics using a machine of this into account by fitting for the parameters that describe the learning technique known as Long short-term memory networks. higher level distribution simultaneously with the astrophysical signal In this work, we opted to model and fit the detector systematics to of each observation. allow for any correlations between the systematics and astrophysical Bayesian analysis requires us to specify priors on the parameters signals. we are trying to infer. We can write down our prior on the 푖th eclipse depth as

1.2 Hierarchical Models 퐸퐷푖 ∼ N (휇, 휎), (1) Hierarchical models (Gelman et al. 2014) are routinely used in other where we have used shorthand to say that the eclipse depth is drawn fields because they offer a natural way to infer higher level trends that from a normal distribution centered on 휇 with a standard deviation exist in a dataset, and can increase measurement precision. They are of 휎; 휇 and 휎 are hyperparameters. In a non-hierarchical model, we gaining traction in exoplanet studies: for example, to study the - would specify values 휇 and 휎 to represent our prior expectations of radius (Teske et al. 2020) and mass-radius-period (Neil & Rogers what 퐸퐷푖 could be. After fitting, we would get a separate posterior 2020) relations, and radius inflation of hot Jupiters (Sarkis et al. distribution for each eclipse depth. 2021; Thorngren et al. 2021). Hierarchical models have not yet been In a hierarchical model, we instead make 휇 and 휎 parameters and applied to atmospheric characterization of . fit them simultaneously with the ten eclipse depths. We represent our There is one major difference between a typical Bayesian model beliefs about hyperparameters 휇 and 휎 with hyperpriors. This allows and a hierarchical one. In a hierarchical model one assumes that each eclipse observation to inform the others, by pulling the eclipse some of the model parameters of interest are drawn from a higher depths closer to the mode of the distribution of 휇. This is known as level, unobserved distribution. The higher level parameters describ- Bayesian shrinkage. After fitting, we get a posterior distribution for ing this distribution, otherwise known as hyperparameters, are fitted each eclipse depth, as well as for 휇 and 휎. simultaneously with the parameters of interest. As we explain below, In the limit that 휎 goes to infinity, the hierarchical model is equiv- this naturally represents how our intuition pools information across alent to the model with completely separate eclipse depths. Likewise observations. It also helps to tame models by compromising between when 휎 goes to zero, it is equivalent to the single eclipse depth model. overfitting and underfitting. A hierarchical model empirically fits for the amount of pooling based Hierarchical models can and should be used whenever a higher on what is most consistent with the observations. level structure can be assumed in a data set, such as when one quantity is measured multiple times. A natural example is the archetypical suite of of ten XO-3b Spitzer IRAC Channel 2 (4.5 휇m) eclipses 2.1 Priors (Wong et al. 2014). Below we explain the model and present results. Priors are necessary in a model to encode prior knowledge, as well Afterwards, we show how we extend the model to fit multiple eclipses as to properly sample a model. In all cases, we use weakly infor- from different planets simultaneously and present results from fitting mative priors rather than flat, “uninformative” priors. Half-normal the eclipse data from Garhart et al.(2020) with a hierarchical model. priors or wide normal priors are unlikely to introduce much bias into the parameter estimates and can make sampling more efficient. Flat priors are discouraged in practice because we usually have at least 2 HIERARCHICAL MODEL OF XO-3B ECLIPSES some vague knowledge of the range of values a parameter can take (Gelman et al. 2017). In the Spitzer data challenge, several groups analyzed ten secondary For instance, consider a flat prior on the eclipse depth. This is eclipses of XO-3b in order to test repeatability and accuracy of equivalent to saying that all values of eclipse depth are equally likely, various decorrelation techniques (Ingalls et al. 2016). The reduced even extremely large, unphysical values. Instead, we chose to place archival data from the data challenge are publicly available, so we a normal prior with a large standard deviation so that we kept the downloaded them rather than reducing them ourselves. predicted values within the right order of magnitude. For XO-3b and other planets with repeated secondary eclipse ob- servations, the eclipses have usually been fitted separately from one another, with a separate eclipse depth parameter for each observation 2.2 Astrophysical Model (Ingalls et al. 2016; Kilpatrick et al. 2020). In other cases, a single eclipse depth parameter has been used to simultaneously fit multiple The astrophysical model for each observation was a secondary secondary eclipse measurements (Wong et al. 2014). This is also eclipse. We used STARRY (Luger et al. 2019) to compute the shape what is typically done for phase curves that are bracketed by two of each eclipse, with the depth and time of eclipse left as free pa- eclipses (Cowan et al. 2012; Bell et al. 2020). rameters. We fixed the radius of the planet and host star, the orbital However, neither approach quite matches what our intuition tells period, ratio of semi-major axis to stellar radius, , us. Because we are measuring the same thing each time, fitting the longitude of periastron and eccentricity to the literature values. eclipse observations separately amounts to overfitting the individ- To get a rough upper limit on the eclipse depth, we used the ual observations, and fitting a single eclipse parameter amounts to parameterization of Cowan & Agol(2011) to calculate the maximum

MNRAS 000,1–10 (2021) Hierarchical modelling of Spitzer data 3

dayside temperature, in the limit of a Bond albedo of zero and no where 푥 and 푦 represent the centroid locations on the IRAC detec- heat recirculation: tor, in pixel coordinates. The terms 푙푥 and 푙푦 are the covariance √︂  1/4 lengthscales, and 훼 is the Gaussian process amplitude. The squared 푅★ 2 푇d,max = 푇eff . (2) exponential kernel has the intuitive property that locations on the 푎 3 detector pixel that are close together should have similar sensitivity. If the length scales are fixed by the user rather than fitted for, this Here 푇eff is the stellar effective temperature, and 푎/푅★ is the ratio of semimajor axis to stellar radius. We note that this equation assumes boils down to the Gaussian kernel regression of Knutson et al.(2012) a circular , while XO-3b is on an eccentric orbit (푒 = 0.28; and Lewis et al.(2013) Bonomo et al.(2017)). Nonetheless, it allows us to get an order of With no loss of generality, we can let the mean function be zero, magnitude estimate of the eclipse depth. and instead fit the residuals between the astrophysical model and The above temperature can be converted to an eclipse depth using observations which gives 푝(residuals|훾) ∼ N (0, Σ). (6)  2 퐵(휆, 푇d) 푅p 퐸퐷 = (3) We placed weakly informative inverse gamma priors on the length- 퐵(휆, 푇★,4.5휇푚) 푅★ scales. We chose the parameters such that 99% of the prior probabil- where 퐵 is the Planck function, and 푇★,4.5휇푚 is the brightness tem- ity was between lengthscales 0 and 1, measured in pixels. This gives perature of the star at 4.5 휇m, which we calculated by integrating 푝(푙푥), 푝(푙푦) ∼ InverseGamma(훼=11, 훽=5). PHOENIX models (Allard et al. 2011) over the Spitzer bandpass For the amplitude 훼, we used a weakly informative half normal (Baxter et al. 2020). prior: 훼 ∼ Half-N(0, Δ퐹/3), where Δ퐹/3) is the range of observed For the non-hierarchical model, we placed a wide prior flux values. By using a half normal prior, we weakly constrain the on the eclipse depth to prevent biasing the value: 퐸퐷 ∼ scale of the Gaussian process amplitude without introducing bias. N(퐸퐷푚푎푥/2, 퐸퐷푚푎푥/2). For the time of eclipse, we let 휏 ∼ We placed the same prior on the white noise uncertainty, 휎phot. N(Δ푡/2, Δ푡/2) where Δ푡 is the duration of the observation, and time is measured from the start of the observation. We experimented with various priors and found that our resulting fits were consistent and not strongly dependent on the choice of priors. 2.4 The posterior For the hierarchical model, we used a wide Normal prior for For each eclipse, we fit the eclipse depth ED, time of eclipse 휏, the hierarchical mean: 휇 ∼ N (퐸퐷푚푎푥/2, 퐸퐷푚푎푥/2). For the hi- photometric uncertainty 휎phot, Gaussian process lengthscales 푙푥 and erarchical standard deviation we used a weakly informative Half- 푙푦, and Gaussian process amplitude 훼. The planet-to-star flux ratio as Normal prior: 휎 ∼ Half-N(300ppm). We then let the individual a function of time 푡 is given by 퐹. The 푥 and 푦 centroid locations are eclipse depths be drawn from the following higher level distribution: included as covariates. We also fit for the hierarchical eclipse depth 퐸퐷푖 ∼ N (휇, 휎) mean 휇, and standard deviation 휎. The likelihood function for one eclipse is given by

2.3 Detector Systematics: Gaussian Processes 푝(퐹|푡, 푥, 푦, 휎phot,퐸퐷,휇,휎,휏,푙푥, 푙푦) (7) The IRAC detector sensitivity in Channels 1 and 2 depends on the tar- and the prior is get centroid position on the detector. To parameterize this behaviour, we used a Gaussian process. The advantage of using a Gaussian 푝(푡, 푥, 푦, 휎phot,퐸퐷,휇,휎,휏,푙푥, 푙푦). (8) process is that it doesn’t require calculating the detector sensitivity explicitly, in contrast with polynomial models (Cowan et al. 2012) or We form the posterior function by multiplying the likelihood and BLISS (Stevenson et al. 2012). prior together: When using Gaussian processes, we make the usual assumption that the data are normally distributed, but allow for covariance be- ( | ) ∝ ( | ) ( ) tween data points. The likelihood function can be written 푝 퐸퐷, 휇, 휎, 휃 퐹 푝 퐹 퐸퐷, 휇, 휎, 휃 푝 퐸퐷, 휇, 휎, 휃 (9) where we have replaced the variables and parameters other than 푝(data|훾) ∼ N (휇GP, Σ), (4) eclipse depth by 휃 for shorthand. where 훾 represents the model parameters and independent variables, First, note that the likelihood function does not depend on the and 휇GP is the mean function around which the data are distributed. hierarchical parameters directly, so we can remove 휇 and 휎 from The covariance function, Σ, is an 푛 × 푛 matrix where 푛 is the number the brackets of the likelihood function. Second, the eclipse depth of data. The entries along the diagonal of Σ are the measurement depends on the hierarchical parameters in this way: uncertainties on each datum, and the off-diagonal entries are the 푝(퐸퐷, 휇, 휎) covariance between data. When the off-diagonal elements are equal 푝(퐸퐷|휇, 휎) = , (10) to zero, the likelihood function reduces to the usual assumption of 푝(휇, 휎) independent Gaussian uncertainties. where we have made use of Bayes’ theorem. Although it is computationally intractable to fit each off-diagonal This means we can rewrite the likelihood and prior to obtain: entry of the covariance matrix, they can be parameterized using a kernel function with a handful of parameters. We used the squared 푝(퐸퐷, 휇, 휎, 휃|퐹) ∝ 푝(퐹|퐸퐷, 휃)푝(퐸퐷|휇, 휎)푝(휇, 휎, 휃). (11) exponential kernel employed by Evans et al.(2015) It is this refactoring that makes a hierarchical model different from a "  2  2# 푥푖 − 푥 푗 푦푖 − 푦 푗 non-hierarchical one. Σ푖 푗 = 훼 exp − − , (5) 푙푥 푙푦 To form the posterior of the full hierarchical model, we multiply

MNRAS 000,1–10 (2021) 4 D. Keating and N. Cowan the individual eclipse posteriors together: how close each model is to the true model. In practice (at least in 푛 exoplanet science), they typically yield similar conclusions. Ö A shortcoming shared by BIC and AIC is that they are accurate 푝(퐸퐷푖, 휇, 휎, 휃푖 |퐹푖) ∝ 푖=1 only when using flat priors, which are not recommended for most 푛 (12) models (Gelman et al. 2017). The AIC also assumes that the posterior Ö 푝(휇, 휎) 푝(퐹푖 |퐸퐷푖, 휃푖)푝(퐸퐷푖 |휇, 휎)푝(휃푖) distributions are multivariate Gaussians. The priors are never flat for 푖=1 hierarchical models, which means we cannot use the AIC or BIC for model comparison. where 푛 is the number of eclipse observations— 10 in the case of the A more general model comparison tool is the Widely Applica- XO-3b dataset. ble Information Criterion (WAIC; Watanabe(2010)). The WAIC is Bayesian, uses the full fit posterior, and critically, makes no assump- tions about the shape of the posterior or priors. The WAIC is easily 2.5 Hamiltonian Monte Carlo computed from the full fit posterior (McElreath, R. 2020): ! We used the probabilistic programming package PyMC3 to build ∑︁ and sample from the model. PyMC3 uses Hamiltonian Monte Carlo WAIC(data, Θ) = −2 lppd − var휃 (log p (Fi | EDi, 휃i)) (13) (HMC), the state-of-the-art Markov Chain Monte Carlo (MCMC) i algorithm, to perform the sampling. HMC is more efficient than 푡ℎ where Θ is the posterior, 퐹푖 stands for the i eclipse observation, other MCMC algorithms, meaning it can effectively describe the data refers to the entire suite of observations, and lppd stands for the posterior using fewer samples than other MCMC algorithms. For high log-pointwise predictive-density, dimensional models, and especially for high dimensional hierarchical 푆 ! models, HMC is all but necessary (Betancourt & Girolami 2013). ∑︁ 1 ∑︁ lppd = log 푝 퐹 | 퐸퐷 , 휃  , (14) Hamiltonian Monte Carlo works by exploiting the fact that for ev- 푆 푖 푖 푖,푠 ery probabilistic system we care about, there is an equivalent physical 푖 푠=0 system that is easier to reason about (Betancourt 2017). The chains th where 푆 is the number of samples and 휃푖,푠 is the 푆 set of parameters in an MCMC sampler move through parameter space to estimate for the 푖th observation. The log predictive pointwise density is an the shape of the posterior; an equivalent physical system is the mo- estimate of how well the model would fit new, unseen data. The tion of a satellite orbiting a giant planet where the planet represents second term in the WAIC expression is a penalty term that penalizes the mode of the probability distribution. The crucial step in HMC overly complex models. is to transform these trajectories from parameter space to momen- Once MCMC sampling has finished, the WAIC can be computed tum space using the Hamiltonian of the system, and sample from in a few lines of code using the MCMC chains. It is also possible that instead. The HMC chains traverse through the typical set of the to compute the standard error of the WAIC, something that is not distribution, roughly defined as where most random draws from the possible with the BIC or AIC. If the difference in WAIC between two model of interest would lie. models is significantly larger than the standard error of the difference, To use Hamiltonian Monte Carlo, the gradient of the likelihood then the model with the smaller WAIC is favoured over the other. If function, with respect to the parameters, is needed in order to con- the difference in WAIC is smaller than the standard error of the struct the Hamiltonian of the system. PyMC3 does this using Theano, difference, then the models make equally good predictions and there which is a deep learning library that allows for efficient manipulation is no evidence to favour one over the other. of matrices (The Theano Development Team et al. 2016). In our case Since all modern secondary eclipse, transit, and phase curve anal- the eclipse expressions reduce to the analytic expressions of Mandel yses use Markov Chain Monte Carlo to sample and store the posterior & Agol(2002), but STARRY allows for analytic expressions and draws, the WAIC is a better choice than BIC or AIC for model com- gradients in the general case of eclipse and phase mapping. parison. The biggest advantage is that HMC can diagnose problematic pos- teriors or models. Posteriors with pathological regions, such as high curvature, are hard for typical MCMC samplers to explore efficiently; 2.7 Pooling the GP parameters hierarchical models exhibit such pathologies. This can lead to biases We hypothesized that fitting a common set of Gaussian process am- in the final results, which is hard to diagnose because typical samplers plitude and length scales across the suite of eclipses would yield will quietly sample away unbeknownst to the user. When an HMC more precise eclipse depths by sharing information about the detec- chain gets stuck in a region of high curvature or otherwise behaves tor sensitivity across the observations. Because we used Hamiltonian badly, it will diverge to infinity and the sampler keeps track of where Monte Carlo, it was feasible to fully marginalize over the Gaussian this occurred. Divergences can often be eliminated by changing the process hyperparameters. However, we found that the fitted eclipse HMC step size or by reparameterizing the model. depths had nearly identical means and standard deviations between the shared and non-shared detector models of XO-3b. In practice we adopted the shared GP model for XO-3b because it has fewer 2.6 Model comparison: Information Criteria parameters and is therefore easier to sample. It is common among exoplanet scientists to use the Bayesian Infor- mation Criterion (BIC) or Aikake Information Criterion (AIC) to 2.8 Results perform model comparison and selection (Schwarz 1978; Akaike 1974). Both criteria use the maximum likelihood and a complex- We fit the ten 4.5 휇m eclipses of XO-3b with three models: a model ity term to penalize overly complex models. These two information where each eclipse observation had its own, separate eclipse depth criteria describe slightly different things— the AIC measures the parameter, one where we used a single, pooled eclipse depth for all relative predictive loss of a set of models, and the BIC measures the observations, and a hierarchical model. To fit each model, we

MNRAS 000,1–10 (2021) Hierarchical modelling of Spitzer data 5

Table 1. Best-fit eclipse depths from each model of the ten Spitzer IRAC 0.0020 Hierarchical channel 2 eclipses of XO-3b. The parameters 휇 and 휎 are the hierarchical Pooled mean and standard deviation for the suite of eclipse depths. With the pooled F / model, we are implicitly assuming 휎 = 0, while for the separate model we p Separate F 0.0018

are implicitly assuming 휎 = ∞. The hierarchical model fits for the amount ,

h of pooling, and is preferred over both the completely pooled and separate t

p models.

e 0.0016 D

Eclipse Number Hierarchical Pooled Separate e s

p 0.0014 1 1602±122 1507±45 1655±144 i l

c 2 1723±148 1507±45 1892±163

E 3 1744±136 1507±45 1894±142 0.0012 4 1548±111 1507±45 1561±140 5 1366±119 1507±45 1269±140 1 2 3 4 5 6 7 8 9 10 6 1538±120 1507±45 1553±153 Eclipse Number 7 1455±118 1507±45 1412±146 8 1378±123 1507±45 1280±150 9 1468±105 1507±45 1438±126 Figure 1. Eclipse depths of XO-3b for the three different models. The pooled 10 1383±108 1507±45 1316±118 model has the lowest eclipse depth uncertainty, but is not the best model 휇 1520±81 1507±45 1527±219 according to the WAIC. The hierarchical model is the best model, and yields 휎 193±80 0 ∞ eclipses that are closer together, with lower uncertainties on the individual eclipse depths than the separate model (15% smaller on average). The hier- archical model represents a compromise between the overfit separate model and the underfit pooled model. Table 2. Comparison of different information criteria for the three XO-3b eclipse depth models— lower values are better. For the WAICwe also tabulate the standard error in the difference between each model. Using the BIC and used 2000 tuning steps to initialize four HMC chains, and obtained AIC the pooled model is preferred. However, our model does not satisfy the 1000 samples for each chain. After sampling, we confirmed that the assumptions necessitated by the BIC and AIC. The WAIC is more robust. Gelman-Rubin statistic was close to 1 for all parameter values, and According to the WAIC, the hierarchical model is the preferred model. that there were no divergences. The eclipse depths from each model Model Δ WAIC 휎ΔWAIC ΔAIC ΔBIC are shown graphically in Figure1 and tabulated in Table1. We also computed the WAIC values for each model, shown in Table2. Hierarchical 0.0 0.0 7.46 10.80 Pooled 8.54 5.04 0 0 The hierarchical model had a significantly lower WAIC than the Separate 65.66 25.60 20.31 23.03 separate model, and a marginally lower WAIC than the single eclipse depth (pooled) model. This suggests that the eclipse depths are indeed 3 HIERARCHICAL MODEL FOR MULTIPLE PLANETS different between observations, but are more similar than if we had used a separate eclipse parameter to describe each. This also hints While hierarchical modelling is most obviously applicable for re- at some variability from observation to observation. Although the peated measurements of the same planet, we can also extend it to difference in WAIC between the hierarchical and fully pooled model secondary eclipse observations of multiple planets. We considered is only 1.69 휎ΔWAIC, typical hot Jupiters are predicted to show some the published dayside brightness temperatures from Garhart et al. -to-epoch variability (Komacek & Showman 2020), which is (2020), the largest dataset of uniformly analyzed hot Jupiter sec- another reason to adopt the hierarchical model over the pooled one. ondary eclipses. The mean eclipse depth for the three models are consistent with We expect that hot Jupiter dayside temperatures, 푇d, are approxi- one another within the uncertainties. In Figure1, we see the effects of √︃ mately proportional to their irradiation temperatures, 푇 = 푇 푅★ . shrinkage on the eclipse depths. Compared to the separate model, the 0 eff 푎 We built a hierarchical model by including this intuition in our hy- hierarchical model yields smaller scatter across the suite of eclipse perprior, and making the hierarchical mean a function of irradiation depths and higher precision on the individual eclipse depths. Addi- temperature: tionally, the individual uncertainties on the fitted eclipse depths are smaller by 15% on average in the hierarchical fits compared to the 휇d = 푚 (푇0− < 푇0 >) + 푏, separate fits. The individual eclipse depth observations help constrain where < 푇0 > is the average irradiation temperature. In other words, the others by shrinking the whole suite of eclipse depths towards the our hierarchical mean is now a line described by a slope and standard grand mean, but not as much as in the completely pooled model. deviation. We represent the scatter about this line using the hyper- parameter 휎푑. The prior on each dayside brightness temperature is then 푇d,p ∼ N (휇d, 휎푑). For this hierarchical model, the hierarchical mean itself depends on two hyperparameters, the slope and intercept of the line, which we fit for simultaneously with the suite of dayside brightness tem- peratures. We used the following weakly informative priors for the hyperparameters: 푚 ∼ N (1, 0.5), 푏 ∼ N (2200퐾, 500퐾), 휎푑 ∼ Half-N(500퐾). We included the planets that had measurements at both 4.5 휇m and 3.6 휇m, which gave a total of 34 planets. The planets WASP-14b, WASP-19b, and WASP-103b had two sets of eclipses in both chan-

MNRAS 000,1–10 (2021) 6 D. Keating and N. Cowan

1.005 AOR: 46471424 AOR: 46471168 F / p

F 1.000

0.995

0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 Time from start of observation (Days) Time from start of observation (Days) 1.005 AOR: 46470912 AOR: 46470656 F / p

F 1.000

0.995

0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 Time from start of observation (Days) Time from start of observation (Days) 1.005 AOR: 46470400 AOR: 46469632 F / p

F 1.000

0.995

0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 Time from start of observation (Days) Time from start of observation (Days) 1.005 AOR: 46469120 AOR: 46468608 F / p

F 1.000

0.995

0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 Time from start of observation (Days) Time from start of observation (Days) 1.005 AOR: 46468352 AOR: 46468096 F / p

F 1.000

0.995

0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 Time from start of observation (Days) Time from start of observation (Days)

Figure 2. The full fit to each of the ten XO-3b eclipses, using the hierarchical model for the eclipse depths and a pooled Gaussian process for the detector systematics. nels, so in total there were 74 measurements. Fitting 74 eclipse obser- hierarchical model, but were not distinguishable from each other. The vations simultaneously with a two-dimensional Gaussian process is fitted values from the wavelength independent hierarchical model can computationally intractable, so we took the published measurements be seen in Figure4. The effects of Bayesian shrinkage are evident; at face value rather than refit them. the refit measurements are closer together, and more precise. We also We again used PyMC3, and first fit a non-hierarchical model as plot the scaled dayside brightness temperatures in Fig5. our baseline, using separate 푇푑, 푝 parameters for each measurement. As expected, this model just reproduces the published dayside tem- peratures and uncertainties. This also acts as a confidence check that The wavelength dependent model is indistinguishable from the our priors are not biasing the fitted parameters. wavelength independent model, according to the WAIC, meaning Since there are measurements at two different wavelengths, we fit they make equally good predictions. In the wavelength independent two different versions of the hierarchical model. In the wavelength model, the dayside brightness temperatures for both channels fol- dependent model, we allowed the dayside brightness temperature low the same slope, or equivalently, the ratio of the slopes for each distributions to be different between the two wavelengths, fitting one channel is equal to one. Our interpretation of this is that we are not set of hierarchical parameters for the 4.5 휇m measurements, and an- detecting the trend of increasing brightness temperature ratio versus other set for the 3.6 휇m measurements. In the wavelength independent stellar irradiation reported by Garhart et al.(2020) and Baxter et al. model, we used a common distribution for all the measurements, and (2020). The best-fit hierarchical parameters from our wavelength thus one set of hierarchical parameters. We tabulate the refit bright- independent model are 휇푑 = 1.43 ± 0.07, 푏 = 2123 ± 25퐾, and ness temperatures in Table3. The WAIC values for the three models 휎푚 = 150 ± 22퐾. These constraints make it possible to empirically for the Garhart et al.(2020) dataset are tabulated in Table4. The predict a planet’s brightness temperature in the Spitzer channels 1 or two hierarchical models are both significantly favoured over the non- 2 bandpass given its irradiation temperature.

MNRAS 000,1–10 (2021) Hierarchical modelling of Spitzer data 7

1.0050 AOR: 46471424 AOR: 46471168

1.0025 F / p F 1.0000

0.9975 1.0050 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 AOR: 46470912 AOR: 46470656 Time from start of observation (Days) Time from start of observation (Days) 1.0025 F / p F 1.0000

0.9975 1.0050 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 AOR: 46470400 AOR: 46469632 Time from start of observation (Days) Time from start of observation (Days) 1.0025 F / p F 1.0000

0.9975 1.0050 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 AOR: 46469120 AOR: 46468608 Time from start of observation (Days) Time from start of observation (Days) 1.0025 F / p F 1.0000

0.9975 1.0050 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 AOR: 46468352 AOR: 46468096 Time from start of observation (Days) Time from start of observation (Days) 1.0025 F / p F 1.0000

0.9975 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 Time from start of observation (Days) Time from start of observation (Days)

Figure 3. The ten XO-3b eclipses with the detector systematics removed using a Gaussian process as a function of stellar centroid location. The eclipse signal is clearly visible in the corrected raw data (represented by the black dots), and the best fit eclipse signal for each is shown as the red line.

4 DISCUSSION AND CONCLUSIONS way in order to search for variability. The hot Jupiter WASP-43b has one published (Stevenson et al. 2017; Mendonça et al. 2018; Morello In our reanalysis of the ten secondary eclipses of XO-3b, we found et al. 2019; May & Stevenson 2020; Bell et al. 2020), and two that the hierarchical model was favoured. This means that the mea- unpublished, Spitzer phase curves at 4.5 휇m. A hierarchical model sured eclipse depths are indeed different from epoch to epoch, yet could be used to quantify the variability in the phase amplitudes and clustered. The biggest difference compared to previous analyses is offsets of the three 4.5 휇m phase curves. that we were able to empirically fit for the amount of variability favoured by the data, and doing this improves the precision on our We also showed that a hierarchical model is the best choice when measurements by 15% on average, because of Bayesian shrinkage. fitting measurements from multiple planets in order to look for trends. Hierarchical models could improve measurements of other hot For the Garhart et al.(2020) dataset, our two hierarchical models were Jupiters and other types of planets. They could be used to robustly test significantly favoured over the non-hierarchical model. We did not the reported variability in the secondary eclipses of the super earth 55 detect the trend of increasing brightness temperature ratio with in- Cancri e (Demory et al. 2016; Tamburo et al. 2018), or to fit the twelve creasing stellar irradiation reported by Garhart et al.(2020); Baxter Spitzer eclipses of the recently discovered hot Saturn LTT 9779b et al.(2020), but we did not rule it out. Additionally, we experimented (Dragomir et al. 2020). The published variability constraints for HD with fitting the larger eclipse dataset from Baxter et al.(2020) with 189733b (Agol et al. 2010) and HD 209458b (Kilpatrick et al. 2020) a hierarchical model, but the results were inconclusive: the hierar- could be revisited with hierarchical models, and the unpublished set chical model did slightly better, but was indistinguishable from the of eleven Spitzer secondary eclipses of WASP-18b, which straddles non-hierarchical model according to the WAIC. Because the Baxter the boundary between hot Jupiter and ultra hot Jupiter, could also et al.(2020) sample is comprised of eclipses analyzed with disparate benefit from a hierarchical treatment. pipelines, we recommend that a wholesale reanalysis be performed Repeated phase curve observations could be analyzed in a similar to enable a more robust comparison.

MNRAS 000,1–10 (2021) 8 D. Keating and N. Cowan

Table 3. Results from fitting a wavelength independent hierarchical model to the Garhart et al.(2020) eclipse dataset.

Planet 푇0 (퐾 ) 푇d,ch2 (K) 푇d,ch1 (K) HAT-13 b 2338±71 1802±119 1821±129 HAT-30 b 2430±48 1928±107 2008±99 HAT-33 b 2623±209 2073±119 2111±114 HAT-40 b 2505±54 1966±133 2007±140 HAT-41 b 2383±82 1725±97 1836±136 KELT-2 b 2434±51 1835±92 1971±88 KELT-3 b 2587±59 2108±99 2284±101 KELT-7 b 2908±44 2412±67 2492±64 Qatar-1 b 2011±51 1508±128 1488±143 WASP-12 b 3601±116 2995±92 3198±115 WASP-14 b 2677±85 2232±78 2267±74 WASP-14 b 2677±85 2276±78 2267±69 WASP-18 b 3417±82 3229±72 3034±58 WASP-19 b 2968±55 2342±115 2455±96 WASP-19 b 2968±55 2427±127 2464±90 WASP-36 b 2411±62 1854±145 1884±153 WASP-43 b 2042±57 1535±68 1740±62 WASP-46 b 2352±76 1883±134 1822±154 WASP-62 b 2025±47 1549±108 1694±122 WASP-63 b 2172±52 1613±137 1637±139 WASP-64 b 2367±239 1820±137 1958±126 WASP-65 b 2107±64 1561±150 1646±137 WASP-74 b 2718±65 2176±86 2097±79 WASP-76 b 3097±61 2725±56 2659±54 WASP-77 b 2372±40 1740±77 1806±73 WASP-78 b 3111±58 2621±143 2681±136 WASP-79 b 2489±72 1959±92 1968±98 WASP-87 b 3281±88 2878±108 2787±115 WASP-94A b 2133±106 1395±99 1454±80 WASP-97 b 2185±57 1635±86 1735±89 WASP-100 b 3121±240 2535±113 2490±119 WASP-101 b 2205±54 1606±104 1707±111 WASP-103 b 3554±69 3118±133 3022±111 WASP-103 b 3554±69 3057±127 2942±119 WASP-104 b 2123±267 1670±124 1650±122 WASP-121 b 3346±81 2611±62 2567±69 WASP-131 b 2069±45 1470±139 1530±146

Table 4. WAIC scores for the three models used to fit the suite of published measurements of certain targets, and will both carry out photomet- eclipse depths from Garhart et al.(2020)— lower values are better. We also ric and spectroscopic transit, eclipse, and phase curve surveys for a tabulate the standard error of the difference in WAICbetween each model. The variety of targets (Bean et al. 2018; Tinetti et al. 2018; Charnay et al. two hierarchical models both make equally good predictions, and significantly 2021). This will allow for atmospheric characterization of potentially outperform the non-hierarchical model. thousands of more exoplanets, from Earth-like planets to ultra-hot Jupiters, and we recommend that these comparative surveys incorpo- Model Δ WAIC 휎ΔWAIC rate hierarchical modelling to make measurements and predictions Wavelength Dependent 0.0 0.0 that are as robust as possible. Wavelength Independent 0.25 1.8 Separate 11.09 3.61

ACKNOWLEDGEMENTS One obvious extension of our work, then, is to simultaneously refit the detector systematics and astrophysical signals for the entire suite D. K. and N. B. C. acknowledge support from McGill Space Institute of Spitzer secondary eclipses using an open source pipeline such and the Institut de recherche sur les exoplanètes. We have made use as SPCA (Bell et al. 2020), and using a hierarchical model for the of open-source software provided by the Python, Astropy, SciPy, dayside brightness temperatures. This is computationally intractable Matplotlib, and PyMC3 communities. using a two dimensional Gaussian process like we did for XO-3b, but it could potentially be done with an easier-to-compute detector model like Pixel Level Decorrelation (Deming et al. 2015; Garhart DATA AVAILABILITY et al. 2020), especially if Hamiltonian Monte Carlo is used. In this work we have shown that hierarchical models are useful The reduced photometry for the ten archival secondary eclipse when analyzing repeated measurements from a single target, or when observations of XO-3b are freely available at https://irachpp. doing comparative exoplanetology studies of many targets. Next gen- spitzer.caltech.edu/page/data-challenge-2015. The eration telescopes like James Webb and Ariel will make repeated code used in the XO-3b reanalysis, and a Jupyter notebook showing

MNRAS 000,1–10 (2021) Hierarchical modelling of Spitzer data 9

the reanalysis of the Garhart et al.(2020) eclipses can both be found 3500 at https://github.com/dylanskeating/HARMONiE.

3000 REFERENCES

2500 Agol E., Cowan N. B., Knutson H. A., Deming D., Steffen J. H., Henry G. W., Charbonneau D., 2010, ApJ, 721, 1861 ) Akaike H., 1974, IEEE Transactions on Automatic Control, 19, 716 K (

Allard F., Homeier D., Freytag B., 2011, in Johns-Krull C., Browning M. K., y 2000 a

d West A. A., eds, Astronomical Society of the Pacific Conference Series T Vol. 448, 16th Cambridge Workshop on Cool Stars, Stellar Systems, and 1500 the Sun. p. 91 (arXiv:1011.5405) Baxter C., et al., 2020, A&A, 639, A36 Bean J. L., et al., 2018, PASP, 130, 114402 1000 Bell T. J., et al., 2020, arXiv e-prints, p. arXiv:2010.00687 Betancourt M., 2017, arXiv e-prints, p. arXiv:1706.01520 Hierarchical Betancourt M. J., Girolami M., 2013, arXiv e-prints, p. arXiv:1312.0906 500 Garhart (2020) Bonomo A. S., et al., 2017, A&A, 602, A107 Charnay B., et al., 2021, arXiv e-prints, p. arXiv:2102.06523 2000 2250 2500 2750 3000 3250 3500 Cowan N. B., Agol E., 2011, ApJ, 729, 54 T0 (K) Cowan N. B., Machalek P., Croll B., Shekhtman L. M., Burrows A., Deming D., Greene T., Hora J. L., 2012, ApJ, 747, 82 Deming D., et al., 2015, ApJ, 805, 132 Figure 4. Dayside brightness temperatures for the planets analyzed by Garhart Demory B.-O., Gillon M., Madhusudhan N., Queloz D., 2016, MNRAS, 455, et al.(2020), refit with a wavelength independent hierarchical model. The red 2018 dots are the published values and the black line is the best fit trend line. The Dragomir D., et al., 2020, ApJ, 903, L6 effects of Bayesian shrinkage are evident: the measurements are clustered Evans T. M., Aigrain S., Gibson N., Barstow J. K., Amundsen D. S., Tremblin closer to the line, and the uncertainties on the measurements are reduced. P., Mourier P., 2015, MNRAS, 451, 680 Fazio G. G., et al., 2004, ApJS, 154, 10 Garhart E., et al., 2020,AJ, 159, 137 Gelman A., Carlin J. B., Stern H. S., B. D. D., Vehtari A., Rubin. D. B., 2014, Bayesian Data Analysis, (3rd ed.). Boca Raton : CRC Press 1.0 Gelman A., Simpson D., Betancourt M., 2017, Entropy, 19, 555 Ingalls J. G., Krick J. E., Carey S. J., Laine S., Surace J. A., Glaccum W. J., Grillmair C. C., Lowrance P. J., 2012, in Clampin M. C., Fazio G. G., MacEwen H. A., Oschmann Jacobus M. J., eds, Society of Photo-Optical 0.8 Instrumentation Engineers (SPIE) Conference Series Vol. 8442, Space Telescopes and Instrumentation 2012: Optical, Infrared, and Millimeter

0 Wave. p. 84421Y, doi:10.1117/12.926947 T

/ Ingalls J. G., et al., 2016,AJ, 152, 44 y

a Keating D., Cowan N. B., Dang L., 2019, Nature Astronomy, 3, 1092 d 0.6 T Keating D., et al., 2020,AJ, 159, 225 Kilpatrick B. M., et al., 2020,AJ, 159, 51 Knutson H. A., et al., 2012, ApJ, 754, 22 Komacek T. D., Showman A. P., 2020, ApJ, 888, 2 0.4 Krick J. E., Fraine J., Ingalls J., Deger S., 2020,AJ, 160, 99 Lewis N. K., et al., 2013, ApJ, 766, 95 Hierarchical Luger R., Agol E., Foreman-Mackey D., Fleming D. P., Lustig-Yaeger J., Deitrick R., 2019,AJ, 157, 64 Garhart (2020) 0.2 Mandel K., Agol E., 2002, ApJ, 580, L171 2000 2250 2500 2750 3000 3250 3500 May E. M., Stevenson K. B., 2020,AJ, 160, 140 T McElreath, R. 2020, Statistical Rethinking, (2nd ed.). Boca Raton : CRC 0 Press Mendonça J. M., Malik M., Demory B.-O., Heng K., 2018,AJ, 155, 150 Morello G., Waldmann I. P., Tinetti G., Peres G., Micela G., Howarth I. D., Figure 5. Dayside brightness temperatures from Fig4, scaled by irradiation 2014, ApJ, 786, 22 temperature. The horizontal lines represent some theoretical limits on dayside Morello G., Waldmann I. P., Tinetti G., 2016, ApJ, 820, 86 temperature assuming a zero Bond albedo: the solid line assumes zero heat Morello G., Danielski C., Dickens D., Tremblin P., Lagage P. O., 2019,AJ, recirculation, the dashed line assumes a uniform dayside hemisphere but a 157, 205 temperature of zero on the nightside, and the dotted line assumes a uniform Morvan M., Nikolaou N., Tsiaras A., Waldmann I. P., 2020,AJ, 159, 109 temperature at every location on the planet. Neil A. R., Rogers L. A., 2020, ApJ, 891, 12 Parmentier V., Crossfield I. J. M., 2018, Exoplanet Phase Curves: Observations and Theory. Springer International Publishing AG, doi:10.1007/978-3-319-55333-7_116 Sarkis P., Mordasini C., Henning T., Marleau G. D., Mollière P., 2021, A&A, 645, A79 Schwartz J. C., Cowan N. B., 2015, MNRAS, 449, 4192

MNRAS 000,1–10 (2021) 10 D. Keating and N. Cowan

Schwartz J. C., Kashner Z., Jovmir D., Cowan N. B., 2017, ApJ, 850, 154 Schwarz G., 1978, Annals of Statistics, 6, 461 Sing D. K., et al., 2016, Nature, 529, 59 Stevenson K. B., et al., 2012, ApJ, 754, 136 Stevenson K. B., et al., 2017,AJ, 153, 68 Tamburo P., Mandell A., Deming D., Garhart E., 2018,AJ, 155, 221 Teske J., et al., 2020, arXiv e-prints, p. arXiv:2011.11560 The Theano Development Team et al., 2016, arXiv e-prints, p. arXiv:1605.02688 Thorngren D. P., Fortney J. J., Lopez E. D., Berger T. A., Huber D., 2021, arXiv e-prints, p. arXiv:2101.05285 Tinetti G., et al., 2018, Experimental Astronomy, 46, 135 Waldmann I. P., 2012, ApJ, 747, 12 Watanabe S., 2010, arXiv e-prints, p. arXiv:1004.2316 Wong I., et al., 2014, ApJ, 794, 134 Zhang M., et al., 2018,AJ, 155, 83

This paper has been typeset from a TEX/LATEX file prepared by the author.

MNRAS 000,1–10 (2021)