MNRAS 000,1–11 (2015) Preprint 2 June 2016 Compiled using MNRAS LATEX style file v3.0
Jackknife resampling technique on mocks: an alternative method for covariance matrix estimation
S. Escoffier1?, M.-C. Cousinou1, A. Tilquin1, A. Pisani1,2,3, A. Aguichine1,4, S. de la Torre 5, A. Ealet1, W. Gillard1, E. Jullo5 1Aix Marseille Universit´e, CNRS/IN2P3, CPPM UMR 7346, 13288, Marseille, France 2Sorbonne Universit´es, UPMC (Paris 06), UMR7095, Institut d’Astrophysique de Paris, 98bis Bd. Arago, F-75014, Paris, France 3CNRS, UMR7095, Institut d’Astrophysique de Paris, 98bis Bd. Arago, F-75014, Paris, France 4Paris-Saclay Universit´e,ENS Cachan, 61 av. du Pr´esident Wilson, 94230 Cachan, France 5Aix Marseille Universit´e, CNRS, LAM (Laboratoire d’Astrophysique de Marseille) UMR 7326, 13388, Marseille, France
Accepted XXX. Received YYY; in original form ZZZ
ABSTRACT We present a fast and robust alternative method to compute covariance matrix in case of cosmology studies. Our method is based on the jackknife resampling applied on simulation mock catalogues. Using a set of 600 BOSS DR11 mock catalogues as a reference, we find that the jackknife technique gives a similar galaxy clustering covariance matrix estimate by requiring a smaller number of mocks. A comparison of convergence rates show that ∼7 times fewer simulations are needed to get a similar accuracy on variance. We expect this technique to be applied in any analysis where the number of available N-body simulations is low. Key words: cosmology: cosmological parameters – large-scale structure of Universe, methods: data analysis
1 INTRODUCTION properties of the cosmological data set (e.g. Manera et al. 2013). The data covariance matrix is computed from the During the last decade the observation of the large-scale scatter in the parameter value obtained for each statisti- structure of the universe has been intensified: the mapping of cally independent realization. Recent works (Taylor et al. the three-dimensional distribution of galaxies reaches an un- 2013; Dodelson & Schneider 2013; Taylor & Joachimi 2014; precedented level of precision. However the accuracy on the Percival et al. 2014) showed that this covariance estimate parameters measurement is required at the sub-percent level suffers from noise due to the finite number of mock realiza- to probe the nature of dark energy or modifications to grav- tions, implying an extra variance on estimated cosmological ity. Incoming and future large spectroscopic galaxy surveys parameters, of O(1/(N − N )), where N is the such as the Dark Energy Spectroscopic Instrument (Levi mocks bins bins total number of bins in the measurement. To keep this error et al. 2013), the Subaru Prime Focus Spectrograph (Ellis contribution to an acceptable level one needs to estimate et al. 2014) and the space-based Euclid mission (Laureijs the covariance matrix from a very large number of mock et al. 2011) aim to deliver high statistical accuracy. The chal- realizations, satisfying N N . For instance in the lenge for these surveys is to control systematic biases and to mocks bins arXiv:1606.00233v1 [astro-ph.CO] 1 Jun 2016 SDSS survey, 600 independent mock catalogs were generated contain all potential systematic errors within the bounds set for DR10 and DR11 BOSS measurements (Anderson et al. by statistical ones. 2014; Percival et al. 2014), and several tens of thousands are An essential step in galaxy clustering measurements is expected for the next generation of galaxy surveys. The main to get an unbiased estimate of the covariance matrix, which limitation of this approach is that the large number of mock is needed to fully control the accuracy on the measured pa- realizations requires large computational resources, and this rameters. The difficulty to predict an accurate data covari- situation will be manageable with difficulty for surveys such ance matrix relies intrinsically on the modeling of nonlinear as DESI or Euclid. gravitational evolution of matter distribution, galaxy bias and redshift-space distortions (RSD). The data covariance Numerous efforts have been made to propose viable al- matrix can be estimated in a number of ways. A standard ternatives to this issue. Approximate methods aim at reduc- approach is to generate many mock samples that match the ing computational time with fast gravity solvers, instead of producing full N-body simulations. Besides log-normal den- sity field realizations (Coles & Jones 1991), methods based ? E-mail: escoffi[email protected] on Lagrangian perturbation theory (LPT) have been pro-
c 2015 The Authors 2 S. Escoffier et al. posed, such as pthalos (Scoccimarro & Sheth 2002; Man- called precision matrix. The superscript t denotes the true era et al. 2013), pinocchio (Monaco et al. 2002, 2013) and matrix. patchy algorithms (Kitaura et al. 2015). Other recent ap- proaches include the effective Zel’dovich approximation ez- 2.1 Independent mock realizations mocks (Chuang et al. 2015), the quick particle mesh (White et al. 2014) and the hybrid fast cola method (Tassev et al. 2.1.1 Covariance matrix 2013; Koda et al. 2015). Interestingly, statistical methods can be also applied The evaluation of the likelihood function requires the knowl- t in the limit of a small number of simulations, such as the edge of the precision matrix Ψ , computed as the inverse of shrinkage estimation (Pope & Szapudi 2008), the resam- the covariance matrix. If a physical model is given for the pling in Fourier modes (Schneider et al. 2011) or the covari- covariance matrix, then the computation is analytical. Oth- ance tapering method (Paz & Sanchez 2015), or using theo- erwise, as it is the case when statistical properties of the retically few-parameter models (Pearson & Samushia 2016; data are not well known, the data covariance matrix can be O’Connell et al. 2015). estimated from an ensemble of simulations. In such a case, Lastly, the data covariance matrix can be estimated an unbiased estimate of Cij is given by: through the resampling of data itself, using jackknife (Tukey Ns 1 X k k 1958) or bootstrap (Efron 1979) methods. Such internal Cbij = (yi − yi)(yj − yj ) (2) Ns − 1 methods have been applied extensively in the past (Barrow k=1 et al. 1984; Ling et al. 1986; Hamilton 1993; Fisher et al. where N is the number of independent realizations, yk is 1994; Zehavi et al. 2002). The jackknife resampling tech- s i the value of the data parameter in bin i for the k-th mock nique has some disadvantages (Norberg et al. 2009), but realization and y is the mean value in bin i over the set of is usually discussed in the delete-one scheme based on the i mock catalogs: deleting single case from the original sample. In this paper we propose an hydrid approach that Ns 1 X k deals with the jackknife resampling of mock catalogs (smc yi = yi (3) Ns method) instead of the data itself. The sample covariance is k=1 estimated using the delete-d jackknife scheme (Shao & Wu The statistical properties of the sample covariance ma- 1989) applied on an ensemble of independent realizations. trix Cb of Eq. (2) follow a Wishart distribution (Wishart The smc method can be particularly useful in case the num- 1928), which generalizes the χ2 distribution in case of mul- ber of available N-body light cone simulations is low. tivariate Gaussian parameters. From the moments of the This paper is organized as follows. In Section2 we Wishart distribution one can deduce the variance of the el- present the formalism of covariance and precision matrix. ements of the sample covariance matrix: We briefly remind how is estimated the sample covariance 2 1 2 matrix from a set of independent simulations in Section 2.1. σ [Cbij ] = [Cij + CiiCjj ] (4) Ns − 1 We next describe the jackknife resampling technique and give an estimate of the sample covariance matrix with our smc method in Section 2.2. In Section3 we apply our method 2.1.2 Precision matrix for the correlation function covariance matrix estimate. We The precision matrix is defined as the inverse of the true describe in more detail our method of resampling mock real- covariance matrix Ψt ≡ [Ct]−1. However, the measured izations with the jackknife method and illustrate it with the [Cˆ]−1 is governed by the skewed inverse-Wishart distribu- sample mean and sample variance of the correlation function tion (Press 1982), giving a biased estimate of the true pre- in Section 3.2. We discuss about the settings of jackknife pa- t cision matrix Ψ . If the condition Ns > Nb + 2 is satisfied, rameters in the covariance matrix estimate in Section 3.3. d where Nb is the size of the data x , then an unbiased es- In Section 3.4 we present the comparison of the full covari- timate of the precision matrix is given by (Hartlap et al. ance matrix obtained by the smc estimate to the method 2007): using simply N-body simulations. We next perform compar- ison of precision matrices in Section 3.5. Finally we compare Ns − Nb − 2 −1 the rates of convergence in Section 3.6. We conclude with a Ψb = [Cˆ] (5) general discussion in Section4. Ns − 1 Taylor et al.(2013) and Dodelson & Schneider(2013) found that the bias in the mean of the precision matrix is 2 COVARIANCE AND PRECISION MATRIX not the only effect that affects the accuracy of cosmological parameter estimation and that the covariance of the sample We suppose here that the parameter estimation is derived precision matrix should also be corrected, due to the finite from a likelihood analysis. If the data distribution is a mul- number of realizations used. The unbiased variance on the tivariate Gaussian, then parameter constraints are obtained elements of the precision matrix can be expressed as (Taylor by minimizing: et al. 2013): N b 2 2 2 X d model t d model σ [Ψij ] = A [(Ns − Nb)Ψ + (Ns − Nb − 2)ΨiiΨjj ] (6) χ = −2 ln L = (xi − xi )Ψij (xj − xj ) (1) b ij i,j=1 with d model where x is the data collected in Nb bins, x is the model 1 t A = (7) prediction and Ψ is the inverse covariance matrix, the so- (Ns − Nb − 1)(Ns − Nb − 4)
MNRAS 000,1–11 (2015) Jackknife resampling on mock catalogs 3
2.2 The jackknife resampling on mock catalogs the sample covariance matrix as the average of covariance matrices over N independent mocks such as: 2.2.1 The delete-d jackknife method M NM 1 X (m) The jackknife method is a statistical technique which aims at Cij = Cbij (11) NM using observed data themselves to make an estimate of the m=1 error on the measurement (Quenouille 1949; Tukey 1958). As Cij is a sample mean value computed from uncorre- Given a sample of size N, one can omit, in turn, each of (m) lated Cbij , because independent mock catalogs are uncor- the N subsamples from the initial dataset to draw N new related by definition, we are able to compute the variance datasets, each dataset being a N − 1 sized sample (delete-1 on the sample mean as: method). The resulting mean and variance is computed over NM the entire ensemble of N datasets. 2 1 X (m) 2 σ [Cbij ] = ( Cbij − Cij ) (12) First attempts were done for 2-point clustering statistics NM − 1 m=1 from galaxy redshift surveys, where resampling technique was applied by removing each galaxy in turn from the cata- We define in this case the precision matrix as the inverse log (Barrow et al. 1984; Ling et al. 1986). However Norberg of the average covariance matrix: −1 et al.(2009) argued that, in order to avoid underestimating Ψb ij = [Cij ] (13) clustering uncertainties and to resolve computational trou- The relevance of using an average covariance matrix bles given the very high statistics in modern galaxy cata- over several mock realizations is discussed and highlighted logs, the resampling of the data should be applied on N s in Sections 3.2 and 3.4. sub-volumes into which the dataset has been split instead of on individual galaxies. The main issue is to be able to reproduce large-scale variance. An other concern is about 3 ILLUSTRATION USING CORRELATION the delete-1 jackknife variance estimator, which is known to FUNCTION be asymptotically consistent for linear estimators (e.g. sam- ple mean), but biased in non-linear cases (e.g. correlation In this section we present the new hybrid method for the function) (Miller 1974; Wu 1986). estimate of the covariance matrix applied on the two-point Hence we consider in this paper the delete-d jackknife correlation function. We perform some validation tests and method proposed by Shao & Wu(1989). The delete- d jack- we use for comparison the estimate of the covariance matrix knife method consists of leaving out Nd observations at a computed from 600 independent mock realizations (Manera time instead of leaving out only one observation, which in et al. 2013) that we consider as a reference here. our case means leaving out, in turn, Nd subsamples amongst Ns initial subsamples. The dimension of each new dataset 3.1 Method is (Ns − Nd) and the number of datasets is the combina- Ns tion Njk = . It can be shown that the delete-d jackknife Nd 3.1.1 Simulation data method√ is asymptotically unbiased if Nd satisfies both con- Ns Galaxy mock catalogs used for this work have been gener- ditions → 0 and Ns − Nd → ∞ (Wu 1986). This means Nd it is preferable to choose ated for the SDSS-III/BOSS survey (Dawson et al. 2013) √ and used for the DR11 BOSS analysis (Anderson et al. Ns < Nd < Ns (8) 2014).They were generated using a method similar to ptha- los (Scoccimarro & Sheth 2002), and a detailed descrip- In the delete-d jackknife scheme where Njk jackknife re- samplings are applied on a single realization, the covariance tion is given in Manera et al.(2013). We only consider the matrix is estimated by: CMASS galaxy sample (0.43 < z < 0.75) for this study, with 600 independent available realizations. Njk (Ns − Nd) X k k The study presented here has been restricted to the Cbij = (yi − yi)(yj − yj ) (9) CFHT Legacy Survey W1 field in a multiparameters fit. We Nd.Njk k=1 delay to a future paper the study extended to the full NGC k where yi is the value of the data parameter in the bin i and SGC BOSS footprints. for the k-th jackknife configuration and yi is the empirical In order to generate jackknife configurations, we divide average of the jackknife replicates: our sample into separate regions on the sky of approximately equal area. Unless otherwise specified, we use for the follow- Njk 1 X k ing Ns = 12 and Nd = 6, cutting in the CFHTLS W1 field. yi = yi (10) Njk k=1 3.1.2 Clustering statistics 2.2.2 Average covariance and precision matrices In the rest of this paper we consider two-point clustering The combined method we propose in this paper consists in statistics, by calculating the two dimensional correlation function ξ(s, µ) in which s is the redshift-space separation of averaging covariance matrices computed with Njk jackknife pseudo-realizations over a few sample of independent mock pairs of galaxies and randoms, and µ the cosine of the angle catalogs, and not just from a single realization. of the pair to the line-of-sight, using the Landy & Szalay (m) (1993) estimator: if Cij denotes the covariance matrix computed with Njk jackknife pseudo-realizations applied on the initial mock DD(s, µ) − 2DR(s, µ) + RR(s, µ) ξ(s, µ) = (14) m and satisfying Eq.9, then we can define an estimator of RR(s, µ)
MNRAS 000,1–11 (2015) 4 S. Escoffier et al. where DD, DR and RR represent the number of pairs of
galaxies extracted from the galaxy sample D and from the 200 2 random sample R. Individual PTHalos mock realizations 180 One galaxy sample D is generated for each of the Njk Sample mean of the 600 mock realizations 12 160 The mock realization used to perform jackknife combinations (924 combinations for the 6 con- the 924 jackknife pseudo•mocks figuration), to which is associated one random catalog R gen- 140 Sample mean of the 924 jackknife realizations (s) (Mpc/h) 0 ξ erated on the same deleted field with a number of randomly 2 120 distributed points fifteen times larger than the number of s data points. 100 We project the µ-dependence to obtain multipoles of 80 the correlation function: 60 Z 1 2l + 1 40 ξl(s) = dµξ(s, µ)Ll(µ) (15) 2 −1 20 where Ll(µ) is the Legendre polynomial of order l. Monopole 0 10 ξ0(s) and quadrupole ξ2(s) components are then projected s (Mpc/h) on 2×11 bins of equal ∆log (s) = 0.1 h−1Mpc width, varying from s = 2.8 h−1Mpc to s = 28.2 h−1Mpc.
200 2
180 Individual jackknife realizations
3.2 Sample mean and variance Sample mean 160 of the 924 jackknife realizations In order to illustrate the sample variance computed from an 140 ensemble of 600 independent realizations, we display in the 120
upper panel of Fig.1 the monopole of the 2-pt correlation (s) (Mpc/h) 0 ξ
function computed for each of the 600 pthalos mock re- 2 100 s alizations (in light grey) and the related sample mean (in 80 black). The dispersion of the distribution can be measured by the standard deviation (error bars on black points), sim- 60 ply defined as the square root of diagonal elements of the 40 covariance matrix computed from Eq. (2). 20 The lower panel of Fig.1 shows the monopole of the 2-pt 0 correlation function computed for each of the 924 jackknife 10 s (Mpc/h) pseudo-realizations (in light magenta) generated from one 12 single mock in the 6 configuration (the discussion about the choice of the jackknife parameters is done in the next Figure 1. Monopole of the correlation function, computed for section). The attractive point of the jackknife resampling each of the 600 mock realizations (light grey in the upper panel) technique is that only one pthalos mock realization is self- and for each of the 924 jackknife pseudo-mocks (light magenta in sufficient to define a sample variance. The sample standard the lower panel) generated from one given pthalos mock realiza- deviation (error bar on dark magenta squares) is defined in tion (green). The sample means (squares) and standard deviations this case as the square root of diagonal terms of the covari- (error bars) are computed from the 600 mock realizations (black) and from the 924 jackknife pseudo-mocks (dark magenta). ance matrix given by Eq. (9). The sample mean and its standard deviation on the monopole computed from the 924 jackknife pseudo- realizations is reported in the upper panel of Fig.1 to make very good, especially since only 48 mocks were used for this easier the comparison. A close attention points out a dif- exercise. The estimate of sample means agrees within 2%. ference between sample means from the 600 pthalos mock We will show in next sections that the variance is also realizations (in black) and from the 924 jackknife pseudo- in very good agreement, and converges faster with few inde- realizations (in magenta). Actually, the latter has been com- pendent mocks when using jackknife resampling method. puted using one single pthalos mock catalog (in green) and we see a very good consistency between the initial monopole function and the sample mean of the monopole function 3.3 Settings of jackknife parameters computed with the jackknife technique as we could expect. We understand intuitively that if we want to faithfully From this section we put aside the sample mean introduced reproduce the sample mean of the 600 pthalos mocks, we for illustration in Section 3.2 and we reframe the discus- must consider many independent mock realizations. Fig.2 sion on variance and covariance. In this section we will dis- shows the average mean over an arbitrary set of NM = 48 cuss about the choice of jackknife parameters in the delete-d jackknife sample means, where each jackknife sample mean method. For this purpose we only consider diagonal terms is the mean of 924 jackknife pseudo-mocks, and where the of covariance matrices of the correlation function and com- standard deviation (error bars) is the square root of diagonal parison of estimates of variances is done for one single mock elements of covariance matrix as given by Eq. (11). The realization (NM = 1). Then we will show that jackknife vari- agreement of sample means and standard deviations between ance depends on the choice of the initial mock catalog and the 600 pthalos mocks and the average over 48 mocks is that it is necessary to use several independent simulation
MNRAS 000,1–11 (2015) Jackknife resampling on mock catalogs 5
1 200 2
Sample mean of the 924 jackknife realizations (s)) 0
for one PTHalos mock realization ξ
180 ( Average of 48 jackknife sample means σ 160 Sample mean of the 600 mock realizations 140 (s) (Mpc/h)
0 −1
ξ 10
2 120 s
Error on Monopole 100 JK resampling config(12,6) on mock #1 JK resampling config(12,4) on mock #1 80 JK resampling config(12,8) on mock #1
60 10−2 40 2 10 s (Mpc/h) 20 1 0 0 −1 10 −2 s (Mpc/h) 10 s (Mpc/h)
Figure 2. Sample mean of the monopole of the correlation func- (s)) 0 tion for NM = 48 arbitrary independent mocks (in blue) for which ξ 1 (