Estimating Cosmological Parameters from the Distribution

Siamak Ravanbakhsh? [email protected] Junier Oliva? [email protected] Sebastien Fromenteau† [email protected] Layne C. Price† [email protected] Shirley Ho† [email protected] Jeff Schneider? [email protected] Barnabas´ Poczos´ ? [email protected] ? School of Computer Science, Carnegie Mellon University, 5000 Forbes Ave., Pittsburgh, PA 15213, USA McWilliams Center for , Department of Physics, Carnegie Mellon University, Carnegie 5000 Forbes Ave., Pittsburgh,† PA 15213, USA

Abstract A grand challenge of the 21st century cosmol- ogy is to accurately estimate the cosmological parameters of our Universe. A major approach in estimating the cosmological parameters is to use the large scale matter distribution of the Uni- verse. surveys provide the means to map out cosmic large-scale structure in three dimen- sions. Information about galaxy locations is typ- ically summarized in a “single” function of scale, such as the galaxy correlation function or power- spectrum. We show that it is possible to estimate these cosmological parameters directly from the distribution of matter. This paper presents the application of deep 3D convolutional networks to volumetric representation of dark-matter sim- ulations as well as the results obtained using a Figure 1. Dark matter distribution in three cubes produced using recently proposed distribution regression frame- different sets of parameters. Each cube is divided into small sub- work, showing that machine learning techniques cubes for training and prediction. Note that although cubes in are comparable to, and can sometimes outper- this figure are produced using very different cosmological param- form, maximum-likelihood point estimates using eters in our constrained sampled set, the effect is not visually dis- “cosmological models”. This opens the way to cernible. estimating the parameters of our Universe with higher accuracy. servations that allow us to make serious inroads to the un- derstanding of our own universe, including the cosmic mi- 1. Introduction crowave background (CMB) (Planck Collaboration et al., The 21st century has brought us tools and methods to ob- 2015; Hinshaw et al., 2013), supernovae (Perlmutter et al., serve and analyze the Universe in far greater detail than 1999; Riess et al., 1998) and the large scale structure of before, allowing us to deeply probe the fundamental prop- and galaxy clusters (Cole et al., 2005; Anderson erties of cosmology. We have a suite of cosmological ob- et al., 2014; Parkinson et al., 2012). In particular, large scale structure involves measuring the positions and other Proceedings of the 33 rd International Conference on Machine properties of bright sources in great volumes of the sky. Learning, New York, NY, USA, 2016. JMLR: W&CP volume The amount of information is overwhelming, and modern 48. Copyright 2016 by the author(s). methods in machine learning and statistics can play an in- Cosmological Parameters from the Dark matter Distribution creasingly important role in modern cosmology. For ex- ample, the common method to compare large scale struc- ture observation and theory is to compare the compressed two-point correlation function of the observation with the theoretical prediction (which is only correct up to a certain physical separation scale). We argue here that there may be a better way to make this comparison. The best model of the Universe is currently described by less than 10 parameters in the standard ΛCDM cosmol- ogy model, where CDM stands for cold dark matter and Λ stands for the cosmological constant. The ΛCDM param- Figure 2. Prediction and ground truth of Ωm and σ8 using 3D eters that are important for this analysis include the matter conv-net and analysis of the power-spectrum on 50 test cube in- density Ωm 0.3 (normal matter and dark matter together stances. constitute ≈30% of the energy content of the Universe), ∼ the density Ωλ 0.7 ( 70% of the energy content of the Universe is a≈ dark energy∼ substance that ration et al.(2015) CMB observations. Our objective is to pushes the content of the universe apart), the variance in show that it is possible to further improve the estimates in the matter over densities σ 0.8 (measured on the matter 8 this range, for simulated data, using a deep convolutional power spectrum smoothed over≈ 8 h−1Mpc spheres), and neural network (conv-net). the current Hubble parameter H0 = 100h 70km/s/Mpc ≈ (which describes the present rate of expansion of the Uni- We consider two sets of simulations: the first set contains verse). ΛCDM also assumes a flat geometry for the Uni- only one snapshot of the dark matter distribution at the verse, which requires ΩΛ = 1 Ωm (Dodelson, 2003). present day. The following cosmological parameters are − −1 Note that the unit of distance megaparsec/h ( h Mpc ) varied across simulations: I) mass density Ωm; II) σ8 (or used above is time-dependent, where 1Mpc is equivalent alternatively, the amplitude of the primordial power spec- 6 to 3.26 10 light years and h is the dimensionless Hubble trum, As, which can be used to predict σ8). parameter× that accounts for the expansion of the universe. Here, each training and test instance is the output of an The expansion of the Universe stretches the wavelength, N-body simulation with millions of particles in a box or or redshifts, the light that is emitted from distant galax- “cube” that is tens of h−1Mpc across. All the simulations ies, with the amount of change in wavelength depending in this dataset are recorded at the present day – i.e., redshift on their distances and the cosmological parameters. Con- z = 0. Figure1 shows three cubes with their corresponding sequently, for a fixed cosmology we can use the directly cosmological parameters. As is evident from this figure, observed redshift z of galaxies as a proxy for their distance distinguishing the constants using visual clues is challeng- away from us and/or the time at which the light was emit- ing. Importantly, there is substantial variation among cubes ted. even with similar cosmological parameters, since the initial conditions are chosen randomly in each simulation. In all Here, we present a first attempt at using advanced ma- experiments, we use 90% of the data for training and the chine learning to predict cosmological parameters directly remaining 10% for testing. from the distribution of matter. The final goal is to apply such models to produce better estimates for cosmological We compare the performance of the conv-net to a stan- parameters in our universe. In the following, Section2 dard cosmology analysis based on the standard maximum presents our main results. Section3 elaborates on the simu- likelihood fit to the matter power spectrum (Dodelson, lation and cosmological analysis procedures as well as ma- 2003). Figure2 presents our main result, the prediction chine learning techniques used to obtain these estimates. versus the ground truth for the cosmological parameters us- ing both methods. We find that the maximum likelihood 2. Results prediction for (σ8, Ωm) has an average relative error of (0.013, 0.072), respectively.1 In comparison, the conv-net To build the computational model, we rely on direct dark has an average relative error of (0.012, 0.028), which has matter simulations produced using different cosmological a clear advantage in predicting Ωm. Predictions for conv- parameters and random seeds. We sample these parame- net are the mean-value of the predictions on smaller 128 ters within a very narrow range that reflects the uncertainty h−1Mpc sub-cubes. On these sub-cubes, the conv-net has of our current best estimates of these parameters for our 1 Relative error for ground truth Ωm and the prediction Ωbm are universe from real data, in particular the Planck Collabo-   defined as |Ωm − Ωbm| /Ωm. Cosmological Parameters from the Dark matter Distribution a relatively small standard deviation of (0.0044, 0.0032), indicating only small variations in predictions using much conv-net smaller sub-cubes. We have not performed a maximum 2BE likelihood estimate on these small sub-cubes, since the quality of the results would be drastically limited by sam- ple variance.2 We also observed that changing the size of 100 these sub-cubes by a factor of two did not significantly af- fect conv-net’s prediction accuracy; see the Appendix A for details. prediction (z) The second dataset contains 100 simulations using a more 1 sophisticated simulation code (Trac et al., 2015), where 10− each simulation is recorded at 13 different redshifts z ∈ [0, 6]; see Figure3. Simulations in this set use fewer par- 1 0 10− 10 ticles and since the distribution of matter at different red- ground truth (z) shifts is substantially different (compared to the effect of cosmological parameters in the first dataset) we are able Figure 4. Prediction and ground truth of redshift z on test in- to produce reasonable estimates of the redshift using the stances for both 3D conv-net and double-basis estimator (2BE). distribution-to-real framework of Oliva et al.(2014) as well as a 3D conv-net. Figure4 reports both results for the train- ing and test sets. sequently, significant effort has been made in the last few decades to obtain a large number of realistic simulations as a function of the cosmology parameters. The simulations also provide a useful test for supervised machine learning techniques. In order to apply a Machine Learning process in cosmo- logical parameter estimation we need to generate a huge amount of simulations for the training set. Moreover, it is important to generate big volume simulation boxes in order to accurately reproduce the statistics of large scale struc- Figure 3. Log-density of dark matter at different redshifts. Each row shows a slice of a different 3D cube. From left to right the tures. There are several algorithms for calculating the grav- redshift increases in 1Gyr steps. itational acceleration in N-body simulations, ranging from slow-and-accurate to fast-and-approximate. The equations of motion for the N particles are solved in discrete time 3. Methods steps to track the nonlinear trajectories of the particles. As we are interested in large scale statistics, for the first We review the procedure for dark matter simulations in dataset we use the COmoving Lagrangian Acceleration Section 3.1 and outline the standard cosmological likeli- (COLA) code (Tassev et al., 2013; Koda et al., 2015). The hood analysis in Section 3.2. Section 3.3 and Section 3.4 COLA code is a mixture of N-body simulation and second detail our deep conv-net applied to the data and our ap- order Lagrangian perturbation theory. This method con- proach to predicting the redshift using a double-basis es- serves the N-body accuracy at large scale and agrees with timator. Section 3.5 describes the details of the redshift the non-linear power spectrum (see Section 3.2) that can estimation. be obtained with ultra high-resolution pure N-body sim- ulations (Springel, 2005) at better than 95% up to k 3.1. Simulations 0.5hMpc−1. ∼ Simulations play an important part in modern cosmology For the first study we generate 500 cubic simulations with studies, particularly in order to model the non-linear ef- a size of 512 h−1Mpc with 5123 dark matter particles, fects of general relativity and gravity, which are impossible evolving the simulation until redshift z = 0. The mass to take into account in a simpler analytic solution. Con- of these particles varies with the value of Ωm from mp 10 10 −1 ∼ 2 6.5 10 to mp 9.5 10 h M , where M is a For the power spectrum analysis there is a strong degeneracy × ∼ × in the (σ8, Ωm) plane on small scales: larger (smaller) values of solar mass. We start the simulations at a redshift of z 20 3 ∼ σ8 combined with smaller (larger) Ωm predict comparable power and use 20 steps up to the final redshift z = 0. Each box is spectra. This provides a small bias to the maximum likelihood estimate. 3Corresponding to a scale factor of a = 0.05, as advocated in Cosmological Parameters from the Dark matter Distribution

of 1 gigayear (Gyr), using 1283 particles in boxes with sizes of 128 h−1Mpc using the standard ΛCDM cosmol- ogy.

3.2. Two-Point Correlation and Maximum Likelihood Power Spectrum Analysis A commonly used measurement for analysis of the distri- bution of matter is the two-point correlation function ξ(~r), measuring the excess probability, relative to a random dis- tribution, of finding two points in the matter distribution at the volume elements dV1 and dV2 separated by a vector distance ~r – that is we have

2 dP12(~r) = n (1 + ξ(~r)) dV1dV2, (1)

where n is the mean density (number of particles divided 2 by the volume), and n dV1dV2 in the equation above mea- sures the probability of finding two points in dV1 and dV2 at vector distance ~r. Under the cosmological principle the Figure 5. Distribution of cosmological parameters in the first set Universe is statistically isotropic and homogeneous, there- of simulations. fore the correlation function only depends on the distance r = ~r . The matter power spectrum Pm(k) is the Fourier transform| | of the correlation function, where k = k is the generated using a different seed for the random initial con- magnitude of the Fourier basis vector. | | ditions.4 The Hubble parameter used for all simulations The form of the power spectrum as a function of k depends is H = 70km/s/Mpc.5 Each simulation on average re- 0 on the cosmological parameters, in particular σ and Ω . quires 6 CPU hours on 2GHz processors and the final raw 8 m For a larger (smaller) σ the amplitude of the power spec- snapshot is about 1GB in size. 8 trum smoothed on the scale of 8 h−1Mpc increases (de- Motivated by the PLANCK results (Planck Collaboration creases). Similarly, larger Ωm shifts power into smaller et al., 2015), we use a Gaussian distribution for the am- scales. plitude of the initial scalar perturbations ln(1010A ) = s Maximum Likelihood Analysis. Given the output of an 3.089 0.036 and a flat distribution in the range [0.25; 0.35] ± N-body simulation at z = 0, we evaluate the “empiri- for Ω . Note that PLANCK results arguably give us the best m cal” power spectrum Pˆ(k) of the dark matter distribution.7 constraints on the parameters of our Universe, limiting our For a set of cosmological parameters Y = (σ , Ω ) we simulations mostly to uncertain regions of the parameter 8 m can obtain the predicted (theoretical) matter power spec- space. The value for σ is obtained by calculating the con- 8 tra P (k, Y ).8 This is basically an accurate estimate of volution of the linear power spectrum with a top hat win- m the average power spectra, if our training dataset contained dow function with a radius of 8 h−1Mpc , using the CAMB many simulations with the same cosmological parameters code; see Section 3.2 for power-spectrum. Figure5 shows and different initial conditions. This theoretical average is the distribution of the three parameters (two independent, produced using our “physical model”, rather than the train- one derived) that are varying across simulations. ing data. After obtaining an estimate of the covariance us- The simulations in the second dataset are based on a ing additional training simulations, for each test cube, we particle-particle-particle-mesh (P3M) algorithm from Trac can find the parameter Y that maximizes its Gaussian like- et al.(2015). 6 Each simulation is computed in 13 time steps lihood. Izard et al.(2015). To define this Gaussian likelihood of the empirical power 4 This random seed is generated by an adjusted version of the spectra based on its theoretical value (Pˆm(k) Pm(k, Y )), 2LPTIC code (Koda et al., 2015). L | 5We use a scalar perturbation spectral index of 0.96 for all the 7 Given the ΛCDM cosmology model, there is a constraint simulations and a cosmological constant with ΩΛ = 1 − Ωm in in the parameter space (As, σ8, Ωm), which we utilize to only order to conserve a flat universe. require fitting to the parameters (σ8, Ωm) – i.e., treating As as a 6The long-range potential is computed using a particle-mesh deterministic derivative. algorithm where Poisson’s equation is efficiently solved using 8We use the linear Boltzmann code CAMB (Lewis et al., 2000), Fast Fourier Transforms. The short-range force is computed for supplemented with the empirically calibrated non-linear correc- particle-particle interactions using direct summation. tions obtained from HALOFIT (Smith et al., 2003). Cosmological Parameters from the Dark matter Distribution we discretize the power spectrum to equally spaced bins in permutation invariance. For conv-nets, prior to data aug- log k. We the estimate the covariance matrix of this Gaus- mentation, we transform this data to volumetric form, sian using 20 different simulations with the fixed cosmol- where a 3D histogram of d3 voxels represents the nor- ogy of (σ8, Ωm) = (0.812, 0.273). Note that each of these malized density of the matter for each cube. For the first is using different random initial conditions to obtain an es- and second datasets this resolution (in proportion to the timate of the sample variance.9 The sample variance on number of particles and the size of these cubes) is set to scales of k . 0.1Mpc gives a large uncertainty in the es- d = 256 and d = 64 respectively, which means each voxel ˆ −1 −1 timate of Pm(k) at scales & 100 h Mpc in real-space, is 2 h Mpc along each edge. A normalization step en- which corresponds to approximately 20% of the entire sim- sures that the model generalizes to simulations with differ- ulation box. This limits the inferences we can draw from ent number of particles as long as densities remain non- large scales in the dark matter simulation. degenerate. In the first dataset we further break down each of the 500 simulation cubes to 643-voxel sub-cubes, corre- We then maximize the likelihood function over Y using sponding to 1283( h−1Mpc )3. This is in order to obtain the downhill simplex method (Nelder & Mead, 1965) to more training instances for our conv-net; see Figure1 obtain an estimate Yˆ that can be compared to the ground truth cosmological parameter values that are known from Translation invariance is addressed by shift-invariance of the simulations. 10 the convolutional parameters. We augment both datasets with symmetries of a cube. This symmetry group has 48 3.3. Invariances of the Distribution of Matter elements: 6 different 90◦ rotations and 23 = 8 different axis-reflections of each sub-cube. Modern cosmology is built on the cosmological princi- ple that states at large scales, the distribution of matter in The combination of data-augmentation and us- the Universe is homogeneous and isotropic (Ryden, 2003), ing “sub”-cubes increases the training data which implies shift, rotation and reflection invariance of the = (X(1),Y (1)),..., (X(N),Y (N)) to have N > 106 Sand N{ > 62000 instances for the first} and second dataset distribution of matter. These invariances have also made 3 the two-point correlation function –as a shift, rotation and respectively, where in the following we use X Υ = R64 ∈ reflection invariant measurement– an indispensable tool in to denote a (sub-)cube from either dataset. cosmological data analysis. Here, we intend to go beyond To see if the data-augmentation has indeed produced the this measure. Let X denote a cube and Y = (Ωm, σ8) the desirable invariance, we predicted both Ωm and σ8 using corresponding dependent variable. The existence of invari- 48 replicates of each sub-cube. The average standard devi- ance in the data means p(Y X) = p(Y transform(X)), | | ation in these predictions is .0013 and .0017 respectively, where the invariance identifies the valid transformations. i.e., small compared to .029 and .039, their respective stan- In machine learning, and in particular deep learning, sev- dard deviations over the whole test-set. eral recent works have attempted to identify and model the data invariances and its symmetries (e.g., Gens & Domin- 3.4. Deep Convolutional Network for Volumetric Data gos, 2014; Cohen & Welling, 2014). However, due to Our goal is to learn the model parameters θ∗ Θ for an ex- inefficiency of current techniques, any known symmetry pressive class of functions f :Υ 2, so as∈ to minimize beyond translation invariance is often enforced by data- θ R the expected loss [`(f(X) →Y )] where `( 2) is augmentation (e.g., Krizhevsky et al., 2012); see (Diele- EX,Y R R a loss function — e.g., we use the− L1 norm. However,→ due man et al., 2015) for an application in astronomy. Data- to the unavailability of p(X,Y ), a common practice is to augmentation is the process of replicating data by invariant P (n) minimize the empirical loss (n) (n) `(f(X ) transformations. (X ,Y )∈S − Y (n)) with an eye towards generalization to new data, In the original representation of cubes, particles are fully which is often enforced by regularization. interchangable and a source of redundancy is due to this Our function class is the class of a deep convolutional neu- 9We need to estimate the covariance matrix for a single assign- ral network (LeCun et al., 2015; Bengio, 2009). Conv-nets ment to the parameters. These particular parameters provide the have been mostly applied to 2D image data in the past. best-fit Lambda-CDM values to the data from the Planck satellite telescope, which is the state-of-the-art measurement of the cosmic Beside applications in video processing –with two image microwave background. dimensions and time as the third dimension– application 10While this differs from common cosmological analyses that of conv-nets to volumetric data are very recent and mostly calculate the posterior probability distribution P (Y |D) using limited to 3D medical image segmentation (e.g., Kamnit- Bayesian techniques via software such as COSMOMC(Lewis & sas et al., 2015; Roth et al., 2015). Bridle, 2002), it gives a reasonable point estimate of the parame- ters that can be compared to the results of the conv-nets. Figure6 shows the architecture of our model, as well as the Cosmological Parameters from the Dark matter Distribution feature-maps produced at the first two convolutional layers that layer– before applying the non-linearity. With- for a particular input. The model uses six convolutional out using batch-normalization, we observed shooting layers that are initially followed by pooling layers to re- gradients early during the training.11 In using batch- duce the size of feature-maps. These convolution layers normalization, we normalize the values across all the are followed by three fully connected layers. A major re- voxels of each feature-map. However, since due to striction when moving from 2D images to volumetric data memory constraints the number of training instances is the substantial increase in the size of the input, which in in each mini-batch is limited, batch-normalization turn restricts the number of feature-maps at the first layers across the fully connected layers introduces relatively of the conv-net. This memory usage is further amplified by large oscillations during learning. For this reason, we the fact that in 3D convolution the advantage of using FFT limit the batch-normalization to convolutional layers. is considerable. However, FFT-based convolution requires larger memory compared to its time domain counterpart. Regularization is enforced by “drop-out” at fully connected layers, where 50% of units are ignored during each ac- tivation, in order to reduce overfitting by preventing co- adaptation (Hinton et al., 2012). For training with back- propagation, we use Adam (Kingma & Ba, 2014) with a learning rate of .0005 and first and second moment expo- nential decay rate of .9 and .999, respectively.

Figure 6. The architecture of our 3D conv-net. The model has six Figure 7. (top) visualization of inputs that maximize the activa- convolutional and 3 fully connected layers. The first two convolu- tion of 7/1024 units (corresponding to seven rows) at the first fully tional layers are followed by average pooling. All layers, except connected layer. In this figure, we have unwrapped the maximiz- the final layer, use leaky rectified linear units, and all the convo- ing input sub-cubes for better visualization. (bottom) magnified lutional layers use batch-normalization (b.n.). portion of the top row.

In designing our network we identified several choices that are critical in obtaining the results reported in Section2: 3.4.1. VISUALIZATION A common approach to visualizing the representation We use Leaky rectified linear unit (ReLU). (Maas • learned by a deep neural network is to maximize the acti- et al., 2013). This significantly speeds up the learning vation of a particular unit while treating the input X as the compared to non-leaky variation. We used the leak optimization variable (Erhan et al., 2009; Simonyan et al., parameter c = .01 in f(X) = max(0,X) c. − 2013) Average pooling We used in our model and could not ∗ • X = arg max s.t. Xl,i X 2 ζ learn a meaningful model using max-pooling (which X k k ≤ is often used for image processing tasks). One expla- th nation is that with the combination of ReLU and av- where Xl,i is the i unit at layer l of the conv-net and erage pooling, activity at higher layers of the conv-net ζ > 0 is a constant. Figure7 shows the representation signifies the weighted sum of the dark-matter mass at learned by seven units at the first fully connected layer of particular regions of the cube. This information (to- our model.12 The visualization suggests that the conv-net tal mass in a region) is lost when using max-pooling. has learned to identify various patterns involving periodic

Here, both pooling layers are sub-sampling by a factor 11 of two along each dimension. Batch-normalization would not be critical in a more shallow network. However, we observe consistent –although sometimes Batch normalization (Ioffe & Szegedy, 2015) is nec- marginal– improvement by increasing the number of layers in our • essary to undo the internal covariate shift and stabilize conv-net up to its current value. 12Since the input to our conv-net is a distribution it seems more the gradient calculations. The basic idea is to nor- appropriate to bound X by kXk1 = 1 and Xi > 0 ∀i. How- malize the output of each layer –with an online esti- ever, using penalty method for this optimization did not produce mate of mean and variance for all the training data at visually meaningful features. Cosmological Parameters from the Dark matter Distribution

l concentration of mass as a key feature in predicting Ωm and of ϕi i∈ serves as an orthonormal basis for L2(Υ ); that { } Z σ8. is,

l 3.5. Estimating the Redshift Y l ϕα l where ϕα(x) = ϕα (xi), x Υ (2) { }α∈Z i ∈ We applied the conv-net of the previous section to estimate i=1 the redshift in our second dataset. Since this is an easier serves as an orthonormal basis (so we have α, ρ task, we removed two fully connected layers, without los- l ∀ ∈ Z , ϕα, ϕρ = I α = ρ ). ing prediction power. All the other settings in training are h i { } l kept the same. For this dataset we could also obtain good Let P L2(Υ ), then results using the Double-Basis Estimator, described in the ∈ I ⊆ X following section. p(x) = aα(P )ϕα(x) where (3) l α∈Z 3.5.1. DISTRIBUTIONTO REAL REGRESSION Z aα(P ) = ϕα, p = ϕα(z)dP (z) R. We analyzed the use of distribution-to-real regression h i Υl ∈ (Poczos´ et al., 2013) and the Double-Basis Estimator (2BE) Here, p(x) denotes the probability density function of the (Oliva et al., 2014) for predicting cosmological parameters. distribution P . Here, we take sub-cubes of simulation snapshots to be sam- ple sets from an underlying distribution, and regress a map- If the space of input densities, , is in a Sobolov ellip- I ping that maps the underlying distribution to a real-value soid type space; see (Ingster & Stepanova, 2011; Laurent, (in this case the redshift of the simulation snapshot). In 1996; Oliva et al., 2014) for details. We can effectively N other words, we consider our data to be = ( i,Yi) , approximate input densities using a finite set of empiri- D { X }i=1 3 ni iid cally estimated projection coefficients. Given a sample where i = Xij R j=1 Pi. We look to estimate a X { ∈ } ∼ iid mapping Yi = f(Pi) + i, where i is a noise term (Oliva i = Xi1,...,Xini where Xij Pi , let Pbi be X { } ∼ ∈ I 1 et al., 2014). the empirical distribution of i; i.e. Pi(X = Xij) = . b ni Our estimator for p will be: X Roughly speaking, the 2BE operates in an approximate pri- i mal space that allows one to use a kernelized estimator on X p˜i(x) = aα(Pbi)ϕα(x) where (4) distributions without computing a Gram matrix. The 2BE α∈M uses: n Z 1 Xi aα(Pbi) = ϕα(z)dPbi(z) = ϕα(Xij). (5) l n 1. An orthonormal basis so that we can estimate the L2 Υ i j=1 distance on two distributions, Pi Pj 2, as the Eu- k − k clidean distance of finite vectors of their projection co- Choosing M optimally can be shown to lead to E[ p˜i − 2 k − efficients onto a finite subset of the orthonormal basis, 2 2+γ−1 −1 pi 2] = O(ni ), where γ is a smoothing constant ~a(Pi) ~a(Pj) . k k − k (Nussbaum, 1983). 2. A random basis to approximate kernel evalua- Random Basis. Next, we use random basis functions from tions on distributions K(Pi,Pj) as the dot prod- Random Kitchen Sinks (RKS) (Rahimi & Recht, 2007) to uct of finite vectors of random features on the re- compute our estimate of the response. In particular, we spective projection coefficients of the distributions, consider the RBF kernel T z(~a(Pi)) z(~a(Pj)).  x y 2  Kδ(x, y) = exp k − k Using these two bases, the 2BE is able to regress a non- − 2δ2 parametric mapping efficiently. In short, the 2BE estimates d T where x, y and δ is a bandwidth parameter. a real valued response, Yi, as Yi ψ z(~a(Pi)), where R R ≈ Rahimi & Recht∈ (2007) shows∈ that for a shift-invariant ker- z(~a(Pi)) are the aforementioned random features of pro- jection coefficients, and ψ is a vector of model parameters nel, such as Kδ: that are optimized over. We expound on the details below. T Kδ(x, y) z(x) z(y), where (6) q≈ Orthonormal Basis. First, we use orthonormal basis pro- 2  T T T z(x) cos(ω x + b1) cos(ω x + bD) (7) jection estimators (Tsybakov, 2008) for estimating densi- ≡ D 1 ··· D ties of Pi from a sample i. Let Υ = [a, b] and suppose l l X iid −2 iid that Υ R is the domain of input densities. If ϕi i∈ with ωi (0, δ Id), bi Unif[0, 2π]. The quality of ⊆ { } Z ∼ N ∼ is an orthonormal basis for L2(Υ), then the tensor product the approximation will depend on the number of random Cosmological Parameters from the Dark matter Distribution

N features D as well as other factors, see (Rahimi & Recht, (z(~a(Pbi)),Yi) i=1 : 2007) for details. { } T fˆ(P˜0) ψˆ z(~a(Pb0)) where (10) Below we consider the RBF kernel on distributions, ≡ ˆ ~ 2  2  ψ arg min Y Zβ 2 (11) pi pj ≡ β k − k Kδ(Pi,Pj) = exp k − 2 k , − 2δ =(ZT Z)−1ZT Y~ (12) where pi, pj are the respective densities and pi pj is the k − k ~ T L2 norm on functions. We will take the class of mappings for Y = (Y1,...,YN ) , and with Z being the N D T × we regress to be: matrix: Z = [z(~at(Pb1)) z(~at(PbN ))] . ··· N A straightforward extension to (10) is to use a ridge regres- X Yi = θiKδ(Gj,Pi) + i, (8) sion estimate on features z(~a( )) rather than a OLS esti- · j=1 mate. That is, for λ 0 let ≥ where θ 1 < , Gj ’s are unknown distributions ˆT ~ 2 k k ∞ ∈ I ψλ arg min Y Zβ 2 + λ β 2 (13) and i is a noise term (Oliva et al., 2014). Note that this ≡ β k − k k k model is analogous to a linear smoother on some unknown =(ZT Z + λI)−1ZT Y.~ (14) infinite dataset, and is nonparametric. We show that (8) can be approximated with the 2BE below. 3.5.2. ALGORITHM Double-Basis Estimator. First note that: * + We summarize the basic steps for training the 2BE in prac- X X tice given a data-set of empirical functional observations p˜i, p˜j = aα(Pi)ϕα, aα(Pj)ϕα b b N h i = ( i,Yi) , parameters δ and D (which may α∈M α∈M D { X }i=1 X X be cross-validated), and an orthonormal basis ϕi i∈Z for = aα(Pbi)aβ(Pbj) ϕα, ϕβ { } h i L2([a, b]). α∈M β∈M X = aα(Pbi)aα(Pbj) 1. Determine the sets of basis functions M for approx- α∈M imating p. This may be done via cross validation of D E density estimates (see (Oliva et al., 2014) for more de- = ~a(Pbi),~a(Pbj) , tails). where ~a(Pbi) = (aα1 , . . . , aαs ), M = α1, . . . , αs , and iid −2 iid the last inner product is the vector dot product.{ Thus,} 2. Let s = M , draw ωi (0, δ Is), bi Unif[0, 2π] | for| i 1,...,D∼ N ; keep the set∼ D ∈ { } (ωi, bi) fixed henceforth. p˜i p˜j 2 = ~at(Pbi) ~at(Pbj) , i=1 k − k − 2 { } 3. Let α1, . . . , αs = M. Generate the data-set of where the norm on the LHS is the L2 norm and the `2 on random{ kitchen} sink features, output projection co- the RHS. N efficient vector, response pairs (z(~a(Pbi)),Yi) i=1. iid −2 iid { } Consider a fixed δ. Let ωi (0, δ Is), bi ˆ T −1 T ~ D ∼ N ∼ Let ψ = (Z Z + λI) Z Y R where Z = Unif[0, 2π], be fixed. Then, T N∈×D [z(~a(Pb1)) z(~a(PbN ))] R , and λ may be ∞ ∞ ··· ∈ T ~ T X X chosen via cross validation. Note that Z Y and Z Z θiKδ(Gi,P0) θiKδ(~a(Gi),~a(P0)) can be computed efficiently using parallelism. ≈ i=1 i=1 ∞ 4. For all future query input functional observations X T θiz(~a(Gi)) z(~a(Pb0)) P , estimate the corresponding response as fˆ(p ) = ≈ b0 0 i=1 T ψˆ z(~a(Pb0)). ∞ !T X = θiz(~a(Gi)) z(~at(Pb0)) 3.6. 2BE for Redshift Prediction i=1 T We divide simulation snapshots into 16 h−1Mpc length =ψ z(~a(Pb0)) (9) sub-cubes, for a total of 512 sub-cubes per simulation snap- P∞ D where ψ = θiz(~a(Gi)) R . Thus, we consider shot. Each sub-cube is then rescaled to be the unit box. We i=1 ∈ estimators of the form (9). I.e. we use a linear estima- treat each sub-cube as a sample i with a response Yi, of tor in the non-linear space induced by z(~a( )). In par- the redshift it was observed at. InX total, a training set of ap- · ticular, we consider the OLS estimator using the data-set proximately 600K (sample i, response Yi) pairs was used X Cosmological Parameters from the Dark matter Distribution for constructing our model. A total of 130 simulation snap- Anderson, Lauren et al. The clustering of galaxies in the SDSS- shots were held out. Test accuracies were assessed by aver- III Baryon Oscillation Spectroscopic Survey: baryon acoustic aging the predicted response in the boxes of each held-out oscillations in the Data Releases 10 and 11 Galaxy samples. Mon. Not. Roy. Astron. Soc., 441(1):24–62, 2014. doi: 10. snapshot. 1093/mnras/stu523.

We used 20K random features, D, as in Eq. (7). We used Bengio, Yoshua. Learning deep architectures for ai. Foundations the cosine basis, i.e., the tensor product in Eq. (2) of: and trends in ML, 2(1), 2009. ϕ0(x) = 1, and ϕk(x) = √2 cos(kπx) for k 1. The set of basis functions, M (5), was taken to be M≥= α Cohen, Taco and Welling, Max. Learning the irreducible 3 : α 18 via rule of thumb. The free parameters{ ∈ representations of commutative lie groups. arXiv preprint N arXiv:1402.4437, 2014. δ, the bandwidth,k k ≤ } and λ, the regularizer, were chosen by validation on a held-out portion of the training set. In total Cole, Shaun et al. The 2dF Galaxy Redshift Survey: Power- the 2BE model’s parameters ψ, totaled 20K dimensions. spectrum analysis of the final dataset and cosmological impli- cations. Mon. Not. Roy. Astron. Soc., 362:505–534, 2005. doi: 10.1111/j.1365-2966.2005.09318.x. Future Directions Dieleman, Sander, Willett, Kyle W, and Dambre, Joni. Rotation- We demonstrated that machine learning techniques can invariant convolutional neural networks for galaxy morphology produce accurate estimates of the cosmological parameters prediction. Monthly Notices of the Royal Astronomical Society, 450(2):1441–1459, 2015. from simulated dark matter distributions, which are highly competitive with standard analysis techniques. In particu- Dodelson, S. Modern Cosmology. Elsevier Science, 2003. ISBN lar the advantage of conv-nets on small-scale boxes shows 9780080511979. that convolutional features that carry higher order correla- Erhan, Dumitru, Bengio, Yoshua, Courville, Aaron, and Vincent, tion information provide high fidelity and could produce Pascal. Visualizing higher-layer features of a deep network. low variance estimates of the cosmological parameters. Dept. IRO, Universite´ de Montreal,´ Tech. Rep, 4323, 2009.

The eventual goal is to use such models to estimate the pa- Gens, Robert and Domingos, Pedro M. Deep symmetry net- rameters of our own Universe, where we only have access works. In Advances in neural information processing systems, to the distribution of “visible” matter. This introduces ex- pp. 2537–2545, 2014. tra complexities as galaxies and clusters are biased tracers Hinshaw, G. et al. Nine-Year Wilkinson Microwave Anisotropy of the underlying matter distribution. Furthermore, the di- Probe (WMAP) Observations: Cosmological Parameter Re- rect simulation of galaxy clusters are highly complex. In sults. Astrophys. J. Suppl., 208:19, 2013. doi: 10.1088/ the next step, we would like to evaluate and establish the 0067-0049/208/2/19. robustness of these models to variations across simulation Hinton, Geoffrey E, Srivastava, Nitish, Krizhevsky, Alex, settings, before applying proper models to Sloan Digital Sutskever, Ilya, and Salakhutdinov, Ruslan R. Improving neu- Sky Survey data (Alam et al., 2015) that observes the dis- ral networks by preventing co-adaptation of feature detectors. tribution of galaxies at large scales. arXiv preprint arXiv:1207.0580, 2012.

As another direction for the future work, we would also Ingster, Y. and Stepanova, N. Estimation and detection of func- like to investigate the application of approximate Bayesian tions from anisotropic sobolev classes. Electronic Journal of computation (ABC; Marin et al., 2012) in combination with Statistics, 5:484–506, 2011. the power-spectrum method for this problem. Ioffe, Sergey and Szegedy, Christian. Batch normalization: Ac- celerating deep network training by reducing internal covariate Acknowledgements shift. arXiv preprint arXiv:1502.03167, 2015. Izard, A., Crocce, M., and Fosalba, P. ICE-COLA: Towards fast We would like to thank Hy Trac for providing the simu- and accurate synthetic galaxy catalogues optimizing a quasi N- lations for the second set of experiments. We also like to body method. ArXiv e-prints, September 2015. thank the anonymous reviewers for their helpful feedback. The research of SR was supported by the department of en- Kamnitsas, Konstantinos, Chen, Liang, Ledig, Christian, Rueck- ergy grant DE-SC0011114. ert, Daniel, and Glocker, Ben. Multi-scale 3d convolutional neural networks for lesion segmentation in brain mri. Ischemic Stroke Lesion Segmentation, pp. 13, 2015.

References Kingma, Diederik and Ba, Jimmy. Adam: A method for stochas- Alam, Shadab et al. The Eleventh and Twelfth Data Releases of tic optimization. arXiv preprint arXiv:1412.6980, 2014. the Sloan Digital Sky Survey: Final Data from SDSS-III. As- trophys. J. Suppl., 219(1):12, 2015. doi: 10.1088/0067-0049/ Koda, J., Blake, C., Beutler, F., Kazin, E., and Marin, F. Fast 219/1/12. and accurate mock catalogue generation for low-mass galaxies. ArXiv e-prints, July 2015. Cosmological Parameters from the Dark matter Distribution

Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Ima- Ryden, Barbara Sue. Introduction to cosmology, volume 4. genet classification with deep convolutional neural networks. Addison-Wesley San Francisco USA, 2003. In Advances in neural information processing systems, pp. 1097–1105, 2012. Simonyan, Karen, Vedaldi, Andrea, and Zisserman, Andrew. Deep inside convolutional networks: Visualising image Laurent, B. Efficient estimation of integral functionals of a den- classification models and saliency maps. arXiv preprint sity. The Annals of Statistics, 24(2):659–681, 1996. arXiv:1312.6034, 2013.

LeCun, Yann, Bengio, Yoshua, and Hinton, Geoffrey. Deep learn- Smith, R. E., Peacock, J. A., Jenkins, A., White, S. D. M., Frenk, ing. Nature, 521(7553):436–444, 2015. C. S., Pearce, F. R., Thomas, P. A., Efstathiou, G., and Couch- mann, H. M. P. Stable clustering, the halo model and nonlinear Lewis, Antony and Bridle, Sarah. Cosmological parameters from cosmological power spectra. Mon. Not. Roy. Astron. Soc., 341: CMB and other data: a Monte- Carlo approach. Phys. Rev., 1311, 2003. doi: 10.1046/j.1365-8711.2003.06503.x. D66:103511, 2002. Springel, V. The cosmological simulation code GADGET-2. Lewis, Antony, Challinor, Anthony, and Lasenby, Anthony. Effi- mnras, 364:1105–1134, December 2005. doi: 10.1111/j. cient computation of CMB anisotropies in closed FRW models. 1365-2966.2005.09655.x. Astrophys. J., 538:473–476, 2000. Tassev, S., Zaldarriaga, M., and Eisenstein, D. J. Solving large Maas, Andrew L, Hannun, Awni Y, and Ng, Andrew Y. Recti- scale structure in ten easy steps with COLA. jcap, 6:036, June fier nonlinearities improve neural network acoustic models. In 2013. doi: 10.1088/1475-7516/2013/06/036. Proc. ICML, volume 30, 2013. Trac, H., Cen, R., and Mansfield, P. SCORCH I: The Galaxy-Halo Marin, Jean-Michel, Pudlo, Pierre, Robert, Christian P, and Ry- Connection in the First Billion Years. apj, 813:54, November der, Robin J. Approximate bayesian computational methods. 2015. doi: 10.1088/0004-637X/813/1/54. Statistics and Computing, 22(6):1167–1180, 2012. Tsybakov, Alexandre B. Introduction to nonparametric estima- Nelder, John A and Mead, Roger. A simplex method for function tion. Springer, 2008. minimization. The computer journal, 7(4):308–313, 1965.

Nussbaum, M. On optimal filtering of a function of many vari- ables in white gaussian noise. Problemy Peredachi Informatsii, A. Spatial Scale of the Cubes 19(2):23–29, 1983. However, increasing the size of cubes comes at the cost Oliva, Junier B, Neiswanger, Willie, Poczos, Barnabas, Schnei- of less training/test instances. We evaluated the effect der, Jeff, and Xing, Eric. Fast distribution to real regression. AISTATS, 2014. of scale by using different quantizations of the original 5123( h−1Mpc )3 cubes. The results of Section2 use a Parkinson, David et al. The WiggleZ Dark Energy Survey: Final 3D histogram with 2563 voxels divided into 643-voxel sub- data release and cosmological results. Phys. Rev., D86:103518, cubes. We also tried 3D histograms with 5123 and 1283 2012. doi: 10.1103/PhysRevD.86.103518. voxels, with similar 643-voxel sub-cubes. We then used the Perlmutter, S. et al. Measurements of Omega and Lambda from same conv-net for training. This resulted in using 23 = 8 42 high redshift supernovae. Astrophys. J., 517:565–586, 1999. times less or more instances. Figure8 compares the pre- doi: 10.1086/307221. diction accuracy under different spatial scales. Error-bars Planck Collaboration, Ade, P. A. R., Aghanim, N., Arnaud, M., show one standard deviation for the predictions made using Ashdown, M., Aumont, J., Baccigalupi, C., Banday, A. J., Bar- different sub-cubes that belong to the same cube (sibling reiro, R. B., Bartlett, J. G., and et al. Planck 2015 results. XIII. sub-cubes). Interestingly, these predictions for sibling sub- Cosmological parameters. ArXiv e-prints, February 2015. cubes are also consistent, having a small standard deviation Poczos,´ Barnabas,´ Rinaldo, Alessandro, Singh, Aarti, and for both parameters (Ωm and σ8). Wasserman, Larry. Distribution-free distribution regression. Moreover, change of the spatial volume of sub-cubes does AISTATS, 2013. not seem to significantly affect the prediction accuracy. We Rahimi, Ali and Recht, Benjamin. Random features for large- are able to make predictions with similar accuracy using scale kernel machines. Advances in neural information pro- sub-cubes with both smaller and larger spatial scales. cessing systems, pp. 1177–1184, 2007.

Riess, Adam G. et al. Observational evidence from supernovae for an accelerating universe and a cosmological constant. Astron. J., 116:1009–1038, 1998. doi: 10.1086/300499.

Roth, Holger R, Farag, Amal, Lu, Le, Turkbey, Evrim B, and Summers, Ronald M. Deep convolutional networks for pan- creas segmentation in ct imaging. In SPIE Medical Imag- ing, pp. 94131G–94131G. International Society for Optics and Photonics, 2015. Cosmological Parameters from the Dark matter Distribution

Ωm σ8

test train 0.90

0.32

0.88

0.30 0.86

0.84

prediction 0.28

0.82

0.26 0.80

0.78 0.24

0.26 0.28 0.30 0.32 0.34 0.80 0.85 0.90 0.95 ground truth ground truth (a) sub-cubic volume of 643 h−1Mpc

Ωm σ8

test 0.95 train

0.34

0.90 0.32

0.30 0.85 prediction

0.28

0.80

0.26

0.26 0.28 0.30 0.32 0.34 0.80 0.85 0.90 0.95 ground truth ground truth (b) sub-cubic volume of 1283 h−1Mpc

Ωm σ8

test train 0.95 0.36

0.34

0.90

0.32

0.30 prediction 0.85

0.28

0.26 0.80

0.24

0.26 0.28 0.30 0.32 0.34 0.80 0.85 0.90 0.95 ground truth ground truth (c) sub-cubic volume of (2563 h−1Mpc

Figure 8. Prediction and ground truth using a) small; b) medium and ;c) large sub-cubes. The error-bar shows the standard devia- tion over predictions made by sibling sub-cubes.