Cosmic Microwave Background Mapmaking with a Messenger Field

Kevin M. Huffenberger1 and Sigurd K. Næss2 1 Department of Physics, Florida State University, Tallahassee, FL 32306, USA; [email protected] 2 Center for Computational Astrophysics, Flatiron Institute, New York, NY 10010, USA Received 2017 June 24; revised 2017 November 12; accepted 2017 November 20; published 2018 January 10

Abstract We apply a messenger ﬁeld method to solve the linear minimum-variance mapmaking equation in the context of Cosmic Microwave Background (CMB) observations. In simulations, the method produces sky maps that converge signiﬁcantly faster than those from a conjugate gradient descent algorithm with a diagonal preconditioner, even though the computational cost per iteration is similar. The messenger method recovers large scales in the map better than conjugate gradient descent, and yields a lower overall χ2. In the single, pencil beam approximation, each iteration of the messenger mapmaking procedure produces an unbiased map, and the iterations become more optimal as they proceed. A variant of the method can handle differential data or perform deconvolution mapmaking. The messenger method requires no preconditioner, but a high-quality solution needs a cooling parameter to control the convergence. We study the convergence properties of this new method and discuss how the algorithm is feasible for the large data sets of current and future CMB experiments. Key words: cosmic background radiation – methods: data analysis – methods: statistical

1. Introduction for CMB data sets. With a more carefully designed preconditioner, the number of iterations can be greatly reduced (Næss & Cosmic Microwave Background (CMB) experiments mea- Louis 2014; Szydlarski et al. 2014), but the preconditioner sure the temperature and polarization of radiation from the must be tuned carefully to fit the characteristics of the CMB z∼1100 surface of last scattering. To reduce systematics, survey (using e.g., a typical scanning pattern). Such computa- CMB telescopes scan the sky in constant motion, and we must tions can be much more costly per iteration. reassemble the scanning data into sky maps. Doing so is ( ) In this paper, we introduce a new algorithm for mapmaking, computationally expensive. Janssen & Gulkis 1992 suggested which is based on a messenger field method. Messenger fields a linear, minimum-variance solution to the mapmaking were introduced by Elsner & Wandelt (2013) to circumvent the problem for the Cosmic Background Explorer (COBE), and ( ) inversion of a dense covariance matrix in the computation of Tegmark 1997 showed that this solution preserves the the linear Wiener filter. The messenger method is iterative, and information content of the data. While there are other methods requires no preconditioner. Along with the map, it solves for an that also preserve information, this solution makes few additional field (the “messenger”), but only requires inversions assumptions and produces an unbiased map with minimized of sparse matrices. Beyond solving the linear system for the errors, which are appealing features. Wiener filter, the messenger field can be used to generate As CMB experiments achieved higher resolution, the samples of the signal and signal covariance via Gibbs computational scaling and increased number of map pixels sampling. Several subsequent works have applied the messen- made brute force approaches to the minimum-variance solution ger method to problems in cosmology, including CMB, impractical, and researchers turned to iterative linear system gravitational lensing, and large-scale structure (Anderes fi solvers that could nd the solution without explicitly perform- et al. 2015; Jasche & Lavaux 2015; Alsing et al. 2016; Lavaux ( ing an intractable matrix inversion Wright et al. 1996; Doré & Jasche 2016). As a relatively new approach, it is still being et al. 2001; Natoli et al. 2001; Dupac & Giard 2002; Stompor refined to improve the convergence (Kodi Ramanah et al. 2017) et al. 2002; de Gasperis et al. 2005). Preconditioned conjugate and to treat wider varieties of covariance matrix structures gradient descent is the currently preferred method and these (Huffenberger 2017). solvers have subsequently been optimized for massively The approach to the mapmaking problem that we present parallel high-performance computing environments (e.g., here uses messenger fields, but differs from the above works in Cantalupo et al. 2010). Applications of similar approaches to two important respects. First, we solve a different linear system sub-mm and infrared mapmaking include Patanchon et al. in the mapmaking problem than in the Wiener filter. (2008) and Piazzo et al. (2015). The computational costs of Mapmaking makes no assumption about the signal’s covar- mapmaking limit the number of Monte Carlo simulations that a iance. Second, in all the above works, the data and the signal project can make of its end-to-end data analysis pipeline. These vectors have the same dimensionality, that is, the same number simulations quantify uncertainties and biases and help to of pixels. This is not true in mapmaking, since the time-ordered validate the data processing (e.g., Planck Collaboration XII data for all detectors contains vastly more samples than the et al. 2016). number of pixels in the temperature and polarization Preconditioner matrices speed up the convergence to the components of the resulting maps. The time samples are tied mapmaking solution. These are approximate matrix inverses, to positions on the sky map by the pointing of the telescope’s whose purpose is to lower the condition number of the matrix detectors. Many time samples refer to the same map pixel, and in the linear system. With simple preconditioners, conjugate the cross-linking of the scans is important to optimize the gradient methods typically converge in a few hundred iterations measurement of the sky.

1 The Astrophysical Journal, 852:92 (11pp), 2018 January 10 Huffenberger & Næss The tests we report here indicate that our new messenger convergence can be fast. Over the course of the solution, we mapmaking algorithm can converge much faster than the will ultimately tune λ from large values to unity as a “cooling preconditioned conjugate gradient descent method, at a similar parameter,” as discussed below. This approach is inspired by computational cost per iteration. Furthermore, the method is the messenger field method of Elsner & Wandelt (2013). feasible for large data sets, which is an important feature for The messenger field t represents a timestream that contains modern CMB experiments that utilize many thousands of our statistical best estimate for the signal component, plus a detectors and multiyear campaigns. noise component with statistics governed by our choice of T. This paper is organized so that in Section 2, we introduce the Operationally, it is a weighted average of the data and the map messenger mapmaking method. In Section 3, we present the (projected into the timestream by the pointing matrix). In the results of the method applied to simple simulations that allow a simplest implementations, we let T=τI be proportional to the careful examination of the convergence properties. In identity matrix, which implies that the noise part of t is uniform Section 4, we discuss scaling up the method to real data sets. In and uncorrelated. The messenger field t has less noise than the Section 5, we conclude with a brief description of a preliminary full timestream d because it lacks the remaining correlated application to ACTPol data. Three appendices provide noise in the data (governed by N¯ ). The estimated map m is a derivations of the method for the single-pencil-beam approx- simple coaddition of t. (T drops out of the second equation.) ( ) imation, for composite-beam e.g., differential experiments, The value τ is the variance of the noise in t, and so it cannot be and for an extension that can treat complications in the noise larger than overall noise floor in the data, or it will violate the properties at an increased computational cost. positive-definiteness of N¯ . Since the identity matrix is diagonal in any basis, the choice of T=τI makes all the matrix 2. Methods inversions trivial so long as each time sample is associated in P The standard model for a vector of time-ordered data is with one sky position only (the single, pencil beam approximation). dm=+P true n,1() Although the messenger field t has a nice interpretation, it is large for real data sets; it is the same size as the time-ordered where P is the pointing matrix that scans the sky (vector mtrue), and n is zero-mean noise in the timestream. The pointing data. It is preferable to precompute smaller data objects and discard the time-ordered data and similar objects from the matrix contains the weight that a particular time sample ( ) computer memory. For the second equation, we do not actually provides to a map pixel in temperature and polarization . The need t, just P† T−1 t, and we can combine the equations, linear, unbiased estimate for the sky that minimizes the rewriting the iteration using map-sized objects: variance is given by the mapmaking equation, mmid+1 =+()llF () m i.5 () md= ()PN††--11 P PN - 1,2 () (λ) fi where N is the covariance matrix of the noise. For Gaussian Here, md is a binned map of ltered data, noise, this linear estimate has the smallest variance among all ††--11 - 1 - 1 --- 111 mdd ()ll=+ (PT P ) PT ( N¯ ()) T N¯ possible estimates. This solution minimizes =+()(PT††--11 P Pl - 1 N¯ T)() - 1d,6 c21()()()mdm=-PN† - dm - P,3 () and F is a matrix filter applied to the current iteration, and therefore maximizes the Gaussian probability 2 ††--11 - 1 - 1 -- 11 - 1 µ-exp(())c m 2 . It is also unbiased, i.e., ámmñ= true. The FPTPPTNT()lll=+ ( ) (¯ ())() TP noise covariance for data timestreams is approximately =+()PT††--11 P PT - 1 ( Il TN¯ -- 11) P diagonal in the frequency domain, which makes the multi- =-IPTPPNTP()(††--11l - 1¯ +)() - 1.7 plication of the data by the inverse noise matrix tractable. † − − However, the matrix inverse (P N 1P) 1 is not directly The Woodbury formula leads to the last equation. Together, computable for such correlated noise and large numbers of these yield another compact form for the mapmaking pixels, and the problem requires a linear system solver. algorithm, which is Conjugate gradient descent methods, for example, move † − ††--11 - 1¯ - 1 (P N 1P) to the left-hand side of Equation (2) and prescribe mmii+1 =+()(PT P Pl N + T)( dm - P i ),8 () fi iterative updates to m to nd the solution. although this form requires the full time-ordered data. For practical mapmaking, the main computational costs per 2.1. Messenger Field Mapmaking iteration are the projection of the map into a timestream, the In Appendix A, we show that iterating the following Fourier transform to the frequency domain for filtering by equations provides an alternative solution to the mapmaking sparse matrices, the inverse Fourier transform back to the time equation: domain, and the projection back to the map domain. These costs per iteration are comparable to conjugate gradient descent ¯ -----11111¯ tdmii=+(NTN())(ll +() TP ) methods with a diagonal preconditioner, but in our examples, ††--11 - 1 mtii+1 = ()PT P PT ,4 ()we find the messenger field method can converge to a higher- quality solution in many fewer iterations. T for N¯ =-NT. We are free to choose any that leaves N¯ For general values of the cooling parameter λ>1, the positive definite. The value λ=1 solves the mapmaking iterations minimize problem, but the convergence can be slow. Other values of λ>1 solve the problem for certain modes only, but the cl21()()()()()mdmdm,,9=-PN† ¢ l- - P

2 The Astrophysical Journal, 852:92 (11pp), 2018 January 10 Huffenberger & Næss where N′(λ)=N+(λ − 1)T. This minimization leads to the form allows us to make progress because matrices like λ † -1 -dependent, unbiased map solution PATPA A, and sums of such matrices, can be made simple to invert. md()ll=¢ (PN†† ()--11 P ) PN ¢ () l - 1,10 ( ) In Appendix B, we show that the messenger mapmaking which is optimal only for λ=1. solution in this case still has the form mi+1=md(λ)+F(λ) mi, For very large λ, F=0 and the next iteration get no with slightly modiﬁed terms (compare Equations (6) and (7)): contributions from the current one. Then the map is an ††--11-1 unbiased weighted map of the data: m(large md ()l =+ (PTA A PA PTB B PB ) † −1 −1 † −1 † --11 λ)=md=(P T P) P T d. For T=τI, this is a coaddi- PNT(l ¯ + ) d tion of d. Regardless of λ, if an iteration is unbiased ††--11-1 FIPTPPTP()l =- (A A A +B B B ) (ámmiñ= true), then the messenger ﬁeld has mean PNTP† l--11¯ + átmiñ=P true, and the next iteration will be unbiased as well. ( )().13 The residual in a map compared to the solution at a particular The matrix ()PT† --11 P+ PT† P is a weighted sum of weight λ is A A A BB B maps, and as before, can be made (block) diagonal for simple T ii++11()lllll=-=mm () ()F () i ().11 ( ) matrices. Another compact form for the mapmaking solution is:

Thus the convergence of the remaining residual to zero ††--11-1 mmii+1 =+()PTA A PA + PTB B PB depends on the eigenvalues of matrix F. If an eigenvalue is PNT†(l--11¯ +-)(dm P ),14 ( ) much less than unity, the corresponding eigenmode will i converge fast. If it is close to unity, the mode will converge which recalls Equation (8). slowly. Looking at the structure of F, we may expect that this There are some important differences to the single-beam transition point will occur (for frequency domain noise) around case. Here, F ¹ 0 for a large λ, so even in that case, the initial N( f )≈λτ, so that modes less noisy than that will rapidly condition matters. Unlike before, a large λ does not converge to their proper values for m(λ). automatically ensure an unbiased map. In some cases, for T =T =T/ For modes that correspond to frequencies that are much more example when A B 2 and all are diagonal, then † --11† noisy than λτ, the solution m(λ) and our desired m(λ = 1) are ()PTA A PA + PTBB PB is proportional to the diagonal part of † −1 the same. Following Elsner & Wandelt (2013), we thus P T P. Then, in the limit of large λ, Equation (14) expresses a establish a “cooling schedule” for λ that takes it from large weighted Jacobi method that will converge toward the unbiased † −1 −1 † −1 values to λ=1 as we iterate. We must lower the value of λ so binned map solution (P T P) P T d. At other λ, the map that less-and-less noisy modes converge to their ﬁnal values, will converge toward an unbiased, but non-optimal, m(λ)(see but not lower λ too quickly, or noisy modes will get stuck with Equation (10)). If the method achieves an unbiased map at any eigenvalues close to unity before converging. A proper cooling λ, or is initialized with an unbiased map, then subsequent schedule will permit noisy modes to converge quickly at large iterations will stay unbiased. λ, and then let less noisy modes converge to their proper values as l  1, speeding the convergence to the ultimate solution. 2.3. Simulations To validate and study the messenger mapmaking method, we 2.2. Composite Pointing Matrix prepared mock time-ordered data based on cross-linked raster The above discussion of the messenger mapmaking supposes scans of model instruments over a simulated sky. The scanned that the matrix P†T−1P is simple to invert for suitable T. This is area is square and two degrees across. The statistics of the sky true in the single pencil beam approximation, which has only signal are not important for the mapmaking solution, which one pixel entry (or polarized pixel block) in the pointing matrix depends only on the noise covariance and the pointing. For the per time sample. However, it is not true for differential signal in the test case, we used a collection of compact and experiments like COBE-Differential Microwave Radiometer or extended axisymmetric polarized and unpolarized sources. This the Wilkinson Microwave Anisotropy Probe (WMAP), which lets us observe how different scales converge. The source sample and difference widely separated directions on the sky. amplitudes range from a few hundred microkelvin to a few In the differential case, the pointing matrix has a +1 and a −1 millikelvin, and their signals also vary within pixels. All the pixel entry per time sample in the pencil beam approximation, mapmaking algorithms we tested here were subject to the same corresponding to the pointing of the two sides of the subpixel signal bias, so it does not ﬁgure into the relative differencing assembly (Wright et al. 1996; Hinshaw comparisons. et al. 2003). The pointing matrix has even more entries when The single-beam instrument model uses an orthogonal pair attempting deconvolution mapmaking, where a portion of the of raster scanning patterns and has nine polarization sensitive beam is placed in the pointing matrix and deconvolved in a detectors, at various orientations, that are offset from the regularized way (Armitage & Wandelt 2004; Armitage-Caplan telescope boresight by up to 1.8 arcmin. The noise for each & Wandelt 2009; Cantalupo et al. 2010). detector is independent and has a power spectrum with power- The case with two components has the form law and white noise components, -2 PP=+AB P,12() Nf()=+ N0 (1,15 ( ffknee ) ) ( ) 2 but we could generalize this to an arbitrary number of where N0=(20 μK) and fknee=66.7 Hz. Each raster scan components. In the differential case, the A-component could lasts for 1000 s and covers the bulk of the scanning area. The have +1 entries, and the B-component could have −1 entries. scans are unphysical; the boresight moves in one direction only We also split the messenger covariance as T=TA+TB. This without turn-arounds and jumps back to the start with a

3 The Astrophysical Journal, 852:92 (11pp), 2018 January 10 Huffenberger & Næss frequency 0.46 Hz (every 2.17 s), while stepping ahead Therefore, subsequent iterations are also unbiased. Then λ is monotonically in the cross-scan direction at the end of set so that the levels of λτ are logarithmically distributed each scan. between the maximum and minimum noise power, and we We made a second set of simulations to test the differential repeat each level for a fixed number of iterations. This naturally mapmaking case. In this test, we simplified the differential concludes with λ=1 for several iterations to finish the model instrument to have a single temperature-only detector solution. Thus the schedule with 8 levels of cooling, for 5 (with two beams). We offset the A-side and the B-side beams iterations each, we denote as 8×5. At a particular value of λ, from the boresight in opposite directions, each by 9 arcminutes. the messenger maps will reach a plateau in χ2 after a few We scanned over the same model as before, fixing the angle of iterations, as the low noise modes rapidly converge. If there are the boresight so that the separation between the beams is noisy modes left to converge, then they will proceed slower, perpendicular to the scanning direction. As a simplifying and χ2 will fall more and more slowly. When we lower λ, the assumption, we discarded subpixel structure. At the mapmak- χ2 will fall rapidly toward the new equilibrium. Cooling ing resolution we used for the single-beam model, we find that schemes can lead to better overall χ2 values than the conjugate the differential scenario is not well cross-linked for the same gradient solution, in fewer iterations, and lead to better maps, pair of orthogonal scans. Therefore, we rotated the raster particularly on the large scales that converge at high λ. pattern to nine different orientations, shortening the angular For this simulation, we argue that the 16×10 cooling length of each scan so that they still fit within the two degree scheme is well converged, based on comparisons among square of the model. We adjusted the noise level of the several cooling schemes and the fact that the plateaus in χ2 are detectors so that we spend the same total effort for both the very flat, indicating that noisy modes were well converged at differential and single-beam cases, although the single-beam previous λ levels. In the 8×5 cooling scheme, the plateaus are scan pattern distributes it more evenly. not flat, so more convergence could be achieved at each λ level. Even so, the 8×5 cooling scheme after 40 or so iterations achieves a lower final χ2 than the conjugate gradient 3. Results algorithm after 500 iterations. 3.1. Single Beam Although χ2 is a good diagnostic, it does not tell us what parts of the solution are converging. It is instructive to examine where In Figure 1, we compare the results of the messenger the contributions to χ2 are coming from. In Figures 4 and 5,we mapmaking procedure to a conjugate gradient mapmaker on the 2 examine χ per Fourier mode for a single detector during a same time-ordered data. The two algorithms converge to nearly single scan. In the ideal limit, where the scan is completely in the the same end point, but in strikingly different ways. The cross-linked area, and there is so much data that m≈m ,the messenger method starts after one iteration with an unbiased true χ2 per mode for a segment of time-ordered data have a χ2 binned map full of scanning stripes. As the iterations proceed, distribution, rescaled to have unit mean and with two degrees of the maps remain unbiased and become more optimal. The ( stripes fade away, in this case after a few tens of iterations, and freedom for the real and imaginary portions of the Fourier coefficient). We are not in that limit here, and the distribution of the map in the cross-linked region resembles the input map χ2 (Figure 2). This map used a cooling schedule that we discuss values per mode is a complicated interplay of the noise and below. the cross-linking of the scanning pattern. It varies as a function The conjugate gradient maps converge much more slowly on of frequency and has a substantial contribution from sub- large scales. Initialized to zero, these maps start highly filtered, pixel bias. showing only the minima and maxima positions of the final There is also no simple relationship between the frequencies F map. As the iterations proceed, the maps gradually become less in the time-ordered data and the eigenmodes of that converge biased. A pulse of signal propagates from each pixel (most in the map, except for a rough correspondence that high obviously from the bright extrema), filling out the rest of the frequencies in the timestream tend to produce small-scale map. Even at 500 iterations, however, the map has large-scale features in the map, and low frequencies tend to produce large- ( ) distortions compared to the messenger map and the input map. scale features but not exclusively . In Figure 4, for messenger iterations with constant λ=1 (no These tend to emanate from the uncross-linked regions. There 2 are also other narrow glitches at the map edges that are not cooling), we can see that the high frequencies find their final χ present in the messenger map at 40 iterations. values first. Lower and lower frequencies follow, but progress The mapmaking procedure is supposed to minimize χ2(m) slows down as we dig into noisier and noisier modes. These 2 (Equation (3)). In Figure 3, we show that χ2 decreases plots are binned in frequency to average over nearby χ values, 2 monotonically for iterations of both the messenger and giving a rough idea of the mean of the distribution of χ as a conjugate gradient mapmakers. We evaluate χ2 in the Fourier function of the frequency. They are compared to the well- domain as a sum of contributions from every Fourier mode for converged 16×10 cooling scheme. all detector timestreams for all scans. The conjugate gradient In another panel, we show iterations at a different fixed value maps have a χ2 that drops rapidly, but reaches a plateau. for λ. Here, λτ=N(1Hz), and it converges toward the The convergence for the messenger method depends solution for that N′(λ). For that solution, the χ2 per mode strongly on the cooling schedule. Without a cooling schedule, approaches its final value for very low frequencies. At ∼1Hz at a constant λ=1, the χ2 falls quickly, but the noisy modes the χ2 per mode is lower than in the fully converged solution, are slow to converge, and it would take hundreds or thousands while it is higher at high frequencies. Essentially, this solution of iterations to achieve a quality solution. over-fits the ∼1Hz component of the noise and allows the high We implement cooling schedules where λ is set very high for frequency part of the map to be too noisy compared to the one iteration, which starts us with an unbiased binned map. ultimate (λ = 1) solution.

4 The Astrophysical Journal, 852:92 (11pp), 2018 January 10 Huffenberger & Næss

Figure 1. Convergence of messenger and conjugate gradient maps in the single-beam case. The top row shows the messenger mapmaker up to 40 iterations, using a cooling schedule with 8 cooling levels of 5 iterations each (denoted elsewhere 8 × 5). The iteration number is noted on each panel. The color scale has a range of ±1000 μK. The second row shows the residual difference to a well-converged solution to the map (using the 16 × 10 cooling schedule),inﬂated by a factor of 10. The third row show a conjugate gradient mapmaker with a diagonal preconditioner up to 500 iterations. The bottom row shows the conjugate gradient residual differences to the well-converged map (inﬂated by factor 10). The conjugate gradient map retains spurious large-scale features even at 500 iterations. These images show the intensity component, but the polarization components show similar features.

In Figure 5, we compare the final χ2 per mode for the messenger and conjugate gradient algorithms, extending to even lower frequencies. We can see more clearly that the conjugate gradient solution’s average χ2 per mode is above those for the pair of messenger solutions, and that even the relatively quick 8×5 cooling schedule provides a nearly converged map by comparison. We note that the worst frequencies for the conjugate gradient method are near to the telescope scanning frequency (0.46 Hz), and also at the very Figure 2. Left: intensity component of our polarized model sky. Right: lowest frequencies in the data. difference between maps produced by messenger mapmaking with two cooling To check that the messenger algorithm is unbiased in the schemes. Here, we show the map with 8 cooling levels of 5 iterations each absence of subpixel effects, we ran another simulation that minus the more-converged map with 16 cooling levels of 10 iterations each. modifies the pointing to sample pixel centers exactly. The messenger maps here are indeed unbiased, with no apparent residuals from the sky signal. Without subpixel bias, the overall We also compare the χ2 per mode for a cooling schedule and χ2 values are substantially lower. The comparisons to the a conjugate gradient mapmaker. In the 8×5 cooling schedule conjugate gradient maps lead to the same conclusions as solution, the early, higher-λ iterations also have low χ2 at low before: conjugate gradient maps show large-scale features that frequency, but as λ approaches unity, the iterations force the χ2 are not present in the messenger maps with cooling. Like of the low-frequency modes up to their final values. This before, the 16×10 cooling scheme has the lowest χ2. Unlike balances the penalty in χ2 between low and high frequencies. before, the 8×5 cooling scheme has a slightly higher χ2 than Compared to the others, the conjugate gradient solver the conjugate gradient maps (at 500 iterations), but if the converges very quickly at high frequencies, but even after messenger algorithm is allowed to continue at λ=1, then it 500 iterations, it does not converge at frequencies around 1Hz. achieves a lower χ2 after 46 total iterations.

5 The Astrophysical Journal, 852:92 (11pp), 2018 January 10 Huffenberger & Næss map became unbiased on large scales, and erroneously freezes in the beam shadows of the large-scale features. Another possibility we need to consider is missing modes. Depending on the cross-linking, differential experiments may struggle to represent certain modes on the sky. The mean of the map is always unconstrained, and other modes could be unconstrained also, particularly on scales larger than the beam separation. For example, in WMAP data the l=3 BB power spectrum has especially large errors among the low-l polarization modes because it is the least modulated mode in the time- ordered data (Page et al. 2007; Hinshaw et al. 2009). When modes are not representable, it means the matrix (P†N−1P) has less than full rank and so is not invertible. To probe particular modes of the differential solution more carefully, we wanted to examine a brute force solution, but this is only practical at very low resolution. We downsampled the model from the original map resolution of 600×600 Figure 3. With an appropriate cooling schedule, the messenger method can pixels to 50×50 pixels. We then generated time-ordered data achieve lower χ2 than the conjugate gradient algorithm (CG) in many fewer from that map (without subpixel structure) using the same scan iterations. The plot shows the minimized quantity χ2(m)(Equation (3)) for pattern and noise realization. Figure 6 shows the messenger several mapmaking schemes. Lower values mean better maps. The green line solution at low resolution with the same cooling scheme. The shows the result of a messenger mapmaker with no cooling scheme, holding the cooling parameter constant at λ=1. Two cooling schedules for λ achieve messenger solution displays reconstruction errors that are better final outcomes. Here, we show 8 cooling levels of 5 iterations each similar to the high resolution solution. (8 × 5) and 16 levels of 10 iterations each (16 × 10). The red line shows the At this lower resolution, we computed the brute force χ2 † − progress of a conjugate gradient method, and the dotted red line shows the solution by tabulating the 25002 entries of (P N 1P) explicitly. after 500 iterations. We compare all curves to the final χ2 from the 16×10 cooling scheme. From that matrix we computed eigenvectors, eigenvalues, and the Penrose pseudoinverse via singular value decomposition. We find that 1794 eigenmodes have eigenvalues much larger 3.2. Differential Beams than the others. Large eigenvalues indicate that modes are well constrained. Since 1795 pixels have one or more observations, To verify our solution for a differential experiment, we it appears that only one signal mode, the mean, is uncon- implemented Equation (14) using T =T =T/2, and applied A B strained in the map. For the case with the narrower beam it to our differential beam simulation. We solve for the map and separation, at low resolution again only the mean appears to be examine the residuals compared to the input model. (Recall that ) unconstrained, which supports the hypothesis that the bias we we discarded the subpixel structure for this case. For the fi λ solution in Figure 6, we take one step at a very high λ, then saw there was due to insuf cient iterations at a high to debias take l  1 with a 16×10 cooling schedule. the map before cooling. In these examples, the effect of the differential beams is We keep only the singular values of the well-constrained visible as a bias in the first iteration of the map. Bright objects modes in our computation of the pseudoinverse. The brute are surrounded by a ring of shadow images (of opposite sign). force solution in Figure 6 shows smaller reconstruction errors These show the positions of the other beam when one beam has than the messenger solution. In Figure 6, we also show the two (P†N−1P) scanned over the bright object. This example has scans of two eigenmodes of with the largest eigenvalues and the ( beams, so there are 18 shadows from each bright object. All two modes with the smallest eigenvalues of the well- ) structures in the map are surrounded by similar features, which constrained modes . The most constrained modes characterize for extended objects blend together into doughnut-like shapes. the central region with the most cross-linking. The least There are two distinct phases in the iterative solution. First, constrained modes are large-scale gradients. When we compute in the early iterations, the shadow features fade away, and this the solution with the pseudoinverse, we see that the remaining debiasing dominates the reduction of χ2. In this example, the reconstruction error does indeed have a large-scale gradient. shadows have mostly disappeared on small scales after a dozen In the low-resolution case, we also see that the messenger iterations, and the map becomes nearly unbiased. Second, the method with a 16×10 cooling schedule in Figure 6 has not subsequent iterations mostly serve to make the reconstructed achieved reconstruction errors as small as the brute force 2 map more optimal. After 160 iterations, our solution in solution. On the other hand, at 160 iterations it has a χ that is −3 Figure 6 has eliminated much of the noise. The reconstruction only a fraction 10 higher than the brute force solution (which errors that remain on medium to large scales appear to be noise is comparable to the conjugate gradient mapmaker’s quality in and change when varying the noise realization. the single-beam case), so the large-scale feature is not very significant. Letting it continue to iterate at λ=1 until 320 A test that narrows the separation between the differential − beam pair from 18 to 4.8 arcmin provides a solution that is iterations lowers the χ2 steadily to a fraction 2×10 4 above unbiased on small scales, but has a large-scale bias that persists the brute force value. We tested several fixed and adaptive after a 16×10 cooling schedule and does not appear to cooling schedules. The quickest we achieved a χ2 that is a − depend on the particular noise realization. Running for more fraction 10 3 above the brute force value is about 110 iterations at a high λ before starting the cooling schedule helps iterations, although that solution has larger large-scale noise − to reduce that bias, and so we conclude that the bias at least in residuals than what we see in Figure 6.Aχ2 fraction 10 4 part comes from a schedule that cools prematurely, before the above the brute force value we can achieve after 670 iterations

6 The Astrophysical Journal, 852:92 (11pp), 2018 January 10 Huffenberger & Næss

Figure 4. Evolution of χ2 per frequency mode with iterations (considering one detector during one raster scan). The color and line style indicate the iteration count, and the thick black line shows the χ2 values for the well-converged 16×10 cooling schedule for the λ parameter. The black line can be viewed as the target for convergence. A logarithmic binning averages over modes for plotting. The top row displays the convergence of the messenger method with a ﬁxed λ parameter and shows how different frequencies converge in a case without cooling (λ = 1) and a case with a higher λ. The bottom row contrasts a cooled messenger method to a conjugate gradient mapmaker with a diagonal preconditioner. In the messenger method with the 8×5 cooling schedule, the high frequencies converge last as λ drops. The messenger method converges faster and more completely than the conjugate gradient mapmaker, which has not converged at ∼1Hz after 500 iterations. or so. The slowest and most stringent case we tried reached a number of iterations per level. We have shown already that fraction 2×10−5 above the brute force χ2 value after about simple cooling schemes can work well. 2 2700 iterations. For real data sets, we do not want to compute χ (mi) per For many applications, a small number of messenger iteration because it too requires the full data. However, in our iterations may produce a map of sufﬁcient quality. It is also single-beam test case, we have shown that the behavior of χ2 possible that a more clever approach to the cooling schedule per mode for individual detectors can be indicative of the could improve these results. convergence overall. The χ2 per mode for representative detector timestreams thus can be a useful online diagnostic.

4. Discussion of Large Surveys 4.2. Heterogeneous Data We discuss two points that are important in applying this Over the course of a multiseason campaign, differing method to the large detector arrays in current and future CMB observing conditions and loading will alter the minimum white experiments. First, a practical algorithm should avoid storing noise level for detectors. If T=τI, then τ must represent the the time-ordered data in memory, and second, it must treat the global minimum noise for all timestreams going into the map. diversity of detector noise properties inherent in a multiseason This same argument applies to mixing timestreams from ground-based CMB observing campaign. observatories with heterogeneous experimental designs, as in the current planning for the Simons Observatory3 and CMB- S4.4 This single τ value will be too low for most detectors, and 4.1. Projecting the Data to Map-sized Objects cause the convergence for them to go slower than it could have. To minimize its memory footprint, a mapmaking code A better choice for T is piecewise proportional to the identity should ideally avoid storing the full time-ordered data from matrix and uses appropriate τ values for each detector and each iteration to iteration, and the computer should not read the data uncorrelated time-segment of the data: in more than once to avoid costly input/output operations. In Equation (5), we showed that the messenger algorithm T =¼diag(tt11 , , , m (λ) tt22,,,¼ depends on the data only through the map-sized object d , ... which is far smaller than the full data. If we establish a cooling ttnndet´´ tod,,¼ nn det tod)() . 16 schedule with levels {λj} in advance, we can load the time- ( ) ordered data once from disk in chunks , precompute the set of In general, good choices for T can be best understood in {m (λ )} maps, and then flush the data from memory. (In d j terms of a hierarchical forward model for the time-ordered data: parallelization schemes where segments of the time-order data are distributed among computational nodes, each node needs 1. Map m is the sky signal. only the parts of the md(λ) maps that the data touch.) This limitation to fixed λ-levels constrains the on-the-fly adapt- 3 https://simonsobservatory.org/ ability of our cooling schemes, but does not constrain the 4 https://cmb-s4.org, Abazajian et al. (2016).

7 The Astrophysical Journal, 852:92 (11pp), 2018 January 10 Huffenberger & Næss for rapid convergence. We showed in our test cases that simple cooling schedules sufﬁce. Compared to conjugate gradient methods, the extra effort to design an appropriate cooling scheme may be more than offset by the reduction in the number of iterations required, or a hybrid scheme may yield better performance at all scales. Substantial reductions in the computational cost for mapmaking could allow projects to produce more end-to-end Monte Carlo simulations, leading to better understanding of noise covariances in maps, power spectra, cosmological parameters, and other data products. We discussed ways in which the algorithm can scale up to handle real data sets. We have successfully applied the messenger mapmaking method to polarization data from the Atacama Cosmology Telescope’s ACTPol receiver (Niemack et al. 2010), adapting some of the existing mapmaking infrastructure to map one of the sky regions from Næss et al. (2014). We will discuss it more fully in future work, but our early efforts indicate that the map quality can be similar to conjugate gradient descent methods in a fraction of the iterations. We have not worked hard yet to optimize the cooling schedule for the ACTPol data. Like our simulation example here, the preliminary map of the real data appears more fully converged on large scales than its conjugate gradient counterpart. If these preliminary indications bear out, messenger ﬁeld mapmaking may be a promising and powerful new addition to our CMB analysis toolbox.

We thank Thibaut Louis, Arthur Kosowsky, Lyman Page, and Suzanne Staggs for useful comments and encouragement. K.M.H. acknowledges support from NASA grant 2 Figure 5. Contrasting χ per mode for messenger and conjugate gradient NNX17AF87G. The Flatiron Institute is supported by the mapmaking for the timestream of a single detector during one raster scan of our simulation. A logarithmic binning averages together nearby frequencies above Simons Foundation. 0.01 Hz. At <0.01 Hz, we show individual modes. The top plot compares the χ2 per mode for the end point of the messenger mapmaking with two cooling Appendix A schemes and for the conjugate gradient mapmaker after 500 iterations. The Derivation of Messenger Mapmaking middle plot shows the difference to the most-converged 16×10 cooling schedule. The bottom plot shows the ratio to the 16×10 cooling schedule. We The mapmaking equation minimizes divide before averaging into bins. c21()()()()mddm,17=-PN† - dm - P fi P T 2. Messenger eld t is m plus noise drawn from . and maximizes the Gaussian probability Pmd(∣)µ 3. Data d is t plus noise drawn from N¯ , so that the total exp(()-c2 md , 2). We introduce the augmented χ2 with a amount of noise in d is N. messenger field The more noise that we can put into T, the better the method will work (while keeping N¯ =-NTpositive definite). In the c2 ()()()()mtd,, =- tPT m††--11 t - P m + d - t N¯ () d - t limit that all the noise is in T, then t=d and the method 1 ()18 “converges” in one single step. In practice, we cannot do that because we cannot invert (P†T−1P) when T=N for realistic (with N¯ =-NT) that corresponds to the Gaussian prob- cases, as it reverts to the standard mapmaking problem. 2 So our goal for T is that it should contain as much of the ability Pmtd1(∣),µ- exp (c1 ( mtd , , ) 2) and has the property noise detail as possible, while still allowing for simple matrix that † − inversions of N¯ , (NT¯ --11+ ()l ), (l-1NT¯ + ), and (P T 1P). Uniform noise per detector timestream in T fulfills this goal. ò Pmtddt1(∣),19= Pmd (∣) ()

5. Conclusions when t is marginalized. The field t is called the messenger field because it appears both with the map and the data in c2, but the We have presented a mapmaking algorithm based on 1 map and the data never appear together. Thus t communicates messenger fields. The procedure is faster and (by several indications) produces higher-quality maps than preconditioned between the two. The value of m at the maximum is the same in conjugate gradient descent mapmakers, at least in our example. both distributions P and P1. By iteratively maximizing the A straightforward modification to the method allows treatment marginal distributions P1(∣tm, d )and Pmtd1(∣, ), we can work of differential data and deconvolution mapmaking. It does not our way up to the peak of P1(∣)mtd, , and find the m that solves require a preconditioner, but it does require a cooling schedule the mapmaking equation.

8 The Astrophysical Journal, 852:92 (11pp), 2018 January 10 Huffenberger & Næss

Figure 6. Messenger mapmaking algorithm for a differential instrument. The top row shows iterations from a 16×10 cooling schedule. The color scale is ±1000μK. The last column shows the reconstruction error (compared to the input model) inflated by a factor of 10. (The mean is subtracted because it is not constrained in a differential experiment.) The second row zooms in on a bright source that is below and left of the image center. The first messenger iteration shows a ring of shadow beam images around the source, a bias that results from the differential beam and scanning pattern. There are two shadows for each of the nine scanning directions. The shadows fade after a few iterations, and then the map gradually becomes more optimal as the cooling schedule proceeds. The third row shows the same situation at lower resolution. The bottom row shows a brute force solution by singular value decomposition, which the low resolution permits. We show the two most and two least constrained eigenmodes, the solution itself, and the reconstruction error to the input model. The χ2 of the low-resolution messenger solution is a fraction 10−3 higher than the brute force solution, and contains some additional large-scale noise fluctuations.

We can write down the marginal distributions by completing case m in the next iteration is an weighted coaddition of those 2 ﬁ the square in c1 for t and m, respectively, pieces of the messenger eld.

2 † ¯ --11 Appendix B cm1 (∣tm, d )=- ( ttt )(NT +)( t - m ) + ... 2 ††-1 Mapmaking with a Composite Pointing Matrix cmm1 (∣mt, d )=- ( mmm )PT P ( m - ) + ..., ( 20 ) We can treat cases with a nontrivial pointing matrix using a ﬁ and ﬁnding the maximum probability points (the means) of the modi ed formalism where the full pointing matrix is a sum of simpler pointing matrices, each of which includes only a single marginal distributions at pixel entry per time sample. The case with two components has ¯ -----11111¯ the form mt =+(NT)( Ndm + TP) ††--11 - 1 mm = ()PT P PT t.21 () PP=+AB P,22()

Setting t=μt and m=μm leads to the iterative solution. We but we could be generalize this to an arbitrary number of are free to choose T so long as N¯ is positive deﬁnite. Smart components. In the differential case, the A-component could choices make the matrix inverses in the means of the marginal have +1 entries and the B-component could have −1 entries. distributions easy to calculate. The messenger ﬁeld covariance will also be split: T= T=τI T +T † -1 If , then the matrix inverses are easy in the upper A B. Matrices like PATPA A can be made simple to invert. equation and the T cancels from the lower equation. The time-ordered data has the form Alternatively, T can be made diagonal so that pieces of the time-order data each get their own optimized τ values, in which dmmn=++PPAB.23()

9 The Astrophysical Journal, 852:92 (11pp), 2018 January 10 Huffenberger & Næss As before, we write down an augmented χ2, but this time with the right-hand side in terms of the data and map only: fi two messenger elds: †††--11¯ -1 †-1 PTA A ttA += PTB B B PNAA() dtm -+ PTA PA 21- ††-1 -1 † +-+PN¯ ()dt PTB PB m c1 (mttdtmtm,,,AB )()()=- APT A A AA - P BB † ¯ -1 ††--11 † -1 =-++PN()(dt PTA A PA PTB B PB ) m +-()()tmtmBBPTB BB - P = PN† ¯ -1d +--()dt t†N¯ -1()() dt -- t.24 AB AB -+PN† ¯ --11( N¯ T --- 111)( N¯ dm + T - 1 P ) ††--11 Each messenger field has an associated covariance matrix TA, ++()PTA A PA PTB B PB m † ------111111 TB. With few restrictions, we are able to choose these matrices =-PN( ¯¯ N( N¯ + T) N¯ )d to make subsequent operations simpler. Again we have N¯ = ††--11† ¯ --11¯ --- 111 ++()PTA A PA PTB B PB - PN( N + T)) T Pm N -=-TNT - T. AB =+PN†( ¯ T)((-1d + PTP††--11 + PTP ) As before, we complete the square to find the means of the A A A B B B † -1 marginal distributions, -+PN( ¯ T)) Pm. ()31 2 † -1 -1 cm(∣)ttAA...=- ( )(NT¯ +)( tA - m ) + ... 1 tAAA t We used the Woodbury formula in the first term of the last step. 2 † -1 -1 † --11† cm(∣)tt...=- ( )(NT¯ +)( t - m ) + ... Multiplying by ()PT PA + PT PB , we arrive at a 1 BBtBBB B t A A BB 211† ††-- compact expression for updating the map iterations: cm1 (∣)mm... =- (mA )(PTA PA + PTB B PB ) ††--11--11† mdi+1 =+()(PTA PA PTB PB P N¯ + T) ()m -+mm ..., () 25 A B ††--11--11† ¯ +-((IPTPPTPPNTPA A A +B B Bi )( +))m and find =+mmdiF . ()32 m =+NT¯ -1 -1 --11 N¯ dt -+ TP-1 m tA ( A )( ()B A A ) -1 -1 --11 -1 This expresses the mapmaking solution in a way that requires m =+(NT¯ )( N¯ ()dt -+A TPB m ) tB B B only map-sized objects be stored in memory. The expressions m =+()()()PT††--11 P PT P-1 PT ††--11tt + PT .26 mAA A B B B A A A B B B ††--11--11† ¯ mdd =+()(PTA A PA PTB B PB P N + T) The iterative solution follows from t = m , t = m , and ††--11--11† ¯ A tA B tB FI=-()( PTPPTPPNTPA A A +B B B +)()33 m=μ . These equations require inversions only of trivial m recall similar ones from Equations (6) and (7) for the single- matrices and are sufficient to find the mapmaking solution. The † --11† beam case. Including the λ parameters, we have term ()PTA A PA + PTBB PB is a weighted sum of weight maps ( ) T T ††--11--11† ¯ - 1 summed over pointing component when we choose A, B mdd ()ll=+ (PTA A PA PTB B PB ) P ( N + T) that are piecewise proportional to the identity matrix. The FIPTPPTPPNTP()ll=- (††--11 + )--11† ( ¯ +) - 1. generalization to more pointing components is straightforward. A A A B B B ()34 Despite the appearance of tA and tB, we can write the solution fi = + in terms of a single messenger eld t tA tB. Then Another convenient expression of the solution is tB=t−tA and vice versa. In that case, ††--11-1 mmii+1 =+()PTA A PA + PTB B PB ¯ -1 -1 --11¯ ¯ - 1 -1 † --11 tdttmA =+(NTA )( N() -++ NA TPA A )(),27 PNT(l ¯ +-)(dm Pi ),35 ( ) ( ) and similarly for tB. Putting all the tA terms on the left-hand which recalls Equation 8 . -1 -1 Finally, we can check that we recover the simple, single side, and multiplying both sides by TNA( ¯ + TA ), yields pencil beam case by setting PB=0. Then P=PA and -1 N¯¯=+=-NT NT tdtmAA=-+TN¯ () P A,28 () ABA. So we have ††-1 --11 - 1 mdd =+()(PTA P Pl N¯AA T) which is similar for tB. If we add the two messenger fields ††-1 --11¯ - 1 together, then we find that FI=-()( PTPPA l NAA + T)() P 36 and recover one of the forms for the single pencil beam case tdtm=+()TTN--11 -+ ()()() TTN ++ PP,29 AB AB A B from before (Equations (6) and (7)). which can be rearranged to give Appendix C tdm=+(NT¯ -----11111)( N¯ + TP)(),30 Mapmaking with Composite Timestream Noise The timestream noise covariance could be complicated and which is the same as the case with a simple pointing matrix. composed of several pieces, like N = åi Ni, so that by itself, it is Thus we conclude that there is only one real messenger field, difficult to invert in any easily accessible basis. Huffenberger namely t,astA and tB are simply projections of it (using (2017) showed that the messenger field method can be extended Equation (28)). to handle such cases if the subcomponents are invertible in We can plug those projections into the the final term in the separate and convenient bases. That work examined Wiener expression for the map in Equation (26) and work to express filters, but the same idea applies for mapmaking. For the case

10 The Astrophysical Journal, 852:92 (11pp), 2018 January 10 Huffenberger & Næss

N=N0+N1, we need additional messenger fields t0, t1, d0 and References two matrices T0, T1. (More complicated noise will require even more additional fields.) Iterating the following equations will yield Abazajian, K. N., Adshead, P., Ahmed, Z., et al. 2016, arXiv:1610.02743 Alsing, J., Heavens, A., Jaffe, A. H., et al. 2016, MNRAS, 455, 4452 the mapmaking solution: Anderes, E., Wandelt, B. D., & Lavaux, G. 2015, ApJ, 808, 152 Armitage, C., & Wandelt, B. D. 2004, PhRvD, 70, 123007 ¯ --1 1 -1 ¯ --1 1 tdd1 =+(NT1 1 )( N1 + T1 0) Armitage-Caplan, C., & Wandelt, B. D. 2009, ApJS, 181, 533 Cantalupo, C. M., Borrill, J. D., Jaffe, A. H., Kisner, T. S., & Stompor, R. ¯ --1 1 -1 ¯ --1 1 dtt0 =+(NT0 1 )( N0 0 + T1 1) 2010, ApJS, 187, 212 tdm=+NT¯ --1 1 -1 N¯ --1 + TP1 de Gasperis, G., Balbi, A., Cabella, P., Natoli, P., & Vittorio, N. 2005, A&A, 0 ( 0 0 )(0 0 0 ) 436, 1159 ††-1 -1 -1 Doré, O., Teyssier, R., Bouchet, F. R., Vibert, D., & Prunet, S. 2001, A&A, mt= ()PT0 P PT0 0,37 () 374, 358 ¯ ¯ Dupac, X., & Giard, M. 2002, MNRAS, 330, 497 where N00=-NT0 and similarly for N1. Elsner, F., & Wandelt, B. D. 2013, A&A, 549, A111 Gap filling, for example, can be naturally incorporated into Hinshaw, G., Barnes, C., Bennett, C. L., et al. 2003, ApJS, 148, 63 this scheme. We can let N0 be the timestream noise (diagonal in Hinshaw, G., Weiland, J. L., Hill, R. S., et al. 2009, ApJS, 180, 225 Huffenberger, K. M. 2017, MNRAS, submitted (arXiv:1704.00865) the Fourier domain) and let N1 have infinite variance during ( ) Janssen, M. A., & Gulkis, S. 1992, ASI Ser. C., 359, 391 gaps in the timestream diagonal in the time domain , but Jasche, J., & Lavaux, G. 2015, MNRAS, 447, 1204 minimal other noise. This is similar to the treatment of spatial Kodi Ramanah, D., Lavaux, G., & Wandelt, B. D. 2017, MNRAS, 468, 1782 masks in Elsner & Wandelt (2013), Huffenberger (2017), and Lavaux, G., & Jasche, J. 2016, MNRAS, 455, 3169 Næss, S. K., & Louis, T. 2014, JCAP, 8, 045 many other works. Under this assumption, d0 will be an optimized estimate of the gap-filled timestream and there are no Næss, S. 2014, JCAP, 10, 007 fi Natoli, P., de Gasperis, G., Gheller, C., & Vittorio, N. 2001, A&A, 372, 346 ambiguities about how to ll gaps. However, this method may Niemack, M. D., Ade, P. A. R., Aguirre, J., et al. 2010, Proc. SPIE, 7741, not be practical because of the need to carry around additional 77411S messenger fields. Each of these is large for real data sets and is Page, L., Hinshaw, G., Komatsu, E., et al. 2007, ApJS, 170, 335 the same size as the time-ordered data. The additional fields Patanchon, G., Ade, P. A. R., Bock, J. J., et al. 2008, ApJ, 681, 708 Piazzo, L., Calzoletti, L., Faustini, F., et al. 2015, MNRAS, 447, 1471 will also make the system slower to converge. Planck Collaboration XII, Ade, P. A. R., Aghanim, N., et al. 2016, A&A, 594, A12 ORCID iDs Stompor, R., Balbi, A., Borrill, J. D., et al. 2002, PhRvD, 65, 022003 Szydlarski, M., Grigori, L., & Stompor, R. 2014, A&A, 572, A39 Kevin M. Huffenberger https://orcid.org/0000-0001- Tegmark, M. 1997, ApJL, 480, L87 7109-0099 Wright, E. L., Hinshaw, G., & Bennett, C. L. 1996, ApJL, 458, L53