A&A 574, A74 (2015) Astronomy DOI: 10.1051/0004-6361/201323006 & c ESO 2015 Astrophysics

Denoising, deconvolving, and decomposing photon observations?

Derivation of the D3PO

Marco Selig1,2 and Torsten A. Enßlin1,2

1 Max-Planck Institut für Astrophysik, Karl-Schwarzschild-Straße 1, 85748 Garching, Germany e-mail: [email protected] 2 Ludwig-Maximilians-Universität München, Geschwister-Scholl-Platz 1, 80539 München, Germany

Received 7 November 2013 / Accepted 22 October 2014

ABSTRACT

The analysis of astronomical images is a non-trivial task. The D3PO algorithm addresses the inference problem of denoising, decon- volving, and decomposing photon observations. Its primary goal is the simultaneous but individual reconstruction of the diffuse and point-like photon flux given a single photon count image, where the fluxes are superimposed. In order to discriminate between these morphologically different signal components, a probabilistic algorithm is derived in the language of information field theory based on a hierarchical Bayesian parameter model. The signal inference exploits prior information on the spatial correlation structure of the diffuse component and the brightness distribution of the spatially uncorrelated point-like sources. A maximum a posteriori solution and a solution minimizing the of the inference problem using variational Bayesian methods are discussed. Since the derivation of the solution is not dependent on the underlying position space, the implementation of the D3PO algorithm uses the  package to ensure applicability to various spatial grids and at any resolution. The fidelity of the algorithm is validated by the analysis of simulated data, including a realistic high energy photon count image showing a 32 × 32 arcmin2 observation with a spatial resolution of 0.1 arcmin. In all tests the D3PO algorithm successfully denoised, deconvolved, and decomposed the data into a diffuse and a point-like signal estimate for the respective photon flux components. Key words. methods: data analysis – methods: numerical – methods: statistical – techniques: image processing – gamma rays: general – X-rays: general

1. Introduction objects are commonly divided into two morphological classes, diffuse sources and point sources. Diffuse sources span rather An astronomical image might display multiple superimposed smoothly across large fractions of an image, and exhibit appar- features, such as point sources, compact objects, diffuse emis- ent internal correlations. Point sources, on the contrary, are local sion, or background radiation. The raw photon count images de- features that, if observed perfectly, would only appear in one livered by high energy telescopes are far from perfect; they suffer pixel of the image. In this work, we will not distinguish between from shot noise and distortions caused by instrumental effects. diffuse sources and background, both are diffuse contributions. The analysis of such astronomical observations demands elabo- Intermediate cases, which are sometimes classified as extended rate denoising, deconvolution, and decomposition strategies. or compact sources, are also not considered here. The data obtained by the detection of individual photons is The question arises of how to reconstruct the original source subject to Poissonian shot noise which is more severe for low contributions, the individual signals, that caused the observed count rates. This causes difficulty for the discrimination of faint photon data. This task is an ill-posed inverse problem with- sources against noise, and makes their detection exceptionally out a unique solution. There are a number of heuristic and challenging. Furthermore, uneven or incomplete survey cover- probabilistic approaches that address the problem of denoising, age and complex instrumental response functions leave imprints deconvolution, and decomposition in partial or simpler settings. in the photon data. As a result, the data set might exhibit gaps and S (Bertin & Arnouts 1996) is one of the heuris- artificial distortions rendering the clear recognition of different tic tools and the most prominent for identifying sources in as- features a difficult task. Point-like sources are especially afflicted tronomy. Its popularity is mostly based on its speed and easy by the instrument’s point spread function (PSF) that smooths operability. However, S produces a catalog of fitted them out in the observed image, and therefore can cause fainter sources rather than denoised and deconvolved signal estimates. ones to vanish completely in the background noise. C (Högbom 1974) is commonly used in radio astronomy In addition to such noise and convolution effects, it is the su- and attempts a deconvolution assuming there are only contribu- perposition of the different objects that makes their separation tions from point sources. Therefore, diffuse emission is not op- ambiguous, if possible at all. In astrophysics, photon emitting timally reconstructed in the analysis of real observations using ? A copy of the code is available at the CDS via anonymous ftp to C. cdsarc.u-strasbg.fr (130.79.128.5) or via Multiscale extensions of C (Cornwell 2008; Rau & http://cdsarc.u-strasbg.fr/viz-bin/qcat?J/A+A/574/A74 Cornwell 2011) improve on this, but are also not prefect

Article published by EDP Sciences A74, page 1 of 20 A&A 574, A74 (2015)

(Junklewitz et al. 2014). Decomposition techniques for diffuse The fluxes from diffuse and point-like sources contribute backgrounds, based on the analysis of angular power spectra equally to the observed photon counts, but their morphologi- have recently been proposed by Hensley et al.(2013). cal imprints are very different. The proposed algorithm, derived Inference methods, in contrast, investigate the probabilistic in the framework of information field theory (IFT; Enßlin et al. relation between the data and the signals. Here, the signals of 2009; Enßlin 2013, 2014), therefore incorporates prior assump- interest are the different source contributions. Probabilistic ap- tions in form of a hierarchical parameter model. The fundamen- proaches allow a transparent incorporation of model and a pri- tally different morphologies of diffuse and point-like contribu- ori assumptions, but often result in computationally heavier tions reflected in different prior correlations and statistics. The . exploitation of these different prior models is the key to the sig- As an initial attempt, a maximum likelihood analysis was nal decomposition. proposed by Valdes(1982). In later work, maximum entropy The diffuse and point-like signal are treated as two separate (Strong 2003) and minimum χ2 methods (e.g., Bouchet et al. signal fields. A signal field represents an original signal appear- 2013) were applied to the INTEGRAL/SPI data reconstructing ing in nature; e.g., the physical photon flux distribution of one a single signal component, though. On the basis of sparse regu- source component as a function of real space or sky position. In larization a number of techniques exploiting waveforms (based theory, a field has infinitely many degrees of freedom being de- on the work by Haar 1910, 1911) have proven successful in per- fined on a continuous position space. In computational practice, forming denoising and deconvolution tasks in different settings however, a field needs of course to be defined on a finite grid. It (González-Nuevo et al. 2006; Willett & Nowak 2007; Dupé et al. is desirable that the signal field is reconstructed independently 2009, 2011; Figueiredo & Bioucas-Dias 2010). For example, from the grid’s resolution, except for potentially unresolvable Schmitt et al.(2010, 2012) analyzed simulated (single and multi- features1. We note that the point-like signal field hosts one point channel) data from the Fermi γ-ray space telescope focusing on source in every pixel, however, most of them might be invisibly the removal of Poisson noise and deconvolution or background faint. Hence, a complicated determination of the number of point separation. Furthermore, a (generalized) morphological compo- sources, as many algorithms perform, is not required in our case. nent analysis denoised, deconvolved and decomposed simulated The derivation of the algorithm makes use of a wide range radio data assuming Gaussian noise statistics (Bobin et al. 2007; of Bayesian methods that are discussed below in detail with re- Chapman et al. 2013). gard to their implications and applicability. For now, let us con- Still in the regime of Gaussian noise, Giovannelli & Coulais sider an example to emphasize the range and performance of the (2005) derived a deconvolution algorithm for point and ex- D3PO algorithm. Figure1 illustrates a reconstruction scenario in tended sources minimizing regularized least squares. They in- one dimension, where the coordinate could be an angle or po- troduce an efficient convex regularization scheme at the price sition (or time, or energy) in order to represent a 1D sky (or a of a priori unmotivated fine tuning parameters. The fast al- time series, or an energy spectrum). The numerical implementa- gorithm PowellSnakes I/II by Carvalho et al.(2009, 2012) is tion uses the 2 package (Selig et al. 2013).  permits capable of analyzing multi-frequency data sets and detecting an algorithm to be set up abstractly, independent of the finally point-like objects within diffuse emission regions. It relies on chosen topology, dimension, or resolution of the underlying po- matched filters using PSF templates and Bayesian filters exploit- sition space. In this way, a 1D prototype code can be used for ing, among others, priors on source position, size, and number. development, and then just be applied in 2D, 3D, or even on the PowellSnakes II has been successfully applied to the Planck data sphere. (Planck Collaboration VII 2011). The remainder of this paper is structured as follows. The approach closest to ours is the background-source sepa- Section2 discusses the inference on photon observations; i.e., ration technique used to analyze the ROSAT data (Guglielmetti the underlying model and prior assumptions. The D3PO algo- et al. 2009). This Bayesian model is based on a two-component rithm solving this inference problem by denoising, deconvolu- mixture model that reconstructs extended sources and (diffuse) tion, and decomposition is derived in Sect.3. In Sect.4 the al- background concurrently. The latter is, however, described by a gorithm is demonstrated in a numerical application on simulated spline model with a small number of spline sampling points. high energy photon data. We conclude in Sect.5. The strategy presented in this work aims at the simultane- ous reconstruction of two signals, the diffuse and point-like pho- ton flux. Both fluxes contribute equally to the observed photon 2. Inference on photon observations counts, but their morphological imprints are very different. The 2.1. Signal inference proposed algorithm, derived in the framework of information field theory (IFT; Enßlin et al. 2009; Enßlin 2013), therefore in- Here, a signal is defined as an unknown quantity of interest that corporates prior assumptions in form of a hierarchical parameter we want to learn about. The most important information source model. The fundamentally different morphologies of diffuse and on a signal is the data obtained in an observation to measure the point-like contributions reflected in different prior correlations signal. Inferring a signal from an observational data set poses and statistics. The exploitation of these different prior models is crucial to the signal decomposition. In this work, we exclusively 1 If the resolution of the reconstruction were increased gradually, the consider Poissonian noise, in particular, but not exclusively, in diffuse signal field might exhibit more and more small scale features the low count rate regime, where the signal-to-noise ratio be- until the information content of the given data is exhausted. From this comes challengingly low. The D3PO algorithm presented here point on, any further increase in resolution would not change the signal field reconstruction significantly. In a similar manner, the localization targets the simultaneous denoising, deconvolution, and decom- accuracy and number of detections of point sources might increase with position of photon observations into two signals, the diffuse and the resolution until all relevant information of the data was captured. point-like photon flux. This task includes the reconstruction of All higher resolution grids can then be regarded as acceptable represen- the harmonic power spectrum of the diffuse component from tations of the continuous position space. the data themselves. Moreover, the proposed algorithm provides 2  homepage http://www.mpa-garching.mpg.de/ift/ a posteriori uncertainty information on both inferred signals. nifty/

A74, page 2 of 20 M. Selig and T. A. Enßlin: D3PO – denoising, deconvolving, and decomposing photon observations (a) 210 superimposed signal components

28 signal response (superimposed and convolved)

26

24

22 integrated flux [counts]

20

coordinate [arbitrary] (b) 210 data points (zeros excluded)

28 signal response (noiseless data)

26

24

22 photon data [counts]

20

coordinate [arbitrary] (c) 210 reconstructed point-like signal

28 reconstructed diffuse signal 2 uncertainty interval (original) signal response 26

24

22 integrated flux [counts]

20

coordinate [arbitrary] (d) 210 reproduced signal response

28 2 shot noise interval data points (zeros excluded)

26

24

22 photon data [counts]

20

coordinate [arbitrary]

Fig. 1. Illustration of a 1D reconstruction scenario with 1024 pixels. Panel a) shows the superimposed diffuse and point-like signal components (green solid line) and its observational response (gray contour). Panel b) shows again the signal response representing noiseless data (gray contour) and the generated Poissonian data (red markers). Panel c) shows the reconstruction of the point-like signal component (blue solid line), the diffuse one (orange solid line), its 2σ reconstruction uncertainty interval (orange dashed line), and again the original signal response (gray contour). The point-like signal comprises 1024 point-sources of which only five are not too faint. Panel d) shows the reproduced signal response representing noiseless data (black solid line), its 2σ shot noise interval (black dashed line), and again the data (gray markers).

A74, page 3 of 20 A&A 574, A74 (2015) a fundamental problem because of the presence of noise in the flux, ρ = ρ(x), is defined for each position x on the observational data and the ambiguity that several possible signals could have space Ω. In astrophysics, this space Ω is typically the S2 sphere produced the same data, even in the case of negligible noise. representing an all-sky view, or a region within R2 representing For example, given some image data like photon counts, we an approximately plane patch of the sky. The flux ρ might ex- want to infer the underlying photon flux distribution. This phys- press different morphological features, which can be classified ical flux is a continuous scalar field that varies with respect to into a diffuse and point-like component. The exact definitions time, energy, and observational position. The measured photon of the diffuse and point-like flux should be specified a priori, count data, however, is restricted by its spatial and energy bin- without knowledge of the data, and are addressed in Sects. 2.3.1 ning, as well as its limitations in energy range and observation and 2.3.3, respectively. At this point it shall suffices to say that time. Basically, all data sets are finite for practical reasons, and the diffuse flux varies smoothly on large spatial scales, while the therefore cannot capture all of the infinitely many degrees of flux originating from point sources is fairly local. These two flux freedom of the underlying continuous signal field. components are superimposed, There is no exact solution to such signal inference problems, since there might be (infinitely) many signal field configurations s u ρ = ρ + ρ − = ρ0 e + e , (4) that could lead to the same data. This is why a probabilistic data diffuse point like analysis, which does not pretend to calculate the correct field configuration but provides expectation values and uncertainties where we introduced the dimensionless diffuse and point-like signal fields, s and u, and the constant ρ which absorbs the of the signal field, is appropriate for signal inference. 0 physical dimensions of the photon flux; i.e., events per area per Given a data set d, the a posteriori probability distribution energy and time interval. The exponential function in Eq. (4) is P(s|d) judges how likely a potential signal s is. This posterior is applied componentwise. In this way, we naturally account for given by Bayes’ theorem, the strict positivity of the photon flux at the price of a non-linear P(d|s) P(s) change of variables, from the flux to its natural logarithm. P(s|d) = , (1) P(d) A measurement apparatus observing the photon flux ρ is ex- pected to detect a certain number of photons λ. This process can as a combination of the likelihood P(d|s), the signal prior P(s), be modeled by a linear response operator R as follows, and the evidence P(d), which serves as a normalization. The 0 likelihood characterizes how likely it is to measure data set d s u from a given signal field s. It covers all processes that are rele- λ = R0ρ = R e + e , (5) vant for the measurement of d. The prior describes the knowl- edge about s without considering the data, and should, in gen- where R = R0ρ0. This reads for pixel i, eral, be less restrictive than the likelihood. Z IFT is a Bayesian framework for the inference of signal fields  s(x) u(x) exploiting mathematical methods for theoretical physics. A sig- λi = dx Ri(x) e + e . (6) nal field, s = s(x), is a function of a continuous position x in Ω some position space Ω. In order to avoid a dependence of the The response operator R comprises all aspects of the measure- reconstruction on the partition of Ω, the according calculus re- 0 ment process; i.e., all instrument response functions. This in- garding fields is geared to preserve the continuum limit (Enßlin cludes the survey coverage, which describes the instrument’s 2013, 2014; Selig et al. 2013). In general, we are interested in the overall exposure to the observational area, and the instrument’s a posteriori mean estimate m of the signal field given the data, PSF, which describes how a point source is imaged by the and its (uncertainty) covariance D, defined as instrument. Z The superposition of different components and the transi- m = hsi = Ds s P(s|d), (2) (s|d) tion from continuous coordinates to some discrete pixelization, D E cf. Eq. (6), cause a severe loss of information about the original D = (m − s)(m − s)† , (3) (s|d) signal fields. In addition to that, measurement noise distorts the signal’s imprint in the data. The individual photon counts per where † denotes adjunction and h · i(s|d) the expectation value 3 pixel can be assumed to follow a Poisson distribution P each. with respect to the distribution P(s|d) . Therefore, the likelihood of the data d given an expected number In the following, the posterior of the physical photon flux of events λ is modeled as a product of statistically independent distribution of two morphologically different source components Poisson processes, given a data set of photon counts is build up piece by piece ac- cording to Eq. (1). Y Y 1 di −λi P(d|λ) = P(di, λi) = λi e . (7) di! 2.2. Poissonian likelihood i i √ The images provided by astronomical high energy telescopes The Poisson distribution has a signal-to-noise ratio of λ which typically consist of integer photon counts that are binned spa- scales with the expected number of photon counts. Therefore, tially into pixels. Let di be the number of detected photons, also Poissonian shot noise is most severe in regions with low photon called events, in pixel i, where i ∈ {1,..., Npix} ⊂ N. fluxes. This makes the detection of faint sources in high energy The kind of signal field we would like to infer from astronomy a particularly challenging task, as X- and γ-ray pho- such data is the causative photon flux distribution. The photon tons are sparse. R 3 This expectation value is computed by a path integral, Ds, over The likelihood of photon count data given a two component the complete phase space of the signal field s; i.e., all possible field photon flux is hence described by the Eqs. (5) and (7). Rewriting configurations. this likelihood P(d|s, u) in form of its negative logarithm yields

A74, page 4 of 20 M. Selig and T. A. Enßlin: D3PO – denoising, deconvolving, and decomposing photon observations

4 the information Hamiltonian H(d|s, u) , where τk are spectral parameters and Sk are projections onto a set of disjoint harmonic subspaces of Ω. These subspaces are H(d|s, u) = − log P(d|s, u) (8) commonly denoted as spectral bands or harmonic modes. The set = H + 1†λ − d† log (λ) (9) of spectral parameters, τ = {τk}k, is then the logarithmic power 0 spectrum of the diffuse signal field s with respect to the chosen † s u † s u = H0 + 1 R e + e − d log R e + e , (10) harmonic basis denoted by k. However, the diffuse signal covariance is in general unknown where the ground state energy H0 comprises all terms constant a priori. This requires the introduction of another prior for the co- in s and u, and 1 is a constant data vector being 1 everywhere. variance, or for the set of parameters τ describing it adequately. This approach of on prior parameters creates a hier- 2.3. Prior assumptions archical parameter model. The diffuse and point-like signal fields, s and u, contribute equally to the likelihood defined by Eq. (10), and thus leaving 2.3.2. Unknown power spectrum it completely degenerate. On the mere basis of the likelihood, The lack of knowledge of the power spectrum, requires its recon- the full data set could be explained by the diffuse signal alone, struction from the same data the signal is inferred from (Wandelt or only by point-sources, or any other conceivable combination. et al. 2004; Jasche et al. 2010; Enßlin & Frommert 2011; Jasche In order to downweight intuitively implausible solutions, we in- & Wandelt 2013). Therefore, two a priori constraints for the troduce priors. The priors discussed in the following address the spectral parameters τ, which describe the logarithmic power morphology of the different photon flux contributions, and de- spectrum, are incorporated in the model. fine diffuse and point-like in the first place. These priors aid the The power spectrum is unknown and might span over sev- reconstruction by providing some remedy for the degeneracy eral orders of magnitude. This implies a logarithmically uniform of the likelihood. The likelihood describes noise and convolu- prior for each element of the power spectrum, and a uniform tion properties, and the prior describe the individual morpholog- prior for each spectral parameter τk, respectively. We initially ical properties. Therefore, the denoising and deconvolution of assume independent inverse-Gamma distributions I for the in- the data towards the total photon flux ρ is primarily likelihood dividual elements, (s) driven, but for a decomposition of the total photon flux into ρ Y (u) τ τk and ρ , the signal priors are imperative. P(e |α, q) = I(e , αk, qk) (13) k

αk−1 2.3.1. Diffuse component Y q −τk = k e−(αkτk+qke ), (14) (s) s Γ(αk − 1) The diffuse photon flux, ρ = ρ0e , is strictly positive and might k vary in intensity over several orders of magnitude. Its morphol- and hence ogy shows cloudy patches with smooth fluctuations across spa- Y deτk tial scales; i.e., one expects similar values of the diffuse flux in τk Pun(τ|α, q) = I(e , αk, qk) (15) neighboring locations. In other words, the diffuse component ex- dτ k k hibits spatial correlations. A log-normal model for ρ(s) satisfies  † † −τ those requirements according to the maximum entropy principle ∝ exp − (α − 1) τ − q e , (16) (Oppermann et al. 2013; Kinney 2014). If the diffuse photon flux { } { } follows a multivariate log-, the diffuse signal where α = αk k and q = qk k are the shape and scale pa- field s obeys a multivariate Gaussian distribution G, rameters, and Γ denotes the Gamma function. In the limit of αk → 1 and qk → 0 ∀k, the inverse-Gamma distributions be- ! come asymptotically flat on a logarithmic scale, and thus Pun 1 1 † −1 6 P(s|S) = G(s, S) = √ exp − s S s , (11) constant . Small non-zero scale parameters, 0 < qk, provide det[2πS] 2 lower limits for the power spectrum that, in practice, lead to D E more stable inference algorithms. with a given covariance S = ss† . This covariance describes (s|S) So far, the variability of the individual elements of the power the strength of the spatial correlations, and thus the smoothness spectrum is accounted for, but the question about their corre- of the fluctuations. lations has not been addressed. Empirically, power spectra of A convenient parametrization of the covariance S can be a diffuse signal field do not exhibit wild fluctuation or change found, if the signal field s is a priori not known to distinguish drastically over neighboring modes. They rather show some sort any position or orientation axis; i.e., its correlations only depend of spectral smoothness. Moreover, for diffuse signal fields that on relative distances. This is equivalent to assume s to be statis- were shaped by local and causal processes, we might expect a tically homogeneous and isotropic. Under this assumption, S is finite correlation support in position space. This translates into a diagonal in the harmonic basis5 of the position space Ω such that smooth power spectrum. In order to incorporate spectral smooth- X ness, we employ a prior introduced by Enßlin & Frommert τk S = e Sk, (12) (2011); Oppermann et al.(2013). This prior is based on the sec- k ond logarithmic derivative of the spectral parameters τ, and fa- vors power spectra that obey a power law. It reads 4 Throughout this work we define H( · ) = − log P( · ), and absorb con- ! 1 stant terms into a normalization constant H0 in favor of clarity. P (τ|σ) ∝ exp − τ†Tτ , (17) 5 The basis in which the Laplace operator is diagonal is denoted har- sm 2 monic basis. If Ω is a n-dimensional Euclidean space Rn or Torus T n, 2 6 the harmonic basis is the Fourier basis; if Ω is the S sphere, the har- If P(τk = log z) = const., then a substitution yields P(z) = monic basis is the spherical harmonics basis. P(log z) |d(log z)/dz| ∝ z−1 ∼ I(z, α → 1, q → 0).

A74, page 5 of 20 A&A 574, A74 (2015) with For the construction of a prior, that the photon flux is a strictly positive quantity also needs to be considered. Thus, a !2 Z 1 ∂2τ simple exponential prior, τ†Tτ = d(log k) k , (18) 2 2   σk ∂(log k) (u) (u) P(ρx ) ∝ exp −ρx /ρ0 , (21) where σ = {σk}k are Gaussian standard deviations specifying has been suggested (e.g., Guglielmetti et al. 2009). It has the the tolerance against deviation from a power-law behavior of the advantage of being (easily) analytically treatable, but its physi- power spectrum. A choice of σk = 1 ∀k would typically allow cal implications are questionable. This distribution strongly sup- for a change in the power law’s slope of 1 per e-fold in k. In the presses high photon fluxes in favor of lower ones. The maximum limit of σk → ∞ ∀k, no smoothness is enforced upon the power entropy prior, which is also often applied, is even worse because spectrum. it corresponds to a brightness distribution8, The resulting prior for the spectral parameters is given by the  (u)  product of the priors discussed above, (u)  (u)  −ρx /ρ0 P(ρx ) ∝ ρx /ρ0 . (22)

P(τ|α, q, σ) = Pun(τ|α, q) Psm(τ|σ). (19) The following (rather crude) consideration might motivate a more astrophysical prior. Say the universe hosts a homogeneous The parameters α, q and σ are considered to be given as part of distribution of point sources. The number of point sources would the hierarchical Bayesian model, and provide a flexible handle therefore scale with the observable volume; i.e., with distance to model our knowledge on the scaling and smoothness of the cubed. Their apparent brightness, which is reduced because of power spectrum. the spreading of the light rays; i.e., a proportionality to the dis- tance squared. Consequently, a power-law behavior between the number of point sources and their brightness with a slope β = 3 2.3.3. Point-like component 2 is to be expected (Fomalont 1968; Malyshev & Hogg 2011). (u) u The point-like photon flux, ρ = ρ0e , is supposed to originate However, such a plain power law diverges at 0, and is not nec- from very distant astrophysical sources. These sources appear essarily normalizable. Furthermore, Galactic and extragalactic morphologically point-like to an observer because their actual sources cannot be found in arbitrary distances owing to the finite extent is negligible given the extreme distances. This renders size of the Galaxy and the cosmic (past) light cone. Imposing an point sources to be spatially local phenomena. The photon flux exponential cut-off above 0 onto the power law yields an inverse- contributions of neighboring point sources can (to zeroth or- Gamma distribution, which has been shown to be an appropri- der approximation) be assumed to be statistically independent ate prior for point-like photon fluxes (Guglielmetti et al. 2009; of each other. Even if the two sources are very close on the ob- Carvalho et al. 2009, 2012). servational plane, their physical distance might be huge. Even The prior for the point-like signal field is therefore derived in practice, the spatial cross-correlation of point sources is neg- from a product of independent inverse-Gamma distributions9, ligible. Therefore, statistically independent priors for the pho- (u) Y (u) ton flux contribution of each point-source are assumed in the P(ρ |β, η) = I(ρx , βx, ρ0ηx) (23) following. x βx−1   Because of the spatial locality of a point source, the corre- Y (ρ0ηx)  −βx  ρ0ηx  = ρ(u) exp −  , (24) sponding photon flux signal is supposed to be confined to a sin- x  (u)  Γ(βx − 1) ρ gle spot, too. If the point-like signal field, defined over a con- x x tinuous position space Ω, is discretized properly7, this spot is yielding sufficiently identified by an image pixel in the reconstruction. A Y dρ eux discretization, ρ(x ∈ Ω) → (ρx)x, is an inevitable step since the ux 0 P(u|β, η) = I(ρ0e , βk, ρ0ηk) (25) u algorithm is to be implemented in a computer environment any- x d x way. Nevertheless, we have to ensure that the a priori assump-  † † −u tions do not depend on the chosen discretization but satisfy the ∝ exp − (β − 1) u − η e , (26) continuous limit. where β = {β } and η = {η } are the shape and scale parame- Therefore, the prior for the point-like signal component fac- x x x x ters. The latter is responsible for the cut-off of vanishing fluxes, torizes spatially, and should be chosen adequately small in analogy to the spectral q (u) Y (u) scale parameters . The determination of the shape parameters is P(ρ ) = P(ρx ), (20) more difficile. The geometrical argument above suggests a uni- x 3 versal shape parameter, βx = 2 ∀x. A second argument for this value results from demanding a priori independence of the dis- but the functional form of the priors are yet to be determined. cretization. If we choose a coarser resolution that would add up This model allows the point-like signal field to host one point the flux from two point sources at merged pixels, then our prior source in every pixel. Most of these point sources are expected to 3 should still be applicable. The universal value of 2 indeed fulfills be invisibly faint contributing negligibly to the total photon flux. this requirement as shown in AppendixA. There it is also shown However, the point sources which are just identifiable from the that η has to be chosen resolution dependent, though. data are pinpointed in the reconstruction. In this approach, there is no necessity for a complicated determination of the number 8 The so-called maximum entropy regularization P (u) (u) and position of sources. x(ρx /ρ0) log(ρx /ρ0) of the log-likelihood can be regarded as log-prior, cf. Eqs. (20) and (22). 7 The numerical discretization of information fields is described in 9 A possible extension of this prior model that includes spatial corre- great detail in Selig et al.(2013). lations would be an inverse-Wishart distribution for diag[ρ(u)].

A74, page 6 of 20 M. Selig and T. A. Enßlin: D3PO – denoising, deconvolving, and decomposing photon observations

α, q σ β, η are possible, but hardly feasible because of the huge parame- ter phase space. Nevertheless, similar problems have been ad- dressed by elaborate sampling techniques (Wandelt et al. 2004; τ Jasche et al. 2010; Jasche & Kitaura 2010; Jasche & Wandelt 2013). s u Here, two approximative algorithms with lower computa- tional costs are derived. The first one uses the maximum a pos- ρ teriori (MAP) approximation, the second one minimizes the Gibbs free energy of an approximate posterior ansatz in the spirit of variational Bayesian methods. The fidelity and accuracy of λ these two algorithms are compared in a numerical application in Sect.4. d 3.1. Posterior maximum Fig. 2. Graphical model of the model parameters α, q, σ, β, and η, the logarithmic spectral parameters τ, the diffuse signal field s, the point- The posterior maximum and mean coincide, if the posterior dis- like signal field u, the total photon flux ρ, the expected number of pho- tribution is symmetric and single peaked. In practice, this often tons λ, and the observed photon count data d. holds – at least in good approximation –, so that the maximum a posteriori approach can provide suitable estimators. This can either be achieved using a δ-distribution at the posterior’s mode, 2.4. Parameter model Z MAP-δ Figure2 gives an overview of the parameter hierarchy of the sug- hsi(s|d) ≈ Ds s δ(s − smode), (28) gested Bayesian model. The data d is given, and the diffuse sig- nal field s and the point-like signal field u shall be reconstructed or using a Gaussian approximation around this point, from that data. The logarithmic power spectrum τ is a set of nui- Z sance parameters that also need to be reconstructed from the data MAP-G h i ≈ D G − in order to accurately model the diffuse flux contributions. The s (s|d) s s (s smode, Dmode). (29) model parameters form the top layer of this hierarchy and are given to the reconstruction algorithm. This set of model parame- Both approximations require us to find the mode, which is done ters can be boiled down to five scalars, namely α, q, σ, β, and η, by extremizing the posterior. if one defines α = α1, etc. The incorporation of the scalars in Instead of the complex posterior distribution, it is convenient the inference is possible in theory, but this would increase the to consider the information Hamiltonian, defined by its negative computational complexity dramatically. logarithm, We discussed reasonable values for these scalars to be cho- sen a priori. If additional information sources, such as theoretical H(s, τ, u|d) = − log P(s, τ, u|d) (30) power spectra or object catalogs, are available the model param- † s u † s u = H0 + 1 R e + e − d log R e + e eters can be adjusted accordingly. In Sect.4, di fferent parameter 1 1 choices for the analysis of simulated data are investigated. + log (det [S]) + s†S−1 s (31) 2 2 1 + (α − 1)†τ + q†e−τ + τ†Tτ 3. Denoising, deconvolution, and decomposition 2 † † −u The likelihood model, describing the measurement process, and + (β − 1) u + η e , the prior assumptions for the signal fields and the power spec- where all terms constant in s, τ, and u have been absorbed into trum of the diffuse component yield a well-defined inference a ground state energy H , cf. Eqs. (7), (11), (19), and (26), problem. The corresponding posterior is given by 0 respectively. P(d|s, u) P(s|τ) P(τ|α, q, σ) P(u|β, η) The MAP solution, which maximizes the posterior, mini- P(s, τ, u|d) = , (27) mizes the Hamiltonian. This minimum can thus be found by tak- P(d) ing the first (functional) derivatives of the Hamiltonian with re- which is a complex form of Bayes’ theorem (1). spect to s, τ, and u and equating them with zero. Unfortunately, Ideally, we would now calculate the a posteriori expectation this yields a set of implicit, self-consistent equations rather than values and uncertainties according to Eqs. (2) and (3) for the an explicit solution. However, these equations can be solved by diffuse and point-like signal fields, s and u, as well as for the an iterative minimization of the Hamiltonian using a steepest de- logarithmic spectral parameters τ. However, an analytical eval- scent method for example, see Sect. 3.4 for details. uation of these expectation values is not possible because of the In order to better understand the structure of the MAP solu- (s) ? (u) complexity of the posterior. tion, we consider the minimum (s, τ, u) = (m , τ , m ). The The posterior is non-linear in the signal fields and, except resulting filter formulas for the diffuse and point-like signal field for artificially constructed data, non-convex. It, however, is more read flexible and therefore allows for a more comprehensive descrip- ∂H † (s) −1 tion of the parameters to be inferred (Kirkpatrick et al. 1983; = 0 = (1 − d/l) R ∗ em + S? m(s), (32) ∂s min Geman & Geman 1984). ∂H † (u) − (u) Numerical approaches involving Markov chain Monte Carlo = 0 = (1 − d/l) R ∗ em + β − 1 − η ∗ e m , (33) methods (Metropolis & Ulam 1949; Metropolis et al. 1953) ∂u min

A74, page 7 of 20 A&A 574, A74 (2015) with approximation for Qs that accounts for correlation between s and u would be sufficient. Hence, our previous approximation  m(s) m(u)  l = R e + e , (34) is extended by setting ? X τ? S k S = e k. (35) Qs(s, u|µ, d) = G(ϕ, D), (39) k with Here, ∗ and / denote componentwise multiplication and divi- ! ! sion, respectively. The first term in Eqs. (32) and (33), which s − m(s) D(s) D(su) ϕ = , D = † . (40) comes from the likelihood, vanishes in case l = d. We note that u − m(u) D(su) D(u) l = λ|min describes the most likely number of photon counts, not the expected number of photon counts λ = hdi(d|s,u), cf. Eqs. (5) This Gaussian approximation is also a convenient choice in and (7). Disregarding the regularization by the priors, the solu- terms of computational complexity because of its simple ana- tion would overfit; i.e., noise features are partly assigned to the lytic structure. signal fields in order to achieve an unnecessarily close agreement The goodness of the approximation P ≈ Q can be quantified with the data. However, the a priori regularization suppresses by an information theoretical measure, see Appendix C.1. The this tendency to some extend. Gibbs free energy of the inference problem, The second derivative of the Hamiltonian describes the cur- | − − | vature around the minimum, and therefore approximates the (in- G = H(s, τ, u d) Q log Q(s, τ, u d) Q, (41) verse) uncertainty covariance, which is equivalent to the Kullback-Leibler divergence 2 2 D (Q, P), is chosen as such a measure (Enßlin & Weig 2010). ∂ H − ∂ H − KL ≈ (s) 1 ≈ (u) 1 In favor of comprehensibility, we suppose the solution for † D , † D . (36) ∂s∂s min ∂u∂u min the logarithmic power spectrum τ? is known for the moment. (s) (u) The Gibbs free energy is then calculated by plugging in the The closed form of D and D is given explicitly in 10 AppendixB. Hamiltonian, and evaluating the expectation values , The filter formula for the power spectrum, which is derived 1 from a first derivative of the Hamiltonian with respect to τ, yields G = G0 + H(s, u|d) − log (det [D]) (42) Qs 2  ∞  1  h (s) (s)† −1i  X ν  q + tr m m S † †  (−1) ν  τ? 2 k k = G1 + 1 l − d log (l) − (λ/l − 1) e = , (37)  ν Qs  γ + Tτ? ν=2 1 (s)† ?−1 (s) 1 h (s) ?−1i 1  h −1i + m S m + tr D S (43) where γ = (α − 1) + tr SkSk . This formula is in 2 k 2 2 1 accordance with the results by Enßlin & Frommert(2011); † (u) † −m(u)+ Dˆ (u) Oppermann et al.(2013). It has been shown by the former au- + (β − 1) m + η e 2 thors that such a filter exhibits a perception threshold; i.e., on 1 − log (det [D]) , scales where the signal-response-to-noise ratio drops below a 2 certain bound the reconstructed signal power becomes vanish- ingly low. This threshold can be cured by a better capture of the with a posteriori uncertainty structure. λ = R es + eu , (44)  1 1  m(s) Dˆ (s) m(u) Dˆ (u) l = hλi = R e + 2 + e + 2 , (45) 3.2. Posterior approximation Qs ? X τ? In order to overcome the analytical infeasibility as well as the S = e k Sk, and (46) perception threshold, we seek an approximation to the true pos- k terior. Instead of approximating the expectation values of the Dˆ = diag [D] . (47) posterior, approximate posteriors are investigated in this section. In case the approximation is good, the expectation values of the Here, G0 and G1 carry all terms independent of s and u. In com- approximate posterior should then be close to the real ones. parison to the Hamiltonian given in Eq. (31), there are a number The posterior given by Eq. (27) is inaccessible because of the of correction terms that now also consider the uncertainty covari- entanglement of the diffuse signal field s, its logarithmic power ances of the signal estimates properly. For example, the expec- spectrum τ, and the point-like signal field u. The involvement tation values of the photon fluxes differ comparing l in Eqs. (34) of τ can been simplified by a mean field approximation, and (45) where it now describes the expectation value of λ over the approximate posterior. In case l = λ the explicit sum in P(s, τ, u|d) ≈ Q = Q (s, u|µ, d) Q (τ|µ, d), (38) D E s τ Eq. (43) vanishes. Since this sum includes powers of λν>2 Qs where µ denotes an abstract mean field mediating some informa- † tion between the signal field tuple (s, u) and τ that are separated 10 The second likelihood term in Eq. (43), d log (λ), is thereby ex- by the product ansatz in Eq. (38). This mean field is fully de- panded according to termined by the problem, as it represents effective (rather than ∞ * !ν+ X (−1)ν x additional) degrees of freedom. It is only needed implicitly for log(x) = log hxi − − 1 ν hxi the derivation, an explicit formula can be found in Appendix C.3, ν=2 D E though. ≈ log hxi + O x2 , Since the a posteriori mean estimates for the signal fields and their uncertainty covariances are of primary interest, a Gaussian under the assumption x ≈ hxi.

A74, page 8 of 20 M. Selig and T. A. Enßlin: D3PO – denoising, deconvolving, and decomposing photon observations its evaluation would require all entries of D to be known explic- We note that this is a minimal Gibbs free energy solution h i itly. In order to keep the algorithm computationally feasible, this that maximizes Qτ. A proper calculation of τ Qτ might include sum shall hereafter be neglected. This is equivalent to truncating further correction terms, but their derivation is not possible in the corresponding expansion at second order; i.e., ν = 2. It can closed form. Moreover, the above used diffuse signal covariance D E be shown that, in consequence of this approximation, the cross- S?−1 should be replaced by S−1 adding further correction correlation D(su) equals zero, and D becomes block diagonal. Qτ terms to the filter formulas. Without these second order terms, the Gibbs free energy In order to keep the computational complexity on a feasi- reads ble level, all these higher order corrections are not considered G = G + 1† l − d† log (l) here. The detailed characterization of their implications and im- 1 plementation difficulties is left for future investigation. 1 † 1 h i + m(s) S?−1 m(s) + tr D(s)S?−1 (48) 2 2 1 3.3. Physical flux solution † † −m(u) Dˆ (u) + (β − 1) m(u) + η e + 2 To perform calculations on the logarithmic fluxes is convenient 1  h i 1  h i − log det D(s) − log det D(u) . for numerical reasons, but it is the physical fluxes that are ac- 2 2 tually of interest to us. Given the chosen approximation, we can compute the posterior expectation values of the diffuse and Minimizing the Gibbs free energy with respect to m(s), m(u), D(s), point-like photon flux, ρ(s) and ρ(u), straight forwardly, and D(u) would optimize the fitness of the posterior approxima- tion P ≈ Q. Filter formulas for the Gibbs solution can be derived D ( · )E MAP-δ D ( · )E m( · ) ρ ≈ ρ = ρ e mode , (54) by taking the derivative of G with respect to the approximate P δ 0 · 1 · mean estimates, MAP-G D ( · )E m( ) + Dˆ ( ) ≈ ρ = ρ e mode 2 mode , (55) G 0 ∂G (s) 1 ˆ (s) † m + D ?−1 (s) · 1 · = 0 = (1 − d/l) R ∗ e 2 + S m , (49) Gibbs D · E m( ) Dˆ ( ) (s) ≈ ρ( ) = ρ e mean+ 2 mean , (56) ∂m Q 0 1 ∂G † m(u) Dˆ (u) = 0 = (1 − d/l) R ∗ e + 2 (50) ∂m(u) in accordance with Eqs. (28), (29), or (38), respectively. Those 1 solutions differ from each other in terms of the involvement of −m(u) Dˆ (u) + β − 1 − η ∗ e + 2 . the posterior’s mode or mean, and in terms of the inclusion of the uncertainty information, see subscripts. This filter formulas again account for the uncertainty of the mean In general, the mode approximation holds for symmetric, estimates in comparison to the MAP filter formulas in Eqs. (32) single peaked distributions, but can perform poorly in other cases and (33). The uncertainty covariances can be constructed by ei- (e.g., Enßlin & Frommert 2011). The exact form of the posterior ther taking the second derivatives, considered here is highly complex because of the many degrees

2 2 of freedom. In a dimensionally reduced frame, however, the pos- ∂ G −1 ∂ G −1 11 ≈ D(s) , ≈ D(u) , (51) terior appears single peaked and exhibits a negative skewness . ∂m(s)∂m(s)† ∂m(u)∂m(u)† Although this is not necessarily generalizable, it suggest a su- periority of the posterior mean compared to the MAP because or setting the first derivatives of G with respect to the uncertainty of the asymmetry of the distribution. Nevertheless, the MAP ap- covariances equal to zero matrices, proach is computationally cheaper compared to the Gibbs ap- ∂G ∂G proach that requires permanent knowledge of the uncertainty , . covariance. (s) = 0 (u) = 0 (52) ∂Dxy ∂Dxy The uncertainty of the reconstructed photon flux can be ap- proximated as for an ordinary log-normal distribution, The closed form of D(s) and D(u) is given explicitly in     ( · )2 D ( · )E2 MAP D ( · )E2 Dˆ ( · ) AppendixB. − ≈ mode − ? ? ρ ρ ρ e 1 , (57) So far, the logarithmic power spectrum τ , and with it S , P P G   have been supposed to be known. The mean field approxi- Gibbs D · E2 ˆ ( · ) ≈ ρ( ) eDmean − 1 , (58) mation in Eq. (38) does not specify the approximate poste- Q rior Qτ(τ|µ, d), but it can be retrieved by variational Bayesian methods (Jordan et al. 1999; Wingate & Weber 2013), accord- where the square root of the latter term would describe the rela- ing to the procedure detailed in Appendix C.2. The subsequent tive uncertainty. Appendix C.3 discusses the derivation of an solution for τ by ex- tremizing Qτ. This result, which was also derived in Oppermann 3.4. Imaging algorithm et al.(2013), applies to the inference problem discussed here, yielding The problem of denoising, deconvolving, and decomposing pho- ton observations is a non-trivial task. Therefore, this section dis- 3 1  h (s) (s)† (s) −1i cusses the implementation of the D PO algorithm given the two ? q + 2 tr m m + D Sk eτ = k · (53) sets of filter formulas derived in Sects. 3.1 and 3.2, respectively. γ + Tτ? The information Hamiltonian, or equivalently the Gibbs free energy, are scalar quantities defined over a huge phase space Again, this solution includes a correction term in comparison (s) to the MAP solution in Eq. (37). Since D is positive definite, 11 For example, the posterior P(s|d) for a one-dimensional diffuse sig- it contributes positive to the (logarithmic) power spectrum, and − 1 2 − nal is proportional to exp( 2 s + ds exp(s)), whereby all other param- therefore reduces the possible perception threshold further. eters are fixed to unity. Analogously, P(u|d) ∝ exp(du − 2 cosh(u)).

A74, page 9 of 20 A&A 574, A74 (2015) of possible field and parameter configurations including, among relying on the application of D(u) to random fields ξ that others, the elements of m(s) and m(u). If we only consider project out the diagonal (Hutchinson 1989; Selig et al. 2012). those, and no resolution refinement from data to signal space, The uncertainty covariance is given as the inverse Hessian two numbers need to be inferred from one data value each. by Eqs. (36) or (51), and should be symmetric and posi- Including τ and the uncertainty covariances D(s) and D(u) in the tive definite. For that reason, it can be applied to a field inference, the problem of underdetermined degrees of freedom using a conjugate gradient (Shewchuk 1994); i.e., solving gets worse. This is reflected in the possibility of a decent num- (D(u))−1y = ξ for y. However, if the current phase space ber of local minima in the non-convex manifold landscape of the position is far away from the minimum, the Hessian is not codomain of the Hamiltonian, or Gibbs free energy, respectively necessarily positive definite. One way to overcome this tem- (Kirkpatrick et al. 1983; Geman & Geman 1984; Giovannelli & poral instability, would be to introduce a Levenberg damping Coulais 2005). The complexity of the inference problem goes in the Hessian (inspired by Transtrum et al. 2010; Transtrum back to the, in general, non-linear entanglement between the in- & Sethna 2012). dividual parameters. 6. Optimize m(s), the diffuse signal field. – An analog scheme The D3PO algorithm is based on an iterative optimization as in step4 using steepest descent and Wolfe conditions scheme, where certain subsets of the problem are optimized al- is effective. The potentials can be computed according to ternately instead of the full problem at once. Each subset op- Eqs. (31) or (43) neglecting terms independent of m(s), and timization is designed individually, see below. The global opti- the gradient according to Eqs. (32) or (49), respectively. It mization cycle is in some degree sensitive to the starting values has proven useful to first ensure a convergence on large because of the non-convexity of the considered potential; i.e., scales; i.e., small harmonic modes k. This can be done re- the information Hamiltonian or Gibbs free energy, respectively. peating steps6 −8 for all k < kmax with growing kmax using We can find such appropriate starting values by solving the infer- the corresponding projections Sk. ence problem in a reduced frame in advance, see below. So far, 7. Update Dˆ (s), the diffuse uncertainty variance, in case of a a step-by-step guide of the algorithm looks like the following. Gibbs approach in analogy to step5. 8. Optimize τ?, the logarithmic power spectrum. – This is done 1. Initialize the algorithm with primitive starting values; e.g., by solving Eqs. (37) or (53). The trace term can be computed (s) (u) (s) (u) ? −2 mx = mx = 0, Dxy = Dxy = δxy, and τk = log (k ). analog to the diagonal; e.g., by probing. Given this, the equa- – Those values are arbitrary. Although the optimization is tion can be solved efficiently by a Newton-Raphson method. rather insensitive to them, inappropriate values can cripple 9. Repeat the steps4 to8 until convergence. – This scheme will the algorithm for numerical reasons because of the high non- take several cycles until the algorithm reaches the desired linearity of the inference problem. convergence level. Therefore, it is not required to achieve a 2. Optimize m(s), the diffuse signal field, coarsely. – The pre- convergence to the final accuracy level in all subsets in all liminary optimization shall yield a rough estimate of the dif- cycles. It is advisable to start with weak convergence criteria fuse only contribution. This can be achieved by reconstruct- in the first loop and increase them gradually. ing a coarse screened diffuse signal field that only varies on large scales; i.e., limiting the bandwidth of the diffuse signal A few remarks are in order. in its harmonic basis. Alternatively, obvious point sources The phase space of possible signal field configurations is in the data could be masked out by introducing an artificial tremendously huge. It is therefore impossible to judge if the al- mask into the response, if feasible. gorithm has converged to the global or some local minima, but 3. Optimize m(u), the point-like signal field, locally. – This ini- this does not matter if both yield reasonable results that do not tial optimization shall approximate the brightest, most obvi- differ substantially. ous, point sources that are visible in the data image by eye. In general, the converged solution is also subject to the Their current disagreement with the data dominates the con- choice of starting values. Solving a non-convex, non-linear in- sidered potential, and introduces some numerical stiffness. ference problem without proper initialization can easily lead to The gradient of the potential can be computed according to nonsensical results, such as fitting (all) diffuse features by point Eqs. (33) or (50), and its minima will be at the expected po- sources. Therefore, the D3PO algorithm essentially creates its sition of the brightest point source which has not been recon- own starting values executing the initial steps1 to3. The primi- structed, yet. It is therefore very efficient to increase m(u) at tive starting values are thereby processed to rough estimates that this location directly until the sign of the gradient flips, and cover coarsely resolved diffuse and prominent point-like fea- repeat this procedure until the obvious point sources are fit. tures. These estimates serve then as actual starting values for 4. Optimize m(u), the point-like signal field. – This task can the optimization cycle. be done by a steepest descent minimization of the potential Because of the iterative optimization scheme starting with combined with a line search following the Wolfe condi- the diffuse component in step2, the algorithm might be prone to tions (Nocedal & Wright 2006). The potentials can be com- explaining some point-like features by diffuse sources. Starting puted according to Eqs. (31) or (43) neglecting terms in- with the point-like component instead would give rise to the op- dependent of m(u), and the gradient according to Eqs. (33) posite bias. To avoid such biases, it is advisable to restart the or (50). A more sophisticated minimization scheme, such algorithm partially. To be more precise, we propose to discard as a non-linear conjugate gradient (Shewchuk 1994), is the current reconstruction of m(u) after finishing step8 for the conceivable but would require the application of the full first time, then start the second iteration again with step3, and to Hessian, cf. step5. In the first run, it might be su fficient to discard the current m(s) before step6. restrict the optimization to the locations identified in step3. The above scheme exploits a few numerical techniques, such 5. Update Dˆ (u), the point-like uncertainty variance, in case of a as probing or Levenberg damping, that are described in great Gibbs approach. – It is not feasible to compute the full uncer- detail in the given references. The code of our implementation tainty covariance D(u) explicitly in order to extract its diag- of the D3PO algorithm will be made public in the future under onal. A more elegant way is to apply a probing technique http://www.mpa-garching.mpg.de/ift/d3po/.

A74, page 10 of 20 M. Selig and T. A. Enßlin: D3PO – denoising, deconvolving, and decomposing photon observations (a) (b) (c)

(d) (e) (f)

Fig. 3. Illustration of the data and noiseless, but reconvolved, signal responses of the reconstructions. Panel a) shows the data from a mock observation of a 32 × 32 arcmin2 patch of the sky with a resolution of 0.1 arcmin corresponding to a total of 102 400 pixels. The data had been convolved with a Gaussian-like PSF (FWHM ≈ 0.2 arcmin = 2 pixels, finite support of 1.1 arcmin = 11 pixels) and masked because of an uneven exposure. Panel b) shows the centered convolution kernel. Panel c) shows the exposure mask. The bottom panels show the reconvolved signal response R hρi of a reconstruction using a different approach each, namely d) MAP-δ, e) MAP-G, and f) Gibbs. All reconstructions shown here −12 3 −4 and in the following figures used the same model parameters: α = 1, q = 10 , σ = 10, β = 2 , and η = 10 .

4. Numerical application effects have been removed presenting a denoised, deconvolved and decomposed reconstruction result for the diffuse photon Exceeding the simple 1D scenario illustrated in Fig.1, the flux. Figure4 also shows the absolute di fference to the orig- D3PO algorithm is now applied to a realistic, but simulated, data inal flux. Although the differences in the MAP estimators are set. The data set represents a high energy observation with a insignificant, the Gibbs solution seems to be slightly better. field of view of 32 × 32 arcmin2 and a resolution of 0.1 arcmin; In order to have a quantitative statement about the goodness i.e., the photon count image comprises 102 400 pixels. The in- of the reconstruction, we define a relative residual error (s) for strument response includes the convolution with a Gaussian-like the diffuse contribution as follows, PSF with a FWHM of roughly 0.2 arcmin, and an uneven survey D E −1 mask attributable to the inhomogeneous exposure of the virtual (s) = ρ(s) − ρ(s) ρ(s) , (59) 2 instrument. The data image and those characteristics are shown 2 2 in Fig.3. where | · |2 is the Euclidean L -norm. For the point-like contri- In addition, the top panels of Fig.3 show the reproduced sig- bution, however, we have to consider an error in brightness and nal responses of the reconstructed (total) photon flux. The recon- position. For this purpose we define, −12 structions used the same model parameters, α = 1, q = 10 , Z N D E −1 σ = 10, β = 3 , and η = 10−4 in a MAP-δ, MAP-G and a Gibbs (u) = dn Rn ρ(u) − Rn ρ(u) Rn ρ(u) , (60) 2 PSF PSF PSF 2 approach, respectively. They all show a very good agreement 1 2 with the actual data, and differences are barely visible by eye. where RPSF is a (normalized) convolution operator, such that We note that only the quality of denoising is visible, since the N RPSF becomes the identity for large N. These errors are listed signal response shows the convolved and superimposed signal in Table1. When comparing the MAP- δ and MAP-G approach, fields. the incorporation of uncertainty corrections seems to improve The diffuse contribution to the deconvolved photon flux is the results slightly. The full regularization treatment within the shown Fig.4 for all three estimators, cf. Eqs. (54) to (56). There, Gibbs approach outperforms MAP solutions in terms of the cho- all point-like contributions as well as noise and instrumental sen error measure ( · ). For a discussion of how such measures

A74, page 11 of 20 A&A 574, A74 (2015)

(a) (b) (c)

(d) (e) (f)

(s) Fig. 4. Illustration of the diffuse reconstruction. The top panels show the denoised and deconvolved diffuse contribution hρ i/ρ0 reconstructed using a different approach each, namely d) MAP-δ, e) MAP-G, and f) Gibbs. The bottom panels d) to f) show the difference between the originally simulated signal and the respective reconstruction.

Table 1. Overview of the relative residual errors in the photon flux re- The reconstruction of the power spectrum, as shown in constructions for the respective approaches, all using the same model Fig.6, gives further indications of the reconstruction quality parameters, cf. text. of the diffuse component. The simulation used a default power spectrum of MAP-δ MAP-G Gibbs −7 (s) = 4.442% (s) = 4.441% (s) = 2.078% exp(τk) = 42 (k + 1) . (61) (u) = 1.540% (u) = 1.540% (u) = 1.089% This power spectrum was on purpose chosen to deviate from a strict power law supposed by the smoothness prior. From Fig.6 it is apparent that the reconstructed power spec- can change the view on certain Bayesian estimators, we refer to tra track the original well up to a harmonic mode k of roughly the work by Burger & Lucka(2014). 0.4 arcmin−1. Beyond that point, the reconstructed power spec- Figure5 illustrates the reconstruction of the di ffuse signal tra fall steeply until they hit a lower boundary set by the field, now in terms of logarithmic flux. The original and the re- model parameter q = 10−12. This drop-off point at 0.4 arcmin−1 constructions agree well, and the strongest deviations are found corresponds to a physical wavelength of roughly 2.5 arcmin, in the areas with low amplitudes. With regard to the exponen- and thus (half-phase) fluctuations on a spatial distances below tial ansatz in Eq. (4), it is not surprising that the inference 1.25 arcmin. The Gaussian-like PSF of the virtual observatory on the signal fields is more sensitive to higher values than to has a finite support of 1.1 arcmin. The lack of reconstructed lower ones. For example, a small change in the diffuse signal power indicates that the algorithm assigns features on spatial field, s → (1 ± )s, translates into a factor in the photon flux, scales smaller than the PSF support preferably to the point- ρ(s) → ρ(s)e±s, that scales exponentially with the amplitude of like component. This behavior is reasonable because solely the the diffuse signal field. The Gibbs solution shows less deviation point-like signal can cause PSF-like shaped imprints in the data from the original signal than the MAP solution. Since the latter image. However, there is no strict threshold in the distinction be- lacks the regularization by the uncertainty covariance it exhibits tween the components on the mere basis of their spatial extend. a stronger tendency to overfitting compared to the former. This We rather observe a continuous transition from assigning flux to includes overestimates in noisy regions with low flux intensities, the diffuse component to assigning it to the point-like component as well as underestimates at locations where point-like contribu- while reaching smaller spatial scales because strict boundaries tions dominate the total flux. are blurred out under the consideration of noise effects.

A74, page 12 of 20 M. Selig and T. A. Enßlin: D3PO – denoising, deconvolving, and decomposing photon observations

(a) (b) (c)

(d) = (a) − (b) (e) = (a) − (c)

(f) = |(e)| / (c) (g) (h)

Fig. 5. Illustration of the reconstruction of the diffuse signal field s = log ρ(s) and its uncertainty. The top panels show diffuse signal fields. Panel a) (s) (s) shows the original simulation s, panel b) the reconstruction mmode using a MAP approach, and panel c) the reconstruction mmean using a Gibbs approach. The panels d) and e) show the differences between original and reconstruction. Panel f) shows the relative difference. The panels g) and h) show the relative uncertainty of the above reconstructions.

The differences between the reconstruction using a MAP and solution tends to overfit by absorbing some noise power into m(s) a Gibbs approach are subtle. The difference in the reconstruction as discussed in Sect.3. Thus, the higher MAP power spectrum formulas given by Eqs. (37) and (53) is an additive trace term in Fig.6 seems to be caused by a higher level of noise remnants involving D(s), which is positive definite. Therefore, a recon- in the signal estimate. structed power spectrum regularized by uncertainty corrections (s) The influence of the choice of the model parameter σ is also is never below the one with out given the same m . However, shown in Fig.6. Neither a smoothness prior with σ = 10, nor the reconstruction of the signal field follows different filter for- a weak one with σ = 1000 influences the reconstruction of the mulas, respectively. Since the Gibbs approach considers the un- power spectrum substantially in this case12. The latter choice, certainty covariance D(s) properly in each cycle, it can present a more conservative solution. The drop-off point is apparently at higher k for the MAP approach, leading to higher power on 12 −1 For a discussion of further log-normal reconstruction scenarios scales between roughly 0.3 and 0.7 arcmin . In turn, the MAP please refer to the work by Oppermann et al.(2013).

A74, page 13 of 20 A&A 574, A74 (2015) (a) (b) 100 100

10-3 10-3 ) ) k k τ τ

( -6 ( -6 p 10 p 10 x x e e 2 2

| default | default k k | realization | realization -9 -9 10 MAP 10 MAP MAP + corrections MAP + corrections Gibbs Gibbs 10-12 10-12 10-1 100 10-1 100 1 1 k [ arcmin− ] k [ arcmin− ] | | | |

Fig. 6. Illustration of the reconstruction of the logarithmic power spectrum τ. Both panels show the default power spectrum (black dashed line), and the simulated realization (black dotted line), as well as the reconstructed power spectra using a MAP (orange solid line), plus second order corrections (orange dashed line), and a Gibbs approach (blue solid line). Panel a) shows the reconstruction for a chosen σ parameter of 10, panel b) for a σ of 1000. however, exhibits some more fluctuations in order to better track difficile to determine as the residual errors, (s) and (u), are de- the concrete realization. fined differently. Although the errors vary significantly, 2−15% s The results for the reconstruction of the point-like compo- for ( ), we like to stress that the model parameters were changed nent are illustrated in Fig.7. Overall, the reconstructed point- drastically, partly even by orders of magnitude. The impact of like signal field and the corresponding photon flux are in good the prior clearly exists, but is moderate. We note that the case agreement with the original ones. The point-sources have been of σ → ∞ corresponds to neglecting the smoothness prior com- located with an accuracy of ±0.1 arcmin, which is less than the pletely. The β = 1 case that corresponds to a logarithmically FWHM of the PSF. The localization tends to be more precise flat prior on u showed a tendency to fit more noise features by for higher flux values because of the higher signal-to-noise ra- point-like contributions. tio. The reconstructed intensities match the simulated ones well, In summary, the D3PO algorithm is capable of denoising, although the MAP solution shows a spread that exceeds the ex- deconvolving and decomposing photon observations by recon- pected shot noise uncertainty interval. This is again an indication structing the diffuse and point-like signal field, and the loga- of the overfitting known for MAP solutions. Moreover, neither rithmic power spectrum of the former. The reconstruction us- reconstruction shows a bias towards higher or lower fluxes. ing MAP and Gibbs approaches perform flawlessly, except for The uncertainty estimates for the point-like photon flux ρ(u) a little underestimation of the uncertainty of the point-like com- obtained from D(u) according to Eqs. (57) and (58) are, in gen- ponent. The MAP approach shows signs of overfitting, but those eral, consistent with the deviations from the original and the shot are not overwhelming. Considering the simplicity of the MAP noise uncertainty, cf. Fig.7. They show a reasonable scaling be- approach that goes along with a numerically faster performance, ing higher for lower fluxes and vice versa. However, some uncer- this shortcoming seems acceptable. tainties seem to be underestimated. There are different reasons Because of the iterative scheme of the algorithm, a combi- for this. nation of the MAP approach for the signal fields and a Gibbs On the one hand, the Hessian approximation for D(u) in approach for the power spectrum is possible. Eqs. (36) or (51) is in individual cases in so far poor as that the curvature of the considered potential does not describe the uncer- tainty of the point-like component adequately. The data admit- 5. Conclusions and summary tedly constrains the flux intensity of a point source sufficiently, The D3PO algorithm for the denoising, deconvolving and de- especially if it is a bright one. However, the rather narrow dip in composing photon observations has been derived. It allows for the manifold landscape of the considered potential can be asym- the simultaneous but individual reconstruction of the diffuse and metric, and thus not always well described by the quadratic ap- point-like photon fluxes, as well as the harmonic power spec- proximation of Eqs. (36) or (51), respectively. trum of the diffuse component, from a single data image that On the other hand, the approximation leading to vanishing is exposed to Poissonian shot noise and effects of the instru- cross-correlation D(su), takes away the possibility of commu- ment response functions. Moreover, the D3PO algorithm can nicating uncertainties between diffuse and point-like compo- provide a posteriori uncertainty information on the reconstructed nents. However, omitting the used simplification or incorpo- signal fields. With these capabilities, D3PO surpasses previous rating higher order corrections would render the algorithm too approaches that address only subsets of these complications. computationally expensive. The fact that the Gibbs solution, The theoretical foundation is a hierarchical Bayesian param- u which takes D( ) into account, shows improvements backs up eter model embedded in the framework of IFT. The model com- this argument. prises a priori assumptions for the signal fields that account for The reconstructions shown in Figs.5 and7 used the model the different statistics and correlations of the morphologically 3 −4 parameters σ = 10, β = 2 , and η = 10 . In order to reflect the different components. The diffuse photon flux is assumed to influence of the choice of σ, β, and η, Table2 summarizes the re- obey multivariate log-normal statistics, where the covariance is sults from several reconstructions carried out with varying model described by a power spectrum. The power spectrum is a priori parameters. Accordingly, the best parameters seem to be σ = 10, unknown and reconstructed from the data along with the signal. 5 −4 β = 4 , and η = 10 , although we caution that the total error is Therefore, hyperpriors on the (logarithmic) power spectra have A74, page 14 of 20 M. Selig and T. A. Enßlin: D3PO – denoising, deconvolving, and decomposing photon observations (a) (b) (c)

0 0 0

0 0 0

1 [counts/pixel] 5000 1 [counts/pixel] 5000 1 [counts/pixel] 5000

(d) (e)

103 103

102 102

1 1

reconstructed flux [counts/pixel] 10 reconstructed flux [counts/pixel] 10

off-set = 0 off-set = 0 off-set 6 arcsec off-set 6 arcsec ≈ ≈ 2 shot noise interval 2 shot noise interval 100 100

100 100

0 1 2 3 0 1 2 3

reconstructed flux [relative] 10 10 10 10 reconstructed flux [relative] 10 10 10 10 original flux [counts/pixel] original flux [counts/pixel] Fig. 7. Illustration of the reconstruction of the point-like signal field u = log ρ(u) and its uncertainty. The top panels show the location (markers) and intensity (gray scale) of the point-like photon fluxes, underlaid is the respective diffuse contribution (contours) to guide the eye, cf. Fig4. Panel a) shows the original simulation, panel b) the reconstruction using a MAP approach, and panel c) the reconstruction using a Gibbs approach. The bottom panels d) and e) show the match between original and reconstruction in absolute and relative fluxes, the 2σ shot noise interval (gray contour), as well as some reconstruction uncertainty estimate (error bars). been introduced, including a spectral smoothness prior (Enßlin depend on the concrete problem at hand and the trade-off be- & Frommert 2011; Oppermann et al. 2013). The point-like pho- tween reconstruction precision against computational effort. ton flux, in contrast, is assumed to factorize spatially in inde- The performance of the D3PO algorithm has been demon- pendent inverse-Gamma distributions implying a (regularized) strated in realistic simulations carried out in 1D and 2D. The power-law behavior of the amplitudes of the flux. implementation relies on the  package (Selig et al. 2013), An adequate description of the noise properties in terms of which allows for the application regardless of the underlying po- a likelihood, here a Poisson distribution, and the incorporation sition space. of all instrumental effects into the response operator renders the In the 2D application example, a high energy observation of denoising and deconvolution task possible. The strength of the a 32 × 32 arcmin2 patch of a simulated sky with a 0.1 arcmin res- proposed approach is the performance of the additional decom- olution has been analyzed. The D3PO algorithm successfully de- position task, which especially exploits the a priori description noised, deconvolved and decomposed the data image. The anal- of diffuse and point-like. The model comes down to five scalar ysis yielded a detailed reconstruction of the diffuse photon flux parameters, for which all a priori defaults can be motivated, and and its logarithmic power spectrum, the precise localization of of which none is driving the inference predominantly. the point sources and accurate determination of their flux inten- We discussed maximum a posteriori (MAP) and Gibbs free sities, as well as a posteriori estimates of the reconstructed fields. energy approaches to solve the inference problem. The de- The D3PO algorithm should be applicable to a wide range of rived solutions provide optimal estimators that, in the considered inference problems appearing in astronomical imaging and re- examples, yielded equivalently excellent results. The Gibbs so- lated fields. Concrete applications in high energy astrophysics, lution slightly outperforms MAP solutions (in terms of the con- for example, the analysis of data from the Chandra X-ray ob- sidered L2-residuals) thanks to the full regularization treatment, servatory or the Fermi γ-ray space telescope, are currently however, for the price of a computationally more expensive considered by the authors. In this regard, the public release of optimization. Which approach is to be preferred in general might the D3PO code is planned.

A74, page 15 of 20 A&A 574, A74 (2015)

Table 2. Overview of the relative residual error in the photon flux reconstructions for a MAP-δ approach with varying model parameters σ, β, and η.

α = 1 q = 10−12 σ = 1 σ = 10 σ = 100 σ = 1000 σ → ∞ (s) = 0.06710 (s) = 0.05406 (s) = 0.05323 (s) = 0.05383 (s) = 0.05359 β = 1 η = 10−6 (u) = 0.02000 (u) = 0.01941 (u) = 0.01602 (u) = 0.01946 (u) = 0.01898 (s) = 0.02874 (s) = 0.01929 (s) = 0.01974 (s) = 0.02096 (s) = 0.01991 β = 5 η = 10−6 4 (u) = 0.01207 (u) = 0.01102 (u) = 0.01090 (u) = 0.01123 (u) = 0.01104 (s) = 0.05890 (s) = 0.02237 (s) = 0.02318 (s) = 0.02238 (s) = 0.02344 β = 3 η = 10−6 2 (u) = 0.02741 (u) = 0.01343 (u) = 0.01346 (u) = 0.01342 (u) = 0.01351 (s) = 0.10864 (s) = 0.04304 (s) = 0.03234 (s) = 0.03248 (s) = 0.03263 β = 7 η = 10−6 4 (u) = 0.04840 (u) = 0.02767 (u) = 0.02142 (u) = 0.02143 (u) = 0.02167 (s) = 0.11870 (s) = 0.04614 (s) = 0.04527 (s) = 0.04522 (s) = 0.04500 β = 2 η = 10−6 (u) = 0.05360 (u) = 0.02926 (u) = 0.02924 (u) = 0.02926 (u) = 0.02915 (s) = 0.06660 (s) = 0.05474 (s) = 0.05377 (s) = 0.05474 (s) = 0.05423 β = 1 η = 10−4 (u) = 0.02157 (u) = 0.01903 (u) = 0.01657 (u) = 0.01986 (u) = 0.02055 (s) = 0.02874 (s) = 0.01929 (s) = 0.01974 (s) = 0.02096 (s) = 0.01991 β = 5 η = 10−4 4 (u) = 0.01207 (u) = 0.01100 (u) = 0.01103 (u) = 0.01123 (u) = 0.01102 (s) = 0.05890 (s) = 0.02237 (s) = 0.02318 (s) = 0.02238 (s) = 0.02344 β = 3 η = 10−4 2 (u) = 0.02743 (u) = 0.01343 (u) = 0.01346 (u) = 0.01340 (u) = 0.01352 (s) = 0.10864 (s) = 0.04304 (s) = 0.03234 (s) = 0.03248 (s) = 0.03263 β = 7 η = 10−4 4 (u) = 0.04840 (u) = 0.02766 (u) = 0.02145 (u) = 0.02142 (u) = 0.02166 (s) = 0.11870 (s) = 0.04614 (s) = 0.04527 (s) = 0.04522 (s) = 0.04500 β = 2 η = 10−4 (u) = 0.05358 (u) = 0.02926 (u) = 0.02926 (u) = 0.02927 (u) = 0.02916 (s) = 0.07271 (s) = 0.06209 (s) = 0.06192 (s) = 0.06291 (s) = 0.06265 β = 1 η = 10−2 (u) = 0.02252 (u) = 0.02047 (u) = 0.02109 (u) = 0.01764 (u) = 0.02068 (s) = 0.02335 (s) = 0.01934 (s) = 0.02042 (s) = 0.01999 (s) = 0.01930 β = 5 η = 10−2 4 (u) = 0.01139 (u) = 0.01112 (u) = 0.01097 (u) = 0.01124 (u) = 0.01102 (s) = 0.05999 (s) = 0.02227 (s) = 0.02347 (s) = 0.02266 (s) = 0.02274 β = 3 η = 10−2 2 (u) = 0.02745 (u) = 0.01341 (u) = 0.01356 (u) = 0.01332 (u) = 0.01351 (s) = 0.10715 (s) = 0.04304 (s) = 0.03254 (s) = 0.03264 (s) = 0.03258 β = 7 η = 10−2 4 (u) = 0.04833 (u) = 0.02766 (u) = 0.02140 (u) = 0.02144 (u) = 0.02163 (s) = 0.12496 (s) = 0.04614 (s) = 0.04497 (s) = 0.04528 (s) = 0.04500 β = 2 η = 10−2 (u) = 0.05361 (u) = 0.02927 (u) = 0.02915 (u) = 0.02914 (u) = 0.02915 (s) = 0.15328 (s) = 0.14544 (s) = 0.14138 (s) = 0.14181 (s) = 0.14185 β = 1 η = 1 (u) = 0.03250 (u) = 0.03291 (u) = 0.02905 (u) = 0.03087 (u) = 0.02876 (s) = 0.15473 (s) = 0.14406 (s) = 0.14357 (s) = 0.14465 (s) = 0.13964 β = 5 η = 1 4 (u) = 0.03217 (u) = 0.03166 (u) = 0.03089 (u) = 0.03101 (u) = 0.03160 (s) = 0.15360 (s) = 0.14216 (s) = 0.14248 (s) = 0.14208 (s) = 0.14233 β = 3 η = 1 2 (u) = 0.03262 (u) = 0.03063 (u) = 0.02534 (u) = 0.02872 (u) = 0.03095 (s) = 0.15206 (s) = 0.14156 (s) = 0.13772 (s) = 0.14160 (s) = 0.14390 β = 7 η = 1 4 (u) = 0.03262 (u) = 0.03065 (u) = 0.03174 (u) = 0.03141 (u) = 0.03178 (s) = 0.06421 (s) = 0.05479 (s) = 0.05365 (s) = 0.05499 (s) = 0.05429 β = 2 η = 1 (u) = 0.02043 (u) = 0.01966 (u) = 0.01676 (u) = 0.02070 (u) = 0.01996

Notes. The parameters α and q were fixed. The best and worst residuals are printed in bold face.

Acknowledgements. We thank Niels Oppermann, Henrik Junklewitz and two with the shape and scale parameters, β and η. It can be shown anonymous referees for the insightful discussions and productive comments. that, for β = 3 , the sum of N such variables still obeys an inverse- Furthermore, we thank the DFG Forschergruppe 1254 “Magnetisation of 2 Interstellar and Intergalactic Media: The Prospects of Low-Frequency Radio Gamma distribution, Observations” for travel support in order to present this work at their annual N meeting in 2013. Some of the results in this publication have been derived using (u) X (u) the  package (Selig et al. 2013). This research has made use of NASA’s ρN = ρx (A.2) Astrophysics Data System. x ! 3 ρ(u) x I ρ(u), β = , N2ρ η . (A.3) N N 2 0 Appendix A: Point source stacking For a proof see Giron(2001). In Sect. 2.3.3, a prior for the point-like signal field has been de- 3 In the case of β = 2 , the power-law behavior of the prior rived under the assumption that the photon flux of point sources becomes independent of the discretization of the continuous po- is independent between different pixels and identically inverse- sition space. This means that the slope of the distribution of ρ(u) Gamma distributed, x remains unchanged notwithstanding that we refine or coarsen the ! resolution of the reconstruction. However, the scale parameter η (u) (u) 3 → 2 ρx x I ρx , β = , ρ0η ∀x, (A.1) needs to be adapted for each resolution; i.e., η N η if N pixels 2 are merged.

A74, page 16 of 20 M. Selig and T. A. Enßlin: D3PO – denoising, deconvolving, and decomposing photon observations

1 Appendix B: Covariance and curvature. They are identical up to the + 2 Dxx terms in the exponents. On the one hand, this reinforces the approximations done in G − The covariance D of a Gaussian (s m, D) describes the uncer- Sect. 3.2. On the other hand, this shows that higher order cor- tainty associated with the mean m of the distribution. It can be rection terms might alter the uncertainty covariances further, cf. computed by second moments or cumulants according to Eq. (3), Eq. (43). The concrete impact of these correction terms is diffi- or in this Gaussian case as the inverse Hessian of the correspond- cult to judge, since they introduce terms involving Dxy that cou- ing information Hamiltonian, ple all elements of D in an implicit manner. 2 2 ! We note that the inverse Hessian describes the curvature of ∂ H ∂ 1 † −1 = (s − m) D (s − m) the potential, its interpretation as uncertainty is, strictly speak- ∂s∂s† ∂s∂s† 2 s=m s=m ing, only valid for quadratic potentials. However, in most cases −1 it is a sufficient approximation. = D . (B.1) The Gibbs approach provides an alternative by equating the In Sect.3, uncertainty covariances for the di ffuse signal field s first derivative of the Gibbs free energy with respect to the co- and the point-like signal field u have been derived that are here variance with zero. Following Eq. (52), the covariances read given in closed form.  !  −1 X d (s) 1 (s)  The MAP uncertainty covariances introduced in Sect. 3.1 are (s)  i mx + Dxx  Dxy =  1 − Rixe 2  δxy (B.8) approximated by inverse Hessians. According to Eq. (36), they  li  read i  !  ? −1 X d (s)  + S , (B.9) (s)−1  i mx  xy Dxy ≈  1 − Rixe  δxy (B.2)  li  i and

 (s)   (s)  ( ! X di −1 (u) 1 (u) mx my ? (u)−1 X di m + D + Rixe Riye + S xy , D = 1 − R e x 2 xx (B.10) l2 xy l ix i i i i and 1 ) −m(u) D(u)  !  x + 2 xx X di (u) (u)  + η e δxy. (B.11) D(u)−1 ≈  1 − R emx + η e−mx  δ xy  l ix  xy  i i  Compared to the above solutions, there is one term missing indi- X d  (u)   (u)  cating that they already lack first order corrections. For this rea- i mx my + Rixe Riye , (B.3) l2 sons, the solutions obtained from the inverse Hessians are used i i in the D3PO algorithm. with Z m (s) m (u) Appendix C: Posterior approximation li = dx Rix (e x + e x ) . (B.4) C.1. Information theoretical measure The corresponding covariances derived in the Gibbs approach according to Eq. (51), yield If the full posterior P(z|d) of an inference problem is so com-   plex that an analytic handling is infeasible, an approximate pos- ! 1 − X di m(s) D(s)  terior Q might be used instead. The fitness of such an approxi- (s) 1 ≈  − x + 2 xx  Dxy  1 Rixe  δxy (B.5) mation can be quantified by an asymmetric measure for which  li  i different terminologies appear in the literature.

 1   1  First, the Kullback-Leibler divergence, X di m(s) D(s) m(s) D(s) + R e x + 2 xx R e y + 2 yy 2 ix iy Z i li Q(z|d) DKL(Q, P) = Dz Q(z|d) log (C.1) ? −1 P(z|d) + S xy , * + and Q(z|d) = log , (C.2) ( ! P(z|d) (u)−1 X di m(u)+ 1 D(u) Q D ≈ 1 − R e x 2 xx xy l ix i i defines mathematically an information theoretical distance, or divergence, which is minimal if a maximal cross information be- 1 ) −m(u) D(u) tween P and Q exists (Kullback & Leibler 1951). + η e x + 2 xx δ (B.6) xy Second, the information entropy,

Z P(z|d) X d  (u) 1 (u)   (u) 1 (u)  i mx + Dxx my + Dyy S (Q, P) = − Dz P(z|d) log (C.3) + Rixe 2 Riye 2 , E l2 Q(z|d) i i * + P(z|d) with = − log (C.4) Q(z|d) P Z  (s) 1 (s) (u) 1 (u)  mx + Dxx mx + Dxx li = dx Rix e 2 + e 2 . (B.7) = −DKL(P, Q),

A74, page 17 of 20 A&A 574, A74 (2015) is derived under the maximum entropy principle (Jaynes Computing the variation of the Gibbs free energy yields 1957a,b) from fundamental axioms demanding locality, coordi- δ   nate invariance and system independence (see e.g. Caticha 2008, δ G H z|d − − Q j = 0 = ( j) ( ) Q log Q (C.9) 2011). δQ j(z |µ, d)   Third, the (approximate) Gibbs free energy (Enßlin & Weig δ X (i) = H(z|d) + log Qi(˜z |µ, d) 2010), δQ (z( j)|µ, d) Q Qi j i (C.10) Z δ ( j) ( j) G = H(z|d) Q − S B(Q) (C.5) Dz Q z |µ, d = ( j) ˜ j(˜ ) δQ j(z |µ, d) D E  = − log P(z|d) Q − − log Q(z|d) Q (C.6) ( j) × H(z|d) Q + log Q j(˜z |µ, d) Qi, j = D (Q, P), δ X KL ... + ( j) δQ j(z |µ, d) i, j | {z } describes the difference between the internal energy hH(z|d)iQ =0 Z  and the Boltzmann-Shannon entropy S B(Q) = S E(1, Q). The ( j) (i) ( j) D E derivation of the Gibbs free energy is based on the principles = D˜z δ(z − ˜z ) H(z|d) Q Qi, j of thermodynamics13.  ( j) The Kullback-Leibler divergence, information entropy, and + log Q j(˜z |µ, d) + 1 the Gibbs free energy are equivalent measures that allow one to   ( j) ≈ = H(z|d) + log Q j(z |µ, d) + const. (C.11) assess the approximation Q P. Alternatively, a parametrized z( j) Q proposal for Q can be pinned down by extremizing the measure Qi, j of choice with respect to the parameters. This defines a solution for the approximate posterior Q j, where the constant term in Eq. (C.11) ensures the correct normaliza- 14 tion of Q j, C.2. Calculus of variations   ! ( j) The information theoretical measure can be interpreted as an ac- Q j(z |µ, d) ∝ exp − H(z|d) . (C.12) z( j) Q tion to which the principle of least action applies. This concept Qi, j is the basis for variational Bayesian methods (Jordan et al. 1999; Although the parts z(i, j) are integrated out, Eq. (C.12) is no Wingate & Weber 2013), which enable among others the deriva- marginalization since the integration is performed on the level of tion of approximate posterior distributions. the (negative) logarithm of a probability distribution. The suc- We suppose that z is a set of multiple signal fields, z = cess of the mean field approach might be that this integration (i) {z }i∈N, d a given data set, and P(z|d) the posterior of interest. is often more well-behaved in comparison to the corresponding In practice, such a problem is often addressed by a mean field marginalization. However, the resulting equations for the Qi de- approximation that factorizes the variational posterior Q, pend on each other, and thus need to be solved self-consistently. A maximum a posteriori solution for z( j) can then be found by minimizing an effective Hamiltonian, Y (i) P(z|d) ≈ Q = Qi(z |µ, d). (C.7) i argmax P(z|d) = argmin H(z|d) (C.13) z( j) z( j)   ≈ argmin H(z|d) . (C.14) (i, j) z( j) Q Here, the mean field µ, which mimics the effect of all z z( j) Qi, j onto z( j), has been introduced. The approximation in Eq. (C.7) shifts any possible entanglement between the z(i) within P into Since the posterior is approximated by a product, the (i) Hamiltonian is approximated by a sum, and each summand the dependence of z on µ within Qi. Hence, the mean field µ is well determined by the inference problem at hand, as demon- depends on solely one variable in the partition of the latent strated in the subsequent Sect. C.3. We note that µ represents variable z. effective rather than additional degrees of freedom. Following the principle of least action, any variation of C.3. Example the Gibbs free energy must vanish. We consider a variation ( j) In this section, the variational method is demonstrated with an δ j = δ/δQ j(z |µ, d) with respect to one approximate posterior ( j) exemplary posterior of the following form, Q j(z |µ, d). It holds, P(d|s) P(s, τ|d) = P(s|τ) P(τ) (C.15) P(d) (i)| δQi(˜z µ, d) (i) ( j) | = δi j δ(z − ˜z ). (C.8) P(d s) ( j)| = G(s, S) Pun(τ|α, q) Psm(τ|σ), (C.16) δQ j(z µ, d) P(d)

14 The normalization could be included by usage of Lagrange multipli- 13 P R (i) (i)  In Eq. (C.5), a unit temperature is implied, see discussion by Enßlin ers; i.e., by adding a term i λi 1 − Dz Qi(z |µ, d) to the Gibbs & Weig(2010, 2012); Iatsenko et al.(2012). free energy in Eq. (C.9).

A74, page 18 of 20 M. Selig and T. A. Enßlin: D3PO – denoising, deconvolving, and decomposing photon observations

(a) (b) Hence, the mean field effect on τ is given by the above trace, k D E and the mean field effect on s is described by S−1 . model model Qτ Extremizing Eq. (C.21) yields

µ 1  h †  −1i q + 2 tr mm + D Sk eτ = k · (C.23) γ + Tτ τ s τ s This formula is in agreement with the critical filter formula (Enßlin & Frommert 2011; Oppermann et al. 2013). In case a d d Gaussian likelihood and no smoothness prior is assumed, it is the exact maximum of the true posterior with respect to the (log- Fig. C.1. Graphical model for the variational method applied to the ex- ample posterior in Eq. (C.15). Panel a) shows the graphical model with- arithmic) power spectrum. out, and panel b) with the mean field µ. References where P(d|s) stands for an arbitrary likelihood describing how Bertin, E., & Arnouts, S. 1996, A&AS, 117, 393 likely the data d can be measured from a signal s, and S = Bobin, J., Starck, J.-L., Fadili, J. M., Moudden, Y., & Donoho, D. 2007, IEEE P τk k e Sk for a parametrization of the signal covariance. This pos- Trans. Image Proc., 16, 2675 terior is equivalent to the one derived in Sect.2 in order to find a Bouchet, L., Amestoy, P., Buttari, A., Rouet, F.-H., & Chauvin, M. 2013, solution for the logarithmic power spectrum τ. Here, any explicit Astronomy and Computing, 1, 59 dependence on the point-like signal field u is veiled in favor of Burger, M., & Lucka, F. 2014, Inverse Problems, 30, 114004 Carvalho, P., Rocha, G., & Hobson, M. P. 2009, MNRAS, 393, 681 clarity. Carvalho, P., Rocha, G., Hobson, M. P., & Lasenby, A. 2012, MNRAS, 427, The corresponding Hamiltonian reads 1384 Caticha, A. 2008 [arXiv:0808.0012] H(s, τ|d) = − log P(s, τ|d) (C.17) Caticha, A. 2011, in Am. Inst. Phys. Conf. Ser., eds. A. Mohammad-Djafari, J.-F. Bercher, & P. Bessière, AIP Conf. Ser., 1305, 20 1 X  h † −1i −τ  = H + % τ + tr ss S e k (C.18) Chapman, E., Abdalla, F. B., Bobin, J., et al. 2013, MNRAS, 429, 165 0 2 k k k k Cornwell, T. J. 2008, IEEE J. Select. Top. Sign. Process., 2, 793 1 Dupé, F.-X., Fadili, J. M., & Starck, J.-L. 2009, IEEE Trans. Im. Process., 18, + (α − 1)†τ + q†e−τ + τ†Tτ, 310 2 Dupé, F.-X., Fadili, J., & Starck, J.-L. 2011 [arXiv:1103.2213] h −1i Enßlin, T. 2013, in AIP Conf. Ser., Am. Inst. Phys. Conf. Ser., 1553, ed. where %k = tr SkSk and all terms constant in τ, including the U. von Toussaint, 184 likelihood P(d|s), have been absorbed into H0. Enßlin, T. 2014, AIP Conf. Proc., 1636, 49 For an arbitrary likelihood it might not be possible to Enßlin, T. A., & Frommert, M. 2011, Phys. Rev. D, 83, 105014 Enßlin, T. A., & Weig, C. 2010, Phys. Rev. E, 82, 1112 marginalize the posterior over s analytically. However, an in- Enßlin, T. A., & Weig, C. 2012, Phys. Rev. E, 85, 3102 tegration of the Hamiltonian over s might be feasible since Enßlin, T. A., Frommert, M., & Kitaura, F. S. 2009, Phys. Rev. D, 80, the only relevant term is quadratic in s. As, on the one hand, 5005 the prior P(s|τ) is Gaussian and, on the other hand, a poste- Figueiredo, M. A. T., & Bioucas-Dias, J. M. 2010, IEEE Trans. Image Process., rior mean m and covariance D for the signal field s suffice, cf. 19, 3133 Fomalont, E. B. 1968, Bull. Astron. Inst. Netherlands, 20, 69 Eqs. (2) and (3), we assume a Gaussian approximation for Qs; Geman, S., & Geman, D. 1984, IEEE Trans. Pattern Anal. Mach. Intell., 6, i.e., Qs = G(s − m, D). 721 We now introduce a mean field approximation, denoted by µ, Giovannelli, J.-F., & Coulais, A. 2005, A&A, 439, 401 by changing the causal structure as depicted in Fig. C.1. With the Giron, Francisco Javier, C. C. d. 2001, RACSAM, 95, 39 González-Nuevo, J., Argüeso, F., Lopez-Caniego, M., Toffolatti, L., et al. 2006, consequential approximation of the posterior, Not. Roy. Astron. Soc., 1603 P(s, τ|d) ≈ G(s − m, D) Qτ(τ|µ, d), (C.19) Guglielmetti, F., Fischer, R., & Dose, V. 2009, MNRAS, 396, 165 Haar, A. 1910, Math. Ann., 69, 331 we can calculate the effective Hamiltonian for τ as Haar, A. 1911, Math. Ann., 71, 38   Hensley, B. S., Pavlidou, V., & Siegal-Gaskins, J. M. 2013, MNRAS, 433, † 1 † † −τ H(s, τ|d) = H0 + γ τ + τ Tτ + q e (C.20) 591 τ Qs 2 Högbom, J. A. 1974, A&AS, 15, 417 1 X D E  Hutchinson, M. F. 1989, Communications in Statistics – Simulation and † −1 −τk + tr ss Sk e Computation, 18, 1059 2 Qs k Iatsenko, D., Stefanovska, A., & McClintock, P. V. E. 2012, Phys. Rev. E, 85, 3101 † 1 † † −τ Jasche, J., & Kitaura, F. S. 2010, MNRAS, 407, 29 = H0 + γ τ + τ Tτ + q e (C.21) 2 Jasche, J., & Wandelt, B. D. 2013, ApJ, 779, 15 Jasche, J., Kitaura, F. S., Wandelt, B. D., & Enßlin, T. A. 2010, MNRAS, 406, 1 X h  †  − i − + tr mm + D S 1 e τk , 60 2 k k Jaynes, E. T. 1957a, Phys. Rev., 108, 171 Jaynes, E. T. 1957b, Phys. Rev., 106, 620 1 where γ = (α − 1) + 2 %. Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. 1999, Machine The nature of the mean field µ can be derived from the cou- Learning, 37, 183 pling term in Eq. (C.18) that ensures an information flow be- Junklewitz, H., Bell, M. R., Selig, M., & Enßlin, T. A. 2014, A&A, submitted [arXiv:1311.5282] tween s and τ, Kinney, J. B. 2014, Phys. Rev. E, 90, 011301 D h iE  † −1   h †  −1i Kirkpatrick, S., Gelatt, C. D., & Vecchi, M. P. 1983, Science, 220, 671  tr ss Sk  tr mm + D S   Qs   k  Kullback, S., & Leibler, R. A. 1951, Ann. Math. Stat., 22, 79 µ = D E  =  D − E  (C.22)  P −τk −1   S 1 ·  Malyshev, D., & Hogg, D. W. 2011, ApJ, 738, 181 k e Sk Qτ Qτ Metropolis, N., & Ulam, S. 1949, J. Amer. Statist. Assn., 44, 335

A74, page 19 of 20 A&A 574, A74 (2015)

Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, Selig, M., Oppermann, N., & Enßlin, T. A. 2012, Phys. Rev. E, 85, 021134 E. 1953, J. Chem. Phys., 21, 1087 Selig, M., Bell, M. R., Junklewitz, H., et al. 2013, A&A, 554, A26 Nocedal, J., & Wright, S. J. 2006, Numerical optimization, Shewchuk, J. R. 1994, Techn. rep., Carnegie Mellon University, Pittsburgh, PA http://site.ebrary.com/id/10228772 Strong, A. W. 2003, A&A, 411, L127 Oppermann, N., Selig, M., Bell, M. R., & Enßlin, T. A. 2013, Phys. Rev. E, 87, Transtrum, M. K., & Sethna, J. P. 2012 [arXiv:1201.5885] 032136 Transtrum, M. K., Machta, B. B., & Sethna, J. P. 2010, Phys. Rev. Lett., 104, Planck Collaboration VII. 2011, A&A, 536, A7 0201 Rau, U., & Cornwell, T. J. 2011, A&A, 532, A71 Valdes, F. 1982, in SPIE Conf. Ser., 331, 465 Schmitt, J., Starck, J. L., Casandjian, J. M., Fadili, J., & Grenier, I. 2010, A&A, Wandelt, B. D., Larson, D. L., & Lakshminarayanan, A. 2004, Phys. Rev. D, 70, 517, A26 083511 Schmitt, J., Starck, J. L., Casandjian, J. M., Fadili, J., & Grenier, I. 2012, A&A, Willett, R., & Nowak, R. 2007, IEEE Trans., 53, 3171 546, A114 Wingate, D., & Weber, T. 2013 [arXiv:1301.1299]

A74, page 20 of 20