Scale Space Multiresolution Analysis of Random Signals

Lasse Holmstr¨om∗, Leena Pasanen

Department of Mathematical Sciences, University of Oulu, Finland

Reinhard Furrer

Institute of Mathematics, University of Zurich,¨ Switzerland

Stephan R. Sain

National Center for Atmospheric Research, Boulder, Colorado, USA

Abstract A method to capture the scale-dependent features in a random signal is pro- posed with the main focus on images and spatial fields defined on a regular grid. A technique based on scale space smoothing is used. However, where the usual scale space analysis approach is to suppress detail by increasing smoothing progressively, the proposed method instead considers differences of smooths at neighboring scales. A random signal can then be represented as a sum of such differences, a kind of a multiresolution analysis, each difference representing de- tails relevant at a particular scale or resolution. Bayesian analysis is used to infer which details are credible and which are just artifacts of random variation. The applicability of the method is demonstrated using noisy digital images as well as global temperature change fields produced by numerical climate prediction models. Keywords: Scale space smoothing, Bayesian methods, Image analysis, Climate research

1. Introduction

In signal processing, smoothing is often used to suppress noise. An optimal level of smoothing is then of interest. In scale space analysis of noisy curves and images, instead of a single, in some sense “optimal” level of smoothing, a whole family of smooths is considered and each smooth is thought to provide information about the underlying object at a particular scale or resolution. The

∗Corresponding author ∗∗P.O.Box 3000, 90014 University of Oulu, Finland Tel. +358 50 563 7465 Fax +358 9 553 1730 Email address: [email protected] (Lasse Holmstr¨om ) URL: http://cc.oulu.fi/~llh/ (Lasse Holmstr¨om ) The electronic version of this paper includes Matlab software used in the computations

Preprint submitted to Elsevier April 14, 2011 concept of scale space was introduced in the context of image analysis (see Lindeberg, 1994, and the references there) but during the past ten years it has emerged also as useful statistical data analysis procedure (cf. Holmstr¨om, 2010). The seminal idea was the SiZer technology introduced by Chaudhuri and Marron (1999, 2000) and the original concept has since then been extended to various directions, including Bayesian versions that can be used to analyze curves underlying noisy data (Er¨ast¨o and Holmstr¨om, 2005, Godtliebsen and Øig˚ard, 2005, Holmstr¨om, 2010). Recently, a Bayesian version for the analysis of images and spatial fields has also been proposed (Holmstr¨om and Pasanen, 2007, 2008, Pasanen and Holmstr¨om, 2008). The central idea in statistical scale space methods is to make inferences about the “statistically significant” features of the smooths of an object of which only a noisy observation is available. This is typically done by estimating a measure of local change of the smooths, such as the or, in the Bayesian approach, exploring its posterior distribution. The analysis is carried out for a wide range of smoothing levels in the hope of discovering the salient scale- dependent features. Raising the smoothing level progressively suppresses details revealing increasingly large scale characteristics in the object underlying the data. By employing differences of smooths at neighboring scales, the method pro- posed in this paper attempts to separate the features into distinct scale cate- gories even more aggressively than the usual scale space procedure. The basic idea is illustrated with the simple example shown in Figure 1. The top panel in T the middle column shows a curve represented by a vector x = [x1,...,xn] of values on a discrete, equally spaced grid. The curve was constructed as the sum of the four underlying “detail” curves x1,..., x4 shown on the left. They can be thought to represent the features of the curve x in four different scales. In order to recover these scale-dependent detail curves, Nadaraya-Watson kernel smoothing (see Appendix B) was applied in the middle column to the curve x. The smoothing operator, an n × n matrix, is denoted by Sλ and λ denotes the smoothing level. Here λ1 = 0 (no smoothing) and Sλ1 x therefore is just x. The differences of these smooths are shown in the right column and they, together with the mean Sλ4 x, capture the original constituent detail curves reasonably well. The middle column can be thought to represent conventional scale space analysis where, in a sense, Sλi x contains the signal features for all scales λ ≥ λi. By considering differences of smooths, we attempt to remedy this by isolating, for two levels λi < λj , those features that are present at level λi but not at λj . One way to think about the contrast between usual scale space analysis and the difference of smooths approach is the distinction between low pass and band pass filtering. A related idea is the reroughing technique of Tukey (Tukey, 1977, Ch. 16) which, if the smoothing levels it uses are selected appropriately, can produce similar scale-dependent signal components. The purpose of this paper is to take the idea presented in this simple exam- ple further. We will develop a method that applies to more general situations, focusing on noisy images and random fields defined on regular grids. In contrast to the noiseless situation of the previous example, it then becomes essential to distinguish true features in the data from artifacts created by random fluctu- ation. Obviously, in order for the suggested idea to work well one must select properly the smooths used to reconstruct the details. Subtracting smooths of very different scales, such as using the difference between x and Sλ3 x in Fig-

2 x1 x

2 2

1 1 z1

0 2 0 0 0.5 1 0 0.5 1 1

x2 Sλ x 2 0 2 2 0 0.5 1

1 1 z2

0 0 2

0 0.5 1 0 0.5 1 1

x3 Sλ x 3 0 2 2 0 0.5 1

1 1 z3

0 0 2

0 0.5 1 0 0.5 1 1

x4 Sλ x 4 0 2 2 0 0.5 1

1 1

0 0

0 0.5 1 0 0.5 1

Figure 1: The curve x = Sλ1 x in the uppermost panel in the middle column is constructed as the sum of the curves x1, x2, x3 and x4 on the left. Differences of the smooths of Sλi x in the right hand column reveal its underlying components. ure 1, could easily miss the mid-scale features, whereas differences of smooths of very similar scales probably will not reveal anything interesting. We there- fore also try to provide tools for identifying a sequence of smooths that, when differenced, would reveal interesting features underlying the data. The scale space multiresolution procedure we propose has three steps: (1) Bayesian signal synthesis or reconstruction, (2) forming of scale-dependent de- tail components of the reconstructed or synthesized signal using differences of smooths at neighboring scales, and (3) posterior credibility analysis of the fea- tures in these details. In the second step, methods for finding useful smoothing levels are needed. The paper is organized as follows. In Section 2 we explain the idea of us- ing differences of smooths in more detail and also discuss ideas for the selec- tion of suitable sets of smoothing levels. Section 3 demonstrates the appli- cation of the proposed method to the analysis of digital images and random fields. In Section 4 we consider alternative approaches to resolving a signal into meaningful details, concentrating on image decomposition. Sec- tion 5 is a short summary of the main points made in the paper and the Ap-

3 pendices describe some technicalities associated with inference and the partic- ular smoothers used in the analyses. The Matlab code used for all compu- tations, together with instructions for its use and examples can be found at http:/cc.oulu.fi/∼lpasanen/MRBSiZer/. The Matlab functions are also avail- able with the electronic version of the paper.

2. Smoothing-based multiresolution analysis

2.1. The detail decomposition Let x be a signal, e.g. a curve or an image or, more generally, a random field, represented as an n-dimensional random vector. Also, let Sλ be a smoother represented as an n × n matrix. Here λ ≥ 0 is a smoothing parameter that controls the level of smoothing, a small λ corresponding to little smoothing and a large λ corresponding to heavy smoothing. For example, Sλ could be a kernel smoother and λ the kernel bandwidth or Sλ could be a spline smoother with λ controlling the size of the roughness penalty. Now, let 0= λ1 < λ2 < ··· < λL−1 < λL = ∞ (1) be a set of smoothing levels. We assume that S0 is the identity mapping,

S0x = Sλ1 x = x. Theeffect of S∞ depends on the particular smoother used. For the Nadaraya-Watson kernel smoother used in Figure 1, L = 4 and the smooth S∞x = S4x is the mean of the curve x, whereas for a local linear regression smoother S∞x is a linear fit to the data x. In all applications presented in this paper S∞x is in fact the mean of x. We now have trivially that

L−1 L−1

x = (Sλi − Sλi+1 )x + SλL x ≡ zi + zL, (2) Xi=1 Xi=1 where zi = (Sλi − Sλi+1 )x for i =1,...,L − 1 and zL = S∞x. (To save space, the mean z4 was not included in the right hand column of Figure 1.) Here zi for i = 1,...,L − 1 is the difference between two consecutive smooths and it can be interpreted as the detail lost when smoothing is increased from λi to λi+1. The “detail” zi therefore represents features in x at the scale λi and the decomposition (2) can be viewed as a kind multiresolution analysis of x. However, since zi is a random vector, one cannot in general consider all of its non-zero components as corresponding to true underlying features of x. We will quantify the uncertainty in the zi’s using Bayesian modeling. Thus, suppose one has available p(zi|data), the posterior of the detail given the data, i =1,...,L. The data could be an observed noisy image or the obser- vations used in a hierarchical Bayes model for x. One can then make inferences about the features in the detail zi by finding the subsets of the components of zi that differ credibly from zero. This is in fact the procedure used in the Bayesian scale methods BSiZer and iBSiZer that find scale-dependent features in scat- ter plots and digital images, respectively (Er¨ast¨o and Holmstr¨om, 2005, 2007, Holmstr¨om and Pasanen, 2007, 2008, Pasanen and Holmstr¨om, 2008). Follow- ing this approach we first select a posterior probability threshold 0 <α< 1, a typical choice being α = 0.95, the value used in all examples in this paper. Then, let zi be a vectorization of an array [zi,s]s∈I , where I is a set of spatial locations. To visualize the inferences made we now flag the location s blue if

4 P(zi,s > 0|data) > α, red if P(zi,s < 0|data) > α, and gray otherwise. However, as in BSiZer and iBSiZer, instead of considering each location separately, the inference is in fact done jointly over all locations. This color coding corresponds to the original SiZer convention but in some applications it is in fact more nat- ural to reverse the roles of blue and red (cf. Section 3.2). For more details, see Appendix A. Figure 2 shows an artificial example that demonstrates the above ideas. The true, unobserved 200 × 200 image x together with its noisy observed version y are shown in the first column. The remaining five panels in this column show the additive components from which x was built. Thus, x is the sum of these five images with positive and negative pixel values shown in white and black, respectively. To display their features more clearly, the component images are individually scaled in the second column. The next column shows the details found, summarized as their posterior means and again scaled to show their fea- tures more clearly. Using the method described in the next section, the following sequence of smoothing parameters was selected to define the decomposition (2):

2 4 6 8 [λ1, λ2, λ3, λ4, λ5, λ6, λ7] = [0, 1, 10 , 10 , 10 , 10 , ∞].

The green and yellow circles indicate the effective sizes of the impulse responses

of the smoothers Sλi and Sλi+1 , giving an idea of the spatial ranges of these smoothers. For the largest scales the ranges extend beyond the actual images. As one can see, the true underlying details are reasonably well identified. In reality, the true number of the underlying components is of course not known and depending on the number of smoothing levels used, some scale contributions of the original signal may be captured by several details, as in fact is the case here. Namely, z1 and z2 both clearly represent the true smallest scale component image x1 and, correspondingly, the contribution of the mid-scale component x3 is covered by the details z4 and z5. The credibility analysis of the fourth column shows that the detail z1 is not credible as its credibility map is almost completely gray. According to this analysis the rest of the detail images are mostly credible. The Bayesian image model used in this example assumes that

y = x + ε, (3)

T 2 2 with ε = [ε1,...,εn] ∼ N(0, σ I), where I is the identity matrix. For σ an Inv- 2 2 χ (ν0, σ0) prior was used and for the underlying image x we used a smoothing prior of the form

λ (n−1)/2 λ p(x|λ , σ2) ∝ 0 exp − 0 xT Qx (4) 0 σ2   2σ2  where the precision matrix Q is defined in Appendix B (equation (B.1)). The 2 2 parameter values used for this example were λ0 = 1, σ0 = 19.6 and ν0 = 50. The noise prior mean is 202, the true variance of the noise in the observed image but the prior has little influence on the posterior. The posterior density p(x|y) is a multivariate student t-distribution which can be easily sampled. A posterior sample of size 5000 was obtained and the transformation x 7→ zi =

(Sλi − Sλi+1 )x or x 7→ zL = S∞x was applied to the sample vectors x. The resulting sample of detail vectors zi was then used to make inferences about the

5 credible features in the detail component zi. More complicated models, such as dependent errors, can be handled similarly. For more information we refer to Holmstr¨om and Pasanen (2008) and Pasanen and Holmstr¨om (2008).

2.2. Selection of smoothing levels How should one choose the smoothing levels (1) used in the multiresolution analysis (2) so that the decomposition would capture the scale-dependent fea- tures of x as well as possible? One alternative is simply to experiment with different choices of the λi’s and to search for a good set interactively. In fact, as the “features”and the corresponding“scales” are somewhat subjective ideas, we believe that at least some user interaction is always necessary to select the set (1) in a useful manner. Still, with an appropriate choice of the smoother Sλ used in the analysis it is possible to gain helpful insight into what kind of smoothing level sets could be effective. The approach taken here is to use a roughness penalty smoother

−1 Sλ = (I + λQ) (5) that minimizes the penalized loss defined by Q,

2 T Sλx = argmin kx − uk + λu Qu . u  Here Q is a positive semidefinite matrix with the property that xT Qx in some T sense measures the“roughness”of x. As an example, for curves x = [x1,...,xn] the spline smoother is of this type with xT Qx = x′′(t)2dt, where x(t) is the natural cubic spline with values xi at the knots (e.g.R Green and Silverman, 1994). For the examples discussed in this paper we need smoothers of the type (5) for images and random fields defined on a sphere. These smoothers are described in Appendix B. With images, we employ the matrix Q that defines the smoother Sλ also in the prior (4) used for image reconstruction. For a smoother (5), let n T Q = γj vj vj , Xj=1 where 0 ≤ γ1 ≤ ··· ≤ γn are the eigenvalues of Q and v1,..., vn are the corresponding orthonormal eigenvectors. The smooth of x can then be written as n −1 T Sλx = (1 + λγj ) (vj x)vj . (6) Xj=1 T We note that vi Qvi = γi so that roughness of vi is measured by its associated eigenvalue. Looking at (6), the smoothing effect of Sλ on x can now be viewed as a consequence of the fact that increasing λ suppresses most the projections of x onto the vj ’s with the largest eigenvalues γj , that is, the rough components −1 of x. In fact, denoting βj (λ)=(1+ λγj ) we see that while βj (λ) decreases as λ increases, the relative rate of decrease |dβ (λ)/dλ| γ j = j βj (λ) 1+ λγj at the same time increases with γj.

6 x E(z1|y)

y E(z2|y)

x1 E(z3|y)

x2 E(z4|y)

x3 E(z5|y)

x4 E(z6|y)

x5 E(z7|y)

Figure 2: Decomposition of an artificial image into scale dependent details and the analysis of the posterior credibility of the features in the details. The original image is x and it was constructed as the sum of the components x1,..., x5 shown in the first column. The noisy observed image is y. To show their features more clearly, the components are individually scaled in the second column. The details z1,..., z7 found by scale space multiresolution analysis are summarized by their posterior means in the third column. The circles indicate the effective ranges of the smoothers Sλi and Sλi+1 that define the detail. The fourth column shows which detail features are (jointly) credible, blue and red indicating credibly positive and negative pixels, respectively. 7 Suppose now that Q has null space of dimension n0, 1 ≤ n0 < n, that is, the rank of Q equals n − n0. Then 0= γ1 = ... = γn0 <γn0+1 ≤ ... ≤ γn, and (6) takes the form

n0 n T −1 T Sλx = (vj x)vj + (1 + λγj ) (vj x)vj . (7) Xj=1 j=Xn0+1

The L − 2 first details in the decomposition (2) can then be written as

n −1 −1 T zi = (Sλi − Sλi+1 )x = (1 + λiγj ) − (1 + λi+1γj) (vj x)vj . (8) j=Xn0+1   Setting λ = ∞ in (7) we see that the Lth detail is

n0 T zL = SλL x = S∞x = (vj x)vj (9) Xj=1 and that the (L − 1)th detail is

n −1 T zL−1 = (SλL−1 − SλL )x = (1 + λL−1γj ) (vj x)vj . (10) j=Xn0 +1

It follows from (8)-(10) that a detail zi can be expressed as

n (i) T zi = αj (vj x)vj , (11) Xj=1 where (i) 0, 1 ≤ j ≤ n0, αj = −1 −1 (12)  (1 + λiγj ) − (1 + λi+1γj ) , n0 < j ≤ n, for i =1,...,L − 2, and

(L−1) 0, 1 ≤ j ≤ n0, (L) 1, 1 ≤ j ≤ n0, αj = −1 αj =  (1 + λL−1γj ) , n0 < j ≤ n,  0, n0 < j ≤ n. (13) α (i) n We refer to the L sequences i = [αj ]j=1, i = 1,...,L, as “tapering func- tions”. Following the idea suggested by Nychka (2000) we may now choose the set (1) so that the supports of the tapering functions are generally approximately disjoint, consisting of roughly non-overlapping segments of the integers 1,...,n. This will produce details that are somewhat orthogonal and that can therefore be hoped to correspond to features that are dissimilar in their characteristics. To illustrate this idea, we consider again the example discussed in the pre- vious section. The smoothing levels λi used to produce Figure 2 were in fact selected with the help of tapering functions shown in Figure 3. The image size is 200×200 so there are 40 000 eigenvalues in all. We selected seven λi’s, including the values λ1 = 0 and λ7 = ∞. The corresponding seven tapering functions are shown in Figure 3 using different line types and colors. The eigenvalue index is on the horizontal axis and the ranges [λi, λi+1] associated with the details zi are given in the legend. The rank of the matrix Q used is n − 1 (see Appendix B),

8 1 0−100 100−102 0.8 102−104 104−106 106−108 0.6 108−∞ ∞

0.4 tapering function value 0.2

0 0 1 2 3 4 5 10 10 10 10 10 10 eigenvalue index

Figure 3: The tapering functions (12) and (13) corresponding to the seven details found for the test image x of Figure 2 are shown as curves of different colors and line types. The horizontal axis shows the index of the smoothing operator eigenvalue γi in (6). The legend displays the smoothing range [λi, λi+1] of each detail.

so that n0 = 1 and the tapering function corresponding to the detail z7, the im- age mean, is the delta spike on the left. The smallest scale details, represented by the two topmost panels in the middle column of Figure 2, correspond to the solid blue and the dashed green tapering functions that involve the largest eigenvalues (largest indices), that is, the roughest eigenvectors. Similarly, large scale details involve small eigenvalues or smooth eigenvectors. As can be seen from Figure 3, disjointness of the tapering function supports is satisfied only partly. Also, in order to produce large scale details that correspond well to the actual image components, the tapering function corresponding to the range [108, ∞] was chosen to lie completely inside the support of the tapering func- tion associated with the range [106, 108]. This demonstrates the need for user interaction in the selection of appropriate pairs of smooths. Still, the idea of non-overlapping tapering functions offers at least a good starting point for the search of useful sets of smoothing levels. This smoothing level selection method depends only on the eigenvalues of the matrix Q and therefore only on the particular smoother Sλ used, as well as the dimensions of the signal analyzed. It seems reasonable that better results could be obtained by taking into account in the selection of the λi’s also the structure of the underlying signal x. This could be accomplished by considering (i) T n the signal-dependent tapering functions [αj (vj x)]j=1 (cf. (11)). However, as the underlying signal is unknown, we replace it by the posterior mean E(x|y) and define the signal-dependent tapering functions as

α (i) n (i) (i) T E ˜ i =[˜αj ]j=1, α˜j = αj (vj (x|y)), i =1,...,L. (14)

9 1500 0−100 0 2 1000 10 −10 102−104 500 104−106 106−108 0 108−∞ ∞ −500

−1000

−1500 tapering function value

−2000

−2500

0 1 2 3 4 5 10 10 10 10 10 10

eigenvalue index Figure 4: Signal-dependent tapering functions (14) corresponding to the seven details found for the test image x of Figure 2. For description of the graphics, see the caption of Figure 3.

Figure 4 presents these tapering functions for our test image x shown in Figure 2 using the smoothing levels [0, 1, 100, 104, 106, 108, ∞]. The upper limit of the vertical axis was truncated to 1500 in order to show the whole structure of the tapering functions more clearly. The values of the components (i) α˜j vary wildly and the salient features of the tapering functions are hard to summarize. However, with careful inspection one in fact discovers that the tapering functions do have different ranges where their absolute values are larger than those of the others. As the signal-dependent tapering functions in this as well as in other exam- ples tested are too irregular for visual selection of appropriate smoothing levels, a more formal approach is needed. We propose to use optimization of a suit- able objective function with respect to the λi’s to achieve rough orthogonality of the tapering functions. In order to avoid additional complexity associated with non-linear optimization in a high-dimensional space, for the purposes of demonstration we show how to optimize a signal multiresolution decomposition consisting of just four terms corresponding to a smoothing parameter sequence [0, λ2, λ3, ∞]. We minimize the objective function αT α | ˜ i ˜ j | G(λ2, λ3)= (15) kα˜ ikkα˜ j k {i,j=1X,2,3 |i

10 109

107

105 2 λ 103

101

10−1

10−1 101 103 105 107 109 λ 3 Figure 5: Minimization of the objective function (15) for the test image x of Figure 2. The minimum point in indicated by the white diamond.

used the smoothing parameter sequence [0, 1, 100, 5·105, ∞] by adding an extra small level λ = 1 and thus letting the minimizers of G to assume the roles of the parameters λ3 and λ4. The value λ = 1 was added in order to detect also the smallest squares. This again demonstrates the need for at least some user input in the selection of an appropriate smoothing level sequence. The details z1 and z5 were omitted from Figure 6 since z1 contained nothing credible and z5 (the mean) was just all blue. We note that the overall content of the credibility maps is similar to those in Figure 2 with only the largest scale gradient missing. This can be explained by the use of just four details in the multiresolution analysis.

3. Experiments

3.1. A digital image Our first real data example is a 284 × 400 digital camera image of a desk with a sketch pad and some office supplies. The sketch pad has some writing on it. The intensity range of the image is [28, 230] and the “noisy image” was obtained by adding independent and identically distributed Gaussian noise with standard deviation 10. The original and the noisy images are displayed in the first column of Figure 7. We used the prior (4) for x and an Inv-χ2-prior for σ2. The values of the 2 2 parameters λ0, σ0 and ν0 were 0.2 and 8.9 , and 10, respectively. The noise prior mean is equal to the actual noise variance in the image but, as in the case of the artificial example of Figure 2, the prior has little influence on the posterior. The choice of λ0 is more important and a rather small value was se- lected in order not to smooth out the finest image features in the reconstruction phase. The size of the posterior sample generated was 3000. The minimum of the objective function (15) was at [354, 6.39 · 105]. By trying these values we concluded that the minimizing value for λ2 was too large. However, G is quite flat near the minimum and we therefore chose to use a smaller value for λ2 for which G(λ2, λ3) still remained relatively small. The set of smoothing levels used

11 E(z2|y) E(z3|y) E(z4|y)

Figure 6: Decomposition of the artificial image x of Figure 2 into scale dependent details and the analysis of the posterior credibility of the features in the details. Smoothing levels were chosen partially automatically based on signal-dependent tapering functions. The details z1 and z5 are omitted. For more information, see the text and the caption of Figure 2. in the multiresolution analysis was then [0, 1, 30, 6 · 105, ∞]. As with the artifi- cial test image, a small value was included in the smoothing level sequence so that also the smallest features could be detected. Of the sharpest features, the sketch pad perforation is captured in z1. The grid of the cross-ruled paper and the paper clips appear in the detail z2. In z3 the mathematical formulas become clearly visible while the cross-ruled pattern is smoothed out. Finally, in z4 the paper is divided diagonally into positive and negative halves. This feature can be seen to correspond to the lighting gradient of the original image. The fourth panel of the first column shows how one may correct for such a distortion by subtracting the mean of this detail from the reconstructed image. The detail z5 is the overall mean of the image and of course credibly positive and therefore omitted. To help visually assess the level of noise in the image relative to the smallest scale details, the last row of Figure 7 displays a magnification of the square area indicated by the yellow square in the noisy image.

3.2. Climate prediction fields Our second application involves the analysis of climate model outputs. Nu- merical experiments based on atmosphere–ocean general circulation models (AOGCMs or simply GCMs) are one of the primary tools in deriving projections for future climate change. Although each GCM has the same underlying partial differential equations, modeling large scale effects, they have different small scale parameterizations and different discretizations to solve the equations, resulting in different climate projections. This motivates climate projections synthesized from results of several GCMs’ output. Present day and future climate projections were combined in a single high- dimensional hierarchical Bayes model that separates the spatial response into a large scale climate change signal and an isotropic process representing small

12 x E(z1|y)

y E(z2|y)

E(x|y) E(z3|y)

E(x|y) − E(z4|y) E(z4|y)

y E(z2|y)

Figure 7: Decomposition of a digital image into scale dependent details and the analysis of the posterior credibility of the features of the details. The fourth panel in the first column shows the reconstructed image corrected for the back- ground gradient. To better assess visually the noise level in the image, the last row shows a close-up of the square area depicted in the noisy image y. For more information, see the text and the caption of Figure 2.

13 scale variability among GCMs. Samples from the posterior distributions are ob- tained with computer-intensive MCMC simulations. For this paper, we consider the predicted seasonal (boreal winter, i.e., December, January, and February) global surface temperature change between 1980–2000 (present day) and 2080- 2100 (future). Further, the “business as usual” emission scenario (A1B, see also Naki´cenovi´cand Swart, 2000) was used. This scenario of world future devel- opment lies somewhere between the most pessimistic and the most optimistic possible alternatives. For details about the statistical model we refer to Furrer et al. (2007). We sampled 500 temperature fields from the posterior p(x|data) of the con- sensus climate change x. The “data” now consist of the simulated global tem- perature change fields produced by the prediction models. The field size was 32 × 72 with grid box centers at 5 degree intervals. Thus, all longitudes were included and only the latitudes above 80 degrees and below −80 degrees were omitted. Scale space multiresolution analysis can now be applied directly to the pos- terior sample without any further need for modeling. However, smoothing must be carried out in manner that takes into account the fact that the data actually lie on a sphere. The smoother used is described in Appendix B. Figure 8 shows the sample mean of the 500 fields and four individual sample fields. The overall pattern in these samples appears to be captured in the mean quite well but there do exist sample specific variation in details. The question is: which patterns in the data are“really there”and what is just random variation? The answer provided by multiresolution scale space analysis is shown in Figure 9. The role of colors in the credibility maps shown in the right hand column is now switched so that blue and red correspond to negative (colder) and positive (warmer) field values, respectively. The set of smoothing levels −4 −2 used was [0, 2 · 10 , 8 · 10 , ∞], obtained by selecting a point (λ2, λ3) in a valley of small values of the function (15). The first two detail fields represent credible change in the northern latitudes concentrating in specific sea areas. These are the areas where the climate models have most difficulties in making good predictions because of the uncertainties related to the presence of sea ice and the difficulties in modeling sea ice in the GCMs. The next largest detail is a south-north temperature gradient and the largest scale detail is just the mean global temperature change. Thus, according to this analysis, the pattern of global temperature change can be decomposed into three credible sub-patterns, a constant that represents an overall rise in the global mean temperature, a south-north gradient and a more complex pattern that concentrates in the northern latitudes.

4. Comparison with other methods –

Many techniques are available for the reconstruction of a degraded signal, including statistical methods based both on frequentist and Bayesian ideas. In image analysis, description of various methods related to the signal reconstruc- tion approach adopted in this paper can be found for example in Kaipio and Somersalo (2004), Peyr´eet al. (2008), Winkler (2006) and their references. Mul- tiscale has also been discussed in many papers, including Kolaczyk et al. (2005) and Salembier and Serra (1992).

14

0 2 4 6 8 10

Figure 8: Global temperature change fields as predicted by the consensus of 21 climate models. The sample mean (top row) and four individual sample fields (the four other panels).

The most straightforward way to find image features is to apply a threshold to the pixel intensity values of the reconstructed image. More sophisticated alternatives include methods like MSER (Maximally Stable Extremal Regions) (Matas et al., 2002) which has a multiscale character in that it is based on thresholding the image using several threshold values. The conclusion of Holm- str¨om and Pasanen (2008) was that the Bayesian scale space smoothing approach compared favorably with this and other thresholding methods. The benefits of scale space smoothing included modeling flexibility, better performance in more difficult situations such as in the presence of heteroskedastic noise, and a sound probabilistic interpretation of the salient image features. Still, when discussing multiresolution analysis, probably the most natural al- ternative to consider is wavelets (e.g. Vidakovic, 1999). Similarly to the smooth- ing based details considered here, the distinct resolution levels of a signal wavelet decomposition could be interpreted as features in different scales. So, how ex- actly do the scale space and the wavelet analyses compare? A useful discussion on this question can be found in Section 2.9 of Lindeberg (1994) and we include some of its relevant observations here. First, the scale space details are found by smoothing that preserves the same signal sampling density through scales whereas in a wavelet multiresolution analysis the grid sizes decrease progressively as resolution is diminished. For this reason wavelets can be very useful in signal compression whereas scale space analysis can be said to be maximally redundant. From our point of view the efficiency of representation is not important as our goal is to make explicit certain features of the signal rather than aim at in some sense computationally

15 Figure 9: A scale space multiresolution analysis of global winter temperature change from 1980-2000 to 2080-2100 as predicted by the consensus of 21 climate prediction models. Four details are used in the multiresolution decomposition. The left hand column summarizes the details as posterior means and the right hand column shows the corresponding credibility maps. As opposed to other credibility analyses in this paper, the role of colors has been switched so that blue and red correspond to negative (colder) and positive (warmer) field values, respectively.

16 optimal representation. Second, for a signal defined on a grid (such as an image), wavelet representa- tion typically uses a discrete dyadic set of scales whereas in scale space analysis a continuum of resolutions can be explored making tracking of features across scales easier. Of course, continuous wavelet transform could be used for similar purpose. Further, for example in the case of images, any image size can be readily handled with no need for additional adjustments, whereas basic wavelet analyses typically assume dyadic image dimensions. Also, we believe that our scale space method can be rather easily extended beyond regular data grids to more general spatial settings involving for example data on graphs where the pertinent statistical tools are already available (e.g. Rue and Held, 2005). To see how the proposed scale space approach compares with wavelets in practice, we analyzed with both methods the 256 × 256 image of John Lennon shown in the first panel of the first row of Figure 10. We experimented with sev- eral wavelet families found in the Matlab Wavelet Toolbox, including Daubechies wavelets, Coiflets and Symmlets. Of these, the Coiflet 4 wavelets appeared to perform at least as well as the others so we report results using them. Another type of wavelet family tested were the W-wavelets introduced by Kwong and Tang (1994) and used for example by Nychka et al. (2002). The code for W- wavelets is available in the Fields R-package for spatial data analyses (Fields Development Team, 2006). Like the smoothing based details discussed in this paper, the W-wavelet decomposition is not completely orthogonal and the asso- ciated scaling and wavelet functions have a smooth appearance. These wavelets therefore offer a rather natural comparison with our scale space approach. The results of our analyses are shown in Figure 10. The two leftmost columns present the details derived from the Coiflet 4 wavelets and the W-wavelets, respectively, based on combinations of the resolution levels 128, 64, 32, and 16. The decomposition produced by simply summing within each level the three component images corresponding to the horizontal, vertical and diagonal details had a noisy appearance (results not shown) and better results were obtained by summing these components also across the levels. The best results were obtained from the combinations 128+64 and 32+16 and they are shown as the top and the middle panels of the two first columns. The images on the bottom are the parts corresponding to the scaling function and therefore the three images in each column sum up to the original image. In the third column are the differences of the smooths corresponding to the levels [0, 10, 103, 105, ∞], excluding the mean. In this example the signal-dependent tapering functions were not very useful in smoothing level selection and these levels were therefore selected by trial and error with the help of Figure 11 that uses the tapering functions (12) and (13). Note that although substantial user input could here be viewed as a handicap it, on the other hand, makes possible fine tuning on a level not possible with wavelets that are bound to a discrete ladder of resolutions. While the details obtained using all three methods are somewhat similar, the wavelet decompositions do look somewhat noisy and they contain artifacts that do not correspond to clear image features. On the other hand, the details obtained as differences of smooths are very natural. In the smallest scale the sharpest edges in the image are visible. In the second level the face of John Lennon is easy to recognize. In the next level a more generic human face emerges and, finally, the last detail is just a light oval against a dark background giving a very coarse description of the image structure. In all, the decomposition of

17 the image features into scale-dependent components is quite satisfying. We also tested how well the salient features can be found under noisy condi- tions by adding to the Lennon image independent Gaussian noise with standard deviation 20. The intensity range of the original image is [0, 192] and the noisy image is shown in the right hand panel of the first row of Figure 10. We used 2 2 the smoothing prior (4) with λ0 = 0.4 for x and an Inv-χ (10, 17.9 ) prior for σ2. The sample size generated from the posterior of x was 5000. The credibility maps shown in the rightmost column indicate that, with the exception of the sharpest details, most of the relevant features are found to be credible.

5. Summary

A multiresolution analysis of random signals based on Bayesian scale space smoothing was proposed. The method consists of three steps: Bayesian signal synthesis or reconstruction, forming of scale-dependent signal detail components using differences of smooths at neighboring scales, and posterior credibility anal- ysis of the features in these details. All these steps can be carried out using other methods but the suggested approach is believed to offer clear benefits, includ- ing flexible statistical modeling and a clear probabilistic interpretation of the scale-dependent signal features through inference based on the joint posterior probability of the feature patterns. The visualization methods developed by Holmstr¨om and Pasanen (2008) for inferences on image differences proved to be effective here, too. A key ingredient of the proposed method is the selection of the smoothing levels that define the multiresolution details and our general guidelines for this can be summed up as follows. First, we recommend only 2 or 3 intermediate smoothing levels in addition to the fixed values λ1 = 0 and λL = ∞, thus keeping the number of levels L relatively small. With too many levels some of the details will probably describe similar structure (e.g. details z4 and z5 in Figure 2) or some meaningful features may in fact be missed. One can i then search for useful smoothing levels of the form λi = τ where for instance i ∈ {1,..., 5} and τ ∈ {5, 10, 25, 100} give a rough idea of interesting levels in the examples considered in this paper. A more formal approach that we have found to work well is offered by the computational and graphical tools described in Section 2.2. Still, no matter how one first finds reasonable initial values for the smoothing levels, for best results, some fine tuning by the user is usually required. We have also observed that, in order to capture very fine scale features, it is sometimes useful to insert a small value, such as, λ = 1, in the final smoothing level sequence. The performance of the method was demonstrated using an artificial image, two digital images and a posterior global temperature change field based on computer model outputs. In an image analysis test our method was competitive against basic dyadic scale wavelet analyses.

Appendix A. Inference on credible features

Consider a detail zi, a vectorization of an array [zs]s∈I . To simplify notation the index i is left out from the components zs. We explain here how to find subsets of I on which the components zs differ jointly credibly from zero.

18 Coiflet 4 wavelets W-wavelets diff. of smooths HPW-maps

Figure 10: Decompositions of an image of John Lennon using two different wavelet methods and differences of smooths. The rightmost column shows a noisy version of the Lennon image (top panel) and credibility analysis of the details found using scale space multiresolution analysis. Here white and black indicate credibly positive and negative pixels, respectively. For more informa- tion, see the text.

19 1 0−101 101−103 0.8 103−105 105−∞ 0.6 ∞

0.4

0.2 tapering function value

0 0 1 2 3 4 5 10 10 10 10 10 10 eigenvalue index Figure 11: Tapering functions for the Lennon image. For description of the graphics, see the caption of Figure 3.

Decompose first the index set I into three disjoint subsets, b I = {s | P(zs > 0 | data) ≥ α}, r I = {s | P(zs < 0 | data) ≥ α}, (A.1) Ig = I \ (Ib ∪ Ir). Here α is the level of credibility considered in the inference. All the probabilities here and below can be easily found using simulation. If each location were treated independently of others, then the location s would be flagged blue if s ∈ Ib, red if s ∈ Ir, and gray if s ∈ Ig. However, in the simultaneous inference applied in this paper we seek for a decomposition of the index set I into three disjoint subsets J b, J r, and J g = I \ (J b ∪ J r), so that b r P(zs > 0 for s ∈ J & zs < 0 for s ∈ J | data) ≥ α. (A.2) The location s is then colored blue, red or gray depending on whether s ∈ J b, s ∈ J r or s ∈ J g. Clearly J b ⊂ Ib,J r ⊂ Ir , and hence J g ⊃ Ig, where Ib, Ir and Ig are defined in (A.1). The search for jointly credible features of the b b r r detail zi therefore amounts to finding suitable subsets J ⊂ I , J ⊂ I . Note that the choice of such subsets in general is not unique. The two simultaneous inference methods described below were first suggested in a one dimensional setting (for curves) by Er¨ast¨o and Holmstr¨om (2005). The first method is based on considering the highest pointwise probabilities and it strives to maximize the number of points colored blue or red. The second method employs simultaneous credible intervals centered on the posterior means. This method flags pixels credible more conservatively and we used it in the example of Figure 2 whereas the first method was used in all other examples.

Appendix A.1. Highest pointwise probabilities For a set S, denote by |S| the number of its elements and let N = |Ib| + |Ir|, b r where I and I are defined in (A.1). Let Es denote the event zs > 0 | data,

20 b r when s ∈ I and the correspondingly the event zs < 0 | data, when s ∈ I . Let b r s1,...,sn be a permutation of the locations s in I ∪ I for which P P P (Es1 ) ≥ (Es2 ) ···≥ (EsN ) ≥ α (A.3) and let P k = max{l | (Es1 & ··· & Esl ) ≥ α}. (A.4) b Then, if l ∈ {1,...,k}, we flag sl blue or red depending on whether sl ∈ I or r sl ∈ I , and the rest of the pixels s are flagged gray.

Appendix A.2. Simultaneous credible intervals Let ∆ > 0 satisfy z − E(z |data) P max s s ≤ ∆ data = α. (A.5)  s∈I Std(z |data)  s

Define

b J = {s | E(zs|data) − ∆Std(zs|data) > 0}, r J = {s | E(zs|data) + ∆Std(zs|data) < 0}. Then (A.2) is clearly satisfied.

Appendix B. The smoothers used

We describe here the smoothers used in this paper. Consider first a curve x(t) of which we have noisy observations yi = x(ti)+εi, i =1,...,n. The Nadaraya- Watson kernel smoother (Watson, 1964) estimates x(t) by local weighted aver- aging as n i=1 K((t − ti)/λ)yi xλ(t)= n , P i=1 K((t − ti)/λ) where a typical choice for kernel functionP K is the Gaussian and the smoothing parameter λ > 0 controls the level smoothing used in the estimate. When T evaluated at the grid points ti, the smooth xλ = [xλ(t1),...,xλ(tn)] can be T expressed as xλ = Sλy, where y = [y1,...,yn] and Sλ is an n × n smoothing matrix. Most popular smoothers can in fact be formulated in terms of such a smoothing matrix (e.g. Wand and Jones, 1995, Green and Silverman, 1994). In the example of Figure 1, the Nadaraya-Watson smoother was applied to the noiseless curve x itself. The general form of the smoothers used in the artificial example of Section 2 −1 T as well as in the applications of Section 3 is Sλ = (I + λQ) , where x Qx in some sense measures the “roughness” of x (cf. Section 2.2). Consider first an image or, more generally, a planar random field x defined on a regular equispaced two-dimensional grid with n grid points. For two grid point locations (pixels) s and t, write s ∼ t, if they are neighbors. The matrix Q we use in this case is defined by 2 T x Qx = xs − 4xt , (B.1) Xt  Xs∼t  where the inner summation is over all unordered pairs of neighboring locations. In order to have four neighbors also at a boundary location t, the boundary

21 values of x are actually extended beyond the original grid. This Neumann boundary condition modifies Q accordingly. The rank of Q is then n − 1 and its null space consists of constants. The rank deficiency is reflected in the formula (4) for the prior of x. We note that xT Qx = kCxk2, where Q = CT C and the matrix C can be interpreted as the discrete . The resulting smoother there- fore penalizes for roughness as measured by the variability of discrete second partial . Eigenanalysis needed for the tapering function method of Section 2.2 is facilitated by the use of the two dimensional discrete cosine trans- formation to speed up the computations. In Section 3.2 we need smoothing on a sphere. A simple approach would be a Nadaraya-Watson type smoother with a symmetric unimodal probability density function, like the Fisher density, as a kernel. This in fact works fine except that the smoother is not of the type (5) that would allow use of tapering functions for smoothing level selection. We therefore used a smoother based on an analogue of (B.1) on the sphere. Let 0 ≤ φ ≤ π and 0 ≤ θ < 2π be the polar and the azimuth angles of a point on the sphere, respectively. Then, given a function u(φ, θ) defined on the sphere, its Laplacian can be defined as

1 ∂ ∂u 1 ∂2u B(u)= sin φ + . (B.2) sin φ ∂φ  ∂φ sin2 φ ∂θ2 A simpler functional that also involves partial derivatives with respect to the two spherical coordinates is

∂2u 1 ∂2u C(u)= + , (B.3) ∂φ2 sin2 φ ∂θ2 where the factor 1/ sin2 φ compensates for the dependence of the radius of the latitude circle on the polar angle. Since use of these two functionals produced very similar results, we adopted the simpler expression (B.3). For discretization we used the ideas suggested by Lai and Wang (2002). Our temperature change field is specified as values u(φj ,θk) on an M × N grid, equispaced in the coordinates φ and θ. Since, for a fixed φ, the function u is periodic in θ, we can approximate it using the truncated Fourier series

N/2−1 N−1 1 u(φ, θ )= a (φ)eijθk , a (φ)= u(φ, θ )e−ijθk . (B.4) k j j N k j=X−N/2 Xk=0

Here θk = 2kπ/N and these formulas define the discrete Fourier transform (DFT) [aj (φ)]j of the vector [u(φ, θk)]k and its inverse transform. Using this expression for u, the functional C(u) in (B.3) becomes

N/2−1 2 2 d aj(φ) j ijθk − aj (φ) e . (B.5)  dφ2 sin2 φ  j=X−N/2

To discretize the above expression, consider a fixed j and denote alj = aj (φl), where φl = φ1 + (l − 1)δ, l = 1,...,M and δ = (φM − φ1)/(M − 1). Here

22 0 < φ1 < φM < π, because the poles are not included in our grid. Using centered differences the expression inside the square brackets in (B.5) then becomes

2 al+1,j − 2al,j + al−1,j j 2 − 2 al,j , (B.6) δ sin φl l =1,...,M. In the spirit of the Neumann boundary conditions used for a flat grid we define here a0,j = a1,j and aM+1,j = aM,j . Define now T aj = [a1,j,...,aM,j] and 1 1 1 D F 2 j = 2 − j diag( 2 ,..., 2 ), δ sin φ1 sin φM where F is the discrete second differencing matrix. Then the expressions (B.6) for l =1,...,M are obtained as components of the vector Dj aj . Let us then vectorize the array u columnwise into an MN × 1 vector U and denote by D the MN × MN block diagonal matrix composed of the matrices Dj , j = −N/2,...,N/2 − 1. Then the discrete version of the functional C(u) in (B.3) can be written as CU, where

C = NV∗DV, (B.7)

V = VN ⊗ IM , VN is the N × N DFT-matrix, IM the M × M identity matrix and V∗ is the conjugate transpose of V. Here we use the fact that V−1 = NV∗. The matrix C is the desired discretized differential operator and we define Q = C∗C in the smoother (5). This means that we measure the roughness of u by UT QU = kCUk2, which approximately equals the sum of the quantities C(u)2 evaluated at the grid points. Using symmetry properties of the DFT and the matrix D it is easy to see that the components of Cu are real for any real u. Therefore Q is a real matrix. In our particular example the matrix Q is full and quite large. However, the identity Q = NV∗DT DV and the fact that DT D has a block diagonal structure greatly speeds up the eigenanalysis of Q needed for the tapering function based multiresolution analysis.

References

Chaudhuri, P., Marron, J. S., 1999. SiZer for exploration of structures in curves. Journal of the American Statistical Association 94 (447), 807–823. Chaudhuri, P., Marron, J. S., 2000. Scale space view of curve estimation. The Annals of Statistics 28 (2), 408–428.

Er¨ast¨o, P., Holmstr¨om, L., 2005. Bayesian multiscale smoothing for making in- ferences about features in scatter plots. Journal of Computational and Graph- ical Statistics 14 (3), 569–589.

Er¨ast¨o, P., Holmstr¨om, L., 2007. Bayesian analysis of features in a scatter plot with dependent observations and errors in predictors. Journal of Statistical Computation and Simulation 77 (5), 421–431.

23 Fields Development Team, 2006. fields: Tools for Spatial Data. http:// www.cgd.ucar.edu/Software/Fields, National Center for Atmospheric Re- search, Boulder, CO. Furrer, R., Sain, S. R., Nychka, D. W., Meehl, G. A., 2007. Multivariate Bayesian analysis of atmosphere-ocean general circulation models. Environ- mental and Ecological Statistics 14 (3), 249–266. Godtliebsen, F., Øig˚ard, T., 2005. A visual display device for significant features in complicated signals. Computational Statistics & Data Analysis 48 (2), 317– 343. Green, P. J., Silverman, B. W., 1994. Nonparametric Regression and Generalized Linear Models. A roughness penalty approach. Chapman & Hall.

Holmstr¨om, L., 2010. Scale space methods. Wiley Interdisciplinary Re- views: Computational Statistics 2 (2), 150–159, available on-line at http://dx.doi.org/10.1002/wics.79.

Holmstr¨om, L., Pasanen, L., 2007. Bayesian analysis of image differences in multiple scales. In: Niskanen, M., Heikkil¨a, J. (Eds.), Proceedings, Finnish Signal Processing Symposium 2007, August 30, Oulu, Finland. University of Oulu, Department of Electrical and Information Engineering, CD-ROM, ISBN 978-951-42-8546-2.

Holmstr¨om, L., Pasanen, L., 2008. Bayesian scale space analysis of differences in images. http://cc.oulu.fi/~llh/preprints/iBSiZer.pdf, submitted for publication. Kaipio, J., Somersalo, E., 2004. Statistical and Computational Inverse Problems. Applied Mathematical Sciences. Springer Verlag, Berlin. Kolaczyk, E. D., Ju, J., Gopal, S., December 2005. Multiscale, multigranular statistical image segmentation. Journal of the American Statistical Associa- tion 100 (472), 1358–1369. Kwong, M. K., Tang, P. T. P., 1994. W-matrices, nonorthogonal multiresolution analysis, and finite signals of arbitrary length. Tech. rep., Mathematics and Computer Science Division, Argonne National Laboratory. Lai, M.-C., Wang, W.-C., 2002. Fast direct solvers for Poisson equation on 2D polar and spherical geometries. Numerical Methods for Partial Differential Equations 18 (1), 56–68. Lindeberg, T., 1994. Scale-Space Theory in . Kluwer Academic Publishers. Matas, J., Chum, O., Urban, M., Pajdla, T., 2002. Robust wide baseline stereo from maximally stable extremal regions. In: British Machine Vision Confer- ence. Vol. 1. pp. 384–393. Naki´cenovi´c, N., Swart, R. (Eds.), 2000. Special Report on Emission Scenarios. Intergovernmental Panel on Climate Change, Cambridge University Press, 599 pp.

24 Nychka, D., 2000. Spatial process estimate as smoothers. In: Schmek, M. G. (Ed.), Smoothing and Regression. Approaches, Computation and Application. John Wiley & Sons, pp. 393–425. Nychka, D., Wikle, C., Royle, J., 2002. Multi-resolution models for nonstation- ary spatial covariance functions. Statistical Modelling 2 (4), 315–331.

Pasanen, L., Holmstr¨om, L., 2008. Bayesian Scale Space Analysis of Image Differences. In: Proceedings of the 2008 Joint Statistical Meetings, Section on Statistical Computing. Denver, Colorado, USA, pp. 1786–1793. Peyr´e, G., Bougleux, S., Cohen, L. D., 2008. Non-local regularization of inverse problems. In: ECCV (3). pp. 57–68. Rue, H., Held, L., 2005. Gaussian Markov Random Fields: Theory and Applica- tions. Vol. 104 of Monographs on Statistics and Applied Probability. Chapman & Hall, London. Salembier, P., Serra, J. C., 1992. Morphological multiscale image segmentation. Vol. 1818. SPIE, pp. 620–631. Tukey, J. W., 1977. Exploratory Data Analysis. Addison-Wesley, Reading, Mas- sachusetts. Vidakovic, B., 1999. Statistical Modeling by Wavelets. Wiley Series in Proba- bility and Statistics. John Wiley & Sons, Inc., New York. Wand, M. P., Jones, M. C., 1995. Kernel smoothing. Chapman & Hall, London. Watson, G., 1964. Smooth regression analysis. Sankhya, Ser. A 26, 359–372. Winkler, G., 2006. Image Analysis, Random Fields and Markov Chain Monte Carlo Methods: A Mathematical Introduction. Stochastic Modelling and Ap- plied Probability. Springer Verlag, Berlin.

25