John von Neumann Institute for Computing

Implementation of Anisotropic Nonlinear Diffusion for Filtering 3D Images in Structural Biology on SMP Clusters

S. Tabik, E.M. Garzon,´ I. Garc´ıa, J.J. Fernandez´

published in Parallel Computing: Current & Future Issues of High-End Computing, Proceedings of the International Conference ParCo 2005, G.R. Joubert, W.E. Nagel, F.J. Peters, O. Plata, P. Tirado, E. Zapata ( Editors), John von Neumann Institute for Computing, J¨ulich, NIC Series, Vol. 33, ISBN 3-00-017352-8, pp. 727-734, 2006.

c 2006 by John von Neumann Institute for Computing Permission to make digital or hard copies of portions of this work for personal or classroom use is granted provided that the copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise requires prior specific permission by the publisher mentioned above. http://www.fz-juelich.de/nic-series/volume33 727

Implementation of Anisotropic Nonlinear Diffusion for filtering 3D images in Structural Biology on SMP Clusters∗

Siham Tabika, E. M. Garzon´ a, I. Garc´ıaa, J.J. Fernandez´ a aDept. de Arquitectura de Computadores y Electronica.´ Universidad de Almer´ıa. 04120 Almer´ıa

In this work we have proposed a parallel implementation of the Anisotropic Nonlinear Diffusion (AND) for filtering 3D images. AND is a powerful reduction technique in the field of . This method is based on a partial differential equation (PDE) tightly coupled with a massive set of eigensystems. Denoising large (3D) images in biomedicine and structural cellular biology by AND has a high computational cost. Consequently, an appropriate parallel implementation of AND is the best approach to reduce its runtime. This work proposes a portable parallel code using several programming models:(1) shared address space model, (2) message passing model and (3) hybrid model. The proposed parallel implementation has been evaluated on a cluster of biprocessors. The evaluation of the performance of AND-code shows that the hybrid model scales better in this kind of platforms.

1. Introduction

In many disciplines, raw data acquired from instruments are substantially corrupted by noise and sophisticated filtering techniques are then indispensable for a proper interpretation or post- processing. In general terms, smoothing techniques can be classified into linear and non-linear. Standard linear filtering techniques based on local averages or Gaussian kernels succeed in reducing the noise, but at expenses of poor feature preservation. In other words, they may severely blur the features as their edges are attenuated. However, nonlinear filtering techniques achieve better feature preservation as they try to adaptively tune the strength of the smoothing to the local structures found in the image. Anisotropic nonlinear diffusion (AND) is currently one of the most powerful techniques in the field of computer vision [20]. This technique takes into account the local structures found in the image to filter noise, preserve edges and enhance some features, thus considerably increasing the signal-to-noise ratio (SNR) with no significant quantitative of the signal. Pioneered in 1990 by Perona and Malik [17], AND has grown to become a well-established tool in the last decade [15,20–22]. AND has already been successfully applied in different disciplines, such as medicine [2,11,12] or biology [8–10], for denoising multidimensional images. AND has actually been crucial to achieve some recent breakthroughs [4,6,13,16]. The mathematical basis of AND is a partial differential equation (PDE) tightly coupled with a massive set of eigensystems [18]. AND may turn out to be extremely expensive, from the compu- tational point of view, depending on the size of the images. There are some disciplines where the requirements may be so huge –much more than 1 Gbyte in size [7]– that parallel computing proves to be essential. The standard numerical scheme for solving PDEs is based upon an explicit finite difference dis- cretization. More efficient schemes have been specifically designed for nonlinear diffusion [23], though. However, they are complex to implement and, despite their efficiency, they still require to be parallelized [5].

∗This work was supported by the Spanish Ministry of Education and Science through grant TIC2002-00228 7282

In this work we address the parallelization of AND for its application to denoising of large three- dimensional (3D) volumes in biomedicine and structural cellular biology. We make use of the stan- dard explicit numerical scheme for the discretization. This scheme is commonly used in other fields where PDEs are involved [18] and, as a consequence, the parallel approaches that are presented and discussed here may be valuable for them too.

2. Review of anisotropic nonlinear diffusion

AND accomplishes a sophisticated edge-preserving denoising that takes into account the struc- tures at local scales. AND tunes the strength of the smoothing along different directions based on the local structure estimated at every point of the multidimensional image. Conceptually speaking, AND can be considered as an adaptive gaussian filtering technique in which, for every voxel in the volume, an anisotropic 3D gaussian function is computed whose widths and orientations depend on the local structure [3]. This section presents local structure determination via structure , the concept of diffusion, a diffusion approach commonly used in image processing and, finally, details of the numerical implementation. 2.1. Estimation of local structure The structure is the mathematical tool that allows us to estimate the local structure in a multidimensional image. Let I(x) denote a 3D image, where x = (x, y, z) is the coordinate vector. The of I is a symmetric positive semi-definite matrix given by: 2 Ix IxIy IxIz T  2  J(∇I) = ∇I · ∇I = IxIy Iy IyIz (1)  2   IxIz IyIz Iz  ∂I ∂I ∂I where Ix = ∂x , Iy = ∂y , Iz = ∂z are the derivatives of the image with respect to x, y and z, respectively. The eigen-analysis of the structure tensor allows determination of the local structural features in the image [20]:

µ1 0 0   T J(∇I) = [v1 v2 v3] · 0 µ2 0 · [v1 v2 v3] (2)    0 0 µ3 

The orthogonal eigenvectors v1, v2, v3 provide the preferred local orientations, and the correspond- ing eigenvalues µ1, µ2, µ3 (assume µ1 ≥ µ2 ≥ µ3) provide the average contrast along these di- rections. The first eigenvector v1 represents the direction of the maximum variance. Therefore, v1 represents the direction normal to the local feature (see Fig. 1). 2.2. Concept of diffusion in image processing Diffusion is a physical process that equilibrates concentration differences as a function of time, without creating or destroying mass. In image processing, density values play the role of concentra- tion. This observation is expressed by the diffusion equation [20]:

It = div(D · ∇I) (3) ∂I where It = ∂t denotes the derivative of the image I with respect to the time t, ∇I is the gradient vector, D is a square matrix called diffusion tensor and div is the divergence operator: ∂fx ∂fy ∂fz div(f) = + + ∂x ∂y ∂z 7293

v3

v1

v2

Figure 1. Local structure found by eigen-analysis of the structure tensor. v1, v2, v3 are the corre- sponding eigenvectors. v1 is the direction normal to the local structure.

In AND the smoothing depends on both the strength of the gradient and its direction measured at a local scale. The diffusion tensor D is therefore defined as a function of the structure tensor J:

λ1 0 0   T D = [v1 v2 v3] · 0 λ2 0 · [v1 v2 v3] (4)    0 0 λ3  where vi denotes the eigenvectors of the structure tensor. The values of the eigenvalues λi define the strength of the smoothing along the direction of the corresponding eigenvector vi. The values of λi rank from 0 (no smoothing) to 1 (strong smoothing). Therefore, this approach allows smoothing to take place anisotropically according to the eigenvec- tors determined from the local structure of the image. Consequently, AND allows smoothing on the edges: Smoothing runs along the edges so that they are not only preserved but smoothed. AND has turned out, by far, the most effective denoising method by its capabilities for structure preservation and feature enhancement [8,9,20]. 2.3. Edge Enhancing Diffusion One of the most common ways of setting up the diffusion tensor D gives rise to the so-called Edge Enhancing Diffusion (EED) approach [20]. The primary effects of EED are edge preservation and enhancement. Here strong smoothing is applied along the preferred directions of the local structure, (the second and third eigenvectors, v2 and v3). The strength of the smoothing along the normal of the structure, i.e. the eigenvector v1, depends on the gradient: the higher the value is, the lower the smoothing strength is. Consequently, λi are then set up as:

λ1 = g(|∇I|)   λ2 = 1 (5) λ3 = 1  −3.31488 with g being a monotonically decreasing function, such as g(x) = 1 − exp  (x/K)8 , where K > 0 acts as a contrast parameter [20]; Structures with |∇I| > K are regarded as edges, otherwise they are considered to belong to the interior of a region. Therefore, smoothing along edges is preferred over smoothing across them, hence edges are preserved and enhanced. 2.4. Numerical discretization of the diffusion equation The diffusion equation, Eq. (3), can be numerically solved using finite differences. The term ∂I It = ∂t can be replaced by an Euler forward difference approximation. The resulting explicit 7304

scheme allows calculation of subsequent versions of the image iteratively:

(s) (s−1) ∂ ∂ ∂ ∂ I = I + τ · ( ∂x (D11Ix) + ∂x (D12Iy) + ∂x (D13Iz) + ∂y (D21Ix) ∂ ∂ ∂ ∂ ∂ (6) + ∂y (D22Iy) + ∂y (D23Iz) + ∂z (D31Ix) + ∂z (D32Iy) + ∂z (D33Iz))

(s) where τ denotes the time step size, I denotes the image at time ts = sτ, the terms Ix, Iy, Iz are the derivatives of the image with respect to x, y and z, respectively. Finally, the Dmn terms represent the components of the diffusion tensor D. The standard scheme to approximate the spatial derivatives ∂ ∂ ∂ ( ∂x , ∂y and ∂z ) is based on central differences. In this traditional explicit scheme for solving the partial differential equation Eq. (3), the stability is an issue [20]. The maximum time step that is allowed is τ ≤ 0.5/Nd, where Nd is the number of dimensions of the problem. In our case, we are dealing with a three-dimensional problem, so Nd = 3. In the experiments carried out in this work, we used a conservative value of τ = 0.1.

Figure 2. Left: slice from the original volume of a mitochondrion; right: slice from the filtered volume

Fig. 2 shows the result of the application of AND to a volume of a mitochondrion, a cell organelle. The enhancement in visualizing a slice of the volume is apparent (up: slice from the original volume; bottom: slice from the filtered volume). 2.5. Scheme of the diffusion approach The outline of the AND approach is the following:

(1) Compute the image at iteration s. For every voxel: Compute the structure tensor J according to Eqs.(1) and (2). Compute the diffusion tensor D according to Eq.(4). Solve the partial differential equation of diffusion, Eq. (3). (2) Iterate: go to step (1)

3. Parallel implementation of AND

Large scale parallel systems based on clusters of shared memory symmetric multiprocessors (SMP) are today’s dominant computing platforms, where the communication between nodes is es- tablished by means of message passing across the network and inside the nodes the communication 7315 between processors is established by means of the shared memory [14,19]. This architecture sup- ports three parallel programming models: (1) shared address space model, (2) message passing model and (3) hybrid model. The hybrid model combines the two previous models, shared address space model inside nodes and message passing model between nodes. In this work a portable code has been designed based on a hybrid model, which can be translated into shared address space and/or message passing models, taking into account the features of the considered parallel platform. The hybrid implementation of the AND-code uses two standards: MPI for the message passing between nodes and Pthreads for the shared memory parallelism inside the nodes. The hybrid implementation is based on two features of AND:

• At each time step, s, the 3D image is updated by z-planes, where the z-axis is the direction of the larger image dimension Nz.

s+1 s s • The update of the k z-plane, Ik , is only function of Ik; and its four neighbour z-planes, Ik−1, s s s Ik−2, Ik+1 and Ik+2, where s denotes the iteration index and k denotes the z-plane index.

To optimize data access and storage of the AND-code, memory requirements have been mini- mized, allocating and computing only the necessary data needed for updating each single z-plane. The following parallel algorithm takes into account the underlined scheme:

0 local Distribute I among nodes, Nz planes will be assigned to each node Do s = 1, ...n (a) Each thread initializes its auxiliary data structures local (b) Do k = 2, ..., d(Nz − 4)/T e s+1 s Ik = AND(Ik) End Do (c) Interchange boundary z-planes between neighbour nodes. End Do Collect the image.

local where Nz denotes the local number of z-planes of the image in every node, T denotes the number of processors inside one node, and I 0 denotes the original 3D image, and n denotes the number of iterations to obtain an acceptable denoised image. Dimensions of the volumes in structural biology usually range between 256x256x256 and 768x768x768. Typical values for n are around 60-100 iterations It was necessary to control the data distribution at two levels, at node level and at processor level inside the node:

local 1. The total number of z-planes is distributed among nodes, Nz = dNz/P e+4 z-planes will be assigned to each node, where P denotes the number of nodes. To decrease message passing communications at each time step, for updating boundary planes of the diffusion tensor all nodes allocate four additional planes to hold four neighbour planes: two planes from each boundary, as it is illustrated in Figure 3. At the end of each time step, nodes communicate the updated four boundary planes via message passing.

local 2. At processor level, each thread will update its subset of d(Nz −4)/T e z-planes applying the AND process, where a z-plane of the structure tensor and of the diffusion tensor are computed 7326

updated planes by one thread

z axis

updated planes in one node Nz planes in memory of one node

Figure 3. The distribution of an image at node level and processor level.

to update the corresponding z-plane of the image. To carry out this process, two auxiliary data structures have been defined by each thread, in order to hold both the boundary z-plane of the image and the boundary z-plane of the diffusion tensor, before they are modified by the thread neighbour. In the next section, the described parallel code will be evaluated on a cluster of SMP.

4. Evaluation of the Parallel implementation of AND

In this paper, we evaluate the performance of a portable implementation of AND-code as message passing code and as hybrid code on a cluster of SMP multiprocessors. The aim of this evaluation is to determine which model exploits best the SMP clusters. The implementation is carried out on a cluster of biprocessors of Intel(R) Xeon(TM) 3.06 GHz with 2 GB RAM, 512 KB cache. Nodes are interconnected via two Gigabit Ethernet networks, one for data (NFS) and the other for computation. Four test volumes of different sizes (256x256x256, 384x384x384, 512x512x512 and 640x640x640) have been selected to develop the evaluation pro- cess. Next, the speed-up will be analyzed for the test volumes. Traditionally, the speedup is only re- ferred to the sequential runtime. Recently, a general concept of speedup has been introduced [1], where the parallel runtime is used as reference instead. In this evaluation, this concept is tacked into account to evaluate the performance of AND on the cluster of SMPs. Specifically, for the volumes 512x512x512 and 640x640x640 the parallel run time with four processors is considered as reference, since it was not possible to run the codes on fewer than four processors with these volumes. While the sequential times is used as reference for the smaller volumes 256x256x256 and 384x384x384. Fig 4 shows the speedup achieved by the pure MPI code and the hybrid code, for the test volumes, on the cluster of SMPs described above. In global terms, both strategies yield good results, with slightly better performance for the hybrid strategy. It is evident from these figures that the hybrid strategy yields better scalability than the strategy based on message passing, specially for increasing number of processors. Moreover, the influence of the volume size has proven to be relevant on this platform, given that the speedup is better for large sizes. Finally, notice that Nz = Nx = Ny for the test images considered in this work; better performance is expected if the z-dimension of the image verifies: Nz >> Nx, Ny. 7337

image size: 256x256x256 image size: 384x384x384 30 30 28 28 26 26 24 24

22 pure mpi 22 pure mpi 20 hybrid 20 hybrid 18 ideal speedup 18 ideal speedup 16 16 14 14 speedup 12 speedup 12 10 10 8 8 6 6 4 4 2 2 0 0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 number of processors number of processors image size: 512x512x512 image size: 640x640x640 8

7 6 6 pure MPI pure mpi Hybrid hybrid ideal speedup 5 ideal speedup 4 4 speedup speedup 3

2 2

1

0 0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 number of processors number of processors

Figure 4. Speed-up of the message passing and the hybrid models versus the number of processors for the test volumes.

5. Conclusions and future works

It can be concluded from the evaluation that the hybrid implementation of the AND-code allows: (1) to reduce significantly the runtime over a large range of images size, specially for images of large sizes and (2) to be run on different kind of parallel platforms. The evaluation shows that the parallel AND-code is scalable, as well as the hybrid model achieves better performance in SMP clusters. In future works, an evaluation of the AND-code will be developed in an extended range of platforms.

Acknowledgments

The authors wish to thank Dr. Guy Perkins (National Center for Microscopy and Imaging Re- search, San Diego, USA) for kindly providing the mitochondrion dataset.

References

[1] S. G. Akl. Superlinear performance in real-time parallel computation. The Journal of Supercomputing, 29(1):89–111, 2004. 7348

[2] I. Bajla and I. Hollander. Nonlinear filtering of magnetic resonance tomograms by geometry-driven diffusion. Machine Vision and Applications, 10:243–255, 1998. [3] D. Barash. A fundamental relationship between bilateral filtering, adaptive smoothing and the nonlinear diffusion equation. IEEE Trans. Patt. Anal. Mach. Intel., 24:844–847, 2002. [4] M. Beck, F. Forster, M. Ecke, J. M. Plitzko, F. Melchior, G. Gerisch, W. Baumeister, and O. Medalia. Nuclear pore complex structure and dynamics revealed by cryoelectron tomography. Science, 306:1387– 1390, 2004. [5] A. Bruhn, T. Jakob, M. Fischer, T. Kohlberger, J. Weickert, U. Bruning, and C. Schnorr. High perfor- mance cluster computing with 3-D nonlinear diffusion filters. Real-Time Imaging, 10:41–51, 2004. [6] M. Cyrklaff, C. Risco, J. J. Fernandez, M. V. Jimenez, M. Esteban, W. Baumeister, and J. L. Carrascosa. Cryo-electron tomography of vaccinia virus. Proc. Natl. Acad. Sci. USA, 102:2772–2777, 2005. [7] J. J. Fernandez, J.M. Carazo, and I. Garcia. Three-dimensional reconstruction of cellular structures by electron microscope tomography and parallel computing. J. Paral. Distr. Computing, 64:285–300, 2004. [8] J. J. Fernandez and S. Li. An improved algorithm for anisotropic nonlinear diffusion for denoising cryo-tomograms. J. Struct. Biol., 144:152–161, 2003. [9] A. S. Frangakis and R. Hegerl. Noise reduction in electron tomographic reconstructions using nonlinear anisotropic diffusion. J. Struct. Biol., 135:239–250, 2001. [10] A. S. Frangakis, A. Stoschek, and R. Hegerl. transform filtering and nonlinear anisotropic diffusion assessed for signal reconstruction performance on multidimensional biomedical data. IEEE Trans. BioMed. Engineering., 48:213–222, 2001. [11] G. Gerig, R. Kikinis, O. Kubler, and F. A. Jolesz. Nonlinear anisotropic filtering of MRI data. IEEE Trans. Med. Imaging, 11:221–232, 1992. [12] O. Ghita, K. Robinson, M. Lynch, and P. F. Whelan. MRI diffusion-based filtering: A note on perfor- mance characterisation. Computerized Medical Imaging and Graphics, 29:267–277, 2005. [13] K. Grunewald, P. Desai, D. C. Winkler, J. B. Heymann, D. M. Belnap, W. Baumeister, and A. C. Steven. Three-dimensional structure of herpes simplex virus from cryo-electron tomography. Science, 302:1396–1398, 2003. [14] H. Shan, J.P. Singh, L. Oliker and R. Biswas. Message passing and shared address space parallelism on an SMP cluster. Parallel Computing, 29:167–186, 2003. [15] J. Weickert. Coherence-enhancing diffusion of colour images. Image and Vision Computing, 17:201– 212, 1999. [16] O. Medalia, I. Weber, A. S. Frangakis, D. Nicastro, G. Gerisch, and W. Baumeister. Macromolecular architecture in eukaryotic cells visualized by cryoelectron tomography. Science, 298:1209–1213, 2002. [17] P. Perona and J. Malik. and using anisotropic diffusion. IEEE Trans. Patt. Anal. Mach. Intel., 12:629–639, 1990. [18] W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. Numerical Recipes: The Art of Scientific Computing. Cambridge University Press, 1992. [19] M. J. Quinn. Parallel Programming in C with MPI and OpenMP. McGraw-Hill, 2004. [20] J. Weickert. Anisotropic Diffusion in Image Processing. Teubner, 1998. [21] J. Weickert. Coherence-enhancing diffusion filtering. Int. J. Computer Vision, 31:111–127, 1999. [22] J. Weickert and H. Scharr. A scheme for coherence-enhancing diffusion filtering with optimized rotation invariance. J. Visual Comm. Imag. Repres., 13:103–118, 2002. [23] J. Weickert, B. M. ter Haar Romeny, and M. A. Viergever. Efficient and reliable schemes for nonlinear diffusion filtering. IEEE Trans. Image Processing, 7:398–410, 1998.