Solving Finite Element Systems with Hypre and Z88

FRIEDRICH-ALEXANDER-UNIVERSITAT¨ ERLANGEN-NURNBERG¨ INSTITUT FUR¨ INFORMATIK (MATHEMATISCHE MASCHINEN UND DATENVERARBEITUNG)

Lehrstuhl f¨urInformatik 10 (Systemsimulation)

Stephan Helou

Bachelor Thesis

Solving Finite Element Systems with Hypre and Z88

Stephan Helou Bachelor Thesis

Aufgabensteller: Prof. Dr. U. R¨ude,Prof. Dr.-Ing. F. Rieg Betreuer: Dipl.-Inf. T. Gradl Bearbeitungszeitraum: 23.08.2009 – 23.11.2009

Erkl¨arung:

Ich versichere, daß ich die Arbeit ohne fremde Hilfe und ohne Benutzung anderer als der angegebe- nen Quellen angefertigt habe und daß die Arbeit in gleicher oder ähnlicher Form noch keiner anderen Prufungsbeh¨ örde vorgelegen hat und von dieser als Teil einer Prufungsleistung¨ angenommen wurde. Alle Ausfuhrungen,¨ die wörtlich oder sinngemäß ubernommen¨ wurden, sind als solche gekennzeichnet.

Erlangen, den 10. M¨arz 2010 ......

Abstract

In time, the problems concerning linear elasticity and the finite element method have always become larger and more complex. The aim of this thesis is to test different parallel solvers and preconditioners and to compare the achieved results with the algorithms of the finite element program Z88. This is done by calculating displacements of different component parts with more than 2 million degrees of freedom. Using the hypre framework provides various highly parallel solvers and preconditioners, for instance BoomerAMG, Conjugate Gradient or ParaSails. The methods are tested considering their solving time, number of iterations and residuals. The parallel Conjugate Gradient algorithm using ParaSails as a preconditioner is the fastest combination of the thesis. With only 8 CPUs it gets up to 110 seconds faster than the Z88 methods. Having more CPUs it calculates the problems up to 140 seconds faster. Solving the biggest component part with 243 CPUs is even 22.3 minutes faster. According to these values and the more complex problems, the usage of parallel solvers is indispensable.

Zusammenfassung

Mit der Zeit wurden die Probleme im Bezug auf lineare Elastizität und Finite Elemente Methode immer größer und komplexer. Das Ziel dieser Arbeit ist es die verschiedenen parallelen Löser und Vorkonditionie- rer zu testen und die erzielten Ergebnisse mit den Algorithmen des Finite Elemente Programmes Z88 zu vergleichen. Dies wird durch die Berechnung der Verschiebungen in unterschiedlichen Bauteilen mit mehr als 2 Millionen Freiheitsgraden umgesetzt. Durch Benutzung des hypre Frameworks stehen verschiedene hochparallele Löser und Vorkonditionierer zur Verfugung,¨ z.B. BoomerAMG, Methode der konjugierten Gradienten oder ParaSails. Die Methoden werden bezuglich¨ ihrer Lösungszeiten, Anzahl der Iterationen und Residuen verglichen. Der mit ParaSails vorkonditionierte parallele Konjugierte Gradienten Algorith- mus ist die schnellste Kombination der Arbeit. Mit nur 8 CPUs wird es bis zu 110 Sekunden schneller als die Z88 Verfahren. Mit mehreren CPUs werden die Probleme bis zu 140 Sekunden schneller berechnet. Das Lösen des grössten Bauteiles mit 243 CPUs ist sogar 22, 3 Minuten schneller. Im Bezug auf diese Werte und die komplexeren Problemstellungen ist die Anwendung von parallelen Lösern unentbehrlich.

Acknowledgements

I would like to thank Prof. Dr. Ulrich Rüdeand Prof. Dr.-Ing. Frank Rieg from the University of Bayreuth for offering me this thesis. Very special thanks to my supervisor Tobias Gradl for his overall support and advice. Also special thanks to my contact person in Bayreuth, Martin Neidnicht, who helped me out with issues concerning Z88 and providing me with the data of the component parts. Additionally, I would like to thank Tobias Preclik, Dr. Harald Köstlerand Klaus Iglberger for their overall support.

Contents

1 Introduction 1

2 Theoretical Background 3 2.1 Linear Elasticity...... 3 2.2 Finite Element Method...... 4 2.3 Solvers...... 7 2.3.1 The Classical Algebraic Multigrid...... 7 2.3.2 The Conjugate Gradient Method...... 11 2.3.3 ParaSails Preconditioner...... 12

3 The Frameworks 15 3.1 The Hypre Package...... 15 3.1.1 Interfaces...... 15 3.1.2 Solvers and Preconditioner...... 17 3.2 Z88...... 19

4 Computer Architecture 21

5 Results 23 5.1 Test Problems...... 23 5.2 Measurement Methods...... 28 5.3 Z88...... 28 5.4 BoomerAMG...... 29 5.5 Conjugate Gradient...... 30 5.6 Conclusions...... 36

6 Conclusion 37

Bibliography 39

List of Figures

2.1 Deformation caused by external forces in 2D...... 3 2.2 Types of grids...... 5 2.3 Basis functions in 1D...... 6 2.4 Example for the assembling routine...... 7 2.5 Types of cycles...... 8 2.6 Example for the coarsening strategy...... 10 2.7 Convergence of CG...... 12

3.1 Overview of the diﬀerent interfaces...... 16 3.2 Overview of the various solvers and preconditioners...... 18 3.3 Z88COM...... 20

4.1 Compute node of the Woodcrest Cluster...... 21

5.1 L section...... 23 5.2 Piston...... 24 5.3 Fan...... 24 5.4 Hub carrier...... 25 5.5 Arch...... 26 5.6 Connecting rod...... 26 5.7 I beam...... 27 5.8 Residual for solving the L section with Z88I2...... 29 5.9 Residual for solving the L section with BoomerAMG...... 30 5.10 Residual for solving the L section with CG...... 31 5.11 Residual for solving the L section with CG and using Euclid, BoomerAMG or ParaSails as a preconditioner...... 32 5.12 Times and ratios for solving the component parts with CG+ParaSails and Z88I2...... 33 5.13 Ratios for solving the fan with multiple CPUs...... 35 5.14 Ratios for solving the I beam with multiple CPUs...... 35

6.1 Deformation of the fan...... 37

iii

List of Tables

5.1 Times for calculating the displacements of the component parts with Z88...... 29 5.2 Times for calculating the displacements of the component parts with BoomerAMG..... 30 5.3 Times for calculating the displacements of the component parts with CG...... 31 5.4 Times for calculating the displacements of the component parts with CG and a preconditioner. 32 5.5 Times and speedup for multiple CPUs...... 34

List of Algorithms

1 Multigrid MG(ul, fl, l)...... 8 2 Algebraic Multigrid Coarsening...... 9 3 Conjugate Gradient Method...... 11 4 ParaSails...... 13

5 Building an IJ Matrix...... 16 6 Building an IJ Vector...... 17 7 Setup and run routine for the solvers...... 18 8 Setup and run routine for the solvers using a preconditioner...... 19

9 Time measurement of Z88F...... 28 10 Time measurement of Z88I2...... 28 11 Time measurement of hypre solvers...... 28

vii

Nomenclature

C − points points on the coarse grid

s Fi F-points strongly connected to i H1 Sobolev space

w Ni all points weakly connected to i

Si set of points that strongly inﬂuence i AMG algebraic multigrid CG conjugate gradient

FEM ﬁnite element method LE linear elasticity MG multigrid

1 Introduction

In modern times, linear elasticity (LE) analyses have become indispensable. The calculation of stresses, strains and displacements in mechanical components of a car are only a few among the various application fields of the elasticity theory. All these problems can be described by partial differential equations (PDEs). One of the most used discretization methods over the last 50 years has been the finite element method (FEM) [JL01]. The basic idea of the method has been formulated by the engineers Turner, Clough, Martin and Topp. They described the approach by separating a solid body in a finite number of elements and calculated the displacements at the element’s nodes. The theoretical proof was given by mathematicians in the 60’s. According to the fact that FEM problems are not easy to solve in an analytical way, the method has received a wide acceptance with the appearance of the first digital computers. One of the first applications was the trajectory calculation of the artillerie in World War 2. This was done with the help of Zuse Z3 and Havard Mark I, the first Turing machines [RH09]. At this time only people who had access to a mainframe could deal with FE calculations. Today, it is possible for nearly everybody to handle FE problems with the help of normal desktop PC’s. The increasing power of software and hardware over the last years lead to the possibility to calculate more complicated and larger problems. With the aid of the FEM, a big spectrum of problems in different technical application fields is covered. For instance the weather forecast, medical engineering or the most classical fields aircraft and vehicle construction. Furthermore, it provides the basis of every CAD program. With having more complex local geometries, higher numbers of degrees of freedom or highly refined finite element meshes, the computational amount highly increases. This problem can be handled with the aid of parallel computers. Using these parallel systems in combination with more powerful solving methods leads to a minimization of the computational costs. The equations can be solved very fast in an efficient way. This thesis arised from the cooperation with Prof. Dr.-Ing. F. Rieg from the University of Bayreuth, who is the editor and designer of the finite element program Z88. Except for the PARDISO algorithm (in version 13), no parallel solver has yet been included in the Z88 framework. Hence, the aim of the thesis was to test different parallel solvers and preconditioners and to compare them with the algorithms included in Z88 (of version 12). In all cases displacements were calculated which were caused through surface-, pressure loads or external forces. In the scope of this thesis the highly parallel hypre framework was used which provides a various number of parallel solvers and preconditioners. The main focus lies on the comparison of the computation times spent by the methods of the two frameworks. The structure of the thesis is arranged in the following way. Chapter 2 gives an overview of the theoretical background. It starts with a short elucidation of LE. Afterwards, the FEM is described. Moreover, the main used solvers and preconditioners of the thesis are explained: algebraic multigrid (AMG), conjugate gradient (CG) and ParaSails. Chapter 3 takes a closer look at the two applied frameworks: the free hypre library, which was developed at Lawrence Livermore National Laboratory [HYP09] can be used on massively parallel computers, Z88 is a free finite element program and was developed at the University of Bayreuth [Z8809]. Chapter 4 gives a short overview of the used computer architecture, the Woodcrest Cluster at the Regional Computing Center of Erlangen (RRZE), on which the problems were computed. Moreover, Chapter 5 presents the test components and the achieved results of the used methods. Finally, Chapter 6 concludes the thesis.

2 Theoretical Background

2.1 Linear Elasticity

This section only gives a small insight on LE, for a deeper understanding refer to [JL01]. In the beginning, displacements are elucidated. Due to heat generation or forces an undeformed body (Figure 2.1(a)) is changing its shape to a deformed body (Figure 2.1(b)).

(a) undeformed body (b) deformed body

Figure 2.1: Deformation caused by external forces in 2D.

There are two diﬀerent possibilities to explain the deformation in 3D. The Lagrange description focuses on the movement of a particle X. The displacement u and location x are given by

x = x(X), u = u(X) alternatively xi = xi(Xj), ui = ui(Uj). (2.1)

The Euler description focuses on the state of a point in space. The particle’s location and displacement are given by X = X(x), u = u(x) alternatively Xi = Xi(xj), ui = ui(xj), (2.2) where i, j stand for the coordinate axes x, y, z. The displacements u are calculated with the help of the gradient of displacements

Hij = uij = 1/2(uij + uji) + 1/2(uij − uji) = εij + ωij, (2.3) where the infinitesimal tensor of rotation ωij can be neglected since the rotations have no effect on the element’s stress. The first term εij in equation (2.3) is the infinitesimal strain tensor

εij = 1/2(uij + uji), i, j ∈ x, y, z. (2.4)

The elements of the tensor are symmetric, i.e. εij = εji, and have the matrix notation   εxx εxy εxz ε = εyx εyy εyz . (2.5) εzx εzy εzz

3 Another important part in LE are the stresses     σxx σxy σxz σ11 σ12 σ13 σ = [σij] = σyx σyy σyz = σ21 σ22 σ23 . (2.6) σzx σzy σzz σ31 σ32 σ33

In the case i = j the components σij are normal and are called normal stresses. If i 6= j the components σij are tangential and are called shear stresses. With the aid of Hooke’s Law it is possible to combine strains and stresses to

σij = λεkkδij + 2µεij, for i, j ∈ x, y, z, (2.7) where the Kronecker delta δij is deﬁned as ( 1 for i = j, δij = i, j = 0, 1, ..., n. (2.8) 0 for i 6= j

The parameters µ and λ are the Lam´eparameters and are deﬁned as νE E λ = , µ = , (2.9) (1 + ν)(1 − 2ν) 2(1 + ν) where ν and E are material constants. In the scope of this thesis only the displacements are considered and calculated.

2.2 Finite Element Method

One of the standard approaches for solving LE problems is the FEM. Using the fact that the solution does not have to fulfill the partial differential equation (PDE) in every point, the basis of the method is to create the weak or variational formulation (cf. [JL01]). In the following, the transformation from classical to variational formulation is shown with the help of the 1D heat equation. With a given temperature at the left boundary and a free heat exchange at the right boundary, the equation is defined as

00 n n −u (x) + cu(x) =f(x) ∀ x ∈ Ω = (a, b), ∀ u(x) ∈ R , f(x) ∈ R , u(a) =ga, (2.10) 0 −u (b) =αb(u(b) − ub).

Multiplying the equation with a test function v, contained in the Sobolev space H1(a, b) (cf. [JL01]) leads to − u00(x) v(x) + cu(x) v(x) = f(x) v(x). (2.11) Using partial integration and adding up the given boundary conditions results in

b b Z Z 00 0 0 − u (x) v(x) dx = u (x) v (x) dx + ab u(b) v(b) − abubv(b). (2.12) a a This is done for getting a symmetric stiffness matrix. Furthermore, the problem gains in accuracy since the partial integration minimizes the differential coefficients.

4 With the help of equation (2.11) and (2.12) the variational formulation is deﬁned as

1 u ∈ Vg = {u ∈ H (a, b): u(a) = ga}, such that 1 a(u, v) = hF, vi ∀ v ∈ V0 = v ∈ H (a, b): v(a) = 0, where b Z (2.13) 0 0 a(u, v) = [u (x) v (x) + cu(x) v(x)] dx + αb u(b) v(b), a b Z hF, vi = f(x) v(x) dx + αbubv(b). a

In the first step of solving the weak formulation, the domain is discretized by splitting it into different elements. The discretization can be structured (Figure 2.2(a)) or unstructured Figure (2.2(b)). The mesh is generated with the aid of different fundamental geometries. The ones most commonly used are

• 1D → interval, • 2D → triangle, quadrangle,

• 3D → tetrahedron, pyramid, hexahedron.

(a) structured (b) unstructured

Figure 2.2: Types of grids.

Another important element of the FEM are the basis functions which approximate the problem at a ﬁnite number of points. These functions have to achieve the following conditions:

• the function has to be deﬁned on the entire element, • every function has to be assigned to one node of the element,

• at these nodes equation (2.8) has to be fulﬁlled, • the sum of the approximation functions on an element has to be 1, • the approximation functions in the nodes of elements with common edge or surface have to be the same.

In the 1D case, the basis functions mainly used are: linear (Figure 2.3(a)), quadratic (Figure 2.3(b)) and cubic (Figure 2.3(c)). In the 2D case they are linear, bilinear, biquadratic and bicubic.

5 (a) linear (b) quadratic (c) cubic

Figure 2.3: Basis functions in 1D.

Combining equation (2.13) with the basis functions results in

n X uh ∈ Vgh = {uh(x): uh(x) = ujpj(x) + gap0(x)}, (2.14) j=1 where ( n ) X a(uh, vh) = hF, vhi ∀vh ∈ V0h = vh(x): vh(x) = vipi(x) . (2.15) i=1 Now, it is possible to calculate the local stiﬀness matrices. After discretizing the domain with the help of n fundamental geometries the left hand side of equation (2.14) is computed to

x n Z i X 0 0 [uh(x) vh(x) + cuh(x) vh(x)] dx. (2.16) i−1 xi−1

Regrouping the equation results in n X ui−1 (v v ) K(i) , (2.17) i−1 i u i−1 i where (i) (i)! (i) K11 K12 K = (i) (i) (2.18) K21 K22 is the local stiﬀness matrix of one of the fundamental geometries. K(i) has the following composition:

x Z i (i) 0 0 K11 = [pi−1(x pi−1(x) + c pi−1(x) pi−1(x)] dx,

xi−1 x Z i (i) 0 0 K12 = [pi(x) pi−1(x) + c pi(x) pi−1(x)] dx,

xi−1 x (2.19) Z i (i) 0 0 K21 = [pi−1(x) pi(x) + c pi−1(x) pi(x)] dx,

xi−1 x Z i (i) 0 0 K22 = [pi(x) pi(x) + c pi(x) pi(x)] dx.

xi−1

6 Applying the same concept to the right hand side yields

x n Z i n X X (i) f(x) vh(x) dx = (vi−1 vi)f (2.20) i=1 i−1 xi−1 with (i) R xi ! f f(x) pi−1(x) dx (i) 1 xi−1 f = (i) = R xi . (2.21) f(x) pi(x) dx f2 xi−1 Assembling the local stiffness matrices to the global stiffness matrix A ∈ Rnxn leads to the equation system Au = f. (2.22) The assembling routine is shown with the help of a small example. According to the problem given in Figure 2.4, the global stiffness matrix results in

 1 1  K11 K12 1 1 2 2 K21 K22 + K22 K23  A =  2 2 3 3  , (2.23)  K32 K33 + K33 K34 3 3 K43 K44 where e.g. the local stiﬀness matrix of element 1 is

1 1 1 K11 K12 K = 1 1 . (2.24) K21 K22

Combining the nodes ○1 and ○2 , the element matrix entries deﬁned by equation (2.24) are added to the 1 1 1 1 global matrix in the following way. K11 is added to position A(1, 1), K12 to A(1, 2), K21 to A(2, 1) and K22 to A(2, 2).

Figure 2.4: Example for the assembling routine.

2.3 Solvers

In this section, the used iterative methods and preconditioners for solving equation (2.22) are explained. In the beginning, the focus lies on the classical AMG. This method only requires the underlying matrix, and no knowledge of the geometry of the problem is needed. Moreover, the section takes a look at another well-known solver, the CG method. In the end, the ParaSails preconditioner is elucidated, which is a parallel sparse approximate inverse preconditioner.

2.3.1 The Classical Algebraic Multigrid The basic modules in every multigrid (MG) method are the pre-smoothing, post-smoothing, restriction, prolongation and correction step. The following elucidation is mainly based on [WBM99] and [Fal06].

7 Algorithm 1 Multigrid MG(ul, fl, l) 1 if l = 1 then −1 2 MG(ul, fl, l) = Al fl 3 end if 4 if l > 1 then 5 u = Rν1 (u ) l l,fl l 6 // Coarse grid correction 7 Residual: rl = fl − Alul l−1 8 Restriction: rl−1 = Il rl 9 Recursive call: 0 10 el−1 = 0 11 for i = 1 to µ do i i−1 12 el−1 = MG(el−1, rl−1, l − 1) µ 13 el−1 = el−1 14 end for l 15 Prolongation: el = Il+1el−1 16 Correction: ul = ul + el 17 // ν2-post-smoothing 18 MG(u , f , l) = Rν2 (u ) l l l,fl l 19 end if

l−1 Algorithm1 shows the standard multigrid algorithm, where l is the number of levels, Il is the interpo- l lation operator, Il+1 is the prolongation operator and R is the smoothing operator. AMG does not need any geometries but in order to keep the explanation as simple as possible it is nevertheless elucidated with the help of geometry. As mentioned before, AMG is based on the same concept as the classical MG. It removes recursively the remaining smooth error after relaxation with the help of coarse grid correction. Both methods are recursive, speciﬁed by the parameter µ. In the case of µ = 1 a V-cycle is executed. Figure 2.5(a) shows a 5 level V-cycle. If µ = 2, a W-cycle is performed. Figure 2.5(b) depicts a 4 level W-cycle.

(a) V-cycle (b) W-cycle

Figure 2.5: Types of cycles.

Nevertheless, there are some diﬀerences among the classical MG and AMG. In the classical MG the high

8 frequency geometric error is damped with the aid of the smoothing operator. The remained low frequency error (smooth error), which is eliminated by coarse grid correction, is smooth in the usual geometric sense. In the case of AMG, where linear systems are solved on the basis of MG principles the remaining smooth error can be geometrically oscillatory. (cf. [Fal06]) In AMG, one of the crucial points is defining the coarse grid operator. In order to get a good coarsening strategy, the nature of the smooth error is needed, which is the underlying basis of the main components of the method. Using the fact that only the matrix A is known, but not the geometry leading to the structure of A, the error has to be characterized in an algebraic way. Due to the attribute, that the iterative methods Gauss-Seidel or Jacobi damps high frequencies in a good way, they are often used as relaxation schemes. The smooth error corresponds to the eigenvectors of A with small eigenvalues. Concerning the fact, that the smoothers eliminate the big eigenmodes, the coarse grid correction has to reduce the small modes. It is necessary to split the fine grid into C-points (within the fine and coarse grid) and F-points (only in the fine grid) to select the coarse grid. An important factor of this operation is the Strength of Connection among 2 neighboring entries in the matrix A, measured with a threshold Θ. Using

− aij ≥ Θ max−aik with 0 < Θ ≤ 1, (2.25) k6=i the coarse grid selection can be merged into two heuristics (cf. [WBM99]):

H-1: For each F-point i, every point j ∈ Si that strongly inﬂuences i either should be in the coarse interpolatory set Ci or should strongly depend on at least one point in Ci.

H-2: The set of coarse points C should be a maximal subset of all points with the property that no C-point strongly depends on another C-point.

Figure 2.6 shows the relevant operations of a three step instruction given in Algorithm2 for choosing the coarse grid.

Algorithm 2 Algebraic Multigrid Coarsening 1 Selecting a C-point with maximal weight. 2 Deﬁning the neighbor points as F-points. 3 All neighbor points of the deﬁned F-points are updated.

The used discretization stencil in Figure 2.6 is, −1 −1 −1 −1 8 −1 . (2.26) −1 −1 −1

The next module of the AMG is the interpolation. Since small eigenmodes indicate a smooth error, the prolongation uses the formula r = Ae to show the coherence between smooth error and small residuals

rT r = eT A2e = λ2 < 1. (2.27)

Circumscribing equation (2.27) at a F-point i results in X X X aiiei = − aijej − aijej − aijej, (2.28) s w j∈Ci j∈Fi j∈Ni s w where Ci are the C-points strongly connected to i, Fi are the F-points strongly connected to i and Ni are all points weakly connected to i.

Exchanging the ej’s in the last two sums with Ci or the F-point i leads to the deﬁnition of interpolation.

9 (a) (b) (c)

(d) (e) (f)

(g) (h) (i)

(j) (k) (l)

Figure 2.6: Example for the coarsening strategy.

10 2.3.2 The Conjugate Gradient Method The CG algorithm is a widely used iterative method for solving linear equations. In this section, only the fundamentals of the algorithm are explained. Those willing and thirsty for knowledge can have further recourse to appropriate technical literature (cf. [She94]). The major idea of CG is to ﬁnd the minimum of the quadratic form

f(x) = 0.5xT Ax − bT x + c, (2.29) instead of solving the linear system Ax = b. For this purpose A-orthogonal search directions are used. This is shown in Algorithm3.

Algorithm 3 Conjugate Gradient Method

1 d(0) = r(0) = b − Ax(0) 2 for number of iterations do T r(i)r(i) 3 αi = T d(i)Ad(i)

4 x(i+1)=x(i)+α(i)d(i) 5 r(i+1) = r(i) − α(i)Ad(i) T r(i+1)r(i+1) 6 β(i+1) = T r(i)r(i) 7 d(i+1) = r(i+1) + β(i+1)d(i) 8 end for

The parameter α specifies the length of each step. The aim is to go in every search direction only once. Hence, the first direction is chosen along the coordinate axis. This implies the orthogonality of e(i+1) to d(i) and results in a first approximation T d(i)e(i) α(i) = − T . (2.30) d(i)d(i)

Since e(i) is not known, a new approach is needed. Considering additionally the A-orthogonality of two search vectors d(i), d(i+1) and the residual equation ri = −Aei = −∇f(xi) which is the negative gradient, results in the ﬁnal formulation T d(i)r(i) α(i) = T . (2.31) d(i)Ad(i)

In the last two lines of the algorithm, the new search direction d(i+1) is calculated. As a starting point the Gram-Schmidt-Conjugation is chosen to

i−1 X d(i) = ui + βikd(k). (2.32) k=0

Combining the equation with the residual, by setting r(i) = u(i) leads to

d(i+1) = r(i+1) + β(i+1)d(i). (2.33)

The residual is orthogonal to the previous search direction, i.e.

T d(i)r(j) = 0, i < j. (2.34) This implies T r(i)r(j) = 0, i 6= j. (2.35)

11 Using theses two terms β is obtained to

T r(i+1)r(i+1) β(i+1) = T . (2.36) r(i)r(i)

Since the A-orthogonal search directions and the exact length α of each step, the CG method converges in a maximum of n steps. This is shown in Figure 2.7.

Figure 2.7: Convergence of CG (cf. [She94]).

2.3.3 ParaSails Preconditioner The hypre package contains several preconditioners. In the scope of this thesis, four of them have been analyzed and compared:

• BoomerAMG: a parallel algebraic multigrid method, • Euclid: Implementation of the parallel ILU algorithm,

• PILUT : a parallel incomplete factorization, • ParaSails: a sparse approximate preconditioner. The best solutions of the problems of this thesis were achieved with ParaSails. It is used to approximate the inverse of a matrix A by a sparse matrix M. This is done with the help of the Cholesky factor L and by minimizing 2 k I − ML kF (2.37) in the Frobenius norm v u m n uX X 2 k A kF = t aij. (2.38) i=1 j=1 This elucidation is based on [Cho00].

12 In the case of symmetric positive deﬁnite (spd) problems a sparse lower triangular matrix G is chosen to approximate A−1: GT G ≈ A−1. (2.39) This is done by minimizing 2 k I − GL kF . (2.40) Due to the fact, that all problems of this thesis are spd-problems, only the factorized case is considered. For other possibilities refer to [Cho00].

Algorithm 4 ParaSails 1 Threshold A to produce Ae . L L 2 Compute the pattern Ae and let the pattern of G be the lower triangular part of the pattern of Ae . 3 Compute the nonzero entries in G. 4 Filtration: drop small entries in G and rescale.

In the following, the single steps of the ParaSails method (Algorithm4) are explained. In a ﬁrst step, the binary matrix Ae is calculated out of A. It is deﬁned as follows:

( −1/2 −1/2 1, if i = j or | (D AD )ij |> thresh Ae = (2.41) 0, otherwise with ( | A |, if | A |> 0 D = ii ii (2.42) 1, otherwise where thresh is the ﬁrst parameter of the ParaSails implementation. When choosing a smaller value for thresh, Ae is thinned out and produces a more accurate preconditioning matrix. L Using matrix (2.41), the pattern of Ae is calculated by merging and storing the sparse rows in a dense format. The exponent L is deﬁned with the aid of the parameter nlevels within the hypre code

L = nlevels + 1. (2.43)

In the case of nlevels = 0 and thresh = 0, the sparsity pattern of G is the same as in A. The nonzero entries are computed with the help of normal equations and equation (2.40)

T T GLL ij = L ij, (i, j) ∈ SL, (2.44) where L is the Cholesky factor. Some reformulations lead to the preconditioned matrix

GAGT . (2.45)

The ﬁltration step is used to reduce the costs of the the preconditioner. All matrix entries of G, which are smaller than the filter parameter are dropped out. With the aid of the three mentioned transfer parameters of the ParaSails implementation thresh, nlevels and filter it is possible to adjust the accuracy and cost of the preconditioner. Since the special form of the Frobenius norm (2.38) the algorithm is inherently parallel.

3 The Frameworks

In this chapter, the used frameworks for solving the linear equation (2.22) are introduced and explained. The chapter is based on [HYP06b], [HYP06a], [FY02] and [Rie08]. In the beginning, hypre, a software library of high performance preconditioners and solvers, is explained. It was developed at the Center for Applied Scientiﬁc Computing at Lawrence Livermore National Labo- ratory. The used version is 2.0.0 of the year 2006. In the second part of the chapter, the ﬁnite element program Z88 is elucidated. Prof. Dr. Ing. F. Rieg of the University of Bayreuth is the designer and editor. In the scope of this thesis version 12.0.0 has been used.

3.1 The Hypre Package

The hypre software package is mainly written in C and provides interfaces for other programming languages with the help of Babel. For more information on Babel refer to [Bab04]. The framework was developed to solve large, sparse linear systems of equations on massively parallel computers. For parallelization the package is using MPI, which is based on a distributed memory concept. For more information on MPI consider [MPI09]. It includes several families of scalable preconditioners, for instance the already mentioned ParaSails, Euclid or PILUT algorithm. Due to the fact that the most commonly used Krylov-based iterative methods are part of the framework, it is possible to solve a big spectrum of problems. GMRES is applied in the case of nonsymmetric problems and CG for symmetric matrices. Both are only two among the various solvers. For setting up the sparse matrix data structure, hypre uses diﬀerent types of interfaces: a stencil based structured (Struct)/semi-structured (SStruct) interface, a ﬁnite element based interface (FEI) and a linear algebraic interface (IJ).

3.1.1 Interfaces One of the most important points of using hypre is choosing the correct interface. Figure 3.1 (cf. [HYP06b]) illustrates an overview of the different interfaces. The Struct interface is used for solving finite difference or finite volume problems. Therefore a fixed stencil and a structured rectangular grid is needed. For solving block structured or composite grids the SStruct interface is the right decision. It supports multiple unknowns per cell and uses a graph to allow nearly arbitrary relationships between parts of the data. The FEI interface is used for solving linear equations out of a finite element discretization, by offering a set of finite element data structures. Concerning the fact that the matrix A has already included boundary values and element informations. Splitting up the finalized matrix A into the finite element structure would be too costly. Therefore, the last interface, the IJ interface was used to solve the displacements of the component parts.

15 Figure 3.1: Overview of the diﬀerent interfaces.

According to the fact that hypre was built for using a high number of processors the matrix of the linear system in equation (2.22) has to be distributed as   A0  A1  A =   , (3.1)  .   .  Ap−1 where p is the number of processors and Ai is itself a matrix of a certain number of rows given by ilower and iupper.

Algorithm 5 Building an IJ Matrix 1 ... 2 HYPRE_IJMatrixCreate(comm, ilower, iupper, jlower, jupper, &ij_matrix); 3 HYPRE_IJMatrixSetObjectType(ij_matrix, HYPRE_PARCSR); 4 HYPRE_IJMatrixInitialize(ij_matrix); 5 6 HYPRE_IJMatrixSetValues(ij_matrix, nrows, ncols, rows, cols, values); 7 8 ... 9 10 HYPRE_IJMatrixAssemble(ij_matrix); 11 HYPRE_IJMatrixGetObject(ij_matrix, (void**) &parcsr_matrix); 12 ...

The code snippet in Algorithm5 shows the standard implementation of an IJ matrix. In the beginning, a new matrix is built with the Create() routine, the rows are split across the processors shown in equation (3.1). With the help of the SetObjectType() function the matrix object type is set to HYPRE_PARCSR. This is a sparse matrix storage format. It needs three vectors containing the nonzero values (values), the column indices (cols) and a vector storing the number of nonzero values per row (ncols). A short example of

16 the PARCSR storage format is illustrated in the following. The matrix given by

7 0 6 0 0 5 0  4 0 0 2 0 0 0    0 0 0 0 0 1 0    0 0 3 0 0 0 0  (3.2)   0 8 0 0 9 0 17   0 0 34 0 0 14 0  0 0 0 26 0 0 10 can be written as values [ 7 , 6 , 5 , 4 , 2 , 1 , 3 , 8 , 9 , 17, 34, 14, 26, 10 ] cols [ 0 , 2 , 5 , 0 , 3 , 5 , 2 , 1 , 4 , 6 , 2 , 5 , 3 , 6 ] (3.3) ncols [ 3 , 2 , 1 , 1 , 3 , 2 , 2 ].

Calling the Initialize() routine shows that the matrix is ready to be set. This is done by adding the coeﬃcients with the help of the SetValues() function. The parameters values,cols and ncols specify the ParCSR vectors. Storing the values in the ij_matrix object, the function needs the row vector, containing the row indices and the nrows variable, specifying the number of rows to be set. After ﬁnalizing the matrix with the Assemble() function and calling the GetObject() routine, the ParCSR_matrix can be passed to a hypre solver. The setup of the right hand side and of the solution vector is done with the IJ vector interface. The approach is the same as in the matrix case and is shown in the code snippet of Algorithm6.

Algorithm 6 Building an IJ Vector 1 ... 2 HYPRE_IJVectorCreate(comm, jlower, jupper, &ij_vector); 3 HYPRE_IJVectorSetObjectType(ij_vector, HYPRE_PARCSR); 4 HYPRE_IJVectorInitialize(ij_vector); 5 6 HYPRE_IJVectorSetValues(ij_vector, nvalues, indices, values); 7 8 ... 9 10 HYPRE_IJVectorAssemble(ij_vector); 11 HYPRE_IJVectorGetObject(ij_vector, (void**) &parcsr_vector); 12 ...

3.1.2 Solvers and Preconditioner As mentioned before, hypre includes several solvers and preconditioners. Figure 3.2 shows the methods used in the thesis and their dependencies on the interfaces.

17 Figure 3.2: Overview of the various solvers and preconditioners.

The basis of the setup and run routine for the various hypre solvers is nearly identical. The only diﬀerence is the setting of the parameters. This is illustrated in Algorithm7.

Algorithm 7 Setup and run routine for the solvers 1 ... 2 // Create Solver 3 int HYPRE_SOLVERCreate(MPI_COMM_WORLD, &solver); 4 5 // set certain parameters 6 HYPRE_SOLVERSetTol(solver,1.e-7); 7 ... 8 // set up solver 9 HYPRE_SOLVERSetup(solver,A,b,x); 10 // solve the system 11 HYPRE_SOLVERSolve(solver,A,b,x) 12 // Destroy the solver 13 HYPRE_SOLVERDestroy(solver); 14 ...

The implementation of a preconditioner has to be set before the solver routines (shown in Algorithm8).

18 Algorithm 8 Setup and run routine for the solvers using a preconditioner 1 ... 2 // Set up the preconditioner 3 HYPRE_PRECONDCreate(MPI_COMM_WORLD, &precond_solver); 4 5 // Optional ﬁne-tuning of the preconditioner 6 ... 7 8 // Set up the used solver 9 HYPRE_SOLVERCreate(MPI_COMM_WORLD, &solver); 10 11 // Optional ﬁne-tuning of the solver 12 ... 13 14 // Initialize the preconditioner 15 HYPRE_SOLVERSetPrecond(solver,HYPRE_PRECONDSolve,HYPRE_PRECONDSetup,precond_solver) 16 17 HYPRE_SOLVERSetup(solver,A,b,x); 18 19 // Solve the system 20 HYPRE_SOLVERSolve(solver,A,b,x) 21 22 // Destroy the solver 23 HYPRE_SOLVERDestroy(solver); 24 HYPRE_PRECONDDestroy(precond_solver); 25 ...

3.2 Z88

The second framework used in the thesis is the finite element method program Z88. It was developed and designed especially for PCs and is subdivided into modules. Z88 is compatible with Windows/Unix platforms and is capable of exchanging data to CAD systems, e.g. Pro/ENGINEER. The framework covers 20 different element types in total and it is possible to calculate plane stress, plate bending, axial symmetric structures and spacial structures up to 20 node Serendipity hexahedrons. Z88 consists of different modules which can be put in running order separately. After finishing a job, the modules free the allocated memory. The communication in between the functions is handled via in and output data sets. A graphical user interface (GUI) of the framework’s modules is given by Z88COM, where all operations can be started and controlled. Figure 3.3 (cf. [Rie08]) shows the GUI for the Windows and Unix Commander. The basis of the framework are the solvers. At the moment Z88 is supporting three different solvers for calculating the FE problems. Z88F, a direct Cholesky solver without fill-in is used for solving small to average size structures up to 30000 degrees of freedom. The PARDISO solver was developed by O. Schenk at the University of Basel and is currently the only parallel solver in the framework using up to 9 CPUs. Using a direct decomposition with fill-in makes it possible to calculate medium size structures up to 150000 degrees of freedom. The huge amount of memory requirement and the limitation of CPUs in Z88PAR are two of the disadvantages of the method. Z88I2 is a sparse matrix solver which is based on the CG algorithm. Due to the fact that the iterative

19 (a) Windows (b) Unix

Figure 3.3: Z88COM. method is preconditioned with either a SOR or a Cholesky decomposition, fast calculations of FE structures up to 5 million degrees of freedom are no problem. The execution of the method is split into two phases. In a first step Z88I1 builds up the structure of the global stiffness matrix. The second step Z88I2 is calculating the local stiffness matrices, assembling them to the global matrix and is solving the system. Another important part about Z88 is the possibility of creating Z88-input files out of CAD data and otherway round. This is done with the modules Z88X and the 3D converter Z88G. Joint forces and strains are calculated with the help of Z88D and Z88E. Furthermore, the framework includes a net generator, a filechecker and a own plotting program. According to the fact that in Version 12 the PARDISO solver was not yet included, only the Z88F and Z88I1/I2 solvers were used in the thesis.

20 4 Computer Architecture

The computer architecture used in this thesis is the Woodcrest Cluster (termed woody) [oE09] at the Regional Computing Center of Erlangen (RRZE). Woody is a high-performance cluster with 217 nodes. On each node (Figure 4.1) are two ”Xeon 5160” Woodcrest Chips with 4 cores and 8GB of RAM. The clock speed of the cores is 3.00GHz and they are running with a 32kB Level 1 cache and a 4MB uniﬁed Level 2 cache.

Figure 4.1: Compute node of the Woodcrest Cluster.

The inﬁniband network of the cluster has a 10 GBit/s bandwidth per link and direction. Furthermore, woody has a peak performance of 12 GFlops/s per processor and a overall peak performance of 10.4 TFlops/s. The performance, measured with LINPACK, is 6.62 TFlops/s. In November 2006 the system has entered the Top 500 list on rank 124 and was still on rank 329 in November 2007.

5 Results

This chapter gives an overview of the achieved results from the calculations of the displacements. More- over, the chapter compares the diﬀerent solvers and preconditioners with respect to time and number of iterations. In the beginning, the used component parts with their conﬁgurations are introduced. After- wards, the results of the solvers are presented beginning with the Z88 methods, followed by BoomerAMG and CG. The chapter is concluded with a summary of all results.

5.1 Test Problems

This section gives a short overview of the component parts, on which the displacements were calculated. For all seven problems the boundary condition, number of nodes, number of elements, number of degrees of freedom, type of forces and the element type are mentioned.

(a) light view (b) grid view

Figure 5.1: L section.

L section (Figure (5.1))

• 3D • Dirichlet boundary conditions • 1758 nodes

• 6535 elements • 5274 degrees of freedom • surface pressure loads

• type of element – tetrahedrons – linear basis functions – size of stiﬀness matrix 12x12

23 (a) light view (b) grid view

Figure 5.2: Piston.

Piston (Figure (5.2)) • 3D • Dirichlet boundary conditions • 32522 nodes • 129569 elements • 97566 degrees of freedom • surface pressure loads • type of element – tetrahedrons – linear basis functions – size of stiﬀness matrix 12x12

(a) light view (b) grid view

Figure 5.3: Fan.

24 Fan (Figure (5.3))

• 3D

• Dirichlet boundary conditions • 34495 nodes • 130541 elements

• 103485 degrees of freedom • surface pressure loads • type of element – tetrahedrons – linear basis functions – size of stiﬀness matrix 12x12

(a) light view (b) grid view

Figure 5.4: Hub carrier.

Hub carrier (Figure (5.4))

• 3D • Dirichlet boundary conditions

• 13392 nodes • 58794 elements • 40176 degrees of freedom

• surface pressure loads • type of element – tetrahedrons – linear basis functions – size of stiﬀness matrix 12x12

25 (a) light view (b) grid view

Figure 5.5: Arch.

Arch (Figure (5.5))

• 3D

• Dirichlet boundary conditions • 31431 nodes • 18983 elements • 94293 degrees of freedom

• surface pressure loads • type of element – tetrahedrons – quadratic isoparametric serendipity elements – size of stiﬀness matrix 30x30

(a) light view (b) grid view

Figure 5.6: Connecting rod.

26 Connecting rod (Figure (5.6)) • 3D • Dirichlet boundary conditions • 35751 nodes • 19622 elements • 107253 degrees of freedom • external forces • type of element – tetrahedrons – quadratic isoparametric serendipity elements – size of stiﬀness matrix 30x30

(a) light view (b) grid view

Figure 5.7: I beam.

I beam (Figure (5.7)) • 3D • Dirichlet boundary conditions • 753474 nodes • 4151839 elements • 2260422 degrees of freedom • external forces • type of element – tetrahedron – linear basis functions – size of stiﬀness matrix 12x12

27 5.2 Measurement Methods

Measuring of the running time is done with gettimeofday() which measures up to microseconds. The routine is executed within the functions timer_start() and timer_lap(), which returns the time in seconds since the last call of gettimeofday(). According to the fact that the solvers are in the focus of interest, only the methods themself are measured. The code snippets in Algorithm9, 10 and 11 show the implementation of the time measurement for Z88F in ﬁle z88cc.c, Z88I2 in ﬁle z88ci.c and for the hypre solvers, respectively.

Algorithm 9 Time measurement of Z88F 1 ... 2 timer start(); 3 choy88(); 4 printf (”Time for solving: %e”, timer lap ()); 5 ...

Algorithm 10 Time measurement of Z88I2 1 ... 2 timer start(); 3 sorcg88(); 4 printf (”Time for solving: %e”, timer lap ()); 5 ...

Algorithm 11 Time measurement of hypre solvers 1 ... 2 timer start(); 3 HYPRE_SOLVERSolve(solver,A,b,x); 4 printf (”Time for solving: %e”, timer lap ()); 5 ...

All chosen solvers in the thesis have the same stopping criterion, i.e, the residual is smaller than 10−7.

5.3 Z88

Table 5.1 shows the clock time in seconds and the number of iterations for the Z88 solvers. According to the large number of degrees of freedom of the problems, the Cholesky solver Z88F could only handle the L section. Z88I2, the iterative CG method, was preconditioned with SOR which used a relaxation parameter ω = 1 (Gauß-Seidel). Referring to Section 5.1, the fan is one of the most complicated component parts tested in the thesis. Z88I2 needs 152 seconds and 2322 iterations to solve it.

28 Z88F Z88I2 Test problems time iter time iter L section 37.42 5274 0.41 224 Hub carrier / / 8.82 314 Piston / / 18.93 235 Arch / / 47.72 383 Fan / / 152.36 2322 Connecting rod / / 134.60 987 I beam / / 1480.63 589

Table 5.1: Times for calculating the displacements of the component parts with Z88.

Figure 5.8 illustrates the relative residual over the iterations for the L section, solved by Z88I2.

1 Z88

0.1

0.01

0.001

0.0001

1e-05 relative residual

1e-06

1e-07

1e-08

1e-09 0 50 100 150 200 Number of iterations

Figure 5.8: Residual for solving the L section with Z88I2.

5.4 BoomerAMG

This section presents the achieved results for BoomerAMG, used as a solver. The following optional parameters of the method were chosen, the strength threshold, explained in equation (2.25), was set to 0.73 and the maximum number of cycles to 20. Furthermore, the Ruge3 coarsening and the F-F interpolation were chosen, for more information on these algorithms refer to [HYP06b]. If BoomerAMG is used as a solver, the Table 5.2 shows the solving time and the number of iterations of the component parts. The calculation of the three largest problems, the fan, the connecting rod and the I beam, were stopped after more than 12.5 hours and more than 70000 iterations.

29 BoomerAMG Test problems time iter L section 35.86 6423 Hub carrier 523.39 4311 Piston 842.51 2799 Arch 7864.2 15038 Fan > 12.5h > 136532 Connecting rod > 12.5h > 76000 I beam > 12.5h > 2140

Table 5.2: Times for calculating the displacements of the component parts with BoomerAMG.

The bad results for AMG are not unexpected. According to [TO00] and [MGS03] several problems arise with MG solving LE problems, e.g. concerning the rigid body modes and the near null space. In Figure 5.9 the residual over the iterations for solving the L section is illustrated. The displacements have been calculated with the help of BoomerAMG used as a solver.

1 BoomerAMG

0.1

0.01

0.001

0.0001

1e-05 relative residual

1e-06

1e-07

1e-08

1e-09 0 1000 2000 3000 4000 5000 6000 number of iterations

Figure 5.9: Residual for solving the L section with BoomerAMG.

5.5 Conjugate Gradient

The last tested method of the thesis is CG. It was used as a single solver and in combination with diﬀerent preconditioners, e.g. the above mentioned BoomerAMG, Euclid, PILUT or ParaSails. CG using PILUT as a preconditioner diverges for every used problem. In Table 5.3 the times and number of iterations for the plain CG solver are presented. Noticeable are the large number of iterations, which cause the higher calculation times than Z88I2.

30 CG Test problems time iter L section 0.32 922 Hub carrier 46.30 6584 Piston 70.89 3773 Arch 76.20 2486 Fan 362.46 20655 Connecting rod 563.95 15477 I beam 9033.94 10869

Table 5.3: Times for calculating the displacements of the component parts with CG.

Using the CG method for calculating the displacements of the L section, Figure 5.10 illustrates the occurring residuals over the iterations.

1 CG

0.1

0.01

0.001

0.0001

1e-05 relative residual

1e-06

1e-07

1e-08

1e-09 0 100 200 300 400 500 600 700 800 900 number of iterations

Figure 5.10: Residual for solving the L section with CG.

The next part of the section points out the results for the preconditioners BoomerAMG, Euclid and ParaSails in combination with CG. In the case of BoomerAMG, the parameters were set in the same way as for BoomerAMG used as a plain solver. The strength threshold was set to 0.73, Ruge3 was set as coarsening strategy and the F-F interpolation was chosen. Table 5.4 shows the values for all component parts. It was not possible to solve the displacements of the fan and the piston in combination with the Euclid preconditioner. The calculation was numerically not stable.

31 CG+BoomerAMG CG+Euclid CG+ParaSails Test problems time iter time iter time iter L section 1.87 23 0.22 85 0.16 228 Hub carrier 39.91 24 6.97 204 5.63 567 Piston 91.91 23 / / 12.05 455 Arch 234.276 35 31.39 122 23.97 591 Fan 1234.87 319 / / 106.78 3617 Connecting rod 854.50 120 88.32 289 46.36 1002 I beam > 4h > 56 4643.76 1445 5183.85 3971

Table 5.4: Times for calculating the displacements of the component parts with CG and a preconditioner.

Figure 5.11, illustrates once more the residuals over the iterations for the L section. This time, the CG method is used with diﬀerent preconditioners.

1 CG+ParaSails CG+Euclid CG+BoomerAMG 0.1

0.01

0.001

0.0001

1e-05 relative residual

1e-06

1e-07

1e-08

1e-09 0 50 100 150 200 number of iterations

Figure 5.11: Residual for solving the L section with CG and using Euclid, BoomerAMG or ParaSails as a preconditioner.

Furthermore, Figure 5.12 illustrates the exact times and the time ratios (thypre/tZ88I2), for solving the displacements, between the CG method, preconditioned with ParaSails and Z88I2. The time diﬀerences for the smaller problems are only marginal, e.g. the calculation of the displacements for the hub carrier is only 3 seconds faster than Z88I2. Using these parallel methods is more proﬁtable by calculating larger problems, e.g. the fan is solved 46 seconds faster or the connecting rod which is solved 88 seconds faster with the combination of CG and ParaSails. The only exception of the component parts is the I beam. This problem is solved 61 minutes faster with the aid of Z88I2 than with the hypre methods.

32 1 100

0.8

0.6

time log scaled 0.4 range of t(hypre)/t(Z88)

0.1 0.2

Z88 hypre t(hypre)/t(Z88) 0.01 0 L section Hub carrier Piston Arch Connecting rod Fan Component Parts

Figure 5.12: Times and ratios for solving the component parts with CG+ParaSails and Z88I2.

Now, the section is concluded with the results for using multiple CPUs. Table 5.5 shows the calculation times and the speedup for the diﬀerent component parts using up to 8 CPUs.

33 (a) (b) (c) L section Hub carrier Arch #CPU time speedup #CPU time speedup #CPU time speedup 1 0.16 1 1 5.63 1 1 23.97 1 2 0.14 1.15 2 4.08 1.37 2 15.70 1.53 3 0.08 1.88 3 3.90 1.44 3 16.12 1.49 4 0.07 2.28 4 3.70 1.52 4 15.67 1.53 5 0.07 2.38 5 3.28 1.71 5 12.68 1.89 6 0.06 2.49 6 2.81 2.00 6 11.18 2.15 7 0.07 2.36 7 2.79 2.02 7 10.99 2.18 8 0.08 1.88 8 2.65 2.12 8 10.64 2.25

(d) (e) (f) Piston Fan Connecting rod #CPU time speedup #CPU time speedup #CPU time speedup 1 12.05 1 1 106.78 1 1 46.36 1 2 9.03 1.33 2 76.72 1.40 2 36.02 1.37 3 8.19 1.47 3 79.49 1.35 3 35.33 1.40 4 8.15 1.48 4 67.04 1.60 4 32.55 1.52 5 6.15 1.96 5 49.36 2.17 5 28.84 1.71 6 6.48 1.86 6 43.94 2.44 6 24.94 1.98 7 5.65 2.13 7 46.30 2.32 7 23.47 2.10 8 5.79 2.08 8 43.86 2.44 8 23.17 2.13

(g) I beam #CPU time speedup 1 5183.85 1 2 3838.49 1.35 3 3667.69 1.41 4 3225.62 1.60 5 2480.51 2.09 6 1955.41 2.65 7 1918.76 2.70 8 1833.24 2.83

Table 5.5: Times and speedup for multiple CPUs.

Figure 5.13 and Figure 5.14 illustrate the time ratios (thypre/tZ88) for solving the fan and the I beam with multiple CPUs, respectively.

34 1 t(hypre)/t/(Z88)

0.8

0.6

0.4 range of t(hypre)/t/(Z88)

0.2

0 20 40 60 80 100 120 number of CPUs

Figure 5.13: Ratios for solving the fan with multiple CPUs.

3.5 t(hypre)/t(Z88)

2.5

1.5 range of t(hypre)/t(Z88)

0.5

50 100 150 200 250 number of CPUs

Figure 5.14: Ratios for solving the I beam with multiple CPUs.

35 5.6 Conclusions

For all described problems in Section 5.1, the displacements caused by surface loads, pressure loads or external forces have been calculated. Therefore a various number of solvers and preconditioners have been tested and compared. The direct Cholesky solver Z88F is only useful for small to medium sized problems. From the component parts, only the L section with 5274 degrees of freedom could be calculated. All other problems lead to segmentation faults and stopped the calculation. The iterative solver Z88I2 is the most used and most robust method within the Z88 framework. All mentioned problems have been solved in a fast and accurate way. For instance, Table 5.1 shows that the calculation of the displacements for the L section was about 37 seconds faster than with Z88F. The worst results were achieved with BoomerAMG. Even the smallest problems take 80 times longer than the fastest solution with CG and ParaSails. Solving the three largest problems was stopped after more than 12.5 hours (the calculation of the fan was terminated after 136532 iteration steps at a relative residual of 2 · 10−2). According to Table 5.2, the solution of the L section takes over 6000 iterations with an average convergence factor of 0.9. Using BoomerAMG as a preconditioner is not improving the calculation speed. Solving the equations for the fan in combination with CG is about 800 seconds slower than using the plain CG method, shown in the Tables 5.3 and 5.4. As mentioned before, these bad results are not unexpected. [TO00] and [MGS03] provide more details on the problems regarding LE and MG. The best results were achieved with the combination of the CG method and ParaSails used as a preconditioner. Both methods are from the highly parallel hypre framework. Even with only 1 CPU the hypre combination is much faster than the Z88 solvers, except for the I beam which has over 2 million degrees of freedom. This problem was solved slower than with Z88I2. Comparing the smaller and more simple problems, shown in the Tables 5.1 and 5.4, the time differences are only marginal. For example the L section is calculated 0.25 seconds faster or the piston which is calculated 6.88 seconds faster than with Z88I2. With having more complex component parts the differences increase, which is also shown in the above mentioned Tables. For instance Z88I2 is about 46 seconds slower for calculating the displacements of the fan or 88 seconds slower for calculating the connecting rod than the CG and ParaSails combination. These results are also illustrated in Figure 5.12, where the time differences of the frameworks are compared directly. The performance of the hypre solvers can be increased considerable by using multiple CPUs. Table 5.5 shows the times and speedup values for the different problems by solving them with up to 8 CPUs. Even with just 6 CPUs, a doubling of the speed for all component parts can be achieved. For instance, solving the displacement of the fan, using CG and ParaSails with 8 CPUs is 143% faster than the calculation with 1 CPU and even 247% faster than Z88I2. The fan was also calculated with up to 120 CPUs. Figure 5.13 show that the best results were achieved with 30 − 45 processors. When using more than 50 CPUs the communication between the processors is too high and the solving times are again slowly increasing. With the help of multiple CPUs the solving time of the largest problem, the I beam, can also be increased considerably. When using 10 CPUs, the calculation gets as fast as Z88I2 and even 22.3 minutes faster with 243 CPUs. This correlation is shown in Figure 5.14.

36 6 Conclusion

The main idea of the thesis was to solve displacement equations, to test and to comapare various solvers and preconditioners. These displacements of the component parts are one of the application ﬁelds of LE. They were discretized with the help of FEM to the linear equation system (2.22). Afterwards this equation was calculated with the help of the Z88 methods and the parallel solvers and preconditioners from the hypre package. The best results were achieved with a parallel implementation of the CG algorithm and using ParaSails as a preconditioner. The performance of these hypre methods were increased by using multiple CPUs. Figure 6.1 shows the deformation for the component parts for the fan.

(a) Undeformed (b) Deformed

Figure 6.1: Deformation of the fan.

Bibliography

[Bab04] Babel. https://computation.llnl.gov/casc/components/babel.html [Accessed 23.11.2009], 2004. [Cho00] E. Chow. Parallel implementation and performance characteristics of least squares sparse approxiamte inverse preconditioners. Int. J. High Perf. Comput. Apps, 2000. [Fal06] R.D. Falgout. An introduction to algebraic multigrid. Computing in Science and Engineering, 2006. [FY02] R. D. Falgout and U.M. Yang. hypre: a library of high performance preconditioners. Compu- tational Science, 2002.

[HYP06a] Hypre - Reference Manual. https://computation.llnl.gov/casc/hypre/software.html [Accessed 23.11.2009], 2006.

[HYP06b] Hypre - User’s Manual. https://computation.llnl.gov/casc/hypre/software.html [Ac- cessed 23.11.2009], 2006.

[HYP09] Hypre. https://computation.llnl.gov/casc/linear_solvers/sls_hypre.html [Accessed 23.11.2009], 2009.

[JL01] M. Jung and U. Langer. Methode der ﬁniten Elemente f¨urIngenieure. Teubner Verlag, 2001. [MGS03] D. Oeltz M. Griebel and M. A. Schweitzer. An algebraic multigrid method for linear elasticity. SIAM, 2003.

[MPI09] MPI. http://www.mcs.anl.gov/research/projects/mpi/ [Accessed 23.11.2009], 2009.

[oE09] Regional Computing Center of Erlangen. Woodcrest Cluster. http://www.rrze. uni-erlangen.de/dienste/arbeiten-rechnen/hpc/systeme/woodcrest-cluster.shtml [Accessed 23.11.2009], 2009. [RH09] F. Rieg and R. Hackenschmidt. Finite Elemente Analyse f¨ur Ingenieure. HANSER, 2009.

[Rie08] F. Rieg. Z88 - User’s Manual. http://www.z88.uni-bayreuth.de/english.html [Accessed 23.11.2009], 2008. [She94] J.R. Shewchuk. An introduction to the conjugate gradient method without the agonizing pain. Technical report, School of Computer Science, Carnegie Mellon University, 1994.

[TO00] U. Trottenberg and C. W. Oosterlee. Multigrid: Basics, Parallelism and Adaptivity. Academic Press, 2000. [WBM99] V.E. Henson W. Briggs and S. McCormick. A Multigrid Tutorial. SIAM, 1999.

[Z8809] Z88. http://www.z88.uni-bayreuth.de/ [Accessed 23.11.2009], 2009.