SIAM J. SCI. COMPUT. c 2015 Axel Klawonn, Martin Lanser, Oliver Rheinbach Vol. 37, No. 6, pp. C667–C696

TOWARD EXTREMELY SCALABLE NONLINEAR DOMAIN DECOMPOSITION METHODS FOR ELLIPTIC PARTIAL DIFFERENTIAL EQUATIONS∗

AXEL KLAWONN†, MARTIN LANSER† , AND OLIVER RHEINBACH‡

Abstract. The solution of nonlinear problems, e.g., in material science, requires fast and highly scalable parallel solvers. Finite element tearing and interconnecting dual primal (FETI-DP) domain decomposition methods are parallel solution methods for implicit problems discretized by finite ele- ments. Recently, nonlinear versions of the well-known FETI-DP methods for linear problems have been introduced. In these methods, the nonlinear problem is decomposed before linearization. This approach can be viewed as a strategy to further localize computational work and to extend the par- allel scalability of FETI-DP methods toward extreme-scale supercomputers. Here, a recent nonlinear FETI-DP method is combined with an approach that allows an inexact solution of the FETI-DP coarse problem. We combine the nonlinear FETI-DP domain decomposition method with an alge- braic multigrid (AMG) method and thus obtain a hybrid nonlinear domain decomposition/multigrid method. We consider scalar nonlinear problems as well as nonlinear hyperelasticity problems in two and three space dimensions. For the first time for a domain decomposition method, weak parallel scalability can be shown beyond half a million cores and subdomains. We can show weak parallel scalability for up to 524 288 cores on the Mira Blue Gene/Q supercomputer for our new implemen- tation and discuss the steps necessary to obtain these results. We solve a heterogeneous nonlinear hyperelasticity problem discretized using piecewise quadratic finite elements with a total of 42 billion degrees of freedom in about six minutes. Our analysis reveals that scalability beyond 524 288 cores depends critically on both efficient construction and solution of the coarse problem.

Key words. domain decomposition, FETI-DP, nonlinear, parallel computing

AMS subject classifications. 65N55, 65F08, 65F10, 65Y05

DOI. 10.1137/140997907

1. Introduction. The discretization of elliptic partial differential equations leads to very large and very ill-conditioned problems. It is a challenging task to build a parallel scalable solver for the resulting discrete linear or nonlinear systems. This is partially due to the fact that iterative elliptic solvers, such as multigrid or domain de- composition methods, need a global communication mechanism to be scalable. Here, we consider new nonoverlapping domain decomposition methods which belong to the family of finite element tearing and interconnecting dual primal (FETI-DP) domain decomposition . Domain decomposition methods are divide and conquer algorithms that rely on a geometrical decomposition of the original problems. FETI-DP domain decomposition methods for linear or linearized problems were first introduced in Farhat et al. [24, 23]. They belong to the family of nonoverlapping domain decomposition methods that have

∗Submitted to the journal’s Software and High-Performance Computing section December 1, 2014; accepted for publication (in revised form) September 24, 2015; published electronically December 8, 2015. This work was supported in part by the German Research Foundation (DFG) through the Priority Programme 1648 “Software for Exascale Computing” (SPPEXA), project KL 2094/4-1, and RH 122/2-1. A preliminary version of this paper has been submitted as a proceedings paper [40]. The present paper has been completely revised and significantly extended. http://www.siam.org/journals/sisc/37-6/99790.html †Mathematisches Institut, Universit¨at zu K¨oln, Weyertal 86-90, 50931 K¨oln, Germany (klawonn@ math.uni-koeln.de, [email protected]). ‡Fakult¨at f¨ur Mathematik und Informatik, Institut f¨ur Numerische Mathematik und Optimierung, Technische Universit¨at Bergakademie Freiberg, 09596 Freiberg, Germany (oliver.rheinbach@math. tu-freiberg.de). C667 C668 AXEL KLAWONN, MARTIN LANSER, AND OLIVER RHEINBACH reduced communication compared to other domain decomposition approaches such as overlapping Schwarz methods [56, 58]. In finite element tearing and interconnecting (FETI) methods, the original problem is decomposed into problems on nonoverlapping subdomains. The continuity of the global solution is enforced as a linear constraint by dual Lagrange multipliers. In FETI-DP methods, additional (primal) constraints are enforced throughout the iteration; see Figure 1; for further details, see [58, 46]. A classical FETI-DP method was awarded a Gordon Bell price in 2002 [7] for a structural mechanics simulation on unstructured grids using 3 400 ASCI White cores. Modified versions, i.e., inexact FETI-DP domain decomposition methods, introduced in [43], have scaled to up to 65 536 Blue Gene/P cores in 2009 [46] for . For highly scalable parallel algebraic multigrid (AMG) solvers, see, e.g., [1], where the BoomerAMG has shown to be parallel scalable to 100 000 cores in 2012. Scalability for multigrid solvers to the complete JUQUEEN machine at Forschungszentrum J¨ulich, Germany, has recently been shown for porous media [35] using DUNE [5, 6] and for earth mantle convection [55]. More recently, a computa- tional scale bridging approach combined with FETI-DP methods was also scaled to thecompleteJUQUEEN[39]. The classical approach to solving nonlinear partial differential equations with FETI-DP methods uses a Newton–Krylov (NK) FETI-DP approach in which the discretized nonlinear problem is linearized by a Newton method, possibly within a globalization loop. Then, in each Newton step, the linear system obtained defined by the tangent matrix is solved iteratively using a Krylov space combined with a FETI-DP method. In [37, 38] a new approach was considered where the nonlinear problem is first decomposed and then linearized, yielding the new family of nonlinear FETI-DP methods. The numerical results presented in [37, 38] were obtained sequen- tially using MATLAB, and only scalar problems based on variants of the p-Laplace operator were considered. As in traditional (exact) FETI-DP methods, in the non- linear FETI-DP algorithms in [37, 38], direct solvers are used for the elimination of the local subdomain problems and for the factorization of the coarse Schur comple- ment matrix. The latter prevents the use of this method on parallel supercomputers with several hundreds of thousands of cores due to the superlinear memory and time complexity of the direct solver in the coarse problem. In this current paper, we will overcome this limitation and discuss steps necessary to obtain scalability to half a million cores. This paper presents several major new contributions to the field of nonlinear FETI-DP methods. First, a nonlinear FETI-DP method allowing for the inexact so- lution of the coarse problem is presented together with a new scalable implementation of this method. A different formulation allows the use of an AMG preconditioner for inexact solution of the coarse problem. This is opposed to our earlier nonlinear FETI- DP methods [37, 38], where we only presented sequential results using MATLAB and solved the coarse problem exactly. The derivation of this new inexact nonlinear FETI- DP method alongside preliminary results was first presented at the 22nd International Conference on Domain Decomposition Methods; see [40]. Meanwhile, the scalability of this new method has been substantially improved. In the present paper, parallel numerical results based on these improvements are shown. Second, for the first time for any linear or nonlinear FETI-DP-type domain decomposition method, weak par- allel scalability is achieved for more than 500 000 processor cores. Let us note that to date we are not aware of such a scalability result for any other domain decomposition method. Finally, we give a detailed analysis of our implementation to find components SCALABLE NONLINEAR DOMAIN DECOMPOSITION METHODS C669 that limit the scalability. The corresponding findings are discussed, and solutions to overcome these issues and to further improve the parallel performance are provided. We present weak parallel scaling results for nonlinear problems using 32 768 cores of SuperMUC at Leibniz-Rechenzentrum Munich, for 131 072 cores of the Vulcan Blue Gene/Q supercomputer at Lawrence Livermore National Laboratory, and 524 288 cores of the Mira Blue Gene/Q supercomputer at Argonne National Laboratory. In our method, the nonlinear problem is decomposed before linearization, using Lagrange multipliers. We use a hybrid domain decomposition/multigrid method that combines a nonlinear FETI-DP domain decomposition method with a parallel AMG precondi- tioner. The nonlinear approach serves to reduce communication, and the multilevel approach allows the extension of the parallel scalability by adding further coarse levels. If the method is applied to linear problems and the coarse solver is iterated until convergence in each outer iteration, then the method is equivalent to the standard FETI-DP method. Of course, in the nonlinear method presented in this paper, in each outer iteration, using only one or two multigrid iterations for the coarse problem will be sufficient. Note that in nonlinear FETI-DP methods some global communication still takes place in the outermost loop. With respect to software development aspects it is interesting that the algorithmic building blocks of the new nonlinear methods and of a corresponding classical NK FETI-DP method remain largely identical, and they are only arranged in a different way. A well-designed existing implementation of a traditional FETI-DP method can thus easily be transformed into a nonlinear FETI- DP method. 2. Nonlinear domain decomposition. This section puts the method dis- cussed in this paper into the context of existing linear and nonlinear domain de- composition methods. The experienced reader may skip this section. Fast solvers for nonlinear elliptic partial differential equations discretized by finite elements are iterative schemes with several levels of nested iterations. A classical strategy for a given discrete nonlinear problem

(2.1) A(u)=0 is the reduction of the nonlinear problem to a sequence of linear problems. Superlinear convergence of such schemes can only be expected if the iterative correction δu(k) in

(2.2) u(k+1) = u(k) − δu(k) is sufficiently close to a step of the classical Newton method. Since convergence can only be expected from a sufficiently good initial guess, a suitable globalization technique is used, e.g., line search, trust region methods, or load stepping. Let us denote the sequence of linearized problems by

(k) (k) (k) (2.3) DA(u ) δu = A(u ),k=1,...,nN .

The left-hand side DA(u(k)) is sparse but can be very large and is ill-conditioned. Today, fast solvers for these linearized systems are, again, iterative schemes such as multilevel, multigrid [59], or domain decomposition methods [56, 58]. These methods are often used as ; i.e., for acceleration they are embedded in a Krylov subspace iteration such as conjugate gradient (CG) or the generalized minimal residual method (GMRES) [28]. These algorithms are then referred to as preconditioned Newton–Krylov (NK) methods. C670 AXEL KLAWONN, MARTIN LANSER, AND OLIVER RHEINBACH

The primitives of (preconditioned) Krylov methods are inner products, -vector multiplications, daxpy-operations, and the application of the precon- ditioner. For numerically scalable preconditioners, the latter is the computationally most expensive step. Numerical scalability demands that the number of Krylov it- erations can be bounded independently of the problem size. This property is widely regarded as prerequisite to obtaining parallel scalability for ill-conditioned problems. FETI-DP methods are, in this sense, numerically scalable domain decomposition methods. The parallelization of the solution scheme is then performed on the level of the Krylov iteration, by providing parallel versions of the Krylov primitives, and on the level of the (linear) preconditioner. The parallelization of the latter is typically per- formed by a geometrical decomposition of the two- or three-dimensional problem. This decomposition can be computed by applying graph partitioners to the finite element mesh or the matrix graph. This approach has proven efficient for parallel multigrid methods as well as parallel domain decomposition preconditioners. Global communication appears during the Krylov primitives and, to obtain numerical scala- bility, in the preconditioner. In preconditioned NK methods the decomposition into concurrent subproblems is thus performed after linearization; i.e., conceptually, it is introduced in an interior loop. As a result, in the innermost loop of NK FETI-DP methods, in every iteration step, nearest-neighbor communication at the subdomain interfaces is necessary in addition to (limited) global coarse communication. On the other hand, in nonlinear FETI-DP methods [38] the decomposition into concurrent subproblems is performed early, i.e., before linearization. In consequence, the nearest-neighbor communication can be removed from the main loop, resulting in an improved concurrency. Nonlinear approaches to domain decomposition can thus be viewed as a strategy to localize computational work in order to extend parallel scalability toward future extreme-scale supercomputers. Moreover, in these approaches local nonlinearities are resolved by local computational work which may reduce the total energy to solu- tion if other cores are idle long enough to make entering a power conservation state worthwhile. Different approaches to nonlinear domain decomposition approaches are known. The nonlinear, overlapping additive Schwarz preconditioned inexact Newton (ASPIN) approach was introduced in Cai and Keyes [11]; see also [34, 12, 33] and Groß and Krause [30, 29]. As opposed to our approach here, this is an overlapping domain decomposition method. Nonlinear domain decomposition has also been used as a coupling method rather than as a solver, e.g., in fluid-structure interaction (FSI); see Deparis et al. [19, 18], Deparis [17], or Fernandez et al. [25]; it has also been used for the coupling of multi- phase flow; see, e.g., Ganis et al. [27, 26]. In these approaches the number of subdo- mains is small, i.e., two in the case of FSI, and there is no need for a coarse problem. This is opposed to the method presented here, where, due to the large number of subdomains, the coarse space plays an important role. Indeed, no scalability can be achieved without a coarse space. Nonlinear FETI-1 methods were introduced in Pebrel, Rey, and Gosselet [53], and nonlinear Neumann–Neumann methods, as a scalable solver approach, were in- troduced in Bordeu, Boucard, and Gosselet [8]. These methods are related to our method, but ours does not use a projection for the coarse space. This allows the use of a preconditioner for the coarse problem, which is central to the approach pre- SCALABLE NONLINEAR DOMAIN DECOMPOSITION METHODS C671

sented in this paper. Nonlinear Schwarz methods as a solver, i.e., not as a nonlinear preconditioner, have been considered earlier; see, e.g., [21, 10]. A technique to use local nonlinear problems in standard methods has been de- noted nonlinear localization [13]. Nonlinear multigrid methods, where on each level a nonlinear problem is considered and thus nonlinear smoothers are used, have also been known for some time [59]. In domain decomposition methods of the FETI-DP [24, 23, 46, 47, 44, 48] and BDDC types [20, 14, 50, 49, 51], the coarse spaces are constructed from partial as- sembly of the finite elements (dual primal). This has facilitated the extension of the scalability of these methods; see, e.g., [61, 52, 43, 45, 60, 57]. Inexact FETI- DP methods were introduced in [43], and their parallel scalability was demonstrated in [46, 54]. These methods are nonoverlapping domain decomposition methods. Scal- able overlapping domain decomposition preconditioners for linear problems are also known [56, 58]; see [36] for a recent study on robustness and scalability. A recent overview of approaches to nonlinear solution methods and their combination can be found in [9]. 3. Nonlinear FETI-DP formulation. Let Ω ⊂ Rd,d=2, 3, be a compu- tational domain, discretized by finite elements, and let V h be the corresponding fi- nite element space. We consider a nonlinear, typically elliptic, problem on Ω. In nonlinear FETI-DP methodswedecomposeΩintoN nonoverlapping subdomains Ωi,i =1,...,N. We will denote the size of a finite element by h and the size of a subdomain by H. We consider the minimization of a nonlinear energy J : V h → R,

(3.1) min J(u). u∈V h For standard problems, such as hyperelasticity, discretized by finite elements, this global energy can be represented by a sum N (3.2) J(u)= Ji(ui) i=1 of local energies on the nonoverlapping subdomains; for details, see [40, 38]. Our geometric decomposition into concurrent problems is thus performed directly on the nonlinear energy, i.e., on the highest possible level rather than after lineariza- tion as in standard NK FETI-DP approaches. The solution of concurrent nonlinear problems will therefore appear as a building block of all the different nonlinear FETI- DP methods. Let ϕi,j ,i=1,...,N,j =1,...,Ni, be the nodal finite element basis functions for the local finite element space Wi.Wewrite  Ji(ui)(ϕi,j )=(Ki(ui) − fi)j ,

where fi is independent of ui. We define the nonlinear, discrete block operators and block vectors ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ K1(u1) f1 u1 ⎜ . ⎟ ⎜ . ⎟ ⎜ . ⎟ K(u):=⎝ . ⎠ ,f:= ⎝ . ⎠ ,u:= ⎝ . ⎠ , KN (uN ) fN uN

where Ki(ui),i=1,...,N, define independent, concurrent nonlinear subdomain prob- lems. C672 AXEL KLAWONN, MARTIN LANSER, AND OLIVER RHEINBACH

Fig. 1. The operator K represents partially assembled nonlinear subdomain problems. Here 4 × 4 subdomain problems are assembled at the vertices.

T Let RΠ be the standard FETI-DP partial assembly operator that is used to define the FETI-DP coarse problem; see, e.g., [46, 58] for the notation. It introduces coupling in a small set of interface variables, denoted as primal variables. In typical scenarios the number of primal variables is between 3 and 4 magnitudes smaller than the original problem, e.g., [42, 46]. Let B be the standard FETI-DP jump operator; see, e.g., [46, 58]. The linear constraint Bu˜ = 0 enforces continuity of the solution of the dual variables across subdomain boundaries. We define the nonlinear, partially assembled operator

T (3.3) K(˜u):=RΠK(RΠu˜)

T and the corresponding right-hand side f˜ := RΠf. We can then introduce the nonlinear FETI-DP master system [37, 38]

K (˜u)+BT λ − f˜ =0, (3.4) Bu˜ =0.

We assume that the operator K is continuously differentiable and locally invertible. In a typical parallel implementation, the application of the operators B and BT includes nearest-neighbor communication. 4. A hybrid nonlinear FETI-DP/multigrid method. We now describe the new hybrid nonlinear domain decomposition/multigrid approach that we propose and use in our new parallel code. 4.1. Derivation of the method. Newton linearization of (3.4) with respect to (˜u, λ) results in the linear system

DK (˜u) BT δu˜ K (˜u)+BT λ − f˜ (4.1) = . B 0 δλ Bu˜

As in standard FETI-DP methods, we partition δu˜ into primal variables δu˜Π and dual variables δuB, i.e.,

δuB δu˜ = . δu˜Π

Since, by construction, we have continuity in the primal variables δu˜Π,thejump operator can be written as B =[BB O], where the first block BB corresponds to the SCALABLE NONLINEAR DOMAIN DECOMPOSITION METHODS C673

dual variables δuB and the second zero block to the (subassembled) primal variables δu˜Π. We can write (4.1) accordingly, i.e., as a block system ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ T T T (DK(˜u))BB (DK(˜u))ΠB BB δuB (K(˜u))B + BBλ − fB ⎣ ⎦ ⎣ ⎦ ⎣ ⎦ (4.2) (DK(˜u))ΠB (DK(˜u))ΠΠ 0 δu˜Π = (K(˜u))Π − f˜Π . BB 00δλ BBuB

Assuming that (DK(˜u))BB is invertible, we perform one step of block Gauss elimi- nation of uB and obtain a reduced system

(4.3) Arxr = Fr;   −1 T A SΠΠ –(DK(˜u))ΠB (DK(˜u))BBBB (4.4) r = −1 T ; (sym.)–BB (DK(˜u))BBBB

δu˜Π (4.5) xr = ; δλ

cf. the notation in [43] for linear problems. The Schur complement SΠΠ in (4.4), i.e., −1 T (4.6) SΠΠ =(DK(˜u))ΠΠ − (DK(˜u))ΠB (DK(˜u))BB(DK(˜u))ΠB constitutes the coarse problem of the method. An elimination ofu ˜Π would yield the Nonlinear-FETI-DP-1 (NL1) system

FNL1δλ = d

introduced in [37, 38]. Here, we choose not to carry out the elimination ofu ˜Π but to solve iteratively the block system (4.4) instead, using the left block-triangular preconditioner B−1 B−1 −1 −1 −1 r,L := r,L(SΠΠ,DK(˜u)BB,M )  −1 (4.7) SΠΠ 0 := −1 −1 T −1 −1 . −M BB(DK(˜u))BB(DK(˜u))ΠB SˆΠΠ −M

This new approach, along with some preliminary results, was first proposed in a DD22 plenary talk [40]. It combines the inexact approach from [43, 46] with the Nonlinear- FETP-DP-1 method [37, 38] and may therefore also be denoted irNonlinear-FETI- −1 DP-1 (inexact reduced Nonlinear-FETI-DP-1) method. The application of SΠΠ will −1 consist of a few cycles of a parallel AMG method; for DK(˜u)BB we will use concurrent sparse direct solvers on the subdomains. Here, M −1 is a good, parallel preconditioner for the dual Schur complement. We choose M −1 as one of the standard FETI-DP preconditioners, e.g., the Dirichlet preconditioner [58]. Its application is embarrassingly parallel. We use a Krylov space method suitable for unsymmetric systems since the pre- conditioner (4.7) is unsymmetric. The use of CG requires a symmetric reformulation; see, e.g., [43]. This method is, in general, not identical to a standard FETI-DP method applied after Newton linearization. In nonlinear FETI-DP methods, continuity of the solution may not be reached until convergence of the Newton iteration. This is different from standard exact or inexact NK FETI-DP methods, where each Newton iterate is continuous. C674 AXEL KLAWONN, MARTIN LANSER, AND OLIVER RHEINBACH

4.2. Computing an initial value using concurrent nonlinear problems. A good initial value can be crucial for the convergence of Newton-type methods. A suitable initial valueu ˜(0) for the Newton method as presented in section 4.1 has to be (0) continuous in all primal variablesu ˜Π and should provide a good local approximation of the given problem. Note that the start value is allowed to be discontinuous in the (0) (0) dual variables uB . A possible choice of an initial valueu ˜ has been proposed in [38]. An initial value can be obtained from solving the nonlinear problem

(4.8) K (˜u(0))=f˜− BT λ(0).

For simplicity, we use λ(0) = 0. Here, (4.8) requires the solution of concurrent non- linear subdomain problems which are only coupled in the primal unknowns. The communication is thus limited to the primal variables, i.e., a few unknowns for each MPI rank. This step can also be seen as a nonlinear localization step. Linearization of (4.8) results in

    T T (DK(˜u))BB (DK(˜u))ΠB δuB (K(˜u))B + BBλ − fB (4.9) = . (DK(˜u))ΠB (DK(˜u))ΠΠ δu˜Π (DK(˜u))Π − f˜Π

A block Gauss elimination of uB yields the symmetric system

(4.10) SΠΠ δu˜ = d˜Π,

where SΠΠ is defined as in (4.6). We solve (4.10) by a Krylov method using the AMG −1 preconditioner SΠΠ; see (4.7). We will also refer to the initialization phase as Phase 1 of the algorithm. The solution phase corresponding to section 4.1 is then referred to as Phase 2; cf. Figure 2.

5. Algorithmic building blocks. Although our approach differs substantially from the NK FETI-DP approach, the building blocks of the new algorithm are largely identical. In Figure 2 the hybrid nonlinear FETI-DP/multigrid algorithm presented in section 4 is summarized, and, for comparison, in Figure 3 the more traditional hybrid NK FETI-DP/multigrid approach (also known as NK irFETI-DP) is outlined. Both approaches include an inexact solution of the coarse problem using an AMG method. In our implementation, for the solution of linearized sparse systems, we build upon sparse direct and iterative solvers. For linear problems on the subdomains we apply a multifrontal sparse direct solver which is known for its robustness, i.e., UMFPACK 5.6.2 [15]. Cycles of an AMG preconditioner (BoomerAMG) [31] from

Hypre 2.9.1a [22]. are applied to the (linearized) global coarse problem SΠΠ.This adds further (algebraic) coarse levels to the method. All packages were interfaced through PETSc 3 [3, 2, 4]. We make use of the infra- structure provided by PETSc where possible. On Mira, we first use PETSc 3.4.3 and later PETSc 3.5.3. On Vulcan and SuperMUC, Petsc 3.4.3 was used; on JUQUEEN, PETSc 3.5.2 was used. We use Hypre 2.9.1a on all supercomputers, except for Vulcan, where the newer version 2.9.4a was applied. SCALABLE NONLINEAR DOMAIN DECOMPOSITION METHODS C675

Hybrid Nonlinear FETI-DP/Multigrid Method

init: u˜(0) ∈ W, λ(0) =0 for k =0,...,convergence // Phase 1: Compute initial value build: K (˜u(k))andDK (˜u(k)) (k) iterative Krylov solve for δuΠ using −1 AMG preconditioner SΠΠ: (k) SΠΠ δu˜Π = d˜Π // see (4.10)

update:  (k) (k) −1 (k) δuB := (DK(˜u ))BB (K(˜u ))B +  T (0) (k) T (k) BBλ − fB − (DK(˜u ))ΠB δu˜Π // see (4.9), (k) (k)T (k)T T δu˜ := [δuB ,δu˜Π ] , u˜(k+1) :=u ˜(k) − α(k) δu˜(k) end u˜(0) :=u ˜(k+1) for k =0,...,convergence // Phase 2: Main iteration loop build: K (˜u(k)), DK (˜u(k)), and M −1 (k)T (k)T iterative Krylov solve for xr =[δu˜Π ,δλ ]usingleft −1 −1 −1  −1 −1 preconditioner Br,L := Br,L(SΠΠ,DK(˜u)BB,M ): Arxr = Fr // (see (4.3))

update:  (k) (k) −1 (k) T (k) − (k) δuB := (DK(˜u ))BB (K(˜u ))B + BB(λ δλ )  (k) T (k) −fB − (DK(˜u ))ΠB δu˜Π // see (4.2),

T T (k) (k) (k) T δu˜ =[δuB ,δu˜Π ] , u˜(k+1) :=u ˜(k) − α(k) δu˜(k) , λ(k+1) := λ(k) − α(k) δλ(k) end

Fig. 2. Pseudocode of the hybrid nonlinear FETI-DP/multigrid algorithm as an instance of −1 the irNonlinear-FETI-DP-1 framework; cf. [46, 43]. The application of SΠΠ will consist of cycles of  −1 a parallel AMG method, for DK(˜u)BB uses concurrent forward-backward substitutions of a sparse direct solver on the subdomains.

5.1. Parallel application of (DK)−1 to a vector. We have the product representation

−1 T −1 I −(DK) (DK) DK = BB ΠB 0 I   −1 × (DK)BB 0 (5.1) −1 0 SΠΠ

I 0 × −1 . −(DK)ΠB (DK)BB I C676 AXEL KLAWONN, MARTIN LANSER, AND OLIVER RHEINBACH

Hybrid Newton–Krylov FETI-DP/Multigrid Method

Init: u˜(0) ∈ W for k =0,...,convergence build: K (˜u(k)), DK (˜u(k)), and M −1 (k)T iterative Krylov solve for xr =[δu˜Π ,λ]usingleft −1 −1 −1  −1 −1 preconditioner Br,L := Br,L(SΠΠ,DK(˜u)BB,M ): Arxr = Fr // (see (4.3))

update:   (k) (k) −1 (k) T (k) T (k) δuB := (DK(˜u ))BB (K(˜u ))B − fB + BBλ − (DK(˜u ))ΠB δu˜Π // see (4.2),

T T (k) (k) (k) T δu˜ =[δuB ,δu˜Π ] , u˜(k+1) :=u ˜(k) − α(k) δu˜(k) end

−1 Fig. 3. Pseudocode of the hybrid NK FETI-DP/multigrid algorithm. The application of SΠΠ  −1 willconsistofcyclesofaparallelAMGmethod,forDK(˜u)BB uses concurrent forward-backward substitutions of a sparse direct solver on the subdomains.

Here, the block operator ⎡ ⎤ (1) −1 ((DK )BB) 0 −1 ⎢ . ⎥ (5.2) (DK)BB = ⎣ .. ⎦ (N) −1 0((DK )BB) is completely parallel and is implemented by embarrassingly parallel forward-backward substitutions of a sparse direct solver. For standard FETI-DP methods with an exact solution of the coarse problem the sparse direct solver is also used to factor SΠΠ. This can be performed by first broadcasting SΠΠ to all ranks before factoring it locally. In this case, in the subsequent Krylov iteration, no communication is necessary for the coarse problem. Alternatively,

SΠΠ can be factored in parallel on a subset of ranks or on separate ranks. Of course, the parallel scalability of this approach is eventually limited by the sparse direct solver. In our hybrid nonlinear FETI-DP/multigrid approach the application of DK −1 on a vector is not required in the main iteration loop; see Figure 2 (Phase 2). Instead, −1 −1 only the application of (DK)BB to a vector and of the AMG preconditioner SΠΠ to a vector are needed in each Krylov iteration step. In our experiments, only one or two V-cycles of the AMG preconditioner are used in each Krylov space iteration.

5.2. Building the coarse operator SΠΠ (cf. (4.6)) in parallel. As in multi- grid methods it is vital to build the coarse operator efficiently. In parallel multigrid methods this is achieved by carrying out a parallel triple matrix product to build the RAP Galerkin operator, where R and P are the restriction and prolongation opera- tors, respectively, and A is the system matrix. It is well known that repartitioning of the coarse problem is also vital for efficiency. Fortunately, in FETI-DP methods the coarse problem is small compared to the fine level; i.e., it can be smaller by a factor of SCALABLE NONLINEAR DOMAIN DECOMPOSITION METHODS C677

103 to 104 compared to the original problem [42, 46]. FETI-DP and related domain decomposition methods can apply such aggressive coarsening since direct solvers are used on the fine level instead of smoothers.

The FETI-DP coarse operator SΠΠ can be built from local contributions com- puted in parallel, i.e., N (i)T (i) (i) (5.3) SΠΠ = i=1 RΠ SΠΠRΠ ,

(i) where the SΠΠ are distributed among the processor cores. The parallel assembly can be partially overlapped with the parallel computation of the local Schur comple- (i) (i) (i) (i)−1 (i)T ment contributions SΠΠ =(DK )ΠΠ − (DK )ΠB(DK )BB(DK )ΠB of the processor cores. In some of our experiments on the Mira and Vulcan supercomputers, the coarse problem is first distributed over all MPI ranks and then redistributed to a subset of cores; i.e., 16K coarse cores are used for the problems running on 65K and 262K total cores. We also provide results for an improved approach on the Vulcan Blue Gene/Q, avoiding the redistribution process; see sections 6.4 and 6.5 for details. It is interesting to note that in our experiments the current implementation of the assembly of the coarse problem starts to be a bottleneck for more than 131K cores. This has not been observed previously. 5.3. Parallel matrix-free inversion of B(DK(˜u(k)))−1BT . Systems with a left-hand side B(DK (˜u(k)))−1BT are solved using a Krylov subspace method such as GMRES. Nearest-neighbor communication is necessary when applying the operators BT and B to a vector. The application of (DK (˜u(k)))−1 to a vector is implemented as described above in section 5.1. 6. Numerical results. 6.1. Model problems. We consider different model problems based on the non- linear p-Laplace operator Δp and on nonlinear hyperelasticity. Our first model problem is −Δu − 4Δpu =1forp = 4, where Δ is the usual Laplace operator and Δp is the nonlinear p-Laplace operator. For this problem the number of Newton iterations and the time to solution are small. The computations are carried out on the unit square Ω = (0, 1) × (0, 1). For the discretization, we use linear finite elements. The energy is given by 1 J(u):= |∇u|2 + |∇u|4 − udx. Ω 2 Our second model problem is nonlinear hyperelasticity. We consider a Neo–Hooke material with a soft matrix material and stiff circular inclusions; see Figure 4 for the geometry. The strain energy density function W [62, 32] is given by 1 2 2 1 W (u)= K − G ln (det F )+ G(tr(F T F ) − 3 − 2ln(detF )) 2 3 2 with E E K = ,G= , 3(1 − 2ν) 2(1 + ν) and the deformation gradient F (x):=∇ϕ(x); here, ϕ(x)=x + u(x) denotes the deformation and u(x) the displacement of x. The energy functional of which stationary C678 AXEL KLAWONN, MARTIN LANSER, AND OLIVER RHEINBACH

Fig. 4. Decomposition of the computational domain Ω into 64 subdomains; each subdomain has an (slightly off-centered) circular inclusion of stiffer material.

points are computed is given by

J(u)= W (u) − H1(u)dx − H2(u)ds, Ω Γ

where H1(u)andH2(u) are functionals related to the volume and traction forces. In our experiments in two dimensions, we have the following material parameters E and ν (see Figure 4 for the geometry): In the circular inclusions we have E = 210 000, and in the surrounding matrix material E = 210. We have chosen ν =0.3inthecomplete domain Ω. The nonlinear elasticity problem is discretized with piecewise quadratic finite elements. For our three-dimensional problems, we have chosen a similar setup with stiff spherical inclusions in softer material. Again, we choose E = 210 000 in the inclusions and E = 210 inside the surrounding material. We always use one iteration of BoomerAMG [31, 22, 1] with symmetric-SOR/Jacobi for the coarse problem. In all our two-dimensional experiments we choose all vertices to be primal. We additionally use edge averages as primal constraints on all edges in three-dimensional problems and perform a local transformation of basis; see [42, section 4] for details on the implementation of edge constraints. In the weak scaling experiments in this paper, the number of subdomains is identical to the number of cores. For the Newton iterations appearing in our methods, we use a relative tolerance of 108, i.e., the iteration is stopped, when the norm of the residual has been reduced by eight orders of magnitude. The linearized systems are then solved using GMRES as a Krylov method using a relative tolerance of 1010. 6.2. Computational platforms. We perform our computations on different supercomputers: Mira is a 49 152 node 10-petaflops Blue Gene/Q system at Ar- gonne National Laboratory (Lemont, IL) with a total number of 786 432 processor cores and is ranked 5 in the current TOP500 list (November 2014, www.top500.org). JUQUEEN is a 28 672 node 6-petaflops Blue Gene/Q system at J¨ulich Supercom- puting Center (JSC, Germany), with a total number of 458 752 processor cores and is ranked 8 in the current TOP500 list. Vulcan is a 24 576 node 5 Petaflop Blue Gene/Q production system at Lawrence Livermore National Laboratory (Livermore, CA) with a total number of 393 216 processor cores and is ranked 9 in the current TOP500 list. All three Blue Gene/Q systems use a Power BQC 16C 1.6GHz processor with 16 cores and 16 GB memory per node. SuperMUC is a 9400 node 3.2-petaflops system at Leibniz Supercomputing Center (Munich, Germany) with a total number of SCALABLE NONLINEAR DOMAIN DECOMPOSITION METHODS C679

155 656 processor cores (Sandy Bridge-EP Xeon E5-2680 8C 2.7GHz; 16 cores and 32 GB memory per node). It is ranked 14 in the current TOP500 list (November 2014). On SuperMUC the Intel compiler was used, and on Vulcan the IBM XL compiler was used. As we encountered internal compiler errors with the IBM compiler on the Blue Gene/Q system in templated C++ code, we first had to resort to the GNU compiler 4.4.7 on the Mira supercomputer for Tables 2 and 3. The later results, i.e., in Tables 8 and 9, then use the IBM XL C/C++ Compiler for Blue Gene V12.1, the ESSL, and -O3 -qhot=level=0 -qsimd=auto -qmaxmem=-1 -qstrict -qstrict induction. 6.3. Comparison with existing methods. In Table 1, a comparison of the new method discussed in this paper with the more traditional NK FETI-DP/multigrid method [46] is presented. For linear problems both methods are mathematically identical. We consider the second nonlinear model benchmark problem from [38, section 5.2.2], i.e., the p-Laplacian [38, section 5.2.3] with p = 4 in the inclusions embedded inside subdomains and p = 2, elsewhere. We have H/h = 128, and the inclusions have a distance of η =2h to the subdomain boundary. Here, we use piecewise quadratic finite elements. The runs from 16 up to 65 536 cores have been performed on the Vulcan Blue Gene/Q at Lawrence Livermore National Laboratory. By parallel efficiency we refer to the fraction of the actual time to solution compared to the base line, which is, in this case, the time to solution on a single Blue Gene/Q node (16 cores). The total time to solution (or total execution time) includes the time for the assembly of the problem as well as its solution. We see a reduction of the number of Krylov iterations and the time used in the Krylov iterations by 70%. As a result, the parallel efficiency of the Nonlinear FETI-DP/multigrid method appears to be superior compared to the efficiency of the more traditional NK FETI-DP/multigrid approach. Of course, for these simple nonlinear problems, the time used in the Krylov iteration is small. Nonetheless, the reduction of the total time increases from 28% on 16 cores to 37% on 65 536 cores; i.e., the advantage of the new method increases with the number of cores. For numerically harder nonlinear problems the time used in the Krylov iteration can constitute a larger portion of the time to solution. Results for one such example are given in Figure 5. The model problem considered here is the third benchmark problem defined in [38, section 5.2.3]: We have considered the p-Laplacian with p =4 and a multiplicative weight of α =106 in 64 channels of width 1/4H and p =2,α=1 elsewhere. In this example the total time to solution is reduced by a factor of two by using the nonlinear method on 4096 cores. Here, the (small) coarse problem is solved directly as in [38]. This is feasible if the number of subdomains is small, and then it yields essentially the same results as when using an efficient multigrid preconditioner for the coarse problem; see [43]. 6.4. Weak parallel scalability. In Table 2 and Figure 6 weak parallel scal- ability results on the Mira Blue Gene/Q node are presented for the p-Laplacian (−(Δ + 4Δp)u =1,p =4,H/h = 180). Again, the total time to solution, denoted total execution time in the table, includes the time for the assembly of the problem as well as its solution using relative tolerances of 108 for Newton and 1010 for GMRES. Again, for the parallel efficiency, our reference baseline is the time on a single node (16 cores). The parallel scalability is satisfactory, although we do see an increase in the total time especially when scaling from 65K to 262K cores. A parallel efficiency ex- ceeding 100% stems from a variation in the number of Newton steps by one iteration. In Table 3 the weak parallel scalability for both solution phases is shown separately. C680 AXEL KLAWONN, MARTIN LANSER, AND OLIVER RHEINBACH

Fig. 5. Comparison of the nonlinear FETI-DP approach [38, Nonlinear-FETI-DP-2,third model problem, coefficient 106] (cf. right column), with the more traditional NK FETI-DP approach (cf. left column), on 4 096 cores of a Cray XT 6 (Universit¨at Duisburg-Essen). In both methods a line search method with a Wolfe condition is used as globalization technique. The total time to solution can be reduced by more than a factor of two for this problem.

Table 1 New, hybrid nonlinear FETI-DP/multigrid (abbreviated by NL) algorithm compared to the more traditional NK FETI-DP/multigrid (abbreviated by NK) on the Vulcan supercomputer at Lawrence Livermore National Laboratory; p-Laplace inclusions in linear Laplace equation; p =4; H/h = 128; and piecewise quadratic finite elements. The baseline for the parallel efficiency is a single Blue Gene/Q node ( 16 cores).

Vulcan Blue Gene/Q Supercomputer (LLNL) Total Problem Newton steps Krylov execution Parallel Cores size Solver Phase 1/Phase 2 time time effic. 16 1 050 625 NK –/20 219 57.5s 459.7s 100% NL 13/7 84 20.3s 332.0s 100% 64 4 198 401 NK –/21 438 110.4s 534.9s 86% NL 14/7 138 33.17s 356.1s 93% 256 16 785 409 NK –/22 546 137.9s 587.1s 78% NL 14/8 185 44.94s 396.6s 84% 1 024 67 125 249 NK –/23 599 152.7s 624.3s 74% NL 15/8 208 51.3s 413.6s 80% 4 096 268 468 225 NK –/23 624 160.6s 630.1s 73% NL 16/7 199 49.6s 410.6s 81% 16 384 1 073 807 361 NK –/24 688 178.4s 676.9s 68% NL 16/8 228 57.5s 441.2s 75% 65 536 4 295 098 369 NK –/24 722 190.0s 720.5s 64% NL 17/7 211 54.1s 453.9s 73%

As expected, since it includes significantly less communication, Phase 1 scales better than Phase 2. We proceed with a discussion on possible reasons for some of the inefficiencies ob- served. We then are able to present improved results on Vulcan and Mira considering hyperelasticity after having implemented some modifications; see Tables 5, 6, and 7 for Vulcan and Table 8 for Mira, and the corresponding discussion. A better understanding of why the time to solution increases when scaling from 65K to 262K cores on Mira can be achieved by an analysis of the phases of the method. Considering Figure 7, we can see that a better scalability for Phase 1 is prevented by SCALABLE NONLINEAR DOMAIN DECOMPOSITION METHODS C681

Table 2 Weak scalability for the hybrid nonlinear FETI-DP/multigrid algorithm with Wolfe line search on the Mira supercomputer at Argonne National Laboratory; see also the detailed analysis in Fig- ure 7. Improvements resulting from this analysis are discussed in the text.

MIRA Blue Gene/Q Supercomputer (ANL) Total Problem Newton steps Krylov execution Parallel Cores size Phase 1/Phase 2 iter time time effic. 16 514 089 5/1 10 1.23s 64.4s 100% 64 2 053 489 5/1 18 2.16s 66.0s 98% 256 8 208 225 4/1 23 3.17s 57.0s 113% 1 024 32 821 441 4/1 23 3.79s 57.9s 111% 4 096 131 262 849 4/1 22 4.04s 58.2s 111% 16 384 525 005 569 5/1 21 3.97s 69.8s 92% 65 536 2 099 930 625 4/1 23 3.59s 64.9s 99% 262 144 8 399 539 201 4/1 23 4.77s 90.2s 71%

Fig. 6. Weak parallel scalability on the Mira Blue Gene/Q; cf. the data in Table 2.

Table 3 Scalability of the two different solution phases for hybrid nonlinear FETI-DP/multigrid algo- rithm with Wolfe line search on the Mira supercomputer at Argonne National Laboratory; see also the detailed analysis in Figure 7. Improvements resulting from this analysis are discussed in the text.

MIRA Blue Gene/Q Supercomputer (ANL) Phase 1 Phase 2 Problem Newton step Newton step Cores size øtime effic. øtime effic. 16 514 089 10.0s 100% 14.4s 100% 64 2 053 498 10.1s 99% 15.5s 93% 256 8 208 225 10.1s 99% 16.5s 87% 1 024 32 821 441 10.2s 98% 17.1s 84% 4 096 131 262 849 10.2s 98% 17.4s 83% 16 384 525 005 569 10.4s 96% 17.7s 81% 65 536 2 099 930 625 11.6s 86% 18.7s 77% 262 144 8 399 539 201 16.0s 63% 26.1s 55% C682 AXEL KLAWONN, MARTIN LANSER, AND OLIVER RHEINBACH

a noticeable, undesired increase of the time used in the Krylov iterations (yellow in Figure 7). Since, in Phase 1, the Krylov method operates exclusively on the coarse problem, this suggests an inefficiency in the coarse problem. Indeed, for this problem, as a result of the larger number of Newton iterations in Phase 1, the coarse problem has to be solved more often in Phase 1 than in Phase 2. For improving scalability, we therefore have to revisit the coarse problem. In fact, in recent results on the Vulcan Blue Gene/Q, the scalability of the coarse problem solve has been improved by optimizing several BoomerAMG parameters and by using a better distribution of the coarse problem to MPI ranks; see Figure 9, the corresponding discussion, and also section 6.5. For our second model problem, the following BoomerAMG parameters showed to be superior: We now use a nodal based HMIS coarsening (cf. [16]), a stronger truncation of the interpolation operator to a maximal value of three nonzero entries per row, and finally a higher threshold value for the determination of strong couplings. Second, there is a noticeable increase in the remaining time phase (orange in Fig- ure 7), especially in Phase 2 of the algorithm. Here, the remaining time consists of the update of the solution, the computation of parallel norms necessary for the stopping criterion of the global Newton iteration, as, e.g., the computation of the norm of the Newton residual, and the cleanup of parallel data structures. Further investigations on the Vulcan Blue Gene/Q (see discussion below) exposed an inefficiency in the parallel update of the Lagrange multipliers which affected the solution phases of all our Nonlinear FETI-DP implementations. Although the Lagrangian multipliers can be updated completely locally, an unnecessary VecAssembly caused some communi- cation. The update was then replaced by an improved implementation avoiding the parallel assembly process, which subsequently was used to obtain the the weak scaling results for Vulcan in Table 7 and Figure 12 and the strong scaling results presented in Table 10 and Figure 15. Third, for 262K cores we see an unexpected increase in the portion of the total time that is spent in the FETI-DP setup time (green in Figure 7). This phase contains the assembly of the coarse problem, the local LU factorizations, the construction of several scatters, the AMG setup, and the construction of the Dirichlet preconditioner (only necessary in Phase 2). The increase in runtime of the FETI-DP setup time affects both Phases 1 and 2. Since we expect the local factorizations to scale perfectly, the increase can only be a result of the coarse setup, i.e., the construction of the coarse problem; cf. section 5.2. This has been confirmed by detailed additional analysis on the Vulcan Blue Gene/Q; see the discussion below. Better results have been achieved on Vulcan by avoiding the redistribution step described in section 5.2 for the assembly of the coarse problem; see Table 7 and Figures 12 and 11. We can conclude that not only the solution of the FETI-DP coarse problem but also its efficient construction will be the key to obtaining scalability to one million cores. The results presented in this paper are based on pure MPI; i.e., the largest problem uses 524 288 MPI ranks. A hybrid MPI/OpenMP model offers further potential to increase scalability. We can reduce the number of MPI ranks by using a threaded solver on the subdomain level. Results for this approach are promising but beyond the scope of this paper. Hybrid MPI/OpenMP aspects and corresponding computational results are discussed elsewhere [41]. Threaded BLAS or LAPACK libraries were also not used here. From our experience on Blue Gene/P, the sparse direct solver [15] does not profit from the fine grained parallelism of a threaded BLAS. The same p-Laplace model problem as on Mira has been solved on SuperMUC SCALABLE NONLINEAR DOMAIN DECOMPOSITION METHODS C683

Fig. 7. Detailed analysis of Phase 1 (left) and Phase 2 (right) performed on the Mira supercom- puter of the first implementation of the hybrid nonlinear FETI-DP/multigrid algorithm; cf. Table 3. Improvements resulting from this analysis are discussed in the text; see also the improved results on Vulcan in Figure 12 andonMirainTable8.

Table 4 Hybrid nonlinear FETI-DP/multigrid algorithm on the SuperMUC supercomputer at Leibniz- Rechenzentrum (LRZ) in Munich; −Δu − 4Δpu =1,p=4,Hx/hx= 768,Hy/hy = 384.

SuperMUC (LRZ) Phase 1 Phase 2 Problem Newton step Newton step Total Cores size øtime effic. øtime effic. time effic. 32 9M 11.1s 100% 16.2s 100% 112.5s 100% 128 38M 11.4s 97% 17.2s 94% 117.8s 96% 512 151M 11.4s 97% 17.2s 94% 119.1s 95% 2 048 604M 11.5s 97% 17.4s 93% 119.1s 95% 8 192 2 416M 12.3s 90% 18.8s 86% 127.9s 88% 32 768 9 664M 14.6s 76% 23.3s 70% 151.4s 74%

at Leibniz-Rechenzentrum (LRZ) in Munich, Germany. The weak parallel scalability for both phases can be seen in Table 4, showing behavior similar to that on Mira. As already mentioned, we now present improved weak scalability results and pro- vide a detailed analysis of several timers. The computations are performed on the Vulcan Blue Gene/Q. In Table 5 the weak scalability of the complete application for a nonlinear two-dimensional elasticity problem is presented. Here, we consider a rectangular domain Ω = (0, 0)× (2, 1). As a Dirichlet boundary condition, we apply a fixed 1% displacement in the x-direction in each degree of freedom on the boundary; 1.01 0 i.e., x =[ 0 1 ]X for all nodes X ∈ ∂Ω in the undeformed configuration. In these ex- periments, no further volume and traction forces H1(u)andH2(u) are considered. We use a Neo–Hooke material model and consider a heterogeneous material characterized by two different elasticity modules; cf. section 6.1 and Figure 4 for the distribution of the coefficients. In order to illustrate the nonlinearity and heterogeneity of the chosen model problem, we provide a heat map picturing the deviation of the heterogeneous solution from the solution of an equivalent problem considering linear elasticity and a homogeneous material; see Figure 8. The smallest problem with 1.6 million de- grees of freedom is decomposed into 32 subdomains, and the largest problem with 6.7 billion degrees of freedom is decomposed into 131 072 subdomains. All in all, a nearly constant behavior of the time spent in Krylov iterations can be observed. This is comparable to the behavior of our hybrid nonlinear FETI-DP/Multigrid method C684 AXEL KLAWONN, MARTIN LANSER, AND OLIVER RHEINBACH

Table 5 Hybrid nonlinear FETI-DP/multigrid algorithm on the Vulcan supercomputer at Lawrence Liv- ermore National Laboratory; Neo-Hooke material; E = 210 000 in circular inclusions and E = 210 in the surrounding matrix material; Poisson ratio ν =0.3 in the complete domain; see Figure 4 for the geometry; a fixed displacement of 1% in the x-direction is prescribed in each boundary node; two-dimensional and rectangular domain (0, 0) × (2, 1); H/h =80and piecewise quadratic finite elements.

Vulcan Blue Gene/Q Supercomputer (LLNL) Total Problem Newton steps Krylov execution Parallel Cores size Phase 1/Phase 2 iter time time effic. 32 1 642 242 4/3 93 23.2s 250.3s 100% 128 6 561 282 4/3 107 27.2s 256.0s 98% 512 26 229 762 4/3 109 27.9s 257.7s 97% 2 048 104 888 322 4/3 109 28.4s 258.7s 97% 8 192 419 491 842 4/3 107 28.7s 261.0s 96% 32 768 1 677 844 482 4/3 105 27.2s 261.7s 96% 131 072 6 711 132 162 4/3 102 26.7s 278.9s 90%

applied to the p-Laplace model problem; cf. section 6.3. The parallel efficiency of the application is always better than 90%. But again we see an increase in runtime and, consequently, a drop in efficiency from 96% to 90% when scaling from 32K to 131K cores. We provide detailed time measurements to explain this effect and how we obtain improvements; see, e.g., Figures 9, 10, 11, and 12. We also present weak scaling results for nonlinear elasticity in three dimensions; see Table 6. The problem setup in this experiment is similar to the two-dimensional model problem: We have one stiff spherical inclusion (E = 210 000) in each subdomain embedded in softer material (E = 210). To increase the memory available for each subdomain, we use 8 of the 16 cores of one Vulcan node. The parallel efficiency on 65K cores compared to 128 cores is 78% and thus seems satisfactory.

Fig. 8. Deviation ||u−u0||2 where u is the solution of the nonlinear problem and u0 is the initial value. We chose u0 as the solution of a linear and homogeneous problem of the same size and with the same boundary conditions. The geometry consists of 32 subdomains with circular inclusions. The image on the left shows the deviation, whereas the image on the right also shows the positions of the inclusions as white circles. The highest measured deviation is 0.0027 and corresponds to the dark red areas in the picture.

Let us now discuss the weak scalability results for our nonlinear Neo–Hooke model problems. In Figure 9, we partition the average runtime of one Newton step in four different phases: Assembly of the local problems (blue in Figure 9), FETI-DP and preconditioner setup including the local LU factorizations (green in Figure 9), Krylov iteration (yellow in Figure 9), and the remaining time (orange in Figure 9). In Figure 9 (left) we present the average timings over all Newton steps of Phase 1, and in Figure 9 (right) we give the respective timings for Phase 2 (solution phase). Let us remark that Phase 2 is comparable to classical NK FETI-DP/multigrid. SCALABLE NONLINEAR DOMAIN DECOMPOSITION METHODS C685

Table 6 NK FETI-DP/multigrid on the Vulcan supercomputer at Lawrence Livermore National Labo- ratory; Neo-Hooke material; E = 210 000 in spherical inclusions and E = 210 in the surrounding matrix material; Poisson ratio ν =0.3 in the complete domain; a fixed displacement of 1% in the x-direction is prescribed in each boundary node; three-dimensional cubic domain; H/h =7and piecewise quadratic finite elements.

Vulcan Blue Gene/Q Supercomputer (LLNL) Average Total Problem Krylov Newton time execution Parallel Cores size iter time steps Newton step time effic. 128 698 691 85 9.3s 4 22.2s 88.7s 100% 1 024 5 447 811 98 11.8s 4 23.1s 92.2s 96% 8 192 43 022 595 116 15.5s 4 24.7s 98.9s 90% 65 536 341 955 075 181 27.2s 4 28.5s 113.9s 78%

The assembly time (blue in Figure 9) scales, as one should expect, nearly perfectly for both phases. In Phase 1, the Krylov iteration time can now be neglected after our improve- ments on the coarse problem solve: We have better adapted BoomerAMG to the properties of our FETI-DP coarse operator (HMIS nodal coarsening, truncation of the interpolation; see [16] and the discussion above for more details) and use a better distribution of the coarse problem to the MPI ranks; i.e., we use, e.g., 16K cores for the coarse problem in the run using a total of 131K cores. The combination of these two changes alone results in a noticeable improvement compared to the Mira results presented before. In Phase 2, the time spent in the Krylov method takes a larger part, but scales quite well, also because the nonlinear FETI-DP method keeps the number of Krylov iterations nearly constant. We also notice that the remaining time (orange in Figure 9) does not scale perfectly when reaching 131K cores. This was caused by an inefficiency in the implementation of the update of the Lagrange multipliers; see also the discussion above. This does not affect this weak scaling experiment on Vulcan substantially but proved to be a crucial point in later strong scaling tests on Vulcan; see section 6.5. A noticeable parallel inefficiency in the FETI-DP setup time (green in Figure 9) is caused by the assembly of the coarse problem. This corresponds to the earlier results on Mira; cf. the discussion above. We present a detailed analysis of the FETI-DP setup time in Phase 1 in Figure 10. The FETI-DP setup phase is split into the LU factorizations (green in Figure 10), the assembly and redistribution of the globally coupled coarse problem (red and purple in Figure 10), the construction of the scatters representing B and RΠ (yellow in Figure 10), and the BoomerAMG setup (orange in Figure 10). Apparently, only the phases related to the coarse problem are not scaling perfectly. It is possible to avoid the redistribution process (see section 5.2) by assembling the coarse problem directly on a subset of MPI ranks. This proves to be more efficient, and improved results compared to Table 5 can be found in Table 7 and Figure 11. In Figure 11 the redistribution (purple in Figures 11 and 10) is removed, resulting in an improvement of the overall parallel efficiency from 90% (Table 5) to 92% in Table 7 and Figure 12. Nevertheless, we assume that improvements are still possible in the construction of the coarse problem. We may be able to learn from highly scalable parallel multigrid codes which also rely on an efficient construction of the coarse operator. We finally provide a large weak scalability test on Mira using our modified and C686 AXEL KLAWONN, MARTIN LANSER, AND OLIVER RHEINBACH

Fig. 9. Detailed analysis of Phase 1 (left) and Phase 2 (right) of the hybrid nonlinear FETI- DP/multigrid method; average time per Newton step; corresponds to the data in Table 5.

optimized software for the heterogeneous Neo–Hooke problem in two dimensions; see Table 8. We scale from 16 cores (a single node) up to 524 288 cores (32 768 nodes) and subdomains. The range of weak scalability is thus extended by almost one order of magnitude compared to the best earlier results obtained for FETI-DP methods [46]. Our largest nonlinear problem has 41 944 million degrees of freedom and is solved in 377.8s. This corresponds to an overall parallel efficiency of 96%, choosing the time to solution on 16 cores (364s) as a baseline; cf. Figure 13. The number of Newton steps slightly decreases for an increasing number of subdomains. Thus, a fairer comparison may be obtained by considering the average runtime spent in a Newton step. Choosing the average runtime on 16 cores as a baseline, we measure a parallel efficiency of 67% in Phase 1 and even 68% in Phase 2, which is a remarkable result; see also Figure 14. A Blue Gene/Q core executes instructions in order in two pipelines (one inte- ger/load/store/control and one floating point pipeline). At most one instruction is issued per thread in each cycle; each Blue Gene/Q core has a maximum number of four hardware threads. However, the use of OpenMP threading on the node within our algorithmic framework is discussed elsewhere [41]. To make use of the hardware threads using pure MPI, we can use multiple MPI ranks per core; i.e., here we use 32 MPI ranks per node (option -p 32 on JUQUEEN) instead of 16 MPI ranks; see also the discussion on the - -overcommit option in sec- tion 6.5 for Vulcan. The results are presented in Table 9. As a reference, in the first line of Table 9, we provide the use of 16 ranks per core (-p 16, i.e., one MPI rank per core). We can see that our implementation can profit from the use of two MPI ranks per core if the memory constraint of 512 MB per MPI process can be accommodated, i.e., we can use half the nodes without doubling the time (95.2s versus 141.8s). Scaling up the number of nodes using this approach, we obtain a parallel efficiency of 80% for 524K MPI ranks compared to a baseline of 32 MPI ranks (a single node) for the nonlinear problem. For each Newton step we have an efficiency of 52% in Phase 1 and 60% in Phase 2, which indicates that the coarse problem is the limiting factor in this configuration with multiple MPI ranks per core.

6.5. Strong parallel scalability. Finally, we present strong scaling tests for a nonlinear elasticity problem in two dimensions performed on the Vulcan Blue Gene/Q; SCALABLE NONLINEAR DOMAIN DECOMPOSITION METHODS C687

Fig. 10. Detailed subtimers of the FETI-DP/multigrid setup in Phase 1; average time per Newton step; corresponds to the data in Table 5.

Fig. 11. Detailed subtimers of the FETI-DP/multigrid setup in Phase 1; average time per Newton step; corresponds to the data in Table 7. Improved results compared to Figure 10.

see Table 10. We again use the problem setup with stiff circular inclusions; cf. Figure 4. We decompose a problem with 419 million degrees of freedom and 131 072 inclusions into 131 072 subdomains and solve with our new hybrid nonlinear FETI-DP/multigrid method using an increasing number of Blue Gene/Q cores. We start with 1 024 cores (16 nodes) and scale up to 131K cores (8192 nodes). This implies that we solve between one and 128 subdomain problems on each processor core, and each of these has approximately 6.6K degrees of freedom, which is comparably small. We are not able to start our strong scaling test on fewer cores due to memory constraints. The result of 63% parallel efficiency on 131K cores is convincing. Let us also remark C688 AXEL KLAWONN, MARTIN LANSER, AND OLIVER RHEINBACH

Fig. 12. Detailed analysis of Phase 1 (left) and Phase 2 (right) of the hybrid nonlinear FETI- DP /multigrid method; average time per Newton step; corresponds to the data in Table 7.Improved results compared to Figure 9.

Table 7 Setting as in Table 5; assembly of the coarse problem on a subset of min(MPI-size, 16K) cores instead of redistribution of the coarse space.

Vulcan Blue Gene/Q Supercomputer (LLNL) Total Problem Newton steps Krylov execution Parallel Cores size Phase 1/Phase 2 iter time time effic. 32 1 642 242 4/3 93 23.2s 249.9s 100% 128 6 561 282 4/3 107 27.4s 256.2s 98% 512 26 229 762 4/3 108 27.9s 258.2s 97% 2 048 104 888 322 4/3 109 28.6s 258.9s 97% 8 192 419 491 842 4/3 106 28.9s 261.2s 96% 32 768 1 677 844 482 4/3 105 28.3s 263.7s 95% 131 072 6 711 132 162 4/3 102 26.8s 273.1s 92%

that we even gain a mentionable speedup when scaling from 65K to 131K cores. A graphical presentation of the parallel speedup can be found in Figure 15. We also provide and discuss some detailed timings for this strong scaling experi- ment. In Figure 16, we present four different subtimers: Krylov iterations (yellow in Figure 16), assembly of local problems (blue in Figure 16), FETI-DP setup (green in Figure 16), and updates and parallel norms (orange in Figure 16). On the left, the average measurements of these subtimers per Newton step of Phase 2 is presented. On the right, the percentage as a portion of the total time for these subtimers is depicted. Let us note that three main improvements compared to our first results on Mira enabled these strong scaling results. First, the elimination of the inefficiency in the update of the Lagrange multipliers has made the remaining time (orange in Figure 16) insignificant. Second, we have removed the redistribution of the coarse problem and replaced it by a faster assembly process on a subset of MPI ranks. For the strong scaling results, we assemble the coarse problem on at most 16K cores. This also results in a better scalability of the FETI-DP setup phase. Third, we have better adapted BoomerAMG to the properties of the FETI-DP coarse problem; see our previous detailed discussion on the improved use of BoomerAMG. SCALABLE NONLINEAR DOMAIN DECOMPOSITION METHODS C689

Table 8 Hybrid nonlinear FETI-DP/multigrid algorithm on the Mira supercomputer at Argonne Na- tional Laboratory; hyperelastic Neo-Hooke material; E = 210 000 in circular inclusions and E = 210 in the surrounding matrix material; Poisson ratio ν =0.3 in the complete domain; see Figure 4 for the geometry; a fixed displacement of 1% in the x-direction is prescribed in each boundary node; two-dimensional and rectangular domain; H/h = 100 and piecewise quadratic finite elements. The number of subdomains is identical to the number of cores and MPI ranks. Scalability 16 → 524K for each Newton step: 67% (Phase 1), 68% (Phase 2). A relative tolerance of rtol =108 is used for Newton and rtol =1010 for GMRES.

Mira Blue Gene/Q Supercomputer (ANL) Problem Phase 1 Phase 2 Krylov Total Parallel Cores size time / Newton time / Newton iter time efficiency 16 1.3M 158.7s / 4 205.3s / 3 83 364.0s 100% 32 2.6M 159.2s / 4 212.9s / 3 95 372.1s 98% 64 5.1M 159.5s / 4 220.9s / 3 109 380.4s 96% 128 10M 159.5s / 4 224.6s / 3 113 384.1s 95% 256 20M 160.1s / 4 238.9s / 3 135 399.0s 91% 512 41M 160.1s / 4 231.2s / 3 113 391.3s 93% 1 024 82M 160.3s / 4 245.2s / 3 136 405.5s 90% 2 048 164M 160.8s / 4 230.3s / 3 110 391.1s 93% 4 096 328M 182.0s / 4 246.5s / 3 110 428.4s 85% 8 192 655M 186.4s / 4 254.0s / 3 114 440.4s 83% 16 384 1 311M 137.3s / 4 249.0s / 3 110 433.3s 84% 32 768 2 622M 138.9s / 4 251.7s / 3 111 390.6s 93% 65 536 5 243M 145.3s / 4 180.3s / 2 85 325.6s 112% 131 072 10 486M 147.5s / 3 182.0s / 2 84 329.5s 110% 262 144 20 972M 144.9s / 3 177.5s / 2 83 322.4s 113% 524 288 41 944M 177.6s / 3 200.2s / 2 82 377.8s 96%

Fig. 13. Time to solution on the Mira Blue Gene/Q for our hyperelastic model problem; cf. the data in Table 8.

Still, the FETI-DP setup phase does not show optimal parallel efficiency, and the percentage of this phase of a single Newton step grows (green in Figure 16). Further investigations and optimizations of the assembly process of the coarse problem may be necessary. It is possible to execute up to four MPI processes on each core on Vulcan by specifying the - -overcommit option to make better use of the hardware threads of the Power BQC processor. In strong scaling experiments, as presented above, several subdomain problems have to be solved on each core. It is natural to handle these subdomains sequentially on each core. Alternatively, we can handle subdomains in parallel by using several MPI processes per core, which may help to fill the two C690 AXEL KLAWONN, MARTIN LANSER, AND OLIVER RHEINBACH

Fig. 14. Weak scalability on Mira Blue Gene/Q for our hyperelastic model problem; cf. the data in Table 8. Phase 1 and Phase 2 are average values over the Newton steps.

Table 9 Hybrid nonlinear FETI-DP/multigrid algorithm on the JUQUEEN supercomputer at the J¨ulich Supercomputing Center (JSC) using 32 MPI ranks per node (-p 32); hyperelastic Neo-Hooke mate- rial; E = 210 000 in circular inclusions and E = 210 in the surrounding matrix material; Poisson ratio ν =0.3 in the complete domain; see Figure 4 for the geometry; a fixed displacement of 1% in the x-direction is prescribed in each boundary node; two-dimensional and rectangular domain; H/h =70and piecewise quadratic finite elements. The number of subdomains is identical to the number of MPI ranks. Scalability 32 → 524K MPI ranks for each Newton step: 52% (Phase 1), 60% (Phase 2). A relative tolerance of rtol =108 is used for Newton and rtol =1010 for GMRES.

JUQUEEN Blue Gene/Q Supercomputer (JSC) MPI Problem Phase 1 Phase 2 Krylov Total Par. ranks Cores size time / Newton time / Newton iter time eff. 32 32 1.3M 40.2s / 4 54.0s / 3 81 95.2s 32 16 1.3M 59.1s / 4 82.7s / 3 89 141.8s 100% 128 64 5M 59.3s / 4 88.0s / 3 104 147.3s 96% 512 256 20M 59.6s / 4 90.9s / 3 108 150.6s 94% 2 048 512 80M 60.2s / 4 91.6s / 3 106 151.8s 94% 8 192 4 096 321M 60.7s / 4 90.7s / 3 104 151.1s 94% 32 768 16 384 1 285M 47.0s / 3 64.6s / 2 81 111.6s 127% 131 072 65 536 5 138M 54.1s / 3 70.8s / 2 80 124.9s 114% 524 288 262 144 20 553M 85.4s / 3 92.0s / 2 78 177.4s 80%

pipelines and thus better utilize the execution units. The memory of the node will be partitioned accordingly. We present results for a Neo–Hooke hyperelasticity problem with 838 942 722 degrees of freedom in Figure 17 and compare computations on 1 024, 2 048, and 4 096 nodes with one process per core (upper blue line) and four, two, and one MPI processes per core (lower green line). We use the maximum number of four processes per core on 1 024 nodes and two processes per core on 2 048 nodes. Of course we cannot expect perfect scalability, but we do achieve a significant speedup of a surprising 30% compared to the standard approach by using two MPI processes per processor core. We then do not obtain an additional benefit from using four MPI processes per core. Thus, if feasible, an overcommit should be considered.

7. Conclusion. We have presented a new parallel nonlinear FETI-DP domain decomposition method for scalar nonlinear problems and nonlinear hyperelasticity in two and three dimensions which is especially suited for the currently leading comput- ing facilities with their large number of cores. We obtain weak scalability to 524 288 SCALABLE NONLINEAR DOMAIN DECOMPOSITION METHODS C691

Table 10 Hybrid nonlinear FETI-DP/multigrid algorithm on the Vulcan supercomputer at Lawrence Liv- ermore National Laboratory; Neo-Hooke material; E = 210 000 in circular inclusions and E = 210 in the surrounding matrix material; Poisson ratio ν =0.3 in the complete domain; see Figure 4 for the geometry; a fixed displacement of 1% in the x-direction is prescribed in each boundary node; piecewise quadratic finite elements.

Vulcan Blue Gene/Q Supercomputer (LLNL) Total Problem execution Actual Ideal Parallel Cores Subdomains size time speedup speedup effic. 1 024 131 072 419 471 361 3 365.1s 1.0 1 100% 2 048 131 072 419 471 361 1 726.4s 1.9 2 97% 4 096 131 072 419 471 361 868.0s 3.9 4 97% 8 192 131 072 419 471 361 453.5s 7.4 8 93% 16 384 131 072 419 471 361 231.4s 14.6 16 91% 32 768 131 072 419 471 361 119.8s 28.1 32 88% 65 536 131 072 419 471 361 64.3s 51.6 64 81% 131 072 131 072 419 471 361 41.7s 80.6 128 63%

Fig. 15. Strong scaling on Vulcan: Visualization of the speedup from Table 10.

Blue Gene/Q cores with an efficiency of 96% for a nonlinear hyperelasticity problem using a pure MPI implementation and achieve strong scalability from 1 024 to 131 072 cores with an efficiency of 63%. In contrast to the nonlinear FETI-DP methods proposed in [38, 37], the new method allows for the inexact solution of the coarse problem, here using an AMG preconditioner, which enables the very good parallel scalability results on up to several hundreds of thousands of cores. The nonlinear domain decomposition approach helps to localize computational work and reduce the number of Krylov iterations. Algorithmically, largely the same building blocks are used as in standard FETI-DP methods, but they are arranged in a different order to achieve an increased locality and concurrency. C692 AXEL KLAWONN, MARTIN LANSER, AND OLIVER RHEINBACH

Fig. 16. Detailed subtimer for the strong scaling experiments from Table 10. Average time per Newton step in Phase 2.

Fig. 17. Effect of using multiple MPI processes for each processor core (overcommit): One process per core compared to four/two/one processes per core. For the upper (blue) line we use 65 536 subdomains and 16 384, 32 768,and65 536 MPI ranks on 16 384, 32 768,and65 536 BG/Q cores, respectively. For the lower (green) line we use 65 536 subdomains and 65 536 MPI ranks on 16 384, 32 768,and65 536 BG/Q cores.

Acknowledgments. The authors would like to thank Satish Balay and Barry Smith, Argonne National Laboratory, for the fruitful cooperation. The authors also express gratitude for the the cooperation with Ulrike Meier Yang, Lawrence Livermore National Laboratory. This research used resources, i.e., Mira, of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357. Additional computing time was provided by the Argonne Leadership Computing Facility through the Director’s Discretionary program and the Mira Bootcamp 2015, both in preparation for an Innovative and Novel Compu- tational Impact on Theory and Experiment (INCITE) proposal. The computational SCALABLE NONLINEAR DOMAIN DECOMPOSITION METHODS C693 experiments on the Vulcan supercomputer were performed while one of the authors (Martin Lanser) was visiting Lawrence Livermore National Laboratory (www.llnl. gov). The authors also gratefully acknowledge the Gauss Centre for Supercomputing e.V. (www.gauss-centre.eu) for providing computing time on the GCS Supercomputer SuperMUC at Leibniz Supercomputing Centre (LRZ, www.lrz.de) and JUQUEEN at J¨ulich Supercomputing Centre (JSC, www.fz-juelich.de/ias/jsc). GCS is the alliance of the three national supercomputing centres HLRS (Universit¨at Stuttgart), JSC (Forschungszentrum J¨ulich), and LRZ (Bayerische Akademie der Wissenschaften), funded by the German Federal Ministry of Education and Research (BMBF) and the German State Ministries for Research of Baden-W¨urttemberg (MWK), Bayern (StMWFK), and Nordrhein-Westfalen (MIWF). The use of the Cray XT6 at Univer- sity of Duisburg-Essen is also gratefully acknowledged.

REFERENCES

[1] A. H. Baker, R. D. Falgout, T. V. Kolev, and U. M. Yang, Scaling hypre’s multigrid solvers to 100,000 cores, in High-Performance Scientific Computing, M. W. Berry, K. A. Gallivan, E. Gallopoulos, A. Grama, B. Philippe, Y. Saad, and F. Saied, eds., Springer, London, 2012, pp. 261–279. [2] S. Balay, J. Brown, K. Buschelman, V. Eijkhout, W. D. Gropp, D. Kaushik, M. G. Knepley, L. C. McInnes, B. F. Smith, and H. Zhang, PETSc Users Manual, Technical Report ANL-95/11, Revision 3.5, Argonne National Laboratory, Lemont, IL, 2014. [3] S. Balay, J. Brown, K. Buschelman, W. D. Gropp, D. Kaushik, M. G. Knepley, L. C. McInnes, B. F. Smith, and H. Zhang, PETSc Web Page, http://www.mcs.anl.gov/petsc (2014). [4] S. Balay, W. D. Gropp, L. C. McInnes, and B. F. Smith, Efficient management of paral- lelism in object oriented numerical software libraries, in Modern Software Tools in Scientific Computing, E. Arge, A. M. Bruaset, and H. P. Langtangen, eds., Birkh¨auser, Basel, 1997, pp. 163–202. [5] P. Bastian, M. Blatt, A. Dedner, C. Engwer, R. Klofkorn,¨ M. Ohlberger, and O. Sander, A generic grid interface for parallel and adaptive scientific computing. Part I: Abstract framework, Computing, 82 (2008), pp. 103–119. [6] P. Bastian, M. Blatt, A. Dedner, C. Engwer, R. Klofkorn,¨ M. Ohlberger, and O. Sander, A generic grid interface for parallel and adaptive scientific computing. Part II: Implementation and tests in dune, Computing, 82 (2008), pp. 121–138. [7] M. Bhardwaj, K. H. Pierson, G. Reese, T. Walsh, D. Day, K. Alvin, J. Peery, C. Farhat, and M. Lesoinne, Salinas: A scalable software for high performance structural and me- chanics simulation, in ACM/IEEE Proceedings of SC02: High Performance Networking and Computing, Gordon Bell Award, ACM, New York, IEEE, Washington, DC, 2002, pp. 1–19. [8] F. Bordeu, P.-A. Boucard, and P. Gosselet, Balancing domain decomposition with non- linear relocalization: Parallel implementation for laminates, in Proceedings of the 1st International Conference on Parallel, Distributed and Grid Computing for Engineering, B. H. V. Topping and P. Ivnyi, eds., Civil-Comp Press, Stirlingshire, UK, 2009. [9] P. Brune, M. G. Knepley, B. Smith, and X. Tu, Composing Scalable Nonlinear Algebraic Solvers, Technical Report ANL/MCS-P2010-0112, Argonne National Laboratory, Lemont, IL, 2013. [10] X.-C. Cai and M. Dryja, Domain decomposition methods for monotone nonlinear elliptic problems, in Domain Decomposition Methods in Scientific and Engineering Computing (University Park, PA, 1993), Contemp. Math. 180, AMS, Providence, RI, 1994, pp. 21–27. [11] X.-C. Cai and D. E. Keyes, Nonlinearly preconditioned inexact Newton algorithms,SIAMJ. Sci. Comput., 24 (2002), pp. 183–200. [12] X.-C.Cai,D.E.Keyes,andL.Marcinkowski, Non-linear additive Schwarz preconditioners and application in computational fluid dynamics, Internat. J. Numer. Methods Fluids, 40 (2002), pp. 1463–1470. [13] P. Cresta, O. Allix, C. Rey, and S. Guinard, Nonlinear localization strategies for domain decomposition methods: Application to post-buckling analyses, Comput. Methods Appl. Mech. Engrg., 196 (2007), pp. 1436–1446. C694 AXEL KLAWONN, MARTIN LANSER, AND OLIVER RHEINBACH

[14] J.-M. Cros, A preconditioner for the Schur complement domain decomposition method,in Domain Decomposition Methods in Science and Engineering, O. Widlund, I. Herrera, D. Keyes, and R. Yates, eds., National Autonomous University of Mexico (UNAM), Mexico City, Mexico, 2003, pp. 373–380. [15] T. A. Davis, A column pre-ordering strategy for the unsymmetric-pattern multifrontal method, ACM Trans. Math. Software, 30 (2004), pp. 165–195. [16] H. De Sterck, U. M. Yang, and J. J. Heys, Reducing complexity in parallel algebraic multi- grid preconditioners, SIAM J. Matrix Anal. Appl., 27 (2006), pp. 1019–1039. [17] S. Deparis, of Axisymmetric Flows and Methods for Fluid-Structure In- teraction Arising in Blood Flow Simulation, Ph.D. thesis, EPFL, Lausanne, Switzerland, 2004. [18] S. Deparis, M. Discacciati, G. Fourestey, and A. Quarteroni, Heterogeneous domain decomposition methods for fluid-structure interaction problems, in Domain Decomposition Methods in Science and Engineering XVI, Lect. Notes Comput. Sci. Eng. 55, Springer, Berlin, Heidelberg, 2007, pp. 41–52. [19] S. Deparis, M. Discacciati, G. Fourestey, and A. Quarteroni, Fluid-structure algorithms based on Steklov-Poincar´eoperators, Comput. Methods Appl. Mech. Engrg., 195 (2006), pp. 5797–5812. [20] C. R. Dohrmann, A preconditioner for substructuring based on constrained energy minimiza- tion, SIAM J. Sci. Comput., 25 (2003), pp. 246–258. [21] M. Dryja and W. Hackbusch, On the nonlinear domain decomposition method,BIT,37 (1997), pp. 296–311. [22] R. D. Falgout, J. E. Jones, and U. M. Yang, The design and implementation of hypre, a library of parallel high performance preconditioners, in Numerical Solution of Partial Differential Equations on Parallel Computers, A. M. Bruaset, P. Bjorstad, and A. Tveito, eds., Lect. Notes Comput. Sci. Engrg. 51, Springer-Verlag, Berlin, 2006, pp. 267–294. [23] C. Farhat, M. Lesoinne, P. LeTallec, K. Pierson, and D. Rixen, FETI-DP: A dual-primal unified FETI method - part I: A faster alternative to the two-level FETI method,Internat. J. Numer. Methods Engrg., 50 (2001), pp. 1523–1544. [24] C. Farhat, M. Lesoinne, and K. Pierson, A scalable dual-primal domain decomposition method, Numer. Linear Algebra Appl., 7 (2000), pp. 687–714. [25] M. A.´ Fernandez,´ J.-F. Gerbeau, A. Gloria, and M. Vidrascu, Domain decomposition based Newton methods for fluid-structure interaction problems, in CANUM 2006—Congr`es National d’Analyse Num´erique, ESAIM Proc. 22, EDP Sci., Les Ulis, 2008, pp. 67–82. [26] B. Ganis, K. Kumar, G. Pencheva, M. F. Wheeler, and I. Yotov, A global Jacobian method for mortar discretizations of a fully implicit two-phase flow model,Multiscale Model. Simul., 12 (2014), pp. 1401–1423. [27] B. Ganis, G. Pencheva, M. F. Wheeler, T. Wildey, and I. Yotov, A frozen Jacobian mul- tiscale mortar preconditioner for nonlinear interface operators, Multiscale Model. Simul., 10 (2012), pp. 853–873. [28] A. Greenbaum, Iterative Methods for Solving Linear Systems, Frontiers Appl. Math. 17, SIAM, Philadelphia, 1997. [29] C. Groß and R. Krause, A Generalized Recursive Trust-Region Approach—Nonlinear Multi- plicatively Preconditioned Trust-Region Methods and Applications, Technical Report 2010- 09, Institute of Computational Science, Universita della Svizzera Italiana, Lugano, Switzer- land, 2010. [30] C. Groß and R. Krause, On the Globalization of ASPIN Employing Trust-Region Control Strategies—Convergence Analysis and Numerical Examples, Technical Report 2011-03, In- stitute of Computational Science, Universita della Svizzera Italiana, Lugano, Switzerland, 2011. [31] V. E. Henson and U. M. Yang, BoomerAMG: A parallel algebraic multigrid solver and pre- conditioner, Appl. Numer. Math., 41 (2002), pp. 155–177. [32] G. A. Holzapfel, Nonlinear Solid Mechanics. A Continuum Approach for Engineering,John Wiley and Sons, Chichester, UK, 2000. [33] F.-N. Hwang and X.-C. Cai, Improving robustness and parallel scalability of Newton method through nonlinear preconditioning, in Domain Decomposition Methods in Science and En- gineering, Lect. Notes Comput. Sci. Eng. 40, Springer, Berlin, 2005, pp. 201–208. [34] F.-N. Hwang and X.-C. Cai, A class of parallel two-level nonlinear Schwarz preconditioned inexact Newton algorithms, Comput. Methods Appl. Mech. Engrg., 196 (2007), pp. 1603– 1611. SCALABLE NONLINEAR DOMAIN DECOMPOSITION METHODS C695

[35] O. Ippisch, M. Blatt, J. Fahlke, and F. Heimann, MuPhi—Simulation of Flow and Trans- port in Porous Media, http://www.fz-juelich.de/ias/jsc/EN/Expertise/High-Q-Club/ muPhi/ node.html (2014). [36] P. Jolivet, F. Hecht, F. Nataf, and C. Prud’homme, Scalable domain decomposition pre- conditioners, in SuperComputing SC13, Denver, CO, 2013. [37] A. Klawonn, M. Lanser, P. Radtke, and O. Rheinbach, On an adaptive coarse space and on nonlinear domain decomposition, in Domain Decomposition Methods in Science and Engineering XXI, Lect. Notes Comput. Sci. Eng. 98, J. Erhel, M. J. Gander, L. Halpern, G. Pichot, T. Sassi, and O. B. Widlund, eds., Springer-Verlag, Berlin, 2014, pp. 71–83. [38] A. Klawonn, M. Lanser, and O. Rheinbach, Nonlinear FETI-DP and BDDC methods, SIAM J. Sci. Comput., 36 (2014), pp. A737–A765. [39] A. Klawonn, M. Lanser, and O. Rheinbach, FE2TI (ex nl/fe2) EXASTEEL— Bridging Scales for Multiphase Steels, http://www.fz-juelich.de/ias/jsc/EN/Expertise/ High-Q-Club/FE2TI/ node.html (2015). [40] A. Klawonn, M. Lanser, and O. Rheinbach, A nonlinear FETI-DP method with an inexact coarse problem, in Domain Decomposition Methods in Science and Engineering XXII, R. Krause, M. J. Gander, Th. Dickopf, L. F. Pavarino, and L. Halpern, eds., Lect. Notes Comput. Sci. Eng., Springer-Verlag, Berlin, to appear. [41] A. Klawonn, M. Lanser, O. Rheinbach, H. Stengel, and G. Wellein, Hybrid MPI/OpenMP parallelization in FETI-DP methods, in Proceedings of the Conference on Recent Trends in Computational Engineering (CE2014), Lect. Notes Comput. Sci. Eng. 105, Springer-Verlag, Berlin, 2015, pp. 67–84. [42] A. Klawonn and O. Rheinbach, A parallel implementation of dual-primal FETI methods for three-dimensional linear elasticity using a transformation of basis,SIAMJ.Sci.Comput., 28 (2006), pp. 1886–1906. [43] A. Klawonn and O. Rheinbach, Inexact FETI-DP methods, Internat. J. Numer. Methods Engrg., 69 (2007), pp. 284–307. [44] A. Klawonn and O. Rheinbach, Robust FETI-DP methods for heterogeneous three dimen- sional elasticity problems, Comput. Methods Appl. Mech. Engrg., 196 (2007), pp. 1400– 1414. [45] A. Klawonn and O. Rheinbach, A hybrid approach to 3-level FETI, PAMM Proc. Appl. Math. Mech., 8 (2008), pp. 10841–10843. [46] A. Klawonn and O. Rheinbach, Highly scalable parallel domain decomposition methods with an application to biomechanics, ZAMM Z. Angew. Math. Mech., 90 (2010), pp. 5–32. [47] A. Klawonn and O. B. Widlund, Dual-primal FETI methods for linear elasticity, Comm. Pure Appl. Math., 59 (2006), pp. 1523–1572. [48] A. Klawonn, O. B. Widlund, and M. Dryja, Dual-primal FETI methods for three- dimensional elliptic problems with heterogeneous coefficients,SIAMJ.Numer.Anal.,40 (2002), pp. 159–179. [49] J. Li and O. B. Widlund, FETI–DP, BDDC, and block Cholesky methods,Internat.J.Numer. Methods Engrg., 66 (2006), pp. 250–271. [50] J. Mandel and C. R. Dohrmann, Convergence of a balancing domain decomposition by con- straints and energy minimization, Numer. Linear Algebra Appl., 10 (2003), pp. 639–659. [51] J. Mandel, C. R. Dohrmann, and R. Tezaur, An algebraic theory for primal and dual substructuring methods by constraints, Appl. Numer. Math., 54 (2005), pp. 167–193. [52] J. Mandel, B. Soused´ık, and C. R. Dohrmann, On multilevel BDDC, in Domain Decompo- sition Methods in Science and Engineering XVII, U. Langer, M. Discacciati, D. E. Keyes, O. B. Widlund, and W. Zulehner, eds., Lect. Notes Comput. Sci. Eng. 60, Springer, Berlin, Heidelberg, 2008, pp. 287–294. [53] J. Pebrel, C. Rey, and P. Gosselet, A nonlinear dual-domain decomposition method: Ap- plication to structural problems with damage, Internat. J. Multiscale Comp. Eng., 6 (2008), pp. 251–262. [54] O. Rheinbach, Parallel iterative substructuring in structural mechanics, Arch. Comput. Meth- ods Eng., 16 (2009), pp. 425–463. [55] U. Rude¨ , Terra-neo—Integrated Co-design of an Exa-scale Earth Mantle Modeling Frame- work, http://www.fz-juelich.de/ias/jsc/EN/Expertise/High-Q-Club/Terra-Neo/ node. html (2014). [56] B. F. Smith, P. E. Bjørstad, and W. Gropp, Domain Decomposition: Parallel Multilevel Methods for Elliptic Partial Differential Equations, Cambridge University Press, Cam- bridge, UK, 1996. C696 AXEL KLAWONN, MARTIN LANSER, AND OLIVER RHEINBACH

[57] B. Soused´ık and J. Mandel, On adaptive-multilevel BDDC, in Domain Decomposition Meth- ods in Science and Engineering XIX, Lect. Notes Comput. Sci. Eng. 78, Springer, Heidel- berg, 2011, pp. 39–50. [58] A. Toselli and O. Widlund, Domain Decomposition Methods—Algorithms and Theory, Springer Ser. Comput. Math. 34, Springer, Berlin, 2004. [59] U. Trottenberg, C. W. Oosterlee, and A. Schuller¨ , Multigrid, Academic Press, London, San Diego, 2001. [60] X. Tu, Three-level BDDC in three dimensions, SIAM J. Sci. Comput., 29 (2007), pp. 1759– 1780. [61] X. Tu, Three-level BDDC in two dimensions, Inter. J. Numer. Methods Engrg., 69 (2007), pp. 33–59. [62] O. C. Zienkiewicz and R. L. Taylor, The for Solid and Structural Mechanics, Elsevier, Oxford, UK, 2005.