<<

Perspectives in Science (2016) 7, 140—150

Available online at www.sciencedirect.com

ScienceDirect

j ournal homepage: www.elsevier.com/pisc

Numerical libraries solving large-scale

problems developed at IT4Innovations

Research Programme Supercomputing for Industryଝ

a,b,∗ a,b a

Michal Merta , Jan Zapletal , Tomas Brzobohaty ,

a a a

Alexandros Markopoulos , Lubomir Riha , Martin Cermak ,

a,b a,b a,b

Vaclav Hapla , David Horak , Lukas Pospisil , a,b

Alena Vasatova

a

IT4Innovations National Supercomputing Center, 17. listopadu 15/2172, 708 00 Ostrava, Czech Republic

b

Department of Applied Mathematics VSB — Technical University of Ostrava, 17. listopadu 15/2172,

708 33 Ostrava, Czech Republic

Received 26 October 2015; accepted 11 November 2015

Available online 15 December 2015

KEYWORDS Summary The team of Research Programme Supercomputing for Industry at IT4Innovations

FETI; National Supercomputing Center is focused on development of highly scalable algorithms for

TFETI; solution of linear and non-linear problems arising from different engineering applications.

BEM; As a main parallelisation technique, domain decomposition methods (DDM) of FETI type are

Domain used. These methods are combined with finite element (FEM) or boundary element (BEM) dis-

decomposition; cretisation methods and quadratic programming (QP) algorithms. All these algorithms were

Quadratic implemented into our in-house software packages BEM4I, ESPRESO and PERMON, which demon-

programming; strate high scalability up to tens of thousands of cores.

HPC © 2015 Published by Elsevier GmbH. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

Introduction

ଝ High performance of contemporary computers results from

This article is part of a special issue entitled ‘‘Proceedings of

an increasing number of compute nodes in clusters and num-

the 1st Czech-China Scientific Conference 2015’’.

ber of processor cores per node. While the current most

Corresponding author at: IT4Innovations National Supercompu-

ting Center, 17. listopadu 15/2172, 708 33 Ostrava, Czech Republic. powerful petascale or multi-petascale computers contain

E-mail address: [email protected] (M. Merta). hundreds of thousands of CPU cores, the future exascale

http://dx.doi.org/10.1016/j.pisc.2015.11.023

2213-0209/© 2015 Published by Elsevier GmbH. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

Numerical libraries solving large-scale problems 141

systems will comprise millions of them. For efficient use of original FETI-1 method assumes that the boundary subdo-

such systems, algorithms with high parallel scalability have mains inherit Dirichlet conditions from the original problem

to be developed. where the conditions are embedded into the linear system

Discretisation of most engineering problems describable arising from FEM. This means physically that subdomains

by partial differential equations (PDE) leads to large sparse whose interfaces intersect the Dirichlet boundary are fixed

linear systems of equations. However, problems that can be while others are kept floating; in the linear algebra speech,

expressed as elliptic variational inequalities, such as those the corresponding subdomain stiffness matrices are non-

describing the equilibrium of elastic bodies in mutual con- singular and singular, respectively.

tact, lead to quadratic programming (QP) problems. The basic idea of the Total-FETI (TFETI) method (Dostál

ˇ

Finite element tearing and interconnecting (FETI) and et al., 2006, 2010; Cermák et al., 2015) is to keep all

boundary element tearing and interconnecting (BETI) the subdomains floating and enforce the Dirichlet bound-

(Langer and Steinbach, 2003; Of and Steinbach, 2009) meth- ary conditions by means of a constraint matrix and Lagrange

ods form a successful subclass of domain decomposition multipliers, similarly to the gluing conditions along sub-

methods (DDM). They belong to non-overlapping methods domain interfaces. This simplifies implementation of the

and combine sparse iterative and direct solvers. FETI was stiffness matrix generalised inverse. The key point is that

s s

firstly introduced by Farhat and Roux (Farhat and Roux, kernels R of subdomain stiffness matrices K are known a

1991, 1992). The key ingredient of the FETI method is the priori, have the same dimension and can be formed with-

decomposition of the spatial domain into non-overlapping out any computation from the mesh data, so that R matrix

subdomains that are ‘‘glued together’’ by Lagrange multipli- (Im R = Ker K) possess also nice block-diagonal layout. Fur-

ers. Elimination of the primal variables reduces the original thermore, each local stiffness matrix can be regularised

linear problem to a smaller, relatively well conditioned, cheaply, and the inverse of the resulting nonsingular matrix

equality constrained QP. If the FETI procedure is applied to a is at the same time a generalised inverse of the original

contact problem (Dostál et al. 1998, 2000, 2005, 2010, 2012; singular one (Dostál et al., 2011; Brzobohat´y et al., 2011).

Dostál and Horák, 2004), the resulting QP has additional FETI methods use the Lagrange multipliers to enforce

bound constraints. FETI methods allow highly accurate com- both equality and inequality constraints (gluing and nonpen-

putations scaling up to tens of thousands of processors. etration conditions) in the original primal problem

Our team was successful in adapting FETI approach for

contact problems and designed new variants. One of them

1 uT T

min Ku − u f s.t. BE u = o and BIu ≤ cI.

is Total-FETI (TFETI) developed by Dostal et al. (Dostál et al., 2

ˇ

2006, 2010; Kruis et al., 2002; Cermák et al., 2015) which

uses Lagrange multipliers to enforce Dirichlet boundary con- The primal problem is then transformed using duality into

ditions. This enables a simpler building of the stiffness significantly smaller and better conditioned dual problem

matrix kernel, as all subdomains are floating and associ- with equality constraint and nonnegativity bound

ated subdomain stiffness matrices have the same kernel,

obtained without any computation. Hybrid-TFETI (HTFETI)

1 T F T

min − d s.t. G = e, I ≤ o

reduces coarse problem (CP) size by aggregating the subdo- 2

mains into clusters, i.e. TFETI is applied twice.

Resulting QP problems can be then solved by means of with

efficient MPRGP and SMALBE algorithms designed again by

+ T T T + T

Dostal et al. (Dostál et al., 2003; Dostál and Schöberl, 2005;

F = BK B , G = R B , d = BK f, e = R f.

Dostál, 2009) with known rate of convergence given by spec-

tral properties of the solved system.

After homogenisation using particular solution

We develop several software packages dealing with FETI: T T 1 T

˜ = G (GG ) e, while = ˆ + ˜, ˆ ∈ Ker G, ˜ ∈ Im G

PERMON based on PETSc and ESPRESO based on Intel MKL

and enforcing homogenised equality constraint by means

and Cilk. The BEM4I library implements BEM discretisation, −

T T 1

− Q = G GG G

and together with the other two packages the BETI method. of projector P = I Q on Ker G, where ( ) is

T

The paper is organised as follows. After introduction, we projector to Im G , SMALSE algorithm can be applied to the

describe the main principles of FETI and BETI methods. Then problem

the particular libraries and their modules are introduced

with the achieved highlights from various areas. 1 T T

min ˆ PFPˆ − ˆ P(d − F˜) s.t Gˆ = o, ˆI ≥ −˜I. 2

Numerical methods

For this dual problem the classical estimate of the spec-

tral condition number is valid, i.e. (PFP|Im P) ≤ C(H/h),

FETI methods with H denoting the decomposition and h the discretisation

parameter. Natural effort using the massively parallel com-

FETI-1 (Farhat and Roux, 1991, 1992; Farhat et al., 1994; puters is to maximise the number of subdomains (decrease

Kruis, 2006) is a non-overlapping DDM (Gosselet and Rey, H) so that sizes of subdomain stiffness matrices are reduced

2006) which is based on decomposing the original spatial which accelerates not only their factorisation and subse-

domain into non-overlapping subdomains. They are ‘‘glued quent generalised inverse application but also improves

together’’ by Lagrange multipliers which have to satisfy cer- conditioning and reduces the number of iterations. Negative

tain equality constraints which will be discussed later. The effect of that is increase of dual and null space dimensions,

142 M. Merta et al.

1

which decelerate the coarse problem (CP) solution, i.e. solu- D0 1 ∗ 1

u =  u − K  u ∈ ∂

T ( )(x) (x) ( )(x)for x (1)

tion of the system GG x = y, so that the bottleneck of the 2

TFETI method is the application of the projector dominating ∗

with V, K, K , and D denoting the single-layer, double-layer,

the solution time.

adjoint double-layer, and hypersigular boundary integral

operators, respectively. The Galerkin discretisation of the

Hybrid FETI method single-layer operator equation (1) leads to the system of

linear equations

Although there are several efficient coarse problem paral-  

lelisation strategies (Hapla and Horák, 2012; Kozubek et al., 1

= + u

2012, 2013), still there are size limitations of the coarse Vt M K

2

problem. So several hybrid (multilevel) methods were pro-

posed (Lee, 2009; Klawonn and Rheinbach, 2010). The key

idea is to aggregate small number of neighbouring sub- with the boundary element matrices

 

domains into clusters (see Fig. 1), which naturally results

into the smaller coarse problem. In our HTFETI, the aggre- V[k, ] := v(x, y) dsy dsx K[k, j]

gation of subdomains into the clusters is enforced again   k 

by Lagrange multipliers. Thus TFETI method is used on ∂v

=

: (x, y)ϕj(y) dsy dsx

both cluster and subdomain levels. This approach simpli- ∂

k ∂ ny

fies implementation of hybrid FETI methods and enables to

extend parallelisation of the original problem up to tens of and the sparse identity matrix M.

thousands of cores due to lower memory requirements. This The assembly of the full matrices is of quadratic complex-

is the positive effect of reducing the coarse space. The neg- ity with respect to the degrees of freedom on the surface.

ative one is getting worse convergence rate compared with Moreover, advanced numerical quadrature methods must be

the original TFETI. To improve it the transformation of basis applied to treat singularities occurring in the integrals in

originally introduced by Klawonn and Widlund (Klawonn the case of identical or adjacent elements (see Rjasanow

and Widlund, 2006), Klawonn and Rheinbach (Klawonn and and Steinbach (2007) or Sauter and Schwab (2010)). There-

Rheinbach, 2006), and Li and Widlund (Li and Widlund, 2006) fore, an efficient implementation and parallelisation of the

is applied to the derived hybrid algorithm. method is necessary to allow the solution of large scale

problems.

FETI domain decomposition methodology applied com-

Boundary element method and BETI

bined with BEM discretisation results in so called BETI

(boundary element tearing and interconnecting) method.

The boundary element method (BEM) is well-suited for the

solution of exterior problems such as sound or electromag-

netic wave scattering, or shape optimisation problems. The MPRGP and SMALBE algorithms

boundary integral formulation of the given problem leads

to the discretisation of the boundary only, thus effectively

Combination of SemiMonotonic Augment Lagrangian algo-

reduces the problem dimension.

rithm for Bound and Equality constraints (SMALBE) and

The method is applicable to problems for which the fun-

Modified Proportioning with Reduced Gradient Projec-

damental solution is known, which is the case, e.g. of the

tion (MPRGP) algorithms (Dostál et al., 2003; Dostál and

Laplace or Helmholtz equations. In 3D, the respective fun-

Schöberl, 2005; Dostál, 2009) was developed and tested for

damental solutions read

solution of QP problems resulting from discretisation of con-

tact problems of mechanics, but can be as well used for any

eix−y

1 1 , 1 . other QP problems. They have theoretically supported rate

v(x, y) := v(x, y) :=

4 x − y 4 x − y of convergence given by spectral properties of the solved

system. General linear inequality constraints must be con-

verted to bound constraints by applying dualisation which

The solution to the boundary value problem under consider-

also typically improves conditioning and reduces dimension.

ation is given by the representation formula

MPRGP is an active set based algorithm. The main idea of

 

∂v MPRGP is gradient splitting based on active sets into free

1 0

u =  u v , s −  u , s ,

(x) : (y) (x y) d y (y) (x y) d y and chopped gradients whose sum yields the projected gra-

∂ ∂ ∂ny

dient. The algorithm exploits a test to decide about leaving

the face and three types of steps to generate a sequence of

0 1

 

where and represent the Dirichlet and Neumann trace the iterates that approximate the solution:

operators. The unknown Cauchy data can be obtained from

the appropriate system of boundary integral equations.

1 The expansion step, if the solution is proportional, may

Applying the Dirichlet and Neumann trace operators to the

expand the current active set using fixed steplength

representation formula leads to the boundary integral equa-

related to matrix norm and reduced free gradient.

tions

2 The proportioning step may remove indices from the

1 active set using chopped gradient.

1 0 0

(V u)(x) = u(x) + (K u)(x) for x ∈ ∂, (1)

2 3 The conjugate gradient step.

Numerical libraries solving large-scale problems 143

Figure 1 Cube prepared for TFETI and HFETI.

The algorithm has been proved to enjoy the R-linear rate of necessary to approximate matrices using the ACA or FMM

convergence in terms of the spectral condition number. The methods.

SMALBE is an algorithm based on augmented Lagrangians. 2 BEBilinearForm: the main purpose of this class and its

It takes care of the equality constraints, while in its each descendants is to assemble the boundary element sys-

iteration, the inner problem consisting in bound-constrained tem matrices (in both full and sparsified formats). The

minimisation of the augmented Lagrangian is solved by any element-wise assembly is performed using the BEInte-

suitable solver such as MPRGP described above. grator class. The assembly is parallelised using OpenMP

and MPI at this level.

3 BEIntegrator: the classes responsible for the local

system matrix assembly inherit from the BEInte-

The BEM4I library

grator class. Several types of numerical quadratures

Overview are employed by these classes, including the classical

Gaussian quadrature schemes over the pairs of distant

elements and the semi-analytical approach (Rjasanow and

The boundary element library BEM4I concentrates on the

Steinbach, 2007; Zapletal and Bouchala, 2014) and fully

efficient assembly of the boundary element matrices for

numerical schemes (Sauter and Schwab, 2010) to treat the

the 3D Laplace, Helmholtz, Lamé, and time-domain wave

singularities in the integrals over pairs of close elements.

equations. It employs sparsification methods, namely the

The computation is vectorised to reduce the computa-

fast multipole method (Greengard and Rokhlin, 1987; Of,

tional time using the SSE or AVX instruction sets (Fig. 3).

2007) (FMM) and the adaptive cross approximation (ACA)

(Bebendorf, 2008; Rjasanow and Steinbach, 2007) to reduce

the computational effort to almost linear.

The core of the library consists of three main sets of

In addition to these classes the library also contains the

classes:

supportive classes representing full, sparse, and sparsified

matrices, iterative and direct solvers, preconditioners, sur-

face meshes, etc. The library structure together with the

1 BESpace: the classes inheriting from the BESpace class

results of the scalability tests have been presented in Merta

are responsible for the approximation of the continu-

ˇ

and Zapletal (2015, accepted for publication), Cermák et al.

ous function spaces. The stored information includes the

(2015) (Fig. 2).

order of polynomial test and Ansatz functions or data

Figure 2 Structure of the solver for Laplace boundary value problems.

144 M. Merta et al.

Figure 3 Concurrent summation of scalars using vector

instructions.

Intel Xeon Phi utilisation

To reduce the computational time the code of the library is

accelerated by the Intel Xeon Phi coprocessors. The compu-

Figure 5 Comparison of the assembly of the single layer oper-

tationaly most demanding parts of the code are offloaded to

ator matrix.

the coprocessor using offload pragmas of the Intel com-

piler and the computation is carried out using 60 physical

and many core architectures, for instance Intel Xeon Phi or

(240 logical) cores available at the coprocessor (see Fig. 4).

Nvidia Tesla. Therefore for the CPU and Xeon Phi version we

The computation consists of several steps.

are using the Intel MKL library and CUDA libraries are used

(cuBLAS, cuSPARSE, cuSolver) for the GPU version.

1 Pack the data (mainly nodes and elements of a surface

mesh) and send it to the coprocessor.

Communication layer optimisation

2 Perform simultaneous computation on the coprocessor

ESPRESO-H is mainly focused on the scalability of the com-

and the host.

munication layer for large computer systems with thousands

3 Send results from the coprocessor to the host processor.

and tens of thousands of compute nodes. All the processing

4 Combine data from the coprocessor and the processor and

is done by the CPUs. The solver uses hybrid parallelisation

assemble the global system matrix.

which is well suited for multi-socket and multi-core com-

pute nodes as this is the architecture of most of today’s

The results of the numerical benchmarks focused on the

supercomputers.

assembly of the full single-layer operator matrix for the

The first level of parallelisation is designed for parallel

Laplace equation show a significant reduction in the compu-

processing of the clusters of subdomains. Individual clusters

tational time (see Fig. 5). The main bottleneck is currently

are processed per node. It is possible to process multiple

the data transfer from the coprocessor to the host processor

clusters per one node, but not the other way around. The

(Fig. 6).

distributed memory parallelisation is done using MPI. In par-

ticular we are using MPI standard 3.0 which is implemented

ExaScale PaRallel FETI SOlver (ESPRESO) in most of the modern MPI distributions. The MPI 3.0 is used

because the communication hiding techniques implemented

Overview in the communication layer require the non-blocking collec-

tive operations.

The ESPRESO library is implemented in C++. Significant part The communication layer is identical for both TFETI or

of the development effort was devoted to development of HTFETI solvers in ESPRESO. It uses novel communication hid-

a C++ wrapper for (1) the selected sparse and dense BLAS ing techniques for the main iterative solver. In particular

routines and (2) the sparse direct solvers (MKL and original we have implemented: (1) the Pipelined Conjugate Gra-

versions of PARDISO direct solvers) of the Intel MKL library. dient (PipeCG) solver which hides communication of the

The solver is developed to support current and future multi global dot products behind the local matrix vector mul-

tiplications; (2) distributed CP processing — merges two

global communication operations (Gather and Scatter) into

one (AllGather) and parallelises the CP processing using the

Figure 4 Offload of the computation to the Intel Xeon Phi

coprocessor. Figure 6 Sound scattering off a submarine shaped domain.

Numerical libraries solving large-scale problems 145

Figure 7 The stencil communication for simple decomposition into four subdomains. The Lagrange multipliers (LMs) that connects

different neighbouring subdomains are depicted in different colours. In every iteration when the LMs are updated an exchange is

performed between the neighbouring subdomains to finish the update. This affinity also controls the distribution of the data for the

main distributed iterative solver, which iterates over local LMs only. In our implementation each MPI process modifies only those

elements of the vectors used by the CG solver that match the LMs associated with the particular domain in case of FETI or the set

of domains in a cluster in case of hybrid FETI.

distributed inverse matrix of the CP; and (3) the optimised the largest problem we were able to solve on the Piz Daint

version of global gluing matrix multiplication (matrix B for machine. The important message from this measurement

FETI and B1 for HFETI) — written as stencil communication is the flattening characteristics from 343 to 2197 nodes,

which is fully scalable, see Fig. 7. which is expected result from good weak scalability of the

solver.

The next tests show strong scaling of the HTFETI method

Inter-cluster processing

in ESPRESO. In the Fig. 9 we can see the strong scalability of

The second level of parallelisation is designed for parallel

single iteration time. This experiment decouples the numer-

processing of subdomains in a cluster. Our implementation

ical scalability of the HTFETI method and the scalability of

enables over subscription of CPU cores so that each core can

the implementation itself. We can see that the ESPRESO

process multiple subdomains and therefore the size of the

achieves super-linear scalability per iteration when solving

cluster is not limited by the hardware configuration. This

2.6 billion unknowns problem starting from 1000 nodes and

shared memory parallelisation is implemented using Intel

scaling to 4913 nodes. The per iteration time is shown in the

Cilk+. We have chosen the Cilk+ due to its advanced support

figure next to each point, the second line, while the number

for C++ language. In particular we are taking advantage of

of nodes is described in the first line. The blue line shows

the functionality that allows us to create custom parallel

the linear scaling based on processing time on 1000 nodes.

reduction operations on top of the C++ objects which in our

The last test using the synthetic benchmark shows the

case are sparse matrices.

strong scalability of the entire iterative solver in the

ESPRESO. This involves the per iteration time as well as

Numerical results the number of iteration (the numerical scalability). We can

see that even in this test solver achieved the linear scaling.

The ESPRESO is designed to solve large problems using world Please note that for both strong scalability tests we keep the

largest supercomputers. In this paper we present the results cluster configuration identical, in other words the number of

measured on the European largest machine, CSCS Piz Daint domains per node remains the same and we are reducing the

in Lugano, Switzerland. The Piz Daint is a Cray XC30 machine domain size while increasing the number of nodes/clusters

with 5272 compute nodes each with one 8-core Sandy Bridge (Fig. 10).

CPU (E5-2670), 32 GB of RAM and one K20X GPU accelerator.

All the following tests are done using the synthetic 3D cube

ESPRESO-GPU and ESPRESO-MIC

linear elasticity benchmark. For this benchmarking we are

developing massively parallel in memory problem genera-

In parallel with EPSRESO-H we are developing two more

tor, which eliminates I/O bottlenecks and allows to evaluate

flavours of ESPRESO which are designed to take advantage of

the efficiency and scalability of the solver routines more

precisely. modern many-core accelerators. The ESPRESO-GPU is using

CUDA and its libraries to run on Nvidia Tesla GPUs. The

The first set of results is shown in Fig. 8. This figure

ESPRESO-MIC is developed under the Intel Parallel Comput-

presents the weak scalability of HTFETI solver in ESPRESO.

ing Center (IPCC) at IT4Innovations and its main focus is to

Due to limited amount of memory per node, solver is able to

fully utilise the potential of Xeon Phi accelerators based in

process 2.7 million of unknowns per single node. Then the

Knights Corner architecture. This is an essential research

amount of work per node is kept fixed and we are increasing

for IT4Innovations as it has the European largest Xeon Phi

the number of nodes from 1 to 2197, which defines the max-

accelerated system called Salomon.

imum problem size to 5.8 billions of unknowns. This is so far

146 M. Merta et al.

Figure 8 The weak scaling evaluation of the ESPRESO solver on European largest CSCS Piz Daint supercomputer. Solver is able

to process 2.7 million of unknowns per node. The scalability is evaluated from 1 to 2197 nodes. The fattening shape of the total

execution time shows the potential of the ESPRESO to scale even further.

Figure 9 Strong scalability of a single iteration time of the ESPRESO solver. In this test ESPRESO is solving 2.6 billion unknown

problem starting from 1000 to 4913 nodes.

PERMON in discretisation techniques, QP algorithms, and DDM. It

incorporates our own codes, and makes use of renowned

Overview open source libraries. The solver layer, discussed here,

consists of three modules: PermonFLLOP, PermonQP, and

PermonIneq. Other modules are problem-specific such as

We develop a novel software package based on PETSc

PermonPlasticity for plasticity, PermonImage for image

using TFETI for solution of QP called PERMON (Paral-

recognition, PermonMultiBody for particle dynamics and

lel, Efficient, Robust, Modular, Object-oriented, Numerical)

others.

toolbox since 2011. It makes use of theoretical results

Figure 10 Strong scalability of the iterative solver of the ESPRESO. In this test ESPRESO is solving 2.6 billion unknown problem

starting at 1000 nodes to 4913 nodes.

Numerical libraries solving large-scale problems 147

Figure 11 Doubly linked list of QPs.

PermonQP PermonIneq

PermonQP is a package providing a base for solution of QP PermonQP capabilities can be further extended with the

problems. Its main idea is separation of concepts of QP prob- PermonIneq package, which adds several concrete solvers

lems, transforms and solvers which are abstracted by three for inequality constrained QPs, e.g. the already mentioned

basic classes QP, QPT and QPS, respectively. A QP trans- MPRGP and SMALBE algorithms.

form derives a new QP from the given QP, so that a doubly

linked list (QP chain) is generated where every node is a QP

PermonFLLOP

(Fig. 11). The programming interface (API) of PermonQP is

carefully designed to be easy-to-use, and at the same time

PermonFLLOP is a wrapper of PermonQP implementing

efficient and suitable for HPC. The solution process is from

FETI. It assembles the FETI-specific constraint matrix B

the user’s point of view divided into the following sequence

and nullspace matrix R. They are passed internally to Per-

of actions:

monQP together with subdomain-wise stiffness matrices and

load vectors, which can be assembled with arbitrary FEM

1 QP problem specification;

library such as PermonCube or libMesh. FETI method itself

2 QP transforms, which reformulate the original problem

consists here just in calling the proper sequence of QP

and create a chain of QP problems where the last one is

transformations: primal scaling, dualisation, dual scaling,

passed to the solver;

homogenisation of the equality constraints, preconditioning

3 automatic or manual choice of an appropriate QP solver;

by orthogonal projector onto the kernel of the dual equality

4 QP solution.

constraint matrix.

PermonQP as a stand-alone package allows solving uncon-

strained QP problems (i.e. linear systems with a positive Numerical experiments

semidefinite matrix) or equality constrained ones. In both

cases it makes use of the PETSc KSP package which includes

As a benchmark an elastic cube was subjected to the volume

both direct and iterative solvers, including interfaces to

forces pressing it against the obstacle. There were two rea-

many external solvers. Examples of equality constraints

sons for this decision. The elastic cube is a numerical model

are for instance multipoint constraints, or the alternative

which could be fully controlled and the obtained results are

enforcing of Dirichlet boundary conditions using a separate

not affected by complexity of geometry. Another reason is

constraint matrix. This module is being prepared for pub-

that it is very difficult or even impossible to create very

lishing under the BSD 2-Clause license.

large meshes on complex geometries using existing meshing

Table 1 Results for the cube contact linear elasticity problem.

X NS # Decomp. DOF Solution time [s] Outer iters Inner iters

4 64 3,000,000 2.68E+01 3 94

6 216 10,125,000 5.38E+01 3 147

8 512 24,000,000 1.21E+02 4 250

10 1000 46,875,000 1.84E+02 4 248

148 M. Merta et al.

Figure 12 Parallel weak scalability of the TFETI method implementation in the PermonFLLOP code for the linear elasticity cube

benchmark at HECToR supercomputer.

Figure 13 Numerical scalability of the TFETI method within the PermonFLLOP code for the linear elasticity cube benchmark.

Note that from a certain point we get almost constant number of iterations allowing good parallel scalability.

tools. With our mesh generator PermonCube we were able in-house software packages. These solvers were optimised

to prepare large scale problems decomposed into thousands employing available state-of-the-art external libraries,

of subdomains. communication hiding and avoiding techniques, hybrid MPI-

R and G matrices were orthonormalised using Iterative OpenMP programming, GPU and MIC accelerators, etc.

Classical Gram-Schmidt process, K matrix was factorised Scalability was proven for both huge model problems and

using the Cholesky factorisation from the MUMPS library. complicated engineering problems up to ten thousands of

Currently, each computational core owns one and only one cores.

subdomain. The norm of the projected gradient compared The presented BEM4I library for the boundary element

−5

with the 10 multiple of the projected dual RHS was used discretisation of engineering problems has been tested up

as the stopping criterion. The results are summarised in the to more than a thousand of cores. Currently, its acceleration

Table 1. using the Intel Xeon Phi coprocessors is under development.

The weak scalability for 13,824; 8000 and 4096 elements The initial results suggest a significant reduction in com-

per subdomain and the numerical scalability for these con- putational time in the case of full system matrices for the

figurations (corresponding to the fixed ratios H/h = 24, 20, Laplace equation, therefore the acceleration of the assem-

16) are then illustrated in Figs. 12 and 13. To investigate the bly of matrices sparsified by ACA, as well as the assembly of

strong scalability we selected discretisation with 32,768,000 the system matrices for the Lame equation is being consid-

elements (approx. 100,000,000 unknowns). The strong scal- ered.

ability for discretisation with 32,768,000 elements (approx. The presented ESPRESO library brings highly optimised

100,000,000 unknowns) was demonstrated up to 8000 cores TFETI and HTFETI implementations. ESPRESO-H is oriented

(41.5 s using 2197 cores; 19.8 s using 4096 cores; 15.7 s 8000 to large computer systems with thousands and tens of

cores). thousands of compute nodes. ESPRESO-GPU and EPSRESO-

MIC are developed to exploit power of GPU and MIC

accelerators.

Conclusion We have also presented our PERMON toolbox, mainly its

solver packages based on PETSc. They uniquely combine FETI

Efficient variants of BEM discretisation method, scalable DDM with QP algorithms. PermonFLLOP is used to gener-

QP algorithms, and FETI-type domain decomposition meth- ate FETI-specific objects for a contact problem of elasticity

ods (BETI, TFETI and HTFETI) were implemented into our while FEM objects are provided by any FEM code for each

Numerical libraries solving large-scale problems 149

subdomain independently. PermonFLLOP wraps PermonQP Dostál, Z., Horák, D., 2007. Theoretically supported scalable FETI

and PermonIneq which solve the resulting QP problem. for numerical solution of variational inequalities. SIAM Journal

Results for contact problem of elastic cube, generated by on Numerical Analysis 45 (2), 500—513.

Dostál, Z., Horák, D., Kuˇcera, R., Vondrák, V. , Haslinger, J.,

PermonCube package, were shown.

Dobiáˇs, J., Pták, S., 2005. FETI based algorithms for con-

tact problems: scalability, large displacements and 3d coulomb

Conflict of interest friction. Comput. Methods Appl. Mech. Eng. 194 (2—5),

395—409.

Dostál, Z., Kozubek, T. , Brzobohat´y, T. , Markopoulos, A., Vlach, O.,

The authors declare that there is no conflict of interest.

2012. Scalable TFETI with optional preconditioning by conjugate

projector for transient contact problems of elasticity. Comput.

Acknowledgements Methods Appl. Mech. Eng. 247—248, 37—50.

Dostál, Z., Schöberl, J., 2005. Minimizing quadratic functions sub-

ject to bound constraints. Comput. Optim. Appl. 30 (January

This work was supported by the European Regional

(1)), 23—43.

Development Fund in the IT4Innovations Centre of Excel-

Farhat, C., Mandel, J., Roux, F.X., 1994. Optimal convergence

lence project (CZ.1.05/1.1.00/02.0070); project of major

properties of the FETI domain decomposition method. Comput.

infrastructures for research, development and innova-

Methods Appl. Mech. Eng. 115, 365—385.

tion of Ministry of Education, Youth and Sports with Farhat, C., Roux, F.X., 1991. A method of finite element tearing

reg. num. LM2011033; by the EXA2CT project funded and interconnecting and its parallel solution algorithm. Int. J.

from the EU’s Seventh Framework Programme (FP7/2007- Numer. Methods Eng. 32 (6), 1205—1227.

Farhat, C., Roux, F.X., 1992. An unconventional domain decomposi-

2013) under grant agreement no. 610741; by the internal

tion method for an efficient parallel solution of large-scale finite

student grant competition project SP2015/186 ‘‘PERMON

element systems. SIAM J. Sci. Stat. Comput. (1).

toolbox development’’; the project POSTDOCI II reg.

Gosselet, P. , Rey, C., 2006. Non-overlapping domain decomposition

no. CZ.1.07/2.3.00/30.0055 within Operational Programme

methods in structural mechanics. Arch. Comput. Methods Eng.

Education for Competitiveness; and by the Grant Agency of

13 (4), 515—572.

the Czech Republic (GACR) project no. 15-18274S. We thank

Greengard, L., Rokhlin, V. , 1987. A fast algorithm for particle sim-

CSCS (www.cscs.ch) for the support in using the Piz Daint

ulations. J. Comput. Phys. 73 (2), 325—348.

supercomputer. Hapla, V. , Horák, D., 2012. Tfeti coarse space projectors paralleliza-

tion strategies. In: Wyrzykowski, R., Dongarra, J., Karczewski,

K., Wa´sniewski, J. (Eds.), Parallel Processing and Applied Math-

References

ematics, Lecture Notes in Computer Science. Springer, Berlin,

Heidelberg, pp. 152—162.

Bebendorf, M., 2008. Hierarchical Matrices: A Means to Efficiently Klawonn, A., Rheinbach, O., 2010. Highly scalable parallel domain

Solve Elliptic Boundary Value Problems, Lecture Notes in Com- decomposition methods with an application to biomechanics.

putational Science and Engineering. Springer. ZAMM Z. Angew. Math. Mech. 90 (1), 5—32.

Brzobohat´y, T. , Dostál, Z., Kozubek, T. , Kováˇr, P. , Markopoulos, A., Klawonn, A., Widlund, O.B., 2006. Dual-primal FETI methods

2011. Cholesky decomposition with fixing nodes to stable com- for linear elasticity. Commun. Pure Appl. Math. 59 (11),

putation of a generalized inverse of the stiffness matrix of a 1523—1572.

floating structure. Int. J. Numer. Methods Eng. 88 (5), 493—509. Klawonn, A., Rheinbach, O., 2006. A parallel implementation of

Dostál, Z., Friedlander, A., Santos, S.A., 2003. Augmented dual-primal FETI methods for three-dimensional linear elasticity

Lagrangians with adaptive precision control for quadratic pro- using a transformation of basis. SIAM J. Sci. Comput. 28 (January

gramming with simple bounds and equality constraints. SIAM J. (5)), 1886—1906, http://dx.doi.org/10.1137/050624364.

Optim. 13 (January (4)), 1120—1140. Kozubek, T. , Horák, D., Hapla, V. , 2012. FETI coarse

Dostál, Z., Friedlander, A., Santos, S.A., 1998. Solution of contact problem parallelization strategies and their compari-

problems of elasticity by FETI domain decomposition. Contemp. son, Tech. Rep. http://www.prace-project.eu/IMG/pdf/

Math. 218, 82—93. feticoarseproblemparallelization.pdf.

Dostál, Z., Horák, D., Kuˇcera, R., 2006. Total FETI — an easier imple- Kozubek, T. , Vondrák, V. , Menˇsík, M., Horák, D., Dostál, Z., Hapla,

ˇ

mentable variant of the FETI method for numerical solution of V. , Kabelíková, P. , Cermák, M., 2013. Total FETI domain decom-

elliptic PDE. Commun. Numer. Methods Eng. 22 (12), 1155—1162. position method and its massively parallel implementation. Adv.

Dostál, Z., Kozubek, T. , Markopoulos, A., Menˇsík, M., 2011. Cholesky Eng. Softw. 60—61, 14—22.

decomposition of a positive semidefinite matrix with known ker- Kruis, J., 2006. Domain Decomposition Methods for Distributed Com-

nel. Appl. Math. Comput. 217 (13), 6067—6077. puting. Saxe-Coburg Publications.

Dostál, Z., Kozubek, T. , Vondrák, V. , Brzobohat´y, T. , Markopoulos, Kruis, J., Matouˇs, K., Dostál, Z., 2002. Solving laminated plates by

A., 2010. Scalable TFETI algorithm for the solution of multibody domain decomposition. Adv. Eng. Softw. 33, 445—452.

contact problems of elasticity. Int. J. Numer. Methods Eng. 82 Langer, U., Steinbach, O., 2003. Boundary element tearing and

(11), 1384—1405. interconnecting methods. Computing 71 (3), 205—228.

Dostál, Z., Neto, F.A.G., Santos, S.A., 2000 dec. Solution of con- Lee, J., 2009. A hybrid domain decomposition method and its appli-

tact problems by FETI domain decomposition with natural coarse cations to contact problems in mechanical engineering. New York

space projections. Comput. Methods Appl. Mech. Eng. 190 (13- University (Ph.D. thesis).

14), 1611—1627. Li, J., Widlund, O.B., 2006. FETI-DP, BDDC, and block Cholesky

Dostál, Z., 2009. Optimal Quadratic Programming Algorithms, with methods. Int. J. Numer. Methods Eng. 66, 250—271.

Applications to Variational Inequalities. SOIA, Springer, New Merta, M., Zapletal, J., 2015. A parallel library for boundary ele-

York, US. ment discretization of engineering problems. Math. Comput.

Dostál, Z., Horák, D., 2004. Scalable FETI with optimal dual penalty Simul. (accepted for publication).

for a variational inequality. Numer. Linear Algebra Appl. 11, Merta, M., Zapletal, J., 2015. Acceleration of boundary element

455—472. method by explicit vectorization. Adv. Eng. Softw. 86, 70—79.

150 M. Merta et al.

Of, G., Steinbach, O., 2009. The all-floating boundary element Sauter, S., Schwab, C., 2010. Boundary Element Methods. Springer

tearing and interconnecting method. J. Numer. Math. 17 (4), Series in Computational Mathematics. Springer.

ˇ

277—298. Cermák, M., Hapla, V. , Horák, D., Merta, M., Markopoulos, A., 2015.

Of, G., 2007. Fast multipole methods and applications. In: Schanz, Total-FETI domain decomposition method for solution of elasto-

M., Steinbach, O. (Eds.), Boundary Element Analysis, Lecture plastic problems. Adv. Eng. Softw. 84, 48—54.

ˇ

Notes in Applied and Computational Mechanics. Springer, Berlin, Cermák, M., Merta, M., Zapletal, J., 2015. A novel boundary ele-

Heidelberg, pp. 135—160. ment library with applications. In: Simos, T. , Tsitouras, C. (Eds.),

Patrick Amestoy and others, 2015. MUMPS: a Multifrontal Massively Proceedings of ICNAAM 2014. AIP Conference Proceedings, vol.

Parallel sparse direct Solver, http://mumps.enseeiht.fr/. 1648.

Rjasanow, S., Steinbach, O., 2007. The Fast Solution of Boundary Zapletal, J., Bouchala, J., 2014. Effective semi-analytic integration

Integral Equations. Mathematical and Analytical Techniques with for hypersingular Galerkin boundary integral equations for the

Applications to Engineering. Springer. Helmholtz equation in 3d. Appl. Math. 59 (5), 527—542.