Solving Sparse Linear Systems Part 2: Trilinos Solver Overview

Michael A. Heroux Sandia National Laboratories

http://www.users.csbsju.edu/~mheroux/HartreeTutorialPart2.pdf

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, 1 for the United States Department of Energy under contract DE-AC04-94AL85000. Outline

§ Sources of sparse linear systems. § Sparse solver ecosystem. § Details about sparse data structures. § Overview of Trilinos. § Overview of Trilinos linear solvers.

2 If You Know Some Linear Algebra § If this intro material is not new, visit the Trilinos Tutorial Site: w https://github.com/trilinos/Trilinos_tutorial/wiki/TrilinosHandsOnTutorial w Lots of examples. w A Web portal for compiling, linking and running Trilinos examples. w Pointers to Trilinos User Group videos: Coverage of many Trilinos topics. § If you know Linux well: w Start installing Trilinos from the instructions. w Install these packages: • Kokkos, Tpetra, Epetra, EpetraExt, AztecOO, ML, Belos, Muelu, Ifpack, Ifpack2, Zoltan, Zoltan2, Amesos • Rob Allan has already done this on Tutorial systems, so he can assist you. § If you are into Docker: w Go https://hub.docker.com/r/sjdeal/webtrilinos/ for image. § If you want to start on the Epetra, AztecOO example: w Scroll down the tutorial page and start using and studying them. 3§ Explore… Setting up Linux Environment

§ Commands: w module avail # (look for Trilinos install) w module load intel # (get Intel compilers) w module load trilinos/12.2.1 w env # (look for path to Trilinos) § Copy and paste sample makefile from Hands-on page w http://trilinos.org/oldsite/Export_Makefile_example.txt w Get rid of references to libmyappLib w Define: MYAPP_TRILINOS_DIR=/gpfs/stfc/local/apps/intel/trilinos/12.2.1 § Try to make.

4 Sparse Linear Systems

5 Matrices

§ Matrix (defn): (not rigorous) An m-by-n, 2 dimensional array of numbers. § Examples:

1.0 2.0 1.5 A = 2.0 3.0 2.5 1.5 2.5 5.0

a11 a12 a13 A = a21 a22 a23 a31 a32 a33

6 6 7 Sparse Matrices

§ Sparse Matrix (defn): (not rigorous) An m-by-n matrix with enough zero entries that it makes sense to keep track of what is zero and nonzero. § Example:

a11 a12 0 0 0 0 a21 a22 a23 0 0 0 A = 0 a32 a33 a34 0 0 0 0 a43 a44 a45 0 0 0 0 a54 a55 a56 0 0 0 0 a65 a66

7 Dense vs Sparse Costs

§ What is the cost of storing the tridiagonal matrix with all entries? § What is the cost if we store each of the diagonals as vector? § What is the cost of computing y = Ax for vectors x (known) and y (to be computed): w If we ignore sparsity? w If we take sparsity into account?

8 9 Origins of Sparse Matrices

§ In practice, most large matrices are sparse. Specific sources: w Differential equations. • Encompasses the vast majority of scientific and engineering simulation. • E.g., structural mechanics. – F = ma. Car crash simulation. w Stochastic processes. • Matrices describe probability distribution functions. w Networks. • Electrical and telecommunications networks.

• Matrix element aij is nonzero if there is a wire connecting point i to point j. w 3D imagery for Google Earth • Relies on SuiteSparse (via the Ceres nonlinear least squares solver developed by Google). w And more…

9 10 Example: 1D Heat Equation (Laplace Equation) § The one-dimensional, steady-state heat equation on the interval [0,1] is as follows:

u ''(x) = 0, 0 < x <1. u(0) = a. u(1) = b. § The solution u(x), to this equation describes the distribution of heat on a wire with temperature equal to a and b at the left and right endpoints, respectively, of the wire.

10 11 Finite Difference Approximation

§ The following formula provides an approximation of u”(x) in terms of u:

u(x + h) − 2u(x) + u(x − h) u"(x) = + O(h2 ) 2h2

§ For example if we want to approximate u”(0.5) with h = 0.25:

u(0.75) − 2u(0.5) + u(0.25) u"(0.5) ≈ 2(1 4)2

11 1D Grid 12 Note that it is impossible to find u(x) for all values of x. Instead we: Create a “grid” with n points. Then find an approximate to u at these grid points. If we want a better approximation, we increase n.

u(1)=b u(0)=a u(0.25)=u u(0.5)=u u(0.75)=u 1 2 3 = u4 u(x): = u0 Interval:

x: x0= 0 x1= 0.25 x2= 0.5 x3= 0.75 x4= 1

Note: We know u0 and u4 . We know a relationship between the ui via the finite difference equations. We need to find ui for i=1, 2, 3. 12 13 What We Know

Left endpoint: 1 (u − 2u + u ) = 0, or 2u − u = a. h2 0 1 2 1 2 Right endpoint: 1 (u − 2u + u ) = 0, or 2u − u = b. h2 2 3 4 3 2 Middle points: 1 (u − 2u + u ) = 0, or − u + 2u − u = 0 for i = 2. h2 i−1 i i+1 i−1 i i+1

13 14 Write in Matrix Form

⎡ 2 −1 0 ⎤ ⎡u1 ⎤ ⎡a⎤ ⎢ 1 2 1⎥ ⎢u ⎥ ⎢0⎥ ⎢− − ⎥ ⎢ 2 ⎥ = ⎢ ⎥ ⎣⎢ 0 −1 2 ⎦⎥ ⎣⎢u3 ⎦⎥ ⎣⎢b⎦⎥

Notes: 1. This system was assembled from pieces of what we know. 2. This is a linear system with 3 equations and three unknowns. 3. We can easily solve. 4. Note that n=5 generates this 3 equation system. 5. In general, for n grid points on [0, 1], we will have n-2 equations and unknowns.

14 15 General Form of 1D Finite Difference Matrix

⎡ 2 −1 0 0 0 ⎤ ⎡ u1 ⎤ ⎡a⎤ ⎢ 1 2 1 0 0 ⎥ ⎢ u ⎥ ⎢0⎥ ⎢− − ⎥ ⎢ 2 ⎥ ⎢ ⎥ ⎢ 0    0 ⎥ ⎢ u3 ⎥ = ⎢0⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢     −1⎥ ⎢  ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎣ 0 0 0 −1 2 ⎦ ⎣un−1 ⎦ ⎣b⎦

15 16 A View of More Realistic Problems

§ The previous example is very simple. § But basic principles apply to more complex problems. § Finite difference approximations exist for any differential equation. § Finite volume is widely used. § Sandia primarily used finite elements. § Leads to far more complex matrix patterns. § Fun example…

16 “Tapir” Matrix (John Gilbert)

17 17 Corresponding Mesh 18

18 Spot Quiz

Define sparse matrix.

19 20 Sparse Linear Systems: Problem Definition § A frequent requirement for scientific and engineering computing is to solve: Ax = b where A is a known large (sparse) matrix a linear operator, b is a known vector, x is an unknown vector. NOTE: We are using x differently than before. Previous x: Points in the interval [0, 1]. New x: Vector of u values. § Goal: Find x. § Question: How do we solve this problem?

20 WHY AND HOW TO USE SPARSE SOLVER LIBRARIES

21 2 1 § A farmer had chickens and pigs. There was a total of 60 heads and 200 feet. How many chickens and how many pigs did the farmer have? § Let x be the number of chickens, y be the number of pigs. § Then: x + y = 60 2x + 4y = 200 § From first equation x = 60 – y, so replace x in second equation: 2(60 – y) + 4y = 200 § Solve for y: 120 – 2y + 4y = 200 2y = 80 y = 40 § Solve for x: x = 60 – 40 = 20. § The farmer has 20 chickens and 40 pigs.

22 § A restaurant owner purchased one box of frozen chicken and another box of frozen pork for $60. Later the owner purchased 2 boxes of chicken and 4 boxes of pork for $200. What is the cost of a box of frozen chicken and a box of frozen pork? § Let x be the price of a box of chicken, y the price of a box of pork. § Then: x + y = 60 2x + 4y = 200 § From first equation x = 60 – y, so replace x in second equation: 2(60 – y) + 4y = 200 § Solve for y: 120 – 2y + 4y = 200 2y = 80 y = 40 § Solve for x: x = 60 – 40 = 20. § A box of chicken costs $20 and a box of pork costs $40.

23 Problem Statement § A restaurant owner purchased one box of frozen chicken and another box of frozen pork for $60. Later the owner purchased 2 boxes of chicken and 4 boxes of pork for $200. What is the cost of a box of frozen chicken and a box of frozen pork? § Let x be the price of a box of chicken, y the price of a box of pork. Variables § Then: Problem Setup x + y = 60 2x + 4y = 200 § From first equation x = 60 – y, so replace x in second equation: 2(60 – y) + 4y = 200 Solution Method § Solve for y: 120 – 2y + 4y = 200 2y = 80 y = 40 § Solve for x: x = 60 – 40 = 20. § A box of chicken costs $20. A box of pork costs $40. Translate Back

24 Why Sparse Solver Libraries? § Many types of problems. § Similar Mathematics. § Separation of concerns: w Problem Statement. w Translation to Math. App w Set up problem. w Solve Problem. w Translate Back. SuperLU

25 Importance of Sparse Solver Libraries § Computer solution of math problems is hard: w Floating point arithmetic not exact: • 1 + ε = 1, for small ε > 0. • (a + b) + not always equal to a + (b + c). w High fidelity leads to large problems: 1M to 10B equations. w Clusters require coordinated solution across 100 – 1M processors. § Sophisticated solution algorithms and libraries leveraged: w Solver expertise highly specialized, expensive. w Write code once, use in many settings. § Trilinos is a large collection of state-of-the-art work: w The latest in scientific algorithms. w Modern software design and architecture.

26 Sparse Direct Methods § Construct L and U, lower and upper triangular, resp, s.t.

LU = A § Solve Ax = b: 1. Ly = b 2. Ux = y § Symmetric versions: LLT = A, LDLT

§ When are direct methods effective? w 1D: Always, even on many, many processors. w 2D: Almost always, except on many, many processors. w 2.5D: Most of the time. w 3D: Only for “small/medium” problems on “small/medium” processor counts. § Bottom line: Direct sparse solvers should always be in your toolbox.

27 Sparse Direct Solver § HSL: http://www.hsl.rl.ac.uk § MUMPS: http://mumps.enseeiht.fr Packages § Pardiso: http://www.pardiso-project.org § PaStiX: http://pastix.gforge.inria.fr § SuiteSparse: http://www.cise.ufl.edu/research/sparse/SuiteSparse § SuperLU: http://crd-legacy.lbl.gov/~xiaoye/SuperLU/index.html § UMFPACK : http://www.cise.ufl.edu/research/sparse/umfpack/ § WSMP: http://researcher.watson.ibm.com/researcher/view_project.php?id=1426 § Trilinos/Amesos/Amesos2: http://trilinos.org § Notes: w All have threaded parallelism. w All but SuiteSparse and UMFPACK have distributed memory (MPI) parallelism. w MUMPS, PaStiX, SuiteSparse, SuperLU, Trilinos, UMFPACK are freely available. w HSL, Pardiso, WSMP are available freely, with restrictions. w Some research efforts on GPUs, unaware of any products. § Emerging hybrid packages: w PDSLin – Sherry Li. w HIPS – Gaidamour, Henon. w Trilinos/ShyLU – Rajamanickam, Boman, Heroux. 28 Other Sparse Direct Solver Packages

§ “Legagy” packages that are open source but not under active development today. w TAUCS : http://www.tau.ac.il/~stoledo/taucs/ w PSPASES : http://www-users.cs.umn.edu/~mjoshi/pspases/ w BCSLib : http://www.boeing.com/phantom/bcslib/

§ Eigen http://eigen.tuxfamily.org w Newer, active, but sequential only (for sparse solvers). w Sparse Cholesky (including LDL^T), Sparse LU, Sparse QR. w Wrappers to quite a few third-party sparse direct solvers.

29 Emerging Trend in Sparse Direct

§ New work in low-rank approximations to off-diagonal blocks. § Typically: w Off-diagonal blocks in the factorization stored as dense matrices. § New: w These blocks have low rank (up to the accuracy needed for solution). w Can be represented by approximate SVD. § Still uncertain how broad the impact will be. w Will rank-k SVD continue to have low rank for hard problems? § Potential: Could be breakthrough for extending sparse direct method to much larger 3D problems.

30 31 Iterative Methods

§ Given an initial guess for x, called x(0), (x(0) = 0 is acceptable) compute a sequence x(k), k = 1,2, … such that each x(k) is “closer” to x. § Definition of “close”: w Suppose x(k) = x exactly for some value of k. w Then r(k) = b – Ax(k) = 0 (the vector of all zeros). w And norm(r(k)) = sqrt() = 0 (a number). w For any x(k), let r(k) = b – Ax(k) w If norm(r(k)) = sqrt() is small (< 1.0E-6 say) then we say that x(k) is close to x. w The vector r is called the residual vector. Sparse Iterative Solver Packages

§ PETSc: http://www.mcs.anl.gov/petsc § hypre: https://computation.llnl.gov/casc/linear_solvers/sls_hypre.html § Trilinos: http://trilinos.sandia.gov § Paralution: http://www.paralution.com (Manycore; GPL/Commercial license) § HSL: http://www.hsl.rl.ac.uk (Academic/Commercial License) § Eigen http://eigen.tuxfamily.org (Sequential CG, BiCGSTAB, ILUT/Sparskit) § Sparskit: http://www-users.cs.umn.edu/~saad/software § Notes: w There are many other efforts, but I am unaware of any that have a broad user base like hypre, PETSc and Trilinos. w Sparskit, and other software by Yousef Saad, is not a product with a large official user base, but these codes appear as embedded (serial) source code in many applications. w PETSc and Trilinos support threading, distributed memory (MPI) and growing functionality for accelerators. w Many of the direct solver packages support some kind of iteration, if only iterative refinement.

32 Which Type of Solver to Use?

Dimension Type Notes 1D Direct Often tridiagonal (Thomas alg, periodic version). 2D very easy Iterative If you have a good initial guess, e.g., transient simulation. 2D otherwise Direct Almost always better than iterative. 2.5D Direct Example: shell problems. Good ordering can keep fill low. 3D “smooth” Direct? Emerging methods for low-rank SVD representation. 3D easy Iterative Simple preconditioners: diagonal scaling. CG or BiCGSTAB. 3D harder Iterative Prec: IC, ILU (with domain decomposition if in parallel). 3D hard Iterative Use GMRES (without restart if possible). 3D + large Iterative Add multigrid, geometric or algebraic.

33 Details about Sparse Matrices

34 35 General Sparse Matrix

• Example:

a11 0 0 0 0 a16 0 a22 a23 0 0 0 A = 0 a32 a33 0 a35 0 0 0 0 a44 0 0 0 0 a53 0 a55 a56 a61 0 0 0 a65 a66 36 Compressed Row Storage (CRS) Format

Idea: •Create 1 length nnz array of non-zero value. 1 length nnz array of column indices. 1 length m+1 array of ints: double * values = new double [nnz]; int * colIndices = new int[nnz]; int * rowPointers = new int[m+1];

nnz – Number of nonzero terms in the matrix. m – Matrix dimension. 37 Compressed Row Storage (CRS) Format

Fill arrays as follows: rowPointer[0] = 0; double * curValuePtr = values; int * curIndicesPtr = colIndices; for (i=0; i

! 4 0 0 1 $ # & 0 3 0 2 A = # & # 0 0 6 0 & # & " 5 0 9 8 % values = {4, 1, 3, 2, 6, 5, 9, 8}, colIndices = {0, 3, 1, 3, 2, 0, 2, 3} rowPointer = {0, 2, 4, 5, 8} 39 CRS Example

! 4 2 0 0 $ # & 0 3 0 2 A = # & # 0 0 0 0 & # & " 5 7 9 8 % values = colIndices = rowPointer = values = {4, 2, 3, 2, 5, 7, 9, 8}, colIndices = {0, 1, 1, 3, 0, 1, 2, 3} rowPointer = {0, 2, 4, 4, 8} 40 Serial Sparse MV int sparsemv( int m, double * values, int * colIndices, int * rowPointers, double * x, double * y){ for (int i=0; i< m; ++i){ double sum = 0.0; curNumEntries = rowPointers[i+1] – rowPointers[i]; double * curVals = values[rowPointers[i]]; int * curInds = colIndices[rowPointers[i]];

for (int j=0; j< curNumEntries; j++) sum += curVals[j]*x[curInds[j]]; y[i] = sum; } return(0); } Trilinos Overview

41 What is Trilinos?

§ Object-oriented software framework for… § Solving big complex science & engineering problems § More like LEGO™ bricks than Matlab™

42 Background/Motivation

43 Laptops to Optimal Kernels to Optimal Solutions: Leadership systems w Geometry, Meshing w Discretizations, Load Balancing. w Scalable Linear, Nonlinear, Eigen, Transient, Optimization, UQ solvers. w Scalable I/O, GPU, Manycore

w 60 Packages. w Other distributions: w Cray LIBSCI w Public repo. w Thousands of Users. w Worldwide distribuion. Trilinos Strategic Goals

§ Scalable Computations: As problem size and processor counts increase, the cost of the computation will remain nearly fixed. § Hardened Computations: Never fail unless problem essentially intractable, in which case we diagnose and inform the user why the problem fails and provide a reliable Algorithmic measure of error. Goals § Full Vertical Coverage: Provide leading edge enabling technologies through the entire technical application software stack: from problem construction, solution, analysis and optimization. § Grand Universal Interoperability: All Trilinos packages, and important external packages, will be interoperable, so that any combination of packages and external software (e.g., PETSc, Hypre) that makes sense algorithmically will be possible within Trilinos. § Universal Accessibility: All Trilinos capabilities will be available to users of major computing environments: C++, , Python and the Web, and from the desktop to the latest scalable systems. Software § Universal Solver RAS: Trilinos will be: Goals w Reliable: Leading edge hardened, scalable solutions for each of these applications w Available: Integrated into every major application at Sandia w Serviceable: “Self-sustaining”. Capability Leaders: Layer of Proactive Leadership § Areas: w Framework, Tools & Interfaces (J. Willenbring). w Software Engineering Technologies and Integration (R. Bartlett). w Discretizations (P. Bochev). w Geometry, Meshing & Load Balancing (K. Devine). w Parallel Programming Models (H.C. Edwards) w Linear Algebra Services (M. Hoemmen). w Linear & Eigen Solvers (J. Hu). w Nonlinear, Transient & Optimization Solvers (A. Salinger). w Scalable I/O: (R. Oldfield) w User Experience: (W. Spotz) § Each leader provides strategic direction across all Trilinos packages

within area. 46 47 Unique features of Trilinos

§ Huge library of algorithms w Linear and nonlinear solvers, preconditioners, … w Optimization, transients, sensitivities, uncertainty, … § Growing support for multicore & hybrid CPU/GPU w Built into the new Tpetra linear algebra objects • Therefore into iterative solvers with zero effort! w Unified intranode programming model w Spreading into the whole stack: • Multigrid, sparse factorizations, element assembly… § Support for mixed and arbitrary precisions w Don’t have to rebuild Trilinos to use it § Support for huge (> 2B unknowns) problems 48 Trilinos Access

§ Trilinos 12.6 current. w 57 packages. § Website: https://trilinos.org.

§ GitHub Fork: w https://github.com/trilinos/Trilinos Trilinos software organization

49 Trilinos Package Summary Objective Package(s)

Meshing & Discretizations STK, Intrepid, Pamgen, Sundance, ITAPS, Mesquite Discretizations Time Integration Rythmos Automatic Differentiation Sacado Methods Mortar Methods Moertel

Linear algebra objects Epetra, Tpetra, Kokkos, Xpetra Interfaces Thyra, Stratimikos, RTOp, FEI, Shards

Services Load Balancing Zoltan, Isorropia, Zoltan2 “Skins” PyTrilinos, WebTrilinos, ForTrilinos, Ctrilinos, Optika C++ utilities, I/O, thread API Teuchos, EpetraExt, Kokkos, Triutils, ThreadPool, Phalanx, Trios Iterative linear solvers AztecOO, Belos, Komplex

Direct sparse linear solvers Amesos, Amesos2, ShyLU Direct dense linear solvers Epetra, Teuchos, Pliris Iterative eigenvalue solvers Anasazi, Rbgen ILU-type preconditioners AztecOO, IFPACK, Ifpack2, ShyLU Solvers Multilevel preconditioners ML, CLAPS, Muelu Block preconditioners Meros, Teko Nonlinear system solvers NOX, LOCA, Piro Optimization (SAND) MOOCHO, Aristos, TriKota, Globipack, Optipack Stochastic PDEs Stokhos Interoperability vs. Dependence (“Can Use”) (“Depends On”)

§ Although most Trilinos packages have no explicit dependence, often packages must interact with some other packages: w NOX needs operator, vector and linear solver objects. w AztecOO needs preconditioner, matrix, operator and vector objects. w Interoperability is enabled at configure time. w Trilinos cmake system is vehicle for: • Establishing interoperability of Trilinos components… • Without compromising individual package autonomy. • Trilinos_ENABLE_ALL_OPTIONAL_PACKAGES option § Architecture supports simultaneous development on many fronts.

51 Trilinos is made of packages § Not a monolithic piece of software w Like LEGO™ bricks, not Matlab™ § Each package: w Has its own development team and management w Makes its own decisions about algorithms, coding style, etc. w May or may not depend on other Trilinos packages

§ Trilinos is not “indivisible” w You don’t need all of Trilinos to get things done w Any subset of packages can be combined and distributed w Current public release contains ~50 of the 55+ Trilinos packages § Trilinos top layer framework w Not a large amount of source code: ~1.5% w Manages package dependencies • Like a GNU/Linux package manager w Runs packages’ tests nightly, and on every check-in § Package model supports multifrontal development § New effort to create apps by gluing Trilinos together: Albany

52 Software Development and Delivery

53 “Are C++ templates safe? No, but they are good.” Compile-time Polymorphism Templates and Sanity upon a shifting foundation

Software delivery: Template Benefits: • Essential Activity – Compile time polymorphism. – True generic programming. How can we: – No runtime performance hit. • Implement mixed precision algorithms? – Strong typing for mixed precision. • Implement generic fine-grain parallelism? – Support for extended precision. • Support hybrid CPU/GPU computations? – Many more… • Support extended precision? • Explore redundant computations? • Prepare for both exascale “swim lanes”? Template Drawbacks: – Huge compile-time performance hit: C++ templates only sane way, for now. • But good use of multicore :) • Eliminated for common data types. - Complex notation: - Esp. for Fortran & C programmers). - Can insulate to some extent.

54 Solver Software Stack Phase I packages: SPMD, int/double Phase II packages: Templated

Optimization Unconstrained: MOOCHO Constrained:

Bifurcation Analysis LOCA

Transient Problems DAEs/ODEs: Rythmos

Nonlinear Problems NOX

Linear Problems Anasazi Linear Equations: Ifpack, ML, etc... Eigen Problems: AztecOO Sensitivities (Automatic Sacado) Differentiation:

Distributed Linear Algebra Epetra Matrix/Graph Equations: Vector Problems: Teuchos 55 Solver Software Stack Phase I packages Phase II packages Phase III packages: Manycore*, templated

Optimization Unconstrained: MOOCHO Constrained:

Bifurcation Analysis LOCA T-LOCA

Transient Problems DAEs/ODEs: Rythmos

Nonlinear Problems NOX T-NOX

Anasazi Linear Problems AztecOO Linear Equations: Belos* Ifpack, Ifpack2*, Eigen Problems:

Sensitivities (Automatic Sacado) Differentiation: ML, etc... Muelu*,etc... Tpetra* Distributed Linear Algebra Epetra Matrix/Graph Equations: Kokkos* Vector Problems: Teuchos 56 57

Using Trilinos Linear Solvers

Sandia is a multiprogram laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U. S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. Trilinos Package Summary Objective Package(s)

Meshing & Discretizations STKMesh, Intrepid, Pamgen, Sundance, Mesquite Discretizations Time Integration Rythmos Automatic Differentiation Sacado Methods Mortar Methods Moertel Linear algebra objects Epetra, Tpetra Interfaces Xpetra, Thyra, Stratimikos, RTOp, FEI, Shards Services Load Balancing Zoltan, Isorropia, Zoltan2 “Skins” PyTrilinos, WebTrilinos, ForTrilinos, Ctrilinos, Optika Utilities, I/O, thread API Teuchos, EpetraExt, Kokkos, Triutils, ThreadPool, Phalanx Iterative linear solvers AztecOO, Belos, Komplex Direct sparse linear solvers Amesos, Amesos2, ShyLU Incomplete factorizations AztecOO, IFPACK, Ifpack2 Multilevel preconditioners ML, CLAPS, MueLu Direct dense linear solvers Epetra, Teuchos, Pliris Solvers Iterative eigenvalue solvers Anasazi Block preconditioners Meros, Teko Nonlinear solvers NOX, LOCA Optimization MOOCHO, Aristos, TriKota, Globipack, Optipack Stochastic PDEs Stokhos ShyLU

• Subdomain solvers or smoothers have to adapt to hierarchical architectures. • One MPI process per core cannot exploit intra-node parallelism. • One subdomain per MPI process hard to scale. (due to increase in the number of iterations)

Hypergraph/Graph based ordering of the matrix for the ShyLU

§ ShyLU (Scalable Hybrid LU) is hybrid • In the mathematical sense (direct + iterative) for robustness. • In the parallel programming sense (MPI + Threads) for scalability. § Robust than simple preconditioners and scalable than direct solvers. § ShyLU is a subdomain solver where a subdomain is not limited to one MPI process. § Will be part of Trilinos. In precopyright Trilinos for Sandia users. § Results: Over 19x improvement in the simulation time for large Xyce circuits. Amesos2

§ Direct Solver interface for the Tpetra Stack. § Typical Usage: w preOrder(), w symbolicFactorization(), w numericFactorization(), w solve(). § Easy to support new solvers (Current support for all the SuperLU variants). § Easy to support new multivectors and sparse matrices. § Can support third party solver specific parameters with little changes. § Available in the current release of Trilinos. 61 AztecOO

§ Iterative linear solvers: CG, GMRES, BiCGSTAB,… § Incomplete factorization preconditioners

§ Aztec was Sandia’s workhorse solver: w Extracted from the MPSalsa reacting flow code w Installed in dozens of Sandia apps w 1900+ external licenses § AztecOO improves on Aztec by: w Using Epetra objects for defining matrix and vectors w Providing more preconditioners & scalings w Using C++ class design to enable more sophisticated use § AztecOO interface allows: w Continued use of Aztec for functionality w Introduction of new solver capabilities outside of Aztec

Developers: Mike Heroux, Alan Williams, Ray Tuminaro 62 Belos § Next-generation linear iterative solvers

§ Decouples algorithms from linear algebra objects w Linear algebra library has full control over data layout and kernels w Improvement over AztecOO, which controlled vector & matrix layout w Essential for hybrid (MPI+X) parallelism § Solves problems that apps really want to solve, faster: w Multiple right-hand sides: AX=B w Sequences of related systems: (A + ΔAk) Xk = B + ΔBk § Many advanced methods for these types of systems w Block & pseudoblock solvers: GMRES & CG w Recycling solvers: GCRODR (GMRES) & CG w “Seed” solvers (hybrid GMRES) w Block orthogonalizations (TSQR) § Supports arbitrary & mixed precision, complex, … § If you have a choice, pick Belos over AztecOO

Developers: Heidi Thornquist, Mike Heroux, Chris Baker, Mark Hoemmen 63 Ifpack(2): Algebraic preconditioners

§ Preconditioners: w Overlapping domain decomposition w Incomplete factorizations (within an MPI process) w (Block) relaxations & Chebyshev § Accepts user matrix via abstract matrix interface § Use {E,T}petra for basic matrix / vector calculations § Perturbation stabilizations & condition estimation § Can be used by all other Trilinos solver packages § Ifpack2: Tpetra version of Ifpack w Supports arbitrary precision & complex arithmetic w Path forward to hybrid-parallel factorizations

Developers: Mike Heroux, Mark Hoemmen, Siva Rajamanickam, Marzio Sala, Alan Williams, etc. 64 : Multi-level Preconditioners

§ Smoothed aggregation, multigrid, & domain decomposition

§ Critical technology for scalable performance of many apps § ML compatible with other Trilinos packages: w Accepts Epetra sparse matrices & dense vectors w ML preconditioners can be used by AztecOO, Belos, & Anasazi § Can also be used independent of other Trilinos packages § Next-generation version of ML: MueLu w Works with Epetra or Tpetra objects (via Xpetra interface)

Developers: Ray Tuminaro, Jeremie Gaidamour, Jonathan Hu, Marzio Sala, Chris Siefert 65 MueLu: Next-gen algebraic multigrid

§ Motivation for replacing ML w Improve maintainability & ease development of new algorithms w Decouple computational kernels from algorithms • ML mostly monolithic (& 50K lines of code) • MueLu relies more on other Trilinos packages w Exploit Tpetra features • MPI+X (Kokkos programming model mitigates risk) • 64-bit global indices (to solve problems with >2B unknowns) • Arbitrary Scalar types (Tramonto runs MueLu w/ double-double) § Works with Epetra or Tpetra (via Xpetra common interface) § Facilitate algorithm development w Energy minimization methods w Geometric or classic algebraic multigrid; mix methods together § Better support for preconditioner reuse w Explore options between “blow it away” & reuse without change

Developers: Andrey Prokopenko, Jonathan Hu, Chris Siefert, Ray Tum ina r o, Tobia s Wiesner Recap § Solution of linear systems is ubiquitous in HPC: w Implicit problem formulations. w Consume huge fraction of computational time. § Knowing about consistency, non-singularity is important: w Problem setup errors manifest as consistency, singularity issues. § Effective use of solvers means knowing spectrum of tools: w LAPACK, Direct Sparse, Iterative methods. w Problem solving environments, esp. Matlab. § All sparse solver libraries follow a basic pattern: w Setup structure, values: Complicated interaction with app. w Select solver parameters: Easy to code. Challenging to get right. w Solve system. w Solve similar problem, same structure. § Trilinos provides a large collection of sparse solvers.

66 Exercise: Find the Trilinos packages that should work best for each class of problem

Dimension Type Notes 1D Direct Often tridiagonal (Thomas alg, periodic version). 2D very easy Iterative If you have a good initial guess, e.g., transient simulation. 2D otherwise Direct Almost always better than iterative. 2.5D Direct Example: shell problems. Good ordering can keep fill low. 3D “smooth” Direct? Emerging methods for low-rank SVD representation. 3D easy Iterative Simple preconditioners: diagonal scaling. CG or BiCGSTAB. 3D harder Iterative Prec: IC, ILU (with domain decomposition if in parallel). 3D hard Iterative Use GMRES (without restart if possible). 3D + large Iterative Add multigrid, geometric or algebraic.

67 Exercises § Visit the Trilinos Tutorial Site: w https://github.com/trilinos/Trilinos_tutorial/wiki/TrilinosHandsOnTutorial § If you haven’t installed Trilinos, use the web portal: w http://trilinos.csbsju.edu/WebTrilinosMPI-shared-12.0/c++/index.html w Install these packages: • Kokkos, Tpetra, Epetra, EpetraExt, AztecOO, ML, Belos, Muelu, Ifpack, Ifpack2, Zoltan, Zoltan2, Amesos § Warning: Please take care when using the web portal. w Lots of users. w If you set a large problem size, bad things might happen.

68 Exercises

§ From the web Trilinos drop down menu: w Select the Epetra_Comm example, run it. • Vary the number of MPI ranks, pthread threads. w Select the Epetra_Map example. • How many elements are assigned to each MPI process? w Select Epetra_Vector example. • Change the input values. w Select the AztecOO example. • Find the AztecOO User Guide (google search). • Change the parameter values for: – Solver.SetAztecOption(AZ_subdomain_solve, AZ_ilut); – Solver.SetAztecOption(AZ_graph_fill, 3); – Solver.SetAztecOption(AZ_overlap, 0);

69