<<

Practical and Efficient Time Integration and Kronecker Solvers

Jed Brown [email protected] (CU Boulder and ANL) Collaborators: Debojyoti Ghosh (LLNL), Matt Normile (CU), Martin Schreiber (Exeter)

SIAM Central, 2017-09-30 This talk: https: //jedbrown.org/files/20170930-FastKronecker.pdf Theoretical Peak Floating Point Operations per Byte, Double Precision

E5-2699 v3 E5-2699 v3 E5-2699 v4 FirePro S9150 10 FirePro W9100

Tesla K20X Tesla P100 Tesla K20 E5-2697 v2Xeon Phi 7120 (KNC) HD 6970 HD 6970

Tesla C2050 Tesla M2090 HD 5870 E5-2690 Tesla K40 TeslaXeon K40 Phi 7290 (KNL)

FLOP per Byte HD 8970 X5482 X5492 HD 7970 GHz Ed. X5690 W5590 X5680

HD 4870

INTEL Xeon CPUs 1 HD 3870 NVIDIA Tesla GPUs AMD Radeon GPUs INTEL Xeon Phis Tesla C1060 Tesla C1060 2008 2010 2012 2014 2016 End of Year

[c/o Karl Rupp] 2017 HPGMG performance spectra Motivation

I Hardware trends I Memory bandwidth a precious commodity (8+ flops/byte) I Vectorization necessary for floating point performance I Conflicting demands of cache reuse and vectorization I Can deliver bandwidth, but latency is hard I Assembled sparse is doomed! I Limited by memory bandwidth (1 flop/6 bytes) I No vectorization without blocking, return of ELLPACK I Spatial-domain vectorization is intrusive I Must be unassembled to avoid bandwidth bottleneck I Whether it is “hard” depends on discretization I Geometry, boundary conditions, and adaptivity Sparse linear algebra is dead (long live sparse . . . )

I Arithmetic intensity < 1/4 I Idea: multiple right hand sides

(2k flops)(bandwidth) , k  avg. nz/row sizeof() + sizeof(Int)

I Problem: popular algorithms have nested data dependencies I Time step Nonlinear solve Krylov solve Preconditioner/sparse

I Cannot parallelize/vectorize these nested loops I Can we create new algorithms to reorder/fuse loops? I Reduce latency-sensitivity for communication I Reduce memory bandwidth (reuse matrix while in cache) Sparse linear algebra is dead (long live sparse . . . )

I Arithmetic intensity < 1/4 I Idea: multiple right hand sides

(2k flops)(bandwidth) , k  avg. nz/row sizeof(Scalar) + sizeof(Int)

I Problem: popular algorithms have nested data dependencies I Time step Nonlinear solve Krylov solve Preconditioner/

I Cannot parallelize/vectorize these nested loops I Can we create new algorithms to reorder/fuse loops? I Reduce latency-sensitivity for communication I Reduce memory bandwidth (reuse matrix while in cache) Attempt: s-step methods in 3D

work memory communication 106 106 106

105 105 105

104 104 104

103 p = 1 103 p = 1 103 p = 1 p = 1, s = 2 p = 2 p = 2 p = 2, s = 3 p = 2, s = 3 p = 2, s = 3 p = 2, s = 5 p = 2, s = 5 p = 2, s = 5 102 102 102 102 103 104 105 102 103 104 105 102 103 104 105 dofs/process dofs/process dofs/process

I Limited choice of preconditioners (none optimal, surface/volume) I Amortizing message latency is most important for strong-scaling I s-step methods have high overhead for small subdomains Attempt: distribute in time (multilevel SDC/Parareal)

I PFASST algorithm (Emmett and Minion, 2012) I Zero-latency messages (cf. performance model of s-step) I Spectral Deferred Correction: iterative, converges to IRK (Gauss, Radau, . . . ) I Stiff problems use implicit basic integrator (synchronizing on spatial communicator) Problems with SDC and time-parallel

1792 896 ideal 448 PFASST (theory) 256 PMG+SDC (fine) 128 PMG+SDC (coarse) 64 PMG+PFASST 32 16 8 speedup 4 2 1 256 512 1K 2K 4K 8K 16K 32K 64K 112K 224K 448K number of cores

1.0 PMG PMG+PFASST 0.8 0.6 0.4 0.2

rel. efficiency 0.0 256 512 1K 2K 4K 8K 16K 32K 64K 112K 224K 448K number of cores c/o Matthew Emmett, parallel compared to sequential SDC I Iteration count not uniform in s; efficiency starts low I Low arithmetic intensity; tight error tolerance (cf. Crank-Nicolson) I Parabolic space-time also works, but comparison flawed Runge-Kutta methods

u˙ = F(u)       y1 a11 ··· a1s y1  .  n  . . .   .   .  = u + h  . .. . F  .  ys as1 ··· ass ys | {z } | {z } Y A un+1 = un + hbT F(Y )

I General framework for one-step methods I Diagonally implicit: A lower triangular, stage order 1 (or 2 with explicit first stage)

I Singly diagonally implicit: all Aii equal, reuse solver setup, stage order 1 I If A is a general full matrix, all stages are coupled, “implicit RK” Implicit Runge-Kutta

√ √ √ 1 − 15 5 2 − 15 5 − 15 2 10 36√ 9 15 36 √30 1 5 + 15 2 5 − 15 2√ 36 √24 9√ 36 24 1 − 15 5 + 15 2 + 15 5 2 10 36 30 9 15 36 5 4 5 18 9 18

I Excellent accuracy and stability properties I Gauss methods with s stages I order 2s, (s,s) Padé approximation to the exponential I A-stable, symplectic I Radau (IIA) methods with s stages I order 2s − 1, A-stable, L-stable I Lobatto (IIIC) methods with s stages I order 2s − 2, A-stable, L-stable, self-adjoint

I Stage order s or s + 1 Method of Butcher (1976) and Bickart (1977)

∗ I Newton linearize Runge-Kutta system at u

n  ∗  Y = u + hAF(Y ) Is ⊗ In + hA ⊗ J(u ) δ Y = RHS

I Solve linear system with product operator

Gˆ = S ⊗ In + Is ⊗ J

where S = (hA)−1 is s × s dense, J = −∂ F(u)/∂ u sparse I SDC (2000) is Gauss-Seidel with low-order corrector − I Butcher/Bickart method: diagonalize S = V ΛV 1

I Λ ⊗ In + Is ⊗ J I s decoupled solves I Complex eigenvalues (overhead for real problem) Ill conditioning

A = V ΛV −1 Skip the diagonalization

S + j I j I  s + J s + J 11 12 11 12 j IS + j I j I s + J s + J  21 22 23  21 22 j IS + j I | {z } 32 33 S⊗In+Is⊗J | {z } In⊗S+J⊗Is

I Accessing memory for J dominates cost I Irregular vector access in application of J limits vectorization I Permute Kronecker product to reuse J and make fine-grained structure regular I Stages coupled via register at spatial-point granularity I Same convergence properties as Butcher/Bickart PETSc MatKAIJ: “sparse” Kronecker product matrices

G = In ⊗ S + J ⊗ T

I J is parallel and sparse, S and T are small and dense I More general than multiple RHS ()

I Compare J ⊗ Is to multiple right hand sides in row-major

I Runge-Kutta systems have T = Is (permuted from Butcher method) I Stream J through cache once, same efficiency as multiple RHS I Unintrusive compared to spatial-domain vectorization or s-step Convergence with point-block Jacobi preconditioning

I 3D centered-difference diffusion problem

Method order nsteps Krylov its. (Average) Gauss 1 2 16 130 (8.1) Gauss 2 4 8 122 (15.2) Gauss 4 8 4 100 (25) Gauss 8 16 2 78 (39) We really want multigrid

I Prolongation: P ⊗ Is

I Coarse operator: In ⊗ S + (RJP) ⊗ Is I Larger time steps I GMRES(2)/point-block Jacobi smoothing I FGMRES outer

Method order nsteps Krylov its. (Average) Gauss 1 2 16 82 (5.1) Gauss 2 4 8 64 (8) Gauss 4 8 4 44 (11) Gauss 8 16 2 42 (21) Toward a better AMG for IRK/tensor-product systems

I Start with Rˆ = R ⊗ Is, Pˆ = P ⊗ Is ˆ ˆ Gcoarse = R(In ⊗ S + J ⊗ Is)P

I Imaginary component slows convergence I Can we use a Kronecker product interpolation? I Rotation on coarse grids (connections to shifted Laplacian) Why implicit is silly for waves

I Implicit methods require an implicit solve in each stage. I Time step size proportional to CFL for accuracy reasons. I Methods higher than first order are not unconditionally strong stability preserving (SSP; Spijker 1983).

I Empirically, ceff ≤ 2, Ketcheson, Macdonald, Gottlieb (2008) and others I Downwind methods offer to bypass, but so far not practical I Time step size chosen for stability I Increase order if more accuracy needed I Large errors from spatial discretization, modest accuracy I My goal: need less memory motion per stage I Better accuracy, symplecticity nice bonus only I Cannot sell method without efficiency Implicit Runge-Kutta for advection

Table: Total number of iterations (communications or accesses of J) to solve linear advection to t = 1 on a 1024-point grid using point-block Jacobi preconditioning of implicit Runge-Kutta matrix. The relative algebraic solver tolerance is 10−8.

Method order nsteps Krylov its. (Average) Gauss 1 2 1024 3627 (3.5) Gauss 2 4 512 2560 (5) Gauss 4 8 256 1735 (6.8) Gauss 8 16 128 1442 (11.2)

I Naive centered-difference discretization I Leapfrog requires 1024 iterations at CFL=1 I This is A-stable (can handle dissipation) Diagonalization revisited

(I ⊗ I − hA ⊗ L)Y = (1 ⊗ I)un (1) T un+1 = un + h(b ⊗ L)Y (2)

− I eigendecomposition A = V ΛV 1 −1 (V ⊗ I)(I ⊗ I − hΛ ⊗ L)(V ⊗ I)Y = (1 ⊗ I)un. − − I Find diagonal W such that W 11 = V 11 I Commute diagonal matrices −1 (I ⊗ I − hΛ ⊗ L)(WV ⊗ I)Y = (1 ⊗ I)un. | {z } Z T T − I Using b˜ = b VW 1, we have the completion formula

˜T un+1 = un + h(b ⊗ L)Z. I Λ,b˜ is new diagonal Butcher table I Compute coefficients offline using extended precision to handle ill-conditioning of V I Equivalent for linear problems, usually fails nonlinear stability Exploiting realness

I Eigenvalues come in conjugate pairs

A = V ΛV −1

I For each conjugate pair, create unitary transformation

1 1 1  T = √ 2 i −i

I Real 2 × 2 block diagonal D; real V˜ (with appropriate phase)

A = (VT ∗)(T ΛT ∗)(TV −1) = VD˜ ˜V −1

I Yields new block-diagonal Butcher table D,b˜. I Halve number of stages using identity

(α + J)−1u = (α + J)−1u

Solve one complex problem per conjugate pair, then take twice the real part. Conditioning

Conditioning of diagonalization: gauss

(V) 107 ||b||1 ||b || 106 re 1

105

104

103

102

101

100

2 4 6 8 10 12 14 Number of stages

I Diagonalization in extended precision helps somewhat, as does real formulation I Neither makes arbitrarily large number of stages viable REXI: Rational approximation of exponential

u(t) = eLt u(0)

I Haut, Babb, Martinsson, Wingate; Schreiber and Loft

(α ⊗ I + hI ⊗ L)Y = (1 ⊗ I)un T un+1 = (β ⊗ I)Y .

I α is complex-valued diagonal, β is complex I Constructs rational approximations of Gaussian functions, target (real part of) eit I REXI is a Runge-Kutta method: can convert via “modified Shu-Osher form” I Developed for SSP (strong stability preserving) methods I Ferracina, Spijker (2005), Higueras (2005) −1 −2 I Yields diagonal Butcher table A = −α ,b = −α β Abscissa for RK and REXI methods

Abscissa 0.6 Gauss Gauss diag Gauss rbdiag 0.4 REXI 33 Cauchy 16 0.2

0.0

0.2

0.4

0.6 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Stability regions

Stability region, Cauchy (16,6) 20 1.8 Stability region, Gauss, #poles=4 20 0.500 15 1.8 1.6 1.500 15 1.6 10 1.4

10 1.000 1.4 5 1.500 1.2 1.2 5 1.0 0

1.000 1.0 0 0.8 0.8 5 5 0.500 0.6 0.6 10 10 0.4 0.4 15 15 0.2 0.2

20 0.0 20 0.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0

Order star, Gauss, #poles=4 Order star, Cauchy (16,6) 20 1.0 20 1.0

15 0.5 15 0.5 10 0.1 10 0.1 5 0.0 5 0.0 0 0.0

0 0.0 5 0.0

10 0.1 5 0.0

15 0.5 10 0.1

20 1.0 15 0.5 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0

20 1.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 Outlook on Kronecker product solvers

I ⊗ S + J ⊗ I

I (Block) diagonal S is usually sufficient I Best opportunity for “time parallel” (for linear problems) I Is it possible to beat explicit wave propagation with high efficiency? I Same structure for stochastic Galerkin and other UQ methods I IRK unintrusively offers bandwidth reuse and vectorization I Need polynomial smoothers for IRK spectra I Change number of stages on spatially-coarse grids (p-MG, or even increase)? I Experiment with SOR-type smoothers I Prefer point-block Jacobi in smoothers for spatial parallelism I Possible IRK correction for IMEX (non-smooth explicit function) I PETSc implementation (works in parallel, hardening in progress) I Thanks to DOE ASCR