<<

Lawrence Livermore National Laboratory

A Tutorial

Ulrike Meier Yang

Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA 94551 This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 LLNL-PRES-463131 References

. “Introduction to ”, Blaise Barney https://computing.llnl.gov/tutorials/parallel_comp

. “A Parallel Multigrid Tutorial”, Jim Jones https://computing.llnl.gov/casc/linear_solvers/present.html

. “Introduction to Parallel Computing, Design and Analysis of ”, Vipin Kumar, Ananth Grama, Anshul Gupta, George Karypis

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 2 Overview

. Basics of Parallel Computing (see Barney) - Concepts and Terminology - Architectures - Programming Models - Designing Parallel Programs

. Parallel algorithms and their implementation - basic kernels - Krylov methods - multigrid

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 3 Serial computation

. Traditionally has been written for serial computation: - run on a single computer - instructions are run one after another - only one instruction executed at a time

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 4

. John von Neumann first authored the general requirements for an electronic computer in 1945 . Data and instructions are stored in memory . fetches instructions/data from memory, decodes the instructions and then sequentially coordinates operations to accomplish the programmed task. . Arithmetic Unit performs basic arithmetic operations . Input/Output is the interface to the human operator

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 5 Parallel Computing

. Simultaneous use of multiple compute sources to solve a single problem

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 6 Why use parallel computing?

. Save time and/or money . Enables the solution of larger problems . Provide concurrency . Use of non-local resources . Limits to serial computing

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 7 Concepts and Terminology

. Flynn’s Taxonomy (1966)

S I S D S I M D

Single Instruction, Single Data Single Instruction, Multiple Data

M I S D M I M D

Multiple Instruction, Single Data Multiple Instruction, Multiple Data

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 8 SISD

. Serial computer . Deterministic execution . Examples: older generation main frames, work stations, PCs

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 9 SIMD

. A type of parallel computer . All processing units execute the same instruction at any given clock cycle . Each processing unit can operate on a different data element . Two varieties: Arrays and Vector Pipelines . Most modern , particularly those with graphics processor units (GPUs) employ SIMD instructions and execution units.

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 10 MISD

. A single data is fed into multiple processing units. . Each processing unit operates on the data independently via independent instruction streams. . Few actual examples : Carnegie-Mellon .mmp computer (1971).

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 11 MIMD

. Currently, most common type of parallel computer . Every processor may be executing a different instruction stream . Every processor may be working with a different data stream . Execution can be synchronous or asynchronous, deterministic or non- deterministic . Examples: most current , networked parallel computer clusters and "grids", multi-processor SMP computers, multi-core PCs. . Note: many MIMD architectures also include SIMD execution sub-components

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 12 Parallel Computer Architectures

. : all processors can access the same memory

. (UMA): - identical processors - equal access and access times to memory

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 13 Non-uniform memory access (NUMA)

. Not all processors have equal access to all memories

. Memory access across link is slower

. Advantages: - user-friendly programming perspective to memory - fast and uniform data sharing due to the proximity of memory to CPUs . Disadvantages: - lack of between memory and CPUs. - Programmer responsible to ensure "correct" access of global memory - Expense Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 14

. Distributed memory systems require a communication network to connect inter-processor memory. . Advantages: - Memory is scalable with number of processors. - No memory interference or overhead for trying to keep cache coherency. - Cost effective . Disadvantages: - programmer responsible for data communication between processors. - difficult to existing data structures to this memory organization. Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 15 Hybrid Distributed-Shared Meory

. Generally used for the currently largest and fastest computers . Has a mixture of previously mentioned advantages and disadvantages

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 16 Parallel programming models

. Shared memory . Threads . Passing . Data Parallel . Hybrid

All of these can be implemented on any architecture.

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 17 Shared memory

. tasks share a common , which they read and write asynchronously. . Various mechanisms such as locks / semaphores may be used to control access to the shared memory. . Advantage: no need to explicitly communicate of data between tasks -> simplified programming . Disadvantages: • Need to take care when managing memory, avoid conflicts • Harder to control data locality

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 18 Threads

. A can be considered as a subroutine in the main program . Threads communicate with each other through the global memory . commonly associated with shared memory architectures and operating systems . Posix Threads or pthreads . OpenMP

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 19

. A set of tasks that use their own local memory during computation. . Data exchange through sending and receiving . . Data transfer usually requires cooperative operations to be performed by each . For example, a send operation must have a matching receive operation.

. MPI (released in 1994) . MPI-2 (released in 1996)

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 20 Data Parallel

. The data parallel model demonstrates the following characteristics: • Most of the parallel work performs operations on a data set, organized into a common structure, such as an array • A set of tasks works collectively on the same data structure, with each task working on a different partition • Tasks perform the same operation on their partition

. On shared memory architectures, all tasks may have access to the data structure through global memory. On distributed memory architectures the data structure is split up and resides as "chunks" in the local memory of each task. Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 21 Other Models

. Hybrid - combines various models, e.g. MPI/OpenMP . Single Program Multiple Data (SPMD) - A single program is executed by all tasks simultaneously

. Multiple Program Multiple Data (MPMD) - An MPMD application has multiple executables. Each task can execute the same or different program as other tasks.

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 22 Designing Parallel Programs

. Examine problem: - Can the problem be parallelized? - Are there data dependencies? - where is most of the work done? - identify bottlenecks (e.g. I/O)

. Partitioning - How should the data be decomposed?

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 23 Various partitionings

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 24 How should the be distributed?

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 25 Communications

. Types of communication:

- point-to-point

- collective

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 26 Synchronization types

. Barrier • Each task performs its work until it reaches the barrier. It then stops, or "blocks". • When the last task reaches the barrier, all tasks are synchronized. . / • The first task to acquire the lock "sets" it. This task can then safely (serially) access the protected data or code. • Other tasks can attempt to acquire the lock but must wait until the task that owns the lock releases it. • Can be blocking or non-blocking

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 27 Load balancing

. Keep all tasks busy all of the time. Minimize idle time. . The slowest task will determine the overall performance.

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 28

. In parallel computing, granularity is a qualitative measure of the ratio of computation to communication. . Fine-grain Parallelism: - Low computation to communication ratio - Facilitates load balancing - Implies high communication overhead and less opportunity for performance enhancement . Coarse-grain Parallelism: - High computation to communication ratio - Implies more opportunity for performance increase - Harder to load balance efficiently

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 29 Parallel Computing Performance Metrics

. Let T(n,p) be the time to solve a problem of size n using p processors

- : S(n,p) = T(n,1)/T(n,p)

- Efficiency: E(n,p) = S(n,p)/p

- Scaled Efficiency: SE(n,p) = T(n,1)/T(pn,p)

An algorithm is scalable if there exists C>0 with SE(n,p) ≥ C for all p

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 30 Strong Scalability

increasing the number of processors 8 for a fixed problem size 7 6 5 Run times perfect

4 speedup Speedup 6 3 method 5 2 4 1 3 1 2 3 4 5 6 7 8 2 seconds no of procs 1 1.2 0 1.1 1 2 3 4 5 6 7 8 1 no of procs 0.9 0.8 Efficiency 0.7 0.6 1 2 3 4 5 6 7 8 no of procs

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 31 Weak scalability

increasing the number of processors using a fixed problem size per processor

Run times 1.2 2 1 1.8 1.6 0.8 1.4 1.2 0.6 1

0.8 0.4 seconds

0.6 Scaled EfficiencyScaled 0.4 0.2 0.2 0 0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 no of procs No of procs

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 32 Amdahl’s Law

. Maximal Speedup = 1/(1-P), P parallel portion of code

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 33 Amdahl’s Law

Speedup = 1/((P/N)+S), N no. of processors, S serial portion of code

50% Speedup 75% 90% N P=0.5 P=0.9 P=0.99 95% 10 1.82 5.26 9.17 100 1.98 9.17 50.25 1000 1.99 9.91 90.99 10000 1.99 9.91 99.02 100000 1.99 9.99 99.90

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 34 Performance Model for distributed memory

. Time for n floating point operations:

tcomp = fn

1/f Mflops

. Time for communicating n words:

tcomm = α + βn

α latency, 1/β

. Total time: tcomp + tcomm

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 35 Parallel Algorithms and Implementations

. basic kernels . Krylov methods . multigrid

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 36 Vector Operations

. Adding two vectors Proc 0 z0 = x0 + y0 x = y+z . Axpys Proc 1 z1 = x1 + y1 y = y+αx

z2 = x2 + y2 Proc 2

. Vector products s = x * y Proc 0 s = x T y 0 0 0 Proc 1 s = s + s + s s1 = x1 * y1 0 1 2

s2 = x2 * y2 Proc 2

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 37 Dense n x n Vector

. Matrix size n x n , p processors

Proc 0

Proc 1 = Proc 2

Proc 3

. T = 2(n2/p)f + (p-1)(α+βn/p)

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 38 Vector Multiplication

. Matrix of size n2 x n2 , p2 processors

. Consider graph of matrix

. Partitioning influences run time 2 2 2 2 T1 ≈ 10(n /p )f + 2(α+βn) vs. T2 ≈ 10(n /p )f + 4(α+βn/p)

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 39 Effect of Partitioning on Performance

15

13

11

9 Perfect Speedup

Speedup 7 MPI opt MPI-noopt 5

3

1 1 3 5 7 9 11 13 15 No of procs

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 40 Sparse Matrix Vector Multiply with COO Data Structure

. Coordinate Format (COO) real Data(nnz), integer Row(nnz), integer Col(nnz)

Data 2 5 1 3 7 9 4 8 6 1 0 0 2 0 3 4 0 Row 0 5 6 0 0 2 0 1 3 3 1 3 2 7 0 8 9 Col 3 1 0 1 0 3 2 2 2

. Matrix-Vector multiplication: for (k=0; k

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 41 Sparse Matrix Vector Multiply with COO Data Structure

. Matrix-Vector multiplication: for (k=0; k

1 0 0 2 Data 2 3 1 5 7 9 4 8 6 possible conflict 0 3 4 0 row 1-3 0 5 6 0 Row 0 1 0 2 3 3 1 3 2 7 0 8 9 Col 3 1 0 1 0 3 2 2 2

2 3 1 4 5 7 9 8 6 1 0 0 2 0 3 4 0 0 5 6 0 Proc 0 0 1 0 1 Proc 1 2 3 3 3 2 7 0 8 9 3 1 0 2 1 0 3 2 2

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 42 Sparse Matrix Vector Multiply with CSR Data Structure

. Compressed sparse row (CSR) real Data(nnz), integer Rowptr(n+1), Col(nnz)

Data 1 2 3 4 5 6 7 8 9 1 0 0 2 0 3 4 0 0 5 6 0 Rowptr 0 2 4 6 9 7 0 8 9 Col 0 3 1 2 1 2 0 2 3

. Matrix-Vector multiplication: for (i=0; i

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 43 Sparse Matrix Vector Multiply with CSC Data Structure

. Compressed sparse column (CSC) real Data(nnz), integer Colptr(n+1), Row(nnz)

Data 1 7 3 5 4 6 8 2 9 1 0 0 2 0 3 4 0 0 5 6 0 Colptr 0 2 4 7 9 7 0 8 9 Row 0 3 1 2 1 2 3 0 3

. Matrix-Vector multiplication: for (i=0; i

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 44 Parallel Matvec, CSR vs. CSC

y0 = A0 x0 . CSR

y1 = A1 x1

y2 = A2 x2

A0 x0 . CSC y = y0 + y1 + y2 A0 A2 x1

x2

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 45 Sparse Matrix Vector Multiply with Diagonal Storage

. Diagonal storage real Data(n,m), integer offset(m)

Data 0 1 2 offset -2 0 1 1 2 0 0 0 3 0 0 0 3 0 5 0 6 7 5 6 7 0 8 0 9 8 9 0

. Matrix-Vector multiplication: for (i=0; i

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 46 Other Data Structures

. data structures (BCSR, BCSC) block size bxb, n divisible by b real Data(nnb*b*b), integer Ptr(n/b+1), integer Index(nnb)

. Based on space filling curves (e.g. Hilbert curves, Z-order) integer IJ(2*nnz) real Data(nnz)

. Matrix-Vector multiplication: for (i=0; i

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 47 Data Structures for Distributed Memory

. Concept for MATAIJ (PETSc) and ParCSR ()

x x x x x x x x x Proc 0 x x x x x x x x x x x x x x x x x x x x x x Proc 1 x x x x x x x x x x x x x x x x Proc 2 x x x x x x x x x x x x

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 48 Data Structures for Distributed Memory

. Concept for MATAIJ (PETSc) and ParCSR (Hypre)

x x x x x x x x x Proc 0 x x x x x x x x x x x x x x x x x x x x x x Proc 1 x x x x x x x x x x x x x x x x Proc 2 x x x x x x x x x x x x

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 49 Data Structures for Distributed Memory

. Concept for MATAIJ (PETSc) and ParCSR (Hypre)

x x x x x x x x x Proc 0 x x x x x x x x x x x x x x x x x x x x x x Proc 1 x x x x x x x x x x x x x x x x Proc 2 x x x x x x x x x x x x 0 1 2 3 4 5 6 7 8 9 10

. Each processor : Local CSR Matrix, Nonlocal CSR Matrix Mapping local to global indices 0 2 3 8 10

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 50 Communication Information

. Neighbor hood information (No. of neighbors)

. Amount of data to be sent and/or received

. Global partitioning for p processors

r0 r1 r2 …. rp not good on very large numbers of processors!

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 51 Assumed partition (AP) algorithm

. Answering global distribution questions previously required O(P) storage & computations . On BG/L, O(P) storage may not be possible

Data owned by processor 3, . New algorithm requires in 4’s assumed partition • O(1) storage Actual partition • O(log P) computations 1 2 3 4 1 N 1 2 3 4 . Enables scaling to 100K+ Assumed partition ( p = (iP)/N ) processors Actual partition info is sent to the assumed partition processors  distributed directory

. AP idea has general applicability beyond hypre

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 52 Assumed partition (AP) algorithm is more challenging for structured AMR grids

. AMR can produce grids with “gaps” . Our AP accounts for these gaps for scalability . Demonstrated on 32K procs of BG/L

Simple, naïve AP function leaves processors with empty partitions

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 53 Effect of AP on one of hypre’s multigrid solvers

Run times on up to 64K procs on BG/P 7

6

5

4 global 3

seconds assumed 2

1

0 0 20000 40000 60000 No of procs

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 54 Machine specification – Hera - LLNL

. Multi-core/ multi-socket cluster . 864 diskless nodes interconnected by Quad Data rate (QDR) Infiniband . AMD opteron quad core (2.3 GHz) . network

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 55 Multicore cluster details (Hera):

. Individual 512 KB L2 cache for each core . 2 MB L3 cache shared by 4 cores . 4 sockets per node,16 cores sharing the 32 GB main memory . NUMA memory access

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 56 MxV Performance 2-16 cores – OpenMP vs. MPI 7pt stencil

OpenMP MPI

9000 9000

8000 8000

7000 7000

6000 6000 1 5000 2 5000

4 Mflops Mflops 4000 4000 8 3000 3000 12 2000 16 2000

1000 1000

0 0 0 200000 400000 0 200000 400000 matrix size matrix size

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 57 MxV Speedup for various matrix sizes, 7-pt stencil

OpenMP MPI

30 30

25 25

512 20 20 8000 21952 15 15

32768

Speedup Speedup 64000 10 110592 10 512000 5 perfect 5

0 0 1 3 5 7 9 11 13 15 1 3 5 7 9 11 13 15 no of threads no of tasks

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 58 Iterative linear solvers

. Krylov solvers : - Conjugate gradient - GMRES - BiCGSTAB

. Generally consist of basic linear kernels

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 59 Conjugate Gradient for solving Ax=b

T k = 0, x0 = 0, r0 = b, ρ0 = r0 r0 ; 2 while (ρi > ε ρ0)

Solve Mzk = rk; T γk = rk zk;

if (k=0) p1 = z0 else pk+1 = zk + γk pk / γk-1 ; k = k+1 ; T αk = γk-1 / pk Apk ;

xk = xk-1 + αk pk ;

rk = rk-1 - αk Apk ; T ρk = rk rk ; end x = xk ;

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 60 Weak scalability – diagonally scaled CG

Total times 100 iterations Number of iterations

12 600

10 500

8 400

6 300 seconds 4 200

2 100

0 0 0 20 40 60 80 100 120 0 20 40 60 80 100 120 no of procs no of procs

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 61 Achieve algorithmic scalability through multigrid

smooth

Finest Grid

restrict interpolate

First Coarse Grid

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 62 Multigrid parallelized by domain partitioning

. Partition fine grid

. Interpolation fully parallel . Smoothing

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 63 Algebraic Multigrid

Setup Phase:

• Select coarse “grids,” • Define interpolation, P(m) , m  1,2,... • Define restriction and coarse-grid operators R(m) ( P(m)T ) A(m1)  R(m)A(m)P(m)

Solve Phase:

Relax A(m)um  fm Relax A(m)um  fm Compute rm  fm  A(m)um Correct um um  em

Restrict rm1  R(m)rm Interpolate em  P(m)em1 Solve A(m1)em1  rm1

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 64 How about scalability of multigrid V cycle?

. Time for relaxation on finest grid for nxn points per proc 2 T0 ≈ 4α + 4nβ + 5n f . Time for relaxation on level k k k 2 Tk ≈ 4α + 4(n/2 )β + 5(n/2 ) /4 . Total time for V_cycle: k k 2 TV ≈ 2 ∑k (4α + 4(n/2 )β + 5(n/2 ) /4) ≈ 8(1+1+1+ …) α + 8(1 + 1/2 + 1/4 +…) n β + 10 (1 + 1/4 + 1/16 + …) n2 f ≈ 8 L α + 16 n β + 40/3 n2 f

L ≈ log2(pn) number of levels

. lim p>>0 TV(n,1)/ TV(pn,p) = O(1/log2(p))

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 65 How about scalability of multigrid W cycle?

. Total time for W cycle:

k k k 2 TW ≈ 2 ∑k 2 (4α + 4(n/2 )β + 5(n/2 ) /4) ≈ 8(1+2+4+ …) α + 8(1 + 1 + 1 +…) n β + 10 (1 + 1/2 + 1/4 + …) n2 f L ≈ log (pn) ≈ 8 (2L+1 - 1) α + 16 L n β + 20 n2 f 2

2 ≈ 8 (2pn - 1) α + 16 log2(pn) n β + 20 n f

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 66 Smoothers

Smoothers significantly affect multigrid convergence and runtime • solving: Ax=b, A SPD -1 smooth • smoothing: xn+1 = xn + M (b-A xn) • smoothers: • Jacobi, M=D, easy to parallelize

1  n1  x (i)  b(i)  A(i, j)x (j), i  0,...,n- 1 k   k1  A(i,i)  j0,ji 

• Gauss-Seidel, M=D+L, highly sequential!!

1  i1 n1  x (i)  b(i)  A(i, j)x (j)  A(i, j)x (j), i  0,..., n- 1 k   k  k1  A(i,i)  j0 ji1 

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 67 Parallel Smoothers

Red-Black Gauss-Seidel

Multi-color Gauss-Seidel

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 68 Hybrid smoothers

Hybrid Gauß-Seidel: Mk  Dk  Lk M   1  Hybrid Symm. Gauß-Seidel:  .  1 T Mk  (Dk  Lk )Dk (Dk  Lk ) M   .    1 M  D  L  .  Hybrid SOR: k k k ωk  M   p  Corresponding block schemes, Schwarz solvers, ILU, etc.

Can be used for distributed as well as shared parallel computing Needs to be used with care

Might require smoothing parameter Mω = ω M

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 69 How is the smoother threaded?

. matrices are distributed A1 across P processors in contiguous blocks of rows A2 A = . the smoother is threaded

Ap such that each thread works on a contiguous subset of the rows

t 1 . Hybrid: GS within each t A = 2 thread, Jacobi on thread p t 3 boundaries t4

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 70 FEM application: 2D Square

. partitions grid element-wise into nice subdomains for each processor

16 procs

(colors indicate element numbering) . AMG determines thread subdomains by splitting nodes (not elements) . the application’s numbering of the elements/nodes becomes relevant! (i.e., threaded domains dependent on colors)

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 71 We want to use OpenMP on multicore machines...

. hybrid-GS is not guaranteed to converge, but in practice it is effective for diffusion problems when subdomains are well-defined

Some Options 1. applications needs to order unknowns in processor subdomain “nicely” • may be more difficult due to refinement • may impose non-trivial burden on application

2. AMG needs to reorder unknowns (rows) • non-trivial, costly

3. smoothers that don’t depend on parallelism

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 72 Coarsening algorithms

. Original coarsening algorithm

. Parallel coarsening algorithms

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 73 Choosing the Coarse Grid (Classical AMG)

. Add measures (Number of strong influences of a point)

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 74 Choosing the Coarse Grid (Classical AMG)

. Pick a point with 2 maximal measure and 4 it a C-point 3

2 3

3 2 2

2 4 1

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 75 Choosing the Coarse Grid (Classical AMG)

. Pick a point with maximal measure and make it a C-point . Make its neighbors F-points

3 2 2

2 4 1

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 76 Choosing the Coarse Grid (Classical AMG)

. Pick a point with maximal measure and make it a C-point . Make its neighbors F-points . Increase measures of 5 neighbors of new F- 3 3 points by number of new F-neighbors 2 4 1

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 77 Choosing the Coarse Grid (Classical AMG)

. Pick a point with maximal measure and make it a C-point . Make its neighbors F-points . Increase measures of 5 neighbors of new F- 3 3 points by number of new F-neighbors 2 4 . Repeat until finished 1

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 78 Choosing the Coarse Grid (Classical AMG)

. Pick a point with maximal measure and make it a C-point . Make its neighbors F-points . Increase measures of neighbors of new F- 3 3 points by number of new F-neighbors

5 . Repeat until finished 1

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 79 Choosing the Coarse Grid (Classical AMG)

. Pick a point with maximal measure and make it a C-point . Make its neighbors F-points . Increase measures of neighbors of new F- points by number of new F-neighbors . Repeat until finished

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 80 Parallel Coarsening Algorithms

. Perform sequential algorithm on each processor, then deal with processor boundaries

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 81 PMIS: start

3 5 5 5 5 5 3

 select C-pt with 5 8 8 8 8 8 5 maximal measure

5 8 8 8 8 8 5  select neighbors as 5 8 8 8 8 8 5 F-pts

5 8 8 8 8 8 5

5 8 8 8 8 8 5

3 5 5 5 5 5 3

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 82 PMIS: add random numbers

3.7 5.3 5.0 5.9 5.4 5.3 3.4

 select C-pt with 5.2 8.0 8.5 8.2 8.6 8.2 5.1 maximal measure

5.9 8.1 8.8 8.9 8.4 8.9 5.9  select neighbors as 5.7 8.6 8.3 8.8 8.3 8.1 5.0 F-pts

5.3 8.7 8.3 8.4 8.3 8.8 5.9

5.0 8.8 8.5 8.6 8.7 8.9 5.3

3.2 5.6 5.8 5.6 5.9 5.9 3.0

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 83 PMIS: exchange neighbor information

3.7 5.3 5.0 5.9 5.4 5.3 3.4

 select C-pt with 5.2 8.0 8.5 8.2 8.6 8.2 5.1 maximal measure

5.9 8.1 8.8 8.9 8.4 8.9 5.9  select neighbors as 5.7 8.6 8.3 8.8 8.3 8.1 5.0 F-pts

5.3 8.7 8.3 8.4 8.3 8.8 5.9

5.0 8.8 8.5 8.6 8.7 8.9 5.3

3.2 5.6 5.8 5.6 5.9 5.9 3.0

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 84 PMIS: select

3.7 5.3 5.0 5.9 5.4 5.3 3.4

 select C-pts with 5.2 8.0 8.5 8.2 8.6 8.2 5.1 maximal measure locally 5.9 8.1 8.8 8.9 8.4 8.9 5.9

5.7 8.6 8.3 8.8 8.3 8.1 5.0  make neighbors F- pts 5.3 8.7 8.3 8.4 8.3 8.8 5.9

5.0 8.8 8.5 8.6 8.7 8.9 5.3

3.2 5.6 5.8 5.6 5.9 5.9 3.0

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 85 PMIS: update 1

3.7 5.3 5.0 5.9 5.4 5.3 3.4

 select C-pts with 5.2 8.0 maximal measure locally 5.9 8.1

5.7 8.6  make neighbors F- pts 8.4 (requires neighbor info) 8.6

5.6

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 86 PMIS: select

3.7 5.3 5.0 5.9 5.4 5.3 3.4  select C-pts with 5.2 8.0 maximal measure locally 5.9 8.1

 make neighbors F- 5.7 8.6 pts

8.4

8.6

5.6

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 87 PMIS: update 2

3.7 5.3 5.3 3.4

 select C-pts with 5.2 8.0 maximal measure locally

 make neighbors F- pts

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 88 PMIS: final grid

 select C-pts with maximal measure locally

 make neighbor F- pts

 remove neighbor edges

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 89 Interpolation

. Nearest neighbor interpolation straight forward to parallelize

. Harder to parallelize long range interpolation

j

i

k l

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 90 Communication patterns – 128 procs – Level 0

. Setup . Solve

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 91 Communication patterns – 128 procs – Setup distance 1 vs. distance 2 interpolation

. Distance 1 – Level 4 . Distance 2 – Level 4

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 92 Communication patterns – 128 procs – Solve distance 1 vs. distance 2 interpolation

. Distance 1 – Level 4 . Distance 2 – Level 4

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 93 AMG-GMRES(10) on Hera, 7pt 3D Laplace problem on a unit cube, 100x100x100 grid points per node (16 procs per node)

Setup times Solve times 20 20 18 Hera-1 18 16 Hera-2 16 BGL-1 14 14 BGL-2 12 12 10 10

8 8 seconds 6 6 4 4 2 2 0 0 0 1000 2000 3000 0 1000 2000 3000

No. of cores No. of cores

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 94 Distance 1 vs distance 2 interpolation on different computer architectures

. AMG-GMRES(10), 7pt 3D Laplace problem on a unit cube, using 50x50x25 points per processor, (100x100x100 per node on Hera) 30

25

20 Hera-1 15

Hera-2 seconds 10 BGL-1 BGL-2 5

0 0 1000 2000 3000 no of cores

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 95 AMG-GMRES(10) on Hera, 7pt 3D Laplace problem on a unit cube, 100x100x100 grid points per node (16 procs per node)

Setup times Solve cycle times

0.4 35 0.35 30 0.3 MPI 16x1 25 H 8x2 0.25 H 4x4 20 H 2x8 0.2 OMP 1x16 seconds 15 MCSUP 1x16 0.15

10 0.1

5 0.05

0 0 0 5000 10000 0 5000 10000

No. of cores No. of cores

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 96 AMG-GMRES(10) on Hera, 7pt 3D Laplace problem on a unit cube, 100x100x100 grid points per node (16 procs per node)

50

45

40

35 MPI 16x1 30 H 8x2 25 H 4x4

20 H 2x8

Seconds OMP 1x16 15 MCSUP 1x16 10

5

0 0 2000 4000 6000 8000 10000 12000 No. of cores

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 97 Blue Gene/P Solution – Intrepid Argonne National Laboratory

. 40 racks with 1024 compute nodes each . Quad-core 850 MHz PowerPC 450 Processor . 163,840 cores . 2GB main memory per node shared by all cores . 3D torus network - isolated dedicated partitions

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 98 AMG-GMRES(10) on Intrepid, 7pt 3D Laplace problem on a unit cube, 50x50x25 grid points per core (4 cores per node)

Setup times Cycle times

6 0.3

5 0.25

4 0.2 Seconds 3 0.15

Seconds smpH1x4 2 0.1 dualH2x2 vnMPI 1 0.05 vnH4x1-omp

0 0 0 50000 100000 0 50000 100000 No of cores No of cores

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 99 AMG-GMRES(10) on Intrepid, 7pt 3D Laplace problem on a unit cube, 50x50x25 grid points per core (4 cores per node)

16

14

12 No. Iterations

10 H1x4 H2x2 MPI H1x4 128 17 17 18 8 1024 20 20 22 H2x2 Seconds 6 8192 25 25 25 27648 29 27 28 4 MPI 65536 30 33 29 128000 37 36 36 2 H4x1 0 0 50000 100000 No of procs

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 100 XT-5 Jaguar-PF – Oak Ridge National Laboratory

. 18,688 nodes in 200 cabinets . Two sockets with AMD Opteron Hex-Core processor per node . 16 GB main memory per node . Network: 3D torus/ mesh with wrap-around links (torus-like) in two dimensions and without such links in the remaining dimension . Compute Node Linux, limited set of services, but reduced noise ratio . PGI’s C (v.904) with and without OpenMP, OpenMP settings: -mp=numa (preemptive thread pinning, localized memory allocations)

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 101 AMG-GMRES(10) on Jaguar, 7pt 3D Laplace problem on a unit cube, 50x50x30 grid points per core (12 cores per node)

Setup times Cycle times

10 0.25 MPI 9 H12x1 8 H1x12 0.2 H4x3 7 H2x6

6 0.15

5 Seconds Seconds 4 0.1

3

2 0.05

1

0 0 0 50000 100000 150000 200000 0 50000 100000 150000 200000 No. of cores No. of cores

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 102 AMG-GMRES(10) on Jaguar, 7pt 3D Laplace problem on a unit cube, 50x50x30 grid points per core (12 cores per node)

16

14

12

10 MPI H1x12 8 H4x3 seconds 6 H2x6 4

2

0 0 50000 100000 150000 200000 no of cores

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 103 Thank You!

This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Lawrence Livermore National Laboratory

Option:UCRL# Option:Additional Information 104