Lawrence Livermore National Laboratory
Ulrike Meier Yang
Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA 94551 This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 LLNL-PRES-463131 References
. “Introduction to Parallel Computing”, Blaise Barney https://computing.llnl.gov/tutorials/parallel_comp
. “A Parallel Multigrid Tutorial”, Jim Jones https://computing.llnl.gov/casc/linear_solvers/present.html
. “Introduction to Parallel Computing, Design and Analysis of Algorithms”, Vipin Kumar, Ananth Grama, Anshul Gupta, George Karypis
Lawrence Livermore National Laboratory
Option:UCRL# Option:Additional Information 2 Overview
. Basics of Parallel Computing (see Barney) - Concepts and Terminology - Computer Architectures - Programming Models - Designing Parallel Programs
. Parallel algorithms and their implementation - basic kernels - Krylov methods - multigrid
Lawrence Livermore National Laboratory
Option:UCRL# Option:Additional Information 3 Serial computation
. Traditionally software has been written for serial computation: - run on a single computer - instructions are run one after another - only one instruction executed at a time
Lawrence Livermore National Laboratory
Option:UCRL# Option:Additional Information 4 Von Neumann Architecture
. John von Neumann first authored the general requirements for an electronic computer in 1945 . Data and instructions are stored in memory . Control unit fetches instructions/data from memory, decodes the instructions and then sequentially coordinates operations to accomplish the programmed task. . Arithmetic Unit performs basic arithmetic operations . Input/Output is the interface to the human operator
Lawrence Livermore National Laboratory
Option:UCRL# Option:Additional Information 5 Parallel Computing
. Simultaneous use of multiple compute sources to solve a single problem
Lawrence Livermore National Laboratory
Option:UCRL# Option:Additional Information 6 Why use parallel computing?
. Save time and/or money . Enables the solution of larger problems . Provide concurrency . Use of non-local resources . Limits to serial computing
Lawrence Livermore National Laboratory
Option:UCRL# Option:Additional Information 7 Concepts and Terminology
. Flynn’s Taxonomy (1966)
S I S D S I M D
Single Instruction, Single Data Single Instruction, Multiple Data
M I S D M I M D
Multiple Instruction, Single Data Multiple Instruction, Multiple Data
Lawrence Livermore National Laboratory
Option:UCRL# Option:Additional Information 8 SISD
. Serial computer . Deterministic execution . Examples: older generation main frames, work stations, PCs
Lawrence Livermore National Laboratory
Option:UCRL# Option:Additional Information 9 SIMD
. A type of parallel computer . All processing units execute the same instruction at any given clock cycle . Each processing unit can operate on a different data element . Two varieties: Processor Arrays and Vector Pipelines . Most modern computers, particularly those with graphics processor units (GPUs) employ SIMD instructions and execution units.
Lawrence Livermore National Laboratory
Option:UCRL# Option:Additional Information 10 MISD
. A single data stream is fed into multiple processing units. . Each processing unit operates on the data independently via independent instruction streams. . Few actual examples : Carnegie-Mellon C.mmp computer (1971).
Lawrence Livermore National Laboratory
Option:UCRL# Option:Additional Information 11 MIMD
. Currently, most common type of parallel computer . Every processor may be executing a different instruction stream . Every processor may be working with a different data stream . Execution can be synchronous or asynchronous, deterministic or non- deterministic . Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers, multi-core PCs. . Note: many MIMD architectures also include SIMD execution sub-components
Lawrence Livermore National Laboratory
Option:UCRL# Option:Additional Information 12 Parallel Computer Architectures
. Shared memory: all processors can access the same memory
. Uniform memory access (UMA): - identical processors - equal access and access times to memory
Lawrence Livermore National Laboratory
Option:UCRL# Option:Additional Information 13 Non-uniform memory access (NUMA)
. Not all processors have equal access to all memories
. Memory access across link is slower
. Advantages: - user-friendly programming perspective to memory - fast and uniform data sharing due to the proximity of memory to CPUs . Disadvantages: - lack of scalability between memory and CPUs. - Programmer responsible to ensure "correct" access of global memory - Expense Lawrence Livermore National Laboratory
Option:UCRL# Option:Additional Information 14 Distributed memory
. Distributed memory systems require a communication network to connect inter-processor memory. . Advantages: - Memory is scalable with number of processors. - No memory interference or overhead for trying to keep cache coherency. - Cost effective . Disadvantages: - programmer responsible for data communication between processors. - difficult to map existing data structures to this memory organization. Lawrence Livermore National Laboratory
Option:UCRL# Option:Additional Information 15 Hybrid Distributed-Shared Meory
. Generally used for the currently largest and fastest computers . Has a mixture of previously mentioned advantages and disadvantages
Lawrence Livermore National Laboratory
Option:UCRL# Option:Additional Information 16 Parallel programming models
. Shared memory . Threads . Message Passing . Data Parallel . Hybrid
All of these can be implemented on any architecture.
Lawrence Livermore National Laboratory
Option:UCRL# Option:Additional Information 17 Shared memory
. tasks share a common address space, which they read and write asynchronously. . Various mechanisms such as locks / semaphores may be used to control access to the shared memory. . Advantage: no need to explicitly communicate of data between tasks -> simplified programming . Disadvantages: • Need to take care when managing memory, avoid synchronization conflicts • Harder to control data locality
Lawrence Livermore National Laboratory
Option:UCRL# Option:Additional Information 18 Threads
. A thread can be considered as a subroutine in the main program . Threads communicate with each other through the global memory . commonly associated with shared memory architectures and operating systems . Posix Threads or pthreads . OpenMP
Lawrence Livermore National Laboratory
Option:UCRL# Option:Additional Information 19 Message Passing
. A set of tasks that use their own local memory during computation. . Data exchange through sending and receiving messages. . Data transfer usually requires cooperative operations to be performed by each process. For example, a send operation must have a matching receive operation.
. MPI (released in 1994) . MPI-2 (released in 1996)
Lawrence Livermore National Laboratory
Option:UCRL# Option:Additional Information 20 Data Parallel
. The data parallel model demonstrates the following characteristics: • Most of the parallel work performs operations on a data set, organized into a common structure, such as an array • A set of tasks works collectively on the same data structure, with each task working on a different partition • Tasks perform the same operation on their partition
. On shared memory architectures, all tasks may have access to the data structure through global memory. On distributed memory architectures the data structure is split up and resides as "chunks" in the local memory of each task. Lawrence Livermore National Laboratory
Option:UCRL# Option:Additional Information 21 Other Models
. Hybrid - combines various models, e.g. MPI/OpenMP . Single Program Multiple Data (SPMD) - A single program is executed by all tasks simultaneously
. Multiple Program Multiple Data (MPMD) - An MPMD application has multiple executables. Each task can execute the same or different program as other tasks.
Lawrence Livermore National Laboratory
Option:UCRL# Option:Additional Information 22 Designing Parallel Programs
. Examine problem: - Can the problem be parallelized? - Are there data dependencies? - where is most of the work done? - identify bottlenecks (e.g. I/O)
. Partitioning - How should the data be decomposed?
Lawrence Livermore National Laboratory
Option:UCRL# Option:Additional Information 23 Various partitionings
Lawrence Livermore National Laboratory
Option:UCRL# Option:Additional Information 24 How should the algorithm be distributed?
Lawrence Livermore National Laboratory
Option:UCRL# Option:Additional Information 25 Communications
. Types of communication:
- point-to-point
- collective
Lawrence Livermore National Laboratory
Option:UCRL# Option:Additional Information 26 Synchronization types
. Barrier • Each task performs its work until it reaches the barrier. It then stops, or "blocks". • When the last task reaches the barrier, all tasks are synchronized. . Lock / semaphore • The first task to acquire the lock "sets" it. This task can then safely (serially) access the protected data or code. • Other tasks can attempt to acquire the lock but must wait until the task that owns the lock releases it. • Can be blocking or non-blocking
Lawrence Livermore National Laboratory
Option:UCRL# Option:Additional Information 27 Load balancing
. Keep all tasks busy all of the time. Minimize idle time. . The slowest task will determine the overall performance.
Lawrence Livermore National Laboratory
Option:UCRL# Option:Additional Information 28 Granularity
. In parallel computing, granularity is a qualitative measure of the ratio of computation to communication. . Fine-grain Parallelism: - Low computation to communication ratio - Facilitates load balancing - Implies high communication overhead and less opportunity for performance enhancement . Coarse-grain Parallelism: - High computation to communication ratio - Implies more opportunity for performance increase - Harder to load balance efficiently
Lawrence Livermore National Laboratory
Option:UCRL# Option:Additional Information 29 Parallel Computing Performance Metrics
. Let T(n,p) be the time to solve a problem of size n using p processors
- Speedup: S(n,p) = T(n,1)/T(n,p)
- Efficiency: E(n,p) = S(n,p)/p
- Scaled Efficiency: SE(n,p) = T(n,1)/T(pn,p)
An algorithm is scalable if there exists C>0 with SE(n,p) ≥ C for all p
Lawrence Livermore National Laboratory
Option:UCRL# Option:Additional Information 30 Strong Scalability
increasing the number of processors 8 for a fixed problem size 7 6 5 Run times perfect
4 speedup Speedup 6 3 method 5 2 4 1 3 1 2 3 4 5 6 7 8 2 seconds no of procs 1 1.2 0 1.1 1 2 3 4 5 6 7 8 1 no of procs 0.9 0.8 Efficiency 0.7 0.6 1 2 3 4 5 6 7 8 no of procs
Lawrence Livermore National Laboratory
Option:UCRL# Option:Additional Information 31 Weak scalability
increasing the number of processors using a fixed problem size per processor
Run times 1.2 2 1 1.8 1.6 0.8 1.4 1.2 0.6 1
0.8 0.4 seconds
0.6 Scaled EfficiencyScaled 0.4 0.2 0.2 0 0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 no of procs No of procs
Lawrence Livermore National Laboratory
Option:UCRL# Option:Additional Information 32 Amdahl’s Law
. Maximal Speedup = 1/(1-P), P parallel portion of code
Lawrence Livermore National Laboratory
Option:UCRL# Option:Additional Information 33 Amdahl’s Law
Speedup = 1/((P/N)+S), N no. of processors, S serial portion of code
50% Speedup 75% 90% N P=0.5 P=0.9 P=0.99 95% 10 1.82 5.26 9.17 100 1.98 9.17 50.25 1000 1.99 9.91 90.99 10000 1.99 9.91 99.02 100000 1.99 9.99 99.90
Lawrence Livermore National Laboratory
Option:UCRL# Option:Additional Information 34 Performance Model for distributed memory
. Time for n floating point operations:
tcomp = fn
1/f Mflops
. Time for communicating n words:
tcomm = α + βn
α latency, 1/β bandwidth
. Total time: tcomp + tcomm
Lawrence Livermore National Laboratory
Option:UCRL# Option:Additional Information 35 Parallel Algorithms and Implementations
. basic kernels . Krylov methods . multigrid
Lawrence Livermore National Laboratory
Option:UCRL# Option:Additional Information 36 Vector Operations
. Adding two vectors Proc 0 z0 = x0 + y0 x = y+z . Axpys Proc 1 z1 = x1 + y1 y = y+αx
z2 = x2 + y2 Proc 2
. Vector products s = x * y Proc 0 s = x T y 0 0 0 Proc 1 s = s + s + s s1 = x1 * y1 0 1 2
s2 = x2 * y2 Proc 2
Lawrence Livermore National Laboratory
Option:UCRL# Option:Additional Information 37 Dense n x n Matrix Vector Multiplication
. Matrix size n x n , p processors
Proc 0
Proc 1 = Proc 2
Proc 3
. T = 2(n2/p)f + (p-1)(α+βn/p)
Lawrence Livermore National Laboratory
Option:UCRL# Option:Additional Information 38 Sparse Matrix Vector Multiplication
. Matrix of size n2 x n2 , p2 processors
. Consider graph of matrix
. Partitioning influences run time 2 2 2 2 T1 ≈ 10(n /p )f + 2(α+βn) vs. T2 ≈ 10(n /p )f + 4(α+βn/p)
Lawrence Livermore National Laboratory
Option:UCRL# Option:Additional Information 39 Effect of Partitioning on Performance
15
13
11
9 Perfect Speedup
Speedup 7 MPI opt MPI-noopt 5
3
1 1 3 5 7 9 11 13 15 No of procs
Lawrence Livermore National Laboratory
Option:UCRL# Option:Additional Information 40 Sparse Matrix Vector Multiply with COO Data Structure
. Coordinate Format (COO) real Data(nnz), integer Row(nnz), integer Col(nnz)
Data 2 5 1 3 7 9 4 8 6 1 0 0 2 0 3 4 0 Row 0 5 6 0 0 2 0 1 3 3 1 3 2 7 0 8 9 Col 3 1 0 1 0 3 2 2 2
. Matrix-Vector multiplication: for (k=0; k Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 41 Sparse Matrix Vector Multiply with COO Data Structure . Matrix-Vector multiplication: for (k=0; k 1 0 0 2 Data 2 3 1 5 7 9 4 8 6 possible conflict 0 3 4 0 row 1-3 0 5 6 0 Row 0 1 0 2 3 3 1 3 2 7 0 8 9 Col 3 1 0 1 0 3 2 2 2 2 3 1 4 5 7 9 8 6 1 0 0 2 0 3 4 0 0 5 6 0 Proc 0 0 1 0 1 Proc 1 2 3 3 3 2 7 0 8 9 3 1 0 2 1 0 3 2 2 Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 42 Sparse Matrix Vector Multiply with CSR Data Structure . Compressed sparse row (CSR) real Data(nnz), integer Rowptr(n+1), Col(nnz) Data 1 2 3 4 5 6 7 8 9 1 0 0 2 0 3 4 0 0 5 6 0 Rowptr 0 2 4 6 9 7 0 8 9 Col 0 3 1 2 1 2 0 2 3 . Matrix-Vector multiplication: for (i=0; i Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 43 Sparse Matrix Vector Multiply with CSC Data Structure . Compressed sparse column (CSC) real Data(nnz), integer Colptr(n+1), Row(nnz) Data 1 7 3 5 4 6 8 2 9 1 0 0 2 0 3 4 0 0 5 6 0 Colptr 0 2 4 7 9 7 0 8 9 Row 0 3 1 2 1 2 3 0 3 . Matrix-Vector multiplication: for (i=0; i Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 44 Parallel Matvec, CSR vs. CSC y0 = A0 x0 . CSR y1 = A1 x1 y2 = A2 x2 A0 x0 . CSC y = y0 + y1 + y2 A0 A2 x1 x2 Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 45 Sparse Matrix Vector Multiply with Diagonal Storage . Diagonal storage real Data(n,m), integer offset(m) Data 0 1 2 offset -2 0 1 1 2 0 0 0 3 0 0 0 3 0 5 0 6 7 5 6 7 0 8 0 9 8 9 0 . Matrix-Vector multiplication: for (i=0; i Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 46 Other Data Structures . Block data structures (BCSR, BCSC) block size bxb, n divisible by b real Data(nnb*b*b), integer Ptr(n/b+1), integer Index(nnb) . Based on space filling curves (e.g. Hilbert curves, Z-order) integer IJ(2*nnz) real Data(nnz) . Matrix-Vector multiplication: for (i=0; i Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 47 Data Structures for Distributed Memory . Concept for MATAIJ (PETSc) and ParCSR (Hypre) x x x x x x x x x Proc 0 x x x x x x x x x x x x x x x x x x x x x x Proc 1 x x x x x x x x x x x x x x x x Proc 2 x x x x x x x x x x x x Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 48 Data Structures for Distributed Memory . Concept for MATAIJ (PETSc) and ParCSR (Hypre) x x x x x x x x x Proc 0 x x x x x x x x x x x x x x x x x x x x x x Proc 1 x x x x x x x x x x x x x x x x Proc 2 x x x x x x x x x x x x Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 49 Data Structures for Distributed Memory . Concept for MATAIJ (PETSc) and ParCSR (Hypre) x x x x x x x x x Proc 0 x x x x x x x x x x x x x x x x x x x x x x Proc 1 x x x x x x x x x x x x x x x x Proc 2 x x x x x x x x x x x x 0 1 2 3 4 5 6 7 8 9 10 . Each processor : Local CSR Matrix, Nonlocal CSR Matrix Mapping local to global indices 0 2 3 8 10 Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 50 Communication Information . Neighbor hood information (No. of neighbors) . Amount of data to be sent and/or received . Global partitioning for p processors r0 r1 r2 …. rp not good on very large numbers of processors! Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 51 Assumed partition (AP) algorithm . Answering global distribution questions previously required O(P) storage & computations . On BG/L, O(P) storage may not be possible Data owned by processor 3, . New algorithm requires in 4’s assumed partition • O(1) storage Actual partition • O(log P) computations 1 2 3 4 1 N 1 2 3 4 . Enables scaling to 100K+ Assumed partition ( p = (iP)/N ) processors Actual partition info is sent to the assumed partition processors distributed directory . AP idea has general applicability beyond hypre Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 52 Assumed partition (AP) algorithm is more challenging for structured AMR grids . AMR can produce grids with “gaps” . Our AP function accounts for these gaps for scalability . Demonstrated on 32K procs of BG/L Simple, naïve AP function leaves processors with empty partitions Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 53 Effect of AP on one of hypre’s multigrid solvers Run times on up to 64K procs on BG/P 7 6 5 4 global 3 seconds assumed 2 1 0 0 20000 40000 60000 No of procs Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 54 Machine specification – Hera - LLNL . Multi-core/ multi-socket cluster . 864 diskless nodes interconnected by Quad Data rate (QDR) Infiniband . AMD opteron quad core (2.3 GHz) . Fat tree network Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 55 Multicore cluster details (Hera): . Individual 512 KB L2 cache for each core . 2 MB L3 cache shared by 4 cores . 4 sockets per node,16 cores sharing the 32 GB main memory . NUMA memory access Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 56 MxV Performance 2-16 cores – OpenMP vs. MPI 7pt stencil OpenMP MPI 9000 9000 8000 8000 7000 7000 6000 6000 1 5000 2 5000 4 Mflops Mflops 4000 4000 8 3000 3000 12 2000 16 2000 1000 1000 0 0 0 200000 400000 0 200000 400000 matrix size matrix size Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 57 MxV Speedup for various matrix sizes, 7-pt stencil OpenMP MPI 30 30 25 25 512 20 20 8000 21952 15 15 32768 Speedup Speedup 64000 10 110592 10 512000 5 perfect 5 0 0 1 3 5 7 9 11 13 15 1 3 5 7 9 11 13 15 no of threads no of tasks Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 58 Iterative linear solvers . Krylov solvers : - Conjugate gradient - GMRES - BiCGSTAB . Generally consist of basic linear kernels Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 59 Conjugate Gradient for solving Ax=b T k = 0, x0 = 0, r0 = b, ρ0 = r0 r0 ; 2 while (ρi > ε ρ0) Solve Mzk = rk; T γk = rk zk; if (k=0) p1 = z0 else pk+1 = zk + γk pk / γk-1 ; k = k+1 ; T αk = γk-1 / pk Apk ; xk = xk-1 + αk pk ; rk = rk-1 - αk Apk ; T ρk = rk rk ; end x = xk ; Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 60 Weak scalability – diagonally scaled CG Total times 100 iterations Number of iterations 12 600 10 500 8 400 6 300 seconds 4 200 2 100 0 0 0 20 40 60 80 100 120 0 20 40 60 80 100 120 no of procs no of procs Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 61 Achieve algorithmic scalability through multigrid smooth Finest Grid restrict interpolate First Coarse Grid Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 62 Multigrid parallelized by domain partitioning . Partition fine grid . Interpolation fully parallel . Smoothing Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 63 Algebraic Multigrid Setup Phase: • Select coarse “grids,” • Define interpolation, P(m) , m 1,2,... • Define restriction and coarse-grid operators R(m) ( P(m)T ) A(m1) R(m)A(m)P(m) Solve Phase: Relax A(m)um fm Relax A(m)um fm Compute rm fm A(m)um Correct um um em Restrict rm1 R(m)rm Interpolate em P(m)em1 Solve A(m1)em1 rm1 Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 64 How about scalability of multigrid V cycle? . Time for relaxation on finest grid for nxn points per proc 2 T0 ≈ 4α + 4nβ + 5n f . Time for relaxation on level k k k 2 Tk ≈ 4α + 4(n/2 )β + 5(n/2 ) /4 . Total time for V_cycle: k k 2 TV ≈ 2 ∑k (4α + 4(n/2 )β + 5(n/2 ) /4) ≈ 8(1+1+1+ …) α + 8(1 + 1/2 + 1/4 +…) n β + 10 (1 + 1/4 + 1/16 + …) n2 f ≈ 8 L α + 16 n β + 40/3 n2 f L ≈ log2(pn) number of levels . lim p>>0 TV(n,1)/ TV(pn,p) = O(1/log2(p)) Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 65 How about scalability of multigrid W cycle? . Total time for W cycle: k k k 2 TW ≈ 2 ∑k 2 (4α + 4(n/2 )β + 5(n/2 ) /4) ≈ 8(1+2+4+ …) α + 8(1 + 1 + 1 +…) n β + 10 (1 + 1/2 + 1/4 + …) n2 f L ≈ log (pn) ≈ 8 (2L+1 - 1) α + 16 L n β + 20 n2 f 2 2 ≈ 8 (2pn - 1) α + 16 log2(pn) n β + 20 n f Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 66 Smoothers Smoothers significantly affect multigrid convergence and runtime • solving: Ax=b, A SPD -1 smooth • smoothing: xn+1 = xn + M (b-A xn) • smoothers: • Jacobi, M=D, easy to parallelize 1 n1 x (i) b(i) A(i, j)x (j), i 0,...,n- 1 k k1 A(i,i) j0,ji • Gauss-Seidel, M=D+L, highly sequential!! 1 i1 n1 x (i) b(i) A(i, j)x (j) A(i, j)x (j), i 0,..., n- 1 k k k1 A(i,i) j0 ji1 Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 67 Parallel Smoothers Red-Black Gauss-Seidel Multi-color Gauss-Seidel Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 68 Hybrid smoothers Hybrid Gauß-Seidel: Mk Dk Lk M 1 Hybrid Symm. Gauß-Seidel: . 1 T Mk (Dk Lk )Dk (Dk Lk ) M . 1 M D L . Hybrid SOR: k k k ωk M p Corresponding block schemes, Schwarz solvers, ILU, etc. Can be used for distributed as well as shared parallel computing Needs to be used with care Might require smoothing parameter Mω = ω M Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 69 How is the smoother threaded? . matrices are distributed A1 across P processors in contiguous blocks of rows A2 A = . the smoother is threaded Ap such that each thread works on a contiguous subset of the rows t 1 . Hybrid: GS within each t A = 2 thread, Jacobi on thread p t 3 boundaries t4 Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 70 FEM application: 2D Square . partitions grid element-wise into nice subdomains for each processor 16 procs (colors indicate element numbering) . AMG determines thread subdomains by splitting nodes (not elements) . the application’s numbering of the elements/nodes becomes relevant! (i.e., threaded domains dependent on colors) Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 71 We want to use OpenMP on multicore machines... . hybrid-GS is not guaranteed to converge, but in practice it is effective for diffusion problems when subdomains are well-defined Some Options 1. applications needs to order unknowns in processor subdomain “nicely” • may be more difficult due to refinement • may impose non-trivial burden on application 2. AMG needs to reorder unknowns (rows) • non-trivial, costly 3. smoothers that don’t depend on parallelism Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 72 Coarsening algorithms . Original coarsening algorithm . Parallel coarsening algorithms Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 73 Choosing the Coarse Grid (Classical AMG) . Add measures (Number of strong influences of a point) Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 74 Choosing the Coarse Grid (Classical AMG) . Pick a point with 2 maximal measure and 4 make it a C-point 3 2 3 3 2 2 2 4 1 Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 75 Choosing the Coarse Grid (Classical AMG) . Pick a point with maximal measure and make it a C-point . Make its neighbors F-points 3 2 2 2 4 1 Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 76 Choosing the Coarse Grid (Classical AMG) . Pick a point with maximal measure and make it a C-point . Make its neighbors F-points . Increase measures of 5 neighbors of new F- 3 3 points by number of new F-neighbors 2 4 1 Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 77 Choosing the Coarse Grid (Classical AMG) . Pick a point with maximal measure and make it a C-point . Make its neighbors F-points . Increase measures of 5 neighbors of new F- 3 3 points by number of new F-neighbors 2 4 . Repeat until finished 1 Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 78 Choosing the Coarse Grid (Classical AMG) . Pick a point with maximal measure and make it a C-point . Make its neighbors F-points . Increase measures of neighbors of new F- 3 3 points by number of new F-neighbors 5 . Repeat until finished 1 Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 79 Choosing the Coarse Grid (Classical AMG) . Pick a point with maximal measure and make it a C-point . Make its neighbors F-points . Increase measures of neighbors of new F- points by number of new F-neighbors . Repeat until finished Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 80 Parallel Coarsening Algorithms . Perform sequential algorithm on each processor, then deal with processor boundaries Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 81 PMIS: start 3 5 5 5 5 5 3 select C-pt with 5 8 8 8 8 8 5 maximal measure 5 8 8 8 8 8 5 select neighbors as 5 8 8 8 8 8 5 F-pts 5 8 8 8 8 8 5 5 8 8 8 8 8 5 3 5 5 5 5 5 3 Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 82 PMIS: add random numbers 3.7 5.3 5.0 5.9 5.4 5.3 3.4 select C-pt with 5.2 8.0 8.5 8.2 8.6 8.2 5.1 maximal measure 5.9 8.1 8.8 8.9 8.4 8.9 5.9 select neighbors as 5.7 8.6 8.3 8.8 8.3 8.1 5.0 F-pts 5.3 8.7 8.3 8.4 8.3 8.8 5.9 5.0 8.8 8.5 8.6 8.7 8.9 5.3 3.2 5.6 5.8 5.6 5.9 5.9 3.0 Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 83 PMIS: exchange neighbor information 3.7 5.3 5.0 5.9 5.4 5.3 3.4 select C-pt with 5.2 8.0 8.5 8.2 8.6 8.2 5.1 maximal measure 5.9 8.1 8.8 8.9 8.4 8.9 5.9 select neighbors as 5.7 8.6 8.3 8.8 8.3 8.1 5.0 F-pts 5.3 8.7 8.3 8.4 8.3 8.8 5.9 5.0 8.8 8.5 8.6 8.7 8.9 5.3 3.2 5.6 5.8 5.6 5.9 5.9 3.0 Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 84 PMIS: select 3.7 5.3 5.0 5.9 5.4 5.3 3.4 select C-pts with 5.2 8.0 8.5 8.2 8.6 8.2 5.1 maximal measure locally 5.9 8.1 8.8 8.9 8.4 8.9 5.9 5.7 8.6 8.3 8.8 8.3 8.1 5.0 make neighbors F- pts 5.3 8.7 8.3 8.4 8.3 8.8 5.9 5.0 8.8 8.5 8.6 8.7 8.9 5.3 3.2 5.6 5.8 5.6 5.9 5.9 3.0 Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 85 PMIS: update 1 3.7 5.3 5.0 5.9 5.4 5.3 3.4 select C-pts with 5.2 8.0 maximal measure locally 5.9 8.1 5.7 8.6 make neighbors F- pts 8.4 (requires neighbor info) 8.6 5.6 Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 86 PMIS: select 3.7 5.3 5.0 5.9 5.4 5.3 3.4 select C-pts with 5.2 8.0 maximal measure locally 5.9 8.1 make neighbors F- 5.7 8.6 pts 8.4 8.6 5.6 Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 87 PMIS: update 2 3.7 5.3 5.3 3.4 select C-pts with 5.2 8.0 maximal measure locally make neighbors F- pts Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 88 PMIS: final grid select C-pts with maximal measure locally make neighbor F- pts remove neighbor edges Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 89 Interpolation . Nearest neighbor interpolation straight forward to parallelize . Harder to parallelize long range interpolation j i k l Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 90 Communication patterns – 128 procs – Level 0 . Setup . Solve Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 91 Communication patterns – 128 procs – Setup distance 1 vs. distance 2 interpolation . Distance 1 – Level 4 . Distance 2 – Level 4 Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 92 Communication patterns – 128 procs – Solve distance 1 vs. distance 2 interpolation . Distance 1 – Level 4 . Distance 2 – Level 4 Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 93 AMG-GMRES(10) on Hera, 7pt 3D Laplace problem on a unit cube, 100x100x100 grid points per node (16 procs per node) Setup times Solve times 20 20 18 Hera-1 18 16 Hera-2 16 BGL-1 14 14 BGL-2 12 12 10 10 8 8 seconds 6 6 4 4 2 2 0 0 0 1000 2000 3000 0 1000 2000 3000 No. of cores No. of cores Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 94 Distance 1 vs distance 2 interpolation on different computer architectures . AMG-GMRES(10), 7pt 3D Laplace problem on a unit cube, using 50x50x25 points per processor, (100x100x100 per node on Hera) 30 25 20 Hera-1 15 Hera-2 seconds 10 BGL-1 BGL-2 5 0 0 1000 2000 3000 no of cores Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 95 AMG-GMRES(10) on Hera, 7pt 3D Laplace problem on a unit cube, 100x100x100 grid points per node (16 procs per node) Setup times Solve cycle times 0.4 35 0.35 30 0.3 MPI 16x1 25 H 8x2 0.25 H 4x4 20 H 2x8 0.2 OMP 1x16 seconds 15 MCSUP 1x16 0.15 10 0.1 5 0.05 0 0 0 5000 10000 0 5000 10000 No. of cores No. of cores Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 96 AMG-GMRES(10) on Hera, 7pt 3D Laplace problem on a unit cube, 100x100x100 grid points per node (16 procs per node) 50 45 40 35 MPI 16x1 30 H 8x2 25 H 4x4 20 H 2x8 Seconds OMP 1x16 15 MCSUP 1x16 10 5 0 0 2000 4000 6000 8000 10000 12000 No. of cores Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 97 Blue Gene/P Solution – Intrepid Argonne National Laboratory . 40 racks with 1024 compute nodes each . Quad-core 850 MHz PowerPC 450 Processor . 163,840 cores . 2GB main memory per node shared by all cores . 3D torus network - isolated dedicated partitions Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 98 AMG-GMRES(10) on Intrepid, 7pt 3D Laplace problem on a unit cube, 50x50x25 grid points per core (4 cores per node) Setup times Cycle times 6 0.3 5 0.25 4 0.2 Seconds 3 0.15 Seconds smpH1x4 2 0.1 dualH2x2 vnMPI 1 0.05 vnH4x1-omp 0 0 0 50000 100000 0 50000 100000 No of cores No of cores Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 99 AMG-GMRES(10) on Intrepid, 7pt 3D Laplace problem on a unit cube, 50x50x25 grid points per core (4 cores per node) 16 14 12 No. Iterations 10 H1x4 H2x2 MPI H1x4 128 17 17 18 8 1024 20 20 22 H2x2 Seconds 6 8192 25 25 25 27648 29 27 28 4 MPI 65536 30 33 29 128000 37 36 36 2 H4x1 0 0 50000 100000 No of procs Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 100 Cray XT-5 Jaguar-PF – Oak Ridge National Laboratory . 18,688 nodes in 200 cabinets . Two sockets with AMD Opteron Hex-Core processor per node . 16 GB main memory per node . Network: 3D torus/ mesh with wrap-around links (torus-like) in two dimensions and without such links in the remaining dimension . Compute Node Linux, limited set of services, but reduced noise ratio . PGI’s C compilers (v.904) with and without OpenMP, OpenMP settings: -mp=numa (preemptive thread pinning, localized memory allocations) Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 101 AMG-GMRES(10) on Jaguar, 7pt 3D Laplace problem on a unit cube, 50x50x30 grid points per core (12 cores per node) Setup times Cycle times 10 0.25 MPI 9 H12x1 8 H1x12 0.2 H4x3 7 H2x6 6 0.15 5 Seconds Seconds 4 0.1 3 2 0.05 1 0 0 0 50000 100000 150000 200000 0 50000 100000 150000 200000 No. of cores No. of cores Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 102 AMG-GMRES(10) on Jaguar, 7pt 3D Laplace problem on a unit cube, 50x50x30 grid points per core (12 cores per node) 16 14 12 10 MPI H1x12 8 H4x3 seconds 6 H2x6 4 2 0 0 50000 100000 150000 200000 no of cores Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 103 Thank You! This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. Lawrence Livermore National Laboratory Option:UCRL# Option:Additional Information 104