PGAS Applications What, Where and Why?
Kathy Yelick
Professor of Electrical Engineering and Computer Sciences University of California at Berkeley Associate Laboratory Director for Computing Sciences Lawrence Berkeley National Laboratory
What is PGAS? Partitioned Global Address Space
2 4 6 8 . . . 22 24
x: 1 x: 5 x: 7 y: y: y: 0
l: l: l:
g: g: g:
Global address space p0 p1 pn • Global Address Space: Directly access remote memory • Par oned: Programmer controls data layout for scalability
DEGAS Overview 2 PGAS Languages and Libraries
• Language mechanisms for distributed arrays – (CoArray) Fortran REAL:: X(2,3)[*] X(1,2) …. X(1,3)[5]
– UPC shared double X [100]; or double X[THREADS*6]; X[] …. X[MYTHREAD]
– Chapel const ProblemSpace= {1..m} dmapped Block(boundingBox={1..m}); var X: [ProblemSpace] real;
– UPC++ upcxx::shared_array
Simula on: More Regular Analy cs: More Irregular
Message Passing Programming Global Address Space Programming Divide up domain in pieces Each start compu ng Compute one piece Grab whatever / whenever Send/Receive data from others
MPI, and many libraries UPC,UPC++, CAF, X10, Chapel, Shmem, GA
DEGAS Overview 4 Distributed Arrays Directory Style
Many UPC programs avoid the UPC style arrays in factor of directories of objects typedef shared [] double *sdblptr; shared sdblptr directory[THREADS]; directory[i]=upc_alloc(local_size*sizeof(double));
directory Distributed Arrays Directory Style
• These are also more general: • Mul dimensional, unevenly distributed • Ghost regions around blocks
physical and conceptual 3D array layout Example: Hash Table in PGAS
. . .
key: act key: cga key: gac key: tac val: a val: g val: c val: c
key: cca key: gta val: t val: c
Global address space p0 p1 pn •
DEGAS Overview 7 Other Programming Model Variations
• What can you do remotely? – Read, Write, Lock – Atomics (compare-and-swap, fetch-and-add) – Invoke func ons – Signal processes to wake up (task graphs) • What type of parallelism is there – Data parallel (single threaded seman cs, e.g., A = B + C) • Collec ve communica on – Single Program Mul ple Data (SPDM): if (MYTHREAD == 0)…. – Hierarchical SPMD (teams): if (MYTEAM….)... – Fork-join: fork / async – Task graph (events)
Programing Models and Environments 8 Where is PGAS programming used?
1. Asynchronous fine-grained reads/write/atomics
9 De novo Genome Assembly
• DNA sequence consists of 4 bases: A/C/G/T • Read: short fragment of DNA • De novo assembly: Construct a genome (chromosomes) from a collec on of reads
DEGAS Overview 10 Metagenome Assembly: Grand Challenge
For complex metagenomes (soil) most of the reads cannot be assembled
DEGAS Overview 11 De novo Genome Assembly a la Meraculous
Input: Reads that may contain errors reads Chop reads into k-mers, process k-mers to exclude errors • Bloom filter (sta s cal) 1 • Intensive I/O • High memory footprint k-mers Construct & traverse de Bruijn graph of k-mers, generate con gs • Huge graph as a hash table 2 • Irregular accesses • Injec on limited contigs Leverage read informa on to link con gs and generate scaffolds. 3 • Mul ple hash tables • High memory requirements scaffolds • Intensive computa on • Intensive I/O
Georganas, Buluc, Chapman, Oliker, Rokhsar, Yelick, [Aluru,Egan,Hofmeyr] in SC14, IPDPS15, SC15 12 Application Challenge: Random Access to Large Data
• Parallel DFS (from randomly selected K-mers) to compute con gs • Some tricky synchroniza on to deal with conflicts
CCG Con g 2: AACCG
ACC Con g 1: GATCTGA AAC GAT ATC TCT CTG TGA AAT ATG TGC Con g 3: AATGC
• Hash tables used in all phases – Different use cases, different implementa ons • No a priori locality: that is the problem you’re trying to solve
DEGAS Overview 13 HipMer (High Performance Meraculous) Assembly Pipeline
Distributed Hash Tables in PGAS • Remote Atomics, Dynamic Aggrega on, So ware Caching • 13x Faster than another HPC/MPI code (Ray) on 960 cores
8192 HUMAN WHEAT overall time 16384 overall time kmer analysis kmer analysis 4096 contig generation 8192 contig generation scaffolding scaffolding 2048 ideal overall time 4096 ideal overall time
1024 2048 1024 512 512 Seconds Seconds 256 256 128 128 64 64
32 32 480 960 1920 3840 7680 15360 960 1920 3840 7680 15360 Number of Cores Number of Cores
Evangelos Georganas, Aydın Buluç, Jarrod Chapman, Steven Hofmeyr, Chaitanya Aluru, Rob Egan, Lenny Oliker, Dan Rokhsar, and Kathy Yelick. HipMer: An Extreme-Scale De Novo Genome Assembler, SC’15
. Comparison to other Assemblers 140 hours
Run me on Assemblers 50 45 40 35 30 Equal core counts (960 Edison) 25 20
Run me in Hours 15 10 5 4 min 0 Meraculous SGA ABySS 960 Ray 960 HipMer 960 HipMer 20K Con g only Science Impact: HipMer is transformative
• Human genome (3Gbp) “de novo” assembled : – Meraculous: 48 hours Makes unsolvable – HipMer: 4 minutes (720x speedup rela ve to problems solvable! Meraculous) • Wheat genome (17 Gbp) “de novo” assembled (2014): – Meraculous (did not run): – HipMer: 39 minutes; 15K cores (first all-in-one assembly) • Pine genome (20 Gbp) “de novo” assembled (2014) : – Masurca : 3 months; 1 TB RAM • Wetland metagenome (1.25 Tbp) analysis (2015): – Meraculous (projected): 15 TB of memory – HipMER: Strong scaling to over 100K cores (con g gen only) Georganas, Buluc, Chapman, Oliker, Rokhsar, Yelick, [Aluru,Egan,Hofmeyr] in DEGAS SC14, IPDPS15, SC15 16 UPC++: PGAS with “Mixins” (Teams and Asyncs)
• UPC++ uses templates (no compiler needed) s: 16 x: 5 x: 7 y: y: 0 shared_var
• Research in teams for hierarchical algorithms and machines teamsplit (team) { ... }
• Use these for “domain-specific” run me systems
17 Where is PGAS programming used?
1. Asynchronous fine-grained reads/write/atomics (aggrega on and so ware caching when possible) 2. Strided irregular updates (adds) to distributed matrix
18 Application Challenge: Data Fusion
Distributed Matrix Construc on • Remote asyncs with user-controlled resource management • Divide threads into injectors / updaters • 6x faster than MPI 3.0 on 1K XE6 nodes
• “Fusing” observa onal data into simula on • Interoperates with MPI/Fortran/ScaLAPACK
Sco French, Yili Zheng, Barbara Romanowicz, Katherine Yelick; "Parallel Hessian Assembly for Seismic Waveform Inversion Using Global Updates", IPDPS 2015 DEGAS Overview\ 19 Application Challenge: Data Fusion
Sco French, Yili Zheng, Barbara Romanowicz, Katherine Yelick; "Parallel Hessian Assembly for Seismic Waveform Inversion Using Global Updates", IPDPS 2015 DEGAS Overview\ 20 Science Impact: Whole-Mantle Seismic Model
• First-ever whole-mantle seismic model from numerical waveform tomography • Finding: Most volcanic hotspots are linked to two spots on the boundary between the metal core and rocky mantle 1,800 miles below Earth's surface.
Makes unsolvable problems solvable!
Sco French, Barbara Romanowicz, "Broad plumes rooted at the base of the Earth’s mantle beneath major hotspots", Nature, 2015 DEGAS Overview 21 Multidimensional Arrays in UPC++
• UPC++ arrays have a rich set of opera ons
translate restrict slice (n dim to n-1) transpose • Create new views of the data in original array • Example: ghost cell exchange in AMR intersection (copied area) interior ndarray
DEGAS Overview22 Where is PGAS programming used?
1. Asynchronous fine-grained reads/write/atomics (aggrega on and so ware caching when possible) 2. Strided irregular updates (adds) to distributed matrix 3. Dynamic work stealing
23 Application Challenge: Dynamic Load Balancing
• Hartree Fock example (e.g., in NWChem which is already PGAS) - Inherent load imbalance • UPC++ version - Dynamic work stealing and fast atomic opera ons enhanced load balance - Transpose an irregularly blocked matrix
0 1 2 3
Local Array 4 5 6 7
8 9 10 11
12 13 14 15
David Ozog (CSGF Fellow), A. Kamil, Y. Zheng, P. Hargrove, J. Hammond, A. Malony, W. de Jong, K. Yelick DEGAS Overview 24 Hartree Fock Code
Improved Scalability
Strong Scaling of UPC++ HF on NERSC Edison Compared to (highly op mized) GTFock with Global Arrays David Ozog (CSGF Fellow), A. Kamil, Y. Zheng, P. Hargrove, J. Hammond, A. Malony, W. de Jong, K. Yelick DEGAS Overview 25 Towards NWChem in UPC++
• New Global Arrays Toolkit over GASNet Increase scalability! • Over 20% faster on Infiniband • More scalable aggregate FFTs than FFTW
XY + Z (256^3) X+Y+Z (64X256^3;√pX√p)/64 X+Y+Z (64X256^3; _X16)/64 3000 FFTW (256^3) X+Y+ Z (64X256^3; _X 32)/64 X+Y+ Z (64X256^3; _X 64)/64 X+Y+ Z (64X256^3; _X128)/64 X + Y + Z (256^3;√pX√p)
GA over GASNET 16384 8.6 TFLOPS GA base version 8192 2500 4.6 TFLOPS 4096
2048
2000 1024 512
256
1500 128
64 Wall clock me (Sec)
GFLOPS (5*N*log N / T) GFLOPS 32
1000 16
8
500 4 0 512 1024 1536 2048 2 MPI Ranks (16 / node ; 8 / socket) on Edison 1 Number of processor cores (Using 16 cores per node) 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 • Goal of making this ready for produc on use (Bert de Jong)
E. Hoffman (GAGA), H. Simhadri (FFTs) DEGAS Overview 26 Where is PGAS programming used?
1. Asynchronous fine-grained reads/write/atomics (aggrega on and so ware caching when possible) 2. Strided irregular updates (adds) to distributed matrix 3. Dynamic work stealing 4. Hierarchical algorithms / one programming model
27 UPC++ Communication Speeds up AMR
• Adaptive Mesh Refinement on Block-Structured Meshes – Used in ice sheet modeling, climate, subsurface (fracking), astrophysics, accelerator modeling and many
more FillBoundary Test on 2048 Cori Cores
1.2 1.13
1.00 1.0 Better
0.8 0.68 0.65 0.61 0.6 0.52 Hierarchical UPC++ (distributed / shared style) Renormalized Time 0.4 • UPC++ plus UPC++ is 2x faster than MPI plus OpenMP 0.2 • MPI + MPI also does well
0.0 MPI MPI/OMP H. MPI UPC++ UPC++/OMP H. UPC++ Reducing Metadata Overhead in AMR Scaling AMR grid hierarchy • Reducing the metadata size
• Distribute the grid hierarchy data structure using UPC
Slide source: Mike Norman UCSD (Enzo code) 29 Where is PGAS programming used? 1. Asynchronous fine-grained reads/write/atomics (aggrega on and so ware caching when possible) 2. Strided irregular updates (adds) to distributed matrix 3. Dynamic work stealing 4. Hierarchical algorithms / one programming model 5. Task Graph Scheduling (UPC++)
30 2) Column j of L is used to update the remaining columns acyclic graph that has n vertices v ,v , ,v , with v { 1 2 ··· n} i of A. corresponding to column i of A. Suppose i>j. There is an If A is a dense matrix, then every column k, k>j, is edge between vi and vj in the elimination tree if and only updated. if `ij is the first off-diagonal nonzero entry in column j of Once the factorization is computed, the solution to the L. Thus, vi is called the parent of vj and vj is a child of vi. original linear system can be obtained by solving two The elimination tree contains a lot of information regarding triangular linear systems using the Cholesky factor L. the sparsity structure of L and the dependency among the L B. Cholesky factorization of sparse matrices columns of . See [2] for details. An elimination tree can be expressed in terms of supern- A sparse For large-scale applications, is often , meaning odes rather than column. In such a case, it is referred to as A that most of the elements of are zero. When the Cholesky a supernodal elimination tree. An example of such tree is A factorization of is computed, some of the zero entries depicted in Figure 1b. will turn into nonzero (due to the subtraction operations in the column updates; see Alg. 1). The extra nonzero entries C. Scheduling in parallel sparse Cholesky factorization are referred to as fill-in. For in-depth discussion of sparse Cholesky factorization, the reader is referred to [1]. In the following, we discuss scheduling of the computa- Following is an important observation in sparse Cholesky tion in the numerical factorization. The only constraints that factorization. It is expected that the columns of L will have to be respected are the numerical dependencies among become denser and denser as one moves from the left to the the columns: column k of A has to be updated by column j right. This is due to the fact that the fill-in in one column will of L, for any j