PGAS with “Mixins” (Teams and Asyncs)

PGAS Applications What, Where and Why?

Kathy Yelick

Professor of Electrical Engineering and Computer Sciences University of California at Berkeley Associate Laboratory Director for Computing Sciences Lawrence Berkeley National Laboratory

What is PGAS? Partitioned Global Address Space

2 4 6 8 . . . 22 24

x: 1 x: 5 x: 7 y: y: y: 0

l: l: l:

g: g: g:

Global address space p0 p1 pn • Global Address Space: Directly access remote memory • Paroned: Programmer controls data layout for scalability

DEGAS Overview 2 PGAS Languages and Libraries

• Language mechanisms for distributed arrays – (CoArray) Fortran REAL:: X(2,3)[*] X(1,2) …. X(1,3)[5]

– UPC shared double X [100]; or double X[THREADS*6]; X[] …. X[MYTHREAD]

– Chapel const ProblemSpace= {1..m} dmapped Block(boundingBox={1..m}); var X: [ProblemSpace] real;

– UPC++ upcxx::shared_array X; X.init(128); or X.init(128,4) X [upcxx:ranks()] ... X[6] Many others... 3 Programming Data Analytics vs Simulation

Simulaon: More Regular Analycs: More Irregular

Message Passing Programming Global Address Space Programming Divide up domain in pieces Each start compung Compute one piece Grab whatever / whenever Send/Receive data from others

MPI, and many libraries UPC,UPC++, CAF, X10, Chapel, Shmem, GA

DEGAS Overview 4 Distributed Arrays Directory Style

Many UPC programs avoid the UPC style arrays in factor of directories of objects typedef shared [] double *sdblptr; shared sdblptr directory[THREADS]; directory[i]=upc_alloc(local_size*sizeof(double));

directory Distributed Arrays Directory Style

• These are also more general: • Muldimensional, unevenly distributed • Ghost regions around blocks

physical and conceptual 3D array layout Example: Hash Table in PGAS

. . .

key: act key: cga key: gac key: tac val: a val: g val: c val: c

key: cca key: gta val: t val: c

Global address space p0 p1 pn • pairs, stored in some bucket based on hash(key) • One-sided communicaon; never having to say “receive” • Allows for Terabyte-Petabyte size data sets vs ~1 TB in shared memory

DEGAS Overview 7 Other Programming Model Variations

• What can you do remotely? – Read, Write, Lock – Atomics (compare-and-swap, fetch-and-add) – Invoke funcons – Signal processes to wake up (task graphs) • What type of parallelism is there – Data parallel (single threaded semancs, e.g., A = B + C) • Collecve communicaon – Single Program Mulple Data (SPDM): if (MYTHREAD == 0)…. – Hierarchical SPMD (teams): if (MYTEAM….)... – Fork-join: fork / async – Task graph (events)

Programing Models and Environments 8 Where is PGAS programming used?

1. Asynchronous ﬁne-grained reads/write/atomics

9 De novo Genome Assembly

• DNA sequence consists of 4 bases: A/C/G/T • Read: short fragment of DNA • De novo assembly: Construct a genome (chromosomes) from a collecon of reads

DEGAS Overview 10 Metagenome Assembly: Grand Challenge

For complex metagenomes (soil) most of the reads cannot be assembled

DEGAS Overview 11 De novo Genome Assembly a la Meraculous

Input: Reads that may contain errors reads Chop reads into k-mers, process k-mers to exclude errors • Bloom ﬁlter (stascal) 1 • Intensive I/O • High memory footprint k-mers Construct & traverse de Bruijn graph of k-mers, generate congs • Huge graph as a hash table 2 • Irregular accesses • Injecon limited contigs Leverage read informaon to link congs and generate scaﬀolds. 3 • Mulple hash tables • High memory requirements scaffolds • Intensive computaon • Intensive I/O

Georganas, Buluc, Chapman, Oliker, Rokhsar, Yelick, [Aluru,Egan,Hofmeyr] in SC14, IPDPS15, SC15 12 Application Challenge: Random Access to Large Data

• Parallel DFS (from randomly selected K-mers) to compute congs • Some tricky synchronizaon to deal with conﬂicts

CCG Cong 2: AACCG

ACC Cong 1: GATCTGA AAC GAT ATC TCT CTG TGA AAT ATG TGC Cong 3: AATGC

• Hash tables used in all phases – Diﬀerent use cases, diﬀerent implementaons • No a priori locality: that is the problem you’re trying to solve

DEGAS Overview 13 HipMer (High Performance Meraculous) Assembly Pipeline

Distributed Hash Tables in PGAS • Remote Atomics, Dynamic Aggregaon, Soware Caching • 13x Faster than another HPC/MPI code (Ray) on 960 cores

8192 HUMAN WHEAT overall time 16384 overall time kmer analysis kmer analysis 4096 contig generation 8192 contig generation scaffolding scaffolding 2048 ideal overall time 4096 ideal overall time

1024 2048 1024 512 512 Seconds Seconds 256 256 128 128 64 64

32 32 480 960 1920 3840 7680 15360 960 1920 3840 7680 15360 Number of Cores Number of Cores

Evangelos Georganas, Aydın Buluç, Jarrod Chapman, Steven Hofmeyr, Chaitanya Aluru, Rob Egan, Lenny Oliker, Dan Rokhsar, and Kathy Yelick. HipMer: An Extreme-Scale De Novo Genome Assembler, SC’15

. Comparison to other Assemblers 140 hours

Runme on Assemblers 50 45 40 35 30 Equal core counts (960 Edison) 25 20

Runme in Hours 15 10 5 4 min 0 Meraculous SGA ABySS 960 Ray 960 HipMer 960 HipMer 20K Cong only Science Impact: HipMer is transformative

• Human genome (3Gbp) “de novo” assembled : – Meraculous: 48 hours Makes unsolvable – HipMer: 4 minutes (720x speedup relave to problems solvable! Meraculous) • Wheat genome (17 Gbp) “de novo” assembled (2014): – Meraculous (did not run): – HipMer: 39 minutes; 15K cores (ﬁrst all-in-one assembly) • Pine genome (20 Gbp) “de novo” assembled (2014) : – Masurca : 3 months; 1 TB RAM • Wetland metagenome (1.25 Tbp) analysis (2015): – Meraculous (projected): 15 TB of memory – HipMER: Strong scaling to over 100K cores (cong gen only) Georganas, Buluc, Chapman, Oliker, Rokhsar, Yelick, [Aluru,Egan,Hofmeyr] in DEGAS SC14, IPDPS15, SC15 16 UPC++: PGAS with “Mixins” (Teams and Asyncs)

• UPC++ uses templates (no compiler needed) s: 16 x: 5 x: 7 y: y: 0 shared_var s; global_ptr g; 18 63 27 shared_array sa(8); g: sa: • Default execuon model is SPMD, but p0 p1 p2 • Remote methods, async async(place) (Function f, T1 arg1,…); wait(); // other side does poll();

• Research in teams for hierarchical algorithms and machines teamsplit (team) { ... }

• Use these for “domain-speciﬁc” runme systems

17 Where is PGAS programming used?

1. Asynchronous ﬁne-grained reads/write/atomics (aggregaon and soware caching when possible) 2. Strided irregular updates (adds) to distributed matrix

18 Application Challenge: Data Fusion

Distributed Matrix Construcon • Remote asyncs with user-controlled resource management • Divide threads into injectors / updaters • 6x faster than MPI 3.0 on 1K XE6 nodes

• “Fusing” observaonal data into simulaon • Interoperates with MPI/Fortran/ScaLAPACK

Sco French, Yili Zheng, Barbara Romanowicz, Katherine Yelick; "Parallel Hessian Assembly for Seismic Waveform Inversion Using Global Updates", IPDPS 2015 DEGAS Overview\ 19 Application Challenge: Data Fusion

Sco French, Yili Zheng, Barbara Romanowicz, Katherine Yelick; "Parallel Hessian Assembly for Seismic Waveform Inversion Using Global Updates", IPDPS 2015 DEGAS Overview\ 20 Science Impact: Whole-Mantle Seismic Model

• First-ever whole-mantle seismic model from numerical waveform tomography • Finding: Most volcanic hotspots are linked to two spots on the boundary between the metal core and rocky mantle 1,800 miles below Earth's surface.

Makes unsolvable problems solvable!

Sco French, Barbara Romanowicz, "Broad plumes rooted at the base of the Earth’s mantle beneath major hotspots", Nature, 2015 DEGAS Overview 21 Multidimensional Arrays in UPC++

• UPC++ arrays have a rich set of operaons

translate restrict slice (n dim to n-1) transpose • Create new views of the data in original array • Example: ghost cell exchange in AMR intersection (copied area) interior ndarray gridB = bArrays[i, j, k]; … gridA.async_copy(gridB.shrink(1)); gridA ghost cells gridB

DEGAS Overview22 Where is PGAS programming used?

1. Asynchronous ﬁne-grained reads/write/atomics (aggregaon and soware caching when possible) 2. Strided irregular updates (adds) to distributed matrix 3. Dynamic work stealing

23 Application Challenge: Dynamic Load Balancing

• Hartree Fock example (e.g., in NWChem which is already PGAS) - Inherent load imbalance • UPC++ version - Dynamic work stealing and fast atomic operaons enhanced load balance - Transpose an irregularly blocked matrix

0 1 2 3

Local Array 4 5 6 7

8 9 10 11

12 13 14 15

David Ozog (CSGF Fellow), A. Kamil, Y. Zheng, P. Hargrove, J. Hammond, A. Malony, W. de Jong, K. Yelick DEGAS Overview 24 Hartree Fock Code

Improved Scalability

Strong Scaling of UPC++ HF on NERSC Edison Compared to (highly opmized) GTFock with Global Arrays David Ozog (CSGF Fellow), A. Kamil, Y. Zheng, P. Hargrove, J. Hammond, A. Malony, W. de Jong, K. Yelick DEGAS Overview 25 Towards NWChem in UPC++

• New Global Arrays Toolkit over GASNet Increase scalability! • Over 20% faster on Inﬁniband • More scalable aggregate FFTs than FFTW

XY + Z (256^3) X+Y+Z (64X256^3;√pX√p)/64 X+Y+Z (64X256^3; _X16)/64 3000 FFTW (256^3) X+Y+ Z (64X256^3; _X 32)/64 X+Y+ Z (64X256^3; _X 64)/64 X+Y+ Z (64X256^3; _X128)/64 X + Y + Z (256^3;√pX√p)

GA over GASNET 16384 8.6 TFLOPS GA base version 8192 2500 4.6 TFLOPS 4096

2048

2000 1024 512

256

1500 128

64 Wall clock me (Sec)

GFLOPS (5*N*log N / T) GFLOPS 32

1000 16

500 4 0 512 1024 1536 2048 2 MPI Ranks (16 / node ; 8 / socket) on Edison 1 Number of processor cores (Using 16 cores per node) 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 • Goal of making this ready for producon use (Bert de Jong)

E. Hoﬀman (GAGA), H. Simhadri (FFTs) DEGAS Overview 26 Where is PGAS programming used?

1. Asynchronous ﬁne-grained reads/write/atomics (aggregaon and soware caching when possible) 2. Strided irregular updates (adds) to distributed matrix 3. Dynamic work stealing 4. Hierarchical algorithms / one programming model

27 UPC++ Communication Speeds up AMR

• Adaptive Mesh Refinement on Block-Structured Meshes – Used in ice sheet modeling, climate, subsurface (fracking), astrophysics, accelerator modeling and many

more FillBoundary Test on 2048 Cori Cores

1.2 1.13

1.00 1.0 Better

0.8 0.68 0.65 0.61 0.6 0.52 Hierarchical UPC++ (distributed / shared style) Renormalized Time 0.4 • UPC++ plus UPC++ is 2x faster than MPI plus OpenMP 0.2 • MPI + MPI also does well

0.0 MPI MPI/OMP H. MPI UPC++ UPC++/OMP H. UPC++ Reducing Metadata Overhead in AMR Scaling AMR grid hierarchy • Reducing the metadata size

• Distribute the grid hierarchy data structure using UPC

Slide source: Mike Norman UCSD (Enzo code) 29 Where is PGAS programming used? 1. Asynchronous ﬁne-grained reads/write/atomics (aggregaon and soware caching when possible) 2. Strided irregular updates (adds) to distributed matrix 3. Dynamic work stealing 4. Hierarchical algorithms / one programming model 5. Task Graph Scheduling (UPC++)

30 2) Column j of L is used to update the remaining columns acyclic graph that has n vertices v ,v , ,v , with v { 1 2 ··· n} i of A. corresponding to column i of A. Suppose i>j. There is an If A is a dense matrix, then every column k, k>j, is edge between vi and vj in the elimination tree if and only updated. if ìj is the first off-diagonal nonzero entry in column j of Once the factorization is computed, the solution to the L. Thus, vi is called the parent of vj and vj is a child of vi. original linear system can be obtained by solving two The elimination tree contains a lot of information regarding triangular linear systems using the Cholesky factor L. the sparsity structure of L and the dependency among the L B. Cholesky factorization of sparse matrices columns of . See [2] for details. An elimination tree can be expressed in terms of supern- A sparse For large-scale applications, is often , meaning odes rather than column. In such a case, it is referred to as A that most of the elements of are zero. When the Cholesky a supernodal elimination tree. An example of such tree is A factorization of is computed, some of the zero entries depicted in Figure 1b. will turn into nonzero (due to the subtraction operations in the column updates; see Alg. 1). The extra nonzero entries C. Scheduling in parallel sparse Cholesky factorization are referred to as fill-in. For in-depth discussion of sparse Cholesky factorization, the reader is referred to [1]. In the following, we discuss scheduling of the computa- Following is an important observation in sparse Cholesky tion in the numerical factorization. The only constraints that factorization. It is expected that the columns of L will have to be respected are the numerical dependencies among become denser and denser as one moves from the left to the the columns: column k of A has to be updated by column j right. This is due to the fact that the fill-in in one column will of L, for any jdistributed memory nodes on whichAshcraft [6]The introducesfan-out the conceptfamily of includes computation algorithms maps that compute Both fan-both and symPACK involve three types of op- to guide the mapping of tasks onto processors. A mapping supernodeserations: arefactorization, mapped update, in a 1D-cyclic aggregation. way. We let A be an is a two-dimensionalupdates from grid column that “extends”k to columns to the matrixi, for kParallel algorithm and computation maps triangular part of row i of . M We now describe fan-both in a parallel setting. We assume In [6], the author discusses the worst case communication a parallel distributed memory platform with P processors volume depending on which computation map is used. The pP -by-pP maps involve at most pP nodes in each com- ranging from p1 to pP . We assume that A and L are cyclically distributed by supernodes of various sizes in a munication step, while each step involves at most P nodes 1D way, as depicted in Figure 1a. This has the benefit for either P -by-1 or 1-by-P computation maps. The volume is directly impacted by the number of nodes participating to 1We use MATLAB notation in this paper. each communication step. However, the pP -by-pP maps Sparse Cholesky Comparisons

• Dynamic scheduling outperforms other • The combinaon of algorithm and implementaon (in UPC++) outperforms the compeon

Mathias Jacqueline, Esmond Ng, Yili Zheng, Kathy Yelick 32 Where is PGAS programming used? 1. Asynchronous ﬁne-grained reads/write/atomics (aggregaon and soware caching when possible) 2. Dynamic work stealing 3. Strided irregular updates (adds) to distributed matrix 4. Hierarchical algorithms / one programming model 5. Task Graph Scheduling (UPC++)

6. Dynamic runmes (CHARM++, Legion, HPX) 33 HPX by example HPX Asynchronous25.10.2014 Runtime | Thomas Heller ([email protected]) Performs | FAU on Manycore 29 Example 1: Matrix Transpose

LibGeoDecomp - Weak Scaling - Distributed 700000 (Host Cores) Babbage

600000 HPX 500000 MPI Theorecal Peak 400000

GFLOPS 300000

200000

100000

0 Higher is Beer 0 500 1000 Cores 0 3K 6K 9K 13K 16K Number of Localies (16 Cores each)

Credit: Harmut Kaiser, LSU and HPX team 34 Legion Programming Model & Runtime

60000 • Dynamic task-based Weak scaling on Titan (throughput)

– Data-centric – tasks specify what data 50000 they access and how they use them (read-only, read-write, exclusive, etc.) 40000

– Separates task implementaon 30000 from hardware mapping decisions 20000 – Latency tolerant

Throughput Per Node10000 (Points/s) Legion S3D • Declarave speciﬁcaon of MPI Fortran S3D 0 1 4 16 64 256 1024 4096 13824 task graph in Legion Nodes – Serial program 70000 Weak scaling on Piz Daint (throughput)

– Read/Write eﬀects on regions of 60000

data structures 50000

– Determine maximum parallelism 40000 • Port of S3D complete 30000 20000 Throughput Per Node (Points/s) 10000 Legion S3D MPI Fortran S3D 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 Nodes Legion team from Stanford, ExaCT Co-Design Center Why is PGAS used? (Besides Application Characteristics)

36 One-Sided Communication is Closer to Hardware

one-sided message host CPU address data payload network two-sided put message interface message ID data payload memory

• One-sided communicaon (put/get) is what hardware does – Even underneath send/receive – Informaon on where to put the date is in the message – Decouples synchronizaon from transfer • Two-sided message passing (e.g., send/receive in MPI) – Requires matching with remote side to “ﬁnd” the address to write data – Couples data transfer with synchronizaon (oen, but not always what you want) Exascale should oﬀer programmers / vendors a lightweight opon

DEGAS Overview 37 One-Sided Communication Between Nodes is Faster

18000 Cray XE6: Hopper 16000 Berkeley UPC Cray UPC 14000 Cray MPI 12000

10000

8000

6000 Bandwidth (MB/s)

4000

2000

0 8 32 128 512 2048 8192 32768 131072 524288 2097152 Msg. size • For communicaon-intensive problems, the gap is substanal – Problems with small messages – Bisecon bandwidth problems (global FFTs)

Hargrove, Ibrahim 38 Overhead for Messaging

• Overhead (processor busy me) gets worse on “exascale” cores • Having a low overhead opon is increasingly important Avg cycles per call Oﬀ Node On-Node (to do nothing) On Intel Ivybridge iSend() 3,692 cycles 1,262 cycles iRecv() 1,154 cycles 1,924 cycles

35000 1800 4000 1600 2.9 GHz x86 30000 3500 1 GHz x86 (model) 3000 1400 25000 1200 1 GHz 3-SIMT (model) 2500 20000 1000 2000 15000 800 1500 600 10000 1000

Nanoseconds 400 5000 200 500 Soware Overhead 0 0 isend (off) isend (off) irecv (off) irecv (off) isend (on) isend (on) irecv (on) irecv (on)

Computer Architecture Group (CAL) 39 Summary

• Successful PGAS applicaons are mostly asynchronous

1. Asynchronous ﬁne-grained reads/write/atomics (aggregaon and soware caching when possible) 2. Dynamic work stealing 3. Strided irregular updates (adds) to distributed matrix 4. Hierarchical algorithms / one programming model 5. Task Graph Scheduling (UPC++) 6. Dynamic runmes (CHARM++, Legion, HPX)

• Exascale architecture trends