Programming Using the Partitioned Global Address Space (PGAS)

M04-M04- Programming Programming UsingUsing thethe PartitionedPartitioned GlobalGlobal AddressAddress SpaceSpace (PGAS)(PGAS) ModelModel Tarek El-Ghazawi, GWU Brad Chamberlain, Cray Vijay Saraswat, IBM Overview A. Introduction to PGAS (~ 45 mts) B. Introduction to Languages A. UPC (~ 60 mts) B. X10 (~ 60 mts) C. Chapel (~ 60 mts) C. Comparison of Languages (~45 minutes) A. Comparative Heat transfer Example B. Comparative Summary of features C. Discussion D. Hands-On (90 mts) 2 1 A.A. IntroductionIntroduction toto PGASPGAS The common architectural landscape SMPSMP Clusters Clusters SMP Node SMP Node PEs, PEs, PEs, PEs, . Memory Memory Interconnect Massively Parallel Processor Massively Parallel Processor Systems Systems IBMIBM Blue Blue Gene Gene CrayCray XT5 XT5 4 2 Programming Models What is a programming model? 0The logical interface between architecture and applications Why Programming Models? 0Decouple applications and architectures Write applications that run effectively across architectures Design new architectures that can effectively support legacy applications Programming Model Design Considerations 0Expose modern architectural features to exploit machine power and improve performance 0Maintain Ease of Use 5 Examples of Parallel Programming Models Message Passing Shared Memory (Global Address Space) Partitioned Global Address Space (PGAS) 6 3 The Message Passing Model Concurrent sequential processes Explicit communication, two- sided Library-based Positive: 0Programmers control data … and work distribution. Negative: Legend: 0Significant communication overhead for small Thread/Process Memory Access transactions 0Excessive buffering Address Space Message 0Hard to program in Example: MPI 7 The Shared Memory Model Concurrent threads with shared space Positive: 0 Simple statements … 0Read remote memory via an expression 0Write remote memory through assignment Negative: 0Manipulating shared data leads to Legend: synchronization Thread/Process Memory Access requirements Address Space 0Does not allow locality exploitation Example: OpenMP, Java 8 4 Hybrid Model(s) Example: Shared + Message Passing Network … … … Process Thread Address space Example: OpenMP at the node (SMP), and MPI in between 9 The PGAS Model Concurrent threads with a partitioned shared space 0 A datum may reference data in other partitions 0 Global arrays have fragments in multiple partitions Positive: 0 Helps in exploiting locality PGAS 0 Simple statements as UPC, CAF, X10 shared memory Legend: Negative: 0 Thread/Process Memory Access sharing all memory can result in subtle bugs and Address Space race conditions 0 Examples: This Tutorial! UPC, X10, Chapel, CAF, Titanium 10 5 PGAS vs. other programming models/languages UPC, X10, MPI OpenMP HPF Chapel PGAS Distributed Distributed Memory model (Partitioned Global Shared Memory Memory Shared Memory Address Space) Notation Language Library Annotations Language Global arrays? Yes No No Yes Global Yes No No No pointers/references? Locality Exploitation Yes Yes, necessarily No Yes 11 The heterogeneous/accelerated architectural landscape I/O gateway nodes ••• (100’s of such cluster nodes) “Scalable Unit” Cluster Interconnect Switch/Fabric CrayCray XT5h: XT5h: FPGA/Vector-accelerated FPGA/Vector-accelerated Opteron Opteron RoadRoad Runner: Runner: Cell-accelerated Cell-accelerated Opteron Opteron 12 6 Example accelerator technologies SPEPlace 1 Place 2 Place 3 Place 4 Place 5 Place 6 Place 7 Place 8 SPU SPU SPU SPU SPU SPU SPU SPU SXU SXU SXU SXU SXU SXU SXU SXU LS LS LS LS LS LS LS LS SMF SMF SMF SMF SMF SMF SMF SMF 16B/cycle EIB (up to 96B/cycle) 16B/cycle PPE 16B/cycle 16B/cycle (2x) PPU MIC BIC L2 L1 PXU 32B/cycle 16B/cycle Dual FlexIOTM GPU TM GPU Place 0 XDR 64-bit Power Architecture with VMX From NVidia CUDA Programming Guide CellCell Processor Processor 13 FPGA Multi-core w/ accelerators (Intel IXP 2850) FPGA Multi-core w/ accelerators (Intel IXP 2850) The current architectural landscape Substantial architectural Multicore systems will innovation is anticipated over dramatically raise the number the next ten years. of cores available to 0Hardware situation remains applications. murky, but programmers 0Programmers must need stable interfaces to understand concurrent develop applications structure of their applications. Heterogenous accelerator- Applications seeking to based systems will exist, leverage these architectures raising serious programmability will need to go beyond data- challenges. parallel, globally synchronizing 0Programmers must MPI model. choreograph interactions These changes, while most between heterogenous profound for HPC now, will processors, memory change the face of commercial subsystems. computing over time. 14 7 Asynchronous PGAS Legend: Thread/Process Memory Access Address Space PGAS UPC, CAF, X10 Explicit concurrency Asynchronous activities can be used for active messaging SPMD is a special case 0DMAs, Asynchronous activities can 0fork/join concurrency, do- be started and stopped in a all/do-across parallelism given space partition Concurrency is made explicit and programmable. 15 How do we realize APGAS? Through an APGAS library in C, Through languages Fortran, Java (co-habiting with 0Asynchronous Co-Array MPI) Fortran 0Implements PGAS extension of CAF with asyncs Remote references 0Asynchronous UPC (AUPC) Global data-structures Proper extension of UPC with 0 asyncs Implements inter-place 0 messaging X10 (already asynchronous) Extension of sequential Java Optimizes inlinable asyncs 0Chapel (already synchronous) 0Implements global and/or collective operations Language runtimes share 0Implements intra-place common APGAS runtime concurrency Libraries reduce cost of Atomic operations adoption, languages offer enhanced productivity benefits Algorithmic scheduler 0Customer gets to choose Builds on XL UPC runtime, GASNet, ARMCI, LAPI, DCMF, DaCS, Cilk runtime, Chapel 16 runtime 8 Overview A. Introduction to PGAS (~ 45 mts) B. Introduction to Languages A. UPC (~ 60 mts) B. X10 (~ 60 mts) C. Chapel (~ 60 mts) C. Comparison of Languages (~45 minutes) A. Comparative Heat transfer Example B. Comparative Summary of features C. Discussion D. Hands-On (90 mts) 1 SC08SC08 PGASPGAS TutorialTutorial UnifiedUnified ParallelParallel CC -- UPC UPC Tarek El-Ghazawi [email protected] The George Washington University 1 UPC Overview 1) UPC in a nutshell 4) Advanced topics in UPC 0 Memory model 0 Dynamic Memory Allocation 0 Execution model 0 Synchronization in UPC 0 UPC Systems 0 UPC Libraries 2) Data Distribution and Pointers 0 Shared vs Private Data 5) UPC Productivity 0 Examples of data 0 Code efficiency distribution 0 UPC pointers 3) Workload Sharing 0 upc_forall 3 Introduction UPC – Unified Parallel C Set of specs for a parallel C 0v1.0 completed February of 2001 0v1.1.1 in October of 2003 0v1.2 in May of 2005 Compiler implementations by vendors and others Consortium of government, academia, and HPC vendors including IDA CCS, GWU, UCB, MTU, UMN, ARSC, UMCP, U of Florida, ANL, LBNL, LLNL, DoD, DoE, HP, Cray, IBM, Sun, Intrepid, Etnus, … 4 2 Introduction cont. UPC compilers are now available for most HPC platforms and clusters 0Some are open source A debugger is available and a performance analysis tool is in the works Benchmarks, programming examples, and compiler testing suite(s) are available Visit www.upcworld.org or upc.gwu.edu for more information 5 UPC Systems Current UPC Compilers 0Hewlett-Packard 0Cray 0IBM 0Berkeley 0Intrepid (GCC UPC) 0MTU UPC application development tools 0Totalview 0PPW from UF 6 3 UPC Home Page http://www.upc.gwu.edu 7 UPC textbook now available UPC: Distributed Shared Memory Programming Tarek El-Ghazawi William Carlson Thomas Sterling Katherine Yelick Wiley, May, 2005 ISBN: 0-471-22048-5 8 4 What is UPC? Unified Parallel C An explicit parallel extension of ISO C A partitioned shared memory parallel programming language 9 UPC Execution Model A number of threads working independently in a SPMD fashion 0MYTHREAD specifies thread index (0..THREADS- 1) 0Number of threads specified at compile-time or run-time Synchronization when needed 0Barriers 0Locks 0Memory consistency control 10 5 UPC Memory Model Thread 0 Thread 1 Thread THREADS-1 Shared Partitioned Global address space Private 0 Private 1 Private THREADS-1 Private Private Spaces A pointer-to-shared can reference all locations in the shared space, but there is data-thread affinity A private pointer may reference addresses in its private space or its local portion of the shared space Static and dynamic memory allocations are supported for both shared and private memory 11 UPC Overview 1) UPC in a nutshell 4) Advanced topics in UPC 0 Memory model 0 Dynamic Memory Allocation 0 Execution model 0 Synchronization in UPC 0 UPC Systems 0 UPC Libraries 2) Data Distribution and Pointers 5) UPC Productivity 0 Shared vs. Private Data 0 Code efficiency 0 Examples of data distribution 0 UPC pointers 3) Workload Sharing 0 upc_forall 12 6 A First Example: Vector addition Thread 0 Thread 1 //vect_add.c Iteration #: 0 1 #include <upc_relaxed.h> 2 3 #define N 100*THREADS v1[0] v1[1] Shared Space Shared shared int v1[N], v2[N], v1plusv2[N]; v1[2] v1[3] void main() { … int i; v2[0] v2[1] for(i=0; i<N; i++) v2[2] v2[3] if (MYTHREAD==i%THREADS) … v1plusv2[i]=v1[i]+v2[i];v1plusv2[0] v1plusv2[1] } v1plusv2[2]…v1plusv2[3] 13 2nd Example: A More Efficient Implementation Thread 0 Thread 1 Iteration #: 0 1 //vect_add.c 2 3 #include <upc_relaxed.h> v1[0] v1[1] #define N 100*THREADS Space Shared

Load more