Parallel Programming Models

Lecture 3 Parallel Programming Models Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline • Goal: Understand parallel programming models – Programming Models – Application Models – Parallel Computing Frameworks • Open-MP • MPI • CUDA • Map-Reduce – Parallel Computing Libraries ICOM 6025: High Performance Computing 2 Parallel Programming Models • A parallel programming model represents an abstraction of the underlying hardware or computer system. – Programming models are not specific to a particular type of memory architecture. – In fact, any programming model can be in theory implemented on any underlying hardware. ICOM 6025: High Performance Computing 3 Programming Language Models • Threads Model • Data Parallel Model • Task Parallel Model • Message Passing Model • Partitioned Global Address Space Model ICOM 6025: High Performance Computing 4 Threads Model • A single process can have multiple, concurrent execution paths – The thread based model requires the programmer to manually manage synchronization, load balancing, and locality, • Deadlocks • Race conditions • Implementations – POSIX Threads • specified by the IEEE POSIX 1003.1c standard (1995) and commonly referred to as Pthreads, • Library-based explicit parallelism model. – OpenMP • Compiler directive-based model that supports parallel programming in C/C++ and FORTRAN. ICOM 6025: High Performance Computing 5 Data Parallel Model • Focus on data partition • Each task performs the same operation on their partition of data. – This model is suitable for problems with static load balancing (e.g. very regular fluid element analysis, image processing). • Implementations – FORTRAN 90 (F90), FORTRAN 95 (F95) – High Performance Fortran (HPF) – MapReduce, CUDA ICOM 6025: High Performance Computing 6 Task Parallel Model • Focus on task execution (workflow) • Each processor executes a different thread or process • Implementations – Threading Building Blocks (TBB) – Task Parallel Library (TPL) – Intel Concurrent Collections (CnC) ICOM 6025: High Performance Computing 7 Message Passing Model • A set of tasks use their own local memory during computation – Tasks exchange data through communications by sending and receiving messages – Data transfer usually requires cooperative operations to be performed by each process. • For example, a send operation must have a matching receive operation. • Implementations – MPI – Erlang ICOM 6025: High Performance Computing 8 Partitioned Global Address Space Model • Multiple threads share at least a part of a global address space – The idea behind this model is to provide the ability to access local and remote data with same mechanisms. • Implementations – Co-Array Fortran – Unified Parallel C (UPC) – Titanium ICOM 6025: High Performance Computing 9 Integrated view of PP Models Data Parallel © Wilson Rivera SIMD Data Task Parallel Parallel Data MIMD Parallel Message Passing Data Parallel SIMD Task Parallel Data Parallel Shared Memory Distributed Memory ICOM 6025: High Performance Computing 10 PP Model Ecosystem © Wilson Rivera Message Erlang MPI Passing Runtime CnC System Higher Abstraction Higher Task TBB TPL Parallel HPF UPC Co-Array MapReduce Data Fortran90 Ct CUDA RapidMind Parallel OpenMP Pthreads Thread ICOM 6025: High Performance Computing 11 Application Models • Bag-of-Tasks – The user simply states a set of unordered computation tasks and allows the system to assign them to processors • Map-Reduce – A map operator is applied to a set of name-value pairs to generate several intermediate sets – Subsequently, a reduce operator is then applied to summarize the intermediates sets into one or more final sets • Bulk-Synchronous-Parallel – Processes run in phases that are separated by a global synchronization step – During each phase, they perform calculations independently, and then communicate new results with their data dependent peers. – The execution time of a step is determined by the most heavily load processor • Directed Acyclic Graph (DAG) – A workflow application can be modeled as a directed acyclic graph A=(V,E) where V is the set of nodes that represent the tasks in an application, and E is the set of edges which represent the dependence between tasks. ICOM 6025: High Performance Computing 12 Application Models Direct Acyclic Graph Less Granularity Less Bulk Synchronous Parallel Map-Reduce Bag of Tasks © Wilson Rivera More Synchronization ICOM 6025: High Performance Computing 13 Application Settings High performance Computing High (Clusters) © Wilson Rivera Application Coupling Cloud Computing Grid Computing Low Low Latency High Motifs Embed SPEC DB Games ML HPC Health Image Speech Music Browser 1 Finite State Mach. 2 Combinational 3 Graph Traversal 4 Structured Grid 5 Dense Matrix 6 Sparse Matrix 7 Spectral (FFT) 8 Dynamic Prog 9 N-Body 10 MapReduce 11 Backtrack/ B&B 12 Graphical Models 13 Unstructured Grid Source: UC Berkeley ICOM 6025: High Performance Computing 15 High Performance Libraries http://www.netlib.org HPC Libraries • Basic Linear Algebra Subroutines (BLAS) • Linear Algebra PACKage (LAPACK) – IBM (ESSL), AMD (ACML), Intel (MKL), Cray (libsci) • Basic Linear Algebra Communication Subroutines (BLACS) • Automatically Tuned Linear Algebra Software (ATLAS) • Parallel Linear Algebra Software for Multicore Architectures (PLASMA) • Matrix Algebra on GPU and Multicore Architectures (MAGMA) Summary • Goal: Understand parallel programming models – Programming Models – Application Models – Parallel Computing Frameworks • Open-MP • MPI • CUDA • Map-Reduce – Parallel Computing Libraries ICOM 6025: High Performance Computing 18.

Load more