Lecture 3 Parallel Programming Models
Dr. Wilson Rivera
ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline
• Goal: Understand parallel programming models – Programming Models – Application Models – Parallel Computing Frameworks • Open-MP • MPI • CUDA • Map-Reduce – Parallel Computing Libraries
ICOM 6025: High Performance Computing 2 Parallel Programming Models
• A parallel programming model represents an abstraction of the underlying hardware or computer system.
– Programming models are not specific to a particular type of memory architecture. – In fact, any programming model can be in theory implemented on any underlying hardware.
ICOM 6025: High Performance Computing 3 Programming Language Models
• Threads Model • Data Parallel Model • Task Parallel Model • Message Passing Model • Partitioned Global Address Space Model
ICOM 6025: High Performance Computing 4 Threads Model
• A single process can have multiple, concurrent execution paths – The thread based model requires the programmer to manually manage synchronization, load balancing, and locality, • Deadlocks • Race conditions
• Implementations – POSIX Threads • specified by the IEEE POSIX 1003.1c standard (1995) and commonly referred to as Pthreads, • Library-based explicit parallelism model. – OpenMP • Compiler directive-based model that supports parallel programming in C/C++ and FORTRAN.
ICOM 6025: High Performance Computing 5 Data Parallel Model
• Focus on data partition • Each task performs the same operation on their partition of data. – This model is suitable for problems with static load balancing (e.g. very regular fluid element analysis, image processing).
• Implementations – FORTRAN 90 (F90), FORTRAN 95 (F95) – High Performance Fortran (HPF) – MapReduce, CUDA
ICOM 6025: High Performance Computing 6 Task Parallel Model
• Focus on task execution (workflow) • Each processor executes a different thread or process
• Implementations – Threading Building Blocks (TBB) – Task Parallel Library (TPL) – Intel Concurrent Collections (CnC)
ICOM 6025: High Performance Computing 7 Message Passing Model
• A set of tasks use their own local memory during computation – Tasks exchange data through communications by sending and receiving messages – Data transfer usually requires cooperative operations to be performed by each process. • For example, a send operation must have a matching receive operation.
• Implementations – MPI – Erlang
ICOM 6025: High Performance Computing 8 Partitioned Global Address Space Model
• Multiple threads share at least a part of a global address space – The idea behind this model is to provide the ability to access local and remote data with same mechanisms.
• Implementations – Co-Array Fortran – Unified Parallel C (UPC) – Titanium
ICOM 6025: High Performance Computing 9 Integrated view of PP Models
Data Parallel © Wilson Rivera SIMD Data Task Parallel Parallel
Data MIMD Parallel Message Passing Data Parallel SIMD Task Parallel Data Parallel
Shared Memory Distributed Memory ICOM 6025: High Performance Computing 10 PP Model Ecosystem © Wilson Rivera
Message Erlang MPI Passing Runtime CnC System
Higher Abstraction Task TBB TPL Parallel
HPF UPC Co-Array MapReduce Data Fortran90 Ct CUDA RapidMind Parallel
OpenMP Pthreads Thread
ICOM 6025: High Performance Computing 11 Application Models
• Bag-of-Tasks – The user simply states a set of unordered computation tasks and allows the system to assign them to processors
• Map-Reduce – A map operator is applied to a set of name-value pairs to generate several intermediate sets – Subsequently, a reduce operator is then applied to summarize the intermediates sets into one or more final sets
• Bulk-Synchronous-Parallel – Processes run in phases that are separated by a global synchronization step – During each phase, they perform calculations independently, and then communicate new results with their data dependent peers. – The execution time of a step is determined by the most heavily load processor
• Directed Acyclic Graph (DAG) – A workflow application can be modeled as a directed acyclic graph A=(V,E) where V is the set of nodes that represent the tasks in an application, and E is the set of edges which represent the dependence between tasks.
ICOM 6025: High Performance Computing 12 Application Models
Direct Acyclic Graph Less Granularity Bulk Synchronous Parallel
Map-Reduce
Bag of Tasks
© Wilson Rivera
More Synchronization
ICOM 6025: High Performance Computing 13 Application Settings
High performance Computing High (Clusters)
© Wilson Rivera Application Coupling Cloud Computing
Grid Computing Low
Low Latency High Motifs
Embed SPEC DB Games ML HPC Health Image Speech Music Browser 1 Finite State Mach. 2 Combinational 3 Graph Traversal 4 Structured Grid 5 Dense Matrix 6 Sparse Matrix 7 Spectral (FFT) 8 Dynamic Prog 9 N-Body 10 MapReduce 11 Backtrack/ B&B 12 Graphical Models 13 Unstructured Grid Source: UC Berkeley
ICOM 6025: High Performance Computing 15 High Performance Libraries
http://www.netlib.org HPC Libraries
• Basic Linear Algebra Subroutines (BLAS) • Linear Algebra PACKage (LAPACK) – IBM (ESSL), AMD (ACML), Intel (MKL), Cray (libsci) • Basic Linear Algebra Communication Subroutines (BLACS) • Automatically Tuned Linear Algebra Software (ATLAS) • Parallel Linear Algebra Software for Multicore Architectures (PLASMA) • Matrix Algebra on GPU and Multicore Architectures (MAGMA) Summary
• Goal: Understand parallel programming models – Programming Models – Application Models – Parallel Computing Frameworks • Open-MP • MPI • CUDA • Map-Reduce – Parallel Computing Libraries
ICOM 6025: High Performance Computing 18