Lecture 3 Parallel Programming Models

Dr. Wilson Rivera

ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline

• Goal: Understand parallel programming models – Programming Models – Application Models – Frameworks • Open-MP • MPI • CUDA • Map-Reduce – Parallel Computing Libraries

ICOM 6025: High Performance Computing 2 Parallel Programming Models

• A parallel programming model represents an abstraction of the underlying hardware or computer system.

– Programming models are not specific to a particular type of memory architecture. – In fact, any programming model can be in theory implemented on any underlying hardware.

ICOM 6025: High Performance Computing 3 Programming Language Models

• Threads Model • Data Parallel Model • Task Parallel Model • Model • Partitioned Global Address Space Model

ICOM 6025: High Performance Computing 4 Threads Model

• A single can have multiple, concurrent execution paths – The based model requires the to manually manage synchronization, load balancing, and locality, • • Race conditions

• Implementations – POSIX Threads • specified by the IEEE POSIX 1003.1c standard (1995) and commonly referred to as Pthreads, • Library-based model. – OpenMP • Compiler directive-based model that supports parallel programming in /C++ and FORTRAN.

ICOM 6025: High Performance Computing 5 Data Parallel Model

• Focus on data partition • Each task performs the same operation on their partition of data. – This model is suitable for problems with static load balancing (e.g. very regular fluid element analysis, image processing).

• Implementations – FORTRAN 90 (F90), FORTRAN 95 (F95) – (HPF) – MapReduce, CUDA

ICOM 6025: High Performance Computing 6 Task Parallel Model

• Focus on task execution (workflow) • Each processor executes a different thread or process

• Implementations – Threading Building Blocks (TBB) – Task Parallel Library (TPL) – Intel Concurrent Collections (CnC)

ICOM 6025: High Performance Computing 7 Message Passing Model

• A set of tasks use their own local memory during computation – Tasks exchange data through communications by sending and receiving messages – Data transfer usually requires cooperative operations to be performed by each process. • For example, a send operation must have a matching receive operation.

• Implementations – MPI – Erlang

ICOM 6025: High Performance Computing 8 Partitioned Global Address Space Model

• Multiple threads share at least a part of a global address space – The idea behind this model is to provide the ability to access local and remote data with same mechanisms.

• Implementations – Co-Array Fortran – (UPC) – Titanium

ICOM 6025: High Performance Computing 9 Integrated view of PP Models

Data Parallel © Wilson Rivera SIMD Data Task Parallel Parallel

Data MIMD Parallel Message Passing Data Parallel SIMD Task Parallel Data Parallel

Shared Memory ICOM 6025: High Performance Computing 10 PP Model Ecosystem © Wilson Rivera

Message Erlang MPI Passing Runtime CnC System

Higher Abstraction Task TBB TPL Parallel

HPF UPC Co-Array MapReduce Data Fortran90 Ct CUDA RapidMind Parallel

OpenMP Pthreads Thread

ICOM 6025: High Performance Computing 11 Application Models

• Bag-of-Tasks – The user simply states a set of unordered computation tasks and allows the system to assign them to processors

• Map-Reduce – A map operator is applied to a set of name-value pairs to generate several intermediate sets – Subsequently, a reduce operator is then applied to summarize the intermediates sets into one or more final sets

• Bulk-Synchronous-Parallel – Processes run in phases that are separated by a global synchronization step – During each phase, they perform calculations independently, and then communicate new results with their data dependent peers. – The execution time of a step is determined by the most heavily load processor

• Directed Acyclic Graph (DAG) – A workflow application can be modeled as a directed acyclic graph A=(V,E) where V is the set of nodes that represent the tasks in an application, and E is the set of edges which represent the dependence between tasks.

ICOM 6025: High Performance Computing 12 Application Models

Direct Acyclic Graph Less Granularity Bulk Synchronous Parallel

Map-Reduce

Bag of Tasks

© Wilson Rivera

More Synchronization

ICOM 6025: High Performance Computing 13 Application Settings

High performance Computing High (Clusters)

© Wilson Rivera Application Coupling

Grid Computing Low

Low Latency High Motifs

Embed SPEC DB Games ML HPC Health Image Speech Music Browser 1 Finite State Mach. 2 Combinational 3 Graph Traversal 4 Structured Grid 5 Dense Matrix 6 Sparse Matrix 7 Spectral (FFT) 8 Dynamic Prog 9 N-Body 10 MapReduce 11 Backtrack/ B&B 12 Graphical Models 13 Unstructured Grid Source: UC Berkeley

ICOM 6025: High Performance Computing 15 High Performance Libraries

http://www.netlib.org HPC Libraries

• Basic Linear Algebra (BLAS) • Linear Algebra PACKage (LAPACK) – IBM (ESSL), AMD (ACML), Intel (MKL), Cray (libsci) • Basic Linear Algebra Communication Subroutines (BLACS) • Automatically Tuned Linear Algebra Software (ATLAS) • Parallel Linear Algebra Software for Multicore Architectures (PLASMA) • Matrix Algebra on GPU and Multicore Architectures (MAGMA) Summary

• Goal: Understand parallel programming models – Programming Models – Application Models – Parallel Computing Frameworks • Open-MP • MPI • CUDA • Map-Reduce – Parallel Computing Libraries

ICOM 6025: High Performance Computing 18