Numerical Parallel Computing

NUMERICAL PARALLEL COMPUTING NUMERICAL PARALLEL COMPUTING Lecture 3: Programming multicore processors with OpenMP http://people.inf.ethz.ch/iyves/pnc12/ Peter Arbenz∗, Andreas Adelmann∗∗ ∗Computer Science Dept, ETH Zürich, E-mail: [email protected] ∗∗Paul Scherrer Institut, Villigen E-mail: [email protected] Parallel Numerical Computing. Lecture 3, Mar 9, 2012 1/47 NUMERICAL PARALLEL COMPUTING Review of last week So far I Moore's law. I Flynn's taxonomy of parallel computers: SISD, SIMD, MIMD. I Some terminology: Work, speedup, efficiency, scalability. I Amdahl's and Gustafson's law I SIMD programming. Today I Shared memory MIMD programming (Part 1) Parallel Numerical Computing. Lecture 3, Mar 9, 2012 2/47 NUMERICAL PARALLEL COMPUTING MIMD: Multiple Instruction stream - Multiple Data stream MIMD: Multiple Instruction stream { Multiple Data stream Each processor (core) can execute its own instruction stream on its own data independently from the other processors. Each processor is a full-fledged CPU with both control unit and ALU. MIMD systems are asynchronous. Parallel Numerical Computing. Lecture 3, Mar 9, 2012 3/47 NUMERICAL PARALLEL COMPUTING MIMD: Multiple Instruction stream - Multiple Data stream Shared memory machines Shared memory machines (multiprocessors) I autonomous processors connected to memory system via interconnection network I single address space accessible by all processors I (implicit) communication by means of shared data I data dependencies / race conditions possible Fig.2.3 in Pacheco (2011) Parallel Numerical Computing. Lecture 3, Mar 9, 2012 4/47 NUMERICAL PARALLEL COMPUTING MIMD: Multiple Instruction stream - Multiple Data stream Distributed memory machines I distributed memory machines (multicomputers) I Each processor has its own local/private memory I processor/memory pairs communicate via interconnection network I all data are local to some processor, I (explicit) communication by message passing or some other means to access memory of remote processor Fig.2.4 in Pacheco (2011) Parallel Numerical Computing. Lecture 3, Mar 9, 2012 5/47 NUMERICAL PARALLEL COMPUTING Shared-memory machines Typical architecture of a multicore processor Typical architecture of a multicore processor Multiple cores share multiple caches, that are arrange in a tree-like structure. 3-levels example: I L1-cache in-core, I 2 cores share L2-cache, I all cores have access to all of the L3 cache and memory. UMA: uniform memory access Each processor has direct connection to (block of) memory. Parallel Numerical Computing. Lecture 3, Mar 9, 2012 6/47 NUMERICAL PARALLEL COMPUTING Shared-memory machines Typical architecture of a multicore processor Typical architecture of a multicore processor (cont'd) NUMA: non-uniform memory access I Processors can access each others' memory through special hardware built into processors. I Own memory is faster to access than remote memory. Parallel Numerical Computing. Lecture 3, Mar 9, 2012 7/47 NUMERICAL PARALLEL COMPUTING Shared-memory machines Typical architecture of a multicore processor Interconnection networks Most widely used interconnects on shared memory machines I bus (slow / cheap / not scalable) I crossbar switch Fig.2.7(a) in Pacheco (2011) Parallel Numerical Computing. Lecture 3, Mar 9, 2012 8/47 NUMERICAL PARALLEL COMPUTING Execution of parallel programs Execution of parallel programs I Multitasking (time sharing) In operating systems that support multitasking several threads or processes are executed on the same processor in time slices (time sharing). In this way, latency due to, e.g., I/O operations can be hidden. This form of executing multiple tasks at the same time is called concurrency. Multiple tasks are executed at the same time, but only one of them has access to the compute resources at any given time. No simultaneous parallel excution is taking place. Parallel Numerical Computing. Lecture 3, Mar 9, 2012 9/47 NUMERICAL PARALLEL COMPUTING Execution of parallel programs Execution of parallel programs (cont.) I Multiprocessing Using multiple physical processors admits the parallel execution of multiple tasks. The parallel hardware may cause overhead though. Parallel Numerical Computing. Lecture 3, Mar 9, 2012 10/47 NUMERICAL PARALLEL COMPUTING Execution of parallel programs Execution of parallel programs (cont.) I Simultaneous Multithreading (SMT) In simultaneous multithreading (SMT) or hyperthreading multiple flows of control are running concurrently on a processor (or a core). The processor switches among these so-called threads of control by means of dedicated hardware. If multiple logical processors are executed on one physical processor then the hardware resources can be better employed and task execution may be sped up. (With two logical processors, performance improvements of up to 30% have been observed.) Parallel Numerical Computing. Lecture 3, Mar 9, 2012 11/47 NUMERICAL PARALLEL COMPUTING Execution of parallel programs Multicore programming Multicore programming I Multicore processors are programmed with multithreaded programs. I Although many programs use multithreading there are some notable differences between multicore programming and SMT. I SMT is mainly used by the OS to hide I/O overhead. On multicore processors the work is actually distributed on the different cores. I Cores have individual caches. False sharing may occur: Two cores may work on different data that is stored in the same cacheline. Although there is no data dependence the cache line of the other processor is marked invalid if the first processor writes its data item. (Massive) performance degradation is possible. Parallel Numerical Computing. Lecture 3, Mar 9, 2012 12/47 NUMERICAL PARALLEL COMPUTING Execution of parallel programs Multicore programming Multicore programming (cont.) I Thread priorities. If multithreading programs are executed on a single processor machines, always the thread with the highest priority is executed. On multicore processors, threads with different priorities can be executed simultaneously. This may lead to different results! Programming multicore machines therefore requires techniques, methods, and designs from parallel programming. Parallel Numerical Computing. Lecture 3, Mar 9, 2012 13/47 NUMERICAL PARALLEL COMPUTING Execution of parallel programs Parallel programming models Parallel programming models The design of a parallel program is always based on an abstract view of the parallel system on which the software shall be executed. This abstract view is called parallel programming model. It does not only describe the underlying hardware. It describes the the whole system as it is presented to a software developer: I System software (operating system) I parallel programming language I parallel library I compiler I run time system Parallel Numerical Computing. Lecture 3, Mar 9, 2012 14/47 NUMERICAL PARALLEL COMPUTING Execution of parallel programs Parallel programming models Parallel programming models (cont.) Level of parallelism: On which level of the program do we have parallelism? I Instruction level parallelism Compiler can detect independent instructions and distribute then on different functional units of a superscalar processor. I Data or loop level parallelism Data structures, e.g., arrays, are partitioned in portions. The same operation is applied to all elements of the portions. SIMD. I Function level parallelism Functions in a program, e.g., in recursive calls, can be invoked in parallel, provided there are no dependences. Parallel Numerical Computing. Lecture 3, Mar 9, 2012 15/47 NUMERICAL PARALLEL COMPUTING Execution of parallel programs Parallel programming models Parallel programming models (cont.) Explicit vs. implicit parallelism: How is parallelism declared in the program? I Implicit parallelism. Parallelizing compilers can detect regions/statements in the code that can be executed concurrently/in parallel. Parallelizing compilers are of limited success. I Explicit parallelism with implicit partitioning. The programmer indicates to the compiler where there is potential to parallelize. The partitioning is done implicitly. OpenMP. I Explicit partitioning. The programmer also indicates how to partition, but does not indicate where to execute the parts. I Explicit communication and synchronization. MPI. Parallel Numerical Computing. Lecture 3, Mar 9, 2012 16/47 NUMERICAL PARALLEL COMPUTING Execution of parallel programs Parallel programming models Parallel programming models (cont.) There are two flavors of explicit parallel programming, I Thread programming A thread is a sequence of statements that can be executed in parallel with other sequences of statements (threads). Each thread has its own resources (program counter, status information, etc.) but they use a common address space. Suited for multicore processors. I Message passing programming In message passing programming, processes are used for the various pieces of the parallel program, that run on physical or logical processors. Each of the processes has its own (private) address space. Parallel Numerical Computing. Lecture 3, Mar 9, 2012 17/47 NUMERICAL PARALLEL COMPUTING Thread programming Thread programming Programming multicore processors is tightly connected to parallel programming with a shared address space and to thread programming. There are a number of of environments for thread programming I Pthreads (Posix threads) I Java threads I OpenMP I Intel TBB (thread building blocks) In this lecture we deal with OpenMP which is the most commonly used in HPC. Parallel Numerical Computing.

Load more