Introduction to Openmp! Agenda!

Introduction to OpenMP! Agenda! • Introduction to parallel computing! • Classification of parallel computers! • What is OpenMP? ! • OpenMP examples! !" #" Parallel computing! Demand for parallel computing! A form of computation in which many instructions are carried out simultaneously operating on the principle With the increased use of computers in every sphere of that large problems can often be divided into smaller human activity, computer scientists are faced with two ones, which are then solved concurrently (in parallel).! crucial issues today:! Parallel computing has become a dominant paradigm in • Processing has to be done faster ! computer architecture, mainly in the form of multi-core processors.! • Larger or more complex problems need to be solved! Why is it required?! $" %" Parallelism! Serial computing! A problem is broken into a discrete series of instructions. ! Increasing the number of transistors as per Moore’s law Instructions are executed one after another. ! is not a solution, as it increases power consumption.! Only one instruction may execute at any moment in time.! Power consumption has been a major issue recently, as it causes a problem of processor heating.! The prefect solution is parallelism, in hardware as well as software.! Moore's law (1965): The density of transistors will double every 18 months.! &" '" Parallel computing! Forms of parallelism! A problem is broken into discrete parts that can be solved concurrently. • Bit-level parallelism! !!!!!!! Each part is further broken down to a series of instructions. ! Instructions from each part execute simultaneously on different CPUs! !More than one bit is manipulated per clock cycle ! • Instruction-level parallelism! !!!! !Machine instructions are processed in multi-stage pipelines! • Data parallelism! !!!!!!! !Focuses on distributing the data across different parallel !computing nodes. It is also called loop-level parallelism! • Task parallelism! !!!!!!! !Focuses on distributing tasks across different parallel !computing nodes. It is also called functional or control !parallelism! (" )" Key difference between data ! Parallelism versus concurrency! and task parallelism! • !Concurrency is when two tasks can start, run and !complete in overlapping time periods. It doesn’t Data parallelism! Task parallelism! !necessarily mean that they will ever be running at A task ‘A’ is divided into sub- A task ‘A’ and a task ‘B’ are !the same instant of time. E.g., multitasking on a parts and then processed ! processed separately by !single-threaded machine.! different processors! • !Parallelism is when two tasks literally run at the !same time, e.g. on a multi-core computer.! *" !+" Granularity! Flynn’s classification of computers! • !Granularity is the ratio of computation to !communication.! • !Fine-grained parallelism means individual !tasks are relatively small in terms of code size !and execution time. The data are transferred !among processors frequently in amounts of !one or a few memory words.! • !Coarse-grained is the opposite: data are !communicated infrequently, after larger !amounts of computation. ! M. Flynn, 1966! !!" !#" Flynn’s taxonomy: SISD! Flynn’s taxonomy: MIMD! 678" ,-./"." ,-./"." 5.,,"9" ,-./">" 678" .//"0" .//"0" ,-./"." >?5">" 12-34"5"" 12-34"5"" :;<=","" :;<=@"" =!" =#" =A" A serial (non-parallel) computer! Examples: uni-core PCs, single CPU workstations and Multiple Instruction: Every Examples: Most modern Single Instruction: Only one instruction mainframes! processing unit may be executing computers fall into this stream is acted on by the CPU in one a different instruction at any given category! clock cycle! clock cycle! Single Data: Only one data stream is used Multiple Data: Each processing as input during any clock cycle! unit can operate on a different data element! !$" !%" Flynn’s taxonomy: SIMD! Flynn’s taxonomy: MISD! ,-./".B!C" ,-./".B#C" ,-./".BAC" 678" ,-./"." ,-./"." ,-./"." 678" .//"0B!C" .//"0B#C" .//"0BAC" <;,2"!" <;,2"#" <;,2"A" 12-34"5B!C"" 12-34"5B#C"" 12-34"5BAC"" 12-34"5B!C"" 12-34"5B#C"" 12-34"5BAC"" =!" =#" =A" =!" =#" =A" Single Instruction: All processing Best suited for problems of A single data stream is fed into Examples: Few actual units execute the same instruction high degree of regularity, multiple processing units! examples of this category at any given clock cycle! such as image processing! have ever existed! Each processing unit operates on Multiple Data: Each processing Examples: Connection the data independently via unit can operate on a different data Machine CM-2, Cray C90, independent instruction streams! element! NVIDIA! !&" !'" Flynn’s Taxonomy! (1966)! Parallel classification! MISD (Multiple-Instruction Single-Data)! MIMD (Multiple-Instruction Multiple-Data)! Parallel computers:! >!! > ! A >!! >A! = ! !"!"!" = ! • SIMD (Single Instruction Multiple Data): ! ! A =!! !"!"!" =A! !Synchronized execution of the same instruction on a ! /D! /E! / !/ ! / !/ ! !set of processors! D! E! DA EA SISD (Single-Instruction Single-Data)! SIMD (Single-Instruction Multiple-Data)! • MIMD (Multiple Instruction Multiple Data): ! !Asynchronous execution of different instructions! >! >! =!! !"!"!! =A! "=! / !/ ! / !/ ! D! E! DA EA / ! / ! D E !(" 18! Parallel machine memory models! Distributed memory! Distributed memory model! Shared memory model! Processors have local (non-shared) + Very scalable! Hybrid memory model! memory! + No cache coherence problems! + Use commodity processors! Requires communication network. Data from one processor must be - Lot of programmer responsibility! communicated if required ! - NUMA (Non-Uniform Memory !Access) times! Synchronization is programmer’s responsibility! !*" #+" Shared memory! Hybrid memory! Multiple processors operate + User friendly! independently but access global + UMA (Uniform Memory Access) Biggest machines today have both! memory space! !times. UMA = SMP (Symmetric !Multi-Processor)! Advantages and disadvantages are Changes in one memory location those of the individual parts! are visible to all processors! - Expense for large computers! Poor scalability is “myth”! #!" ##" Shared memory access! Uniform Memory Access – UMA: ! (Symmetric Multi-Processors – SMP)! • !centralized shared memory, accesses to global !memory! !from all processors have same latency! Non-uniform Memory Access – NUMA: ! (Distributed Shared Memory – DSM)! • !memory is distributed among the nodes, local accesses !much faster than remote accesses! #$" #%" Moore’s Law! (1965)! Parallelism in hardware and software! • Hardware! Moore’s Law, despite predictions of its demise, is still !going parallel (many cores)! in effect. Despite power issues, transistor densities are still doubling every 18 to 24 months. ! • Software?! With the end of frequency scaling, these new !Herb Sutter (Microsoft):! transistors can be used to add extra hardware, such as !“The free lunch is over. Software performance additional cores, to facilitate parallel computing. ! !will no longer increase from one generation to !the next as hardware improves ... unless it is !parallel software.”! #&" #'" Amdahl’s Law [1967]! Gene Amdahl, IBM designer! Assume fraction fp of execution time is parallelizable! !No overhead for scheduling, communication, synchronization, etc. ! Fraction fs = 1"fp is serial (not parallelizable)" Time on 1 processor:! T1 = fpT1 + (1! fp )T1 f T Time on N processors:"T = p 1 + (1! f )T N N p 1 T 1 1 Speedup "= 1 = ! T fp 1" f N + (1! f ) p N p #(" #)" 1 1 Speedup = Speedup = f f p + (1! f ) p + (1! f ) Example! N p N p Amdahl’s Law! 95% of a program’s execution time occurs inside a loop that can be executed in parallel. ! For mainframes Amdahl expected fs =1-fp =35%! What is the maximum speedup we should expect from a parallel version of the program executing on 8 processors?! !For a 4-processor: speedup # 2! !For an infinity-processor: speedup < 3 ! 1 Speedup ! " 5.9 !Therefore, stay with mainframes with only ! ! 0.95 / 8 0.05 + !one/few processors!! What is the maximum speedup achievable by this program, Amdahl’s Law prevented designers from exploiting regardless of how many processes are used?! parallelism for many years as even the most trivially 1 parallel programs contain a small amount of natural serial Speedup ! = 20 0.05 code! #*" $+" Limitations of Amdahl’s Law! Gustafson’s Law [1988]! Two key flaws in past arguments that Amdahl’s law is a The key is in observing that f and N are not independent of fatal limit to the future of parallelism are! s one another.! !(1) Its proof focuses on the steps in a particular ! Parallelism can be used to increase the size of the problem. ! !algorithm, and does not consider that other ! Time gained by using N processors (in stead of one) for the ! !algorithms with more parallelism may exist! parallel part of the program can be used in the serial part to solve a larger problem.! !(2) Gustafson’s Law: The proportion of the ! !! !computations that are sequential normally ! Gustafson’s speedup factor is called scaled speedup factor.! !! !decreases as the problem size increases! $!" $#" Speedup = N + (1 ! N ) fs Gustafson’s Law (cont’d)! Example! 95% of a program’s execution time occurs inside a loop that can be Assume TN is fixed to one, meaning the problem is executed in parallel. ! scaled up to fit the parallel computer. ! Then execution time on a single processor will be What is the scaled speedup on 8 processors?! fs+Nfp as the N parallel parts must be executed sequentially. Thus, ! Speedup = 8 + (1! 8)" 0.05 = 7.65 T1 fs + Nfp Speedup = = = fs + Nfp = N + (1! N) fs TN

Introduction to Openmp! Agenda!

Data-Level Parallelism

High Performance Computing Through Parallel and Distributed Processing

2.5 Classification of Parallel Computers

Balanced Multithreading: Increasing Throughput Via a Low Cost Multithreading Hierarchy

Cimple: Instruction and Memory Level Parallelism a DSL for Uncovering ILP and MLP

Amdahl's Law Threading, Openmp

Introduction to Multi-Threading and Vectorization Matti Kortelainen Larsoft Workshop 2019 25 June 2019 Outline

Computer Hardware Architecture Lecture 4

Parallel Generation of Image Layers Constructed by Edge Detection

CS650 Computer Architecture Lecture 10 Introduction to Multiprocessors

Scheduling Task Parallelism on Multi-Socket Multicore Systems

Demystifying Parallel and Distributed Deep Learning: an In-Depth Concurrency Analysis