Parallel Programming: Performance

Introduction Parallel Programming: Rich space of techniques and issues • Trade off and interact with one another Performance Issues can be addressed/helped by software or hardware • Algorithmic or programming techniques • Architectural techniques Todd C. Mowry Focus here on performance issues and software techniques CS 418 • Point out some architectural implications January 20, 25 & 26, 2011 • Architectural techniques covered in rest of class –2– CS 418 Programming as Successive Refinement Outline Not all issues dealt with up front 1. Partitioning for performance Partitioning often independent of architecture, and done first 2. Relationshipp,y of communication, data locality and • View machine as a collection of communicating processors architecture – balancing the workload – reducing the amount of inherent communication 3. Orchestration for performance – reducing extra work • Tug-o-war even among these three issues For each issue: Then interactions with architecture • Techniques to address it, and tradeoffs with previous issues • Illustration using case studies • View machine as extended memory hierarchy • Application to g rid solver –extra communication due to archlhitectural interactions • Some architectural implications – cost of communication depends on how it is structured • May inspire changes in partitioning Discussion of issues is one at a time, but identifies tradeoffs 4. Components of execution time as seen by processor • Use examples, and measurements on SGI Origin2000 • What workload looks like to architecture, and relate to software issues –3– CS 418 –4– CS 418 Page 1 Partitioning for Performance Load Balance and Synch Wait Time Sequential Work 1. Balancing the workload and reducing wait time at synch Limit on speedup: Speedupproblem(p) < points Max Work on any Processor • Work includes data access and other costs 2. Reducing inherent communication • Not just equal work, but must be busy at same time 3. Reducing extra work Four parts to load balance and reducing synch wait time: Even these algorithmic issues trade off: • Minimize comm. => run on 1 processor => extreme load imbalance 1. Identify enough concurrency • Maximize load balance => random assignment of tiny tasks => no 2. Decide how to manage it control over communication • Good partition may imply extra work to compute or manage it 3. Determine the granularity at which to exploit it Goal is to compromise 4. Reduce serialization and cost of synchronization • Fortunately, often not difficult in practice –5– CS 418 –6– CS 418 Identifying Concurrency Identifying Concurrency (contd.) Techniques seen for equation solver: Function parallelism: • Loop structure, fundamental dependences, new algorithms • entire large tasks (procedures) that can be done in parallel Data Parallelism versus Function Parallelism • on same or different data Often see orthogonal levels of parallelism; e.g. VLSI routing • e.g. different independent grid computations in Ocean • pipelining, as in video encoding/decoding, or polygon rendering W1 W2 W3 • degree usually modest and does not grow with input size (a) • difficult to load balance Wire W2 expands to segments • often used to reduce synch between data parallel phases S21 S22 S23 S24 S25 S26 (b) Most scalable programs data parallel (per this loose definition) Segment S23 expands to routes • function parallelism reduces synch between data parallel phases (c) –7– CS 418 –8– CS 418 Page 2 Load Balance and Synch Wait Time Deciding How to Manage Concurrency Sequential Work Limit on speedup: Speedupproblem(p) < Static versus Dynamic techniques Max Work on any Processor Static: • Work includes data access and other costs • Algorithmic assignment based on input; won’t change • Not just equal work, but must be busy at same time • Low runtime overhead • Computation must be predictable Four parts to load balance and reducing synch wait time: • Preferable when applicable (except in multiprogrammed or heterogeneous environment) 1. Identify enough concurrency Dynamic: 2. Decide how to manage it • Adappunmnt at runtime to balance load 3. Determine the granularity at which to exploit it • Can increase communication and reduce locality • Can increase task management overheads 4. Reduce serialization and cost of synchronization –9– CS 418 –10– CS 418 Dynamic Assignment Dynamic Tasking with Task Queues Profile-based (semi-static): Centralized versus distributed queues • Profile work distribution at runtime, and repartition dynamically Task stealinggq with distributed queues • Applicable in many computations, e.g. Barnes-Hut, some graphics • Can compromise comm and locality, and increase synchronization • Whom to steal from, how many tasks to steal, ... • Termination detection Dynamic Tasking: • Maximum imbalance related to size of task • Deal with unpredictability in program or environment (e.g. All processes insert tasks P0 inserts P1 inserts P2 inserts P3 inserts Raytrace) – computation, communication, and memory system interactions –multippgrogramming and heterog eneity QQ 0 Q1 Q2 Q3 – used by runtime systems and OS too • Pool of tasks; take and add tasks until done • E.g. “self-scheduling” of loop iterations (shared loop counter) Others may steal All remove tasks P0 removes P1 removes P2 removes P3 removes (a) Centralized task queue (b) Distributed task queues (one per pr ocess) –11– CS 418 –12– CS 418 Page 3 Impact of Dynamic Assignment Load Balance and Synch Wait Time Sequential Work On SGI Origin 2000 (cache-coherent shared memory): Limit on speedup: Speedupproblem(p) < Max Work on any Processor Origin, semistatic Origg,yin, dynamic 30 30 Challenge, semistatic Challenge, dynamic • Work includes data access and other costs Origin, static Origin, static • Not just equal work, but must be busy at same time 25 Challenge, static 25 Challenge, static 20 20 Four parts to load balance and reducing synch wait time: 15 15 1. Identify enough concurrency Speedup Speedup 10 10 2. Decide how to manage it 5 5 3. Determine the granularity at which to exploit it 4. Reduce serialization and cost of synchronization (a)0 (b) 0 1 3 5 7 9 1113151719 21 23 25 27 29 31 1 3 5 7 9 1113151719212325272931 Number of processors Number of processors Barnes-Hut Raytrace –13– CS 418 –14– CS 418 Determining Task Granularity Load Balance and Synch Wait Time Sequential Work Task granularity: amount of work associated with a task Limit on speedup: Speedupproblem(p) < Max Work on any Processor • Work includes data access and other costs General rule: • Not just equal work, but must be busy at same time • Coarse-grained => often less load balance • Fine-grained => more overhead; often more communication & contention Four parts to load balance and reducing synch wait time: 1. Identify enough concurrency Communication & contention actually affected by assignment, not size 2. Decide how to manage it • Overhead by size itself too, particularly with task queues 3. Determine the granularity at which to exploit it 4. Reduce serialization and cost of synchronization –15– CS 418 –16– CS 418 Page 4 Reducing Serialization Partitioning for Performance Careful about assignment and orchestration (including 1. Balancing the workload and reducing wait time at synch scheduling) points EtEvent synchhitironization 2. Reducing inherent communication • Reduce use of conservative synchronization – e.g. point-to-point instead of barriers, or granularity of pt-to-pt 3. Reducing extra work • But fine-grained synch more difficult to program, more synch ops. Mutual exclusion • Separate locks for separate data – e.g. locking records in a database: lock per process, record, or field –lock ppqpqer task in task queue, not per queue – finer grain => less contention/serialization, more space, less reuse • Smaller, less frequent critical sections – don’t do reading/testing in critical section, only modification – e.g. searching for task to dequeue in task queue, building tree • Stagger critical sections in time –17– CS 418 –18– CS 418 Reducing Inherent Communication Domain Decomposition Communication is expensive! Works well for scientific, engineering, graphics, ... Measure: communication to computation ratio applications Focus here on inherent communication Exploits local-biased nature of physical problems • Determined by assignment of tasks to processes • Information requirements often short-range • Later see that actual communication can be greater • Or long-range but fall off with distance Assign tasks that access same data to same process Simple example: nearest-neighbor grid computation n n p Solving communication and load balance NP-hard in P0 P1 P2 P3 general case P P4 P5 P6 7 n n p But simple heuristic solutions work well in practice P8 P9 P10 P11 • Applications have structure! P12 P13 P14 P15 Perimeter to Area comm-to-comp ratio (area to volume in 3D) •Depends on n,p: decreases with n, increases with p –19– CS 418 –20– CS 418 Page 5 Domain Decomposition (Continued) Finding a Domain Decomposition Best domain decomposition depends on information requirements Static, by inspection Nearest neighbor example: block versus strip decomposition • Must be predictable: grid example above, and Ocean -----n- p n Static, but not by inspection • Input-dependent, require analyzing input structure P0 P1 P2 P3 • E.g sparse matrix computations, data mining P4 P5 P6 P7 n ------ n Semi-static (periodic repartitioning) p P8 P9 P10 P11 • Characteristics change but slowly; e.g. Barnes-Hut P12 P13 P14 P15 Static or semi-static, with dynamic task stealing • In itia l decompos ition, bu t hig hly unpredic ta ble; e.g ray ttiracing 4*√p 2*p Comm to comp: n for block, n for strip • Retain block from here on Application dependent: strip may be better in other cases • E.g. particle flow in tunnel –21– CS 418 –22– CS 418 Other Techniques Implications of Comm-to-Comp Ratio Scatter Decomposition, e.g. initial partition in Raytrace If denominator is execution time, ratio gives average BW needs 1 2 1 2 1 2 1 2 If operation count, gives extremes in impact of latency and 34 34 34 34 1 2 bandwidth 1 2 1 2 1 2 1 2 • Latency: assume no latency hiding 34 34 34 34 • Bandwidth: assume all latency hidden 1 2 1 2 1 2 1 2 • Reality is somewhere in between 34 34 34 34 3 4 Actual impact of comm.

Parallel Programming: Performance

Data-Level Parallelism

High Performance Computing Through Parallel and Distributed Processing

Introduction to Multi-Threading and Vectorization Matti Kortelainen Larsoft Workshop 2019 25 June 2019 Outline

Demystifying Parallel and Distributed Deep Learning: an In-Depth Concurrency Analysis

Exploiting Parallelism in Many-Core Architectures: Architectures G Bortolotti, M Caberletti, G Crimi Et Al

An Introduction to Gpus, CUDA and Opencl

Parallel Computing a Key to Performance

Computer Architecture: Parallel Processing Basics

1 Parallelism

Exploiting Parallelism Opportunities with Deep Learning Frameworks

An Introduction to CUDA/Opencl and Graphics Processors

Introduction to Parallel Machines and Programming Models Lecture 3