Introduction
Parallel Programming: Rich space of techniques and issues • Trade off and interact with one another Performance Issues can be addressed/helped by software or hardware • Algorithmic or programming techniques • Architectural techniques Todd C. Mowry Focus here on performance issues and software techniques CS 418 • Point out some architectural implications January 20, 25 & 26, 2011 • Architectural techniques covered in rest of class
–2– CS 418
Programming as Successive Refinement Outline
Not all issues dealt with up front 1. Partitioning for performance Partitioning often independent of architecture, and done first 2. Relationshipp,y of communication, data locality and • View machine as a collection of communicating processors architecture – balancing the workload – reducing the amount of inherent communication 3. Orchestration for performance – reducing extra work • Tug-o-war even among these three issues For each issue: Then interactions with architecture • Techniques to address it, and tradeoffs with previous issues • Illustration using case studies • View machine as extended memory hierarchy • Application to g rid solver –extra communication due to archhlitectural interactions • Some architectural implications – cost of communication depends on how it is structured • May inspire changes in partitioning Discussion of issues is one at a time, but identifies tradeoffs 4. Components of execution time as seen by processor • Use examples, and measurements on SGI Origin2000 • What workload looks like to architecture, and relate to software issues
–3– CS 418 –4– CS 418
Page 1 Partitioning for Performance Load Balance and Synch Wait Time
Sequential Work 1. Balancing the workload and reducing wait time at synch Limit on speedup: Speedupproblem(p) < points Max Work on any Processor • Work includes data access and other costs 2. Reducing inherent communication • Not just equal work, but must be busy at same time 3. Reducing extra work
Four parts to load balance and reducing synch wait time: Even these algorithmic issues trade off: • Minimize comm. => run on 1 processor => extreme load imbalance 1. Identify enough concurrency • Maximize load balance => random assignment of tiny tasks => no 2. Decide how to manage it control over communication • Good partition may imply extra work to compute or manage it 3. Determine the granularity at which to exploit it Goal is to compromise 4. Reduce serialization and cost of synchronization • Fortunately, often not difficult in practice
–5– CS 418 –6– CS 418
Identifying Concurrency Identifying Concurrency (contd.)
Techniques seen for equation solver: Function parallelism: • Loop structure, fundamental dependences, new algorithms • entire large tasks (procedures) that can be done in parallel Data Parallelism versus Function Parallelism • on same or different data Often see orthogonal levels of parallelism; e.g. VLSI routing • e.g. different independent grid computations in Ocean • pipelining, as in video encoding/decoding, or polygon rendering W1 W2 W3 • degree usually modest and does not grow with input size (a) • difficult to load balance
Wire W2 expands to segments • often used to reduce synch between data parallel phases
S21 S22 S23 S24 S25 S26
(b) Most scalable programs data parallel (per this loose definition) Segment S23 expands to routes • function parallelism reduces synch between data parallel phases
(c) –7– CS 418 –8– CS 418
Page 2 Load Balance and Synch Wait Time Deciding How to Manage Concurrency
Sequential Work Limit on speedup: Speedupproblem(p) < Static versus Dynamic techniques Max Work on any Processor Static: • Work includes data access and other costs • Algorithmic assignment based on input; won’t change • Not just equal work, but must be busy at same time • Low runtime overhead • Computation must be predictable Four parts to load balance and reducing synch wait time: • Preferable when applicable (except in multiprogrammed or heterogeneous environment) 1. Identify enough concurrency Dynamic: 2. Decide how to manage it • Adappunmnt at runtime to balance load 3. Determine the granularity at which to exploit it • Can increase communication and reduce locality • Can increase task management overheads 4. Reduce serialization and cost of synchronization
–9– CS 418 –10– CS 418
Dynamic Assignment Dynamic Tasking with Task Queues
Profile-based (semi-static): Centralized versus distributed queues • Profile work distribution at runtime, and repartition dynamically Task stealinggq with distributed queues • Applicable in many computations, e.g. Barnes-Hut, some graphics • Can compromise comm and locality, and increase synchronization • Whom to steal from, how many tasks to steal, ... • Termination detection Dynamic Tasking: • Maximum imbalance related to size of task • Deal with unpredictability in program or environment (e.g. All processes insert tasks P0 inserts P1 inserts P2 inserts P3 inserts Raytrace) – computation, communication, and memory system interactions –multippgrogramming and heterog eneity QQ 0 Q1 Q2 Q3 – used by runtime systems and OS too • Pool of tasks; take and add tasks until done • E.g. “self-scheduling” of loop iterations (shared loop counter) Others may steal
All remove tasks P0 removes P1 removes P2 removes P3 removes
(a) Centralized task queue (b) Distributed task queues (one per pr ocess)
–11– CS 418 –12– CS 418
Page 3 Impact of Dynamic Assignment Load Balance and Synch Wait Time
Sequential Work On SGI Origin 2000 (cache-coherent shared memory): Limit on speedup: Speedupproblem(p) < Max Work on any Processor Origin, semistatic Origg,yin, dynamic 30 30 Challenge, semistatic Challenge, dynamic • Work includes data access and other costs Origin, static Origin, static • Not just equal work, but must be busy at same time 25 Challenge, static 25 Challenge, static 20 20 Four parts to load balance and reducing synch wait time: 15 15 1. Identify enough concurrency Speedup Speedup 10 10 2. Decide how to manage it
5 5 3. Determine the granularity at which to exploit it 4. Reduce serialization and cost of synchronization (a)0 (b) 0 1 3 5 7 9 1113151719 21 23 25 27 29 31 1 3 5 7 9 1113151719212325272931 Number of processors Number of processors Barnes-Hut Raytrace –13– CS 418 –14– CS 418
Determining Task Granularity Load Balance and Synch Wait Time
Sequential Work Task granularity: amount of work associated with a task Limit on speedup: Speedupproblem(p) < Max Work on any Processor • Work includes data access and other costs General rule: • Not just equal work, but must be busy at same time • Coarse-grained => often less load balance • Fine-grained => more overhead; often more communication & contention Four parts to load balance and reducing synch wait time: 1. Identify enough concurrency Communication & contention actually affected by assignment, not size 2. Decide how to manage it • Overhead by size itself too, particularly with task queues 3. Determine the granularity at which to exploit it 4. Reduce serialization and cost of synchronization
–15– CS 418 –16– CS 418
Page 4 Reducing Serialization Partitioning for Performance
Careful about assignment and orchestration (including 1. Balancing the workload and reducing wait time at synch scheduling) points EtEvent synchhitironization 2. Reducing inherent communication • Reduce use of conservative synchronization – e.g. point-to-point instead of barriers, or granularity of pt-to-pt 3. Reducing extra work • But fine-grained synch more difficult to program, more synch ops. Mutual exclusion • Separate locks for separate data – e.g. locking records in a database: lock per process, record, or field –lock ppqpqer task in task queue, not per queue – finer grain => less contention/serialization, more space, less reuse • Smaller, less frequent critical sections – don’t do reading/testing in critical section, only modification – e.g. searching for task to dequeue in task queue, building tree • Stagger critical sections in time
–17– CS 418 –18– CS 418
Reducing Inherent Communication Domain Decomposition
Communication is expensive! Works well for scientific, engineering, graphics, ... Measure: communication to computation ratio applications Focus here on inherent communication Exploits local-biased nature of physical problems • Determined by assignment of tasks to processes • Information requirements often short-range • Later see that actual communication can be greater • Or long-range but fall off with distance Assign tasks that access same data to same process Simple example: nearest-neighbor grid computation n n p
Solving communication and load balance NP-hard in P0 P1 P2 P3
general case P P4 P5 P6 7 n n p But simple heuristic solutions work well in practice P8 P9 P10 P11
• Applications have structure! P12 P13 P14 P15
Perimeter to Area comm-to-comp ratio (area to volume in 3D) •Depends on n,p: decreases with n, increases with p
–19– CS 418 –20– CS 418
Page 5 Domain Decomposition (Continued) Finding a Domain Decomposition
Best domain decomposition depends on information requirements Static, by inspection Nearest neighbor example: block versus strip decomposition • Must be predictable: grid example above, and Ocean -----n- p n Static, but not by inspection • Input-dependent, require analyzing input structure P0 P1 P2 P3 • E.g sparse matrix computations, data mining P4 P5 P6 P7 n ------n Semi-static (periodic repartitioning) p P8 P9 P10 P11 • Characteristics change but slowly; e.g. Barnes-Hut
P12 P13 P14 P15 Static or semi-static, with dynamic task stealing • In itia l decompos ition, bu t hig hly unpredict abl e; e.g ray ttiracing 4* √p Comm to comp: n for block, n for strip • Retain block from here on 2*p Application dependent: strip may be better in other cases • E.g. particle flow in tunnel –21– CS 418 –22– CS 418
Other Techniques Implications of Comm-to-Comp Ratio
Scatter Decomposition, e.g. initial partition in Raytrace If denominator is execution time, ratio gives average BW needs 1 2 1 2 1 2 1 2 If operation count, gives extremes in impact of latency and 34 34 34 34 1 2 bandwidth 1 2 1 2 1 2 1 2 • Latency: assume no latency hiding 34 34 34 34 • Bandwidth: assume all latency hidden 1 2 1 2 1 2 1 2 • Reality is somewhere in between 34 34 34 34 3 4 Actual impact of comm. depends on structure & cost as well 1 2 1 2 1 2 1 2 StilSequential WkWork 34 34 34 34 Speedup < Max (Work + Synch Wait Time + Comm Cost) Domain decomposition Scatter decomposition • Need to keep communication balanced across processors as well Preserve locality in task stealing •Steal large tasks for locality, steal from same queues, ... –23– CS 418 –24– CS 418
Page 6 Partitioning for Performance Reducing Extra Work
1. Balancing the workload and reducing wait time at synch Common sources of extra work: points • Computing a good partition 2. Reducing inherent communication –e.g. partitioning in Barnes-Hut or sparse matrix • Using redundant computation to avoid communication 3. Reducing extra work • Task, data and process management overhead – applications, languages, runtime systems, OS • Imposing structure on communication – coalescing messages, allowing effective naming Architectural Implications: • Re duce need by making communi cati on and orches tra tion effic ien t
Sequential Work Speedup < Max (Work + Synch Wait Time + Comm Cost + Extra Work)
–25– CS 418 –26– CS 418
Summary: Analyzing Parallel Algorithms Outline
Requires characterization of multiprocessor and algorithm 1. Partitioning for performance Historical focus on alggppg,orithmic aspects: partitioning, 2. Relationshipp,y of communication, data locality and mapping architecture PRAM model: data access and communication are free 3. Orchestration for performance • Only load balance (including serialization) and extra work matter 4. Components of execution time as seen by processor Sequential Instructions Speedup < Max (Instructions + Synch Wait Time + Extra Instructions) • Useful for earlyyp, development, but unrealistic for real performance • Ignores communication and also the imbalances it causes • Can lead to poor choice of partitions as well as orchestration • More recent models incorporate comm. costs; BSP, LogP, ...
–27– CS 418 –28– CS 418
Page 7 Limitations of Algorithm Analysis What is a Multiprocessor?
Inherent communication in parallel algorithm is not all A collection of communicating processors • artifactual communication caused by program implementation and • View taken so far architec tura l in terac tions can even dom ina te • Goals: balance load, reduce inherent communication and extra work • thus, amount of communication not dealt with adequately A multi-cache, multi-memory system Cost of communication determined not only by amount • Role of these components essential regardless of programming model • also how communication is structured • Programming model and comm. abstraction affect specific • and cost of communication in system performance tradeoffs Both architecture-dependent, and addressed in orchestration step Most of remaining performance issues focus on second To understand techniques, first look at system interactions aspect
–29– CS 418 –30– CS 418
Memory-Oriented View Uniprocessor
Multiprocessor as Extended Memory Hierarchy Performance depends heavily on memory hierarchy • as seen by a given processor Time sppypgent by a program
Levels in extended hierarchy: Timeprog(1) = Busy(1) + Data Access(1) • Registers, caches, local memory, remote memory (topology) • Divide by instructions to get CPI equation • Glued together by communication architecture • Levels communicate at a certain granularity of data transfer Need to exploit spatial and temporal locality in hierarchy Data access time can be reduced by: • Otherwise extra communication may also be caused • Optimizing machine: bigger caches, lower latency... • Especially important since communication is expensive • Optimizing program: temporal and spatial locality
–31– CS 418 –32– CS 418
Page 8 Extended Hierarchy Artifactual Comm. in Extended Hierarchy
Idealized view: local cache hierarchy + single main memory Accesses not satisfied in local portion cause communication But realityyp is more complex • Inherent communication, implicit or explicit, causes transfers • Centralized Memory: caches of other processors – determined by program • Distributed Memory: some local, some remote; + network topology • Artifactual communication • Management of levels – determined by program implementation and arch. interactions – caches managed by hardware – poor allocation of data across distributed memories – main memory depends on programming model – unnecessary data in a transfer » SAS: data movement between local and remote transparent – unnecessary transfers due to system granularities » message passing: explicit – redundant communication of data • Levels closer to processor are lower latency and higher bandwidth – finite replication capacity (in cache or main memory) • Improve performance through architecture or program locality • Inherent communication assumes unlimited capacity, small • Tradeoff with parallelism; need good node performance and transfers, perfect knowledge of what is needed. parallelism • More on artifactual later; first consider replication-induced further
–33– CS 418 –34– CS 418
Communication and Replication Working Set Perspective
•At a given level of the hierarchy (to the next further one) Comm. due to finite capacity is most fundamental artifact • Like cache size and miss rate or memory traffic in uniprocessors affic First working set • Extended memory hierarchy view useful for this relationship rr
View as three level hierarchy for simplicity Data t
Capacity-generated traffic • Local cache, local memory, remote memory (ignore network topology) (including conflicts) Classify “misses” in “cache” at any level as for uniprocessors Second working set – compulsory or cold misses (no size effect) Other capacity-independent communication – capacity misses (yes) Inherent communication
– conflict or collision misses (yes) Cold-start (compulsory) traffic – communication or coherence misses (no) Replication capacity (cache size) • Each may be helped/hurt by large transfer granularity (spatial • Hierarchy of working sets locality) • At first level cache (fully assoc, one-word block), inherent to algorithm – working set curve for program • Traffic from any type of miss can be local or non-local (communication)
–35– CS 418 –36– CS 418
Page 9 Outline Orchestration for Performance
1. Partitioning for performance Reducing amount of communication: 2. Relationshipp,y of communication, data locality and • Inherent: change logical data sharing patterns in algorithm architecture • Artifactual: exploit spatial, temporal locality in extended hierarchy 3. Orchestration for performance – Techniques often similar to those on uniprocessors 4. Components of execution time as seen by processor Structuring communication to reduce cost
Let’s examine techniques for both...
–37– CS 418 –38– CS 418
Reducing Artifactual Communication Exploiting Temporal Locality
• Structure algorithm so working sets map well to hierarchy Message passing model – often techniques to reduce inherent communication do well here • Communication and replication are both explicit – schedule tasks for data reuse once assigned • Even artifactual communication is in explicit messages • Multiple data structures in same phase Shared address space model – e.g. database records: local versus remote • More interesting from an architectural perspective • Solver example: blocking • Occurs transparently due to interactions of program and system – sizes and granularities in extended memory hierarchy Use shared address space to illustrate issues
(a) Unblocked access pattern in a sweep (b) Blocked access pattern with B = 4
•More useful when O(nk+1) computation on O(nk) data –many linear algebra computations (factorization, matrix multiply) –39– CS 418 –40– CS 418
Page 10 Exploiting Spatial Locality Spatial Locality Example
Besides capacity, granularities are important: • Repeated sweeps over 2-d grid, each time adding 1 to elements • Granularity of allocation • Natural 2-d versus higher-dimensional array representation • Granularity of communication or data transfer Contiguity in memory layout • Granularity of coherence
Major spatial-related causes of artifactual communication: P0 P1 P2 P3 P0 P1 P2 P3 • Conflict misses
P4 P5 P6 P7 • Data distribution/layout (allocation granularity) P4 P5 P6 P7 • Fragmentation (communication granularity)
• False sharing of data (coherence granularity) P8 P8 All depen d on how spati a l access patterns interact wi th data structures • Fix problems by modifying data structures, or layout/alignment Page straddles Cache block Page does not Cache block is partition boundaries: straddles partition straddle partition within a partition Examine later in context of architectures difficult to distribute boundary boundary memory well • one simple example here: data distribution in SAS solver (a) Two-dimensional array (b) Four-dimensional array
–41– CS 418 –42– CS 418
Tradeoffs with Inherent Communication Example Performance Impact
Partitioning grid solver: blocks versus rows Performance measured on an SGI Origin2000 • Blocks still have a spatial locality problem on remote data 30 50 Rows 4D 45 • Rows can perform better despite worse inherent c-to-c ratio 4D 4D-rr 25 Good spatial locality on 2D 40 Rows non-local accesses at Rows-rr row-oriented boundary 35 20 2D Poor spatial locality on 30 2D-rr non-local accesses at 15 25 column-oriented boundary
Speedup Speedup 20 10 15 10 5 5 0 0 1 3 5 7 9 1113151719212325272931 1 3 5 7 9 1113151719212325272931 Number of processors Number of processors •Result depends on n and p Ocean, 514x514 grids Solver Kernel, 12K x 12K grid –43– CS 418 –44– CS 418
Page 11 Structuring Communication Reducing Overhead
Given amount of communication, goal is to reduce cost Can reduce # of messages m or overhead per message o Cost of communication as seen by process: o is usuallyyy determined by hardware or sy stem software nc/m C =f f* *( ( l o + l + t + tc - overl)lap) B • Program should try to reduce m by coalescing messages – f = frequency of messages • More control when communication is explicit – o = overhead per message (at both ends) Coalescing data into larger messages: – l = network delay per message • Easy for regular, coarse-grained communication – n = total data sent c • Can be difficult for irregular, naturally fine-grained communication – m = number of messages – may require changes to algorithm and extra work – B = bandwidth along path (determined by network, NI, assist) » coalescing data and determining what and to whom to send – tc = cost iiddnduced by content ion per message – overlap = amount of latency hidden by overlap with comp. or comm. – will discuss more in implications for programming models later
• Portion in parentheses is cost of a message (as seen by processor) • That portion, ignoring overlap, is latency of a message
• Goal: reduce terms in latency and increase overlap
–45– CS 418 –46– CS 418
Reducing Network Delay Reducing Contention
Network delay component = f* h All resources have nonzero occupancy h*t – h = number of hops traversed in network • Memory, communication controller, network link, etc.
– th = link+switch latency per hop • Can only handle so many transactions per unit time Reducing f: communicate less, or make messages larger Effects of contention: Reducing h: • Increased end-to-end cost for messages • Reduced available bandwidth for individual messages • Map communication patterns to network topology • Causes imbalances across processors – e.g. nearest-neighbor on mesh and ring; all-to-all Particularly insidious performance problem • How important is this? • Easyyg to ignore when p pgrogramming – used to be major focus of parallel algorithms • Slow down messages that don’t even need that resource – by causing other dependent resources to also congest – depends on no. of processors, how th, compares with other components – less important on modern machines • Effect can be devastating: Don’t flood a resource! » overheads, processor count, multiprogramming
–47– CS 418 –48– CS 418
Page 12 Types of Contention Overlapping Communication
Network contention and end-point contention (hot-spots) Cannot afford to stall for high latencies Location and Module Hot-spots • even on uniprocessors! Location: e. g. accumulating into global variable, barrier Overlap with computation or communication to hide latency • solution: tree-structured communication Requires extra concurrency (slackness), higher bandwidth Contention Little contention Techniques: • Prefetching • Block data transfer
Flat Tree structured • Proceeding past communication •In general, reduce burstiness; may conflict with making messages • Multith readi ng larger Module: all-to-all personalized comm. in matrix transpose
•solution: stagger access by different processors to same node temporally
–49– CS 418 –50– CS 418
Summary of Tradeoffs Outline
Different goals often have conflicting demands 1. Partitioning for performance • Load Balance 2. Relationshipp,y of communication, data locality and – fine-grain tasks architecture – random or dynamic assignment • Communication 3. Orchestration for performance – usually coarse grain tasks 4. Components of execution time as seen by processor – decompose to obtain locality: not random/dynamic • What workload looks like to architecture • Extra Work • Relate to software issues – coarse grain tasks – simple assignment • Communication Cost: – big transfers: amortize overhead and latency – small transfers: reduce contention
–51– CS 418 –52– CS 418
Page 13 Processor-Centric Perspective Relationship between Perspectives
Parallelization step(s) Performance issue Processor time component
Decomposition/ Load imbalance and Synch wait assignment/ synchronization orchestration
Decomposition/ Extra work Busy-overhead assignment
Decomposition/ Inherent Data-remote assignment communication volume
Orchestration Artifactual Data-local communitiication and data locality
Orchestration/ Communication mapping structure
–53– CS 418 –54– CS 418
Summary
Busy(1) + Data(1) Speedup (p) prob = Busyuseful(p)+Datalocal(p)+Synch(p)+Dataremote(p)+Busyoverhead(p) • Goal is to reduce denominator components • Both programmer and system have role to play • Architecture cannot do much about load imbalance or too much communication • But it can: – reduce incentive for creating ill-behaved programs (efficient naming, communication and synchronization) – reduce artifactual communication – provide efficient naming for flexible assignment – allow effective overlapping of communication
–55– CS 418
Page 14