Introduction

Parallel Programming: Rich space of techniques and issues • Trade off and interact with one another Performance Issues can be addressed/helped by software or hardware • Algorithmic or programming techniques • Architectural techniques Todd C. Mowry Focus here on performance issues and software techniques CS 418 • Point out some architectural implications January 20, 25 & 26, 2011 • Architectural techniques covered in rest of class

–2– CS 418

Programming as Successive Refinement Outline

Not all issues dealt with up front 1. Partitioning for performance Partitioning often independent of architecture, and done first 2. Relationshipp,y of communication, data locality and • View machine as a collection of communicating processors architecture – balancing the workload – reducing the amount of inherent communication 3. Orchestration for performance – reducing extra work • Tug-o-war even among these three issues For each issue: Then interactions with architecture • Techniques to address it, and tradeoffs with previous issues • Illustration using case studies • View machine as extended memory hierarchy • Application to g rid solver –extra communication due to archhlitectural interactions • Some architectural implications – cost of communication depends on how it is structured • May inspire changes in partitioning Discussion of issues is one at a time, but identifies tradeoffs 4. Components of execution time as seen by • Use examples, and measurements on SGI Origin2000 • What workload looks like to architecture, and relate to software issues

–3– CS 418 –4– CS 418

Page 1 Partitioning for Performance Load Balance and Synch Wait Time

Sequential Work 1. Balancing the workload and reducing wait time at synch Limit on : Speedupproblem(p) < points Max Work on any Processor • Work includes data access and other costs 2. Reducing inherent communication • Not just equal work, but must be busy at same time 3. Reducing extra work

Four parts to load balance and reducing synch wait time: Even these algorithmic issues trade off: • Minimize comm. => run on 1 processor => extreme load imbalance 1. Identify enough concurrency • Maximize load balance => random assignment of tiny tasks => no 2. Decide how to manage it control over communication • Good partition may imply extra work to compute or manage it 3. Determine the granularity at which to exploit it Goal is to compromise 4. Reduce serialization and cost of synchronization • Fortunately, often not difficult in practice

–5– CS 418 –6– CS 418

Identifying Concurrency Identifying Concurrency (contd.)

Techniques seen for equation solver: Function parallelism: • Loop structure, fundamental dependences, new algorithms • entire large tasks (procedures) that can be done in parallel Data Parallelism versus Function Parallelism • on same or different data Often see orthogonal levels of parallelism; e.g. VLSI routing • e.g. different independent grid computations in Ocean • pipelining, as in video encoding/decoding, or polygon rendering W1 W2 W3 • degree usually modest and does not grow with input size (a) • difficult to load balance

Wire W2 expands to segments • often used to reduce synch between data parallel phases

S21 S22 S23 S24 S25 S26

(b) Most scalable programs data parallel (per this loose definition) Segment S23 expands to routes • function parallelism reduces synch between data parallel phases

(c) –7– CS 418 –8– CS 418

Page 2 Load Balance and Synch Wait Time Deciding How to Manage Concurrency

Sequential Work Limit on speedup: Speedupproblem(p) < Static versus Dynamic techniques Max Work on any Processor Static: • Work includes data access and other costs • Algorithmic assignment based on input; won’t change • Not just equal work, but must be busy at same time • Low runtime overhead • Computation must be predictable Four parts to load balance and reducing synch wait time: • Preferable when applicable (except in multiprogrammed or heterogeneous environment) 1. Identify enough concurrency Dynamic: 2. Decide how to manage it • Adappunmnt at runtime to balance load 3. Determine the granularity at which to exploit it • Can increase communication and reduce locality • Can increase task management overheads 4. Reduce serialization and cost of synchronization

–9– CS 418 –10– CS 418

Dynamic Assignment Dynamic Tasking with Task Queues

Profile-based (semi-static): Centralized versus distributed queues • Profile work distribution at runtime, and repartition dynamically Task stealinggq with distributed queues • Applicable in many computations, e.g. Barnes-Hut, some graphics • Can compromise comm and locality, and increase synchronization • Whom to steal from, how many tasks to steal, ... • Termination detection Dynamic Tasking: • Maximum imbalance related to size of task • Deal with unpredictability in program or environment (e.g. All processes insert tasks P0 inserts P1 inserts P2 inserts P3 inserts Raytrace) – computation, communication, and memory system interactions –multippgrogramming and heterog eneity QQ 0 Q1 Q2 Q3 – used by runtime systems and OS too • Pool of tasks; take and add tasks until done • E.g. “self-scheduling” of loop iterations (shared loop counter) Others may steal

All remove tasks P0 removes P1 removes P2 removes P3 removes

(a) Centralized task queue (b) Distributed task queues (one per pr ocess)

–11– CS 418 –12– CS 418

Page 3 Impact of Dynamic Assignment Load Balance and Synch Wait Time

Sequential Work On SGI Origin 2000 (cache-coherent ): Limit on speedup: Speedupproblem(p) < Max Work on any Processor  Origin, semistatic  Origg,yin, dynamic 30 30  Challenge, semistatic   Challenge, dynamic • Work includes data access and other costs  Origin, static  Origin, static • Not just equal work, but must be busy at same time 25  Challenge, static 25  Challenge, static   20 20   Four parts to load balance and reducing synch wait time:   15  15     1. Identify enough concurrency Speedup Speedup   10 10  2. Decide how to manage it

    5 5  3. Determine the granularity at which to exploit it        4. Reduce serialization and cost of synchronization (a)0 (b) 0 1 3 5 7 9 1113151719 21 23 25 27 29 31 1 3 5 7 9 1113151719212325272931 Number of processors Number of processors Barnes-Hut Raytrace –13– CS 418 –14– CS 418

Determining Task Granularity Load Balance and Synch Wait Time

Sequential Work Task granularity: amount of work associated with a task Limit on speedup: Speedupproblem(p) < Max Work on any Processor • Work includes data access and other costs General rule: • Not just equal work, but must be busy at same time • Coarse-grained => often less load balance • Fine-grained => more overhead; often more communication & contention Four parts to load balance and reducing synch wait time: 1. Identify enough concurrency Communication & contention actually affected by assignment, not size 2. Decide how to manage it • Overhead by size itself too, particularly with task queues 3. Determine the granularity at which to exploit it 4. Reduce serialization and cost of synchronization

–15– CS 418 –16– CS 418

Page 4 Reducing Serialization Partitioning for Performance

Careful about assignment and orchestration (including 1. Balancing the workload and reducing wait time at synch scheduling) points EtEvent synchhitironization 2. Reducing inherent communication • Reduce use of conservative synchronization – e.g. point-to-point instead of barriers, or granularity of pt-to-pt 3. Reducing extra work • But fine-grained synch more difficult to program, more synch ops. Mutual exclusion • Separate locks for separate data – e.g. locking records in a database: lock per , record, or field –lock ppqpqer task in task queue, not per queue – finer grain => less contention/serialization, more space, less reuse • Smaller, less frequent critical sections – don’t do reading/testing in critical section, only modification – e.g. searching for task to dequeue in task queue, building tree • Stagger critical sections in time

–17– CS 418 –18– CS 418

Reducing Inherent Communication Domain Decomposition

Communication is expensive! Works well for scientific, engineering, graphics, ... Measure: communication to computation ratio applications Focus here on inherent communication Exploits local-biased nature of physical problems • Determined by assignment of tasks to processes • Information requirements often short-range • Later see that actual communication can be greater • Or long-range but fall off with distance Assign tasks that access same data to same process Simple example: nearest-neighbor grid computation n n p

Solving communication and load balance NP-hard in P0 P1 P2 P3

general case P P4 P5 P6 7 n n p But simple heuristic solutions work well in practice P8 P9 P10 P11

• Applications have structure! P12 P13 P14 P15

Perimeter to Area comm-to-comp ratio (area to volume in 3D) •Depends on n,p: decreases with n, increases with p

–19– CS 418 –20– CS 418

Page 5 Domain Decomposition (Continued) Finding a Domain Decomposition

Best domain decomposition depends on information requirements Static, by inspection Nearest neighbor example: block versus strip decomposition • Must be predictable: grid example above, and Ocean -----n- p n Static, but not by inspection • Input-dependent, require analyzing input structure P0 P1 P2 P3 • E.g sparse matrix computations, data mining P4 P5 P6 P7 n ------n Semi-static (periodic repartitioning) p P8 P9 P10 P11 • Characteristics change but slowly; e.g. Barnes-Hut

P12 P13 P14 P15 Static or semi-static, with dynamic task stealing • In itia l decompos ition, bu t hig hly unpredict abl e; e.g ray ttiracing 4* √p Comm to comp: n for block, n for strip • Retain block from here on 2*p Application dependent: strip may be better in other cases • E.g. particle flow in tunnel –21– CS 418 –22– CS 418

Other Techniques Implications of Comm-to-Comp Ratio

Scatter Decomposition, e.g. initial partition in Raytrace If denominator is execution time, ratio gives average BW needs 1 2 1 2 1 2 1 2 If operation count, gives extremes in impact of latency and 34 34 34 34 1 2 bandwidth 1 2 1 2 1 2 1 2 • Latency: assume no latency hiding 34 34 34 34 • Bandwidth: assume all latency hidden 1 2 1 2 1 2 1 2 • Reality is somewhere in between 34 34 34 34 3 4 Actual impact of comm. depends on structure & cost as well 1 2 1 2 1 2 1 2 StilSequential WkWork 34 34 34 34 Speedup < Max (Work + Synch Wait Time + Comm Cost) Domain decomposition Scatter decomposition • Need to keep communication balanced across processors as well Preserve locality in task stealing •Steal large tasks for locality, steal from same queues, ... –23– CS 418 –24– CS 418

Page 6 Partitioning for Performance Reducing Extra Work

1. Balancing the workload and reducing wait time at synch Common sources of extra work: points • Computing a good partition 2. Reducing inherent communication –e.g. partitioning in Barnes-Hut or sparse matrix • Using redundant computation to avoid communication 3. Reducing extra work • Task, data and process management overhead – applications, languages, runtime systems, OS • Imposing structure on communication – coalescing messages, allowing effective naming Architectural Implications: • Re duce need by making communi cati on and orches tra tion effic ien t

Sequential Work Speedup < Max (Work + Synch Wait Time + Comm Cost + Extra Work)

–25– CS 418 –26– CS 418

Summary: Analyzing Parallel Algorithms Outline

Requires characterization of multiprocessor and algorithm 1. Partitioning for performance Historical focus on alggppg,orithmic aspects: partitioning, 2. Relationshipp,y of communication, data locality and mapping architecture PRAM model: data access and communication are free 3. Orchestration for performance • Only load balance (including serialization) and extra work matter 4. Components of execution time as seen by processor Sequential Instructions Speedup < Max (Instructions + Synch Wait Time + Extra Instructions) • Useful for earlyyp, development, but unrealistic for real performance • Ignores communication and also the imbalances it causes • Can lead to poor choice of partitions as well as orchestration • More recent models incorporate comm. costs; BSP, LogP, ...

–27– CS 418 –28– CS 418

Page 7 Limitations of Algorithm Analysis What is a Multiprocessor?

Inherent communication in is not all A collection of communicating processors • artifactual communication caused by program implementation and • View taken so far architec tura l in terac tions can even dom ina te • Goals: balance load, reduce inherent communication and extra work • thus, amount of communication not dealt with adequately A multi-cache, multi-memory system Cost of communication determined not only by amount • Role of these components essential regardless of programming model • also how communication is structured • Programming model and comm. abstraction affect specific • and cost of communication in system performance tradeoffs Both architecture-dependent, and addressed in orchestration step Most of remaining performance issues focus on second To understand techniques, first look at system interactions aspect

–29– CS 418 –30– CS 418

Memory-Oriented View Uniprocessor

Multiprocessor as Extended Memory Hierarchy Performance depends heavily on memory hierarchy • as seen by a given processor Time sppypgent by a program

Levels in extended hierarchy: Timeprog(1) = Busy(1) + Data Access(1) • Registers, caches, local memory, remote memory (topology) • Divide by instructions to get CPI equation • Glued together by communication architecture • Levels communicate at a certain granularity of data transfer Need to exploit spatial and temporal locality in hierarchy Data access time can be reduced by: • Otherwise extra communication may also be caused • Optimizing machine: bigger caches, lower latency... • Especially important since communication is expensive • Optimizing program: temporal and spatial locality

–31– CS 418 –32– CS 418

Page 8 Extended Hierarchy Artifactual Comm. in Extended Hierarchy

Idealized view: local cache hierarchy + single main memory Accesses not satisfied in local portion cause communication But realityyp is more complex • Inherent communication, implicit or explicit, causes transfers • Centralized Memory: caches of other processors – determined by program • : some local, some remote; + network topology • Artifactual communication • Management of levels – determined by program implementation and arch. interactions – caches managed by hardware – poor allocation of data across distributed memories – main memory depends on programming model – unnecessary data in a transfer » SAS: data movement between local and remote transparent – unnecessary transfers due to system granularities » message passing: explicit – redundant communication of data • Levels closer to processor are lower latency and higher bandwidth – finite replication capacity (in cache or main memory) • Improve performance through architecture or program locality • Inherent communication assumes unlimited capacity, small • Tradeoff with parallelism; need good node performance and transfers, perfect knowledge of what is needed. parallelism • More on artifactual later; first consider replication-induced further

–33– CS 418 –34– CS 418

Communication and Replication Working Set Perspective

•At a given level of the hierarchy (to the next further one) Comm. due to finite capacity is most fundamental artifact • Like cache size and miss rate or memory traffic in uniprocessors affic First working set • Extended memory hierarchy view useful for this relationship rr

View as three level hierarchy for simplicity Data t

Capacity-generated traffic • Local cache, local memory, remote memory (ignore network topology) (including conflicts) Classify “misses” in “cache” at any level as for uniprocessors Second working set – compulsory or cold misses (no size effect) Other capacity-independent communication – capacity misses (yes) Inherent communication

– conflict or collision misses (yes) Cold-start (compulsory) traffic – communication or coherence misses (no) Replication capacity (cache size) • Each may be helped/hurt by large transfer granularity (spatial • Hierarchy of working sets locality) • At first level cache (fully assoc, one-word block), inherent to algorithm – working set curve for program • Traffic from any type of miss can be local or non-local (communication)

–35– CS 418 –36– CS 418

Page 9 Outline Orchestration for Performance

1. Partitioning for performance Reducing amount of communication: 2. Relationshipp,y of communication, data locality and • Inherent: change logical data sharing patterns in algorithm architecture • Artifactual: exploit spatial, temporal locality in extended hierarchy 3. Orchestration for performance – Techniques often similar to those on uniprocessors 4. Components of execution time as seen by processor Structuring communication to reduce cost

Let’s examine techniques for both...

–37– CS 418 –38– CS 418

Reducing Artifactual Communication Exploiting Temporal Locality

• Structure algorithm so working sets map well to hierarchy Message passing model – often techniques to reduce inherent communication do well here • Communication and replication are both explicit – schedule tasks for data reuse once assigned • Even artifactual communication is in explicit messages • Multiple data structures in same phase Shared address space model – e.g. database records: local versus remote • More interesting from an architectural perspective • Solver example: blocking • Occurs transparently due to interactions of program and system – sizes and granularities in extended memory hierarchy Use shared address space to illustrate issues

(a) Unblocked access pattern in a sweep (b) Blocked access pattern with B = 4

•More useful when O(nk+1) computation on O(nk) data –many linear algebra computations (factorization, matrix multiply) –39– CS 418 –40– CS 418

Page 10 Exploiting Spatial Locality Spatial Locality Example

Besides capacity, granularities are important: • Repeated sweeps over 2-d grid, each time adding 1 to elements • Granularity of allocation • Natural 2-d versus higher-dimensional array representation • Granularity of communication or data transfer Contiguity in memory layout • Granularity of coherence

Major spatial-related causes of artifactual communication: P0 P1 P2 P3 P0 P1 P2 P3 • Conflict misses

P4 P5 P6 P7 • Data distribution/layout (allocation granularity) P4 P5 P6 P7 • Fragmentation (communication granularity)

• False sharing of data (coherence granularity) P8 P8 All depen d on how spati a l access patterns interact wi th data structures • Fix problems by modifying data structures, or layout/alignment Page straddles Cache block Page does not Cache block is partition boundaries: straddles partition straddle partition within a partition Examine later in context of architectures difficult to distribute boundary boundary memory well • one simple example here: data distribution in SAS solver (a) Two-dimensional array (b) Four-dimensional array

–41– CS 418 –42– CS 418

Tradeoffs with Inherent Communication Example Performance Impact

Partitioning grid solver: blocks versus rows Performance measured on an SGI Origin2000 • Blocks still have a spatial locality problem on remote data 30 50  Rows 4D 45 • Rows can perform better despite worse inherent c-to-c ratio  4D 4D-rr 25 Good spatial locality on  2D 40  Rows non-local accesses at  Rows-rr row-oriented boundary 35  20  2D Poor spatial locality on 30  2D-rr  non-local accesses at   15 25   column-oriented boundary  

Speedup  Speedup 20     10   15    10 5     5     0 0 1 3 5 7 9 1113151719212325272931 1 3 5 7 9 1113151719212325272931 Number of processors Number of processors •Result depends on n and p Ocean, 514x514 grids Solver Kernel, 12K x 12K grid –43– CS 418 –44– CS 418

Page 11 Structuring Communication Reducing Overhead

Given amount of communication, goal is to reduce cost Can reduce # of messages m or overhead per message o Cost of communication as seen by process: o is usuallyyy determined by hardware or sy stem software nc/m C =f f* *( ( l o + l + t + tc - overl)lap) B • Program should try to reduce m by coalescing messages – f = frequency of messages • More control when communication is explicit – o = overhead per message (at both ends) Coalescing data into larger messages: – l = network delay per message • Easy for regular, coarse-grained communication – n = total data sent c • Can be difficult for irregular, naturally fine-grained communication – m = number of messages – may require changes to algorithm and extra work – B = bandwidth along path (determined by network, NI, assist) » coalescing data and determining what and to whom to send – tc = cost iiddnduced by content ion per message – overlap = amount of latency hidden by overlap with comp. or comm. – will discuss more in implications for programming models later

• Portion in parentheses is cost of a message (as seen by processor) • That portion, ignoring overlap, is latency of a message

• Goal: reduce terms in latency and increase overlap

–45– CS 418 –46– CS 418

Reducing Network Delay Reducing Contention

Network delay component = f* h All resources have nonzero occupancy h*t – h = number of hops traversed in network • Memory, communication controller, network link, etc.

– th = link+switch latency per hop • Can only handle so many transactions per unit time Reducing f: communicate less, or make messages larger Effects of contention: Reducing h: • Increased end-to-end cost for messages • Reduced available bandwidth for individual messages • Map communication patterns to network topology • Causes imbalances across processors – e.g. nearest-neighbor on mesh and ring; all-to-all Particularly insidious performance problem • How important is this? • Easyyg to ignore when p pgrogramming – used to be major focus of parallel algorithms • Slow down messages that don’t even need that resource – by causing other dependent resources to also congest – depends on no. of processors, how th, compares with other components – less important on modern machines • Effect can be devastating: Don’t flood a resource! » overheads, processor count, multiprogramming

–47– CS 418 –48– CS 418

Page 12 Types of Contention Overlapping Communication

Network contention and end-point contention (hot-spots) Cannot afford to stall for high latencies Location and Module Hot-spots • even on uniprocessors! Location: e. g. accumulating into global variable, barrier Overlap with computation or communication to hide latency • solution: tree-structured communication Requires extra concurrency (slackness), higher bandwidth Contention Little contention Techniques: • Prefetching • Block data transfer

Flat Tree structured • Proceeding past communication •In general, reduce burstiness; may conflict with making messages • Multith readi ng larger Module: all-to-all personalized comm. in matrix transpose

•solution: stagger access by different processors to same node temporally

–49– CS 418 –50– CS 418

Summary of Tradeoffs Outline

Different goals often have conflicting demands 1. Partitioning for performance • Load Balance 2. Relationshipp,y of communication, data locality and – fine-grain tasks architecture – random or dynamic assignment • Communication 3. Orchestration for performance – usually coarse grain tasks 4. Components of execution time as seen by processor – decompose to obtain locality: not random/dynamic • What workload looks like to architecture • Extra Work • Relate to software issues – coarse grain tasks – simple assignment • Communication Cost: – big transfers: amortize overhead and latency – small transfers: reduce contention

–51– CS 418 –52– CS 418

Page 13 Processor-Centric Perspective Relationship between Perspectives

Parallelization step(s) Performance issue Processor time component

Decomposition/ Load imbalance and Synch wait assignment/ synchronization orchestration

Decomposition/ Extra work Busy-overhead assignment

Decomposition/ Inherent Data-remote assignment communication volume

Orchestration Artifactual Data-local communitiication and data locality

Orchestration/ Communication mapping structure

–53– CS 418 –54– CS 418

Summary

Busy(1) + Data(1) Speedup (p) prob = Busyuseful(p)+Datalocal(p)+Synch(p)+Dataremote(p)+Busyoverhead(p) • Goal is to reduce denominator components • Both programmer and system have role to play • Architecture cannot do much about load imbalance or too much communication • But it can: – reduce incentive for creating ill-behaved programs (efficient naming, communication and synchronization) – reduce artifactual communication – provide efficient naming for flexible assignment – allow effective overlapping of communication

–55– CS 418

Page 14