Parallel Computer Systems

Randal E. Bryant CS 347 Lecture 27 April 29, 1997

Topics • Parallel Applications • Shared vs. Distributed Model • Concurrency Models • Single Systems • Network-Based Systems • Lessons Learned Motivation

Limits to Sequential Processing • Cannot push clock rates beyond technological limits • Instruction-level parallelism gets diminishing returns – 4-way superscalar machines only get average of 1.5 instructions / cycle – Branch prediction, speculative execution, etc. yield diminishing returns Applications have Insatiable Appetite for Computing • Modeling of physical systems • Virtual reality, real-time graphics, video • Database search, data mining Many Applications can Exploit Parallelism • Work on multiple parts of problem simultaneously • Synchronize to coordinate efforts • Communicate to share information

– 2 – CS 347 S’97 Historical Perspective: The Graveyard

• Lots of venture capital and DoD Resarch $$’s • Too big to enumerate, but some examples … ILLIAC IV • Early research machine with overambitious technology Thinking Machines • CM-2: 64K single-bit processors with single controller (SIMD) • CM-5: Tightly coupled network of SPARC processors Encore Computer • machine using National microprocessors Kendall Square Research KSR-1 • Shared memory machine using proprietary processor NCUBE / Hypercube / Intel Paragon • Connected network of small processors • Survive only in niche markets – 3 – CS 347 S’97 Historical Perspective: Successes

Shared Memory Multiprocessors (SMP’s) • E.g., SGI Challenge, SUN servers, DEC Alpha Servers • Good for handling “server” applications – Number of loosely coupled (or independent) computing tasks – E.g., multiuser system, Web server – Share resources such as primary memory Vector Machines (also Fujitsu, NEC) • Single instruction can specify operation over entire vector – E.g., [i] = a[i] + b[i] * d, for 0 ≤ i < 128 • Effective for many scientific computing applications Cray T3D, T3E • DEC Alpha’s connected by high performance network • Less versatile than vector machine, but better cost performance

– 4 – CS 347 S’97 Application Classes

Loosely Compute Servers Coupled • Number of independent users using single computing facility • Only synchronization is to mediate use of shared resources – Memory, disk sectors, file system Database Servers • Users performing transactions on shared database – E.g., bank records, flight reservations • Synchronization required to guarantee consistency – Don’t want two people to get last seat on flight True Parallel Applications • Computationally intensive task exploiting multiple computing agents Tightly • Synchronization required to coordinate efforts Coupled

– 5 – CS 347 S’97 Parallel Application Example

Discrete representation of Finite Element Model continuous system • Spatially: partition into mesh elements • Temporally: Update state every dT time units Example Computation

)RUWLPHIURPWRPD[7 IRUHDFKPHVKHOHPHQW 8SGDWHPHVKYDOXH

Locality • Update depends only on values of adjacent elements

– 6 – CS 347 S’97 Parallel Mapping

Spatial Partitioning Partitioning • Divide mesh into regions P1 P2 P3 P1 P2 P3 • Allocate different regions to different processors Computation for Each Processor P4P4 P5P5 P6P6 )RUWLPHIURPWRPD[7 *HWERXQGDU\YDOXHVIURP QHLJKERUV P7P7 P8P8 P9P9 )RUHDFKPHVKHOHPHQW 8SGDWHPHVKYDOXH 6HQGERXQGDU\YDOXHVWR QHLJKERUV

– 7 – CS 347 S’97 Complicating Factors

Communication Overhead • N X N mesh, M processors • Elements / processor = N2 / M – How much work is required per iteration • Boundary elements / processor ~ N / Sqrt(M) – How much communication is required per iteration • Communication vs. computation load ~ Sqrt(M) / N – Become communication limited as increase number of processors Nonuniformities • Irregular mesh, varying computing / mesh element • Makes partioning & load balancing difficult Synchronization • Keeping all processors on same iteration • Determining global properties such as convergence and time step – 8 – CS 347 S’97 Shared Memory Model

Global Memory Space

P P P • •¬• P

Conceptual View • All processors access single memory – Physical address space – Use virtual address mapping to partition among processes • If one processor updates location, then all will see it – Memory consistency

– 9 – CS 347 S’97 Bus-Based Realization

Memory Bus • Handles all accesses to shared memory Memory Caches Memory Bus • One per processor • Allows local copies of heavily used data C C C C • Must avoid stale data P P P • •¬• P Considerations • Small step up from single processor system – Support added to many microprocessor chips • Does not scale well – Bus becomes bottleneck – Limited to ~16 processors

– 10 – CS 347 S’97 Network-Based Realization

Memory • Partitioned Among Processors Interconnection Network Network • Transmit messages to perform accesses to remote memories M M M M Caches • Local copies of heavily used data C C C C • Must avoid stale data – Harder than with bus-based system P P P • •¬• P Considerations • Scales well – 1024 processor systems have been built • Nonuniform memory access – 100’s of cycles for remote access – 11 – CS 347 S’97 Memory Consistency

Model Initially: [ \ • Independent processes with access to shared variables Process A • No assumptions about relative D[ timing D LI\ « – Which starts first – Relative rates Process B Sequential Consistency E\ • Each process executes its steps in program order E LI[ « • Overall effect should match that of some interleaving of the individual process steps

– 12 – CS 347 S’97 Sequential Consistency Example

Process A D Process B E D[ E\

D LI\ « D E LI[ « E Possible Interleavings

D D D E E E

D =T E E D D E =T

E D =F E =F D =F E =F D

E =F E =F D =F E =F D =F D =F

– 13 – CS 347 S’97 Sequential Inconsistency

D E • Cannot have both tests yield T – b2 must precede a1 – a2 must precede b1 D E – Cannot satisfy these plus program order constraints Network Real Life Scenario [ • Process A \ – Remote write x – Local read y Pa Pb • Process B – Remote write y – Local read x \[ • Could have both reads yield 0 0 0 – 14 – CS 347 S’97 Snoopy Bus-Based Consistency

Caches • Write-back – Minimize bus traffic • Monitor bus transactions when not Memory master Memory Bus Cached blocks • Clean block can have multiple, read- only copies C C C • To write, must obtain exclusive copy P P • •¬• P – Marked as dirty SnoopMaster Snoop Getting copy • Make bus request • Memory replies if block clean • Owning cache replies if dirty

– 15 – CS 347 S’97 Implementation Details

Block Status Bus Operations • Maintained by each cache for • Read each of its blocks – Get read-only copy • Invalid • Invalidate – Entry not valid – Invalidate all other copies • Clean – Make local copy writeable – Valid, read-only copy • Write – Matches copy in main – Write back dirty block memory – To make room for different • Dirty block – Exclusive,writeable copy – Must write back to evict

– 16 – CS 347 S’97 Bus Master Actions

P Read = i Requested Block P Write – P Read – None – t Current Block Read i Read i – Read i Stall i Read P Read ≠ Invalid Clean Read i i Stall

P Write ≠ P Read ≠ P Write ≠ P Write = Read i Write t Write t Inval. i i Stall – Stall – Stall – Write

Request P/B i:t Dirty Operation Bus Bus Operation Block P Read = P Write = Tag Processor None – None – Update Operation – Read – Write – 17 – CS 347 S’97 Bus Snoop Actions

B – ≠ B Inv = – – – – – – – – B Read = Invalid Clean – – – – i Requested Block t Current Block B Read = Data: Cache supplies block – i – Data

Request P/B i:t Dirty Operation Bus Bus Operation Block B – ≠ Tag Cache – – Update Operation – –

– 18 – CS 347 S’97 Example 1

Process A Process B D[ E\ D LI\ « E LI[ «

ABBus Transactions

A: Read x A: Invalidate x a1: x = 1 B: Read y A: Read y a2: = T B: Invalidate y b1: y = 1 B: Read x b2: = F

– 19 – CS 347 S’97 Example 2

Process A Process B D[ E\ D LI\ « E LI[ «

ABBus Transactions

A: Read x A: Invalidate x a1: x = 1 B: Read y B: Invalidate y b1: y = 1 A: Read y a2: = F B: Read x b2: = F – 20 – CS 347 S’97 Livelock Example

Process A Process B D\ EZKLOHW \ E \ W ABBus Transactions A: Read y B: Read y • B: Invalidate y b1: t = y • b2: y = t+1 • A: Read y B: Read y Never gets B: Invalidate y b1: t = y chance to write b2: y = t+1 A: Read y • B: Read y • • – 21 – CS 347 S’97 Single Bus Machine Example

SGI Challenge Series Up to 36 MIPS R4400 processors • Up to 16 GB main memory Bus • 256-bit wide data • 40-bit wide address • Data transferred at 1.22 GB / second • Split transaction – Read request & Read response are separate bus transactions – Can use bus for other things while read outstanding – Complicates synchronization Performance • 164 processor cycles to handle remote read • Assuming no bus contention – 22 – CS 347 S’97 Network-Based Cache Coherency

Home-Based Protocol Memory Controller 4 • Each block has “home” Block Status Copy Holders – Memory controller tracking its status 24 shared 0 1 0 1 0 1 0 1 • Home maintains 25 remote 0 1 0 0 0 0 0 0 – Block status 26 uncached 0 0 0 0 0 0 0 0 – Identity of copy holders » 1 bit flag / processor Block Status Values • Shared – 1 or more remote, read-only copies • Remote – Writeable copy in remote cache • Uncached – No remote copies

– 23 – CS 347 S’97 Network-Based Consistency

To Obtain Copy of Block • Processor sends message to its home • Home retrieves remote copy if status is remote • Sends copy to requester • If exclusive copy requested, send invalidate message to all other copy holders Tricky Details • Lots of possible sources of deadlock & errors

– 24 – CS 347 S’97 Lesson#1: Use Commodity Microprocessor

Build on their Economy of Scale • Thousands of engineering person-years • Efficient manufacturing Failures • KSR: 15 MHz processor, designed by 3 person team • Tera: > 2 years late shipping first system Successes • Cray T3D, T3E uses Alpha, fastest processor made • SMP’s based on MIPS, SPARC, PentiumPro, … – Snoopy protocol hardware integrated into chip

– 25 – CS 347 S’97 Lesson #2: Provide Shared Memory Interface

Natural Programming Model for Many Applications • Shared database – Can access different parts of data base by memory references • Task scheduling – Processors grab tasks from shared queue – Get code & data via memory references Can Simulate Other Models Effectively • Message-passing – Store data in buffer allocated in distant memory Exploits Memory Management Technology • OS involved in setting up & allocating resources • Once allocated, communication proceeds without OS intervention

– 26 – CS 347 S’97 Lesson #3: Watch Out for Workstations

Workstations Good for High Performance Computing • Easy to program • Low cost per compute cycle • Low unit cost allows frequent upgrades Parallel Processors at Disadvantage • Takes work to make application run well on parallel machine • Parallel machines big & expensive • Higher unit costs & longer development times – Hard to upgrade processors – Even harder to upgrade network or bus Viable Strategies • Minimal extension of workstation • Stick with applications requiring performance

– 27 – CS 347 S’97