Parallel Computer Systems

Parallel Computer Systems Randal E. Bryant CS 347 Lecture 27 April 29, 1997 Topics • Parallel Applications • Shared vs. Distributed Model • Concurrency Models • Single Bus Systems • Network-Based Systems • Lessons Learned Motivation Limits to Sequential Processing • Cannot push clock rates beyond technological limits • Instruction-level parallelism gets diminishing returns – 4-way superscalar machines only get average of 1.5 instructions / cycle – Branch prediction, speculative execution, etc. yield diminishing returns Applications have Insatiable Appetite for Computing • Modeling of physical systems • Virtual reality, real-time graphics, video • Database search, data mining Many Applications can Exploit Parallelism • Work on multiple parts of problem simultaneously • Synchronize to coordinate efforts • Communicate to share information – 2 – CS 347 S’97 Historical Perspective: The Graveyard • Lots of venture capital and DoD Resarch $$’s • Too big to enumerate, but some examples … ILLIAC IV • Early research machine with overambitious technology Thinking Machines • CM-2: 64K single-bit processors with single controller (SIMD) • CM-5: Tightly coupled network of SPARC processors Encore Computer • Shared memory machine using National microprocessors Kendall Square Research KSR-1 • Shared memory machine using proprietary processor NCUBE / Intel Hypercube / Intel Paragon • Connected network of small processors • Survive only in niche markets – 3 – CS 347 S’97 Historical Perspective: Successes Shared Memory Multiprocessors (SMP’s) • E.g., SGI Challenge, SUN servers, DEC Alpha Servers • Good for handling “server” applications – Number of loosely coupled (or independent) computing tasks – E.g., multiuser system, Web server – Share resources such as primary memory Cray Vector Machines (also Fujitsu, NEC) • Single instruction can specify operation over entire vector – E.g., c[i] = a[i] + b[i] * d, for 0 ≤ i < 128 • Effective for many scientific computing applications Cray T3D, T3E • DEC Alpha’s connected by high performance network • Less versatile than vector machine, but better cost performance – 4 – CS 347 S’97 Application Classes Loosely Compute Servers Coupled • Number of independent users using single computing facility • Only synchronization is to mediate use of shared resources – Memory, disk sectors, file system Database Servers • Users performing transactions on shared database – E.g., bank records, flight reservations • Synchronization required to guarantee consistency – Don’t want two people to get last seat on flight True Parallel Applications • Computationally intensive task exploiting multiple computing agents Tightly • Synchronization required to coordinate efforts Coupled – 5 – CS 347 S’97 Parallel Application Example Discrete representation of Finite Element Model continuous system • Spatially: partition into mesh elements • Temporally: Update state every dT time units Example Computation )RUWLPHIURPWRPD[7 IRUHDFKPHVKHOHPHQW 8SGDWHPHVKYDOXH Locality • Update depends only on values of adjacent elements – 6 – CS 347 S’97 Parallel Mapping Spatial Partitioning Partitioning • Divide mesh into regions P1 P2 P3 P1 P2 P3 • Allocate different regions to different processors Computation for Each Processor P4P4 P5P5 P6P6 )RUWLPHIURPWRPD[7 *HWERXQGDU\YDOXHVIURP QHLJKERUV P7P7 P8P8 P9P9 )RUHDFKPHVKHOHPHQW 8SGDWHPHVKYDOXH 6HQGERXQGDU\YDOXHVWR QHLJKERUV – 7 – CS 347 S’97 Complicating Factors Communication Overhead • N X N mesh, M processors • Elements / processor = N2 / M – How much work is required per iteration • Boundary elements / processor ~ N / Sqrt(M) – How much communication is required per iteration • Communication vs. computation load ~ Sqrt(M) / N – Become communication limited as increase number of processors Nonuniformities • Irregular mesh, varying computing / mesh element • Makes partioning & load balancing difficult Synchronization • Keeping all processors on same iteration • Determining global properties such as convergence and time step – 8 – CS 347 S’97 Shared Memory Model Global Memory Space P P P • •¬• P Conceptual View • All processors access single memory – Physical address space – Use virtual address mapping to partition among processes • If one processor updates location, then all will see it – Memory consistency – 9 – CS 347 S’97 Bus-Based Realization Memory Bus • Handles all accesses to shared memory Memory Caches Memory Bus • One per processor • Allows local copies of heavily used data C C C C • Must avoid stale data P P P • •¬• P Considerations • Small step up from single processor system – Support added to many microprocessor chips • Does not scale well – Bus becomes bottleneck – Limited to ~16 processors – 10 – CS 347 S’97 Network-Based Realization Memory • Partitioned Among Processors Interconnection Network Network • Transmit messages to perform accesses to remote memories M M M M Caches • Local copies of heavily used data C C C C • Must avoid stale data – Harder than with bus-based system P P P • •¬• P Considerations • Scales well – 1024 processor systems have been built • Nonuniform memory access – 100’s of cycles for remote access – 11 – CS 347 S’97 Memory Consistency Model Initially: [ \ • Independent processes with access to shared variables Process A • No assumptions about relative D[ timing D LI\ « – Which starts first – Relative rates Process B Sequential Consistency E\ • Each process executes its steps in program order E LI[ « • Overall effect should match that of some interleaving of the individual process steps – 12 – CS 347 S’97 Sequential Consistency Example Process A D Process B E D[ E\ D LI\ « D E LI[ « E Possible Interleavings D D D E E E D =T E E D D E =T E D =F E =F D =F E =F D E =F E =F D =F E =F D =F D =F – 13 – CS 347 S’97 Sequential Inconsistency D E • Cannot have both tests yield T – b2 must precede a1 – a2 must precede b1 D E – Cannot satisfy these plus program order constraints Network Real Life Scenario [ • Process A \ – Remote write x – Local read y Pa Pb • Process B – Remote write y – Local read x \[ • Could have both reads yield 0 0 0 – 14 – CS 347 S’97 Snoopy Bus-Based Consistency Caches • Write-back – Minimize bus traffic • Monitor bus transactions when not Memory master Memory Bus Cached blocks • Clean block can have multiple, read- only copies C C C • To write, must obtain exclusive copy P P • •¬• P – Marked as dirty SnoopMaster Snoop Getting copy • Make bus request • Memory replies if block clean • Owning cache replies if dirty – 15 – CS 347 S’97 Implementation Details Block Status Bus Operations • Maintained by each cache for • Read each of its blocks – Get read-only copy • Invalid • Invalidate – Entry not valid – Invalidate all other copies • Clean – Make local copy writeable – Valid, read-only copy • Write – Matches copy in main – Write back dirty block memory – To make room for different • Dirty block – Exclusive,writeable copy – Must write back to evict – 16 – CS 347 S’97 Bus Master Actions P Read = i Requested Block P Write – P Read – None – t Current Block Read i Read i – Read i Stall i Read P Read ≠ Invalid Clean Read i i Stall P Write ≠ P Read ≠ P Write ≠ P Write = Read i Write t Write t Inval. i i Stall – Stall – Stall – Write Request P/B i:t Dirty Operation Bus Bus Operation Block P Read = P Write = Tag Processor None – None – Update Operation – Read – Write – 17 – CS 347 S’97 Bus Snoop Actions B – ≠ B Inv = – – – – – – – – B Read = Invalid Clean – – – – i Requested Block t Current Block B Read = Data: Cache supplies block – i – Data Request P/B i:t Dirty Operation Bus Bus Operation Block B – ≠ Tag Cache – – Update Operation – – – 18 – CS 347 S’97 Example 1 Process A Process B D[ E\ D LI\ « E LI[ « ABBus Transactions A: Read x A: Invalidate x a1: x = 1 B: Read y A: Read y a2: = T B: Invalidate y b1: y = 1 B: Read x b2: = F – 19 – CS 347 S’97 Example 2 Process A Process B D[ E\ D LI\ « E LI[ « ABBus Transactions A: Read x A: Invalidate x a1: x = 1 B: Read y B: Invalidate y b1: y = 1 A: Read y a2: = F B: Read x b2: = F – 20 – CS 347 S’97 Livelock Example Process A Process B D\ EZKLOHW \ E \ W ABBus Transactions A: Read y B: Read y • B: Invalidate y b1: t = y • b2: y = t+1 • A: Read y B: Read y Never gets B: Invalidate y b1: t = y chance to write b2: y = t+1 A: Read y • B: Read y • • – 21 – CS 347 S’97 Single Bus Machine Example SGI Challenge Series Up to 36 MIPS R4400 processors • Up to 16 GB main memory Bus • 256-bit wide data • 40-bit wide address • Data transferred at 1.22 GB / second • Split transaction – Read request & Read response are separate bus transactions – Can use bus for other things while read outstanding – Complicates synchronization Performance • 164 processor cycles to handle remote read • Assuming no bus contention – 22 – CS 347 S’97 Network-Based Cache Coherency Home-Based Protocol Memory Controller 4 • Each block has “home” Block Status Copy Holders – Memory controller tracking its status 24 shared 0 1 0 1 0 1 0 1 • Home maintains 25 remote 0 1 0 0 0 0 0 0 – Block status 26 uncached 0 0 0 0 0 0 0 0 – Identity of copy holders » 1 bit flag / processor Block Status Values • Shared – 1 or more remote, read-only copies • Remote – Writeable copy in remote cache • Uncached – No remote copies – 23 – CS 347 S’97 Network-Based Consistency To Obtain Copy of Block • Processor sends message to its home • Home retrieves remote copy if status is remote • Sends copy to requester • If exclusive copy requested, send invalidate message to all other copy holders Tricky Details • Lots of possible sources of deadlock & errors – 24 – CS 347

Parallel Computer Systems

Evaluation of Architectural Support for Global Address-Based

Scalability Study of KSR-1

The KSR1: Experimentation and Modeling of Poststore Amy Apon Clemson University, [email protected]

THE RISE and Fall the 01 BRILLIANT START-UP THAT Some Day We Will Build a Think I~Z~~~~~ Thinking Ing Machine

NASA Contractor Report ICASE Report No. 94-2 191592 SHARED

NSA and the Supercomputer Industry

Appendix M Historical Perspectives and References

National Energy Research Scientific Computing Center (NERSC) The

Hardware for Fast Global Operations on Distributed Memory Multicomputers and Multiprocessors

A Survey of Parallel Programming Languages and Tools

A History of Modern 64-Bit Computing

The Mountains of Pi: the New Yorker 10/05/09 12:06