Parallel Computer Systems

Parallel Computer Systems

Parallel Computer Systems Randal E. Bryant CS 347 Lecture 27 April 29, 1997 Topics • Parallel Applications • Shared vs. Distributed Model • Concurrency Models • Single Bus Systems • Network-Based Systems • Lessons Learned Motivation Limits to Sequential Processing • Cannot push clock rates beyond technological limits • Instruction-level parallelism gets diminishing returns – 4-way superscalar machines only get average of 1.5 instructions / cycle – Branch prediction, speculative execution, etc. yield diminishing returns Applications have Insatiable Appetite for Computing • Modeling of physical systems • Virtual reality, real-time graphics, video • Database search, data mining Many Applications can Exploit Parallelism • Work on multiple parts of problem simultaneously • Synchronize to coordinate efforts • Communicate to share information – 2 – CS 347 S’97 Historical Perspective: The Graveyard • Lots of venture capital and DoD Resarch $$’s • Too big to enumerate, but some examples … ILLIAC IV • Early research machine with overambitious technology Thinking Machines • CM-2: 64K single-bit processors with single controller (SIMD) • CM-5: Tightly coupled network of SPARC processors Encore Computer • Shared memory machine using National microprocessors Kendall Square Research KSR-1 • Shared memory machine using proprietary processor NCUBE / Intel Hypercube / Intel Paragon • Connected network of small processors • Survive only in niche markets – 3 – CS 347 S’97 Historical Perspective: Successes Shared Memory Multiprocessors (SMP’s) • E.g., SGI Challenge, SUN servers, DEC Alpha Servers • Good for handling “server” applications – Number of loosely coupled (or independent) computing tasks – E.g., multiuser system, Web server – Share resources such as primary memory Cray Vector Machines (also Fujitsu, NEC) • Single instruction can specify operation over entire vector – E.g., c[i] = a[i] + b[i] * d, for 0 ≤ i < 128 • Effective for many scientific computing applications Cray T3D, T3E • DEC Alpha’s connected by high performance network • Less versatile than vector machine, but better cost performance – 4 – CS 347 S’97 Application Classes Loosely Compute Servers Coupled • Number of independent users using single computing facility • Only synchronization is to mediate use of shared resources – Memory, disk sectors, file system Database Servers • Users performing transactions on shared database – E.g., bank records, flight reservations • Synchronization required to guarantee consistency – Don’t want two people to get last seat on flight True Parallel Applications • Computationally intensive task exploiting multiple computing agents Tightly • Synchronization required to coordinate efforts Coupled – 5 – CS 347 S’97 Parallel Application Example Discrete representation of Finite Element Model continuous system • Spatially: partition into mesh elements • Temporally: Update state every dT time units Example Computation )RUWLPHIURPWRPD[7 IRUHDFKPHVKHOHPHQW 8SGDWHPHVKYDOXH Locality • Update depends only on values of adjacent elements – 6 – CS 347 S’97 Parallel Mapping Spatial Partitioning Partitioning • Divide mesh into regions P1 P2 P3 P1 P2 P3 • Allocate different regions to different processors Computation for Each Processor P4P4 P5P5 P6P6 )RUWLPHIURPWRPD[7 *HWERXQGDU\YDOXHVIURP QHLJKERUV P7P7 P8P8 P9P9 )RUHDFKPHVKHOHPHQW 8SGDWHPHVKYDOXH 6HQGERXQGDU\YDOXHVWR QHLJKERUV – 7 – CS 347 S’97 Complicating Factors Communication Overhead • N X N mesh, M processors • Elements / processor = N2 / M – How much work is required per iteration • Boundary elements / processor ~ N / Sqrt(M) – How much communication is required per iteration • Communication vs. computation load ~ Sqrt(M) / N – Become communication limited as increase number of processors Nonuniformities • Irregular mesh, varying computing / mesh element • Makes partioning & load balancing difficult Synchronization • Keeping all processors on same iteration • Determining global properties such as convergence and time step – 8 – CS 347 S’97 Shared Memory Model Global Memory Space P P P • •¬• P Conceptual View • All processors access single memory – Physical address space – Use virtual address mapping to partition among processes • If one processor updates location, then all will see it – Memory consistency – 9 – CS 347 S’97 Bus-Based Realization Memory Bus • Handles all accesses to shared memory Memory Caches Memory Bus • One per processor • Allows local copies of heavily used data C C C C • Must avoid stale data P P P • •¬• P Considerations • Small step up from single processor system – Support added to many microprocessor chips • Does not scale well – Bus becomes bottleneck – Limited to ~16 processors – 10 – CS 347 S’97 Network-Based Realization Memory • Partitioned Among Processors Interconnection Network Network • Transmit messages to perform accesses to remote memories M M M M Caches • Local copies of heavily used data C C C C • Must avoid stale data – Harder than with bus-based system P P P • •¬• P Considerations • Scales well – 1024 processor systems have been built • Nonuniform memory access – 100’s of cycles for remote access – 11 – CS 347 S’97 Memory Consistency Model Initially: [ \ • Independent processes with access to shared variables Process A • No assumptions about relative D[ timing D LI\ « – Which starts first – Relative rates Process B Sequential Consistency E\ • Each process executes its steps in program order E LI[ « • Overall effect should match that of some interleaving of the individual process steps – 12 – CS 347 S’97 Sequential Consistency Example Process A D Process B E D[ E\ D LI\ « D E LI[ « E Possible Interleavings D D D E E E D =T E E D D E =T E D =F E =F D =F E =F D E =F E =F D =F E =F D =F D =F – 13 – CS 347 S’97 Sequential Inconsistency D E • Cannot have both tests yield T – b2 must precede a1 – a2 must precede b1 D E – Cannot satisfy these plus program order constraints Network Real Life Scenario [ • Process A \ – Remote write x – Local read y Pa Pb • Process B – Remote write y – Local read x \[ • Could have both reads yield 0 0 0 – 14 – CS 347 S’97 Snoopy Bus-Based Consistency Caches • Write-back – Minimize bus traffic • Monitor bus transactions when not Memory master Memory Bus Cached blocks • Clean block can have multiple, read- only copies C C C • To write, must obtain exclusive copy P P • •¬• P – Marked as dirty SnoopMaster Snoop Getting copy • Make bus request • Memory replies if block clean • Owning cache replies if dirty – 15 – CS 347 S’97 Implementation Details Block Status Bus Operations • Maintained by each cache for • Read each of its blocks – Get read-only copy • Invalid • Invalidate – Entry not valid – Invalidate all other copies • Clean – Make local copy writeable – Valid, read-only copy • Write – Matches copy in main – Write back dirty block memory – To make room for different • Dirty block – Exclusive,writeable copy – Must write back to evict – 16 – CS 347 S’97 Bus Master Actions P Read = i Requested Block P Write – P Read – None – t Current Block Read i Read i – Read i Stall i Read P Read ≠ Invalid Clean Read i i Stall P Write ≠ P Read ≠ P Write ≠ P Write = Read i Write t Write t Inval. i i Stall – Stall – Stall – Write Request P/B i:t Dirty Operation Bus Bus Operation Block P Read = P Write = Tag Processor None – None – Update Operation – Read – Write – 17 – CS 347 S’97 Bus Snoop Actions B – ≠ B Inv = – – – – – – – – B Read = Invalid Clean – – – – i Requested Block t Current Block B Read = Data: Cache supplies block – i – Data Request P/B i:t Dirty Operation Bus Bus Operation Block B – ≠ Tag Cache – – Update Operation – – – 18 – CS 347 S’97 Example 1 Process A Process B D[ E\ D LI\ « E LI[ « ABBus Transactions A: Read x A: Invalidate x a1: x = 1 B: Read y A: Read y a2: = T B: Invalidate y b1: y = 1 B: Read x b2: = F – 19 – CS 347 S’97 Example 2 Process A Process B D[ E\ D LI\ « E LI[ « ABBus Transactions A: Read x A: Invalidate x a1: x = 1 B: Read y B: Invalidate y b1: y = 1 A: Read y a2: = F B: Read x b2: = F – 20 – CS 347 S’97 Livelock Example Process A Process B D\ EZKLOHW \ E \ W ABBus Transactions A: Read y B: Read y • B: Invalidate y b1: t = y • b2: y = t+1 • A: Read y B: Read y Never gets B: Invalidate y b1: t = y chance to write b2: y = t+1 A: Read y • B: Read y • • – 21 – CS 347 S’97 Single Bus Machine Example SGI Challenge Series Up to 36 MIPS R4400 processors • Up to 16 GB main memory Bus • 256-bit wide data • 40-bit wide address • Data transferred at 1.22 GB / second • Split transaction – Read request & Read response are separate bus transactions – Can use bus for other things while read outstanding – Complicates synchronization Performance • 164 processor cycles to handle remote read • Assuming no bus contention – 22 – CS 347 S’97 Network-Based Cache Coherency Home-Based Protocol Memory Controller 4 • Each block has “home” Block Status Copy Holders – Memory controller tracking its status 24 shared 0 1 0 1 0 1 0 1 • Home maintains 25 remote 0 1 0 0 0 0 0 0 – Block status 26 uncached 0 0 0 0 0 0 0 0 – Identity of copy holders » 1 bit flag / processor Block Status Values • Shared – 1 or more remote, read-only copies • Remote – Writeable copy in remote cache • Uncached – No remote copies – 23 – CS 347 S’97 Network-Based Consistency To Obtain Copy of Block • Processor sends message to its home • Home retrieves remote copy if status is remote • Sends copy to requester • If exclusive copy requested, send invalidate message to all other copy holders Tricky Details • Lots of possible sources of deadlock & errors – 24 – CS 347

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    27 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us