Parallel Processing, Part 6
Total Page:16
File Type:pdf, Size:1020Kb
Part VI Implementation Aspects Fall 2010 Parallel Processing, Implementation Aspects Slide 1 About This Presentation This presentation is intended to support the use of the textbook Introduction to Parallel Processing: Algorithms and Architectures (Plenum Press, 1999, ISBN 0-306-45970-1). It was prepared by the author in connection with teaching the graduate-level course ECE 254B: Advanced Computer Architecture: Parallel Processing, at the University of California, Santa Barbara. Instructors can use these slides in classroom teaching and for other educational purposes. Any other use is strictly prohibited. © Behrooz Parhami Edition Released Revised Revised First Spring 2005 Fall 2008* * Very limited update Fall 2010 Parallel Processing, Implementation Aspects Slide 2 About This Presentation This presentation is intended to support the use of the textbook Introduction to Parallel Processing: Algorithms and Architectures (Plenum Press, 1999, ISBN 0-306-45970-1). It was prepared by the author in connection with teaching the graduate-level course ECE 254B: Advanced Computer Architecture: Parallel Processing, at the University of California, Santa Barbara. Instructors can use these slides in classroom teaching and for other educational purposes. Any other use is strictly prohibited. © Behrooz Parhami Edition Released Revised Revised Revised First Spring 2005 Fall 2008 Fall 2010* * Very limited update Fall 2010 Parallel Processing, Implementation Aspects Slide 3 VI Implementation Aspects Study real parallel machines in MIMD and SIMD classes: • Examine parallel computers of historical significance • Learn from modern systems and implementation ideas • Bracket our knowledge with history and forecasts Topics in This Part Chapter 21 Shared-Memory MIMD Machines Chapter 22 Message-Passing MIMD Machines Chapter 23 Data-Parallel SIMD Machines Chapter 24 Past, Present, and Future Fall 2010 Parallel Processing, Implementation Aspects Slide 4 21 Shared-Memory MIMD Machines Learn about practical shared-variable parallel architectures: • Contrast centralized and distributed shared memories • Examine research prototypes and production machines Topics in This Chapter 21.1 Variations in Shared Memory 21.2 MIN-Based BBN Butterfly 21.3 Vector-Parallel Cray Y-MP 21.4 Latency-Tolerant Tera MTA 21.5 CC-NUMA Stanford DASH 21.6 SCI-Based Sequent NUMA-Q Fall 2010 Parallel Processing, Implementation Aspects Slide 5 21.1 Variations in Shared Memory Single Copy of Multiple Copies of Modifiable Data Modifiable Data Central UMA Main CC-UMA Memory BBN Butterfly Cray Y-MP COMA NUMA Distributed CC-NUMA Main Memory Tera MTA Stanford DASH Sequent NUMA-Q Fig. 21.1 Classification of shared-memory hardware architectures and example systems that will be studied in the rest of this chapter. Fall 2010 Parallel Processing, Implementation Aspects Slide 6 C.mmp: A Multiprocessor of Historical Significance Mem Proc Cache Map BI 0 0 Clock Mem Proc Cache Map BI 1 1 . 16 × 16 . crossbar Bus . interface Bus . control Mem Proc Cache Map BI 15 15 Fig. 21.2 Organization of the C.mmp multiprocessor. Fall 2010 Parallel Processing, Implementation Aspects Slide 7 Shared Memory Consistency Models Varying latencies makes each processor’s view of the memory different Sequential consistency (strictest and most intuitive); mandates that the interleaving of reads and writes be the same from the viewpoint of all processors. This provides the illusion of a FCFS single-port memory. Processor consistency (laxer); only mandates that writes be observed in the same order by all processors. This allows reads to overtake writes, providing better performance due to out-of-order execution. Weak consistency separates ordinary memory accesses from synch accesses that require memory to become consistent. Ordinary read and write accesses can proceed as long as there is no pending synch access, but the latter must wait for all preceding accesses to complete. Release consistency is similar to weak consistency, but recognizes two synch accesses, called “acquire” and “release”, that sandwich protected shared accesses. Ordinary read / write accesses can proceed only when there is no pending acquire access from the same processor and a release access must wait for all reads and writes to be completed. Fall 2010 Parallel Processing, Implementation Aspects Slide 8 21.2 MIN-Based BBN Butterfly Processing Node 0 MC 68000 Processor 4 × 4 4 × 4 Processor Node Controller Memory Manager 4 × 4 4 × 4 Switch EPROM Interface 1 MB To I/O 4 × 4 4 × 4 Memory Boards Daughter Board Connection 4 × 4 4 × 4 for Memory Expansion (3 MB) Processing Node 15 Fig. 21.3 Structure Fig. 21.4 A small 16-node version of a processing node of the multistage interconnection in the BBN Butterfly. network of the BBN Butterfly. Fall 2010 Parallel Processing, Implementation Aspects Slide 9 21.3 Vector-Parallel Cray Y-MP Vector Add Inter- Integer Processor Vector Units Shift Commun. Registers V7 Logic V6 Fig. 21.5 Key V5 Weight/ V4 Parity elements of V3 V2 V1 64 bit the Cray Y-MP 0 1 processor. 2 Floating- Address 3 Point Add Units Multiply registers, . V0 Reciprocal address . Approx. function units, Central Memory 64 bit instruction 62 63 buffers, and Scalar Integer Add control not Units Shift T Registers Logic shown. (8 64-bit) S Registers (8 64-bit) Weight/ Parity From address 32 bit CPU registers/units Input/Output Fall 2010 Parallel Processing, Implementation Aspects Slide 10 Cray Y-MP’s Interconnection Network Sections Subsections Memory Banks 1 × 8 0, 4, 8, ... , 28 P0 4 × 4 32, 36, 40, ... , 92 8 × 8 P 4 × 4 1 P 1, 5, 9, ... , 29 2 4 × 4 8 × 8 P3 4 × 4 2, 6, 10, ... , 30 P 4 × 4 4 8 × 8 P 4 × 4 5 3, 7, 11, ... , 31 P6 4 × 4 8 × 8 P 4 × 4 7 227, 231, ... , 255 Fig. 21.6 The processor-to-memory interconnection network of Cray Y-MP. Fall 2010 Parallel Processing, Implementation Aspects Slide 11 21.4 Latency-Tolerant Tera MTA Issue Pool Instr. Fetch MAC Retry Pool Fig. 21.7 The instruction execution pipelines of Interconnection Network Tera MTA. Memory Internal Pipeline Fall 2010 Parallel Processing, Implementation Aspects Slide 12 21.5 CC-NUMA Stanford DASH Processor Processor cluster cluster Request mesh Reply mesh Main m em ory Processor Directory cluster Level-2 cache & network I-cache D-cache interface Processor Processor cluster Wormhole routers Fig. 21.8 The architecture of Stanford DASH. Fall 2010 Parallel Processing, Implementation Aspects Slide 13 21.6 SCI-Based Sequent NUMA-Q Quad 533 MB/s local bus P6 (64 bits, 66 MHz) IQ-Link 1 GB/s Interquad SCI link Orion PCI bridge Fig. 21.9 The P6 100 MB/s Fibre Channel OPB PCI/FC physical placement 4 x 33 MB/s P6 PCI buses Publ ic of Sequent’s quad OPB PCI/LAN LAN components on a P6 IQ-Plus Mem ory MDP (SCI rackmount Pentiu m Pr o ring) (P6) proc’s baseboard (not to scale) Quad . Quad Peripheral bay Private SCSI Fig. 21.10 Ethernet Bridge The architecture Cons ole of Sequent Bridge SCSI IQ-link NUMA-Q 2000. Peripheral bay Fall 2010 Parallel Processing, Implementation Aspects Slide 14 Details of the IQ-Link in Sequent NUMA-Q Network side snooping tags tags P6 Local Remote Local Remote bus directory tags directory tags Fig. 21.11 SCI in Block diagram of the IQ-Link board. Bus interface Directory Interconnect controller controller controller Remote 32 MB cache data SCI out SCI in Assemble Stripper Elastic buffer Request Response Fig. 21.12 receive Q receive Q Disassemble Bypass Block diagram FIFO of IQ-Link’s interconnect Request Response send Q send Q controller. SCI out Fall 2010 Parallel Processing, Implementation Aspects Slide 15 22 Message-Passing MIMD Machines Learn about practical message-passing parallel architectures: • Study mechanisms that support message passing • Examine research prototypes and production machines Topics in This Chapter 22.1 Mechanisms for Message Passing 22.2 Reliable Bus-Based Tandem NonStop 22.3 Hypercube-Based nCUBE3 22.4 Fat-Tree-Based Connection Machine 5 22.5 Omega-Network-Based IBM SP2 22.6 Commodity-Driven Berkeley NOW Fall 2010 Parallel Processing, Implementation Aspects Slide 16 22.1 Mechanisms for Message Passing 1. Shared-Medium networks (buses, LANs; e.g., Ethernet, token-based) 2. Router-based networks (aka direct networks; e.g., Cray T3E, Chaos) 3. Switch-based networks (crossbars, MINs; e.g., Myrinet, IBM SP2) Crosspoint switch Through Cros sed Inputs (straight) (exchange) Lower Upper Outputs broadcast broadcast Fig. 22.2 Example 4 × 4 and 2 × 2 switches Fig.used 22.1 as building The strublockscture for of larger a generic networks. router. Fall 2010 Parallel Processing, Implementation Aspects Slide 17 Router-Based Networks Injection Ejection Channel Channel LC LC Link Controller Q Q Message Queue Routing/Arbitration Input Queue Output Queue LC Q Q LC Input LC Q Q LC Output Channels Switch Channels LC Q Q LC LC Q Q LC Fig. 22.1 The structure of a generic router. Fall 2010 Parallel Processing, Implementation Aspects Slide 18 Case Studies of Message-Passing Machines Shared-Medium Router-Based Switch-Based Network Network Network Fig. 22.3 Classification of message-passing Coarse- Grain Tandem NonStop hardware (Bus) architectures and example systems that will be studied in this Medium- Grain Berkeley NOW nCUBE3 TMC CM-5 chapter. (LAN) IBM SP2 Fine- Grain Fall 2010 Parallel Processing, Implementation Aspects Slide 19 22.2 Reliable Bus-Based Tandem NonStop Dynabus Processor Processor Processor Processor and and and and Memory Memory Memory Memory I/O I/O I/O I/O I/O I/O I/O I/O I/O Controller Fig. 22.4 One section of the Tandem NonStop Cyclone system. Fall 2010 Parallel Processing, Implementation Aspects Slide 20 High-Level Structure of the Tandem NonStop Fig. 22.5 Four four-processor sections interconnected Four-processor by Dynabus+.