Part VI Implementation Aspects

Fall 2010 Parallel Processing, Implementation Aspects Slide 1 About This Presentation

This presentation is intended to support the use of the textbook Introduction to Parallel Processing: Algorithms and Architectures (Plenum Press, 1999, ISBN 0-306-45970-1). It was prepared by the author in connection with teaching the graduate-level course ECE 254B: Advanced Computer Architecture: Parallel Processing, at the University of California, Santa Barbara. Instructors can use these slides in classroom teaching and for other educational purposes. Any other use is strictly prohibited. © Behrooz Parhami

Edition Released Revised Revised First Spring 2005 Fall 2008*

* Very limited update

Fall 2010 Parallel Processing, Implementation Aspects Slide 2 About This Presentation

This presentation is intended to support the use of the textbook Introduction to Parallel Processing: Algorithms and Architectures (Plenum Press, 1999, ISBN 0-306-45970-1). It was prepared by the author in connection with teaching the graduate-level course ECE 254B: Advanced Computer Architecture: Parallel Processing, at the University of California, Santa Barbara. Instructors can use these slides in classroom teaching and for other educational purposes. Any other use is strictly prohibited. © Behrooz Parhami

Edition Released Revised Revised Revised First Spring 2005 Fall 2008 Fall 2010*

* Very limited update

Fall 2010 Parallel Processing, Implementation Aspects Slide 3 VI Implementation Aspects

Study real parallel machines in MIMD and SIMD classes: • Examine parallel computers of historical significance • Learn from modern systems and implementation ideas • Bracket our knowledge with history and forecasts

Topics in This Part Chapter 21 Shared-Memory MIMD Machines Chapter 22 Message-Passing MIMD Machines Chapter 23 Data-Parallel SIMD Machines Chapter 24 Past, Present, and Future

Fall 2010 Parallel Processing, Implementation Aspects Slide 4 21 Shared-Memory MIMD Machines

Learn about practical shared-variable parallel architectures: • Contrast centralized and distributed shared memories • Examine research prototypes and production machines

Topics in This Chapter 21.1 Variations in Shared Memory 21.2 MIN-Based BBN Butterfly 21.3 Vector-Parallel Cray Y-MP 21.4 Latency-Tolerant Tera MTA 21.5 CC-NUMA Stanford DASH 21.6 SCI-Based Sequent NUMA-Q

Fall 2010 Parallel Processing, Implementation Aspects Slide 5 21.1 Variations in Shared Memory

Single Copy of Multiple Copies of Modifiable Data Modifiable Data

Central UMA Main CC-UMA Memory BBN Butterfly Cray Y-MP

COMA NUMA Distributed CC-NUMA Main Memory Tera MTA Stanford DASH Sequent NUMA-Q

Fig. 21.1 Classification of shared-memory hardware architectures and example systems that will be studied in the rest of this chapter.

Fall 2010 Parallel Processing, Implementation Aspects Slide 6 C.mmp: A Multiprocessor of Historical Significance

Mem Proc Cache Map BI 0 0

Clock Mem Proc Cache Map BI 1 1 . 16 × 16 . crossbar Bus . . interface Bus . . control

Mem Proc Cache Map BI 15 15

Fig. 21.2 Organization of the C.mmp multiprocessor.

Fall 2010 Parallel Processing, Implementation Aspects Slide 7 Shared Memory Consistency Models

Varying latencies makes each processor’s view of the memory different

Sequential consistency (strictest and most intuitive); mandates that the interleaving of reads and writes be the same from the viewpoint of all processors. This provides the illusion of a FCFS single-port memory. Processor consistency (laxer); only mandates that writes be observed in the same order by all processors. This allows reads to overtake writes, providing better performance due to out-of-order execution. Weak consistency separates ordinary memory accesses from synch accesses that require memory to become consistent. Ordinary read and write accesses can proceed as long as there is no pending synch access, but the latter must wait for all preceding accesses to complete. Release consistency is similar to weak consistency, but recognizes two synch accesses, called “acquire” and “release”, that sandwich protected shared accesses. Ordinary read / write accesses can proceed only when there is no pending acquire access from the same processor and a release access must wait for all reads and writes to be completed.

Fall 2010 Parallel Processing, Implementation Aspects Slide 8 21.2 MIN-Based BBN Butterfly

Processing Node 0 MC 68000 Processor 4 × 4 4 × 4 Processor Node Controller Memory Manager 4 × 4 4 × 4 Switch EPROM Interface

1 MB To I/O 4 × 4 4 × 4 Memory Boards

Daughter Board Connection 4 × 4 4 × 4 for Memory Expansion (3 MB)

Processing Node 15

Fig. 21.3 Structure Fig. 21.4 A small 16-node version of a processing node of the multistage interconnection in the BBN Butterfly. network of the BBN Butterfly.

Fall 2010 Parallel Processing, Implementation Aspects Slide 9 21.3 Vector-Parallel Cray Y-MP

Vector Add Inter- Integer Processor Vector Units Shift Commun. Registers V7 Logic V6 Fig. 21.5 Key V5 Weight/ V4 Parity elements of V3 V2 V1 64 the Cray Y-MP 0 1 processor. 2 Floating- Address 3 Point Add

Units Multiply registers, . V0 Reciprocal address . Approx. . function units, Central Memory 64 bit instruction 62 63 buffers, and Scalar Integer Add control not Units Shift T Registers Logic shown. (8 64-bit) S Registers (8 64-bit) Weight/ Parity

From address 32 bit CPU registers/units Input/Output Fall 2010 Parallel Processing, Implementation Aspects Slide 10 Cray Y-MP’s Interconnection Network Sections Subsections Memory Banks 1 × 8 0, 4, 8, ... , 28 P0 4 × 4 32, 36, 40, ... , 92

8 × 8 P 4 × 4 1

P 1, 5, 9, ... , 29 2 4 × 4 8 × 8

P3 4 × 4

2, 6, 10, ... , 30 P 4 × 4 4 8 × 8 P 4 × 4 5

3, 7, 11, ... , 31 P6 4 × 4 8 × 8

P 4 × 4 7 227, 231, ... , 255 Fig. 21.6 The processor-to-memory interconnection network of Cray Y-MP.

Fall 2010 Parallel Processing, Implementation Aspects Slide 11 21.4 Latency-Tolerant Tera MTA

Issue Pool Instr. Fetch

MAC

Retry Pool

Fig. 21.7 The instruction execution pipelines of Interconnection Network Tera MTA.

Memory Internal Pipeline Fall 2010 Parallel Processing, Implementation Aspects Slide 12 21.5 CC-NUMA Stanford DASH

Processor Processor cluster cluster

Request mesh

Reply mesh

Main m em ory

Processor Directory cluster Level-2 cache & network I-cache D-cache interface Processor

Processor cluster Wormhole routers Fig. 21.8 The architecture of Stanford DASH.

Fall 2010 Parallel Processing, Implementation Aspects Slide 13 21.6 SCI-Based Sequent NUMA-Q

Quad 533 MB/s local bus P6 (64 , 66 MHz) IQ-Link 1 GB/s Interquad SCI link

Orion PCI bridge Fig. 21.9 The P6 100 MB/s Fibre Channel OPB PCI/FC physical placement 4 x 33 MB/s P6 PCI buses Publ ic of Sequent’s quad OPB PCI/LAN LAN components on a P6 IQ-Plus Mem ory MDP (SCI rackmount Pentiu m Pr o ring) (P6) proc’s baseboard (not to scale) Quad

. . .

Quad

Peripheral bay Private SCSI Fig. 21.10 Ethernet Bridge The architecture Cons ole of Sequent Bridge SCSI IQ-link NUMA-Q 2000. Peripheral bay

Fall 2010 Parallel Processing, Implementation Aspects Slide 14 Details of the IQ-Link in Sequent NUMA-Q Network side snooping tags tags P6 Local Remote Local Remote bus directory tags directory tags Fig. 21.11 SCI in Block diagram of the IQ-Link board. Bus interface Directory Interconnect controller controller controller

Remote 32 MB cache data SCI out SCI in

Assemble Stripper Elastic buffer

Request Response Fig. 21.12 receive Q receive Q Disassemble Bypass Block diagram FIFO of IQ-Link’s

interconnect Request Response send Q send Q controller. SCI out

Fall 2010 Parallel Processing, Implementation Aspects Slide 15 22 Message-Passing MIMD Machines

Learn about practical message-passing parallel architectures: • Study mechanisms that support message passing • Examine research prototypes and production machines

Topics in This Chapter 22.1 Mechanisms for Message Passing 22.2 Reliable Bus-Based Tandem NonStop 22.3 Hypercube-Based nCUBE3 22.4 Fat-Tree-Based 5 22.5 Omega-Network-Based IBM SP2 22.6 Commodity-Driven Berkeley NOW

Fall 2010 Parallel Processing, Implementation Aspects Slide 16 22.1 Mechanisms for Message Passing 1. Shared-Medium networks (buses, LANs; e.g., Ethernet, token-based) 2. Router-based networks (aka direct networks; e.g., Cray T3E, Chaos) 3. Switch-based networks (crossbars, MINs; e.g., Myrinet, IBM SP2)

Crosspoint switch

Through Cros sed Inputs (straight) (exchange)

Lower Upper Outputs broadcast broadcast Fig. 22.2 Example 4 × 4 and 2 × 2 switches usedFig. 22.1 as building The strublockscture for of larger a generic networks. router.

Fall 2010 Parallel Processing, Implementation Aspects Slide 17 Router-Based Networks Injection Ejection Channel Channel

LC LC Link Controller

Q Q Message Queue Routing/Arbitration

Input Queue Output Queue LC Q Q LC

Input LC Q Q LC Output Channels Switch Channels LC Q Q LC

LC Q Q LC

Fig. 22.1 The structure of a generic router.

Fall 2010 Parallel Processing, Implementation Aspects Slide 18 Case Studies of Message-Passing Machines

Shared-Medium Router-Based Switch-Based Network Network Network Fig. 22.3 Classification of message-passing Coarse- Grain Tandem NonStop hardware (Bus) architectures and example systems that will be studied in this Medium- Grain Berkeley NOW nCUBE3 TMC CM-5 chapter. (LAN) IBM SP2

Fine- Grain

Fall 2010 Parallel Processing, Implementation Aspects Slide 19 22.2 Reliable Bus-Based Tandem NonStop

Dynabus Processor Processor Processor Processor and and and and Memory Memory Memory Memory I/O I/O I/O I/O I/O I/O I/O I/O I/O Controller

Fig. 22.4 One section of the Tandem NonStop Cyclone system.

Fall 2010 Parallel Processing, Implementation Aspects Slide 20 High-Level Structure of the Tandem NonStop

Fig. 22.5 Four four-processor sections interconnected Four-processor by Dynabus+. sections

Dynabus+

Fall 2010 Parallel Processing, Implementation Aspects Slide 21 22.3 Hypercube-Based nCUBE3

I/O

I/O

Host I/O I/O Fig. 22.6 Host An eight-node nCUBE architecture.

Fall 2010 Parallel Processing, Implementation Aspects Slide 22 22.4 Fat-Tree-Based Connection Machine 5

Data Control Diagnostic Network Network Network

P P P P P CP CP CP ...... I/O I/O M M M M M M M M Data Vault Up to 16K One or more Processing Nodes Control Processors HIPPI or VME Graphics Interface Output Fig. 22.7 The overall structure of CM-5.

Fall 2010 Parallel Processing, Implementation Aspects Slide 23 Processing Memory Memory Memory Memory Nodes in CM-5 Vector Unit 64 × 64 Pipelined Register ALU File Vector Vector Vector Unit Unit Unit Memory Instr. Decode Control

Bus Interface

64-Bit Bus

Fig. 22.8 The components SPARC Network of a processing Processor Interface node in CM-5. Data Control Network Network Fall 2010 Parallel Processing, Implementation Aspects Slide 24 The Interconnection Network in CM-5

Data Routers (4 × 2 Switches)

0 1 2 3

Fig. 22.9 The fat-tree (hyper-tree) data network Processors and I/O Nodes of CM-5.

63 Fall 2010 Parallel Processing, Implementation Aspects Slide 25 22.5 Omega-Network-Based IBM SP2

Ethernet Cons ole

Pr oc es s or Ethernet P E adapter Compute Disk Micro Channel nodes Controller C . . . C MCC I/O bus

M NIC Netw ork To other inter f ac e Me mor y systems

G High-performance switch Gateways

G

I/O I/O . . . I/O I/O H . . . H Input/output nodes Host nodes Fig. 22.10 The architecture of IBM SP series of systems.

Fall 2010 Parallel Processing, Implementation Aspects Slide 26 The IBM SP2 Network Interface

Micro Channel interface Micro Channel

40 MHz Left Right 8-MB i860 DMA DMA DRAM processor

2 x 2 KB 160 MB/s i860 bus FIFO buffers

Memory and switch Input Output management unit FIFO FIFO

Fig. 22.11 The network interface controller of IBM SP2.

Fall 2010 Parallel Processing, Implementation Aspects Slide 27 The Interconnection Network of IBM SP2

4 x 4 crossbars

Only 1/4 of links at this level are show n

0 1 2 3 . . . . 63 Fig. 22.12 A section of the high-performance switch network of IBM SP2.

Fall 2010 Parallel Processing, Implementation Aspects Slide 28 22.6 Commodity-Driven Berkeley NOW

NOW aims for building a distributed from commercial workstations (high end PCs), linked via switch-based networks

Components and challenges for successful deployment of NOWs (aka clusters of workstations, or COWs) Network interface hardware: System-area network (SAN), middle ground between LANs and specialized interconnections Fast communication protocols: Must provide competitive parameters for the parameters in the LogP model (see Section 4.5) Distributed file systems: Data files must be distributed among nodes, with no central warehouse or arbitration authority Global resource management: Berkeley uses GLUnix (global-layer Unix); other systems of comparable functionalities include Argonne Globus, NSCP metacomputing system, Princeton SHRIMP, Rice TreadMarks, Syracuse WWVM, Virginia Legion, Wisconsin Wind Tunnel

Fall 2010 Parallel Processing, Implementation Aspects Slide 29 23 Data-Parallel SIMD Machines

Learn about practical implementation of SIMD architectures: • Assess the successes, failures, and potentials of SIMD • Examine various SIMD parallel computers, old and new

Topics in This Chapter 23.1 Where Have All the SIMDs Gone? 23.2 The First Supercomputer: ILLIAC IV 23.3 Goodyear MPP 23.4 Distributed Array Processor (DAP) 23.5 Hypercubic Connection Machine 2 23.6 Multiconnected MasPar MP-2

Fall 2010 Parallel Processing, Implementation Aspects Slide 30 23.1 Where Have All the SIMDs Gone?

Control Global Operations Control & Response Associative or Unit Read content-addressable Comparand Lines Response Mask Store memories/processors (Tags) constituted early Cell 0 t 0 forms of SIMD parallel processing

Cell 1 t1

Global Tag Cell 2 t2 Operations Unit

Data and . Commands Fig. 23.1 Broadcast . Functional view of . an associative

memory/processor. Cell m–1 tm–1

Fall 2010 Parallel Processing, Implementation Aspects Slide 31 Search Functions in Associative Devices

Exact match: Locating data based on partial knowledge of contents

Inexact match: Finding numerically or logically proximate values

Membership: Identifying all members of a specified set

Relational: Determining values that are less than, less than or equal, etc.

Interval: Marking items that are between or outside given limits

Extrema: Finding the maximum, minimum, next higher, or next lower

Rank-based: Selecting kth or k largest/smallest elements

Ordered retrieval: Repeated max- or min-finding with elimination (sorting)

Fall 2010 Parallel Processing, Implementation Aspects Slide 32 Mixed-Mode SIMD / MIMD Parallelism

Memory Control Management System Control Unit Storage System

Mid-Level Controllers Parallel Computation Unit

Proc. Proc. Proc. . . . Proc. 0 1 2 p–1 Inter- connection Netwrok

Memory A B A B A B . . . A B Modules

Fig. 23.2 The architecture Memory Storage System of Purdue PASM.

Fall 2010 Parallel Processing, Implementation Aspects Slide 33 23.2 The First Supercomputer: ILLIAC IV

Control Unit Host Fetch data or (B6700) instructions Data Control Mode

To Proc. 63 Proc. Proc. Proc. . . . Proc. To Proc. 0 0 1 2 63

Mem. Mem. Mem. Mem. 0 1 2 . . . 63 Fig. 23.3 The ILLIAC IV computer (the inter-processor routing network is only partially shown).

3 3 3 3 2 2 2 2 1 1 1 . . . 1 Synchronized 0 0 0 0 Disks 299 299 299 299 (Main Memory) 298 298 298 298

Band 0 Band 1 Band 2 Band 51 Fall 2010 Parallel Processing, Implementation Aspects Slide 34 23.3 Massively Parallel Goodyear MPP

Tape Disk Printer Terminal Network

Host processor (VAX 11/780)

Program and data management unit

Staging Array control Staging memory memory Control Status

Switches Array unit

128-bit 128 × 128 128-bit input processors output interface interface

Fig. 23.4 The architecture of Goodyear MPP.

Fall 2010 Parallel Processing, Implementation Aspects Slide 35 Processing Elements in the Goodyear MPP

From left To right processor S processor

G Mask

Comparator To/from To global neighbors OR tree NEWS

C Carry P Full adder Sum Logic A Memory Fig. 23.5 Variable- The single-bit length processor of Data shift MPP. bus register Address B

Fall 2010 Parallel Processing, Implementation Aspects Slide 36 23.4 Distributed Array Processor (DAP) To neighboring N To row/col processors W E responses

S Condition A

From NN neighboring E E S C processors S Full Carry {W Mux W adder Sum Mux From Q Memory control { Row unit Col D Fig. 23.6 S The bit-serial processor of DAP. From south neighbor To north neighbor

Fall 2010 Parallel Processing, Implementation Aspects Slide 37 DAP’s High-Level Structure N Column j W E S Processors Fast I/O Program memory Row i Q plane Register Q in processor ij Master C plane control A plane unit D plane

Host interface unit One plane of memory Host work - station

Array memory Fig. 23.7 The high-level (at least Local memory architecture of DAP system. 32K planes) for processor ij

Fall 2010 Parallel Processing, Implementation Aspects Slide 38 23.5 Hypercubic Connection Machine 2

Nexus

3 2 Bus Front 1 Inter- End 0 face Fig. 23.8 The architecture 3 of CM-2. 2 Data Vault 1 Sequencer 0 Data Vault Data Vault

16K Processors I/O Graphic System Display

Fall 2010 Parallel Processing, Implementation Aspects Slide 39 The Simple Processing Elements of CM-2

From a Memory bc

0 1

2 Flags f op- 3 f(a, b, c) code 4 Mux 5 0 6 1 7 2 g op- 3 g(a, b, c) Mux code 4 To Memory 5 6 7

Fig. 23.9 The bit-serial ALU of CM-2.

Fall 2010 Parallel Processing, Implementation Aspects Slide 40 23.6 Multiconnected MasPar MP-2

Ethernet Front End

Disk Array

Array Control Unit

I/O Channel Controller Processor Array & Memory (X-net connected)

Fig. 23.10 Global Router The architecture of MasPar MP-2.

Fall 2010 Parallel Processing, Implementation Aspects Slide 41 The Interconnection Network of MasPar MP-2

Processor cluster

. . . Stage 3

. . .

Stage . . . . . 2 ......

Stage . . . 1

Fig. 23.11 The physical packaging of processor clusters and the 3-stage global router in MasPar MP-2 .

Fall 2010 Parallel Processing, Implementation Aspects Slide 42 The Processing Nodes of MasPar MP-2

X-net X-net Comm. Comm. In Out Input Output

To Router To Router Stage 3 Stage 1

Flags

Significand Barrel Exponent ALU Logic Unit Unit Shifter

Word Bus 32 bits

Bit Bus

Memory Memory Register Address Data/ECC File Fig. 23.12 Unit Unit Processor architecture in Control Reduction MasPar MP-2. External Memory

Instruction Broadcast

Fall 2010 Parallel Processing, Implementation Aspects Slide 43 24 Past, Present, and Future

Put the state of the art in context of history and future trends: • Quest for higher performance, and where it is leading us • Interplay between technology progress and architecture

Topics in This Chapter 24.1 Milestones in Parallel Processing 24.2 Current Status, Issues, and Debates 24.3 TFLOPS, PFLOPS, and Beyond 24.4 Processor and Memory Technologies 24.5 Interconnection Technologies 24.6 The Future of Parallel Processing

Fall 2010 Parallel Processing, Implementation Aspects Slide 44 24.1 Milestones in Parallel Processing

1840s Desirability of computing multiple values at the same time was noted in connection with Babbage’s Analytical Engine 1950s von Neumann’s and Holland’s cellular computers were proposed 1960s The ILLIAC IV SIMD machine was designed at U Illinois based on the earlier SOLOMON computer 1970s Vector machines, with deeply pipelined function units, emerged and quickly dominated the supercomputer market; early experiments with shared memory multiprocessors, message-passing multicomputers, and gracefully degrading systems paved the way for further progress 1980s Fairly low-cost and massively parallel , using hypercube and related interconnection schemes, became available 1990s Massively parallel computers with mesh or torus interconnection, and using wormhole routing, became dominant 2000s Commodity and network-based took hold, offering speed, flexibility, and reliability for server farms and other areas

Fall 2010 Parallel Processing, Implementation Aspects Slide 45 24.2 Current Status, Issues, and Debates

Design choices and key debates: Related user concerns: Architecture General- or special-purpose system? Cost per MIPS SIMD, MIMD, or hybrid? Scalability, longevity Interconnection Shared-medium, direct, or multistage? Cost per MB/s Custom or commodity? Expandability, reliability Routing Oblivious or adaptive? Speed, efficiency Wormhole, packet, or virtual cut-through? Overhead, saturation limit Programming Shared memory or message passing? Ease of programming New languages or libraries? Software portability

Fall 2010 Parallel Processing, Implementation Aspects Slide 46 24.3 TFLOPS, PFLOPS, and Beyond

1000 Plan Develop Use

100+ TFLOPS, 20 TB 100 ASCI Purple

30+ TFLOPS, 10 TB ASCI Q

10+ TFLOPS, 5 TB 10 ASCI White ASCI

3+ TFLOPS, 1.5 TB ASCI Blue Performance (TFLOPS) Performance 1+ TFLOPS, 0.5 TB 1 ASCI Red 1995 2000 2005 2010 Calendar year Fig. 24.1 Milestones in the Accelerated Strategic Computing Initiative (ASCI) program, sponsored by the US Department of Energy, with extrapolation up to the PFLOPS level.

Fall 2010 Parallel Processing, Implementation Aspects Slide 47 Performance Milestones in Supercomputers

Performance 1993 and later milestones based on .com data EFLOPS

PFLOPS

TFLOPS

GFLOPS Cray Y-MP CDC Cray 1 IBM MFLOPS 1960 1980 2000 2020 Year

Fall 2010 Parallel Processing, Implementation Aspects Slide 48 24.4 Processor and Memory Technologies

Port-0 Shift Dedicated to Port 2 Units FLP Mult memory access 86 (address Port 3 FLP Div generation 86 Integer Div units, etc) Port 4 Port 0 FLP Add 86 Integer Execution Reservation Unit 0 Station

Port-1 Reorder Units Buffer and Jump Retirement Port 1 Integer Exec Register Execution Unit File Unit 1

Fig. 24.2 Key parts of the CPU in the Intel Pentium Pro .

Fall 2010 Parallel Processing, Implementation Aspects Slide 49 24.5 Interconnection Technologies

8

7 Continued use of Al/Oxide

6

Changeover to lower-resistivity wires & low-dielectric-constant insulator; e.g., Cu/Polyimide 5 Ratio of wire delay to switching time switching to delay wire of Ratio 4 0.5 0.35 0.25 0.18 0.13 Feature size (μm) Fig. 24.3 Changes in the ratio of a 1-cm wire delay to device switching time as the feature size is reduced.

Fall 2010 Parallel Processing, Implementation Aspects Slide 50 Intermodule and Intersystem Connection Options 1012

Processor Geographically distributed

bus I/O network

System-area 9 10 network (SAN) Local-area network (LAN) Metro-area network Bandwidth (b/s) (MAN) 10 6 Wide-area network Same geographic location (WAN)

10 3 10− 9 10− 6 10− 3 1 10 3 (ns) (μs) (ms) (min) (h) Latency (s) Fig. 24.4 Various types of intermodule and intersystem connections.

Fall 2010 Parallel Processing, Implementation Aspects Slide 51 System Interconnection Media

Twisted pair

Plastic

Insul ator Coaxial Copper cable core Outer conductor

Reflection Silica Optical Light source fiber

Fig. 24.5 The three commonly used media for computer and network connections.

Fall 2010 Parallel Processing, Implementation Aspects Slide 52 24.6 The Future of Parallel Processing In the early 1990s, it was unclear whether the TFLOPS performance milestone would be reached with a thousand or so GFLOPS nodes or with a million or so simpler MFLOPS processors The eventual answer was closer to 1K × GFLOPS than 1M × MFLOPS PFLOPS performance was achieved in more or less the same manner Asynchronous design for its speed and greater energy economy: Because pure asynchronous circuits create some difficulties, designs that are locally synchronous and globally asynchronous are flourishing Intelligent memories to help remove the memory wall: Memory chips have high internal bandwidths, but are limited by pins in how much data they can handle per clock cycle Reconfigurable systems for adapting hardware to application needs: Raw hardware that can be configured via “programming” to adapt to application needs may offer the benefits of special-purpose processing Network and grid computing to allow flexible, ubiquitous access: Resource of many small and large computers can be pooled together via a network, offering supercomputing capability to anyone

Fall 2010 Parallel Processing, Implementation Aspects Slide 53 Parallel Processing for Low-Power Systems

TIPS

DSP performance Absolute per watt proce ssor performance GIPS

GP processor performance per watt Performance MIPS

kIPS 1980 1990 2000 2010 Calendar year Figure 25.1 of Parhami’s computer architecture textbook.

Fall 2010 Parallel Processing, Implementation Aspects Slide 54 Peak Performance of Supercomputers

PFLOPS

Earth Simulator × 10 / 5 years ASCI White Pacific

TFLOPS ASCI Red

TMC CM-5 Cray T3D

Cray X-MP TMC CM-2

Cray 2 GFLOPS 1980 1990 2000 2010 Dongarra, J., “Trends in High Performance Computing,” Computer J., Vol. 47, No. 4, pp. 399-403, 2004. [Dong04]

Fall 2010 Parallel Processing, Implementation Aspects Slide 55 Evolution of Computer Performance/Cost

From: “Robots After All,” by H. Moravec, CACM, pp. 90-97, October 2003.

Mental power in four scales

Fall 2010 Parallel Processing, Implementation Aspects Slide 56 Two Approaches to Parallel Processing

Architecture (topology) and algorithm considered in tandem

+ Potentially very high performance – Large effort to develop efficient algorithms

Users interact with topology-independent platforms

+ Less steep learning curve + Program portability – Sacrifice some performance for ease of use

Main examples: Virtual shared memory Message-passing interface

The topology-independent approach is currently dominant

Fall 2010 Parallel Processing, Implementation Aspects Slide 57 ABCs of Parallel Processing in One Slide A Amdahl’s Law (Speedup Formula) Bad news – Sequential overhead will kill you, because:

Speedup = T1/Tp ≤ 1/[f + (1 – f)/p] ≤ min(1/f, p) Morale: For f = 0.1, speedup is at best 10, regardless of peak OPS. B Brent’s Scheduling Theorem Good news – Optimal scheduling is very difficult, but even a naive scheduling algorithm can ensure:

T1/p ≤ Tp < T1/p + T∞ = (T1/p)[1 + p/(T1/T∞)] Result: For a reasonably parallel task (large T1/T∞), or for a suitably small p (say, p < T1/T∞), good speedup and efficiency are possible. C Cost-Effectiveness Adage Real news – The most cost-effective parallel solution may not be the one with highest peak OPS (communication?), greatest speed-up (at what cost?), or best utilization (hardware busy doing what?). Analogy: Mass transit might be more cost-effective than private cars even if it is slower and leads to many empty seats.

Fall 2010 Parallel Processing, Implementation Aspects Slide 58 Gordon Bell Prize for High-Performance Computing

Established in 1988 by Gordon Bell, one of the pioneers in the field Prizes are given annually in several categories (there are small cash awards, but the prestige of winning is the main motivator) The two key categories are as follows; others have varied over time Peak performance: The entrant must convince the judges that the submitted program runs faster than any other comparable application Price/Performance: The entrant must show that performance of the application divided by the list price of the smallest system needed to achieve the reported performance is better than that of any other entry

Different hardware/software combinations have won the competition Supercomputer performance improvement has outpaced Moore’s law; however, progress shows signs of slowing down due to lack of funding In past few years, improvements have been due to faster uniprocessors (2004 NRC Report Getting up to Speed: The Future of Supercompuing)

Fall 2010 Parallel Processing, Implementation Aspects Slide 59 Recent Gordon Bell Prizes, November 2008 Performance in a Scientific Application: superconductor simulation Oak Ridge National Lab: Cray XT Jaguar, 150 000 cores, 1.35 PFLOPS http://www.supercomputingonline.com/ article.php?sid=16648

Special Prize for Algorithm Innovation Lawrence Berkeley National Lab: harnessing potential of nanostructures http://newscenter.lbl.gov/press-releases/2008/11/24/berkeley-lab-team-wins- special-acm-gordon-bell-prize-for-algorithm-innovation/

Fall 2010 Parallel Processing, Implementation Aspects Slide 60