ΗΜΥ 656 ΠΡΟΧΩΡΗΜΕΝΗ ΑΡΧΙΤΕΚΤΟΝΙΚΗ ΗΛΕΚΤΡΟΝΙΚΩΝ

ΥΠΟΛΟΓΙΣΤΩΝ

Εαρινό Εξάμηνο 2007

ΔΙΑΛΕΞΗ 7: Parallel Computer Systems

ΧΑΡΗΣ ΘΕΟΧΑΡΙΔΗΣ ([email protected])

Ack: Parallel Computer Architecture: A Hardware/Software Approach, David E. Culler et al, Morgan Kaufmann What is Parallel Architecture?

• A parallel computer is a collection of processing elements that cooperate to solve large problems fast

• Some broad issues: – Resource Allocation:

• how large a collection?

• how powerful are the elements?

• how much memory?

– Data access, Communication and Synchronization

• how do the elements cooperate and communicate?

• how are data transmitted between processors?

• what are the abstractions and primitives for cooperation?

– Performance and Scalability

• how does it all translate into performance?

• how does it scale?

Why Study Parallel Architecture? • Role of a computer architect: – To design and engineer the various levels of a computer system to maximize performance and programmability within limits of technology and cost. • Parallelism: – Provides alternative to faster clock for performance

– Applies at all levels of system design – Is a fascinating perspective from which to view architecture

– Is increasingly central in information processing

Why Study it Today?

• History: diverse and innovative organizational structures, often tied to novel programming models

• Rapidly maturing under strong technological constraints – The “killer micro” is ubiquitous – Laptops and supercomputers are fundamentally similar! – Technological trends cause diverse approaches to converge • Technological trends make parallel computing inevitable – In the mainstream • Need to understand fundamental principles and design tradeoffs, not just taxonomies

– Naming, Ordering, Replication, Communication performance

Inevitability of Parallel Computing

• Application demands: Our insatiable need for cycles – Scientific computing: CFD, Biology, Chemistry, Physics, ... – General-purpose computing: Video, Graphics, CAD, Databases, TP... • Technology Trends – Number of transistors on chip growing rapidly – Clock rates expected to go up only slowly • Architecture Trends – Instruction-level parallelism valuable but limited – Coarser-level parallelism, as in MPs, the most viable approach • Economics • Current trends: – Today’s microprocessors have multiprocessor support – Servers & even PCs becoming MP: Sun, SGI, COMPAQ, Dell,... – Tomorrow’s microprocessors are multiprocessors

Application Trends

• Demand for cycles fuels advances in hardware, and vice-versa

– Cycle drives exponential increase in microprocessor performance – Drives parallel architecture harder: most demanding applications • Range of performance demands – Need range of system performance with progressively increasing cost

– Platform pyramid • Goal of applications in using parallel machines: Speedup Performance (p processors) • Speedup (p processors) = Performance (1 processor)

• For a fixed problem size (input data set), performance = 1/time

Time (1 processor) • Speedup (p processors) = fixed problem Time (p processors)

Scientific Computing Demand Summary of Application Trends

• Transition to parallel computing has occurred for scientific and engineering computing • In rapid progress in commercial computing – Database and transactions as well as financial – Usually smaller-scale, but large-scale systems also used • Desktop also uses multithreaded programs, which are a lot like parallel programs • Demand for improving throughput on sequential workloads – Greatest use of small-scale multiprocessors • Solid application demand exists and will increase

Technology Trends

100 Supercomputers

10 Mainframes Microprocessors Minicomputers

Performance 1

0.1 1965 1970 1975 1980 1985 1990 1995 • Commodity microprocessors have caught up with supercomputers.

Architectural Trends

• Architecture translates technology’s gifts to performance and capability

• Resolves the tradeoff between parallelism and locality – Current microprocessor: 1/3 compute, 1/3 cache, 1/3 off-chip connect

– Tradeoffs may change with scale and technology advances • Understanding microprocessor architectural trends – Helps build intuition about design issues or parallel machines – Shows fundamental role of parallelism even in “sequential” computers

• Four generations of architectural history: tube, transistor, IC, VLSI – Here focus only on VLSI generation • Greatest delineation in VLSI has been in type of parallelism exploited

Arch. Trends: Exploiting Parallelism

• Greatest trend in VLSI generation is increase in parallelism – Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit

• slows after 32 bit • adoption of 64-bit now under way, 128-bit far (not performance issue) • great inflection point when 32-bit micro and cache fit on a chip – Mid 80s to mid 90s: instruction level parallelism

• pipelining and simple instruction sets, + compiler advances (RISC) • on-chip caches and functional units => superscalar execution • greater sophistication: out of order execution, speculation, prediction – to deal with control transfer and latency problems

– Next step: thread level parallelism

Phases in VLSI Generation

Bit-level parallelism Instruction-level Thread-level (?) 100,000,000

‹‹‹

10,000,000 ‹‹ ‹‹‹‹ ‹ ‹ ‹ R10000 ‹‹‹‹‹‹‹‹ ‹‹ ‹‹‹ ‹ ‹‹‹‹‹‹‹‹ ‹ ‹ ‹ 1,000,000 ‹‹‹‹‹‹‹ ‹‹ ‹ ‹‹ Pentium ‹ ‹ ‹ ‹‹ i80386 ‹ i80286 ‹ ‹‹ R3000 Transistors 100,000 ‹ ‹ ‹ R2000

‹ i8086

10,000 ‹ i8080 ‹i8008 ‹ ‹ i4004

1,000 1970 1975 1980 1985 1990 1995 2000 2005 • How good is instruction-level parallelism? • Thread-level needed in microprocessors?

Architectural Trends: ILP

• • Reported speedups for superscalar processors • Horst, Harris, and Jardine [1990] ...... 1.37

• Wang and Wu [1988] ...... 1.70 • Smith, Johnson, and Horowitz [1989] ...... 2.30 • Murakami et al. [1989] ...... 2.55 • Chang et al. [1991] ...... 2.90 • Jouppi and Wall [1989] ...... 3.20 • Lee, Kwok, and Briggs [1991] ...... 3.50 • Wall [1991] ...... 5 • Melvin and Patt [1991] ...... 8 • Butler et al. [1991] ...... 17+ • Large variance due to difference in – application domain investigated (numerical versus non-numerical)

– capabilities of processor modeled

ILP Ideal Potential

30 3 z z z 25 2.5

20 2 z

15 1.5 Speedup 10 1 z

Fraction of total cycles (%) Fraction of total cycles 5 0.5

0 0 0123456+ 0 5 10 15 Number of instructions issued Instructions issued per cycle

• Infinite resources and fetch bandwidth, perfect branch prediction and renaming

– real caches and non-zero miss latencies

Results of ILP Studies

perfect branch prediction 4x 1 branch unit/real prediction

3x

2x

1x

Jouppi_89 Smith_89 Murakami_89 Chang_91 Butler_91 Melvin_91 • Concentrate on parallelism for 4-issue machines • Realistic studies show only 2-fold speedup • Recent studies show that for more parallelism, one must look across threads

Architectural Trends: Bus-based MPs

•Micro on a chip makes it natural to connect many to shared memory – dominates server and enterprise market, moving down to desktop •Faster processors began to saturate bus, then bus technology advanced

– today, range of sizes for bus-based systems, desktop to large servers 70

CRAY CS6400z z Sun 60 E10000

50

40 SGI Challenge z

Sequent B2100 Symmetry81 SE60 Sun E6000 30 z z z z z SE70 Number of processors

20 Sun SC2000zzSC2000E z SGI PowerChallenge/XL

AS8400 z Sequent B8000 Symmetry21 SE10 z 10 z z z SE30 Power z SS1000 zzSS1000E

SS690MP 140 z AS2100 zz HP K400 z P-Pro SGI PowerSeries z SS690MP 120 z SS10 zzSS20 0 1984 1986 1988 1990 1992 1994 1996 1998

No. of processors in fully configured commercial shared-memory systems Bus Bandwidth

100,000

Sun E10000 z 10,000

SGI z Sun E6000 PowerCh XL z AS8400 SGI Challenge z z z CS6400 1,000 z HPK400 z SC2000E z SC2000 z AS2100 z P-Pro z SS1000E SS1000 SS690MP 120 z z SS20 zzzSS10/ zzzz SE70/SE30 SS690MP 140 SE10/ SE60 100 Symmetry81/21 zz zzSGI PowerSeries Power Shared bus bandwidth (MB/s)

zzSequent B2100 Sequent B8000 10 1984 1986 1988 1990 1992 1994 1996 1998 Economics

• Commodity microprocessors not only fast but CHEAP • Development cost is tens of millions of dollars (5-100 typical) •BUT, manymore are sold compared to supercomputers – Crucial to take advantage of the investment, and use the commodity building block

– Exotic parallel architectures no more than special-purpose

• Multiprocessors being pushed by software vendors (e.g. database) as well as hardware vendors

• Standardization by Intel makes small, bus-based SMPs commodity

• Desktop: few smaller processors versus one larger one? – Multiprocessor on a chip

History

• Historically, parallel architectures tied to programming models – Divergent architectures, with no predictable pattern of growth.

Application Software

System Systolic Software SIMD Arrays Architecture Message Passing Dataflow Shared Memory

Uncertainty of direction paralyzed parallel software development! Today

• Extension of “computer architecture” to support communication and cooperation – OLD: Instruction Set Architecture – NEW: Communication Architecture

• Defines – Critical abstractions, boundaries, and primitives (interfaces) – Organizational structures that implement interfaces (hw or sw)

• Compilers, libraries and OS are important bridges today History

• “Mainframe” approach: – Motivated by multiprogramming P

– Extends crossbar used for mem bw and I/O P

– Originally processor cost limited to small scale I/O C

• later, cost of crossbar I/O C

– Bandwidth scales with p M M M M

– High incremental cost; use multistage instead

• “Minicomputer” approach: – Almost all microprocessor systems have bus

– Motivated by multiprogramming, TP I/O I/O

– Used heavily for parallel computing C C M M

– Called symmetric multiprocessor (SMP)

– Latency larger than for uniprocessor $ $ – Bus is bandwidth bottleneck P P • caching is key: coherence problem

– Low incremental cost

Example: Intel Pentium Pro Quad

CPU P-Pro P-Pro P-Pro Interrupt 256-KB module module module controller L2 $ Bus interface

P-Pro bus (64-bit data, 36-bit addr ess, 66 MHz)

PCI PCI Memory bridge bridge controller

PCI I/O MIU cards PCI bus PCI bus 1-, 2-, or 4-way interleaved DRAM

– All coherence and multiprocessing glue in

processor module – Highly integrated, targeted at high volume

– Low latency and bandwidth

Example:

CPU/mem P P cards $ $

$2 $2 Mem ctrl

Bus interface/switch

Gigaplane bus (256 data, 41 address, 83 MHz)

I/O cards Bus interface SBUS SBUS SBUS 100bT, SCSI 100bT, 2 FiberChannel

– 16 cards of either type: processors + memory, or I/O – All memory accessed over bus, so symmetric – Higher bandwidth, higher latency bus

Scaling Up

M M M ° ° °

Network Network

$ $ ° ° ° $ M $ M $ ° ° ° M $

P P P P P P

“Dance hall” Distributed memory – Problem is interconnect: cost (crossbar) or bandwidth (bus) – Dance-hall: bandwidth still scalable, but lower cost than crossbar

• latencies to memory uniform, but uniformly large

– Distributed memory or non-uniform memory access (NUMA)

• Construct shared address space out of simple message transactions across a general-purpose network (e.g. read-request, read-response) – Caching shared (particularly nonlocal) data?

Example: T3E

External I/O

P Mem $

Mem ctrl and NI

XY Switch

Z

– Scale up to 1024 processors, 480MB/s links

– Memory controller generates comm. request for nonlocal references – No hardware mechanism for coherence (SGI Origin etc. provide this)

Message Passing Architectures

• Complete computer as building block, including I/O – Communication via explicit I/O operations

• Programming model: – directly access only private address space (local memory) – communicate via explicit messages (send/receive )

• High-level block diagram similar to distributed-mem SAS – But comm. integrated at IO level, need not put into memory system

– Like networks of workstations (clusters), but tighter integration

– Easier to build than scalable SAS

• Programming model further from basic hardware ops – Library or OS intervention

Message Passing Abstraction

Match ReceiveY , P, t

Address Y Send X, Q, t

Address X Local process Local process address space address space

Process P Process Q – Send specifies buffer to be transmitted and receiving process – Recv specifies sending process and application storage to receive into

– Memory to memory copy, but need to name processes

– Optional tag on send and matching rule on receive

– User process names local data and entities in process/tag space too

– In simplest form, the send/recv match achieves pairwise synch event

• Other variants too

– Many overheads: copying, buffer management, protection

Evolution of Message Passing

101 100 • Early machines: FIFO on each link – Hardware close to programming model 001 000 • synchronous ops – Replaced by DMA, enabling non-blocking ops 111 110 • Buffered by system at destination until recv

• Diminishing role of topology 011 010 – Store & forward routing: topology important – Introduction of pipelined routing made it less so

– Cost is in node-network interface – Simplifies programming

Example: IBM SP-2

Power 2 CPU IBM SP-2 node

L2 $

Memory bus

General interconnection network formed from Memory 4-way interleaved 8-port switches controller DRAM

MicroChannel bus NIC

I/O DMA

i860 NI DRAM

– Made out of essentially complete RS6000 workstations – Network interface integrated in I/O bus (bw limited by I/O bus)

Example: Intel Paragon

i860 i860 Intel Paragon node L1 $ L1 $

Memory bus (64-bit, 50 MHz)

Mem DMA ctrl Driver NI 4-way Sandia’ s Intel Paragon XP/S-based Supercomputer interleaved DRAM

8 bits, 175 MHz, 2D grid network bidirectional with processing node attached to every switch Toward Architectural Convergence

• Evolution and role of software have blurred boundary – Send/recv supported on SAS machines via buffers – Can construct global address space on MP using hashing – Page-based (or finer-grained) shared virtual memory • Hardware organization converging too – Tighter NI integration even for MP (low-latency, high-bandwidth) – At lower level, even hardware SAS passes hardware messages • Even clusters of workstations/SMPs are parallel systems – Emergence of fast system area networks (SAN) • Programming models distinct, but organizations converging – Nodes connected by general network and communication assists – Implementations also converging, at least in high-end machines

Data Parallel Systems

• Programming model: – Operations performed in parallel on each element of data structure

– Logically single thread of control, performs sequential or parallel steps

– Conceptually, a processor associated with each data element • Architectural model: – Array of many simple, cheap processors with little memory each

• Processors don’t sequence through instructions

– Attached to a control processor that issues instructions

– Specialized and general communication, cheap global synchronization

Control processor • Original motivation: PE PE ° ° ° PE – Matches simple differential equation solvers PE PE ° ° ° PE

– Centralize high cost of instruction fetch ° ° ° ° ° ° ° ° ° & sequencing PE PE ° ° ° PE

Application of Data Parallelism

– Each PE contains an employee record with his/her salary If salary > 100K then salary = salary *1.05 else salary = salary *1.10 – Logically, the whole operation is a single step – Some processors enabled for arithmetic operation, others disabled

• Other examples: – Finite differences, linear algebra, ... – Document searching, graphics, image processing, ... • Some recent machines: – Thinking Machines CM-1, CM-2 (and CM-5) – Maspar MP-1 and MP-2,

Parallel Computer Architectures

• Flynn’s Classification • Legacy Parallel Computers • Current Parallel Computers • Trends in Supercomputers • Converged Parallel Computer Architecture

Parallel Architectures Tied to Programming Models

Divergent architectures, no predictable pattern of growth

Application Software

System software Systolic SIMD Arrays Architecture

Dataflow Message Passing Shared Memory

Uncertainty of direction paralyzed parallel software development! Flynn’s Classification

• Based on notions of instruction and data streams (1972) – SISD (Single Instruction stream over a Single Data stream ) – SIMD (Single Instruction stream over Multiple Data streams )

– MISD (Multiple Instruction streams over a Single Data stream)

– MIMD (Multiple Instruction streams over Multiple Data stream)

• Popularity – MIMD > SIMD > MISD

SISD (Single Instruction Stream Over A Single Data Stream )

• SISD – Conventional sequential machines

IS : Instruction Stream DS : Data Stream CU : Control Unit PU : Processing Unit MU : Memory Unit

IS IS DS CU PU MU I/O SIMD (Single Instruction Stream Over Multiple Data Streams ) • SIMD – Vector computers – Special purpose computations

PE : Processing Element LM : Local Memory DS DS PE1 LM1 IS CU IS Data sets DS DS loaded from host Program loaded PEn LMn from host SIMD architecture with distributed memory MISD (Multiple Instruction Streams Over A Single Data Streams) • MISD – Processor arrays, systolic arrays – Special purpose computations

IS IS CU1 CU2 CUn Memory IS IS IS (Program, DS DS DS Data) PU1 PU2 PUn I/O DS MISD architecture (the systolic array) MIMD (Multiple Instruction Streams Over Multiple Data Stream) • MIMD – General purpose parallel computers

IS IS DS CU1 PU1 I/O Shared Memory IS DS I/O CUn PUn

IS MIMD architecture with shared memory Dataflow Architectures

• Represent computation as a graph of essential dependences – Logical processor at each node, activated by availability of operands

– Message (tokens) carrying tag of next instruction sent to next processor

– Tag compared with others in matching store; match fires execution

1 b c e

− × a = (b +1) × (b − c) + d = c × e f = a × d d × Dataflow graph a × Network f

Token Program store store

Network Waiting Instruction Execute Form Matching fetch token

Token queue

Network

Convergence: General Parallel Architecture

• A generic modern Network multiprocessor ° ° °

Communication Mem assist (CA)

$

P

• Node: processor(s), memory system, plus communication assist – Network interface and communication controller • Scalable network • Convergence allows lots of innovation, now within framework – Integration of assist with node, what operations, how efficiently... Data Flow vs. Control Flow

• Control-flow computer – Program control is explicitly controlled by instruction flow in program – Basic components • PC (Program Counter) • Shared memory • Control sequencer

Data Flow vs. Control Flow (Cont’d)

• Advantages – Full control – Complex data and control structures are easily implemented • Disadvantages – Less efficient – Difficult in programming – Difficult in preventing runtime error

Data Flow vs. Control Flow (Cont’d)

• Data-flow computer – Execution of instructions is driven by data availability – Basic components • Data are directly held inside instructions • Data availability check unit • Token matching unit • Chain reaction of asynchronous instruction executions

Data Flow vs. Control Flow (Cont’d)

• Advantages – Very high potential for parallelism – High throughput – Free from side-effect • Disadvantages – Time lost waiting for unneeded arguments – High control overhead – Difficult in manipulating data structures

Dataflow Machine (Manchester Dataflow Computer) To host

Token Token Queue

I/O Matching Overflow Switch Unit Unit Network

Instruction Store

func1 funck Match

From host First actual hardware implementation Execution on Control Flow Machines

Assume all the external inputs are available before entering do loop + : 1 cycle, * : 2 cycles, / : 3 cycles,

a1 b1 c1 a2 b2 c2 a4 b4 c4 Sequential execution on a uniprocessor in 24 cycles Execution On A Data Flow Machine

a1 b1 s1 t1 s1 = b1+b2, t1 = s1+b3 a2 b2 a3 b3 s2 t2 s2 = b3+b4, t2 = s1+s2 a4 b4 Parallel execution on a shared-memory 4-processor system in 7 cycles

a1 b1 c1 c2 c3 c4 a2 b2 a3 b3 a4 b4 Data-driven execution on a 4-processor dataflow computer in 9 cycles Problems

• Excessive copying of large data structures in dataflow operations – I-structure : a tagged memory unit for overlapped usage by the producer and consumer • Retreat from pure dataflow approach (shared memory) • Handling complex data structures • Chain reaction control is difficult to implement – Complexity of matching store and memory units • Expose too much parallelism (?)

Convergence of Dataflow Machines

• Converged to use conventional processors and memory – Support for large, dynamic set of threads to map to processors • Operations have locality across them, useful to group together

– Typically shared address space as well – But separation of programming model from hardware (like data-parallel) Contributions

• Integration of communication with thread (handler) generation • Tightly integrated communication and fine-grained synchronization – Each instruction represents a synchronization operation. – Absorb the communication latency and minimize the losses due to synchronization waits. Systolic Architectures

– Replace single processor with array of regular processing elements – Orchestrate data flow for high throughput with less memory access

M M

PE PE PE PE

• Different from pipelining: – Nonlinear array structure, multidirection data flow, each PE may have

(small) local instruction and data memory

• Different from SIMD: each PE may do something different • Initial motivation: VLSI enables inexpensive special-purpose chips

• Represent algorithms directly by chips connected in regular pattern

Systolic Arrays (Cont)

• Example: Systolic array for 1-D convolution

x(i+1) x(i) x(i-1) x(i-k) W (1) W (2) W (k) y(i+k+1) y(i+k) y(i) y(i+1)

k

y(i) = w(j)*x(i-j)

j=1

– Practical realizations (e.g. iWARP) use quite general processors • Enable variety of algorithms on same hardware – But dedicated interconnect channels • Data transfer directly from register to register across channel – Specialized, and same problems as SIMD • General purpose systems work well for same algorithms (locality etc.)

Systolic Architectures

• Orchestrate data flow for high throughput with less memory access

• Different from pipelining – Nonlinear array structure, multidirection data flow, each PE may have (small) local instruction and data memory

• Different from SIMD – Each PE may do something different • Initial motivation – VLSI enables inexpensive special-purpose chips – Represent algorithms directly by chips connected in regular pattern

Systolic Architectures

M M

PE PE PE PE

Conventional Systolic arrays

Replace a processing element(PE) with an array of PE’s without increasing I/O bandwidth Two Communication Styles

Systolic communication

CPU CPU CPU

Local Local Local Memory Memory Memory

Memory communication

CPU CPU CPU

Local Local Local Memory Memory Memory Characteristics

• Practical realizations (e.g. Intel iWARP) use quite general processors – Enable variety of algorithms on same hardware • But dedicated interconnect channels – Data transfer directly from register to register across channel • Specialized, and same problems as SIMD – General purpose systems work well for same algorithms (locality etc.) Matrix Multiplication

Use the following PE n i = ∑ ij j = ,...,1, nixay j=1 xin xout x Recursive algorithm w yin yout for i = 1 to n y(i,0) = 0 xout = x for j = 0 to n x = xin y(i,0) = y(i,0) + a(i,j) * x(j,0) yout = yin + w * xin Systolic Array Representation of Matrix Multiplication

y1 y2 y3 y4

x4

x3

x2

x1

0 0 0 0 Example of Convolution

xin xout xout = x x x = xin w yout = yin + w * xin yin yout y(i) = w1 * x(i) + w2 * x(i+1) + w3 * x(i+2) + w4 * x(i+3) y(i) is initialized as 0. x8 x6 x4 x2 x7 x5 x3 x1 w4 w3 w2 w1 y3 y2 y1 Data Parallel Systems

• Programming model – Operations performed in parallel on each element of data structure – Logically single thread of control, performs sequential or parallel steps – Conceptually, a processor associated with each data element Data Parallel Systems (Cont’d)

• SIMD Architectural model – Array of many simple, cheap processors with little memory each • Processors don’t sequence through instructions – Attached to a control processor that issues instructions – Specialized and general communication, cheap global synchronization Evolution & Convergence

• Popular due to cost savings of centralized sequencer – Replace by vector in mid-70s. – Revived in mid-80s when 32-bit datapath sliced fit on chip – Old machines • Thinking machines CM-1, CM-2 • Maspar MP-1 and MP-2

Evolution & Convergence (Cont’d)

• Drawbacks – Low applicability – Simple, regular applications can do well anyway in MIMD • Convergence: SPMD (Single Program Multiple Data) – Fast global synchronization is needed

Vector Processors

• Merits of vector processor – Very deep pipeline without data hazard • The computation of each result is independent of the computation of previous results

– Instruction bandwidth requirement is reduced • A vector instruction specifies a great deal of work – Control hazards are nonexistent • A vector instruction represents an entire loop. • No loop branch

Vector Processors (Cont’d)

– The high latency of initiating a main memory access is amortized • A single access is initiated for the entire vector rather than a single word

– Known access pattern – Interleaved memory banks

• Vector operations is faster than a sequence of scalar operations on the same number of data items! Vector Programming Example

Y = a * X + Y

8 * 64 LD F0, a ADDI R4, Rx, #512 ; last address to load Loop: LD F2, 0(Rx) ; load X(i) MULTD F2, F0, F2 ; a x X(i) LD F4, 0(Ry) ; load Y(i) ADDD F4, F2, F4 ; a x X(i) + Y(i) Repeat 64 times SD F4, 0(Ry) ; store into Y(i) ADDI Rx, Rx, #8 ; increment index to X ADDI Ry, Ry, #8 ; increment index to Y SUB R20, R4, Rx ; compute bound BNZ R20, loop ; check if done RISC machine Vector Programming Example (Cont’d)

Y = a * X + Y

LD F0, a ; load scalar LV V1, Rx ; load vector X 6 instructions MULTSV V2, F0, V1 ; vector-scalar multiply (low instruction LV V3, Ry ; load vector Y bandwidth) ADDV V4, V2, V3 ; add SV Ry, V4 ; store the result

Vector machine Basic Vector Architecture

• Vector-register processor – All vector operations except load and store are among the vector registers – The major vector computers • Memory-memory vector processor – All vector operations are memory to memory – The first vector computer

A Vector-Register Architecture (DLXV)

Main Memory

Vector FPFP add/subtract add/subtract Load-store FPFP add/subtract add/subtract

FPFP add/subtract add/subtract Vector FP add/subtract registers FP add/subtract

FPFP add/subtract add/subtract Crossbar Crossbar Scalar registers Vector Machines

Elements Load Functional Registers per register Store units CRAY-1 8 64 1 6 Fujitsu VP200 8 - 256 32-1024 2 3 CRAY X-MP 8 64 2Ld/1St 8 Hitachi S820 32 256 4 4 NEC SX/2 8 + 8192 256 8 16 Convex C-1 8 128 1 4 CRAY-2 8 64 1 5 CRAY Y-MP 8 64 2Ld/1St 8 CRAY C-90 8 128 4 8 NEC SX/4 8 + 8192 256 8 16 Convoy

• Convoy – The set of vector instructions that could begin execution in one clock period – The instructions in a convoy must not contain any structural or data hazards • Chime – An approximate measure of execution time for a vector sequence Convoy Example

LV V1, Rx ; load vector X MULTSV V2, F0, V1 ; vector-scalar multiply LV V3, Ry ; load vector Y ADDV V4, V2, V3 ; add SV Ry, V4 ; store the result

1. LV 2. MULTSV LV Chime = 4 3. ADDV 4. SV Convoy Strip Mining

™Strip mining 9 When a vector has a length greater than that of the vector registers, segment the long vector into fixed-length segments (size = MVL, maximum vector length).

Total execution time Tn = n/MVL * (Tloop + Tstart ) + n * Tchime

Ex) T200 = 200/64 * (15+49) + 200*4 Strip Mining Example

do 10 i = 1, n 10 Y(i) = a * X(i) + Y(i)

low = 1 VL = (n mod MVL) ; find the odd size piece do 1 j = 0, (n/MVL) ; outer loop do 10 i = low, low+VL-1 ; runs for length VL Y(i) = a*X(i) + Y(i) ; main operation 10 continue low = low + VL ; start of next vector VL = MVL ; reset the length to max 1 continue Performance Enhancement Techniques • Chaining • Conditional execution • Sparse matrix

Chaining

• Chaining – Allow a vector operation to start as soon as the individual elements of its vector source operand become available – Permit the operations to be schedule in the same convoy and reduces the number of chimes required. Chaining Example

Unit Start-up overhead MULTVV1, V2, V3 Load and store unit 12 cycles ADDV V4, V1, V5 Multiply unit 7 cycles Add unit 6 cycles Vector sequence

7 64 6 64 Unchained MULTV ADDV Total = 141 7 64 Chained MULTV 6 64 ADDV Total = 77 Conditional Execution

• Vector mask register – Any vector instructions executed operate only on the vector elements whose corresponding entries in the vector-mask registers are 1. – Require execution time even when the condition is not satisfied. • The elimination of a branch and the associated control dependences can make a conditional instruction faster.

Conditional Execution Example

do 10 i = 1, 64 if (A(i) .ne. 0) then A(i) = A(i) - B(i) endif 10 continue

LV V1, Ra ; load vector A into V1 LV V2, Rb ; load vector B LD F0, #0 ; load FP zero into F0 SNESV F0, V1 ; sets the VM to 1 if V1(i) != F0 SUBV V1, V1, V2 ; subtract under the VM CVM ; set the VM to all 1s SV Ra, V1 ; store the result in A Sparse Matrix

• Sparse Matrix – There are small number of non-zero elements • Scatter-gather operations using index vectors – Moving between a dense representation and a normal representation of a sparse matrix – Gather • Make a dense representation – Scatter • Return to the sparse representation

Sparse Matrix Example

LVI : Load Vector Indexed SVI : Store Vector Indexed

do 10 i = 1,n 10 A(K(i)) = A(K(i))+ C(M(i))

LV Vk, Rk ; load K LVI Va, (Ra+Vk) ; load A(K(i)) - gather LV Vm, Rm ; load M LVI Vc, (Rc+Vm) ; load C(M(i)) - gather ADDV Va, Va, Vc ; add them SVI (Ra + Vk), Va ; store A(K(i)) - scatter Current Parallel Computer Architectures

PVP

Multiprocessor SMP (Shared Address Space) DSM

MIMD

MPP

Multicomputer Constellation (Message Passing) Cluster Programming Models

• What does programmer use in coding applications? • Specifies communication and synchronization • Classification – Uniprocessor model: Von Neumann model – Multiprogramming: no comm. and synch. at program level • (ex) CAD

– Shared address space

– Symmetric multiprocessor model – CC-NUMA model – Message passing – Data parallel

Communication Architecture

• User/System interface – Communication primitives exposed to used-level realizes the programming model • Implementation – How to implement primitives: HW or OS – How optimized are they? – Network structure

Communication Architecture (Cont’d)

• Goals – Performance and cost – Programmability – Scalability – Broad applicability

Shared Address Space Architecture

• Any processor can directly reference any memory location – Communication occurs implicitly by “loads and stores”. • Natural extension of uniprocessor model – Location transparency – Good throughput on multiprogrammed workloads • OS used shared memory to coordinate processes • Shared memory multiprocessors – SMP: every processor has equal access to the shared memory, the I/O devices, and the OS system serviced. UMA

architecture – NUMA: distributed shared memory

Shared Address Space Model

Virtual address spaces for a collection of processes communicating via Machine physical address space shared addresses Load Pn private

Store Common physical addresses

Shared portion of P2 private address space

P1 private Private portion of address space P0 private SAS History

• “Mainframe” approach – Motivated by multiprogramming – Extends crossbar used for memory bandwidth and I/O – Bandwidth scales with p – High incremental cost; use multistage instead

Crossbar Switch

Mem

Mem

Mem

Mem

Cache Cache I/O I/O

P P Minocomputer

• “Minicomputer” approach – Almost all microprocessor systems have bus – Motivated by multiprogramming, TP – Called symmetric multiprocessor (SMP) – Latency larger than for uniprocessor – Bus is bandwidth bottleneck – caching is key: coherence problem – Low incremental cost

Bus Connection

Mem Mem Mem Mem

Cache Cache I/O I/O

P P Scaling Up

• Problem is interconnect: cost (crossbar) or bandwidth (bus) • Dance-hall: bandwidth still scalable, but lower cost than crossbar

– Latencies to memory uniform, but uniformly large • Distributed memory or non-uniform memory access (NUMA) – Construct shared address space out of simple message transactions across a general-purpose network (e.g. read-

request, read-response)

• Caching shared (particularly nonlocal) data?

Organizations

Mem Mem Mem P P Mem Cache Mem Cache

Network Network

Cache Cache Cache

P P P Mem Cache Mem Cache P P

Dancing Hall Distributed Memory Multiprocessors (Shared Address Space Architecture) • PVP (Parallel Vector Processor) – A small number of proprietary vector processors connected by a high-bandwidth crossbar switch • SMP (Symmetric Multiprocessor) – A small number of COTS microprocessors connected by a high-speed bus or crossbar switch • DSM (Distributed Shared Memory) – Similar to SMP – The memory is physically distributed among nodes.

PVP (Parallel Vector Processor)

VP : Vector Processor SM : Shared Memory

VP VP VP

Crossbar Switch

SM SM SM SMP (Symmetric Multi-Processor)

P/C : Microprocessor and Cache

P/C P/C P/C

Bus or Crossbar Switch

SM SM SM DSM (Distributed Shared Memory)

DIR : Cache Directory

MB MB P/C P/C

LM LM

DIR DIR

NIC NIC

Custom-Designed Network MPP (Massively Parallel Processing)

MB : Memory Bus NIC : Network Interface Circuitry

MB MB P/C P/C

LM LM

NIC NIC

Custom-Designed Network Cluster

LD : Local Disk IOB : I/O Bus

MB MB P/C P/C

M M

Bridge Bridge LD IOB LD IOB NIC NIC

Commodity Network (Ethernet, ATM, Myrinet, VIA) Constellation

IOC : I/O Controller >= 16 >= 16

P/C P/C P/C P/C

Hub IOC Hub IOC

LD NIC LD NIC SM SM SM SM

Custom or Commodity Network Trend in Parallel Computer Architectures

MPPs Constellations Clusters SMPs 400 350 300 250 200 150

Number of HPCs 100 50 0 1997 1998 1999 2000 2001 2002 Years Food-Chain of High-Performance Computers Converged Architecture of Current Supercomputers

InterconnectionInterconnection Network Network

Memory Memory Memory Memory

P P P P P P P P P P P P P P P P

Multiprocessors Multiprocessors Multiprocessors Multiprocessors