<<

Chapter 1: Perspectives

Copyright @ 2005-2008 Yan Solihin

Copyright notice : No part of this publication may be reproduced, stored in a retrieval system, or transmitted by any means (electronic, mechanical, photocopying, recording, or otherwise) without the prior written permission of the author. An exception is granted for academic lectures at universities and colleges, provided that the following text is included in such copy: “Source: Yan Solihin, Fundamentals of Parallel , 2008”. Outline for Lecture 1

 Introduction  Types of parallelism  Architectural trends  Why parallel computers?  Scope of CSC/ECE 506

Fundamentals of Computer Architecture - Chapter 1 2 Evolution in

QuickTime™ and a decompressor are needed to see this picture.

Fundamentals of Computer Architecture - Chapter 1 3 Key Points

 More and more components can be integrated on a single chip

 Speed of integration tracks Moore’s law, doubling every 18–24 months.

 Exercise: Look up how the number of transistors per chip has changed, esp. since 2006. Submit here .

 Until recently, performance tracked speed of integration

 At the architectural level, two techniques facilitated this:  Instruction-level parallelism  Cache memory

 Performance gain from uniprocessor system was high enough that multiprocessor systems were not viable for most uses.

Fundamentals of Computer Architecture - Chapter 1 4 Illustration

 100-processor system with perfect  Compared to a single processor system  Year 1: 100x faster  Year 2: 62.5x faster  Year 3: 39x faster  …  Year 10: 0.9x faster  Single-processor performance catches up in just a few years!  Even worse  It takes longer to develop a multiprocessor system  Low volume means prices must be very high  High prices delay adoption  Perfect speedup is unattainable

Fundamentals of Computer Architecture - Chapter 1 5 Why did uniprocessor performance grow so fast?

 ≈ half from circuit improvement (smaller transistors, faster clock, etc.)  ≈ half from architecture/organization:

 Instruction-level parallelism (ILP)  Pipelining: RISC, CISC with RISC back-end  Superscalar  Out-of-order execution

 Memory hierarchy (caches)  Exploit spatial and temporal locality  Multiple cache levels

Fundamentals of Computer Architecture - Chapter 1 6 But uniprocessor perf. growth is stalling

 Source of uniprocessor performance growth: ILP  Parallel execution of independent instructions from a single

 ILP growth has slowed abruptly  Memory wall: Processor speed grows at 55%/year, memory speed grows at 7% per year  ILP wall: achieving higher ILP requires quadratically increasing complexity (and power)

 Power efficiency  Thermal packaging limit vs. cost

Fundamentals of Computer Architecture - Chapter 1 7 Types of parallelism

 Instruction level (cf. ECE 521)

 Pipelining

A (a load) IF IDEX MEM WB

IF IDEX MEM WB B

IF IDEX MEM WB C

Fundamentals of Computer Architecture - Chapter 1 8 Types of parallelism, cont.

 Superscalar/ VLIW  Original:

LD F0, 34(R2) ADDD F4, F0, F2 LD F7, 45(R3) ADDD F8, F7, F6

 Schedule as:

LD F0, 34(R2) | LD F7, 45(R3) ADDD F4, F0, F2 | ADDD F8, F0, F6

+ Moderate (sometimes 50) – Requires fast communication (register level)

Fundamentals of Computer Architecture - Chapter 1 9 Why ILP is slowing

 Branch-prediction accuracy is already > 90%  Hard to improve it even more

 Number of pipeline stages is already deep ( ≈ 20–30 stages)  But critical dependence loops do not change  Memory latency requires more clock cycles to satisfy

 Processor width is already high  Quadratically increasing complexity to increase the width

 Cache size  Effective, but also shows diminishing returns  In general, the size must be doubled to reduce miss rate by a half

Fundamentals of Computer Architecture - Chapter 1 10 Current trends: multicore and manycore

Aspect Intel AMD IBM Cell Clovertown Barcelona # cores 4 4 8+1

Clock 2.66 GHz 2.3 GHz 3.2 GHz frequency Core type OOO OOO 2-issue SIMD Superscalar Superscalar Caches 2x4MB L2 512KB L2 256KB local (private), store 2MB L3 (shd) Chip power 120 watts 95 watts 100 watts

Exercise: Browse the Web for information on more recent processors, and for each processor, fill out this form . (You can view the submissions here .)

Fundamentals of Computer Architecture - Chapter 1 11 Historical perspectives

 80s – early 90s: Prime time for parallel architecture research  A cannot fit on a chip, so naturally need multiple chips (and processors)  J-machine, M-machine, Alewife, Tera, HEP, DASH, etc.  Exercise: Pick one of these machines, and identify a major innovation that it introduced. Submit here .

 90s: At the low end, uniprocessor systems’ speed grows much faster than parallel systems’ speed  A microprocessor fits on a chip. So do branch predictor, multiple functional units, large caches, etc.!  Microprocessor also exploits parallelism (pipelining, multiple-issue, VLIW) – parallelisms originally invented for multiprocessors  Many parallel computer vendors went bankrupt  Prestigious but small high-performance computing market

Fundamentals of Computer Architecture - Chapter 1 12 “If the automobile industry advanced as rapidly as the semiconductor industry, a Rolls Royce would get a half million miles per gallon and it would be cheaper to throw it away than to park it.”

Gordon Moore, Intel Corporation, 1998

Fundamentals of Computer Architecture - Chapter 1 13 Historical perspectives, cont.

 90s: Emergence of distributed (vs. parallel) machines  Progress in network technologies:

 Network bandwidth grows faster than Moore’s law

 Fast interconnection networks getting cheap  Connects cheap uniprocessor systems into a large distributed machine  Network of Workstations, Clusters, Grid

 00s: Parallel architectures are back!  Transistors per chip >> microprocessor transistors  Harder to get more performance from a uniprocessor  SMT (Simultaneous multithreading), CMP (Chip Multi- Processor), ultimately Massive CMP  E.g. Intel Pentium D, Core Duo, AMD Dual Core, IBM Power5, Sun Niagara.

Fundamentals of Computer Architecture - Chapter 1 14 Parallel computers

 A parallel computer is a collection of processing elements that can communicate and cooperate to solve a large problem fast . [Almasi & Gottlieb]

 “collection of processing elements”  How many? How powerful each? ?  Few very powerful (e.g., Altix) vs. many small ones (BlueGene)

 “that can communicate”  How do PEs communicate? ( vs. message- passing)  Interconnection network (bus, multistage, crossbar, …)  Metrics: cost, latency, throughput, scalability, fault tolerance

Fundamentals of Computer Architecture - Chapter 1 15 Parallel computers, cont.

 “and cooperate”  Issues: granularity, synchronization, and autonomy  Synchronization allows sequencing of operations to ensure correctness  Granularity up → parallelism down, communication down, overhead down

 Statement/instruction level: 2–10 instructions (ECE 521)

 Loop level: 10 – 1K instructions

 Task level: 1K – 1M instructions

 Program level: > 1M instructions  Autonomy

 SIMD (single instruction stream) vs. MIMD (multiple instruction streams)

Fundamentals of Computer Architecture - Chapter 1 16 Parallel computers, cont.

 “solve a large problem fast”  General- vs. special-purpose machine?  Any machine can solve certain problems well

What domains?  Highly (embarassingly) parallel apps

 Many scientific codes  Medium parallel apps

 Many engineering apps (finite-element, VLSI-CAD)  Not parallel apps

 Compilers, editors (do we care?)

Fundamentals of Computer Architecture - Chapter 1 17 Why parallel computers?

 Absolute performance: Can we afford to wait?  Folding of a single protein takes years to simulate on the most advanced microprocessor. It only takes days on a parallel computer  Weather forecast: timeliness is crucial

 Cost/performance  Harder to improve performance on a single processor  Bigger monolithic processor vs. many, simple processors

 Power/performance

 Reliability and availability

 Key enabling technologies  Advances in microprocessor and interconnect technology  Advances in software technology

Fundamentals of Computer Architecture - Chapter 1 18 Scope of CSC/ECE 506

 Parallelism  Loop-level and task-level parallelism

 Flynn taxonomy:  SIMD (vector architecture)  MIMD

 Shared memory machines (SMP and DSM)

 Clusters

 Programming Model:  Shared memory  Message-passing  Hybrid

Fundamentals of Computer Architecture - Chapter 1 19 Loop-level parallelism

 Sometimes each iteration can be performed independently for (i=0; i<8; i++) a[i] = b[i] + c[i];

 Sometimes iterations cannot be performed independently ⇒ no loop-level parallelism

for (i=0; i<8; i++) a[i] = b[i] + a[i-1];

+ Very high parallelism > 1K + Often easy to achieve load balance – Some loops are not parallel – Some apps do not have many loops

Fundamentals of Computer Architecture - Chapter 1 20 Task-level parallelism

 Arbitrary code segments in a single program  Across loops: … for (i=0; i

 Subroutines: Cost = getCost(); A = computeSum(); B = A + Cost;  Threads: e.g. editor: GUI, printing, parsing

+ Larger granularity => low overheads, communication – Low degree of parallelism – Hard to balance

Fundamentals of Computer Architecture - Chapter 1 21 Program-level parallelism

 Various independent programs execute together  gmake:  gcc –c code1.c // assign to proc1  gcc –c code2.c // assign to proc2  gcc –c main.c // assign to proc3  gcc main.o code1.o code2.o

+ No communication – Hard to balance – Few opportunities

Fundamentals of Computer Architecture - Chapter 1 22 Scope of CSC/ECE 506

 Parallelism  Loop-level and task-level parallelism

 Flynn taxonomy:  SIMD (vector architecture)  MIMD

 *Shared memory machines (SMP and DSM)

 Clusters

 Programming Model:  Shared memory  Message-passing  Hybrid

Fundamentals of Computer Architecture - Chapter 1 23 Taxonomy of parallel computers

The Flynn taxonomy: • Single or multiple instruction streams. • Single or multiple data streams.

 1. SISD machine (Most desktops, laptops)  Only one instruction fetch stream  Most of today’s workstations or desktops

Instruction Data Control ALU unit stream stream

Fundamentals of Computer Architecture - Chapter 1 24 SIMD

 Examples: Vector processors, SIMD extensions (MMX)  A single instruction operates on multiple data items.

SISD: SIMD: for (i=0; i<8; i++) a = b + c; // vector addition a[i] = b[i] + c[i];

Data ALU 1 stream 1

Data ALU 2 Control Instruction stream unit stream 2

n Data ALU n stream

Fundamentals of Computer Architecture - Chapter 1 25 MISD

 Example: CMU Warp Data  Systolic arrays stream

Instruction Control ALU 1 unit 1 stream 1

Instruction Control ALU 2 unit 2 stream 2

Control Instruction n n n ALU unit stream

Fundamentals of Computer Architecture - Chapter 1 26 Systolic arrays (contd.) Example: Systolic array for 1-D convolution

y(i) = w1 × x(i) + w2 × x(i + 1) + w3 × x(i + 2) + w4 × x(i + 3)

x8 x6 x4 x2 x7 x5 x3 x1

w4 w3 w2 w1 y3 y2 y1

xin xout x xout = x x = xin × w yout = yin + w xin yin yout

 Practical realizations (e.g. iWARP) use quite general processors

 Enable variety of on same hardware  But dedicated interconnect channels

 Data transfer directly from register to register across channel  Specialized, and same problems as SIMD

 General purpose systems work well for same algorithms (locality etc.)

Fundamentals of Computer Architecture - Chapter 1 27 MIMD

 Independent processors connected together to form a multiprocessor system.  Physical organization:  Determines which memory hierarchy level is shared  Programming abstraction:  Shared Memory:

 on a chip: Chip Multiprocessor (CMP)

 Interconnected by a bus: Symmetric multiprocessors (SMP)

 Point-to-point interconnection: Distributed Shared Memory (DSM)  :

 Clusters, Grid

Fundamentals of Computer Architecture - Chapter 1 28 MIMD Physical Organization

Shared-cache architecture: P P … - CMP (or Simultaneous Multi-Threading) - e.g.: Pentium 4 chip, IBM Power4 chip, Sun caches Niagara, Pentium D, etc. - Implies shared-memory hardware M UMA () P P Shared Memory : - Pentium Pro Quad, Sun Enterprise, etc. caches caches … - What interconnection network? - Bus Network - Multistage - Crossbar M - etc. - Implies shared-memory hardware Fundamentals of Computer Architecture - Chapter 1 29 MIMD Physical Organization (2)

P P NUMA (Non-Uniform Memory Access) Shared Memory : - SGI Origin, Altix, IBM p690, caches caches AMD Hammer-based system … - What interconnection network? M M - Crossbar - Mesh - Hypercube Network - etc. - Also referred to as Distributed Shared Memory

Fundamentals of Computer Architecture - Chapter 1 30 MIMD Physical Organization (3)

P P Distributed System/Memory: - Also called clusters, grid - Don’t confuse it with distributed caches caches shared memory

M M

I/O I/O

Network

Fundamentals of Computer Architecture - Chapter 1 31 Scope of CSC/ECE 506

 Parallelism  Loop-level and task-level parallelism

 Flynn taxonomy:  MIMD

 Shared memory machines (SMP and DSM)

 Programming Model:  Shared memory  Message-passing  Hybrid (e.g., UPC)  Data parallel

Fundamentals of Computer Architecture - Chapter 1 32 Programming models: shared memory

 Shared Memory / Shared Address Space:  Each processor can see the entire memory

P P … P

Shared M

 Programming model = thread programming in uniprocessor systems

Fundamentals of Computer Architecture - Chapter 1 33 Programming models: message-passing

 Distributed Memory / Message Passing / Multiple Address Space:  a processor can only directly access its own local memory. All communication happens by explicit messages.

P P

M M

P P

M M

Fundamentals of Computer Architecture - Chapter 1 34 Programming models: data parallel

 Programming model  Operations performed in parallel on each element of data structure  Logically single thread of control, performs sequential or parallel steps

 Conceptually, a processor associated with each data element

 Architectural model  Array of many simple, cheap processing elements (PEs) with little memory each  Processing elements don’t sequence through instructions  PEs are attached to a control processor that issues instructions  Specialized and general communication, cheap global synchronization

 Original motivation  Matches simple differential equation solvers  Centralize high cost of instruction fetch/sequencing

Fundamentals of Computer Architecture - Chapter 1 35 Top 500

 http://www.top500.org  Let’s look at the Earth Simulator  Was #1 in 2004, now #10 in 2006  Hardware:  5,120 (640 8-way nodes) 500 MHz NEC CPUs  8 GFLOPS per CPU (41 TFLOPS total)

 30s TFLOPS sustained performance!  2 GB (4 512 MB FPLRAM modules) per CPU (10 TB total)  Shared memory inside the node  10 TB total memory  640 × 640 crossbar switch between the nodes  16 GB/s inter-node bandwidth  20 kVA power consumption per node

Fundamentals of Computer Architecture - Chapter 1 36 Fundamentals of Computer Architecture - Chapter 1 37 Exercise

 Go to http://www.top500.org and look at the Statistics menu near the right-hand side. Click on one of the statistics, e.g., Vendors, Processor Architecture, and examine what kind of systems are prevalent. Then do the same for earlier lists, and report on the trend. You may find interesting results by clicking on the “Development” tab.  For example, if you choose “Processor Architecture,” the current list, http://www.top500.org/stats/list/34/procarch/ , will tell you how many vector and scalar architectures there are. Change the “34” to “32” to get last year’s list. Change it to a lower number to get an earlier year’s list. You can go all the way back to the first list from 1993.  Submit your results here .

Fundamentals of Computer Architecture - Chapter 1 38