University of Athens Department of Informatics and Telecommunications

Architectures of chips

Ø Basic Architectural features Ø Single core processor chip ü Scalar processors –Pipelining ü Superscalars–ILP -threading Ø Amdahl’s Law Ø Models of Multicores Ø Multicores ü Symmetric MulticoreChips ü Asymmetric MulticoreChips ü Dynamic MulticoreChips Ø Multi-Threading Ø Accelerators

ConstantinHalatsis 1

University of Athens Department of Informatics and Telecommunications

Basic Architectural features

Ø Memory Hierarchy (caches, Main memory, HDs,..) Ø Pipelining Ø Branch prediction Ø Instruction Level Paralelism–ILP Ø Data Level Parallelism –DLP Ø Level Parallelism -TLP

ConstantinHalatsis 2

1 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Example

ConstantinHalatsis 3

University of Athens Department of Informatics and Telecommunications

ConstantinHalatsis 4

2 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications Basic techniques

Based on the CDC 6000 Architecture Scoreboarding Important Feature: Scoreboard Issue: WAW, Decode: RAW, execute and write results: WAR

Reorder Buffer TomasuloAlgorithm Implemented in the IBM360/91’s floating point unit. Important Feature: Reservation Station and CDB (Common Data Bus) Issue: tag if not available, copy if they are; Execute: stall RAW monitoring the CDB Write results: Send results to the CDB and dump the store buffercontents; Exception Handling: No instscan be issued until a branch can be resolved Register Renaming

ConstantinHalatsis 5

University of Athens Department of Informatics and Telecommunications Single core processors Αρχιτεκτονικέςέναρξηςπολλώνεντολών: Αύξησητωνεντολώνανάκύκλο, IPC Εκμετάλλευσηπαραλληλίαςστοεπίπεδοτηςεντολής, ILP

Common Issue Hazard Scheduling Distinguishing Examples Name Structure Detection characteristics

Superscalar Dynamic Hardware Static In order execution Sun (static) UltraSPARCII and III

Superscalar Dynamic hardware Dynamic Some out of order IBM Power2 (dynamic) execution

Superscalar Dynamic Hardware Dynamic With Speculative out of Pentium 3 and 4 (speculative) speculation order execution εικοτολογίας VLIW / LIW Static Software Static No hazards between Trimedia, i860 issues packets

EPIC Mostly Static Mostly Mostly Static Explicit Dependences Itanium Software marked by compiler

Register Renaming TomasuloAlgorithm Reorder Buffer Scoreboarding

ConstantinHalatsis 6

3 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications ’s x86 line of processors SPECint92 Leveling off 10000 Prescott (2M) P4/3200 * ** * * Prescott (1M) 5000 P4/3060 * Northwood B P4/2400 * ** * P4/2800 P4/2000 * P4/2200 P4/1500* * 2000 P4/1700 PIII/600 PIII/1000 1000 * * PII/400 * PIII/500 PII/300 * *PII/450 500 *

~ 100*/10 years Pentium/200 * Pentium Pro/200 200 * Pentium/133 * * Pentium/166 * Pentium/120 100 Pentium/100 * Pentium/66 * 50 * 486-DX4/100 486/50 * * 486-DX2/66 486/33 *486-DX2/50 20 * 486/25 * 10 * 386/33 386/20 * 5 * 386/25 386/16 * 80286/12 2 * 80286/10 1 * 8088/8 0.5 *

0.2 * 8088/5

Year 79 1980 81 82 83 84 85 86 87 88 89 1990 91 92 93 94 95 96 97 98 99 2000 01 02 03 04 05

The integer performance of Intel’s x86 line of processors

ConstantinHalatsis 7

University of Athens Department of Informatics and Telecommunications Επίδοσηεπεξεργαστών (Spec Int2000)

3.

Source: F. Labonte, http://mos.stanford.edu/papers/flabonte_thesis.pdf ConstantinHalatsis 8

4 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications ProcessorClock rate

3.

Source: F. Labonte, http://mos.stanford.edu/papers/flabonte_thesis.pdf ConstantinHalatsis 9

University of Athens Department of Informatics and Telecommunications Power density evolution –Watt/mm2

3.

Source: F. Labonte, http://mos.stanford.edu/papers/flabonte_thesis.pdf ConstantinHalatsis 10

5 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications ILPprocessing: Pipelining

Paradigms of ILP-processing

Temporal parallelism

Pipeline processors ConstantinHalatsis 11

University of Athens Department of Informatics and Telecommunications Pipelining Schemes

TypesoftemporalparallelisminILP processors

Sequential Prefetching Pipelined Pipeline processing EUs processors Overlapping the fetch Overlapping the execute Overlapping and further phases phases through pipelining all phases

ii ii+1 F D EW F D E E i FDEW i 1 2 E3 D F EW i i ii F D E W i ii+1 i+1 ii+1 i i i+2 i+2 ii+2 i i+3 ii+3

Mainframes Early 34 35 mainframes Stretch (1961) IBM 360/91 (1967) Atlas 37 (1963) 36 CDC 7600 (1969) 38 IBM 360/91 (1967) i80286 39 (1982) 41 40 R2000 (1988) M68020 (1985) i8038642 (1985) M6803043 (1988)

(F: fetchcycle, D: decodecycle, E: executecycle, W: writecycle)

ConstantinHalatsis 12

6 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications Sourcesofraisingclockfrequencies(3)

No of pipeline stages 40 P4 Prescott (~30) 30 * (~20) Core Duo 20 * Conroe Pentium Pro Athlon-64 (~12) (14) Athlon (12) * Pentium * K6 (6) * 10 (5) (6) * * * Year 1990 1995 2000 2005

Figure4.2:Number of pipelinestages in Intel’s and AMD’s processors

ConstantinHalatsis 13

University of Athens Department of Informatics and Telecommunications POWER5 vs. POWER6 Pipeline Comparison

Pre-Dec POWER5 Pipe Stages Pre-Dec POWER6 Pipe Stages 22 FO4 13 FO4 Pre-Dec

Pre-Dec Pre-Dec I Cache I Cache 5 cycle 14 cycle 3-4 cycle Xmit 12 cycle I Cache redirect Branch redirect Branch IBUF Xmit Resolution Resolution Rotate Group Form Decode IBUF0

Assembly IBUF1

Group Xfer Pre-dispatch

Group Disp Group Disp

Rename 4 cycle RF Load-load AG/RF Issue 1 cycle Fx-FX RF DCache / Ex 2 cycle AG/Ex 2 cycle Fx-Fx Load-Fx DCache 3 cycle Load-load/fx D Cache Fmt1 Fmt Fmt2 6 cycle FP-FP 6 cycle FP to local FP use Writeback 5 cycle LD-FP 8 cycle FP to remote FP use Finish Writeback 0 cycle LD to FP use Completion

Completion Chkpt

ConstantinHalatsis 14

7 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications ILPprocessing: VLIW

Paradigms of ILP-processing

Temporal parallelism Issue parallelism

Static dependency resolution

Pipeline VLIW processors processors ConstantinHalatsis 15

University of Athens Department of Informatics and Telecommunications

VLIW processing Instructions

Independent instructions (static dependency resolution)

F F F E E E

Processor

VLIW: Very Large Instruction Word

ConstantinHalatsis 16

8 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications ILPprocessing-Superscalars

Paradigms of ILP processing

Temporal parallelism Issue parallelism

Static Dynamicdependency dependency resolution resolution

Pipeline VLIW Superscalar processors processors processors ConstantinHalatsis 17

University of Athens Department of Informatics and Telecommunications

VLIW Superscalar processing Instructions processing

Independent Dependent instructions instructions (static dependency resolution)

Dynamic dependency resolution

F F F F F F E E E E E E

Processor Processor

VLIW: Very Large Instruction Word

ConstantinHalatsis 18

9 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications ILP processing

Paradigms of ILP processing

Temporal parallelism Issue parallelism Data parallelism

Static Dynamicdependency dependency resolution resolution

Pipeline VLIW Superscalar SIMD processors processors processors extension ConstantinHalatsis 19

University of Athens Department of Informatics and Telecommunications ILP-processing

Issueparallelism Data parallelism

Static dependency resolution

Sequential Temporal VLIW processors EPIC processors processing parallelism

Pipeline Dynamic processors. dependency resolution

Superscalarprocessors Superscalar proc.s with SIMDextension

~ ~ ~ ’95 -‘00 ‘85 ‘90 Figure 1.3:The emergence of ILP-paradigms andprocessor types ConstantinHalatsis 20

10 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications Performance of ILP-processors

Ideal case Real case Absolute performance

1 Sequential P = f ∗ ai C CPI

CPI i 1 Pipeline P = f ∗ ai C CPI

CPIi IP IP i i 1 VLIW/ Pai = fC ∗ ∗ IP superscalar CPI

OPIi SIMD 1 extension P = f ∗ ∗ IP ∗OPI ao C CPI ConstantinHalatsis 21

University of Athens Department of Informatics and Telecommunications Performance components of ILP-processors:

1 P = f ∗ ∗ IP ∗ OPI ∗ η ao C CPI Clock Temporal Issue Data Efficiency of frequency parall. parall. parall. spec. exec.

Pao = f C ∗ IPC eff 1 with: IPC = ∗ IP ∗ OPI ∗ η eff CPI

Clock frequency Per cycle efficiency Depends on technology/ Depends on ISA, μarchitecture, system μarchitecture architecture, OS,compiler, application ConstantinHalatsis 22

11 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications Options to implement issue parallelism

VLIW (EPIC) instruction issue

Static dependency resolution (3.2)

Pipeline processing Superscalar instruction issue

Dynamic dependency resolution (3.3)

ConstantinHalatsis 23

University of Athens Department of Informatics and Telecommunications

VLIWprocessing(1)

Memory/cache VLIW instructions

with independent sub-instructions (static dependency resolution)

VLIW processor

~ (10-30 EUs)

E E E E U U U U

Figure 3.1:Principle of VLIW processing

ConstantinHalatsis 24

12 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

VLIWprocessing(2)

VLIW: VeryLongInstructionWord Term:1983 (Fisher)

Length of sub-instructions~32 bit Instruction length: ~ n*32bit n: Number of execution units (EU)

Static dependency resulution with parallel optimization

Complex VLIWcompiler

ConstantinHalatsis 25

University of Athens Department of Informatics and Telecommunications VLIWprocessing(3)

The term ‘VLIW’

Figure 3.2:Experimental and commercially available VLIWprocessors

Source: Sima et al., ACA, Addison-Wesley, 1997

ConstantinHalatsis 26

13 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications VLIWprocessing(4)

Benefits of static dependecy resolution:

Less complex processors

Earlier appearance

Either higherfcor largerILP

ConstantinHalatsis 27

University of Athens Department of Informatics and Telecommunications VLIWprocessing(5)

Drawbacks of staticdependency resolution:

Completely new ISA

New compilers, OS

Rewriting of applications

Achieving the critical mass to convince the market

The compiler uses technology dependent parameters (e.g. latencies of EUs and caches, repetition rates of EUs) for dependency resolution and parallel optimization

New proc. models require new compiler versions

ConstantinHalatsis 28

14 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications VLIWprocessing(6)

Drawbacks of static dependency resolution (cont.):

VLIWinstructions are only partially filled

Purely utilized memory space and bandwidth

ConstantinHalatsis 29

University of Athens Department of Informatics and Telecommunications

VLIWprocessing(7)

CommercialVLIW processors:

Trace (1987) Multiflow Cydra-5 (1989) Cydrome

In a few years both firms became bankrupt

Developers: toHP, IBM

They became initiators/developers of EPIC processors

ConstantinHalatsis 30

15 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications VLIWprocessing(8)

VLIW EPIC

Integration of SIMDinstructions and advanced superscalar features

1994: Intel, HP announced the cooperation

1997: The EPICterm was born

2001: IA-64 ⇒ Itanium

ConstantinHalatsis 31

University of Athens Department of Informatics and Telecommunications

Superscalar processing

Pipeline processing Superscalar instruction issue

Main attributes of superscalar processing:

Dynamic dependency resolution

Compatible ISA

ConstantinHalatsis 32

16 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

History

Experimental superscalar processors Source: Sima et al., ACA, Addison-Wesley, 1997 ConstantinHalatsis 33

University of Athens Department of Informatics and Telecommunications

Emergence of superscalar processors

RISC processors

Intel 960 960KA/KB 960CA (3) M 88000 MC 88100 MC 88110 (2) HP PA PA 7000 PA7100 (2) SPARC MicroSparc SuperSparc (3)

Mips R R 4000 R 8000 (4) 29040 Am 29000 29000 sup (4) IBM Power Power1(4) RS/6000 DEC α α21064(2) PowerPC PPC 601 (3) PPC 603 (3)

87 88 89 90 91 92 93 94 95 96 CISC processors

Intel x86 i486 Pentium(2) M 68000 M 68040 M 68060 (2) Gmicro Gmicro/100p Gmicro500(2) AMD K5 K5 (4) CYRIX M1 M1 (2)

denotes superscalar processors.

Source: Sima et al., ACA, Addison-Wesley, 1997

ConstantinHalatsis 34

17 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Attributes of first generation superscalars (1)

•2-3 RISC instructions/cycle or Width: •2CISC instructions/cycle „wide” Core: •Staticbranch prediction •Single ported, blockingL1 data caches, Cache: Off-chip L2 caches attached via the processor bus

Examples: • Alpha21064 • PA 7100 • Pentium

ConstantinHalatsis 35

University of Athens Department of Informatics and Telecommunications

Attributes of first generation superscalars (2)

Consistency of processor features (1)

Dynamic instruction frequencies in gen. purpose applications:

FX instrtuctions ~ 40 % Load instructions ~ 30 % Store instructions ~ 10 % Branches ~ 20 % FP instrtuctions ~ 1-5 %

Available parallelism in gen. purpose applications assuming direct issue:

~ 2 instructions / cycle

(Wall 1989, Lam, Wilson 1992)

Source: Sima et al., ACA, Addison-Wesley, 1997 ConstantinHalatsis 36

18 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Attributes of first generation superscalars (3)

Consistency of processor features (2) Reasonable core width:

2 -3 instructions/cycle

Required number of data cache ports (np):

np ~ 0.4 * (2 -3)= 0.8 –1.2 instructions/cycle Single port data caches

Required EU-s (Each L/S instruction generates an address calculation as well): FX~ 0.8 * (2 –3) = 1.6 –2.4 2 –3 FX EUs L/S~ 0.4 * (2 –3) = 0.8 –1.2 1 L/S EU Branch~ 0.2 * (2 –3) = 0.4 –0.6 1 B EU FP~ (0.01 –0.05) * (2 –3) 1 FP EU ConstantinHalatsis 37

University of Athens Department of Informatics and Telecommunications The bottleneck evoked and its resolution(1)

The issue bottleneck

Icache Instr. window Cycles

i i i I-buffer 3 2 1 C Instr. window (3) i

Issue i3 i2 Decode, check, Dependent instructions C check, block instruction issue i+1 iissssue

i6 i5 i4 C i+2

EU EU EU Executable instructions Dependent instructions Issue

(a): Simplifiedstructureofthemikroarchitectureassuming (b): The issueprocess directissue

The principle of direct issue ConstantinHalatsis 38

19 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications The bottleneck evoked and its resolution(2)

I cache Eliminating the issue bottleneck

I-buffer Instruction window

Instructions are dispatched without Decode/Issue checking for dependences to the Dispatch shelving buffers (reservation stations)

Shelving Shelving Shelving buffer buffer buffer

Shelved not dependent Dep. checking/ Dep.. checkiing/ Dep. checking/ Dep. checking/ instructions are issued Issue issue issue iissssueue for execution to the EUs.

EU EU EU

Principle of the buffered (out oforder) issue ConstantinHalatsis 39

University of Athens Department of Informatics and Telecommunications

The bottleneck evoked and its resolution(3)

First generation Second generation (narrow) (wide) superscalars superscalars

Elimination of the issue bottleneck and in addition widening the processing width of all subsystems of the core

ConstantinHalatsis 40

20 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications Generations of superscalars First generation Second generation ”narrow”superscalars ”wide”superscalars

Width: •2-3 RISC instructions/cycle or •4 RISC instructions/cycles or 2 CISC instructions/cycle „wide” 3 CISC instruction/cycle „wide” • Buffered (ooo) issue Core: •Static branch prediction •Predecoding •Dynamic branch prediction •Register renaming •ROB Caches: •Single-ported, blocking •Dual-ported,non-blocking L1 data caches L1 data caches •Off-chip L2 caches •direct attached attached via the processor bus off-chipL2 caches

Examples: • Alpha21064 • Alpha21264 • PA 7100 • PA 8000 • Pentium • PentiumPro • K6 ConstantinHalatsis 41

University of Athens Department of Informatics and Telecommunications

Attributes of second generation superscalars (2)

Consistency of processor features (1) Dynamic instruction frequencies in gen. purpose applications: FX instrtuctions ~ 40 % Load instructions ~ 30 % Store instructions ~ 10 % Branches ~ 20 % FP instrtuctions ~ 1-5 %

Available parallelism in gen. purpose applications assuming buffered issue:

~ 4 –6 instructions / cycle

(Wall 1990)

Source: Sima et al., ACA, Addison-Wesley, 1997 ConstantinHalatsis 42

21 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Parallelism available in general purpose applications

Extentofparallelismavailableingeneralpurposeapplications assuming buffered issue Source: Wall: LimitsoCfoILnPst,a WntRinLH TaNla-t15sis, Dec. 1990 43

University of Athens Department of Informatics and Telecommunications

Attributes of second generation superscalars (3)

Consistency of processor features (2) Reasonable core width:

4 -5 instructions/cycle

Required number of data cache ports (np):

np ~ 0.4 * (4 -5)= 1.6 –2 instructions/cycle Dual port data caches

Required EU-s (Each L/S instruction generates an address calculation as well): FX~ 0.8 * (4 –5) = 3.2 –4 3 –4 FX EUs L/S~ 0.4 * (4 –5) = 1.6 –2 2 L/S EU Branch~ 0.2 * (4 –5) = 0.8 –1 1 B EU FP~ (0.01 –0.05) * (4 –5) 1 FP EU

ConstantinHalatsis 44

22 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Exhausting the issue parallelism

In general purpose applications 2. generation(„wide”) superscalars already exhaust the parallelism available at the instruction level

ConstantinHalatsis 45

University of Athens Department of Informatics and Telecommunications Introducing data parallelism (1)

Possible approaches to introduce data parallelism

Dual-operation SIMD instructions instructions

FX-SIMD FP-SIMD

(i=a*b+c) (MM-support) (3D-support)

O O1 O O O O i: 2 i: O4 O3 2 1 i: 2 1

n OPI 2 2/4/8/16/32 2/4 Dedicated use Dedicated use

OPI 1+ε >1 (for gen.use) ISA-extension

n OPI : Number of operations per instruction OPI : Average number of operations per instruction

Implementation alternatives of data parallelism ConstantinHalatsis 46

23 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Introducing data parallelism (2)

Superscalar extension SIMD instructions (FX/FP)

Multiple operations Superscalar issue within a single instruction

EPIC extension

Principle of intruducing SIMDinstructions in superscalar and VLIW (EPIC) processors ConstantinHalatsis 47

University of Athens Department of Informatics and Telecommunications The appeareance of SIMDinstructions in superscalars (1) Intel’s and AMD’s ISA extensions (MMX, SSE, SSE2, SSE3, 3DNow!, 3DNowProfessional)

RISC processors

Compaq/DEC Alpha Alpha 21064 Alpha 21164 21164PC 21264 Alpha 21364

Motorola MC 88000 MC88110

HP PA PA7100 PA-7100LC PA-7200 PA8000 PA-8200 PA 8500 PA 8600 PA 8700

IBM Power Power1(4) Power2(6/4) P2SC(6/4) PPC 604 (4) Power PC PPC 601 (3) PowerPC PPC 620 (4) G3 (3) Power3 (4) G4 (3) Power 4 Power 4+ Alliance PPC 603 (3) PPC 602 (2) MIPS R R 80000 R 10000 R 12000 R14000 R16000

Sun/Hal SPARC SuperSparc UltraSparc UltraSparc-2 UltraSparc-3 UltraSparc-3 UltraSparc-3-Cu

CISC processors Sparc64 Pentium/MMX Intel 80x86 Pentium PentiumPro Pentium III Pentium 4 P4 with HT Pentium II CYRIX /VIA M M1 MII K6-3 AMD/NexGen Nx/K Nx586 K5 K6-2 Opteron K6 K7

1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003

Multimedia support (FX-SIMD) Support of 3D (FP-SIMD)

The emergence of FX-SIMDandFP-SIMD instructions in superscalars

ConstantinHalatsis 48

24 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

21/2 and3. generation superscalars(1)

• 2.5. generation superscalars Second generation FX SIMD (MM) superscalars • 3. generation superscalars

FX SIMD + FP SIMD (MM+3D)

ConstantinHalatsis 49

University of Athens Department of Informatics and Telecommunications

21/2 and3. generation superscalars(2)

RISC processors

Compaq/DEC Alpha Alpha 21064 Alpha 21164 21164PC 21264 Alpha 21364

Motorola MC 88000 MC88110

HP PA PA7100 PA-7100LC PA-7200 PA8000 PA-8200 PA 8500 PA 8600 PA 8700

IBM Power Power1(4) Power2(6/4) P2SC(6/4) PPC 604 (4) Power PC PPC 601 (3) PowerPC PPC 620 (4) G3 (3) Power3 (4) G4 (3) Power 4 Power 4+ Alliance PPC 603 (3) PPC 602 (2) MIPS R R 80000 R 10000 R 12000 R14000 R16000

Sun/Hal SPARC SuperSparc UltraSparc UltraSparc-2 UltraSparc-3 UltraSparc-3 UltraSparc-3-Cu

CISC processors Sparc64 Pentium/MMX Intel 80x86 Pentium PentiumPro Pentium III Pentium 4 P4 with HT Pentium II CYRIX /VIA M M1 MII K6-3 AMD/NexGen Nx/K Nx586 K5 K6-2 Opteron K6 K7

1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003

Multimedia support (FX-SIMD) Support of 3D (FP-SIMD)

ConstantinHalatsis 50

25 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications Overview of superscalar processor generations

Superscalars

First Generation Second Generation 2.5 Generation Third Generation ("Thin superscalars") ("Wide superscalars") ("Wide superscalars ("Wide superscalars with MM/3D support") with MM/3D support")

Features: Width: 2-3 RISC instructions/cycle or 4 RISC instructions/cycle or 2 CISC instructions/cycle "wide" 3 CISC instructions/cycle "wide" Core: No predecoding Predecoding Static branch prediction Dynamic branch prediction Unbufferedissue Buffered issue (shelving) No renaming Renaming No ROB ROB Caches: Single ported data caches Dual ported data caches Blocking L1 data caches or NonblockingL1 data caches nonblockingcaches with up to with multiple cache misses allowed a single pending cache miss allowed Off-chip L2 caches Off-chip direct coupled L2 caches On-chip L2 caches attached via the processor bus

ISA: No MM/3D support FX-SIMD FX-and FP-SIMD instructions instructions Examples: Alpha 21064 1 Alpha 21264 PA 7100 PA 8000 Power2 3 Power 4 PowerPC 601 PowerPC 604 PowerPC 620 1 1,4 1,4 No renaming. SuperSparc UltraSparcI, II 2 Dual ported data cache, optional Pentium III (0.18 µ ) dynamic branch prediction. Pentium2 Pentium Pro Pentium II 3 Pentium 4 No off-chip direct coupled L2. 4 Athlon(model 4) Only single ported data cache. K6 AthlonMP (model 6)

Performance

Complexity

Memory Bandwidth

Branch prediction accuracy ConstantinHalatsis 51

University of Athens Department of Informatics and Telecommunications Exhausting the performance potential of data parallelism

In general purpose applications second generation superscalars already exhaust the parallelism available at the instruction level, whereas third generation superscalars exhaust already the parallelism available in dedicated applications (such as MM or 3D applications) at the instruction level as well.

Thus the era of ILP-processors came to an end.

ConstantinHalatsis 52

26 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications Summary: Main evolution scenarios

a.Evolutionary scenario (Superscalar approach) (The mainroad)

Introductionand Introductionand increaseof issue increaseof data parallelism parallelism Introductionand increaseof temporal parallelism b.Radicalscenario(VLIW/EPICapproach)

Introduction Introduction of of VLIW data parallelism processing (EPIC)

ConstantinHalatsis 53

University of Athens Department of Informatics and Telecommunications Summary: Main road of processor evolution(1)

Extent of Sequential opererationlevel ILP processing parallelism

Temporal + Issue + Data parallelism parallelism parallelism

Traditional Pipeline Superscalar Superscalar processors von N. procs. processors processors with SIMD extension

~ 1985/88 ~ 1990/93 ~ 1994/00 t

Level of hardware redundancy

The three cycles of the main road of processor evolution ConstantinHalatsis 54

27 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications Summary: The main road of evolution(2)

i: Introduction of temporal à issue and à data parallelism

i=1:3 introduction of a particular dimension of parallelism

processing bottleneck(s) arise

elimination of the bottleneck(s) evoked by introducing appropriate techniques

as a consequence, parallelism available at the given dimension becomes exhausted, further performance increase is achievable only by introducing a new dimension of parallelism

The Three main cycles of the main road ConstantinHalatsis 55

University of Athens Department of Informatics and Telecommunications Summary: Main road of the evolution(3)

Traditionalsequential InIntrtoroduducctitoionnof of processing temporalparallelism

Introduction of issue parallelism

Introduction of dataparallelism

Traditionalsequential Superscalar Superscalars with procesors Pipeline processors processors SIMDextension

Advanced memory subsystem ISA extension 1. generation Advanced branch processing à 1.5. generation 2.5. generation 1. generation •Caches • FX SIMD extension à 2. generation • Extensionofsystemarchitecture •Dynamicinst. scheduling à 2. generation AGP •Renaming •Branchprediction On-chipL2 •Predecoding ... •Dynamicbranchprediction •ROB à 3. generation •Dualporteddatacaches •FP SIMD extension •NonblockingL1 datacaches ... withmultiplecachemisses allowed •Off-chipdirectcoupledL2 caches

~ 1985/88 ~ 1990/93 ~ 1994/97

New techniques introduced in the three main cycles of processor evolution ConstantinHalatsis 56

28 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Main road of evolution(4)

Memory Bandwidth

Hardware Complexity

Performance

~ 1985 ~ 2000 t

Memory bandwidth and hardware complexity vs raising processor performance

ConstantinHalatsis 57

University of Athens Department of Informatics and Telecommunications 5.2. Main road of evolution(5)

Accuracy of branch prediction

Number of pipeline stages

fc

t ~ 1985 ~ 2000

Figure 5.5:Branch prediction accuracy vs raising clock rates

ConstantinHalatsis 58

29 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

TάσητεχνολογίαςΜικροεπεξεργαστών

Moore’s Law

2X transistors/Chip κάθε 1.5 χρόνια Αποκαλείται “Moore’s Ο Gordon Moore (συν-ιδρυτής Law” της Intel) πρόβλεψετο 1965 ότιηπυκνότητατων transistor Οιεπεξεργαστές στα microchips θα γίνονταιμικρότεροι, διπλασιάζεταιπερίπουκάθε 18 πυκνότεροικαι μήνες. ισχυρότεροι Slide source: Jack Dongarra ConstantinHalatsis 59

University of Athens Department of Informatics and Telecommunications

Moore’s Law

Gordon Moore, co-founder of Intel

n Number of transistors per square inch on integrated circuits had doubled every year since the IC was invented.

n Predicted that this trend would continue for the foreseeable future.

n In subsequent years, the pace slowed down a bit, but data density has doubled approximately every 18 months, and this is the current definition of Moore's Law, which Moore himself has blessed. Most experts, including Moore himself, expect Moore's Law to hold for at least another two decades.

n HOWEVER Webopedia.com(1998)

ConstantinHalatsis 60

30 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Οικλασσικέςπηγέςβελτίωσηςτηςεπίδοσης (ILP, clock speed, power) παρουσιάζουνκάμψη (FromK. Olukotun, L. Hammond, H.Sutter, andB. Smith)

Στοπαρελθόνκάθεχρόνοοι επεξεργαστέςγίνοντανταχύτεροι

Σήμεραησυχνότηταρολογιού είναισταθερήήμικραίνει

Το ‘τσάμπαγεύμα’ τελείωσε! The “Free Lunch” is over!

Παρόλααυτάέχουμεδιπλασιασμό τηςεπίδοσηςκάθε 18 -24 μήνες

Τοχάσμα CPU-Memory χειροτερεύει

Ονόμοςτου Moore χρειάζεται αναδιατύπωση

ConstantinHalatsis 61

University of Athens Department of Informatics and Telecommunications

ΤοτείχοςτηςΜνήμης (memory wall) Processor-DRAM Gap (latency)

µProc 1000 CPU 60%/yr. “Moore’s Law” 100 Processor-Memory Performance Gap: (grows 50% / year) 10 DRAM

P e r f o rm a n ce DRAM 7%/yr. 1 1987 1988 1989 1981 1982 1983 1984 1985 1986 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 1980 Time

ConstantinHalatsis 62

31 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

O νόμος Moore στηνπράξη )

CPU Network Bandwidth l o g ( Sp e ed RAM

1/Network Latency Software

Year

ConstantinHalatsis 63

University of Athens Department of Informatics and Telecommunications

ΠροβλήματαΥλικού (Hardware)

Ø ΣημερινοίπεριορισμοίΥλικούκαιΛιθογραφίας ü Ηκατανάλωσηισχύοςπεριορίζειτηνσχεδίαση chipςυψηλώνεπιδόσεων § Intel TejasPentium 4 cancelled due to power issues ü Ηαπόδοσηπαραγωγής (Yield) για chipςυψηλώνεπιδόσεωνπέφτει δραματικά § IBM quotes yields of 10 –20% on 8-processor ü Ησχεδίασηκαιηεπαλήθευσηαυτούγια chipςυψηλώνεπιδόσεων γίνεταιδυσμεταχείριστη § Verification teams > design teams on leading edge processors

Ø Λύση: ‘τομικρόείναιωραίο’–Multicoreand Manycorechips ü Μικρούβάθους pipelines (5-to 9-stage) CPUs, FPUs, vector, SIMD PEs § Small cores not much slower than large cores ü H παραλληλίαείναιέναςτρόποςγιακαλόλόγοεπίδοση/ισχύς: § Lower threshold and supply voltages lowers energy per op ü Πλεονάζοντεςεπεξεργαστέςβελτιώνουντηνπαραγωγή (chip yield) § Cisco Metro 188 CPUs + 4 spares; Sun Niagara sells 6 or 8 CPUs ü Πιοεύκοληηεπαλήθευσητηςσχεδίασηςγιαμικρούςόμοιουςπυρήνες

ConstantinHalatsis 64

32 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Multicore: Redefining“ConventionalWisdom” inComputerArchitecture

Ø OldConventionalWisdom(CW): Powerisfree, buttransistors expensive – NewCW isthe“Powerwall”: Powerisexpensive, but transistorsare“free”. –Canputmoretransistorsona chipthanhavethepowertoturnon Ø OldCW: Multipliesslow, butloadsandstoresfast – NewCW isthe“Memorywall”: Loadsandstoresareslow, butmultipliesfast –200 clockstoDRAM, butevenFP multipliesonly4 clocks Ø OldCW: WecanrevealmoreILP viacompilersandarchitecture innovation –Branchprediction, OOO execution, speculation, VLIW, … – NewCW isthe“ILP wall”: Diminishingreturnsonfinding moreILP Ø OldCW: 2X CPU Performanceevery18 months – NewCW is PowerWall+ MemoryWall+ ILP Wall= Brick Wall

ConstantinHalatsis 65

University of Athens Department of Informatics and Telecommunications

Αναδιατύπωσητουνόμου Moore

Ονόμος Moore συνεχίζει! Πωςνααξιοποιήσουμεόλααυτάτα transistors γιανασυνεχίσουμεναέχουμε επιδόσειςσταπλαίσιατουνόμου Moore; Ø Περισσότερη out-of-order εκτέλεση (απαγορευτικόλόγωπολυπλοκότητας, μικρήβελτίωσηεπίδοσης, κατανάλωσηισχύος ) Ø Περισσότερηνημάτωση -threading (ασυμπτωτικήεπίδοση) Ø Περισσότερη DLP/SIMD (περιορισμένεςεφαρμογές, compilers?) Ø Μεγαλύτερες caches Ø Τοποθέτησηενός SMP σεένα chip = ‘multicore’ (and manycore)

ΗαπάντησητηςΒιομηχανίας: Ø Οαριθμόςτων πυρήνων ανά chip διπλασιάζεταικάθε 18 μήνες, ενώη συχνότητατουρολογιούμειώνεται (δεναυξάνει) ü Έχουμεόμωςνακάνουμεμεεκατομμύριαταυτόχρονων νημάτων (threads) § Οιμελλοντικέςγενεέςθαέχουνδισεκ. νήματα. ü Αναγκαίαγίνεταιηεύκοληαντικατάσταση τηςπαραλληλίας τηςμεταξύτων chipςμετηνπαραλληλίατηςεντόςτου chip

ConstantinHalatsis 66

33 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

MULTI-CORE Αρχιτεκτονική

Ø Ηκαλλίτερηδιεργασίαμικροηλεκτρονικής επιτρέπειεξοικονόμησηεπιφάνειαςτου die Ø Αυτόσημαίνειπερισσότερατρανζίστορανά chip ü 90nm, 65nm and 45nm σήμερα ü 32nm, 22nm and 10nm αύριο Ø Οικατασκευαστές Chip αποφάσισανότιη καλλίτερη (φθηνότερη) χρήσητωνεπιπλέον τρανζίστορείναιηπροσθήκηπρόσθετων πυρήνες ü 4 ή 8 πυρήνεςμεέωςκαι 64 νήματασήμερα ü Εκατοντάδεςκαιπυρήνεςαύριο – manycore

Ø Άλληδιέξοδος: περισσότερεςλειτουργίεςστο έναπυρήνα (π.χ.,C oπρnstanοtiσnHθalήatsκisημιας GPU, ενός 67 crypto, etc)

University of Athens Department of Informatics and Telecommunications

MULTI-COREARCHITECTURES

Ø Multi-corelongusedinDSP, embedded andnetworkprocessors Ø Manyhouseholditemscontainmulti-core processors: ü iPOD–2 coreARM CPU ü ThePS3 –IBM CellCPU ü TheXbox360 –3-core XenonCPU (PowerPCbased, SMTcapable) ü GPUs, liketheNVIDIA 8800 series–up to128 minicores

ConstantinHalatsis 68

34 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications Powerful processors

Ø The Hertz race is over; however … üSome processors are still at it … üPower 6 and 7 running at 4 and 5 GHz üIntel Polaris: 3.6 to 6 GHz Ø Many hardware re-designs are in order üMake pipelines shorter, simpler üGet rid of “extra”hardware features

ConstantinHalatsis 69

University of Athens Department of Informatics and Telecommunications UltraSparcT1 Codename: Niagara Multi-core Trends 8 Core Processor, 32 Logical Threads Codename: in this Decade AMD Turion64 X2 16 Core Processor, 32 Logical Threads IA32 x86 Dual Core Chip

Intel Core Duo Intel Core 2 Pentium D IA32 x86 Dual Core Codename: Penryn, Wolfdale IA32 x86 2 Core Chip Chip IA32 x86 Dual & Quad Core Chip

Power5 CBE 64 bit PowerPC 2 Core PowerPC Power7 with SMT 9 Core chip 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011

Power 4 Codename: Power 6 Nehalem 64 bit PowerPC 2 64 bit PowerPC 3 64 bit PowerPC 1 to 8 Core Chip Core Core chip 2 Core with SMT

Xeon Dual Core Intel Core 2 Duo IA32 x86 2 Core Chip IA32 x86 2 Core Chip Codename: Sandy Bridge

AMD Opteron AMD IBM Code Name: Denmark Code Name: Barcelona IA32 x86 Native 4 Core Chip Intel IA32 x86 2 Core Chip UltraSparcT2 AMD Codename: Niagara 8 Core Processor, 64 Logical Threads SUN ConstantinHalatsis 70

35 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Intel Processors

ConstantinHalatsis 71

University of Athens Department of Informatics and Telecommunications

AMD processors

ConstantinHalatsis 72

36 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

IBM processors

ConstantinHalatsis 73

University of Athens Department of Informatics and Telecommunications

Example: the Power6 microarchitecture

Pictures Courtesy of Intel from “IBM Power6 Microarchitecture”

Power6 Running at frequencies from 4 to 5 GHz 13 FO4 versus 23 FO4 pipeline

ConstantinHalatsis 74

37 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Recall Amdahl’s Law

Ø Begins with Simple Software Assumption (Limit Arg.) ü Fraction F of execution time perfectly parallelizable ü No Overhead for n Scheduling n Communication n Synchronization, etc.

ü Fraction (1 –F) is Completely Serial

Ø Time on 1 core = (1 –F) / 1 + F / 1 = 1

Ø Time on N cores = (1 –F) / 1 + F / N

ConstantinHalatsis 75

University of Athens Department of Informatics and Telecommunications

Recall Amdahl’s Law [1967]

1 Amdahl’s Speedup = F 1 -F + 1 N

Ø For mainframes, Amdahl expected 1 -F = 35% ü For a 4-processor speedup = 2 ü For infinite-processor speedup < 3 ü Therefore, stay with mainframes with one/few processors

Ø Amdahl’s Law applied to Minicomputer to PC Eras Ø What about the Multicore Era?

ConstantinHalatsis 76

38 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Designing Multicore Chips Hard

Ø Designers must confront single-core design options ü Instruction fetch, wakeup, select ü configuation& operand bypass ü Load/queue(s) & data cache ü Checkpoint, log, runahead, commit.

Ø As well as additional design degrees of freedom ü How many cores? How big each? ü Shared caches: levels? How many banks? ü Memory interface: How many banks? ü On-chip interconnect: bus, switched, ordered?

ConstantinHalatsis 77

University of Athens Department of Informatics and Telecommunications

A Simple Multicore Hardware Model (Wisconsin Multifacetproject www.cs.wisc.edu/multifacet/amdahl

To Complement Amdahl’s Simple Software Model

(1) Chip Hardware Roughly Partitioned into ü Multiple Cores (with L1 caches) ü The Rest (L2/L3 cache banks, interconnect, pads, etc.) ü Changing Core Size/Number does NOT change The Rest

(2) Resources for Multiple Cores Bounded ü Bound of N resources per chip for cores ü Due to area, power, cost (€€€), or multiple factors ü Bound = Power? (but our pictures use Area)

ConstantinHalatsis 78

39 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

A simple Multicore Hardware Model, cont.

(3) Micro-architects can improve single-core performance using more of the bounded resource

Ø A Simple Base Core ü Consumes 1 Base Core Equivalent (BCE) resources ü Provides performance normalized to 1

Ø An Enhanced Core (in same process generation) ü Consumes R BCEs ü Performance as a function Perf(R)

Ø What does function Perf(R) look like?

ConstantinHalatsis 79

University of Athens Department of Informatics and Telecommunications

More on Enhanced Cores

Ø (Performance Perf(R) consuming R BCEsresources)

Ø If Perf(R) > R è Always enhance core Ø Cost-effectively speedups both sequential & parallel

Ø Therefore, Equations Assume Perf(R) < R

Ø Graphs Assume Perf(R) = Square Root of R ü 2x performance for 4 BCEs, 3x for 9 BCEs, etc. ü Why? Models diminishing returns with “no coefficients” ü Alpha EV4/5/6 [Kumar 11/2005] & Intel’s Pollack’s Law

ConstantinHalatsis 80

40 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Symmetric Multicorechips

Ø How many cores per chip? Ø Each Chip Bounded to N BCEs(for all cores) Ø Each Core consumes R BCEs Ø Assume Symmetric Multicore= All Cores Identical Ø Therefore, N/R Cores per Chip —(N/R)*R = N Ø For an N = 16 BCE Chip:

Sixteen 1-BCE cores Four 4-BCE cores One 16-BCE core

ConstantinHalatsis 81

University of Athens Department of Informatics and Telecommunications

Performance of Symmetric Multicore Chips

Ø Serial Fraction 1-F uses 1 core at rate Perf(R) Ø Serial time = (1 –F) / Perf(R)

Ø Parallel Fraction uses N/R cores at rate Perf(R) each Ø Parallel time = F / (Perf(R) * (N/R)) = F*R / Perf(R)*N

Ø Therefore, w.r.t. one base core: 1 Symmetric Speedup = F * R 1 -F + Ø Implications? Perf(R) Perf(R)*N

Enhanced Cores speed Serial & Parallel

ConstantinHalatsis 82

41 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Symmetric Multicore Chip, N = 16 BCEs

F=0.5 R=16, Cores=1, Speedup=4 (16 cores) (8 cores) (2 cores) (1 core) (4 cores) F=0.5, Opt. Speedup S = 4 = 1/(0.5/4 + 0.5*16/(4*16)) Need to increase parallelism to make multicore optimal!

ConstantinHalatsis 83

University of Athens Department of Informatics and Telecommunications

Symmetric Multicore Chip, N = 16 BCEs

F=0.9, R=2, Cores=8, Speedup=6.7

F=0.5 R=16, Cores=1, Speedup=4

At F=0.9, Multicore optimal, but speedup limited Need to obtain even more parallelism!

ConstantinHalatsis 84

42 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Symmetric Multicore Chip, N = 16 BCEs

Fà1, R=1, Cores=16, Speedupà16

F matters: Amdahl’s Law applies to multicore chips MANY Researchers should target parallelism F first

ConstantinHalatsis 85

University of Athens Department of Informatics and Telecommunications

Need a Third “Moore’s Law?”

Ø Technologist’s Moore’s Law ü Double Transistors per Chip every 2 years ü Slows or stops: To Be Discussed Ø Microarchitect’sMoore’s Law ü Double Performance per Core every 2 years ü Slowed or stopped: Early 2000s Ø Multicore’sMoore’s Law ü Double Cores per Chip every 2 years ü & Double Parallelism per Workload every 2 years ü & Aided by Architectural Support for Parallelism ü = Double Performance per Chip every 2 years ü Starting now

Ø Software as Producer, not Consumer, of Performance Gains!

ConstantinHalatsis 86

43 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Symmetric Multicore Chip, N = 16 BCEs

Recall F=0.9, R=2, Cores=8, Speedup=6.7

As Moore’s Law enables N to go from 16 to 256 BCEs, More cores? Enhance cores? Or both?

ConstantinHalatsis 87

University of Athens Department of Informatics and Telecommunications

Symmetric Multicore Chip, N = 256 BCEs

Fà1 R=1 (vs. 1) Cores=256 (vs. 16) Speedup=204 (vs. 16) MORE CORES!

F=0.99 R=3 (vs. 1) Cores=85 (vs. 16) F=0.9 Speedup=80 (vs. 13.9) R=28 (vs. 2) Cores=9 (vs. 8) MORE CORES Speedup=26.7 (vs. 6.7) & ENHANCE CORES! ENHANCE CORES! As Moore’s Law increases N, often need enhanced core designs Some arch. researchers should target single-core performance

ConstantinHalatsis 88

44 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Software for Large Symmetric MulticoreChips

Ø F matters: Amdahl’s Law applies to multicorechips

Ø N = 256 ü F=0.9 è Speedup = 27 @ R = 28 ü F=0.99 è Speedup = 80 @ R = 3 ü F=0.999 è Speedup = 204 @ R = 1

Ø N = 1024 ü F=0.9 è Speedup = 53 @ R = 114 ü F=0.99 è Speedup = 161 @ R = 10 ü F=0.999 è Speedup = 506 @ R = 1

Ø Researchers must target parallelism F first

ConstantinHalatsis 89

University of Athens Department of Informatics and Telecommunications

Asymmetric (Heterogeneous) Multicore Chips

Ø Symmetric Multicore Required All Cores Equal Ø Why Not Enhance Some (But Not All) Cores?

Ø For Amdahl’s Simple Software Assumptions ü One Enhanced Core ü Others are Base Cores

Ø How? ü ü Model ignores design cost of asymmetric design

Ø How does this effect our hardware model?

ConstantinHalatsis 90

45 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

How Many Cores per Asymmetric Chip?

Ø Each Chip Bounded to N BCEs(for all cores) Ø One R-BCE Core leaves N-R BCEs Ø Use N-R BCEsfor N-R Base Cores Ø Therefore, 1 + N -R Cores per Chip Ø For an N = 16 BCE Chip:

Symmetric: Four 4-BCE Asymmetric:One 4-BCE core cores & Twelve 1-BCE base cores

ConstantinHalatsis 91

University of Athens Department of Informatics and Telecommunications

Performance of Asymmetric Multicore Chips

Ø Serial Fraction 1-F same, so time = (1 –F) / Perf(R)

Ø Parallel Fraction F ü One core at rate Perf(R) ü N-R cores at rate 1 ü Parallel time = F / (Perf(R) + N -R)

Ø Therefore, w.r.t. one base core: 1 Asymmetric Speedup = F 1 -F + Perf(R) Perf(R) + N -R

ConstantinHalatsis 92

46 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Asymmetric MulticoreChip, N = 256 BCEs

(256 cores)(1+252 cores) (1+192 cores)(1 core) (1+240 cores) Number of Cores = 1 (Enhanced) + 256 –R (Base) How do Asymmetric & Symmetric speedups compare?

ConstantinHalatsis 93

University of Athens Department of Informatics and Telecommunications

Recall Symmetric Multicore Chip, N = 256 BCEs

Recall F=0.9, R=28, Cores=9, Speedup=26.7

ConstantinHalatsis 94

47 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Asymmetric Multicore Chip, N = 256 BCEs

F=0.99 R=41 (vs. 3) Cores=216 (vs. 85) Speedup=166 (vs. 80)

F=0.9 R=118 (vs. 28) Cores= 139 (vs. 9) Speedup=65.6 (vs. 26.7) Asymmetric offers greater speedups potential than Symmetric In Paper: As Moore’s Law increases N, Asymmetric gets better Some arch. researchers should target asymmetric multicores

ConstantinHalatsis 95

University of Athens Department of Informatics and Telecommunications

Asymmetric Multicore: 3 Software Issues 1. Schedule computation (e.g., when to use bigger core) 2. Manage locality (e.g., sending code or data can sap gains) 3. Synchronize (e.g., asymmetric cores reaching a barrier)

At What Level? ü Application Programmer ü Library Author ü Compiler ü Runtime System ü Operating System ü Hypervisor (Virtual Machine Monitor) ü Hardware

ConstantinHalatsis 96

48 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Dynamic MulticoreChips, Take 1

Ø Why NOT Have Your Cake and Eat It Too?

Ø N Base Cores for Best Parallel Performance Ø Harness R Cores Together for Serial Performance

Ø How? DYNAMICALLY Harness Cores Together ü

parallel mode

sequential mode

ConstantinHalatsis 97

University of Athens Department of Informatics and Telecommunications

Dynamic MulticoreChips, Take 2

Ø Let POWER provide the limit of N BCEs Ø While Area is Unconstrained (to first order)

parallel mode How to model these two chips? sequential mode

Ø Result: N base cores for parallel; large core for serial ü [Chakraborty, Wells, & Sohi, Wisconsin CS-TR-2007- 1607] ü When Simultaneous Active Fraction (SAF) < ½

ConstantinHalatsis 98

49 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Performance of Dynamic Multicore Chips

Ø N Base Cores with R BCEsused Serially

Ø Serial Fraction 1-F uses R BCEsat rate Perf(R) Ø Serial time = (1 –F) / Perf(R)

Ø Parallel Fraction F uses N base cores at rate 1 each Ø Parallel time = F / N

Ø Therefore, w.r.t. one base core:

1 Dynamic Speedup = F 1 -F + Perf(R) N ConstantinHalatsis 99

University of Athens Department of Informatics and Telecommunications

Recall Asymmetric Multicore Chip, N = 256 BCEs

Recall F=0.99 R=41 Cores=216 Speedup=166

What happens with a dynamic chip?

ConstantinHalatsis 100

50 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Dynamic Multicore Chip, N = 256 BCEs

F=0.99 R=256 (vs. 41) Cores=256 (vs. 216) Speedup=223 (vs. 166)

Dynamic offers greater speedup potential than Asymmetric Arch. researchers should target dynamically harnessing cores

ConstantinHalatsis 101

University of Athens Department of Informatics and Telecommunications Dynamic Asymmetric Multicore: 3 Software Issues

1. Schedule computation (e.g., when to use bigger core) 2. Manage locality (e.g., sending code or data can sap gains) 3. Synchronize (e.g., asymmetric cores reaching a barrier)

At What Level? ü Application Programmer ü Library Author ü Compiler ü Runtime System ü Operating System ü Hypervisor (Virtual Machine Monitor) ü Hardware Dynamic Challenges > Asymmetric Ones Dynamic chips due to power likely ConstantinHalatsis 102

51 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Three Multicore Amdahl’s Law

1 Parallel Section Symmetric Speedup = F * R 1 -F + N/R Sequential Section Perf(R) Perf(R)*N Enhanced Cores 1 Enhanced Core 1 Asymmetric Speedup = F 1 -F + 1 Enhanced Perf(R) Perf(R) + N -R & N-R Base Cores 1 Dynamic Speedup = F 1 -F + N Base Perf(R) N Cores

ConstantinHalatsis 103

University of Athens Department of Informatics and Telecommunications

Multicorearchitectures

Building blocks of multicore processors (MC)

• Cores C C • L2 cache(s) (L2)

• L3 cache(s), if available (L3) L2 I L2 D L2 I L2 D

• Interconnection network (IN) L3 L3 • FSB controller (FSB c.) IN or • Bus controller (B. c.) FSB c. (M. • c.) FSB

ConstantinHalatsis 104

52 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Design space of Multicores

Macro architecture of multi-core (MC) processors

Layoutofthe Layout ofL3 caches Layout of the on-chip cores (ifavailable) interconnections

Layout ofL2 caches LayoutoftheI/O and memoryarchitecture

ConstantinHalatsis 105

University of Athens Department of Informatics and Telecommunications

Layoutof the cores-1

Basic layout

SMC HMC (Symmetrical MC) (Heterogeneous MC)

All other MCs Cell BE(1 PPE+8 SPE)

ConstantinHalatsis 106

53 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications Layoutof the cores-2

Physical implementation

Monolithic Multi-chip implementation implementation

YonahDuo

Pentium D 8xx PaxvilleDP Pentium EE 840 PaxvilleMP

Tulsa Pentium D 9xx Pentium EE 955/965 Dempsey

Merom Kentsfield Conroe Clovertown Woodcrest

Montecito All DCsof AMD, IBM, Sun , HP and RMI

ConstantinHalatsis 107

University of Athens Department of Informatics and Telecommunications L2caches-1 AllocationofL2 cachestothecores

PrivateL2 cache SharedL2 cache foreachcore forallcores

Examples:

Core Core Core Core

L2 L2 L2 contr.

System Request Queue

L2 IN (Xbar)

B. c. M. c. FSB c.

FSB Memory HT-bus CoreDuo (2006) Athlon64 X2 (2005) Core2 Duo(2006)

ConstantinHalatsis 108

54 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications L2caches-2

AllocationofL2 cachestothecores

PrivateL2 cache SharedL2 cache foreachcore forallcores

Smithfield, Presler, Core Duo (Yonah), Irwindale and Cedar Mill Core 2 Duo (Core) based lines(2005/2006) based lines (2006) Montecito (2006) Athlon 64 X2 and AMD's Opteron lines (2005/2006) POWER4 (2001) POWER6(2007) POWER5 (2005)

Cell BE(2006) UltraSPARCIV (2004) UltraSPARCIV+ (2005) PA 8800 (2004) UltraSPARCT1 (2005) PA 8900 (2005) UltraSPARCT2(2005) SPARC64 VI (2007) SPARC64 VII (2008)

ConstantinHalatsisXLR (2005) 109

University of Athens Department of Informatics and Telecommunications L2caches-3 InclusionpolicyofL2 caches

InclusiveL2 ExclusiveL2

Replaced (victimised) lines (Victim L1 L1 L2 Lines missing in L1 cache) are reloaded and Data missing deleted from L2 Modified data in L1/L2 (low traffic) L2 (high traffic) M.c.

Memory Memory Mostimplementations Lines replaced (victimized) in the L1arewritten to L2. References to data missing in L1 but available inL2 initiate reloadingthe pertaining cache line to L1. Reloaded data is deleted from L2 L2operates usually as a write back cache(only modified data that is replaced in the L2 is written back to the memory),

Athlon64X2 (2005) Dual-core Opteron lines (2005)

ConstUannmtiondHifiaelda tdsisata that is replaced inL2is deleted. 110

55 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

L2caches-4

Usebyinstructions/data

Unifiedinstr./data Splitinstr./data cache(s) cache(s)

Typicalimplementation Montecito (2006)

Expected trend

ConstantinHalatsis 111

University of Athens Department of Informatics and Telecommunications L2caches-6 Mapping of addresses to memory modules

POWER4 (2001) UltraSPARCT1 (2005) (Niagara) POWER5 (2005) (8 cores/4xL2 banks)

Core Core

Core Core X-bar

X-bar

L2 L2 L2

L2 L2 Fabric Bu SContr. IN (Fabric Bus Contr.)

M. c. M. c. L3 tags/ B. c. contr. Memory Memory

GX bus

Mapping of addresses to the banks: Mapping of addresses to the banks: The 128-byte long L2 cache lines are hashed across The four L2 modules are interleaved at 64-byte blocks. the 3 modules. Hashing is performed by modulo 3 arithmetricapplied on a large number of real address bits. 7 6 0 196 Addr. 128

∑Modulo 3 64 0 0 1 2 256 ConstantinHalatsis 112

56 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

L2caches-7

Integrationtotheprocessorchip

Partial integration of L2 Full integration of L2

On-chip L2 tags/control, Entire L2 on-chip off-chip data

UltraSPARCIV (2004) UltraSPARCV (2005) PA 8800 (2004) PA 8900 (2005) All other lines considered

Expected trend

ConstantinHalatsis 113

University of Athens Department of Informatics and Telecommunications

L3caches-1

AllocationofL3 cachestotheL2 caches

SharedL3 PrivateL3 cache cacheforall foreachL2 L2s

Montecito (2006) Athlon 64 X2–Barcelona (2007)

POWER5 (2005) POWER4(2001) UltraSPARCIV+(2004)

ConstantinHalatsis 114

57 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications L3caches-2

InclusionpolicyofL3 caches

Athlon 64 X2–Barcelona (2007) POWER5(2005) POWER6(2007) InclusiveL3 ExclusiveL3 UltraSPARCIV+(2005)

Replaced (victimised) lines (Victim Montecito (2006 L2 L2 L3 cache) POWER4(2001) Lines missing in L2 are reloaded Data missing and deleted from L3 Modified data in L2/L3 (low traffic) L3 (high traffic) M.c.

Memory Memory Lines replaced (victimized) in the L2arewritten to L3. References to data missing in L2 but available inL3 initiatereloadingthe pertaining cache line to L2. Reloaded data is deleted from L3 L3operates usually as a write back cache(only modified data that is replaced in the L3 is written back to the memory), Unmodified data that is replaced inL3is deleted. ConstantinHalatsis 115

University of Athens Department of Informatics and Telecommunications

L3caches-3

Integrationtotheprocessorchip

L3 partially integrated L3 fully integrated

On-chip L3 tags/control, Entire L3 on-chip off-chip data

UltraSPARCIV+(2005) POWER4 (2001) POWER5 (2005) POWER6(2007)

Montecito(2006)

Expected trend ConstantinHalatsis 116

58 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

L3caches-Examples

Examplesof partially integrated L3 caches:

L3 data

L2 L3 tags/contr. L3 data L3 tags/contr.

L2 L3 tags/contr. L3 data L2

L2 L3 tags/contr. L3 data Interconn. Core network Core Fabric Bus Contr.

B. c. M. c.. M. c.

Fire Plane Memory bus Memory

POWER5 (2005): UltraSPARCIV+ (2005):

ConstantinHalatsis 117

University of Athens Department of Informatics and Telecommunications L2-L3caches

Cachearchitecture

L2 partially integrated, L2 fully integrated, L2 fully integrated, L2 fully integrated, no L3 no L3 L3 partially integrated L3 fully integrated

L2 L2 L2 L2 L2 L2 L2 L2 private shared private shared private shared private shared

L3 L3 L3 L3 L3 L3 L3 L3 private shared private shared private shared private shared

UltraSPARCIV Smithfield/Presler POWER6 POWER4 based proc.s POWER5(excl.) Athlon64X2 (excl.) UltraSPARCIV+ Opteron dual core (excl.) (excl.) CoreDuo Montecito PA-8800 Gemini PA-8900 (Yonah) Core2 Duo based proc-s Cell BE UltraSPARCT1/T2 SPARC 64 VI/VII XLR

ConstantinHalatsis 118

59 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

L2-L3caches–examples

Examples for cache architectures (1)

L2 partially integrated, private L2 partially integrated, shared no L3 no L3

L2 data L2 data L2data

L2 L2 tags/contr. L2 tags/contr. Core Core tags/contr. Interconn. network

Core Core FSB c.

FSB B. c. M. c.

Fire Plane Memory bus

UltraSPARCIV (2004) PA-8800 (2004) PA-8900 (2005)

ConstantinHalatsis 119

University of Athens Department of Informatics and Telecommunications L2-L3caches-examples

Examples for cache architectures (2)

L2 fully integrated, private L2 fully integrated, shared no L3 no L3

Core 0 Core Core L2 L2 M. c. Memory Core Core X L2 M. c. Memory b L2 L2 L2 contr. a L2 M. c. Memory r System Request Queue

L2 L2 M. c. Memory Core 7 IN (Xbar)

B. c. B. c. M. c. FSB c. JBus

FSB

HT-bus Memory

Athlon64 X2 (2005) CoreDuo (2006) UltraSPARCT1 (2005) Core2 Duo(2006) ConstantinHalatsis 120

60 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

L2-L3caches-examples

Examples for cache architectures (3)

L2 fully integrated, shared L3 partially integrated L3 data

L3 tags/contr.

Chip-to-chip/ L2 L2 L2 L2 Mem.-to-Mem. interconn. IN (Fabric Bus Contr.) Interconn. Core network Core

B. c. L3 tags contr. B. c. M. c.

Fire Plane GX-bus Memory L3 data bus

M.c.

Memory

POWER4 (2001) UltraSPARCIV+ (2005) ConstantinHalatsis 121

University of Athens Department of Informatics and Telecommunications

L2-L3caches-example

Examples for cache architectures (4)

L2 fully integrated, private, L3 fully integrated

Core Core

L2 I L2 D L2 I L2 D

L3 L3

FSB c.

FSB

Montecito(2006)

ConstantinHalatsis 122

61 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Cores’interconnect-1

On-chip interconnections

Between private or Between private or Chip-to-chip and Between cores and shared L2 modules shared L2/L3 modules module-to-module shared L2 modules and shared L3 and the B. c./M. c. or interconnects modules alternatively the FSB c.

C C Chip L2 L2 L2/L3 L2/L3

IN1 IN IN Module Chip Chip

B. c. M. c. or FSB c. L2 L2 L3 L3 Chip

I/O bus Memory FSB

ConstantinHalatsis 123

University of Athens Department of Informatics and Telecommunications Cores’interconnect-2

Implementation of interconnections

Arbiter/Multi-port Crossbar Ring cache implementations

Quantitative aspects, such as the number of sources/destinationsor bandwidth requirement affect which implementation alternative is the most beneficial.

For a small number of For a larger number of sources/destinations sources/destinations UltraSPARC T1 (2005) Cell BE (2006) eg. to connect dual-cores XRI (2005) to shared L2 caches UltraSPARC T2 (2007) ConstantinHalatsis 124

62 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Performance Beyond ILP

Ø Much higher natural parallelism in some applications ü Database, web servers, or scientific codes Ø Explicit Thread-LevelParallelism Ø Thread: has own instructions and data ü May be part of a parallel program or independent programs ü Each thread has all state (instructions, data, PC, register state, and so on) needed to execute Ø Multithreading: Thread-Level Parallelism within a processor

http://www.intel.com/technology/computing/dual-core/demo/popup/demo.htm

ConstantinHalatsis 125

University of Athens Department of Informatics and Telecommunications

Principle of sequential-, multitasked-and multithreaded programming

Sequential programmi Multitasked programming Multithreaded programming ng

P1 P1 fork() P1 CreateThread() T1 exec() P2 Create Process() T2 fork() P2 T3 P2 T4 P3 T5 exec() T6 join() P3 P r o c e ss / Th a d M a n g eme t E x m p le

ConstantinHalatsis 126

63 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Main features of multithreading

Threads •belong to the same process, •share usually a common address space (else multiple address translation paths (virtual to real) need to be maintained concurrently), •are executed concurrently (simultaneously (i.e. overlapped by time sharing) or in parallel), depending on the impelmentation of multithreading .

Main tasks of thread management

• creation, control and termination of individual threads, • context swithing between threads, • maintaining multiple sets of thread states.

Basic thread states

• thread program state (state of the ISA) including the PC, FX/FP architectural registers, stateregisters, • thread microstate (supplementary state of the microarchitecture) including the rename register mappings, branch history, ROB etc.

ConstantinHalatsis 127

University of Athens Department of Informatics and Telecommunications

Thread-Level Parallelism (TLP)

Ø ILP exploits implicit parallel operations within loop or straight-line code segment Ø TLP is explicitly represented by multiple threads of execution that are inherently parallel Ø Goal: Use multiple instruction streams to improve ü Throughput of computers that run many programs ü Execution time of multi-threaded programs Ø TLP could be more cost-effective to exploit than ILP

ConstantinHalatsis 128

64 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Hardwaremulti-threadingcomes back

Ø Intel’sMontecito(Itanium2) –upto35% improvementcited ü 2 cores, 2 non-simultaneousthreadspercore(“temporal multithreading”) ü Switchesovertotheotherthreadincaseofa highlatencyevent (i.e.pagefault) Ø Sun’sNiagara1 (UltraSPARCT1) ü 8 cores, 4 non-simultaneousthreadspercore ü Morefine-grainedswitching–aftereveryinstruction ü A threadisskippedifittriggersa largelatencyevent ü Singlethreadslower, butthroughputveryhigh Ø Sun’sNiagara2 UltraSPARCT2) ü 8 cores, 8 non-simultaneousthreadspercore ü MySQLcompiledwithSun’scompilerruns2.5x fasterthanwith gcc–O3 Ø IntelchipsfromNehalem(Core3 –Q4 ’08) onwardswillfeature HyperThreading ü Simultaneousmulti-threading(onlyonsuperscalarCPUs) ü Takesadvantageofinstructionlevelparallelism

ConstantinHalatsis 129

University of Athens Department of Informatics and Telecommunications Options for Multithreading

Implementation of multithreading

(while executing multithreaded apps/OSs)

Software Hardware multithreading multithreading

Execution of multithreaded apps/OSs Execution of multithreaded apps/OSs on a single threaded processor on a multithreaded processor simultaneously (i.e. by time sharing) concurrently

Maintaining multiple threads Maintaining multiple threads simultaneously by the OS concurrently by the processor

Multithreaded OSs Multithreaded processors

Fast context swithing between threads required. ConstantinHalatsis 130

65 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Multithreaded processors

Multicore processors Multithreaded cores (SMP: Symmetric Multiprocessing CMP: Chip Multiprocessing)

Chip MT core Core L2/L3 Core L2/L3

L3/Memory L3/Memory

ConstantinHalatsis 131

University of Athens Department of Informatics and Telecommunications Requirements of Multithreading Ø Requirement of software multithreading

Maintaining multiple thread program states concurrently by the OS, including the PC, FX/FP architectural registers, state registers

Ø Core enhancements needed in multithreaded cores

•Maintaining multiple thread program states concurrently by the processor, including the PC, FX/FP architectural registers, state registers

•Maintaning multiple thread microstates, pertaining to: rename register mappings, the RAS (Return Address Stack), theROB, etc.

•Providing increased sizes for scarce or sensitive resources, such as: the instruction buffer, store queue,in case of merged arch. and rename registers appropriatly large file sizes (FX/FP) etc.

Ø Options to provide multiple states

•Implementing individual per thread structures, like 2 or 4 sets of FX registers, •Implementing tagged structures, like a tagged ROB, a tagged buffer etc. ConstantinHalatsis 132

66 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Chip area requirements vsgain

Multicore Multithreaded processors cores

Additional ~(60 –80) % ~(2 –10) % complexity

Additional gain (in gen. purp. ~(60 –80) % ~(0 –30) % apps)

ConstantinHalatsis 133

University of Athens Department of Informatics and Telecommunications Multithreaded OSs

Ø Windows NT Ø OS/2 Ø Unix w/Posix Ø most OSs developed from the 90’s on

ConstantinHalatsis 134

67 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Contrasting sequential-, multitasked-and multithreaded execution(1)

Multitasked programs Multithreaded programs

Sequential Hardware multithreading programs Software Software implementation multithreading on a multithreaded core on a multicore proc.

Multithreadedsoftware on Multiple processes on a Multithreadedsoftware Single process on a single threaded Multithreadedsoftware on single processor using time ona multicore a single processor processorusing time a multithreaded core sharing processor sharing D e s c r i pt on

No issues with Multiple programs with Multiple programs with Simultaneous execution of True parallel execution parallel programs quasi-parallel execution quasi-parallel execution threads of threads

Private address spaces Shared process address Threads share address spaces Threads share address space space Thread context switch No thread context needed No thread context switches needed switches needed (except K e y F a t u r es coarse grained MT)

Sequential Solutions for fast context Thread state management Intra-core Thread scheduling bottleneck switching and context switching communication K e y I ssu es

ConstantinHalatsis 135

University of Athens Department of Informatics and Telecommunications

Contrasting sequential-, multitasked-and multithreaded execution(2)

Multitasked programs Multithreaded programs

Sequential Hardware multithreading programs Software Software implementation multithreading on a multithreaded on a multicore proc. core

Most modern OS’s Most modern OS’s Most modern OS’s Legacy OS Traditional Unix (Windows NT/2000, (Windows NT/2000, (Windows NT/2000, OS/2, support OS/2, Unix w/Posix) OS/2, Unix w/Posix) Unix w/Posix) O S Su pp o rt

Low Low-medium High Higher Highest L e v el P e r f o rm a n c

Process and thread Process and thread Process and thread Process life cycle management API management API management API No API level management API support Explicit threading API Explicit threading API Explicit threading API OpenMP OpenMP OpenMP S o f tw a r e D e v el o pm nt ConstantinHalatsis 136

68 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

MultithreadingParadigms

FU1FU2FU3FU4 Unused Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Exec u t ion Ti me

Conventional Fine-grained Coarse-grained Chip Simultaneous Superscalar Multithreading Multithreading Multiprocessor Multithreading Single (cycle-by-cycle (Block Interleaving) (CMP) (SMT) Threaded Interleaving) or called or Intel’s HT Multi-Core Processors today

ConstantinHalatsis 137

University of Athens Department of Informatics and Telecommunications

Fine-Grained Multithreading

Ø Switches between threads on each instruction, interleaving execution of multiple threads ü Usually done round-robin, skipping stalled threads Ø CPU must be able to switch threads on every clock cycle Ø Pro: Hide latency of both short and long stalls ü Instructions from other threads always available to execute ü Easy to insert on short stalls Ø Con: Slow down execution of individual threads ü Thread ready to execute without stalls will be delayed by instructions from other threads Ø Used on Sun’s Niagara

ConstantinHalatsis 138

69 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Course-Grained Multithreading

Ø Switches threads only on costly stalls ü e.g., L2 cache misses Ø Pro: No switching each clock cycle ü Relieves need to have very fast thread switching ü No slow down for ready-to-go threads § Other threads only issue instructions when the main one would stall (for long time) anyway Ø Con: Limitation in hiding shorter stalls ü Pipeline must be emptied or frozen on stall, since CPU issues instructions from only one thread ü New thread must fill pipe before instructions can complete ü Thus, better for reducing penalty of high-cost stalls where pipeline refill << stall time Ø Used in IBM AS/400

ConstantinHalatsis 139

University of Athens Department of Informatics and Telecommunications

Simultaneous Multithreading (SMT) Ø Exploits TLP at the same time it exploits ILP Ø Intel’s HyperThreading (2-way SMT) Ø Others: IBM Power5 and Intel future multicore(8-core, 2-thread 45nm Nehalem) Ø Basic ideas: Conventional MT + Simultaneous issue + Sharing common resources Fdiv, unpipe (16 cycles) FFeettchch Dececode RS & ROB FMult Unnitit ppllusus (4 cycles) Phhysysiiccalal RRegeg FAdd RRegeg Reegiisstteerr Reggiisstteerr FileRRegeg (2 cyc) FFileileRRegeg PCPC Reenaamemerr FFileile FFileileReg PCPC FFileile PCPC File PCPC A L U1 A L U1 II--Cacachee A L U2 A L U2

Load/Store D--Cacachee (variable)

ConstantinHalatsis 140

70 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Simultaneous Multithreading (SMT)

Ø Insight that dynamically scheduled processor already has many HW mechanisms to support multithreading:

ü Large set of virtual registers that can be used to hold register sets for independent threads ü Register renaming provides unique register identifiers § Instructions from multiple threads can be mixed in data path § Without confusing sources and destinations across threads! ü Out-of-order completion allows the threads to execute out of order, and get better utilization of the HW

Ø Just add per-thread renaming table and keep separate PCs ü Independent commitment can be supported via separate reorder buffer for each thread

ConstantinHalatsis 141

University of Athens Department of Informatics and Telecommunications

Overview of multithreaded cores (1)

8CMT

QCMT

1/06 DCMT 5/05

PentiumEE 840 PentiumEE 955/965 (Smithfield) (Presler)

90 nm/2*103 mm2 65 nm/2*81 mm2 230 mtrs./130 W 2*188 mtrs./130 W 2-way MT/core 2-way MT/core

11/02 02/04 SCMT Pentium4 Pentium4 (Northwood B) (Prescott)

130nm/146mm2 90 nm/112mm2 55mtrs./82 W 125mtrs./103 W 2-way MT 2-way MT

1H 2H 1H 2H 1H 2H 1H 2H 1H 2H 2002 2003 2004 2005 2006

Figure2.1:Intel’smultithreaded desktop families ConstantinHalatsis 142

71 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Overview of multithreaded cores (2)

8CMT

QCMT

DCMT 10/05 6/06

XeonDP 2.8 Xeon5000 (PaxvilleDP) (Dempsey)

90 nm/2*135 mm2 65 nm/2*81 mm2 2*169 mtrs./135 W 2*188 mtrs./95/130 W 2-way MT/core 2-way MT/core

2/02 11/03 6/04 SCMT Pentium 4 Pentium 4 Pentium 4 (Prestonia-A) (Irwindale-A) (Nocona)

130nm/146mm2 130nm/135mm2 90 nm/112mm2 55mtrs./55 W 169mtrs./110 W 125mtrs./103 W 2-way MT 2-way MT 2-way MT

1H 2H 1H 2H 1H 2H 1H 2H 1H 2H 2002 2003 2004 2005 2006

Figure2.2.:Intel’smultithreaded XeonDP-families ConstantinHalatsis 143

University of Athens Department of Informatics and Telecommunications

Overview of multithreaded cores (3)

8CMT

QCMT

DCMT 11/05 8/06

Xeon7000 Xeon7100 (PaxvilleMP) (Tulsa)

90 nm/2*135 mm2 65 nm/435 mm2 2*169 mtrs./95/150 W 1328 mtrs./95/150 W 2-way MT/core 2-way MT/core 3/02 3/04 3/05 SCMT Pentium 4 Pentium 4 Pentium 4 (Foster-MP) (Gallatin) (Potomac) 180nm/n/a 130nm/310mm2 90 nm/339mm2 108mtrs./64 W 178/286mtrs./77 W 675mtrs./95/129 W 2-way MT 2-way MT 2-way MT

1H 2H 1H 2H 1H 2H 1H 2H 1H 2H 2002 2003 2004 2005 2006

Figure2.3.:Intel’smultithreadedXeonMP-families ConstantinHalatsis 144

72 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Overview of multithreaded cores (4)

8CMT

QCMT

DCMT 7/06

9x00 (Montecito)

90 nm/596 mm2 1720 mtrs./104 W 2-way MT/core

SCMT

1H 2H 1H 2H 1H 2H 1H 2H 1H 2H 2002 2003 2004 2005 2006

Figure2.4.:Intel’s multithreaded EPIC based server family ConstantinHalatsis 145

University of Athens Department of Informatics and Telecommunications

Overview of multithreaded cores (5)

8CMT

QCMT

10/05 2007 DCMT 5/04

POWER5 POWER5+ POWER6

2 2 130 nm/389 mm2 90 nm/230 mm 65 nm/341 mm 276 mtrs./80W (est.) 276 mtrs./70 W 750 mtrs./~100W 2-way MT/core 2-way MT/core 2-way MT/core 5/04 2006 SCMT RS 64 IV CellBEPPE (Sstar) 90 nm/221*mm2 180nm/n/a 234*mtrs./95*W 44mtrs./n/a 2-wayMT 2-way MT (*: entire proc.) ~ 1H 2H 1H 2H 1H 2H 1H 2H 1H 2H 2000 2004 2005 2006 2007

Figure2.5.:IBM’smultithreaded server families ConstantinHalatsis 146

73 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Overview of multithreaded cores (6)

8CMT 11/2005 2007 UltraSPARCT1 UltraSPARCT2 (Niagara) (Niagara II)

90 nm/379 mm2 65 nm/342 mm2 279 mtrs./63 W 72 W (est.) 4-way MT/core 8-way MT/core

QCMT 2008

APL SPARC64 VII (Jupiter)

65 nm/464mm2 ~120 W 2-way MT/core

DCMT 2007 APL SPARC64 VI (Olympus)

90 nm/421 mm2 540 mtrs./120 W 2-way MT/core

SCMT

1H 2H 1H 2H 1H 2H 1H 2H 1H 2H 2004 2005 2006 2007 2008

Figure2.6:Sun’s and Fujitsu’smultithreaded server families ConstantinHalatsis 147

University of Athens Department of Informatics and Telecommunications

Overview of multithreaded cores (7)

5/05 8CMT

XLR 5xx

90 nm/~220 mm2 333 mtrs./10-50 W 4-way MT/core

QCMT

DCMT

SCMT

1H 2H 1H 2H 1H 2H 1H 2H 1H 2H 2002 2003 2004 2005 2006

Figure2.7:RMI’s multithreaded XLR family (scalar RISC) ConstantinHalatsis 148

74 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Overview of multithreaded cores (8)

8CMT

QCMT

DCMT

2003 SCMT Alpha 21464 (V8) 130nm/n/a 250mtrs./10-50 W 4-way MT Cancelled 6/2001

1H 2H 1H 2H 1H 2H 1H 2H 1H 2H 2002 2003 2004 2005 2006

Figure2.8:DEC’s/Compaq’s multithreaded processor ConstantinHalatsis 149

University of Athens Department of Informatics and Telecommunications

Overview of multithreaded cores (9)

Underlying core(s)

Scalar core(s) Superscalar core(s) VLIW core(s)

IBM RS64 IV (2000) SUN UltraSPARC T1 (2005) (SStar) SUN MAJC 5200 (2000) (Niagara) Single-core/2T Quad-core/4T up to 8 cores/4T (dedicated use) Pentium 4 based RMI XLR 5xx (2005) processors Intel Montecito (2006) 8 core/4T Single-core/2T (2002-) Dual-core/2T Dual-core/2T (2005-)

PPE of Cell BE (2006) Single-core/2T

Fujitsu SPARC64 VI / VII Dual-core/Quad-core/2T

ConstantinHalatsis 150

75 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications Thread scheduling (1)

Thread scheduling in software multithreading on a traditional supercalar processor D i s p at c h l o ts

Thread1 Contextswitch Thread2

Clock cycles

The execution of a new thread is initiated by a context switch (needed to save the state of the suspended thread and loading the state of the thread to be executed next).

Thread scheduling assuming software multithreading on a 4-way superscalar processor

ConstantinHalatsis 151

University of Athens Department of Informatics and Telecommunications Thread scheduling (2)

Thread scheduling in multicore processors (CMP-s) D i s p at c h l o ts

Thread1 Thread2

Clock cycles Both t-way superscalar cores execute different threads independently.

Thread scheduling in a dual core processor

ConstantinHalatsis 152

76 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Thread scheduling (3)

Thread scheduling in multithreaded cores

Coarse grained MT

ConstantinHalatsis 153

University of Athens Department of Informatics and Telecommunications

Thread scheduling (4) s l o ts D i s p at c h / ss ue

Clock cycles

Thread1 Contextswitch Thread2

Threads are switched by means of rapid, HW-supported context switches.

Thread scheduling in a 4-way coarse grained multithreaded processor

ConstantinHalatsis 154

77 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Thread scheduling (5)

Coarse grained MT

Scalar based Superscalar based VLIW based

IBM RS64 IV (2000) SUN MAJC 5200 (2000) (SStar) Quad-core/4T Single-core/2T (dedicated use)

Intel Montecito (2006?) Dual-core/2T

ConstantinHalatsis 155

University of Athens Department of Informatics and Telecommunications

Thread scheduling (6)

Thread scheduling in multithreaded cores

Coarse grained MT Fine grained MT

ConstantinHalatsis 156

78 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications Thread scheduling (7) s l o ts D i s p at c h / ss ue

Clock cycles

Thread1 Thread2 Thread3 Thread4

The hardware thread scheduler choses a thread in each cycle and instructions from this thread are dispatched/issued in this cycle..

Thread scheduling in a 4-way fine grained multithreaded processor

ConstantinHalatsis 157

University of Athens Department of Informatics and Telecommunications Thread scheduling (8)

Fine grained MT

Round robin Priority based selection policy selection policy

Scalar Superscala VLIW Scalar Superscala VLIW based r based based based r based based

SUN UltraSPARC T1 (2005) (Niagara) up to 8 cores/4T

PPE of Cell BE (2006) single-core/2T

ConstantinHalatsis 158

79 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Thread scheduling (9)

Thread scheduling in multithreaded cores

Coarse grained MT Fine grained MT Simultaneous MT (SMT)

ConstantinHalatsis 159

University of Athens Department of Informatics and Telecommunications Thread scheduling (10) s l o ts D i s p at c h / ss ue Clock cycles

Thread1 Thread2 Thread3 Thread4

Available instructions (chosen according to an appropriate selection policy, such as the priority of the threads) are dispatched/issued for execution in each cycle.

SMT: Proposed by Tullsen, Eggers and Levy in 1995 (U. of Washington).

Thread scheduling in a 4-way symultaneous multithreaded processor

ConstantinHalatsis 160

80 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications Thread scheduling (11)

SMT cores

Scalar based Superscalar based VLIW based

Pentium 4based proc.s Single-core/2T (2002-) Dual-core/2T (2005-)

DEC 21464 (2003) Dual-core/4T (canceled in 2001)

IBM POWER5 (2005) Dual-core/2T

ConstantinHalatsis 161

University of Athens Department of Informatics and Telecommunications

SMT Pipeline

Fetch Decode/ Queue Reg Execute Dcache/ Reg Retire Map Read Store Write Buffer

PC

Register Map Regs Dcache Regs Icache

SMT in the Alpha 21464(V8)

Data from Compaq ConstantinHalatsis 162

81 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

IBM Power 4 Single-threaded predecessor to Power 5. 8 execution units in out-of-order engine; each can issue instruction each cycle.

ConstantinHalatsis 163

University of Athens Department of Informatics and Telecommunications

Changes in IBM Power 5 to support SMT

Ø Increased associativityof L1 instruction cache and the instruction address translation buffers Ø Added per thread load and store queues Ø Increased size of the L2 (1.92 vs. 1.44 MB) and L3 caches Ø Added separate instruction prefetchand buffering per thread Ø Increased the number of virtual registers from 152 to 240 Ø Increased the size of several issue queues Ø The Power5 core is about 24% larger than the Power4 core because of the addition of SMT support

ConstantinHalatsis 164

82 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications SMT in IBM Power 5

PPoowweerr 55

2 fetch (PC), 2 initial decodes 2 commits (architected register sets)

ConstantinHalatsis 165

University of Athens Department of Informatics and Telecommunications Detailed SMT pipeline of IBM POWER5

Source: Kalla R., „IBM's POWER5 Micro Processor Design and Methodology”, IBM Corporation, 2003 ConstantinHalatsis 166

83 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications IBM Power 5 data flow

Why only 2 threads? With 4, one of the shared resources (physical registers, cache, memory bandwidth) would be prone to bottleneck ConstantinHalatsis 167

University of Athens Department of Informatics and Telecommunications Power 5 thread performance Relative priority of each thread controllable in hardware.

For balanced operation, both threads run slower than if they “owned” the machine.

ConstantinHalatsis 168

84 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Pentium-4 Hyperthreading(2002)

Ø First commercial SMT design (2-way SMT) ü Hyperthreading== SMT Ø Logical processors share nearly all resources of the physical processor ü Caches, execution units, branch predictors Ø Die area overhead of hyperthreading ~ 5% Ø When one logical processor is stalled, the other can make progress ü No logical processor can use all entries in queues when two threads are active Ø Processor running only one active software thread runs at approximately same speed with or without hyperthreading

ConstantinHalatsis 169

University of Athens Department of Informatics and Telecommunications

Pentium-4 Hyperthreading Front End

Resource divided Resource shared between logical between logical CPUs CPUs [ Intel Technology Journal, Q1 2002 ] ConstantinHalatsis 170

85 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Pentium-4 Hyperthreading Execution Pipeline

[ Intel Technology Journal, Q1 2002 ] ConstantinHalatsis 171

University of Athens Department of Informatics and Telecommunications

SMT adaptation to parallelism type For regions with high thread For regions with low thread level level parallelism (TLP) entire parallelism (TLP) entire machine machine width is shared by all width is available for instruction threads level parallelism (ILP) Issue width Issue width

Time Time

ConstantinHalatsis 172

86 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Initial Performance of SMT

Ø Pentium 4 Extreme SMT yields 1.01 speedup for SPECint_rate benchmark and 1.07 for SPECfp_rate ü Pentium 4 is dual threaded SMT ü SPECRaterequires that each SPEC benchmark be run against a vendor-selected number of copies of the same benchmark Ø Running on Pentium 4 each of 26 SPEC benchmarks paired with every other (262 runs) speed-ups from 0.90 to 1.58; average was 1.20 Ø Power 5, 8-processor server 1.23 faster for SPECint_ratewith SMT, 1.16 faster for SPECfp_rate Ø Power 5 running 2 copies of each app speedup between 0.89 and 1.41 ü Most gained some ü Fl.Pt. apps had most cache conflicts and least gains

ConstantinHalatsis 173

University of Athens Department of Informatics and Telecommunications Pentium 4 Hyperthreading: Performance Improvements

ConstantinHalatsis 174

87 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

IcountChoosing Policy Fetch from thread with the least instructions in flight.

Why does this enhance throughput?

ConstantinHalatsis 175

University of Athens Department of Informatics and Telecommunications

SMT Fetch Policies (Locks)

Ø Problem: Spin looping thread consumes resources

Ø Solution: Provide quiescingoperation that allows a thread to sleep until a memory location changes

Load and start loop: watching 0(r2) ARM r1, 0(r2) BEQ r1, got_it QUIESCE Inhibit scheduling BR loop of thread until got_it: activity observed on 0(r2)

ConstantinHalatsis 176

88 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

The Cray example

Ø Cray XMT ü128 Hardware streams § A stream is 31 64-bit registers, 8 target registers, and a control register üThree functional units: M, A and C ü500 MHz üFull and Empty bits per word (2-bits) Ø An example of a very high SMT design

ConstantinHalatsis 177

University of Athens Department of Informatics and Telecommunications The Cray MTA2

Programs 1 2 3 4 running in parallel

i = n i = n Sub- . problem . Serial . i = 3 A . i = 1 Code Concurrent . . threads of i = 2 Sub- i = 0 computation problem i = 1 B Subproblem A

Hardware . . . . streams (128)

Unused streams

Instruction Ready Pool;

Pipeline of executing instructions

Cray MTA2 picture from JonhFeo’s“Can programmers and Machines ever be friends”

ConstantinHalatsis 178

89 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications Summary: Multithreaded Categories

Simultaneous Superscalar Fine-GrainedCoarse-Grained Multiprocessing Multithreading T i m e ( p r o c ss cyc l e)

Thread 1 Thread 3 Thread 5 Thread 2 Thread 4 Idle slot

ConstantinHalatsis 179

University of Athens Department of Informatics and Telecommunications

Multithreaded Programming

Ø A thread of execution is a fork of a computer programinto two or more concurrentlyrunning tasks. Ø POSIX Threads (Pthreads) is a set of threading interfaces developed by the IEEE Ø Assembly of shared memory programming Ø Programmer has to manually: ü Create and terminating threads ü Wait for threads to complete ü Manage the interaction between threads using mutexes, condition variables, etc.

ConstantinHalatsis 180

90 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Concurrency Platforms

• Programming directly on PThreadsis painful and error-prone. • With PThreads, you either sacrifice memory usage or load-balance among processors • A concurrency platform provides linguistic support and handles load balancing. • Examples: • Threading Building Blocks (TBB) • OpenMP •Cilk++

ConstantinHalatsis 181

University of Athens Department of Informatics and Telecommunications

Trends Ø Multicore–manycore

Ø Hybrid processors ü Accelerators for specific kinds of computation ü More difficult to take advantage of

Ø Application-specific supercomputers ü Not covered

ConstantinHalatsis 182

91 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications ΠολυπύρηνοιέναντιΥπερπύρηνων (multicorevsmanycore) Ø Multicore: παρούσαπορεία/τακτική ü Ίδιοςπυρήνας ü Διπλάσιοςαριθμόςκάθε 18 μήνες (2, 4, 8 . . . Etc…) ü Παραδείγματα: Intel Dunnington(6 cores), Tukwila (next generation of Itanium, 4 cores), AMD Shanghai (4 core), AMD Constantinople (6 Core), IBM Power 6 (up to 16 cores), IBM Power 7, AMD Magny-Cours(12 core), … Ø Manycore: σύγκλισηπροςαυτήτηνκατεύθυνση ü Απλοποίησητουπυρήνα (shorter pipelines, lower clock frequencies, in-order processing) ü Ξεκίναμε 100 πυρήνεςκαιδιπλασίαζεκάθε 18 μήνες ü Παραδείγματα: NvidiaG80 (128 cores), Intel Polaris (80 cores), Cisco/TensilicaMetro (188 cores)

Ø Σύγκλιση: Τελικάσε Manycore ConstantinHalatsis 183 ü Manycore εφόσον βρούμε πως θα τους προγραμματίζουμε

University of Athens Department of Informatics and Telecommunications

From Multi-to Many-Core

Ø Multi-core (2-4 cores) designs dominate the commodity market and percolate into high-end systems Ø Many-core (10s or 100+ cores) emerging ü heterogenuityis a real possibility Ø Examples ü Intel 80-core TeraScalechip & 32-core Larrabeechip ü IBM Cyclops-64 chip with 160 thread units ü ClearSpeed96-core CSX chip ü 2nd generation NvidiaTeslaproducts use 240-core C870 processor (0.9TF single precision) ü AMD/ATI FireStream9170with 320 cores

IBM Cell ConstantinHalatsis 184

92 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Από Multi-to Many-Core

IBM Cell ConstantinHalatsis 185

University of Athens Department of Informatics and Telecommunications

Παράδειγμα:Intel Tera-scale Processor

Ø 80 simple cores on a single chip Ø Two programmable FP engines per core Ø Design aspects: ü Tiled design ü Network on a chip § Each core has a 5-port message passing router § Routers connected in a 2-D mesh ü Fine-grain power management: compute engines/routers can be individually put to sleep ü Other innovations: sleep transistors, mesochronous clocking, clock gating Ø Reaches 1 Tflopspeak consuming 62 Watt

ConstantinHalatsis 186

93 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

ConstantinHalatsis 187

University of Athens Department of Informatics and Telecommunications

ConstantinHalatsis 188

94 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications Τοπρόβλημαμετους manycore! “I know how to make 4 horses pull a cart -I don't know how to make 1024 chickens do it.” EnricoClementi-former IBM fellow

Slide by Ruth Poole – IBM Software Engineer Blue Gene Control System

ConstantinHalatsis 189

University of Athens Department of Informatics and Telecommunications

HPC Accelerators

Ø Cell Ø GPGPUs Ø FPGAs Ø Clearspeed Ø Hybrids ü Intel Larrabee….

ConstantinHalatsis 190

95 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

SonyToshibaIBMCell

Ø A heterogeneousarchitecturedevelopedforthePS3 Ø Combinesa PowerPCwithco-processingelements toacceleratemultimediaandvectorprocessing applications Ø Softwarecontrolledmemories Ø Availablesince2005 Ø ManyresearchCELL clusters/projects Ø A hybridOpteron-Cellclusterhas becomethefirst petaflopsystem ü A niceoverviewcanbefoundat: http://en.wikipedia.org/wiki/Cell_(microprocessor) ü A nice workshop on Cell at UCSD http://crca.ucsd.edu/cellworkshop/

ConstantinHalatsis 191

University of Athens Department of Informatics and Telecommunications

IBM PowerXCell8i

ConstantinHalatsis 192

96 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Αρχιτεκτονικήτου Cell Power Processor Element Synergistic Processor (PPE): Elements (SPE): •General Purpose, 64-bit •8 per chip RISC •128-bit wide SIMD Processor (PowerPC Units 2.02) •Integer and Floating •2-Way Hardware Point ultithreaded •256KB Local Store •L1 : 32KB I ; 32KB D •Up to 25.6 GF/s per •L2 : 512KB SPE •Coherent load/store 200GF/s total * •VMX •3.2 GHz

ConstantinHalatsis 193

University of Athens Department of Informatics and Telecommunications

Cell/B.E. –Example of eight concurrent transactions

ConstantinHalatsis 194

97 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Comparison to some multicores

Difficult to achieve this peak

Sparse Matrix * Vector SpMV

ConstantinHalatsis 195

University of Athens Department of Informatics and Telecommunications

Mercury Cell Accelerator Board 2

Ø 1 Cell Processor 2.8GHz Ø 4GB DDR2, GbE, PCI-Express 16x, 256MB DDR2 Ø Full Mercury MultiCorePlus SDK Support

Workstation Accelerator Board ConstantinHalatsis 196

98 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Mercury’s 1U Server DCBS-2

Ø Dual Cell Processor, 3.2GHz, 1GB XDR per Cell, Dual Gige, Inifiband/10GE Option, 8 SPE per Cell. Ø Larger Memory Footprint Compared to the PS-3 Ø Dual Cell processor give a single application access to 16 SPEs Ø Preconfigured with YellowdogLinux Ø Binary compatible with PS-3

Excellent Small Application Option ConstantinHalatsis 197

University of Athens Department of Informatics and Telecommunications

IBM QS-22 Blade

Ø Dual IBMPowerXCell™8i (New Double Precision SPEs) Ø Up to 32GB DDRII Per Cell Ø Dual GigEand Optional 10GE/Inifiband Ø Red Hat Enterprise Linux 5.2 Ø Full Software/Hardware support form IBM Ø Up to 14 Blades in Blade Center Chassis Ø Very high density solution

Double Precision Workhorse ConstantinHalatsis 198

99 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

SONY ZEGO BCU-100

Ø 1U Cell Server Ø Single 3.2GHz Cell/B.E. 1GB XDR, 1GB DDR2 , RSX GPU, GE Ø Full Cell/B.E. with 8 SPEs Ø PCI-Express slot for Inifibandor 10GbE Ø Pre loaded with Yellow Dog Linux

New Product ConstantinHalatsis 199

University of Athens Department of Informatics and Telecommunications

Software Development

Ø Utilize Free Software ü IBM Cell Software Development Kit (SDK) § C/C++ libraries providing programming models, data movement operations, SPE local store management, process communication, and SIMD math functions § Open Source Compilers/Debuggers n GNU and IBM XL C/C++ Compilers n Eclipse IDE enhancements specifically for Cell targets n Instruction level debugging on PPE and SPEs § IBM System Simulator: Allows testing Cell applications without Cell hardware § Code optimization tools n Feedback Directed Program Restructuring (FDPR-Pro) optimizes performance and memory footprint ü Eclipse Integrated Development Environment (IDE) § Compilation from Linux , run remotely on Cell targets § Develop, compile, and run directly on Cell based hardware and System simulators ü Linux Operating System § Customizable kernel § Large software base for development and management tools

Ø Additional Software Available for purchase ü Mercury’s MultiCoreFramework, PAS, SAL

Multiple Software Development Options Allow Greater Flexibility and Cost Savings ConstantinHalatsis 200

100 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

ConstantinHalatsis 201

University of Athens Department of Informatics and Telecommunications

Application Development Using IBM SDK

IIBM Node Mgmt Mailbox Single, Lists Intrinsics Signals, etc

MCF SAL

Init & Buffer Init & Logic POSIX C//C++ NUNUMAMA Mgmt

Usserr App Setup/Init Comm Data I/O Processing Sync

Advvanttages Diissadvavanttages •SPE Affinity supported (SDK 2.1+) •Lower level SPE/PPE setup/control •Pluginsare implicitly loaded at run-time •Light weight infrastructure •Low Level DMA control and monitoring increases complexity •SPE-SPE Communications possible •Manual I/O buffer management

•Low Level DMA operations can be functionally hidden •SPE-SPE Transfers are possible •Technical Support unknown

•Free w/o Support

° Data processing and task synchronization are comparable between the Mercury MCF and IBM SDK

ConstantinHalatsis 202

101 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Application Development Using Mercury MCF

IIBM SSiingnglele IIntntrriinnssiicscs

NNododee MMggmtt -- Maililboxbox Muullttiipplele SSALAL Mututexex MCF TTaassksks/Pl/Pluuggiinsns

IInniitt && LogLogicic PPOOSSIXIX C//C++ NUNUMAMA

Usserr App Settup//IInitit Comm Datta II/O/O Prrocessssiing Syync

Advvanttages Diissadvavanttages •Node management is easy to setup and change dynamically. •Pluginsmust be explicitly defined and loaded at runtime •SPE affinity are not supported. •Added overhead •Simplifies complex data movement •Various data I/O operations are hidden from the user after initial setup. •Mailbox communication is restricted between a PPE and SPEs •Multiple striding, overlapping, and multi-buffer options available. •Single one-time transfers can be performed via IBM DMA APIs •SPE to SPE transfers isn’t supported in MCF 1.1. •Technical Support Provided •Added Overhead

•Cost •Possible interference when trying to utilize IBM SDK features that ° Data processing and task synchronization are aren’t exposed via MCF APIs. comparable between the Mercury MCF and IBM SDK

ConstantinHalatsis 203

University of Athens Department of Informatics and Telecommunications

GP-GPUs

Ø General-purposecomputingongraphicsprocessing units Ø UsingtheGPU as“vectorCPU”by(ab)usingthe programmablevertexshaders Ø Availablesince2000 Ø Becomingmoreandmorepopular Ø Usestreamprocessingtoexploittheextremely parallelism Ø CUDA tutorialattheSupercomputingConference 2007 ü A niceintroductioncanbefoundat: http://en.wikipedia.org/wiki/GPGPU

ConstantinHalatsis 204

102 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Some GPGPUs

ConstantinHalatsis 205

University of Athens Department of Informatics and Telecommunications

Programming GPGPUs-CUDA

CUDA : C fortheGPU

• C programmingenvironmentthataddsabilityto −Specifymanycoreparallelism −MapfunctionstotheGPU − TransferdatabetweenGPU & CPU •Lowlearningcurve − Easilyspecify1000s ofthreads •IntegratedCPU-GPU programmingmodel •50 MillionCUDA-enabledGPUsalreadydeployed

ConstantinHalatsis 206

103 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

CUDA toolkit and SDK examples

•FreeCUDA ToolkitandSDKExamples •ToolkitandSDK include − Compiler − Debugger, Profiler − Documentation − Libraries − Samples

ConstantinHalatsis 207

University of Athens Department of Informatics and Telecommunications

FPGAs

Ø Field-programmablegatearray Ø “Adjustthearchitecturetotheneedsofyour algorithm” Ø Invented1984 Ø Usedheavilyinembeddedandreal-timesystems Ø UsedinsupercomputerslikeCrayXD1, SGI RASC Blades Ø Programmability!

ü Anoverviewcanbefoundat: http://en.wikipedia.org/wiki/Field-programmable_gate_array

ConstantinHalatsis 208

104 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Some FPGA’s

ConstantinHalatsis 209

University of Athens Department of Informatics and Telecommunications

FPGA example

ConstantinHalatsis 210

105 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

ConstantinHalatsis 211

University of Athens Department of Informatics and Telecommunications

FPGA example

ConstantinHalatsis 212

106 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

FPGA example

ConstantinHalatsis 213

University of Athens Department of Informatics and Telecommunications

ClearspeedBoards

Ø Acceleratorboardsspeciallydevelopedfor scientificcomputingandtheneedsoftheHPC community Ø Onlyone acceleratedplatforminthecurrent top500(TsubameGrid Cluster, Japan, No 29) Ø Advertisementclaims: ü “World’shighestperformanceprocessor”(96GF per board) ü “World’shighestperformanceperwatt”(2.86GF/Watt)

http://www.clearspeed.com/acceleration/performance/benchmarks

ConstantinHalatsis 214

107 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications The CSX700 Processor • Includes dual MTAP cores: –96 GFLOPS peak (32 & 64-bit) –48 GMACS peak (16x16 → 32+64) –10W max power consumption –250MHz clock speed –192 Processing Elements (2x96) –8 spare PEsfor resiliency –ECC on all internal memories • On-die temperature sensors • Active power management • Dual integrated 64-bit DDR2 memory controllers with ECC • Integrated PCI Express x16 • CCBR chip-to-chip bridge port • IBM 90nm process • 266 million transistors • Shipping to customers since June 08

Copyright ©2008 ClearSpeedTechnology Inc. All rights reserved. ConstantinHalatsis 215

University of Athens Department of Informatics and Telecommunications The ClearSpeedAdvanceTM e710, e720 and CATS-700

Ø 96 GFLOPS e710 & e720 fit standard 1U & HP blade servers ü Low power consumption of 25W max, small, light, passively cooled ü Designed for high reliability (MTBF) ü All memoryis error protected; no moving parts (e.g. fans) are required

Ø CATS-700 1U system ü 1.152 TFLOPS 32-and 64-bit floating point ü 96 GBytes/smemory bandwidth to 24 GB of ECC protected DDR2 ü 300W typical power consumption

Ø Easy to use Software Development Kit ANSI C compiler, gdb-based debugger, advanced profiler ü 216 Copyright ©2008 ClearSpeedTechnology Inc. All rights reserved. ConstantinHalatsis 216

108 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

CSX700 and beyond

Ø The CSX700 is much more power efficient than cell and GPUsfor embedded processing. ü E.g. for single precision complex 1024x1024 2D FFT: ü Cell (8 SPE): 38 GFLOPS 40W 0.95 GFLOP/watt ü S870 (Tesla) GPU: 50 GFLOPS 170W 0.07 GFLOP/watt ü x86 core: 3 GFLOPS 25W0.12 GFLOP/watt ü CSX700: 20 GFLOPS 7W2.86 GFLOP/watt

Ø Next generation processor “Carnac”in design now ü Focusing on 1-and 2D FFT performance ü Design goal is 100 GFLOPS/watt sustained on 2D FFTs

217 Copyright ©2008 ClearSpeedTechnology Inc. All rights reserved. ConstantinHalatsis 217

University of Athens Department of Informatics and Telecommunications

RECONFIGURABLE COMPUTING

Ø ReconfigurableComputing(RC) ü Ideaofreconfiguringa computertoyour current needs ü UseFPGAsforthereconfiguration Ø Conceptexistssince1960s (PaperbyGeraldEstrin) “Unfortunatelythisideawasfaraheadofitstimein neededelectronictechnology.” Ø Renaissanceinthe80s/90s ü “Theworld'sfirstcommercialreconfigurable computer, theAlgotronixCHS2X4, wascompletedin 1991. Itwasnota commercialsuccess.” Quotesaretakenfrom http://en.wikipedia.org/wiki/Reconfigurable computing] Ø ReconfigurableHPC (RHPC)

ConstantinHalatsis 218

109 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

OverviewofavailableAccelerators

Source: WhitePaperfromHP http://www.hp.com/techservers/hpccn/hpccollaboration/ADCatalyst/d ownloads/accelerators.pdf

ConstantinHalatsis 219

University of Athens Department of Informatics and Telecommunications

Hybrids

Ø Larabee

ConstantinHalatsis 220

110 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Larrabee: A Many-core, Thread-by-Data Parallel (Vector) Hybrid? Ø Larrabeeintegrates data-and thread-parallel design elements and emphasizes programmability making it a programmable hybrid

ConstantinHalatsis 221

University of Athens Department of Informatics and Telecommunications

Larrabee’sThread and Data Parallel Design Elements and Abstractions

Ø Processor: A many core die with from 8 to 32 cores Ø Core: An processing element that runs and maintains the context of up to 4 threads that share L2 cache Ø Thread: A hardware managed context of from 2 to 10 fibers switched to hide long, unpredictable latency Ø Fiber: A software managed context of from 1 to 4 vector ops (vectors, may also include scalars?) Ø Vector: A data-parallel instruction (data block) composed of 16 single (8 double) precision strands used to hide short, predictable memory latency Ø Strand: series of operations on individual data elements with common register, cache, or memory locations

ConstantinHalatsis 222

111 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Larrabee’sParallel Abstraction Stack

LarrabeeProcessor P 8 to 32 cores per processor

Cores c c . . . . c c

Threads T T T T 4 thread contexts per core

Fibers ( Vector Vector . . . . Vector ) 2 to 10 fibers per thread

Strands S S . . . . S S 16 to (4 x 16) strands per fiber ConstantinHalatsis 223

University of Athens Department of Informatics and Telecommunications

LarrabeeArchitecture

224/16 ConstantinHalatsis 224

112 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

LarrabeeCore and Caches

Ø Instruction Set ü Standard Pentium x86 inst. ü Scalar and cache control inst. ü Vector inst. Ø L1 cache can be somewhat like an extended register file Ø Global L2 cache is divided into separate local subsets, one / core. ü Low latency for local L2 subset ü Data written by a core is stored in its own L2 subset, and is flushed from other subsets, if necessary. 225/16 ConstantinHalatsis 225

University of Athens Department of Informatics and Telecommunications Scalar Unit and Cache Control Instruction

Ø Scalar pipeline is derived from the dual-issue Pentium processor ü Short, inexpensive execution pipeline ü Multi-threading, 64bit extensions, and sophisticated pre-fecthing ü New scalar instructions (bit count, bit scan) Ø New instructions and instruction modes for explicit cache control ü Pre-fetch data into L1 or L2 Caches ü Reduce the priority of a cache line ü Use L2 cache like a scratchpad memory Ø Support 4-threads with separate register set per thread

226/16 ConstantinHalatsis 226

113 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Vector Processing Unit Ø 16-wide vector processing unit (VPU) ü Integer, single/double precision float instructions Ø Data can come directly from L1 Ø The VPU supports ü a wide variety of inst. on int/float. § Standard arithmetic/logical § Additional ld/stinst. for convsersionsdata types ü gather and scatter supporting § 16 elements from up to 16 different addresses Ø Inst. can be predicated by a mask register which ü controls which part of a reg./memory are written or left untouched. ü reduces branch mispredictionpenalties

227/16 ConstantinHalatsis 227

University of Athens Department of Informatics and Telecommunications

Inter-Processor Ring Network

Ø Larrabeeuses a bi-directional ring network ü 512-bits wide per direction ü All the routing decisions are made before injecting messages into the network ü Provides a path for the L2 to access memory ü Fixed function logic agents are accessed by the CPU cores and in turn to access L2/memory

228/16 ConstantinHalatsis 228

114 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications

Comparing Larrabeeto Nehalem, Tesla, PowerXCell

Larrabee, Nehalem, Tesla 10, PowerXCell, Intel Intel NVIDIA IBM Availability ~Q1, 2010 Q4, 2008 Q3, 2008 Q2, 2008

Board socket or bus socket bus socket or bus integration Core type small, in large, out-of- very small FPU, small, mixed, order, x86-64 order, x86- in order, nvdia- ppe/spe-64 64 64 Core number 8 to ~32 (64- 4 to 8 (64- 240 (32-bit), 8 + 1(64-bit) bit) bit) 30(64-bit) Max Clock ~2.5 GHz ~3.2 GHz ~1.5 GHz ~3.2 GHz

On-chip ring(s) switch pipeline + switch ring interconnect Vector width 16 4 (1+1) 4 (SIMD 32-bit) New parallel No, compiler No, compiler Yes, CUDA Yes, DaCs, ALF API required has burden has burden Programmable Fully, 1 fixed Fully, no Partially, several Fully, no fixed pipeline function units fixed fixed func. units function units function units ConstantinHalatsis 229

University of Athens Department of Informatics and Telecommunications

Convergence of Platforms üMultiple parallel general-purpose processors (GPPs) üMultiple application-specific processors (ASPs)

Intel Network Processor IBM Cell 1 GPP Core 1 GPP (2 threads) 16 ASPs (128 threads) 8 ASPs 1 1 1 8 8 8 Stripe RDRA RDRA RDRA MEv MEv MEv MEv M M M 2 2 2 2 IXP280S PicochipDSP 1 2 3 4 1 2 3 Rbuf P 64 @ I 16b 1208B MEv MEv MEv MEv 0 4 Intel® G o 1 GPP core PCI 2 2 2 2 XScale A 8 7 6 5 r S Tbuf C 64b (64b) ™ K 64 @ S 16b 66 Core MEv MEv MEv MEv E 128B I 248 ASPs MHz 32K IC 2 2 2 2 T X 32K DC 9 10 11 12 Hash C4S8Rs/64/ 128 S- cratc QDR QDR QDR QDR MEv MEv MEv MEv Fasht_wr SRAM SRAM SRAM SRAM 2 2 2 2 -U16AKBRT 1 2 3 4 16 15 14 13 - E/D Q E/D Q E/D Q E/D Q Timers 1 1 1 1 1 1 1 1 -GPIO 8 8 8 8 8 8 8 8 - Cisco CRS-1 BootROM/S Sun Niagara lowPort 188 TensilicaGPPs 8 GPP cores (32 threads)

Intel 4004 (1971): 4-bit processor, 1000s of “The Processor is 2312 transistors, ~100 KIPS, processor the new Transistor” 10 micron PMOS, cores per [Rowen] 11 mm2 chip die ConstantinHalatsis 230

115 PDF created with pdfFactory Pro trial version www.pdffactory.com