University of Athens Department of Informatics and Telecommunications
Architectures of processor chips
Ø Basic Architectural features Ø Single core processor chip ü Scalar processors –Pipelining ü Superscalars–ILP -threading Ø Amdahl’s Law Ø Models of Multicores Ø Multicores ü Symmetric MulticoreChips ü Asymmetric MulticoreChips ü Dynamic MulticoreChips Ø Multi-Threading Ø Accelerators
ConstantinHalatsis 1
University of Athens Department of Informatics and Telecommunications
Basic Architectural features
Ø Memory Hierarchy (caches, Main memory, HDs,..) Ø Pipelining Ø Branch prediction Ø Instruction Level Paralelism–ILP Ø Data Level Parallelism –DLP Ø Thread Level Parallelism -TLP
ConstantinHalatsis 2
1 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Example
ConstantinHalatsis 3
University of Athens Department of Informatics and Telecommunications
ConstantinHalatsis 4
2 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications Basic techniques
Based on the CDC 6000 Architecture Scoreboarding Important Feature: Scoreboard Issue: WAW, Decode: RAW, execute and write results: WAR
Reorder Buffer TomasuloAlgorithm Implemented in the IBM360/91’s floating point unit. Important Feature: Reservation Station and CDB (Common Data Bus) Issue: tag if not available, copy if they are; Execute: stall RAW monitoring the CDB Write results: Send results to the CDB and dump the store buffercontents; Exception Handling: No instscan be issued until a branch can be resolved Register Renaming
ConstantinHalatsis 5
University of Athens Department of Informatics and Telecommunications Single core processors Αρχιτεκτονικέςέναρξηςπολλώνεντολών: Αύξησητωνεντολώνανάκύκλο, IPC Εκμετάλλευσηπαραλληλίαςστοεπίπεδοτηςεντολής, ILP
Common Issue Hazard Scheduling Distinguishing Examples Name Structure Detection characteristics
Superscalar Dynamic Hardware Static In order execution Sun (static) UltraSPARCII and III
Superscalar Dynamic hardware Dynamic Some out of order IBM Power2 (dynamic) execution
Superscalar Dynamic Hardware Dynamic With Speculative out of Pentium 3 and 4 (speculative) speculation order execution εικοτολογίας VLIW / LIW Static Software Static No hazards between Trimedia, i860 issues packets
EPIC Mostly Static Mostly Mostly Static Explicit Dependences Itanium Software marked by compiler
Register Renaming TomasuloAlgorithm Reorder Buffer Scoreboarding
ConstantinHalatsis 6
3 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications Intel’s x86 line of processors SPECint92 Leveling off 10000 Prescott (2M) P4/3200 * ** * * Prescott (1M) 5000 P4/3060 * Northwood B P4/2400 * ** * P4/2800 P4/2000 * P4/2200 P4/1500* * 2000 P4/1700 PIII/600 PIII/1000 1000 * * PII/400 * PIII/500 PII/300 * *PII/450 500 *
~ 100*/10 years Pentium/200 * Pentium Pro/200 200 * Pentium/133 * * Pentium/166 * Pentium/120 100 Pentium/100 * Pentium/66 * 50 * 486-DX4/100 486/50 * * 486-DX2/66 486/33 *486-DX2/50 20 * 486/25 * 10 * 386/33 386/20 * 5 * 386/25 386/16 * 80286/12 2 * 80286/10 1 * 8088/8 0.5 *
0.2 * 8088/5
Year 79 1980 81 82 83 84 85 86 87 88 89 1990 91 92 93 94 95 96 97 98 99 2000 01 02 03 04 05
The integer performance of Intel’s x86 line of processors
ConstantinHalatsis 7
University of Athens Department of Informatics and Telecommunications Επίδοσηεπεξεργαστών (Spec Int2000)
3.
Source: F. Labonte, http://mos.stanford.edu/papers/flabonte_thesis.pdf ConstantinHalatsis 8
4 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications ProcessorClock rate
3.
Source: F. Labonte, http://mos.stanford.edu/papers/flabonte_thesis.pdf ConstantinHalatsis 9
University of Athens Department of Informatics and Telecommunications Power density evolution –Watt/mm2
3.
Source: F. Labonte, http://mos.stanford.edu/papers/flabonte_thesis.pdf ConstantinHalatsis 10
5 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications ILPprocessing: Pipelining
Paradigms of ILP-processing
Temporal parallelism
Pipeline processors ConstantinHalatsis 11
University of Athens Department of Informatics and Telecommunications Pipelining Schemes
TypesoftemporalparallelisminILP processors
Sequential Prefetching Pipelined Pipeline processing EUs processors Overlapping the fetch Overlapping the execute Overlapping and further phases phases through pipelining all phases
ii ii+1 F D EW F D E E i FDEW i 1 2 E3 D F EW i i ii F D E W i ii+1 i+1 ii+1 i i i+2 i+2 ii+2 i i+3 ii+3
Mainframes Early 34 35 mainframes Stretch (1961) IBM 360/91 (1967) Atlas 37 (1963) 36 CDC 7600 (1969) 38 Microprocessors IBM 360/91 (1967) i80286 39 (1982) 41 40 R2000 (1988) M68020 (1985) i8038642 (1985) M6803043 (1988)
(F: fetchcycle, D: decodecycle, E: executecycle, W: writecycle)
ConstantinHalatsis 12
6 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications Sourcesofraisingclockfrequencies(3)
No of pipeline stages 40 P4 Prescott (~30) 30 * Pentium 4 (~20) Core Duo 20 * Conroe Pentium Pro Athlon-64 (~12) (14) Athlon (12) * Pentium * K6 (6) * 10 (5) (6) * * * Year 1990 1995 2000 2005
Figure4.2:Number of pipelinestages in Intel’s and AMD’s processors
ConstantinHalatsis 13
University of Athens Department of Informatics and Telecommunications POWER5 vs. POWER6 Pipeline Comparison
Pre-Dec POWER5 Pipe Stages Pre-Dec POWER6 Pipe Stages 22 FO4 13 FO4 Pre-Dec
Pre-Dec Pre-Dec I Cache I Cache 5 cycle 14 cycle 3-4 cycle Xmit 12 cycle I Cache redirect Branch redirect Branch IBUF Xmit Resolution Resolution Rotate Group Form Decode IBUF0
Assembly IBUF1
Group Xfer Pre-dispatch
Group Disp Group Disp
Rename 4 cycle RF Load-load AG/RF Issue 1 cycle Fx-FX RF DCache / Ex 2 cycle AG/Ex 2 cycle Fx-Fx Load-Fx DCache 3 cycle Load-load/fx D Cache Fmt1 Fmt Fmt2 6 cycle FP-FP 6 cycle FP to local FP use Writeback 5 cycle LD-FP 8 cycle FP to remote FP use Finish Writeback 0 cycle LD to FP use Completion
Completion Chkpt
ConstantinHalatsis 14
7 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications ILPprocessing: VLIW
Paradigms of ILP-processing
Temporal parallelism Issue parallelism
Static dependency resolution
Pipeline VLIW processors processors ConstantinHalatsis 15
University of Athens Department of Informatics and Telecommunications
VLIW processing Instructions
Independent instructions (static dependency resolution)
F F F E E E
Processor
VLIW: Very Large Instruction Word
ConstantinHalatsis 16
8 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications ILPprocessing-Superscalars
Paradigms of ILP processing
Temporal parallelism Issue parallelism
Static Dynamicdependency dependency resolution resolution
Pipeline VLIW Superscalar processors processors processors ConstantinHalatsis 17
University of Athens Department of Informatics and Telecommunications
VLIW Superscalar processing Instructions processing
Independent Dependent instructions instructions (static dependency resolution)
Dynamic dependency resolution
F F F F F F E E E E E E
Processor Processor
VLIW: Very Large Instruction Word
ConstantinHalatsis 18
9 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications ILP processing
Paradigms of ILP processing
Temporal parallelism Issue parallelism Data parallelism
Static Dynamicdependency dependency resolution resolution
Pipeline VLIW Superscalar SIMD processors processors processors extension ConstantinHalatsis 19
University of Athens Department of Informatics and Telecommunications ILP-processing
Issueparallelism Data parallelism
Static dependency resolution
Sequential Temporal VLIW processors EPIC processors processing parallelism
Pipeline Dynamic processors. dependency resolution
Superscalarprocessors Superscalar proc.s with SIMDextension
~ ~ ~ ’95 -‘00 ‘85 ‘90 Figure 1.3:The emergence of ILP-paradigms andprocessor types ConstantinHalatsis 20
10 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications Performance of ILP-processors
Ideal case Real case Absolute performance
1 Sequential P = f ∗ ai C CPI
CPI i 1 Pipeline P = f ∗ ai C CPI
CPIi IP IP i i 1 VLIW/ Pai = fC ∗ ∗ IP superscalar CPI
OPIi SIMD 1 extension P = f ∗ ∗ IP ∗OPI ao C CPI ConstantinHalatsis 21
University of Athens Department of Informatics and Telecommunications Performance components of ILP-processors:
1 P = f ∗ ∗ IP ∗ OPI ∗ η ao C CPI Clock Temporal Issue Data Efficiency of frequency parall. parall. parall. spec. exec.
Pao = f C ∗ IPC eff 1 with: IPC = ∗ IP ∗ OPI ∗ η eff CPI
Clock frequency Per cycle efficiency Depends on technology/ Depends on ISA, μarchitecture, system μarchitecture architecture, OS,compiler, application ConstantinHalatsis 22
11 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications Options to implement issue parallelism
VLIW (EPIC) instruction issue
Static dependency resolution (3.2)
Pipeline processing Superscalar instruction issue
Dynamic dependency resolution (3.3)
ConstantinHalatsis 23
University of Athens Department of Informatics and Telecommunications
VLIWprocessing(1)
Memory/cache VLIW instructions
with independent sub-instructions (static dependency resolution)
VLIW processor
~ (10-30 EUs)
E E E E U U U U
Figure 3.1:Principle of VLIW processing
ConstantinHalatsis 24
12 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
VLIWprocessing(2)
VLIW: VeryLongInstructionWord Term:1983 (Fisher)
Length of sub-instructions~32 bit Instruction length: ~ n*32bit n: Number of execution units (EU)
Static dependency resulution with parallel optimization
Complex VLIWcompiler
ConstantinHalatsis 25
University of Athens Department of Informatics and Telecommunications VLIWprocessing(3)
The term ‘VLIW’
Figure 3.2:Experimental and commercially available VLIWprocessors
Source: Sima et al., ACA, Addison-Wesley, 1997
ConstantinHalatsis 26
13 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications VLIWprocessing(4)
Benefits of static dependecy resolution:
Less complex processors
Earlier appearance
Either higherfcor largerILP
ConstantinHalatsis 27
University of Athens Department of Informatics and Telecommunications VLIWprocessing(5)
Drawbacks of staticdependency resolution:
Completely new ISA
New compilers, OS
Rewriting of applications
Achieving the critical mass to convince the market
The compiler uses technology dependent parameters (e.g. latencies of EUs and caches, repetition rates of EUs) for dependency resolution and parallel optimization
New proc. models require new compiler versions
ConstantinHalatsis 28
14 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications VLIWprocessing(6)
Drawbacks of static dependency resolution (cont.):
VLIWinstructions are only partially filled
Purely utilized memory space and bandwidth
ConstantinHalatsis 29
University of Athens Department of Informatics and Telecommunications
VLIWprocessing(7)
CommercialVLIW processors:
Trace (1987) Multiflow Cydra-5 (1989) Cydrome
In a few years both firms became bankrupt
Developers: toHP, IBM
They became initiators/developers of EPIC processors
ConstantinHalatsis 30
15 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications VLIWprocessing(8)
VLIW EPIC
Integration of SIMDinstructions and advanced superscalar features
1994: Intel, HP announced the cooperation
1997: The EPICterm was born
2001: IA-64 ⇒ Itanium
ConstantinHalatsis 31
University of Athens Department of Informatics and Telecommunications
Superscalar processing
Pipeline processing Superscalar instruction issue
Main attributes of superscalar processing:
Dynamic dependency resolution
Compatible ISA
ConstantinHalatsis 32
16 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
History
Experimental superscalar processors Source: Sima et al., ACA, Addison-Wesley, 1997 ConstantinHalatsis 33
University of Athens Department of Informatics and Telecommunications
Emergence of superscalar processors
RISC processors
Intel 960 960KA/KB 960CA (3) M 88000 MC 88100 MC 88110 (2) HP PA PA 7000 PA7100 (2) SPARC MicroSparc SuperSparc (3)
Mips R R 4000 R 8000 (4) 29040 Am 29000 29000 sup (4) IBM Power Power1(4) RS/6000 DEC α α21064(2) PowerPC PPC 601 (3) PPC 603 (3)
87 88 89 90 91 92 93 94 95 96 CISC processors
Intel x86 i486 Pentium(2) M 68000 M 68040 M 68060 (2) Gmicro Gmicro/100p Gmicro500(2) AMD K5 K5 (4) CYRIX M1 M1 (2)
denotes superscalar processors.
Source: Sima et al., ACA, Addison-Wesley, 1997
ConstantinHalatsis 34
17 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Attributes of first generation superscalars (1)
•2-3 RISC instructions/cycle or Width: •2CISC instructions/cycle „wide” Core: •Staticbranch prediction •Single ported, blockingL1 data caches, Cache: Off-chip L2 caches attached via the processor bus
Examples: • Alpha21064 • PA 7100 • Pentium
ConstantinHalatsis 35
University of Athens Department of Informatics and Telecommunications
Attributes of first generation superscalars (2)
Consistency of processor features (1)
Dynamic instruction frequencies in gen. purpose applications:
FX instrtuctions ~ 40 % Load instructions ~ 30 % Store instructions ~ 10 % Branches ~ 20 % FP instrtuctions ~ 1-5 %
Available parallelism in gen. purpose applications assuming direct issue:
~ 2 instructions / cycle
(Wall 1989, Lam, Wilson 1992)
Source: Sima et al., ACA, Addison-Wesley, 1997 ConstantinHalatsis 36
18 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Attributes of first generation superscalars (3)
Consistency of processor features (2) Reasonable core width:
2 -3 instructions/cycle
Required number of data cache ports (np):
np ~ 0.4 * (2 -3)= 0.8 –1.2 instructions/cycle Single port data caches
Required EU-s (Each L/S instruction generates an address calculation as well): FX~ 0.8 * (2 –3) = 1.6 –2.4 2 –3 FX EUs L/S~ 0.4 * (2 –3) = 0.8 –1.2 1 L/S EU Branch~ 0.2 * (2 –3) = 0.4 –0.6 1 B EU FP~ (0.01 –0.05) * (2 –3) 1 FP EU ConstantinHalatsis 37
University of Athens Department of Informatics and Telecommunications The bottleneck evoked and its resolution(1)
The issue bottleneck
Icache Instr. window Cycles
i i i I-buffer 3 2 1 C Instr. window (3) i
Issue i3 i2 Decode, check, Dependent instructions C check, block instruction issue i+1 iissssue
i6 i5 i4 C i+2
EU EU EU Executable instructions Dependent instructions Issue
(a): Simplifiedstructureofthemikroarchitectureassuming (b): The issueprocess directissue
The principle of direct issue ConstantinHalatsis 38
19 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications The bottleneck evoked and its resolution(2)
I cache Eliminating the issue bottleneck
I-buffer Instruction window
Instructions are dispatched without Decode/Issue checking for dependences to the Dispatch shelving buffers (reservation stations)
Shelving Shelving Shelving buffer buffer buffer
Shelved not dependent Dep. checking/ Dep.. checkiing/ Dep. checking/ Dep. checking/ instructions are issued Issue issue issue iissssueue for execution to the EUs.
EU EU EU
Principle of the buffered (out oforder) issue ConstantinHalatsis 39
University of Athens Department of Informatics and Telecommunications
The bottleneck evoked and its resolution(3)
First generation Second generation (narrow) (wide) superscalars superscalars
Elimination of the issue bottleneck and in addition widening the processing width of all subsystems of the core
ConstantinHalatsis 40
20 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications Generations of superscalars First generation Second generation ”narrow”superscalars ”wide”superscalars
Width: •2-3 RISC instructions/cycle or •4 RISC instructions/cycles or 2 CISC instructions/cycle „wide” 3 CISC instruction/cycle „wide” • Buffered (ooo) issue Core: •Static branch prediction •Predecoding •Dynamic branch prediction •Register renaming •ROB Caches: •Single-ported, blocking •Dual-ported,non-blocking L1 data caches L1 data caches •Off-chip L2 caches •direct attached attached via the processor bus off-chipL2 caches
Examples: • Alpha21064 • Alpha21264 • PA 7100 • PA 8000 • Pentium • PentiumPro • K6 ConstantinHalatsis 41
University of Athens Department of Informatics and Telecommunications
Attributes of second generation superscalars (2)
Consistency of processor features (1) Dynamic instruction frequencies in gen. purpose applications: FX instrtuctions ~ 40 % Load instructions ~ 30 % Store instructions ~ 10 % Branches ~ 20 % FP instrtuctions ~ 1-5 %
Available parallelism in gen. purpose applications assuming buffered issue:
~ 4 –6 instructions / cycle
(Wall 1990)
Source: Sima et al., ACA, Addison-Wesley, 1997 ConstantinHalatsis 42
21 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Parallelism available in general purpose applications
Extentofparallelismavailableingeneralpurposeapplications assuming buffered issue Source: Wall: LimitsoCfoILnPst,a WntRinLH TaNla-t15sis, Dec. 1990 43
University of Athens Department of Informatics and Telecommunications
Attributes of second generation superscalars (3)
Consistency of processor features (2) Reasonable core width:
4 -5 instructions/cycle
Required number of data cache ports (np):
np ~ 0.4 * (4 -5)= 1.6 –2 instructions/cycle Dual port data caches
Required EU-s (Each L/S instruction generates an address calculation as well): FX~ 0.8 * (4 –5) = 3.2 –4 3 –4 FX EUs L/S~ 0.4 * (4 –5) = 1.6 –2 2 L/S EU Branch~ 0.2 * (4 –5) = 0.8 –1 1 B EU FP~ (0.01 –0.05) * (4 –5) 1 FP EU
ConstantinHalatsis 44
22 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Exhausting the issue parallelism
In general purpose applications 2. generation(„wide”) superscalars already exhaust the parallelism available at the instruction level
ConstantinHalatsis 45
University of Athens Department of Informatics and Telecommunications Introducing data parallelism (1)
Possible approaches to introduce data parallelism
Dual-operation SIMD instructions instructions
FX-SIMD FP-SIMD
(i=a*b+c) (MM-support) (3D-support)
O O1 O O O O i: 2 i: O4 O3 2 1 i: 2 1
n OPI 2 2/4/8/16/32 2/4 Dedicated use Dedicated use
OPI 1+ε >1 (for gen.use) ISA-extension
n OPI : Number of operations per instruction OPI : Average number of operations per instruction
Implementation alternatives of data parallelism ConstantinHalatsis 46
23 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Introducing data parallelism (2)
Superscalar extension SIMD instructions (FX/FP)
Multiple operations Superscalar issue within a single instruction
EPIC extension
Principle of intruducing SIMDinstructions in superscalar and VLIW (EPIC) processors ConstantinHalatsis 47
University of Athens Department of Informatics and Telecommunications The appeareance of SIMDinstructions in superscalars (1) Intel’s and AMD’s ISA extensions (MMX, SSE, SSE2, SSE3, 3DNow!, 3DNowProfessional)
RISC processors
Compaq/DEC Alpha Alpha 21064 Alpha 21164 21164PC 21264 Alpha 21364
Motorola MC 88000 MC88110
HP PA PA7100 PA-7100LC PA-7200 PA8000 PA-8200 PA 8500 PA 8600 PA 8700
IBM Power Power1(4) Power2(6/4) P2SC(6/4) PPC 604 (4) Power PC PPC 601 (3) PowerPC PPC 620 (4) G3 (3) Power3 (4) G4 (3) Power 4 Power 4+ Alliance PPC 603 (3) PPC 602 (2) MIPS R R 80000 R 10000 R 12000 R14000 R16000
Sun/Hal SPARC SuperSparc UltraSparc UltraSparc-2 UltraSparc-3 UltraSparc-3 UltraSparc-3-Cu
CISC processors Sparc64 Pentium/MMX Intel 80x86 Pentium PentiumPro Pentium III Pentium 4 P4 with HT Pentium II CYRIX /VIA M M1 MII K6-3 AMD/NexGen Nx/K Nx586 K5 K6-2 Opteron K6 K7
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003
Multimedia support (FX-SIMD) Support of 3D (FP-SIMD)
The emergence of FX-SIMDandFP-SIMD instructions in superscalars
ConstantinHalatsis 48
24 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
21/2 and3. generation superscalars(1)
• 2.5. generation superscalars Second generation FX SIMD (MM) superscalars • 3. generation superscalars
FX SIMD + FP SIMD (MM+3D)
ConstantinHalatsis 49
University of Athens Department of Informatics and Telecommunications
21/2 and3. generation superscalars(2)
RISC processors
Compaq/DEC Alpha Alpha 21064 Alpha 21164 21164PC 21264 Alpha 21364
Motorola MC 88000 MC88110
HP PA PA7100 PA-7100LC PA-7200 PA8000 PA-8200 PA 8500 PA 8600 PA 8700
IBM Power Power1(4) Power2(6/4) P2SC(6/4) PPC 604 (4) Power PC PPC 601 (3) PowerPC PPC 620 (4) G3 (3) Power3 (4) G4 (3) Power 4 Power 4+ Alliance PPC 603 (3) PPC 602 (2) MIPS R R 80000 R 10000 R 12000 R14000 R16000
Sun/Hal SPARC SuperSparc UltraSparc UltraSparc-2 UltraSparc-3 UltraSparc-3 UltraSparc-3-Cu
CISC processors Sparc64 Pentium/MMX Intel 80x86 Pentium PentiumPro Pentium III Pentium 4 P4 with HT Pentium II CYRIX /VIA M M1 MII K6-3 AMD/NexGen Nx/K Nx586 K5 K6-2 Opteron K6 K7
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003
Multimedia support (FX-SIMD) Support of 3D (FP-SIMD)
ConstantinHalatsis 50
25 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications Overview of superscalar processor generations
Superscalars
First Generation Second Generation 2.5 Generation Third Generation ("Thin superscalars") ("Wide superscalars") ("Wide superscalars ("Wide superscalars with MM/3D support") with MM/3D support")
Features: Width: 2-3 RISC instructions/cycle or 4 RISC instructions/cycle or 2 CISC instructions/cycle "wide" 3 CISC instructions/cycle "wide" Core: No predecoding Predecoding Static branch prediction Dynamic branch prediction Unbufferedissue Buffered issue (shelving) No renaming Renaming No ROB ROB Caches: Single ported data caches Dual ported data caches Blocking L1 data caches or NonblockingL1 data caches nonblockingcaches with up to with multiple cache misses allowed a single pending cache miss allowed Off-chip L2 caches Off-chip direct coupled L2 caches On-chip L2 caches attached via the processor bus
ISA: No MM/3D support FX-SIMD FX-and FP-SIMD instructions instructions Examples: Alpha 21064 1 Alpha 21264 PA 7100 PA 8000 Power2 3 Power 4 PowerPC 601 PowerPC 604 PowerPC 620 1 1,4 1,4 No renaming. SuperSparc UltraSparcI, II 2 Dual ported data cache, optional Pentium III (0.18 µ ) dynamic branch prediction. Pentium2 Pentium Pro Pentium II 3 Pentium 4 No off-chip direct coupled L2. 4 Athlon(model 4) Only single ported data cache. K6 AthlonMP (model 6)
Performance
Complexity
Memory Bandwidth
Branch prediction accuracy ConstantinHalatsis 51
University of Athens Department of Informatics and Telecommunications Exhausting the performance potential of data parallelism
In general purpose applications second generation superscalars already exhaust the parallelism available at the instruction level, whereas third generation superscalars exhaust already the parallelism available in dedicated applications (such as MM or 3D applications) at the instruction level as well.
Thus the era of ILP-processors came to an end.
ConstantinHalatsis 52
26 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications Summary: Main evolution scenarios
a.Evolutionary scenario (Superscalar approach) (The mainroad)
Introductionand Introductionand increaseof issue increaseof data parallelism parallelism Introductionand increaseof temporal parallelism b.Radicalscenario(VLIW/EPICapproach)
Introduction Introduction of of VLIW data parallelism processing (EPIC)
ConstantinHalatsis 53
University of Athens Department of Informatics and Telecommunications Summary: Main road of processor evolution(1)
Extent of Sequential opererationlevel ILP processing parallelism
Temporal + Issue + Data parallelism parallelism parallelism
Traditional Pipeline Superscalar Superscalar processors von N. procs. processors processors with SIMD extension
~ 1985/88 ~ 1990/93 ~ 1994/00 t
Level of hardware redundancy
The three cycles of the main road of processor evolution ConstantinHalatsis 54
27 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications Summary: The main road of evolution(2)
i: Introduction of temporal à issue and à data parallelism
i=1:3 introduction of a particular dimension of parallelism
processing bottleneck(s) arise
elimination of the bottleneck(s) evoked by introducing appropriate techniques
as a consequence, parallelism available at the given dimension becomes exhausted, further performance increase is achievable only by introducing a new dimension of parallelism
The Three main cycles of the main road ConstantinHalatsis 55
University of Athens Department of Informatics and Telecommunications Summary: Main road of the evolution(3)
Traditionalsequential InIntrtoroduducctitoionnof of processing temporalparallelism
Introduction of issue parallelism
Introduction of dataparallelism
Traditionalsequential Superscalar Superscalars with procesors Pipeline processors processors SIMDextension
Advanced memory subsystem ISA extension 1. generation Advanced branch processing à 1.5. generation 2.5. generation 1. generation •Caches • FX SIMD extension à 2. generation • Extensionofsystemarchitecture •Dynamicinst. scheduling à 2. generation AGP •Renaming •Branchprediction On-chipL2 •Predecoding ... •Dynamicbranchprediction •ROB à 3. generation •Dualporteddatacaches •FP SIMD extension •NonblockingL1 datacaches ... withmultiplecachemisses allowed •Off-chipdirectcoupledL2 caches
~ 1985/88 ~ 1990/93 ~ 1994/97
New techniques introduced in the three main cycles of processor evolution ConstantinHalatsis 56
28 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Main road of evolution(4)
Memory Bandwidth
Hardware Complexity
Performance
~ 1985 ~ 2000 t
Memory bandwidth and hardware complexity vs raising processor performance
ConstantinHalatsis 57
University of Athens Department of Informatics and Telecommunications 5.2. Main road of evolution(5)
Accuracy of branch prediction
Number of pipeline stages
fc
t ~ 1985 ~ 2000
Figure 5.5:Branch prediction accuracy vs raising clock rates
ConstantinHalatsis 58
29 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
TάσητεχνολογίαςΜικροεπεξεργαστών
Moore’s Law
2X transistors/Chip κάθε 1.5 χρόνια Αποκαλείται “Moore’s Ο Gordon Moore (συν-ιδρυτής Law” της Intel) πρόβλεψετο 1965 ότιηπυκνότητατων transistor Οιεπεξεργαστές στα microchips θα γίνονταιμικρότεροι, διπλασιάζεταιπερίπουκάθε 18 πυκνότεροικαι μήνες. ισχυρότεροι Slide source: Jack Dongarra ConstantinHalatsis 59
University of Athens Department of Informatics and Telecommunications
Moore’s Law
Gordon Moore, co-founder of Intel
n Number of transistors per square inch on integrated circuits had doubled every year since the IC was invented.
n Predicted that this trend would continue for the foreseeable future.
n In subsequent years, the pace slowed down a bit, but data density has doubled approximately every 18 months, and this is the current definition of Moore's Law, which Moore himself has blessed. Most experts, including Moore himself, expect Moore's Law to hold for at least another two decades.
n HOWEVER Webopedia.com(1998)
ConstantinHalatsis 60
30 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Οικλασσικέςπηγέςβελτίωσηςτηςεπίδοσης (ILP, clock speed, power) παρουσιάζουνκάμψη (FromK. Olukotun, L. Hammond, H.Sutter, andB. Smith)
Στοπαρελθόνκάθεχρόνοοι επεξεργαστέςγίνοντανταχύτεροι
Σήμεραησυχνότηταρολογιού είναισταθερήήμικραίνει
Το ‘τσάμπαγεύμα’ τελείωσε! The “Free Lunch” is over!
Παρόλααυτάέχουμεδιπλασιασμό τηςεπίδοσηςκάθε 18 -24 μήνες
Τοχάσμα CPU-Memory χειροτερεύει
Ονόμοςτου Moore χρειάζεται αναδιατύπωση
ConstantinHalatsis 61
University of Athens Department of Informatics and Telecommunications
ΤοτείχοςτηςΜνήμης (memory wall) Processor-DRAM Gap (latency)
µProc 1000 CPU 60%/yr. “Moore’s Law” 100 Processor-Memory Performance Gap: (grows 50% / year) 10 DRAM
P e r f o rm a n ce DRAM 7%/yr. 1 1987 1988 1989 1981 1982 1983 1984 1985 1986 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 1980 Time
ConstantinHalatsis 62
31 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
O νόμος Moore στηνπράξη )
CPU Network Bandwidth l o g ( Sp e ed RAM
1/Network Latency Software
Year
ConstantinHalatsis 63
University of Athens Department of Informatics and Telecommunications
ΠροβλήματαΥλικού (Hardware)
Ø ΣημερινοίπεριορισμοίΥλικούκαιΛιθογραφίας ü Ηκατανάλωσηισχύοςπεριορίζειτηνσχεδίαση chipςυψηλώνεπιδόσεων § Intel TejasPentium 4 cancelled due to power issues ü Ηαπόδοσηπαραγωγής (Yield) για chipςυψηλώνεπιδόσεωνπέφτει δραματικά § IBM quotes yields of 10 –20% on 8-processor Cell ü Ησχεδίασηκαιηεπαλήθευσηαυτούγια chipςυψηλώνεπιδόσεων γίνεταιδυσμεταχείριστη § Verification teams > design teams on leading edge processors
Ø Λύση: ‘τομικρόείναιωραίο’–Multicoreand Manycorechips ü Μικρούβάθους pipelines (5-to 9-stage) CPUs, FPUs, vector, SIMD PEs § Small cores not much slower than large cores ü H παραλληλίαείναιέναςτρόποςγιακαλόλόγοεπίδοση/ισχύς: § Lower threshold and supply voltages lowers energy per op ü Πλεονάζοντεςεπεξεργαστέςβελτιώνουντηνπαραγωγή (chip yield) § Cisco Metro 188 CPUs + 4 spares; Sun Niagara sells 6 or 8 CPUs ü Πιοεύκοληηεπαλήθευσητηςσχεδίασηςγιαμικρούςόμοιουςπυρήνες
ConstantinHalatsis 64
32 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Multicore: Redefining“ConventionalWisdom” inComputerArchitecture
Ø OldConventionalWisdom(CW): Powerisfree, buttransistors expensive – NewCW isthe“Powerwall”: Powerisexpensive, but transistorsare“free”. –Canputmoretransistorsona chipthanhavethepowertoturnon Ø OldCW: Multipliesslow, butloadsandstoresfast – NewCW isthe“Memorywall”: Loadsandstoresareslow, butmultipliesfast –200 clockstoDRAM, butevenFP multipliesonly4 clocks Ø OldCW: WecanrevealmoreILP viacompilersandarchitecture innovation –Branchprediction, OOO execution, speculation, VLIW, … – NewCW isthe“ILP wall”: Diminishingreturnsonfinding moreILP Ø OldCW: 2X CPU Performanceevery18 months – NewCW is PowerWall+ MemoryWall+ ILP Wall= Brick Wall
ConstantinHalatsis 65
University of Athens Department of Informatics and Telecommunications
Αναδιατύπωσητουνόμου Moore
Ονόμος Moore συνεχίζει! Πωςνααξιοποιήσουμεόλααυτάτα transistors γιανασυνεχίσουμεναέχουμε επιδόσειςσταπλαίσιατουνόμου Moore; Ø Περισσότερη out-of-order εκτέλεση (απαγορευτικόλόγωπολυπλοκότητας, μικρήβελτίωσηεπίδοσης, κατανάλωσηισχύος ) Ø Περισσότερηνημάτωση -threading (ασυμπτωτικήεπίδοση) Ø Περισσότερη DLP/SIMD (περιορισμένεςεφαρμογές, compilers?) Ø Μεγαλύτερες caches Ø Τοποθέτησηενός SMP σεένα chip = ‘multicore’ (and manycore)
ΗαπάντησητηςΒιομηχανίας: Ø Οαριθμόςτων πυρήνων ανά chip διπλασιάζεταικάθε 18 μήνες, ενώη συχνότητατουρολογιούμειώνεται (δεναυξάνει) ü Έχουμεόμωςνακάνουμεμεεκατομμύριαταυτόχρονων νημάτων (threads) § Οιμελλοντικέςγενεέςθαέχουνδισεκ. νήματα. ü Αναγκαίαγίνεταιηεύκοληαντικατάσταση τηςπαραλληλίας τηςμεταξύτων chipςμετηνπαραλληλίατηςεντόςτου chip
ConstantinHalatsis 66
33 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
MULTI-CORE Αρχιτεκτονική
Ø Ηκαλλίτερηδιεργασίαμικροηλεκτρονικής επιτρέπειεξοικονόμησηεπιφάνειαςτου die Ø Αυτόσημαίνειπερισσότερατρανζίστορανά chip ü 90nm, 65nm and 45nm σήμερα ü 32nm, 22nm and 10nm αύριο Ø Οικατασκευαστές Chip αποφάσισανότιη καλλίτερη (φθηνότερη) χρήσητωνεπιπλέον τρανζίστορείναιηπροσθήκηπρόσθετων πυρήνες ü 4 ή 8 πυρήνεςμεέωςκαι 64 νήματασήμερα ü Εκατοντάδεςκαιπυρήνεςαύριο – manycore
Ø Άλληδιέξοδος: περισσότερεςλειτουργίεςστο έναπυρήνα (π.χ.,C oπρnstanοtiσnHθalήatsκisημιας GPU, ενός 67 crypto, etc)
University of Athens Department of Informatics and Telecommunications
MULTI-COREARCHITECTURES
Ø Multi-corelongusedinDSP, embedded andnetworkprocessors Ø Manyhouseholditemscontainmulti-core processors: ü iPOD–2 coreARM CPU ü ThePS3 –IBM CellCPU ü TheXbox360 –3-core XenonCPU (PowerPCbased, SMTcapable) ü GPUs, liketheNVIDIA 8800 series–up to128 minicores
ConstantinHalatsis 68
34 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications Powerful processors
Ø The Hertz race is over; however … üSome processors are still at it … üPower 6 and 7 running at 4 and 5 GHz üIntel Polaris: 3.6 to 6 GHz Ø Many hardware re-designs are in order üMake pipelines shorter, simpler üGet rid of “extra”hardware features
ConstantinHalatsis 69
University of Athens Department of Informatics and Telecommunications UltraSparcT1 Codename: Niagara Multi-core Trends 8 Core Processor, 32 Logical Threads Codename: Rock in this Decade AMD Turion64 X2 16 Core Processor, 32 Logical Threads IA32 x86 Dual Core Chip
Intel Core Duo Intel Core 2 Pentium D IA32 x86 Dual Core Codename: Penryn, Wolfdale IA32 x86 2 Core Chip Chip IA32 x86 Dual & Quad Core Chip
Power5 CBE 64 bit PowerPC 2 Core PowerPC Power7 with SMT 9 Core chip 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
Power 4 Codename: Xenon Power 6 Nehalem 64 bit PowerPC 2 64 bit PowerPC 3 64 bit PowerPC 1 to 8 Core Chip Core Core chip 2 Core with SMT
Xeon Dual Core Intel Core 2 Duo IA32 x86 2 Core Chip IA32 x86 2 Core Chip Codename: Sandy Bridge
AMD Opteron AMD IBM Code Name: Denmark Code Name: Barcelona IA32 x86 Native 4 Core Chip Intel IA32 x86 2 Core Chip UltraSparcT2 AMD Codename: Niagara 8 Core Processor, 64 Logical Threads SUN ConstantinHalatsis 70
35 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Intel Processors
ConstantinHalatsis 71
University of Athens Department of Informatics and Telecommunications
AMD processors
ConstantinHalatsis 72
36 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
IBM processors
ConstantinHalatsis 73
University of Athens Department of Informatics and Telecommunications
Example: the Power6 microarchitecture
Pictures Courtesy of Intel from “IBM Power6 Microarchitecture”
Power6 Running at frequencies from 4 to 5 GHz 13 FO4 versus 23 FO4 pipeline
ConstantinHalatsis 74
37 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Recall Amdahl’s Law
Ø Begins with Simple Software Assumption (Limit Arg.) ü Fraction F of execution time perfectly parallelizable ü No Overhead for n Scheduling n Communication n Synchronization, etc.
ü Fraction (1 –F) is Completely Serial
Ø Time on 1 core = (1 –F) / 1 + F / 1 = 1
Ø Time on N cores = (1 –F) / 1 + F / N
ConstantinHalatsis 75
University of Athens Department of Informatics and Telecommunications
Recall Amdahl’s Law [1967]
1 Amdahl’s Speedup = F 1 -F + 1 N
Ø For mainframes, Amdahl expected 1 -F = 35% ü For a 4-processor speedup = 2 ü For infinite-processor speedup < 3 ü Therefore, stay with mainframes with one/few processors
Ø Amdahl’s Law applied to Minicomputer to PC Eras Ø What about the Multicore Era?
ConstantinHalatsis 76
38 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Designing Multicore Chips Hard
Ø Designers must confront single-core design options ü Instruction fetch, wakeup, select ü Execution unit configuation& operand bypass ü Load/queue(s) & data cache ü Checkpoint, log, runahead, commit.
Ø As well as additional design degrees of freedom ü How many cores? How big each? ü Shared caches: levels? How many banks? ü Memory interface: How many banks? ü On-chip interconnect: bus, switched, ordered?
ConstantinHalatsis 77
University of Athens Department of Informatics and Telecommunications
A Simple Multicore Hardware Model (Wisconsin Multifacetproject www.cs.wisc.edu/multifacet/amdahl
To Complement Amdahl’s Simple Software Model
(1) Chip Hardware Roughly Partitioned into ü Multiple Cores (with L1 caches) ü The Rest (L2/L3 cache banks, interconnect, pads, etc.) ü Changing Core Size/Number does NOT change The Rest
(2) Resources for Multiple Cores Bounded ü Bound of N resources per chip for cores ü Due to area, power, cost (€€€), or multiple factors ü Bound = Power? (but our pictures use Area)
ConstantinHalatsis 78
39 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
A simple Multicore Hardware Model, cont.
(3) Micro-architects can improve single-core performance using more of the bounded resource
Ø A Simple Base Core ü Consumes 1 Base Core Equivalent (BCE) resources ü Provides performance normalized to 1
Ø An Enhanced Core (in same process generation) ü Consumes R BCEs ü Performance as a function Perf(R)
Ø What does function Perf(R) look like?
ConstantinHalatsis 79
University of Athens Department of Informatics and Telecommunications
More on Enhanced Cores
Ø (Performance Perf(R) consuming R BCEsresources)
Ø If Perf(R) > R è Always enhance core Ø Cost-effectively speedups both sequential & parallel
Ø Therefore, Equations Assume Perf(R) < R
Ø Graphs Assume Perf(R) = Square Root of R ü 2x performance for 4 BCEs, 3x for 9 BCEs, etc. ü Why? Models diminishing returns with “no coefficients” ü Alpha EV4/5/6 [Kumar 11/2005] & Intel’s Pollack’s Law
ConstantinHalatsis 80
40 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Symmetric Multicorechips
Ø How many cores per chip? Ø Each Chip Bounded to N BCEs(for all cores) Ø Each Core consumes R BCEs Ø Assume Symmetric Multicore= All Cores Identical Ø Therefore, N/R Cores per Chip —(N/R)*R = N Ø For an N = 16 BCE Chip:
Sixteen 1-BCE cores Four 4-BCE cores One 16-BCE core
ConstantinHalatsis 81
University of Athens Department of Informatics and Telecommunications
Performance of Symmetric Multicore Chips
Ø Serial Fraction 1-F uses 1 core at rate Perf(R) Ø Serial time = (1 –F) / Perf(R)
Ø Parallel Fraction uses N/R cores at rate Perf(R) each Ø Parallel time = F / (Perf(R) * (N/R)) = F*R / Perf(R)*N
Ø Therefore, w.r.t. one base core: 1 Symmetric Speedup = F * R 1 -F + Ø Implications? Perf(R) Perf(R)*N
Enhanced Cores speed Serial & Parallel
ConstantinHalatsis 82
41 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Symmetric Multicore Chip, N = 16 BCEs
F=0.5 R=16, Cores=1, Speedup=4 (16 cores) (8 cores) (2 cores) (1 core) (4 cores) F=0.5, Opt. Speedup S = 4 = 1/(0.5/4 + 0.5*16/(4*16)) Need to increase parallelism to make multicore optimal!
ConstantinHalatsis 83
University of Athens Department of Informatics and Telecommunications
Symmetric Multicore Chip, N = 16 BCEs
F=0.9, R=2, Cores=8, Speedup=6.7
F=0.5 R=16, Cores=1, Speedup=4
At F=0.9, Multicore optimal, but speedup limited Need to obtain even more parallelism!
ConstantinHalatsis 84
42 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Symmetric Multicore Chip, N = 16 BCEs
Fà1, R=1, Cores=16, Speedupà16
F matters: Amdahl’s Law applies to multicore chips MANY Researchers should target parallelism F first
ConstantinHalatsis 85
University of Athens Department of Informatics and Telecommunications
Need a Third “Moore’s Law?”
Ø Technologist’s Moore’s Law ü Double Transistors per Chip every 2 years ü Slows or stops: To Be Discussed Ø Microarchitect’sMoore’s Law ü Double Performance per Core every 2 years ü Slowed or stopped: Early 2000s Ø Multicore’sMoore’s Law ü Double Cores per Chip every 2 years ü & Double Parallelism per Workload every 2 years ü & Aided by Architectural Support for Parallelism ü = Double Performance per Chip every 2 years ü Starting now
Ø Software as Producer, not Consumer, of Performance Gains!
ConstantinHalatsis 86
43 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Symmetric Multicore Chip, N = 16 BCEs
Recall F=0.9, R=2, Cores=8, Speedup=6.7
As Moore’s Law enables N to go from 16 to 256 BCEs, More cores? Enhance cores? Or both?
ConstantinHalatsis 87
University of Athens Department of Informatics and Telecommunications
Symmetric Multicore Chip, N = 256 BCEs
Fà1 R=1 (vs. 1) Cores=256 (vs. 16) Speedup=204 (vs. 16) MORE CORES!
F=0.99 R=3 (vs. 1) Cores=85 (vs. 16) F=0.9 Speedup=80 (vs. 13.9) R=28 (vs. 2) Cores=9 (vs. 8) MORE CORES Speedup=26.7 (vs. 6.7) & ENHANCE CORES! ENHANCE CORES! As Moore’s Law increases N, often need enhanced core designs Some arch. researchers should target single-core performance
ConstantinHalatsis 88
44 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Software for Large Symmetric MulticoreChips
Ø F matters: Amdahl’s Law applies to multicorechips
Ø N = 256 ü F=0.9 è Speedup = 27 @ R = 28 ü F=0.99 è Speedup = 80 @ R = 3 ü F=0.999 è Speedup = 204 @ R = 1
Ø N = 1024 ü F=0.9 è Speedup = 53 @ R = 114 ü F=0.99 è Speedup = 161 @ R = 10 ü F=0.999 è Speedup = 506 @ R = 1
Ø Researchers must target parallelism F first
ConstantinHalatsis 89
University of Athens Department of Informatics and Telecommunications
Asymmetric (Heterogeneous) Multicore Chips
Ø Symmetric Multicore Required All Cores Equal Ø Why Not Enhance Some (But Not All) Cores?
Ø For Amdahl’s Simple Software Assumptions ü One Enhanced Core ü Others are Base Cores
Ø How? ü
Ø How does this effect our hardware model?
ConstantinHalatsis 90
45 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
How Many Cores per Asymmetric Chip?
Ø Each Chip Bounded to N BCEs(for all cores) Ø One R-BCE Core leaves N-R BCEs Ø Use N-R BCEsfor N-R Base Cores Ø Therefore, 1 + N -R Cores per Chip Ø For an N = 16 BCE Chip:
Symmetric: Four 4-BCE Asymmetric:One 4-BCE core cores & Twelve 1-BCE base cores
ConstantinHalatsis 91
University of Athens Department of Informatics and Telecommunications
Performance of Asymmetric Multicore Chips
Ø Serial Fraction 1-F same, so time = (1 –F) / Perf(R)
Ø Parallel Fraction F ü One core at rate Perf(R) ü N-R cores at rate 1 ü Parallel time = F / (Perf(R) + N -R)
Ø Therefore, w.r.t. one base core: 1 Asymmetric Speedup = F 1 -F + Perf(R) Perf(R) + N -R
ConstantinHalatsis 92
46 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Asymmetric MulticoreChip, N = 256 BCEs
(256 cores)(1+252 cores) (1+192 cores)(1 core) (1+240 cores) Number of Cores = 1 (Enhanced) + 256 –R (Base) How do Asymmetric & Symmetric speedups compare?
ConstantinHalatsis 93
University of Athens Department of Informatics and Telecommunications
Recall Symmetric Multicore Chip, N = 256 BCEs
Recall F=0.9, R=28, Cores=9, Speedup=26.7
ConstantinHalatsis 94
47 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Asymmetric Multicore Chip, N = 256 BCEs
F=0.99 R=41 (vs. 3) Cores=216 (vs. 85) Speedup=166 (vs. 80)
F=0.9 R=118 (vs. 28) Cores= 139 (vs. 9) Speedup=65.6 (vs. 26.7) Asymmetric offers greater speedups potential than Symmetric In Paper: As Moore’s Law increases N, Asymmetric gets better Some arch. researchers should target asymmetric multicores
ConstantinHalatsis 95
University of Athens Department of Informatics and Telecommunications
Asymmetric Multicore: 3 Software Issues 1. Schedule computation (e.g., when to use bigger core) 2. Manage locality (e.g., sending code or data can sap gains) 3. Synchronize (e.g., asymmetric cores reaching a barrier)
At What Level? ü Application Programmer ü Library Author ü Compiler ü Runtime System ü Operating System ü Hypervisor (Virtual Machine Monitor) ü Hardware
ConstantinHalatsis 96
48 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Dynamic MulticoreChips, Take 1
Ø Why NOT Have Your Cake and Eat It Too?
Ø N Base Cores for Best Parallel Performance Ø Harness R Cores Together for Serial Performance
Ø How? DYNAMICALLY Harness Cores Together ü
parallel mode
sequential mode
ConstantinHalatsis 97
University of Athens Department of Informatics and Telecommunications
Dynamic MulticoreChips, Take 2
Ø Let POWER provide the limit of N BCEs Ø While Area is Unconstrained (to first order)
parallel mode How to model these two chips? sequential mode
Ø Result: N base cores for parallel; large core for serial ü [Chakraborty, Wells, & Sohi, Wisconsin CS-TR-2007- 1607] ü When Simultaneous Active Fraction (SAF) < ½
ConstantinHalatsis 98
49 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Performance of Dynamic Multicore Chips
Ø N Base Cores with R BCEsused Serially
Ø Serial Fraction 1-F uses R BCEsat rate Perf(R) Ø Serial time = (1 –F) / Perf(R)
Ø Parallel Fraction F uses N base cores at rate 1 each Ø Parallel time = F / N
Ø Therefore, w.r.t. one base core:
1 Dynamic Speedup = F 1 -F + Perf(R) N ConstantinHalatsis 99
University of Athens Department of Informatics and Telecommunications
Recall Asymmetric Multicore Chip, N = 256 BCEs
Recall F=0.99 R=41 Cores=216 Speedup=166
What happens with a dynamic chip?
ConstantinHalatsis 100
50 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Dynamic Multicore Chip, N = 256 BCEs
F=0.99 R=256 (vs. 41) Cores=256 (vs. 216) Speedup=223 (vs. 166)
Dynamic offers greater speedup potential than Asymmetric Arch. researchers should target dynamically harnessing cores
ConstantinHalatsis 101
University of Athens Department of Informatics and Telecommunications Dynamic Asymmetric Multicore: 3 Software Issues
1. Schedule computation (e.g., when to use bigger core) 2. Manage locality (e.g., sending code or data can sap gains) 3. Synchronize (e.g., asymmetric cores reaching a barrier)
At What Level? ü Application Programmer ü Library Author ü Compiler ü Runtime System ü Operating System ü Hypervisor (Virtual Machine Monitor) ü Hardware Dynamic Challenges > Asymmetric Ones Dynamic chips due to power likely ConstantinHalatsis 102
51 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Three Multicore Amdahl’s Law
1 Parallel Section Symmetric Speedup = F * R 1 -F + N/R Sequential Section Perf(R) Perf(R)*N Enhanced Cores 1 Enhanced Core 1 Asymmetric Speedup = F 1 -F + 1 Enhanced Perf(R) Perf(R) + N -R & N-R Base Cores 1 Dynamic Speedup = F 1 -F + N Base Perf(R) N Cores
ConstantinHalatsis 103
University of Athens Department of Informatics and Telecommunications
Multicorearchitectures
Building blocks of multicore processors (MC)
• Cores C C • L2 cache(s) (L2)
• L3 cache(s), if available (L3) L2 I L2 D L2 I L2 D
• Interconnection network (IN) L3 L3 • FSB controller (FSB c.) IN or • Bus controller (B. c.) FSB c. Memory controller (M. • c.) FSB
ConstantinHalatsis 104
52 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Design space of Multicores
Macro architecture of multi-core (MC) processors
Layoutofthe Layout ofL3 caches Layout of the on-chip cores (ifavailable) interconnections
Layout ofL2 caches LayoutoftheI/O and memoryarchitecture
ConstantinHalatsis 105
University of Athens Department of Informatics and Telecommunications
Layoutof the cores-1
Basic layout
SMC HMC (Symmetrical MC) (Heterogeneous MC)
All other MCs Cell BE(1 PPE+8 SPE)
ConstantinHalatsis 106
53 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications Layoutof the cores-2
Physical implementation
Monolithic Multi-chip implementation implementation
YonahDuo
Pentium D 8xx PaxvilleDP Pentium EE 840 PaxvilleMP
Tulsa Pentium D 9xx Pentium EE 955/965 Dempsey
Merom Kentsfield Conroe Clovertown Woodcrest
Montecito All DCsof AMD, IBM, Sun Fujitsu, HP and RMI
ConstantinHalatsis 107
University of Athens Department of Informatics and Telecommunications L2caches-1 AllocationofL2 cachestothecores
PrivateL2 cache SharedL2 cache foreachcore forallcores
Examples:
Core Core Core Core
L2 L2 L2 contr.
System Request Queue
L2 IN (Xbar)
B. c. M. c. FSB c.
FSB Memory HT-bus CoreDuo (2006) Athlon64 X2 (2005) Core2 Duo(2006)
ConstantinHalatsis 108
54 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications L2caches-2
AllocationofL2 cachestothecores
PrivateL2 cache SharedL2 cache foreachcore forallcores
Smithfield, Presler, Core Duo (Yonah), Irwindale and Cedar Mill Core 2 Duo (Core) based lines(2005/2006) based lines (2006) Montecito (2006) Athlon 64 X2 and AMD's Opteron lines (2005/2006) POWER4 (2001) POWER6(2007) POWER5 (2005)
Cell BE(2006) UltraSPARCIV (2004) UltraSPARCIV+ (2005) PA 8800 (2004) UltraSPARCT1 (2005) PA 8900 (2005) UltraSPARCT2(2005) SPARC64 VI (2007) SPARC64 VII (2008)
ConstantinHalatsisXLR (2005) 109
University of Athens Department of Informatics and Telecommunications L2caches-3 InclusionpolicyofL2 caches
InclusiveL2 ExclusiveL2
Replaced (victimised) lines (Victim L1 L1 L2 Lines missing in L1 cache) are reloaded and Data missing deleted from L2 Modified data in L1/L2 (low traffic) L2 (high traffic) M.c.
Memory Memory Mostimplementations Lines replaced (victimized) in the L1arewritten to L2. References to data missing in L1 but available inL2 initiate reloadingthe pertaining cache line to L1. Reloaded data is deleted from L2 L2operates usually as a write back cache(only modified data that is replaced in the L2 is written back to the memory),
Athlon64X2 (2005) Dual-core Opteron lines (2005)
ConstUannmtiondHifiaelda tdsisata that is replaced inL2is deleted. 110
55 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
L2caches-4
Usebyinstructions/data
Unifiedinstr./data Splitinstr./data cache(s) cache(s)
Typicalimplementation Montecito (2006)
Expected trend
ConstantinHalatsis 111
University of Athens Department of Informatics and Telecommunications L2caches-6 Mapping of addresses to memory modules
POWER4 (2001) UltraSPARCT1 (2005) (Niagara) POWER5 (2005) (8 cores/4xL2 banks)
Core Core
Core Core X-bar
X-bar
L2 L2 L2
L2 L2 Fabric Bu SContr. IN (Fabric Bus Contr.)
M. c. M. c. L3 tags/ B. c. contr. Memory Memory
GX bus
Mapping of addresses to the banks: Mapping of addresses to the banks: The 128-byte long L2 cache lines are hashed across The four L2 modules are interleaved at 64-byte blocks. the 3 modules. Hashing is performed by modulo 3 arithmetricapplied on a large number of real address bits. 7 6 0 196 Addr. 128
∑Modulo 3 64 0 0 1 2 256 ConstantinHalatsis 112
56 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
L2caches-7
Integrationtotheprocessorchip
Partial integration of L2 Full integration of L2
On-chip L2 tags/control, Entire L2 on-chip off-chip data
UltraSPARCIV (2004) UltraSPARCV (2005) PA 8800 (2004) PA 8900 (2005) All other lines considered
Expected trend
ConstantinHalatsis 113
University of Athens Department of Informatics and Telecommunications
L3caches-1
AllocationofL3 cachestotheL2 caches
SharedL3 PrivateL3 cache cacheforall foreachL2 L2s
Montecito (2006) Athlon 64 X2–Barcelona (2007)
POWER5 (2005) POWER4(2001) UltraSPARCIV+(2004)
ConstantinHalatsis 114
57 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications L3caches-2
InclusionpolicyofL3 caches
Athlon 64 X2–Barcelona (2007) POWER5(2005) POWER6(2007) InclusiveL3 ExclusiveL3 UltraSPARCIV+(2005)
Replaced (victimised) lines (Victim Montecito (2006 L2 L2 L3 cache) POWER4(2001) Lines missing in L2 are reloaded Data missing and deleted from L3 Modified data in L2/L3 (low traffic) L3 (high traffic) M.c.
Memory Memory Lines replaced (victimized) in the L2arewritten to L3. References to data missing in L2 but available inL3 initiatereloadingthe pertaining cache line to L2. Reloaded data is deleted from L3 L3operates usually as a write back cache(only modified data that is replaced in the L3 is written back to the memory), Unmodified data that is replaced inL3is deleted. ConstantinHalatsis 115
University of Athens Department of Informatics and Telecommunications
L3caches-3
Integrationtotheprocessorchip
L3 partially integrated L3 fully integrated
On-chip L3 tags/control, Entire L3 on-chip off-chip data
UltraSPARCIV+(2005) POWER4 (2001) POWER5 (2005) POWER6(2007)
Montecito(2006)
Expected trend ConstantinHalatsis 116
58 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
L3caches-Examples
Examplesof partially integrated L3 caches:
L3 data
L2 L3 tags/contr. L3 data L3 tags/contr.
L2 L3 tags/contr. L3 data L2
L2 L3 tags/contr. L3 data Interconn. Core network Core Fabric Bus Contr.
B. c. M. c.. M. c.
Fire Plane Memory bus Memory
POWER5 (2005): UltraSPARCIV+ (2005):
ConstantinHalatsis 117
University of Athens Department of Informatics and Telecommunications L2-L3caches
Cachearchitecture
L2 partially integrated, L2 fully integrated, L2 fully integrated, L2 fully integrated, no L3 no L3 L3 partially integrated L3 fully integrated
L2 L2 L2 L2 L2 L2 L2 L2 private shared private shared private shared private shared
L3 L3 L3 L3 L3 L3 L3 L3 private shared private shared private shared private shared
UltraSPARCIV Smithfield/Presler POWER6 POWER4 based proc.s POWER5(excl.) Athlon64X2 (excl.) UltraSPARCIV+ Opteron dual core (excl.) (excl.) CoreDuo Montecito PA-8800 Gemini PA-8900 (Yonah) Core2 Duo based proc-s Cell BE UltraSPARCT1/T2 SPARC 64 VI/VII XLR
ConstantinHalatsis 118
59 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
L2-L3caches–examples
Examples for cache architectures (1)
L2 partially integrated, private L2 partially integrated, shared no L3 no L3
L2 data L2 data L2data
L2 L2 tags/contr. L2 tags/contr. Core Core tags/contr. Interconn. network
Core Core FSB c.
FSB B. c. M. c.
Fire Plane Memory bus
UltraSPARCIV (2004) PA-8800 (2004) PA-8900 (2005)
ConstantinHalatsis 119
University of Athens Department of Informatics and Telecommunications L2-L3caches-examples
Examples for cache architectures (2)
L2 fully integrated, private L2 fully integrated, shared no L3 no L3
Core 0 Core Core L2 L2 M. c. Memory Core Core X L2 M. c. Memory b L2 L2 L2 contr. a L2 M. c. Memory r System Request Queue
L2 L2 M. c. Memory Core 7 IN (Xbar)
B. c. B. c. M. c. FSB c. JBus
FSB
HT-bus Memory
Athlon64 X2 (2005) CoreDuo (2006) UltraSPARCT1 (2005) Core2 Duo(2006) ConstantinHalatsis 120
60 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
L2-L3caches-examples
Examples for cache architectures (3)
L2 fully integrated, shared L3 partially integrated L3 data
L3 tags/contr.
Chip-to-chip/ L2 L2 L2 L2 Mem.-to-Mem. interconn. IN (Fabric Bus Contr.) Interconn. Core network Core
B. c. L3 tags contr. B. c. M. c.
Fire Plane GX-bus Memory L3 data bus
M.c.
Memory
POWER4 (2001) UltraSPARCIV+ (2005) ConstantinHalatsis 121
University of Athens Department of Informatics and Telecommunications
L2-L3caches-example
Examples for cache architectures (4)
L2 fully integrated, private, L3 fully integrated
Core Core
L2 I L2 D L2 I L2 D
L3 L3
FSB c.
FSB
Montecito(2006)
ConstantinHalatsis 122
61 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Cores’interconnect-1
On-chip interconnections
Between private or Between private or Chip-to-chip and Between cores and shared L2 modules shared L2/L3 modules module-to-module shared L2 modules and shared L3 and the B. c./M. c. or interconnects modules alternatively the FSB c.
C C Chip L2 L2 L2/L3 L2/L3
IN1 IN IN Module Chip Chip
B. c. M. c. or FSB c. L2 L2 L3 L3 Chip
I/O bus Memory FSB
ConstantinHalatsis 123
University of Athens Department of Informatics and Telecommunications Cores’interconnect-2
Implementation of interconnections
Arbiter/Multi-port Crossbar Ring cache implementations
Quantitative aspects, such as the number of sources/destinationsor bandwidth requirement affect which implementation alternative is the most beneficial.
For a small number of For a larger number of sources/destinations sources/destinations UltraSPARC T1 (2005) Cell BE (2006) eg. to connect dual-cores XRI (2005) to shared L2 caches UltraSPARC T2 (2007) ConstantinHalatsis 124
62 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Performance Beyond ILP
Ø Much higher natural parallelism in some applications ü Database, web servers, or scientific codes Ø Explicit Thread-LevelParallelism Ø Thread: has own instructions and data ü May be part of a parallel program or independent programs ü Each thread has all state (instructions, data, PC, register state, and so on) needed to execute Ø Multithreading: Thread-Level Parallelism within a processor
http://www.intel.com/technology/computing/dual-core/demo/popup/demo.htm
ConstantinHalatsis 125
University of Athens Department of Informatics and Telecommunications
Principle of sequential-, multitasked-and multithreaded programming
Sequential programmi Multitasked programming Multithreaded programming ng
P1 P1 fork() P1 CreateThread() T1 exec() P2 Create Process() T2 fork() P2 T3 P2 T4 P3 T5 exec() T6 join() P3 P r o c e ss / Th a d M a n g eme t E x m p le
ConstantinHalatsis 126
63 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Main features of multithreading
Threads •belong to the same process, •share usually a common address space (else multiple address translation paths (virtual to real) need to be maintained concurrently), •are executed concurrently (simultaneously (i.e. overlapped by time sharing) or in parallel), depending on the impelmentation of multithreading .
Main tasks of thread management
• creation, control and termination of individual threads, • context swithing between threads, • maintaining multiple sets of thread states.
Basic thread states
• thread program state (state of the ISA) including the PC, FX/FP architectural registers, stateregisters, • thread microstate (supplementary state of the microarchitecture) including the rename register mappings, branch history, ROB etc.
ConstantinHalatsis 127
University of Athens Department of Informatics and Telecommunications
Thread-Level Parallelism (TLP)
Ø ILP exploits implicit parallel operations within loop or straight-line code segment Ø TLP is explicitly represented by multiple threads of execution that are inherently parallel Ø Goal: Use multiple instruction streams to improve ü Throughput of computers that run many programs ü Execution time of multi-threaded programs Ø TLP could be more cost-effective to exploit than ILP
ConstantinHalatsis 128
64 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Hardwaremulti-threadingcomes back
Ø Intel’sMontecito(Itanium2) –upto35% improvementcited ü 2 cores, 2 non-simultaneousthreadspercore(“temporal multithreading”) ü Switchesovertotheotherthreadincaseofa highlatencyevent (i.e.pagefault) Ø Sun’sNiagara1 (UltraSPARCT1) ü 8 cores, 4 non-simultaneousthreadspercore ü Morefine-grainedswitching–aftereveryinstruction ü A threadisskippedifittriggersa largelatencyevent ü Singlethreadslower, butthroughputveryhigh Ø Sun’sNiagara2 UltraSPARCT2) ü 8 cores, 8 non-simultaneousthreadspercore ü MySQLcompiledwithSun’scompilerruns2.5x fasterthanwith gcc–O3 Ø IntelchipsfromNehalem(Core3 –Q4 ’08) onwardswillfeature HyperThreading ü Simultaneousmulti-threading(onlyonsuperscalarCPUs) ü Takesadvantageofinstructionlevelparallelism
ConstantinHalatsis 129
University of Athens Department of Informatics and Telecommunications Options for Multithreading
Implementation of multithreading
(while executing multithreaded apps/OSs)
Software Hardware multithreading multithreading
Execution of multithreaded apps/OSs Execution of multithreaded apps/OSs on a single threaded processor on a multithreaded processor simultaneously (i.e. by time sharing) concurrently
Maintaining multiple threads Maintaining multiple threads simultaneously by the OS concurrently by the processor
Multithreaded OSs Multithreaded processors
Fast context swithing between threads required. ConstantinHalatsis 130
65 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Multithreaded processors
Multicore processors Multithreaded cores (SMP: Symmetric Multiprocessing CMP: Chip Multiprocessing)
Chip MT core Core L2/L3 Core L2/L3
L3/Memory L3/Memory
ConstantinHalatsis 131
University of Athens Department of Informatics and Telecommunications Requirements of Multithreading Ø Requirement of software multithreading
Maintaining multiple thread program states concurrently by the OS, including the PC, FX/FP architectural registers, state registers
Ø Core enhancements needed in multithreaded cores
•Maintaining multiple thread program states concurrently by the processor, including the PC, FX/FP architectural registers, state registers
•Maintaning multiple thread microstates, pertaining to: rename register mappings, the RAS (Return Address Stack), theROB, etc.
•Providing increased sizes for scarce or sensitive resources, such as: the instruction buffer, store queue,in case of merged arch. and rename registers appropriatly large file sizes (FX/FP) etc.
Ø Options to provide multiple states
•Implementing individual per thread structures, like 2 or 4 sets of FX registers, •Implementing tagged structures, like a tagged ROB, a tagged buffer etc. ConstantinHalatsis 132
66 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Chip area requirements vsgain
Multicore Multithreaded processors cores
Additional ~(60 –80) % ~(2 –10) % complexity
Additional gain (in gen. purp. ~(60 –80) % ~(0 –30) % apps)
ConstantinHalatsis 133
University of Athens Department of Informatics and Telecommunications Multithreaded OSs
Ø Windows NT Ø OS/2 Ø Unix w/Posix Ø most OSs developed from the 90’s on
ConstantinHalatsis 134
67 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Contrasting sequential-, multitasked-and multithreaded execution(1)
Multitasked programs Multithreaded programs
Sequential Hardware multithreading programs Software Software implementation multithreading on a multithreaded core on a multicore proc.
Multithreadedsoftware on Multiple processes on a Multithreadedsoftware Single process on a single threaded Multithreadedsoftware on single processor using time ona multicore a single processor processorusing time a multithreaded core sharing processor sharing D e s c r i pt on
No issues with Multiple programs with Multiple programs with Simultaneous execution of True parallel execution parallel programs quasi-parallel execution quasi-parallel execution threads of threads
Private address spaces Shared process address Threads share address spaces Threads share address space space Thread context switch No thread context needed No thread context switches needed switches needed (except K e y F a t u r es coarse grained MT)
Sequential Solutions for fast context Thread state management Intra-core Thread scheduling bottleneck switching and context switching communication K e y I ssu es
ConstantinHalatsis 135
University of Athens Department of Informatics and Telecommunications
Contrasting sequential-, multitasked-and multithreaded execution(2)
Multitasked programs Multithreaded programs
Sequential Hardware multithreading programs Software Software implementation multithreading on a multithreaded on a multicore proc. core
Most modern OS’s Most modern OS’s Most modern OS’s Legacy OS Traditional Unix (Windows NT/2000, (Windows NT/2000, (Windows NT/2000, OS/2, support OS/2, Unix w/Posix) OS/2, Unix w/Posix) Unix w/Posix) O S Su pp o rt
Low Low-medium High Higher Highest L e v el P e r f o rm a n c
Process and thread Process and thread Process and thread Process life cycle management API management API management API No API level management API support Explicit threading API Explicit threading API Explicit threading API OpenMP OpenMP OpenMP S o f tw a r e D e v el o pm nt ConstantinHalatsis 136
68 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
MultithreadingParadigms
FU1FU2FU3FU4 Unused Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Exec u t ion Ti me
Conventional Fine-grained Coarse-grained Chip Simultaneous Superscalar Multithreading Multithreading Multiprocessor Multithreading Single (cycle-by-cycle (Block Interleaving) (CMP) (SMT) Threaded Interleaving) or called or Intel’s HT Multi-Core Processors today
ConstantinHalatsis 137
University of Athens Department of Informatics and Telecommunications
Fine-Grained Multithreading
Ø Switches between threads on each instruction, interleaving execution of multiple threads ü Usually done round-robin, skipping stalled threads Ø CPU must be able to switch threads on every clock cycle Ø Pro: Hide latency of both short and long stalls ü Instructions from other threads always available to execute ü Easy to insert on short stalls Ø Con: Slow down execution of individual threads ü Thread ready to execute without stalls will be delayed by instructions from other threads Ø Used on Sun’s Niagara
ConstantinHalatsis 138
69 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Course-Grained Multithreading
Ø Switches threads only on costly stalls ü e.g., L2 cache misses Ø Pro: No switching each clock cycle ü Relieves need to have very fast thread switching ü No slow down for ready-to-go threads § Other threads only issue instructions when the main one would stall (for long time) anyway Ø Con: Limitation in hiding shorter stalls ü Pipeline must be emptied or frozen on stall, since CPU issues instructions from only one thread ü New thread must fill pipe before instructions can complete ü Thus, better for reducing penalty of high-cost stalls where pipeline refill << stall time Ø Used in IBM AS/400
ConstantinHalatsis 139
University of Athens Department of Informatics and Telecommunications
Simultaneous Multithreading (SMT) Ø Exploits TLP at the same time it exploits ILP Ø Intel’s HyperThreading (2-way SMT) Ø Others: IBM Power5 and Intel future multicore(8-core, 2-thread 45nm Nehalem) Ø Basic ideas: Conventional MT + Simultaneous issue + Sharing common resources Fdiv, unpipe (16 cycles) FFeettchch Dececode RS & ROB FMult Unnitit ppllusus (4 cycles) Phhysysiiccalal RRegeg FAdd RRegeg Reegiisstteerr Reggiisstteerr FileRRegeg (2 cyc) FFileileRRegeg PCPC Reenaamemerr FFileile FFileileReg PCPC FFileile PCPC File PCPC A L U1 A L U1 II--Cacachee A L U2 A L U2
Load/Store D--Cacachee (variable)
ConstantinHalatsis 140
70 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Simultaneous Multithreading (SMT)
Ø Insight that dynamically scheduled processor already has many HW mechanisms to support multithreading:
ü Large set of virtual registers that can be used to hold register sets for independent threads ü Register renaming provides unique register identifiers § Instructions from multiple threads can be mixed in data path § Without confusing sources and destinations across threads! ü Out-of-order completion allows the threads to execute out of order, and get better utilization of the HW
Ø Just add per-thread renaming table and keep separate PCs ü Independent commitment can be supported via separate reorder buffer for each thread
ConstantinHalatsis 141
University of Athens Department of Informatics and Telecommunications
Overview of multithreaded cores (1)
8CMT
QCMT
1/06 DCMT 5/05
PentiumEE 840 PentiumEE 955/965 (Smithfield) (Presler)
90 nm/2*103 mm2 65 nm/2*81 mm2 230 mtrs./130 W 2*188 mtrs./130 W 2-way MT/core 2-way MT/core
11/02 02/04 SCMT Pentium4 Pentium4 (Northwood B) (Prescott)
130nm/146mm2 90 nm/112mm2 55mtrs./82 W 125mtrs./103 W 2-way MT 2-way MT
1H 2H 1H 2H 1H 2H 1H 2H 1H 2H 2002 2003 2004 2005 2006
Figure2.1:Intel’smultithreaded desktop families ConstantinHalatsis 142
71 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Overview of multithreaded cores (2)
8CMT
QCMT
DCMT 10/05 6/06
XeonDP 2.8 Xeon5000 (PaxvilleDP) (Dempsey)
90 nm/2*135 mm2 65 nm/2*81 mm2 2*169 mtrs./135 W 2*188 mtrs./95/130 W 2-way MT/core 2-way MT/core
2/02 11/03 6/04 SCMT Pentium 4 Pentium 4 Pentium 4 (Prestonia-A) (Irwindale-A) (Nocona)
130nm/146mm2 130nm/135mm2 90 nm/112mm2 55mtrs./55 W 169mtrs./110 W 125mtrs./103 W 2-way MT 2-way MT 2-way MT
1H 2H 1H 2H 1H 2H 1H 2H 1H 2H 2002 2003 2004 2005 2006
Figure2.2.:Intel’smultithreaded XeonDP-families ConstantinHalatsis 143
University of Athens Department of Informatics and Telecommunications
Overview of multithreaded cores (3)
8CMT
QCMT
DCMT 11/05 8/06
Xeon7000 Xeon7100 (PaxvilleMP) (Tulsa)
90 nm/2*135 mm2 65 nm/435 mm2 2*169 mtrs./95/150 W 1328 mtrs./95/150 W 2-way MT/core 2-way MT/core 3/02 3/04 3/05 SCMT Pentium 4 Pentium 4 Pentium 4 (Foster-MP) (Gallatin) (Potomac) 180nm/n/a 130nm/310mm2 90 nm/339mm2 108mtrs./64 W 178/286mtrs./77 W 675mtrs./95/129 W 2-way MT 2-way MT 2-way MT
1H 2H 1H 2H 1H 2H 1H 2H 1H 2H 2002 2003 2004 2005 2006
Figure2.3.:Intel’smultithreadedXeonMP-families ConstantinHalatsis 144
72 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Overview of multithreaded cores (4)
8CMT
QCMT
DCMT 7/06
9x00 (Montecito)
90 nm/596 mm2 1720 mtrs./104 W 2-way MT/core
SCMT
1H 2H 1H 2H 1H 2H 1H 2H 1H 2H 2002 2003 2004 2005 2006
Figure2.4.:Intel’s multithreaded EPIC based server family ConstantinHalatsis 145
University of Athens Department of Informatics and Telecommunications
Overview of multithreaded cores (5)
8CMT
QCMT
10/05 2007 DCMT 5/04
POWER5 POWER5+ POWER6
2 2 130 nm/389 mm2 90 nm/230 mm 65 nm/341 mm 276 mtrs./80W (est.) 276 mtrs./70 W 750 mtrs./~100W 2-way MT/core 2-way MT/core 2-way MT/core 5/04 2006 SCMT RS 64 IV CellBEPPE (Sstar) 90 nm/221*mm2 180nm/n/a 234*mtrs./95*W 44mtrs./n/a 2-wayMT 2-way MT (*: entire proc.) ~ 1H 2H 1H 2H 1H 2H 1H 2H 1H 2H 2000 2004 2005 2006 2007
Figure2.5.:IBM’smultithreaded server families ConstantinHalatsis 146
73 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Overview of multithreaded cores (6)
8CMT 11/2005 2007 UltraSPARCT1 UltraSPARCT2 (Niagara) (Niagara II)
90 nm/379 mm2 65 nm/342 mm2 279 mtrs./63 W 72 W (est.) 4-way MT/core 8-way MT/core
QCMT 2008
APL SPARC64 VII (Jupiter)
65 nm/464mm2 ~120 W 2-way MT/core
DCMT 2007 APL SPARC64 VI (Olympus)
90 nm/421 mm2 540 mtrs./120 W 2-way MT/core
SCMT
1H 2H 1H 2H 1H 2H 1H 2H 1H 2H 2004 2005 2006 2007 2008
Figure2.6:Sun’s and Fujitsu’smultithreaded server families ConstantinHalatsis 147
University of Athens Department of Informatics and Telecommunications
Overview of multithreaded cores (7)
5/05 8CMT
XLR 5xx
90 nm/~220 mm2 333 mtrs./10-50 W 4-way MT/core
QCMT
DCMT
SCMT
1H 2H 1H 2H 1H 2H 1H 2H 1H 2H 2002 2003 2004 2005 2006
Figure2.7:RMI’s multithreaded XLR family (scalar RISC) ConstantinHalatsis 148
74 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Overview of multithreaded cores (8)
8CMT
QCMT
DCMT
2003 SCMT Alpha 21464 (V8) 130nm/n/a 250mtrs./10-50 W 4-way MT Cancelled 6/2001
1H 2H 1H 2H 1H 2H 1H 2H 1H 2H 2002 2003 2004 2005 2006
Figure2.8:DEC’s/Compaq’s multithreaded processor ConstantinHalatsis 149
University of Athens Department of Informatics and Telecommunications
Overview of multithreaded cores (9)
Underlying core(s)
Scalar core(s) Superscalar core(s) VLIW core(s)
IBM RS64 IV (2000) SUN UltraSPARC T1 (2005) (SStar) SUN MAJC 5200 (2000) (Niagara) Single-core/2T Quad-core/4T up to 8 cores/4T (dedicated use) Pentium 4 based RMI XLR 5xx (2005) processors Intel Montecito (2006) 8 core/4T Single-core/2T (2002-) Dual-core/2T Dual-core/2T (2005-)
PPE of Cell BE (2006) Single-core/2T
Fujitsu SPARC64 VI / VII Dual-core/Quad-core/2T
ConstantinHalatsis 150
75 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications Thread scheduling (1)
Thread scheduling in software multithreading on a traditional supercalar processor D i s p at c h l o ts
Thread1 Contextswitch Thread2
Clock cycles
The execution of a new thread is initiated by a context switch (needed to save the state of the suspended thread and loading the state of the thread to be executed next).
Thread scheduling assuming software multithreading on a 4-way superscalar processor
ConstantinHalatsis 151
University of Athens Department of Informatics and Telecommunications Thread scheduling (2)
Thread scheduling in multicore processors (CMP-s) D i s p at c h l o ts
Thread1 Thread2
Clock cycles Both t-way superscalar cores execute different threads independently.
Thread scheduling in a dual core processor
ConstantinHalatsis 152
76 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Thread scheduling (3)
Thread scheduling in multithreaded cores
Coarse grained MT
ConstantinHalatsis 153
University of Athens Department of Informatics and Telecommunications
Thread scheduling (4) s l o ts D i s p at c h / ss ue
Clock cycles
Thread1 Contextswitch Thread2
Threads are switched by means of rapid, HW-supported context switches.
Thread scheduling in a 4-way coarse grained multithreaded processor
ConstantinHalatsis 154
77 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Thread scheduling (5)
Coarse grained MT
Scalar based Superscalar based VLIW based
IBM RS64 IV (2000) SUN MAJC 5200 (2000) (SStar) Quad-core/4T Single-core/2T (dedicated use)
Intel Montecito (2006?) Dual-core/2T
ConstantinHalatsis 155
University of Athens Department of Informatics and Telecommunications
Thread scheduling (6)
Thread scheduling in multithreaded cores
Coarse grained MT Fine grained MT
ConstantinHalatsis 156
78 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications Thread scheduling (7) s l o ts D i s p at c h / ss ue
Clock cycles
Thread1 Thread2 Thread3 Thread4
The hardware thread scheduler choses a thread in each cycle and instructions from this thread are dispatched/issued in this cycle..
Thread scheduling in a 4-way fine grained multithreaded processor
ConstantinHalatsis 157
University of Athens Department of Informatics and Telecommunications Thread scheduling (8)
Fine grained MT
Round robin Priority based selection policy selection policy
Scalar Superscala VLIW Scalar Superscala VLIW based r based based based r based based
SUN UltraSPARC T1 (2005) (Niagara) up to 8 cores/4T
PPE of Cell BE (2006) single-core/2T
ConstantinHalatsis 158
79 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Thread scheduling (9)
Thread scheduling in multithreaded cores
Coarse grained MT Fine grained MT Simultaneous MT (SMT)
ConstantinHalatsis 159
University of Athens Department of Informatics and Telecommunications Thread scheduling (10) s l o ts D i s p at c h / ss ue Clock cycles
Thread1 Thread2 Thread3 Thread4
Available instructions (chosen according to an appropriate selection policy, such as the priority of the threads) are dispatched/issued for execution in each cycle.
SMT: Proposed by Tullsen, Eggers and Levy in 1995 (U. of Washington).
Thread scheduling in a 4-way symultaneous multithreaded processor
ConstantinHalatsis 160
80 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications Thread scheduling (11)
SMT cores
Scalar based Superscalar based VLIW based
Pentium 4based proc.s Single-core/2T (2002-) Dual-core/2T (2005-)
DEC 21464 (2003) Dual-core/4T (canceled in 2001)
IBM POWER5 (2005) Dual-core/2T
ConstantinHalatsis 161
University of Athens Department of Informatics and Telecommunications
SMT Pipeline
Fetch Decode/ Queue Reg Execute Dcache/ Reg Retire Map Read Store Write Buffer
PC
Register Map Regs Dcache Regs Icache
SMT in the Alpha 21464(V8)
Data from Compaq ConstantinHalatsis 162
81 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
IBM Power 4 Single-threaded predecessor to Power 5. 8 execution units in out-of-order engine; each can issue instruction each cycle.
ConstantinHalatsis 163
University of Athens Department of Informatics and Telecommunications
Changes in IBM Power 5 to support SMT
Ø Increased associativityof L1 instruction cache and the instruction address translation buffers Ø Added per thread load and store queues Ø Increased size of the L2 (1.92 vs. 1.44 MB) and L3 caches Ø Added separate instruction prefetchand buffering per thread Ø Increased the number of virtual registers from 152 to 240 Ø Increased the size of several issue queues Ø The Power5 core is about 24% larger than the Power4 core because of the addition of SMT support
ConstantinHalatsis 164
82 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications SMT in IBM Power 5
PPoowweerr 55
2 fetch (PC), 2 initial decodes 2 commits (architected register sets)
ConstantinHalatsis 165
University of Athens Department of Informatics and Telecommunications Detailed SMT pipeline of IBM POWER5
Source: Kalla R., „IBM's POWER5 Micro Processor Design and Methodology”, IBM Corporation, 2003 ConstantinHalatsis 166
83 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications IBM Power 5 data flow
Why only 2 threads? With 4, one of the shared resources (physical registers, cache, memory bandwidth) would be prone to bottleneck ConstantinHalatsis 167
University of Athens Department of Informatics and Telecommunications Power 5 thread performance Relative priority of each thread controllable in hardware.
For balanced operation, both threads run slower than if they “owned” the machine.
ConstantinHalatsis 168
84 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Pentium-4 Hyperthreading(2002)
Ø First commercial SMT design (2-way SMT) ü Hyperthreading== SMT Ø Logical processors share nearly all resources of the physical processor ü Caches, execution units, branch predictors Ø Die area overhead of hyperthreading ~ 5% Ø When one logical processor is stalled, the other can make progress ü No logical processor can use all entries in queues when two threads are active Ø Processor running only one active software thread runs at approximately same speed with or without hyperthreading
ConstantinHalatsis 169
University of Athens Department of Informatics and Telecommunications
Pentium-4 Hyperthreading Front End
Resource divided Resource shared between logical between logical CPUs CPUs [ Intel Technology Journal, Q1 2002 ] ConstantinHalatsis 170
85 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Pentium-4 Hyperthreading Execution Pipeline
[ Intel Technology Journal, Q1 2002 ] ConstantinHalatsis 171
University of Athens Department of Informatics and Telecommunications
SMT adaptation to parallelism type For regions with high thread For regions with low thread level level parallelism (TLP) entire parallelism (TLP) entire machine machine width is shared by all width is available for instruction threads level parallelism (ILP) Issue width Issue width
Time Time
ConstantinHalatsis 172
86 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Initial Performance of SMT
Ø Pentium 4 Extreme SMT yields 1.01 speedup for SPECint_rate benchmark and 1.07 for SPECfp_rate ü Pentium 4 is dual threaded SMT ü SPECRaterequires that each SPEC benchmark be run against a vendor-selected number of copies of the same benchmark Ø Running on Pentium 4 each of 26 SPEC benchmarks paired with every other (262 runs) speed-ups from 0.90 to 1.58; average was 1.20 Ø Power 5, 8-processor server 1.23 faster for SPECint_ratewith SMT, 1.16 faster for SPECfp_rate Ø Power 5 running 2 copies of each app speedup between 0.89 and 1.41 ü Most gained some ü Fl.Pt. apps had most cache conflicts and least gains
ConstantinHalatsis 173
University of Athens Department of Informatics and Telecommunications Pentium 4 Hyperthreading: Performance Improvements
ConstantinHalatsis 174
87 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
IcountChoosing Policy Fetch from thread with the least instructions in flight.
Why does this enhance throughput?
ConstantinHalatsis 175
University of Athens Department of Informatics and Telecommunications
SMT Fetch Policies (Locks)
Ø Problem: Spin looping thread consumes resources
Ø Solution: Provide quiescingoperation that allows a thread to sleep until a memory location changes
Load and start loop: watching 0(r2) ARM r1, 0(r2) BEQ r1, got_it QUIESCE Inhibit scheduling BR loop of thread until got_it: activity observed on 0(r2)
ConstantinHalatsis 176
88 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
The Cray example
Ø Cray XMT ü128 Hardware streams § A stream is 31 64-bit registers, 8 target registers, and a control register üThree functional units: M, A and C ü500 MHz üFull and Empty bits per word (2-bits) Ø An example of a very high SMT design
ConstantinHalatsis 177
University of Athens Department of Informatics and Telecommunications The Cray MTA2
Programs 1 2 3 4 running in parallel
i = n i = n Sub- . problem . Serial . i = 3 A . i = 1 Code Concurrent . . threads of i = 2 Sub- i = 0 computation problem i = 1 B Subproblem A
Hardware . . . . streams (128)
Unused streams
Instruction Ready Pool;
Pipeline of executing instructions
Cray MTA2 picture from JonhFeo’s“Can programmers and Machines ever be friends”
ConstantinHalatsis 178
89 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications Summary: Multithreaded Categories
Simultaneous Superscalar Fine-GrainedCoarse-Grained Multiprocessing Multithreading T i m e ( p r o c ss cyc l e)
Thread 1 Thread 3 Thread 5 Thread 2 Thread 4 Idle slot
ConstantinHalatsis 179
University of Athens Department of Informatics and Telecommunications
Multithreaded Programming
Ø A thread of execution is a fork of a computer programinto two or more concurrentlyrunning tasks. Ø POSIX Threads (Pthreads) is a set of threading interfaces developed by the IEEE Ø Assembly of shared memory programming Ø Programmer has to manually: ü Create and terminating threads ü Wait for threads to complete ü Manage the interaction between threads using mutexes, condition variables, etc.
ConstantinHalatsis 180
90 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Concurrency Platforms
• Programming directly on PThreadsis painful and error-prone. • With PThreads, you either sacrifice memory usage or load-balance among processors • A concurrency platform provides linguistic support and handles load balancing. • Examples: • Threading Building Blocks (TBB) • OpenMP •Cilk++
ConstantinHalatsis 181
University of Athens Department of Informatics and Telecommunications
Trends Ø Multicore–manycore
Ø Hybrid processors ü Accelerators for specific kinds of computation ü More difficult to take advantage of
Ø Application-specific supercomputers ü Not covered
ConstantinHalatsis 182
91 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications ΠολυπύρηνοιέναντιΥπερπύρηνων (multicorevsmanycore) Ø Multicore: παρούσαπορεία/τακτική ü Ίδιοςπυρήνας ü Διπλάσιοςαριθμόςκάθε 18 μήνες (2, 4, 8 . . . Etc…) ü Παραδείγματα: Intel Dunnington(6 cores), Tukwila (next generation of Itanium, 4 cores), AMD Shanghai (4 core), AMD Constantinople (6 Core), IBM Power 6 (up to 16 cores), IBM Power 7, AMD Magny-Cours(12 core), … Ø Manycore: σύγκλισηπροςαυτήτηνκατεύθυνση ü Απλοποίησητουπυρήνα (shorter pipelines, lower clock frequencies, in-order processing) ü Ξεκίναμε 100 πυρήνεςκαιδιπλασίαζεκάθε 18 μήνες ü Παραδείγματα: NvidiaG80 (128 cores), Intel Polaris (80 cores), Cisco/TensilicaMetro (188 cores)
Ø Σύγκλιση: Τελικάσε Manycore ConstantinHalatsis 183 ü Manycore εφόσον βρούμε πως θα τους προγραμματίζουμε
University of Athens Department of Informatics and Telecommunications
From Multi-to Many-Core
Ø Multi-core (2-4 cores) designs dominate the commodity market and percolate into high-end systems Ø Many-core (10s or 100+ cores) emerging ü heterogenuityis a real possibility Ø Examples ü Intel 80-core TeraScalechip & 32-core Larrabeechip ü IBM Cyclops-64 chip with 160 thread units ü ClearSpeed96-core CSX chip ü 2nd generation NvidiaTeslaproducts use 240-core C870 processor (0.9TF single precision) ü AMD/ATI FireStream9170with 320 cores
IBM Cell ConstantinHalatsis 184
92 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Από Multi-to Many-Core
IBM Cell ConstantinHalatsis 185
University of Athens Department of Informatics and Telecommunications
Παράδειγμα:Intel Tera-scale Processor
Ø 80 simple cores on a single chip Ø Two programmable FP engines per core Ø Design aspects: ü Tiled design ü Network on a chip § Each core has a 5-port message passing router § Routers connected in a 2-D mesh ü Fine-grain power management: compute engines/routers can be individually put to sleep ü Other innovations: sleep transistors, mesochronous clocking, clock gating Ø Reaches 1 Tflopspeak consuming 62 Watt
ConstantinHalatsis 186
93 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
ConstantinHalatsis 187
University of Athens Department of Informatics and Telecommunications
ConstantinHalatsis 188
94 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications Τοπρόβλημαμετους manycore! “I know how to make 4 horses pull a cart -I don't know how to make 1024 chickens do it.” EnricoClementi-former IBM fellow
Slide by Ruth Poole – IBM Software Engineer Blue Gene Control System
ConstantinHalatsis 189
University of Athens Department of Informatics and Telecommunications
HPC Accelerators
Ø Cell Ø GPGPUs Ø FPGAs Ø Clearspeed Ø Hybrids ü Intel Larrabee….
ConstantinHalatsis 190
95 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
SonyToshibaIBMCell
Ø A heterogeneousarchitecturedevelopedforthePS3 Ø Combinesa PowerPCwithco-processingelements toacceleratemultimediaandvectorprocessing applications Ø Softwarecontrolledmemories Ø Availablesince2005 Ø ManyresearchCELL clusters/projects Ø A hybridOpteron-Cellclusterhas becomethefirst petaflopsystem ü A niceoverviewcanbefoundat: http://en.wikipedia.org/wiki/Cell_(microprocessor) ü A nice workshop on Cell at UCSD http://crca.ucsd.edu/cellworkshop/
ConstantinHalatsis 191
University of Athens Department of Informatics and Telecommunications
IBM PowerXCell8i
ConstantinHalatsis 192
96 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Αρχιτεκτονικήτου Cell Power Processor Element Synergistic Processor (PPE): Elements (SPE): •General Purpose, 64-bit •8 per chip RISC •128-bit wide SIMD Processor (PowerPC Units 2.02) •Integer and Floating •2-Way Hardware Point ultithreaded •256KB Local Store •L1 : 32KB I ; 32KB D •Up to 25.6 GF/s per •L2 : 512KB SPE •Coherent load/store 200GF/s total * •VMX •3.2 GHz
ConstantinHalatsis 193
University of Athens Department of Informatics and Telecommunications
Cell/B.E. –Example of eight concurrent transactions
ConstantinHalatsis 194
97 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Comparison to some multicores
Difficult to achieve this peak
Sparse Matrix * Vector SpMV
ConstantinHalatsis 195
University of Athens Department of Informatics and Telecommunications
Mercury Cell Accelerator Board 2
Ø 1 Cell Processor 2.8GHz Ø 4GB DDR2, GbE, PCI-Express 16x, 256MB DDR2 Ø Full Mercury MultiCorePlus SDK Support
Workstation Accelerator Board ConstantinHalatsis 196
98 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Mercury’s 1U Server DCBS-2
Ø Dual Cell Processor, 3.2GHz, 1GB XDR per Cell, Dual Gige, Inifiband/10GE Option, 8 SPE per Cell. Ø Larger Memory Footprint Compared to the PS-3 Ø Dual Cell processor give a single application access to 16 SPEs Ø Preconfigured with YellowdogLinux Ø Binary compatible with PS-3
Excellent Small Application Option ConstantinHalatsis 197
University of Athens Department of Informatics and Telecommunications
IBM QS-22 Blade
Ø Dual IBMPowerXCell™8i (New Double Precision SPEs) Ø Up to 32GB DDRII Per Cell Ø Dual GigEand Optional 10GE/Inifiband Ø Red Hat Enterprise Linux 5.2 Ø Full Software/Hardware support form IBM Ø Up to 14 Blades in Blade Center Chassis Ø Very high density solution
Double Precision Workhorse ConstantinHalatsis 198
99 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
SONY ZEGO BCU-100
Ø 1U Cell Server Ø Single 3.2GHz Cell/B.E. 1GB XDR, 1GB DDR2 , RSX GPU, GE Ø Full Cell/B.E. with 8 SPEs Ø PCI-Express slot for Inifibandor 10GbE Ø Pre loaded with Yellow Dog Linux
New Product ConstantinHalatsis 199
University of Athens Department of Informatics and Telecommunications
Software Development
Ø Utilize Free Software ü IBM Cell Software Development Kit (SDK) § C/C++ libraries providing programming models, data movement operations, SPE local store management, process communication, and SIMD math functions § Open Source Compilers/Debuggers n GNU and IBM XL C/C++ Compilers n Eclipse IDE enhancements specifically for Cell targets n Instruction level debugging on PPE and SPEs § IBM System Simulator: Allows testing Cell applications without Cell hardware § Code optimization tools n Feedback Directed Program Restructuring (FDPR-Pro) optimizes performance and memory footprint ü Eclipse Integrated Development Environment (IDE) § Compilation from Linux workstations, run remotely on Cell targets § Develop, compile, and run directly on Cell based hardware and System simulators ü Linux Operating System § Customizable kernel § Large software base for development and management tools
Ø Additional Software Available for purchase ü Mercury’s MultiCoreFramework, PAS, SAL
Multiple Software Development Options Allow Greater Flexibility and Cost Savings ConstantinHalatsis 200
100 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
ConstantinHalatsis 201
University of Athens Department of Informatics and Telecommunications
Application Development Using IBM SDK
IIBM Node Mgmt Mailbox Single, Lists Intrinsics Signals, etc
MCF SAL
Init & Buffer Init & Logic POSIX C//C++ NUNUMAMA Mgmt
Usserr App Setup/Init Comm Data I/O Processing Sync
Advvanttages Diissadvavanttages •SPE Affinity supported (SDK 2.1+) •Lower level SPE/PPE setup/control •Pluginsare implicitly loaded at run-time •Light weight infrastructure •Low Level DMA control and monitoring increases complexity •SPE-SPE Communications possible •Manual I/O buffer management
•Low Level DMA operations can be functionally hidden •SPE-SPE Transfers are possible •Technical Support unknown
•Free w/o Support
° Data processing and task synchronization are comparable between the Mercury MCF and IBM SDK
ConstantinHalatsis 202
101 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Application Development Using Mercury MCF
IIBM SSiingnglele IIntntrriinnssiicscs
NNododee MMggmtt -- Maililboxbox Muullttiipplele SSALAL Mututexex MCF TTaassksks/Pl/Pluuggiinsns
IInniitt && LogLogicic PPOOSSIXIX C//C++ NUNUMAMA
Usserr App Settup//IInitit Comm Datta II/O/O Prrocessssiing Syync
Advvanttages Diissadvavanttages •Node management is easy to setup and change dynamically. •Pluginsmust be explicitly defined and loaded at runtime •SPE affinity are not supported. •Added overhead •Simplifies complex data movement •Various data I/O operations are hidden from the user after initial setup. •Mailbox communication is restricted between a PPE and SPEs •Multiple striding, overlapping, and multi-buffer options available. •Single one-time transfers can be performed via IBM DMA APIs •SPE to SPE transfers isn’t supported in MCF 1.1. •Technical Support Provided •Added Overhead
•Cost •Possible interference when trying to utilize IBM SDK features that ° Data processing and task synchronization are aren’t exposed via MCF APIs. comparable between the Mercury MCF and IBM SDK
ConstantinHalatsis 203
University of Athens Department of Informatics and Telecommunications
GP-GPUs
Ø General-purposecomputingongraphicsprocessing units Ø UsingtheGPU as“vectorCPU”by(ab)usingthe programmablevertexshaders Ø Availablesince2000 Ø Becomingmoreandmorepopular Ø Usestreamprocessingtoexploittheextremely parallelism Ø CUDA tutorialattheSupercomputingConference 2007 ü A niceintroductioncanbefoundat: http://en.wikipedia.org/wiki/GPGPU
ConstantinHalatsis 204
102 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Some GPGPUs
ConstantinHalatsis 205
University of Athens Department of Informatics and Telecommunications
Programming GPGPUs-CUDA
CUDA : C fortheGPU
• C programmingenvironmentthataddsabilityto −Specifymanycoreparallelism −MapfunctionstotheGPU − TransferdatabetweenGPU & CPU •Lowlearningcurve − Easilyspecify1000s ofthreads •IntegratedCPU-GPU programmingmodel •50 MillionCUDA-enabledGPUsalreadydeployed
ConstantinHalatsis 206
103 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
CUDA toolkit and SDK examples
•FreeCUDA ToolkitandSDKExamples •ToolkitandSDK include − Compiler − Debugger, Profiler − Documentation − Libraries − Samples
ConstantinHalatsis 207
University of Athens Department of Informatics and Telecommunications
FPGAs
Ø Field-programmablegatearray Ø “Adjustthearchitecturetotheneedsofyour algorithm” Ø Invented1984 Ø Usedheavilyinembeddedandreal-timesystems Ø UsedinsupercomputerslikeCrayXD1, SGI RASC Blades Ø Programmability!
ü Anoverviewcanbefoundat: http://en.wikipedia.org/wiki/Field-programmable_gate_array
ConstantinHalatsis 208
104 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Some FPGA’s
ConstantinHalatsis 209
University of Athens Department of Informatics and Telecommunications
FPGA example
ConstantinHalatsis 210
105 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
ConstantinHalatsis 211
University of Athens Department of Informatics and Telecommunications
FPGA example
ConstantinHalatsis 212
106 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
FPGA example
ConstantinHalatsis 213
University of Athens Department of Informatics and Telecommunications
ClearspeedBoards
Ø Acceleratorboardsspeciallydevelopedfor scientificcomputingandtheneedsoftheHPC community Ø Onlyone acceleratedplatforminthecurrent top500(TsubameGrid Cluster, Japan, No 29) Ø Advertisementclaims: ü “World’shighestperformanceprocessor”(96GF per board) ü “World’shighestperformanceperwatt”(2.86GF/Watt)
http://www.clearspeed.com/acceleration/performance/benchmarks
ConstantinHalatsis 214
107 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications The CSX700 Processor • Includes dual MTAP cores: –96 GFLOPS peak (32 & 64-bit) –48 GMACS peak (16x16 → 32+64) –10W max power consumption –250MHz clock speed –192 Processing Elements (2x96) –8 spare PEsfor resiliency –ECC on all internal memories • On-die temperature sensors • Active power management • Dual integrated 64-bit DDR2 memory controllers with ECC • Integrated PCI Express x16 • CCBR chip-to-chip bridge port • IBM 90nm process • 266 million transistors • Shipping to customers since June 08
Copyright ©2008 ClearSpeedTechnology Inc. All rights reserved. ConstantinHalatsis 215
University of Athens Department of Informatics and Telecommunications The ClearSpeedAdvanceTM e710, e720 and CATS-700
Ø 96 GFLOPS e710 & e720 fit standard 1U & HP blade servers ü Low power consumption of 25W max, small, light, passively cooled ü Designed for high reliability (MTBF) ü All memoryis error protected; no moving parts (e.g. fans) are required
Ø CATS-700 1U system ü 1.152 TFLOPS 32-and 64-bit floating point ü 96 GBytes/smemory bandwidth to 24 GB of ECC protected DDR2 ü 300W typical power consumption
Ø Easy to use Software Development Kit ANSI C compiler, gdb-based debugger, advanced profiler ü 216 Copyright ©2008 ClearSpeedTechnology Inc. All rights reserved. ConstantinHalatsis 216
108 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
CSX700 and beyond
Ø The CSX700 is much more power efficient than cell and GPUsfor embedded processing. ü E.g. for single precision complex 1024x1024 2D FFT: ü Cell (8 SPE): 38 GFLOPS 40W 0.95 GFLOP/watt ü S870 (Tesla) GPU: 50 GFLOPS 170W 0.07 GFLOP/watt ü x86 core: 3 GFLOPS 25W0.12 GFLOP/watt ü CSX700: 20 GFLOPS 7W2.86 GFLOP/watt
Ø Next generation processor “Carnac”in design now ü Focusing on 1-and 2D FFT performance ü Design goal is 100 GFLOPS/watt sustained on 2D FFTs
217 Copyright ©2008 ClearSpeedTechnology Inc. All rights reserved. ConstantinHalatsis 217
University of Athens Department of Informatics and Telecommunications
RECONFIGURABLE COMPUTING
Ø ReconfigurableComputing(RC) ü Ideaofreconfiguringa computertoyour current needs ü UseFPGAsforthereconfiguration Ø Conceptexistssince1960s (PaperbyGeraldEstrin) “Unfortunatelythisideawasfaraheadofitstimein neededelectronictechnology.” Ø Renaissanceinthe80s/90s ü “Theworld'sfirstcommercialreconfigurable computer, theAlgotronixCHS2X4, wascompletedin 1991. Itwasnota commercialsuccess.” Quotesaretakenfrom http://en.wikipedia.org/wiki/Reconfigurable computing] Ø ReconfigurableHPC (RHPC)
ConstantinHalatsis 218
109 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
OverviewofavailableAccelerators
Source: WhitePaperfromHP http://www.hp.com/techservers/hpccn/hpccollaboration/ADCatalyst/d ownloads/accelerators.pdf
ConstantinHalatsis 219
University of Athens Department of Informatics and Telecommunications
Hybrids
Ø Larabee
ConstantinHalatsis 220
110 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Larrabee: A Many-core, Thread-by-Data Parallel (Vector) Hybrid? Ø Larrabeeintegrates data-and thread-parallel design elements and emphasizes programmability making it a programmable hybrid
ConstantinHalatsis 221
University of Athens Department of Informatics and Telecommunications
Larrabee’sThread and Data Parallel Design Elements and Abstractions
Ø Processor: A many core die with from 8 to 32 cores Ø Core: An processing element that runs and maintains the context of up to 4 threads that share L2 cache Ø Thread: A hardware managed context of from 2 to 10 fibers switched to hide long, unpredictable latency Ø Fiber: A software managed context of from 1 to 4 vector ops (vectors, may also include scalars?) Ø Vector: A data-parallel instruction (data block) composed of 16 single (8 double) precision strands used to hide short, predictable memory latency Ø Strand: series of operations on individual data elements with common register, cache, or memory locations
ConstantinHalatsis 222
111 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Larrabee’sParallel Abstraction Stack
LarrabeeProcessor P 8 to 32 cores per processor
Cores c c . . . . c c
Threads T T T T 4 thread contexts per core
Fibers ( Vector Vector . . . . Vector ) 2 to 10 fibers per thread
Strands S S . . . . S S 16 to (4 x 16) strands per fiber ConstantinHalatsis 223
University of Athens Department of Informatics and Telecommunications
LarrabeeArchitecture
224/16 ConstantinHalatsis 224
112 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
LarrabeeCore and Caches
Ø Instruction Set ü Standard Pentium x86 inst. ü Scalar and cache control inst. ü Vector inst. Ø L1 cache can be somewhat like an extended register file Ø Global L2 cache is divided into separate local subsets, one / core. ü Low latency for local L2 subset ü Data written by a core is stored in its own L2 subset, and is flushed from other subsets, if necessary. 225/16 ConstantinHalatsis 225
University of Athens Department of Informatics and Telecommunications Scalar Unit and Cache Control Instruction
Ø Scalar pipeline is derived from the dual-issue Pentium processor ü Short, inexpensive execution pipeline ü Multi-threading, 64bit extensions, and sophisticated pre-fecthing ü New scalar instructions (bit count, bit scan) Ø New instructions and instruction modes for explicit cache control ü Pre-fetch data into L1 or L2 Caches ü Reduce the priority of a cache line ü Use L2 cache like a scratchpad memory Ø Support 4-threads with separate register set per thread
226/16 ConstantinHalatsis 226
113 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Vector Processing Unit Ø 16-wide vector processing unit (VPU) ü Integer, single/double precision float instructions Ø Data can come directly from L1 Ø The VPU supports ü a wide variety of inst. on int/float. § Standard arithmetic/logical § Additional ld/stinst. for convsersionsdata types ü gather and scatter supporting § 16 elements from up to 16 different addresses Ø Inst. can be predicated by a mask register which ü controls which part of a reg./memory are written or left untouched. ü reduces branch mispredictionpenalties
227/16 ConstantinHalatsis 227
University of Athens Department of Informatics and Telecommunications
Inter-Processor Ring Network
Ø Larrabeeuses a bi-directional ring network ü 512-bits wide per direction ü All the routing decisions are made before injecting messages into the network ü Provides a path for the L2 to access memory ü Fixed function logic agents are accessed by the CPU cores and in turn to access L2/memory
228/16 ConstantinHalatsis 228
114 PDF created with pdfFactory Pro trial version www.pdffactory.com University of Athens Department of Informatics and Telecommunications
Comparing Larrabeeto Nehalem, Tesla, PowerXCell
Larrabee, Nehalem, Tesla 10, PowerXCell, Intel Intel NVIDIA IBM Availability ~Q1, 2010 Q4, 2008 Q3, 2008 Q2, 2008
Board socket or bus socket bus socket or bus integration Core type small, in large, out-of- very small FPU, small, mixed, order, x86-64 order, x86- in order, nvdia- ppe/spe-64 64 64 Core number 8 to ~32 (64- 4 to 8 (64- 240 (32-bit), 8 + 1(64-bit) bit) bit) 30(64-bit) Max Clock ~2.5 GHz ~3.2 GHz ~1.5 GHz ~3.2 GHz
On-chip ring(s) switch pipeline + switch ring interconnect Vector width 16 4 (1+1) 4 (SIMD 32-bit) New parallel No, compiler No, compiler Yes, CUDA Yes, DaCs, ALF API required has burden has burden Programmable Fully, 1 fixed Fully, no Partially, several Fully, no fixed pipeline function units fixed fixed func. units function units function units ConstantinHalatsis 229
University of Athens Department of Informatics and Telecommunications
Convergence of Platforms üMultiple parallel general-purpose processors (GPPs) üMultiple application-specific processors (ASPs)
Intel Network Processor IBM Cell 1 GPP Core 1 GPP (2 threads) 16 ASPs (128 threads) 8 ASPs 1 1 1 8 8 8 Stripe RDRA RDRA RDRA MEv MEv MEv MEv M M M 2 2 2 2 IXP280S PicochipDSP 1 2 3 4 1 2 3 Rbuf P 64 @ I 16b 1208B MEv MEv MEv MEv 0 4 Intel® G o 1 GPP core PCI 2 2 2 2 XScale A 8 7 6 5 r S Tbuf C 64b (64b) ™ K 64 @ S 16b 66 Core MEv MEv MEv MEv E 128B I 248 ASPs MHz 32K IC 2 2 2 2 T X 32K DC 9 10 11 12 Hash C4S8Rs/64/ 128 S- cratc QDR QDR QDR QDR MEv MEv MEv MEv Fasht_wr SRAM SRAM SRAM SRAM 2 2 2 2 -U16AKBRT 1 2 3 4 16 15 14 13 - E/D Q E/D Q E/D Q E/D Q Timers 1 1 1 1 1 1 1 1 -GPIO 8 8 8 8 8 8 8 8 - Cisco CRS-1 BootROM/S Sun Niagara lowPort 188 TensilicaGPPs 8 GPP cores (32 threads)
Intel 4004 (1971): 4-bit processor, 1000s of “The Processor is 2312 transistors, ~100 KIPS, processor the new Transistor” 10 micron PMOS, cores per [Rowen] 11 mm2 chip die ConstantinHalatsis 230
115 PDF created with pdfFactory Pro trial version www.pdffactory.com