Case Studies: The PowerPC 620 and Intel P6 Case Study: The PowerPC 620 Modern Processor Design: Fundamentals of Superscalar Processors

2

IBM/Motorola/Apple Alliance PowerPC 620 Case Study  Alliance begun in 1991 with a joint design center (Somerset) in Austin – Ambitious objective: unseat Intel on the desktop – Delays, conflicts, politics…hasn’t happened, alliance largely dissolved today  First-generation out-of-order processor  PowerPC 601 – Quick design based on RSC compatible with POWER and PowerPC  Developed as part of Apple-IBM-Motorola  PowerPC 603 alliance – Low power implementation designed for small uniprocessor systems – 5 FUs: branch , integer , system, load/store, FP  Aggressive goals, targets  PowerPC 604 – 4-wide machine  Interesting microarchitectural features – 6 FUs, each with 2-entry RS  Hopelessly delayed  PowerPC 620 – First 64-bit machine, also 4-wide  Led to future, successful designs – Same 6 FUs as 604 – Next slide, also chapter 5 in the textbook  PowerPC G3, G4 – Newer derivatives of the PowerPC 603 (3-issue, in-order) – Added Altivec multimedia extensions

1 PowerPC 620 PowerPC 620 Pipeline Fetch stage

Instruction buffer (8)

Dispatch stage

BRU LSU XSU0 XSU1 MC-FXU FPU Reservation stations (6)

Execute stage(s)

Completion buffer (16)

Complete stage

Writeback stage

 Fetch stage – 4-wide, BTAC simple predictor  PowerPC 620  Instruction Buffer – Joint IBM/Apple/Motorola design – Decouples fetch from dispatch stalls – Holds up to 8 instructions – Aggressively out-of-order, weak memory order, 64 bits  Hopelessly delayed, very few shipped, but influenced later designs

PowerPC 620 Pipeline PowerPC 620 Pipeline Fetch stage Fetch stage Instruction buffer (8) Instruction buffer (8) Dispatch stage Dispatch stage

BRU LSU BRU XSU0 XSU1 MC-FXU FPU LSU Reservation stations (6) XSU0 XSU1 MC-FXU FPU Reservation stations (6) Execute stage(s)

Completion buffer (16) Execute stage(s)

Complete stage Completion buffer (16) Complete stage Writeback stage

 Dispatch Stage Writeback stage – Rename – Allocate: rename buffer, completion buffer  Execute Stage – Dispatch to reservation station – Six functional units – Branches: resolve (if operands avail.) or predict with BHT – Execute, bypass to waiting RS entries, write rename buffer  Reservation Stations  Completion Buffer – 2 to 4 entries per functional unit, depending on type – Sixteen entries, holds instruction state until in-order completion – RS holds instruction payload, operands

2 PowerPC 620 Pipeline Fetch stage Benchmark Performance Instruction buffer (8)

Dispatch stage Benchmarks Dynamic Instructions Execution Cycles IPC

BRU compress 6,884,247 6,062,494 1.14 LSU XSU0 XSU1 MC-FXU FPU Eqntott 3,147,233 2,188,331 1.44 Reservation stations (6) espresso 4, 615, 085 3, 412, 653 1351.35 Execute stage(s)

Completion buffer (16) Li 3,376,415 3,399,293 0.99 Complete stage alvinn 4,861,138 2,744,098 1.77

Writeback stage hydro2d 4,114,602 4,293,230 0.96

 Complete Stage tomcatv 6,858,619 6,494,912 1.06 – Maintains precise exceptions by buffering out-of-order instructions –4-wide  Writeback Stage – In-order writeback: results copied from rename buffer to architected register file

Branch Prediction Branch Prediction Accuracy

 Two-phase branch prediction compress eqntott espresso li alvinn hydro2d tomcatv BranchResolution –BTAC Not Taken 40.35% 31.84% 40.05% 33.09% 6.38% 17.51% 6.12%  Holds targets of taken branches only – On miss, fetch sequential (not-taken) path Taken 59.65% 68.16% 59.95% 66.91% 93.62% 82.49% 93.88%  Accessed in single cycle in fetch stage BTACPrediction Correct 84.10% 82.64% 81.99% 74.70% 94.49% 88.31% 93.31%  Generates fetch address for next cycle Incorrect 15.90% 17.36% 18.01% 25.30% 5.51% 11.69% 6.69%  256 entries, 2 -way set-associative BHT Prediction –BHT Resolved 19.71% 18.30% 17.09% 28.83% 17.49% 26.18% 45.39%  Accessed in dispatch stage Correct 68.86% 72.16% 72.27% 62.45% 81.58% 68.00% 52.56%  2048 entries of 2-bit counters (bimodal) – Also attempts to resolve branches at dispatch Incorrect 11.43% 9.54% 10.64% 8.72% 0.92% 5.82% 2.05% BTAC Incorrect and  Interactions BHT Correct 0.01% 0.79% 1.13% 7.78% 0.07% 0.19% 0.00%

– {BTAC right, wrong} x {BHT right, wrong} = 4 cases BTAC Correct and – BHT overrides BTAC BHT Incorrect 0.00% 0.12% 0.37% 0.26% 0.00% 0.08% 0.00% Overall Branch Prediction Accuracy 88.57% 90.46% 89.36% 91.28% 99.07% 94.18% 97.95%

3 (a) Instruction (b) Completion

20% 20% 10% 10% compress 0% 0% 048 0481216

Wasted Fetch Cycles Buffer Utilization 20% 20% 10% 10% eqntott 0% 0% 048 0481216

 Instruction buffer 20% 20% 10% 10% espresso Benchmark Misprediction I-Cache Miss 0% 0% – Decouples 048 0481216 compress 6.65% 0.01% fetch/dispatch 20% 20% eqntott 11. 78% 0. 08% li 10% 10% 0% 0% espresso 10.84% 0.52%  Completion buffer 048 0481216 20% 20% 10% 10% li 8.92% 0.09% – Supports OOO alvinn 0% 0% alvinn 0.39% 0.02% execution 048 0481216 20% 20% hydro2d 5.24% 0.12% 10% 10% hydro2d 0% 0% tomcatv 0.68% 0.01% 048 0481216 20% 20% 10% 10% tomcatv 0% 0% 048 0481216

Dispatch Stalls Issue Stalls

Frequency of dispatch stall cycles. Sources of Dispatch Stalls compress eqntott espresso li alvinn hydro2d tomcatv Frequency of issue stall cycles. Serialization0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% Sources of Issue Stalls Move to special register constraint compress eqntott espresso li alvinn hydro2d tomcatv 0.00% 4.49% 0.94% 3.44% 0.00% 0.95% 0.08% Read port saturation Out of order disallowed 0.26% 0.00% 0.02% 0.00% 0.32% 2.23% 6.73% 0.00% 0.00% 0.00% 0.00% 0.72% 11.03% 1.53% Reservation station saturation Serialization 36.07% 22.36% 31.50% 34.40% 22.81% 42.70% 36.51% Rename buffer saturation 1.69% 1.81% 3.21% 10.81% 0.03% 4.47% 0.01% 24.06% 7.60% 13.93% 17.26% 1.36% 16.98% 34.13% Waiting for source Completion buffer saturation 21.97% 29.30% 37.79% 32.03% 17.74% 22.71% 3.52% 5.54% 3.64% 2.02% 4.27% 21.12% 7.80% 9.03% Waiting for execution unit Another to same unit 9.72% 20.51% 18.31% 10.57% 24.30% 12.01% 7.17% 13.67% 3.28% 7.06% 11.01% 2.81% 1.50% 1.30% No issue stalls No dispatch stalls 62.67% 65.61% 51.94% 46.15% 78.70% 60.29% 93.64% 24.35% 41.40% 33.28% 30.06% 30.09% 17.33% 6.35%

4 Parallelism Achieved (a) Dispatching (b) Issuing (c) Finishing (d) Completion Summary of PowerPC 620 30% 30% 30% 30% 20% 20% 20% 20% 10% 10% 10% 10% compress 0% 0% 0% 0% 02134 021 3456 021 3456 02134  First-generation OOO part 30% 30% 30% 30% 20% 20% 20% 20%

eqntott 10% 10% 10% 10%  Aggressive goals, poor execution 0% 0% 0% 0% 02134 021 3456 021 3456 02134  Interesting contributions 30% 30% 30% 30% 20% 20% 20% 20%

spresso 10% 10% 10% 10% – Two-ppp()hase branch prediction (also in 604) e 0% 0% 0% 0% 02134 021 3456 021 3456 02134 – Short pipeline 30% 30% 30% 30%

li 20% 20% 20% 20% 10% 10% 10% 10% – Weak ordering of memory references 0% 0% 0% 0% 02134 021 3456 021 3456 02134  PowerPC evolution 30% 30% 30% 30% 20% 20% 20% 20% – 1998: Power3 (630FP) alvinn 10% 10% 10% 10% 0% 0% 0% 0% 02134 021 3456 021 3456 02134 – 2001: Power4 30% 30% 30% 30% 20% 20% 20% 20% 10% 10% 10% 10% – 2004: Power5 hydro2d 0% 0% 0% 0% 02134 021 3456 021 3456 02134

30% 30% 30% 30% 20% 20% 20% 20% 10% 10% 10% 10% tomcatv 0% 0% 0% 0% 02134 021 3456 021 3456 02134

620 vs. Power3 vs. Power4 IBM Power4 Attribute 620 Power3 Power4 Frequency 172 MHz 450 MHz 1.3 GHz Pipeline depth 5+ 5+ 15+ Branch predictor Bimodal BHT + Same 3x16 1b combining BTAC Fetch/issue/comple 4/6/4 4/8/4 4/8/5 tion width Rename/physical 8 Int, 8 FP 16 Int, 24 FP 80 Int, 72 FP registers In-flight instructions 16 32 Up to 100 FP Units 1 2 2 Load/store units 1 2 2 Instruction Cache 32K 8w SA 32K 128w SA 64K DM Data Cache 32K 8w SA 64K 128w SA 32K 2w SA  IBM POWER4, began shipping in 2001 L2/L3 size 4M 16M 1.5M/32M – Deep pipeline: 15 stages minimum L2 bandwidth 1GB/s 6.4GB/s 100+ GB/s – Aggressive combining branch prediction Store queue 6 x 8B 16 x 8B 12 x 64B entries – Over 100 instructions in flight, tracked in 20 groups of 5 in ROB MSHRs I:1/D:1 I:2/D:4 I:2/D:8 – Aggressive memory hierarchy, memory bandwidth Hardware prefetch None 4 streams 8 streams

5 Pentium Pro Case Study Chapter 7: Intel’s P6 Architecture Modern Processor Design: Fundamentals of • Microarchitecture Superscalar Processors – Order-3 Superscalar – Out-of-Order execution – Specu lati ve executi on – In-order completion • Design Methodology • Performance Analysis

BTB/ICU Fetch 4 Cycles Goals of P6 Microarchitecture P6 –The BAC/Rename Decode 2 Cycles Big Picture IA-32 Compliant Allocation Dispatch 2 Cycles

Reservation Station (20) 2 Cycles Performance (Frequency - IPC) 4 3 2 1 0 Validation Die Size 2 cyc AGU1 AGU0 IEU1 IEU0 Schedule ROB (40x157) Power JEU Fadd MOB Fmul RRF Imul DCU Div

6 Memory Hierarchy Instruction Fetch L2 Cache (256Kb) Main Memory PCI Other Stream Fetch Requests Buffer Inst. 64 bit Length 16 bytes Decoder Inst. Buf ICache

Next (8Kb) 16 bytes + marks Mux Addr. ddress LthLength ICache A BIU Logic Marks (8KB) 16 L2 Cache To

bytes Fetch Victim (256Kb) Cache Inst. Decode

DCache Inst. Rotator CPU (8Kb) Physical

Addr. Instruction Data Instruction Instruction Data Branch Target on TLB i ct i

• Level 1 instruction and data caches - 2 cycle access time d Prediction Marks e r • Level 2 unified cache - 6 cycle access time P Branch Target Buffer (512) 2 cycle • Separate level 2 cache and memory address/data bus • Level 2 cache fill policy - implications

Instruction Cache Unit Branch Target Buffer Lower 12 bits PHT Bus Way 0 Way 1 Way 3 . . r Stream Buffer Interface r Unit pec.

Fetch Address Rs H B get Add get Add Br. Offset Br. Offset Br. Offset r Upper r 4-bit BHR 4-bit BHR 4-bit BHR a a 128 Sets 20 bits 16 entries/set bit T T Target Addr. etch Addr. Tag - 4 4-bit BHR spec. 4-bit BHR spec. ICache Fetch Addr. Tag Fetch Addr. Tag F (8 Kb) Victim Cache Lower 12 bits Tag Compare Instruction Tag Array Prediction Prediction Data Mux Fetch Return Control & Address Stack Logic Target Addr.

 Pattern History Table (PHT) is not speculatively updated

 A speculative Branch History Register (BHR) and prediction state is maintained Instruction Hit/Miss Data ITLB  Uses speculative prediction state if it exist for that branch

7 Branch Prediction Algorithm Instruction Decode ‐ 1 Macro-Instruction Bytes from IFU 0010 Branch Execution Br. History State 11 Pattern Table Instruction Buffer 16 bytes To Ne xt Machine Address 0010 0000 Calc. 10 0001 Br. Pred. 0010 10 0011 uROM Decoder Decoder Decoder 1 0100 0 1 2 0101 1 Branch 0110 Address 0101 0101 . Calc. . 4 uops 1 uop 1 uop Speculative History Spec. Pred. . 1110 1111 uop Queue (6)

 Current prediction updates the speculative history prior to the next instance of the branch instruction Up to 3 uops Issued to dispatch  Branch instruction detection  Branch History Register (BHR) is updated during branch execution  Branch address calculation - Static prediction and branch always execution  Branch recovery flushes front-end and drains the execution core  One branch decode per cycle (break on branch)  Branch mis-prediction resets the speculative branch history state to match BHR

Instruction Decode ‐ 2 What is a uop? Macro-Instruction Bytes from IFU Small two-operand instruction - Very RISC like. Instruction Buffer 16 bytes To Ne xt Address IA-32 instruction Calc. add (eax),(ebx) MEM(eax) <- MEM(eax) + MEM(ebx)

uROM Decoder Decoder Decoder 0 1 2 Uop decomposition: Branch Address ld guop0, (eax) guop0 <- MEM(eax) Calc. ld guop1, (ebx) guop1 <- MEM(ebx) 4 uops 1 uop 1 uop add guop0,guop1 guop0 <- guop0 + guop1

uop Queue (6) sta eax std guop0 MEM(eax) <- guop0 Up to 3 uops Issued to dispatch

 Instruction Buffer contains up to 16 instructions, which must be decoded and queued before the instruction buffer is re-filled

 Macro-instructions must shift from decoder 2 to decoder 1 to decoder 0

8 Instruction Dispatch ‐ 1 Integer RAT EAX Real Register File (RRF)EBX Reorder Buffer (ROB)

er (3) To EAX ECX 0 ff Mux Reservation EBX 1 Renaming Logic Station 8 ECX 2 GuoP0 3 GuoP1 4

uoP Queue (6) FST0 5 Dispatch Bu Retirement 8 FST1 6 Info Floating Point RAT 7 8 GuoP0 FST0 9 Allocator FST1 12 GuoP1 2 cycles FST2 Register Renaming 39 4 IuoP(0-3) Allocation requirements 9 CC/Events FST7 “3-or-none” Reorder buffer entries Similar to Tomasulo’s Algorithm - Uses ROB entry number as tags Reservation station entry The register alias tables (RAT) maintain a pointer to the most recent data for the Load Buffer or store buffer entry renamed register Dispatch buffer “probably” dispatches all 3 uops before re-fill Execution results are stored in the ROB

Register Renaming ‐ Example Register Renaming –Example cont’d Integer RAT Integer RAT EAX EAX Real Register File (RRF)EBX Reorder Buffer (ROB) Real Register File (RRF)EBX Reorder Buffer (ROB) EAX ECX 0 EAX ECX 0 EBX 1 EBX 1 8 ECX 2 8 ECX 2 GuoP0 3 GuoP0 3 GuoP1 4 GuoP1 4 FST0 5 FST0 5 8 FST1 Comp sub 6 8 FST1 6 Floating Point RAT Alloc 7 Floating Point RAT 7 8 8 GuoP0 FST0 9 GuoP0 FST0 9 GuoP1 FST1 GuoP1 FST1 12 FST2 12 FST2 39 39 4 IuoP(0-3) 4 IuoP(0-3) 9 CC/Events FST7 9 CC/Events FST7 Dispatching: Completing: Dispatching: Completing: add eax, ebx sub eax, ecx add eax, ebx sub eax, ecx add eax, ecx add eax, ecx fxch f0, f1 fxch f0, f1

9 Challenges to Register Renaming Out‐of‐Order Execution Engine Integer RAT RS Reservation Station (20) EAX bypass 2 Cycles Real Register File (RRF)EBX Reorder Buffer (ROB) 4 3 2 1 0 EAX ECX 0 EBX 1 8 ECX 2 GuoP0 3 GuoP1 4 FST0 5 AGU1 AGU0 IEU1 IEU0 FST1 6 8 JEU Fadd Floating Point RAT 7 8 MOB GuoP0 FST0 9 Fmul GuoP1 FST1 12 FST2 Imul 39 DCU (8Kb) 4 IuoP(0-3) Div 9 CC/Events FST7 • In-order branch issue and execution 8-bit code • In-order load/store issue to address generation units mov AL, #data1 Byte addressable registers • Instruction execution and result bus scheduling mov AH, #data2 add AL, #data3 • Is the reservation station “truly” centralized & what is “binding”? add AL, #data4

Reservation Station Memory Ordering Buffer (MOB) AGU0 AGU1 R/S From Dispatch Queue Load Buffer Store Address Store Data (16) Buffer (12) Buffer (12) Conflict Port 4 Port 3 Port 2 Port 1 Port 0 Logic Cycle 1

2 cycle Ccle2Cycle 2 Data Cache Unit (8Kb)

MOB To Execution Units Bypass Control Logic •Cycle 1 2 cycle Load Data Result – Order checking • Load buffer retains loads until completed, for coherency checking – Operand availability • Store forwarding out of store buffers •Cycle 2 • 2 cycle latency through MOB – Writeback bus scheduling • “Store Coloring” - Load instructions are tagged by the last store

10 Instruction Completion Pentium Pro Design Methodology ‐ 1

• Handles all exception/interrupt/trap conditions • Handles branch recovery Performance Model – OOO core drains out right-path instructions, commits to RRF – In parallel, front end starts fetching from target/fall-through Behavioral Model Microarchitecture – However, no renaming is allowed until OOO core is drained 0.75 years 1 year – After draining is done, RAT is reset to point to RRF Structural Model – Avoids checkpointing RAT, recovering to intermediate RAT state Start 1990 Conception • Commits execution results to the architectural state in-order Finish 4Q – Retirement Register File (RRF) 1994 1 year Logic Design – Must handle hazards to RRF (writes/reads in same cycle) Silicon Debug 1.75 years – Must handle hazards to RAT (writes/reads in same cycle) Circuit Design • “Atomic” IA-32 instruction completion – uops are marked as 1st or last in sequence Layout – exception/interrupt/trap boundary • 2 cycle retirement

Performance –Run Times Pentium Pro Performance Analysis UnhaltedUser-Mode User-Mode Processor Processor Cycles Cycles

• Observability .2E+11 – On-chip event counters 1E+11 – Dynamic analysis Total of 27.5 billion cycles

8E+10

• Benchmark Suite 6E+10

– BAPco Sysmark32 - 32-bit Windows NT applications 4E+10 – Winstone97 - 32-bit Windows NT applications 2E+10 – Some SPEC95 benchmarks

0 li avs go Cor el excel wor d Pwave MSVC paradox Wor dPr o PowerPnt SYSexcel SYSflw96 compress PhotoShop PageMaker SYSword7SYSCorel6SYSexcel7 microstation SYSpowerpnt7SYSParadox7 SYSpagemaker6

11 Performance –IPC vs. uPC Performance –IPC vs. uPC

1. 6 InstructionsInstructio ns and and UOPs RetiredUops Per Cycleretired per cycle 100 Inst.Inst. or or UOPS Uops retired retiredper cycle per cycle

1. 4 2 uops/instruction 90 1. 2 Inst retired 80 UOPS retired 1 70 AVS 0. 8 60 PageMaker Paradox 0. 6 50 PowerPoi nt

WordPro # Retired0. 4 Per Cycle 40

0. 2 30

Instructions were Retired 0 20

Percent10 of Cycles where n=0-3 UOPs or SPEC-Li WS-AVS WS- W or d SPEC-Go WS-Excel SYS-ExcelSYS-Word SYS-Excel WS - P W av e SYS-FlW96 WS- MSVC++ WS-Paradox WS-WordPro SYS-Paradox 0 WS- Pow er Pnt WS- Phot oSho p WS-Corel Draw WS-PageMaker SYS-PowerPnt SYS-CorelDraw SYS-PagemakerSPEC-Compress 3 UOPs 2 UOPs 1 UOP 0 UOPs 3 Inst . 2 Inst . 1 In st . 0 Inst . WS-Microstation

Performance –Branch Prediction Branch Mispredict Rate Performance –Cache Misses 16% Branch Mispredict Rate 14%

12% 6.8% avg 2. 5 Cache Misses Per Cycl e Cache Misses Per Cycle 10%

8%

2 IL1 - 0.51% IFU Misses / Cycle DL1 - 1.15% 6% DCU Misses / Cycle 4% L2 - 0.25% % Branches Mispredicted

L2 Misses / Cycle 2% 1. 5

0%

WS-AVS SP EC-Li WS -Exc el WS-WordSYS-ExcelSY S-Word SYS-Excel SPEC-Go WS-P Wa ve SYS-FlW96 1 WS-MSVC++ WS-Paradox WS-WordPro WS -PowerP nt SY S-Paradox WS-CorelDraw SYS-PowerPnt WS -PhotoShop WS-PageMaker SYS-CorelDraw WS -Mi crostation SYS-Pagemaker SPEC-Compress

% Unhalted Cycles BTB Miss Rate 45% BTB Miss Rate

0. 5 40%

35%

30%

25% 0 20%

15%

10% % Branches that Miss in BTB WS-AVS SPEC-Li WS - Ex c el WS - Wor d SP EC- Go 5% WS - PWav e SYS-ExcelSY S-W ord SYS-Excel SYS-FlW96 0% WS- MSVC++ WS-Paradox WS- Wor dP r o SYS-Paradox WS- PowerPnt SYS-PowerPnt WS-PhotoShop WS-CorelDraw WS-PageMaker SYS-CorelDraw WS-Microstation WS-AVS SPEC-Li WS-Excel WS-Word SPEC-Go SYS-Pagemaker SYS-ExcelSY S-Word SYS-Excel SPEC-Compress WS-PWave SYS-FlW96 WS -MS VC+ + WS-Paradox WS-WordPro WS- PowerPnt SYS-Paradox WS-CorelDraw SYS -PowerPnt WS-P hotoShop WS-PageMaker SYS-CorelDraw WS-Microstation SYS-Pagemaker SPEC-Compress

12 Conclusions IA-32 Compliant Performance (Frequency - IPC) 366.0 ISpec92 283.2 FSpec92 8.09 SPECint95 6.70 SPECfp95 Validation Die Size - Fabable Schedule - 1 year late Power -

13