<<

Pro Case Study Chapter 7: ’s Architecture Modern Processor Design: Fundamentals of z Superscalar Processors – Order-3 Superscalar – Out-of-Order execution – – In-order completion z Design Methodology z Performance Analysis

BTB/ICU Fetch 4 Cycles P6 – The Goals of P6 Microarchitecture BAC/Rename Decode 2 Cycles Big Picture IA-32 Compliant Allocation Dispatch 2 Cycles

Reservation Station (20) 2 Cycles Performance (Frequency - IPC) 4 3 2 1 0 Validation Die Size 2 cyc AGU1 AGU0 IEU1 IEU0 Schedule ROB (40x157) Power JEU Fadd MOB Fmul RRF Imul DCU Div

1 Memory Hierarchy Instruction Fetch L2 Cache (256Kb) Main Memory PCI

Other Stream ks r

Fetch a s

Buffer uf Requests e B

Inst. +m t. s s 64 bit 6byt Length e 1 Decoder In s ICache s x 6byt

Next (8Kb) 1 Addr. Mu

Length r ta

ICache o t a Logic a Marks t

BIU a ch Addre t

(8KB) 16 t L2 Cache o To on D bytes Fe Victim (256Kb) Cache Inst. Decode st. R ucti n

DCache I tr s CPU Physical ruction Da (8Kb) t

Addr. In Ins

Instruction n Branch n

s Target o TLB cti i d Predictio Mark z Level 1 instruction and data caches - 2 cycle access time e r P Branch Target Buffer (512) z Level 2 unified cache - 6 cycle access time 2 cycle z Separate level 2 cache and /data

z Level 2 cache fill policy - implications

Instruction Cache Unit Branch Target Buffer Lower 12 bits Way 0 Way 1 Way 3 PHT

Bus t e t t t ec. Stream Buffer Interface ec. ec. Tag Tag Tag p p p s s s . . . ddr. fse fse fse ddr. ddr. HR Unit HR HR f f f dr dr dr R R R tA tB tA tB tA Fetch Address tB H H H .O .O .O Ad Ad Ad bi bi bi B B B - - - 28 Sets Br Upper Br Br 4 4 4 16 entries/s 20 bits 1 bit bit bit Targe Targe Targe - - etch - 4 4 4 ICache Fetch Fetch F (8 Kb) Victim Cache Lower 12 bits Tag Compare Instruction Tag Array Prediction Prediction Data Mux Fetch Return Control & Address Stack Logic Target Addr.

z Pattern History Table (PHT) is not speculatively updated

z A speculative Branch History Register (BHR) and prediction state is maintained Instruction Hit/Miss Data ITLB z Uses speculative prediction state if it exist for that branch

2 Branch Prediction Algorithm Instruction Decode - 1 Macro-Instruction Bytes from IFU 0010 Branch Execution Br. History State 11 Pattern Table Instruction Buffer 16 bytes To Ne xt Machine Address 0010 0000 Calc. 10 0001 Br. Pred. 0010 10 0011 uROM Decoder Decoder Decoder 1 0100 0 1 2 0101 1 Branch 0110 Address 0101 0101 . Calc. . 4 uops 1 uop 1 uop Speculative History Spec. Pred. . 1110 1111 uop Queue (6)

z Current prediction updates the speculative history prior to the next instance of the branch instruction Up to 3 uops Issued to dispatch z Branch instruction detection z Branch History Register (BHR) is updated during branch execution z Branch address calculation - Static prediction and branch always execution z Branch recovery flushes front-end and drains the execution core z One branch decode per cycle (break on branch) z Branch mis-prediction resets the speculative branch history state to match BHR

Instruction Decode - 2 What is a uop? Macro-Instruction Bytes from IFU Small two-operand instruction - Very RISC like. Instruction Buffer 16 bytes To Ne xt Address IA-32 instruction Calc. add (eax),(ebx) MEM(eax) <- MEM(eax) + MEM(ebx)

uROM Decoder Decoder Decoder 0 1 2 Uop decomposition: Branch Address ld guop0, (eax) guop0 <- MEM(eax) Calc. ld guop1, (ebx) guop1 <- MEM(ebx) 4 uops 1 uop 1 uop add guop0,guop1 guop0 <- guop0 + guop1

uop Queue (6) sta eax std guop0 MEM(eax) <- guop0 Up to 3 uops Issued to dispatch

z Instruction Buffer contains up to 16 instructions, which must be decoded and queued before the instruction buffer is re-filled

z Macro-instructions must shift from decoder 2 to decoder 1 to decoder 0

3 Instruction Dispatch - 1 Integer RAT EAX Real (RRF)EBX Reorder Buffer (ROB)

er (3) EAX ECX 0 f To Mux Reservation EBX 1 Renaming Logic Station 8 ECX 2 hBuf

c GuoP0 3 GuoP1 4

uoP Queue (6) FST0 5 Dispat Retirement 8 FST1 6 Info Floating Point RAT 7 8 GuoP0 FST0 9 Allocator FST1 12 GuoP1 2 cycles FST2 Register Renaming 39 4 IuoP(0-3) Allocation requirements 9 CC/Events FST7 “3-or-none” Reorder buffer entries Similar to Tomasulo’s Algorithm - Uses ROB entry number as tags Reservation station entry The register alias tables (RAT) maintain a pointer to the most recent data for the Load Buffer or store buffer entry renamed register Dispatch buffer “probably” dispatches all 3 uops before re-fill Execution results are stored in the ROB

Register Renaming - Example Register Renaming – Example cont’d Integer RAT Integer RAT EAX EAX Real Register File (RRF)EBX Reorder Buffer (ROB) Real Register File (RRF)EBX Reorder Buffer (ROB) EAX ECX 0 EAX ECX 0 EBX 1 EBX 1 8 ECX 2 8 ECX 2 GuoP0 3 GuoP0 3 GuoP1 4 GuoP1 4 FST0 5 FST0 5 8 FST1 Comp sub 6 8 FST1 6 Floating Point RAT Alloc 7 Floating Point RAT 7 8 8 GuoP0 FST0 9 GuoP0 FST0 9 GuoP1 FST1 GuoP1 FST1 12 FST2 12 FST2 39 39 4 IuoP(0-3) 4 IuoP(0-3) 9 CC/Events FST7 9 CC/Events FST7 Dispatching: Completing: Dispatching: Completing: add eax, ebx sub eax, ecx add eax, ebx sub eax, ecx add eax, ecx add eax, ecx fxch f0, f1 fxch f0, f1

4 Challenges to Register Renaming Out-of-Order Execution Engine Integer RAT RS Reservation Station (20) EAX bypass 2 Cycles Real Register File (RRF)EBX Reorder Buffer (ROB) 4 3 2 1 0 EAX ECX 0 EBX 1 8 ECX 2 GuoP0 3 GuoP1 4 FST0 5 AGU1 AGU0 IEU1 IEU0 FST1 6 8 JEU Fadd Floating Point RAT 7 8 MOB GuoP0 FST0 9 Fmul GuoP1 FST1 12 FST2 Imul 39 DCU (8Kb) 4 IuoP(0-3) Div 9 CC/Events FST7 z In-order branch issue and execution 8-bit code z In-order load/store issue to address generation units mov AL, #data1 Byte addressable registers mov AH, #data2 z Instruction execution and result bus scheduling add AL, #data3 z Is the reservation station “truly” centralized & what is “binding”? add AL, #data4

Reservation Station Memory Ordering Buffer (MOB) AGU0 AGU1 R/S From Dispatch Queue Load Buffer Store Address Store Data (16) Buffer (12) Buffer (12) Conflict Port 4 Port 3 Port 2 Port 1 Port 0 Logic Cycle 1

2 cycle Cycle 2 Data Cache Unit (8Kb)

MOB To Execution Units Bypass Control Logic z Cycle 1 2 cycle Load Data Result – Order checking z Load buffer retains loads until completed, for coherency checking – Operand availability z Store forwarding out of store buffers z Cycle 2 – Writeback bus scheduling z 2 cycle latency through MOB z “Store Coloring” - Load instructions are tagged by the last store

5 Instruction Completion Design Methodology - 1

z Handles all exception/interrupt/trap conditions z Handles branch recovery Performance Model – OOO core drains out right-path instructions, commits to RRF – In parallel, front end starts fetching from target/fall-through Behavioral Model Microarchitecture – However, no renaming is allowed until OOO core is drained 0.75 years 1 year – After draining is done, RAT is reset to point to RRF Structural Model – Avoids checkpointing RAT, recovering to intermediate RAT state Start 1990 Conception z Commits execution results to the architectural state in-order Finish 4Q – Retirement Register File (RRF) 1994 1 year Logic Design – Must handle hazards to RRF (writes/reads in same cycle) Silicon Debug 1.75 years – Must handle hazards to RAT (writes/reads in same cycle) Circuit Design z “Atomic” IA-32 instruction completion – uops are marked as 1st or last in sequence Layout – exception/interrupt/trap boundary z 2 cycle retirement

Performance – Run Times Pentium Pro Performance Analysis UnhaltedUser-Mode User-Mode Processor Processor Cycles Cycles

z Observability .2E+11 – On-chip event counters 1E+11 – Dynamic analysis Total of 27.5 billion cycles

8E+10

z Benchmark Suite 6E+10 – BAPco Sysmark32 - 32-bit Windows NT applications 4E+10 – Winstone97 - 32-bit Windows NT applications – Some SPEC95 benchmarks 2E+10

0 li avs go Cor el excel wor d Pwave MSVC paradox Wor dPr o PowerPnt SYSexcel SYSflw96 compress PhotoShop PageMaker SYSword7SYSCorel6SYSexcel7 microstation SYSpowerpnt7SYSParadox7 SYSpagemaker6

6 Performance – IPC vs. uPC Performance – IPC vs. uPC

1. 6 InstructionsInstructio ns and and UOPs RetiredUops Per Cycleretired per cycle 100 Inst.Inst. or or UOPS Uops retired retiredper cycle per cycle

1. 4 2 uops/instruction 90 1. 2 Inst retired 80 UOPS retired 1 70 AVS 0. 8 60 PageMaker Paradox 0. 6 50 PowerPoi nt

WordPro # Retired0. 4 Per Cycle 40

0. 2 30

Instructions were Retired 0 20

Percent10 of Cycles where n=0-3 UOPs or SPEC-Li WS-AVS WS- W or d SPEC-Go WS-Excel SYS-ExcelSYS-Word SYS-Excel WS - P W av e SYS-FlW96 WS- MSVC++ WS-Paradox WS-WordPro SYS-Paradox 0 WS- Pow er Pnt WS- Phot oSho p WS-Corel Draw WS-PageMaker SYS-PowerPnt SYS-CorelDraw SYS-PagemakerSPEC-Compress 3 UOPs 2 UOPs 1 UOP 0 UOPs 3 Inst . 2 Inst . 1 In st . 0 Inst . WS-Microstation

Performance – Branch Prediction Branch Mispredict Rate Performance – Cache Misses 16% Branch Mispredict Rate 14%

12% 6.8% avg 2. 5 Cache Misses Per Cycl e Cache Misses Per Cycle 10%

8%

2 IL1 - 0.51% IFU Misses / Cycle DL1 - 1.15% 6% DCU Misses / Cycle 4% L2 - 0.25% % Branches Mispredicted

L2 Misses / Cycle 2% 1. 5

0%

WS-AVS SP EC-Li WS -Exc el WS-WordSYS-ExcelSY S-Word SYS-Excel SPEC-Go WS-P Wa ve SYS-FlW96 1 WS-MSVC++ WS-Paradox WS-WordPro WS -PowerP nt SY S-Paradox WS-CorelDraw SYS-PowerPnt WS -PhotoShop WS-PageMaker SYS-CorelDraw WS -Mi crostation SYS-Pagemaker SPEC-Compress

% Unhalted Cycles BTB Miss Rate 45% BTB Miss Rate

0. 5 40%

35%

30%

25% 0 20%

15%

10% % Branches that Miss in BTB WS-AVS SPEC-Li WS - Ex c el WS - Wor d SP EC- Go 5% WS - PWav e SYS-ExcelSY S-W ord SYS-Excel SYS-FlW96 0% WS- MSVC++ WS-Paradox WS- Wor dP r o SYS-Paradox WS- PowerPnt SYS-PowerPnt WS-PhotoShop WS-CorelDraw WS-PageMaker SYS-CorelDraw WS-Microstation WS-AVS SPEC-Li WS-Excel WS-Word SPEC-Go SYS-Pagemaker SYS-ExcelSY S-Word SYS-Excel SPEC-Compress WS-PWave SYS-FlW96 WS -MS VC+ + WS-Paradox WS-WordPro WS- PowerPnt SYS-Paradox WS-CorelDraw SYS -PowerPnt WS-P hotoShop WS-PageMaker SYS-CorelDraw WS-Microstation SYS-Pagemaker SPEC-Compress

7 Conclusions IA-32 Compliant Performance (Frequency - IPC) 366.0 ISpec92 283.2 FSpec92 8.09 SPECint95 6.70 SPECfp95 Validation Die Size - Fabable Schedule - 1 year late Power -

8