Pentium Pro Case Study
Prof. Mikko H. Lipasti University of Wisconsin-Madison
Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti Pentium Pro Case Study
• Microarchitecture – Order-3 Superscalar – Out-of-Order execution – Speculative execution – In-order completion • Design Methodology • Performance Analysis • Retrospective Goals of P6 Microarchitecture
IA-32 Compliant Performance (Frequency - IPC) Validation Die Size Schedule Power
BTB/ICU Fetch 4 Cycles P6 – The BAC/Rename Decode 2 Cycles Big Picture Allocation Dispatch 2 Cycles
Reservation Station (20) 2 Cycles 4 3 2 1 0
2 cyc ROB AGU1 AGU0 IEU1 IEU0 (40x157) JEU Fadd MOB Fmul RRF Imul DCU Div Memory Hierarchy
Main Memory PCI
64 bit
ICache BIU (8KB) 16 bytes L2 Cache (256Kb) DCache CPU (8Kb)
• Level 1 instruction and data caches - 2 cycle access time • Level 2 unified cache - 6 cycle access time • Separate level 2 cache and memory address/data bus
Instruction Fetch
L2 Cache (256Kb)
s
Other Stream k
r
a
Fetch f
s
Buffer u
e m
Requests
t
B
y
+
.
Inst.
b
t
s
s
6
e
Length n
t
1
I
y
Decoder
b
s
s ICache
x
6
e
u
1
Next r (8Kb)
d
M
d
Addr.
a
Length r
A
t
o
t a
Logic a Marks
h
t
a
c
t
D a
t
o
e
n
D
To
R
o
F Victim
n
i
.
t
t
Inst. o Decode
c
Cache i
s
t
u
n
c
r
I
t
u
s
Physical r
t
n
s
Addr. I
n
I
n
Instruction Branch
o
i
n
t
s Target
o c
TLB
i i
k
t
d r
c
e a
i
r
d
P M
e
r
P Branch Target Buffer (512) 2 cycle Instruction Cache Unit Lower 12 bits Bus Stream Buffer Interface Unit Fetch Address Upper 20 bits ICache (8 Kb) Victim Cache Lower 12 bits
Instruction Data Mux Tag Array
Instruction Hit/Miss Data ITLB Branch Target Buffer
Way 0 Way 1 Way 3 PHT
t
. . .
g g g
e
c c c
. . .
a a a
s
t t t
e e e
r r r
/
e e e
T T R R T R
p p p
s
d d d
s
s s s
s s s
. . . t
e
d d d
f f f
H H H
r r r
i
f f f
e
r
d d d
A A A
B R B R B R
t
O O O
S
d d d
t t t t t t
n
H H H
. . .
i i i
e e e e
r r r
A A A
8
b b b
B B B
g g g
- - -
2
B B B
6
r r r
h h h
t t t
i i i
4 4 4
a a a
1
c c c 1
t t t
b b b
T T T
e e e
- - -
4 4 4
F F F
Tag Compare
Prediction Prediction Fetch Return Control & Address Stack Logic Target Addr.
Pattern History Table (PHT) is not speculatively updated
A speculative Branch History Register (BHR) and prediction state is maintained
Uses speculative prediction state if it exist for that branch Branch Prediction Algorithm
0010 Branch Execution Br. History State 11 Pattern Table Machine 0 0 1 0 0000 10 0001 Br. Pred. 0010 1 0 0011 1 0100 0101 1 0110 0 1 0 1 0101 . . Speculative History Spec. Pred. . 1110 1111
Current prediction updates the speculative history prior to the next instance of the branch instruction
Branch History Register (BHR) is updated during branch execution
Branch recovery flushes front-end and drains the execution core
Branch mis-prediction resets the speculative branch history state to match BHR Instruction Decode - 1 Macro-Instruction Bytes from IFU
Instruction Buffer 16 bytes To Next Address Calc.
uROM Decoder Decoder Decoder 0 1 2 Branch Address Calc. 4 uops 1 uop 1 uop
uop Queue (6)
Up to 3 uops Issued to dispatch Branch instruction detection
Branch address calculation - Static prediction and branch always execution
One branch decode per cycle (break on branch) Instruction Decode - 2 Macro-Instruction Bytes from IFU
Instruction Buffer 16 bytes To Next Address Calc.
uROM Decoder Decoder Decoder 0 1 2 Branch Address Calc. 4 uops 1 uop 1 uop
uop Queue (6)
Up to 3 uops Issued to dispatch
Instruction Buffer contains up to 16 instructions, which must be decoded and queued before the instruction buffer is re-filled
Macro-instructions must shift from decoder 2 to decoder 1 to decoder 0 What is a uop?
Small two-operand instruction - Very RISC like. IA-32 instruction add (eax),(ebx) MEM(eax) <- MEM(eax) + MEM(ebx)
Uop decomposition: ld guop0, (eax) guop0 <- MEM(eax) ld guop1, (ebx) guop1 <- MEM(ebx) add guop0,guop1 guop0 <- guop0 + guop1 sta eax std guop0 MEM(eax) <- guop0
Instruction Dispatch
)
3
(
)
r
6
(
e
f
To
e
f
u u
Mux Reservation
e
B
Renaming
u
Logic Station
h
Q
c
t
P
a
o
p
u
s
i D Retirement Info
Allocator 2 cycles Register Renaming Allocation requirements “3-or-none” Reorder buffer entries Reservation station entry Load buffer or store buffer entry Dispatch buffer “probably” dispatches all 3 uops before re-fill Register Renaming - 1 Integer RAT EAX Real Register File (RRF) EBX Reorder Buffer (ROB) EAX ECX 0 EBX 1 8 ECX 2 GuoP0 3 GuoP1 4 FST0 5 8 FST1 6 Floating Point RAT 7 8 GuoP0 FST0 9 GuoP1 FST1 12 FST2 39 4 IuoP(0-3) 9 CC/Events FST7
Similar to Tomasulo’s Algorithm - Uses ROB entry number as tags The register alias tables (RAT) maintain a pointer to the most recent data for the renamed register Execution results are stored in the ROB Register Renaming - Example Integer RAT EAX Real Register File (RRF) EBX Reorder Buffer (ROB) EAX ECX 0 EBX 1 8 ECX 2 GuoP0 3 GuoP1 4 FST0 5 8 FST1 Comp sub 6 Floating Point RAT Alloc 7 8 GuoP0 FST0 9 GuoP1 FST1 12 FST2 39 4 IuoP(0-3) 9 CC/Events FST7 Dispatching: Completing: add eax, ebx sub eax, ecx add eax, ecx fxch f0, f1 © Shen, Lipasti 15 Challenges to Register Renaming Integer RAT EAX Real Register File (RRF) EBX Reorder Buffer (ROB) EAX ECX 0 EBX 1 8 ECX 2 GuoP0 3 GuoP1 4 FST0 5 8 FST1 6 Floating Point RAT 7 8 GuoP0 FST0 9 GuoP1 FST1 12 FST2 39 4 IuoP(0-3) 9 CC/Events FST7 8-bit code mov AL, #data1 Byte addressable registers mov AH, #data2 add AL, #data3 add AL, #data4 Out-of-Order Execution Engine RS Reservation Station (20) bypass 2 Cycles 4 3 2 1 0
AGU1 AGU0 IEU1 IEU0 JEU Fadd MOB Fmul Imul DCU (8Kb) Div
• In-order branch issue and execution • In-order load/store issue to address generation units • Instruction execution and result bus scheduling • Is the reservation station “truly” centralized & what is “binding”?
Reservation Station
From Dispatch Queue
Port 4 Port 3 Port 2 Port 1 Port 0 Cycle 1
Cycle 2
To Execution Units • Cycle 1 – Order checking – Operand availability • Cycle 2 – Writeback bus scheduling
Memory Ordering Buffer (MOB) AGU0 AGU1 R/S
Load Buffer Store Address Store Data (16) Buffer (12) Buffer (12) Conflict Logic
2 cycle Data Cache Unit (8Kb)
MOB Bypass Control Logic 2 cycle Load Data Result • Load buffer retains loads until completed, for coherency checking • Store forwarding out of store buffers • 2 cycle latency through MOB • “Store Coloring” - Load instructions are tagged by the last store Instruction Completion
• Handles all exception/interrupt/trap conditions • Handles branch recovery – OOO core drains out right-path instructions, commits to RRF – In parallel, front end starts fetching from target/fall-through – However, no renaming is allowed until OOO core is drained – After draining is done, RAT is reset to point to RRF – Avoids checkpointing RAT, recovering to intermediate RAT state • Commits execution results to the architectural state in-order – Retirement Register File (RRF) – Must handle hazards to RRF (writes/reads in same cycle) – Must handle hazards to RAT (writes/reads in same cycle) • “Atomic” IA-32 instruction completion – uops are marked as 1st or last in sequence – exception/interrupt/trap boundary • 2 cycle retirement Pentium Pro Design Methodology - 1
Performance Model
Behavioral Model Microarchitecture 0.75 years 1 year Structural Model Start 1990 Conception Finish 4Q 1994 1 year Logic Design
Silicon Debug 1.75 years Circuit Design
Layout Pentium Pro Performance Analysis • Observability – On-chip event counters – Dynamic analysis
• Benchmark Suite – BAPco Sysmark32 - 32-bit Windows NT applications – Winstone97 - 32-bit Windows NT applications – Some SPEC95 benchmarks Performance – Run Times
UnhaltedUser-Mode User-Mode Processor Processor Cycles Cycles
1.2E+11
1E+11 Total of 27.5 billion cycles
8E+10
6E+10
4E+10
2E+10
0 li avs go Corel excel word Pwave MSVC paradox WordPro PowerPnt SYSexcel SYSflw96 compress PhotoShop PageMaker SYSword7SYSCorel6SYSexcel7 microstation SYSpowerpnt7SYSParadox7 SYSpagemaker6 Performance – IPC vs. uPC
1.6 InstructionsInstructions and and UOPs UopsRetired Per Cycleretired per cycle
1.4 2 uops/instruction 1.2 Inst retired
UOPS retired 1
0.8
0.6
# Retired0.4 Per Cycle
0.2
0
WS-AVS SPEC-Li SPEC-Go WS-Excel WS-Word SYS-ExcelSYS-Word SYS-Excel WS-PWave SYS-FlW96 WS-MSVC++ WS-Paradox WS-WordPro SYS-Paradox WS-PowerPnt WS-PhotoShop WS-CorelDraw WS-PageMaker SYS-PowerPnt SYS-CorelDraw SYS-PagemakerSPEC-Compress WS-Microstation Performance – IPC vs. uPC
100 Inst.Inst. or or UOPSUops retired retiredper cycle per cycle
90
80
70 AVS
60 PageMaker Paradox 50 PowerPoint
WordPro 40
30
Instructions were Retired 20
Percent10 of Cycles where n=0-3 UOPs or
0 3 UOPs 2 UOPs 1 UOP 0 UOPs 3 Inst. 2 Inst. 1 Inst. 0 Inst. Performance – Cache Misses
2.5 CacheCache Misses Misses Per CyclePer Cycle
2 IL1 - 0.51% IFU Misses / Cycle DL1 - 1.15% DCU Misses / Cycle L2 - 0.25% L2 Misses / Cycle 1.5
1
% Unhalted Cycles
0.5
0
WS-AVS SPEC-Li WS-Excel WS-Word SYS-ExcelSYS-Word SYS-Excel SPEC-Go WS-PWave SYS-FlW96 WS-MSVC++ WS-Paradox WS-WordPro SYS-Paradox WS-PowerPnt SYS-PowerPnt WS-PhotoShop WS-CorelDraw WS-PageMaker SYS-CorelDraw WS-Microstation SYS-Pagemaker SPEC-Compress Performance – Branch Prediction
Branch Mispredict Rate 16% Branch Mispredict Rate
14%
12% 6.8% avg
10%
8%
6%
4% % Branches Mispredicted
2%
0%
WS-AVS SPEC-Li WS-Excel WS-WordSYS-ExcelSYS-Word SYS-Excel SPEC-Go WS-PWave SYS-FlW96 WS-MSVC++ WS-Paradox WS-WordPro WS-PowerPnt SYS-Paradox WS-CorelDraw SYS-PowerPnt WS-PhotoShop WS-PageMaker SYS-CorelDraw WS-Microstation SYS-Pagemaker SPEC-Compress
BTB Miss Rate 45% BTB Miss Rate
40%
35%
30%
25%
20%
15%
10% % Branches that Miss in BTB
5%
0%
WS-AVS SPEC -Li W S-Exce l W S-Wo rd SPEC -Go SYS -ExcelSY S-W ord SYS-E xcel W S-PW ave SYS-F lW 96 WS -MS VC+ + W S-P aradox W S-W ordPro SYS-Paradox WS-PowerPnt SYS-PowerPnt WS-PhotoShopWS-CorelDrawWS-PageMaker SYS-CorelDraw WS-Microstation SYS-Pagemaker SPEC-Compress Conclusions IA-32 Compliant Performance (Frequency - IPC) 366.0 ISpec92 283.2 FSpec92 8.09 SPECint95 6.70 SPECfp95 Validation Die Size - Fabable Schedule - 1 year late Power -
Retrospective • Most commercially successful microarchitecture in history • Evolution – Pentium II/III, Xeon, etc. • Derivatives with on-chip L2, ISA extensions, etc. – Replaced by Pentium 4 as flagship in 2001 • High frequency, deep pipeline, extreme speculation – Resurfaced as Pentium M in 2003 • Initially a response to Transmeta in laptop market • Pentium 4 derivative (90nm Prescott) delayed, slow, hot – Core Duo, Core 2 Duo, Core i7 replaced Pentium 4
© Shen, Lipasti 29 Microarchitectural Updates • Pentium M (Banias), Core Duo (Yonah) – Micro-op fusion (also in AMD K7/K8) • Multiple uops in one: (add eax,[mem] => ld/alu), sta/std • These uops decode/dispatch/commit once, issue twice – Better branch prediction • Loop count predictor • Indirect branch predictor – Slightly deeper pipeline (12 stages) • Extra decode stage for micro-op fusion • Extra stage between issue and execute (for RS/PLRAM read) – Data-capture reservation station (payload RAM) • Clock gated for 32 (int) , 64 (fp), and 128 (SSE) operands
© Shen, Lipasti 30 Microarchitectural Updates • Core 2 Duo (Merom) – 64-bit ISA from AMD K8 – Macro-op fusion • Merge uops from two x86 ops • E.g. cmp, jne => cmpjne – 4-wide decoder (Complex + 3x Simple) • Peak x86 decode throughput is 5 due to macro-op fusion – Loop buffer • Loops that fit in 18-entry instruction queue avoid fetch/decode overhead – Even deeper pipeline (14 stages) – Larger reservation station (32), instruction window (96) – Memory dependence prediction
© Shen, Lipasti 31 Microarchitectural Updates • Nehalem (Core i7/i5/i3) – RS size 36, ROB 128 – Loop cache up to 28 uops – L2 branch predictor – L2 TLB – I$ and D$ now 32K, L2 back to 256K, inclusive L3 up to 8M – Simultaneous multithreading – RAS now renamed (repaired) – 6 issue, 48 load buffers, 32 store buffers – New system interface (QPI) – finally dropped front-side bus – Integrated memory controller (up to 3 channels) – New STTNI instructions for string/text handling
© Shen, Lipasti 32 Microarchitectural Updates • Sandybridge/Ivy Bridge (2nd-3rd generation Core i7) – On-chip integrated graphics (GPU) – Decoded uop cache up to 1.5K uops, handles loops, but more general – 54-entry RS, 168-entry ROB – Physical register file: 144 FP, 160 integer – 256-bit AVX units: 8 DPFLOP/cycle, 16 SPFLOP/cycle – 2 general AGUs enable 2 ld/cycle, 2 st/cycle or any combination, 2x128-bit load path from L1 D$
© Shen, Lipasti 33 Microarchitectural Updates • Haswell/Broadwell/Skylake: wider & deeper – 8-wide issue (up from 6 wide) – 4th integer ALU, third AGU, second branch unit – 60-entry RS, 192-entry ROB – 72-entry load queue/42-entry store queue – Physical register file: 168 FP, 168 integer – Doubled FP throughput (32 SP/16 DP) – Load/store bandwidth to L1 doubled (64B/32B) – TSX (transactional memory) – Integrated voltage regulator
© Shen, Lipasti 34