The Alpha 21264 Microprocessor: Out-of-Order Execution at 600 MHz
R. E. Kessler Compaq Computer Corporation Shrewsbury, MA
REK August 1998 1
Some Highlights
? Continued Alpha performance leadership ? 600 MHz operation in 0.35u CMOS6, 6 metal layers, 2.2V ? 15 Million transistors, 3.1 cm2, 587 pin PGA ? Specint95 of 30+ and Specfp95 of 50+ ? Out-of-order and speculative execution ? 4-way integer issue ? 2-way floating-point issue ? Sophisticated tournament branch prediction ? High-bandwidth memory system (1+ GB/sec)
REK August 1998 2
Page 1 1 Alpha 21264: Block Diagram
FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Int Branch Int Reg Exec Predictors Reg Issue File Queue Addr Sys Bus Map (80) Exec (20) L1 Bus 64-bit Data Reg Exec Inter- Cache Bus 80 in-flight instructions File Cache plus 32 loads and 32 stores Addr face 64KB 128-bit (80) Exec Unit Next-Line 2-Set Address Phys Addr 4 Instructions / cycle L1 Ins. 44-bit Cache FP ADD FP Reg 64KB FP Div/Sqrt Issue File Victim 2-Set Reg Queue (72) FP MUL Buffer Map (15) Miss Address
REK August 1998 3
Alpha 21264: Block Diagram
FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Int Branch Int Reg Exec Predictors Reg Issue File Queue Addr Sys Bus Map (80) Exec (20) L1 Bus 64-bit Data Reg Exec Inter- Cache Bus 80 in-flight instructions File Cache plus 32 loads and 32 stores Addr face 64KB 128-bit (80) Exec Unit Next-Line 2-Set Address Phys Addr 4 Instructions / cycle L1 Ins. 44-bit Cache FP ADD FP Reg 64KB FP Div/Sqrt Issue File Victim 2-Set Reg Queue (72) FP MUL Buffer Map (15) Miss Address
REK August 1998 4
Page 2 2 21264 Instruction Fetch Bandwidth Enablers
? The 64 KB two-way associative instruction cache supplies four instructions every cycle
? The next-fetch and set predictors provide the fast cache access times of a direct-mapped cache and eliminate bubbles in non- sequential control flows
? The instruction fetcher speculates through up to 20 branch predictions to supply a continuous stream of instructions
? The tournament branch predictor dynamically selects between Local and Global history to minimize mispredicts
REK August 1998 5
Instruction Stream Improvements
Learns Dynamic Jumps PC... GEN 0-cycle Penalty On Branches Instr Decode, Branch Pred Set-Associative Behavior PC
Tag Tag S0 S1 Icache Next Set Data Fetch Pred
Cmp Cmp
Hit/Miss/Set Miss Next address + set Instructions (4)
REK August 1998 6
Page 3 3 Fetch Stage: Branch Prediction ? Some branch directions can ? Others can be predicted be predicted based on their based on how the program past behavior: Local arrived at the branch: Correlation Global Correlation
(predict taken) if (a == 0) if (a%2 == 0) TNTN
if (b == 0) if (b == 0) (predict taken) NNNN
if (a == b) (predict taken)
REK August 1998 7
Tournament Branch Prediction
Local Local Global History Prediction Prediction (1024x10) (1024x3) (4096x2) Choice Prediction Program Counter (4096x2)
Global History
Prediction
REK August 1998 8
Page 4 4 Alpha 21264: Block Diagram
FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Int Branch Int Reg Exec Predictors Reg Issue File Queue Addr Sys Bus Map (80) Exec (20) L1 Bus 64-bit Data Reg Exec Inter- Cache Bus 80 in-flight instructions File Cache plus 32 loads and 32 stores Addr face 64KB 128-bit (80) Exec Unit Next-Line 2-Set Address Phys Addr 4 Instructions / cycle L1 Ins. 44-bit Cache FP ADD FP Reg 64KB FP Div/Sqrt Issue File Victim 2-Set Reg Queue (72) FP MUL Buffer Map (15) Miss Address
REK August 1998 9
Mapper and Queue Stages
? Mapper: ? Rename 4 instructions per cycle (8 source / 4 dest ) ? 80 integer + 72 floating-point physical registers
? Queue Stage: ? Integer: 20 entries / Quad-Issue ? Floating Point: 15 entries / Dual-Issue ? Instructions issued out-of-order when data ready ? Prioritized from oldest to youngest each cycle ? Instructions leave the queues after they issue ? The queue collapses every cycle as needed
REK August 1998 10
Page 5 5 Map and Queue Stages
MAPPER ISSUE QUEUE
Saved Arbiter Map State Grant Request
Register Renamed Register Numbers MAP Registers CAMs Scoreboard
80 Entries 4 Inst / Cycle Issued Instructions Instructions QUEUE 20 Entries
REK August 1998 11
Alpha 21264: Block Diagram
FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Int Branch Int Reg Exec Predictors Reg Issue File Queue Addr Sys Bus Map (80) Exec (20) L1 Bus 64-bit Data Reg Exec Inter- Cache Bus 80 in-flight instructions File Cache plus 32 loads and 32 stores Addr face 64KB 128-bit (80) Exec Unit Next-Line 2-Set Address Phys Addr 4 Instructions / cycle L1 Ins. 44-bit Cache FP ADD FP Reg 64KB FP Div/Sqrt Issue File Victim 2-Set Reg Queue (72) FP MUL Buffer Map (15) Miss Address
REK August 1998 12
Page 6 6 Register and Execute Stages
Floating Point Integer Execution Execution Units Units
Cluster 0 Cluster 1 m. video mul FP Mul shift/br shift/br add / logic add / logic Reg Reg Reg
FP Add add / add / logic / logic / FP Div memory memory SQRT
REK August 1998 13
Integer Cross-Cluster Instruction Scheduling and Execution
Integer Execution Units
Cluster 0 Cluster 1 • Instructions are statically m. video mul pre-slotted to the upper or shift/br shift/br lower execution pipes add / logic add / logic • The issue queue dynamically selects between Reg Reg the left and right clusters add / add / • This has most of the logic / logic / performance of 4-way with memory memory the simplicity of 2-way issue
REK August 1998 14
Page 7 7 Alpha 21264: Block Diagram
FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Int Branch Int Reg Exec Predictors Reg Issue File Queue Addr Sys Bus Map (80) Exec (20) L1 Bus 64-bit Data Reg Exec Inter- Cache Bus 80 in-flight instructions File Cache plus 32 loads and 32 stores Addr face 64KB 128-bit (80) Exec Unit Next-Line 2-Set Address Phys Addr 4 Instructions / cycle L1 Ins. 44-bit Cache FP ADD FP Reg 64KB FP Div/Sqrt Issue File Victim 2-Set Reg Queue (72) FP MUL Buffer Map (15) Miss Address
REK August 1998 15
21264 On-Chip Memory System Features
? Two loads/stores per cycle ? Any combination ? 64 KB two-way associative L1 data cache (9.6 GB/sec) ? Phase-pipelined at > 1 GHz (no bank conflicts!) ? 3 cycle latency (issue to issue of consumer) with hit prediction ? Out-of-order and speculative execution ? Minimizes effective memory latency ? 32 outstanding loads and 32 outstanding stores ? Maximizes memory system parallelism ? Speculative stores forward data to subsequent loads
REK August 1998 16
Page 8 8 Memory System (Continued)
Data Path Cluster 0 Cluster 1 Address Path Memory Memory Unit Unit • New references check addresses 64 against existing references Data Busses 64 • Speculative store data that System @ 1.5x Speculative address matches bypasses to loads 64 Store Data Fill Data • Multiple misses to the same 128 cache block merge Bus Interface • Hazard detection logic manages Double-Pumped out-of-order references L2 @ 1.5x GHz Data Cache 128
128 Victim Data
REK August 1998 17
Low-latency Speculative Issue of Integer Load Data Consumers (Predict Hit)
LDQ Cache Lookup Cycle LDQ R0, 0(R0) Q R E D ADDQ R0, R0, R1 Q R E D Squash this ADDQ on a miss OP R2, R2, R3 Q R E D This op also gets squashed 3 Cycle Latency Q R E D Restart ops here on a miss ADDQ Issue Cycle
? When predicting a load hit: ? The ADDQ issues (speculatively) after 3 cycles ? Best performance if the load actually hits (matching the prediction) ? The ADDQ issues before the load hit/miss calculation is known ? If the LDQ misses when predicted to hit: ? Squash two cycles (replay the ADDQ and its consumers) ? Force a “mini-replay” (direct from the issue queue)
REK August 1998 18
Page 9 9 Low-latency Speculative Issue of Integer Load Data Consumers (Predict Miss)
LDQ Cache Lookup Cycle LDQ R0, 0(R0) Q R E D OP1 R2, R2, R3 Q R E D OP2 R4, R4, R5 Q R E D ADDQ R0, R0, R1 Q R E D ADDQ can issue 5 Cycle Latency (Min) Earliest ADDQ Issue ? When predicting a load miss: ? The minimum load latency is 5 cycles (more on a miss) ? There are no squashes ? Best performance if the load actually misses (as predicted) ? The hit/miss predictor: ? MSB of 4-bit counter (hits increment by 1, misses decrement by 2)
REK August 1998 19
Dynamic Hazard Avoidance (Before Marking)
Program order (Assume R28 == R29): LDQ R0, 64(R28) STQ R0, 0(R28) Store followed by load to the LDQ R1, 0(R29) same memory address!
First execution order: LDQ R0, 64(R28) LDQ R1, 0(R29) … STQ R0, 0(R28) This (re-ordered) load got the wrong data value!
REK August 1998 20
Page 10 10 Dynamic Hazard Avoidance (After Marking/Training)
Program order (Assume R28 == R29): LDQ R0, 64(R28) This load STQ R0, 0(R28) Store followed by load to the is marked LDQ R1, 0(R29) same memory address! this time Subsequent executions (after marking): LDQ R0, 64(R28) … STQ R0, 0(R28) The marked (delayed) LDQ R1, 0(R29) load gets the bypassed store data!
REK August 1998 21
New Memory Prefetches
? Software-directed prefetches ? Prefetch ? Prefetch w/ modify intent ? Prefetch, evict it next ? Evict Block (eject data from the cache) ? Write hint ? allocate block with no data read ? useful for full cache block writes
REK August 1998 22
Page 11 11 21264 Off-Chip Memory System Features
? 8 outstanding block fills + 8 victims ? Split L2 cache and system busses (back-side cache) ? High-speed (bandwidth) point-to-point channels ? clock-forwarding technology, low pin counts ? L2 hit load latency (load issue to consumer issue) = 6 cycles + SRAM latency ? Max L2 cache bandwidth of 16 bytes per 1.5 cycles ? 6.4 GB/sec with a 400Mhz transfer rate ? L2 miss load latency can be 160ns (60 ns DRAM) ? Max system bandwidth of 8 bytes per 1.5 cycles ? 3.2 GB/sec with a 400Mhz transfer rate
REK August 1998 23
21264 Pin Bus
System Pin Bus L2 Cache Port
Address Out TAG L2 Cache System TAG RAMs Chip-set Address In Alpha Address
(DRAM System Data 21264 and I/O) Data L2 Cache 64b 128b Data RAMs
Independent data busses allow simultaneous data transfers
L2 Cache: 0 -16MB Synch Direct Mapped Peak Bandwidth: Example 1: Reg-Reg 133 MHz Burst RAM 2.1 GB/Sec Example 2: ‘RISC’ 200+ MHz Late Write 3.2+ GB/Sec Example 3: Dual Data 400+ MHz 6.4+ GB/Sec
REK August 1998 24
Page 12 12 Compaq Alpha / AMD Shared System Pin Bus (External Interface)
System Pin Bus (“Slot A”)
Address Out System Chip-set Address In AMD K7 Cache (DRAM System Data and I/O) 64b
• High-performance system pin bus • Shared system chipset designs • This is a win-win!
REK August 1998 25
Dual 21264 System Example
64b 256b Memory 21264 8 Data Bank & Switches L2 Cache 256b
64b
100MHz SDRAM 21264 Control & Chip L2 Cache Memory Bank
PCI-chip PCI-chip
100MHz SDRAM 64b PCI 64b PCI
REK August 1998 26
Page 13 13 Measured Performance: SPEC95
SpecInt95
35 30 30
25
18.8 20 17.4 16.5 16.1 14.7 14.4 15
10
5
0 21264 - 600 21164 - 600 PA8200 - Pentium II - UltraSparc R10000 - PowerPC 240 400 II - 360 250 604e - 332
SpecFP95
60 50 50
40
29.2 28.5 30 26.6 24.5 23.5
20 13.7 12.6 10
0 21264 - 21164 - PA8200 - POWER2 R10000 - UltraSparc Pentium II PowerPC 600 600 240 - 160 250 II - 360 - 400 604e - 332
REK August 1998 27
Measured Performance: STREAMS
Max Streams (MB/sec)
1200 1000 1000 883
800
600
367 400 358 292 213 200
0 21264 - 600 RS6000-397 UltraSparc II - R10000 - 250 21164 - 533 - Pentium II - UE6000 (ass.) - O2000 164SX 300 - Dell
REK August 1998 28
Page 14 14 Alpha 21264
Bus Floating Integer
Interface Integer Unit Mapper Integer Unit Floating Mapper (Left) and Unit (right) Queue - point Unit Memory Integer Controller Queue Instruction
Data path Data and Control Busses
Memory Controller
Data Cache
Instruction BIU Cache
REK August 1998 29
Summary
? The 21264 will maintain Alpha’s performance lead ? 30+ Specint95 and 50+ Specfp95 ? 1+ GB/sec memory bandwidth
? The 21264 proves that both high frequency and sophisticated architectural features can coexist ? high-bandwidth speculative instruction fetch ? out-of-order execution ? 6-way instruction issue ? highly parallel out-of-order memory system
REK August 1998 30
Page 15 15