The Alpha 21264 : Out-of-Order Execution at 600 Mhz

R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

REK August 1998 1 Some Highlights

z Continued Alpha performance leadership y 600 Mhz operation in 0.35u CMOS6, 6 metal layers, 2.2V y 15 Million transistors, 3.1 cm2, 587 pin PGA y Specint95 of 30+ and Specfp95 of 50+ y Out-of-order and y 4-way integer issue y 2-way floating-point issue y Sophisticated tournament branch prediction y High-bandwidth memory system (1+ GB/sec)

REK August 1998 2 Alpha 21264: Block Diagram

FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Int Branch Int Reg Exec Predictors Reg Issue File Queue Addr Sys Map (80) Exec (20) L1 Bus 64-bit Data Reg Exec Inter- Cache Bus 80 in-flight instructions File Cache plus 32 loads and 32 stores Addr face 64KB 128-bit (80) Exec Unit Next-Line 2-Set Address Phys Addr 4 Instructions / cycle L1 Ins. 44-bit Cache FP ADD FP Reg 64KB FP Div/Sqrt Issue File Victim 2-Set Reg Queue (72) FP MUL Buffer Map (15) Miss Address

REK August 1998 3 Alpha 21264: Block Diagram

FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Int Branch Int Reg Exec Predictors Reg Issue File Queue Addr Sys Bus Map (80) Exec (20) L1 Bus 64-bit Data Reg Exec Inter- Cache Bus 80 in-flight instructions File Cache plus 32 loads and 32 stores Addr face 64KB 128-bit (80) Exec Unit Next-Line 2-Set Address Phys Addr 4 Instructions / cycle L1 Ins. 44-bit Cache FP ADD FP Reg 64KB FP Div/Sqrt Issue File Victim 2-Set Reg Queue (72) FP MUL Buffer Map (15) Miss Address

REK August 1998 4 21264 Instruction Fetch Bandwidth Enablers

z The 64 KB two-way associative instruction cache supplies four instructions every cycle

z The next-fetch and set predictors provide the fast cache access times of a direct-mapped cache and eliminate bubbles in non-sequential control flows

z The instruction fetcher speculates through up to 20 branch predictions to supply a continuous stream of instructions

z The tournament dynamically selects between Local and Global history to minimize mispredicts

REK August 1998 5 Instruction Stream Improvements

Learns Dynamic Jumps PC... GEN 0-cycle Penalty On Branches Instr Decode, Branch Pred Set-Associative Behavior PC

Tag Tag S0 S1 Icache Line Set Data Addr Pred

Cmp Cmp

Hit/Miss/Set Miss Next LP address + set Instructions (4)

REK August 1998 6 Fetch Stage: Branch Prediction z Some branch directions can z Others can be predicted based be predicted based on their on how the program arrived at past behavior: Local the branch: Global Correlation Correlation

(predict taken) if (a == 0) if (a%2 == 0) TNTN

if (b == 0) if (b == 0) (predict taken) NNNN

if (a == b) (predict taken)

REK August 1998 7 Tournament Branch Prediction

Local Local Global History Prediction Prediction (1024x10) (1024x3) (4096x2) Choice Prediction Program Counter (4096x2)

Path History

Prediction

REK August 1998 8 Alpha 21264: Block Diagram

FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Int Branch Int Reg Exec Predictors Reg Issue File Queue Addr Sys Bus Map (80) Exec (20) L1 Bus 64-bit Data Reg Exec Inter- Cache Bus 80 in-flight instructions File Cache plus 32 loads and 32 stores Addr face 64KB 128-bit (80) Exec Unit Next-Line 2-Set Address Phys Addr 4 Instructions / cycle L1 Ins. 44-bit Cache FP ADD FP Reg 64KB FP Div/Sqrt Issue File Victim 2-Set Reg Queue (72) FP MUL Buffer Map (15) Miss Address

REK August 1998 9 Mapper and Queue Stages

z Mapper: y Rename 4 instructions per cycle (8 source / 4 dest ) y 80 integer + 72 floating-point physical registers

z Queue Stage: y Integer: 20 entries / Quad-Issue y Floating Point: 15 entries / Dual-Issue y Instructions issued out-of-order when data ready y Prioritized from oldest to youngest each cycle y Instructions leave the queues after they issue y The queue collapses every cycle as needed

REK August 1998 10 Map and Queue Stages

MAPPER ISSUE QUEUE

Saved Arbiter Map State Grant Request

Register Renamed Register Numbers MAP Registers CAMs Scoreboard

80 Entries 4 Inst / Cycle Issued Instructions Instructions QUEUE 20 Entries

REK August 1998 11 Alpha 21264: Block Diagram

FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Int Branch Int Reg Exec Predictors Reg Issue File Queue Addr Sys Bus Map (80) Exec (20) L1 Bus 64-bit Data Reg Exec Inter- Cache Bus 80 in-flight instructions File Cache plus 32 loads and 32 stores Addr face 64KB 128-bit (80) Exec Unit Next-Line 2-Set Address Phys Addr 4 Instructions / cycle L1 Ins. 44-bit Cache FP ADD FP Reg 64KB FP Div/Sqrt Issue File Victim 2-Set Reg Queue (72) FP MUL Buffer Map (15) Miss Address

REK August 1998 12 Register and Execute Stages

Floating Point Integer Execution Units Execution Units

Cluster 0 Cluster 1 m. video mul FP Mul shift/br shift/br add / logic add / logic Reg Reg Reg

FP Add add / add / logic / logic / FP Div memory memory SQRT

REK August 1998 13 Integer Cross-Cluster Instruction Scheduling and Execution

Integer Execution Units

• Instructions are statically Cluster 0 Cluster 1 pre-slotted to the upper or m. video mul lower execution pipes shift/br shift/br add / logic add / logic • The issue queue dynamically selects between Reg Reg the left and right clusters add / add / • This has most of the logic / logic / performance of 4-way with memory memory the simplicity of 2-way issue

REK August 1998 14 Alpha 21264: Block Diagram

FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Int Branch Int Reg Exec Predictors Reg Issue File Queue Addr Sys Bus Map (80) Exec (20) L1 Bus 64-bit Data Reg Exec Inter- Cache Bus 80 in-flight instructions File Cache plus 32 loads and 32 stores Addr face 64KB 128-bit (80) Exec Unit Next-Line 2-Set Address Phys Addr 4 Instructions / cycle L1 Ins. 44-bit Cache FP ADD FP Reg 64KB FP Div/Sqrt Issue File Victim 2-Set Reg Queue (72) FP MUL Buffer Map (15) Miss Address

REK August 1998 15 21264 On-Chip Memory System Features

z Two loads/stores per cycle y any combination z 64 KB two-way associative L1 data cache (9.6 GB/sec) y phase-pipelined at > 1 Ghz (no bank conflicts!) y 3 cycle latency (issue to issue of consumer) with hit prediction z Out-of-order and speculative execution y minimizes effective memory latency z 32 outstanding loads and 32 outstanding stores y maximizes memory system parallelism z Speculative stores forward data to subsequent loads

REK August 1998 16 Memory System (Contd.)

Data Path Address Path Cluster 0 MemoryCluster Unit 1 Memory Unit • New references check addresses 64 against existing references 64 • Speculative store data that

System @ 1.5x address matches bypasses to loads 64 Speculative Store Data • Multiple misses to the same 128 cache block merge Bus Interface • Hazard detection logic manages Double-Pumped Ghz Data Cacheout-of-order references L2 @ 1.5x 128

128

REK August 1998 17 Low-latency Speculative Issue of Integer Load Data Consumers (Predict Hit)

LDQ Cache Lookup Cycle LDQ R0, 0(R0) Q R E D ADDQ R0, R0, R1 Q R E D Squash this ADDQ on a miss OP R2, R2, R3 Q R E D This op also gets squashed 3 Cycle Latency Q R E D Restart ops here on a miss ADDQ Issue Cycle

z When predicting a load hit: y The ADDQ issues (speculatively) after 3 cycles y Best performance if the load actually hits (matching the prediction) y The ADDQ issues before the load hit/miss calculation is known z If the LDQ misses when predicted to hit: y Squash two cycles (replay the ADDQ and its consumers) y Force a “mini-replay” (direct from the issue queue)

REK August 1998 18 Low-latency Speculative Issue of Integer Load Data Consumers (Predict Miss)

LDQ Cache Lookup Cycle LDQ R0, 0(R0) Q R E D OP1 R2, R2, R3 Q R E D OP2 R4, R4, R5 Q R E D ADDQ R0, R0, R1 Q R E D ADDQ can issue 5 Cycle Latency (Min) Earliest ADDQ Issue z When predicting a load miss: y The minimum load latency is 5 cycles (more on a miss) y There are no squashes y Best performance if the load actually misses (as predicted) z The hit/miss predictor: y MSB of 4-bit counter (hits increment by 1, misses decrement by 2)

REK August 1998 19 Dynamic Hazard Avoidance (Before Marking) Program order (Assume R28 == R29): LDQ R0, 64(R28) STQ R0, 0(R28) Store followed by load to LDQ R1, 0(R29) the same memory address!

First execution order: LDQ R0, 64(R28) LDQ R1, 0(R29) … STQ R0, 0(R28) This (re-ordered) load got the wrong data value!

REK August 1998 20 Dynamic Hazard Avoidance (After Marking/Training)

Program order (Assume R28 == R29): LDQ R0, 64(R28) This load STQ R0, 0(R28) Store followed by load to is marked LDQ R1, 0(R29) the same memory address! this time Subsequent executions (after marking): LDQ R0, 64(R28) … STQ R0, 0(R28) The marked (delayed) LDQ R1, 0(R29) load gets the bypassed store data!

REK August 1998 21 New Memory Prefetches

zSoftware-directed prefetches y Prefetch y Prefetch w/ modify intent y Prefetch, evict it next y Evict Block (eject data from the cache) y Write hint y allocate block with no data read y useful for full cache block writes

REK August 1998 22 21264 Off-Chip Memory System Features

z 8 outstanding block fills + 8 victims z Split L2 cache and system busses (back-side cache) z High-speed (bandwidth) point-to-point channels y clock-forwarding technology, low pin counts z L2 hit load latency (load issue to consumer issue) = 6 cycles + SRAM latency z Max L2 cache bandwidth of 16 bytes per 1.5 cycles y 6.4 GB/sec with a 400Mhz transfer rate z L2 miss load latency can be 160ns (60 ns DRAM) z Max system bandwidth of 8 bytes per 1.5 cycles y 3.2 GB/sec with a 400Mhz transfer rate

REK August 1998 23 2126421264 PinPin BusBus

System Pin Bus L2 Cache Port

Address Out TAG L2 Cache System TAG RAMs Chip-set Address In Alpha Address (DRAM 21264 System Data Data and I/O) L2 Cache 64b 128b Data RAMs

Independent data busses allow simultaneous data transfers

L2 Cache: 0 -16MB Synch Direct Mapped Peak Bandwidth: Example 1: Reg-Reg 133 MHz BurstRAM 2.1 GB/Sec Example 2: ‘RISC’ 200+ MHz Late Write 3.2+ GB/Sec Example 3: Dual Data 400+ MHz 6.4+ GB/Sec

REK August 1998 24 COMPAQCOMPAQ AlphaAlpha // AMDAMD SharedShared SystemSystem PinPin BusBus (External(External Interface)Interface)

System Pin Bus (“Slot A”)

Address Out System Address In Chip-set AMD K7 Cache (DRAM System Data and I/O) 64b

• High-performance system pin bus • Shared system chipset designs • This is a win-win!

REK August 1998 25 Dual 21264 System Example

64b 256b Memory 21264 8 Data Bank & Switches L2 Cache 256b

64b

100MHz SDRAM 21264 Control & Chip L2 Cache Memory Bank

PCI-chip PCI-chip

100MHz SDRAM 64b PCI 64b PCI

REK August 1998 26 Measured Performance: SPEC95

SpecInt95

35 30 30

25

18.8 20 17.4 16.5 16.1 14.7 14.4 15

10

5

0 21264 - 600 21164 - 600 PA8200 - P entium II - UltraSparc R10000 - PowerPC 240 400 II - 360 250 604e - 332

SpecFP95

60 50 50

40

29.2 28.5 30 26.6 24.5 23.5

20 13.7 12.6 10

0 21264 - 21164 - PA8200 - POWER2 R10000 - UltraSparc Pentium II PowerPC 600 600 240 - 160 250 II - 360 - 400 604e - 332

REK August 1998 27 Measured Performance: STREAMS

Max Streams (MB/sec)

1200

1000 1000 883

800

600

367 400 358 292 213 200

0 21264 - 600 RS6000-397 UltraSparc II - R10000 - 250 21164 - 533 - Pentium II - (ass.) UE6000 (ass.) - O2000 164SX 300 - Dell

REK August 1998 28 Alpha 21264

Bus Floating Integer

Interface Unit Integer Mapper Unit Integer

Floating-point Unit Mapper (Left) and Unit (right) Queue

Memory Integer Controller Queue

Instruction Data and Control Busses Data path

Memory Controller

Data Cache

Instruction BIU Cache

REK August 1998 29 Summary

z The 21264 will maintain Alpha’s performance lead y 30+ Specint95 and 50+ Specfp95 y 1+ GB/sec memory bandwidth

z The 21264 proves that both high frequency and sophisticated architectural features can coexist y high-bandwidth speculative instruction fetch y out-of-order execution y 6-way instruction issue y highly parallel out-of-order memory system

REK August 1998 30