The Alpha 21264 : Out-of-Order Execution at 600 MHz

R. E. Kessler Compaq Computer Corporation Shrewsbury, MA

REK August 1998 1

Some Highlights

? Continued Alpha performance leadership ? 600 MHz operation in 0.35u CMOS6, 6 metal layers, 2.2V ? 15 Million transistors, 3.1 cm2, 587 pin PGA ? Specint95 of 30+ and Specfp95 of 50+ ? Out-of-order and ? 4-way integer issue ? 2-way floating-point issue ? Sophisticated tournament branch prediction ? High-bandwidth memory system (1+ GB/sec)

REK August 1998 2

Page 1 1 Alpha 21264: Block Diagram

FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Int Branch Int Reg Exec Predictors Reg Issue File Queue Addr Sys Map (80) Exec (20) L1 Bus 64-bit Data Reg Exec Inter- Cache Bus 80 in-flight instructions File Cache plus 32 loads and 32 stores Addr face 64KB 128-bit (80) Exec Unit Next-Line 2-Set Address Phys Addr 4 Instructions / cycle L1 Ins. 44-bit Cache FP ADD FP Reg 64KB FP Div/Sqrt Issue File Victim 2-Set Reg Queue (72) FP MUL Buffer Map (15) Miss Address

REK August 1998 3

Alpha 21264: Block Diagram

FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Int Branch Int Reg Exec Predictors Reg Issue File Queue Addr Sys Bus Map (80) Exec (20) L1 Bus 64-bit Data Reg Exec Inter- Cache Bus 80 in-flight instructions File Cache plus 32 loads and 32 stores Addr face 64KB 128-bit (80) Exec Unit Next-Line 2-Set Address Phys Addr 4 Instructions / cycle L1 Ins. 44-bit Cache FP ADD FP Reg 64KB FP Div/Sqrt Issue File Victim 2-Set Reg Queue (72) FP MUL Buffer Map (15) Miss Address

REK August 1998 4

Page 2 2 21264 Instruction Fetch Bandwidth Enablers

? The 64 KB two-way associative instruction cache supplies four instructions every cycle

? The next-fetch and set predictors provide the fast cache access times of a direct-mapped cache and eliminate bubbles in non- sequential control flows

? The instruction fetcher speculates through up to 20 branch predictions to supply a continuous stream of instructions

? The tournament dynamically selects between Local and Global history to minimize mispredicts

REK August 1998 5

Instruction Stream Improvements

Learns Dynamic Jumps PC... GEN 0-cycle Penalty On Branches Instr Decode, Branch Pred Set-Associative Behavior PC

Tag Tag S0 S1 Icache Next Set Data Fetch Pred

Cmp Cmp

Hit/Miss/Set Miss Next address + set Instructions (4)

REK August 1998 6

Page 3 3 Fetch Stage: Branch Prediction ? Some branch directions can ? Others can be predicted be predicted based on their based on how the program past behavior: Local arrived at the branch: Correlation Global Correlation

(predict taken) if (a == 0) if (a%2 == 0) TNTN

if (b == 0) if (b == 0) (predict taken) NNNN

if (a == b) (predict taken)

REK August 1998 7

Tournament Branch Prediction

Local Local Global History Prediction Prediction (1024x10) (1024x3) (4096x2) Choice Prediction Program Counter (4096x2)

Global History

Prediction

REK August 1998 8

Page 4 4 Alpha 21264: Block Diagram

FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Int Branch Int Reg Exec Predictors Reg Issue File Queue Addr Sys Bus Map (80) Exec (20) L1 Bus 64-bit Data Reg Exec Inter- Cache Bus 80 in-flight instructions File Cache plus 32 loads and 32 stores Addr face 64KB 128-bit (80) Exec Unit Next-Line 2-Set Address Phys Addr 4 Instructions / cycle L1 Ins. 44-bit Cache FP ADD FP Reg 64KB FP Div/Sqrt Issue File Victim 2-Set Reg Queue (72) FP MUL Buffer Map (15) Miss Address

REK August 1998 9

Mapper and Queue Stages

? Mapper: ? Rename 4 instructions per cycle (8 source / 4 dest ) ? 80 integer + 72 floating-point physical registers

? Queue Stage: ? Integer: 20 entries / Quad-Issue ? Floating Point: 15 entries / Dual-Issue ? Instructions issued out-of-order when data ready ? Prioritized from oldest to youngest each cycle ? Instructions leave the queues after they issue ? The queue collapses every cycle as needed

REK August 1998 10

Page 5 5 Map and Queue Stages

MAPPER ISSUE QUEUE

Saved Arbiter Map State Grant Request

Register Renamed Register Numbers MAP Registers CAMs Scoreboard

80 Entries 4 Inst / Cycle Issued Instructions Instructions QUEUE 20 Entries

REK August 1998 11

Alpha 21264: Block Diagram

FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Int Branch Int Reg Exec Predictors Reg Issue File Queue Addr Sys Bus Map (80) Exec (20) L1 Bus 64-bit Data Reg Exec Inter- Cache Bus 80 in-flight instructions File Cache plus 32 loads and 32 stores Addr face 64KB 128-bit (80) Exec Unit Next-Line 2-Set Address Phys Addr 4 Instructions / cycle L1 Ins. 44-bit Cache FP ADD FP Reg 64KB FP Div/Sqrt Issue File Victim 2-Set Reg Queue (72) FP MUL Buffer Map (15) Miss Address

REK August 1998 12

Page 6 6 Register and Execute Stages

Floating Point Integer Execution Execution Units Units

Cluster 0 Cluster 1 m. video mul FP Mul shift/br shift/br add / logic add / logic Reg Reg Reg

FP Add add / add / logic / logic / FP Div memory memory SQRT

REK August 1998 13

Integer Cross-Cluster Instruction Scheduling and Execution

Integer Execution Units

Cluster 0 Cluster 1 • Instructions are statically m. video mul pre-slotted to the upper or shift/br shift/br lower execution pipes add / logic add / logic • The issue queue dynamically selects between Reg Reg the left and right clusters add / add / • This has most of the logic / logic / performance of 4-way with memory memory the simplicity of 2-way issue

REK August 1998 14

Page 7 7 Alpha 21264: Block Diagram

FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Int Branch Int Reg Exec Predictors Reg Issue File Queue Addr Sys Bus Map (80) Exec (20) L1 Bus 64-bit Data Reg Exec Inter- Cache Bus 80 in-flight instructions File Cache plus 32 loads and 32 stores Addr face 64KB 128-bit (80) Exec Unit Next-Line 2-Set Address Phys Addr 4 Instructions / cycle L1 Ins. 44-bit Cache FP ADD FP Reg 64KB FP Div/Sqrt Issue File Victim 2-Set Reg Queue (72) FP MUL Buffer Map (15) Miss Address

REK August 1998 15

21264 On-Chip Memory System Features

? Two loads/stores per cycle ? Any combination ? 64 KB two-way associative L1 data cache (9.6 GB/sec) ? Phase-pipelined at > 1 GHz (no bank conflicts!) ? 3 cycle latency (issue to issue of consumer) with hit prediction ? Out-of-order and speculative execution ? Minimizes effective memory latency ? 32 outstanding loads and 32 outstanding stores ? Maximizes memory system parallelism ? Speculative stores forward data to subsequent loads

REK August 1998 16

Page 8 8 Memory System (Continued)

Data Path Cluster 0 Cluster 1 Address Path Memory Memory Unit Unit • New references check addresses 64 against existing references Data Busses 64 • Speculative store data that System @ 1.5x Speculative address matches bypasses to loads 64 Store Data Fill Data • Multiple misses to the same 128 cache block merge Bus Interface • Hazard detection logic manages Double-Pumped out-of-order references L2 @ 1.5x GHz Data Cache 128

128 Victim Data

REK August 1998 17

Low-latency Speculative Issue of Integer Load Data Consumers (Predict Hit)

LDQ Cache Lookup Cycle LDQ R0, 0(R0) Q R E D ADDQ R0, R0, R1 Q R E D Squash this ADDQ on a miss OP R2, R2, R3 Q R E D This op also gets squashed 3 Cycle Latency Q R E D Restart ops here on a miss ADDQ Issue Cycle

? When predicting a load hit: ? The ADDQ issues (speculatively) after 3 cycles ? Best performance if the load actually hits (matching the prediction) ? The ADDQ issues before the load hit/miss calculation is known ? If the LDQ misses when predicted to hit: ? Squash two cycles (replay the ADDQ and its consumers) ? Force a “mini-replay” (direct from the issue queue)

REK August 1998 18

Page 9 9 Low-latency Speculative Issue of Integer Load Data Consumers (Predict Miss)

LDQ Cache Lookup Cycle LDQ R0, 0(R0) Q R E D OP1 R2, R2, R3 Q R E D OP2 R4, R4, R5 Q R E D ADDQ R0, R0, R1 Q R E D ADDQ can issue 5 Cycle Latency (Min) Earliest ADDQ Issue ? When predicting a load miss: ? The minimum load latency is 5 cycles (more on a miss) ? There are no squashes ? Best performance if the load actually misses (as predicted) ? The hit/miss predictor: ? MSB of 4-bit counter (hits increment by 1, misses decrement by 2)

REK August 1998 19

Dynamic Hazard Avoidance (Before Marking)

Program order (Assume R28 == R29): LDQ R0, 64(R28) STQ R0, 0(R28) Store followed by load to the LDQ R1, 0(R29) same memory address!

First execution order: LDQ R0, 64(R28) LDQ R1, 0(R29) … STQ R0, 0(R28) This (re-ordered) load got the wrong data value!

REK August 1998 20

Page 10 10 Dynamic Hazard Avoidance (After Marking/Training)

Program order (Assume R28 == R29): LDQ R0, 64(R28) This load STQ R0, 0(R28) Store followed by load to the is marked LDQ R1, 0(R29) same memory address! this time Subsequent executions (after marking): LDQ R0, 64(R28) … STQ R0, 0(R28) The marked (delayed) LDQ R1, 0(R29) load gets the bypassed store data!

REK August 1998 21

New Memory Prefetches

? Software-directed prefetches ? Prefetch ? Prefetch w/ modify intent ? Prefetch, evict it next ? Evict Block (eject data from the cache) ? Write hint ? allocate block with no data read ? useful for full cache block writes

REK August 1998 22

Page 11 11 21264 Off-Chip Memory System Features

? 8 outstanding block fills + 8 victims ? Split L2 cache and system busses (back-side cache) ? High-speed (bandwidth) point-to-point channels ? clock-forwarding technology, low pin counts ? L2 hit load latency (load issue to consumer issue) = 6 cycles + SRAM latency ? Max L2 cache bandwidth of 16 bytes per 1.5 cycles ? 6.4 GB/sec with a 400Mhz transfer rate ? L2 miss load latency can be 160ns (60 ns DRAM) ? Max system bandwidth of 8 bytes per 1.5 cycles ? 3.2 GB/sec with a 400Mhz transfer rate

REK August 1998 23

21264 Pin Bus

System Pin Bus L2 Cache Port

Address Out TAG L2 Cache System TAG RAMs Chip-set Address In Alpha Address

(DRAM System Data 21264 and I/O) Data L2 Cache 64b 128b Data RAMs

Independent data busses allow simultaneous data transfers

L2 Cache: 0 -16MB Synch Direct Mapped Peak Bandwidth: Example 1: Reg-Reg 133 MHz Burst RAM 2.1 GB/Sec Example 2: ‘RISC’ 200+ MHz Late Write 3.2+ GB/Sec Example 3: Dual Data 400+ MHz 6.4+ GB/Sec

REK August 1998 24

Page 12 12 Compaq Alpha / AMD Shared System Pin Bus (External Interface)

System Pin Bus (“Slot A”)

Address Out System Chip-set Address In AMD K7 Cache (DRAM System Data and I/O) 64b

• High-performance system pin bus • Shared system chipset designs • This is a win-win!

REK August 1998 25

Dual 21264 System Example

64b 256b Memory 21264 8 Data Bank & Switches L2 Cache 256b

64b

100MHz SDRAM 21264 Control & Chip L2 Cache Memory Bank

PCI-chip PCI-chip

100MHz SDRAM 64b PCI 64b PCI

REK August 1998 26

Page 13 13 Measured Performance: SPEC95

SpecInt95

35 30 30

25

18.8 20 17.4 16.5 16.1 14.7 14.4 15

10

5

0 21264 - 600 21164 - 600 PA8200 - Pentium II - UltraSparc R10000 - PowerPC 240 400 II - 360 250 604e - 332

SpecFP95

60 50 50

40

29.2 28.5 30 26.6 24.5 23.5

20 13.7 12.6 10

0 21264 - 21164 - PA8200 - POWER2 R10000 - UltraSparc Pentium II PowerPC 600 600 240 - 160 250 II - 360 - 400 604e - 332

REK August 1998 27

Measured Performance: STREAMS

Max Streams (MB/sec)

1200 1000 1000 883

800

600

367 400 358 292 213 200

0 21264 - 600 RS6000-397 UltraSparc II - R10000 - 250 21164 - 533 - Pentium II - UE6000 (ass.) - O2000 164SX 300 - Dell

REK August 1998 28

Page 14 14 Alpha 21264

Bus Floating Integer

Interface Integer Unit Mapper Integer Unit Floating Mapper (Left) and Unit (right) Queue - point Unit Memory Integer Controller Queue Instruction

Data path Data and Control Busses

Memory Controller

Data Cache

Instruction BIU Cache

REK August 1998 29

Summary

? The 21264 will maintain Alpha’s performance lead ? 30+ Specint95 and 50+ Specfp95 ? 1+ GB/sec memory bandwidth

? The 21264 proves that both high frequency and sophisticated architectural features can coexist ? high-bandwidth speculative instruction fetch ? out-of-order execution ? 6-way instruction issue ? highly parallel out-of-order memory system

REK August 1998 30

Page 15 15