The Alpha 21264 Microprocessor: Out-Of-Order Execution at 600 Mhz

The Alpha 21264 Microprocessor: Out-of-Order Execution at 600 MHz R. E. Kessler Compaq Computer Corporation Shrewsbury, MA REK August 1998 1 Some Highlights ? Continued Alpha performance leadership ? 600 MHz operation in 0.35u CMOS6, 6 metal layers, 2.2V ? 15 Million transistors, 3.1 cm2, 587 pin PGA ? Specint95 of 30+ and Specfp95 of 50+ ? Out-of-order and speculative execution ? 4-way integer issue ? 2-way floating-point issue ? Sophisticated tournament branch prediction ? High-bandwidth memory system (1+ GB/sec) REK August 1998 2 Page 1 1 Alpha 21264: Block Diagram FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Int Branch Int Reg Exec Predictors Reg Issue File Queue Addr Sys Bus Map (80) Exec (20) L1 Bus 64-bit Data Reg Exec Inter- Cache Bus 80 in-flight instructions File Cache plus 32 loads and 32 stores Addr face 64KB 128-bit (80) Exec Unit Next-Line 2-Set Address Phys Addr 4 Instructions / cycle L1 Ins. 44-bit Cache FP ADD FP Reg 64KB FP Div/Sqrt Issue File Victim 2-Set Reg Queue (72) FP MUL Buffer Map (15) Miss Address REK August 1998 3 Alpha 21264: Block Diagram FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Int Branch Int Reg Exec Predictors Reg Issue File Queue Addr Sys Bus Map (80) Exec (20) L1 Bus 64-bit Data Reg Exec Inter- Cache Bus 80 in-flight instructions File Cache plus 32 loads and 32 stores Addr face 64KB 128-bit (80) Exec Unit Next-Line 2-Set Address Phys Addr 4 Instructions / cycle L1 Ins. 44-bit Cache FP ADD FP Reg 64KB FP Div/Sqrt Issue File Victim 2-Set Reg Queue (72) FP MUL Buffer Map (15) Miss Address REK August 1998 4 Page 2 2 21264 Instruction Fetch Bandwidth Enablers ? The 64 KB two-way associative instruction cache supplies four instructions every cycle ? The next-fetch and set predictors provide the fast cache access times of a direct-mapped cache and eliminate bubbles in non- sequential control flows ? The instruction fetcher speculates through up to 20 branch predictions to supply a continuous stream of instructions ? The tournament branch predictor dynamically selects between Local and Global history to minimize mispredicts REK August 1998 5 Instruction Stream Improvements Learns Dynamic Jumps PC... GEN 0-cycle Penalty On Branches Instr Decode, Branch Pred Set-Associative Behavior PC Tag Tag S0 S1 Icache Next Set Data Fetch Pred Cmp Cmp Hit/Miss/Set Miss Next address + set Instructions (4) REK August 1998 6 Page 3 3 Fetch Stage: Branch Prediction ? Some branch directions can ? Others can be predicted be predicted based on their based on how the program past behavior: Local arrived at the branch: Correlation Global Correlation (predict taken) if (a == 0) if (a%2 == 0) TNTN if (b == 0) if (b == 0) (predict taken) NNNN if (a == b) (predict taken) REK August 1998 7 Tournament Branch Prediction Local Local Global History Prediction Prediction (1024x10) (1024x3) (4096x2) Choice Prediction Program Counter (4096x2) Global History Prediction REK August 1998 8 Page 4 4 Alpha 21264: Block Diagram FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Int Branch Int Reg Exec Predictors Reg Issue File Queue Addr Sys Bus Map (80) Exec (20) L1 Bus 64-bit Data Reg Exec Inter- Cache Bus 80 in-flight instructions File Cache plus 32 loads and 32 stores Addr face 64KB 128-bit (80) Exec Unit Next-Line 2-Set Address Phys Addr 4 Instructions / cycle L1 Ins. 44-bit Cache FP ADD FP Reg 64KB FP Div/Sqrt Issue File Victim 2-Set Reg Queue (72) FP MUL Buffer Map (15) Miss Address REK August 1998 9 Mapper and Queue Stages ? Mapper: ? Rename 4 instructions per cycle (8 source / 4 dest ) ? 80 integer + 72 floating-point physical registers ? Queue Stage: ? Integer: 20 entries / Quad-Issue ? Floating Point: 15 entries / Dual-Issue ? Instructions issued out-of-order when data ready ? Prioritized from oldest to youngest each cycle ? Instructions leave the queues after they issue ? The queue collapses every cycle as needed REK August 1998 10 Page 5 5 Map and Queue Stages MAPPER ISSUE QUEUE Saved Arbiter Map State Grant Request Register Renamed Register Numbers MAP Registers CAMs Scoreboard 80 Entries 4 Inst / Cycle Issued Instructions Instructions QUEUE 20 Entries REK August 1998 11 Alpha 21264: Block Diagram FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Int Branch Int Reg Exec Predictors Reg Issue File Queue Addr Sys Bus Map (80) Exec (20) L1 Bus 64-bit Data Reg Exec Inter- Cache Bus 80 in-flight instructions File Cache plus 32 loads and 32 stores Addr face 64KB 128-bit (80) Exec Unit Next-Line 2-Set Address Phys Addr 4 Instructions / cycle L1 Ins. 44-bit Cache FP ADD FP Reg 64KB FP Div/Sqrt Issue File Victim 2-Set Reg Queue (72) FP MUL Buffer Map (15) Miss Address REK August 1998 12 Page 6 6 Register and Execute Stages Floating Point Integer Execution Execution Units Units Cluster 0 Cluster 1 m. video mul FP Mul shift/br shift/br add / logic add / logic Reg Reg Reg FP Add add / add / logic / logic / FP Div memory memory SQRT REK August 1998 13 Integer Cross-Cluster Instruction Scheduling and Execution Integer Execution Units Cluster 0 Cluster 1 • Instructions are statically m. video mul pre-slotted to the upper or shift/br shift/br lower execution pipes add / logic add / logic • The issue queue dynamically selects between Reg Reg the left and right clusters add / add / • This has most of the logic / logic / performance of 4-way with memory memory the simplicity of 2-way issue REK August 1998 14 Page 7 7 Alpha 21264: Block Diagram FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Int Branch Int Reg Exec Predictors Reg Issue File Queue Addr Sys Bus Map (80) Exec (20) L1 Bus 64-bit Data Reg Exec Inter- Cache Bus 80 in-flight instructions File Cache plus 32 loads and 32 stores Addr face 64KB 128-bit (80) Exec Unit Next-Line 2-Set Address Phys Addr 4 Instructions / cycle L1 Ins. 44-bit Cache FP ADD FP Reg 64KB FP Div/Sqrt Issue File Victim 2-Set Reg Queue (72) FP MUL Buffer Map (15) Miss Address REK August 1998 15 21264 On-Chip Memory System Features ? Two loads/stores per cycle ? Any combination ? 64 KB two-way associative L1 data cache (9.6 GB/sec) ? Phase-pipelined at > 1 GHz (no bank conflicts!) ? 3 cycle latency (issue to issue of consumer) with hit prediction ? Out-of-order and speculative execution ? Minimizes effective memory latency ? 32 outstanding loads and 32 outstanding stores ? Maximizes memory system parallelism ? Speculative stores forward data to subsequent loads REK August 1998 16 Page 8 8 Memory System (Continued) Data Path Cluster 0 Cluster 1 Address Path Memory Memory Unit Unit • New references check addresses 64 against existing references Data Busses 64 • Speculative store data that System @ 1.5x Speculative address matches bypasses to loads 64 Store Data Fill Data • Multiple misses to the same 128 cache block merge Bus Interface • Hazard detection logic manages Double-Pumped out-of-order references L2 @ 1.5x GHz Data Cache 128 128 Victim Data REK August 1998 17 Low-latency Speculative Issue of Integer Load Data Consumers (Predict Hit) LDQ Cache Lookup Cycle LDQ R0, 0(R0) Q R E D ADDQ R0, R0, R1 Q R E D Squash this ADDQ on a miss OP R2, R2, R3 Q R E D This op also gets squashed 3 Cycle Latency Q R E D Restart ops here on a miss ADDQ Issue Cycle ? When predicting a load hit: ? The ADDQ issues (speculatively) after 3 cycles ? Best performance if the load actually hits (matching the prediction) ? The ADDQ issues before the load hit/miss calculation is known ? If the LDQ misses when predicted to hit: ? Squash two cycles (replay the ADDQ and its consumers) ? Force a “mini-replay” (direct from the issue queue) REK August 1998 18 Page 9 9 Low-latency Speculative Issue of Integer Load Data Consumers (Predict Miss) LDQ Cache Lookup Cycle LDQ R0, 0(R0) Q R E D OP1 R2, R2, R3 Q R E D OP2 R4, R4, R5 Q R E D ADDQ R0, R0, R1 Q R E D ADDQ can issue 5 Cycle Latency (Min) Earliest ADDQ Issue ? When predicting a load miss: ? The minimum load latency is 5 cycles (more on a miss) ? There are no squashes ? Best performance if the load actually misses (as predicted) ? The hit/miss predictor: ? MSB of 4-bit counter (hits increment by 1, misses decrement by 2) REK August 1998 19 Dynamic Hazard Avoidance (Before Marking) Program order (Assume R28 == R29): LDQ R0, 64(R28) STQ R0, 0(R28) Store followed by load to the LDQ R1, 0(R29) same memory address! First execution order: LDQ R0, 64(R28) LDQ R1, 0(R29) … STQ R0, 0(R28) This (re-ordered) load got the wrong data value! REK August 1998 20 Page 10 10 Dynamic Hazard Avoidance (After Marking/Training) Program order (Assume R28 == R29): LDQ R0, 64(R28) This load STQ R0, 0(R28) Store followed by load to the is marked LDQ R1, 0(R29) same memory address! this time Subsequent executions (after marking): LDQ R0, 64(R28) … STQ R0, 0(R28) The marked (delayed) LDQ R1, 0(R29) load gets the bypassed store data! REK August 1998 21 New Memory Prefetches ? Software-directed prefetches ? Prefetch ? Prefetch w/ modify intent ? Prefetch, evict it next ? Evict Block (eject data from the cache) ? Write hint ? allocate block with no data read ? useful for full cache block writes REK August 1998 22 Page 11 11 21264 Off-Chip Memory System Features ? 8 outstanding block fills + 8 victims ? Split L2 cache and system busses (back-side cache) ? High-speed (bandwidth) point-to-point channels ? clock-forwarding technology, low pin counts ? L2 hit load latency (load issue to consumer issue) = 6 cycles + SRAM latency ? Max L2 cache bandwidth of 16 bytes per 1.5 cycles ? 6.4 GB/sec with a 400Mhz transfer rate ? L2 miss load latency can be 160ns (60 ns DRAM) ? Max system bandwidth of 8 bytes

Load more