Alpha 21264 Microarchitecture

Alpha 21264 Microarchitecture Onur/Aditya 11/6/2001 Key Features of 21264 • Introduced in Feb 98 at 500 MHz • 15M transistors, 2.2V 0.35-micron 6 metal layer CMOS process • Implements 64-bit Alpha ISA • Out-of-order execution (unlike 21164) • 4-wide fetch (like 21164) • Max 6 inst/cycle execution bandwidth • 7-stage pipeline • Hybrid two-level branch prediction (tournament predictor) • Clustered integer pipeline • 80 in-flight instructions Overview of the Presentation • Overview of 21264 pipeline • Fetch and Branch Prediction mechanism • Register Renaming in 21264 • Clustering • Memory System 21264 Pipelines Source: Microprocessor Report, 10/28/96 Pipeline Structure Source: IEEE Micro, March-April 1999 Instruction Fetch Mechanism • Two features: – Line and way prediction – Branch prediction • Line-way predictor predicts the line-way of the I-cache that will be accessed in the next cycle • Line-way prediction takes the branch predictor outside the critical fetch loop. • On cache fills, line predictor value at each line points to the next sequential fetch line. • Line predictor is later trained by the branch predictor. • In effect, line-way predictor is similar to a very fast BTB. • Prediction of the line predictor is verified in stage 1 (Instruction slot). If line-way prediction is incorrect, slot stage is flushed and PC generated using the branch predictor information is used to redirect fetch. Fetch - Line and Way Prediction Source: IEEE Micro, March-April 1999 Branch Prediction Mechanism • Hybrid Branch Predictor • Global predictor: – Good for inter-correlated branches. – Indexed by global path history register (T/NT status of last 12 branches) – 4K-entry table of 2-bit counters • Local Predictor – Good for self-correlated branches. – 10 bits of PC indexes a per-address local history table, which in turn indexes a 1K-entry table of 3-bit counters. – Aliasing among branches is a problem. • Choice Predictor – Decides which predictor to use. – Indexed by global path history register – 4K-entry table of 2-bit counters Branch Prediction Mechanism Source: Microprocessor Report, 10/28/96 • Minimum branch penalty: 7 cycles • Typical branch penalty: 11+ cycles (IQ delay) • 48K bits of target addresses stored in I-cache • 32-entry return address stack • Predictor tables are reset on a context switch Instruction Slotting • Check line predictor prediction • Branch predictor compares the next cache index it generates with the one generated by line predictor • Determine the subclusters integer instructions will go to • Some subclusters are specialized resource constraints • Perform load balancing on subclusters Register Renaming • 31 Integer 31 FP architectural registers • 41 Int 41 FP extra physical registers • Uses a merged rename and architectural register file, one for Int one for FP • Same physical register holds the results of an instruction before and after commit • No separate architectural register file (no data copying on commit) • Register map table stores current mappings of architectural registers. • A map silo contains old mappings of up to 20 previous decode cycles (used in case of misprediction) Register Renaming Logic Source: Presentation by R. Kessler, August 1998. • On decoding an instruction: – Search map CAMs for the source registers – Find the physical registers currently containing the value of the architectural source registers – Access free physical register list – Map the found free physical register to the architectural destination register Register Renaming Logic • On completing an instruction: – Write result into the physical destination register – Mark the physical destination register as valid in the register scoreboard – Broadcast results to issue queue entries – Physical destination register number is broadcast as tag • On committing an instruction – Mark the physical destination register as committed – Free the physical register that corresponds to an old mapping of the same architectural register • On a misprediction/exception – Roll back the map state to what it was when the exception-causing instruction was renamed – To be able to do this, instructions should be associated with map entries – This is done using inums. Each instruction is given an 8-bit unique identifier during register mapping Physical Register States • 4 states • Initially n architectural registers are in AR state. • Rest are Available • When an instruction with a destination register is issued, one of the available registers is allocated as rename buffer (RB) • When instruction finishes execution, state is set to valid Source: Sima, D. The Design Space of Register Renaming • On instruction commit, Techniques. IEEE Micro, September/October 2000. state is set to AR and old AR mapping is reclaimed Integer Issue Queues - Clustering • 20 entries, maximum 4 per cycle • Two arbiters pick the instructions that will issue (One for upper subclusters, one for lower subclusters) • Each queue entry asserts a request to the arbiter when it contains an instruction that can be executed by the subcluster (if operand values are available within that subcluster) • 4 request signals (U0, U1, L0, L1) • Arbiters choose between simultaneous requesters of a subcluster based on the age of the request • Older instructions are given priority • Each arbiter picks 2 of the possible 20 requesters for service • A given instruction can request only upper or lower subclusters (load balancing based on the assignment done by Stage 1) • Subcluster assignment is static (Stage 1) • Cluster selection on issue is dynamic (Stage 2) Integer/FP Execution Pipes • Integer cluster communication latency: 1 cycle • Advantage of clustering: – Fewer read/write ports to the register file – Register file will not be a cycle time limiter Source: IEEE Micro, March-April 1999 • FP issue queue: – 15 entries – 2 inst/cycle Memory References • Load Queue – Reorder buffer for loads – 32 entries, in-order – Maintains state of loads issued but not yet retired • Store Queue – Reorder buffer for stores – 32 entries, in-order – Maintains state of stores issued but not yet written to the data cache – Holds data associated with store instructions – Forwards data to older matching stores • Miss Address File – Holds physical addresses associated with pending L1 cache misses (instruction or data) – Maximum 8 misses to off-chip memory system Load/Store Ordering • New memory references check their address and age against older references. • For example, when a store issues: – LDQ compares store address to the addresses of younger loads (CAM search) – If the older store issues to the same memory address as a younger load, LDQ squashes the load and initiates recovery • When a load is ready to issue: – STQ compares the load address to the addresses of younger stores – If a match is found: • If store data is available, STQ forwards the data • Else load issue is delayed until store data becomes available Load/Store Ordering • When a load is ready to issue: – If a younger store exists in STQ with an unknown address: • Predict that the ready load will not access the same memory location unless this load was incorrectly ordered before (check the load wait table) – Exposes more ILP if prediction is correct – In case of misprediction: • Minimum 14 cycle penalty • Initiate recovery: Load and all subsequent instructions are squashed and re-executed • Mark the load in the load wait table so that it will wait for all younger stores to compute addresses next time around Load/Store Ordering Example Source: IEEE Micro, March-April 1999 Features of Memory System • Data cache – 64 KB, 2-way, virtually-indexed physically tagged (translation in parallel with access) – Write-back, read/write allocate – 64-byte block size + ECC bits – Prevents synonyms by not allowing different physical addresses corresponding to the same virtual address to co-exist in the cache – Load hit/miss prediction to minimize load-use latency (Data cache access is 3 cycles after the issue queue + 1 cycle to get the hit/miss signal to issue queue) • Victim Buffer (Victim address and data files) – Contains evicted L1(Data and Inst) and L2 cache lines – 8 entries, Serial access • Off-chip L2 cache – Minimum data cache miss latency 13 cycles – Up to 16 MB – Dedicated access to L2 cache Overall System Diagram Source: Microprocessor Report, 10/28/96 The Processor Itself References • R.E. Kessler. The Alpha 21264 Microprocessor. IEEE Micro. March/April 1999. • D. Leibholz and R. Razdan. The Alpha 21264: A 500 MHz Out-of- order Execution Microprocessor. COMPCON97, 1997. • Compaq Computer Corporation. Alpha 21264/EV6 Hardware Reference Manual. • R. Kessler, E. McLellan, and D. Webb. The Alpha 21264 microprocessor architecture. International Conference on Computer Design, October 1998 • B.A. Gieseke et. al. A 600 MHz Superscalar RISC Microprocessor with Out-of-order Execution. International Solid State Circuits Conference. 1997. • L. Gwennap. Digital 21264 Sets New Standard. Microprocessor Report. October 28, 1996. • Dezso Sima. The Design Space of Register Renaming Techniques. IEEE Micro. September/October 2000. • P.E. Gronowski et. al. High Performance Microprocessor Design. IEEE Journal of Solid State Circuits. May 1998..

Load more