Design Objectives of the 0.35Μm Alpha 21164 Microprocessor (A 500Mhz Quad Issue RISC Microprocessor)

Digital Semiconductor Design Objectives of the 0.35µm Alpha 21164 Microprocessor (A 500MHz Quad Issue RISC Microprocessor) Gregg Bouchard Digital Semiconductor Digital Equipment Corporation Hudson, MA 1 Digital Semiconductor Outline • 0.35µm Alpha 21164 Overview • Design Goals • Technology Issues • 21164 Internal Architecture Review • Architectural Enhancements • System Performance Enhancements • Alpha Processor RoadMap • Summary 2 Digital Semiconductor 0.35µm ALPHA 21164 • Technology shrink of 0.5µm design + Architectural Enhancements + System Performance Enhancements 0.5 µm process 0.35 µm process Transistor Count 9.3 Million 9.66 Million Die Size 16.5 mm x 18.1 mm 14.4 mm x 14.5 mm Power Supply 3.3V external 3.3V external 3.3V internal 2.5V internal WC Power Dissipation 50W @ 300 MHz 37W @ 433 MHz Target Cycle Time 300 MHz 433 MHz 3 Digital Semiconductor Design Goals • Reduced Cost – Die size reduction – Remove processing steps • Higher Performance – 433 MHz design target – Architectural enhancements • Reduced Power – Lower Core Operating Voltage (2.0v-2.5v) • Time to Market – Significant leverage from previous design 4 Digital Semiconductor Die Size Analysis X-dim Y-dim NormA Original Die 16.5mm x 18.1mm Original 16.5 18.1 1.00 298 mm2 25% shrink - 4.1 - 4.5 12.4 13.6 0.56 Conversion +0.0 +0.8 12.6 14.5 0.62 Actual shrink 14.4mm x 14.5mm LI Removal +1.8 +0.0 209 mm2 14.4 14.5 0.70 Actual 14.4 14.5 0.70 Ideal 30% shrink 11.5mm x 12.7mm 2 Ideal 30% 11.5 12.7 0.49 146 mm 5 Digital Semiconductor Layout Conversion Strategy • Full 30% linear shrink was not possible • Solution: – 25% linear shrink – Semi-automated conversion of design rules – Polysilicon mask layer pushed to full 30% shrink dimensions for performance • Redesign of caches – Local Interconnect removed 6 Digital Semiconductor Layout Conversion Example Well Well Plug M1C M2C DRC Diffusion Poly Metal 1 Metal 2 Errors 7 Original After Script Final Digital Semiconductor Speed • Single-wire, two phase clocking scheme • 14 gates per cycle including latches • Single global clock grid – Global clock skew<90ps – Local clock skew<25ps • Clock statistics (0.5 µm design) – Clock load = 3.75 nF – Size of final clock inverter = 58 cm – Edge rate = 0.5 ns – Clocking consumes 40% of chip power – di/dt = 50 A 8 Digital Semiconductor Speed Verification • Test circuits were used to determine the speed scaling of different circuit configurations – Predicted average process speed up – Identified “slow” circuit types • New circuits were evaluated in SPICE • Chip sections with major modifications were completely re-verified 9 Digital Semiconductor Power • Significant Power Reduction • Vddi = 2.2v (to 2.5v) • 3.3V only interface to external devices 0.5µm 3.3v 0.35µm 2.5v 50 40 I/O 30 Watts Clock 20 Core 10 0 MHz 300 366 433 500 10 L1 I Cache Pre- CLK I-Box Drivers L2 F- E- L2 Cache Box Box Cache Pad Ring Main CLK (I/O) Drivers M-Box L1 D Cache L2 C-Box Tags Digital Semiconductor Alpha 21164 Features Key Attributes • 4-way issue superscalar – Up to 2 Integer AND 2 Floating Point instructions issued per CPU cycle. • Large on-chip L2 cache – 96KB, writeback, 3-way set associative • Fully Pipelined – 7-stage integer pipeline – 9-stage floating point pipeline • Emphasis on low latency at high clock rate • High-throughput memory subsystem 12 Digital Semiconductor Alpha 21164 Block Diagram Instruction Unit Execution Units Memory Unit Integer Merge 40b Unit Logic Address Write- Four- Integer back Unit Bus Instruction way Write- L2 Interface L3 Cache Cache Issue Through Cache (8KB) Unit Unit FP Adder Data (96KB) Cache (8KB) FP Mult. 128b Data 128-bit internal data bus ITB: 48 entry DTB: 64 entry 13 Digital Semiconductor Instruction Issue Pipeline Review Next Issue 2kx2 Conflict branch PC Checker prediction 0 to floating point Instruction multiply pipeline Buffer Instruction to floating point Cache add pipeline (8KB) X to integer pipeline 0 to integer 1 pipeline 1 Prefetch Buffer Instruction Instruction Slot Issue S0 S1 S2 S3 14 Digital Semiconductor Execution Pipeline Review Int mul Integer Pipeline 0: arith, logical, ld/st, shift Int Regs Integer Pipeline 1: arith, logical, ld, br/jmp FP div FP Pipeline 0: add, subtract, FP compare, FP br Regs FP Pipeline 1: multiply S3 S4 S5 S6 S7 S8 15 Digital Semiconductor On-chip Cache Resources Review 2 System 6 Data 6 Data 3 Instr. Commands Reads Writes Reads 128 bits/cycle S6 S7 S8 S9 96K, 3 Set, Writeback L2 Cache System data Victim 0 data Miss Addr 0 Victim Addr 0 Victim 1 data Miss Addr 1 Victim Addr 1 40 bit PA 16 Digital Semiconductor L3 Cache (off-chip) • L3 cache is a direct-mapped writeback superset of on- Command/Addr 21164 chip L2 cache Index • Up to 2 reads (or Bus L3 L3 Address outstanding read Tag Data File commands) in L3 cache • Programmable wave 128-bit data bus pipelining for L3 cache • Support for Synchronous Flow-Thru SRAMs • L3 cache is optional 17 Digital Semiconductor Off-Chip L3 Cache Options Selectable via on-chip programmable registers u Cache Size - 1 to 64M Byte u Cache Read/Write Speed - 4 to 15 cpu cycles u Read to Write Spacing - 1 to 7 cpu cycles u Write Pulse (Bit Mask) - Up to 9 cpu cycles u Wave pipelining - 0 to 3 cpu cycles u Support for Synchronous SRAM’s 18 Digital Semiconductor Architectural Enhancements New Instructions • Scalar support for Byte and Short data types – LDBU, LDWU load an unsigned byte or short – STB, STW, store an byte or short – SEXTB, SEXTW sign extend a byte or short • Eases porting of device drivers to Alpha • Improves emulation of Intel code on Alpha • Implemented in this and all future Alpha microprocessors 19 Digital Semiconductor Architectural Enhancements New Instructions (continued) •IMPLVER – returns a small integer indicating the core design – used for code scheduling decisions – implemented in all Alpha microprocessors •AMASK – clears bits to indicate which features are present – implemented in all Alpha microprocessors Example: AMASK #1, R0 ;byte/word present BNE R0, emulate ;if not emulate 20 ... Digital Semiconductor System Performance Enhancements 1 21164 oe_we_active_low (eliminates off-chip buffering) big_drv_en (50% more drive) Bcache index L 1 tag_oe,we Tag Store H data_oe,we 0 2 Data Store st_clk1,2 (additional pin) tag_data 2 data 128 3 (1x clock mode) clk_mode<2> I E 4 Bus utilization improvements (Auto Dack) M cmd Write Write addr A0 A1 S O data D00 D01 D02 D03 D10 D11 dack 5 Cache Coherence 21 Digital Semiconductor Pre-Silicon Logic Verification • Used extensive test suite developed for original chip as testing baseline • Random and focused testing • Coverage analysis to ensure excellent test coverage • Three simulation systems used: –RTL – Transistor-level – Gate-level 22 Digital Semiconductor Alpha Microprocessor Road Map 25 20 15 21164 -500MHz 21164 -433MHz 21164 -400MHz SpecInt95 10 21164 -366MHz 21164 -333MHz 21164 -300MHz 5 21064A -275MHz 21064 -200MHz 21064 -150MHz 0 23 1992 1993 1994 1995 1996 1997 1998 Digital Semiconductor Summary • FIRST PASS SILICON - October 1995 • Booted first operating system in 2 days! • Continued Performance Leadership (4+ years) • Look for next generation 30+ SpecInt95 by Q3’97 24 Digital Semiconductor Acknowledgments Peter J. Bannon, Michael S. Bertone, Randel P. Blake Campos, William J. Bowhill, David A. Carlson, Ruben W. Castelino, Dale R. Donchin, Richard M. Fromm, Mary K. Gowan, Paul E. Gronowski, Anil K. Jain, Bruce J. Loughlin, Shekhar Mehta, Jeanne E. Meyer, Robert O. Mueller, Andy Olesin, Tung N. Pham, Ronald P. Preston, Paul I. Rubinfeld 25.

Design Objectives of the 0.35Μm Alpha 21164 Microprocessor (A 500Mhz Quad Issue RISC Microprocessor)

The Design and Verification of the Alphastation 600 5-Series Workstation by John H

Computer Organization EECC 550 • Introduction: Modern Computer Design Levels, Components, Technology Trends, Register Transfer Week 1 Notation (RTN)

Computer Architectures an Overview

VAX VMS at 20

The Alphaserver 4100 Cached Processor Module Maurice B

Data Caches for Superscalar Processors*

Zarka Cvetanovic and R.E. Kessler Compaq Computer Corporation

MICROPROCESSOR EVALUATIONS for SAFETY-CRITICAL, REAL-TIME February 2009 APPLICATIONS: AUTHORITY for EXPENDITURE NO

The Computer History Simulation Project

Digital Semiconductor Alpha 21164 Microprocessor Product Brief

Performance of Various Computers Using Standard Linear Equations Software

Clock Distribution Lecture 25 Clock Distribution Power Distribution