Alpha 21264 Microarchitecture

Total Page:16

File Type:pdf, Size:1020Kb

Alpha 21264 Microarchitecture Alpha 21264 Microarchitecture Onur/Aditya 11/6/2001 Key Features of 21264 • Introduced in Feb 98 at 500 MHz • 15M transistors, 2.2V 0.35-micron 6 metal layer CMOS process • Implements 64-bit Alpha ISA • Out-of-order execution (unlike 21164) • 4-wide fetch (like 21164) • Max 6 inst/cycle execution bandwidth • 7-stage pipeline • Hybrid two-level branch prediction (tournament predictor) • Clustered integer pipeline • 80 in-flight instructions Overview of the Presentation • Overview of 21264 pipeline • Fetch and Branch Prediction mechanism • Register Renaming in 21264 • Clustering • Memory System 21264 Pipelines Source: Microprocessor Report, 10/28/96 Pipeline Structure Source: IEEE Micro, March-April 1999 Instruction Fetch Mechanism • Two features: – Line and way prediction – Branch prediction • Line-way predictor predicts the line-way of the I-cache that will be accessed in the next cycle • Line-way prediction takes the branch predictor outside the critical fetch loop. • On cache fills, line predictor value at each line points to the next sequential fetch line. • Line predictor is later trained by the branch predictor. • In effect, line-way predictor is similar to a very fast BTB. • Prediction of the line predictor is verified in stage 1 (Instruction slot). If line-way prediction is incorrect, slot stage is flushed and PC generated using the branch predictor information is used to redirect fetch. Fetch - Line and Way Prediction Source: IEEE Micro, March-April 1999 Branch Prediction Mechanism • Hybrid Branch Predictor • Global predictor: – Good for inter-correlated branches. – Indexed by global path history register (T/NT status of last 12 branches) – 4K-entry table of 2-bit counters • Local Predictor – Good for self-correlated branches. – 10 bits of PC indexes a per-address local history table, which in turn indexes a 1K-entry table of 3-bit counters. – Aliasing among branches is a problem. • Choice Predictor – Decides which predictor to use. – Indexed by global path history register – 4K-entry table of 2-bit counters Branch Prediction Mechanism Source: Microprocessor Report, 10/28/96 • Minimum branch penalty: 7 cycles • Typical branch penalty: 11+ cycles (IQ delay) • 48K bits of target addresses stored in I-cache • 32-entry return address stack • Predictor tables are reset on a context switch Instruction Slotting • Check line predictor prediction • Branch predictor compares the next cache index it generates with the one generated by line predictor • Determine the subclusters integer instructions will go to • Some subclusters are specialized resource constraints • Perform load balancing on subclusters Register Renaming • 31 Integer 31 FP architectural registers • 41 Int 41 FP extra physical registers • Uses a merged rename and architectural register file, one for Int one for FP • Same physical register holds the results of an instruction before and after commit • No separate architectural register file (no data copying on commit) • Register map table stores current mappings of architectural registers. • A map silo contains old mappings of up to 20 previous decode cycles (used in case of misprediction) Register Renaming Logic Source: Presentation by R. Kessler, August 1998. • On decoding an instruction: – Search map CAMs for the source registers – Find the physical registers currently containing the value of the architectural source registers – Access free physical register list – Map the found free physical register to the architectural destination register Register Renaming Logic • On completing an instruction: – Write result into the physical destination register – Mark the physical destination register as valid in the register scoreboard – Broadcast results to issue queue entries – Physical destination register number is broadcast as tag • On committing an instruction – Mark the physical destination register as committed – Free the physical register that corresponds to an old mapping of the same architectural register • On a misprediction/exception – Roll back the map state to what it was when the exception-causing instruction was renamed – To be able to do this, instructions should be associated with map entries – This is done using inums. Each instruction is given an 8-bit unique identifier during register mapping Physical Register States • 4 states • Initially n architectural registers are in AR state. • Rest are Available • When an instruction with a destination register is issued, one of the available registers is allocated as rename buffer (RB) • When instruction finishes execution, state is set to valid Source: Sima, D. The Design Space of Register Renaming • On instruction commit, Techniques. IEEE Micro, September/October 2000. state is set to AR and old AR mapping is reclaimed Integer Issue Queues - Clustering • 20 entries, maximum 4 per cycle • Two arbiters pick the instructions that will issue (One for upper subclusters, one for lower subclusters) • Each queue entry asserts a request to the arbiter when it contains an instruction that can be executed by the subcluster (if operand values are available within that subcluster) • 4 request signals (U0, U1, L0, L1) • Arbiters choose between simultaneous requesters of a subcluster based on the age of the request • Older instructions are given priority • Each arbiter picks 2 of the possible 20 requesters for service • A given instruction can request only upper or lower subclusters (load balancing based on the assignment done by Stage 1) • Subcluster assignment is static (Stage 1) • Cluster selection on issue is dynamic (Stage 2) Integer/FP Execution Pipes • Integer cluster communication latency: 1 cycle • Advantage of clustering: – Fewer read/write ports to the register file – Register file will not be a cycle time limiter Source: IEEE Micro, March-April 1999 • FP issue queue: – 15 entries – 2 inst/cycle Memory References • Load Queue – Reorder buffer for loads – 32 entries, in-order – Maintains state of loads issued but not yet retired • Store Queue – Reorder buffer for stores – 32 entries, in-order – Maintains state of stores issued but not yet written to the data cache – Holds data associated with store instructions – Forwards data to older matching stores • Miss Address File – Holds physical addresses associated with pending L1 cache misses (instruction or data) – Maximum 8 misses to off-chip memory system Load/Store Ordering • New memory references check their address and age against older references. • For example, when a store issues: – LDQ compares store address to the addresses of younger loads (CAM search) – If the older store issues to the same memory address as a younger load, LDQ squashes the load and initiates recovery • When a load is ready to issue: – STQ compares the load address to the addresses of younger stores – If a match is found: • If store data is available, STQ forwards the data • Else load issue is delayed until store data becomes available Load/Store Ordering • When a load is ready to issue: – If a younger store exists in STQ with an unknown address: • Predict that the ready load will not access the same memory location unless this load was incorrectly ordered before (check the load wait table) – Exposes more ILP if prediction is correct – In case of misprediction: • Minimum 14 cycle penalty • Initiate recovery: Load and all subsequent instructions are squashed and re-executed • Mark the load in the load wait table so that it will wait for all younger stores to compute addresses next time around Load/Store Ordering Example Source: IEEE Micro, March-April 1999 Features of Memory System • Data cache – 64 KB, 2-way, virtually-indexed physically tagged (translation in parallel with access) – Write-back, read/write allocate – 64-byte block size + ECC bits – Prevents synonyms by not allowing different physical addresses corresponding to the same virtual address to co-exist in the cache – Load hit/miss prediction to minimize load-use latency (Data cache access is 3 cycles after the issue queue + 1 cycle to get the hit/miss signal to issue queue) • Victim Buffer (Victim address and data files) – Contains evicted L1(Data and Inst) and L2 cache lines – 8 entries, Serial access • Off-chip L2 cache – Minimum data cache miss latency 13 cycles – Up to 16 MB – Dedicated access to L2 cache Overall System Diagram Source: Microprocessor Report, 10/28/96 The Processor Itself References • R.E. Kessler. The Alpha 21264 Microprocessor. IEEE Micro. March/April 1999. • D. Leibholz and R. Razdan. The Alpha 21264: A 500 MHz Out-of- order Execution Microprocessor. COMPCON97, 1997. • Compaq Computer Corporation. Alpha 21264/EV6 Hardware Reference Manual. • R. Kessler, E. McLellan, and D. Webb. The Alpha 21264 microprocessor architecture. International Conference on Computer Design, October 1998 • B.A. Gieseke et. al. A 600 MHz Superscalar RISC Microprocessor with Out-of-order Execution. International Solid State Circuits Conference. 1997. • L. Gwennap. Digital 21264 Sets New Standard. Microprocessor Report. October 28, 1996. • Dezso Sima. The Design Space of Register Renaming Techniques. IEEE Micro. September/October 2000. • P.E. Gronowski et. al. High Performance Microprocessor Design. IEEE Journal of Solid State Circuits. May 1998..
Recommended publications
  • The Alpha 21264 Microprocessor: Out-Of-Order Execution at 600 Mhz
    The Alpha 21264 Microprocessor: Out-of-Order Execution at 600 Mhz R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA REK August 1998 1 Some Highlights z Continued Alpha performance leadership y 600 Mhz operation in 0.35u CMOS6, 6 metal layers, 2.2V y 15 Million transistors, 3.1 cm2, 587 pin PGA y Specint95 of 30+ and Specfp95 of 50+ y Out-of-order and speculative execution y 4-way integer issue y 2-way floating-point issue y Sophisticated tournament branch prediction y High-bandwidth memory system (1+ GB/sec) REK August 1998 2 Alpha 21264: Block Diagram FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Int Branch Int Reg Exec Predictors Reg Issue File Queue Addr Sys Bus Map (80) Exec (20) L1 Bus 64-bit Data Reg Exec Inter- Cache Bus 80 in-flight instructions File Cache plus 32 loads and 32 stores Addr face 64KB 128-bit (80) Exec Unit Next-Line 2-Set Address Phys Addr 4 Instructions / cycle L1 Ins. 44-bit Cache FP ADD FP Reg 64KB FP Div/Sqrt Issue File Victim 2-Set Reg Queue (72) FP MUL Buffer Map (15) Miss Address REK August 1998 3 Alpha 21264: Block Diagram FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Int Branch Int Reg Exec Predictors Reg Issue File Queue Addr Sys Bus Map (80) Exec (20) L1 Bus 64-bit Data Reg Exec Inter- Cache Bus 80 in-flight instructions File Cache plus 32 loads and 32 stores Addr face 64KB 128-bit (80) Exec Unit Next-Line 2-Set Address Phys Addr 4 Instructions / cycle L1 Ins.
    [Show full text]
  • Computer Organization EECC 550 • Introduction: Modern Computer Design Levels, Components, Technology Trends, Register Transfer Week 1 Notation (RTN)
    Computer Organization EECC 550 • Introduction: Modern Computer Design Levels, Components, Technology Trends, Register Transfer Week 1 Notation (RTN). [Chapters 1, 2] • Instruction Set Architecture (ISA) Characteristics and Classifications: CISC Vs. RISC. [Chapter 2] Week 2 • MIPS: An Example RISC ISA. Syntax, Instruction Formats, Addressing Modes, Encoding & Examples. [Chapter 2] • Central Processor Unit (CPU) & Computer System Performance Measures. [Chapter 4] Week 3 • CPU Organization: Datapath & Control Unit Design. [Chapter 5] Week 4 – MIPS Single Cycle Datapath & Control Unit Design. – MIPS Multicycle Datapath and Finite State Machine Control Unit Design. Week 5 • Microprogrammed Control Unit Design. [Chapter 5] – Microprogramming Project Week 6 • Midterm Review and Midterm Exam Week 7 • CPU Pipelining. [Chapter 6] • The Memory Hierarchy: Cache Design & Performance. [Chapter 7] Week 8 • The Memory Hierarchy: Main & Virtual Memory. [Chapter 7] Week 9 • Input/Output Organization & System Performance Evaluation. [Chapter 8] Week 10 • Computer Arithmetic & ALU Design. [Chapter 3] If time permits. Week 11 • Final Exam. EECC550 - Shaaban #1 Lec # 1 Winter 2005 11-29-2005 Computing System History/Trends + Instruction Set Architecture (ISA) Fundamentals • Computing Element Choices: – Computing Element Programmability – Spatial vs. Temporal Computing – Main Processor Types/Applications • General Purpose Processor Generations • The Von Neumann Computer Model • CPU Organization (Design) • Recent Trends in Computer Design/performance • Hierarchy
    [Show full text]
  • Computer Architectures an Overview
    Computer Architectures An Overview PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sat, 25 Feb 2012 22:35:32 UTC Contents Articles Microarchitecture 1 x86 7 PowerPC 23 IBM POWER 33 MIPS architecture 39 SPARC 57 ARM architecture 65 DEC Alpha 80 AlphaStation 92 AlphaServer 95 Very long instruction word 103 Instruction-level parallelism 107 Explicitly parallel instruction computing 108 References Article Sources and Contributors 111 Image Sources, Licenses and Contributors 113 Article Licenses License 114 Microarchitecture 1 Microarchitecture In computer engineering, microarchitecture (sometimes abbreviated to µarch or uarch), also called computer organization, is the way a given instruction set architecture (ISA) is implemented on a processor. A given ISA may be implemented with different microarchitectures.[1] Implementations might vary due to different goals of a given design or due to shifts in technology.[2] Computer architecture is the combination of microarchitecture and instruction set design. Relation to instruction set architecture The ISA is roughly the same as the programming model of a processor as seen by an assembly language programmer or compiler writer. The ISA includes the execution model, processor registers, address and data formats among other things. The Intel Core microarchitecture microarchitecture includes the constituent parts of the processor and how these interconnect and interoperate to implement the ISA. The microarchitecture of a machine is usually represented as (more or less detailed) diagrams that describe the interconnections of the various microarchitectural elements of the machine, which may be everything from single gates and registers, to complete arithmetic logic units (ALU)s and even larger elements.
    [Show full text]
  • Improving the Precise Interrupt Mechanism of Software- Managed TLB Miss Handlers
    Improving the Precise Interrupt Mechanism of Software- Managed TLB Miss Handlers Aamer Jaleel and Bruce Jacob Electrical & Computer Engineering University of Maryland at College Park {ajaleel,blj}@eng.umd.edu Abstract. The effects of the general-purpose precise interrupt mechanisms in use for the past few decades have received very little attention. When modern out-of-order processors handle interrupts precisely, they typically begin by flushing the pipeline to make the CPU available to execute handler instructions. In doing so, the CPU ends up flushing many instructions that have been brought in to the reorder buffer. In par- ticular, many of these instructions have reached a very deep stage in the pipeline - representing significant work that is wasted. In addition, an overhead of several cycles can be expected in re-fetching and re-executing these instructions. This paper concentrates on improving the performance of precisely handling software managed translation lookaside buffer (TLB) interrupts, one of the most frequently occurring interrupts. This paper presents a novel method of in-lining the interrupt handler within the reorder buffer. Since the first level interrupt-handlers of TLBs are usually small, they could potentially fit in the reorder buffer along with the user-level code already there. In doing so, the instructions that would otherwise be flushed from the pipe need not be re-fetched and re-executed. Additionally, it allows for instructions independent of the exceptional instruction to continue to execute in parallel with the handler code. We simulate two different schemes of in-lining the interrupt on a pro- cessor with a 4-way out-of-order core similar to the Alpha 21264.
    [Show full text]
  • Data Caches for Superscalar Processors*
    Data Caches for Superscalar Processors* Toni Juan Juan J. Navarro Olivier Temam antoniojQx.upc.es juanjoQac.upc.es [email protected] Dept. Arquitectura de Computadors - Barcelona PRiSM [Jniversitat Politecnica de Catalunya Versailles University Spain FlWlCe Abstract are more difficult to design because load/store requests sent in parallel share no obvious locality properties. The diffi- As the number of instructions executed in parallel increases, culty is to propose a design that can cope with increasing superscalar processors will require higher bandwidth from degree of instruction parallelism. The solutions presently data caches. Because of the high cost of true multi-ported implemented in processors can be classified as: caches, alternative cache designs must be evaluated. The l True multi-porting. With respect to performance true purpose of this study is to examine the data cache band- multi-porting is clearly an ideal solution, but its chip area width requirements of high-degree superscalar processors, cost is high. Cost can be partly reduced by accepting a and investigate alternative solutions. The designs studied certain degradation of cache access time that would reduce range from classic solutions like multi-banked caches to more the performance. complex solutions recently proposed in the literature. The . Multiple Cache Copies. For n accesses, the cache must performance tradeoffs of these different cache designs are be replicated n times with no benefit to storage space. More- examined in details. Then, using a chip area cost model, over, store requests are sent simultaneously to all cache all solutions are compared with respect to both cost and copies for coherence and thus no other cache request can performance.
    [Show full text]
  • Zarka Cvetanovic and R.E. Kessler Compaq Computer Corporation
    PERFORMANCE ANALYSIS OF THE ALPHA 21264-BASED COMPAQ ES40 SYSTEM Zarka Cvetanovic and R.E. Kessler Compaq Computer Corporation Abstract SPEC95 1-CPU 90 This paper evaluates performance characteristics of the Compaq ES40/21264 667MHz Compaq ES40 shared memory multiprocessor. The ES40 80 HP PA-8500 440MHz system contains up to four Alpha 21264 CPU’s together IBM Power3 375MHz with a high-performance memory system. We qualitatively 70 SUN USPARC-II 450MHz describe architectural features included in the 21264 Intel Pentium-III 800MHz microprocessor and the surrounding system chipset. We 60 further quantitatively show the performance effects of these features using benchmark results and profiling data 50 collected from industry-standard commercial and technical workloads. The profile data includes basic performance 40 information – such as instructions per cycle, branch mispredicts, and cache misses – as well as other data that 30 specifically characterizes the 21264. Wherever possible, we compare and contrast the ES40 to the AlphaServer 4100 – a 20 previous-generation Alpha system containing four Alpha 21164 microprocessors – to highlight the architectural 10 advances in the ES40. We find that the Compaq ES40 often provides 2 to 3 times the performance of the AlphaServer 0 4100 at similar clock frequencies. We also find that the SPECint95 SPECfp95 ES40 memory system has about five times the memory bandwidth of the 4100. These performance improvements Figure 1 - SPEC95 Comparison come from numerous microprocessor and platform enhancements, including out-of-order execution, branch SPECfp_rate95 prediction, functional units, and the memory system. 3000 Compaq ES40/21264 667MHz 1. INTRODUCTION HP PA-8500 440MHz 2500 SUN USparc-II 400MHz The Compaq ES40 is a shared memory multiprocessor Intel Pentium-III 800MHz containing up to four third-generation Alpha 21264 2000 microprocessors [1][2].
    [Show full text]
  • Alphapc 264DP Technical Reference Manual Preliminary
    AlphaPC 264DP Technical Reference Manual Order Number: EC–RB0DA–TE Revision/Update Information: This is a new document. Preliminary Compaq Computer Corporation February 1999 The information in this publication is subject to change without notice. COMPAQ COMPUTER CORPORATION SHALL NOT BE LIABLE FOR TECHNICAL OR EDITORIAL ERRORS OR OMISSIONS CONTAINED HEREIN, NOR FOR INCIDENTAL OR CONSEQUENTIAL DAM- AGES RESULTING FROM THE FURNISHING, PERFORMANCE, OR USE OF THIS MATERIAL. THIS INFORMATION IS PROVIDED "AS IS" AND COMPAQ COMPUTER CORPORATION DISCLAIMS ANY WARRANTIES, EXPRESS, IMPLIED OR STATUTORY AND EXPRESSLY DISCLAIMS THE IMPLIED WAR- RANTIES OF MERCHANTABILITY, FITNESS FOR PARTICULAR PURPOSE, GOOD TITLE AND AGAINST INFRINGEMENT. This publication contains information protected by copyright. No part of this publication may be photocopied or reproduced in any form without prior written consent from Compaq Computer Corporation. ©1999 Digital Equipment Corporation. All rights reserved. Printed in U.S.A. COMPAQ, the Compaq logo, the Digital logo, and DIGITAL Registered in U.S. Patent and Trademark Office. AlphaPC, DECchip, and Tru64 are trademarks of Compaq Computer Corporation. Intel is a registered trademark of Intel Corporation. Microsoft, Visual C++, and Windows NT are registered trademarks of Microsoft Corporation. Other product names mentioned herein may be the trademarks of their respective companies. 12 February 1999 – Subject to Change Contents 1 AlphaPC 264DP Introduction 1.1 System Components and Features . 1-1 1.1.1 Memory Subsystem . 1-2 1.1.2 21272 Core Logic Chipset . 1-2 1.1.3 CPU Daughtercard . 1-3 1.1.3.1 Level 2 Cache Subsystem Overview . 1-4 1.1.3.2 21264 DC-to-DC Converter .
    [Show full text]
  • Appendix C a Survey of RISC Architectures for Desktop, Server, and Embedded Computers
    C.1 Introduction C-2 C.2 Addressing Modes and Instruction Formats C-4 C.3 Instructions: The MIPS Core Subset C-5 C.4 Instructions: Multimedia Extensions of the Desktop/Server RISCs C-16 C.5 Instructions: Digital Signal-Processing Extensions of the Embedded RISCs C-18 C.6 Instructions: Common Extensions to MIPS Core C-19 C.7 Instructions Unique to MIPS64 C-24 C.8 Instructions Unique to Alpha C-26 C.9 Instructions Unique to SPARC v.9 C-27 C.10 Instructions Unique to PowerPC C-31 C.11 Instructions Unique to PA-RISC 2.0 C-32 C.12 Instructions Unique to ARM C-35 C.13 Instructions Unique to Thumb C-36 C.14 Instructions Unique to SuperH C-37 C.15 Instructions Unique to M32R C-38 C.16 Instructions Unique to MIPS16 C-38 C.17 Concluding Remarks C-40 C.18 Acknowledgments C-41 References C-41 C A Survey of RISC Architectures for Desktop, Server, and Embedded Computers RISC: any computer announced after 1985. Steven Przybylski A Designer of the Stanford MIPS © 2003 Elsevier Science (USA). All rights reserved. C-2 I Appendix C A Survey of RISC Architectures for Desktop, Server, and Embedded Computers C.1 Introduction We cover two groups of reduced instruction set computer (RISC) architectures in this appendix. The first group is the desktop and server RISCs: I Digital Alpha I Hewlett-Packard PA-RISC I IBM and Motorola PowerPC I Silicon Graphics MIPS I Sun Microsystems SPARC The second group is the embedded RISCs: I Advanced RISC Machines ARM I Advanced RISC Machines Thumb I Hitachi SuperH I Mitsubishi M32R I Silicon Graphics MIPS16 There has never been another class of computers so similar.
    [Show full text]
  • Alpha 21264 Microprocessor Hardware Reference Manual
    Alpha 21264 Microprocessor Hardware Reference Manual Order Number: EC–RJRZA–TE This manual is directly derived from the internal 21264 Specifications, Revision 4.2. You can access this hardware reference manual in PDF format from the following website: ftp.digital.com/pub/Digital/info/semiconductor/literature/dsc-library.html Revision/Update Information: This is a new document. Compaq Computer Corporation July 1999 The information in this publication is subject to change without notice. COMPAQ COMPUTER CORPORATION SHALL NOT BE LIABLE FOR TECHNICAL OR EDITORIAL ERRORS OR OMISSIONS CONTAINED HEREIN, NOR FOR INCIDENTAL OR CONSEQUENTIAL DAM- AGES RESULTING FROM THE FURNISHING, PERFORMANCE, OR USE OF THIS MATERIAL. THIS INFORMATION IS PROVIDED “AS IS” AND COMPAQ COMPUTER CORPORATION DISCLAIMS ANY WARRANTIES, EXPRESS, IMPLIED OR STATUTORY AND EXPRESSLY DISCLAIMS THE IMPLIED WAR- RANTIES OF MERCHANTABILITY, FITNESS FOR PARTICULAR PURPOSE, GOOD TITLE AND AGAINST INFRINGEMENT. This publication contains information protected by copyright. No part of this publication may be photocopied or reproduced in any form without prior written consent from Compaq Computer Corporation. © 1999 Digital Equipment Corporation. All rights reserved. Printed in the U.S.A. COMPAQ, the Compaq logo, the Digital logo, and VAX Registered in United States Patent and Trademark Office. Pentium is a registered trademark of Intel Corporation. Other product names mentioned herein may be trademarks and/or registered trademarks of their respective compa- nies. Alpha 21264 Hardware Reference Manual Table of Contents Preface 1 Introduction 1.1 The Architecture . 1–1 1.1.1 Addressing . 1–2 1.1.2 Integer Data Types. 1–2 1.1.3 Floating-Point Data Types .
    [Show full text]
  • The Alpha 21264: a 500 Mhz Out-Of-Order Execution Microprocessor
    The Alpha 21264: A 500 MHz Out-of-Order Execution Microprocessor Daniel Leibholz and Rahul Razdan Digital Equipment Corporation Hudson, MA 01749 Abstract =$ Motion estimation instructions accelerate CPU-intensive video compression and This paper describes the internal organization of the decompression algorithms. 21264, a 500 MHz, Out-Of Order, quad-ferch, six-way 3 Prefetch instructions enable software control issue microprocessor. The aggressive cycle-time of the of the data caches. 21264 in combination with many architectural 3 Floating point square root and bidirectional innovations, such as out-oforder and. speculative register file transfer instructions (integer-to- execution, enable this microprocessor to deliver an floating point) enhance floating point estimated 30 SpecInt95 and 50 SpecFp95 performance. In performance. addition, the 21264 can sustain 54- Gigabyteslsec of High-speed interfaces to the backup (L2) cache and bandwidth to an L2 cache and 3+ Gigabyteslsec to system memory dramatically increase the bandwidth memory for high performance on memory-intensive available from each of these sources. applications. The combination of these techniques delivers an Introduction estimated 30 Speclnt95, and over 50 SpecFp95 performance on the standard SPEC95 benchmark suite The 21264 is the third generation of Alpha and over 1600 MB/s on the McCalpin STREAM microprocessors designed and built by Digital benchmark. In addition, the dramatic rise in external Semiconductor. Like its predecessors , the 21064 [l] and the 21 164 [2], thc design objective of the 21264 team was to build a world-class microprocessor which is the undisputed performance leader. The principle levers used to achieve this objective were: A cycle time (2.0 ns in 0.35 micron CMOS at 2 volts) was chosen by evaluation of the circuit loops which provide the most performance leverage.
    [Show full text]
  • Clocking Design and Analysis for a 600-Mhz Alpha Microprocessor Daniel W
    IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 33, NO. 11, NOVEMBER 1998 1627 Clocking Design and Analysis for a 600-MHz Alpha Microprocessor Daniel W. Bailey and Bradley J. Benschneider Abstract— Design, analysis, and verification of the clock hi- erarchy on a 600-MHz Alpha microprocessor is presented. The clock hierarchy includes a gridded global clock, gridded major clocks, and many local clocks and local conditional clocks, which together improve performance and power at the cost of verifica- tion complexity. Performance is increased with a windowpane arrangement of global clock drivers for lowering skew and employing local clocks for time borrowing. Power is reduced by using major clocks and local conditional clocks. Complexity is managed by partitioning the analysis depending on the type of clock. Design and characterization of global and major clocks use both an AWEsim-based computer-aided design (CAD) tool and SPICE. Design verification of local clocks relies on SPICE along with a timing-based methodology CAD tool that includes data- dependent coupling, data-dependent gate loads, and resistance effects. Fig. 1. Clock hierarchy. Index Terms— Clocks, delay estimation, electromagnetic cou- pling, microprocessors, resistance. of critical timing paths, so another consideration is designing a clocking methodology that is flexible enough to solve localized I. INTRODUCTION timing problems. The clock design described in this paper addresses these major issues. It is similar to previous Alpha HE microprocessor discussed in this paper is the third 1 designs [2], [3] in that it uses a single-node, gridded, two- Tmajor implementation of the Alpha architecture. It is an phase global clock, in this case named GCLK, that covers out-of-order execution, superscalar microprocessor that per- the entire die.
    [Show full text]
  • Computer Arithmetic and Hardware: “O! the Shelf ” Microprocessors Versus “Custom Hardware”
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Elsevier - Publisher Connector Theoretical Computer Science 279 (2002) 3–27 www.elsevier.com/locate/tcs Computer arithmetic and hardware: “o! the shelf ” microprocessors versus “custom hardware” Daniel Etiemble Department of Electrical and Computer Engineering, University of Toronto, 10 King’s College Road Toronto, Ontario, Canada M5S 3G4 Abstract This paper discusses the relationship between computer arithmetic and hardware implementa- tion. First, we examine the impact of computer arithmetic on the overall performance of today’s microprocessors. By comparing their evolution over the last 10 years, we show that the perfor- mance of arithmetic operators is far less critical than the performance of the memory hierarchy or the branch predictors. We then discuss the potential for improvement in arithmetic performance, both for pipelined and non-pipelined operations. We then examine the possible impact of new technologies, such as MMX technology or asynchronous control of microprocessors, on computer arithmetic. Finally, we show that programmable logic devices now permit a cost-e!ective imple- mentation of speciÿc arithmetic number representations, such as serial arithmetic or logarithmic representations. c 2002 Elsevier Science B.V. All rights reserved. Keywords: Arithmetic performance; Asynchronous operators; Computer arithmetic; Computer performance; Latency; Microprocessors; Multimedia instructions; Programmable logic; System performance; Throughput 1. Introduction Performance of computers has climbed up steadily for more than 50 years and the progression rate has increased with the developments of microprocessors in the last 25 years. The big gap between low-end computers and high-end supercomputers is narrowing, because both use the same basic components.
    [Show full text]