The Supersparc Microprocessor
Total Page:16
File Type:pdf, Size:1020Kb
The SuperSPARC™ Microprocessor Technical White Paper 2550 Garcia Avenue Mountain View, CA 94043 U.S.A. © 1992 Sun Microsystems, Inc.—Printed in the United States of America. 2550 Garcia Avenue, Mountain View, California 94043-1100 U.S.A All rights reserved. This product and related documentation is protected by copyright and distributed under licenses restricting its use, copying, distribution and decompilation. No part of this product or related documentation may be reproduced in any form by any means without prior written authorization of Sun and its licensors, if any. Portions of this product may be derived from the UNIX® and Berkeley 4.3 BSD systems, licensed from UNIX Systems Laboratories, Inc. and the University of California, respectively. Third party font software in this product is protected by copyright and licensed from Sun’s Font Suppliers. RESTRICTED RIGHTS LEGEND: Use, duplication, or disclosure by the government is subject to restrictions as set forth in subparagraph (c)(1)(ii) of the Rights in Technical Data and Computer Software clause at DFARS 252.227-7013 and FAR 52.227-19. The product described in this manual may be protected by one or more U.S. patents, foreign patents, or pending applications. TRADEMARKS Sun, Sun Microsystems, the Sun logo, are trademarks or registered trademarks of Sun Microsystems, Inc. UNIX and OPEN LOOK are registered trademarks of UNIX System Laboratories, Inc. All other product names mentioned herein are the trademarks of their respective owners. All SPARC trademarks, including the SCD Compliant Logo, are trademarks or registered trademarks of SPARC International, Inc. SPARCstation, SPARCserver, SPARCengine, SPARCworks, and SPARCompiler are licensed exclusively to Sun Microsystems, Inc. Products bearing SPARC trademarks are based upon an architecture developed by Sun Microsystems, Inc. The OPEN LOOK® and Sun™ Graphical User Interfaces were developed by Sun Microsystems, Inc. for its users and licensees. Sun acknowledges the pioneering efforts of Xerox in researching and developing the concept of visual or graphical user interfaces for the computer industry. Sun holds a non-exclusive license from Xerox to the Xerox Graphical User Interface, which license also covers Sun’s licensees who implement OPEN LOOK GUIs and otherwise comply with Sun’s written license agreements. X Window System is a trademark and product of the Massachusetts Institute of Technology. Please Recycle Contents 1. Introduction to the SuperSPARC Microprocessor . 1 2. Design Philosophy . 3 Design Criteria . 4 Superscalar . 4 Design Choices. 4 3. Microarchitecture . 7 Integer Unit. 7 Pipeline Organization . 7 Instruction Issue Strategy . 9 Branch Strategy . 10 Exception Strategy. 11 Floating-Point Unit . 12 Floating-Point Controller . 12 Floating-Point Adder . 13 Floating-Point Multiplier . 13 iii Performance . 14 Memory Hierarchy . 14 Cache and MMU Operation . 15 Bus Operation. 17 Direct MBus Interface Operation. 17 External Cache Interface . 18 4. Implementation. 19 Process. 19 Clocking . 20 5. Second-Level Cache Controller . 21 Introduction . 21 Multiple Processors . 21 Bus Support . 22 Processor Subsystem . 22 External Second-Level Cache . 23 Features . 24 A. References . 25 iv SuperSPARC Microprocessor—May 1992 Introduction to the SuperSPARC Microprocessor 1 ® The SuperSPARC™ microprocessor is a highly integrated superscalar SPARC microprocessor, fully compatible with the SPARC version 8 architecture [SI 1992]. This paper provides an overview of its architecture, internal operation, and capabilities. The processor contains on a single chip an integer unit, a double- precision floating-point unit, fully consistent instruction and data caches, and a SPARC Reference Memory Management Unit. In addition, a dual-mode bus interface is included that supports either the SPARC standard (MBus) [SI 1990] or an interface optimized for connection to a companion second-level cache controller chip. The chip, with an optional second-level cache controller, is targeted at a wide range of systems from uniprocessor desktop machines to multiprocessing file and compute servers. Both the microprocessor (3.1 million transistors) and the second-level cache controller (2.2 million transistors) are fabricated with Texas Instruments’ Epic2b process: a 0.8 micron, three-layer metal, BiCMOS process. These integrated circuits have die sizes of 16x16mm and 14x14mm respectively. This paper discusses the design strategy to produce a microprocessor suitable for a range of systems and the required trade-offs before focusing on the main functional blocks. Specifically, the operation of the integer pipeline, the floating point unit, and memory hierarchy are discussed. A chapter is devoted to the design and operation of the second-level cache controller. 1 1 2 SuperSPARC Microprocessor—May 1992 Design Philosophy 2 Compute-bound execution rate is a product of three factors: the number of compiler-generated instructions in an execution trace (NI), the sustained rate of instructions completed per cycle (IPC), and the clock rate made possible by the underlying integrated circuit technology [Patterson 1989]. High performance requires advances on all three fronts. All designs will benefit from better compilers and high clock rate. However, cost-effective systems cannot rely exclusively on high clock rate designs. To benefit from economies of scale, mainstream (rather than exotic) technology must be used. SuperSPARC-based systems provide high performance at moderate clock rates by optimizing the IPC. This presents design challenges for managing the processor pipeline and memory hierarchy. Average memory access time must be reduced and the number of instructions issued per cycle must rise; a superscalar architecture is required. 3 2 Design Criteria Superscalar Known bottlenecks to issuing multiple instructions include control dependencies, data dependencies, the number of ports to processor registers, the number of ports to the memory hierarchy, and the number of floating point pipelines. Our own studies suggest that many applications can benefit from a processor that issues three instructions per cycle. Table 2-1 Instruction Mix for Integer and Floating Point Applications Instruction Class Integer Floating Point Integer Arithmetic 50% 25% FP Arithmetic 0% 30% Loads 17% 25% Stores 8% 15% Branches 25% 5% Design Choices The above instruction mix presents many challenges for a three-scalar design executing code at the peak rate. Integer applications will access data memory every cycle, process branches every two cycles, and update memory every three cycles. Floating-point applications will access data memory, and execute floating point and integer arithmetic instructions every cycle. To mitigate these expected performance challenges, the following architectural decisions were made: • On-chip caches • A wide instruction fetch (128 bits) • Extensive forwarding paths • Overlapped memory access with floating point dispatch • Write buffer to decouple the processor from store completion Figure 2-1 shows the functional blocks of the microprocessor. 4 SuperSPARC Microprocessor—May 1992 2 fcontrol FP File fp queue FADD FMUL IU File FRND FNORM Scheduler Exception ALU Shifter group Handler Resource Allocation Forwarding Control Instruction Prefetch Queue 128 Shared MMU Instruction Cache Data Cache iptags Write Buffer dptags Cache Coherent Bus Interface Unit Address36 Data 64 Figure 2-1 Functional Block Diagram Design Philosophy 5 2 6 SuperSPARC Microprocessor—May 1992 Microarchitecture 3 Integer Unit This section describe the operation of the integer pipeline arithmetic logic units (ALUs), and strategies to deal with branches and exceptions. Pipeline Organization The SuperSPARC processor [ISSCC 1992], [Compcon 1992] uses a four-cycle integer pipeline. Each cycle has two-phases. The first cycle (F0, F1) fetches up to four instructions from the first-level instruction cache into an instruction queue. There are three phases of decode (D0, D1, D2), two stages of execution (EO, EI) and a writeback stage (WB). Table 3-1 Pipeline stages Stage Activity F0 I-cache ram access I-cache TLB lookup F1 I-cache match detect 4 instructions sent to the iqueue D0 Issue 1,2 or 3 instructions Select register file indices for LD/ST address registers D1 Read Rfile for LD/ST address registers Resource allocation for ALU instructions Evaluate branch target address 7 3 Table 3-1 Pipeline stages (Continued) Stage Activity D2 Read Rfile for ALU operands Add LD/ST address registers to form effective address E0 1st stage of ALU D cache RAM access D cache TLB lookup FP instruction dispatch E1 2nd stage of ALU D cache match detect Load data available Resolve exceptions WB Write back into the Rfile Retire store into the store buffer The operation of the pipeline is illustrated by the execution of a load instruction with a forwarding path in Figure 3-1. The pipeline stages are detailed in Table 3-1. clock f0f1 d0 d1d2 e0 e1 wb aluop GRP RDA EXEC f0 f1 d0 d1 d2 e0 e1 wb aluop GRP RDA f0 f1 d0 d1 d2 e0 e1 wb load GRP RDM ADDR MEM f0 f1 d0 d1 d2 e0 e1 wb aluop GRP RDA Figure 3-1 Pipelined Execution of Load with Forwarding 8 SuperSPARC Microprocessor—May 1992 3 The floating-point (FP) pipeline is tightly coupled to the integer pipeline. An operation may be started every cycle; latency of most FP operations is three cycles. In the E0 phase, one FP arithmetic instruction is selected for execution and its operands are read during E1. Two stages of execution latency are required for the double-precision FP adder and FP multiplier.