The Hp Pa-8000 Risc Cpu

Total Page:16

File Type:pdf, Size:1020Kb

The Hp Pa-8000 Risc Cpu . THE HP PA-8000 RISC CPU Ashok Kumar he PA-8000 RISC CPU is the first of a bits (40 bits on the PA-8000). A new mode new generation of Hewlett-Packard bit governs address formation, creating Hewlett-Packard Tmicroprocessors. Designed for high- increased flexibility. In 32-bit addressing end systems, it is among the world’s most mode, the processor can take advantage of powerful and fastest microprocessors. It fea- 64-bit computing instructions for faster tures an aggressive, four-way, superscalar throughput. In 64-bit addressing mode, 32- implementation, combining speculative exe- bit instructions and conditions are available cution with on-the-fly instruction reordering. for backward compatibility. The heart of the machine, the instruction In addition, the following extensions help reorder buffer, provides out-of-order execu- optimize performance for virtual memory tion capability. and cache management, branching, and Our primary design objective for the PA- floating-point operations: 8000 was to attain industry-leading perfor- mance in a broad range of applications. In • fast TLB (translation look-aside buffer) addition, we wanted to provide full support insertion instructions, for 64-bit applications. To make the PA-8000 • load and store instructions with 16-bit truly useful, we needed to ensure that the displacement, processor would not only achieve high bench- • memory prefetch instructions, mark performance but would sustain such • support for variable-size pages, performance in large, real-world applications. • halfword instructions for multimedia To achieve this goal, we designed large, exter- support, nal primary caches with the ability to hide • branches with 22-bit displacements, memory latency in hardware. We also imple- • branches with short pointers, mented dynamic instruction reordering in • branch prediction hinting, hardware to maximize instruction-level par- • floating-point multiply-and-accumulate allelism available to the execution units. instructions, and The PA-8000 connects to a high-bandwidth • multiple floating-point compare-result Runway system bus, a 768-Mbyte/s split- bits. transaction bus that allows each processor to generate multiple outstanding memory requests. The processor also provides glue- Key hardware features less support for up to four-way multipro- Since the PA-8000’s completely redesigned cessing via the Runway bus. The PA-8000 core uses no circuitry from previous- implements the new PA (Precision Architec- generation processors, we could design the This aggressive, four- ture) 2.0, a binary-compatible extension of new processor with any microarchitectural the previous PA-RISC architecture. All previ- features necessary to attain high performance. way, superscalar ous code executes on the PA-8000 without Figure 1 (next page) shows a functional-block recompilation or translation. diagram of the PA-8000’s basic control and microprocessor data paths. Architecture enhancements The chip’s most notable feature is the 56- combines speculative PA 2.0 incorporates a number of advanced entry instruction reorder buffer, to our microarchitectural enhancements, most sup- knowledge the industry’s largest, which execution with on-the- porting 64-bit computing. We widened the serves as the central control unit. It supports integer registers and functional units, includ- full register renaming for all instructions in fly instruction ing the shift/merge unit, to 64 bits. PA 2.0 the buffer and tracks instruction inter- supports flat virtual addressing up to 64 bits, dependencies to allow dataflow execution reordering. as well as physical addresses greater than 32 through the entire 56-instruction window. 0272-1732/97/$10.00 © 1997 IEEE March/April 1997 27 . HP PA-8000 Instruction System fetch unit Sort unit bus interface Instruction cache Runway bus 64-bit integer ALUs Instruction Address Shift/ reorder Load/store reorder merge buffer address Data buffer units adders cache ALU Memory 28 buffer buffer entries (28 (28 entries) entries) FMAC units Divide/ square Retire unit Rename root units registers Architecturally specified registers Rename registers Figure 1. Functional block diagram of the PA-8000. The PA-8000 executes at a peak rate of four instructions As pipelines get deeper and a processor’s parallelism per cycle, enabled by a large complement of computational increases, instruction fetch bandwidth and branch predic- units, shown at the left side of Figure 1. For integer opera- tion become increasingly important. To increase fetch band- tion, it includes two 64-bit integer ALUs and two 64-bit width and mitigate the effect of pipeline stalls for shift/merge units. All integer functional units have a single- predicted-taken branches, the PA-8000 incorporates a 32- cycle latency. For floating-point applications, the chip entry branch target address cache. The BTAC is a fully asso- includes dual floating-point multiply-and-accumulate (FMAC) ciative structure that associates a branch instruction’s address units and dual divide/square root units. We optimized the with its target’s address. Whenever the processor encoun- FMAC units for performing the common operation A times ters a predicted-taken branch in the instruction stream, it cre- B plus C. By fusing an add to a multiply, each FMAC can ates a BTAC entry for that branch. The next time the fetch execute two floating-point operations in just three cycles. In unit fetches from the branch’s address, the BTAC signals a hit addition to providing low latency for floating-point opera- and supplies the branch target’s address. The fetch unit can tions, the FMAC units are fully pipelined so that the PA-8000’s then immediately fetch the branch’s target without incurring peak throughput is four floating-point operations per cycle. a penalty, resulting in a zero-state taken-branch penalty for The two divide/square root units are not pipelined, but other branches that hit in the BTAC. To improve the hit rate, the floating-point operations can execute on the FMAC units BTAC holds only predicted-taken branches. If a predicted- while the divide/square root units are busy. A single- untaken branch hits in the BTAC, the entry is deleted. precision divide or square root operation requires 17 cycles; To reduce the number of mispredicted branches, the PA- a double-precision operation requires 31 cycles. 8000 implements two modes of branch prediction: dynamic Such a large array of computational units would be point- and static. Each TLB entry contains a bit to indicate which less if they could not obtain enough data to operate on. To mode the branch prediction hardware should use; therefore, that end, the PA-8000 incorporates two complete load/store the software can select the mode on a page-by-page basis. In pipes, including two address adders, a 96-entry dual-ported dynamic mode, the instruction fetch unit consults a 256-entry TLB, and a dual-ported cache. The right side of Figure 1 branch history table, which stores the results of each branch’s shows the dual load/store units and the memory system last three iterations (either taken or untaken). The instruction interface. The symmetry of dual functional units throughout fetch unit predicts that a given branch’s outcome will be the the processor allows a number of simplifications in data same as the majority of the last three outcomes. In static pre- paths, control logic, and signal routing. In effect, this duali- diction mode, the processor predicts that most conditional ty provides separate even and odd machines. forward branches will be untaken and that most conditional 28 IEEE Micro . SystemSystem busus interfinterfaceace Instr I IntegerInteger n s t registerregister r uction cache interf InstrInstructionuction uction cache interf u c stacstacksks Floating-point units F reorderreorder t i l o o n buffufferer a t c i n a g c - h DataData p e o i IntegerInteger cachecache i n n t t interfinterfaceace e functionalfunctional u r n f ace a ace unitsunits i t c s e InstrInstructionuction addressaddress andand Figure 3. The PA-8000 package. controlcontrol InstrInstructionuction registersregisters TLBTLB fetchetch unitunit pitch routing and local interconnections, two for low-RC (resistance-capacitance) global routing, and a final layer for Figure 2. The PA-8000 die. clock and power supply routing. The processor design includes a three-level clock network, organized as a modified H-tree. The clock syncs serve as pri- backward branches will be taken. For the common compare- mary inputs, received by a central buffer and driven to 12 and-branch instruction, the PA 2.0 architecture defines a secondary clock buffers located at strategic spots around the branch predict bit that indicates whether the processor should chip. These buffers then drive the clock to the major circuit follow this normal prediction convention or the opposite con- areas, where it is received by clock “gaters” (third-level clock vention. Compilers using either heuristic methods or profile- buffers) featuring high gain and a very short in-to-out delay. based optimization can use the static prediction mode to The approximately 7,000 gaters can generate many clock effectively communicate branch probabilities to the hardware. “flavors”: two-phase overlapping or nonoverlapping, invert- ing or noninverting, qualified or nonqualified. The clock Cache design qualification is useful for synchronous register sets and The PA-8000 uses separate, large, single-level, off-chip, dumps, as well as for powering down sections of logic not direct-mapped caches for instructions and data. It supports in use. To minimize clock skew and improve edge rates, we up to 4 Mbytes for instructions and 4 Mbytes for data, using simulated and tuned the clock network extensively. The sim- industry-standard synchronous SRAMs. We provided two ulated final clock skew for this design was no greater than complete copies of the data cache tags so that the processor 170 ps between any points on the die.
Recommended publications
  • Contents [Edit] History
    HP 9000 is the name for a line of workstation and server computer systems produced by the Hewlett-Packard Company (HP). The HP 9000 brand was introduced in 1984 to encompass several existing technical workstations models previously launched in the early 1980s. Contents [hide] • 1 History • 2 Workstation models o 2.1 Series 200 o 2.2 Series 300/400 o 2.3 Series 500 o 2.4 Series 700 . 2.4.1 VME Industrial Workstations o 2.5 B, C, J class • 3 Server models o 3.1 D-class o 3.2 R-class o 3.3 N-class o 3.4 L-class o 3.5 A-class o 3.6 S/X-class o 3.7 V-class • 4 Operating systems • 5 See also • 6 Notes • 7 References • 8 External links [edit] History The first HP 9000 models comprised the HP 9000 Series 200 and Series 500 ranges. These were rebadged existing models, the Series 200 including various Motorola 68000- based workstations such as the HP 9826 and HP 9836, and the Series 500 using HP's FOCUS microprocessor architecture introduced in the HP 9020 workstation. These were followed by the HP 9000 Series 300 and Series 400 workstations which also used 68k- series microprocessors. From the mid-1980s onwards, HP started to switch over to its own microprocessors based on its proprietary PA-RISC ISA, for the Series 600, 700, 800, and later lines. More recent models use either the PA-RISC or its successor, the HP/Intel IA-64 ISA. All of the HP 9000 line run various versions of the HP-UX operating system, except earlier Series 200 models, which ran standalone applications.
    [Show full text]
  • PC Hardware Contents
    PC Hardware Contents 1 Computer hardware 1 1.1 Von Neumann architecture ...................................... 1 1.2 Sales .................................................. 1 1.3 Different systems ........................................... 2 1.3.1 Personal computer ...................................... 2 1.3.2 Mainframe computer ..................................... 3 1.3.3 Departmental computing ................................... 4 1.3.4 Supercomputer ........................................ 4 1.4 See also ................................................ 4 1.5 References ............................................... 4 1.6 External links ............................................. 4 2 Central processing unit 5 2.1 History ................................................. 5 2.1.1 Transistor and integrated circuit CPUs ............................ 6 2.1.2 Microprocessors ....................................... 7 2.2 Operation ............................................... 8 2.2.1 Fetch ............................................. 8 2.2.2 Decode ............................................ 8 2.2.3 Execute ............................................ 9 2.3 Design and implementation ...................................... 9 2.3.1 Control unit .......................................... 9 2.3.2 Arithmetic logic unit ..................................... 9 2.3.3 Integer range ......................................... 10 2.3.4 Clock rate ........................................... 10 2.3.5 Parallelism .........................................
    [Show full text]
  • Integrating DMA Into the Generic Device Model
    Integrating DMA Into the Generic Device Model James E.J. Bottomley SteelEye Technology, Inc. http://www.steeleye.com [email protected] Abstract across architectures prior to PCI, the most prominent be- ing EISA. Finally, some manufacturers of I/O chips de- signed them not to be bus based (the LSI 53c7xx series This paper will introduce the new DMA API for the of SCSI chips being a good example). These chips made generic device model, illustrating how it works and ex- an appearance in an astonishing variety of cards with an plaining the enhancements over the previous DMA Map- equal variety of bus interconnects. ping API. In a later section we will explain (using illus- trations from the PA-RISC platform) how conversion to The major headache for people who write drivers for the new API may be achieved hand in hand with a com- non-PCI or multiple bus devices is that there was no plete implementation of the generic device API for that standard for non-PCI based DMA, even though many of platform. the problems encountered were addressed by the DMA Mapping API. This gave rise to a whole hotchpotch of solutions that differed from architecture to architecture: On Sparc, the DMA Mapping API has a completely 1 Introduction equivalent SBUS API; on PA-RISC, one may obtain a “fake” PCI object for a device residing on a non-PCI bus which may be passed straight into the PCI based DMA Back in 2001, a group of people working on non-x86 API. architectures first began discussing radical changes to the way device drivers make use of DMA.
    [Show full text]
  • Internals Training Guide HP E3000 MPE/Ix Computer Systems
    Internals Training Guide HP e3000 MPE/iX Computer Systems Edition 1 Manufacturing Part Number: 30216-90316 E0101 U.S.A. January 2001 Notice The information contained in this document is subject to change without notice. Hewlett-Packard makes no warranty of any kind with regard to this material, including, but not limited to, the implied warranties of merchantability or fitness for a particular purpose. Hewlett-Packard shall not be liable for errors contained herein or for direct, indirect, special, incidental or consequential damages in connection with the furnishing or use of this material. Hewlett-Packard assumes no responsibility for the use or reliability of its software on equipment that is not furnished by Hewlett-Packard. This document contains proprietary information which is protected by copyright. All rights reserved. Reproduction, adaptation, or translation without prior written permission is prohibited, except as allowed under the copyright laws. Restricted Rights Legend Use, duplication, or disclosure by the U.S. Government is subject to restrictions as set forth in subparagraph (c) (1) (ii) of the Rights in Technical Data and Computer Software clause at DFARS 252.227-7013. Rights for non-DOD U.S. Government Departments and Agencies are as set forth in FAR 52.227-19 (c) (1,2). Acknowledgments UNIX is a registered trademark of The Open Group. Hewlett-Packard Company 3000 Hanover Street Palo Alto, CA 94304 U.S.A. © Copyright 2001 by Hewlett-Packard Company 2 Contents 1. Hardware Overview Monitor and I/O Services 2. PCISCSI Device Adapter Manager (DAM) Internals Training . 38 Additional References . 40 Introduction. 40 PCI Based SCSI Interface Cards .
    [Show full text]
  • Service Handbook
    Service Handbook J Class Workstation HP Part No. A2876–90041 Edition E0498 Update to Service Handbook J Class Workstation HP Part No. A2876–90040 R Hewlett–Packard Company 3404 E. Harmony Rd., Ft. Collins, CO 80528–9599 NOTICE The information contained in this document is subject to change without notice. HEWLETT–PACKARD WARRANTY STATEMENT HP PRODUCT DURATION OF WARRANTY J Class Workstation one year 1. HP warrants HP hardware, accessories and supplies against defects in materials and workmanship for the period specified above. If HP receives notice of such defects during the warranty period, HP will, at its option, either repair or replace products which prove to be defective. Replacement products may be either new or like–new. 2. HP warrants that HP software will not fail to execute its programming instructions, for the period specified above, due to defects in material and workmanship when properly installed and used. If HP receives notice of such defects during the warranty period, HP will replace software media which does not execute its programming instructions due to such defects. 3. HP does not warrant that the operation of HP products will be uninterrupted or error free. If HP is unable, within a reasonable time, to repair or replace any product to a condition as warranted, customer will be entitled to a refund of the purchase price upon prompt return of the product. 4. HP products may contain remanufactured parts equivalent to new in performance or may have been subject to incidental use. 5. The warranty period begins on the date of delivery or on the date of installation if installed by HP.
    [Show full text]
  • Symmetric Multiprocessing Workstations and Servers System-Designed for High Performance and Low Cost
    Symmetric Multiprocessing Workstations and Servers System-Designed for High Performance and Low Cost A new family of workstations and servers provides enhanced system performance in several price classes. The HP 9000 Series 700 J-class workstations provide up to 2-way symmetric multiprocessing, while the HP 9000 Series 800 K-class servers (technical servers, file servers) and HP 3000 Series 9x9KS business-oriented systems provide up to 4-way symmetric multiprocessing. by Matt J. Harline, Brendan A. Voge, Loren P. Staley, and Badir M. Mousa Blending high performance and low cost, a new family of workstations and servers has been designed to help maintain HP’s leadership in system performance, price/performance, system support, and system reliability. This article and the accompanying articles in this issue describe the design and implementation of the HP 9000 J-class workstations, which are high-end workstations running the HP-UX* operating system, the HP 9000 K-class servers, which are a family of midrange technical and business servers running the HP-UX operating system, and the HP 3000 Series 9x9KS servers, which are a family of midrange business servers running the MPE/iX operating system. In this issue, these systems will be referred to collectively as J/K-class systems. The goals of the the design team were to achieve high performance and low cost, while at the same time creating a broad family of systems that would share many of the same components and meet a wide range of customer needs. The challenge was to create a list of requirements that would meet the needs of the three different target markets: the UNIX-system-based workstation market, the UNIX-system-based server market, and Hewlett-Packard’s proprietary MPE/iX-system-based server market.
    [Show full text]
  • Architecture Reference Guide
    Architecture Reference Guide V2500 Server First Edition A5074-96004 V2500 Server Customer Order Number: A5074-90004 June, 1999 Printed in: USA Revision History Edition: First Document Number: A3725-96004 Remarks: Initial release June, 1999. Notice Copyright Hewlett-Packard Company 1999. All Rights Reserved. Reproduction, adaptation, or translation without prior written permission is prohibited, except as allowed under the copyright laws. The information contained in this document is subject to change without notice. Hewlett-Packard makes no warranty of any kind with regard to this material, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. Hewlett-Packard shall not be liable for errors contained herein or for incidental or consequential damages in connection with the furnishing, performance or use of this material. Contents Preface . xv Notational conventions . xvi 1 Introduction . 1 The PA-8500 processor . .2 The node . .3 Control and status registers (CSRs) . .5 Description of functional blocks. .5 Processor agent controller. .5 Routing attachment controller—Hyperplane crossbar . .6 Memory access controller . .7 CUB and core logic bus . .8 Shared memory . .9 Multiple nodes . .11 Coherent toroidal interconnect . .12 Globally shared memory (GSM) . .14 GSM subsystem . .14 Memory interleave . .14 GSM and memory latency . .15 GSM and cache coherence . .17 2 Physical address space . 19 Physical addresses. .20 Node addressing . .22 Node Identifiers . .23 Coherent memory space . .24 Coherent memory layout . .25 Addressing a byte of memory. .25 Memory interleaving . .27 Memory interleave generation. .29 Force node ID function . .30 Memory board, bus and bank index selection . .32 Memory board interleave pattern . .33 Memory bus interleave pattern .
    [Show full text]
  • Hp Rp5400 Series of Entry-Level Unix Servers November 2001 a Technical
    hp rp5400 series of a technical white paper november 2001 entry-level unix servers from Hewlett-Packard table of contents introduction to the hp server rp5400 series......................................................................... 3 the hp server product line ............................................................................................. 7 binary compatibility ..................................................................................................... 8 IPF ready..................................................................................................................... 8 IPF transition................................................................................................................ 8 architecture ..................................................................................................................... 9 low-latency memory access ........................................................................................... 11 speeds and feeds......................................................................................................... 11 I/O subsystem design................................................................................................... 13 internal removable media ............................................................................................. 16 scalability.................................................................................................................... 16 rp5400 series industrial design and packaging..................................................................
    [Show full text]
  • Hp Server Rp7400
    an hp white paper february 2002 hp server rp7400 This white paper describes the design goals and system system architecture architecture of the HP server rp7400 and provides recommendations for implementing business-critical and design guide solutions. table of contents introduction .......................................................................................................................3 overview of the hp server rp7400 ...................................................................... 3 the hp enterprise server product line ................................................................... 4 performance .................................................................................................... 5 design philosophy—balanced high performance.................................................. 6 binary compatibility.......................................................................................... 7 architecture .......................................................................................................................7 twin system bus ................................................................................................ 8 very low latency memory access ........................................................................ 9 speeds and feeds ........................................................................................... 10 I/O subsystem design ..................................................................................... 11 scalability.....................................................................................................
    [Show full text]
  • 739250-1.Pdf
    Eindhoven University of Technology MASTER Modeling and analysis of a cache coherent interconnect Wiener, U. Award date: 2012 Link to publication Disclaimer This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required minimum study period may vary in duration. General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain Eindhoven University of Technology Faculty of Electrical Engineering Electronic Systems Group Modeling and Analysis of a Cache Coherent Interconnect Thesis Report submitted in partial fulfillment of the requirements for the degree of Master of Science in Embedded Systems Supervisors: by: Prof. Kees Goossens Uri Wiener Dr. Andreas Hansson v2.2, August 2, 2012 Abstract System on Chip (SoC) designs integrate heterogeneous components, increasing in number, performance, and memory-related requirements. Such components share data using distributed shared memory, and rely on caches for improving performance and reducing power consumption.
    [Show full text]
  • Independent PA-RISC and Itanium Reference Book
    OpenPA The book of PA-RISC Paul Weissmann Bonn This document and its content are Copyright © 1999-2021 Paul Weissmann, Bonn, Germany, unless otherwise noted. Berlin, Bonn, Palo Alto, San Francisco No parts of this document may be reproduced or copied without prior written permission. Commercial use of the content is prohibited. OpenPA.net, the information resource on PA-RISC. Second edition Release 2.8 Bonn, May 2021 Other editions: Second edition 2.7: July 2020 Second edition 2.6: January 2018 Second edition 2.5: January 2016 Second edition 2.4: July 2012 Second edition 2.3: July 2009 Second edition 2.2: January 2009 Second edition 2.1: October 2008 Second edition 2.0: May 2008 First edition 1.2: December 2007 First edition 1.1: November 2007 First edition 1.0: July 2006 OpenPA.net (online) is a registered serial publication, ISSN 1866-2757. Paul Weissmann can be reached by e-mail: [email protected]. and online at Insignals Cyber Security and OpenKRITIS. Preface This is the print edition of the OpenPA.net website from Spring 2021. OpenPA is a resource for HP PA-RISC and Itanium computers with technical descriptions of workstations, servers, their hardware architecture and supported operating systems This project is independent of and does not represent The Hewlett Packard Company in any way. This is the Second Edition 2.8. Set with LATEX. Changes in Second Edition 2.8 since the last edition (2020): ê Many revisions and corrections (thanks!) ê Text and language in all chapters ê TeX backend update All other changes are listed in chapter 5.1.
    [Show full text]
  • Sensory Information Processing
    Analysis of Avalanche’s Shared M em ory Architecture Ravindra Kuramkote, John Carter, Alan Davis, Chen-Chi Kuo, Leigh Stoller, Mark Swanson UUCS-97-008 * Computer Systems Laboratory University of Utah ' A bstract In this paper, we describe the design of the Avalanchemultiprocessor’s shared memory subsys­ tem, evaluate its performance, and discuss problems associated with using commodity worksta­ tions and network interconnects as the building blocks of a scalable shared memory multiprocessor. Compared to other scalable shared memory architectures, Avalanchehas a number of novel fea­ tures including its support for the Simple COMA memory architecture and its support for multiple coherency protocols (migratory, delayed write update, and (soon) write invalidate). We describe the performance implications of Avalanche’s architecture, the impact of various novel low-level design options, and describe a number of interesting phenomena we encountered while developing a scalable multiprocessor built on the HP PA-RISC platform. 1 9 Analysis of Avalanche’s Shared M em ory Architecture Ravindra Kuram,kote, John Carter, Alan Davis, Chen-Chi Kuo, Leigh Stoller, Mark Swanson Computer Systems Laboratory University of Utah 1 Introduction , The primary Avalanchedesign goal is to maximize the use of commercial components in the creation of a scalable parallel cluster of workstation multiprocessor that supports both high performance message passing and distributed shared memory. In the current prototype, Avalanchenodes are composed from Hewlett-Packard HP7200 or PA-8000 based symmetric multiprocessing worksta­ tions, a custom device called the W idget, and Myricom’s Myrinet interconnect fabric [6]. Both workstations use a main memory bus known as the Runway [7], a split transaction bus supporting cache coherent transactions.
    [Show full text]