The Hp Pa-8000 Risc Cpu
Total Page:16
File Type:pdf, Size:1020Kb
. THE HP PA-8000 RISC CPU Ashok Kumar he PA-8000 RISC CPU is the first of a bits (40 bits on the PA-8000). A new mode new generation of Hewlett-Packard bit governs address formation, creating Hewlett-Packard Tmicroprocessors. Designed for high- increased flexibility. In 32-bit addressing end systems, it is among the world’s most mode, the processor can take advantage of powerful and fastest microprocessors. It fea- 64-bit computing instructions for faster tures an aggressive, four-way, superscalar throughput. In 64-bit addressing mode, 32- implementation, combining speculative exe- bit instructions and conditions are available cution with on-the-fly instruction reordering. for backward compatibility. The heart of the machine, the instruction In addition, the following extensions help reorder buffer, provides out-of-order execu- optimize performance for virtual memory tion capability. and cache management, branching, and Our primary design objective for the PA- floating-point operations: 8000 was to attain industry-leading perfor- mance in a broad range of applications. In • fast TLB (translation look-aside buffer) addition, we wanted to provide full support insertion instructions, for 64-bit applications. To make the PA-8000 • load and store instructions with 16-bit truly useful, we needed to ensure that the displacement, processor would not only achieve high bench- • memory prefetch instructions, mark performance but would sustain such • support for variable-size pages, performance in large, real-world applications. • halfword instructions for multimedia To achieve this goal, we designed large, exter- support, nal primary caches with the ability to hide • branches with 22-bit displacements, memory latency in hardware. We also imple- • branches with short pointers, mented dynamic instruction reordering in • branch prediction hinting, hardware to maximize instruction-level par- • floating-point multiply-and-accumulate allelism available to the execution units. instructions, and The PA-8000 connects to a high-bandwidth • multiple floating-point compare-result Runway system bus, a 768-Mbyte/s split- bits. transaction bus that allows each processor to generate multiple outstanding memory requests. The processor also provides glue- Key hardware features less support for up to four-way multipro- Since the PA-8000’s completely redesigned cessing via the Runway bus. The PA-8000 core uses no circuitry from previous- implements the new PA (Precision Architec- generation processors, we could design the This aggressive, four- ture) 2.0, a binary-compatible extension of new processor with any microarchitectural the previous PA-RISC architecture. All previ- features necessary to attain high performance. way, superscalar ous code executes on the PA-8000 without Figure 1 (next page) shows a functional-block recompilation or translation. diagram of the PA-8000’s basic control and microprocessor data paths. Architecture enhancements The chip’s most notable feature is the 56- combines speculative PA 2.0 incorporates a number of advanced entry instruction reorder buffer, to our microarchitectural enhancements, most sup- knowledge the industry’s largest, which execution with on-the- porting 64-bit computing. We widened the serves as the central control unit. It supports integer registers and functional units, includ- full register renaming for all instructions in fly instruction ing the shift/merge unit, to 64 bits. PA 2.0 the buffer and tracks instruction inter- supports flat virtual addressing up to 64 bits, dependencies to allow dataflow execution reordering. as well as physical addresses greater than 32 through the entire 56-instruction window. 0272-1732/97/$10.00 © 1997 IEEE March/April 1997 27 . HP PA-8000 Instruction System fetch unit Sort unit bus interface Instruction cache Runway bus 64-bit integer ALUs Instruction Address Shift/ reorder Load/store reorder merge buffer address Data buffer units adders cache ALU Memory 28 buffer buffer entries (28 (28 entries) entries) FMAC units Divide/ square Retire unit Rename root units registers Architecturally specified registers Rename registers Figure 1. Functional block diagram of the PA-8000. The PA-8000 executes at a peak rate of four instructions As pipelines get deeper and a processor’s parallelism per cycle, enabled by a large complement of computational increases, instruction fetch bandwidth and branch predic- units, shown at the left side of Figure 1. For integer opera- tion become increasingly important. To increase fetch band- tion, it includes two 64-bit integer ALUs and two 64-bit width and mitigate the effect of pipeline stalls for shift/merge units. All integer functional units have a single- predicted-taken branches, the PA-8000 incorporates a 32- cycle latency. For floating-point applications, the chip entry branch target address cache. The BTAC is a fully asso- includes dual floating-point multiply-and-accumulate (FMAC) ciative structure that associates a branch instruction’s address units and dual divide/square root units. We optimized the with its target’s address. Whenever the processor encoun- FMAC units for performing the common operation A times ters a predicted-taken branch in the instruction stream, it cre- B plus C. By fusing an add to a multiply, each FMAC can ates a BTAC entry for that branch. The next time the fetch execute two floating-point operations in just three cycles. In unit fetches from the branch’s address, the BTAC signals a hit addition to providing low latency for floating-point opera- and supplies the branch target’s address. The fetch unit can tions, the FMAC units are fully pipelined so that the PA-8000’s then immediately fetch the branch’s target without incurring peak throughput is four floating-point operations per cycle. a penalty, resulting in a zero-state taken-branch penalty for The two divide/square root units are not pipelined, but other branches that hit in the BTAC. To improve the hit rate, the floating-point operations can execute on the FMAC units BTAC holds only predicted-taken branches. If a predicted- while the divide/square root units are busy. A single- untaken branch hits in the BTAC, the entry is deleted. precision divide or square root operation requires 17 cycles; To reduce the number of mispredicted branches, the PA- a double-precision operation requires 31 cycles. 8000 implements two modes of branch prediction: dynamic Such a large array of computational units would be point- and static. Each TLB entry contains a bit to indicate which less if they could not obtain enough data to operate on. To mode the branch prediction hardware should use; therefore, that end, the PA-8000 incorporates two complete load/store the software can select the mode on a page-by-page basis. In pipes, including two address adders, a 96-entry dual-ported dynamic mode, the instruction fetch unit consults a 256-entry TLB, and a dual-ported cache. The right side of Figure 1 branch history table, which stores the results of each branch’s shows the dual load/store units and the memory system last three iterations (either taken or untaken). The instruction interface. The symmetry of dual functional units throughout fetch unit predicts that a given branch’s outcome will be the the processor allows a number of simplifications in data same as the majority of the last three outcomes. In static pre- paths, control logic, and signal routing. In effect, this duali- diction mode, the processor predicts that most conditional ty provides separate even and odd machines. forward branches will be untaken and that most conditional 28 IEEE Micro . SystemSystem busus interfinterfaceace Instr I IntegerInteger n s t registerregister r uction cache interf InstrInstructionuction uction cache interf u c stacstacksks Floating-point units F reorderreorder t i l o o n buffufferer a t c i n a g c - h DataData p e o i IntegerInteger cachecache i n n t t interfinterfaceace e functionalfunctional u r n f ace a ace unitsunits i t c s e InstrInstructionuction addressaddress andand Figure 3. The PA-8000 package. controlcontrol InstrInstructionuction registersregisters TLBTLB fetchetch unitunit pitch routing and local interconnections, two for low-RC (resistance-capacitance) global routing, and a final layer for Figure 2. The PA-8000 die. clock and power supply routing. The processor design includes a three-level clock network, organized as a modified H-tree. The clock syncs serve as pri- backward branches will be taken. For the common compare- mary inputs, received by a central buffer and driven to 12 and-branch instruction, the PA 2.0 architecture defines a secondary clock buffers located at strategic spots around the branch predict bit that indicates whether the processor should chip. These buffers then drive the clock to the major circuit follow this normal prediction convention or the opposite con- areas, where it is received by clock “gaters” (third-level clock vention. Compilers using either heuristic methods or profile- buffers) featuring high gain and a very short in-to-out delay. based optimization can use the static prediction mode to The approximately 7,000 gaters can generate many clock effectively communicate branch probabilities to the hardware. “flavors”: two-phase overlapping or nonoverlapping, invert- ing or noninverting, qualified or nonqualified. The clock Cache design qualification is useful for synchronous register sets and The PA-8000 uses separate, large, single-level, off-chip, dumps, as well as for powering down sections of logic not direct-mapped caches for instructions and data. It supports in use. To minimize clock skew and improve edge rates, we up to 4 Mbytes for instructions and 4 Mbytes for data, using simulated and tuned the clock network extensively. The sim- industry-standard synchronous SRAMs. We provided two ulated final clock skew for this design was no greater than complete copies of the data cache tags so that the processor 170 ps between any points on the die.