Pentium® III Processor Implementation Tradeoffs
Total Page:16
File Type:pdf, Size:1020Kb
PentiumÒ III Processor Implementation Tradeoffs Jagannath Keshava and Vladimir Pentkovski: Microprocessor Products Group, Intel Corp. sense that the same sequence of operations can be ABSTRACT applied concurrently to multiple data elements. The This paper discusses the implementation tradeoffs of the Internet SSE allows us to express this parallelism explicitly, PentiumÒ III processor. The Pentium III processor but the hardware needs to be able to translate the implements a new extension of the IA-32 instruction set parallelism into higher performance. The P6 superscalar called the Internet Streaming Single-Instruction, Multiple- out-of-order microarchitecture is capable of utilizing Data (SIMD) Extensions (Internet SSE). The processor is explicit as well as extracted implicit parallelism. However, based on the PentiumÒ Pro processor microarchitecture. hardware that supports higher computation throughput improves the performance of these algorithms. The The initial development goals for the Pentium III processor development of such hardware and increasing its were to balance performance, cost, and frequency. utilization were key tasks in the development of the Descriptions of some of the key aspects of the SIMD Pentium III processor. Second, in order to feed the parallel Floating Point (FP) architecture and of the memory computations with data, the system needs to supply high streaming architecture are given. The utilization and memory bandwidth and hide memory latency. effectiveness of these architectures in decoupling memory accesses from computation, in the context of balancing the The implementation section of this paper contains details 3D pipe, are discussed. Implementation choices in some of some of the techniques we used to provide enhanced of the key areas are described. We also give some details throughput of computations and memory while meeting of the implementation of Internet SSE execution units, aggressive die-size and frequency goals. The primary including the development of new FP units, and discuss purpose of this paper, however, is to describe key how we have implemented the 128-bit instructions on the implementation techniques used in the processor and the existing 64-bit datapath. We then discuss the details of rationale for their development. the memory streaming implementation. The Pentium III processor is now in production on ARCHITECTURE frequencies of up to 550 MHz. The new instructions in the The PentiumÒ III processor is the first implementation of Internet SSE were added at about a 10% die cost and have the Internet SSE. The Internet SSE contains 70 new enabled the Pentium III processor to offer a richer, more instructions and a new architectural state. It is the second compelling visual experience. significant extension of the instruction set since the 80386 and the first to add a new architectural state. MMXÔ was INTRODUCTION the first significant instruction extension, but it did not add any new architectural state. The new instructions fall The goal of the Internet SSE development was to enable a into different categories: better visual experience and to enable new applications such as real-time video encoding and speech recognition · SIMD FP instructions that operate on four single [7]. The PentiumÒ III processor is the first implementation precision numbers of ISSE. It is based on the P6 microarchitecture, which · scalar FP instructions allows an efficient implementation in terms of die size and effort. The key development goals were the · cacheability instructions including prefetches into implementation of the Internet SSE while keeping about a different levels of the cache hierarchy 10% larger die size than the PentiumÒ II processor and · control instructions achieving a higher frequency by at least one bin. · data conversion instructions Two features of these new applications challenge designers of computer systems. First, the algorithms that the applications are based on are inherently parallel in the Overview of the PentiumÒ III Processor 1 Intel Technology Journal Q2, 1999 · new media extensions that are instructions such as processor execute only a single micro-instruction. the PSAD and the PAVG that accelerate encoding and Support is provided for two modes of FP arithmetic: IEEE decoding respectively compliant mode for applications that need exact single precision computation and portability and a Flush-to-Zero Adding the new state reduced implementation complexity, (FTZ) mode for high-performance real-time applications. eased programming model issues, and allowed SIMD-FP and MMX technology or X87 instructions to be used (Details of the instruction set are given in other papers in concurrently. It also addressed ISV and OSV requests. All this issue of the Intel Technology Journal.) of the SSE State was separated from the X87-FP State; there is a dedicated interrupt vector to handle the numeric IMPLEMENTATION exceptions. There is also a new control/status register, In this section we discuss some of the key features that MXCSR, which is used to mask/unmask numerical we developed to increase FP and memory throughput on exception handling, to set rounding mode, to set flush to the PentiumÒ III processor. We then discuss a couple of zero mode, and to view status flags. Applications often techniques we developed to help provide an area-efficient require both scalar and packed mode of operations. To solution. Figure 1 shows the block diagram of the P6 address this issue, explicit scalar instructions (in the new pipeline. SIMD-FP mode) were defined, which for the Pentium III Overview of the Pentium III Processor 2 Intel Technology Journal Q2, 1999 Figure 2 shows the Dispatch/Execute units in the [6] where you will also find a description of the blocks PentiumÒ II processor. An overview of the P6 shown in Figures 1 and 2. architecture and the microarchitecture is given in [5] and Overview of the PentiumÒ III Processor 3 Intel Technology Journal Q2, 1999 RS RS - Reservation Station EU - Execution Unit WIRE FEU - Floating Point EU FEU IEU - Integer EU IEU WIRE - Miscellaneous Functions SIMD 0/1 - MMX™ EU SIMD 0 Port 0 JEU - Jump EU AGU - Address Generation Unit To/from ROB - ReOrder Buffer JEU Instruction IEU Port 1 Pool (ROB) SIMD 1 Port 2 AGU Load Port 3,4 AGU Store Figure 2: Pentium II Dispatch/Execute units Implementation of Internet SSE Execution set on the 64-bit datapath limits the changes to the decoder and the utilization of existing and new execution Units units. We also implemented a few other features to The Internet SSE was implemented in the following way. improve the utilization of the FP hardware: The instruction decoder translates 4-wide (128-bit) 1. The adder and multiplier were placed on different Internet SSE instructions into a pair of 2-wide (64-bit) ports. This allows for simultaneous dispatch of 2- internal uops. The execution of the 2-wide uops is wide addition and 2-wide multiplication operations. supported by 2-wide execution units. Some of the FP This boosts the peak performance two more times execution units were developed by extending the when compared to the Pentium II, and hence, it allows functionality of existing P6 FP units. The 2-wide units 2.2 GFLOP/sec peak at 550 MHz. The new units boost the performance to that of twice the Pentium II developed on the Pentium III and modified P6 units processor. Further, implementing the 128-bit instruction are shown in color in Figure 3. Overview of the Pentium III Processor 4 Intel Technology Journal Q2, 1999 RS - Reservation Station WIRE EU - Execution Unit RS FEU - Floating Point EU FEU IEU - Integer EU IEU WIRE - Miscellaneous Functions SIMD 0 SIMD 0/1 - MMX™ EU Shuffle - ISSE Shuffle Unit PFADD - ISSE Adder Port 0 JEU Recip/Recipsqrt - ISSE Reciprocal IEU JEU - Jump EU To/from AGU - Address Generation Unit SIMD 1 Instruction ROB - ReOrder Buffer Shuffle Pool (ROB) PFADD Port 1 Recip/ Recipsqrt Port 2 AGU Load Port 3,4 AGU Store Figure 3: Pentium III Dispatch/Execute units All the new units have been added on Port 1. The support of these instructions to be an important new operations executed on Port 0 have been method to improve the utilization of FP units, since it accomplished by modifying existing units. The allows for less time to be spent in data reorganization. multiplier resides on Port 0 and is a modification of the The new shuffle/logical unit serves this purpose. It existing FP multiplier. The new packed SP shares Port 1 and executes the unpack high and multiplication is performed with a throughput of two unpack low, move, and logical uops. The 128-bit SP numbers each cycle with a latency of four cycles. shuffle operation is performed through three uops: (1) (The X87 SP processor, on the other hand, had a copy temporary, (2) shuffle low, and (3) shuffle high. throughput of a single SP number every two cycles The shuffle unit also executes packed integer shuffle, and a latency of five cycles.) A new packed SP adder PINSRW and PEXTRW, through sharing of the FP is provided on Port 1. The adder operates on two SP shuffle unit. numbers with a throughput of one cycle and a latency 3. Data is copied faster. IA-32 instructions overwrite of three cycles. The adder also executes all compare, one of the operands of the instruction. We knew that subtract, min/max, and convert instructions. this feature would add more MOVE instructions to the Essentially, we have assigned different units (and code. For instance, consider the following fragment: therefore different instructions) to Ports 0 and 1 to ensure that a full 4-wide peak execution can be Load memory operand A to register XMM0 achieved. Multiply XMM0 by memory operand B 2. Hardware support is in place for data reorganization. The second instruction overwrites the content of the Effective SIMD computations require adequate SIMD register XMM0.