Tuning Pentium Pro Microarchitecture

TUNING THE PENTIUM PRO MICROARCHITECTURE David B. Papworth esigning a wholly new microproces- mance of the core logic improves, designs sor is dfficult and expensive. To jus- must continue to enhance the bus and cache in tel Corporation Dtlfy this effort, a major new architecture to keep pace with the core microarchitecture must improve performance Further, as other technologies (such as mul- one and a half or two times over the previ- tiprocessing) mature, there is a natural ten- ous-generation microarchitecture,when eval- dency to draw them into the processor uated on equivalent process technology. In design as a way of providing additional fea- addition, semiconductor process technology tures and value for the end user continues to evolve while the processor design is in progress. The previous-genera- Mass-market designs tion microarchitecture increases in clock The large installed base and broad range speed and performance due to compactions of applications for the Intel architecture and conversion to newer technology. A new place additional constraints on the design, microarchitecture must “intercept”the process constraints beyond the purely academic technology to achieve a compounding of ones of performance and clock frequency process and microarchitectural speedups. We do not have the flexibility to control soft- The process technology, degree of ware applications, compilers, or operating pipelining, and amount of effort a team is systems in the same way system vendors willing to spend on circuit and layout issues can We cannot remove obsolete features determine the clock speed of a microarchi- and must cater to a wide variety of coding tecture. Typically, a mlcroarchitecture will styles The processor must run thousands of start with the same clock speed as a prior shrink-wrapped applications, rather than microarchitecture (adjusted for process tech- only those compiled by a vendor-provided nology scaling). This enables the maximum compiler running on a vendor-provided reuse of past designs and circuits, and fits operating system on a vendor-provided plat the new design to the existing product form These limitations leave fewer avenues %is inside look at a development tools and methodology. for workarounds and the processor exposed Performance enhancements should come to a much greater variety of instruction large microprocessor primarily from the microarchitecture and not sequences and boundary conditions from clock speed enhancements per se. Intel’s architecture has accumulated a development project Often, a new processor’s die area is close great deal of history and features in 15 years to the maximum that can be manufactured. The product must delivei world-class per- reveals some of the This design choice stems from marketplace formance and also successfully identify and competitiveness and efforts to get as much resolve compatibility issues The micro- reasoning vor goals, performance as possible in the new microar- processor may be an assemblage of pieces chitecture. While making the die smaller and from many different vendors, yet must func- changes, trade-osfsi cheaper and improving performance are tion reliably and be easy for the general pub- desirable, it is generally not possible to lic to use andJerformance achieve a 1.5-to-2-times-better performance Since a new design needs to be manufac- goal without using at least 1.5 to 2 times a turable in high volume froin the very begin- simulation) that lay prior design’s transistors. ning, designers cannot allow the design to Finally, new processor designs often expand to the limits of the technology It behind itsjnalform. incorporate new features. As the perfor- also must meet stringent environmental and 8 IEEEMicro 0272-1732/96/$5.00 0 1996 IEEE design-life limits. It must deliver high performance using a set of motherboard components costing less than a few hun- dred dollars. Meeting these additional design constraints is critical for business success. They add to the complexity of the project and the total effort required, compared to a brand-new instruction set architecture. The additional complexity results in extra staffing and longer schedules. A vital ingredient in long-term success is proper planning and management of the demands of extra complexity; management must ensure that the complexity does not impact the product's long-term performance and availability. What we actually built The actual Pentium Pro processor looks much different Our first effort from our first straw man: After due consideration of the performance, area, and mass-market constraints, we knew we would have to imple- a 150-MHz clock using 0 6-micron Iechnology, ment out-of-order execution and register renaming to wring a 14-stage pipeline, more instruction level parallelism out of existing code. three-instruction decoding per clock cycle, Further, the modest register file of the Intel architecture con- three micro-operations (micro-ops) I enamed and retired strains any compiler. That is, it limits the amount of instruc- per clock cycle, tion reordering that a compiler can do to increase superscalar an 8-Kbyte L1 instruction cache, parallelism and basic block size. Clearly, a new microarchi- dn 8-Kbyte L1 data cache, tecture would have to provide some way to escape the con- one dedicated load port and one tore port, and straints of false dependencies and provide a form of dynamic 5 5 million transistors code motion in hardware. To conform to projected Pentium processor goals, we ini- The evolution process tially targeted a 100-MHz clock speed using 0.6-micron tech- Our first efforts centered on designing and simulating a nology. Such a clock speed would have resulted in roughly high-performance dynamic-execution engine. We attacked a 10-stage pipeline. It would have much the same structure the problems of renaming, scheduling, and dispatching, and as the Pentium processor with an extra decode stage added designed core structures that implemented the desired for more instruction decode bandwidth. It would also require functionality. extra stages to implement register renaming, nintime sched- Circuit and layout studies overlapped this effort. We dis- uling, and in-order retirement functions. covered that the basic out-of-order core and the functional We expected a two-clock data cache access time (like the units could run at a higher clock frequency than 100 MHz. Pentium processor) and other core execution units that would In addition, instruction fetching and decoding in two pipeline strongly resemble the Pentium processor. The straw-man stages and data cache access in two clock cycles were the microarchitecture would have had the following components: main frequency limiters. One of our first activities was to create a microarchitect's a 100-MHz clock using 0.6-micron technology, workbench. Basically, this was a performance simulator a 10-stage pipeline, capable of modeling the general class of dynamic execution four-instruction decoding per clock cycle, microarchitectures. We didn't base this simulator on silicon four-micro-operation renaming and retiring per clock structures or detailed modeling of any particular implemen- cycle, tation. Instead, it took an execution trace as input and a 32-Kbyte level-1 instruction cache, applied various constraints to each instruction, modeling the a separate 32-Kbyte L1 data cache, functions of decoding, renaming, scheduling, dispatching, two general load/store ports, and and retirement. It processed one micro-operation at a time, a total of 10 million transistors. from decoding until retirement, and at each stage applied the limitations extant in the design being modeled. From the outset we planned to include a full-frequency, This simulator was very flexible in allowing us to model dedicated L2 cache, with some flavor of advanced packag- any number of possible architectures. Modifying it was much ing connecting the cache to the processor. Our intent was to faster than modifying a detailed, low-level implementation, enable effective shared-memory multiprocessing by remov- since there was no need for functional correctness or rout- ing the processor-10-12 transactions from the traditional glob- ing signals from one block to another. \We set up this simu- al interconnect, or front-side, bus, and to facilitate board and lator to model our initial straw-man microarchitecture and platform designs that could keep up with the high-speed then performed a sensitivity analysis of the inajor microar- processor. Remember that in 1990/1991 when we began the chitecturdl areas that affect performance, clock speed, and project, it was quite a struggle to build 50- and 66-MH.z sys- die area. tems. It seemed prudent to provide for a package-level solu- We simulated each change or potential change against at tion to this problem. least 2 billion instructions from more than 200 programs. We April1996 9 AGU Address generation unit BIU Bus interface unit BTB Branch target buffer DCU Data cache unit FEU Floating-point execution unit ID Instruction decoder IEU Integer execution unit IFU Instruction fetch unit (includes I-cache) L2 level-2 cache MIS Microinstruction sequencer MIU Memory interface unit MOB Memory reorder buffer RAT Register alias table ROB Reorder buffer RRF Retirement register file RS Reservation station studied the effects of L1 cache size, pipeline depth, branch The trade-off prediction effectiveness, renaming width, reservation station Based on our circuit studies, we explored what would depth and organization, and reorder buffer depth Quite happen if we boosted the core frequency by 1 5 times over often, we found that our initial intuition was wrong and that our initial straw man This required a few simple changes to every assumption had to be tested and tuned to what was the reservation station, but clearly we could build the basic proven to work core to operate at this frequency It would allow us to retain 10 IEEE Micro Pentium Pro (continued) Folloning renaming.

Load more