POWER7™, a Highly Parallel, Scalable Multi-Core High End Server Processor Dieter F
Total Page:16
File Type:pdf, Size:1020Kb
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 1, JANUARY 2011 145 POWER7™, a Highly Parallel, Scalable Multi-Core High End Server Processor Dieter F. Wendel, Member, IEEE, Ron Kalla, James Warnock, Senior Member, IEEE, Robert Cargnoni, Member, IEEE, Sam G. Chu, Joachim G. Clabes, Daniel Dreps, Member, IEEE, David Hrusecky, Josh Friedrich, Saiful Islam, Jim Kahle, Jens Leenstra, Gaurav Mittal, Jose Paredes, Juergen Pille, Member, IEEE, Phillip J. Restle, Member, IEEE, Balaram Sinharoy, Fellow, IEEE, George Smith, Senior Member, IEEE, William J. Starke, Scott Taylor, Member, IEEE, A. James Van Norstrand, Jr., Stephen Weitzel, Member, IEEE, Phillip G. Williams, and Victor Zyuban, Member, IEEE Abstract—This paper gives an overview of the latest member TABLE I of the POWER™ processor family, POWER7™. Eight quad- THE 11 LEVEL METAL STACK threaded cores, operating at frequencies up to 4.14 GHz, are integrated together with two memory controllers and high speed system links on a 567 mmP die, employing 1.2B transistors in a 45 nm CMOS SOI technology with 11 layers of low-k copper wiring. The technology features deep trench capacitors which are used to build a 32 MB embedded DRAM L3 based on a 0.067 mP DRAM cell. The functionally equivalent chip transistor count would have been over 2.7B if the L3 had been implemented with a conventional 6 transistor SRAM cell. (A detailed paper about the eDRAM implementation will be given in a separate paper of this Journal). Deep trench capacitors are also used to reduce on-chip voltage island supply noise. This paper describes the organiza- tion of the design and the features of the processor core, before moving on to discuss the circuits used for analog elements, clock generation and distribution, and I/O designs. The final section describes the details of the clocked storage elements, including I. INTRODUCTION special features for test, debug, and chip frequency tuning. HE next processor of the POWER™ family, POWER7™, is introduced, giving an overview of the chip, elaborating onT key units including the vector and scalar unit and the load/ Index Terms—Clocked storage element design, clock grid, store unit, and describing the cache hierarchy. Selected imple- CML circuits, debug features, deep trench capacitor, design for mentations from the analog, clock generation and I/O circuitry reliability, design for test, differential I/O, digital PLL, duty are shown, and design considerations for the implementation cycle correction, eight core processor, embedded DRAM, flip-flop of the clocked storage elements and local clock splitters are design, high-speed I/O, latch, LBIST, L3 cache, microprocessor, discussed. multi-core, multiport SRAM, POWER processor, POWER7, pulsed-clock latch, quad-threaded core, SER, SEU, SMP, SOI, vector register file, vector scalar unit. II. TECHNOLOGY The 567 mm POWER7™ chip is fabricated in IBM’s 45 nm Silicon-On-Insulator (SOI) CMOS technology. A key innovation introduced at the 45 nm technology node is the Manuscript received April 17, 2010; revised July 18, 2010; accepted Au- deep trench technology used for the embedded DRAM and for gust 31, 2010. Date of publication November 09, 2010; date of current version on-chip voltage supply decoupling. The single-transistor deep December 27, 2010. This paper was approved by Guest Editor Tanay Karnik. trench cell has an area of 0.067 m . Two different 6-transistor This material is based upon work supported by the Defense Advanced Research Projects Agency under its Agreement No. HR0011-07-9-0002. memory cells are employed in the dual supply SRAM designs D. Wendel, J. Leenstra, and J. Pille are with IBM Research and Development [4]. The 0.404 m cell was used in the dense L2 cache and L3 GmbH, Boeblingen 71032, Germany (e-mail: [email protected]). directory arrays and the 0.462 m high-performance cell was R. Kalla, R. Cargnoni, S. Chu, J. Clabes, D. Dreps, D. Hrusecky, J. Friedrich, S. Islam, J. Kahle, G. Mittal, J. Paredes, W. Starke, S. Taylor, J. Van Norstrand, used by the arrays working at the processor core frequency, Jr., S. Weitzel, and P. Williams are with the IBM Systems and Technology such as the L1 caches. Group, Austin, TX 78758 USA (e-mail: [email protected]; [email protected]. The technology provides 11 levels of copper wiring, opti- com; [email protected]; [email protected]). J. Warnock, B. Sinharoy, and G. Smith are with the IBM Systems and Tech- mized for density and performance as shown in Table I. nology Group, Poughkeepsie, NY 12601 USA (e-mail: [email protected]). SOI technology offers superior soft error immunity, com- P. J. Restle and V. Zyuban are with IBM Research, Yorktown Heights, NY pared to bulk, due to the small device volume sensitive to the ef- 10598 USA. fects of incident high-energy particles. SOI-specific circuit tech- Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. niques, such as device stacking, have been used to further reduce Digital Object Identifier 10.1109/JSSC.2010.2080611 the soft error rate. 0018-9200/$26.00 © 2010 IEEE 146 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 1, JANUARY 2011 TABLE II starts with the instruction fetch unit (IFU), requesting up to 32 KEY TECHNOLOGY FEATURES PowerPC instructions from a 256 KB L2 cache. Instructions are transferred from the L2 into a 32 KB instruction cache. The instruction decoder fetches instructions from the I-cache and places them into groups of up to 6 instructions, ready for dis- patch. Whenever instructions are fetched from the I-cache they are also scanned for branches. If a branch is encountered, branch prediction logic in the IFU predicts whether or not it is taken and the branch target, if needed. The entire group of instructions is then dispatched to the instruction sequencing unit (ISU). The ISU has multiple issue queues where instructions are held until all resources are available for execution. Instructions for the fixed point unit (FXU), load store unit (LSU), decimal floating point unit (DFU) and instructions targeted for the merged vector scalar unit (VSU) floating point unit (FPU), are kept in two unified queues. The unified queue is divided into two 24-entry halves. Branches are kept in a 12-entry branch issue queue. Once a branch instruction can be resolved it is sent to the branch exe- cution unit (BRU) located in the IFU. The actual result of the branch is compared with the predicted value. If a mispredict is found to have occurred, all younger instructions are flushed and the instruction fetch is redirected to the correct target of the branch. The PowerPC architecture has a rich set of condition register (CR) manipulating instructions. CR ops are kept in an Fig. 1. Main POWER7™ building blocks. eight entry CR issue queue and executed by the condition reg- ister unit (CRU), also physically integrated in the IFU. Key technology features are summarized in Table II. See [5] for more detailed information. A. The Vector and Scalar Unit (VSU) The VSU unit of the POWER7™ processor merges the III. CHIP OVERVIEW former separate Vector Media eXtension (VMX) unit and the The chip is partitioned into independent, modular blocks scalar Binary Floating point Unit (BFU) into a single unit for enabling parallel design closure and the separation of unused power and area reduction. In addition, the VSU supports the cores from supply voltages. The modular design style also new VSX architecture. With the VSU dual instruction issue allowed POWER7™ to provide new levels of system flexibility capability and the use of new VSX instructions such as “Vector by maintaining 23 frequency domains, 47 different voltage Multiply-Add Double Precision” (128 bit wide data using supply domains, and 16 on-chip generated supply voltages. 2-way SIMD FP 64 bit) the processor can support 8 Double Fig. 1 illustrates the location of the largest components de- Precision FLOPs per cycle. This doubles the POWER7™ core scribed in the following. Each of the 8 Cache/Core Partition floating point performance in comparison to the POWER6™ (CCP) chiplets has its own set of separate supply voltages for core. The 64 VSX architectural registers are mapped onto the logic, SRAM, and DRAM as well as an analog voltage for set of 32 VMX architectural registers and the 32 registers of the Digital Phase-Locked Loop (DPLL). The CCP chiplet also the scalar BFU. This circumvents the need to implement new contains two internal charge-pump-derived voltages to supply registers, and provides VSX instructions with direct access to the DRAM wordlines. Each core, along with its associated results in the VMX and BFU registers. Finally in POWER7™ L2 and L3 arrays, can operate at a different, dynamically ad- up to 4 threads are able to run on each core in parallel and justable frequency to balance system performance and energy therefore 4 times 64 registers plus renames are supported by consumption. Nominal, power-save, turbo and sleep modes the Vector Register File (VRF). are supported. Other major modular blocks shown in Fig. 1 The organization of the VSU is illustrated in Fig. 3. Each include the local and remote SMP interconnect partitions and floating point (FP) unit implements both the support for floating the memory interfaces. Asynchronous connections link each point single/double precision instructions (as found in the VSX, core/cache complex, the 2 integrated memory controllers, the BFU and VMX Vector Float architecture) and the VMX com- off-chip local and global SMP links and the I/O controllers to a plex integer instructions by means of a common data path.