IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 1, JANUARY 2011 145 POWER7™, a Highly Parallel, Scalable Multi-Core High End Server Processor Dieter F. Wendel, Member, IEEE, Ron Kalla, James Warnock, Senior Member, IEEE, Robert Cargnoni, Member, IEEE, Sam G. Chu, Joachim G. Clabes, Daniel Dreps, Member, IEEE, David Hrusecky, Josh Friedrich, Saiful Islam, Jim Kahle, Jens Leenstra, Gaurav Mittal, Jose Paredes, Juergen Pille, Member, IEEE, Phillip J. Restle, Member, IEEE, Balaram Sinharoy, Fellow, IEEE, George Smith, Senior Member, IEEE, William J. Starke, Scott Taylor, Member, IEEE, A. James Van Norstrand, Jr., Stephen Weitzel, Member, IEEE, Phillip G. Williams, and Victor Zyuban, Member, IEEE

Abstract—This paper gives an overview of the latest member TABLE I of the POWER™ processor family, POWER7™. Eight quad- THE 11 LEVEL METAL STACK threaded cores, operating at frequencies up to 4.14 GHz, are integrated together with two memory controllers and high speed system links on a 567 mmP die, employing 1.2B transistors in a 45 nm CMOS SOI technology with 11 layers of low-k copper wiring. The technology features deep trench capacitors which are used to build a 32 MB embedded DRAM L3 based on a 0.067 mP DRAM . The functionally equivalent chip would have been over 2.7B if the L3 had been implemented with a conventional 6 transistor SRAM cell. (A detailed paper about the eDRAM implementation will be given in a separate paper of this Journal). Deep trench capacitors are also used to reduce on-chip voltage island supply noise. This paper describes the organiza- tion of the design and the features of the processor core, before moving on to discuss the circuits used for analog elements, clock generation and distribution, and I/O designs. The final section describes the details of the clocked storage elements, including I. INTRODUCTION special features for test, debug, and chip frequency tuning. HE next processor of the POWER™ family, POWER7™, is introduced, giving an overview of the chip, elaborating onT key units including the vector and scalar unit and the load/ Index Terms—Clocked storage element design, clock grid, store unit, and describing the cache hierarchy. Selected imple- CML circuits, debug features, deep trench capacitor, design for mentations from the analog, clock generation and I/O circuitry reliability, design for test, differential I/O, digital PLL, duty are shown, and design considerations for the implementation cycle correction, eight core processor, embedded DRAM, flip-flop of the clocked storage elements and local clock splitters are design, high-speed I/O, latch, LBIST, L3 cache, , discussed. multi-core, multiport SRAM, POWER processor, POWER7, pulsed-clock latch, quad-threaded core, SER, SEU, SMP, SOI, vector register file, vector scalar unit. II. TECHNOLOGY The 567 mm POWER7™ chip is fabricated in IBM’s 45 nm Silicon-On-Insulator (SOI) CMOS technology. A key innovation introduced at the 45 nm technology node is the Manuscript received April 17, 2010; revised July 18, 2010; accepted Au- deep trench technology used for the embedded DRAM and for gust 31, 2010. Date of publication November 09, 2010; date of current version on-chip voltage supply decoupling. The single-transistor deep December 27, 2010. This paper was approved by Guest Editor Tanay Karnik. trench cell has an area of 0.067 m . Two different 6-transistor This material is based upon work supported by the Defense Advanced Research Projects Agency under its Agreement No. HR0011-07-9-0002. memory cells are employed in the dual supply SRAM designs D. Wendel, J. Leenstra, and J. Pille are with IBM Research and Development [4]. The 0.404 m cell was used in the dense L2 cache and L3 GmbH, Boeblingen 71032, Germany (e-mail: wendel@de..com). directory arrays and the 0.462 m high-performance cell was R. Kalla, R. Cargnoni, S. Chu, J. Clabes, D. Dreps, D. Hrusecky, J. Friedrich, S. Islam, J. Kahle, G. Mittal, J. Paredes, W. Starke, S. Taylor, J. Van Norstrand, used by the arrays working at the processor core frequency, Jr., S. Weitzel, and P. Williams are with the IBM Systems and Technology such as the L1 caches. Group, Austin, TX 78758 USA (e-mail: [email protected]; [email protected]. The technology provides 11 levels of copper wiring, opti- com; [email protected]; [email protected]). J. Warnock, B. Sinharoy, and G. Smith are with the IBM Systems and Tech- mized for density and performance as shown in Table I. nology Group, Poughkeepsie, NY 12601 USA (e-mail: [email protected]). SOI technology offers superior soft error immunity, com- P. J. Restle and V. Zyuban are with IBM Research, Yorktown Heights, NY pared to bulk, due to the small device volume sensitive to the ef- 10598 USA. fects of incident high-energy particles. SOI-specific circuit tech- Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. niques, such as device stacking, have been used to further reduce Digital Object Identifier 10.1109/JSSC.2010.2080611 the soft error rate.

0018-9200/$26.00 © 2010 IEEE 146 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 1, JANUARY 2011

TABLE II starts with the instruction fetch unit (IFU), requesting up to 32 KEY TECHNOLOGY FEATURES PowerPC instructions from a 256 KB L2 cache. Instructions are transferred from the L2 into a 32 KB instruction cache. The instruction decoder fetches instructions from the I-cache and places them into groups of up to 6 instructions, ready for dis- patch. Whenever instructions are fetched from the I-cache they are also scanned for branches. If a branch is encountered, branch prediction logic in the IFU predicts whether or not it is taken and the branch target, if needed. The entire group of instructions is then dispatched to the instruction sequencing unit (ISU). The ISU has multiple issue queues where instructions are held until all resources are available for execution. Instructions for the fixed point unit (FXU), load store unit (LSU), decimal floating point unit (DFU) and instructions targeted for the merged vector scalar unit (VSU) floating point unit (FPU), are kept in two unified queues. The unified queue is divided into two 24-entry halves. Branches are kept in a 12-entry branch issue queue. Once a branch instruction can be resolved it is sent to the branch exe- cution unit (BRU) located in the IFU. The actual result of the branch is compared with the predicted value. If a mispredict is found to have occurred, all younger instructions are flushed and the instruction fetch is redirected to the correct target of the branch. The PowerPC architecture has a rich set of condition register (CR) manipulating instructions. CR ops are kept in an Fig. 1. Main POWER7™ building blocks. eight entry CR issue queue and executed by the condition reg- ister unit (CRU), also physically integrated in the IFU.

Key technology features are summarized in Table II. See [5] for more detailed information. A. The Vector and Scalar Unit (VSU) The VSU unit of the POWER7™ processor merges the III. CHIP OVERVIEW former separate Vector Media eXtension (VMX) unit and the The chip is partitioned into independent, modular blocks scalar Binary Floating point Unit (BFU) into a single unit for enabling parallel design closure and the separation of unused power and area reduction. In addition, the VSU supports the cores from supply voltages. The modular design style also new VSX architecture. With the VSU dual instruction issue allowed POWER7™ to provide new levels of system flexibility capability and the use of new VSX instructions such as “Vector by maintaining 23 frequency domains, 47 different voltage Multiply-Add Double Precision” (128 bit wide data using supply domains, and 16 on-chip generated supply voltages. 2-way SIMD FP 64 bit) the processor can support 8 Double Fig. 1 illustrates the location of the largest components de- Precision FLOPs per cycle. This doubles the POWER7™ core scribed in the following. Each of the 8 Cache/Core Partition floating point performance in comparison to the POWER6™ (CCP) chiplets has its own set of separate supply voltages for core. The 64 VSX architectural registers are mapped onto the logic, SRAM, and DRAM as well as an analog voltage for set of 32 VMX architectural registers and the 32 registers of the Digital Phase-Locked Loop (DPLL). The CCP chiplet also the scalar BFU. This circumvents the need to implement new contains two internal charge-pump-derived voltages to supply registers, and provides VSX instructions with direct access to the DRAM wordlines. Each core, along with its associated results in the VMX and BFU registers. Finally in POWER7™ L2 and L3 arrays, can operate at a different, dynamically ad- up to 4 threads are able to run on each core in parallel and justable frequency to balance system performance and energy therefore 4 times 64 registers plus renames are supported by consumption. Nominal, power-save, turbo and sleep modes the Vector Register File (VRF). are supported. Other major modular blocks shown in Fig. 1 The organization of the VSU is illustrated in Fig. 3. Each include the local and remote SMP interconnect partitions and floating point (FP) unit implements both the support for floating the memory interfaces. Asynchronous connections link each point single/double precision instructions (as found in the VSX, core/cache complex, the 2 integrated memory controllers, the BFU and VMX Vector Float architecture) and the VMX com- off-chip local and global SMP links and the I/O controllers to a plex integer instructions by means of a common data path. With central interconnect that operates at a fixed frequency. the four FP double precision units, two VSX FP instructions can be executed simultaneously on both pipes. For execution of IV. THE CORE scalar FPU DP/SP instructions two FP units are used. Vector The processor core in the POWER7™ chip is comprised of Float (VF) and Vector Complex (VC) instructions can only be seven separately integrated units as shown in Fig. 2. The units issued on pipe 0 and each of the four FP units handles 32 bits are placed to optimize performance. The flow of instructions for executing these types of instructions. WENDEL et al.: POWER7™, A HIGHLY PARALLEL, SCALABLE MULTI-CORE HIGH END SERVER PROCESSOR 147

Fig. 2. The core plus local L3 region.

Fig. 3. VSU merged structure.

The common dataflow provided significant savings in terms allocation of the threads 0/2 and 1/3 to specific cells enabled the of transistor count, area, and power, compared to an implemen- implementation of the VRF with double-bit cells as shown in tation with separate VMX and BFU units, but introduced ad- Fig. 5. The bit of each instruction selects the sub-cell to ditional wiring complexity. With FPU0/1 on the bottom and read from, as shown by the multiplexer to the right of the cells in FPU2/3 on the top as shown in Fig. 4, the areas with the most Fig. 5. The thread selection scheme could have been extended wiring complexity are those around the Vector Register File into quad bit cells, but as single thread mode requires at least (VRF), the bypass macros and the permute in the center. All 144 entries for the architected and rename registers, this was buses needed careful planning to minimize the timing impacts found to be too costly area-wise in comparison to the 172-entry on critical nets from coupled noise. double-bit cell implementation. The VRF is a multi-ported register file in the VSU and its 172 The multiplexing of the pair of 3 read ports is implemented entries of double-bit cells provide the architected registers and with 2 OR-AND-INVERT (oai22) gates. In addition to using renames for up to 4 threads. The VRF supports the dual issue of the double-bit cells, the VRF also implements two writes per instructions with 3 read operands via the 6 read ports. The fixed cycle—called a double-pumped write. In the double-pumped 148 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 1, JANUARY 2011

Referring to Fig. 6, the read and write addresses are multiplexed immediately in front of the bank address decoder, thus allowing reads and writes to the D$ to occur while encountering a min- imum of banking address read/write collisions. In the event of a read/write collision, the write always takes priority. The D$ is 8-way set associative with a set-prediction (SETP) late-select mechanism. The SETP is accessed in parallel to the D$, using sum-address decoding, and the 1 of 8 set choice is resolved by means of an effective address hashing and compare algorithm. To allow the D$ space to be shared among multiple operating program threads, the SETP also contains thread iden- tification information to help make the set choice. The SETP’s choice is resolved and transported to the D$ for final data selec- tion just ahead of the completion of the D$ array access at the end of the address generation (AGN) cycle. The two D$ read channels may be addressed from a number of sources in any given access cycle. Referring to Fig. 6, mul- tiplexors in the register file access (RF) cycle serve to select the operand value from amongst 5 sources; the GPR, LSU re- sult 0, LSU result 1, immediate operands from the instruction, Fig. 4. VSU Floorplan. or LSU internally generated operands. The latter may support AGN operations for prefetching, MMU, recycling or ABIST op- erations. For most operations, the LSU has the ability to access write, one data port is written in the first half (c2 phase) of the data within a cache line on any boundary. Two effective ad- cycle, which starts with the falling edge of the mesh clock, and dresses (EA) are calculated to access the D$ complex; the EA the other data port is written during the second half cycle (c1 and an . The D$ macros in the LSU are distributed so phase) as shown in Fig. 5. As a result, the VRF cell is imple- that half of them store data related to the EA, and the other half mented physically with 2 write ports instead of 4 write ports, store data related to . significantly reducing the total height of the array. This makes To minimize wire distances, the D$ macros (DCAC0-3) both local and global read bit line lengths shorter, speeding up are arranged in a 2 2 configuration in the LSU floor plan the overall read access time by 5%. Each bit of a double-bit cell (Fig. 7). Operand multiplexors, (OPMUX0/1) data formatters has separate write selects for each of the logical 4 ports, meaning (FMT, PREFMT0/1), and the LSU fixed point execution stacks that threads 0/2 and 1/3 can write their results independently. (ALU0/1) are all “folded” into the center of the cache macros Read/write conflicts are handled by a VRF global bypass that to balance the wire lengths. forwards the data on the write ports to the read ports while the When an LSU operand channel is not required by other opera- cells are being written. The use of the double-bit cells and the tions, a reduced set of fixed point instructions may be scheduled. double-pumped write resulted in a combined power savings of These results are multiplexed with other results (data fetched approximately 25%. These techniques also reduced the number from memory, data from the LSU’s store data queue for a store- of write selects and read selects needed by a factor of 2. This forwarding-to-a-load result, and other maintenance operations enabled the VRF to be built with a short and wide aspect ratio such as special purpose register reads) and either returned to the since VRF entries could be placed in a single row. This in turn FXU or wrapped back to the operand through the opmux macro. reduced the time needed for the result distribution to/from the Operands are received from the fixed point unit late in the RF permute unit as these routes needed to cross over the VRF in- cycle (Fig. 6.) The remainder of the RF cycle is used to perform stances to reach the other units. operand selection and to start address generation. The addresses are passed through a transparent L1 latch and then distributed B. The Load/Store Unit (LSU) to the cache and SETP macros. The D$ and SETP are then ac- The LSU handles POWER7™ memory operations as well as cessed in parallel as described earlier, with the resultant late se- some fixed point operations through the use of dual concurrent lected cache data available at the beginning of the result cycle. channels. A reduced set of ALU operations may be scheduled The pre-format and data format multiplexing are performed and when a channel is not claimed for other purposes. The chan- the load results are returned to the FXU. The load results are nels share a common 3-ported data cache which can perform also wrapped back through the opmux at the same time as new 2 reads and 1 write in parallel [6]. The dataflow achieves two operands would be expected, thus closing the two cycle loop. cycle load-to-use and two cycle back-to-back loads in a 24 FO4 cycle. V. T HE CACHE HIERARCHY The Data cache (D$), using a 6T SRAM cell, provided the The cache hierarchy consists of a store-through L1 data foundation for the load/store micro-architecture. The D$ has cache, an L1 instruction cache, a store-in L2 cache and a hybrid dual read ports to support the dual load channels. The SRAM victim L3 cache. The L1 caches are contained in the core, while cells are gathered into 16-row, independently addressable banks. the L2 cache is located directly below the core, as shown in WENDEL et al.: POWER7™, A HIGHLY PARALLEL, SCALABLE MULTI-CORE HIGH END SERVER PROCESSOR 149

Fig. 5. Double bit cell circuit and timing.

Fig. 2. The L2 cache consists of high speed SRAM macros that With four 32 Byte read data buses from the L3 to L2 cache it are placed in a position to deliver demand load data at low la- takes four clock cycles to transfer a 128 Byte cache line. Read tency and high bandwidth. The 32 MB shared L3 cache was built data from the near DRAM macros are transferred in the first from high density DRAM macros [14], [16]. Each processor two clock cycles and read data from the far two DRAM macros core was grouped together with a local 4 MB segment of the are transferred in the second two clock cycles. When a read is 32 MB L3 cache. The 256 DRAM macros (32 per core) were requested of the L3 cache, the most critical 32 Byte sections placed in an arrangement that was optimized to meet the phys- are indicated in the address associated with the request. This ical layout constraints of the chip. The resulting layout described is used to order the delivery of data within the first and second below was optimized to achieve the best balance between floor- cycle from the near DRAM blocks and within the third and plan requirements and performance related attributes. fourth cycle from the far DRAM macros. As shown in Fig. 2, two columns of eight DRAM macros were It is desirable to have a uniform latency for any accesses of placed next to the core in a stack that begins at the top of the data stored in the L3 cache. For this reason the 128 Byte cache core and continues to the bottom of the L2 cache. These sixteen lines are split in such a way that the first 64 are stored in array macros comprise the right two 32 Byte dataflows of the near DRAMs and the second 64 Bytes are stored in far DRAMs. L3 cache and are attached to the data ports on the right side of This minimizes the latency differences and improves average the L2 cache. The remaining sixteen DRAM macros are placed latency for the statistically more critical first 32 Bytes of a cache below the L2 cache and comprise the left two 32 Byte dataflows. line. Eight were placed close to the L2 cache data ports near the lower left corner and the remaining eight were placed in the lower right VI. PERVASIVE DESIGN ELEMENTS corner of the L3 unit to allow the control logic macros to occupy the central region below the L3 directory SRAM macros. A. Analog Elements This arrangement resulted in the requirement for one extra The most significant use of analog circuits on this chip is for clock cycle to signal the read or write commands to the ‘far’ the phase-locked loops (PLLs) and the thermal sensors. The chip DRAM macros relative to the ‘near’ ones closest to the L2 uses analog PLLs for chip clock generation and I/O clocking. cache. Also one extra clock cycle was required to receive the These PLLs operate from a supply that is exempt from power data being read from the far arrays relative to the near ones. management. A local regulator provides a 1.2 V supply at up to 150 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 1, JANUARY 2011

Fig. 6. LSU micro architecture (1 pipe shown).

25 mA for the most critical circuits. The regulator uses only ture (PTAT) current source into a junction diode (circled in the standard MOS transistors with the exception of diodes, in a schematic), which acts as the temperature sensing element. A bandgap reference circuit, and resistors. The VCO is a CMOS mirrored copy of the current source also drives a digitally con- ring design, operating at frequencies in excess of 12 GHz. The trolled variable resistor to ground. The resistor is controlled by VCO output is scaled by a divider to provide suitable chip ref- logic inputs. The resistor and diode voltages are sent to a com- erence frequencies. To avoid introducing noise, the dividers do parator, sensing which voltage is larger and sending an output to not share the analog power supply, but are operated from the digital logic, which in turn controls the resistor value as a suc- standard logic supply. Current mode logic (CML) outputs are cessive-approximation analog to digital converter. provided for I/O use and CMOS level outputs are provided for Using a variable resistor instead of tapping a single resistor digital logic use. The design was optimized for an operating allows lower voltage operation overall, at the cost of additional range spanning a factor of 2 in output frequency. Spread spec- non-linearity of the circuit. The final result is linearized by ap- trum operation is supported for noise control. Both internal and plying a second order polynomial to the resistance selection external feedbacks are supported, based on the specific applica- when a reading in degrees C is needed. The coefficients of the tion requirements. polynomial are determined uniquely for each sensor during the The chip uses real-time thermal readings to support power manufacturing test process by noting the reading at a known management features. The sensors can be read in a hardware temperature. The finished sensor maintains excellent accuracy controlled loop without any external interface, middle ware, or over a wide temperature range. microcode. They do not require special voltage supplies and op- erate from the normal logic supply voltage. The goal was to min- B. Clock Generation and Distribution imize the sensor area, so the use of charge pumps to provide The chip global clock distribution consists of 23 clock higher voltages was ruled out. The finished area was approxi- domains, including 21 high frequency domains and two lower mately 85 m . The temperature sensor is shown in Fig. 8. The frequency domains. Six independently controlled analog PLLs left side of the figure shows a bandgap circuit, capable of low support the high frequency I/O memory and interfaces, the voltage operation, which provides a voltage reference. This ref- central internal bus logic, and a common clock signal source for erence is used to generate a proportional-to-absolute-tempera- all CCP chiplets. Each of the eight CCP chiplets contains WENDEL et al.: POWER7™, A HIGHLY PARALLEL, SCALABLE MULTI-CORE HIGH END SERVER PROCESSOR 151

Fig. 7. LSU floorplan.

Fig. 8. Temperature sensor. an independently controllable digital PLL. The digital PLL total buffer delay and process sensitivity. Fig. 9 shows the was specifically designed for frequency slewing while locked. schematic for a shielded wire model used for a 1mm length The two low frequency clock domains are sourced by off-chip wire on one of the two 1.2 m thick metal layers. The RLC clocks. values are parameterized in terms of the wire width and the Clock signals are distributed from the signal source to space from the wire to the shields, and fitted to results from the various clock domains through a binary distribution tree a proprietary field-solver. The global clock distributions for using two low impedance 1.2 m thick top level metal layers all 8 cores, starting from the 8 DPLLs is shown in Fig. 9 and a combination of single and triple stage inverting buffer below the wire model [13]. The 3-D graph shows the phys- circuits. The buffer-to-buffer interconnect wire width, space, ical location on the chip in x and y and gives the relative and return current path widths were optimized to allow max- delay from the DPLL in z. All clock domains were built as imum buffer to buffer spatial separation, thereby minimizing a traditional clock grid. Balanced voltage translation circuits 152 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 1, JANUARY 2011

Fig. 9. Core clock distributions.

Fig. 10. Modular clock generation. conveyed the clock signals across power supply domain frequency by two for the cache clock domain (or “nest clock”) boundaries without duty-cycle distortion. and the second path is unprocessed for the core clock domain. Each CCP chiplet contains two independent but synchronous Each of the two clock signal paths contains dynamically con- grids supporting a 1:1 core clock domain and a 2:1 cache clock trollable programmable delay segments for grid to grid skew domain covering 26.8 and 16.9 mm respectively. As shown in control, and fine resolution duty cycle correction circuits. Both Fig. 10, both grids are sourced from a common point, which may clock signal distributions required five buffering levels in the be driven by any one of four clock signal sources. One of the tree, including the grid driver. clock signal sources is a locally controlled fractional N PLL [15] One of the important design considerations for the CCP allowing for dynamic fine frequency control. A simple state- clock distribution was the skew between the two grids, since machine-controlled multiplexor allows the power management synchronous data transfer between the two clock domains was controls to dynamically select a different clock signal source for required. A detailed skew analysis was performed including the two grids without producing irregular clock signal pulses effects from die temperature variations, power supply noise during transitions between any two sources. The clock signal and variability, and process variations affecting both wire then splits into two paths. The first path divides the clock signal and devices. Also included in the skew budget were mod- WENDEL et al.: POWER7™, A HIGHLY PARALLEL, SCALABLE MULTI-CORE HIGH END SERVER PROCESSOR 153

The unit feeds the data into the physical layer (PHY) at 1.6 GHz. The transmitter clock is generated lo- cally with one PLL per MC unit and distributed with low-jitter CML circuitry to the CMOS TX bit slices. An AC coupled CML-to-CMOS circuit is shown in Fig. 12. The transmitter is a full CMOS Source-Series Terminated (SST) design [8]. The driver is calibrated with one off chip resistor per MC unit. The resultant launch jitter at 1e-12 BER is 25 ps. A divided down copy of the CML clock is fed to the MC unit for the 1.6 GHz clock grid. The FIFO from the MC-PHY interface absorbs any drifts with programmable read and write pointers to allow min- imum latency transfers. The data is received via a thin PHY and passed on through a continuous time linear equalization (CTLE), or peaking amplifier directly into a differential cascode voltage switch (DCVS) sampler. Each lane is offset-calibrated at power up only. Decision feedback equalization (DFE) is not needed for the channels serviced. The maximum channel loss is about 14 dB at 3.2 GHz with signal/crosstalk at roughly 18 dB. Finally the data is deserialized to 4 bits per lane and passed to the MC unit. On the clock path, the input is fed into a poly phase filter to generate I and Q (quadrature clock) for the data Fig. 11. Main memory high level diagram. and edge samplers. The clock path also employs narrowband shunt peaking methods. The phase rotators are implemented in low-power-optimized CML circuitry. A diagram of the receiver eling errors covering device, wire, and unmodeled parasitic slice is shown in Fig. 13. The PHY has several innovative de- components. sign features. It controls the initialization, scrambling, and lane repair operations and de-skew, quiescent controls, lane invert, C. High Bandwidth I/O load to unload delay and loading analog values such as FFE and POWER7™ has two memory controller (MC) units with four peaking values. ports per controller, adding up to a total of eight ports as shown in Fig. 11. Each port connects to a memory buffer chip (ref- erenced as SN in Fig. 11) with 2 DDR3 ports. Each differ- VII. THE LATCH AND CLOCK CONCEPT ential Physical Layer (PHY) has 2 Bytes of read data and 1 Byte of write data. The pin speed is 6.4 Gigabits/s of differen- The POWER7™ design used over 2M clocked storage ele- tial data per lane. The total peak bandwidth across all 8 ports ments (CSEs) in the implementation of the digital logic: thus is thus 153.6 GBytes/s. In most high performance commercial the design of these elements [17] had a critical impact on the workloads there is only one memory buffer per port and the overall chip area, power, and performance. In addition, spe- buffer drives up to four ranks of DRAM per DQ. The memory cial features were included to provide enhanced reliability, high buffer can reside on the same planar as the POWER7™ chip DC/AC test coverage and additional capabilities for hardware or on a riser card where the DRAMs and the memory buffer are debug and frequency optimization. This section will cover the packaged. In POWER7™ differential I/O implementations were design of the POWER7™ CSEs, focusing on specific design chosen over a former implementation (Elastic Interface 3 (EI3) features and how they were used, and describing the techniques [7]) to avoid the severe cross talk that has limited past EI3 buses used for power-performance optimization. that pass through “cost optimized” DIMM connectors. The de- The vast majority of the clocked storage elements in the dig- sign allows for dynamic lane repair for DIMM connector corro- ital logic were implemented as pulsed-clock static level-sensi- sion in a self detecting transparent operation. The cyclic redun- tive latches, primarily because of the power advantages com- dancy check (CRC) errors are decoded to detect lanes whose pared to a master-slave implementation and also because of the error rate exceeds a certain threshold, and to replace them with “soft” cycle boundary provided by such a scheme [9]. Two of a sparing bit on the fly. The interface uses CRC retry and is the basic CSE topologies used are shown in Fig. 14. The first specified to operate with per lane bit-error rates (BER) as low example, Fig. 14(a), was used throughout most of the logic de- as 1E-12. In typical applications the packaging channels and the sign. This element is really a master-slave latch (MSL), but was composite jitter budget run at about 1e-18 BER per lane. The de- usually operated in pulsed mode with d1clk held high and a sign is source synchronous and passes a 3.2 GHz clock from the self-timed pulsed clock used on the lclk port. During scan oper- transmitter (TX) side to the receiver (RX) side. This can allow ation, d1clk is held low, and d2clk is set to be the logical inverse the oversampling circuitry to be gated in the memory buffer to of lclk, avoiding race issues along the scan chain which would save power and also allows the use of non-LC-tank VCOs, with otherwise arise with just a single pulsed lclk. The clock wave- their longer-term jitter, while still maintaining a random jitter forms for normal functional operation and scan operation are performance that is near 1 ps rms. shown in Figs. 15(a) and (b) respectively. 154 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 1, JANUARY 2011

Fig. 12. CML-to-CMOS circuit.

Fig. 13. Receiver bit slice diagram.

Since this particular element (Fig. 14(a)) was used so per- is struck by a particle and turns on, due to activation of the vasively in the design, it had to meet very strict reliability parallel bipolar mechanism [10], the other transistor will still requirements. It is well-known that SOI technologies provide block any current flow, thereby avoiding an upset [11], unless resistance to single-event upsets: our own measurements on both transistor bodies are hit simultaneously. The cost of the CSE designs indicate that SOI provides about a 6X reduction stacked-transistor-based hardening was small, increasing area in soft error rate (SER) versus a comparable design in a bulk by less than 10% per CSE and increasing the power/latency of technology. However, as technology scaling has continued to the CSE by only a small amount. This stacking scheme works lower voltages, finer feature scales, and ever-higher levels of well in SOI technologies since adjacent transistor bodies are integration, the “SOI advantage” is no longer enough to meet completely isolated from each other, but would not be practical the strict standards required. To improve the reliability of the in a bulk technology. To be effective, the individual transistors CSE, single transistors were replaced by 2-high stacks of se- in the stack would have to be separated from each other by ries transistors, at critical locations within the CSE, as indicated a significant distance. Since the vast majority of these ele- in Fig. 14(a). If one of two non-conducting series transistors ments were operated with pulsed clocks (lclk is pulsed, d1clk WENDEL et al.: POWER7™, A HIGHLY PARALLEL, SCALABLE MULTI-CORE HIGH END SERVER PROCESSOR 155

Fig. 14. Master-slave latch diagram. is held high), it was only necessary to harden the L2 part of Figs. 17(a) and (b) show the local clock circuitry used to the structure. operate the CSEs in Figs. 14(a) and (b) respectively. This Fig. 16 shows a summary of the data from accelerated par- design is similar to that used in previous high-frequency micro- ticle beam experiments. The data was collected with the same processors [12], including the control signals ACT, FORCE, logical value on both sides of the L2 transmission gate. In prac- and THOLD B, but the POWER7™ design includes added tice, this is the most common situation, although the cross sec- functionality for reliability and debug. These new clock wave- tion increases somewhat when the transmission gate comes into form control signals, D MODE B, DELAY LCLKR, MPW1, play. Overall, the stacking is seen to provide at least a 5X reduc- and MPW2 were controlled from local scan-only registers, tion in SER. In general, scan-only CSEs store their data only on and were used as described below. In addition, another debug the L2 node, so are not sensitive to hits on the L1 node. This is control “ACT DIS” was also sourced from the same scan-only also true for MSLs operated with pulsed clocks. MSLs operated register bank, and allowed overriding any local clock gating with traditional half-cycle clocks would be sensitive to hits on logic, in situations where clock gating was used only for power the L1 half of the time, and on the L2 for the other half of each saving, and not for functional holds. This was done through cycle, so would be expected to have a higher upset cross section the FORCE input, effectively short-circuiting the clock enable since only the L2 stage is hardened. This was not a big con- input (ACT). cern for POWER7™ because the CSEs were only used rarely in The signal D MODE B could be used to force a spe- non-pulsed mode. cific group of MSL designs to run in non-pulsed mode Fig. 14(b) shows an example of a CSE designed to run only , with half-cycle complementary master with pulsed clocks. This design has lower latency than the and slave clocks as shown in Fig. 15(c), as a debug aid and dual-mode design, and was used to help solve the most critical work-around if specific problems turned up in the hardware, timing issues. Since this design was used less commonly than related to pulsed-mode operation. DELAY LCLKR was used the master-slave latch shown in Fig. 14(a), and since no sacri- to delay the pulsed-clock output to the latches as shown in fice could be tolerated in overall latency, this structure was not Fig. 15(d), for hardware-based cycle time/performance tuning hardened by transistor stacking. or debug work. 156 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 1, JANUARY 2011

Fig. 15. Local clock waveforms.

effectively screened out by testing at high voltage with MPW1 enabled (widest pulsewidth setting). The clock control signals were used during system bring-up and for extensive performance optimization work. The DELAY LCLKR control was useful for debugging cycle-lim- iting paths in the early hardware, since it delays the rising edge of the local clock (for pulsed clocks the falling edge is delayed also by the same amount), thereby introducing a fixed amount of delay into the launch of new data from all affected storage elements. This control was implemented with block level gran- ularity, with a typical circuit block containing on the order of 10 K gates. By activating this extra delay on a block-by-block basis, the criticality of each of the blocks could be determined Fig. 16. Measured upset cross sections. based on the resultant decrease in the overall chip maximum operating frequency, fmax, for each setting. An example of this “DELAY LCLKR pareto” is shown in Fig. 18, for a particular Finally, the MPW1 and MPW2 control signals provided a chip. means of assuring the reliability of all clocked elements using Given a specific situation where fmax was seen to decrease pulsed clocks. Aging of pulse generator circuitry, or of the when DELAY LCLKR was asserted for a particular circuit clocked storage elements could lead to a writeability error or block, the receiving block for the critical path in question could other race-path failure into a latch, if there are any elements be isolated by checking on a block-by-block basis so see if in the design which, for one reason or another, did not have delaying the capture clock increased the observed fmax. This enough margin at time zero. Without special control switches, could be done with the MPW1 switch, also implemented with it’s hard to guarantee adequate margin through end of life, since the same block-level granularity as DELAY LCLKR. Finally, these types of marginalities would be frequency-independent, the timing slacks in the chip timing model were examined for and might not reliably show up when testing at specific voltage paths launched and captured from the blocks isolated through and temperature extremes. To guard against this eventuality, the above observations, providing detailed information on all pulsed-clock circuitry could be stressed with a wider pulse possible critical paths. This analysis was used to determine , or narrower pulse as shown exactly how much improvement was needed in each of the in 15(d) before shipping at the nominal pulsewidth setting critical paths for a set of 2nd pass metal-only design changes. . In addition, any potential hold The MPW1 switch, which delays the falling edge of the clock time marginality caused by resistance-related issues could be pulse, was also used to provide permanent setup time relief in WENDEL et al.: POWER7™, A HIGHLY PARALLEL, SCALABLE MULTI-CORE HIGH END SERVER PROCESSOR 157

Fig. 17. Local clock generators.

Fig. 18. Frequency degradation caused by assertion. specific situations. To verify that sufficient hold time margin still be fmax limiting. The granularity of this ACT DIS control was existed in these situations, the system was tested with both the fine enough such that sparing use of this override had only a programmed MPW1 settings, and DELAY LCLKR asserted for negligible impact on total chip power. Together, the timing im- the affected circuit blocks. provements enabled by the DELAY LCLKR pareto work with Finally, ACT DIS was used to disable clock gating in situ- subsequent design tuning and hardware updates, and the use of ations where the clock enable signal (ACT) was only used for MPW1 and other similar switches for timing relief resulted in a power reduction (and not functional holds), and was found to 9% chip frequency improvement at constant voltage, with about 158 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 1, JANUARY 2011

Fig. 19. Maximum operating frequency improvements.

Fig. 20. Minimum operating Vdd reduction.

6% coming directly from the optimization of the clock control switches and about 3% coming from the 2nd -pass metal layer only design improvements. These improvements are shown in Fig. 19. The D MODE B switch was also useful to improve the operating voltage range in early hardware. In one particular situation, insufficient hold time margins were observed at low voltage due to a timing offset induced between the clock (propagated in the low voltage domain) and data (launched from a CSE and propagated in a higher-voltage domain). Acti- vating D MODE B for the receiving pulsed latch clock driver put it into master-slave mode, effectively reducing its hold time requirement to a more manageable interval. In addition, Fig. 21. Minimum operating Vdd changes via MPW1 or MPW2. the data side of this race path was improved by activation of DELAY LCLKR for the launching element. The combination of these two actions enabled the system to run at much lower Fig. 21, where changes in the minimum operating voltage as- voltages than would have otherwise been possible without an sociated with the assertion of MPW1 and MPW2 are shown. additional round of design updates, as shown in Fig. 20. This With MPW2 asserted, the pulse-width is reduced, and a gen- enabled a higher-quality second-pass design (which included eral increase in Vmin is observed. As seen in Fig. 21, the actual modifications to improve the hold time on this path), since change in Vmin depends on many random factors which do not debug work on the early hardware could continue in the low track with any obvious process parameters. It is evident that the voltage regime. margin needed for adequate pulsewidth reliability would be dif- Finally, MPW1 and MPW2 were important for system relia- ficult to translate into a reasonable Vmin guardband specifica- bility as described above. In addition to non default settings of tion: some parts would need relatively large Vmin margins, and clock control switches (e.g., DELAY LCLKR) which could po- without this control switch it would be impossible to know this, tentially increase the hold time requirement of a specific CSE or to understand the statistical aspect of the margins needed. to close to the fail point, any local process irregularity or de- It is also interesting to note that the parts saw generally im- fect could result in a situation where a pulsed-clock element proved Vmin margin with MPW1 asserted (widening the clock had less margin than needed to operate robustly through the pulses), and generally MPW1 did not decrease Vmin. The impli- whole product lifespan. There is generally no guarantee that cation is that data hold time issues were not a significant Vmin such marginality would be adequately discoverable via tests limiter for these parts (since hold times are increased when the at extended voltage margins. This is evident from the data in pulses are widened). With a pulsewidth equal to about 20% of WENDEL et al.: POWER7™, A HIGHLY PARALLEL, SCALABLE MULTI-CORE HIGH END SERVER PROCESSOR 159 the core cycle time, the longer hold times associated with the [12] J. Warnock et al., “Circuit design techniques for a first-generation cell pulsed-clock latches did drive a certain amount of short-path broadband engine processor,” IEEE J. Solid-State Circuits, vol. 41, no. 8, pp. 1692–1706, Aug. 2006. padding activity during the chip design process. Empirically it [13] P. J. Restle, “Timing uncertainty measurements on the Power5 micro- would seem that the margins used were more than adequate to processor,” in IEEE ISSCC 2004 Dig. Tech. Papers, Feb. 2004, vol. 1, protect the design at its Vmin targets. These lab data indicate pp. 354–355. [14] G. Wang et al., “Scaling deep trench based eDRAM on SOI to 32 nm that pulsed-clock writeability issues were more limiting, even and beyond,” in IEDM, 2008. at the wider pulse-width setting. These observations thereby [15] A. Rylyakov et al., “A wide range (1 GHz-to-15 GHz) fractional-N provide valuable feedback for future calibration of the statis- all-digital PLL in 45 nm SOI,” in Proc. CICC, 2008. [16] J. Barth et al., “A 45 nm SOI embedded DRAM macro for the tical timing methodologies and design tools employed by the POWER7™ processor 32 MByte on-chip L3 cache,” in IEEE ISSCC POWER7™ team. Dig. Tech. Papers, 2010, paper 19.1. [17] J. Warnock et al., “POWER7™ local clocking and clocked storage el- ements,” in IEEE ISSCC Dig. Tech. Papers, 2010. VIII. CONCLUSION An overview of the newest member of the POWER™ family of processors, POWER7™ has been presented. The design is highly parallel, containing 8 superscalar, out-of-order, multi-threaded cores, a large embedded DRAM-based shared L3 cache, two memory controllers and high-speed system links, Dieter F. Wendel (M’10) received the B.S. degree in electrical engineering from the University of all on the same die. The processor cores operate at frequencies Würzburg, Germany, in 1981. up to 4.14 GHz, with considerable embedded flexibility for He joined IBM in the same year working in several power-saving modes when the maximum performance is not processor-design-related technical areas and several IBM locations in the US and Germany. His recent required. The design and circuit techniques employed were assignments have been circuit lead for the CELL™ aimed at achieving high performance while also breaking new processor and POWER7™. ground with added features for reliability and testability. In Mr. Wendel is an IBM Distinguished Engineer. short, the POWER7™ processor chip has been designed to enable new levels of performance in the high-end server space, offering scalable solutions for the most difficult technical and commercial applications. Ron Kalla is the chief engineer for IBM POWER7. He has 25 years of processor design experience. He has worked on processors for IBM S/370, M68000, ACKNOWLEDGMENT iSeries and pSeries machines. He holds numerous patents on processor architecture. He also has an Many thanks to the dedicated IBM processor design and man- extensive background in post silicon hardware bring ufacturing teams around the globe who worked to make this de- up and verification. He has 30 issued US patents and is an IBM master inventor. sign become reality.

REFERENCES [1] R. Kalla and B. Sinharoy, “POWER7: IBM’s next generation power microprocessor,” in IEEE Symp. High-Performance Chips (Hot Chips James Warnock (M’89–SM’06) received the B.Sc. 21), 2009. degree from Ottawa University, Canada, and the [2] B. Starke, “POWER7™ design,” in Hot Chips, 2009. Ph.D. degree in physics from the Massachusetts [3] D. Wendel et al., “The implementation of POWER7™, a highly par- Institute of Technology. allel, scalable multi-core high end server processor,” in IEEE ISSCC Since then, he has been at IBM in Yorktown Dig. Tech. Papers, 2010. Heights, NY, where he has worked on high-speed [4] J. Pille et al., “Implementation of the CELL broadband engine™ in a including IBM’s S/390 G4, 65 nm SOI technology featuring dual-supply SRAM arrays supporting POWER4, the Cell Broadband Engine, POWER7 6 GHz at 1.3 V,” in IEEE ISSCC Dig. Tech. Papers, 2007. and the zEnterprise 196. His interests include [5] S. Narasimha et al., “High performance 45-nm SOI technology with VLSI circuit design tools and methodology, local enhanced strain, porous low-k BEOL, and immersion lithography,” in clocking/latch design, design for test, and de- IEDM, 2006. sign-technology interactions. [6] J. Pille et al., “A 32 kB 2R/1W L1 data cache in 45 nm SOI tech- Dr. Warnock is a Distinguished Engineer in IBM’s Systems and Technology nology for the POWER7™ processor,” in IEEE ISSCC Dig. Tech. Pa- Group and a member of the IBM Academy of Technology. pers, 2010, 19.2. [7] D. Dreps et al., “The 3rd generation of IBM’s elastic interface on POWER6™,” Hot Chips 19, Aug. 2007. [8] C. Menolfi, T. Toifl, P. Buchmann, M. Kossel, T. Morf, J. Weiss, and M. Schmatz, “A 16 Gb/s source-series terminated transmitter in 65nm Robert Cargnoni (M’85) received the B.S. and M.S. CMOS SOI,” in IEEE ISSCC 2007 Dig. Tech. Papers, Feb. 11–15, degrees in electrical engineering from the University 2007, pp. 446–614, vol., no.,. of Illinois at Urbana-Champaign. [9] Clocking in Modern VLSI Systems, T. Xanthopoulos, Ed. New York: He is the Chief Engineer for the IBM POWER7 Springer, 2009, ch. 3. cache hierarchy, coherence protocol, SMP intercon- [10] A. J. KleinOsowski et al., “Circuit design and modeling for soft errors,” nect, memory and I/O subsystems. He held key lead- IBM J. Res. Develop., vol. 52, pp. 255–263, May 2008. ership positions in the POWER4 and POWER5 pro- [11] J. Cai et al., “SOI series MOSFET for embedded high voltage appli- grams as well. cations and soft-error immunity,” in IEEE Int. SOI Conf. Proc., Oct. 2008, pp. 21–22. 160 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 1, JANUARY 2011

Sam G. Chu received the Master’s degree from Syra- Saiful Islam graduated from Bangladesh University cuse University, Syracuse, NY. of Engineering and Technology in 1987 and received He is an IBM Senior Technical Staff Member. He the M.S.E.E. and Ph.D. degrees from the University joined IBM in February 1984 working in different of Texas at Austin in 1990 and 1994, respectively. processor design areas. He is currently the technical He worked with Advanced Micro Devices for lead for Register File implementations, supporting almost nine years in the K5 and K7 microprocessor the IBM Z- and P-processor families. development team as a custom circuit designer. He joined IBM in 2003 and started working on the Power6 project. Since then, he has been working in the large register file array and complex custom circuit design area. Currently, he is leading the custom register file team for next P project.

Joachim G. Clabes received the Ph.D. degree in Jim Kahle is a graduate of Rice University, and for physics from the Technical University of Clausthal, more than 25 years at IBM he has held numerous Germany, in 1978. managerial and technical positions. He is a renowned He joined IBM at the T. J. Research expert in the microprocessor industry, achieving the Center in 1983 to work on a wide range of mate- distinction of IBM Fellow, and currently is the Chief rials science and advanced development projects Architect for Power Hybrid systems. Previously he within IBM. In 1993 he joined the microprocessor was Chief Technical lead for Power 7. Before that, development program. Since moving to Austin he led the Collaborative design for Cell which was in 2000, he has worked as a circuit design lead a partnership with IBM, Sony and Toshiba. He was on IBM’s POWER5 and POWER7 microproces- also Chief Architect of the Power4 core used in IBM sors and most recently as PD lead on the PERCS servers and Apple’s G5. He was project manager for high-performance interconnect chip. the PowerPC 603 series that are used in Apple Laptops and Nintendo game cubes. He has been involved in designs using the Power architecture since its conception. He combines broad processor knowledge with an ability to lead high performance teams and to drive deep client relationships in order to understand future system requirements, achieving breakthrough innovations in chip design.

Daniel Dreps (M’10) received the B.S.E.E. degree Jens Leenstra received the M.S. degree from the from Michigan State University in 1983. University of Twente, The Netherlands, and the He is a Distinguished Engineer working in the Ph.D. degree from the University of Eindhoven, The IBM Systems and Technology Group. During his Netherlands. IBM career, he has designed and developed transistor He has worked in several areas of development, models, fiber optic links, ASIC technology custom including logic design and verification of I/O elements, and high-speed serial links for IBM chips, multiprocessor system verification of the servers. His interests currently focus on high-speed IBM S/390* G2 and G3 mainframe computers, link development and applications in the entire range the Cell/B.E. processor SPEs,P6 VMX unit and of IBM servers. He has published multiple papers is the lead of the POWER7 processor VSU unit. and holds more than 40 patents in broad areas of He is currently working on next-generation IBM interconnect and server design. microprocessors. His current interests focus on computer architecture, high-fre- quency design, low power, and design for testability.

Gaurav Mittal received the B.Tech. degree from the Indian Institute of Technology, Delhi, India, in 1996, David Hrusecky is the main engineer for the L1 and the M.S. degree in electrical engineering from the Cache Load Store Dataflow for IBM POWER7. He University of Michigan, Ann Arbor, in 1999. was previously involved is cache designs for the for He led the electrical circuit characterization of IBM POWER6. He has 28 years of processor/logic POWER7, optimizing power, frequency and yield. design experience. He has worked on processors for He is currently leading the L2 Cache Unit circuit IBM S/370, IBM S/390, and pSeries machines. He design team for next generation POWER processors. also worked extensively on MPEG video processing He was also circuit lead for load store unit on logic including digital filters and graphics display Power6+ and did electrical circuit characterization processors. He holds numerous patents on processor for POWER6 and POWER6+. He joined IBM, and video architecture. Austin, TX, in 2003. Prior to joining IBM, from 1999 to 2003, he was a Circuit Design Engineer at Sandcraft Inc., Santa Clara, CA, working on the load store unit and TLB.

Jose Paredes is a Senior Engineer at IBM. He has Josh Friedrich led power estimation and reduction been with the company for over 20 years and has been efforts on the POWER7™ chip. He also played a key working on POWER Microprocessors for 12 years. role in defining the chip infrastructure and closing the He is an IBM master inventor, and has over 35 issued design methodology. On past POWER™ processors, patents in the United States and other countries, in- he has led hardware characterization, memory sub- cluding Germany, Japan, and China. system circuit development, and the design of core execution units. He is currently heading circuit de- velopment on one of IBM’s future designs. WENDEL et al.: POWER7™, A HIGHLY PARALLEL, SCALABLE MULTI-CORE HIGH END SERVER PROCESSOR 161

Juergen Pille (M’10) received the M.S. degree in William J. Starke received the B.S. in computer microelectronics from the University of Hanover, science from Michigan Technological University. Germany, in 1990. He is the Chief Architect for the IBM He joined IBM at the IBM Deutschland Research POWER7 cache hierarchy, coherence protocol, and Development Lab, Boeblingen, Germany, in SMP interconnect, memory and I/O subsystems. 1990 to work on Advanced VLSI designs. Since He served in a similar role for POWER6, and then, he has been working in several areas and held key leadership positions in the POWER4 several locations in the U.S. and Germany. As a and POWER5 programs as well. Prior to that, Senior Technical Staff Member, he is currently the he worked as a hardware performance engineer P-Server high-speed array lead. on a number of IBM mainframe programs. A prolific innovator, Mr. Starke holds over 100 issued U.S. patents, in addition to a large number currently pending.

Phillip J. Restle (M’87) received the B.A. in physics from Oberlin College in 1979, and the Ph.D. in physics from the University of Illinois at Urbana in Scott Taylor (M’10) was the circuit lead for the 1986. POWER7 memory subsystem working both circuit He then joined the IBM T. J. Watson Research and methodology issues. He followed that micropro- Center as a Research Staff Member, where he cessor into the lab leading the circuit characterization initially worked on CMOS parametric test and mod- effort. His latest role is the circuit lead of a follow-on eling, CMOS oxide-trap noise, package testing, and processor. DRAM variable retention time. Since 1993 he has concentrated on tools and designs for VLSI clock distribution networks contributing to more than a dozen server and game microprocessors, including all recent high-performance IBM servers such as the POWER7 processor, the z10 mainframe processor, the Xbox 360 processor, and the Cell Broadband Engine, and has extended these techniques for high-end ASIC chips, and 3-D chip integration. Dr. Restle received IBM awards for the Mainframe G4, G5, and G6 micro- A. James Van Norstrand, Jr. received the B.S.E.E. degree from Syracuse Uni- processors, for the POWER4 and POWER5 microprocessors, for the PowerPC versity, Syracuse, NY, in 1982. 970 used in the Apple G5 machine, as well as an IBM corporate award for VLSI He is a Senior Technical Staff Member with IBM. He has 26 years in IBM clock distribution design and methodology. He received the 2005 Pat Goldberg Power 7, and holds many microprocessor patents. He is Instruction Fetch Unit Memorial Best Paper award for a resonant-clock paper, and the 2007 ITC best Lead, and before that worked on game chip microprocessors (Sony Playstation paper award. He holds more than 12 patents, has written more than 25 papers, 3) and various Power systems in Austin, TX. He started working in Mid Hudson and has given keynotes, invited talks, and tutorials on clock distribution, high valley on Z systems. frequency on-chip interconnects, and technical visualizations in VLSI design.

Balaram Sinharoy (M’92–SM’96–F’09) is an IBM Distinguished Engineer and the chief architect of Stephen Weitzel (M’75) received the B.S. degree in IBM’s POWER7 processor. Before POWER7, he electrical engineering from Pennsylvania State Uni- was the Chief Architect for POWER5 processor. He versity, University Park PA. has published numerous articles and received over He joined IBM as a Test Equipment Engineer in 50 patents in the area of computer architecture, with 1974 and held various positions in test engineering many more patents pending. and circuit design in the East Fishkill development Dr. Sinharoy has received several IBM corporate center, Noyce design center, Somerset design center, awards for his work in different generations of and STI design center. He is currently a Senior Tech- POWER microprocessors. He is an IBM Master nical Staff Member working on high-frequency clock Inventor. distributions for IBM in the high performance micro- processor center, Austin, TX. He has 20 patents and has co-authored 17 papers.

George Smith (M’80–SM’05) received the Bachelor and Master degrees in electrical engineering from Rensselaer Polytechnic Institute in Troy, NY, in 1980 Phillip G. Williams graduated from Missouri Institute of Technology in 1977. and 1981, respectively. He is a Senior Engineer with IBM Systems and Technology Group. He joined He is a senior engineer and team leader of the IBM in 1977, developing systems. He has designed com- Server Analog Team at IBM’s System and Tech- puter logic in the areas of Instruction Dispatch and Completion, Branch Predic- nology Group in Poughkeepsie, New York, which tion, Cache and Memory Controllers and I/O. In his current role as design lead, provides PLLs, sensors, and other analog circuits for he is embedding DRAM in processor Caches. use in IBM’s microprocessors.

Victor Zyuban (S’97–M’00) received the B.S. and M.S. degrees from the Moscow Institute of Physics and Technology, Moscow, Russia, and the Ph.D. in CSE from the University of Notre Dame, Notre Dame, Indiana. Since 2000 he has been with IBM, working on low power circuits and micro-architecture, serving as a core power lead on POWER7, and chip power lead on a follow-on microprocessor. He is also a manager of a low-power circuits group at the IBM T. J. Watson Research Center.