5-Ghz 32-Bit Integer Execution Core in 130-Nm Dual-V/Sub T/ CMOS

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 37, NO. 11, NOVEMBER 2002 1421 5-GHz 32-bit Integer Execution Core in 130-nm Dual- CMOS Sriram Vangal, Member, IEEE, Mark A. Anders, Nitin Borkar, Erik Seligman, Venkatesh Govindarajulu, Vasantha Erraguntla, Member, IEEE, Howard Wilson, Amaresh Pangal, Venkat Veeramachaneni, James W. Tschanz, Yibin Ye, Dinesh Somasekhar, Member, IEEE, Bradley A. Bloechel, Associate Member, IEEE, Gregory E. Dermer, Ram K. Krishnamurthy, Member, IEEE, K. Soumyanath, Sanu Mathew, Siva G. Narendra, Mircea R. Stan, Senior Member, IEEE, Scott Thompson, Vivek De, Member, IEEE, and Shekhar Borkar Abstract—A 32-bit integer execution core containing a achieving: 1) high noise robustness; 2) low active leakage power Han-Carlson arithmetic-logic unit (ALU), an 8-entry 2 ALU dissipation; 3) maximum low- usage; 4) simplified 50% instruction scheduler loop and a 32-entry 32-bit register file duty-cycle timing scheme with seamless scheduler/ALU is described. In a 130 nm, six-metal, dual- CMOS technology, the 2.3 mmP prototype contains 160 K transistors. Measurements interface time-borrowing; and 5) scalable performance up to demonstrate capability for 5-GHz single-cycle integer execution 10 GHz, measured at 1.7 V, 25 C. Stack node preconditioning at 25 C. Single-ended, leakage-tolerant dynamic scheme used in enables further ALU performance improvement. In addition, ALU and scheduler enables up to 9-wide ORs with 23% critical the RF employs semi-dynamic flip-flops for increased speed path speed improvement and 40% active leakage power reduction and a static design for increased leakage tolerance. On-chip when compared to a conventional Kogge–Stone implementation. On-chip body-bias circuits provide additional performance body-bias circuits are used to improve performance or reduce improvement or leakage tolerance. Stack node preconditioning standby leakage power. The chip also supports full at-speed improves ALU performance by 10%. At 5 GHz, ALU power is result capture and scan-out. 95 mW at 0.95 V and the register file consumes 172 mW at 1.37 V. In the execution core, both the ALU and scheduler are The ALU performance is scalable to 6.5 GHz at 1.1 V and to organized as a loop [3], enabling single-cycle latency and 10 GHz at 1.7 V, 25 C. throughput both for ALU operations and for resolving instruc- Index Terms—CMOS integrated circuits, integrated circuit de- tion dependencies and priorities (Fig. 1). The scheduler updates sign, logic design, microprocessors, very-high-speed integrated cir- interinstruction dependency information each cycle, choosing cuits. the highest priority from among those instructions ready to execute. The chosen instruction controls ALU input selection I. INTRODUCTION as well as RF address. The 32-bit ALU executes add/subtract UT-OF-ORDER execution engines of superscalar operations each cycle, allowing the previous results to be used O processors require: 1) wide instruction schedulers ca- directly in the following cycle. This architecture enables fast pable of scheduling back-to-back instructions into multiple parallel out-of-order execution in superscalar microprocessors. arithmetic-logic units (ALUs) in the execution core; 2) fast ALUs capable of executing these instructions with single-cycle II. PROTOTYPE ARCHITECTURE latency and throughput; and 3) leakage-tolerant register files The core architecture contains: 1) three first in first out (FIFO) capable of feeding the ALU units. A high-speed execution core buffers, one FIFO for instructions and two FIFOs for data; 2) is therefore essential to maximize processor performance [1]. In a tightly coupled RF-ALU-Scheduler loop; 3) a FIFO to cap- this paper, we describe key components of a integer execution ture output results (Fig. 2) [4]. The core executes instructions core: a 32-bit ALU, an 8-entry 2-ALU instruction scheduler stored in the 32-bit wide, 4-deep circular instruction FIFO, op- and a 32-entry 32-bit register file (RF), fabricated in 130 erating at core speed. Data FIFOs (D0-D1) provide the desired nm dual- CMOS technology [2]. High-speed single-ended operands. A central block forwards data and control signals dynamic circuit techniques enable the evaluation of complex to all units. RF-ALU instructions are single-cycle and can be (up to 2 9-way OR) logic operations while simultaneously scheduled back to back. A 416-bit long input scan chain feeds the data and control words. Results are captured at speed by Manuscript received March 15, 2002; revised June 10, 2002. a 32-bit wide, 4-deep result FIFO. Capture timing and interval S. Vangal, M. A. Anders, N. Borkar, E. Seligman, V. Govindarajulu, V. Erra- guntla, H. Wilson, A. Pangal, V. Veeramachaneni, J. W. Tschanz, Y. Ye, D. So- are fully programmable via scan. Output results are scanned out masekhar, B. A. Bloechel, G. E. Dermer, R. K. Krishnamurthy, K. Soumyanath, using a 128-bit scan chain. The scan control block manages op- S. Mathew, S. G. Narendra, V. De, and S. Borkar are with the Intel Corpora- erations of all four FIFOs on chip. Separate power grids for tion, Circuits Research, Intel Labs, Hillsboro, OR 97124-6497 USA (e-mail: [email protected]). ALU, RF and circuits in the rest of the core allow individual S. Thompson is with the Portland Technology Development, Intel Corpora- power measurement of each unit. In addition, RF and ALU units tion, Hillsboro, OR 97124-6497 USA. have on-chip body bias generator circuits to improve perfor- M. R.Stan is with the Department of Electrical Engineering, University of Virginia, Charlottesville, VA22904-4743 USA. mance by applying 450 mV of forward body bias (FBB) to all Digital Object Identifier 10.1109/JSSC.2002.803944 pMOS devices during active operation. Body biases of high- 0018-9200/02$17.00 © 2002 IEEE 1422 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 37, NO. 11, NOVEMBER 2002 Fig. 1. Out-of-order execution core. Fig. 3. Scheduler organization. Fig. 2. Block diagram of integer execution core and test circuits. and low- devices can be controlled separately. On-chip FBB can be disabled; and forward, zero, or reverse body bias can be applied externally to all nMOS and pMOS devices to improve performance or reduce standby leakage power. The chip organization provides flexibility to characterize individual units or the complete core. III. 8-ENTRY 2INSTRUCTION SCHEDULER The instruction scheduler is capable of scheduling dependent Fig. 4. Scheduler bitslice logic. instructions to two 32-bit ALUs, choosing one of eight poten- tially ready instructions to execute in each ALU per cycle. An bit slices, with one ready logic evaluation and one priority en- instruction is ready for execution if it is not dependent upon re- coder operation per bit slice (Fig. 3). The 15 dependencies for sults of any other pending instructions and has not been sched- the 16 instructions currently in the pool, , are evalu- uled in the previous cycle. The scheduler is organized into 16 ated and stored in a 1-bit 240-entry dependency matrix during VANGAL et al.: A 5-GHz 32-bit INTEGER EXECUTION CORE IN 130 nm DUAL- CMOS 1423 (a) (b) Fig. 5. Scheduler designs. (a) Dual-rail domino. (b) Single-rail CSG. Fig. 7. Scheduler priority encoder CSG circuit. Fig. 6. Scheduler ready logic CSG circuit. wide AND paths and realizes the complete critical path with single-ended 2 9-way (Fig. 6) and 8-way (Fig. 7) dynamic OR the previous cycle. The ready logic resolves dependencies be- circuits, respectively. The CSG circuit enables domino-com- tween the 16 instructions in the pool and two external depen- patible dual-rail outputs but requires only a single-rail input. dency signals ( ), essentially requiring an 18-way AND It contains two dynamic nodes, a traditional complementary operation (Fig. 4). An 8-way AND priority encoder then chooses dynamic node and a true dynamic node. Both nodes precharge from among the ready instructions using dynamically controlled using the same clock. During evaluation, one of these two nodes priorities ( ) and drives a 140- m loopback bus into the transitions low, causing the nonswitching node to be actively ready logic and the shared ALU tri-state bus. The ready logic, held by a pMOS device turned on by the evaluating node. These using the priority encoder outputs from all other bit slices, de- cross-coupled pMOS transistors provide additional noise immu- termines if its instruction is dependent on any other instruction. nity, allowing wider OR-gates than those possible when leakage The priority encoder, then, using the ready logic outputs only is compensated only by a normal half-keeper. Dual- opti- from the other 7 bit slices in its portion of the instruction queue, mization is conducted for high performance and to meet target indicates if its instruction is the highest priority. noise margin constraints. High- is used on the 9- and 8-way A domino implementation of the scheduler logic requires domino-OR nMOS pull down transistors and low- is used for a fully dual-rail design, since both true and complementary all other transistors. The complete scheduler path requires only domino-compatible inputs are required for both the ready logic 6 gate-stages, improving critical path performance by 23% over as well as the priority encoder. An optimal dual-rail domino the corresponding dual-rail implementation. Furthermore, the design requires 8 gate stages due to decreasing performance as single-ended design achieves 67% layout area reduction and evaluation stack heights are increased on the complementary 25% loopback interconnect length reduction due to eliminating path [Fig. 5(a)]. Fig. 5(b) shows the single-ended to domino- 50% of the scheduler logic transistors, enabling a dense layout compatible complementary signal generator (CSG) based ready occupying 210 m 210 m.

5-Ghz 32-Bit Integer Execution Core in 130-Nm Dual-V/Sub T/ CMOS

The ASCI Red TOPS Supercomputer

2017 HPC Annual Report Team Would Like to Acknowledge the Invaluable Assistance Provided by John Noe

An Extensible Administration and Configuration Tool for Linux Clusters

Scalability and Performance of Salinas on the Computational Plant

Computer Architectures an Overview

Test, Evaluation, and Build Procedures for Sandia's ASCI Red (Janus) Teraflops Operating System

High Performance Computing

Delivering Insight: the History of the Accelerated Strategic Computing

Sandia's ASCI Red, World's First Teraflop Supercomputer, Is

Performance of Various Computers Using Standard Linear Equations Software

Supercomputers: the Amazing Race Gordon Bell November 2014

Architectural Specification for Massively