IBM Research

The Limits of CMOS Scaling from a Power- Constrained Technology Optimization Perspective

D. J. Frank IBM T.J. Watson Research Center, Yorktown Heights, NY

Purdue University seminar Oct. 4, 2006

Presentation is © 2006 IBM © 2004 IBM Corporation IBM Research | Technology Acknowledgements

ƒ Wilfried Haensch ƒ Ghavam Shahidi ƒ Omer Dokumaci ƒ Mary Wisniewski ƒ Mike Scheuermann ƒ Phillip Restle ƒ Steve Kosonocky ƒ Evan Colgan ƒ Philip Wong ƒ Yuan Taur ƒ Paul Solomon ƒ Bob Dennard

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology Outline

1. General Limitations to Scaling 2. Power-Constrained Technology Optimization ƒ Models and Assumptions ƒ Optimization Results 3. Minimum Energy from Optimization 4. Open Questions 5. Summary

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology 1. Limitations to Scaling

1. Quantum mechanical leakage currents 2. Discreteness of matter and energy 3. Material considerations 4. Thermodynamic limitations 5. Practical and environmental constraints on power

Basic idea of Scaling: Adjust dimensions, voltages, & doping to achieve smaller FET with same electrostatic behavior.

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology Quantum Mechanical Tunneling Leakage Currents

FET 'ON' FET 'OFF' Vdd Gnd Gnd Gnd Gnd Vdd

Gate insulator tunneling

e- Subthreshold leakage e- Direct source-to-drain tunneling Source Drain-to-body tunneling

e-

Channel Drain

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology Discrete dopant fluctuations ‹The number of dopant atoms in the depletion layer of a 1.5 MOSFET has been scaling roughly as Leff . ‹Statistical variation in the number of dopants, N, varies as 1/2 N , causing increasing VT uncertainty for small N. ‹Specific threshold uncertainties depend on the details of the doping profiles. ‹3D simulations are required to accurately evaluate these dopant fluctuations. ‹A preprocessor (called MCMESH3D) was written for FIELDAY: •Checks every Si atom site to see if it is a dopant •Transfers these dopants to the simulation mesh ‹3D FIELDAY simulations of subthreshold current are run

on ~100 different cases to statistically evaluate σVT for any given design. •Use constant mobility model to avoid unphysical mobility dependence on dopant positions.

249,403,263 Si atoms 68,743 donors 13,042 acceptors}

[D. J. Frank, et al., Symp. VLSI Technol., p.169, 1999 and D. J. Frank and H.-S. P. Wong, IWCE, p.2, May 2000]

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology Simulated Contact Hole Exposure --Discreteness of photons Photons Deprotected Disolved absorbed polymer polymer

©2003, SPIE Monte Carlo simulation of exposure and development of a 80 nm contact hole using EUV lithography. [J. Cobb, et al., Proc SPIE]

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology Material Properties

1. Bandgap. The Si bandgap does not scale, but (a) this is not a major problem, and (b) it can be overcome by forward biasing the body.

2. Dielectric constants. ITRS roadmap requires high-k gate insulators, but there are few materials that come close to satisfying all the demands: High k High barrier (for both electrons and holes, preferably) Stable on Si at anneal temperatures No traps or interface states High reliability Hafnium silicate-based dielectrics are presently the most promising.

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology Thermodynamic limitations

ƒ 1. The Boltzmann distribution. This causes subthreshold leakage current. ƒ 2. Irreversible computation => All switching energy is converted to heat. ƒ 3. All leakage currents and IR drops are irreversible => More heat. ƒ 4. Subthreshold slope sets fundamental limits on logic swing, but this limit is not usually important.

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology Thermal Current (subthreshold leakage) ƒThe Boltzmann distribution determines the subthreshold

slope and leakage current, VT, and diode leakage currents, too.

e(VG-VT)/ηkT IsubVT = I0 e

ƒVT can only be scaled by reducing the temperature, which is not acceptable for many applications.

ƒSpeed is very sensitive to e-

VT/VDD ratio. Source

Channel Drain

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology Practical and Environmental Issues

ƒ Power consumption and heat removal are limited by practical considerations. ƒ Low power applications must be battery powered

‰ Many must be lightweight => power < ~few watts.

‰ Disposable batteries can cost >> $500/watt over life of device.

‰ Rechargeables can cost > $50/watt over life of device. ƒ Home electronics is limited to <~1000W by heating of the room and cost of electricity. ƒ High performance is limited by difficulty of heat removal from chip (~100 W/chip). (Cost of electricity is ~$5/watt over life.)

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology 2. Technology Scaling Limits from System Performance Optimization

ƒ Since the end of scaling is dominated by practical considerations, it is application-dependent, requiring optimization across device, circuit, and architecture. ƒ In the past, device, circuit and architecture design have proceeded in parallel, to increase product throughput and reduce complexity, but in an era of diminishing returns, greater performance can be achieved by optimizing across the boundaries. ƒ As an initial exploration of this regime, a tool has been designed to capture the essential features of each complexity level in order to evaluate the impact of technology options on the performance of future systems.

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology Concept Existence of an Optimal Technology

Fixed architectural complexity ¾Practicality imposes power constraints. + Fixed power constraints ¾Electrostatics imposes + Device physics geometric constraints ¾Thermodynamics imposes = Existence of an optimal tech- voltage constraints. nology with maximal performance. ¾Quantum mechanics imposes miniaturization constraints due to tunneling.

Po leakage power wer leakage increases due to tunneling effects

dynamic power Large Miniaturization Small l o g ( Pe Decreasing available dynamic power rf

ormance overcomes speed improvements due to scaling. ) Large Miniaturization Small

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology Schematic organization of optimization program

Fixed Variables: parameters initial guess

new values: improved Area Model guess

Wiring Thermal Device Structure statistics Model Constrained IV Model Leakage Model optimizer Wire Capacitance

Delay Leakage Power tolerance adjustments tolerance adjustments

Adjust for Latency Total Power of Long Paths

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology Assumptions and Model Details

ƒ Chip-level assumptions ƒ Optimization metric ƒ Device IV curves ƒ Circuit delay ƒ Power dissipation ƒ Thermal model ƒ Wire models ƒ Accounting for variations

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology Models and Approximations System Assumptions zProcessor chip is assumed to have a fixed number of cores, each with a specified number of logic gates. zOnly the logic within the cores is considered within the optimizations. zThe clock and memory aspects of the chip are assumed to scale in the same way as the logic (delay, power, and area). zCore-to-core and core-to-memory communication is not dealt with.

Repeaters Memory ock Logic Cl

Fudge Treat in Fudge detail

Treat these by simple scaling from the logic part.

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology How much area do the processor cores take? 100% to 25%, generally decreasing with generation: 70%

Prescott. 125M FETs 100%

Alpha 21264 ('96) 15M FETs, L1 cache only 40%

Dothan, 140M FETs

40%

Power4, 174M FETs ~25% 2 cores, 1.72B FETs

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology Area usage within a processor core Approximate area fractions for a high-performance core in leading-edge technology 9.3%

23.3% 7.0%

9.3% data from: 23.3% 7.0% M. Scheuermann 9.3% and M. Wisniewski 23.3% 60% 7.0% 9.3% 1/3 20.2% 23.3% 7.0% 60.5% 2/3 20.2%

13.3% 33% 40.3% Buffers & extra latches 31% 36% 20.2% 13.3% Caches (L1) Buffers & extra latches 12.5% 14.5% Macros Caches 1.2% 90% Caps, Clock dist., Unused Register files Buffers & extra 11.2% 14.5% latches Custom & RLMs Caches Caps, Clock dist., Unused Buffers & extra latches Register files Caches Latches and Register files Processors built with nanotechnology are likely to LCBs Latches and LCBs have similar area usage statistics. Logic Logic in use Nanotechnology may require additional area Unused/caps Unused Logic allocations for defective circuitry. Caps, Clock Unused/caps Estimates of power and computational densities dist., Unused should take into account realistic area efficiencies. Caps, Clock dist., Unused

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology Optimization Approaches

1. Engineering approach: Maximize system performance, at fixed power. Use total logic transition rate (LTR),

LTR = Ngates x activity factor/logic depth x 1/Delay Relatively little dependence on architectural details.

2. Business approach: Maximize Return on Investment (ROI).

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology FET Model

Using a general temperature-dependent short-channel FET model in

which VT, tD, and tox are coupled, halo doping effects are included, and VT is set by the doping. Modified alpha power model:

γ s WεI ηkT  ηkT /e   µ(E⊥ ) VGS −VT  ID (VGS ) = eff   µ0  EC Fα   tox e  FI EC LCH   µ0   ηkT /e 

10W FET Lg=28nm

1mW FET Lg=45nm

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology Calibrating FET IV Model

zOptimizer FET model is fit to IV curves from 2D Separate Fits Joint Fit

device simulator (FIELDAY). 90nm 45nm 90nm 45nm zAn empirical relation for the effective body doping vs ND 3.26E18 6.00E18 2.29E18 4.35E18 channel length is used, which allows excellent fitting to xe 50.7 36.9 60.6 45.8 xoverlap 5.69 0.29 9.66 -0.55 the FIELDAY data. beta 1.45 1.49 1.465 = zBest fits occur when xovlp is an optimization variable, gamma 0.321 0.132 0.243 = allowing overlap to decrease with generation. n1 -0.399 -0.643 -0.755 = n2 2.33 1.84 2.14 = zEvaluated 10 parameter fits to 90nm and 45nm tan theta 1.375 1.20 1.275 =

technology node IV's separately, and 14 parameter fit mult factor 2.04 1.97 1.933 =

to the data jointly. Rc 0.0126 0.0162 0.0127 0.0161

Chi2 2.54 3.09 9.63 - Correlation plots for the joint fit:

L CH = L G − x ovlp

L n1  CH   xe  Neff = ND L n2 1+  CH   xe 

−γ β I D ~ µmult ()LCH (VG −VT )

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology Circuit Delay Estimation Basic circuit elements are: FI=2, FO=1.65 wire-loaded NAND gates for logic inverters for repeaters, FO ~ 1.2 Delay calculations: V (C + C + C ) τ = DD parasitic wire gateload 1 * 2I Deff Current is adjusted to account for noise and variations. τ2 = Rwire (Cwire + Cgateload )

Propagation τ3 = Lwire (c / 2) delay 3/ 4 τ + ()τ4/3 + τ4/3 Final delay corrections are τ = 1 2 3 based on Eble's thesis. 0.5 + ()1−VT /VDD (1+ α) Correction for VT/VDD.

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology Power Calculation

PTOT = PDYN + PsubVT + POX + PB2B

α 1 PDYN = NCKT C (VH −VL )VDD τ lD 2 W PsubVT =1.7NCKTVDD Ioff (VT ,VDD, tox , η, LG , L )

W Pox = Acore Dox ( L )VDD Jox (VT ,VDD, tox , η) P = 1 A D (W )V J (F ,V ) B2 B 3 core ox L DD B2B Max DD Note that 1 cross-through α power is not LTR = NCKT lD τ included.

The powers are computed separately for logic and for repeaters. τ = mean delay for a single loaded logic gate

α is activity factor divided by logic depth. Usually ~0.015 in lD recent optimizations.

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology Generalized heat sink model

ƒ Two level heat flow model: Heat sources Si wafer ‰ Flow in the silicon wafer ‰ Flow in the heat sink material Interface

ƒ In each layer, the flow can be: Heat spreader ‰ 3D (spherical) for spots smaller than (e.g., SiC or Cu) thickness ‰ 2D (cylindrical) at distances larger than the Interface to final coolant thickness (e.g., air or water) ƒ In silicon layer, inhomogeneous power dissipation is accounted for, to estimate ρSi – thermal sheet resistance of Si wafer maximum junction temperature at hottest point. ρHS – thermal sheet resistance of heat sink

RSi – thermal contact resistance of Si wafer ))

Comparison KK ThisMy m modelodel is red. is red ((

ee R – thermal contact resistance of heat sink of simplified FKEai’ sm datodela is blisue. blue HS analytic Ris Ris

model with ee detailed numerical peraturperatur

model. mm ee TT

Hot spot size (cm)

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology Temperature rise constraint 60. 59.

9 60. 00 73 00 60. 8 42. 00 55 60. ƒ To prevent excessive heating, a 7 00 23.

constraint is introduced: 6 81

5 13. nce (arb units) 11 ƒ If the power level would cause the a 4 5.

3 86

maximum chip temperature to 3. 2 16 exceed the constraint value, the 1. Labels are temperature 01 0. 0. 0.

1 34 10 03 rise at each point core area is increased above its Relative Perform 0 nominal estimated size (e.g., by 1 10 100 Total Chip Power (W) diluting the core with cache) until the 2.5 temperature rise is just equal to the

constraint. 2

ƒ This makes longer average wire iplier 1.5 length, but prevents excess ea mult r A temperatures. 1

0.5 1 10 100 Total Chip Power (W)

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology Communication and Wiring Models

Assume wire lengths distributed according to Rent's rule.

4FO  2r−3 2r−3 

i ( ) = − ()2 N res) net l l CKT  i 3 + FO  

l R inet ( )d ∫1 l l l LnoRptr = l R

inet ( )d og(number of w l l l ∫1 2 N log(length) l R CKT l Max N Rptr = linet (l)dl l R ∫l R From optimizations: Units are gate pitches. 12 r = Rent exponent, 0.6, here. 10 8

Required 6 4 # Wiring Levels 1E+5 1E+6 1E+7 1E+8 Number of Logic Gates

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology Repeater Model Long wires receive repeaters with a spacing that is optimized. Long wire delay can be absorbed into pipeline depth, but the latency causes inefficiency, so we use a latency penalty factor: γ.

100 1 1 Repeater spacing 0.9 90 Repeater width 0.8 0.1 80 0.7 0.6 70 0.5 0.01 0.4 60 Pecon=10 W/cm2 0.3 9S 11S 12S 13S Repeater Width (um) Repeater Spacing (cm) Repeater Spacing (um) 50 0.2 10S 0.01 0.1 1 0.001 0.01 0.1 1 10 100 Latency Penalty Factor CPU Core Power (W)

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology Local Variation Modeling

ƒ Variation sources:

‰ Signal Coupling noise

‰ Supply noise

‰ Statistical doping variations

‰ LER gate length variations ƒ Consequences modeled:

‰ Increased static power • combine 1 sigma of doping, length, and noise ‰ Critical path delay distribution • yield-based, using estimated critical path distribution, • and 1 sigma of doping and length, and worst case noise. ‰ Single stage functionality • use worst case (~6 sigma) of doping and length, no noise.

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology Optimization Results

ƒ General results ƒ Evaluating specific possible device directions

‰ Increasing mobility

‰ High-k gate dielectric and metal gates

‰ BEOL improvement only

‰ Better heat sinks

‰ Sub-ambient cooling

‰ Multi-processor tradeoffs

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology Optimize by generation

Dual core processor with Optimizations over 7 variables: aggressive air cooling tox, Lg, ND, , Vdd, Srpt,

Technology node (nm) 130 90 65 45 32

Wire 1/2 pitch (nm) 175 120 90 70 50

gate overlap (nm) 19 8.74 3 -1.03 -1.5

halo scale len (nm) 61.5 53.4 46.3 40.2 34.9

Rcs (Ohm cm) 0.0129 0.0136 0.0144 0.0152 0.0161

LER sigma Lg @W=1um 0.28 0.28 0.28 0.28 0.28 (nm)

ACLV (nm) 3.9 2.7 1.8 1.3 1

mob. enhancement 1 1.4 1.7 2 2

gate depletion (nm) 0.4 0.4 0.3 0.3 0.01

k_BEOL 3.9 3.5 3.2 2.8 2.5

Gate insulator oxynitride oxynitride oxynitride oxynitride HfO2

Note that the LG, tox, VDD, VT, etc. are NOT preselected. They are solved for by the optimizations.

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology Optimize by generation Dual core processor with aggressive air cooling

9 90 nm 65 nm 45 nm 32 nm 8 )

s 7 t

uni 6 b

r High-k gate a 5 insulator

nce ( 4 a

m 3 or f r e 2 Oxynitride P

1 Dual-core processor 0 1 10 100 Total Chip Power (W)

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology Optimize by generation Dual core processor with aggressive air cooling

Gate Length vs Power Oxide Thickness vs Power

120 1.6 90 nm 65 nm 45 nm 32 nm 1.5 90 nm 65 nm 45 nm 32 nm 100 1.4 1.3 80 1.2 Oxynitride 1.1 60 1 0.9 40

Gate Length (nm) Gate Length 0.8 High-k, for 32nm 20 0.7 0.6

0 Equivalent Oxynitride Thickness (nm) 0.5 0.01 0.1 1 10 100 0.01 0.1 1 10 100 Total Chip Power (W) Total Chip Power (W) (high-k case assumes 0.3nm barrier

layer, bandedge metal gate, HfO2-like insulator characteristics.)

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology Optimize by generation Dual core processor with aggressive air cooling Voltages vs Power 0.9 Vdd,90 Vdd,65 Vdd,45 Vdd,32 0.8 VT,90 VT,65 VT,45 VT,32

0.7

0.6

0.5

0.4

0.3

0.2 Supply and Threshold Votlage (V)

0.1 0.01 0.1 1 10 100 Total Chip Power (W)

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology Optimal Power Allocation Fractions

Oxide pwr, rptrs Oxide pwr, logic SubVT pwr, rptrs SubVT pwr, logic Dyn. pwr, rptrs Dyn. pwr, logic 100%

80%

60%

40%

Power Allocation 20%

0% 1 3 10 30 100 300 Chip Power (W)

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology Impact of long-wire assumptions

Optimized Performance vs Power

7

6 ) s t i n

u 5

b r a 4 e (

nc 3 a m or

f 2 r e P 1 1x wires 2x wires Rwire=0 0 1 10 100 1000 Total Chip Power (W)

Green case: wires with repeaters are 2x the regular wire. Red case: all wire is the same size (63.6 nm, here, for 45nm node) Blue case: zero wire resistance case is for comparison. (All wires are 2:1, height to width ratio)

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology Mobility dependence 1.2 45nm technology Enhanced mobility has greatest benefit at dual core processor 1.15 water cooled high power. Even for large mobility enhancements, 1.1 performance boost is modest: 10-15%. 1.05 Relative Performance

1 W chip 10 W chip 100W chip 1 1 1.5 2 2.5 3 Mobility Enhancement Factor

1.2 1.12 45nm technology 1W 10W 100W dual core processor 1.1 1.15 e nc

a 1.08 m or f r

1.1 e 1.06 P e v i 1.04 lat 32nm technology 1.05 e Relative Performance R 1.5x, air 2.5x, air 2.0x, water 2.5x, water 1.02 8 core processor 2.0x, air 1.5x, water Air cooled 1 1 1 10 100 1 1.5 2 2.5 3 Total Chip Power (W) Mobility Enhancement Factor

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology Metal-gate workfunction for high-k and oxynitride

45nm node, dual core processor with wfn=0 wfn=.2 wfn=.4 wfn=0, hi-k wfn=.2, hi-k wfn=.4, hi- 1.5 aggressive air cooling 1.4 1.3 1.2 1.1 1.4 1 5W, oxynitride 5W, high-k 0.9 1.3

50W, oxynitride 50W, high-k poly-Si Gate 0.8 1.2 0.7

Performance Relative to 0.6 1.1 0.5 1 10 100 1 Total Chip Power (W)

0.9 20 wfn=0 wfn=0.4 wfn=0.2, hi-k 0.8 wfn=0.2 wfn=0, hi-k wfn=0.4, hi-k ) s 15 0.7 rb unit a

Performance relative to poly-Si Performance 0.6 10 nce ( a 0.5 orm 0 0.1 0.2 0.3 0.4 0.5 5 Perf Workfunction offset from bandedge (ev)

0 1 10 100 Total Chip Power (W)

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology Cooling scenario optimizations

Optimized over 7 variables: Lg, tox, Nd, , Drptr, , Vdd.

7 5 8 core processor design 4 core processor design 6 32nm technology 45nm technology 4 5

Low Cost Air 3 -40C Liquid 4 18C Water Hi Perf Air 18C Water Hi-Perf. Air 3 2 Low-Cost Air -40C Liquid 2 1 Performance (arb units) Performance Performance (arb units) 1

0 0 1 10 100 1000 10000 1 10 100 1000 10000 Total Chip Power (W) Total Chip Power (W)

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology Multi-processor trade-offs

The energy / performance tradeoff is very steep at the high end. Lower power, more parallel processors potentially offer more computation for the same total power level.

Energy vs Performance 10 These results are for 4- processor chips with micro- 1 channel water cooling, pulling out all the stops. Loaded Switching Energy (fJ) 0.1 1E+12 1E+13 1E+14 1E+15 1E+16 Total Logic Transistions / sec

9 variables: tox, Lg, ND, , Vdd, wHP, Srpt, , xhalo

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology Optimizations for varying number of processor cores

32nm node optimizations Aggressive air cooling Assume: fixed total number of FETs, divided into varying # of cores.

10 Optimized over 7 variables: ) s Lg, tox, Nd, , unit b r Drptr, , a

e ( Vdd. nc a m or f r e P

2 cores 4 cores 8 cores 16 cores 1 1 10 100 1000 Total Chip Power (W)

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology 3. What is the best possible? These optimizations are for hypothetical Optimizations across all generations 4-processor chips with micro-channel water cooling, pulling out all the stops. 9 variables: tox, Lg, ND, , Vdd, Srpt, , wHP, xhalo Caveats: conventional MOSFET structure, high-performance design practices

Mean FET width Gate Length Equiv. oxide thickness Wire Half-pitch Hi-k thickness Supply Voltage Threshold Voltage (VTsat) 1000 0.9 2:1 minimum width-to-length ratio. 0.8

100 0.7 0.6 0.5 10 0.4

Voltage (V) 0.3 1 0.2

Optimal Dimension (nm) 0.1 0.1 0 0.001 0.01 0.1 1 10 100 1000 0.001 0.01 0.1 1 10 100 1000 Total Chip Power (W) Total Chip Power (W)

Wire becomes VERY small at Optimal wire pitch grows for highest lowest power because wire performance cases. Gives lower resistance, and resistance has little impact on the FETs are spreading out because of their the slow speeds. width, anyway.

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology Continued Optimizations across all generations

Switching Energy 9 variables: tox, Lg, ND, , Vdd, Srpt, , wHP, xhalo 10

Area of Each CPU Core (500K logic gates + cache, registers, latches, overhead) 10 1

Energy per logic transition (fJ) 0.1 0.001 0.01 0.1 1 10 100 1000 1 Total Chip Power (W)

Logic reaches a Performance maximum density of 3.4G

Core Area (mm2) 1E+16 gates/cm2 within the logic part of the core. 1E+15 0.1 0.001 0.01 0.1 1 10 100 1000 1E+14 Total Chip Power (W)

1E+13

Power constraints limit conventional CMOS scaling Logic transitions/sec 2 to ~3.4G logic gates/cm . 1E+12 The challenge for nanotechnology is to find a way 0.001 0.01 0.1 1 10 100 1000 to do significantly better. Total Chip Power (W)

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology Energy vs Performance

10 1 30 00 0 oxynitride, 4way high-k, 16way Numbers on high-k, 4way high-k, 16w, 0 tol 30 each point

1 are the total 0 1 chip power for 3 0.00 that point. 0 0.03 0.3 1 .0 0 0.1 0 .0 1 3 1 Zero process tolerances 0.1 is unrealistic, but serves as a lower bound. Gate

Energy / Transition (fJ) lengths can be 30% smaller, yielding higher density and shorter wires. 0.01 1E+12 1E+13 1E+14 1E+15 1E+16 Logic Transitions / sec

Average loaded switching energy versus performance for cross-generational optimization (9 parameters).

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology Minimum Energy breakdown

ƒ Average logic load capacitance: 0.17fF gate cap, 0.06fF parasitic, 0.40fF wire (3.2um average length) ƒ Minimum supply voltage: ~ 360mV. ƒ Raw logic switching energy: ½CV2 = 0.04fJ ƒ Ratio of total logic power to active power: ~1.4 ƒ Ratio of logic + repeater power to logic power: ~ 1.2 ƒ Long wire latency penalty factor: ~1.6

ƒ All together: effective energy per ‘useful’ logic transition: ~0.1fJ

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology 4. Open questions

ƒ Many aspects of VLSI design are tied to the importance or ‘criticality’ of a net. How can the distribution of ‘importance’ be modeled? Can this be tied to a distribution of electrical and/or logical effort?

‰ More accurate FET width distribution

‰ Multiple VT optimization

‰ Wiring hierarchy optimization ƒ How should SRAM optimization be tied to logic, if at all? ƒ How to optimize further up the design hierarchy into architecture?

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology 5. Summary ƒ General limitations to scaling have been summarized. ƒ Power and temperature rise are dominant limiters. ƒ A set of simplified models have been developed to enable fast turnaround comparative technology optimizations in the presence of power and temperature constraints. ƒ The dependence of optimal technology parameters on application power requirements has been investigated. ƒ The dependence of chip performance on potential technology enhancements has also been investigated. ƒ Performance gains can still be obtained from improved cooling and/or from lower power, slower, more parallel processors. ƒ Minimum loaded switching energy for conventional CMOS is ~0.1 fJ. ƒ Open questions are still under investigation.

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation IBM Research | Silicon Technology The End of Scaling is Optimization Stop when you get to the top. Then, try to switch to a different mountain, e.g., some form of nanotechnology.

But, each technology has its own summit, and we need to try to make sure the new peak is actually stem Performance) higher than the one we are already on.

log(Sy Miniaturization

Purdue University NCN Seminar | © IBM | Oct. 4, 2006 © 2004 IBM Corporation