CIS501 Introduction to Computer Architecture

This Unit

• What is a computer and what is computer architecture

CIS501 • Forces that shape computer architecture • Applications (covered last time) Introduction to Computer Architecture • Semiconductor technology

• Evaluation metrics: parameters and technology basis Prof. Milo Martin • Cost • Performance Unit 1: Technology, Cost, Performance, Power, and Reliability • Power • Reliability

UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 1 UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 2

Readings What is Computer Architecture? (review)

• H+P • Design of interfaces and implementations… • Chapters 1 • Under constantly changing set of external forces… • Applications: change from above (discussed last time) • Paper • Technology: changes transistor characteristics from below • Inertia: resists changing all levels of system at once • G. Moore, “Cramming More Components onto Integrated Circuits” • To satisfy different constraints • Reminders • CIS 501 mostly about performance • Pre-quiz • Cost • Paper review • Power • Groups of 3-4, send via e-mail to [email protected] • Reliability • Don’t worry (much) about power question, as we might not get to it today • Iterative process driven by empirical evaluation • The art/science of tradeoffs

UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 3 UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 4 Abstraction and Layering Abstraction, Layering, and Computers

• Abstraction: only way of dealing with complex systems • Computers are complex systems, built in layers • Divide world into objects, each with an… • Applications • Interface: knobs, behaviors, knobs ! behaviors • O/S, compiler • Implementation: “black box” (ignorance+apathy) • Firmware, device drivers • Only specialists deal with implementation, rest of us with interface • Processor, memory, raw I/O devices • Example: car, only mechanics know how implementation works • Digital circuits, digital/analog converters • Layering: abstraction discipline makes life even simpler • Gates • Removes need to even know interfaces of most objects • Transistors • Divide objects in system into layers • 99% of users don’t know hardware layers implementation • Layer X objects • 90% of users don’t know implementation of any layer • Implemented in terms of interfaces of layer X-1 objects • That’s OK, world still works just fine • Don’t even need to know interfaces of layer X-2 objects • But unfortunately, the layers sometimes breakdown • But sometimes helps if they do • Someone needs to understand what’s “under the hood”

UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 5 UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 6

CIS501: A Picture Semiconductor Technology Background

Application Application • Transistor Software OS OS • invention of the century Compiler Firmware Compiler Firmware Instruction Set Architecture (ISA) • Fabrication CPU I/O CPU I/O

Memory Hardware Memory Digital Circuits Digital Circuits Gates & Transistors Gates & Transistors • Computer architecture • Definition of ISA to facilitate implementation of software layers • CIS 501 mostly about computer micro-architecture • Design CPU, Memory, I/O to implement ISA …

UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 7 UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 8 Shaping Force: Technology Complementary MOS (CMOS) drain • Basic technology element: MOSFET • Voltages as values

• Invention of 20th century • Power (VDD) = 1, Ground = 0 • MOS: metal-oxide-semiconductor gate channel • Conductor, insulator, semi-conductor • Two kinds of MOSFETs power (1) • FET: field-effect transistor • N-transistors p-transistor • Solid-state component acts like electrical switch source • Conduct when gate voltage is 1 • Channel conducts source!drain when voltage applied to gate input output • Good at passing 0s (“node”) • P-transistors n-transistor • Channel length: characteristic parameter (short ! fast) • Conduct when gate voltage is 0 • Aka “feature size” or “technology” • Good at passing 1s ground (0) • Currently: 0.09 µm (0.09 micron), 90 nm • Continued miniaturization (scaling) known as “Moore’s Law” • CMOS: complementary n-/p- networks form boolean logic • Won’t last forever, physical limits approaching (or are they?)

UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 9 UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 10

CMOS Examples More About CMOS and Technology

• Example I: inverter • Two different CMOS families • Case I: input = 0 • P-transistor closed, n-transistor open 0 1 0 • Power charges output (1) 1 • SRAM (logic): used to make processors • Case II: input = 1 • Storage implemented as inverter pairs • P-transistor open, n-transistor closed • Optimized for speed • Output discharges to ground (0) • Example II: look at truth table • DRAM (memory): used to make memory • 0, 0 1 0, 1 1 A B ! ! • Storage implemented as capacitors • 1, 0 ! 1 1, 1 ! 0 • Optimized for density, cost, power • Result: this is a NAND (NOT AND) • NAND is universal (can build any logic function) A • Disk is also a “technology”, but isn’t transistor-based • More examples, details B • http://…/~amir/cse371/lecture_slides/tech.pdf

UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 11 UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 12 Aside: VLSI + Manufacturing MOSFET Side View

• VLSI (very large scale integration) gate • MOSFET manufacturing process insulator source drain • As important as invention of MOSFET itself Substrate channel • Multi-step photochemical and electrochemical process • Fixed cost per step • MOS: three materials needed to make a transistor • Cost per transistor shrinks with transistor size • Metal - Aluminum, Tungsten, Copper: conductor

• Oxide - Silicon Dioxide (SiO2): insulator • Other production costs • Semiconductor - doped Si: conducts under certain conditions • Packaging • FET: field effect (the mechanism) transistor • Test • Voltage on gate: current flows source to drain (transistor on) • Mask set • No voltage on gate: no current (transistor off) • Design

UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 13 UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 14

Manufacturing Process Manufacturing Process

• Start with silicon wafer • Grow SiO2 • “Grow” photo-resist • Grow photo-resist • Molecular beam epitaxy • Burn “via-level-1” mask • Burn positive bias mask • Dissolve unburned photo-resist • Ultraviolet light lithography • And underlying SiO2 • Dissolve unburned photo-resist • Grow tungsten “vias” • Chemically • Dissolve remaining photo-resist • Bomb wafer with negative ions (P) • Continue with next layer • Doping • Dissolve remaining photo-resist • Chemically • Continue with next layer

UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 15 UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 16 Manufacturing Process Defects

• Grow SiO • Defects can arise 2 Defective: • Grow photo-resist • Under-/over-doping • Burn “wire-level-1” mask • Over-/under-dissolved insulator • Mask mis-alignment • Dissolve unburned photo-resist Defective: • Particle contaminants • And underlying SiO2 • Grow copper “wires” • Try to minimize defects • Dissolve remaining photo-resist • Process margins Slow: • Continue with next wire layer… • Design rules • Minimal transistor size, separation • Typical number of wire layers: 3-6 • Or, tolerate defects • Redundant or “spare” memory cells

UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 17 UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 18

Empirical Evaluation Cost

• Metrics • Metric: $ • Cost • Performance • In grand scheme: CPU accounts for fraction of cost • Power • Some of that is profit (Intel’s, Dell’s) • Reliability

Desktop Laptop PDA Phone • Often more important in combination than individually $ $100–$300 $150-$350 $50–$100 $10–$20 • Performance/cost (MIPS/$) % of total 10–30% 10–20% 20–30% 20-30% • Performance/power (MIPS/W) Other costs Memory, display, power supply/battery, disk, packaging

• Basis for • We are concerned about Intel’s cost (transfers to you) • Design decisions • Unit cost: costs to manufacture individual chips • Purchasing decisions • Startup cost: cost to design chip, build the fab line, marketing

UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 19 UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 20 Unit Cost: Integrated Circuit (IC) Yield/Cost Examples

• Chips built in multi-step chemical processes on wafers • Parameters • Cost / wafer is constant, f(wafer size, number of steps) • wafer yield = 90%, " = 2, defect density = 2/cm2 • Chip (die) cost is proportional to area • Larger chips means fewer of them Die size (mm2) 100 144 196 256 324 400 • Larger chips means fewer working ones Die yield 23% 19% 16% 12% 11% 10% • Why? Uniform defect density 6” Wafer 139(31) 90(16) 62(9) 44(5) 32(3) 23(2) 8” Wafer 256(59) 177(32) 124(19) 90(11) 68(7) 52(5) 10” Wafer 431(96) 290(53) 206(32) 153(20) 116(13) 90(9) • Chip cost ~ chip area" • " = 2#3 Wafer Defect Area Dies Yield Die Package Test Total Cost (/cm2) (mm2) Cost Cost (pins) Cost Intel 486DX2 $1200 1.0 81 181 54% $12 $11(168) $12 $35 • Wafer yield: % wafer that is chips IBM PPC601 $1700 1.3 196 66 27% $95 $3(304) $21 $119 • Die yield: % chips that work DEC Alpha $1500 1.2 234 53 19% $149 $30(431) $23 $202 • Yield is increasingly non-binary - fast vs slow chips Intel Pentium $1500 1.5 296 40 9% $417 $19(273) $37 $473

UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 21 UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 22

Startup Costs Moore’s Effect on Cost

• Startup costs: must be amortized over chips sold • Scaling has opposite effects on unit and startup costs • Research and development: ~$100M per chip + Reduces unit integrated circuit cost • 500 person-years @ $200K per • Either lower cost for same functionality… • Fabrication facilities: ~$2B per new line • Or same cost for more functionality • Clean rooms (bunny suits), lithography, testing equipment – Increases startup cost • More expensive fabrication equipment • If you sell 10M chips, startup adds ~$200 to cost of each • Takes longer to design, verify, and test chips • Companies (e.g., Intel) don’t make money on new chips • They make money on proliferations (shrinks and frequency) • No startup cost for these

UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 23 UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 24 Performance Performance Improvement

• Two definitions • Processor A is X times faster than processor B if • Latency (execution time): time to finish a fixed task • Latency(P,A) = Latency(P,B) / X • Throughput (bandwidth): number of tasks in fixed time • Throughput(P,A) = Throughput(P,B) * X • Very different: throughput can exploit parallelism, latency cannot • Processor A is X% faster than processor B if • Baking bread analogy • Latency(P,A) = Latency(P,B) / (1+X/100) • Often contradictory • Throughput(P,A) = Throughput(P,B) * (1+X/100) • Choose definition that matches goals (most frequently thruput) • Car/bus example • Example: move people from A to B, 10 miles • Latency? Car is 3 times (and 200%) faster than bus • Car: capacity = 5, speed = 60 miles/hour • Throughput? Bus is 4 times (and 300%) faster than car • Bus: capacity = 60, speed = 20 miles/hour • Latency: car = 10 min, bus = 30 min • Throughput: car = 15 PPH (count return trip), bus = 60 PPH

UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 25 UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 26

What Is ‘P’ in Latency(P,A)? SPEC Benchmarks

• Program • SPEC (Standard Performance Evaluation Corporation) • Latency(A) makes no sense, processor executes some program • http://www.spec.org/ • But which one? • Consortium of companies that collects, standardizes, and • Actual target workload? distributes benchmark programs • Post SPECmark results for different processors + Accurate • 1 number that represents performance for entire suite – Not portable/repeatable, overly specific, hard to pinpoint problems • Benchmark suites for CPU, Java, I/O, Web, Mail, etc. • Some representative benchmark program(s)? • Updated every few years: so companies don’t target benchmarks + Portable/repeatable, pretty accurate – Hard to pinpoint problems, may not be exactly what you run • SPEC CPU 2000 • Some small kernel benchmarks (micro-benchmarks) • 12 “integer”: gzip, gcc, perl, crafty (chess), vortex (DB), etc. + Portable/repeatable, easy to run, easy to pinpoint problems • 14 “floating point”: mesa (openGL), equake, facerec, etc. – Not representative of complex behaviors of real programs • Written in C and Fortran (a few in C++)

UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 27 UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 28 Other Benchmarks Adding/Averaging Performance Numbers

• Parallel benchmarks • You can add latencies, but not throughput • Latency(P1+P2, A) = Latency(P1,A) + Latency(P2,A) • SPLASH2 - Stanford Parallel Applications for Shared Memory • Throughput(P1+P2,A) != Throughput(P1,A) + Throughput(P2,A) • NAS • 1 mile @ 30 miles/hour + 1 mile @ 90 miles/hour • SPEC’s OpenMP benchmarks • Average is not 60 miles/hour • SPECjbb - Java multithreaded database-like workload • 0.033 hours at 30 miles/hour + 0.01 hours at 90 miles/hour • Average is only 47 miles/hour! (2 miles / (0.033 + 0.01 hours)) • Throughput(P1+P2,A) = • Transaction Processing Council (TPC) 1 / [(1/ Throughput(P1,A)) + (1/ Throughput(P2,A))] • TPC-C: On-line transaction processing (OLTP) • TPC-H/R: Decision support systems (DSS) • Same goes for means (averages) • TPC-W: E-commerce database backend workload • Arithmetic: (1/N) * !P=1..N Latency(P) • For units that are proportional to time (e.g., latency) • Have parallelism (intra-query and inter-query) • Harmonic: N / !P=1..N 1/Throughput(P) • Heavy I/O and memory components • For units that are inversely proportional to time (e.g., throughput) N • Geometric: " #P=1..N Speedup(P) • For unitless quantities (e.g., speedups) UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 29 UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 30

SPECmark CPU Performance Equation

• Reference machine: Sun SPARC 10 • Multiple aspects to performance: helps to isolate them • Latency SPECmark • For each benchmark • Latency(P,A) = seconds / program = • Take odd number of samples: on both machines • (instructions / program) * (cycles / instruction) * (seconds / cycle) • Choose median • Instructions / program: dynamic instruction count • Take latency ratio (Sun SPARC 10 / your machine) • Function of program, compiler, instruction set architecture (ISA) • Take GMEAN of ratios over all benchmarks • Cycles / instruction: CPI • Throughput SPECmark • Function of program, compiler, ISA, micro-architecture • Run multiple benchmarks in parallel on multiple-processor system • Seconds / cycle: clock period • Function of micro-architecture, technology parameters • Recent (latency) leaders • For low latency (better performance) minimize all three • SPECint: Intel 3.4 GHz Pentium4 (1705) • Hard: often pull against the other • SPECfp: IBM 1.9 GHz Power5 (2702)

UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 31 UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 32 Danger: Partial Performance Metrics MIPS and MFLOPS (MegaFLOPS)

• Micro-architects often ignore dynamic instruction count • Problem: MIPS may vary inversely with performance • Typically work in one ISA/one compiler ! treat it as fixed – Some optimizations actually add instructions – Work per instruction varies (e.g., FP mult vs. integer add) • CPU performance equation becomes – ISAs are not equivalent • seconds / instruction = (cycles / instruction) * (seconds / cycle) • This is a latency measure, if we care about throughput … • Instructions / second = (instructions / cycle) * (cycles / second) • MFLOPS: like MIPS, but counts only FP ops, because… + FP ops can’t be optimized away • MIPS (millions of instructions per second) + FP ops have longest latencies anyway • Instructions / second * 10-6 + FP ops are same across machines • Cycles / second: clock frequency (in MHz) • May have been valid in 1980, but today… • Example: CPI = 2, clock = 500 MHz, what is MIPS? – Most programs are “integer”, i.e., light on FP • 0.5 * 500 MHz * 10-6 = 250 MIPS – Loads from memory take much longer than FP divide • Example problem situation: • compiler removes instructions, program faster – Even FP instructions sets are not equivalent • However, “MIPS” goes down (misleading) • Upshot: MIPS not perfect, but more useful than MFLOPS

UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 33 UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 34

Danger: Partial Performance Metrics II Cycles per Instruction (CPI)

• Micro-architects often ignore dynamic instruction count… • CIS501 is mostly about improving CPI • … but general public (mostly) also ignores CPI • Cycle/instruction for average instruction • IPC = 1/CPI • Equates clock frequency with performance!! • Used more frequently than CPI, but harder to compute with • Different instructions have different cycle costs • Which processor would you buy? • E.g., integer add typically takes 1 cycle, FP divide takes > 10 • Processor A: CPI = 2, clock = 500 MHz • Assumes you know something about instruction frequencies • Processor B: CPI = 1, clock = 300 MHz • Probably A, but B is faster (assuming same ISA/compiler) • CPI example • A program executes equal integer, FP, and memory operations • Classic example • Cycles per instruction type: integer = 1, memory = 2, FP = 3 • What is the CPI? (0.33 * 1) + (0.33 * 2) + (0.33 * 3) = 2 • 800 MHz PentiumIII faster than 1 GHz Pentium4 • Caveat: this sort of calculation ignores dependences completely • Same ISA and compiler • Back-of-the-envelope arguments only

UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 35 UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 36 Another CPI Example Increasing Clock Frequency: Pipelining

• Assume a processor with instruction frequencies and costs • Integer ALU: 50%, 1 cycle + • Load: 20%, 5 cycle 4 • Store: 10%, 1 cycle • Branch: 20%, 2 cycle Insn Register PC a • Which change would improve performance more? Mem File Data s1 s2 d Mem • A. Branch prediction to reduce branch cost to 1 cycle? d • B. A bigger data cache to reduce load cost to 3 cycles? • Compute CPI • CPU is a pipeline: compute stages separated by latches • Base = 0.5*1 + 0.2*5 + 0.1*1 + 0.2*2 = 2 • http://…/~amir/cse371/lecture_slides/pipeline.pdf • A = 0.5*1 + 0.2*5 + 0.1*1 + 0.2*1 = 1.8 • Clock period: maximum delay of any stage • B = 0.5*1 + 0.2*3 + 0.1*1 + 0.2*2 = 1.6 (winner) • Number of gate levels in stage • Delay of individual gates (these days, wire delay more important)

UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 37 UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 38

Increasing Clock Frequency: Pipelining CPI and Clock Frequency

• Reduce pipeline stage delay • System components “clocked” independently • Reduce logic levels and wire lengths (better design) • E.g., Increasing processor clock frequency doesn’t improve • Complementary to technology efforts (described later) memory performance • Increase number of pipeline stages (multi-stage operations) – Often causes CPI to increase • Example

– At some point, actually causes performance to decrease • Processor A: CPICPU = 1, CPIMEM = 1, clock = 500 MHz • “Optimal” pipeline depth is program and technology specific • What is the speedup if we double clock frequency? • Base: CPI = 2 ! IPC = 0.5 ! MIPS = 250 • Remember example • New: CPI = 3 ! IPC = 0.33 ! MIPS = 333 • PentiumIII: 12 stage pipeline, 800 MHz • Clock *= 2 ! CPIMEM *= 2 faster than • Speedup = 333/250 = 1.33 << 2 • Pentium4: 22 stage pipeline, 1 GHz • Next Intel design: more like PentiumIII • What about an infinite clock frequency? • Much more about this later • Only a x2 speedup (Example of Amdahl’s Law)

UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 39 UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 40 Measuring CPI Improving CPI

• How are CPI and execution-time actually measured? • CIS501 is more about improving CPI than frequency • Execution time: time (Unix): wall clock + CPU + system • Historically, clock accounts for 70%+ of performance improvement • CPI = CPU time / (clock frequency * dynamic insn count) • Achieved via deeper pipelines • How is dynamic instruction count measured? • That will (have to) change

• More useful is CPI breakdown (CPICPU, CPIMEM, etc.) • Deep pipelining is not power efficient • So we know what performance problems are and what to fix • Physical speed limits are approaching • CPI breakdowns • 1GHz: 1999, 2GHz: 2001, 3GHz: 2002, 4GHz? almost 2006 • Hardware event counters • Techniques we will look at • Calculate CPI using counter frequencies/event costs • Caching, speculation, multiple issue, out-of-order issue • Cycle-level micro-architecture simulation (e.g., SimpleScalar) • Vectors, multiprocessing, more… + Measure exactly what you want + Measure impact of potential fixes • Moore helps because CPI reduction requires transistors • Must model micro-architecture faithfully • The definition of parallelism is “more transistors” • Method of choice for many micro-architects (and you) • But best example is caches

UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 41 UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 42

Moore’s Effect on Performance Performance Rules of Thumb

• Moore’s Curve: common interpretation of Moore’s Law • Make common case fast • “CPU performance doubles every 18 months” • Sometimes called “Amdahl’s Law” • Self fulfilling prophecy • Corollary: don’t optimize 1% to the detriment of other 99% • 2X every 18 months is ~1% per week • Q: Would you add a feature that improved performance 20% if • Build a balanced system it took 8 months to design and test? 350 • Don’t over-engineer capabilities that cannot be utilized • Processors under Moore’s Curve (arrive too late) fail spectacularly 300

• E.g., Intel’s Itanium, Sun’s Millennium RISC 250 e

c • Design for actual, not peak, performance n

a 200 m r

o • For actual performance X, machine capability must be > X f r 150 Intel x86 e P 100

50 35%/yr

0 1982 1984 1986 1988 1990 1992 1994 Year UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 43 UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 44 Transistor Speed, Power, and Reliability Transistors and Wires

• Transistor characteristics and scaling impact: • Switching speed • Power • Reliability

• “Undergrad” gate delay model for architecture • Each Not, NAND, NOR, AND, OR gate has delay of “1” • Reality is not so simple M B I ©

IBM SOI Technology From slides © Krste Asanovi!, MIT UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 45 UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 46

Transistors and Wires Simple RC Delay Model

• Switching time is a RC circuit (charge or discharge) • R - Resistance: slows rate of current flow • Depends on material, length, cross-section area 1!0 • C - Capacitance: electrical charge storage • Depends on material, area, distance • Voltage affects speed, too

I 0 1 1!0 !

1!0 M B I ©

IBM CMOS7, 6 layers of copper wiring From slides © Krste Asanovi!, MIT UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 47 UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 48 Resistance Capacitance

• Transistor channel resistance 1 • Source/Drain capacitance 1

• function of Vg (gate voltage) • Gate capacitance • Wire resistance (negligible for short wires) 1!0 • Wire capacitance (negligible for short wires) 1!0

1 1

I I 0 1 0 1 1!0 ! 1!0 !

Off 1!0 1!0

UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 49 UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 50

Which is faster? Why? Transistor Width

• “Wider” transistors have lower resistance, more drive • Specified per-device

• Useful for driving large “loads” like long or off-chip wires

UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 51 UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 52 RC Delay Model Ramifications Transistor Scaling Minimum Length=2& Gate • Want to reduce resistance 1 Drain Source • “wide” drive transistors (width specified per device) Gate • Short wires 1!0 Width • Want to reduce capacitance Source Drain Width=4 • Number of connected devices & • Less-wide transistors 1 Bulk (gate capacitance Length of next stage) I 0 1 • Short wires 1!0 ! • Transistor length is key property of a “process generation” • 90nm refers to the transistor gate length, same for all transistors 1!0 • Shrink transistor length: • Lower resistance of channel (shorter) • Lower gate/source/drain capacitance • Result: transistor drive strength linear as gate length shrinks Diagrams © Krste Asanovi!, MIT UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 53 UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 54

Wires Wire Delay Pitch • RC Delay of wires • Resistance proportional to length • Capacitance proportional to length Height Length • Result: delay of a wire is quadratic in length • Insert “inverter” repeaters for long wires to • Bring it back to linear delay Width • Resistance fixed by (length*resistivity) / (height*width) • bulk aluminum 2.8 µ$-cm, bulk copper 1.7 µ$-cm • Capacitance depends on geometry of surrounding wires and relative permittivity, %r,of dielectric • silicon dioxide %r = 3.9, new low-k dielectrics in range 1.2-3.1 From slides © Krste Asanovi!, MIT UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 55 UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 56 Moore’s Effect on RC Delay Improving RC Delay

• Scaling helps reduce wire and gate delays • Exploit good effects of scaling • In some ways, hurts in others • Fabrication technology improvements + Wires become shorter (Length( ! Resistance() + Use copper instead of aluminum for wires ('( ! Resistance() + Wire “surface areas” become smaller (Capacitance() + Use lower-dielectric insulators ()( ! Capacitance() + Transistors become shorter (Resistance() + Increase Voltage + Transistors become narrower (Capacitance(, Resistance*) + Design implications – Gate insulator thickness becomes smaller (Capacitance*) + Use bigger cross-section wires (Area* ! Resistance() – Distance between wires becomes smaller (Capacitance*) • Typically means taller, otherwise fewer of them – Increases “surface area” and capacitance (Capacitance*) + Use wider transistors (Area* ! Resistance() – Increases capacitance (not for you, for upstream transistors) – Use selectively

UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 57 UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 58

Another Constraint: Power and Energy Sources of Energy Consumption

Short-Circuit • Power (Watt or Joule/Second): short-term (peak, max) Current • Mostly a dissipation (heat) concern • Power-density (Watt/cm2): important related metric – Thermal cycle: power dissipation* ! power density* ! Diode Leakage Current temperature* ! resistance* ! power dissipation*… Capacitor Charging CL Subthreshold Leakage Current • Cost (and form factor): packaging, heat sink, fan, etc. Current Dynamic power: • Capacitor Charging (85-90% of active power) • Energy (Joule): long-term • Energy is $ CV2 per transition • Mostly a consumption concern • Short-Circuit Current (10-15% of active power) • When both p and n transistors turn on during signal transition • Primary issue is battery life (cost, weight of battery, too) Static power: • Low-power implies low-energy, but not the other way around • Subthreshold Leakage (dominates when inactive) • Transistors don’t turn off completely • 10 years ago, nobody cared • Diode Leakage (negligible) • Parasitic source and drain diodes leak to substrate From slides © Krste Asanovi!, MIT UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 59 UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 60 Moore’s Effect on Power Reducing Power

• Scaling has largely good effects on local power • Reduce supply voltage (VDD) + Shorter wires/smaller transistors (Length( ! Capacitance() + Reduces dynamic power quadratically and static power linearly

– Shorter transistor length (Resistance(, Capacitance() • But poses a tough choice regarding VT

– Global effects largely undone by increased transistor counts – Constant VT slows circuit speed ! clock frequency ! performance

– Reduced VT increases static power exponentially • Scaling has a largely negative effect on power density • Reduce clock frequency (f) + Transistor/wire power decreases linearly + Reduces dynamic power linearly – Transistor/wire density decreases quadratically – Doesn’t reduce static power – Power-density increases linearly – Reduces performance linearly • Thermal cycle • Generally doesn’t make sense without also reduced VDD … • Except that frequency can be adjusted cycle-to-cycle and locally • Controlled somewhat by reduced VDD (5!3.3!1.6!1.3!1.1) • More on this later • Reduced VDD sacrifices some switching speed

UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 61 UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 62

Dynamic Voltage Scaling (DVS) Reducing Power: Processor Modes

• Dynamic voltage scaling (DVS) • Modern electrical components have low-power modes • OS reduces voltage/frequency when peak performance not needed • Note: no low-power disk mode, magnetic (non-volatile) • “Standby” mode Mobile PentiumIII TM5400 Intel X-Scale • Turn off internal clock “SpeedStep” “LongRun” (StrongARM2) • Leave external signal controller and pins on Frequency 300–1000MHz 200–700MHz 50–800MHz • Restart clock on interrupt (50MHz steps) (33MHz steps) (50MHz steps) ± Cuts dynamic power linearly, doesn’t effect static power Voltage 0.9–1.7V 1.1–1.6V 0.7–1.65V • Laptops go into this mode between keystrokes (0.1V steps) (continuous) (continuous) • “Sleep” mode High-speed 3400MIPS @ 34W 1600MIPS @ 2W 800MIPS @ 0.9W • Flush caches, OS may also flush DRAM to disk Low-power 1100MIPS @ 4.5W 300MIPS @ 0.25W 62MIPS @ 0.01W • Turn off processor power plane – Needs a “hard” restart + Cuts dynamic and static power ± X-Scale is power efficient (6200 MIPS/W), but not IA32 compatible • Laptops go into this mode after ~10 idle minutes

UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 63 UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 64 Reliability Moore’s Bad Effect on Reliability

• Mean Time Between Failures (MTBF) • CMOS devices: CPU and memory • How long before you have to reboot or buy a new one • Historically almost perfectly reliable • Not very quantitative yet, people just starting to think about this • Moore has made them less reliable over time

• Two sources of electrical faults • CPU reliability small in grand scheme • Energetic particle strikes (from sun) • Software most unreliable component in a system • Randomly charge nodes, cause bits to flip, transient • Much more difficult to specify & test • Electro-migration: change in electrical interfaces/properties • Much more of it • Temperature-driven, happens gradually, permanent • Most unreliable hardware component … disk • Subject to mechanical wear • Large, high-energy transistors are immune to these effects – Scaling makes node energy closer to particle energy – Scaling increases power-density which increases temperature • Memory (DRAM) was hit first: denser, smaller devices than SRAM

UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 65 UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 66

Moore’s Good Effect on Reliability Summary: A Global Look at Moore

• The key to providing reliability is redundancy • Device scaling (Moore’s Law) • The same scaling that makes devices less reliable… + Increases performance • Also increase device density to enable redundancy • Reduces transistor/wire delay • Gives us more transistors with which to reduce CPI • Classic example + Reduces local power consumption • Error correcting code (ECC) for DRAM – Which is quickly undone by increased integration • ECC also starting to appear for caches – Aggravates power-density and temperature problems • More reliability techniques later – Aggravates reliability problem + But gives us the transistors to solve it via redundancy • Today’s big open questions + Reduces unit cost – But increases startup cost • Can we protect logic? • Can architectural techniques help hardware reliability? • Can architectural techniques help with software reliability? • Will we fall off Moore’s Cliff? (for real, this time?) • What’s next: nanotubes, quantum-dots, optical, spin-tronics, DNA?

UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 67 UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 68 Summary CIS501

• What is computer architecture Application • CIS501: Computer Architecture • Abstraction and layering: interface and implementation, ISA OS • Mostly about micro-architecture • Shaping forces: application and semiconductor technology Compiler Firmware • Mostly about CPU/Memory • Moore’s Law • Mostly about general-purpose CPU I/O • Cost • Mostly about performance Memory • Unit and startup • We’ll still only scratch the surface • Performance Digital Circuits • Latency and throughput Gates & Transistors • Next time • CPU performance equation: insn count * CPI * clock frequency • Instruction set architecture • Power and energy • Dynamic and static power • Reliability

UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 69 UPenn's CIS501 (Martin/Roth): Technology, cost, performance, power, and reliability 70