<<

University of Wisconsin-Madison CS/ECE 752 Advanced Architecture I

Professor Matt Sinclair

Unit 1: Technology, Cost, Performance, & Power

Slides developed by Milo Martin & Amir Roth at the University of Pennsylvania with sources that included University of Wisconsin slides by Mark Hill, Guri Sohi, Jim Smith, and David Wood.

Slides enhanced by Milo Martin, Mark Hill, and David Wood with sources that included Profs. Asanovic, Falsafi, Hoe, Lipasti, Sankaralingam, Shen, Smith, Sohi, Vijaykumar, and Wood

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 1 Announcements

• Manually enrolled everyone in the course on Piazza • Using this for all announcements and questions moving forward • Poll: did everyone receive email sent to class mailing list? • Poll: who is interested in lecture from the Condor team? • Will post poll on Piazza • Lecture slides posted – see Course Schedule • Reminder: HW0 due next Wednesday

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 2 This Unit

• What is a computer and what is

• Forces that shape computer architecture • Applications (covered last time) • technology

• Evaluation metrics: parameters and technology basis • Cost • Performance • Power • Reliability

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 3 Readings

• H&P • 1.4 – 1.7 (technology)

• Paper • G. Moore, “Cramming More Components onto Integrated Circuits”

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 4 What is Computer Architecture? (review)

• Design of interfaces and implementations… • Under constantly changing set of external forces… • Applications: change from above (discussed last time) • Technology: changes characteristics from below • Inertia: resists changing all levels of system at once

• To satisfy different constraints • This course mostly about performance • Cost • Power • Reliability

• Iterative driven by empirical evaluation • The art/science of tradeoffs (“it depends”)

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 5 Abstraction and Layering

• Abstraction: only way of dealing with complex systems • Divide world into objects, each with an … • Interface: knobs, behaviors, knobs → behaviors • Implementation: “black box” (ignorance + apathy) • Specialists deal with implementations; others interface • Example: car drivers vs. mechanics

• Layering: abstraction discipline makes life even easier • Removes need to even know interfaces of most objects • Divide objects in system into layers • Layer X objects • Implemented in terms of interfaces of layer X-1 objects • Don’t even need to know interfaces of layer X-2 objects

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 6 Abstraction, Layering, &

• Computers are complex systems, built in layers • Applications • O/S, compiler • Firmware, device drivers • , memory, raw I/O devices • Digital circuits, digital/analog converters • Gates • Transistors • 99% of users don’t know hardware layers implementation • 90% of users don’t know implementation of any layer • That’s ok – computers still work just fine • But unfortunately, the layers sometimes breakdown • Someone needs to understand what’s “under the hood”

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 7 Gray box: peering through the layers

• Layers of abstraction in a car • Interface (drivers): steering wheel, clutch, shift, brake • Implementation (mechanic): engine, fuel injection, transmission • But high-performance drivers know the torque curve • Achieve maximum performance

• Similar exs. for computers Keep RPM in range where • organization/locality torque is maximized • Pipeline scheduling/interlocks • Power users peek across layers CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 8 A Computer Architecture Picture

Application OS Software Compiler Firmware Instruction Set Architecture (ISA) CPU I/O Memory Hardware Digital Circuits Gates & Transistors • Computer architecture • Definition of ISA to facilitate implementation of software layers • This course mostly on computer micro-architecture • Design CPU, Memory, I/O to implement ISA …

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 9 Technology Basis of Clock Frequency

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 10 Semiconductor Technology Background

(1947) Application • A key invention of 20th century OS • Fabrication Compiler Firmware

CPU I/O Memory Digital Circuits Gates & Transistors

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 11 Shaping Force: Technology

Drain • Basic technology element: MOSFET Gate Channel • MOS: metal-oxide-semiconductor • Conductor, insulator, semi-conductor • FET: field-effect transistor Source • Solid-state component acts like electrical • Channel conducts source → drain when voltage applied to gate

• Channel length: characteristic parameter (short → fast) • Aka “feature size” or “technology” • Currently: 10-12 nm (0.010 – 0.012 micron) • Continued (scaling) known as “Moore’s Law” • Won’t last forever, physical limits approaching (or are they?)

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 12 Transistor implementations are complex

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 13 Simple CMOS Interface

• Voltage as values

• Power (VDD) = 1, Ground = 0 Power (1) • Two kinds of • N-transistors p-transistor Input Output (“node”) • Conduct when gate voltage is 1 • Good at passing 0s n-transistor • P-transistors • Conduct when gate voltage is 0 • Good at passing 1s Ground (0)

• CMOS: complementary n-/p-networks form boolean logic

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 14 CMOS Examples

• Example I: • Case I: input = 0 • P-transistor closed, n-transistor open • Power charges output (1) • Case II: input = 1 • P-transistor open, n-transistor closed • Output discharges to ground (0)

• Example II: look at truth table • 0, 0 → 1 0, 1 → 1 • 1, 0 → 1 1, 1 → 0 • Result: this is a NAND (NOT AND) • NAND is universal (can build any logic function)

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 15 Transistor Speed, Power, and Reliability

• Transistor characteristics and scaling impact • Switching speed • Power • Reliability

• “Undergrad” gate delay model for architecture • Each Not, NAND, NOR, AND, OR gate has a delay of “1” • Reality: much more complex • Full answer requires 2nd order differential equations • Need something easier to reason about

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 16 Simple RC Delay Model

• Switching time is a RC circuit (charge or discharge) 1 • R – Resistance: slows rate of current flow

• Depends on material, length, cross-section area 1→0 • – Capacitance: electrical charge storage • Depends on material, area, distance • Voltage affects speed, too 1

I 1→0 0→1

1→0

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 17 Resistance

• Transistor channel resistance 1

• Function of Vg (gate voltage)

• Wire resistance 1→0 • Negligible for short wires

1

I 1→0 0→1

1→0

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 18 Capacitance

• Source/drain capacitance 1 • Gate capacitance • Wire capacitance 1→0 • Negligible for short wires

1

I 1→0 0→1

1→0

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 19 RC Delay Model Ramifications

• Want to reduce resistance • Wide drive transistors (width specified per device) 1 • Short gate length • Short wires • Want to reduce capacitance 1→0 • Number of connected devices • Less-wide transistors (gate capacitance 1 of next stage) • Short wires I 1→0 0→1

1→0

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 20 Improving RC Delay

• Exploit good effects of scaling • Fabrication technology improvements • Use copper instead of aluminum for wires (ρ↓ → Resistance ↓)

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 21

Transistors and Wires ©IBM

From slides © Krste Asanović, MIT CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 22 Transistor Geometry: Width Length Gate Drain Source Gate Width Source Drain Width Bulk Si Length Diagrams © Krste Asanovic, MIT • Transistor width, set by designer for each transistor • Wider transistors: • Lower resistance of channel (increases drive strength) – good! • But, increases capacitance of gate/source/drain – bad! • Result: set width to balance these conflicting effects

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 23 Transistor Geometry: Length & Scaling Length Gate Drain Source Gate Width Source Drain Width Bulk Si Length Diagrams © Krste Asanovic, MIT • Transistor length: characteristic of “process generation” • 45nm refers to the transistor gate length, same for all transistors • Shrink transistor length: • Lower resistance of channel (shorter) – good! • Lower gate/source/drain capacitance – good! • Result: switching speed improves linearly as gate length shrinks

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 24 Wire Geometry Pitch

Height Length

Width IBM CMOS7, 6 layers of copper wiring • Transistors 1-dimensional for design purposes: width • Wires 4-dimensional: length, width, height, “pitch” • Longer wires have more resistance • “Thinner” wires have more resistance • Closer wire spacing (“pitch”) increases capacitance

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc.From slides © Krste Asanovic, MIT 25 Increasing Problem: Wire Delay

• RC Delay of wires • Resistance proportional to: resistivity * length / (cross section) • Wires with smaller cross section have higher resistance • Resistivity (type of metal, copper vs aluminum) • Capacitance proportional to length • And wire spacing (closer wires have large capacitance) • Permittivity or “dielectric constant” (of material between wires)

• Result: delay of a wire is quadratic in length • Insert “inverter” repeaters for long wires • Why? To bring it back to linear delay… but repeaters still add delay • Trend: wires are getting relatively slow to transistors • And relatively longer time to cross relatively larger chips

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 26 More About CMOS and Technology

• Two different CMOS families • SRAM (logic): used to make processors • Storage implemented as inverter pairs • Optimized for speed • DRAM (memory): used to make memory • Storage implemented as capacitors • Optimized for density, cost, power

• Also a technology, but we’ll discuss it later in this course • Disk is also a “technology”, but isn’t transistor-based

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 27 Fabrication & Cost

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 28 Cost

• Metric: $

• In grand scheme: CPU accounts for fraction of cost • Some of that is profit (’s, Dell’s)

Desktop Laptop Netbook Phone $ $100–$300 $150-$350 $50–$100 $10–$20 % of total 10–30% 10–20% 20–30% 20-30% Other costs Memory, display, power supply/battery, storage, software • We are concerned about chip cost • Unit cost: costs to manufacture individual chips • Startup cost: cost to design chip, build the manufacturing facility

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 29 Cost versus Price

• Cost: cost to manufacturer, cost to produce • What is the relationship of cost to price? • Complicated, has to with volume and competition

• Commodity: high-volume, un-differentiated, un-branded • “Un-differentiated”: copper is copper, wheat is wheat • “Un-branded”: consumers aren’t allied to manufacturer brand • Commodity prices tracks costs closely • Example: DRAM (used for main memory) is a commodity • Do you even know who manufactures DRAM?

are not commodities • Specialization, compatibility, different cost/performance/power • Complex relationship between price and cost

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 30 Manufacturing Steps

Source: P&H

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 31 Manufacturing Steps • Multi-step photo-/electro-chemical process • More steps, higher unit cost + Fixed cost mass production ($1 million+ for “mask set”)

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 32 Manufacturing Defects

Correct: • Defects can arise • Under-/over-doping • Over-/under-dissolved insulator

Defective: • Mask mis-alignment • Particle contaminants

• Try to minimize defects Defective: • Process margins • Design rules • Minimal transistor size, separation

Slow: • Or, tolerate defects • Redundant or “spare” memory cells • Can substantially improve yield CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 33 Unit Cost: (IC)

• Chips built in multi-step chemical processes on wafers • Cost / wafer is constant, f(wafer size, number of steps) • Chip (die) cost is related to area • Larger chips means fewer of them • Cost is more than linear in area • Why? random defects • Larger chips means fewer working ones • Chip cost ~ chip areaa  a = 2 to 3

• Wafer yield: % wafer that is chips • Die yield: % chips that work • Yield is increasingly non-binary - fast vs slow chips CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 34 Additional Unit Cost

• After manufacturing, there are additional unit costs • Testing: how do you know chip is working? • Packaging: high-performance packages are expensive • Determined by maximum operating temperature • And number of external pins (off-chip ) • Burn-in: stress test chip (detects unreliability chips early) • Re-testing: how do you know packaging/burn-in didn’t damage chip?

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 35 Fixed Costs

• For new chip design • Design & verification: ~$100M (500 person-years @ $200K per) • Amortized over “proliferations”, e.g., Core i3, i5, i7 variants

• For new (smaller) technology generation • ~$3B for a new fab • Amortized over multiple designs • Amortized by “rent” from companies that don’t fab themselves

– Moore’s Law generally increases startup cost • More expensive fabrication equipment • More complex chips take longer to design and verify

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 36 All Roads Lead To Multi-Core

+ Multi-cores reduce unit costs • Higher yield than same-area single-cores • Why? Defect on one of the cores? Sell remaining cores for less • IBM manufactures CBE (“ processor”) with eight cores • But PlayStation3 software is written for seven cores • Yield for eight working cores is too low • Sun manufactures Niagaras (UltraSparc T1) with eight cores • Also sells six- and four- core versions (for less)

+ Multi-cores can reduce design costs too • Replicate existing designs rather than re-design larger single-cores

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 37 Technology Scaling

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 38 In-Class Activity

• With a partner, answer the following questions: • If you were reading this article in 1965, would you believe in Moore’s Law? Why? • What does the publication source say about the paper? • Why did Moore’s Law change from every year to every 18-24 months?

• In 5 minutes we’ll come back together and discuss as a class

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 39 Moore’s Law: Technology Scaling

gate

source drain

channel • Moore’s Law: aka “technology scaling” • Continued miniaturization (esp. reduction in channel length) + Improves switching speed, power/transistor, area(cost)/transistor – Reduces transistor reliability • Literally: DRAM density (transistors/area) doubles every 18 months • Public interpretation: performance doubles every 18 months • Not quite right, but helps performance in three ways

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 40 Moore’s Effect #1:

• Linear shrink in each dimension • 180nm, 130nm, 90nm, 65nm, 45nm, 32nm, … • Each generation is a 1.414 linear shrink • Shrink each dimension (2D) • Results in 2x more transistors (1.414*1.414)

• Reduces cost per transistor

• More transistors can increase performance • Job of a computer architect: use the ever-increasing number of transistors • Examples: caches, exploiting parallelism at all levels

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 41 Moore’s Effect #2: RC Delay

• First-order: speed scales proportional to gate length • Has provided much of the performance gains in the past • Scaling helps wire delays in some ways… + Transistors become shorter (Resistance), narrower (Capacitance) + Wires become shorter (Length → Resistance) + Wire “surface areas” become smaller (Capacitance) • Hurts in others… – Transistors become narrower (Resistance) – Gate insulator thickness becomes smaller (Capacitance) – Wires becomes thinner (Resistance) • What to do? • Take the good, use wire/transistor sizing & repeaters to bad • Exploit new materials: Aluminum → Copper, metal gate, high-K

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 42 Moore’s Effect #3: Cost

• Mixed impact on unit integrated circuit cost + Either lower cost for same functionality… + Or same cost for more functionality – Difficult to achieve high yields

– Increases startup cost • More expensive fabrication equipment • Takes longer to design, verify, and test chips

– Process variation across chip increasing • Some transistors slow, some fast • Increasingly active research area: dealing with this problem

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 43 Moore’s Effect #4: Psychological

• Moore’s Curve: common interpretation of Moore’s Law • “CPU performance doubles every 18 months” • Self fulfilling prophecy: 2X every 18 months is ~1% per week • Q: Would you add a feature that improved performance 20% if it would delay the chip 8 months? • Processors under Moore’s Curve (arrive too late) fail spectacularly • E.g., Intel’s , Sun’s Millennium

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 44 Moore’s Law in the Future

• Won’t last forever, approaching physical limits • “If something must eventually stop, it can’t go on forever” • But betting against it has proved foolish in the past • Perhaps will “slow” rather than stop abruptly

• Transistor count will likely continue to scale • “Die stacking” is on the cusp of becoming main stream • Uses the third dimension to increase transistor count

• But transistor performance scaling? • Running into physical limits • Example: gate oxide is less than 10 silicon atoms thick! • Can’t decrease it much further • Power is becoming a limiting factor

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 45 Summary

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 46 Technology Summary

• Has a first-order impact on computer architecture • Cost (die area) • Performance (transistor delay, wire delay) • Changing rapidly • Most significant trends for architects (and thus 752) • More and more transistors • What to do with them? → integration → parallelism Rest of • Logic is improving faster than memory & cross-chip wires semester • “Memory wall” → caches, more integration • Power and energy • Voltage vs frequency, parallelism, special-purpose hardware • This unit: a quick overview, just scratching the surface

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 47 Performance

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 48 Performance

• Two definitions • Latency (execution time): time to finish a fixed task • Throughput (bandwidth): number of tasks in fixed time • Very different: throughput can exploit parallelism, latency cannot • Ex: baking bread • Often contradictory • Choose definition of performance that matches your goals

• Example: move people from A → B, 10 miles • Car: capacity = 5, speed = 60 miles/hour • : capacity = 60, speed = 20 miles/hour • Latency: car = 10 min, bus = 30 min • Throughput: car = 15 PPH (count return trip), bus = 60 PPH

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 49 Performance Improvement

• Processor A is X times faster than processor B if • Latency(P, A) = Latency(P, B) / X • Throughput(P, A) = Throughput(P, B) * X • Processor A is X% faster than processor B if • Latency(P, A) = Latency(P, B) / (1+X/100) • Throughput(P, A) = Throughput(P, B) * (1+X/100)

• Car/bus example • Latency? Car is 3 times (and 200%) faster than bus • Throughput? Bus is 4 times (and 300%) faster than car

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 50 What is ‘P’ in Latency(P, A)?

• Program • Latency(A) makes no sense, processor executes some program • But which one? • Actual target workload? + Accurate − Not portable/repeatable, overly specific, hard to pinpoint problems − Some representative program(s)? + Portable/repeatable, pretty accurate − Hard to pinpoint problems, may not be exactly what you run − Some small kernel benchmarks (micro-benchmarks) + Portable/repeatable, easy to run, easy to pinpoint problems − Not representative of complex behaviors of real programs

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 51 SPEC Benchmarks

• SPEC (Standard Performance Evaluation Corporation) • http://www.spec.org/ • Consortium that collects, standardizes, and distributes benchmarks • Post SPECmark results for different processors • 1 number that represents performance for entire suite • Benchmark suites for CPU, Java, I/O, Web, Mail, etc. • Updated every few years: so companies don’t target benchmarks

• SPEC CPU 2006 • 12 “integer”: bzip2, gcc, perl, hmmer (genomics), h264, etc. • 17 “floating point”: wrf (weather), povray, sphynx3 (speech), etc. • Written in C/C++ and • SPEC 2016 recently released

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 52 SPECmark 2006

• Reference machine: Sun Ultra Enterprise II (@ 296 MHz) • Latency SPECmark • For each benchmark • Take odd number of samples: on both machines • Choose median • Take latency ratio (Sun Ultrasparc / your machine) • Take GMEAN of ratios over all benchmarks • Throughput SPECmark • Run multiple benchmarks in parallel on multiple-processor system

• Not-so-recent (latency) leaders • SPECint: Intel 3.3 GHz W5590 (34.2) • SPECfp: Intel 3.2 GHz Xeon W3570 (39.3)

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 53 Other CPU Benchmarks • Parallel benchmarks • SPLASH2: Stanford Parallel Applications for Shared Memory • NAS: another parallel benchmark suite • SPECopenMP: parallelized versions of SPECfp 2000) • SPECjbb: Java multithreaded database-like workload

• Transaction Processing Council (TPC) • TPC-C: On-line transaction processing (OLTP) • TPC-H/R: Decision support systems (DSS) • TPC-W: E-commerce database backend workload • Have parallelism (intra-query and inter-query) • Heavy I/O and memory components

• Recent years: benchmark suites for GPUs (e.g., Rodinia) CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 54 Adding/Average Performance Numbers

• You can add latencies, but not throughput • Latency(P1+P2, A) = Latency(P1, A) + Latency(P2, A) • Throughput(P1+P2, A) != Throughput(P1, A) + Throughput(P2, A) • 1 mile @ 30 miles/hour + 1 mile @ 90 miles per hour • Average is not 60 miles per hour • 0.033 hours at 30 miles/hour + 0.01 hours at 90 miles/hour • Average is only 47 miles/hour! (2 miles/ (0.033 + 0.01)) • Throughput(P1+P2, A) = 1 / [(1/Throughput(P1, A)) + (1/Throughput(P2, A))] • Same goes for means (average):

• Arithmetic: (1/N) * ∑P=1..N Latency(P) • For units that are proportional to time (e.g., latency)

• Harmonic: N / ∑P=1..N 1/Throughput(P) • For units that are inversely proportional to time (e.g., throughput) N • Geometric: √∏P=1..N Speedup(P) • For unitless quantities (e.g., speedup ratios)

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 55 CPU Performance

CIS 501 (Martin): Performance 56 Iron Law of Performance

• Multiple aspects to performance: helps to isolate them • Latency(P, A) = seconds/program = • (instructions/program) * (cycles/instruction) * (seconds/cycle) • Instructions / program: dynamic insn count • Function of program, compiler, instruction set architecture (ISA) • Cycles / instruction: CPI • Function of program, compiler, ISA, micro-architecture • Seconds / cycle: clock period • Function of micro-architecture, technology parameters

• For low latency (better performance) minimize all three! – Difficult: often pull against one another

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 57 Pitfalls of Partial Performance Metrics

CIS 501 (Martin): Performance 58 Danger: Partial Performance Metrics

• (Micro) architects often ignore dynamic instruction count • Typically work in one ISA/one compiler → treat it as fixed • Not always accurate for multithreaded workloads!

• CPU performance equation becomes • Latency: seconds / insn = (cycles / insn) * (seconds / cycle) • Throughput: insn / second = (insn / cycle) * (cycles / second)

• MIPS (millions of ) • Instructions/second * 10-6 • Cycles / second: clock frequency (in MHz) • Example: CPI = 2, clock = 500 MHz → 0.5 * 500 MHz = 250 MIPS

• Pitfall: may vary inversely with actual performance – Compiler removes insns, program gets faster, MIPS goes down – Work per instruction varies (e.g., multiply vs. add, FP vs. integer)

CIS 501 (Martin): Performance 59 MIPS and MFLOPS (MegaFLOPS)

• Problem: MIPS may vary inversely with performance • Some optimizations actually add instructions • Work per instruction varies (e.g., FP mult vs. integer add) • ISAs are not equivalent • MFLOPS: like MIPS, but counts only FP ops, because… + FP ops can’t be optimized away + FP ops have longest latencies + FP ops are same across machines • May have been valid in 1980, but now… − Many CPU programs are “integer” – light on FP − Loads from memory take much longer than a FP divide − Even FP instruction sets are not equivalent • Upshot: Neither MIPS nor MFLOPS are broadly useful

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 60 Danger: Partial Performance Metrics II

• (Micro-)architects often ignore dynamic instruction count… • … but general public (mostly) also ignores CPI • Equates clock frequency with performance!

• Which processor would you buy? • Processor A: CPI = 2, clock = 5 GHz • Processor B: CPI = 1, clock = 3 GHz • Probably A, but B is faster (assuming same ISA/compiler)

• Classic example • 800 MHz PentiumIII faster than 1 GHz Pentium4! • More recent example: Core i7 faster clock-per-clock than Core 2 • Same ISA and compiler! • Meta-point: danger of partial performance metrics!

CIS 501 (Martin): Performance 61 (CPI)

• This course is mostly about improving CPI • Cycle/instruction for average instruction • IPC = 1/CPI • Used more frequently than CPI, but harder to compute with • Different instructions have different cycle costs • E.g., integer add typically takes 1 cycle, FP divide takes > 10 • Assumes you know something about instruction frequencies

• CPI example • A program executes equal integer, FP, and memory operations • Cycles per instruction type: integer = 1, memory = 2, FP = 3 • What is the CPI? (0.33 * 1) + (0.33 * 2) + (0.33 * 3) = 2 • Caveat: this sort of calculation ignores dependencies completely • Still useful for back-of-the-envelope arguments CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 62 Another CPI Example

• Assume a processor with instruction frequencies and costs • Integer ALU: 50%, 1 cycle • Load: 20%, 5 cycles • Store: 10%, 1 cycle • Branch: 20%, 2 cycles • Which change would improve performance more? • A: Branch prediction to reduce branch cost to 1 cycle? • B: A bigger data cache to reduce load cost to 3 cycles?

• Compute CPI • Base: 0.5*1 + 0.2*5 + 0.1*1 + 0.2*2 = 2 • A = 0.5*1 + 0.2*5 + 0.1*1 + 0.2*1 = 1.8 • B = 0.5*1 + 0.2*3 + 0.1*1 + 0.2*2 = 1.6 (winner!)

CIS 501 (Martin): Performance 63 Performance Rules of Thumb

• Design for actual performance, not peak performance • Peak performance: “Performance you are guaranteed not to exceed” • Greater than “actual” or “average” or “sustained” performance • Why? Caches misses, branch mispredictions, limited ILP, etc. • For actual performance X, machine capability must be > X

• Easier to “buy” bandwidth than latency • Which is easier: designing a train that: • (1) hold twice as much cargo or (2) goes twice as fast? • Use bandwidth to reduce latency

• Build a balanced system • Don’t over-optimize 1% to the detriment of other 99% • System performance often determined by slowest component

CIS 501 (Martin): Performance 64 Improving CPI

• This course focuses on improve CPI instead of frequency • Historically, clock accounts for 70%+ of performance improvement • Achieved via deeper pipelines • That has been forced to change • Deep pipelining is not power efficient • Physical speed limits are approaching • 1 GHz: 1999, 2 GHz: 2001, 3 GHz: 2002, 3.8 GHz: 2004, 5 GHz: 2008 • 2: 1.8 – 3.2 GHz in 2008 • Techniques we’ll look at: • Caching, speculation, multiple issue, out-of-order issue • Vectors, , more …

• Moore helps because CPI reduction requires transistors • The definition of parallelism is “more transistors” • But best example is caches

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 65 Moore’s Effect on Performance

• Moore’s Curve: common interpretation of Moore’s Law • “CPU performance doubles every 18(- 24) months” • Self fulfilling prophecy • 2X every 18-24 months is ~1% per week • ?: Would you add a feature that improved performance 20% if it took 8 months to design and test? • Processors under Moore’s Curve (arrive too late) fail spectacularly • E.g., Intel’s Itanium, Sun’s Millenium

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 66 Performance Rules of Thumb

• Make common case fast • “Amdahl’s Law”

• Speedupoverall = 1 = ((1 – fractionx) + fractionx/Speedupx) • Corollary: don’t optimize 5% to the detriment of the other 95%

• Speedupoverall = 1 / ((1 – 5%) + 5%/∞) = 1.05

• Build a balanced system • Don’t over-engineer capabilities that cannot be utilized • Try to be “bound” by the most expensive resource (if not everywhere)

• Design for actual, not peak performance • For actual performance X, machine capability must be > X

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 67 Amdahl’s Law Graph

Source: Wikipedia CIS 501 (Martin): Performance 68 Little’s Law

• Key Relationship between latency and bandwidth: • Average number in system = arrival rate * avg holding time

• Example: • How big a wine cellar should I build? • My family drinks (and buys) an average of 4 bottles per week • On average, I want to age my wine 5 years

• # bottles in cellar = 4 bottles/week * 52 weeks/year * 5 years = 1040 bottles (!)

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 69 More Little’s Law

• How many outstanding cache misses? • Want to sustain 5 GB/s bandwidth • 64 blocks • 100ns miss latency • Requests in system = arrival rate * time in system = (5 GB/s / 64 byte blocks) * 100 ns = 8 misses • That’s an AVERAGE. • Need to support many more if we hope to sustain this bandwidth • Rule of thumb: 2X

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 70 Power & Energy

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 71 Another Constraint: Power and Energy

• Power (Watt or Joule/second): short-term (peak/max) • Used to be mostly a dissipation (heat) concern, not $$$ too • Power-density (Watt/cm2): important related metric • Thermal cycle: power dissipation ↑ → power density ↑ → temperature ↑ → resistance ↑ → power dissipation ↑ … • Cost (and form factor): packaging, heat sink, fan, etc. • Energy (Joule or Watt-seconds): • Mostly a consumption concern • Primary issue(s): Battery life, electric bill, environmental impact • Low-power implies low-energy, but not vice-versa

• Two sources: • Dynamic power: active switching of transistors • Static power: leakage of transistors even while inactive

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 72 Moore’s Effect on Power

• Scaling has largely good effects on local power + Shorter wires/smaller transistors (Length ↓ → Capacitance ↓) − Shorter transistor length (Resistance ↓, Capacitance ↓) − Wires closer together (Capacitance ↑) − Global effects largely undone by increased transistor counts

• Scaling has a largely negative effect on power density + Transistor/wire power decreases linearly − Transistor/wire density decreases quadratically − Power-density increases linearly • Thermal cycle

• Controlled somewhat by reduced VDD (5 → 3 → 1.6 → 1.3 → 1.1)

• Reduced VDD sacrifices some switching speed

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 73 Reducing Power

2 • Power proportional to CVDD f

• Reduce supply voltage (VDD) + Reduces dynamic power quadratically and static power linearly

• But poses a tough choice regarding VT

− Constant VT slows circuit speed → clock frequency → performance

− Reduced VT increases static power exponentially • Reduced clock frequency (f) • Reduces dynamic power linearly • Doesn’t reduce static power • Reduces performance linearly

• Generally doesn’t make sense without reducing VDD too … • Except that frequency can be adjusted cycle-to-cycle and locally • (DVS) – more on this later

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 74 Reliability

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 75 Reliability

• Mean Time Between Failures (MTBF) • How long before you have to reboot or buy a new one

• CPU reliability is small in the grand scheme • Software most unreliable component in a system • Much more difficult to specify & test • Much more of it • Most unreliable hardware component: disk • Subject to mechanical wear

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 76 Moore’s Bad Effect on Reliability

• Wasn’t a problem until 5-10 years ago… • Except for transient-errors on chips in orbit (satellites) • …a problem already and getting worse all the time – Transient faults: • Small (low charge) transistors are more easily flipped • Even low-energy particles can flip a now – Permanent faults: • Small transistors and wires deform and break more quickly • Higher temperatures accelerate the process • Progression of transient faults • Memory (DRAM) was hit first: denser, smaller devices than SRAM • Then on-chip memory (SRAM) • Logic is starting to have problems…

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 77 Moore’s Good Effect on Reliability

• The key to providing reliability is redundancy • The same scaling that makes devices less reliable… • Also increase device density to enable redundancy

• Examples • Error correcting code for memory (DRAM) and caches (SRAM) • Core-level redundancy: paired-execution, hot-spare, etc. • Intel’s Core i7 (Nehalem) uses 8 transistor SRAM cells • Versus the standard 6 transistor cells

• Big open questions • Can we protect logic efficiently? (without 2x or 3x overhead) • Can architectural techniques help hardware reliability? • Can software techniques help?

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 78 Summary

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 79 Summary: A Global Look at Moore

• Device scaling (Moore’s Law) • Increases performance • Reduces transistor/wire delay • Gives us more transistors with which to reduce CPI • Reduces local power consumption • Which is quickly undone by increased integration • Aggravates power-density and temperature problems • Aggravates reliability problem • But gives us the transistors to solve it via redundancy • Reduces unit cost • But increases startup cost

• (When) will we fall off Moore’s Cliff (for real, this time?) • What’s next: nanotubes, quantum-dots, optical, spin-tronics, DNA?

CIS 501 (Martin): Performance 80 Summary

• What is computer architecture? • Abstraction and layering: interface and implementation, ISA • Shaping forces: application and semiconductor technology • Moore’s Law • Cost • Unit and startup • Performance • Latency and throughput • CPU performance equation: insn count * CPI * clock frequency • Power and energy • Dynamic and static power • Reliability

CS/ECE 752 (Sinclair): Technology, Cost, Performance, Power, etc. 81