Administrative

• Office Hours Office: D308 LSRC Hours: Mon 3:00-4:00, Thurs 1:00-2:00 or by appointment (email) email: [email protected] Lecture 1: Course Introduction, Phone: 660-6551 Technology Trends, Performance • Teaching Assistant Shobana Ravi Professor Alvin R. Lebeck Office: D330 Compsci 220 / ECE 252 Hours: TBD email: [email protected] Fall 2004 Phone: 660-6589 Slides based on those of: Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz

© 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 2

Administrative (Grading) Administrative (Continued)

• 30% Homeworks • Midterm Exam: In class (75 min) Closed book – 4 to 6 Homeworks • Final Exam: (3 hours) closed book – Late < 1 day = 50% – Late > 1 day = zero • CS Graduate Students---This is a “Quals” Course. • 40% Examinations (Midterm 15% + Final 25%) – Quals pass based on Midterm and Final exams only • 30% Research Project (work in groups of 3 or 2) – No late term projects

• Academic Misconduct • University policy will be followed strictly • Zero tolerance for cheating and/or plagiarism

• This course requires hard work.

© 2004 Lebeck, Sorin, Roth, Hill, Wood, © 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 3 Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 4 Administrative (Continued) SPIDER: Systems Seminar

• Course Web Page • Systems & Architecture Seminar – http://www.cs.duke.edu/courses/cps220/fall04 – Wednesdays 4:00-5:00 in D344 – Lectures posted there shortly before class (pdf) – duke.cs.os-research (spider newsgroup) – Homework posted there • Presentations on current work – General information about course – Practice talks for conferences • Course News Group – Discussion on recent papers – duke.cs.cps220 – Your own research – Use it to • Why you should go? 1. read announcements/comments on class or homework, – If you want to work in Systems/Architecture… 2. ask questions (help), – Good time to practice public speaking in front of friendly crowd 3. communicate with each other. – Learn about current topics

© 2004 Lebeck, Sorin, Roth, Hill, Wood, © 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 5 Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 6

Homework #0 What is This Course All About?

• Need Duke CS account? • State-of-the-art computer hardware design • Email to me ([email protected]) your • Topics 1. Duke ID – Uniprocessor architecture (i.e., ) 2. ACPUB account name – Memory architecture – I/O architecture • Read Chapters 1 & 2 – Brief look at multithreading and multiprocessors

• Fundamentals, current systems, and future systems • Will read from textbook, classic papers, brand-new papers

© 2004 Lebeck, Sorin, Roth, Hill, Wood, © 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 7 Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 8 Course Goals and Expectations CPS 220 Course Focus

• Course Goals Understanding the design techniques, machine – Understand how current processors work structures, technology factors, evaluation methods that will determine the form of computers in 21st Century – Understand how to evaluate/compare processors – Learn how to use simulator to perform experiments – Learn research skills by performing term project Parallelism Technology Programming Languages • Course expectations: Applications Interface Design – Will loosely follow text Computer Architecture: • Instruction Set Design (ISA) – Major emphasis on cutting-edge issues Power • Organization • Hardware – Students will read a list of research papers – Term project Operating Measurement & Systems Evaluation History

© 2004 Lebeck, Sorin, Roth, Hill, Wood, © 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 9 Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 10

Expected Background Course Components

– Basic architecture (ECE 152 / CPS 104) Reading Materials – Basic OS (ECE 153 / CPS 110) • Computer Architecture: A Quantitative Approach by Hennessy and Patterson, 3rd Edition • Other useful and related courses: • Readings in Computer Architecture by Hill, Jouppi, – Digital system design (ECE 251) Sohi – VLSI systems (ECE 261) • Recent research papers (online) – Multiprocessor architecture (ECE 259 / CPS 221) – Fault tolerant computing (ECE 254 / CPS 225) – Computer networks and systems (CPS 114 & 214) – Programming languages & compilers (CS 106 & 206) – Advanced OS (CPS 210)

© 2004 Lebeck, Sorin, Roth, Hill, Wood, © 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 11 Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 12 Computer Architecture Is … Computer Architecture Topics

“…the attributes of a [computing] system as seen by Input/Output and Storage the programmer, i.e., the conceptual structure and Disks, WORM, Tape RAID functional behavior, as distinct from the organization of the data flows and controls, the logic design, and Emerging Technologies DRAM Interleaving the physical implementation.” Bus protocols Amdahl, Blaaw, and Brooks, IBM Journal of R&D, April 1964 - . Coherence, Memory L2 Cache Bandwidth, Hierarchy Latency

L1 Cache Addressing, VLSI Protection, Instruction Set Architecture Exception Handling Pipelining, Hazard Resolution, Pipelining and Instruction Superscalar, Reordering, Level Parallelism Prediction, Speculation

© 2004 Lebeck, Sorin, Roth, Hill, Wood, © 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 13 Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 14

Architecture and Other Disciplines Levels of Computer Architecture

architecture Application Software – functional appearance to immediate user » opcodes, addressing modes, architected registers Operating Systems, Compilers, Networking implementation (microarchitecture) – logical structure that performs the architecture Computer Architecture » pipelining, functional units, caches, physical registers realization (circuits) Circuits, Wires, Devices, Network Hardware – physical structure that embodies the implementation » gates, cells, transistors, wires

• Architecture interacts with many other fields • Can’t be studied in a vacuum

© 2004 Lebeck, Sorin, Roth, Hill, Wood, © 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 15 Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 16 Role of the Computer Microarchitect Computer Engineering Methodology

• architect: defines the hardware/software interface • microarchitect: defines the hardware implementation – usually the same person • decisions based on – applications – performance Technology – cost Trends – reliability – power . . .

© 2004 Lebeck, Sorin, Roth, Hill, Wood, © 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 17 Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 18

Computer Engineering Methodology Computer Engineering Methodology

Evaluate Existing Evaluate Existing Systems for Systems for Bottlenecks Bottlenecks Benchmarks Benchmarks Technology Technology Trends Trends

Simulate New Designs and Organizations Workloads

© 2004 Lebeck, Sorin, Roth, Hill, Wood, © 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 19 Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 20 Computer Engineering Methodology Applications -> Requirements -> Designs

• scientific: weather prediction, molecular modeling – need: large memory, floating-point arithmetic Evaluate Existing – examples: CRAY-1, T3E, IBM DeepBlue, BlueGene Implementation Systems for • commercial: inventory, payroll, web serving, e-commerce Complexity Bottlenecks – need: integer arithmetic, high I/O – examples: Clusters, SUN SPARCcenter, Enterprise Benchmarks • desktop: multimedia, games, entertainment – need: high data bandwidth, graphics Technology – examples: Intel Pentium4, IBM Power4, Motorola PPC 620 Trends • mobile: laptops Implement Next – need: low power (battery), good performance Generation System – examples: Intel Mobile Pentium III, Transmeta TM5400 Simulate New • embedded: cell phones, automobile engines, door knobs Designs and – need: low power (battery + heat), low cost Organizations – examples: Compaq/Intel StrongARM, X-Scale, Transmeta TM3200 Workloads

© 2004 Lebeck, Sorin, Roth, Hill, Wood, © 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 21 Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 22

Why Study Computer Architecture? Why Study Computer Architecture?

• answer #1: requirements are always changing • answer #2: technology playing field is always changing • aren’t computers fast enough already? • annual technology improvements (approximate) – are they? – SRAM (logic): density +25%, speed +20% – fast enough to do everything we will EVER want? – DRAM (memory): density + 60%, speed: + 4% » AI, VR, protein sequencing, ???? – disk (magnetic): density +25%, speed: + 4% – fiber: ?? • is speed the only goal? • parameters change and change relative to one – power: heat dissipation + battery life another! – cost • designs change even if requirements fixed – reliability – etc but requirements are not fixed

© 2004 Lebeck, Sorin, Roth, Hill, Wood, © 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 23 Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 24 Examples of Changing Designs Moore’s Law

example I: caches “Cramming More Components onto Integrated Circuits” • 1970: 10K transistors, DRAM faster than logic -> bad idea – G.E. Moore, Electronics, 1965 • 1990: 1M transistors, logic faster than DRAM -> good idea • will caches ever be a bad idea again? • observation: (DRAM) transistor density doubles example II: out-of-order execution annually – became known as “Moore’s Law” • 1985: 100K transistors + no precise interrupts -> bad idea – wrong—density doubles every 18 months (had only 4 data points) • 1995: 2M transistors + precise interrupts -> good idea • corollaries • 2005: 100M transistors + 10GHz clock -> bad idea? – cost / transistor halves annually (18 months) – power per transistor decreases with scaling • semiconductor technology is an incredible driving force – speed increases with scaling – reliability increases with scaling (depends how small!)

© 2004 Lebeck, Sorin, Roth, Hill, Wood, © 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 25 Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 26

Moore’s Law Technology Trends: Capacity

“performance doubles every 18 months” 1000000000 “Graduation Window” common interpretation of Moore’s Law, not original 100000000 intent Intel 10000000 Digital wrong! “performance” doubles every ~2 years 1000000 Pentium Pro: 5.5 million self-fulfilling prophecy (Moore’s Curve) Sparc Ultra: 5.2 million 100000 PowerPC 620: 6.9 million – 2X every 2 years = ~3% increase per month : 9.3 million 10000 – 3% per month used to judge performance features : 15 million – if feature adds 9 months to schedule... 1000 Pentium III: 28 million – ...it should add at least 30% to performance (1.039 = 1.30 → 30%) Pentium 4: 42 million – : under Moore’s Curve in a big way 100 : 100 million 1 4 8 3 5 0 6 1 97 97 97 99 99 0 00 01 Alpha 21464: 250 million 1 1 1 1982 1985 1989 1 1 20 2003 2006 2 2 CMOS improvements: • Die size: 2X every 3 yrs • Line width: halve / 7 yrs

© 2004 Lebeck, Sorin, Roth, Hill, Wood, © 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 27 Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 28 Processor Performance Alpha SPECint and SPECfp

300 Integer Floating Point 1.54x/yr P Sun UltraSparc 700 250 1.54X/yr e 600 r f 200 DEC 21064a 500 o 400 r 150 m IBM Power 2/590 300 a 100 DEC AXP 3000 200 n HP 9000/750 Performance (Specmark) c 50 MIPS M/120 100 e MIPS M2000 Sun-4/260 IBM 1.35X/yr 0 0 RS6000/540 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 1987 1988 1989 1990 1991 1992 1993 1994 1995 Year

© 2004 Lebeck, Sorin, Roth, Hill, Wood, © 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 29 Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 30

Chip Area Reachable in One Clock Cycle Power Density

1.2 1000

1

0.8 100 f16 Processor 0.6 f8 Hot Plate fSIA Laser diode 0.4 10

0.2 Power Density Power Density W/cm^2 Fraction of Chip Reached

0 1 250 180 130 100 70 50 35 1.5 1 0.8 0.6 0.35 0.25 0.18 0.13 0.1 Nanometers Microns

© 2004 Lebeck, Sorin, Roth, Hill, Wood, © 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 31 Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 32 Measurement and Evaluation Measurement Tools

• How do I evaluate an idea? Architecture is an iterative process: • Performance, Cost, Die Area, Power Estimation Design • Searching the space of possible designs • At all levels of computer systems • Benchmarks, Traces, Mixes Analysis • Simulation (many levels) – ISA, RT, Gate, Circuit Creativity • Queuing Theory • Rules of Thumb Cost / Performance • Fundamental Laws Analysis

• Question: What is “better” Boeing 747 or Concorde? Good Ideas Bad Ideas Mediocre Ideas

© 2004 Lebeck, Sorin, Roth, Hill, Wood, © 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 33 Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 34

The Bottom Line: Performance (and Cost) The Bottom Line: Performance (and Cost)

Throughput Plane DC to Paris Speed Passengers "X is n times faster than Y" means (pmph) ExTime(Y) Performance(X) Boeing 747 6.5 hours 610 mph 470 286,700 ------= ------ExTime(X) Performance(Y) BAD/Sud 3 hours 1350 mph 132 178,200 Concorde • Speed (latency) of Concorde vs. Boeing 747 • Time to run the task (ExTime) – Execution time, response time, latency • Throughput of Boeing 747 vs. Concorde • Tasks per day, hour, week, sec, ns … (Performance) – Throughput, bandwidth

© 2004 Lebeck, Sorin, Roth, Hill, Wood, © 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 35 Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 36 Performance Terminology Example

“X is n% faster than Y” means: ExTime(Y) Performance(X) n ExTime(Y) 15 1.5 Performance (X) ------= ------= 1 + ----- = = = ExTime(X) 10 1.0 Performance (Y) ExTime(X) Performance(Y) 100 100 (1.5 - 1.0) n= n = 100(Performance(X) - Performance(Y)) 1.0 Performance(Y) n= 50% Example: Y takes 15 seconds to complete a task, X takes 10 seconds. What % faster is X?

© 2004 Lebeck, Sorin, Roth, Hill, Wood, © 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 37 Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 38

Amdahl's Law Amdahl’s Law

Speedup due to enhancement E: ExTime w/o E Performance w/ E

Speedup(E) = ------= ------ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced ExTime w/ E Performance w/o E Speedupenhanced

1 ExTimeold Speedup = = overall (1 - Fraction ) + Fraction ExTime enhanced enhanced Suppose that enhancement E accelerates a fraction F new Speedup of the task by a factor S, and the remainder of the enhanced task is unaffected, then: ExTime(E) = Speedup(E) =

© 2004 Lebeck, Sorin, Roth, Hill, Wood, © 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 39 Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 40 Amdahl’s Law Amdahl’s Law

• Floating point instructions improved to run 2X; • Floating point instructions improved to run 2X; but only 10% of actual instruction execution time but only 10% of actual instruction execution time is FP is FP

ExTimenew = ExTimenew = ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold

1 Speedupoverall = Speedupoverall = = 1.053 0.95

© 2004 Lebeck, Sorin, Roth, Hill, Wood, © 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 41 Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 42

Corollary: Make The Common Case Fast Occam's Toothbrush

• The simple case is usually the most frequent and the • All instructions require an instruction fetch, easiest to optimize! only a fraction require a data fetch/store. – Optimize instruction access over data access • Programs exhibit locality Spatial Locality Temporal Locality • Do simple, fast things in hardware and be sure the rest can be handled correctly in software

• Access to small memories is faster – Provide a storage hierarchy such that the most frequent accesses are to the smallest (closest) memories.

Reg's Cache Memory Disk / Tape © 2004 Lebeck, Sorin, Roth, Hill, Wood, © 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 43 Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 44 Metrics of Performance Aspects of CPU Performance

CPUCPU time time = = Seconds Seconds = = Instructi Instructionsons x x Cycles Cycles x x Seconds Seconds ProgramProgram Program Program Instruction Instruction Cycle Cycle Application Answers per month Operations per second Programming Instr. Cnt CPI Clock Rate Language Program Compiler (millions) of Instructions per second: MIPS Compiler ISA (millions) of (FP) operations per second: MFLOP/s Datapath Instr. Set Control Megabytes per second Function Units Organization Transistors Wires Pins Cycles per second (clock rate) Technology

© 2004 Lebeck, Sorin, Roth, Hill, Wood, © 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 45 Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 46

Marketing Metrics Cycles Per Instruction

Instruction Count Clock Rate MIPS = ×106 = ×106 “Average Cycles Per Instruction” Time CPI CPU time × Clock Rate Cycles • Machines with different instruction sets ? CPI = = • Programs with different instruction mixes ? Instruction Count Instruction Count – Dynamic frequency of instructions n • Uncorrelated with performance CPU time = Cycle Time × ∑CPIi × Ii i = 1 FP Operations MFLOPS = ×106 Time Normalized:Normalized: “Instruction Frequency” add,sub,compare,multadd,sub,compare,mult 1 1 n I • Machine dependent CPI = CPI × F where F = i divide, sqrt 4 ∑ i i i divide, sqrt 4 i = 1 Instruction Count • Often not where time is spent exp,exp, sin, sin, ...... 8 8 Invest Resources where time is Spent!

© 2004 Lebeck, Sorin, Roth, Hill, Wood, © 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 48 Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 49 Organizational Trade-offs Example: Calculating CPI

Base Machine (Reg / Reg) Application Op Freq Cycles CPIi (% Time) Programming ALU 50% 1 .5 (33%) Language Load 20% 2 .4 (27%) Compiler Store 10% 2 .2 (13%) Instruction Mix Branch 20% 2 .4 (27%) ISA 1.5 Typical Mix Datapath CPI Control Function Units Cycle Time Transistors Wires Pins

© 2004 Lebeck, Sorin, Roth, Hill, Wood, © 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 50 Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 51

Example Next Time

Add register / memory operations to traditional RISC: • Benchmarks – One source operand in memory – One source operand in register • Performance Metrics – Cycle count of 2 • Cost Branch cycle count to increase to 3. • Instruction Set Architectures What fraction of the loads must be eliminated for this TODO to pay off? • Read Chapters 1 & 2 Base Machine (Reg / Reg) • Email me if you need CS account Op Freq Cycles • HW #1 will be up Wednesday Due Sep 7 ALU 50% 1 Load 20% 2 Store 10% 2 Branch 20% 2

© 2004 Lebeck, Sorin, Roth, Hill, Wood, © 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 52 Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 53 Administrative

• Read Chapter 2, Wulf, transmeta • Homework #1 Due September 7 – Simple scalar, read some of the documentation first – See web page for details Lecture 2: Benchmarks, Performance – Questions, contact Shobana ([email protected]) Metrics, Cost, Instruction Set Architecture • After that pipelining…Appendix A + papers Professor Alvin R. Lebeck Compsci 220 / ECE 252 Fall 2004

© 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 55

Review: Trends The Danger of Extrapolation

• Technology trends are one driving force in architectural innovation • Moore’s Law • Dot-com stock value • Chip Area Reachable in one clock • Technology Trends • Power Density • Power dissipation? • Cost of new fabs? • Alternative technologies? – Carbon Nanotubes – Optical

© 2004 Lebeck, Sorin, Roth, Hill, Wood, © 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 56 Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 57 Amdahl’s Law Review: Performance

CPUCPU time time = = Seconds Seconds = = Instructi Instructionsons x x Cycles Cycles x x Seconds Seconds ProgramProgram Program Program Instruction Instruction Cycle Cycle “Average Cycles Per Instruction” ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced Speedup CPU time × Clock Rate Cycles enhanced CPI = = Instruction Count Instruction Count 1 ExTimeold Speedup = = n overall (1 - Fraction ) + Fraction CPU time = Cycle Time × CPI × I ExTime enhanced enhanced ∑ i i new i = 1 Speedupenhanced “Instruction Frequency” n I CPI = CPI ×F where F = i ∑ i i i Instruction Count Invest Resourcesi = 1 where time is Spent!

© 2004 Lebeck, Sorin, Roth, Hill, Wood, © 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 58 Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 59

Example Example Solution

Add register / memory operations: Exec Time = Instr Cnt x CPI x Clock – One source operand in memory – One source operand in register Op Freq Cycles CPI Freq Cycles CPI – Cycle count of 2 Branch cycle count to increase to 3. ALU .50 1 .5 .5 – X 1 .5 – X Load .20 2 .4 .2 – X 2 .4 – 2X What fraction of the loads must be eliminated for this Store .10 2 .2 .1 2 .2 to pay off? Branch .20 2 .3 .2 3 .6 Reg/Mem X 2 2X Base Machine (Reg / Reg) 1.00 1.5 1 – X (1.7 – X)/(1 – X) Op Freq Cycles CyclesNew ALU 50% 1 InstructionsNew Load 20% 2 Typical Mix Store 10% 2 CPINew must be normalized to new instruction frequency Branch 20% 2

© 2004 Lebeck, Sorin, Roth, Hill, Wood, © 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 60 Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 61 Example Solution Actually Measuring Performance

Exec Time = Instr Cnt x CPI x Clock • how are execution-time & CPI actually measured? – execution time: time (Unix cmd): wall-clock, CPU, system Op Freq Cycles Freq Cycles – CPI = CPU time / (clock frequency * # instructions) ALU .50 1 .5 .5 – X 1 .5 – X – more useful? CPI breakdown (compute, memory stall, etc.) Load .20 2 .4 .2 – X 2 .4 – 2X – so we know what the performance problems are (what to fix) Store .10 2 .2 .1 2 .2 Branch .20 2 .3 .2 3 .6 • measuring CPI breakdown Reg/Mem X 2 2X – hardware event counters (PentiumPro, Alpha DCPI) 1.00 1.5 1 – X (1.7 – X)/(1 – X) » calculate CPI using instruction frequencies/event costs – cycle-level microarchitecture simulator (e.g., SimpleScalar) » measure exactly what you want Instr CntOld x CPIOld x ClockOld = Instr CntNew x CPINew x ClockNew 1.00 x 1.5 = (1 – X) x (1.7 – X)/(1 – X) » model microarchitecture faithfully (at least parts of interest) 1.5 = 1.7 – X » method of choice for many architects (yours, too!) 0.2 = X ALL loads must be eliminated for this to be a win!

© 2004 Lebeck, Sorin, Roth, Hill, Wood, © 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 62 Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 63

Benchmarks and Benchmarking Benchmarks: Instruction Mixes

• “program” as unit of work • instruction mix: instruction type frequencies – millions of them, many different kinds, which to use? • ignores dependences • benchmarks • ok for non-pipelined, scalar processor without caches – standard programs for measuring/comparing performance – the way all processors used to be – represent programs people care about – example: Gibson Mix - developed in 1950’s at IBM – repeatable!! – load/store: 31%, branches: 17% – benchmarking process – compare: 4%, shift: 4%, logical: 2% » define workload – fixed add/sub: 6%, float add/sub: 7% » extract benchmarks from workload – float mult: 4%, float div: 2%, fixed mul: 1%, fixed div: <1% » execute benchmarks on candidate machines – qualitatively, these numbers are still useful today! » project performance on new machine » run workload on new machine and compare » not close enough -> repeat

© 2004 Lebeck, Sorin, Roth, Hill, Wood, © 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 64 Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 65 Benchmarks: Toys, Kernels, Synthetics Benchmarks: Real Programs

• toy benchmarks: little programs that no one really real programs runs • only accurate way to characterize performance – e.g., fibonacci, 8 queens – little value, what real programs do these represent? • requires considerable work (porting) – scary fact: used to prove the value of RISC in early 80’s Standard Performance Evaluation Corporation (SPEC) • kernels: important (frequently executed) pieces of – http://www.spec.org real programs – collects, standardizes and distributes benchmark suites – e.g., Livermore loops, Linpack (inner product) – consortium made up of industry leaders – good for focusing on individual features not big picture – SPEC CPU (CPU intensive benchmarks) – over-emphasize target feature (for better or worse) » SPEC89, SPEC92, SPEC95, SPEC2000 – other benchmark suites • synthetic benchmarks: programs made up for » SPECjvm, SPECmail, SPECweb benchmarking – e.g., Whetstone, Dhrystone Other benchmark suite examples: TPC-C, TPC-H for – toy kernels++, which programs do these represent? databases

© 2004 Lebeck, Sorin, Roth, Hill, Wood, © 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 66 Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 67

SPEC CPU2000 Benchmarking Pitfalls

• 12 integer programs (C, C++) • benchmark properties mismatch with features gcc (compiler), perl (interpreter), vortex (database) studied bzip2, gzip (replace compress), crafty (chess, replaces go) – e.g., using SPEC for large cache studies eon (rendering), gap (group theoretic enumerations) • careless scaling twolf, vpr (FPGA place and route) – using only first few million instructions (initialization phase) parser (grammar checker), mcf (network optimization) – reducing program data size • 14 floating point programs (C, FORTRAN) • choosing performance from wrong application space swim (shallow water model), mgrid (multigrid field solver) – e.g., in a realtime environment, choosing troff applu (partial diffeq’s), apsi (air pollution simulation) – others: SPECweb, TPC-W (amazon.com) wupwise (quantum chromodynamics), mesa (OpenGL library) • using old benchmarks art (neural network image recognition), equake (wave propagation) – “benchmark specials”: benchmark-specific optimizations fma3d (crash simulation), sixtrack (accelerator design) lucas (primality testing), galgel (fluid dynamics), ammp (chemistry) • benchmarks must be continuously maintained and updated!

© 2004 Lebeck, Sorin, Roth, Hill, Wood, © 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 68 Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 69 Common Benchmarking Mistakes Reporting Average Performance

• Not validating measurements • averages: one of the things architects frequently get • Collecting too much data but doing too little analysis wrong • Only average behavior represented in test workload – pay attention now and you won’t get them wrong on exams • Loading level (other users) controlled inappropriately • important things about averages (i.e., means) • Caching effects ignored – ideally proportional to execution time (ultimate metric) » Arithmetic Mean (AM) for times • Buffer sizes not appropriate » Harmonic Mean (HM) for rates (IPCs) • Inaccuracies due to sampling ignored » Geometric Mean (GM) for ratios (speedups) • Ignoring monitoring overhead – there is no such thing as the average program • Not ensuring same initial conditions – use average when absolutely necessary • Not measuring transient (cold start) performance • Using device utilizations for performance comparisons

© 2004 Lebeck, Sorin, Roth, Hill, Wood, © 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 70 Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 71

What Does the Mean Mean? Geometric Mean Weirdness

• Arithmetic mean (AM): (weighted arithmetic mean) • What abour averagion ratios (speedups)? – HM / AM change depending on which machine is the base tracks execution time: ∑1..Ν(Timei)/N or ∑(Wi*Timei) • Harmonic mean (HM): (weighted harmonic mean) of

rates (e.g., MFLOPS) tracks execution time: Machine A Machine B B/A A/B N/ (1/Rate ) or 1/ (W /Rate ) ∑1..Ν i ∑ i i Program 1 1 100 10 0.1 – Arithmetic mean cannot be used for rates (e.g., IPC) Program 2 1000 100 0.1 10 – 30 MPH for 1 mile + 90 MPH for 1 mile != avg 60 MPH (10+.1)/2=5.05 (.1+10)/2 = 5.05 AM • Geometric mean (GM): average speedups of N B is 5.05 faster! A is 5.05 faster! 2/(1/10+1/.1) = 5.05B is 2/(1/.1+1/10) = 5.05 HM programs 5.05 faster! B is 5.05 faster! N √∏1..Ν (speedup(i)) GM Sqrt(10*.1) = 1 Sqrt(.1*10) = 1

• geometric mean of ratios is not proportional to total time! • if we take total execution time, B is 9.1 times faster • GM says they are equal

© 2004 Lebeck, Sorin, Roth, Hill, Wood, © 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 72 Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 73 Little’s Law System Balance

• Key Relationship between latency and bandwidth: each system component produces & consumes data • Average number in system = arrival rate * mean • make sure data supply and demand is balanced holding time • X demand >= X supply ⇒ computation is “X-bound” – e.g., memory bound, CPU-bound, I/O-bound • Example: • goal: be bound everywhere at once (why?) – How big a wine cellar should we build? • X can be bandwidth or latency – We drink (and buy) an average of 4 bottles per week – X is bandwidth ⇒ buy more bandwidth – On average, I want to age my wine 5 years – X is latency ⇒ much tougher problem – bottles in cellar = 4 bottles/week * 52 weeks/year * 5 years – = 1040 bottles

© 2004 Lebeck, Sorin, Roth, Hill, Wood, © 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 74 Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 75

Tradeoffs Bursty Behavior

“Bandwidth problems can be solved with money. • Q: to sustain 2 IPC... how many instructions should Latency problems are harder, because the speed of processor be able to light is fixed and you can’t bribe God” –David Clark – fetch per cycle? (MIT) – execute per cycle? well... – complete per cycle? • can convert some latency problems to bandwidth • A: NOT 2 (more than 2) problems – dependences will cause stalls (under-utilization) – if desired performance is X, peak performance must be > X • solve those with money • programs don’t always obey “average” behavior • the famous “bandwidth/latency tradeoff” – can’t design processor only to handle average behvaior

• architecture is the art of making tradeoffs

© 2004 Lebeck, Sorin, Roth, Hill, Wood, © 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 76 Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 77 Cost Startup and Unit Cost

• very important to real designs • startup cost: manufacturing – startup cost – fabrication plant, clean rooms, lithography, etc. (~$3B) » one large investment per chip (or family of chips) – chip testers/debuggers (~$5M a piece, typically ~200) » increases with time – few companies can play this game (Intel, IBM, Sun) – unit cost – equipment more expensive as devices shrink » cost to produce individual copies • startup cost: research and development » decreases with time – 300–500 person years, mostly spent in verification – only loose correlation to price and profit – need more people as designs become more complex • Moore’s corollary: price of high-performance system • unit cost: manufacturing is constant – raw materials, chemicals, process time (2–5K per wafer) – performance doubles every 18 months – decreased by improved technology & experience – cost per function (unit cost) halves every 18 months – assumes startup costs are constant (they aren’t)

© 2004 Lebeck, Sorin, Roth, Hill, Wood, © 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 78 Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 79

Unit Cost and Die Size Unit Cost -> Price

• unit cost most strongly influenced by physical size of chip • if chip cost 25$ to manufacture, why do they cost $500 to (die) buy? » semiconductors built on silicon wafers (8”) – integrated circuit costs 25$ » chemical+photolithographic steps create transistor/wire layers » typical number of metal layers (M) today is 6 (α = ~4) – must still be tested, packaged, and tested again – testing (time == $): $5 per working chip – cost per wafer is roughly constant C0 + C1 * α (~$5000) – basic cost per chip proportional to chip area (mm2) – packaging (ceramic+pins): $30 » typical: 150–200mm2, 50mm2 (embedded)–300mm2 (Itanium) » more expensive for more pins or if chip is dissipates a lot of heat » typical: 300–600 dies per wafer » packaging yield < 100% (but high) » post-packaging test: another 5$ – yield (% working chips) inversely proportional to area and α » non-zero defect density (manufacturing defect per unit area) – total for packaged chip: ~$65 » P(working chip) = (1 + (defect density * die area)/α)–α – spread startup cost over volume ($100–200 per chip) – typical defect density: 0.005 per mm2 » proliferations (i.e., shrinks) are startup free (help profits) – typical yield: (1 + (0.005 * 200) / 4)–4 = 40% – Intel needs to make a profit... – typical cost per chip: $5000 / (500 * 40%) = $25 – ... and so does Dell

© 2004 Lebeck, Sorin, Roth, Hill, Wood, © 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 80 Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 81 Reading Summary: Performance

• H&P Chapter 1 • Next Instruction Sets

© 2004 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz CompSci 220 / ECE 252 82