<<

Welcome to CS2410!

. This is a grad-level introduction to

CS2410: . Let’s take a look at the course info. Sheet

Technology, , . Schedule performance, and cost issues

Sangyeun Cho

Computer Department University of Pittsburgh

CS2410: Computer Architecture University of Pittsburgh

Computer architecture? Computer architecture?

. A discipline that explores: • Principles and practices to exploit characteristics of hardware & software artifacts relevant for computer systems hardware ; Applications, e.g., design itself; and Operating systems • Changing interaction between hardware and software “Application pull” Software layers

. Goals Instruction Set Architecture “Architecture” • Sustain the historic (what is performance?) architect Organization “” improvement rate and expand a computer’s capabilities • Keep the cost down VLSI Implementation “Physical hardware”

push”

CS2410: Computer Architecture University of Pittsburgh CS2410: Computer Architecture University of Pittsburgh Uniprocessor performance Today’s topics

. Performance = 1 / time . Technology trends . Time = IC  CPI  CCT • “” • Impact of CMOS scaling . Instructions/program • Also called “instruction count” (IC above) • Represents how many (dynamic) instructions are required to finish . Cost the program • IC chip cost • Highly depends on “architecture”

. /instruction . Performance • Also called CPI (Clocks Per Instruction) • Benchmarks • Depends on pipelining and “microarchitecture” implementation • Summarizing performance measurements • Quantitative approach to computer design . Time/clocks • Also called cycle time (inverse of frequency) • Highly depends on circuit & VLSI chip realization . Application trends

CS2410: Computer Architecture University of Pittsburgh CS2410: Computer Architecture University of Pittsburgh

Uniprocessor performance trend Uniprocessor performance hurdles

. Maximum power dissipation • 100W ~ 150W . Little instruction-level parallelism left . Little-changing memory

. “We are dedicating all of our future product development to multicore . … This is a sea change in .” • Paul Otellini, President, (2004)

CS2410: Computer Architecture University of Pittsburgh CS2410: Computer Architecture University of Pittsburgh How does technology scaling help?

. Time = (inst. count)(clocks per inst.)(clock cycle time)

. Faster circuit • Scaling makes not only smaller but also faster • Faster clock  smaller clock cycle time

. More transistors • Larger L2 caches (relatively simple design change) • Smaller CPI

. Design changes enabled by scaling • Deep using more pipeline registers • Superscalar pipeline using more functional units • Larger, more sophisticated branch predictors • … Moore’s note (1965) • Multicores

CS2410: Computer Architecture University of Pittsburgh CS2410: Computer Architecture University of Pittsburgh

Switches History of switches

. block for digital logic • NAND, NOR, NOT, …

. Technology advances have provided designers with switches that are Called “”; Mark I (1944) Bell lab. (1947); Kilby’s first IC (1957) • Faster; • Lower power; • More reliable (e.g., vs. ); and • Smaller.

. Nano-scale technologies will not continue promising the - MOS devices same good properties Vacuum tubes; ENIAC (1946, 18k tubes)

CS2410: Computer Architecture University of Pittsburgh CS2410: Computer Architecture University of Pittsburgh MOS transistors MOS transistor scaling

. Today’s chips heavily depend on CMOS (complementary MOS)-style logic design

CS2410: Computer Architecture University of Pittsburgh CS2410: Computer Architecture University of Pittsburgh

Impact of MOS transistor scaling Global delay

. In general • Smaller transistors (i.e., density doubling with each new generation) • Faster transistors (latency  L) • Roughly constant wire delay ( relatively slow !) • Lower supply voltage ( lower dynamic power)

. Downside • Increased global wire delay • Increased power density (W/cm2) • Increased leakage power • Increased susceptibility to and transient errors • On-chip variation • Cost of

CS2410: Computer Architecture University of Pittsburgh CS2410: Computer Architecture University of Pittsburgh Power density Productivity

CS2410: Computer Architecture University of Pittsburgh CS2410: Computer Architecture University of Pittsburgh

Component-level performance trend Disk: Archaic vs. Modern

. Four key components in a computer system CDC Wren I, 1983 Seagate 373453, 2003 • Disks • Memory . 3,600 RPM . 15,000 RPM (4x) • Network . 0.03 GB . 73.4 GB (2,500x) • Processors . Tracks/inch: 800 . Tracks/inch: 64,000 (80x) . /inch: 9,550 . Bits/inch: 533,000 (60x) . Compare ~1980 Archaic (or “Nostalgic”) vs. ~2000 Modern . Three 5.25” platters . Four 2.5” platters (or “Newfangled”) • (Patterson) . : 0.6 MB/s . Bandwidth: 86 MB/s (140x) . Metric . Latency: 48.3 ms . Latency: 5.7 ms (8x) • Bandwidth: # operations or events per unit time . : none . Cache: 8MB • Latency: elapsed time for a single operation or

CS2410: Computer Architecture University of Pittsburgh CS2410: Computer Architecture University of Pittsburgh LANs: Archaic vs. Modern Memory: Archaic vs. Modern

. 1980 DRAM . 2000 Double Data Rate Synchr. . 802.3 • Ethernet 802.3ae (clocked) DRAM (asynchronous) . Year of Standard: 1978 • Year of Standard: 2003 . 256.00 Mbits/chip (4000X) . 0.06 Mbits/chip . 10 Mbits/s • 10,000 Mbits/s (1000X) . 64,000 xtors, 35 mm2 . 256,000,000 xtors, 204 mm2 link speed link speed . 16- data per . 64-bit data bus per . Latency: 3000 sec • Latency: 190 sec (15X) DIMM, 66 pins/chip (4X) module, 16 pins/chip . Shared media • Switched media . 1600 Mbytes/sec (120X) . 13 Mbytes/sec . Coaxial cable • Category 5 wire . Latency: 225 ns . Latency: 52 ns (4X) "Cat 5" is 4 twisted pairs in bundle Coaxial Cable: Covering Twisted Pair: . (no block transfer) . Block transfers (page mode) Braided outer conductor Copper core Copper, 1mm thick, twisted to avoid antenna effect

CS2410: Computer Architecture University of Pittsburgh CS2410: Computer Architecture University of Pittsburgh

CPUs: Archaic vs. Modern Latency lags bandwidth (last ~20 years) . 1982 . 2001 Intel . 12.5 MHz . 1500 MHz (120X) . CPU . 2 MIPS (peak) . 4500 MIPS (peak) (2250X) • 21x vs. 2250x . Latency 320 ns . Latency 15 ns (20X) 2 . . 134,000 xtors, 47 mm2 . 42,000,000 xtors, 217 mm Ethernet “Memory wall” . 16-bit data bus, 68 pins . 64-bit data bus, 423 pins • 16x vs. 1000x . , . 3-way superscalar, . separate FPU chip Dynamic translation to RISC, • 4x vs. 120x Superpipelined (22 stage), . (no caches) . Disk Out-of-Order execution • 8x vs. 143x . On-chip 8KB Data caches, 96KB Instr. Trace cache, 256KB L2 cache CS2410: Computer Architecture University of Pittsburgh CS2410: Computer Architecture University of Pittsburgh Rule of thumbs: latency lagging BW Cost trend

. In the time that bandwidth doubles, latency improves by no . Time more than a factor of 1.2 to 1.4 • Learning curve • (Capacity improves faster than bandwidth) • Change in

. In other words, bandwidth improves by more than the . Volume square of the improvement in latency • Decreases cost, increases efficiency • “Shrinking” by deploying - generation technology (without changing the design itself)

. Commoditization • Standards push this • Multiple vendors compete

CS2410: Computer Architecture University of Pittsburgh CS2410: Computer Architecture University of Pittsburgh

IC () cost IC (Integrated Circuit) cost

. Cost of IC = (cost of production) / (final test yield)  diameter/2)2  wafer diameter Dies per wafer   area . Cost of production 2die area • Cost of die • Cost of testing die • Cost of packaging and final test

. Cost of production at time line • NRE (Non-Recurring ) cost R & Mask • Chip production “Front end” “Back end” – packaging, etc. • Test cost

. Cost of die = (cost of wafer) / ((dies per wafer)  (die yield))

CS2410: Computer Architecture University of Pittsburgh CS2410: Computer Architecture University of Pittsburgh IC (Integrated Circuit) cost Performance analysis

. Which computer is faster for what you want to do?   defect densitydie area  • Time matters Die yield  wafer yield  1     • Workload matters

. defect density = # defects in unit area . (jobs/sec) vs. latency (sec/job) . defect density  die area will be then average # of defects • Single processor vs. multiprocessor per die • Pentium4 @2GHz vs. Pentium4 @4GHz

. : manufacturing complexity . Commonly used techniques . 2006 CMOS :  = 4.0 • Direct measurement • • Analytical modeling

CS2410: Computer Architecture University of Pittsburgh CS2410: Computer Architecture University of Pittsburgh

Performance analysis Performance report

. Combination of . Reproducibility • Measurement • Provide all necessary details so that others can reproduce the same • Interpretation result • configuration, compiler flags, …

. Overall performance vs. specific aspects . Single number is attractive, but • Choice of metric • It does not show how a new feature affects different programs • It may in fact mislead; a technique good for a program may be bad for . Considerations in performance analysis others • Perturbation • Accuracy • Reproducibility • …

CS2410: Computer Architecture University of Pittsburgh CS2410: Computer Architecture University of Pittsburgh Performance analysis techniques Performance metrics

. Direct measurement . (Preferably) single number that essentially extracts a desired • Can provide the best result – no simplifying assumptions characteristic • Not flexible (difficult to change parameters) • Prone to perturbation (if instrumented) • Cache hit rate • Made much easier these days by using performance counters • AMAT (Average Memory Access Time) • IPC () . Simulation • Time (or delay) • Very flexible • Energy-delay product • Time consuming • … • Difficult to model details and validate

. Analytical modeling • Quick insight for overall behaviors • Limited applicability • Used to confine simulation , validate , etc.

CS2410: Computer Architecture University of Pittsburgh CS2410: Computer Architecture University of Pittsburgh

Comparing two Benchmarks

. Real programs . suites: a set of real applications Execution timeY  n “X is n times faster than Y” • SPEC CPU 2006 (desktop and servers) • EEMBC, SPECjvm (embedded) Execution timeX • TPC-, TPC-H, SPECjbb, ECperf (servers) • …

. Two different . Kernels: important pieces of codes from real applications • Livermore loops, … . Two different options (e.g., memory sizes) on a machine . programs: small programs that we easily understand . … • Quicksort • Sieves of Eratosthenes, … . Synthetic program: to mimic a program behavior “uniformly” Execution timeY PerformanceX • Dhrystone n   • Whetstone, … Execution timeX PerformanceY

CS2410: Computer Architecture University of Pittsburgh CS2410: Computer Architecture University of Pittsburgh SPEC CPU2006 Summarizing performance results

. 12 integer programs . Arithmetic mean • 9 use C • When dealing with times • 3 use C++ . Weighted arithmetic mean . 17 floating-point programs • 3 use C . Geometric mean • 4 use C++ • When dealing with ratios • 6 use Fortran • SPEC CPU uses this method • 4 use a mixture of C and Fortran n n Geometric mean  samplei i1 . Package available at /afs/cs.pitt.edu/projects . In the case of SPEC, sample is the SPECRatio for program i /spec-cpu2006 i

CS2410: Computer Architecture University of Pittsburgh CS2410: Computer Architecture University of Pittsburgh

SPEC2k scoring method Amdahl’s law

. Get execution time of each benchmark . Optimization or parallelization usually applies to a portion • Places “limitation” of the scope of an optimization • us to focus on “common cases” . Get a ratio for each benchmark by dividing the time with that of the reference machine • “Make common case fast and rare case accurate” • Sun Ultra 5_10, 300MHz SPARC, 256MB memory • Its score is 100

Timebefore Timeunaffected Timeaffected . Get a geometric mean of all the computed ratios

Timeafter Timeunaffected

CS2410: Computer Architecture University of Pittsburgh CS2410: Computer Architecture University of Pittsburgh Principle of locality Performance vs. performance-price

. Locality found in memory access instructions • Temporal locality: if an item is referenced, it will tend to be referenced again soon • Spatial locality: if an item is referenced, items whose addresses are close by tend to be referenced soon • …

. 90/10 locality rule • A program executes about 90% of its instructions in 10% of its code

. We will look at how this principle is exploited in various microarchitecture techniques

CS2410: Computer Architecture University of Pittsburgh CS2410: Computer Architecture University of Pittsburgh

Killer apps? Software defined

. applications Degree of © UMTS mobility . Games 3GPP-LTE • 3D graphics Driving CDMA 3G Evolution • Physics simulation & GSM Beyond 3G . GPRS >2010 . RMS (Recognition, , and Synthesis) • Speech recognition HSxPA • Video mining EDGE EV-DO Walking EV-DV IEEE • Voice synthesis 802.16e • … FlashOFDM (802.20) DECT IEEE WLAN 802.16a,d . (Cf.) Software defined radio and other mobile applications (IEEE 802.11b) WLAN Stationary (IEEE 802.11a/g/n) User data rate 0.1 110 100 Mbps CS2410: Computer Architecture University of Pittsburgh CS2410: Computer Architecture University of Pittsburgh Multimedia performance needs

(K. Uchiyama, ACSAC ‘07)

CS2410: Computer Architecture University of Pittsburgh CS2410: Computer Architecture University of Pittsburgh