<<

Performance evaluation

ƒ It’s an everyday process…

CS/COE0447: Organization ƒ When you buy food and Assembly Language • Same qqy,yuantity, then you look at cost • Same cost, then you look at quantity • (Assuming same quality!) Chapter 4 ƒ When yyyou buy a notebook • You give weights to different features LCD size? CPU speed? Sangyeun Cho MMiain memory si?ize? Battery life? Dept. of Computer Science • How do you quantify these and come up with a single score that will University of Pittsburgh ggyuide your decision?

CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh 2

Performance evaluation Computer performance

ƒ Often we face an oppptimization problem… ƒ Have you cared about performance seriously when buying a computer? • What were the most important factors when you bought a new PC? ƒ When cost is fixed • MiiMaximize performance (d(and ftfeatures )! ƒ Defining computer performance may not be that simple… • What do you mean when you say a computer has better performance ƒ When performance target is achieved than another? • Minimize cost! ƒ We need a common metric • Remember, however, than a single metric will not say everything about a computer ƒ Give me the highest-performance computer! Don’t care • We may need multiple metrics! about cost… • Supercomputers are of this class ƒ Response time vs.

CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh 3 4 Response time vs. throughput Conventions

Throughput Plane DC to Paris Top Speed Passengers (pmph) ƒ Througgphput

Boeing 747 6.5 hours 610 mph 470 286,700 • E.g., # of transactions processed per second

BAD/Sud • Larger is better 3 hours 1350 mph 132 178,200 Concorde

ƒ Which has higher performance? ƒ If we are primarily concerned with response time, • Time to deliver 1 passenger performance can be defined as 1/T where T is the response • Time to deliver 400 passengers time ƒ Time for 1 job called • Smaller T is better • This is called “response time” or “execution time” or “” • That is, larger (1/T) is better • Time/job ƒ Machine A is N times faster than B ƒ Jobs per day called • N = performance(A) / performance(B) = time(B) / time(A) • This is called “throughput” or “” • When we present this statement, we want to have N>1 to eliminate • Jobs/time confusion

CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh 5 6

Response time vs. throughput Time in a computer

ƒ Time of Concorde vs. Boeing 747 ƒ When yypgyou time a program, what do you measure? • Concord is faster • Total time to complete a task • (6.5 hours / 3 hours) or 2.2 times faster • CPU time ƒ Throughput of Being 747 vs. Concorde • Overheads include disk accesses,,y memory accesses , I/O activities , OS • 286,700 p·mph / 176,200 p·mph overheads, overheads from other programs, … • 1.6 times higher (faster) • “Real time,” “response time,” “elapsed time,” … ƒ “Boeing 747 is 1.6 times (or 60%) faster than Concorde in terms of throughput” ƒ We are interested in time spent by CPU on your program ƒ “Concorde is 2.2 times (or 120%) faster than Boeing 747 in only – there are multiple programs running at the same time terms of flying time (response time)” • “CPU execution time” or “CPU time” • Often divided into system CPU time (in OS) and user CPU time (in user ƒ We will focus primarily on execution time of a single job for program) the remaining discussions

CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh 7 8 Time in a computer Measuring time

ƒ In terms of seconds?

ƒ CPU time: are constructed using digital circuitry synchronized at common “clock” ticks • E.g., Rowers and a captain • Clock ticks determine when events take place ƒ Clock is an endless,,p repetitive u p & down pattern ƒ Clock cycle time = period of a clock cycle (up and down) • CCT = 1 / (or clock frequency) • 1ns if 1GHz clock is used • 0.5ns if 2GHz clock is used • 0.25ns if 4GHz clock is used • ___ if 5GHz clock is used

CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh 9 10

Measuring time in terms of clocks Measuring time in terms of clocks

ƒ CPU execution time can be represented with # of clocks or ƒ # of clock ticks clock ticks • Has to do with how many instructions are executed in the program • E.g., it takes 1,000,000 clock ticks to finish program A • Has to do with how many clock ticks are needed to finish a single instruction on average

ƒ If we know the clock cycle time (equivalently clock rate), we ƒ # clock ticks = # instructions (also called “instruction can easily calculate the actual time count”)×(clocks per instruction or CPI) • Time = (# of clock titikcks )×(l(cloc k cycle time ) • Time = (# of clock ticks) / (clock rate) ƒ Time = (# of clock ticks)×(clock cycle time) ƒ Time = IC×CPI×CCT

ƒ IC may not change if two machines use the same instruction set architecture

CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh 11 12 What impacts time? Workload

ƒ Time = IC×CPI×CCT ƒ A set of ppgrograms run on a com puter would form a workload ƒ IC – # of instructions • Actual collection of applications • Instruction set architecture (()ISA) • Syypgnthetic programs • Compiler used • Kernel loops ƒ CPI (average clock per instruction) • … • Microarchitecture ((g,ppe.g., pipelinin g, cache size , … ) • (Compiler) ƒ To compare two computer systems, a user would typically • (ISA) compare the execution time of the workload on the two ƒ CCT ((qequivalentl y clock rate ) machines • Technology & circuit optimization • Microarchitecture, esp. pipelining • (ISA)

CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh 13 14

Benchmark suites Synthetic benchmarks

ƒ A suite is a set of applications relevant for performance ƒ A syypgnthesized program intended to mimic real pgprograms evaluation • A = B×C+D/E+F; ƒ SPEC (standard performance evaluation corporation) • www.spec.org ƒ One may write a synthetic benchmark to exercise a specific • CPU benchmarks • Server benchmarks sequence of actions • Graphics benchmarks • A program that maximally exercises the multiplication unit in CPU • … ƒ EEMBC (embedded microprocessor benchmark consortium) • Automotive • Consumer • Network • Telecom • Office • …

CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh 15 16 Kernel programs (or loops) Summarizing performance

ƒ An imppportant portion of a real pgprogram Computer A Computer B • Quicksort Program 1 (sec) 1 10 • Matrix multiplication Program 2 (sec) 1000 100

Total time (sec) 1001 110 ƒ Real programs usually consist of many loops and functions and it may be quite difficult to understand and interpret any ƒ Compppguter A is 10 times faster than B for program 1 performance evaluation results if we are only given summary ƒ Computer B is 10 times faster than A for program 2

ƒ In the case kernel programs, we immediately understand ƒ Alt hough th e ab ove statements are correct i ndi v idua lly, t hey what’s going on in the programs present a confusing picture…

CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh 17 18

Arithmetic mean Normalized time

ƒ Assume that we have N ppgrograms Computer A Computer B Ratio Program 1 1 (0.1) 10 (1) 0.1

ƒ Arithmetic mean (AM) = (∑ Timei)/N Program 2 1 (10) 0.1 (1) 10

Sum 2 (10.1) 10.1 (2)

ƒ Weighted arithmetic mean = ∑ Timei×Wi, ∑ Wi = 1 ƒ Normalized time maintains the ratio of the original times ƒ However, depending on the reference, the result can be ƒ AM is a special case of weighted AM where W = 1/N i different! ƒ Summing (e.g., arithmetic mean) the normalized times may not produce the same comparison result

CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh 19 20 Dealing w/ ratios Summarizing performance

Computer A Computer B Ratio

Program 1 1 10 0.1

Program 2 1 (1000) 0.1 (100) 10

GM 1

ƒ When performance ratios are given, geometric mean (GM) can summarize the result more consistently

1/N ƒ GM = (∏ Ratioi)

• GM(Xi)/GM(Yi) = GM(Xi/Yi)

• Taking either the ratio of GM or GM of the ratios yields the same (From Xeon 5100 brochure) result

CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh 21 22

Case study: SPEC benchmark Amdahl’s law

ƒ SPEC CPU2000 benchmark ƒ Your oppqyppytimization technique is usually applicable to only a • 12 integer programs limited portion of program execution • 14 floating-point (computation-intensive) programs • E.g., larger cache, improved CPU frequency, improved FSB frequency, …

ƒ To obtain a “SPECmark” • Run each program on the target machine ƒ Timeimproved = Timeunaffected+Timeaffected/(improvement factor) • Get each program’ s performance ratio by diididividing the pre-provided execution time (of an old SUN workstation) with the execution time ƒ Make the common case fast! obtained • Get GM of the ratios from all the programs

CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh 23 24 Amdahl’s law Fallacies and pitfalls

ƒ Pitfall 1: Exppgecting the im provement of one as pect of a

Timebefore Timeunaffected Timeaffected computer to increase performance by an amount proportional to the size of the improvement

Time Time improvement factor after unaffected ƒ Pitfall 2: Using a subset of the performance equation as a performance metric

ƒ If we extend this basic form, we have

• Timeafter=T1/N1+T2/N2+…+TM/NM where Ti is a portion of Timebefore and Ni is the improvement rat io (or spee dup ) app lidlied to the program execution portion

CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh 25 26

Summary Summary

ƒ Performance evaluation is an imppgygortant stage of any design ƒ Resppyonse time (latency) = time to finish a gjgiven job and engineering process ƒ Throughput = # of jobs done in a second ƒ We are interested in measuring computer performance where the goal is ƒ Time = # of clock cycl es × clklock cycle time • improvement ƒ # of clock cycles = # of instructions × CPI • Hardware improvement • Or both

ƒ We need relevant metrics and a method to summarize measurements ƒ Latency vs. throughput

CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh 27 28 Summary

ƒ Workloads used in performance evaluation are typically drawn from real applications ƒ Synthetic or kernel benchmarks are often useful ƒ A benchmark suite is a set of applications desi gned to aid performance evaluation tasks • Easier experiments, fair comparison, consistent reporting, …

ƒ Summarizing results • Use arithmetic mean (AM) for measured times • Use gg()eometric mean (GM) for ratios ƒ Amdahl’s law • Dictates the actual performance improvement due to a factor applied to a limited portion of a program

CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh 29