Performance evaluation
It’s an everyday process…
CS/COE0447: Computer Organization When you buy food and Assembly Language • Same qqy,yuantity, then you look at cost • Same cost, then you look at quantity • (Assuming same quality!) Chapter 4 When yyyou buy a notebook • You give weights to different features LCD size? CPU speed? Sangyeun Cho MMiain memory si?ize? Battery life? Dept. of Computer Science • How do you quantify these and come up with a single score that will University of Pittsburgh ggyuide your decision?
CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh 2
Performance evaluation Computer performance
Often we face an oppptimization problem… Have you cared about performance seriously when buying a computer? • What were the most important factors when you bought a new PC? When cost is fixed • MiiMaximize performance (d(and ftfeatures )! Defining computer performance may not be that simple… • What do you mean when you say a computer has better performance When performance target is achieved than another? • Minimize cost! We need a common metric • Remember, however, than a single metric will not say everything about a computer Give me the highest-performance computer! Don’t care • We may need multiple metrics! about cost… • Supercomputers are of this class Response time vs. throughput
CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh 3 4 Response time vs. throughput Conventions
Throughput Plane DC to Paris Top Speed Passengers (pmph) Througgphput
Boeing 747 6.5 hours 610 mph 470 286,700 • E.g., # of transactions processed per second
BAD/Sud • Larger is better 3 hours 1350 mph 132 178,200 Concorde
Which has higher performance? If we are primarily concerned with response time, • Time to deliver 1 passenger performance can be defined as 1/T where T is the response • Time to deliver 400 passengers time Time for 1 job called • Smaller T is better • This is called “response time” or “execution time” or “latency” • That is, larger (1/T) is better • Time/job Machine A is N times faster than B Jobs per day called • N = performance(A) / performance(B) = time(B) / time(A) • This is called “throughput” or “bandwidth” • When we present this statement, we want to have N>1 to eliminate • Jobs/time confusion
CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh 5 6
Response time vs. throughput Time in a computer
Time of Concorde vs. Boeing 747 When yypgyou time a program, what do you measure? • Concord is faster • Total time to complete a task • (6.5 hours / 3 hours) or 2.2 times faster • CPU time Throughput of Being 747 vs. Concorde • Overheads include disk accesses,,y memory accesses , I/O activities , OS • 286,700 p·mph / 176,200 p·mph overheads, overheads from other programs, … • 1.6 times higher (faster) • “Real time,” “response time,” “elapsed time,” … “Boeing 747 is 1.6 times (or 60%) faster than Concorde in terms of throughput” We are interested in time spent by CPU on your program “Concorde is 2.2 times (or 120%) faster than Boeing 747 in only – there are multiple programs running at the same time terms of flying time (response time)” • “CPU execution time” or “CPU time” • Often divided into system CPU time (in OS) and user CPU time (in user We will focus primarily on execution time of a single job for program) the remaining discussions
CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh 7 8 Time in a computer Measuring time
In terms of seconds?
CPU time: computers are constructed using digital circuitry synchronized at common “clock” ticks • E.g., Rowers and a captain • Clock ticks determine when events take place Clock is an endless,,p repetitive u p & down pattern Clock cycle time = period of a clock cycle (up and down) • CCT = 1 / clock rate (or clock frequency) • 1ns if 1GHz clock is used • 0.5ns if 2GHz clock is used • 0.25ns if 4GHz clock is used • ___ if 5GHz clock is used
CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh 9 10
Measuring time in terms of clocks Measuring time in terms of clocks
CPU execution time can be represented with # of clocks or # of clock ticks clock ticks • Has to do with how many instructions are executed in the program • E.g., it takes 1,000,000 clock ticks to finish program A • Has to do with how many clock ticks are needed to finish a single instruction on average
If we know the clock cycle time (equivalently clock rate), we # clock ticks = # instructions (also called “instruction can easily calculate the actual time count”)×(clocks per instruction or CPI) • Time = (# of clock titikcks )×(l(cloc k cycle time ) • Time = (# of clock ticks) / (clock rate) Time = (# of clock ticks)×(clock cycle time) Time = IC×CPI×CCT
IC may not change if two machines use the same instruction set architecture
CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh 11 12 What impacts time? Workload
Time = IC×CPI×CCT A set of ppgrograms run on a com puter would form a workload IC – # of instructions • Actual collection of applications • Instruction set architecture (()ISA) • Syypgnthetic programs • Compiler used • Kernel loops CPI (average clock per instruction) • … • Microarchitecture ((g,ppe.g., pipelinin g, cache size , … ) • (Compiler) To compare two computer systems, a user would typically • (ISA) compare the execution time of the workload on the two CCT ((qequivalentl y clock rate ) machines • Technology & circuit optimization • Microarchitecture, esp. pipelining • (ISA)
CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh 13 14
Benchmark suites Synthetic benchmarks
A benchmark suite is a set of applications relevant for performance A syypgnthesized program intended to mimic real pgprograms evaluation • A = B×C+D/E+F; SPEC (standard performance evaluation corporation) • www.spec.org One may write a synthetic benchmark to exercise a specific • CPU benchmarks • Server benchmarks sequence of actions • Graphics benchmarks • A program that maximally exercises the multiplication unit in CPU • … EEMBC (embedded microprocessor benchmark consortium) • Automotive • Consumer • Network • Telecom • Office • …
CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh 15 16 Kernel programs (or loops) Summarizing performance
An imppportant portion of a real pgprogram Computer A Computer B • Quicksort Program 1 (sec) 1 10 • Matrix multiplication Program 2 (sec) 1000 100
Total time (sec) 1001 110 Real programs usually consist of many loops and functions and it may be quite difficult to understand and interpret any Compppguter A is 10 times faster than B for program 1 performance evaluation results if we are only given summary information Computer B is 10 times faster than A for program 2
In the case kernel programs, we immediately understand Alt hough th e ab ove statements are correct i ndi v idua lly, t hey what’s going on in the programs present a confusing picture…
CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh 17 18
Arithmetic mean Normalized time
Assume that we have N ppgrograms Computer A Computer B Ratio Program 1 1 (0.1) 10 (1) 0.1
Arithmetic mean (AM) = (∑ Timei)/N Program 2 1 (10) 0.1 (1) 10
Sum 2 (10.1) 10.1 (2)
Weighted arithmetic mean = ∑ Timei×Wi, ∑ Wi = 1 Normalized time maintains the ratio of the original times However, depending on the reference, the result can be AM is a special case of weighted AM where W = 1/N i different! Summing (e.g., arithmetic mean) the normalized times may not produce the same comparison result
CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh 19 20 Dealing w/ ratios Summarizing performance
Computer A Computer B Ratio
Program 1 1 10 0.1
Program 2 1 (1000) 0.1 (100) 10
GM 1
When performance ratios are given, geometric mean (GM) can summarize the result more consistently
1/N GM = (∏ Ratioi)
• GM(Xi)/GM(Yi) = GM(Xi/Yi)
• Taking either the ratio of GM or GM of the ratios yields the same (From Intel Xeon 5100 brochure) result
CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh 21 22
Case study: SPEC benchmark Amdahl’s law
SPEC CPU2000 benchmark Your oppqyppytimization technique is usually applicable to only a • 12 integer programs limited portion of program execution • 14 floating-point (computation-intensive) programs • E.g., larger cache, improved CPU frequency, improved FSB frequency, …
To obtain a “SPECmark” • Run each program on the target machine Timeimproved = Timeunaffected+Timeaffected/(improvement factor) • Get each program’ s performance ratio by diididividing the pre-provided execution time (of an old SUN workstation) with the execution time Make the common case fast! obtained • Get GM of the ratios from all the programs
CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh 23 24 Amdahl’s law Fallacies and pitfalls
Pitfall 1: Exppgecting the im provement of one as pect of a
Timebefore Timeunaffected Timeaffected computer to increase performance by an amount proportional to the size of the improvement
Time Time improvement factor after unaffected Pitfall 2: Using a subset of the performance equation as a performance metric
If we extend this basic form, we have
• Timeafter=T1/N1+T2/N2+…+TM/NM where Ti is a portion of Timebefore and Ni is the improvement rat io (or spee dup ) app lidlied to the program execution portion
CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh 25 26
Summary Summary
Performance evaluation is an imppgygortant stage of any design Resppyonse time (latency) = time to finish a gjgiven job and engineering process Throughput = # of jobs done in a second We are interested in measuring computer performance where the goal is Time = # of clock cycl es × clklock cycle time • Software improvement # of clock cycles = # of instructions × CPI • Hardware improvement • Or both
We need relevant metrics and a method to summarize measurements Latency vs. throughput
CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh 27 28 Summary
Workloads used in performance evaluation are typically drawn from real applications Synthetic or kernel benchmarks are often useful A benchmark suite is a set of applications desi gned to aid performance evaluation tasks • Easier experiments, fair comparison, consistent reporting, …
Summarizing results • Use arithmetic mean (AM) for measured times • Use gg()eometric mean (GM) for ratios Amdahl’s law • Dictates the actual performance improvement due to a speedup factor applied to a limited portion of a program
CS/CoE0447: Computer Organization and Assembly Language University of Pittsburgh 29