<<

Performance in General CSE 675.02: Introduction to Architecture • "X is n times faster than Y" means: Performance (X) n = –––––––––––––– Performance Performance (Y) of Computer Systems • Speed of Concorde vs. Boeing 747 • of Boeing 747 vs. Concorde Presentation • Cost is also an important parameter in the equation, which is why Concorde was put to pasture! • The Bottom Line:

slides by Gojko Babi Performance and Cost or Cost and Performance? Studying Assignment: Chapter 4 g. babic Presentation C 2

Basic Performance Metrics : Introduction

• Response : the time between the start and the completion • The computer user is interested in response time (or execution of a task (in time units) time) – the time between the start and completion of a given task (program). • Throughput: the total amount of tasks done in a given time period (in number of tasks per unit of time) • The manager of a data processing center is interested in throughput – the total amount of work done in given time. • Example: Car assembly factory: • The computer user wants response time to decrease, while – 4 hours to produce a car (response time), the manager wants throughput increased. – 6 cars per an hour produced (throughput) • Main factors influencing performance of computer system are: In general, there is no relationship between those two metrics, – and memory, – throughput of the car assembly factory may increase to 18 – input/output controllers and peripherals, cars per an hour without changing time to produce one car. – compilers, and – How? – .

g. babic Presentation C 3 g. babic Presentation C 4

CPU Time or CPU Execution Time Analysis of CPU Time

CPU time depends on the program which is executed, CPU time (or CPU Execution time) is the time between the start including: and the end of execution of a given program. This time – the number of instructions executed, accounts for the time CPU is the given program, including operating system routines executed on the program’s – types of instructions executed and their of usage. behalf, and it does not include the time waiting for I/O and are constructed is such way that events in hardware running other programs. are synchronized using a clock. is given in Hz (=1/sec). • CPU time is a true measure of processor/memory performance. A clock rate defines durations of discrete time intervals called clock cycle times or clock cycle periods: • Performance of processor/memory = 1 / CPU_time – clock_cycle_time = 1/clock_rate (in sec) Thus, when we refer to different instruction types (from perfor- mance point of view), we are referring to instructions with different number of clock cycles required (needed) to execute.

g. babic Presentation C 5 g. babic Presentation C 6

1 CPU Time Equation Calculating Components of CPU time • For an existing processor it is easy to obtain the CPU time (i.e. • CPU time = Clock cycles for a program * Clock cycle time the execution time) by measurement, and the clock rate is = Clock cycles for a program / Clock rate known. But, it is difficult to figure out the instruction count or Clock cycles for a program is a total number of clock cycles CPI. needed to execute all instructions of a given program. Newer processors, MIPS processor is such an example, • CPU time = Instruction count * CPI / Clock rate include counters for instructions executed and for clock cycles. Those can be helpful to programmers trying to understand and Instruction count is a number of instructions executed, tune the performance of an application. sometimes referred as the instruction path length. CPI – the average number of clock (for a • Also, different simulation techniques and queuing theory could given execution of a given program) is an important parameter be used to obtain values for components of the execution given as: (CPU) time. CPI = Clock cycles for a program / Instructions count

g. babic Presentation C 7 g. babic Presentation C 8

Analysis of CPU Performance Equation Calculating CPI The table below indicates frequency of all instruction types execu- • CPU time = Instruction count * CPI / Clock rate ted in a “typical” program and, from the reference manual, we are provided with a number of cycles per instruction for each type. • How to improve (i.e. decrease) CPU time:  Instruction Type Frequency Cycles – Clock rate: hardware technology & organization, ALU instruction 50% 4 – CPI: organization, ISA and compiler technology, instruction 30% 5 – Instruction count: ISA & compiler technology. Store instruction 5% 4 Branch instruction 15% 2 Many potential performance improvement techniques primarily improve one component with small or predictable impact on the CPI = 0.5*4 + 0.3*5 + 0.05*4 + 0.15*2 = 4 cycles/instruction other two.

g. babic Presentation C 9 g. babic Presentation C 10

CPU Time: Example 1 CPU Time: Example 1 (continued) • a. Approach 1: Consider an implementation of MIPS ISA with 500 MHz clock and – each ALU instruction takes 3 clock cycles, Clock cycles for a program = (x3 + y2 + z4 + w5) 6 – each branch/jump instruction takes 2 clock cycles, = 910  10 clock cycles – each sw instruction takes 4 clock cycles, CPU_time = Clock cycles for a program / Clock rate – each lw instruction takes 5 clock cycles. = 910  106 / 500  106 = 1.82 sec

Also, consider a program that during its execution executes: • b. Approach 2: – x=200 million ALU instructions CPI = Clock cycles for a program / Instructions count – y=55 million branch/jump instructions CPI = (x3 + y2 + z4 + w5)/ (x + y + z + w) – z=25 million sw instructions = 3.03 clock cycles/ instruction – w=20 million lw instructions CPU time = Instruction count  CPI / Clock rate Find CPU time. Assume sequentially executing CPU. = (x+y+z+w)  3.03 / 500  106 = 300  106  3.03 /500  106 = 1.82 sec g. babic Presentation C 11 g. babic Presentation C 12

2 CPU Time: Example 2 Calculating CPI Consider another implementation of MIPS ISA with 1 GHz clock and CPI = (x3 + y2 + z4 + w5)/ (x + y + z + w) – each ALU instruction takes 4 clock cycles, – each branch/jump instruction takes 3 clock cycles, – each sw instruction takes 5 clock cycles, – each lw instruction takes 6 clock cycles. The calculation may not be necessary correct (and usually it isn’t) Also, consider the same program as in Example 1. since the numbers of cycles per instruction given don’t account Find CPI and CPU time. Assume sequentially executing CPU. for pipeline effects and other advanced design techniques. ------CPI = (x4 + y3 + z5 + w6)/ (x + y + z + w) = 4.03 clock cycles/ instruction CPU time = Instruction count  CPI / Clock rate = (x+y+z+w)  4.03 / 1000  106 = 300  106  4.03 /1000  106 = 1.21 sec g. babic Presentation C 13 g. babic Presentation C 14

Phases in MIPS Instruction Execution Sequential Execution of 3 LW Instructions

• Assumed are the following delays: Memory access = 2 nsec, • We can divide the execution of an instruction into the ALU operation = 2 nsec, access = 1 nsec; following five stages: P r o g r a m e x e c u t i o n 2 4 6 8 1 0 1 2 1 4 1 6 1 8 T i m e – IF: Instruction fetch o r d e r ( i n i n s t r u c t i o n s ) In s tr u ct i on Da t a – ID: Instruction decode and register fetch l w r 1 , 1 0 0 ( r 4 ) R eg A L U R eg fe tc h ac ce ss

– EX: Execution, effective address or branch calculation In s tr u ct i on Da t a l w r 2 , 2 0 0 ( r 5 ) 8 n s R eg A L U R eg fe tc h ac ce s s

– MEM: Memory access (for lw and sw instructions only) In s tr u ct i on l w r 3 , 3 0 0 ( r 6 ) 8 n s fe tc h – WB: Register write back (for ALU and lw instructions) . . . Every lw instruction needs 8 nsec to execute. 8 n s In this course, we shall design a processor that executes instructions sequentially, i.e. as illustrated here.

g. babic Presentation C 15 g. babic Presentation C 16

Pipelining: Its Natural! Sequential Laundry

• Modern processors use advanced hardware design 6 PM 7 8 9 10 11 Midnight techniques, such as pipelining and out of order execution. Time • Dave has four loads of clothes A B C D 30 40 20 30 40 20 30 40 20 30 40 20 to wash, dry, and fold T a A s • Washer takes 30 minutes k B O r d C • Dryer takes 40 minutes e r D Sequential laundry takes 6 hours for 4 loads; • “Folder” takes 20 minutes If Dave learned pipelining, how long would laundry take?

g. babic Presentation C 17 g. babic Presentation C 18

3 Pipelined Laundry Pipeline Executing 3 LW Instructions

• Assuming delays as in the sequential case and pipelined 6 PM 7 8 9 10 11 Midnight processor with a clock cycle time of 2 nsec. P r o g r a m Time e x e c u t io n 2 4 6 8 1 0 1 2 1 4 T i m e o r d e r ( i n i n s t r u c t i o n s ) 30 40 40 40 40 20 In s t r u c t i o n Da ta lw r 1 , 1 0 0 ( r 4 ) Re g A LU Re g fe t c h ac c e s s

A In s tr u c ti o n Da ta T lw r 2 , 2 0 0 ( r 5 ) 2 n s Re g A LU Re g fe t c h ac c e s s a In s t r uc t i o n Da ta s lw r 3 , 3 0 0 ( r 6 ) 2 n s Re g A LU Re g fe t c h ac c es s k B 2 n s 2 n s 2 n s 2 n s 2 n s O C Note that registers are written during the first part of a cycle and r d read during the second part of the same cycle. e D r • Pipelining doesn’t help to execute a single instruction, it may Pipelined laundry takes 3.5 hours for 4 loads; improve performance by increasing instruction throughput; g. babic Presentation C 19 g. babic Presentation C 20

Quantitative Performance Measures Quantitative Performance Measures (continued) • The original performance measure was time to perform an • Another popular, misleading and essentially useless measure individual instruction, e.g. add. was peak MIPS. That is a MIPS obtained using an instruction • Next performance measure was the average instruction time, mix that minimizes the CPI, even if that instruction mix is totally obtained from the relative frequency of instructions in some impractical. Computer manufacturers still occasionally announ- typical instruction mix and times to execute each instruction. Since instruction sets were similar, this was a more accurate ce products using peak MIPS as a metric, often neglecting to comparison. include the work “peak”. • One alternative to execution time as the metric was MIPS – Million . For a given program MIPS • Another popular alternative to execution time was Million rating is simple: Floating Point Operations Per Second – MFLOPS: Number of floating point operations in a program Instruction count Clock rate MFLOPS = –––––––––––––––––––––––––––––––––––––––– MIPS rating = ––––––––––––––6 = –––––––––6 Execution time 106 CPU time * 10 CPI * 10 * The problems with MIPS rating as a performance measure: Because it is based on operations in the program rather than – difficult to compare computers with different instruction sets, on instructions, MFLOPS has a stronger claim than MIPS to – MIPS varies between programs on the same computer, being a fair comparison between different machines. MFLOPS – MIPS can vary inversely with performance! are not applicable outside floating-point performance. g. babic Presentation C 21 g. babic Presentation C 22

Benchmark Suites SPEC Suites

Collections of “representative” programs used to measure the performance of processors. • The SPEC benchmarks are real programs, modified for Benchmarks could be: portability and to minimize the role of I/O in overall benchmark – real programs; performance. Example: Optimizer GNU C compiler. – modified (or scripted) applications; • First in 1989, SPEC89 was introduced with 4 integer programs – kernels – small, key pieces from real programs; and 6 floating point programs, providing a single “SPECmarks”. – synthetic benchmarks – not real programs, but codes try to match the average frequency of operations and operands of • SPEC92 had 5 integer programs and 14 floating point a large set of programs. programs, and provided SPECint92 and SPECfp92. Examples: and benchmarks; • SPEC95 provided SPECint_base95, SPECfp_base95. • SPEC (Standard Performance Evaluation Corporation) was founded in late 1980s to try to improve the state of bench- • SPEC CPU2000 has 12 integer benchmarks and 14 floating marking and make more valid base for comparison of desk point benchmarks, and provides CINT2000 and CFP2000. and server computers. g. babic Presentation C 23 g. babic Presentation C 24

4 Summarizing Performance Summarizing SPEC CPU2000 Performance • The arithmetic mean of the execution times is given as: SPEC CPU2000 summarizes performance using a geometric n mean of ratios, with larger numbers indicating higher 1 – * Timei performance. n i=1 CINT2000 where Time is the execution time for the ith program of a is indicator of integer performance and it is given as: i 12 total of n in the workload (benchmark). 12 CINT2000 = k  1  CPU time basei/CPU timei • The weighted arithmetic mean of execution times is given as: i=1 n Weight Time  i * i where k1 is a coefficient and CPU timei is the CPU time for the i=1 ith integer program of a total of 12 programs in the workload. where Weighti is the frequency of the ith program in the workload. Similarly for floating point performance, CFP2000 is given as: 14 • The geometric mean of execution times is given as: 14 n n CFP2000 = k  n 2  FPExec time basei/FPExec timei where xi = x * x * x * … * x i=1 Timei  1 2 3 n  i=1 i=1 g. babic Presentation C 25 g. babic Presentation C 26

Performance Example (part 1/5) Performance Example (part 2/5)

Note: This example is equivalent to Exercises 4.35, 4.36 and • Machine with No Floating Point Hardware - MNFP does 4.37 in the textbook. not support real number instructions, but all its instructions are identical to non-real number instructions • We are interested in two implementations of two similar of MFP. Each MNFP instruction (including integer but still different ISA, one with and one without special real instructions) takes 2 clock cycles. Thus, MNFP is number instructions. identical to MFP without real number instructions. • Both machines have 1000MHz clock. • Any real number operation (in a program) has to be • Machine With Floating Point Hardware - MFP implements emulated by an appropriate (i.e. real number operations directly with the following compiler has to insert an appropriate sequence of characteristics: integer instructions for each real number operation). The – real number multiply instruction requires 6 clock cycles number of integer instructions needed to implement each – real number add instruction requires 4 clock cycles real number operations is as follows: – real number divide instruction requires 20 clock cycles – real number multiply needs 30 integer instructions Any other instruction (including integer instructions) – real number add needs 20 integer instructions requires 2 clock cycles – real number divide needs 50 integer instructions

g. babic Presentation C 27 g. babic Presentation C 28

Performance Example (part 3/5) Performance Example (part 4/5) Consider Program P with the following mix of operations: – real number multiply 10% b. If Program P on MFP needs 300,000,000 instructions, find – real number add 15% time to execute this program on each machine. – real number divide 5% MFP Number of MNFP Number of – other instructions 70% instructions instructions a. Find MIPS rating for both machines. real mul 30106 900106 real add 45106 900106 CPIMFP = 0.16 + 0.154 + 0.0520 + 0.72 = 3.6 clocks/instr real div 15106 750106 6 6 CPIMNFP = 2 others 21010 21010 6 clock rate 1000*10 Totals 300106 2760106 MIPSMFP rating = ------= ------= 277.777 6 6 6 6 CPI * 10 3.6*10 CPU_timeMFP = 30010  3.6 / 1000  10 = 1.08 sec MIPSMNFP rating = 500 6 6 CPU_timeMNFP = 276010  2 / 1000  10 = 5.52 sec According to MIPS rating, MNFP is better than MFP !? g. babic Presentation C 29 g. babic Presentation C 30

5 Performance Example (part 5/5) c. Calculate MFLOPS for both computers. Number of floating point operations in a program MFLOPS = –––––––––––––––––––––––––––––––––––––––– 6 Execution time * 10

6 6 MFLOPSMFP = 9010 / 1.0810 = 83.3

6 6 MFLOPSMNFP = 9010 / 5.52  10 = 16.3

g. babic Presentation C 31

6