<<

June 03- 06, 2002 C-DAC, Pune

P

e

r

f

o

r

m

a

n

c

e

M

e

t

r

i

c

s

a n Performance Metrics and

Performance Metrics and

d

Day – 2 Classroom Lecture 2 – Day

S

c

a

l a Analysis

Scalability Analysis

b

i

l

i

t

y

A n

PCOPP-2002

a

l

y

s

i s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 1

June 03- 06, 2002 C-DAC, Pune

P

e r

    

f o

Following Topics will be discussed Lecture Outline

r

m

a n

Theoretical concepts on Performance and Scalability Communication overhead measurement Analysis Work load and speed metrics metrics Performance Requirements in performance and cost c

Performance Metrics and Scalability Analysis Performance

e

M

• • • e

Examples models Isoefficiency Speed up, Efficiency, Phase Parallel Model

t

r

i

c

s

a

n

d

S

c

a

l

a

b

i

l

i

t

y

A

n

a

l

y

s

i s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 2

June 03- 06, 2002 C-DAC, Pune

P

e

r

f o

     Questions to be answered

r

m

a

n c

executing a given application? How can one determine the scalability of a parallel What are the factors (parameters) affecting the performance? What kind of performance metrics should be used? programs? How one should measure the performance of application What are the user’s requirement in performance and cost?

e

M

e

t

r

i

c

s

a

n

d

S

c

a

l a

Performance Versus Cost

b

i

l

i

t

y

A

n

a

l

y

s

i s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 3

June 03- 06, 2002 C-DAC, Pune

P e

Approach How do we measure do performance theHow computer we of a system Remarks

r f

   

o

r

m

a n

to measure computer performance metric reliable Many people believe that execution time is the only Pitfalls of using execution time as performance metric. misleading interpretations. This approach is some times difficult to apply and it could permit elapsed Run the user’s application time and measure wall clock time c



e

M e

true performance of the parallel machine Execution time alone doesto not give theclue muchuser a

t

r

i

c

s

a

n

d

S

c

a

l a

Performance Versus Cost

b

i

l

i

t

y

A

n

a

l

y

s

i s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) ? 2002 2002 4

June 03- 06, 2002 C-DAC, Pune

P

e r

Remarks Six types of performance requirements are posed by users: Types of performance requirement

f o

     

r

m

a

n c

Performance / Cost ratio Cost / Performance Cost effectiveness Utilization System Processing speed Executive time and throughput

e

M

e

t r

:

i

c s These requirements could lead to quite different conclusions for the same application on the same computer

platform

a

n

d

S

c

a l

Performance Requirements

a

b

i

l

i

t

y

A

n

a

l

y

s

i s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 5

June 03- 06, 2002 C-DAC, Pune

P e

Performance versus cost Processing Speed

r f

 

o

r

m

a n

certain processing speed rather than an execution time limit For many applications, the users may be interesteda in achieving Execution time is critical to some applications c

  

e

M e

Example : Example : guaranteed to finish within a time limit A real time application,user thecares about whetheris job the

t

r

i

c

s

a

n

d

S

c a

Performance Requirements

l

a b

Radar signal processing application Radar signal processing application

i

l

i

t

y

A

n

a

l

y

s

i s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 6

June 03- 06, 2002 C-DAC, Pune

P

e

r f

 System Throughput 

o

r

m

a n unit time Throughput is defined to be the number ofa jobs processedin other performance requirement is needed. Remark : Execution time is not only performance requirement, but

Example :

c

e

M

e

t

r

i

c

s

a

n

d

S

STAP Benchmarks, TPC Benchmarks

c

a l

Performance Requirements

a

b

i

l

i

t

y

A

n

a

l

y

s

i s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 7

June 03- 06, 2002 C-DAC, Pune

P

e

r f

Utilization and Cost Effectiveness o

  

r

m

a

n c

overheads Reduction of time spent in load imbalance and communication High percentage of the CPU hours to be used for useful may want to run his application more cost effectively Instead of searching for the shortest execution time, the user

e

M

e

t

r

i

c

s

a

n

d

S

c

a

l a

Performance Requirements

b

i

l

i

t

y

A

n

a

l

y

s

i s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 8

June 03- 06, 2002 C-DAC, Pune

P

e r

Utilization and Cost Effectiveness Performance/Cost Ratio

f o

 

r

m

a

n c

ratio of the speed to the purchasing price. Performance/cost ratio of a computer system is defined as the computer. which is ratio of the achieved speed to the peak speed of a given A good indicator of cost-effectiveness is the utilization factor

e

M

e

t

r

i

c

s

a

n

d

S

c

a l

Performance Requirements

a

b

i

l

i

t

y

A

n

a

l

y

s

i s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 9

June 03- 06, 2002 C-DAC, Pune

P

e

r

f o

Remarks r

    

m

a

n

c e

programs are often highly optimized. values Utilization generated from the vendor’s utilization drops as more nodes are used. Utilization factor varies from 5% to 38%. Generally the workload, or a low speed due to a slow machine. Good program could have a long execution time due to a large always indicates a A low utilization poor program or compiler. provided if CPU-hours are changed at a fixed rate. Higher Utilization corresponds to higher Gflop/s per dollar,

M

e

t

r

i

c

s

a

n

d

S

c

a l

Performance Requirements

a

b

i

l

i

t

y

A

n

a

l

y

s

i s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 10

June 03- 06, 2002 C-DAC, Pune

P

e r

The misleading peak performance / cost ratio

f o

 

r

m

a n

Example c

different system Often, users need to use more thanmetric onein comparing often misleading Using the peak performance/cost ratio to compare systemis e

 

M

But its sustained performance is actually higher. of others. those than lower much is supercomputers e

systems alone, the current Pentium PC easily beat more powerful If we use the cost-effectiveness orperformance cost/ratio with the performance/cost ratio of a computer system cost-effectivenessThe measureshould not be confused

t

r

i

c

s

a

n d

: The peak performance: The of someratio cost

S

c

a

l a

Performance Versus Cost

b

i

l

i

t

y

A

n

a

l

y

s

i s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 11

June 03- 06, 2002 C-DAC, Pune

P

e r

   Summary

f

o

r

m

a

n Usually, a set or requirements may be imposed. effectiveness, performance/cost ratio, and scalability. The other include requirements a user may want to impose. The execution time is just one of several performance

c

e

M

e

t

r

i

c

s

a

n

d

S

c

a

l a

Performance Versus Cost

b

i

l

i

t y

speed, throughput, utilization, cost-

A

n

a

l

y

s

i s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 12

June 03- 06, 2002 C-DAC, Pune

P

e r

 Workload and speed metrics

f

o

r

m a

workload of a program Three metrics are frequently used to measure the computational

n c

  

e

M

third metric is often architecture-independent. The number of floating-point operations executed ISA. same when the program is executed on different machines with tied to an instruction set architecture (ISA), stays unchanged The number of instructions executed executed on a different machine. computer system. It may change when the C program is The execution time

e

t

r

i

c

s

a

n

d

S

c

a l

Workload and Speed Metrics

a

b

i

l

i

t

y

A

n

a

l

y

s

i s : is tied The first metric to a specific : The second metric, Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, : The : The 2002 2002 13

June 03- 06, 2002 C-DAC, Pune

P

e

r

f

o

r m

Workload and Speed Metrics a Workload Type

count operation (flop) Floating-point Application per second Instruction count Seconds (s), CPU clocks Execution time

n

c

e

M

e

t

r

i

c

s

a

n

d

S c

Workload and Speed Metrics a

Summary of all three performance metrics

l

a

b

i

l

i

t

y

A n

billion instructionbillion instructionsMillion or a

billion flop (Gflop) billion flop (Mflop), Million l

Workload Unit

y

s

i s Flop, Copyright MIPS or BIPS Speed Unit Gflop/s Mflop/s PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 14

June 03- 06, 2002 C-DAC, Pune

P

e

r

f o

workload is defined as the instruction count for the worst case input. but only 4000 for another. For suchinput-dependent anprogram, the A sorting program may execute 100000input,forinstructions a given Instruction count Example r

 

m

a

n

c e

instruction count for the an input dependent program, the workload is defined as the Instruction count may depend on thesuchFor input data values. just the number of instructions in assembly program text. The work load is the instructions that the machines executed, not

M

e

t

r

i

c

s

a

n

d

S c

Workload and Speed Metrics

a

l

a

b

i

l

i

t

y

A

n

a

l

y

s

i s worst-case input. Copyright PCOPP PCOPP  (Contd..) C-DAC, 2002 C-DAC, 2002 2002 15

June 03- 06, 2002 C-DAC, Pune

P

e

r

f o

   Instruction count

r

m

a n

mean the program needs more time to execute. Finally, a larger instruction count does not necessarily optimizationsused. are on the samedifferent machine when different compilers or Even with a fixed input, the instruction count of a program becould on different machines. Even with a fixed input, the instruction executed could be different

c

e

M

e

t

r

i

c

s

a

n

d

S c

Workload and Speed Metrics

a

l

a

b

i

l

i

t

y

A

n

a

l

y

s

i s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 16

June 03- 06, 2002 C-DAC, Pune

P

e

r

f o

   Execution time

r

m

a n

The basic unit of workload is seconds. Execution time depends on many factors explained below. known as the elapsed time. execution time should be time,also measuredclock termswall in of the workload as the total time taken to execute theprogram.This For a given programa specificon computer system, one can define

c

e

M

e

t

r

i

c

s

a

n

d

S

c

a

l a

Workload and Speed Metrics

b

i

l

i

t

y

A

n

a

l

y

s

i s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 17

June 03- 06, 2002 C-DAC, Pune

P

e

r

f

o

r

m a

Execution time

n

c

e

M

e

t

r

i

c

s

a

n

d

S c

Workload and Speed Metrics

a

l

a

b

i

l

i

t

y

A

n

a

l

y

s

i s Language Data structure of algorithm kindChoice of right Input data Platform affect performance affect operating system hardware and The machine Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 18

June 03- 06, 2002 C-DAC, Pune

P e

Time Execution

r

f

o

r

m

a

n

c

e

M

e

t

r

i

c

s

a n Language

structure Data Algorithm d

Platform

S

Input

c a

Workload and Speed Metrics

l

a

b

i

l

i

t

y

A

n a

options/ and library functions reduce the execution time. reduce the execution options/ and library functions code the application, compiler/linker Language used to system versions. disk), operating (cache, main memory, factors include Other the memory hierarchy performance. system affect and operating Machine hardware set. input data baseline time for the or time worst-case the to use either we need programs, input-dependent such workload for When using execution time as the input data. times on different take different such as sorting and searching could programs, Other will take 0( instance, an For input data. depend on the applications does not many for The execution time also impacts performance structured are Howthe data time input on execution has a great The algorithm used

l

y

s

i s N log N ) time. Copyright PCOPP PCOPP  N-point FFT N-point C-DAC, 2002 C-DAC, (Contd..) 2002 2002 19

June 03- 06, 2002 C-DAC, Pune

P

e

r

f o

Floating point count r

   

m

a

n c

flop count is determined by an execution run time. This approach has been used in the NAS benchmark, where the executed.instructions actually This specific run will generate a reportnumber the of oforflops workload by running the application on a specific machine. the varies with different input data (e.g., sorting) one can measure Whentheapplication code is complex, or when the workload dependent, the workload can be determined by code inspection. When application program is simple and its workload is not input-

e

M

e

t

r

i

c

s

a

n

d

S c

Workload and Speed Metrics

a

l

a

b

i

l

i

t

y

A

n

a

l

y

s

i s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 20

June 03- 06, 2002 C-DAC, Pune

P

e

r

f o

Rules for CountingFloating-PointRules for Operationsstandards) (NAS

r m

A[2*1] = B[ j -1]+1.5*C - 2; 3 Add, subtract, or multiply each count as 1 flop multiplyflop each count as 1 or Add, subtract, 3 2; - -1]+1.5*C B[ j A[2*1] = a

If (X>Y) Max = 2.0*X; 2 A comparison is flop A counted as 1 2 2.0*X; = Max (X>Y) If n

X = (float) i + 3.0; 2 A type conversion type A is counted as 1 flop 2 exp(Z); X = sin(Y)- + sqrt(Z) Y 3.0 / X = i + 3.0; X = (float)

c e

Operations

M e

X = Y; 1 An isolated An flop assignment as 1 is counted 1 Y; X =

t

r

i

c

s

a

n

d

S c

Workload and Speed Metrics

a

l

a

b

i

l

i

t

y

A

n

a l

Count

y s Flop

17 9

i s A sine, exponential, etc. is counted as 8 flop as 8 A sine, exponential, etc. is counted A division or square root is counted as 4 flop root A division or square Assignment not separately counted Assignment not separately Comments on Rules Index arithmetic not counted not counted arithmetic Index Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 21

June 03- 06, 2002 C-DAC, Pune

P

e r

     Caveats

f

o

r

m

a n

achieve 90% utilization in some numerical subroutines. Highly tuned programs could achieve more. It is possible to clue as to how well the application is implemented. their corresponding speed and utilization measures give some The flop count and theanotherinstruction count have advantage: Workload can be determined only by executing the program. different systems. workloadThe changes the when same application is run on leads to a strange phenomenon. Using execution time or instructionmetricworkload the count as

c

e

M

e

t

r

i

c

s

a

n

d

S c

Workload and Speed Metrics

a

l

a

b

i

l

i

t

y

A

n

a

l

y

s

i s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 22

June 03- 06, 2002 C-DAC, Pune

P

e

r f

Remark    metrics Summary performance of

o

r

m

a n

platform, compiler options, and input data set etc. under a set of specific listing conditions including hardware The execution time is often measured on a specific machine, inspection, as described in the table. In practice, the flop count workload is often determined by code count and the execution time. To sum up, all three metrics are useful, but especially the flop

c

e

M e

:

t

r

i c

The flop countThestable. metricis more

s

a

n

d

S c

Workload and Speed Metrics

a

l

a

b

i

l

i

t

y

A

n

a

l

y

s

i s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 23

June 03- 06, 2002 C-DAC, Pune

P

e

r

f o

and Clusters are of performance forparameters values PVP, SMP, MPPs The historical Computational Characteristics

r m

      a

Computational CharacteristicsComputers ofParallel

n

c e

Peak performance on Peak performance per node (Mflop/sec) Machine Size Memory Capacity (MHz) Year of release in Market

M

e

t

r

i

c

s

a

n

d

S

c

a

l

a

b

i

l

i

t

y

A

n

a

l

y

s

i s n )nodes (Gflop/sec) ( > 1) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 24

June 03- 06, 2002 C-DAC, Pune

P

e

r

f o

extracting performance for For each layer, the following three parameters play a vital role The Memory Hierarchy r

  

m a

Computational CharacteristicsComputers ofParallel

n c

Capacity (C) : How many bytes of data can be held by the (L) : How much time is needed to fetch one from the (B) : For moving a large amount of data, how many

e

M

e

t

r

i

c

s

a

n

d

S

c

a

l

a

b

i

l

i t

device ? device ? second ? bytes can be transferred from the device in a

y

A

n

a

l

y

s

i s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 25

June 03- 06, 2002 C-DAC, Pune

P

e r

is shown in the figure. between processors and memories. The memory subsystemhierarchy performanceThe depends fastsystem the on how can move data The Memory Hierarchy

f o B = 1-32 GB/s B = 1-32 0 cycle L = C < 2KB

Registers

r m

Computational CharacteristicsComputers ofParallel

a

n

c

e

M

e t

Performance parameters of a typical memory hierarchy

r

i

c

s

a

n

d

S

1-16 GB/s 1-16 cycles 0-2 KB 4-256 c

Cache Level-1

a

l

a

b

i

l

i

t

y

A

n

a

l

y

s

i s 1-4 GB/s 1-4 cycles 2-10 MB 64 KB-4 Cache Level-2 10-100 cycles 10-100 GB 16 MB-16 Memory Remote 0.4-2 GB/s 0.4-2 memory Main  Copyright PCOPP PCOPP 1-300 MB/s 100-100 K cycles 1-100 GB  C-DAC, 2002 C-DAC, 1-16 MB/s 1-16 cycles 100K-1M 1-100 GB Disk (Contd..) 2002 2002 26

June 03- 06, 2002 C-DAC, Pune

P

e

r

f o

    

r

m a

Computational CharacteristicsComputers ofParallel

n

c memory. a centralizedmemorywith for shared the machines global The main memory includes the local memory in a node, and the SMP node. in processors some by shared off chip is cache 3 – level The the and chip processor the on usually is cache is off 2 chip. cache level – 1 - level The chip. processor of the part fact in The devices closest to the processor are the registers,are which The faster and smaller devices are closer to the processor.

e

M

e

t

r

i

c

s

a

n

d

S

c

a

l

a

b

i

l

i

t

y

A

n

a

l

y

s

i s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 27

June 03- 06, 2002 C-DAC, Pune

P

e

r

f o

where The time to execute a parallel program is overlapping is involved ( respectively. computation, the parallelism, and the interaction operations,

r

m

a

n

c

e

M

T

e t

comp

r

i

c s

Parallelism and Interaction Overheads

a n

, d Assume that interaction overheads are ignored, and no

T

S

par

c

a

l

a b

and

i

l

i t

T

y

A =

). n

T a T

interact

l y

comp

s

i s + are the times needed to execute the T par + T interact Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 28

June 03- 06, 2002 C-DAC, Pune

P

e

r

f o

 Parallelism Overheads    Source of Parallelism overhead

r

m

a n

times to process to times raw data. This is important for the parallel program which is executed many Creating a process or a group is expensive on parallel . identification, rank,identification, group, group size,etc. Process inquiry operations, such as asking for process group Grouping operations, such as creation or destruction of a process switching, etc. Process management, such as creation, termination, context

c

e

M

e

t

r

i

c s

Parallelism and Interaction Overheads

a

n

d

S

c

a

l

a

b

i

l

i

t

y

A

n

a

l

y

s

i s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 29

June 03- 06, 2002 C-DAC, Pune

P

e

r

f o

    Sources of interaction overheads

r

m

a

n c

Idleness due to InteractionIdleness due to communicationandvariables reading/writing of shared Communication, suchas point-to-point and collective Aggregation,as such reduction and scan Synchronization, such as barrier, locks critical regions, and events

e

M

e

t

r

i

c s

Parallelism and Interaction Overheads

a

n

d

S

c

a

l

a

b

i

l

i

t

y

A

n

a

l

y

s

i s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 30

June 03- 06, 2002 C-DAC, Pune

P

e

r

f o

     Overheads Quantification

r

m

a n

asymptotic bandwidth Collective communication and Collective computation message length Point-to-PointCommunication Communication overhead for Expressions for overhead parallelism due to communications methods Measurement conditions Measurement

c

e

M

e

t

r i

Parallelism and Interaction Overheads

c

s

a

n

d

S

c

a

l

a

b

i

l

i

t

y

A

m

n

a

l y )

(in bytes) (

s

i s requirement of startup time, Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 31

June 03- 06, 2002 C-DAC, Pune

P

e

r

f o

      The following conditions are used to estimate the overhead involved. Overhead measurement conditions

r

m

a n

To use wall clock time elapsed. The communication hardware and protocol used. The message passing library used. The best complier options should be used. The programming language used. not be page faults) made small enough to fit into the node memory so that there will The data structures used (Thedata structuresusedare always

c

e

M

e

t

r

i

c s

Parallelism and Interaction Overheads

a

n

d

S

c

a

l

a

b

i

l

i

t

y

A

n

a

l

y

s

i s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 32

June 03- 06, 2002 C-DAC, Pune

P e

Interpretation of overhead measurement data    Overhead measurement Methods

r

f

o

r m

lengths andlengths machine size. that can be used to estimate the overhead for various message Method 3 : Present the data using simple, closed-formexpression Method 2 : Present the data as curve Method 1 : Present the results in tabular form a

 

n

c e

collective communication performance involving Measuring point-point communication two nodes Measuring point-point communication (ping-pongtest)between

M

e

t

r

i

c s

Parallelism and Interaction Overheads

a

n

d

S

c

a

l

a

b

i

l

i

t

y

A

n

a

l

y

s

i s Copyright n PCOPP PCOPP  nodes and C-DAC, 2002 C-DAC, (Contd..) 2002 2002 33

June 03- 06, 2002 C-DAC, Pune

P

e

r

f o

    Remarks

r

m

a

n c

another. The overhead values also vary greatly from one system to significant and they should be quantified. In some cases parallelism and interaction overheads are system. the basic computation time, and they vary from a system to Parallelism and interaction overheads are often large compared to develop a parallel program on a given parallel computer. Knowing the overheads can help programmer decide how to best

e

M

e

t

r

i

c s

Parallelism and Interaction Overheads

a

n

d

S

c

a

l

a

b

i

l

i

t

y

A

n

a

l

y

s

i s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 34

June 03- 06, 2002 C-DAC, Pune

P

e

r

f o

Phase Parallel model is depicted in the following figure phase We want toefficientan develop program parallel component computational phases Consider a sequential program

r

m a

W DOP T Phase n

1 c (1)

1

e

M C Performance Metrics : Phase Parallel Model Performance 1

The phase parallel model of an application algorithm e

i

t

r i

with a Degree Of Parallelism c

C

s

1

a n

:

d

neato Interaction Interaction S

  

c

a

l

a

b

i

l

i

t y

  

A n :

 a

Analysis

l

y

s

i s W DOP T Phase 1 C ( i i ) C consisting of a sequence of i 1 , C C DOP i 2 : , C 3    ... i . A phase parallel program  C    K . C Copyright by exploiting each    PCOPP PCOPP W DOP T Phase  1 ( k C-DAC, 2002 C-DAC, k ) k K C 2002 major 2002 k : 35

June 03- 06, 2002 C-DAC, Pune

P

e

r

f o

While executing on Phase Parallel model r

  

m

a

n c

T The total parallel execution time over p nodes becomes The parallel execution time for step Ci is Tp(i)=T1(i) / p,

e

M par

Performance Metrics: Phase Parallel Model Performance

e

t

r

i c

and

s

a

n

d

T

S c interact

T

a

l a

n b

p i

=

l

i

t y

processors with 1 with processors

denote all parallelism and interaction overheads. A 1

: n ≤ Σ

Basic Metrics a

i

l

y s

i s k min ( T DOP 1 (i) i ≤ , n p ) + ≤ T DOP par + T i , we have Copyright interact PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 36

June 03- 06, 2002 C-DAC, Pune

P

e

r

f o

.These inequalities are useful to estimate the parallel execution 4. metrics These will help the Application User to know about 3. The implementation of point-to-point and collective 2. cases,many In the total overhead is dominated by the 1. Remark

r

m

a

n c

time performance issues of a particular code. sizes in MPI play a important role for this overhead. communication and computation overhead for various message communication overhead.

e

M

Performance Metrics: Phase Parallel Model Performance

e

t

r

i

c

s

a

n

d

S

c

a

l

a

b

i

l

i

t

y

A

n

a

l

y

s

i s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 37

June 03- 06, 2002 C-DAC, Pune

P

e

r

f

o

r

m

a n Phase parallel model

Notation

c

e

M T E P S T T T

Performance Metrics: Phase Parallel Model Performance

e

0 ∞ ∞ ∞ p p p ∞

p 1

t

r

i

c

s

a

n d

p

Sequential time S -node efficiency

Total overhead Critical path p Parallel time, c p Terminology

-node time a

-node speed

l

a

b

i

l

i

t

y

A

n

a

l

y

s

i s T p = = 1 ≤ ≤ ≤ ≤ Σ i T ≤ ≤ ≤ ≤

T E

∞ ∞ ∞ ∞ k 0 p P T = T = S = = 1 p min ( = W / T = W / S 1 = = ≤ ≤ ≤ ≤ p Σ Definition par p = T 1 i T / p = T / p ≤ ≤ ≤ ≤ ≤ ≤ ≤ ≤ DOP Σ 1 + T k i ( 1 ≤ ≤ ≤ ≤ i ) p / T T k 1 interact i , Copyright p DOP ( 1 p i ) T / (pT + ) 1 T ( i i par ) p PCOPP PCOPP ) +  T interact C-DAC, 2002 C-DAC, (Contd..) 2002 2002 38

June 03- 06, 2002 C-DAC, Pune

P

e

r

f o

Available Parallelism Note

r m

  

a

n

c e

Many scientific applications exhibit high degree of good If the architecture and compiler work perfectly, one can expect Instruction-Level Parallelism play an important role. Parallelism in non-numerical computations is less. relatively

M

:

e

t r

parallelism i

Some scientific codes exhibit a high DOP due to data

c s

average

a n

Performance of Parallel Programs

d

S

c

a

l

a

b i

parallelism Example : Perfect

l

i

t

y

A

n

a

l

y

s

i s Benchmark. Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 DOP . 39

June 03- 06, 2002 C-DAC, Pune

P

e

r

f o

  

r

m

a n

of parallel runtime and the number of processors used identical The processor on problem the same to solve algorithm the parallel by taken Speedup of the Cost

c

e

M

e t

p r

Performance Metrics of Parallel Systems Performance i

: c

processors used by the parallel algorithmbe are assumedto s best

Cost of solving a problem on a parallelproduct systemthe is

a

n d to the one used by the sequential algorithm

:

S

Speedup c

sequential algorithm for solving a problem to the time

a

l

a

b

i

l

i

t

y

A

n

a

l y

T

s

i p s is defined as the serialratio of the runtime E = p.S p Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 p 40

June 03- 06, 2002 C-DAC, Pune

P

e

r

f o

   

r

m

a n

Cost Optimal execution time of the known best sequential algorithm Efficiency The the of the Efficiency can also be expressed as the ratio of the execution time single processor. single execution time of the fastest known sequential algorithm on a cost of solving a problem on parallel computer is proportional to the

c

e

M

Performance Metrics of Parallel Systems Performance e

cost

t r

cost

i c

fastest

s

a n

of solving the same problem on

d

S :

of solving a problem on problema single processora is the of solving c

Ratio of speedup to the number of processors.

a

l a

known sequential algorithm fora problemto solving b

:

i

l i

A parallel system is said to be cost-optimal if the to be cost-optimalif said A parallel system is

t

y

A

n

a

l

y

s

i s p processors Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 41

June 03- 06, 2002 C-DAC, Pune

P

e

r f

  

o

r

m

a n

terms of number of processors and a cost-optimal systems is also known as product processor-time or work as to referred is sometimes Cost Cost optimal parallel system has an parallel algorithm is called Using fewer than the maximum numberexecuteprocessors to of a

c

e

Performance Metrics of Parallel Systems Performance

M

e

t

r

i

c

s

a

n

d

S

c

a

l

a

b

i

l

i

t

y

A

n

a

l

y

s

i s scaling down efficiency PT a parallel system in Copyright of P

optimal system θ (1) PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 42

June 03- 06, 2002 C-DAC, Pune

P

e

r

f o

 Problem Size

r

m

a

n c

processor. processor. steps in best sequential algorithm to solve the problem on a single We define e

 

M

e

t r

Performance Metrics of Parallel Systems Performance i

sequential computer. run time With this assumption, the problem size A consistent definition of the of computation doubling the size always means performing twice the amount that, regardlesssuchproblem be should of the problem,

c

s

a

n

d

S

problem size

c

a

l

a b

T

i

l

i t

s

y

A

of the fastest known algorithm to solve on a

n

a

l

y

s

i s ( W ) as the numbercomputationbasic of size or the magnitude of theor the magnitudeof Copyright W is equal to serial PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 43

June 03- 06, 2002 C-DAC, Pune

P

e

r

f

o

r m

  Degree of Concurrency

a

n

c

e

problem of size of size problem operations that an algorithm can perform in parallel for a of number the of measure a is concurrency of degree The degree of concurrency. simultaneously at any timeits in a parallel algorithmis called maximum numberThe of tasks that can be executed

M

e

t r

Performance Metrics of Parallel Systems Performance

i

c

s

a

n

d

S

c

a

l

a

b

i

l

i

t y

W

A n

; it is independent of the parallel architecture

a

l

y

s

i s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 44

June 03- 06, 2002 C-DAC, Pune

P

e

r

f o

Note and communication operationsa constant takeamount time. of Consider the problem of adding n numbers on a Example 1 : p r  

< m

n

a

n c and both

Case 2 : processor: one element per 1 Case e

: All the case studies help the Application -USER to design and

M

analyze algorithm on parallel computer.

e

t r

Performance Metrics of Parallel Systems Performance

i

c

s

a

n d Adding n

n

S

and c

is a multiple of (

a

l

a

b

i l

p

i

t y

n

are powers of two. Assume that both the addition A

numbers on a

n

a

l

y

s

i s p log p p ) -Processor Hypercube p Copyright processors such that that such processors PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 45

June 03- 06, 2002 C-DAC, Pune

P

e r

  Analysis Example 1 : 

f

o

r

m a

its multiprocessor in We can perform this on a Hypercube or on a shared memory sum of all the numbers the and, at the end of the computation, one of the processors stores Initially each processor is assigned one of the numbers to addedbe Since the problem can be solved in

n

c

e

speed up

M

e

t r

Performance Metrics of Parallel Systems Performance

i

c

s

a n

Input – Case 1 Case Input –

d

S c

sT is

a

l

a

b

i l

p

i t

log n y

=

A

θ

n a

(

l y

n s

steps i

/ log s continued ….. n

) θ ( n) time on a single processor, Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 46

June 03- 06, 2002 C-DAC, Pune

P

e

r f

Effect of GranularityData Mapping on Performance and   Example 1 :

o

r

m

a n

is In practice, we assign larger pieces of input data to processors parallel system is less than As per input distribution onprocesses, every the efficiencyofthis

c

e

not

M

e

t r

Performance Metrics of Parallel Systems Performance

i c

cost-optimal.

s

a n

continued ……

d

S

c

a

l

a

b

i

l

i

t

y

A

n

a

l

y

s

i s

Input – Case 1 for 1 for Case – Input θ (1), indicates that the parallel system p = Copyright n PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 47

June 03- 06, 2002 C-DAC, Pune

P

e r

Example 1 :

f

o

r

m a

four-processor hypercube. four-processor A cost-optimal way of computing the sum of 16 numbers on a

n

c

e

M e Σ

0

t r

Performance Metrics of Parallel Systems Performance i 0 1 2 3

0 7 c

0

s

a

continued ….

n d

a (b) (c) 1 (a)

S

1 4 5 6 7

c

a

l a Σ

2

b i

10 11

l 8 9 8 15 2

i

t

y

A n

3 a

14 15 l 3

12 13

y s

nu Case 2 : Input –

i s Σ 0 Σ 0 0 3 0 15 n Σ 1 1 is multiple of is multiple 4 7 (d) Σ 2 2 Copyright 8 11 Σ 3 3 12 15 PCOPP PCOPP  p C-DAC, 2002 C-DAC, log p (Contd..) 2002 2002 48

June 03- 06, 2002 C-DAC, Pune

P

e r

Example 1 :

f o

Adding An alternate method n r    

=16 and

m

a

n c

The parallel run time of this algorithm is The parallel system is runtime. If Each processor locally adds its its cost is

e

M

n

e n t

= r

Performance Metrics of Parallel Systems Performance i

numbers by using c

p s p

=4.

a continued ….

log

n

d

S

θ c

p a

(n + p (n +

l a

then the cost is

b

i

l

i

t

y

A n

log

a

l y

nu Case 2 for Input –

s i p).

cost optimal cost s p

processors is illustrated in figure for θ (n) n / p which is same as the serial numbers in . p

< θ n (n/p +

Copyright θ ( n / log PCOPP PCOPP p ) time.  C-DAC, 2002 C-DAC, p) p) and (Contd..) 2002 2002 49

June 03- 06, 2002 C-DAC, Pune

P

e r

Example    Remarks

f

o

r

m a

optimal by scaling down the number of processors to cost – not always optimal systems possible make non-cost It is Algorithm design is very important. The role of mapping computations onto processors in Parallel processors to determine a cost-optimality of a parallel system. The examples demonstrates how the computation is mapped onto

n

c

e

M

e

t r

:

i c

Matrix vector multiplication involving a random sparse

s

a

matrix. n

Scalability and Speedup Analysis Scalability

d

S

c

a

l

a

b

i

l

i

t

y

A

n

a

l

y

s

i s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 50

June 03- 06, 2002 C-DAC, Pune

P

e

r

f o

Example 2 : Simple Analysis 

r

m

a n

Slow processor dictatesSlow achievable performance the c

Effective Rate ( Effective e

F T

M

H H e

, ,

t r

F T

i

c

s L L

a n

Scalability and Speedup Analysis Scalability d

Fraction of the problem done at High and Low Speeds = Time to generate a result at High and Low Speeds =

S

c

a

l

a b if

R

i

l

i t

T y

= 1 ) =

F

H A F

H n

→ a H

T

l y

+ s H

0

i s F + ⇒ L F = 1; L R (Assume 2 heterogeneous Processors) T → L ( T R L H F = 1/ L ) -1 T = H R L / Copyright T T L L = 1/ = PCOPP PCOPP F T  L L + ( C-DAC, 2002 C-DAC, F T (Contd..) H L -1 F H 2002 2002 ) T L -1 51

June 03- 06, 2002 C-DAC, Pune

P

e

r

f o

Example 2 : r

Here in lies the challenge To obtain high rate For large Where m

a

n

c e

Match the processor with the problem

R M

F e

H t

R

r i 1 = =

+ c

H s

r

F a = ( F

Scalability and Speedup Analysis Scalability n T Further Analysis Further

L d L T H

= 1 and

S T

H c << -1

L

a l ;

+

a b T

R i

F

l i

L t

L y H

),

A = T

F n r

H a T H

=

l y L

mustlarge be s

-1

i s T L r / T + H F H R (1- H r ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 52

June 03- 06, 2002 C-DAC, Pune

P

e

r

f

o

r m

 xml ae-FixedSize Problem 2 : Case - Example

a

n

c e problem being solved on parallel computer. h rcinsle nlwsed( solved in low speed fraction the To achieve good speedup, it is important to know how much

T

F F M = 1;

H L

e

t

r i

Fraction solved in low speed = c

Fractionusingat high speed solved =

s

F

a n

L d S

Scalability and Speedup Analysis Scalability

+ S

P c

F

a

l a H

= b

= = 1;

i

l

i t

pF

y

A

L

n a

+

l

y s p

F

i s H T P = F = L + 1 + 1 + F H pp F F L L = ) contributes to the ( p p -1) p pF Copyright processors L + F PCOPP PCOPP H  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 53

June 03- 06, 2002 C-DAC, Pune

P

e r

Example uelna speedup superlinear A speedup greater than Speedup Superlinear into their respective processors’ main memories. processors’ into their respective processors, the individual data-partitions would be small enough to fit usethe of secondary storage. But when partitioned among several memory of a uni-processor, thereby degrading its performanceto due

f o

 

r

m

a

n c

the hardware characteristic that put the This is toa non-optimalusually due either algorithmsequential or may be the reasonsmay for Data partition, use of and secondarymaincache, memory,storage disadvantage in the computation

e

M

e

t r Performance Metrics of Parallel Systems Performance

: The data for a problem might be too large to fit into the main

i

c

s

a

n

d

S

c

a

l

a

b

i

l

i

t y

.

A

n

a l

p

y

s

i s is sometimes observed which is called as superlinear speed up sequential Copyright phenomenon PCOPP PCOPP algorithm at an  C-DAC, 2002 C-DAC, . 2002 2002 54

June 03- 06, 2002 C-DAC, Pune

P

e

r

f o

Speedup metrics Three approaches to scalability analysis are based on Three performance models based on three speedup metrics are r

commonly used.

m

a

n c

Speedup Metrics and Scalability Analysis

e

M e

• • •   

t

r

i

c

s

a

A constant utilization A constant speed, and Maintaining a constant efficiency, u-islw - Memory Bounding speedup Sun-Ni’s law -- Fixed time speedup Fixed problem size Gustafson’s law -- -- Amdahl’s law

n

d

S

c

a

l

a

b

i

l

i

t

y

A

n

a

l

y

s

i s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 55

June 03- 06, 2002 C-DAC, Pune

P

e

r

f o

Assume all overheads are ignored, a fixed load speedup is defined by percentbyexecuted can be Amdahl’s law workload can be divided into two parts Consider a problem with a fixed workload where

r m

S a

p

n c =

Speedup Metrics and Scalability Analysis

e

M

α

e

α

t r

percent of W executed sequentially, and the remaining 1- i

W

c s

W

a

+ (1 -

n d =

W

: Fixed Problem Size S

c α

a l

W a

α

b i

)

l i

+ (1 - t

W

y

A

/ n

p

a

l

y

s

= α

i s ) p W nodes simultaneously. 1 + ( p p

-1) α W

. Assume . Assume that the Copyright

1 α as PCOPP PCOPP p  C-DAC, 2002 C-DAC, ∞ (Contd..)

2002 2002 α 56

June 03- 06, 2002 C-DAC, Pune

P

e

r

f o

Example 3:  Amdahl’s law 

r

m

a

n c Speedup Metrics and Scalability Analysis

want the processors to be used at 50 percent efficiency ? Determinewhat portion of computation can be sequential if we to the inverse of the number of processors number the of inverse the to computationproportionalprocessing devoted to sequential mustbe Thus, in order to maintain a constant efficiency,fractionthe ofthe e



M

e

t

r i

Speedup is executetotaken the serial part of the code. Then the Assume that the workload is normalized to 1 and

c

s

a

n

d

S c

continued …

a

l

a

b

i

l

i

t y

S

A

p n

( a

α

l y

)

s i

= s 1 /( α +(1- α )/ p) then Copyright α = 1/ ( = PCOPP PCOPP  p α -1). C-DAC, 2002 C-DAC, is the time (Contd..) 2002 2002 57

June 03- 06, 2002 C-DAC, Pune

P

e

r

f o

For fixed load speedup .When 3. other In words,sequential the component of the program is 2. .For a given workload, the maximal speedup has an upper bound of 1. Amdahl’s law implications .To achieve good speedup, it is important to make the sequential 4.

r m

S a

p n

bottleneck. 1/ bottleneck c =

Speedup Metrics and Scalability Analysis

e

α M

. e

α

t

r i

W c

s α

a + (1-

increasesthe speedup decreasesproportionally.

n d

W

S α

c a

as small as possible.

l

a α b

)

i

l i

W

t

y

A / p

S

n a p

+

l y

(with all overheads s

T

i s 0 = α 0 + T 1 0 / W T 0 ) becomes ) becomes Copyright as p PCOPP PCOPP  C-DAC, 2002 C-DAC, ∞ (Contd..) 2002 2002 58

June 03- 06, 2002 C-DAC, Pune

P

e

r

f

o

r m

Amdahl’s law

a n

Figure represents the sequential component versus Efficiency c

Speedup Metrics and Scalability Analysis

e

M

e

t

r

i

c

s

a

n

d

S c

: Fixed Problem Size

a

l

a

b

i

l

i

t

y

A

n

a

l

y

s

i s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 59

June 03- 06, 2002 C-DAC, Pune

P

e r

     Amdahl’s law

f

o

r

m a

overhead. only by the sequential bottleneck, but also by the average not In other words, the performanceprogramparallel of limited a is impact of the overhead. increase the average granularity in order toadversereduce the We must not only reduce the sequential bottleneck, but also increasing the number of processors in a system. The problem of sequential bottleneck cannot be solved just by and the average overhead increase. The fixed optimize the commoncase. make the larger portion executed faster. In other words, one should When a problem consistswe should of the above portions, two

n

c

e

Speedup Metrics and Scalability Analysis

M

e

t

r

i

c

s

a

n d

workload speedup drops as the sequential bottleneck

S

c

a

l

a

b

i

l

i

t

y

A

n

a

l

y

s

i s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 α 60

June 03- 06, 2002 C-DAC, Pune

P

e

r

f o

   Gustafson’s Law

r

m

a n

machine size. problemincreaseby scaling speedup size with the improvedin Gustafson’s proposed removing theremoving restriction fixed of a The sequential bottleneck in Amdahl’s law can be alleviated by toa small problem.employed solve Amdahl’s law leads to a diminishinglarger return systema when is available computing power as the machine size increases. Thus, The problem size (workload) is fixed

c

e

Speedup Metrics and Scalability Analysis

M

e

t

r

i

c

s

a

n

d

S

c

a

l a

: Scaling for Higher Accuracy

b

i

l

i

t

y

A

n

a

l

y

s

i s a fixed problem size. problem time concept thattime achieves an and cannot matchthescale to Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 61

June 03- 06, 2002 C-DAC, Pune

P

e r

    Gustafson’s Law

f

o

r

m a

As the machine size increases, the workload ( develop fixed-time speedup model. This problem of scaling for accuracy has motivated Gustafson to the execution time unchanged. greater workload, producing more accurate solution and yet keeping to in order a create size problem the to increase want may we As the machine size is upgraded to obtainmore computingpower, minimum turnaround time. There are many applications that emphasize accuracy more than

n

c

e

Speedup Metrics and Scalability Analysis

M

e

t

r

i

c

s

a

n

d

S

c

a l

: Scaling for Higher Accuracy

a

b

i

l

i

t

y

A

n

a

l

y

s

i s W Copyright ) should increases. PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 62

June 03- 06, 2002 C-DAC, Pune

P

e

r f

Gustafson’sScaling : for Higher Accuracy Law o

Fixed Time Speedup (    

r

m

a n

W On a On a single node machine, the execution time is The parallel execution time on However, the sequential time for executing the scaled workload is

c

e

Speedup Metrics and Scalability Analysis M

* e

=

t r

n i

c α s

node machine, we scale the workload to

W

a

n d

+ (1-

S

c

a

l

a

b

i α

l

i t

)

y

A p W

S n

W a

p l

* y

) s .

*

i s

= α W + (1- p

nodes is still α ) p W W Copyright . W . PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 63

June 03- 06, 2002 C-DAC, Pune

P e

  The Gustafson’sScaling : for Higher Accuracy Law

r

f

o

r

m a increase in machine size. Achieves an improved speedup by scaling the problem size with the workload is scaled up to maintain a fixed execution time. It states that the fixed time linearspeedup functiona of is S

fixed-time speedup n S

p

c e p

* =

* =

Speedup Metrics and Scalability Analysis

M

e

t r

Sequential time for scaled-up workload

i

c s

Parallel time for scaled-up workload

α

a n

+ (1-

d

S

c

a

l

a

b α i

)

l

i

t y

p

A n

with scaled workload is defined as

a

l

y

s

i s

=

Copyright α W PCOPP PCOPP + (1-  C-DAC, 2002 C-DAC,

W α ) p p W (Contd..) , if the 2002 2002 64

June 03- 06, 2002 C-DAC, Pune

P e

Gustafson’sScaling : for Higher Accuracy Law

r

f o

 

r

m

a n

becomes. incorporationthe overheads,After all of the scaled speedup number of processors controlling the overheads as a decreasing function with respect to The generalized Gustafson’s law canspeeduplinear achieveby a

c

e

Speedup Metrics and Scalability Analysis

M

e

t

r

i

c s

S

a

p

n d

*=

S

c

a

l

a

b α i

W

l i

W + T W +

t

y

A

+ (1-

n

a

l

y s p

0

i s

. But this is often difficult to achieve α ) pW =

(1+T α + (1- 0

/ W) α ) p Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, !!. (Contd..) 2002 2002 65

June 03- 06, 2002 C-DAC, Pune

P

e

r

f o Motivation Sun and Ni’s law

  

r

m

a n

the use of both CPU and memory capacities Use concept of Amdahl’s law and Gustafson’s law to maximize speedup, greater accuracy, and better resource utilization also demandsThis a scaled workload, providing higher the available memory capacity. by largestThe idea is toonly possible solve the limited problem,

c

e

Speedup Metrics and Scalability Analysis

M

e

t

r

i

c

s

a

n

d

S

c

a

l a

: Memory up Speed Bound

b

i

l

i

t

y

A

n

a

l

y

s

i s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 66

June 03- 06, 2002 C-DAC, Pune

P

e

r

f o

Sun and Ni’s law

r m

  

a

n c

Scaled work load is workload can be scaled upworkload can be scaled When both the fixed load Remark : factor node is one node and execute in bounded problem, assume it uses all the memory capacity parallel system, the total memory is Let capacity increases

e

Speedup Metrics and Scalability Analysis

M

e t

M

r

i

c s G be thebe memorya singlenode. capacityof On an

p

a

W n ( nodes are used, assume that the parallel portion of the

p

d

) reflects the increase in workload as the memory S Memory bound model gives a higher speedup than

is given by is given

c

a

l a

: Memory up Speed Bound

b

i

l

i

t

y

A n speedup and the fixed time

p a

W

l y

times). times).

s

α

i s is given by W W + (1- F seconds. Now the workload on one ( p

) times. α )

W α W pM + (1- . Given a memory-. Given (

S α ) p Copyright *) F speedup speedup ( p ) W PCOPP PCOPP . (Here the the . (Here  p C-DAC, 2002 C-DAC, -node M (Contd..) 2002 2002 on 67

June 03- 06, 2002 C-DAC, Pune

P e

.F 1. Three Remark F 3. F 2.

r

f o

Sun and Ni’s law

r

m a

workload increases faster than the memory requirement. times when the memory is increased fixed. (Amdahl’s law)

n c

( ( ( e p p p

Special Cases :

Speedup Metrics and Scalability Analysis

) = 1. This corresponds to the case where the problem size is )> ) = M

: e the fixed load the fixed

p t

Memory bound model gives a higher speedup than both

r i p

. This corresponds to the situation where the computational

c s

. This applies to the case where the workload increases

a

n

d

S

c

a

l a

: Memory up Speed Bound

b

i

l

i

t

y

speedup and the fixed time

A

n

a

l

y

s

i s p times. (Gustafson’s law) (Gustafson’s times. ( S p speedup. Copyright *) PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 p 68

June 03- 06, 2002 C-DAC, Pune

P

e

r f

     Summary of Performance metrics

o

r

m a bottleneck ‘ sequentialthe In all cases, by limited the achievable speedupis Gustafson’s A compromise between problem a may requireup Scaling more execution time, even However, parallel systems are often more efficiently used if the If the purpose is to reduce the execution time of a fixed-

when a larger machine size is used. problem size scales up with machine size (Sun and Ni’s law). as scalability of the systemtheworkload problem, can be defined

n

c

e

M

fixed load speedup

e

t

r

i

c

s

a

n

d

α α

α α S

fixed time scaling time fixed c

and the average overhead ’

a

l

a

b

i

l

i

t

y

A

n

a

l y

(Amdahl’s law).

s

i s Summary fixed-load . and and T o fixed memory fixed / W Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, scaling is 2002 2002 69

June 03- 06, 2002 C-DAC, Pune

P

e r

   Scalability Analysis

f

o

r m Second with an increasing number of processors. efficiencyflattens. This is consequencethelaw,drops of Amdhal’s linearly as the numberof processors increases. The speedup curve systems as the problem is exhibited by many parallel systems. We call such simultaneously increasing the number of processors and the size of This ability to maintain efficiency by at a fixed value by First one common to large class of parallel systems !! speedup and efficiency continue to drop with increasing and efficiencynumbersameprocessors, for thealthough of both

These two phenomena shown here for a particularsystem, parallel

a

n c

Speedup Metrics and Scalability Analysis

e

M e

, for a given problem instance, the speedup does not increase

t

r

i

c s

, higher a of same problem yields speedup instance larger

a

n d

scalable

S

c

a

l

a

b

i

l

i

t

y

A n

parallel systems

a

l

y

s

i s Copyright PCOPP PCOPP  P . C-DAC, 2002 C-DAC, (Contd..) 2002 2002 70

June 03- 06, 2002 C-DAC, Pune

P

e r

   Scalability Analysis

f

o

r

m a

Cost optimal parallel system has an efficiency of The scalability and cost-optimality of parallel system are related. resources effectively. It reflects a parallel system ability to utilize increasing processing increase speedup in proportion to the number of processors. The

n c

Speedup Metrics and Scalability Analysis

e

M e

scalability

t

r

i

c

s

a

n

d

S

c

a

l

a

b i

of a parallel systems is a measurecapacitytoof its

l

i

t

y

A

n

a

l

y

s

i s

Copyright θ (1). PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 71

June 03- 06, 2002 C-DAC, Pune

P

e r

  Scalability Analysis 

f

o

r

m a

T Denote the overhead function of a parallel system by the symbol computer incurred by the fastest known serial algorithm on a sequential system as the part of its cost (processor product)time notis that totaldefine We overhead or the overhead function of a parallel collectively referred to as the overheadprocessing.parallel due to All causes of overhead function ( The relation between cost

n c 0

Speedup Metrics and Scalability Analysis e

.

M

T e

0

t

r i

is a function of

c

s

a

n

d

S

c

a

l

a b

non-optimal

i

l i

: Total overhead

t

y

T

A n

0 a ) is given by

W

l

y

s

i s and P. efficiency of a parallelof a efficiency system are (pTp T ), problem size ( size ), problem 0 =pTp-W. Copyright PCOPP PCOPP W  ) and the C-DAC, 2002 C-DAC, (Contd..) 2002 2002 72

June 03- 06, 2002 C-DAC, Pune

P e

 Remarks :  Scalability Metric of Isoefficiency The r



f

o

r

m a

scalable parallel systems, the efficiency increases. If size, overhead function, and the number of processors Parallel execution time can be expressed as a function of problem decreases becausetotal the overhead,T If the problem size is kept constant and n

• • • c

W

e

M

is increased keeping the number of processors fixed, then for e

Efficiency ( Speedup Parallel run time (

t

r

i

c

s

a n

Scalability and Speedup Analysis Scalability

d

S

c

a

l a (E) = S/ P = (W+T

S

b

i l

)

i

t y

= W/ T = W/

A

n

a

l y

T

s i

p s ) p =W = ( W+T p / W+T 0 (W,P)/ W=1/ W=1/ (W,P)/ 0 (W,p) 0 (W,P) p 0 is increased, the efficiency increases with ) / p ; ; Copyright ( 1+ T PCOPP PCOPP 0 (W,P)/ W (W,P)/  p. C-DAC, 2002 C-DAC, 2002 2002 ) 73

June 03- 06, 2002 C-DAC, Pune

P e

 Scalability Metric of Isoefficiency The 

r

f

o

r

m a

constant value. For a desired value value (between 0 and 1) if the ratio fixed For scalable parallel systems, efficiency can be maintained at a efficiency, thus the better scalability of system. relative to the increase of machine size while keeping the same increase to needs the workload the less isoefficiency, The smaller the

n

c

e

M

e

t

r

i

c

s

a E = 1/E =

W = E/(1-E )T n

Scalability and Speedup Analysis Scalability

d

S

c

a

l a

( b

1+T

i

l

i

t

y

A

0 n

(W,P)/ W

a

l y

0 s

(W,P) = KT (W,P) =

i s ) 0 E (W,P) (K is a constant) of efficiency, T 0 / W in is maintained at a is maintained in Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 74

June 03- 06, 2002 C-DAC, Pune

P e   

The Isoefficiency Metric of Scalability Metric of Isoefficiency The

r

f

o

r m

isoefficiency efficiency equation for workload If we fix the the efficiency to solve someand 50%)(e.g. constant workload W and the machine size The efficiency of any system can be expressed as a function of the To characterize parallel system scalability

a

n

c

e

M

e

t

r

i

c

s

a n

Scalability and Speedup Analysis Scalability

d

S

function of the system.

c

a

l

a

b

i

l

i

t

y

A

n

a

l

y

s

i s W p , the resulting function is called thefunctionis called , the resulting , i.e E = f ( W , p Copyright ). PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 75

June 03- 06, 2002 C-DAC, Pune

P

e r

 Example 1  

f

o

r

m a

W function ( This The overhead function for the problem of adding problem of The overhead functionthe for a is Thus the asymptotic isoefficiency function for this parallel system

n

c e

p

required to keep the efficiency fixed as the efficiency required to keep M θ

-processor HyperCube is e

( t

p log p

r

i c

:

s

nu Case 2 : Input – a

Scalability of adding n

Scalability and Speedup Analysis Scalability

d

S

)

c a

sefcec function) isoefficiency

l

a

b

i

l

i

t

y

A

n

a

l

y

s

i s n is multiple of is multiple plgp log 2p n Numbers on a HyperCube : , and dictatesthe growth rate of p p log W increases. increases. p = Copyright 2Kp log p n numbers on PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 76

June 03- 06, 2002 C-DAC, Pune

P

e r

    

f o

h ssedadIotlzto of Scalability The Isospeed and Isoutilization

r

m a

utilization. relative to the machine size increase, samewhile maintaining the Isoutlization Isoutlization equation for workload the utilization toFix constant (25 %) and solves the utilization U Utilization of parallel system is defined as follows : both machine size and problem size at the same time. a efficiency,can preserve one to Isoeffenciceny. of similar Instead is maintaining a constant It To characterize parallel system scalability

n

c e

=

M

P

e t

p r

/(

i

c s

p

a .

P n

Scalability and Speedup Analysis Scalability d

peak

S c

function of the system. a metric can predicttheworkload rate required growth

) =

l

a

b

i

l i

W

t

y

A

/ (

n a

p

l y

. s

T

i s W p . P , the, resultingfunction is called the peak ) ( constant speed p = number of processors) of number = Copyright while scaling up PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 77

June 03- 06, 2002 C-DAC, Pune

P

e

r f

properties A Ideal Scalability metric should have the following two Remark : o

 It is consistent with execution time, i.e., a more scalable system 2. It predicts thewith respecttoincreasetherate workload growth 1. 

r

m

a

n c

systemsmall with a be shown thatcan underIt very reasonable conditions, a always has a shorter execution time. of machine size. Systems with a small with Systems has a shortertime. execution those with a large one.

e

M

e

t

r

i

c

s

a n

Scalability and Speedup Analysis Scalability

d

S

c

a

l

a

b

i

l

i

t

y

A

n a

isoutlization isoutlization

l

y

s

i s isoutilization (thus more scalable) always are more scalable than Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd..) 2002 2002 78

June 03- 06, 2002 C-DAC, Pune

P

e

r f

    Summary of Performance metrics

o

r

m

a n

Theoretical concepts on Performance and Scalability Types of overhead in parallel computing are discussed. Work load and speed metrics are explained parallel program have been explained. of a performance of measurement and metrics Performance c

e

M

e t

ssedadIotlzto metrics are explained. Isoutilization and Isospeed Phase Parallel model, Speed up, Efficiency, Isoefficiency,

r

i

c

s

a

n

d

S

c

a

l

a

b

i

l

i

t

y

A

n a

Conclusions

l

y

s

i s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 79

June 03- 06, 2002 C-DAC, Pune

P

e r

7. Culler David E, Jaswinder Pal Singh with Anoop Gupta, Parallel Computer Architecture, Architecture, Gupta, Parallel Computer Anoop Pal Singh with Culler David E, Jaswinder 7. Zhiwei Xu, Scalable Parallel Computing (Technology Architecture Hwang, Kai 6. Ian Designing T. Foster, and Building Parallel tools for Parallel Concepts and Programs, 5. Pittsburgh Peak Performance, Applications Tuning MPI for Rusty Lusk, William Gropp, 4. to Parallel Introduction Karypis, George Gupta, Anshul Grama, Ananth Kumar, Vipin 3. Computing Handbook, McGraw-HillParalleldistributed and Zomaya, Y.H. Albert 2. Parallel A practical Introduction, Leiss, and VectorMcGraw-Hill Computing Ernst L. 1.

f

o r

References m

A Hardware/ Approach, Morgan Kaufmann Publishers, Inc, (1999) Publishers, Inc, Kaufmann Morgan Approach, A Hardware/Software (1997) Hill Newyork McGraw Programming) Engineering,Software Addison-Wesley Publishing(1995). Company (1996) Benjmann/Cummings (1994). CA,Analysis Redwood and of Algorithms, City, Design Computing, (1996). Series on Computing Engineering, Newyork (1995). Engineering, Newyork Series on Computer

a

n

c

e

M

e

t

r

i

c

s

a

n

d

S

c

a

l

a

b

i

l

i

t

y

A

n

a

l

y

s

i s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 80

June 03- 06, 2002 C-DAC, Pune

P

e

r

f

o

r

m

a

n

c

e

M

e

t

r

i

c

s

a

n

d

S

c

a

l

a

b

i

l

i

t

y

A

n

a

l

y

s

i s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 81