Scalable Problems

and



MemoryBounded Sp eedup

XianHe Sun

ICASE

Mail Stop C

NASA Langley Research Center

Hampton VA

sunicaseedu

Abstract

In this pap er three mo dels of parallel sp eedup are studied They are xedsize

xedtime speedup and memorybounded speedup The latter two consider the

relationship b etween sp eedup and problem Two sets of sp eedup formulations

are derived for these three mo dels One set considers uneven workload allo cation and

communication overhead and gives more accurate estimation Another set considers a

simplied case and provides a clear picture on the impact of the sequential p ortion of an

application on the p ossible p erformance gain from parallel pro cessing The simplied

xedsize sp eedup is Amdahls law The simplied xedtime sp eedup is Gustafsons

scaled speedup The simplied memoryb ounded sp eedup contains b oth Amdahls law

and Gustafsons scaled sp eedup as sp ecial cases This study leads to a b etter under

standing of parallel pro cessing

Running Head Scalable Problems and MemoryBounded Sp eedup



This research was supp orted in part by the NSF grant ECS and by the National Aeronautics and Space

Administration under NASA contract NAS

SCALABLE PROBLEMS

AND

MEMORYBOUNDED SPEEDUP

By

XianHe Sun

ICASE

NASA Langley Research Center

Hampton VA

sunicaseedu

And

Lionel M Ni

Computer Science Department

Michigan State University

E Lansing MI

nicpsmsuedu

Intro duction

Although parallel pro cessing has b ecome a common approach for achieving high p erformance there

is no wellestablished metric to measure the p erformance gain of parallel pro cessing The most

commonly used p erformance metric for parallel pro cessing is speedup which gives the p erformance

gain of parallel pro cessing versus sequential pro cessing Traditionally sp eedup is dened as the

ratio of unipro cessor execution time to execution time on a parallel pro cessor There are dierent

ways to dene the metric execution time In xedsize speedup the amount of work to b e executed

is indep endent of the numb er of pro cessors Based on this mo del Ware summarized Amdahls

arguments to dene a sp eedup formula which is known as Amdahls law However in many

applications the amount of work to b e p erformed increases as the numb er of pro cessors increases

in order to obtain a more accurate or b etter result The concept of scaled speedup was prop osed by

Gustafson et al at Sandia National Lab oratory Based on this concept Gustafson suggested a

xedtime speedup which xes the execution time and is interested in how the problem size can

b e scaled up In scaled sp eedup b oth sequential and parallel execution times are measured based

on the same amount of work dened by the scaled problem

Both Amdahls law and Gustafsons scaled sp eedup use a single parameter the sequential

p ortion of a to characterize an application They are simple and give much

insight into the p otential degradation of parallelism as more pro cessors b ecome available Amdahls

law has a xed problem size and is interested in how small the resp onse time could b e It suggests

that pro cessing may not gain high sp eedup Gustafson approaches the problem

from another p oint of view He xes the resp onse time and is interested in how large a problem

could b e solved within this time This pap er further investigates the scalability of problems While

Gustafsons scalable problems are constrained by the execution time the capacity of main memory

is also a critical metric For parallel computers esp ecially for distributedmemory multipro cessors

the size of scalable problems is often determined by the memory available Shortage of memory is

paid for in problem solution time due to the IO or messagepassing delays and in programmer

time due to the additional co ding required to multiplex the distributed memory For many

applications the amount of memory is an imp ortant constraint to scaling problem size

Thus memorybounded speedup is the ma jor fo cus of this pap er

We rst study three mo dels of sp eedup xedsize speedup xedtime speedup and memory

bounded speedup With b oth uneven workload allo cation and communication overhead considered

sp eedup formulations will b e derived for all three mo dels When communication overhead is not

considered and the workload only consists of sequential and p erfectly parallel p ortions the simplied

xedsize sp eedup is Amdahls law the simplied xedtime sp eedup is Gustafsons scaled sp eedup

and the simplied memoryb ounded sp eedup contains b oth Amdahls law and Gustafsons sp eedup

as sp ecial cases Therefore the three mo dels of sp eedup which represent dierent p oints of view

are unied

Based on the concept of scaled sp eedup intensive research has b een conducted in recent years

in the area of p erformance evaluation Some other denitions of sp eedup have also b een prop osed

such as generalized sp eedup costrelated sp eedup and sup erlinear sp eedup Interested readers can

refer to for details

This pap er is organized as follows In Section we intro duce the program mo del and some

basic terminologies More generalized sp eedup formulations for the three mo dels of sp eedup are

presented in Section Sp eedup formulations for simplied cases are studied in Section The

inuence of communicationmemory tradeo is studied in Section Conclusions and comments

are given in Section

A Mo del of Parallel Sp eedup

To measure dierent sp eedup metrics for scalable problems the underlying machine is assumed to

b e a scalable multiprocessor A multipro cessor is considered scalable if as the numb er of pro cessors

increase the memory capacity and network bandwidth also increase Furthermore all pro cessors

are assumed to b e homogeneous Most distributedmemory multipro cessors and multicomputers

such as commercial hyp ercub e and meshconnected computers are scalable multipro cessors Both

messagepassing and sharedmemory programming paradigms have b een used in such multipro ces

sors To simplify the discussion our study assumes homogeneous distributedmemory architectures

The parallelism in an application can b e characterized in dierent ways for dierent purp oses

For simplicity sp eedup formulations generally use very few parameters and consider very high

level characterizations of the parallelism We consider two main degradations of parallelism uneven

al location load imbalance and communication latency The former degradation is application

dep endent The latter degradation dep ends on b oth the application and the parallel computer

under consideration To obtain an accurate estimate b oth degradations need to b e considered

Uneven allo cation is measured by degree of paral lelism

Denition The of a program is an integer which indicates the maximum

number of processors that can be busy computing at a particular instant in time given an unbounded

number of available processors

The degree of parallelism is a function of time By drawing the degree of parallelism over the

execution time of an application a graph can b e obtained We refer to this graph as the paral lelism

prole Figure is the parallelism prole of a hyp othetical divideandconque r computation

By accumulating the time sp ent at each degree of parallelism the prole can b e rearranged to form

the shape see Figure of the application

6

Degree

of

Parallelism

-

T Time

Figure Parallelism prole of an application

Let W b e the amount of work of an application Work can b e dened as arithmetic op erations

instructions or whatever is needed to complete the application Formally the sp eedup with N

pro cessors and with the total amount of work W is dened as

T W

1

S W

N

T W

N

where T W is the time required to complete W amount of work on i pro cessors Let W b e

i i

the amount of work executed with degree of parallelism i and let m b e the maximum degree of

P

m

parallelism Thus W W Assuming each computation takes a constant time to nish on

i

i=1

a given pro cessor the execution time for computing W with a single pro cessor is

i

W

i

t W

1 i

where is the computing capacity of each pro cessor If there are i pro cessors available the

execution time is

W

i

t W

i i

i

With an innite numb er of pro cessors available the execution time will not b e further decreased

and is

6

Degree

of

Parallelism

-

 - - - -

t t t t

4 3 2 1

Figure Shap e of the application

W

i

for i m t W

1 i

i

Therefore without considering communication latency the execution times on a single pro cessor

and on an innite numb er of pro cessors are

m

X

W

i

T W

1

i=1

m

X

W

i

T W

1

i

i=1

The maximum sp eedup with work W and an innite numb er of pro cessors is

P

P

m W

i

m

T W W

i=1

1 i 4

i=1

P

S W

P

1

m

m W

i

W i T W

i 1

i=1

i=1

i4

Average parallelism is an imp ortant factor for sp eedup and eciency It has b een carefully

examined in Average parallelism is equivalent to the maximum sp eedup S S

1 1

gives the b est p ossible sp eedup based on the inherent parallelism of an algorithm There are no

machine dep endent factors considered With only a limited numb er of available pro cessors and

with communication latency considered the sp eedup will b e less than the b est sp eedup S W

1

W

i

i

If there are N pro cessors available and N i then some pro cessors have to do d e work and

i N

W

i

i

b c work By the denition of degree of parallelism W and the rest of the pro cessors will do

i

i N

W cannot b e executed simultaneously for i j Thus the elapsed time will b e

j

i W

i

d e t W

N i

i N

Hence

m

X

W i

i

T W d e

N

i N

i=1

and the sp eedup is

P

m

W T W

i 1

i=1

S W

P

N

m W

i

i

T W

d e

N

i=1

i N

Communication latency is another factor causing p erformance degradation Unlike degree of

parallelism communication latency is machine dep endent It dep ends on the communication net

work top ology the routing scheme the adopted switching technique and the dynamics of the

network trac Let Q W b e the communication overhead when N pro cessors are used to com

N

plete W amount of work The actual formulation for Q W is dicult to derive as it is dep endent

N

on the communication pattern and the message sizes of the algorithm itself as well as the system

dep endent communication latency Note that Q W is encountered when there are N pro cessors

N

N Assuming that the degree of parallelism do es not change due to communication overhad

the sp eedup b ecomes

P

m

W T W

i 1

i=1

S W

P

N

m W

i

i

T W

d e Q W

N

N

i=1

i N

Sp eedup of Scaled Problems

In the last section we develop ed a general sp eedup formula and showed how the numb er of pro cessors

and degradation parameters inuence the p erformance However sp eedup is not dep endent only

on these parameters It is also dep endent on how we view the problem With dierent p oints

of view we get dierent mo dels of sp eedup and dierent sp eedup formulations One viewp oint

emphasizes shortening the time it takes to solve a problem by parallel pro cessing With more and

more computation p ower available the problem can in principle b e solved in less and less time

With more pro cessors available the system will provide a fast turnaround time and the user will

have a shorter waiting time A sp eedup formulation based on this philosophy is called xedsize

speedup In the previous section we implicitly adopted xedsize sp eedup Eq is the sp eedup

formula for xedsize sp eedup Fixedsize sp eedup is suitable for many algorithms in which the

problem size cannot b e scaled

For some applications we may have a time limitation but we may not want to obtain the solution

in the shortest p ossible time If we have more computation p ower we may want to increase the

problem size carry out more op erations and get a more accurate solution Various nite dierence

and nite element algorithms for the solution of Partial Dierential Equations PDEs are typical

examples of such scalable problems

An imp ortant issue in scalable problems is the identication of scalability constraints One scal

ablility constraint is to keep the execution time unchanged with resp ect to unipro cessor execution

time This viewp oint leads to a dierent mo del of sp eedup called xedtime speedup For xedtime

P

0

m

0 0

sp eedup the workload is scaled up with the numb er of pro cessors available Let W W b e

i=1

i

0

the total amount of scaled work where W is the amount of scaled work executed with degree of par

i

0

allelism i and m b e the maximum degree of parallelism of the scaled problem when N pro cessors

are available Note that the maximum degree of parallelism can change as the problem is scaled In

0

order to keep the same turnaround time as the sequential version the condition T W T W

1 N

0

must b e satised for W That is the following scalable constraint must b e satised

0

m m

0

X X

i W

0 i

d e Q W W

N i

i N

i=1 i=1

Thus the general sp eedup formula for xedtime sp eedup is

P P

0 0

m m

0 0 0

W W T W

1

i=1 i i=1 i 0

P

S W

N

0

m

P

0

0

W

m

i

T W W

i 0

N i

i=1

d e Q W

N

i=1

i N

In many parallel computers the memory size plays an imp ortant role in p erformance Many

large scale multipro cessors with lo cal do not supp ort virtual memory due to

insucient IO network bandwidth When solving an application with one pro cessor the problem

size is more often b ounded by the memory limitation than by the execution time limitation With

more pro cessors available instead of keeping the execution time xed we may want to meet the

memory size constraint In other words if you have adequate memory space and the scaled problem

meets the time limit imp osed by xedtime sp eedup will you further increase the problem size to

an even b etter or more accurate solution If the answer is yes the appropriate mo del is

memorybounded speedup Like xedtime sp eedup memoryb ounded sp eedup is a scaled sp eedup

The problem size scales up with memory size The dierence is that in xedtime sp eedup execution

time is the limiting factor and in memoryb ounded sp eedup memory size is the limiting factor

With memory size considered as a factor of p erformance the requirements of an algorithm

consist of two parts One is the computation requirement which is the workload and the other

is the memory capacity requirement For a given algorithm these two requirements are related

to each other and the workload might b e viewed as a function of the memory requirement Let

M represent the memory size of each pro cessor Let g b e a function such that W g M or

1 1 1

M g W where g is the inverse function of g An example of function g and g can

b e found in Section In a homogeneous scalable parallel computer the memory capacity on

each no de is xed and the total memory available increases linearly with the numb er of pro cessors

P

m

available If W W is the workload for execution on a single pro cessor the maximum scaled

i

i=1

P



m

 

workload with N pro cessors W W must satisfy the following scalable constraint

i=1

i

 1

W g NM g N g W



where m is the maximum degree of parallelism of the scaled problem and g is determined by

the algorithm The memory limitation can b e stated as the memory requirement for any active

P

m

1

processor is less than or equal to M g W Here the main p oint is that the memory

i

i=1

o ccupied on each pro cessor is limited By considering the communication overhead Eq is the

general sp eedup formula for memoryb ounded sp eedup

P



m



W

 i=1 i

S W



N

P



W

m

i

i



d e Q W

N

i=1

i N

Simplied Mo dels of Sp eedup

The three general sp eedup formulations contain b oth uneven allo cation and communication latency

degradations They give b etter upp er b ounds on the p erformance of parallel applications On

the other hand these formulations are problem dep endent and dicult to understand They give

detailed information for each application but lose the global view of p ossible p erformance gains In

this section we make some simplifying assumptions We assume that the communication overhead

is negligible ie Q and the workload only contains two parts a sequential part and a

N

p erfectly parallel part That is W for i and i N We also assume that the sequential

i

0 

part is indep endent of the system size ie W W W

1

1 1

Under this simplied case the general xedsize sp eedup formulation Eq b ecomes

W W

1 N

S W

N

W

N

W

1

N

Eq is known as Amdahls law Figure shows that when the numb er of pro cessors increases

the load on each pro cessor decreases Eventually the sequential part will dominate the p erformance

W +W

1

N

In Figure T is the execution time for the sequential and the sp eedup is b ounded by

1

W

1

p ortion of the work and T is the execution time for the parallel p ortion of the work

N

6

6

Amount Elapsed

of Time

Work

W W W W W

1 1 1 1 1

T

1

T

1

T

1

T

N

T W W W W W

1 N N N N N

T

1

T

N

T

N

T

N

T

N

-

-

Numb er of Pro cessors N Numb er of Pro cessors N

Figure Amdahls law

For xedtime sp eedup and under the simplied conditions the scalability constraint Eq

b ecomes

0

W

0

N

W W W

1 N

1

N

0

W

0 0

N

That is W NW Eq b ecomes Since W W we have W

N 1 N

1

N

N

W NW

1 N

0

S W

N

W W

1 N

The simplied xedtime sp eedup Eq is known as Gustafsons scaled sp eedup From

Eq we can see that the parallel p ortion of an application scales up linearly with the system

size The relation of workload and elapsed time for Gustafsons scaled sp eedup is depicted in Figure

We need some preparation b efore deriving the simplied formulation for memoryb ounded

sp eedup

Denition A function g is a semihomomorphism if there exists a function g such that for any

real number c and any variable x g cx g cg x

6

6

W

1

Amount

of

W Elapsed

1

Work

Time

W

1

W

1

W

N

W

N

T T T T T

1 1 1 1 1

W

1

W

N

W

N

W T T T T T

N N N N N N

-

-

Numb er of Pro cessors N Numb er of Pro cessors N

Figure Gustafsons scaled sp eedup

b

One class of semihomomorphisms is the p ower function g x x where b is a rational numb er

In this case g is the same as the function g Another class of semihomomorphisms is the single

b

term p olynomial g x ax where a is a real constant and b is a rational numb er For this kind

b

of semihomomorphism gx x which is not the same as g x

Under our assumptions the sequential p ortion of the workload W is indep endent of the system

1

size If the inuence of memory on the sequential p ortion is not considered ie the memory capacity

M is used for the parallel p ortion only we have the following theorem

Theorem If W g M for some semihomomorphism g g cx gcg x then with al l data

being accessible by al l available processors and using al l available memory space the simplied

memorybounded speedup is

W g N W

1 N



S W

N

g(N )

W W

N 1

N

Pro of Assume that the maximum problem size will take the maximum available memory

capacity of M when one pro cessor is used As mentioned b efore when one pro cessor is available

the parallel p ortion of the workload W can b e expressed as W g M Since all data are

N N

accessible by all pro cessors there is no need to replicate the data With N pro cessors available

the total available memory capacity will b e increased to NM The parallel p ortion of the problem



can b e scaled up to use all available memory capacity NM Thus the scaled parallel p ortion W

N

 

is expressed as W g NM g N g M Therefore W g N W and

N

N N

 

W W W g N W

1 N

1



N

S W

N

 

g(N )

W W N

W W

1

N

N 1

N

2

Note that in Theorem we made two assumptions in the simplied case Since the commu

nication latency is ignored remote memory accesses take the same time as lo cal memory accesses

This implies that the data is accessible by all available pro cessors and All the available memory

space is used for a b etter solution These simplied sp eedup mo dels are useful to demonstrate how

the sequential p ortion of an application W will aect the maximum sp eedup that can b e achieved

1

W

1

The simplied xedsize sp eedup xedtime with dierent numb er of pro cessors Let k

W +W

1

N

sp eedup and memoryb ounded sp eedup are resp ectively

N

S W

N

k N

0

S W N k N k N k and

N

gN k gN



S W N

N

gN k N gN

When the numb er of pro cessors N go es to innity Eq is b ounded by the recipro cal of

k which gives the maximum value of the xedsize sp eedup Eq shows that the xedtime

sp eedup is a linear function of the numb er of pro cessors with slop e equal to k When N

go es to innity this sp eedup can increase without b ound Memoryb ounded sp eedup dep ends on

the function gN When gN memoryb ounded sp eedup is the same as xedsize sp eedup

Wheng N N the memoryb ounded sp eedup is the same as the xedtime sp eedup In general

the function gN is application dep endent and gN N It implies that when the problem size

is increased by N the amount of work increases more than N times It is easy to verify that

 0

S W S W wheng N N Note that all data in memory is likely to b e accessed at least

N N

once Thus for scaled problems gN N is unlikely to o ccur The sequential p ortion of the

work plays dierent roles in the three denitions of sp eedup In xedsize sp eedup the inuence

of the sequential p ortion increases with system size and eventually dominates the p erformance In

xedtime sp eedup the inuence of the sequential p ortion is unchanged which makes the sp eedup

a linear function of system size In the memoryb ounded sp eedup since in general gN N

the inuence of the sequential p ortion is reduced when the system size increases indicating that a

b etter sp eedup could b e achieved with a larger system size

The function gN provides a metric to evaluate parallel algorithms In general gN may not

b e derivable for a given algorithm Note that any single term p olynomial is a semihomomorphism

and most solvable algorithms have p olynomial time computation and memory requirement If we

take an algorithms computation and storage complexity the term with the largest p ower as its

computation and memory requirement for any algorithm with p olynomial complexity there exists a

semihomomorphism g such that W g M The approximated semihomomorphism g will provide

a go o d estimation on the memoryb ounded sp eedup when the numb er of pro cessors is large More

detailed case studies for the three mo dels of sp eedup can b e found in

Figure demonstrates the dierence b etween the three mo dels of sp eedup when k and N

3

2

ranges from to For the simplied memoryb ounded SMB sp eedup we cho oseg N N

which is typical in many matrix op erations to b e describ ed later Wheng N N it is Gustafsons

1

g (N )

N will b e studied in next section scaled sp eedup The case of GN g

N

Ideal

Amdahls Law

SMBgN

SMBGN

Gustafsons Sp eedup

Sp eedup

Numb er of No des

Figure Amdahls law Gustafsons sp eedup and SMB sp eedup for k

CommunicationMemory Tradeo

The simplied sp eedup formulations give the impact of the sequential p ortion of an application

on the maximum sp eedup The simplied memoryb ounded sp eedup suggests that when data are

shared by all pro cessors maximum sp eedup is obtained However in practice if communication

overhead is considered the data sharing approach may not lead to maximum sp eedup In the

design of ecient parallel algorithms the communication cost plays an imp ortant role in deciding

how a problem should b e solved and scaled One way to reduce the frequency of communication

is to replicate some shared data to pro cessors Thus a go o d algorithm design should consider the

tradeo b etween the maximum size that a problem can scale and the reduction of available memory

due to the replication of shared data

If data replication is allowed the function W g NM will no longer hold Motivated by



Theorem the function GN W W is dened to represent the ratio of work increment

N

N

when N pro cessors are available In terms of GN the simplied memoryb ounded sp eedup is

generalized b elow



Theorem If W is independent of system size W for i N and W GN W for

1 i N

N

some function GN the memorybounded speedup is

W GN W

1 N



S W

N

G(N )



W W Q W

1 N N

N

The pro of of Theorem is similar to the pro of of Theorem Eq shows that the maxi

mum sp eedup is not necessarily achieved when GN g N Note that the communication cost



Q W is a unied communication cost An optimal choice of the function GN is b oth algo

N

rithm and architecture dep endent and in general is dicult to obtain Also unlike gN GN

might b e less than N If GN N memory capacity is likely to b e the scalable constraint when

N is large If GN N execution time is likely to b e the scalable constraint The function GN

indicates the p ossible scalable constraint of an algorithm The prop osed scaled sp eedup Eq

may not b e easy to fully understand at rst glance Hence we use matrix multiplication as an

example to illustrate it

A matrix often represents some discretized continuum Enlarging the matrix size will generally

lead to a more accurate solution for the continuum For matrix multiplication C AB there are

many ways to partition the matrices A and B to allow parallel pro cessing Assume that there

are N pro cessors available and A and B are n n matrices when executing on a single pro cessor

3 2 3

The computation requirement is n and the memory requirement is roughly n Thus W n

N

2

and M n Two extreme cases of memoryb ounded scaled sp eedup are considered

Local Computation

In the rst case we assume that the communication cost is extremely high Thus data should b e

replicated if p ossible to reduce communication This can b e achieved by partitioning the columns

of matrix B into N submatrices B B B and replicating the matrix A Thus B s are

0 1 N 1 i

distributed among all the pro cessors and matrix A is replicated on each pro cessor Pro cessor i

do es the multiplication AB C i N indep endently Since there is no need for

i i

communication it is referred to as local computation approach Figure a shows the partitioning

of B for the case of N

AB 0 B 1 B 2 B 3 C 0 C 1 C 2 C 3

(a) The matrix B is partitioned.

A 0 C 00 C 01 C 02 C 03

A 1 C 10 C 11 C 12 C 13 B 0 B 1 B 2 B 3 A 2 C 20 C 21 C 22 C 23

A 3 C 30 C 31 C 32 C 33

(b) Both matrices A and B are partitioned.

Figure Two partitioning schemes of matrices A and B

If b oth A and B are allowed to scale along any dimension and A and B are not necessary to b e

    

square matrices the enlarged problem is A B C where A is an k matrix B is a k m



matrix and the resulting matrix C is an m matrix Note that the lo cal memory capacity

2

is M n It is easy to see that the maximum memoryb ound sp eedup will b e achieved when

k n and m nN In other words b oth B and C are scaled up N times along their rows and

3

A is replicated but not scaled The amount of computation on each pro cessor is xed W n

N



and W NW Thus we have GN N The memoryb ounded scaled sp eedup is

N

N

W NW

1 N



S W

N

W W

1 N

which is Gustafsons scaled sp eedup Thus the b est p erformance of memoryb ounded sp eedup

using the lo cal computation mo del is the same as the Gustafsons scaled sp eedup In general the

lo cal computation mo del will lead to a sp eedup that is less than Gustafsons scaled sp eedup For

example if b oth A and B are restricted to square matrices the function GN will b e

s

3

N

A

GN

N

3

2

see App endix Note that due to data which is less than N for N and is b ounded by

replication the memory capacity requirement increases faster than the computation requirement

do es

Global Computation

In the second extreme case we assume that the communication cost is negligible Thus there is

no need to replicate the data A bigger problem can b e solved We partition matrix A into N row

blo cks and B into N column blo cks See Figure b By assigning each pair of submatrices A and

i

B to one pro cessor initially all main diagonal blo cks of C can b e computed Then the row blo cks

i

of A are rotated from one pro cessor to another after each rowcolumn submatrix multiplication

With N pro cessors N rotations are needed to nish the computation as shown in Figure for

the case of N This metho d is referred to as global computation

For the global computation approach the maximum scaled sp eedup is achieved when k

p

N see App endix m n

3

2

W W N

N 1



S W

N

1

2

W N W

1 N

3

2

2

The corresp onding function GN N Assuming N n we can write W as a function of M

N

as follows

3

2

M

W g M

N

Increasing the total memory capacity to NM we have

3 3

2 2

3 3

M NM



2 2

N N W g N W W

N N

N

The matrix multiplication problem has a semihomomorphism b etween its memory requirement and

3

2

computation requirement and gN N Assuming a negligible communication cost the global

computation approach will achieve the b est p ossible scaled sp eedup of the matrix multiplication

problem

We have studied two extreme cases of memoryb ounded scaled sp eedup which are based on

global computation and lo cal computation In general for most of the algorithms part of the data A 0 A 1 A 2 A 3

B 0 B 1 B 2 B 3

C 00 C 11 C 22 C 33 (a) step 1

A 1 A 2 A 3 A 0

B 0 B 1 B 2 B 3

C 10 C 21 C 32 C 03 (b) step 2

A 2 A 3 A 0 A 1

B 0 B 1 B 2 B 3

C 20 C 31 C 02 C 13 (c) step 3

A 3 A 0 A 1 A 2

B 0 B 1 B 2 B 3

C 30 C 01 C 12 C 23

(d) step 4

Figure Matrix multiplication without data replication

may b e replicated and part of the data may have to b e shared Deriving a sp eedup formulation for

these algorithms is dicult not only b ecause we are facing a more complicated situation but also

b ecause the ratio b etween replicated and shared data is uncertain The replicated part may not

increase as the system size is increased In case the replicated part do es increase its sp eed of increase

may b e dierent from the sp eed that the shared part is increased Also an algorithm may start with

global computation When the system size is increased replication may b e needed as part of the

1

g (N )

N eort to reduce communication overhead A sp ecial combined case GN g

N

has b een carefully studied in The structure of that study can b e used as a guideline for other

algorithms

The inuence of communication overhead on the b est p erformance of the memoryb ounded

sp eedup is studied The study can b e extended to xedtime sp eedup where redundant computa

tion could b e intro duced to reduce the communication overhead The function GN determines

the actual achieved sp eedup We have shown how the partition and scale of the problem will inu

ence the function GN In general nding an optimal function GN is a nonlinear optimization

problem The concept of the function GN can b e extended to algorithms with multidegree of

parallelism

Conclusions

It is known that the p erformance of parallel pro cessing is inuenced by the inherent parallelism

and communication requirement of the algorithm by the computation and communication p ower

of the underlying architecture and by the memory capacity of the parallel computer system How

ever how are these factors related to each other and how do they inuence the p erformance of

parallel pro cessing is generally unknown Discovering the answers to these unknowns is imp ortant

for designing ecient parallel algorithms In this pap er one mo del of sp eedup memorybounded

speedup is carefully studied The mo del contains these factors as its parameters

As part of the study on p erformance two other mo dels of sp eedup have also b een studied

They are xedsize speedup and xedtime speedup Two sets of sp eedup formulations have b een

derived for these two mo dels of sp eedup and for memoryb ounded sp eedup Formulations in the

rst set give rise to generalized sp eedup formulas The second set of formulations only considers a

sp ecial simplied case The simplied xedsize sp eedup is Amdahls law the simplied xedtime

sp eedup is Gustafsons scaled speedup and the simplied memoryb ounded sp eedup contains b oth

Amdahls law and Gustafsons scaled sp eedup as sp ecial cases

The three mo dels of sp eedup xedsize speedup xedtime speedup and memorybounded

speedup are based on dierent viewp oints and are suitable for dierent classes of algorithms How

ever algorithms exist which do not t any of the mo dels of sp eedup but satisfy some combination

of the mo dels

App endix

When communication do es not o ccur lo cal computation or its cost is negligible the memory

b ounded sp eedup equation b ecomes

W GN W

1 N



S

N

G(N )

W W

1 N

N



It is easy to verify that S increases with the function GN Thus for the two extreme cases

N

considered in Section the problem of how to reach the maximum sp eedup b ecomes how to scale

the matrix A and B such that the function GN reaches its maximum value The matrix A and

B can b e scaled in any dimension A general scaled matrix multiplication problem is

A B C

lk k m lm

where b oth A and B are rectangular matrices To achieve an optimal sp eedup we need to decide

the integers l k and m for which that the function GN reaches the maximum value The

following result gives the optimal l k and m for the global computation approach Fig b

given in Section Recall that N is the numb er of pro cessors

Prop osition If A and B are n n matrices when N then the global computation approach

p

reaches the maximum GN when l k n and m n N excluding the communication cost

32

The corresponding GN equals N and the maximum speedup is

32

W N W

1 N



S

N

12

W N W

1 N

Pro of By the partition schema of the global computation approach the rows of matrix A

and the columns of matrix B are distributed among pro cessors The workload on each pro cessor is

m

B C A

m l l

k 

k 

N

N N N

Since the memory is fully lled

m l m l

2

k k n

N N N N

Thus

m l

2

n

N N

k

l m

N N

The work of the scaled problem is

2

n N l m



W l m k l m

N

l m

2 2

n N l m n N l ml m l m

3

W n

N

2 3

n l m l m n

2

n N l ml m

GN

3

l m n

Therefore GN reaches its maximum value if and only if the function

2

n N l ml m

f l m

l m

reaches its maximum value At its maximum value the derivatives of f l m satisfy

0

2 2 2 2 2

f l m l m n m N

l

0

2 2 2 2 2

f l m ml n m N

m

It leads to

2 2

l l m n N

2 2

m l m n N

This is

2 2 2

l m m n N

2 2 2

m l l n N

2 2

Thus we have m l ie

l m

Combining the Eq and Eq we get

p

l m n N

p

From the Eq we have k n N Thus the enlarged A and B are still square matrices with

p

N By Eq the maximum GN is dimension n

p p

2 2 2

3

n N n N n N

2

p

GN N

3

n n N

which is equal to the memorywork functiong N for the matrix multiplication problem see Section

and the corresp onding sp eedup is

32

W N W

1 N



S

N

12

W N W

1 N

From Theorem it is the b est p ossible p erformance for the matrix multiplication problem 2

Using similar arguments as in Prop osition we can nd that the optimal dimension of the lo cal

computation approach is l k n m nN and the maximum value of GN is N see Section

The scalability of matrix A and B is application dep endent If A and B should b e maintained as

square matrices the following prop osition shows the limitation of the lo cal computation approach

Prop osition If A and B are n n matrices when N and l k m is required then the

q

3

3

3N

2

which is bounded by maximum value of GN of the local computation approach is

N +2

and is smal ler than N for N

Pro of When A and B are square matrices the scaled problem is

A B C

k k k k k k

k

If the load is balanced on each pro cessor and m is an integer then each pro cessor do es the

N

work

A B C

k k k m k m

When memory is fully used

2 2

k k m n

k

Since m

N

2

k

2 2

n k

N

Thus

s

s

2

n N

A

k n

2

N

N

The scaled work

s s

3 3

N N

3  3

A A

n W W k

N

N

N N

and

s

3

N

A

GN

N

Since

N N

N N N N

and

N

N

N N

3

2

the GN is b ounded by and is smaller than N for N 2

References

Amdahl G Validity of the singlepro cessor approach to achieving large scale computing

capabilities In Proc AFIPS Conf pp

Barton M and Withers G Computing p erformance as a function of the sp eed quantity

and cost of the pro cessors In Proc Supercomputing pp

Dunigan T Performance of the intel ipsc and ncub e hyp ercub es Paral lel Com

puting Dec

Eager D Zahorjan J and Lazowska E Sp eedup versus eciency in parallel system

IEEE Transactions on Computers March

Gustafson J Reevaluating Amdahls law Communications of the ACM May

Gustafson J Montry G and Benner R Development of parallel metho ds for a

pro cessor hyp ercub e SIAM J of Sci and Stat Computing July

Gustafson J Rover D Elbert S and Carter M The design of a scalable xed

time computer b enchmark J of Paral lel and

Karp A H and Flatt H P Measuring parallel pro cessor p erformance Communications

of the ACM May

Kumar V and Gupta A Analysis of scalability of parallel algorithms and architectures

A survey In Proc of Intl Conf on Supercomputing June pp

Moler C Matrix computation on distributed memory multipro cessors In Proc of First

Conf on Hypercube Multiprocessors pp

Ni L and King C On partitioning and mapping for hyp ercub e computing International

Journal of Paral lel Programming

Sevcik K Characterizations of parallelism in applications and their use in scheduling In

Proc of ACM SIGMETRICS and Performance May pp

Sun XH Parallel computation mo dels Representation analysis and applications PhD

Dissertation Department Michigan State University

Sun XH and Gustafson J Toward a b etter parallel p erformance metric Paral lel

Computing Dec

Sun XH and Ni L Another view on parallel sp eedup In Proc of Supercomputing

NY NY pp

Sun XH and Rover D Scalability of parallel algorithmmachine combinations Tech

nical Rep ort IS UC Ames Lab oratory US Department of Energy Accepted

to app ear in IEEE TPDS

Ware W The ultimate computer IEEE Spectrum

Worley P T The eect of time constraints on scaled sp eedup SIAM J of Sci and Stat

Computing Sept