Scalable Problems and Memory-Bounded Speedup

Scalable Problems and MemoryBounded Sp eedup XianHe Sun ICASE Mail Stop C NASA Langley Research Center Hampton VA sunicaseedu Abstract In this pap er three mo dels of parallel sp eedup are studied They are xedsize speedup xedtime speedup and memorybounded speedup The latter two consider the relationship b etween sp eedup and problem scalability Two sets of sp eedup formulations are derived for these three mo dels One set considers uneven workload allo cation and communication overhead and gives more accurate estimation Another set considers a simplied case and provides a clear picture on the impact of the sequential p ortion of an application on the p ossible p erformance gain from parallel pro cessing The simplied xedsize sp eedup is Amdahls law The simplied xedtime sp eedup is Gustafsons scaled speedup The simplied memoryb ounded sp eedup contains b oth Amdahls law and Gustafsons scaled sp eedup as sp ecial cases This study leads to a b etter under standing of parallel pro cessing Running Head Scalable Problems and MemoryBounded Sp eedup This research was supp orted in part by the NSF grant ECS and by the National Aeronautics and Space Administration under NASA contract NAS SCALABLE PROBLEMS AND MEMORYBOUNDED SPEEDUP By XianHe Sun ICASE NASA Langley Research Center Hampton VA sunicaseedu And Lionel M Ni Computer Science Department Michigan State University E Lansing MI nicpsmsuedu Intro duction Although parallel pro cessing has b ecome a common approach for achieving high p erformance there is no wellestablished metric to measure the p erformance gain of parallel pro cessing The most commonly used p erformance metric for parallel pro cessing is speedup which gives the p erformance gain of parallel pro cessing versus sequential pro cessing Traditionally sp eedup is dened as the ratio of unipro cessor execution time to execution time on a parallel pro cessor There are dierent ways to dene the metric execution time In xedsize speedup the amount of work to b e executed is indep endent of the numb er of pro cessors Based on this mo del Ware summarized Amdahls arguments to dene a sp eedup formula which is known as Amdahls law However in many applications the amount of work to b e p erformed increases as the numb er of pro cessors increases in order to obtain a more accurate or b etter result The concept of scaled speedup was prop osed by Gustafson et al at Sandia National Lab oratory Based on this concept Gustafson suggested a xedtime speedup which xes the execution time and is interested in how the problem size can b e scaled up In scaled sp eedup b oth sequential and parallel execution times are measured based on the same amount of work dened by the scaled problem Both Amdahls law and Gustafsons scaled sp eedup use a single parameter the sequential p ortion of a parallel algorithm to characterize an application They are simple and give much insight into the p otential degradation of parallelism as more pro cessors b ecome available Amdahls law has a xed problem size and is interested in how small the resp onse time could b e It suggests that massively parallel pro cessing may not gain high sp eedup Gustafson approaches the problem from another p oint of view He xes the resp onse time and is interested in how large a problem could b e solved within this time This pap er further investigates the scalability of problems While Gustafsons scalable problems are constrained by the execution time the capacity of main memory is also a critical metric For parallel computers esp ecially for distributedmemory multipro cessors the size of scalable problems is often determined by the memory available Shortage of memory is paid for in problem solution time due to the IO or messagepassing delays and in programmer time due to the additional co ding required to multiplex the distributed memory For many applications the amount of memory is an imp ortant constraint to scaling problem size Thus memorybounded speedup is the ma jor fo cus of this pap er We rst study three mo dels of sp eedup xedsize speedup xedtime speedup and memory bounded speedup With b oth uneven workload allo cation and communication overhead considered sp eedup formulations will b e derived for all three mo dels When communication overhead is not considered and the workload only consists of sequential and p erfectly parallel p ortions the simplied xedsize sp eedup is Amdahls law the simplied xedtime sp eedup is Gustafsons scaled sp eedup and the simplied memoryb ounded sp eedup contains b oth Amdahls law and Gustafsons sp eedup as sp ecial cases Therefore the three mo dels of sp eedup which represent dierent p oints of view are unied Based on the concept of scaled sp eedup intensive research has b een conducted in recent years in the area of p erformance evaluation Some other denitions of sp eedup have also b een prop osed such as generalized sp eedup costrelated sp eedup and sup erlinear sp eedup Interested readers can refer to for details This pap er is organized as follows In Section we intro duce the program mo del and some basic terminologies More generalized sp eedup formulations for the three mo dels of sp eedup are presented in Section Sp eedup formulations for simplied cases are studied in Section The inuence of communicationmemory tradeo is studied in Section Conclusions and comments are given in Section A Mo del of Parallel Sp eedup To measure dierent sp eedup metrics for scalable problems the underlying machine is assumed to b e a scalable multiprocessor A multipro cessor is considered scalable if as the numb er of pro cessors increase the memory capacity and network bandwidth also increase Furthermore all pro cessors are assumed to b e homogeneous Most distributedmemory multipro cessors and multicomputers such as commercial hyp ercub e and meshconnected computers are scalable multipro cessors Both messagepassing and sharedmemory programming paradigms have b een used in such multipro ces sors To simplify the discussion our study assumes homogeneous distributedmemory architectures The parallelism in an application can b e characterized in dierent ways for dierent purp oses For simplicity sp eedup formulations generally use very few parameters and consider very high level characterizations of the parallelism We consider two main degradations of parallelism uneven al location load imbalance and communication latency The former degradation is application dep endent The latter degradation dep ends on b oth the application and the parallel computer under consideration To obtain an accurate estimate b oth degradations need to b e considered Uneven allo cation is measured by degree of paral lelism Denition The degree of parallelism of a program is an integer which indicates the maximum number of processors that can be busy computing at a particular instant in time given an unbounded number of available processors The degree of parallelism is a function of time By drawing the degree of parallelism over the execution time of an application a graph can b e obtained We refer to this graph as the paral lelism prole Figure is the parallelism prole of a hyp othetical divideandconque r computation By accumulating the time sp ent at each degree of parallelism the prole can b e rearranged to form the shape see Figure of the application 6 Degree of Parallelism - T Time Figure Parallelism prole of an application Let W b e the amount of work of an application Work can b e dened as arithmetic op erations instructions or whatever is needed to complete the application Formally the sp eedup with N pro cessors and with the total amount of work W is dened as T W 1 S W N T W N where T W is the time required to complete W amount of work on i pro cessors Let W b e i i the amount of work executed with degree of parallelism i and let m b e the maximum degree of P m parallelism Thus W W Assuming each computation takes a constant time to nish on i i=1 a given pro cessor the execution time for computing W with a single pro cessor is i W i t W 1 i where is the computing capacity of each pro cessor If there are i pro cessors available the execution time is W i t W i i i With an innite numb er of pro cessors available the execution time will not b e further decreased and is 6 Degree of Parallelism - - - - - t t t t 4 3 2 1 Figure Shap e of the application W i for i m t W 1 i i Therefore without considering communication latency the execution times on a single pro cessor and on an innite numb er of pro cessors are m X W i T W 1 i=1 m X W i T W 1 i i=1 The maximum sp eedup with work W and an innite numb er of pro cessors is P P m W i m T W W i=1 1 i 4 i=1 P S W P 1 m m W i W i T W i 1 i=1 i=1 i4 Average parallelism is an imp ortant factor for sp eedup and eciency It has b een carefully examined in Average parallelism is equivalent to the maximum sp eedup S S 1 1 gives the b est p ossible sp eedup based on the inherent parallelism of an algorithm There are no machine dep endent factors considered With only a limited numb er of available pro cessors and with communication latency considered the sp eedup will b e less than the b est sp eedup S W 1 W i i If there are N pro cessors available and N i then some pro cessors have to do d e work and i N W i i b c work By the denition of degree of parallelism W and the rest of the pro cessors will do i i N W cannot b e executed simultaneously for i j Thus the elapsed time will b e j i W i d e t W N i i N Hence m X W i i T W d e N i N i=1 and the sp eedup is

Scalable Problems and Memory-Bounded Speedup

Scalable and Distributed Deep Learning (DL): Co-Design MPI Runtimes and DL Frameworks

Parallel Programming

Parallel Computer Architecture

Parallel Programming Using Openmp Feb 2014

Enabling Efficient Use of UPC and Openshmem PGAS Models on GPU Clusters

An Overview of the PARADIGM Compiler for Distributed-Memory

Concepts from High-Performance Computing Lecture a - Overview of HPC Paradigms

In Reference to RPC: It's Time to Add Distributed Memory

Parallel Breadth-First Search on Distributed Memory Systems

Introduction to Parallel Computing

Multi-Core Architectures

14. Parallel Computing 14.1 Introduction 14.2 Independent