Performance of Various Computers Using Standard Linear Equations
Total Page:16
File Type:pdf, Size:1020Kb
CS Performance of Various Computers Using Standard Linear Equations Software Jack J Dongarra Computer Science Department UniversityofTennessee Knoxville TN and Mathematical Sciences Section Oak Ridge National Lab oratory Oak Ridge TN CS Novemb er Electronic mail address dongarracsutkedu An uptodate version of this rep ort can b e found at httpwwwnetliborgb enchmarkp erformanceps This work was supp orted in part by the Applied Mathematical Sciences subprogram of the Oce of Energy Research US Department of Energy under Contract DEACOR and in part by the Science Alliance a state supp orted program at the UniversityofTennessee Performance of Various Computers Using Standard Linear Equations Software Jack J Dongarra Computer Science Department UniversityofTennessee Knoxville TN and Mathematical Sciences Section Oak Ridge National Lab oratory Oak Ridge TN Novemb er Abstract This rep ort compares the p erformance of dierent computer systems in solving dense systems of linear equations The comparison involves approximately a hundred computers ranging from aCray YMP to scientic workstations such as the Ap ollo and Sun to IBM PCs Intro duction and Ob jectives The timing information presented here should in no way b e used to judge the overall p erformance of a computer system The results reect only one problem area solving dense systems of equations This rep ort provides p erformance information on a wide assortment of computers ranging from the homeused PC up to the most powerful sup ercomputers The information has been collected over a period of time and will undergo change as new machines are added and as hardware and software systems improve The programs used to generate this data can easily be obtained over the Internet While wemakeevery attempt to verify the results obtained from users and vendors errors are b ound to exist and should b e brought to our attention We encourage users to obtain the programs and run the routines on their machines rep orting any discrepancies with the numb ers listed here The rst table rep orts three numbers for each machine listed in some cases the numb ers are missing b ecause of lack of data All p erformance numbers reect arithmetic p erformed in full precision usually bit unless noted On some machines full precision may b e single precision such as the Cray or double precision suchasthe IBM The rst numb er is for the LINPACK b enchmark program for a matrix of order in a Fortran environment The second numb er is for solving a system of equations of order with no restriction on the metho d or its implementation The third numb er is the theoretical p eak p erformance of the machine LINPACK programs can be characterized as having a high p ercentage of oatingp oint arith metic op erations The routines involved in this timing study SGEFA and SGESL use column oriented algorithms That is the programs usually reference array elements sequentially down a This work was supp orted in part by the Applied Mathematical Sciences subprogram of the Oce of Energy Research US Department of Energy under Contract DEACOR and in part by the Science Alliance a state supp orted program at the UniversityofTennessee An uptodate version of this rep ort can b e found at httpwwwnetliborgb enchmarkp erformanceps November column not across a row Column orientation is imp ortant in increasing eciency b ecause of the wayFortran stores arrays Most oatingp oint op erations in LINPACK take place in a set of subpro grams the Basic Linear Algebra Subprograms BLAS which are called rep eatedly throughout the calculation These BLAS referred to nowasLevel BLAS reference onedimensional arrays rather than twodimensional arrays In the rst case the problem size is relatively small order and no changes were made to the LINPACK software Moreover no attempt was made to use sp ecial hardware features or to exploit vector capabilities or multiple pro cessors The compilers on some machines may of course generate optimized co de that itself accesses sp ecial features Thus many highp erformance machines maynothavereached their asymptotic execution rates In the second case the problem size is larger matrix of order and mo difying or replac ing the algorithm and software was p ermitted to achieve as high an execution rate as p ossible Thus the hardware had more opp ortunity for reaching nearasymptotic rates An imp ortantcon straint however was that all optimized programs maintain the same relative accuracy as standard techniques such as Gaussian elimination used in LINPACK Furthermore the driver program supplied with the LINPACK benchmark had to be run to ensure that the same problem is solved The driver program sets up the matrix calls the routines to solve the problem veries that the answers are correct and computes the total number of op erations to solve the problem indep endent of the metho d as n n wheren The last column is based not on an actual program run but on a pap er computation to determine the theoretical p eak Mops rate for the machine This is the number manufacturers often cite it represents an upp er b ound on p erformance That is the manufacturer guarantees that programs will not exceed this ratesort of a sp eed of light for a given computer The theoretical p eak p erformance is determined by counting the number of oatingp oint ad ditions and multiplications in full precision that can be completed during a p erio d of time usually the cycle time of the machine As an example the Cray YMP has a cycle time of ns During a cycle the results of both an addition and a multiplication can be completed op erations cycle Mops on a single pro cessor On the Cray YMP there are ns cycle pro cessors thus the p eak p erformance is Mops The information in this rep ort is presented to users to provide a range of p erformance for the various computers and to show the eects of typical Fortran programming and the results that can b e obtained through careful programming The maximum rate of execution is given for comparison The column lab eled Computer gives the name of the computer hardware on which the program was run In some cases we have indicated the number of pro cessors in the conguration and in some cases the cycle time of the pro cessor in nanoseconds The column lab eled LINPACK Benchmark gives the op erating system and compiler used The run was based on two routines from LINPACK SGEFA and SGESL were used for single precision and DGEFA and DGESL were used for double precision These routines p erform standard LU decomp osition with partial pivoting and backsubstitution The timing was done on a matrix of order where no changes are allowed to the Fortran programs The column lab eled TPP Toward Peak Performance gives the results of hand optimization the problem size was of order The nal column lab eled Theoretical Peak gives the maximum rate of execution based on the cycle time of the hardware The same matrix was used to solve the system of equations The results were checked for accuracy by calculating a residual for the problem jjAx bjjjjAjjjjxjj The term Mops used as a rate of execution stands for millions of oatingp oint op erations completed p er second For solving a system of n equations n n op erations are p erformed November wecount b oth additions and multiplications The information in the tables was compiled over a p erio d of time Subsequent systems software and hardware changes may alter the timings to some extent One further note The following tables should not b e taken to o seriously In multiprogramming environments it is often dicult to reliably measure the execution time of a single program We trust that anyone actually evaluating machines and op erating systems will gather more reliable and more representativedata A Lo ok at Parallel Pro cessing While collecting the data presented in Table wewere able to exp eriment with parallel pro cessing onanumb er of computer systems For these exp eriments we used either the standard LINPACK algorithm or an algorithm based on matrixmatrix techniques In the case of the LINPACK algorithm the lo op around the SAXPY can b e p erformed in parallel In the matrixmatrix imple mentation the matrix pro duct can be split into submatrices and p erformed in parallel In either case the parallelism follows a simple forkandjoin mo del where each pro cessor gets some number of op erations to p erform For a problem of size we exp ect a high degree of parallelism Thus it is not surprising that we get such high eciency see Table The actual p ercentage of parallelism of course dep ends on the algorithm and on the sp eed of the unipro cessor on the parallel part relative to the sp eed of the unipro cessor on the nonparallel part Highly Parallel Computing With the arrival of massively parallel computers there is a need to benchmark such machines on problems that make sense The problem size and rule for the runs reected in the Tables and do not p ermit massively parallel computers to demonstrate their p otential p erformance The basic aw is the problem size is to o small Toprovide a forum for comparing suchmachines the following b enchmark was run on a numb er of massively parallel machines The b enchmark involves solving a system of linear equations as was done in Tables and However in this case the problem size is allowed to increase and the p erformance numb ers reect the largest problem run on the machine The ground rules are as follows Solve systems of linear equations bysome metho d allow the size of the problem to vary and measure the execution