HPL is a portable implementation of the High Performance HISTORY OF THE LINPACK Benchmark for distributed memory . 1974 LINPACK software is released • Algorithm: recursive panel factorization, multiple lookahead Solves systems of linear equations in 66 depths, bandwidth reducing swapping 1977 LINPACK 100 released • Easy to install, only needs MPI+BLAS or VSIPL Measures system performance in Mflop/s and solves 100x100 linear systems • Highly scalable and efficient from the smallest cluster to the 1986 LINPACK 1000 released largest in the world Any language allowed and the linear system of size 1000 can be used

1989 LINPACKDv2 released 18000 Extends random number generator from 16384 to 65536 30100 cores of Jaguar XT 4 ORNL 1991 LINPACK Table 3 (Highly ) 187500 13500 TIME (seconds) Any size linear system is allowed

1993 TOP500 first released 125000 9000 With CM-5 running the LINPACK benchmark at nearly 60 Gflop/s

1996 9th TOP500 is released 62500 4500 With the 1st system breaking the 1 Tflop/s barrier: ASCI Red from Sandia National Laboratory PERFORMANCE (Gflop/s) 0 0 2000 HPLv1 is released 400000 700000 1000000 1300000 1700000 By Antoine Petitet, , Clint Whaley, and Andy Cleary MATRIX SIZE 2008 31st TOP500 is released With the 1st system breaking the 1 Pflop/s barrier: Roadrunner from 4500 Los Alamos National Laboratory 648 cores from Grid5000 4000 HPLv2 is released 3500 The new version features a 64-bit random number generator that 3000 prevents the benchmark failures from generating singular matrices 2500 which was the problem with the old generator.

2000 2009 Peta flop/s are spreading 1500 The upgrades of the machines hosted at ORNL results in shattering 1000 the 2 Pflop/s theoretical peak barrier for Cray's XT5 Jaguar. Also, the 500 first academic system reaches 1 Pflop/s theoretical peak: University of

PERFORMANCE (Gflop/s) Tennessee's Kraken. 0 13600 40120 66980 93500 120020 130000 2010 GPUs are coming MATRIX SIZE The performance growth at Peta-scale is now fueled by a new breed of GPUs that now power systems with over 2 Pflop/s in LINPACK performance.

2011 Multicore strikes back by exceeding 8 Pflop/s using over half a million of modified SPARC VIII cores. HPL was used to obtain a number of results in the current TOP500 list, including the #1 entry. 2012 Even more cores needed to be #1: a million and a half cores were required to break 15 Pflop/s barrier.

2013 Intel back on top with over half of 100 Pflop/s of peak performance coming mostly from its Intel Xeon Phi acceleratos

IN COLLABORATION WITH SPONSORED BY

FIND OUT MORE AT http://icl.cs.utk.edu/hpl takes torunthefullversionofbenchmarkwithoutlosingmuch This willallowthemachinetorunbenchmarkonlyafractionoftimeit To dealwiththisproblem,anewversionofHPLwillofferpartialrunoption. this timeincreasedtonearly30hours. cores. In2009,the#1entryreportedarunthattookover18hours.2011, chart belowshows,thetimetocompletionisincreasingwithnumberof comes thepriceofincreasedtimebenchmarktakestocomplete.As on theHPLbenchmark.Withunprecedentedincreaseinperformance community ingeneral.Inthelist,computersarerankedbytheirperformance experts, computationalscientists,manufacturers,andtheInternet The listiscompiledtwiceayearwiththehelpofhigh-performancecomputer assemble andmaintainalistofthe500mostpowerfulcomputersystems. provide thisnewstatisticalfoundation,theTOP500listwascreatedin1993to computing powerofthehigh-endmodelsworkstationsuppliers(SMP).To massively parallelprocessing(MPP)systems,andthestrongincreasein between low-endandhigh-endmodels,theincreasingavailabilityof the diversificationofsupercomputers,enormousperformancedifference scalable parallelproblem–theHPL.Newstatisticswererequiredtoreflect by 1000problem(threeloopoptimization–thewholeprogram),anda optimization opportunity:100byproblem(innerloopoptimization),1000 general densematrixproblemAx=batthreelevelsofsizeand expanded. Thebenchmarkreportdescribestheperformanceforsolvinga number ofcomputersincreasing,thescopebenchmarkhasalso collection includesover1300differentcomputersystems.Inadditiontothe data wasadded,moreasahobbythananythingelse,andtodaythe in theLINPACKUsers'Guide1979.Overyearsadditionalperformance equations. Thefirst“LINPACKBenchmark”reportappearedasanappendix information onexecutiontimesrequiredtosolveasystemoflinear originally designedtoassistusersoftheLINPACKpackagebyproviding The originalLINPACKBenchmarkis,insomesense,anaccident.Itwas TIME (hours) 10 15 20 25 30 0 5 ’93 ’95 ’97 TIME TORUNFOR#1ENTRY ’99 ’01 IN COLLABORATION WITH ’03 ’05 ’07 ’09 ’11 ’13 verified asrigorously,isthecaseforstandardfactorization. occupies theentirememory.Also,resultofpartialexecutioncanbe provides theaddedbenefitofstressingentiresystem:systemmatrix the resultingperformancedropwillonlybe5%.Thesampledfactorization 50% howmuchwilltheperformancedecrease?Theanswerisencouraging: answered bythefigureisthis:iftimetorunbenchmarkreduced what-if analysisofthedataasitisindicatedwitharrows.Thequestionbeing both varybetween0%and100%.Inthissettingitisnoweasytoperforma both timeandperformancemaycoexistonthesameY-axisbecausethey terms asafractionofthemaximumattainedperformance.Inthismanner, As shownbelowintheperformance-timecurve,timeisgivenrelative Between Failures(MTBF)forlargesupercomputerinstallations. the rapidlyincreasingnumberofcoresanddecreaseMeanTime becomes aviablealternativeto,say,checkpointingtosolvetheproblemof indicate onlya6%dropinperformancewith50%reductiontime.This reported performance.Initialrunsonasystemwithover30000cores

SPONSORED BY RELATIVE PERFORMANCE 100%

20 40 60 80 RELEASED OCTOBER 2012 0 10 signals. ItwasreportedbyStephenWhalenfromCray. HPL_spreadT() functions.Thismaycausesegmentationfault Fixed anout-of-boundsaccesstoarraysintheHPL_spreadN()and important fortheGreen500project. purposes. Forexample,forreportingpowerconsumption.Thisis accurate accountingofrunningtimefordatacentermanagement execution ofthesolverfunctionHPL_pdgesv().Thiscouldallowfor The outputnowreportsexacttimestampsbeforeandafterthe PERCENTAGE OFFACTOREDPORTIONTHEMATRIX 20 30 40 50 FIND OUT MOREAT 95% 60 2.1 OF PERFORMANCE 70 80 50% http://icl.cs.utk.edu/hpl 90 OF TIME 100% 0 20 40 60 80 100%

RELATIVE TIME