HPC Challenge (HPCC) Benchmark Suite Acknowledgements
Total Page:16
File Type:pdf, Size:1020Kb
HPC Challenge (HPCC) Benchmark Suite Jack Dongarra Piotr Luszczek Allan Snavely SC07, Reno, Nevada Sunday, November 11, 2007 Session S09 8:30AM-12:00PM, A1 Acknowledgements • This work was supported in part by the Defense Advanced Research Projects Agency (DARPA), the Department of Defense, the National Science Foundation (NSF), and the Department of Energy (DOE) through the DARPA High Productivity Computing Systems (HPCS) program under grant FA8750-04-1-0219 and under Army Contract W15P7T-05-C-D001 • Opinions, interpretations, conclusions, and recommendations are those of the authors and are not necessarily endorsed by the United States Government 1 Introduction • HPC Challenge Benchmark Suite – To examine the performance of HPC architectures using kernels with more challenging memory access patterns than HPL – To augment the TOP500 list – To provide benchmarks that bound the performance of many real applications as a function of memory access characteristics ― e.g., spatial and temporal locality – To outlive HPCS • HPC Challenge pushes spatial and temporal boundaries and defines performance bounds TOP500 and HPCC • TOP500 • HPCC – Performance is represented by – Performance is represented by only a single metric multiple metrics – Data is available for an – Benchmark is new — so data is extended time period available for a limited time (1993-2005) period (2003-2007) • Problem: • Problem: There can only be one “winner” There cannot be one “winner” • Additional metrics and statistics • We avoid “composite” benchmarks – Count (single) vendor systems – Perform trend analysis on each list • HPCC can be used to show – Count total flops on each list complicated kernel/ per vendor architecture performance characterizations – Use external metrics: price, ownership cost, power, … – Select some numbers for comparison – Focus on growth trends over time – Use of kiviat charts • Best when showing the differences due to a single independent “variable” • Over time — also focus on growth trends 2 High Productivity Computing Systems (HPCS) Goal: Provide a new generation of economically viable high productivity computing systems for the national security and industrial user community (2010) s & Ass nalysi essme Impact: A ry R nt Indust &D Performance (time-to-solution): speedup critical national Programming Performance Models security applications by a factor of 10X to 40X Characterization & Prediction Hardware Technology Programmability (idea-to-first-solution): reduce cost and System Architecture time of developing application solutions Software Technology Portability (transparency): insulate research and Ind operational application software from system ustry R&D Anal sment Robustness (reliability): apply all known techniques to ysis & Asses protect against outside attacks, hardware faults, & HPCS Program Focus Areas programming errors Applications: Intelligence/surveillance, reconnaissance, cryptanalysis, weapons analysis, airborne contaminant modeling and biotechnology FillFill thethe CriticalCritical TechnologyTechnology andand CapabilityCapability GapGap TodayToday (late(late 80’s80’s HPCHPC technology)…..technology)…..toto…..Future…..Future (Quantum/Bio(Quantum/Bio Computing)Computing) HPCS Program Phases I-III Productivity Assessment (MIT LL, DOE, DoD, NASA, NSF) Concept System CDR Early Final Review Design PDR SCR DRR Demo Demo Industry Review Milestones 1 2 3 4 5 6 7 Technology Assessment Review HPLS SW SW SW SW MP Peta-Scale Plan Rel 1 Rel 2 Rel 3 Dev Unit Procurements Mission Partner Mission Partner Mission Partner Deliver Units Peta-Scale Dev Commitment System Commitment Application Dev MP Language Dev Year (CY) 0203 04 05 06 07 08 09 10 11 (Funded Five) (Funded Three) Mission Program Reviews Phase I Phase II Phase III Partners Industry R&D Prototype Development Critical Milestones Concept Program Procurements Study 3 HPCS Benchmark and Application Spectrum Execution and System Bounds Execution Development Indicators Indicators Discrete Math 3 Scalable … Compact Apps Execution Graph Pattern Matching Bounds Analysis Graph Analysis Signal Processing Local … DGEMM Linear STREAM Solvers 3 Petascale/s RandomAccess HPCS … Simulation 1D FFT Signal (Compact) Spanning Set Applications of Kernels Processing Global … Linpack PTRANS Simulation Others RandomAccess Classroom 1D FFT … Experiment Future Applications Codes Existing Applications I/O Applications Emerging 8 HPCchallenge Benchmarks (~40) Micro & Kernel (~10) Compact 9 Simulation Benchmarks Applications Applications Simulation Intelligence Reconnaissance SpectrumSpectrum ofof benchmarksbenchmarks provideprovide differentdifferent viewsviews ofof systemsystem •• HPCchallengeHPCchallenge pushespushes spatialspatial andand temporaltemporal boundaries;boundaries; setssets performanceperformance boundsbounds •• ApplicationsApplications drivedrive systemsystem issues;issues; setset legacylegacy codecode performanceperformance boundsbounds •• KernelsKernels andand CompactCompact AppsApps forfor deeperdeeper analysisanalysis ofof executionexecution andand developmentdevelopment timetime Motivation of the HPCC Design FFT DGEMM High HPL Mission Partner Applications PTRANS Temporal Locality Temporal RandomAccess STREAM Low Spatial Locality High 4 Measuring Spatial and Temporal Locality (1/2) Generated by PMaC @ SDSC HPC Challenge Benchmarks 1.00 Select Applications HPL 0.80 Gamess Overflow HYCOM Test3D OOCore 0.60 FFT RFCTH2 AVUS 0.40 CG Temporal locality Temporal 0.20 RandomAccess STREAM 0.00 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 Spatial Locality •• SpatialSpatial andand temportemporalal datadata localitylocality herehere isis forfor oneone node/processornode/processor —— i.e.,i.e., locallylocally oror “in“in thethe small”small” Measuring Spatial and Temporal Locality (2/2) Generated by PMaC @ SDSC High Temporal Locality Good Performance on Cache-based systems HPC Challenge High Spatial Locality Spatial Locality occurs Benchmarks Sufficient Temporal Locality in registers 1.00 Select Applications Satisfactory Performance on HPL 0.80 Gamess Overflow Cache-based systems HYCOM Test3D OOCore 0.60 FFT RFCTH2 AVUS 0.40 CG Temporal locality Temporal 0.20 RandomAccess STREAM 0.00 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 Spatial Locality No Temporal or Spatial Locality High Spatial Locality Poor Performance on Moderate Performance on Cache-based systems Cache-based systems 5 Supercomputing Architecture Issues Standard Parallel Corresponding Performance Computer Architecture Memory Hierarchy Implications CPU CPU CPU CPU Registers RAM RAM RAM RAM Instr. Operands Cache Disk Disk Disk Disk Blocks Network Switch Local Memory Messages CPU CPU CPU CPU Bandwidth Increasing Remote Memory Increasing Programmability RAM RAM RAM RAM Pages Increasing Latency Increasing Increasing Capacity Increasing Disk Disk Disk Disk Disk •• StandardStandard architecturearchitecture producesproduces aa “steep”“steep” multi-layered multi-layered memorymemory hierarchyhierarchy –– ProgrammerProgrammer mustmust managemanage thisthis hierarchyhierarchy toto getget goodgood performanceperformance •• HPCSHPCS technicaltechnical goalgoal –– ProduceProduce aa systemsystem withwith aa “flatter”“flatter” memory memory hierarchyhierarchy thatthat isis easiereasier toto programprogram HPCS Performance Targets HPC Challenge Corresponding HPCS Targets Benchmark Memory Hierarchy (improvement) • Top500: solves a system Registers Ax = b 2 Petaflops Instr. Operands (8x) • STREAM: vector operations Cache 6.5 Petabyte/s A = B + s x C (40x) Blocks Local Memory 0.5 Petaflops • FFT: 1D Fast Fourier bandwidth (200x) Transform latency Messages Z = FFT(X) 64,000 GUPS Remote Memory • RandomAccess: random (2000x) updates Pages T(i) = XOR( T(i), r ) Disk •• HPCSHPCS programprogram hashas developeddeveloped aa newnew suitesuite ofof benchmarksbenchmarks (HPC(HPC Challenge)Challenge) •• EachEach benchmarkbenchmark focusesfocuses onon aa differentdifferent partpart ofof thethe memorymemory hierarchyhierarchy •• HPCSHPCS programprogram performanceperformance targetstargets willwill flattenflatten thethe memorymemory hierarchy,hierarchy, improveimprove realreal applicationapplication performance,performance, andand makemake programmingprogramming easiereasier 6 Official HPCC Submission Process 1. Download Prequesites:Prequesites: ●● CC compilercompiler 2. Install ●● BLASBLAS ● MPI 3. Run ● MPI 4. Upload results ProvideProvide detaileddetailed 5. Confirm via @email@ installationinstallation andand executionexecution environmentenvironment 6. Tune 7. Run 8. Upload results ●● OnlyOnly somesome routinesroutines cancan bebe replacedreplaced ● Data layout needs to be preserved 9. Confirm via @email@ ● Data layout needs to be preserved ●● MultipleMultiple languageslanguages cancan bebe usedused ResultsResults areare immediatelyimmediately availableavailable onon thethe webweb site:site: ●● InteractiveInteractive HTMLHTML ●● XMLXML OptionalOptional ●● MSMS ExcelExcel ●● KiviatKiviat chartscharts (radar(radar plots)plots) HPCC as a Framework (1/2) • Many of the component benchmarks were widely used before the HPC Challenge suite of Benchmarks was assembled – HPC Challenge has been more than a packaging effort – Almost all component benchmarks were augmented from their original form to provide consistent verification and reporting • We stress the importance of running these benchmarks on a single machine — with a single configuration and options – The benchmarks were useful separately for the HPC community, meanwhile – The unified