The HPC Challenge (HPCC) Benchmark Suite Characterizing a System with Several Specialized Kernels
Total Page:16
File Type:pdf, Size:1020Kb
The HPC Challenge (HPCC) Benchmark Suite Characterizing a system with several specialized kernels Rolf Rabenseifner [email protected] High Performance Computing Center Stuttgart (HLRS) University of Stuttgart www.hlrs.de SPEC Benchmarking Joint US/Europe Colloquium June 22, 2007 Technical University of Dresden, Dresden, Germany http://www.hlrs.de/people/rabenseifner/publ/publications.html#SPEC2007 1/31 Acknowledgements • Update and extract from SC’06 Tutorial S12: Piotr Luszczek, David Bailey, Jack Dongarra, Jeremy Kepner, Robert Lucas, Rolf Rabenseifner, Daisuke Takahashi: “The HPC Challenge (HPCC) Benchmark Suite” SC06, Tampa, Florida, Sunday, November 12, 2006 • This work was supported in part by the Defense Advanced Research Projects Agency (DARPA), the Department of Defense, the National Science Foundation (NSF), and the Department of Energy (DOE) through the DARPA High Productivity Computing Systems (HPCS) program under grant FA8750-04-1-0219 and under Army Contract W15P7T-05-C- D001 • Opinions, interpretations, conclusions, and recommendations are those of the authors and are not necessarily endorsed by the United States Government 2/31 Outline • Overview • The HPCC kernels • Database & output formats • HPCC award • Augmenting TOP500 • Balance Analysis • Conclusions 3/31 • Overview • The Kernels Introduction • Output formats • HPCC awards • Augm. TOP500 • HPC Challenge Benchmark Suite • Balance Analys. – To examine the performance • Conclusions of HPC architectures using kernels with more challenging memory access patterns than HPL – To augment the TOP500 list – To provide benchmarks that bound the performance of many real applications as a function of memory access characteristics e.g., spatial and temporal locality 4/31 TOP500 and HPCC • TOP500 • HPCC – Performance is represented by – Performance is represented by only a single metric multiple single metrics – Data is available for an – Benchmark is new — so data is extended time period available for a limited time (1993-2006) period (2003-2007) • Problem: • Problem: There can only be one “winner” There cannot be one “winner” • Additional metrics and statistics • We avoid “composite” benchmarks – Count (single) vendor systems – Perform trend analysis on each list • HPCC can be used to show – Count total flops on each list complicated kernel/ per vendor architecture performance characterizations – Use external metrics: price, ownership cost, power, … – Select some numbers for comparison – Focus on growth trends over time – Use of kiviat charts • Best when showing the differences due to a single independent “variable” – Compute balance ratios • Over time — also focus on growth trends 5/31 High Productivity Computing Systems (HPCS) Goal: Provide a new generation of economically viable high productivity computing systems for the national security and industrial user community (2010) & Ass nalysis essme Impact: A ry R nt Indust &D Performance (time-to-solution): speedup critical national Programming Performance Models security applications by a factor of 10X to 40X Characterization tion & Predic Hardware Technology Programmability (idea-to-first-solution): reduce cost and System Architecture time of developing application solutions Software Technology Portability (transparency): insulate research and Ind operational application software from system ustry R&D Analy sment Robustness (reliability): apply all known techniques to sis & Asses protect against outside attacks, hardware faults, & HPCS Program Focus Areas programming errors Applications: Intelligence/surveillance, reconnaissance, cryptanalysis, weapons analysis, airborne contaminant modeling and biotechnology FFiillll tthhee CCrriittiiccaall TTeecchhnnoollooggyy aanndd CCaappaabbiilliittyy GGaapp TTooddaayy ((llaattee 8800’’ss HHPPCC tteecchhnnoollooggyy))……....ttoo……....FFuuttuurree ((QQuuaannttuumm//BBiioo CCoommppuuttiinngg)) 6/31 Motivation of the HPCC Design High Temporal Locality Good Performance on Cache-based systems HPC Challenge High Spatial Locality Spatial Locality occurs Benchmarks Sufficient Temporal Locality in registers 1.00 Select Applications Satisfactory Performance on HPL 0.80 Gamess Overflow Cache-based systems y HYCOM t Test3D i l OOCore a 0.60 c o l l FFT a AVUS r RFCTH2 o p 0.40 CG m e T 0.20 STREAM RandomAccess 0.00 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 Spatial Locality No Temporal or Spatial Locality High Spatial Locality Poor Performance on Generated by Moderate Performance on Cache-based systems PMaC @ SDSC Cache-based systems Further information: SSppaattiiaall aanndd tteemmppoorraall ddaattaa llooccaalliittyy hheerree iiss ffoorr oonnee „Performance Modeling and Characterization“ nnooddee//pprroocceessssoorr —— ii..ee..,, llooccaallllyy oorr ““iinn tthhee ssmmaallll”” @ San Diego Supercomputer Center http://www.sdsc.edu/PMaC/ 7/31 HPCS Performance Targets HPC Challenge Corresponding HPCS Targets Benchmark Memory Hierarchy (improvement) • Top500: solves a system Registers Ax = b 2 Petaflops Instr. Operands (8x) • STREAM: vector operations Cache 6.5 Petabyte/s A = B + s x C (40x) Blocks • FFT: 1D Fast Fourier bandwidth Local Memory 0.5 Petaflops (200x) Transform latency Messages Z = FFT(X) 64,000 GUPS Remote Memory • RandomAccess: random (2000x) updates Pages Compared to T(i) = XOR( T(i), r ) BlueGene/L, Disk Nov. 2006 •• HHPPCCSS pprrooggrraamm hhaass ddeevveellooppeedd aa nneeww ssuuiittee ooff bbeenncchhmmaarrkkss ((HHPPCC CChhaalllleennggee)) •• EEaacchh bbeenncchhmmaarrkk ffooccuusseess oonn aa ddiiffffeerreenntt ppaarrtt ooff tthhee mmeemmoorryy hhiieerraarrcchhyy •• HHPPCCSS pprrooggrraamm ppeerrffoorrmmaannccee ttaarrggeettss wwiillll ffllaatttteenn tthhee mmeemmoorryy hhiieerraarrcchhyy,, iimmpprroovvee rreeaall aapppplliiccaattiioonn ppeerrffoorrmmaannccee,, aanndd mmaakkee pprrooggrraammmmiinngg eeaassiieerr 8/31 HPCC as a Framework (1/2) • Many of the component benchmarks were widely used before – HPCC is more than a packaging effort – E.g., provides consistent verification and reporting • Important: Running these benchmarks on a single machine — with a single configuration and options – The benchmark components are still useful separately for the HPC community, meanwhile – The unified HPC Challenge framework creates an unprecedented view of performance characterization of a system • A comprehensive view with data captured under the same conditions allows for a variety of analyses depending on end user needs 9/31 HPCC as a Framework (2/2) • A single executable is built to run all of the components – Easy interaction with batch queues – All codes are run under the same OS conditions – just as an application would • No special mode (page size, etc.) for just one test (say Linpack benchmark) • Each test may still have its own set of compiler flags – Changing compiler flags in the same executable may inhibit inter- procedural optimization • Scalable framework — Unified Benchmark Framework – By design, the HPC Challenge Benchmarks are scalable with the size of data sets being a function of the largest HPL matrix for the tested system 10/31 • Overview • The Kernels HPCC Tests at a Glance • Output formats • HPCC awards • Augm. TOP500 1. HPL • Balance Analys. – High Performance Linpack 5. RandomAccess • Conclusions n n – Solving Ax = b A∈R × x,b∈R – calculates a series of integer updates to random locations in memory 2. DGEMM – Ran = 1; – Double-precision General for (i=0; i<4*N; ++i) { Matrix-matrix Multiply Ran= (Ran<<1) ^ – Computing (((int64_t)Ran < 0) ? 7:0); C ← αAB + βC A,B,C∈Rn×n α,β∈R Table[Ran & (N-1)] ^= Ran; – Temporal/spatial locality: } similar to HPL – Use at least 64-bit integers – About half of memory used for ‘Table’ 3. STREAM – Parallel look-ahead limited to 1024 – measures sustainable memory bandwidth with vector operations 6. FFT – COPY: c=a SCALE: b=α c – Fast Fourier Transform ADD: c=a+b TRIAD: a=b+α c n – Computing zk=Σxj exp(-2π√-1 jk/n) x,z∈C 4. PTRANS 7. b_eff – Parallel matrix TRANSpose – Patterns: • ping-pong, – Computing A=AT+B • natural ring, and – Temporal/spatial locality: • random ring patterns similar to EP-STREAM, – Bandwidth (w 2,000,000 bytes messages) but includes global communication – Latency (with 8 bytes messages) 11/31 Random Ring Bandwidth • Reflects communication patterns in unstructured grids • And 2nd & 3rd dimension of a Cartesian domain decomposition • On clusters of SMP nodes: – Some connections are inside of the nodes – Most connections are inter-node • Global benchmark all processes participate • Reported: bandwidth per process similar to • Accumulated bandwidth bi-section := bandwidth per process x #processes bandwidth 12/31 HPCC Testing Scenarios 1. Local MM MM MM MM P P P P P P P P 1.Only single process P P P P P P P P computes NNeettwwoorrkk 2. Embarrassingly parallel MM MM MM MM 1.All processes compute PP PP PP PP PP PP PP PP and do not communicate (explicitly) NNeettwwoorrkk 3. Global MM MM MM MM 1.All processes compute PP PP PP PP PP PP PP PP and communicate NNeettwwoorrkk 4. Network only 13/31 Base vs. Optimized Submission • G-RandomAccess EP Random Random G-Random G-STREAM STREAM EP Ring Ring System Information Run G-HPL G-PTRANS Access G-FFTE Triad Triad DGEMM Bandwidth Latency System - Processor Speed Count Type TFlop/s GB/s Gup/s GFlop/s GB/s GB/s GFlop/s GB/s usec Cray mfeg8 X1E 1.13GHz 248 opt 3.3889 66.01 1.85475 -1 3280.9 13.229 13.564 0.29886 14.58 Cray X1E X1E MSP 1.13GHz 252 base 3.1941 85.204 0.014868 15.54 2440 9.682 14.185 0.36024 14.93 • Base code: Latency based execution •