<<

P e r f o r m a n c e A n a l y s i s

Measuring High-Performance with Real Applications

The platforms the authors describe here performed both the best and the worst in a test of selected applications. “Best performance” significantly depended on the problem being solved, calling into question the value of computer rankings that use simplistic metrics.

enchmarking can help us quantify use HPL in highly visible projects, such as those the ability of high-performance com- ranked in the top 500 supercomputers (www. puting (HPC) systems to perform top500.org). For application benchmarks, we se- certain, known tasks. An obvious use lected four computational codes represented in Bof the results is to find a given architecture’s the Standard Performance Evaluation Corpo- suitability for a selection of computer applica- ration’s SPEC HPC2002 and SPEC MPI2007 tions. Other equally important uses include the suites (www.spec.org/hpg) and in the US Na- creation of yardsticks to assess the progress and tional Science Foundation (NSF) HPC system requirements of future HPC technology. Such acquisitions process. measurements must be based on real computer Although we advocate the use of real applica- applications and a consistent benchmarking pro- tion benchmarks, it’s important to note that ker- cess. Although metrics that measure simple pro- nel benchmarks have an essential role—a small gram kernel behavior have worked well in the code fragment that can time a single message ex- past, they fail to answer questions that deal with change, for example, is most appropriate for mea- the true state of the art of today’s HPC technol- suring a message-passing function’s . We ogy and the directions future HPC development don’t argue here that small benchmarks are unim- should take. These answers have a direct impact portant, but that people often use them to answer on competitiveness. the wrong questions. Kernels are appropriate for To quantify this belief, we compared kernel measuring system components, but observations benchmarks with real application benchmarks. of real application behavior are essential for as- We used high-performance Linpack (HPL) to sessing HPC technology. represent kernel benchmarks; researchers often Key Benchmarking Questions

1521-9615/08/$25.00 © 2008 IEEE Table 1 lists some important questions and indi- Copublished by the IEEE CS and the AIP cates the ability of either kernel or full application Mohamed Sayeed, Hansang Bae, Yili Zheng, benchmarks to provide answers. The first ques- tion considers the time required to solve today’s Brian Armstrong, Rudolf Eigenmann, and Faisal Saied important problems. We might learn that a biol- Purdue University ogy application takes a full year to fold a specific

60 This article has been peer-reviewed. Computing in Science & Engineering

Authorized licensed use limited to: Purdue University. Downloaded on April 30, 2009 at 12:01 from IEEE Xplore. Restrictions apply. Table 1. Comparison of kernel versus full application metrics.* protein on a present-day HPC platform or that Benchmarking question Ability to answer forecasting the weather with a certain resolution Kernels Full and accuracy takes 10 hours. For obvious reasons, applications kernel benchmarks don’t address such questions: 1. What time is required to solve n/a + we can’t infer solution times for real problems important computational problems on from kernel execution times. In contrast, real ap- today’s high-performance computing plication benchmarks with data sets that represent platforms? actual problems give direct answers. 2. What’s the overall platform – + The second question deals with relative per- performance? formance. Although both types of 3. How do system components perform? + – provide answers, a system’s overall ranking can 4. What’s the importance of system n/a + hinge on the type of application involved. The components relative to each other? same machine can perform the best and the 5. What’s the importance of system – + worst, so relying on simple numbers not only pro- components relative to upper bounds? vides inaccurate answers, it can lead to the wrong 6. What are the characteristics of – + conclusions about machine suitability and HPC important computational problems? technology in general. 7. What are the characteristics of – + The strength of specialized kernel benchmarks important future problems? is their ability to measure individual machine fea- * “+” is a good answer, “–” is a limited answer, and “n/a” is no answer. tures by focusing on a particular system aspect. However, the results don’t answer the question of relative component importance—breaking down a real application’s execution time into computa- Simple Benchmarks Are Overly Easy to Run tion, communication, and I/O, for example, shows One of the greatest challenges of evaluating per- each component’s true relative importance. We formance with real applications is the simplicity can’t get this type of by combining with which we can run kernel benchmarks. With each component’s individual kernel benchmarks. the investment of minutes to a few hours, we can However, kernel metrics can answer the question generate a benchmark report that seemingly puts of what percentage of peak performance the float- our machine “on the map.” The simplicity of ker- ing-point operations can achieve: such metrics nel benchmarks can overcome portability issues are idealized bounds under a given code pattern. (no changes to the are necessary to But these results don’t represent the percentage of port it to a new machine) as well as envi- peak performance a real application can achieve. ronment issues (the small code is unlikely to en- The sixth question deals with an important gage unproven compiler optimizations or to break group of issues that benchmarking and perfor- programming tools in a beta release). This sim- mance evaluation must address: the characteristics plicity might even overcome hardware issues—the of computational applications. Kernel benchmarks kernel is unlikely to approach numerical instabil- can’t help here—they’re typically created by iden- ity, to which a new processor’s floating-point unit tifying real application characteristics and then might be susceptible. focusing on one of them. However, these are the very features that we Benchmarking can help predict the applica- want in a true HPC evaluation. We don’t want a tion behavior of data sets and machines that are machine to show up in the “Best 100” if it takes much larger than what’s feasible today. As the major code restructuring to port a real application, last question illustrates, the value of kernel ver- if porting the application requires major addition- sus real application benchmarks for such predic- al debugging, if the tools and compilers aren’t yet tion is still controversial. mature, or if the hardware is still unstable. Specialized benchmarks exist for a large num- Challenges of Measuring ber of metrics, including message counting in Performance with Real Applications message-passing interface (MPI) applications, Communities ranging from HPC customers to measuring the memory of symmetric computer manufacturers to research funding multiprocessor (SMP) platforms, gauging fork- agencies to scientific teams have increasingly join overheads in OpenMP implementations, called for better yardsticks. Let’s look at the studying characteristics,1 and many stark contrast between this need and conven- more. The SPEC benchmarking organization tional wisdom. (www.spec.org) alone distributes 12 major bench-

July/August 2008 61

Authorized licensed use limited to: Purdue University. Downloaded on April 30, 2009 at 12:01 from IEEE Xplore. Restrictions apply. mark suites. Diverse suites play an essential role tory. Furthermore, selecting yardsticks by such because they can help us understand a specific criteria would dismiss the fact that we want to aspect of a system in depth, but when it comes learn about these properties from real applica- to understanding the behavior of a system as a tions. We want to learn how scalable a real ap- whole, these detailed measures don’t suffice. Even plication is, for example, not just select a scalable if a specific benchmark could measure each and one. Similarly, if an application’s real data sets every aspect of a system, no formula exists to help don’t have large memory footprints, then our us combine the different numbers into an overall evaluation has produced an important result: in- system metric. flating input data parameters to fill some memory footprint benchmark selection criterion wouldn’t We Can’t Abstract be meaningful. Realistic Benchmarks from Real Applications An inviting approach to benchmarking with real Today’s Real Applications applications is to extract important excerpts from Might Not Be Tomorrow’s real codes, with the aim of preserving the relevant Large, real applications tend to contain legacy features and omitting the unnecessary ones. Un- code with programming practices that don’t re- fortunately, this means we’re making decisions flect tomorrow’s principles about an application’s less important parts that and algorithms. might be incorrect. For example, we might find This is perhaps the strongest argument against a loop that executes 100 computationally near- using today’s real applications to determine future identical iterations and reduce it to just a few, HPC needs, but when we ask what future applica- but this abstraction might render the code use- tions should include, the answer isn’t forthcoming. Should we use specific selection criteria? Are we Large, real applications tend to contain sure that the best of today’s algorithms and soft- ware engineering principles will find themselves legacy code with programming practices that in tomorrow’s applications? If we choose a certain path and miss, we risk losing on two fronts: aban- don’t reflect tomorrow’s software engineering doning today’s established practices and erring in what tomorrow’s technology will be. Neverthe- principles and algorithms. less, the best predictor of tomorrow could very well be today’s established practice.

less for evaluating adaptive compiler techniques: Benchmarking Isn’t Eligible repetitive behavior can be crucial for an adaptive for Research Funding algorithm to converge on the best optimization Performance evaluation and benchmarking proj- technique. Similarly, data downsizing techniques2 ects are long-term efforts, so data must be gath- might ensure that a smaller problem’s cache be- ered and kept for many years. (The 15-year-old havior remains the same, but the code would be- [email protected] mailing list come useless for determining the real problem’s still receives occasional queries.) Unfortunately, memory footprint. this type of task doesn’t easily include the advanced The difficulty of defining scaled-down bench- performance-modeling topics that interest scien- marks is evident in the many criteria for bench- tists, so most efforts that focus on benchmarking mark selection suggested in past efforts: codes infrastructure and its long-term support lack con- should contain a balance of various degrees of tinuity in funding from science sponsors. parallelism, be scalable, be simple yet reflect real- Programs that create such services at fund- world problems, use a large memory footprint and ing agencies could eventually appear—perhaps a large working set, be executable on a large num- this article will motivate future initiatives. An ber of processors, exhibit a variety of cache-hit ra- alternative is a combined academic/industrial ef- tios, and be amenable to a variety of programming fort, such as SPEC’s High-Performance Group models for shared-memory, multicore, and cluster (HPG), in which industrial benchmarking needs architectures. Last but not least, benchmark codes meet academic interests, resulting in suites of should represent a balanced set of computer ap- real, HPC applications. plications from diverse fields. Obviously, nothing could satisfy all of these Maintaining Benchmarking Efforts Is Costly demands—some of them are directly contradic- A performance evaluation and benchmarking ef-

62 Computing in Science & Engineering

Authorized licensed use limited to: Purdue University. Downloaded on April 30, 2009 at 12:01 from IEEE Xplore. Restrictions apply. fort entails collecting test applications, ensuring Meeting the Challenge: portability, developing self-validation procedures, Toward a Benchmarking Methodology defining benchmark workloads, creating bench- An advanced HPC benchmarking methodology mark run and result submission rules, organizing result review committees, disseminating the suite, • creates performance yardsticks based on real maintaining result publication sites, and even pro- applications and openly shared data sets, tecting evaluation metrics from misuse. Dealing • defines metrics that indicate overall prob- with real, large-scale applications further neces- lem solution time as well as the performance sitates assisting benchmark users and maintain- of important kernels that constitute the ing the involvement of domain experts in their application, respective application areas. • defines rules for running and reporting bench- The high cost of these tasks is obvious. Many marks, so that comparisons are fair and any rel- benchmarking efforts cover their costs through evant information is fully disclosed, and initial research grants or volunteer efforts, but • enables the creation of a repository that maintains this support typically pays for the first round of benchmarking reports over a long time period. benchmark definitions, not subsequent steps or publication sites. To date, SPEC is the only orga- Rules are only useful if they’re enforceable, nization that maintains full, continuously updated which emphasizes the need for a benchmark re- benchmarking efforts. Its funding comes primar- view process. The goal here is to facilitate ob- ily from industrial membership and a comparably jective efforts rather than self-evaluations by small fee for the actual benchmark suites. machine vendors and HPC platform owners. Full disclosure is also crucial: the applications, Proprietary Application Benchmarks their data sets, the benchmarking process, and Can’t Serve as Open Yardsticks the software and hardware configuration with So, should realistic benchmarks match the ex- which the results were obtained must all be open act applications that will run on the target sys- to inspection. tem of interest? Clearly, prospective customers An open repository of fully disclosed bench- think their own applications would be the best mark results is essential to fair, consistent HPC choice for testing a system’s desired functional- evaluation. Such an effort benefits prospective ity. It might serve the customer best if multiple customers of HPC systems as well as researchers vendors ran applications in a way that allowed attempting to advance HPC technology. It’s even fair comparison, but because most applications more important in the acquisition and evaluation are proprietary, neither the scientific commu- of HPC systems with public funds. nity nor the public can verify or scrutinize the generated performance claims. Hence, the value Benchmarking Efforts of proprietary benchmarks is limited to simple with Real Applications metrics. Performance evaluation efforts with real applica- Public benchmarks might be of higher value. tions began to emerge in 1988 with the Perfect Fair benchmark results can be costly to generate, Benchmarks.3,4 This set of scientific and engi- and vendors are under pressure to produce good neering applications was intended to replace ker- results in a very short period, often competing nel-type benchmarks and the commonly used internally for machine resources. Unless cus- peak-performance numbers, and represented a tomers can closely supervise the benchmarking significant step in the direction of application- process with significant expertise, they might level benchmarking. Although they continue to not get results that they can compare fairly. circulate in the research community, the origi- Without any deceptive intentions on the vendor nal Perfect Benchmarks are small compared to side, the lack of a clear evaluation methodol- today’s real applications (the largest included ogy often allows shortcuts and “optimizations” some 20,000 lines of Fortran 77), and their that can differ significantly among evaluation data sets execute in the order of seconds on today’s groups. In contrast, even though established, machines. No results Web site is available for the public, full-application benchmark results might Perfect Benchmarks. not have computational patterns identical to a Similarly, the ParkBench (for PARallel Kernels customer’s codes, the performance numbers’ and BENCHmarks)5 effort emerged in 1992 with consistency and availability might outweigh this some research funding but didn’t update its initial drawback. suites in response to newer generations of HPC

July/August 2008 63

Authorized licensed use limited to: Purdue University. Downloaded on April 30, 2009 at 12:01 from IEEE Xplore. Restrictions apply. Other HPC Benchmarking Efforts Other attempts to provide HPC benchmarks have emerged over the past decade as well. Notable examples include the Euroben effort in Europe (www.euroben.nl) and NASA’s Parallel Bench- marks (NPB).10 The US Department of Defense’s High Perfor- mance Computing Modernization Program (HPC- MP; www.hpcmo.hpc.mil) developed its own suites that include synthetic benchmarks and real appli- cations to support its yearly acquisition activities. Figure 1. Top application performers. This aggregate view combines A related effort is the benchmarking project that’s the SPEC HPC2002 suite’s results into an overall rank, with the bars part of Darpa’s High-Productivity Computing subdivided into each individual benchmark application’s contributions Systems (HPCS) program (www.highproductivity. to the rank. The TAP lists also show rankings based on other org). Currently, this effort has suites of kernel application suites. benchmarks and several synthetic compact applica- tions. Another important effort is the benchmark- ing process defined by the NSF for its acquisition systems. The effort was very ambitious in its goal of a national HPC system infrastructure. The NSF of delivering a set of benchmarks that range from benchmarks include both a set of kernel bench- kernels to full applications. The largest, full- marks and real applications for system evaluation. ­application suite was never created, though. However, neither the HPCS nor the NSF bench- marking efforts have an associated project to main- SPEC HPG tain a result repository that the public can view. Like the Perfect Benchmarks, SPEC also debuted in 1988. Although it’s largely vendor based, the Metrics for HPC Evaluation organization includes a range of academic af- The general consensus in the benchmarking com- filiates. Initially, SPEC focused on benchmarking munity is that overall performance must be evalu- uniprocessor machines and, in this area, became ated via wall-clock time measurements, but an the leader in providing performance numbers to open and often controversial issue is how to com- workstation customers. SPEC suites are also in- bine measurements from multiple benchmarks into creasingly used by the research community to one final number. SPEC’s approach is to leave the evaluate the performance of new architecture decision up to the benchmark report reader—that concepts and software prototypes. Today, SPEC is, each code is reported separately. Other suites, offers a large number of benchmarks that evalu- such as SPEC CPU, SPEC OMP, and SPEC MPI, ate workstations, graphics capabilities, and high- report the geometric mean of individual program performance systems. Most notably for HPC performance results. The NSF benchmarking ef- evaluation, the SPEC HPG formed in 1994 out fort includes both kernel and applications metrics, of an initiative to merge the Perfect Benchmarks which are reported independently. Kernel bench- effort’s expertise in HPC evaluation with SPEC’s marks evaluate a wide variety of system compo- capability to sustain such efforts long term. Since nents, so the metrics also vary widely, and to our its foundation, this group has produced several knowledge, no current benchmarking effort can HPC benchmarks, including the HPC suite,6–8 relate kernel and application benchmarks by, for the OMP suite (for OpenMP applications),9 and example, measuring important constituent kernels the MPI suite. as part of a real application benchmark run. SPEC’s HPG suites are based on widely used An increasingly important class of metrics computational applications that can be openly characterizes the program properties of computa- distributed to the community. The codes are tional applications (examples include the use and implemented with the MPI and OpenMP stan- frequency of algorithms, program patterns, and dards for parallel processing, and SPEC provides compiler analysis results). Understanding such a result submission and review process, a re- benchmarking metrics is key to improving the pository at www.spec.org/hpg, and a continuous of computational applications. benchmark update process. Intended consumers include end users, system vendors, software ven- Tools for Gathering Metrics dors, and researchers. Obtaining overall timing metrics is relatively

64 Computing in Science & Engineering

Authorized licensed use limited to: Purdue University. Downloaded on April 30, 2009 at 12:01 from IEEE Xplore. Restrictions apply. Table 2. Wall-clock execution times and problems solved by application benchmarks on the IBM P690 platform, using 32 processors. Application Execution time Problem description for a medium data set Seismic (SPECseis) 625 sec. Seismic is a suite of codes typical of the seismic processing applications used to find oil and gas. The data set processes seismic traces of 512 × 48 × 128 × 128. (Samples per trace × traces per group × groups per line × number of lines, where a trace corresponds to a sensor that has a sampling frequency; the sensors are strung out on multiple cables behind a ship.) The total data set size in Phase 1 of SPECseis is 1.5 Gbytes and reduces to 70 Mbytes in Phase 2. GAMESS (SPECchem) 3,849 sec. GAMESS is a general ab initio quantum chemistry package. The data set computes self-consistent field wavefunctions (RHF type) for thymine (C5H7N3O – 16 atoms), one of the bases found in nucleic acids. WRF (SPECenv) 742 sec. WRF is a weather research and forecasting modeling system for the mesoscale (meters to thousands of kilometers). The data set simulates the weather over the continental US for a 24-hour period starting from Saturday, 3 November 2001, at 12:00 a.m. The grid is 260 × 164 × 35 with a 22-km resolution. MILC 1,497 sec. MILC is used for large-scale numerical simulation of quantum chromodynamics to calculate the masses and other basic properties of strongly interacting particles (quarks and gluons). It simulates quantum chromodynamics with improved staggered quarks of two masses and performs a computation over a 4D lattice with 32 points, involving approximately 34 million variables for the integral evaluation.

straightforward, but tools for gathering detailed Ranking HPC Systems execution characteristics are often platform-spe- Benchmark reports for real application suites, cific, so it can be difficult to obtain the metrics such as the SPEC HPC codes, can help us create of interest on a given platform. It’s even more more realistic rankings of HPC platforms (www. difficult to conduct comparative evaluations that purdue.edu/TAPlist). Figure 1 shows the top ap- gather a certain metric across several platforms. plication performers (TAP) list as of January 2006; Among the tools we’ve used in our projects are this list defines an aggregate metric with which we mpiP, hpmcount, and strace: can combine the results of our three benchmark applications into an overall rank. The metric • mpiP is a lightweight profiling library for MPI weights individual application performance results applications that reports the percentage of time according to their average runtime across different spent in the MPI. It also includes the times each platforms. The list also allows a single benchmark MPI function uses. to be used for ranking. The TAP list contains links • IBM’s Hardware Performance Monitor suite in- to the original SPEC benchmark reports. cludes a simplified interface, hpmcount, which summarizes data from selected hardware coun- Performance Results ters and computes some useful derived metrics, We used four application benchmarks and tested such as instructions and per cycle. them on three different parallel architectures • We can measure I/O behavior (by recording to evaluate performance; we discuss the perfor- I/O system call activities) with the strace com- mance results in this section in terms of the key mand; from this output, we can extract statis- benchmarking questions discussed earlier. Table tics for file I/O using a script. 2 gives a brief description of the four applications along with the problem being solved. The codes These tools are but a small sample of a large GAMESS and WRF are members of the SPEC set of instruments available on myriad platforms HPC and NSF benchmark suites (see www.nsf. today, thus an important goal is uniformity. gov/pubs/2006/nsf0605/nsf0605.jsp); WRF and Benchmarkers need tools and interfaces to gather MILC are also part of SPEC MPI2007. Here, relevant performance data consistently across the we’ve measured the SPEC HPC versions of range of available platforms. Ideally, these tools GAMESS and WRF and the MILC version of the won’t just report volumes of performance coun- NSF benchmarks; the Seismic application is part ter results—they’ll also abstract these volumes, of the SPEC HPC suite. Where appropriate for thereby creating the end metrics of interest. comparison, we’ve shown results obtained from

July/August 2008 65

Authorized licensed use limited to: Purdue University. Downloaded on April 30, 2009 at 12:01 from IEEE Xplore. Restrictions apply. 2.5 12

2.0 10 8 1.5 6 1.0 Scalability Scalability 4

0.5 2

0 4 8 12 16 20 24 28 32 36 0 8 16 24 34 40 48 56 64 72 Processors Processors 7 6 12 5 10 4 8 3 Scalability Scalability 6 2 1 4 2 0 8 16 24 34 40 48 56 64 72 0 8 16 24 34 40 48 56 64 72 Processors Processors 8 7 Xeon 6 SGI Altix 5 IBM P690 4

Scalability 3 2 1

0 8 16 24 34 40 48 56 64 72 Processors

Figure 2. Relative performance of application benchmarks on three platforms. All performance numbers are relative to the execution speed on an Intel Xeon with four processors; we used the medium data set in all executions. For comparison, we show the execution times of the same machines using the HPL kernel benchmarks (with N = 9,900). Note that every machine performed the best on one application and the worst on another. Hence, the relative ranking of these systems critically depends on the problem being solved.

the HPL kernel benchmarks (the HPL codes are 10 minutes (for Seismic) to more than an hour also part of the NSF benchmark suite). (for GAMESS). Figure 2 shows the relative performance of the Overall Performance individual benchmark applications on the three Absolute application performance is the total run- platforms as well as the HPL benchmark’s per- time for solving an overall application problem. The formance. We took measurements on up to 64 machines we used included an IBM P690 (www. processors, except for Seismic (up to 32; the 64- ccs.ornl.gov/Cheetah/Cheetah.html), an SGI processor runs didn’t validate). ­Altix (www.ccs.ornl.gov/Ram/Ram.html), and The four applications had different rankings on an Intel Xeon cluster (www.itap.purdue.edu/rcac/ the platforms. In terms of execution time on the news/news.cfm?NewsID=178). The runtimes highest measured processor count, each platform for these medium data sets on a 32-processor performed both the best and the worst across the IBM P690 platform ranged from approximately different benchmarks. In terms of be-

66 Computing in Science & Engineering

Authorized licensed use limited to: Purdue University. Downloaded on April 30, 2009 at 12:01 from IEEE Xplore. Restrictions apply. Computation Communication I/O 100

80

60

40 % of each component

20

0 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 10 11 12 13 14 15 10 11 12 13 14 15 10 11 12 13 14 15 10 11 12 13 14 15 10 11 12 13 14 15 SEISMIC WRF GAMESS MILC HPL (N = 9,900) HPL (N = 39,600)

MPI rank/application

Figure 3. Components. When we look closer at the execution time breakdown of the benchmarks into computation, communication, and I/O, we see different application characteristics such as if the application is compute, communication, or I/O bound. havior, the IBM P690 platform scaled the best machine features: their relative performance shows in Seismic, WRF, and MILC, but the worst in a feature’s contribution to a computational prob- GAMESS. The good scaling behavior on MILC lem’s overall solution. Furthermore, component was consistent with a relatively high computation- performance relative to some upper bound shows to-communication ratio, which Figure 3 shows. us how efficiently the machine feature is exploited On both the Intel Xeon and the SGI Altix, Seismic compared to theoretical limits. We used the per- scaled only to 16 processors and WRF only to 32. formance analysis tools mpiP, hpmcount, and strace We observed superlinear behavior for GAMESS to measure specific application component charac- on SGI Altix and Intel Xeon up to 32 processors teristics such as communication, computation, and and for MILC on 32 and 64 processors. The su- I/O. Figure 3 shows the time breakdown. perlinear behavior might be due to data fitting nicely in L2 cache; we assumed ideal behavior for Communication characteristics. We measured the the four-processor case, as this was the smallest communication component overhead for all processor count the medium data sets could run the application benchmark codes and the HPL on due to memory limitations. benchmark with constant and scaled problem The kernel benchmark results (HPL with N = sizes, and found the communication cost to be 9,900) were most similar to those of WRF. Evi- 5 to 40 percent of the overall execution time. dently, because the different applications’ perfor- Seismic did the least amount of communication, mance behavior varied significantly, the kernel followed by GAMESS, MILC, and WRF. We benchmark reflected only a small part of the ap- measured all four phases of Seismic separately plication spectrum. and later aggregated them; ultimately, Seismic showed significant load imbalance. In GAMESS, Component Performance a single processor communicated 100 percent of As we mentioned earlier, system component mea­ the time. WRF, MILC, and HPL were the most surements give insight into the behavior of individual balanced, but the difference between the least

July/August 2008 67

Authorized licensed use limited to: Purdue University. Downloaded on April 30, 2009 at 12:01 from IEEE Xplore. Restrictions apply. WRF Seismic (phase 1) Seismic (phase 2) to 1 Gbyte per processor in MILC (due to the large Seismic (phase 3) Seismic (phase 4) GAMESS HPL footprint, MILC couldn’t run on less than four 1,000 processors). MILC, WRF, and Seismic’s phases 3 and 4 exhibited the commonly expected behavior: memory footprint decreased steadily with increas- 100 s ing processor numbers. But in Seismic’s phases 1 and 2 and in GAMESS, the memory footprint was

Mbyte 10 independent of the number of processors. This finding is important because it refutes the com- mon assumption that larger systems will naturally 1 2 4 8 16 accommodate larger data sets. This assumption is Number of processors the basis for a benchmark methodology that lets data sets “scale” and thus reduces communica- tion, leading to seemingly improved performance Figure 4. Benchmark memory footprints. The total memory size per numbers on large systems. Our results show that processor for an Intel Xeon cluster varies widely depending on the this path to scalability might not be correct. benchmark used. Application Characteristics We can characterize computational applica- and most communicating processors was still 50 tions from many diverse angles—the physi- percent or more. cal problem being solved, the algorithms used, HPL communication depended on problem computer language and source code properties, size and exhibited locality characteristics: with an compiler-applied optimization methods, gener- increasing data set size, communication reduced ated instruction characteristics, and upper limit significantly. This feature of the kernel bench- analyses, to name a few. A systematic methodol- mark led to generally good scaling behavior on ogy of application characterization11 could guide very large machines, but it also contributed to the the process of answering the relevant questions difference in rank lists of application versus kernel in this area. Table 3 shows a small set of such benchmarks. We observed a significant difference data, providing insight into the core algorithms in communication cost overhead across architec- and programming languages used to compose tures (IBM P690 and Intel Xeon Linux cluster) for our test applications. the MILC and WRF benchmarks because they’re As we mentioned earlier, understanding the communication intensive. The IBM P690 system characteristics of today’s applications could help us provided higher communication bandwidth than anticipate the behavior of tomorrow’s applications. the Intel Xeon system. Most performance prediction techniques are based on application signatures and machine profiles, I/O characteristics. We measured I/O volume on combined via convolution methods. In one ap- all the benchmark codes and found that MILC proach,12 the authors defined synthetic kernels that and HPL didn’t perform any I/O, whereas in exhibit interesting code and machine properties both WRF and GAMESS, a single processor and then extrapolated these kernels’ measured be- performed all the I/O. The I/O volumes of these havior to larger data set sizes. Another approach13– codes (four Seismic phases measured separately, 15 measured an application’s key characteristics on WRF, and GAMESS) ranged from 61 Mbytes to a current platform and then forecasted the appli- 5.5 Gbytes. The fraction of execution time taken cation behavior on scaled data and machine sizes, by the I/O was small in both GAMESS and WRF using predictive formulas. The authors derived the but significant in Seismic, especially in phase 1. formulas with a least-squares fitting approach15 or The I/O read and write volumes differed, yet they via the compiler from the source code, expressing took the same amount of time due to differences the way input data affects the volume of computa- in read and write speed. tion, communication, and I/O.

Memory footprints. Figure 4 shows the bench­ mark’s memory footprints as a function of the good benchmarking methodology can number of processors. Again, we split Seismic into save a tremendous amount of resourc- its four execution phases, with the sizes ranging es in terms of human effort, machine from 20 Mbytes per processor in Seismic’s phase 1 A cycles, and cost. Such a methodology

68 Computing in Science & Engineering

Authorized licensed use limited to: Purdue University. Downloaded on April 30, 2009 at 12:01 from IEEE Xplore. Restrictions apply. Table 3. Benchmark composition in terms of number of files, subroutines, lines, programming languages, and core algorithms. Seismic WRF GAMESS MILC Files 59 313 61 152 Subroutines 354 1139 835 382 Lines 24,000 168,000 122,000 20,000 Languages 56% Fortran, 44% C 85% Fortran, 15% C 99.5% Fortran, 0.5% C 100% C Algorithms 3D FFT, finite difference, CFD (Eulerian equation Eigenvalue solver, dense Conjugate gradient, Toepliz solver, Kirchhoff solver), split-explicit matrix matrix multiply integral method

must consider the relevance and openness of the Performance Evaluation with Large-Scale Science and Engineer- chosen codes, well-defined rules for executing and ing Applications, MIT Press, 2001, pp. 40–48. reporting the benchmarks, a review process to 8. M.S. Mueller et al., “SPEC HPG Benchmarks for High Performance Systems,” Int’l J. High-Performance Computing enforce the rules, and a public repository for the and Networking, vol. 2, no. 1, 2004; http://citeseer.ist.psu. obtained information. For the methodology to edu/672999.html. be feasible, it must also be supported by adequate 9. V. Aslot et al., “Specomp: A New Benchmark Suite for tools that enable the user to consistently execute Measuring Parallel Computer Performance,” Proc. Workshop OpenMP Applications and Tools, LNCS 2104, Springer-Verlag, the benchmarks and gather the requisite metrics. 2001, pp. 1–10. At the very least, reliable benchmarking results 10. D. Bailey et al., The NAS Parallel Benchmarks 2.0, tech. report can help people make decisions about HPC acqui- NAS-95-020, NASA Ames Research Center, Dec. 1995. sitions and assist scientists and engineers in sys- 11. B. Armstrong and R. Eigenmann, “Benchmarking and Per- tem advances. By saving resources and enabling formance Evaluation with Realistic Applications,” A Method- ology for Scientific Benchmarking with Large-Scale Applications, balanced designs and configurations, realistic MIT Press, 2001, pp. 109–127. benchmarking ultimately leads to increased com- 12. D.H. Bailey and A. Snavely, “Performance Modeling: Un- petitiveness in both industry and academia. derstanding the Past and Predicting the Future,” Proc. 11th Int’l Euro-Par Conf., LNCS 3648, Springer-Verlag, 2005, pp. 761–770. Acknowledgments 13. B. Armstrong and R. Eigenmann, “Performance Forecasting: This work was supported, in part, by the US Nation- A Methodology for Characterizing Large Computational al Science Foundation under grants 9974976-EIA, Applications,” Proc. Int’l Conf. Parallel Processing, IEEE Press, 0103582-EIA, and 0429535-CCF. The machines used 1998, pp. 518–526. in our experiments were made available by Purdue 14. L. Carrington, A. Snavely, and N. Wolter, “A Performance Prediction Framework for Scientific Applications,”Future University’s Rosen Center for Advanced Computing Generation Computer Systems, vol. 22, no. 3, 2006, pp. and the Oak Ridge National Laboratory. 336–346. 15. M. Mahinthakumar et al., “Performance Evaluation and Modeling of a Parallel Astrophysics Application,” Proc. High References Performance Computing Symp., Soc. for Computer Simulation Int’l, 2004; www.scs.org/docInfo.cfm?get=1681. 1. A.T. Wong et al., “Esp: A System Utilization Benchmark,” Proc. IEEE/ACM SC2000 Conf., IEEE CS Press, 2000; http:// citeseer.ist.psu.edu/wong00esp.html. 2. S.C. Woo et al., “The Splash2 Programs: Characterization Mohamed Sayeed is a research scientist in the Com- and Methodological Considerations,” Proc. 22nd Int’l Symp. puting Research Institute and Rosen Center for Ad- Computer Architecture, ACM Press, 1995, pp. 24–36. vanced Computing at Purdue University. His research 3. M. Berry et al., “The Perfect Club Benchmarks: Effective Performance Evaluation of Supercomputers,” Int’l J. Super- interests include numerical modeling, high-per- computer Applications, vol. 3, no. 3, 1989, pp. 5–40. formance computing for scientific computing, and 4. M. Berry, G. Cybenko, and J. Larson, “Scientific Benchmark performance modeling. Sayeed has a PhD in civil Characterizations,” , vol. 17, Dec. 1991, engineering from North Carolina State University. pp. 1173–1194. Contact him at [email protected]. 5. R.W. Hockney and M. Berry, “Parkbench Report: Public In- ternational Benchmarking for Parallel ,” Scientific Programming, vol. 3, no. 2, 1994, pp. 101–146. Hansang Bae is working toward a PhD in electrical 6. R. Eigenmann and S. Hassanzadeh, “Benchmarking with and at Purdue University. His Real Industrial Applications: The SPEC High-Performance research interests include optimizing compilers and Group,” IEEE Computational Science & Eng., vol. 3, no. 1, 1996, pp. 18–23. performance analysis of high-performance appli- 7. R. Eigenmann et al., “Performance Evaluation and Bench- cations and platforms. Contact him at baeh@ecn. marking with Realistic Applications,” SPEC HPG Benchmarks: purdue.edu.

July/August 2008 69

Authorized licensed use limited to: Purdue University. Downloaded on April 30, 2009 at 12:01 from IEEE Xplore. Restrictions apply. Yili Zheng is a PhD candidate in the School of Electrical and Computer Engineering at Purdue PURPOSE: The IEEE Computer Society is the world’s largest association of computing University. His research interests professionals and is the leading provider of technical information in the field. include cross-layer optimizations MEMBERSHIP: Members receive the monthly magazine Computer, discounts, and and end-to-end solutions for opportunities to serve (all activities are led by volunteer members). Membership is computation-intensive problems open to all IEEE members, affiliate society members, and others interested in the such as those in biomedical, computer field. COMPUTER SOCIETY WEB SITE: www.computer.org financial, and energy applica- OMBUDSMAN: Call the IEEE Member Services toll-free number, +1 800 678 4333 (US) tions. Contact him at yzheng@ or +1 732 981 0060 (international), or email [email protected]. purdue.edu.

EXECUTIVE COMMITTEE EXECUTIVE STAFF Brian Armstrong is a PhD can- didate in Purdue University’s President: Rangachar Kasturi* Executive Director: Angela R. Burgess; Direc- computational science and engi- President-Elect: Susan K. (Kathy) Land, tor, Finance & Accounting: John Miller; neering program. His research in- CSDP;* Past President: Michael R. Director, Governance, & Associate Execu- Williams;* VP, Electronic Products & Ser- tive Director: Anne Marie Kelly; Director, terests are performance analysis vices: George V. Cybenko (1ST VP);* Sec- Information Technology & Services: Neal of parallel and distributed ap- retary: Michel Israel (2ND VP);* VP, Chap- Linson; Director, Membership Develop- plications and automatic par- ters Activities: Antonio Doria;† VP, ment: Violet S. Doan; Director, Products allelization of industrial-grade Educational Activities: Stephen B. Seid- & Services: Evan Butterfield; Director, applications. Contact him at man;† VP, Publications: Sorel Reisman;† Sales & Markting: Dick Price [email protected]. VP, Standards Activities: John W. Walz;† VP, Technical & Conference Activities: COMPUTER SOCIETY OFFICES Joseph R. Bumblis;† Treasurer: Donald F. Washington Office. 1828 L St. N.W., Suite Rudolf Eigenmann is a profes- Shafer;* 2008–2009 IEEE Division V 1202, Washington, D.C. 20036-5104 sor in the School of Electrical Phone: +1 202 371 0101 Director: Deborah M. Cooper;† and Computer Engineering and Fax: +1 202 728 9614 2007–2008 IEEE Division VIII Director: interim director of the Comput- Thomas W. Williams;† 2008 IEEE Division Email: [email protected] ing Research Institute at Purdue VIII Director-Elect: Stephen L. Diamond;† Los Alamitos Office. 10662 Los Vaqueros Computer Editor in Chief: Carl K. Chang† Circle, Los Alamitos, CA 90720-1314 University. His research interests * voting member of the Board of Governors Phone: +1 714 821 8380 include compiler optimization, Email: [email protected] † nonvoting member of the Board of Governors programming methodologies and Membership & Publication Orders: tools, performance evaluation for Phone: +1 800 272 6657 BOARD OF GOVERNORS Fax: +1 714 821 4641 high-performance computers and Email: [email protected] applications, and Internet shar- Term Expiring 2008: Richard H. Eckhouse; Asia/Pacific Office. Watanabe Building, 1-4-2 ing technology. Eigenmann has a James D. Isaak; James Moore, CSDP; Gary Minami-Aoyama, Minato-ku, McGraw; Robert H. Sloan; Makoto Takiza- PhD in electrical engineering and Tokyo 107-0062, Japan wa; Stephanie M. White computer science from ETH Zur- Phone: +81 3 3408 3118 Term Expiring 2009: Van L. Eden; Robert Fax: +81 3 3408 3553 ich. Contact him at eigenman@ Dupuis; Frank E. Ferrante; Roger U. Fujii; Email: [email protected] purdue.edu. Ann Q. Gates, CSDP; Juan E. Gilbert; Don F. Shafer IEEE OFFICERS Faisal Saied is a senior research Term Expiring 2010: André Ivanov; Phillip President: Lewis M. Terman; President- A. Laplante; Itaru Mimura; Jon G. Rokne; scientist at Purdue University’s Elect: John R. Vig; Past President: Leah Christina M. Schober; Ann E.K. Sobel; Jef- Computing Research Institute H. Jamieson; Executive Director & COO: frey M. Voas and Rosen Center for Advanced Jeffry W. Raynes; Secretary: Barry L. Shoop; Treasurer: David G. Green; VP, Computing. His research interests Next Board Meeting: Educational Activities: Evangelia Micheli- include numerical analysis, paral- 18 Nov. 2008, New Brunswick, Tzanakou; VP, Publication Services & lel numerical algorithms, numeri- NJ, USA Products: John Baillieul; VP, Membership cal methods for the Schrödinger & Geographic Activities: Joseph V. Lillie; equation, large-scale eigenvalue VP, Standards Association Board of Governors: George W. Arnold; VP, problems, and performance Technical Activities: J. Roberto B. engineering for large-scale ap- deMarca; IEEE Division V Director: plications. Saied has a PhD in Deborah M. Cooper; IEEE Division VIII computer science from Yale Uni- Director: Thomas W. Williams; President, versity. Contact him at fsaied@ revised 22 May 2008 IEEE-USA: Russell J. Lefevre purdue.edu.

70 Computing in Science & Engineering

Authorized licensed use limited to: Purdue University. Downloaded on April 30, 2009 at 12:01 from IEEE Xplore. Restrictions apply.