Measuring High-Performance Computing with Real Applications

P ERFORMANCE A NALYSIS Measuring High-Performance Computing with Real Applications The computer platforms the authors describe here performed both the best and the worst in a test of selected applications. “Best performance” significantly depended on the problem being solved, calling into question the value of computer rankings that use simplistic metrics. enchmarking can help us quantify use HPL in highly visible projects, such as those the ability of high-performance com- ranked in the top 500 supercomputers (www. puting (HPC) systems to perform top500.org). For application benchmarks, we se- certain, known tasks. An obvious use lected four computational codes represented in Bof the results is to find a given architecture’s the Standard Performance Evaluation Corpo- suitability for a selection of computer applica- ration’s SPEC HPC2002 and SPEC MPI2007 tions. Other equally important uses include the suites (www.spec.org/hpg) and in the US Na- creation of yardsticks to assess the progress and tional Science Foundation (NSF) HPC system requirements of future HPC technology. Such acquisitions process. measurements must be based on real computer Although we advocate the use of real applica- applications and a consistent benchmarking pro- tion benchmarks, it’s important to note that ker- cess. Although metrics that measure simple pro- nel benchmarks have an essential role—a small gram kernel behavior have worked well in the code fragment that can time a single message ex- past, they fail to answer questions that deal with change, for example, is most appropriate for mea- the true state of the art of today’s HPC technol- suring a message-passing function’s latency. We ogy and the directions future HPC development don’t argue here that small benchmarks are unim- should take. These answers have a direct impact portant, but that people often use them to answer on competitiveness. the wrong questions. Kernels are appropriate for To quantify this belief, we compared kernel measuring system components, but observations benchmarks with real application benchmarks. of real application behavior are essential for as- We used high-performance Linpack (HPL) to sessing HPC technology. represent kernel benchmarks; researchers often Key Benchmarking Questions 1521-9615/08/$25.00 © 2008 IEEE Table 1 lists some important questions and indi- Copublished by the IEEE CS and the AIP cates the ability of either kernel or full application Mohamed Sayeed, Hansang Bae, Yili Zheng, benchmarks to provide answers. The first question considers the time required to solve today’s Brian Armstrong, Rudolf Eigenmann, and Faisal Saied important problems. We might learn that a biol- Purdue University ogy application takes a full year to fold a specific 60 THIS ARTICLE HAS BEEN PEER-REVIEWED. COMPUTING IN SCIENCE & ENGINEERING Authorized licensed use limited to: Purdue University. Downloaded on April 30, 2009 at 12:01 from IEEE Xplore. Restrictions apply. Table 1. Comparison of kernel versus full application metrics.* protein on a present-day HPC platform or that Benchmarking question Ability to answer forecasting the weather with a certain resolution Kernels Full and accuracy takes 10 hours. For obvious reasons, applications kernel benchmarks don’t address such questions: 1. What time is required to solve n/a + we can’t infer solution times for real problems important computational problems on from kernel execution times. In contrast, real ap- today’s high-performance computing plication benchmarks with data sets that represent platforms? actual problems give direct answers. 2. What’s the overall platform – + The second question deals with relative per- performance? formance. Although both types of benchmark 3. How do system components perform? + – provide answers, a system’s overall ranking can 4. What’s the importance of system n/a + hinge on the type of application involved. The components relative to each other? same machine can perform the best and the 5. What’s the importance of system – + worst, so relying on simple numbers not only pro- components relative to upper bounds? vides inaccurate answers, it can lead to the wrong 6. What are the characteristics of – + conclusions about machine suitability and HPC important computational problems? technology in general. 7. What are the characteristics of – + The strength of specialized kernel benchmarks important future problems? is their ability to measure individual machine fea- * “+” is a good answer, “–” is a limited answer, and “n/a” is no answer. tures by focusing on a particular system aspect. However, the results don’t answer the question of relative component importance—breaking down a real application’s execution time into computa- Simple Benchmarks Are Overly Easy to Run tion, communication, and I/O, for example, shows One of the greatest challenges of evaluating per- each component’s true relative importance. We formance with real applications is the simplicity can’t get this type of information by combining with which we can run kernel benchmarks. With each component’s individual kernel benchmarks. the investment of minutes to a few hours, we can However, kernel metrics can answer the question generate a benchmark report that seemingly puts of what percentage of peak performance the float- our machine “on the map.” The simplicity of ker- ing-point operations can achieve: such metrics nel benchmarks can overcome portability issues are idealized bounds under a given code pattern. (no changes to the source code are necessary to But these results don’t represent the percentage of port it to a new machine) as well as software envi- peak performance a real application can achieve. ronment issues (the small code is unlikely to en- The sixth question deals with an important gage unproven compiler optimizations or to break group of issues that benchmarking and perfor- programming tools in a beta release). This sim- mance evaluation must address: the characteristics plicity might even overcome hardware issues—the of computational applications. Kernel benchmarks kernel is unlikely to approach numerical instabil- can’t help here—they’re typically created by iden- ity, to which a new processor’s floating-point unit tifying real application characteristics and then might be susceptible. focusing on one of them. However, these are the very features that we Benchmarking can help predict the applica- want in a true HPC evaluation. We don’t want a tion behavior of data sets and machines that are machine to show up in the “Best 100” if it takes much larger than what’s feasible today. As the major code restructuring to port a real application, last question illustrates, the value of kernel ver- if porting the application requires major addition- sus real application benchmarks for such predic- al debugging, if the tools and compilers aren’t yet tion is still controversial. mature, or if the hardware is still unstable. Specialized benchmarks exist for a large num- Challenges of Measuring ber of metrics, including message counting in Performance with Real Applications message-passing interface (MPI) applications, Communities ranging from HPC customers to measuring the memory bandwidth of symmetric computer manufacturers to research funding multiprocessor (SMP) platforms, gauging fork- agencies to scientific teams have increasingly join overheads in OpenMP implementations, called for better yardsticks. Let’s look at the studying scheduling characteristics,1 and many stark contrast between this need and conven- more. The SPEC benchmarking organization tional wisdom. (www.spec.org) alone distributes 12 major bench- JULY/AUGUST 2008 61 Authorized licensed use limited to: Purdue University. Downloaded on April 30, 2009 at 12:01 from IEEE Xplore. Restrictions apply. mark suites. Diverse suites play an essential role tory. Furthermore, selecting yardsticks by such because they can help us understand a specific criteria would dismiss the fact that we want to aspect of a system in depth, but when it comes learn about these properties from real applica- to understanding the behavior of a system as a tions. We want to learn how scalable a real ap- whole, these detailed measures don’t suffice. Even plication is, for example, not just select a scalable if a specific benchmark could measure each and one. Similarly, if an application’s real data sets every aspect of a system, no formula exists to help don’t have large memory footprints, then our us combine the different numbers into an overall evaluation has produced an important result: in- system metric. flating input data parameters to fill some memory footprint benchmark selection criterion wouldn’t We Can’t Abstract be meaningful. Realistic Benchmarks from Real Applications An inviting approach to benchmarking with real Today’s Real Applications applications is to extract important excerpts from Might Not Be Tomorrow’s real codes, with the aim of preserving the relevant Large, real applications tend to contain legacy features and omitting the unnecessary ones. Un- code with programming practices that don’t re- fortunately, this means we’re making decisions flect tomorrow’s software engineering principles about an application’s less important parts that and algorithms. might be incorrect. For example, we might find This is perhaps the strongest argument against a loop that executes 100 computationally near- using today’s real applications to determine future identical iterations and reduce it to just a few, HPC needs, but when we ask what future applica- but this abstraction might render the code use- tions should include, the answer isn’t forthcoming. Should we use specific selection criteria? Are we Large, real applications tend to contain sure that the best of today’s algorithms and software engineering principles will find themselves legacy code with programming practices that in tomorrow’s applications? If we choose a certain path and miss, we risk losing on two fronts: aban- don’t reflect tomorrow’s software engineering doning today’s established practices and erring in what tomorrow’s technology will be.

Load more