Performance Profiling with Cactus Benchmarks

Performance Profiling with Cactus Benchmarks Sasanka Madiraju April 6th, 2006 System Science Master’s Project Report Department of Computer Science Louisiana State University Acknowledgment This project has evolved out of a cumulative effort at the Center for Computation & Tech- nology towards the proposal written in reply to the bigiron solicitation from NSF. Over a span of six months there was a lot of activity to collect and organize data and participate in rigorous analysis. I would like to thank all the people who helped me during the course of this project. I would like to thank my committee members Dr. Gabrielle Allen, Dr. Edward Seidel and Dr. Gerald Baumgartner, for their support and suggestions. I would also like to offer my appreciation to Dr. Thomas Sterling and Dr. Ram Ramanujam for their constant encouragements and help. I would especially like to thank Dr. Erik Schnetter and Dr. Maciej Brodowicz for their constant advice and support and I am grateful to them for their exceptional guidance. I am also grateful to Mr. Yaakoub El-Khamra and Dr. Steven Brandt for their help and suggestions. Finally, I would like to thank all the people at CCT for being a source of inspiration. Sasanka Madiraju Contents 1 Introduction 1 1.1 Project Scope . 1 1.2 Goals . 3 1.2.1 Phase 1 . 3 1.2.2 Phase 2 . 4 1.3 Contributions and People Involved . 4 2 Cactus Toolkit: Benchmarks 6 2.1 Cactus Toolkit . 6 2.2 Types of Benchmarks . 8 2.3 Bench BSSN PUGH ................................. 9 2.4 Bench Whisky Carpet .............................. 9 2.5 Duration of Run . 10 2.6 Problems with IPM Results . 10 3 Phase 1: Analysis of Results 12 3.1 IBM Power5 Architecture . 12 3.2 Intel Itanium2 Architecture . 13 3.3 Intel Xeon Architecture . 14 3.4 Details of Machines Used . 15 i 3.5 Classification of Results . 15 3.5.1 Single Processor Performance Results . 16 3.5.2 Scaling Results for different machines . 19 4 Phase 2: Profiling Bench BSSN PUGH 29 4.1 Reason for Profiling . 29 4.2 Tools used . 30 4.2.1 PAPI: Performance Application Programming Interface . 31 4.2.2 Valgrind: A linux based profiling tool . 33 4.2.3 Valkyrie: A GUI based tool for Valgrind . 34 4.2.4 gprof: GNU Tool for profiling . 34 4.2.5 VTune: Intel Performance Analyzer . 35 4.2.6 Oprofile: Linux based Profiling tool . 35 4.3 Results and Analysis . 36 4.3.1 Gprof Results: . 36 4.3.2 Valgrind Results . 39 4.3.3 Oprofile Results . 40 5 Conclusion 44 ii Chapter 1 Introduction 1.1 Project Scope Modeling real world scenarios requires high performance systems capable of performing extremely large number of concurrent operations. The increasing complexity included in these models implies, there will always be a perennial demand for faster and more robust computing resources. There is a cyclic dependency between the applications and the resources as the applications are being driven by the presence of the resources and the resources are being enhanced in order to accommodate the demands of the applications. Supercomputers and cluster systems are used for performing operations on similar kinds of data, but on a gigantic scale. For example, an average desktop system can perform several million complex calculations per second. Supercomputers and clusters on the other hand can perform extremely large number of operations per second when compared to a stand alone system. In a very simplistic sense a supercomputer is a collection of large number of smaller computers, put together to exploit the availability of computational power. It is a well know fact that, over the past few decades the field of Computer Engineering has shown exponential growth in terms of microprocessor technology, resulting in overall development 1 of computing resources and architectures. Hence there is a need to understand and improve the performance of the existing computing resources to accommodate the requirements of these applications. One tricky aspect of using supercomputers is to come up with applications that can maximize the utilization of available resources. This is not easy as there are a number of limiting factors that prevent maximal utilization of the existing resources. So, there are two solutions to this problem, understand the limiting factors and try to find a work around for them or try to add measured resources to the system to improve its performance. The major goal of this project is to understand the limiting factors and come up with a detailed analysis of the results. In order to write better code, it is imperative to understand the working of the application on the given architecture. This is not a simple task as the possible number of architectures and machine configurations is quite large, bordering on being limitless. Each computational resource is unique in terms of the processor architecture, peak performance, type of interconnect, available memory and the I/O system used. The aim of this project is to study the performance characteristics of Cactus [6] benchmarks on different architectures and analyze the results obtained from the benchmark runs. Cactus [6] is a computational toolkit used for problem solving in different scenarios by scien- tists and engineers. The reason Cactus [6] was picked for profiling was, it is a good example of a software that has excellent modularity and portability. It is known to run on different architectures and on machines of different sizes. For example, these benchmarks can be run on a laptop or standalone system or they can just as easily be run on a supercomputer or a cluster. Also Cactus [6] supports many technologies like HDF5 parallel file I/O, the PETSc scientific library, adaptive mesh refinement, web interfaces and advanced visualization tools. The study of profiling results also helps to understand the different aspects of the system architecture that effect the performance of applications. The project also aims to turn on the 2 trace on these benchmarks so that the time spent in different operations can be measured. A number of profiling tools like PAPI [1], IPM [2], gprof [8], valgrind [4], Oprofile [3], are used to collect information about the total number of floating point operations performed, the number of Level1, Level2 and some times Level3 data as well as instruction cache misses, and call graph information. This information can be used to tweak the performance of the benchmark. The results of the first phase of the project show a need to understand the benchmark in detail. Hence this project not only tries to understand the performance of the code on different architectures but also tries to find avenues to optimize the benchmark code. 1.2 Goals This project aims to understand the performance characteristics of the Cactus [6] benchmark Bench BSSN PUGH on different architectures. This analysis helps to understand the architec- tural aspects of such complex systems. Also, this study aims to provide a basis to tweak the performance of the actual benchmark so that it can become more efficient. This project is divided into two phases. The following sections explain the details of the two phases. 1.2.1 Phase 1 The first phase of the project deals with running Cactus [6] benchmarks on different architectures and collecting the run statistics. These results are analyzed and the reasons for the differences in the wall times are studied in detail. This phase of the project was important for collecting results for the recent CCT proposal in reply to the NSF “bigiron” solicitation. In this phase, it was found that among different NSF benchmarks the Cactus [6] benchmarks “Bench BSSN PUGH” and “Bssn Whisky Carpet” were found dominated by load and store operations. The reasons for this result are studied in detail in the second phase of the 3 project. 1.2.2 Phase 2 The second phase of the project involves profiling Cactus [6] benchmark Bench BSSN PUGH. This benchmark was chosen over the Bench Whisky Carpet benchmark as it is already known to scale well and shows more consistent behavior. Bench Whisky Carpet benchmark also shows consistent results, but if the need arises to run it on a large number of processors its performance deteriorates rapidly making it unsuitable for profiling. Moreover both the benchmarks have similar kinds of code with the main change being the kind of driver being used, as one uses PUGH (structured unigrid) and the other used CARPET (adaptive mesh refinement). Profiling tools like “IPM” [2], “PAPI” [1], “gprof” [8], “valgrind” [4] and “Oprofile” [3] were used to investigate the reasons for the number of load and store operations. Also, it is clear that the load and store operations happen when there are a lot L2 (or L3 if there is one) cache misses, due to which the processor needs to fetch the instructions from the cache. Hence, the profiling tools are used to identify those parts of the code, which are responsible for a majority of cache misses at L2 or L3. 1.3 Contributions and People Involved My contributions to the proposal submitted by CCT for the NSF “bigiron” solicitation, have been compiled and further analysis is added as a part of this project. A lot of data and results from the benchmark runs went into the proposal. There were a number of other benchmarks, that were run by other members of the team which are not part of this project. Results from the Cactus [6] benchmark runs on the NERSC machine called Bassi, were provided to me by one our team members Dr.Erik Schnetter. More over a lot of information about the architecture and the kind of runs that need to be performed and invaluable suggestions 4 about the analysis of data were provided to me by Dr.

Performance Profiling with Cactus Benchmarks

Application Performance Optimization

Memory Subsystem Profiling with the Sun Studio Performance Analyzer

Oracle Solaris Studio 12.2 Performance Analyzer MPI Tutorial

Oracle® Solaris Studio 12.4: Performance Analyzer Tutorials

Openmp Performance 2

Openmp 4.0 Support in Oracle Solaris Studio

1 a Study on Performance Analysis Tools for Applications Running On

Java™ Performance

Characterization for Heterogeneous Multicore Architectures

Oracle Developer Studio Performance Analyzer Brief

Optimizing Applications with Oracle Solaris Studio Compilers and Tools

Oracle Solaris Studio July, 2014