Hardware-Based Performance Monitoring with Vtune Performance Analyzer Under Linux

Hardware-based performance monitoring with VTune Performance Analyzer under Linux Hassan Shojania [email protected] Abstract are excellent sources for more information about perfor- All new modern processors have hardware support for mance monitoring hardware and features of Pentium 4 in monitoring processor performance. In this project, we this area. Of course, [3] provides raw register-level infor- try to explore use of VTune Performance Analyzer for mation for programming Pentium 4 performance monitor- hardware-based performance monitoring of a Linux clus- ing hardware. ter of Pentium 4 Xeon processors. 2 Basics of performance monitoring hard- 1 Introduction ware Performance measurement of any high-performance There can be different approaches for collecting proces- cluster system is very critical for development and de- sor performance data. ployment of efficient applications for such system. All new modern processors have special hardware to monitor 1. Modifying the application to add instrumentation processor performance. This hardware-based performance code for collecting various data like instruction trace measurement has many advantageous over traditional in- and memory reference data. This requires either re- trusive methods of performance measurements based on building it from source code or modifying its exe- adding code for probing execution time of portion of a cutable version; both not favorable usually (especially program. For example, data collected by this hardware for operating system code). Also, these approaches provides performance information on applications, the op- can disturb the applications behavior, bringing ques- erating system, and the processor. These data can guide tions about validity of the collected data. performance improvement efforts by helping programmers 2. Another way to collect processor performance data is tuning the algorithms used, and the code sequences that by using a simulator to model the processor as it ex- implement those algorithms [1]. ecutes the application. This simulation approach can In this project, we intend to explore VTune Performance yield a detailed data on processor blocks like pipeline Analyzer set from Intel for performance measurement of stalls, branch prediction, cache performance, and so standard MPI/OpenMP-based benchmarks under our clus- on. However, processor manufacturers do not usu- ter of Dell PowerEdge 2650/6650 (2/4 way SMP systems ally provide simulators for advanced processor de- with Xeon processors) running Linux. As the first step to- signs and third parties don’t know enough about the wards VTune, we are targeting mainly the proper system hardware detail to build such a simulator. setup/configuration with limited trials of few benchmarks. This should provide enough background for more in depth 3. Using performance-monitoring hardware has several analysis of other benchmarks. distinct advantages over previous approaches. Hav- This report is organized as follows: First we overview ing the processor itself actually collect performance the basics of performance monitoring hardware in Section data as it executes an application have several bene- 2. Features of Pentium 4 performance monitoring hard- fits. First, the application and operating system re- ware is overviewed in Section 3. In Section 4 we describe main largely unmodified. Second, the accuracy of the performance data collection in VTune with the help of con- collected event counts is much higher compared to us- cepts presented in Sections 2 and 3. Section 5 explains dif- ing loose simulators which are not capable of simu- ferent deployment choices of VTune and installation steps. lating exact hardware behavior. Third, performance- Section 6 provides some sample test results. In Section 7 monitoring hardware collects data on the fly as the we present our conclusions. application executes, avoiding the slow simulation- Section 2 and 3 have heavily used [1] and [2]. These based approaches. Fourth, this approach can collect data for both the application and the operating system. can not improve the hardware/software performance. Two These advantages often make hardware performance common approaches are: monitoring the preferred, and sometimes only choice for collecting processor performance data. Time-based Sampling (TBS) approach tries to capture the percentage of time an application spends in its dif- Performance-monitoring hardware typically has two ferent sections through exposing the most frequently components: performance event detectors and event coun- executed portions of the code. It is usually imple- ters. Users can configure performance event detectors to mented through interrupting an application’s execu- detect anyone of several performance events (for example, tion at regular time intervals and recording the pro- cache misses or branch mispredictions). Often, event de- gram counter. At the end of the application, a his- tectors have an event mask field that allows further qualifi- togram will show the number of samples collected for cation of the event. For example, based on processors priv- each section of code. ilege mode (user/supervisor) to separate events generated Event-based Sampling(EBS) produces a histogram of by application from operating system code, or for filtering performance event counts based on code location. accesses to the L2 cache based on cache line’s specific state Now the application is interrupted after a specific (i.e. modified, shared, exclusive, or invalid). number of performance events (a counter reaching Further configuration is usually possible through en- some threshold) rather than at regular time intervals abling event counters only under certain edge and threshold in TBS. Similar to a time-based profile indication of conditions. The edge detection feature is most often used most frequently executed code locations, an event- for events that detect the presence or absence of certain based profile indicates the most frequently executed conditions every cycle, like a pipeline stall. The thresh- code locations that cause a particular performance old feature lets the event counter compare the value it re- event. This requires performance-monitoring hard- ports each cycle to a threshold value and then increment ware support for generating performance monitor in- the counter. The threshold feature is only useful for per- terrupt when a performance event counter overflows. formance events that report values greater than one in each Then the Interrupt Service Routine (ISR) handler cap- cycle, for example for an ”Instructions Completed” event, tures sample data from the program (e.g. program number of cycles when three or more instructions were counter) and re-enables the interrupt for another in- completed (in one cycle) can be counted by using a thresh- terrupt. old of two. 2.1 Performance event monitoring 2.3 Limitations Mainstream processors suffer from a common set of Performance events can be grouped into five cat- problems: egories: program characterization, memory accesses, pipeline stalls, branch prediction, and resource utilization. Too few counters. Limited number of counters restricts Program characterization events show largely processor- monitoring multiple events concurrently. independent attributes of a program like number and type of instructions (for example, loads, stores, floating point, Speculative counts. Some processors can’t distinguish branches, and so on) completed by the program. Mem- between performance events for instructions that do ory access events aid performance analysis of the proces- not complete (speculative execution) from the one sor’s memory hierarchy, like references and misses to vari- who really complete. This is important because per- ous caches and transactions on the processor memory bus. formance events are still generated (like cache misses) Pipeline stall event information helps users analyze how even for speculatively executed instructions that never well the program’s instructions flow through the pipeline. retire. Branch prediction events show performance of branch pre- Sampling delay. The sampling of program counter when diction hardware. Resource utilization events can monitor a hardware event counter overflows (commonly used the usage of certain resources like number of cycles spent with EBS) can not identify the exact instruction that using a floating-point divider. caused the overflow. This degree of accuracy is not 2.2 Performance Profiles required all the time (e.g. the associated module or Though hardware performance counters reveal many in- thread is of more interest) but there are cases when an formation about the software, but this information is very event frequency needs to be related to particular type low level and mainly expose global state of the proces- of instruction. The program counter is sampled by a sor not a particular application behavior. And as long as performance monitor interrupt (PMI) after overflow the source of monitored behavior is not detected, the user of an event counter. However, the current instruction 2 Figure 1: The general structure of the Pentium 4 event counnters and detectors) (from [2]) is usually not the instruction causing the overflow as processors pipeline cause an arbitrary delay between counter overflow and signaling of an interrupt. Lack of data-address profiling. Usually there is no support for collecting profiles of data addresses generated by CPU when a memory hierarchy performance event happens. 3 Pentium 4 Performance-monitoring Fea- tures Intel Pentium 4 has tried

Load more