Performance Evaluation: Techniques, Tools and Benchmarks

Performance Evaluation: Techniques, Tools and Benchmarks Author Info: Prof. Lizy Kurian John Electrical and Computer Engineering Department, ENS 143 The University of Texas at Austin Austin, TX 78712 [email protected] Phone: 512 232 1455 Fax: 512 471 2893 http://www.ece.utexas.edu/~ljohn Contents: 1. Introduction Classification of Performance Evaluation Techniques 2. Performance Measurement Techniques a. On-chip Performance Monitoring Counters b. Off-chip hardware measurement c. Software Monitoring d. Microcoded Instrumentation 3. Performance Modeling Techniques a. Simulation i. Trace Driven Simulation ii. Execution Driven Simulation iii. Complete System Simulation iv. Stochastic Discrete Event Driven Simulation v. Software Profiling Tools b. Analytical Modeling 4. Workloads and Benchmarks CPU, Embedded, Java, Transaction Processing, Web server, E-commerce, server and personal computer benchmarks 1 1. Introduction State-of-the-art high performance microprocessors contain tens of millions of transistors and operate at frequencies close to 2GHz. These processors perform several tasks in overlap, employ significant amounts of speculation and out-of-order execution, and other microarchitectural techniques, and are true marvels of engineering. Designing and evaluating these microprocessors is a major challenge, especially considering the fact that one second of program execution on these processors involves several billion instructions and analyzing one second of execution may involve dealing with tens of billion pieces of information. In general, design of microprocessors and computer systems involves several steps (i) understanding applications and workloads that the systems will be running (ii) innovating potential designs (iii) evaluating performance of the candidate designs, and (iv) selecting the best design. The large number of potential designs and the constantly evolving nature of workloads have resulted in designs being largely adhoc. In this article, we investigate major techniques used in the performance evaluation process. It should be noted that performance evaluation is needed at several stages of the design. In early stages, when the design is being conceived, performance evaluation is used to make early design tradeoffs. Usually, this is accomplished by simulation models, because building prototypes of state-of-the-art microprocessors is expensive and time consuming. Several design decisions are made before any prototyping is done. Once the design is finalized and is being implemented, simulation is used to evaluate functionality and performance of subsystems. Later, performance measurement is done after the product is available in order to understand the performance of the actual system to various real world workloads and to identify modifications to incorporate in future designs. Performance evaluation can be classified into performance modeling and performance measurement, as illustrated in Table 1. Performance measurement is possible only if the system of interest is available for measurement and only if one has access to the parameters of interest. Performance measurement may further be classified into on-chip hardware monitoring, off-chip hardware monitoring, software monitoring and microcoded instrumentation. Performance modeling is typically used when actual systems are not available for measurement or if the actual systems do not have test points to measure every detail of interest. Performance modeling may further be classified into simulation modeling and analytical modeling. Simulation models may further be classified into numerous categories depending on the mode/level of detail of simulation. Analytical models use probabilistic models, queueing theory, Markov models or Petri nets. Table 1. A Classification of Performance Evaluation Techniques Microprocessor On-chip Performance Monitoring Counters Performance Measurement Off-chip Hardware Monitoring Software Monitoring Micro-coded Instrumentation Trace Driven Simulation Execution Driven Simulation Simulation Complete System Simulation Event Driven Simulation Performance Modeling Software Profiling Probabilistic Models Analytical Modeling Queuing Models Markov Models Petri Net Models There are several desirable features that performance modeling/measurement techniques and tools should possess. • They must be accurate. It is easy to build models that are heavily sanitized, however, such models will not be accurate. 2 • They must be non-invasive. The measurement process must not alter the system or degrade the system's performance. • They must not be expensive. Building the performance measurement facility should not cost significant amount of time or money. • They must be easy to change or extend. Microprocessors and computer systems constantly undergo changes and it must be easy to extend the modeling/measurement facility to include the upgraded system. • They must not need source code of applications. If tools and techniques necessitate source code, it will not be possible to evaluate commercial applications where source is not often available. • They should measure all activity including kernel and user activity. Often it is easy to build tools that measure only user activity. This was acceptable in traditional scientific and engineering workloads, however in database, web server, and Java workloads, there is significant operating system activity and it is important to build tools that measure operating system activity as well. • They should be capable of measuring a wide variety of applications including those that use signals, exceptions and DLLs (Dynamically Linked Libraries). • They should be user-friendly. Hard to use tools often are under-utilized. Hard-to-use tools also result in more user error. • They should be fast. If a performance model is very slow, long-running workloads which take hours to run may take days or weeks to run on the model. If an instrumentation tool is slow, it can be invasive. • Models should provide control over aspects that are measured. It should be possible to selectively measure what is required. • Models and tools should handle multiprocessor systems and multithreaded applications. Dual and quad-processor systems are very common nowadays. Applications are becoming increasingly multithreaded especially with the advent of Java, and it is important that the tool handles these. • It will be desirable for a performance evaluation technique to be able to evaluate the performance of systems that are not yet built. Many of these requirements are often conflicting. For instance, it is difficult for a mechanism to be fast and accurate. Consider mathematical models. They are fast, however, several simplifying assumptions go into their creation and often they are not accurate. Similarly it is difficult for a tool to be non-invasive and user- friendly. Many users like graphical user interfaces (GUIs), however, most instrumentation and simulation tools with GUIs are slow and invasive. Benchmarks and metrics to be used for performance evaluation have always been interesting and controversial issues. There has been a lot of improvement in benchmark suites since 1988. Before that computer performance evaluation has been largely with small benchmarks such as kernels extracted from applications (eg: Lawrence Livermore Loops), Dhrystone and Whetstone benchmarks, Linpack, Sorting, Sieve of Eratosthenes, 8-queens problem, Tower of Hanoi, etc. [1]. The Standard Performance Evaluation Cooperative (SPEC) consortium and the Transactions Processing Council (TPC) formed in 1988 have made available several benchmark suites and benchmarking guidelines to improve the quality of benchmarking. Several state-of-the-art benchmark suites are described in section 4. Another important issue in performance evaluation is the choice of performance metric. For a system level designer, execution time and throughput are two important performance metrics. Execution time is generally the most important measure of performance. Execution time is the product of the number of instructions, cycles per instruction (CPI) and the clock period. Throughput of an application is a more important metric, especially in server systems. In servers that serve the banking industry, airline industry, or other similar business, what is important is the number of transactions that could be completed in unit time. Such servers, typically called transaction processing systems use transactions per minute (tpm) as a performance metric. MIPS (Millions of Instructions Per Second) and MFLOPS (Million of Floating Point Operations Per Second) have been very popular measures of performance in the past. Both of these are very 3 simple and straightforward to understand and hence have been used often, however, they do not contain all three components of program execution time and hence are incomplete measures of performance. There are also several low level metrics of interest to microprocessor designers, in order to help them identify performance bottlenecks and tune their designs. Cache hit ratios, branch misprediction ratios, number of off-chip memory accesses, etc are examples of such measures. Another major problem is the issue of reporting performance with a single number. A single number is easy to understand and easy to be used by the trade press. Use of several benchmarks also make it necessary to find some kind of a mean. Arithmetic Mean, Geometric Mean and Harmonic Mean are three ways of finding the central tendency of a group of numbers, however, it should be noted that

Performance Evaluation: Techniques, Tools and Benchmarks

Microprocessor

EEMBC and the Purposes of Embedded Processor Benchmarking Markus Levy, President

Load Testing, Benchmarking, and Application Performance Management for the Web

Cloud Workbench a Web-Based Framework for Benchmarking Cloud Services

Overview of the SPEC Benchmarks

Automatic Benchmark Profiling Through Advanced Trace Analysis Alexis Martin, Vania Marangozova-Martin

Opportunities and Open Problems for Static and Dynamic Program Analysis Mark Harman∗, Peter O’Hearn∗ ∗Facebook London and University College London, UK

Evaluation of AMD EPYC

Automatic Benchmark Profiling Through Advanced Trace Analysis

Introduction to Digital Signal Processors

An Effective Dynamic Analysis for Detecting Generalized Deadlocks

BOOM): an Industry- Competitive, Synthesizable, Parameterized RISC-V Processor