Performance Tuning of Computer Systems
Subhasis Banerjee
(CARG, University of Ottawa) 1 / 30 Outline
1 Profiling Understand What Computers Execute Program Profiling: Basic Concepts Types of Profiler
2 Using Profile Information Data Collection Case Studies
3 Profile Directed Optimization Software Optimization Hardware Optimization
4 Summary
(CARG, University of Ottawa) 2 / 30 Profiling Understand What Computers Execute
Outline
1 Profiling Understand What Computers Execute Program Profiling: Basic Concepts Types of Profiler
2 Using Profile Information Data Collection Case Studies
3 Profile Directed Optimization Software Optimization Hardware Optimization
4 Summary
(CARG, University of Ottawa) 3 / 30 Profiling Understand What Computers Execute
What is Profiling?
Profile = a set of data often in graphic form portraying the significant features of something (Merriam-Webster) Focus on dynamic execution (static program analysis is part of software engineering - Formal Method for software) Profiling reveals interaction between software and underlying machine architecture Indicates areas of improvement ⇒ performance tuning
(CARG, University of Ottawa) 4 / 30 Profiling Understand What Computers Execute
How is Profiling Done?
Software tools are used to collect profile data Tools are often supported by operating system Some popular profiling tools: gprof, oprofile, valgrind, pin Profile is collected during program execution ⇒ dynamic profile Profile data analysis ⇒ Performance Engineering
(CARG, University of Ottawa) 5 / 30 Profiling Understand What Computers Execute
Which Profile Data is Collected?
Call graph profile in function level Call graph in Basic Block Memory performance Architectural events e.g., branch misprediction, exception, cache hit/miss Monitoring performance counters
(CARG, University of Ottawa) 6 / 30 Profiling Program Profiling: Basic Concepts
Outline
1 Profiling Understand What Computers Execute Program Profiling: Basic Concepts Types of Profiler
2 Using Profile Information Data Collection Case Studies
3 Profile Directed Optimization Software Optimization Hardware Optimization
4 Summary
(CARG, University of Ottawa) 7 / 30 Profiling Program Profiling: Basic Concepts
Characterizing Programs: Basic Block
Programs best viewed in architecture level in intermediate form e.g., in assembly language Basic Block (BB) is the section of code sequence which has one entry and one exit point BBs are used as building blocks in majority of program analysis tools and optimization policies Once a BB is hit all subsequent instructions are executed till it exits at the end of BB More general definition - a sequence of instructions where every instruction dominates all subsequent following instructions and no other instruction executes between two instructions in the sequence
(CARG, University of Ottawa) 8 / 30 Profiling Program Profiling: Basic Concepts
Algorithm to Identify BB
Step 1. Identify the leaders (the first instruction of the basic block) in the code. Leaders are instructions which come under any of the following 3 categories: The first instruction The target of a conditional or an unconditional branch instruction The instruction that immediately follows a conditional or an unconditional branch instruction Step 2. Starting from a leader, the set of all following instructions until and not including the next leader is the basic block corresponding to the starting leader
(CARG, University of Ottawa) 9 / 30 Profiling Program Profiling: Basic Concepts
Programs in Terms of Directed Graph
1
2 3 Program is a directed graph with BBs as nodes Edges are indicated by branch in- structions 4
5 6
(CARG, University of Ottawa) 10 / 30 Profiling Types of Profiler
Outline
1 Profiling Understand What Computers Execute Program Profiling: Basic Concepts Types of Profiler
2 Using Profile Information Data Collection Case Studies
3 Profile Directed Optimization Software Optimization Hardware Optimization
4 Summary
(CARG, University of Ottawa) 11 / 30 Profiling Types of Profiler
Profilers
Call graph profiler: Shows the call times and frequency of functions/subroutines. Static or dynamic - depending on degree of accuracy required Dynamic graph can be built fully context sensitive, i.e, for each call of a subroutine graph includes a node with call stack ⇒ large memory requirement Widely used in program analysis - Valgrind, gprof, codeviz, doxygen Drawback - slow execution, intrusive Event based profiler: Programming languages supports event based profiling (Java, Python, Ruby) Runtime provides various callback to profile agent Customizable in profile collection Drawback - slow execution, intrusive
(CARG, University of Ottawa) 12 / 30 Profiling Types of Profiler
Profilers
Statistical Profiler: Usually sampling is done in hardware (program counter) OS interrupts samples the hardware counters Sampling is always lossy, not as accurate as others in collecting data Nearly as fast as the original execution Advantage - non-intrusive Although theoretically other software profilers provides accurate information statistical profiler has exhibited near accurate performance without any loss of execution speed, without modifying program characteristics Tools - AMD CodeAnalysit, Intel VTune, OProfile (open source), gprof, MIPS with JTAG interface Instrumenting Program: Instrument code in appropriate place to collect profile Different types of instrumentation - manual, compiler assisted, runtime instrumentation, binary translation gprof (works with instrumentation and sampling) using -pg option with compiler PIN uses runtime instrumentation ATOM (DEC Alpha processors with TRU 64 OS) uses binary translation (obsolete)
(CARG, University of Ottawa) 13 / 30 Profiling Types of Profiler
Simulation Profiler
Simulators can be used for detail profiling - slow execution Simulation can be of different type: Trace driven simulator, execution driven simulator Some simulator can simulate cycle level: Data dependence analysis Simulation is done with smaller data input - realistic representation of actual data
(CARG, University of Ottawa) 14 / 30 Using Profile Information Data Collection
Outline
1 Profiling Understand What Computers Execute Program Profiling: Basic Concepts Types of Profiler
2 Using Profile Information Data Collection Case Studies
3 Profile Directed Optimization Software Optimization Hardware Optimization
4 Summary
(CARG, University of Ottawa) 15 / 30 Using Profile Information Data Collection
How to Collect Meaningful Data
What is useful data for a profiler? Trace: Collection of instructions and data at runtime, all dynamic instances Events: Architectural events e.g., cache miss, branch misprediction, page fault Counts: Architecture specific metrics e.g., number of instructions retired, number of hit/miss in cache, interrupts/exception Dataflow analysis: Instructions share data for computation - dataflow dependency ensures program order What to do with collected data? Analysis of bottleneck in program execution Identify the performance metric: Number of instruction retired in a given time? Power? If multithreaded program - is it adequately parallel? Suggest possible improvement: Automatically tune code? Hint to compiler to generate better code? Architectural support to eliminate/reduce performance bottleneck?
(CARG, University of Ottawa) 16 / 30 Using Profile Information Case Studies
Outline
1 Profiling Understand What Computers Execute Program Profiling: Basic Concepts Types of Profiler
2 Using Profile Information Data Collection Case Studies
3 Profile Directed Optimization Software Optimization Hardware Optimization
4 Summary
(CARG, University of Ottawa) 17 / 30 Using Profile Information Case Studies
SPEC2000 Benchmark: Instruction Mix and Cache Profile
25
32kB 64kB 20 16kB
15
10 Miss rate (%) 5 1−way 2−way 0 4−way bzip2 crafty gcc gzip 8−way mcf parser vortex
40
32 kB 30 16kB 64kB
20 Miss rate (%) 10 1−way 0 2−way art 4−way equake mesa galgel mgrid 8−way
(CARG, University of Ottawa) 18 / 30 Using Profile Information Case Studies
SPEC2000 Benchmark: Branch and Power Profile
100 combined gshare bimodal
95
90
85
Percentage of branch prediction accuracy 80
75 gcc mcf gzip twolf bzip2 crafty vortex parser perlbmk
100 combined gshare bimodal
95
90
85
Percentage of branch prediction accuracy 80
75 art swim mesa applu mgrid galgel ammp facerec equake
(CARG, University of Ottawa) 19 / 30 Profile Directed Optimization Software Optimization
Outline
1 Profiling Understand What Computers Execute Program Profiling: Basic Concepts Types of Profiler
2 Using Profile Information Data Collection Case Studies
3 Profile Directed Optimization Software Optimization Hardware Optimization
4 Summary
(CARG, University of Ottawa) 20 / 30 Profile Directed Optimization Software Optimization
Hot Spot Detection
Detection of program hot spots: Program exhibits temporal locality Detect collection of basic blocks which are frequently executed Detection mechanism can be improved by hardware support ([1]) Focus on the code section representing hot spot Modify hot spot code section aggressively (loop unrolling, instruction fusion, prefetching)
(CARG, University of Ottawa) 21 / 30 Profile Directed Optimization Software Optimization
Profile Guided Optimization in Java
Java Virtual Machines (JVM) employ optimizing compiler Profile information is collected at the time of program execution ([2])
(CARG, University of Ottawa) 22 / 30 Profile Directed Optimization Hardware Optimization
Outline
1 Profiling Understand What Computers Execute Program Profiling: Basic Concepts Types of Profiler
2 Using Profile Information Data Collection Case Studies
3 Profile Directed Optimization Software Optimization Hardware Optimization
4 Summary
(CARG, University of Ottawa) 23 / 30 Profile Directed Optimization Hardware Optimization
Microarchitecture Optimization
One architecture never fits all Profiling is the key to characterize program behavior and adapt architecture Architecture reconfiguration is often done with feedback obtained from program profile Runtime configuration is challenging due to several constraint in dynamic profiles: profile overhead, storing time sensitive data
(CARG, University of Ottawa) 24 / 30 Profile Directed Optimization Hardware Optimization
Trace Cache Design
Traces are sequence of decoded instructions with operand and data Trace cache is an instruction cache which stores sequences of basic blocks with decoded instructions and operands with set of branch predictions A hit is registered if all branch outcomes are true - decoding of instructions is omitted Optimization: Traces can be selectively stored depending of the frequency of execution of a given trace
(CARG, University of Ottawa) 25 / 30 Profile Directed Optimization Hardware Optimization
Power Optimization: Adaptive Issue Queue Design
Issue queue is one of the power hungry resource Size of issue queue can be selectively enabled / disabled depending on available parallelism in the code section A profiler can indicate degree of parallelism in the code section
(CARG, University of Ottawa) 26 / 30 Summary
Summary
Software and hardware optimizations are primarily guided by profiling information Future architecture trend: many simple on on one chip ⇒ requires scalable solution to extract thread / process level parallelism from program Profiler will play important role in defining model of tuning compiler and architecture Hardware support extends the capability of compiler to generate efficient code Autotuner: Traditional compiler may be replaced by feedback driven autotuner.
(CARG, University of Ottawa) 27 / 30 Appendix References
M. C. Martin, C. George, J. Gyllenhaal, and W. Hwu, “A hardware driven profiling scheme for identifying program hot spots to support runtime optimization,” Proc. of the Intl. Symp. on Computer Architecture, 1999. 21 M. Arnold, M. Hind, and B. G. Ryder, “Online feedback-directed optimization of java,” in OOPSLA ’02: Proceedings of the 17th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications, 2002, pp. 111–129. 22
(CARG, University of Ottawa) 28 / 30 Appendix References
THANK YOU
QUESTIONS ?
(CARG, University of Ottawa) 29 / 30 Appendix References
BACKUP SLIDES
(CARG, University of Ottawa) 30 / 30