Intel® VTune™ Performance Analyzer 9.1 for In-Depth Intel® VTune™ Performance Analyzer 9.1—for Linux*

Contents

Is your application too fast? I didn’t think so...... 3 Powerful User Interface Improvements ...... 7

Easy to Use...... 3 Tune Inline Functions...... 7

Finding Your Bottleneck Is Easier Than Ever Before...... 3 One-click Hotspot Navigation...... 7

See the Answers on Your Source...... 3 Branch and Call Navigation Made Easy...... 7

Find the Critical Path Using Call Graph Profiling...... 4 Create Meaningful Event Labels...... 7

Low Overhead Sampling Profiling...... 4 Large Enterprise and HPC Systems ...... 7

Filter the data to find your answers...... 4 Minimize Bus Traffic in Non-uniform Memory Architecture (NuMA)

Features...... 5 Systems...... 7

All Architectures:...... 5 New for Itanium® Architecture! ...... 8

Large Enterprise and HPC Systems...... 5 Intel® Premier Support ...... 8

Intel® Itanium® Architecture:...... 5

New...... 5

Profile JavaScript* and Flash* Code...... 5

Profile Dynamically Generated Code...... 6

Access to VTune Analyzer’s Open Data Model...... 6

Access to the latest Experimental Technologies...... 6

VTune Analyzer Displays What the Compiler Knows...... 6

New More Effective Tuning Methodology Supported...... 7

New Events for Tuning Multicore Intel® Processors...... 7

New Linux Distributions!...... 7

Faster Call Graph—Selective Instrumentation for Java* and Native Code...... 7

Supports the Latest Intel® Processors...... 7

2 Intel® VTune™ Performance Analyzer 9.1—for Linux*

Is your application too fast? I didn’t think so. Finding Your Bottleneck Is Easier Than Ever Before Intel® VTune™ Performance Analyzer for Linux* is a fully Linux-based Complete one simple dialog box to get a list of the top five time- solution indispensable for making your software run its fastest consuming functions . on single and multicore systems . It analyzes applications without recompilation or linking on handheld through supercomputer systems . It is robust with large applications (over 1 GB of source code¹) and multicore, multiprocessor, and NuMA systems using the latest Intel® processors .

• NEW! Now supports the Intel® Core™ i7 processor

• NEW! Performance Profiling for Dynamically Generated Code, JavaScript, and Flash . Access to VTune Analyzer’s Open Data Model .

Easy to Use VTune analyzer makes application performance tuning easier with a graphical user interface (GUI) based on the Eclipse* development environment‡

It’s fast and easy to find your performance bottlenecks with a list of the most active functions . Click on a function name to display the source and show the most time-consuming source statements

See the Answers on Your Source Source and assembly views show you exactly which lines of code are taking the most time .

Quickly find the data you need. Click an icon to:

View Source (shown) Go to hottest line for the selected event Many developers want to maximize application performance . VTune View mixed source and assembly analyzer gives the developer a view of what’s happening as the Go to the next hottest line for View assembly the selected event application is running . It identifies areas that take an inordinate Go to next function View compiler tuning advice amount of processor time . It also helps identify critical paths in an application where adjustments have maximum benefit . Without See larger image at: http://www.intel.com/cd/software/products/ asmo-na/eng/vtune/320771.htm VTune analyzer, performance tuning is a guessing game .

3 Intel® VTune™ Performance Analyzer 9.1—for Linux*

Find the Critical Path Using Call Graph Profiling Call Graph determines calling sequences and graphically displays the critical path . It also shows you the context of the bottleneck . To be effective, you often need to know not only where the application is spending its time, but how it got there .

Unlike other offerings, VTune analyzer provides both sampling and call graph analysis . Even if you plan to do mostly call graph analysis, running sampling first lets you identify the modules that need it so you only pay Call Graph’s larger overhead for the modules that need to be analyzed . This can be vital on large projects . Sampling is great for analysis of “loopy” code . Call Graph is usually better for “branchy” code . You need both to get the job done right .

See larger image at: http://cache-www.intel.com/ cd/00/00/32/10/321016_321016.GIF

Filter the data to find your answers The table and bar chart views of sampling results filter your data many different ways to find what you need . View by thread (shown) for load balancing .

View the critical path in red Quickly locate the critical path and navigate the profiling results easily using both a table and graph view . Click a table entry to highlight the function in the graph, or click the graph to find the table entry .

Low Overhead Sampling Profiling Process Hotspot (function) Event-based sampling finds your bottleneck with very low overhead Thread (shown) Source View (typically less than 5 percent) . Identify problems such as cache Module CPU misses and branch mis-predictions . Because it is system-wide, event- based sampling can be used to tune libraries and drivers as well as application programs .

4 Intel® VTune™ Performance Analyzer 9.1—for Linux*

Features • Multiple users can share a large system for simultaneous Call Graph performance analyses . All Architectures: • Sampling is supported on systems with 128 or more3 processors Low Overhead—Accurately identify where the program spends time . using local buffering per CPU for minimum inter-node contention . Sampling is system wide with negligible overhead (typically less than To limit the amount of data collected we recommend selecting a 5 percent) . maximum of 64 CPUs for simultaneous data collection . Find the Critical Path—Determine function calling sequences and Intel® Itanium® Architecture: find the critical path using Call Graph . Eclipse* Based Graphical User Interface—The easy-to-use Eclipse* No Recompile Required—Unlike traditional instrumented profilers based graphical user interface in VTune analyzer is now native on that make you recompile or modify your build script, just use your Itanium® architecture . production executables . Instruction Filtered Events Pinpoint Bottleneck Locations— Compatibility—VTune™ Performance Analyzer supports the latest Isolate problems like poor pre-fetch and poor memory alignment . Intel® processors (Intel® 64 architecture-based processors, Intel® Sometimes just choosing an event is not selective enough, because Itanium® processors, multicore processors . ). and a wide variety of the event can occur both at critical and non-critical times . On Intel Linux* distributions . Itanium architecture, instruction filtering allows you to collect events only when they occur with a specified op-code . Programming Language and Compiler Independent—VTune analyzer supports all compilers that follow industry standards (ELF, Minimize Data Collection with CPU Selection—Collect only the STABS, DWARF) . data you need . CPU selection lets you control exactly where data is collected, from all the processors, only those in your allocation or only Mixed Java* and Native Code—Unlike Java*-only analyzers, VTune the processors you specify . This greatly reduces the amount of data analyzer tunes mixed Java and native code1 . you need to collect . Minimal Memory Footprint—Remote profiling minimizes the performance impact on the target system by running the user New interface on a separate system . Note: Features listed as “New” are new since the last major release 8.0. Some have been previewed in minor updates and beta releases. Command Line Capability—Automate batch operations .

Large Applications Welcome—VTune analyzer is a robust solution Profile JavaScript* and Flash* Code even with large executables2 . If you have a large application with New profiling support in emerging internet browsers and other script- hundreds of thousands of functions, bring it to VTune analyzer . oriented products allow developers working with new JavaScript* or Flash* JIT technologies to analyze their code . Use the VTune analyzer Listen to the Compiler’s Advice—An optimizing compiler can do to optimize for scalable performance of these codes on Windows* and a lot better with just a few tips from you . We’ve integrated the Linux* to ensure the best end user experience with your application . Intel®Compilers with VTune analyzer to make this easy and very VTune analyzer supports profiling JIT’d code when browser vendors effective . add the required support . This enables deep performance analysis of these additional languages: Large Enterprise and HPC Systems: Minimize Traffic in Non-uniform Memory Architecture (NuMA) • JavaScript / AJAX Systems by storing sampling data in local CPU memory . This is critical • Flash (Action Script) to avoid saturating the interconnect fabric and slowing the system Check with your browser supplier for details on when their browser under test . will enable support . Designed for High Performance Computing—Large High Performance Computing (HPC) systems have unique requirements supported by VTune analyzer .

5 Intel® VTune™ Performance Analyzer 9.1—for Linux*

Profile Dynamically Generated Code VTune Analyzer Displays What the Compiler Knows Many applications today emit their own runtime-generated or just- An optimizing compiler can do a lot better with just a few tips in-time (JIT) code . New profiling APIs in the VTune analyzer enable from you . We’ve integrated the Intel® compilers with Intel® VTune™ performance analysis of dynamic code and allow you to view Performance Analyzer to make this easy . annotated source code directly from the analysis results . The Intel compiler optimization reports contain a wealth of Access to VTune Analyzer’s Open Data Model information to make your application faster . VTune analyzer locates your critical, time consuming “hot spot” and filters the compiler VTune analyzer can now support many different software platforms optimization report to show only the lines that apply to the code with performance sampling analysis . Use the new open data model selected . Now you can see what the compiler optimized and choose APIs to combine the VTune analyzer’s powerful GUI on Windows* or pragmas to further improve performance . Linux* with data from your custom collector to analyze any application on a wide range of platforms . For example, a single click tells you that the compiler didn’t optimize your critical loop because of an assumed vector dependency . You • Collect data on operating systems not directly supported by the know there is no dependency and insert a pragma telling the compiler VTune analyzer . to ignore it which makes it faster . • Supported Windows* Operating Systems Currently, optimization report filtering works exclusively with Intel® • Supported Linux* Distributions C++ and Fortran Compilers 9 1. and higher, but it utilizes a standard • Collect data on embedded Intel hardware based platforms . format open to other compilers .

Access to the latest Experimental Technologies VTune analyzer users have access to the latest experimental performance tuning technologies Intel has to offer . Visit whatif .intel . com and look for Intel® Performance Tuning Utility and Intel® Platform Modeling with Machine Learning . These tools include a number of exciting capabilities including:

• Statistical Call Tree—profiles with low overhead to detect where time is spent in your application

• Basic Block Analysis—displays hotspots with basic block granularity and generates a control flow graph for advanced analysis of application, even without the source code

• Data Access Profiling—identifies memory hotspots and relates them to code hotspots

• Dependency Plots—visualize the relationships between metrics

• Event Rank—view the list of best predictors of performance using See larger image at: http://www.intel.com/cd/software/products/ machine learning asmo-na/eng/312271.htm After you find your hotspot using Intel® VTune™ analyzer, select the hot lines of code in the source view and click an icon to see the compiler’s tuning advice.

6 Intel® VTune™ Performance Analyzer 9.1—for Linux*

New More Effective Tuning Methodology Supported Branch and Call Navigation Made Easy Pipeline stall accounting radically improves tuning by focusing the user Instantly follow a branch in disassembly by clicking a menu . No more on the instances of possible issues (like cache misses) which actually hunting for the destination, just choose “Go to target” to scroll the end up mattering . Core™2 Duo and Core™2 Quad processors have display . greatly enhanced performance analysis capabilities . These processors support more events, higher precision in event location correlation, Create Meaningful Event Labels and a new and wonderful pipeline stall accounting . Name your custom events using event aliasing . When you create a custom event, it is often difficult to remember exactly what you did . New Events for Tuning Multicore Intel® Processors Event aliasing creates a custom label that is meaningful to you . VTune New events measure parallelism, core sharing of the bus and cache, analyzer then uses this label in all event displays . and modified data sharing by threads . These identify opportunities to improve threading, tune multicore sharing of the bus and cache, and optimize cache-line usage .

New Linux Distributions! Check out the details on the latest supported distributions in the system requirements .

Faster Call Graph—Selective Instrumentation for Java* and Native Code Now you can selectively instrument Java* or native code to improve runtime performance . By gathering data only on the modules being tuned, overhead is reduced and runtime is improved .

Supports the Latest Intel® Processors See larger image at: http://www.intel.com/cd/software/products/ asmo-na/eng/312273.htm Supports the latest Intel® quad-core processors .

Powerful User Interface Improvements Click the Max icon to scroll to the hottest line in the current source view. Next, Previous and Min buttons quickly take you through the list of Tune Inline Functions hotspots. To navigate a different event, just click the desired column. Tune your inlined code with instance-specific event counts on the source and assembly views . Performance can vary by context, i .e ., by Large Enterprise and HPC Systems where a function is called . VTune analyzer provides event data for Minimize Bus Traffic in Non-uniform Memory Architecture each occurrence of an inlined function . (NuMA) Systems by storing sampling data in local CPU memory . This is critical to avoid Supports Intel and GNU compilers: saturating the interconnect bus and slowing the system under test . • ICC 8 1. or higher

• GCC 3 .2 or higher **

One-click Hotspot Navigation With event counts next to each source line, you can easily see how hot each line is . But in a large source file, how do you find the hottest spot? Or jump to the next hottest line which may be a thousand lines away? Easy, just select the event you want to navigate by clicking in its column, and then click the Min, Max, Next and Previous icons to quickly browse through your hot spots .

7 Intel® VTune™ Performance Analyzer 9.1—for Linux*

New for Itanium® Architecture! Eclipse* Based Graphical User Interface VTune analyzer’s easy to use Eclipse* based graphical user interface is now native on Itanium® architecture .

Instruction Filtered Events Pinpoint Bottleneck Location Itanium® architecture exclusive! Isolate problems like poor pre-fetch and poor memory alignment . Sometimes, just choosing an event is not selective enough, because the event can occur both at critical and non-critical times . On Intel® Itanium® architecture, instruction filtering allows you to collect events only when they occur with a specified op-code .

Minimize Data Collection with CPU Selection See larger image at: http://www.intel.com/cd/software/products/ Itanium architecture exclusive! asmo-na/eng/308425.htm Collect only the data you need . CPU selection lets you control exactly where data is collected . From all the processors, only those in your Intel® Premier Support allocation, or only the processors you specify . This greatly reduces the Every purchase of an Intel® Software Development Product includes amount of data you need to collect . a year of support services, which provides access to Intel® Premier Support and all product releases during that time . You receive online Eclipse* based graphical user interface is now native on access to our expert engineering support staff and additional Itanium® architecture. technical documentation .

See larger image at: http://www.intel.com/cd/software/products/ asmo-na/eng/312275.htm

8 ‡ Technical support for Eclipse is not provided by Intel. For more information on Eclipse, please visit the Eclipse Foundation Web site at: http:// www.eclipse.org/ 1 Sampling only. 2 Large applications are welcome! For example, the source distribution tree of one large application including the tools and predefined libraries required to do a build (but not the build itself) is about 1.85 GB with over 62,700 files. The execution tree alone is about 870 MB with over 8,200 files. 3 Due to the unique requirements for supporting large systems, if the software will be used on systems with more than 128 cores please contact us before purchase to make special arrangements. ** GCC uses the older Dwarf2 format. In some cases there is not enough information to associate the inlined instance with the correct caller source line. In this case VTune analyzer will guess and associate the contribution of the inlined instance with the nearest caller source line. This may create an event mismatch between Source and Function Views. The newer Dwarf3 format used by ICC 8.1 and higher eliminates this problem by unambiguously associating inlined instances with the caller source line. GCC 4.0.2 may partially support Dwarf3, but it not complete enough to help with this problem.

© 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. 0209/BLA/CMD/PDF 321522-001