Intel® Vtune™ Performance Analyzer 9.1 for Linux In-Depth Intel® Vtune™ Performance Analyzer 9.1—For Linux*

Intel® VTune™ Performance Analyzer 9.1 for Linux In-Depth Intel® VTune™ Performance Analyzer 9.1—for Linux* Contents Is your application too fast? I didn’t think so. .3 Powerful User Interface Improvements . 7 Easy to Use. .3 Tune Inline Functions . 7 Finding Your Bottleneck Is Easier Than Ever Before . 3 One-click Hotspot Navigation . 7 See the Answers on Your Source . 3 Branch and Call Navigation Made Easy . 7 Find the Critical Path Using Call Graph Profiling . 4 Create Meaningful Event Labels . .7 Low Overhead Sampling Profiling . 4 Large Enterprise and HPC Systems . 7 Filter the data to find your answers . .4 Minimize Bus Traffic in Non-uniform Memory Architecture (NuMA) Features. 5 Systems . 7 All Architectures: . 5 New for Itanium® Architecture! . 8 Large Enterprise and HPC Systems . 5 Intel® Premier Support . 8 Intel® Itanium® Architecture: . .5 New. 5 Profile JavaScript* and Flash* Code . 5 Profile Dynamically Generated Code . .6 Access to VTune Analyzer’s Open Data Model . 6 Access to the latest Experimental Technologies . 6 VTune Analyzer Displays What the Compiler Knows . 6 New More Effective Tuning Methodology Supported . 7 New Events for Tuning Multicore Intel® Processors . .7 New Linux Distributions! . .7 Faster Call Graph—Selective Instrumentation for Java* and Native Code . 7 Supports the Latest Intel® Processors . .7 2 Intel® VTune™ Performance Analyzer 9.1—for Linux* Is your application too fast? I didn’t think so. Finding Your Bottleneck Is Easier Than Ever Before Intel® VTune™ Performance Analyzer for Linux* is a fully Linux-based Complete one simple dialog box to get a list of the top five time- solution indispensable for making your software run its fastest consuming functions . on single and multicore systems . It analyzes applications without recompilation or linking on handheld through supercomputer systems . It is robust with large applications (over 1 GB of source code¹) and multicore, multiprocessor, and NuMA systems using the latest Intel® processors . • NEW! Now supports the Intel® Core™ i7 processor • NEW! Performance Profiling for Dynamically Generated Code, JavaScript, and Flash . Access to VTune Analyzer’s Open Data Model . Easy to Use VTune analyzer makes application performance tuning easier with a graphical user interface (GUI) based on the Eclipse* development environment‡ It’s fast and easy to find your performance bottlenecks with a list of the most active functions . Click on a function name to display the source and show the most time-consuming source statements See the Answers on Your Source Source and assembly views show you exactly which lines of code are taking the most time . Quickly find the data you need. Click an icon to: View Source (shown) Go to hottest line for the selected event Many developers want to maximize application performance . VTune View mixed source and assembly analyzer gives the developer a view of what’s happening as the Go to the next hottest line for View assembly the selected event application is running . It identifies areas that take an inordinate Go to next function View compiler tuning advice amount of processor time . It also helps identify critical paths in an application where adjustments have maximum benefit . Without See larger image at: http://www.intel.com/cd/software/products/ asmo-na/eng/vtune/320771.htm VTune analyzer, performance tuning is a guessing game . 3 Intel® VTune™ Performance Analyzer 9.1—for Linux* Find the Critical Path Using Call Graph Profiling Call Graph determines calling sequences and graphically displays the critical path . It also shows you the context of the bottleneck . To be effective, you often need to know not only where the application is spending its time, but how it got there . Unlike other offerings, VTune analyzer provides both sampling and call graph analysis . Even if you plan to do mostly call graph analysis, running sampling first lets you identify the modules that need it so you only pay Call Graph’s larger overhead for the modules that need to be analyzed . This can be vital on large projects . Sampling is great for analysis of “loopy” code . Call Graph is usually better for “branchy” code . You need both to get the job done right . See larger image at: http://cache-www.intel.com/ cd/00/00/32/10/321016_321016.GIF Filter the data to find your answers The table and bar chart views of sampling results filter your data many different ways to find what you need . View by thread (shown) for load balancing . View the critical path in red Quickly locate the critical path and navigate the profiling results easily using both a table and graph view . Click a table entry to highlight the function in the graph, or click the graph to find the table entry . Low Overhead Sampling Profiling Process Hotspot (function) Event-based sampling finds your bottleneck with very low overhead Thread (shown) Source View (typically less than 5 percent) . Identify problems such as cache Module CPU misses and branch mis-predictions . Because it is system-wide, event- based sampling can be used to tune libraries and drivers as well as application programs . 4 Intel® VTune™ Performance Analyzer 9.1—for Linux* Features • Multiple users can share a large system for simultaneous Call Graph performance analyses . All Architectures: • Sampling is supported on systems with 128 or more3 processors Low Overhead—Accurately identify where the program spends time . using local buffering per CPU for minimum inter-node contention . Sampling is system wide with negligible overhead (typically less than To limit the amount of data collected we recommend selecting a 5 percent) . maximum of 64 CPUs for simultaneous data collection . Find the Critical Path—Determine function calling sequences and Intel® Itanium® Architecture: find the critical path using Call Graph . Eclipse* Based Graphical User Interface—The easy-to-use Eclipse* No Recompile Required—Unlike traditional instrumented profilers based graphical user interface in VTune analyzer is now native on that make you recompile or modify your build script, just use your Itanium® architecture . production executables . Instruction Filtered Events Pinpoint Bottleneck Locations— Compatibility—VTune™ Performance Analyzer supports the latest Isolate problems like poor pre-fetch and poor memory alignment . Intel® processors (Intel® 64 architecture-based processors, Intel® Sometimes just choosing an event is not selective enough, because Itanium® processors, multicore processors . ). and a wide variety of the event can occur both at critical and non-critical times . On Intel Linux* distributions . Itanium architecture, instruction filtering allows you to collect events only when they occur with a specified op-code . Programming Language and Compiler Independent—VTune analyzer supports all compilers that follow industry standards (ELF, Minimize Data Collection with CPU Selection—Collect only the STABS, DWARF) . data you need . CPU selection lets you control exactly where data is collected, from all the processors, only those in your allocation or only Mixed Java* and Native Code—Unlike Java*-only analyzers, VTune the processors you specify . This greatly reduces the amount of data analyzer tunes mixed Java and native code1 . you need to collect . Minimal Memory Footprint—Remote profiling minimizes the performance impact on the target system by running the user New interface on a separate system . Note: Features listed as “New” are new since the last major release 8.0. Some have been previewed in minor updates and beta releases. Command Line Capability—Automate batch operations . Large Applications Welcome—VTune analyzer is a robust solution Profile JavaScript* and Flash* Code even with large executables2 . If you have a large application with New profiling support in emerging internet browsers and other script- hundreds of thousands of functions, bring it to VTune analyzer . oriented products allow developers working with new JavaScript* or Flash* JIT technologies to analyze their code . Use the VTune analyzer Listen to the Compiler’s Advice—An optimizing compiler can do to optimize for scalable performance of these codes on Windows* and a lot better with just a few tips from you . We’ve integrated the Linux* to ensure the best end user experience with your application . Intel®Compilers with VTune analyzer to make this easy and very VTune analyzer supports profiling JIT’d code when browser vendors effective . add the required support . This enables deep performance analysis of these additional languages: Large Enterprise and HPC Systems: Minimize Traffic in Non-uniform Memory Architecture (NuMA) • JavaScript / AJAX Systems by storing sampling data in local CPU memory . This is critical • Flash (Action Script) to avoid saturating the interconnect fabric and slowing the system Check with your browser supplier for details on when their browser under test . will enable support . Designed for High Performance Computing—Large High Performance Computing (HPC) systems have unique requirements supported by VTune analyzer . 5 Intel® VTune™ Performance Analyzer 9.1—for Linux* Profile Dynamically Generated Code VTune Analyzer Displays What the Compiler Knows Many applications today emit their own runtime-generated or just- An optimizing compiler can do a lot better with just a few tips in-time (JIT) code . New profiling APIs in the VTune analyzer enable from you . We’ve integrated the Intel® compilers with Intel® VTune™ performance analysis of dynamic code and allow you to view Performance Analyzer to make this easy . annotated source code directly from the analysis results . The Intel compiler optimization reports contain a wealth of Access to VTune Analyzer’s Open Data Model information to make your application faster . VTune analyzer locates your critical, time consuming “hot spot” and filters the compiler VTune analyzer can now support many different software platforms optimization report to show only the lines that apply to the code with performance sampling analysis . Use the new open data model selected . Now you can see what the compiler optimized and choose APIs to combine the VTune analyzer’s powerful GUI on Windows* or pragmas to further improve performance .

Intel® Vtune™ Performance Analyzer 9.1 for Linux In-Depth Intel® Vtune™ Performance Analyzer 9.1—For Linux*

Application Performance Optimization

Memory Subsystem Profiling with the Sun Studio Performance Analyzer

Oracle Solaris Studio 12.2 Performance Analyzer MPI Tutorial

Oracle® Solaris Studio 12.4: Performance Analyzer Tutorials

Openmp Performance 2

Openmp 4.0 Support in Oracle Solaris Studio

1 a Study on Performance Analysis Tools for Applications Running On

Java™ Performance

Characterization for Heterogeneous Multicore Architectures

Oracle Developer Studio Performance Analyzer Brief

Optimizing Applications with Oracle Solaris Studio Compilers and Tools

Oracle Solaris Studio July, 2014