Revealing the performance aspects in your code

1 Three corner stones of HPC

• The parallelism can be exploited at three levels: message passing, fork/join, SIMD• Hyperthreading is not quite threading • Caches have many levels of hierarchy with own latencies and rules • A popular• strategyNodes can have different is choose frequency, one• Memory of is typicallyfirst non two-uniform and too let compiler doarch, the # of cores vectorization and SIMD width • Latency and bandwidth are different between different nodes • This view is functional, but performance agnostic

• Different instructions may be available for different types (SP vs. DP) • Alignment requirements can be strict and contradict with other constraints

Software and Services Group 2 First Things First! Get Base Line Performance (I/III) Single node: Xeon E5 and Xeon Phi™ separate

Software and Services Group 4 Get Base Line Performance (II/III) Heterogeneous: Xeon E5 + Xeon Phi™

Software and Services Group 5 Get Base Line Performance (III/III) Heterogeneous Cluster: N*(Xeon E5 + Xeon Phi™)

Software and Services Group 6 Base Line Results Analysis

• For 1 and 2 nodes we get a 2X Speedup for the heterogeneous version vs. Xeon E5. • For higher node numbers Xeon E5 shows a super linear speedup while Xeon E5 + Xeon Phi™ saturates. • Potential reasons may be: - message passing performance - changing load balance - sub optimal vectorization for Xeon Phi™

Software and Services Group 7 A few BKMs to Try! BKM: Tune programs with affinity set

• Problem: results are not stable b/w runs • Solution: use KMP_AFFINITY to bind threads

Compact

Scatter

Balanced

Software and Services Group 9 Check Scalability: Xeon & Xeon Phi

Using timer functions is quick and easy!

Xeon performance doesn’t scale beyond 1 socket. Need to be investigated!

100T run on 60C KNC gives Best Performance. Why?

Software & Services Group, Developer Products Division

Copyright © 2014, Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 10 Loop Profiler - Identify Time Consuming Loops / Functions to Optimize

Enables targeting parallelization/optimization efforts to most significant code areas ( hotspot identification ) • Easy to use: – Use compiler switches to add instrumentation to the application – Compiler instruments entry and exits of all loops and functions

icc -O1 -profile-functions -profile-loops=all -profile-loops-report=2…

– Running the application generates a report file with resulting counts – Both a human-readable text file (a table) and an XML-file are generated – Analyze data by looking at the raw text file, or use the GUI viewer shipped with compiler • Report file contains information such as: – Call count of routines – Self-time of functions / loops – Total-time of functions / loops – Average, minimum, maximum iteration counter of loops !!

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 11 Loop Profile Data Viewer GUI

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 12 Original Parallel Version: Not vectorized…Check Compiler VEC Reports!

10 static inline size_t posToIdx(const size_t width, const Position& pos){ 11 return (pos.y * width) + pos.x; 12 } 13 14 static void subtractPSF(const float* psf, ...) { ... 25 #pragma omp parallel for default(shared) 26 for (int y = starty; y <= stopy; ++y) { 27 for (int x = startx; x <= stopx; ++x) { 28 residual[posToIdx(residualWidth, Position(x, y))] -= gain * absPeakVal 29 * psf[posToIdx(psfWidth, Position(x - diffx, y - diffy))]; 30 } 31 } 32 } Compiler can’t identify 33 ... the index pattern

Compiler unable to vectorize the loop at line 27. Index compute complex!

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. 13 13 *Other brands and names are the property of their respective owners. Replacing OpenMP critical section: With a serial loop reduction loop…Better Scalability!

Original Code Optimized Code

#pragma omp parallel #pragma omp parallel Compute Compute per { { per thread thread Peak float threadAbsMaxVal = 0.0; float threadAbsMaxVal = 0.0; Peak int threadAbsMaxPos = 0; size_t threadAbsMaxPos = 0;

#pragma omp for schedule(static) #pragma omp for schedule(static) for (int i = 0; i < size; ++i) { for (int i = 0; i < size; ++i) { if (abs(image[i]) > (threadAbsMaxVal)) { if (abs(image[i]) > (threadAbsMaxVal)) { threadAbsMaxVal = abs(image[i]); threadAbsMaxVal = abs(image[i]); threadAbsMaxPos = i; threadAbsMaxPos = i; Store per } } thread

#pragma omp critical int t_num = omp_get_thread_num(); Peak if (threadAbsMaxVal > maxVal) { temp_Peak[t_num] = threadAbsMaxVal; maxVal = threadAbsMaxVal; temp_Pos[t_num] = threadAbsMaxPos; maxPos = threadAbsMaxPos; Find global } } peak across } for (int k = 0; k < num_t ; k++) { maxVal = image[maxPos]; all threads if ((temp_Peak[k]) > (maxVal)) { maxVal = temp_Peak[k]; maxPos = temp_Pos[k]; } Find global } peak across all Performance Gain: 1.12X maxVal = image[maxPos]; threads

#KNC 48C Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 14 Profiling MPI + OpenMP Heterogeneous Execution… Profile using Intel® Cluster Studio XE 10 MPI Xeon + 22 MPI Xeon Phi x 11 OpenMP

Unbalanced Load

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 16 Profile using Intel® Cluster Studio XE 12 MPI Xeon + 20 MPI Xeon Phi x 12 OpenMP

Balanced Load

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 17 Profiling using… Intel® VTune™ Amplifier XE

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 18 Intel® VTune™ Amplifier XE Performance Profiler

Where is my application…

Spending Time? Wasting Time? Waiting Too Long?

• Focus tuning on • See cache misses on • See locks by wait time functions taking time your source • Red/Green for CPU • See call stacks • See functions sorted by utilization during wait • See time on source # of cache misses

• Windows & • Low overhead • No special recompiles

Advanced Profiling For Scalable Multicore Performance

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 19 Intel® VTune™ Amplifier XE Tune Applications for Scalable Multicore Performance

• Fast, Accurate Performance Profiles – Hotspot (Statistical call tree) – Call counts (Statistical) – Hardware-Event Sampling • Thread Profiling – Visualize thread interactions on timeline – Balance workloads • Easy set-up – Pre-defined performance profiles – Use a normal production build • Find Answers Fast – Filter extraneous data – View results on the source / assembly • Compatible – , GCC, Intel compilers – C/C++, Fortran, Assembly, .NET, Java – Latest Intel® processors and compatible processors1 • Windows or Linux

– Visual Studio Integration (Windows) 1 IA32 and Intel® 64 architectures. – Standalone user i/f and command line Many features work with compatible processors. Event based sampling requires a genuine Intel® – 32 and 64-bit Processor.

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 20 Intel® VTune™ Amplifier XE Get a quick snapshot

4 cores

CPU Usage

Thread Concurrency

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. 21 *Other brands and names are the property of their respective owners. 21 Intel® VTune™ Amplifier XE Identify hotspots

Hottest Functions Hottest Call Stack

Quickly identify what is important

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 22 Intel® VTune™ Amplifier XE Identify threading inefficiency

Coarse Grain Locks

High Lock Contention Low Concurrency

Load Imbalance

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 23 Intel® VTune™ Amplifier XE Find Answers Fast

Adjust Data Grouping

… (Partial list shown) Click [+] for Call Stack

Double Click Function to View Source

Filter by Timeline Selection (or by Grid selection)

Filter by Module & Other Controls

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 24 Intel® VTune™ Amplifier XE Timeline Visualizes Thread Behavior

Transitions CPU Time Locks & Hotspot Lightweight Waits s Hotspots

Hovers:

• Optional: Use API to mark frames and user tasks • Optional: Add a mark during collection

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 25 Intel® VTune™ Amplifier XE See Profile Data On Source / Asm

Time on Source / Asm

Quick Asm navigation: Select source to highlight Asm

Right click for instruction reference manual Quickly scroll to hot spots.

Intel® VTune™ Amplifier XE Click jump to scroll Asm

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 26 High-level Features

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 27 Intel® VTune™ Amplifier XE Feature Highlights

• Basic Hot Spot Analysis (Statistical Call Graph) – Locates the time consuming regions of your application – Provides associated call-stacks that let you know how you got to these time consuming regions – Call-tree built using these call stacks • Advanced Hotspot and architecture analysis – Based on Hardware Event-based Sampling (EBS) – Pre-defined tuning experiments • Thread Profiling – Visualize thread activity and lock transitions in the timeline – Provides lock profiling capability – Shows CPU/Core utilization and concurrency information • GPU Compute Performance Analysis – Collect GPU data for tuning OpenCL applications. Correlate GPU and CPU activities • CPU Power Efficiency Analysis – Wake-up rate and frequency measurement per Core

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 28 Intel® VTune™ Amplifier XE Feature Highlights

• Attach to running processes – Hotspot and Concurrency analysis modes can attach to running processes • System wide data collection – EBS modes allows system wide data collection and the tool provides the ability to filter this data • GUI – Standalone GUI available on Windows* and Linux – Microsoft* Visual Studio integration • Command Line – Comprehensive support for regression analysis and remote collection • Platform & application support – Windows* and Linux (Android, , Yocto – in the ISS) – Microsoft* .NET/C# applications – Java* and mixed applications – Fortran applications

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 29 Intel® VTune™ Amplifier XE Feature Highlights

• Event multiplexing – Gather more information with each profiling run • Timeline correlation of thread and event data – Populates thread active time with event data collected for that thread – Ability to filter regions on the timeline • Advanced Source / Assembler View – See event data graphed on the source / assembler – View and analyze assembly as basic blocks – Review the quality of vectorization in the assembly code display of your hot spot • Provides pre-defined tuning experiments – Predefined profiles for quick analysis configuration – A user profile can be created on a basis of a predefined profile • User API – Rich set of user API for collection control, events highlighting, code instrumentation, and visualization enhancing.

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 30 Data Collectors and Analysis Types

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 31 Intel® VTune™ Amplifier XE Analysis Types (based on technology)

Software Collector Hardware Collector Any x86 processor, any virtual, no driver Higher res., lower overhead, system wide Basic Hotspots Advanced Hotspots Which functions use the most time? Which functions use the most time? Where to inline? – Statistical call counts Concurrency General Exploration Tune parallelism. Where is the biggest opportunity? Colors show number of cores used. Cache misses? Branch mispredictions? Locks and Waits Advanced Analysis Tune the #1 cause of slow threaded Dig deep to tune bandwidth, cache performance – waiting with idle cores. misses, access contention, etc.

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 32 Intel® VTune™ Amplifier XE Pre-defined Analysis Types

Advanced Hotspot analysis based on the underlying architecture

User mode sampling, Threading, IO, Signaling API instrumentation

3rd Generation Core Architecture (a.k.a SandyBridge) analysis types

4th Generation Core Architecture (a.k.a Haswell) analysis types

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 33 GUI Layout

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 34 Creating a Project GUI Layout

1

2

3

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 35 Selecting type of data collection GUI Layout

All available analysis types

Different ways to start the analysis

Helps creating new analysis types

Copy the command line to clipboard

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 36 Profile a Running Application No need to stop and re-launch the app when profiling

Two Techniques: • Attach to Process:

- Any type of analysis

• Profile System:

- Advanced Hotspots & Custom EBS - Optional: Filter by process after collection

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 37 Summary View GUI Layout

Clicking on the Summary tab shows a high level summary of the run

Timing for the whole application run

List of 5 Hotspot functions

CPU Usage

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 38 Bottom-Up View GUI Layout

Menu and Tool bars

Analysis Viewpoint currently Type being used Tabs within each result Current grouping

Grid area

Stack Pane

Filter area

Timeline area

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 39 Top-Down View GUI Layout Clicking on the Top- Down Tree tab changes stack representation in the Grid

Top-level function and it’s tree

Total Time Self Time (self + children’s)

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 40 Caller/Callee View GUI Layout Select a function in the Bottom-Up and find the caller/callee

List of functions sorted by CPU Time

List of callers and their stacks

List of callees and their stacks

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 41 Results Comparison

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 42 Intel® VTune™ Amplifier XE Terminology

• Elapsed Time The total time your target application ran. Wall clock time at end of application – Wall clock time at start of application

• CPU Time The amount of time a thread spends executing on a logical processor. For multiple threads, the CPU time of the threads is summed.

• Wait Time The amount of time that a given thread waited for some event to occur, such as: synchronization waits and I/O waits

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 43 Intel® VTune™ Amplifier XE CPU Usage

Thread1 Waiting Thread1

Thread2 Waiting Thread2

Thread3 Waiting Thread3

Thread running

Thread waiting 1sec 1sec 1sec 1sec 1sec 1sec

• Elapsed Time: 6 seconds CPU Usage • CPU Time: T1 (4s) + T2 (3s) + T3 (3s) = 10 seconds

• Wait Time: T1(2s) + T2(2s) + T3 (2s) = 6 seconds0 1 2 3 4

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 44 CPU Usage

Summary View: CPU Usage Histogram

Only CPU Time measured Wait Time is not counted in Hotspots Bottom-Up View: CPU Time

Function CPU Time By CPU Utilization My_Func() 10 s

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 45 Overhead and spin

Threading library internals

Thread1 Waiting lib Thread1

Thread2 Waiting lib Thread2

Thread3 Waiting uThread3ser code lib running or spin

Spin wait Thread running

Thread waiting 1sec 1sec 1sec 1sec 1sec 1sec

• Elapsed Time: 6 seconds CPU Usage • CPU Time: T1 (4s) + T2 (3s) + T3 (3s) = 10 seconds • Wait Time: T1(2s) + T2(2s) + T3 (2s) = 6 seconds 0 1 2 3 4 • Overhead and spin Time: T1(1s) + T2(1s) + T2(1s) = 3 s

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 46 CPU Usage

Summary View: CPU Usage Histogram

Overhead and Spin Time is not counted for CPU Usage

Bottom-Up View: CPU Time

Function CPU By CPU Utilization Overhead and Time Spin Time My_Func() 10 s 4 s

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 47 Hotspot analysis

• Displays hot functions in your application • Shows most time consuming call sequences – Statistical Call Graph • Include timeline view of threads in your application

Start the Basic Analysis Hotspot Analysis

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 48 Hotspot analysis Summary

Note Elapsed Time and CPU Time

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 49 Hotspot analysis Summary (Continued)

Note overall CPU Usage

Note # of CPUs Available on the platform

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 50 Hotspots analysis Hotspot functions

Hotspot Functions Change Viewpoint Adjust Data Grouping

Function CPU time … (Partial list shown)

Call stack Click [+] for Call Stack Thread timeline

Filter by Timeline Selection (or by Grid Selection)

Filter by Module & Other Controls

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 51 Hotspots analysis Hotspot functions by CPU usage

Double Click Function to View Source Coloring CPU Overhead Time by CPU and Spin Utilization Time

Overhead and Spin on Timeline

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 52 Hotspots analysis Source View

Source View Assembly View

Self and Total Time on Source / Asm Right click for instruction reference manual

Quick Asm navigation: Click jump to scroll Asm Select source to highlight Asm

Quickly scroll to hot spots. Scroll Bar “Heat Map” is an overview of hot spots

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 53 Advanced Hotspot analysis

• Uses Intel’s CPU hardware performance collectors • Higher resolution of sampling (~1 /ms) • System wide analysis (all processes running in a system) • OS modules and drivers profiling (ring 0 level) • OS context switches and threads synchronization issues

Start the Analysis

Advanced Hotspot Analysis

Select level of Software & Services Group, Developerdata collected Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 54 Parallelism Methodology of performance profiling and tuning

How to optimize the Hotspots? • Maximize CPU utilization and minimize elapsed time • Ensure CPU is busy all the time • All Cores busy – parallelism (high concurrency)

Elapsed (Serial) Serial Time T1 T2 Elapsed (N-threads) T3 T4 Gain Time

Elapsed (Serial) / N 4T optimal Potential Time Gain

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 55 Intel ® VTune™ Amplifier XE Terminology

Concurrency - Is a measurement of the number of active threads

Thread1 Waiting Thread1

Thread2 Waiting Thread2 Concurrency Summary

Thread3 Waiting Thread3 0 1 2 3 4

Thread running

Thread waiting 1sec 1sec 1sec 1sec 1sec 1sec

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 56 Intel® VTune™ Amplifier XE Parallelism/Concurrency Analysis • For Parallelism / Concurrency analysis, – Stack sampling is done just like in Hotspots analysis – Wait functions are instrumented (e.g. WaitForSingleObject, EnterCriticalSection) – Signal functions are instrumented (e.g. SetEvent, LeaveCriticalSection) – I/O functions are instrumented (e.g. ReadFile, socket)

Start the Analysis Concurrency Analysis

Software & Services Group, Developer Products Division Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 57 Concurrency Analysis Summary

Concurrency Levels

Adjustable Metrics

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 58 Concurrency Analysis Summary: Concurrency vs. CPU Usage Histogram

Threads might be in active state, but not using CPU

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 11/26/2014 59 Concurrency View

Concurrency Overhead Wait Level

Overhead Thread is waiting Thread Thread is Transitions running

Software & Services Group, Developer Products Division Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 60 Concurrency Timeline Investigate reasons for transitions

Select and Zoom Hover over a transition line

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 61 Source Code View by Concurrency

Concurrency coloring for CPU Time against source lines

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 62 Waiting on locks

Sync Sync Sync Object Object Object Signal Signal Signal

Thread1 Waiting Thread1

Idle Thread2 Waiting Thread2

Idle Thread3 Waiting Thread3

Thread running

Thread waiting 1sec 1sec 1sec 1sec 1sec 1sec

Begin main End main thread Calculating Wait and Idle time thread

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 63 Intel® VTune™ Amplifier XE Locks and Waits Analysis • Identifies those threading items that are causing the most thread block time – Synchronization locks – Threading APIs – I/O

Start the Analysis

Locks & Waits Analysis

Software & Services Group, Developer Products Division Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 64 Locks and Waits View

Grouping by Sync Object

Waits # Wait Objects CPU Utilization Spinning Stack for the wait object

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 65 Locks-and-Waits Source View

Wait count

Waiting time on the Critical Section Critical Section object

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 66 Intel® VTune™ Amplifier XE User APIs

User APIs • Collection Control API • Thread Naming API • User-Defined Synchronization API • Task API • User Event API • Frame API • JIT Profiling API

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 67 Windows & Linux Versions Available Stand-alone GUI, Command line, Visual Studio Integration

Microsoft Windows* OS – Windows XP*1, Windows 7*, Windows 8 Desktop* – Windows Server* 2003, 2008 – Microsoft Visual Studio* 2008, 2010 and 2012 – Standalone GUI and command line – IA32 and Intel® 64 Linux* OS – RHEL*, Fedora*, SUSE*, CentOS*, Ubuntu* – Additional distributions may also work – Standalone GUI and command line – IA32 and Intel® 64 Single user and floating licenses available

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 68 Intel® VTune™ Amplifier XE Command Line Interface - Examples

• Display a list of available analysis types and preset configuration levels

amplxe-cl –collect-list

• Run Hot Spot analysis on target myApp and store result in default-named directory, such as r000hs

amplxe-cl –c hotspots -- myApp

• Run the Parallelism analysis, store the result in directory r001par

amplxe-cl -c parallelism -result-dir r001par -- myApp

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 69 Intel® VTune™ Amplifier XE Command Line Interface – Gropof-like output

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 70

Legal Disclaimer & Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and are trademarks of Intel Corporation in the U.S. and other countries.

Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved. 72 *Other brands and names are the property of their respective owners.