PRACE Summer School, CINECA 8 -11 July 2013 Debugging and Profiling on Intel® Xeon Phi™

Hans Pabst, July 2013 Software and Services Group Intel Corporation Agenda

Debugging

• Compiler Debug Features

• GNU* Project Debugger

• Intel® Inspector

Profiling

• Compiler Profiling Features

• Intel® VTune™ Amplifier

Demonstration

2

Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. Compiler Debug Features

Static Analysis (SA): icc -diag-enable scn • Customize analysis level (and other adjustments) • Textual and Inspector based reports • Issue tracking via Inspector GUI Pointer Checker (PL): icc -check-pointers=rw • Further option adjustments possible • No ABI changes despite of bounds information • Intrinsics / API for custom memory allocation • Rigorous checks; failure behavior adjustable • Debugger integration

3

Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. Intel® Inspector: Static Analysis

Analysis: 250 error types • Incorrect directives • Security errors

Reports and collaboration • Choose your priority: - Minimize false errors - Maximize error detection • Hierarchical navigation • Share comments with team

Code Complexity Metrics • Find code likely to be less reliable

4

Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. Agenda

Debugging

• Compiler Debug Features

• GNU* Project Debugger

• Intel® Inspector

Profiling

• Compiler Profiling Features

• Intel® VTune™ Amplifier

Demonstration

5

Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. The GNU* Project Debugger (GDB)

Enhanced build of GNU* GDB 7.5 is included into Intel® Manycore Platform Software Stack (Intel® MPSS)

• http://software.intel.com/en-us/articles/intel- manycore-platform-software-stack-mpss

• Source code available via installation option

• Released back to the GNU* community

• Native and cross/remote debugger versions

/C++ support, improved support

• Intel® Parallel Debugger Extension

6

Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. Eclipse* Integration

Eclipse* IDE integration • Seamless debugging of host and coprocessor • Simultaneous view of host and coprocessor threads • Supports offload language extensions (auto-attach to offload process) • Supports multiple coprocessor cards • Supports C/C++ and Fortran

Simultaneous and seamless thread debugging.

7

Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. Eclipse* Integration (cont.)

Install Eclipse IDE for C/C++ Developers • Available from http://www.eclipse.org

Integrate Intel® Compilers and Intel® Xeon Phi™ • Start Eclipse, select Help  Install New Software • Uncheck Group items by category • Use Add  Local, and select folder: /opt/intel/composerxe/eclipse_support/cdt8.0/eclipse • Click Select All and Finish.

8

Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. Intel® Xeon Phi™ Coprocessor Architecture Features

List all new vector and mask registers (gdb) info registers zmm K0 0x0 0 ⁞ Zmm31 {v16_float = {0x0 }, v8_double = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v64_int8 = {0x0 }, v32_int16 = {0x0 }, v16_int32 = {0x0 }, v8_int64 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_uint128 = {0x0, 0x0, 0x0, 0x0}}

Disassemble instructions (gdb) disassemble $pc, +10 Dump of assembler code from 0x11 to 0x24: 0x0000000000000011 : vpackstorelps %zmm0,-0x10(%rbp){%k1} 0x0000000000000018 : vbroadcastss -0x10(%rbp),%zmm0

9

Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. GDB: Intel® Xeon Phi™

Coprocessor Debug server and host debugger /usr/linux-k1om-4.7/linux-k1om/usr/bin/gdbserver /usr/linux-k1om-4.7/bin/x86_64-k1om-linux-gdb Native debugger (no Parallel DeBug eXtension) /usr/linux-k1om-4.7/linux-k1om/usr/bin/gdb

Host debugger /opt/intel/mic/bin/gdb

10

Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. GDB: Native Debugging

Run GDB* on the Intel® Xeon Phi™ Coprocessor ssh –t mic0 /path/on/mic/to/gdb

To attach to a running application via the process-id (gdb) shell pidof my_application 42 (gdb) attach 42

To run an application directly from GDB* (gdb) file /target/path/to/application (gdb) start

Intel Confidential – NDA presentation 11

Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. GDB: Remote Debugging

Run GDB* on your local host /usr/linux-k1om-4.7/bin/x86_64-k1om-linux-gdb Start gdbserver on the coprocessor – To remote debug using standard I/O redirection (gdb) target extended-remote |ssh -T mic0 gdbserver –-multi – – To set a custom environment replace gdbserver by e.g.: env LD_LIBRARY_PATH=/tmp:$LD_LIBRARY_PATH gdbserver Attach to a running application via process ID (pid) (gdb) file /local/path/to/application (gdb) attach Run an application directly (gdb) file /local/path/to/application (gdb) set remote exec-file /target/path/to/application

12

Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. GDB: Offload Debugging

Debugging into an offloaded code section on the host does not “switch” to a debugger on the target • No debug synchronization (host / coprocessor) • GUI integration will provide this “glue” logic; see “Eclipse* Integration”

Debugging offloaded code via command line 1. Wait within the offloaded code section volatile int loop = 1; do { volatile int a = 1; } while (loop); 2. Attach to offload process on coprocessor via PID

Note: cross-compiling the entire application and debugging the previously offloaded section natively might be easier. Intel Confidential – NDA presentation 13

Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. GDB: Detect and Debug Data Races

Data race: a data race occurs when two ordinary simultaneous accesses to the same scalar, at least one of which is a write, execute in different parallel regions. [Hans-J. Boehm, WG21/N2480]

Tools are needed to detect and debug data races*:

• GNU* GDB with parallel debug extension

• Intel® Inspector

* Remember: single-threaded (sequential) execution cannot reproduce data races in contrast to multiple threads even on a single core. 14

Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. GDB: Data Race Symptoms

Varying, or wrong results* • One of the possible (but different) ways to interleave the instructions between parallel sections reproduces the actual result (“sequential consistency”)

Memory corruption, or crash • An index (or pointer data) is subject of a data race e.g., a book keeping structure is concurrently modified and left in an inconsistent state (mix of different updates)

* No data race is a prerequisite for reproducible numerical results e.g., a deterministic execution-order is needed as well. 15

Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. GDB: Data Race Example

Given: global variables a=1 b=2

Given: two threads T1: x = a + b T2: b = 42

Value of x depends on execution order: If T1 runs before T2  x = 3 If T2 runs before T1  x = 43

Data race e.g., “read-write”: T2’s update was not visible to T1’s calculation

16

Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. GDB: Detect Data Detection

How to detect data races? Compile with Intel Compiler: icpc -debug parallel

Debugger breaks when race has been detected: (gdb) pdbx enable (gdb) run data race detected 1: write shared, 4 bytes from foo.c:36 3: read shared, 4 bytes from foo.c:40

Breakpoint -11, 0x401515 in L_test_..._21 () at foo.c:36 *var = 42; /* bp.write */

Stop in the context of the access that triggers a race condition

17

Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. GDB: Data Race Debugging

Fine-tune detection and analysis via filter sets • Add filter to selected filter set (gdb) pdbx filter line foo.c:36 (gdb) pdbx filter code 0x40518..0x40524 (gdb) pdbx filter var shared (gdb) pdbx filter data 0x60f48..0x60f50 (gdb) pdbx filter reads # read accesses • Ignore events specified by filters (default behavior) (gdb) pdbx fset suppress • Ignore events not specified by filters (gdb) pdbx fset focus • Get debug command help (pdbx) (gdb) help pdbx Use cases • Focused debugging e.g., debug a single symptom • Limit overhead and control false positives

18

Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. GDB: Date Race Detection Limitations

Data race detection needs instrumented threading runtimes in order to capture synchronization events. Enhanced build of GNU* GDB 7.5 supports: • Intel OpenMP • Pthreads How to avoid false positives? • Use filter mechanism or supported threading runtime • Use Intel® Inspector – Supports all threading runtimes from Intel and Pthreads – Good support for other threading runtimes

19

Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. GDB: Date Race Detection Usage Hints

Optimized code (symptom) (gdb) run data race detected 1: write question, 4 bytes from foo.c:36 3: read question, 4 bytes from foo.c:40 Breakpoint -11, 0x401515 in foo () at foo.c:36 36 *answer = 42; (gdb) Reported variable may appear to be wrong • Remember: data races are reported on memory objects • If symbol name cannot be resolved: only address is printed Recommendation • Unoptimized code (O0): avoids to miss finding data races (due to removed / optimized away temporaries, etc.)

20

Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. Agenda

Debugging

• Compiler Debug Features

• GNU* Project Debugger

• Intel® Inspector

Profiling

• Compiler Profiling Features

• Intel® VTune™ Amplifier

Demonstration

21

Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. Intel® Inspector

Manage results of static analysis Command line and remote • Issue tracking (regression issue collection tests) $ inspxe-cl \ Runtime issue analysis -no-auto-finalize \ -collect ti3 \ • Memory issue detection -knob scope=extreme \ – OOB accesses (heap, stack) -r myresults – mybin – Uninitialized memory usage – Inconsistent alloc. and leaks $ inspxe-cl -finalize \ • Parallelism issue detection -r myresults – Dead locks (locked forever) $ inspxe-cl \ – Life locks (locking forever) –report problems \ – Data races -r myresults

22

Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. Agenda

Debugging

• Compiler Debug Features

• GNU* Project Debugger

• Intel® Inspector

Profiling

• Compiler Profiling Features

• Intel® VTune™ Amplifier

Demonstration

23

Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. Loop Profiler Identify Time Consuming Loops/Functions Compiler switch: /Qprofile-functions, -profile-functions • Insert instrumentation calls on function entry and exit points to collect the cycles spent within the function. Compiler switch: /Qprofile-loops=, -profile-loops= • Insert instrumentation calls for function entry and exit points as well as the instrumentation before and after instrument able loops of the type listed as the option’s argument. GUI-based data viewer utility

Input is generated XML output file, named loop_prof_.xml

loopprofileviewer.sh

24 Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Loop Profiler Text Dump (.dump file)

25

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Loop Profiler Data Viewer GUI

Menu to allow user to enable Function Profile View filtering or displaying the source code

Column headers allow selection to control sort criteria independently for function and loop table

Loop Profile View

26

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Agenda

Debugging

• Compiler Debug Features

• GNU* Project Debugger

• Intel® Inspector

Profiling

• Compiler Profiling Features

• Intel® VTune™ Amplifier

Demonstration

27

Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier 2013 Low Overhead Java* Profiling

Low Overhead & Precise Versatile & Easy to Use • Sampling is fast / • Multiple simultaneous JVMs unobtrusive • Mixed Java / C++ • Fast Hardware sampling • See results on the Java (With optional stacks!) source • Advanced profiles are unique (cache misses, bandwidth…)

28

Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier XE Feature Highlights Hot Spot Analysis (Statistical Call Graph) Where is the application spending time and how did it get there? • Faster and more reliable than older instrumentation-based exact call graph

Hardware Event-based Sampling (EBS) Where are the tuning opportunities? (e.g., cache misses) • Improved usability • Pre-defined tuning experiments

Thread Profiling Where is my concurrency poor and why? • Thread timeline visualizes thread activity and lock transitions • Integrated EBS data tells you exactly what’s happening and when

29 Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier XE Feature Highlights Event multiplexing • Gather more information with each profiling run

Timeline correlates thread and event data • See what active threads are doing • Filter profile results by selecting a region in the timeline

Advanced Source / Assembler View • See event data graphed on the source / assembler • View and analyze assembly as basic blocks

30

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier XE Feature Highlights Collect System Wide Data & Attach to Running Processes • EBS collects system wide data, filter it to find what you need • Hot Spot and Concurrency Analyses can attach to a running process

GUI & Command Line • Stand-alone GUI, Command Line, Microsoft* Visual Studio Integration • GUI makes setup and analysis easy • Command line for regression analysis and collection on remote systems

Extended Platform Coverage • Windows* and Linux • Microsoft* .NET* C# applications

31

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier XE Analysis Types (based on technology) Hot Spot Analysis • Sampling with stacks

Parallelism/Concurrency Analysis • Locks and Waits finds the problems with parallelism

Hardware Event-based Sampling and Counting • LightWeight Hotspot (pre-defined) • Advanced Analysis Types (pre-defined) . General Exploration . Memory Access . Bandwidth . Cycles and uOps • Advanced Analysis Types (created by a user)

32

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Selecting type of data collection

All available Different ways to analysis types start the analysis

Copy correct command line Helps creating syntax to new analysis clipboard types

33

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier XE Terminology CPU Time The amount of time a thread spends executing on a logical processor. For multiple threads, the CPU time of the threads is summed.

Unused CPU Time Represents the time that a CPU was not being used, and, the time each of the cores did not spend in CPU time for this application. This may happen, for example, when the cores are blocked

Wait Time The amount of time that a given thread waited for some event to occur, such as: synchronization waits and I/O waits

34

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier XE Terminology

Thread1 Waiting Thread1

Thread2 Waiting Thread2

Thread3 Waiting Thread3

Thread running

Thread waiting 1sec 1sec 1sec 1sec 1sec 1sec

Elapsed Time: 6 seconds

CPU Time: T1 (4s) + T2 (2s) + T3 (2s) = 8 seconds

Wait Time: T1(2s) + T2(3s) + T3 (2s) = 7 seconds

35

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Hotspot analysis

Displays hot functions in your application

Shows most time consuming call sequences • Statistical Call Graph

Include timeline view of threads in your application

Start the Hotspot Analysis Analysis

36

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Hotspot analysis Summary

37

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Hotspots analysis Hotspot functions Hotspot Functions Double Click Function to View Source

Adjust Data Grouping Function CPU time

… (Partial list shown) Call stack

Click [+] for Call Stack Thread timeline

Filter by Timeline Selection (or by Filter by Module & Grid Selection) Other Controls

38

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Hotspots analysis

Source ViewSource Assembly View View

Time on Source / Asm

Quick Asm navigation: Select source to highlight Asm Click jump to scroll Asm

Quickly scroll to hot spots. Scroll Bar “Heat Map” is an Right click for instruction overview of hot spots reference manual

39

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier XE Parallelism/Concurrency Analysis For Parallelism / Concurrency analysis, • Stack sampling is done just like in Hotspots analysis

• Wait functions are instrumented (e.g. WaitForSingleObject, EnterCriticalSection)

• Signal functions are instrumented (e.g. SetEvent, LeaveCriticalSection)

• I/O functions are instrumented (e.g. ReadFile, socket)

Concurrency Analysis

Copyright© 2013, Intel Corporation. All rights reserved. 40 *Other brands and names are the property of their respective owners. Intel ® VTune™ Amplifier XE Concurrency Calculation

Thread1 Waiting Thread1

Thread2 Waiting Thread2

Thread3 Waiting Thread3

Thread running

Thread waiting 1sec 1sec 1sec 1sec 1sec 1sec

Threads running 1 2 1 1 2 3

Concurrency Summary

0 1 2 3 4

Copyright© 2013, Intel Corporation. All rights reserved. 41 *Other brands and names are the property of their respective owners. Concurrency Analysis Summary

Concurrency Levels

Adjustable Metrics

Copyright© 2013, Intel Corporation. All rights reserved. 42 *Other brands and names are the property of their respective owners. Concurrency View

Wait Concurrency Overhead Level

Thread Thread is Thread is Transitions running waiting

43

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier XE Locks and Waits Analysis Identifies those threading items that are causing the most thread block time • Synchronization locks • Threading APIs • I/O

Locks & Waits Analysis

44

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Locks and Waits View

Spinning

Wait Waits # Objects CPU Utilization Stack for the wait object

45

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Locks-and-Waits Source View

Auto Reset Event

46

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier XE Viewpoints and Groupings Viewpoints • Each analysis types have pre-defined viewpoints • Different viewpoints allow the user analyze the data in a different way with different focus in mind

47

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier XE Viewpoints and Groupings Groupings • Each analysis types have pre-defined groupings • Different groupings allow users analyze data in different ways with different focus in mind

48

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Viewpoints and Groupings

49

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Viewpoints and Groupings

50

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Timeline Visualizes Thread Behavior

Transitions CPU Time Locks & Waits Hotspots Lightweight Hotspots

Hovers:

Optional: Use API to mark frames and user tasks

Optional: Add a mark during collection

51

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier XE User APIs

User APIs • Collection Control API • Thread Naming API • User-Defined Synchronization API • Task API • User Event API • Frame API • JIT Profiling API

52

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier XE Command Line Interface Command line (CLI) versions exist on Linux* and Windows* • CLI use cases: – Test code changes for performance regressions – Automate execution of performance analyses • CLI features: – Fine-grained control of all analysis types and options – Text-based analysis reports – Analysis results can be opened in the graphical user interface

53

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier XE Command Line Interface - Examples Display a list of available analysis types and preset configuration levels

amplxe-cl –collect-list

Run Hot Spot analysis on target myApp and store result in default-named directory, such as r000hs

amplxe-cl –c hotspots -- myApp

Run the Parallelism analysis, store the result in directory r001par amplxe-cl -c parallelism -result-dir r001par -- myApp

54

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier XE Command Line Interface - Reporting

$> amplxe-cl –report summary –r /home/user1/examples/lab2/r003cc

Summary ------

Average Concurrency: 9.762 Elapsed Time: 158.749 CPU Time: 561.030 Wait Time: 190.342 CPU Usage: 3.636 Executing actions 100 % done

55

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier XE Command Line Interface – Gropof like output

56

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier XE Command Line Interface – CSV output

Example:

$> amplxe-cl -report hotspots -csv-delimiter=comma - format=csv -report-out=testing111 -r r003cc

Function,Module,CPU Time,Idle:CPU Time,Poor:CPU Time,Ok:CPU Time,Ideal:CPU Time,Over:CPU Time CLHEP::RanecuEngine::flat,test40,50.751,0,0.050,0.081,0.080,50.541 G4UniversalFluctuation::SampleFluctuations,test40,32.730,0,0.030,0.070,0.010, 32.620 sqrt,test40,19.060,0,0.010,0.070,0.030,18.951 G4Track::GetVelocity,test40,15.330,0,0.030,0.030,0.040,15.230 G4VoxelNavigation::LevelLocate,test40,14.460,0,0.020,0.010,0.040,14.390 G4Step::UpdateTrack,test40,14.090,0,0,0.030,0.020,14.040 G4NavigationLevelRep::G4NavigationLevelRep,test40,13.721,0,0.030,0.020,0.040, 13.631 exp,test40,13.438,0,0.038,0.010,0.060,13.330 log,test40,13.340,0,0.180,0.020,0.110,13.030 G4PhysicsVector::GetValue,test40,11.970,0,0.020,0.020,0.050,11.880

57

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Remote Data Collection Collect performance data on one system, analyze and display results on another • Collect data from 1 system • Copy resulting data files to another system • Open the data files to analyze and display results

One typical model • Collect on Linux, analyze and display on Windows – Similar to “Remote Data Collector” in the VTune™ Analyzer • Collect data on Linux system using command line tool – Doesn’t require a license • Copy the resulting performance data files to a Windows* system • Analyze and display results on the Windows* system – Requires a license

58

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Collect events in Intel® VTune™ Amplifier XE on offload and native program executions

Application settings: • Application: ssh • Parameters: mic0 “” • Working directory: Usually does not matter • Don’t forget to set search directories under “All files” to map source

59

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Efficiency Metric: CPI

• Cycles per Instruction

• Can be calculated per hardware thread or per core

• Is a ratio! So varies widely across applications and should be used carefully.

Number of Minimum (Best) Minimum (Best) Hardware Theoretical CPI per Theoretical CPI Threads / Core Core per Thread 1 1.0 1.0 2 0.5 1.0 3 0.5 1.5 4 0.5 2.0

60

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Efficiency Metric: CPI

• Measures how latency affects your application’s execution Metric Formula Investigate if

CPI Per CPU_CLK_UNHALTED/ > 4.0, or increasing Thread INSTRUCTIONS_EXECUTED CPI Per (CPI per Thread) / Number of > 1.0, or increasing Core hardware threads used

• Look at how optimizations applied to your application affect CPI • Address high CPIs through any optimizations that aim to reduce latency

61

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Some CPI Data • Scaling data from a lab workload: Metric 1 2 3 4 hardware hardware hardware hardware thread / threads / threads / threads / core core core core CPI per 5.24 8.80 11.18 13.74 Thread CPI per 5.24 4.40 3.73 3.43 Core • Scatter plot of observed CPIs from lab workloads: 9,00 8,00 7,00 6,00 5,00 4,00 3,00 2,00 1,00 0,00

62

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Efficiency Metric: Compute to Data Access Ratio • Measures an application’s computational density, and its suitability for Intel® MIC Architecture

Metric Formula Investigate if

L1 Compute to Data VPU_ELEMENTS_ACTIVE / < Vectorization Access Ratio DATA_READ_OR_WRITE Intensity L2 Compute to Data VPU_ELEMENTS_ACTIVE / < 100x L1 Compute Access Ratio DATA_READ_MISS_OR_ to Data Access Ratio WRITE_MISS

• Increase computational density through vectorization and reducing data access (see cache issues, also, DATA ALIGNMENT!)

63

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Problem Area: L1 Cache Usage

• Significantly affects data access latency and therefore application performance Metric Formula Investigate if

L1 DATA_READ_MISS_OR_WRITE_MISS + Misses L1_DATA_HIT_INFLIGHT_PF1 L1 Hit (DATA_READ_OR_WRITE – L1 Misses) / < 95% Rate DATA_READ_OR_WRITE • Tuning Suggestions: – Software prefetching – Tile/block data access for cache size – Use streaming stores – If using 4K access stride, may be experiencing conflict misses – Examine Compiler prefetching (Compiler-generated L1 prefetches should not miss)

64

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Problem Area: Data Access Latency • Significantly affects application performance

Metric Formula Investigate if

Estimated (CPU_CLK_UNHALTED >145 Latency – EXEC_STAGE_CYCLES Impact – DATA_READ_OR_WRITE) / DATA_READ_OR_WRITE_MISS • Tuning Suggestions: – Software prefetching – Tile/block data access for cache size – Use streaming stores – Check cache locality – turn off prefetching and use CACHE_FILL events - reduce sharing if needed/possible – If using 64K access stride, may be experiencing conflict misses

65

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Problem Area: TLB Usage • Also affects data access latency and therefore application performance Metric Formula Invest- igate if L1 TLB miss ratio DATA_PAGE_WALK/DATA_READ_OR_WRITE > 1%

L2 TLB miss ratio LONG_DATA_PAGE_WALK > .1% / DATA_READ_OR_WRITE L1 TLB misses per DATA_PAGE_WALK / LONG_DATA_PAGE_WALK > 100x L2 TLB miss • Tuning Suggestions: – Improve cache usage & data access latency – If L1 TLB miss/L2 TLB miss is high, try using large pages – For loops with multiple streams, try splitting into multiple loops – If data access stride is a large power of 2, consider padding between arrays by one 4 KB page

66

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Problem Area: VPU Usage

• Indicates whether an application is vectorized successfully and efficiently Metric Formula Investigate if

Vectorization VPU_ELEMENTS_ACTIVE / <8 (DP), Intensity VPU_INSTRUCTIONS_EXECUTED <16(SP) • Tuning Suggestions: – Use the Compiler vectorization report! – For data dependencies preventing vectorization, try using Intel® Cilk™ Plus #pragma SIMD (if safe!) – Align data and tell the Compiler! – Restructure code if possible: Array notations, AOS->SOA

67

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Problem Area: Memory Bandwidth

• Can increase data latency in the system or become a performance bottleneck

Metric Formula Investigate if Memory (UNC_F_CH0_NORMAL_READ + < 80GB/sec Bandwidth UNC_F_CH0_NORMAL_WRITE+ (practical peak UNC_F_CH1_NORMAL_READ + 140GB/sec) UNC_F_CH1_NORMAL_WRITE) X 64/time (with 8 memory controllers)

• Tuning Suggestions: – Improve locality in caches – Use streaming stores – Improve software prefetching

68

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Event collections on the coprocessor can generate volumes of data dgemm: on 60+ cores

Tip: Use cpu-mask to reduce data set, while maintaining the same accuracy.

69

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Summary

• Vectorization, Parallelism, and Data locality are critical to good performance for the Intel® Xeon Phi™ Coprocessor

• Event names can be misleading – we recommend using the metrics given in this presentation or our tuning guide at http://software.intel.com/en- us/articles/optimization-and-performance-tuning- for-intel-xeon-phi-coprocessors-part-2- understanding

• Intel® VTune™ Amplifier XE supports collecting all of the above metrics, as well as providing special analysis types like Memory Bandwidth

70

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier XE: Summary

Vectorization, Parallelism, and Data locality are critical to good performance for the Intel® Xeon Phi™ Coprocessor.

The Intel® VTune™ Amplifier XE finds: • Source code for performance bottlenecks • Characterize the amount of parallelism in an application • Determine which synchronization locks or APIs are limiting the parallelism in an application • Understand problems limiting CPU instruction level parallelism

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Agenda

Debugging

• Compiler Debug Features

• GNU* Project Debugger

• Intel® Inspector

Profiling

• Compiler Profiling Features

• Intel® VTune™ Amplifier

Demonstration

72

Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. Demonstration

Example: matrix-vector multiplication • Upper triangular matrix • Structural symmetry

Key points • Straight-forward sequential implementation • Naïve parallelization of the outer loop* • Data races in the result vector

Approach is to demonstrate data races and not to show efficient synchronization or performance.

* Column indexes are used to update rows (symmetry!) that are “owned” by another thread.

73

Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. Demonstration: Preparation

$ source /opt/intel/system_studio_2013/ \ bin/iccvars.sh intel64

$ source /opt/intel/system_studio_2013/ \ debugger/gdb/bin/debuggervars.sh intel64

$ source /opt/intel/system_studio_2013/ \ inspector_for_systems/inspxe-vars.sh

74

Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners.

Legal Disclaimer & Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © 2013, Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.

Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

76

Copyright© 2013,2012, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners.