PRACE Summer School, CINECA 8 -11 July 2013 Debugging and Profiling on Intel® Xeon Phi™
Hans Pabst, July 2013 Software and Services Group Intel Corporation Agenda
Debugging
• Compiler Debug Features
• GNU* Project Debugger
• Intel® Inspector
Profiling
• Compiler Profiling Features
• Intel® VTune™ Amplifier
Demonstration
2
Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. Compiler Debug Features
Static Analysis (SA): icc -diag-enable scn • Customize analysis level (and other adjustments) • Textual and Inspector based reports • Issue tracking via Inspector GUI Pointer Checker (PL): icc -check-pointers=rw • Further option adjustments possible • No ABI changes despite of bounds information • Intrinsics / API for custom memory allocation • Rigorous checks; failure behavior adjustable • Debugger integration
3
Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. Intel® Inspector: Static Analysis
Analysis: 250 error types • Incorrect directives • Security errors
Reports and collaboration • Choose your priority: - Minimize false errors - Maximize error detection • Hierarchical navigation • Share comments with team
Code Complexity Metrics • Find code likely to be less reliable
4
Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. Agenda
Debugging
• Compiler Debug Features
• GNU* Project Debugger
• Intel® Inspector
Profiling
• Compiler Profiling Features
• Intel® VTune™ Amplifier
Demonstration
5
Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. The GNU* Project Debugger (GDB)
Enhanced build of GNU* GDB 7.5 is included into Intel® Manycore Platform Software Stack (Intel® MPSS)
• http://software.intel.com/en-us/articles/intel- manycore-platform-software-stack-mpss
• Source code available via installation option
• Released back to the GNU* community
• Native and cross/remote debugger versions
• C/C++ support, improved Fortran support
• Intel® Parallel Debugger Extension
6
Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. Eclipse* Integration
Eclipse* IDE integration • Seamless debugging of host and coprocessor • Simultaneous view of host and coprocessor threads • Supports offload language extensions (auto-attach to offload process) • Supports multiple coprocessor cards • Supports C/C++ and Fortran
Simultaneous and seamless thread debugging.
7
Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. Eclipse* Integration (cont.)
Install Eclipse IDE for C/C++ Developers • Available from http://www.eclipse.org
Integrate Intel® Compilers and Intel® Xeon Phi™ • Start Eclipse, select Help Install New Software • Uncheck Group items by category • Use Add Local, and select folder: /opt/intel/composerxe/eclipse_support/cdt8.0/eclipse • Click Select All and Finish.
8
Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. Intel® Xeon Phi™ Coprocessor Architecture Features
List all new vector and mask registers (gdb) info registers zmm K0 0x0 0 ⁞ Zmm31 {v16_float = {0x0
Disassemble instructions (gdb) disassemble $pc, +10 Dump of assembler code from 0x11 to 0x24: 0x0000000000000011
9
Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. GDB: Intel® Xeon Phi™
Coprocessor Debug server and host debugger /usr/linux-k1om-4.7/linux-k1om/usr/bin/gdbserver /usr/linux-k1om-4.7/bin/x86_64-k1om-linux-gdb Native debugger (no Parallel DeBug eXtension) /usr/linux-k1om-4.7/linux-k1om/usr/bin/gdb
Host debugger /opt/intel/mic/bin/gdb
10
Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. GDB: Native Debugging
Run GDB* on the Intel® Xeon Phi™ Coprocessor ssh –t mic0 /path/on/mic/to/gdb
To attach to a running application via the process-id (gdb) shell pidof my_application 42 (gdb) attach 42
To run an application directly from GDB* (gdb) file /target/path/to/application (gdb) start
Intel Confidential – NDA presentation 11
Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. GDB: Remote Debugging
Run GDB* on your local host /usr/linux-k1om-4.7/bin/x86_64-k1om-linux-gdb Start gdbserver on the coprocessor – To remote debug using standard I/O redirection (gdb) target extended-remote |ssh -T mic0 gdbserver –-multi – – To set a custom environment replace gdbserver by e.g.: env LD_LIBRARY_PATH=/tmp:$LD_LIBRARY_PATH gdbserver Attach to a running application via process ID (pid) (gdb) file /local/path/to/application (gdb) attach
12
Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. GDB: Offload Debugging
Debugging into an offloaded code section on the host does not “switch” to a debugger on the target • No debug synchronization (host / coprocessor) • GUI integration will provide this “glue” logic; see “Eclipse* Integration”
Debugging offloaded code via command line 1. Wait within the offloaded code section volatile int loop = 1; do { volatile int a = 1; } while (loop); 2. Attach to offload process on coprocessor via PID
Note: cross-compiling the entire application and debugging the previously offloaded section natively might be easier. Intel Confidential – NDA presentation 13
Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. GDB: Detect and Debug Data Races
Data race: a data race occurs when two ordinary simultaneous accesses to the same scalar, at least one of which is a write, execute in different parallel regions. [Hans-J. Boehm, WG21/N2480]
Tools are needed to detect and debug data races*:
• GNU* GDB with parallel debug extension
• Intel® Inspector
* Remember: single-threaded (sequential) execution cannot reproduce data races in contrast to multiple threads even on a single core. 14
Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. GDB: Data Race Symptoms
Varying, or wrong results* • One of the possible (but different) ways to interleave the instructions between parallel sections reproduces the actual result (“sequential consistency”)
Memory corruption, or crash • An index (or pointer data) is subject of a data race e.g., a book keeping structure is concurrently modified and left in an inconsistent state (mix of different updates)
* No data race is a prerequisite for reproducible numerical results e.g., a deterministic execution-order is needed as well. 15
Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. GDB: Data Race Example
Given: global variables a=1 b=2
Given: two threads T1: x = a + b T2: b = 42
Value of x depends on execution order: If T1 runs before T2 x = 3 If T2 runs before T1 x = 43
Data race e.g., “read-write”: T2’s update was not visible to T1’s calculation
16
Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. GDB: Detect Data Detection
How to detect data races? Compile with Intel Compiler: icpc -debug parallel
Debugger breaks when race has been detected: (gdb) pdbx enable (gdb) run data race detected 1: write shared, 4 bytes from foo.c:36 3: read shared, 4 bytes from foo.c:40
Breakpoint -11, 0x401515 in L_test_..._21 () at foo.c:36 *var = 42; /* bp.write */
Stop in the context of the access that triggers a race condition
17
Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. GDB: Data Race Debugging
Fine-tune detection and analysis via filter sets • Add filter to selected filter set (gdb) pdbx filter line foo.c:36 (gdb) pdbx filter code 0x40518..0x40524 (gdb) pdbx filter var shared (gdb) pdbx filter data 0x60f48..0x60f50 (gdb) pdbx filter reads # read accesses • Ignore events specified by filters (default behavior) (gdb) pdbx fset suppress • Ignore events not specified by filters (gdb) pdbx fset focus • Get debug command help (pdbx) (gdb) help pdbx Use cases • Focused debugging e.g., debug a single symptom • Limit overhead and control false positives
18
Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. GDB: Date Race Detection Limitations
Data race detection needs instrumented threading runtimes in order to capture synchronization events. Enhanced build of GNU* GDB 7.5 supports: • Intel OpenMP • Pthreads How to avoid false positives? • Use filter mechanism or supported threading runtime • Use Intel® Inspector – Supports all threading runtimes from Intel and Pthreads – Good support for other threading runtimes
19
Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. GDB: Date Race Detection Usage Hints
Optimized code (symptom) (gdb) run data race detected 1: write question, 4 bytes from foo.c:36 3: read question, 4 bytes from foo.c:40 Breakpoint -11, 0x401515 in foo () at foo.c:36 36 *answer = 42; (gdb) Reported variable may appear to be wrong • Remember: data races are reported on memory objects • If symbol name cannot be resolved: only address is printed Recommendation • Unoptimized code (O0): avoids to miss finding data races (due to removed / optimized away temporaries, etc.)
20
Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. Agenda
Debugging
• Compiler Debug Features
• GNU* Project Debugger
• Intel® Inspector
Profiling
• Compiler Profiling Features
• Intel® VTune™ Amplifier
Demonstration
21
Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. Intel® Inspector
Manage results of static analysis Command line and remote • Issue tracking (regression issue collection tests) $ inspxe-cl \ Runtime issue analysis -no-auto-finalize \ -collect ti3 \ • Memory issue detection -knob scope=extreme \ – OOB accesses (heap, stack) -r myresults – mybin – Uninitialized memory usage – Inconsistent alloc. and leaks $ inspxe-cl -finalize \ • Parallelism issue detection -r myresults – Dead locks (locked forever) $ inspxe-cl \ – Life locks (locking forever) –report problems \ – Data races -r myresults
22
Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. Agenda
Debugging
• Compiler Debug Features
• GNU* Project Debugger
• Intel® Inspector
Profiling
• Compiler Profiling Features
• Intel® VTune™ Amplifier
Demonstration
23
Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. Loop Profiler Identify Time Consuming Loops/Functions Compiler switch: /Qprofile-functions, -profile-functions • Insert instrumentation calls on function entry and exit points to collect the cycles spent within the function. Compiler switch: /Qprofile-loops=
Input is generated XML output file, named loop_prof_
loopprofileviewer.sh
24 Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Loop Profiler Text Dump (.dump file)
25
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Loop Profiler Data Viewer GUI
Menu to allow user to enable Function Profile View filtering or displaying the source code
Column headers allow selection to control sort criteria independently for function and loop table
Loop Profile View
26
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Agenda
Debugging
• Compiler Debug Features
• GNU* Project Debugger
• Intel® Inspector
Profiling
• Compiler Profiling Features
• Intel® VTune™ Amplifier
Demonstration
27
Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier 2013 Low Overhead Java* Profiling
Low Overhead & Precise Versatile & Easy to Use • Sampling is fast / • Multiple simultaneous JVMs unobtrusive • Mixed Java / C++ • Fast Hardware sampling • See results on the Java (With optional stacks!) source • Advanced profiles are unique (cache misses, bandwidth…)
28
Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier XE Feature Highlights Hot Spot Analysis (Statistical Call Graph) Where is the application spending time and how did it get there? • Faster and more reliable than older instrumentation-based exact call graph
Hardware Event-based Sampling (EBS) Where are the tuning opportunities? (e.g., cache misses) • Improved usability • Pre-defined tuning experiments
Thread Profiling Where is my concurrency poor and why? • Thread timeline visualizes thread activity and lock transitions • Integrated EBS data tells you exactly what’s happening and when
29 Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier XE Feature Highlights Event multiplexing • Gather more information with each profiling run
Timeline correlates thread and event data • See what active threads are doing • Filter profile results by selecting a region in the timeline
Advanced Source / Assembler View • See event data graphed on the source / assembler • View and analyze assembly as basic blocks
30
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier XE Feature Highlights Collect System Wide Data & Attach to Running Processes • EBS collects system wide data, filter it to find what you need • Hot Spot and Concurrency Analyses can attach to a running process
GUI & Command Line • Stand-alone GUI, Command Line, Microsoft* Visual Studio Integration • GUI makes setup and analysis easy • Command line for regression analysis and collection on remote systems
Extended Platform Coverage • Windows* and Linux • Microsoft* .NET* C# applications
31
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier XE Analysis Types (based on technology) Hot Spot Analysis • Sampling with stacks
Parallelism/Concurrency Analysis • Locks and Waits finds the problems with parallelism
Hardware Event-based Sampling and Counting • LightWeight Hotspot (pre-defined) • Advanced Analysis Types (pre-defined) . General Exploration . Memory Access . Bandwidth . Cycles and uOps • Advanced Analysis Types (created by a user)
32
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Selecting type of data collection
All available Different ways to analysis types start the analysis
Copy correct command line Helps creating syntax to new analysis clipboard types
33
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier XE Terminology CPU Time The amount of time a thread spends executing on a logical processor. For multiple threads, the CPU time of the threads is summed.
Unused CPU Time Represents the time that a CPU was not being used, and, the time each of the cores did not spend in CPU time for this application. This may happen, for example, when the cores are blocked
Wait Time The amount of time that a given thread waited for some event to occur, such as: synchronization waits and I/O waits
34
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier XE Terminology
Thread1 Waiting Thread1
Thread2 Waiting Thread2
Thread3 Waiting Thread3
Thread running
Thread waiting 1sec 1sec 1sec 1sec 1sec 1sec
Elapsed Time: 6 seconds
CPU Time: T1 (4s) + T2 (2s) + T3 (2s) = 8 seconds
Wait Time: T1(2s) + T2(3s) + T3 (2s) = 7 seconds
35
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Hotspot analysis
Displays hot functions in your application
Shows most time consuming call sequences • Statistical Call Graph
Include timeline view of threads in your application
Start the Hotspot Analysis Analysis
36
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Hotspot analysis Summary
37
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Hotspots analysis Hotspot functions Hotspot Functions Double Click Function to View Source
Adjust Data Grouping Function CPU time
… (Partial list shown) Call stack
Click [+] for Call Stack Thread timeline
Filter by Timeline Selection (or by Filter by Module & Grid Selection) Other Controls
38
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Hotspots analysis
Source ViewSource Assembly View View
Time on Source / Asm
Quick Asm navigation: Select source to highlight Asm Click jump to scroll Asm
Quickly scroll to hot spots. Scroll Bar “Heat Map” is an Right click for instruction overview of hot spots reference manual
39
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier XE Parallelism/Concurrency Analysis For Parallelism / Concurrency analysis, • Stack sampling is done just like in Hotspots analysis
• Wait functions are instrumented (e.g. WaitForSingleObject, EnterCriticalSection)
• Signal functions are instrumented (e.g. SetEvent, LeaveCriticalSection)
• I/O functions are instrumented (e.g. ReadFile, socket)
Concurrency Analysis
Copyright© 2013, Intel Corporation. All rights reserved. 40 *Other brands and names are the property of their respective owners. Intel ® VTune™ Amplifier XE Concurrency Calculation
Thread1 Waiting Thread1
Thread2 Waiting Thread2
Thread3 Waiting Thread3
Thread running
Thread waiting 1sec 1sec 1sec 1sec 1sec 1sec
Threads running 1 2 1 1 2 3
Concurrency Summary
0 1 2 3 4
Copyright© 2013, Intel Corporation. All rights reserved. 41 *Other brands and names are the property of their respective owners. Concurrency Analysis Summary
Concurrency Levels
Adjustable Metrics
Copyright© 2013, Intel Corporation. All rights reserved. 42 *Other brands and names are the property of their respective owners. Concurrency View
Wait Concurrency Overhead Level
Thread Thread is Thread is Transitions running waiting
43
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier XE Locks and Waits Analysis Identifies those threading items that are causing the most thread block time • Synchronization locks • Threading APIs • I/O
Locks & Waits Analysis
44
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Locks and Waits View
Spinning
Wait Waits # Objects CPU Utilization Stack for the wait object
45
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Locks-and-Waits Source View
Auto Reset Event
46
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier XE Viewpoints and Groupings Viewpoints • Each analysis types have pre-defined viewpoints • Different viewpoints allow the user analyze the data in a different way with different focus in mind
47
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier XE Viewpoints and Groupings Groupings • Each analysis types have pre-defined groupings • Different groupings allow users analyze data in different ways with different focus in mind
48
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Viewpoints and Groupings
49
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Viewpoints and Groupings
50
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Timeline Visualizes Thread Behavior
Transitions CPU Time Locks & Waits Hotspots Lightweight Hotspots
Hovers:
Optional: Use API to mark frames and user tasks
Optional: Add a mark during collection
51
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier XE User APIs
User APIs • Collection Control API • Thread Naming API • User-Defined Synchronization API • Task API • User Event API • Frame API • JIT Profiling API
52
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier XE Command Line Interface Command line (CLI) versions exist on Linux* and Windows* • CLI use cases: – Test code changes for performance regressions – Automate execution of performance analyses • CLI features: – Fine-grained control of all analysis types and options – Text-based analysis reports – Analysis results can be opened in the graphical user interface
53
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier XE Command Line Interface - Examples Display a list of available analysis types and preset configuration levels
amplxe-cl –collect-list
Run Hot Spot analysis on target myApp and store result in default-named directory, such as r000hs
amplxe-cl –c hotspots -- myApp
Run the Parallelism analysis, store the result in directory r001par amplxe-cl -c parallelism -result-dir r001par -- myApp
54
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier XE Command Line Interface - Reporting
$> amplxe-cl –report summary –r /home/user1/examples/lab2/r003cc
Summary ------
Average Concurrency: 9.762 Elapsed Time: 158.749 CPU Time: 561.030 Wait Time: 190.342 CPU Usage: 3.636 Executing actions 100 % done
55
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier XE Command Line Interface – Gropof like output
56
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier XE Command Line Interface – CSV output
Example:
$> amplxe-cl -report hotspots -csv-delimiter=comma - format=csv -report-out=testing111 -r r003cc
Function,Module,CPU Time,Idle:CPU Time,Poor:CPU Time,Ok:CPU Time,Ideal:CPU Time,Over:CPU Time CLHEP::RanecuEngine::flat,test40,50.751,0,0.050,0.081,0.080,50.541 G4UniversalFluctuation::SampleFluctuations,test40,32.730,0,0.030,0.070,0.010, 32.620 sqrt,test40,19.060,0,0.010,0.070,0.030,18.951 G4Track::GetVelocity,test40,15.330,0,0.030,0.030,0.040,15.230 G4VoxelNavigation::LevelLocate,test40,14.460,0,0.020,0.010,0.040,14.390 G4Step::UpdateTrack,test40,14.090,0,0,0.030,0.020,14.040 G4NavigationLevelRep::G4NavigationLevelRep,test40,13.721,0,0.030,0.020,0.040, 13.631 exp,test40,13.438,0,0.038,0.010,0.060,13.330 log,test40,13.340,0,0.180,0.020,0.110,13.030 G4PhysicsVector::GetValue,test40,11.970,0,0.020,0.020,0.050,11.880
57
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Remote Data Collection Collect performance data on one system, analyze and display results on another • Collect data from 1 system • Copy resulting data files to another system • Open the data files to analyze and display results
One typical model • Collect on Linux, analyze and display on Windows – Similar to “Remote Data Collector” in the VTune™ Analyzer • Collect data on Linux system using command line tool – Doesn’t require a license • Copy the resulting performance data files to a Windows* system • Analyze and display results on the Windows* system – Requires a license
58
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Collect events in Intel® VTune™ Amplifier XE on offload and native program executions
Application settings: • Application: ssh • Parameters: mic0 “
59
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Efficiency Metric: CPI
• Cycles per Instruction
• Can be calculated per hardware thread or per core
• Is a ratio! So varies widely across applications and should be used carefully.
Number of Minimum (Best) Minimum (Best) Hardware Theoretical CPI per Theoretical CPI Threads / Core Core per Thread 1 1.0 1.0 2 0.5 1.0 3 0.5 1.5 4 0.5 2.0
60
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Efficiency Metric: CPI
• Measures how latency affects your application’s execution Metric Formula Investigate if
CPI Per CPU_CLK_UNHALTED/ > 4.0, or increasing Thread INSTRUCTIONS_EXECUTED CPI Per (CPI per Thread) / Number of > 1.0, or increasing Core hardware threads used
• Look at how optimizations applied to your application affect CPI • Address high CPIs through any optimizations that aim to reduce latency
61
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Some CPI Data • Scaling data from a lab workload: Metric 1 2 3 4 hardware hardware hardware hardware thread / threads / threads / threads / core core core core CPI per 5.24 8.80 11.18 13.74 Thread CPI per 5.24 4.40 3.73 3.43 Core • Scatter plot of observed CPIs from lab workloads: 9,00 8,00 7,00 6,00 5,00 4,00 3,00 2,00 1,00 0,00
62
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Efficiency Metric: Compute to Data Access Ratio • Measures an application’s computational density, and its suitability for Intel® MIC Architecture
Metric Formula Investigate if
L1 Compute to Data VPU_ELEMENTS_ACTIVE / < Vectorization Access Ratio DATA_READ_OR_WRITE Intensity L2 Compute to Data VPU_ELEMENTS_ACTIVE / < 100x L1 Compute Access Ratio DATA_READ_MISS_OR_ to Data Access Ratio WRITE_MISS
• Increase computational density through vectorization and reducing data access (see cache issues, also, DATA ALIGNMENT!)
63
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Problem Area: L1 Cache Usage
• Significantly affects data access latency and therefore application performance Metric Formula Investigate if
L1 DATA_READ_MISS_OR_WRITE_MISS + Misses L1_DATA_HIT_INFLIGHT_PF1 L1 Hit (DATA_READ_OR_WRITE – L1 Misses) / < 95% Rate DATA_READ_OR_WRITE • Tuning Suggestions: – Software prefetching – Tile/block data access for cache size – Use streaming stores – If using 4K access stride, may be experiencing conflict misses – Examine Compiler prefetching (Compiler-generated L1 prefetches should not miss)
64
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Problem Area: Data Access Latency • Significantly affects application performance
Metric Formula Investigate if
Estimated (CPU_CLK_UNHALTED >145 Latency – EXEC_STAGE_CYCLES Impact – DATA_READ_OR_WRITE) / DATA_READ_OR_WRITE_MISS • Tuning Suggestions: – Software prefetching – Tile/block data access for cache size – Use streaming stores – Check cache locality – turn off prefetching and use CACHE_FILL events - reduce sharing if needed/possible – If using 64K access stride, may be experiencing conflict misses
65
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Problem Area: TLB Usage • Also affects data access latency and therefore application performance Metric Formula Invest- igate if L1 TLB miss ratio DATA_PAGE_WALK/DATA_READ_OR_WRITE > 1%
L2 TLB miss ratio LONG_DATA_PAGE_WALK > .1% / DATA_READ_OR_WRITE L1 TLB misses per DATA_PAGE_WALK / LONG_DATA_PAGE_WALK > 100x L2 TLB miss • Tuning Suggestions: – Improve cache usage & data access latency – If L1 TLB miss/L2 TLB miss is high, try using large pages – For loops with multiple streams, try splitting into multiple loops – If data access stride is a large power of 2, consider padding between arrays by one 4 KB page
66
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Problem Area: VPU Usage
• Indicates whether an application is vectorized successfully and efficiently Metric Formula Investigate if
Vectorization VPU_ELEMENTS_ACTIVE / <8 (DP), Intensity VPU_INSTRUCTIONS_EXECUTED <16(SP) • Tuning Suggestions: – Use the Compiler vectorization report! – For data dependencies preventing vectorization, try using Intel® Cilk™ Plus #pragma SIMD (if safe!) – Align data and tell the Compiler! – Restructure code if possible: Array notations, AOS->SOA
67
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Problem Area: Memory Bandwidth
• Can increase data latency in the system or become a performance bottleneck
Metric Formula Investigate if Memory (UNC_F_CH0_NORMAL_READ + < 80GB/sec Bandwidth UNC_F_CH0_NORMAL_WRITE+ (practical peak UNC_F_CH1_NORMAL_READ + 140GB/sec) UNC_F_CH1_NORMAL_WRITE) X 64/time (with 8 memory controllers)
• Tuning Suggestions: – Improve locality in caches – Use streaming stores – Improve software prefetching
68
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Event collections on the coprocessor can generate volumes of data dgemm: on 60+ cores
Tip: Use cpu-mask to reduce data set, while maintaining the same accuracy.
69
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Summary
• Vectorization, Parallelism, and Data locality are critical to good performance for the Intel® Xeon Phi™ Coprocessor
• Event names can be misleading – we recommend using the metrics given in this presentation or our tuning guide at http://software.intel.com/en- us/articles/optimization-and-performance-tuning- for-intel-xeon-phi-coprocessors-part-2- understanding
• Intel® VTune™ Amplifier XE supports collecting all of the above metrics, as well as providing special analysis types like Memory Bandwidth
70
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier XE: Summary
Vectorization, Parallelism, and Data locality are critical to good performance for the Intel® Xeon Phi™ Coprocessor.
The Intel® VTune™ Amplifier XE finds: • Source code for performance bottlenecks • Characterize the amount of parallelism in an application • Determine which synchronization locks or APIs are limiting the parallelism in an application • Understand problems limiting CPU instruction level parallelism
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Agenda
Debugging
• Compiler Debug Features
• GNU* Project Debugger
• Intel® Inspector
Profiling
• Compiler Profiling Features
• Intel® VTune™ Amplifier
Demonstration
72
Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. Demonstration
Example: matrix-vector multiplication • Upper triangular matrix • Structural symmetry
Key points • Straight-forward sequential implementation • Naïve parallelization of the outer loop* • Data races in the result vector
Approach is to demonstrate data races and not to show efficient synchronization or performance.
* Column indexes are used to update rows (symmetry!) that are “owned” by another thread.
73
Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners. Demonstration: Preparation
$ source /opt/intel/system_studio_2013/ \ bin/iccvars.sh intel64
$ source /opt/intel/system_studio_2013/ \ debugger/gdb/bin/debuggervars.sh intel64
$ source /opt/intel/system_studio_2013/ \ inspector_for_systems/inspxe-vars.sh
74
Copyright© 2013, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners.
Legal Disclaimer & Optimization Notice
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Copyright © 2013, Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.
Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804
76
Copyright© 2013,2012, Intel Corporation. All rights reserved. 10.07.2013 *Other brands and names are the property of their respective owners.