Next Generation Developer Tools from

Focusing on new Performance Analysis Tool

4th Parallel Tools Workshop HLRS, September 2010

Software & Services Group, Developer Products Division

Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Developer Tools of Intel ---Today

Fully Supported Developer Products:

… and numerous unsupported tools like PIN, PTU, AVX Emulator,Emulator, CnC freely available from whatif.intel.com and other public sitessites

2

Software & Services Group, Developer Products Division

Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 2 Intel ®®® Parallel Studio –––Windows Only New Version 2011 Released Sep 2 ndndnd ! Intel ® Parallel Advisor Intel ® Parallel Composer • Demystifies and speeds parallel • C++ Compiler, libraries, debugger application design plug-in • Direct user where to parallelize • Intel ® Parallel Debugger Extension: • Explorer & Modeler tools give Simplify debugging parallel code parallelism design insight and • A family of Parallel models New! analysis Set of portable, reliable, future proof • Proposes parallelism scheme best parallel models for both data and suited for application task parallelism, includes Intel TBB, • Summary view for decision-making Cilk Plus • Support for Intel Array Building Blocks • Intel ® IPP, OpenMP * included

Intel ® Parallel Inspector Intel ® Parallel Amplifier • Dynamic Memory & thread • Parallel performance analyzer Analysis for serial and parallel • Find both serial and parallel performance bottlenecks code • Scale application performance with • Finds thread data races & more processor cores deadlocks • No special compilers or builds • Finds memory leaks and necessary corruption

Software & Services Group, Developer Products Division

Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 3 Future: Next Generation Tools Suite for HPC In Beta Testing Today; Planned to be Released in a few Months Complete Package for Linux* and Windows* – Single installer, single license, synchronized launch schedule – Supports the latest Intel® x86 processors, both IA32 and Intel64

Future Compiler Bundle – “Compiler release 12.0” – Fortran, C/C++ Compilers Version – New parallel programming models Cilk Plus and Co -Array Fortran – Vectorization enhancements: GAP, SIMD pragma, Array notation – Intel® Debugger IDB Next Generation Checking Tool – “ThreadChecker Next” – Thread, Memory and Security checking – Support of Parallel Programming Models: Intel ® TBB, OpenMP*, Intel® Cilk Plus – C/C++, Fortran, and .NET support Future Performance Analyzer– “VTune+ThreadProfiler Next”

Software & Services Group, Developer Products Division

Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 4 Intel’s Family of Parallel Programming Models

Fixed Intel® Parallel Established Research and Function Standards Exploration Building Blocks (PBB) Libraries

Intel ® Intel ® Math Intel ® Threading Kernel MPI Concurrent Building Library (MKL) Collections Blocks (TBB) Intel® Cilk Intel ® Intel® Array Plus Integrated OpenCL* Building Performance OpenMP* Blocks (ArBB) Primitives (IPP)

Intel® Cilk Plus, Intel® TBB: Part of Intel® Parallel Studio (XE)

Intel® Array Building Blocks : Known by code names ‘Intel Ct’ or ‘Intel Firetown”; will start public beta around mid September 2010

Software & Services Group, Developer Products Division 5 Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 5 Future Performance Analysis Tool

Software & Services Group, Developer Products Division

Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 6 Future Performance Analysis Tool Integrating Features from 4 Existing Performance Tools

Intel® Parallel Amplifier

Intel® Performance Tuning Utility ( PTU ) Unsupported tool freely available from whatif.intel.com

7

Software & Services Group, Developer Products Division

Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 7 Future Performance Analysis Tool

• Integrates popular and mature features of Intel® VTune™ Performance Analyzer, Intel® Parallel Amplifier, Intel® Thread Profiler and Intel® Performance Tuning Utility – But not a super-set in all cases – Some additional features being worked on and will be added later; some are still being evaluated/might be added to future updates • Standalone GUI on Linux* and Windows* – GUI in all environments based on wxWidgets: Very fast and stable – Same look-and-feel for Linux & Windows – Fast and native implementation on Linux – No sluggish and fragile emulations anymore !! • Comprehensive Command Line interface • New instrumentation technologies for data collection

Software & Services Group, Developer Products Division

Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 8 Future Performance Analysis Tool Feature Highlights ••EaseEase of use is key focus – Simple configuration of analysis session – Copy and modify existing analysis types to adapt to special needs – Intuitive filtering and display of data collected – Stand-alone GUI but also seamless integration into Microsoft Visual Studio* on Windows* •• Extended Platform Coverage – Windows* and Linux – Microsoft* .NET* C# applications ••AdvancedAdvanced Source / Assembler View – Analysis / event data mapped to the source / assembler code – View and analyze assembler code as basic blocks

Software & Services Group, Developer Products Division

Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 9 Different ways to All available start the analysis analysis types

Helps creating new analysis Copy the types commandline to clipboard

Software & Services Group, Developer Products Division

Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 10 Future Performance Analysis Tool Analysis Techniques ••HotHot Spot Analysis Where is the execution time spend? – Includes statistical call graph – Minimal intrusiveness due to statistical data collection ••EventEvent Based Sampling (EBS): Where are architectural events like cache misses happening ? – Performance Event configuration – Event grouping ••ThreadThread Profiling Where is my concurrency poor and why? Which time do I spend in locks ? – Thread timeline visualizes thread activity and lock transitions – Optional integrates EBS data collection and view

Software & Services Group, Developer Products Division

Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 11 Future Performance Analysis Tool Hotspot Analysis

• For Hotspots analysis, only stack sampling is performed – For each sample, capture execution context, time passed since previous sample and thread CPU time – No specific filtering is done for Hotspots. – The amount of data collected in Hotspots is rather moderate

Screenshots are subject to change

Software & Services Group, Developer Products Division

Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 12 Future Performance Analysis Tool Hotspot Analysis –––Summary View

Software & Services Group, Developer Products Division

Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 13 Future Performance Analysis Tool Hotspot Analysis

Callstack

Threads

Timeline

Software & Services Group, Developer Products Division

Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 14 Future Performance Analysis Tool Parallelism/Concurrency Analysis

• For Parallelism / Concurrency analysis, – Stack sampling is done just like in Hotspots analysis – Wait functions are instrumented (e.g. WaitForSingleObject, EnterCriticalSection) – Signal functions are instrumented (e.g. SetEvent, LeaveCriticalSection) – I/O functions are instrumented (e.g. ReadFile, socket)

Software & Services Group, Developer Products Division

Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 15 Future Performance Analysis Tool Concurrency Calculation

Thread1 Waiting Thread1

Thread2 Waiting Thread2

Thread3 Waiting Thread3

Thread running

Thread waiting 1sec 1sec 1sec 1sec 1sec 1sec

Threads running 111222 111 111 222 333

Concurrency Summary

0 12 3 4

Software & Services Group, Developer Products Division

Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 16 Future Performance Analysis Tool Parallelism/Concurrency Analysis –––Summary View

Software & Services Group, Developer Products Division

Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 17 Future Performance Analysis Tool Parallelism/Concurrency Analysis

Software & Services Group, Developer Products Division

Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 18 Future Performance Analysis Tool Event Based Sampling

Software & Services Group, Developer Products Division

Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 19 Future Performance Analysis Tool Event Based Sampling and Counting

• Captures program performance information in terms of hardware specific (PMU) events. • Uses event multiplexing to collect as much information possible during a single run • Initially only Sampling; support for Counting ( EBC ) will be added in future update – EBC gains relevance since most UNCORE events of latest processors from Intel can not be “sampled” – EBC doesn't capture CPU state information – EBC data cannot be attributed to the program flow – EBC mode has lower overhead and collects smaller trace files (tb5 / tb6)

Software & Services Group, Developer Products Division

Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 20 Event Based Sampling • Pre-defined event sets for supported processors – Top level Triage – Cache analysis and false sharing – Branching issues – Long-latency computation operations – Structural hazards – Working set size – Data access patterns – Memory latency – Memory bandwidth • Event multiplexing • Pre-defined displays (viewpoints) transform data into information – Numerous views, sophisticated filtering – combining best we learned from EBS in Intel® Vtune™ Perf. Analyzer and Intel® PTU • Cycle accounting analysis methodology

21

Software & Services Group, Developer Products Division

Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 21 Data Collection Techniques

• EBS: Similar to Intel® PTU, based on SEP (Sampling Enabling Platform • Hotspot Analysis: User-mode sampling (aka “stack sampling”) • Concurrency Analysis: Dynamic binary instrumentation of PIN – Only low-intrusive “Probe-Mode” ; slow ‘JIT-Mode’ not used by performance tools • Source modification/recompilation is not needed

VTune Analyzer, Improved PTU Event based sampling (PMU) Future Performance PIN for Analysis Tool Instrumentation, Parallel Amplifier TBS for hotspot and stack sampling)

Software & Services Group, Developer Products Division

Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 22 Future Performance Analysis Tool Viewpoints and Groupings

• Groupings – Each analysis types have pre-defined groupings – Different groupings allow users analyze data in different ways with different focus in mind

Software & Services Group, Developer Products Division

Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 23 Frame Analysis

• There are two ways to support frames and frame analysis – Explicit frame marker by program APIs – Recommend approach for version 1.0 – Check out \include\ittnotify.h

typedef struct __itt_frame_t *__itt_frame;

__itt_frame ITTAPI __itt_frame_createA(const char *domain); __ itt_frame ITTAPI __ itt_frame_createW (const wchar_t *domain);

void ITTAPI __itt_frame_begin(__itt_frame frame);

void ITTAPI __itt_frame_end (__itt_frame frame);

– Automated Frame detection using knowledge about graphic API (like DirectX) – Prototype available; no final decision on whether it will make it to initial product release already

Software & Services Group, Developer Products Division

Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 24 Frame Analysis –––using APIs

itt_frame = __itt_frame_createA("SimDomain");

while(gRunning) { __itt_frame_begin(itt_frame);

start = clock(); //Wait all threads before moving into the next frame WaitForMultipleObjects(FUNCTIONAL_DOMAINS, eSignal, TRUE, INFINITE); stop = clock(); //Give all threads the "go" signal for (int i = 0; i < FUNCTIONAL_DOMAINS; i++) SetEvent(bSignal[i]); if (frame % NETWORKCONNETION_FREQ == 0) { //Start network thread SetEvent(bNetSignal); } __itt_frame_end(itt_frame); }

Software & Services Group, Developer Products Division

Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 25 Frame Analysis in GUI

Frame boundaries

Software & Services Group, Developer Products Division

Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 26 Frame Analysis in GUI Data View

The current thresholds are: •Fast > 100 fps •Good 40 < fps < 100 •Slow < 40 fps

Software & Services Group, Developer Products Division

Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 27 User Events Check out \include\ittnotify.h typedef int __itt_event;

__itt_event LIBITTAPI __itt_event_createA(const char *name, int namelen); __itt_event LIBITTAPI __itt_event_createW(const wchar_t *name, int namelen); int LIBITTAPI __itt_event_start(__itt_event event); int LIBITTAPI __itt_event_end(__itt_event event);

DWORD WINAPI aiWork(LPVOID lpArg) { int tid = *((int*)lpArg); __itt_event aiEvent; aiEvent = __itt_event_createA ("AI Thread Work",14);

while(gRunning) { WaitForSingleObject(bSignal[tid], INFINITE); __itt_event_start(aiEvent); doSomeDataParallelWork(); __itt_event_end(aiEvent); SetEvent(eSignal[tid]); } return 0; } Software & Services Group, Developer Products Division

Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 28 User Events in GUI [1]

User defined task

User defined task User defined task

Software & Services Group, Developer Products Division

Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 29 User Events in GUI [2]

User defined events

Software & Services Group, Developer Products Division

Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 30 Future Performance Analysis Tool Command Line Interface Examples

D:\Examples\>"c:\Program Files (x86)\Intel\Amplifier XE\bin32\amplxe-cl.exe" Command Line Tool Copyright (C) 2009-2010 Intel Corporation. All rights reserved. ----- Usage ----- amplxe-cl <-action-option> [-modifier-option] [[--] target [target opt ions]]

Where : action-option is one of the following: collect, collect-list, command , finalize, help, import, knob -list, report, report -list, version

modifier-option is one or more of the following: allow-multiple-runs, callee-attribution-mode, csv-delimiter, cumulative-threshold-percent, data-limit, [no-]discard-raw-data, quiet, duration, filter, [no-]follow-child, format, group-by, knob, limit, mrte-mode, report-output, result-dir, resume-after , search-dir, start-paused , target-duration-type, target-pid, target-process, user-data-dir, verbose

To view the results in the IDE, double-click the .amplxe file located in the result directory.

Software & Services Group, Developer Products Division

Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 31 Future Performance Analysis Tool Command Line Interface Examples

• Display a list of available analysis types and preset configuration levels amplxe-cl –collect-list

• Run Hot Spot analysis on target myApp and store result in default -named directory, such as r000hs inspxe-cl –c hotspots-- myApp

• Run the Parallelism analysis, store the result in directory r001par amplxe-cl -c parallelism -result-dir r001par -- myApp

Software & Services Group, Developer Products Division 32 Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Future Performance Analysis Tool and Legacy Tools: Feature Comparison

Software & Services Group, Developer Products Division

Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 33 Intel® Vtune™ Performance Analyzer Features not available in initial Version of new Tool Performance Analysis for Java – Support for Java will be added later Exact, graphical call graph – Needed ?

Software & Services Group, Developer Products Division

Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 34 Intel® Thread Profiler ( for Windows* only ) Features not available in new Tool Critical Path Analysis – Will be added to Future Performance Analysis Tool later

Thread 3 Thread 3 Acquire lock L Release L Wait for L Acquire L terminates

Thread 2 Thread 2 Wait for L Acquire L terminates

Release L Thread 1 Wait for Threads terminates 2 & 3 Threads Thread 1 2 & 3 Done

T0 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15

The critical path is the longest execution flow

Software & Services Group, Developer Products Division

Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 35 Intel® Performance Tunining Utility (Intel® PTU) Features not available in new Tool Data Access Profiling – Based on PEBS – Precise Event Based Sampling – Very likely added to future update Control-flow graph / basic block view Heap Memory (“Space-Time”) Profiling – Very likely added to future update Call Count Analysis – Similar to Precise Call Graph of VTune but using PIN for instrumentation

PTU will continue to be available as a free, unsupported tool PTU will continue to be used for introducing new features

Software & Services Group, Developer Products Division

Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 36 More “Wish---List”-List” Features Event Abstraction – Architecture-independent tuning using events – Hide specific processor-dependent details

Improved support for parallel tasking models – OpenMP * 3.0 tasking – Intel® Threading Building Blocks – Intel® Cilk™ Plus

EBS for VMM/Virtual operating system environments

Software & Services Group, Developer Products Division

Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 37 Summary • Future Performance Analysis Tool is the next generation performance tool – The new performance analysis tool is successor of popular Intel VTune + Thread Profiler – Additional features and technologies – New binary instrumentation engine – Standalone GUI on Linux and Windows* • Let us know about issues you see of course but also please submit feature requests • Existing Intel® Vtune™ Performance Analyzer license with non-expired support access will work for new tool

Software & Services Group, Developer Products Division 38 Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.