Advanced Features of Intel® C++ Composer XE for Linux

Jeff Arnold Intel Corporation 18 February 2011

Software & Services Group

• Preliminaries • Intel® Parallel Studio XE 2011 • Intel® C++ Composer XE • Intel® Parallel Building Blocks – Intel® Silk™ Plus – Intel® Array Building Blocks • Performance Libraries • Intel® Vtune™ Amplifier XE 2011 • Intel® Inspector XE 2011

Software & Services Group

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products.

BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Atom, Centrino Atom Inside, Centrino Inside, Centrino logo, Cilk, Core Inside, FlashFile, i960, InstantIP, Intel, the Intel logo, Intel386, Intel486, IntelDX2, IntelDX4, IntelSX2, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel StrataFlash, Intel Viiv, Intel vPro, Intel XScale, Itanium, Itanium Inside, MCS, MMX, Oplus, OverDrive, PDCharm, Pentium, Pentium Inside, skoool, Sound Mark, The Journey Inside, Viiv Inside, vPro Inside, VTune, Xeon, and Xeon Inside are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others.

http://intel.com/software/products

Software & Services Group

Optimization Notice

Intel® compilers, associated libraries and associated development tools may include or utilize options that optimize for instruction sets that are available in both Intel® and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel compilers, including some that are not specific to Intel micro-architecture, are reserved for Intel microprocessors. For a detailed description of Intel compiler options, including the instruction sets and specific microprocessors they implicate, please refer to the “Intel® Compiler User and Reference Guides” under “Compiler Options." Many library routines that are part of Intel® compiler products are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel® compiler products offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors.

Intel® compilers, associated libraries and associated development tools may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors.

While Intel believes our compilers and libraries are excellent choices to assist in obtaining the best performance on Intel® and non-Intel microprocessors, Intel recommends that you evaluate other compilers and libraries to determine which best meet your requirements. We hope to win your business by striving to offer the best performance of any compiler or library; please let us know if you find we do not. Notice revision #20101101

Software & Services Group

Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 2011-02-18 5 Hickory, Dickory, Dock – The ISV Development Clock

Intel® Enhanced Intel® Intel® Core™ Future Intel® Pentium® M Microarchitecture … Microarchitecture Microarchitecture codename Nehalem Microarchitecture

Yonah Merom

Architectural and Micro-Architectural changes require software changes to realize the full benefit

All dates, product descriptions, availability, and plans Software & Services Group are forecasts and subject to change without notice.

• The trend toward multi-core mobile, desktop, and server processors is expected to continue into the foreseeable future. • Software must be ready to take full advantage of it. Scalar plus many core for Large, Scalar cores for highly threaded high single-thread workloads performance Many-core array • CMP with 10s-100s low power cores • Scalar cores • Capable of TFLOPS+ Multi-core array • Full System-on-Chip • CMP with ~10 cores • Servers, workstations, Dual core embedded… • Symmetric multithreading

CMP » Chip Multi-Processing

All dates, product descriptions, availability, and plans Software & Services Group are forecasts and subject to change without notice.

Software & Services Group

Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 2011-02-18 8 Intel® Parallel Studio XE 2011 Powerful tools to create fast, reliable and secure code

Phase Productivity Tool Feature Benefit

C/C++ and Fortran Application performance, scalability and Advanced Intel® compilers, performance Build & quality for current multicore and future libraries, and parallel Debug Composer XE many-core systems. models

Memory & threading error Increases productivity and lowers cost, Advanced Intel® checking tool for higher by catching memory and threading Verify Inspector XE code reliability & quality defects early

Removes guesswork, saves time, makes Performance Profiler to Advanced Intel® VTune™ it easier to find performance and optimize performance Tune scalability bottlenecks Combines ease of Amplifier XE and scalability use with deeper insights.

Today’s Focus: Intel® Composer XE

Software & Services Group

Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. 9 *Other brands and names are the property of their respective owners. 2011-02-18 Get Outstanding Application Performance from Intel Compiler Suite Products

New Names, Same Great Tradition of Compilers & Library Performance

Old New Intel® C++ Compiler, Professional Edition for Windows* Intel® C++ Composer XE for Windows* Intel® Visual Fortran Compiler, Professional Edition for Windows* with IMSL* Intel® Visual Fortran Composer XE for Windows* Intel® Visual Fortran Compiler, Professional Edition for Windows* with IMSL* Intel® Visual Fortran Composer XE for Windows* with IMSL* Intel® Compiler Suite, Professional Edition for Windows* Intel® Composer XE for Windows*

Intel® C++ Compiler, Professional Edition for Linux* Intel® C++ Composer XE for Linux* Intel® Fortran Compiler, Professional Edition for Linux* Intel® Fortran Composer XE for Linux* Intel® Compiler Suite, Professional Edition for Linux* Intel® Composer XE for Linux*

Intel® C++ Compiler, Professional Edition for Mac OS X* Intel® C++ Composer XE for Mac OS X*

Intel® Fortran Compiler, Professionald Edition for Mac OS X* Intel® Fortran Composer XE for Mac OS X*

Software & Services Group

Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. 10 *Other brands and names are the property of their respective owners. 2011-02-18 Intel Performance-Oriented Compiler Suites Compilers, Performance Libraries, Debugging Tools: C++, Fortran on Windows, Linux and Mac OS X

Intel® C++ Intel® Fortran Intel Composer Composer XE 2011 Composer XE 2011 XE 2011

• Intel® C++ Compiler XE 12.0 • Intel® Fortran Compiler XE 12.0 • Combines Intel C++ • Intel® Parallel Debugger • Intel® Parallel Debugger Composer XE and Intel® Extension • Intel® Math Kernel Library Fortran Composer XE • Intel® Parallel Building • Intel® Integrated • For Fortran developers who Blocks Performance Primitives also want Intel C++ • Intel® Math Kernel Library • Windows, Linux only • Intel® Integrated Performance Primitives

• Windows: Integrates into Microsoft* Visual Studio*, Intel C++/Visual C++ Compatibility • Linux: Integrates into Eclipse CDT, Intel C++ Compatible with GCC • Mac OS: Integrates into XCode Environment, Compatible with GCC • All: 1 Year Premier Support Renewable Annually

Performance Compatibility Support

Software & Services Group

• Major release of C/C++ and Fortran compilers v12.0 SIMD • Advanced C/C++ parallelism with Intel® pragma Parallel Building Blocks • Advanced vectorization with SIMD pragmas • Co-array Fortran and more Fortran 2008 support • Updated versions of Intel® MKL & Intel® IPP

Parallel Program Debugging

Software & Services Group

What’s New!

• Improved performance • Improved Intel performance • Subset of C++0x in support of libraries integration: Intel® Math Visual C++ compatibility Kernel Library, Intel® Integrated • Support for Visual Studio* 2010 Performance Primitives (continuing 2005 & 2008 support) • New hardware support: Intel® • Enhanced vectorization capabilities: Sandy Bridge GAP and SIMD pragmas • Many Intel Core Architecture – • Parallelism Discovery Assistant – MICA – extensions (beta) enhanced loop profiler • 32-bit and 64-bit support • Expanded parallelism-dev features: • Windows*, Linux* and Mac* OS X Intel® Parallel Building Blocks • Fortran 2003 support and many Fortran 2008 features, including Co- Array Fortran

Ongoing Commitment to Innovation & Standards

Software & Services Group

Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. 14 *Other brands and names are the property of their respective owners. 2011-02-18 Intel® C++ Compatibility and Performance Leadership

• Intel® Cilk™ Plus: Easy to use language extensions for array syntax that deliver great performance through parallelism and more readable syntax

• Staying on top of the performance heap – Enhanced vectorization and auto-parallelization that apply to more situations in code. Developers love seeing this in their build logs. – Low overhead loop and function profiling shows hotspots and where to introduce threads

• Guided Auto Parallelism suggests code changes to get the compiler to auto- vectorize and/or auto-parallelize, a great productivity tool that delivers great performance

• More C++ 0x and C99 standards support for enhanced compatibility with Visual C++

• Even more performance from optimized string intrinsics that use Intel® SSE 4.2 instructions

Software & Services Group

The Intel C++ Compiler provides the following language conformances - ANSI/ISO standard for C language compilation (ISO/IEC9899:1990) - ANSI/ISO standard (ISO/IEC 14882:1998) for the C++ language The Fortran Compiler provides the following language conformances - Fortran 95 language standard - Fortran 90 language standard - Fortran 77 language standard - Fortran IV - Includes also many features from the Fortran 2003 language standard, as well as numerous popular language extensions.

Software & Services Group

Source (mostly) and binary compatible • Mixing and matching binary files created by g++, including third-party libraries • Generating C++ code compatible with gcc/g++ 3.2 or higher (up to 4.3) • Improved support for command-line options offered in the GNU compilers • Support of most GNU C and C++ language extensions • Glibc 2.3.2, 2.3.4, 2.3.5 or 2.8 • Linux Kernel 2.4.x or 2.6.x Limitations • Intel Fortran Compiler for Linux is not binary compatible with GNU g77 or GNU gfortran compiler

Software & Services Group

Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 2011-02-18 17 Interprocedural Optimization (IPO)

• Cross-module optimization • IPO is seamless process. Most optimization actually happens during Link Phase • Benefits of IPO – Optimization of large number of frequently used small & medium functions, especially those called in loops – Function Inlining – Eliminates need for arguments setup, call branch/return overhead – Enables opportunities for other optimizations (const prop, DCE, &c.) – Dead code elimination, Better register usage – Improved alias analysis for better auto-vectorization & loop transformations • May increase build-time/binary size

Software & Services Group

Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. 18 *Other brands and names are the property of their respective owners. 2011-02-18 Interprocedural Optimizations (IPO)

• ip: Enables inter-procedural optimizations for current Linux* Windows* source file compilation -ip /Qip

• ipo: Enables inter-procedural -ipo /Qipo optimizations across files

Can inline functions in separate files

Permits inlining and other inter-procedural optimizations among multiple source files. The optional value argument controls the maximum number of link-time compilations (or number of object files) spawned. Default for value is 0 (the compiler chooses).

Enhances optimization when used in combination with other compiler features

Software & Services Group

Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 2011-02-18 19 Other Techniques for Inlining Functions

• Compiler Switches – Increase information provided to the compiler -ipo, -prof_use (Linux), /Qipo, /Qprof-use (Windows) – Change Compiler Heuristics -inline-factor=n (default=100), /Qinline-factor=n -inline-level=0|1|2, /ob0|1|2 • Inlining source code features – GCC C/C++ __attribute__((always_inline)) __attribute__((noinline)) – Microsoft* C/C++ Keywords: inline, __inline, __forceinline

Software & Services Group

• Auto-vectorizer exploits SIMD/DLP opportunities – Auto-vectorizes sequential operations using SSE and AVX instructions – No significant changes to source-code – Much easier to learn, debug, maintain – Forward looking with respect to compilers and processors • Optimized code for targeted processor(s) – Both Intel and AMD* – Mixed processors environment supported as well • Processor Specific Optimization – Targeting specific Intel Processor(s) – e.g. for Intel® Core i7 use -xSSE4.2 • Auto-dispatch: Processor Optimized Optimization – Includes both optimized and generic (SSE2) code-paths – e.g. for Intel® Core i7 use -axSSE4.2

Software & Services Group

Group 1: -m such as -msse3 • Optimizes for both Intel® and compatible, non-Intel processors Group 2: -x such as -xAVX • Targets Intel® processors only • Application will not start on non-Intel processors or if instruction set is not available Group 3: -ax such as –axsse4.2 • Creates default and additional processor-specific paths • Processor-specific path(s), for Intel® processors only, defined by • default code path is -msse2 unless explicitly modified • default code path can be changed using an additional switch from group 1 or 2 • multiple processor-specific paths can be specified

Software & Services Group

Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 2011-02-18 22 Key Intel® Advanced Vector Extensions (Intel® AVX) Features KEY FEATURES BENEFITS • Wider Vectors • Up to 2x peak FLOPs (floating point – Increased from 128 bit to 256 bit operations per second) output with good power efficiency

• Enhanced Data Rearrangement • Organize, access and pull only – Use the new 256 bit primitives to necessary data more quickly and broadcast, mask loads and permute data efficiently • Three and four Operands, • Fewer register copies, better Non Destructive Syntax register use for both vector and – Designed for efficiency and future scalar code extensibility • Flexible unaligned memory • More opportunities to fuse load and access support compute operations

Intel® AVX is a general purpose architecture, expected to supplant SSE in all applications used today

Software & Services Group

Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® Advanced Vector Extensions (Intel® AVX) 2X Vector Width A 256-bit vector extension to SSE

• Intel® AVX extends all 16 XMM registers to 256bits

XMM0 YMM0

256 bits 128 bits (2010) (1999) • Intel AVX works on either – The whole 256-bits – for FP instructions – The lower 128-bits (like existing SSE instructions) – A drop-in replacement for all existing scalar/128-bit SSE instructions – The upper part of the register is zeroed out • Intel AVX targets a high-performance first implementation – 256-bit Multiply, Add and Shuffle engines (2X today) – 2nd load port

Software & Services Group

• Scalar mode • SIMD processing – one instruction produces – with SSE or AVX instructions one result – one instruction can produce multiple results

X X x7 x6 x5 x4 x3 x2 x1 x0 + +

Y Y y7 y6 y5 y4 y3 y2 y1 y0 = =

X + Y X + Y x7+y7 x6+y6 x5+y5 x4+y4 x3+y3 x2+y2 x1+y1 x0+y0

Software & Services Group

SSE 4x floats

2x doubles

16x bytes

8x 16-bit shorts SSE-2 4x 32-bit integers

2x 64-bit integers

1x 128-bit(!) integer

Software & Services Group

Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 2011-02-18 26 AVX-256 Data Types on “Sandy Bridge”

8x floats now 4x doubles

32x bytes

16x 16-bit shorts possible future 8x 32-bit integers implementations? 4x 64-bit integers

2x 128-bit(!) integer

Software & Services Group

Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 2011-02-18 27 Compiling for Intel® AVX (high level)

• Compile with –xavx – Intel processors only – Vectorization works just as for SSE – Main speedups are for floating point – No integer 256 bit instructions in first generation – Up to ~1.8x performance for Linpack – Best if 32 byte aligned – More loops can be vectorized than with SSE – Individually masked data elements – More powerful data rearrangement instructions • -axavx gives both SSE and AVX code paths – use –x or –m switches to modify the default SSE code path – Eg –axavx –xsse4.2 to target Nehalem and AVX • Math libraries may target AVX automatically at runtime

Software & Services Group

• Found in immintrin.h • Names typically begin with _mm256_ – E.g. _mm256_add_pd() – SSE intrinsics typically begin with _mm_ • New data types: – __m256 holds 8 32-bit floats – __m256d holds 4 64-bit doubles – __m256i holds integers: 32 8-bit, 16 16-bit, 8 32-bit or 4 64-bit – Intrinsics may also use SSE data types __m128i etc • Manual cpu dispatch (temporary names; Intel processors only) – __declspec(cpu_specific(future_cpu_16)) – __declspec(cpu_dispatch(future_cpu_16,…))

Software & Services Group

Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Automatic Vectorization by Compiler Translates Loops into SIMD Parallelism loop is stripmined (unrolled), strip length of 8 for floats with AVX cf 4 for floats with SSE

for (i=0;i<=MAX;i++) c[i]=a[i]+b[i];

A[7] A[6] A[5] A[4] A[3] A[2] A[1] A[0] + + + + + + + + 128-bit Registers B[7] B[6] B[5] B[4] B[3] B[2] B[1] B[0]

C[7] C[6] C[5] C[4] C[3] C[2] C[1] C[0]

Software & Services Group

Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 2011-02-18 30 Features of AVX loads on Sandy Bridge

• Performance of vmovupd is as good as vmovapd when the data is 32 byte aligned – Therefore, compiler never generates vmovapd, only vmovupd – No alignment faults if data is not always aligned • Performance of 32 byte aligned loads is better than unaligned loads – Try to align your data • Performance of two 16 byte loads may be better than one unaligned 32 byte load – Compiler may split 32 byte loads into two 16 byte loads – if known to be unaligned, or if 32 byte alignment unknown • Performance of 16 byte unaligned loads not much worse than aligned 16 byte loads (similar to Nehalem)

Software & Services Group

Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 2011-02-18 31 Mixing AVX-256 and SSE instructions

• Legacy Intel® SSE instructions preserve the value of the upper 128 bits of a YMM register – 128-bit Intel® AVX instructions will zero the upper 128 bits • There is a performance penalty when switching between 256-bit Intel AVX and SSE – due to save/restore of the upper 128 bits – With -xavx, compiler will prefer AVX-128 to SSE • User advice: avoid mixing functions with AVX-256 and functions with SSE that call each other. – Where possible, recompile with -xavx – Automatically converts SSE intrinsics to AVX-128 – Automatically converts SSE inline assembly to AVX-128

Software & Services Group

• GNU* tools – gcc 4.4.1 (AVX-128 bit) and later – binutils 2.20.51.0.1 and later – objdump for disassembly – gdb 6.8.50.20090915 and later • Microsoft* Visual Studio* 2010 – Compiler and optimizer support – /arch:AVX – Intrinsics – MASM – Disassembly – Debugger support for YMM registers

Software & Services Group

• Intel® MKL has had Intel® AVX tunings since MKL 10.2.0 – mkl_enable_instructions() activated it (64-bit only) • Further Intel AVX optimizations in MKL 10.3 – DGEMM & SGEMM optimizations

– All BLAS level 3 functions Many routines in the Intel® MKL and IPP – LU/Cholesky/QR & eigensolvers in LAPACK libraries are more highly optimized for – FFTs of lengths 2^n Intel microprocessors than for non-Intel microprocessors. – VML/VSL – no special activation needed

• Intel® IPP has supported Intel AVX since IPP 6.1 – “g9” code for IA-32, “e9” code for Intel 64 – Automatic optimization using the compiler switch /Qxavx (-xavx) – Certain functions have been hand-optimized for AVX. http://software.intel.com/en-us/articles/intel-ipp-functions-optimized-for-intel-avx- intel-advanced-vector-extensions/ • Further Intel AVX optimizations in IPP 7.0 – Hand-optimized functions for Image Compression

Software & Services Group

• Serial portion of code is automatically translated into multi-threaded code when possible – Performs dataflow analysis to verify correct parallel execution – Partitions data for threaded code • Parallel runtime support offers same features as in OpenMP* – Handling details of loop iteration modification – Thread scheduling – Synchronization • Enabled by -parallel switch

Software & Services Group

Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. 35 *Other brands and names are the property of their respective owners. 2011-02-18 Intel® Guided Auto Parallelism Let the Compiler Tell You What it Needs

• Motivation – Effective, simplified way to add parallelism to applications – Use built-in compiler technology to speed parallelism development • What is GAP? – Compiler-based analyzer that provides guidance to developers to change code so it can be compiled to automatically optimize code through vectorization, parallelization, or data transformation – Built upon existing auto-vectorization and auto-parallelization technology • GAP does not – Analyze code and find hotspots for threading (see Advisor) – Verify threading correctness (use Inspector) – Do any performance/hotspot analysis (use VTune Amplifier)

Developer Must Verify Semantics of GAP Recommendations

Software & Services Group

• Requires optimization level set to -O2 or higher – Works with both command line options or in the Eclipse IDE – Neither IPO or PGO is required but advice may change if used – User may apply all or a subset of the advice provided by GAP – When multiple messages apply to a given loop ALL suggestions for that loop must be applied to get desired optimization • User can specify regions of a file or routine that are considered important to optimize – Advice will be restricted to that region – Default is to provide advice on entire compilation-unit • Advice may involve – Suggestions for source changes that assert new properties – Adding pragmas for loop if semantics are satisfied – Adding new compilation options • GAP output is a set of GAP messages, not .exe

Software & Services Group

Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. 37 *Other brands and names are the property of their respective owners. 2011-02-18 ® Intel Guided Auto Parallelism Traditional GAP Workflow Hotspot Analysis

Application Identify Application Source Performance hotspots, Compiler Binary C/C++/Fortran Tools problems

Compiler suggests Application compiler source Compiler Source + Advice modifications to in advice- Hotspots messages enable vecorization, mode parallelization

Feed modified source back to Modified Compiler Improved compiler for Application (extra Application options) Binary optimization Source

Software & Services Group

• GAP – Guided Auto Parallelism – New in Intel® Parallel Composer 2011 – -guide switch – Provides advice on source changes that could enable parallelism – Provides analysis and suggestions; doesn’t actually generate code • Other reports – -vec-report – Which loops were vectorized, which were not – Why they weren’t vectorized – -par-report – -opt-report – Reports available for a variety of optimizations – icc –help reports for more details

Software & Services Group

• Enabled with –O3 – With auto-vectorization, it does more aggressive data dependency analysis than at -O2 – Exploits properties of source code (loops & arrays) – Best chance for performing loop transformations • Performs loop transformations – Loop distribution – Loop interchange – Loop fusion – Loop unrolling – Data pre-fetching – PGO based loop unrolling

Software & Services Group

• Choice of precision for math functions – Lower precision may give better performance – In 11.1 and earlier: – high precision for libm (~0.55 ulp) – lower precision for libsvml (< 4 ulp)

• Bitwise reproducible libraries – Identical results on different processors, – E.g. Intel® Core® 2 Duo, Intel® Core i7, AMD* processors – In prior versions, cpu dispatch could cause differences – There may still be differences between – IA-32 and Intel64 – different compiler versions – Achieved by calling high accuracy function versions that use instructions available to all processors – There will be some cost in performance

Software & Services Group

• -fimf-precision= – Default is off (compiler chooses) – Typically high for scalar code, medium for vector code – low typically halves the number of mantissa bits – high ~0.55 ulp – medium < 4 ulp (typically 2) • -fimf-arch-consistency= – Will produce consistent results on all microarchitectures or processors within the same architecture – Run-time performance may decrease – Default is false (even with –fp-model precise !)

Software & Services Group

• Can specify at the function level – -fimf-precision=[:fnlist] – e.g. –fimf-precision=low:func1,func2,func3 – -fimf-arch-consistency= [:fnlist] • Can specify desired accuracy in different ways: – -fimf-max-error=ulps‡[:fnlist] – Maximum relative error – E.g. -fimf-max-error:0.6 for high accuracy – -fimf-absolute-error=value[:fnlist] – Max absolute error specified as a floating-point number – -fimf-accuracy-bits=bits[:fnlist] – Required accuracy specified as a number of mantissa bits

‡ulps = Units in the Last Place

Software & Services Group

• No new run-time libraries, but new entry points – High accuracy functions typically have names ending in _ha – Low accuracy functions typically have names ending in _ep – “extra performance” – About half the number of bits of the high accuracy version – Bit-wise reproducible functions typically have names starting with __bwr or terminating in _br – In some cases, the _ha functions are bit-wise reproducible and so no _br version is needed

• New compile-time library libiml_attr – Tells the compiler which libm entry point to call – .so or .dll located in bin directory

Software & Services Group

Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Loop Profiler Identify Time Consuming Loops/Functions Enables targeting parallelization and optimization efforts to most significant code areas (hotspot identification ) • Simple to use – Add compiler option to command line to instrument application – Compiler adds instrumentation calls to loops and function entry and exit points – Run application to get profile report file – Both a human-readable text file (a table) and an XML-file are generated – Analyze data by looking at raw text file or use GUI viewer shipped with compiler • Report file contains for example – Call count of routines – Self-time of functions / loops – Total-time of functions / loops – Average, minimum, maximum iteration counter of loops

Software & Services Group

Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 2011-02-18 45 Loop Profile Data Viewer GUI Alternative to Plain Text Output

Menu to allow user to enable Function Profile View filtering or displaying the source code

Column headers allow selection to control sort criteria independently for function and loop table

Loop Profile View

Software & Services Group

Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 2011-02-18 47 Intel® Parallel Building Blocks Comprehensive Tools to Deliver Outstanding App Performance

Intel® Threading Intel® Array Building Intel® Cilk™ Building Blocks Blocks Plus Language extensions to Widely used C++ template Sophisticated C++ template What simplify task/data parallelism library for task parallelism library for vector parallelism

• 3 simple keywords & array • Parallel algorithms and data • Automatically scales to notations for parallelism structures future Intel platforms Features • Support for task and data • Scalable memory allocation • Use of cores, threads, parallelism and task scheduling SIMD, determined at • Semantics similar to serial • Synchronization primitives runtime code

• Simple way to parallelize • Rich feature set for general your code purpose parallelism • Use for flexible vector • Sequentially consistent, • Available as open source or parallelism Why low overhead, powerful commercial license • JIT & VM technology: solution • Supports C++, Windows, flexible and powerful • Supports C, C++, Linux, Mac OS X, other OSs • Supports C++, Windows & Windows and Linux Linux

Mix & Match to Optimize Your Application’s Performance

Software & Services Group

• Intel® Cilk™ Plus (language extension to C/C++) – New keywords: cilk_for, cilk_spawn, cilk_sync – Vector notation for arrays – Elemental functions (vector functions) • Intel® Threading Building Blocks (Intel® TBB) – C++ template library – C++ Containers for tasks – Flexible grain size – Sophisticated built-in task scheduling – Open source • Intel® Array Building Blocks (Intel® ArBB) – C++ library for parallel array operations – Dynamic compilation allows architectural customization – Array notation enabling parallel array operations – ABI published

Software & Services Group

• Compiler assisted solution offering a tasking system via 3 simple keywords Intel • Includes array notation to specify vector code Cilk • Has a hyper objects library which offers powerful parallel data structures Plus • Based on 15 years of research at MIT What is it? • Pragmas to force vectorization of loops and specify functions that can be applied to all elements of arrays

• Simple syntax which is very easy to learn and use • Array notation guarantees fast vector code • Pragmas guarantee vectorization of loops over arbitrary user code Intel • Fork/join tasking system is simple to understand and offers Cilk Plus safety from errors Key Benefits • Low overhead tasks offer scalability to high core counts • Hyper objects enable reductions which give the same answers as serial code • Mixes with Intel TBB and Intel ArBB for a complete task and vector parallel solution

Software & Services Group

• Intel Cilk Plus: Keyword & Hyperobjects Example cilk::reducer_list pos; void findnum(int *MAX, float *array, float val) { cilk_for(int i=0;i<*MAX;i++) if array[i]==val pos.push_back(i); }

• Intel Cilk Plus: CEAN & Elemental Functions Example __declspec (vector) double ef_add (double x, double y){ return x + y;}

a[:] = ef_add(b[:],c[:]);

Software & Services Group

• Seeking task or vector parallelism

• Serial semantics task based parallelism is required

• Reduction operations need consistent answers as number of cores vary

• Need a compiled language with no JIT/VM capability

• A fork/join tasking model is sufficient

• Need to guarantee array notation or loops run as high performance vector code

• Vectorize loops over arbitrary user functions applied to entire arrays

Cilk Plus A powerful yet simple & easy to learn compiler assisted capability offering low-overhead, high-performance task & vector parallelism

Software & Services Group

Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 2011-02-18 52 Intel® Array Building Blocks: In Beta

Intel • A C++ template library for flexible vector parallelism ArBB • Utilizes a JIT and VM to offer high performance • Runs vectors on multiple cores What is it?

• High performance and flexible vector parallelism • Built-in data types for commonly used data • Compile once/run everywhere Intel • Future proof – accommodates changing vector ArBB lengths Key Benefits • No special compiler – easy to integrate incrementally into existing environments • Mixes with Intel Cilk Plus and Intel TBB for a complete task and vector parallel solution

Software & Services Group

Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 2011-02-18 53 Intel® Array Building Blocks: example void findnum(dense array, f32 val, dense& results) { dense locations = (array == val); dense matching_indices = indices(0, array.length()); results = pack(matching_indices, locations); }

Software & Services Group

Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 2011-02-18 54 Intel® Array Building Blocks – when to use

• Seeking a library based vector parallelism solution for C++

• Have array or vector rich calculations

• Seeking a compile once/run everywhere deployment model, based on JIT compilation

• Need deterministic execution

ArBB A sophisticated C++ template library based capability offering vector parallelism using JIT technology for flexible performance & deployment

Software & Services Group

Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 2011-02-18 55 Guidance: Parallel Building Blocks

Select from a Variety of Powerful Tools to Aid Parallelism

Software & Services Group

• Scaling performance on Intel processors • Parallel implementations – shared and distributed memory – Extensively threaded math functions with excellent scaling – Threading in Vector Math Functions – OpenMP* compatibility library supports Microsoft and GNU OpenMP implementations • Maximize application performance – Automatic runtime processor detection ensures great performance on whatever processor your application is running on. – Optimizations for recent Intel processors – Cluster functionality is standard • Function Domains – Linear Algebra: BLAS, LAPACK – Linear Algebra: Sparse Solvers – Fast Fourier Transforms – Vector Math Library – Vector Statistical Library

Highly Optimized Math Library for Scientific, Engineering, Financial, and Energy Applications

Software & Services Group

Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. 58 *Other brands and names are the property of their respective owners. 2011-02-18 Intel® Integrated Performance Primitives 7.0

Applications Digital Media | Web/Enterprise Data | Embedded Communications | Scientific/Technical

High level APIs and Codecs Interfaces and Code Samples

Intel® Integrated Performance Primitives 16 Function Domains

Data Multimedia Signal Processing Processing

• Image Processing • Signal Processing • Data Compression • Color Conversion • Audio Coding • Data Integrity • JPEG/JPEG2000 • Speech Coding • Cryptography • Video Coding • Speech Recognition • String Processing • Computer Vision • Vector Operations • Matrix Operations • Realistic Rendering

Cross-platform C/C++ API for Code Re-use

Optimized 32-bit and 64-bit Multicore Performance

Software & Services Group

Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 59 59 Intel® Integrated Performance Primitives What’s New In Version 7.0?

• New performance optimizations for the latest Intel processors Advanced Encryption Standard (AES) and CRC32C new instructions for dramatic performance increases in cryptography and data compression algorithms for Intel® Core i7 and later processors

• Windows Imaging Component (WIC) API support for faster and easier adoption of IPP image codecs by Windows developers.

• Improved JPEG codec multicore performance scaling (6x on 8 core machines)

• New JPEG-XR CODEC, (aka HD Photo) a new image compression standard – 2x the compression level for the same image quality without need for greater memory or computing resources as well – Supports lossless and lossy compression as well as incremental decompression of specific image regions – Supports higher dynamic range and color depth than existing image codecs

• Improved ease of use for Deferred Mode Image Processing (DMIP) via Visual Studio* Domain Specific Language graphical programming user interface

Software & Services Group

Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 2011-02-18 61 Intel® VTune™ Amplifier XE Performance Profiler

Where is my application… Spending Time? Wasting Time? Waiting Too Long?

• Focus tuning on • See cache misses on • See locks by wait time functions taking time your source • Red/Green for CPU • See call stacks • See functions sorted by utilization during wait • See time on source # of cache misses

• Windows & Linux • Low overhead • No special recompiles

Advanced Profiling For Scalable Multicore Performance

Software & Services Group

Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 2011-02-18 62 Intel® VTune™ Amplifier XE Tune Applications for Scalable Multicore Performance

• Fast, Accurate Performance Profiles – Hotspot (Statistical call tree) – Hardware-Event Based Sampling • Thread Profiling – Visualize thread interactions on timeline – Balance workloads • Easy set-up – Pre-defined performance profiles – Use a normal production build • Compatible – Microsoft, GCC, Intel compilers – C/C++, Fortran, Assembly, .NET – Latest Intel® processors and compatible processors1 • Find Answers Fast – Filter extraneous data – View results on the source / assembly – Event multiplexing • Windows or Linux – Visual Studio Integration (Windows)

– Standalone user interface and command line 1 IA32 and Intel® 64 architectures. – 32 and 64-bit Many features work with compatible processors. Event based sampling requires a genuine Intel® Processor.

Software & Services Group

• User Mode Sampling and Tracing Analysis – Dynamically instruments binary (configurable API’s) – Uses OS interrupt for each thread to collect samples and keeps sample if thread was active since last sample – Collects call stack • Hardware Event-based Sampling Analysis – Uses installed driver to configure and collect interrupts from the Performance Monitoring Unit of each Intel CPU Core.

Software & Services Group

Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 2011-02-18 64 Double Click from Grid or Timeline See Profile Data On Source / Asm

Time on Source / Asm

Quick Asm navigation: Select source to highlight Asm

Right click for instruction reference manual Quickly scroll to hot spots. Scroll Bar “Heat Map” is an overview of hot spots

Intel® VTune™ Amplifier XE Click jump to scroll Asm

Software & Services Group

Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 2011-02-18 65 Intel® VTune™ Amplifier XE Timeline Visualizes Thread Behavior

Transitions CPU Time Locks & Waits Hotspots Lightweight Hotspots

Hovers:

• Optional: Use API to mark frames and user tasks • Optional: Add a mark during collection

Software & Services Group

Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 2011-02-18 66 Profile a Running Application No need to stop and re-launch the app when profiling

Two Techniques: (Attach to process is currently only available for Windows) • Attach to Process: – Hotspot – Concurrency – Locks & Waits • Profile System: – Lightweight Hotspots – Advanced & Custom EBS – Optional: Filter by process after collection

Software & Services Group

• amplxe-cl is the command line. • Linux: /opt/intel/inspector_xe/bin[32|64]/amplxe-cl • Windows: C:\Program Files\Intel\Inspector XE \bin[32|64]\amplxe-cl.exe • To get detailed help: • amplxexe-cl –help • Get Command Line from GUI

– Command examples: 1. amplxe-cl -collect-list 2. amplxe-cl -knob-list=hotspots 3. amplxe-cl -collect=hotspot – myapp.exe [MyParams] 4. amplxe-cl –report hotspots

Software & Services Group

Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 2011-02-18 68 Remote Data Collection Conveniently analyze data collected on remote systems

Local System Remote System VTune™ Amplifier XE Copy command line Lightweight command Full user interface line collector Copy results file

1. Set up the experiment using GUI locally • Minimal 2. Copy command line “performance footprint” instructions to paste buffer during collection 3. Open remote shell on target machine • Easy setup using GUI 4. Paste command line, • Easy analysis of results run collection 5. Copy result file to your local system 6. Open file using local GUI

Software & Services Group

• Create Baseline: $> amplxe-cl -collect hotspots -r BaseLinePerf -- myapp.exe $> amplxe-cl –report hotspots –r BaseLinePerf • Nightly Performance Regression Testing: $> amplxe-cl -collect hotspots –r nightlyresults -- myapp.exe $> amplxe-cl –report hotspots –r BaseLinePerf –r NightlyResults […stuff Deleted …] Module Process Result 1:CPU Time Result 2:CPU Time Difference:CPU Time myapp.exe myapp.exe 23.141 61.531 -38.391 …

Software & Services Group

Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 2011-02-18 70 Intel® VTune™ Amplifier XE Compare Results Quickly - Sort By Difference

• Quickly identify cause of regressions. – Run a command line analysis daily – Identify the function responsible so you know who to alert • Compare 2 optimizations – What improved? • Compare 2 systems – What didn’t speed up as much?

Software & Services Group

Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 2011-02-18 71 Readying Your Application: for Intel VTune Amplifier XE

• You should run Amplifier XE on a “Released/Optimized” build. • Symbols allow you to view the Source (not just the assembly) – Linux: –g • Intel Threading Runtimes need instrumented runtimes – TBB: Define TBB_USE_THREADING_TOOLS – OpenMP: Use Intel Dynamic Version of OpenMP • Call Stack Mode – Requires use of the dynamic version of the C Runtime library to properly attribute System Calls – Linux do not use: -static

Software & Services Group

Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 2011-02-18 73 Intel® Inspector XE 2011 Advancing Application Reliability, Code Quality and Security

• Powerful Robust Dynamic Analysis – Memory errors – Invalid Memory Accesses – Memory Leaks – Uninitialized Memory Accesses – Improper usage of Memory API(s) – Resource Leaks (Windows only) – Threading Errors – Data Races – Deadlock/Lock Hierarchy Violation – Cross Stack Memory Accesses • Productivity Features – View Context of Problem (Stack, Multiple Contributing Source Lines) – Bug does not have to occur to find it! – Suppression, Filtering, and Workflow Management – Time Line visualization – Visual Studio Integration (Windows) – Memory Leak Snapshots (Linux) – Break into Debugger on Error (Linux)

Software & Services Group

Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 2011-02-18 74 Intel® Parallel Inspector XE Process Flow Intel® Inspector XE

Application.cpp Suppression Configuration Filter

Filter/ Change State/ Execution/ Compile/Link Suppress JIT Instrumention

Runtime Data Application.exe Collector +.dll(s) +.dll(s) r###[t|m]#.insp Reduced Data File (results)

Software & Services Group

Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 2011-02-18 75 Readying Your Sources • Intel Inspector XE can Analyze any native binary… but some switches (Like symbols and Debug) make the results easier to read – Linux: -O0 –g • Threading Error Analysis – use of the Dynamic version of the C Runtime library will avoid false positives – Linux do not use: -static • Use of switches that implement similar functionality in the binary is not recommended • Intel Threading Runtimes require switches to reduce false positives in Threading Error Analysis – TBB: Define TBB_USE_THREADING_TOOLS – Use the Dynamic Version of OpenMP Compatibility library supplied by the Intel® Compiler

Software & Services Group

• inspxe-cl is the command line. – Windows: C:\Program Files\Intel\Inspector XE \bin[32|64]\inspxe-cl.exe – Linux: /opt/intel/inspector_xe/bin[32|64]/inspxe-cl • To get detailed help: inspxe-cl –help • Get Command Line from GUI

• Command examples: 1. inspxe-cl -collect-list 2. inspxe-cl –collect ti2 -- MyApp.exe 3. inspxe-cl –report problems

More Help is available with the Online Documentation

Software & Services Group

Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 2011-02-18 77 Remote Data Collection Conveniently analyze data collected on remote systems

Local System Remote System Inspector XE Copy command line Lightweight command Full user interface line collector Copy results file

1. Setup the experiment using GUI locally 2. Copy command line instructions to paste buffer 3. Open remote shell on target machine 4. Paste command line, run collection 5. Copy result file to your local system 6. Open file using local GUI

Software & Services Group

• Create Baseline Suppression File: $> inspxe-cl –collect ti2 –r BaseLineResults –- App.exe $> inspxe-cl -create-suppression-file myThread.sup -result-dir BaseLineResults • Nightly Performance Regression Testing: $> inspxe-cl –collect ti2 –suppression-file MyThread.sup –r NightlyTestResults –- App.exe […Stuff Deleted…] 0 new problem(s) found

Software & Services Group

Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 2011-02-18 79 Summary Intel Performance-Oriented Compiler Suites

Intel® C++ Intel® Fortran Intel Composer Composer XE 2011 Composer XE 2011 XE 2011

• Windows: Integrates into Microsoft* Visual Studio*, Intel C++/Visual C++* Compatibility • Linux: Integrates into Eclipse* CDT, Intel C++ Compatible with GCC • Mac OS: Integrates into XCode* Environment, Compatible with GCC • All: 1 Year Premier Support Renewable Annually

Performance Compatibility Support

Software & Services Group