FZJ, May 20th, 2020 Dr. Heinrich Bockhorst, Agenda

Introduction

Processor Architecture Overview

Composer XE – Compiler

Intel Python

APS – Application Performance Snapshot

MPI and ITAC analysis

VTune Amplifier XE - analysis

Advisor XE - Vectorization

Selected Intel® Tools

References

Refer to .intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

The “Free Lunch” is over, really Processor clock rate growth halted around 2005

Source: © 2014, James Reinders, Intel, used with permission Software must be parallelized to realize all the potential performance

Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. What platform should I use for code modernization?

Juwels

The world is going Intel® Xeon® Processor Intel® Xeon Phi™ x100 Intel® Xeon® Processor Intel® Xeon Phi™ x200 parallel – stick E5-2600 v3 Product Product Family E5-2600 v4 Product Product Family with sequential Family formerly formerly codenamed Family codenamed codenamed … code and you will codenamed Knights Broadwell Knights Skylake fall behind. Haswell Corner Landing

Cores 18 61 22 72 28

Threads/Core 2 4 2 4 2

Vector Width 256-bit 512-bit 256-bit 512-bit (x2) 512-bit (x2)

Peak Memory Bandwidth 68 GB/s 352 GB/s 77 GB/s >500 GB/s 228 GB/s

Both Xeon and KNL are suitable platforms; KNL provides higher scale & memory bandwidth.

Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Intel® Parallel Studio XE

Intel® Inspector Intel® Advisor Vectorization Optimization and Thread Prototyping Memory and Threading Checking Intel® Cluster Checker Cluster Diagnostic Expert System

Intel® VTune™ Amplifier Architecture Performance Profiler Intel® Trace Analyzer and Collector

MPI Profiler Profiling, Analysis, and Profiling,

Intel® Data Analytics Acceleration Library Intel® MPI Library Tools Cluster Optimized for Data Analytics & Machine Learning Intel® Integrated Performance Primitives Image, Signal, and Compression Routines

Libraries Intel®

Performance Intel® Threading Building Blocks Optimized Routines for Science, Engineering, and Financial Task-Based Parallel C++ Template Library

Intel® C/C++ and Fortran Compilers Intel® Distribution for Python Performance Scripting

Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Intel® ONEAPI toolkits(beta)

Toolkits Tailored to Your Needs Domain-specific sets of tools to get your job done quickly.

Intel® oneAPI Base Toolkit A core set of high-performance tools for building Data Parallel C++ applications and oneAPI library based applications

Learn More

Intel® oneAPI HPC Toolkit Intel® oneAPI IoT Toolkit Intel® oneAPI DL Framework Intel® oneAPI Rendering Developer Toolkit Toolkit Everything HPC developers need to Tools for building high-performing, Tools for developers & researchers Powerful rendering libraries to create deliver fast C++, Fortran, & OpenMP* efficient, reliable solutions that run at who build deep learning frameworks high-performance, high-fidelity applications that scale the network’s edge or customize existing ones so visualization applications applications run faster Learn More Learn More Learn More Learn More

Toolkits Powered by oneAPI

Intel® System Bring-Up Toolkit Intel® Distribution of OpenVINO™ Intel® AI Analytics Toolkit Toolkit Tools to debug & tune power & Tools to build high performance deep Tools to build applications that performance in pre- & post-silicon learning inference & computer leverage machine learning & deep development applications (production-level tool) learning models

Learn More Learn More Learn More

Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. 9 *Other names and brands may be claimed as the property of others. Some useful links

▪ Intro: https://software.intel.com/en-us/oneapi ▪ oneAPI initiative: https://www.oneapi.com/ ▪ Code migration: https://software.intel.com/en-us/oneapi-programming- guide-migrating-code-to-dpc ▪ Windows and toolkits available on https://software.intel.com/en- us/oneapi ▪ Access to hardware in DevCloud:

https://software.intel.com/content/www/us/en/develop/tools/devcloud.html

Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. 10 *Other names and brands may be claimed as the property of others.

Common Optimization Options

Windows* Linux*, OS* X

Disable optimization /Od -O0 Optimize for speed (no code size increase) /O1 -O1

Optimize for speed (default) /O2 -O2 High-level loop optimization /O3 -O3 Create symbols for debugging /Zi -g Multi-file inter-procedural optimization /Qipo -ipo Profile guided optimization (multi-step build) /Qprof-gen -prof-gen /Qprof-use -prof-use Optimize for speed across the entire program /fast -fast (“prototype switch”) same as: /O3 /Qipo same as: /Qprec-div-, Linux: -ipo –O3 -no-prec-div –static – fast options definitions changes over time! /fp:fast=2 /QxHost) fp-model fast=2 -xHost) OS X: -ipo -mdynamic-no-pic -O3 - no-prec-div -fp-model fast=2 -xHost OpenMP support /Qopenmp -qopenmp Automatic parallelization /Qparallel -parallel

Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Vectorization

Single Instruction Multiple Data (SIMD): – Processing vector with a single operation – Provides data level parallelism (DLP) – Because of DLP more efficient than scalar processing

Vector: A B – Consists of more than one element AAi i BBi i A B Ai i Bi i – Elements are of same scalar data types (e.g. floats, integers, …) + + Ci Vector length (VL): Elements of the vector CCi C Ci i Scalar Vector VL Processing Processing

Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Many Ways to Vectorize

Compiler: Ease of use Auto-vectorization (no change of code)

Compiler: Auto-vectorization hints (#pragma vector, …)

Compiler: Intel® ™ Plus Array Notation Extensions

SIMD intrinsic class (e.g.: F32vec, F64vec, …)

Vector intrinsic (e.g.: _mm_fmadd_pd(…), _mm_add_ps(…), …)

Assembler code (e.g.: [v]addps, [v]addss, …) Programmer control

Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Basic Vectorization Switches I

Linux*, OS X*: -x ▪ Might enable Intel processor specific optimizations ▪ Processor-check added to “main” routine: Application errors in case SIMD feature missing or non-Intel processor with appropriate/informative message ▪ Example: -xCORE-AVX2 (Jureca Xeon HSW) ▪ Example: -xMIC-AVX512 (Jureca Booster KNL) ▪ Example: -xCORE-AVX512 (Juwels Xeon SKL)

Linux*, OS X*: -ax ▪ Multiple code paths: baseline and optimized/processor-specific ▪ Multiple SIMD features/paths possible, e.g.: -axSSE2,CORE-AVX512 ▪ Baseline code path defaults to –xSSE2

Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Basic Vectorization Switches II

Special switch for Linux*, OS X*: -xHost

▪ Compiler checks SIMD features of current host processor (where built on) and makes use of latest SIMD feature available

▪ Code only executes on processors with same SIMD feature or later as on build host

Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Basic Vectorization Switches III

Special switch in addition to CORE-AVX512: -qopt-zmm-usage=[keyword]

▪ [keyword] = [high | low] ; Note: “low” is the default

▪ Why choosing a defensive vectorization level?

Frequency drops in vectorized parts. Frequency does not immediately increases after the vectorized loop. Too many small vectorized loops will decrease the performance for the serial part.

Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Basic Vectorization gcc

Linux*, OS X*: -m or –march= ▪ Might enable processor specific optimizations ▪ Example: -mavx2 (Jureca Xeon HSW) ▪ Example: -march=knl (Jureca Booster KNL) ▪ Example: -march=skylake-avx512 (Juwels Xeon SKL)

Further Information: https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html

Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Control Vectorization

Verify vectorization:

▪ Globally: Linux*, OS X*: -qopt-report[n] check for additional options (man icc)!

▪ Annotated source listing: -qopt-report-annotate=[text | html] generates source listing with compiler comments

Advanced: ▪ Ignore vector dependencies (IVDEP): C/C++: #pragma ivdep Fortran: !DIR$ IVDEP

▪ “Enforce” vectorization: C/C++: #pragma omp Fortran: !$omp simd

Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Adoption of Python continues to grow among Python* Landscape domain specialists and developers for its productivity benefits

Challenge#1: Domain specialists are not professional software programmers.

Challenge#2: Python performance limits migration to production systems

Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Adoption of Python continues to grow among Python* Landscape domain specialists and developers for its productivity benefits

Challenge#1: Domain specialists are not professional softwareIntel’s solution is to… programmers. ▪ Accelerate Python performance Challenge#2: ▪ Enable easy access Python performance limits migration▪ to Empower the community production systems

Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Intel® Parallel Studio XE

Intel® Inspector Intel® Advisor Vectorization Optimization and Thread Prototyping Memory and Threading Checking Intel® Cluster Checker Cluster Diagnostic Expert System

Intel® VTune™ Amplifier Architecture Performance Profiler Intel® Trace Analyzer and Collector

MPI Profiler Profiling, Analysis, and Profiling,

Intel® Data Analytics Acceleration Library Intel® MPI Library Tools Cluster Optimized for Data Analytics & Machine Learning Intel® Integrated Performance Primitives Image, Signal, and Compression Routines

Libraries Intel® Math Kernel Library

Performance Intel® Threading Building Blocks Optimized Routines for Science, Engineering, and Financial Task-Based Parallel C++ Template Library

Intel® C/C++ and Fortran Compilers Intel® Distribution for Python Performance Scripting

Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Tools for High-Performance Implementation Intel® Parallel Studio XE

Cluster N Assuming hybrid Scalable Tune MPI MPI + Threading Program ? Y

Memory Effective Y N Vectorize Bandwidth threading Sensitive ? ? N Y

Optimize Thread Bandwidth

Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Performance Analysis Tools for Diagnosis Intel® Parallel Studio XE

Cluster N Intel® Trace Analyzer Scalable Tune MPI & Collector (ITAC) ? Y

Memory Effective Y N Vectorize Bandwidth threading Sensitive ? ? N Y

Optimize Thread Bandwidth

Intel® Intel® Intel® VTune™ Amplifier Advisor VTune™ Amplifier

Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Application Performance Snapshot (APS) Data in One Place: MPI+OpenMP+Memory Floating Point

Quick & Easy Performance Overview ▪ Does the app need performance tuning? MPI & non-MPI Apps ▪ Distributed MPI with or without threading ▪ Shared memory applications Popular MPI Implementations Supported ▪ Intel® MPI Library ▪ MPICH & Cray MPI Richer Metrics on Computation Efficiency ▪ CPU (processor stalls, memory access) ▪ FPU (vectorization metrics)

MPI supported only on Linux*

Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Application Performance Snapshot (APS)

High-level overview of application performance Primary optimization areas and next steps in analysis with deep tools Easy to install, run, explore results with CL or HTML reports Low collection overhead Scales to large jobs Multiple methods to obtain ▪ Part of Intel® Parallel Studio XE, VTune Amplifier standalone ▪ Separate free download (110Mb) from APS web page – https://software.intel.com/sites/products/snapshots/application-snapshot/

Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. APS Usage Setup Environment • $ source /apsvars.sh (or Jülich: module load VTune)

Run Application • $ aps • MPICH: >mpirun aps • SLURM: srun aps (Jülich: please include #SBATCH --disable-perfparanoid )

Generate Report on Result Folder • $ aps –report

Generate CL reports with detailed MPI statistics on Result Folder • $ aps-report –

Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Suggested Order of TUNING Steps

1. Application Performance Snapshot (APS). Gives recommendation for next steps e.g.: (Please find more options inside the APS_Playbook.txt)

2. Intel Trace Analyzer and Collector (ITAC) (MPI scalability issues)

3. VTune analysis: Hotspots (profiling)

4. (Vectorization)

5. VTune analysis HPC: (Adding OpenMP analysis, Bandwidth and some Memory analysis)

6. VTune analysis: Memory Access

0. Check Code with (Threading, Memory) and MPI with Message Checker (part of ITAC)

Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Boost Distributed Application Performance with Intel® MPI Library Performance, Scalability & Fabric Flexibility

Standards Based Optimized MPI Library for Distributed Computing ▪ Built on open source MPICH Implementation ▪ Tuned for low latency, high bandwidth & scalability ▪ Multi fabric support for flexibility in deployment

Learn More: software.intel.com/intel-mpi-library

1See following benchmarks slide for more details Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Efficiently Profile MPI Applications Intel® Trace Analyzer & Collector

Helps Developers ▪ Visualize & understand parallel application behavior ▪ Evaluate profiling statistics & load balancing ▪ Identify communication hotspots

Features ▪ Event-based approach ▪ Low overhead ▪ Excellent scalability ▪ Powerful aggregation & filtering functions ▪ Idealizer ▪ Scalable

Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. ITAC Analysis High Load imbalance causes MPI_Alltoall time

Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Online Resources

Intel® MPI Library product page. Intel MPI is free! ▪ www.intel.com/go/mpi Intel® Trace Analyzer and Collector product page ▪ www.intel.com/go/traceanalyzer Intel® Clusters and HPC Technology forums ▪ http://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology

Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Analyze & Tune Application Performance Intel® VTune™ Amplifier—Performance Profiler

Save Time Optimizing Code ▪ Accurately profile C, C++, Fortran*, Python*, Go*, Java*, or any mix ▪ Optimize CPU, threading, memory, cache, storage & more ▪ Save time: rich analysis leads to insight ▪ Take advantage of Priority Support – Connects customers to Intel engineers for confidential inquiries (paid versions)

What’s New in 2019 Release (partial list) ▪ New Platform Profiler! - Longer Data Collection ▪ A more accessible user interface provides a simplified profiling workflow ▪ Smarter, faster Application Performance Snapshot: Analyze Learn More: software.intel.com/intel-vtune-amplifier-xe CPU utilization of physical cores, pause/resume, more… (Linux*) ▪ Improved JIT profiling for server-side/cloud applications

Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. What’s New for 2019? Intel® VTune Amplifier Performance Profiler

New, Simplified Setup and More Intelligible Results New Platform Profiler - Longer Data Collection ▪ Find hardware configuration issues ▪ Identify poorly tuned applications Smarter, Faster Application Performance Snapshot ▪ Smarter: CPU utilization analysis of physical cores ▪ Faster: Lower overhead, data selection, pause/resume Added Cloud, Container & Linux .NET Support ▪ JIT profiling on LLVM* or HHVM PHP servers ▪ Java* analysis on OpenJDK 9 and Oracle* JDK 9 ▪ .NET support on Linux* plus Hyper-V* support SPDK and DPDK I/O Analysis - Measure “Empty” Polling Cycles Balance CPU/FPGA Loading Additional Embedded OSs & Environments

Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Get Breakthrough Vectorization Performance Intel® Advisor—Vectorization Advisor

Faster Vectorization Optimization Data & Guidance You Need ▪ Vectorize where it will pay off most ▪ Compiler diagnostics + ▪ Quickly ID what is blocking vectorization Performance Data + SIMD efficiency ▪ Tips for effective vectorization ▪ Detect problems & recommend fixes ▪ Safely force compiler vectorization ▪ Loop-Carried Dependency Analysis ▪ Optimize memory stride ▪ Memory Access Patterns Analysis

Optimize for Intel® AVX-512 with or without access to AVX-512 hardware http://intel.ly/advisor-xe Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Find Effective Optimization Strategies Intel® Advisor—Cache-aware Roofline Analysis

Roofline Performance Insights ▪ Highlights poor performing loops

▪ Shows performance ‘headroom’ for each loop – Which can be improved – Which are worth improving

▪ Shows likely causes of bottlenecks

▪ Suggests next optimization steps “I am enthusiastic about the new "integrated roofline" in Intel® Advisor. It is now possible to proceed with a step-by- step approach with the difficult question of memory transfers Nicolas Alferez, Software Architect optimization & vectorization which is of major importance.” Onera – The French Aerospace Lab

Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Science & Research National Energy Research Scientific Up to 35% Faster Performance Computing Center

“Optimizing complex applications demands a sense of absolute performance. There are many potential optimization directions. It’s essential to know which direction to take, what factors are limiting performance, and when to stop.”

Dr. Tuomas Koskela, postdoctoral fellow at NERSC

With a single Intel Advisor survey, NERSC was able to discover most of the bottlenecks.

Software & workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark & MobileMark, are measured using specific computer systems, components, software, operations & functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. Refer to software.intel.com/articles/optimization-notice for more informationThese optimizations regarding include performance SSE2, SSE3, &and optimization SSSE3 instruction choices sets and other in Intel optimizations. software Intel products.does not guarantee the availability, functionality, or effectiveness of any Copyright ©, Intel Corporation. All rights reserved. optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain 43 optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information *Other names and brands may be claimed as the property of others. regarding the specific instruction sets covered by this notice. Additional Material

Intel® Advisor – Threading Design & Prototyping: ▪ Product page – overview, features, FAQs, support… ▪ Training materials – movies, tech briefs, documentation… ▪ Reviews Additional Analysis Tools: ▪ Intel® VTune Amplifier – performance profiler ▪ Intel® Inspector - memory and thread checker / debugger Additional Development Products: ▪ Intel® Software Development Products

▪ https://software.intel.com/en-us/parallel-studio-xe/documentation/featured-documentation

Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Playbook - no official documentation

ASCII File containing command lines and instructions for tools usage:

Application Performance Snapshot (APS) Playbook

======

The Playbook contains command lines starting with $

Please change $PRG, $ARGS into the path,name and parameters of your program!

Version 1.0, 18.02.2019

0. Environment ------

load modules ------

$ module load Intel $ module load IntelMPI $ module load VTune

For Batch jobs ======

Please include: --disable-perfparanoid

in command line for sbatch or in scripts with #SBATCH --disable-perfparanoid

check for important executables

$ which aps

check version

$ aps –version …

Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Debug Memory & Threading with Intel® Inspector Find & Debug Memory Leaks, Corruption, Data Races, Deadlocks

Correctness Tools Increase ROI by 12%- Debugger Breakpoints 21%1 ▪ Errors found earlier are less expensive to fix ▪ Races & deadlocks not easily reproduced ▪ Memory errors are hard to find without a tool

Diagnose in hours instead of months Debugger Integration Speeds Diagnosis ▪ Breakpoint set just before the problem Learn More: intel.ly/inspector-xe ▪ Examine variables and threads with the debugger

1Cost Factors – Square Project Analysis - CERT: U.S. Computer Emergency Readiness Team, and Carnegie Mellon CyLab NIST: National Institute of Standards & Technology: Square Project Results Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. How to start?

▪ Compile with minimal options and run with APS (will provide tuning tips) ▪ Compile with -xHost and –opt-report=5 and check timing and APS report ▪ Optional! Compile with –xHost and –no-vec disables vectorization. Compare with previous timing ▪ Use: VTune Amplifier XE: $ module load VTune/ ▪ Use: Advisor XE: $ module load Advisor/ ▪ Google for Intel related topics → etc.

▪ Please set thread affinity e.g.: $ export KMP_AFFINITY=scatter,verbose This can speed up OMP programs up to 10X!

▪ Any questions: [email protected]

Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimize Efficiently with Valuable Resources

Shortcut Optimization Sign up now Intel® Parallel Studio XE Attend TEC Webinars! ▪ Overview, features, support, code samples ▪ Training materials, Tech.Decoded webinars, how-to videos & articles ▪ Reviews & Case Studies ▪ More Intel® Software Development Products

Intel Code Modernization Program ▪ Overview ▪ Live training

▪ Remote Access https://intel.ly/2PdkNhN

Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.