Intel Tuning for Jurecaand JuWELS

Dr. Heinrich Bockhorst – Date: 29.11.2019 Agenda

• Introduction • Processor Architecture Overview • Composer XE – Compiler • Intel Python • APS – Application Performance Snapshot • MPI and ITAC analysis • VTune Amplifier XE - analysis • Advisor XE - Vectorization • Selected Intel® Tools • References

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. Processor Architecture Overview The “Free Lunch” is over, really Processor clock rate growth halted around 2005

Source: © 2014, James Reinders, Intel, used with permission must be parallelized to realize all the potential performance

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. What platform should I use for code modernization?

Today (Juwels)

The world is going Intel® Xeon® Processor Intel® Xeon Phi™ x100 Intel® Xeon® Processor Intel® Xeon Phi™ x200 parallel – stick E5-2600 v3 Product Product Family E5-2600 v4 Product Product Family with sequential Family formerly formerly codenamed Family codenamed codenamed … code and you will codenamed Knights Broadwell Knights Skylake fall behind. Haswell Corner Landing

Cores 18 61 22 72 28

Threads/Core 2 4 2 4 2

Vector Width 256-bit 512-bit 256-bit 512-bit (x2) 512-bit (x2)

Peak Memory Bandwidth 68 GB/s 352 GB/s 77 GB/s >500 GB/s 228 GB/s

Both Xeon and KNL are suitable platforms; KNL provides higher scale & memory bandwidth.

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. Intel® Parallel Studio XE Intel® Parallel Studio XE

Intel® Inspector Intel® Advisor Vectorization Optimization and Thread Prototyping Memory and Threading Checking Intel® Cluster Checker Cluster Diagnostic Expert System

Architecture Intel® VTune™ Amplifier Performance Profiler Intel® Trace Analyzer and Collector

MPI Profiler

Profiling, Analysis, and Cluster Cluster Tools Intel® Data Analytics Acceleration Library Intel® MPI Library Optimized for Data Analytics & Machine Learning Intel® Integrated Performance Primitives

Image, Signal, and Compression Routines Libraries

Performance Performance Intel® Intel® Threading Building Blocks Optimized Routines for Science, Engineering, and Financial Task-Based Parallel ++ Template Library

Intel® C/C++ and Fortran Compilers Intel® Distribution for Python Performance Scripting

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. Intel® CompoSERXE Common OptimizationOptions

Windows* *, OS* X

Disable optimization /Od -O0 Optimize for speed (no code size increase) /O1 -O1

Optimize for speed (default) /O2 -O2 High-level loop optimization /O3 -O3 Create symbols for debugging /Zi -g Multi-file inter-procedural optimization /Qipo -ipo Profile guided optimization (multi-step build) /Qprof-gen -prof-gen /Qprof-use -prof-use Optimize for speed across the entire program /fast -fast same as: /O3 /Qipo same as: (“prototype switch”) /Qprec-div-, /fp:fast=2 Linux: -ipo –O3 -no-prec-div –static –fp- fast options definitions changes over time! /QxHost) model fast=2 -xHost) OS X: -ipo -mdynamic-no-pic -O3 -no-prec- div -fp-model fast=2 -xHost OpenMP support /Qopenmp -qopenmp Automatic parallelization /Qparallel -parallel

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. Vectorization

• Single Instruction Multiple Data (SIMD): • Processing vector with a single operation • Provides data level parallelism (DLP) • Because of DLP more efficient than scalar processing • Vector: • Consists of more than one element • Elements are of same scalar data types (e.g. floats, integers, …) • Vector length (VL): Elements of the vector

A B A B AAi i BBi i Ai i Bi i + Scalar Vector + Processing Processing Ci C CCi Ci i VL

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. Many Ways to Vectorize

Compiler: Ease of use Auto-vectorization (no change of code)

Compiler: Auto-vectorization hints (#pragma vector, …)

Compiler: Intel® ™ Plus Array Notation Extensions

SIMD intrinsic class (e.g.: F32vec, F64vec, …)

Vector intrinsic (e.g.: _mm_fmadd_pd(…), _mm_add_ps(…), …)

Assembler code (e.g.: [v]addps, [v]addss, …) Programmer control

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. Basic Vectorization Switches I

• Linux*, OS X*: -x ▪ Might enable Intel processor specific optimizations ▪ Processor-check added to “main” routine: Application errors in case SIMD feature missing or non-Intel processor with appropriate/informative message ▪ Example: -xCORE-AVX2 (Jureca Xeon HSW) ▪ Example: -xMIC-AVX512 (Jureca Booster KNL) ▪ Example: -xCORE-AVX512 (Juwels Xeon SKL)

• Linux*, OS X*: -ax • Multiple code paths: baseline and optimized/processor-specific • Multiple SIMD features/paths possible, e.g.: -axSSE2,CORE-AVX512 • Baseline code path defaults to –xSSE2

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. Basic Vectorization Switches II

• Special switch for Linux*, OS X*: -xHost

▪ Compiler checks SIMD features of current host processor (where built on) and makes use of latest SIMD feature available

▪ Code only executes on processors with same SIMD feature or later as on build host

▪ As for -x if “main” routine is built with –xHost the final executable only runs on Intel processors

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. Basic Vectorization Switches III

• Special switch in addition to CORE-AVX512: -qopt-zmm-usage=[keyword]

▪ [keyword] = [high | low] ; Note: “low” is the default

▪ Why choosing a defensive vectorization level?

Frequency drops in vectorized parts. Frequency does not immediately increases after the vectorized loop. Too many small vectorized loops will decrease the performance for the serial part.

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. Control Vectorization

• Verify vectorization:

▪ Globally: Linux*, OS X*: -qopt-report[n] check for additional options (man icc)!

▪ Annotated source listing: -qopt-report-annotate=[text | html] generates source listing with compiler comments

• Advanced: • Ignore vector dependencies (IVDEP): C/C++: #pragma ivdep Fortran: !DIR$ IVDEP

• “Enforce” vectorization: C/C++: #pragma omp Fortran: !$omp simd

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. NextGENCompiler

• Compiler will be based on LLVM in the future. Can be enabled in latest intel compiler by:

-qnextgen

• More information:

https://software.intel.com/en-us/articles/early-documentation-for-intel-c-compiler-based-on-the-modern- llvm-framework

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. Intel® Distribution for Python* Adoption of Python continues to grow among Python* Landscape domain specialists and developers for its productivity benefits

• Challenge#1: • Domain specialists are not professional software programmers.

• Challenge#2: • Python performance limits migration to production systems

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. Adoption of Python continues to grow among Python* Landscape domain specialists and developers for its productivity benefits

• Challenge#1: • Domain specialists are not professional software programmers. Intel’s solution is to… ▪ Accelerate Python performance • Challenge#2: ▪ Enable easy access • Python performance limits migration to production systems ▪ Empower the community

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. “I expected Intel’s numpy to be fast Access Multiple Options for Faster Python* but it is significant that plain old python code is much faster with the Included in Intel® Distribution for Python Intel version too.“ Dr. Donald Kinghorn, • Accelerate with native libraries Puget Systems Review

• NumPy, SciPy, Scikit-Learn, Theano, Pandas, pyDAAL • Intel® MKL, Intel® DAAL • Multi-node parallelism • Exploit vectorization and threading • Mpi4Py, Distarray • Cython + Intel C++ compiler • Intel native libraries: Intel MPI • Numba + Intel LLVM • Integration with Big Data, ML platforms and Better/Composable threading frameworks • Cython, Numba, Pyston • Spark, Hadoop, Trusted Analytics Platform • Threading composability for MKL, CPython, Blaze/Dask, Numba • Better performance profiling • Extensions for profiling mixed Python & native/JIT codes

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. Intel® Parallel Studio XE

Intel® Inspector Intel® Advisor Vectorization Optimization and Thread Prototyping Memory and Threading Checking Intel® Cluster Checker Cluster Diagnostic Expert System

Architecture Intel® VTune™ Amplifier Performance Profiler Intel® Trace Analyzer and Collector

MPI Profiler

Profiling, Analysis, and Cluster Cluster Tools Intel® Data Analytics Acceleration Library Intel® MPI Library Optimized for Data Analytics & Machine Learning Intel® Integrated Performance Primitives

Image, Signal, and Compression Routines Libraries

Performance Performance Intel® Math Kernel Library Intel® Threading Building Blocks Optimized Routines for Science, Engineering, and Financial Task-Based Parallel C++ Template Library

Intel® C/C++ and Fortran Compilers Intel® Distribution for Python Performance Scripting

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. Which tool should I use? Tools for High-Performance Implementation Intel® Parallel Studio XE

Cluster N Assuming hybrid Scalable Tune MPI MPI + Threading Program ? Y

Memory Effective Y N Vectorize Bandwidth threading Sensitive ? ? N Y

Optimize Thread Bandwidth

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. Performance Analysis Tools for Diagnosis Intel® Parallel Studio XE

Cluster N Intel® Trace Analyzer Scalable Tune MPI & Collector (ITAC) ? Intel® MPI Tuner Y

Memory Effective Y N Vectorize Bandwidth threading Sensitive ? ? N Y

Optimize Thread Bandwidth

Intel® Intel® Intel® VTune™ Amplifier Advisor VTune™ Amplifier

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. APS is THE first STEP! Application Performance Snapshot (APS) Data in One Place: MPI+OpenMP+Memory Floating Point

Quick & Easy Performance Overview ▪ Does the app need performance tuning? MPI & non-MPI Apps ▪ Distributed MPI with or without threading ▪ Shared memory applications Popular MPI Implementations Supported ▪ Intel® MPI Library ▪ MPICH & Cray MPI Richer Metrics on Computation Efficiency ▪ CPU (processor stalls, memory access) ▪ FPU (vectorization metrics)

MPI supported only on Linux*

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. Application Performance Snapshot (APS)

• High-level overview of application performance • Primary optimization areas and next steps in analysis with deep tools • Easy to install, run, explore results with CL or HTML reports • Low collection overhead • Scales to large jobs • Multiple methods to obtain • Part of Intel® Parallel Studio XE, VTune Amplifier standalone • Separate free download (110Mb) from APS web page • https://software.intel.com/sites/products/snapshots/application-snapshot/

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. APS Usage Setup Environment • $ source /apsvars.sh (or Jülich: module load VTune) Run Application • $ aps • MPICH: >mpirun aps • SLURM: srun aps (Jülich: please include #SBATCH --disable-perfparanoid )

Generate Report on Result Folder • $ aps –report

Generate CL reports with detailed MPI statistics on Result Folder • $ aps-report –

▪ MPI Time per rank ▪ aps-report –t

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. Suggested Order of TUNING Steps

1. Application Performance Snapshot (APS). Gives recommendation for next steps e.g.: (Please find more options inside the APS_Playbook.txt)

2. Intel Trace Analyzer and Collector (ITAC) (MPI scalability issues) 3. VTune analysis: Hotspots (profiling) 4. (Vectorization) 5. VTune analysis HPC: (Adding OpenMP analysis, Bandwidth and some Memory analysis) 6. VTune analysis: Memory Access

0. Check Code with (Threading, Memory) and MPI with Message Checker (part of ITAC)

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. ITAC for MPI Analysis Boost Distributed Application Performance with Intel® MPI Library Performance, Scalability & Fabric Flexibility

• Standards Based Optimized MPI Library for Distributed Computing ▪ Built on open source MPICH Implementation ▪ Tuned for low latency, high bandwidth & scalability ▪ Multi fabric support for flexibility in deployment

Learn More: software.intel.com/intel-mpi-library

1See following benchmarks slide for more details

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. Profile & Analyze High Performance MPI Applications Intel® Trace Analyzer & Collector • Powerful Profiler, Analysis & Visualization Tool for MPI Applications ▪ Low overhead for accurate profiling, analysis & correctness checking ▪ Easily visualize process interactions, hotspots & load balancing for tuning & optimization ▪ Workflow flexibility: Compile, Link or Run

Learn More: software.intel.com/intel-trace-analyzer

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. Efficiently Profile MPI Applications Intel® Trace Analyzer & Collector

• Helps Developers Visualize & understand parallel application behavior Evaluate profiling statistics & load balancing Identify communication hotspots

• Features ▪ Event-based approach ▪ Low overhead ▪ Excellent scalability ▪ Powerful aggregation & filtering functions ▪ Idealizer ▪ Scalable

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. ITAC Analysis High Load • imbalance causes MPI_Alltoall time

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. Online Resources

• Intel® MPI Library product page. Intel MPI is free! • www.intel.com/go/mpi • Intel® Trace Analyzer and Collector product page • www.intel.com/go/traceanalyzer • Intel® Clusters and HPC Technology forums • http://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. Intel® VTune™ Amplifier Analyze & Tune Application Performance Intel® VTune™ Amplifier—Performance Profiler

• Save Time Optimizing Code • Accurately profile C, C++, Fortran*, Python*, Go*, Java*, or any mix • Optimize CPU, threading, memory, cache, storage & more • Save time: rich analysis leads to insight • Take advantage of Priority Support • Connects customers to Intel engineers for confidential inquiries (paid versions)

• What’s New in 2019 Release (partial list) ▪ New Platform Profiler! - Longer Data Collection ▪ A more accessible user interface provides a simplified profiling workflow ▪ Smarter, faster Application Performance Snapshot: Analyze CPU utilization of physical cores, pause/resume, more… (Linux*) Learn More: software.intel.com/intel-vtune-amplifier-xe • Improved JIT profiling for server-side/cloud applications

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. What’s New for 2019? Intel® VTune Amplifier Performance Profiler

New, Simplified Setup and More Intelligible Results New Platform Profiler - Longer Data Collection ▪ Find hardware configuration issues ▪ Identify poorly tuned applications Smarter, Faster Application Performance Snapshot ▪ Smarter: CPU utilization analysis of physical cores ▪ Faster: Lower overhead, data selection, pause/resume Added Cloud, Container & Linux .NET Support ▪ JIT profiling on LLVM* or HHVM PHP servers ▪ Java* analysis on OpenJDK 9 and Oracle* JDK 9 ▪ .NET support on Linux* plus Hyper-V* support SPDK and DPDK I/O Analysis - Measure “Empty” Polling Cycles Balance CPU/FPGA Loading Additional Embedded OSs & Environments

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. Intel® Advisor Get Breakthrough Vectorization Performance Intel® Advisor—Vectorization Advisor

• Faster Vectorization Optimization • Data & Guidance You Need • Vectorize where it will pay off most • Compiler diagnostics + • Quickly ID what is blocking vectorization Performance Data + SIMD efficiency • Tips for effective vectorization • Detect problems & recommend fixes • Safely force compiler vectorization • Loop-Carried Dependency Analysis • Optimize memory stride • Memory Access Patterns Analysis

Optimize for Intel® AVX-512 with or without access to AVX-512 hardware http://intel.ly/advisor-xe

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. Find Effective Optimization Strategies Intel® Advisor—Cache-aware Roofline Analysis • Roofline Performance Insights • Highlights poor performing loops • Shows performance ‘headroom’ for each loop • Which can be improved • Which are worth improving • Shows likely causes of bottlenecks “I am enthusiastic about the new "integrated roofline" in Intel® Advisor. It is now possible to proceed with a step-by-step approach • Suggests next optimization with the difficult question of memory transfers optimization & steps Nicolas Alferez, Software Architect vectorization which is of major importance.” Onera – The French Aerospace Lab

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. Science & Research National Energy Research Scientific Computing Up to 35% Faster Performance Center

“Optimizing complex applications demands a sense of absolute performance. There are many potential optimization directions. It’s essential to know which direction to take, what factors are limiting performance, and when to stop.”

Dr. Tuomas Koskela, postdoctoral fellow at NERSC

With a single Intel Advisor survey, NERSC was able to discover most of the bottlenecks.

Software & workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark & MobileMark, are measured using specific computer systems, components, software, operations & functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These © 2017 Intel Corporation. All rights reserved. Intel andoptimizations the Intel logo include are SSE2, trademarks SSE3, and of SSSE3 Intel Corporationinstruction sets orand its other subsidiari optimizations.es in Intelthe doesU.S. notand/or guarantee other the countries. availability, *Otherfunctionality, names or effectiveness and of any optimization on brands may be claimed as the property of others. microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more45 information regarding the specific For more complete information about compiler optimizations,instruction seesets ourcoveredOptimization by this notice. Notice . Additional Material

• Intel® Advisor – Threading Design & Prototyping: • Product page – overview, features, FAQs, support… • Training materials – movies, tech briefs, documentation… • Reviews • Additional Analysis Tools: • Intel® VTune Amplifier – performance profiler • Intel® Inspector - memory and thread checker / debugger • Additional Development Products: • Intel® Software Development Products • https://software.intel.com/en-us/parallel-studio-xe/documentation/featured-documentation

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. Playbook - no official documentation

ASCII File containing command lines and instructions for tools usage:

Application Performance Snapshot (APS) Playbook

======

The Playbook contains command lines starting with $

Please change $PRG, $ARGS into the path,name and parameters of your program!

Version 1.0, 18.02.2019

0. Environment

------load modules

------

$ module load Intel $ module load IntelMPI $ module load VTune

For Batch jobs ======

Please include: --disable-perfparanoid in command line for sbatch or in scripts with #SBATCH --disable-perfparanoid check for important executables

$ which aps check version

$ aps –version …

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. Intel® InSpector Debug Memory & Threading with Intel® Inspector Find & Debug Memory Leaks, Corruption, Data Races, Deadlocks

• Correctness Tools Increase ROI by 12%- Debugger Breakpoints 21%1 • Errors found earlier are less expensive to fix • Races & deadlocks not easily reproduced • Memory errors are hard to find without a tool • Debugger Integration Speeds Diagnosis Diagnose in hours instead of months • Breakpoint set just before the problem • Examine variables and threads with the Learn More: intel.ly/inspector-xe debugger

1Cost Factors – Square Project Analysis - CERT: U.S. Computer Emergency Readiness Team, and Carnegie Mellon CyLab NIST: National Institute of Standards & Technology: Square Project Results

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. How to start?

▪ Compile with minimal options and run with APS (will provide tuning tips) ▪ Compile with -xHost and –opt-report=5 and check timing and APS report ▪ Compile with –xHost and –no-vec disables vectorization. Compare with previous timing ▪ Use: VTune Amplifier XE: $ module load VTune/ ▪ Use: Advisor XE: $ module load Advisor/ ▪ Google for Intel related topics → etc.

▪ Please set thread affinity e.g.: $ export KMP_AFFINITY=scatter,verbose This can speed up OMP programs up to 10X!

▪ Any questions: [email protected]

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. Optimize Efficiently with Valuable Resources

Shortcut Optimization Sign up now Intel® Parallel Studio XE Attend TEC Webinars! • Overview, features, support, code samples • Training materials, Tech.Decoded webinars, how-to videos & articles • Reviews & Case Studies • More Intel® Software Development Products

Intel Code Modernization Program • Overview • Live training • Remote Access https://intel.ly/2PdkNhN

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. Legal Disclaimer & Optimization Notice

• INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. • Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. • Copyright © 2016, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.

Documentation

• Intel® 64 and IA-32 Architectures Software Developer Manuals: • Intel® 64 and IA-32 Architectures Software Developer’s Manuals ▪ Volume 1: Basic Architecture ▪ Volume 2: Instruction Set Reference ▪ Volume 3: System Programming Guide • Software Optimization Reference Manual • Related Specifications, Application Notes, and White Papers

• https://www-ssl.intel.com/content/www/us/en/processors/architectures-software-developer- manuals.html?iid=tech_vt_tech+64-32_manuals

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may12/4/2019 be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. APS Command Line Reports –Advanced MPI statistics (2/3)

• Message Size Summary by all ranks • aps-report –m

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. APS Command Line Reports –Advanced MPI statistics (3/3)

• Data Transfers for Rank-to-Rank Communication • aps-report –x

And many others – check • aps-report -help

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. Collection Control API (for APS) Since 2018 Update 2

• To measure a particular application phase or exclude initialization/finalization phases use: MPI: • Pause: MPI_Pcontrol(0) • Resume: MPI_Pcontrol(1) MPI or Shared memory applications: • Pause: __itt_pause() • Resume: __itt_resume() • See how to configure the build of your application to use itt API Tip: use aps “-start-paused” option allows to start application without profiling and skip initialization phase

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. Visualize Parallelism—Interactively Build, Validate & Analyze Algorithms Intel® Advisor—Flow Graph Analyzer (FGA)

▪ Visually generate code stubs

▪ Generate parallel C++ programs

▪ Click & zoom through your algorithm’s nodes & edges to understand parallel data & program flow

▪ Analyze load balancing, concurrency, & other parallel attributes to fine tune your program

Use Intel® TBB or OpenMP* 5 (draft) OMPT APIs

Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. 58 *Other names and brands may be claimed as the property of others.