FZ-Jülich, 26 November 2020 Tuning for Juwels and Jureca

Dr. Heinrich Bockhorst - Intel Notices & Disclaimers

Intel technologies may require enabled hardware, or service activation. Learn more at intel.com or from the OEM or retailer.

Your costs and results may vary.

Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.

Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804. https://software.intel.com/en-us/articles/optimization-notice

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.

Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. See backup for configuration details. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See configuration disclosure for details. No product or component can be absolutely secure.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

One Intel Software & Architecture (OISA) 2 Agenda

▪ Levels of Parallelism ▪ Intel® Parallel Studio XE ▪ Intel® Compiler ▪ Application Performance Snapshot (APS) ▪ MPI Analysis ▪ Vtune Profiler ▪ Advisor ▪ oneAPI

One Intel Software & Architecture (OISA) 3 The “Free Lunch” is over, really Processor clock rate growth halted around 2005

Source: © 2014, James Reinders, Intel, used with permission Software must be parallelized to realize all the potential performance

One Intel Software & Architecture (OISA) 4 Levels of Parallelism

Inter- Node Parallelism

Intra- Node Parallelism

Core / Thread Level Parallelism

Thread-Level Parallelism with SMT

Instruction-level Parallelism

Data Level Parallelism

Accelerator / Offloading Parallelism

One Intel Software & Architecture (OISA) 5 Levels of Parallelism

One Intel Software & Architecture (OISA) 6

What’s Inside Intel® Parallel Studio XE Comprehensive Software Development Tool Suite

Cluster Edition Composer Edition Professional Edition BUILD ANALYZE SCALE Compilers & Libraries Analysis Tools Cluster Tools Intel® Intel® VTune™ Amplifier Intel® MPI Library C / C++, Performance Profiler Message Passing Interface Library Intel® Data Analytics Fortran Acceleration Library Compilers Intel® Inspector Intel® Trace Analyzer & Collector Intel Threading Building Blocks Memory & Thread Debugger MPI Tuning & Analysis C++ Threading Intel® Advisor Intel® Cluster Checker Intel® Integrated Performance Primitives Vectorization Optimization Cluster Diagnostic Expert System Image, Signal & Data Processing Thread Prototyping & Flow Graph Analysis Intel® Distribution for Python High Performance Python

Operating System: Windows, , MacOS1 Intel® Architecture Platforms

1Available only in the Composer Edition.

One Intel Software & Architecture (OISA) 8

10 Common optimization options

Linux* Disable optimization -O0 Optimize for speed (no code size increase) -O1 Optimize for speed (default) -O2 High-level loop optimization -O3 Create symbols for debugging -g Multi-file inter-procedural optimization -ipo Profile guided optimization (multi-step build) -prof-gen -prof-use Optimize for speed across the entire program (“prototype switch”) -fast same as: fast options definitions changes over time! -ipo –O3 -no-prec-div –static –fp-model fast=2 –xHost

OpenMP support -qopenmp Automatic parallelization -parallel https://tinyurl.com/icc-user-guide

One Intel Software & Architecture (OISA) 10 InterProcedural Optimizations Extends optimizations across file boundaries

-ip Only between modules of one source file -ipo Modules of multiple files/whole application

Without IPO With IPO

Compile & Optimize file1.c Compile & Optimize

Compile & Optimize file2.c

Compile & Optimize file3.c file1.c file3.c

Compile & Optimize file4.c file4.c file2.c

One Intel Software & Architecture (OISA) 11 Math Libraries ▪ icc comes with Intel’s optimized math libraries ▪ libimf (scalar) and libsvml (scalar & vector) ▪ Faster than GNU* libm ▪ Driver links libimf automatically, ahead of libm ▪ Additional functions (replace math.h by mathimf.h) ▪ Don’t link to libm explicitly! -lm ▪ May give you the slower libm functions instead ▪ Though the Intel driver may try to prevent this ▪ gcc needs -lm, so it is often found in old makefiles

One Intel Software & Architecture (OISA) 12 SIMD: Single Instruction, Multiple Data for (i=0; i

8 doubles for AVX-512

X X x7 x6 x5 x4 x3 x2 x1 x0 + +

Y Y y7 y6 y5 y4 y3 y2 y1 y0 = =

X + Y X + Y x7+y7 x6+y6 x5+y5 x4+y4 x3+y3 x2+y2 x1+y1 x0+y0

One Intel Software & Architecture (OISA) 13 Basic Vectorization Switches I

Linux*, OS X*: -x ▪ Might enable Intel processor specific optimizations ▪ Processor-check added to “main” routine: Application errors in case SIMD feature missing or non-Intel processor with appropriate/informative message ▪ Example: -xCORE-AVX2 (Jureca Xeon HSW) ▪ Example: -xMIC-AVX512 (Jureca Booster KNL) ▪ Example: -xCORE-AVX512 (Juwels Xeon SKL)

Linux*, OS X*: -ax • Multiple code paths: baseline and optimized/processor-specific • Multiple SIMD features/paths possible, e.g.: -axSSE2,CORE-AVX512 • Baseline code path defaults to –xSSE2

One Intel Software & Architecture (OISA) 14 Basic Vectorization Switches II

Special switch for Linux*, OS X*: -xHost

▪ Compiler checks SIMD features of current host processor (where built on) and makes use of latest SIMD feature available

▪ Code only executes on processors with same SIMD feature or later as on build host

15

One Intel Software & Architecture (OISA) 15 Basic Vectorization Switches III

Special switch in addition to CORE-AVX512: -qopt-zmm-usage=[keyword]

▪ [keyword] = [high | low] ; Note: “low” is the default

▪ Why choosing a defensive vectorization level?

Frequency drops in vectorized parts. Frequency does not immediately increase after the vectorized loop. Too many small vectorized loops will decrease the performance for the serial part.

One Intel Software & Architecture (OISA) 16

Tools for High-Performance Implementation Intel® Parallel Studio XE

N Assuming hybrid Cluster Tune MPI MPI + Threading Program Scalable ? Y

Memory Effective Y N Vectorize Bandwidth threading Sensitive ? ? N Y

Optimize Thread Bandwidth

One Intel Software & Architecture (OISA) 18 Performance Analysis Tools for Diagnosis Intel® Parallel Studio XE

N Intel® Trace Analyzer Cluster Tune MPI & Collector (ITAC) Scalable ? Y

Memory Effective Y Bandwidth N threading Vectorize Sensitive ? ? N Y

Optimize Thread Bandwidth

Intel® Intel® Intel® VTune™ Amplifier Advisor VTune™ Amplifier

One Intel Software & Architecture (OISA) 19 Before dive to a particular tool..

▪ How to assess easily any potential in performance tuning? ▪ What to use on big scale not be overwhelmed with huge trace size, post processing time and collection overhead? ▪ How to quickly evaluate environment settings or incremental code changes? ▪ Which tool should I use first?

▪ Answer: try Application Performance Snapshot (APS)

One Intel Software & Architecture (OISA) 20 APS Usage Setup Environment • $ source /vtune_vars.sh # or load module

Run Application • $ aps • MPI: $ mpirun aps

Generate Report on Result Folder • $ aps –report

Generate CL reports with detailed MPI statistics on Result Folder • $ aps-report –

One Intel Software & Architecture (OISA) 21 Application Performance Snapshot (APS) Data in One Place: MPI+OpenMP+Memory Floating Point

Quick & Easy Performance Overview ▪ Does the app need performance tuning? MPI & non-MPI Apps ▪ Distributed MPI with or without threading ▪ Shared memory applications Popular MPI Implementations Supported ▪ Intel® MPI Library ▪ MPICH & Cray MPI Richer Metrics on Computation Efficiency ▪ CPU (processor stalls, memory access) ▪ FPU (vectorization metrics)

MPI supported only on Linux*

One Intel Software & Architecture (OISA) 22 APS Command Line Reports – Advanced MPI statistics

Data Transfers for Rank-to- Rank Communication • aps-report –x

And many others – check • aps-report -help

23

One Intel Software & Architecture (OISA) 23

Efficiently Profile MPI Applications Intel® Trace Analyzer & Collector

▪ Helps Developers ▪ Visualize & understand parallel application behavior ▪ Evaluate profiling statistics & load balancing ▪ Identify communication hotspots

▪ Features ▪ Event-based approach ▪ Low overhead ▪ Excellent scalability ▪ Powerful aggregation & filtering functions ▪ Idealizer ▪ Scalable

One Intel Software & Architecture (OISA) 25 ITAC Analysis High Load imbalance ▪ causes MPI_Alltoall time

One Intel Software & Architecture (OISA) 26 Online Resources

Intel® MPI Library product page. Intel MPI is free! • www.intel.com/go/mpi Intel® Trace Analyzer and Collector product page • www.intel.com/go/traceanalyzer Intel® Clusters and HPC Technology forums • http://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology

One Intel Software & Architecture (OISA) 27

Analyze & Tune Application Performance Intel® VTune™ Profiler

▪ Save Time Optimizing Code • Accurately profile C, C++, Fortran*, Python*, Go*, Java*, or any mix • Optimize CPU, threading, memory, cache, storage & more • Save time: rich analysis leads to insight • Take advantage of Priority Support • Connects customers to Intel engineers for confidential inquiries (paid versions) ▪ What’s New in 2019 Release (partial list) ▪ New Platform Profiler! - Longer Data Collection ▪ A more accessible user interface provides a simplified profiling workflow ▪ Smarter, faster Application Performance Snapshot: Analyze CPU utilization of physical cores, pause/resume, more… (Linux*) https://software.intel.com/content/www/us/en/develop/tools/ • Improved JIT profiling for server-side/cloud applications -profiler/get-started.html

One Intel Software & Architecture (OISA) 29 Start a new Project

▪ Use GUI ▪ Or Command- Line

Get Command- Line

One Intel Software & Architecture (OISA) 30

Get Breakthrough Vectorization Performance Intel® Advisor—Vectorization Advisor

▪ Faster Vectorization ▪ Data & Guidance You Need Optimization • Compiler diagnostics + • Vectorize where it will pay off most Performance Data + SIMD efficiency • Quickly ID what is blocking vectorization • Detect problems & recommend fixes • Tips for effective vectorization • Loop-Carried Dependency Analysis • Safely force compiler vectorization • Memory Access Patterns Analysis • Optimize memory stride

Optimize for Intel® AVX-512 with or without access to AVX-512 hardware http://intel.ly/advisor-xe

One Intel Software & Architecture (OISA) 32 Find Effective Optimization Strategies Intel® Advisor—Cache-aware Roofline Analysis

▪ Roofline Performance Insights • Highlights poor performing loops

• Shows performance ‘headroom’ for each loop • Which can be improved • Which are worth improving • Shows likely causes of bottlenecks

• Suggests next optimization steps “I am enthusiastic about the new "integrated roofline" in Intel® Advisor. It is now possible to proceed with a step-by- step approach with the difficult question of memory transfers Nicolas Alferez, Software Architect optimization & vectorization which is of major importance.”

Onera – The French Aerospace Lab One Intel Software & Architecture (OISA) 33 Science & Research National Energy Research Scientific Up to 35% Faster Performance Computing Center

“Optimizing complex applications demands a sense of absolute performance. There are many potential optimization directions. It’s essential to know which direction to take, what factors are limiting performance, and when to stop.”

Dr. Tuomas Koskela, postdoctoral fellow at NERSC

34

With a single survey, NERSC was able to discover most of the bottlenecks.

Software & workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark & MobileMark, are measured using specific computer systems, components, software, operations & functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific 34 One Intel Software & Architecture (OISA) to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Additional Material

▪ Intel® Advisor – Threading Design & Prototyping: • Product page – overview, features, FAQs, support… • Training materials – movies, tech briefs, documentation… • Reviews ▪ Additional Analysis Tools: • Intel® VTune Amplifier – performance profiler • Intel® Inspector - memory and thread checker / debugger ▪ Additional Development Products: • Intel® Software Development Products • https://software.intel.com/en-us/parallel-studio-xe/documentation/featured-documentation

One Intel Software & Architecture (OISA) 35 Playbook - no official documentation ▪ ASCII File containing command lines and instructions for tools usage:

▪ Application Performance Snapshot (APS) Playbook

▪ ======

▪ The Playbook contains command lines starting with $

▪ Please change $PRG, $ARGS into the path,name and parameters of your program!

▪ Version 1.0, 18.02.2019

▪ 0. Environment ------

▪ load modules ------

▪ $ module load Intel $ module load IntelMPI $ module load VTune

▪ For Batch jobs ======

▪ Please include: --disable-perfparanoid

▪ in command line for sbatch or in scripts with #SBATCH --disable-perfparanoid

▪ check for important executables

▪ $ which aps

▪ check version

▪ $ aps –version …

One Intel Software & Architecture (OISA) 36 Cross-Architecture Programming for Accelerated Compute, Freedom of Choice for Hardware oneAPI: Industry Initiative & Intel Products

One Intel Software & Architecture group Intel Architecture, Graphics & Software November 2020

All information provided in this deck is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps. Intel® oneAPI Toolkits A complete set of proven developer tools expanded from CPU to XPU

A core set of high-performance tools for building C++, Data Parallel C++ applications & oneAPI library-based Native Code Developers applications

Intel® oneAPI Intel® oneAPI Intel® oneAPI Rendering Tools for HPC Tools for IoT Toolkit

Deliver fast Fortran, Build efficient, reliable Create performant, OpenMP & MPI solutions that run at high-fidelity visualization applications that network’s edge applications Specialized Workloads scale

Intel® AI Analytics Intel® Distribution of Toolkit OpenVINO™ Toolkit Accelerate machine learning & data Deploy high performance science pipelines with optimized DL inference & applications from frameworks & high-performing Toolkit edge to cloud Python libraries Data Scientists & AI Developers

One Intel Software & Architecture (OISA) 38 Intel® oneAPI Toolkits Free Run the tools locally Run the tools in the Cloud Availability Downloads

Get Started Quickly Repositories Code Samples, Quick-start Guides, Webinars, Training DevCloud software.intel.com/oneapi Containers Available on Juwels/Jureca:

$ source /p/usersoftware/paj1720/intel/oneapi/setvars.sh

One Intel Software & Architecture (OISA) 39 Intel® Compilers – DPC++, C/C++ & Fortran

Intel Compiler OpenMP Compiler driver Target Current Status (Technology) Offload* C/C++ (IL0) icc CPU No Production C/C++ (LLVM) icx / icc -qnextgen CPU, GPU* Yes Production/Beta CPU, GPU, DPC++ (LLVM) dpcpp NA Beta FPGA Fortran (IL0) ifort CPU No Production Fortran (LLVM) ifx CPU, GPU* Yes Beta

Cross Compiler Binary Compatible and Linkable

One Intel Software & Architecture (OISA) 40 How to start?

▪ Compile with minimal options and run with APS (will provide tuning tips) ▪ Compile with -xHost and –opt-report=5 and check timing and APS report ▪ Optional! Compile with –xHost and –no-vec disables vectorization. Compare with previous timing ▪ Use: VTune Profiler: $ module load VTune/ ▪ Use: Advisor: $ module load Advisor/ ▪ Google for Intel related topics → etc.

▪ Please set thread affinity e.g.: $ export KMP_AFFINITY=scatter,verbose This can speed up OMP programs up to 10X!

▪ Any questions: [email protected]

One Intel Software & Architecture (OISA) 41 Questions

One Intel Software & Architecture (OISA) 42 43 Backup Additional information

One Intel Software & Architecture (OISA) Intel Confidential 44 Basic Vectorization gcc

Linux*, OS X*: -m or –march= ▪ Might enable processor specific optimizations ▪ Example: -mavx2 (Jureca Xeon HSW) ▪ Example: -march=knl (Jureca Booster KNL) ▪ Example: -march=skylake-avx512 (Juwels Xeon SKL)

Further Information: https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html

One Intel Software & Architecture (OISA) 45 Control Vectorization

Verify vectorization:

▪ Globally: Linux*, OS X*: -qopt-report[n] check for additional options (man icc)!

▪ Annotated source listing: -qopt-report-annotate=[text | html] generates source listing with compiler comments

Advanced: • Ignore vector dependencies (IVDEP): C/C++: #pragma ivdep Fortran: !DIR$ IVDEP

46 • “Enforce” vectorization: C/C++: #pragma omp Fortran: !$omp simd

One Intel Software & Architecture (OISA) 46

Debug Memory & Threading with Intel® Inspector Find & Debug Memory Leaks, Corruption, Data Races, Deadlocks ▪ Correctness Tools Increase ROI by 12%- Debugger Breakpoints 21%1 • Errors found earlier are less expensive to fix • Races & deadlocks not easily reproduced • Memory errors are hard to find without a tool

Diagnose in hours instead of months ▪ Debugger Integration Speeds Diagnosis • Breakpoint set just before the problem Learn More: intel.ly/inspector-xe • Examine variables and threads with the debugger

1Cost Factors – Square Project Analysis - CERT: U.S. Computer Emergency Readiness Team, and Carnegie Mellon CyLab NIST: National Institute of Standards & Technology: Square Project Results

One Intel Software & Architecture (OISA) 48