Debugging, profiling and performance tuning with Arm Forge

03-03-2021

Confidential © Arm 2018 About this Presentation

Who are we? Presentation objectives • Suyash Sharma, Senior AE, based in • Overview main features of DDT, MAP Manchester UK and Performance Reports • Arm /HPC SW Support Team: • Introduction to using Arm debugging [email protected] and profiling tools for HPC applications • Ryan Hulguin, Senior AE, based in development Tennessee US

2 xxxx rev 00000 Agenda

4:15-4:45pm:

• Forge Overview (5 mins)

• Debugging with DDT & demo (10 mins)

• Profiling with MAP & PR (15 mins)

4:45-5:00pm:

• Forge demo (15 mins)

• Further performance tuning with Performance Reports (5 mins)

• Hands-on, Q&A: debugging, profiling your own application (10 mins)

3 xxxx rev 00000 Arm Forge Overview

Confidential © Arm 2018 About Arm (ex-Allinea) interoperable tools for HPC

Arm (ex-Allinea) Tools: leading toolkit for HPC application developers

• Available on 65% of the top 100 HPC systems

• Help maximise application efficiency with Performance Reports

• Help the HPC community design the best applications with Forge

In December 2016 Allinea joined Arm

• Continue to be the trusted HPC Tools leader in tools across every platform

• Our engineering roadmap is aligned with upcoming architectures from every vendor

• We remain 100% committed to providing cross-platform tools for HPC

• Our product and service team will continue to work with you, our customers and partners, and the wider HPC community

5 xxxx rev 00000 Server & HPC Development Solutions from Arm Best in class commercially supported tools for Linux and high-performance computing Code Generation Performance Engineering Server & HPC Solution for Arm servers for any architecture, at any scale for Arm servers

COMPILER FOR LINUX Commercially Supported Toolkit /C++ Compiler Debugger for applications development on Linux

Fortran Compiler Profiler • C/C++ Compiler for Linux • Compiler for Linux Performance Libraries Reporting • Performance Libraries • Performance Reports • Debugger • Profiler

6 xxxx rev 00000 Server & HPC Development Solutions from Arm Best in class commercially supported tools for Linux and high-performance computing Code Generation Performance Engineering Server & HPC Solution for Arm servers for any architecture, at any scale for Arm servers

COMPILER FOR LINUX Commercially Supported Toolkit C/C++ Compiler Debugger for applications development on Linux

Fortran Compiler Profiler • C/C++ Compiler for Linux • Fortran Compiler for Linux Performance Libraries Reporting • Performance Libraries • Performance Reports • Debugger • Profiler

7 xxxx rev 00000 Arm Forge (DDT & MAP) An interoperable toolkit for debugging and profiling The de-facto standard for HPC development • Available on the vast majority of the Top500 machines in the world • Fully supported by Arm on x86, IBM Power, GPUs and Arm v8-A. Commercially supported by Arm State-of-the art debugging and profiling capabilities • Powerful and in-depth error detection mechanisms (including memory debugging) • Low-overhead sampling-based profiler to identify and understand bottlenecks • Available at any scale (from serial to peta-flopic applications) Fully Scalable Easy to use by everyone • Unique capabilities to simplify remote interactive sessions • Innovative approach to present quintessential information to users

Very user-friendly

8 xxxx rev 00000 Arm Performance Reports Characterize and understand the performance of HPC application runs Gathers a rich set of data • Analyses metrics around CPU, memory, IO, hardware counters, etc. • Possibility for users to add their own metrics Commercially supported by Arm Build a culture of application performance & efficiency awareness • Analyses data and reports the information that matters to users • Provides simple guidance to help improve workloads’ efficiency

Accurate and astute insight Adds value to typical users’ workflows • Define application behaviour and performance expectations • Integrate outputs to various systems for validation (e.g. continuous integration) • Can be automated completely (no user intervention)

Relevant advice to avoid pitfalls

9 xxxx rev 00000 Different ways to run Arm Forge…

Here There (remote launch + (interactive mode + reverse connect) reverse connect)

There (offline OR interactive mode)

Ultimately, that’s where the tools will run. But what about the GUI?

10 xxxx rev 00000 Run and Ensure Application Correctness with DDT Scalable tool for interactive and automated debugging

• One can: • Use DDT manually & interactively to debug the application • Generate debug report in offline mode that can be shared with others for co-development • Integrate Arm Forge to your CI workflows for automated & non-interactive debugging

Examples: $ ddt --connect mpirun –n 48 ./example $ ddt --offline mpirun –n 48 ./example

11 xxxx rev 00000 Optimise the Application with MAP Identify bottlenecks and rewrite some code for better performance

• One can: • Profile the application and measure all performance aspects • Generate profile report that can be analysed later or shared with others

Examples: $ map --connect mpirun –n 48 ./example $ map --profile mpirun –n 48 ./example

12 xxxx rev 00000 Understand Application Behaviour with Performance Reports Set a reference for future work

• One can: • Analyse performance with the PR generated stats and hints • Summarise from a .map file generated by MAP

Example: $ perf-report mpirun –n 16 mmult_c $ perf-report profile.map

13 xxxx rev 00000 Resources Web, doc and support

• Forge user guide, releasenotes/installation, downloads: https://developer.arm.com/tools-and-software/server-and-hpc/debug-and-profile/arm-forge • Forge product info, case studies, webinars: https://www.arm.com/products/development-tools/server-and-hpc/forge • Arm Licensing Server download and installation: https://developer.arm.com/tools-and-software/server-and-hpc/help/help-and-tutorials/system- administration/licensing/arm-licence-server

For getting support, please send emails to [email protected] or submit a case directly from https://support.developer.arm.com/.

14 xxxx rev 00000 Arm DDT Debugging with Arm DDT

Confidential © Arm 2018 OpenSource Debuggers’ Challenges Debugging with GDB/LLDB can be less user friendly due to the text-based interface

GDB Workflow Alternate GDB GUI - DDD

16 xxxx rev 00000 Arm DDT – The Debugger

Who had a rogue behaviour ? Run with Arm tools • Merges stacks from processes and threads Where did it happen? Identify a problem • leaps to source Gather info How did it happen? Who, Where, How, Why • Diagnostic messages

• Some faults evident instantly from source Fix Why did it happen? • Unique “Smart Highlighting” • Sparklines comparing data across processes

17 xxxx rev 00000 DDT capabilities

• Dedicated HPC debugger • Fortran, C & C++, Python • Designed for massively parallel applications • Designed for MPI applications • Support for OpenMP • Highly scalable • Shown to debug at hundreds of thousands of cores • Fast reduction algorithms • Memory debugging • Variable comparison • Distributed arrays • GPU support • For NVIDIA CUDA (9, 10 and 11)

18 xxxx rev 00000 The DDT user interface (on Arm)

19 xxxx rev 00000 Breakpoints, Watchpoints and Tracepoints

Breakpoints: stop at a code line and check Watchpoints: allow observing variables or expressions with conditions. DDT will stop with a notification about the value of the variable or the expression Tracepoints: allow tracing variables in a selected code line without stopping the application

* Use Tracepoints with cautious, as it can be resource consuming

20 xxxx rev 00000 Version Control Information To track new bugs from latestchanges

View -> Version Control Information

22 xxxx rev 00000 Disassembler View

Tools -> Disassemble or

24 xxxx rev 00000 Check memory usage

Tools -> Overall Memory Stats Tools -> Current Memory Usage

25 xxxx rev 00000 Memory debugging menu in Arm DDT

26 xxxx rev 00000 Launching DDT

Confidential © Arm 2018 Preparing Code for Use withDDT

As with any debugger, code must be compiled with the debug flag typically -g It is recommended to use low optimization level i.e. -O0 during debugging for better debug info -

• To avoid compiler code generation errors,

• More errors with more aggressive optimizations,

• Optimizes out some variables and functions

30 xxxx rev 00000 Express launch (from where Forge is installed)

With X11 forwarding to launch the ddt GUI (might be slow, or X11 not supported): $ ddt $ ddt srun/aprun/mpirun/mpiexec –n 4 example.exe

Without X11 forwarding (faster): • Offline mode, without interactive debugging: $ ddt --offline srun/aprun/mpirun/mpiexec –n 4 example.exe • Remote connect, with interactive debugging: $ ddt --connect $ ddt --connect srun/aprun/mpirun/mpiexec –n 4 example.exe

31 xxxx rev 00000 Express launch GUI

32 xxxx rev 00000 Remote connect

• Saves connecting over X11 session • Communicates over ssh • Much faster • Start local (e.g. laptop) instance of Forge • See remote files as normal • Configure a remote connection • Supports multi-hop and ssh configurations • Specify the location of the Forge install • Lets you open ‘remote’ files ‘locally’ • Start jobs – through scheduler • Supports reverse connect

33 xxxx rev 00000 Remote connect with Forge Remote Client

Install the Forge client on your remote laptop/workstation. Download the package from: https://developer.arm.com/tools-and-software/server-and-hpc/downloads/arm-forge 34 xxxx rev 00000 Arm DDT cheat sheet

• Load the environment module • $ module load armforge/21.0 • See available command options of DDT: • $ ddt --help • Prepare the code • $ mpicc -O0 -g myapp.c -o myapp.exe • Start Arm DDT in interactive mode • $ ddt srun -n 8 ./myapp.exe arg1 arg2 • Or use the reverse connect mechanism • Install and use the Forge client on your remote laptop/workstation • Then, edit the job script to run the following command and submit: • ddt --connect srun -n 8 ./myapp.exe arg1 arg2

35 xxxx rev 00000 Arm DDT’s offline mode

• You can run the debugger in non-interactive mode • For long-running jobs • For automated testing, continuous integration…

• To do so, use the following arguments: • $ ddt --offline --output=report.html mpirun ./jacobi_omp_mpi_gnu.exe • --offline enable non-interactive debugging • --output | -o specifies the name and output of the non-interactive debugging session • Html • Txt • Add --mem-debug to enable memory debugging and memory leak detection

ddt --offline -o jacobi_omp_mpi_gnu_debug.txt \ --trace-at _jacobi.F90:83,residual \ srun ./jacobi_omp_mpi_gnu.exe

36 xxxx rev 00000 DDT offline debug file

37 xxxx rev 00000 Python Debugging

Parameter for debugging Python code: $ ddt python3 %allinea_python_debug% my-script.py Python with MPI implementation: $ ddt srun -n 4 python3 %allinea_python_debug% my-mpi-script.py

38 xxxx rev 00000 CUDA Debugging

Compiling CUDA code: $ nvcc -g –G -O0 my_cudascript.cu -o my_cudascript To use memory debugging in DDT with CUDA: $ nvcc -g –G -O0 --cudart shared my_cudascript.cu –o my_cudascript Launch CUDA debugging in DDT: $ ddt --connect ./my_cudascript & Or from DDT GUI:

39 xxxx rev 00000 Arm DDT Examples Debugging with Arm DDT

Confidential © Arm 2018 Segmentation Fault In this example, the application crashes with a segmentation error outside of DDT.

What happens when it runs under DDT?

41 © 2018 Arm Limited 41 xxxx rev 00000 Segmentation Fault in DDT

DDT takes you to the exact line where Segmentation fault occurred, and you can pause and investigate

42 xxxx rev 00000 Invalid Memory Access

Pause and investigate: The array tab is a 13x13 array, but the application is trying to write a value to tab(4198128,0) which causes the segmentation fault. Reasons found: i is not used, and x and y are not initialized

43 xxxx rev 00000 Profiling with MAP

Confidential © Arm 2018 Arm MAP

What is Arm MAP? A low-overhead parallel profiler that shows you visualized performance data. Why to profile an application? Profiling: a form of dynamic program analysis that measures, for example, the space (memory) or time complexity of a program, the usage of particular instructions, or the frequency and duration of function calls. Most commonly, profiling information serves to aid program optimization. (Wikipedia) How profiling tools are used? – Select representative test case(s) – Profile – Analyse and find bottlenecks – Optimise – Profile again to check performance results and iterate

45 xxxx rev 00000 How to profile?

Different methods: • Tracing – Records and timestamps all operations – Intrusive • Instrumenting – Add instructions in the source code to collect data – Intrusive • Sampling – Automatically collect data – Not intrusive

46 xxxx rev 00000 Some types of profiles

Hotspot • One function corresponds to more than 80% of the runtime • Large speed-up potential • Best optimisation scenario

Spike • The application spends most of the time in a few functions • Speed-up potential depends on the aggregated time • Variable optimisation time

Flat • Runtime split evenly between numerous functions, each one with a very small runtime • Little speed-up potential without algorithmic changes • Worst optimisation scenario

47 xxxx rev 00000 Arm MAP: Performance made easy

Low overhead measurement •Accurate, non-intrusive application performance profiling •Seamless – no recompilation or relinking required

Easy to use •Source code viewer pinpoints bottleneck locations •Zoom in to explore iterations, functions and loops

Deep •Multiple architectures: Armv8 AArch64, x86, Power, GPU •Measures CPU, communication, I/O and memory to identify problem causes •Identifies vectorization and cache performance

48 xxxx rev 00000 Arm MAP – The Profiler

Small data files

<5% slowdown

No instrumentation

No recompilation

49 xxxx rev 00000 Arm MAP cheat sheet

Load the environment module • $ module load

Prepare the code • $ –O3 -g myapp.c –o myapp.exe

Run Arm MAP by • $ map ./myapp.exe arg1 arg2

Or run Arm MAP in “profile” mode, e.g. when submit a job script • $ map --profile ./myapp.exe arg1 arg2

Open the results • On the login node: • $ map myapp_Xp_Yn_YYYY-MM-DD_HH-MM.map • (or load the corresponding file using the remote client connected to the remote system or locally) 50 xxxx rev 00000 Introduce to the MAP GUI

Open results: $ map mmult.map Activity graph

Processes

Time

Metric graphs

Colour coded activity stats Source code view

Stack/Function view

51 xxxx rev 00000 Glean Deep Insight from our Source-Level Profiler

Track memory usage across the entire application over time

Spot MPI and OpenMP imbalance and overhead

Optimize CPU memory and vectorization in loops

Detect and diagnose I/O bottlenecks at real scale

52 xxxx rev 00000 Configurable CPU metrics

• Allows inspection of CPU efficiency for fine performance tuning • For more local machine supported perf events, a probe file can be generated by: $/path-to-forge-installation/bin/ forge-probe –install=user • Specify the collection of hardware events in MAP – From the GUI – From the command-line • Feature based on Linux Perf • Available for x86_64, Arm v8-A and Power

53 xxxx rev 00000 Profiling with Caliper (additional features) Caliper: https://github.com/LLNL/Caliper

• Caliper provides more introspection to the app & performance • A new “Selected regions” category is displayed • Functions that are instrumented with Caliper can be selected • Iterative pattern within functions appears more clearly • The timeline in the new Regions view shows how these functions are called

54 xxxx rev 00000 Optimizing memory accesses

Confidential © Arm 2018 Typical memory hierarchy

Latency from next Size (bytes) level (cycles) 4 Registers 192

12 L1 Cache 32k

26 L2 Cache 256k

230-360 L3 Cache 2M

? Main memory 2G

57 xxxx rev 00000 Speeding up memory accesses

High performance is possible when: • There is an opportunity for cache re-use • Data is local to the core for quick usage • CPU gets data from memory to cache before it is actually needed

CPUs

D A Registers T A L1 Cache S T L2 Cache R E A L3 Cache M Main memory

58 xxxx rev 00000 Memory access patterns

Data locality • Temporal locality: use of data within a short time of its last use • Spatial locality: use memory references close to memory already referenced

Temporal locality example for (i=0 ; i < N; i++) { for (loop=0; loop < 10; loop++) { … = … x[i] … } }

Spatial locality example for (i=0 ; i < N*s; i+=s) { … = … x[i] … }

59 xxxx rev 00000 Memory Accesses and Cache Misses for(i=0; i

j=0 j=1 HIT i=0, n=4 A

for(i=0; i

60 xxxx rev 00000 Visualize the high memory access issue

61 xxxx rev 00000 Memory access performance correction & tuning

62 xxxx rev 00000 Improving IOs

Confidential © Arm 2018 Checkpoint example

About the example: • Checkpointing: a technique that helps tolerate failures so that a long-running application can just restart from the latest checkpoint is, and avoid the application having to restart from the beginning • The basic way is to copy all the required data from the memory to reliable story and then continue with the execution • The cost: too much checkpoints can decrease the performance of the application

64 xxxx rev 00000 Profiling observations

App spends most of the time in I/O (fseek and fprintf):

65 xxxx rev 00000 Checkpointing solution profiling • OpenMP threads are used for compute and writing res.txt • More compute time

66 xxxx rev 00000 Arm MAP cheat sheet

• Load the environment module • $ module load armforge/20.1.3

• Prepare the code • $ mpicc -O3 -g myapp.c -o myapp.exe • Optimisation allowed

• Start Arm MAP in profile mode • $ map --profile srun -n 8 ./myapp.exe arg1 arg2

• Start Arm MAP in interactive mode • Express launch (specify mpiexec/mpirun/srun/aprun) – $ map srun -n 8 ./myapp.exe arg1 arg2 • Compatibility launch (use one of the MPI that MAP supports) – $ map -n 8 ./myapp.exe arg1 arg2

• View the MAP file locally • Use the remote connect

67 xxxx rev 00000 Arm Performance Reports Tuning performance

Confidential © Arm 2018 Arm Performance Reports

No source code needed

Less than 5% runtime overhead

Fully scalable

Run regularly – or in regression tests

Explicit and usable output

69 xxxx rev 00000 Arm Performance Reports Performance Reports is a stand alone performance profiler • Less intrusive than MAP • Removes the time component • Data aggregated over time for concise metrics Output in HTML webpage • Visual representation of metrics • Also support for txt / JSON / CSV formats Easy to run • Uses express launch • perf-report mpirun –np 8 ./myapp.exe arg1 arg1 Can convert map files to Performance Report • perf-report my_profile.map Works with MAP custom metrics • Through XML ‘partial report’

70 xxxx rev 00000 Demonstrate performance gains Compare against your baseline and iterate if needed

Before After

71 xxxx rev 00000 Thank You! Danke! Merci! 谢谢! ありがとう! Gracias! Kiitos!

72 Confidential © Arm 2018