Debugging, profiling and performance tuning with Arm Forge
03-03-2021
Confidential © Arm 2018 About this Presentation
Who are we? Presentation objectives • Suyash Sharma, Senior AE, based in • Overview main features of DDT, MAP Manchester UK and Performance Reports • Arm Linux/HPC SW Support Team: • Introduction to using Arm debugging [email protected] and profiling tools for HPC applications • Ryan Hulguin, Senior AE, based in development Tennessee US
2 xxxx rev 00000 Agenda
4:15-4:45pm:
• Forge Overview (5 mins)
• Debugging with DDT & demo (10 mins)
• Profiling with MAP & PR (15 mins)
4:45-5:00pm:
• Forge demo (15 mins)
• Further performance tuning with Performance Reports (5 mins)
• Hands-on, Q&A: debugging, profiling your own application (10 mins)
3 xxxx rev 00000 Arm Forge Overview
Confidential © Arm 2018 About Arm (ex-Allinea) interoperable tools for HPC
Arm (ex-Allinea) Tools: leading toolkit for HPC application developers
• Available on 65% of the top 100 HPC systems
• Help maximise application efficiency with Performance Reports
• Help the HPC community design the best applications with Forge
In December 2016 Allinea joined Arm
• Continue to be the trusted HPC Tools leader in tools across every platform
• Our engineering roadmap is aligned with upcoming architectures from every vendor
• We remain 100% committed to providing cross-platform tools for HPC
• Our product and service team will continue to work with you, our customers and partners, and the wider HPC community
5 xxxx rev 00000 Server & HPC Development Solutions from Arm Best in class commercially supported tools for Linux and high-performance computing Code Generation Performance Engineering Server & HPC Solution for Arm servers for any architecture, at any scale for Arm servers
COMPILER FOR LINUX Commercially Supported Toolkit C/C++ Compiler Debugger for applications development on Linux
Fortran Compiler Profiler • C/C++ Compiler for Linux • Fortran Compiler for Linux Performance Libraries Reporting • Performance Libraries • Performance Reports • Debugger • Profiler
6 xxxx rev 00000 Server & HPC Development Solutions from Arm Best in class commercially supported tools for Linux and high-performance computing Code Generation Performance Engineering Server & HPC Solution for Arm servers for any architecture, at any scale for Arm servers
COMPILER FOR LINUX Commercially Supported Toolkit C/C++ Compiler Debugger for applications development on Linux
Fortran Compiler Profiler • C/C++ Compiler for Linux • Fortran Compiler for Linux Performance Libraries Reporting • Performance Libraries • Performance Reports • Debugger • Profiler
7 xxxx rev 00000 Arm Forge (DDT & MAP) An interoperable toolkit for debugging and profiling The de-facto standard for HPC development • Available on the vast majority of the Top500 machines in the world • Fully supported by Arm on x86, IBM Power, Nvidia GPUs and Arm v8-A. Commercially supported by Arm State-of-the art debugging and profiling capabilities • Powerful and in-depth error detection mechanisms (including memory debugging) • Low-overhead sampling-based profiler to identify and understand bottlenecks • Available at any scale (from serial to peta-flopic applications) Fully Scalable Easy to use by everyone • Unique capabilities to simplify remote interactive sessions • Innovative approach to present quintessential information to users
Very user-friendly
8 xxxx rev 00000 Arm Performance Reports Characterize and understand the performance of HPC application runs Gathers a rich set of data • Analyses metrics around CPU, memory, IO, hardware counters, etc. • Possibility for users to add their own metrics Commercially supported by Arm Build a culture of application performance & efficiency awareness • Analyses data and reports the information that matters to users • Provides simple guidance to help improve workloads’ efficiency
Accurate and astute insight Adds value to typical users’ workflows • Define application behaviour and performance expectations • Integrate outputs to various systems for validation (e.g. continuous integration) • Can be automated completely (no user intervention)
Relevant advice to avoid pitfalls
9 xxxx rev 00000 Different ways to run Arm Forge…
Here There (remote launch + (interactive mode + reverse connect) reverse connect)
There (offline OR interactive mode)
Ultimately, that’s where the tools will run. But what about the GUI?
10 xxxx rev 00000 Run and Ensure Application Correctness with DDT Scalable tool for interactive and automated debugging
• One can: • Use DDT manually & interactively to debug the application • Generate debug report in offline mode that can be shared with others for co-development • Integrate Arm Forge to your CI workflows for automated & non-interactive debugging
Examples: $ ddt --connect mpirun –n 48 ./example $ ddt --offline mpirun –n 48 ./example
11 xxxx rev 00000 Optimise the Application with MAP Identify bottlenecks and rewrite some code for better performance
• One can: • Profile the application and measure all performance aspects • Generate profile report that can be analysed later or shared with others
Examples: $ map --connect mpirun –n 48 ./example $ map --profile mpirun –n 48 ./example
12 xxxx rev 00000 Understand Application Behaviour with Performance Reports Set a reference for future work
• One can: • Analyse performance with the PR generated stats and hints • Summarise from a .map file generated by MAP
Example: $ perf-report mpirun –n 16 mmult_c $ perf-report profile.map
13 xxxx rev 00000 Resources Web, doc and support
• Forge user guide, releasenotes/installation, downloads: https://developer.arm.com/tools-and-software/server-and-hpc/debug-and-profile/arm-forge • Forge product info, case studies, webinars: https://www.arm.com/products/development-tools/server-and-hpc/forge • Arm Licensing Server download and installation: https://developer.arm.com/tools-and-software/server-and-hpc/help/help-and-tutorials/system- administration/licensing/arm-licence-server
For getting support, please send emails to [email protected] or submit a case directly from https://support.developer.arm.com/.
14 xxxx rev 00000 Arm DDT Debugging with Arm DDT
Confidential © Arm 2018 OpenSource Debuggers’ Challenges Debugging with GDB/LLDB can be less user friendly due to the text-based interface
GDB Workflow Alternate GDB GUI - DDD
16 xxxx rev 00000 Arm DDT – The Debugger
Who had a rogue behaviour ? Run with Arm tools • Merges stacks from processes and threads Where did it happen? Identify a problem • leaps to source Gather info How did it happen? Who, Where, How, Why • Diagnostic messages
• Some faults evident instantly from source Fix Why did it happen? • Unique “Smart Highlighting” • Sparklines comparing data across processes
17 xxxx rev 00000 DDT capabilities
• Dedicated HPC debugger • Fortran, C & C++, Python • Designed for massively parallel applications • Designed for MPI applications • Support for OpenMP • Highly scalable • Shown to debug at hundreds of thousands of cores • Fast reduction algorithms • Memory debugging • Variable comparison • Distributed arrays • GPU support • For NVIDIA CUDA (9, 10 and 11)
18 xxxx rev 00000 The DDT user interface (on Arm)
19 xxxx rev 00000 Breakpoints, Watchpoints and Tracepoints
Breakpoints: stop at a code line and check Watchpoints: allow observing variables or expressions with conditions. DDT will stop with a notification about the value of the variable or the expression Tracepoints: allow tracing variables in a selected code line without stopping the application
* Use Tracepoints with cautious, as it can be resource consuming
20 xxxx rev 00000 Version Control Information To track new bugs from latestchanges
View -> Version Control Information
22 xxxx rev 00000 Disassembler View
Tools -> Disassemble or
24 xxxx rev 00000 Check memory usage
Tools -> Overall Memory Stats Tools -> Current Memory Usage
25 xxxx rev 00000 Memory debugging menu in Arm DDT
26 xxxx rev 00000 Launching DDT
Confidential © Arm 2018 Preparing Code for Use withDDT
As with any debugger, code must be compiled with the debug flag typically -g It is recommended to use low optimization level i.e. -O0 during debugging for better debug info -
• To avoid compiler code generation errors,
• More errors with more aggressive optimizations,
• Optimizes out some variables and functions
30 xxxx rev 00000 Express launch (from where Forge is installed)
With X11 forwarding to launch the ddt GUI (might be slow, or X11 not supported): $ ddt $ ddt srun/aprun/mpirun/mpiexec –n 4 example.exe
Without X11 forwarding (faster): • Offline mode, without interactive debugging: $ ddt --offline srun/aprun/mpirun/mpiexec –n 4 example.exe • Remote connect, with interactive debugging: $ ddt --connect $ ddt --connect srun/aprun/mpirun/mpiexec –n 4 example.exe
31 xxxx rev 00000 Express launch GUI
32 xxxx rev 00000 Remote connect
• Saves connecting over X11 session • Communicates over ssh • Much faster • Start local (e.g. laptop) instance of Forge • See remote files as normal • Configure a remote connection • Supports multi-hop and ssh configurations • Specify the location of the Forge install • Lets you open ‘remote’ files ‘locally’ • Start jobs – through scheduler • Supports reverse connect
33 xxxx rev 00000 Remote connect with Forge Remote Client
Install the Forge client on your remote laptop/workstation. Download the package from: https://developer.arm.com/tools-and-software/server-and-hpc/downloads/arm-forge 34 xxxx rev 00000 Arm DDT cheat sheet
• Load the environment module • $ module load armforge/21.0 • See available command options of DDT: • $ ddt --help • Prepare the code • $ mpicc -O0 -g myapp.c -o myapp.exe • Start Arm DDT in interactive mode • $ ddt srun -n 8 ./myapp.exe arg1 arg2 • Or use the reverse connect mechanism • Install and use the Forge client on your remote laptop/workstation • Then, edit the job script to run the following command and submit: • ddt --connect srun -n 8 ./myapp.exe arg1 arg2
35 xxxx rev 00000 Arm DDT’s offline mode
• You can run the debugger in non-interactive mode • For long-running jobs • For automated testing, continuous integration…
• To do so, use the following arguments: • $ ddt --offline --output=report.html mpirun ./jacobi_omp_mpi_gnu.exe • --offline enable non-interactive debugging • --output | -o specifies the name and output of the non-interactive debugging session • Html • Txt • Add --mem-debug to enable memory debugging and memory leak detection
ddt --offline -o jacobi_omp_mpi_gnu_debug.txt \ --trace-at _jacobi.F90:83,residual \ srun ./jacobi_omp_mpi_gnu.exe
36 xxxx rev 00000 DDT offline debug file
37 xxxx rev 00000 Python Debugging
Parameter for debugging Python code: $ ddt python3 %allinea_python_debug% my-script.py Python with MPI implementation: $ ddt srun -n 4 python3 %allinea_python_debug% my-mpi-script.py
38 xxxx rev 00000 CUDA Debugging
Compiling CUDA code: $ nvcc -g –G -O0 my_cudascript.cu -o my_cudascript To use memory debugging in DDT with CUDA: $ nvcc -g –G -O0 --cudart shared my_cudascript.cu –o my_cudascript Launch CUDA debugging in DDT: $ ddt --connect ./my_cudascript & Or from DDT GUI:
39 xxxx rev 00000 Arm DDT Examples Debugging with Arm DDT
Confidential © Arm 2018 Segmentation Fault In this example, the application crashes with a segmentation error outside of DDT.
What happens when it runs under DDT?
41 © 2018 Arm Limited 41 xxxx rev 00000 Segmentation Fault in DDT
DDT takes you to the exact line where Segmentation fault occurred, and you can pause and investigate
42 xxxx rev 00000 Invalid Memory Access
Pause and investigate: The array tab is a 13x13 array, but the application is trying to write a value to tab(4198128,0) which causes the segmentation fault. Reasons found: i is not used, and x and y are not initialized
43 xxxx rev 00000 Profiling with MAP
Confidential © Arm 2018 Arm MAP
What is Arm MAP? A low-overhead parallel profiler that shows you visualized performance data. Why to profile an application? Profiling: a form of dynamic program analysis that measures, for example, the space (memory) or time complexity of a program, the usage of particular instructions, or the frequency and duration of function calls. Most commonly, profiling information serves to aid program optimization. (Wikipedia) How profiling tools are used? – Select representative test case(s) – Profile – Analyse and find bottlenecks – Optimise – Profile again to check performance results and iterate
45 xxxx rev 00000 How to profile?
Different methods: • Tracing – Records and timestamps all operations – Intrusive • Instrumenting – Add instructions in the source code to collect data – Intrusive • Sampling – Automatically collect data – Not intrusive
46 xxxx rev 00000 Some types of profiles
Hotspot • One function corresponds to more than 80% of the runtime • Large speed-up potential • Best optimisation scenario
Spike • The application spends most of the time in a few functions • Speed-up potential depends on the aggregated time • Variable optimisation time
Flat • Runtime split evenly between numerous functions, each one with a very small runtime • Little speed-up potential without algorithmic changes • Worst optimisation scenario
47 xxxx rev 00000 Arm MAP: Performance made easy
Low overhead measurement •Accurate, non-intrusive application performance profiling •Seamless – no recompilation or relinking required
Easy to use •Source code viewer pinpoints bottleneck locations •Zoom in to explore iterations, functions and loops
Deep •Multiple architectures: Armv8 AArch64, x86, Power, GPU •Measures CPU, communication, I/O and memory to identify problem causes •Identifies vectorization and cache performance
48 xxxx rev 00000 Arm MAP – The Profiler
Small data files
<5% slowdown
No instrumentation
No recompilation
49 xxxx rev 00000 Arm MAP cheat sheet
Load the environment module • $ module load
Prepare the code • $
Run Arm MAP by • $ map
Or run Arm MAP in “profile” mode, e.g. when submit a job script • $ map --profile
Open the results • On the login node: • $ map myapp_Xp_Yn_YYYY-MM-DD_HH-MM.map • (or load the corresponding file using the remote client connected to the remote system or locally) 50 xxxx rev 00000 Introduce to the MAP GUI
Open results: $ map mmult.map Activity graph
Processes
Time
Metric graphs
Colour coded activity stats Source code view
Stack/Function view
51 xxxx rev 00000 Glean Deep Insight from our Source-Level Profiler
Track memory usage across the entire application over time
Spot MPI and OpenMP imbalance and overhead
Optimize CPU memory and vectorization in loops
Detect and diagnose I/O bottlenecks at real scale
52 xxxx rev 00000 Configurable CPU metrics
• Allows inspection of CPU efficiency for fine performance tuning • For more local machine supported perf events, a probe file can be generated by: $/path-to-forge-installation/bin/ forge-probe –install=user • Specify the collection of hardware events in MAP – From the GUI – From the command-line • Feature based on Linux Perf • Available for x86_64, Arm v8-A and Power
53 xxxx rev 00000 Profiling with Caliper (additional features) Caliper: https://github.com/LLNL/Caliper
• Caliper provides more introspection to the app & performance • A new “Selected regions” category is displayed • Functions that are instrumented with Caliper can be selected • Iterative pattern within functions appears more clearly • The timeline in the new Regions view shows how these functions are called
54 xxxx rev 00000 Optimizing memory accesses
Confidential © Arm 2018 Typical memory hierarchy
Latency from next Size (bytes) level (cycles) 4 Registers 192
12 L1 Cache 32k
26 L2 Cache 256k
230-360 L3 Cache 2M
? Main memory 2G
57 xxxx rev 00000 Speeding up memory accesses
High performance is possible when: • There is an opportunity for cache re-use • Data is local to the core for quick usage • CPU gets data from memory to cache before it is actually needed
CPUs
D A Registers T A L1 Cache S T L2 Cache R E A L3 Cache M Main memory
58 xxxx rev 00000 Memory access patterns
Data locality • Temporal locality: use of data within a short time of its last use • Spatial locality: use memory references close to memory already referenced
Temporal locality example for (i=0 ; i < N; i++) { for (loop=0; loop < 10; loop++) { … = … x[i] … } }
Spatial locality example for (i=0 ; i < N*s; i+=s) { … = … x[i] … }
59 xxxx rev 00000 Memory Accesses and Cache Misses for(i=0; i j=0 j=1 HIT i=0, n=4 A for(i=0; i 60 xxxx rev 00000 Visualize the high memory access issue 61 xxxx rev 00000 Memory access performance correction & tuning 62 xxxx rev 00000 Improving IOs Confidential © Arm 2018 Checkpoint example About the example: • Checkpointing: a technique that helps tolerate failures so that a long-running application can just restart from the latest checkpoint is, and avoid the application having to restart from the beginning • The basic way is to copy all the required data from the memory to reliable story and then continue with the execution • The cost: too much checkpoints can decrease the performance of the application 64 xxxx rev 00000 Profiling observations App spends most of the time in I/O (fseek and fprintf): 65 xxxx rev 00000 Checkpointing solution profiling • OpenMP threads are used for compute and writing res.txt • More compute time 66 xxxx rev 00000 Arm MAP cheat sheet • Load the environment module • $ module load armforge/20.1.3 • Prepare the code • $ mpicc -O3 -g myapp.c -o myapp.exe • Optimisation allowed • Start Arm MAP in profile mode • $ map --profile srun -n 8 ./myapp.exe arg1 arg2 • Start Arm MAP in interactive mode • Express launch (specify mpiexec/mpirun/srun/aprun) – $ map srun -n 8 ./myapp.exe arg1 arg2 • Compatibility launch (use one of the MPI that MAP supports) – $ map -n 8 ./myapp.exe arg1 arg2 • View the MAP file locally • Use the remote connect 67 xxxx rev 00000 Arm Performance Reports Tuning performance Confidential © Arm 2018 Arm Performance Reports No source code needed Less than 5% runtime overhead Fully scalable Run regularly – or in regression tests Explicit and usable output 69 xxxx rev 00000 Arm Performance Reports Performance Reports is a stand alone performance profiler • Less intrusive than MAP • Removes the time component • Data aggregated over time for concise metrics Output in HTML webpage • Visual representation of metrics • Also support for txt / JSON / CSV formats Easy to run • Uses express launch • perf-report mpirun –np 8 ./myapp.exe arg1 arg1 Can convert map files to Performance Report • perf-report my_profile.map Works with MAP custom metrics • Through XML ‘partial report’ 70 xxxx rev 00000 Demonstrate performance gains Compare against your baseline and iterate if needed Before After 71 xxxx rev 00000 Thank You! Danke! Merci! 谢谢! ありがとう! Gracias! Kiitos! 72 Confidential © Arm 2018