How to Analyze the Performance of Parallel Codes with Open|SpeedShop
Jim Galarowicz: Krell Ins tute Don Maghrak, Krell Ins tute
LLNL-PRES-503451
Oct 27-28, 2015
Presenters and Extended Team
v Jim Galarowicz, Krell v Donald Maghrak, Krell
Open|SpeedShop extended team: v William Hachfeld, Dave Whitney, Dane Gardner: Argo Navis/Krell v Jennifer Green, David Montoya, Mike Mason, David Shrader: LANL v Mar n Schulz, Ma Legendre and Chris Chambreau: LLNL v Mahesh Rajan, Anthony Agelastos: SNL v Dyninst group (Bart Miller: UW & Jeff Hollingsworth: UMD) v Phil Roth, Mike Brim: ORNL v Ciera Jaspan: CMU
How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 2
Outline
Welcome Introduc on to Open|SpeedShop (O|SS) tools Overview of performance experiments Ø How to use Open|SpeedShop to gather and display Ø Sampling Experiments Ø Tracing Experiments Ø How to compare performance data for different applica on runs Advanced and New Func onality v Component Based Tool Framework (CBTF) based O|SS Ø New lightweight iop and mpip experiments Ø New: memory experiment Ø New: CUDA/GPU experiment Ø New: pthreads experiment What is new / Roadmap / Future Plans Supplemental, or Interest Ø Command Line Interface (CLI) tutorial and examples
How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 3
How to Analyze the Performance of Parallel Codes with Open|SpeedShop
Sec on 1 Introduc on into Tools and Open|SpeedShop
Oct 27-28, 2015
Open|SpeedShop Tool Set
v Open Source Performance Analysis Tool Framework Ø Most common performance analysis steps all in one tool Ø Combines tracing and sampling techniques Ø Extensible by plugins for data collec on and representa on Ø Gathers and displays several types of performance informa on v Flexible and Easy to use Ø User access through: GUI, Command Line, Python Scrip ng, convenience scripts v Scalable Data Collec on Ø Instrumenta on of unmodified applica on binaries Ø New op on for hierarchical online data aggrega on v Supports a wide range of systems Ø Extensively used and tested on a variety of Linux clusters Ø Cray, Blue Gene, ARM, Intel MIC, GPU support
How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 5
Open|SpeedShop Workflow
srun –n4 –N1 smg2000 –n 65 65 65
srun –n4 –N1 smg2000 –n 65 65 65 osspcsamp “srun –n4 –N1 smg2000 –n 65 65 65” MPI Applica on
O|SS Post-mortem
h p://www.openspeedshop.org/
Oct 27-28, 2015
Alterna ve Interfaces
v Scrip ng language Ø Immediate command interface Experiment Commands Ø O|SS interac ve command line (CLI) expView • openss -cli expCompare expStatus
List Commands v list –v exp Python module list –v hosts import openss list –v src
my_filename=openss.FileList("myprog.a.out")Session Commands my_exptype=openss.ExpTypeList("pcsamp") openGui my_id=openss.expCreate(my_filename,my_exptype) openss.expGo() My_metric_list = openss.MetricList("exclusive") my_viewtype = openss.ViewTypeList("pcsamp”) result = openss.expView(my_id,my_viewtype,my_metric_list)
How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 7
Central Concept: Experiments v Users pick experiments: Ø What to measure and from which sources? Ø How to select, view, and analyze the resulting data? v Two main classes: Ø Statistical Sampling • Periodically interrupt execution and record location • Useful to get an overview • Low and uniform overhead Ø Event Tracing • Gather and store individual application events • Provides detailed per event information • Can lead to huge data volumes v O|SS can be extended with additional experiments
Oct 27-28, 2015
Sampling Experiments in O|SS v PC Sampling (pcsamp) Ø Record PC repeatedly at user defined me interval Ø Low overhead overview of me distribu on Ø Good first step, lightweight overview v Call Path Profiling (user me) Ø PC Sampling and Call stacks for each sample Ø Provides inclusive and exclusive ming data Ø Use to find hot call paths, whom is calling who v Hardware Counters (hwc, hwc me, hwcsamp) Ø Provides profile of hardware counter events like cache & TLB misses Ø hwcsamp: • Periodically sample to capture profile of the code against the chosen counter • Default events are PAPI_TOT_INS and PAPI_TOT_CYC Ø hwc, hwc me: • Sample a hardware counter ll a certain number of events ( called threshold) is recorded and get Call Stack • Default event is PAPI_TOT_CYC overflows
How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 9
Tracing Experiments in O|SS
v Input/Output Tracing (io, iot) Ø Record invoca on of all POSIX I/O events Ø Provides aggregate and individual mings Ø Store func on arguments and return code for each call (iot) v MPI Tracing (mpi, mpit, mpio ) Ø Record invoca on of all MPI rou nes Ø Provides aggregate and individual mings Ø Store func on arguments and return code for each call (mpit) Ø Create Open Trace Format (OTF) output (mpio ) v Floa ng Point Excep on Tracing (fpe) Ø Triggered by any FPE caused by the applica on Ø Helps pinpoint numerical problem areas
How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 10
Performance Analysis in Parallel v How to deal with concurrency? Ø Any experiment can be applied to parallel applica on • Important step: aggrega on or selec on of data Ø Special experiments targe ng parallelism/synchroniza on v O|SS supports MPI and threaded codes Ø Automa cally applied to all tasks/threads Ø Default views aggregate across all tasks/threads Ø Data from individual tasks/threads available Ø Thread support (incl. OpenMP) based on POSIX threads v Specific parallel experiments (e.g., MPI) Ø Wraps MPI calls and reports • MPI rou ne me • MPI rou ne parameter informa on Ø The mpit experiment also store func on arguments and return code for each call
How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 11
How to Run a First Experiment in O|SS?
1. Picking the experiment Ø What do I want to measure? Ø We will start with pcsamp to get a first overview 2. Launching the applica on Ø How do I control my applica on under O|SS? Ø Enclose how you normally run your applica on in quotes Ø osspcsamp “mpirun –np 256 smg2000 –n 65 65 65” 3. Storing the results Ø O|SS will create a database Ø Name: smg2000-pcsamp.openss 4. Exploring the gathered data Ø How do I interpret the data? Ø O|SS will print a default report Ø Open the GUI to analyze data in detail (run: “openss”)
How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 12
Example Run with Output (1 of 2)
v osspcsamp “mpirun –np 4 smg2000 –n 65 65 65” Bash> osspcsamp "mpirun -np 4 ./smg2000 -n 65 65 65" [openss]: pcsamp experiment using the pcsamp experiment default sampling rate: "100". [openss]: Using OPENSS_PREFIX installed in /opt/ossoffv2.1u4 [openss]: Se ng up offline raw data directory in /opt/shared/offline-oss [openss]: Running offline pcsamp experiment using the command: "mpirun -np 4 /opt/ossoffv2.1u4/bin/ossrun "./smg2000 -n 65 65 65" pcsamp"
Running with these driver parameters: (nx, ny, nz) = (65, 65, 65) …