How to Analyze the Performance of Parallel Codes with Open|SpeedShop

Jim Galarowicz: Krell Instute Don Maghrak, Krell Instute

LLNL-PRES-503451

Oct 27-28, 2015

Presenters and Extended Team

v Jim Galarowicz, Krell v Donald Maghrak, Krell

Open|SpeedShop extended team: v William Hachfeld, Dave Whitney, Dane Gardner: Argo Navis/Krell v Jennifer Green, David Montoya, Mike Mason, David Shrader: LANL v Marn Schulz, Ma Legendre and Chris Chambreau: LLNL v Mahesh Rajan, Anthony Agelastos: SNL v Dyninst group (Bart Miller: UW & Jeff Hollingsworth: UMD) v Phil Roth, Mike Brim: ORNL v Ciera Jaspan: CMU

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 2

Outline

Welcome Introducon to Open|SpeedShop (O|SS) tools Overview of performance experiments Ø How to use Open|SpeedShop to gather and display Ø Sampling Experiments Ø Tracing Experiments Ø How to compare performance data for different applicaon runs Advanced and New Funconality v Component Based Tool Framework (CBTF) based O|SS Ø New lightweight iop and mpip experiments Ø New: memory experiment Ø New: CUDA/GPU experiment Ø New: pthreads experiment What is new / Roadmap / Future Plans Supplemental, or Interest Ø Command Line Interface (CLI) tutorial and examples

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 3

How to Analyze the Performance of Parallel Codes with Open|SpeedShop

Secon 1 Introducon into Tools and Open|SpeedShop

Oct 27-28, 2015

Open|SpeedShop Tool Set

v Open Source Performance Analysis Tool Framework Ø Most common performance analysis steps all in one tool Ø Combines tracing and sampling techniques Ø Extensible by plugins for data collecon and representaon Ø Gathers and displays several types of performance informaon v Flexible and Easy to use Ø User access through: GUI, Command Line, Python Scripng, convenience scripts v Scalable Data Collecon Ø Instrumentaon of unmodified applicaon binaries Ø New opon for hierarchical online data aggregaon v Supports a wide range of systems Ø Extensively used and tested on a variety of clusters Ø Cray, Blue Gene, ARM, MIC, GPU support

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 5

Open|SpeedShop Workflow

srun –n4 –N1 smg2000 –n 65 65 65

srun –n4 –N1 smg2000 –n 65 65 65 osspcsamp “srun –n4 –N1 smg2000 –n 65 65 65” MPI Applicaon

O|SS Post-mortem

hp://www.openspeedshop.org/

Oct 27-28, 2015

Alternave Interfaces

v Scripng language Ø Immediate command interface Experiment Commands Ø O|SS interacve command line (CLI) expView • openss -cli expCompare expStatus

List Commands v list –v exp Python module list –v hosts import openss list –v src

my_filename=openss.FileList("myprog.a.out")Session Commands my_exptype=openss.ExpTypeList("pcsamp") openGui my_id=openss.expCreate(my_filename,my_exptype) openss.expGo() My_metric_list = openss.MetricList("exclusive") my_viewtype = openss.ViewTypeList("pcsamp”) result = openss.expView(my_id,my_viewtype,my_metric_list)

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 7

Central Concept: Experiments v Users pick experiments: Ø What to measure and from which sources? Ø How to select, view, and analyze the resulting data? v Two main classes: Ø Statistical Sampling • Periodically interrupt execution and record location • Useful to get an overview • Low and uniform overhead Ø Event Tracing • Gather and store individual application events • Provides detailed per event information • Can lead to huge data volumes v O|SS can be extended with additional experiments

Oct 27-28, 2015

Sampling Experiments in O|SS v PC Sampling (pcsamp) Ø Record PC repeatedly at user defined me interval Ø Low overhead overview of me distribuon Ø Good first step, lightweight overview v Call Path Profiling (userme) Ø PC Sampling and Call stacks for each sample Ø Provides inclusive and exclusive ming data Ø Use to find hot call paths, whom is calling who v Hardware Counters (hwc, hwcme, hwcsamp) Ø Provides profile of hardware counter events like cache & TLB misses Ø hwcsamp: • Periodically sample to capture profile of the code against the chosen counter • Default events are PAPI_TOT_INS and PAPI_TOT_CYC Ø hwc, hwcme: • Sample a hardware counter ll a certain number of events ( called threshold) is recorded and get Call Stack • Default event is PAPI_TOT_CYC overflows

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 9

Tracing Experiments in O|SS

v Input/Output Tracing (io, iot) Ø Record invocaon of all POSIX I/O events Ø Provides aggregate and individual mings Ø Store funcon arguments and return code for each call (iot) v MPI Tracing (mpi, mpit, mpio) Ø Record invocaon of all MPI rounes Ø Provides aggregate and individual mings Ø Store funcon arguments and return code for each call (mpit) Ø Create Open Trace Format (OTF) output (mpio) v Floang Point Excepon Tracing (fpe) Ø Triggered by any FPE caused by the applicaon Ø Helps pinpoint numerical problem areas

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 10

Performance Analysis in Parallel v How to deal with concurrency? Ø Any experiment can be applied to parallel applicaon • Important step: aggregaon or selecon of data Ø Special experiments targeng parallelism/synchronizaon v O|SS supports MPI and threaded codes Ø Automacally applied to all tasks/threads Ø Default views aggregate across all tasks/threads Ø Data from individual tasks/threads available Ø Thread support (incl. OpenMP) based on POSIX threads v Specific parallel experiments (e.g., MPI) Ø Wraps MPI calls and reports • MPI roune me • MPI roune parameter informaon Ø The mpit experiment also store funcon arguments and return code for each call

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 11

How to Run a First Experiment in O|SS?

1. Picking the experiment Ø What do I want to measure? Ø We will start with pcsamp to get a first overview 2. Launching the applicaon Ø How do I control my applicaon under O|SS? Ø Enclose how you normally run your applicaon in quotes Ø osspcsamp “mpirun –np 256 smg2000 –n 65 65 65” 3. Storing the results Ø O|SS will create a database Ø Name: smg2000-pcsamp.openss 4. Exploring the gathered data Ø How do I interpret the data? Ø O|SS will print a default report Ø Open the GUI to analyze data in detail (run: “openss”)

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 12

Example Run with Output (1 of 2)

v osspcsamp “mpirun –np 4 smg2000 –n 65 65 65” Bash> osspcsamp "mpirun -np 4 ./smg2000 -n 65 65 65" [openss]: pcsamp experiment using the pcsamp experiment default sampling rate: "100". [openss]: Using OPENSS_PREFIX installed in /opt/ossoffv2.1u4 [openss]: Seng up offline raw data directory in /opt/shared/offline-oss [openss]: Running offline pcsamp experiment using the command: "mpirun -np 4 /opt/ossoffv2.1u4/bin/ossrun "./smg2000 -n 65 65 65" pcsamp"

Running with these driver parameters: (nx, ny, nz) = (65, 65, 65) … … Final Relave Residual Norm = 1.774415e-07 [openss]: Converng raw data from /opt/shared/offline-oss into temp file X.0.openss

Processing raw data for smg2000 Processing processes and threads ... Processing performance data ... Processing funcons and statements ... Resolving symbols for /home/jeg/DEMOS/workshop_demos/mpi/smg2000/test/smg2000

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 13

Example Run with Output (2 of 2)

v osspcsamp “mpirun –np 4 smg2000 –n 65 65 65” [openss]: Restoring and displaying default view for: /home/jeg/DEMOS/workshop_demos/mpi/smg2000/test/smg2000-pcsamp.openss [openss]: The restored experiment idenfier is: -x 1

Exclusive CPU me % of CPU Time Funcon (defining locaon) in seconds. 7.870000 43.265531 hypre_SMGResidual (smg2000: smg_residual.c,152) 4.390000 24.134140 hypre_CyclicReducon (smg2000: cyclic_reducon.c,757) 1.090000 5.992303 mca_btl_vader_check_oxes (libmpi.so.1.4.0: btl_vader_ox.h,108) 0.510000 2.803738 unpack_predefined_data (libopen-pal.so.6.1.1: opal_datatype_unpack.h,41) 0.380000 2.089060 hypre_SemiInterp (smg2000: semi_interp.c,126) 0.360000 1.979109 hypre_SemiRestrict (smg2000: semi_restrict.c,125) 0.350000 1.924134 __memcpy_ssse3_back (libc-2.17.so) 0.310000 1.704233 pack_predefined_data (libopen-pal.so.6.1.1: opal_datatype_pack.h,38) 0.210000 1.154480 hypre_SMGAxpy (smg2000: smg_axpy.c,27) 0.140000 0.769654 hypre_StructAxpy (smg2000: struct_axpy.c,25) 0.110000 0.604728 hypre_SMGSetStructVectorConstantValues (smg2000: smg.c,379) .... v View with GUI: openss –f smg2000-pcsamp.openss

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 14

Default Output Report View

Performance Data Default view: by Function Toolbar to switch (Data is sum from all Views processes and threads) Select “Functions”, click D-icon

Graphical Representation

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 15

Statement Report Output View

Performance Data View Choice: Statements Select “statements, click D-icon

Statement in Program that took the most time

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 16

Associate Source & Performance Data

Double click to open Use window controls to source window split/arrange windows

Selected performance data point

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 17

Library (LinkedObject) View

Select LinkedObject View type and Click on D-icon

Shows time spent in libraries. Can indicate imbalance. Libraries in the application

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 18

Loop View

Select Loops View type and Click on D-icon

Shows time spent in loops. Statement number of start of loop.

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 19

First Experiment Run: Summary

v Place the way you run your applicaon normally in quotes and pass it as an argument to osspcsamp, or any of the other experiment convenience scripts: ossio, ossmpi, etc. Ø osspcsamp “srun –N 8 –n 64 ./mpi_applicaon app_args” v Open|SpeedShop sends a summary profile to stdout v Open|SpeedShop creates a database file v Display alternave views of the data with the GUI via: Ø openss –f v Display alternave views of the data with the CLI via: Ø openss –cli –f v On clusters, need to set OPENSS_RAWDATA_DIR Ø Should point to a directory in a shared file system Ø Usually set/handled in a module or dotkit file. v Start with pcsamp for overview of performance v Then, focus on performance issues with other experiments

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 20

How to Analyze the Performance of Parallel Codes with Open|SpeedShop

Secon 2 Overview of the O|SS experiment funconality

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 21

Idenfying Crical Regions

Flat Profile Overview v Profiles show computaonally intensive code regions Ø First views: Time spent per funcons or per statements v Quesons: Ø Are those funcons/statements expected? Ø Do they match the computaonal kernels? Ø Any runme funcons taking a lot of me? v Idenfy boleneck components Ø View the profile aggregated by shared objects (LinkedObject view) Ø Correct/expected modules? Ø Impact of support and runme libraries

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 22

How to Analyze the Performance of Parallel Codes with Open|SpeedShop

Call Path Profiling (userme)

Oct 27-28, 2015

Call stack profiling

v Call Stack Profiling Ø Take a sample: address inside a function Ø Call stack: series of program counter addresses (PCs) Ø Unwinding the stack is walking through those address and recording that information for symbol resolution later. Ø Leaf function is at the end of the call stack list v Open|SpeedShop: experiment called usertime Ø Time spent inside a routine vs. its children Ø Key view: butterfly v Comparisons Ø Between experiments to study improvements/changes Ø Between ranks/threads to understand differences/outliers

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 24

Adding Context through Stack Traces

Funcon v Missing informaon in flat A profiles Ø Disnguish rounes called from mulple callers Ø Understand the call invocaon history Funcon Funcon Ø Context for performance data B C v Crical technique: Stack traces Ø Gather stack trace for each performance sample Ø Aggregate only samples with Funcon equal trace D v User perspecve: Ø Buerfly views (caller/callee relaonships) Ø Hot call paths Funcon • Paths through applicaon that E take most me

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 25

Inclusive vs. Exclusive Timing

Funcon v Stack traces enable A calculaon of inclusive/ exclusive mes Ø Time spent inside a funcon only (exclusive) • See: Funcon B Funcon Funcon Ø Time spent inside a funcon and B C its children (inclusive) • See Funcon C and children v Implementaon similar to flat profiles Funcon Ø Sample PC informaon D Ø Addionally collect call stack informaon at every sample v Tradeoffs Inclusive Time for C Ø Pro: Obtain addional context Funcon informaon Exclusive Time for B E Ø Con: Higher overhead/lower sampling rate

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 26

Interpreng Call Context Data

v Inclusive versus exclusive mes Ø If similar: child execuons are insignificant • May not be useful to profile below this layer Ø If inclusive me significantly greater than exclusive me: • Focus aenon to the execuon mes of the children v Hotpath analysis Ø Which paths takes the most me? Ø Path me might be ok & expected, but could point to a problem v Buerfly analysis (similar to gprof) Ø Could be done on “suspicious” funcons • Funcons with large execuon me • Funcons with large difference between implicit and explicit me • Funcons of interest • Funcons that “take unexpectedly long” • … Ø Shows split of me in callees and callers

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 27

In/Exclusive Time in O|SS: Userme

Basic syntax: ossuserme “how you run your executable normally”

Examples: ossuserme “smg2000 –n 50 50 50” ossuserme “smg2000 –n 50 50 50” low v Parameters Sampling frequency (samples per second) Alternave parameter: high (70) | low (18) | default (35)

Recommendaon: compile code with –g to get statements!

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 28

Reading Inclusive/Exclusive Timings

v Default View Ø Similar to pcsamp view from first example Ø Calculates inclusive versus exclusive mes Exclusive Inclusive Time Time

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 29

Stack Trace Views: Hot Call Path

Hot Call Path

Access to call paths: • All call paths (C+) • All call paths for selected function (Cê)

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 30

Stack Trace Views: Buerfly View

v Similar to well known “gprof” tool

Callers of “hypre_SMGSolve”

Callees of Pivot routine “hypre_SMGSolve” “hypre_SMGSolve”

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 31

How to Analyze the Performance of Parallel Codes with Open|SpeedShop

Performance Analysis related to accessing Hardware Counter Informaon

Oct 27-28, 2015

Idenfy architectural impact on code inefficiencies

v Timing informaon shows where you spend your me Ø Hot funcons / statements / libraries Ø Hot call paths v BUT: It doesn’t show you why Ø Are the computaonally intensive parts efficient? Ø Are the processor architectural components working opmally? v Answer can be very plaorm dependent Ø Bolenecks may differ Ø Cause of missing performance portability Ø Need to tune to architectural parameters v Next: Invesgate hardware/applicaon interacon Ø Efficient use of hardware resources or Micro-architectural tuning Ø Architectural units (on/off chip) that are stressed

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 33

Hardware Performance Counters

v Architectural Features Ø Typically/Mostly packaged inside the CPU Ø Count hardware events transparently without overhead v Newer plaorms also provide system counters Ø Network cards and switches Ø Environmental sensors v Drawbacks Ø Availability differs between plaorm & processors Ø Slight semanc differences between plaorms Ø In some cases : requires privileged access & kernel patches v Open|SpeedShop accesses them through PAPI Ø API for tools + simple runme tools Ø Abstracons for system specific layers Ø More informaon: hp://icl.cs.utk.edu/papi/

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 34

The O|SS HWC Experiments

v Provides access to hardware counters Ø Implemented on top of PAPI Ø Access to PAPI and nave counters Ø Examples: cache misses, TLB misses, bus accesses v Basic model 1: Timer Based Sampling: HWCsamp Ø Samples at set sampling rate for the chosen events Ø Supports mulple counters Ø Lower stascal accuracy Ø Can be used to esmate good threshold for hwc/hwcme v Basic model 2: Thresholding: HWC and HWCme Ø User selects one counter Ø Run unl a fixed number of events have been reached Ø Take PC sample at that locaon • HWCme also records stacktrace Ø Reset number of events Ø Ideal number of events (threshold) depends on applicaon

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 35

Examples of Typical Counters

PAPI Name Descripon Threshold PAPI_L1_DCM L1 data cache misses high PAPI_L2_DCM L2 data cache misses high/medium PAPI_L1_DCA L1 data cache accesses high PAPI_FPU_IDL Cycles in which FPUs are idle high/medium PAPI_STL_ICY Cycles with no instrucon issue high/medium PAPI_BR_MSP Miss-predicted branches medium/low PAPI_FP_INS Number of floang point instrucons high PAPI_LD_INS Number of load instrucons high PAPI_VEC_INS Number of vector/SIMD instrucons high/medium PAPI_HW_INT Number of hardware interrupts low PAPI_TLB_TL Number of TLB misses low

Note: Threshold indicaons are just rough guidance and depend on the applicaon. Note: counters plaorm dependent (use papi_avail & papi_native_avail)

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 36

Recommend start with HWCsamp v osshwcsamp “< args>” [ default || ] Ø Sequenal job example: • osshwcsamp “smg2000” Ø Parallel job example: • osshwcsamp “mpirun –np 128 smg2000 –n 50 50 50” PAPI_L1_DCM,PAPI_L1_TCA 50 v default events: PAPI_TOT_CYC and PAPI_TOT_INS v default sampling_rate: 100 v : Comma separated PAPI event list (Maximum of 6 events that can be combined) v :Integer value sampling rate v Use event count values to guide selecon of thresholds for HWC, HWCme experiments for deeper analysis

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 37

Selecng the Counters & Sampling Rate

v For osshwcsamp, Open|SpeedShop supports … Ø Derived and Non derived PAPI presets • All derived and non derived events reported by “papi_avail” • Also reported by running “osshwcsamp” with no arguments • Ability to sample up to six (6) counters at one me; before use test with – papi_event_chooser PRESET • If a counter does not appear in the output, there may be a conflict in the hardware counters Ø All nave events • Architecture specific (incl. naming) • Names listed in the PAPI documentaon • Nave events reported by “papi_native_avail” v Sampling rate depends on applicaon Ø Overhead vs. Accuracy • Lower sampling rate cause less samples

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 38

Possible HWC Combinaons To Use*

Xeons v PAPI_FP_INS,PAPI_LD_INS,PAPI_SR_INS (load store info, memory bandwidth needs) v PAPI_L1_DCM,PAPI_L1_TCA (L1 cache hit/miss raos) v PAPI_L2_DCM,PAPI_L2_TCA (L2 cache hit/miss raos) v LAST_LEVEL_CACHE_MISSES,LAST_LEVEL_CACHE_REFERENCES (L3 cache info) v MEM_UNCORE_RETIRED:REMOTE_DRAM,MEM_UNCORE_RETIRED:LOC AL_DRAM (local/nonlocal memory access) For Opterons only v PAPI_FAD_INS,PAPI_FML_INS (Floang point add mult) v PAPI_FDV_INS,PAPI_FSQ_INS (sqrt and divisions) v PAPI_FP_OPS,PAPI_VEC_INS (floang point and vector instrucons) v READ_REQUEST_TO_L3_CACHE:ALL_CORES,L3_CACHE_MISSES:ALL_COR ES (L3 cache) *Credit: Koushik Ghosh, LLNL

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 39 hwcsamp with miniFE (see mantevo.org)

v osshwcsamp “mpiexec –n 72 miniFE.X –nx 614 –ny 614 –nz 614” PAPI_DP_OPS,PAPI_L1_DCM,PAPI_TOT_CYC,PAPI_TOT_INS v openss –f miniFE.x-hwcsamp.openss

Also have pcsamp Up to six event can be information displayed. Here we have 4.

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 40

Deeper Analysis with HWC and HWCme v osshwc[me] “ < args>” [ default | | | ] Ø Sequenal job example: • osshwc[me] “smg2000 –n 50 50 50” PAPI_FP_OPS 50000 Ø Parallel job example: • osshwc[me] “mpirun –np 128 smg2000 –n 50 50 50” v default: event (PAPI_TOT_CYC), threshold (10000) v : PAPI event name v : PAPI integer threshold v NOTE: If the output is empty, try lowering the value. There may not have been enough PAPI event occurrences to record and present

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 41

Viewing hwc Data

v hwc default view: Counter = Total Cycles

Flat hardware counter profile of a single hardware counter event. Exclusive counts only

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 42

Viewing hwcme Data

hwcme default view: Counter = L1 Cache Misses

Calling context hardware counter profile of a single hardware counter event. Exclusive/Inclusive counts

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 43

How to Analyze the Performance of Parallel Codes with Open|SpeedShop

Performance Analysis related to applicaon I/O acvity

Oct 27-28, 2015

Need for Understanding I/O

v I/O could be significant percentage of execuon me dependent upon: Ø Checkpoint, analysis output, visualizaon & I/O frequencies Ø I/O paern in the applicaon: N-to-1, N-to-N; simultaneous writes or requests Ø Nature of applicaon: data intensive, tradional HPC, out-of-core Ø File system and Striping: NFS, Lustre, Panasas, and # of Object Storage Targets (OSTs) Ø I/O libraries: MPI-IO, hdf5, PLFS,… Ø Other jobs stressing the I/O sub-systems v Obvious candidates to explore first while tuning: Ø Use parallel file system Ø Opmize for I/O paern Ø Match checkpoint I/O frequency to MTBI of the system Ø Use appropriate libraries

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 45

Addional I/O analysis with O|SS v I/O Tracing (io experiment) Ø Records each event in chronological order Ø Provides call path and me spent in I/O funcons v Extended I/O Tracing (iot experiment) Ø Records each event in chronological order Ø Collects Addional Informaon • Funcon Parameters • Funcon Return Value Ø When to use extended I/O tracing? • When you want to trace the exact order of events • When you want to see the return values or bytes read or wrien. • When you want to see the parameters of the IO call

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 46

Running I/O Experiments

Offline io/iot experiment on sweep3d applicaon Convenience script basic syntax: ossio [t] “executable” [ default | ] Ø Parameters • I/O Funcon list to sample(default is all) • creat, creat64, dup, dup2, lseek, lseek64, open, open64, pipe, pread, pread64, pwrite, pwrite64, read, readv, write, writev Examples: ossio “mpirun –np 256 sweep3d.mpi” ossiot “srun –N 1 –n 16 sweep3d.mpi” read,readv,write ossiot “mpirun –np 256 sweep3d.mpi” read,readv,write

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 47

I/O output via GUI

v I/O Default View for IOR applicaon “io” experiment

Shows the aggregated time spent in the I/O functions traced during the application.

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 48

I/O output via GUI

v I/O Call Path View for IOR applicaon “io” experiment

Shows the call paths to the I/O functions traced and the time spent along the paths.

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 49

I/O “iot” experiment output via GUI

v I/O Default View for IOR applicaon “iot” experiment

Shows the min and max values for bytes read or written. .

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 50

I/O “iot”experiment output via CLI

v # Show the call paths in the applicaon run that allocated the largest number of bytes

v # Using the min_bytes would show all the paths that allocated the minimum number of bytes.

v openss>>expview -vcalltrees,fullstack -m max_bytes

v

v Max_Bytes Call Stack Funcon (defining locaon)

v Read

v Wrien Ø _start (IOR) Ø > @ 562 in __libc_start_main (libmonitor.so.0.0.0: main.c,541) Ø >> @ 258 in __libc_start_main (libc-2.12.so: libc-start.c,96) Ø >>> @ 517 in monitor_main (libmonitor.so.0.0.0: main.c,492) Ø >>>> @ 153 in main (IOR: IOR.c,108) Ø >>>>> @ 2013 in TestIoSys (IOR: IOR.c,1848) Ø >>>>>> @ 2608 in WriteOrRead (IOR: IOR.c,2562) Ø >>>>>>> @ 244 in IOR_Xfer_POSIX (IOR: aiori-POSIX.c,224) Ø >>>>>>>> @ 321 in write (iot-collector-monitor-mrnet-mpi.so: wrappers.c,239) v 262144 >>>>>>>>> @ 82 in write (libc-2.12.so: syscall-template.S,82)

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 51

How to Analyze the Performance of Parallel Codes with Open|SpeedShop

Performance Analysis related to applicaon MPI acvity

Oct 27-28, 2015

How can O|SS help for parallel jobs?

v O|SS is designed to work on parallel jobs Ø Support for threading and message passing Ø Automacally tracks all ranks and threads during execuon Ø Records/stores performance info per process/rank/thread v All experiments can be used on parallel jobs Ø O|SS applies the experiment collector to all ranks or threads on all nodes v MPI specific tracing experiments Ø Tracing of MPI funcon calls (individual, all, or a specific group) Ø Four forms of MPI tracing experiments

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 53

Analysis of Parallel Codes

v By default experiment collectors are run on all tasks Ø Automacally detect all processes and threads Ø Gathers and stores per thread data v Viewing data from parallel codes Ø By default all values aggregated (summed) across all ranks Ø Manually include/exclude individual ranks/processes/threads Ø Ability to compare ranks/threads v Addional analysis opons Ø Load Balance (min, max, average) across parallel execuons • Across ranks for hybrid OpenMP/MPI codes • Focus on a single rank to see load balance across OpenMP threads Ø Cluster analysis (finding outliers) • Automacally creates groups of similar performing ranks or threads • Available from the Stats Panel toolbar or context menu • Note: can take a long me for large numbers of processors (current version)

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 54

Integraon with MPI

v O|SS has been tested with a variety of MPIs Ø Including: Open MPI, MVAPICH[2], and MPICH2 (Intel, Cray), MPT (SGI) v Running O|SS experiments on MPI codes Ø Just use the convenience script corresponding to the data you want to gather and put the command you use to run your applicaon in quotes: • osspcsamp “mpirun –np 32 sweep3d.mpi” • ossio “srun –N 4 –n 16 sweep3d.mpi” • osshwcme “mpirun –np 128 sweep3d.mpi” • ossuserme “srun –N 8 –n 128 sweep3d.mpi” • osshwc “mpirun –np 128 sweep3d.mpi”

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 55

MPI Specific Tracing Experiments

v MPI funcon tracing Ø Record all MPI call invocaons Ø Record call mes and call paths (mpi) • Convenience script: ossmpi Ø Record call mes, call paths and argument info (mpit) • Convenience script: ossmpit v Public format: Ø Full MPI traces in Open Trace Format (OTF) Ø Experiment name: (mpio) • Convenience script: ossmpio

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 56

Running MPI Specific Experiments

Offline mpi/mpit/mpio experiment Convenience script basic syntax: ossmpi[t][o] “mpi executable syntax” [default | | mpi category] Ø Parameters • Default is all MPI Funcons Open|SpeedShop traces • MPI Funcon list to trace (comma separated) – MPI_Send, MPI_Recv, …. • mpi_category: – "all”, "asynchronous_p2p”, "collecve_com”, "datatypes”, "environment”, "graphs_contexts_comms”, "persistent_com”, "process_topologies”, "synchronous_p2p” Examples: ossmpi “srun –N 4 –n 32 smg2000 –n 50 50 50” ossmpi “mpirun –np 4000 nbody” MPI_Send,MPI_Recv

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 57

Idenfying Load Imbalance With O|SS

v Get overview of applicaon Ø Run one of these lightweight experiments • pcsamp, userme, hwc Ø Use this informaon to verify performance expectaons v Use load balance view on pcsamp, userme, hwc Ø Look for performance values outside of norm • Somewhat large difference for the min, max, average values v Get overview of MPI funcons used in applicaon Ø If the MPI libraries are showing up in the load balance for pcsamp, then do a MPI specific experiment v Use load balance view on MPI experiment Ø Look for performance values outside of norm • Somewhat large difference for the min, max, average values Ø Focus on the MPI_Funcons to find potenal problems

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 58

Load Balance View: NPB: LU

v Load Balance View based on funcons (pcsamp)

MPI library showing up high in the list Max time in rank 255

With load balance view we are looking for performance number out of norm of what is expected.

Large differences between min, max and/or average values. How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 59

Default Linked Object View: NPB: LU

v Default Aggregated View based on Linked Objects (libraries)

Linked Object View (library view) Select “Linked Objects” Click D-icon

NOTE: MPI library consuming large portion of application run time

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 60

MPI Tracing Results: Default View v Default Aggregated MPI Experiment View Information Icon Displays Experiment Metadata

Aggregated Results

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 61

View Results: Show MPI Callstacks

Unique Call Paths View: Click C+ Icon Unique Call Paths to MPI_Waitall and other MPI functions

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 62

Using Cluster Analysis in O|SS

v Can use with pcsamp, userme, hwc Ø Will group like performing ranks/threads into groups Ø Groups may idenfy outlier groups of ranks/threads Ø Can examine the performance of a member of the outlier group Ø Can compare that member with member of acceptable performing group v Can use with mpi, mpit Ø Same funconality as above Ø But, now focuses on the performance of individual MPI_Funcons. Ø Key funcons are MPI_Wait, MPI_WaitAll Ø Can look at call paths to the key funcons to analyze why they are being called to find performance issues

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 63

Link. Obj. Cluster Analysis: NPB: LU

v Cluster Analysis View based on Linked Objects (libraries)

In Cluster Analysis results Rank 255 showing up as an outlier.

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 64

How to Analyze the Performance of Parallel Codes with Open|SpeedShop

Comparing Performance Data

Oct 27-28, 2015

Comparing Performance Data

v Key funconality for any performance analysis Ø Absolute numbers oen don’t help Ø Need some kind of baseline / number to compare against v Typical examples Ø Before/aer opmizaon Ø Different configuraons or inputs Ø Different ranks, processes or threads v Very limited support in most tools Ø Manual operaon aer mulple runs Ø Requires lining up profile data Ø Even harder for traces v Open|SpeedShop has support to line up profiles Ø Perform mulple experiments and create mulple databases Ø Script to load all experiments and create mulple columns

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 66

Comparing Performance Data in O|SS

v Convenience Script: osscompare Ø Compares Open|SpeedShop up to 8 databases to each other • Syntax: osscompare “db1.openss,db2.openss,…” [opons] • osscompare man page has more details Ø Produces side-by-side comparison lisng Ø Metric opon parameter: • Compare based on: me, percent, a hwc counter, etc. Ø Limit the number of lines by “rows=nn” opon Ø Specify the: viewtype=[funcons|statements|linkedobjects] • Control the view granularity. Compare based on the funcon, statement, or library level. Funcon level is the default. • By default the compare will be done comparing the performance of funcons in each of the databases. • If statements opon is specified then all the comparisons will be made by looking at the performance of each statement in all the databases that are specified. Similar for libraries, if linkedobject is selected as the viewtype parameter.

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 67

Comparison Report in O|SS

osscompare "smg2000-pcsamp.openss,smg2000-pcsamp-1.openss” openss]: Legend: -c 2 represents smg2000-pcsamp.openss [openss]: Legend: -c 4 represents smg2000-pcsamp-1.openss -c 2, Exclusive CPU -c 4, Exclusive CPU Funcon (defining locaon) me in seconds. me in seconds. 3.870000000 3.630000000 hypre_SMGResidual (smg2000: smg_residual.c,152) 2.610000000 2.860000000 hypre_CyclicReducon (smg2000: cyclic_reducon.c,757) 2.030000000 0.150000000 opal_progress (libopen-pal.so.0.0.0) 1.330000000 0.100000000 mca_btl_sm_component_progress (libmpi.so.0.0.2: topo_unity_component.c,0) 0.280000000 0.210000000 hypre_SemiInterp (smg2000: semi_interp.c,126) 0.280000000 0.040000000 mca_pml_ob1_progress (libmpi.so.0.0.2: topo_unity_component.c, 0)

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 68

How to Analyze the Performance of Parallel Codes with Open|SpeedShop

New funconaliy

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 69

Open|SpeedShop and CBTF

New funconality being worked on now. v openMP specific experiment enabled through CBTF Ø Use the OMTP interface being proposed as an openMP standard v Intel MIC and ARM plaorm support for CBTF v Filtering (data reducon, analysis) in the MRNet communicaon nodes Ø Faster views as data is mined in parallel v Changes in the CLI to handle the reduced performance data

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 70

What are the recent changes to O|SS v New features and improvements already in O|SS Ø OpenMP idle/wait me augmentaon to the sampling experiments Ø Added Loop granularity displays Ø View caching in the CLI and GUI • CLI caching is cross session, views saved in the database • GUI caching is only in the current session, also caches metadata requests Ø Speed-ups in processing symbols when creang the database Ø MPI File I/O funcon tracing Ø Metric expression support in the CLI (create new data columns) Ø Inial Spack build support Ø Conversion to cmake builds for OpenSpeedShop, CBTF from source builds

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 71

What are the recent changes to O|SS v Component Based Tool Framework (CBTF) Ø Roll-out of new version of O|SS that uses a tree based network (MRNet) • Transfer data over the network, does not write files like the offline version • Can do data reducon (in parallel) as the data is streamed up the tree Ø Five new experiments implemented in this version • Lightweight I/O profiling (iop) • Lightweight MPI profiling (mpip) • Threading experiments (pthreads) • Memory usage analysis (mem) • GPU/Accelerator support ()

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 72

How to Analyze the Performance of Parallel Codes with Open|SpeedShop

Advanced and New O|SS Funconality Component Based Tool Framework based Open|SpeedShop

Oct 27-28, 2015

CBTF Structure and Dependencies

§ Tool-Type independent cbtf-gui • Performance Tools libcbtf-tcpip libcbtf-mrnet • Debugging Tools libcbtf-xml • etc… Qt libcbtf MRNet § Completed components Xerces-C • Base library (libcbtf) Boost • XML-based component networks (libcbtf-xml) • MRNet distributed components MPI Application (libcbtf-mrnet) § Ability for online aggregation • Tree-base communication • Online data filters • Configurable

Tool

Oct 27-28, 2015

Open|SpeedShop and CBTF

v CBTF will be/is the new plaorm for O|SS Ø More scalability Ø Aggregaon funconality for new experiments Ø Build with new “instrumentor class” v Exisng experiments will not change Ø Same workflow Ø Same convenience scripts v New experiments enabled through CBTF Ø Lighter weight I/O profiling (iop) Ø Lighter weight MPI profiling (mpip) Ø Threading experiments (pthreads) Ø Memory usage analysis (mem) Ø GPU/Accelerator support (cuda)

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 75

Addional Experiments in O|SS/CBTF

v POSIX thread tracing (pthreads) Ø Record invocaon of all POSIX thread events Ø Provides aggregate and individual rank, thread, or process mings v MPI Tracing (mpip) Ø Record invocaon of all MPI rounes Ø Provides aggregate and individual rank, thread, or process mings Ø Lightweight MPI profiling because not tracking individual call details v Memory Tracing (mem) Ø Record invocaon of key memory related funcon call events Ø Provides aggregate and individual rank, thread, or process mings v Input/Output Tracing (iop) Ø Record invocaon of all POSIX I/O events Ø Provides aggregate and individual rank, thread, or process mings Ø Lightweight I/O profiling (iop) v CUDA GPU Event Tracing (cuda) Ø Record CUDA events, provides meline and event mings

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 76

How to Analyze the Performance of Parallel Codes with Open|SpeedShop

Performance Analysis related to lightweight I/O and MPI

Oct 27-28, 2015

Addional I/O analysis with O|SS

v I/O Profiling (iop experiment) *OSS/CBTF only Ø Lighter weight I/O tracking experiment Ø Trace I/O funcons but only record individual callpaths not each individual event with callpath (Like userme) • No chronological list of every I/O event (call) Ø Results in smaller database sizes and quicker viewing • Examples: (small IOR MPI 4 process run) – IOR iot database size: 56320 – IOR iop database size: 51200

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 78

Running I/O Profiling Experiment iop experiment on sweep3d applicaon Convenience script basic syntax: ossiop “executable” [ default | ] Ø Parameters • I/O Funcon list to sample(default is all) • creat, creat64, dup, dup2, lseek, lseek64, open, open64, pipe, pread, pread64, pwrite, pwrite64, read, readv, write, writev Examples: ossiop “mpirun –np 256 sweep3d.mpi” ossiop “mpirun –np 256 sweep3d.mpi” read,readv,write

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 79

MPI Specific Tracing Experiment (mpip)

Lightweight MPI profiling (mpip) v Trace MPI funcons Ø But, only record individual call paths Ø Not do not record each individual event (MPI call) • Similar to userme v Smaller database files Ø 1895424 Oct 22 14:19 smg2000-mpip-0.openss Ø 57498624 Oct 22 14:20 smg2000-mpit-0.openss v Faster viewing of MPI performance data

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 80

Running MPI Lightweight Experiment

OSS/CBTF mpip experiment Convenience script basic syntax: ossmpip“mpi executable syntax” [default | | mpi category] Ø Parameters • Default is all MPI Funcons Open|SpeedShop traces • MPI Funcon list to trace (comma separated) – MPI_Send, MPI_Recv, …. • mpi_category: – "all”, "asynchronous_p2p”, "collecve_com”, "datatypes”, "environment”, "graphs_contexts_comms”, "persistent_com”, "process_topologies”, "synchronous_p2p” Examples: ossmpip “srun –N 4 –n 32 smg2000 –n 50 50 50” ossmpip “mpirun –np 4000 nbody” MPI_Send,MPI_Recv

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 81

How to Analyze the Performance of Parallel Codes with Open|SpeedShop

Performance Analysis related to applicaon memory funcon acvity

Oct 27-28, 2015

Memory Hierarchy

v Memory Hierarchy Ø CPU registers and cache Ø System RAM Ø Online memory, such as disks, etc. Ø Offline memory not physically connected to system Ø hps://en.wikipedia.org/wiki/Memory_hierarchy v What do we mean by memory? Ø Memory an applicaon requires from the system RAM Ø Memory allocated on the heap by system calls, such as malloc and friends

How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'15 Oct 27-28, 2015 83

Need for Understanding Memory Usage

v Memory Leaks Ø Is the applicaon releasing memory back to the system? v Memory Footprint Ø How much memory is the applicaon using? Ø Finding the High Water Mark (HWM) of memory allocated Ø Out Of Memory (OOM) potenal Ø Swap and paging issues v Memory allocaon paerns Ø Memory allocaons longer than expected Ø Allocaons that consume large amounts of heap space Ø Short lived allocaons

How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'15 Oct 27-28, 2015 84

Example Memory Heap Analysis Tools v MemP is a parallel heap profiling library Ø Requires mpi Ø hp://sourceforge.net/projects/memp v ValGrind provides two heap profilers. Ø Massif is a heap profiler • hp://valgrind.org/docs/manual/ms-manual.html Ø DHAT is a dynamic heap analysis tool • hp://valgrind.org/docs/manual/dh-manual.html v Dmalloc - Debug Malloc Library Ø hp://dmalloc.com/ v Google PerfTools heap analysis and leak detecon. Ø hps://github.com/gperools/gperools

How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'15 Oct 27-28, 2015 85

O|SS Memory Experiment v Supports sequenal, mpi and threaded applicaons. Ø No instrumentaon needed in applicaon. Ø Traces system calls via wrappers • malloc • calloc • realloc • free • memalign and posix_memalign v Provides metrics for Ø Timeline of events that set an new highwater mark (HWM). Ø List of event allocaons (with calling context) to leaks. Ø Overview of all unique calltrees to traced memory calls that provides max and min allocaon and count of calls on this path. v Example Usage Ø ossmem "./lulesh2.0” Ø ossmem "srun -N4 -n 64 ./sweep3d.mpi"

How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'15 Oct 27-28, 2015 86

O|SS Memory Experiment v Show calltree for the allocaon event that set the high water mark openss>> expview -v trace,calltrees -m start_time -m hwm -m allocation Start Time of Event HWM Call Stack Function (defining location) _start (lulesh2.0) >__libc_start_main (libmonitor.so.0.0.0: main.c,541) >>__libc_start_main (libc-2.18.so) >>>monitor_main (libmonitor.so.0.0.0: main.c,492) >>>>main (lulesh2.0: lulesh.cc,2672) >>>>>GOMP_parallel_start (libgomp.so.1.0.0: parallel.c,123) >>>>>>gomp_new_team (libgomp.so.1.0.0: team.c,141) >>>>>>>gomp_malloc (libgomp.so.1.0.0: alloc.c,35) 2015/09/09 10:11:04.023 27016684 >>>>>>>>__libc_malloc (libc-2.18.so)

How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'15 Oct 27-28, 2015 87

O|SS Memory Experiment v Show calltree to one memory leak expview -v trace,calltrees -m leak -m allocaon mem1 Total Call Stack Funcon (defining locaon) Allocaon _start (lulesh2.0) >__libc_start_main (libmonitor.so.0.0.0: main.c,541) >>__libc_start_main (libc-2.18.so) >>>monitor_main (libmonitor.so.0.0.0: main.c,492) >>>>main (lulesh2.0: lulesh.cc,2672) >>>>>GOMP_parallel_start (libgomp.so.1.0.0: parallel.c,123) >>>>>>gomp_new_team (libgomp.so.1.0.0: team.c,141) >>>>>>>gomp_malloc (libgomp.so.1.0.0: alloc.c,35) 10600684 >>>>>>>>__libc_malloc (libc-2.18.so)

How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'15 Oct 27-28, 2015 88

O|SS Memory Experiment v The next slide shows an overview of unique call paths to memory calls. v In this example: Ø The top free and malloc calls are shown with number of mes the path was called. v For the malloc call: Ø The max and min allocaon seen on this path are displayed.

How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'15 Oct 27-28, 2015 89

O|SS Memory Experiment expview -v trace,calltrees,fullstack -mpath_counts -m max_allocaon,min_allocaon -m overview Counts Max Min Call Stack Funcon (defining locaon) This Allocaon Allocaon Path _start (lulesh2.0) > @ 562 in __libc_start_main (libmonitor.so.0.0.0: main.c,541) >>__libc_start_main (libc-2.18.so) >>> @ 517 in monitor_main (libmonitor.so.0.0.0: main.c,492) >>>> @ 2392 in main (lulesh2.0: lulesh.cc,2672) >>>>> @ 2249 in EvalEOSForElems(Domain&, double*, int, int*, int) (lulesh2.0: lulesh.cc,2218) 32619 >>>>>>__cfree (libc-2.18.so) _start (lulesh2.0) > @ 562 in __libc_start_main (libmonitor.so.0.0.0: main.c,541) >>__libc_start_main (libc-2.18.so) >>> @ 517 in monitor_main (libmonitor.so.0.0.0: main.c,492) >>>> @ 2392 in main (lulesh2.0: lulesh.cc,2672) >>>>> @ 2164 in EvalEOSForElems(Domain&, double*, int, int*, int) (lulesh2.0: lulesh.cc,2218) >>>>>> @ 125 in GOMP_parallel_start (libgomp.so.1.0.0: parallel.c,123) >>>>>>> @ 156 in gomp_new_team (libgomp.so.1.0.0: team.c,141) >>>>>>>> @ 37 in gomp_malloc (libgomp.so.1.0.0: alloc.c,35) 32619 2080 2080 >>>>>>>>>__libc_malloc (libc-2.18.so)

How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'15 Oct 27-28, 2015 90

Summary and Conclusions v Benefits of Memory Heap Analysis Ø Detect leaks Ø Inefficient use of system memory Ø Find potenal OOM, paging, swapping condions Ø Determine memory footprint over lifeme of applicaon run v Observaons of Memory Analysis Tools Ø Less concerned with the me spent in memory calls Ø Emphasis is placed on the relaonship of allocaon calls to free calls. Ø Can generate large amounts of data Ø Can slow down and impact applicaon while running

How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'15 Oct 27-28, 2015 91

How to Analyze the Performance of Parallel Codes with Open|SpeedShop

Performance Analysis related to applicaon hardware accelerator acvity

Oct 27-28, 2015

Emergence of HPC Heterogeneous Processing

v Heterogeneous compung refers to systems that use more than one kind of processor.

v What led to increased heterogeneous processing in HPC? Ø Limits on ability to connue to scale processor frequencies Ø Power consumpon hing realisc upper bound Ø Programmability advances lead to more wide-spread, general usage of graphics processing unit (GPU). Ø Advances in manycore, mul-core hardware technology (MIC) v Heterogeneous accelerator processing: (GPU, MIC) Ø Data level parallelism • Vector units, SIMD execuon • Single instrucon operates on mulple data items Ø Thread level parallelism • Multhreading, mul-core, manycore

How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'15 Oct 27-28, 2015 93

Overview: Most Notable Hardware Accelerators

v GPU (Graphics Processing Unit) Ø General-purpose compung on graphics processing units (GPGPU) Ø Solve problems of type: Single-instrucon, mulple thread (SIMT) model Ø Vectors of data where each element of the vector can be treated independently Ø Offload model – where data is transferred into/out-of the GPU Ø Program using CUDA language or use direcve based OpenCL or OpenACC v Intel MIC (Many Integrated Cores) Ø Has a less specialized architecture than a GPU Ø Can execute parallel code wrien for: • Tradional programming models including POSIX threads, OpenMP Ø Inially offload based (transfer data to and from co-processor) Ø Now/future: programs to run navely

How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'15 Oct 27-28, 2015 94

GPGPU Accelerator

GPU versus CPU comparison v Different goals produce different designs Ø GPU assumes work load is highly parallel Ø CPU must be good at everything, parallel or not

v CPU: minimize latency experienced by 1 thread Ø Big on-chip caches Ø Sophiscated control logic

v GPU: maximize throughput of all threads Ø # threads in flight limited by resources => lots of resources (registers, bandwidth, etc.) Ø Mul-threading can hide latency => skip the big caches Ø Shared control logic across many threads *based on NVIDIA presentaon

How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'15 Oct 27-28, 2015 95

GPGPU Accelerator

Mixing GPU and CPU usage in applicaons Mulcore CPU Manycore GPU

Data must be transferred to/from the CPU to the GPU in order for the GPU to operate on it and return the new values. *NVIDIA image

How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'15 Oct 27-28, 2015 96

Heterogeneous Programming

v There are four main ways to use an accelerator Ø Explicit programming: • The programmer writes explicit instrucons for the accelerator device to execute as well as instrucons for transferring data to and from the device (e.g. CUDA-C for GPUs or OpenMP+Cilk Plus for Phis). This method requires to most effort and knowledge from programmers because algorithms must be ported and opmized on the accelerator device. Ø Accelerator-specific pragmas/direcves: • Accelerator code is automacally generated from your serial code by a (e.g. OpenACC, OpenMP 4.0). For many applicaons, adding a few lines of code (pragmas/direcves) can result in good performance gains on the accelerator. Ø Accelerator-enabled libraries: • Only requires the use of the library, no explicit accelerator programming is necessary once the library has been wrien. The programmer effort is similar to using a non-accelerator enabled scienfic library. Ø Accelerator-aware applicaons: • These soware packages have been programed by other sciensts/ engineers/soware developers to use accelerators and may require lile or no programming for the end-user.

Credit: hp://www.hpc.mcgill.ca/index.php/starthere/81-doc-pages/255-accelerator-overview

How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'15 Oct 27-28, 2015 97

Programming for GPGPU

Prominent models for programming the GPGPU Augment current languages to access GPU strengths v NVIDIA CUDA Ø Scalable parallel programming model Ø Minimal extensions to familiar C/C++ environment Ø Heterogeneous serial-parallel compung Ø Supports NVIDIA only v OpenCL (Open Compung Language) Ø Open source, royalty-free Ø Portable, can run on different types of devices Ø Runs on AMD, Intel, and NVIDIA v OpenACC Ø Provides direcves (hint commands inserted into source) Ø Direcves tell the compiler where to create acceleraon (GPU) code without the user modifying or adapng the code.

How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'15 Oct 27-28, 2015 98

Opmal Heterogeneous Execuon

GPGPU consideraons for best performance? v How is the parallel scaling for the applicaon overall? v Can you balance the GPU and CPU workload? Ø Keep both the GPU and CPU busy for best performance v Is profitable to send a piece of work to the GPU? Ø What is the cost of the transfer of data to and from the GPU? v How much work is there to be done inside the GPU? Ø Will the work to be done fully populate and keep the GPU processors busy Ø Are there opportunies to chain together operaons so the data can stay in the GPU for mulple operaons? v Is there a vectorizaon opportunity? Intel MIC consideraons for best performance? v Program should be heavily threaded v Parallel scaling should be high with an OpenMP version

How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'15 Oct 27-28, 2015 99

Accelerator Performance Monitoring

How can performance tools help opmize code? v Is profitable to send a piece of work to the GPU? Ø Can tell you this by measuring the costs: • Transferring data to and from the GPU • How much me is spent in the GPU versus the CPU v Is there a vectorizaon opportunity? Ø Could measure the mathemacal operaons versus the vector operaons occurring in the applicaon Ø Experiment with compiler opmizaon levels, re-measure operaons and compare v How is the parallel scaling for the applicaon overall? Ø Use performance tool to get idea of real performance versus expected parallel speed-up v Provide OpenMP programming model to source code insights Ø Use OpenMP performance analysis to map performance issues to source code

How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'15 Oct 27-28, 2015 100

Open|SpeedShop accelerator support

What performance info does Open|SpeedShop provide? v For GPGPU it reports informaon to help understand: Ø Time spent in the GPU device Ø Cost and size of data transferred to/from the GPU Ø Balance of CPU versus GPU ulizaon Ø Transfer of data between the host and device memory versus the execuon of computaonal kernels Ø Performance of the internal computaonal kernel code running on the GPU device v Open|SpeedShop will be able to monitor CUDA scienfic libraries because it operates on applicaon binaries. v Support for CUDA based applicaons is provided by tracing actual CUDA events v OpenACC support is scheduled v OpenCL support is also scheduled (aer OpenACC)

How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'15 Oct 27-28, 2015 101

Open|SpeedShop accelerator support

What performance info does Open|SpeedShop provide? v For Intel MIC (non-offload model): Ø Reports the same range of performance informaon that it does for CPU based applicaons Ø Open|SpeedShop will operate on MIC similar to targeted plaorms where the compute node processer is different than the front-end node processor Ø Only non-offload support is in our current plans Ø A specific OpenMP experiment is coming soon • Will help to beer support analysis of MIC applicaons • OpenMP performance analysis key to understanding performance

How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'15 Oct 27-28, 2015 102

Running the cuda experiment

OSS/CBTF cuda experiment informaon and example Convenience script basic syntax: osscuda “

Examples: Sequenal job example: osscuda “eigenvalues --matrix-size=4096”

Parallel job example: osscuda “mpirun -np 64 -npernode 1 lmp_linux -sf gpu < in.lj”

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 103

Open|SpeedShop CUDA CLI Views

>openss –cli -f BlackScholes-cuda-0.openss Welcome to OpenSpeedShop 2.1 [openss]: The restored experiment idenfier is: -x 1 openss>>expview Exclusive % of Total Exclusive Funcon (defining locaon) Time (ms) Exclusive Time Count 170.478194 100.000000 131 BlackScholesGPU(float*, float*, float*, float*, float*, float, float, int) (BlackScholes) openss>>expview -v Exec,Trace Start Time (d:h:m:s) Exclusive % of Total Grid Dims Block Dims Call Stack Funcon (defining locaon) Time (ms) Excl Time

2014/12/04 00:47:18.139 1.305890 0.766016 31250,1,1 128,1,1 >BlackScholesGPU(float*, float*, float*, float*, float*, float, float, int) (BlackScholes) 2014/12/04 00:47:18.141 1.297314 0.760985 31250,1,1 128,1,1 >BlackScholesGPU(float*, float*, float*, float*, float*, float, float, int) (BlackScholes) … 2014/12/04 00:47:18.805 1.300547 0.762882 31250,1,1 128,1,1 >BlackScholesGPU(float*, float*, float*, float*, float*, float, float, int) (BlackScholes)

How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'15 Oct 27-28, 2015 104

CUDA CLI Views

[openss>>expview -v Xfer,Trace Start Time (d:h:m:s) Exclusive % of Total Size Kind Call Stack Funcon (defining locaon) Time (ms) Exclusive Time 2014/12/04 00:47:18.127 4.060334 14.766928 16000000 HostToDevice >>cudart::cudaApiMemcpy(void*, void const*, unsigned long, cudaMemcpyKind) (BlackScholes) 2014/12/04 00:47:18.131 3.882354 14.119637 16000000 HostToDevice >>cudart::cudaApiMemcpy(void*, void const*, unsigned long, cudaMemcpyKind) (BlackScholes) 2014/12/04 00:47:18.135 3.893426 14.159904 16000000 HostToDevice >>cudart::cudaApiMemcpy(void*, void const*, unsigned long, cudaMemcpyKind) (BlackScholes) 2014/12/04 00:47:18.817 8.012517 29.140524 16000000 DeviceToHost >>cudart::cudaApiMemcpy(void*, void const*, unsigned long, cudaMemcpyKind) (BlackScholes) 2014/12/04 00:47:18.825 7.647501 27.813007 16000000 DeviceToHost >>cudart::cudaApiMemcpy(void*, void const*, unsigned long, cudaMemcpyKind) (BlackScholes) openss>>expview -v Exec,Trace -m me,grid,block,rpt

Exclusive Grid Dims Block Registers Call Stack Funcon (defining locaon) Time (ms) Dims Per Thread

1.305890 31250,1,1 128,1,1 15 >BlackScholesGPU(float*, float*, float*, float*, float*, float, float, int) (BlackScholes) 1.304930 31250,1,1 128,1,1 15 >BlackScholesGPU(float*, float*, float*, float*, float*, float, float, int) (BlackScholes) … 1.295267 31250,1,1 128,1,1 15 >BlackScholesGPU(float*, float*, float*, float*, float*, float, float, int) (BlackScholes)

How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'15 Oct 27-28, 2015 105

CUDA Trace View (Chronological order)

openss>>expview -v trace Start Time(d:h:m:s) Exclusive % of Call Stack Funcon (defining locaon) Call Total Time(ms) 2013/08/21 18:31:21.611 11.172864 1.061071 copy 64 MB from host to device (CUDA) 2013/08/21 18:31:21.622 0.371616 0.035292 copy 2.1 MB from host to device (CUDA) 2013/08/21 18:31:21.623 0.004608 0.000438 copy 16 KB from host to device (CUDA) 2013/08/21 18:31:21.623 0.003424 0.000325 set 4 KB on device (CUDA) 2013/08/21 18:31:21.623 0.003392 0.000322 set 137 KB on device (CUDA) 2013/08/21 18:31:21.623 0.120896 0.011481 compute_degrees(int*, int*, int, int)<<<[256,1,1], [64,1,1]>>> (CUDA) 2013/08/21 18:31:21.623 13.018784 1.236375 QTC_device(float*, char*, char*…,int, bool)<<<[256,1,1], [64,1,1]>>> (CUDA) 2013/08/21 18:31:21.636 0.035232 0.003346 reduce_card_device(int*, int)<<<[1,1,1], [1,1,1]>>> (CUDA) 2013/08/21 18:31:21.636 0.002112 0.000201 copy 8 bytes from device to host (CUDA) 2013/08/21 18:31:21.636 1.375616 0.130640 trim_ungrouped_pnts_indr_array(int, int*, float*…,float, int, bool)<<<[1,1,1], [64,1,1]>>> (CUDA) 2013/08/21 18:31:21.638 0.001344 0.000128 copy 260 bytes from device to host (CUDA) 2013/08/21 18:31:21.638 0.025600 0.002431 update_clustered_pnts_mask(char*, char*, int)<<<[1,1,1], [64,1,1]>>> (CUDA)

How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'15 Oct 27-28, 2015 106

Open|SpeedShop Accelerator support

What is the status of this Open|SpeedShop work? v Completed, but not hardened: Ø NVIDIA GPU Performance Data Collecon Ø Packaging the data into the Open|SpeedShop database Ø Retrieving the performance data and forming text based views Ø Basic inial support for Intel Phi (MIC) based applicaons. • Have gathered and displayed data on MIC test bed at NERSC and at NASA on the maia Intel MIC plaorm. v On going work (NASA FY15-FY16 SBIR) Ø Pursue obtaining more informaon about the code being execung inside the GPU device Ø More research and hardening of support for Intel MIC Ø Analysis of performance data and more meaningful views for CUDA/GPU Ø Research creang GUI based views based on inial command line interface (CLI) data views.

How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'15 Oct 27-28, 2015 107

How to Analyze the Performance of Parallel Codes with Open|SpeedShop

Performance Analysis related to applicaon POSIX thread acvity

Oct 27-28, 2015

OSS/CBTF pthreads experiment

“pthreads” experiment was created using the CBTF infrastructure v Gives opportunity to filter the POSIX thread performance informaon to reduce and mine the important/worthwhile informaon while the data is transferring to the client tool Ø Discussion Topic: What is that worthwhile informaon? Ø Ideas: • Report stascs about pthread wait • Report OMP blocking mes • Aribute informaon to proper threads • Thread numbering improvements – Use a shorter alias number for the long POSIX pthread numbers • Report synchronizaon overhead mapped to proper thread v Slides that follow show what the tool provides currently

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 109

Running the pthreads experiment

OSS/CBTF pthreads experiment informaon and example Convenience script basic syntax: osspthreads “executable” [ default | ] Ø Parameters • POSIX thread funcon list to sample(default is all) • pthread_create, pthread_mutex_init, pthread_mutex_destroy, pthread_mutex_lock, pthread_mutex_trylock, pthread_mutex_unlock, pthread_cond_init, pthread_cond_destroy, pthread_cond_signal, pthread_cond_broadcast, pthread_cond_wait, pthread_cond_medwait Examples: osspthreads “aprun -n 64 -d 8 ./mpithreads_both” osspthreads “mpirun –np 256 sweep3d.mpi” pthread_mutex_lock

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 110

Running the pthreads experiment

OSS/CBTF pthreads experiment default GUI view

Aggregated Time and Number of calls for the POSIX thread functions

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 111

Running the pthreads experiment

OSS/CBTF pthreads experiment callpath GUI view

Unique Call Paths to POSIX thread functions

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 112

Running the pthreads experiment

OSS/CBTF pthreads experiment loadbalance GUI view

Max, Min, Ave across all threads for all traced POSIX thread functions

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 113

Running the pthreads experiment

OSS/CBTF pthreads experiment buerfly GUI view

Caller, Pivot Function, and Callee information from Butterfly View

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 114

How to Analyze the Performance of Parallel Codes with Open|SpeedShop

Secon 3 Command Line Interface Usage

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 115

Command Line Interface (CLI) Usage v Command Line Interface Features Ø “gdb” like tool for performance data creaon and viewing Ø Same funconal capabilies the graphical user interface (GUI) • Excepon: GUI can focus on the source line corresponding to stascs Ø List metadata about your applicaon and the OSS experiment Ø Create experiments and run them Ø Launch the GUI from the CLI via the “opengui” command Ø View performance data from a database file • openss –cli –f to launch • expview – key command with many opons • list – list many items such as source, object files, metrics to view • expcompare – Compare ranks, threads, processes to each other, more… • cviewcluster – Creates groups of like performing enes to outliers. • cview – Output the columns of data represenng groups created by cviewcluster. • cviewinfo – Output what ranks, threads, etc., are in each cview group

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 116

Command Line Interface (CLI) Usage

v Command Line Interface Features (addional) Ø Format the performance informaon view to csv • expview -F csv Ø Selecvely only view, compare, analyze by rank, thread, process • -r or rank list or rank ranges • -t or thread list or thread ranges • -p or process list or process ranges Ø Selecvely only view, compare, analyze by specific metrics • -m or list of metrics • Example metrics for sampling: percent, me • Example metrics for tracing: me, count, percent Ø Selecvely only view by specific view type opons • -v or list of view types • Example view types: funcons, statements, loops, linked objects • Example metrics for tracing: me, count, percent

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 117

Command Line Interface (CLI) Usage

Command Line Interface Examples v openss –cli –f v Commands to get started Ø expstatus • Gives the metadata informaon about the experiment Ø expview • Displays the default view for the experiment • Use expview nn to see only nn lines of output – expview pcsamp20 shows only the top 20 me taking funcons • -v funcons : displays data based on funcon level granularity • -v statements : displays data based on statement level granularity • -v linkedobjects : displays data based on library level granularity • -v loops : displays data based on loop level granularity • -v calltrees : displays call paths combining like paths • -v calltrees,fullstack : displays all unique call paths individually • -m loadbalance : displays the min, max, average values across ranks, …

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 118

Command Line Interface (CLI) Usage

Command Line Interface Examples v Openss -cli -f v Commands to get started Ø expview (connued from previous page) • -v trace : for tracing experiments, display chronological list of events • -m : only display the metric(s) provided via the –m opon – Where metric can be: me, percent, a hardware counter, (see list -v metrics) • -r : only display data for that rank or ranks • -t < thread id or list of thread ids> : only display data for that thread (s) • -h < host id or list of host ids> : only display data for that host or hosts • -F csv : display performance data in a comma separated list Ø expcompare : compare data within the same experiment • -r 1 -r 2 –m me : compare rank 1 to rank 2 for metric equal me • -h host1 –h host2 : compare host 1 to host 2 for the default metric

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 119

Command Line Interface (CLI) Usage

Command Line Interface Examples v openss –cli –f v Commands to get started Ø list • -v metrics : display the data types (metric) that can be displayed via –m • -v src : display source files associated with experiment • -v obj : display object files associated with experiment • -v ranks : display ranks associated with experiment • -v hosts : display machines associated with experiment • -v exp : display the experiment numbers that are currently loaded • -v savedviews : display the commands that are cached in the database

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 120

Viewing hwcsamp data in CLI

openss -cli -f smg2000-hwcsamp-1.openss

# View the default report for this hwcsamp experiment openss>>[openss]: The restored experiment idenfier is: -x 1 openss>>expview Exclusive CPU me % of CPU Time PAPI_TOT_CYC PAPI_FP_OPS Funcon (defining locaon) in seconds. 3.920000000 44.697833523 11772604888 1198486900 hypre_SMGResidual (smg2000: smg_residual.c,152) 2.510000000 28.620296465 7478131309 812850606 hypre_CyclicReducon (smg2000: cyclic_reducon.c,757) 0.310000000 3.534777651 915610917 48863259 opal_progress (libopen-pal.so.0.0.0) 0.300000000 3.420752566 910260309 100529525 hypre_SemiRestrict (smg2000: semi_restrict.c,125) 0.290000000 3.306727480 874155835 48509938 mca_btl_sm_component_progress (libmpi.so.0.0.2)

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 121

Viewing hwcsamp data in CLI

# View the linked object (library) view for this Hardware Counter Sampling experiment openss>>expview -v linkedobjects Exclusive CPU me % of CPU Time PAPI_TOT_CYC PAPI_FP_OPS LinkedObject in seconds. 7.710000000 87.315968290 22748513124 2396367480 smg2000 0.610000000 6.908267271 1789631493 126423208 libmpi.so.0.0.2 0.310000000 3.510758777 915610917 48863259 libopen-pal.so.0.0.0 0.200000000 2.265005663 521249939 46127342 libc-2.10.2.so 8.830000000 100.000000000 25975005473 2617781289 Report Summary openss>>

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 122

Viewing I/O (iot) data in CLI

# View the default I/O report for this I/O experiment openss>> openss -cli -f sweep3d.mpi-iot.openss openss>>[openss]: The restored experiment idenfier is: -x 1 openss>>expview I/O Call % of Number Funcon (defining locaon) Time(ms) Total of Time Calls 1.241909 90.077151 36 __write (libpthread-2.17.so) 0.076653 5.559734 2 close (libpthread-2.17.so) 0.035452 2.571376 2 read (libpthread-2.17.so) 0.024703 1.791738 2 open64 (libpthread-2.17.so)

# View the default trace (chronological list of I/O funcons calls) for this I/O experiment openss>>expview -v trace Start Time I/O Call % of Funcon File/Path Name Event Call Stack Funcon (defining locaon) Time(ms) Total Dependent Idenfier(s) Time Return Value 2014/08/17 09:25:44.368 0.012356 0.896196 13 input 0:140166697882752 >>>>>>>>>>open64 (libpthread-2.17.so) 2014/08/17 09:25:44.368 0.027694 2.008679 input 0:140166697882752 >>>>>>>>>>>>>read (libpthread-2.17.so) 2014/08/17 09:25:44.377 0.053832 3.904500 0 input 0:140166697882752 >>>>>>>>>close (libpthread-2.17.so) 2014/08/17 09:25:44.378 0.012347 0.895543 13 input 0:140166697882752 >>>>>>>>>>open64 (libpthread-2.17.so) 2014/08/17 09:25:44.378 0.007758 0.562697 53 input 0:140166697882752 >>>>>>>>>>>>>read (libpthread-2.17.so) 2014/08/17 09:25:44.378 0.022821 1.655235 0 input 0:140166697882752 >>>>>>>>>close (libpthread-2.17.so) 2014/08/17 09:25:44.378 0.037219 2.699539 62 /dev/pts/1 0:140166697882752 >>>>>>>>>>__write (libpthread-2.17.so)

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 123

Viewing I/O (iot) data in CLI

# View the list of metrics (types of performance informaon) for this I/O experiment

openss>>list -v metrics iot::average iot::count iot::exclusive_details iot::exclusive_mes iot::inclusive_details iot::inclusive_mes iot::max iot::min iot::nsysarg iot::pathname iot::retval iot::stddev iot::syscallno iot::threadAverage iot::threadMax iot::threadMin iot::me

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 124

Viewing I/O (iot) data in CLI

# View in chronological trace order: start_me, me, the rank:thread event occurred in.

openss>>expview -m start_me,me,id -vtrace Start Time(d:h:m:s) Exclusive Event Call Stack Funcon (defining locaon) I/O Call Idenfier(s) Time(ms) 2014/08/17 09:25:44.368 0.012356 0:140166697882752 >>>>>>>>>>open64 (libpthread-2.17.so) 2014/08/17 09:25:44.368 0.027694 0:140166697882752 >>>>>>>>>>>>>read (libpthread-2.17.so) 2014/08/17 09:25:44.377 0.053832 0:140166697882752 >>>>>>>>>close (libpthread-2.17.so) 014/08/17 09:25:44.378 0.012347 0:140166697882752 >>>>>>>>>>open64 (libpthread-2.17.so) 2014/08/17 09:25:44.378 0.007758 0:140166697882752 >>>>>>>>>>>>>read (libpthread-2.17.so) 2014/08/17 09:25:44.378 0.022821 0:140166697882752 >>>>>>>>>close (libpthread-2.17.so) 2014/08/17 09:25:44.378 0.037219 0:140166697882752 >>>>>>>>>>__write (libpthread-2.17.so) 2014/08/17 09:25:44.378 0.018545 0:140166697882752 >>>>>>>>>>__write (libpthread-2.17.so) 2014/08/17 09:25:44.378 0.019837 0:140166697882752 >>>>>>>>>>__write (libpthread-2.17.so) 2014/08/17 09:25:44.379 0.035047 0:140166697882752 >>>>>>>>>>__write (libpthread-2.17.so) … …

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 125

Viewing I/O (iot) data in CLI

# View the load balance (max, min, and average values) across all ranks, threads, or processes.

openss>>expview -m loadbalance Max I/O Rank Min I/O Rank Average Funcon (defining locaon) Call of Call of I/O Call Time Max Time Min Time Across Across Across Ranks(ms) Ranks(ms) Ranks(ms) 1.241909 0 1.241909 0 1.241909 __write (libpthread-2.17.so) 0.076653 0 0.076653 0 0.076653 close (libpthread-2.17.so) 0.035452 0 0.035452 0 0.035452 read (libpthread-2.17.so) 0.024703 0 0.024703 0 0.024703 open64 (libpthread-2.17.so)

# View data for only rank nn, in this case rank 0

openss>>expview -r 0

I/O Call % of Number Funcon (defining locaon) Time(ms) Total of Time Calls 1.241909 90.077151 36 __write (libpthread-2.17.so) 0.076653 5.559734 2 close (libpthread-2.17.so) 0.035452 2.571376 2 read (libpthread-2.17.so) 0.024703 1.791738 2 open64 (libpthread-2.17.so)

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 126

Viewing I/O (iot) data in CLI

# View the top me taking call tree in this applicaon run. iot1 indicates see only one callstack. iot shows “number” of calltrees.

openss>>expview -v calltrees,fullstack iot1 I/O Call % of Number Call Stack Funcon (defining locaon) Time(ms) Total of Time Calls _start (sweep3d.mpi) > @ 562 in __libc_start_main (libmonitor.so.0.0.0: main.c,541) >>__libc_start_main (libc-2.17.so) >>> @ 517 in monitor_main (libmonitor.so.0.0.0: main.c,492) >>>>0x4026a2 >>>>> @ 185 in MAIN__ (sweep3d.mpi: driver.f,1) >>>>>> @ 41 in inner_auto_ (sweep3d.mpi: inner_auto.f,2) >>>>>>> @ 128 in inner_ (sweep3d.mpi: inner.f,2) >>>>>>>>_gfortran_st_write_done (libgfortran.so.3.0.0) >>>>>>>>>0x308db56f >>>>>>>>>>0x308e65af >>>>>>>>>>>0x308df8c8 0.871600 63.218195 12 >>>>>>>>>>>>__write (libpthread-2.17.so)

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 127

Command Line Interface (CLI) Usage

Open|SpeedShop CLI output into csv form for spreadsheet use. v Create an experiment database EXE=./a.out export OPENSS_HWCSAMP_EVENTS="PAPI_VEC_DP,FP_COMP_OPS_EXE:SSE_FP_PACKED_ DOUBLE" osshwcsamp "/usr/bin/srun -N 1 -n 1 $EXE ” v Open the database file and use expview –F csv to create a csv file openss –cli –f a.out-hwcsamp.openss >expview -F csv > mydata.csv v Create csv file using a script method echo "exprestore -f a.out-hwcsamp.openss" > ./cmd_file echo "expview -F csv > mydata.csv " >> ./cmd_file echo "exit" >> ./cmd_file # # Run openss ulity to output CSV rows. # openss -batch < ./cmd_file

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 128

GUI export of data to csv form

Geng Open|SpeedShop output into csv form for spreadsheet use.

openss –f stream.x-hwcsamp-1.openss Go to Stats Panel Menu Select the "Report Export Data (csv)" opon In the Dialog Box provide a meaningful name such as stream.x-hwcsamp-1.csv File will be saved to a file you specified above

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 129

Command Line Interface (CLI) Usage

Storing csv informaon into spreadsheet v Open your spreadsheet. v Either Ø Select open from pulldown menu Ø import stream.x-hwcsamp-1.csv Ø Cut and paste the contents of stream.x-hwcsamp-1.csv into the spreadsheet v Use "Data" operaon and then "Text to columns", v Select "comma" to separate the columns. v Save the spreadsheet file

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 130

Command Line Interface (CLI) Usage

Plong the performance info from the spreadsheet (Libre office instrucons) v Open your spreadsheet. v Select Insert, then choose Chart v Choose a chart type v Choose data range choices (data series in columns) v Choose data series (nothing to do here) v Chart is created. Views can be changed.

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 131

How to Analyze the Performance of Parallel Codes with Open|SpeedShop

Secon 4 Preferences

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 132

Changing Preferences in O|SS

Selecng File->Preferences to customize O|SS

Select Preferences from the File menu

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 133

Changing Preferences in O|SS

Enabling the new Command Line Interface (CLI) save/reuse views feature. Looking for friendly user evaluaon.

Click here to enable the save and reuse (cache) CLI views

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 134

How to Analyze the Performance of Parallel Codes with Open|SpeedShop

Secon 5 What is new / Road Map / Future Work

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 135

What are the recent changes to O|SS v New features and improvements in O|SS Ø OpenMP idle/wait me augmentaon to the sampling experiments Ø Added Loop granularity displays Ø View caching in the CLI and GUI • CLI caching is cross session, views saved in the database • GUI caching is only in the current session, also caches metadata requests Ø Speed-ups in processing symbols when creang the database Ø MPI File I/O funcon tracing Ø Metric expression support in the CLI (create new data columns) Ø Inial Spack build support Ø Conversion to cmake builds for OpenSpeedShop, CBTF from source builds

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 136

What are the recent changes to O|SS v Component Based Tool Framework (CBTF) Ø Roll-out of new version of O|SS that uses a tree based network (MRNet) • Transfer data over the network, does not write files like the offline version • Can do data reducon (in parallel) as the data is streamed up the tree Ø Five new experiments implemented in this version • Lightweight I/O profiling (iop) • Lightweight MPI profiling (mpip) • Threading experiments (pthreads) • Memory usage analysis (mem) • GPU/Accelerator support (cuda)

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 137

Road Map: O|SS in Transion

v Largest boleneck in exisng O|SS Ø Data handling from instrumentor to analysis Ø Need a hierarchical approach to conquer large scale Ø Generic requirement for a wide range of tools v Component Based Tool Framework (CBTF) project Ø Generic tool and data transport framework • Create “black box” components • Connect these components into a component network • Distribute these components and/or component networks across a tree base transport mechanism (MRNet). Ø Goal: reuse of components • Aimed at giving the ability to rapidly prototype performance tools • Create specialized tools to meet the needs of applicaon developers Ø hp://sourceforge.net/p/cb/wiki/Home/

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 138

CBTF Structure and Dependencies

§ Tool-Type independent cbtf-gui • Performance Tools libcbtf-tcpip libcbtf-mrnet • Debugging Tools libcbtf-xml • etc… Qt libcbtf MRNet § Completed components Xerces-C • Base library (libcbtf) Boost • XML-based component networks (libcbtf-xml) • MRNet distributed components MPI Application (libcbtf-mrnet) § Ability for online aggregation • Tree-base communication • Online data filters • Configurable

Tool

Oct 27-28, 2015

Open|SpeedShop and CBTF

v CBTF will be/is the new plaorm for O|SS Ø More scalability Ø Aggregaon funconality for new experiments Ø Build with new “instrumentor class” v Exisng experiments will not change Ø Same workflow Ø Same convenience scripts v New experiments enabled through CBTF Ø Lighter weight I/O profiling (iop) Ø Lighter weight MPI profiling (mpip) Ø Threading experiments (pthreads) Ø Memory usage analysis (mem) Ø GPU/Accelerator support (cuda)

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 140

Open|SpeedShop and CBTF

New funconality being worked on now. v openMP specific experiment enabled through CBTF Ø Use the OMTP interface being proposed as an openMP standard v Intel MIC and ARM plaorm support for CBTF v Filtering (data reducon, analysis) in the MRNet communicaon nodes Ø Faster views as data is mined in parallel v Changes in the CLI to handle the reduced performance data

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 141

New GUI – not funded

v Modular GUI Framework base on QT4 Ø Base for mulple tools Ø Ability for remote connecons

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 142

Open|SpeedShop and CBTF

Ideas for future improvements v Power monitoring through papi v Possible NUMA experiment v Filtering for exisng experiments to reduce data and focus on interesng events v Program logical phase marker with capability to show iteraon informaon

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 143

Sandia chama/skybridge Availability

OSS/Offline version Ø tools/openss-openmpi • For mpi experiments when applicaon is built with openmpi Ø tools/openss-mvapich2 • For mpi experiments when applicaon is built with mvapich2 v OSS/CBTF version Ø tools/oss-cb-openmpi • For mpi experiments when applicaon is built with openmpi

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 144

LANL Availability

OSS/Offline version Ø module load friendly-tesng openspeedshop/2.2`

OSS/CBTF version Ø ?

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 145

LLNL Availability v Open|SpeedShop system-wide dotkit files v OSS/Offline version (default version) Ø use openss Ø use openss-mvapich2*

v OSS/CBTF version (experimental version) Ø use cb Ø use cb-mvapich2*

*For mpi specific experiments (mpi, mpit, mpio) when applicaon is built to run with mvapich2. Default version is set to mvapich.

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 146

Availability v Current version: 2.2 has been released v Open|SpeedShop Website Ø hp://www.openspeedshop.org/ v Open|SpeedShop help and bug reporng Ø Direct email: [email protected] Ø Forum/Group: oss-ques[email protected] v Feedback Ø Bug tracking available from website Ø Feel free to contact presenters directly Ø Support contracts and onsite training available

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 147

Open|SpeedShop Documentaon

v Build and Installaon Instrucons Ø hp://www.openspeedshop.org/documentaon • Look for: Open|SpeedShop Version 2.2 Build/Install Guide v Open|SpeedShop User Guide Documentaon Ø hp://www.openspeedshop.org/documentaon • Look for Open|SpeedShop Version 2.2 Users Guide v Man pages: OpenSpeedShop, osspcsamp, ossmpi, … v Quick start guide downloadable from web site Ø hp://www.openspeedshop.org Ø Click on “Download Quick Start Guide” buon

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Oct 27-28, 2015 148