Openss>>Expview

How to Analyze the Performance of Parallel Codes with Open|SpeedShop Jim Galarowicz: Krell Instute Don Maghrak, Krell Instute LLNL-PRES-503451 Oct 27-28, 2015 Presenters anD ExtenDeD Team v Jim Galarowicz, Krell v Donald Maghrak, Krell Open|SpeedShop extended team: v William Hachfeld, Dave Whitney, Dane Gardner: Argo Navis/Krell v Jennifer Green, David Montoya, Mike Mason, David Shrader: LANL v MarHn Schulz, MaJ Legendre and Chris Chambreau: LLNL v Mahesh Rajan, Anthony Agelastos: SNL v Dyninst group (Bart Miller: UW & Jeff Hollingsworth: UMD) v Phil Roth, Mike Brim: ORNL v Ciera Jaspan: CMU How to Analyze the Performance of Parallel CoDes with Open|SpeedShop Oct 27-28, 2015 2 Outline Welcome IntroducHon to Open|SpeedShop (O|SS) tools Overview of performance experiments Ø How to use Open|SpeedShop to gather anD Display Ø Sampling Experiments Ø Tracing Experiments Ø How to compare performance Data for Different applicaon runs Advanced and New FuncHonality v Component Based Tool Framework (CBTF) based O|SS Ø New lightweight iop anD mpip experiments Ø New: memory experiment Ø New: CUDA/GPU experiment Ø New: pthreads experiment What is new / Roadmap / Future Plans Supplemental, or Interest Ø CommanD Line Interface (CLI) tutorial anD examples How to Analyze the Performance of Parallel CoDes with Open|SpeedShop Oct 27-28, 2015 3 How to Analyze the Performance of Parallel Codes with Open|SpeedShop Secon 1 IntroDucNon into Tools anD Open|SpeeDShop Oct 27-28, 2015 Open|SpeeDShop Tool Set v Open Source Performance Analysis Tool Framework Ø Most common performance analysis steps all in one tool Ø Combines tracing anD sampling techniques Ø Extensible by plugins for Data collecNon anD representaon Ø Gathers anD Displays several types of performance informaon v Flexible and Easy to use Ø User access through: GUI, Command Line, Python Scrip@ng, convenience scripts v Scalable Data CollecHon Ø Instrumentaon of unmodified applica@on binaries Ø New opNon for hierarchical online data aggregaon v Supports a wide range of systems Ø Extensively useD anD testeD on a variety of Linux clusters Ø Cray, Blue Gene, ARM, Intel MIC, GPU support How to Analyze the Performance of Parallel CoDes with Open|SpeedShop Oct 27-28, 2015 5 Open|SpeedShop Workflow srun –n4 –N1 smg2000 –n 65 65 65 srun –n4 –N1 smg2000 –n 65 65 65 osspcsamp “srun –n4 –N1 smg2000 –n 65 65 65” MPI Applicaon O|SS Post-mortem hp://www.openspeedshop.org/ Oct 27-28, 2015 Alternave Interfaces v Scripng language Ø ImmeDiate commanD interface Experiment Commands Ø O|SS interacNve commanD line (CLI) expView • openss -cli expCompare expStatus List Commands v list –v exp Python module list –v hosts import openss list –v src my_filename=openss.FileList("myprog.a.out")Session Commands my_exptype=openss.ExpTypeList("pcsamp") openGui my_id=openss.expCreate(my_filename,my_exptype) openss.expGo() My_metric_list = openss.MetricList("exclusive") my_viewtype = openss.ViewTypeList("pcsamp”) result = openss.expView(my_id,my_viewtype,my_metric_list) How to Analyze the Performance of Parallel CoDes with Open|SpeedShop Oct 27-28, 2015 7 Central Concept: Experiments v Users pick experiments: Ø What to measure and from which sources? Ø How to select, view, and analyze the resulting data? v Two main classes: Ø Statistical Sampling • Periodically interrupt execution and record location • Useful to get an overview • Low and uniform overhead Ø Event Tracing • Gather and store individual application events • Provides detailed per event information • Can lead to huge data volumes v O|SS can be extended with additional experiments Oct 27-28, 2015 Sampling Experiments in O|SS v PC Sampling (pcsamp) Ø RecorD PC repeateDly at user DefineD Nme interval Ø Low overhead overview of Nme DistribuNon Ø GooD first step, lightweight overview v Call Path Profiling (userme) Ø PC Sampling anD Call stacks for each sample Ø ProviDes inclusive anD exclusive Nming Data Ø Use to finD hot call paths, whom is calling who v Hardware Counters (hwc, hwcHme, hwcsamp) Ø ProviDes profile of harDware counter events like cache & TLB misses Ø hwcsamp: • PerioDically sample to capture profile of the coDe against the chosen counter • Default events are PAPI_TOT_INS anD PAPI_TOT_CYC Ø hwc, hwcme: • Sample a harDware counter Nll a certain number of events ( calleD thresholD) is recorDeD anD get Call Stack • Default event is PAPI_TOT_CYC overflows How to Analyze the Performance of Parallel CoDes with Open|SpeedShop Oct 27-28, 2015 9 Tracing Experiments in O|SS v Input/Output Tracing (io, iot) Ø RecorD invocaon of all POSIX I/O events Ø ProviDes aggregate anD inDiviDual Nmings Ø Store funcNon arguments anD return coDe for each call (iot) v MPI Tracing (mpi, mpit, mpio) Ø RecorD invocaon of all MPI rouNnes Ø ProviDes aggregate anD inDiviDual Nmings Ø Store funcNon arguments anD return coDe for each call (mpit) Ø Create Open Trace Format (OTF) output (mpiol) v FloaHng Point ExcepHon Tracing (fpe) Ø TriggereD by any FPE causeD by the applicaon Ø Helps pinpoint numerical problem areas How to Analyze the Performance of Parallel CoDes with Open|SpeedShop Oct 27-28, 2015 10 Performance Analysis in Parallel v How to deal with concurrency? Ø Any experiment can be applieD to parallel applicaon • Important step: aggregaon or selecNon of Data Ø Special experiments targeNng parallelism/synchronizaon v O|SS supports MPI and threaded codes Ø Automacally applieD to all tasks/threads Ø Default views aggregate across all tasks/threads Ø Data from inDiviDual tasks/threads available Ø Thread support (incl. OpenMP) baseD on POSIX threads v Specific parallel experiments (e.g., MPI) Ø Wraps MPI calls anD reports • MPI rouNne Nme • MPI rouNne parameter informaon Ø The mpit experiment also store funcNon arguments anD return coDe for each call How to Analyze the Performance of Parallel CoDes with Open|SpeedShop Oct 27-28, 2015 11 How to Run a First Experiment in O|SS? 1. Picking the experiment Ø What Do I want to measure? Ø We will start with pcsamp to get a first overview 2. Launching the applicaon Ø How Do I control my applicaon unDer O|SS? Ø Enclose how you normally run your applica7on in quotes Ø osspcsamp “mpirun –np 256 smg2000 –n 65 65 65” 3. Storing the results Ø O|SS will create a database Ø Name: smg2000-pcsamp.openss 4. Exploring the gathered data Ø How Do I interpret the Data? Ø O|SS will print a default report Ø Open the GUI to analyze data in detail (run: “openss”) How to Analyze the Performance of Parallel CoDes with Open|SpeedShop Oct 27-28, 2015 12 Example Run with Output (1 of 2) v osspcsamp “mpirun –np 4 smg2000 –n 65 65 65” Bash> osspcsamp "mpirun -np 4 ./smg2000 -n 65 65 65" [openss]: pcsamp experiment using the pcsamp experiment Default sampling rate: "100". [openss]: Using OPENSS_PREFIX installeD in /opt/ossoffv2.1u4 [openss]: Seng up offline raw Data Directory in /opt/shareD/offline-oss [openss]: Running offline pcsamp experiment using the commanD: "mpirun -np 4 /opt/ossoffv2.1u4/bin/ossrun "./smg2000 -n 65 65 65" pcsamp" Running with these Driver parameters: (nx, ny, nz) = (65, 65, 65) … <SMG nave output> … Final Relave ResiDual Norm = 1.774415e-07 [openss]: ConverNng raw Data from /opt/shareD/offline-oss into temp file X.0.openss Processing raw Data for smg2000 Processing processes anD threads ... Processing performance Data ... Processing funcons anD statements ... Resolving symbols for /home/jeg/DEMOS/workshop_demos/mpi/smg2000/test/smg2000 How to Analyze the Performance of Parallel CoDes with Open|SpeedShop Oct 27-28, 2015 13 Example Run with Output (2 of 2) v osspcsamp “mpirun –np 4 smg2000 –n 65 65 65” [openss]: Restoring anD Displaying Default view for: /home/jeg/DEMOS/workshop_demos/mpi/smg2000/test/smg2000-pcsamp.openss [openss]: The restoreD experiment iDenNfier is: -x 1 Exclusive CPU Nme % of CPU Time FuncNon (Defining locaon) in seconds. 7.870000 43.265531 hypre_SMGResiDual (smg2000: smg_resiDual.c,152) 4.390000 24.134140 hypre_CyclicReDucNon (smg2000: cyclic_reDucNon.c,757) 1.090000 5.992303 mca_btl_vader_check_oxes (libmpi.so.1.4.0: btl_vader_ox.h,108) 0.510000 2.803738 unpack_preDefineD_data (libopen-pal.so.6.1.1: opal_datatype_unpack.h,41) 0.380000 2.089060 hypre_SemiInterp (smg2000: semi_interp.c,126) 0.360000 1.979109 hypre_SemiRestrict (smg2000: semi_restrict.c,125) 0.350000 1.924134 __memcpy_ssse3_back (libc-2.17.so) 0.310000 1.704233 pack_preDefineD_data (libopen-pal.so.6.1.1: opal_datatype_pack.h,38) 0.210000 1.154480 hypre_SMGAxpy (smg2000: smg_axpy.c,27) 0.140000 0.769654 hypre_StructAxpy (smg2000: struct_axpy.c,25) 0.110000 0.604728 hypre_SMGSetStructVectorConstantValues (smg2000: smg.c,379) .... v View with GUI: openss –f smg2000-pcsamp.openss How to Analyze the Performance of Parallel CoDes with Open|SpeedShop Oct 27-28, 2015 14 Default Output Report View Performance Data Default view: by Function Toolbar to switch (Data is sum from all Views processes and threads) Select “Functions”, click D-icon Graphical Representation How to Analyze the Performance of Parallel CoDes with Open|SpeedShop Oct 27-28, 2015 15 Statement Report Output View Performance Data View Choice: Statements Select “statements, click D-icon Statement in Program that took the most time How to Analyze the Performance of Parallel CoDes with Open|SpeedShop Oct 27-28, 2015 16 Associate Source & Performance Data Double click to open Use window controls to source window split/arrange windows Selected performance data point How to Analyze the Performance of Parallel CoDes with Open|SpeedShop Oct 27-28, 2015 17 Library (LinkedObject) View Select LinkedObject View type and Click on D-icon Shows time spent in libraries. Can indicate imbalance. Libraries in the application How to Analyze the Performance of Parallel CoDes with Open|SpeedShop Oct 27-28, 2015 18 Loop View Select Loops View type and Click on D-icon Shows time spent in loops.

Openss>>Expview

Tech Tools Tech Tools

Porting Open64 to the Cygwin Environment

Automated Malware Analysis Report for Funny Linux.Elf

CPU Function (Defining Location) CPU Time in Seconds

2010-2011 Annual Report

Proceedings of the 7 Python in Science Conference

The CUDA Compiler Driver NVCC

High-Level Programming Model for Heterogeneous Embedded Systems Using

Build Linux Kernel with Open64

For More Details, See Downloadable Description

Pathscale Fortran and C/C++ Compilers

Code That Performs