Performance Programming: Theory, Practice and Case Studies

Module I: Measuring Program Performance

9 Performance Programing Module I: Measuring Program Performance OutlineOutline

Measuring methodology and guidelines Measurement tools Timing Tools Profiling Tools Process monitoring and tracing tools System monitoring tools Hardware counter measurements Monitoring tools Code instrumentation Parallel performance measurements Guidelines and recommendations Tools for parallel monitoring Summary

10 Performance Programing Module I: Measuring Program Performance MeasurementMeasurement MethodologyMethodology

Quantifying performance is the first step in the application tuning process Important to set reasonable expectations for op- timization Measurements should be made repeatedly to identify parts of the program that need to be op- timized Proper choice of measurement characteristics suitable for a particular application Comparison of measurements to theoretical peak values

11 Performance Programing Module I: Measuring Program Performance WhatWhat toto MeasureMeasure

Timing measurements Wall clock time for a single job (turnaround time) Wall clock time for multiple jobs (throughput measurements) Wall clock time for parallel runs (scalability measurements) Execution and computation rates MFLOPS (million floating point operations per second) MIPS (million instructions per second) IPC (instructions per cycle) Resource utilization Memory usage I/O utilization Network usage

12 Performance Programing Module I: Measuring Program Performance BenchmarkingBenchmarking GuidelinesGuidelines

Benchmark runs should adequately represent the use of the application Preferably only one parameter changing at a time Overhead of measurement should be considered Runs from tmpfs or from a locally mounted ufs System activities should be monitored The systems should not have any other computa- tional jobs running during benchmarking System parameters and settings should be docu- mented together with the results of the runs.

13 Performance Programing Module I: Measuring Program Performance MeasurementMeasurement ToolsTools

Functionality Timing tools Profiling tools Monitoring tools Usage requirements Tools that can operate on optimized binaries Tools that require recompilation Tools that require instrumentation Parallel / serial measurement tools Tools measuring serial performance Tools measuring parallel performance

14 Performance Programing Module I: Measuring Program Performance TimingTiming EntireEntire ProgramProgram

Measuring the elapsed (wall- clock) time that passes during the program execution Example: Solaris time, timex, and ptime

15 Performance Programing Module I: Measuring Program Performance TimingTiming ProgramProgram PortionsPortions

77: etime, dtime (both not thread safe) , C++, Fortran 90/95: gethrtime High resolution timer (nanoseconds) Can be called via a C wrapper from Fortran 77 Can be used for multithreaded applications Platform-specific tools and methods Solaris microstate accounting Fine-grain timing measurements by accessing UltraSPARC TICK register directly

.inline readtick,1 rd %tick, %o1 stx %o1, [%o0] .end

16 Performance Programing Module I: Measuring Program Performance MeasurementMeasurement OverheadOverhead

Computing overhead Distribution of gethrtime() 22500 call 20000 17500 #include 15000 time_t start, end; int i, iters = 100000; 12500 for (i = 0; i < iters; i++) { start = gethrtime(); 10000 end = gethrtime(); 7500 (void)printf("%lld \n", (end - start)); 5000 } 2500 0 Call overhead (ns)

180-185 185-190 190-195 195-200 200-205 205-210 210-215 215-220 220-225 225-230 (ns)

17 Performance Programing Module I: Measuring Program Performance ProgramProgram ProfilingProfiling withwith

Application profiling Special form of timing measurements that showsgprofgprof which func- tions account for large parts of application runtimes Should be used on multiple and representative test cases gprof - standard UNIX profiling utility Can be used for profiling executalbes and shared libraries Based on Program Counter (PC) sampling at periodic intervals Requires recompilation with -pg (Linux, Solaris, Tru64) or -G (HP-UX) After the run the data is collected in gmon.out file Profiling results displayed with gprof command

18 Performance Programing Module I: Measuring Program Performance gprofgprof OutputOutput Output includes Absolute time spent in a function Percentage of total run time spent in a function Number of calls to the function Average time per call Functions can be sorted by time they consume together with their descendants (commul- ative or inclusive time) time spent executing the function itself (self or exclusive time)

% cumulative self self total time seconds seconds calls ms/call ms/call name 66.4 65.70 65.70 186116 0.35 0.35 dmmch_ [4] 15.2 80.72 15.02 20448 0.73 0.73 dmake_ [8] 10.9 91.51 10.79 16924 0.64 0.64 dgemm_ [9] ...

19 Performance Programing Module I: Measuring Program Performance ProfilingProfiling UsingUsing CoverageCoverage AnalysisAnalysis

Coverage analysis tools annotate source code with the number of times each line was executed Basic block profiling Results can be accumulated for multiple runs Information about hot loops in the code and branches taken for quality assurance

DO 350 L = LL, LL+ LSEC- 1 150483840 -> F11 = F11 + T1( L- LL+ 1, I- II+ 1 )* $ T2( L- LL+ 1, J- JJ+ 1 ) Available on UNIX platforms Linux/GNU: Solaris: IRIX: cvcov, cvxcov Tru64: pixie AIX: tprof

20 Performance Programing Module I: Measuring Program Performance AdvancedAdvanced ProfilingProfiling ToolsTools

Measurement parameters and features Measurements based on hardware counters Profiling by functions basic blocks lines of high level code assembly instructions Source code annotation Capabilities to work with parallel programs synchronization overhead, load balancing monitoring Available tools Tool Vendor Platforms VTune Intel NT Analyzer Sun Solaris SpeedShop SGI IRIX DCPI DEC Compaq HP Tru64, NT 21 Performance Programing Module I: Measuring Program Performance Example:Example: SunSun PerformancePerformance AnalyzerAnalyzer (1(1 ofof 3)3)

Profiling by function and module (no recompilation)

22 Performance Programing Module I: Measuring Program Performance Example:Example: SunSun PerformancePerformance AnalyzerAnalyzer (2(2 ofof 3)3)

Annotated source (recompilation with -g) and disassembly (no recompilation)

23 Performance Programing Module I: Measuring Program Performance Example:Example: SunSun PerformancePerformance AnalyzerAnalyzer (3(3 ofof 3)3)

Hardware counter overflow profiling

24 Performance Programing Module I: Measuring Program Performance ProcessProcess MonitoringMonitoring ToolsTools

Tracing tools Linux: strace (ltrace for dynamic library calls) Solaris: truss (sotruss for dynamic library calls) IRIX: par Tru64: atom -tool ptrace procfs-based tools pmap: prints the address space of the program pldd: lists the dynamic shared objects linked into the process (including ones explicitly attached using dlopen) pstack: prints a stack trace for each LWP in the process pflags: prints the /proc tracing flags ptree: process trees containing specified pids or users pwait: wait for specified processes to terminate pcred: prints the credentials (effective, real, saved UIDs and GIDs)

25 Performance Programing Module I: Measuring Program Performance Example:Example: profilingprofiling systemsystem callscalls

truss on Solaris Reports the number of system calls for a process and associated time

26 Performance Programing Module I: Measuring Program Performance SystemSystem MonitoringMonitoring ToolsTools

Tools for various UNIX platforms vmstat, vm_stat, memvis - virtual memory and CPU sta- tistics mpstat, mpvis - parallel memory/CPU statistics netstat, nfsstat, nfsvis - network status and statistics iostat, dkvis - I/O statistics sar - system activity report top, prstat - list of most active processes systat - system activity stats lockstat - kernel lock statistics dkstat - file status information

27 Performance Programing Module I: Measuring Program Performance vmstatvmstat -- VirtualVirtual MemoryMemory StatisticsStatistics Available on HP-UX, Tru64, Solaris, Linux, FreeBSD, etc. Example on Alpha/Tru64 Memory Paging Usage Activity CPU Usage

Idle System

28 Performance Programing Module I: Measuring Program Performance HardwareHardware CounterCounter MeasurementsMeasurements

Hardware performance counters allow for the runtime low-overhead measurements of various hardware events Cache references Cache misses Pipeline stalls Branch misprediction statistics D-TLB (Data Translation Lookaside Buffer) misses I-TLB (Instruction Translation Lookaside Buffer) Bus statistics including DMA and cache coherency transac- tions on a multiprocessor systems Others Only several events can be monitored at the same time

29 Performance Programing Module I: Measuring Program Performance CodeCode InstrumentationInstrumentation

APIs can be used directly in the code High-resolution timing of performance-critical parts of the pro- gram Access to HW performance counters Example (Solaris)

if (cpc_take_sample(&before) == -1) exit(-1); for (k = 0; k < N-1; k++) sum = sum + a[k]*b[k]; if (cpc_take_sample(&after) == -1) exit(-1); Counters specified by setting PERFEVENTS environment variable

example% setenv PERFEVENTS pic0=Load_use,pic1=Load_use_RAW Works on UltraSPARC CPUs

30 Performance Programing Module I: Measuring Program Performance ParallelParallel MeasurementMeasurement MethodologyMethodology

Same guidelines as in the serial case Parallel benchmarks should be representative of typical uses of applications Benchmarking must be performed to ensure repeatable and consistent results Probe effects and tool overheads should be minimized Specifics of parallel benchmarking Parallelism vs. Concurrency Dedicated mode of benchmarking Number of processors Choice of timer and time criterion Processor-set configuration Processor allocation in clusters

31 Performance Programing Module I: Measuring Program Performance TimingTiming aa ParallelParallel ThreadedThreaded ProgramProgram

timex can be used for parallel timing

Note that the real time decreases, but the user time repre- senting combined CPU usage stays constant

32 Performance Programing Module I: Measuring Program Performance SpecificSpecific ParallelParallel TimersTimers

Timing MPI programs time or timex timers can be used in combination with MPI submitting commands (mprun, mpirun, etc.) For timing portions of an MPI program, one can use the MPI_Wtime function available in Fortran, C and C++ bind- ings (typically highly accurate). Threaded applications can use gethrvtime (S- olaris, Tru64 with Solaris Compatibility Library) Shows the user time on a per-thread basis Can be used in combination with gethrtime, which returns the elapsed real (wallclock) time on a per-thread basis

33 Performance Programing Module I: Measuring Program Performance ParallelParallel SystemSystem MonitoringMonitoring

mpstat - mutliprocessor monitoring Context Thread migrations Cross Interrupts switches System calls calls Mutex info CPU usage CPU ID First snapshot: average since boot

Sample measurements

34 Performance Programing Module I: Measuring Program Performance KernelKernel LockLock StatisticsStatistics

Tools that report kernel lock statistics lockstat - Solaris, IRIX, AIX, Linux lockinfo - Tru64 Allows one to specify what events to monitor spin on adaptive mutex block on read access to rwlock due to waiting writers On some platforms generates gprof-like output # lockstat -IWk example_tnf 24 ... Profiling interrupt: 151649 events in 130.282 seconds (1164 events/sec) Count indv cuml rcnt nsec Hottest CPU+PIL Caller ------85698 57% 57% 1.00 188 cpu[12] mutex_vector_enter 14247 9% 66% 1.00 160 cpu[9]+10 disp_getwork 12792 8% 74% 1.00 746 cpu[14] mutex_tryenter 10359 7% 81% 1.00 280 cpu[5] (usermode) 1951 1% 82% 1.00 59 cpu[1] splx 1648 1% 84% 1.00 365 cpu[5]+10 _resume_from_idle 1510 1% 85% 1.00 490 cpu[9]+10 disp 1259 1% 85% 1.00 255 cpu[15]+10 setfrontdq

35 Performance Programing Module I: Measuring Program Performance BindingBinding aa ProgramProgram ToTo aa SetSet ofof ProcessorsProcessors

Process monitoring can be difficult on multiproc- essor systems due to process migration Single-threaded programs One can bind to a processor For multithreaded programs One can use processor sets Commands to set up and use processor sets psrset (HP-UX, Solaris) pset (IRIX) pset_create, pset_assign_cpu, pset_assign_pid, etc. (Tru64)

36 Performance Programing Module I: Measuring Program Performance SummarySummary

Monitoring performance is essential to optimiza- tion If you cannot measure it you cannot improve it Important to select benchmarks carefully and identify parameters to measure Select tools suitable for the task System-wide or process-specific? Parallel or serial? Require recompilation or instrumentation? Need source-level information? Need hardware counter information?

37 Performance Programing Module I: Measuring Program Performance