Performance Programming: Theory, Practice and Case Studies
Total Page:16
File Type:pdf, Size:1020Kb
Performance Programming: Theory, Practice and Case Studies Module I: Measuring Program Performance 9 Performance Programing Module I: Measuring Program Performance OutlineOutline Measuring methodology and guidelines Measurement tools Timing Tools Profiling Tools Process monitoring and tracing tools System monitoring tools Hardware counter measurements Monitoring tools Code instrumentation Parallel performance measurements Guidelines and recommendations Tools for parallel monitoring Summary 10 Performance Programing Module I: Measuring Program Performance MeasurementMeasurement MethodologyMethodology Quantifying performance is the first step in the application tuning process Important to set reasonable expectations for op- timization Measurements should be made repeatedly to identify parts of the program that need to be op- timized Proper choice of measurement characteristics suitable for a particular application Comparison of measurements to theoretical peak values 11 Performance Programing Module I: Measuring Program Performance WhatWhat toto MeasureMeasure Timing measurements Wall clock time for a single job (turnaround time) Wall clock time for multiple jobs (throughput measurements) Wall clock time for parallel runs (scalability measurements) Execution and computation rates MFLOPS (million floating point operations per second) MIPS (million instructions per second) IPC (instructions per cycle) Resource utilization Memory usage I/O utilization Network usage 12 Performance Programing Module I: Measuring Program Performance BenchmarkingBenchmarking GuidelinesGuidelines Benchmark runs should adequately represent the use of the application Preferably only one parameter changing at a time Overhead of measurement should be considered Runs from tmpfs or from a locally mounted ufs System activities should be monitored The systems should not have any other computa- tional jobs running during benchmarking System parameters and settings should be docu- mented together with the results of the runs. 13 Performance Programing Module I: Measuring Program Performance MeasurementMeasurement ToolsTools Functionality Timing tools Profiling tools Monitoring tools Usage requirements Tools that can operate on optimized binaries Tools that require recompilation Tools that require source code instrumentation Parallel / serial measurement tools Tools measuring serial performance Tools measuring parallel performance 14 Performance Programing Module I: Measuring Program Performance TimingTiming EntireEntire ProgramProgram Measuring the elapsed (wall- clock) time that passes during the program execution Example: Solaris time, timex, and ptime 15 Performance Programing Module I: Measuring Program Performance TimingTiming ProgramProgram PortionsPortions Fortran 77: etime, dtime (both not thread safe) C, C++, Fortran 90/95: gethrtime High resolution timer (nanoseconds) Can be called via a C wrapper from Fortran 77 Can be used for multithreaded applications Platform-specific tools and methods Solaris microstate accounting Fine-grain timing measurements by accessing UltraSPARC TICK register directly .inline readtick,1 rd %tick, %o1 stx %o1, [%o0] .end 16 Performance Programing Module I: Measuring Program Performance MeasurementMeasurement OverheadOverhead Computing overhead Distribution of gethrtime() 22500 call 20000 17500 #include<sys/time.h> 15000 time_t start, end; int i, iters = 100000; 12500 for (i = 0; i < iters; i++) { start = gethrtime(); 10000 end = gethrtime(); 7500 (void)printf("%lld \n", (end - start)); 5000 } 2500 0 Call overhead (ns) 180-185 185-190 190-195 195-200 200-205 205-210 210-215 215-220 220-225 225-230 (ns) 17 Performance Programing Module I: Measuring Program Performance ProgramProgram ProfilingProfiling withwith gprofgprof Application profiling Special form of timing measurements that shows which func- tions account for large parts of application runtimes Should be used on multiple and representative test cases gprof - standard UNIX profiling utility Can be used for profiling executalbes and shared libraries Based on Program Counter (PC) sampling at periodic intervals Requires recompilation with -pg (Linux, Solaris, Tru64) or -G (HP-UX) After the run the data is collected in gmon.out file Profiling results displayed with gprof command 18 Performance Programing Module I: Measuring Program Performance gprofgprof OutputOutput Output includes Absolute time spent in a function Percentage of total run time spent in a function Number of calls to the function Average time per call Functions can be sorted by time they consume together with their descendants (commul- ative or inclusive time) time spent executing the function itself (self or exclusive time) % cumulative self self total time seconds seconds calls ms/call ms/call name 66.4 65.70 65.70 186116 0.35 0.35 dmmch_ [4] 15.2 80.72 15.02 20448 0.73 0.73 dmake_ [8] 10.9 91.51 10.79 16924 0.64 0.64 dgemm_ [9] ... 19 Performance Programing Module I: Measuring Program Performance ProfilingProfiling UsingUsing CoverageCoverage AnalysisAnalysis Coverage analysis tools annotate source code with the number of times each line was executed Basic block profiling Results can be accumulated for multiple runs Information about hot loops in the code and branches taken Code coverage for quality assurance DO 350 L = LL, LL+ LSEC- 1 150483840 -> F11 = F11 + T1( L- LL+ 1, I- II+ 1 )* $ T2( L- LL+ 1, J- JJ+ 1 ) Available on UNIX platforms Linux/GNU: gcov Solaris: tcov IRIX: cvcov, cvxcov Tru64: pixie AIX: tprof 20 Performance Programing Module I: Measuring Program Performance AdvancedAdvanced ProfilingProfiling ToolsTools Measurement parameters and features Measurements based on hardware counters Profiling by functions basic blocks lines of high level code assembly instructions Source code annotation Capabilities to work with parallel programs synchronization overhead, load balancing monitoring Available tools Tool Vendor Platforms VTune Intel NT Analyzer Sun Solaris SpeedShop SGI IRIX DCPI DEC Compaq HP Tru64, NT 21 Performance Programing Module I: Measuring Program Performance Example:Example: SunSun PerformancePerformance AnalyzerAnalyzer (1(1 ofof 3)3) Profiling by function and module (no recompilation) 22 Performance Programing Module I: Measuring Program Performance Example:Example: SunSun PerformancePerformance AnalyzerAnalyzer (2(2 ofof 3)3) Annotated source (recompilation with -g) and disassembly (no recompilation) 23 Performance Programing Module I: Measuring Program Performance Example:Example: SunSun PerformancePerformance AnalyzerAnalyzer (3(3 ofof 3)3) Hardware counter overflow profiling 24 Performance Programing Module I: Measuring Program Performance ProcessProcess MonitoringMonitoring ToolsTools Tracing tools Linux: strace (ltrace for dynamic library calls) Solaris: truss (sotruss for dynamic library calls) IRIX: par Tru64: atom -tool ptrace procfs-based tools pmap: prints the address space of the program pldd: lists the dynamic shared objects linked into the process (including ones explicitly attached using dlopen) pstack: prints a stack trace for each LWP in the process pflags: prints the /proc tracing flags ptree: process trees containing specified pids or users pwait: wait for specified processes to terminate pcred: prints the credentials (effective, real, saved UIDs and GIDs) 25 Performance Programing Module I: Measuring Program Performance Example:Example: profilingprofiling systemsystem callscalls truss on Solaris Reports the number of system calls for a process and associated time 26 Performance Programing Module I: Measuring Program Performance SystemSystem MonitoringMonitoring ToolsTools Tools for various UNIX platforms vmstat, vm_stat, memvis - virtual memory and CPU sta- tistics mpstat, mpvis - parallel memory/CPU statistics netstat, nfsstat, nfsvis - network status and statistics iostat, dkvis - I/O statistics sar - system activity report top, prstat - list of most active processes systat - system activity stats lockstat - kernel lock statistics dkstat - file status information 27 Performance Programing Module I: Measuring Program Performance vmstatvmstat -- VirtualVirtual MemoryMemory StatisticsStatistics Available on HP-UX, Tru64, Solaris, Linux, FreeBSD, etc. Example on Alpha/Tru64 Memory Paging Usage Activity CPU Usage Idle System 28 Performance Programing Module I: Measuring Program Performance HardwareHardware CounterCounter MeasurementsMeasurements Hardware performance counters allow for the runtime low-overhead measurements of various hardware events Cache references Cache misses Pipeline stalls Branch misprediction statistics D-TLB (Data Translation Lookaside Buffer) misses I-TLB (Instruction Translation Lookaside Buffer) Bus statistics including DMA and cache coherency transac- tions on a multiprocessor systems Others Only several events can be monitored at the same time 29 Performance Programing Module I: Measuring Program Performance CodeCode InstrumentationInstrumentation APIs can be used directly in the code High-resolution timing of performance-critical parts of the pro- gram Access to HW performance counters Example (Solaris) if (cpc_take_sample(&before) == -1) exit(-1); for (k = 0; k < N-1; k++) sum = sum + a[k]*b[k]; if (cpc_take_sample(&after) == -1) exit(-1); Counters specified by setting PERFEVENTS environment variable example% setenv PERFEVENTS pic0=Load_use,pic1=Load_use_RAW Works on UltraSPARC CPUs