0138134553_Gove.book Page 447 Tuesday, December 4, 2007 2:29 PM

Index

Numbers activity reports. See system status reporting 32-bit applications, file handling in, 364–367 addition operations, 9 32-bit architecture, specifying in compiler, address, memory, 16, 18 103 address space for process, reporting on, 32-bit code on UltraSPARC processors, 30 78–79 64-bit architecture, specifying in compiler, addressing modes, SPARC, 24–25 103 aggressively optimized code, compiler 286 processor, 39 options for, 93–95. See also Sun 386 processor, 39 Studio compiler, using SSE instruction sets on, 95–96 algorithmic complexity, 437–442 -xtarget=generic compiler option, 99 alias_level pragma, 143–144 80286 processor, 39 alias pragma, 144–145 80386 processor, 39 aliasing control pragmas, 142–147 SSE instruction sets on, 95–96 aliasing of pointers. See pointer aliasing in -xtarget=generic compiler option, 99 and C++ 8086 processor, 39 align pragma, 136–137 -aligncommon=16 compiler option, 101 alignment of variables, specifying, A 135, 136–137 ,a adornment (SPARC instruction), 25 AMD64 processors, 40 .a files (static libraries), 182 compiler performance, 105 access to memory Amdahl’s Law, 377, 440, 443 error detection for, 261–262, 274 analyzer GUI, 207, 210–212 metrics on (example), 288–290 AND operation (logical), 9 accessing matrix data, 347–348 application profiles. See profiling active_cycle_count counter, 310 performance

447 0138134553_Gove.book Page 448 Tuesday, December 4, 2007 2:29 PM

448 Index

applications ax_miss_count_dm counter, 310 calls from, reporting on, 80–81 ax_miss_count_dm_if counter, 310 compiler options for. See Sun Studio ax_read_count_dm counter, 310 compiler, using debugging, 265–271 running under dbx, 268–271 B disassembling, 89 -B compiler option, 183 looking for process IDs for, 61–62 bandwidth, memory, 14–15 multithreaded. See multithreading source code optimizations, 326–327 processes as, 371. See also processes synthetic metrics for, 292–293 reporting on, 84–92 bandwidth, system bus, 15, 308 disassembles, examining, 89 base pointer (x86 code), 43, 44, 264 file contents, type of, 86–87 bcheck command, 256–257 library linkage, 84–86 big-endian vs. little-endian systems, 41–42 library version information, 87–89 binding, controlling processor use with, metadata in files, 90–92 52–53 segment size, reporting, 90 BIT tool, 224–226 symbols in files, 87 bounds checking, 256–257 runtime linking of, information on, runtime array bounds (Fortran), 259 191–192 Br_completed counter, 308 segments in, reporting size of, 90 Br_taken counter, 308 selecting target machine type for, branch delay slot, 25 103–107 branch instructions, 25 tuning, 442–444. See also optimization; events, performance counters for, profiling performance 300–301, 315–316 ar tool, 184 branch pipes, 7, 9–10 architecture, specifying with compiler, UltraSPARC III processors, 31 99, 104, 105, 106–107 branches archives (static libraries), creating, 184 if statements, 357–364 arguments passed to process, reporting on, loop unrolling and, 320–321 79 mispredicted branches, 10, 107 arrays, VIS instructions for searching, performance counters for, 203–205 300–301, 315–316 assembly language (machine code), 18–19. predictors, 10, 358 See also ISA (information set breakpoints (dbx), setting, 269 architecture) bsdmalloc library, 193–194 assertion flags, 443 buses, 15–16 associativity, cache, 12–13 reporting activity of, 76 atomic directive, 432–433 busstat command, 76–77, 308 atomic operations, 395–396 bypass (storing data to memory), 299, 352 ATS (Automatic Tuning and byte ordering, x64 processors, 41–42 Troubleshooting System), 271–274 attributed time for routines, 214 audit interface, linker, 192 C automatic parallelization, 408–409, 424–425 C- and C++-specific compiler optimizations, Automatic Tuning and Troubleshooting 123–135 System (ATS), 271–274 -xalias_level compiler option, autoscoping, 404–405 127–133, 409 0138134553_Gove.book Page 449 Tuesday, December 4, 2007 2:29 PM

Index 449

-xalias_level=basic option, atomic operations, 395–396 129–130, 426 mutexes with, 394 -C compiler option, 259 optimizing for, 446 C99 floating-point exceptions, 169–170 parallelization, 377 cache configuration, setting, 104, 105 code checking and debugging, cache latency (memory latency), 8–9, 34–35 244–245, 247–276, 274–276 measuring, 288–290, 333–334 ATS for locating optimization bugs, source code optimizations, 333–337 271–274 fetching integer data, 328 compile-time checking, 248–255 synthetic metrics for, 290–292 compiler options for code checking caches, 4, 11–14 C language, 248, 250 access metrics (example), 288–290 C++ language, 250, 251–252 clock speed and, 5 compiler options for debugging, 93–95 data cache events, 283–285 dbx debugger, 262–271 data cache lines, showing, 229 commands mapped to mdb, 276 instruction cache events, 283–285 compiler options for debugging, 262 L3 cache, 302–303 debug information format, 263–264 link time optimization, 115 debugging application (example), miss events, cycles lost to, 287 265–268 prefetch cache, 293–295 running applications under, 268–271 prefetch for cache lines, 335–337 running on core files, 264–265 thrashing in, 12–13, 349–353 including debug information when UltraSPARC III and IV processors, 31–33 compiling, 102–103 call instructions, SPARC, 25 mdb debugger, 274–276 call stack. See stack multithreaded code, 413–416 caller–callee information, 212–214 runtime checking, 256–262 tail-call optimization and, 236–237 code complexity, 437–442 calls from applications, reporting on, 80–81 code coverage information cc command (f90 command), 99 dtrace command, 241–244 character I/O, reporting on, 67 tcov command, 239–241 checking code. See code checking and to validate profile feedback, 114–115 debugging code layout optimizations, 107–116 child processes, 378–379 code parallelization. See parallelization chip multiprocessing (CMP), 373 code scheduling, 105 chip multithreading (CMT), 6, 373. See also specifying, 8–9, 104, 106 multithreading coding rules, 437 atomic operations, 395–396 collect command, 207 mutexes with, 394 collect utility, 207, 208–210, 243, 280 optimizing for, 446 hardware performance counters, 218 parallelization, 377 collision rate, 70 CISC processor, 40–41 command-line performance analysis. See clock speed, 4–5, 15 er_print tool pipelining and, 7 commandline arguments, examining, 79 cloning functions, 325–326 common subexpression elimination (CSE), ClusterTools software, 384–385 325 CMP (chip multiprocessing), 373 communication costs of parallelization, 377. CMT (chip multithreading), 6, 373. See also See also parallelization multithreading compare operations, 9 0138134553_Gove.book Page 450 Tuesday, December 4, 2007 2:29 PM

450 Index

comparisons in floating-point operations, CSE (common subexpression elimination), 158 325 compile-time array bounds checking, 259 current processes. See processes compile-time code checking, 248–255 cycle budget (UltraSPARC T1), 305–307 compile time, optimization level and, 94, 96 Cycle_cnt counter, 282 compiler commentary, 244–245 cycle_count counter, 310 compiling for Performance Analyzer, 210 cycles (clock speed), 4, 5 detecting data races, 412 lost of processor stall events, 299 complex instruction set computing. See lost to cache miss events. See misses, CISC processor cache complexity, algorithmic, 437–442 stalled, 6, 316–317 complexity, parallelism, 444 RAW recycles, 352–354 compressed structure code, 341–342 store queue, 354–357 conditional breakpoints (dbx), 269 UltraSPARC III processors, 34 conditional move statements, 358–360 conditional statements, 10, 357–364. See also branches D configuration of system, reporting on, 49–55 -D_FILE_OFFSET_BITS compiler option, conflict (memory), multiple streams and, 348 366 constants, floating-point in single precision, -D_LARGEFILE_SOURCE compiler option, 172–173 366 contended locks, 395–396 -dalign compiler option, 122–123 context switches, reporting number of, data cache. See also caches 57, 60, 63 events, performance counters for, 283–285, control transfer instructions, SPARC, 25 290, 304, 306, 309, 312–314 cooperating processes, 378–382 memory bandwidth measurements, copies of processes, 378 292–293 copying memory, 338–339 memory latency measurements core files, running debugger on, 264–265 example of, 288–290 core-level performance counters, 307 synthetic metrics for, 290–292 cores, 4, 371 reorganizing data structures, 339–343 multiple, 373 showing data cache lines, 229 correctness of code. See code checking and data density, 15 debugging data prefetch. See prefetch instructions costs of parallelization, 377 data races in multithreaded applications, count flag, Opteron performance counters, 389–392, 412–413, 434 311 data reads after writes, 352–354 count Perl script, 280–281 data space profiling, 226–233 coverage. See code coverage information data streams CPUs (central processing units), 3–4, 371. multiple, optimizing, 348–349 See also processors, in general storing, 329 cpustat command, 73–74, 279, 301, 307 data structures, optimizing, 339–349 cputrack command, algorithmic complexity, 437–442 75–76, 279, 280, 304–305 matrices and accesses, 347–348 RAW recycle stalls, 353–354 multiple streams, 348–349 TLB misses, 351–352 prefetching, 343–346 cross calls, 63 reorganizing, 339–343 crossfile optimization, 108–110 various considerations, 346 0138134553_Gove.book Page 451 Tuesday, December 4, 2007 2:29 PM

Index 451

data TLB misses, 351–352 df command, 71–72 dbx debugger, 262–271 die, CPU, 4 commands mapped to mdb, 276 direct mapped caches, 12 compiler options for debugging, 262 directories (files), reporting disk space used debug information format, 263–264 by, 70 debugging application (example), 265–268 directory mechanism, 16 running applications under, 268–271 UltraSPARC III processors, 36 running on core files, 264–265 dis command, 89 DC_access counter, 313 disabling virtual processors, 51 DC_dtib_L1_miss_L2_ counters, 314 disassembles, examining, 89 DC_miss counter, 304, 309, 313 discover tool, 261–262 multipliers for conversion to cycles, 306 disk drive reporting, 57 DC_rd counter, 283 space utilization, 71–72 DC_rd_miss counter, 283, 290 Dispatch0_2nd_br counter, 300 DC_refill_from_L2 counter, 312 Dispatch0_br_target counter, 300 DC_refill_from_system counter, 312 Dispatch0_IC_miss counter, 285 DC_wr counter, 283 Dispatch0_mispred counter, 300 DC_wr_miss counter, 283 divides (floating-point), hoisting, 163–165 debugging. See code checking and debugging divides (integer), avoiding, 174–177 delay slot, 25 division by zero, 11 dependence analysis, 120–121, 141 handler for (example), 168–169 detailed warnings enabled, 252 zero divided by zero. See NaN developer tools. See also specific tool by name (Not-a-Number) code checking. See code checking and division operations, 9 debugging does_not_read_global_data pragma, floating-point optimization, 149–180. See 137–138 also floating-point calculations errno variable and, 166 compiler options for, 149–173 does_not_write_global_data pragma, integer maths, 174–178 137–138 Kahan Summation Formula, 161–163 errno variable and, 166 multiply accumulate instructions, doors, 380 173–174 double-precision floating-point registers, parameter passing with SPARC V8, SPARC, 24, 29–30 178–180 double-precision values, 150–151 reorganizing data structures, 341–342 not promoting single precision to, informational tools, 49–92. See also 171–172 process- and processor-specific dprofiling, 226–233 tools DTLB_miss counter, 304, 309 application reporting, 84–92 multipliers for conversion to cycles, 306 system configuration reporting, 49–55 dtrace command, 233–234, 241–244 on system status. See system status du command, 72 reporting dump command, 90–92 libraries and linking. See libraries and dumpstabs command, 90–91, 264 linking dwarfdump command, 90–91, 264 performance profiling tools. See profiling dynamic allocation (dynamic scheduling), performance 429–430 Sun Studio compiler. See Sun Studio dynamic libraries, creating, 184–185 compiler, using dynamic linking, 182–183 0138134553_Gove.book Page 452 Tuesday, December 4, 2007 2:29 PM

452 Index

E file descriptor limits, 364–366 -E compiler option, 250 file directories. See directories EC_ic_miss counter, 285 file-level scope, specifying alias level, 144 EC_misses counter, 286 filling registers, 28, 40 EC_rd_miss counter, 286, 290 fill traps, 77 EC_ref counter, 286 loop splitting and, 321–322 elfdump command, 90–92 finalization code in libraries, 187 EMT64 processors, 40 ::find_leaks command (dbx), 276 compiler performance, 105 findbug script (ATS), 273–274 emulated floating-point instructions, fini pragma, 187 reporting number of, 64 first level (L1) caches, UltraSPARC III, endianness, 42 31, 34 equality of NaN as untrue, 156 floating_instructions counter, 310 .er directories, 208 floating-point calculations. See also er_kernel tool, 233–235 floating-point optimization er_print tool, 207, 214–215 aggressive code optimizations and, 94 er_src utility, 244–245, 405 -fast compiler optimizations, 101 errno variable, 165–166 assessing operations with performance errors, checking for. See code checking and counters, 281–283 debugging catching exceptions with dbx, 270 exceptions from floating-point calculations, counting instructions executed, 304 167–170 exception flags, 167–170 exclusive time in routines, 211 executing time, reporting on execution frequency statistics, 240–241 UltraSPARC processors, 217 execution latency, 8 hoisting of divides, 163–165 execution pipes, 7–11 honoring parentheses in, 158–159 named, 380 IEEE-754 standard, 11, 150–151 UltraSPARC III processors, 30–31 inline template versions, 170–171 experiment repository, 208 optimization of, 157–158 exponent (floating-point numbers), 153–154 parentheses, honoring, 165 reductions and, 406 reordering, 159–161 F Kahan Summation Formula, 161–163 f 90 command (cc command), 99 specifying which events cause traps, FA_pipe_completion counter, 282 166–167 false sharing, 397–399 subnormal numbers, 153–155 -fast compiler option, 98–102 flushing to zero, 155 errno variable and, 165 unnecessary, avoiding, 158–159 optimizations in, 100–102 vectorizing, 151–153 mathematical optimizations, 178–180 floating-point constants, in single precision, specifying target architecture, 99, 105 172–173 FGA and FGM pipes (SPARC), 31 floating-point instructions, emulated, 64 file access system routines, reporting on, 67 floating-point optimization, 149–180. See file command, 86–87 also floating-point calculations file contents compiler options for, 149–173 metadata, reporting on, 90–92 integer maths, 174–178 reporting type of, 86–87 Kahan Summation Formula, 161–163 symbols defined in, reporting, 87 multiply accumulate instructions, 173–174 0138134553_Gove.book Page 453 Tuesday, December 4, 2007 2:29 PM

Index 453

parameter passing with SPARC V8, timing duration of, 199–201 178–180 functions (func) command, 214–215 reorganizing data structures, 341–342 fused multiply accumulate instructions, floating-point pipes, 7, 11 173–174 UltraSPARC III processors, 31 floating-point registers, SPARC, 24, 29–30 flush directive, 432–433 G flushing subnormal numbers to zero, 155 -g and -g0 compiler options, FM_pipe_completion counter, 282 102, 210, 262, 263 fmaf function (C99), 174 compiler commentary, 244–245 -fma=fused compiler option, 174 -G compiler option, 184 -fnostore compiler option, 150 GCC compatibility, 147 -fns compiler option, 155 general compiler optimizations, 116–123 within -fast option, 101, 150 in -fast compiler option, 101 fork() function, 378–382 general-purpose registers, x86 processors, Fortran compiler optimizations, 135–136 44–45 controlling amount of warning general system information, reporting on, information, 253 49–51 Fortran runtime array bounds checking, 259 generic target for compilation, 104–105 FP_instr_cnt counter, 304 get_tick function, 199–200 multipliers for conversion to cycles, 306 gethrtime function, 199, 200 fpversion command, 55, 105 gettimeofday function, 199 FR_dispatch_stall_ performance counters, global affects of functions, 138–139 316–317 global data, specifying function’s access to, FR_dispatch_stalls counter, 316 137–138 FR_nothing_to_dispatch counter, 316 global registers, SPARC, 23, 26–27 FR_retired_ performance counters, 315–316 global variables frame pointer (x86 code), 43, 44, 264 link time optimization, 115 free memory, reporting on, 56, 67 pointers and, 125–126 free operations, 194 gprof command, 237–239 -fsimple compiler option, 157–161, 163–165 granularity, mutex, 394 within -fast option, 101, 150 -fsingle compiler option, 171–172 within -fast option, 101, 150 H ftime function, 199 -H compiler option, 250 -ftrap compiler option, 166–167 hardware within -fast option, 101, 150 characteristics of SPARC processors, 55 -ftrap=common option, 101, 150, 166 performance counters, -ftrap=%none option, 101, 150, 166 73–76, 208, 304–305 funcs command, 271 values read from. See volatile variables functions hardware prefetch, 32, 294–295, 297 cloning, 325–326 header files inclusion information, 250 infrequently called, specifying as, heap, specifying page size for, 123 138–139 hoisting loop-invariant code, 324–325 performance of. See profiling performance hoisting of floating-point divides, side effects of, 137–138 163–165 specifying non-access to global data, horizontal scaling, 375–376 137–138 hot regions of code, identifying, 443–444 0138134553_Gove.book Page 454 Tuesday, December 4, 2007 2:29 PM

454 Index

I infrequently used code, moving, 112 I/O activity, reporting, 68–69 init pragma, 187 character I/O, 67 initialization code in libraries, 187 IA-32 instruction set, 40, 42–43 inline template versions of floating-point IC_fetch counter, 312 functions, 170 IC_instr_fetch_stall counter, 316 inlining across source files. See crossfile IC_itib_L1_miss_L2_ counters, 314 optimization IC_miss_cancelled counter, 285 inlining routines, 108 IC_miss counter, 285, 304, 309, 312 copying or moving memory, 338–339 multipliers for conversion to cycles, 306 profile feedback for, 112 IC_ref counter, 285 input registers, SPARC, 23, 24, 26–27 IC_refill_from_L2 counter, 312 register windows and, 27–28 IC_refill_from_system counter, 312 Instr_cnt counter, 282, 309 ICXs (involuntary context switches), Instr_FGU_arithmetic counter, 308 reporting on, 60, 63 Instr_ld counter, 308 identifying processes, 58 Instr_st counter, 309 idle time, reporting, 57 instruction cache IEEE-754 standard, 11, 150–151 events, performance counters for, ieee_flags function, 167 285–286, 304, 306, 309, 312–313 ieee_handler function, 168 layout of, 107 if_r_iu_req_mi_go counter, 310 unrolled loops and, 321 if statements, 357–364 instruction_counts counter, 310 conditional move statements, 358–360 instruction pointer register (x86 processors), misaligned memory accesses (SPARC), 44 361–364 instruction prefetch. See prefetch impdep2_instructions counter, 310 instructions in-order execution processors, 5 instruction scheduling, specifying, inclusive time in routines, 211 8–9, 104, 106 indexing, 18 instruction set, specifying in compiler, information set architecture. See ISA 99, 104, 105, 106–107 informational tools, 49–92 instruction set extensions, x64 processors, application reporting, 84–92 40, 46 process- and processor-specific reporting instruction set reports, 53 bus activity, 76 instructions (processors) commandline arguments, examining, execution speed of, 4–5 79 latency of execution, 8 files held open by processes, 79 integer maths hardware performance counters, with floating-point values, 174–178 73–76, 208, 304–305 prefetch and, 327–328 stack dumps, 79–80 integer operation pipes, 7, 9 timing process execution, 72–73 integer pipes, UltraSPARC III processors, 31 tracing application execution, 80–81 integer registers, SPARC, 26–27 trap activity, 77–78 integer registers, x86, 44 user and system code activity, 82–84 interchanging loops, 322–324 virtual memory mapping, 78–79 interleaving, 35 system configuration reporting, 49–55 interposition of libraries, 189–191 on system status. See system status interprocess cross calls, 63 reporting interrupts, reporting number of, 57 0138134553_Gove.book Page 455 Tuesday, December 4, 2007 2:29 PM

Index 455

Intimate Shared Memory (ISM), 380 L2_dmiss_ld counter, 304, 309 invert flag, Opteron performance counters, multipliers for conversion to cycles, 306 311 L2_imiss counter, 304, 309 involuntary context switches (ICXs), 60, 63 multipliers for conversion to cycles, 306 iostat command, 68–69 L3 cache, 302–303 ISA (information set architecture), 18–19 la_objopen function, 192 libraries for specific, searching for, 186 la_version function, 192 SPARC processors, 19, 23–30 large files, in 32-bit applications, 366–367 supported, outputting, 53–55 latency, memory, 8–9, 34–35 x64 architecture, 42–44 measuring, 288–290, 333–334 extensions to, 40, 46 source code optimizations, 333–337 isalist command, 53, 86 fetching integer data, 328 $ISALIST symbol, 186 synthetic metrics for, 290–292 ISM (Intimate Shared Memory), 380 latency groups (l-groups), 36 ITLB_miss counter, 304, 309 latency of instruction execution, 8 multipliers for conversion to cycles, 306 lazy loading of libraries, 187 IU_Stat_Br_count_taken counter, 300 -lbsmalloc compiler option, 193 IU_Stat_Br_count_untaken counter, 300 -lbtmalloc compiler option, 193 IU_Stat_Br_miss_taken counter, 300 LD_AUDIT environment variable, 193 IU_Stat_Br_miss_untaken counter, 300 LD_DEBUG environment variable, 191–192 LD_LIBRARY_PATH environment variables, 85, 186 K LD_LIBRARY_PATH_32 variable, 186 Kahan Summation Formula, 161–163 LD_LIBRARY_PATH_64 variable, 186 kernel ld linker, 182 exploring system code activity, 82–84 LD_OPTIONS environment variable, memory allocation, reporting on, 67 191–192 statistics on, reporting, 64–68 LD_PRELOAD environment variable, 85, traps. See traps 190–191, 257–258 -Kpic and -KPIC compiler options, 185 LD_PRELOAD_32 variable, 190–191 kstat command, 64 LD_PRELOAD_64 variable, 190–191 ldd command, 84–86 libc library, 183, 193 L libc_per library, 193 -l compiler option, 183 libcpc interface, 280 -L compiler option, 183, 185–186 libfast library, 193–194, 196 L1 caches, 31, 34 libmopt library, 171 L2 caches libmvec library, 152 fetching integer data, 327–328 libraries, specific, 193–198 memory bandwidth measurements, bsdmalloc library, 193–194 292–293 libc library, 183, 193 memory latency measurements libc_per library, 193 example of, 288–290 libfast library, 193–194, 196 synthetic metrics for, 290–292 libmopt library, 171 performance counters, 286–287, 304 libmvec library, 152 UltraSPARC II processors, 34 libumem library, 193, 194, 258–259, 274 UltraSPARC III and IV+ processors, MediaLib library, 202 31–32, 33, 34 memory management libraries, 193–196 0138134553_Gove.book Page 456 Tuesday, December 4, 2007 2:29 PM

456 Index

libraries, specific (Continued) load/store pipe, 8–9 mtmalloc library, 194, 195, 259 UltraSPARC III processors, 31 performance library, 196–197 loads (processors), 9 Rogue Wave and STLport4 libraries, 198 physical and virtual addresses of, 229 Sun Math Library, 201–202 SPARC operations, 24 watchmalloc.so library, 257–258 local registers, SPARC, 23, 26–27 libraries and linking, 87, 181–205. See also register windows and, 27–28 libraries, specific local variables audit interface, 192–193 aligning for optimal layout, 135, 136–137 debug interface, 191–192 placing on stack, 135–136 disassembling libraries, 89 locking objects to multiple changes. See dynamic and static libraries, 184–185 mutexes dynamic and static linking, 182–183 logical operations, 9, 358–360 how to link libraries, 183 loops initialization and finalization code, 187 compiler commentary on, 244–245 lazy loading of libraries, 185–187 dependence analysis. See dependence libraries linked to applications, reporting, analysis 84–86 fusion, 321–322 library calls, 199–205 hoisting of floating-point divides, searching arrays with VIS instructions, 163–165 203–205 interchange and tiling, 322–324 SIMD instructions and MediaLib, 202 invariant hoisting, 324–325 for timing, 199–201 parallelizing (OpenMP), 403–406 using most appropriate, 201–202 splitting, 322 library interposition, 189–191 unrolling and pipelining, link-time optimization, 108, 115–116 140–142, 320–321 locations of libraries, specifying, 185–187 -lsunmath compiler option, 201 overview of linking, 181–182 -lunem compiler option, 193 recognizing standard library functions LWPs (lightweight processes), 59, 76 (compiler), 133–135 segments in, reporting size of, 90 symbol scoping, 188 M versions of libraries, 87–89, 186 -M compiler option, 110 -library=stlport4 compiler option, 198 -m32 compiler option, 106 libumem library, 193, 194, 258–259, 274 -m64 compiler option, 99, 106 lightweight processes (LWPs), 59, 76 machine code (assembly language), 18–19. limit command, 214–215 See also ISA (information set limit stacksize command, 136 architecture) line sizes, caches, 12, 13 macro options, 98 link-time optimization, 108, 115–116 makefiles, dependency information for, 251 linking. See libraries and linking malloc command requests, counting, 81–83 lint command, 248–250 MALLOC_DEBUG environment variable, little-endian vs. big-endian systems, 41–42 258 -lm9x compiler option, 169 malloc operations, 193–196 load balancing/imbalance, 52 debugging options under, 258–259 OpenMP specification and, 429–430 mantissa (floating-point numbers), 153–154 load operations, counting, 126 manual performance optimizations, load_store_instructions counter, 310 avoiding, 441–442 0138134553_Gove.book Page 457 Tuesday, December 4, 2007 2:29 PM

Index 457

manual prefetch, 330–333 mapping information, reporting for store queue performance, 356–357 process, 78–79 structure prefetch, 346 utilization of, reporting, 56–57 mapfiles, 108, 110–111 memory footprint, 40 generating, 222 compiling for 32- or 64-bit architecture, master threads. See Pthreads 103 matrices, optimizing, 347–348 memory management libraries, 258–259 may_not_point_to pragma, 146–147 memory pipes, 7 may_point_to pragma, 145–146 memset command, 338 MC_reads_0_sh counter, 303 Message Passing Interface (MPI), 382–385 mcs command, 92 message queues, 380 mdb debugger, 274–276 metadata in files, reporting on, 90–92 commands mapped to dbx, 276 metadata on performance reports, 223 MediaLib library, 202 microstate accounting, 59 Megahertz Myth, 5 misaligned memory accesses (SPARC), MEMBAR instructions, 16, 36 121–123, 361–364 memcpy command, 338 mispredicted branches, 10, 107 memory performance counters for, access error detection, 261–262, 274 300–301, 315–316 addressing, 16, 18 misses, cache, 287 bandwidth, 14–15 data cache, 283–285 source code optimizations, 326–327 instruction cache, 285–286 synthetic metrics for, 292–293 memory bandwidth measurements, cache latency, 8–9, 34–35 292–293 measuring, 288–290, 333–334 memory latency measurements source code optimizations, example of, 288–290 328, 333–337 synthetic metrics for, 288–290 synthetic metrics for, 290–292 TLB (Translation Lookaside Buffer), caches. See caches 351–352 controller events, 301–302, 303 MMX instruction set extensions, 46 copying and moving, 338–339 modular debugger. See mdb debugger dependencies on, specifying none, 141 modulo operation, 177–178 misaligned accesses (SPARC), moving memory, 338–339 121–123, 361–364 MPI (Message Passing Interface), 382–385 ordering. See TSO (Total Store Ordering) MPI_REDUCE function, 383–384 ordering (x64 processors), 46 MPSS (multiple page size support), 350 paging, 17–18 mpstat command, 62–63 changing what applications request, -mt compiler option, 195 54 errno variable and, 166 reporting on, 57, 67 mtmalloc library, 194, 195, 259 setting page size, 123 multiple data streams, optimizing use of, supported page size, reporting on, 348–349 53–55 multiple execution pipes, 7 thrashing in caches, page size and, multiple page size support (MPSS), 350 350–351 multiple processors, 3–19, 371. See also tagging, 18 processors, in general threaded applications, 385, 399–402 multiplication operations, 9 virtual, 16–18 multiply accumulate instructions, 173–174 0138134553_Gove.book Page 458 Tuesday, December 4, 2007 2:29 PM

458 Index

multithreading, 372, 385–402 combining with libraries. See libraries atomic operations, 395–396 and linking CMT systems. See CMT disassembling, 89 data races, 412–413 segments in, reporting size of, 90 debugging code, 413–416 of_r_iu_req_mi_go counter, 310 false sharing, 397–399 OMP_NUM_THREADS environment memory layout, 399–402 variable, 194, 406, 425 mutexes, 389–395 OpenMP API, 406 parallelization. See parallelization OpenMP specification, 194 profiling multithreaded applications, debug and, 264 410–412 load balancing, 429–430 Pthreads, 385–387 parallelization, 402–403 role-based threads or processes, 445 example of, 424–425 sharing data between threads (example), of loops, 403–406 430–434 sharing variables between threads, Thread Local Storage, 387–389 432–434 threads sharing a processor, 378 operating system calls, reporting on, 80–81 virtualization, 374–375 Opteron processor performance counters, mutexes, 389–395, 430–432, 445 310–317 profiling performance, 410–412 optimization. See also performance algorithms and complexity, 437–442 for CMT processors, 446 N code layout optimizations, 107–116 named pipes, 380 crossfile optimization, 108–110 NaN (Not-a-Number), 11, 155–157 link time optimization, 115–116 comparisons, eliminating, 158 compilation optimizations, nested loops, 322–324 96–102, 116–123. See also Sun netstat command, 70 Studio compiler, using network activity, reporting, 70 C- and C++-specific, 123–135 nice value, process, 59 Fortran-specific, 135–136 nm command, 87 including debug information and, 103 no_side_effect pragma, 138–139, 426 levels for, 93–95 errno variable and, 166 data structures, 339–349 noalias pragma, 146 algorithmic complexity, 437–442 -nofstore compiler option, 101 matrices and accesses, 347–348 nomemorydepend pragma, 142 multiple streams, 348–349 Non-Uniform Memory Access (NUMA), prefetching, 343–346 35, 36 reorganizing, 339–343 nonvolatile variables, 97 various considerations, 346 ,nt adornment (SPARC instruction), 25 floating-point. See floating-point NUMA (Non-Uniform Memory Access), optimization 35, 36 how to apply, 437–446 performance counters. See performance counters O serial code, tuning, 442–444 -O compiler option, 98 serial vs. parallel applications, 418–419 O notation (complexity), 438–440 of source code. See source code object files optimizations 0138134553_Gove.book Page 459 Tuesday, December 4, 2007 2:29 PM

Index 459

tail-call optimization and debug, 235–237 performance counters for, optimized maths library, 171 300–301, 315–316 OR operation (logical), 9 counters. See performance counters ordered segments, 111 dynamic vs. static linking, 182 ordering memory (x64 processors), 46 floating-point calculations. See also $ORIGIN symbol, 186 floating-point optimization out-of-order execution processors, 5–6 integer maths, 174–178 output registers, SPARC, 23, 26–27 Kahan Summation Formula, 161–163 register windows and, 27–28 reordering, 159–161 identifying consuming processes, 58–60 in-order vs. out-of-order processors, 5–6 P load imbalance, 52, 429–430 -P compiler option, 250, 251 mapfiles. See mapfiles packet information, reporting on, 71 memory bandwidth. See bandwidth, padding for thread data, 398 memory -pad=local compiler option, 101 memory latency. See cache latency pagesize command, 53–55 memory paging, 17 paging (memory), 17–18 multithreaded applications. See also changing what applications request, 54 multithreading reporting on, 57, 67 atomic operations, 395–396 setting page size, 123 false sharing, 397–399 supported page size, reporting on, 53–55 mutexes, 393–396 thrashing in caches, page size and, optimizing for CMT processors, 446 350–351 parallelization. See parallelization parallelization, 376–377, 444–446 with prefetch. See prefetch instructions automatic, 408–409, 425–429 processors, 5–6. See also process- and example of, 417–434 processor-specific using OpenMP, 402–407 profile feedback, 108, 111–115 loops, 402–403 profiling tools. See profiling performance section-based, 407 serial code, tuning, 442–444 parentheses, honoring in floating-point structures, 346 calculations, 158–159, 165 subnormal number calculations, 154–155 pargs command, 79 system bus bandwidth, 15, 308 partitioning compute resources, 52 Performance Analyzer, 207–208 paths to libraries, specifying, 185–186 compiling for, 210 pbind command, 53 multithreaded applications, 410 PC_MS_misses counter, 294 performance counters, 218–219, 279–317 PC_port0_rd and PC_port1_rd counters, bus events, 76–77, 308 293–294 comparison with and without prefetch, PC_snoop_inv counter, 294 295–297 PC_soft_hit counter, 294 hardware events, reporting on, 73–76, -pec compiler option, 271–272 208, 304–305 peeling loops, 321–322 Opteron processor, 310–317 percentage sign (%) for SPARC registers, 23 reading, tools for, 279–281 performance, 437–446. See also SPARC64 VI processor, 309–310 informational tools; optimization TLB misses, 351–352 algorithms and complexity, 437–442 UltraSPARC III and IV processors, branch mispredictions, 10, 107 281–302 0138134553_Gove.book Page 460 Tuesday, December 4, 2007 2:29 PM

460 Index

performance counters (Continued) init pragma, 187 UltraSPARC IV and IV+ processors, may_not_point_to pragma, 146–147 302–303 may_point_to pragma, 145–146 UltraSPARC T1 processor, 304–308 no_side_effect pragma, 138–139 UltraSPARC T2 processor, 308–309 errno variable and, 166 performance library, 196–197 noalias pragma, 146 pfiles command, 79 nomemorydepend pragma, 141 pgrep command, 61–62 pipeloop pragma, 140–141 physical memory, 16 rarely_called pragma, 139–140 pic0 and pic1 counters, 281 unroll pragma, 141–142 PID (process ID). See also processes pragmas, 136–142 of application, locating, 61–62 for aliasing control, 142–147 arguments passed to, 79 predicting branches, 10. See also files help open by, reporting, 79 mispredicted branches spawned processed, 378–379 preferred page size, defining, 54 pins, CPU, 3 prefetch cache pipelines, 7, 320–321 performance counters, 293–297 specifying safe degree of, 140–141 UltraSPARC III and IV+ processors, pipeloop pragma, 140–141 31, 33–34 pipes, 7–11 prefetch instructions, 31, 116–118 named, 380 aggressiveness of prefetch insertion, 120. UltraSPARC III processors, 30–31 See also -xprefetch_level compiler pmap command, 78–79 option pointer aliasing in C and C++, 123–133 algorithmic complexity and, 441 diagnosing problems, 126 for cache lines, 335–337 loop invariant hoisting and, 324 enabling prefetch generation, 118–119 restricted pointers, 126–127 manual prefetch, 330–333 specifying degree of, 127–133 store queue performance, 356–357 pointer chasing, 336 structure prefetch, 346 pointers, restricted, 126–127, 443. See also source code optimizations -xalias_level compiler option for integer data, 327–328 POSIX threads (Pthreads), 385–387 with loop unrolling and pipelining, 320 memory layout, 399–402 memory bandwidth and, 326–327 OpenMP specification vs., 402–403 storing data streams, 329 parallelization example, 422–424 speculative, number of, 101, 117 Thread Local Storage, 387–389 structure prefetch, 343–346 ppgsz command, 54 preprocessing source code, 251 #pragma directives priority, process, 59 alias pragma, 144–145 probe effect, 239 alias_level pragma, 143–144 process- and processor-specific reporting align pragma, 136–137 bus activity, 76 does_not_read_global_data pragma, commandline arguments, examining, 79 137–138 files held open by processes, 79 errno variable and, 166 hardware performance counters, does_not_write_global_data pragma, 73–76, 208, 304–305 137–138 stack dumps, 79–80 errno variable and, 166 timing process execution, 72–73 fini pragma, 187 tracing application execution, 80–81 0138134553_Gove.book Page 461 Tuesday, December 4, 2007 2:29 PM

Index 461

trap activity, 77–78 with counters. See performance counters user and system code activity, 82–84 interpreting profiles, 215–217 virtual memory mapping, 78–79 UltraSPARC processors, 217 process ID. See PID (process ID) mapfiles. See mapfiles process resource utilization, reporting, 58–60 memory profiling across patterns processes, 371–372 (dprofiling), 226–233 assigning roles to, 445 multithreaded applications, 410–412, 419 calls from, reporting on, 81 Performance Analyzer, about, 207–208 current, listing, 60–61 Performance Analyzer, compiling for, 210 defined, 371 profile feedback for compilation, files held open by, 79 108, 111–115 multiple, using, 378–385 profile information, gathering cooperating processes, 378–382 dtrace command, 241–244 copies of processes, 378 gprof command, 237–239 MPI (Message Passing Interface), serial code, tuning, 442–444 382–385 spot tool, to generate reports, 223–226 multithreaded. See multithreading tail-call optimization and debug, 235–237 parallelization, 376–377, 444–446 viewing profiles with Analyzer GUI, automatic, 408–409, 425–429 207, 210–212 example of, 417–434 program counter (x86 processors), 44 using OpenMP, 402–407 protocol, reporting activity for each, 70 processor activity, reporting all, 62–63 prstat command, 58–60 processor sets, controlling processor use prtconf command, 51 with, 52–53 prtdiag command, 49–50 processor stall events, 299 prtfru command, 51 processors, in general, 3–19, 371 prtpic1 command, 51 caches. See caches ps command, 60–61 components of, 3–4 psradm command, 51 execution pipes, 7–11 psrinfo command, 51 named, 380 psrset command, 52 UltraSPARC III processors, 30–31 pstack command, 79–80 indexing and memory tagging, 18 ,pt adornment (SPARC instruction), 25 instruction set architecture, 18–19 Pthreads, 385–387 interacting with system, 14–16 memory layout, 399–402 multiple, 374–376 OpenMP specification vs., 402–403 specifying with compiler, parallelization example, 422–424 99, 104, 105, 106–107 Thread Local Storage, 387–389 virtual memory, 16–18 ptime command, 72–73 profiling performance, 207–245. See also pvs command, 87–89 optimization; performance caller–callee information, 212–214 tail-call optimization and, 236–237 Q code coverage information quiet NaNs, 156–157 dtrace command, 241–244 tcov command, 239–241 collecting profiles, 208–210 R command-line tool (er_print), 207, 214–215 -R compiler option, 183, 185–186 compiler commentary, 244–245 race conditions, 389–392, 412–413, 434 0138134553_Gove.book Page 462 Tuesday, December 4, 2007 2:29 PM

462 Index

rarely_called pragma, 139–140 rotation operations, 9 RAW recycles, 352–354 routines Re_DC_miss counter, 287 call stack of, examining, 219–222 Re_DC_missovhd counter, 287, 290 defined in files, identifying, 87 Re_EC_miss counter, 286, 287, 290 inlining, 108 Re_L2_miss counter, 302 copying or moving memory, 338–339 Re_L3_miss counter, 302 profile feedback for, 112 Re_RAW_miss counter, 299, 352–354 in libraries, learning how used, read misses 189–191 data cache, 283–285 library calls, 199–205 instruction cache, 285 searching arrays with VIS instructions, memory bandwidth measurements, 203–205 292–293 SIMD instructions and MediaLib, 202 memory latency measurements for timing, 199–201 example of, 288–290 using most appropriate, 201–202 synthetic metrics for, 290–292 performance of. See optimization; second-level cache, 286 performance; profiling reads after writes, 352–354 performance reduction operations, 404–406 of standard libraries, compiler redundant floating-point calculations, recognition of, 133–135 158–159 time spent in, reporting, 211 register windows, SPARC, 27–29 RSS (resident set size), 58 registers Rstall_FP_use counter, 299 fill and spill traps, 77 Rstall_IU_use counter, 299 multiple data streams and, 349 Rstall_storeQ counter, 299, 355–357 SPARC architecture, 23–25 runtime floating-point registers, 24, 29–30 application linking, information on, integer registers, 26–27 191–192 unrolled loops and, 321 optimization level and, 96 x64 architecture, 40, 43–45 runtime array bounds checking, 259 regs command, 267–268 runtime code checking, 256–262 relative paths to libraries, specifying, 186 runtime linker, 185 reordering floating-point calculations, runtime stack overflow checking, 260–261 159–161 Kahan Summation Formula, 161–163 reorganizing data structures, 339–349 S replacement algorithm (cache), 13 sar command, 64–68 reporting on performance. See spot tool, to SAVE instruction (SPARC), 27–28 generate reports SB_full counter, 304 reporting on system. See informational tools multipliers for conversion to cycles, 306 RESTORE instruction (SPARC), 28–29 scaling to multiple processors, 445 restricted pointers, 126–127, 443. See also horizontal and vertical scaling, 375–376 -xalias_level compiler option using multiple processes, 378–385 RISC (reduced instruction set computing), scheduling processor instructions, 19, 23 8–9, 104, 106 CISC vs., 41 scoping symbols, 188 Rogue Wave library, 198 searching arrays with VIS instructions, role-specific threads, 445 203–205 0138134553_Gove.book Page 463 Tuesday, December 4, 2007 2:29 PM

Index 463

second level (L1) caches Solaris Containers. See Zones fetching integer data, 327–328 Solaris doors, 380 memory bandwidth measurements, source code optimizations, 319–367 292–293 data locality, bandwidth, latency, memory latency measurements 326–339 example of, 288–290 cache latency (memory latency), synthetic metrics for, 290–292 333–337 performance counters, 286–287, 304 copying and moving memory, 338–339 UltraSPARC II processors, 34 integer maths, 327–328 UltraSPARC III and IV+ processors, memory bandwidth, 326–327 31–32, 33, 34 storing streams of data, 329 section-based parallelism, 407 data structures, 339–349 segment registers, x86 processors, 44 matrices and accesses, 347–348 segment size, reporting, 90 multiple streams, 348–349 serial code, tuning, 442–444 prefetching, 343–346 serial tasks reorganizing, 339–343 explained, 376 various considerations, 346 parallelizing (example), 417–434 file handling in 32-bit applications, shared data, protecting with mutexes, 364–367 389–395, 430–432, 445 if statements, 357–364 profiling performance, 410–412 conditional move statements, 358–360 shared libraries. See libraries and linking misaligned memory accesses (SPARC), shift operations, 9 361–364 side effects of function, 138–139 reads after writes, 352–354 SIGBUS errors, 121–122 store queue, 354–357 signaling NaNs, 156–157 thrashing (caches), 349–353 signals, 380 traditional optimizations, 319–326 SIMB instructions, 202 source files, inlining across. See crossfile SIMD instructions. See also SSE and SSE2 optimization instruction set extensions; VIS SPARC architecture, 21–38 instructions 32-bit and 64-bit code, 23–30 vectorizing floating-point computations, history of, 21–22 152–153 instruction set architecture (ISA), simplification of floating-point expressions, 19, 23–30 157–158 link-time optimization, 108, 115–116 sincos function, 201–202 misaligned memory accesses, single-precision floating-point registers, 121–123, 361–364 SPARC, 24, 29–30 page size, 17–18, 123 single-precision values, 150–151 SPARC64 VI processor, 23, 38 not promoting to double precision, performance counters, 309–310 171–172 summarizing hardware characteristics, size command, 90 55 SMP (symmetric multiprocessing), 372 targeting for compilation, 103 snoop command, 71 UltraSPARC processors. See snooping, 16 UltraSPARC processors UltraSPARC III processors, 36 x64 architecture vs., 41, 43, 46 .so files (static libraries), 183 -xtarget=generic compiler option, 99 software prefetch, 32–33, 294–295, 297 sparc_prefetch_ constructs, 330 0138134553_Gove.book Page 464 Tuesday, December 4, 2007 2:29 PM

464 Index

spawning processes, 378–379 store queue, 354–357 speculative stride prediction, 336–337 stores (processors), 9 spilling registers to memory, 28, 40 physical and virtual addresses of, 229 loop splitting and, 321–322 SPARC operations, 24 spill traps, 77 storing streams of data, 329 splitting loops, 322 streams of data, storing, 329 spot tool, to generate reports, 223–226 strength reduction, 9, 325 src command, 215 stride predictor, 336–337 SSE and SSE2 instruction set extensions strong prefetches (UltraSPARC III and IV+), (x64), 46. See also SIMD instructions 33–34 unavailable on 386 processor, 95–96 structure pointers. See pointer aliasing in C stack and C++ default size, multithreading and, structures. See data structures, optimizing 399–400 subblocked caches, 14 default size, OpenMP and, 407–408 subdirectories. See directories interpreting performance profiles for, subexpressions, eliminating common, 219–222 324–325 overflow checking, 260–261 subnormal numbers, 153–155 placing local variables on, 135–136 flushing to zero, 155 stack dumps, 79–80 subtraction operations, 9 stack page size, specifying, 123 Sun HPC ClusterTools software, 384–385 stack pointer (x86 code), 43, 44 Sun Math Library, 201–202 stack space, 28–29 sun_prefetch_ constructs, 330 STACKSIZE environment variable, 260, 407 Sun Studio compiler, using, 93–147 stalled cycles, 6, 316–317 C and C++ pointer aliasing, 123–133 RAW recycles, 352–354 code layout optimizations, 107–116 store queue, 354–357 compatibility with GCC, 147 UltraSPARC III processors, 34 debug information, generating, 102–103 standard library routines, recognizing, optimization, 96–102 133–135 C- and C++-specific optimizations, state, setting up before execution, 187 133–135 static libraries, creating, 184 -fast option. See -fast compiler option static linking, 182–183 Fortran-specific optimizations, status of system, reporting on, 55–72 135–136 all processor activity, 62–63 general compiler optimizations, current processes, listing, 60–61 116–123 disk space utilization, 71–72 -O compiler option, 98 I/O activity, 68–69 volatile variables and, 94, 97 kernel statistics, 64–68 pragmas, 136–142 locating an application’s process ID, in C, for aliasing control, 142–147 61–62 selecting target machine type, 103–107 network activity, 70 types of compiler options, 93–95 packet information, 71 -xtarget=generic option (x86), 95–96 process resource utilization, 58–60 Sun Studio Performance Analyzer. See swap file usage, 57–58 profiling performance virtual memory utilization, 56–57 superscalar processors, 7 STLport4 library, 198 swap command, 57–58 stop command, 268 swap file usage 0138134553_Gove.book Page 465 Tuesday, December 4, 2007 2:29 PM

Index 465

controlling, 57 throughput computing, 378 reporting, 56, 57–58 tick counters, 199–200, 311 swap file usage, reporting, 66 tiling loops, 323–324 symbol scoping, 188 time allocation, reporting on, 57 symbols in files, reporting on, 87 time-based profiling, 207–208 symmetric multiprocessing (SMP), 372 time command, 72–73 synchronization of processors, 15–16 time function, 199 system bandwidth consumption, 308 timex command, 72–73 system buses, 15–16 timing functions, 199–201 system calls, reporting number of, 57, 66–67 timing harness (timing.h), 200–201 system code activity, exploring, 82–84 timing process execution, reporting, 72–73 system configuration reporting, 49–55 TLB (Translation Lookaside Buffer), 17 system libraries. See libraries and linking events, performance counters for, system status reporting, 55–72 304, 306, 309, 314–315 all processor activity, 62–63 layout, 107 current processes, listing, 60–61 multiple data streams and, 348 disk space utilization, 71–72 performance counter, 351–352 I/O activity, 68–69 reporting supported TLB page sizes, kernel statistics, 64–68 53–55 locating an application’s process ID, thrashing, 349–351 61–62 traps, 77 network activity, 70 UltraSPARC III and IV+ processors, 33 packet information, 71 tools. See developer tools; specific tool by process resource utilization, 58–60 name swap file usage, 57–58 tracing application execution, 80–81 virtual memory utilization, 56–57 tracing process execution, 81 system time, reporting on, 57 tracking performance. See developer tools training data for profile feedback, 113–114 T transfers per second, reporting, 66 T1 processor. See UltraSPARC T1 processor Translation Lookaside Buffer. See TLB T2 processor. See UltraSPARC T2 processor Translation Storage Buffer (TSB). See TLB tagging memory, 18 trap_DMMU_miss counter, 310 tail-call optimization and debug, 235–237 trap_IMMU_miss counter, 310 target machine type for compilation, 103 traps tcov command, 239–241 caused by floating-point events, 166–167 .tcov files, 240 to correct memory misalignment, third-level cache, 302–303 363–364 thrashing (caches), 12–13, 349–353 fill and spill traps, 77 Thread Analyzer, 412 reporting activity of, 77–78 Thread Local Storage, 387–389 TLB traps, 77 thread migrations, reporting on, 63 unfinished floating-point traps, 64 threads, 371–372. See also multithreading truss command, 80–81 assigning roles to, 445 TSB (Translation Storage Buffer). See TLB parallelization. See parallelization TSO (Total Store Ordering), 36 sharing data and variables, 430–434 tuning. See optimization sharing processor, 378 types (for variables), aliasing between. See virtualization, 374–375 aliasing control pragmas 0138134553_Gove.book Page 466 Tuesday, December 4, 2007 2:29 PM

466 Index

U variables -u compiler option, 253, 254 aliasing between. See aliasing control ulimit command, 260 pragmas UltraSPARC processors, 21–23. See also alignment of, specifying, 135, 136–137 SPARC architecture global interpreting performance profiles, 217 link time optimization, 115 SPARC64 VI processor, 23, 38 pointers and, 125–126 performance counters, 309–310 sharing between threads, 432–434 UltraSPARC I processors, 22 volatile, 94, 97 UltraSPARC II processors, 22 mutexes, 391–392, 432 UltraSPARC III processors, 22, 30–36 VCXs (voluntary context switches), 60 performance counters, 281–302 vector library, 152 UltraSPARC IV and IV+ processors, 33 vectorizing floating-point computation, performance counters, 281–303 151–153 UltraSPARC T1 processor, 3–4, 22, 37 versions of libraries CMT (chip multithreading), 6 obtaining information on, 87–89 data cache, 13 searching for instruction-set-specific, 186 MEMBAR instructions. See MEMBAR vertical scaling, 375–376 instructions vertical threading, 373 page size, 17–18 virtual addressing, 16 performance counters, 304–308 virtual memory, 16–18 UltraSPARC T2 processor, 22–23, 37–38 mapping information, reporting for performance counters, 308–309 process, 78–79 -xtarget=generic compiler option, 99 utilization of, reporting, 56–57 ::umalog command (dbx), 274–276 virtual processes, enabling, 51 umask flag, Opteron performance counters, virtual processors, 372, 373 311 virtualization, 374–375 UMEM_DEBUG environment variable, 258 VIS instructions (SPARC), 202–205. See also UMEM_LOGGING environment variable, SIMD instructions 258 for searching arrays, 203–205 ::umem_verify command (mdb), 274 vmstat command, 56–57 uncoverage information, 225–226 volatile variables, 94, 97 unfinished floating-point traps, 64 mutexes, 391–392, 432 unnecessary floating-point calculations, voluntary context switches (VCXs), 60 158–159 unroll pragma, 141–142 W unrolling loops, 320–321 for parallelization, 420–422 -w compiler option, 252 unrolling of loops, degree of, 141–142 +w and +w2 compiler options, 252 user code activity, exploring, 82–84 -w0 through -w4 compiler options, 253 user time, reporting on, 57 watchmalloc.so library, 257–258 WC_miss counter, 297–298 WC_scrubbed counter, 297–298 V WC_snoop_cb counter, 297–298 -v compiler compiler option, 248 WC_wb_wo_read counter, 297–298 V8 parameter passing (floating-point weak prefetches (UltraSPARC III and IV+), functions), 178–180 33–34 V9 architecture (SPARC), 22, 30 whereami command, 266 0138134553_Gove.book Page 467 Tuesday, December 4, 2007 2:29 PM

Index 467

worker threads. See Pthreads -xlibmil option and, 170–171 write cache errno variable, 166 performance counters for, 297–298 -xcache compiler option, 104, 105 UltraSPARC III and IV+ processors, -xcheck=stkovf compiler option, 32, 33 260–261, 408 write misses -xchip compiler option, 104, 106 data cache, 283–285 -xcloopinfo compiler option, 404 instruction cache, 285 -xcode compiler option, 185 memory bandwidth measurements, -xcrossfile compiler option, 110 292–293 -xdebugformat compiler option, 262, 263 memory latency measurements -xdepend compiler option, 120–121 example of, 288–290 within -fast option, 101 synthetic metrics for, 288–290 -xdumpmacros compiler option, 252–253 second-level cache, 286 -xe compiler option, 251 -xF compiler option, 110 -xhwcprof compiler option, 226 X -xinstrument compiler option, 412 x64 architecture, 39–46 -xipo compiler option, 110 byte ordering, 41–42 -xlibmil compiler option, 170–171 instruction set extensions, 40, 46 within -fast option, 101, 150 instruction template, 42–43 -xbuiltin option and, 170–171 ISA (information set architecture), 19 errno variable, 166 memory ordering, 46 -xlibmopt compiler option as out-of-order processors, 6 errno variable and, 166 page size, 17, 123 within -fast option, 101, 150, 171 registers, 40, 43–45 -xlic_lib=sunperf compiler option, 197 SPARC architecture vs., 41, 43, 46 -xlinkopt compiler option, 115–116 targeting for compilation, 103 -xlist compiler options (Fortran), 254–255 -xtarget=generic compiler option, 99 -xloopinfo compiler option, 408, 426 x86 processors, 39 -xM compiler option, 250 frame pointer (base pointer), 43, 44, 264 -xmemalign compiler option, 121–123, 362 -xtarget=generic compiler option, 95–96 within -fast option, 101 x87 coprocessor, 46 -xO# optimization levels, 96–97 -xalias_level compiler option, 127–133, 409 -xopenmp compiler option, -xalias_level=basic option, 129–130, 426 194, 264, 402, 413 -xar compiler option, 184 -xpad compiler option, 135 -xarch compiler option, 104, 106–107 -xpagesize compiler option, 123 -xarch=sparcfmaf option, 107, 174 -xpagesize_heap compiler option, 123 -xarch=sparcvis option, 107 -xpagesize_stack compiler option, 123 -xarch=sse option, 119 -xpg compiler option, 237 -xarch= option, 152–153 -xpost compiler option, 251 -xautopar compiler option, 408, 426 -xprefetch compiler option, 117, 326–327 -xbinopt compiler option, 225 -xprefetch_level compiler option, -xbinopt=prepare option, 261 117, 118, 120 -xbuiltin compiler option, 133–135 within -fast option, 101 copying or moving memory, 338 manual prefetch, 330, 332 within -fast option, 101 -xprofile compiler option vectorized computation, 151 -xprofile=collect option, 111–112 0138134553_Gove.book Page 468 Tuesday, December 4, 2007 2:29 PM

468 Index

-xprofile compiler option (Continued) within -fast option, 150 -xprofile=coverage option, 239 -xvector=lib compiler option, 101 -xprofile=use option, 112 -xvector= option, 152–153 -xreduction compiler option, 409 -xvpara compiler option, 404–405, 425 -xregs=frameptr compiler option, 264 within -fast option, 101 -xrestrict compiler option, 127, 143 Y -Xs compiler mode, 171–172 __global scoping specifier, 188 -xs compiler option, 262, 263 __hidden scoping specifier, 188 -xsfpconst compiler option, 172–173 __MATHERR_ERRNO_DONTCARE -xstackvar compiler option, 135–136 preprocessor variable, 165 -Xt compiler mode, 171–172 __thread specifier, 388 -xtarget compiler option -xtarget=generic option, 95–96, 99, 106 -xtarget=generic64 option, 106 Z -xtarget=native option, 98, 99, 101 zero, division by, 11 -xtarget=opteron option, 105, 107 handler for (example), 168–169 -xtarget=ultra3 option, 119, 330, 332 zero divided by zero. See NaN -xtarget_level=basic compiler option, 101 (Not-a-Number) -xtransition compiler option, 248 Zones (Solaris Containers), 374–375 -xvector compiler option, 151–152 -ztext compiler option, 184–185