Download the Index
Total Page:16
File Type:pdf, Size:1020Kb
0138134553_Gove.book Page 447 Tuesday, December 4, 2007 2:29 PM Index Numbers activity reports. See system status reporting 32-bit applications, file handling in, 364–367 addition operations, 9 32-bit architecture, specifying in compiler, address, memory, 16, 18 103 address space for process, reporting on, 32-bit code on UltraSPARC processors, 30 78–79 64-bit architecture, specifying in compiler, addressing modes, SPARC, 24–25 103 aggressively optimized code, compiler 286 processor, 39 options for, 93–95. See also Sun 386 processor, 39 Studio compiler, using SSE instruction sets on, 95–96 algorithmic complexity, 437–442 -xtarget=generic compiler option, 99 alias_level pragma, 143–144 80286 processor, 39 alias pragma, 144–145 80386 processor, 39 aliasing control pragmas, 142–147 SSE instruction sets on, 95–96 aliasing of pointers. See pointer aliasing in C -xtarget=generic compiler option, 99 and C++ 8086 processor, 39 align pragma, 136–137 -aligncommon=16 compiler option, 101 alignment of variables, specifying, A 135, 136–137 ,a adornment (SPARC instruction), 25 AMD64 processors, 40 .a files (static libraries), 182 compiler performance, 105 access to memory Amdahl’s Law, 377, 440, 443 error detection for, 261–262, 274 analyzer GUI, 207, 210–212 metrics on (example), 288–290 AND operation (logical), 9 accessing matrix data, 347–348 application profiles. See profiling active_cycle_count counter, 310 performance 447 0138134553_Gove.book Page 448 Tuesday, December 4, 2007 2:29 PM 448 Index applications ax_miss_count_dm counter, 310 calls from, reporting on, 80–81 ax_miss_count_dm_if counter, 310 compiler options for. See Sun Studio ax_read_count_dm counter, 310 compiler, using debugging, 265–271 running under dbx, 268–271 B disassembling, 89 -B compiler option, 183 looking for process IDs for, 61–62 bandwidth, memory, 14–15 multithreaded. See multithreading source code optimizations, 326–327 processes as, 371. See also processes synthetic metrics for, 292–293 reporting on, 84–92 bandwidth, system bus, 15, 308 disassembles, examining, 89 base pointer (x86 code), 43, 44, 264 file contents, type of, 86–87 bcheck command, 256–257 library linkage, 84–86 big-endian vs. little-endian systems, 41–42 library version information, 87–89 binding, controlling processor use with, metadata in files, 90–92 52–53 segment size, reporting, 90 BIT tool, 224–226 symbols in files, 87 bounds checking, 256–257 runtime linking of, information on, runtime array bounds (Fortran), 259 191–192 Br_completed counter, 308 segments in, reporting size of, 90 Br_taken counter, 308 selecting target machine type for, branch delay slot, 25 103–107 branch instructions, 25 tuning, 442–444. See also optimization; events, performance counters for, profiling performance 300–301, 315–316 ar tool, 184 branch pipes, 7, 9–10 architecture, specifying with compiler, UltraSPARC III processors, 31 99, 104, 105, 106–107 branches archives (static libraries), creating, 184 if statements, 357–364 arguments passed to process, reporting on, loop unrolling and, 320–321 79 mispredicted branches, 10, 107 arrays, VIS instructions for searching, performance counters for, 203–205 300–301, 315–316 assembly language (machine code), 18–19. predictors, 10, 358 See also ISA (information set breakpoints (dbx), setting, 269 architecture) bsdmalloc library, 193–194 assertion flags, 443 buses, 15–16 associativity, cache, 12–13 reporting activity of, 76 atomic directive, 432–433 busstat command, 76–77, 308 atomic operations, 395–396 bypass (storing data to memory), 299, 352 ATS (Automatic Tuning and byte ordering, x64 processors, 41–42 Troubleshooting System), 271–274 attributed time for routines, 214 audit interface, linker, 192 C automatic parallelization, 408–409, 424–425 C- and C++-specific compiler optimizations, Automatic Tuning and Troubleshooting 123–135 System (ATS), 271–274 -xalias_level compiler option, autoscoping, 404–405 127–133, 409 0138134553_Gove.book Page 449 Tuesday, December 4, 2007 2:29 PM Index 449 -xalias_level=basic option, atomic operations, 395–396 129–130, 426 mutexes with, 394 -C compiler option, 259 optimizing for, 446 C99 floating-point exceptions, 169–170 parallelization, 377 cache configuration, setting, 104, 105 code checking and debugging, cache latency (memory latency), 8–9, 34–35 244–245, 247–276, 274–276 measuring, 288–290, 333–334 ATS for locating optimization bugs, source code optimizations, 333–337 271–274 fetching integer data, 328 compile-time checking, 248–255 synthetic metrics for, 290–292 compiler options for code checking caches, 4, 11–14 C language, 248, 250 access metrics (example), 288–290 C++ language, 250, 251–252 clock speed and, 5 compiler options for debugging, 93–95 data cache events, 283–285 dbx debugger, 262–271 data cache lines, showing, 229 commands mapped to mdb, 276 instruction cache events, 283–285 compiler options for debugging, 262 L3 cache, 302–303 debug information format, 263–264 link time optimization, 115 debugging application (example), miss events, cycles lost to, 287 265–268 prefetch cache, 293–295 running applications under, 268–271 prefetch for cache lines, 335–337 running on core files, 264–265 thrashing in, 12–13, 349–353 including debug information when UltraSPARC III and IV processors, 31–33 compiling, 102–103 call instructions, SPARC, 25 mdb debugger, 274–276 call stack. See stack multithreaded code, 413–416 caller–callee information, 212–214 runtime checking, 256–262 tail-call optimization and, 236–237 code complexity, 437–442 calls from applications, reporting on, 80–81 code coverage information cc command (f90 command), 99 dtrace command, 241–244 character I/O, reporting on, 67 tcov command, 239–241 checking code. See code checking and to validate profile feedback, 114–115 debugging code layout optimizations, 107–116 child processes, 378–379 code parallelization. See parallelization chip multiprocessing (CMP), 373 code scheduling, 105 chip multithreading (CMT), 6, 373. See also specifying, 8–9, 104, 106 multithreading coding rules, 437 atomic operations, 395–396 collect command, 207 mutexes with, 394 collect utility, 207, 208–210, 243, 280 optimizing for, 446 hardware performance counters, 218 parallelization, 377 collision rate, 70 CISC processor, 40–41 command-line performance analysis. See clock speed, 4–5, 15 er_print tool pipelining and, 7 commandline arguments, examining, 79 cloning functions, 325–326 common subexpression elimination (CSE), ClusterTools software, 384–385 325 CMP (chip multiprocessing), 373 communication costs of parallelization, 377. CMT (chip multithreading), 6, 373. See also See also parallelization multithreading compare operations, 9 0138134553_Gove.book Page 450 Tuesday, December 4, 2007 2:29 PM 450 Index comparisons in floating-point operations, CSE (common subexpression elimination), 158 325 compile-time array bounds checking, 259 current processes. See processes compile-time code checking, 248–255 cycle budget (UltraSPARC T1), 305–307 compile time, optimization level and, 94, 96 Cycle_cnt counter, 282 compiler commentary, 244–245 cycle_count counter, 310 compiling for Performance Analyzer, 210 cycles (clock speed), 4, 5 detecting data races, 412 lost of processor stall events, 299 complex instruction set computing. See lost to cache miss events. See misses, CISC processor cache complexity, algorithmic, 437–442 stalled, 6, 316–317 complexity, parallelism, 444 RAW recycles, 352–354 compressed structure code, 341–342 store queue, 354–357 conditional breakpoints (dbx), 269 UltraSPARC III processors, 34 conditional move statements, 358–360 conditional statements, 10, 357–364. See also branches D configuration of system, reporting on, 49–55 -D_FILE_OFFSET_BITS compiler option, conflict (memory), multiple streams and, 348 366 constants, floating-point in single precision, -D_LARGEFILE_SOURCE compiler option, 172–173 366 contended locks, 395–396 -dalign compiler option, 122–123 context switches, reporting number of, data cache. See also caches 57, 60, 63 events, performance counters for, 283–285, control transfer instructions, SPARC, 25 290, 304, 306, 309, 312–314 cooperating processes, 378–382 memory bandwidth measurements, copies of processes, 378 292–293 copying memory, 338–339 memory latency measurements core files, running debugger on, 264–265 example of, 288–290 core-level performance counters, 307 synthetic metrics for, 290–292 cores, 4, 371 reorganizing data structures, 339–343 multiple, 373 showing data cache lines, 229 correctness of code. See code checking and data density, 15 debugging data prefetch. See prefetch instructions costs of parallelization, 377 data races in multithreaded applications, count flag, Opteron performance counters, 389–392, 412–413, 434 311 data reads after writes, 352–354 count Perl script, 280–281 data space profiling, 226–233 coverage. See code coverage information data streams CPUs (central processing units), 3–4, 371. multiple, optimizing, 348–349 See also processors, in general storing, 329 cpustat command, 73–74, 279, 301, 307 data structures, optimizing, 339–349 cputrack command, algorithmic complexity, 437–442 75–76, 279, 280, 304–305 matrices and accesses, 347–348 RAW recycle stalls, 353–354 multiple streams, 348–349 TLB misses, 351–352 prefetching, 343–346 cross calls, 63 reorganizing, 339–343 crossfile optimization, 108–110 various considerations, 346 0138134553_Gove.book Page 451 Tuesday, December 4, 2007 2:29 PM Index 451 data TLB misses, 351–352 df command, 71–72 dbx debugger, 262–271 die, CPU, 4 commands mapped to mdb, 276 direct mapped caches, 12 compiler