Copyrighted Material

Blair-Chappell bindex V3 - 03/22/2012

INDEX

Symbols and Numbers Sandy Bridge laptop, 97 speedup, 99 -?, SSA injection, 146 Advisor, 27–31 * (asterisk), reduction(), 105 annotations, 279, 286–290 \ (backslash), continuation marker, 76 replacing, 304–307 = (equals sign), assignment operator, 122 C/C++, 277 - (minus sign) Correctness, 278–279, 295–304 call chains, 282 design, 277–308 reduction(), 105 disadvantages, 278–279 + (plus sign) documentation, 280 call chains, 282 errors, 217 reduction(), 105 lock annotations, 287–288 2nd Generation Intel Core mappings, 305–307 architecture, 10 NQueens example program, 280–281 128-bit SSE variables, 380–381 ROI, 278 serial code, 278 A site annotations, 286–287 Suitability, 279, 290–295 a g, wait(), 200 Summary Report, 304–305 abstraction, 11 Survey, 282–285 acceleration, high-energy physics workﬂ ows, 279–280 experiment, 421 aligned_space, 39 addAcc, 400–401 COPYRIGHTEDalways[assert] MATERIAL, 169 addps, 98 AMD, 99 addsdq, 351 Amdahl’s Law, 20–21, 282 addss, 98 Ampliﬁ er XE, 8, 19, 26, 45–48, 251–271 AddY(), 89, 111 algorithm analysis, 261–264 addy.c, 89, 128 analysis types, 364 Advanced Vector Extensions (AVX), 10 APIs, 366–369 auto-vectorization, 97 for, 468 Intel, 120 timestamps, 468–469

489

bbindex.inddindex.indd 489489 33/22/2012/22/2012 77:24:31:24:31 PMPM Blair-Chappell bindex V3 - 03/22/2012

Ampliﬁ er XE – /arch

Amplifi er XE (continued) source code, 173 Bandwidth, 364 SSE, 360–361 Bandwidth Breakdown, 364 Sudoku, 394–396 baselines, 252–255 Thread Concurrency, 255–256 cache misses, 359–360 timeline, 258–261 call stacks, 174 viewpoints, 268, 349–350, 364–365 CLI, 253–255, 369–370 analysis types, 218–219. See also specifi c compiler, 361–363, 386 analysis type Concurrency analysis, 71–73, 251–252, custom, 218, 245–247 414–415, 438–440 predefi ned, Amplifi er XE, 364 hotspots, 270 ANNOTATE_LOCK_ACQUIRE(), 288 CPI, 341–345, 353–354 ANNOTATE_LOCK_RELEASE(), 288 CPU, 355–357 ANNOTATE_SITE_BEGIN(), 286 CPU Usage, 255–256 ANNOTATE_SITE_END(), 286 Custom designed, 364 ANNOTATE_TASK_BEGIN(), 286 Cycles and uOps, 364 ANNOTATE_TASK_END(), 286 data collection, 468, 474 annotations, 29 event-based analysis, 341–374 Advisor, 279, 286–290 frame rate analysis, 368–369 locks, 287–288 Front End Investigation, 364 replacing, 304–307 General Exploration analysis, 352–357, C/C++ compiler, 289 364 Composer XE, 30–31 hardware, 358–363 NQueens, 290 health, 342–345 Annotation Wizard, 289 hotspots, 254, 256–258 anonymous functions. See lambda hotspot analysis, 171–177, 345–352, functions 400, 467 ANSI C-like code, 183 __it_pause(), 366–368 antivirus software, 252 Locks and Waits analysis, 46–48, 251, AOS. See array of structures 268–269, 474–475 APIs, Amplifi er XE, 366–369 loops, 172–174 for, 468 loop profi ler, 172 timestamps, 468–469 Memory Access, 364 ApplyAccelerationAndAdvancedBodies, n-bodies simulation, 400 401 OpenMP, 76–77, 270–271 ArBB. See Array Building Blocks order-of-magnitude problem, 398 arbb::bind(...), 430 predefi ned analysis types, 364 ArBB_track_fit, 441 Results tab, 346–347 /arch, 99–100

490

bbindex.inddindex.indd 490490 33/22/2012/22/2012 7:24:327:24:32 PMPM Blair-Chappell bindex V3 - 03/22/2012

/arch32 – Barnes-Hut algorithm

/arch32, 92 auto-parallelizer /arch:SSE4.1, 100 hotspots, 156 argv[3], 480 hotspot analysis, 165–171 arithmetic overfl ow, 131 OpenMP, 165 Arr_2_Glob, 465 profi ling, 165–168 Array Building Blocks (ArBB), 11, auto-vectorization, 87 427–429 AV X, 97 for, 432 compiler, 99, 103–107 binding, 450–452 enhancing, 99–100 classes.h, 449–450 error messages, 101 Concurrency analysis, 438–440 failure, 101–103 Correctness, 435 GAP, 116–118 covariance matrices, 452 guidelines, 98–99 Kalman fi lter, 432 IPO, 109–111 kernel, 452–460 loopy code, 97 Lightweight Hotspots analysis, 437 MMX, 97 scalability, 435–438 multipath, 118–120 serial code, 435 optimized code, 116–118 speedup, 435–438 organizing data, 102–103 TBB, 439 processor-specifi c optimizations, 96–107 track-fi tting, 425, 430–440, /Qx, 99 444–460 reports, 91, 100–101 array notations, Cilk Plus, 33, 36 SIMD, 97 elemental functions, 123 speedup, 99 keywords, 13–15 SSE, 32, 96, 97, 99 vectorization, 121–124 turning on, 99 array of structures (AOS), 102, 431 Average Concurrency, 253 assignment operator, 122 AV X. See Advanced Vector Extensions atom, 125 atomic, 39 atomic operations, 236–237 B auto-parallelism. See also guided back-end, 355–356 auto-parallelization Back-end Bound Pipeline Slots, 358 C/C++ compiler, 32 back-end-bound execution, 357 programming guidelines, 168 Bandwidth, Amplifi er XE, 364, 407 /Qguide, 181 Bandwidth Breakdown, /Qparallel, 181 Amplifi er XE, 364 reports, 91 Barnes-Hut algorithm, 403

491

bbindex.inddindex.indd 491491 33/22/2012/22/2012 7:24:337:24:33 PMPM Blair-Chappell bindex V3 - 03/22/2012

Barriers – C/C++

Barriers, IDB, 333 compiler, OpenMP, 73 baryons, 419 Dhrystone benchmark, 478–487 baselines, Amplifi er XE, 252–255 exceptions, elemental functions, 123 ./bash_profile, 158 legacy code, 478–487 Basic Linear Algebra Subprograms SIMD, 121 (BLAS), 44 SSA, 132 Begin_Time, 465 TBB, 10, 38, 182 benchmark, 467 The C Programming Language (Kernighan benchmarking, 92 and Ritchie), 464 binding, ArBB, 450–452 C runtime library, 17 BinNum, SSE intrinsics, 380–381 C99, 104 BinRow, SSE intrinsics, 381 cache, 18 BIOS, Turbo Boost Technology, 93 cache misses, 359–360 BLAS. See Basic Linear Algebra call Subprograms ArBB, 429 body, 401 kernel, 449–450, 452–453 BODYMAX, 401 call chains, 282 Bool_Glob, 465 call stacks, Amplifi er XE, 174 Bottom-up tab, User Mode Call stack with Loops, 284 Hotspots, 347–348 Call Stacks for the two source boundary violations, SSA, 131 snippets, 297 buffer overfl ows, 51, 131 calling sequence, n-bodies simulation, 410 build specifi cation, SSA, 145–149 cancellation-dominated execution, 357 Build_serial, 312 candidate parallel regions, 283 Build_with_cilk, 312 capture_mode, 183 Build_with_openmp, 312–313 category, SSA, 138 Build_with_tbb, 312 CBM. See Compressed Baryonic Matter C/C++, 182 Advisor, 277 C ArBB, 428 C. See also C/C++ Cilk Plus, 10, 121–122 Dhrystone benchmark, 472–478 cilk_for, 185 global variables, 476–478 compiler, 31–37 hotspots, 350–351 annotations, 289 legacy code, 472–478 optimized code, 87–130 shared variables, 472–475 Composer XE, 26 C++. See also C/C++ Parallel Studio XE, 27 Cilk Plus, 482–487 SSA, 51 class, 479–482 SSE intrinsics, 380

492

bbindex.inddindex.indd 492492 33/22/2012/22/2012 7:24:337:24:33 PMPM Blair-Chappell bindex V3 - 03/22/2012

cdqe – clock speed

cdqe, 351 reducers, 33, 35, 68–70, 122–123, 190 CERN Collider. See track-ﬁ tting compilation errors, 57 Ch_1_Glob, 465 data races, 234 Ch_2_Glob, 465 sections, 195–196 chapter4.h, 89, 128–129 serial programs Check Sum, 68, 74, 89 analysis, 60–62 check sum, 59 debugging, 63–71 chunk size, 78, 79 errors, 63–71 Cilk Plus implementation, 62–63 for, 67, 185–186 tuning, 71–73 reduction, 190 shared variables, 67–68 annotations, 305–307 SSA, 51, 132 array notation Tachyon, 312 elemental functions, 123 TBB, 11–12, 38 matrix multiplication, 123–124 while, 191–192 vectorization, 121–124 wrappers, 483–486 assignment operator, 122 Cilk Thread Stack, 333 C++, 482–487 cilk_for, 11, 14, 34, 185–186, 238 C/C++, 121–122, 182 for, 63, 410 compiler, 32, 33–37 cilk_grainsize, 186 Concurrency analysis, 71–73 cilk::holder, 236 data races, 233–236 wrappers, 483 Dhrystone benchmark, 482–487 CILK_NWORKERS, 34 elemental functions, 123 cilk::reducer_opadd, 234, 413–414 cilk_spawn four-step methodology, 54–73 , 34–35, 236 global variables, 485–486 elemental functions, 123 holders, data races, 234–236 recursive functions, 199 hyperobjects, 483–485 sections, 195–196 while, 191–192 IDB, 333–334 cilk_sync, 34, 236 Intel, 10 class, C++, 479–482 ISAT, 271 classes.h, 441, 447 keywords, array notations, 13–15 ArBB, 449–450 legacy code, 482–487 class.h, 421 linked lists, 209 clean, 467 n-bodies simulation, 399, 410 CLI. See command-line interface nested for loops, 189 clock speed PDE, 43 CPU, 3–4, 5 recursive functions, 199 Turbo Boost Technology, 92

493

bbindex.inddindex.indd 493493 33/22/2012/22/2012 7:24:337:24:33 PMPM Blair-Chappell bindex V3 - 03/22/2012

clock ticks – core_2nd_gen_avx

clock ticks, 342 Sudoku, 386 CnC. See Concurrent Collections compilervars.sh, 93 Coarray Fortran, 11 Composer XE collections, ArBB, 428 annotations, 30–31 collision, high-energy physics experiment, C/C++, 26 421 Compressed Baryonic Matter (CBM), combined_track_fit, 441 419–461 command-line interface (CLI) concurrency, 467, 469 Amplifi er XE, 253–255, 369–370 Concurrency analysis baselines, 253–255 Amplifi er XE, 46, 47, 71–73, 251–252, communication 414–415, 438–440 data communication, Correctness, hotspots, 270 299–300 ArBB, 438–440 IPC, 23 Cilk Plus, 71–73 Compare Total Time, 294 n-bodies simulation, 415 compilation errors, Cilk Plus OpenMP, 76–80 reducers, 57 Concurrent Collections (CnC), 11 compiler, 59 concurrent_bounded_queue, 39 Amplifi er XE, 361–363, 386 concurrent_hash_map, 39 auto-vectorization, 99, concurrent_queue, 39 103–107 concurrent_unordered_map, 39 C++, OpenMP, 73 concurrent_vector, 39 C/C++, 31–37 Confi gure Analysis Type, 245 annotations, 289 Confi rmed, SSA problem state, 137, optimized code, 87–130 141, 142 Dhrystone benchmark, 465 consistency, baselines, 252 function calls, 164–165 const long int VERYBIG 100000, 57 hotspot analysis, 158–165 constants, auto-parallelism, 168 intel.noopt.exe, 93 continuation, 35 JIT, 427, 429 continuation marker, 76 lambda functions, 184 control, 33 loop profi ler, 156 Core 2 laptop, 90 n-bodies simulation, 399 IPO, 109 PGO, 113–114 optimized code, general options, 94 -Qno-alias-args, 118 PGO, 114 reports, 91 /QxAVX, 100 restrict, 103–104, 118 core_2_duo_sse4_1, 125 SIMD, 97 core_2_duo_ssse3, 125 SSE, 92, 380 core_2nd_gen_avx, 124

494

bbindex.inddindex.indd 494494 33/22/2012/22/2012 7:24:337:24:33 PMPM Blair-Chappell bindex V3 - 03/22/2012

core_aes_pclmulqdq – data races

core_aes_pclmulqdq, 124 power density, 3–4 core_i7_sse4_2, 124 retirement-dominated execution, 356 Correction analysis, 30 SIMD, 380 Advisor, 278–279, 295–304 Turbo Boost Technology, 92 ArBB, 435 CPU Time by Utilization, 71–72 data collection, 302 CPU Time.Wait Time, 254 data communication, 299–300 CPU Usage, 78 deadlocks, 300 Amplifi er XE, 255–256 debugging, 296 timeline, 259 limitations, 301–302 track-fi tting, 439 lock annotations, 287 CPUID, 124–125 locks, 300–301 critical sections locksets, 303 hotspots, 260–261 memory, 298–299 OpenMP, data races, 236 modeling, 303 custom analysis types, 218, 245–247 NQueens, 304 Custom designed, Amplifi er XE, 364 problems, 298–301 Cycles and uOps, Amplifi er XE, 364, 407 references, 303 cycles per instruction (CPI), 341 serial code, 302 Amplifi er XE, 341–345, 353–354 Sudoku generator, 392–393 hardware, 358 task execution tree, 302 track-fi tting, 435 Correctness Report, 303 NQueens, 296–297 D Correctness Source window, 297–298 data acquisition, high-energy physics covariance matrices, ArBB, 452 experiment, 421 CPI. See cycles per instruction data collection CPU Amplifi er XE, 468, 474 addAcc(), 400 Correctness, 302 Amplifi er XE, 355–357 limit, 175 back-end, 355–356 Suitability, 294 back-end-bound execution, 357 data communication, Correctness, 299–300 cancellation-dominated execution, 357 data parallelism, 11, 283, 286 clock speed, 3–4, 5 ArBB, 428 dispatch, 119 data races, 15 execution behavior, 356–357 ArBB, 427 front-end, 355–356 Cilk Plus, 233–236 front-end-bound execution, 356–357 deadlocks, 220–221 optimized code, 124–125 debugging, 227, 309

495

bbindex.inddindex.indd 495495 33/22/2012/22/2012 7:24:337:24:33 PMPM Blair-Chappell bindex V3 - 03/22/2012

data races – dhry_l.cpp

data races (continued) Correctness, 296 detecting, 225–233 data races, 227, 309 fi xing, 233–238, 323–331 deadlocks, 309 IDB, 311–331 fi lters, 332–333 fi xing, 325 /Od, 91 Inspector XE, 49, 64, 66, 67, 225–233, OpenMP serial programs, 75–76 467 Parallel Studio XE, 310 Linux, 227, 323 printf(), 8 fi xing, 325 runtime, 333–339 mbox, 329–331 serial code, 311, 315–316 n-bodies simulation, 412–414 Visual Studio, PDE, 43 OpenMP, 236–237 workfl ows, 310–311 fi xing, 324–325 __declespec(align), 106 parallel regions, 324 __declspec, 124 PDE, 412–413 __declspec(vector), 13, 123 fi xing, 324–325 default.sup, 228 #pragma omp atomic, 261 #define LAST, 255 SSA, 227 #defines, 399 suppression fi lters, 228–233, 319–323 de-interleaving, 433 TBB, 237–238 Delete(), 134 threading, 220 dependencies Windows, 227, 322–323 auto-parallelism, 168 fi xing, 324–325 loop, 110–111 Data Translation Look-aside Buffer vector, 101 (DTLB), 359 Detect Deadlocks analysis, data-level parallelism, 9–10 221–225 datarace, 467 Detect Deadlocks and Data Races analysis, dead code, SSA, 132 225–233 deadlocks, 16 Detect Leaks, 238 ArBB, 427 Detect Memory Problems, 238 Correctness, 300 detection, high-energy physics experiment, data races, 220–221 421 debugging, 309 determinacy races, 15–16 detecting, 221–225 dhry_1.c, 476 Inspector XE, 49, 64, 221–225 dhry_1.cpp, 479 deallocation, 239 dhry_2.c, 464 debugging, 309–339. See also Intel dhry.h, 476 Debugger; Parallel Debugger Extension dhry_l.c, 464 Cilk Plus serial programs, 63–71 dhry_l.cpp, 479

496

bbindex.inddindex.indd 496496 33/22/2012/22/2012 7:24:337:24:33 PMPM Blair-Chappell bindex V3 - 03/22/2012

Dhrystone benchmark – ﬁ lters

Dhrystone benchmark Enable Task Chunking, 292 C, 472–478 End_Time, 465 C++, 478–487 EnterCriticalSection, 16 Cilk Plus, 482–487 eog, 158 global variables, 464–465, 476–478 errors, 15–19, 217–250 legacy code, 464–465 Advisor, 217 Makefile, 466 auto-vectorization, 101 serial code, 471–472 Cilk Plus serial programs, 63–71 shared variables, 464–465, 472–475 compilation errors, Cilk Plus reducers, 57 dhrystone::main(), 479 Inspector XE, 49–51 Dhrystones_Per_Second, 465 memory Dir(), 134 Inspector XE, 49–50, 238–245 disassembly source view, 48 types, 239–240 distributed parallelism, 9 Memory Error analysis, 218 divide by zero, 131 custom analysis types, 246 _dl_relocate_object, 347 n-bodies simulation, 412–414 documentation, Advisor, 280 PDE, 217 domain-speciﬁ c libraries, 11 SSA, 134, 217 draw_task, 313 threading, Inspector XE, 49–50, 66 serial, 320 Threading Error analysis, 218, 219–221 driver.cpp, 441, 447–448 custom analysis types, 246–247 dhry_l.cpp, 479 event-based sampling, 46 fitTracks, 442 execution behavior, 356–357 DTLB. See Data Translation Look-aside Exit(), 133 Buffer duration time estimate, 174 DWORD, 56 .dyn, 114 F /Qprof-use, 32 Facility for Antiproton and Ion Research dynamically allocated memory, 131 (FAIR), 420 false positives, 135–136 false sharing, 18–19 E Fast Fourier Transform (FFT), 41–42 MKL, 44 education, 8 File.cpp, 385 ef_add, 13–14 ﬁ lters Elapsed Time, 253 Correctness Report, 296 elemental functions, Cilk Plus, 33, 36, 123 debugging, 332–333 Enable Parallel Debug Checks, 317 focus, 325–328

497

bbindex.inddindex.indd 497497 33/22/2012/22/2012 7:24:347:24:34 PMPM Blair-Chappell bindex V3 - 03/22/2012

ﬁ lters – GDI

fi lters (continued) #pragma omp for, 476 Kalman fi lter, 445 #pragma omp parallel, 472 ArBB, 432 reduction, 189–191 fitTracks, 441 schedule, 187 for, 454 speedup, 19 high-energy physics experiment, TBB, 188 421, 425–427 reduction, 191 suppression fi lters, data races, 228–233, __forceinline, 171 319–323 for_each, 184 fi lters step, Kalman fi lters, 426 fork-join parallelism, 286 filter::serial_in_order, ntoken, 207 for-loop, 169 firstprivate, 325 Fortran local_mbox, 327 Parallel Studio XE, 27 threading, 383 SSA, 51, 132 fitTracks frame rate analysis, Amplifi er XE, 368–369 driver.cpp, 442 free, 13 Kalman fi lter, 441 Front End Investigation, Amplifi er XE, 364 main(), 442 front-end, 355–356 fitTracksArBB, main, 445 front-end-bound execution, 356–357 fitTracksArBBEntry, 448 functions, 193–197 fit_util.h, 441 elemental, Cilk Plus, 33, 36, 123 Fixed, SSA problem state, 141, 142 lambda, 183–184 fl oating-point performance, 32 compiler, 184 flow_control fc, 207 TBB, 183 Focus Code Location, 65 recursive, 198–201 focus fi lters, 325–328 Cilk Plus, 199 foo, 302 OpenMP, 200 serial code, 198–199 for, 185–188 TBB, 200–201 Amplifi er XE APIs, 468 TBB, 197 ArBB, 432 function calls Cilk Plus, 67, 185–186 auto-parallelism, 168 cilk_for, 63, 410 compiler, 164–165 deadlocks, 220 Function Call Sites and Loops, 283 hotspot analysis, 254 functors. See lambda functions Kalman fi lter, 454 lastprivate, 473 G main(), 464 nested loops, 188–189 GAP. See guided auto-parallelization OpenMP, 187–188 GCC. See GNU Compiler Collection reduction, 190 GDI. See Graphics Device Interface

498

bbindex.inddindex.indd 498498 33/22/2012/22/2012 7:24:347:24:34 PMPM Blair-Chappell bindex V3 - 03/22/2012

General Exploration analysis – hotspots

General Exploration analysis H Amplifi er XE, 352–357, 364, 407 hard real-time, 22–23 n-objects simulation, 407–409 hardware general options, optimized code, Amplifi er XE, 358–363 93–96 CPI, 358 general parallelism, 11 HashAdvance, 410 generator Hash.cpp OpenMP, 382–383 , 410 Sudoku, 379–380 hashed octree, 405–407 optimizing, 384–396 HASHTABLE, 405 Generator.cpp, 385 health, Amplifi er XE, 342–345 generic, 125 heaps, 18 Geo.dat, 441 high-energy physics experiment, 420–427 GetLine, 201–202 acceleration, 421 GetPrimes, 266–267, 270 collision, 421 get_value(), 234 data acquisition, 421 reducers, 35 detection, 421 global optimization, 31 Kalman fi lter, 421, 425–427 global variables serial code, 422 C, 476–478 track reconstruction stages, 421–427 Cilk Plus, 485–486 track-fi nding, 421, 423–425 Dhrystone benchmark, 464–465, track-fi tting, 421, 425–427, 430–440 476–478 vertexing, 421 legacy code, 476–478 high-level constructs, 11 OpenMP, 477 high-level optimization (HLO), 95 wrappers, 483 high-performance computing (HPC), 11 GNU Compiler Collection (GCC), 25, 26, high-performance optimization 399 (HPO), 95 GNU Public License (GPL), 38 hits, 421 gNumPrimes, 265 hitsX, 451 goto, 123, 186 hitsY, 451 GPL. See GNU Public License HLO. See high-level optimization gProgress, 260 hlo, 95 Graphics Device Interface (GDI), 49, 239 holders, Cilk Plus, data races, 234–236 gthum, 158 hotspot, 467, 469 guided auto-parallelization (GAP), 32 hotspots, 155. See also User Mode auto-vectorization, 116–118 Hotspots reports, 91 Amplifi er XE, 254, 256–258 Gustavson’s observation, 20–21 Concurrency analysis, 270 gwenview, 158 auto-parallelizer, 156

499

bbindex.inddindex.indd 499499 33/22/2012/22/2012 7:24:347:24:34 PMPM Blair-Chappell bindex V3 - 03/22/2012

hotspots – Intel

hotspots (continued) #include, 63, 74, 480 C, 350–351 dhry_1.cpp, 479 critical sections, 260–261 Makefile, 466 printf, 261 SSA, 137 PrintProgress, 256–257 #include , 410 Sudoku #include , 57 generator, 391 #include , 56 solver, 388–389 #include , 56 hotspot analysis. See also Lightweight independent updates, 299 Hotspots analysis init(), 422 Amplifi er XE, 46, 60–61, 171–177, initial approximation step, 345–352, 400, 467 Kalman fi lters, 426 auto-parallelizer, 165–171 injection, SSA, 146–147 compiler, 158–165 inline, 171 inlining, 160 _inline, 171 IPO, 160 __inline__, 171 loops, 156 inlining, 160, 162, 171 octree, 405 inpxe-runsc, 146 HPC. See high-performance computing Inspector XE, 8, 26, 48–51 HPO. See high-performance optimization Custom analysis types, 245–247 hpo, 95 data races, 64, 66, 67, 225–233, 467 hyperobjects, 483–485 deadlocks, 64, 221–225 Hyper-Threading, 5 memory errors, 238–245 baselines, 252 SSA, 131–132 benchmarking, 92 state information, 137 disabling, 92 threading errors, 66 hypervisor, virtualization, 9 Threading Errors analysis, 219 XML, 136 I inspxe, 136 inspxe-gui, 138 ia32, 93 inspxe-inject, 146–149 id, 133 instruction-level parallelism, 9 IDB. See Intel Debugger instructions retired, 342 idb openmp-serialization on, 316 Integrated Performance Primitives (IPP), 11, idb set cilk-serialization on, 316 31, 40–43, 181 idb sharing on, 317 Intel. See also specifi c topics if, 135 AV X, 120 *.inc, 471 Cilk Plus, 10 incident sharing, 298 domain-specifi c libraries, 11

500

bbindex.inddindex.indd 500500 33/22/2012/22/2012 7:24:347:24:34 PMPM Blair-Chappell bindex V3 - 03/22/2012

Intel – ittnotify.h

MIC, 6 IntervalZero, 23 multipath auto-vectorization, 119 Int_Glob, 465 Parallel Studio XE, 6 intrinsics, SSE, 380–382, 390 research and development, 11 invalid deallocation, 239 standards, 11 invalid memory access, 239 Terafl op Research Chip, 5–6 invalid partial memory access, 239 Intel compiler. See compiler invalid users, SSA, 134–135 Intel Debugger (IDB), 43, 310, 317 Investigated, SSA problem state, 141–142 Barriers, 333 I/O Cilk Plus, 333–334 Locks and Waits analysis, 46 data races, 311–331 virtualization, 9 fi xing, 325 IPC. See interprocess communication focus fi lters, 329 IPO. See interprocedural optimization Linux, 26, 31, 316 ipo_inl, 95 Locks, 333 IPP. See Integrated Performance Primitives OpenMP, 333–339 ippAC, 40 serial code, 315–316 ippCC, 40 Spawn Tree, 333, 336–339 ippCH, 40 Tasks, 333 ippCP, 40 parallel regions, 334–335 ippCV, 40 Taskwaits, 333 ippDC, 40 Teams, 333 ippDI, 40 windows, 333–334 ippGEN, 40 Intel Software Autotuning Tool (ISAT), ippIP, 41 251, 271–272 ippJP, 41 INTEL_LOOP_PROF_XML_DUMP, 159 ippMX, 41 intel.noopt.exe, 93 ippRR, 41 Intel_SSA, 138 ippSC, 41 intent, 33 ippSP, 41 interleaving, 433 ippSR, 41 InterlockedIncrement, 11 ippVC, 41 interprocedural optimization (IPO), 31, 87, ippVM, 41 108–112 irfanview.com, 157 auto-parallelism, 168 ISAT. See Intel Software Autotuning Tool auto-vectorization, 109–111 ISM. See interstellar medium hotspot analysis, 160 ison, 325 /Qipo-, 160 iteration, 209–211 interprocess communication (IPC), 23 __it_pause(), 366–368 interstellar medium (ISM), 397, 398 ittnotify.h, 176

501

bbindex.inddindex.indd 501501 33/22/2012/22/2012 7:24:347:24:34 PMPM Blair-Chappell bindex V3 - 03/22/2012

ITT_PAUSE – Linux

ITT_PAUSE, 366 Large work with overhead, 21 __itt_pause(), 175–176 last level cache, 18 ITT_RESUME, 366 lastprivate, 169, 473, 476 __itt_resume(), 175–176 Launch Application, 346 LeaveCriticalSection, 16 legacy code, 7, 463–488 J C, 472–478 jnle, 351 C++, 478–487 just-in-time (JIT), 427, 429 Cilk Plus, 482–487 Dhrystone benchmark, 464–487 global variables, 476–478 K shared variables, 472–475 Kalman fi lter, 445 lib-a, 137 for, 454 libittnotify.lib, 176 ArBB, 432 Lightweight Hotspots analysis, 346–352 fitTracks, 441 ArBB, 437 high-energy physics experiment, 421, n-objects simulation, 407–409 425–427 track-fi tting, 437 __kcilkrts(), 186 linear(), 105 kernel Linear Algebra PACKage (LAPACK), 44 ArBB, 452–460 LineIn, 203, 205, 207 call, 449–450, 452–453 LineOut, 203, 205, 207 resource leak, 239 linked lists, 208–211 track-fi tting, 449–450 Cilk Plus, 209 Kernighan, Brian, 464 OpenMP, 210 keywords, Cilk Plus, 33–34 parallel iteration, 209–211 array notations, 13–15 serial iteration, 209 K&R C code, 464 TBB, 210 Linux Amplifi er XE baseline, 255 L compiler, 363 L1 cache, 18 data races, 227, 323 L2 cache, 18 fi xing, 325 lambda functions, 183–184 Dhrystone benchmark, 471 compiler, 184 focus fi lters, 329 TBB, 183 GCC, 26 LAPACK. See Linear Algebra PACKage General Exploration analysis, 354

502

bbindex.inddindex.indd 502502 33/22/2012/22/2012 7:24:347:24:34 PMPM Blair-Chappell bindex V3 - 03/22/2012

LLC miss – main.c

IDB, 26, 31, 310, 316 counter, 98, 474 ISAT, 271 elemental functions, 123 loop swapping, 362 hotspot analysis, 156 Makefile, 92 OpenMP, 382, 473 n-bodies simulation, 399 swapping, 360–363 hotspot analysis, 402–403 work.c, 110 option-mapping, 155–156 loop parallelism. See data parallelism /Oy-, 155–156 loop profi ler PGO, 114 Amplifi er XE, 172 PPM, 158 compiler, 156 SSE speedup, 363 inlining, 162 Tachyon, 313, 316 Loop Time, 284 LLC miss, 358 loop trip count, 98 load balancing, 17 LOOPCOUNT, 471 cilk_for, 186 loop-dependency, 110–111 n-bodies simulation, 414–415 LoopOne(), 366 OpenMP, 79–84, 382 loopprofileviewer, 159, 164 local trees, 410 LoopTwo(), 366 local_mbox, 327 loopy code, 97 Locate Memory Problems, 239, 240 Low trip count, 101 locks low-level constructs, 11 annotations, Advisor, 287–288 Correctness, 300–301 IDB, 333 M OpenMP, data races, 236 Suitability Report, 292 main, 62, 429, 445 TBB, 238 main(), 89, 480 Locks and Waits analysis, 46–48, 251, for, 464 268–269, 474–475 fitTracks, 442 locksets, 303 a g, wait(), 200 longjmp, 123 n-bodies simulation, 399 loops, 185–193. See also for; while OpenMP, 200 Amplifi er XE, 172–174 packed SSE instructions, 361 auto-vectorization, 98–99 main.c, 152–153 control variables main.c(28) : error #12329, 134 cilk_for, 186 main.c(38) : error #12305, 134 OpenMP, 188 main.c(59) : error #12305, 134

503

bbindex.inddindex.indd 503503 33/22/2012/22/2012 7:24:357:24:35 PMPM Blair-Chappell bindex V3 - 03/22/2012

Main.cpp – modeling

Main.cpp, 385 Inspector XE, 49–50, 238–245 main.cpp, 157, 399, 441, 446–447 types, 239–240 mainCRTStartup, 71, 73, 77 growth, 239 main.h, 399 leaks, 17, 238, 239 MainMenu(), 133, 134 Inspector XE, 50 maintainability, 8 SSA, 131 make clean, 467, 471 OpenMP, 476 Makefile, 89, 129–130, 153, 399 SoA, 433 include, 466 virtualization, 9 legacy code, 466 Memory Access, Ampliﬁ er XE, 364, 407 Linux, 92 Memory Error analysis, 218, 246 n-bodies simulation Memory reuse: Observations, 296, 297 hashed octree, 405 Message Passing Interface (MPI), 9, 11, 410 octree, 405 metrics tracking, 145–151 malloc, 13 m_holder(), 483 init(), 422 -mia32, 92 memory, 17–18 MIC. See Many Integrated Core errors, 239 Architecture serial code, 445 Microseconds, 465 mandelbrot.cpp, 157 Microsoft compiler, 58–59 mandelbrot.h, 157 m_index, 433 manual CPU dispatch, 124–125 minimal solution, 379 Many Integrated Core Architecture (MIC), 6 minus-one-plus-two algorithm, 380 many-core computing, 4–6, 8 minus-two-plus-one algorithm, 380, mappings, Advisor, 305–307 390–391 Math Kernel Library (MKL), 11, 31, mirror space, 433 44–45, 181 mismatched allocation/deallocation, 239 matrix multiplication, 123–124 missing allocations, 239 Maximum Program Gain For All Sites, 291 MKL. See Math Kernel Library Maximum Site Gain, 291 _mm_hadd, 361 MAXKEYS, 405 _mm_malloc(), 106 mbox, 329–331 MMX. See MultiMedia eXtensions memcpy, 49, 239 model parameters, 291 memory modeled estimates, Suitability Report, 291 ArBB, 428 modeling Correctness, 298–299 Correctness, 303 errors, 17–19 Suitability, 295

504

bbindex.inddindex.indd 504504 33/22/2012/22/2012 7:24:357:24:35 PMPM Blair-Chappell bindex V3 - 03/22/2012

Moore – NumInRow

Moore, Gordon E., 5 NewIdx, 388 Moore’s Law, 5 Next_Ptr_Glob, 465 movsdq, 351 nmake benchmark, 468, 471 MPI. See Message Passing Interface , 467 mulsd, 360 [no]assert, 105 mulsdq, 351 NODE, 403 multi-core computing, 4–6 NodeIdx, 387 MultiMedia eXtensions (MMX), 97 NODES, 403 multipath auto-vectorization, non-Intel processors 118–120 AMD, 99 multwo, 123 multipath vectorization, 119 mutex, 39 Nonstandard loop is not a vectorization mutexes, 237–238 candidate, 101 m_value, 433 non-unit strides, 102 myholder, 236, 483 Not a problem, SSA problem state, 137, 141, 142 N Not an inner loop, 101 Not ﬁ xed, SSA problem state, 141 n-bodies simulation, 399–400 Not investigated, SSA problem state, calling sequence, 410 141–142 Concurrency analysis, 415 NotePad, 133 data races, 412–414 nowait, 192, 205 errors, 412–414 NQueens hashed octree, 405–407 Advisor, 280–281 hotspot analysis, 400–403 annotations, 290, 307 Linux, hotspot analysis, Correctness analysis, 304 402–403 Correctness Report, 296–297 load balancing, 414–415 Suitability analysis, 295 octree, 403–405 Suitability Report, 291–292 PDE, 412–413 Summary Report, 305 Windows, hotspot analysis, 402–403 Survey Report, 282–283 n-bodies.cpp, 399 nrOfSolutions, 299 n-bodies.h, 399 ntoken, 207 nested loops, 188–189 NULL, 239 nested parallelism, 283, 287 NumInColumn, 388 New, SSA problem state, 141 NumInRow, 388

505

bbindex.inddindex.indd 505505 33/22/2012/22/2012 7:24:357:24:35 PMPM Blair-Chappell bindex V3 - 03/22/2012

/O1 – Open Multi-Processing

O data races, 236–237 fi xing, 324–325 /O1, 93–94, 168 for-loop, 169 /O2 , 91–94, 108–109 four-step methodology, 73–84 /O3 , 93–94 generator, 382–383 /Ob0, 171 global variables, 477 /Ob1, 171, 174 IDB, 333–339 object declarations, 132 ISAT, 271 octree, 403–405 linked lists, 210 OCTREE, 405 load balancing, 79–84, 382 octree.cpp, 399 locks, data races, 236 octree.h, 399 loops, 382, 473 /Od, 91–92 loop control variables, 188 auto-parallelism, 168 main(), 200 debugging, 91 memory, 476 offl ine data analysis, 421 n-bodies simulation, 410 omp single, 383 nested for loops, 189 OMP Worker Thread 3, 77 PDE, 43 omp_get_max_threads(), 476 pipelines, 205–206 omp_get_nested(), 189 #pragma omp critical, 11 OMP_NESTED, 189 recursive functions, 200 OMP_NUM_THREAD=1, 225 reduction clause, 76, 105, 190 OMP_NUM_THREADS, 187–188 data races, 237 omp_set_nested(expression), 189 sections, 196–197 omp_set_num_threads(), 187–188, 476 deadlocks, 220 Open Multi-Processing (OpenMP), 11, serial programs 37–38 analysis, 74 for, 187–188 debugging, 75–76 reduction, 190 implementation, 74–75 Amplifi er XE, 76–77, 270–271 tuning, 76–84 atomic operations, data races, shared variables, 75–76 236–237 SSA, 51, 132 auto-parallelizer, 165 Sudoku, 382 C++ compiler, 73 generator, 391–392 C/C++, 182 Tachyon, 312, 314–315 compiler, 32 tasks, 382 Concurrency analysis, 76–80 threading, 182 critical sections, data races, 236 while, 192–193

506

bbindex.inddindex.indd 506506 33/22/2012/22/2012 7:24:357:24:35 PMPM Blair-Chappell bindex V3 - 03/22/2012

oper – parallel_do

oper, 105 synchronization, 17 operator->, 483 threading, 17 operators /Ox, 93–94 ArBB, 428 /Oy-, 155–156 assignment, 122 optimization reports, 32, 91 /Qopt-report, 95–96 P optimized code packed instructions auto-vectorization, 116–118 SIMD, 124 C/C++ compiler, 87–130 SSE, 98, 360–361 CPU, 124–125 Parallel Advisor. See Advisor disabling, 91–93 parallel choices, Suitability Report, example application, 89–90 292–293, 295 auto-vectorization, 106–107 Parallel Debug Environment, 317 Cilk Plus array notations, 123–124 Parallel Debugger Extension (PDE), 8, 26, general options, 94, 96 31, 43, 310 IPO, 108–109, 111–112 data races, 412–413 PGO, 114–116 fi xing, 324–325 unoptimized version, 92–93 errors, 217 general options, 93–96 focus fi lters, 328 IPO, 108–112 n-bodies simulation, 412–413 PGO, 112–116 Tachyon, 312, 316 processor-specifi c optimizations, Parallel Inspector XE. See Inspector XE 96–107 parallel iteration, 209–211 producing, 87–130 parallel overhead, 20 seven steps, 90–125 parallel regions source code, 125–130 data races, 324 option-mapping, 155–156 IDB Tasks, 334–335 -opt-report-phase=pgo, 114 Parallel Studio XE, 181 -opts, 156 ./bash_profile, 158 order-of-magnitude problem, 398 C/C++, 27 overclocking, 4 debugging, 310 overfl ow errors, 217–250 arithmetic, 131 Fortran, 27 buffer, 51, 131 n-bodies simulation, 400 overhead, 20, 21 unoptimized version, 92 profi ling, 163–165 parallel_do, 39 Suitability Report, 291 Work(), 193

507

bbindex.inddindex.indd 507507 33/22/2012/22/2012 7:24:357:24:35 PMPM Blair-Chappell bindex V3 - 03/22/2012

parallel_for – PrintProgress

parallel_for, 38, 39–40, 188 POSIX, 181 parallel_for_each, 39 power density, 3–4 parallel_invoke, 39, 197, 237 PPM, 157, 158 parallel_KF, 445 ppm.cpp, 157 parallel_KF.cpp, 432 pPuzzle, 387 parallel_pipeline, 39, 206–208 pragma, OpenMP, 37 ParallelPrime.cpp, 255, 264, 270 #pragma omp, 325 parallel_reduce, 38, 191 #pragma omp atomic, 260, 261 parallel_scan, 38 #pragma omp critical, 236–237 parallel_sort, 39 gNumPrimes, 265 PARDISO, 44 gProgress, 260 PASSWORD, 134 OpenMP, 11 passwords, SSA, 134–135 #pragma omp for, 187–188, 205, 476 Pause API, 175–176 #pragma omp for schedule(), 187 Paused Time, 347 #pragma omp parallel, 38, 187–188, PDE. See Parallel Debugger Extension 192, 313, 472 pentium_4, 125 #pragma omp single, 38 pentium_4_sse3, 125 #pragma omp task, 38, 192 pentium_iii, 125 #pragma parallel, 168, 169–170 pentium_m, 125 #pragma parallel for, 266 Performance Analyzer #pragma simd, 104–105 n-bodies simulation, 400 #pragma vector aligned, 105–106 Sudoku, 382 #pragma vector always, 101, 104 PGO. See profi le-guided optimization #pragma vector unaligned, pgo, 95 105–106 pipelines, 201–208 predefi ned analysis types, Amplifi er XE, OpenMP, 205–206 364 phases, 201 prediction step, Kalman fi lters, 426 serial code, 203–205 Print.cpp, 385 TBB, 206–208 print.cpp, 399 threading, 202 printf, 8, 101, 481 pointers hotspots, 261 ArBB, 428 PrintProgress, 262 auto-parallelism, 168 reducers, 35 global variables, 476 print.h, 399 high-energy physics experiment, PrintProgress 421–422 hotspots, 256–257 SoA, 433 ParallelPrime.cpp, 264 SSA, 131 printf, 262

508

bbindex.inddindex.indd 508508 33/22/2012/22/2012 7:24:357:24:35 PMPM Blair-Chappell bindex V3 - 03/22/2012

privacy – /Qved

privacy GAP, 116–117 Inspector XE, 49 /Qvec-report, 116 threading, 220 /Qguide=n, 116 private, 476 /Qipo, 109–110, 165, 168 shared variables, 75–76, 473 /Qipo-, 160 threading, 383 /Qipo, 108 private() -Qno-alias-args, 118 #pragma parallel, 169 /Qopenmp, 37 #pragma simd, 104–105 /Qopt-report, 91, 95–96 problem states, SSA, 140–142 /Qopt-report-phase, 95 Problems and Messages, 296 /Qopt-report-phase:pgo, 114, 115 process noise step, Kalman fi lters, 426 Qpar-adjust-stack, 168 processor-specifi c optimizations, 96–107 Qpar-affinity, 168 -prof-gen, 114 Qparallel, 168 -prof-gen=srcpos, 114 /Qparallel, 165, 181 Profi le System, 344, 346 Qparallel-source-info, 168 profi le-guided optimization (PGO), 31–33, Qpar-num-threads, 168 87, 112–116 Qpar-report, 168 compiler, 113–114 /Qpar-report, 91, 165 /Qopt-report-phase, 95 Qpar-runtime-control, 168 profi ling, 156. See also loop profi ler Qpar-schedule, 168 auto-parallelizer, 165–168 Qpar-threshold, 168 hotspot analysis with compiler, 158–165 /Qprof-gen, 114 overhead, 163–165 /Qprof-gen:srcpos, 114 -prof-use, 114 /Qprofile-functions, 159, 160 Project Properties, suppression fi lters, 230 /Qprofile-loops, 159 protostellar cloud, 398 /Qprofile-loops:, 160 Ptr_Glob, 465 /Qprofile-loops-report:, 160 Ptr_Glob_Arr, 476 /Qprof-use, 32, 114 PutLine, 201–202 Qsort, 287 /Qstd.c99, 104 quarks, 419 queueing_mutex, 238 Q queueing_rw_mutex, 238 QA environment, 149–151 Quick-Reference Guide to Optimization, /Qax, 119 90, 109 /Qdiag-enable:thread, 464 /Qvec-report, 91, 100–101, 116 /Qguide, 91 /Qvec-report3, 110 auto-parallelism, 181 /Qved, 99

509

bbindex.inddindex.indd 509509 33/22/2012/22/2012 7:24:367:24:36 PMPM Blair-Chappell bindex V3 - 03/22/2012

/Qx – runtime

/Qx reducer_ostream, 36 AMD, 99 reduction, for, 189–191 auto-vectorization, 99 reduction(), 105 non-Intel processors, 100 reduction clause, OpenMP, 76, 105, 190 /QxAVX, 100 data races, 237 reduction variables, 68 redundant clues, 379 R redundant code, SSA, 132 range, ArBB, 428 references, Correctness, 303 ray tracing. See Tachyon Regression, SSA problem state, 142 ReadFiles.cpp, 441, 447 regression testing, 149–152 readInput, 441 Related Code Location, 65 real sharing, 19 Relationship Diagram, 297 real-time systems, 22–24 relatively_global, 302 recursive backtracking algorithm, 386–387 render_one_pixel, 320 recursive decomposition, 283–285 reports recursive functions, 198–201 auto-parallelism, 91 Cilk Plus, 199 auto-vectorization, 91, 100–101 OpenMP, 200 compiler, 91 serial code, 198–199 GAP, 91 TBB, 200–201 research and development, 11 Reduce Lock Contention, 292–293 restrict, 103–104, 118 Reduce Lock Overhead, 291 Results tab, Ampliﬁ er XE, 346–347 Reduce Site Overhead, 291 __resume_(), 366–368 Reduce Task Overhead, 291 Resume API, 175–176 reducers retirement-dominated execution, 356 Cilk Plus, 33, 35, 68–70, 122–123, 190 return on investment (ROI), 9 compilation errors, 57 Advisor, 278 data races, 234 Summary Report, 305 views, 234 ReturnToMain(), 134 reducer_basic_string, 36 Ritchie, Dennis, 464 reducer_list_append, 36 ROI. See return on investment reducer_list_prepend, 36 RTX, 23 reducer_max, 36 runSerialBodies, 401 reducer_max_index, 36 runtime reducer_min, 36 C runtime library, 17 reducer_min_index, 36 debugging, 333–339 reducer_opadd, 35, 36 threading, 258

510

bbindex.inddindex.indd 510510 33/22/2012/22/2012 7:24:367:24:36 PMPM Blair-Chappell bindex V3 - 03/22/2012

SafeAdd – series1

S self time, 163, 282 serial SafeAdd() , 220 draw_task, 320 Sandy Bridge laptop, 90 #pragma omp, 325 AV X, 97 serial code, 7 /Od /O2 and , 92 Advisor, 278 scalability, 19–22 ArBB, 435 ArBB, 435–438 Correctness, 302 Maximum Site Gain, 291 debugging, 311, 315–316 track-ﬁ tting, 435–438 Dhrystone benchmark, Scalable LAPACK (ScaLAPACK), 44 471–472 scalable_allocator , 39 high-energy physics experiment, scalable_malloc, 13, 18 422 ScaLAPACK. See Scalable LAPACK IDB, 315–316 scalar instructions, SSE, 98, 360 malloc, 445 scanf(), 133 pipelines, 203–205 schedule, 80 recursive functions, for, 187 198–199 GetPrimes, 266–267, 270 sections, 194–195 #pragma parallel for, 266 track-ﬁ tting, 435, 441–443 schedule(static,1), 480 serial iteration, linked lists, 209 scoped_timer, 446 serial programs 2nd Generation Intel Core Cilk Plus, 55–73 architecture, 10 analysis, 60–62 __sec_reduce_add, 122 debugging, 63–71 __sec_reduce_all_nonzero, 122 implementation, 62–63 __sec_reduce_all_zero, 122 OpenMP __sec_reduce_any_nonzero, 122 analysis, 74 __sec_reduce_max, 122 debugging, 75–76 __sec_reduce_max_ind, 122 implementation, 74–75 __sec_reduce_min, 122 tuning, 76–84 __sec_reduce_min_ind, 122 serial_KF.cpp, 432, __sec_reduce_mul, 122 441, 445 sections, 193–197 serial_track_fit, 441 Cilk Plus, 195–196 Series1() sections, OpenMP, 196–197, 220 loops, 111 security. See also Static Security analysis /Qipo, 110 threats, 131 series1(), 89

511

bbindex.inddindex.indd 511511 33/22/2012/22/2012 7:24:367:24:36 PMPM Blair-Chappell bindex V3 - 03/22/2012

Series2 – Statement cannot be vectorized

Series2() SOA. See structure of arrays loops, 111 soft real-time, 22–23 /Qipo, 110 Solve, 388 series2(), 89 solve(), 280–281 series.c, 89, 128 solver, Sudoku, 379–380, setjmp, 123 386–390 setQueen(), 280–281 source code, 211–214 shared variables Ampliﬁ er XE, 173 C, 472–475 ISAT, 271 Cilk Plus, 67–68 optimized code, 125–130 Dhrystone benchmark, 464–465, security threats, 131 472–475 SSA, 131–132, 152–153 legacy code, 472–475 writing, 131–154 OpenMP, 75–76 Spaces, 184 private, 75–76, 473 Spawn Tree, 333, 336–339 SIMD. See single instruction, multiple data speedup, 19–22, 194, 282 SIMD pragma, Cilk Plus, 33, 36 ArBB, 435–438 simulated cores, 295 auto-vectorization, 99 simulation. See also n-bodies simulation SSE, 363 star formation, 397–398 Suitability Report, 291 Simultaneous Multi Threading, 314 track-ﬁ tting, 435–438 single entry-and-exit loop, 98 Spin Time, 474–475 single instruction, multiple data (SIMD), spin_mutex, 238, 305 9–10, 97. See also Streaming SIMD spinning threads, 269 Extensions spin_rw_mutex, 238 ArBB, 427 SPMD. See single program multiple data C++, 121 SqRoot, 201–202 CPU, 380 sqroot(), 203 packed instructions, 124 SSA. See Static Security analysis timeline, 97 SSE. See Streaming SIMD Extensions vectorization, 425 SSEHasNumber, 390 single program multiple data (SPMD), Standard Template Library (STL), 11 184, 185 site annotations, Advisor, 286–287 standards, 11 Small work with overhead, 21 star formation, 397–398 SNB state, SSA, 138 IPO, 109 state information, Inspector XE, optimized code, general options, 94 137 PGO, 114 Statement cannot be vectorized, 101

512

bbindex.inddindex.indd 512512 33/22/2012/22/2012 7:24:367:24:36 PMPM Blair-Chappell bindex V3 - 03/22/2012

Static Security analysis – Summary Report

Static Security analysis (SSA), 32, 51–52 struct_globals, 476 basics, 134–145 structure of arrays (SOA), 102, 431, build specifi cation, 145–149 433–434 conducting, 136–145 Subscript too complex, 101 data races, 227 Sudoku, 377–396 enabling, 137 Amplifi er XE, 394–396 errors, 134, 217 BinNum, 380–381 false positives, 135–136 challenge of, 378–379 include, 137 compiler, 386 injection, 146–147 generator, 379–380 Inspector XE, 131–132 Correctness, 392–393 invalid users, 134–135 hotspots, 391 metrics tracking, 145–151 minus-two-plus-one algorithm, passwords, 134–135 390–391 problem states, 140–142 OpenMP, 391–392 QA environment, 149–151 optimizing, 384–396 regression testing, 149–152 high-level design, 379–380 results, 138–140 minus-one-plus-two algorithm, directory structure, 147 380 source code, 131–132, 152–153 minus-two-plus-one algorithm, Visual Studio, 142–145 380 workfl ows, 136 OpenMP, 382 StationsArBB, 431, 449–450, 451 Performance Analyzer, 382 Step, 410 solver, 379–380, 386–390 STL. See Standard Template Library hotspots, 388–389 Streaming SIMD Extensions (SSE), 9–10, 97 SSE, 377 Amplifi er XE, 360–361 intrinsics, 380–382, 390 auto-vectorization, 32, 96, 99 Visual Studio, 385 compiler, 92 Suitability analysis, 279, 290–295 intrinsics, Sudoku, 380–382, 390 data collection, 294 multipath auto-vectorization, 119 modeling, 295 packed instructions, 98, 360–361 NQueens, 295 PDE, 43 task execution tree, 294 scalar instructions, 98, 360 timestamps, 294 speedup, 99, 363 Suitability Report, 29–30, 291–294 Sudoku, 377 NQueens, 291–292 stride values, 102–103 parallel choices, 292–293, 295 struct sum, 68 elemental functions, 123 reduction clause, 76 memory errors, 240 Summary Report, Advisor, 304–305

513

bbindex.inddindex.indd 513513 33/22/2012/22/2012 7:24:367:24:36 PMPM Blair-Chappell bindex V3 - 03/22/2012

sum_of_diff erences – threading

sum_of_differences, 429 task scheduler, 295 sumx, 67 task_group, 200 sumy++, 111 Taskwaits, 333 sumy--, 111 TBB. See Threading Building Blocks suppression fi lters, data races, 228–233, tbb_allocator, 39 319–323 tbb::atomic, 238 Survey analysis, 282–285 tbb::concurrent_queue, 237 Survey Report, 282–283 tbb::mutex, 238 Survey Source window, 284 TBuf, 452 switch, 101, 123 Teams, IDB, 333 synchronization temp, 298–299 overhead, 17 templates primitives, ArBB, 428 STL, 184, 185 SystemMenu(), 133, 134 TBB, 38–39 system_table, 134, 135 wrappers, 236 Terafl op Research Chip, Intel, 5–6 test.c, 152 This program was not built to T run on the processor in your Tachyon ray-tracing application, 311-331 system., 97 Linux, 313, 316 Thread Concurrency, 78 OpenMP, 314–315 Amplifi er XE, 255–256 PDE, 316 Histogram, 81–82 Windows, 312, 316 timeline, 259 Tachyon.common, 312 track-fi tting, 439 Target CPU Number, 291, 295 Thread Data Sharing Events, tasks 320, 325, 327 IDB, 333 fi lters, 332 parallel regions, 334–335 threading OpenMP, 382 chunk size, 78, 79 TBB, 38 data races, 220 threading, 183 errors, Inspector XE, 49–50, 66 task chunking, 292 firstprivate, 383 task execution tree Hyper-Threading, 5 Correctness, 302 baselines, 252 Suitability, 294 benchmarking, 92 task parallelism, 9, 283 disabling, 92 Qsort, 287 information, 219–220

514

bbindex.inddindex.indd 514514 33/22/2012/22/2012 7:24:367:24:36 PMPM Blair-Chappell bindex V3 - 03/22/2012

Threading Building Blocks – trees

IPP, 42–43 thread-level parallelism, 9 OpenMP, 182 thread-safe, 42 overhead, 17 MKL, 44 pipelines, 202 timeGetTime(), 56, 57, 441 privacy, 220 timeline, Amplifi er XE, 258–261 runtime, 258 timestamps Simultaneous Multi Threading, 314 Amplifi er XE APIs, 468–469 spinning threads, 269 Suitability, 294 tasks, 183 tools, 7–8 views, 69 Top-down Tree tab, User Mode Hotspots, Threading Building Blocks (TBB), 10, 31, 348–349 38–40, 181 Total, 59, 68, 74, 89 for, 188 total, 68 reduction, 191 reduction clause, 76 annotations, 305–307 Total Time, 284, 294 ArBB, 439 track-fi nding, 421, 423–425 C++, 182 track-fi tting, 419–461 Cilk Plus, 11–12 ArBB, 430–440, 444–460 data races, 237–238 Correctness, 435 functions, 197 CPU Usage, 439 ISAT, 271 high-energy physics experiment, 421, lambda functions, 183 425–427, 430–440 linked lists, 210 kernel, 449–450 locks, 238 Lightweight Hotspots analysis, mutexes, 237–238 437 n-bodies simulation, 410 scalability, 435–438 nested for loops, 189 serial code, 435, 441–443 parallel_invoke, 197 speedup, 435–438 parallel_pipeline, 206–208 Thread Concurrency, 439 parallel_reduce, 191 TracksArBB, 449–450 pipelines, 206–208 binding, 451 recursive functions, 200–201 Tracks.dat, 441 spin_mutex, 305 TREE, 403 Tachyon, 312 trees templates, 38–39 hashed octree, 405–407 while, 193, 237 local, 410 Threading Error analysis, 218, 219–221 octree, n-bodies simulation, custom analysis types, 246–247 403–405 Threading Model, 295 Spawn Tree, 333, 336–339

515

bbindex.inddindex.indd 515515 33/22/2012/22/2012 7:24:377:24:37 PMPM Blair-Chappell bindex V3 - 03/22/2012

trees – -version

trees (continued) V task execution tree var Correctness, 302 , 105 Suitability, 294 variables Top-down Tree tab, User Mode global Hotspots, 348–349 C, 476–478 TREE_WIDTH, 403 Cilk Plus, 485–486 true dependence, 300 Dhrystone benchmark, 464–465, Trust Region Solver, 44 476–478 try_pop(), 237 legacy code, 476–478 Turbo Boost Technology, 90 OpenMP, 477 baselines, 252 wrappers, 483 benchmarking, 92 loop control BIOS, 93 cilk_for, 186 disabling, 92 OpenMP, 188 IPO, 109 reduction, Cilk Plus, 68 optimized code, general options, 94 shared PGO, 114 C, 472–475 Cilk Plus, 67–68 Dhrystone benchmark, 464–465, 472–475 U legacy code, 472–475 unchecked input, SSA, 131 OpenMP, 75–76 underclocking, 5 private, 75–76, 473 uninitialized memory access, 240 uninitialized, SSA, 131 uninitialized partial memory vector dependency, 101 access, 240 vectorization. See also uninitialized variables and objects, 131 auto-vectorization USE_API, 366 alternative means, USE_CILK, 399 120–125 USE_CILK_REDUCER, 414 Cilk Plus array notation, User Mode Hotspots, 346–350 121–124 Bottom-up tab, 347–348 SIMD, 425 Top-down Tree tab, 348–349 SSE intrinsics, 380 viewpoints, 349–350 Vectorization possible but seems user_table, 133 inefﬁ cient, 101 system_table, 135 vectorlength(), 104 User_Time, 465 -verbose, 147 usize, 454 -version, 146

516

bbindex.inddindex.indd 516516 33/22/2012/22/2012 7:24:377:24:37 PMPM Blair-Chappell bindex V3 - 03/22/2012

vertexing – wtime.c

vertexing, 421 win32, 181 VERYBIG, 58, 63, 68 Windows views Amplifi er XE baseline, reducers, 35, 234 254 threading, 69 compiler, 363 viewpoints data races, 227, 322–323 Amplifi er XE, 268, 349–350, fi xing, 324–325 364–365 Dhrystone benchmark, User Mode Hotspots, 349–350 471 virtual machine (VM), 428–429 focus fi lters, 328 virtualization, 9 General Exploration visibility, 174 analysis, 354 Visual Studio, 26 ISAT, 271 compiler, 59 loop swapping, 363 debugging, PDE, 43 n-bodies simulation, hotspot analysis, SSA, 142–145 402–403 Sudoku, 385 PGO, 114 unoptimized version, 92 SSE speedup, 363 VM. See virtual machine Tachyon, 312, 316 vmlinux , 347 winmm.lib, 57, 64 vtC , 449, 454 Work(), 198 vtT , 449, 454 lambda functions, 200 VTUNEDIR, 471 parallel_do, 193 work(), 19, 21–22, 89 Work with no overhead, 21 W work.c, 127–128 loops, 110 Wait Time , 254 workfl ows Wait Time by Utilization, 261 Advisor, 279–280 weight, 138 debugging, 310–311 whatever_udt, 433 SSA, 136 while, 191–193 WRAPEVERYTHING, 469–470 Cilk Plus, 191–192 wrappers OpenMP, 192–193 Cilk Plus, 483–486 TBB, 193, 237 cilk::holder, 483 Win32 global variables, 483 EnterCriticalSection, 16 InterlockedIncrement, 11 templates, 236 wtime.c LeaveCriticalSection, 16 , 89, 128–129, 157

517

bbindex.inddindex.indd 517517 33/22/2012/22/2012 7:24:377:24:37 PMPM Blair-Chappell bindex V3 - 03/22/2012

X2hits – /Zi

X Y X2hits, 451 Y2hits, 451 Xeon workstations, 90 y[i], 98 IPO, 109 /Od and /O2, 92 optimized code, general options, 94 Z PGO, 114 x[i], 98 zero_allocator, 39 XML, 136 /Zi, 168

518

bbindex.inddindex.indd 518518 33/22/2012/22/2012 7:24:377:24:37 PMPM