Copyrighted Material

k INDEX k k A benchmarks, 118, 314, 386, 425 aborts, 83 block access, 4 adaptive locking protocol, 466 BlockLib, 139 adaptive scheduling, 422 Brook, 145, 201 algorithmic skeleton, 263 bus interconnection, 431 algorithm view, 35 allocation, 418 C all-or-nothing transaction, 167 CABAC, 282 Amdahl’s law, 372 cache, 72, 237 analytical model, 442 cache behavior, 95 application-centric models, 44 cache coherence, 348 array, 67 cascading aborts, 172 Array-OL, 146 CAVLC, 282 atomicity, 84, 326 Cell/B.E. processor, 39 automated parallelization,COPYRIGHTED 233 Cell Superscalar, MATERIAL 43 auto-tuning, 194–195 Charm++,45 Cilk, 37 B cloud computing, 452 bag, 65 cluster, 113, 311 bandwidth, 425 code optimization, 292 Programming Multicore and Many-core Computing Systems, First Edition. Edited by Sabri Pllana and Fatos Xhafa. © 2017 John Wiley & Sons, Inc. Published 2017 by John Wiley & Sons, Inc. 481 k k 482 Index code profiling, 291 diagnostic tools, 358 collector, 268 dictionary, 68 combinability, 62 dining philosopher, 166 communication, 36 distributed desk checking, 331 communication congestion, 431 distributed programming, 451 communication links, 444 distributed real-time applications, 464 commutativity, 177 distributive review, 325 compiler, 109, 388 DOACROSS parallelism, 205 computational accelerators, 407 DOALL parallelism, 205 computing performance, 3 dynamic data structure, 218 concurrency, 36, 228, 232 dynamic scheduling, 421 concurrent code, 81 concurrent conflicting transactions, 83 E concurrent data structures, 59 eager update, 88, 207 concurrent program, 325 efficiency, 432 concurrent programming, 465 elastic transactions, 179 ConcurrentTesting, 340 embarrassingly parallel applications, 13 concurrent transactions, 91, 173 emitter, 268 conflicts, 94 encoder, 284 connectivity, 237 energy consumption, 352 consensus problem, 62 evolutionary algorithms, 301 consistency, 84 execution flow, 420 containers, 125 execution plan, 130 k contention manager, 91 execution time, 432 k context switch, 345 Corey, 457 F CPU cycles, 352 fairness, 353, 466 CPU-intensive application, 346 false conflicts, 94, 176 crossover operator, 303 farm paradigm, 271 CUDA, 41, 302 FastFlow, 148, 262 CUDPP, 138 fat-pointer, 220 CUFFT library, 108 Flynn’s taxonomy, 5 cyclic dependency deadlock, 328 fragmentation, 69 frame level, 286 D functional-level parallelism, 287 data block, 390 functional parallelism, 4 data compression, 386 data exchanges, 133 G data layout, 35 garbage collection, 70 data locality, 431 GenerOS, 457 data parallelism, 186 genetic programming, 302 data-parallel skeletons, 122 global scheduler, 463 data transfers, 36, 133 global scheduling, 463 deadlocks, 166 GPGPU, 301 debugging, 341 GPMCs, 39 decoder, 284 GPUs, 39 decompressions, 388 granularity, 466 deque, 66 graphics processor, 72 desk checking, 330 Grid, 452 k k Index 483 Grid middleware, 460 load balancing, 73 Grid OS, 460 local data structures, 466 Grid systems, 460 lock-free, 60 lock-free data structure, 63 H locking strategy, 466 hash table, 68 lock table, 95 H.264/AVC, 282 loosely coupled components, 451 heterogeneous architectures, 101 heterogeneous multicores, 19 M high-performance computing, 431, 451 many-core accelerators, 410 H.264/MPEG-4, 281 many-core architectures, 30 homogeneous components, 411 many-cores scalability, 380 Map, 139 I MapOverlap, 139 ILP wall, 11 mappers, 413 instruction set architecture, 238 mapping, 36, 52 Intel TBB, 39 mapping strategy, 435 interconflicts, 94 MapReduce, 139, 246, 411 intercore communication, 156 Map skeleton, 122 interleaving, 333 massively parallel applications, 29 intraconflicts, 94 master, 115 invisible reads, 89 master/worker pattern, 192, 432 invisible read transactions, 174 memory allocator, 69 k I/O latency, 357 memory bandwidth, 431, 466 k island model, 306 memory hierarchy, 466 isolated parallel program, 336 memory latency, 345–346, 355 isolation, 84 memory reclamation, 70 iterator, 150 memory wall, 10 Mercurium, 109 J Message Passing Interface (MPI), 37, 102, Java, 169 229, 432 JavaGrande, 200 message passing libraries, 7 J2EE, 363 message-passing paradigm, 466 joining, 191 metadata storage, 219–220 microkernels, 453 K miscompression rate, 392 kernel, 345 MPI. See Message Passing Interface (MPI) kernel function, 132 MPI applications, 432 MPI communication, 116 L multicore architectures, 9, 101 LAMP stack, 365 multicore clusters, 432 latency, 264, 369, 465, 466 multicore nodes, 431 layered design, 262 multiobjective evolutionary algorithms, 303 lazy update, 88, 207 multiobjective optimization, 302 linearizability, 60, 86 Multiple Instruction Multiple Data, 6 linked lists, 67 Multiple Instruction Single Data, 6 Linux, 170, 453 multiprocessors, 5 Linux kernel, 465 multiprogramming, 5 list, 67 multithreaded applications, 366 k k 484 Index multithreading, 8 power wall, 10 mutation operator, 304 predictability, 367 mutual exclusion, 60, 81 prefetching, 4 priority queue, 67 N productivity, 32 Nanos++, 109 programmability, 143 nested transaction, 91 programmability gap, 32 NUMA, 355 programming models, 6, 32 O Q object-orientation, 185 queue, 66 off-chip communication, 407 offline profiling, 211 R offload function, 418 read sharing, 171 offloading, 253 read-write ratio, 466 off-the-shelf components, 408 real-time applications, 467 OmpSs programming model, 46, 102 real-time scheduling, 463, 465 on-chip memory, 391 Reduce, 139 opacity, 86 reducers, 413 OpenCL, 47, 129 Reduction, 123 OpenCL/CUDA, 102 region tree, 113 OpenMP, 37, 102, 212, 230 regression analysis, 381 OPL, 46 repeatability, 367 optimistic concurrency control, 83 k resource aware, 249 k optimizations, 36 resource configurations, 415 review techniques, 324 P runtime system, 102 parallel bug patterns, 325 parallel design patterns, 191, 265 S parallel implementation, 344 parallelism, 3 scalability, 228, 363, 367, 432 parallel performance, 343 scalability tests, 366 parallel statements, 189 scalar processor, 4 partitioned scheduling, 463 scheduling, 36, 52 performance, 32 semantics, 84 performance aware, 249 sensitivity analysis, 395 performance bottlenecks, 96 sequential implementation, 344 performance metrics, 432, 467 Sequoia, 45 performance optimization, 344 serializability, 85 performance portability, 243 set, 68 performance predictions, 243 shared cache, 431 pessimistic concurrency control, 83 shared memory, 8, 303 PetaBricks, 246 shared-memory communication, 466 pipeline parallelism, 186 shared-memory locking, 466 pipelining, 4, 124 shared-memory multicore processor, 343 polymorphism, 177 shared memory multiprocessor, 6 population, 305 shared-memory paradigm, 466 portability, 32, 121 shared-memory parallel programming, POSIX, 293, 467 343 k k Index 485 shared-memory programming, 231 task parallelism, 186 Simics, 393 test environment, 378 single global lock, 85 testing, 323 Single Instruction Single Data, 5 thread-based programming, 229 single-system image, 455 Threading Building Blocks, 138 skeletal approach, 265 thread-level parallelism, 205 skeleton, 104 threads, 37, 345 skeleton programming, 122 throughput, 380, 465 skip list, 68 timing requirements, 465 SLICES, 161 trace, 106 socket, 348 transaction, 166 software accelerators, 266 transactional abort, 82 software pipelining, 222 transactional boosting, 177 software transactional memory, 81 transactional commit, 82 SP@CE, 44 transactional isolation, 87 SpecFP2000, 393 transactional memory, 119, 165 speculative computation, 209 transactional polymorphism, 178 speculative parallelization, 206 transaction models, 177 speculative thread, 207 transaction nesting, 169 speedup, 310, 381, 432, 435 transformation, 153 splitting, 191 tree, 69 SPMD applications, 433 try-to-eat procedure, 167 stack, 65 tuning, 185 k STAPL, 246 two-phase locking, 171 k Star Superscalar, 232 state separation, 208 U stream graph, 186 unified runtime architecture, 119 StreamIt, 144, 201 unit of transfer, 390 stream parallelism, 263 stream programming, 143, 185 V superscalar processor, 4 video encoding, 281 symmetric multiprocessors, 7 synchronization, 60, 61, 114, 185, 339, 344, W 350, 465 waiting mechanism, 351 synchronization APIs, 358 WebLogic, 363 synchronization overheads, 358 work block, 189 synchronization primitives, 353 work descriptors, 112 synchronous data flow, 143 worksharings, 106 system performance, 364 work unit, 418 write-back policy, 115 T task dependency graph, 113 X task farm, 124 XJava, 187 task graph, 116 XtreemOS, 460 k k k k k.

Copyrighted Material

Toward Optimised Skeletons for Heterogeneous

Algorithmic Skeletons and Parallel Design Patterns in Mainstream Parallel Programming

A Divide-And-Conquer Parallel Skeleton for Unbalanced and Deep Problems

Towards an Algorithmic Skeleton Framework for Programming the Intel R Xeon Phitm Processor

UNIVERSITY of PISA and SCUOLA SUPERIORE SANTANNA Master

A Theoretical Model for Global Optimization of Parallel Algorithms

Arxiv:2005.04094V1 [Cs.DC] 5 May 2020 National University of Defense Technology E-Mail: {J.Fang, Chunhuang, Taotang84}@Nudt.Edu.Cn Z

Skelcl -- a Portable Multi-GPU Skeleton Library

Algorithmic Skeletons and Parallel Design Patterns in Mainstream Parallel Programming

Nested Parallelism with Algorithmic Skeletons A

Algorithmic Skeletons for Python Jolan Philippe, Frédéric Loulergue

Algorithmic Skeletons for Exact Combinatorial Search at Scale