Cambridge University Press 978-1-107-17439-9 — Introduction to Parallel Computing Zbigniew J. Czech Index More Information

Index

3D rendering, 207 scaled speedup, 71, 311 size of input data, 65 Academic Computer Center CYFRONET AGH, speedup, 63, 319 Kraków, Poland, xxvi tree traversal, 121 Academic Computer Center, Gdansk,´ Poland, xxvi perfect shufle, 202 accelerator, 298 PRAM model, 72 address array packing, 119 physical, 184, 310 inding minimum element, 73 translation lookaside buffer, 184 inding sum of elements, 75, 76 virtual, 183 matrix–matrix multiplication, 85, 288 address space, 5, 24, 182, 183, 305, 309 preix computation, 81, 287 algorithm preix computation with segmentation, 119 approximate, 217 sorting, 83, 228 network model, 92 randomized, 145, 155, 161 matrix–matrix multiplication, 94, 95 round-robin, 145, 153, 155, 161, 262 matrix–vector multiplication, 93, 236 sequential, 1 minimum graph bisection, 217 array packing, 119 preix computation, 99 bubble sort, 122 reduction, 96 counting sort, 83, 103 sorting, 228 Euclidean, 58, 285 parallel, 63, 123, 125, 316 Horner’s, 36, 39 absolute and relative speedup, 64 insertion sort, 113 array packing, 119 matrix transpose, 119 coarse-grained, 142 memory requirement, 39, 63, 307 Cole’s sorting, 104 merge sort, 104 communication cost, 64, 85, 183 minimum graph bisection, 217 cost, 63, 66, 308 performance metric, 35, 38 eficiency, 63, 66, 157–159, 161, 310 preix computation, 287 ine-grained, 134, 293 preix computation with segmentation, 119 matrix transpose, 120 quicksort, 134 matrix–matrix multiplication, 85, 288 running time, 38, 39, 63, 307 memory requirement, 307 size of input data, 38 minimum graph bisection, 217 spatial modeling, 43 odd–even transposition sort, 113, 122, 210 Amdahl’s law, see law: Amdahl’s overhead, 65–67, 70–72, 139, 144 American National Aeronautics and Space parallel running time, 64, 100, 307 Administration (NASA), xxi performance metric, 64, 66, 77, 84, 85, 94, 97, Ames Research Center, xxii 317 Goddard Space Flight Center, 186 portability, 64 API, see application programming interface preix computation, 80, 98, 118–120, 287 application programming interface, 243, 248, 262, preix computation with segmentation, 119 273, 275, 281 processor complexity, 74, 77, 81, 100, 318 approximation, 162 reduction, 121 arithmetic-logic unit, 177, 178, 321 scalability, 67, 123, 158–160 loating-point, 207, 208

343

344 Index

batch mode, 24 point-to-point, 94, 206, 225 Bellman Richard Ernest, 170 shared memory, 3, 35 Beowulf, see cluster: Beowulf shared variable, 3 bijection, 60 operation, 3 binary tree, see interconnection network: binary synchronous, 11, 156, 221 tree communication channel, 3, 41, 156, 198, 306, 312 bitonic sequence, 113 data transfer rate, 42, 198, 309 bitonic sorting network, see network: sorting: latency, 94 bitonic transmission time, 94 blocking, see semaphor: blocked-queue or communication complexity, see algorithm: parallel: blocked-set communication cost Brent’s theorem, 79 communication properties, see interconnection broadcasting, see message broadcast, and network: communication properties interconnection network: cube: message company broadcast AMD/ATI, 181, 207 buffer, 11 IBM, 207 cyclical, 13 Inmos, 214, 241 bus, see interconnection network: bus Intel, 178, 181, 212 busy waiting, 27 Nvidia, 207 butterly, see interconnection network: butterly Sony, 207 Sun, 207 characteristic vector, 128 Toshiba, 207 base sequence, 128 comparator, see network: comparator checklist, 159–161 comparator network, see network: comparator circuit complex plane, 164 integrated, 207 complexity class, 100 logic, 56, 57, 58, 102, 103 L, 118 VLSI, 61, 196 NC,100,117,315 cluster, 24, 183, 184, 207, 212, 215, 243, 306 subclasses NC j,100,315 Beowulf, 186, 212, 306 NL, 118 computer, 185, 306 NP,100 computing, 294, 310 NP-complete, 102, 144, 145, 147, 161, 216 constellation, 184, 306, 308 P, 1 0 0 data references, 185 P-complete, 65, 100–103, 124, 317 data transmission compression and decompression, 207 latency, 141, 187, 312 computation rate, 187 asynchronous, 3, 93, 181, 187, 191, 196, 216 load balancing, 188 synchronization operation, 216 massively parallel, 313 coarse-grained, 137, 142, 187, 311 multicore processor, 185, 299, 306 computer graphics, 43 node, 184, 306 datalow, 189, 191, 196 packaging distributed, 24, 187, 214 compact, 187 detecting termination of distributed slack, 187 computation, 154 reliability, 188 ine-grained, 134, 137, 153, 155, 158, 186, 191, 207, scalability, 186 309, 311 symmetric multiprocessor, 184, 306 low, 194 coherence, see memory: cache: data consistency, grain size, 137 and OpenMP: data consistency granularity, 168, 173 collective communication, see MPI library: linear algebra, 43 collective communication, and loosely synchronous, 216 communication between processes: message-passing, 243 collective multithreaded, 208, 212 communication between processes, 2, 157, parallel, 35, 42, 58, 64, 66, 70, 100, 117, 131, 158 135–137, 156, 189, 214, 309 asynchronous, 11, 156 GPUs, 207 blocking, 156, 223 overhead, 66, 140, 144, 159 buffered, 222 pipeline, 128, 175, 317 collective, 159 pipelining, 125, 128, 317 cost, 159, 160, 294 preix, 80, 118, 120 global, 158 redundant, 140, 142, 158 local, 158 shared-memory, 243 message passing, 3, 35, 41, 184, 214, 215, 220, 306, synchronous, 40, 94, 96, 180, 216 313 computational message delivery time, 3 power, 47 nonblocking, 222 step, 36, 38, 40, 64

Index 345

computational complexity NEC Cenju-4, 209 memory, 38, 39, 63, 307 NYU Ultracomputer, 209 time, 38, 39, 63, 64, 100, 307 personal (PC), 5, 180, 186, 196 computer Pleiades, xxii Alpha 21364, 62 processor array, 180, 207, 212, 317 BBN Butterly, 209 Quadrics Apemille, 207 Blue Gene/L, xxiv, 43, 62 Roadrunner, 298 Bull NovaScale, 209 SGI Altix, xxi, 209 C.mmp, 209 SGI Origin 2000, 209 Compaq ES40, xxiii SIMD, 181, 207 Compaq GS160, xxiv activity mask, 181 conventional, 175, 177, 195, 196 control processor, 180 Convex Exemplar, 209 diversiication of computations, 181 Cosmic Cube, 61 execution of if-then-else instruction, 181 CPP DAP Gamma II, 207 image processing, 181 Cray T3D, 62 processing element, 180, 317 Cray T3E, 62 processor array, 317 Cray X1E, xxii simulation of atmospheric phenomena, 181 Cray XT4, xxii, xxiii SISD, 175, 311, 321 Cray XT5, xxiii Solomon, 61 datalow, 189, 213, 309 Sun Ultra HPC Server, 209 control low, 189, 191 Sun V40z, 236 Cyberlow, 196 supercomputer, 186, 209, 296, 298, 319 data low, 190 systolic, 198, 213, 320 DDP,195 Warp and iWARP,213 EDDN, 196 Thinking Machines CM-1, CM-2, CM-5, 207, 209 EM-5, 195 Tianhe-2, xxi, xxv, 209, 296 Manchester, 195, 213 unconventional architecture, 189 MIT, 195, 213 uniprocessor, 311, 321 Monsoon, 195, 213 vector, 177, 321 SIGMA-1, 195 computer architecture, 35, 175, 307 structure, 191 Cell Broadband Engine, 207 Earth Simulator, xxii conventional, 177 EDVAC, 35 datalow, 189, 212, 309 Fujitsu VPP 500, 209 MIMD, 182, 311 HP Superdome, 209 MISD, 198, 311 hybrid, 298 parallel, 212 IBM Blueire, xxiii processor array, 180 IBM eSerwer 325, 236 SIMD, 180, 181, 311 IBM p575, xxii SISD, 175, 311 IBM p690, xxii systolic, 196, 320 IBM RP3, 209 unconventional, 189 Illiac IV, 61, 207 uniprocessor, 212 Intel DELTA, 62 vector, 321 iPSC, 62 von Neumann, 35, 177, 212 J-machine, 62 computer game console, 180 K, xxiii Sony PlayStation 3, 207 Kendall Square Research KSR1, 212 computer graphics, 207, 209 MasPar MP-1, 207 computer vision, 207 massively parallel, 313 computing node, see cluster: node Meiko CS-2, 209 concurrent processes, 2, 3, 34, 308 MIMD, 182 contention, 7 distributed memory, 183, 314 creation loosely coupled, 183 dynamic, 216 shared memory, 182, 209, 314 static, 216 symmetric multiprocessor, xxiii, 180, 182, 184, critical section, 8–10, 12, 18, 27, 271, 308 200, 320 deadlock, 7, 10, 14–16, 34, 309 tightly coupled, 182 multithreaded, 5, 244 MISD, 198 mutual exclusion, 7, 10, 271, 315 multicomputer, 183 nonterminating, 6 multiprocessor, 27, 28, 160, 181, 183, 207, 212, 314 parallel execution, 3 distributed memory, 183, 314 priority, 5 loosely coupled, 183 pseudo-parallel execution, 4, 5, 142, 143, 314 shared memory, 182, 209, 281, 282, 314 starvation, 7, 10, 14, 15, 18, 24, 27, 30 tightly coupled, 182 synchronization, 3, 12, 15, 26, 34, 314 nCUBE, 62 condition (event), 13, 17

346 Index

concurrent program, 1, 3, 24, 34, 307 of a program, 126, 247, 309 correctness, 6, 34 of an algorithm, 140 fairness, 7 depth of a network, see network: depth fairness (weak, strong), 310 depth-irst search, 90 liveness, 6, 7, 10, 15, 34, 312 design checklist, 158 safety, 6, 7, 10, 16, 34, 318 designing parallel algorithms, 125 condition synchronization, see concurrent assigning tasks to processors, 140, 143, 145, 146, processes: synchronization: condition 150, 151, 155, 160, 161 (event) block, 146 condition variable, see monitor: condition variable block-cyclic, 146 conditional compilation, 248 cyclic, 146 constant of gravitation, 166 cost, 168, 173 constellation, see cluster: constellation decomposition, 125, 158, 168, 173, 292, 317 contention blockwise, 145 resource, 8, 27, 42, 152, 308 data, 126, 129, 131, 147, 158, 161 context switch, see operating system: context exploratory, 135, 158 switch ine-grained, 158 core, 179, 308 functional, 126, 127, 130, 150, 151, 158, 161 cost, see algorithm: parallel: cost mixed (hybrid), 137, 158 criterion of cost recursive, 133, 158 logarithmic, 39 speculative, 136, 158 uniform, 39 load balancing, 143, 161, 173, 312 critical section, see concurrent processes: critical centralized, 151, 161 section, and problem: critical section decentralized, 152, 161 crossbar switch, see interconnection network: distributed, 152, 161 crossbar switch dynamic, 151, 161 csh shell, 247 static, 145, 161 CSP notation, 214, 241 method, 125 cube, see interconnection network: cube data parallelism, 147, 181 cube connected cycles, see interconnection Foster’s, 156 network: cube connected cycles functional parallelism, 125–130, 137, 145, CUDA environment 149–151, 154, 158, 161 device function (__device__), 208 master-worker, 155, 162, 217, 281 general-purpose GPU processing (GPGPU), pipeline, 128, 155 209 producer and consumer, 155 grid, 209 task pool, 154 host function (__host__), 208 task dependency graph, 127–130, 151 kernel function (_ _global_ _), 208 determinism, 58, 103, 118 thread block, 209 differential equation, 166 thread index, 209 directed forest, 120 cycle, 102 distributed program, 3, 24, 310 Euler, 88 distributed shared memory, 183, 212, 310 data transfer data latency, 184, 312 local, 156 message passing, 184, 310, 313 nonlocal, 156 distribution redundant, 158 data between memories, 183 data consistency (coherence), see memory: cache: matrix elements, 94 data consistency, and OpenMP: data domain decomposition, see designing parallel consistency algorithms: decomposition: data data dependency, 65, 176, 189, 247 data parallelism, see designing parallel algorithms: effect method: data parallelism, and designing Amdahl, 65, 123, 305 parallel algorithms: decomposition: data eficiency, see algorithm: parallel: eficiency database, 8, 17, 33 eficiently parallelizable problem, see problem: transaction, 17 eficiently parallelizable datalow computation model, see model of embarrassingly parallel problem, see problem: computation: parallel: datalow embarrassingly parallel de Bruijn Nicolaas Govert, 61 embedding, see interconnection network: deadlock, see concurrent processes: deadlock embedding decomposition, see designing parallel algorithms: encapsulation, 19 decomposition environment OpenCL, 208 degree of concurrency equivalence of computation models, see model of average, 126, 130, 139 computation: parallel: equivalence of maximum, 126, 130 models of a problem, 64, 65, 126, 149, 151, 294, 309 Eratosthenes of Cyrene, 126

Index 347

Ethernet averaging, 130 fast, 187 Gauss, 130 gigabit, 187 lp2, 130 Euclid, 58 iltering, 129 Euler Leonhard, 88, 167 pixel, 129, 164, 300, 303 Euler path, 91, 289 resolution, 165 event synchronization, see concurrent processes: smoothing, 130 synchronization: condition (event) induction, 54, 116 Ininiband, see interconnection network: switch: fairness, see concurrent program: correctness: Ininiband fairness injection, 50, 52, 54 fastest supercomputers, xxi, 181, 209, 296, 298 instruction fat tree, see interconnection network: fat tree atomic, 27 FIFO queue, 7, 10, 152, 156 compare-and-swap, 27 Flynn’s taxonomy, 175, 206, 212, 311 empty, 27 data stream, 175 exchange, 27 instruction stream, 175 fetch-and-add, 27 fork-join parallelism, see OpenMP interface: test-and-set, 26 model of computation: fork-join vector, 177 parallelism instruction list, see processor: instruction list Foster’s method, see designing parallel algorithms: instruction pipelining, 317 method: Foster’s conditional branch, 176 fractal, 164 control hazard, 177 fractal geometry, 164 data dependency, 176 function data hazard, 176 Boolean, 56, 102 integration isoeficiency, 68, 123, 312 method growth rate, 318 rectangles, 162, 241 functional parallelism, see designing parallel trapezoids, 163, 164, 241, 281 algorithms: decomposition: functional, and step, 162 designing parallel algorithms: method: interconnection network, 41, 180, 182, 312 functional parallelism binary tree, 59, 60 functional programming language, see double-rooted, 60 programming language: functional dynamic, 204, 206, 320 static, 204, 206, 320 galaxy, 168 bisection bandwidth, 42, 305 game theory, 163 bisection width, 42, 48, 206, 216, 305 GNU, see project: GNU bus, 182, 199, 305 granularity, 125, 134, 137, 138, 185, 311 access algorithm, 199 coarse, 137, 294 bandwidth, 199, 200 ine, 134, 138, 185, 293 cost, 199 grain size, 142, 311 data transmission latency, 199, 312 medium, 137 scaling, 200 graph butterly, 46, 48, 203, 206, 209, 306 directed, 88, 120 communication properties, 42, 43, 53, 59 Eulerian, 88 completely-connected, 42, 48, 216 loop, 120 contention vertex adjacency lists, 89 communication resource, 42 graph bisection, see problem: minimum graph cost, 42, 48, 206 bisection crossbar switch, 182, 184, 200, 206, 209, 308 Gray code, 54 cost, 200 greatest common divisor, see algorithm: scaling, 200 sequential: Euclidean cube, 44, 47, 48, 53, 59, 60, 96, 308 k-dimensional, 44, 48, 53, 55, 60 hazard data transfer statement, 97 control, 177 four-dimensional, 285 data, 176 message broadcast, 49, 98, 121 Heron of Alexandria, 189 three-dimensional, 54, 55 Horner William George, 36 cube connected cycles, 46, 60, 309 hypercube, see interconnection network: cube de Bruijn, 61 degree, 42, 48 I/O diameter, 42, 48, 60, 205, 309 operation, 4, 24, 36, 142, 269 dynamic, 199 port, 156 evaluation parameters, 205, 206 image processing, 129, 207 latency, 206, 312 ilter edge connectivity, 42, 48, 206

348 Index

interconnection network (cont.) law embedding, 50, 310 Amdahl’s, 69, 123, 305 congestion, 51, 52, 53 Gustafson–Barsis’s, 70, 123, 311 dilation, 51, 52, 53 Moore’s, 178, 212, 314 expansion, 52, 53 level of abstraction, 37, 38, 42, 58 load factor, 51, 52, 53 library fat tree, 205, 299, 311, 320 BLAS, 208 local area network, 187 DirectX, 208 maximum degree vertex, 206 java.util.concurrent, 9 mesh, 48, 313 MPI, 214 k-dimensional, 43, 320 Pthreads, 9, 244, 282 multidimensional, 53–55 PVM, 187, 214, 241 one-dimensional, 43, 52, 53, 204, 320 linear algebra, 43 three-dimensional, 43, 60 link, see communication channel two-dimensional, 43, 52–54, 55, 60, list, 87, 120 285 Green 500, 298 mesh of trees, 313 Top 500, 181, 209, 296 two-dimensional, 43, 48 list structure, 87 message routing, 41, 199, 318 liveness, see concurrent program: correctness: routing procedure, 41, 206, 318 liveness multistage, 182, 201, 213, 314, 315 load balancing, see designing parallel algorithms: node, 198, 312 load balancing omega, 182, 201, 206, 209, 213, 315 local area network, see interconnection network: cost, 203 local area network data transmission time, 203 locality of data reference, 159, 312 message routing, 202, 318 spatial, 178 proprietary, 187 temporal, 178 resistance to damage, 42 logarithmic cost criterion, see criterion of cost: scaling, 45, 199 logarithmic sparse, 42, 43 logic circuit, see model of computation: parallel: star, 185, 204, 320 logic circuit static, 199 evaluation parameters, 42 Mandelbrot Benoît B., 164 latency, 94 massively parallel processing system, see cluster: switch, 198, 306, 312, 319 massively parallel degree, 198 memory Ininiband, 299 access message broadcast, 199 random, 36 message buffering, 199 sequential, 36 ports, 199 cache, 27, 177, 179, 184, 200, 212, 273, 282, 306, topology, 41, 47, 180, 199, 320 310 torus, 48, 59, 286, 320 ccNUMA, 185, 209 k-dimensional, 320 data consistency, 185, 249, 273, 282, 306 data transfer statement, 94 hit, 178, 306 doubly twisted, 59 line, 178, 306, 312 message broadcast, 121, 288 miss, 178, 306 one-dimensional, 43, 53, 54, 60, 93, 138, 185, spatial locality, 199, 312 318 temporal locality, 178, 312 three-dimensional, 43 distributed, 34, 183, 309 two-dimensional, 44, 59, 94, 286 DRAM, 178 tree, 204, 320 global, 207 vertex connectivity, 42, 206 access time, 207 wide area network, 187 hierarchical, 141, 177, 184, 311 Interdisciplinary Centre for Mathematical and local, 39, 93, 156, 181, 183, 184 Computational Modelling, University of nonlocal, 142, 184 Warsaw, Poland, xxvi, 236 nonuniform memory access, xxiii, 141, 182, 184, interleaving, see operating system: interleaving 315 isoeficiency function, see function: isoeficiency shared, 3, 6, 34, 39, 183–186, 207, 243, 244, 319 Karp–Flatt metric, see sequential fraction of access time, 207 parallel computation bandwidth, 182, 313 module, 182, 199 L, see complexity class: L uniform memory access, 182, 184, 321 LAN, see interconnection network: local area memory complexity, see algorithm: sequential: network memory requirement, and algorithm: latency, 3, 94, 142, 206, 207, 312 parallel: memory requirement

Index 349

memory requirement, see algorithm: sequential: MPI_Barrier, 236 memory requirement, and algorithm: MPI_Bcast, 224, 225, 241 parallel: memory requirement MPI_BOR, 227 memory wall, see problem: memory wall MPI_BXOR, 227 mesh, see interconnection network: mesh MPI_BYTE, 221 message broadcast, 241 MPI_CHAR, 221 all-to-all, 121, 288 MPI_COMM_WORLD, 220, 228, 230 one-to-all, 98, 121 MPI_Comm, 221, 222, 225, 228, 230, 231, 234 message passing, see communication between MPI_Comm_free, 230 processes: message passing MPI_Comm_rank, 220, 224 method MPI_Comm_size, 220, 224 divide and conquer, 170 MPI_Comm_split, 228, 230 dynamic programming, 170 MPI_Datatype, 221–223, 225, 230, 231, 234 common subproblems property, 170 MPI_DOUBLE, 221 optimal substructure property, 170 MPI_DOUBLE_INT, 227 Eulerian cycle, 88 MPI_ERROR, 222 Monte Carlo, 162, 241, 281, 304 MPI_Exscan,80 convergence, 163 MPI_Finalize, 219, 224 Newton’s, 194 MPI_FLOAT, 221, 300 pointer jumping, 87, 120 MPI_FLOAT_INT, 227 Milky Way-2, see computer: Tianhe-2 MPI_Gather, 230, 231, 233, 234, 300 MIMD (MISD), see computer architecture: MPI_Gatherv, 234–236 MIMD (MISD), and computer: MIMD MPI_Get_address, 302 (MISD) MPI_Get_count, 223 model of computation, 35 MPI_Init,219,224 parallel MPI_INT, 221, 300 arbitrary CRCW PRAM, 48 MPI_Irecv, 143, 223 BSP,56 MPI_Isend, 143, 222, 223 combining CRCW PRAM, 40, 120, 288 MPI_LAND, 227 common CRCW PRAM, 40, 48, 81 MPI_LONG, 221 comparator network, 112, 307 MPI_LONG_DOUBLE, 221 comparison of network models, 50 MPI_LONG_DOUBLE_INT, 227 comparison of PRAM models, 47 MPI_LONG_INT, 227 conlicts in shared memory access, 40 MPI_LOR, 227 CRCW PRAM, 40, 47, 82 MPI_LXOR, 227 CREW PRAM, 40, 47, 58, 83, 85, 228 MPI_MAX, 227 datalow, 189 MPI_MAXLOC, 227 equivalence of models, 58 MPI_MIN, 227 EREW PRAM, 40, 47, 74 MPI_MINLOC, 227 LogGP,56 MPI_Op, 225 logic circuit, 56, 58, 102, 313 MPI_PACKED, 221 LogP,56 MPI_PROD, 227 network model, 35, 41, 187, 315 MPI_Recv, 220, 222–225, 300 PRAM, 35, 39, 316 MPI_Reduce, 79, 224, 225, 227, 229, 230, 241 priority CRCW PRAM, 48 MPI_Scan,80 processor number, 40 MPI_Scatter, 231 shared memory, 316 MPI_Send, 220–222, 224, 225, 300 sorting network, 56, 307 MPI_SHORT, 221 sequential RAM, 35, 58, 189, 318 MPI_SHORT_INT, 227 computational step, 36 MPI_SOURCE, 222 random access memory, 36 MPI_Status, 222, 223 module, 19 MPI_SUCCESS, 220 memory, 182, 199 MPI_SUM, 227 TriBlade, 299 MPI_TAG, 222 monitor, 18, 25, 314 MPI_Test, 143, 222, 223 condition variable, 19, 314 MPI_Type_commit, 303 operations wait and signal,19 MPI_Type_create_struct, 302 Moore Gordon E., 178 MPI_Wait, 143, 222, 223 MPI library, 79, 187, 224, 241, 243, 282, 299 MPI_Wtime, 235, 236, 239 MPI_Aint, 302 MPI_UNDEFINED, 228 MPI_Allgather, 231 MPI_UNSIGNED, 221 MPI_Allreduce, 227 MPI_UNSIGNED_CHAR, 221 MPI_ANY_SOURCE, 223 MPI_UNSIGNED_LONG, 221 MPI_ANY_TAG, 223 MPI_UNSIGNED_SHORT, 221 MPI_2INT, 227 collective communication, 224, 225, 300 MPI_BAND, 227 communicator, 220, 225, 228, 230, 231, 236

350 Index

MPI library (cont.) if, 251, 259, 266 derived datatype, 221, 300 lastprivate, 253, 255, 257, 261 handle, 301 mergeable, 259 map, 301 nowait, 255, 257, 262, 269 signature, 301 num_threads, 247, 251, 267 history, 215 ordered, 255 message broadcast, 225 private, 246, 251, 255, 257, 259, 260 model of computation, 215 reduction, 251, 255, 257, 264 MPI-1, 215, 241 schedule, 255, 262 MPI-2, 215, 241 shared, 246, 251, 259, 260 MPI-3, 215 untied, 259 program compilation and execution, 218 cOMPunity organization, 243, 281 mpicc command, 218 construct, 249 mpirun command, 218 atomic, 272 MPP, see cluster: massively parallel barrier, 260, 270 multiprocessor, see computer: multiprocessor critical, 260, 265, 271, 275, 276, 278, 304 multiprogramming, see operating system: lush, 273 multiprogramming master, 269 multitasking, see operating system: multitasking ordered, 271 multithreading, 6, 34, 180, 207, 208, 282 parallel, 250, 254, 267 mutual exclusion, see concurrent processes: sections, 252, 255 mutual exclusion single, 252, 257, 268 taskwait, 270 NC, see complexity class: NC task, 258, 259 network loop, 252, 254, 255, 275, 304 Beneš, 210 worksharing, 252 comparator, 112, 307 critical section, 271, 272, 276, 304 de Bruijn, 61 data consistency, 245 depth, 113 directive, 243 merging, 115 threadprivate,274 permutation, 210 environment variable, 243, 247 size, 113 OMP_NUM_THREADS, 247, 252, 267 sorting, 113, 307 ICVs, 247 Batcher’s, 113, 116 library function, 243, 247, 248, 250, 251 bitonic, 113 omp_set_num_threads, 247 network model, see model of computation: omp_get_num_threads, 247 parallel: network omp_get_thread_num, 247, 248, 251 Newton Isaac, 166, 194 omp_get_wtick, 278 NL, see complexity class: NL omp_get_wtime, 278 nondeterminism, 118, 144, 192, 196, 249, 251, 253, omp_set_num_threads, 252, 267 257 memory nonuniform memory access, see memory: cache, 245, 249 nonuniform memory access data consistency, 249 NP, see complexity class: NP relaxed-consistency model, 244 NP-complete, see complexity class: NP-complete shared, 244 NUMA, see memory: nonuniform memory access thread’s temporary view of memory, 244 number threadprivate, 245 complex, 164 model of computation, 244 composite, 126, 233 fork-join parallelism, 246 prime, 126 OpenMP versions 2.5, 3.0, 3.1, 4.0, 243 random, 162, 163, 263, 304 program compilation, 252 NYU Ultracomputer, see computer: NYU command pgcc, 252 Ultracomputer race condition, 249, 260 region, 250 Occam’s (or Ockham’s) razor, 214 active, 245 Ockham, see William of Ockham inactive, 245 OpenMP interface, 208 parallel, 245, 246, 247, 250–252, 257, 258, 266, _OPENMP macro, 248 267, 270, 274 Architecture Review Board, 243 runtime system, 248, 258, 259, 263, 264, 270 clause, 246, 251, 255, 257, 259 structured block, 250, 251 collapse, 255 synchronization barrier, 251, 252, 254, 257, 260, copyin, 251, 268 268–270 copyprivate, 257, 268 task, 244, 256, 257 default, 251, 259, 261 child task, 258, 270 inal, 259 explicit, 245, 250 irstprivate, 251, 255, 257, 259, 260 implicit, 245, 246, 250, 256, 258

Index 351

region, 245, 246, 250 sum (Boolean), 80 scheduling point, 258, 270 swap, 38 tied, 245 switch, 193 team of threads, 247, 250 test-and-set, 3 thread, 244 vector, 207, 299 initial, 245 optimal binary search tree, 168, 169–171, 172, 173, master, 245–247, 250, 251, 258, 269, 274 304 number, 247, 248 order synchronization, 249, 251, 252, 254, 270 postorder, 121 unbalanced workloads, 257, 263 preorder, 121, 288 variable topological, 102 nthreads-var, 247 overlapping communication and computation, 52, private, 243, 244, 246, 253 142, 222, 315 shared, 243, 246, 249, 253, 260, 270, 273 threadprivate, 274 P, see complexity class: P operand, 189 P-complete, see complexity class: P-complete operating system, 1, 4, 5, 34, 160, 236, 244, 247, 315 package, 19 concurrency of processes, 4, 34 parallel context switch, 4, 207 computation thesis, 117, 118 interleaving,2,4,8,9,25 execution interrupt, 5 instruction, 175 priority, 5 loop, 260 Linux, 34, 186, 299 operations, 191, 207 load balancing, 5 tasks, 136, 160 multiprogramming, 24, 314 programming model, 156 multitasking, 314 Foster’s, 156 process, 5, 34 slackness, 143, 208, 317 Solaris-2, 34 parallel execution of processes, see concurrent task scheduling, 5 processes: parallel execution shortest job next principle, 32 parallel overhead, see algorithm: parallel: thread, 5, 34, 244 overhead time-sharing, 4, 5, 142, 244 parallel program, 3, 24, 307 Unix, 34 correctness, 7, 249 Windows, 34 ine-grained, 185, 186 operation scalability, 215 arithmetic, 38 SPMD method, 216, 242, 248 associative, 41, 79, 118 parallel running time, see algorithm: parallel: atomic, 3, 9, 25, 26, 272 parallel running time binary, 79, 118 parallel slackness, see parallel: slackness combining, 50 parallel time complexity, see algorithm: parallel: communication, 159 parallel running time commutative, 80, 80 parallel work, see algorithm: parallel: cost compare-and-swap, 3 performance metric, see algorithm: parallel: comparison, 38 performance metric, and algorithm: computational, 48, 79, 159 sequential: performance metric concatenation, 80 pipelined execution of instructions, 175, 212, 317 control, 36 pointer, 87 copy, 119 stack, 5 data transfer, 196 pointer jumping, 87 dominant, 38 polylogarithmic complexity, 58, 100–102 exchange, 3 portability fetch-and-add, 3 MPI library, 215 ixed-point, 144 parallel algorithm, 64, 159 loating-point, xxi, 186, 235, 293, 296, 299, 319 software, 186 gate, 192 POSIX interface, 244, 282 latency, 142 Poznan´ Supercomputing and Networking Center, logical (Boolean), 36, 38, 40 Poland, xxvi merge, 192, 192 PRAM, see model of computation: parallel: nondeterministic, 192 PRAM product (Boolean), 80 PRAM-on-chip, 61 relational, 192 preix, 80, 98, 118, 120, 287 SAXPY, 208 preix computation, see problem: preix scan, 80 computation, and algorithm: sequential: select, 192 preix computation, and algorithm: parallel: sink, 192 preix computation square, 194 preprocessor, 248

352 Index

principle matrix transpose, 119, 121 Occam’s razor, 214 two-dimensional mesh, 121 optimality, 170 matrix–matrix multiplication, 94, 196 shortest job next, 32 combining CRCW PRAM, 120, 288 zero-one, 114–116, 123 CREW PRAM, 85, 120, 288 problem cube, 288 n-body, 165 matrix–vector multiplication, 92, 196, 236 MPI program, 241 MPI program, 236 OpenMP program, 281 maximum cut of graph, 242 parallel program, 292 memory wall, 177, 212 sequential program, 167 minimum cut, 216 NC = P?, 100, 124, 315 minimum graph bisection, 216, 275 P = NP?, 100 MPI program, 217 array packing, 119 OpenMP program, 275 bin packing, 144 NC, 100, 124, 315 cigarette smokers, 33 NP,100 circuit value, 102, 124 NP-complete, 161, 216 unrestricted, 124 NP-hard, 163 computing approximation of π, 162, 210 one-lane bridge, 32 MPI program, 241 P, 1 0 0 OpenMP program, 281, 304 P-complete, 100–103, 124, 317 sequential program, 162 preix computation, 98, 118, 120, 287 computing integral value, 161 segmentation, 118 MPI program, 240 sums, 123 OpenMP program, 281 producer and consumer, 11 sequential program, 162 readers and writers, 17, 30 computing Mandelbrot set, 164 reduction, 79, 123, 225, 264 MPI program, 241, 299 one-dimensional mesh, 121 OpenMP program, 281 two-dimensional mesh, 121 sequential program, 165 satisiability of Boolean expressions, 102 computing number of descendants, 91 scheduling, 173 computing value of polynomial shortest schedule, 144 Horner’s algorithm, 36 size, 38, 64, 158, 160 RAM program, 36 sleeping barber, 33, 284 constructing optimal binary search tree, 168 sorting, 83, 100, 122, 174, 228, 276 OpenMP program, 304 MPI program, 228 parallel program, 168 OpenMP program, 276 constructing optimal binary search tree transforming unrooted tree into rooted tree, 90 sequential program, 171 traveling salesman, 101 critical section, 8, 17, 26, 27 process decision, 101, 124 multithreaded, 282 detecting termination of distributed sequential, 1, 5, 319 computation, 154, 173 processing dining philosophers, 14 element, 180, 181, 198, 317 ecoregion map, 146, 173 unit, 184 eficiently parallelizable, 58, 100, 173, 315 processor embarrassingly parallel, 126, 173 AMD Opteron, 236 evaluation of multiple-choice test results, AMD Opteron 2210, 298 126 array, 180 inding minimum element, 49, 61, 73, 100 ATI Radeon HD 4000/5000, 207 inding prime numbers, 242, 277 Cell, 207 MPI program, 232 dual-core, 179 network model, 128 extensions OpenMP program, 277 MMX, 181 pipelining method, 128 SSE, 181 PRAM, 128 Fermi, 207 sieve of Eratosthenes, 126, 233, 277 GPU, 208 speedup, 127 graphics processing unit, 181, 207 inding roots of trees, 120 IBM POWER6, xxiii functional, 101, 124 IMB PowerXCell 8i, 298 inherently sequential, 100 instruction list, 26, 37, 181 Königsberg bridges, 88 Intel 80-core, 212 knapsack, 173 Kepler, 207 leader election, 154, 174 Maxwell, 207 list ranking, 87 multicore, 6, 178, 185, 207, 294, 314 load balancing, 5 Pentium, 175

Index 353

program counter, 5, 6, 36, 181, 244 rank, 104 real, 2 cross, 104 scalar, 177 recursion, 80, 114, 116, 133, 134 streaming, 207 recursive doubling, 80 superscalar, 175 reduction, 101 Tesla, 207 complexity of reduction, 101 UltraSPARC T2, 207 NC,101,102 vector, 177, 212, 321 register, 5, 244 pipelined arithmetic-logic unit, 177 arithmetic, 4, 36, 40, 178, 181 virtual, 1, 2 control, 36 processor complexity, see algorithm: parallel: program counter, 4 processor complexity reliability, see cluster: reliability program counter, see processor: program counter replication program parallelization computation, 142, 159, 160 automatic, 247, 282 data, 141, 159, 160 incremental, 248 resistance to damage, 24 programming resource, 5, 14, 244 concurrent, 8, 34 ring, see interconnection network: torus: distributed, 34 one-dimensional dynamic, 170 routing, 41, 42, 199, 202, 204, 206, 215, multithreaded, 34, 244, 282 318 parallel, 34 rule message-passing, 214 iring, 192 shared-memory, 243 thumb, 160 programming language running time, see algorithm: sequential: running Ada, 10, 25, 34 time C, 208, 214, 215, 219, 221, 226, 227, 243, 246, 252 safety, see concurrent program: correctness: safety C++, 208, 215, 243, 246 scalability, 67, 123, 215, 318 C#, 25 scalability of parallel algorithm, see algorithm: Concurrent Euclid, 25 parallel: scalability Concurrent Pascal, 25 scheduling, see problem: scheduling Fortran, 214, 243, 246 semantic, 249 90, 215 semaphore, 8, 24, 128, 318 2008, 215 binary, 15, 25 HPF, 244, 282 blocked-queue, 9, 25 functional, 195 blocked-set, 24 HASAL, 212 busy-wait, 24 Id, 212 general, 9, 13, 15 Java, 25, 34 operations wait and signal,9 Lapse, 212 simulation of general semaphore, 30, 283 Mesa, 25 split, 14, 30 Modula 3, 25 strong, weak, 25 occam, 214, 241 waiting queue, 9 SISAL 1.2, 195, 212 sequential fraction of parallel computation, 71, VAL, 212 123, 312 project sequential program, 160 GNU, 186 concurrency Open MPI, 215, 241, 299 halting problem, 6 protocol correctness, 6, 34 post-protocol, 11, 27, 30 assertion, 6 pre-protocol, 11, 27, 30 partial, 6 pseudo-parallel execution of processes, see postcondition, 6 concurrent processes: pseudo-parallel precondition, 6 execution total, 6 pseudo-parallel execution of tasks, see concurrent RAM, 36, 285 processes: pseudo-parallel execution serial process, see process: sequential Pthreads, see library: Pthreads server, 180 PVM, see library: PVM computing, 5, 188, 306 database, 188, 306 race condition, see OpenMP interface: race www, 188, 306 condition set RAM, see model of computation: sequential RAM Mandelbrot, 164, 241, 281, 299 RAM program, see sequential program: RAM set image, 164 random number generator, 304 sieve of Eratosthenes, see problem: inding prime random sample, 162 numbers: sieve of Eratosthenes

354 Index

SIMD (SISD), see computer architecture: SIMD task, 1, 125, 156, 168, 244, 320 (SISD), and computer: SIMD (SISD) agglomeration, 157–160 SIMT, 207 granularity, 154, 156 simulation coarse, 152 atmospheric phenomena, 181 ine, 156 communication operation, 51 medium, 137 execution of algorithm, 51, 53, 66, size, 145, 152, 157–159 79 task dependency graph, see designing parallel general semaphore, 30, 283 algorithms: task dependency graph motion of bodies, 165 thread, 5, 207, 244, 320 PRAM computation, 48, 49, 59 context switching, 207 processor computation, 53 queue, 207 simulation step, 165 states, 207 simultaneous reads and writes, 48, throughput, see memory: shared: bandwidth 58 time complexity, see algorithm: sequential: running weather phenomena, 149 time size of a network, see network: size time-sharing, see operating system: time-sharing SMP, see computer: MIMD: symmetric TLB, see address: translation lookaside buffer multiprocessor, and cluster: symmetric torus, see interconnection network: torus multiprocessor transputer, 214 software, 186 tree free of charge, 186, 215 binary search, 168 open source, 186 rooted, 90, 120 sorting, see problem: sorting, and algorithm: traversal, 121 parallel: sorting unrooted, 88 sorting network, see network: sorting Turing Alan Mathison, 36, 58 speedup, see algorithm: parallel: speedup Turing machine, 36, 58, 103 spinning, 27 type SPMD, see parallel program: SPMD derived, 221, 301, 302 stack, 5, 6, 244 semaphore, 10 starvation, see concurrent processes: starvation statement UMA, see memory: uniform memory access atomic, 272 unary notation, 58 data transfer, 94, 97, 121 uniform cost criterion, see criterion of cost: parfor, 73 uniform receive, 93, 121 uniform memory access, see memory: uniform send, 93, 121 memory access supercomputer, see computer: supercomputer surjection, 54 variable switch, see interconnection network: local, 74, 76, 81–83, 85, 93, 95, 97, 99, 105 switch vector instruction, 177 symmetric multiprocessor, see computer: MIMD: video game, 207 symmetric multiprocessor, and cluster: virtual ring, 294 symmetric multiprocessor von Neumann John, 35, 177, 212 synchronization, see concurrent processes: synchronization Wallis John, 210 synchronization barrier, 27 WAN, see interconnection network: wide area centralized, 28 network dissemination, 29 wide area network, see interconnection network: symmetrical, 29 wide area network two-task, 28 William of Ockham, 214 system worst-case time complexity, see algorithm: parallel: computer, 184, 306 worst-case time complexity concurrent, 8 Wrocław Center for Networking and distributed, 24, 34 Supercomputing, Poland, xxvi parallel scalability, 123 zero-one principle, see principle: zero-one