Part II Extreme Models
Fall 2008 Parallel Processing, Extreme Models Slide 1 About This Presentation
This presentation is intended to support the use of the textbook Introduction to Parallel Processing: Algorithms and Architectures (Plenum Press, 1999, ISBN 0-306-45970-1). It was prepared by the author in connection with teaching the graduate-level course ECE 254B: Advanced Computer Architecture: Parallel Processing, at the University of California, Santa Barbara. Instructors can use these slides in classroom teaching and for other educational purposes. Any other use is strictly prohibited. © Behrooz Parhami
Edition Released Revised Revised First Spring 2005 Spring 2006 Fall 2008
Fall 2008 Parallel Processing, Extreme Models Slide 2 II Extreme Models
Study the two extremes of parallel computation models: • Abstract SM (PRAM); ignores implementation issues • Concrete circuit model; incorporates hardware details • Everything else falls between these two extremes
Topics in This Part Chapter 5 PRAM and Basic Algorithms Chapter 6 More Shared-Memory Algorithms Chapter 7 Sorting and Selection Networks Chapter 8 Other Circuit-Level Examples
Fall 2008 Parallel Processing, Extreme Models Slide 3 5 PRAM and Basic Algorithms
PRAM, a natural extension of RAM (random-access machine): • Present definitions of model and its various submodels • Develop algorithms for key building-block computations
Topics in This Chapter 5.1 PRAM Submodels and Assumptions 5.2 Data Broadcasting 5.3 Semigroup or Fan-in Computation 5.4 Parallel Prefix Computation 5.5 Ranking the Elements of a Linked List 5.6 Matrix Multiplication
Fall 2008 Parallel Processing, Extreme Models Slide 4 5.1 PRAM Submodels and Assumptions Processors Shared Memory Processor i can do 0 0 the following in three 1 phases of one cycle: 1. Fetch a value 2 from address si 1 3 in shared memory . . . . 2. Perform . . computations on data held in local p–1 m–1 registers 3. Store a value Fig. 4.6 Conceptual view of a parallel into address di in random-access machine (PRAM). shared memory
Fall 2008 Parallel Processing, Extreme Models Slide 5 Types of PRAM Reads from same location Exclusive Concurrent
EREW CREW Least “powerful”, most “realistic” Default Exclusive
ERCW CRCW Most “powerful”, Not useful further subdivided Concurrent Writes to same location Writes to same
Fig. 5.1 Submodels of the PRAM model.
Fall 2008 Parallel Processing, Extreme Models Slide 6 Types of CRCW PRAM
CRCW submodels are distinguished by the way they treat multiple writes:
Undefined: The value written is undefined (CRCW-U) Detecting: A special code for “detected collision” is written (CRCW-D) Common: Allowed only if they all store the same value (CRCW-C) [ This is sometimes called the consistent-write submodel ] Random: The value is randomly chosen from those offered (CRCW-R) Priority: The processor with the lowest index succeeds (CRCW-P) Max / Min: The largest / smallest of the values is written (CRCW-M) Reduction: The arithmetic sum (CRCW-S), logical AND (CRCW-A), logical XOR (CRCW-X), or another combination of values is written
Fall 2008 Parallel Processing, Extreme Models Slide 7 Power of CRCW PRAM Submodels
Model U is more powerful than model V if TU(n)=o(TV(n)) for some problem
EREW < CREW < CRCW-D < CRCW-C < CRCW-R < CRCW-P
Theorem 5.1: A p-processor CRCW-P (priority) PRAM can be simulated (emulated) by a p-processor EREW PRAM with slowdown factor Θ(log p).
Intuitive justification for concurrent read emulation (write is similar): Write the p memory addresses in a list 1 1 1 Sort the list of addresses in ascending order 6 1 5 1 Remove all duplicate addresses 2 2 2 Access data at desired addresses 3 2 Replicate data via parallel prefix computation 6 3 3 1 5 5 Each of these steps requires constant or O(log p) time 1 6 6 2 6
Fall 2008 Parallel Processing, Extreme Models Slide 8 Implications of the CRCW Hierarchy of Submodels
EREW < CREW < CRCW-D < CRCW-C < CRCW-R < CRCW-P
A p-processor CRCW-P (priority) PRAM can be simulated (emulated) by a p-processor EREW PRAM with slowdown factor Θ(log p).
Our most powerful PRAM CRCW submodel can be emulated by the least powerful submodel with logarithmic slowdown
Efficient parallel algorithms have polylogarithmic running times
Running time still polylogarithmic after slowdown due to emulation
We need not be too concerned with the CRCW submodel used
Simply use whichever submodel is most natural or convenient
Fall 2008 Parallel Processing, Extreme Models Slide 9 Some Elementary PRAM Computations
p Initializing an n-vector (base address = B) to all 0s: elements
for j = 0 to ⎡n/p⎤ – 1 processor i do ⎡n / p⎤ if jp + i < n then M[B + jp + i] := 0 segments endfor
Adding two n-vectors and storing the results in a third (base addresses B′, B″, B)
Convolution of two n-vectors: Wk = ∑i+j=k Ui × Vj (base addresses BW, BU, BV)
Fall 2008 Parallel Processing, Extreme Models Slide 10 B 0 5.2 Data 1 2 3 Broadcasting 4 5 6 7 8 Making p copies of B[0] 9 10 by recursive doubling 11 for k = 0 to ⎡log2p⎤ – 1 Fig. 5.2 Data broadcasting in EREW PRAM Proc j, 0 ≤ j < p, do via recursive doubling. k B Copy B[j] into B[j + 2 ] 0 1 endfor 2 3 4 5 Can modify the algorithm 6 7 so that redundant copying 8 9 does not occur and array 10 bound is not exceeded 11 Fig. 5.3 EREW PRAM data broadcasting without redundant copying.
Fall 2008 Parallel Processing, Extreme Models Slide 11 All-to-All Broadcasting on EREW PRAM
EREW PRAM algorithm for all-to-all broadcasting 0 Processor j, 0 ≤ j < p, write own data value into B[j] for k = 1 to p – 1 Processor j, 0 ≤ j < p, do j Read the data value in B[(j + k) mod p] endfor
This O(p)-step algorithm is time-optimal p –1
Naive EREW PRAM sorting algorithm (using all-to-all broadcasting) Processor j, 0 ≤ j < p, write 0 into R[ j ] for k = 1 to p – 1 Processor j, 0 ≤ j < p, do l := (j + k) mod p This O(p)-step sorting if S[ l ] < S[ j ] or S[ l ] = S[ j ] and l < j algorithm is far from then R[ j ] := R[ j ] + 1 optimal; sorting is possible endif in O(log p) time endfor Processor j, 0 ≤ j < p, write S[ j ] into S[R[ j ]]
Fall 2008 Parallel Processing, Extreme Models Slide 12 Class Participation: Broadcast-Based Sorting
Each person write down an arbitrary nonnegative 0 integer with 3 or fewer digits on a piece of paper j Students take turn broadcasting their numbers by calling them out aloud
Each student puts an X on paper for every number p –1 called out that is smaller than his/her own number, or is equal but was called out before the student’s own value
Each student counts the number of Xs on paper to determine the rank of his/her number
Students call out their numbers in order of the computed rank
Fall 2008 Parallel Processing, Extreme Models Slide 13 5.3 Semigroup or Fan-in Computation S EREW PRAM semigroup 0 0:0 0:0 0:0 0:0 0:0 1 1:1 0:1 0:1 0:1 0:1 computation algorithm 2 2:2 1:2 0:2 0:2 0:2 Proc j, 0 ≤ j < p, copy X[j] into S[j] 3 3:3 2:3 0:3 0:3 0:3 4 4:4 3:4 1:4 0:4 0:4 s := 1 5 5:5 4:5 2:5 0:5 0:5 while s
Fall 2008 Parallel Processing, Extreme Models Slide 14 5.4 Parallel Prefix Computation
Same as the first part of semigroup computation (no final broadcasting) S 0 0:0 0:0 0:0 0:0 0:0 1 1:1 0:1 0:1 0:1 0:1 2 2:2 1:2 0:2 0:2 0:2 3 3:3 2:3 0:3 0:3 0:3 4 4:4 3:4 1:4 0:4 0:4 5 5:5 4:5 2:5 0:5 0:5 6 6:6 5:6 3:6 0:6 0:6 7 7:7 6:7 4:7 0:7 0:7 8 8:8 7:8 5:8 1:8 0:8 9 9:9 8:9 6:9 2:9 0:9
Fig. 5.6 Parallel prefix computation in EREW PRAM via recursive doubling.
Fall 2008 Parallel Processing, Extreme Models Slide 15 A Divide-and-Conquer Parallel-Prefix Algorithm x x x x x x 0 1 2 3 n-2 n-1
⊗ ⊗ ⊗ ⊗ ⊗ T(p) = T(p/2) + 2 Parallel prefix computation T(p) ≅ 2 log2 p of size n/2 In hardware, Each vertical this is the basis line represents ⊗ ⊗ ⊗ ⊗ for Brent-Kung a location in carry-lookahead shared memory 0:n-2 adder 0:0 0:2 0:n-1 0:1 0:3 Fig. 5.7 Parallel prefix computation using a divide-and-conquer scheme.
Fall 2008 Parallel Processing, Extreme Models Slide 16 Another Divide-and-Conquer Algorithm x x x x x x 0 1 2 3 n-2 n-1
Parallel prefix computation on n/2 even-index ed inputs T(p) = T(p/2) + 1
Parallel prefix T(p) = log p computation on n/2 2 odd-indexed inputs Strictly optimal Each vertical algorithm, but ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ line represents ⊗ ⊗ ⊗ ⊗ requires a location in commutativity shared memory 0:0 0:2 0:n-2 0:1 0:3 0:n-1 Fig. 5.8 Another divide-and-conquer scheme for parallel prefix computation.
Fall 2008 Parallel Processing, Extreme Models Slide 17 5.5 Ranking the Elements of a Linked List info next Terminal element head C F A E B D Rank: 5 4 3 2 1 0 (or distance from terminal) Distance from head: 1 2 3 4 5 6 Fig. 5.9 Example linked list and the ranks of its elements. info next rank List ranking appears to be 0 A 4 hopelessly sequential; one cannot get to a list 1 B 3 element except through head 2 C 5 its predecessor! 3 D 3 Fig. 5.10 PRAM data structures 4 representing a linked list and the E 1 ranking results. 5 F 0
Fall 2008 Parallel Processing, Extreme Models Slide 18 List Ranking via Recursive Doubling
Many problems 1 1 111 0 that appear to be unparallelizable to the uninitiated are parallelizable; 2222 1 0 Intuition can be quite misleading!
info next 0 A 4 4 4 321 0 1 B 3 head 2 C 5 3 D 3 5 4 321 0 4 E 1 5 F 0
Fig. 5.11 Element ranks initially and after each of the three iterations.
Fall 2008 Parallel Processing, Extreme Models Slide 19 PRAM List Ranking Algorithm
PRAM list ranking algorithm (via pointer jumping) If we do not Processor j, 0 ≤ j < p, do {initialize the partial ranks} want to modify if next[ j ] = j the original list, then rank[ j ] := 0 we simply make else rank[ j ] := 1 a copy of it first, endif in constant time while rank[next[head]] ≠ 0 Processor j, 0 ≤ j < p, do rank[ j ] := rank[ j ] + rank[next[ j ]] info next rank next[ j ] := next[next[ j ]] 0 endwhile A 4 1 B 3 head Question: Which PRAM 2 C 5 submodel is implicit in 3 D this algorithm? 3 4 E 1 Answer: CREW 5 F 0
Fall 2008 Parallel Processing, Extreme Models Slide 20 5.6 Matrix Multiplication
Sequential matrix multiplication PRAM solution with for i = 0 to m –1 do m3 processors: for j = 0 to m –1 do each processor does t := 0 one multiplication for k = 0 to m –1 do (not very efficient)
t := t + aikbkj endfor cij := Σk=0 to m–1 aikbkj cij := t endfor j endfor A BC × = i ij m × m matrices
Fall 2008 Parallel Processing, Extreme Models Slide 21 PRAM Matrix Multiplication with m 2 Processors
PRAM matrix multiplication using m2 processors Proc (i, j), 0 ≤ i, j < m, do begin Processors are numbered (i, j), t := 0 instead of 0 to m2 –1 for k = 0 to m –1 do Θ(m) steps: Time-optimal t := t + aikbkj endfor CREW model is implicit cij := t end j ABC × = i ij
Fig. 5.12 PRAM matrix multiplication; p = m2 processors.
Fall 2008 Parallel Processing, Extreme Models Slide 22 PRAM Matrix Multiplication with m Processors
PRAM matrix multiplication using m processors for j = 0 to m –1 Proc i, 0 ≤ i < m, do t := 0 Θ(m2) steps: Time-optimal for k = 0 to m –1 do CREW model is implicit t := t + aikbkj endfor Because the order of multiplications cij := t is immaterial, accesses to B can be endfor skewed to allow the EREW model j ABC × = ij i
Fall 2008 Parallel Processing, Extreme Models Slide 23 PRAM Matrix Multiplication with Fewer Processors
Algorithm is similar, except that Θ(m3/p) steps: Time-optimal each processor is in charge of computing m / p rows of C EREW model can be used
A drawback of all algorithms thus far is that only two arithmetic operations (one multiplication and one addition) are performed for each memory access. This is particularly costly for NUMA shared-memory machines. j A BC × = ij m / p i rows
Fall 2008 Parallel Processing, Extreme Models Slide 24 More Efficient Matrix Multiplication (for NUMA)
Partition the matrices A B E F AE+BG AF+BH × into p square blocks = C D G H CE+DG CF+DH 12 √¦pp Block matrix multiplication 1 One processor follows the same algorithm as q=m/¦pq=m/√p computes these simple matrix multiplication. elements of C 2 q that it holds in local memory Block-bandj
Block- Block band × = ij √¦pp i
Fig. 5.13 Partitioning the matrices for block matrix multiplication .
Fall 2008 Parallel Processing, Extreme Models Slide 25 Details of Block Matrix Multiplication
Element of jq jq + 1 jq + b jq + q - 1 block (i, k) Elements of A multiply-add kq + c in matrix A block (k, j) computation in matrix B Multiply on q × q blocks needs kq + c jq jq + 1 jq + b jq + q - 1 2q 2 = 2m 2/p iq iq memory iq + 1 iq + 1 accesses and Add 2q 3 arithmetic Elements of block (i, j) operations iq + a iq + a in matrix C So, q arithmetic operations are done per iq + q - 1 iq + q - 1 memory access Fig. 5.14 How Processor (i, j) operates on an element of A and one block-row of B to update one block-row of C.
Fall 2008 Parallel Processing, Extreme Models Slide 26 6 More Shared-Memory Algorithms
Develop PRAM algorithm for more complex problems: • Must present background on the problem in some cases • Discuss some practical issues such as data distribution
Topics in This Chapter 6.1 Sequential Ranked-Based Selection 6.2 A Parallel Selection Algorithm 6.3 A Selection-Based Sorting Algorithm 6.4 Alternative Sorting Algorithms 6.5 Convex Hull of a 2D Point Set 6.6 Some Implementation Aspects
Fall 2008 Parallel Processing, Extreme Models Slide 27 8.1 Searching and Dictionary Operations
Parallel p-ary search on PRAM Example: log (n + 1) 0 n = 26, p = 2 p+1 1 P = log2(n + 1) / log2(p + 1) 2 0 = Θ(log n / log p) steps P1 Speedup ≅ log p P0 P1 8 P Optimal: no comparison-basedExample: 0 search algorithm can be faster n = 26 A single search in a sorted list can’t be significantly speededp = 2 up through parallel processing, 17 P1 but all hope is not lost: Dynamic data (sorting overhead) Step Step Step Batch searching (multiple lookups) 25 2 1 0
Fall 2008 Parallel Processing, Extreme Models Slide 28 6.1 Sequential Ranked-Based Selection Selection: Find the (or a) kth smallest among n elements Example: 5th smallest element in the following list is 1: 6 4 5 6 7 1 5 3 8 2 1 0 3 4 5 6 2 1 7 1 4 5 4 9 5
Median q k < |L| Naive solution < m L through sorting, O(n log n) time = m E
n m n/q
m = the median But linear-time of the medians: k > |L| + |E| < n/4 elements > m G sequential > n/4 elements algorithm can be developed max(|L|, |G|) ≤ 3n/4
Fall 2008 Parallel Processing, Extreme Models Slide 29 Linear-Time Sequential Selection Algorithm Sequential rank-based selection algorithm select(S, k) 1. if |S| < q {q is a small constant} then sort S and return the kth smallest element of S O(n) else divide S into |S|/q subsequences of size q Sort each subsequence and find its median Let the |S|/q medians form the sequence T endif T(n/q) 2. m = select(T, |T|/2) {find the median m of the |S|/q medians} 3. Create 3 subsequences Median L:Elements of Sq that are < m k < |L| O(n) < m L E:Elements of S that are = m G:Elements of S that are > m = m E 4. if |L| ≥ k m n/q then return selectn (L, k)
else if |L| + |E| ≥ k m = the median of the medians: k > |L| + |E| T(3n/4) then return m < n/4 elements > m G else return select(G, k –|L| –> |n/4E |)elements endif endif
Fall 2008 Parallel Processing, Extreme Models Slide 30 Algorithm Complexity and Examples We must have q ≥ 5; T(n)=T(n / q)+T(3n/4) + cn for q = 5, the solution is T(n) = 20cn
------n/q sublists of q elements ------S 6 4 5 6 7 1 5 3 8 2 1 0 3 4 5 6 2 1 7 1 4 5 4 9 5 ------T 6 3 3 2 5 m 3 1 2 1 0 2 1 1 3 3 6 4 5 6 7 5 8 4 5 6 7 4 5 4 9 5 ------LE G |L| = 7 |E| = 2 |G| = 16 To find the 5th smallest element in S, select the 5th smallest element in L S 1 2 1 0 2 1 1 ------The 9th smallest element of S is 3. T 1 1 m 1 0 1 1 1 1 2 2 The 13th smallest element of S is ------found by selecting the 4th smallest L E G element in G. Answer: 1
Fall 2008 Parallel Processing, Extreme Models Slide 31 6.2 A Parallel Selection Algorithm Parallel rank-based selection algorithm PRAMselect(S, k, p) 1. if |S| < 4 then sort S and return the kth smallest element of S O(n x) else broadcast |S| to all p processors divide S into p subsequences S(j) of size |S|/p Processor j, 0 ≤ j < p, compute Tj := select(S(j), |S(j)|/2) endif T(n1–x,p) 2. m = PRAMselect(T, |T|/2, p) {median of the medians} 3. Broadcast m to all processors and create 3 subsequences x L:Elements of S that are < m O(n ) Median q k < |L| E:Elements of S that are = m < m L G:Elements of S that are > m
4. if |L| ≥ k = m E then return PRAMselect(L, k, p) n m n/q else if |L| + |E| ≥ k m = the median of the medians: T(3n/4,p) then return m k > |L| + |E| < n/4 elements > m G else return PRAMselect(G, k> –|n/4 elementsL| – |E|, p) endif 1–x endif Let p =O(n )
Fall 2008 Parallel Processing, Extreme Models Slide 32 Algorithm Complexity and Efficiency
x 1–x x The solution is O(n ); T(n,p)=T(n ,p)+T(3n/4,p) + cn verify by substitution
x 1–x Speedup = Θ(n)/O(n ) = Ω(n ) = Ω(p) Remember Efficiency = Ω(1) p =O(n1–x) Work(n, p) = pT(n, p) = Θ(n1–x) Θ(n x) = Θ(n)
What happens if we set x to 1? ( i.e., use one processor) No, because T(n, 1) = O(n x) = O(n) in asymptotic analysis, we ignored several What happens if we set x to 0? (i.e., use n processors) O(log n) terms T(n, n) = O(n x) = O(1) ? compared with O(nx) terms
Fall 2008 Parallel Processing, Extreme Models Slide 33 6.3 A Selection-Based Sorting Algorithm Parallel selection-based sort PRAMselectionsort(S, p) O(1) 1. if |S| < k then return quicksort(S) 2. for i = 1 to k –1 do x O(n ) mj := PRAMselect(S, i|S|/k, p) {let m0 := –∞; mk := +∞} endfor 3. for i = 0 to k – 1 do x O(n ) make the sublist T(i) from elements of S in (mi, mi+1) endfor 4. for i = 1 to k/2 do in parallel PRAMselectionsort(T(i), 2p/k) T(n/k,2p/k) {p/(k/2) proc’s used for each of the k/2 subproblems} endfor 5. for i = k/2 + 1 to k do in parallel Let p = n1–x PRAMselectionsort(T(i), 2p/k) T(n/k,2p/k) endfor and k = 21/x n/k elements n/k n/k n/k
+ −∞– m1 mm23. . . mk–1 +∞
Fig. 6.1 Partitioning of the sorted list for selection-based sorting.
Fall 2008 Parallel Processing, Extreme Models Slide 34 Algorithm Complexity and Efficiency
x x The solution is O(n log n); T(n, p) = 2T(n/k, 2p/k) + cn verify by substitution
Speedup(n,p) = Ω(nlogn)/O(n x logn) = Ω(n1–x) = Ω(p) Efficiency = speedup / p = Ω(1) Work(n, p) = pT(n, p) = Θ(n1–x) Θ(n x log n) = Θ(n log n)
What happens if we set x to 1? ( i.e., use one processor) Remember T(n, 1) = O(n x log n) = O(n log n) p =O(n1–x)
Our asymptotic analysis is valid for x > 0 but not for x = 0; i.e., PRAMselectionsort cannot sort p keys in optimal O(log p) time.
Fall 2008 Parallel Processing, Extreme Models Slide 35 Example of Parallel Sorting
S: 6 4 5 6 7 1 5 3 8 2 1 0 3 4 5 6 2 1 7 0 4 5 4 9 5
Threshold values for k = 4 (i.e., x = ½ and p = n1/2 processors):
m0 = –∞ n/k = 25/4 ≅ 6 m1 = PRAMselect(S, 6, 5) = 2 2n/k = 50/4 ≅ 13 m2 = PRAMselect(S, 13, 5) = 4 3n/k = 75/4 ≅ 19 m3 = PRAMselect(S, 19, 5) = 6 m4 = +∞
m0 m1 m2 m3 m4 T: -----2|------4|-----6|------
T: 0 0 1 1 1 2|2 3 3 4 4 4 4|5 5 5 5 5 6|6 6 7 7 8 9
Fall 2008 Parallel Processing, Extreme Models Slide 36 6.4 Alternative Sorting Algorithms
Sorting via random sampling (assume p << √ n) Given a large list S of inputs, a random sample of the elements can be used to find k comparison thresholds It is easier if we pick k = p, so that each of the resulting subproblems is handled by a single processor
Parallel randomized sort PRAMrandomsort(S, p) 1. Processor j, 0 ≤ j < p, pick |S|/p2 random samples of its |S|/p elements and store them in its corresponding section of a list T of length |S|/p 2. Processor 0 sort the list T 2 {comparison threshold mi is the (i|S|/p )th element of T} 3. Processor j, 0 ≤ j < p, store its elements falling
in (mi, mi+1) into T(i) 4. Processor j, 0 ≤ j < p, sort the sublist T(j)
Fall 2008 Parallel Processing, Extreme Models Slide 37 Parallel Radixsort
In binary version of radixsort, we examine every bit of the k-bit keys in turn, starting from the LSB In Step i, bit i is examined, 0 ≤ i < k Records are stably sorted by the value of the ith key bit
Input Sort by Sort by Sort by Binary list LSB middle bit MSB forms –––––– –––––– –––––– –––––– 5 (101) 4 (100) 4 (100) 1 (001) 7 (111) 2 (010) 5 (101) 2 (010) Question: 3 (011) 2 (010) 1 (001) 2 (010) How are 1 (001) 5 (101) 2 (010) 3 (011) the data 4 (100) 7 (111) 2 (010) 4 (100) movements 2 (010) 3 (011) 7 (111) 5 (101) performed? 7 (111) 1 (001) 3 (011) 7 (111) 2 (010) 7 (111) 7 (111) 7 (111)
Fall 2008 Parallel Processing, Extreme Models Slide 38 Data Movements in Parallel Radixsort
Input Compl. Diminished Prefix sums Shifted list of bit 0 prefix sums Bit 0 plus 2 list –––––– ––– –––––– ––– ––––––––– –––––– 5 (101) 0 – 1 1 + 2 = 3 4 (100) 7 (111) 0 – 1 2 + 2 = 4 2 (010) 3 (011) 0 – 1 3 + 2 = 5 2 (010) 1 (001) 0 – 1 4 + 2 = 6 5 (101) 4 (100) 1 0 0 – 7 (111) 2 (010) 1 1 0 – 3 (011) 7 (111) 0 – 1 5 + 2 = 7 1 (001) 2 (010) 1 2 0 – 7 (111)
Running time consists mainly of the time to perform 2k parallel prefix computations: O(log p) for k constant
Fall 2008 Parallel Processing, Extreme Models Slide 39 6.5 Convex Hull of a 2D Point Set
y 7 y 7 CH(Q) 1 Point 1 11 11 Best sequential 5 Set Q 9 4 12 algorithm for p 0 0 points: 3 15 15 10 13 Ω(p log p) steps 6
14 14 2 2 8 8 y’ x x Fig. 6.2 Defining the 7 Angle convex hull problem.
1 11 x’
0 Fig. 6.3 Illustrating the properties All points 15 of the convex hull. fall on this side of line
Fall 2008 Parallel Processing, Extreme Models Slide 40 PRAM Convex Hull Algorithm
Parallel convex hull algorithm PRAMconvexhull(S, p) 1. Sort point set by x coordinates 2. Divide sorted list into √p subsets Q (i) of size √p, 0 ≤ i < √p 3. Find convex hull of each subset Q (i) using √p processors 4. Merge √p convex hulls CH(Q (i)) into overall hull CH(Q) y y
Upper hull
Lower (0) (1) (2) (3) (0) hull Q Q Q Q x CH(Q ) Fig. 6.4 Multiway divide and conquer for the convex hull problem
Fall 2008 Parallel Processing, Extreme Models Slide 41 Merging of Partial Convex Hulls
Tangent lines Tangent lines are found through binary CH(Q( k ) ) search in log time CH(Q ( j) ) CH(Q ( i) ) Analysis: (a) No point of CH(Q(i)) is on CH(Q) T(p, p) = T(p1/2, p1/2) + c log p A B ≅ 2c log p
The initial sorting also CH(Q ( i) ) takes O(log p) time (k) CH(Q ( j) ) CH(Q )
(b) Points of CH(Q(i)) from A to B are on CH(Q)
Fig. 6.5 Finding points in a partial hull that belong to the combined hull.
Fall 2008 Parallel Processing, Extreme Models Slide 42 6.6 Some Implementation Aspects
This section has been expanded; it will eventually become a separate chapter
Memory Processors modules 0 0 Options: Crossbar 1 1 Processor- Processor- Bus(es) to-processor . to-memory . MIN network . network . . . Bottleneck p-1 m-1 Complex . . . Expensive Parallel I/O Fig. 4.3 A parallel processor with global (shared) memory.
Fall 2008 Parallel Processing, Extreme Models Slide 43 Processor-to-Memory Network
Crossbar switches offer P 0 full permutation capability P (they are nonblocking), 1 but are complex and 2 P2 expensive: O(p )
P3 Even with a permutation P network, full PRAM 4 functionality is not
P5 realized: two processors cannot access different P6 addresses in the same memory module P7 Practical processor-to- M M M M M M M M 0 1 2 3 4 5 6 7 memory networks cannot realize all permutations An 8 × 8 crossbar switch (they are blocking)
Fall 2008 Parallel Processing, Extreme Models Slide 44 Bus-Based Interconnections
Single-bus system: Bandwidth bottleneck Memory I/O Bus loading limit Scalability: very poor Single failure point Conceptually simple Proc Proc . . . Proc Proc Forced serialization
Multiple-bus system: Memory I/O Bandwidth improved Bus loading limit Scalability: poor More robust More complex scheduling Proc Proc . . . Proc Proc Simple serialization
Fall 2008 Parallel Processing, Extreme Models Slide 45 Back-of-the-Envelope Bus Bandwidth Calculation
Single-bus system: Bus frequency: 0.5 GHz Memory I/O Data width: 256 b (32 B) Mem. Access: 2 bus cycles (0.5G)/2 × 32 = 8 GB/s Proc Proc . . . Proc Proc Bus cycle = 2 ns Memory cycle = 100 ns 1 mem. cycle = 50 bus cycles Mem Mem Mem Mem I/O Multiple-bus system: Peak bandwidth multiplied by the number of buses (actual bandwidth is likely to be much less) Proc Proc . . . Proc Proc
Fall 2008 Parallel Processing, Extreme Models Slide 46 Hierarchical Bus Interconnection
Bus switch (Gateway)
Low-level cluster Heterogeneous system
Fig. 4.9 Example of a hierarchical interconnection architecture.
Fall 2008 Parallel Processing, Extreme Models Slide 47 Removing the Processor-to-Memory Bottleneck
Processors Caches Memory modules 0 0
1 1 Processor- Processor- to-processor Challenge: . to-memory . network Cache . network . coherence . .
p-1 m-1 . . . Parallel I/O Fig. 4.4 A parallel processor with global memory and processor caches.
Fall 2008 Parallel Processing, Extreme Models Slide 48 Why Data Caching Works
Address in Placement Option 0 Placement Option 1 Hit rate r (fraction of Tag Index State bits State bits memory accesses Tag --- One cache block --- Tag --- One cache block --- satisfied by cache) . . C = C + (1 – r)C . . eff fast slow Block . . address
Cache parameters: Word . offset . . . Size . . Block length (line width) Placement policy Replacement policy The two candidate words and their tags are read out Write policy
Com- = 0 1 Fig. 18.1 Data storage pare Mux and access in a two-way Com- = set-associative cache. pare Cache miss Data out
Fall 2008 Parallel Processing, Extreme Models Slide 49 Benefits of Caching Formulated as Amdahl’s Law
Hit rate r (fraction of memory Access cabinet Access drawer accesses satisfied by cache) in 30 s in 5 s C = C + (1 – r)C eff fast slow Register Access file desktop in 2 s Cache memory S = Cslow / Ceff 1 Main = memory (1 – r) + Cfast/Cslow Fig. 18.3 of Parhami’s Computer Architecture text (2005) This corresponds to the miss-rate fraction 1 – r of accesses being unaffected and the hit-rate fraction r (almost 1) being speeded up by a factor Cslow/Cfast Generalized form of Amdahl’s speedup formula:
S = 1/(f1/p1 + f2/p2 + . . . + fm/pm), with f1 + f2 + . . . + fm = 1
In this case, a fraction 1 – r is slowed down by a factor (Cslow + Cfast)/Cslow, and a fraction r is speeded up by a factor Cslow / Cfast
Fall 2008 Parallel Processing, Extreme Models Slide 50 18.2 Cache Coherence Protocols Memory Processors Caches modules w Multiple w w consistent 0 y ′ x x Single consistent w y 1 y ′ Single Proc.- z ′ inconsistent to- Processor- proc. . to-memory . net- work . network . . .
x z z p–1 ′ Invalid z . . . Parallel I/O Fig. 18.2 Various types of cached data blocks in a parallel processor with global memory and processor caches.
Fall 2008 Parallel Processing, Extreme Models Slide 51 Example: A Bus-Based Snoopy Protocol
Each transition is labeled with the event that triggers it, followed by the action(s) that must be taken
CPU read hit, CPU write hit CPU read hit
CPU read miss: Write back the block, put read miss on bus Bus read miss for this block: Write back the block CPU Exclusive Shared (read-only) write miss: (read/write) CPU write hit/miss: Put write miss on bus CPU Write back read miss: the block, Put read Put write miss on bus miss on bus Bus write miss for this block: CPU read miss: Write back the block Put read miss on bus
Invalid CPU write miss: Put write miss on bus Bus write miss for this block
Fig. 18.3 Finite-state control mechanism for a bus-based snoopy cache coherence protocol.
Fall 2008 Parallel Processing, Extreme Models Slide 52 Implementing a Snoopy Protocol
A second State tags/state Duplicate tags CPU Main tags and and state store state store for storage unit for snoop side Tags Addr Cmd processor side allows snooping to be done Cache Snoop side data Processor side concurrently cache control array cache control with normal cache operation =? Tag =?
Getting all the Snoop implementation state Cmd Addr Buffer Buffer Addr Cmd timing and System details right is bus nontrivial Fig. 27.7 of Parhami’s Computer Architecture text.
Fall 2008 Parallel Processing, Extreme Models Slide 53 Distributed Shared Memory
Memories Processors Some Terminology: 0 NUMA 1 Nonuniform memory access (distributed shared memory) . UMA . Interconnection Uniform memory access . network (global shared memory) p-1 COMA
. Cache-only memory arch Parallel I/O . .
Fig. 4.5 A parallel processor with distributed memory.
Fall 2008 Parallel Processing, Extreme Models Slide 54 Example: A Directory-Based Protocol
Write miss: Fetch data value, request invalidation, Read miss: Return data value, return data value, sharing set = {c} sharing set = sharing set + {c}
Read miss: Fetch data value, return data value, Read miss:sharing Fetch, set return = sharing data set setvalue, + + {c} sharing set = {c} Exclusive Correction to text Shared (read/write) (read-only) Write miss: Invalidate, sharing set = {c}, return data value
Data write-back: Sharing set = { }
Uncached Read miss: Return data value, Write miss: Return data value, sharing set = {c} sharing set = {c} Fig. 18.4 States and transitions for a directory entry in a directory-based coherence protocol (c denotes the cache sending the message).
Fall 2008 Parallel Processing, Extreme Models Slide 55 Implementing a Directory-Based Protocol
0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 Sharing set implemented as a bit-vector (simple, but not scalable)
When there are many more nodes (caches) than the typical size of a sharing set, a list of sharing units may be maintained in the directory
Processor 0 Processor 1 Processor 2 Processor 3
Cache 0 Cache 1 Cache 2 Cache 3
Head pointer
Memory
Noncoherent Coherent data blocks data block
The sharing set can be maintained as a distributed doubly linked list (will discuss in Section 18.6 in connection with the SCI standard)
Fall 2008 Parallel Processing, Extreme Models Slide 56 Hiding the Memory Access Latency
By assumption, PRAM accesses memory Not a smart strategy: locations right when they are needed, so Memory access time = processing must stall until data is fetched 100s times that of add time Method 1: Predict accesses (prefetch) Proc. 0’s access request Method 2: Pipeline multiple accesses Proc. 0’s access response Memory Processors modules 0 0
1 1 Processor- P rocessor- to-processor . to-memory . network . network . . .
p-1 m-1 . . . P a ra lle l I/O Fall 2008 Parallel Processing, Extreme Models Slide 57 21.3 Vector-Parallel Cray Y-MP
Vector Add Inter- Integer Processor Vector Units Shift Commun. Registers V7 Logic V6 Fig. 21.5 Key V5 Weight/ V4 Parity elements of V3 V2 V1 64 bit the Cray Y-MP 0 1 processor. 2 Floating- Address 3 Point Add
Units Multiply registers, . V0 Reciprocal address . Approx. . function units, Central Memory 64 bit instruction 62 63 buffers, and Scalar Integer Add control not Units Shift T Registers Logic shown. (8 64-bit) S Registers (8 64-bit) Weight/ Parity
From address 32 bit CPU registers/units Input/Output Fall 2008 Parallel Processing, Extreme Models Slide 58 Cray Y-MP’s Interconnection Network Sections Subsections Memory Banks 1 × 8 0, 4, 8, ... , 28 P0 4 × 4 32, 36, 40, ... , 92
8 × 8 P 4 × 4 1
P 1, 5, 9, ... , 29 2 4 × 4 8 × 8
P3 4 × 4
2, 6, 10, ... , 30 P 4 × 4 4 8 × 8 P 4 × 4 5
3, 7, 11, ... , 31 P6 4 × 4 8 × 8
P 4 × 4 7 227, 231, ... , 255 Fig. 21.6 The processor-to-memory interconnection network of Cray Y-MP.
Fall 2008 Parallel Processing, Extreme Models Slide 59 Butterfly Processor-to-Memory Network Not a full p Processors log 2 p Columns of 2-by-2 Switches p Memory Banks 0000 0 0 1 2 3 0 0000 permutation network 0001 1 1 0001 (e.g., processor 0 0010 2 2 0010 0011 3 3 0011 cannot be connected 0100 4 4 0100 to memory bank 2 0101 5 5 0101 alongside the two 0110 6 6 0110 connections shown) 0111 7 7 0111 1000 1000 8 8 Is self-routing: i.e., 1001 9 9 1001 1010 10 10 1010 the bank address 1011 11 11 1011 determines the route 1100 12 12 1100 1101 13 13 1101 A request going to 1110 14 14 1110 memory bank 3 1111 15 15 1111 (0 0 1 1) is routed: Fig. 6.9 Example of a multistage memory access network. lower upper upper
Fall 2008 Parallel Processing, Extreme Models Slide 60 Butterfly as Multistage Interconnection Network
log p + 1 Columns of 2-by-2 Switches p Processors log p Columns of 2-by-2 Switches p Memory Banks 2 2 0 1 2 3 0000 0 0 1 2 3 0 0000 000 0 0001 1 1 0001 001 1 0010 2 2 0010 010 2 0011 3 3 0011 011 3 0100 4 4 0100 100 4 0101 5 5 0101 101 5 0110 6 6 0110 110 6 0111 7 7 0111 111 7 1000 8 8 1000 000 0 1001 9 9 1001 001 1 1010 10 10 1010 010 2 1011 11 11 1011 011 3 1100 12 12 1100 100 4 1101 13 13 1101 101 5 1110 14 14 1110 110 6 1111 15 15 1111 111 7 Fig. 6.9 Example of a multistage Fig. 15.8 Butterfly network memory access network used to connect modules that are on the same side Generalization of the butterfly network High-radix or m-ary butterfly, built of m × m switches Has mq rows and q + 1 columns (q if wrapped)
Fall 2008 Parallel Processing, Extreme Models Slide 61 Beneš Network
Processors 2 log2 p – 1 Columns of 2-by-2 Switches Memory Banks 000 0 0 1 2 3 4 0 000 001 1 1 001 010 2 2 010 011 3 3 011 100 4 4 100 101 5 5 101 110 6 6 110 111 7 7 111
Fig. 15.9 Beneš network formed from two back-to-back butterflies.
A 2q-row Beneš network: Can route any 2q × 2q permutation It is “rearrangeable”
Fall 2008 Parallel Processing, Extreme Models Slide 62 Routing Paths in a Beneš Network 0 0 To which 1 1 memory 2 2 modules 3 3 can we 4 4 connect 5 5 proc 4 6 6 without 7 7 rearranging 8 8 the other 9 9 paths? 10 10 11 11 12 12 What 13 13 about 14 14 proc 6? 15 0 1 2 3 4 5 6 15 q+1 2 q+1 Inputs 2 q Rows, 2q + 1 Columns 2 Outputs Fig. 15.10 Another example of a Beneš network.
Fall 2008 Parallel Processing, Extreme Models Slide 63 16.6 Multistage Interconnection Networks
Numerous indirect or multistage interconnection networks (MINs) have been proposed for, or used in, parallel computers They differ in topological, performance, robustness, and realizability attributes We have already seen the butterfly, hierarchical bus, beneš, and ADM networks
Fig. 4.8 (modified) The sea of indirect interconnection networks.
Fall 2008 Parallel Processing, Extreme Models Slide 64 Self-Routing Permutation Networks
Do there exist self-routing permutation networks? (The butterfly network is self-routing, but it is not a permutation network)
Permutation routing through a MIN is the same problem as sorting
7 (111) 0 0 0 (000)
0 (000) 1 1 1 (001)
4 (100) 3 3 2 (010)
6 (110) 2 2 3 (011)
1 (001) 7 4 4 (100)
5 (101) 4 5 5 (101)
3 (011) 6 7 6 (110)
2 (010) 5 6 7 (111) Sort by Sort by the Sort by MSB middle bit LSB Fig. 16.14 Example of sorting on a binary radix sort network.
Fall 2008 Parallel Processing, Extreme Models Slide 65 Partial List of Important MINs
Augmented data manipulator (ADM): aka unfolded PM2I (Fig. 15.12) Banyan: Any MIN with a unique path between any input and any output (e.g. butterfly) Baseline: Butterfly network with nodes labeled differently Beneš: Back-to-back butterfly networks, sharing one column (Figs. 15.9-10) Bidelta: A MIN that is a delta network in either direction Butterfly: aka unfolded hypercube (Figs. 6.9, 15.4-5) Data manipulator: Same as ADM, but with switches in a column restricted to same state Delta: Any MIN for which the outputs of each switch have distinct labels (say 0 and 1 for 2 × 2 switches) and path label, composed of concatenating switch output labels leading from an input to an output depends only on the output Flip: Reverse of the omega network (inputs × outputs) Indirect cube: Same as butterfly or omega Omega: Multi-stage shuffle-exchange network; isomorphic to butterfly (Fig. 15.19) Permutation: Any MIN that can realize all permutations Rearrangeable: Same as permutation network Reverse baseline: Baseline network, with the roles of inputs and outputs interchanged
Fall 2008 Parallel Processing, Extreme Models Slide 66 Conflict-Free Memory Access
Try to store the data such that parallel accesses are to different banks For many data structures, a compiler may perform the memory mapping Column 2 Each matrix column is 0,0 0,1 0,2 0,3 0,4 0,5 stored in a different Row 1 1,0 1,1 1,2 1,3 1,4 1,5 memory module (bank) 2,0 2,1 2,2 2,3 2,4 2,5 3,0 3,1 3,2 3,3 3,4 3,5 4,0 4,1 4,2 4,3 4,4 4,5 Accessing a column 5,0 5,1 5,2 5,3 5,4 5,5 leads to conflicts Module 0 1 2 3 4 5
Fig. 6.6 Matrix storage in column-major order to allow concurrent accesses to rows.
Fall 2008 Parallel Processing, Extreme Models Slide 67 Skewed Storage Format
Column 2
0,0 0,1 0,2 0,3 0,4 0,5 Row 1 1,5 1,0 1,1 1,2 1,3 1,4 2,4 2,5 2,0 2,1 2,2 2,3 3,3 3,4 3,5 3,0 3,1 3,2 4,2 4,3 4,4 4,5 4,0 4,1 5,1 5,2 5,3 5,4 5,5 5,0 Module 0 1 2 3 4 5
Fig. 6.7 Skewed matrix storage for conflict-free accesses to rows and columns.
Fall 2008 Parallel Processing, Extreme Models Slide 68 A Unified Theory of Conflict-Free Access
Vector indices 0 6 12 18 24 30 A qD array can be 1 7 13 19 25 31 viewed as a vector, 2 8 14 20 26 32 3 9 15 21 27 33 with “row”/“column” 4 10 16 22 28 34 accesses associated 5 11 17 23 29 35 with constant strides A ij is viewed as vector element i + jm Fig. 6.8 A 6 × 6 matrix viewed, in column- major order, as a 36-element vector.
Column: k, k+1, k+2, k+3, k+4, k+5 Stride = 1 Row: k, k+m, k+2m, k+3m, k+4m, k+5m Stride = m Diagonal: k, k+m+1, k+2(m+1), k+3(m+1), k+4(m+1), k+5(m+1) Stride = m + 1 Antidiagonal: k, k+m–1, k+2(m–1), k+3(m–1), k+4(m–1), k+5(m–1) Stride = m –1
Fall 2008 Parallel Processing, Extreme Models Slide 69 Linear Skewing Schemes
Vector Place vector element i indices 0 6 12 18 24 30 1 7 13 19 25 31 in memory bank 2 8 14 20 26 32 a + bi mod B 3 9 15 21 27 33 (word address within 4 10 16 22 28 34 bank is irrelevant to 5 11 17 23 29 35 conflict-free access; also, a can be set to 0) A ij is viewed as vector element i + jm Fig. 6.8 A 6 × 6 matrix viewed, in column- major order, as a 36-element vector.
With a linear skewing scheme, vector elements k, k + s, k +2s, . . . , k + (B –1)s will be assigned to different memory banks iff sb is relatively prime with respect to the number B of memory banks. A prime value for B ensures this condition, but is not very practical.
Fall 2008 Parallel Processing, Extreme Models Slide 70 7 Sorting and Selection Networks
Become familiar with the circuit model of parallel processing: • Go from algorithm to architecture, not vice versa • Use a familiar problem to study various trade-offs
Topics in This Chapter 7.1 What is a Sorting Network? 7.2 Figures of Merit for Sorting Networks 7.3 Design of Sorting Networks 7.4 Batcher Sorting Networks 7.5 Other Classes of Sorting Networks 7.6 Selection Networks
Fall 2008 Parallel Processing, Extreme Models Slide 71 7.1 What is a Sorting Network?
x0 y0 x1 y1 The outputs are a x y permutation of the 2 2 inputs satisfying . n-sorter . y Š≤ y Š≤ ... ≤Š y . . 01 n–1 Fig. 7.1 An n-input (non-descending) . . sorting network or an n-sorter. xn–1 yn–1
input0 min in out in out 2-sorter
in out in out input1 max Block Diagram Alternate Representations Fig. 7.2 Block diagram and four different schematic representations for a 2-sorter.
Fall 2008 Parallel Processing, Extreme Models Slide 72 Building Blocks for Sorting Networks
Implementation with input0 min in Implementationout in with out bit-parallel inputs 2-sorter2-sorter bit-serial inputs
in out in out input1 max Block Diagram Alternate Representations k a 0 k a 0