Part II Extreme Models

Fall 2008 Parallel Processing, Extreme Models Slide 1 About This Presentation

This presentation is intended to support the use of the textbook Introduction to Parallel Processing: Algorithms and Architectures (Plenum Press, 1999, ISBN 0-306-45970-1). It was prepared by the author in connection with teaching the graduate-level course ECE 254B: Advanced Computer Architecture: Parallel Processing, at the University of California, Santa Barbara. Instructors can use these slides in classroom teaching and for other educational purposes. Any other use is strictly prohibited. © Behrooz Parhami

Edition Released Revised Revised First Spring 2005 Spring 2006 Fall 2008

Fall 2008 Parallel Processing, Extreme Models Slide 2 II Extreme Models

Study the two extremes of parallel computation models: • Abstract SM (PRAM); ignores implementation issues • Concrete circuit model; incorporates hardware details • Everything else falls between these two extremes

Topics in This Part Chapter 5 PRAM and Basic Algorithms Chapter 6 More Shared-Memory Algorithms Chapter 7 Sorting and Selection Networks Chapter 8 Other Circuit-Level Examples

Fall 2008 Parallel Processing, Extreme Models Slide 3 5 PRAM and Basic Algorithms

PRAM, a natural extension of RAM (random-access machine): • Present definitions of model and its various submodels • Develop algorithms for key building-block computations

Topics in This Chapter 5.1 PRAM Submodels and Assumptions 5.2 Data Broadcasting 5.3 Semigroup or Fan-in Computation 5.4 Parallel Prefix Computation 5.5 Ranking the Elements of a 5.6 Matrix Multiplication

Fall 2008 Parallel Processing, Extreme Models Slide 4 5.1 PRAM Submodels and Assumptions Processors Processor i can do 0 0 the following in three 1 phases of one cycle: 1. Fetch a value 2 from address si 1 3 in shared memory . . . . 2. Perform . . computations on data held in local p–1 m–1 registers 3. Store a value Fig. 4.6 Conceptual view of a parallel into address di in random-access machine (PRAM). shared memory

Fall 2008 Parallel Processing, Extreme Models Slide 5 Types of PRAM Reads from same location Exclusive Concurrent

EREW CREW Least “powerful”, most “realistic” Default Exclusive

ERCW CRCW Most “powerful”, Not useful further subdivided Concurrent Writes to same location Writes to same

Fig. 5.1 Submodels of the PRAM model.

Fall 2008 Parallel Processing, Extreme Models Slide 6 Types of CRCW PRAM

CRCW submodels are distinguished by the way they treat multiple writes:

Undefined: The value written is undefined (CRCW-U) Detecting: A special code for “detected collision” is written (CRCW-D) Common: Allowed only if they all store the same value (CRCW-C) [ This is sometimes called the consistent-write submodel ] Random: The value is randomly chosen from those offered (CRCW-R) Priority: The processor with the lowest index succeeds (CRCW-P) Max / Min: The largest / smallest of the values is written (CRCW-M) Reduction: The arithmetic sum (CRCW-S), logical AND (CRCW-A), logical XOR (CRCW-X), or another combination of values is written

Fall 2008 Parallel Processing, Extreme Models Slide 7 Power of CRCW PRAM Submodels

Model U is more powerful than model V if TU(n)=o(TV(n)) for some problem

EREW < CREW < CRCW-D < CRCW-C < CRCW-R < CRCW-P

Theorem 5.1: A p-processor CRCW-P (priority) PRAM can be simulated (emulated) by a p-processor EREW PRAM with slowdown factor Θ(log p).

Intuitive justification for concurrent read emulation (write is similar): Write the p memory addresses in a list 1 1 1 Sort the list of addresses in ascending order 6 1 5 1 Remove all duplicate addresses 2 2 2 Access data at desired addresses 3 2 Replicate data via parallel prefix computation 6 3 3 1 5 5 Each of these steps requires constant or O(log p) time 1 6 6 2 6

Fall 2008 Parallel Processing, Extreme Models Slide 8 Implications of the CRCW Hierarchy of Submodels

EREW < CREW < CRCW-D < CRCW-C < CRCW-R < CRCW-P

A p-processor CRCW-P (priority) PRAM can be simulated (emulated) by a p-processor EREW PRAM with slowdown factor Θ(log p).

Our most powerful PRAM CRCW submodel can be emulated by the least powerful submodel with logarithmic slowdown

Efficient parallel algorithms have polylogarithmic running times

Running time still polylogarithmic after slowdown due to emulation

We need not be too concerned with the CRCW submodel used

Simply use whichever submodel is most natural or convenient

Fall 2008 Parallel Processing, Extreme Models Slide 9 Some Elementary PRAM Computations

p Initializing an n-vector (base address = B) to all 0s: elements

for j = 0 to ⎡n/p⎤ – 1 processor i do ⎡n / p⎤ if jp + i < n then M[B + jp + i] := 0 segments endfor

Adding two n-vectors and storing the results in a third (base addresses B′, B″, B)

Convolution of two n-vectors: Wk = ∑i+j=k Ui × Vj (base addresses BW, BU, BV)

Fall 2008 Parallel Processing, Extreme Models Slide 10 B 0 5.2 Data 1 2 3 Broadcasting 4 5 6 7 8 Making p copies of B[0] 9 10 by recursive doubling 11 for k = 0 to ⎡log2p⎤ – 1 Fig. 5.2 Data broadcasting in EREW PRAM Proc j, 0 ≤ j < p, do via recursive doubling. k B Copy B[j] into B[j + 2 ] 0 1 endfor 2 3 4 5 Can modify the algorithm 6 7 so that redundant copying 8 9 does not occur and array 10 bound is not exceeded 11 Fig. 5.3 EREW PRAM data broadcasting without redundant copying.

Fall 2008 Parallel Processing, Extreme Models Slide 11 All-to-All Broadcasting on EREW PRAM

EREW PRAM algorithm for all-to-all broadcasting 0 Processor j, 0 ≤ j < p, write own data value into B[j] for k = 1 to p – 1 Processor j, 0 ≤ j < p, do j Read the data value in B[(j + k) mod p] endfor

This O(p)-step algorithm is time-optimal p –1

Naive EREW PRAM sorting algorithm (using all-to-all broadcasting) Processor j, 0 ≤ j < p, write 0 into R[ j ] for k = 1 to p – 1 Processor j, 0 ≤ j < p, do l := (j + k) mod p This O(p)-step sorting if S[ l ] < S[ j ] or S[ l ] = S[ j ] and l < j algorithm is far from then R[ j ] := R[ j ] + 1 optimal; sorting is possible endif in O(log p) time endfor Processor j, 0 ≤ j < p, write S[ j ] into S[R[ j ]]

Fall 2008 Parallel Processing, Extreme Models Slide 12 Class Participation: Broadcast-Based Sorting

Each person write down an arbitrary nonnegative 0 integer with 3 or fewer digits on a piece of paper j Students take turn broadcasting their numbers by calling them out aloud

Each student puts an X on paper for every number p –1 called out that is smaller than his/her own number, or is equal but was called out before the student’s own value

Each student counts the number of Xs on paper to determine the rank of his/her number

Students call out their numbers in order of the computed rank

Fall 2008 Parallel Processing, Extreme Models Slide 13 5.3 Semigroup or Fan-in Computation S EREW PRAM semigroup 0 0:0 0:0 0:0 0:0 0:0 1 1:1 0:1 0:1 0:1 0:1 computation algorithm 2 2:2 1:2 0:2 0:2 0:2 Proc j, 0 ≤ j < p, copy X[j] into S[j] 3 3:3 2:3 0:3 0:3 0:3 4 4:4 3:4 1:4 0:4 0:4 s := 1 5 5:5 4:5 2:5 0:5 0:5 while s

Fall 2008 Parallel Processing, Extreme Models Slide 14 5.4 Parallel Prefix Computation

Same as the first part of semigroup computation (no final broadcasting) S 0 0:0 0:0 0:0 0:0 0:0 1 1:1 0:1 0:1 0:1 0:1 2 2:2 1:2 0:2 0:2 0:2 3 3:3 2:3 0:3 0:3 0:3 4 4:4 3:4 1:4 0:4 0:4 5 5:5 4:5 2:5 0:5 0:5 6 6:6 5:6 3:6 0:6 0:6 7 7:7 6:7 4:7 0:7 0:7 8 8:8 7:8 5:8 1:8 0:8 9 9:9 8:9 6:9 2:9 0:9

Fig. 5.6 Parallel prefix computation in EREW PRAM via recursive doubling.

Fall 2008 Parallel Processing, Extreme Models Slide 15 A Divide-and-Conquer Parallel-Prefix Algorithm x x x x x x 0 1 2 3 n-2 n-1

⊗ ⊗ ⊗ ⊗ ⊗ T(p) = T(p/2) + 2 Parallel prefix computation T(p) ≅ 2 log2 p of size n/2 In hardware, Each vertical this is the basis line represents ⊗ ⊗ ⊗ ⊗ for Brent-Kung a location in carry-lookahead shared memory 0:n-2 0:0 0:2 0:n-1 0:1 0:3 Fig. 5.7 Parallel prefix computation using a divide-and-conquer scheme.

Fall 2008 Parallel Processing, Extreme Models Slide 16 Another Divide-and-Conquer Algorithm x x x x x x 0 1 2 3 n-2 n-1

Parallel prefix computation on n/2 even-index ed inputs T(p) = T(p/2) + 1

Parallel prefix T(p) = log p computation on n/2 2 odd-indexed inputs Strictly optimal Each vertical algorithm, but ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ line represents ⊗ ⊗ ⊗ ⊗ requires a location in commutativity shared memory 0:0 0:2 0:n-2 0:1 0:3 0:n-1 Fig. 5.8 Another divide-and-conquer scheme for parallel prefix computation.

Fall 2008 Parallel Processing, Extreme Models Slide 17 5.5 Ranking the Elements of a Linked List info next Terminal element head C F A E B D Rank: 5 4 3 2 1 0 (or distance from terminal) Distance from head: 1 2 3 4 5 6 Fig. 5.9 Example linked list and the ranks of its elements. info next rank List ranking appears to be 0 A 4 hopelessly sequential; one cannot get to a list 1 B 3 element except through head 2 C 5 its predecessor! 3 D 3 Fig. 5.10 PRAM data structures 4 representing a linked list and the E 1 ranking results. 5 F 0

Fall 2008 Parallel Processing, Extreme Models Slide 18 List Ranking via Recursive Doubling

Many problems 1 1 111 0 that appear to be unparallelizable to the uninitiated are parallelizable; 2222 1 0 Intuition can be quite misleading!

info next 0 A 4 4 4 321 0 1 B 3 head 2 C 5 3 D 3 5 4 321 0 4 E 1 5 F 0

Fig. 5.11 Element ranks initially and after each of the three iterations.

Fall 2008 Parallel Processing, Extreme Models Slide 19 PRAM List Ranking Algorithm

PRAM list ranking algorithm (via pointer jumping) If we do not Processor j, 0 ≤ j < p, do {initialize the partial ranks} want to modify if next[ j ] = j the original list, then rank[ j ] := 0 we simply make else rank[ j ] := 1 a copy of it first, endif in constant time while rank[next[head]] ≠ 0 Processor j, 0 ≤ j < p, do rank[ j ] := rank[ j ] + rank[next[ j ]] info next rank next[ j ] := next[next[ j ]] 0 endwhile A 4 1 B 3 head Question: Which PRAM 2 C 5 submodel is implicit in 3 D this algorithm? 3 4 E 1 Answer: CREW 5 F 0

Fall 2008 Parallel Processing, Extreme Models Slide 20 5.6 Matrix Multiplication

Sequential matrix multiplication PRAM solution with for i = 0 to m –1 do m3 processors: for j = 0 to m –1 do each processor does t := 0 one multiplication for k = 0 to m –1 do (not very efficient)

t := t + aikbkj endfor cij := Σk=0 to m–1 aikbkj cij := t endfor j endfor A BC × = i ij m × m matrices

Fall 2008 Parallel Processing, Extreme Models Slide 21 PRAM Matrix Multiplication with m 2 Processors

PRAM matrix multiplication using m2 processors Proc (i, j), 0 ≤ i, j < m, do begin Processors are numbered (i, j), t := 0 instead of 0 to m2 –1 for k = 0 to m –1 do Θ(m) steps: Time-optimal t := t + aikbkj endfor CREW model is implicit cij := t end j ABC × = i ij

Fig. 5.12 PRAM matrix multiplication; p = m2 processors.

Fall 2008 Parallel Processing, Extreme Models Slide 22 PRAM Matrix Multiplication with m Processors

PRAM matrix multiplication using m processors for j = 0 to m –1 Proc i, 0 ≤ i < m, do t := 0 Θ(m2) steps: Time-optimal for k = 0 to m –1 do CREW model is implicit t := t + aikbkj endfor Because the order of multiplications cij := t is immaterial, accesses to B can be endfor skewed to allow the EREW model j ABC × = ij i

Fall 2008 Parallel Processing, Extreme Models Slide 23 PRAM Matrix Multiplication with Fewer Processors

Algorithm is similar, except that Θ(m3/p) steps: Time-optimal each processor is in charge of computing m / p rows of C EREW model can be used

A drawback of all algorithms thus far is that only two arithmetic operations (one multiplication and one ) are performed for each memory access. This is particularly costly for NUMA shared-memory machines. j A BC × = ij m / p i rows

Fall 2008 Parallel Processing, Extreme Models Slide 24 More Efficient Matrix Multiplication (for NUMA)

Partition the matrices A B E F AE+BG AF+BH × into p square blocks = C D G H CE+DG CF+DH 12 √¦pp Block matrix multiplication 1 One processor follows the same algorithm as q=m/¦pq=m/√p computes these simple matrix multiplication. elements of C 2 q that it holds in local memory Block-bandj

Block- Block band × = ij √¦pp i

Fig. 5.13 Partitioning the matrices for block matrix multiplication .

Fall 2008 Parallel Processing, Extreme Models Slide 25 Details of Block Matrix Multiplication

Element of jq jq + 1 jq + b jq + q - 1 block (i, k) Elements of A multiply-add kq + c in matrix A block (k, j) computation in matrix B Multiply on q × q blocks needs kq + c jq jq + 1 jq + b jq + q - 1 2q 2 = 2m 2/p iq iq memory iq + 1 iq + 1 accesses and Add 2q 3 arithmetic Elements of block (i, j) operations iq + a iq + a in matrix C So, q arithmetic operations are done per iq + q - 1 iq + q - 1 memory access Fig. 5.14 How Processor (i, j) operates on an element of A and one block-row of B to update one block-row of C.

Fall 2008 Parallel Processing, Extreme Models Slide 26 6 More Shared-Memory Algorithms

Develop PRAM algorithm for more complex problems: • Must present background on the problem in some cases • Discuss some practical issues such as data distribution

Topics in This Chapter 6.1 Sequential Ranked-Based Selection 6.2 A Parallel Selection Algorithm 6.3 A Selection-Based Sorting Algorithm 6.4 Alternative Sorting Algorithms 6.5 Convex Hull of a 2D Point Set 6.6 Some Implementation Aspects

Fall 2008 Parallel Processing, Extreme Models Slide 27 8.1 Searching and Dictionary Operations

Parallel p-ary search on PRAM Example: log (n + 1) 0 n = 26, p = 2 p+1 1 P = log2(n + 1) / log2(p + 1) 2 0 = Θ(log n / log p) steps P1 Speedup ≅ log p P0 P1 8 P Optimal: no comparison-basedExample: 0 search algorithm can be faster n = 26 A single search in a sorted list can’t be significantly speededp = 2 up through parallel processing, 17 P1 but all hope is not lost: Dynamic data (sorting overhead) Step Step Step Batch searching (multiple lookups) 25 2 1 0

Fall 2008 Parallel Processing, Extreme Models Slide 28 6.1 Sequential Ranked-Based Selection Selection: Find the (or a) kth smallest among n elements Example: 5th smallest element in the following list is 1: 6 4 5 6 7 1 5 3 8 2 1 0 3 4 5 6 2 1 7 1 4 5 4 9 5

Median q k < |L| Naive solution < m L through sorting, O(n log n) time = m E

n m n/q

m = the median But linear-time of the medians: k > |L| + |E| < n/4 elements > m G sequential > n/4 elements algorithm can be developed max(|L|, |G|) ≤ 3n/4

Fall 2008 Parallel Processing, Extreme Models Slide 29 Linear-Time Sequential Selection Algorithm Sequential rank-based selection algorithm select(S, k) 1. if |S| < q {q is a small constant} then sort S and return the kth smallest element of S O(n) else divide S into |S|/q subsequences of size q Sort each subsequence and find its median Let the |S|/q medians form the sequence T endif T(n/q) 2. m = select(T, |T|/2) {find the median m of the |S|/q medians} 3. Create 3 subsequences Median L:Elements of Sq that are < m k < |L| O(n) < m L E:Elements of S that are = m G:Elements of S that are > m = m E 4. if |L| ≥ k m n/q then return selectn (L, k)

else if |L| + |E| ≥ k m = the median of the medians: k > |L| + |E| T(3n/4) then return m < n/4 elements > m G else return select(G, k –|L| –> |n/4E |)elements endif endif

Fall 2008 Parallel Processing, Extreme Models Slide 30 Algorithm Complexity and Examples We must have q ≥ 5; T(n)=T(n / q)+T(3n/4) + cn for q = 5, the solution is T(n) = 20cn

------n/q sublists of q elements ------S 6 4 5 6 7 1 5 3 8 2 1 0 3 4 5 6 2 1 7 1 4 5 4 9 5 ------T 6 3 3 2 5 m 3 1 2 1 0 2 1 1 3 3 6 4 5 6 7 5 8 4 5 6 7 4 5 4 9 5 ------LE G |L| = 7 |E| = 2 |G| = 16 To find the 5th smallest element in S, select the 5th smallest element in L S 1 2 1 0 2 1 1 ------The 9th smallest element of S is 3. T 1 1 m 1 0 1 1 1 1 2 2 The 13th smallest element of S is ------found by selecting the 4th smallest L E G element in G. Answer: 1

Fall 2008 Parallel Processing, Extreme Models Slide 31 6.2 A Parallel Selection Algorithm Parallel rank-based selection algorithm PRAMselect(S, k, p) 1. if |S| < 4 then sort S and return the kth smallest element of S O(n x) else broadcast |S| to all p processors divide S into p subsequences S(j) of size |S|/p Processor j, 0 ≤ j < p, compute Tj := select(S(j), |S(j)|/2) endif T(n1–x,p) 2. m = PRAMselect(T, |T|/2, p) {median of the medians} 3. Broadcast m to all processors and create 3 subsequences x L:Elements of S that are < m O(n ) Median q k < |L| E:Elements of S that are = m < m L G:Elements of S that are > m

4. if |L| ≥ k = m E then return PRAMselect(L, k, p) n m n/q else if |L| + |E| ≥ k m = the median of the medians: T(3n/4,p) then return m k > |L| + |E| < n/4 elements > m G else return PRAMselect(G, k> –|n/4 elementsL| – |E|, p) endif 1–x endif Let p =O(n )

Fall 2008 Parallel Processing, Extreme Models Slide 32 Algorithm Complexity and Efficiency

x 1–x x The solution is O(n ); T(n,p)=T(n ,p)+T(3n/4,p) + cn verify by substitution

x 1–x Speedup = Θ(n)/O(n ) = Ω(n ) = Ω(p) Remember Efficiency = Ω(1) p =O(n1–x) Work(n, p) = pT(n, p) = Θ(n1–x) Θ(n x) = Θ(n)

What happens if we set x to 1? ( i.e., use one processor) No, because T(n, 1) = O(n x) = O(n) in asymptotic analysis, we ignored several What happens if we set x to 0? (i.e., use n processors) O(log n) terms T(n, n) = O(n x) = O(1) ? compared with O(nx) terms

Fall 2008 Parallel Processing, Extreme Models Slide 33 6.3 A Selection-Based Sorting Algorithm Parallel selection-based sort PRAMselectionsort(S, p) O(1) 1. if |S| < k then return quicksort(S) 2. for i = 1 to k –1 do x O(n ) mj := PRAMselect(S, i|S|/k, p) {let m0 := –∞; mk := +∞} endfor 3. for i = 0 to k – 1 do x O(n ) make the sublist T(i) from elements of S in (mi, mi+1) endfor 4. for i = 1 to k/2 do in parallel PRAMselectionsort(T(i), 2p/k) T(n/k,2p/k) {p/(k/2) proc’s used for each of the k/2 subproblems} endfor 5. for i = k/2 + 1 to k do in parallel Let p = n1–x PRAMselectionsort(T(i), 2p/k) T(n/k,2p/k) endfor and k = 21/x n/k elements n/k n/k n/k

+ −∞– m1 mm23. . . mk–1 +∞

Fig. 6.1 Partitioning of the sorted list for selection-based sorting.

Fall 2008 Parallel Processing, Extreme Models Slide 34 Algorithm Complexity and Efficiency

x x The solution is O(n log n); T(n, p) = 2T(n/k, 2p/k) + cn verify by substitution

Speedup(n,p) = Ω(nlogn)/O(n x logn) = Ω(n1–x) = Ω(p) Efficiency = speedup / p = Ω(1) Work(n, p) = pT(n, p) = Θ(n1–x) Θ(n x log n) = Θ(n log n)

What happens if we set x to 1? ( i.e., use one processor) Remember T(n, 1) = O(n x log n) = O(n log n) p =O(n1–x)

Our asymptotic analysis is valid for x > 0 but not for x = 0; i.e., PRAMselectionsort cannot sort p keys in optimal O(log p) time.

Fall 2008 Parallel Processing, Extreme Models Slide 35 Example of Parallel Sorting

S: 6 4 5 6 7 1 5 3 8 2 1 0 3 4 5 6 2 1 7 0 4 5 4 9 5

Threshold values for k = 4 (i.e., x = ½ and p = n1/2 processors):

m0 = –∞ n/k = 25/4 ≅ 6 m1 = PRAMselect(S, 6, 5) = 2 2n/k = 50/4 ≅ 13 m2 = PRAMselect(S, 13, 5) = 4 3n/k = 75/4 ≅ 19 m3 = PRAMselect(S, 19, 5) = 6 m4 = +∞

m0 m1 m2 m3 m4 T: -----2|------4|-----6|------

T: 0 0 1 1 1 2|2 3 3 4 4 4 4|5 5 5 5 5 6|6 6 7 7 8 9

Fall 2008 Parallel Processing, Extreme Models Slide 36 6.4 Alternative Sorting Algorithms

Sorting via random sampling (assume p << √ n) Given a large list S of inputs, a random sample of the elements can be used to find k comparison thresholds It is easier if we pick k = p, so that each of the resulting subproblems is handled by a single processor

Parallel randomized sort PRAMrandomsort(S, p) 1. Processor j, 0 ≤ j < p, pick |S|/p2 random samples of its |S|/p elements and store them in its corresponding section of a list T of length |S|/p 2. Processor 0 sort the list T 2 {comparison threshold mi is the (i|S|/p )th element of T} 3. Processor j, 0 ≤ j < p, store its elements falling

in (mi, mi+1) into T(i) 4. Processor j, 0 ≤ j < p, sort the sublist T(j)

Fall 2008 Parallel Processing, Extreme Models Slide 37 Parallel Radixsort

In binary version of radixsort, we examine every bit of the k-bit keys in turn, starting from the LSB In Step i, bit i is examined, 0 ≤ i < k Records are stably sorted by the value of the ith key bit

Input Sort by Sort by Sort by Binary list LSB middle bit MSB forms –––––– –––––– –––––– –––––– 5 (101) 4 (100) 4 (100) 1 (001) 7 (111) 2 (010) 5 (101) 2 (010) Question: 3 (011) 2 (010) 1 (001) 2 (010) How are 1 (001) 5 (101) 2 (010) 3 (011) the data 4 (100) 7 (111) 2 (010) 4 (100) movements 2 (010) 3 (011) 7 (111) 5 (101) performed? 7 (111) 1 (001) 3 (011) 7 (111) 2 (010) 7 (111) 7 (111) 7 (111)

Fall 2008 Parallel Processing, Extreme Models Slide 38 Data Movements in Parallel Radixsort

Input Compl. Diminished Prefix sums Shifted list of bit 0 prefix sums Bit 0 plus 2 list –––––– ––– –––––– ––– ––––––––– –––––– 5 (101) 0 – 1 1 + 2 = 3 4 (100) 7 (111) 0 – 1 2 + 2 = 4 2 (010) 3 (011) 0 – 1 3 + 2 = 5 2 (010) 1 (001) 0 – 1 4 + 2 = 6 5 (101) 4 (100) 1 0 0 – 7 (111) 2 (010) 1 1 0 – 3 (011) 7 (111) 0 – 1 5 + 2 = 7 1 (001) 2 (010) 1 2 0 – 7 (111)

Running time consists mainly of the time to perform 2k parallel prefix computations: O(log p) for k constant

Fall 2008 Parallel Processing, Extreme Models Slide 39 6.5 Convex Hull of a 2D Point Set

y 7 y 7 CH(Q) 1 Point 1 11 11 Best sequential 5 Set Q 9 4 12 algorithm for p 0 0 points: 3 15 15 10 13 Ω(p log p) steps 6

14 14 2 2 8 8 y’ x x Fig. 6.2 Defining the 7 Angle convex hull problem.

1 11 x’

0 Fig. 6.3 Illustrating the properties All points 15 of the convex hull. fall on this side of line

Fall 2008 Parallel Processing, Extreme Models Slide 40 PRAM Convex Hull Algorithm

Parallel convex hull algorithm PRAMconvexhull(S, p) 1. Sort point set by x coordinates 2. Divide sorted list into √p subsets Q (i) of size √p, 0 ≤ i < √p 3. Find convex hull of each subset Q (i) using √p processors 4. Merge √p convex hulls CH(Q (i)) into overall hull CH(Q) y y

Upper hull

Lower (0) (1) (2) (3) (0) hull Q Q Q Q x CH(Q ) Fig. 6.4 Multiway divide and conquer for the convex hull problem

Fall 2008 Parallel Processing, Extreme Models Slide 41 Merging of Partial Convex Hulls

Tangent lines Tangent lines are found through binary CH(Q( k ) ) search in log time CH(Q ( j) ) CH(Q ( i) ) Analysis: (a) No point of CH(Q(i)) is on CH(Q) T(p, p) = T(p1/2, p1/2) + c log p A B ≅ 2c log p

The initial sorting also CH(Q ( i) ) takes O(log p) time (k) CH(Q ( j) ) CH(Q )

(b) Points of CH(Q(i)) from A to B are on CH(Q)

Fig. 6.5 Finding points in a partial hull that belong to the combined hull.

Fall 2008 Parallel Processing, Extreme Models Slide 42 6.6 Some Implementation Aspects

This section has been expanded; it will eventually become a separate chapter

Memory Processors modules 0 0 Options: Crossbar 1 1 Processor- Processor- Bus(es) to-processor . to-memory . MIN network . network . . . Bottleneck p-1 m-1 Complex . . . Expensive Parallel I/O Fig. 4.3 A parallel processor with global (shared) memory.

Fall 2008 Parallel Processing, Extreme Models Slide 43 Processor-to-Memory Network

Crossbar switches offer P 0 full permutation capability P (they are nonblocking), 1 but are complex and 2 P2 expensive: O(p )

P3 Even with a permutation P network, full PRAM 4 functionality is not

P5 realized: two processors cannot access different P6 addresses in the same memory module P7 Practical processor-to- M M M M M M M M 0 1 2 3 4 5 6 7 memory networks cannot realize all permutations An 8 × 8 crossbar switch (they are blocking)

Fall 2008 Parallel Processing, Extreme Models Slide 44 Bus-Based Interconnections

Single-bus system: Bandwidth bottleneck Memory I/O Bus loading limit Scalability: very poor Single failure point Conceptually simple Proc Proc . . . Proc Proc Forced serialization

Multiple-bus system: Memory I/O Bandwidth improved Bus loading limit Scalability: poor More robust More complex scheduling Proc Proc . . . Proc Proc Simple serialization

Fall 2008 Parallel Processing, Extreme Models Slide 45 Back-of-the-Envelope Bus Bandwidth Calculation

Single-bus system: Bus frequency: 0.5 GHz Memory I/O Data width: 256 b (32 B) Mem. Access: 2 bus cycles (0.5G)/2 × 32 = 8 GB/s Proc Proc . . . Proc Proc Bus cycle = 2 ns Memory cycle = 100 ns 1 mem. cycle = 50 bus cycles Mem Mem Mem Mem I/O Multiple-bus system: Peak bandwidth multiplied by the number of buses (actual bandwidth is likely to be much less) Proc Proc . . . Proc Proc

Fall 2008 Parallel Processing, Extreme Models Slide 46 Hierarchical Bus Interconnection

Bus switch (Gateway)

Low-level cluster Heterogeneous system

Fig. 4.9 Example of a hierarchical interconnection architecture.

Fall 2008 Parallel Processing, Extreme Models Slide 47 Removing the Processor-to-Memory Bottleneck

Processors Caches Memory modules 0 0

1 1 Processor- Processor- to-processor Challenge: . to-memory . network Cache . network . coherence . .

p-1 m-1 . . . Parallel I/O Fig. 4.4 A parallel processor with global memory and processor caches.

Fall 2008 Parallel Processing, Extreme Models Slide 48 Why Data Caching Works

Address in Placement Option 0 Placement Option 1 Hit rate r (fraction of Tag Index State bits State bits memory accesses Tag --- One cache block --- Tag --- One cache block --- satisfied by cache) . . C = C + (1 – r)C . . eff fast slow Block . . address

Cache parameters: Word . offset . . . Size . . Block length (line width) Placement policy Replacement policy The two candidate words and their tags are read out Write policy

Com- = 0 1 Fig. 18.1 Data storage pare Mux and access in a two-way Com- = set-associative cache. pare Cache miss Data out

Fall 2008 Parallel Processing, Extreme Models Slide 49 Benefits of Caching Formulated as Amdahl’s Law

Hit rate r (fraction of memory Access cabinet Access drawer accesses satisfied by cache) in 30 s in 5 s C = C + (1 – r)C eff fast slow Register Access file desktop in 2 s Cache memory S = Cslow / Ceff 1 Main = memory (1 – r) + Cfast/Cslow Fig. 18.3 of Parhami’s Computer Architecture text (2005) This corresponds to the miss-rate fraction 1 – r of accesses being unaffected and the hit-rate fraction r (almost 1) being speeded up by a factor Cslow/Cfast Generalized form of Amdahl’s speedup formula:

S = 1/(f1/p1 + f2/p2 + . . . + fm/pm), with f1 + f2 + . . . + fm = 1

In this case, a fraction 1 – r is slowed down by a factor (Cslow + Cfast)/Cslow, and a fraction r is speeded up by a factor Cslow / Cfast

Fall 2008 Parallel Processing, Extreme Models Slide 50 18.2 Cache Coherence Protocols Memory Processors Caches modules w Multiple w w consistent 0 y ′ x x Single consistent w y 1 y ′ Single Proc.- z ′ inconsistent to- Processor- proc. . to-memory . net- work . network . . .

x z z p–1 ′ Invalid z . . . Parallel I/O Fig. 18.2 Various types of cached data blocks in a parallel processor with global memory and processor caches.

Fall 2008 Parallel Processing, Extreme Models Slide 51 Example: A Bus-Based Snoopy Protocol

Each transition is labeled with the event that triggers it, followed by the action(s) that must be taken

CPU read hit, CPU write hit CPU read hit

CPU read miss: Write back the block, put read miss on bus Bus read miss for this block: Write back the block CPU Exclusive Shared (read-only) write miss: (read/write) CPU write hit/miss: Put write miss on bus CPU Write back read miss: the block, Put read Put write miss on bus miss on bus Bus write miss for this block: CPU read miss: Write back the block Put read miss on bus

Invalid CPU write miss: Put write miss on bus Bus write miss for this block

Fig. 18.3 Finite-state control mechanism for a bus-based snoopy cache coherence protocol.

Fall 2008 Parallel Processing, Extreme Models Slide 52 Implementing a Snoopy Protocol

A second State tags/state Duplicate tags CPU Main tags and and state store state store for storage unit for snoop side Tags Addr Cmd processor side allows snooping to be done Cache Snoop side data Processor side concurrently cache control array cache control with normal cache operation =? Tag =?

Getting all the Snoop implementation state Cmd Addr Buffer Buffer Addr Cmd timing and System details right is bus nontrivial Fig. 27.7 of Parhami’s Computer Architecture text.

Fall 2008 Parallel Processing, Extreme Models Slide 53 Distributed Shared Memory

Memories Processors Some Terminology: 0 NUMA 1 Nonuniform memory access (distributed shared memory) . UMA . Interconnection Uniform memory access . network (global shared memory) p-1 COMA

. Cache-only memory arch Parallel I/O . .

Fig. 4.5 A parallel processor with .

Fall 2008 Parallel Processing, Extreme Models Slide 54 Example: A Directory-Based Protocol

Write miss: Fetch data value, request invalidation, Read miss: Return data value, return data value, sharing set = {c} sharing set = sharing set + {c}

Read miss: Fetch data value, return data value, Read miss:sharing Fetch, set return = sharing data set setvalue, + + {c} sharing set = {c} Exclusive Correction to text Shared (read/write) (read-only) Write miss: Invalidate, sharing set = {c}, return data value

Data write-back: Sharing set = { }

Uncached Read miss: Return data value, Write miss: Return data value, sharing set = {c} sharing set = {c} Fig. 18.4 States and transitions for a directory entry in a directory-based coherence protocol (c denotes the cache sending the message).

Fall 2008 Parallel Processing, Extreme Models Slide 55 Implementing a Directory-Based Protocol

0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 Sharing set implemented as a bit-vector (simple, but not scalable)

When there are many more nodes (caches) than the typical size of a sharing set, a list of sharing units may be maintained in the directory

Processor 0 Processor 1 Processor 2 Processor 3

Cache 0 Cache 1 Cache 2 Cache 3

Head pointer

Memory

Noncoherent Coherent data blocks data block

The sharing set can be maintained as a distributed doubly linked list (will discuss in Section 18.6 in connection with the SCI standard)

Fall 2008 Parallel Processing, Extreme Models Slide 56 Hiding the Memory Access Latency

By assumption, PRAM accesses memory Not a smart strategy: locations right when they are needed, so Memory access time = processing must stall until data is fetched 100s times that of add time Method 1: Predict accesses (prefetch) Proc. 0’s access request Method 2: Pipeline multiple accesses Proc. 0’s access response Memory Processors modules 0 0

1 1 Processor- P rocessor- to-processor . to-memory . network . network . . .

p-1 m-1 . . . P a ra lle l I/O Fall 2008 Parallel Processing, Extreme Models Slide 57 21.3 Vector-Parallel Cray Y-MP

Vector Add Inter- Integer Processor Vector Units Shift Commun. Registers V7 Logic V6 Fig. 21.5 Key V5 Weight/ V4 Parity elements of V3 V2 V1 64 bit the Cray Y-MP 0 1 processor. 2 Floating- Address 3 Point Add

Units Multiply registers, . V0 Reciprocal address . Approx. . function units, Central Memory 64 bit instruction 62 63 buffers, and Scalar Integer Add control not Units Shift T Registers Logic shown. (8 64-bit) S Registers (8 64-bit) Weight/ Parity

From address 32 bit CPU registers/units Input/Output Fall 2008 Parallel Processing, Extreme Models Slide 58 Cray Y-MP’s Interconnection Network Sections Subsections Memory Banks 1 × 8 0, 4, 8, ... , 28 P0 4 × 4 32, 36, 40, ... , 92

8 × 8 P 4 × 4 1

P 1, 5, 9, ... , 29 2 4 × 4 8 × 8

P3 4 × 4

2, 6, 10, ... , 30 P 4 × 4 4 8 × 8 P 4 × 4 5

3, 7, 11, ... , 31 P6 4 × 4 8 × 8

P 4 × 4 7 227, 231, ... , 255 Fig. 21.6 The processor-to-memory interconnection network of Cray Y-MP.

Fall 2008 Parallel Processing, Extreme Models Slide 59 Butterfly Processor-to-Memory Network Not a full p Processors log 2 p Columns of 2-by-2 Switches p Memory Banks 0000 0 0 1 2 3 0 0000 permutation network 0001 1 1 0001 (e.g., processor 0 0010 2 2 0010 0011 3 3 0011 cannot be connected 0100 4 4 0100 to memory bank 2 0101 5 5 0101 alongside the two 0110 6 6 0110 connections shown) 0111 7 7 0111 1000 1000 8 8 Is self-routing: i.e., 1001 9 9 1001 1010 10 10 1010 the bank address 1011 11 11 1011 determines the route 1100 12 12 1100 1101 13 13 1101 A request going to 1110 14 14 1110 memory bank 3 1111 15 15 1111 (0 0 1 1) is routed: Fig. 6.9 Example of a multistage memory access network. lower upper upper

Fall 2008 Parallel Processing, Extreme Models Slide 60 Butterfly as Multistage Interconnection Network

log p + 1 Columns of 2-by-2 Switches p Processors log p Columns of 2-by-2 Switches p Memory Banks 2 2 0 1 2 3 0000 0 0 1 2 3 0 0000 000 0 0001 1 1 0001 001 1 0010 2 2 0010 010 2 0011 3 3 0011 011 3 0100 4 4 0100 100 4 0101 5 5 0101 101 5 0110 6 6 0110 110 6 0111 7 7 0111 111 7 1000 8 8 1000 000 0 1001 9 9 1001 001 1 1010 10 10 1010 010 2 1011 11 11 1011 011 3 1100 12 12 1100 100 4 1101 13 13 1101 101 5 1110 14 14 1110 110 6 1111 15 15 1111 111 7 Fig. 6.9 Example of a multistage Fig. 15.8 Butterfly network memory access network used to connect modules that are on the same side Generalization of the butterfly network High-radix or m-ary butterfly, built of m × m switches Has mq rows and q + 1 columns (q if wrapped)

Fall 2008 Parallel Processing, Extreme Models Slide 61 Beneš Network

Processors 2 log2 p – 1 Columns of 2-by-2 Switches Memory Banks 000 0 0 1 2 3 4 0 000 001 1 1 001 010 2 2 010 011 3 3 011 100 4 4 100 101 5 5 101 110 6 6 110 111 7 7 111

Fig. 15.9 Beneš network formed from two back-to-back butterflies.

A 2q-row Beneš network: Can route any 2q × 2q permutation It is “rearrangeable”

Fall 2008 Parallel Processing, Extreme Models Slide 62 Routing Paths in a Beneš Network 0 0 To which 1 1 memory 2 2 modules 3 3 can we 4 4 connect 5 5 proc 4 6 6 without 7 7 rearranging 8 8 the other 9 9 paths? 10 10 11 11 12 12 What 13 13 about 14 14 proc 6? 15 0 1 2 3 4 5 6 15 q+1 2 q+1 Inputs 2 q Rows, 2q + 1 Columns 2 Outputs Fig. 15.10 Another example of a Beneš network.

Fall 2008 Parallel Processing, Extreme Models Slide 63 16.6 Multistage Interconnection Networks

Numerous indirect or multistage interconnection networks (MINs) have been proposed for, or used in, parallel computers They differ in topological, performance, robustness, and realizability attributes We have already seen the butterfly, hierarchical bus, beneš, and ADM networks

Fig. 4.8 (modified) The sea of indirect interconnection networks.

Fall 2008 Parallel Processing, Extreme Models Slide 64 Self-Routing Permutation Networks

Do there exist self-routing permutation networks? (The butterfly network is self-routing, but it is not a permutation network)

Permutation routing through a MIN is the same problem as sorting

7 (111) 0 0 0 (000)

0 (000) 1 1 1 (001)

4 (100) 3 3 2 (010)

6 (110) 2 2 3 (011)

1 (001) 7 4 4 (100)

5 (101) 4 5 5 (101)

3 (011) 6 7 6 (110)

2 (010) 5 6 7 (111) Sort by Sort by the Sort by MSB middle bit LSB Fig. 16.14 Example of sorting on a binary network.

Fall 2008 Parallel Processing, Extreme Models Slide 65 Partial List of Important MINs

Augmented data manipulator (ADM): aka unfolded PM2I (Fig. 15.12) Banyan: Any MIN with a unique path between any input and any output (e.g. butterfly) Baseline: Butterfly network with nodes labeled differently Beneš: Back-to-back butterfly networks, sharing one column (Figs. 15.9-10) Bidelta: A MIN that is a delta network in either direction Butterfly: aka unfolded (Figs. 6.9, 15.4-5) Data manipulator: Same as ADM, but with switches in a column restricted to same state Delta: Any MIN for which the outputs of each switch have distinct labels (say 0 and 1 for 2 × 2 switches) and path label, composed of concatenating switch output labels leading from an input to an output depends only on the output Flip: Reverse of the omega network (inputs × outputs) Indirect cube: Same as butterfly or omega Omega: Multi-stage shuffle-exchange network; isomorphic to butterfly (Fig. 15.19) Permutation: Any MIN that can realize all permutations Rearrangeable: Same as permutation network Reverse baseline: Baseline network, with the roles of inputs and outputs interchanged

Fall 2008 Parallel Processing, Extreme Models Slide 66 Conflict-Free Memory Access

Try to store the data such that parallel accesses are to different banks For many data structures, a compiler may perform the memory mapping Column 2 Each matrix column is 0,0 0,1 0,2 0,3 0,4 0,5 stored in a different Row 1 1,0 1,1 1,2 1,3 1,4 1,5 memory module (bank) 2,0 2,1 2,2 2,3 2,4 2,5 3,0 3,1 3,2 3,3 3,4 3,5 4,0 4,1 4,2 4,3 4,4 4,5 Accessing a column 5,0 5,1 5,2 5,3 5,4 5,5 leads to conflicts Module 0 1 2 3 4 5

Fig. 6.6 Matrix storage in column-major order to allow concurrent accesses to rows.

Fall 2008 Parallel Processing, Extreme Models Slide 67 Skewed Storage Format

Column 2

0,0 0,1 0,2 0,3 0,4 0,5 Row 1 1,5 1,0 1,1 1,2 1,3 1,4 2,4 2,5 2,0 2,1 2,2 2,3 3,3 3,4 3,5 3,0 3,1 3,2 4,2 4,3 4,4 4,5 4,0 4,1 5,1 5,2 5,3 5,4 5,5 5,0 Module 0 1 2 3 4 5

Fig. 6.7 Skewed matrix storage for conflict-free accesses to rows and columns.

Fall 2008 Parallel Processing, Extreme Models Slide 68 A Unified Theory of Conflict-Free Access

Vector indices 0 6 12 18 24 30 A qD array can be 1 7 13 19 25 31 viewed as a vector, 2 8 14 20 26 32 3 9 15 21 27 33 with “row”/“column” 4 10 16 22 28 34 accesses associated 5 11 17 23 29 35 with constant strides A ij is viewed as vector element i + jm Fig. 6.8 A 6 × 6 matrix viewed, in column- major order, as a 36-element vector.

Column: k, k+1, k+2, k+3, k+4, k+5 Stride = 1 Row: k, k+m, k+2m, k+3m, k+4m, k+5m Stride = m Diagonal: k, k+m+1, k+2(m+1), k+3(m+1), k+4(m+1), k+5(m+1) Stride = m + 1 Antidiagonal: k, k+m–1, k+2(m–1), k+3(m–1), k+4(m–1), k+5(m–1) Stride = m –1

Fall 2008 Parallel Processing, Extreme Models Slide 69 Linear Skewing Schemes

Vector Place vector element i indices 0 6 12 18 24 30 1 7 13 19 25 31 in memory bank 2 8 14 20 26 32 a + bi mod B 3 9 15 21 27 33 (word address within 4 10 16 22 28 34 bank is irrelevant to 5 11 17 23 29 35 conflict-free access; also, a can be set to 0) A ij is viewed as vector element i + jm Fig. 6.8 A 6 × 6 matrix viewed, in column- major order, as a 36-element vector.

With a linear skewing scheme, vector elements k, k + s, k +2s, . . . , k + (B –1)s will be assigned to different memory banks iff sb is relatively prime with respect to the number B of memory banks. A prime value for B ensures this condition, but is not very practical.

Fall 2008 Parallel Processing, Extreme Models Slide 70 7 Sorting and Selection Networks

Become familiar with the circuit model of parallel processing: • Go from algorithm to architecture, not vice versa • Use a familiar problem to study various trade-offs

Topics in This Chapter 7.1 What is a ? 7.2 Figures of Merit for Sorting Networks 7.3 Design of Sorting Networks 7.4 Batcher Sorting Networks 7.5 Other Classes of Sorting Networks 7.6 Selection Networks

Fall 2008 Parallel Processing, Extreme Models Slide 71 7.1 What is a Sorting Network?

x0 y0 x1 y1 The outputs are a x y permutation of the 2 2 inputs satisfying . n-sorter . y Š≤ y Š≤ ... ≤Š y . . 01 n–1 Fig. 7.1 An n-input (non-descending) . . sorting network or an n-sorter. xn–1 yn–1

input0 min in out in out 2-sorter

in out in out input1 max Block Diagram Alternate Representations Fig. 7.2 Block diagram and four different schematic representations for a 2-sorter.

Fall 2008 Parallel Processing, Extreme Models Slide 72 Building Blocks for Sorting Networks

Implementation with input0 min in Implementationout in with out bit-parallel inputs 2-sorter2-sorter bit-serial inputs

in out in out input1 max Block Diagram Alternate Representations k a 0 k a 0

1 min(a, b) 1 min(a, b) S Q b

k MSB-first inputs serial R b 0 k b 0

1 max(a, b) 1 max(a, b)

Reset Fig. 7.3 Parallel and bit-serial hardware realizations of a 2-sorter.

Fall 2008 Parallel Processing, Extreme Models Slide 73 Proving a Sorting Network Correct y 3 2 1 1 x0 0 2 3 3 2 x1 y1 5 1 2 3 x2 y2 1 5 5 5 x3 y3

Fig. 7.4 Block diagram and schematic representation of a 4-sorter.

Method 1: Exhaustive test – Try all n! possible input orders

Method 2: Ad hoc proof – for the example above, note that y0 is smallest, y3 is largest, and the last comparator sorts the other two outputs Method 3: Use the zero-one principle – A comparison-based sorting algorithm is correct iff it correctly sorts all 0-1 sequences (2n tests)

Fall 2008 Parallel Processing, Extreme Models Slide 74 Elaboration on the Zero-One Principle

0 3 1 0 1 6 3 0 1 9 Invalid 6* 1 0 1 6-sorter 5* 0 1 8 8 1 0 5 9 1

Deriving a 0-1 sequence that is not correctly sorted, given an arbitrary sequence that is not correctly sorted.

Let outputs yi and yi+1 be out of order, that is yi > yi+1

Replace inputs that are strictly less than yi with 0s and all others with 1s The resulting 0-1 sequence will not be correctly sorted either

Fall 2008 Parallel Processing, Extreme Models Slide 75 7.2 Figures of Merit for Sorting Networks

Cost: Number of In the following example, we have 5 comparators comparators

Delay: Number The following 4-sorter has 3 comparator levels on of levels its critical path

Cost × Delay The cost-delay product for this example is 15

y 3 2 1 1 x0 0 2 3 3 2 x1 y1 5 1 2 3 x2 y2 1 5 5 5 x3 y3 Fig. 7.4 Block diagram and schematic representation of a 4-sorter.

Fall 2008 Parallel Processing, Extreme Models Slide 76 Cost as a Figure of Merit

n = 9, 25 modules, 9 levels n = 10, 29 modules, 9 levels

n = 12, 39 modules, 9 levels

n = 16, 60 modules, 10 levels Fig. 7.5 Some low-cost sorting networks.

Fall 2008 Parallel Processing, Extreme Models Slide 77 Delay as a Figure of Merit

n = 6, 12 modules, 5 levels

These 3 comparators n = 9, 25 modules, 8 levels constitute one level n = 10, 31 modules, 7 levels

n = 12, 40 modules, 8 levels

n = 16, 61 modules, 9 levels Fig. 7.6 Some fast sorting networks.

Fall 2008 Parallel Processing, Extreme Models Slide 78 Cost-Delay Product as a Figure of Merit

n = 6, 12 modules, 5 levels

n = 9, 25 modules, 8 levels n = 10, 29 modules, 9 levels n = 10, 31 modules, 7 levels

Low-cost 10-sorter from Fig. 7.5 Fast 10-sorter from Fig. 7.6

Cost×Delay = 29×9 = 261 Cost×Delay = 31×7 = 217

The most cost-effective n-sorter may be neither n = 12, 40 themodules, fastest 8 levels design, nor the lowest-cost design

n = 16, 61 modules, 9 levels n = 16, 60 modules, 10 levels

Fall 2008 Parallel Processing, Extreme Models Slide 79 7.3 Design of Sorting Networks

C(n) = n(n –1)/2 D(n ) = n Cost × Delay = n2(n –1)/2 = Θ(n3)

RotateRotate by 90by degrees 90 todegrees see the odd-even exchange patterns

Fig. 7.7 Brick-wall 6-sorter based on odd–even transposition.

Fall 2008 Parallel Processing, Extreme Models Slide 80 Insertion Sort and Selection Sort

x0 y0 x0 y0 x1 y1 x1 y1 x2 y2 x2 y2 . (n–1)-sorter . . . (n–1)-sorter ......

xn–2 yn–2 xn–2 yn–2 xn–1 yn–1 xn–1 yn–1 Insertion sort Selection sort

Parallel insertion sort = Parallel selection sort = Parallel bubble sort!

C(n) = n(n –1)/2 D(n ) = 2n –3 Cost × Delay = Θ(n3)

Fig. 7.8 Sorting network based on insertion sort or selection sort.

Fall 2008 Parallel Processing, Extreme Models Slide 81 Theoretically Optimal Sorting Networks

O(log n) depth Note that even for these optimal networks, delay-cost x 0 y 0 product is suboptimal; but this is the best we can do x 1 y 1 The outputs are a x y permutation of the 2 O(n log n) 2 Existinginputs sorting satisfying networks . n-sorter . 2 size havey Š≤ O(logy Š≤ ...n) latency Š≤ y and . . 012 n–1 O((non-descending)n log n) cost . . x n–1 y n–1 Given that log2 n is only 20 for n = 1 000 000, the latter AKS sorting network are more practical (Ajtai, Komlos, Szemeredi: 1983)

Unfortunately, AKS networks are not practical owing to large (4-digit) constant factors involved; improvements since 1983 not enough

Fall 2008 Parallel Processing, Extreme Models Slide 82 7.4 Batcher Sorting Networks

x0 v0 (2, 3)-merger

First x1 w0 a0 sorted x v a1 sequ- 2 1 ence x b x3 w 1 0 b1 y0 v2 b2 y1 w 2 (1, 1) (1, 2)

Second y2 v3 sorted y sequ- 3 w3 (1, 2)-merger ence y y4 v4 c0 y w 5 4 d y 0 6 v5 d (2, 4)-merger (2, 3)-merger 1 (1, 1)

Fig. 7.9 Batcher’s even–odd merging network for 4 + 7 inputs.

Fall 2008 Parallel Processing, Extreme Models Slide 83 Proof of Batcher’s Even-Odd Merge

x 0 v0

First x 1 w 0 Use the zero-one principle sorted x sequ- 2 v 1 ence x x 3 w 1 Assume: x has k 0s y v 0 2 y has k ′ 0s y 1 w 2

Second y 2 v3 sorted y sequ- 3 w 3 ence y v has keven = ⎡k/2⎤ + ⎡k ′/2⎤ 0s y 4 v 4

y 5 w 4 w has kodd = ⎣k/2⎦ + ⎣k ′/2⎦ 0s y 6 v 5 (2, 4)-merger (2, 3)-merger

Case a: keven = kodd v 0 0 0 0 0 0 1 1 1 1 1 1 w 0 0 0 0 0 0 1 1 1 1 1

Case b: keven = kodd+1 v 0 0 0 0 0 0 0 1 1 1 1 1 w 0 0 0 0 0 0 1 1 1 1 1

Case c: keven = kodd+2 v 0 0 0 0 0 0 0 0 1 1 1 1 w 0 0 0 0 0 0 1 1 1 1 1 Out of order

Fall 2008 Parallel Processing, Extreme Models Slide 84 Batcher’s Even-Odd Merge Sorting Batcher’s (m, m) even-odd merger, for m a power of 2:

n/2-sorter C(m) = 2C(m/2) + m –1 . . . = (m –1) + 2(m/2 – 1) + 4(m/4 – 1) + ...... = m log2m + 1

(n/2, n/2)- D(m) = D(m/2) + 1 = log2 m + 1 merger Cost × Delay = Θ(m log2 m)

n/2-sorter . . . Batcher sorting networks based on the . . . even-odd merge technique: . . .

C(n) = 2C(n/2) + (n/2)(log2(n/2)) + 1 2 ≅ n(log2n) / 2 Fig. 7.10 The recursive D(n) = D(n/2) + log2(n/2) + 1 structure of Batcher’s even– = D(n/2) + log n odd merge sorting network. 2 = log2n (log2n + 1)/2 Cost × Delay = Θ(n log4n)

Fall 2008 Parallel Processing, Extreme Models Slide 85 Example Batcher’s Even-Odd 8-Sorter

n/2-sorter ......

(n/2, n/2)- merger

n/2-sorter ......

4-sorters Even Odd (2,2)-merger (2,2)-merger Fig. 7.11 Batcher’s even-odd merge sorting network for eight inputs .

Fall 2008 Parallel Processing, Extreme Models Slide 86 Bitonic-Sequence Sorter

Bitonic sequence: Shift right half of Shifted data to left half Bitonic 1 3 3 4 6 6 6 2 2 1 0 0 right half (superimpose the sequence two halves) Rises, then falls

0 1 2 . . . n/2 . . . n–1 In each position, keep the smaller 8 7 7 6 6 6 5 4 6 8 8 9 value of each pair Falls, then rises and ship the larger value to the right

Each half is a bitonic sequence that can be sorted independently 8 9 8 7 7 6 6 6 5 4 6 8 0 1 2 . . . n/2 . . . n–1 The previous sequence, Fig. 14.2 Sorting a bitonic right-rotated by 2 sequence on a linear array.

Fall 2008 Parallel Processing, Extreme Models Slide 87 Batcher’s Bitonic Sorting Networks

. n/2-sorter ......

n/2-sorter ......

Bitonic n-input bitonic- sequence sequence sorter 2-input 4-input bitonic- 8-input bitonic- sorters sequence sorters sequence sorter Fig. 7.12 The recursive Fig. 7.13 Batcher’s structure of Batcher’s bitonic sorting network bitonic sorting network. for eight inputs.

Fall 2008 Parallel Processing, Extreme Models Slide 88 7.5 Other Classes of Sorting Networks

Desirable properties: a. Regular / modular (easier VLSI layout). b. Simpler circuits via reusing the blocks c. With an extra block tolerates some faults (missed exchanges) d. With 2 extra blocks provides tolerance to single faults (a missed or incorrect exchange) e. Multiple passes through faulty network (graceful degradation) Fig. 7.14 Periodic balanced f. Single-block design sorting network for eight inputs. becomes fault-tolerant by using an extra stage

Fall 2008 Parallel Processing, Extreme Models Slide 89 Shearsort-Based Sorting Networks (1)

0 1 2 0 12 3 3 7 456 4 Corresponding 5 2-D mesh 6 7 Snake-like Column Snake-like row sorts sorts row sorts

Fig. 7.15 Design of an 8-sorter based on shearsort on 2×4 mesh.

Fall 2008 Parallel Processing, Extreme Models Slide 90 Shearsort-Based Sorting Networks (2) 0 1 2 0 1 3 3 2 4 4 5 5 7 6 6 Corresponding 7 2-D mesh Left Right Left Right Some of the same column column column column advantages as sort sort sort sort periodic balanced Snake-like row sort Snake-like row sort sorting networks

Fig. 7.16 Design of an 8-sorter based on shearsort on 2×4 mesh.

Fall 2008 Parallel Processing, Extreme Models Slide 91 7.6 Selection Networks

Direct design may yield simpler/faster selection networks

3rd smallest element

Can remove this block if smallest three inputs needed

Can remove these four comparators 4-sorters Even Odd (2,2)-merger (2,2)-merger Deriving an (8, 3)-selector from Batcher’s even-odd merge 8-sorter.

Fall 2008 Parallel Processing, Extreme Models Slide 92 Categories of Selection Networks

Unfortunately we know even less about selection networks than we do about sorting networks. One can define three selection problems [Knut81]: I. Select the k smallest values; present in sorted order II. Select kth smallest value III. Select the k smallest values; present in any order Circuit and time complexity: (I) hardest, (III) easiest

Classifiers: Smaller 4 values Selectors that separate 8 inputs 8-Classifier the smaller half of values Larger from the larger half 4 values

Fall 2008 Parallel Processing, Extreme Models Slide 93 Type-III Selection Networks

[0,7] [0,6] [0,4] [0,4] [0,3]

[0,7] [1,7] [1,6] [1,5] [1,3]

[0,7] [0,6] [1,6] [2,6] [1,3]

[0,7] [1,7] [3,7] [3,7] [0,3]

[0,7] [0,6] [0,4] [0,4] [4,7]

[0,7] [1,7] [1,6] [1,5] [4,6]

[0,7] [0,6] [1,6] [2,6] [4,6]

[0,7] [1,7] [3,7] [3,7] [4,7]

Figure 7.17 A type III (8, 4)-selector. 8-Classifier

Fall 2008 Parallel Processing, Extreme Models Slide 94 8 Other Circuit-Level Examples

Complement our discussion of sorting and sorting nets with: • Searching, parallel prefix computation, Fourier transform • Dictionary machines, parallel prefix nets, FFT circuits

Topics in This Chapter 8.1 Searching and Dictionary Operations 8.2 A Tree-Structured Dictionary Machine 8.3 Parallel Prefix Computation 8.4 Parallel Prefix Networks 8.5 The Discrete Fourier Transform 8.6 Parallel Architectures for FFT

Fall 2008 Parallel Processing, Extreme Models Slide 95 8.1 Searching and Dictionary Operations

Parallel p-ary search on PRAM Example: log (n + 1) 0 n = 26, p = 2 p+1 1 P = log2(n + 1) / log2(p + 1) 2 0 = Θ(log n / log p) steps P1 Speedup ≅ log p P0 P1 8 P Optimal: no comparison-basedExample: 0 search algorithm can be faster n = 26 A single search in a sorted list can’t be significantly speededp = 2 up through parallel processing, 17 P1 but all hope is not lost: Dynamic data (sorting overhead) Step Step Step Batch searching (multiple lookups) 25 2 1 0

Fall 2008 Parallel Processing, Extreme Models Slide 96 Dictionary Operations

Basic dictionary operations: record keys x0, x1, . . . , xn–1 search(y) Find record with key y; return its associated data insert(y, z) Augment list with a record: key = y, data = z delete(y) Remove record with key y; return its associated data

Some or all of the following operations might also be of interest: findmin Find record with smallest key; return data findmax Find record with largest key; return data findmed Find record with median key; return data findbest(y) Find record with key “nearest” to y findnext(y) Find record whose key is right after y in sorted order findprev(y) Find record whose key is right before y in sorted order extractmin Remove record(s) with min key; return data extractmax Remove record(s) with max key; return data extractmed Remove record(s) with median key value; return data

Priority queue operations: findmin, extractmin (or findmax, extractmax)

Fall 2008 Parallel Processing, Extreme Models Slide 97 8.2 A Tree-Structured Dictionary Machine

Search 2 Input Root Pipelined Combining in the search triangular nodes Search 1 "Circle" search(y): Pass OR Tree of match signals & data from “yes” side findmin / findmax: x0 x1 x2 x3 x4 x5 x6 x7 Pass smaller/larger of two keys & data findmed: "Triangle" Not supported here Tree findbest(y): Pass the larger of two match-degree Output Root indicators along with associated record Fig. 8.1 A tree-structured dictionary machine.

Fall 2008 Parallel Processing, Extreme Models Slide 98 Insertion and Deletion in the Tree Machine

insert(y,z) Input Root Counters keep track 12 Deletion needs second of the vacancies 0 pass to update the in each subtree vacancy counters 01 02

0 010 00 1 1 * * ***

Redundant Implementation: insertion (update?) Merge the circle and and deletion (no-op?) triangle trees by folding

Output Root Figure 8.2 Tree machine storing 5 records and containing 3 free slots.

Fall 2008 Parallel Processing, Extreme Models Slide 99 Systolic Data Structures

Each node holds the smallest (S), Fig. 8.3 Systolic data structure S L for minimum, maximum, median (M),,and largest (L) M value in its subtree and median finding. Each subtree is balanced S L S L M or has one fewer element on the M Example: 20 elements, left (root flag shows this) 3 in root, 8 on left, and 9 on right S L S L S L S L M M M M 8 elements: 3 + 2 + 3

S L S L S L S L S L S L S L S L M M M M M M M M

5 176 Insert 2 Update/access 87 [87, 176] Extractmin [5, 87] Insert 20 Extractmed examples for the 19 or 20 Insert 127 Extractmax systolic data Insert 195 items 20 items structure of Fig. 8.3

Fall 2008 Parallel Processing, Extreme Models Slide 100 8.3 Parallel Prefix Computation Example: Prefix sums

x0 x1 x2 . . . xi x0 x0 + x1 x0 + x1 + x2 . . . x0 + x1 + . . . + xi s0 s1 s2 . . . si Sequential time with one processor is O(n) Simple pipelining does not help

s s i i x ⊗ x i i

Function unit Latches Four-stage pipeline Fig. 8.4 Prefix computation using a latched or pipelined function unit.

Fall 2008 Parallel Processing, Extreme Models Slide 101 Improving the Performance with Pipelining

Ignoring pipelining overhead, it appears that we have achieved a speedup of 4 with 3 “processors.” Can you explain this anomaly? (Problem 8.6a)

a[i–6]xi–6 ³ a[i–7]⊗ xi–7

a[i–1]xi–1 s DelaysDelays i–12 x[i – 12] DelayDelay

xi a[i] a[i–8] ³x a[i–9]i–8 ⊗ x ³i–9 a[i–10]xi–10 ³ a[i–11]⊗⊗ xi–11 Function unit computing ⊗ a[i–4] ³x a[i–5]i–4 ⊗ xi–5 Fig. 8.5 High-throughput prefix computation using a pipelined function unit.

Fall 2008 Parallel Processing, Extreme Models Slide 102 8.4 Parallel Prefix Networks xn–1 x n–2 . . . x3 x 2 x 1 x 0 T(n) = T(n/2) + 2 = 2 log n –1 + + + 2 C(n) = C(n/2) + n –1

= 2n –2 –log2n Prefix Sum n/2

This is the Brent-Kung + + Parallel prefix network (its delay is actually

2 log2n –2) s n–1 s n–2 . . . s 3 s 2 s 1 s 0

Fig. 8.6 Prefix sum network built of one n/2-input network and n – 1 adders.

Fall 2008 Parallel Processing, Extreme Models Slide 103 Example of Brent-Kung Parallel Prefix Network x x x x x x x x 15 14 13 12 11 10 9 8 x7 x6 x5 x4 x3 x2 x1 x0 Originally developed by Brent and Kung as part of a VLSI-friendly carry lookahead adder

One level of latency

T(n) = 2 log2n –2 C(n) = 2n –2 –logn s s s s s s s s 2 15 14 13 12 11 10 9 8 s7 s6 s5 s4 s3 s2 s1 s0 Fig. 8.8 Brent–Kung parallel prefix graph for n = 16.

Fall 2008 Parallel Processing, Extreme Models Slide 104 Another Divide-and-Conquer Design Ladner-Fischer construction xn–1 xn/2 xn/2–1 x0 T(n) = T(n/2) + 1 = log n ...... 2 C(n) = 2C(n/2) + n/2 Prefix Sum n/2 Prefix Sum n/2 = (n/2) log2n ...... Simple Ladner-Fisher Parallel prefix network s s ++n/2–1 0 (its delay is optimal, . . . but has fan-out issues sn–1 sn/2 if implemented directly)

Fig. 8.7 Prefix sum network built of two n/2-input networks and n/2 adders.

Fall 2008 Parallel Processing, Extreme Models Slide 105 Example of Kogge-Stone Parallel Prefix Network

x x x x x x x x 15 14 13 12 11 10 9 8 x7 x6 x5 x4 x3 x2 x1 x0

T(n) = log2n C(n) = (n–1) + (n–2) + (n–4) + . . . + n/2

= n log2n – n –1

Optimal in delay, but too complex in number of cells and wiring pattern s s s s s s s s 15 14 13 12 11 10 9 8 s7 s6 s5 s4 s3 s2 s1 s0 Fig. 8.9 Kogge-Stone parallel prefix graph for n = 16.

Fall 2008 Parallel Processing, Extreme Models Slide 106 Comparison and Hybrid Parallel Prefix Networks x x x x x x x x x x x x x x x x x x x x x x x x 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 x7 x6 x5 x4 x3 x2 x1 x0

Brent/Kung Kogge/Stone 6 levels 4 levels 26 cells 49 cells

s s s s s s s s s15 s14 s13 s12 s s s s s s s s s s s s 15 14 13 12 11 10 9 8 s7 s6 s5 s4 s3 s2 s1 s0 11 10 9 8 7 6 5 4 3 2 1 0 x x x x x x x x 15 14 13 12 11 10 9 8 x7 x6 x5 x4 x3 x2 x1 x0 Brent- Kung

Fig. 8.10 A hybrid Brent–Kung / Kogge–Stone Kogge- parallel prefix Stone graph for n = 16. Han/Carlson 5 levels Brent- 32 cells Kung s s s s s s s s 15 14 13 12 11 10 9 8 s7 s6 s5 s4 s3 s2 s1 s0 Fall 2008 Parallel Processing, Extreme Models Slide 107 Linear-Cost, Optimal Ladner-Fischer Networks

Define a type-x parallel prefix network as one that:

Produces the leftmost output in optimal log2 n time Yields all other outputs with at most x additional delay

xn–1 xn/2 xn/2–1 x Note that even the Type-0 0 Brent-Kung network ...... produces the leftmost Type-0 Prefix Sum n/2 Prefix Sum n/2 Type-1 output in optimal time ...... We are interested in s s building a type-0 overall ++n/2–1 0 network, but can use . . . type-x networks (x > 0) s s as component parts n–1 n/2 Recursive construction of the fastest possible parallel prefix network (type-0)

Fall 2008 Parallel Processing, Extreme Models Slide 108 8.5 The Discrete Fourier Transform

DFT yields output sequence yi based on input sequence xi (0 ≤ i < n) y = ∑ ω ij x O(n2)-time naïve algorithm i j=0 to n–1 n j n j where ωn is the nth primitive root of unity; ωn = 1, ωn ≠ 1 (1 ≤ j < n)

Examples: ω4 is the imaginary number i and ω3 = (−1 + i √3)/2 The inverse DFT is almost exactly the same computation: x = (1/n) ∑ ω −ij y i j=0 to n–1 n j Fast Fourier Transform (FFT): O(n log n)-time DFT algorithm that derives y from half-length sequences u and v that are DFTs of even- and odd-indexed inputs, respectively

i yi = ui + ωn vi (0 ≤ i < n/2) i+n/2 yi+n/2 = ui + ωn vi T(n) = 2T(n/2) + n = n log2n sequentially T(n) = T(n/2) + 1 = log2n in parallel

Fall 2008 Parallel Processing, Extreme Models Slide 109 Application of DFT to Spectral Analysis

697 Hz 1 2 3 A

770 Hz 4 5 6 B Received tone

852 Hz 7 8 9 C DFT 941 Hz * 0 # D

1209 1336 1477 1633 Hz Hz Hz Hz Tone frequency assignments for touch-tone dialing

Frequency spectrum of received tone

Fall 2008 Parallel Processing, Extreme Models Slide 110 Application of DFT to Smoothing or Filtering

Input signal with noise

DFT

Low-pass filter

Inverse DFT

Recovered smooth signal

Fall 2008 Parallel Processing, Extreme Models Slide 111 8.6 Parallel Architectures for FFT

i u: DFT of even-indexed inputs yi = ui + ωn vi (0 ≤ i < n/2) i+n/2 v: DFT of odd-indexed inputs yi+n/2 = ui + ωn vi

x0 u0 y0 x0 u0 y0

x4 u1 y1 x1 u2 y4

x2 u2 y2 x2 u1 y2

x6 u3 y3 x3 u3 y6

x1 v0 y4 x4 v0 y1

x5 v1 y5 x5 v2 y5

x3 v2 y6 x6 v1 y3

x7 v3 y7 x7 v3 y7 Fig. 8.11 Butterfly network for an 8-point FFT.

Fall 2008 Parallel Processing, Extreme Models Slide 112 Variants of the Butterfly Architecture

x0 u0 y0

x1 u2 y4

x2 u1 y2

x3 u3 y6

x4 v0 y1

x5 v2 y5

x6 v1 y3

x7 v3 y7

Fig. 8.12 FFT network variant and its shared-hardware realization.

Fall 2008 Parallel Processing, Extreme Models Slide 113 Computation Scheme for 16-Point FFT

0 0 0 1 1 0 2 2 0 4 3 3 0 4 4 0 2 5 5 0 4 6 6 0 4 6 7 7 0 8 8 0 1 9 9 0 2 10 10 0 4 3 11 11 0 4 12 12 0 2 5 13 13 0 4 6 14 14 0 4 6 7 15 15 j Bit-reversal Butterfly a a + bω j permutation operation b j a − bω

Fall 2008 Parallel Processing, Extreme Models Slide 114 x0 u0 y0 More x1 u2 y4

Economical x2 u1 y2

x FFT 3 u3 y6 ProjectProject x Hardware 4 v0 y1

x5 v2 y5

x6 v1 y3

x7 v3 y7

Project

a a + b x + Project y 1 Fig. 8.13 0 Linear array of log n cells for i 2 ω n Control

n-point FFT b 0

a − b computation. − × 1

Fall 2008 Parallel Processing, Extreme Models Slide 115 Another Circuit Model: Artificial Neural Nets

Artificial neuron

Activation function

Supervised learning Output

Feedforward network Three layers: input, hidden, output Inputs Weights Threshold No feedback Characterized by connection Recurrent network topology and learning method Simple version due to Elman Feedback from hidden nodes to special nodes at the input layer

Hopfield network Diagrams from All connections are bidirectional http://www.learnartificialneuralnetworks.com/

Fall 2008 Parallel Processing, Extreme Models Slide 116