<<

FINDING BETTER NETWORKS

A dissertation submitted

to Kent State University in partial

fulfillment of the requirements for the

degree of Doctor of Philosophy

by

Sherenaz W. Al-Haj Baddar

May 2009

Dissertation written by

Sherenaz W. Al-Haj Baddar

B.S., University of Jordan, Jordan, 2001

M.S., University of Jordan, Jordan, 2003

Ph.D., Kent State University, 2009

Approved by

Kenneth Batcher , Chair, Doctoral Dissertation Committee

Johnie Baker , Members, Doctoral Dissertation Committee

Hassan Peyravi _

Mark Lewis _

Jay Lee _

Accepted by

Robert Walker , Chair, Department of

Timothy Moerland , Dean, College of Arts and Sciences

i

TABLE OF CONTENTS

LIST OF FIGURES ...... VI

LIST OF TABLES ...... IX

DEDICATION ...... X

ACKNOWLEDGEMENTS ...... XI

CHAPTER 1 INTRODUCTION ...... 1

1.1 Problem Statement ...... 1

1.2 Dissertation Organization ...... 3

CHAPTER 2 BASIC CONCEPTS ...... 4

2.1 Knuth Diagrams ...... 4

2.2 The Zero/One Principle...... 5

2.3 Partial Ordering and Haase Diagrams...... 6

2.4 Poset Operations...... 7

2.4.1 Poset Arithmetic ...... 7

2.4.2 Lattices ...... 9

2.4.3 Example: A 4-key ...... 10

2.5 Bracket Cases...... 12

2.6 Preserving Information...... 15

2.7 Renaming...... 17

CHAPTER 3 PREVIOUS WORK ...... 19

3.1 Two Models for Parallel ...... 19

ii

3.2 Bitonic Sorting ...... 21

3.3 Odd-Even Merge Sorting ...... 22

3.4 The Van Voorhis Network for 16 Keys ...... 24

3.5 The AKS Networks ...... 25

3.6 Cole’s Parallel Merge ...... 27

3.6.1 Definitions ...... 28

3.6.2 The Main Idea ...... 29

3.6.3 The Phases of the ...... 30

3.6.4 The Practicality of Cole’s Parallel ...... 31

3.7 The Best Sorting Networks for Input Sizes ≤ 32 ...... 32

3.7.1 The most-efficient N-key Sorting networks for N ≤ 32...... 33

3.7.2 The fastest N-key Sorting networks for N ≤ 32 ...... 35

CHAPTER 4 SORTNET...... 37

4.1 The Interface of Sortnet ...... 37

4.2 Sortnet Commands for Manipulating CEs ...... 38

4.3 Sortnet Commands for Manipulating Zero/One cases ...... 39

4.4 How Sortnet Works ...... 45

4.5 An Example-Treating a 32-Key Network...... 47

CHAPTER 5 A THREE-PHASE TECHNIQUE FOR DESIGNING FASTER

SORTING NETWORKS ...... 49

5.1 Phase I: Designing a Single-segment Poset ...... 49

5.2 Phase II: Designing the Intermediate Steps ...... 51

iii

5.3 Phase III: Finalizing the Network ...... 52

5.4 Remarks on the Three-phase Technique ...... 53

CHAPTER 6 AN 11-STEP NETWORK FOR SORTING 18 KEYS...... 54

6.1 Phase I: Designing a Single-segment Poset in Steps 1 through 4 ...... 55

6.2 Phase II: Generating a Shmoo Chart with a Staircase Pattern in Steps

5 through 9 ...... 55

6.3 Phase III: Finalizing the Network in Steps 10 and 11 ...... 57

6.4 An Illustrative Example ...... 58

CHAPTER 7 A 12-STEP NETWORK FOR SORTING 22 KEYS...... 61

7.1 Phase I: Designing a Single-segment Poset in Steps 1 through 4 ...... 62

7.2 Phase II: Generating a Shmoo Chart with a Staircase Pattern in Steps

5 through 9 ...... 62

7.3 Phase III: Finalizing the Network in Steps 10 through 12...... 64

CHAPTER 8 CONCLUSIONS AND FUTURE WORK ...... 66

8.1 Conclusions ...... 66

8.2 Future Work ...... 68

APPENDIX A PROOFS OF THEOREMS...... 69

A.1 Proof of Theorem 2-1...... 69

A.2 Proof of Theorem 2-2...... 70

A.3 Proof of Theorem 2-3...... 70

REFERENCES ...... 72

iv

LIST OF FIGURES

Figure 1.1. A compare-exchange element ...... 2

Figure 2.1. Three different ways for drawing a 4-key Sorting network ...... 4

Figure 2.2. Haase diagrams tracking the sorting of 4 keys...... 7

Figure 2.3. The Boolean Lattice B ...... 10

Figure 2.4. The key-poset U+U+U+U and the case-poset B(U+U+U+U) ...... 11

Figure 2.5. The key-poset B+B and the case-poset B(B+B) ...... 11

Figure 2.6. The key-poset BB and the case-poset BBB ...... 12

Figure 2.7. The chain of 4 keys and the corresponding chain of 5 zero/one cases ...... 12

Figure 2.8. Comparing c with d (Before and After)...... 15

Figure 2.9. Comparing A with C and B with D (Before and After) ...... 16

Figure 2.10. Preserving as much information as possible by comparing corresponding

keys ...... 17

Figure 2.11. The 16-key poset before renaming ...... 18

Figure 3.1. Bitonic sorting of 4 keys...... 22

Figure 3.2. The 8-key Odd-even merge Sorting network ...... 24

Figure 3.3. Van Voorhis’s network for 16 keys ...... 24

v

Figure 3.4. The renamed version of the last 5 steps of Van Voorhis’s 16-key network

...... 25

Figure 3.5. The general structure of a typical layer in an AKS network ...... 27

Figure 3.6. L covers sorted arrays J and K ...... 29

Figure 3.7. Experimental results obtained from running two parallel algorithms on a

CREW PRAM simulator ...... 32

Figure 4.1. A 4-key Sorting network and the corresponding CE-list ...... 38

Figure 4.2. The SHOW.CE display of the first 5 steps of the 18-key network described

in Chapter 6 ...... 40

Figure 4.3. The poset table that corresponds to the 5-step CE-list in Fig. 4.2 ...... 41

Figure 4.4. The Shmoo chart that corresponds to the CE-list depicted in Fig. 4.2 ...... 42

Figure 4.5. The output of SHOW.GOODCE that corresponds to the CE-list depicted in

Fig. 4.2...... 43

Figure 4.6. The output of SHOW.DIFF that corresponds to the CE-list depicted in Fig.

4.2 ...... 44

Figure 4.7. The output of SHOW.BESTCE that corresponds to the CE-list depicted in

Fig. 4.2...... 45

Figure 4.8. A CE-list that forms a single-segment poset of 32 keys in 5 steps ...... 48

Figure 6.1. The Knuth diagram of the 11-step network for sorting 18 keys ...... 54

Figure 6.2. The CE-list of the 18-key network...... 54

Figure 6.3. The two-segment poset obtained after applying step 3 ...... 55

Figure 6.4. The Shmoo charts after steps 4 and 5 ...... 56

vi

Figure 6.5. The Shmoo charts after steps 6 and 7 ...... 56

Figure 6.6. The Shmoo charts after steps 8 and 9 ...... 57

Figure 6.7. The Shmoo charts after steps 10 and 11 ...... 57

Figure 6.8. The CE-list of steps 5 through 9 of the experimental 18-key network ...... 59

Figure 6.9. The Shmoo charts after steps 9, 10, and 11 of the experimental 18-key

network ...... 59

Figure 6.10. The CE-list of steps 10 and 11 of the experimental 18-key network ...... 60

Figure 7.1. The Knuth diagram of the 12-step network for sorting 22 keys ...... 61

Figure 7.2. The CE-list of the 22-key network...... 61

Figure 7.3. The 3-segment poset obtained after applying the first 3 steps of the network

illustrated in Fig. 7.2...... 62

Figure 7.4. The Shmoo charts after steps 4 and 5 ...... 63

Figure 7.5. The Shmoo charts after steps 6 and 7 ...... 63

Figure 7.6. The Shmoo charts after steps 8 and 9 ...... 64

Figure 7.7. The Shmoo charts after steps 10, 11, and 12 ...... 65

vii

LIST OF TABLES

Table 2.1. Cases A- and A+ after being sorted by S ...... 13

Table 2.2. Case A after being sorted by S ...... 14

Table 2.3. Bracketing case A ...... 14

Table 3.1. The number of CEs in the most-efficient networks for 1 ≤ N ≤ 16...... 34

Table 3.2. The number of CEs in the most-efficient networks for 17 ≤ N ≤ 32 ...... 34

Table 3.3. The number of steps in the fastest networks for 1 ≤ N ≤ 16 ...... 36

Table 3.4. The number of steps in the fastest networks for 17 ≤ N ≤ 32 ...... 36

Table A.1. The Proof of Theorem 2-2 ...... 70

Table A.2. The Proof of min(A, C) < min(B, D) ...... 71

Table A.3. The Proof of max(A, C) < max(B, D)...... 71

viii

DEDICATION

To My Beloved Parents.

ix

ACKNOWLEDGEMENTS

I’d like to thank my advisor, Prof. Kenneth E. Batcher, for his help, support, and valuable remarks. In addition, I would like to thank my committee members for their efforts.

Also, I would like to thank the Department of Computer Science, at Kent State

University, for all the support, cooperation, and guidance I received while pursuing my

Ph.D. degree.

Finally, I would like to thank my family for their unlimited patience and unconditional fondness.

Sherenaz W. Al-Haj Baddar

May 2009, Kent, Ohio

x

CHAPTER 1

Introduction

Parallel processors are fast and powerful computing systems that have been developed to help undertake computationally-challenging problems. Using parallelism to solve a given problem implies splitting the problem into subtasks and assigning them to the computing components which constitute a parallel system. These components usually communicate in order to accomplish their designated subtasks, which introduces the problem of connecting them efficiently. Several interconnection networks schemes have been designed to help solve this problem, among which are multistage interconnection networks [Feng 1981; Quinn 2003]. These widely used networks use a significantly smaller number of switching elements to achieve relatively more efficient inter-processor communication[Adams, Agrawal, and Seigel 1987]. Many multistage interconnection networks have been developed including: Butterfly networks, Omega networks, and

Sorting networks[Akl 1997].

1.1 Problem Statement

Van Voorhis [Van Voorhis 1972] defines a Sorting network as a circuit with N inputs and N outputs such that for any set of inputs {I1, I2,…, IN}, the resulting output is the set {O1, O2,…, ON}. The output set must be a of the input set {I1, I2,…,

IN}. Moreover, for every two keys in the output set Oj and Ok, Oj must be less than or equal to Ok whenever j ≤ k.

1 2

Sorting networks are constructed using stages (steps) of basic cells called

Compare-exchange Elements (CEs). A CE is a two-key sorting circuit. It accepts two inputs via two input lines, compares them and outputs the larger key on its high output line, whereas the smaller is output on its low output line. It is assumed that two CEs with disjoint inputs can operate in parallel. A typical CE is illustrated in Figure 1[Batcher

1968].

Fig. 1.1 A compare-exchange element.

Sorting networks not only provide parallel algorithms for sorting an array of N keys, but also function as contention-resolving switching fabrics that can be connected to other MINs [Arthurs, Hui 1989; Lyles 1994]. Sorting networks also have several applications such as: ATM switching, distributed processing, optical implementation of sorting, and the implementation of multi-access memories [Batcher 1968; Salloum, Perrie

1999; Gibson 2002].

A Sorting network is oblivious in the sense that whenever a CE compares K[i] with K[j], the subsequent CEs for the case K[i] < K[j] are the same as for the case K[i] >

K[j], but with i and j interchanged [Knuth 1998]. Obliviousness is necessary for the simplicity of the hardware design of a Sorting network. If a Sorting network is not oblivious, then the switching of the CEs will need to be reconfigured according to the outputs of the previous CEs. This obviously complicates the hardware, as well as the , design of the Sorting network.

3

An N-key Sorting network must use at least θ(log N) steps[Knuth 1998].

Expander graphs can be used to design networks that sort N keys in C(log N) steps, where the constant C is in the hundreds or the thousands [Ajtai, Komlos, Szemerdi 1983; Akl

1997; Natvig 1990]. C is so high as to render this design technique impractical, as described in Section 3.5. Merge-sorting can be used to design practical networks that sort

N keys in (log N)(1 + log N)/2 steps [Batcher 1968]. There exists a network that sorts 16 keys in only 9 steps – faster than the 10 steps required by merge-sorting [Knuth 1998].

The fastest N-key networks use merge-sorting for N > 16 [Van Voorhis 1971]. Here we describe a technique for designing networks that are faster than merge-sorting. The technique is illustrated with an 18-key network design and a 22-key network design.

1.2 Dissertation Organization

Chapter 2 reviews some basic concepts necessary for discussing the networks illustrated here, while Chapter 3 highlights some previous work in parallel sorting and

Sorting networks. Chapter 4 introduces Sortnet, a software tool designed by Batcher to help synthesize and analyze Sorting networks. Chapter 5 describes a three-phase technique for designing faster Sorting networks using Sortnet. Chapters 6 and 7 illustrate the three-phase technique by showing two faster network designs for 18 and 22 keys, respectively. Finally, Chapter 8 concludes this dissertation and highlights future work.

CHAPTER 2

Basic Concepts

This chapter describes some basic concepts including: Knuth diagrams, the zero/one principle, partial ordering, Haase diagrams, operations on posets, bracket cases, preserving information, and renaming.

2.1 Knuth Diagrams

Knuth diagrams are pictorial representations of Sorting networks. In a typical

Knuth diagram, each input key is represented by a horizontal line and each CE is represented by a vertical line connecting the two keys being compared. Keys are fed into the left-most end of the network and they are sorted when they leave at the right end, with the maximum key being the top-most key.

The input of a Sorting network can arrive in any random order. Thus, the Knuth diagram of that network can be redrawn in several ways that are all equivalent. These diagrams are called renamed Knuth diagrams. Figure 2.1 illustrates three different ways for drawing a 4-key Sorting network. Section 2.7 illustrates the concept of renaming.

(a) (b) (c) Fig. 2.1 Three different ways for drawing a 4-key Sorting network.

4 5

2.2 The Zero/One Principle

In order to show that a particular N-key Sorting network sorts all possible input ; we can test each of the N! possible inputs and check if the network outputs the sorted permutation. However, the zero/one principle does reduce this overhead

[Knuth 1998]. It states that:

Theorem 2-1:

If an N-key Sorting network sorts all 2N sequences of N zeroes and ones, then it will also sort any arbitrary sequence of N keys.

The proof of this theorem is depicted in Appendix A.

A zero/one case is a sequence of N binary keys. Let A = {a[0], a[1],…, a[N-1]} be a zero/one case. If a CE compares two keys a[x] and a[y] in A, where x

For all integers j between 0 and N inclusive, we say that a zero/one case, A, is in class Zj iff it contains exactly j zeroes and N-j ones [Al-Haj Baddar, Batcher 2009 b].

Such a case is said to be an odd(even) zero/one case if j is an odd(even) number. The zero/one case, A, in class Zj remains in the same class after being treated by any number of CEs. A CE-list sorts A iff it rearranges its keys such that all the j zeroes are in a[0] through a[j-1] and all the N-j ones are in a[j] through a[N-1]. Consequently, N+1 sorted zero/one cases exist, with one case belonging to each class. Any N-key Sorting network rearranges each binary input permutation into the corresponding sorted case.

6

Considering zero/one cases for designing and analyzing Sorting networks is advantageous since it reduces the number of the input permutations that need to be examined from N! to 2N. It also helps develop strategies for sorting, like the bracket cases described in Section 2.5. Moreover, using zero/one cases facilitates tracking the progress of a Sorting network visually, as illustrated by the Shmoo chart tool described in

Chapter 4.

2.3 Partial Ordering and Haase Diagrams

Let R be a relation on set S, and let a and b be two keys in S. Also, let aRb denote that the pair (a,b) is in R. A relation R on set S is said to be a partial ordering relation iff

R is reflexive, antisymmetric, and transitive [Rosen 2003]. This implies that a pair of keys a and b for which neither aRb nor bRa holds may exist. A partial ordering relation R on set S is a total ordering relation iff for every two keys a and b in S either aRb or bRa holds [Birkhoff 1967]. These two kinds of ordering govern the relation between the keys sorted by a typical Sorting network at different sorting steps. A total ordering is imposed on the keys sorted by such a network at its end. Nevertheless, only a partial ordering relation exists among the keys at any step prior to the last step. A given set of keys that is governed by a partial ordering relation is called a poset.

A Haase diagram is a tool for illustrating a poset visually, where keys are represented by vertices and relations between them are represented by edges [Rosen

2003]. In a Haase diagram, key x is said to cover key y iff x ≥ y and there is no other key, z, such that x ≥ z ≥ y. So, we only draw a line from x down to y if x covers y [Birkhoff

1967].

7

As an example, Fig. 2.2 uses Haase diagrams to depict the progress of the 4-key

Sorting network illustrated in Fig. 2.1(a). Figure 2.2(a) shows that the poset obtained after step 1 has two separate segments. Such a poset is called a multi-segment poset. In the second step, the two segments are combined into one segment forming a single- segment poset as illustrated in Fig. 2.2(b). Finally, Fig. 2.2(c) depicts the totally ordered chain of keys obtained after step 3.

(a) Step 1 (b) Step 2 (c)Step 3

Fig. 2.2 Haase diagrams tracking the sorting of 4 keys.

2.4 Poset Operations

This section discusses some basic poset arithmetic operations. It, also, presents some definitions and introduces case-posets. Finally, it tracks the progress of a 4-key

Sorting network using both, key and case-posets.

2.4.1 Poset Arithmetic

Let X and Y be two posets, then f is a function from X to Y if for every x X, f(x) Y. If X contains |X| elements and Y contains |Y| elements then there are |Y||X| different functions from X to Y. A function from a poset X to a poset Y is said to be isotone or order-preserving if for every x1 and x2 X, if x1 ≤ x2 in X then f(x1) ≤ f(x2) in Y.

There are three cardinal arithmetic operations that can be used to combine posets together to form other posets: cardinal sum, cardinal product, and cardinal power [Birkhoff 1967].

8

Cardinal Sum

Let X and Y be disjoint posets. The cardinal sum of X and Y, denoted X + Y, is the set of all elements in X or Y such that

If x1 and x2 are elements of X where x1 ≤ x2, then x1 ≤ x2 in X + Y.

If y1 and y2 are elements of Y where y1 ≤ y2, then y1 ≤ y2 in X + Y.

If x is an element of X and y is an element of Y then neither x ≤ y nor y ≤ x is in X

+ Y.

The diagram of X + Y is simply the diagram of X and the diagram of Y placed side-by-side with no lines between X and Y. If X contains |X| elements and Y contains |Y| elements then there are |X| + |Y| elements in X + Y.

Cardinal Product

Let X and Y be disjoint posets. The cardinal product of X and Y, denoted XY, is the set of all couples (x, y) such that x X and y Y, where (x1, y1) ≤ (x2, y2) if and only if x1 ≤ x2 in X and y1 ≤ y2 in Y. If X contains |X| elements and Y contains |Y| elements then there are |X| * |Y| elements in XY.

Cardinal Power

Let Y and X be disjoint posets. The cardinal power with base Y and exponent X, denoted by YX, is the set of all isotone functions from X to Y. The functions in YX are partially ordered by letting, f ≤ g, where f and g YX, iff f(x) ≤ g(x) for all x X.

The following identities are true for any posets, X, Y, and Z [Birkhoff 1967].

1. X + Y = Y + X 2. X + (Y + Z) = (X + Y) + Z 3. XY = YX

4. X(YZ) = (XY)Z 5. X(Y + Z) = XY + XZ 6. (X + Y)Z = XZ + YZ

9

7. X(Y + Z) = XYXZ 8. (XY)Z = XZYZ 9. (XY)Z = XYZ

2.4.2 Lattices

Joins and Meets

Let P be a poset and let x and y be any two elements: x, y P. We say that z is an upper bound of x and y if: z P, z ≥ x, and z ≥ y. The elements x and y have a least upper bound or join, denoted x\/y, if: x\/y is an upper bound of x and y, and for all z that are upper bounds of x and y, we have that x\/y ≤ z. Similarly, we say that w is a lower bound of x and y if: w P, w ≤ x, and w ≤ y. The elements x and y have a greatest lower bound or meet, denoted x/\y, if: x/\y is a lower bound of x and y, and for all w that are lower bounds of x and y, we have that x/\y ≥ w. Consequently, if x ≤ y then x\/y = y and x/\y = x [Birkhoff 1967].

Lattices

A poset, X, is a lattice if every pair of elements, x and y X have a join, x\/y, and a meet, x/\y. A poset, X, is said to be a chain if it is totally-ordered; i.e., for every pair of elements, x and y X, either x ≤ y or x ≥ y [Birkhoff 1967]. Thus, every chain is a lattice.

Some Properties of lattices are depicted here:

1. Every finite lattice has exactly one element, O, where O ≤ x for every x in the

lattice.

2. Every finite lattice has exactly one element, I, where I ≥ x for every x in the

lattice.

10

3. If L and M are lattices then their cardinal product, LM is also a lattice.

4. If L is a lattice and P is a poset then the cardinal power, LP, is also a lattice.

Figure 2.3 shows the poset of two elements 0 and 1. This poset will be called B for Boolean - B is a chain so it is also a lattice. The poset of the zero/one cases (i.e. the

Fig. 2.3 The Boolean Lattice B.

case-poset) for a poset of keys, X, is just the cardinal power BX, and each zero/one case for the poset X is an isotone function from X to B. By Property 4 of lattices, it is established that BX is a lattice. Let U (for unary) denote the poset with exactly one element, u. There are only two functions from U to B: f(u) = 0 and g(u) = 1. Both functions are isotone with f < g, so BU = B. It can be noticed that if X is a chain of N elements, then BX is a chain of (N+1) elements.

2.4.3 Example: A 4-key Sorting network

In the following example the sorting of 4 keys, as depicted in the network in Fig.

2.1(a), will be tracked again, using both key-posets and case-posets.

11

Initially, the four keys are unsorted so the key-poset is U + U + U + U as shown in Fig. 2.4(a). By Identity 7 of cardinal arithmetic, B(U + U + U + U) = BUBUBUBU = BBBB as shown in Fig. 2.4(b), which represents all the zero/one cases that exist before the sorting process begins.

K[0] K[1] K[2] K[3]

Fig. 2.4(a) key-Poset U + U + U + U Fig. 2.4(b) The case-poset (B(U + U + U + U) = BBBB)

After performing the comparisons illustrated in step 1 of Fig. 2.1(a), i.e. 0:1 and

2:3, the key-poset becomes B + B as shown in Fig. 2.5(a). To generate the corresponding zero/one cases, Identity 7 of cardinal arithmetic is applied resulting in B(B + B) = BBBB. BB is a chain of three zero/one cases, 00, 01, 11, so BBBB is a 3 x 3 square as shown in Fig.

2.5(b).

K[1] K[3]

K[0] K[2]

Fig. 2.5(a) The key-poset B + B Fig. 2.5(b) The case-poset B(B + B) = BBBB.

After performing the comparisons in step 2 of Fig. 2.1(a), i.e. 0:2 and 1:3, the key-poset becomes BB as shown in Fig. 2.6(a). The corresponding case-poset is depicted in Fig. 2.6(b).

12

K[3]

K[1] K[2]

K[0]

Fig. 2.6(a) the key-poset BB Fig. 2.6(b) The case-poset BBB.

After performing the last comparison, i.e, 1:2 in step 3, the 4 keys become totally ordered in a chain as shown in Fig. 2.7(a). The corresponding case-poset is also a chain of 5 zero/one cases as shown in Fig. 2.7(b).

K[3]

K[2]

K[1]

K[0]

Fig. 2.7(a) The chain of 4 Fig. 2.7(b) The corresponding chain of 5 keys. zero/one cases.

2.5 Bracket Cases

An important partial ordering relation, , can be defined between pairs of zero/one cases as follows:

If A = {a[0],a[1],…, a[N-1]} and B = {b[0], b[1],…, b[N-1]} are two zero/one cases, then we say that A B iff there is no j, where 0 ≤ j ≤ N-1, for which a[j] = 1 and b[j] = 0.

A theorem that shows that the relation is preserved by any CE can be established:

13

Theorem 2-2:

If A = {a[0], a[1],…, a[N-1]} and B = {b[0], b[1],…, b[N-1]} are two zero/one cases, where A B, and x:y is a CE that compares the keys a[x] with a[y] and b[x] with b[y], then A B still holds after x:y is applied.

The proof of theorem 2-2 is provided in Appendix A.

Theorem 2-2 provides a tool for simplifying the sorting task. Let A = {a[0], a[1],

- …, a[N-1]} be any zero/one case in class Zj , where 0 < j < N . Also, let A be generated by replacing a one-valued bit in A with a zero, and let A+ be generated by replacing a zero-valued bit in A with a one. This implies that A- A A+. Let S be a sequence of

- + comparators that sorts A and A in non-decreasing order. The result is depicted in Table

2.1.

Table 2.1 Cases A- and A+ after being sorted by S. a[0] a[1] … a[j-2] a[j-1] a[j] a[j+1] … a[N-2] a[N-1]

A+ 0 0 … 0 1 1 1 … 1 1 A- 0 0 … 0 0 0 1 … 1 1

Any CE-list preserves the relation A- A A+ according to Theorem 2-2. Thus, when S receives A as its input, the output will be only in one of the two states depicted in Table

2.2. Hence, adding a CE that compares the keys a[j-1] and a[j] finishes the sorting of A.

Let 0 < i < k < N and assume that a CE-list, S, sorts all zero/one cases in classes

Zi and Zk. Let A = {a[0], a[1], …, a[N-1]} be any zero/one case in class Zj, where i < j < k. After S processes all the cases in Zi and Zk as well as case A, we have:

(the sorted case with k zeroes) (A) (the sorted case with i zeroes)

14

Table 2.2 Case A after being processed by S.

a[0] a[1] … a[j-2] a[j-1] a[j] a[j+1] … a[N-2] a[N-1]

A 0 0 … 0 0 1 1 … 1 1

0 0 … 0 1 0 1 … 1 1

Table 2.3 describes these zero/one cases after being processed by S. Each key of

case A with a ?-mark may have a 0-value or a 1-value. To finish the sorting of all the

cases in class Zj, where i < j < k , it is only necessary to sort the keys in {a[i], a[i+1],…,

a[k-1]}. We say that the zero/one cases in classes Zi and Zk are bracket cases because

sorting all these cases brackets the unsorted keys of all intermediate zero/one cases to

{a[i], a[i+1],…, a[k-1]}.

Table 2.3. Bracketing case A a[0] … a[i-2] a[i-1] a[i] a[i+1] … a[k-2] a[k-1] a[k] a[k+1] … a[N-1]

The sorted 0 … 0 0 1 1 … 1 1 1 1 … 1 case with i zeroes Case A 0 … 0 0 ? ? … ? ? 1 1 … 1

The sorted 0 … 0 0 0 0 … 0 0 1 1 … 1 case with k zeroes

As an example on using the concept of bracket cases, let S be a CE-list that sorts

all odd cases of N keys. Adding a single step containing comparators: 1:2, 3:4, 5:6, …etc,

after S will sort all even cases which completes the sorting of the N keys. Here, the

zero/one cases in classes Z1,Z3, Z5, …etc bracket all the even cases.

In a similar fashion, let S be a CE-list that sorts all even cases of N keys, then

adding a single step of comparators: 0:1, 2:3, 4:5,…etc, after S will sort all odd cases to

15

also complete the sorting of the keys. Here, the zero/one cases in classes Z0, Z2, Z4, …etc bracket all the odd zero/one cases.

Hence, to design an N-key Sorting network, we do not have to consider all 2N zero/one cases - just the 2N-1 odd or 2N-1 even cases. When these cases are sorted, the addition of a single step of CEs will sort all the other cases.

2.6 Preserving Information

One strategy that can be used to help design better Sorting networks is to preserve the partial ordering relations by comparing the corresponding keys in the poset. The earlier steps of a Sorting network establish some information that can be lost by the CEs of the later steps as illustrated in Fig. 2.8.

Fig. 2.8 Comparing c with d (Before and After). The right-side of Fig. 2.8 shows the new partial orderings if c and d are now compared to find min(c, d) and max(c, d):

Instead of e < c we only have e < max(c, d);

Instead of f < d we only have f < max(c, d);

Instead of c < a we only have min(c, d) < a; and

Instead of d < b we only have min(c, d) < b.

16

Hence, it is important to pick the pairs of keys to be compared carefully if the information conveyed by the poset is to be preserved. Theorem 2-3 shows a way to preserve information.

Theorem 2-3:

Let A, B, C, and D be any keys, where:

A < B, and

C < D, and

A and C are compared to find min(A, C) and max(A, C), and

B and D are compared to find min(B, D) and max(B, D); then:

min(A, C) < min(B, D); and

min(A, C) < max(A, C); and

min(B, D) < max(B, D); and

max(A, C) < max(B, D).

Figure 2.9 illustrates Theorem 2-3. The proof of this theorem is described in Appendix A.

Fig. 2.9 Comparing A with C and B with D (Before and After).

The strategy illustrated in Fig.2.9 can be generalized. Let X and Y be two isomorphic posets. If we compare the corresponding keys in these two posets, then we will preserve the partial ordering relations within each poset and add new ones that correspond to the new CEs. To illustrate this idea, consider Fig. 2.10 in which the

17

corresponding keys of two 4-key posets are compared as denoted by the dotted edges.

These comparisons not only establish new information, but also preserve the information established before, denoted by the solid edges.

Fig. 2.10 Preserving as much information as possible by comparing corresponding keys.

2.7 Renaming

The renaming process aims at using a given Sorting network to generate an equivalent network that exhibits a clearer behavior. This process consists of the following steps:

1. Generate a permutation P of the indices of the keys. Feed the permutation P to

the original Sorting network. For each CE x:y in the original network, trace

and record the CE’s outputs, denoted by L and H, as they get sorted.

2. After sorting P, rename each CE, x:y, in the original network using its outputs

L and H as recorded in step 1.

3. Draw a new Knuth diagram using the renamed CEs. The new network, which

looks different, is equivalent to the original one.

The renaming process helps better understand the behavior of some Sorting networks. Consider for example Van Voorhis’s 16-key network illustrated in Chapter 3.

It is by no means obvious that this network sorts all possible sequences of 16 keys [Knuth

1998]. However, renaming helped better understand the strategy followed by that

18

network. Renaming can also help better select the CE to apply next. Consider applying 4 steps to generate a single-segment poset of 16 keys. This poset places key K[7] in an upper rank (fourth rank)and K[8] in a lower rank (second rank), as depicted in Fig. 2.11.

By examining the zero/one cases, we find that many of them have K[7] = 1 and K[8] = 0, whereas only one case has K[7] = 0 and K[8] = 1. Thus, we might think that it is a good idea to compare these two keys. However, if renaming is applied to the network, using the permutation P = {0, 1, 2, 3, 4, 5, 6, 8, 7, 9, 10, 11, 12, 13, 14, 15}, then K[7] and K[8] will swap their ranks, i.e. K[7] will appear in the second rank and K[8] will appear in the fourth rank. The re-examination of the cases will show that the number of cases where

K[7] = 1 and K[8] = 0 is only one. Hence, such a comparison is actually less significant.

Fig. 2.11 The 16-key poset before renaming.

CHAPTER 3

Previous Work

This chapter briefly discusses two models for designing parallel sorting algorithms in Section 3.1. It, also, describes some parallel sorting algorithms including:

Bitonic sorting, Odd-even merge sorting, Van Voorhis’s network for sorting 16 keys, the

AKS network, and Cole’s parallel merge sort, which are described in Sections 3.2 through 3.6. In addition, the chapter investigates the practicality of the AKS network and

Cole’s parallel merge sort via comparing their performance with Batcher’s Bitonic . Finally, Section 3.7, of this chapter, highlights the best Sorting networks for input sizes ≤ 32.

3.1 Two Models for Designing Parallel Algorithms

Two models are usually used for designing parallel sorting algorithms: the circuit model and the PRAM model [Cole 1988]. A circuit, in the circuit model, is a device that receives a number of inputs at one end and produces a number of outputs at the other end.

Such a circuit consists of a number of simple interconnected processing elements (PEs) arranged in columns (or stages). Each of these PEs has a constant number of input wires

(a constant fan-in) as well as a constant number of output wires (a constant fan-out), and can perform simple arithmetic/logical operations [Akl 1997]. The depth of a circuit is the maximum number of PEs on a path from input to output. Hence, the depth of a circuit

19 20

represents the maximum amount of time it takes an input to reach the output. The width of a circuit, on the other hand, is the maximum number of PEs that can be accommodated in one stage. The product of the circuit’s depth and width is an upper bound on the circuit size [Akl 1997]. The circuit model is restrictive, due to the fact that the interconnections between the PEs comprising the circuit are fixed a priori. In the context of Sorting networks, this translates to oblivious sorting. Several parallel sorting algorithms that follow this model have been designed including: the Odd-even merge sorting, the Bitonic sorting, and the AKS networks.

The PRAM model, which stands for: Parallel Machine model, is a natural generalization of the sequential RAM model to parallel computation. The

PRAM model comprises several synchronous RAM machines that run in parallel and communicate via a shared memory [Greenlaw, Hoover, Ruzzo 1995]. Each step of a

PRAM algorithm contains up to three phases [Akl 1997]:

1. A Read phase, in which up to N processors can read from up to N memory

locations simultaneously.

2. A Compute phase, in which up to N processors perform a basic arithmetic or

logical operation on their local data.

3. A Write phase, in which up to N processors write simultaneously into up to N

memory locations.

The PRAM model is a less restrictive model where oblivious sorting is not compelled. There are three variants of this model that are commonly used: the CRCW

PRAM, CREW PRAM, and the EREW PRAM. The first of these models allows

21

concurrent access to a given memory location for reading and writing. The second model allows concurrent access for reading only, while the third model permits no concurrent access to a memory location [Cole 1988]. Many parallel sorting algorithms follow this model, the most fundamental of which is Cole’s cost-optimal parallel merge sort. Other

PRAM sorting algorithms include Preparata’s O(logN) time algorithm that runs on NlogN processors [Preparata 1978; Reif 1993] and Jaja’s O(logN*loglogN) time algorithm that runs on θ(N) processors [Jaja 1992].

There exists a gap between theoretical and practical evaluations of parallel algorithms. Asymptotic analysis makes it possible to theoretically analyze the behavior of an algorithm at a very high level, but hides the complexity constants [Brassard, Bratley

1996]. If these constants are relatively large, then they will render parallel algorithms with superior asymptotic behavior impractical when evaluating realistic problems. As a consequence, complexity constants should not be overlooked when parallel algorithms are designed.

3.2 Bitonic Sorting

A Bitonic sequence of keys consists of a non-decreasing monotonic sequence concatenated with a non-increasing monotonic sequence. Moreover, any circular shift of a Bitonic sequence results in a Bitonic sequence. To illustrate this technique, which was developed by Batcher [Batcher 1968], let the input be the Bitonic sequence A = {a1, a2,…, aN, aN+1,…, a2N} that contains an even number of keys 2N. Using a step of N CEs of the form i:i+N, where ai is in { a1, a2,…, aN}, ai+N is in{ aN+1,…, a2N }, and 1 ≤ i ≤ N, generates two new subsequences denoted by H and L. H consists of the maximums

22

produced by each comparator, whereas L consists of the minimums produced by the same set of comparators. It can be noticed that H and L are both Bitonic. Additionally, each key in H is larger than or equal to each key in L. This process is applied recursively on H and L until the size of each Bitonic sequence becomes one. Figure 3.1 depicts the Bitonic sorting of 4 keys [Batcher 1968]. This algorithm requires θ(Nlog2N) comparisons. The number of steps required by an N-key Bitonic sorting network is ½ log N (log N + 1) steps, if N is a , with each step using N/2 CEs [Batcher 1968]. Bitonic sorting merges a monotonically increasing sequence with another that is monotonically decreasing. Thus, before sorting can proceed, one sequence must be reversed. Hence,

Batcher introduced another sorting network, namely, the Odd-even merge sorting.

Figure 3.1. Bitonic sorting of 4 keys.

3.3 Odd-Even Merge Sorting

The main idea behind a merging network is to notice that whenever two sequences, A and B, are merged to form a sequence C, the first key in C, i.e. c1, is either a1 or b1. For the rest of the keys in C, a pair of keys of the form (c2i, c2i+1) in C, contains an even-indexed key from either A or B with an odd-indexed key from either A or B as well. This pattern suggests the following parallel merge sorting algorithm [Batcher 1968]:

23

1. Divide the original data into two parts that are almost equal, then sort each part

resulting in the sorted sequences X and Y, such that |X| = m and |Y| = n (where N,

the original data size, equals m+n).

2. Using two (m/2,n/2)-key merging networks, merge the sequence of odd-indexed

keys in X, with the corresponding keys in Y resulting in a sorted sequence D.

Also, and in parallel, merge the sequence of even-indexed keys in X with the

corresponding keys in Y resulting in a sorted sequence E.

3. Let O = {o1, o2,…, om+n} denote the sorted output sequence. The first key in O,

th that is o1, will be assigned the value of d1. Let ei denote the i output of E,

th whereas di+1 denotes the i+1 output of D. The key o2i will be assigned the value

of min(ei,di+1) whereas o2i+1 will be assigned the value of max(ei,di+1).

Figure 3.2 depicts an 8-key Odd-even merge sorting network [Van Voorhis 1972].

An (N,N)-key merging network merges two N-key sorted sequences using two

(N/2,N/2)-key merging networks followed by a column of N-1 CEs. Hence, it takes an

(N,N)-key merge network θ(NlogN) CEs to accomplish its task. Thus, the Odd-even merge sorting requires θ(Nlog2N) CEs. If N is a power of two, then the algorithm requires

½ log N (log N + 1) steps with no more than N/2 CEs in each step [Batcher 1968].

24

Fig.3.2. The 8-key Odd-even merge sorting network.

3.4 The Van Voorhis Network for 16 keys

In this section, the Van Voorhis Sorting network for 16 keys is analyzed. This network, as depicted in Fig. 3.3 [Knuth 1998], is quite interesting because it is one step less than the Odd-even merge sorting network for the same input size.

Fig. 3.3. Van Voorhis‘s Sorting network for 16 keys.

The first 4 steps construct a single-segment poset of 16 keys. However, the logic behind the last 5 steps is not clear. Thus, the renaming strategy, described in Section 2.7, was applied to this network. According to the 16-key single-segment poset obtained after step 4, the permutation P = {0, 1, 2, 8, 4, 5, 6, 12, 3, 9, 10, 11, 7, 13, 14, 15} was used.

The last 5 steps of the renamed network, as depicted in Fig. 3.4, show that Van Voorhis’s network used a divide-and-conquer strategy that rearranged the keys into 4 4-key groups.

Steps 5 and 6 start sorting keys K[11] through K[14] and K[1] through K[4], and also split

25

the intermediate keys K[5] through K[10] into two partially sorted subsets {K[5], K[6],

K[7]} and {K[8], K[9], K[10]}. In step 7, the minimum key in the top-most group, i.e.

K[11], is combined with the subset{K[8], K[9], K[10]}, whereas the maximum key in the bottom-most group, i.e. K[4], is combined with the subset{K[5], K[6], K[7]}. The two middle groups of 4 keys are sorted in steps 7 through 9. The CEs 3:4, 7:8, and 11:12, in step 9, straddle the boundaries of the corresponding adjacent groups. It can be noticed that the first 8 steps of this network sorted all odd zero/one cases as well as all the cases with exactly 2 and 14 zeroes, whereas the last step sorted the remaining even zero/one cases.

Step 5 Step 6 Step7 Step 8 Step 9

Fig. 3.4 The renamed version of the last 5 steps of Van Voorhis’s 16-key network.

3.5 The AKS Network

The AKS Sorting network developed by Ajtai, Komlos, and Szemerdi sorts N keys in O(logN) time [Ajtai, Komlos, Szemerdi 1983]. The original variant of this network, which is described here, uses O(NlogN) processors to sort in O(log N) time, which is not cost-optimal [Natvig 1990]. Leighton, however, described a cost-optimal variant of the AKS network that uses O(N) processors to sort N keys in O(logN) time

[Leighton 1984]. He, also, indicated that the constant induced by this algorithm is

26

immense (believed to be in the hundreds or the thousands) and that other parallel sorting algorithms run faster as long as N < 10100. Some efforts were invested in reducing the value of this constant [Paterson1987]. However, the algorithm remained complicated with a large complexity constant.

The original N-key AKS network consists of O(logN) layers of complete binary trees, each of which consists of O(N) nodes. Every node in these trees contains a separator circuit which splits the list of input keys it receives into four lists, denoted by

A1, A2, A3, and A4. A typical node, excluding the root and the leaves, with label i in layer

i j sends its A1 and A4 lists to the node labeled in layer j+1. Node i, also, sends its A2 list 2 to its child node(labeled 2i) and its A3 list is sent to its other child node(labeled 2i+1), in layer j. A root node, however, sends its A1 and A4 lists to the root of the in layer j +1, yet forwards its A2 and A3 lists as a conventional node. Fig. 3.5 illustrates the general structure of a typical tree in some layer j of a given AKS network [AKL 1997].

Upon receiving a list of keys, a node to move extremely high keys to A4, and

i extremely low keys to A1. These extreme keys are sent to node in the next layer to 2 help put them in their correct locations, while the rest of the keys get forwarded to the child nodes for further examination. When the sorting process ends, the leaves of the tree in the last layer will hold the sorted output list.

Separator circuits use expander graphs, which induces a massively large complexity constant that abolishes the practicality of the AKS networks. To illustrate this problem, let an N-key AKS network have a relatively small complexity constant like 87,

27

Fig. 3.5. The general structure of a typical layer in an AKS network.

although the actual value of this constant is believed to be at least in the hundreds.

Consequently, the number of steps required by this AKS network is C (log N). The

Bitonic Sorting network requires K(log N(log N + 1)) steps, where K = ½ , as described in

Section 3.2. The cross-over point for these two algorithms happens at N = 2C/K-1.

Substituting for C and K results in the cross-over point N = 2173 ≈ 1.2 * 1052. Hence, the

AKS network becomes faster than the Bitonic merge sorting when N > 1.2 * 1052 keys.

The mass of a proton is about 1.66 * 10-24 grams, which is almost the same as the mass of a neutron. The mass of earth is about 5.98 * 1024 kilograms. Hence, the earth contains about 1.8 * 1051 protons and neutrons. Even if computer technology would ever allow each key to be stored in a proton or a neutron, the Bitonic sorting network will be still faster than the AKS network. This shows that the AKS networks are evidently impractical.

3.6 Cole’s Parallel Merge Sort

Cole’s parallel merge sort is the main parallel sorting algorithm for the PRAM model due to the fact that it is cost-optimal. Cole’s algorithm assumes that N, where N is a power of 2, distinct keys are received as input and initially stored each in a leaf in a

28

complete . Each internal node u in the tree is responsible for merging the two sorted arrays submitted by its two child-nodes. This way, the algorithm proceeds in a bottom-up fashion and terminates when the root produces the sorted array of N keys.

Cole pointed out two main observations [Cole 1988]. Firstly, merging in the ith stage can be done in constant time. Secondly, the merging done at the different levels of the binary tree can be pipelined. This pipelining is possible due to the fact that the merged samples made at a lower level, j, can provide suitable samples for the parent node in the upper level j+1.

3.6.1 Definitions

Here we provide the basic definitions necessary for describing Cole’s algorithm

[Brassard, Bratley 1996]. Let a, b, and c be three keys, we say that a and c straddle b iff a

≤ b < c. Let L, an array of sorted keys, be extended by appending -∞ to its beginning and

+∞ to its end. Now, the rank of the key k in L, denoted by rankL(k), is simply defined as the position of k in L. Obviously, rankL(-∞) = 0 and rankL(+∞) = |L|+1. We can also define the rank of a key k from a sorted array J with respect to another sorted array L, and call it the cross-rank, denoted by c-rankL(k). Precisely, let L and J be two sorted arrays, and let b be a key in J and a and c be two consecutive keys in L, such that a and c straddle b. Then, c-rankL(b) = rankL(a). We denote by J/L the set of ranks of all keys from the sorted list J with respect to the sorted list L. Also, let x and y be two consecutive keys in L, the interval induced by x and y is denoted by [x,y). A key z belongs to [x,y) iff x and y straddle z. The sorted list L is said to cover sorted list J if each interval induced by

29

every two consecutive keys in L, including -∞ and +∞, contains at most three keys in J, where -∞ and +∞ are excluded.

The symbol & is used to denote the operation of merging two sorted arrays

[Brassard, Bratley 1996]. Finally, r(L) denotes the array obtained by taking every fourth key from the sorted array L. If |L| < 4, then r(L) is empty.

3.6.2 The Main Idea

Here we describe the constant-time merging operation done in Cole’s algorithm.

Let J, K, and L be three sorted arrays, and assume that L covers both J and K. Figure 3.6 illustrates the cover-relation established between these arrays [Brassard, Bratley 1996].

J

L

K

Fig. 3.6 Sorted array L covers sorted arrays J and K. Each interval induced by every two consecutive keys in L covers at most three keys in J and the same holds for K according to the definition of the cover-relation provided in subsection 3.6.1. Thus, if we know J/L, K/L, L/K and L/J, then we can tell which keys from J and K are in which intervals in L. Consequently, the keys from J that lie in the first interval in L can be merged with the corresponding keys in K in O(1) time since no more than 6 keys satisfy this property according to the cover-relation. This applies to all of the keys in J and K that lie in the rest of the induced intervals of L. If we have the

30

sufficient number of processors, then we can do all of the 6-key merges in parallel. All that remains to be done is to concatenate the results obtained from the separate merges, which can be done in O(1) time as well, if enough processors exist. Hence, the whole merging process can be done in constant time. This merging technique is called merging with help, since the information provided by the L array helped accomplish all the necessary merges.

3.6.3 The Phases of the Algorithm

The N keys are initially stored in the leaves of a complete binary tree. An internal node u must compute the array Lu, which is the sorted array that consists of all the keys stored in the leaves of the sub-tree rooted at u. The algorithm consumes O(log N) stages.

At each stage t, a typical internal node u generates a sorted subsequence of Lu and stores it in the array Au(t)(where t = 0 ,1,…). The Au(t) array gets recomputed in each stage until

Au(t) = Lu at some stage t—here u is said to be a complete node, and it is said to be an active node otherwise. Three more sorted arrays are involved in the computations at node u during stage t, namely: the Bu(t) array, a sorted array that u submits to its parent node in the tree; Bx(t), a sorted array submitted by x which is a child node of u; and By(t), a sorted array submitted by y, the other child node of u. Each active node goes through three phases at each stage t:

1. Compute Bu(t+1) r(Au(t)), and send it to the parent node of node u.

2. Read the two arrays Bx(t) and By(t) submitted by x and y, the two child nodes of

node u.

3. Compute a new Au, that is Au(t+1) Bx(t)&By(t) with the help of the array Au(t).

31

At stage 0, each node at level 1 reads the two keys stored in its two child nodes

(leaf nodes), compares, and merges them, thus, becoming a complete node itself. If a node u is complete, then the phases it goes through get modified. Let’s assume that u becomes complete at stage t, then: at stage t+1 it performs phase 1 as described above. At stage t+2, it submits every second key in its Au(t) array, and at the third stage, t+3, it submits each key in its Au(t) array. Being a complete node, u does not go through phases

2 and 3 anymore. Consequently, three stages after u becomes complete, its parent node becomes complete, too.

Since the tree is a complete binary tree, logN levels do exist. Hence, the algorithm consumes θ(logN) stages. It can be shown that no more than θ(N) processors are required to accomplish the sorting task. Thus, the algorithm’s cost is θ(NlogN), which is cost optimal.

3.6.4 The Practicality of Cole’s Parallel Merge Sort

Despite the fact that it is theoretically optimal, Cole’s parallel merge sort is not really fast in practice. In an interesting paper by Natvig [Natvig 1990], it’s been shown that Batcher’s Bitonic merge sorting is faster than Cole’s merge sort as long as N is less than 1.2 x 1021 keys.

To accomplish his comparative study, Natvig implemented the designated algorithms on a synchronous MIMD style simulator that resembled a CREW PRAM machine. He pointed out that coding the O(1) merging operation in Cole’s algorithm was fairly complicated and consumed about 781 units, which comprises a relatively large complexity constant. On the other hand, implementing Batcher’s Bitonic sorting, under

32

the same circumstances, was a much easier task. The complexity constant implied by the algorithm was relatively small( about 42). Hence, it is obvious that Bitonic sorting is much faster, for all practical values of N. Figure 3.7 illustrates the time, i.e. the number of parallel PRAM steps, required by Cole’s and the Bitonic sorting algorithms on the

CREW PRAM simulator [Natvig 1990]. The horizontal axis, n, represents the input size.

As depicted in the figure, Bitonic sorting is faster than Cole’s merge sort in practice.

: Cole’s algorithm (O(log N))

: Bitonic algorithm (O(log2N))

Fig. 3.7 Experimental results obtained from running two parallel algorithms on a CREW PRAM simulator. 3.7 The Best Sorting Networks for input sizes ≤ 32

Sorting networks follow the circuit model described in Section 3.1. A new N-key network is better than all other known networks for N keys if it either: uses less comparators than all other N-key networks (i.e. more efficient), or if it uses less steps than all other N-key networks (i.e. faster). A more-efficient Sorting network for N keys might require more steps than a less efficient network. For example, the most-efficient

33

16-key network known so far uses 60 CEs and requires 10 steps, whereas the fastest 16- key network uses 61 CEs and requires 9 steps.

3.7.1 The most-efficient N-key networks for N ≤ 32

The best that can be done to reduce the number of comparators in an N-key network is to achieve the information-theoretic lower bound. Each comparator only has two states (exchange or don't exchange) and an N-key Sorting network must sort all N! input permutations of N distinct keys. Thus, the number of comparators must be at least equal to the smallest integer greater than or equal to log(N!) .

The obliviousness of Sorting networks raises the lower bound on the number of comparisons in some values of N. For instance, the lower bounds on the total number of

CEs necessary for designing sorting networks for N = 5, 6, 7, and 8 are 9, 12, 16, and 19

CEs respectively, instead of the information-theoretic lower bounds of 7, 10, 13, and 16.

This has been proven by Floyd and Knuth [Floyd, Knuth 1967; Floyd, Knuth 1973].

The first row in Table 3.1 shows the input size N, where 1 ≤ N ≤ 16. The second row shows the number of CEs used by the most-efficient Sorting networks that have been designed so far, for each input size. The third row shows the best lower bound known so far. It can be noticed that there is a difference between the second and the third rows for 9

≤ N ≤ 16. Hence, either there exists a better N-key network that uses fewer comparators or there is a greater lower bound.

34

Table 3.1. The number of CEs in the most-efficient networks for 1 ≤ N ≤ 16.

N 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Most efficient 0 1 3 5 9 12 16 19 25 29 35 39 45 51 56 60 Best lower bound 0 1 3 5 9 12 16 19 20 22 26 29 33 37 41 45

The networks for N = 9 and 10, which have their CE counts depicted in Table 3.1 were discovered by Floyd and Waksman respectively [Knuth 1998]. The 39-CE 12-key network was discovered by Shapiro and Green [Knuth 1998]. Juille constructed the 45-

CE 13-key network by simulating an evolutionary process of genetic breeding [Juille

1995]. The 60-CE 16-key network was discovered by Green [Knuth 1998]. The CE counts for the networks with N = 11, 14, and 15 were obtained by removing the bottom key, from the corresponding N+1-key network, together with the CEs that touch that key[Knuth 1998].

In a similar fashion, Table 3.2 shows what is known about the number of CEs used by Sorting networks for 17 ≤ N ≤ 32. The most-efficient network for each of these values was designed by using the most-efficient N -key network together with the most- 2 efficient N -key network as described in Table 3.1. The outputs of these two networks 2 were merged using an Odd-even merging network. For all values of N in this table, there is a difference between the second and the third rows – again, either there exists a better

N-key network that uses fewer comparators or there is a greater lower bound.

Table 3.2. The number of CEs in the most efficient networks for 17 ≤ N ≤ 32.

N 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Most efficient 73 80 88 93 103 110 118 123 133 140 150 157 166 172 180 185 Best lower bound 49 53 57 62 66 70 75 80 84 89 94 98 103 108 113 118

35

3.7.2 The fastest N-key networks for N ≤ 32

If N is an even number, then the maximum number of CEs in each step is N . If 2

S(N) is the best lower bound on the number of CEs for an N-key network, then the smallest number of steps needed will be N steps. Similarly, if N is an odd S(N ) 2 number then the maximum number of comparators in each step is N 1 . If S(N) is the 2 best lower bound on the number of CEs for an N-key network, then the smallest number

(N 1) of steps needed will be S(N) steps. 2

Notice that the lower bound on the number of steps for an N-key network can't be less than the best lower bound on the number of steps for an (N-1)-key network. For example, if N = 6 then S(N) = 12 from Table 3.1 and N/2 = 3. Hence, 4 steps would be a lower bound on the number of steps of a 6-key network. However, it’s been stated in

Table 3.1 that S(5) = 9. Moreover, (5-1)/2 = 2, thus 5 is the lower bound on the number of steps of a 5-key network. This implies that the lower bound on the number of steps of a 6-key network must also be 5.

The first row in Table 3.3 shows the input size N for 1 ≤ N ≤ 16. The second row shows the number of steps in the fastest network known so far for each value of N, and the third row shows the best lower bound known so far. Notice that for 11 ≤ N ≤ 16, there is a difference between the second and the third rows - either there exists a better N-key network that uses fewer steps or there is a greater lower bound. The networks for N = 6,

9, and 12 were discovered by Shapiro, and the networks for N = 10 and 16 were

36

discovered by Van Voorhis [Knuth 1998]. The networks with odd input size N, where 1 ≤

N ≤ 16, are obtained by removing either the top or the bottom key from the corresponding

N+1-key network together with the CEs that touch that key.

Table 3.3. The number of steps in the fastest networks for 1 ≤ N ≤ 16 N 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Fastest 0 1 3 3 5 5 6 6 7 7 8 8 9 9 9 9 Best lower bound 0 1 3 3 5 5 6 6 7 7 7 7 7 7 7 7

Table 3.4 shows the fastest known N-key networks, where 17 ≤ N ≤ 32, and the corresponding best lower bounds. The fastest network, for each of these values of N, uses the fastest N -key network together with the fastest N -key keys according to Table 3.3 2 2 and then merges the outputs of both using Odd-even merging.

Table 3.4. The number of steps in the fastest networks for 17 ≤ N ≤ 32. N 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Fastest 12 12 12 12 13 13 13 13 14 14 14 14 14 14 14 14 Best lower bound 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8

The research described in this dissertation concentrates on designing faster

Sorting networks, i.e. designing networks that require fewer steps than the fastest known

Sorting networks for some input size N. The input sizes that were examined ranged between 12 and 32 inclusive. Two faster networks for N =18 and 22 were discovered and are described in Chapters 6 and 7, respectively.

CHAPTER 4

Sortnet

Sortnet is a software tool developed by Batcher to help synthesize and analyze N- key Sorting networks, where N ≤ 32. Some of Sortnet commands can be used to build, manipulate, and save a CE-list. Other Sortnet commands can then be used to: generate the corresponding set of zero/one cases; count the number of these cases; display the poset of keys arising from this set of cases; and display the corresponding Shmoo chart [Batcher,

Al-Haj Baddar 2008].

This chapter introduces Sortnet. Section 4.1 describes the interface of Sortnet, while Section 4.2 illustrates a subset of its commands that treats CE-lists. Section 4.3 describes a subset of Sortnet commands that treats zero/one cases, and Section 4.4 shows how some parts of Sortnet work. Finally, Section 4.5 illustrates, with the help of an example, how Sortnet handles a given N-key Sorting network.

4.1 The Interface of Sortnet

Whenever Sortnet is waiting to receive a command it prompts the user with sortnet->. A Sortnet command consists of:

1. A mnemonic specifying the operation to be performed;

2. A list of zero or more arguments; and 37 38

3. A semi-colon (;), if necessary, to indicate the end of the argument list.

Whitespace(s) (one or more spaces, tabs, or new lines) must separate the mnemonic from the first argument, each argument from the next argument, and the last argument from the semi-colon at the end of the argument list. Each command should be followed by a newline (return key) to make sure that the (OS) sends it to Sortnet. Each mnemonic has one, two, or three elements with a dot (.) separating each element from the next one. The first element specifies an action, the second element (if present) specifies the object to be acted upon, and the third element (if present) is a modifier.

4.2 Sortnet Commands for Manipulating CEs

The commands described here are useful for creating and modifying CE-lists.

ENT.CE numlo numhi … numlo numhi ;

This command enters the list of comparators from the keyboard into the CE-list.

To enter K comparators into the CE-list there must be 2K numbers in the argument list.

The first two numbers specify the low and high indices of the first CE, the next two numbers specify the low and high indices of the second CE, and so on. For example, executing the command ENT.CE 0 1 2 3 0 2 1 3 1 2 ; generates the CE-list that corresponds to the Knuth diagram illustrated in Fig. 4.1(a)[Knuth 1998].

Fig. 4.1(a)A 4-key Sorting network Fig. 4.1(b) The corresponding CE-list

39

SHOW.CE

This command displays the current CE-list on the monitor, with comments to show the comparators of each step. Figure 4.1(b) shows the display of the SHOW.CE command after executing the previous ENT.CE command.

WR.CE file_name

This command writes the current CE-list into the specified file. The contents of the file are the same as the output of the SHOW.CE command.

RD.CE file_name

This command reads the CE-list stored in the specified file. This file can be created by the WR.CE command. The comments in such a file are ignored.

CUT.CE.STEPS numlaststep

This command removes all the comparators in the CE-list that are in steps past numlaststep. For example, applying the CUT.CE.STEPS 1 command on the CE-list shown in Fig. 4.1(b) will remove the comparators in step 2 and step 3 leaving the two comparators in step 1.

CLR.CE

This command simply clears out all comparators in the CE-list.

4.3 Sortnet Commands for Manipulating Zero/One Cases

The commands listed in this section are used for generating and analyzing zero/one cases. Figure 4.2 depicts the SHOW.CE display of the first 5 steps of the 18- key network described in Chapter 6. This CE-list will help illustrate the Sortnet commands covered in this section.

40

Fig. 4.2 The SHOW.CE display of the first 5 steps of the 18-key network described in Chapter 6. GEN.CASES

After entering a CE-list, using either the ENT.CE or the RD.CE commands, the GEN.CASES command is used to generate the corresponding set of zero/one cases.

This enables other Sortnet commands like, SHOW.POSET and SHOW.SHMOO , to produce their designated outputs.

SHOW.POSET

This command displays on the monitor, a 5-column table describing the key-poset created by the current set of zero/one cases. There is a row in the table for each key, K[k], with:

column 1 showing k, the index of the key, K[k];

column 2 showing the number of keys that are greater than key, K[k];

column 3 showing the number of keys that are less than key, K[k];

column 4 showing the indices of keys that cover key, K[k]; and

column 5 showing the indices of keys that are covered by key, K[k].

41

The rows are ordered by segment (with horizontal lines separating the segments) with the keys in each segment ordered by their height in the segment. For example, Fig.

4.3 shows the table of the poset obtained after generating the cases of the CE-list illustrated in Fig.4.2.

Fig. 4.3 The poset table that corresponds to the 5-step CE-list in Fig.4.2.

SHOW.SHMOO

This command displays a Shmoo chart which is an (N-row x N+1-column) array that has a column for each class Zj and a row for each key K[i]. Figure 4.4 illustrates the

Shmoo chart generated after entering the 5-step CE-list, depicted in Fig. 4.2, and generating the corresponding set of zero/one cases.

The columns in the chart are ordered with class ZN at the left of the chart and class

Z0 at the right of it. The entry in row i and column j is:

1 : if key K[i], in every case in class Zj, equals one;

0 : if key K[i], in every case in class Zj, equals zero; or

42

- : if there is at least one case in class Zj, where K[i] equals zero and at

least one case where K[i] equals one in the same class.

Fig. 4.4 The Shmoo chart that corresponds to the CE-list depicted in Fig. 4.2.

Let the number of zero/one cases in which K[i] equals one be denoted by C-onei.

The keys in the Shmoo chart are ordered from top to bottom by their corresponding C- one values in decreasing order. If two or more keys have the same C-one value, then they are ordered by their indices in decreasing order. A Shmoo chart becomes dash-free when all the keys get sorted, i.e. all the entries in the Shmoo chart are either zeroes or ones.

SHOW.GOODCE

This command considers all possible comparisons and for each, it calculates the number of Shmoo chart dashes that will be removed if that comparison is made. The output of this command is displayed in the form of an upper triangular (N - 1)x(N - 1) table.

43

The entry in the ith row and jth column contains the number of dashes removed from the Shmoo chart if K[i] gets compared with K[j]. A dot is displayed when such a comparison removes no dashes. Figure 4.5 illustrates the output of the

SHOW.GOODCE command when it is executed after generating the cases of the CE-list depicted in Fig. 4.2.

Fig. 4.5 The output of SHOW.GOODCE that corresponds to the CE-list depicted in Fig. 4.2

SHOW.DIFF

This command considers all possible comparisons and for each it calculates the number of zero/one cases that get modified by the designated comparison. The output of this command is displayed in the form of an upper triangular (N-1)x(N-1) table. The entry at the ith row and jth column contains the number of cases that will be modified if

K[i] gets compared with K[j]. A zero means that such a comparison modifies no zero/one cases. Figure 4.6 depicts the output of the SHOW.DIFF command when it is applied to the zero/one cases resulting from the CE-list illustrated in Fig. 4.2.

SHOW.BESTCE

44

Fig. 4.6 The output of SHOW.DIFF that corresponds to the CE-list depicted in Fig. 4.2.

The best CEs to use in a given situation might be: the ones that eliminate the most dashes from the Shmoo chart; the ones that remove fewer dashes but modify more zero/one cases; or the ones that modify the most cases [Al-Haj Baddar, Batcher 2009 b].

Thus, the SHOW.BESTCE command displays the CEs that belong to each of these categories together with the earliest step in which they can be used. This command utilizes the SHOW.GOODCE and SHOW.DIFF commands to help generate its designated output. It is then up to the network designer to pick the CEs they think are the best and add them to the CE-list. Fig. 4.7 illustrates the BESTCE values obtained after generating the zero/one cases of the first five steps of the 18-key network described in

Fig. 4.2.

45

Fig. 4.7 The output of SHOW.BESTCE that corresponds to the CE-list in Fig. 4.2.

SEL.CASES num1 num2 ... numm ;

This command defines a subset of the set of zero/one cases. For each numi in the argument list, the subset contains all zero/one cases in class Znumi.

SHOW.BESTCE.SEL

This command displays on the monitor, the BESTCE values considering only those zero/one cases that are in the classes defined by the most recent SEL.CASES command.

SHOW.CASECNTS

For every class Zj, where 0 ≤ j ≤ N, this command displays on the monitor the number of zero/one cases that belong to that class.

4.4 How Sortnet Works

Sortnet first generates the set of zero/one cases for a given set of CEs, then generates the corresponding key-poset according to the following rule:

For all keys, K[i] and K[j], K[i] ≤ K[j] if there is NO zero/one case

where K[i] = 1 and K[j]= 0.

This section describes how some parts of Sortnet work.

Each zero/one case

46

For an N-key Sorting network, the program maintains a set of zero/one cases.

Each zero/one case is a sequence of N bits with each bit showing the value, 0 or 1, of one of the keys for that case. A long integer in C contains 32 bits, so it's convenient to limit N to 32.

The number of zero/one cases

At the start, before any comparators are treated, the N keys are not sorted at all, so theoretically the program starts with 2N zero/one cases. If N = 32 then 2N = 232

= 4,294,967,296. Sortnet doesn’t start with that many zero/one cases. If the key- poset has two or more segments, Sortnet only has to maintain the zero/one cases corresponding to each segment in a separate linked-list.

Treating posets

At the start, the poset has N segments with each segment containing just one key.

Hence, there are only two zero/one cases for that segment: one with a 0 in that key and another with a 1 in it. Thus, Sortnet starts with N linked-lists with each list containing only two zero/one cases.

Treating each comparator

Sortnet treats the comparators in the CE-list one at a time. To treat comparator

Lo: Hi, Sortnet checks to see if K[Lo] and K[Hi] are in the same segment of the poset or in two different segments:

. If K[Lo] and K[Hi] are in the same segment of the poset, then Sortnet runs

through all zero/one cases for that segment and whenever it finds a case

where K[Lo] = 1 and K[Hi] = 0 it swaps the two values.

47

. If K[Lo] and K[Hi] are in two different segments of the poset, then Sortnet

uses an outer loop and an inner loop to combine every zero/one case of

one segment with every zero/one case of the other segment. Whenever it

finds a combination where K[Lo] = 1 and K[Hi] = 0 it swaps the two

values. The two segments are now combined into one segment containing

both K[Lo] and K[Hi].

Eliminating duplicate cases

Hashing is used to eliminate duplicate cases. The new cases generated when a

comparator is treated are temporarily stored in P separate linked-lists, where P is a

. Case C is treated like an integer and stored in linked-list C mod P

only if it's not the same as any other case in the same linked-list.

4.5 An Example- Treating a 32-Key Network

To illustrate how Sortnet limits the number of cases it handles, we use the CE-list depicted in Fig. 4.8 that forms a single-segment poset of 32 keys.

Initially each key is in a separate segment so there are only 2*32 = 64 cases, in 32

linked-lists, instead of 232 = 4,294,967,296 cases.

The 16 comparators in Step 1 form 16 linked-lists with three cases in each list;

3*16 = 48 cases instead of 316 = 43,046,721. The corresponding key-poset has 16

2-key segments.

48

The 16 comparators in Step 2 form 8 linked-lists with six cases in each list; 6*8 =

48 cases instead of 68 = 1,679,616. The corresponding key-poset has 8 4-key segments.

The 16 comparators in Step 3 form 4 linked-lists with twenty cases in each list;

20*4 = 80 cases instead of 204 = 160,000. The corresponding key-poset has 4 8- key segments.

The 16 comparators in Step 4 form two linked-lists with 168 cases in each list;

168*2 = 336 cases instead of 1682 = 28,224. The corresponding key-poset has 2

16-key segments.

The 16 comparators in Step 5 form one linked-list with 7,581 cases. The corresponding key-poset is a 32-key single segment poset.

Fig. 4.8 A CE-list that forms a single-segment poset of 32 keys in 5 steps.

CHAPTER 5

A Three-Phase Technique for Designing Faster Sorting Networks

This chapter describes a heuristic technique for designing faster Sorting networks using Sortnet. This technique consists of three phases: firstly designing the single- segment poset, then designing the intermediate steps, and eventually finalizing the network [Al-Haj Baddar, Batcher 2009 b]. Sections 5.1 through 5.3 describe the three phases of this technique, and Section 5.4 highlights some remarks on it.

5.1 Phase I: Designing a Single-Segment Poset

The initial steps of a typical N-key Sorting network are usually spent on designing a single-segment poset of the N keys. One way to design such a poset is to design a multi- segment poset of the N keys, then spend one step to combine the multiple segments into a single-segment poset. Hence, this phase aims at designing a single-segment poset that preserves as much information as possible.

Designing an N-key multi-segment poset where N is not a power of 2 can be problematic. Consider, for example designing an 18-key multi-segment poset. Such a poset can be designed using: 3 6-key segments; two 9-key segments; or an 8-key segment with a 10-key segment. Each of these different methods for designing the designated multi-segment poset needs 3 steps. Hence, we must decide which one is the best to use.

Using Sortnet is one way to help make this decision. After generating each of these multi- segment posets using Sortnet, we can examine the resulting number of zero/one cases and

49 50

use the poset with the smallest number of cases. Experimentation showed that the multi- segment posets that have relatively smaller number of zero/one cases are more promising.

Thus, to decide which multi-segment poset is more promising for the 18-key network, let’s use Sortnet to determine the number of zero/one cases associated with each possible poset. The poset with 3 6-key segments has 1331 zero/one cases, while the poset with two 9-key segments has 2905 cases. The last multi-segment poset, with one 8-key segment and one 10-key segment, has 1140 cases. Hence, the last poset is the most promising multi-segment poset for 18 keys. It has been actually used in designing the 11- step 18-key network described in Chapter 6.

After designing the multi-segment poset, we must combine the keys in these segments to generate a single-segment poset. Different ways exist for accomplishing such a task, among which is to combine the keys in the multi-segment poset in a way that preserves as much information as possible. Experimentation showed that the information preserving comparisons tend to result in relatively fewer zero/one cases. Hence, Sortnet can help the network designer select the CEs that combine the keys of a multi-segment poset to design the most promising single-segment poset.

Many ways exist to pair the keys in the two-segment poset with 1140 cases and generate a single-segment poset in step 4. Experimentations showed that applying 9 CEs,

4 of which compare keys in the two different segments, helped design a faster 18-key network [Al-Haj Baddar, Batcher 2009 a]. This network is described in Chapter 6.

Designing a single-segment poset for an N-key network where N is a power of 2 is much easier. Consider, for example, designing a single-segment poset for a 16-key

51

network. In this case, step one combines each two keys resulting in an 8-segment poset.

Step two combines each two two-key segments, from step one, resulting in a 4-segment poset, where each segment has 4-keys. The third step combines each two 4-key segments, from step two, resulting in two 8-key segments. Step 4, finally, combines each two corresponding keys within these two segments resulting in a 16-key single-segment poset. Thus, the decisions on how to combine the keys are quite obvious when N is a power of two.

5.2 Phase II: Designing the Intermediate Steps

After obtaining a single-segment poset of N-keys, the poset might be hard to work with.

This suggests that a different strategy, for selecting the CEs, should be used. Sortnet provides the

SHOW.SHMOO and SHOW.BESTCE commands which can help select the pairs of CEs to use next. To design each step, in this phase, the designer selects the CEs suggested by the BESTCE command, adds these CEs to the CE-list, and views the resulting Shmoo charts. Here, the designer follows a greedy technique in order to improve the patterns in the generated Shmoo charts, i.e. to generate Shmoo charts where more adjacent keys have similar numbers of dashes in their rows. The aim of this phase is to generate a Shmoo chart with M pairs of keys, such that for the majority of the pairs, the two keys within the pair have exactly the same number of dashes in their rows. Such a Shmoo chart is said to have a staircase pattern.

Hence, the designer may select the CEs that appear in the lower lines of the BESTCE command, to remove more dashes, if this strategy results in Shmoo charts with better patterns. Alternatively, they may select the CEs in the upper lines of the BESTCE command, to modify more cases, if this improves the patterns in the Shmoo charts. The designer can alternate between selecting the

52

upper and lower lines of the BESTCE command, as they design several steps, until a Shmoo chart with a staircase pattern is generated.

Experimentation showed that comparing some neighbor keys in the Shmoo chart, in this phase, may help find Shmoo charts with better patterns. Thus, the network designer can alternate between the different CE- categories provided by the SHOW.BESTCE command and/or use the

Shmoo chart to find good patterns.

Backtracking to an earlier step in this phase is possible, using the CUT.CE.STEPS command, if the designer can not generate a Shmoo chart with a staircase pattern by the end of phase II. This allows the designer to use a different strategy for selecting the CEs. The number of zero/one cases can help decide whether a given Shmoo chart is better than another. In general,

CEs that result in relatively fewer cases tend to generate Shmoo charts with better patterns, especially, in the early steps of phase II. If repetitive trials in phase II don’t help the designer generate a Shmoo chart with a staircase pattern, then they might backtrack to phase I and try to find a better single-segment poset.

While designing a given step, the BESTCE command might suggest conflicting CEs, i.e. two or more CEs that share at least one key-index. One way to resolve this conflict is to select some cases using the SEL.CASES command and consider the SHOW.BESTCE.SEL CEs to resolve the conflict. Then, use the regular BESTCE command to select the rest of the CEs that can be accommodated within the designated step.

5.3 Phase III: Finalizing the Network

After generating a Shmoo chart with a staircase pattern, CEs that compare each two adjacent keys in the chart are added to the CE-list. This phase aims at generating a dash-free

Shmoo chart.

53

Comparing adjacent keys in the Shmoo chart will be repeated until the Shmoo chart becomes dash-free within the designated number of steps. If it turns out that extra step(s) are needed, then the network designer may backtrack to phase II and generate a Shmoo chart with a better staircase pattern.

5.4 Remarks on the Three-Phase Technique

Consider designing a p-step N-key Sorting network according to the three-phase technique described in this chapter. Experimentation showed that better results were obtained when about log N steps were spent in phase I. Let, q denote the number of steps it takes phases II and III, such that q = p – log N. Experimentation with N-key networks, where N ≤ 32, showed that no more than half of the q steps was spent on phase III. Hence, determining p ahead of time helps the designer estimate the number of steps it will take each of the three phases. This relieves the designer from exploring many unpromising CE-lists, i.e. CE-lists that will eventually require much more than p steps to sort the input. To help explain the three-phase technique, an illustrative example in which we consider an attempt to design an 11-step 18-key network, is described in Chapter 6.

Experimentation also showed that designing a better single-segment poset implies a smoother phase II. Thus, if the single-segment poset preserves more information, then the outputs of the BESTCE command will help generate a Shmoo chart with a better staircase pattern relatively easily. Moreover, using the three-phase technique facilitated discovering several promising networks. These networks include: 20-key, 24-key, and 26-key networks, that are as fast as the merge-sorting networks but with much fewer CEs in the last step.

CHAPTER 6

An 11-Step Network for Sorting 18 Keys

This chapter describes an 11-step network, for sorting 18 keys [Al-Haj Baddar,

Batcher 2009 a], in Sections 6.1 through 6.3. The previously known fastest network for

18 keys used merge-sorting and required 12 steps. This chapter, also, describes an illustrative example on how to apply the three-phase technique to find a faster sorting network for 18 keys, in Section 6.4. Figure 6.1 illustrates the 11-step network and Fig.

6.2 depicts its CE-list.

Fig. 6.1. The Knuth diagram of the 11-step network for sorting 18 keys.

/* STEP 1 */ /* STEP 4 */ /* STEP 7 */ /* STEP 10 */ 0:1 2:3 4:5 6:7 8:9 7: 16 6: 17 3: 5 10: 14 4:8 14:15 5:9 7:11 1:2 13:14 11:12 9:10 7:8 5:6 10:11 12:13 14 :15 16:17 11:12 9 :15 2:4 1: 13 0:8 12:16 3:6 10:13 3:4

/* STEP 2 */ /* STEP 5 */ /* STEP 8 */ /* STEP 11 */ 0: 2 1: 3 4: 6 5: 7 8:10 16: 17 7: 14 5: 12 3: 15 5:8 11:14 2:3 12:13 6:7 12:13 10:11 8:9 6:7 4:5 9 :11 12:17 13: 14 15:16 6:13 4 :10 2:11 8 :9 0:1 9:10

/* STEP 3 */ /* STEP 6 */ /* STEP 9 */ 0: 4 1: 5 2: 6 3: 7 9:10 1:8 14:16 6:9 7:13 5:11 7:9 3:5 12:14 2:4 13:15 8 :12 11:16 13: 15 14:17 3:10 4:15 6:8 10:11

Fig. 6.2 The CE-list of the 18-key network.

54 55

6.1 Phase I: Designing a Single-segment Poset in Steps 1 through 4

The first two steps generate a 4-segment poset. Three out of the 4 segments contain 4 keys each and the last contains 6 keys. The third step combines two 4-key segments to form a segment of 8 keys and combines the two remaining segments to form a segment of 10 keys. Hence, step 3 results in a two-segment poset with one segment containing 8 keys and the other containing 10 keys. Figure 6.3 describes this two- segment poset [Al-Haj Baddar, Batcher 2009 a].

Fig. 6.3 The two-segment poset obtained after applying step 3.

In step 4, a single-segment poset is formed. It is crucial in this phase, to try to preserve as much information as possible. Due to the fact that the poset becomes hard to work with after step 4, Shmoo charts will be used to illustrate the effect of adding CEs to the network. Fig. 6.4(a) depicts the Shmoo chart obtained after step 4.

6.2 Phase II: Generating a Shmoo Chart with a Staircase Pattern in Steps 5

through 9

The strategy used for comparing keys in step 5 is to try to modify as many cases as possible by selecting the CEs in the top-most lines of the BESTCE command. Figure

6.4(b) illustrates the Shmoo chart generated from applying the CEs of step 5.

56

Fig. 6.4 (a). The Shmoo chart after step 4. Fig. 6.4 (b). The Shmoo chart after step 5. Experimentation showed that using the BESTCE command to remove as many dashes as possible in the steps 6 and 7 helped find the 11-step solution. Hence, the CEs in the lowest lines of the BESTCE command were used to design steps 6 and 7. Figure

6.5 illustrates the Shmoo charts generated from applying the CEs of these two steps.

Fig. 6.5(a). The Shmoo chart after step 6. Fig. 6.5(b). The Shmoo chart after step 7.

Experimentation showed that removing the most dashes in steps 8 and 9 also helped find the 11-step network. Hence, the CEs in the lowest lines of the BESTCE command were used to design steps 8 and 9. Fig. 6.6 depicts the Shmoo charts of these two steps.

57

Fig. 6.6(a). The Shmoo chart after step 8. Fig. 6.6(b). The Shmoo chart after step 9.

It can be noticed that the Shmoo chart patterns improved in steps 5 through 8 and that a

Shmoo chart with a staircase pattern was generated after step 9.

6.3 Phase III: Finalizing the Network in Steps 10 and 11

It is obvious that the Shmoo chart illustrated in Fig. 6.6(b) has a staircase pattern consisting of 6 pairs of keys. Consequently, the tenth step compares each two keys within a pair. The resulting Shmoo chart is described in Fig. 6.7(a). This figure shows that all the even zero/one cases were sorted by the end of step 10. It also shows that the even cases in classes Z4 through Z14 bracket the odd cases in classes Z5 through Z13. Hence, step 11 adds the CEs necessary for sorting the remaining odd cases, which generates the dash- free Shmoo chart illustrated in Fig. 6.7(b).

Fig. 6.7(a). The Shmoo chart after step 10. Fig. 6.7(b). The Shmoo chart after step 11.

58

As a by-product of designing this 18-key network, faster N-key networks can be designed where N is a multiple of 18. For example, a faster 36-key network can be designed using odd-even merging which was described in Section 3.3. This network uses the 11-step network, described here, to generate two sorted lists of 18-keys, and then applies odd-even merging to generate the sorted output. The new 36-key network requires 17, instead of 18, steps. A faster 11-step network for 17 keys can be easily designed, as well. The previously known fastest network that sorts 17 keys utilized the

12-step 18-key network by removing either it top or bottom key. In a similar fashion, an

11-step network for 17 keys can be designed. Besides, faster N-key networks, where N is a multiple of 17, can be designed using odd-even merging.

6.4 An Illustrative Example

This section illustrates an example on how to apply the three-phase technique to help find faster Sorting networks. Here, we consider an attempt to design an 11-step 18- key network. The network designer decides to do the following:

Phase I: construct a single-segment poset in steps 1 through 4. To design this

poset, the designer decides to use the single-segment poset described in Section

6.1.

Phase II: generate a Shmoo chart with a staircase pattern in steps 5 through 9.

The designer decides to use the CEs in the lowest lines of the BESTCE command

in steps 5, 7, and 9, and the CEs in the top-most lines of the same command in

steps 6 and 8. Hence, alternating between removing the most dashes and

modifying the most cases. Accordingly, the CE-list in Fig. 6.8 is generated.

59

/* STEP 5 */ 2:9 5:14 3:15 4:10 1:8 7:17 6:12 11:13 /* STEP 6 */ 16:17 0:1 5:13 6:9 4:11 12:15 3:10 7:8 /* STEP 7 */ 3:11 10:12 6:7 8:13 5:9 2:4 14:15 /* STEP 8 */ 7:11 8:10 4:6 13:14 3:5 9:12 1:2 15:16 /* STEP 9 */ 5:8 9:11 2:4 14:15 6:7 10:13

Fig.6.8 The CE-list of steps 5 through 9 of the experimental 18-key network. Phase III: finalizing the network in steps 10 and 11. The pattern in the Shmoo

chart generated after step 9, and illustrated in Fig. 6.9(a), is quite close to a

staircase pattern. After entering the CEs that correspond to step 10, the designer

notices that it will take more than one step to generate the dash-free Shmoo chart.

Figure 6.9(b) illustrates the Shmoo chart generated after step 10. The designer,

however, continues as planned and enters the CEs that correspond to step 11 to

generate the Shmoo chart depicted in Fig. 6.9(c).

(a) after step 9 (b) after step 10 (c) after step 11

Fig. 6.9 The Shmoo charts of the steps 9, 10, and 11 of the experimental 18-key network.

It is obvious that a twelfth step is needed to finish the sorting task, but only two

CEs are needed there. Hence, the designer realizes that the current network is a promising

60

one, and decides to backtrack to phase II. Fig. 6.10 describes the CE-list of steps 10 and

11.

/* STEP 10 */ 12:13 10:11 8:9 5:7 3:6 /* STEP 11 */ 13:14 11:12 9:10 7:8 5:6 3:4

Fig. 6.10 The CE-list of steps 10 and 11 of the experimental 18-key network. The designer realizes that a different strategy must be used. Actually, several other strategies, for designing phase II, can lead the designer to an 11-step solution. For instance, if the designer decides to modify the most cases in step 5 and remove as many dashes as possible in steps 6 through 9, then they will obtain the 11-step network depicted in this chapter.

CHAPTER 7

A 12-Step Network for Sorting 22 Keys

This chapter describes a 12-step solution for sorting 22 keys, in Sections 7.1 through 7.3 [Al-Haj Baddar, Batcher 2009 b]. The previously known fastest network for sorting 22 keys used merge-sorting and required 13 steps. Figure 7.1 illustrates the 12- step network and its corresponding CE-list is depicted in Fig. 7.2

Fig. 7.1 The Knuth diagram of the 12-step network for sorting 22 keys.

/* STEP 1 */ /* STEP 5 */ /* STEP 9 */ 0:1 2:3 4:5 6:7 8:9 0 : 7 17 :20 3: 15 9: 18 18: 19 14 :16 13: 15 10:11 12:13 14 :15 16:17 2: 11 4: 16 5: 10 1 : 8 11: 12 8 : 9 5: 10 18:19 20: 21 12: 19 13 :14 6 : 7 2 : 3

/* STEP 2 */ /* STEP 6 */ /* STEP 10 */ 2 :4 1 :3 0:5 6:8 7:9 20: 21 0 : 6 3 : 8 12: 18 17: 19 16: 18 14: 15 10 :12 11: 13 14: 16 15: 17 2: 13 14 : 16 5 : 9 10 :15 12: 13 9 :11 8: 10 5 : 7 18 :20 19: 21 4 :7 11: 17 3 : 6 2 : 4

/* STEP 3 */ /* STEP 7 */ /* STEP 11 */ 6 :10 7: 11 8: 12 9 :13 16: 20 18: 19 15: 17 17: 18 15: 16 13: 14 14 :18 15: 19 16: 20 17: 21 12 :14 10: 11 7: 9 8 :13 11 :12 9 :10 7 : 8 5 : 6 3 : 5 1 : 4 0 : 2 4: 5 1 : 3 2 : 6 3 : 4

/* STEP 4 */ /* STEP 8 */ /* STEP 12 */ 9 :17 7 :15 11: 19 8: 16 19: 20 16: 17 15: 18 16: 17 14 :15 12 :13 3 :12 0: 10 1: 18 5 : 20 11: 14 9: 13 10: 12 7 : 8 10: 11 8 : 9 6 : 7 4 : 5 13: 21 6 :14 2: 4 3 : 5 4: 6 1 : 2

Fig. 7.2 The CE-list of the 22-key network.

61 62

7.1 Phase I: Designing a Single-segment Poset in Steps 1 through 4

Several ways exist for designing a 22-key single-segment poset. To mention some, consider designing a two-segment poset that consists of a 10-key segment together with a 12-key segment. Alternatively, consider a two-segment poset that consists of two

11-key segments. Experimentation showed, however, that designing a 22-key single- segment poset using one 6-key segment together with two 8-key segments helped find a faster solution for this problem. The first 3 steps of the network construct the 3-segment poset illustrated in Fig. 7.3 [Al-Haj Baddar, Batcher 2009 b].

Fig. 7.3 The 3-segment poset obtained after applying the first 3 steps of the network illustrated in Fig.7. 2 The CEs of step 4 transform the 3-segment poset into a single-segment poset. One criterion that is crucial for designing a faster Sorting network at this phase is to preserve as much information as possible. The poset becomes hard to work with after step 4. Thus,

Shmoo charts will be used starting from step 4 to illustrate the effect of adding more CEs to the network. Figure 7.4(a) illustrates the Shmoo chart generated after step 4.

7.2 Phase II: Generating a Shmoo Chart with a Staircase Pattern in Steps 5

through 9

The SHOW.BESTCE command, was utilized to help design step 5 of this network. Experimentation showed that selecting the CEs, which removed the most dashes

63

from the Shmoo chart, helped find a faster 22-key network. Hence, the CEs in the lowest lines of the BESTCE command were used. The corresponding Shmoo chart is illustrated in Fig. 7.4(b).

Fig. 7.4(a) The Shmoo chart after step 4. Fig. 7.4(b) The Shmoo chart after step 5.

The inspection of the Shmoo chart in Fig. 7.4(b) suggested adding 0:6 and 20:21 to the CEs of step 6. The rest of the CEs were the ones that removed the most dashes, and hence appeared in the lowest lines of the BESTCE command. Fig. 7.5(a) illustrates the

Shmoo chart generated after applying the CEs of step 6.

Fig. 7.5(a) The Shmoo chart after step 6. Fig. 7.5(b) The Shmoo chart after step 7.

64

The top-most and bottom-most pairs of keys in the Shmoo chart generated after step 6 were compared in step 7, the rest of the CEs in this step were the ones that modified the most cases, and hence appeared in the top-most lines of the BESTCE command. This resulted in the Shmoo chart depicted in Fig. 7.5(b).

Experimentation showed that comparing the top-most three and the bottom-most three pairs of keys in the Shmoo chart of step 7, helped find a Shmoo chart with better patterns. The rest of the CEs in step 8 were the ones that removed the most dashes according to the BESTCE command. The resulting Shmoo chart is depicted in Fig.

7.6(a), which suggested adding the CEs 18:19 and 2:3 to the CEs in step 9. The remaining CEs in this step eliminated the most dashes using the lowest lines of the

BESTCE command. The corresponding Shmoo chart is illustrated in Fig. 7.6(b).

Fig. 7.6(a) The Shmoo chart after step 8. Fig. 7.6(b) The Shmoo chart after step 9.

7.3 Phase III: Finalizing the Network in Steps 10 through 12

The Shmoo chart obtained after step 9 shows a staircase pattern. Thus, each two keys within a pair were compared to form step 10. The resulting Shmoo chart is illustrated in Fig. 7.7(a). Step 11 was designed in a similar fashion, and its corresponding

65

Shmoo chart is depicted in Fig. 7.7(b). This figure shows that all even zero/one cases were sorted by the end of step 11. It also shows that the even zero/one cases in classes Z4 up to Z18 bracket the odd cases in classes Z5 up to Z17. Hence, adding the CEs of step 12 sorted all the remaining odd cases and generated the dash-free Shmoo chart depicted in

Fig. 7.7(c).

(a) after step 10 (b)after steps 11 (c)after step 12

Fig. 7.7 The last three steps of the 22-key network.

As a by-product of designing this 22-key network, faster N-key networks can be designed where N is a multiple of 22. For example, a faster 44-key network can be designed using odd-even merging as described in Section 3.3. This network uses the 12- step network, described here, to generate two sorted lists of 22-keys, and then applies odd-even merging to generate the sorted output. The new 44-key network requires 18, instead of 19, steps. Another network that can be designed easily, using this 12-step network, is a faster network for 21 keys. The previously known fastest network that sorts

21 keys utilized the 13-step 22-key network by removing either its top or bottom key. In a similar fashion, a 12-step network for 21 keys can be designed. Besides, faster N-key networks, where N is a multiple of 21, can be designed using odd-even merging.

CHAPTER 8

Conclusions and Future Work

This chapter summarizes the issues discussed in this dissertation. Section 8.1 presents a set of conclusions, while Section 8.2 highlights some future work.

8.1 Conclusions

According to the discussions in Chapter 3. It is vital not to ignore constants when

designing algorithms with poly-logarithmic complexity.

A zero/one case is a sequence of N binary keys. Using zero/one cases to help

design Sorting networks, simplifies the sorting task, as illustrated in section 2.5.

Additionally, it helps track the progress of the sorting visually using the Shmoo

chart tool, as described in section 4.3.

Renaming, as defined in section 2.7, can help analyze and better understand the

behavior of Sorting networks.

Software tools, like Sortnet which was described in Chapter 4, can help synthesize

and analyze Sorting networks.

This dissertation introduces a technique for designing faster Sorting networks

using Sortnet. The technique consists of three phases:

o Phase I, which aims at designing a single-segment poset that preserves as

much information as possible.

66 67

o Phase II, which aims at generating a Shmoo chart with a staircase pattern,

as defined in section 5.2. In this phase, the network designer uses the

SHOW.BESTCE and the SHOW.SHMOO commands to help generate

the designated Shmoo chart.

. The designer can remove the most dashes, modify the most cases,

and/or compare neighbor keys in the Shmoo chart, in order to

generate the designated Shmoo chart.

. If the designer fails to find a Shmoo chart with a staircase pattern

by the end of phase II, then they can backtrack to earlier steps, in

this phase, and apply different strategies for selecting the CEs.

. If repetitive attempts to generate the designated Shmoo chart fail,

then the designer can backtrack to phase I and try to find a better

single-segment poset. o Phase III, after generating the Shmoo chart with the staircase pattern, the

designer compares each two keys within a pair. This strategy for

comparing keys will be repeated until a dash-free Shmoo chart is

generated within the designated number of steps.

. If such a Shmoo chart can not be generated, then the designer will

backtrack to phase II and try to find a Shmoo chart with a better

staircase pattern.

68

This technique helped design two networks for 18 and 22 keys that are faster than

the previously known networks for these two input sizes. Other faster N-key

networks were obtained as a by-product, for some values of N.

8.2 Future Work

The future work includes:

Finding other faster N-key Sorting networks, where N ≤ 32 using the three-phase

technique.

Applying renaming to the 18-key and 22-key networks, to better understand their

behavior. Hopefully, this will help generalize these network designs to other

larger values of N.

Using the newer version of Sortnet, Sortnet64, to find faster networks, where N ≤

64 and examine how the three-phase technique can help design such networks.

Investigating the possible ways for fully automating the three-phase technique.

APPENDIX A

Proofs of Theorems

This appendix contains the proofs of the theorems 2-1, 2-2, and 2-3.

A.1 Theorem 2-1 : The Zero/One Principle

If a given N-key sorting network fails to sort some arbitrary sequence of N keys, then it will also fail to sort at least one zero/one case of length N. Thus, let K = {k[0], k[1],…, k[N - 1]} be a sequence of N keys that a Sorting network S fails to sort. This implies that S rearranges K as it tries to sort it producing a non-sorted sequence K’ =

{k’[0], k’[1],…, k’[N - 1]}. Since K’ is unsorted, then there exists at least one index j, such that k’[j] > k’[j+1]. To generate a corresponding zero/one case that S fails to sort, let

F be a function that maps the input set of N keys to the set {0,1}.

F(x) = 0 x < k’[j]

1 x ≥ k’[j]

Notice that F is an isotonic function, i.e. if x ≤ y then F(x) ≤ F(y). Let A be a zero/one case generated by replacing each k[i] in K with its corresponding F(k[i]) value, where i = 0, 1,…, N - 1. Hence, A = {F(k[0]), F(k[1]),…, F(k[N - 1])}. Let S receive A as input, since F is isotonic, then the output of S when it receives A as input will be A’=

{F(k’[0]), F(k’[1]),…, F(k’[N - 1])}. But, F(k’[j]) = 1 > F(k’[j+1]) = 0. Thus, A is a zero/one case of length N that S fails to sort.

69 70

A.2 Theorem 2-2

The 9 rows in Table A.1 represent the keys with indices x and y in cases A and B before and after applying x:y, assuming that A B holds before applying x:y . By inspecting each of the possible outcomes, it can be verified that A B still holds after applying x:y. Table A.1 the Proof of Theorem 2-2

Before applying x:y After applying x:y

a[x] a[y] b[x] b[y] a[x] a[y] b[x] b[y]

0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 0 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0 1 1 1 0 1 1 1 1 0 1 0 0 1 0 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1

A.3 Theorem 2-3

It is assumed that A < B and C < D. The keys A and C are compared to find min(A, C) and max ( A, C). Also, the keys B and D are compared to find min(B, C) and max(B, D). Now, we need to prove 4 inequalities:

1. min(A, C) < min(B, D); and

2. min(A, C) < max(A, C); and

3. min(B, D) < max(B, D); and

4. max(A, C) < max(B, D).

71

Table A.2 depicts the proof of the first inequality, by considering all possible outcomes of comparing A with C and B with D.

Table A.2 the Proof of min(A, C) < min(B, D) A < C and B < D A < C and D < B min(A, C) = A min(A,C) = A min(B, D) = B min (B,D) = D min(A, C) < min(B, D)  A < B min(A, C) < min(B, D) A < D Proof: Proof: Follows from the assumptions. A < C by min(A,C) and C < D by assumption It follows that A < D C < A and B < D C < A and D < B min(A,C) = C min(A, C) = C min(B,D) = B min(B,D) = D min(A, C) < min(B, D) C < B min(A, C) < min(B, D)  C < D Proof: Proof: C < A by min(A,C) and Follows from the assumptions. A < B by assumption It follows that C < B

Table A.3 shows the proof of inequality 4, by considering all possible outcomes of comparing A with C and B with D. Table A.3 the Proof of max(A, C) < max(B, D) A < C and B < D A < C and D < B max(A, C) = C max(A,C) = C max(B, D) = D max(B,D) = B max(A, C) < max(B,D) C < D max(A, C) < max(B, D)  C < B Proof: Proof: Follows from the assumptions C < D by assumption and D < B by max(B,D) It follows that C < B C < A and B < D C < A and D < B max(A,C) = A max(A, C) = A max(B,D) = D max(B,D) = B max(A, C) < max(B, D) A < D max(A, C) < max(B, D)  A < B Proof: Proof: A < B by assumption and Follows from the assumptions B < D (max(B,D)) It follows that A< D

Inequalities 2 and 3 follow from the definition of a CE as illustrated in Section 1.1.

REFERENCES

[Adams, Agrawal, and Seigel 1987] Adams G., Agrawal D., and Seigel H., (1987), " A

Survey and Comparison of Fault-Tolerant Multistage Interconnection Networks",

Computer, vol. 20, no. 6, June, pp. 14-27.

[Ajtai, Komlos, Szemerdi 1983] Ajtai M., Komlos J., Szemerdi E., (1983), "Sorting in

nlogn Steps", Combinatorica, vol. 3, pp. 1-19.

[Akl 1997] Akl S., (1997), " Parallel Computation: Models and Methods", Prentice Hall,

Saddle River, New York, USA, pp. 35-76.

[Al-Haj Baddar, Batcher 2009 a] Al-Haj Baddar S., and Batcher K., (2009)," An 11-Step

Sorting Network for 18 Elements", Parallel Processing Letters, vol. 19, no. 1, pp.

97-104.

[Al-Haj Baddar, Batcher 2009 b] Al-Haj Baddar S., Batcher K., (2009), " On Designing

Faster Sorting Networks", submitted to the ACM Symposium on Parallelism in

Algorithms and Architectures SPAA 2009, Calgary, Canada, August 11-13.

[Arthurs, Hui 1989] Arthurs E., and Hui Y., (1989), "Batcher-Banyan Packet Switch

with Output Conflict Resolution Scheme", http://www.patentstorm.us/patents/

4817084.html, US Patent.

72 73

[Batcher 1968] Batcher K., (1968), "Sorting Networks and their Applications", Spring

Joint Computer Conference, AFIPS Proc., vol. 32, April 30th - May 2nd, pp. 307-

314.

[Batcher, Al-Haj Baddar 2008] Batcher K., and Al-Haj Baddar S., (2008)," Sortnet: A

Program for Building Sorting Networks", Department of Computer Science, Kent

State University, Kent, Ohio, USA, TR-KSU-CS-2008-01.

[Birkhoff 1967] Birkhoff G., (1967), "Lattice Theory", American Mathematical Society,

Colloquium Publications, vol. 25, New York, USA, pp. 9-21.

[Brassard, Bratley 1996] Brassard G., and Bratley P., (1996),"Fundamentals of

Algorithms", Prentice-Hall Inc., New Jersey, USA, pp. 402-408.

[Cole 1988] Cole R., (1988), "Parallel Merge Sort", SIAM Journal on Computing, vol.

17, no. 4, August, pp. 770-785.

[Floyd, Knuth 1967] Floyd R., and Knuth D., (1967)," Improved Construction for the

Bose-Nelson Sorting Problem", Notices of the American Mathematical Society,

vol. 14, p. 283.

[Floyd, Knuth 1973] Floyd R., and Knuth D., (1973),"The Bose-Nelson Sorting

Problem", A Survey of Combinatorial Theory, p. 163-172.

[Feng 1981] Feng T., (1981), " A Survey of Interconnection Networks", Computer, vol.

14, no. 12, December, pp. 12-27.

[Gibson 2002] Gibson J., (2002), "The Communication Handbook", CRC Press, 2nd

Edition, pp. 40.10-40.12.

74

[Greenlaw, Hoover, Ruzzo 1995] Greenlaw R., Hoover J., and Ruzzo W., (1995),"

Limits to Parallel Computations: P-Completeness Theory", Oxford University

Press, USA, pp.19-33.

[Jaja 1992] Jaja J., (1992),"An Introduction to Parallel Algorithms", Addison-Wesley

Publication Inc., USA, pp. 150-160.

[Juille, 1995] Juille H., (1995)," Incremental Co-Evolution of Organisms: A new

Approach for Optimization and Discovery Strategies", Lecture Notes in Computer

Science, vol. 929, pp. 246-260.

[Knuth 1998] Knuth D., (1998), "The Art of Computer Programming: Volume 3 Sorting

and Searching", Addison-Wesley Longman, USA, 2nd Edition, pp.225-228.

[Leighton 1984] Leighton T., (1984),"Tight Bounds on the Complexity of Parallel

Sorting", in the Proceedings of the 16th Annual ACM Symposium on Theory of

Computing, New York, USA, April 30th - May 2nd, pp. 71-80.

[Lyles 1994] Lyles J., (1994), "Methods for Building Multi-bit Parallel Batcher/Banyan

Networks", www.patentstorm.us/patents/5327420-claims.html, US Patent.

[Natvig 1990] Natvig L., (1990), "Logarithmic Time Cost Optimal Sorting is Not Yet

Fast in Practice!", Proceedings of SUPERCOMPUTING-90 (IEEE / ACM), New

York, November, pp. 486-494.

[Paterson 1987] Paterson M., (1987)," Improved Sorting Networks with O(log N) depth",

Dept. of Computer Science, University of Warwick, England, Res. Rep. RR.89.

[Preparata 1978] Preparata F., (1978)," New Parallel-Sorting Schemes", IEEE

transactions on Computers, vol. 27, no. 7, pp.669-673.

75

[Quinn 2003] Quinn M., (2003), " Parallel Programming in C with MPI and OpenMP",

McGraw-Hill Science/Engineering, New York, 1st Edition, pp. 27-62.

[Reif 1993] Reif H., (1993),"Synthesis of Parallel Algorithms", Morgan Kaufmann

Publishers Inc., San Francisco, CA, USA, 1st Edition, pp. 245-247.

[Rosen 2003] Rosen K., (2003), "Discrete Mathematics and its Applications", McGraw-

Hill Companies, USA, 5th Edition, pp. 520-525.

[Salloum, Perrie 1999] Salloum S., and Perrie A., (1999), "Fault Tolerance Analysis of

Odd-Even Transposition Sorting Networks", Proceedings of the IEEE Pacific Rim

Conference on Communications, Computers, and Signal Processing

(PACRIM99), Victoria, B.C. Canada, August 23rd - 25th, pp. 155-157.

[Van Voorhis 1971] Van Voorhis D., (1971), "A Generalization of the Divide-sort-

merge Strategy for Sorting Networks", Stanford University, CA, USA, CS-TR-

71-237

[Van Voorhis 1972] Van Voorhis D., (1972), "Efficient Sorting Networks", Stanford

University, CA, USA, Ph.D. Dissertation Thesis.