Design and Implementation of an FPGA-Based Scalable Pipelined Associative SIMD Array with Specialized Variations for Sequence Comparison and MSIMD Operation

A dissertation submitted to the Kent State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy

by Hong Wang November 2006

Dissertation written by Hong Wang B.A., Lanzhou University, 1990 M.A., Kent State University, 2002 Ph.., Kent State University, 2006

Approved by

___Robert A. Walker ______, Chair, Doctoral Dissertation Committee ___Johnnie W. Baker ______, Members, Doctoral Dissertation Committee ___Kenneth E Batcher ______Eugene . Gartland, Jr.______

Accepted by

___Robert A. Walker______, Chair, Department of Science ___John R. D. Stalvey______, Dean, College of Arts and Sciences

ii Table of Contents

1 INTRODUCTION...... 3

1.1 Overview of Associative Computing Research...... 5

1.2 Associative Computing ...... 7 1.2.1 Associative Search and Responder Processing...... 7 1.2.2 Associative Reduction for Maximum / Minimum Value ...... 9

1.3 Applications of a Scalable Associative Processor...... 10 1.3.1 Associative Computing for Relational Database Processing...... 11 1.3.2 Associative Computing for Image Processing...... 16 1.3.3 Associative Computing for String Matching ...... 19

1.4 Dissertation Organization ...... 22 1.4.1 Developing an Associative Processor...... 22 1.4.2 Addition of Pipelining to Improve Performance ...... 23 1.4.3 Addition of a Reconfigurable Interconnection Network ...... 23 1.4.4 Supporting Multiple ISs for Increased Control Parallelism...... 23

2 FIRST PROTOTYPES OF AN ASSOCIATIVE COMPUTING (ASC) PROCESSOR ...... 25

2.1 Basic ASC Processor Architecture...... 25

2.2 First (Incomplete) Prototype of a 4-PE ASC Processor ...... 27

2.3 Implementing a Scalable ASC Processor...... 29 2.3.1 Implementing Associative Search and Responder Processing ...... 32 2.3.2 Implementing Associative Reduction for Maximum / Minimum Value...... 34 2.3.3 Implementing a 1-D and 2-D PE Interconnection Network ...... 35

2.4 Performance of the First Scalable ASC Processor...... 36

2.5 Conclusion ...... 37

3 A SCALABLE PIPELINED ASC PROCESSOR WITH RECONFIGURABLE PE INTERCONNECTION NETWORK...... 38

3.1 Pipelined SIMD Array...... 39 3.1.1 Architecture...... 39

iii

3.1.2 Pipeline Performance...... 44

3.2 Reconfigurable PE Interconnection Network ...... 44 3.2.1 Need for a Reconfigurable Network in Associative Computing ...... 45 3.2.2 Reconfigurable Network Architecture...... 46 3.2.3 Reconfigurable Network Performance ...... 49

3.3 Conclusion ...... 49

4 A SPECIALIZED ASC PROCESSOR WITH RECONFIGURABLE 2D MESH FOR SOLVING THE LONGEST COMMON SUBSEQUENCE (LCS) PROBLEM... 50

4.1 Genome Sequence Comparison ...... 51 4.1.1 Genome Sequence Comparison By Finding the Longest Common Subsequence (LCS) 51 4.1.2 Solving the Longest Common Subsequence (LCS) Problem Using a Specialized ASC Processor with Reconfigurable 2D Mesh...... 52

4.2 PE Interconnection Network ...... 53 4.2.1 Coterie Network...... 55 4.2.2 Modifying the ASC Processor’s Reconfigurable Network to Support the LCS Algorithm...... 56

4.3 Solving the LCS Problem on the ASC Processor with Reconfigurable 2D Mesh57 4.3.1 ASC Processor Exact Match LCS Algorithm...... 57 4.3.2 ASC Processor Approximate Match LCS Algorithm ...... 60

4.4 Conclusions...... 63

5 AN ASC PROCESSOR TO SUPPORT MULTIPLE INSTRUCTION ASSOCIATIVE COMPUTING (MASC) ...... 64

5.1 Multiple-Instruction-Stream MASC Processor ...... 65 5.1.1 Managing Tasks in the MASC Processor ...... 67 5.1.2 Task Manager (TM)...... 68 5.1.3 Instruction Stream (IS)...... 69 5.1.4 Selecting the TM or IS to Control Each PE...... 70 5.1.5 Program Forking and Joining ...... 72

5.2 MASC Processor Performance...... 75

5.3 Conclusion ...... 75

iv

6 CONCLUSIONS AND FUTURE WORK...... 76

6.1 Conclusions...... 76

6.2 Future Work...... 78

v

Table of Figures

Fig. 1 Scalable ASC (Associative Computing) Processor...... 4 Fig. 2 Sample Student Database ...... 9 Fig. 3 Intersection and Union...... 12 Fig. 4 Cartesian Product and Join ...... 13 Fig. 5 Convolution to Detect a Vertical Edge...... 16 Fig. 6 String Matching using Associative Computing...... 20 Fig. 7 A 4-PE ASC Processor Array...... 26 Fig. 8 A Scalable ASC (Associative Computing) Processor ...... 30 Fig. 9 A Scalable ASC Processor ...... 31 Fig. 10 Network Operations...... 34 Fig. 11 1-D and 2-D PE Interconnection Network in the ASC Processor...... 35 Fig. 12 ASC Processor’s Pipelined Architecture...... 40 Fig. 13 Parallel PE’s Pipelined Architecture ...... 41 Fig. 14 Structure of Data ...... 48 Fig. 15 (a) Coterie Network [13], and (b) for PE Interconnect and Bypass...... 54 Fig. 16 Text String After Broadcast Along Column Buses ...... 58 Fig. 17 (a) PEs Set Bypass Switches Along Common Subsequences, and (b) PEs Identify the LCS ...... 59 Fig. 18 (a) Common Subsequence Separated by a Gap, and (b) PEs Set Bypass Switches to Span the Gap...... 62 Fig. 19 Dynamic Allocation of Tasks to ISs...... 66 Fig. 20 Task Manager and Instruction Stream (IS) State Machines...... 66 Fig. 21 Task Manager (TM) and Instruction Stream (IS) Pools in the MASC Processor ...... 67 Fig. 22 PE structure...... 71 Fig. 23 Program and TM / IS Allocation During Program Forking / Joining ...... 73 Fig. 24 Instruction Stream Tree Structure...... 74

vi

Abstract

Over the years a number of variations on associative computing have been explored. At Kent

State University (KSU), associative SIMD computing has its roots at Goodyear Aerospace

Corporation, but most recently has focused on exploring the power of the associative computing paradigm as compared to traditional SIMD, and even MIMD, computing. In contrast, Prof.

Robert Walker’s research group at KSU has focused on implementing those associative concepts on a single chip by developing a new associative SIMD RISC processor, called the ASC

(ASsociative Computing) Processor, using modern FPGA implementation techniques.

This dissertation describes the development of a working, scalable, ASC Processor that is

pipelined to improve performance, supports a reconfigurable network, can be specialized further

for dedicated applications (e.g., genome processing), and that can realize the multiple SIMD

(MSIMD) paradigm by supporting multiple control units.

As a first step in this processor development, a and a 4-PE Array developed

previously by Master’s students were integrated into the first working ASC Processor. This

processor was then modified to develop the first scalable PE ASC Processor, and demonstrated

on database processing and image processing.

However, the core of this dissertation research was the design and implementation of a

pipelined scalable ASC Processor, which as far as we are aware is the first working single-chip

pipelined SIMD associative processor and perhaps the first single-chip pipelined SIMD

processor in general. Compared to the first scalable ASC Processor, this pipelined processor not

only gains the advantage of pipelining, but has a faster clock frequency due to a more efficient

implementation.

1

With that pipelined scalable ASC Processor as a base, two major architectural variations were explored. To support an innovative LCS algorithm for genome sequence comparison, the reconfigurable PE interconnection was modified with some features inspired by the coterie network, plus row and column broadcast buses. To better support control parallelism and increase PE utilization, a multiple-instruction-stream MASC Processor was developed, which dynamically assigns tasks to Processing Elements as the program executes.

2

1 Introduction1

As ASICs (Application-Specific Integrated Circuits) and FPGAs (Field-Programmable Gate

Arrays) continue to grow, a wide range of architectures are being developed to make effective use of the millions (soon to be billions) of transistors available on those chips.

(SoC) architectures are increasingly common, and Multiprocessor SoC (MPSoC) architectures

[1] combine a tens, possibly hundreds, of powerful processors, a network, and on a single chip.

This dissertation explores a third alternative — a SIMD (Single Instruction Stream Multiple

Data) Processor Array on a Chip (PASoC). In contract to the small number of powerful, independent CISC (Complex Instruction Set Computer) processors supported by a MPSoC, the

PASoC supports a large number (hundreds or thousands) of simple RISC (Reduced Instruction

Set Computer) processors operating in lock-step SIMD fashion under the direction of a single

control unit. While SIMD may have been out of fashion in the 1990s, such SIMD PASoCs are

now commercially available [2] and packaged into systems for a variety of applications [3].

These SIMD PASoCs can be further augmented with support for associative computing [4].

Associative computing references memory by content rather than address. In its simplest form, each memory is associated with a flag bit, and if a search for key data is successful, this bit is flagged, leaving that memory cell to be processed further as appropriate by the algorithm.

1 Most of this chapter has been published in [6]and [7]. While I designed and implemented the scalable ASC Processor described in this dissertation (Chapter 2 in particular), my co-author Meiduo Wu on [6] proposed the database processing example described in Section 1.3.1 of this dissertation, and my co-author Lei Xie on [6] developed the 1-D and 2-D network described in Section 2.2.3 of this dissertation and implemented the image processing example described in Section 1.3.2. Ping Xu implemented the string matching example described in Section 1.3.3.

3

Control Unit PE and Memory

PE and Memory Responder Instruction

memory Resolution Unit and supporting circuitry Bus Data

From Control Unit Network

PE and Memory Common Registers

PE and Memory

PE Array

Fig. 1 Scalable ASC (Associative Computing) Processor

An associative SIMD PASoC [4, 5] adds associative computing to a SIMD PASoC, assigning a dedicated Processing Element (PE) to each set of memory cells. In associative SIMD computing, PEs search for a key in their local memories, and PEs whose search is successful are designated responders. Masked instructions can then be used to limit further SIMD processing to only those responders. This SIMD associative computing model has been explored by researchers at Kent State University (KSU), for over 30 years. Referred to as the ASC model [5,

8], it is particularly well-suited for applications such as data mining, , image processing, and air traffic control.

Professor Robert Walker’s research group at KSU has implemented several prototypes of an

FPGA-based associative SIMD PASoC, called the ASC Processor [6]. Like a traditional SIMD architecture, the ASC Processor has one Control Unit (CU) and an array of simple PEs, each containing an 8-bit processor and memory. The Control Unit decodes the instructions and

4

broadcasts control signals to the PE array. Even a small million-gate FPGA can implement one

CU and around 70 PEs, at clock speeds comparable to other processor cores on FPGAs.

All versions of the ASC Processor except for the initial non-scalable 4-PE prototype have been implemented using Altera® Quartus® II design software onto Altera APEX20K1000C

FPGAs. APEX™ FPGAs range for 0.3 to 1.5 million gates, and are available in a variety of packages and speed grades (making it important to compare our ASC Processor cycle times to comparable processor designs on that speed grade). Most of the ASC Processor variations described in this dissertation were implemented on the Altera APEX20K1000C FPGA, which contains approximately 1,000,000 gates, implemented as 38,400 Logic Elements (LEs) and 160

Embedded System Blocks (ESBs). While larger devices are available in the APEX family and in other Altera device families, this device was sufficient for my ASC Processor designs.

My ASC Processors designs were described in VHDL (Very High Speed

Hardware Description Language) at the register-transfer level and compiled onto the APEX

FPGA using Altera® Quartus® II design software. This software supports design entry, simulation, and timing analysis to aid in the design , and programs the APEX FPGA by

“compiling” its results onto that FPGA. All of my designs have been verified using the Quartus simulator.

1.1 Overview of Associative Computing Research

A number of variations on associative processing [5, 4] have been explored over the years.

One variation [4, 9, 10] used an associative memory to locate data records by content rather than address, and provides various reduction operations via a combinational network. Another variation [11, 12] used SIMD techniques to not only locate multiple data records but to also

5

process those data records in parallel. Still other variations [13] have explored the use of

MSIMD (multiple SIMD) techniques to permit processing each data record in a different manner.

Usually oriented toward database or pattern matching of some kind, most early associative were implemented using very simple single-bit processors, necessitating bit-serial processing of multi-bit data words. However, some systems explored the use of wider processors

(e.g., 4-bit or 8-bit processors) or more powerful processors.

At Kent State University (KSU), associative processing [5, 6] has its roots in associative system development at Goodyear Aerospace Corporation, in particular Goodyear’s STARAN

[14] and ASPRO [15] computers. Those early Goodyear associative systems used TTL-based

(Transistor Transistor Logic based) single-bit processors, and were supported by programming languages specially designed to permit efficient SIMD-style associative processing. As various individuals moved from Goodyear to KSU, more recent work at KSU by Professor Johnnie

Baker’s research group has continued to explore the power of the associative computing paradigm as compared to traditional SIMD, and even MIMD, computing [16],in particular for the air traffic control problem.

Complementing the work by Prof. Baker’s research group on associative models and algorithmic development, Prof. Robert Walker’s research group has been implementing those associative concepts by developing a new 8-bit associative SIMD RISC processor, called the

ASC (ASsociative Computing) Processor, using modern FPGA implementation techniques.

Early ASC Processor prototypes, described in Chapter 2 of this dissertation, were limited to only

4 Processing Elements (PEs) as a proof of concept, but the more recent versions described in

Chapters 3 through 5 of this dissertation are scalable, pipelined to improve performance, support

6

reconfigurable networks and multiple Control Units, and can be specialized further for dedicated applications (e.g., genome processing).

1.2 Associative Computing

The variation on associative computing being explored at Kent State University (KSU) over the past thirty years can most properly be described as associative SIMD computing (a multiple-

SIMD, or MSIMD, variation called MASC is described later in Chapter 5 of this dissertation).

Associative computing is particularly well suited to processing records of data in a tabular format.

As illustrated in Fig. 1, each Processing Element (PE) of the SIMD associative computing array can store a record of this tabular data in its memory. If the number of PEs is insufficient compared to the number of data records, multiple virtual PEs can be mapped onto a smaller number of physical PEs. To use a database as an example, each record of the database can be stored in a separate PE as illustrated in Fig. 2.

1.2.1 Associative Search and Responder Processing

One of the central features of associative computing is associative search, which can be implemented using an array of SIMD processing elements (PEs). In an associative search, a search key is broadcast by the Control Unit (CU) and each SIMD PE looks for that key in its local memory at the same time. If the search key is found, the PE is designated a responder and a Responder bit is set to ‘1’ in that PE. To allow for nested searches, the Responder bit is also recorded on a Mask Stack.

Once those PEs with a successful search — the responders — have been identified, they can be processed in a variety of ways. Masked instructions can be used to process all responders in

7

parallel, as Masked instructions are executed SIMD-fashion only by those PEs that have a ‘1’ on the top of their Mask Stack (in contrast, “normal” instructions are Unmasked, meaning they are executed by all PEs). Masked instructions are particularly useful in building complex associative searches, with successive searches restricted to only those processors matching earlier searches.

In other situations, it may be appropriate to process the responders sequentially, or to process only a single responder. To process all responders sequentially in For-loop fashion, a STEP instruction is used. This instruction sets the top of the Mask Stack of one of the responders to ‘1’, sets the top of the Mask Stack of the other responders to ‘0’, and thus limits further processing to only that one responder. Additionally, the STEP instruction clears the responder’s Responder bit, so that further STEP instructions will ignore the processed responder and select only one of the others.

Other Masked instructions can process the responders in While-loop fashion or process only a single responder. In the former case, a FIND instruction is used. Similar to a STEP instruction, the FIND instruction sets the top of the Mask Stack of one of the responders to ‘1’ and the others to ‘0’, limiting processing for now to that one responder. However, the FIND instruction does not clear the responder’s Responder bit, so that responder is considered later by subsequent associative processing. In contrast, the RESOLVE_FIRST instruction (occasionally called a

PICK_ONE instruction) also sets the top of the Mask Stack of one of the responders to ‘1’ and the others to ‘0’, but it also clears all responders so that no responders are considered by further processing.

As a simple example of associative searching and responder processing, consider the sample student database shown in Fig. 2, which contains 12 Records and 3 Attributes (Student Name, ID,

8

and Grade). An associative search for those students who have a Grade over 90 finds two responders — PE1 and PE4. In both of those PEs, the Responder bit is set to ‘1’ and a ‘1’ is pushed onto the top of the Mask Stack (labeled “RSPD” and “Mask” in the figure).

Now suppose that we want to process those two students one-by-one using the STEP instruction— perhaps printing the contents of their records. Since there is no responder before

PE1, PE1 is selected to be processed first. In STEP1, its top of Mask Stack is set to ‘1’ and its

Responder bit is cleared. At the same time, all the other PE’s top of Mask Stacks are cleared, but their Responder bits are left unchanged for further processing. After processing the record in

PE1, the program loops back to start STEP2. Since there is no responder before PE4, PE4 is selected, and similar to STEP1, the top of the Mask Stack and Responder bit are updated. After the second STEP, there are no more responders and the program continues executing the next instructions.

Search STEP1 STEP2 Student Name ID Grade Mask RSPD Mask RSPD Mask RSPD PE0 John Smith 07 66 00 0 0 00 PE1 Gary Heath 05 95 11 1 0 00 PE2 Peter Smith 11 87 00 0 0 00 PE3 John Smith 04 78 00 0 0 00 PE4 Tarry Stanley 02 100 11 0 1 10 PE5 Will Hanson 01 84 00 0 0 00 PE6 Jane Antony 06 64 00 0 0 00 PE7 Mark Bloggs 13 88 00 0 0 00 PE8 Gill Pister 09 75 00 0 0 00 PE9 Min Lee 10 83 00 0 0 00 PE10 Goby Carmen 03 83 00 0 0 00 PE11 Gillian Roger 08 26 00 0 0 00 Fig. 2 Sample Student Database

9

1.2.2 Associative Reduction for Maximum / Minimum Value

Another central feature of associative computing is its ability to perform certain reduction operations, such as searching for a maximum or minimum value. Falkoff’s algorithm [17] is used to identify those PEs that contain the maximum value in a certain field (the minimum value can be found by complementing the data before processing). Returning to the student database example in Fig. 2, Falkoff’s algorithm could be used to find the student with the highest grade.

Falkoff’s algorithm processes the data field from most significant bit to least significant bit, using a Mask bit to identify PEs that are candidates for holding the maximum value. This bit, initially ‘1’, is ANDed with the most significant bit of the field simultaneously in all PEs to produce a new value for the Mask. If the Mask result is ‘1’, the PE remains a candidate for holding the maximum value and has its Responder bit and top of Mask Stack bit set; otherwise those bits are cleared. If it happens that the result of the AND produces ‘0’ for all candidates, the

Mask is left unchanged (i.e., bits of lesser significance may still be used to refine the set of candidates). This processing proceeds from most significant bit to least significant bits, refining the set of candidates for maximum value.

1.3 Applications of a Scalable Associative Processor

Associative computing has been demonstrated as effective on a wide range of applications, ranging from database processing [8, 10], image processing [11, 18, 19], string processing [20], computational geometry [21], and even air traffic control [22], and holds great potential in the future for bioinformatics, computational chemistry, etc.

10

In this section, we will demonstrate the use of our ASC Processor on simple examples of database processing, image processing, and string matching. All three examples [6] are based on the scalable ASC Processor described in Chapter 2 of this dissertation and illustrated in Fig. 1.

While I designed and implemented the scalable ASC Processor described in this dissertation

(Chapter 2 in particular), my co-author Meiduo Wu on [6] proposed the database processing example described next in Section 1.3.1, my co-author Lei Xie on [6]implemented the image processing example described in Section 1.3.2, and Ping Xu implemented the string matching example described in Section 1.3.3.

1.3.1 Associative Computing for Relational Database Processing

Although Section 1.2 of this dissertation motivated the use of associative computing for searching a one-table database, associative computing is also effective in processing more complex relational databases. A relational database is a set of relations (or tables) where each of the tables is a set of tuples (or records). Relational algebra is a set of operations that has been defined to operate on one or more of these tables. One of the primary advantages of associative computing for implementing relational algebra is that it does not require tabular data to be sorted for efficiency. Using associative computing, single table operations such as Insert, Delete, and

Select (Search) can be performed on an unsorted table. Moreover, more complex database management operations such as Cartesian Product, Union, Intersection, Difference, and Join can also be performed much more efficiently using associative computing compared with execution on a von Neumann machine, as can aggregate functions such as Maximum, Minimum, Sum, and

Count. This section will examine how these relational operations are implemented in the ASC

Processor.

11

In ASC, each PE stores one tuple. A specific register in each PE contains a unique relation

ID so that in a multiple-relation database, the group of PEs that contain the same relation can be identified by the ID in the register. Using this storage scheme, a select (search) in one relation can be done as described earlier, and an update operation can follow a search whenever necessary. Insert and Delete on the unsorted database is implemented in the ASC Processor as follows. To Insert data into a specific relation, we choose any idle PE in the PE array and write the data into this PE’s memory, and we also write this table’s ID number into the register. To

Delete, we only need to store a ‘0’ as the ID to indicate that this PE does not belong to any relation..

Most database operations take much more than a single selection. More complicated database algebra operations such as Intersection, Union, and Join require many more steps for their implementation. However, ASC is also effective at speeding up these relational algebra operations, as shown below.

Intersection Union

Relation A Relation A Student ID Class Student ID Class PE7 04 239 04 239 PE8 11 111 11 111 PE9 07 239 CR 07 239 CR PE10 07 124 07 124 PE11 05 124 05 124 PE12 04 111 04 111 Relation B Relation B Step 1 Step 1 PE13 05 111 05 111 2 2 PE14 04 111 04 111 3 3 PE15 07 124 07 124 4 4 PE16 11 124 11 124 Fig. 3 Intersection and Union

12

The Intersection and Union of two relations can be realized similarly to the Select operation

(see Fig. 3, which shows operation results shaded). To compute the Intersection of Relation A and B, first sequentially STEP through Relation B and broadcast the attributes in each tuple in

Relation B to all the data in Relation A for comparison in parallel. Multiple field comparisons can be ANDed together to produce a match. These matching PEs contain the result of the intersection. Depending on what we want to do next, we could either push the result onto the

Mask Stack for further processing or set a flag in a register to generate a View.

For Union, we need to do the reverse. First we set all Mask bits in Table A and B. During broadcast, once a match is found, we set this tuple’s Mask bit (or register) back to ‘0’ in Table B.

To facilitate explanation, we are changing the original tables in Fig. 3. If Table A and B have NA

and NB tuples respectively, a new table could be generated for the Union operation with a worst

case complexity of O(NA + NB) instead of O(NA * NB) time on a sequential processor... Further, we could always choose the smaller table to step through.

The Cartesian Product of Relation A and Relation B returns a relation containing all combinations of tuples from two relations. This operation can be performed in the ASC in time complexity O(NA+ NB) using the following hardware operations. To perform the Cartesian

Relation A Result Class ID Credit Class ID Credit StudentID Class Register 111 3 111 3 01 239 0 124 2 111 3 02 239 1 239 3 111 3 02 111 2 124 2 01 239 Relation B 0 124 2 02 239 1 StudentID Class 124 2 02 111 2 01 239 239 3 01 239 0 02 239 239 3 02 239 1 02 111 239 3 02 111 2 Fig. 4 Cartesian Product and Join

13

Product for the two relations in Fig. 4, we first STEP through Relation A to produce NB copies

for each tuple in relation A in a group of new PEs. To distinguish the NB copies of a tuple from

relation A, we need to set a different ID number (from 0 to NB -1) in a register for each copy.

The ID number could be generated through subtracting the ID number of each PE from the ID number of the first PE in the NB copies. For example, in Fig. 3, after broadcasting the first line

in Relation A (Class ID 111), we broadcast the PE ID of the first line in the resulting Relation to

the three PEs on line 1, 2, 3. Each of these three PEs subtracts this number from its own PE ID

number to produce the ID number to be stored in the register. Then we broadcast each tuple in

Relation B to corresponding positions in the result table where each tuple in Relation B aligns with each tuple from Relation A as shown in the figure. For example, we want to broadcast the first line in Relation B to lines 1, 4, and 7 in the result table in Fig. 4. We first mask PEs who have the same number in the register and then broadcast each tuple in Relation B. Broadcasting relation A takes O(NA) time, and broadcasting relation B takes O(NB) time, so the whole

operation takes O(NA+ NB) time. The result of this Cartesian Product is shown in Fig. 4.

The Join operation is used to combine related tuples from two relations into a single tuple. If

a pair of tuples (one from each relation) satisfies a given condition, a new tuple is created

containing the combined attributes from both original tuples. All possible pairs of tuples are

considered. While the condition can be any Boolean operation among attributes of the tuples, we

will only illustrate EquiJoin here. For EquiJoin, the condition is that one or more attributes of

the tuples are identical. This operation then concatenates tuples from the two tables whenever

they meet the condition.

14

For example, suppose we perform EquiJoin for the two relations in Fig. 4. Class ID and

Class have some identical ID numbers for classes. To do EquiJoin, first Cartesian Product is performed on the two relations to generate a new set of combined tuples as shown in Fig. 4.

Once Cartesian Product is done, the values in Attributes Class ID and Class can be compared to find the Join result (results shown in grey). Because the Cartesian Product takes O(NA + NB) time, the EquiJoin operation also is in time complexity O(NA + NB).

Many aggregation functions can be performed efficiently in ASC. ASC provides built-in

hardware to support maximum and minimum searching through the data in a relation as

described in earlier. To Sum data from tuples throughout multiple PEs, we could issue a Search

to find all the tuples that are to be summed, and then STEP through all the PEs adding to the sum

in each step. Though the summation is sequential, it is not necessary to process all PEs in a

relation. The Count function is implemented by Summing all the responder bits across all PEs in a relation. We do not yet use the existing network in these operations, but expect to do so in the future to improve the efficiency of these operations.

15

Output 000000 011110 010010 010010 011110 Input 000000 000000 011110 011110 011110 011110 000000

-1 0 1 111 Weights -1 0 1 000 -1 0 1 -1 -1 -1 Fig. 5 Convolution to Detect a Vertical Edge

1.3.2 Associative Computing for Image Processing

The ASC Processor can also be used for image processing (edge detection using convolution).

Often organized as a large number of simple processors arranged in a 2-D mesh, SIMD processors have long been applied to image processing. Since our ASC Processor supports both

SIMD operations and 1-D and 2-D interconnection networks, augmented with associative processing, it is even more suited to image processing. This section demonstrates the use of our

ASC Processor for edge detection — an image pre-processing operation for identifying edges in an image, where edges are defined as areas where the brightness changes abruptly. Edge detection is typically followed by other image processing operations such as segmentation and object recognition.

We have implemented edge detection on our ASC Processor using convolution on an array of

36 PEs. More specifically, we detect vertical or horizontal edges in the image by convolving each interior pixel in the image with a 3x3 weight matrix called the Prewitt operator.

16

Convolution is defined as the sum of the products of each cell in the weight matrix with its corresponding cell in the input image, so a single convolution operation reads 9 input image cells, performs 9 multiplications, adds those 9 products, and produces a result. The absolute value of that result is then thresholded to produce an image cell in the output image.

Fig. 5. shows a cell in a 1-bit (i.e., black and white) input image convolved with the Prewitt vertical edge mask to produce a cell in the output image. The Prewitt vertical mask, shown at the lower left of the figure, finds the vertical edges in the input image, while the horizontal mask, shown at the lower right of the figure, finds the horizontal edges. To find both edges as shown in the figure, the absolute values of the two results are added together and thresholded (values above ‘1’ replaced with ‘1’), producing the square boundary of the solid input square. Compared to the O(N2) algorithm on a sequential machine (assuming a NxN image), our ASC machine can

perform this operation in O(N) time using N PEs connected with a 1-D network or in constant time using N2 PEs connected with 2-D network.

In 1-D mode, each row of the image is stored in the memory of a different PE, with each

memory location holding one pixel. As shown at the lower left of Fig. 5, this means that each

column of the image is stored at corresponding locations in the memory of different PEs.

Pseudocode for finding the vertical edges in an NxN image using 1-D mode is as follows

(finding both edges is a simple extension):

1. Loop: For N pixels in a row 2. For all PEs in parallel 3. Loop: For each row (three cells) in the weight matrix 4. Multiply the three weight cells with corresponding three cells in input image 5. Add the three products and move the sum to the output cell 6. End Loop

17

7. Add the three sums (one per row in the weight matrix) to produce the output image cell 8. End Loop

The innermost loop iterates 3 times and the outermost loop iterates N times, so overall the algorithm runs in O(N) time using a 1-D network.

The detailed execution of this code on our ASC Processor is under the control of the compiler, but in our tests worked as follows. Basically, one row of the image (three cells) is processed at a time using the STEP operation, reading weight matrix cells and broadcasting them to all PEs. PEs read input image cells three at a time, and multiply them with the corresponding weight matrix cells. The three products in each PE are then added, and a MOVE instruction moves the sum down if it is the first line of the weight matrix, up if it is the third line of the weight matrix, or will not move the sum at all if it is the center row of the weight matrix. This sum is then stored temporarily in the destination PE until it can be added to the other sums.

After the final sum is computed, it is thresholded to either ‘1’ or ‘0’ (in the simple example shown, the threshold is 0, so if the final sum is other than ‘0’, it is set to ‘1’, otherwise ‘0’), producing a new pixel in the output image. Other rows in the input image cells are processed similarly.

In 2-D mode, each PE holds a pixel in the image, as shown at the lower right of Fig. 5. This mode requires NxN PEs to process the image as follows:

1. Loop: For each cell in the weight matrix 2. For all PEs in parallel 3. Multiply weight cell with all input cells and move to output cells 4. Add the nine products (one per cell in the weight matrix) to produce the output image 5. End Loop

18

The outermost loop iterates 9 times, so overall the algorithm runs in constant time (with respect to the number of PEs) using a 2-D network.

The detailed execution of this code is similar to that using the 1-D network, but requires more network movement. Weight matrix cells must be read one by one, and each product must be moved differently through the network in order to reach the center PE. For example, after broadcasting the weight matrix to all PEs, the weight matrix cell at the upper left corner will be multiplied with the corresponding image cells and the product will first move down to the PEs below those image cells and then move right to the center PEs where all nine products will be added together.

Although these algorithms could be executed in pure SIMD fashion, the use of associative processing features such as the STEP operation does provide some additional power. However, advanced image processing will take more advantage of associative processing. For example,

Pixel Planes[23] is a dedicated system for graphics and imaging. This fast processor uses one processor at one pixel, similar to our 2-D network arrangement. During scan conversion, first, a coefficient gets broadcast to all pixels in parallel, then an Enable register is used to identify whether a pixel is inside or outside a polygon by comparing its own data with this coefficient.

These methods can be performed by our ASC Processor by searching and using the top of the

Mask Stack to enable certain processors (pixels). Many of their other techniques can similarly be adapted to associative processing.

1.3.3 Associative Computing for String Matching

A good solution to string matching is useful in a variety of applications, including packet processing, genome processing, etc. The VLDC associative SIMD algorithm[20] can find all

19

instances of a pattern string within a larger text string. Though we discuss only the exact-match version of the algorithm here, extensions support single-character and variable-length “don’t cares”. Note that this algorithm requires associative search, responder processing, and a linear interconnection network. Although it does not strictly require a reconfigurable network, that functionality might be required by a MASC version that can search multiple text strings simultaneously.

Fig. 6 demonstrates the VLDC algorithm by example. In this figure, 5 PEs on the ASC

Processor are shown. “APE” represents an associative PE, and “R” is its responder bit. Fig. 6a shows the variables initialized as explained below.

Fig. 6 String Matching using Associative Computing 6a on the top left, 6b on the top right and so on

20

The cells to the left of each PE, with names ending in “$”, represent parallel variables stored in the PE’s local memory: “text$” stores the text string to be searched (“ABAA” in this example”), “$” indicates the number of previous characters matched, and “match$” flags the beginning of any pattern matches.

In addition, the Control Unit (CU) stores three scalar variables: “patt_string” is the pattern to be found (“AB” here), “patt_length” is its length, and “patt_counter” indicates the number of characters matched in the pattern.

The algorithm processes the pattern string from right to left, beginning with an associative search (see Fig. 6b) for a text$ value of “B” and a counter$ value equal to patt_counter, meaning a “B” preceding 0 previously-matched characters. The single successful PE sets its responder bit (see Fig. 6c).

Limiting further processing to only that responder, that responder adds 1 to its counter$ value and sends the result of 1 to the counter$ variable of its predecessor (see Fig. 6d), using the linear PE interconnection network. This indicates that 1 cell following the predecessor has had a successful match.

The algorithm then processes the next character in the pattern string, performing an associative search (see Fig. 6e) for a text$ value of “A” and a counter$ value equal to patt_counter , meaning an “A” preceding 1 previously-matched characters. While there are three “A”s in the text string, only one precedes a “B” — that single successful PE sets its responder bit (see Fig. 6f).

21

Limiting further processing to only that responder, that responder adds 1 to its counter$ value and sends the result of 2 to the counter$ variable of its predecessor (see Fig. 6g). This indicates that 2 cells following the predecessor have had a successful match.

Once both characters in patt_string have been searched, the algorithm performs one last associative search (see Fig. 6h), looking for a counter$ value equal to patt_length, meaning a cell for which the next 2 preceding cells matched. Although not shown in Fig. 6 due to space limitations, the resulting responder sends a 1 to the match$ variable of its successor, using the linear PE interconnection network, to flag the beginning of a successful pattern match.

This algorithm can be modified to run even faster on our pipelined ASC Processor described in Chapter 3 of this dissertation. However, it is necessary to insert bubbles before the masked operations to avoid data hazards.

1.4 Dissertation Organization

In this dissertation, I describe the development of a scalable, pipelined ASC Processor with various reconfigurable networks and support for multiple control units. I also describe how the basic ASC Processor can be further customized for one specific application, genome processing.

1.4.1 Developing an Associative Processor

In Chapter 2, I begin by describing the basic ASC Processor architecture, and then describe the implementation of the first (incomplete) 4-PE prototype developed previously by two other students in Prof. Walker’s research group. Then I describe the scalable ASC Processor implemented at the beginning of my dissertation research, and show how it implements associative search, responder processing, and reduction for maximum/minimum value.

22

1.4.2 Addition of Pipelining to Improve Performance

Though the first scalable ASC Processor described in Chapter 2 was a working processor, it was based on code developed earlier (and independently) by two other students and was not a very efficient implementation. In Chapter 3, I describe the implementation of a new version of the scalable ASC Processor which was not only more cleanly implemented but more efficient due to a new five-stage pipeline architecture.

1.4.3 Addition of a Reconfigurable Interconnection Network

The first PE interconnection networks developed for the ASC Processor implemented basic

1-D and 2-D neighbor-to-neighbor communication. However, this network was relatively expensive and its functionality is limited. I later implemented a more powerful reconfigurable

PE interconnection network that allows PEs to communicate even when they are not neighbors

— functionality important for some associative algorithms such as the string matching algorithm described earlier in Section 1.3.3. This reconfigurable interconnection network is described in

Chapter 3 of this dissertation, and specialized version for solving the longest common subsequence (LCS) string matching problem is described in Chapter 4.

1.4.4 Supporting Multiple ISs for Increased Control Parallelism

Most SIMD research considers only a single Control Unit (CU); multiple SIMD (MSIMD) architectures are much less common. Some models suggest simulating MIMD on a SIMD architecture [24] but those models tend to be either instruction level or -based. The primary idea is to treat instructions as data, meaning the program is stored in both the CU and in each PE’s memory. Therefore, each PE might execute completely different instructions.

23

In contrast, the development of the Multiple-Instruction-Stream ASC model, called MASC, is an active area of research at Kent State University, so as the final element of this dissertation I developed a first prototype of Multiple Instruction Stream ASC Processor, called the MASC

Processor. The processor supports a pool of Task Managers (TM) and a pool of Instruction

Streams (IS). The TMs dynamically allocate tasks to ISs, and then the ISs broadcast instructions to control the PEs. Like the ASC Processor, this MASC Processor is scalable, both in terms of the PEs as well as the TMs and ISs. This first MASC Processor is described in Chapter 5.

24

2 First Prototypes of an Associative Computing (ASC) Processor2

The previous chapter introduced the concept of associative computing, and described

associative search, responder processing, and reduction for maximum / minimum value, and

ended by presenting three example applications using associative computing. This chapter

describes several early prototypes of an FPGA-based associative SIMD processor, called the

ASC Processor. I begin by describing the basic architecture, and then describe the first

(incomplete) prototype ASC Processor developed by two other students, and conclude the

chapter by describing my first implementation of a working, scalable ASC Processor. This first

implementation gave me the insights necessary to develop the improved scalable pipelined ASC

Processor described in Chapter 3.

2.1 Basic ASC Processor Architecture

The initial prototypes of our ASC (Associative Computing) processor are a byte-serial

associative processor, illustrated in Fig. 7. The first prototypes have a single Instruction Stream

Control Unit, at various times called the IS Control Unit, the Control Unit (CU), or simply the

Instruction Stream (IS). This Control Unit works in conjunction with the Processing Element

(PE) Array; while the first prototype [25] was limited to only 4 PEs as a proof of concept the

current version [26] can be scaled to hundreds of PEs on one FPGA. Additional circuitry [27] in

the PE array supports associative search, responder resolution, and maximum / minimum

associative reduction.

2 Most of this chapter has been published in [26] and [6]. Section 2.1 is my summary an earlier prototype developed by two other students. While I designed and implemented the scalable ASC Processor described in Section 2.2, my co-author Lei Xie on [6] developed the 1-D and 2-D network described in Section 2.2.3.

25

The Control Unit fetches and decodes instructions, executes scalar instructions, and sends control signals to the PE array to perform associative operations. The Control Unit contains an

Instruction Memory and a Data Memory, an 8-bit ALU for scalar arithmetic, and 16 8-bit

General Purpose Registers in which the 16th Register is dedicated to holding PE_ID number (0 to N-1). It communicates data (for example, a broadcast associative search key) to and from the

PE array through 16 8-bit Common Registers, which are readable by both the Control Unit and the PEs in the PE Array.

Each Processing Element (PE) cell in the PE array consists of a custom-designed 8-bit RISC based PE and a local memory which typically stores one or more data records. Each PE contains an 8-bit ALU and 16 8-bit General-Purpose Registers for data processing, and a 1-bit ALU, 16 1-

Max/Min Unit Responder Resolution Unit At_least_one_responder

Responder Ma Register sk Sta ck PE1 PE2 PE3 PE0

1-bit Logical ALU register

Step/Find/ ResolveFirst

8-bit ALU, sixteen 8-bit General Purpose Registers

PE PE PE PE Memory Memory Memory Memory

Fig. 7 A 4-PE ASC Processor Array

26

bit Logical Registers, a Responder bit, and a Mask Stack holding at most 16 1-bit values for associative processing. The PE Array is also supported by Responder Resolution and

Maximum/Minimum circuitry. Since the PEs are only 8 bits wide, larger data fields must be processed in byte-serial fashion under the direction of the Control Unit. Finally, the PEs are connected by both a 1-D and 2-D interconnection network.

2.2 First (Incomplete) Prototype of a 4-PE ASC Processor

The initial prototype of our ASC Processor [27, 28] is a byte serial associative processor with a single Instruction Stream Control Unit and a 4-PE Processing Array, along with responder resolution circuitry and MAX/MIN circuitry necessary to support associative computing. The IS

Control Unit directs the processing array to perform associative searches, search for a maximum or minimum value in a particular field across all processors, as well as other scalar and parallel arithmetic and logic operations. The IS Control Unit was designed by Yanping Wang and the

PE Array was designed by Meiduo Wu. However, though these two components were implemented and tested separately, they were never connected into a complete working ASC

Processor.

In this first prototype ASC Processor, the IS Control Unit has its own data memory and 32 bit instruction memory as well as scalar arithmetic and logic operation circuitry. The Control Unit fetches and decodes instructions stored in the instruction memory. Scalar operations are executed by the Control Unit directly; for parallel and associative operations, the Control Unit sends control signals to the four PEs directing them to perform the appropriate operation. To perform an associative search, the search key is stored into the Common Register, which is readable by both the Control Unit and the PEs, and instructions are sent to the PEs telling them to compare

27

the data in the Common Register to data in the PE’s General Purpose Registers. Since the ASC

Processor is byte serial, the Control Unit loops as necessary to process multiple-byte data

The Associative Processing Array, shown in Fig. 7, is composed of 4 Processing Element

(PE) cells, MAX/MIN circuitry and responder resolution circuitry. Each PE cell is composed of a

PE and a local data memory. The data memory usually contains data of several variable-size data fields which comes from a record of a tabular format data structure.

Each PE has an 8-bit ALU, a 1-bit ALU, sixteen 8-bit General Purpose Registers, sixteen 1- bit Logical Register, a Responder Register, a Mask Stack and Step/Find/ResolveFirst circuitry.

The 8-bit ALU and sixteen 8-bit General Purpose Registers are used for normal arithmetic and logic operation.

The remaining circuitry inside each PE is used to implement associative computing. When an associative search is performed, every PE simultaneously searches a specified General Purpose

Register for a particular search key; those PEs that find this key are called responders and are flagged by a ‘1’ in their Responder Register. If this value is then stored in the top of the mask stack, that PE will be selected for further processing by masked instructions, while PEs with a ‘0’ in the top of the mask stack will not be selected. Additional circuitry handles more complex search criteria.

The MIN/MAX Unit is based on Falkoff’s algorithm [17], and allows the Processing Array to search for a maximum or minimum value across all PEs. For our initial prototype 4-PE ASC

Processor, four 8-bit Shift Registers are used to process the 8-bit data serially, and four 1-bit

Mask Registers are used to indicate the maxium/minimum value. The data in the Shift Registers is processed from most significant bit to least significant bit, ANDing each bit with the Mask

28

Register bit (which is initially 1 for all PEs). The results of this AND are ORed together to indicate whether or not there is at least 1 responder. If the OR result is ‘1’, the Mask Register is set to the corresponding AND result. Otherwise, there is no responder, and all PEs are tied, so the Mask Register is not updated. The remaining bits are then processed, and when the algorithm terminates all PE(s) that have a ‘1’ in the Mask Register contain the maximum/minimum value.

The Responder Resolution Unit collects the responder bit from each of the 4 PEs. After processing, it sends an At_least_one_responder signal to the Control Unit telling it that PE(s) responded. The Responder Resolution Unit also sends a Responders_before_me signal to each

PE so that PEs can be processed sequentially if necessary in the Step/Find/ResolveFirst circuitry.

2.3 Implementing a Scalable ASC Processor

As mentioned earlier, the IS Control Unit and PE Array described in Section 2.2 were implemented separately and never connected into a working ASC Processor. As the first step in my dissertation research, I connected the components they developed, made the changes necessary to integrate the components, and implemented the first working ASC Processor on an

Altera FLEX10K70 FGPA. I then extended that initial prototype ASC Processor into a scalable

ASC Processor with one Instruction Stream Control Unit and 50 PEs, implemented on an Altera

APEX 20K1000E FGPA. That first scalable ASC Processor is described in this section.

The first scalable version of our ASC Processor is illustrated in Fig. 8. In this design, all PEs listen to a single Instruction Stream bus. Each PE also talks to a central Responder Resolution

Unit — PEs send responder signals to this unit, and they receive Responders_before_me signals from it.

29

Unlike the initial 4-PE prototype of the ASC Processor, where the responder circuitry was distributed partially into each PE and partially into the Responder Resolution Unit, that functionality is centralized in the Responder Resolution Unit in this scalable ASC Processor. The

Responder Resolution Unit is responsible for processing the responder signal from the

Responder Register in each PE, as well as responder signals from the Maximum/Minimum Unit

(described next).

Before going to Responder Resolution Unit from PE, these two sets of signals go through a

Selector controlled by the Instruction Stream Control Unit. For example, if the current command is to find a responder among PEs, the Instruction Stream Control Unit will select the responder

Control Unit PE and Memory

PE and Memory Bus Responder Instruction

memory Resolution Unit and supportingcircuitry Bus Data

From Control Unit Network

PE and Memory Common Registers

PE and Memory

PE Array

Fig. 8 A Scalable ASC (Associative Computing) Processor

30

signal from the Responder Register to send to the Responder Resolution Unit. If the current command is to find a maximum value, the Instruction Stream Control Unit will select the

Maximum/Minimum Unit’s AND output to send to the Responder Resolution Unit.

The At_least_one_responder signal is sent to both the Instruction Stream Control Unit and to the PE’s Maximum/Minimum Unit to find a maximum or minimum value. It is also sent to

Step/Find/ResolveFirst circuitry as in the prototype ASC Processor.

In contrast, the Maximum/Minimum Unit is distributed across PEs in this new scalable ASC

Processor design, as illustrated in Fig. 9. Each PE now has its own dedicated

Control unit

Responder Resolution Unit PE1 PE1 PE(n-2) PE (

PE0 n-1

Selector )

……………………… Max/Min Register Responder PE Memory PE Memory PE Memory PE Memory PE

Fig. 9 A Scalable ASC Processor

31

Maximum/Minimum Unit. In the Maximum/Minimum Unit initially designed by Meido Wu [27], there is a shift register, a 1-bit mask register and 1 AND gate. The shift register holds the data from the field to be searched for a maximum or minimum value. A 1 bit mask register indicates whether or not it is extreme value. The AND gate will AND the current bit slice of the shift register with the current mask register bit and output a responder signal, which is sent to a central

Responder Resolution Unit. The Responder Resolution Unit then ORs all the responder signals from all the PEs together to find whether or not there is At_leaset_one_responder, and broadcasts this OR output to all the PEs. A PE updates its mask register with the AND output if it receive a

‘1’, or leaves it the same if it receive a ‘0’. Compared to the non-scalable ASC Processor, this new design scales easier and reduces the circuitry outside the PE array.

2.3.1 Implementing Associative Search and Responder Processing

As described in Chapter 1, a central concept in associative computing is the associative search. To perform associative search in our scalable ASC Processor, the Control Unit broadcasts the search key to all PEs through a Common Register and then directs all PEs (SIMD- fashion) to look for that search key in the specified field of their local memory. Those PEs for which the search is successful are designated responders, and they set their Responder bit and the top of their Mask Stack to ‘1’.

Most parallel instructions in the scalable ASC Processor’s instruction set come in both

Masked and Unmasked versions. Unmasked instructions are executed by all PEs, while Masked instructions are executed only by those PEs with a ‘1’ on the top of the Mask Stack (permitting further processing by the responders from a particular associative search). For complex associative searches, intermediate results can be stored in the PE’s Logical Registers.

32

To implement the STEP, FIND, and RESOLVE_FIRST instructions, a dedicated Responder

Resolution Unit in the PE Array works in conjunction with a STEP/FIND/RESOLVE_FIRST

(SFR) Unit in each PE. The Responder Resolution Unit generates a signal for each PE to tell it whether or not there is a PE responder with a lower ID number (lower ID numbers have higher priority); it also tells the Control Unit and PE Array whether or not any responders exist. The

SFR Unit in each PE receives a signal from Responder Resolution Unit and manipulates the

Responder bit as well as top of Mask Stack in that PE as appropriate

With this architectural support, the STEP instruction is implemented as follows (the FIND and RESOLVE_FIRST instructions are implemented similarly). First, a STEP instruction is sent to all PEs following an associative search. Since the STEP instruction follows an associative search, it will be ignored by all PEs except the successful responders. In that group of responders, each PE will examine the signal sent to it by the Responder Resolution Unit. If a PE sees that there are no responders with a lower ID number than it, it will replace the top of its

Mask Stack with a ‘1’ and clear its Responder bit (recall the responder processing in Fig. 2). The other responders will see that a responder with a lower ID number exists, will replace the top of their Mask Stacks with ‘0’s, but will leave their Responder bits untouched. Subsequent Masked instructions will now be executed only by the one PE with the lowest ID number, though later

STEP instructions can process the other responders sequentially.

33

NWIN Control NWOUT Register Signal Register

PE0 PE0

PE1 PE1

PE2 PE2

PE(n-3) PE(n-3)

PE(n-2) PE(n-2)

PE(n-1) PE(n-1)

Fig. 10 Network Operations

2.3.2 Implementing Associative Reduction for Maximum / Minimum Value

Another key concept in associative computing is the ability to perform reduction operations such as a maximum value search across all PEs. Our scalable ASC Processor implements

Falkoff’s algorithm using a dedicated shift register in each PE in conjunction with the PE’s Mask

Stack and the PE Array’s Responder Resolution Unit. Before the search is performed, a ‘1’ is pushed onto the top of every PE’s Mask Stack; then a MAX instruction performs the actual search. When the MAX instruction finishes, the PE (or PEs) that have the maximum value in the specified field will have their Responder bit and the top of their Mask Stack set to ‘1’. If the data is more than one byte wide, the Control Unit must loop to process all bytes in turn.

The MAX instruction searches as follows. First, it copies the specified field to a dedicated shift register in each PE. Then each PE processes the data in the shift register from most the significant bit to the least significant bit, ANDing each bit in turn with the top of the Mask Stack.

34

The result is sent to the Responder Resolution Unit, and if responders exist a signal is sent back to the PEs telling them to update their Responder bits and the top of their Mask Stack with that result; if there were no responders (i.e., all AND results were ‘0’) the Mask Stacks are not updated.

2.3.3 Implementing a 1-D and 2-D PE Interconnection Network

Although many applications can be implemented using SIMD associative computing with no interconnection network, this first scalable ASC Processor supports both a 1-D and 2-D PE interconnection network designed by Lei Xie for those applications that do require a network, as shown in Fig. 10. Using image processing as an example (see the lower half of Fig. 11. ), either an entire row of an image can be stored in each PE’s memory and the cells can communicate using the 1-D network, or one pixel of the image can be stored per PE and the cells can

m m m Me Me Me

m PE0 Mem Me PE0 PE1 PE2

em PE1 Mem M PE3 PE4 PE5

PE2 Mem PE6 PE7 PE8

110 110

101 101

1 11 111 Fig. 11 1-D and 2-D PE Interconnection Network in the ASC Processor

35

communicate using the 2-D network.

Lei Xie’s network is implemented as a large 8xN bit wide NWIN register (where N is the number of PEs), an 8xN bit NWOUT register, and routing circuitry as shown in Fig. 11. Data enters the network through the NWIN register, which stores data for PE j in bits from 8j to 8j+7, and then that data is routed to the proper place in the NWOUT register.

Data can be moved over either network under program control. Using the 1-D network, data can be moved up or down one PE, and using the 2-D network data can be moved up, down, left, or right one PE. Wrap-around can be turned on or off as desired.

2.4 Performance of the First Scalable ASC Processor

Compared with the previous design, this new design is easier to scale up and saves on design circuitry outside the PE array. When scaled to larger numbers of PEs, the timing / performance varies. While the maximum frequency is about 33.47 MHZ when there are only 5 PEs, the maximum frequency is 33.63 MHz for 10 PEs and 26.9 MHZ for 50 PEs. For larger FPGAs, it may be necessary to optimize the current design further to avoid too much degradation.

Overall, making these changes to our initial prototype ASC Processor to produce this new scalable ASC Processor reduced the circuitry required for the Responder Resolution Unit and the

Maximum/Minimum Unit. Moreover, the new architecture seemed like it could be adaptable into a MASC Processor, where each IS Control Unit would need only one Responder Resolution

Unit instead of having both its own Responder Resolution Unit and Maximum/Minimum Unit.

36

2.5 Conclusion

This chapter has described the first (incomplete) prototype ASC Processor developed by two other students, as well as my first implementation of a working, scalable ASC Processor.

Chapter 3 describes an improved scalable pipelined ASC Processor with a reconfigurable interconnection network.

37

3 A Scalable Pipelined ASC Processor With Reconfigurable PE Interconnection Network3

The previous chapter has described several prototypes of an FPGA-based associative SIMD processor, called the ASC Processor [6]. Like a traditional SIMD architecture, the ASC

Processor has one Control Unit (CU) and an array of simple PEs, each containing an 8-bit processor and memory. The Control Unit decodes the instructions and broadcasts control signals

to the PE array. Even a small million-gate FPGA can implement one CU and around 70 PEs, at

clock speeds comparable to other processor cores on FPGAs.

The ASC Processor implements associative operations using a dedicated comparison unit in

each PE, a mask stack to indicate when the PE is a responder, and special masked instructions to

limit further sequential or parallel processing to only the responders. Additional hardware

supports max/min search, giving the PE array the ability to flag as responders those PEs that

have a maximum (or minimum) value in a particular memory location.

In any SIMD array, the network design is crucial for PE communication. This network is

typically a linear array or mesh, possibly augmented by complicated switches or routers.

Herrmann’s Asynchronous Reconfigurable Mesh [18] was the first to use a reconfigurable

network to connect associative processors, mainly used for image processing. Their system

allows processing cells in the PE to reconfigure the mesh network depending on the content of

that cell, thus allowing cells to disconnect one or more of its four connections to its east, west,

south and north neighbors. This support for reconfiguration allows a group of cells to form a

local region of connection. In Section 3.2, we describe another type of reconfigurable network

that takes advantage of the associative nature of our system.

3 Most of this chapter has been published in [7].

38

As described in the previous chapter, our first ASC Processor prototypes [6] had a shortcoming typical of virtually all SIMD architectures — the system comes entirely from increasing the number of PEs, since the PEs lack the pipeline architecture used by most modern processors. While a pipelined architecture has been proposed in SIMD systems [29], there have been no pipelined SIMD processors implemented to date that we are aware of, including the commercial SIMD PASoCs cited above. In this chapter, we describe a pipelined associative SIMD processor array.

The remainder of this chapter will be organized as follows. Section 3.1 will describe the architecture and functionality of our ASC Processor’s pipelined SIMD array. Section 3.2 will then describe the implementation of a reconfigurable network and its functionality. Each section will also comment briefly on the performance of our new ASC Processor with those additions.

3.1 Pipelined SIMD Array

3.1.1 Pipeline Architecture

There are two main types of pipelining that can be supported by a SIMD system: pipelined broadcast and pipelined PEs [29]. Although broadcast to PEs will be a concern as the number of

PEs increases, this dissertation will assume a relatively small number of PEs (on the order of hundreds of PEs) and concentrate on the pipelining of the PEs, leaving pipelined broadcast for future work. The SIMD array in our ASC Processor, described in this section, has five stages of pipelining: Instruction Fetch (IF), Instruction Decode (ID), Execute (EX), Memory Access

(MEM), and Data Write Back (WB). Each stage is one clock cycle in length.

39

The overall structure of the ASC Processor’s architecture and pipeline is shown in Fig. 12.

On the top left side of the figure is the Control Unit (CU), which fetches and decodes instructions. Below the Control Unit is a Sequential PE (SPE), which executes any scalar instructions, and handles control flow such as branches. On the right side of the figure is the

SIMD processor array, consisting of a set of Parallel PEs (PPEs), which executes any parallel instructions. Note that this figure emphasizes the pipelined implementation; other architectural details such as the components needed to support associative computing are not shown and have been described in previous papers [6].

The IF stage and part of the ID stage are implemented in the Control Unit (CU), shown at the top left of Fig. 13. In the IF stage, the Control Unit fetches one instruction from its Instruction

Memory, and stores it in the inter-stage IF/ID latch. In the next clock cycle, the ID stage decodes the instruction and sends any immediate data via a data bus and any appropriate control signals

Control Unit (CU) Parallel PE (PPE) Array

Instruction Memory

IF/ID Latch

Decoder Immediate Data

Register File Broadcast Register ID/EX Latch Data

EX/MEM Latch

Data Memory

MEM/WB Latch

Sequential PE (SPE) Fig. 12 ASC Processor’s Pipelined Architecture

40

(not shown) via an instruction bus to either the Sequential PE or the Parallel PEs in the PE array.

(Recall that in a SIMD system the Control Unit broadcasts hardwired control signals to the PEs, each of which contains only a data path.) The remainder of the ID stage, along with the last three stages, is implemented in the PEs.

The Sequential PE (SPE), shown at the lower left of Fig. 12, performs scalar computation and handles control flow (e.g., branches). It can also broadcast register data via a data bus to the parallel PE array, for example broadcasting a search key to the PEs for associative search. The

SPE has four stages of pipelining: the remainder of the ID stage, the EX stage, the MEM stage, and the WB stage. The last three stages are similar to those in most pipelined sequential processors. The EX stage performs arithmetic and logic operations. The MEM stage contains a

256 x 8 bit data memory, and the WB stage writes back either memory data or ALU data to the .

The SPE’s portion of the ID stage is more specific to its role in a parallel system. At the front end of the stage is a register file, which stores decoded instructions and data. Most scalar instructions will read data from that register file into the ID/EX latch. However, if the SPE must

Mask Comparator MUX Data Memory Data Switch ID/EX Latch EX/MEM Latch Register File Register MEM/WB Latch

Fig. 13 Parallel PE’s Pipelined Architecture

41

execute a broadcast instruction, the data is instead placed onto a data bus and broadcast to the parallel PE array. This portion of the ID stage also includes a comparator (not shown) for testing branch addresses.

The parallel PE array is the heart of SIMD . In the ASC associative computing model, data are stored in a tabular format across the entire PE array. For example, in a relational database the data are stored in a table, perhaps with each PE storing one row of that table. If there are more rows than PEs, each PE can act as multiple “virtual PEs”. For simplicity, we will assume here that we always have enough PE for specific applications, with the understanding that PEs can act as virtual PEs when necessary.

The architecture of the Parallel PE (PPE) array is shown on the right side of Fig. 12. The relationship of the PE array to the Control Unit and SPE has already been discussed. An instruction bus (not shown) allows control signals to be sent from the Control Unit to the PE array, and a data bus allows immediate data and broadcast data to be sent from the CU and SPE, respectively, to the PE array. Each Parallel PE (PPE) in the PE array also includes a Data Switch to facilitate communication between PEs (see Section 3 for more information on the Data Switch and reconfigurable PE network).

The pipelined architecture of a Parallel PE (PPE) is shown in detail in Fig. 13, and consists of four pipelined stages. As in the SPE’s, the PPE’s portion of the ID stage is more specific to its role in a parallel system, this time with an emphasis on associative computing. The PPE’s ID stage consists of an 16 x 8 bit register file, a Data Switch, a comparator, and an 8 x 1 bit mask stack. The mask stack has two roles: limiting associative computing to only responders, and configuring the interconnection network. In this latter role, the top of the mask stack is sent to

42

the Data Switch and used to configure the interconnection network, as explained later in Section

3.2.

The mask stack is also used to implement responder processing — a fundamental operation in SIMD associative computing. Whenever an associative search is performed, the search key is sent to the PPEs, that key is compared to local data, and the result is pushed onto the top of the mask stack. The top of the mask stack is used to control the ID/EX latch, so if the top of the mask stack is ‘0’, meaning that PPE is not a responder, the instruction will not go through the

ID/EX latch.

Using the top of the mask stack in this way, further associative processing can be limited only to responders. If a PPE is a responder, instructions and data move through the ID/EX latch

to the EX stage, where ALU arithmetic and logical operations are performed. Then memory

access operation can be performed in the MEM stage if necessary, and finally, the WB stage can

write network data, ALU results, or data read from memory back to the register file.

In these PPEs, pipeline control hazards are also regulated using the mask bit. If the PE is not

a responder, the mask bit will be used to deactivate the ID/EX latch, thus preventing any

following signals or data from going through to the EX or following stages. However, any

preceding instructions will still be allowed to finish. Sequential control hazards can be treated in

the traditional way; combined sequential-parallel hazards can be dealt with by pushing no-op

bubbles through the pipe. Other optimizations, such as register forwarding could also be added,

but for now will be left for future work.

43

3.1.2 Pipeline Performance

We have implemented the pipelined ASC Processor on an Altera APEX20K1000C FPGA.

This FPGA is not particularly large by today’s standards, comparable to only one million gates, but sufficient at present for a proof-of-concept. Altera’s web site [30] lists various 8-bit processor cores implemented on this particular FPGA and speed grade, with clock speeds ranging from 30 to 106 MHz, typically 60-68 MHz.

Described in Chapter 2 of this dissertation, our previous non-pipelined ASC Processor [6] was not tuned for performance, mainly being used to test the implementation of associative

SIMD computing, and as such had a clock speed of only 15 MHz. The new pipelined ASC

Processor described here was designed more carefully, with shorter critical paths, and as such has a clock speed of 56.4 MHz, comparable with the 8-bit processors cited by Altera. However, with the 5-stage pipeline in our ASC Processor, peak performance can approach 300 MHz, much faster than any of those processors.

As stated above, our scalable pipelined ASC Processor is implemented on a million-gate

Altera APEX20K1000C FPGA; this FPGA contains 38,4000 Logic Elements (LEs). Without a multiplier, the Control Unit and SPE take up about 200 logical elements (LEs), and each PPE requires about 500 LEs. Thus a one million-gate FPGA of this type can hold 1 Control Unit, 1

SPE, and about 70 PPEs.

3.2 Reconfigurable PE Interconnection Network

Virtually every parallel system, whether SIMD or MIMD oriented, supports some form of PE interconnection network. A linear array or 2D array (also referred to as a “mesh”) is quite

44

common, often with wrap-around at the ends to form a ring or torus. Some systems add row and column broadcast buses. Others add a hypercube for long-distance communication among PEs, or specialized hardware for arbitrary point-to-point communication. As may be expected, there is a tradeoff between added functionality and added hardware.

3.2.1 Need for a Reconfigurable Network in Associative Computing

Associative computing has many advantages, but also has the disadvantage of disrupting any fixed network pattern, since after an associative search the PEs to be processed further are in arbitrary locations in the array. One solution to this problem is to try to avoid using a PE network in associative computing, and it is indeed possible to write many useful associative algorithms [4,

5, 8] that do not use a network. Another solution is to allow arbitrary PEs to connect to their

neighbors to form so-called “coteries” [13], though that solution requires the PEs to be adjacent,

which may not be the case after a random associative search.

For our ASC Processor, we have developed a reconfigurable network to support PE

communication for associative computing. (Technically these are PPEs, using the

implementation-centric terminology of Section 3.1, but we will revert here to the more common

term “PE” for simplicity.) This reconfigurable PE network allows arbitrary PEs in the PE array

to be connected via either a linear array (currently implemented) or a 2D mesh (to be

implemented soon), without the restriction of physical adjacency. More specifically, each PE in

the PE array can choose to stay in the fixed existing network, or opt out of the network so that it

is bypassed by any communication. While this solution does not allow the responders to be

connected into an arbitrary configuration, it does support communication sufficient for many

algorithms.

45

This reconfigurable PE network is crucial for the MASC Processor described later in Chapter

5 of this dissertation, designed for the more powerful Multiple Instruction Stream ASC (MASC) model. In MASC, multiple Control Units, referred to as Instruction Streams (ISs), are supported, each of which can control arbitrary PEs / responders in the PE array. To support the MASC model, each IS will require a separate reconfigurable PE network, which can reconfigure dynamically as each IS’s responder set changes.

3.2.2 Reconfigurable Network Architecture

The reconfigurable PE network is currently implemented as a linear array, with support for a

2D mesh planned. Using this network, each PE can directly read its neighbor’s register file and store the data into its own ID/EX latch for further processing. For example, consider the following instruction:

Add $left($R1) $R2 $R3 This instruction, executed in lockstep by a set of responder PEs under the direction of the

Control Unit, will cause each PE to add the data from its left neighbor’s register R1 to the data in its own register R2, and store the result into its own register R3, all in one clock cycle. Similarly, data from the left neighbor could be added to broadcast data and stored in the PE’s register file, data from the left and right neighbors could be averaged and stored locally, etc.

To transfer data between these different sources efficiently in the network, a dedicated Data

Switch unit (see Fig. 14) is placed before the ID/EX latch in the PEs (the PPEs, using the terminology of Section 3.1). For the linear array network currently implemented in our ASC

Processor, the Data Switch acts as a network unit, connecting each PE’s Data Switch to its two neighbor’s Data Switches.

46

The Data Switch also implements responder processing — the associative processing functionality described earlier in Section 2.1. If the top of the mask stack is ‘1’, meaning the PE is a responder, the Data Switch routes data from its input ports to its output ports and processes data locally as described below. If the mask bit is ‘0’, however, the Data Switch functions acts as a bypass, transferring data directly from its left to right neighbors without any local processing.

Thus the whole network is reconfigurable, with only responders “connected” to the network and with non-responders being bypassed.

Each Data Switch has six input ports and four output ports, as shown in Fig. 14. Two input ports provide local data from the PE’s register file. One input port connects via a data bus to the

Control Unit’s ID stage to provide immediate data. Another input port connects via a data bus to provide register data broadcast by the Sequential PE. The final two input ports provide data from the PE’s left and right neighbors.

The Data Switch has four output ports. Two output ports connect to the ID/EX latch to provide data for calculation, and two output ports provide data to the left and right neighboring

PEs.

The Data Switch routes data to its output in one of four “modes”. In Computation Mode, the two input data from the register file are connected to the two output data ports for calculation. In

Broadcast Mode, immediate data from the Control Unit’s ID stage or register data broadcast by the Sequential PE is connected to one of the output data ports, and one of the register file inputs is connected to the other output data port.

In Data Movement Mode, there are three options: Left, Right, and Both. In Data Movement

Left Mode, every PE’s left register file input connects to its left neighbor, its right register file

47

input connects to one ID/EX latch output, and data from its right neighbor connects to the other

ID/EX latch output. Data Movement Right Mode is similar.

Data Movement Both Mode blends the two. Every PE’s left register file input connects to its left neighbor, and every PE’s right register file input connects to its right neighbor.

Further, every PE connects data received from both neighbors to its two ID/EX latch outputs.

Finally, the Data Switch provides one last mode, Bypass Mode, to support responder processing for associative computing as described earlier in Section 2.1. This mode is set by the top of the mask stack. If the top of the mask stack is ‘0’, meaning this PE is not a responder, then the left neighbor’s output is connected directly to the right neighbor’s input, and vice versa. In this way, the PE is bypassed from the network, leaving only responders connected by the network.

After moving through the ID/EX latch and leaving the ID stage, data goes on to EX stage, where ALU arithmetic and logical operations are performed, and then on to the MEM and WB stages as described earlier. Recall that the top of the mask stack also controls the ID/EX latch, as described earlier in Section 3.1 — if the PE is not a responder, then data does not pass through the ID/EX latch, limiting further processing only to responders.

Register Immediate Register Data Data File (from SPE) (from CU)

Left Right Neighbor Neighbor Data Switch Top of Mask Stack

Comparator & ID/EX Latch Fig. 14 Structure of Data Switch

48

3.2.3 Reconfigurable Network Performance

Unfortunately, the performance of our ASC Processor does degrade as the number of PEs is increased with Bypass Mode present in the reconfigurable network. Implemented on a million- gate Altera APEX20K1000C FPGA as described earlier, a 4-PE processor requires 2152 LEs, and runs at 56.4 MHz, comparable to other processors implemented on this FPGA. If we do not include Bypass Mode in the reconfigurable network, the processor frequency remains the same when the number of PEs is increased from 4 to 50 PEs. However, with Bypass Mode added, when the number of PEs is increased to 50, the frequency drops to 22 MHz. The delay is due to the long path from the first PE to the last PE in the array, but can perhaps be reduced in the future using a pipelined or other multi-hop architecture.

3.3 Conclusion

This chapter describes the implementation of the second generation of our scalable ASC

Processor, and the addition of pipelining and a reconfigurable network to that processor — both important steps forward for SIMD systems. With this processor as a base, the next chapter describes a modified version of this processor with a specialized network designed to support a string matching algorithm.

49

4 A Specialized ASC Processor with Reconfigurable 2D Mesh for Solving the Longest Common Subsequence (LCS) Problem4

The previous chapter has described the implementation of a scalable pipelined ASC

Processor with reconfigurable network suitable for a variety of applications. However, in this chapter we describe a specialized version of this processor designed to support an innovative

LCS algorithm that is useful in bioinformatics such as DNA sequence comparison. This

specialized ASC Processor is not only useful in its own right, but it serves as an example of how

the basic architecture can be further modified for specific applications.

As new genes are sequenced, it is common for molecular biologists to compare the new

gene’s DNA to known sequences. One simple form of DNA sequence comparison is done by

solving the Longest Common Subsequence (LCS) problem. In this paper, we propose a parallel

algorithm and specialized FPGA-based processor (the associative ASC Processor with

reconfigurable 2D mesh) to solve the exact and approximate match LCS problems. This solution

uses inexpensive hardware and can be reconfigured as new analysis techniques are developed,

making it particularly attractive for processing biosequences.

4 Most of this chapter has been published [33]. While I designed and implemented the specialized ASC Processor and assisted in the hardware-oriented details of the implementation of the LCS algorithms, the bulk of the LCS algorithm design was done by my co-author Sabegh Singh Virdi.

50

4.1 Genome Sequence Comparison

One problem of great interest to molecular biologists [31] is the problem of sequence comparison[32], in particular comparing nucleotides of nucleic acids (DNA and RNA). DNA molecules encode genetic information for most organisms, while RNA molecules serve a variety of functions (e.g., messenger RNA carries information in proteins).

Nucleic acids are composed of a sequence of nucleotides. DNA molecules are composed of four types of (deoxyribo) nucleotides —A, G, C, and T — and can be millions of units long.

RNA molecules are composed of four types of (ribo) nucleotides — A, G, C and U — and are shorter but still up to thousands of units long. Each nucleotide is characterized by the base it contains: adenine (A), guanine (G), and cytosine (C) for both DNA and RNA, and thymine (T) for DNA or uracil (U) for RNA (T and U play similar roles). An organism’s genetic information is encoded in the linear ordering of these nucleotides within its DNA molecules.

When a new gene is found, its function is usually unknown, so molecular biologists often compare the new gene’s biosequence to known sequences from a genome bank, under the assumption that the new sequence may be functionally or physically similar to a known gene. As more and more DNA fragments are sequenced, the need for comparing the new sequences to those already known increases.

4.1.1 Genome Sequence Comparison By Finding the Longest Common Subsequence (LCS)

Comparison between biosequences is usually done by aligning (possibly noncontiguous) subsections of the sequences such that corresponding subsections contain either the same symbols or substitutions. The simplest form of a sequence analysis involves solving the Longest

51

Common Subsequence (LCS) problem [34], where we eliminate the operation of substitution and allow only insertions and deletions.

Given an arbitrary string, a subsequence of the string can be obtained by deleting zero or symbols, but not necessarily consecutive ones. For example, if string X is ACCTTGCTC, then

AGCTC and ACTTTC are subsequences of X (4 and 1+2 deletions, respectively), whereas

TGTCT and CTCG are not.

Although there are typically many common subsequences between two strings, some are longer than others. Given strings of size m and n, the Longest Common Subsequence (LCS) of string X [1..m] and Y [1..n], LCS(X,Y), is a common subsequence of maximal length. For example, if string X is AGACTGAGTGA and string Y is ACTGAT, then A, ACT, and ACTGA are all common subsequences of X and Y, but ACTGA is the longest common subsequence if an exact match is required. However, if the deletion of characters is allowed (finding an approximate match), then deleting the third G in X gives ACTGAT as an even longer common subsequence.

4.1.2 Solving the Longest Common Subsequence (LCS) Problem Using a Specialized ASC Processor with Reconfigurable 2D Mesh

Fast solutions to these LCS problems are needed not only in biosequence comparison, but also in , editing error correction, and syntactic pattern recognition. Since genetic databases are quite large, efficient dynamic programming algorithms have been developed [35]. To further reduce the processing time, parallel algorithms based on linear systolic arrays [36], specialized hardware [37], and have been developed.

52

However, solutions based on specialized hardware tend to be expensive and difficult to change as new analysis techniques are developed.

The LCS problem is typically solved with dynamic programming techniques after filling strings of size m and n into an m x n table. The table elements can be regarded as vertices in a graph, and the dependencies between the table values define the edges. The LCS problem is solved by finding the longest path between the vertex in the upper left corner and the one in the lower right corner of the table.

In this chapter, we describe a new implemented on reconfigurable hardware to solve the LCS problem. Our computing model is described in Section 4.2, along with a description of our FPGA-based associative SIMD ASC Processor. Section 4.2 describes the processing element interconnection network required to solve the LCS problem efficiently.

Our LCS algorithm, targeted toward this ASC Processor with this interconnection network, is then described in Section 4.3. Finally, the chapter draws some conclusions on our implementation.

4.2 PE Interconnection Network

Most parallel systems support some form of PE interconnection network. A linear array or a

2D mesh is quite common, often with wrap-around at the ends to form a ring or torus. Support for broadcast (at least from the control unit) and reduction is also common, though some systems augment the mesh with row and column broadcast buses. Other systems add a hypercube for long-distance communication, or specialized hardware for arbitrary point-to-point communication. There is of course a tradeoff between added functionality and added hardware.

53

Fig. 15 (a) Coterie Network [13], and (b) Switches for PE Interconnect and Bypass

Some parallel systems also add reconfigurability to address particularly application domains.

Hermann’s Asynchronous Reconfigurable [18] Mesh was one of the first reconfigurable networks. Used in an associative processor for image processing, PEs in his processor could reconfigure the mesh network to form local regions of connection called coteries (described more fully in Section 4.2.1).

Our ASC Processor also implements a reconfigurable linear array and 2D mesh as described in Chapters 2 and 3 of this dissertation. Needed for associative computing, this reconfigurable

PE network allows arbitrary PEs in the PE array to be connected via either a linear array

(described in Chapter 3) or a 2D mesh (implemented to support the LCS algorithm described in this chapter), without the restriction of physical adjacency. More specifically, each PE in the PE array can choose to stay in the fixed existing network, or opt out of the network so that it is bypassed by any communication. This reconfigurable network has been augmented further with some concepts from coteries to more efficiently solve the LCS problem as described later in this chapter.

54

4.2.1 Coterie Network

The coterie network [13] was developed by Weems in the late 1980s as a reconfigurable network to support image processing. As shown in Fig. 15a, the coterie network allows a 2D array to be partitioned into a set of distinct regions called coteries. A broadcast goes all members of the coterie, and coteries can communicate to exchange information. In their image processing application, separate coteries were assigned to different features in the image, and multi- associative programming was proposed to process those regions in parallel. However, this model was never fully implemented.

The key to the reconfigurability of the coterie network is the switches that interconnect the

PEs, shown in Fig. 15. PEs in a particular region can close the appropriate switches to broadcast within that coterie, or PEs on the boundaries of a coterie can close the appropriate switches to communicate with adjacent coteries.

One set of switches, shown as bold lines in Fig. 15b, connect a given PE to its North, East,

West, and South neighbors, for either local or coterie-broadcast communication. Another set of switches, mentioned only briefly by Weems et. al. but important for our LCS algorithm, allows neighboring PEs to bypass a given PE in their communication — for example a PE’s North neighbor could bypass the PE to communicate directly with its South neighbor or its West neighbor. (Weems [13] described only the corner-turning bypasses and not the horizontal and vertical bypasses but they are an obvious extension and again necessary for our LCS algorithm.)

55

4.2.2 Modifying the ASC Processor’s Reconfigurable Network to Support the LCS Algorithm

The key to the reconfigurability of the ASC Processor [7] is the data switch that allows each

PE to communicate with its neighbors. Since the ASC Processor is an associative SIMD array, it must support associative search (all PEs search for a key in their local memory, and those that find it are designated as responders) as well as associative computing (responders are processed further, either sequentially or in parallel). To support associative computing, the data switch has a bypass mode to allow PE communication to skip non-responders.

To support the LCS algorithm described in Section 4.3, we modified the ASC Processor’s original reconfigurable network from a linear array into a 2D mesh and added some features inspired by the coterie network. More specifically, the bypass mode was augmented to allow both horizontal and vertical bypassing of the PE as well as corner-turning bypasses (e.g., directly from a PE’s South to West neighbors). Our algorithm does not require coteries of arbitrary shape, but only coteries along each row and column, so we simply augmented our ASC Processor’s new

2D mesh with row and column broadcast buses to allow the first processor in the row/column to broadcast to the rest of the row/column. We refer to the resulting network as a reconfigurable 2D mesh to highlight the bypass capability. If the biosequences being compared are too large for the available PE array (which is likely), the sequences are “folded” onto the PEs available.

56

4.3 Solving the LCS Problem on the ASC Processor with Reconfigurable 2D Mesh

Inspired by the features of the coterie network and by the parallel string matching algorithms with VLDCs developed by K. L. Chung [38] for the m x n MCCRB model, we developed an LCS algorithm that can be efficiently supported by our ASC Processor when modified to support a reconfigurable 2D mesh (with bypass mode) and row/column broadcast buses as described in

Section 4.2.2.

4.3.1 ASC Processor Exact Match LCS Algorithm

This section presents the exact match version of our LCS algorithm for the ASC Processor

with reconfigurable 2D mesh row/column broadcast buses described in Section 4.2.2. Initially,

we assume the data switches of all the PEs are open, meaning each PE is disconnected from all of its neighbors and no switches are set to bypass the PE. Each PE also has a Match Register

“M”, a Length Register “L”, and a Gap Register “G”, all initially set to 0.

Step 1 — Broadcast the Text String Broadcast each character of the n-character text string (stored in the first row) from the first PE in each column to all the other PEs in that column using the column broadcast bus.

57

A G A C T G A C T G A

A G A C T G A C T G A

A G A C T G A C T G A

A G A C T G A C T G A

A G A C T G A C T G A

A G A C T G A C T G A

Fig. 16 Text String After Broadcast Along Column Buses

Suppose the text string T= T(1) T(2)…..T(n) has been stored in row 1 of the PE array such that PE(1,j) stores T(j). To follow an example throughout this section, suppose Text string

T=AGACTGACTGA and Pattern string P=ACTGAC.

We broadcast text string T from the first PE in each column to the other PEs in that column using the column broadcast bus, as illustrated in Fig. 16.

Step 2 — Broadcast the Pattern String Broadcast each character of the m-character pattern string (stored in the first column) from the first PE in each row to all the other PEs in that row using the row broadcast bus. Similarly, the pattern string P is broadcast from the first PE in each row to the other PEs in that row using the row broadcast bus. Each PE now holds both an element of the text string and the pattern string.

Step 3 — Identify Character Matches In parallel, Push Mask (push “1” onto the top of the mask stack) if there is a match between the pattern string and text string in the PE Each PE now compares its element of the text string to its element of the pattern string, and

sets the Match Register “M” to 1 if they are equal or 0 otherwise. The resulting Match Register

“M” values are shown at the center of the PEs in Fig. 17a.

58

Fig. 17 (a) PEs Set Bypass Switches Along Common Subsequences, and (b) PEs Identify the LCS

Step 4 — Set Bypass Switches Along the Common Subsequences In parallel, move the top of the mask stack to the West neighbor’s reg1 and South neighbor’s reg2. BYPASS from West to South in all PEs whose top of the mask stack is 1. In parallel, Push Mask if reg1 = reg2 = 1. BYPASS from North to East in all PEs whose top of the mask stack is 1. This step is the key to the algorithm, where we connect the PEs that correspond to characters

common to both strings. The first line tells the West and South neighbors of a match, and the

second line sets the matched PE to be bypassed from West to South (for example, notice the

West-South bypass switch is set in the PE in row 2, column 4, of Fig. 17a due to the C-C match

in that PE). The last lines then set a North-East bypass in all PEs whose North and South

neighbors have a match. The result is that West-South bypass switches are set in the matched

cells, and North-East bypass switches are set to connect those matched cells along a common

subsequence. Fig. 17b illustrates the result.

Step 5 — Identify the LCS (Sequential Version) Each PE at the beginning (bottom) of an LCS sends a token (initially 0) to its West neighbor A PE receiving a token adds 1 to the token if its Match Register “M” contains 1, and passes the token on if its West-South bypass switch is set, or stores it in its Length Register “L” if it is the end of an LCS The PE with the largest value in its Length Register “L” is the start of the LCS

59

This is the simplest method for finding the longest of the common subsequences. After line 2 finishes, the running example (see Fig. 17b) gives three long common subsequences ending at

PE[1,3] (ACTGAC, of length 6), PE[1,7] (ACTGA, of length 5), and PE[4,2] (GAC, of length 3).

Of these, the LCS is ACTGAC, ending at PE[1,7]. Note that finding a maximum value across all

PEs is a constant-time operation for a SIMD associative processor, so this step has running time

O(nm).

Step 5 — Identify the LCS (Parallel Version) Each PE at the beginning (bottom) of an LCS sends its [row, column] ID to its West neighbor A PE receiving an ID passes it on, or if it is the end of an LCS subtracts its own ID from the received ID to calculate the length of the path traveled and stores that value in its Length Register “L” The PE with the largest value in its Length Register “L” is the start of the LCS This version of Step 5 uses the switches that connect the PEs in the common subsequences.

The IDs in line 2 propagate in one step at electrical speeds.

4.3.2 ASC Processor Approximate Match LCS Algorithm

The algorithm described above solves the LCS problem for exact match, but not for

approximate match. To illustrate the difference, consider strings ACCAGG and AGACTGAGG,

where the longest common subsequence ACAGG is an approximate match rather than an exact match, and moreover occurs twice in the second string (i.e., AGACTGAGG and AGACTGAGG).

For approximate match, we do not have a continuous bypass path from the beginning to ending

PE, but we can still find the LCS. We will illustrate this algorithm for pattern string ACCAGG and text string AGACTGAGG, finding approximate match ACAGG.

The first steps in the approximate match LCS algorithm are the same as for exact match:

60

Step 1 — Broadcast the Text String Step 2 — Broadcast the Pattern String Step 3 — Identify Character Matches Step 4 — Set Bypass Switches Along the Common Subsequences The problem after these steps is that the switches connect regions of the LCS but do not cross the “gap” between non-contiguous regions of characters. See Fig. 18a, where switches close the

AC and AGG paths, but those two paths are separated by a “gap” across the shaded PEs in the center of the figure. The LCS algorithm must find a way to span this gap.

Step 5 — Set Bypass Switches to Span Gaps in the Common Subsequences Each PE at the beginning (bottom) of an LCS sends a token (initially 0) to its West neighbor BYPASS from West to South in all PEs that receive this token whose top of the mask stack is 0 BYPASS Vertically (from North to South) in all PEs above the PE identified in line 2, expect the PE in row 1. Each PE at the start (top) of an LCS sends a token (initially 0) to its South neighbor BYPASS from West to South in all PEs that receive this token whose top of the mask stack is 0 BYPASS Horizontally (from West to East) in all PEs to the right of the PE identified in line 5, expect the PE in the last column. BYPASS from West to South in all PEs whose Vertical and Horizontal bypass switches are set and whose Match Register “M” is 0, and set the Gap Register “G” in those PEs to 1 As in Step 5 (Sequential Version) of the exact match algorithm, we inject tokens at the bottom of each common subsequence. Each time a token reaches a gap, it enters the South port of some PE and then stops there since that PE will not have its West-South bypass switch set to pass along the token.

We now close the West-South bypass switch of that PE, along with the Vertical bypass switches of all PEs above it in this column (except the one PE in the first row), allowing the

token to cross the gap in “search” of the next region in the sequence. This step is repeated the value propagates through West-South bypass switches across the gap, possibly allowing multiple tokens to reach PEs in the first row.

61

Fig. 18 (a) Common Subsequence Separated by a Gap, and (b) PEs Set Bypass Switches to Span the Gap

In a similar way, we inject different token symbols from the South port of all those PEs which are at the top of each common subsequence. Each time one of these tokens reaches a gap, it will enter the West port of some PE and then stop there. We now close the West-South bypass switch of that PE, along with the Horizontal bypass switches of all PEs to the right of it in this

row (except the one PE in the last column), again allowing the token to cross the gap in “search”

of the next region in the sequence. This step is repeated the value propagates through West-South

bypass switches across the gap, possibly allowing multiple tokens to reach PEs in the last column.

This process is illustrated in Fig. 18, which shows the Vertical and Horizontal bypass

switches set in Fig. 18b. (Since multiple tokens are injected simultaneously, more Vertical and

Horizontal bypass switches being set than might be expected at first.)

Finally, all PEs whose Vertical and Horizontal bypass switches are switches are set, but

whose Match Register “M” contains 0, set their Gap Register “G” to 1 and close their West-

South bypass switches. These are the PEs corresponding to gaps that can be successfully crossed

by the token.

62

Step 6 — Identify the LCS As in Step 5 (Sequential Version) of the exact match algorithm, we inject tokens at the

bottom of each common subsequence and find the LCS.

4.4 Conclusions

This chapter has described a solution to the LCS problem that combines a new parallel

algorithm with an efficient implementation on specialized hardware. We were inspired by the

coterie network, and modified our scalable ASC Processor described in Chapter 3 to add a

reconfigurable 2D mesh with row/column broadcast buses. We then developed an efficient

algorithm to solve the LCS problem on that processor, for both exact and approximate match.

Since the ASC Processor is based on FPGAs, the hardware is inexpensive and can easily be

reconfigured as new analysis techniques are developed.

As a proof of concept, we implemented this specialized ASC Processor with reconfigurable

2D mesh on an Altera APEX1000KC FPGA, which was sufficiently large enough to hold the

6x11 array of PEs used in the examples of Section 4.3. Larger sequences could be compared by

either “folding” them onto this small PEs array (gaining a modest speed improvement) or by

building a larger PE array using larger FPGAs.

63

5 An ASC Processor to Support Multiple Instruction Stream Associative Computing (MASC)5

The architecture for multiple instructions running on a SIMD architecture is rarely explored.

There were a few models suggesting simulating MIMD on SIMD [24] but they are either

instruction level or thread-based. The primary idea is to treat instructions as data. Meaning the program is stored in both control unit and in each PE’s memory. Therefore, each PE might execute completely different instructions. In the previous chapters, I have described Single

instruction stream associative processors and its applications. In this chapter, multiple instruction

stream architecture will be implemented. This architecture will be built on top of the ASC

Processor we have already implemented. It will be able to support both and control parallelism.

As described in previous chapters, the ASC Processor is a SIMD architecture — all PEs, or all responders, operate under the direction of a single Control Unit / Instruction Stream. While this data parallelism provides improved performance over sequential solutions, when the responders are busy processing local data, all the non-responders are idle.

For search-intensive applications such as data mining and bioinformatics, a SIMD Processor

Array on a Chip may be an effective architecture, but if the application is control-intensive, a

Multiple SIMD (MSIMD) architecture may further increase processor utilization. Multiple- instruction-stream SIMD (MSIMD) architectures can improve processor utilization when opportunities for control parallelism are present by dedicating separate Control Units /

Instruction Streams to concurrent basic blocks of code. MSIMD architecture have been developed for shared memory [39], and for shared memory with pipelined instruction streams

5 Most of this chapter has been published [40].

64

[41]. Also using shared data memory, the MISC project [42] has 4 specialized PEs to extract program streams that can be executed in parallel. However, none of these architectures support associative computing.

In this chapter, we describe an associative MSIMD version of our ASC Processor, called the

MASC Processor. Like our ASC Processor, the MASC Processor is implemented using FPGAs, is easily scalable, and dynamically assigns tasks to Processing Elements as the program executes.

5.1 Multiple-Instruction-Stream MASC Processor

To provide more control parallelism in the ASC model, a Multiple-SIMD (MSIMD) version of the ASC model has been proposed, called the Multiple Instruction Stream ASC (MASC) model.

While algorithms and a runtime environment have been developed [43, 44] for the MASC model, it has not previously been implemented in hardware.

However, a static MASC architecture has been recently proposed [45] by another research group at our university. Their architecture uses one manager IS to allocate tasks to an array of worker ISs that control the PEs. In their proposed architecture, the basic blocks in the code are statically assigned to specific ISs.

65

Fig. 19 Dynamic Allocation of Tasks to ISs

IDLE IDLE

JOIN CALL TM JOIN Wait for IS TASK to Return ALLOC Wait for SIgnal TM TM for FETCH Return

WAIT FOR IS

Task Manager Istruction Stream

Fig. 20 Task Manager and Instruction Stream (IS) State Machines

While that static MASC architecture gives the programmer precise control over task allocation, in other situations dynamic task allocation may be more suitable. In this paper, we describe such a scalable MASC Processor that dynamically allocations tasks to ISs.

In our dynamic MASC Processor, tasks are assigned to available ISs from a common pool as those ISs become available (see Fig. 21). Once PEs are assigned to an IS, they listen exclusively to instruction data and broadcast data from that IS until reassigned. We have implemented a small prototype of this MASC Processor on an Altera FPGA, and have demonstrated its use on sample code.

66

Multiple Instruction Stream

TM POOL IS POOL

TM0 TM1

IS0 IS1 IS2 TM2

Task Allocation Control and Control Signal Instructions

PE PE PE PE (n- 0 1 n ...... 1)

Fig. 21 Task Manager (TM) and Instruction Stream (IS) Pools in the MASC Processor 5.1.1 Managing Tasks in the MASC Processor

In most MSIMD research, each Control Unit (CU) / Instruction Stream (IS) is treated as a single entity that handles both control and data parallelism. For data parallelism, the IS must broadcast data parallel microinstructions or control signals to the PE array, just like in a SIMD system— we refer to this process as Instruction Broadcast. Second, the IS must handle control parallelism when the program encounters IF_ELSE or SWITCH instructions. In this case, the IS must assign different ISs to each block of code inside the conditional operations — we refer to this process as Task Allocation.

However, combining Task Allocation and Instruction Broadcast into a single unit reduces the efficiency of that unit. Once an IS forks a child (Task Allocation), if it has to wait until that child rejoins, it may not simultaneously be able to control PEs (Instruction Broadcast).

67

Our MASC Processor separates the functionality for Instruction Broadcast and Task

Allocation into two different types of units: Instruction Stream (IS) units to handle Instruction

Broadcast, and Task Manager (TM) units to handle Task Allocation. With this separation, a TM can be allocating tasks to basic block inside IF_ELSE / SWITCH statements at the same time ISs can be directing PEs to process the instructions in various tasks.

The MASC Processor TM / IS structure is shown in Fig. 21. In this figure, 3 TMs and 3 ISs are shown, each grouped into separate pools. Within each pool, if more than one unit is available, the one with the smallest ID is chosen. The IS pool also contains a register file (not shown) that is used to store scalar variables shared among the ISs (introducing a new class of scalar variables, separate from the traditional SIMD scalar variables inside each IS).

Fig. 19 illustrates the process of dynamically allocating tasks to ISs. IS0 begins executing the program and controlling the entire PE array. When it encounters a conditional operation it transfers program control to the first available TM (TM0), which allocates the two conditional branches to IS0 and the next available IS (IS1). TM0 waits for both tasks to finish, and when they complete, it transfers control back to IS0 to continue execution and both TM0 and IS1 return to their respective pools.

5.1.2 Task Manager (TM)

The key element of the MASC Processor, each TM is implemented as a four-stage finite state machine as shown on the left side of Fig. 20. Each TM is initially in the IDLE state, waiting for an IS to reach a conditional operation such as an IF_ELSE or SWITCH statement. When an IS reaches a conditional operation, it calls a TM (chosen from the pool of idle TMs) to perform task allocation, and the chosen TM then enters the TASK_ALLOCATION state. At the same time,

68

those PEs previously listening to the IS are directed to start listening to the chosen TM to find out which IS will be issuing instructions to them next.

In the TASK_ALLOCATION state, the TM decodes the conditional operation and sends the

PEs currently listening to it the first condition to test (there is one condition for each branch of the If_ELSE or SWITCH statement). Those PEs for which the test is successful are marked for assignment to a new IS that will execute the corresponding branch.

The TM then enters a WAIT_FOR_IS state, where it waits for an available IS to assign to that branch. When an IS becomes available, the TM gives it the branch to execute (pointing it to the beginning of the code), and directs a set of PEs to listen to that IS. The TM also increments a counter that keeps track of the number of ISs assigned by the TM. Then the TM returns to the

TASK_ALLOCATION state to process another condition and allocate another task.

Once all conditions have been allocated to ISs, the TM enters the JOIN state. As each IS completes its task and returns to the TM, the counter keeping track of the number of ISs assigned is decremented. When it reaches 0, the TM returns to the IDLE state.

5.1.3 Instruction Stream (IS)

Complementing the TMs, each IS is also implemented as a four-stage finite state machine as shown on the right side of Fig. 20. Each IS is initially in IDLE state, waiting for a TM to assign it a block of code to execute. When assigned a block of code, the IS enters the FETCH state.

In the FETCH state, the IS fetches, decodes, and broadcasts instructions to PEs that are listening to this IS. Each instruction is processed in turn, until the IS reaches a conditional operation such as an IF_ELSE or SWITCH statement. At this point, the IS enters the CALL_TM

69

state and waits for an idle TM to begin task allocation. As soon as a TM is available, the PEs previously listening to this IS start listening to that TM, and the IS returns to IDLE state.

When an IS executes a branch of a conditional operation, it is also in the FETCH state, where it fetches, decodes, and executes instructions. When it finishes processing the branch, it must tell the TM that called it that it has finished its task. To do so, it enters the JOIN state and signals the

TM that it is ready to return. After the IS successfully signals the TM (and the TM deceases its counter), the IS returns to the IDLE state where it can be called by other TMs. When all ISs have returned, the TM returns to the IDLE state.

The architecture of the TMs and ISs complements the pipelined architecture in our previous

ASC Processor, which has carried over to this new MASC Processor. Before a conditional operation is encountered, each ISs in the MASC Processor behaves just as it does in the ASC

Processor, fetching and decoding instructions and then passing control to the ID stage of the

Parallel PEs. However, when an IS in the MASC Processor encounters a conditional operation, it issues “bubble” instructions to ID stage of the Parallel PEs while control is transferred to a TM.

5.1.4 Selecting the TM or IS to Control Each PE

The PE array and basic structure of the Sequential PE and Parallel PE Array in our ASC

Processor is shown in Fig. 13. This processor is pipelined, with the Control Unit containing the

IF (fetch) stage and part of the ID (decode) stage, and the PEs containing the rest of the ID stage, the EX (execute) stage, MEM (memory) stage, and WB (writeback) stage. This pipelined architecture is described more fully in [7].

70

In the ASC Processor, the Parallel PEs receive two types of signals from the Control Unit and Sequential PE (see Fig. 12). Immediate data or broadcast register data are scalar values broadcast to all PEs.. Microinstructions or control signals (not shown in the figure) tell the PEs what operation to perform.

In the MASC Processor, the Parallel PEs are controlled by either TMs or ISs. TMs send control signals to the PEs; ISs send control signals, broadcast immediate data, or broadcast register data from the IS register file or the shared register file in the IS pool. To choose which

TM or IS each PE will listen to, a dedicated CU Selector unit is added to the IE stage of each

Parallel PE as shown in Fig. 22. In a sense, the CU Selector is a large choosing which TM or IS will control the PE and to which broadcast network the PE should listen.

For example, consider the MASC Processor structure shown in Fig. 21 with three TMs (TM0,

TM1, and TM2) and three ISs (IS0, IS1, and IS2) and the dynamic allocation of tasks as shown in Fig. 19. All PEs are initially listening to IS0, meaning IS0’s ID is stored in their CU ID

TM IS signals signals CU 000 Selector CU signals CU ID

mask

compare

PE

Fig. 22 PE structure

71

register and thus their CU Selector passes on control signals and data broadcasts from only that one IS.

If IS0 reaches a conditional operation instruction, it will send instructions to each PE directing them to store the ID of the first available TM (TM0, in this example) in their CU ID register. Until directed otherwise, all PEs will now listen only to TM0, with all other control signals or broadcast data blocked by the CU Selector.

TM0 then starts allocating tasks. Assume the conditional instruction is composed of two alternative conditions. TM0 sends the first condition to the PEs listening to it, directing them to perform the comparison using data in their local memories. In each PE where there is a match,

TM0 sets the top of the PE’s mask stack to ‘1’, designating the PE a responder (recall the definitions of associative computing in Section 1.2).

Those PEs that meet the condition now need to be switched to an IS, IS0 in this example.

TM0 enters the WAIT_FOR_IS state, where it issues masked control signals to the PEs listening to it. Only those PEs that are responders (with a ‘1’ on top of their mask stack) will receive those instructions, which store IS0’s ID into their CU ID register, directing them to now listen to IS0.

This process continues with the allocation of the other conditional branch to IS1, and eventually control transferred back to IS0 once both branches complete.

5.1.5 Program Forking and Joining

The example program in Fig. 23 shows a more complete example of TM / IS allocation during program forking and joining, along with nested conditional operations. As in the previous example, IS0 is initially executing the program, and when a conditional operation is encountered

72

IS0

Parallel Select Start

TM0 Parallel Select Case 1 Parallel Select Case 2

Begin Case 1 IS0 IS1 Parallel Select Start(inner) TM1 Parallel Select Case 1 (inner)

Parallel Case 2 IS0 (inner) IS2 Begin Case 1(inner)

Begin Case 2(inner)

End Parallel Select (inner)

Case 2 TM1

IS1 END Case 1 IS0

END Parallel Select

TM0

Program Continue

IS0

Fig. 23 Program and TM / IS Allocation During Program Forking / Joining

program control flow is transferred to TM0. TM0 then allocates the first case to IS0 and the second to IS1, again as in the previous example.

As IS0 executes its branch, it encounters another conditional operation. TM1 is called to allocate the tasks in this conditional, and it allocates the first case to IS0 and the second to IS2

(IS1 is still busy). Then TM1 waits for those two inner tasks to finish and return. As it waits,

IS0, IS1 and IS2 are running in parallel. Note that if IS1 had finished executing its task before

73

TM0

TM1 TM2

TM3 TM4 TM5 I I I S S S

I I I I I I I S S S S S S S

Fig. 24 Instruction Stream Tree Structure

TM1 performed this allocation, IS1 would have been allocated to the second case instead of IS2

— the assignment is dynamic and depends on runtime availability.

While TM1 is waiting for IS0 and IS2 to finish executing the two inner tasks, TM0 is waiting for IS0 and IS1 to finish executing the two outer tasks. As each task finishes, the counter in the corresponding TM is decremented. When both inner tasks are finished, TM1 returns program control to IS0. Then TM0 is called to wait for both outer tasks to finish, and when they do, TM0 returns control to IS0, which completes the program execution.

When conditional operations are nested in this manner, “child” TMs must keep track of their

“parent” TM. When TM1’s two inner tasks are finished and it returns program control to IS0, it must pass the ID of its parent TM, TM0, to IS0, so that when IS0 finishes the leftmost outer task, program control will return to TM0

While our initial prototype MASC Processor has only 3 TM’s and 3 IS’s, the implementation can easily be scaled up as necessary. The number of TMs limits the degree of nesting, and the number of ISs limits the number of tasks executed in parallel, as shown in Fig. 24.

74

5.2 MASC Processor Performance

We implemented a small prototype of the MASC Processor on an Altera APEX1000KC

FPGA as a proof of concept. This small FPGA has only 1 million gates, or 38,4000 Logic

Elements (LEs). For the MASC Processor, each PE requires about 500 LEs, each IS about 200

LEs, and the size of a TM is negligible.

Our prototype MASC Processor, configured with 3 TMs, 3 ISs and 9 PEs, uses about 4000

LEs, and runs at 35 MHz (a clock speed comparable to other small processors on that FPGA).

Because the PE array and number of TMs and ISs is scalable, a MASC Processor with hundreds of PEs and dozens of TMs and ISs could easily be constructed.

5.3 Conclusion

We have described the implementation of a Multiple SIMD (MSIMD) associative architecture called the MASC Processor, which is suitable for search-intensive embedded applications. The processor is scalable and dynamically assigns tasks to instruction streams, improving control parallelism while being limited only by the number of Task Managers and

Instruction Streams (both of which are also scalable).

We have prototyped this MASC Processor and tested it on small pieces of code. Future work will be to construct a larger processor and test it on complete programs (initially a recursive convex hull algorithm that requires a substantial number of PEs and control parallelism). Then we plan to further explore its application in image processing (processing multiple regions in the image simultaneously) or in bioinformatics (building on some of our existing work in genome sequence comparison).

75

6 Conclusions and Future Work

Over the years a number of variations on associative computing have been explored. At Kent

State University (KSU), associative SIMD computing has its roots at the Goodyear Aerospace

Corporation, but most recently has focused on exploring the power of the associative computing paradigm as compared to traditional SIMD, and even MIMD, computing. In contrast, Prof.

Robert Walker’s research group at KSU has focused on implementing those associative concepts on a single chip by developing a new associative SIMD RISC processor, called the ASC

(ASsociative Computing) Processor, using modern FPGA implementation techniques.

My dissertation has described my development of a working, scalable, ASC Processor that is

pipelined to improve performance, supports a reconfigurable network, can be specialized further

for dedicated applications (e.g., genome processing), and that can realize the multiple SIMD

(MSIMD) paradigm by supporting multiple control units.

6.1 Conclusions

My first contribution was to construct the first working ASC Processor and the first scalable

ASC Processor. While the two central components of that processor (the Control Unit and the

PE Array) were developed earlier by two Master’s students, they were never integrated into a

working processor. I connected the components they developed, and made the changes

necessary to integrate them into the first working ASC Processor. I then extended that initial

prototype ASC Processor into a scalable ASC Processor, and worked with two other students to

demonstrate its use on database processing and image processing.

76

While associative processors have been invented in the past, and there are SIMD processor arrays currently commercially available, this is the only current implementation of an associative

SIMD processor array that I am aware of. As such, it demonstrates the feasibility of an

Associative SIMD Processor Array on a Chip, a useful complement to the Multiprocessor

Systems-on-a-Chip and SIMD Processor-Arrays-on-a Chip that are already commercially available.

My second contribution was to develop and implement a pipelined ASC Processor. To the best of my knowledge, this is not only the first pipelined associative SIMD processor, but also the first working single-chip pipelined SIMD processor. Compared to the my first scalable ASC

Processor, this new scalable pipelined ASC Processor is not only faster due to the pipelining, which improves the number of instructions processed per clock cycle, but is also faster due to a more efficient implementation, which increases the processor’s clock frequency.

My third contribution was to build on the reconfigurable PE interconnection network developed in conjunction with my pipelined ASC processor to develop a specialized version of this processor designed to support an innovative LCS algorithm that is useful in bioinformatics such as genome sequence comparison. For this purpose, I modified the ASC Processor’s original reconfigurable network from a linear array into a 2D mesh and added some features inspired by the coterie network, plus row and column broadcast buses. This specialized ASC

Processor is not only useful in its own right, but it serves as an example of how the basic architecture can be further modified for specific applications

My fourth and final contribution was to develop a first prototype of the MASC Processor, a

MSIMD version of ASC to better support control parallelism and increase PE utilization. Like

77

my ASC Processor, the MASC Processor is easily scalable. Moreover, the MASC Processor dynamically assigns tasks to Processing Elements as the program executes so the same program will run on MASC Processors of different sizes, albeit more efficiently on larger processors.

6.2 Future Work

One motivation for the development of the pipelined version of the ASC Processor was to gain the increased throughput afforded by the pipeline. However, most of the associative computing algorithms developed at Kent State have been relatively small examples that do not require or demonstrate the need for this increased throughput. One area of future work would be to explore more computationally-intensive associative computing algorithms that could benefit from this pipelined ASC Processor. Additionally, Prof. Johnnie Baker has a new research group that is collaborating with researchers from Ohio University to explore the use of associative computing for air traffic control, a real-time application that might benefit from the ASC

Processor’s increased throughput.

Another area for future research would be to further explore the use of the specialized ASC

Processor for genome sequence comparison. Larger processor arrays could be implemented to avoid the need to fold large sequences onto a small array. The hardware implementation could be improved for the approximate match case in an attempt to make it closer to constant time

(with respect to the number of PEs). Finally, the LCS algorithm and implementation could be modified to find the “best” common subsequence instead of simply the longest common subsequence — or modified in other ways of use to specific genome researchers.

Yet another area for future research would be to explore the use of this first MASC Processor for various applications. Larger MASC Processors could be constructed, and the tradeoffs

78

between the number of Instruction Streams and the number of PEs could be explored. The dynamic MASC Processor described here could be compared to the static MASC Processor and runtime environment proposed by students in Prof. Baker’s research group. To date, there have been very few MASC-specific algorithms developed, but others could be explored and implemented (e.g., the various Convex Hull algorithms developed earlier by Atwah [21]could benefit from the control parallelism afforded by the MASC Processor).

Finally, one last area of future work that would be of great value to future students working on variations of the ASC Processor would be to develop at least an assembler, if not a compiler, for that processor. To date, all the algorithms for this processor were coded by hand in machine language; an assembler would at least ease this tedious task. However, a compiler for the ASC language [5] developed earlier at Kent State by Prof. Jerry Potter would also be of great value, as a number of associative algorithms were written in that language over the years, but translating those algorithms to machine language is a particularly tedious task.

79

Reference

1. Jerraya, A. and W. Wolf, Multiprocessor Systems-on-Chips. 2004: Morgan Kauffman. 2. Products, C., http://www.clearspeed.com/products. 3. Computing, W.D.M.P., http://www.wscapeinc.com/technology.html. 4. Krikelis, A. and C.C. Weems, Associative processing and processors. Computer, 1994. 27(11): p. 12-17. 5. Potter, J., Associative Computing. 1992, New York: Plenum Publishing. 6. Wang, H., et al. A Scalable Associative Processor with Applications in Database and Image Processing. in 18th International Parallel and Distributed Processing Symposium (IPDPS'04) - Workshop 15. 2004. Santa Fe, New Mexico. 7. Wang, H. and R.A. Walker. A Scalable Pipelined Associative SIMD Array with Reconfigurable PE Interconnection Network for Embedded Applications. in Proceedings of the 17th international conference on Parallel and Distributed Computing Systems (PDCS'05). 2005. Phoenix, AZ. 8. Potter, J., et al., ASC: An Associative-Computing Paradigm. Computer, 1994. 27(11): p. 19-25. 9. Scherson, I.D. and S. Ilgen, A Reconfigurable Fully Parallel Associative Processor”, Journal of Parallel and Distributed Computing. 1989. 6(69-89). 10. Hall, J.S., D.E. Smith, and S.Y. Levy, Database Mining and Matching in the Rutgers CAM”, in Associative Processing and Processors, Anargyros Krikelis and Charles C. Weems (Editors). IEEE CS Press, 1997: p. 218-232. 11. Grosspietsch, K.E. and R. Reetz, The Associative Processor System CAPRA: Architecture and Applications. IEEE Micro, 1992. 12(6): p. 58-67. 12. Parhami, B., Search and Data Selection Algorithms for Associative Processors”, in Associative Processing and Processors, Anargyros Krikelis and Charles C. Weems (Editors). IEEE CS Press: p. 10-25. 13. Herbordt, M.C., C.C. Weems, and M.J. Scudder, Nonuniform Region Processing on SIMD Arrays using the Coterie Network. Machine Vision and Applicatons, 1992. 5(2): p. 105-125. 14. Batcher, K.E., STARAN Parallel Processor System Hardware. 1974 National Computer Conference, AFIPS Conference Proceedings, 1974(43): p. 405-410. 15. Reed, R., The ASPRO Parallel Inference Engine (P.I.E.) – A Real Time Production Rule System. Proc. Conf. Aerospace: p. 85-89. 16. Meilander, W., J. Baker, and M. Jin. Importance of SIMD Reconsidered. in Proc. of the 17th International Parallel and Distributed Processing Symposium (Workshop in Processing). 2003. Nice, France: IEEE. 17. Falkoff, A.D., Algorithms for Parallel-Search Memories. Journal of the ACM (JACM), 1962. 9(4): p. 488-511. 18. Herrmann, F.P. and C.G. Sodini, A Dynamic Associative Processor For Machine Vision Applications. IEEE Micro, 1992. 12(3): p. 31-41. 19. Duller, A.W.G., et al., An Associative Processor Array for Image-Processing. Image Vision COMPUT 7, 1989(2): p. 151-158.

80

20. Esenwein, M. and J. Baker. VLCD String Matching for Associative Computing and Multiple Broadcast Mesh. in Proc. of the IASTED International Conference on Parallel and Distributed Computing and Systems. 1997. 21. Atwah, M., J. Baker, and S. Akl. An Associative Implementation of Classical Convex Hull Algorithm. in Proc. of the Eighth IASTED International Conference on Parallel and Distributed Computing Systems. 1996. 22. Meilander, W., J. Baker, and M. Jin. Predictable Real-Time Scheduling for Air Traffic Control. in Proc. of the 15th International Conference of Systems Engineering. 2002. 23. Fuchs, H., et al. Fast spheres, shadows, textures, transparencies, and imgage enhancements in pixel-planes. in International Conference on Computer Graphics and Interactive Techniques,Proceedings of the 12th annual conference on Computer graphics and interactive techniques. 1985. 24. Wu, M.-Y. and W. Shu. MIMD programs on SIMD architectures. in 6th Symposium on the Frontiers of Massively Parallel Computation. 1996. 25. Walker, R.A. and E. al. Implementing Associative Processing: Rethinking Earlier Architectural Decisions. in Proc. of the 15th International Parallel and Distributed Processing Symposium (Workshop in Massively Parallel Processing. 2001. 26. Wang, H. and R.A. Walker. Implementing a Scalable ASC Processor. in Proc. of the 17th International Parallel and Distributed Processing Symposium (Workshop in Massively Parallel Processing). 2003. Nice,France. 27. Wu, M., R.A. Walker, and J. Potter. Implementing Associative Search and Responder Resolution. in Proc. of the 16th International Parallel and Distributed Processing Symposium (Workshop in Massively Parallel Processing). 2002. 28. Walker, R.A., et al. Implementing Associative Processing: Rethinking Earlier Architectural Decisions. in Proc. of the 15th International Parallel and Distributed Processing Symposium (Workshop in Massively Parallel Processing). 2001. 29. Allen, J.D. and D.E. Schimmel. The impact of pipelining on SIMD architectures. in 9th International Parallel Processing Symposium. 1995. 30. MegaStore, A.I. http://www.altera.com/literature/ds/apex.pdf. [cited. 31. Pevzner, P.A. and N.C. Jones, An Introduction to Bioinformatics Algorithms. MIT press, 2004. 32. Sankoff, D. and J.B.K. (eds.), Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. 1983: Addison-Wesley. 33. Virdi, S.S., H. Wang, and R.A. Walker. Solving the Longest Common Subsequence (LCS) Problem using the Associative ASC Processor with Reconfigurable 2D Mesh. in Proceedings of the 17th international conference on Parallel and Distributed Computing Systems (PDCS'06). 2006. Dallas, TX. 34. Hirschberg, D.S., Algorithms for the Longest Common Subsequence Problem. Journal of the ACM (JACM), 1977. 24(4): p. 664-675. 35. Myers, E.W. and W. Miller, Optimal Alignment in Linear Space. CABIOS, 1988. 4(1): p. 11-17. 36. Lopresti, D.P., Rapid Implementation of a Genetic Sequence Comparator Using Field Programmable Logic Arrays. 1991. 138-152. 37. Hunkerpiller, T., et al., Special Purpose VLSI-Based System for the Analysis of Genetic Sequences. Human Genome: 1989-90 Program Report, 1990: p. 101.

81

38. Chung, K.L., O(1)-Time Parallel String-Matching Algorithm with VLDCs. Pattern Recognition Letters, 1996. 17: p. 475-479. 39. Radoy, C.H. and G.J. Lipovski, Switched Multiple Instruction, Multiple Data . Proceedings of the 2nd Annual Symposium on , 1974: p. 183 - 187. 40. Wang, H. and R.A. Walker. Implementing a Multiple-Instruction-Stream Associative MASC Processor. in Proceedings of the 17th international conference on Parallel and Distributed Computing Systems (PDCS'06). 2006. Dallas,TX. 41. Li, Y. and W. Chu. Design and Implementation of a Multiple-Instruction-Stream Multiple- Execution-Pipeline Architecture. in Seventh International Conference on Parallel and Distributed Computing and Systems. 1995. Washington D.C. 42. Tyson, G., M. Farrens, and A.R. Pleszkun. MISC: A Multiple Instruction Stream Computer. in Proceeding of the 25th Annual Symposium on Microarchitecure. 1992. 43. Scherger, M., J. Baker, and J. Potter. Multiple Instruction Stream Control for an Associative Model of Parallel Computation. in Proc. of the 16th International Parallel and Distributed Processing Symposium (Workshop in Massively Parallel Processing). 2003. 44. Scherger, M., J. Baker, and J. Potter. An Object Oriented Framework for and Associative Model of Parallel Computation. in Proc. of the 16th International Parallel and Distributed Processing Symposium (Workshop in Advances in Parallel and Distributed Computational Models). 2003. 45. Chantamas, W. and J. Baker. A Multiple Associative Model to Support Branches in Data Parallel Applications using the Manager-Worker Paradigm. in Proc. of the 19th International Parallel and Distributed Processing Symposium (WMPP Workshop). 2005.

82