Design and Analysis of a CUSTOM COMPUTING ARCHITECTURE for the Upgma BIOINFORMATICS ALGORITHM

Total Page:16

File Type:pdf, Size:1020Kb

Design and Analysis of a CUSTOM COMPUTING ARCHITECTURE for the Upgma BIOINFORMATICS ALGORITHM

DESIGN AND ANALYSIS OF A CUSTOM COMPUTING ARCHITECTURE FOR

THE UPGMA BIOINFORMATICS ALGORITHM

by

Sreesa Akella

Bachelor of Science Andhra University, 1998

______

Submitted in Partial Fulfillment of the

Requirements for the Degree of Master of Science in the

Department of Computer Science and Engineering

College of Engineering and Information Technology

University of South Carolina

2003

______Department of Computer Science and Department of Computer Science and Engineering Engineering Director of Thesis 2nd Reader

______Department of Computer Science and Dean of the Graduate School Engineering 3rd Reader ACKNOWLEDGEMENTS

I would like to express my deepest gratitude to my thesis advisor, Dr. James P.

Davis for the relentless motivation and support he provided, aiding me to complete my thesis on time. His unflinching optimism, undying enthusiasm and focus towards this project had inspired me to a great degree. His constant advise pushed me to look at a problem in a different perspective and helped me visualize concepts in a broader manner.

I would like to extend my appreciation to Dr. Duncan Buell and Dr. John Rose for their continous guidance and inspiration. Their valuable advise from time to time had given this project an optimal direction.

I would also like to thank my parents and friends who have been constant force of motivation and support that sustained me through tough times and helped me achieve this goal.

ii ABSTRACT

In recent years, reconfigurable custom computing has become an increasingly viable option for implementing high-performance computing applications.

Reconfigurable VLSI logic, on which custom computing systems are built, provides several orders of magnitude speed-up in execution performance of algorithms over the execution of these on conventional microprocessor-based systems. In addition, such systems have the flexibility to program--and reprogram via reconfiguration--the actual logic functions of the VLSI circuit with different applications in time and space. Custom computing systems are implemented using FPGA custom-logic devices that are easily and quickly programmed by an end-user. This research presents the design and analysis of a custom computing application architecture for the UPGMA Bioinformatics algorithm implemented on an FPGA-based custom-computing platform. We present the

Bioinformatics problem domain and architectures that were implemented and assessed.

We also discuss the final architecture created and present results of the system performance, as measured and compared against that of the UPGMA algorithm written in

C, running on a single-processor Pentium® PC.

iii TABLE OF CONTENTS

ACKNOWLEDGEMENTS...... II

ABSTRACT...... III

TABLE OF CONTENTS...... IV

LIST OF TABLES...... VII

LIST OF FIGURES...... VIII

INTRODUCTION...... 1

1.1 VON NEUMANN VERSUS RECONFIGURABLE CUSTOM COMPUTING...... 1 1.2 CUSTOM LOGIC DESIGN VERSUS CUSTOM COMPUTING...... 2 1.3 FIELD PROGRAMMABLE GATE ARRAYS...... 3 1.4 APPLICATION PROGRAMMING AND DESIGN STYLES...... 5 1.5 THESIS PROPOSAL...... 7 1.5.1 Thesis Research Objective and Tasks...... 8 BACKGROUND...... 11

2.1 PHYLOGENETICS AND TREE-RECONSTRUCTION METHODS...... 11 2.1.1 Background on trees...... 12 2.1.2 Phylogenetic Algorithms...... 13 2.2 THE UPGMA...... 14 2.2.1 Algorithm...... 14 2.2.2 Complexity and Bottlenecks on UPGMA...... 16 2.3 FIELD PROGRAMMABLE GATE ARRAYS...... 17 2.3.1 Input Output Blocks (IOBs)...... 18 2.3.2 Configurable Logic Blocks (CLBs)...... 19 2.3.3 Programmable Routing Matrix...... 20 2.3.4 Resources on a Virtex-E chip...... 21 2.4 RECONFIGURABLE COMPUTING...... 22 DISCUSSION OF THE WILDCARD CUSTOM COMPUTING PLATFORM...... 25

3.1 THE ANNAPOLIS WILDCARDTM SYSTEM...... 25 3.2 THE WILDCARDTM SYSTEM VHDL MODEL...... 27 3.3 WILDCARDTM HOST PROGRAMMING...... 29 3.3.1 Opening and Closing the WILDCARDTM board...... 30 3.3.2 Clock Control...... 31 3.3.3 Processing Element and Interrupt Control...... 31 3.3.4 Memory Control...... 32 3.4 PE EMBEDDED APPLICATION INITIALIZATION...... 33 CUSTOM COMPUTING DESIGN OF UPGMA...... 34

iv 4.1 VLSI DESIGN FLOW...... 34 4.2 UPGMA PROJECT DESIGN FLOW...... 36 4.3 UPGMA DESIGN...... 37 4.3.1 Design Parameters...... 38 4.3.2 Design Datapath...... 39 4.3.3 Design Architecture...... 41 4.3.1 Adder...... 41 4.3.2 Add Register...... 42 4.3.3 Height Adder...... 43 4.3.4 Height Register...... 43 4.3.5 Multiplier...... 43 4.3.6 Multiplier Register...... 44 4.3.7 Divider Unit...... 44 4.3.8 Divider Register...... 44 4.3.9 Comparator...... 45 4.3.10 Least Distance Register...... 45 4.3.11 Row and Column Registers...... 45 4.3.12 Controller...... 45 4.3.13 Counter Units...... 47 4.3.14 Multiplexers...... 47 4.3.15 Address Generator...... 47 4.3.16 Output Generator...... 51 4.3.17 Height memory...... 51 4.3.18 Off-chip Memory Banks...... 51 4.3.19 Addressing Schemes...... 53 4.3.20 Top-level Block...... 56 4.4 DESIGN VERIFICATION...... 56 EXPERIMENTAL DATA SET AND PERFORMANCE MEASUREMENT...... 58

5.1 EXPERIMENTAL APPARATUS FOR UPGMA...... 58 5.2 Generating Random Taxa Test Data Sets...... 60 5.3 Measuring Time...... 62 EXPERIMENTAL METHOD AND RESULTS...... 65

6.1 RUNNING THE EXPERIMENTS...... 65 6.2 EXPERIMENTAL RESULTS FOR LATENCY...... 66 6.3 BOUNDING TIME COMPLEXITY...... 72 6.4 BENCHMARKING AGAINST PHYLIP...... 75 SUMMARY AND CONCLUSIONS...... 83

7.1 SUMMARY OF RESEARCH CONTRIBUTIONS...... 83 7.2 CONCLUSIONS...... 84 7.3 FUTURE WORK...... 86 7.3.1 Memory size and Memory address schemes...... 86 7.3.2 Latency for a Memory Read...... 87 7.3.3 Device size...... 88

v BIBLIOGRAPHY...... 90

APPENDIX A...... 93 VHDL SOURCE CODE...... 93 APPENDIX B...... 162 CUSTOM COMPUTING MACHINE HOST PROGRAM SOURCE CODE...... 162

vi LIST OF TABLES

TABLE 1 Virtex – E Chip Resources………………...……………….…………….21

TABLE 2 Address Mapping…………….…………………………….…………….55

TABLE 3 Timing Results for permuted data for 32 taxa dataset….….…………….67

TABLE 4 Latency Values for Datasets at Generated Number of Taxa.…………….71

TABLE 5 PHYLIP Run-time Raw Dataset………………………..….…………….78

TABLE 6 Data Comparison Between Hardware and Software UPGMA

Implementations………………………………………………………….79

vii LIST OF FIGURES

FIGURE 1 Architecture of an FPGA Device….………………………………………4

FIGURE 2 A Phylogenetic Tree showing a Relationship between Four Species...….12

FIGURE 3 Distance Matrix…………………………………………….…………….15

FIGURE 4 Structure of Xilinx XCV300E Device...………………….…………….18

FIGURE 5 Virtex – E Input Output Block Architecture……………….…………….19

FIGURE 6 A Two-Slice Virtex – E CLB...…………………………….…………….20

FIGURE 7 The WILDCARDTM Platform Block Diagram…………….…………….26

FIGURE 8 The WILDCARDTM Software Design Hierarchy...…….….…………….30

FIGURE 9 An HDL-based Design Process Model…………………….…………….35

FIGURE 10 Design Datapath...………………………………………….…………….40

FIGURE 11 Block Diagram of UPGMA Architecture………………….…………….42

FIGURE 12 The Controller Algorithm………………………………….…………….46

FIGURE 13 Typical Read Cycle from Memory..……………………….…………….52

FIGURE 14 Typical Write Cycle from Memory.……………………….…………….52

FIGURE 15 Distance Matrix…………………………………………….…………….53

FIGURE 16 Test Data Generator Input Dialog Box………………………………...... 61

FIGURE 17 Frequency Distribution for Latency versus Taxa Data Set Permutation...67

FIGURE 18 Mean Latency versus Number of Taxa (Normal Scale)..….…………….69

FIGURE 19 Latency versus Number of Taxa..………………………….…………….70

FIGURE 20 Bounding of Latency by time Complexity Functions..…….…………….73

viii FIGURE 21 Bounding Latency by Time Complexity Functions Computed in

Excel……………………………………………………………….…….74

FIGURE 22 PHYLIP C run-time performance...……………………….………….….76

FIGURE 23 PHYLIP C run-time performance with Time-Complexity Bounding.…..77

FIGURE 24 Plotting the performance improvement over PHYLIP as Taxa Count

grows.…………………………………………………………………….80

FIGURE 25 Plotting the performance difference as Taxa count grows...……….…….81

FIGURE 26 Plotting the performance difference as Taxa count grows (Log Plot).…..82

ix CHAPTER 1

INTRODUCTION

1.1 Von Neumann versus Reconfigurable Custom Computing

In recent years, reconfigurable custom computing has become an increasingly viable option for implementing applications requiring high-performance or complex computations. It is an area that is not as mature as the use of conventional computing architectures. Traditionally, general-purpose computing involves a serial thread of executing code running on one or more microprocessors. This microprocessor-based computing paradigm is considered "general-purpose" in that the processor can be programmed to run any task—which is an executing application program running on an operating system or monitor program. Once a processor has been designed and fabricated, the single processor’s IC can solve multiple problems at different points of time, by fetching program instructions and data from memory, decoding them to determine an execution plan, then executing each such instruction, in turn.

Reconfigurable computing can also be called "general-purpose", although it uses a different architecture and supporting application development paradigm for computation.

Unlike a microprocessor, which has its computation as a set of sequential instructions fetched from system memory, reconfigurable architectures generally compute a function by configuring functional units and wiring them up in space. This allows a parallel

1 computation of operators and direct dataflow from the producers of an intermediate result to the consumers [1, 2].

1.2 Custom Logic Design versus Custom Computing

Application Specific Integrated Circuits (ASICs) could also be used to implement a design and optimize it to achieve high performance employing spatial architectures.

ASICs, however, are designed using custom logic techniques, creating design artifacts tailored for a specific application, and thus cannot be reconfigured to perform different applications. Therefore, although these systems provide high performance through application-specific optimization, they are not “general purpose”. One other aspect of

ASIC systems is that they have a huge manufacturing cost associated with them.

Reconfigurable systems, on which custom computing systems are built, provide very good performance and the flexibility to program--and reprogram via reconfiguration of the logic functionality--the actual device logic, with different applications in time and space. Additionally, these systems are implemented using FPGA devices that are easily and quickly programmable by and end-user, are available at affordable prices, and thus deliver user-defined functionality at a low cost. The performance and logic density of a single FPGA device have been improving in recent years, leading to more powerful reconfigurable architectures targetable for a wider range of applications. This has opened up the use of FPGAs, typically employed in the creation of logic controllers, as processing elements (PEs) in reconfigurable arrays in applications for high-performance computing.

2 1.3 Field Programmable Gate Arrays

In the past few years, the reconfigurable device market has grown considerably with the availability of a wide range of devices for VLSI systems--one such device being a Field Programmable Gate Array (FPGA). FPGAs have evolved considerably in the recent past, with the primary development being the ability to download a bitstream representing the digital logic functions onto an array of pre-defined arithmetic, logical and steering resources, so they have become the primary device for building reconfigurable and adaptive machines. They were originally designed as prototype devices used for pre-fabrication design emulation. This design activity was employed to verify the design before fabrication, to avoid the fallout of post-fabrication design error.

A Xilinx FPGA device that is primarily the device we are looking at has a standard architecture, which is shown in Figure 1 [3].

FPGAs consist of an array of resource types: configurable logic blocks (CLBs), input/output blocks (IOBs), and programmable interconnect resources. This standard architecture can be configured, and reconfigured if necessary, by an end user to implement a particular functionality. The logic blocks are used to implement the required logic gate and storage elements of the design. The interconnect can be programmed to appropriately connect the logic blocks to realize a larger functional unit specified for use by the application.

3 Figure 1. Architecture of an FPGA Device.

For purposes of consideration in this thesis, the design process of the FPGA device has the following steps:

 Model the design using a hardware description language such as VHDL or

through schematic capture.

 Synthesize this design to generate a netlist.

 Map the design to the FPGA logic blocks.

 Place and route of the design to choose specific logic blocks to use on the FPGA

and to allocate the wire segments to interconnect these logic blocks.

 Download the design as a bitstream onto the target FPGA chip.

Steps 2 through 5 are automated and are performed by an assortment of design tools generally provided by the FPGA device vendor. Some of the major FPGA device manufacturers and vendors in the market are Xilinx, Actel, and Altera.

In order to use these devices for reconfigurable computing applications, one has to deal with a number of FPGA issues so as to effectively implement the design. The computational requirements of the application must be identified and its mapping to the

4 FPGA device must be evaluated via estimation. This is no easy task, and there is no standard method to assemble designs. The FPGA tools, which play a major part in this process, are being continuously improved by the vendors to be more efficient in their mapping of design architecture to design resources. The trend is that, over time, the construction of reconfigurable computing systems on FPGAs will be more like software programming than the hardware design process for custom VLSI that exists today.

1.4 Application Programming and Design Styles

The process of converting a specification into an implementation on FPGA devices can be addressed in different ways. Different design styles lead to different interpretations of the specification—a formal or informal description of the application’s algorithm. An algorithm can be thought of as a set of processing steps for transforming data by executing a series of computations [4]. The algorithm needs to be interpreted by a machine to perform the work. Choosing the elements that make up the machine defines its architecture, and this necessitates looking at different architectural, or design, styles.

Traditionally there have been two generic architectural styles: the software paradigm and the hardware paradigm [4]. The software paradigm looks at implementing an algorithm through use of an instruction code sequence that is interpreted by a microprocessor. In contrast, using the hardware paradigm, an algorithm is mapped onto storage and functional units that perform the computation without the use of an intermediate instruction set.

Under the software paradigm, a program for the algorithm is written in a high level language such as C/C++, which is compiled into a low-level instruction set for the

5 processor to execute on an underlying hardware with a fixed architecture. A hardware implementation would look at implementing the design directly onto a hardware device through mapping to storage and functional units, avoiding the compile-time and operating system overhead present in the software paradigm. This can provide considerable speed- up—on the order of two magnitudes--and thus provide a much higher-performance solution; however, a VLSI hardware application solution generally comes at a higher cost, since fabricating the implementation on application specific devices is expensive.

At the same time, such application formulations using application-specific VLSI custom logic are not general purpose, thus necessitating different implementations for different algorithms. In contrast, the software model would yield a generally lower-performance solution through the overhead associated with instruction fetch-decode and execute; however, the solution would generally be cheaper, since microprocessors are mass- produced, reusable commodity off-the-shelf products, and programming them is not a difficult task. Furthermore, there are more software-trained professionals who can write programs on general-purpose processors than there are design engineers who can design custom-logic VLSI.

FPGAs provide a means to build general-purpose, reconfigurable machines at a lower cost1. This leads to new design style that can be referred to as the reconfigurable- computing paradigm, also referred to as the configurable hardware paradigm [4]. This paradigm supports the implementation of algorithms by providing the performance benefits from mapping directly onto a hardware platform at a relatively low cost. Thus, it would be interesting to look at implementing various applications on reconfigurable 1 Such fixed cost is referred to as NRE, or non-recurring engineering costs, which are associated with the specification, design, implementation, mapping, and test of the logic functions implementing an application on a VLSI device substrate. This is in contrast to the variable costs associated with the fabrication and production of finished devices, which is based on the volume of production—itself based on the demand.

6 platforms and evaluating their performance as compared to implementations using the software paradigm.

Such performance would include conventional notions of latency associated with carrying out computation, comparing between an application-specific software solution running on a conventional processor architecture (or even among a collection of processors, thus distributing the algorithm’s execution across multiple, communicating processing elements), and also throughput of the architecture to run streaming computation, if appropriate for the application. However, evaluating performance could also include comparing the design time of the application—comparing the time to architect, design, implement and test the application according to the requisite engineering processes of each paradigm.

1.5 Thesis Proposal

The reconfigurable computing paradigm--and the predominant FPGA device architecture on which such applications can be built--offers us a good medium for implementing complex computational tasks having high throughput, low latency requirements. Many computational tasks spread over a range of application domains have been implemented and evaluated on reconfigurable computing systems [5, 6, 7, 8,

9]. However, different aspects of application architecture and performance must continue to be explored, while many new and novel computational problems must be implemented using reconfigurable custom computing machines before a general understanding of the characteristics of the reconfigurable computing paradigm can be obtained. This would provide a wider set of configurable computing solutions, as well as patterns for mapping

7 between high-level problem-solving architecture and lower-level device architectures, which can be used to assess the cost/benefit ratio for effective and optimal implementation of more general programming problems on reconfigurable platforms.

Our research thesis involves examining one such data point in the space of possible application solutions where high-performance computing using reconfigurable hardware is required for operating on ever-growing data sets. Namely, we are looking at the Phylogenetics domain that provides us with a rich set of algorithms that can be studied to see if they can be implemented efficiently on reconfigurable computing machines to provide orders of magnitude speedup in the algorithm execution over that available on standard Von Neumann processor architectures programmed using conventional programming techniques. In this Bioinformatics domain, the Unweighted

Pair-Group Method with Arithmetic means (UPGMA) algorithm used for phylogenetic tree-reconstruction purposes has certain computational complexity that makes it an application of specific interest. Furthermore, it is understood to have a software programmed implementation that is particularly optimal, that is, it cannot be further optimized to achieve significant speedup in performance.

1.5.1 Thesis Research Objective and Tasks

It is therefore our objective to explore the space of possible architectures in custom reconfigurable logic, using FPGA devices as an implementation medium, and also using conventional custom-logic design processes, to implement a different

“rendition” of the UPGMA algorithm and measure the performance difference. It is our belief that—although the time complexity of the algorithm is unlikely to change as a result of implementation in FPGA custom-logic hardware, we do believe that the use of

8 custom logic VLSI Hardware design techniques should yield up to a two-order-of- magnitude improvement in the execution speed of the UPGMA algorithm over that employed in the PHYLIP program written by Felsenstein et al. [10].

The tasks involved in exploring this thesis research work are defined as follows:

1) Select the UPGMA algorithm[11] that performs phylogenetic analysis by building

an evolutionary tree as our problem domain.

2) Identify and analyze the various complex computational tasks and bottlenecks.

3) Evaluate the issues that we need to address in implementing this algorithm on a

reconfigurable custom-logic architecture.

4) Address various FPGA issues while developing a hardware architecture for the

particular problem algorithm at hand.

5) Implement this design on a FPGA-based reconfigurable architecture and device

platform.

6) Evaluate its performance by measuring its throughput with an increase in the

number of taxa, and benchmark these results against those obtained from a

software program (Felsenstein’s PHYLIP) executing on a conventional CPU-

based system.

The Annapolis WILDCARDTM system has been chosen as the target reconfigurable platform. The WILDCARDTM FPGA board has a Xilinx Virtex® XCV 300E2 as a processing element, along with two 256K byte memory units, and external I/O connections. This reconfigurable computing platform was chosen primarily based on cost and the availability of a reasonable set of platform development tools.

2 Xilinxand Virtex are Registered Trademarks of Xilinx Inc.

9 Thus, this thesis will attempt to modify the upper bound of the time complexity, corresponding to a modification of the time constant associated with the complexity function for the UPGMA algorithm to achieve orders-of-magnitude speedup, while also contending with the space complexity associated with the limited amount of device resources available on the Wildcard platform. In addition, given that we will be moving data to and from the main computer in which the WILDCARDTM sits, and the

WILDCARDTM PCI/PCMCIA board itself, we will be required to assess the penalties associated with the communication overhead—with the objective of minimizing this as much as possible.

10 CHAPTER 2

BACKGROUND

In this chapter, we provide background on the application domain associated with the UPGMA algorithm and its context in the space of Bioinformatics computational problem solving. We also discuss the FPGA device technology, which constitutes the platform on which we will create a reconfigurable computing solution for the UPGMA problem.

2.1 Phylogenetics and Tree-reconstruction Methods

The study of the relationships between groups of organisms is called taxonomy, an ancient and venerable branch of classical biology. The branch of taxonomy that deals with numerical data such as DNA sequences is known as phylogenetics. Biological systematists who wanted to reconstruct evolutionary genealogies of species based on morphological similarities originally developed phylogenetic analysis. The results of phylogenetic analysis may be depicted as a hierarchical branching diagram, a

"cladogram" or "phylogenetic tree" as shown in Figure 2 [12].

11 Figure 2: A phylogenetic tree showing a relationship between four species.

2.1.1 Background on trees

The tree represents the genealogical evolution of the different species, linking them through a certain set of similarities and differences. Similarities and differences between organisms can be coded as a set of characters, each with two or more alternative character states. In an alignment of DNA sequences, for example, each aligned site is a separate character, each with four character states, the four nucleotides being adenine, thymine, cytosine, and guanine.

All the trees are assumed to be binary, meaning that each node branches into two daughter edges as shown in Figure 2. The edges meet at a branch node, a node being and endpoint of an edge. Each edge has a certain amount of evolutionary divergence associated with it, quantified by some distance between sequences. These distances are referred to as ‘edge lengths’ or ‘branch lengths’. Terminal nodes or leaves correspond to the observed sequences that might connect up to an ultimate ancestor or ‘root’ of the tree.

A true biological phylogeny has a ‘root’ but only some phylogenetic algorithms provide information about the location of the root.

12 For a specific set of n leaves, the nodes and edges of a tree can be counted as follows: There would be (n–1) nodes in addition to the n leaves, giving a total of

(2n-1) nodes and one fewer edges, that is (2n-2), discounting the edge above the root node.

2.1.2 Phylogenetic Algorithms

Phylogenetic algorithms cover three main classes of problems [13]: (1) parsimony, which is like a vertex coloring problem of graph theory; (2) distance methods, which aim to find a tree whose path distance matches closely to observed distances; and, (3) likelihood methods, where the likelihood of the data is calculated using Markov transition matrices. Each approach possesses certain problems in terms of the computational bottlenecks that occur.

The advantages of putting a phylogenetic algorithm onto reconfigurable custom computing platform include the following: (1) eliminating intervening levels of software--such as operating systems--which slows down the execution of the code, etc.; and, (2) parallelizing or pipelining the algorithm functions by exploiting the natural capabilities of custom-logic architecture and design. The latter provides far more work per cycle than code written in a native instruction set on a general-purpose microprocessor. As discussed earlier, we believe a speed up of up to two orders of magnitude should be possible with this approach. Furthermore, the bottlenecks within the algorithms could be avoided by exploiting the underlying hardware resources in reconfigurable machines to optimize specific parts of the algorithm’s execution that general-purpose machines cannot offer.

13 We select a particular phylogenetic distance method algorithm for this research, namely the UPGMA (Unweighted Pair-Group Method with Arithmetic means) algorithm whose computational complexities are described below.

2.2 The UPGMA

UPGMA has relevancy beyond phylogenetics, since it is a hierarchical clustering method that is both fast and useful with gene expression or micro-array data. The algorithm’s running time complexity is evaluated and compared against with that of the hardware implementation and the results are presented in Chapter 6. The value of N is typically around 10,000 to 50,000 in micro-array applications. Thus, even though software-based phylogenetics applications run this method at a rate of 1 second each [15], for N = 100, there is an increase by a factor of perhaps 10,000 in micro array applications even before we consider memory bottlenecks. This last factor causes considerable problems, since memory usage also scales as O(N2). Thus, this problem might take days to complete with larger taxa data sets. This algorithm is well understood [11, 14, 15], and the software solutions have reached a level of optimization beyond which minimal performance improvement can be obtained. Thus UPGMA is an appropriate candidate for exploring an implementation on a reconfigurable platform using custom-logic architecture and design techniques.

2.2.1 Algorithm

We define the distances between two clusters Ci and Cj to be the average distance between pairs of sequences from each cluster:

14 dij = (1/|Ci||Cj|)  dpq (1)

where |Ci| and |Cj|denote the number of sequences in clusters i and j, respectively

and p and q denote the sequences in each cluster Ci and Cj respectively. If Ck is the

union of clusters Ci and Cj, and if Cl is another cluster, then

dkl = (1/|Ci||Cj|)(dil|Ci| + djl|Cj|) (2)

This forms the average distance calculation for obtaining the distance of the new

cluster Ck to the any other cluster Cl.

The distances are represented in the form of a matrix given below in Figure 3 with each row or column corresponding to one node. The nodal distance between node i, j would be in the position [i, j] of the matrix. So D[i, j] would form the distance between nodes i and j.

Figure 3. Distance Matrix

D[i, i] is not a valid distance since there can be no distance between the same node. This is therefore marked as “x” in the matrix.

The steps of UPGMA algorithm are as given below [14]:

15 1. Initialization:

a. Assign each sequence i to its own cluster Ci,

b. Define one leave each T for each sequence, and place at height zero

2. Iteration:

a. Determine the two clusters i, j, for which dij is minimal. (if there are

several equidistant pairs pick one randomly.)

b. Define a new cluster k by Ck = Ci U Cj, and define dkl for all l by (2).

c. Define a node k with daughter nodes i and j, and place it at height

dij/2.

d. Add k to the current clusters and remove i and j.

3. Termination:

a. When only two clusters i, j remain, place the root at height dij/2.

2.2.2 Complexity and Bottlenecks on UPGMA

We believe the UPGMA algorithm has two bottlenecks. The first is in deciding which of the N(N-1)/2 pairwise distances is minimal at each step of the star- decomposition clustering. Following this, the data matrix is reduced by dimension 1, due to clustering of two objects. This introduces the second bottleneck, the need to calculate an average distance between the two objects (i and j) as a single cluster (k) and all other objects. This involves complex computational units, costly on general-purpose microprocessors, but which we believe can be implemented efficiently on reconfigurable custom logic FPGA device, giving better performance results.

16 This research examines the function of the UPGMA algorithm, implementing it as a custom logic architecture. The standard HDL-based design methodology is employed in that we model the algorithm using the VHDL hardware description language, we functionally verify the algorithm’s correctness in the custom logic architecture, and then we synthesize the architecture onto a set of resources to produce a circuit mapped to a target FPGA device’s component library. The resulting circuit is implemented on a

Xilinx Virtex E® FPGA device and is subjected to functional and performance analysis.

However, before we can present the research method undertaken in this effort—including the analysis, architecture and design of the circuit implementing the UPGMA algorithm, we must discuss the characteristics of FPGA devices and their use in reconfigurable computing that gives this research a high chance of success.

2.3 Field Programmable Gate Arrays

The evolution of the FPGA devices is evidenced by the great strides in the underlying technology—effective logic gate counts in the millions of transistor gates, the ability to download and alter the logic via a programmable bitstream while the FPGA device is in operation—to name a few. Several companies have been developing high- performance, high-capacity FPGA devices, targeting larger applications such as those associated with scientific computing. FPGA vendors such as Xilinx, Actel and Altera, the largest of those producing these devices, have a leadership position in the market.

Our reconfigurable platform, the Annapolis Wildcard® system uses a Xilinx® XCV300E device. The Xilinx FPGA devices have a standard set of device architecture features,

17 similar to the one shown in Figure 1 in the previous chapter. We describe the architecture for the Xilinx XCV300E device below.

Figure 4. Structure of the Xilinx® XCV300E device.[16]

Figure 4 provides an architectural overview of the XCV300E device. There are three main components in the device, which are: (1) Input output Blocks (IOB), (2)

Configurable Logic Blocks (CLB), including block-programmable RAM (BRAMs) memory structures; and, (3) the Programmable Routing Matrix.

2.3.1 Input Output Blocks (IOBs)

The input and output blocks on the device provide and interface between the input and output pins and the Configurable logic blocks (CLBs). The architecture for these

18 blocks is given in Figure 5. These blocks provide three storage elements that can be used either as edge-triggered D flip-flops or as level sensitive latches.

Figure 5. Virtex E Input Output Block architecture [16]

2.3.2 Configurable Logic Blocks (CLBs)

The Configurable Logic Blocks provide the functional elements for implementing logic. The basic building block of the CLB is the Logic Cell (LC). Each Virtex-E CLB consists of four LCs. The LC consists of a 4-input function generator, carry logic, and a storage element. The output of the function generator in each LC drives the output of the

CLB and D input of the flip-flop. The architecture for a Virtex E CLB is given in the

Figure 6. The four LCs are organized as two identical slices as shown in the figure.

19 Figure 6. A Two-Slice Virtex E CLB [16]

The function generators are implemented using Look Up Tables (LUTs) that can also be configured to be as 16x1 bit synchronous RAM. The two LUTs in a slice can be combined to create a 16x2 bit or 32x1 bit synchronous RAM, or as a 16x1 dual-port synchronous RAM element.

2.3.3 Programmable Routing Matrix

The Virtex-E consists of a General Routing Matrix (GRM) that connects the

CLBs together to implement the logic chains. The GRM comprises an array of routing switches located at the intersection of the horizontal and vertical routing channels. Each

CLB also has local routing resources through which it connects to the GRM. These local and global routing resources can be programmed to generate the best routing for the design being configured onto the device. The Xilinx configuration tools take care of the

20 placing and routing the design onto the device’s resources through user-specified constraints.

2.3.4 Resources on a Virtex-E chip

The Xilinx Virtex-E resources and their numbers are given below in Table 1:

Resource Number CLBs 1536 Slices 3072 LUTs 6144 FlipFlops 6144 Block RAMs 256x16-bits 32 Block RAMs 256x32-bits 16 Block RAM bits 131072

Table 1: Virtex-E chip resources

Each CLB has two slices and there are two LUT’s and two flip flops per slice.

The Block RAM allocations are based on the how the LUTs are configured. If they are configured as two 16x1 bit RAM then we can have 32 of the 256x16 bit block RAMs implemented on the device. If we have two LUTs configured to form a 32x1 bit RAM then we can have 16 of the 256x32-bit RAMs implemented on the device.

The Xilinx Virtex-E data manual [16] provides a detailed description of the device architecture along with pin definitions and electrical characteristics. The Virtex E device with its full complement of resources, provides the designer with a total of 411,955

CMOS transistor gates. This device can thus be used to implement reasonably sizable designs running at moderately high clock speeds.

21 2.4 Reconfigurable Computing

The constant improvement in FPGA device density and performance has prompted many to look at using these devices for implementing high-performance computing applications. The traditional advantages these devices provide are that they can be configured and reconfigured with little extra cost (except if reconfiguring during application runtime) and ease through direct host program control. The increase in gate count and speed of these devices has also made them an appropriate target for building high-performance, custom computing machines. These machines, also referred to as reconfigurable computing machines, provide flexibility to program and reprogram systems and at the same time provide high performance computing at a relatively low cost when compared to price-performance models of other high-performance platforms, such as supercomputers [2, 5, 6, 7, 8, 9]. Several computing platforms consisting of arrays of FPGA devices have been developed through research and experimentation and are currently commercially available in the market.

The DEC Paris Research Laboratory’s Programmable Active Memories (PAM) project was one of the earliest pioneers in reconfigurable computing [1]. The PRL team implemented the RSA encryption algorithm at speeds that had never been achieved, beating supercomputers and even custom discrete IC applications at that time.

SPLASH and Splash 2 are two other reconfigurable architectures developed in the early nineties—Splash 2 being an upgrade of the original SPLASH architecture [17]. The

Splash 2 consisted of 16 printed circuit boards, each consisting of 17 Xilinx XC4000

FPGA chips per board. Each XC4000 had its own memory banks, to which it could independently read and write. A number of high-performance scientific applications

22 were implemented on Splash 2, such as in the domains of gene sequence matching, fingerprint matching and image processing--at speeds of two orders of magnitude greater than the fastest supercomputers at that time [17].

Several companies have brought commercial platforms to market over the past few years, attempting to exploit this new computing model. Annapolis Microsystems

[18], SRC Computers Inc [19], and Star Bridge Systems [20] are three of the most prominent players in this market. These new reconfigurable architectures are being marketed as platforms that can be used for implementing a wide range of applications from different domains. The Annapolis WILDCARDTM reconfigurable platform that we are using on our research is one of these—albeit a low-end version.

The research described in this thesis examines the architecture, design and implementation of the computationally intensive UPGMA algorithm on a low-end reconfigurable platform and evaluates the performance as contrasted with that obtained by an implementation of the same algorithm using conventional software program execution on a standard Intel® CPU-based personal computer.

As discussed, we have chosen the UPGMA phylogenetics algorithm as the application domain in which we will explore the architecture and design space, and subsequent performance differences, of applying the reconfigurable computing paradigm to this scientific computing problem. Phylogenetics provides us with a rich variety of problems with complex computational tasks that can be studied to see if they can be implemented on reconfigurable machines. Furthermore, the software domain has already been thoroughly explored, and few performance gains can be realized from further software optimization of the UPGMA algorithm in particular.

23 With this rationale clearly in mind, we progress to our discussion of the problem- solving and analysis of the UPGMA domain to derive a suitable high-level architecture with which to implement the algorithm. In addition, we’ll need to play our architecture off against the resource and timing constraints of the underlying Xilinx device and the

WILDCARDTM platform (including its mechanisms for interfacing with the PC-based host system in which it resides).

24 CHAPTER 3

DISCUSSION OF THE WILDCARD CUSTOM COMPUTING PLATFORM

In this chapter, we discuss the reconfigurable computing platform we have available to us for purposes of this research. As part of our analysis, we had to thoroughly analyze this platform in order to understand its operating environment, its programming model, and its key features and constraints. All of this was required prior to devising an architecture for our UPGMA solution, because any such architecture would both be constrained by the resource constraints of the Xilinx device resident on the

WILDCARD, but also we would be further constrained by the programming model and execution environment provided by the vendor for us to realize our solution.

3.1 The Annapolis WILDCARDTM System

The WILDCARDTM system comes as a PC card and can plug into a PCI/PCMCIA card slot adapter, making it a very portable low-end reconfigurable platform. It has a very compact architecture, with a single Xilinx Virtex XCV300E processing element

(PE) and a couple of independent memory modules on the either side, forming the core of the system. The architectural block diagram is given in Figure 7 below.

25 Figure 7.The WILDCARDTM Platform Block Diagram [18]

Each of the two memory blocks, referred to as the Right and Left memory banks, is a 64K x 32-bit RAM module, with a 19-bit address bus and a 32-bit data word. The PE can write and read from the right and left memories independently. The host interface is through a 32-bit CardBus (PCMCIA) controller that operates at a 33 Mhz clock frequency. The CardBus controller interfaces with the PC host through the PCI Bus interface, and with the PE through the LAD Bus interface.

Data transfers to and from the PC host are done through control of a set of C program driver calls that interface with CardBus controller which, in turn, interfaces with the LAD Bus to send data to, and retrieve data from, the PE. Data can be written from the host to the memory through these specific interfaces by making the C program calls provided by the Host Programming Application Programming Interface (API) provided by the vendor.

26 The PE also has certain Input and Output pin connections that enable it to connect to external devices. These pins are helpful when you want your application program to communicate with an external device.

The WILDCARDTM board has a Frequency synthesizer that generates one main global clock signal, F_Clk, which is used to derive three other global clocks, namely the

P_Clk, M_Clk, and K_Clk. The user can set the clock frequency of the F_Clk using a C routine call from the host. The P_Clk is the PE clock pad signal, and is set to half the frequency set for the F_Clk by the user. The M_Clk the memory clock pad signal and operates at the same frequency that is set by user for F_Clk. Finally, the K_Clk is the

CardBus/LAD Bus clock pad signal, which always operates at 33 MHz.

3.2 The WILDCARDTM System VHDL model

The WILDCARDTM system software package provides VHDL models for the whole board that can used to create a VHDL-based program model and to implement and debug the whole reconfigurable application design. The VHDL model of the system also contains a simulation model of the host system that is used for testing the application from the perspective of custom computing hardware-software co-design.

The VHDL model provides interface components that are used to access all the components on the WILDCARDTM system. There are two types of interface components, namely, the Standard Interfaces, and the Mux Interfaces [18].

Standard interfaces are simple interfaces to the devices (PE, memories) on the system and can be used for low level, specifically tuned applications. The Mux (or multiplexing) interface can be used for programming at a higher-level interface between

27 the LAD Bus, Memory and the PE components. Both of these interfaces allow multiple user application components to share a single resource (such as the LAD Bus or the PE’s memory banks).

The development environment provides VHDL models for the following platform components, which are used for early model integration, hardware-software partitioning analysis, and functional verification and clock cycle-level timing analysis.

 Processing element (PE)

 Right Memory Bank

 Left Memory Bank

 Host

 Clock Generation

 Input and Output Connectors

 PCI controllers

The PE VHDL model is a standard VHDL entity-architecture pair. The entity defines the input and output pads of the PE device. The pad numbers are logical and do not match with the physical pin numbers of the FPGA die. This entity definition is fixed and is used as a template for the physical PE while creating the application design. In our preparation of these components for exploring the space of possible architectures for the

UPGMA algorithm, we do the following: the PE architecture template is modified accordingly to embed the application design within it. Furthermore, the Standard and

Mux interface models are used in such a manner in order to embed the application design interface with the LAD Bus and the memory banks. Finally, this allows us to take the

28 resultant composite PE model and synthesize the PE image for actually configuring the

WILDCARD device.

The VHDL models provided for the memory banks, host, clock generation, I/O connectors and the PCI controllers are purely for simulation purposes. These models are used within the WILDCARDTM simulation model and encapsulate the system’s functionality for use in VHDL simulation, enabling us to functionally verify the PE designs, as well as validate the correctness of the UPGMA algorithm, before synthesizing the actual design units and placing and routing them onto the Virtex device resident on the WILDCARD.

3.3 WILDCARDTM Host Programming

The WILDCARDTM system is composed of three main components, listed as follows: (1) the WILDCARDTM board; (2) the WILDCARDTM device driver; and, (3) the

Host Application Programming Interface (API). The WILDCARDTM Software Design

Hierarchy is given in Figure 8. The Host programming is done in C language using the standard Host API routines to communicate with the WILDCARDTM board through the

Windows® based device driver.

The device driver provides a low-level hardware interface to the WILDCARDTM board. When the driver is called in the appropriate set of driver function codes, it initializes the WILDCARDTM in a sequence of steps by reading its configuration and establishing handler interfaces for memory, interrupts and DMA operations. The

WILDCARDTM API presents a generalized view of the hardware resources and control operations. The following operations are performed by calling the API routines:

29  Opening and Closing the WILDCARDTM board

 Clock control (frequency)

 Processing Element control (program, reset, register space)

 Memory Interfaces (read/write)

 Interrupt control (PE/FIFO enable/disable)

The C function routines for each of the above operations are discussed below.

Figure 8. The WILDCARDTM Software Design Hierarchy [18].

30 3.3.1 Opening and Closing the WILDCARDTM board

The host program first makes an “open” call to the WILDCARDTM board before performing any other operations. This initializes the device driver, which, in turn, initiates the interface handlers for access to the board components. The C routine for this is WC_Open( ). The counterpart for the WC_Open( ) function is the WC_Close( ). For every WC_Open( ) there should be a corresponding WC_Close( ) function call to ensure a clear disconnect and proper de-allocation of resources.

3.3.2 Clock Control

The only clock control operation a host program can perform is setting the frequency of F_Clk. The function call for this is WC_SetClkFrequency( ). The programmable clock module allows user programs to change the clock frequency anytime by calling this routine.

3.3.3 Processing Element and Interrupt Control

The four main operations that are executed against the Processing Element (PE) are: (1) the PE Reset; (2) the PE Program; and, (4) the PE Register space read and write

(3) the PE Interrupt control.

The PE Reset operation is used to reset the PE and the embedded application residing on it. The PE program function calls are used to program the PE device with the user-designed application. There are two function calls: (a) PE_ProgramFromBuffer( ), which is used to program the PE from a user buffer space; and, (b) PE_Program( ), which is used to program the PE from a file.

The PE has a certain register space to which we can read and write to. The register space has an address range of 0x04000 to 0x0FFFF. The two function calls to

31 read and write to this register space are: (a) PE_RegRead( ) - reads from the PE register space locations; and (b) PE_RegWrite( ) – writes to the PE register space locations.

When a PE interrupt occurs, the device driver immediately masks the PE interrupt line and informs the calling program that is suspended on the API call that an interrupt has occurred The Interrupt control is done using the following functions [18]:

 WC_IntQueryStatus( ) – checks the status of the PE interrupt line via polling.

This is useful when the host program is written to do other operations while

waiting for the interrupt.

 WC_IntWait( ) – waits for the PE interrupts; useful when the host only needs to

wait for the PE interrupt before proceeding to perform anything else. The calling

program is suspended.

 WC_IntReset( ) – after the host program has processed the interrupt, it can reset it

and clear the API’s indication of a pending interrupt.

3.3.4 Memory Control

There are two main memory control API calls that are made by the host C application program. They are: (1) WC_MemRead( ), which reads from the right or the left memory

SRAM banks; and, (2) WC_MemWrite( ), which writes to the right or the left memory

SRAM banks. The calling arguments include the memory bank identifier, the base address and size of the block of data (that is, the number of DWORDS) to be written or read [17]. The function calls invoke the device driver that, in turn, manages handlers to transfer data through the PCI bus through the CardBus/LADBus interface to the

WILDCARDTM.

32 3.4 PE Embedded Application Initialization

The host program must proceed through a set of steps for initializing the application in the PE device before the reconfigurable application can be started. The steps are as follows:

 Open the WILDCARDTM board;

 Initialize the clock by setting it to a particular frequency;

 Enable the PE Reset line – ensures the PE is reset atleast once when the clock

starts;

 Disable and clear any pending interrupts left hanging by any previous

applications;

 Load the PE image by calling the PE_program API routine;

 Execute any additional initialization tasks as necessary, such as for enabling PE

interrupts; and,

 Disable the reset lines to allow normal operation of the PE.

Now the downloaded UPGMA application is running on the WILDCARDTM proceesing element, and the host program can start its portion of the application processing activity, which consists of transferring taxa and phylogenetic tree data to and from the offloaded UPGMA algorithm application.

33 CHAPTER 4

CUSTOM COMPUTING DESIGN OF UPGMA

In this chapter, we present the design methodology employed to create the custom computing application that offloads the UPGMA algorithm from the Host onto the

WILDCARD board for accelerated processing of taxa data.

4.1 VLSI Design Flow

New VLSI design methodologies have emerged every four or six years. Hardware description languages and EDA tools have made it possible to design VLSI systems at higher levels of abstraction. Hardware description languages provide chip designers with the capability of describing the functionality of a design at a higher, more abstract level of representation than a gate level representation. VHDL and Verilog HDL are the two languages used in the industry for hardware modeling and implementation.

Figure 9 shows a HDL-based design process model representing the various design activities. A brief description of each of the activities in the design process model is given below:

 System specification is an activity to abstract design information from a problem

statement and defining the interface and timing waveforms of the system.

 System partitioning is an activity to hierarchically decompose a system to handle

complexity based on system specification, design resource, and feasibility of

implementation. Components at the final hierarchical level should facilitate

34 behavioral modeling using HDLs or allow for HDL component reuse. The output of this activity is a valid system partition.

Figure 9. An HDL-based design process model[21]

 Modeling or adaptation involves capturing a design component in HDL with

high-level timing information and data dependencies or adapting reusable HDL

components from a design library.

 Component simulation verifies functional behavior and high-level timing of each

component using HDL test benches or cycle-based simulation techniques.

35  System binding is structural integration of simulated components based on the

system partition. This activity produces a system model for verifying system

specification.

 System simulation verifies system behavior and timing using HDL test benches or

cycle based simulation techniques. This activity produces simulation results that

can be verified with system specification.

 Logic synthesis is obtaining a gate-level netlist using an automated synthesis tool.

Removal of timing information and non-synthesizable constructs, technology

mapping and defining area and timing constraints are involved in this activity.

The target ASIC library and constraints are chosen to comply with system

specification.

4.2 UPGMA Project Design Flow

The design methodology described above forms the basis for building the application design. The UPGMA algorithm undertaken for this project forms the problem statement. It was analyzed and a system specification generated. The design was partitioned into easily manageable blocks and modeling of each of these modules was done in VHDL. The top-level hierarchical model is structural and merely connects the different sub modules to form the final top-level design. We have employed a certain amount of adaptation by reusing Xilinx cores to implement certain modules in the design.

This was done mainly to ensure that the design use no more resources than were available in the Virtex XCV300E chip of the WILDCARDTM board.

The component simulation was conducted on each of the sub modules to verify their functionality. This step eases the final system level simulation process as the errors

36 within the sub modules are removed by then. The final top-level structural model was then written and system simulation conducted to verify the functionality of the design as a whole. The ModelSim simulation environment has been used for debugging and testing purposes.

Logic synthesis was conducted mainly for identifying the critical paths and finding the resource usage of the design. The Synplify Pro® 7.3 FPGA synthesis tool was used for this purpose. The synthesized gate level netlist does not entirely form the final design being implemented on the Processing Element (PE) of the WILDCARDTM system.

The functionally verified design was embedded within the PE VHDL architecture model.

The PE model was then placed within the WILDCARDTM system simulation model and the final functional testing was conducted. The verified PE model was then synthesized and the EDIF netlist generated. The EDIF netlist was then placed and routed using a make file provided by the WILDCARDTM system. The make file provided calls to the

Xilinx place and route tools for generating the final PE image that is used to configure the device. This image is then used to proceed with the WILDCARDTM host programming process shown in Chapter 3.

4.3 UPGMA Design

The design architecture was formulated keeping in mind the parameters that govern the data sizes upon which the design operates. The data bit-width size constrains the bit widths of registers, the bit widths of datapath path elements and the memory requirements.

37 We first look at defining these parameters and then move towards describing the design architecture.

4.3.1 Design Parameters

For a taxa size of n:

 The number of nodes that form the final tree are n + (n-1)

 The number of distance values that need to be stored is

o (n(n-1) + (n-2)(n-3))/2

o For n nodes there are n(n-1)/2 combinations, thus making the number

of nodal distances to be n(n-1)/2.

o When an internal node is formed the number of nodes still left to be

connected reduces by 2 for the very first internal node and then by 1

thereafter. Initially there are n external nodes. When the first internal node

is formed by joining two external nodes, the number of nodes left is n-2.

So we need to compute the distance of the new internal node to n-2

different nodes. Thereafter, for each internal node formed we have a

reduction of 1 node making the number of nodes to connected to be n-3,

n-4 and so on. The total number of distances for internal nodes is thus n-

2 + n-3 + n-4 + …… which is equal to (n-2)(n-3)/2.

We define below the data structures used to representing the input, intermediate and output data. There are four types of data upon which the algorithm operates:

38  Distance data – A simple 32-bit dataword is used to represent this value. The

distance data is stored in the left memory bank that has a 32-bit data width. Thus

the choice of a 32-bit word for representing the input distance data.

 Nodes – The nodes that form the tree are represented as 10 bits each. The choice

of 10-bits was made as initially the design was targeted to implement a 512-taxa

dataset.

 Heights – Each node has a height associated with it. This indicates the number of

nodes beneath it. This values has been represented in 16-bit format. A 512 taxa

tree would have a root with the largest height of 512. An initial size of 9-bit was

selected in earlier versions of the modeling but was changed to 16-bits when the

16-bit Block RAMs were chosen to implement the Height memory that is used to

store the heights of the nodes.

 Tree Output Data – This data represents each node in a tree with its parent node

and branch length to its parent associated with it. The format used is given below:

Node ID- 10-bits Parent ID- 10-bits Branch length- 12-bits

 This format was used to connect up the nodes while generating the final tree. The

total bit length is 32-bits. The tree output data is stored in the right memory that

has data word length of 32-bits

4.3.2 Design Datapath

The basic datapath of the design that is used for calculating the average distances and also obtaining the minima is given below in Figure 10 . The datapath is broken into two, one used for finding the least distance or the minima and the other for calculating

39 the average distances. The datapath on the left of the Figure 10, with the less than operator is used to find the minima. The datapath has the distance value and the current minima as inputs. The current minima is stored in a register and is fed back into the comparator. The second datapath on the right is used to calculate the average distance. It

has the distance value dik and the height of node i as inputs. The multiplier obtains the product of this height and the distance and sends it to the adder. The adder adds this value to an accumulator. The multiplier and adder together obtain the numerator part of the average distance equation (2) given in Chapter 2. The same time that the multiplier

accumulator are computing the numerator, the second adder which takes height Hi as input, computes the denominator of the equation (2). When these two are computed the resulting values given below are sent to the divider to obtain the average distance.

Numerator = (dikhi + djk)

Denominator = hi + hj

Average distance = (dikhi + djk)/(hi + hj)

Figure 10. Design Datapath

40 4.3.3 Design Architecture

The architecture of the design is given in the block diagram shown in Figure 11. The architecture shown is the final one created after several design passes. The two main components that form the backbone of the design are:

 The Controller

 The Address Generator

Most of the design effort has been put into modeling these two components since the controller executes the working of the UPGMA algorithm and the address generation forms the core part of the algorithm’s process. The description of these two components along with that of the other sub components is given below. We start with the simpler ones and then proceed later to the complex components. The datapath components are dealt with first, then moving to control path and finally the memory modules.

4.3.1 Adder

The Adder is modeled using the simple VHDL “+” operator. It has two data inputs and one output, each 32 bits wide.

The basic architecture of the 32-bit Adder would be realized using the Ripple-

Carry design. This style of Adder architecture has its carry chain as its critical path; however, for this bit-width, the tradeoff in area versus speed was not significant enough to warrant exploring other, more sophisticated Adder architectures.

41 Data_in Addinp Mulout Multiplier MulReg Adder Addreg

Adderreginp p n i D MulRegEn

Height HAdder HAddReg Divider HeightRegEn

AvgDistanceOpt s t k s

HAddrRegVal AddRegEn i n D a g B v

y A r

DivRegEn o

LstDinp m e M

Least t

Comparator Divreg h g

DistanceReg i LstDstRegEn R 1

p d n n i 2 t a

s p t f n D i e t L s i

o

Row D t

Rowinp s t

RowReg u p n I

a Tree output Data t text a

Output Selector D

Colinp Col

ColReg n E g s t l e c a R e l n w e number_of_species g i o S S R n E g Address Generator e R l

o Height

C HRead Controller Memory r d

a HWrite m

control signals Address Modification d e t Address Generate Control i a r e W R

mem_addr_modified m m e e M M

Address, Read and Write signals to Left and Right Memory Banks

Clk

Reset

Figure 11. Block Diagram of UPGMA Architecture.

4.3.2 Add Register

The Add register stores the output value of the Adder module. The register value is fed back into the Adder so that these two components work together as an accumulator.

42 The Adder input is added with the old value stored in the Add register and the cumulative value is stored into the register. The final value of the Add register forms the numerator of the Average distance equation given in Chapter 2. The Controller module sets the enable and clear signals for the register.

4.3.3 Height Adder

The Height Adder is similar in architecture to the basic Adder and has two inputs and one output, each of which is 16 bits wide. This module is used to add the heights of the nodes of the tree.

4.3.4 Height Register

The architecture of the Height Register is similar to the Add Register, except that it is 16-bits wide. It stores the output value of the Height Adder and its value is fed back into the Height Adder so that they work together as an accumulator. The Height

Register’s final cumulative value forms the denominator in the average distance equation given in Chapter 2.

4.3.5 Multiplier

The multiplier is a simple 32-bit multiplier modeled using the standard VHDL “*” operator. The multiplier has two inputs and one output, each 32 bits wide.

43 4.3.6 Multiplier Register

The Multiplier register stores the Multiplier module’s output. The register component’s architecture and pin configuration is the same as that for the Add Register.

From the standpoint of the VHDL model, a single generic entity-architecture description is employed for all the 32-bit register units.

4.3.7 Divider Unit

The Divider was modeled using the shift-subtract division algorithm. The divider forms the critical path of the design, and the controller coordinates its operation to ensure that the design runs at the requisite clock rate. The Divider has two 32-bit inputs and has as outputs a 32-bit quotient and 1-bit “valid” flag.

4.3.8 Divider Register

The Divider register simply stores the output value of the Divider unit. Its architecture and pin configuration are similar to that of the Add and Multiplier registers.

The controller handles the enable and clear signals.

44 4.3.9 Comparator

The Comparator is used to compare the distance values and find the minima. The comparator is modeled using simple VHDL “>” operator. The comparator compares the new distance value with the previous minima stored in the Least Distance Register; when it finds new minima, it enables the Least Distance Register, Row Register and the

Column Register.

4.3.10 Least Distance Register

The Least Distance Register stores the current minima while the algorithm continues to search for the minima of all the distances. The register architecture is similar to that of the generic 32-bit register with clear signal being set by the controller and the enable signal being by the comparator.

4.3.11 Row and Column Registers

The Row and the Column Registers store the Row and the Column values of the current minima in the distance matrix. Each distance matrix is accessed through the row and column values. These two values are used to obtain other distance values while calculating the average distance value. The comparator sets the enable signal and the controller sets the clear signal controlling these registers.

4.3.12 Controller

The Controller forms the core of the design, making its behavior one of the most complex to model. Its operation is based on the processing steps of the UPGMA algorithm. The steps for the controller are given in Figure 12, below.

45 Figure 12. The Controller Algorithm.

The three main operations being performed by the controller for every single pass through the matrix are as follows:

 Find the new minima

 Compute the Average distance

 Reduce the matrix size

46 The controller accomplishes this by stepping through a set of states and repeating the process until all the nodes in the tree have been handled.

4.3.13 Counter Units

The counters form some of the sub components of the address generator block.

The counters are used to select the next address to be generated. The generic block diagram for one of the counters is given below. The counter counts up until the count value becomes equal to the compared value (CV); at which point they set a “great” flag, indicating that the count has been exceeded; the counters are then reset back to zero. The controller sets the Increment and Clear signals.

4.3.14 Multiplexers

The multiplexers enable the address generator in selecting the next address. We have 2:1 and 3:1 multiplexer architectures for this purpose.

4.3.15 Address Generator

The address generator is a module that underwent several design cycles. The address generation algorithm is not simple, when considering it from a hardware

47 perspective. The address generation is conducted for two operations in the design process.

 Finding distance minima in an instance of the matrix; and,

 Calculating average distance

For finding the minima, the address is generated for fetching the next distance value from the memory. The address is generated in the format of “row&column”, with the row and column values concatenated together to represent the actual memory address.

The row and column values represent the row and column of the node-to-node distance matrix, with each row or column representing a node.

For example, node 1 to node 2 distance can be fetched by concatenating

“0000000001” with “0000000010” to generate the 20-bit address

“00000000010000000010” before actually accessing that memory location. These two values are obtained by reading a memory that stores the currently active node values.

The earlier version of this memory structure, referred to as “node memory” was modeled behaviorally and later modified to reduce the size of the module. The earlier model used to take more resources than were available in one full Virtex XCV300E chip.

The model was later modified by implementing the node memory using 256 x 32-bit

Xilinx Block RAM that would use one single Block RAM resource on the Virtex

XCV300E chip. This reduced the resource consumption significantly, making the module work much more efficiently.

48 The Xilinx Block RAM is a dual port memory structure [16], structure so that two different addresses can be written or read, or read and written, in combinations at the same time. This enables us to read the two different node values at the same time in order to generate within one clock cycle the concatenated address. The block diagram of the dual port Block RAM is given above.

The counters, discussed earlier, are used to generate the address values, ‘addra’ and

‘addrb’, for selecting the nodes to generate the next address. After all the distances are read and compared, the minima is stored in the Least Distance Register.

The average distance calculation needs generation of the address by selecting nodes that have not yet being joined into a cluster. The nodes that have currently been selected to form the new cluster are stored in the Row and Column registers. The average distance equation is given below.

Avg. Distance D(x, y)  i = (HxDxi + HyDyi) / (Hx + Hy)

49 The x, y are the new nodes selected to form the new cluster, i is the node to which

the distance from the new cluster is being calculated, Hx, Hy are the Heights of nodes x

and y respectively, and Dxi, Dyi are the distance of nodes x and y to the node i, respectively. The address generation for calculating the average distance is done in steps outlined as follows:

 The address for obtaining Dxi is generated first by selecting node i’s value and

concatenating with node x’s value stored in the Row Register;

 The address for obtaining Dyi is generated second by selecting node i’s value

again, and concatenating with node y’s value stored in the Column Register;

 These two distances are added to calculate average distance D(x, y)  i ; and,

 This new distance is then stored into the memory for future reference. The new

cluster forms a new node, lets say j, and the average distance calculated

represents the distance of this cluster to node i. Thus, the address for storing this

distance would be cluster j’s value concatenated with node i’s value.

The above four steps are performed for all of the nodes that have not yet been connected to the tree. After the average distances of all the nodes have been calculated, the node memory is updated by removing the nodes that have been selected to form the new cluster. The complexity of the process lies in maintaining the node memory and stepping through the process of selecting memory’s addresses to obtain the next address.

50 4.3.16 Output Generator

This module selects the output values that form the output tree data that is written into the memory. The outputs that this module are listed as follows:

 Node type – internal or external leaf node;

 Node ID- the value of the node;

 Parent ID- the value of the parent of the node; and,

 Branch distance- the distance of the node to its parent.

4.3.17 Height memory

The Height memory holds the heights of the nodes in the tree. The external, or leaf, nodes would have a height of one and the internal, or nodes with children nodes, would have heights of two or more. The memory architecture in the earlier version of this design was implemented behaviorally and the post synthesis results yielded high resource usage and slow performance. Thus, after a design review, this module was implemented using four 256x16-bit Xilinx block RAMs and additional write and read logic associated to control the Block RAM accesses. The module now uses only four

Block RAM resources on the Virtex XCV300E chip.

4.3.18 Off-chip Memory Banks

The right and left memory banks on the WILDCARDTM are used for storing the distance and tree data respectively. The banks are 64Kx32-bit SRAM modules, where access to them is managed through interface modules provided by the WILDCARDTM system. These components are available as VHDL models that can be used depending on the needs of the application. If we need to write to left and right memories we need the

51 interface components provided for these two banks. The application on the PE sends a read or write request through these interface components. The components allow multi- processing such that multiple applications could read and write to the memory at the same time. The read and write requests are prioritized and the designer can chose the kind of prioritization used. This feature has not been used as we have only one application design running within the PE that reads and writes to the memories.

The read and write operations take certain cycles to be performed successfully.

Figures 13 and 14 give the timing diagrams for the read/write to/from the memory. The read cycle is such that the first data takes 4 clock cycles to arrive after the read signal is set.

Figure 13. Typical read cycle from memory[18]

52 Figure 14: Typical write cycle from memory[18]

As we can notice from Figure 13, the write takes only one clock cycle to be performed. The 4 clock cycle latency requires the controller to wait until the first data arrives before performing the datapath operations. The memory read thus causes a certain latency in the design and most certainly affects the performance. For larger taxa sets, this latency gets larger and drastically affects the speed of the design.

4.3.19 Addressing Schemes

The WILDCARDTM memory banks both have a maximum capacity of 65536 words of 32-bits each. This small memory capability not only limits the number of taxa that could be implemented on this system but also affects the way the memory is addressed. The addressing scheme discussed in Section 4.3.15 is not effective when the number of taxa increases to say 256. The scheme uses the following methodology. Let us assume we have a distance matrix given as Figure 15 below.

Figure14: Distance matrix

Figure 15. Distance Matrix

The distances 6, 8, and 3 are addressed by D[0, 1] D[0, 2], D[0, 3]. Thus, while writing data from the host, we place these three values at indexes 1, 2, 3 of the array and transfer the data to WILDCARDTM memory. These values would be written in addresses

1, 2, and 3 of the memory. Thus value 6 in address 1 of the memory can be referenced

53 using the address “00000000000000000001” obtained by concatenating the nodes

“0000000000” and “0000000001” together, and similarly for locations 2 and 3.

Now, distance 7 is D[1, 2] and thus is written in index 1026 of the C array and transferred to the memory. It can thus be referenced by generating the address

“00000000010000000010,” obtained by concatenating nodes “0000000001” and

“0000000010” together. These nodes represent the indexes of the matrix D. Thus we use an addressing scheme that is similar to the way we reference the matrix values.

For datasets of 256 taxa or higher, this scheme fails, since for obtaining distances between say, nodes 254 and 255, we have an address “00111111100011111111” which represents a value much larger than 65536. Also, using this scheme, we are wasting memory locations. For example, the consecutive distance values 6, 8, 3 are stored one after another in locations 1, 2, 3, respectively, but the distance value 7 suddenly jumps to the memory location 1026. This waste of memory locations would reduce as the taxa size increases, but it is still unacceptable.

To avoid this problem and to be able to implement larger taxa datasets we have to employ a linear addressing scheme. The catch in this scheme is that our design needs to maintain a record of the node information while fetching every distance value, so that we can know which two nodes have the minimal distance between them. Thus the address generation using concatenation of nodes is important for the design. We therefore resort to an address modification scheme in which we generate the address in the original scheme and then modify it into a linear 16-bit address that does not go beyond our limit of 65536.

54 The address modification is a complex process, takes additional clock cycles and requires additional states in the control structure. This causes the design to slow down and the performance is affected quite a bit. We will discuss the impact of the address modification on performance in later chapters.

Let us assume in the address “node1&node2” node1 refers to the row of the matrix and node2 to the column. Thus, for the matrix given in Figure 15, we have the following mapping for each address value as given in Table 2. The number of taxa is n =

4.

Matrix format Linear format Number of values per row 0-1 0 0-2 1 n-1 0-3 2 1-2 3 n-2 1-3 4 2-3 5 n-3

Table 2: Address mapping

From the above table we deduce that each row maps to a particular address. For example 0 maps to 0; now for value 0-1 we have address ‘0’, for 0-2 we have ‘1’. So we can see that for column value 2 the address 0 to which the row maps is incremented by 1.

Similarly, for 0-3, the address 0 is incremented twice. Thus we deduce that by mapping a row to an address and adding the (column-1) value to it we obtain the linear address that is used to obtain the required distance value. Row 0 would have (n – 1) values and a base address of 0, thus the base address for row 1 would be (0 + n - 1) which equals 3 for n = 4. Thus for address 1-2 we have an address of 3; for 1-3 we need to add

55 (column – row – 1) to obtain the correct linear address. Thus the steps used to obtain the linear address are:

 Obtain the base address to which the row maps to from a map memory

 Add the value (column – row - 1 ) to obtain the final address

The above methodology is employed to perform the address modification. The base addresses that each row maps to are written initially in Block RAM. This initialization of the node memory, row map memory requires further additions to the controller states and thus adds considerable delay to the design. This delay gets large for larger taxa datasets. The address modification component is placed within the PE outside of the

UPGMA application component. The address generated from the UPGMA component is fed to the address modification component, and the modified address is fed to the memory interface components.

4.3.20 Top-level Block

The top-level block in the design provides integration and routing of all the sub modules described above. The final top level design is then placed within the VHDL model for the PE, as a sub component to that model, and is interfaced with the memory and LAD Bus interfaces for handling the transfer of data.

The VHDL models for all the blocks in the design are listed in Appendix A. .

4.4 Design Verification

The design verification of the UPGMA design was performed using the

ModelSim simulation environment. The VHDL models of the WILDCARDTM system provide a simulation model that could be used to run a host-based simulation. The

56 simulation was done for various taxa datasets. The benchmark dataset used was a 57-taxa dataset for which we had the output resultant tree data generated from the software implementation. The output tree generated by the software simulation of the hardware design was compared with the benchmark data and found to match quite.

To verify the working of the hardware implementation on the WILDCARDTM system the data generated from the hardware implementation was compared with the benchmark data. Both the results matched perfectly.

Data generated through the test data generating software was fed to both software and hardware designs and the resulting output was compared. We found that the two outputs matched, indicating that the hardware design was working correctly.

57 CHAPTER 5

EXPERIMENTAL DATA SET AND PERFORMANCE MEASUREMENT

5.1 Experimental Apparatus for UPGMA

The WILDCARDTM host-programming environment provides us the capability to program the WILDCARDTM system and also allows us to create templates that are used to write the host program. Looking back at the software design hierarchy explained in

Chapter 3, we see that the host “driver” program is written in C. The WILDCARDTM provides the API routines that are used in the host program to perform the following functions: (1) read and write to the on-board SRAM memories; (2) wait for the Virtex®

PE to interrupt (or, alternately, poll the status register for completion of a WILDCARDTM controller operation), and (3) process the results of the API-initiated operation.

The UPGMA host program was written based on the example templates provided by

Annapolis Microsystems® for setting up a custom computing application for reading and writing the SRAM memory banks, reading and writing data to the Virtex® Processing

Element (PE) register space, and for processing PE interrupts. Using these examples as guides, we created a complete host-based, experiment “driver” application, employing the above three components, to perform the following host-to-computing server protocol steps:

58  Initialize the WILDCARDTM system;

 Program the PE from the image file;

 Set the Clock frequency;

 Enable PE interrupt line;

 De-assert PE Reset line;

 Read distance data from the file into a distance array;

 Transfer distance data from the distance array to WILDCARDTM Left memory;

 Write the value of the number of taxon being operated upon into a PE register;

 This triggers the design to start running and assert “done signal” after it finishes;

 The C program waits until the done is set;

 Reads data from the Right memory; and,

 After reading, it outputs the data into a destination file in the PC host file system.

The PE initialization includes “opening ” the board by calling WC_Open( ) routine, applying power to the board, asserting reset lines, and clearing any pending interrupt requests left unprocessed by previous application programs. Once the design has been synthesized, and the EDIF file is transformed into a placed and routed design for the

Virtex® FPGA from the synthesis run, and the image file is generated by running the

Xilinx® M1 Alliance Series place and route tools.

This image is placed with the C project directory and is used to configure the PE by calling the WC_PeProgramFromFile( ) API routine. After the PE image is loaded onto the device, the PE clock frequency is set by calling the routine WC_SetClkFrequency( ),

59 the interrupt lines are enabled, and the Reset line de-asserted. The WILDCARDTM board is then ready for transfer of data to the on-board SRAM memories.

The distance data is written into the left memory, while the right SRAM memory is used for storing the output tree data. The number of taxa on which we are operating is written into a single 32-bit register on the PE. The host C program then goes into “sleep” mode, waiting for the PE interrupt to be set. Meanwhile, the UPGMA logic starts executing, and operates on the distance data in order to generate the output tree data.

After the design finishes processing, it generates a “done” signal that is tied to the PE interrupt line. Once the PE interrupt is set, the host C program comes out of its wait state and starts processing the PE interrupt. The host program clears the interrupt and starts reading the Phylogenetic tree data from the Right SRAM memory. Once all the output tree data is read from the right memory and written into an output file, the “driver” program clears all the memory buffers allocated during the execution, and proceeds to

“close” the device by calling the WC_Close( ) API routine. The C code for the host- based experiment “driver” program is provided in the Appendix D.

5.2 Generating Random Taxa Test Data Sets

A program written in C++, using the MFC programming environment, was used to generate the test data for testing the implementation of the UPGMA algorithm. The program takes as input the following parameters: (1) the number of taxa; (2) the maximum value of inter-node distance; and, (3) the number of repetitions of a single distance value in the data set.

60 The test data are generated for taxa sizes of 10, 16, 32, 50, 64, 75, 100, 128, 150,

175, 200, 225 and 256. For each taxa size, ten different data sets are randomly generated for that number of taxa. Furthermore, each created data set has its data values subjected to permutations, creating up to 10 permutations per data set per number of taxa. The C++ code for the test data generation is given in Appendix E.

Figure16: Test Data Generator Input Dialog Box.

Figure 16 presents a screenshot of the dialog box used by the program for generating test data. The taxa size, maximum nodal distance, and the number of repetitions of a particular distance value are given as inputs. When the data has been generated the program pops up a confirmation dialog box.

The data values are generated randomly making sure that each value is within the maximum nodal distance limit set by the user. Also, the specified number of repetitions

61 of each value in the data set is constrained to be less than or equal to that specified for repetitions by the user. For each taxa size, ten different datasets are generated and for each of these ten datasets ten different permutations are generated by changing the positions of the distance values within the distance matrix.

5.3 Measuring Time

The time taken for the UPGMA implementation to execute on the WILDCARDTM system is measured using standard C time function calls. Time measurements are collected for the time taken for the program to transfer the distance data to the memory, generate the tree, and read back the output tree data from the memory. The time is measured in terms of CPU clock ticks using the standard C language clock( ) function call.

The time taken for memory transfer is measured separately in order to analyze the cost of transferring data to and from the WILDCARDTM memory banks. This is done to give us an idea of how the cost affects the performance of the implementation for high values of N, the number of taxa. The current maximum of 256 taxa limits the number of distance values to be written to the memory. Also, the WILDCARDTM memory banks are

65536 (64K) words, with each word being 32-bits in width. This constrains the number of taxa that can be operated on for a given UPGMA run.

Through independent tests written for the WILDCARDTM system, the time taken for writing to each of the memories has been collected and tabulated. Although not shown here, the numbers collected indicate that the cost for writing the entire memory is about 20 CPU clock ticks, while reading back the entire memory is about 100 clock ticks

62 on the 800 MHz Pentium III processor serving as the experimental workstation. This indicates that reading data from the SRAM memories from the host is a more expensive operation than writing to them.

For the purposes of our experiment, we write distance data and read back tree data from the left and right on-board SRAM memories, respectively. As we are constrained to operate upon a 256 taxa data set, the number of distance values needed to write a complete matrix to the memory 32,640, while the maximum number of memory locations to be read back for the resulting tree is 511. These two values have been obtained from the following analysis formulae.

Number of distance values = N(N-1)/2

No of tree nodes = 2(N) – 1

Thus, considering that our largest data set has values to be written that are less than the maximum capacity of the memory banks, the memory writes take less than 20

CPU ticks for writing the entire memory array. Similarly, the number of output words to be read back from the memory is small compared to the full memory size. Thus, the time taken is much less than the 100 ticks for reading an entire memory. Thus, the cost of writing and reading from the on-board SRAM memories does not form a big factor in the performance overhead for our implementation.

However, if we have to look at realistic values of N, which can go up to as high as

10,000 the cost for reading and writing the memories would important. For obtaining an idea as to what the cost might be, we can extrapolate the timing data assuming that we have unlimited memory capacities on our hardware board. We could try to write to the

WILDCARDTM memory banks multiple times to find out the cost for writing more than

63 65,536 words and use this value to obtain a gross estimate of the cost to write to the memory for large datasets. Similarly we could read back data from the memory multiple times and obtain and estimate of the cost to read the memory for large datasets.

This cost data could then used to obtain a gross estimate on the overall performance of the algorithm being implemented on the hardware. This provides a theoretical extrapolation, and not the exact performance cost; however, it provides valuable information on how the algorithm performance might scale to larger number of taxa data sets, and whether the memory access costs would have a significant or minor impact (assuming we had reasonably unlimited memory available). This data is presented in Chapter 7 as part of the discussion of conclusions of this research.

64 CHAPTER 6

EXPERIMENTAL METHOD AND RESULTS

In this chapter, we present the results of running Phylogenetics data sets against the UPGMA implementation on the WILDCARDTM-based reconfigurable custom computing machine. We present the resultant data sets in terms of a bounded clock cycle count using the clocking frequency of the host PC’s CPU clock, which gives us a count of the total number of host clock cycles for a given computation run. We use this, as opposed to using the on-board FPGA clock, as the former takes into account the communication overhead of getting data to and from the WILDCARDTM board.

6.1 Running the Experiments

We take randomly generated data sets, permute them, and execute them on the

WILDCARDTM. We then increase the number of taxa considered in the input distance matrix, generate new data sets and permute them, and execute them on the UPGMA processor. The test data for taxa sizes of 10, 16, 32, 50, 64, 75, 100, 128, 150, 175, 200,

225 and 256 were executed. For each taxa size, ten different data sets along with 10 different permutations of certain datasets were run and timing results collected. The results are described in the following sections.

65 6.2 Experimental Results for Latency

The time taken for each taxon-size data measured in number of CPU clock ticks, for different datasets and permutations is collected. The average time taken for ten different permutations of each of the ten datasets for each taxon size is given in Tables 4 through 6. Ttotal is the total latency of the platform--the number of clock ticks taken for a complete design run including the data transfer between the host and the WILDCARDTM.

One aspect of defining the data set for purposes of running experiments is permuting the data to assess whether permutation impacts the execution latency. In some implementations of UPGMA in software, permutation might affect the execution of a given data set at some number of taxa. The permutations were randomly generated along with the data sets. However, we wanted to make sure whether this aspect of the organization of data would affect the design in some meaningful way before taking the time to blindly run the experiments.

Our expectation was that permuting the data would not be much of a factor in variation of latency values, because the time to perform actual computations on fixed- width operators is largely independent of the actual data values passed as the operators.

From the data collected from the sample permutation runs, this seems to be the case.

This is shown in Table 3 and Figure 17 below.

66 Table 3: Timing Results for permuted datasets

Figure 17. Frequency Distribution for Latency versus Taxa Data Set Permutation.

From this analysis of the permutation, we conclude that we don’t need to consider permutation of the data set values for a particular execution run. Therefore, we focus our

67 presentation of the data on the different UPGMA execution runs using randomized data sets for each of the selected number of Phylogenetic taxa.

We next examine the response of the custom computing machine in terms of the

Mean Latency (averaging the data set samples) versus the number of taxa.

This is shown several different ways, so as to highlight the statistical convergence of the latency values around the mean values computed across the ten randomized data sets.

The first plot in Figure 18 shows the basic Latency response curve as the number of taxa increases to the maximum value of 256—the maximum number that can be stored in the available memory on the WILDCARDTM, given the architecture.

We wanted to evaluate the deviation from the mean over the data sets for each number of taxa, and observe what happens to this deviation as the number of taxa increases to the maximum targeted for this research. What we see in the Latency data for the different data sets--for a given number of taxa--is that the data tends to tightly cluster around the mean, indicating minimal deviation. There is some wider variance as the number of taxa grows, as evidenced from the curve in Figure 19, which gives the Latency in log scale. The variance is little bit more for 200 and 256 datasets as seen in the curve.

The rest of the datasets seem to converge pretty well.

We basically are not able to grow the number of taxa on the current reconfigurable computing platform based on the WILDCARDTM to see whether there is a real trend in the deviation data or not. However we believe that the data results would not be affected as the number of taxa grows. This is due to the fact that hardware computation speed is relatively fixed for fixed data bit-widths. The combinational circuit would have a fixed latency, thus the computations would have a fixed latency. This leads

68 us to believe that the data results would not differ significantly with increase in the number of taxa.

Latency versus Number of Taxa

Computational Latency (Mean Values) Latency (Std. Deviation) Latency (Variance)

600

500 ) s e l

c 400 y c

U

P 300 C (

y

c 200 n e t a

L 100

0 10 16 32 57 64 75 100 128 150 175 200 225 256 Num be r of Phylogenetic Taxa

Latency versus Number of Taxa

Computational Latency (Mean Values) Latency (Std. Deviation) Latency (Variance)

25

) 20 s e l c y

c 15

U P C (

y 10 c n e t a

L 5

0 10 16 32 57 64 75 10 12 15 17 20 22 25 Num ber of Phylogenetic Taxa

Figure 18. Mean Latency versus Number of Taxa (Normal Scale).

69 Latency versus Number of Taxa (Log Scale)

Com putational Latency (Mean Values) Latency (Std. Deviation) Latency (Variance)

1000 ) s e

l 100 c y c

U P

C 10 (

y c n e t

a 1 L 10 16 32 57 64 75 100 128 150 175 200 225 256

0.1 Number of Phylogenetic Taxa

Figure 19: Latency versus Number of Taxa

The deviation we see in our current results, we believe, can be attributed to

“noise” on the host PC side, as the host is not dedicated to running the WILDCARDTM program exclusively, but at the same time has other processes running that can skew the count of the clock ticks.

Table 4 given below gives us the timing results for the WILDCARDTM UPGMA program to run datasets of different taxa sizes. It gives us the latency in the number of clock ticks for each of the ten different datasets for every taxon size. We also have the mean, standard deviation and variance of the 10 datasets for a given number of taxa.

We look at the performance of the UPGMA processor and compare it with other complexity functions to obtain an upper bounding in terms of Big-Oh.

70 Table 4. Latency Values for Data Sets at Generated Number of Taxa.

71 6.3 Bounding Time Complexity

Given the Latency curve as our number of taxa grows, we want to understand the results in terms of the time complexity. Stanat and McAllister [27] provide an appropriate taxonomy on which we can attempt to qualitatively “fit” our resultant performance curve against those of standard time complexity functions. Given that we have selected a means of measuring Latency that incorporates communication overhead, and that we randomize and permute our data sets, we assume we are working with worst- case behavior. We want to understand the behavior in terms of the standard forms of

Big-Oh.

Our first attempt is to compare our Latency plot against the base function plots for

O(N), O(n log(n)), O(n2) time complexity patterns. We show the Excel® plots for the data shown in Table 4 in the plots of Figure 20 for both normal scale and for logarithmic scale. We attempt to carry out a qualitative assessment of time complexity bounds without resorting to deriving more precise recurrence expressions—although we are able to generate curve-fitting equations directly onto the Excel plots.

What we see from the plots is that—given the limited range of N (number of taxa) covered under the scope of this research—we appear bounded by O(n log(n)) time complexity. However, the other conclusion we draw from this data is that we are too constrained by lack of a sizable memory space (space complexity) in which to store a larger number of matrix distance values for processing a greater number of taxa, N.

Therefore, we cannot draw a definitive conclusion about the performance of the computing system for large values of N. However, we will explore what we might need

72 to do to grow to considerably larger values of N, into the thousands of taxa, in the conclusion of this work. Also, to assess the benefits, we’ll use comparative data.

Latency = f(N) Time Complexity Bounding

Mean Latency N N*log(N) N**2

2500

) 2000 s e l c y c

1500 U P C (

y 1000 c n e t a

L 500

0 10 16 32 57 64 75 100 128 150 175 200 225 256 Num ber of Phylogenetic Taxa

Latency = f(N) Time Complexity Bounding (Log Scale)

Mean Latency N N*log(N) N**2

10000 )

s 1000 e l c y c

U

P 100 C (

y c n e t 10 a L

1 10 16 32 57 64 75 100 128 150 175 200 225 256 Num ber of Phylogenetic Taxa

Figure 20. Bounding of Latency by Time Complexity Functions.

73 Latency = f(N) Time Complexity Bounding

Latency Poly. (Latency) Poly. (Latency)

4500

4000 y = 0.577x3 - 6.2392x2 + 22.55x - 13.393 3500 R2 = 0.9982 )

s 3000 e l c

y 2500 c

U

P 2000 C (

y

c 1500 n e t

a 1000 L 2 500 y = 5.8787x - 47.849x + 83.55 R2 = 0.9748 0

0 2 4 0 0 0 6 1 3 6 0 5 0 5 -500 1 1 2 2 Num ber of Phylogenetic Taxa

Figure 21. Bounding Latency by Time Complexity Functions Computed in Excel.

Finally, before leaving this aspect of the analysis, we show a different plot in

Figure 21, showing how difficult it is to qualitatively assess the time complexity, by using the Excel® plot of trend analysis of the Latency curve, showing both square and cube polynomial trend curves. The Excel software uses regression analysis to come up with the trendlines. The trendlines help us in predicting the behavior of the Latency curve with increase in number of taxa beyond our current 256 max size. We just don’t have enough experimental data ourselves to see what happens to the Latency for larger values of N. For this, we’d need to move the design to a larger platform—such as the

Star Bridge HC-36m or the SRC 6e, which would be the subject of future research.

However the trendlines give us an idea on characterizing our upper bound performance for values of N greater than 256, using the trend curves as a guide, and how they might

74 scale. From the trendline future prediction we can know the R2 (R-square) value. The R2 value, also known as coefficient of determination, ranges from 0 to 1 and helps us in deciding whether the estimated predictive values of the trendline accurately match the actual data. A trendline is most reliable when the R2 value is at or near 1. Thus we see that the cube polynomial trendline provides the best bounding for the Latency curve as its

R2 value is better than that of the square polynomial trendline. So we feel that the algorithm complexity is bounded by O(N3) for the hardware implementation.

We now, try to measure the quality of the solution by comparing the performance of the reconfigurable custom computing solution against that of the baseline execution of

PHYLIP, the version of UPGMA software written in C by Felsenstein et al. [10].

6.4 Benchmarking Against PHYLIP

The software timing data is collected by running the PHYLIP UPGMA C code on the same PC on which the WILDCARDTM host program is run. As before, we execute the experiment and collect run-time data across a range of values of N, with different randomized data sets that have been permuted (selecting half the number of permutations as for the hardware version, for sake of brevity). For this, we use the same data sets that were used to execute the UPGMA algorithm running on the WILDCARDTM. The average time taken for the program to run under five different permutations of each of the ten different datasets for each taxon size is given in Table 5 that follows. The run-time plots corresponding to those for Latency of the software version are given in the figures that follow.

75 PHYLIP C Run-tim e Perform ance

C Run-time Performance N*log(N) N**2

3500.00

U

P 3000.00 C (

e 2500.00 m i ) t s

- 2000.00 e n l u c y R 1500.00

c n o i 1000.00 t u c

e 500.00 x E 0.00 10 16 32 57 64 75 100 128 150 175 200 225 256 Num ber of Phylogenetic Taxa

Figure 22. PHYLIP C run time performance

Figure 22 shows the comparison of the PHYLIP C run time performance with the

N(log N), and N2 curves. These curves were added using trendlines without predictive analysis. From this plot, we observe that the PHYLIP C run-time performance curve is bounded by the N2 curve yet closely follows the plot for N(logN) as its lower bound.

We believe that our limited number of taxa does not show the true nature of the curve, and thus we would like to get a better bounding to obtain a closer match for the algorithm complexity.

The first plot in Figure 23 provides us the plots seen in Figure 22 in a log scale. In this scale we find N(log N) pretty closely matching the C run time performance but we still cannot predict accurately the complexity of the algorithm for larger values of N. In the second plot of Figure 23 we have two polynomial trendlines around the C run time performance curve. We have used forward prediction and obtained the R2 values to look at the accuracy of the trendlines. We find that the cube polynomial trendline matches

76 much better with the R2 value very close to 1. This tells us that the C algorithm provided by Felsenstein[19], seemingly, has a complexity of O(N3).

PHYLIP C Run-tim e Perform ance (Log Scale)

C Run-time Performance N*log(N) N**2

10000.00

U P C

( 1000.00

e m i ) t s - e n l 100.00 u c y R

c n o i t

u 10.00 c e x E 1.00 10 16 32 57 64 75 100 128 150 175 200 225 256 Num ber of Phylogenetic Taxa

Phylip C Run-tim e Perform ance

C Run-time Performance Poly. (C Run-time Performance) Poly. (C Run-time Performance)

U 16000.00 P C

( 14000.00

e 12000.00 y = 1.8845x3 - 27.682x2 + 259.92x - 184.88 m i ) t 2

- 10000.00

s R = 0.9962 n e l u 8000.00 c R y

c 6000.00 n o i t 4000.00 y = 11.892x2 + 30.012x + 131.71 u

c 2000.00 2

e R = 0.9863 x 0.00 E 0 2 4 0 0 0 6 1 3 6 0 5 0 5 1 1 2 2 Num ber of Phylogenetic Taxa

Figure 23. PHYLIP C run time performance with Time-complexity bounding

77 Table 5. PHYLIP Run-time Raw Data Set.

78 We now look at the performance comparison of hardware and software implementations. The Average number of clock ticks taken for each of the taxa sizes, for both the hardware and software implementations, is given in Table 6. The results show a significant improvement for taxa up to 64, but then the rate of improvement starts to decline as the taxa size increases to the 256 maximum for the experiments.

Taxa Hardware Software Improvement 10 8.4 121 14.4 16 8.5 170.1 20 32 9.4 315.3 33.5 57 12 541.7 45.1 64 14 713 50.9 75 20.3 816 40.2 100 39.6 942.1 23.8 128 71.6 1107.5 15.5 150 110.2 1278 11.6 175 162.5 1479 9.1 200 242.6 1788.8 7.4 225 342.1 2250.7 6.6 256 504.4 2659.9 5.3

Table 6. Data Comparison Between Hardware and Software UPGMA Implementations.

This behavior in the hardware implementation is accounted for by the fact that the

FPGA-based design used to implement taxa count of 75 and above were adversely affected by the memory addressing scheme (discussed in chapter 4) made in the final architecture modifications of the design resident on the WILDCARDTM. The negative impact in performance of the design is also attributed to the four-cycle latency for a

SRAM memory read. This latency induces wait states in the control structure causing the design to run slower. The address modification would have not been necessary had the

79 WILDCARDTM system had larger memory banks and the original addressing scheme of concatenating nodal values were still used.

The improvement over the PHYLIP software implementation goes as high as 50 times for a 64-taxa data set size. Beyond this size, the on-board memory address modification becomes necessary, thus causing the design performance to deteriorate.

This can be seen as the 75-taxa data set size performance reduces to 40 times speed up and far less for larger data set sizes, up to the maximum. This deterioration is attributed to the low memory capability of the WILDCARDTM board and larger memory banks, if available, should help scale the design more effectively. The size of the Virtex

XCV300E chip also inhibits us from the implementing parallel or pipelined design architectures that might help in reducing latency of the memory addressing to a certain extent.

60 t

n 50.9 50 e

m 45.1 e v 40.2 o 40 r p 33.5 m i

30 e c

n 23.8 a 20 20 m r

o 15.5

f 14.4 r 11.6 e 10

P 9.1 7.4 6.6 5.3

0 10 16 32 57 64 75 100 128 150 175 200 225 256 Taxon size

Figure 24. Plotting the Performance Improvement over PHYLIP as Taxa Count Grows.

80 Figure 24 provides a plotted view of how the algorithm scales, as N grows large.

Given the maximum taxa data set at 256, we see that the performance deteriorates once we encounter the increased overhead of memory address computation on data sets for more than 64 taxa.

Thus, due to inherent limitations of the WILDCARDTM hardware board on which our design executes, we could not obtain performance improvements that might otherwise be obtained by applying custom logic/custom computing methodologies. A larger board with a larger memory size would allow scaling for larger taxa counts and certainly provide better performance over the software implementation in PHYLIP.

Performance Comparison: CCM vs Phylip

CCM Latency Phylip Run-time

3000.00

2500.00 ) s e l c

y 2000.00 c

U P C

( 1500.00

e c n a

m 1000.00 r o f r e

P 500.00

0.00 10 16 32 57 64 75 100 128 150 175 200 225 256 Num ber of Phylogenetic Taxa

Figure 25. Plotting the Performance Difference as Taxa Count Grows.

However, if we look at the plot of the performance data itself, and compare the two curves, we see that the performance improvement does indeed seem to scale, as the

81 performance curve for the PHYLIP implementation of UPGMA grows at a faster rate than that of the implementation of UPGMA as a custom computing machine architecture.

Performance Comparison: CCM vs Phylip

CCM Latency Phylip Run-time

10000.00

) 1000.00 s e l c y c

U P C

( 100.00

e c n a m r o f r 10.00 e P

1.00 10 16 32 57 64 75 100 128 150 175 200 225 256 Num ber of Phylogenetic Taxa

Figure 26. Plotting the Performance Difference as Taxa Count Grows (Log Plot).

If we observe the trend as a logarithmic plot, we see that, for the peak performance point for the custom computing implementation on the WILDCARDTM

(between 64 and 75 taxa), we are operating close to two-orders of magnitude faster than the software PHYLIP implementation. Furthermore, we see that this improvement decreases to a single order of magnitude—with an apparently decreasing trend in order- of-magnitude performance difference as we grow to the limit of 256 taxa. This corroborates the earlier plot showing a final 5X difference in performance between the two implementations.

CHAPTER 7

82 SUMMARY AND CONCLUSIONS

7.1 Summary of Research Contributions

Custom computing systems built using reconfigurable logic devices, provide several orders of magnitude speed-up in execution performance of algorithms over the execution of these on conventional microprocessor-based systems. In addition, such systems have the flexibility to program--and reprogram via reconfiguration--the actual logic functions of the VLSI circuit with different applications in time and space. Custom computing systems are implemented using FPGA custom-logic devices that are easily and quickly programmed by an end-user. This research conducted design and analysis of a custom computing application architecture for the UPGMA Bioinformatics algorithm implemented on an FPGA-based custom-computing platform. We had looked at different architectures of the design for the purpose of achieving better resource usage and also to conform to the constraints of the hardware resources—most notably memory--on the

WILDCARDTM. We discussed the final architecture created and presented results of the system performance, as measured and compared against that of the UPGMA algorithm written in C, running on a single-processor Pentium® PC.

83 7.2 Conclusions

The results presented in Chapter 6 provided us with an insight into the performance of both the hardware and software implementations. The hardware results showed little variance for different permutations of a dataset for a given number of taxa.

The timing results also converge towards a mean value showing very little variance over different datasets for a given number of taxa. The hardware results showed significant improvement over the software implementation with performance peaking at the 64 taxa datasets. For datasets of 74 taxa and above, the performance began to degrade considerably, compared to that of the PHYLIP software implementation—although the custom computing implementation was still between half- to a full-order of magnitude faster. The hardware implementation was 50 times faster than the software implementation for the 64–taxa datasets, indicating a reasonable performance improvement, given the architectural limitations of memory addressing cited earlier.

We have also shown that, using predictive analysis in Excel, both the implementations are bounded by functions that are time complexity of O(N3). The polynomial equations generated for both the hardware and software performance curves were of the order of N3 with a large difference being based on the coefficients and constants of each time function. The predictive polynomial equation generated for the software performance curve shown in Chapter 6 had large constant and coefficient values compared to that of the polynomial equation of the hardware performance curve. These predictive polynomial equations though do not represent actual values but do give us an accurate estimation as to the behavior of both the implementations if we had been able to scale beyond the limited number of 256 taxa.

84 The large values of the coefficients are indicative of the fact that software implementation has underlying compile-related and operating system overhead that affect its performance. The hardware implementation of the UPGMA algorithm avoids these sources of overhead by its implementation of the computational units directly onto

FPGA-based hardware storage and functional units. This provides a considerable speed- up, facilitating a higher-performance solution as is evidenced through the results obtained.

However we see that the hardware performance degrades rapidly for datasets of

75 taxa and above. The performance degradation is attributed to the linear addressing scheme used for the final architecture, and the latency for a single read from the

WILDCARDTM on-board SRAM memory banks. The read from the memory banks takes

4 clock cycles which adds additional wait states within the control structure that negatively impacts the performance of the design. A linear addressing scheme was employed to facilitate the implementation of taxa of 75 and above. The WILDCARDTM memory provides us a maximum of 65536 word on each bank and this limitation forced us to modify the address generated using the original addressing scheme into a linear addressing scheme. The original addressing scheme would generate address values greater than 65536 yet limits the number of taxa that could be implemented, even though datasets of 256 taxa could be stored within the 65536 memory locations. The linear addressing scheme enables the design to implement larger datasets up to the 256 taxa limit, the maximum taxa size defined as a goal of this research. The address modifications necessary for this purpose induces additional states in the control structure, adversely impacting the performance of the design.

85 The original addressing scheme would require larger memory capabilities on the hardware that the WILDCARDTM platform lacks. We discuss in the sections below how larger memory banks--as well as certain architecture modifications--might improve the performance.

7.3 Future Work

In the early sections we have looked at certain issues that hampered the performance of the UPGMA design implemented in the WILDCARDTM system. We list these issues below:

 Memory size limitation and Memory address schemes

 Latency for the memory read

 Device size (FPGA resources)

We discuss these issues to see how alleviating these bottlenecks might be used to increase performance.

7.3.1 Memory size and Memory address schemes

The WILDCARDTM system provides two memory banks with 65536 words on each. The left memory bank was used for storing the distance matrix data and the right memory for storing the tree output data. The limit of 65536 words on the left memory necessitates address modification which degrades the design performance. We could overcome this problem two ways:

 Generating a better addressing scheme

 Going for platforms with larger memory banks

86 The first option would give us a solution that could be implemented on the currently available WILDCARDTM board, but is going to be a very difficult one as the address modification is complex as described in Chapter 4. The second option is easier and would require us to explore more custom computing platforms that offer larger memory capabilities. Larger memory capabilities would eliminate the need for address modifications and different addressing schemes.

The time taken to write and read to the memories on the WILDCARDTM board from the host is also a significant factor in measuring the performance of the design. We have seen that to write the entire memory bank takes 20 clock ticks on a 800 Mhz Intel

Pentium host system while to read an entire memory bank takes 100 clock ticks. We have seen that the time taken increases linearly with increase in no of writes or no of reads.

Therefore to read the memory banks thrice the host would take 300 clock ticks and to write thrice it would take 60 clock ticks. This linear increase would most certainly affect the performance and we would need to look at other architectures that might provide better performance in terms of reading and writing from the host.

7.3.2 Latency for a Memory Read

We have seen in the earlier sections that the memory read in the WILDCARDTM system takes 4 cycles. This hurts the performance by slowing the operation of the design.

To overcome this we have to look other custom computing platforms that offer better read and write cycle latency. This would remove the additional wait states induced into the control structure and speed up the performance of the design.

87 7.3.3 Device size

The WILDCARDTM has a Xilinx Virtex XCV300E chip on it. The Virtex-E chip has a total of 3072 slices. This is small compared to the Virtex II device, which has a total of 33732 slices, offering much more space to implement larger designs and also would enable us to look at different architectures of the algorithm under consideration, namely, parallel or pipelined architectures. We have seen in the literature that, in general, parallel architectures offer a very good performance improvement [22, 23].

The current implementation of the design takes up 60 percent of the

WILDCARDTM Virtex E chip. A parallel implementation would likely have multiple copies of the design components, such as the datapath, control path, etc., running in parallel. These multiple units would work on sub-parts of the distance matrix. This parallel operation would speed up the design by a large extent but the multiple parallel units would increase the design size, and there would be some penalty in the communication overhead of the interacting subparts of the problem.

Thus, to implement a parallel architecture we would need a larger device or multiple devices to ensure that we do not run out of resources. However the speed up in performance that can be obtained is attractive enough that it definitely warrants an exploration into the trade off between increased resources, parallelism versus communication overhead, and the impact on computation speed. Therefore, future work should investigate different custom computing architectures offering the requisite resources to implement a parallel architecture of the UPGMA algorithm on a custom computing fabric.

88 We have looked at different issues that caused problems in implementing the

UPGMA algorithm on the WILDCARDTM system and have also discussed how we could be able to resolve these problems. The options suggested are presented as future work that might be of great interest and might enable us in obtaining a performance improvement for the UPGMA algorithm that conceivably could alter the upper bound of the time complexity of the algorithm itself.

89 BIBLIOGRAPHY

[1] Andre DeHon, Wawrzynek, The case of reconfigurable processors. Berkeley

Reconfigurable Architectures Systems, and Software. University of California,

Berkeley. http://citeseer.nj.nec.com/dehon97case.html.

[2] Nick Tredennick, The case of reconfigurable computing. Micro Design

Resources, Microprocessor Report, Vol.10, No.10, Aug 1996.

[3] Stephen Brown and Jonathan Rose, Architecture of FPGAs and CPLDs: A

Tutorial, IEEE Design and Test of Computers, Vol. 13, No. 2, pp. 42-57, 1996.

[4] John V. Oldfield, Richard C. Dorf, System Implementation Strategies, Chapter 1,

Field Programmable Gate Arrays, Reconfigurable logic for Rapid Prototyping and

Implementation of Digital Systems, pg 1-26, Wiley-Interscience Publishing, 1995.

[5] Paul Graham, Brent Nelson, FPGA based Sonar processing. ACM/SIGDA

International Symposium for Field Programmable Gate Arrays. Pg 201-208.

February 1998. http://www.dynamicsilicon.com/Articles/Reconfigurable.pdf

[6] Jeffrey Arnold, Kenneth L. Pocek, Genetic Algorithms In Software and In

Hardware - A Performance Analysis of Workstation and Custom Computing

Machine Implementations, Proceedings of IEEE symposium of Field

Programmable Custom Computing Machines, pg 216-225, April 1996. IEEE

Computer Society.

[7] Jason R. Hess, David C. Lee, Scott J. Harper, Mark T. Jones, and Peter M.

Athanas, Implementation and Evaluation of a Prototype Reconfigurable Router,

90 Proceedings of IEEE symposium of Field Programmable Custom Computing

Machines, pg 44-50, April 1999. IEEE Computer Society.

[8] R. Petersen, B. L. Hutchings, An Assessment of the Suitability of FPGA-Based

Systems for Use in Digital Signal Processing, In 5th International Workshop on

Field Programmable Logic and Applications, pp 293-302, August 1995, Oxford,

England.

[9] P.W. Dowd, J.T. McHenry, F.A. Pellegrino, T.M. Carrozzi and W.B. Cocks, An

FPGA-Based Coprocessor for ATM Firewalls, Proceedings of the IEEE

Symposium on FPGA's for Custom Computing Machines (FCCM97), pg 30-39,

April 1997.

[10] Joe Felsenstein, PHYLIP source code, Department of Genome Sciences,

University of Washington, http://evolution.genetics.washington.edu/phylip.html

[11] R. Shamir, UPGMA, Tel Aviv University,

http://www.math.tau.ac.il/~rshamir/algmb/00/scribe00/html/lec08/node21.html

[12] Peter H. Weston, Michael D. Crisp, Introduction to Phylogenetic Systematics,

Invited Contributions of the Society of Australian Systematic Biologists, SASB,

http://www.science.uts.edu.au/sasb/WestonCrsip.html.

[13] James P. Davis, Peter J. Waddell, Sreesa Akella, Methods and Architectures for

Realizing Fast Phylogenetic Computation Engines Using VLSI Array Based

Logic, Submitted to IEEE Bioinformatics Conference, Aug, 2002.

[14] R. Durbin, S. Eddy, A. Krogh, G. Mitchison, Building phylogenetic trees,

Chapter 7, Biological Sequence Analysis, pg 160-190. Cambridge University

Press, 1998.

91 [15] D.L. Swofford, G.J. Olsen, P.J. Waddell, and D.M. Hillis, Phylogenetic

Inference, Chapter 11, Molecular Systematics, pg 45-572, second edition, (ed.

D.M. Hillis, and C. Mortiz), Sinauer Association, Sunderland, MA, 1996.

[16] Xilinx Inc, Virtex-E 1.8V FPGA Complete Datasheet, March 2003

[17] Duncan A. Buell, Jeffrey M. Arnold, Walter J. Kleinfelder, SPLASH2 FPGAs in a

Custom Computing Machine, IEEE Computer Society Press, 1996.

[18] Annapolis Microsystems Inc, Annapolis WILDCARDTM System Reference Manual,

Revision 2.6, 2003. www.annapmicro.com

[19] SRC Computers Inc., www.srccomputers.com.

[20] StarBridge Systems, www.starbridgesystems.com

[21] Yutana Jawchinda, Hideaki Kobayashi, Quantifying Design Reuse: An HDL-

Based Design Experiment, International HDL Conference, April, 1999.

[22] H. J. Whitehouse, J. M. Speiser, K. Bromley, Signal Processing Applications of

Concurrent Array Processor Technology, Chapter 2, VLSI and Modern Signal

Processing, Prentice-Hall, Inc., 1985.

[23] Axelrod, R., The Complexity of Cooperation: Agent-Based Models of

Competition and Cooperation, Princeton University Press, 1997.

[24] Billsus, D., C. A. Brunk, C. Evans, B. Gladish and M. Pazzani, “Adaptive

Interfaces for Ubiquitous Web Access”, Communications of the ACM, Vol. 45,

No. 5, May 2002, pp. 34-38.

[25] Stanat, D. F. and D. F. McAllister, Discrete Mathematics in Computer Science,

Prentice Hall, Inc., 1977.

92 APPENDIX A

VHDL SOURCE CODE

------Add, Subtract decrement modules needed for -- Address modification ------Author : Sreesa Akella -- File : add1.vhd -- Entity : add_1, sub_1, dec_1 -- Architecture : add_1_beh, sub_1_beh, dec_1_beh ------library ieee; use ieee.std_logic_1164.all; use ieee.std_logic_signed.all; use ieee.std_logic_arith.all; entity add_1 is port( in1 : in std_logic_vector(9 downto 0); in2 : in std_logic_vector(9 downto 0); opt : out std_logic_vector(9 downto 0) ); end add_1; architecture add_1_beh of add_1 is begin

opt <= in1 + in2; end add_1_beh; library ieee; use ieee.std_logic_1164.all; use ieee.std_logic_signed.all; use ieee.std_logic_arith.all; entity sub_1 is port( in1 : in std_logic_vector(9 downto 0); in2 : in std_logic_vector(9 downto 0); opt : out std_logic_vector(15 downto 0) ); end sub_1; architecture sub_1_beh of sub_1 is begin opt <= ("000000" & in1) - ("000000" & in2); end sub_1_beh;

93 library ieee; use ieee.std_logic_1164.all; use ieee.std_logic_signed.all; use ieee.std_logic_arith.all; entity dec_1 is port( inp : in std_logic_vector(15 downto 0); opt : out std_logic_vector(15 downto 0) ); end dec_1; architecture dec_1_beh of dec_1 is begin

opt <= inp - '1'; end dec_1_beh;

------Height Adder ------Author : Sreesa Akella -- File : adder_h.vhd -- Entity : adder_h -- Architecture : adder_h ------library ieee; use ieee.std_logic_1164.all; use ieee.std_logic_signed.all; use ieee.std_logic_arith.all; entity adder_h is port( Datainp1 : in std_logic_vector(15 downto 0); Datainp2 : in std_logic_vector(15 downto 0); Data_out : out std_logic_vector(15 downto 0) ); end adder_h; architecture adderh_beh of adder_h is begin

Data_out <= Datainp1 + Datainp2; end adderh_beh;

------Adder module ------

94 -- Author : Sreesa Akella -- File : addernew.vhd -- Entity : adder -- Architecture : adder_beh ------library ieee; use ieee.std_logic_1164.all; use ieee.std_logic_signed.all; use ieee.std_logic_arith.all; entity adder is port( Datainp1 : in std_logic_vector(31 downto 0); Datainp2 : in std_logic_vector(31 downto 0); Data_out : out std_logic_vector(31 downto 0) ); end adder; architecture adder_beh of adder is begin

Data_out <= Datainp1 + Datainp2; end adder_beh;

------Address register entity architecture ------Author : Sreesa Akella -- File : addr_dd.vhd -- Entity : addr_dd -- Architecture : addr_dd_beh ------library ieee; use ieee.std_logic_1164.all; entity addr_dd is port(addr : in std_logic_vector(19 downto 0); reset : in std_logic; clk : in std_logic; addr_dd_s : out std_logic_vector(19 downto 0)); end addr_dd; architecture addr_dd_beh of addr_dd is signal temp, temp1 : std_logic_vector(19 downto 0); begin process(clk, reset) begin if reset = '1' then temp <= (others => '0'); temp1 <= (others => '0');

95 addr_dd_s <= (others => '0'); elsif clk = '1' and clk'event then temp <= addr; temp1 <= temp; addr_dd_s <= temp1; end if; end process; end addr_dd_beh;

------Height Adder Register entity architecture pair ------Author : Sreesa Akella -- File : haddregisterh.vhd -- Entity : adderhreg -- Architecture : addhreg_beh ------library ieee; use ieee.std_logic_1164.all; use ieee.std_logic_arith.all; entity adderhreg is port(addout : in std_logic_vector(15 downto 0); reset : in std_logic; clk : in std_logic; regen : in std_logic; regclr : in std_logic; regval : out std_logic_vector(15 downto 0)); end adderhreg; architecture addhreg_beh of adderhreg is begin process(clk, reset) begin if reset = '1' then regval <= (others => '0'); elsif clk = '1' and clk'event then if regclr = '1' then regval <= (others => '0'); elsif regen = '1' then regval <= addout; end if; end if; end process; end addhreg_beh;

------Distance comparison unit entity architecture pair ------

96 -- Author : Sreesa Akella -- File : comparedst.vhd -- Entity : comparedst -- Architecture : comparedst_beh ------library ieee; use ieee.std_logic_1164.all; entity comparedst is port(Datainp1 : in std_logic_vector(31 downto 0); valid_dst : in std_logic; distreg_val : in std_logic_vector(31 downto 0); first_val : in std_logic; addr : in std_logic_vector(19 downto 0); distreginp : out std_logic_vector(31 downto 0); distreg_en : out std_logic; rowreginp : out std_logic_vector(9 downto 0); rowreg_en : out std_logic; colreginp : out std_logic_vector(9 downto 0); colreg_en : out std_logic; addr1_reg_en : out std_logic; addr2_reg_en : out std_logic); end comparedst; architecture comparedst_beh of comparedst is begin process(addr, first_val, Datainp1, distreg_val, valid_dst) begin if first_val = '1' and valid_dst = '1' then rowreginp <= addr(19 downto 10); colreginp <= addr(9 downto 0); distreginp <= Datainp1; rowreg_en <= '1'; colreg_en <= '1'; distreg_en <= '1'; addr1_reg_en <= '1'; addr2_reg_en <= '1'; else if Datainp1 < distreg_val and valid_dst = '1' then rowreginp <= addr(19 downto 10); colreginp <= addr(9 downto 0); distreginp <= Datainp1; rowreg_en <= '1'; colreg_en <= '1'; distreg_en <= '1'; addr1_reg_en <= '1'; addr2_reg_en <= '1'; else rowreginp <= (others => '0'); colreginp <= (others => '0'); distreginp <= (others => '0'); rowreg_en <= '0'; colreg_en <= '0'; distreg_en <= '0'; addr1_reg_en <= '0';

97 addr2_reg_en <= '0'; end if; end if; end process; end comparedst_beh;

------Controller entity architecture ------Author : Sreesa Akella -- File : controllerverfull9.vhd -- Entity : ctrl_blk -- Architecture : ctrl_beh ------library ieee; use ieee.std_logic_1164.all; entity ctrl_blk is port(clk : in std_logic; reset : in std_logic; valid_numsp : in std_logic; addr_grt : in std_logic; child_cnt_gr : in std_logic; count_gr : in std_logic; all_nodes_done : in std_logic; initialized : in std_logic; div_valid : in std_logic; a_grt : in std_logic; ext_node : in std_logic; r_clr : out std_logic; r_inc : out std_logic; a_clr : out std_logic; a_inc : out std_logic; c2_read : out std_logic; mem_update : out std_logic; R_dec : out std_logic; Rp_dec : out std_logic; c1_incr : out std_logic; c2_incr : out std_logic; c1p_incr : out std_logic; ch_incr : out std_logic; c1_load1 : out std_logic; c1_load2 : out std_logic; c2_load1 : out std_logic; c2_load2 : out std_logic; c1p_load : out std_logic; ch_load : out std_logic; c1_clr : out std_logic; c2_clr : out std_logic; c1p_clr : out std_logic; ch_clr : out std_logic; row_col_sel : out std_logic_vector(1 downto 0); addr2_reg_dec : out std_logic; node_write : out std_logic; read_mem : out std_logic;

98 write_mem : out std_logic; read_wmem : out std_logic; write_wmem : out std_logic; rowreg_clr : out std_logic; colreg_clr : out std_logic; distreg_clr : out std_logic; mulreg_en : out std_logic; mulreg_clr : out std_logic; addregwclr : out std_logic; addregwen : out std_logic; addregclr : out std_logic; addregen : out std_logic; divreg1clr : out std_logic; divreg1en : out std_logic; initial_run : out std_logic; store_cur_addr : out std_logic; node_mem_initialize : out std_logic; mem_initialize : out std_logic; addr_gen1_en : out std_logic; addr_gen2_en : out std_logic; rmem_read : out std_logic; rmem_write : out std_logic; ad_reg_en : out std_logic; ad_reg_clr : out std_logic; row_zero : out std_logic; numsp_val : out std_logic; valid_td : out std_logic; nodeid_sel : out std_logic_vector(1 downto 0); n_type_sel : out std_logic; incnt_inc : out std_logic; done : out std_logic ); end ctrl_blk; architecture ctrl_beh of ctrl_blk is type state is(idle, wait_init, rmem_init, rmem_init1, node_mem_init, mem_init, wait_st, rmem_read_st, addr_mod1, addr_mod2, fetch_dst2, compare_dst, c2_inc_ld_st, wait_st1, addr2_gen_st, fetch_dst, wait_st2, add_dst, mul_dst, div_dst, wait_rmem, write_dist_to_mem, write_mem_wait, c2_incr_st, c2_read_st1, c2_read_st2, c2_read_st22, br_update1, br_update2, rmem_read_st2, addr_mod1_st, addr_mod2_st, tree_map_init1, tree_map_init2, tree_map_int, tree_map1, tree_map2, done_st); signal cur_st : state; signal count_cycle : integer range 0 to 3; begin process(clk, reset) begin if reset = '1' then store_cur_addr <= '0'; r_clr <= '0'; r_inc <= '0'; a_clr <= '0'; a_inc <= '0';

99 rmem_read <= '0'; rmem_write <='0'; ad_reg_en <= '0'; ad_reg_clr <= '0'; row_zero <= '0'; mem_initialize <= '0'; node_mem_initialize <= '0'; count_cycle <= 0; read_mem <= '0'; write_mem <= '0'; read_wmem <= '0'; write_wmem <= '0'; addr_gen1_en <= '0'; addr_gen2_en <= '0'; c2_read <= '0'; mem_update <= '0'; R_dec <= '0'; Rp_dec <= '0'; c1_incr <= '0'; c2_incr <= '0'; c1p_incr <= '0'; ch_incr <= '0'; c1_load1 <= '0'; c1_load2 <= '0'; c2_load1 <= '0'; c2_load2 <= '0'; c1p_load <= '0'; ch_load <= '0'; c1_clr <= '0'; c2_clr <= '0'; c1p_clr <= '0'; ch_clr <= '0'; row_col_sel <= "00"; addr2_reg_dec <= '0'; node_write <= '0'; mulreg_en <= '0'; mulreg_clr <= '0'; addregwen <= '0'; addregwclr <= '0'; addregen <= '0'; addregclr <= '0'; distreg_clr <= '0'; rowreg_clr <= '0'; colreg_clr <= '0'; divreg1clr <= '0'; divreg1en <= '0'; numsp_val <= '0'; valid_td <= '0'; incnt_inc <= '0'; initial_run <= '0'; nodeid_sel <= "00"; n_type_sel <= '0'; done <= '0'; cur_st <= idle; elsif clk = '1' and clk'event then case cur_st is when idle =>

100 done <= '0'; node_mem_initialize <= '0'; if valid_numsp = '1' then cur_st <= wait_init; c1p_incr <= '1'; R_dec <= '1'; Rp_dec <= '1'; row_zero <= '1'; rmem_write <= '1'; else cur_st <= idle; c1p_incr <= '0'; row_zero <= '0'; rmem_write <= '0'; end if; when wait_init => R_dec <= '0'; c1p_incr <= '0'; row_zero <= '0'; rmem_write <= '0'; a_inc <= '1'; cur_st <= rmem_init; when rmem_init => Rp_dec <= '0'; a_inc <= '1'; rmem_write <= '1'; if ext_node = '1' then cur_st <= rmem_init1; row_zero <= '1'; ad_reg_en <= '0'; r_inc <= '1'; else row_zero <= '0'; ad_reg_en <= '1'; r_inc <= '0'; cur_st <= rmem_init; end if; when rmem_init1 => row_zero <= '0'; if a_grt = '1' then rmem_write <= '0'; ad_reg_en <= '0'; ad_reg_clr <= '1'; a_inc <= '0'; r_inc <= '0'; a_clr <= '1'; r_clr <= '1'; write_wmem <= '1'; cur_st <= node_mem_init; else rmem_write <= '1'; ad_reg_en <= '1'; ad_reg_clr <= '0'; a_inc <= '1'; a_clr <= '0'; r_clr <= '0'; r_inc <= '1';

101 write_wmem <= '0'; cur_st <= rmem_init1; end if; when node_mem_init => Rp_dec <= '0'; c2_incr <= '0'; c1p_incr <= '0'; if initialized = '1' then node_mem_initialize <= '0'; write_wmem <= '0'; cur_st <= mem_init; mem_initialize <= '1'; c1_incr <= '0'; else node_mem_initialize <= '1'; write_wmem <= '1'; cur_st <= node_mem_init; mem_initialize <= '0'; c1_incr <= '1'; end if; when mem_init => Rp_dec <= '0'; mem_initialize <= '0'; read_mem <= '1'; node_write <= '0'; c1_incr <= '0'; c1p_incr <= '0'; c2_incr <= '1'; c2_clr <= '0'; r_dec <= '0'; rp_dec <= '0'; cur_st <= wait_st; when wait_st => read_mem <= '1'; store_cur_addr <= '1'; initial_run <= '1'; addr_gen1_en <= '1'; c2_incr <= '0'; cur_st <= rmem_read_st; when rmem_read_st => rmem_read <= '1'; store_cur_addr <= '0'; c2_incr <= '0'; c2_load1 <= '0'; c2_load2 <= '0'; addr_gen1_en <= '0'; cur_st <= addr_mod1; when addr_mod1 => rmem_read <= '0'; c2_incr <= '0'; c2_load1 <= '0'; c2_load2 <= '0'; cur_st <= addr_mod2; when addr_mod2 => cur_st <= fetch_dst2; when fetch_dst2 => read_mem <= '1';

102 node_write <= '0'; if count_gr = '1' then c2_load1 <= '1'; c2_load2 <= '1'; c2_incr <= '0'; else c2_load1 <= '0'; c2_load2 <= '0'; c2_incr <= '1'; end if; cur_st <= compare_dst; when compare_dst => read_mem <= '1'; c1_load1 <= '0'; c1_load2 <= '0'; c2_load1 <= '0'; c2_load2 <= '0'; c2_incr <= '0'; initial_run <= '0'; if (addr_grt = '1') then cur_st <= wait_st1; addr_gen1_en <= '0'; addr_gen2_en <= '0'; store_cur_addr <= '0'; else cur_st <= rmem_read_st; addr_gen1_en <= '1'; addr_gen2_en <= '0'; store_cur_addr <= '1'; end if; when wait_st1 => --wait for four clock cycles so that all data is operated upon if count_cycle < 2 then cur_st <= wait_st1; count_cycle <= count_cycle + 1; c1_load1 <= '0'; c2_load1 <= '0'; c1_clr <= '1'; c2_clr <= '1'; c1p_clr <= '1'; else cur_st <= tree_map_init1; count_cycle <= 0; c1_load1 <= '1'; c2_load1 <= '1'; c1_clr <= '0'; c2_clr <= '0'; end if; when br_update1 => mulreg_clr <= '0'; mulreg_en <='0'; addregwen <= '0'; addregwclr <= '0'; addregen <= '0'; addregclr <= '0'; distreg_clr <= '0'; rowreg_clr <= '0';

103 colreg_clr <= '0'; divreg1clr <= '0'; incnt_inc <= '0'; valid_td <= '0'; c2_read <= '0'; c1_incr <= '0'; if count_gr = '1' then cur_st <= c2_incr_st; c1_load2 <= '1'; c2_load2 <= '1'; mem_update <= '1'; c2_incr <= '0'; else cur_st <= c2_read_st1; c1_load2 <= '0'; c2_load2 <= '0'; mem_update <= '1'; c2_incr <= '1'; end if; when c2_read_st1 => c2_incr <= '0'; c1_incr <= '1'; c2_read <= '1'; mem_update <= '0'; cur_st <= br_update1; when c2_incr_st => mem_update <= '0'; c1_load2 <= '0'; c2_load2 <= '0'; c2_incr <= '1'; cur_st <= c2_read_st2; when c2_read_st2 => c2_incr <= '0'; c1_incr <= '0'; c2_read <= '1'; mem_update <= '0'; cur_st <= br_update2; when br_update2 => c2_read <= '0'; c1_incr <= '0'; if count_gr = '1' then cur_st <= addr2_gen_st; addr_gen2_en <= '0'; mem_update <= '0'; c2_incr <= '0'; c2_clr <= '1'; c1_clr <= '1'; row_col_sel <= "01"; else cur_st <= c2_read_st22; addr_gen2_en <= '0'; c2_incr <= '1'; mem_update <= '1'; row_col_sel <= "00"; end if; when c2_read_st22 => c2_incr <= '0';

104 c1_incr <= '1'; c2_read <= '1'; mem_update <= '0'; cur_st <= br_update2; when addr2_gen_st => addr_gen2_en <= '1'; c2_clr <= '0'; c1_clr <= '0'; c2_incr <= '0'; mem_update <= '0'; cur_st <= rmem_read_st2; when rmem_read_st2 => rmem_read <= '1'; addregwen <= '0'; addregwclr <= '0'; addregen <= '0'; addregclr <= '0'; distreg_clr <= '0'; rowreg_clr <= '0'; colreg_clr <= '0'; divreg1clr <= '0'; ch_incr <= '0'; c2_incr <= '0'; write_mem <= '0'; write_wmem <= '0'; mem_update <= '0'; cur_st <= addr_mod1_st; when addr_mod1_st => rmem_read <= '0'; cur_st <= addr_mod2_st; when addr_mod2_st => cur_st <= fetch_dst; when fetch_dst => addregwen <= '0'; addregwclr <= '0'; addregen <= '0'; addregclr <= '0'; distreg_clr <= '0'; rowreg_clr <= '0'; colreg_clr <= '0'; divreg1clr <= '0'; read_mem <= '1'; c2_incr <= '0'; write_mem <= '0'; write_wmem <= '0'; mem_update <= '0'; read_wmem <= '1'; addr_gen2_en <= '1'; rmem_read <= '0'; cur_st <= wait_st2; ch_incr <= '0'; when wait_st2 => --wait for four clock cycles for the data to arrive read_mem <= '0'; write_mem <= '0'; read_wmem <= '0'; write_wmem <= '0';

105 addr_gen1_en <= '0'; if count_cycle < 2 then cur_st <= wait_st2; count_cycle <= count_cycle + 1; else cur_st <= mul_dst; count_cycle <= 0; end if; ch_incr <= '0'; when mul_dst => mulreg_en <= '1'; valid_td <= '0'; incnt_inc <= '0'; done <= '0'; addr_gen2_en <= '0'; ch_incr <= '0'; when add_dst => read_mem <= '0'; write_mem <= '0'; read_wmem <= '0'; write_wmem <= '0'; mulreg_en <= '0'; addregwen <= '1'; addregwclr <= '0'; addregen <= '1'; addregclr <= '0'; distreg_clr <= '0'; rowreg_clr <= '0'; colreg_clr <= '0'; divreg1clr <= '0'; divreg1en <= '0'; numsp_val <= '0'; valid_td <= '0'; done <= '0'; incnt_inc <= '0'; if child_cnt_gr = '0' then cur_st <= rmem_read_st2; row_col_sel <= "10"; ch_incr <= '1'; else cur_st <= div_dst; row_col_sel <= "11"; ch_incr <= '1'; end if; addr_gen1_en <= '0'; addr_gen2_en <= '1'; when div_dst => ch_incr <= '0'; read_mem <= '0'; write_mem <= '0'; read_wmem <= '0'; write_wmem <= '0'; addregwen <= '0'; addregwclr <= '0'; addregen <= '0'; addregclr <= '0'; distreg_clr <= '0';

106 rowreg_clr <= '0'; colreg_clr <= '0'; divreg1clr <= '0'; numsp_val <= '0'; valid_td <= '0'; done <= '0'; incnt_inc <= '0'; if div_valid = '1' then cur_st <= wait_rmem; divreg1en <= '1'; rmem_read <= '1'; else cur_st <= div_dst; divreg1en <= '0'; rmem_read <= '0'; end if; when wait_rmem => rmem_read <= '0'; read_mem <= '0'; read_wmem <= '0'; divreg1en <= '0'; c2_incr <= '1'; cur_st <= write_dist_to_mem; when write_dist_to_mem => rmem_read <= '0'; read_mem <= '0'; write_mem <= '1'; read_wmem <= '0'; write_wmem <= '1'; addregwen <= '0'; addregwclr <= '1'; addregen <= '0'; addregclr <= '1'; mulreg_en <= '0'; mulreg_clr <= '1'; distreg_clr <= '0'; rowreg_clr <= '0'; colreg_clr <= '0'; divreg1en <= '0'; divreg1clr <= '0'; numsp_val <= '0'; valid_td <= '0'; done <= '0'; incnt_inc <= '0'; c2_incr <= '0'; cur_st <= write_mem_wait; when write_mem_wait => write_wmem <= '0'; write_mem <= '0'; addregwclr <= '0'; addregclr <= '0'; mulreg_clr <= '0'; c2_incr <= '0'; if count_gr = '1' then cur_st <= mem_init; addr_gen2_en <= '0'; node_write <= '1';

107 c1p_incr <= '1'; c1p_clr <= '0'; c2_clr <= '1'; r_dec <= '1'; rp_dec <= '1'; row_col_sel <= "00"; else addr_gen2_en <= '1'; cur_st <= rmem_read_st2; node_write <= '0'; c2_clr <= '0'; c1p_incr <= '0'; r_dec <= '0'; rp_dec <= '0'; row_col_sel <= "01"; end if; when tree_map_init1 => read_mem <= '0'; initial_run <= '0'; addr2_reg_dec <= '1'; write_mem <= '0'; read_wmem <= '0'; write_wmem <= '0'; mulreg_en <= '0'; mulreg_clr <= '1'; addregwen <= '0'; addregwclr <= '1'; addregen <= '0'; addregclr <= '1'; distreg_clr <= '0'; rowreg_clr <= '0'; colreg_clr <= '0'; divreg1en <= '0'; divreg1clr <= '1'; numsp_val <= '0'; nodeid_sel <= "00"; incnt_inc <= '0'; n_type_sel <= '0'; cur_st <= tree_map_init2; valid_td <= '1'; c1_load1 <= '0'; c2_load1 <= '0'; c2_incr <= '1'; done <= '0'; when tree_map_init2 => c2_incr <= '0'; addr2_reg_dec <= '0'; read_mem <= '0'; write_mem <= '0'; initial_run <= '0'; read_wmem <= '0'; write_wmem <= '0'; addregwen <= '0'; addregwclr <= '1'; addregen <= '0'; addregclr <= '1'; distreg_clr <= '0';

108 rowreg_clr <= '0'; colreg_clr <= '0'; divreg1en <= '0'; divreg1clr <= '1'; numsp_val <= '0'; nodeid_sel <= "01"; n_type_sel <= '0'; valid_td <= '1'; done <= '0'; if all_nodes_done = '1' then cur_st <= tree_map2; incnt_inc <= '0'; c2_read <= '0'; else incnt_inc <= '1'; cur_st <= br_update1; c2_read <= '1'; end if; when tree_map2 => read_mem <= '0'; initial_run <= '0'; write_mem <= '0'; read_wmem <= '0'; write_wmem <= '0'; addregwen <= '0'; addregwclr <= '1'; addregen <= '0'; addregclr <= '1'; divreg1clr <= '1'; divreg1en <= '0'; distreg_clr <= '1'; rowreg_clr <= '1'; colreg_clr <= '1'; numsp_val <= '0'; nodeid_sel <= "10"; n_type_sel <= '1'; valid_td <= '1'; done <= '0'; incnt_inc <= '0'; cur_st <= done_st; when others => read_mem <= '0'; initial_run <= '0'; write_mem <= '0'; read_wmem <= '0'; write_wmem <= '0'; addregwen <= '0'; addregwclr <= '0'; addregclr <= '0'; divreg1clr <= '0'; divreg1en <= '0'; numsp_val <= '0'; valid_td <= '0'; incnt_inc <= '0'; done <= '1'; cur_st <= idle; end case;

109 end if; end process; end ctrl_beh;

------Counter Entity Architecture pairs ------Author : Sreesa Akella -- File : counter.vhd -- Entity : counter -- Architecture : counter_beh ------library ieee; use ieee.std_logic_1164.all; use ieee.std_logic_arith.all; entity counter is port (Clk : in std_logic; Res : in std_logic; ld : in std_logic; clr :in std_logic; inp : in std_logic_vector(9 downto 0); cv : in std_logic_vector(9 downto 0); inc : in std_logic; cnt : out std_logic_vector(9 downto 0); grt : out std_logic); end counter; architecture counter_beh of counter is signal count : std_logic_Vector(9 downto 0); begin process(clk, res) begin if res = '1' then count <= (others => '0'); elsif clk = '1' and clk'event then if clr = '1' then count <= (others => '0'); elsif ld = '1' then count <= inp; elsif inc = '1' then if count < cv then count <= unsigned(count) + '1'; else count <= (others => '0'); end if; end if; end if; end process;

110 process(res, count, cv) begin if res = '1' then grt <= '0'; elsif count < cv then grt <= '0'; else grt <= '1'; end if; end process; cnt <= count; end counter_beh;

------child node counter ------library ieee; use ieee.std_logic_1164.all; use ieee.std_logic_arith.all; entity counterch is port (Clk : in std_logic; Res : in std_logic; ld : in std_logic; clr :in std_logic; inp : in std_logic_vector(1 downto 0); inc : in std_logic; cnt : out std_logic_vector(1 downto 0); grt : out std_logic); end counterch; architecture counterch_beh of counterch is constant cv : std_logic_vector(1 downto 0) := "01"; signal count : std_logic_vector(1 downto 0); begin process(clk, res) begin if res = '1' then count <= (others => '0'); elsif clk = '1' and clk'event then if clr = '1' then count <= (others => '0'); elsif ld = '1' then count <= inp; elsif inc = '1' then if count < cv then count <= unsigned(count) + '1'; else

111 count <= (others => '0'); end if; end if; end if; end process; process(res, count) begin if res = '1' then grt <= '0'; elsif count < cv then grt <= '0'; else grt <= '1'; end if; end process; cnt <= count; end counterch_beh;

------Divider Entity - Architecture Pair ------Author : Sreesa Akella -- File : divider32new1.vhd -- Entity : divider -- Architecture : divider_beh ------library ieee; use ieee.std_logic_1164.all; use ieee.std_logic_arith.all; entity divider is port(datainp1 : in std_logic_vector(31 downto 0); divider : in std_logic_vector(15 downto 0); output : out std_logic_vector(31 downto 0); valid : out std_logic); end divider; architecture divider_beh of divider is procedure divide_proc (variable a, b : in unsigned (31 downto 0); variable r : out unsigned (63 downto 0); variable ov : out std_logic) is variable temp1 : unsigned (63 downto 0); variable temp2, temp3 : unsigned (32 downto 0); constant C0 : unsigned := "00000000000000000000000000000000"; -- constant zero constant C1 : unsigned := "00000000000000000000000000000001"; -- constant one begin if (b = C0) then r(31 downto 0) := C0;

112 r(63 downto 32) := C0; ov := '1'; elsif (a = b) then r(31 downto 0) := C1; r(63 downto 32) := C0; ov := '0'; elsif (a < b) then r(31 downto 0) := C0; r(63 downto 32) := a; ov := '0'; else temp1(31 downto 0) := a; temp1(63 downto 32) := C0; temp3 := "0" & b; for i in 0 to 31 loop temp1(63 downto 1) := temp1(62 downto 0); temp1(0) := '0'; temp2 := "1" & temp1(63 downto 32); temp2 := temp2 - temp3; if (temp2(32) = '1') then temp1(0) := '1'; temp1(63 downto 32) := temp2(31 downto 0); end if; end loop; -- i r := temp1; ov := '0'; end if; end divide_proc; begin process(datainp1, divider) variable a_inp, b_inp : unsigned(31 downto 0); variable r_sig : unsigned(63 downto 0); variable ov_sig : std_logic; begin a_inp := unsigned(datainp1); b_inp := unsigned("0000000000000000" & divider); divide_proc(a_inp, b_inp, r_sig, ov_sig); output <= std_logic_vector(r_sig(31 downto 0)); valid <= not ov_sig; end process; end divider_beh;

------Register for first_val signal ------Author : Sreesa Akella -- File : first_val_reg.vhd -- Entity : first_val_ddd -- Architecture : first_val_ddd_beh ------library ieee; use ieee.std_logic_1164.all; entity first_val_ddd is

113 port( first_val : in std_logic; reset : in std_logic; clk : in std_logic; first_val_ddd_s : out std_logic ); end first_val_ddd; architecture first_val_ddd_beh of first_val_ddd is

signal temp : std_logic; signal temp1 : std_logic; begin

process(clk, reset) begin

if reset = '1' then temp <= '0'; temp1 <= '0'; first_val_ddd_s <= '0'; elsif (Rising_Edge( clk )) then temp <= first_val; temp1 <= temp; first_val_ddd_s <= temp1; end if;

end process; end first_val_ddd_beh; library ieee; use ieee.std_logic_1164.all; entity d_f is port(inp : in std_logic; reset : in std_logic; clk : in std_logic; opt : out std_logic); end d_f; architecture d_f_beh of d_f is begin process(Clk, Reset) begin if Reset = '1' then opt <= '0'; elsif (Rising_Edge( clk )) then opt <= inp; end if; end process; end d_f_beh;

------Height Adder Register entity architecture pair

114 ------Author : Sreesa Akella -- File : addregisterh.vhd -- Entity : adderhreg -- Architecture : addhreg_beh ------library ieee; use ieee.std_logic_1164.all; use ieee.std_logic_arith.all; entity adderhreg is port(addout : in std_logic_vector(15 downto 0); reset : in std_logic; clk : in std_logic; regen : in std_logic; regclr : in std_logic; regval : out std_logic_vector(15 downto 0)); end adderhreg; architecture addhreg_beh of adderhreg is begin process(clk, reset) begin if reset = '1' then regval <= (others => '0'); elsif clk = '1' and clk'event then if regclr = '1' then regval <= (others => '0'); elsif regen = '1' then regval <= addout; end if; end if; end process; end addhreg_beh;

------Height Memory entity - architecture ------Author : Sreesa Akella -- File : hmemory.vhd -- Entity : hmemory -- Architecture : hmem_behave ------library ieee, xilinx_lib; use ieee.std_logic_1164.all; use xilinx_lib.VIRTEX.all; entity hmemory is port ( Clk : in std_logic; Reset : in std_logic; Read : in std_logic;

115 Write : in std_logic; numsp : in std_logic_vector(9 downto 0); Addr : in std_logic_vector(9 downto 0); Data : in std_logic_vector(15 downto 0); Data_out : out std_logic_vector(15 downto 0) ); end hmemory; architecture hmem_behave of hmemory is

signal memAddr : integer; signal Write_En0 : std_logic; signal Write_En1 : std_logic; signal Write_En2 : std_logic; signal Write_En3 : std_logic; signal DataOut0 : std_logic_vector(15 downto 0 ); signal DataOut1 : std_logic_vector(15 downto 0 ); signal DataOut2 : std_logic_vector(15 downto 0 ); signal DataOut3 : std_logic_vector(15 downto 0 ); signal enable : std_logic; begin

--************************************** -- Write logic --

Write_En0 <= Write when (Addr(9 downto 8) = "00") else '0';

Write_En1 <= Write when (Addr(9 downto 8) = "01") else '0';

Write_En2 <= Write when (Addr(9 downto 8) = "10") else '0';

Write_En3 <= Write when (Addr(9 downto 8) = "11") else '0'; --************************************** -- Enable logic -- -- pretty simple or of Read or Write

enable <= Read or Write;

--************************************** -- Read Logic -- -- This can be combinational

Data_out <= DataOut0 when Addr(9 downto 8 ) = "00" else DataOut1 when Addr(9 downto 8 ) = "01" else DataOut2 when Addr(9 downto 8 ) = "10" else DataOut3 when Addr(9 downto 8 ) = "11" else ( others => '0' );

116 --************************************** -- Instantiate 4 256x16 BlockRAMS --

U_BR0 : RAMB4_S16 port map ( DI => Data, ADDR => Addr(7 downto 0), CLK => Clk, RST => Reset, WE => Write_En0, EN => enable, DO => DataOut0 );

U_BR1 : RAMB4_S16 port map ( DI => Data, ADDR => Addr(7 downto 0), CLK => Clk, RST => Reset, WE => Write_En1, EN => enable, DO => DataOut1 );

U_BR2 : RAMB4_S16 port map ( DI => Data, ADDR => Addr(7 downto 0), CLK => Clk, RST => Reset, WE => Write_En2, EN => enable, DO => DataOut2 );

U_BR3 : RAMB4_S16 port map ( DI => Data, ADDR => Addr(7 downto 0), CLK => Clk, RST => Reset, WE => Write_En3, EN => enable, DO => DataOut3 );

end hmem_behave;

------

117 -- Multiplier entity-architecture ------Author : Sreesa Akella -- File : mult.vhd -- Entity : mult -- Architecture : mult_beh ------library ieee; use ieee.std_logic_1164.all; use ieee.std_logic_arith.all; use work.std_logic_prims.all; entity mult is port(Datainp1 : in std_logic_vector(31 downto 0); Datainp2 : in std_logic_vector(15 downto 0); Data_out : out std_logic_vector(31 downto 0)); end mult; architecture mult_beh of mult is begin process(Datainp1, Datainp2) variable inp1, inp2, outp : integer range 0 to 1000; begin inp1 := std_logic_vector_to_integer(Datainp1); inp2 := std_logic_vector_to_integer(Datainp2); outp := inp1 * inp2; Data_out <= integer_to_std_logic_vector(outp, 31); end process; end mult_beh;

------Multiplexer entity architecture pairs ------Author : Sreesa Akella -- File : mux.vhd -- Entity : mux2_1, mux3_1, mux2_1_16 -- Architecture : mux2_1_behave, mux3_1_behave, -- mux_2_1_16_behave ------library ieee; use ieee.std_logic_1164.all; entity mux2_1 is port(inp1 : in std_logic_vector(9 downto 0); inp2 : in std_logic_vector(9 downto 0); sel : in std_logic; outp : out std_logic_vector(9 downto 0)); end mux2_1; architecture mux2_1_behave of mux2_1 is begin outp <= inp1 when sel = '0' else inp2; end mux2_1_behave;

118 library ieee; use ieee.std_logic_1164.all; entity mux2_1_16 is port(inp1 : in std_logic_vector(15 downto 0); inp2 : in std_logic_vector(15 downto 0); sel : in std_logic; outp : out std_logic_vector(15 downto 0)); end mux2_1_16; architecture mux2_1_16behave of mux2_1_16 is begin outp <= inp1 when sel = '0' else inp2; end mux2_1_16behave; library ieee; use ieee.std_logic_1164.all; entity mux3_1 is port(inp1 : in std_logic_vector(9 downto 0); inp2 : in std_logic_vector(9 downto 0); inp3 : in std_logic_vector(9 downto 0); sel : in std_logic_vector(1 downto 0); outp : out std_logic_vector(9 downto 0)); end mux3_1; architecture mux3_1_behave of mux3_1 is begin outp <= inp1 when sel = "01" else inp2 when sel = "10" else inp3; end mux3_1_behave;

------Register that stores number of species value ------Author : Sreesa Akella -- File : numspreg.vhd -- Entity : numofspreg -- Architecture : numofspreg_beh ------library ieee; use ieee.std_logic_1164.all; use ieee.std_logic_arith.all; entity numofspreg is port( numsp : in std_logic_vector(9 downto 0); reset : in std_logic; clk : in std_logic; valid_in : in std_logic; valid : out std_logic;

119 regval : out std_logic_vector(9 downto 0)); end numofspreg; architecture numofspreg_beh of numofspreg is begin

process(clk, reset) begin if reset = '1' then valid <= '0'; regval <= (others => '0'); elsif ( Rising_Edge( clk )) then if valid_in = '1' then regval <= numsp; Valid <= '1'; else Valid <= '0'; regval <= (others => '0'); end if; end if; end process; end numofspreg_beh;

------Output Selector and Address Generator ------Author : Sreesa Akella -- File : opt_sel_10.vhd -- Entity : opt_sel -- Architecture : opt_sel_beh ------library ieee, xilinx_lib; use ieee.std_logic_1164.all; use xilinx_lib.VIRTEX.all; use ieee.std_logic_arith.all; use work.std_logic_prims.all; entity opt_sel is port(clk : in std_logic; reset : in std_logic; valid_numsp : in std_logic; numsp : in std_logic_vector(9 downto 0); row_reg : in std_logic_vector(9 downto 0); col_reg : in std_logic_vector(9 downto 0); store_cur_addr : in std_logic; node_mem_initialize : in std_logic; mem_initialize : in std_logic; addr_gen1_en : in std_logic; addr_gen2_en : in std_logic; c2_read : in std_logic; mem_update : in std_logic; addr1_reg_en : in std_logic; addr2_reg_en : in std_logic; R_dec : in std_logic;

120 Rp_dec : in std_logic; c1_incr : in std_logic; c2_incr : in std_logic; c1p_incr : in std_logic; ch_incr : in std_logic; c1_load1 : in std_logic; c1_load2 : in std_logic; c2_load1 : in std_logic; c2_load2 : in std_logic; c1p_load : in std_logic; ch_load : in std_logic; c1_clr : in std_logic; c2_clr : in std_logic; c1p_clr : in std_logic; ch_clr : in std_logic; row_col_sel : in std_logic_vector(1 downto 0); node_write : in std_logic; addr2_reg_dec : in std_logic; distreg_val : in std_logic_vector(31 downto 0); nodeid_sel : in std_logic_vector(1 downto 0); n_type_sel : in std_logic; incnt_inc : in std_logic; initial_run : in std_logic; r_clr : in std_logic; r_inc : in std_logic; a_clr : in std_logic; a_inc : in std_logic; a_grt : out std_logic; ext_node : out std_logic; numsp_1 : out std_logic_vector(9 downto 0); addr_cnt : out std_logic_vector(9 downto 0); row_cnt : out std_logic_vector(9 downto 0); first_val : out std_logic; initialized : out std_logic; addr_grt : out std_logic; child_cnt_gr : out std_logic; count_gr : out std_logic; all_nodes_done : out std_logic; addr : out std_logic_vector(19 downto 0); cnt1 : out std_logic_vector(9 downto 0); nodeid : out std_logic_vector(9 downto 0); n_type : out std_logic; par : out std_logic_vector(9 downto 0); br_len : out std_logic_vector(15 downto 0)); end opt_sel; architecture opt_sel_beh of opt_sel is signal temp_next_int : integer range 0 to 512; signal incnt : integer range 0 to 511; signal numsp1, numsp2 : std_logic_vector(9 downto 0); signal numsp_int : integer range 0 to 511; signal valid_numsp_int : std_logic; signal child_cnt : integer range 0 to 2;

121 signal max_node_cnt : integer range 0 to 511; signal all_nd_done : std_logic;

------Address generator signal declarations ------signal count1_in, c1_comp_val, count1 : std_logic_vector(9 downto 0); signal c1_grt, c1_inc, c1_load : std_logic; signal count1_16 : std_logic_vector(15 downto 0); signal countp_in, c1p_comp_val, count1p : std_logic_vector(9 downto 0); signal c1p_grt, c1p_inc : std_logic; signal count2_in, c2_comp_val, count2 : std_logic_vector(9 downto 0); signal c2_grt, c2_grt_d, c2_grt_p, c2_load : std_logic; signal chcount_in, ch_cnt : std_logic_vector(1 downto 0); signal ch_grt : std_logic; signal addr1_reg, addr2_reg : std_logic_vector(9 downto 0); signal addr1_reg_d, addr2_reg_d : std_logic_vector(9 downto 0); signal addr1_reg_dd, addr2_reg_dd : std_logic_vector(9 downto 0); signal addr1_reg_ddd, addr2_reg_ddd : std_logic_vector(9 downto 0); signal addr1_reg_dddd, addr2_reg_dddd : std_logic_vector(9 downto 0); signal R, Rp : std_logic_vector(9 downto 0); signal count2_in_sel : std_logic_vector(1 downto 0);

------Block ram signals signal din_a, din_b, din_b1 : std_logic_vector(15 downto 0); signal ena, wea, enb, web : std_logic; signal addra, addrb : std_logic_vector(15 downto 0); signal addra0, addra1 : std_logic_vector(15 downto 0); signal addrb0, addrb1 : std_logic_vector(15 downto 0); signal addr_temp : std_logic_vector(9 downto 0); signal addr_grt_t, addr_grt_tt, addr_grt_ttt : std_logic; signal addr_t : std_logic_vector(19 downto 0); signal wea0, wea1, web0, web1 : std_logic; signal ena0, ena1, enb0, enb1 : std_logic; ------Row memory map counters signals ------signal r_cnt : std_logic_vector(9 downto 0); signal a_cval2 : integer range 0 to 511; signal a_cnt : std_logic_vector(9 downto 0); signal e_n : std_logic; ------Address generator component declarations ------component counter port(Clk : in std_logic;

122 Res : in std_logic; ld : in std_logic; clr :in std_logic; inp : in std_logic_vector(9 downto 0); cv : in std_logic_vector(9 downto 0); inc : in std_logic; cnt : out std_logic_vector(9 downto 0); grt : out std_logic); end component; component counterch port(Clk : in std_logic; Res : in std_logic; ld : in std_logic; clr :in std_logic; inp : in std_logic_vector(1 downto 0); inc : in std_logic; cnt : out std_logic_vector(1 downto 0); grt : out std_logic); end component; component mux2_1 port(inp1 : in std_logic_vector(9 downto 0); inp2 : in std_logic_vector(9 downto 0); sel : in std_logic; outp : out std_logic_vector(9 downto 0)); end component; component mux2_1_16 port(inp1 : in std_logic_vector(15 downto 0); inp2 : in std_logic_vector(15 downto 0); sel : in std_logic; outp : out std_logic_vector(15 downto 0)); end component; component mux3_1 port(inp1 : in std_logic_vector(9 downto 0); inp2 : in std_logic_vector(9 downto 0); inp3 : in std_logic_vector(9 downto 0); sel : in std_logic_vector(1 downto 0); outp : out std_logic_vector(9 downto 0)); end component; begin process(incnt) begin temp_next_int <= incnt + 1; end process; process(clk, reset) begin if (reset = '1') then all_nodes_done <= '0'; all_nd_done <= '0'; elsif (clk = '1' and clk'event) then if temp_next_int > max_node_cnt then

123 all_nodes_done <= '1' ; all_nd_done <= '1'; else all_nodes_done <= '0'; all_nd_done <= '0'; end if; end if; end process; process(clk, reset) begin if (reset = '1') then max_node_cnt <= 0; a_cval2 <= 0; elsif (clk = '1' and clk'event) then if valid_numsp_int = '1' then max_node_cnt <= (2* numsp_int); a_cval2 <= 2*numsp_int - 1; end if; end if; end process;

process(clk, reset) begin if (reset = '1') then numsp_int <= 0; valid_numsp_int <= '0'; numsp1 <= (others => '0'); elsif (clk = '1' and clk'event) then if valid_numsp = '1' then numsp1 <= integer_to_std_logic_vector( (std_logic_vector_to_integer(numsp) - 1), 9); numsp_int <= std_logic_vector_to_integer(numsp) - 1; valid_numsp_int <= '1'; end if; end if; end process; numsp_1 <= numsp1; process(clk, reset) begin if (reset = '1') then incnt <= 0; elsif (clk = '1' and clk'event) then if valid_numsp = '1' then incnt <= std_logic_vector_to_integer(numsp); elsif incnt_inc = '1' then incnt <= incnt + 1; end if; end if; end process;

--opt sel processes

124 --n_type output select process(n_type_sel) begin n_type <= n_type_sel; end process;

--nodeid output select process(nodeid_sel, row_reg, col_reg, incnt) begin case nodeid_sel is when "00" => nodeid <= row_reg; when "01" => nodeid <= col_reg; when others => nodeid <= integer_to_std_logic_vector(incnt, 9); end case; end process;

--par select process(incnt) begin par <= integer_to_std_logic_vector(incnt, 9); end process;

--br_len select process(distreg_val) begin br_len <= '0' & distreg_val(15 downto 1); end process;

-- Signal indicating first value to be stored in least distance register process(reset, clk) begin if reset = '1' then first_val <= '0'; elsif clk = '1' and clk'event then if (initial_run = '1') then first_val <= '1'; else first_val <= '0'; end if; end if; end process;

------Address Generator ------U0 : counter port map(Clk, Reset, c1_load, c1_clr, count1_in, c1_comp_val, c1_inc, count1, c1_grt); U1 : counter port map(Clk, Reset, c1p_load, c1p_clr, countp_in, c1p_comp_val, c1p_inc, count1p, c1p_grt); U2 : counter port map(Clk, Reset, c2_load, c2_clr, count2_in, c2_comp_val, c2_incr, count2, c2_grt); u3 : counterch port map(Clk, Reset, ch_load, ch_clr, chcount_in, ch_incr, ch_cnt, ch_grt); U4 : mux2_1 port map(addr2_reg, addr1_reg, c1_load1, count1_in); u5 : mux2_1 port map(Rp, R, node_mem_initialize, c1_comp_val);

125 u6 : mux3_1 port map(addr2_reg, addr1_reg, count1p, count2_in_sel, count2_in); u7 : mux2_1 port map(R, Rp, addr_gen2_en, c2_comp_val); u8 : mux2_1_16 port map(addrb, count1_16, node_mem_initialize, din_a); count1_16 <= "000000" & count1; count2_in_sel <= c2_load1&c2_load2; chcount_in <= "00"; c1p_comp_val <= R; countp_in <= (others => '0'); c1_inc <= c1_incr or (c2_grt_p and not addr_gen2_en); c1p_inc <= c1p_incr or (c2_grt_p and not addr_gen2_en); c1_load <= c1_load1 or c1_load2; c2_load <= c2_load1 or c2_load2;

------Ena, Wea, Enb, and Web signals ena <= node_mem_initialize or mem_initialize or addr_gen1_en or mem_update; wea <= node_mem_initialize or mem_update; enb <= mem_initialize or addr_gen1_en or addr_gen2_en or c2_read or node_write; web <= node_write; --c2_grt and addr_gen2_en; ------Dia and Dib input data process (Clk, Reset) begin if Reset = '1' then din_b <= (others => '0'); elsif Clk = '1' and Clk'event then if incnt_inc = '1' then din_b <= integer_to_std_logic_vector(incnt, 15); end if; end if; end process;

--************************************** -- Write logic -- --

wea0 <= wea when (count1(9 downto 8) = "00") else '0'; wea1 <= wea when (count1(9 downto 8) = "01") else '0'; web0 <= web when (count2(9 downto 8) = "00") else '0'; web1 <= web when (count2(9 downto 8) = "01") else '0';

--**************************************

126 -- Enable logic -- --

ena0 <= ena when (count1(9 downto 8) = "00") else '0'; ena1 <= ena when (count1(9 downto 8) = "01") else '0'; enb0 <= enb when (count1(9 downto 8) = "00") else '0'; enb1 <= enb when (count1(9 downto 8) = "01") else '0';

--************************************** -- Read Logic -- --

addra <= addra0 when count1(9 downto 8) = "00" else addra1 when count1(9 downto 8) = "01" else ( others => '0' ); addrb <= addrb0 when count2(9 downto 8) = "00" else addrb1 when count2(9 downto 8) = "01" else ( others => '0' );

u9 : RAMB4_S16_S16 port map

( ADDRA => count1(7 downto 0), DIA => din_a, WEA => wea0, CLKA => Clk, RSTA => Reset, ENA => ena0, DOA => addra0,

ADDRB => count2(7 downto 0), DIB => din_b, WEB => web0, CLKB => Clk, RSTB => Reset, ENB => enb0, DOB => addrb0 ); u10 : RAMB4_S16_S16 port map

( ADDRA => count1(7 downto 0), DIA => din_a, WEA => wea1, CLKA => Clk, RSTA => Reset,

127 ENA => ena1, DOA => addra1,

ADDRB => count2(7 downto 0), DIB => din_b, WEB => web1, CLKB => Clk, RSTB => Reset, ENB => enb1, DOB => addrb1 );

------addr1_reg and addr2_reg process(clk, reset) begin if Reset = '1' then addr1_reg <= (others => '0'); elsif Clk'event and Clk = '1' then if addr1_reg_en = '1' then addr1_reg <= addr1_reg_d; end if; end if; end process; process (Clk, Reset) begin if Reset = '1' then addr1_reg_d <= (others => '0'); addr2_reg_d <= (others => '0'); elsif Clk = '1' and Clk'event then if store_cur_addr = '1' then addr2_reg_d <= count1; addr2_reg_d <= count2; end if; end if; end process; process(clk, reset) begin if Reset = '1' then addr2_reg <= (others => '0'); elsif Clk'event and Clk = '1' then if addr2_reg_en = '1' then addr2_reg <= addr2_reg_d; elsif addr2_reg_dec = '1' then addr2_reg <= unsigned(addr2_reg) - 1; end if; end if; end process;

------R and Rp registers process(Clk, Reset) begin

128 if Reset = '1' then R <= (others => '0'); elsif Clk = '1' and Clk'event then if valid_numsp = '1' then R <= numsp; --(9 downto 0); elsif R_dec = '1' then R <= unsigned(R) - '1'; end if; end if; end process; process(Clk, Reset) begin if Reset = '1' then Rp <= (others => '0'); elsif Clk = '1' and Clk'event then if valid_numsp = '1' then Rp <= numsp; --(9 downto 0); elsif Rp_dec = '1' then Rp <= unsigned(Rp) - '1'; end if; end if; end process; process(clk, reset) begin if reset = '1' then c2_grt_d <= '0'; elsif clk'event and clk = '1' then c2_grt_d <= c2_grt; end if; end process; c2_grt_p <= c2_grt and not(c2_grt_d); addr <= addr_temp & addrb(9 downto 0) when addr_gen2_en = '1' else addra(9 downto 0) & addrb(9 downto 0) ;

-- addr_temp should be set addr_temp <= row_reg when row_col_sel = "01" else col_reg when row_col_sel = "10" else din_b(9 downto 0);

count_gr <= c2_grt or all_nd_done; child_cnt_gr <= ch_grt; cnt1 <= count1; process(clk, reset) begin if reset = '1' then addr_grt <= '0'; addr_grt_t <= '0';

129 addr_grt_tt <= '0'; addr_grt_ttt <= '0'; elsif Clk'event and Clk = '1' then addr_grt_t <= c1_grt; addr_grt_tt <= addr_grt_t; addr_grt_ttt <= addr_grt_tt; addr_grt <= addr_grt_ttt; end if; end process; initialized <= c1_grt;

------counters for row map memory initialization ------process(Clk, Reset) begin if Reset = '1' then r_cnt <= (others => '0'); elsif Clk'event and Clk = '1' then if r_clr = '1' then r_cnt <= (others => '0'); elsif r_inc = '1' then if r_cnt < R then r_cnt <= unsigned(r_cnt) + '1'; else r_cnt <= (others => '0'); end if; end if; end if; end process; row_cnt <= r_cnt; process(Clk, Reset) begin if Reset = '1' then a_cnt <= (others => '0'); elsif Clk'event and Clk = '1' then if a_clr = '1' then a_cnt <= (others => '0'); elsif a_inc = '1' then if a_cnt < integer_to_std_logic_vector(a_cval2, 9) then a_cnt <= unsigned(a_cnt) + '1'; else a_cnt <= (others => '0'); end if; end if; end if; end process; addr_cnt <= a_cnt; process(Reset, a_cnt, a_cval2, numsp1) begin if Reset = '1' then e_n <= '0';

130 a_grt <= '0'; elsif a_cnt < numsp1 then e_n <= '0'; a_grt <= '0'; elsif a_cnt < integer_to_std_logic_vector(a_cval2, 9) then e_n <= '1'; a_grt <= '0'; else e_n <= '1'; a_grt <= '1'; end if; end process; ext_node <= e_n; end opt_sel_beh;

------Package for defining ieee std_logic primitives ------Author : Sreesa Akella -- File : pack.vhd -- Entity : NA -- Architecture : NA ------library ieee; use ieee.std_logic_1164.all; use ieee.std_logic_arith.all;

-- -- package for defining ieee std_logic primitives -- package std_logic_prims is -- constant definitions constant bit_width : integer := 63; constant bit_widthx2 : integer := 127; -- -- function std_logic_vector_to_integer converts its -- std_logic_vector argument ibus, assumed to consist of '0' -- and '1' elements only, into an integer. -- function std_logic_vector_to_integer(ibus: in std_logic_vector) return integer; -- -- function integer_to_std_logic_vector converts its integer -- argument ival into a std_logic_vector of length n. -- function integer_to_std_logic_vector(val, n: in integer) return std_logic_vector; end std_logic_prims; package body std_logic_prims is -- -- function std_logic_vector_to_integer converts its

131 -- std_logic_vector argument ibus, assumed to consist of '0' -- and '1' elements only, into an integer. -- function std_logic_vector_to_integer(ibus: in std_logic_vector) return integer is variable result: integer := 0; begin for i in ibus'high downto 0 loop result := result * 2; if ibus(i) = '1' then result := result + 1; end if; end loop; return result; end std_logic_vector_to_integer; -- -- function integer_to_std_logic_vector converts its integer -- argument ival into a std_logic_vector of length n. -- function integer_to_std_logic_vector(val, n: in integer) return std_logic_vector is variable result: std_logic_vector(n downto 0); variable ival: integer := val; begin ival := val;

for i in 0 to n loop if (ival mod 2) = 1 then result(i) := '1'; else result(i) := '0'; end if; ival := ival / 2; end loop; return result; end integer_to_std_logic_vector; end std_logic_prims;

------Entity : PE -- -- Architecture : pe_upgma_arch -- -- Author : Sreesa Akella -- -- Filename : pe_upgma_arch.vhd -- -- Description : PE architecture that implements the UPGMA design ------

------Glossary ------Name Key: -- ======

132 -- _AS : Address Strobe -- _CE : Clock Enable -- _CS : Chip Select -- _DS : Data Strobe -- _EN : Enable -- _OE : Output Enable -- _RD : Read Select -- _WE : Write Enable -- _WR : Write Select -- _d[d...] : Delayed (registered) signal (each 'd' denotes one -- level of delay) -- _n : Active low signals (must be last part of name) -- -- Port Name Dir Description -- ======-- Pads.Clocks.F_Clk I Frequency synthesizer clock -- Pads.Clocks.M_Clk I Memory clock -- Pads.Clocks.P_Clk I Processor clock -- Pads.Clocks.K_Clk I LAD-bus clock -- Pads.Clocks.IO_Clk I External I/O connector clock -- Pads.Clocks.M_Clk_Out_Pe O M_Clk to the PE -- Pads.Clocks.M_Clk_Out_CB_Ctrl O M_Clk to the CardBus controller -- Pads.Clocks.M_Clk_Out_Right_Mem O M_Clk to the right memory bank -- Pads.Clocks.M_Clk_Out_Left_Mem O M_Clk to the left memory bank -- Pads.Clocks.P_Clk_Out_Pe O P_Clk to the PE -- Pads.Clocks.P_Clk_Out_CB_Ctrl O P_Clk to the CardBus controller -- Pads.Reset I Global PE reset -- Pads.Audio O Pulse-width modulated audio pad -- Pads.LAD_Bus.Addr_Data B LAD-bus shared address/data bus -- Pads.LAD_Bus.AS_n I LAD-bus address strobe -- Pads.LAD_Bus.DS_n I LAD-bus data strobe -- Pads.LAD_Bus.Ack_n O LAD-bus acknowledge strobe -- Pads.LAD_Bus.Reg_n I LAD-bus register select -- Pads.LAD_Bus.WR_n I LAD-bus write select -- Pads.LAD_Bus.CS_n I LAD-bus chip select -- Pads.LAD_Bus.Int_Req_n O LAD-bus interrupt request -- Pads.LAD_Bus.DMA_0_Data_OK_n O LAD-bus DMA chan 0 data OK flag -- Pads.LAD_Bus.DMA_0_Burst_OK_n O LAD-bus DMA chan 0 burst OK flag -- Pads.LAD_Bus.DMA_1_Data_OK_n O LAD-bus DMA chan 1 data OK flag -- Pads.LAD_Bus.DMA_1_Burst_OK_n O LAD-bus DMA chan 1 burst OK flag -- Pads.LAD_Bus.Reg_Data_OK_n O LAD-bus reg space data OK flag -- Pads.LAD_Bus.Reg_Burst_OK_n O LAD-bus reg space burst OK flag -- Pads.LAD_Bus.Force_K_Clk_n O LAD-bus K_Clk forced-run select -- Pads.LAD_Bus.Reserved - Reserved for future use -- Pads.Left_Mem.Addr O Left memory address bus -- Pads.Left_Mem.Data B Left memory data bus -- Pads.Left_Mem.Byte_WR_n O Left memory byte write select -- Pads.Left_Mem.CS_n O Left memory chip select -- Pads.Left_Mem.CE_n O Left memory clock enable -- Pads.Left_Mem.WE_n O Left memory write enable -- Pads.Left_Mem.OE_n O Left memory output enable -- Pads.Left_Mem.Sleep_EN O Left memory sleep enable -- Pads.Left_Mem.Load_EN_n O Left memory load enable -- Pads.Left_Mem.Burst_Mode O Left memory burst mode select -- Pads.Right_Mem.Addr O Right memory address bus -- Pads.Right_Mem.Data B Right memory data bus -- Pads.Right_Mem.Byte_WR_n O Right memory byte write select

133 -- Pads.Right_Mem.CS_n O Right memory chip select -- Pads.Right_Mem.CE_n O Right memory clock enable -- Pads.Right_Mem.WE_n O Right memory write enable -- Pads.Right_Mem.OE_n O Right memory output enable -- Pads.Right_Mem.Sleep_EN O Right memory sleep enable -- Pads.Right_Mem.Load_EN_n O Right memory load enable -- Pads.Left_Mem.Burst_Mode O Right memory burst mode select -- Pads.Left_IO B Left external I/O connector -- Pads.Right_IO B Right external I/O connector ------

------Library Declarations ------library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; use work.std_logic_prims.all; library PE_Lib; use PE_Lib.PE_Package.all; library LAD_Mux_Lib; use LAD_Mux_lib.LAD_Mux_Pkg.all; use LAD_Mux_lib.LAD_Mem32_Mux_pkg.all;

Library Mem_Mux_Lib; use Mem_Mux_lib.Mem32_Mux_pkg.all; library DMA_Mux_Lib; use DMA_Mux_Lib.DMA_Mux_Pkg.all; use DMA_Mux_Lib.DMA_LAD_Mem32_Mux_Pkg.all;

------Architecture Declaration ------architecture pe_upgma_arch of PE is

------Glossary ------Name Key: -- ======-- _AS : Address Strobe -- _CB : CardBus -- _CE : Clock Enable -- _CS : Chip Select -- _DS : Data Strobe -- _EN : Enable -- _OE : Output Enable -- _PE : Processing Element -- _RD : Read Select -- _WE : Write Enable -- _WR : Write Select -- _d[d...] : Delayed (registered) signal (each 'd' denotes one -- level of delay) -- _n : Active low signals (must be last part of name) -- -- Name Width Dir Description

134 -- ======-- Clocks_In.F_Clk 1 I Frequency synthesizer clock -- Clocks_In.M_Clk 1 I Memory clock -- Clocks_In.P_Clk 1 I Processing element clock -- Clocks_In.K_Clk 1 I LAD-bus clock -- Clocks_In.F_Clk_Locked 1 I U_Clk CLKDLL locked flag -- Clocks_In.M_Clk_Locked 1 I M_Clk CLKDLL locked flag -- Clocks_In.P_Clk_Locked 1 I P_Clk CLKDLL locked flag -- Global_Reset 1 I Global reset (or set) signal -- Audio_Out 1 O Pulse-width modulated audio -- output -- LAD_Mux_Bus(x).Addr 20 I LAD bus DWORD address bus input -- LAD_Mux_Bus(x).Write 1 I LAD bus write select -- LAD_Mux_Bus(x).Strobe 1 I LAD bus register access strobe -- LAD_Mux_Bus(x).Mem_Strobe 1 I LAD bus memory access strobe -- LAD_Mux_Bus(x).DMA_0_Strobe 1 I LAD bus DMA channel 0 access -- strobe -- LAD_Mux_Bus(x).DMA_1_Strobe 1 I LAD bus DMA channel 1 access -- strobe -- LAD_Mux_Bus(x).DMA_0_Done 1 I DMA CH0 Completed signal -- LAD_Mux_Bus(x).DMA_1_Done 1 I DMA CH1 Completed signal -- LAD_Mux_Bus(x).Reset 1 I LAD bus reset signal -- LAD_Mux_Bus(x).Data_In 32 I LAD bus data bus input -- LAD_Mux_Bus(x).Data_Out 32 O LAD bus data bus output -- LAD_Mux_Bus(x).Akk 1 O LAD bus transaction acknowledge -- LAD_Mux_Bus(x).Int_Req 1 O LAD bus interrupt request -- LAD_Mux_Bus(x).DMA_0_Stat 2 O LAD bus DMA Channel 0 status flags -- LAD_Mux_Bus(x).DMA_1_Stat 2 O LAD bus DMA Channel 0 status flags -- -- Left_Mem_Mux(x).Addr 32 O Left on-board memory -- address bus -- Left_Mem_Mux(x).Write 1 O Left on-board memory write -- select -- Left_Mem_Mux(x).Data_Out 32 O Right on-board memory output -- data bus -- Left_Mem_Mux(x).Req 1 O Left on-board memory access -- request -- Left_Mem_Mux(x).Akk 1 O Left on-board memory access -- acknowledge -- Left_Mem_Mux(x).Data_In 32 I Left on-board memory input -- data bus -- Left_Mem_Mux(x).Data_Valid 1 I Left on-board memory valid -- read flag -- -- Right_Mem_Mux(x).Addr 32 O Right on-board memory -- address bus -- Right_Mem_Mux(x).Write 1 O Right on-board memory write -- select -- Right_Mem_Mux(x).Data_Out 32 O Right on-board memory output -- data bus -- Right_Mem_Mux(x).Req 1 O Right on-board memory access -- request

135 -- Right_Mem_Mux(x).Akk 1 O Right on-board memory access -- acknowledge -- Right_Mem_Mux(x).Data_In 32 I Right on-board memory input -- data bus -- Right_Mem_Mux(x).Data_Valid 1 I Right on-board memory valid -- read flag -- -- Left_IO_In.Data_In 13 I Left I/O connector data -- input -- Left_IO_Out.Data_Out 13 O Left I/O connector data -- output -- Left_IO_Out.Data_OE_n 13 O Left I/O connector data -- output enable -- Right_IO_In.Data_In 13 I Right I/O connector data -- input -- Right_IO_Out.Data_Out 13 O Right I/O connector data -- output -- Right_IO_Out.Data_OE_n 13 O Right I/O connector data -- output enable ------

------Below are all of the standard PE pad interface signals. Simply -- uncomment the signal(s) that are needed by the PE design. All -- other unused signals may remain commented out. Be sure to -- uncomment any component instances used by the interface. ------signal Clocks_In : Clock_Std_IF_In_Type; signal Global_Reset : Reset_Std_IF_In_Type := '0'; -- signal Audio_Out : Audio_Std_IF_Out_Type; -- signal Left_IO_In : IO_Conn_Std_IF_In_Type; -- signal Left_IO_Out : IO_Conn_Std_IF_Out_Type; -- signal Right_IO_In : IO_Conn_Std_IF_In_Type; -- signal Right_IO_Out : IO_Conn_Std_IF_Out_Type;

------Below are all of the multiplexing PE pad interface signals. Simply -- uncomment the signal(s) that are needed by the PE design and -- increase the vector sizes as needed. All other unused signals may -- remain commented out. Be sure to uncomment any component -- instances used by the interface. ------signal LAD_Mux_Bus : LAD_Mux_vector(0 to 2); signal Left_Mem_Mux : Mem32_Mux_vector(0 to 1); signal Right_Mem_Mux : Mem32_Mux_vector(0 to 1); signal LAD_Regs : LAD_Mux_register_vector(0 to 1);

------Component declaration of UPGMA top component

136 ------component upgma_top is port( clk : in std_logic; Reset : in std_logic; Data_in : in std_logic_vector(31 downto 0); valid_numsp : in std_logic; valid_dst : in std_logic; addr_cnt : out std_logic_vector(9 downto 0); row_cnt : out std_logic_vector(9 downto 0); ext_node : out std_logic; rmem_read : out std_logic; rmem_write : out std_logic; ad_reg_en : out std_logic; ad_reg_clr : out std_logic; row_zero : out std_logic; mem_addr : out std_logic_vector(19 downto 0); read_mem : out std_logic; write_mem : out std_logic; numsp_1 : out std_logic_vector(9 downto 0); numsp : in std_logic_vector(9 downto 0); avg_dst : out std_logic_vector(31 downto 0); trout : out std_logic_vector(36 downto 0); valid_td : out std_logic; done : out std_logic ); end component;

------UPGMA top component signal declarations ------signal Data_in : std_logic_vector(31 downto 0); signal valid_numsp : std_logic; signal valid_dst : std_logic; signal mem_addr, mem_addr_t : std_logic_vector(19 downto 0); signal read_mem : std_logic; signal write_mem : std_logic; signal numsp : std_logic_vector(9 downto 0); signal avg_dst : std_logic_vector(31 downto 0); signal trout : std_logic_vector(36 downto 0); signal valid_td : std_logic; signal done : std_logic; signal val_nsp_d : std_logic; signal val_nsp_dd : std_logic; signal done_d : std_logic; signal rmem_read : std_logic; signal rmem_write : std_logic; signal ad_reg_en : std_logic; signal ad_reg_clr : std_logic; signal row_zero : std_logic; signal ext_node : std_logic; signal row_cnt, addr_cnt : std_logic_vector(9 downto 0); signal numsp_1 : std_logic_vector(9 downto 0);

------Address modification component

137 ------

component addr_mod port (Clk : in std_logic; Reset : in std_logic; addr_cnt : in std_logic_vector(9 downto 0); row_cnt : in std_logic_vector(9 downto 0); mem_addr : in std_logic_vector(19 downto 0); numsp : in std_logic_vector(9 downto 0); numsp_1 : in std_logic_vector(9 downto 0); ext_node : in std_logic; rmem_read : in std_logic; rmem_write : in std_logic; ad_reg_en : in std_logic; ad_reg_clr : in std_logic; row_zero : in std_logic; addr_modif : out std_logic_vector(15 downto 0)); end component;

------Address modification component signals ------

signal addr_modif : std_logic_vector(15 downto 0);

------Memory input signals -- Left memory and right memory input signals ------

signal left_mem_data_Addr : std_logic_vector(31 downto 0); signal left_mem_data_Write : std_logic; signal left_mem_data_Data_Out : std_logic_vector(31 downto 0); signal left_mem_data_Req : std_logic; signal right_mem_data_Addr : std_logic_vector(31 downto 0); signal right_mem_data_Write : std_logic; signal right_mem_data_Data_Out : std_logic_vector(31 downto 0); signal right_mem_data_Req : std_logic;

signal en : std_logic; begin ------The following two components create a block RAM bridge from the LAD -- bus to the onboard left and right memories. Use the -- LAD_Mem_Bridge.c/.h source files to write data to these components --- from the host. -- -- Each component needs a unique LAD_Mux_Bus, and either a unique -- Left_Mem_Mux or Right_Mem_Mux. -- -- The Left Memory is located at address 0x1000 and the Right Memory -- at address 0x1200. -- -- Physically these addresses come into the PE as 0x5000 and 0x5200, -- however the 0x4000 REGISTER base is subtracted from the Address

138 -- if the USE_OLD_ADDRESSES generic is FALSE. This new address scheme -- was added to make the WC_PeRegRead and WC_PeRegWrite addresses match -- the addresses in the VHDL code. ------

U_Left_Bridge : LAD_Mem32_Bridge generic map ( BASE => x"1000" ) port map ( Kclk => Clocks_In.K_clk, LAD => LAD_Mux_Bus(1),

Mclk => Clocks_In.M_Clk, Mem => Left_Mem_Mux(0) );

U_Right_Bridge : LAD_Mem32_Bridge generic map ( BASE => x"1200" ) port map ( Kclk => Clocks_In.K_clk, LAD => LAD_Mux_Bus(0),

Mclk => Clocks_In.M_Clk, Mem => Right_Mem_Mux(0) );

------Instantiated a LAD_Mux_Register file of size 1 -- This single 32 bit register would store the value of the numsp --

------

U_LAD_Mux_Reg : LAD_Mux_RegFile generic map ( BASE => x"2000", L2NUM => 1 ) port map ( Kclk => Clocks_In.K_Clk, LAD => LAD_Mux_Bus(2),

139 Regs => LAD_Regs );

-- Tie the register output to the input so it can be read back

LAD_Regs(0).Data_Out <= LAD_Regs(0).Data_In;

------Below are all of the standard PE pad interface components. Simply -- uncomment the interface(s) that are needed by the PE design. All -- other unused interfaces may remain commented out. Be sure to -- uncomment any signal declarations used by the interface. ------

-- --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ -- --@@ -- --@@ CLOCK STANDARD Interface. Uncomment this component -- --@@ to use K,M,P and F clocks. (This should almost -- --@@ always be uncommented.) -- --@@ -- --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ -- U_Clocks : Clock_Std_IF generic map ( USE_EXT_P_CLK_SOURCE => FALSE, REVISION => REVD ) port map ( Global_Reset => Global_Reset, Pads => Pads.Clocks, User_In => Clocks_In );

--@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ --@@ --@@ LAD MUX INTERFACE --@@ --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

U_LAD_MUX : LAD_Mux_IF generic map ( USE_OLD_ADDRESSES => FALSE ) port map ( Kclk => Clocks_In.K_Clk, Reset => Global_Reset, Pads => Pads.LAD_Bus, Clients => LAD_Mux_Bus );

140 --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ --@@ --@@ LEFT MEMORY MUX INTERFACE : The two interfaces below --@@ are mutually exclusive. Uncomment either the --@@ Mem32_Mux_Priority_IF or the Mem32_Mux_Fair_IF --@@ to use the left memory bank. --@@ --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

U_Left_Mem_Mux : Mem32_Mux_Priority_IF generic map ( AVOID_OVERFLOW => TRUE, NUM_AKK_FIFOS => 0, ROTATE_PRIORITY => FALSE, STICKY_PRIORITY => FALSE, REGISTERED_AKKS => FALSE, REGISTERED_REQS => FALSE ) port map ( Mclk => Clocks_In.M_Clk, Reset => Global_Reset, Pads => Pads.Left_Mem, Clients => Left_Mem_Mux );

-- U_Left_Mem_Mux : Mem32_Mux_Fair_IF -- generic map -- ( -- AVOID_OVERFLOW => TRUE, -- REGISTER_DATA => FALSE -- ) -- port map -- ( -- Mclk => Clocks_In.M_Clk, -- Reset => Global_Reset, -- Pads => Pads.Left_Mem. -- Clients => Left_Mem_Mux -- ); ------@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ --@@ --@@ RIGHT MEMORY MUX INTERFACE : The two interfaces below --@@ are mutually exclusive. Uncomment either the --@@ Mem32_Mux_Priority_IF or the Mem32_Mux_Fair_IF --@@ to use the right memory bank. --@@ --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

U_Right_Mem_Mux_IF : Mem32_Mux_Priority_IF generic map (

141 AVOID_OVERFLOW => TRUE, NUM_AKK_FIFOS => 0, ROTATE_PRIORITY => FALSE, STICKY_PRIORITY => FALSE, REGISTERED_AKKS => FALSE, REGISTERED_REQS => FALSE ) port map ( Mclk => Clocks_In.M_Clk, Reset => Global_Reset, Pads => Pads.Right_Mem, Clients => Right_Mem_Mux );

-- U_Right_Mem_Mux : Mem32_Mux_Fair_IF -- generic map -- ( -- AVOID_OVERFLOW => TRUE, -- REGISTER_DATA => FALSE -- ) -- port map -- ( -- Mclk => Clocks_In.M_Clk, -- Reset => Global_Reset, -- Pads => Pads.Right_Mem. -- Clients => Right_Mem_Mux -- ); -- --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ --@@ --@@ RESET INTERFACE : The following component provides --@@ a global reset to the entire PE. The Global_Reset --@@ signal is also tied to the GSR port of the --@@ STARTUP VIRTEX. This component should almost --@@ always be uncommented. --@@ --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

U_Reset : Reset_Std_IF port map ( Clk => Clocks_In.K_Clk, Pads => Pads.Reset, User_In => Global_Reset );

--@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ --@@ --@@ UPGMA top component: The following component --@@ reads the distance data from the left and --@@ right memories and reconstructs the phylogene- --@@ -tic tree using the UPGMA algorithm --@@ --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ U_UPGMA_TOP : upgma_top port map (

142 clk => Clocks_In.P_Clk, Reset => Global_Reset, Data_in => Data_in, valid_numsp => valid_numsp, valid_dst => valid_dst, addr_cnt => addr_cnt, row_cnt => row_cnt, ext_node => ext_node, rmem_read => rmem_read, rmem_write => rmem_write, ad_reg_en => ad_reg_en, ad_reg_clr => ad_reg_clr, row_zero => row_zero, mem_addr => mem_addr, read_mem => read_mem, write_mem => write_mem, numsp_1 => numsp_1, numsp => numsp, avg_dst => avg_dst, trout => trout, valid_td => valid_td, done => done );

--@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ --@@ --@@ Data_in and --@@ Valid numsp assignment --@@ --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

valid_numsp <= val_nsp_d and not val_nsp_dd; process ( Global_Reset, Clocks_In.P_Clk ) begin

if ( Global_Reset = '1' ) then numsp <= (others => '0'); val_nsp_d <= '0'; val_nsp_dd <= '0'; elsif ( rising_edge ( Clocks_In.P_Clk ) ) then

val_nsp_d <= '0';

if (LAD_Regs(0).Strobe = '1') then numsp <= LAD_Regs(0).Data_in( 9 downto 0 ); val_nsp_d <= '1'; end if; val_nsp_dd <= val_nsp_d; end if; end process;

process( Global_Reset, Clocks_In.K_Clk) begin

143 if ( Global_Reset = '1' ) then done_d <= '0'; elsif ( rising_edge ( Clocks_In.K_Clk ) ) then done_d <= done; end if; end process;

------Data_in ------

process ( Global_Reset, Clocks_In.P_Clk ) begin if ( Global_Reset = '1' ) then valid_dst <= '0'; Data_in <= (others => '0'); elsif ( rising_edge ( Clocks_In.P_Clk ) ) then if (Left_Mem_Mux(1).Data_Valid = '1') then valid_dst <= Left_Mem_Mux(1).Data_Valid; Data_in <= Left_Mem_Mux(1).Data_in; end if; end if; end process;

--@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ --@@ --@@ Store the done signal in PE registers --@@ for the host to poll and read --@@ --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ process ( Global_Reset, Clocks_In.K_Clk ) begin

if ( Global_Reset = '1' ) then LAD_Regs(1).Data_out <= (others => '0'); elsif ( rising_edge ( Clocks_In.K_Clk ) ) then if (done_d = '1') then LAD_Regs(1).Data_out <= "00000000000000000000000000000001"; end if; end if; end process;

--@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ --@@ --@@ Address modification component instantiation --@@ --@@ --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

U_addr_mod : addr_mod port map (Clocks_In.P_Clk, Global_Reset,

144 addr_cnt, row_cnt, mem_addr, numsp, numsp_1, ext_node, rmem_read, rmem_write, ad_reg_en, ad_reg_clr, row_zero, addr_modif );

--@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ --@@ --@@ Assign left_mem_data and right_mem_data --@@ with data from UPGMA design - avg_dst and --@@ tree_data respectively --@@ --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ process(Global_Reset, Clocks_In.M_Clk) variable temp : integer; begin if ( Global_Reset = '1' ) then left_mem_data_Addr <= (others => '0'); left_mem_data_Write <= '0'; left_mem_data_Data_Out <= (others => '0'); left_mem_data_Req <= '0'; temp := 0; elsif ( rising_edge ( Clocks_In.M_Clk ) ) then temp := std_logic_vector_to_integer(addr_modif); left_mem_data_Addr <= integer_to_std_logic_vector(temp, 31); left_mem_data_Write <= write_mem; left_mem_data_Data_Out <= avg_dst; left_mem_data_Req <= (write_mem or read_mem ); end if; end process; process(Global_Reset, Clocks_In.M_Clk) begin

if ( Global_Reset = '1' ) then right_mem_data_Addr <= (others => '0'); right_mem_data_Write <= '0'; right_mem_data_Data_Out <= (others => '0'); right_mem_data_Req <= '0'; elsif ( rising_edge ( Clocks_In.M_Clk ) ) then right_mem_data_Addr <= "0000000000000000000000" & trout(35 downto 26); right_mem_data_Write <= valid_td; right_mem_data_Data_Out <= trout(35 downto 16)& trout(11 downto 0); right_mem_data_Req <= valid_td; end if;

145

end process;

--@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ --@@ --@@ Assign left_mem_mux(1) and right_mem_mux(1) --@@ with left_mem_data and right_mem_data --@@ respectively --@@ --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

Left_Mem_Mux(1).Addr <= left_mem_data_Addr; Left_Mem_Mux(1).Write <= left_mem_data_Write; Left_Mem_Mux(1).Data_Out <= left_mem_data_Data_Out;-- when Left_Mem_Mux(1).Akk = '0' else Left_Mem_Mux(1).Data_out; Left_Mem_Mux(1).Req <= left_mem_data_Req;

Right_Mem_Mux(1).Addr <= Right_mem_data_Addr; Right_Mem_Mux(1).Write <= Right_mem_data_Write; Right_Mem_Mux(1).Data_Out <= Right_mem_data_Data_Out;-- when Right_Mem_Mux(1).Akk = '0' else Right_Mem_Mux(1).Data_out; Right_Mem_Mux(1).Req <= Right_mem_data_Req;

-- --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ -- --@@ -- --@@ LEFT I/O CONNECTOR INTERFACE : The following -- --@@ component provides an interface to the left I/O -- --@@ connector on the WILDCARD(tm). Uncomment the -- --@@ inteface below to use the left I/O connector. -- --@@ -- --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ -- -- U_Left_IO : IO_Conn_Std_IF -- port map -- ( -- Pads => Pads.Left_IO, -- User_In => Left_IO_In, -- User_Out => Left_IO_Out -- ); ------@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ -- --@@ -- --@@ RIGHT I/O CONNECTOR INTERFACE : The following -- --@@ component provides an interface to the right I/O -- --@@ connector on the WILDCARD(tm). Uncomment the -- --@@ inteface below to use the right I/O connector. -- --@@ -- --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ -- -- U_Right_IO : IO_Conn_Std_IF -- port map -- ( -- Pads => Pads.Right_IO, -- User_In => Right_IO_In, -- User_Out => Right_IO_Out -- ); --

146 -- --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ -- --@@ -- --@@ AUDIO INTERFACE : Uncomment the following -- --@@ interface to use the audio port. -- --@@ -- --@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ -- -- U_Audio : Audio_Std_IF -- port map -- ( -- Clk => Clocks_In.K_Clk, -- Global_Reset => Global_Reset, -- Pads => Pads.Audio, -- User_Out => Audio_Out -- );

------NOTE : The following line must remain in all designs -- to ensure that all of the PE pads are driven. ------Init_PE_Pads ( Pads ); end architecture;

------32-bit Register entity - architecture ------Author : Sreesa Akella -- File : reg.vhd -- Entity : reg -- Architecture : reg_behave ------library ieee; use ieee.std_logic_1164.all; use ieee.std_logic_arith.all; entity reg is port( data : in std_logic_vector(31 downto 0); reset : in std_logic; clk : in std_logic; regen : in std_logic; regclr : in std_logic; regval : out std_logic_vector(31 downto 0) ); end reg; architecture reg_beh of reg is begin

process(clk, reset) begin

if ( reset = '1' ) then regval <= (others => '0');

147 elsif (Rising_Edge ( clk )) then if regclr = '1' then regval <= (others => '0'); elsif regen = '1' then regval <= data; end if; end if;

end process; end reg_beh;

------10-bit Register entity - architecture ------Author : Sreesa Akella -- File : reg2.vhd -- Entity : reg_2 -- Architecture : reg2_beh ------library ieee; use ieee.std_logic_1164.all; use ieee.std_logic_arith.all; entity reg_2 is port( data : in std_logic_vector(9 downto 0); reset : in std_logic; clk : in std_logic; regen : in std_logic; regclr : in std_logic; regval : out std_logic_vector(9 downto 0)); end reg_2; architecture reg2_beh of reg_2 is begin

process(clk, reset) begin if ( reset = '1' ) then regval <= (others => '0'); elsif ( Rising_Edge( clk )) then if regclr = '1' then regval <= (others => '0'); elsif regen = '1' then regval <= data; end if; end if; end process; end reg2_beh;

------16-bit Register entity - architecture

148 ------Author : Sreesa Akella -- File : reg3.vhd -- Entity : reg_3 -- Architecture : reg3_beh ------library ieee; use ieee.std_logic_1164.all; use ieee.std_logic_arith.all; entity reg_3 is port( data : in std_logic_vector(15 downto 0); reset : in std_logic; clk : in std_logic; regen : in std_logic; regclr : in std_logic; regval : out std_logic_vector(15 downto 0)); end reg_3; architecture reg3_beh of reg_3 is begin

process(clk, reset) begin if ( reset = '1' ) then regval <= (others => '0'); elsif ( Rising_Edge( clk )) then if regclr = '1' then regval <= (others => '0'); elsif regen = '1' then regval <= data; end if; end if; end process; end reg3_beh;

------Top Component UPGMA entity - architecture ------Author : Sreesa Akella -- File : upgma_top.vhd -- Entity : upgma_top -- Architecture : upgma_struct ------library ieee; use ieee.std_logic_1164.all; use ieee.std_logic_arith.all; use work.std_logic_prims.all; entity upgma_top is port(clk : in std_logic;

149 Reset : in std_logic; Data_in : in std_logic_vector(31 downto 0); valid_numsp : in std_logic; valid_dst : in std_logic; addr_cnt : out std_logic_vector(9 downto 0); row_cnt : out std_logic_vector(9 downto 0); ext_node : out std_logic; rmem_read : out std_logic; rmem_write : out std_logic; ad_reg_en : out std_logic; ad_reg_clr : out std_logic; row_zero : out std_logic; mem_addr : out std_logic_vector(19 downto 0); read_mem : out std_logic; write_mem : out std_logic; numsp_1 : out std_logic_vector(9 downto 0); numsp : in std_logic_vector(9 downto 0); avg_dst : out std_logic_vector(31 downto 0); trout : out std_logic_vector(36 downto 0); valid_td : out std_logic; done : out std_logic); end upgma_top; architecture upgma_struct of upgma_top is component adder port(Datainp1 : in std_logic_vector(31 downto 0); Datainp2 : in std_logic_vector(31 downto 0); Data_out : out std_logic_vector(31 downto 0)); end component; component adder_w port(Datainp1 : in std_logic_vector(15 downto 0); Datainp2 : in std_logic_vector(15 downto 0); Data_out : out std_logic_vector(15 downto 0)); end component; component adderwreg port(addout : in std_logic_vector(15 downto 0); reset : in std_logic; clk : in std_logic; regen : in std_logic; regclr : in std_logic; regval : out std_logic_vector(15 downto 0)); end component;

component ctrl_blk port(clk : in std_logic; reset : in std_logic; valid_numsp : in std_logic; addr_grt : in std_logic; child_cnt_gr : in std_logic; count_gr : in std_logic; all_nodes_done : in std_logic; initialized : in std_logic; div_valid : in std_logic;

150 a_grt : in std_logic; ext_node : in std_logic; r_clr : out std_logic; r_inc : out std_logic; a_clr : out std_logic; a_inc : out std_logic; c2_read : out std_logic; mem_update : out std_logic; R_dec : out std_logic; Rp_dec : out std_logic; c1_incr : out std_logic; c2_incr : out std_logic; c1p_incr : out std_logic; ch_incr : out std_logic; c1_load1 : out std_logic; c1_load2 : out std_logic; c2_load1 : out std_logic; c2_load2 : out std_logic; c1p_load : out std_logic; ch_load : out std_logic; c1_clr : out std_logic; c2_clr : out std_logic; c1p_clr : out std_logic; ch_clr : out std_logic; row_col_sel : out std_logic_vector(1 downto 0); addr2_reg_dec : out std_logic; node_write : out std_logic; read_mem : out std_logic; write_mem : out std_logic; read_wmem : out std_logic; write_wmem : out std_logic; rowreg_clr : out std_logic; colreg_clr : out std_logic; distreg_clr : out std_logic; mulreg_en : out std_logic; mulreg_clr : out std_logic; addregwclr : out std_logic; addregwen : out std_logic; addregclr : out std_logic; addregen : out std_logic; divreg1clr : out std_logic; divreg1en : out std_logic; initial_run : out std_logic; store_cur_addr : out std_logic; node_mem_initialize : out std_logic; mem_initialize : out std_logic; addr_gen1_en : out std_logic; addr_gen2_en : out std_logic; rmem_read : out std_logic; rmem_write : out std_logic; ad_reg_en : out std_logic; ad_reg_clr : out std_logic; row_zero : out std_logic; numsp_val : out std_logic; valid_td : out std_logic; nodeid_sel : out std_logic_vector(1 downto 0); n_type_sel : out std_logic;

151 incnt_inc : out std_logic; done : out std_logic ); end component; component opt_sel port(clk : in std_logic; reset : in std_logic; valid_numsp : in std_logic; numsp : in std_logic_vector(9 downto 0); row_reg : in std_logic_vector(9 downto 0); col_reg : in std_logic_vector(9 downto 0); store_cur_addr : in std_logic; node_mem_initialize : in std_logic; mem_initialize : in std_logic; addr_gen1_en : in std_logic; addr_gen2_en : in std_logic; c2_read : in std_logic; mem_update : in std_logic; addr1_reg_en : in std_logic; addr2_reg_en : in std_logic; R_dec : in std_logic; Rp_dec : in std_logic; c1_incr : in std_logic; c2_incr : in std_logic; c1p_incr : in std_logic; ch_incr : in std_logic; c1_load1 : in std_logic; c1_load2 : in std_logic; c2_load1 : in std_logic; c2_load2 : in std_logic; c1p_load : in std_logic; ch_load : in std_logic; c1_clr : in std_logic; c2_clr : in std_logic; c1p_clr : in std_logic; ch_clr : in std_logic; row_col_sel : in std_logic_vector(1 downto 0); addr2_reg_dec : in std_logic; node_write : in std_logic; distreg_val : in std_logic_vector(31 downto 0); nodeid_sel : in std_logic_vector(1 downto 0); n_type_sel : in std_logic; incnt_inc : in std_logic; initial_run : in std_logic; r_clr : in std_logic; r_inc : in std_logic; a_clr : in std_logic; a_inc : in std_logic; a_grt : out std_logic; ext_node : out std_logic; numsp_1 : out std_logic_vector(9 downto 0); addr_cnt : out std_logic_vector(9 downto 0); row_cnt : out std_logic_vector(9 downto 0); first_val : out std_logic; initialized : out std_logic; addr_grt : out std_logic;

152 child_cnt_gr : out std_logic; count_gr : out std_logic; all_nodes_done : out std_logic; addr : out std_logic_vector(19 downto 0); cnt1 : out std_logic_vector(9 downto 0); nodeid : out std_logic_vector(9 downto 0); n_type : out std_logic; par : out std_logic_vector(9 downto 0); br_len : out std_logic_vector(15 downto 0)); end component; component mult port(Datainp1 : in std_logic_vector(31 downto 0); Datainp2 : in std_logic_vector(15 downto 0); Data_out : out std_logic_vector(31 downto 0)); end component; component divider port(datainp1 : in std_logic_vector(31 downto 0); divider : in std_logic_vector(15 downto 0); output : out std_logic_vector(31 downto 0); valid : out std_logic); end component; component comparedst port(Datainp1 : in std_logic_vector(31 downto 0); valid_dst : in std_logic; distreg_val : in std_logic_vector(31 downto 0); first_val : in std_logic; addr : in std_logic_vector(19 downto 0); distreginp : out std_logic_vector(31 downto 0); distreg_en : out std_logic; rowreginp : out std_logic_vector(9 downto 0); rowreg_en : out std_logic; colreginp : out std_logic_vector(9 downto 0); colreg_en : out std_logic; addr1_reg_en : out std_logic; addr2_reg_en : out std_logic); end component; component numofspreg port(numsp : in std_logic_vector(9 downto 0); reset : in std_logic; clk : in std_logic; valid_in : in std_logic; valid : out std_logic; regval : out std_logic_vector(9 downto 0)); end component; component wmemory port ( Clk : in std_logic; Reset : in std_logic; Read : in std_logic; Write : in std_logic; numsp : in std_logic_vector(9 downto 0);

153 Addr : in std_logic_vector(9 downto 0); Data : in std_logic_vector(15 downto 0); Data_out : out std_logic_vector(15 downto 0) ); end component; component reg port(data : in std_logic_vector(31 downto 0); reset : in std_logic; clk : in std_logic; regen : in std_logic; regclr : in std_logic; regval : out std_logic_vector(31 downto 0)); end component; component reg_2 port(data : in std_logic_vector(9 downto 0); reset : in std_logic; clk : in std_logic; regen : in std_logic; regclr : in std_logic; regval : out std_logic_vector(9 downto 0)); end component; component addr_dd port(addr : in std_logic_vector(19 downto 0); reset : in std_logic; clk : in std_logic; addr_dd_s : out std_logic_vector(19 downto 0)); end component; component first_val_ddd port(first_val : in std_logic; reset : in std_logic; clk : in std_logic; first_val_ddd_s : out std_logic); end component; component d_f port(inp : in std_logic; reset : in std_logic; clk : in std_logic; opt : out std_logic); end component;

-- Multiplier and multiplier reg component signals signal mult_out : std_logic_vector(31 downto 0); signal mulreg_val : std_logic_vector(31 downto 0);

-- Adder and adderreg component signals signal adderout : std_logic_vector(31 downto 0); signal adderregval : std_logic_vector(31 downto 0);

-- Adderw and adderwreg component signals signal adderw_out : std_logic_vector(15 downto 0); signal adderw_regval : std_logic_vector(15 downto 0);

154 -- Colreg component signals signal Colregval : std_logic_vector(9 downto 0);

-- Controller component signals signal store_cur_addr : std_logic; signal a_clr, a_inc : std_logic; signal r_clr, r_inc : std_logic; signal read_wmem : std_logic; signal write_wmem : std_logic; signal rowreg_en : std_logic; signal rowreg_clr : std_logic; signal colreg_en : std_logic; signal colreg_clr : std_logic; signal distreg_clr : std_logic; signal distreg_en : std_logic; signal mulreg_en : std_logic; signal mulreg_clr : std_logic; signal adderw_valid : std_logic; signal addregwclr : std_logic; signal addregwen : std_logic; signal addregclr : std_logic; signal addregen : std_logic; signal divreg1en : std_logic; signal divreg1clr : std_logic; signal initial_run : std_logic; signal numsp_val : std_logic; signal node_mem_initialize : std_logic; signal mem_initialize : std_logic; signal addr_gen1_en : std_logic; signal addr_gen2_en : std_logic; signal nodeid_sel : std_logic_vector(1 downto 0); signal n_type_sel : std_logic; signal spcnt_inc : std_logic; signal incnt_inc : std_logic; signal node_sel_mem_en : std_logic; signal parent_mem_en : std_logic; signal mem_update : std_logic; signal c2_read : std_logic; signal R_dec : std_logic; signal Rp_dec : std_logic; signal c1_incr : std_logic; signal c2_incr : std_logic; signal c1p_incr : std_logic; signal ch_incr : std_logic; signal c1_load1 : std_logic; signal c1_load2 : std_logic; signal c2_load1 : std_logic; signal c2_load2 : std_logic; signal c1p_load : std_logic; signal ch_load : std_logic; signal c1_clr : std_logic; signal c2_clr : std_logic; signal c1p_clr : std_logic; signal ch_clr : std_logic; signal row_col_sel : std_logic_vector(1 downto 0); signal addr2_reg_dec : std_logic; signal node_write : std_logic;

155 --Optsel signals signal a_grt : std_logic; signal ex_n : std_logic; signal initialized : std_logic; signal addr_grt : std_logic; signal child_cnt_gr : std_logic; signal count_gr : std_logic; signal all_nodes_done : std_logic; signal addr : std_logic_vector(19 downto 0); signal cnt1 : std_logic_vector(9 downto 0); signal first_val : std_logic; signal nodeid : std_logic_vector(9 downto 0); signal n_type : std_logic; signal par : std_logic_vector(9 downto 0); signal br_len : std_logic_vector(15 downto 0);

-- Divider component signals signal div_out : std_logic_vector(31 downto 0); signal div_valid : std_logic;

-- Divider reg 1 component signals signal div_reg_val1 : std_logic_vector(31 downto 0);

-- Least Distance reg component signals signal distreg_val : std_logic_vector(31 downto 0);

-- Comparedst component signals signal distreginp : std_logic_vector(31 downto 0); signal rowreginp : std_logic_vector(9 downto 0); signal colreginp : std_logic_vector(9 downto 0); signal addr1_reg_en : std_logic; signal addr2_reg_en : std_logic;

-- numspreg component signals signal numspreg_val : std_logic_vector(9 downto 0); signal numspreg_valid : std_logic;

-- Rowreg component signals signal Rowregval : std_logic_vector(9 downto 0);

-- Weight memory component signals signal WData : std_logic_vector(15 downto 0); signal WAddr : std_logic_vector(9 downto 0); signal vone : std_logic_vector(15 downto 0); signal WData_in : std_logic_vector(15 downto 0);

-- Addr twice register signal addr_dd_s : std_logic_vector(19 downto 0);

-- first_val registered thrice signal first_val_ddd_s : std_logic; begin

Mul : mult port map(Data_in, WData_in,

156 mult_out); mulreg : reg port map(mult_out, Reset, Clk, mulreg_en, mulreg_clr, mulreg_val);

U0 : adder port map(mulreg_val, adderregval, adderout); adderreg : reg port map(adderout, Reset, Clk, addregen, addregclr, adderregval);

U011 : adder_w port map(WData_in, adderw_regval, adderw_out);

U012 : adderwreg port map(adderw_out, reset, clk, addregwen, addregwclr, adderw_regval);

Colreg : reg_2 port map(Colreginp, Reset, Clk, colreg_en, colreg_clr, Colregval);

U3 : ctrl_blk port map( clk => clk, reset => reset, valid_numsp => valid_numsp, addr_grt => addr_grt, child_cnt_gr => child_cnt_gr, count_gr => count_gr, all_nodes_done => all_nodes_done, initialized => initialized, div_valid => div_valid, a_grt => a_grt, ext_node => ex_n, r_clr => r_clr, r_inc => r_inc, a_clr => a_clr, a_inc => a_inc,

157 c2_read => c2_read, mem_update => mem_update, R_dec => R_dec, Rp_dec => Rp_dec, c1_incr => c1_incr, c2_incr => c2_incr, c1p_incr => c1p_incr, ch_incr => ch_incr, c1_load1 => c1_load1, c1_load2 => c1_load2, c2_load1 => c2_load1, c2_load2 => c2_load2, c1p_load => c1p_load, ch_load => ch_load, c1_clr => c1_clr, c2_clr => c2_clr, c1p_clr => c1p_clr, ch_clr => ch_clr, row_col_sel => row_col_sel, addr2_reg_dec => addr2_reg_dec, node_write => node_write, read_mem => read_mem, write_mem => write_mem, read_wmem => read_wmem, write_wmem => write_wmem, rowreg_clr => rowreg_clr, colreg_clr => colreg_clr, distreg_clr => distreg_clr, mulreg_en => mulreg_en, mulreg_clr => mulreg_clr, addregwclr => addregwclr, addregwen => addregwen, addregclr => addregclr, addregen => addregen, divreg1clr => divreg1clr, divreg1en => divreg1en, initial_run => initial_run, store_cur_addr => store_cur_addr, node_mem_initialize => node_mem_initialize, mem_initialize => mem_initialize, addr_gen1_en => addr_gen1_en, addr_gen2_en => addr_gen2_en, rmem_read => rmem_read, rmem_write => rmem_write, ad_reg_en => ad_reg_en, ad_reg_clr => ad_reg_clr, row_zero => row_zero, numsp_val => numsp_val, valid_td => valid_td, nodeid_sel => nodeid_sel, n_type_sel => n_type_sel, incnt_inc => incnt_inc, done => done );

U_Addr_dd : addr_dd port map

158 (addr, reset, clk, addr_dd_s);

U_fv_ddd : first_val_ddd port map (first_val, reset, clk, first_val_ddd_s);

Opt_gen : opt_sel port map (clk, reset, valid_numsp, numsp, rowregval, colregval, store_cur_addr, node_mem_initialize, mem_initialize, addr_gen1_en, addr_gen2_en, c2_read, mem_update, addr1_reg_en, addr2_reg_en, R_dec, Rp_dec, c1_incr, c2_incr, c1p_incr, ch_incr, c1_load1, c1_load2, c2_load1, c2_load2, c1p_load, ch_load, c1_clr, c2_clr, c1p_clr, ch_clr, row_col_sel, addr2_reg_dec, node_write, distreg_val, nodeid_sel, n_type_sel, incnt_inc, initial_run, r_clr, r_inc, a_clr, a_inc, a_grt, ex_n,

159 numsp_1, addr_cnt, row_cnt, first_val, initialized, addr_grt, child_cnt_gr, count_gr, all_nodes_done, addr, cnt1, nodeid, n_type, par, br_len);

U4 : divider port map (adderregval, adderw_regval, div_out, div_valid);

divregister1 : reg port map(div_out, Reset, Clk, divreg1en, divreg1clr, avg_dst);

leastdistreg : reg port map(distreginp, Reset, Clk, distreg_en, distreg_clr, distreg_val);

U15 : comparedst port map(Data_in, valid_dst, distreg_val, first_val_ddd_s, addr_dd_s, distreginp, distreg_en, rowreginp, rowreg_en, colreginp, colreg_en, addr1_reg_en, addr2_reg_en);

U10 : numofspreg port map(numsp, Reset,

160 Clk, numsp_val, numspreg_valid, numspreg_val);

Rowreg : reg_2 port map(Rowreginp, Reset, Clk, rowreg_en, rowreg_clr, Rowregval); vone <= "0000000000000001";

WData <= vone when node_mem_initialize = '1' else adderw_regval;

WAddr <= cnt1 when node_mem_initialize = '1' else addr(19 downto 10);

U12 : wmemory port map(Clk, Reset, read_wmem, write_wmem, numsp, WAddr, WData, WData_in); trout <= n_type & nodeid & par & br_len; mem_addr <= addr;

U20 : d_f port map(ex_n, Reset, Clk, ext_node); end upgma_struct;

161 APPENDIX B

CUSTOM COMPUTING MACHINE HOST PROGRAM SOURCE CODE

UPGMA_ex.h

#ifndef __UPGMATEST_H__ #define __UPGMATEST_H__ /**************************************************** * * Contants and Macros * ****************************************************/

#define DEFAULT_VERBOSITY ( FALSE ) #define DEFAULT_SLOT_NUMBER ( 0 ) #define DEFAULT_ITERATIONS ( 1 ) #define DEFAULT_FREQUENCY ( 100.0 )

#define IMAGE_FILENAME ( "pe_addr_mod" ) #define IMAGE_FILENAME_REVD ( "pe_addr_mod" )

#define MEM_BASE ( 0x0 ) #define LEFT_MEM_OFFSET ( 0x1000 ) #define RIGHT_MEM_OFFSET ( 0x1200 )

#define MAX_ERR_COUNT ( 32 )

#define NUM_REGISTERS ( 2 ) #define REGISTER_OFFSET ( 0x2000 ) typedef struct _TestInfo_ { WC_DeviceNum DeviceNum; WC_DevConfig DeviceCfg; WC_Version Version; DWORD dIterations; float fClkFreq; BOOLEAN bVerbose; } WC_TestInfo;

/**************************************************** * * Prototypes * ****************************************************/

WC_RetCode WC_UPGMATest_Main( WC_TestInfo *TestInfo );

WC_RetCode WC_UPGMATest_Init( WC_TestInfo *TestInfo );

WC_RetCode WC_UPGMATest_Run( WC_TestInfo *TestInfo );

162 WC_RetCode WC_UPGMATest_Shutdown( WC_TestInfo *TestInfo );

WC_RetCode VerifyData(DWORD ref[], DWORD test[], DWORD size);

#endif

UPGMA_ex.c

/**************************************************************************** * * File : UPGMAtest.c * * Project : UPGMA on Wildcard * * Copyright : Sreesa Akella, Reconfigurable Computing Research Lab 2003 * ****************************************************************************/ #include #include #include #if defined(WIN32) #include #endif #include "wcdefs.h" #include "wc_shared.h" #include "UPGMA_ex.h" #include "LAD_Mem_Bridge_WC.h"

/**************************************************************************** * * Function : main * * Description : Entry point for the WILDCARD * This function is a basic entry point into the test. * It is responcibe for * 1) Parsing the command line parametrs and filling the * TestInfo struct with those parameters * 2) Opening the WILDCARD(tm) board * 3) Calling the main example procedure * 4) Closing the board when the example completes * ****************************************************************************/ WC_RetCode main( int argc, char *argv [] ) { WC_RetCode rc = WC_SUCCESS;

int argi;

WC_TestInfo TestInfo;

163 WC_CardType CardType; char **TestLoc = NULL; const char * help_string = "Usage: memtest \n" " Options:\n" " -v Sets verbose mode. Show progress messages.\n" " -s Set WILDCARD(tm) device \"slot\" number (default = 0).\n" " -i Sets the number of times to perform the example.\n" " (default = 1)\n" " -f Set the memory clock frequency in MHz (default = 40.0)\n" " -h Show this help.\n"; fprintf( stdout, "WILDCARD(tm) UPGMA_Test Example\n");

TestInfo.bVerbose = DEFAULT_VERBOSITY; TestInfo.DeviceNum = DEFAULT_SLOT_NUMBER; TestInfo.dIterations = DEFAULT_ITERATIONS; TestInfo.fClkFreq = DEFAULT_FREQUENCY;

/* Parse the command line parameters */ for ( argi = 1; argi < argc; argi++ ) { if ( argv[argi][0] == '-' ) { switch ( toupper(argv[argi][1]) ) { case 'H': /* Print the help message */ fprintf ( stdout, "%s\n\n", help_string,1,1 ); return(WC_SUCCESS); break;

case 'I': /* Set the number of iterations */ argi++; TestInfo.dIterations = strtoul( argv [ argi ], TestLoc, 0 ); /* Error Check the result. * The following test will be true only if there was an error * in the string conversion above */ if (TestInfo.dIterations == 0) { fprintf( stdout, "\nWARNING: An invalid or missing iteration value\n"); fprintf( stdout, " was found after the -i option.\n\n"); fprintf( stdout, "%s\n\n", help_string ); return (ERROR_UNKNOWN_SWITCH); }

fprintf( stdout, "Setting the iteration value to %d\n",TestInfo.dIterations); break;

case 'S': /* Set the device number */ argi++; TestInfo.DeviceNum = strtoul( argv [ argi ], TestLoc, 0 ); /* The following tests for a valid slot number */ if (TestInfo.DeviceNum > WC_MAX_DEVICES)

164 { fprintf( stdout, "\n WARNING: Invalid device number!\n"); return (ERROR_UNKNOWN_SWITCH); } else { fprintf( stdout, " Setting the device number to %d.\n", TestInfo.DeviceNum); } break;

case 'F': /* Set Frequency */ argi++; if (argi < argc) { TestInfo.fClkFreq = (float) atof( argv [ argi ] ); } else { printf( "\n WARNING: Invalid Frequency option\n" ); printf ( "%s\n\n", help_string ); return(ERROR_UNKNOWN_SWITCH); } if (( TestInfo.fClkFreq < WC_MIN_FCLK_MHZ ) || ( TestInfo.fClkFreq > WC_MAX_FCLK_MHZ )) { printf( "\n WARNING: %3.2f is an invalid Frequency option\n", TestInfo.fClkFreq ); printf ( "%s\n\n", help_string ); return(ERROR_UNKNOWN_SWITCH); } break;

case 'V': /* Show all Errors & set maximum verbosity */ TestInfo.bVerbose=TRUE; fprintf( stdout, " Setting Maximum Verbosity.\n"); break;

default: /* Unknow switch option */ fprintf ( stderr, "\n WARNING: Unknown option: \"%s\"\n", argv [ argi ] ); fprintf ( stderr, "%s\n\n", help_string ); return( ERROR_UNKNOWN_SWITCH ); } } else /* Missing the '-' */ { fprintf ( stderr, "\n WARNING: Unknown option: \"%s\"\n", argv[argi] ); fprintf ( stderr, "%s\n\n", help_string ); return(ERROR_UNKNOWN_SWITCH); } }

/* The WILDCARD(tm) MUST be opened before doing any type of * * access to the card. */ if (TestInfo.bVerbose) { fprintf(stdout,"\n Opening Device %d...\n", TestInfo.DeviceNum); }

165 rc = WC_Open( TestInfo.DeviceNum, 0 ); DISPLAY_ERROR(rc);

/* If you are using both the WILDCARD(tm) and the WILDCARD(tm)-II * * it is a good idea to check the board type before executing any * * calls. For this example we must have a WILDCARD(tm). */ rc = WC_GetCardType( TestInfo.DeviceNum, &CardType ); if (rc!=WC_SUCCESS) { DisplayError(rc); return 0; } else if (CardType != WILDCARD) { printf("\nERROR : This example requires a WILDCARD(tm).\n It will not run on WILDCARD(tm)- II!\n\n"); return 0; }

/* Once the board is successfully opened, the test may be run */ rc = WC_UPGMATest_Main( &TestInfo); if (rc!=WC_SUCCESS) { DisplayError(rc); }

/* The WILDCARD(tm) should be closed when the program finishes to * * free driver resources */ rc = WC_Close( TestInfo.DeviceNum ); DISPLAY_ERROR(rc); return(rc); }

/**************************************************************************** * * Function : WC_UPGMATest_Main * * Parameters : TestInfo - Test Parameters * * Description : Initializes the WILDCARD(tm) hardware, and runs the example * TestInfo->dIterations times. * ****************************************************************************/ WC_RetCode WC_UPGMATest_Main (WC_TestInfo *TestInfo) { DWORD dIteration, dErrorCount;

WC_RetCode rc = WC_SUCCESS;

/* Print out a few parameters so we know what we are running */ if (TestInfo->bVerbose) {

166 fprintf(stdout,"\n TEST PARAMETERS:\n"); fprintf(stdout," Clock Frequency = %f\n",TestInfo->fClkFreq); fprintf(stdout," # of Iterations = %d\n",TestInfo->dIterations); fprintf(stdout," Device Number = %d\n",TestInfo->DeviceNum); fprintf(stdout," Verbose Mode = %s\n",TestInfo->bVerbose?"TRUE":"FALSE"); }

/* This routine will put the WILDCARD in a known state. * * We only need to do this before the first iteration. * * Each additional iteration only needs to reset the * * PE to initialize the WILCARD to a known state because * * All initialization parameters are kept between resets.*/ rc = WC_UPGMATest_Init( TestInfo); //CHECK_RC(rc)

/* Now that the PE is initialized, we run the test * * TestInfo->dIterations times, counting the number of * * failures as we go. */ for (dIteration = 0,dErrorCount = 0; dIteration < TestInfo->dIterations; dIteration++) { fprintf(stdout,"\n **** Memory Example Iteration [%d] of [%d] ****\n",dIteration, TestInfo- >dIterations); rc = WC_UPGMATest_Run(TestInfo); if (rc != WC_SUCCESS) { DisplayError(rc); dErrorCount++; } }

/* Let the user know if the example was a success */ fprintf(stdout, "\n Example Complete! [%d] of [%d] Successful", TestInfo->dIterations - dErrorCount, TestInfo->dIterations);

if (dErrorCount) { fprintf(stdout, " ERRORS In Example!\n\n"); } else { fprintf(stdout, " Example SUCCESSFUL!\n\n"); }

/* Return SUCCESS if we have made it this far without * * returning. This means that no fatal errors have * * occurred. If any test errors occurred, they have * * already been printed above after each iteration. */ return (WC_SUCCESS); }

/**************************************************************************** * * Function : WC_UPGMATest_Run * * Parameters : TestInfo - Test Parameters *

167 * Description : Runs the Memory Example. This hardware for this example * contains an image with a LAD_Mem_Bridge component for both * the left and the right onboard memories. This gives the * host indirect access to the onboard WILDCARD(tm) memories. * * This example will write a random pattern to each of the * memories, read it back, and verify that the read and * write contents are equal. * ****************************************************************************/ WC_RetCode WC_UPGMATest_Run( WC_TestInfo *TestInfo ) { WC_Mem_Object *Left_Memory, *Right_Memory;

DWORD dNumDwords, *pReadBuffer, *pWriteBuffer, index, no_of_values, no_of_species, temp1, temp2, *darray, *dataArray, addr, rem, *bin, value, j;

BOOLEAN bIntStatus, done;

FILE *f, *f1; //time_t tStart, tEnd;

//double diffTime = 0.0; clock_t tStartClk, tEndClk, tend_wr, tend_done;

WC_RetCode rc;

/* The first step in an application is almost always to reset the PE. * * Although this is not needed for the first iteration of this * * example because WC_MemTest_Init has already reset the PE, we need * * to do it again here because subsequent iterations need a fresh PE * * reset. * * One assumption made below is that the time between the two * * WC_PeReset calls is sufficient to reset the PE. In this example, * * and in general, this is true. The reset line need only be high * * for at most one clock cycle of the longest period clock. */ fprintf(stdout, "\n Resetting PE ... "); rc=WC_PeReset( TestInfo->DeviceNum, TRUE ); CHECK_RC(rc);

rc=WC_PeReset( TestInfo->DeviceNum, FALSE ); CHECK_RC(rc); fprintf(stdout, "DONE\n");

168 rc=WC_IntEnable( TestInfo->DeviceNum, TRUE ); CHECK_RC(rc);

/* Reset the interrupts */ rc = WC_IntReset(TestInfo->DeviceNum); CHECK_RC(rc); rc=WC_PeReset( TestInfo->DeviceNum, FALSE ); CHECK_RC(rc); fprintf(stdout, "DONE\n");

/* Check to make sure that the interrupts are cleared */ fprintf(stdout, " Verifying Interrupts are cleared ... "); WC_IntQueryStatus(TestInfo->DeviceNum, &bIntStatus); if (bIntStatus) { fprintf (stdout, "ERROR\n\n Interrupt ERROR : Interrupts were NOT Cleared\n"); return (WC_ERR_INTERRUPT_TIMEOUT); /* Not a good error code, but will work for now */ } fprintf(stdout, "DONE\n");

/* The simplest method of buffer verification is to have different * * buffers for reading and writing. Below we allocate and initialize * * these buffers. * * First, however, we need to find the memory size so we know how * * large of a buffer to get. This information is stored in the * * device information structure filled by WC_MemTest_Init. */

/* The memory port sizes should be equal, but in case they aren't * * we allocate buffers assuming the largest memory size. */ fprintf( stdout, " Allocating Buffers ... "); if (TestInfo->DeviceCfg.MemoryDwords[0] >= TestInfo->DeviceCfg.MemoryDwords[1]) { dNumDwords = TestInfo->DeviceCfg.MemoryDwords[0]; } else { dNumDwords = TestInfo->DeviceCfg.MemoryDwords[1]; } if (TestInfo->bVerbose) fprintf(stdout, "\n * Allocating Read Buffer ... "); pReadBuffer = malloc(dNumDwords * sizeof(DWORD)); if (!pReadBuffer) return (ERROR_MEMORY_ALLOC); memset(pReadBuffer, 0, dNumDwords); if (TestInfo->bVerbose) fprintf(stdout,"DONE\n * Allocating Write Buffer ... "); pWriteBuffer = malloc(dNumDwords * sizeof(DWORD)); if (!pWriteBuffer) { free(pReadBuffer); return (ERROR_MEMORY_ALLOC); }

169 /* MY COMMENTS

WE NEED TO INSERT CODE HERE FOR READING THE FILE AND WRITING THE TEST DATA INTO THE BUFFERS ALSO dNumDwords SHOULD BE SET TO THEN NO OF WORDS TO BE WRITTEN BASED ON THE NUMBER OF TAXON

FOR NOW WE CAN SET THE BUFFER FOR 4 TAXON DATA - 6 WORDS

*/

darray = malloc(dNumDwords * sizeof(DWORD)); if (!darray) return (ERROR_MEMORY_ALLOC); memset(darray, 0, dNumDwords);

dataArray = malloc(dNumDwords * sizeof(DWORD)); if (!dataArray) return (ERROR_MEMORY_ALLOC); memset(dataArray, 0, dNumDwords);

bin = malloc(10 * sizeof(DWORD)); if (!bin) return (ERROR_MEMORY_ALLOC); memset(bin, 0, 10);

//For 4 taxon data //no_of_values = 6; //no_of_species = 4;

// READ DATA FROM FILE AND PUT IT IN darray f = fopen("C:/akella/thesis/testdatagen/testdata/taxon16/WCprogram/testdata_16D2.txt", "r");

if(f == NULL){ fprintf(stdout, "DONE\n Open of test file failed."); return 1; } fscanf(f, "%d", &value); no_of_species = value; no_of_values = (no_of_species * (no_of_species - 1)) / 2;

//printf("\nnoOfSpecies - %d, noOfValues - %d", no_of_species, no_of_values); for (index = 0; index <= no_of_values; index ++) darray[index] = 0;

index = 0; while(fscanf(f, "%d", &value) != EOF){ darray[index] = value; index++; }

/// Reading from memory and placing data in array format into dataArray /*temp1 = 0; temp2 = 0;

for (index = 0; index < no_of_values; index ++) { // read the file here and get one value

170 // for 4 taxon data value = darray[index];

if (temp2 < no_of_species - 1) temp2 = temp2 + 1; else { temp1 = temp1 + 1; temp2 = temp1 + 1; }

for (j = 0; j < 10; j++) { bin[j] = 0; }

addr = 0; rem = temp1; for (j = 0; j < 10 && rem >= 1; j++) { bin[j] = rem % 2; rem = rem / 2; } for (j = 0; j < 10; j++) { addr = addr + bin[j]*(pow(2, (10+j))); } addr = addr + temp2;

dataArray[addr] = value; //data read from file; } */ dNumDwords = no_of_values; for (index = 0; index < dNumDwords; index ++) { pWriteBuffer[index] = darray[index]; }

/* Now we need to allocate and initialize the memory structures for * * the left and right memories. The WC_MemCreate function will * * allocate the structure and fill it with the device number, * * memory offset and flags. * * NOTE : LEFT_MEM_OFFSET and RIGHT_MEM_OFFSET, refer to LAD bus * * offsets NOT memory addresses. Specific memory addresses to read * * and write to are passed to the WC_MemRead and WC_MemWrite * * procedures. */ if (TestInfo->bVerbose) fprintf(stdout,"DONE\n * Allocating Left Memory Struct ... "); Left_Memory = WC_Mem_Create( TestInfo->DeviceNum, LEFT_MEM_OFFSET, 0 ); if (!Left_Memory) { free(pReadBuffer); free(pWriteBuffer); return (ERROR_MEMORY_ALLOC); }

if (TestInfo->bVerbose) fprintf(stdout,"DONE\n * Allocating Right Memory Struct ... ");

171 Right_Memory = WC_Mem_Create( TestInfo->DeviceNum, RIGHT_MEM_OFFSET, 0 ); if (!Right_Memory) { free(pReadBuffer); free(pWriteBuffer); WC_Mem_Release(Left_Memory); return (ERROR_MEMORY_ALLOC); }

printf ("DONE\n");

/* With the memory structures initialized, we can now read and write * * to the memories. First we will write to the LEFT memory. */

fprintf(stdout, " Testing LEFT Memory ... "); /* Find the size of this particular memory, and initialize the write * * buffer. */ //time(&tStart); tStartClk = clock();

/* The following two calls, WC_Mem_Write and WC_Mem_Read are defined in * * LAD_Mem_Bridge.c. They uses a specific protocal to interace with the * * LAD_Mem_Bridge component in the PE to read and write data in memory. * * See the documentation inside the LAD_Mem_Bridge.c file for details of * * this memory protocol. */ if (TestInfo->bVerbose) fprintf(stdout, "\n * Writing to memory ... "); rc = WC_Mem_Write(Left_Memory,MEM_BASE,dNumDwords,pWriteBuffer); if (rc != WC_SUCCESS) { free(pReadBuffer); free(pWriteBuffer); WC_Mem_Release(Left_Memory); WC_Mem_Release(Right_Memory); return (rc); }

/*WRITE THE NO OF TAXON DATA TO THE LAD Reg TO START THE UPGMA DESIGN */ if (TestInfo->bVerbose) fprintf(stdout, "DONE\n * Writing no of taxon to Registers ... ");

rc = WC_PeRegWrite(TestInfo->DeviceNum, REGISTER_OFFSET, NUM_REGISTERS, &no_of_species); if (rc != WC_SUCCESS) { fprintf(stdout, "\n * Wrting the no of taxons didnt work!!"); }

tend_wr = clock();

value = tend_wr - tStartClk;

fprintf(stdout, "DONE\nTime taken for Memory write is: %d", value);

172

/* READ THE REG FILE UNTIL DONE SIGNAL HAS BEEN SET */ //printf("DONE\nThe value of darray[1] is %d", darray[1]); done = FALSE; while(!done){ rc = WC_PeRegRead(TestInfo->DeviceNum, REGISTER_OFFSET, NUM_REGISTERS, darray); if (rc != WC_SUCCESS) { fprintf(stdout, "\n * Cudnt read the register!!"); } //printf("\nThe value of darray[1] is %d", darray[1]); /*rc = WC_Mem_Read(Right_Memory, MEM_BASE, dNumDwords, pReadBuffer); if (rc != WC_SUCCESS) { free(pReadBuffer); free(pWriteBuffer); WC_Mem_Release(Left_Memory); WC_Mem_Release(Right_Memory); return (rc); } printf("\nThe value of rmem[0] is %d", pReadBuffer[0]);*/ if (darray[1] == 1){ tend_done = clock(); done = TRUE; } else done = FALSE;

}

//tend_done = clock();

value = (tend_done - tend_wr); //fprintf(stdout, "\ntdone = %d, tendWrite= %d", tend_done, tend_wr); fprintf(stdout, "\nTime taken for done signal is : %d", value);

/* // WAIT FOR INTERRUPT WHICH INDICATES THAT DESIGN HAS COMPLETED RUNNING fprintf(stdout, " Waiting for interrupt ... "); rc = WC_IntWait(TestInfo->DeviceNum, 1000 ); CHECK_RC(rc); */

/* READ RIGHT MEMORY AFTER INTERRUPT IS GENERATED dNumDwords SHOULD BE SET TO NO OF DWORDS TO BE READ */ dNumDwords = (2*no_of_species) - 1; if (TestInfo->bVerbose) fprintf(stdout, "\n * Reading from memory ... "); rc = WC_Mem_Read(Right_Memory, MEM_BASE, dNumDwords, pReadBuffer); if (rc != WC_SUCCESS) { free(pReadBuffer); free(pWriteBuffer); WC_Mem_Release(Left_Memory); WC_Mem_Release(Right_Memory);

173 return (rc); }

//time(&tEnd); tEndClk = clock(); //diffTime = difftime(tEnd, tStart);

value = tEndClk - tend_done; fprintf(stdout, "DONE\nTime taken for memory read is: %d", value);

value = (tEndClk - tStartClk); printf("\nTotal Time taken: %d", value);

// PRINTING THE TREE DATA f1 = fopen("C:/akella/thesis/thesisOct03/OutputData/TreeData_16D1.txt", "w"); for(index = 0; index < dNumDwords; index++) fprintf (f1, "\nTaxon - %d: %li", index, pReadBuffer[index]);

free(pReadBuffer); free(pWriteBuffer); WC_Mem_Release(Left_Memory); WC_Mem_Release(Right_Memory); printf ("\nDONE\n"); return (rc);

}

/**************************************************************************** * * Function : DeviceInitialize * * Notes : This function puts the card into a known state before the * test begins. It is generally a bad idea to assume the * state of the WILDCARD's hardware when a program starts. * Previous programs can leave the hardware in an unknown * state, and it's state on power-on is undefined. If an * application requires a specific state of the hardware, * explicitly set that state. * * Before running any application the following steps * should be performed in the order given. * * 1) Toggle Power * 2) Assert the processing element reset line * 3) Program the processing element * 4) Set the clock frequency * 5) Configure Interrupts * 6) Deassert the processing element reset line * ****************************************************************************/ WC_RetCode WC_UPGMATest_Init( WC_TestInfo *TestInfo ) { WC_RetCode

174 rc=WC_SUCCESS;

/* A great deal of useful information is available from the ID * * PROM on the WILDCARD(tm), including processing element part * * type, memory size, speed grade, etc. The two API calls, * * WC_DeviceInformation and WC_GetVersion are used to retrieve * * that information. The procedure DisplayConfiguration, * * defined in wc_shared.c, displays this information to the * * screen. * * * * Below we use the API calls to store the WILDCARD(tm) device * * and version information in the TestInfo struct for use later * * in the example, as well as display the information if * * verbosity is on. */ rc = WC_DeviceInformation( TestInfo->DeviceNum, &(TestInfo->DeviceCfg) ); CHECK_RC(rc); rc = WC_GetVersion( TestInfo->DeviceNum, &(TestInfo->Version)); CHECK_RC(rc); if (TestInfo->bVerbose) { rc=DisplayConfiguration(TestInfo->DeviceNum); CHECK_RC(rc); }

/* It should NOT be assumed that the WILDCARD(tm) processing * * element currently has power. Below we toggle the power to * * the processing element, leaving it ON for the remainder of * * the example. */ if (TestInfo->bVerbose) { fprintf (stdout, " Toggling processing element's power...\n"); } rc=WC_PeApplyPower ( TestInfo->DeviceNum, FALSE ); CHECK_RC(rc); rc=WC_PeApplyPower ( TestInfo->DeviceNum, TRUE ); CHECK_RC(rc); if (TestInfo->bVerbose) fprintf (stdout, " PE power turned on.\n");

/* The WILDCARD(tm) has a dedicated reset line controlled by * * the WC_PeReset API call. In general it is advantageous * * to have the PE in reset when it is being set up. This * * will prevent the design from starting execution until the * * WILDCARD(tm) has been correcly initialized. * * * * Below we assert the reset line and keep it asserted * * until the processing element has been programmed, the * * clock has been set, and interrupts have been initialized. * * * If the Reset_STD_If has been instantiated in the VHDL, * * this API call will set the signal 'Global_Reset' high. */ if (TestInfo->bVerbose)

175 fprintf (stdout, " Asserting PE Reset Line...\n"); rc=WC_PeReset( TestInfo->DeviceNum, TRUE ); CHECK_RC(rc); if (TestInfo->bVerbose) fprintf (stdout, " PE RESET line asserted.\n");

/* As of the creation of this file there are 4 revisions of * * the WILDCARD(tm) hardware. (Revs A to D) Below we use * * the informaion in TestInfo->Version to determine the * * revision of the card in this slot. */ rc = WC_GetVersion( TestInfo->DeviceNum, &TestInfo->Version); CHECK_RC(rc); if (TestInfo->bVerbose) fprintf (stdout, " Loading PE Image...\n"); if (((TestInfo->Version.Hardware & WC_MAJOR_VER_MASK)>>WC_MAJOR_VER_SHIFT) == 4) { /* REV D WILDCARD(tm) * * * * The ProgramPeFromFile procedure, found in ws_shared.c, * * will append .\\\ to the * * filename, and load that file into the processing element.* * For a REV D this path will be * * .\XCV300E\PKG_BG352\ */ rc=ProgramPeFromFile( TestInfo->DeviceNum, IMAGE_FILENAME_REVD ); CHECK_RC(rc); } else { /* REV A-C WILDCARD(tm) * * * * The ProgramPeFromFile procedure, found in ws_shared.c, * * will append .\\\ to the * * filename, and load that file into the processing element.* * * * For a REV C this path will be * * .\XCV300E\PKG_BG352\ * * * * For REVs A or B this path will be * * .\XCV300\PKG_BG352\ */ rc=ProgramPeFromFile( TestInfo->DeviceNum, IMAGE_FILENAME ); CHECK_RC(rc); } if (TestInfo->bVerbose) fprintf (stdout, " PE Image Loaded.\n");

/* The WILDCARD(tm) has one on-board programmable oscillator. * * WC_ClkSetFrequency sets the frequency of that clock. We * * always want to set the clock to the appropriate frequency * * before running our application. */ if (TestInfo->bVerbose) fprintf(stdout, " Initializing the clock to %f...\n", TestInfo->fClkFreq); rc=WC_ClkSetFrequency ( TestInfo->DeviceNum, TestInfo->fClkFreq );

176 CHECK_RC(rc); if (TestInfo->bVerbose) fprintf(stdout, " Clock initialized.\n");

/* This application uses the PE interrupt line, to * * generate an interrupt to the host. Interrupts must * * be enabled before we can receive an interrupt from * * the PE. */ if (TestInfo->bVerbose) fprintf (stdout, " Masking PE Interrupt...\n"); rc=WC_IntEnable( TestInfo->DeviceNum, TRUE ); CHECK_RC(rc); if (TestInfo->bVerbose) fprintf (stdout, " PE Interrupt Masked.\n");

/* The order of mask / reset may be important in some * * circumstances. In our case it is not. We mask, * * then clear anything that may have been happened before * * the masking operation */ if (TestInfo->bVerbose) fprintf (stdout, " Resetting PE Interrupt...\n"); rc=WC_IntReset( TestInfo->DeviceNum ); CHECK_RC(rc); if (TestInfo->bVerbose) fprintf (stdout, " PE Interrupt Reset.\n");

/* Lastly, we remove the PE from the RESET state. When * * the Reset_STD_IF is instantiated in the VHDL, this * * will set the VHDL signal 'Global_Reset' low. */

if (TestInfo->bVerbose) fprintf (stdout, " De-asserting PE Reset Line...\n"); rc=WC_PeReset( TestInfo->DeviceNum, FALSE ); CHECK_RC(rc); if (TestInfo->bVerbose) fprintf (stdout, " PE RESET line de-asserted.\n");

return(rc); }

177

Recommended publications