Sreesa Akella

Bachelor of Science Andhra University, 1998


Submitted in Partial Fulfillment of the

Requirements for the Degree of Master of Science in the

Department of Computer Science and Engineering

College of Engineering and Information Technology

University of South Carolina


I would like to express my deepest gratitude to my thesis advisor, Dr. James P.

Davis for the relentless motivation and support he provided, aiding me to complete my thesis on time. His unflinching optimism, undying enthusiasm and focus towards this project had inspired me to a great degree. His constant advise pushed me to look at a problem in a different perspective and helped me visualize concepts in a broader manner.

I would like to extend my appreciation to Dr. Duncan Buell and Dr. John Rose for their continous guidance and inspiration. Their valuable advise from time to time had given this project an optimal direction.

I would also like to thank my parents and friends who have been constant force of motivation and support that sustained me through tough times and helped me achieve this goal.


In recent years, reconfigurable custom computing has become an increasingly viable option for implementing high-performance computing applications.

Reconfigurable VLSI logic, on which custom computing systems are built, provides several orders of magnitude speed-up in execution performance of algorithms over the execution of these on conventional microprocessor-based systems. In addition, such systems have the flexibility to program--and reprogram via reconfiguration--the actual logic functions of the VLSI circuit with different applications in time and space. Custom computing systems are implemented using FPGA custom-logic devices that are easily and quickly programmed by an end-user. This research presents the design and analysis of a custom computing application architecture for the UPGMA Bioinformatics algorithm implemented on an FPGA-based custom-computing platform. We present the

Bioinformatics problem domain and architectures that were implemented and assessed.

We also discuss the final architecture created and present results of the system performance, as measured and compared against that of the UPGMA algorithm written in

C, running on a single-processor Pentium® PC.









1.1 Von Neumann versus Reconfigurable Custom Computing

In recent years, reconfigurable custom computing has become an increasingly viable option for implementing applications requiring high-performance or complex computations. It is an area that is not as mature as the use of conventional computing architectures. Traditionally, general-purpose computing involves a serial thread of executing code running on one or more microprocessors. This microprocessor-based computing paradigm is considered "general-purpose" in that the processor can be programmed to run any task—which is an executing application program running on an operating system or monitor program. Once a processor has been designed and fabricated, the single processor’s IC can solve multiple problems at different points of time, by fetching program instructions and data from memory, decoding them to determine an execution plan, then executing each such instruction, in turn.

Reconfigurable computing can also be called "general-purpose", although it uses a different architecture and supporting application development paradigm for computation.

Unlike a microprocessor, which has its computation as a set of sequential instructions fetched from system memory, reconfigurable architectures generally compute a function by configuring functional units and wiring them up in space. This allows a parallel

1 computation of operators and direct dataflow from the producers of an intermediate result to the consumers [1, 2].

1.2 Custom Logic Design versus Custom Computing

Application Specific Integrated Circuits (ASICs) could also be used to implement a design and optimize it to achieve high performance employing spatial architectures.

ASICs, however, are designed using custom logic techniques, creating design artifacts tailored for a specific application, and thus cannot be reconfigured to perform different applications. Therefore, although these systems provide high performance through application-specific optimization, they are not “general purpose”. One other aspect of

ASIC systems is that they have a huge manufacturing cost associated with them.

Reconfigurable systems, on which custom computing systems are built, provide very good performance and the flexibility to program--and reprogram via reconfiguration of the logic functionality--the actual device logic, with different applications in time and space. Additionally, these systems are implemented using FPGA devices that are easily and quickly programmable by and end-user, are available at affordable prices, and thus deliver user-defined functionality at a low cost. The performance and logic density of a single FPGA device have been improving in recent years, leading to more powerful reconfigurable architectures targetable for a wider range of applications. This has opened up the use of FPGAs, typically employed in the creation of logic controllers, as processing elements (PEs) in reconfigurable arrays in applications for high-performance computing.

2 1.3 Field Programmable Gate Arrays

In the past few years, the reconfigurable device market has grown considerably with the availability of a wide range of devices for VLSI systems--one such device being a Field Programmable Gate Array (FPGA). FPGAs have evolved considerably in the recent past, with the primary development being the ability to download a bitstream representing the digital logic functions onto an array of pre-defined arithmetic, logical and steering resources, so they have become the primary device for building reconfigurable and adaptive machines. They were originally designed as prototype devices used for pre-fabrication design emulation. This design activity was employed to verify the design before fabrication, to avoid the fallout of post-fabrication design error.

A Xilinx FPGA device that is primarily the device we are looking at has a standard architecture, which is shown in Figure 1 [3].

FPGAs consist of an array of resource types: configurable logic blocks (CLBs), input/output blocks (IOBs), and programmable interconnect resources. This standard architecture can be configured, and reconfigured if necessary, by an end user to implement a particular functionality. The logic blocks are used to implement the required logic gate and storage elements of the design. The interconnect can be programmed to appropriately connect the logic blocks to realize a larger functional unit specified for use by the application.

3 Figure 1. Architecture of an FPGA Device.

For purposes of consideration in this thesis, the design process of the FPGA device has the following steps:

 Model the design using a hardware description language such as VHDL or

through schematic capture.

 Synthesize this design to generate a netlist.

 Map the design to the FPGA logic blocks.

 Place and route of the design to choose specific logic blocks to use on the FPGA

and to allocate the wire segments to interconnect these logic blocks.

 Download the design as a bitstream onto the target FPGA chip.

Steps 2 through 5 are automated and are performed by an assortment of design tools generally provided by the FPGA device vendor. Some of the major FPGA device manufacturers and vendors in the market are Xilinx, Actel, and Altera.

In order to use these devices for reconfigurable computing applications, one has to deal with a number of FPGA issues so as to effectively implement the design. The computational requirements of the application must be identified and its mapping to the

4 FPGA device must be evaluated via estimation. This is no easy task, and there is no standard method to assemble designs. The FPGA tools, which play a major part in this process, are being continuously improved by the vendors to be more efficient in their mapping of design architecture to design resources. The trend is that, over time, the construction of reconfigurable computing systems on FPGAs will be more like software programming than the hardware design process for custom VLSI that exists today.

1.4 Application Programming and Design Styles

The process of converting a specification into an implementation on FPGA devices can be addressed in different ways. Different design styles lead to different interpretations of the specification—a formal or informal description of the application’s algorithm. An algorithm can be thought of as a set of processing steps for transforming data by executing a series of computations [4]. The algorithm needs to be interpreted by a machine to perform the work. Choosing the elements that make up the machine defines its architecture, and this necessitates looking at different architectural, or design, styles.

Traditionally there have been two generic architectural styles: the software paradigm and the hardware paradigm [4]. The software paradigm looks at implementing an algorithm through use of an instruction code sequence that is interpreted by a microprocessor. In contrast, using the hardware paradigm, an algorithm is mapped onto storage and functional units that perform the computation without the use of an intermediate instruction set.

Under the software paradigm, a program for the algorithm is written in a high level language such as C/C++, which is compiled into a low-level instruction set for the

5 processor to execute on an underlying hardware with a fixed architecture. A hardware implementation would look at implementing the design directly onto a hardware device through mapping to storage and functional units, avoiding the compile-time and operating system overhead present in the software paradigm. This can provide considerable speed- up—on the order of two magnitudes--and thus provide a much higher-performance solution; however, a VLSI hardware application solution generally comes at a higher cost, since fabricating the implementation on application specific devices is expensive.

At the same time, such application formulations using application-specific VLSI custom logic are not general purpose, thus necessitating different implementations for different algorithms. In contrast, the software model would yield a generally lower-performance solution through the overhead associated with instruction fetch-decode and execute; however, the solution would generally be cheaper, since microprocessors are mass- produced, reusable commodity off-the-shelf products, and programming them is not a difficult task. Furthermore, there are more software-trained professionals who can write programs on general-purpose processors than there are design engineers who can design custom-logic VLSI.

FPGAs provide a means to build general-purpose, reconfigurable machines at a lower cost1. This leads to new design style that can be referred to as the reconfigurable- computing paradigm, also referred to as the configurable hardware paradigm [4]. This paradigm supports the implementation of algorithms by providing the performance benefits from mapping directly onto a hardware platform at a relatively low cost. Thus, it would be interesting to look at implementing various applications on reconfigurable 1 Such fixed cost is referred to as NRE, or non-recurring engineering costs, which are associated with the specification, design, implementation, mapping, and test of the logic functions implementing an application on a VLSI device substrate. This is in contrast to the variable costs associated with the fabrication and production of finished devices, which is based on the volume of production—itself based on the demand.

6 platforms and evaluating their performance as compared to implementations using the software paradigm.

Such performance would include conventional notions of latency associated with carrying out computation, comparing between an application-specific software solution running on a conventional processor architecture (or even among a collection of processors, thus distributing the algorithm’s execution across multiple, communicating processing elements), and also throughput of the architecture to run streaming computation, if appropriate for the application. However, evaluating performance could also include comparing the design time of the application—comparing the time to architect, design, implement and test the application according to the requisite engineering processes of each paradigm.

1.5 Thesis Proposal

The reconfigurable computing paradigm--and the predominant FPGA device architecture on which such applications can be built--offers us a good medium for implementing complex computational tasks having high throughput, low latency requirements. Many computational tasks spread over a range of application domains have been implemented and evaluated on reconfigurable computing systems [5, 6, 7, 8,

9]. However, different aspects of application architecture and performance must continue to be explored, while many new and novel computational problems must be implemented using reconfigurable custom computing machines before a general understanding of the characteristics of the reconfigurable computing paradigm can be obtained. This would provide a wider set of configurable computing solutions, as well as patterns for mapping

7 between high-level problem-solving architecture and lower-level device architectures, which can be used to assess the cost/benefit ratio for effective and optimal implementation of more general programming problems on reconfigurable platforms.

Our research thesis involves examining one such data point in the space of possible application solutions where high-performance computing using reconfigurable hardware is required for operating on ever-growing data sets. Namely, we are looking at the Phylogenetics domain that provides us with a rich set of algorithms that can be studied to see if they can be implemented efficiently on reconfigurable computing machines to provide orders of magnitude speedup in the algorithm execution over that available on standard Von Neumann processor architectures programmed using conventional programming techniques. In this Bioinformatics domain, the Unweighted

Pair-Group Method with Arithmetic means (UPGMA) algorithm used for phylogenetic tree-reconstruction purposes has certain computational complexity that makes it an application of specific interest. Furthermore, it is understood to have a software programmed implementation that is particularly optimal, that is, it cannot be further optimized to achieve significant speedup in performance.

1.5.1 Thesis Research Objective and Tasks

It is therefore our objective to explore the space of possible architectures in custom reconfigurable logic, using FPGA devices as an implementation medium, and also using conventional custom-logic design processes, to implement a different

“rendition” of the UPGMA algorithm and measure the performance difference. It is our belief that—although the time complexity of the algorithm is unlikely to change as a result of implementation in FPGA custom-logic hardware, we do believe that the use of

8 custom logic VLSI Hardware design techniques should yield up to a two-order-of- magnitude improvement in the execution speed of the UPGMA algorithm over that employed in the PHYLIP program written by Felsenstein et al. [10].

The tasks involved in exploring this thesis research work are defined as follows:

1) Select the UPGMA algorithm[11] that performs phylogenetic analysis by building

an evolutionary tree as our problem domain.

2) Identify and analyze the various complex computational tasks and bottlenecks.

3) Evaluate the issues that we need to address in implementing this algorithm on a

reconfigurable custom-logic architecture.

4) Address various FPGA issues while developing a hardware architecture for the

particular problem algorithm at hand.

5) Implement this design on a FPGA-based reconfigurable architecture and device


6) Evaluate its performance by measuring its throughput with an increase in the

number of taxa, and benchmark these results against those obtained from a

software program (Felsenstein’s PHYLIP) executing on a conventional CPU-

based system.

The Annapolis WILDCARDTM system has been chosen as the target reconfigurable platform. The WILDCARDTM FPGA board has a Xilinx Virtex® XCV 300E2 as a processing element, along with two 256K byte memory units, and external I/O connections. This reconfigurable computing platform was chosen primarily based on cost and the availability of a reasonable set of platform development tools.

2 Xilinxand Virtex are Registered Trademarks of Xilinx Inc.

9 Thus, this thesis will attempt to modify the upper bound of the time complexity, corresponding to a modification of the time constant associated with the complexity function for the UPGMA algorithm to achieve orders-of-magnitude speedup, while also contending with the space complexity associated with the limited amount of device resources available on the Wildcard platform. In addition, given that we will be moving data to and from the main computer in which the WILDCARDTM sits, and the

WILDCARDTM PCI/PCMCIA board itself, we will be required to assess the penalties associated with the communication overhead—with the objective of minimizing this as much as possible.



In this chapter, we provide background on the application domain associated with the UPGMA algorithm and its context in the space of Bioinformatics computational problem solving. We also discuss the FPGA device technology, which constitutes the platform on which we will create a reconfigurable computing solution for the UPGMA problem.

2.1 Phylogenetics and Tree-reconstruction Methods

The study of the relationships between groups of organisms is called taxonomy, an ancient and venerable branch of classical biology. The branch of taxonomy that deals with numerical data such as DNA sequences is known as phylogenetics. Biological systematists who wanted to reconstruct evolutionary genealogies of species based on morphological similarities originally developed phylogenetic analysis. The results of phylogenetic analysis may be depicted as a hierarchical branching diagram, a

"cladogram" or "phylogenetic tree" as shown in Figure 2 [12].

11 Figure 2: A phylogenetic tree showing a relationship between four species.

2.1.1 Background on trees

The tree represents the genealogical evolution of the different species, linking them through a certain set of similarities and differences. Similarities and differences between organisms can be coded as a set of characters, each with two or more alternative character states. In an alignment of DNA sequences, for example, each aligned site is a separate character, each with four character states, the four nucleotides being adenine, thymine, cytosine, and guanine.

All the trees are assumed to be binary, meaning that each node branches into two daughter edges as shown in Figure 2. The edges meet at a branch node, a node being and endpoint of an edge. Each edge has a certain amount of evolutionary divergence associated with it, quantified by some distance between sequences. These distances are referred to as ‘edge lengths’ or ‘branch lengths’. Terminal nodes or leaves correspond to the observed sequences that might connect up to an ultimate ancestor or ‘root’ of the tree.

A true biological phylogeny has a ‘root’ but only some phylogenetic algorithms provide information about the location of the root.

12 For a specific set of n leaves, the nodes and edges of a tree can be counted as follows: There would be (n–1) nodes in addition to the n leaves, giving a total of

(2n-1) nodes and one fewer edges, that is (2n-2), discounting the edge above the root node.

2.1.2 Phylogenetic Algorithms

Phylogenetic algorithms cover three main classes of problems [13]: (1) parsimony, which is like a vertex coloring problem of graph theory; (2) distance methods, which aim to find a tree whose path distance matches closely to observed distances; and, (3) likelihood methods, where the likelihood of the data is calculated using Markov transition matrices. Each approach possesses certain problems in terms of the computational bottlenecks that occur.

The advantages of putting a phylogenetic algorithm onto reconfigurable custom computing platform include the following: (1) eliminating intervening levels of software--such as operating systems--which slows down the execution of the code, etc.; and, (2) parallelizing or pipelining the algorithm functions by exploiting the natural capabilities of custom-logic architecture and design. The latter provides far more work per cycle than code written in a native instruction set on a general-purpose microprocessor. As discussed earlier, we believe a speed up of up to two orders of magnitude should be possible with this approach. Furthermore, the bottlenecks within the algorithms could be avoided by exploiting the underlying hardware resources in reconfigurable machines to optimize specific parts of the algorithm’s execution that general-purpose machines cannot offer.

13 We select a particular phylogenetic distance method algorithm for this research, namely the UPGMA (Unweighted Pair-Group Method with Arithmetic means) algorithm whose computational complexities are described below.

2.2 The UPGMA

UPGMA has relevancy beyond phylogenetics, since it is a hierarchical clustering method that is both fast and useful with gene expression or micro-array data. The algorithm’s running time complexity is evaluated and compared against with that of the hardware implementation and the results are presented in Chapter 6. The value of N is typically around 10,000 to 50,000 in micro-array applications. Thus, even though software-based phylogenetics applications run this method at a rate of 1 second each [15], for N = 100, there is an increase by a factor of perhaps 10,000 in micro array applications even before we consider memory bottlenecks. This last factor causes considerable problems, since memory usage also scales as O(N2). Thus, this problem might take days to complete with larger taxa data sets. This algorithm is well understood [11, 14, 15], and the software solutions have reached a level of optimization beyond which minimal performance improvement can be obtained. Thus UPGMA is an appropriate candidate for exploring an implementation on a reconfigurable platform using custom-logic architecture and design techniques.

2.2.1 Algorithm

We define the distances between two clusters Ci and Cj to be the average distance between pairs of sequences from each cluster:

14 dij = (1/|Ci||Cj|)  dpq (1)

where |Ci| and |Cj|denote the number of sequences in clusters i and j, respectively

and p and q denote the sequences in each cluster Ci and Cj respectively. If Ck is the

union of clusters Ci and Cj, and if Cl is another cluster, then

dkl = (1/|Ci||Cj|)(dil|Ci| + djl|Cj|) (2)

This forms the average distance calculation for obtaining the distance of the new

cluster Ck to the any other cluster Cl.

The distances are represented in the form of a matrix given below in Figure 3 with each row or column corresponding to one node. The nodal distance between node i, j would be in the position [i, j] of the matrix. So D[i, j] would form the distance between nodes i and j.

Figure 3. Distance Matrix

D[i, i] is not a valid distance since there can be no distance between the same node. This is therefore marked as “x” in the matrix.

The steps of UPGMA algorithm are as given below [14]:

15 1. Initialization:

a. Assign each sequence i to its own cluster Ci,

b. Define one leave each T for each sequence, and place at height zero

2. Iteration:

a. Determine the two clusters i, j, for which dij is minimal. (if there are

several equidistant pairs pick one randomly.)

b. Define a new cluster k by Ck = Ci U Cj, and define dkl for all l by (2).

c. Define a node k with daughter nodes i and j, and place it at height


d. Add k to the current clusters and remove i and j.

3. Termination:

a. When only two clusters i, j remain, place the root at height dij/2.

2.2.2 Complexity and Bottlenecks on UPGMA

We believe the UPGMA algorithm has two bottlenecks. The first is in deciding which of the N(N-1)/2 pairwise distances is minimal at each step of the star- decomposition clustering. Following this, the data matrix is reduced by dimension 1, due to clustering of two objects. This introduces the second bottleneck, the need to calculate an average distance between the two objects (i and j) as a single cluster (k) and all other objects. This involves complex computational units, costly on general-purpose microprocessors, but which we believe can be implemented efficiently on reconfigurable custom logic FPGA device, giving better performance results.

16 This research examines the function of the UPGMA algorithm, implementing it as a custom logic architecture. The standard HDL-based design methodology is employed in that we model the algorithm using the VHDL hardware description language, we functionally verify the algorithm’s correctness in the custom logic architecture, and then we synthesize the architecture onto a set of resources to produce a circuit mapped to a target FPGA device’s component library. The resulting circuit is implemented on a

Xilinx Virtex E® FPGA device and is subjected to functional and performance analysis.

However, before we can present the research method undertaken in this effort—including the analysis, architecture and design of the circuit implementing the UPGMA algorithm, we must discuss the characteristics of FPGA devices and their use in reconfigurable computing that gives this research a high chance of success.

2.3 Field Programmable Gate Arrays

The evolution of the FPGA devices is evidenced by the great strides in the underlying technology—effective logic gate counts in the millions of transistor gates, the ability to download and alter the logic via a programmable bitstream while the FPGA device is in operation—to name a few. Several companies have been developing high- performance, high-capacity FPGA devices, targeting larger applications such as those associated with scientific computing. FPGA vendors such as Xilinx, Actel and Altera, the largest of those producing these devices, have a leadership position in the market.

Our reconfigurable platform, the Annapolis Wildcard® system uses a Xilinx® XCV300E device. The Xilinx FPGA devices have a standard set of device architecture features,

17 similar to the one shown in Figure 1 in the previous chapter. We describe the architecture for the Xilinx XCV300E device below.

Figure 4. Structure of the Xilinx® XCV300E device.[16]

Figure 4 provides an architectural overview of the XCV300E device. There are three main components in the device, which are: (1) Input output Blocks (IOB), (2)

Configurable Logic Blocks (CLB), including block-programmable RAM (BRAMs) memory structures; and, (3) the Programmable Routing Matrix.

2.3.1 Input Output Blocks (IOBs)

The input and output blocks on the device provide and interface between the input and output pins and the Configurable logic blocks (CLBs). The architecture for these

18 blocks is given in Figure 5. These blocks provide three storage elements that can be used either as edge-triggered D flip-flops or as level sensitive latches.

Figure 5. Virtex E Input Output Block architecture [16]

2.3.2 Configurable Logic Blocks (CLBs)

The Configurable Logic Blocks provide the functional elements for implementing logic. The basic building block of the CLB is the Logic Cell (LC). Each Virtex-E CLB consists of four LCs. The LC consists of a 4-input function generator, carry logic, and a storage element. The output of the function generator in each LC drives the output of the

CLB and D input of the flip-flop. The architecture for a Virtex E CLB is given in the

Figure 6. The four LCs are organized as two identical slices as shown in the figure.

19 Figure 6. A Two-Slice Virtex E CLB [16]

The function generators are implemented using Look Up Tables (LUTs) that can also be configured to be as 16x1 bit synchronous RAM. The two LUTs in a slice can be combined to create a 16x2 bit or 32x1 bit synchronous RAM, or as a 16x1 dual-port synchronous RAM element.

2.3.3 Programmable Routing Matrix

The Virtex-E consists of a General Routing Matrix (GRM) that connects the

CLBs together to implement the logic chains. The GRM comprises an array of routing switches located at the intersection of the horizontal and vertical routing channels. Each

CLB also has local routing resources through which it connects to the GRM. These local and global routing resources can be programmed to generate the best routing for the design being configured onto the device. The Xilinx configuration tools take care of the

20 placing and routing the design onto the device’s resources through user-specified constraints.

2.3.4 Resources on a Virtex-E chip

The Xilinx Virtex-E resources and their numbers are given below in Table 1:

Resource Number CLBs 1536 Slices 3072 LUTs 6144 FlipFlops 6144 Block RAMs 256x16-bits 32 Block RAMs 256x32-bits 16 Block RAM bits 131072

Table 1: Virtex-E chip resources

Each CLB has two slices and there are two LUT’s and two flip flops per slice.

The Block RAM allocations are based on the how the LUTs are configured. If they are configured as two 16x1 bit RAM then we can have 32 of the 256x16 bit block RAMs implemented on the device. If we have two LUTs configured to form a 32x1 bit RAM then we can have 16 of the 256x32-bit RAMs implemented on the device.

The Xilinx Virtex-E data manual [16] provides a detailed description of the device architecture along with pin definitions and electrical characteristics. The Virtex E device with its full complement of resources, provides the designer with a total of 411,955

CMOS transistor gates. This device can thus be used to implement reasonably sizable designs running at moderately high clock speeds.

21 2.4 Reconfigurable Computing

The constant improvement in FPGA device density and performance has prompted many to look at using these devices for implementing high-performance computing applications. The traditional advantages these devices provide are that they can be configured and reconfigured with little extra cost (except if reconfiguring during application runtime) and ease through direct host program control. The increase in gate count and speed of these devices has also made them an appropriate target for building high-performance, custom computing machines. These machines, also referred to as reconfigurable computing machines, provide flexibility to program and reprogram systems and at the same time provide high performance computing at a relatively low cost when compared to price-performance models of other high-performance platforms, such as supercomputers [2, 5, 6, 7, 8, 9]. Several computing platforms consisting of arrays of FPGA devices have been developed through research and experimentation and are currently commercially available in the market.

The DEC Paris Research Laboratory’s Programmable Active Memories (PAM) project was one of the earliest pioneers in reconfigurable computing [1]. The PRL team implemented the RSA encryption algorithm at speeds that had never been achieved, beating supercomputers and even custom discrete IC applications at that time.

SPLASH and Splash 2 are two other reconfigurable architectures developed in the early nineties—Splash 2 being an upgrade of the original SPLASH architecture [17]. The

Splash 2 consisted of 16 printed circuit boards, each consisting of 17 Xilinx XC4000

FPGA chips per board. Each XC4000 had its own memory banks, to which it could independently read and write. A number of high-performance scientific applications

22 were implemented on Splash 2, such as in the domains of gene sequence matching, fingerprint matching and image processing--at speeds of two orders of magnitude greater than the fastest supercomputers at that time [17].

Several companies have brought commercial platforms to market over the past few years, attempting to exploit this new computing model. Annapolis Microsystems

[18], SRC Computers Inc [19], and Star Bridge Systems [20] are three of the most prominent players in this market. These new reconfigurable architectures are being marketed as platforms that can be used for implementing a wide range of applications from different domains. The Annapolis WILDCARDTM reconfigurable platform that we are using on our research is one of these—albeit a low-end version.

The research described in this thesis examines the architecture, design and implementation of the computationally intensive UPGMA algorithm on a low-end reconfigurable platform and evaluates the performance as contrasted with that obtained by an implementation of the same algorithm using conventional software program execution on a standard Intel® CPU-based personal computer.

As discussed, we have chosen the UPGMA phylogenetics algorithm as the application domain in which we will explore the architecture and design space, and subsequent performance differences, of applying the reconfigurable computing paradigm to this scientific computing problem. Phylogenetics provides us with a rich variety of problems with complex computational tasks that can be studied to see if they can be implemented on reconfigurable machines. Furthermore, the software domain has already been thoroughly explored, and few performance gains can be realized from further software optimization of the UPGMA algorithm in particular.

23 With this rationale clearly in mind, we progress to our discussion of the problem- solving and analysis of the UPGMA domain to derive a suitable high-level architecture with which to implement the algorithm. In addition, we’ll need to play our architecture off against the resource and timing constraints of the underlying Xilinx device and the

WILDCARDTM platform (including its mechanisms for interfacing with the PC-based host system in which it resides).



In this chapter, we discuss the reconfigurable computing platform we have available to us for purposes of this research. As part of our analysis, we had to thoroughly analyze this platform in order to understand its operating environment, its programming model, and its key features and constraints. All of this was required prior to devising an architecture for our UPGMA solution, because any such architecture would both be constrained by the resource constraints of the Xilinx device resident on the

WILDCARD, but also we would be further constrained by the programming model and execution environment provided by the vendor for us to realize our solution.

3.1 The Annapolis WILDCARDTM System

The WILDCARDTM system comes as a PC card and can plug into a PCI/PCMCIA card slot adapter, making it a very portable low-end reconfigurable platform. It has a very compact architecture, with a single Xilinx Virtex XCV300E processing element

(PE) and a couple of independent memory modules on the either side, forming the core of the system. The architectural block diagram is given in Figure 7 below.

25 Figure 7.The WILDCARDTM Platform Block Diagram [18]

Each of the two memory blocks, referred to as the Right and Left memory banks, is a 64K x 32-bit RAM module, with a 19-bit address bus and a 32-bit data word. The PE can write and read from the right and left memories independently. The host interface is through a 32-bit CardBus (PCMCIA) controller that operates at a 33 Mhz clock frequency. The CardBus controller interfaces with the PC host through the PCI Bus interface, and with the PE through the LAD Bus interface.

Data transfers to and from the PC host are done through control of a set of C program driver calls that interface with CardBus controller which, in turn, interfaces with the LAD Bus to send data to, and retrieve data from, the PE. Data can be written from the host to the memory through these specific interfaces by making the C program calls provided by the Host Programming Application Programming Interface (API) provided by the vendor.

26 The PE also has certain Input and Output pin connections that enable it to connect to external devices. These pins are helpful when you want your application program to communicate with an external device.

The WILDCARDTM board has a Frequency synthesizer that generates one main global clock signal, F_Clk, which is used to derive three other global clocks, namely the

P_Clk, M_Clk, and K_Clk. The user can set the clock frequency of the F_Clk using a C routine call from the host. The P_Clk is the PE clock pad signal, and is set to half the frequency set for the F_Clk by the user. The M_Clk the memory clock pad signal and operates at the same frequency that is set by user for F_Clk. Finally, the K_Clk is the

CardBus/LAD Bus clock pad signal, which always operates at 33 MHz.

3.2 The WILDCARDTM System VHDL model

The WILDCARDTM system software package provides VHDL models for the whole board that can used to create a VHDL-based program model and to implement and debug the whole reconfigurable application design. The VHDL model of the system also contains a simulation model of the host system that is used for testing the application from the perspective of custom computing hardware-software co-design.

The VHDL model provides interface components that are used to access all the components on the WILDCARDTM system. There are two types of interface components, namely, the Standard Interfaces, and the Mux Interfaces [18].

Standard interfaces are simple interfaces to the devices (PE, memories) on the system and can be used for low level, specifically tuned applications. The Mux (or multiplexing) interface can be used for programming at a higher-level interface between

27 the LAD Bus, Memory and the PE components. Both of these interfaces allow multiple user application components to share a single resource (such as the LAD Bus or the PE’s memory banks).

The development environment provides VHDL models for the following platform components, which are used for early model integration, hardware-software partitioning analysis, and functional verification and clock cycle-level timing analysis.

 Processing element (PE)

 Right Memory Bank

 Left Memory Bank

 Host

 Clock Generation

 Input and Output Connectors

 PCI controllers

The PE VHDL model is a standard VHDL entity-architecture pair. The entity defines the input and output pads of the PE device. The pad numbers are logical and do not match with the physical pin numbers of the FPGA die. This entity definition is fixed and is used as a template for the physical PE while creating the application design. In our preparation of these components for exploring the space of possible architectures for the

UPGMA algorithm, we do the following: the PE architecture template is modified accordingly to embed the application design within it. Furthermore, the Standard and

Mux interface models are used in such a manner in order to embed the application design interface with the LAD Bus and the memory banks. Finally, this allows us to take the

28 resultant composite PE model and synthesize the PE image for actually configuring the

WILDCARD device.

The VHDL models provided for the memory banks, host, clock generation, I/O connectors and the PCI controllers are purely for simulation purposes. These models are used within the WILDCARDTM simulation model and encapsulate the system’s functionality for use in VHDL simulation, enabling us to functionally verify the PE designs, as well as validate the correctness of the UPGMA algorithm, before synthesizing the actual design units and placing and routing them onto the Virtex device resident on the WILDCARD.

3.3 WILDCARDTM Host Programming

The WILDCARDTM system is composed of three main components, listed as follows: (1) the WILDCARDTM board; (2) the WILDCARDTM device driver; and, (3) the

Host Application Programming Interface (API). The WILDCARDTM Software Design

Hierarchy is given in Figure 8. The Host programming is done in C language using the standard Host API routines to communicate with the WILDCARDTM board through the

Windows® based device driver.

The device driver provides a low-level hardware interface to the WILDCARDTM board. When the driver is called in the appropriate set of driver function codes, it initializes the WILDCARDTM in a sequence of steps by reading its configuration and establishing handler interfaces for memory, interrupts and DMA operations. The

WILDCARDTM API presents a generalized view of the hardware resources and control operations. The following operations are performed by calling the API routines:

29  Opening and Closing the WILDCARDTM board

 Clock control (frequency)

 Processing Element control (program, reset, register space)

 Memory Interfaces (read/write)

 Interrupt control (PE/FIFO enable/disable)

The C function routines for each of the above operations are discussed below.

Figure 8. The WILDCARDTM Software Design Hierarchy [18].

30 3.3.1 Opening and Closing the WILDCARDTM board

The host program first makes an “open” call to the WILDCARDTM board before performing any other operations. This initializes the device driver, which, in turn, initiates the interface handlers for access to the board components. The C routine for this is WC_Open( ). The counterpart for the WC_Open( ) function is the WC_Close( ). For every WC_Open( ) there should be a corresponding WC_Close( ) function call to ensure a clear disconnect and proper de-allocation of resources.

3.3.2 Clock Control

The only clock control operation a host program can perform is setting the frequency of F_Clk. The function call for this is WC_SetClkFrequency( ). The programmable clock module allows user programs to change the clock frequency anytime by calling this routine.

3.3.3 Processing Element and Interrupt Control

The four main operations that are executed against the Processing Element (PE) are: (1) the PE Reset; (2) the PE Program; and, (4) the PE Register space read and write

(3) the PE Interrupt control.

The PE Reset operation is used to reset the PE and the embedded application residing on it. The PE program function calls are used to program the PE device with the user-designed application. There are two function calls: (a) PE_ProgramFromBuffer( ), which is used to program the PE from a user buffer space; and, (b) PE_Program( ), which is used to program the PE from a file.

The PE has a certain register space to which we can read and write to. The register space has an address range of 0x04000 to 0x0FFFF. The two function calls to

31 read and write to this register space are: (a) PE_RegRead( ) - reads from the PE register space locations; and (b) PE_RegWrite( ) – writes to the PE register space locations.

When a PE interrupt occurs, the device driver immediately masks the PE interrupt line and informs the calling program that is suspended on the API call that an interrupt has occurred The Interrupt control is done using the following functions [18]:

 WC_IntQueryStatus( ) – checks the status of the PE interrupt line via polling.

This is useful when the host program is written to do other operations while

waiting for the interrupt.

 WC_IntWait( ) – waits for the PE interrupts; useful when the host only needs to

wait for the PE interrupt before proceeding to perform anything else. The calling

program is suspended.

 WC_IntReset( ) – after the host program has processed the interrupt, it can reset it

and clear the API’s indication of a pending interrupt.

3.3.4 Memory Control

There are two main memory control API calls that are made by the host C application program. They are: (1) WC_MemRead( ), which reads from the right or the left memory

SRAM banks; and, (2) WC_MemWrite( ), which writes to the right or the left memory

SRAM banks. The calling arguments include the memory bank identifier, the base address and size of the block of data (that is, the number of DWORDS) to be written or read [17]. The function calls invoke the device driver that, in turn, manages handlers to transfer data through the PCI bus through the CardBus/LADBus interface to the


32 3.4 PE Embedded Application Initialization

The host program must proceed through a set of steps for initializing the application in the PE device before the reconfigurable application can be started. The steps are as follows:

 Open the WILDCARDTM board;

 Initialize the clock by setting it to a particular frequency;

 Enable the PE Reset line – ensures the PE is reset atleast once when the clock


 Disable and clear any pending interrupts left hanging by any previous


 Load the PE image by calling the PE_program API routine;

 Execute any additional initialization tasks as necessary, such as for enabling PE

interrupts; and,

 Disable the reset lines to allow normal operation of the PE.

Now the downloaded UPGMA application is running on the WILDCARDTM proceesing element, and the host program can start its portion of the application processing activity, which consists of transferring taxa and phylogenetic tree data to and from the offloaded UPGMA algorithm application.



In this chapter, we present the design methodology employed to create the custom computing application that offloads the UPGMA algorithm from the Host onto the

WILDCARD board for accelerated processing of taxa data.

4.1 VLSI Design Flow

New VLSI design methodologies have emerged every four or six years. Hardware description languages and EDA tools have made it possible to design VLSI systems at higher levels of abstraction. Hardware description languages provide chip designers with the capability of describing the functionality of a design at a higher, more abstract level of representation than a gate level representation. VHDL and Verilog HDL are the two languages used in the industry for hardware modeling and implementation.

Figure 9 shows a HDL-based design process model representing the various design activities. A brief description of each of the activities in the design process model is given below:

 System specification is an activity to abstract design information from a problem

statement and defining the interface and timing waveforms of the system.

 System partitioning is an activity to hierarchically decompose a system to handle

complexity based on system specification, design resource, and feasibility of

implementation. Components at the final hierarchical level should facilitate

34 behavioral modeling using HDLs or allow for HDL component reuse. The output of this activity is a valid system partition.

Figure 9. An HDL-based design process model[21]

 Modeling or adaptation involves capturing a design component in HDL with

high-level timing information and data dependencies or adapting reusable HDL

components from a design library.

 Component simulation verifies functional behavior and high-level timing of each

component using HDL test benches or cycle-based simulation techniques.

35  System binding is structural integration of simulated components based on the

system partition. This activity produces a system model for verifying system


 System simulation verifies system behavior and timing using HDL test benches or

cycle based simulation techniques. This activity produces simulation results that

can be verified with system specification.

 Logic synthesis is obtaining a gate-level netlist using an automated synthesis tool.

Removal of timing information and non-synthesizable constructs, technology

mapping and defining area and timing constraints are involved in this activity.

The target ASIC library and constraints are chosen to comply with system


4.2 UPGMA Project Design Flow

The design methodology described above forms the basis for building the application design. The UPGMA algorithm undertaken for this project forms the problem statement. It was analyzed and a system specification generated. The design was partitioned into easily manageable blocks and modeling of each of these modules was done in VHDL. The top-level hierarchical model is structural and merely connects the different sub modules to form the final top-level design. We have employed a certain amount of adaptation by reusing Xilinx cores to implement certain modules in the design.

This was done mainly to ensure that the design use no more resources than were available in the Virtex XCV300E chip of the WILDCARDTM board.

The component simulation was conducted on each of the sub modules to verify their functionality. This step eases the final system level simulation process as the errors

36 within the sub modules are removed by then. The final top-level structural model was then written and system simulation conducted to verify the functionality of the design as a whole. The ModelSim simulation environment has been used for debugging and testing purposes.

Logic synthesis was conducted mainly for identifying the critical paths and finding the resource usage of the design. The Synplify Pro® 7.3 FPGA synthesis tool was used for this purpose. The synthesized gate level netlist does not entirely form the final design being implemented on the Processing Element (PE) of the WILDCARDTM system.

The functionally verified design was embedded within the PE VHDL architecture model.

The PE model was then placed within the WILDCARDTM system simulation model and the final functional testing was conducted. The verified PE model was then synthesized and the EDIF netlist generated. The EDIF netlist was then placed and routed using a make file provided by the WILDCARDTM system. The make file provided calls to the

Xilinx place and route tools for generating the final PE image that is used to configure the device. This image is then used to proceed with the WILDCARDTM host programming process shown in Chapter 3.

4.3 UPGMA Design

The design architecture was formulated keeping in mind the parameters that govern the data sizes upon which the design operates. The data bit-width size constrains the bit widths of registers, the bit widths of datapath path elements and the memory requirements.

37 We first look at defining these parameters and then move towards describing the design architecture.

4.3.1 Design Parameters

For a taxa size of n:

 The number of nodes that form the final tree are n + (n-1)

 The number of distance values that need to be stored is

o (n(n-1) + (n-2)(n-3))/2

o For n nodes there are n(n-1)/2 combinations, thus making the number

of nodal distances to be n(n-1)/2.

o When an internal node is formed the number of nodes still left to be

connected reduces by 2 for the very first internal node and then by 1

thereafter. Initially there are n external nodes. When the first internal node

is formed by joining two external nodes, the number of nodes left is n-2.

So we need to compute the distance of the new internal node to n-2

different nodes. Thereafter, for each internal node formed we have a

reduction of 1 node making the number of nodes to connected to be n-3,

n-4 and so on. The total number of distances for internal nodes is thus n-

2 + n-3 + n-4 + …… which is equal to (n-2)(n-3)/2.

We define below the data structures used to representing the input, intermediate and output data. There are four types of data upon which the algorithm operates:

38  Distance data – A simple 32-bit dataword is used to represent this value. The

distance data is stored in the left memory bank that has a 32-bit data width. Thus

the choice of a 32-bit word for representing the input distance data.

 Nodes – The nodes that form the tree are represented as 10 bits each. The choice

of 10-bits was made as initially the design was targeted to implement a 512-taxa


 Heights – Each node has a height associated with it. This indicates the number of

nodes beneath it. This values has been represented in 16-bit format. A 512 taxa

tree would have a root with the largest height of 512. An initial size of 9-bit was

selected in earlier versions of the modeling but was changed to 16-bits when the

16-bit Block RAMs were chosen to implement the Height memory that is used to

store the heights of the nodes.

 Tree Output Data – This data represents each node in a tree with its parent node

and branch length to its parent associated with it. The format used is given below:

Node ID- 10-bits Parent ID- 10-bits Branch length- 12-bits

 This format was used to connect up the nodes while generating the final tree. The

total bit length is 32-bits. The tree output data is stored in the right memory that

has data word length of 32-bits

4.3.2 Design Datapath

The basic datapath of the design that is used for calculating the average distances and also obtaining the minima is given below in Figure 10 . The datapath is broken into two, one used for finding the least distance or the minima and the other for calculating

39 the average distances. The datapath on the left of the Figure 10, with the less than operator is used to find the minima. The datapath has the distance value and the current minima as inputs. The current minima is stored in a register and is fed back into the comparator. The second datapath on the right is used to calculate the average distance. It

has the distance value dik and the height of node i as inputs. The multiplier obtains the product of this height and the distance and sends it to the adder. The adder adds this value to an accumulator. The multiplier and adder together obtain the numerator part of the average distance equation (2) given in Chapter 2. The same time that the multiplier

accumulator are computing the numerator, the second adder which takes height Hi as input, computes the denominator of the equation (2). When these two are computed the resulting values given below are sent to the divider to obtain the average distance.

Numerator = (dikhi + djk)

Denominator = hi + hj

Average distance = (dikhi + djk)/(hi + hj)

Figure 10. Design Datapath

40 4.3.3 Design Architecture

The architecture of the design is given in the block diagram shown in Figure 11. The architecture shown is the final one created after several design passes. The two main components that form the backbone of the design are:

 The Controller

 The Address Generator

Most of the design effort has been put into modeling these two components since the controller executes the working of the UPGMA algorithm and the address generation forms the core part of the algorithm’s process. The description of these two components along with that of the other sub components is given below. We start with the simpler ones and then proceed later to the complex components. The datapath components are dealt with first, then moving to control path and finally the memory modules.

4.3.1 Adder

The Adder is modeled using the simple VHDL “+” operator. It has two data inputs and one output, each 32 bits wide.

The basic architecture of the 32-bit Adder would be realized using the Ripple-

Carry design. This style of Adder architecture has its carry chain as its critical path; however, for this bit-width, the tradeoff in area versus speed was not significant enough to warrant exploring other, more sophisticated Adder architectures.

41 Data_in Addinp Mulout Multiplier MulReg Adder Addreg

Adderreginp p n i D MulRegEn

Height HAdder HAddReg Divider HeightRegEn

AvgDistanceOpt s t k s

HAddrRegVal AddRegEn i n D a g B v

y A r

DivRegEn o

LstDinp m e M

Least t

Comparator Divreg h g

DistanceReg i LstDstRegEn R 1

p d n n i 2 t a

s p t f n D i e t L s i


Row D t

Rowinp s t

RowReg u p n I

a Tree output Data t text a

Output Selector D

Colinp Col

ColReg n E g s t l e c a R e l n w e number_of_species g i o S S R n E g Address Generator e R l

o Height

C HRead Controller Memory r d

a HWrite m

control signals Address Modification d e t Address Generate Control i a r e W R

mem_addr_modified m m e e M M

Address, Read and Write signals to Left and Right Memory Banks



Figure 11. Block Diagram of UPGMA Architecture.

4.3.2 Add Register

The Add register stores the output value of the Adder module. The register value is fed back into the Adder so that these two components work together as an accumulator.

42 The Adder input is added with the old value stored in the Add register and the cumulative value is stored into the register. The final value of the Add register forms the numerator of the Average distance equation given in Chapter 2. The Controller module sets the enable and clear signals for the register.

4.3.3 Height Adder

The Height Adder is similar in architecture to the basic Adder and has two inputs and one output, each of which is 16 bits wide. This module is used to add the heights of the nodes of the tree.

4.3.4 Height Register

The architecture of the Height Register is similar to the Add Register, except that it is 16-bits wide. It stores the output value of the Height Adder and its value is fed back into the Height Adder so that they work together as an accumulator. The Height

Register’s final cumulative value forms the denominator in the average distance equation given in Chapter 2.

4.3.5 Multiplier

The multiplier is a simple 32-bit multiplier modeled using the standard VHDL “*” operator. The multiplier has two inputs and one output, each 32 bits wide.

43 4.3.6 Multiplier Register

The Multiplier register stores the Multiplier module’s output. The register component’s architecture and pin configuration is the same as that for the Add Register.

From the standpoint of the VHDL model, a single generic entity-architecture description is employed for all the 32-bit register units.

4.3.7 Divider Unit

The Divider was modeled using the shift-subtract division algorithm. The divider forms the critical path of the design, and the controller coordinates its operation to ensure that the design runs at the requisite clock rate. The Divider has two 32-bit inputs and has as outputs a 32-bit quotient and 1-bit “valid” flag.

4.3.8 Divider Register

The Divider register simply stores the output value of the Divider unit. Its architecture and pin configuration are similar to that of the Add and Multiplier registers.

The controller handles the enable and clear signals.

44 4.3.9 Comparator

The Comparator is used to compare the distance values and find the minima. The comparator is modeled using simple VHDL “>” operator. The comparator compares the new distance value with the previous minima stored in the Least Distance Register; when it finds new minima, it enables the Least Distance Register, Row Register and the

Column Register.

4.3.10 Least Distance Register

The Least Distance Register stores the current minima while the algorithm continues to search for the minima of all the distances. The register architecture is similar to that of the generic 32-bit register with clear signal being set by the controller and the enable signal being by the comparator.

4.3.11 Row and Column Registers

The Row and the Column Registers store the Row and the Column values of the current minima in the distance matrix. Each distance matrix is accessed through the row and column values. These two values are used to obtain other distance values while calculating the average distance value. The comparator sets the enable signal and the controller sets the clear signal controlling these registers.

4.3.12 Controller

The Controller forms the core of the design, making its behavior one of the most complex to model. Its operation is based on the processing steps of the UPGMA algorithm. The steps for the controller are given in Figure 12, below.

45 Figure 12. The Controller Algorithm.

The three main operations being performed by the controller for every single pass through the matrix are as follows:

 Find the new minima

 Compute the Average distance

 Reduce the matrix size

46 The controller accomplishes this by stepping through a set of states and repeating the process until all the nodes in the tree have been handled.

4.3.13 Counter Units

The counters form some of the sub components of the address generator block.

The counters are used to select the next address to be generated. The generic block diagram for one of the counters is given below. The counter counts up until the count value becomes equal to the compared value (CV); at which point they set a “great” flag, indicating that the count has been exceeded; the counters are then reset back to zero. The controller sets the Increment and Clear signals.

4.3.14 Multiplexers

The multiplexers enable the address generator in selecting the next address. We have 2:1 and 3:1 multiplexer architectures for this purpose.

4.3.15 Address Generator

The address generator is a module that underwent several design cycles. The address generation algorithm is not simple, when considering it from a hardware

47 perspective. The address generation is conducted for two operations in the design process.

 Finding distance minima in an instance of the matrix; and,

 Calculating average distance

For finding the minima, the address is generated for fetching the next distance value from the memory. The address is generated in the format of “row&column”, with the row and column values concatenated together to represent the actual memory address.

The row and column values represent the row and column of the node-to-node distance matrix, with each row or column representing a node.

For example, node 1 to node 2 distance can be fetched by concatenating

“0000000001” with “0000000010” to generate the 20-bit address

“00000000010000000010” before actually accessing that memory location. These two values are obtained by reading a memory that stores the currently active node values.

The earlier version of this memory structure, referred to as “node memory” was modeled behaviorally and later modified to reduce the size of the module. The earlier model used to take more resources than were available in one full Virtex XCV300E chip.

The model was later modified by implementing the node memory using 256 x 32-bit

Xilinx Block RAM that would use one single Block RAM resource on the Virtex

XCV300E chip. This reduced the resource consumption significantly, making the module work much more efficiently.

48 The Xilinx Block RAM is a dual port memory structure [16], structure so that two different addresses can be written or read, or read and written, in combinations at the same time. This enables us to read the two different node values at the same time in order to generate within one clock cycle the concatenated address. The block diagram of the dual port Block RAM is given above.

The counters, discussed earlier, are used to generate the address values, ‘addra’ and

‘addrb’, for selecting the nodes to generate the next address. After all the distances are read and compared, the minima is stored in the Least Distance Register.

The average distance calculation needs generation of the address by selecting nodes that have not yet being joined into a cluster. The nodes that have currently been selected to form the new cluster are stored in the Row and Column registers. The average distance equation is given below.

Avg. Distance D(x, y)  i = (HxDxi + HyDyi) / (Hx + Hy)

49 The x, y are the new nodes selected to form the new cluster, i is the node to which

the distance from the new cluster is being calculated, Hx, Hy are the Heights of nodes x

and y respectively, and Dxi, Dyi are the distance of nodes x and y to the node i, respectively. The address generation for calculating the average distance is done in steps outlined as follows:

 The address for obtaining Dxi is generated first by selecting node i’s value and

concatenating with node x’s value stored in the Row Register;

 The address for obtaining Dyi is generated second by selecting node i’s value

again, and concatenating with node y’s value stored in the Column Register;

 These two distances are added to calculate average distance D(x, y)  i ; and,

 This new distance is then stored into the memory for future reference. The new

cluster forms a new node, lets say j, and the average distance calculated

represents the distance of this cluster to node i. Thus, the address for storing this

distance would be cluster j’s value concatenated with node i’s value.

The above four steps are performed for all of the nodes that have not yet been connected to the tree. After the average distances of all the nodes have been calculated, the node memory is updated by removing the nodes that have been selected to form the new cluster. The complexity of the process lies in maintaining the node memory and stepping through the process of selecting memory’s addresses to obtain the next address.

50 4.3.16 Output Generator

This module selects the output values that form the output tree data that is written into the memory. The outputs that this module are listed as follows:

 Node type – internal or external leaf node;

 Node ID- the value of the node;

 Parent ID- the value of the parent of the node; and,

 Branch distance- the distance of the node to its parent.

4.3.17 Height memory

The Height memory holds the heights of the nodes in the tree. The external, or leaf, nodes would have a height of one and the internal, or nodes with children nodes, would have heights of two or more. The memory architecture in the earlier version of this design was implemented behaviorally and the post synthesis results yielded high resource usage and slow performance. Thus, after a design review, this module was implemented using four 256x16-bit Xilinx block RAMs and additional write and read logic associated to control the Block RAM accesses. The module now uses only four

Block RAM resources on the Virtex XCV300E chip.

4.3.18 Off-chip Memory Banks

The right and left memory banks on the WILDCARDTM are used for storing the distance and tree data respectively. The banks are 64Kx32-bit SRAM modules, where access to them is managed through interface modules provided by the WILDCARDTM system. These components are available as VHDL models that can be used depending on the needs of the application. If we need to write to left and right memories we need the

51 interface components provided for these two banks. The application on the PE sends a read or write request through these interface components. The components allow multi- processing such that multiple applications could read and write to the memory at the same time. The read and write requests are prioritized and the designer can chose the kind of prioritization used. This feature has not been used as we have only one application design running within the PE that reads and writes to the memories.

The read and write operations take certain cycles to be performed successfully.

Figures 13 and 14 give the timing diagrams for the read/write to/from the memory. The read cycle is such that the first data takes 4 clock cycles to arrive after the read signal is set.

Figure 13. Typical read cycle from memory[18]

52 Figure 14: Typical write cycle from memory[18]

As we can notice from Figure 13, the write takes only one clock cycle to be performed. The 4 clock cycle latency requires the controller to wait until the first data arrives before performing the datapath operations. The memory read thus causes a certain latency in the design and most certainly affects the performance. For larger taxa sets, this latency gets larger and drastically affects the speed of the design.

4.3.19 Addressing Schemes

The WILDCARDTM memory banks both have a maximum capacity of 65536 words of 32-bits each. This small memory capability not only limits the number of taxa that could be implemented on this system but also affects the way the memory is addressed. The addressing scheme discussed in Section 4.3.15 is not effective when the number of taxa increases to say 256. The scheme uses the following methodology. Let us assume we have a distance matrix given as Figure 15 below.

Figure14: Distance matrix

Figure 15. Distance Matrix

The distances 6, 8, and 3 are addressed by D[0, 1] D[0, 2], D[0, 3]. Thus, while writing data from the host, we place these three values at indexes 1, 2, 3 of the array and transfer the data to WILDCARDTM memory. These values would be written in addresses

1, 2, and 3 of the memory. Thus value 6 in address 1 of the memory can be referenced

53 using the address “00000000000000000001” obtained by concatenating the nodes

“0000000000” and “0000000001” together, and similarly for locations 2 and 3.

Now, distance 7 is D[1, 2] and thus is written in index 1026 of the C array and transferred to the memory. It can thus be referenced by generating the address

“00000000010000000010,” obtained by concatenating nodes “0000000001” and

“0000000010” together. These nodes represent the indexes of the matrix D. Thus we use an addressing scheme that is similar to the way we reference the matrix values.

For datasets of 256 taxa or higher, this scheme fails, since for obtaining distances between say, nodes 254 and 255, we have an address “00111111100011111111” which represents a value much larger than 65536. Also, using this scheme, we are wasting memory locations. For example, the consecutive distance values 6, 8, 3 are stored one after another in locations 1, 2, 3, respectively, but the distance value 7 suddenly jumps to the memory location 1026. This waste of memory locations would reduce as the taxa size increases, but it is still unacceptable.

To avoid this problem and to be able to implement larger taxa datasets we have to employ a linear addressing scheme. The catch in this scheme is that our design needs to maintain a record of the node information while fetching every distance value, so that we can know which two nodes have the minimal distance between them. Thus the address generation using concatenation of nodes is important for the design. We therefore resort to an address modification scheme in which we generate the address in the original scheme and then modify it into a linear 16-bit address that does not go beyond our limit of 65536.

54 The address modification is a complex process, takes additional clock cycles and requires additional states in the control structure. This causes the design to slow down and the performance is affected quite a bit. We will discuss the impact of the address modification on performance in later chapters.

Let us assume in the address “node1&node2” node1 refers to the row of the matrix and node2 to the column. Thus, for the matrix given in Figure 15, we have the following mapping for each address value as given in Table 2. The number of taxa is n =


Matrix format Linear format Number of values per row 0-1 0 0-2 1 n-1 0-3 2 1-2 3 n-2 1-3 4 2-3 5 n-3

Table 2: Address mapping

From the above table we deduce that each row maps to a particular address. For example 0 maps to 0; now for value 0-1 we have address ‘0’, for 0-2 we have ‘1’. So we can see that for column value 2 the address 0 to which the row maps is incremented by 1.

Similarly, for 0-3, the address 0 is incremented twice. Thus we deduce that by mapping a row to an address and adding the (column-1) value to it we obtain the linear address that is used to obtain the required distance value. Row 0 would have (n – 1) values and a base address of 0, thus the base address for row 1 would be (0 + n - 1) which equals 3 for n = 4. Thus for address 1-2 we have an address of 3; for 1-3 we need to add

55 (column – row – 1) to obtain the correct linear address. Thus the steps used to obtain the linear address are:

 Obtain the base address to which the row maps to from a map memory

 Add the value (column – row - 1 ) to obtain the final address

The above methodology is employed to perform the address modification. The base addresses that each row maps to are written initially in Block RAM. This initialization of the node memory, row map memory requires further additions to the controller states and thus adds considerable delay to the design. This delay gets large for larger taxa datasets. The address modification component is placed within the PE outside of the

UPGMA application component. The address generated from the UPGMA component is fed to the address modification component, and the modified address is fed to the memory interface components.

4.3.20 Top-level Block

The top-level block in the design provides integration and routing of all the sub modules described above. The final top level design is then placed within the VHDL model for the PE, as a sub component to that model, and is interfaced with the memory and LAD Bus interfaces for handling the transfer of data.

The VHDL models for all the blocks in the design are listed in Appendix A. .

4.4 Design Verification

The design verification of the UPGMA design was performed using the

ModelSim simulation environment. The VHDL models of the WILDCARDTM system provide a simulation model that could be used to run a host-based simulation. The

56 simulation was done for various taxa datasets. The benchmark dataset used was a 57-taxa dataset for which we had the output resultant tree data generated from the software implementation. The output tree generated by the software simulation of the hardware design was compared with the benchmark data and found to match quite.

To verify the working of the hardware implementation on the WILDCARDTM system the data generated from the hardware implementation was compared with the benchmark data. Both the results matched perfectly.

Data generated through the test data generating software was fed to both software and hardware designs and the resulting output was compared. We found that the two outputs matched, indicating that the hardware design was working correctly.



5.1 Experimental Apparatus for UPGMA

The WILDCARDTM host-programming environment provides us the capability to program the WILDCARDTM system and also allows us to create templates that are used to write the host program. Looking back at the software design hierarchy explained in

Chapter 3, we see that the host “driver” program is written in C. The WILDCARDTM provides the API routines that are used in the host program to perform the following functions: (1) read and write to the on-board SRAM memories; (2) wait for the Virtex®

PE to interrupt (or, alternately, poll the status register for completion of a WILDCARDTM controller operation), and (3) process the results of the API-initiated operation.

The UPGMA host program was written based on the example templates provided by

Annapolis Microsystems® for setting up a custom computing application for reading and writing the SRAM memory banks, reading and writing data to the Virtex® Processing

Element (PE) register space, and for processing PE interrupts. Using these examples as guides, we created a complete host-based, experiment “driver” application, employing the above three components, to perform the following host-to-computing server protocol steps:

58  Initialize the WILDCARDTM system;

 Program the PE from the image file;

 Set the Clock frequency;

 Enable PE interrupt line;

 De-assert PE Reset line;

 Read distance data from the file into a distance array;

 Transfer distance data from the distance array to WILDCARDTM Left memory;

 Write the value of the number of taxon being operated upon into a PE register;

 This triggers the design to start running and assert “done signal” after it finishes;

 The C program waits until the done is set;

 Reads data from the Right memory; and,

 After reading, it outputs the data into a destination file in the PC host file system.

The PE initialization includes “opening ” the board by calling WC_Open( ) routine, applying power to the board, asserting reset lines, and clearing any pending interrupt requests left unprocessed by previous application programs. Once the design has been synthesized, and the EDIF file is transformed into a placed and routed design for the

Virtex® FPGA from the synthesis run, and the image file is generated by running the

Xilinx® M1 Alliance Series place and route tools.

This image is placed with the C project directory and is used to configure the PE by calling the WC_PeProgramFromFile( ) API routine. After the PE image is loaded onto the device, the PE clock frequency is set by calling the routine WC_SetClkFrequency( ),

59 the interrupt lines are enabled, and the Reset line de-asserted. The WILDCARDTM board is then ready for transfer of data to the on-board SRAM memories.

The distance data is written into the left memory, while the right SRAM memory is used for storing the output tree data. The number of taxa on which we are operating is written into a single 32-bit register on the PE. The host C program then goes into “sleep” mode, waiting for the PE interrupt to be set. Meanwhile, the UPGMA logic starts executing, and operates on the distance data in order to generate the output tree data.

After the design finishes processing, it generates a “done” signal that is tied to the PE interrupt line. Once the PE interrupt is set, the host C program comes out of its wait state and starts processing the PE interrupt. The host program clears the interrupt and starts reading the Phylogenetic tree data from the Right SRAM memory. Once all the output tree data is read from the right memory and written into an output file, the “driver” program clears all the memory buffers allocated during the execution, and proceeds to

“close” the device by calling the WC_Close( ) API routine. The C code for the host- based experiment “driver” program is provided in the Appendix D.

5.2 Generating Random Taxa Test Data Sets

A program written in C++, using the MFC programming environment, was used to generate the test data for testing the implementation of the UPGMA algorithm. The program takes as input the following parameters: (1) the number of taxa; (2) the maximum value of inter-node distance; and, (3) the number of repetitions of a single distance value in the data set.

60 The test data are generated for taxa sizes of 10, 16, 32, 50, 64, 75, 100, 128, 150,

175, 200, 225 and 256. For each taxa size, ten different data sets are randomly generated for that number of taxa. Furthermore, each created data set has its data values subjected to permutations, creating up to 10 permutations per data set per number of taxa. The C++ code for the test data generation is given in Appendix E.

Figure16: Test Data Generator Input Dialog Box.

Figure 16 presents a screenshot of the dialog box used by the program for generating test data. The taxa size, maximum nodal distance, and the number of repetitions of a particular distance value are given as inputs. When the data has been generated the program pops up a confirmation dialog box.

The data values are generated randomly making sure that each value is within the maximum nodal distance limit set by the user. Also, the specified number of repetitions

61 of each value in the data set is constrained to be less than or equal to that specified for repetitions by the user. For each taxa size, ten different datasets are generated and for each of these ten datasets ten different permutations are generated by changing the positions of the distance values within the distance matrix.

5.3 Measuring Time

The time taken for the UPGMA implementation to execute on the WILDCARDTM system is measured using standard C time function calls. Time measurements are collected for the time taken for the program to transfer the distance data to the memory, generate the tree, and read back the output tree data from the memory. The time is measured in terms of CPU clock ticks using the standard C language clock( ) function call.

The time taken for memory transfer is measured separately in order to analyze the cost of transferring data to and from the WILDCARDTM memory banks. This is done to give us an idea of how the cost affects the performance of the implementation for high values of N, the number of taxa. The current maximum of 256 taxa limits the number of distance values to be written to the memory. Also, the WILDCARDTM memory banks are

65536 (64K) words, with each word being 32-bits in width. This constrains the number of taxa that can be operated on for a given UPGMA run.

Through independent tests written for the WILDCARDTM system, the time taken for writing to each of the memories has been collected and tabulated. Although not shown here, the numbers collected indicate that the cost for writing the entire memory is about 20 CPU clock ticks, while reading back the entire memory is about 100 clock ticks

62 on the 800 MHz Pentium III processor serving as the experimental workstation. This indicates that reading data from the SRAM memories from the host is a more expensive operation than writing to them.

For the purposes of our experiment, we write distance data and read back tree data from the left and right on-board SRAM memories, respectively. As we are constrained to operate upon a 256 taxa data set, the number of distance values needed to write a complete matrix to the memory 32,640, while the maximum number of memory locations to be read back for the resulting tree is 511. These two values have been obtained from the following analysis formulae.

Number of distance values = N(N-1)/2

No of tree nodes = 2(N) – 1

Thus, considering that our largest data set has values to be written that are less than the maximum capacity of the memory banks, the memory writes take less than 20

CPU ticks for writing the entire memory array. Similarly, the number of output words to be read back from the memory is small compared to the full memory size. Thus, the time taken is much less than the 100 ticks for reading an entire memory. Thus, the cost of writing and reading from the on-board SRAM memories does not form a big factor in the performance overhead for our implementation.

However, if we have to look at realistic values of N, which can go up to as high as

10,000 the cost for reading and writing the memories would important. For obtaining an idea as to what the cost might be, we can extrapolate the timing data assuming that we have unlimited memory capacities on our hardware board. We could try to write to the

WILDCARDTM memory banks multiple times to find out the cost for writing more than

63 65,536 words and use this value to obtain a gross estimate of the cost to write to the memory for large datasets. Similarly we could read back data from the memory multiple times and obtain and estimate of the cost to read the memory for large datasets.

This cost data could then used to obtain a gross estimate on the overall performance of the algorithm being implemented on the hardware. This provides a theoretical extrapolation, and not the exact performance cost; however, it provides valuable information on how the algorithm performance might scale to larger number of taxa data sets, and whether the memory access costs would have a significant or minor impact (assuming we had reasonably unlimited memory available). This data is presented in Chapter 7 as part of the discussion of conclusions of this research.



In this chapter, we present the results of running Phylogenetics data sets against the UPGMA implementation on the WILDCARDTM-based reconfigurable custom computing machine. We present the resultant data sets in terms of a bounded clock cycle count using the clocking frequency of the host PC’s CPU clock, which gives us a count of the total number of host clock cycles for a given computation run. We use this, as opposed to using the on-board FPGA clock, as the former takes into account the communication overhead of getting data to and from the WILDCARDTM board.

6.1 Running the Experiments

We take randomly generated data sets, permute them, and execute them on the

WILDCARDTM. We then increase the number of taxa considered in the input distance matrix, generate new data sets and permute them, and execute them on the UPGMA processor. The test data for taxa sizes of 10, 16, 32, 50, 64, 75, 100, 128, 150, 175, 200,

225 and 256 were executed. For each taxa size, ten different data sets along with 10 different permutations of certain datasets were run and timing results collected. The results are described in the following sections.

65 6.2 Experimental Results for Latency

The time taken for each taxon-size data measured in number of CPU clock ticks, for different datasets and permutations is collected. The average time taken for ten different permutations of each of the ten datasets for each taxon size is given in Tables 4 through 6. Ttotal is the total latency of the platform--the number of clock ticks taken for a complete design run including the data transfer between the host and the WILDCARDTM.

One aspect of defining the data set for purposes of running experiments is permuting the data to assess whether permutation impacts the execution latency. In some implementations of UPGMA in software, permutation might affect the execution of a given data set at some number of taxa. The permutations were randomly generated along with the data sets. However, we wanted to make sure whether this aspect of the organization of data would affect the design in some meaningful way before taking the time to blindly run the experiments.

Our expectation was that permuting the data would not be much of a factor in variation of latency values, because the time to perform actual computations on fixed- width operators is largely independent of the actual data values passed as the operators.

From the data collected from the sample permutation runs, this seems to be the case.

This is shown in Table 3 and Figure 17 below.

66 Table 3: Timing Results for permuted datasets

Figure 17. Frequency Distribution for Latency versus Taxa Data Set Permutation.

From this analysis of the permutation, we conclude that we don’t need to consider permutation of the data set values for a particular execution run. Therefore, we focus our

67 presentation of the data on the different UPGMA execution runs using randomized data sets for each of the selected number of Phylogenetic taxa.

We next examine the response of the custom computing machine in terms of the

Mean Latency (averaging the data set samples) versus the number of taxa.

This is shown several different ways, so as to highlight the statistical convergence of the latency values around the mean values computed across the ten randomized data sets.

The first plot in Figure 18 shows the basic Latency response curve as the number of taxa increases to the maximum value of 256—the maximum number that can be stored in the available memory on the WILDCARDTM, given the architecture.

We wanted to evaluate the deviation from the mean over the data sets for each number of taxa, and observe what happens to this deviation as the number of taxa increases to the maximum targeted for this research. What we see in the Latency data for the different data sets--for a given number of taxa--is that the data tends to tightly cluster around the mean, indicating minimal deviation. There is some wider variance as the number of taxa grows, as evidenced from the curve in Figure 19, which gives the Latency in log scale. The variance is little bit more for 200 and 256 datasets as seen in the curve.

The rest of the datasets seem to converge pretty well.

We basically are not able to grow the number of taxa on the current reconfigurable computing platform based on the WILDCARDTM to see whether there is a real trend in the deviation data or not. However we believe that the data results would not be affected as the number of taxa grows. This is due to the fact that hardware computation speed is relatively fixed for fixed data bit-widths. The combinational circuit would have a fixed latency, thus the computations would have a fixed latency. This leads

68 us to believe that the data results would not differ significantly with increase in the number of taxa.

Latency versus Number of Taxa

Computational Latency (Mean Values) Latency (Std. Deviation) Latency (Variance)


500 ) s e l

c 400 y c


P 300 C (


c 200 n e t a

L 100

0 10 16 32 57 64 75 100 128 150 175 200 225 256 Num be r of Phylogenetic Taxa

Latency versus Number of Taxa

Computational Latency (Mean Values) Latency (Std. Deviation) Latency (Variance)


) 20 s e l c y

c 15

U P C (

y 10 c n e t a

L 5

0 10 16 32 57 64 75 10 12 15 17 20 22 25 Num ber of Phylogenetic Taxa

Figure 18. Mean Latency versus Number of Taxa (Normal Scale).

69 Latency versus Number of Taxa (Log Scale)

Com putational Latency (Mean Values) Latency (Std. Deviation) Latency (Variance)

1000 ) s e

l 100 c y c


C 10 (

y c n e t

a 1 L 10 16 32 57 64 75 100 128 150 175 200 225 256

0.1 Number of Phylogenetic Taxa

Figure 19: Latency versus Number of Taxa

The deviation we see in our current results, we believe, can be attributed to

“noise” on the host PC side, as the host is not dedicated to running the WILDCARDTM program exclusively, but at the same time has other processes running that can skew the count of the clock ticks.

Table 4 given below gives us the timing results for the WILDCARDTM UPGMA program to run datasets of different taxa sizes. It gives us the latency in the number of clock ticks for each of the ten different datasets for every taxon size. We also have the mean, standard deviation and variance of the 10 datasets for a given number of taxa.

We look at the performance of the UPGMA processor and compare it with other complexity functions to obtain an upper bounding in terms of Big-Oh.

70 Table 4. Latency Values for Data Sets at Generated Number of Taxa.

71 6.3 Bounding Time Complexity

Given the Latency curve as our number of taxa grows, we want to understand the results in terms of the time complexity. Stanat and McAllister [27] provide an appropriate taxonomy on which we can attempt to qualitatively “fit” our resultant performance curve against those of standard time complexity functions. Given that we have selected a means of measuring Latency that incorporates communication overhead, and that we randomize and permute our data sets, we assume we are working with worst- case behavior. We want to understand the behavior in terms of the standard forms of


Our first attempt is to compare our Latency plot against the base function plots for

O(N), O(n log(n)), O(n2) time complexity patterns. We show the Excel® plots for the data shown in Table 4 in the plots of Figure 20 for both normal scale and for logarithmic scale. We attempt to carry out a qualitative assessment of time complexity bounds without resorting to deriving more precise recurrence expressions—although we are able to generate curve-fitting equations directly onto the Excel plots.

What we see from the plots is that—given the limited range of N (number of taxa) covered under the scope of this research—we appear bounded by O(n log(n)) time complexity. However, the other conclusion we draw from this data is that we are too constrained by lack of a sizable memory space (space complexity) in which to store a larger number of matrix distance values for processing a greater number of taxa, N.

Therefore, we cannot draw a definitive conclusion about the performance of the computing system for large values of N. However, we will explore what we might need

72 to do to grow to considerably larger values of N, into the thousands of taxa, in the conclusion of this work. Also, to assess the benefits, we’ll use comparative data.

Latency = f(N) Time Complexity Bounding

Mean Latency N N*log(N) N**2


) 2000 s e l c y c

1500 U P C (

y 1000 c n e t a

L 500

0 10 16 32 57 64 75 100 128 150 175 200 225 256 Num ber of Phylogenetic Taxa

Latency = f(N) Time Complexity Bounding (Log Scale)

Mean Latency N N*log(N) N**2

10000 )

s 1000 e l c y c


P 100 C (

y c n e t 10 a L

1 10 16 32 57 64 75 100 128 150 175 200 225 256 Num ber of Phylogenetic Taxa

Figure 20. Bounding of Latency by Time Complexity Functions.

73 Latency = f(N) Time Complexity Bounding

Latency Poly. (Latency) Poly. (Latency)


4000 y = 0.577x3 - 6.2392x2 + 22.55x - 13.393 3500 R2 = 0.9982 )

s 3000 e l c

y 2500 c


P 2000 C (


c 1500 n e t

a 1000 L 2 500 y = 5.8787x - 47.849x + 83.55 R2 = 0.9748 0

0 2 4 0 0 0 6 1 3 6 0 5 0 5 -500 1 1 2 2 Num ber of Phylogenetic Taxa

Figure 21. Bounding Latency by Time Complexity Functions Computed in Excel.

Finally, before leaving this aspect of the analysis, we show a different plot in

Figure 21, showing how difficult it is to qualitatively assess the time complexity, by using the Excel® plot of trend analysis of the Latency curve, showing both square and cube polynomial trend curves. The Excel software uses regression analysis to come up with the trendlines. The trendlines help us in predicting the behavior of the Latency curve with increase in number of taxa beyond our current 256 max size. We just don’t have enough experimental data ourselves to see what happens to the Latency for larger values of N. For this, we’d need to move the design to a larger platform—such as the

Star Bridge HC-36m or the SRC 6e, which would be the subject of future research.

However the trendlines give us an idea on characterizing our upper bound performance for values of N greater than 256, using the trend curves as a guide, and how they might

74 scale. From the trendline future prediction we can know the R2 (R-square) value. The R2 value, also known as coefficient of determination, ranges from 0 to 1 and helps us in deciding whether the estimated predictive values of the trendline accurately match the actual data. A trendline is most reliable when the R2 value is at or near 1. Thus we see that the cube polynomial trendline provides the best bounding for the Latency curve as its

R2 value is better than that of the square polynomial trendline. So we feel that the algorithm complexity is bounded by O(N3) for the hardware implementation.

We now, try to measure the quality of the solution by comparing the performance of the reconfigurable custom computing solution against that of the baseline execution of

PHYLIP, the version of UPGMA software written in C by Felsenstein et al. [10].

6.4 Benchmarking Against PHYLIP

The software timing data is collected by running the PHYLIP UPGMA C code on the same PC on which the WILDCARDTM host program is run. As before, we execute the experiment and collect run-time data across a range of values of N, with different randomized data sets that have been permuted (selecting half the number of permutations as for the hardware version, for sake of brevity). For this, we use the same data sets that were used to execute the UPGMA algorithm running on the WILDCARDTM. The average time taken for the program to run under five different permutations of each of the ten different datasets for each taxon size is given in Table 5 that follows. The run-time plots corresponding to those for Latency of the software version are given in the figures that follow.

75 PHYLIP C Run-tim e Perform ance

C Run-time Performance N*log(N) N**2



P 3000.00 C (

e 2500.00 m i ) t s

- 2000.00 e n l u c y R 1500.00

c n o i 1000.00 t u c

e 500.00 x E 0.00 10 16 32 57 64 75 100 128 150 175 200 225 256 Num ber of Phylogenetic Taxa

Figure 22. PHYLIP C run time performance

Figure 22 shows the comparison of the PHYLIP C run time performance with the

N(log N), and N2 curves. These curves were added using trendlines without predictive analysis. From this plot, we observe that the PHYLIP C run-time performance curve is bounded by the N2 curve yet closely follows the plot for N(logN) as its lower bound.

We believe that our limited number of taxa does not show the true nature of the curve, and thus we would like to get a better bounding to obtain a closer match for the algorithm complexity.

The first plot in Figure 23 provides us the plots seen in Figure 22 in a log scale. In this scale we find N(log N) pretty closely matching the C run time performance but we still cannot predict accurately the complexity of the algorithm for larger values of N. In the second plot of Figure 23 we have two polynomial trendlines around the C run time performance curve. We have used forward prediction and obtained the R2 values to look at the accuracy of the trendlines. We find that the cube polynomial trendline matches

76 much better with the R2 value very close to 1. This tells us that the C algorithm provided by Felsenstein[19], seemingly, has a complexity of O(N3).

PHYLIP C Run-tim e Perform ance (Log Scale)

C Run-time Performance N*log(N) N**2



( 1000.00

e m i ) t s - e n l 100.00 u c y R

c n o i t

u 10.00 c e x E 1.00 10 16 32 57 64 75 100 128 150 175 200 225 256 Num ber of Phylogenetic Taxa

Phylip C Run-tim e Perform ance

C Run-time Performance Poly. (C Run-time Performance) Poly. (C Run-time Performance)

U 16000.00 P C

( 14000.00

e 12000.00 y = 1.8845x3 - 27.682x2 + 259.92x - 184.88 m i ) t 2

- 10000.00

s R = 0.9962 n e l u 8000.00 c R y

c 6000.00 n o i t 4000.00 y = 11.892x2 + 30.012x + 131.71 u

c 2000.00 2

e R = 0.9863 x 0.00 E 0 2 4 0 0 0 6 1 3 6 0 5 0 5 1 1 2 2 Num ber of Phylogenetic Taxa

Figure 23. PHYLIP C run time performance with Time-complexity bounding

77 Table 5. PHYLIP Run-time Raw Data Set.

78 We now look at the performance comparison of hardware and software implementations. The Average number of clock ticks taken for each of the taxa sizes, for both the hardware and software implementations, is given in Table 6. The results show a significant improvement for taxa up to 64, but then the rate of improvement starts to decline as the taxa size increases to the 256 maximum for the experiments.

Taxa Hardware Software Improvement 10 8.4 121 14.4 16 8.5 170.1 20 32 9.4 315.3 33.5 57 12 541.7 45.1 64 14 713 50.9 75 20.3 816 40.2 100 39.6 942.1 23.8 128 71.6 1107.5 15.5 150 110.2 1278 11.6 175 162.5 1479 9.1 200 242.6 1788.8 7.4 225 342.1 2250.7 6.6 256 504.4 2659.9 5.3

Table 6. Data Comparison Between Hardware and Software UPGMA Implementations.

This behavior in the hardware implementation is accounted for by the fact that the

FPGA-based design used to implement taxa count of 75 and above were adversely affected by the memory addressing scheme (discussed in chapter 4) made in the final architecture modifications of the design resident on the WILDCARDTM. The negative impact in performance of the design is also attributed to the four-cycle latency for a

SRAM memory read. This latency induces wait states in the control structure causing the design to run slower. The address modification would have not been necessary had the

79 WILDCARDTM system had larger memory banks and the original addressing scheme of concatenating nodal values were still used.

The improvement over the PHYLIP software implementation goes as high as 50 times for a 64-taxa data set size. Beyond this size, the on-board memory address modification becomes necessary, thus causing the design performance to deteriorate.

This can be seen as the 75-taxa data set size performance reduces to 40 times speed up and far less for larger data set sizes, up to the maximum. This deterioration is attributed to the low memory capability of the WILDCARDTM board and larger memory banks, if available, should help scale the design more effectively. The size of the Virtex

XCV300E chip also inhibits us from the implementing parallel or pipelined design architectures that might help in reducing latency of the memory addressing to a certain extent.

60 t

n 50.9 50 e

m 45.1 e v 40.2 o 40 r p 33.5 m i

30 e c

n 23.8 a 20 20 m r

o 15.5

f 14.4 r 11.6 e 10

P 9.1 7.4 6.6 5.3

0 10 16 32 57 64 75 100 128 150 175 200 225 256 Taxon size

Figure 24. Plotting the Performance Improvement over PHYLIP as Taxa Count Grows.

80 Figure 24 provides a plotted view of how the algorithm scales, as N grows large.

Given the maximum taxa data set at 256, we see that the performance deteriorates once we encounter the increased overhead of memory address computation on data sets for more than 64 taxa.

Thus, due to inherent limitations of the WILDCARDTM hardware board on which our design executes, we could not obtain performance improvements that might otherwise be obtained by applying custom logic/custom computing methodologies. A larger board with a larger memory size would allow scaling for larger taxa counts and certainly provide better performance over the software implementation in PHYLIP.

Performance Comparison: CCM vs Phylip

CCM Latency Phylip Run-time


2500.00 ) s e l c

y 2000.00 c


( 1500.00

e c n a

m 1000.00 r o f r e

P 500.00

0.00 10 16 32 57 64 75 100 128 150 175 200 225 256 Num ber of Phylogenetic Taxa

Figure 25. Plotting the Performance Difference as Taxa Count Grows.

However, if we look at the plot of the performance data itself, and compare the two curves, we see that the performance improvement does indeed seem to scale, as the

81 performance curve for the PHYLIP implementation of UPGMA grows at a faster rate than that of the implementation of UPGMA as a custom computing machine architecture.

Performance Comparison: CCM vs Phylip

CCM Latency Phylip Run-time


) 1000.00 s e l c y c


( 100.00

e c n a m r o f r 10.00 e P

1.00 10 16 32 57 64 75 100 128 150 175 200 225 256 Num ber of Phylogenetic Taxa

Figure 26. Plotting the Performance Difference as Taxa Count Grows (Log Plot).

If we observe the trend as a logarithmic plot, we see that, for the peak performance point for the custom computing implementation on the WILDCARDTM

(between 64 and 75 taxa), we are operating close to two-orders of magnitude faster than the software PHYLIP implementation. Furthermore, we see that this improvement decreases to a single order of magnitude—with an apparently decreasing trend in order- of-magnitude performance difference as we grow to the limit of 256 taxa. This corroborates the earlier plot showing a final 5X difference in performance between the two implementations.



7.1 Summary of Research Contributions

Custom computing systems built using reconfigurable logic devices, provide several orders of magnitude speed-up in execution performance of algorithms over the execution of these on conventional microprocessor-based systems. In addition, such systems have the flexibility to program--and reprogram via reconfiguration--the actual logic functions of the VLSI circuit with different applications in time and space. Custom computing systems are implemented using FPGA custom-logic devices that are easily and quickly programmed by an end-user. This research conducted design and analysis of a custom computing application architecture for the UPGMA Bioinformatics algorithm implemented on an FPGA-based custom-computing platform. We had looked at different architectures of the design for the purpose of achieving better resource usage and also to conform to the constraints of the hardware resources—most notably memory--on the

WILDCARDTM. We discussed the final architecture created and presented results of the system performance, as measured and compared against that of the UPGMA algorithm written in C, running on a single-processor Pentium® PC.

83 7.2 Conclusions

The results presented in Chapter 6 provided us with an insight into the performance of both the hardware and software implementations. The hardware results showed little variance for different permutations of a dataset for a given number of taxa.

The timing results also converge towards a mean value showing very little variance over different datasets for a given number of taxa. The hardware results showed significant improvement over the software implementation with performance peaking at the 64 taxa datasets. For datasets of 74 taxa and above, the performance began to degrade considerably, compared to that of the PHYLIP software implementation—although the custom computing implementation was still between half- to a full-order of magnitude faster. The hardware implementation was 50 times faster than the software implementation for the 64–taxa datasets, indicating a reasonable performance improvement, given the architectural limitations of memory addressing cited earlier.

We have also shown that, using predictive analysis in Excel, both the implementations are bounded by functions that are time complexity of O(N3). The polynomial equations generated for both the hardware and software performance curves were of the order of N3 with a large difference being based on the coefficients and constants of each time function. The predictive polynomial equation generated for the software performance curve shown in Chapter 6 had large constant and coefficient values compared to that of the polynomial equation of the hardware performance curve. These predictive polynomial equations though do not represent actual values but do give us an accurate estimation as to the behavior of both the implementations if we had been able to scale beyond the limited number of 256 taxa.

84 The large values of the coefficients are indicative of the fact that software implementation has underlying compile-related and operating system overhead that affect its performance. The hardware implementation of the UPGMA algorithm avoids these sources of overhead by its implementation of the computational units directly onto

FPGA-based hardware storage and functional units. This provides a considerable speed- up, facilitating a higher-performance solution as is evidenced through the results obtained.

However we see that the hardware performance degrades rapidly for datasets of

75 taxa and above. The performance degradation is attributed to the linear addressing scheme used for the final architecture, and the latency for a single read from the

WILDCARDTM on-board SRAM memory banks. The read from the memory banks takes

4 clock cycles which adds additional wait states within the control structure that negatively impacts the performance of the design. A linear addressing scheme was employed to facilitate the implementation of taxa of 75 and above. The WILDCARDTM memory provides us a maximum of 65536 word on each bank and this limitation forced us to modify the address generated using the original addressing scheme into a linear addressing scheme. The original addressing scheme would generate address values greater than 65536 yet limits the number of taxa that could be implemented, even though datasets of 256 taxa could be stored within the 65536 memory locations. The linear addressing scheme enables the design to implement larger datasets up to the 256 taxa limit, the maximum taxa size defined as a goal of this research. The address modifications necessary for this purpose induces additional states in the control structure, adversely impacting the performance of the design.

85 The original addressing scheme would require larger memory capabilities on the hardware that the WILDCARDTM platform lacks. We discuss in the sections below how larger memory banks--as well as certain architecture modifications--might improve the performance.

7.3 Future Work

In the early sections we have looked at certain issues that hampered the performance of the UPGMA design implemented in the WILDCARDTM system. We list these issues below:

 Memory size limitation and Memory address schemes

 Latency for the memory read

 Device size (FPGA resources)

We discuss these issues to see how alleviating these bottlenecks might be used to increase performance.

7.3.1 Memory size and Memory address schemes

The WILDCARDTM system provides two memory banks with 65536 words on each. The left memory bank was used for storing the distance matrix data and the right memory for storing the tree output data. The limit of 65536 words on the left memory necessitates address modification which degrades the design performance. We could overcome this problem two ways:

 Generating a better addressing scheme

 Going for platforms with larger memory banks

86 The first option would give us a solution that could be implemented on the currently available WILDCARDTM board, but is going to be a very difficult one as the address modification is complex as described in Chapter 4. The second option is easier and would require us to explore more custom computing platforms that offer larger memory capabilities. Larger memory capabilities would eliminate the need for address modifications and different addressing schemes.

The time taken to write and read to the memories on the WILDCARDTM board from the host is also a significant factor in measuring the performance of the design. We have seen that to write the entire memory bank takes 20 clock ticks on a 800 Mhz Intel

Pentium host system while to read an entire memory bank takes 100 clock ticks. We have seen that the time taken increases linearly with increase in no of writes or no of reads.

Therefore to read the memory banks thrice the host would take 300 clock ticks and to write thrice it would take 60 clock ticks. This linear increase would most certainly affect the performance and we would need to look at other architectures that might provide better performance in terms of reading and writing from the host.

7.3.2 Latency for a Memory Read

We have seen in the earlier sections that the memory read in the WILDCARDTM system takes 4 cycles. This hurts the performance by slowing the operation of the design.

To overcome this we have to look other custom computing platforms that offer better read and write cycle latency. This would remove the additional wait states induced into the control structure and speed up the performance of the design.

87 7.3.3 Device size

The WILDCARDTM has a Xilinx Virtex XCV300E chip on it. The Virtex-E chip has a total of 3072 slices. This is small compared to the Virtex II device, which has a total of 33732 slices, offering much more space to implement larger designs and also would enable us to look at different architectures of the algorithm under consideration, namely, parallel or pipelined architectures. We have seen in the literature that, in general, parallel architectures offer a very good performance improvement [22, 23].

The current implementation of the design takes up 60 percent of the

WILDCARDTM Virtex E chip. A parallel implementation would likely have multiple copies of the design components, such as the datapath, control path, etc., running in parallel. These multiple units would work on sub-parts of the distance matrix. This parallel operation would speed up the design by a large extent but the multiple parallel units would increase the design size, and there would be some penalty in the communication overhead of the interacting subparts of the problem.

Thus, to implement a parallel architecture we would need a larger device or multiple devices to ensure that we do not run out of resources. However the speed up in performance that can be obtained is attractive enough that it definitely warrants an exploration into the trade off between increased resources, parallelism versus communication overhead, and the impact on computation speed. Therefore, future work should investigate different custom computing architectures offering the requisite resources to implement a parallel architecture of the UPGMA algorithm on a custom computing fabric.

88 We have looked at different issues that caused problems in implementing the

UPGMA algorithm on the WILDCARDTM system and have also discussed how we could be able to resolve these problems. The options suggested are presented as future work that might be of great interest and might enable us in obtaining a performance improvement for the UPGMA algorithm that conceivably could alter the upper bound of the time complexity of the algorithm itself.


