<<

A Thesis

entitled

Design of A Systolic Array-Based FPGA Parallel Architecture for the BLAST

and Its Implementation

by

Xinyu Guo

Submitted to the Graduate Faculty as partial fulfillment of the requirements for the

Master of Science Degree in Engineering

______Dr.Hong Wang, Committee Chair

______Dr.Vijay Devabhaktuni, Committee Member

______Dr. Mansoor Alam, Committee Member

______Dr. Patricia R. Komuniecki, Dean College of Graduate Studies

The University of Toledo

August 2012

Copyright 2012, Xinyu Guo

This document is copyrighted material. Under copyright law, no parts of this document may be reproduced without the expressed permission of the author.

An Abstract of

Design of A Systolic Array-Based FPGA Parallel Architecture for the BLAST Algorithm and Its Implementation

By

Xinyu Guo

Submitted to the Graduate Faculty as partial fulfillment of the requirements for the Master of Science Degree in Engineering

The University of Toledo August 2012

A design of systolic array based Field Programmable Gate Array (FPGA) parallel architecture for Basic Local Alignment Search Tool (BLAST) Algorithm is proposed in the thesis. BLAST is a heuristic biological sequence alignment algorithm which has been used by bioinformatics experts. In contrast to other designs that detect at most one hit in one clock cycle, this design applies a Multiple Hits Detection Module which is a pipelining systolic array to search multiple hits in a single clock cycle. Further, A Hits

Combination Block which combines overlapping hits from systolic array into one hit is proposed. These implementations completed the first step of BLAST architecture. After parallelizing the Ungapped Extension Blocks to accelerate the second step of BLAST, the new design in the thesis achieved better results than previously published architectures.

iii

Acknowledgements

I sincerely thank my advisor Dr. Hong Wang and co-advisor Dr. Vijay Devabhaktuni for giving me the opportunity to pursue my research under their guidance. It has been a great learning experience working with them. I would also like to express my deepest gratitude towards them for offering instructive advice at all times, treating me with endurance, and constantly pushing me to do better every time.

Last, but not the least, I thank my family, and friends who have been a constant source of encouragement and support for me throughout my journey in the Master‟s

Degree program.

iv

Table of Contents

Abstract ...... iii

Acknowledgements ...... iv

Table of Contents ...... v

List of Tables ...... vii

List of Figures ...... viii

List of Abbreviations ...... x

1 Introduction ...... 1

1.1 Background ...... 1

1.2 Research Approach ...... 5

1.3 The Documentation Organization ...... 6

2 The BLAST Algorithm ...... 7

2.1 BLAST Overview ...... 7

2.2 Finding Hits ...... 8

2.3.Ungapped Extension ...... 9

2.4 Gapped Extension ...... 9

3 Previous Work ...... 12

4 Parallel Architecture Design of The BLAST Algorithm with Multiple Hits Detection 14

4.1 The Architecture Overview...... 14

4.2 Multiple Hits Detection Module ...... 16

v

4.3 Hits Combination Block ...... 25

4.4 Parallel Architecture for Multiple Hits Detection Module and Hits

Combination Block ...... 27

4.5 Ungapped Extension Block...... 28

4.6 Gapped Extension Block...... 31

5 Performance Analysis ...... 33

5.1 Hardware Performance Comparison ...... 34

5.2 Comparing Execution Time to Software Versions ...... 36

6 Conclusion and Future Work ...... 40

References ...... 42

A Project Architecture Generated from the Schematic Viewer ...... 47

A.1 Architecture of the array_component ...... 47

A.2 Architecture of two systolic arrays in the project ...... 48

A.3 The connection of array_components ...... 49

B Source Code of the Project ...... 50

B.1 Source Code for the Processing Unit ...... 50

B.2 Source Code for the Clock_Division Block ...... 52

B.3 Source Code for 3-input AND Gate ...... 52

B.4 Source Code for Multiple Hits Detection Module ...... 53

B.4 Testbench Source Code for the Systolic Array Architecture ...... 58

vi

List of Tables

4.1 5-bit binary description for amino acids in a protein...... 21

5.1 Synthesis performance comparison between Tree-BLAST and ours...... 35

5.2 Synthesis performance comparison between families of FPGA-based Accelerators and ours...... 35

5.3 Experimental database for different ...... 36

5.4 Execution time of our architecture and software versions...... 37

vii

List of Figures

1-1 Exponential growth of biological sequence database on a yearly basis [1]...... 3

1-2 A FPGA manufactured by ALTERA...... 4

1-3 Logic blocks distribution in a FPGA...... 4

2-1 Illustration of Needleman-Wunch algorithm...... 11

4-1 The structure of the project...... 15

4-2 A 2-D 4x4 systolic array...... 16

4-3 Architecture of the Multiple Hits Detection Module...... 17

4-4 Behavior Simulation of Clock Division Block...... 18

4-5 Architecture of Processing Unit of Multiple Hits Detection Module...... 19

4-6 The Architecture of Multiple Hits Detection Module...... 21

4-7 Systolic array status 1...... 23

4-8 Systolic array status 2...... 24

4-9 Systolic array status 3...... 24

4-10 Hits Combination Block...... 26

4-11 Parallel Architecture of Multiple Hits Finder Array and Hits Combination Block. . 27

4-12 Ungapped Extension Block...... 29

4-13 Extension procedure...... 29

4-14 Parallel Architecture for Ungapped Extension Block...... 31

4-15 Gapped Extension block ...... 32

viii

5-1 Speed up of the parallel FPGA architecture vs. software versions...... 38

A-1 The screen shot for array_component from the schematic viewer ...... 47

A-2 The screen shot for the architecture of two systolic arrays from the schematic viewer.

...... 48

A-3 The screen shot for the architecture of the connection of array_components from the schematic viewer...... 49

ix

List of Abbreviations

BCB…….. Bioinformatics and Computational Biology BLAST…. Basic Local Alignment Search Tool

DNA……. Deoxyribonucleic Acid

FPGA…… Field Programmable Gate Array

HSPs……. High-scoring Segment Pairs

RNA……..Ribonucleic Acid

VHDL…... Integrated Circuit Hardware Description Language

WPRBS….Word Position Record Based Search (WPRBS)

x

Chapter 1

Introduction

1.1 Background

In the Bioinformatics and Computational Biology (BCB) domain, biological sequence alignment including comparison of Deoxyribonucleic Acid (DNA), Ribonucleic Acid

(RNA) and protein (amino acid) sequences is a quite common task where a query sequence is aligned to subject sequences from a large database to find similarities [1].

Scientists are used to adopting sequence alignment to infer biological information of a newly discovered sequence from a set of previously known sequences. For instance, if a recently discovered sequence is similar to a known disease gene, then the biological information about the functionality of the new sequence can be inferred. This is extremely meaningful in early disease diagnosis and drug engineering [2]. In addition, biological sequence alignment plays an important role in the study of evolutionary development and the history of species [3].

There are many methods for sequence alignment in bioinformatics domain [4]. Very short or very similar sequences can be aligned by hand. However, in most biological sequence alignment problems, scientists have to work on those alignments which are lengthy, high variable. Usually, numerous sequences need to be aligned to solve one

1

particular problem. It‟s impossible for human to deal with problem like that solely by hand. Instead, scientists construct algorithms to produce high-quality sequence alignments, and occasionally adjust the final results to reflect patterns that are difficult to represent (especially in the case of nucleotide sequences). Biological alignment sequence algorithms are dived into two categories: global alignments and local alignments.

Calculating a global alignment is a form of global optimization that "forces" the alignment to span the entire length of all query sequences [5]. By contrast, local alignments identify regions of similarity within long sequences that are often widely divergent overall. Local alignments are often preferable, but can be more difficult to calculate due to the fact that the identifying the regions of similarity is a challenge. A variety of computational algorithms have been applied to the sequence alignment problem, including slow but formally correct methods like dynamic programming, and efficient, heuristic algorithms or probabilistic methods that do not guarantee to find best matches designed for large-scale database search [6].

Being one of the most important methods applied in finding biological sequence alignment, BLAST (Basic Local Alignment Search Tool) [7] has been widely implemented on commodity PC clusters. The algorithm was designed by Eugene Myers,

Stephen Altschul ec al in 1990. It emphasizes the speed over the sensitivity. This emphasis on speed is vital to make the algorithm practical on the huge sequences databases currently available. It is a heuristic algorithm that produces sub-optimal local alignments, but it is much quicker than exhaustive dynamic programming algorithms.

Though BLAST is well developed on PC clusters as well as it owns the advantage in

2

speed, the exponential growth of the bio-sequence databases (see Figure 1.1) makes it difficult to meet the computational requirements on current PC platforms.

Figure 1-1 : Exponential growth of biological sequence database on a yearly basis [1].

A Field Programmable Gate Array (FPGA) which is shown in Figure 1.2 is an integrated circuit designed to be configured by the customer or designer after manufacturing. That‟s why it is called "Field Programmable". The FPGA configuration is generally finished by using a hardware description language (HDL), similar to that used for an Application Specific Integrated Circuit (ASIC). FPGAs can be used to implement any logical function which an ASIC could perform. FPGAs contain programmable logic components which are called logic blocks, and a hierarchy of reconfigurable interconnects which allow blocks to be wired together. It is like many changeable logic gates that can be inter-wired in many different configurations [8]. Figure 1.3 shows the distribution of logic blocks in a FPGA. Logic blocks can be configured to perform complex combinational functions, or some simple logic gates like AND gate and XOR

3

gate. In most FPGAs, the logic blocks also include memory elements, which may be simple flip-flops or more complete blocks of memory [9].

Figure 1-2: A FPGA manufactured by ALTERA.

Figure 1-3: Logic blocks distribution in a FPGA.

4

Compared to general-purpose PC , Field Programmable Gate Arrays

(FPGAs) are able to provide a much higher degree of bit-level which leads to higher computational speed. In addition, FPGAs could provide engineers and scientists the chance to create custom, reconfigurable systems with onboard processing

[10]. Last but not least, FPGAs available today have usable sizes at an acceptable price, which makes them effective factors for cost savings and time-to-market when making individual configurations of standard products. Due to these advantages, scientists have been investigating parallel architecture design for the BLAST algorithm on FPGAs to achieve higher computational efficiency.

1.2 Research Approach

This documentation proposes a design and implementation of a parallel architecture on the FPGA for the first steps of BLAST, namely systolic array-based parallel architecture for the BLAST algorithm. The design is implemented with Very High Speed

Integrated Circuit Hardware Description Language (VHDL), making it compatible across a number of FPGA architectures (e.g. Xilinx, Altera). The development environment is

Xilinx ISE Design Suite 12.1, WebPACK edition which can be downloaded from Xilinx

Website for free. The experiment platform is Xilinx ML509 education board on which

XC5VLX110T FPGA is embedded. In this architecture, Multiple Hits Detection Module and Hits Combination Block are designed to accelerate the first step of BLAST. In addition, the parallel design for Ungapped Extension Blocks could improve the implementation of the second step of BLAST.

5

1.3 The Documentation Organization

The rest of this documentation is organized as follows. We first present the overview of the BLAST algorithm as well as the related work in Chapter 2 and Chapter 3. Then, the design of our FPGA architecture will be detailed in Chapter 4. After that, In Chapter 5, the architecture performance including storage resource requirements, FPGA resource consumption and estimated speed-ups against previous architectures and software will be analyzed based on synthesizing, functional and timing simulation. Finally, conclusions and future work are presented in Chapter 6.

6

Chapter 2

The BLAST Algorithm

2.1 BLAST Overview

In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences. A BLAST search enables a researcher to compare a query sequence with a library or database of sequences, and identify library sequences that resemble the query sequence above a certain threshold.

Biological sequence alignment algorithms can be exhaustive, meaning they give optimal alignments. Two examples are the Needleman-Wunsch [11] and Smith-Waterman [12] algorithms. Algorithms like BLAST are heuristic and give sub-optimal results [13].

Although heuristic algorithms produce local alignments that are not always optimal, they are normally much quicker than exhaustive dynamic programming algorithms.

Exhaustive algorithms typically have a complexity of while heuristic algorithms often have a complexity of [14] where m is the size of database sequence and n is the size of query sequence. The execution time of biological sequence alignment is crucial to bioinformatics scientists, especially due to the fact that the size of the bio-

7

sequence databases has grown exponentially over years. BLAST is a very popular heuristic algorithm that provides high computational speed [15].

The three main steps of the BLAST algorithm are finding hits, ungapped extension and gapped extension. These are described as follows.

2.2 Finding Hits

In this step, we use a protein sequence which consists of eight residues (amino acids) as an example shown below:

PQGEFGVY

The query sequence is divided into overlapping k-words as illustrated below with k

=3:

Word 0: PQG

Word 1: QGE

Word 2: GEF

Word 3: EFG

Word 4: FGV

Word 5: GVY

Six words are extracted from this sequence. Generally speaking, (m-k) words would be extracted when the size of the sequence is m. After dividing the query sequence, these k-words plus a score matrix are used to find the possible matching words which have a score with at least the threshold value of T. Finally, subject sequences in the database are scanned one by one to find exact matches to those possible matching words.

Previous work shows that this step could be finished by high-level application software running on a PC [16]. By designing an architecture which can implement every step of

8

BLAST on a FPGA, the interface issue for software and hardware connection can be avoided, saving execution time. Instead of finding exact matches to those possible matching words, in this paper, the implementation searches for the exact matches of those k-words directly on the FPGA. This is called “finding hits”.

2.3.Ungapped Extension

In this step, each hit recorded in the last step is extended by aligning with a subject sequence in both directions without any gap. A score matrix, for example BLOSUM50

[16], is used to compute residue pair alignment scores. The extension is terminated when the total score of the alignment is below a threshold value. These results from ungapped extension are called high-scoring segment pairs (HSPs).

2.4 Gapped Extension

In this stage, the Smith-Waterman or Needleman-Wunsh algorithm is applied to perform the gapped alignment in both directions for those collected HSPs. After this step, a collection of homologous sequences are assigned with different scores.

We choose the Needleman-Wunsch algorithm for gapped extension implementation because of its ability to find optimal global gapped alignment between the query sequence and the subject sequence [17]. Suppose that there are two sequences, Q and S, whose length are M and N, respectively. Q [1:i] denotes the first i characters in sequence

Q and S [1:j] denotes the first j characters in sequence S. The similarity score between

Q[1:i] and S[1:j] can be obtained from the similarity scores computed for Q[1 : i − 1] and

S [1 : j − 1]. A dynamic programming score matrix D is built, where each element of the matrix denotes the best alignment score between the Q [1: i] segment of Q and the S [1: j] segment of S.

9

The first step of Needleman-Wunsch algorithm is to create the score matrix D with

M+1 columns and N+1 rows. M and N are lengths of sequences Q and S. The first row and the first column are assigned the value zero. The computation complexity of the initialization step is O (M + N).

The second step is comprised of filling the matrix. is used to express the cell(i, j) of the scoring matrix D. is defined as the maximum of the similarity scores computed in positions (i-1, j-1),(j, j-1) and (i-1,j) associated with gap penalty or . The following equation is used to compute the value of the cell (i, j):

{

In this equation, is the scoring matrix score for and . if the ith character in Q is equal to the jth character in S (MATCH score); otherwise

(MISMATCH penalty). d is the gap penalty, d= 0. is the score of an alignment between and a gap in Y. is the score of an alignment between a gap in X and .

The last step of this algorithm is Trace Back. Trace back can form the best global alignment between Q and S. It starts from the bottom-right value in the matrix and ends at the up-left value. For each step, the next alignment pair which is added to the front of the alignment segment accords to 3 rules: (1) if the cell(i,j) was derived from cell(i-1,j-1),the pair of alignment and is added to the front of the alignment segment.(2) if cell(i,j) was derived from cell(i-1,j), and a gap in Y are added to the front of the alignment segment.(3) if cell(i, j) was derived from cell(i,j-1), a gap in X and are added to the front of the alignment segment. In some cases, the same similarity scores exist among

10

those three input cells. It‟s hard to decide which cell cell(i, j) comes from. The diagonal cell is chosen under this situation to shorten the length from the bottom-right cell to the upper-left cell. The shortest alignment segment presents the best global alignment between two sequences. Figure 2.1 shows an example which illustrates the Needleman-

Wunch steps. It corresponds to the following alignment:

T _ G G C C A G T C C G

T T G _ C _ A _ T _ _ G

T G G C C A G T C C G 0 0 0 0 0 0 0 0 0 0 0 0 T 0 1 1 1 1 1 1 1 1 1 1 1 T 0 1 1 1 1 1 1 1 2 2 2 2 G 0 1 2 2 2 2 2 2 2 2 2 3 C 0 1 2 2 3 3 3 3 3 3 3 3 A 0 1 2 2 3 3 4 4 4 4 4 4 T 0 1 2 2 3 3 4 4 5 5 5 5 G 0 1 2 3 3 3 4 5 5 5 5 6

Figure 2-1: Illustration of Needleman-Wunch algorithm.

11

Chapter 3

Previous Work

FPGAs not only provide high speed-ups and high bit-level data parallelism compared to regular processors, but also include the convenience of re-programmability. More and more bioinformatics scientists have designed parallel FPGA Architectures for the BLAST algorithm in order to solve the biological sequence alignment issue [18]. Many architectures have been proposed already such as Mercury BLASTn [19][20][21],

TreeBLAST [14], Mercury BLASTp [22], RC-BLAST [23], FPGA/FLASH Accelerator

[24], Multiengine BLASTn Accelerator [25][26] and many industry applied systems like

BEE2 [27] and Mitrion [28]. Most of them adopt the Word Position Record Based Search

(WPRBS) method, which constructs all storage tables to record the position of each word in query sequence, then sends those words to the accelerator one by one where the storage table is searched to find the hit. Although this approach is widely used, it still suffers from time and storage limitations.

Using WPRBS, at most one hit can be found in one clock cycle because only one word can be searched per clock cycle. FPGA memory port number also limits WPRBS searching capacity no matter where the storage tables are stored: internal chip memory or external chip memory. Take storing tables in an external memory for instance, Mercury

12

BLASTn [19] and Mitrion [28] built a hash table from the query. During the searching process, an accelerator checks words in the subject sequence database against the hash table one by one. An external SRAM attached to the FPGA is used to store the hash table since the internal block RAMs do not have enough storage space to hold tables for long query sequences. This situation obviously causes a timing issue: Accessing delay to external SRAM leads to long pipeline cycle time. FPGA/FLASH [24] could minimize the timing issue above. The database is formatted as an index structure in which every word is associated with its position in the sequence and its adjacent environment so that short, un-gapped alignments could be computed simultaneously, avoiding many random accesses to the database. But the size of the database index is very large, A 150 GB index for the Human genome is needed when storing a 40 amino acid substring environment.

This requires 50 times more storage than the raw data [29]. In the future, with the exponential growth of the database, the storage resource requirement will become a serious issue. To get better searching efficiency, Multi-engines BLASTn [25] adopted 64 identical computing units in a single chip to compare the query with 64 subject sequence in database at the same time. Mercury BLASTp [22] proposes a two-hit generator for accelerating the first stage of BLASTp. Both of these two methods are spawned from

WPRBS.

Another searching strategy is based on the systolic array without building any table

[30][31]. Systolic array has been implemented for exhaustive algorithms like Needleman-

Wunsch [16]. Implementing BLAST using systolic array is a relatively newer research area [14].

13

Chapter 4

Parallel Architecture Design of The BLAST Algorithm with Multiple Hits Detection

4.1 The Architecture Overview

As Figure 4.1 shows on the next page, the proposed BLAST searching system contains three layers. The top layer is a host computer which runs high-level application software. The middle layer consists of two SDRAMs- one is used as a subject sequence database and the other is used as a results container which receives HSPs lists from the basic layer. The basic layer is the systolic array-based parallel architecture fitted on

FPGA. This layer scans all subject sequences from database for the query sequence in order to get a HSPs list. After that, the list is sent to the host computer software that assigns statistical significance specified by bioinformatics scientists.

In the top layer of this system, a Dell Precision T1500 desktop computer with an Intel

Core i7 and 4GB memory runs the host software. In the middle layer, two

1 GB SDRAMs are attached to the FPGA where the basic layer is located. The basic layer is the systolic array-based parallel architecture that implements the BLAST algorithm. This layer is mapped to one FPGA chip (Xilinx Virtex-5 XC5VLX110T) which is embedded on the Xilinx ML509 Evaluation Platform. It contains six different blocks with various functions. The Initialization Set & Data Control Block is used to

14

initialize the status of registers for the Hit Finder Array and to forward subject sequence as well as query sequence to the Hit Finder Array. The Query Sequence Memory storing query sequence is constructed from internal block RAMs. The four other blocks implement three steps of the BLAST algorithm in parallel. Hits Finder Arrays can detect multiple hits per clock cycle. The Hits Combination Blocks can find combinable overlapping hits and combine them. For example, it can find adjoined overlapping hits

“TKEL” and “KELP” then combine them into “TKELP”. Ungapped Extension Blocks extends ungapped hits in both directions until the score meets a value T. Gapped

Extension Block extension gapped extension on HSPs through exhaustive algorithm like

Needleman-Wunsch. The block outputs HSPs and alignments scores.

Microprocessor Host Software Memory Top Layer Subject sequence HSP list

Middle Layer SDRAM 1 SDRAM 2

Subject sequence HSPs Query Initialization Set &Data Gapped Query Sequence Control Block Extension Block sequence Memeory Initialization Bit Stream & Address & Contex Basic Sequences Address & Contex HSPs Lyaer Hits infro. Hits infro. Hits infro. Multiple Hits „„ Hits Combination „„ Ungapped Finder Array Blcok Hits infro. Hits infro. Extension Block FPGA

Figure 4-1 : The structure of the project.

15

4.2 Multiple Hits Detection Module

Multiple Hits Detection Module is built with a systolic array. A systolic array is a 2D array of processing units which rhythmically computes and processes data. It has many features such as modularity, regularity, local interconnection, a high degree of pipelining and highly synchronized [32][33]. These features make systolic arrays a satisfactory architecture to process computationally intensive tasks. The general architecture a systolic array is shown in Figure 4.2.

Figure 4-2 : A 2-D 4x4 systolic array.

Arrows represent directions of data flowing. In this architecture, data flows synchronously across the 2D array network, usually with different data flowing in different directions. Each unit at each step takes in data from one or more neighbors (e.g.

North and West), processes the data, and, in the next step, outputs results in the opposite direction (e.g. South and East).

4.2.1 Multiple Hits Detection Module Architecture

Multiple Hits Detection Module is used to detect 3-word hits and record hits‟

16

addresses in the query and the subject sequence. Compared to WPRBS method which could detect at most one hit in one clock cycle, this design can detect multiple hits in only one clock cycle. The architecture of Multiple Hits Detection Module is shown in Figure

4.3.

array_clk clk_division (16:1)

query sequence query sequence 0 1 2 3 …... 28 29 30 31 subject sequence subject sequence

0 0 0

…...

…... …... …... Register (16 bits) Register (16 bits)

Hits Information Hits Information Extraction Unit Extraction Unit hit address in hit address in hit address in hit address in query subject query subject

Figure 4-3: Architecture of the Multiple Hits Detection Module.

As the figure illustrates, there is a systolic array with 32 processing units, every three units are connected to one 3-input AND gates. Every 16 gates‟ outputs are connected to a

16-bit register. The value of each register is sent to the corresponding Hits Information

Extraction Units for recording the hits address in the query and the subject sequence. The systolic array and the Hits Information Extraction Unit are driven by two different clocks

17

so that one Clock Division Block is designed to generate necessary clocks. The behavior simulation of Clock Division Block is shown in Figure 4.4.

Figure 4-4: Behavior Simulation of Clock Division Block.

clock_en (array_clk in Figure 4) drives Hits Information Extraction Units to work; clock_out is the clock source of the systolic array. Two clocks work together makes the architecture implement as a pipeline without any pause. When the program is running, at each rising edge of the clock_out , query or subject sequence moves forward for one processing unit; at each clock_en rising edge, a bit value at a certain position in each register is used to detect the location of hits.

The whole architecture works as follows: first, a query sequence with 32 characters is forwarded into the systolic array so that each processing unit holds a character from the query sequence. Then, the subject sequence is driven in to the systolic array by each internal clock rising edge. Meanwhile, the incoming subject character and the query character which is held by the unit are compared, if they are identical, the logic „1‟ would be generated, otherwise, the logic „0‟ would be generated. The comparison result is an input of a 3-input AND gate. Every three processing units maps to an AND gate. A hit is detected when the logic „1‟ is generated from its output. So, the systolic array with 3-

18

input AND gates can detect multiple hits at one internal clock rising edge. The architecture of the processing unit in the systolic array is illustrated in Figure 4.5.

Subject id Subject id +1

Query id +1 Query id

Shift Register (5 bits) „0‟ or „1‟ Compare

Shift Register (5 bits)

Figure 4-5: Architecture of Processing Unit of Multiple Hits Detection Module.

In the implementation, 20 5-bit long binary numbers (“00001” to “10100”) are defined to present 20 amino acids in a protein. Shift registers are able to shift characters to the adjacent processing unit so that the query and the subject sequence can go through the systolic array. When a new character from the query or the subject sequence is shifted into the unit, the corresponding id register would increase its value by 1. Each processing unit of the systolic array will compare two characters in it at each internal clock rising.

From Figure 4.4, we can see that outputs of 32 3-input AND gates goes into two 16- bit registers. The Hits Information Extraction Unit will detect hits‟ locations and record them. The algorithm running in this module is presented as the following pseudo code: clk_count = 0; // count the number of rising edges for the external clock hits_vector; //register which holds bits from outputs of 16 3-input AND gates.

IF (array_clk‟event and array_array_clk = „1‟) THEN //external clock rising edge

IF(clk_count< =15) then

19

IF(clk_count<15) then

IF (hits_vector(clk_count) = „1‟) then

Record the hit location in the query and subject sequence;

clk_count = clk_count +1;

ELSE

clk_count = clk_count+1;

END IF;

ELSE

IF (hits_vector(clk_count) = „1‟) then

Record the hit location in the query and subject sequence;

ELSE

clk_count = 0;

END IF;

END IF;

END IF;

END IF;

The Multiple Hits Detection Module is a parallel, pipelined architecture. The systolic array with 32 processing units cooperates with 32 3-inputs AND gates to detect hits in both sequences. The responsibility of Hits Information Extraction Block is recording those hits‟ locations. Two processes in the module are driven by two clocks.

4.2.2 Behavior Simulation Multiple Hits Detection Module

We have designed several experiments for the behavior simulation of Multiple Hits

Detection Module. In these experiments, each 5-bit binary presents one kind of amino

20

acid in a protein. Table 4.1 describes names of amino acids and their 5-bit binary.

Table 4.1: 5-bit binary description for amino acids in a protein.

Name of the amino acid Acronym 5-bit binary

Glycine G 00001 Alanine A 00010 Valine V 00011 Leucine L 00100 lsoleucine l 00101 Proline P 00110 Phenylalanine F 00111 Tyrosine Y 01000 Trypotophan W 01001 Serine S 01010 Threonine T 01011 Cystine C 01100 Methionine M 01101 Asparagine N 01110 Glutarnine Q 01111 Asparticacid D 10000 Glutamicacid E 10001 Lysine K 10010 Arginine R 10011 Histidine H 10100

In this project, Multiple Hits Detection Module was designed under Xilinx ISE

Design Suite in VHDL. The architecture of this module is obtained by Schematic Viewer which is integrated in the Design Suite. It is shown in Figure 4.6.

Figure 4-6: The Architecture of Multiple Hits Detection Module.

21

Multiple Hits Detection Module is actually a systolic array-based architecture. Figure

4.6 shows the connection of processing units. The connection details are shown in

Appendix A: Figure A.3. Several experiments were designed to demonstrate the correctness of the behavior function of the architecture. Take one of these experiments for example: a protein segment with 32 amino acids was presented as

“GAVLlPFYWSTCMNQPEKRHGAVLlPFYWSTC”, we took it as query sequence; another protein segment with 32 amino acids was

“STCGAVLlPFYWSTQPERKYWSSWYFPlLLlP”, we took it as subject sequence.

Query and subject sequences are shown below. It can be seen that there are three hits which highlight with different colors.

Query : G A V L l P F Y W S T C M N Q P E K R H G A V L l P F Y W S T C

Subject: S T C GA VL l P F Y W S T Q P E R K Y WS SW Y F P l L L l P

In order to show hits positions clearly, three systolic array statuses in which hits are exist are listed as below. Each character in query sequence is in one processing unit of the systolic array; each character in subject sequence is being forwarded into the systolic array.

Systolic array status 1:

Query: C T S W Y F P l L V A G H R K E P Q N M C T S W Y F P l LVA G

Subject:C T S

Hit address in query sequence is 31, in subject sequence is 2.

Systolic array status 2:

Query : C T S W Y F P l L V A G H R K E P Q N M C T S W Y F P l LVA G

Subject:P l L L l P FYW S S WY K R E P Q T S W Y F P l LVAGCT S

22

Hit address in query sequence is 15, in subject sequence is 15.

Systolic array status 3:

Query : C T S W Y F P l L V A G H R K E P Q N M C T S W Y F P l LVA G

Subject : P l L L l P Y W S S W Y K RE PQT S

Hit address in query sequence is 9, in subject sequence is 21.

A testbench was written in VHDL under Xilinx ISE to get the following simulation forms. Figure 4.6, 4.7 and 4.8 show the registers status in Multiple Hits Detection

Module when a hit was detected.

Figure 4-7: Systolic array status 1.

In Figure 4.7, at 1167ns, a hit is detected in the module. The value of register hit_sub_location_1 is “00010”, the value of register hit_query _location_1 is “11111”.

23

These two values are hits addresses in query sequence and the subject sequence respectively.

Figure 4-8: Systolic array status 2.

In Figure 4.8, at 2095ns, a hit is detected in the module. The value of register hit_sub_location_1 is “10000”, and the value of register hit_query _location_1 is “10000”.

These two values are hits addresses in query sequence and the subject sequence respectively.

Figure 4-9: Systolic array status 3.

24

In Figure 4.9, at 2479ns, a hit is detected in the module. The value of register hit_sub_location_2 is “10101” and the value of register hit_query _location_2 is “01001”.

These two values are hits addresses in query sequence and the subject sequence respectively.

These three figures demonstrate the correctness of Multiple Hits Detection Module behavior.

4.3 Hits Combination Block

If there is high similarity between query and subject sequence, the Multiple Hits

Detection Module may output a large amount of hits per clock cycle. Ungapped

Extension Block is relatively slower than Multiple Hits Detection Module and the systolic array in this situation. As a result, the processing speed of these two stages is unbalanced.

Previous research [34][35] suggested that in many cases there is a high similarity between the query sequence and the subject sequence. Many overlapping hits can be reported by adjacent components at the same time. Processing those hits not only wastes the execution cycle for Multiple Hits Detection Module, but also consumes additional extension time for Ungapped Extension Block because one hit is extended twice. For instance, two adjacent hits “ATK” and “TKP” are found but they are actually only one hit

“ATKP”. If they are not combined into one hit, they would be recorded twice in Multiple

Hits Detection Module and extended twice in Ungapped Extension Block. In this section, we implement Hits Combination Block to address this issue. It can detect the overlapping hits and merge them to reduce verbose hits and maintain the sensitivity of BLAST.

25

This block contains a Hits (First In First Out) FIFO buffer which is used to store hits location addresses from both query and subject sequence. The algorithm implementation engine executes hits combination algorithm then forward the hit addresses and hit length to the Ungapped Extension Block. The data flow in this block is shown in Figure 4.10.

Hit address in Q,S

Hits FIFO Address in Q&S ith Hit address in S & Q

(i+1)th Hit address in S & Q

(i+2)th Hit address in S & Q Algorithm implementation engine …...

Hits Combination Blcok

Address of hits

Figure 4-10: Hits Combination Block.

The order of the Hit addresses in the Hits FIFO buffer follows the direction of the stream flow. For example, the address of “AKL” should be stored in the FIFO buffers first, then the “KLP”. The combination algorithm is triggered when there is more than one record in the Hits FIFO buffer. Here, the hit addresses in subject sequence are recorded and forwarded to the following operations. First, the algorithm implementation engine calculates the ending address of the ith and (i+1) th hit in subject sequence. In this paper, the starting address is the address which is recorded minus 1, because the recorded address is the address of the middle word in a 3-word hit. The ending address is the address of the last character in a sequence, so (

. Here, the sequence length is equal to 3. Then, a comparison

26

between i th hit‟s ending address and (i+1) th hit‟s starting address is performed. If ith hit‟s ending address is greater than or equal to the (i+1) th hit‟s starting address, there will be an overlap between two hits. The algorithm can determine new addresses in both subject sequence and query sequence of the merged hit and calculate its length. Then the algorithm records the new address and hit length at the i th place in Hit‟s FIFO buffer. If i th hit‟s ending address is less than the (i+1) th hit‟s starting address, the (i+1) th hit address will be chosen to be compared with (i+1) th hit address. This operation continues until the Hits FIFO buffer is empty.

4.4 Parallel Architecture for Multiple Hits Detection Module and Hits Combination Block

Because Multiple Hits Detection Module can detect multiple hits in one internal clock cycle, multiple modules working together in parallel speeds up hit detection. In order to construct a longer array, many Multiple Hits Detection Modules are connected serially.

Figure 4.11 shows the way of the connection.

Multiple Hits Multiple Hits Multiple Hits Multiple Hits Detection Detection Detection Detection Module Module ...... Module Module

Hits Combination Hits Combination Hits Combination Hits Combination Block Block Block Block

Hits FIFO Hits FIFO Hits FIFO Hits FIFO

FIFO Combination FIFO Combination

FIFO Combination

Hits FIFO

Figure 4-11: Parallel Architecture of Multiple Hits Finder Array and Hits Combination Block.

27

The large systolic array contains many Multiple Hits Detection Modules. Every module matches one Hits Combination Block, each of which records and combines hits detected by its mapped module. After that, it sends the hits information to a local hits

FIFO. Then FIFO Combination modules combine hits FIFOs and deliver hits information to larger hits FIFOs. In this architecture, Multiple Hits Detection Modules are connected in a serial way to extend the array comparison size. Each Hits Combination Block maps to a Multiple Hits Detection Module so that each block can detect overlapping hits and merge them simultaneously. The parallel implementation trades on-chip resource for speed.

4.5 Ungapped Extension Block

This block implements the ungapped extension step of the BLAST algorithm explained before. FIFO is used to store the hits information. The score matrix

BLOSUM50 [16][34] is applied to find the alignment score for a pair of words. If this block detects a hit in a FIFO, the address of the hit word in query sequence and subject sequence and the hit length would be extracted from the FIFO to compute both extension points. After this step, the Ungapped Extension Algorithm Implementation Engine can finish the ungapped extension step. The data flow in the Ungapped Extension Block is shown in Figure 4.12. In the following figure, we can find that the hits information is forwarded to the FIFO. This information includes addresses of hits in the query sequence and the subject sequence, as well as their content. It‟s important for the ungapped extension because the algorithm running in the algorithm implementation engine will calculate out the start points of the extension in the query sequence and the subject sequence.

28

Hit information

Hit information Score matrix FIFO BLOSUM50

Hit address Alignment score information

Hit address Hit address Query Algorithm implementation engine Initialization Set sequence Context Context & Data Control memory Blcok Ungapped extension block

End point address of HSPs

Figure 4-12: Ungapped Extension Block.

The extension procedure in Algorithm Implementation Engine is illustrated in Figure

4.13 as follows:

Query Sequence: L V N K T P V W Y

D V C K T P L V R Subject Sequence:

-3 4 -3 5 2 7 3 -3 -3

Figure 4-13: Extension procedure.

29

The following is the procedure to find the hit KTP. After computing the addresses of characters “K” and “P,” the starting addresses of the extension, the algorithm drives the extension in both directions. Each pair of the residues own one alignment score which can be found in the BLOSUM50 score matrix. In Figure 13, the alignment score “L” and

“D” is -3, and the score of the pair “V” and “V” is 4. During this extension procedure, the total score of these alignment pairs is computed. Ungapped extension step stops when the total score is more than a threshold. Then endpoints of this high-scoring segment in both directions on both query and subject sequences are recorded and forwarded to the Gapped

Extension Block for gapped alignment.

In this step, hits which are sent from Hits FIFO of the Hits Combination Block are extended to either side to identify a HSP. The extension process needs to be triggered by at least one hit detection. Mercury BLASTn pre-filters the hit‟s characters because only one hit can be searched at one clock cycle. FPGA/FLASH could get the subsequence directly because it constructs the index for each word. Compared to methods applied in

Mercury BLAST-n and FPGA/FLASH, Multiple Hits Detection Module is more efficient since it could detect multiple hits in one clock cycle. However, the serial ungapped extension process is slower than Multiple Hits Detection Module. To speed up this step, we proposed two methods in above sections: Adding Hits Combination Block between

Multiple Hits Detection Module and Ungapped Extension Block, dividing the long components array into several parallel components array groups. As shown in figure 4.11, we propose one more parallel architecture for Ungapped Extension Block. Hits

Combination Block could detect and merge overlapping hits for the next step of the

BLAST algorithm. It likes a filter which can pick up valid hits for the extension.

30

Hits Hits Hits Infor. Infor. Infor. Hits FIFO Hits FIFO Hits FIFO „„

Ungapped Ungapped Ungapped Extension Block Extension Block Extension Block

HSP HSP HSP addr addr addr context context context Subject Sequence Memory & Query Sequence Memory

Figure 4-14: Parallel Architecture for Ungapped Extension Block.

In this architecture, each Hits FIFO of Hits Combination Block connects to one

Ungapped Extension Block. Each block connects to the query sequence memory as well as Initialization Set & Data Control Block because the context of the hits is necessary for extension.The System may slow down if hits from all Hits FIFOs are queued in only one

Ungapped Extension Block. In this architecture, hits information could be distributed into different passes and save processing time.

4.6 Gapped Extension Block

In this block, HSPs from the FIFO buffer are used to perform gapped extension. The

Gapped Algorithm Implementation Engine will be triggered once the block detects that there is a pair of segments in FIFO. We propose using the Needleman-Wunsch algorithm to finish the gapped extension. Figure 4.15 shows the dataflow in Gapped extension block.

31

HSPs

Algorithm FIFO HSPs implementation HSPs&Alignment score engine

Gapped extension block

Figure 4-15: Gapped Extension block

This paper focuses on the first two steps of BLAST algorithm including Finding Hits and Ungapped Extension because they consume most execution time of BLAST algorithm. In order to implement the Needleman-Wunsch algorithm on the FPGA, some efficient architectures [35][36] are already designed. In the future, we will implement the algorithm after doing more research on these architectures.

32

Chapter 5

Performance Analysis

We have mapped Multiple Hits Detection Module, Hits Combination Block and

Ungapped Extension Block architecture on the Xilinx XC5VLX110T FPGA which is embedded on ML509 education board. It is also referred as XUPV5-LX110T board. The architecture was designed under Xilinx ISE Design Suite 12.1 in Very High Speed

Integrated Circuit Hardware Description Language (VHDL).

In our system, the Xilinx ML509 education board has a PCI Express(X1) interface which could provide 500MB/s throughput. The board is then connected to the host computer: Dell Precision T1500 PC with a 2.80GHz Intel Core i7 microprocessor and

4GB memory. NCBI BLAST software runs on the host. In our experiment, a systolic array based parallel architecture with 2048 processing units was implemented on the

Xilinx ML509. Except for this implementation, we also mapped systolic array based architectures with 1204, 2048 and 3072 processing units to different FPGA devices under

Xilinx ISE Design Suite. All performance data came from the design report. The clock values came from the timing analyze tool called post-place & route which integrated in the ISE. We will analyze the architecture performance from two aspects: one is comparing our architecture with previous architectures on hardware, the other is

33

comparing our architecture against BLAST software versions. For hardware comparison, we focus on 3 parts including synthesis performance, storage requirement and word scanning speed. For software versions comparison, we focus on the execution time analysis.

5.1 Hardware Performance Comparison

5.1.1 Synthesis Performance

Tree-Blast adopts systolic array architecture. Because every four components need a

BRAM to index the scoring matrix, the input array size the system could process is restricted by BRAM. As a result, the query size is limited up to 600 on XC2VP70 and

1024 on XC4VLX160. Fei Xia etc [37] implemented a FPGA-based accelerator which is also a systolic array-based one for BLAST algorithm called families of FPGA-based accelerators. The processing unit in their architecture is more complicated than ours.

Their processing unit needs more registers to get match information from two adjacent units. In addition, the data stream in the array needs to be shut down to record hits addresses. As a result, it needs more registers to store the systolic array status in order to resume it. In our architecture, WPRBS is replaced by Multiple Hits Detection Module to find hits so that the BRAM is not the bottleneck of array size anymore but LUTs are. We can build 2048 components on XC5VLX110T which is two times the array size of Tree-

BLAST. Based on synthesis report, the design uses 55% of the slices. Our design has longer array size and consumes less resource, Comparing with Tree-BLAST built on

XC4VLX160 with 78% of the slices. For synthesis performance comparison, we also implement our architecture with 2048 components on FPGA XC2VP70 and FPGA

XC4VLX160. Results are shown in Table 5.1 as below:

34

Table 5.1: Synthesis performance comparison between Tree-BLAST and ours.

Our Architecture Tree-BLAST FPGA XC5VLX110T XC2VP70 XC4VLX160 XC2VP70 XC4VLX160 Array Size 2048 2048 2048 600 1024 Slice (%) 55 87 59.2 -- 78 Memory (%) 20.3 42 28 -- 88 Clock(MHz) 136 145 158 110 178

We constructed two systolic arrays with 1024 and 3072 processing units to make a comparison in synthesis performance with the families of FPGA-based accelerators.

Results are shown in Table5.2 as below:

Table 5.2: Synthesis performance comparison between families of FPGA-based Accelerators and ours.

Families of FPGA-based Our Architecture Accelerators FPGA XC 5VLX110T XC2VP70 XC4VLX160 XC2VP70 XC4VLX160 Array Size 2048 1024 3072 1024 3072 Slice(%) 55 58 69 60 71 Memory(%) 20.3 10.1 33.5 11 38 Clock(MHz) 136 125 166 140 189

5.1.2 Storage Requirement

Systolic array based architecture requires less storage than many WPRBS architectures. The storage consumption of our architecture mainly consists of sequence memory and hits FIFO. The memory usage is 2300 Kbits in total which is 13.7% of the memory capacity in XC5VLX110T. Since many WPRBS architectures need to record the position of each word in query sequence, they generally consume much more storage.

For instance, RC-BLAST spends 64K X 64bits memory to store the query index in Xilinx

4085XLA. Mercury and Mitrion have external SRAM to store the hash table because internal memory is not enough for the record. The slow external memory access has

35

negative impact on the algorithm implementation efficiency. Compared to these methods, our architecture saves storage, avoids the complexity of memory access and simplifies the FPGA layout and routing.

5.1.2 Word Scanning Speed

Most WPRBS architectures could search at most one word per clock cycle. The word- scanning speed in Mercury BLASTn is 96M matches/s. The speed of Multi-engine

BLASTn Accelerator which applies 64 identical parallel engines reaches 6400M matches/s. The word scanning speed of our Multiple Hits Detection Module approach achieves 14450M matches/s which is over 2 times than related work.

5.2 Comparing Execution Time to Software Versions

Our design is mapped to the Xilinx XC5VLX110T FPGA to accelerate the first two steps of NCBI BLAST family programs. Query and subject sequences for experiments of different algorithms come from different databases. The situation is shown in Table 5.3.

Table 5.3: Experimental database for different algorithms.

Software Version Experiments Data Source Experiments Data Size 532146 sequences, 188719038 BLASTp/BLASTx Swiss-Prot characters 1170 sequences, 122655632 TBLASTn/TBLASTx Drosoph.nt characters

As Table 5.3 illustrated, Swiss-prot from EBI [38] and Drosoph.nt from NCBI

BLAST are two databases in experiments. Swiss-prot is a protein sequence database while Drosoph.nt is a database for DNA sequences. The Swiss-prot in our experiments is

UniProtKB/Swiss-Prot protein knowledgebase release 2012_11. BLASTp is used for searching protein sequence against protein database. BLASTx is used for searching DNA sequence against protein database. Queries which are picked from Drosoph.nt translated 36

into six-frame protein sequence before searching against Swiss-Prot. TBLASTn is used for searching protein sequence against DNA database. TBLASTx is sued for searching six-frame translation of DNA sequence against six-frame translation of DNA database.

For each algorithm, we built 6 arrays with different sizes (128-3k) to carry different queries with different lengths. For each length, 15 queries are picked up randomly from the relevant database. Every query‟s running time was averaged over 3 identical runs.

The final execution time for a certain array is the average of running time of 15 queries.

Table 5.4 shows the algorithms execution time of our architecture and software versions.

Table 5.4: Execution time of our architecture and software versions.

BLASTp TBLASTn BLASTx TBLASTx Query Length ST(ms) HT(ms) ST(ms) HT(ms) ST(ms) HT(ms) ST(ms) HT(ms)

128 1901 1050 3203 153 1594 1030 4031 163 256 3087 1058 3641 167 2641 1042 5906 198

512 5603 1092 5156 202 4378 1074 9978 270

1K 9327 1150 9891 254 7828 1135 16266 351 2K 17814 1220 15438 360 13187 1210 27703 462

3K 25132 1490 20328 436 18875 1331 37500 536

The execution time of the software versions in this table come from the previous work [37]. The hardware execution time (HT) is less than software execution time (ST) for all cases. BLASTp software version only spends 1901ms in searching database with a

128 long query while 25132ms is taken when it is searching database with a 3K long query. This is because a larger index is necessary when the sequence is longer. Index construction not only needs more time but also increases the size of searching object. In our architecture, however, the execution time doesn‟t increase so rapidly. This is because the searching time is equal to the sum of the time of database stream flowing through the

37

array and the extension time of valid hits. The time of database stream flowing is determined by the size of database, the extension time of valid hits is influenced by the architecture design. In this experiment, the size of the database is not changed. The Hits

Combination Block and the parallel design reduce the number of valid hits and their extension time. The architecture demonstrates more speed up with longer sequences. For the same reason, the execution time of TBLASTn, BLASTx and TBLASTx which are implemented on the FPGA don‟t increase sharply.

From Table 5.4 we can find out that TBLASTn software version is slower than

BLASTp software version for the same query length. The reason is that all DNA sequences are translated into the 6 possible potential proteins in TBLASTn software version before the searching [39]. The translation time is not counted for the hardware version in our experiments because the translation is a one-time job and results can be used for many other applications. For TBLASTx software version, both queries and the database are translated into the six possible potential proteins before searching. The translation time is not counted for the same reason as in TBLASTn. The speed up is shown in Figure 5.1.

80

60 BLASTp

40 TBLASTn BLASTx 20 TBLASTx 0 128 256 512 1K 2K 3K

Figure 5-1: Speed up of the parallel FPGA architecture vs. software versions.

38

PatternHunter which is implemented in Java applies a space seed model to accelerate BLAST algorithm search speed on a desktop. At that time, few BLAST accelerators on hardware were designed. Though its word scan speed is more superior than other BLAST software versions then, it‟s slower than current hardware accelerators for BLAST. PatternHunter has 12M/s word scan speed at most while ours could get

14450M matches/s.

39

Chapter 6

Conclusion and Future Work

In this paper, we presented a systolic array-based FPGA parallel architecture for

BLAST Algorithm. The architecture contains four different kinds of blocks which are

Multiple Hits Detection Module, Hits Combination Block, Ungapped Extension Block, and Gapped Extension Block. Finding hits is the most computationally time-consuming step in BLAST. We designed the Multiple Hits Detection Module to improve the efficiency of the bottleneck step. The Multiple Hits Detection Module based on the systolic array could find more than one hit in only one clock cycle. The performance is faster than WPRBS searching approach used in other proposed architectures cited in this paper. In order to speed up hits searching, Hits Combination Block between Multiple Hits

Detection Module and Ungapped Extension Block merges adjacent overlapping hits into one hit in order to speed up the extension step. Lastly, we proposed a parallel architecture for Ungapped Extension Block as Figure 12 illustrated. We have implemented Multiple

Hits Detection Module, Hits Combination Block as well as Ungapped Extension Block.

Based on the ISE synthesis report, we compared our architecture performance to the performance of other related work. The results showed advantages of our architecture in four aspects. First, compared to Tree-BLAST, a twice longer array can be built in our

40

architecture with less FPGA resource. Secondly, our architecture only needs less memory space. Thirdly, the word scanning speed of our architecture faster than Mercury

BLASTn‟s word scanning speed. Last, compared to BLASTp software, our architecture is more suitable for dealing with longer sequences. In the future we hope to design an efficient architecture for Gapped Extension Block. To satisfy more users‟ requirement, we also want to modify our design to satisfy nucleotide searches. The Multiple Hits

Detection Module need to be changed, for instance, 3-input AND gate should be changed to 11-input AND gate. The clock_division block needs to be re-designed to guarantee the pipeline works correctly, more details will be addressed in the further design.

41

References

1. Durbin, R., Eddy, S., Krogh, A., and Mitchison, G., “Biological Sequence Analysis:

Probabilistic Models for Proteins and Nucleic Acids”, Cambridge University Press,

Cambridge UK, 1998.

2. Hein, J. “A New Methodology that simultaneously aligns and reconstructs ancestral

sequence for any number of homologous sequences, when a phylogeny is given”.

Journal of Molecular Biology, 6, pp.649-668, 1989.

3. Bain, W., "MULTAN: a program to align multiple DNA sequences", Nucleic Acids

Res, 14, pp.159-179, 1986.

4. E. Sotiriades, C. Kozanitis and A. Dollas, “Some Initial Results on Hardware BLAST

Acceleration with a Reconfigurable Architecture.” In proceedings 20th International

Parallel and Distributed Processing Symposium, IPDPS 2006, p.251., at the 5th IEEE

International Workshop on High Performance Computational Biology

(HiCOMB2006), Rhodes, Greece, 25 – 29 April, 2006 .

5. W. Pearson and D. Lipman, “Improved tools for biological sequence analysis,” Proc.

Natl. Acad. Sci USA, vol. 85, pp.2444-2448, 1988.

6. D. Hoang et. al., “FPGA Implementation of Systolic Sequence Alignment,” in

Proceedings of the 2nd International Work-shop on Field Programmable Logic and

Applications, Lecture Notes in Computer Science, vol. 705, pp. 183 – 191, 1992.

42

7. Altschul, S.F., Gish, W., Miller, W., Myers, E.W and Lipman, D.J., “Basic Local

Alignment Search Tool”, Journal of Molecular Biology, 215, pp.403-410, 1990.

8. Oliver, T. Schmidt, B. and Maskel, D., “Reconfigurable Architectures for Bio-

sequence Database Scanning on FPGAs” , IEEE Trans. Circuits Syst. II, vol. 52, no.

12, pp. 851 -855, 2005.

9. Oehmen, C. and Nieplocha, J., “ScalaBLAST: A Scalable Implementation of BLAST

for High-Performance Data-Intensive Bioinformatics Analysis”, IEEE Trans. On

Parallel and Distributed Systems, 2006.

10. Hoang, D., et al., “Searching Genetic Databases on Splash2”, In: Proc. IEEE

Workshop on FPGAs for Custom Computing Machines, pp. 185-191, 1993.

11. Needleman, S. and Wunsch, C., “A general method applicable to the search for

similarities in the amino acid sequence of tow sequences”, Journal of Molecular

Biology, 48(3), pp.443-453, 1970.

12. Smith, T.F. and Waterman, M.S., “Identification of Common Molecular

Subsequence”, J.Mol.Bio, 147, pp195-197, 1981.

13. Altschul, S. and Madden T. et. "Gapped BLAST and PSI-BLAST: a new generation

of protein database search programs", Nucleic Acids Res, 25(17):3389-402, 1997.

14. Herbordt, M.C., Model, J., Gu, Y.F., Sukhwani, B. and VanCourt, T., “Single Pass,

BLAST-Like, Approximate String Matching on FPGAs”, 14th Annual IEEE

Symposium on Field-Programmable Custom Computing Machines, pp217-226, 2006.

15. Court, T.V. and Herbordt, M.C., “Families of FPGA-Based Accelerators for

Approximate String Matching”, Microprocessors and Microsystems 31, pp135-145,

2007.

43

16. Kasap, Server., Benkrid, Khaled and Liu, Ying. “Design and Implementation of an

FPGA-based Core for Gapped BLAST Sequence Alignment with the Two-Hit

Method”, Engineering Letters, 16 (3), pp.443-452, 2008.

17. Gusfield, Dan., “Algorithms on Strings, Trees, and Sequences: Computer Science and

Computational Biology”, Cambrige University Press, Cambridge UK, 1997.

18. Oliver, T. et.al. “Hyper Customized Processors for Bio-Sequence Database Scanning

on FPGAs”, Proceedings of ACM/SIGDA 13th international symposium on Field-

programmable gate arrays (FPGA), pp229-237, 2005.

19. Buhler, J.D. and Lancaster, J.M., “Mercury BLASTN: Faster DNA Sequence

Comparison Using a Streaming Hardware Architecture”, 3rd Annual Reconfigurable

Systems Summer Institute, 2007.

20. Chamberlain, R.D., Cytron, R.K., Franklin, M.A., and Indeck, R.S., “The Mercury

system: Exploiting truly fast hardware for data search”, International Workshop on

Storage Network Architecture and Parallel I/Os, pp.65-72, September 2003.

21. Lancaster, J.M. and Jacob, A.C., “Mercury BLASTN”, 44th Design Automation

Conference, 2007.

22. Jacob, A., Joseph, L., Buhler, J., Harris, B., and Chamberlain, R.D., “Mercury

BLASTP: Accelerating Protein Sequence Alignment”, ACM Transaction

Reconfigurable Technol Syst., 1(2): 9, June 2008.

23. Datta, S., Beeraka, P., Sass, R., "RC-BLASTn: Implementation and Evaluation of the

BLASTn Scan Function", 17th IEEE Symposium on Field Programmable Custom

Computing Machines, pp 88-95, April 2009.

44

24. Lavenier, D., Georges, G. and Liu, X., “A reconfigurable index flash memory tailored

to seed-based genomic sequence somparison algorithms”, VLSI Signal Processing,

48(3), pp 255–269, 2007.

25. Sotiriades, E., Dollas, A., “Design Space Exploration for the BLAST Algorithm

Implementation”, 15th Annual IEEE Symposuim on Field-Programmable Custom

Computing Machines, 2007.

26. Sotiriades, E., Dollas, A., "A General Reconfigurable Architecture for the BLAST

Algorithm" Journal of VLSI Signal Processing Systems, 48(3), September 2007.

27. Chang, C., “BLAST Implementation on BEE2”, Electrical Engineering and Computer

Science University of California at Berkeley, 2005.

28. Mitrion.Inc, NCBI BLAST Accelerator, 2007.

29. Lavenier, D., Xinchun, L. and Georges, G., “Seed-based Genomic Sequence

Comparison using a FPGA/FLASH Accelerator”, IEEE International Conference on

Field Programmable Technology, pp.41-48, 2006.

30. Tselepis, N., Bekakos, M.P., Milovanovic, I.Z. and Milovanovic, E.I., "FPGA

Implementation of Optimal Planar Systolic Arrays for Orthogonal Matrix

Multiplicaiton", 4th International Conference: Sciences of Electronic, Technologies

of Information and Telecommunication, March 2007.

31. Wang, X.J. and Leeser, M., “FPGA Based Systolic Array Implementation of QR

Transformation Using Givens Rotations”, Available:

http://www.ece.neu.edu/groups/rcl/projects/.

32. Herbordt, M.C., Model, J., Sukhwani, B., Gu, Y.F. and Tom, V.C., "Single pass

Streaming BLAST on FPGAs", , 33(10-11), November 2007.

45

33. Abelsson, H., Sandberg, C. and Mohl, S., "Accelerating on NCBI BLAST", CUG

Conference, Seattle, Washington, May 2007.

34. Rao, A. A. and Visakhapatanam, "A Tool for Pair-Wise Alignment Algorithm",

Management Review: An Internaitonal Journal, 2(1), pp28-40, June 2007.

35. Che, S., Li, J., Sheaffer, J.W., Skadron, K. and Lach, J., "Accelerating Compute-

Intensive Applications With GPUs and FPGAs", Symposium on Application Specific

Processors, pp101-107, July 2008.

36. Lavenier, D., Xinchun, L. and Georges, G., “Seed-based Genomic Sequence

Comparison using a FPGA/FLASH Accelerator”, in Proceedings of IEEE

International Conference on Field Programmable Technology (FPT‟ 06), pp 41-48

2006.

37. Xia, F., Dou, Y. and Xu, J.Q., “Families of FPGA-Based Accelerators for BLAST

Algorithm with Multi-seeds Detection and Parallel Extension”, CCIS 13, pp.43-57,

2008.

38. ExPAsy Bioinformatics Resource Portal, [online]. Available :

http://web.expasy.org/docs/relnotes/relstat.html

39. Paolieri, M., Bonesana, I. and Santambrogio, M.D., “A Parallel and Pipelined

Architecture for Regular Expression Matching. In Proc. IFIP-VLSI, Oct. 2007.

46

Appendix A

Project Architecture Generated from the Schematic Viewer

A.1 Architecture of the array_component

Figure A-1 The screen shot for array_component from the schematic viewer

47

A.2 Architecture of two systolic arrays in the project

Figure A-2 The screen shot for the architecture of two systolic arrays from the schematic viewer.

48

A.3 The connection of array_components

Figure A-3 The screen shot for the architecture of the connection of array_components from the schematic viewer.

49

Appendix B

Source Code of the Project

B.1 Source Code for the Processing Unit library IEEE; use IEEE.STD_LOGIC_1164.ALL; use IEEE.STD_LOGIC_UNSIGNED.ALL; entity array_component is port(query_char_in : in std_logic_vector(4 downto 0); -- 5 bits input for the query --sequence character query_char_out : inout std_logic_vector(4 downto 0); -- 5 bits output for the query --sequence character sub_char_in : in std_logic_vector(4 downto 0); -- 5 bits input for the subject --sequence character sub_char_out: out std_logic_vector(4 downto 0); -- 5 bits output for the subject -- sequence character com_clk: in std_logic; -- the clock for the processing unit reset_q: in std_logic; -- enable processing unit to receive the query sequence -- character reset_s: in std_logic; -- enable processing unit to receive the subject sequence -- character match: out std_logic; -- the signal indicates the comparison result between the subject --sequence --character and the query sequence character query_id: out std_logic_vector(4 downto 0); -- the storage for the query id sub_id: out std_logic_vector(4 downto 0)); -- the storage for the subject id end array_component; architecture Behavioral of array_component is

signal q_id: std_logic_vector(4 downto 0):="00000"; signal s_id: std_logic_vector(4 downto 0):="00000"; begin

50

process (com_clk, reset_q) -- the process for receiving the query sequence character begin if (reset_q = '1') then -- enable the unit to receive if(com_clk'event and com_clk ='1')then if ((0

process (com_clk, reset_s) -- the process for receiving the subject sequence -- character begin if (reset_s ='1') then -- enable the unit to receive the subject sequence --character if (com_clk'event and com_clk ='1') then sub_char_out<=sub_char_in; if ((0

if (sub_char_in = query_char_out) then -- check the match between these --two characters match<='1'; else match<='0'; end if; end if; end if; end process; end Behavioral;

51

B.2 Source Code for the Clock_Division Block library IEEE; use IEEE.STD_LOGIC_1164.ALL; use ieee.std_logic_unsigned.all; entity clock_division is port( clock_en : in std_logic; clock_out: out std_logic); end clock_division; architecture Behavioral of clock_division is signal count: std_logic_vector(3 downto 0):= "0000"; signal clk_holder:std_logic:='0'; begin process(clock_en) -- the process for generating a new clock (clock_out) whose clock --cycle is 16 times of the original one (clock_en) begin if (clock_en'event and clock_en='1') then if (count=7) then count<="0000"; clk_holder <=not clk_holder; else count<=count+1; end if; end if; end process; clock_out<=clk_holder; end Behavioral;

B.3 Source Code for 3-input AND Gate library IEEE; use IEEE.STD_LOGIC_1164.ALL; library IEEE; use IEEE.STD_LOGIC_1164.ALL;

/* the following code describes a 3-input AND gate*/ entity and_3 is port ( left : in std_logic; mid: in std_logic; right: in std_logic; hit_detect: out std_logic); end and_3;

52

architecture Behavioral of and_3 is begin hit_detect <= left and mid and right; end Behavioral;

B.4 Source Code for Multiple Hits Detection Module library IEEE; use IEEE.STD_LOGIC_1164.ALL; use IEEE.STD_LOGIC_UNSIGNED.ALL;

/*building a systolic array with 32 processing units, this array is driven to work by the clock which is generated from the Clock_Division Block.*/ entity systoic_array is port (query_datastream_in : in std_logic_vector(4 downto 0); query_datastream_out: inout std_logic_vector(4 downto 0); sub_datastream_in :in std_logic_vector(4 downto 0); sub_datastream_out: out std_logic_vector(4 downto 0); array_clk : in std_logic; reset_query: in std_logic; reset_sub : in std_logic; enable: in std_logic; hit_query_location1: out std_logic_vector(4 downto 0); hit_sub_location1: out std_logic_vector(4 downto 0); hit_query_location2: out std_logic_vector(4 downto 0); hit_sub_location2: out std_logic_vector(4 downto 0)); end systoic_array; architecture Behavioral of systoic_array is

/* declare all components in this architecture*/ component array_component is port (query_char_in : in std_logic_vector(4 downto 0) ; query_char_out : inout std_logic_vector (4 downto 0) ; sub_char_in : in std_logic_vector(4 downto 0); sub_char_out: out std_logic_vector(4 downto 0); com_clk: in std_logic; reset_q: in std_logic; reset_s: in std_logic; match: out std_logic; query_id: out std_logic_vector(4 downto 0); sub_id: out std_logic_vector (4 downto 0)); end component;

53

component and_3 is port (left : in std_logic; mid: in std_logic; right: in std_logic; hit_detect: out std_logic); end component; component clock_division is port( clock_en : in std_logic; clock_out: out std_logic); end component;

/* building internal registers to hold values for the processing units connection and the connection of the Clock_Division Block and the systolic array.*/ type fivebits_array is array (0 to 15) of std_logic_vector (4 downto 0); signal query_loc_holder_1: fivebits_array; signal query_loc_holder_2: fivebits_array; signal sub_loc_holder_1: fivebits_array; signal sub_loc_holder_2: fivebits_array; signal query_temp_1 : fivebits_array; signal sub_temp_1:fivebits_array; signal query_temp_2 : fivebits_array; signal sub_temp_2: fivebits_array; signal match_holder_1 :std_logic_vector(0 to 15); signal match_holder_2 :std_logic_vector(0 to 15); signal hits_vector_1:std_logic_vector(0 to 15); signal hits_vector_2:std_logic_vector(0 to 15); signal clk_holder:std_logic; signal clk_count_1 : std_logic_vector(3 downto 0):="1111"; signal clk_count_2 : std_logic_vector(3 downto 0):="1111"; signal hit_query_location_1:std_logic_vector(4 downto 0):="11111"; signal hit_sub_location_1: std_logic_vector(4 downto 0):="11111"; signal hit_query_location_2:std_logic_vector(4 downto 0):="11111"; signal hit_sub_location_2:std_logic_vector(4 downto 0):="11111"; begin clock_divide : clock_division port map( clock_en => array_clk, clock_out=> clk_holder );

compare_array : for i in 0 to 31 generate -- connect 32 processing units firstloop: if (i=0) generate array_1 : array_component port map ( query_char_in => query_datastream_in ,

54

com_clk => clk_holder, query_char_out => query_temp_1(i), sub_char_in => sub_datastream_in, sub_char_out =>sub_temp_1(i), match => match_holder_1(i), reset_q=> reset_query, reset_s=> reset_sub, query_id =>query_loc_holder_1(i), sub_id=> sub_loc_holder_1(i)); end generate firstloop; secondloop: if ((i>0) and (i<15)) generate array_1 : array_component port map ( query_char_in => query_temp_1 (i-1), com_clk => clk_holder, query_char_out =>query_temp_1(i), sub_char_in => sub_temp_1(i-1), sub_char_out =>sub_temp_1(i), match => match_holder_1(i), reset_q=> reset_query, reset_s=> reset_sub, query_id =>query_loc_holder_1(i), sub_id=> sub_loc_holder_1(i)); end generate secondloop; thirdloop : if (i =15) generate array_1 : array_component port map ( query_char_in => query_temp_1(i-1) , com_clk => clk_holder, query_char_out =>query_temp_1(i), sub_char_in => sub_temp_1(i-1), sub_char_out =>sub_temp_1(i), match => match_holder_1(i), reset_q=> reset_query, reset_s=> reset_sub, query_id =>query_loc_holder_1(i), sub_id=> sub_loc_holder_1(i)); end generate thirdloop; fourthloop : if (i =16) generate array_2 : array_component port map ( query_char_in => query_temp_1(15) , com_clk => clk_holder, query_char_out =>query_temp_2 (i-16), sub_char_in => sub_temp_1 (15), sub_char_out =>sub_temp_2 (i-16), match => match_holder_2 (i-16), reset_q=> reset_query, reset_s=> reset_sub,

55

query_id =>query_loc_holder_2 (i-16), sub_id=> sub_loc_holder_2 (i-16)); end generate fourthloop; fifthloop : if ((i>16) and (i<31)) generate array_2 : array_component port map (query_char_in => query_temp_2(i-16-1) , com_clk => clk_holder, query_char_out =>query_temp_2 (i-16), sub_char_in => sub_temp_2 (i-16-1), sub_char_out =>sub_temp_2 (i-16), match => match_holder_2 (i-16), reset_q=> reset_query, reset_s=> reset_sub, query_id =>query_loc_holder_2 (i-16), sub_id=> sub_loc_holder_2 (i-16)); end generate fifthloop;

sixthloop : if (i=31) generate array_2 : array_component port map ( query_char_in => query_temp_2 (i-16-1) , com_clk => clk_holder, query_char_out =>query_datastream_out, sub_char_in => sub_temp_2(i-16-1), sub_char_out =>sub_datastream_out, match => match_holder_2 (i-16), reset_q=> reset_query, reset_s=> reset_sub, query_id =>query_loc_holder_2 (i-16), sub_id=> sub_loc_holder_2 (i-16)); end generate sixthloop; end generate compare_array; hit_detection: for i in 0 to 31 generate-- map 32 3-inputs AND gates to the systolic --array with 3 processing units firstloop : if (i<14) generate hit_detect: and_3 port map( left => match_holder_1(i), mid => match_holder_1(i+1), right => match_holder_1(i+2), hit_detect =>hits_vector_1(i)); end generate firstloop;

secondloop: if (i=14)generate hit_detect: and_3 port map( left => match_holder_1(i), mid=>match_holder_1(i+1), right => match_holder_2(0), hit_detect =>hits_vector_1(i)); end generate secondloop;

56

thirdloop: if (i=15) generate hit_detect: and_3 port map( left => match_holder_1(i), mid=>match_holder_2(0), right => match_holder_2(1), hit_detect =>hits_vector_1(i)); end generate thirdloop;

fourthloop: if ((i>15) and (i<30)) generate hit_detect: and_3 port map( left => match_holder_2 (i-16), mid => match_holder_2 (i-16+1), right=> match_holder_2 (i-16+2), hit_detect=>hits_vector_2 (i-16)); end generate fourthloop; fifthloop: if ((i=30) or (i=31)) generate hit_detect: and_3 port map( left => '0', mid => '0', right=>'0', hit_detect=>hits_vector_2 (i-16)); end generate fifthloop; end generate hit_detection; process(array_clk,hits_vector_1,enable) -- detect signals from the first 16 3 input --AND gates to find hits begin if (enable='1') then if (array_clk'event and array_clk = '1') then if((clk_count_1<"1111") or (clk_count_1 ="1111")) then if(clk_count_1<"1111") then if (hits_vector_1(conv_integer(clk_count_1))='1') then hit_query_location_1<=query_loc_holder_1(conv_integer(clk_cou nt_1)); hit_sub_location_1<=sub_loc_holder_1(conv_integer(clk_count_1 )); clk_count_1<=clk_count_1+1; else clk_count_1<=clk_count_1+1; end if; else if (hits_vector_1(conv_integer(clk_count_1))='1') then hit_query_location_1<=query_loc_holder_1(conv_integer(clk_coun t_1)); hit_sub_location_1<=sub_loc_holder_1(conv_integer(clk_count_1)); end if; clk_count_1<="0000"; end if; end if;

57

end if; end if; end process;

process(array_clk,hits_vector_2,enable) -- detect signals from the second 16 3 input -- AND gates to find hits begin if (enable='1') then if (array_clk'event and array_clk = '1') then if ((clk_count_2<"1111") or (clk_count_2 ="1111")) then if (clk_count_2<"1111") then if (hits_vector_2 (conv_integer(clk_count_2))='1') then hit_query_location_2<=query_loc_holder_2(conv_integer(clk_cou nt_2)); hit_sub_location_2<=sub_loc_holder_2 (conv_integer(clk_count_2)); clk_count_2<=clk_count_2+1; else clk_count_2<=clk_count_2+1; end if; else if (hits_vector_2(conv_integer(clk_count_2))='1') then hit_query_location_2<=query_loc_holder_2 (conv_integer (clk_count_2)); hit_sub_location_2<=sub_loc_holder_2(conv_integer(clk_count_2 )); end if; clk_count_2<="0000"; end if; end if; end if; end if; end process; end Behavioral;

B.4 Testbench Source Code for the Systolic Array Architecture

LIBRARY ieee; USE ieee.std_logic_1164.ALL;

-Uncomment the following library declaration if using -- arithmetic functions with Signed or Unsigned values --USE ieee.numeric_std.ALL;

ENTITY test_parallel_architecture IS END test_parallel_architecture;

58

ARCHITECTURE behavior OF test_parallel_architecture IS

-- Component Declaration for the Unit Under Test (UUT)

COMPONENT parallel_architecture --two systolic array, each one has 32 processing --units PORT ( a_one_qin : IN std_logic_vector(4 downto 0); a_one_qout : INOUT std_logic_vector(4 downto 0); a_two_qin : IN std_logic_vector(4 downto 0); a_two_qout : INOUT std_logic_vector(4 downto 0); a_one_sin : IN std_logic_vector(4 downto 0); a_one_sout : INOUT std_logic_vector(4 downto 0); a_two_sin : IN std_logic_vector(4 downto 0); a_two_sout : INOUT std_logic_vector(4 downto 0); control_q : IN std_logic; control_s : IN std_logic; e_hits_det : IN std_logic; archi_clk : IN std_logic); END COMPONENT;

--Inputs type fivebits_array is array (0 to 31) of std_logic_vector(4 downto 0); signal a_one_qinput : -- query sequence 1 fivebits_array:=("00001","00010","00011","00100","00101","00110","00111","01000"," 01001","01010","01011","01100","01101","01110","01111","10000","10001","10010"," 10011","10100","00001","00010","00011","00100","00101","00110","00111","01000"," 01001","01010","01011","01100"); signal a_one_sinput: -- subject sequence 1 fivebits_array:=("01010","01011","01100","00001","00010","00011","00100","00101"," 00110","00111","01000","01001","01010","01011","01111","10000","10001","10011"," 10010","01000","01001","01010","01010","01001","01000","00111","00110","00101"," 00100","00100","00101","00110"); signal a_two_qinput : -- query sequence 2 fivebits_array:=("01111","01110","01101","01100","01011","01010","01001","01000"," 00111","00110","00101","00100","00011","00010","00001","00111","01100","01011"," 01010","01001","01000","00111","00110","00111","01000","01001","01010","01010"," 00010","00011","00100","00101"); signal a_two_sinput: --subject sequence 2 fivebits_array:=("01100","01101","01110","01111","01001","01010","01010","00111"," 00110","01000","01100","01011","01001","10000","00001","00111","01100","10001"," 10010","10101","00001","01000","10000","00100","00011","00010","00001","00010"," 00011","10010","01010","00001"); signal a_one_qin : std_logic_vector(4 downto 0) := (others => '0'); signal a_two_qin : std_logic_vector(4 downto 0) := (others => '0');

59

signal a_one_sin : std_logic_vector(4 downto 0) := (others => '0'); signal a_two_sin : std_logic_vector(4 downto 0) := (others => '0'); signal control_q : std_logic := '0'; signal control_s : std_logic := '0'; signal e_hits_det: std_logic := '0'; signal archi_clk : std_logic := '0';

--BiDirs signal a_one_qout : std_logic_vector(4 downto 0); signal a_two_qout : std_logic_vector(4 downto 0); signal a_one_sout : std_logic_vector(4 downto 0); signal a_two_sout : std_logic_vector(4 downto 0);

-- Clock period definitions constant archi_clk_period : time := 2 ps;

BEGIN

-- Instantiate the Unit Under Test (UUT) uut: parallel_architecture PORT MAP ( a_one_qin => a_one_qin, a_one_qout => a_one_qout, a_two_qin => a_two_qin, a_two_qout => a_two_qout, a_one_sin => a_one_sin, a_one_sout => a_one_sout, a_two_sin => a_two_sin, a_two_sout => a_two_sout, control_q => control_q, control_s => control_s, e_hits_det => e_hits_det, archi_clk => archi_clk);

-- Clock process definitions archi_clk_process :process begin archi_clk <= '0'; wait for archi_clk_period/2; archi_clk <= '1'; wait for archi_clk_period/2; end process;

-- Stimulus process stim_proc: process begin

60

wait for archi_clk_period*16; wait for 15 ps; -- 47 ps control_q <='1'; for i in 0 to 31 loop a_one_qin<=a_one_qinput(i); a_two_qin<=a_two_qinput(i); wait for archi_clk_period*16; end loop; --47+32*32 = 1024+47 = 1071ps control_q <='0'; wait for 32 ps; --1103 ps control_s <='1'; e_hits_det<='1'; for i in 0 to 31 loop a_one_sin<= a_one_sinput(i); a_two_sin<= a_two_sinput(i); wait for archi_clk_period*16; -- 1103+1024-32 = 2095ps end loop; wait; end process; END;

61