ABSTRACT

A COMPREHENSIVE AND HIGH-PERFORMANCE MOTIF FINDING APPROACH ON HETEROGENEOUS SYSTEMS

Unknown regulatory motif finding on DNA sequences is a crucial task for understanding gene expression and the task requires accuracy and efficiency. We propose DMF, a combinatorial approach that uses hash-based heuristics to skip unnecessary computations while retaining the maximum accuracy. Parallelized versions of our DMF approach, called PDMF, have been developed to use CPU, GPU and heterogeneous computing architectures in order to achieve the maximum performance. PDMF also incorporates SIMD instructions to further accelerate the task of unknown motif search. Our experimental results show that the multicore version (PDMFm) achieved 8.87x speedup over DMF. The GPU version (PDMFg) achieved a 41.48x and 9.95x average speedup over the serial version and PDMFm, respectively. Our SIMD enhanced heterogeneous approach (PDMFh) achieved a 3.42x speedup over our fastest GPU model

(PDMFg1). The proposed approach was tested for performance against popular approximate and suffix -based approaches with various sized real-world datasets and the experimental results showed that the proposed approach achieved the maximum accuracy within a practical time bound for motif lengths 6~14.

Sanjay Soundarajan May 2020

A COMPREHENSIVE AND HIGH-PERFORMANCE MOTIF FINDING APPROACH ON HETEROGENEOUS SYSTEMS

by

Sanjay Soundarajan

A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science

in the College of Science and Mathematics California State University, Fresno

May 2020

© 2020 Sanjay Soundarajan APPROVED For the Department of Computer Science:

We, the undersigned, certify that the thesis of the following student meets the required standards of scholarship, format, and style of the university and the student's graduate degree program for the awarding of the master's degree.

Sanjay Soundarajan Thesis Author

Jin Park (Chair) Computer Science

Hubert Cecotti Computer Science

David Ruby Computer Science

For the University Graduate Committee:

Dean, Division of Graduate Studies AUTHORIZATION FOR REPRODUCTION OF MASTER’S THESIS

I grant permission for the reproduction of this thesis in part or in its entirety without further authorization from me, on the condition that the person or agency requesting reproduction absorbs the cost and provides proper acknowledgment of authorship.

X Permission to reproduce this thesis in part or in its entirety must be obtained from me.

Signature of thesis author: ACKNOWLEDGMENTS First, and most of all, I would like to thank my mentor Dr. Jin Park, for his guidance throughout my graduate education. Without his expertise, knowledge and patience this thesis would not have been possible. I would like to thank my committee members Dr. Hubert Cecotti and Dr. David Ruby for their suggestions and encouragement. I would also like to extend my thanks to my family and friends who have supported me throughout my academic career. I wouldn’t be able to complete my education without them. In addition, I would like to thank Dr. Jenny Banh for her unwavering confidence in me and for supporting me at every step of my journey. Last of all, a special thank you goes to my research colleague and friend, Michelle Salomon, for always putting a smile on my face. TABLE OF CONTENTS Page

LIST OF TABLES ...... vii

LIST OF FIGURES ...... viii

INTRODUCTION ...... 1

UNKNOWN REGULATORY MOTIF FINDING ...... 4

Problem Definition...... 4

Solution Approaches ...... 6

PROPOSED APPROACH: DMF (DICTIONARY MOTIF FINDER) ...... 10

Background...... 10

Hash-based Heuristic Approach (DMF) ...... 18

PERFORMANCE OF DMF ...... 23

ACHIEVING HIGH PERFORMANCE ...... 29

Background...... 29

Parallel Dictionary Motif Finder (PDMF) ...... 34

PDMFm (MultiCore version) ...... 34

PDMFg (GPU version) ...... 39

PDMFracing (Heterogeneous Model) ...... 47

Enhanced Heterogeneous Model with SIMD Vectors - PDMFh ...... 49

PERFORMANCE OF PDMF ...... 59

CONCLUSION ...... 66

REFERENCES ...... 68 LIST OF TABLES

Page

Table 1. DMF vs. Branch-and-Bound Execution Time ...... 25

Table 2. Strength of Hypothetical Consensus Patterns ...... 26

Table 3. Accuracy: MEME vs. DMF ...... 27

Table 4. DMF vs. SPELLER vs. WEEDER Execution Time...... 28

Table 5. DMF vs. PDMFm2 vs. PDMFg1 Execution Times ...... 62

Table 6. PDMFg1 vs. PDMFracing vs. PDMFh Execution Times ...... 64 LIST OF FIGURES

Page

Figure 1. Consensus score of a motif ...... 5

Figure 2. Tree representation of a motif search ...... 13

Figure 3. Bypass paths on the L-mer tree ...... 15

Figure 4: performance vs. k and dataset size ...... 24

Figure 5. DMF vs. Branch-and-Bound performance ...... 25

Figure 6. Efficiency: DMF vs. MEME ...... 27

Figure 7. PDMFm1 computation model ...... 36

Figure 8. PDMFm2 block division computation model ...... 38

Figure 9. PDMFm2 cyclic division computation model ...... 38

Figure 10. PDMFm3 computation model ...... 39

Figure 11. GPU parallel architecture ...... 41

Figure 12. PDMFg1 computation model ...... 44

Figure 13. PDMFg2 computation model ...... 46

Figure 14. PDMFracing computation model ...... 48

Figure 15. PDMFh computation model ...... 49

Figure 16. SIMD Vector execution on multiple sequences ...... 54

Figure 17. SSE-based min operation ...... 56

Figure 18. SSE vector set operation ...... 57

Figure 19. PDMFm performance: m1 vs. m2 vs. m3 ...... 60

Figure 20. Performance of PDMFm3 with different thread allocations ...... 61

Figure 21. PDMFg1 vs. PDMFm2 performance ...... 62

Figure 22. Performance: PDMF GPU vs. Heterogeneous Models ...... 63

Figure 23. PDMFh scalability comparison ...... 65 INTRODUCTION

In the field of bioinformatics, finding recurring patterns of amino acid base pairs in genomic data is a crucial task. These recurring amino acids or transcription factor binding sites are referred to as motifs. Motifs represent possible protein interaction sites within the genome for various chemical reactions to take place. The occurrence of a motif within multiple genetically significant areas can highlight the relationship between the pattern of proteins and its effect on features like cancer and other diseases in organisms [1].

With the cost of DNA sequencing falling rapidly in the last two decades, the amount of biological data that exists in digital format has been rising almost exponentially. This increase in the size of genomic databases has caused problems in the searching and processing times as some of these datasets are 10s or even 100s of gigabytes in size. A popular protein database, nr, sits at over 150GB as of March 2020. Early in the area of motif searching relied on simple combinatorial approaches [1] that faced many performance penalties in the face of large datasets. Naive combinatorial algorithms tend to have multiple polynomial orders of magnitudes in execution time complexities. Despite the accuracy of these approaches practical running times was crucial for researchers in the field. A solution to this problem was the use of heuristic motif searching algorithms that utilize statistics and other approximation approaches [1] to retrieve statistically relevant motifs. These approaches tackled the run time concerns but did not fully handle the accuracy portion of the problem. Since approximating the solution motifs did not always lead to the accurate answers, a better solution to the motif searching problem was needed. Furthermore, a more complex version of the motif searching problem, introduced in 2002, utilized hamming distance-based metrics to filter solution motifs. Referred to as 2 2

Planted Motif Search, distance and quorum variables increased the complexity of the search with even longer execution times [1]. With the introduction of multiple core CPUs, GPGPUs and FPGAs, increasing processing power was starting to become available to researches in multiple fields. These devices were providing massive amounts of computing power using smart organizations of hardware level components in a way to provide large amounts of parallelization to tasks that were able to harness it. Biological applications of parallelization saw a great increase in the field of Bioinformatics as now users were able to use parallel computing as a means of accelerating their applications. Exact parallel combinatorial approaches were now starting to be used in a variety of subfields such pair wise sequence alignment and motif searching to name a few. Larger motifs in modern databases were able to be searched in reasonable times using accurate combinatorial approaches.

In this thesis, we introduce a novel combinatorial approach that uses smart heuristics, to reduce the searching space, and multiple parallelization strategies to effectively handle large motif lengths and databases. This proposed approach utilizes the branch-and-bound to filter solution motifs within the 4퐿 search space, where 퐿 refers to the length of the motif and 4 is the size of the DNA alphabet. Parallelization is handled using a heterogeneous approach where a GPU and a multi-core CPU handle processing possible solution motif in a load balanced execution path. Inside each CPU

Streaming SIMD Extensions (SSE) and their later variant instructions are used to accelerate the motif search process within each core. All three parallelization techniques (Multicore, GPU and SIMD) are used simultaneously in our system.

The next section delves into the background of the (planted) motif searching problem and the heuristics we have developed to effectively reduce the search space of the problem. The section following that expands on our parallel approach and how we developed a static heterogeneous execution path to effectively keep all the computing 3 3 resources well fed with tasks without resorting to a master-slave based parallelization system. The final two sections will explore our testing methodologies and results before I conclude this thesis.

UNKNOWN REGULATORY MOTIF FINDING

Problem Definition Searching for a motif in a database of DNA sequences can be abstracted to the following string-matching problem:

Given a set of DNA sequences (s1, s2, s3, …, sn) with sequence length (m1, m2, m3, …, mn ) each from an alphabet of Σ and a motif length of L; Find motif string x such that |x| = L and the sum of hamming distance of x over every sequence in the database is minimized across the set of all possible motif strings. (1)

The hamming distance corresponds to the number of differences between a pair of strings. This operation can be viewed as an XOR operation between two characters or words in the database. In the following example the hamming distance, d, between a pair of two strings of equal length is shown.

Seq1 A C G G C T A G C

Seq2 C C G G T C A G C Hamming Distance 1 0 0 0 1 1 0 0 0

Total Hamming Distance: 3

In a set of sequences of various lengths, this problem requires the aligning of motifs across multiple sequences in order to find the true hamming distance of a motif. As shown in Figure 1, the motif search problem is expanded when a motif is needed from a more real-world biological database. 5 5

Figure 1. Consensus score of a motif

Due to the increased amount of locations the common motif could be in, one must look at every possible motif of length L, within the alphabet Σ in the dataset. With a possible solution list of 4퐿, the motif search problem is considered to be NP-complete due to its exponentially increasing search space with respect to motif length. The planted motif search problem [1] adds an additional variable where a maximum number of mutations possible is specified. To describe this problem, we can modify our former definition of the motif searching problem to include this new parameter as follows:

Given a set of DNA sequences (s1, s2, s3, …, sn) with sequence length (m1, m2, m3, …, mn ) each from an alphabet of Σ and a motif length of L; Find motif string x such that |x| = L and the hamming distance of x with any sequence is not more than d while the total hamming distance of over all the sequences is minimized across the set of all possible motif strings. (2)

To better enforce the distance d across a database, the planted motif search problem includes an additional quorum variable that allows for a small leeway in the number of sequences requiring the distance d. The quorum, q, states the percentage of 6 6 sequences that must have a motif distance of d, within the database. Adding this to our motif search problem definition gives us the following updated definition:

Given a set of DNA sequences (s1, s2, s3, …, sn) with sequence length (m1, m2, m3, …, mn ) each from an alphabet of Σ and a motif length of L; Find motif string x such that |x| = L and q percent of sequences in the database have at most a hamming distance of d. (3)

This final problem definition is usually abbreviated as the (L, d)-q problem where L corresponds to the motif length, d corresponds to the maximum allowed distance and q corresponds to the percentage of sequences allowed to have a maximum distance of d. The planted in the planted motif search problem refers to the introduction of a randomly generated motif within a randomly generated DNA database of n sequences with length m. The generated motif is mutated in d places before planting in each sequence in the dataset. For the purpose of this thesis, the sequence database will not be mutated or have sequences inserted artificially. All database searches will be performed on naturally occurring data within real world datasets and its smaller subsets.

In this thesis, we add to this motif search problem with one final parameter. Since we are trying to top ranking motifs, we add the variable k, to signify the number of top- ranking motifs we are requesting. Therefore, at the end of a motif search, we are expecting the top-k scoring motifs within a database where the higher rank is given to the motifs with lower total hamming distance. An important distinction must be made in this due to many problem definitions only requesting the best scoring motif in the regular motif search or the best planted motif search problem.

Solution Approaches This section will explore related literature in the area of motif searching and some of the various ways in which the solution to the motif search has been showcased in the 7 7 last three decades. These approaches can be categorized into statistical (approximation) approaches [1-10] and combinatorial (exact) algorithms [11-22]. Statistical or approximate approaches often return the best consensus scored motifs in the database using heuristically derived methods. Despite their speed in execution time, one cannot guarantee that the true best motif will be found at every runtime. Gibbs Motif Sampler [2] uses Gibbs Sampling to select motifs. The key feature relies on the probabilities of unobserved positions being inferred from the application of the Bayesian theorem on observed sequence data. This iterative sampling method can get stuck in local optima, but the authors have included methods of phase shifting to reduce this risk. However, this algorithm gives different results on different runs [1]. A more advanced version of the Gibbs sampling method can be found in the AlignAce program [3]. Based on the Expectation-Maximization model, MEME [4] is an approximate approach that repeatedly performs expectation and maximization steps until it converges to locally optimal motifs based on a sample-driven method to find starting points of convergence. This also gives the benefit of increasing the chance of globally optimal motifs being found early on. MEME’s greedy selection of starting points can lead to lower runtimes, but the statistical approach cannot guarantee that the truly optimal motif can be found. Using a Nested Sampling inference strategy, NestedMICA [5] is an approximate scalable pattern-discovery system for finding motifs in biological databases.

NestedMICA has been designed to scale to large sets of data on multiprocessor machines and clusters. Consensus [6] is another well-known approximation approach, developed to perform motif searched on unaligned DNA sequences. Using frequency matrices to represent the possibility of nucleotide occurrence, CONSENSUS creates statistical profiles to determine the probability of common motifs within a dataset. For each possible motif, a probability profile matrix is created. Each of these matrices is combined with L-mers in successive sequences to form new matrices. The lowest probability matrix 8 8 in each combination procedure is saved to describe the consensus pattern at the end of the algorithm’s iterations. More statistical approaches can be found in [7-10]. Combinatorial approaches are able to find the truly optimal motif, but the amount of possible solutions it must consider is very large. Some early pattern driven approaches shown in [11-12] were able to process all candidate patterns to find the best motif, but the time and space complexities of these approaches are extremely high. Sample driven approaches use the database itself to consider possible solution motif [4]. By only considering motifs sampled from the sequences themselves, the search space is reduced by a significant amount. However, this strategy still runs a risk of missing the true best motif due to the algorithm not considering solution motifs outside of the sequence parameters. Multiprofiler [13] uses multi-positional profiles to limit the number of possible patterns in this extended sample driven approach.

A group of branch and bound pattern-driven approaches for solving the planted motif search problem uses suffix or mismatch trees to return all possible candidate motifs without providing the ranking of each possible motif [14-18]. All these approaches utilize the (L, d)-q parameters where L stands for motif length, d stands for maximum allowed mutations and q represents the quorum variable. SPELLER [14] generates the motif models by increasing length and by simulating the traversal of the lexicographic tree of all possible objects over the search space. This algorithm has a space complexity 푂(푛). MITRA [17] uses a variant of the known as a mismatch tree to split the search space of all possible patterns into disjoint subspaces that start with a given prefix.

By categorizing subspaces as weak based on the number of neighborhoods generated, MITRA is able to reduce its memory usage (a major disadvantage for suffix tree-based methods). 9 9

A web accessible approach called WEEDER-Web[15] uses an accelerated exhaustive search where pre-processed input sequences are organized in a suffix tree indexing structure. This approximation approach allows a user to automatically search for varying length motifs with predefined distance metrics for each specified length. Multiple scans can be requested in the program at the cost of lower quorum values.

RISOTTO [16] is a more modern suffix tree-based approach, developed on an earlier implementation (called RISO [23-24]) were the maximal extensibility information is stored to prevent on the expansion of non-solution motifs. This approach returns every possible solution motif without ranking. Motif [18] is an exhaustive suffix tree algorithm that has been implemented to work on ChIP enriched sequences. An attempt to overcome noise generated by the sequences as possible sources of error has been showcased with this approach.

A family of sorting an enumeration-based approaches under the prefix PMS is showcased in works [19-22]. These works are focused on the planted motif search problem using all the (L, d)-q parameters for all search algorithms.

PMS8, for instance, uses pattern and sample-driven methods to prune the d- neighbor search spaces. To improve their speed the compress the L-mers in their pruning matric using a 16-bit integer. The hamming distance operations are performed using logical operators and a cache locality scheme is used for faster access of their pruning matrix. This approach also includes a parallel implementation with a master-slave load balancing and control scheme using the OpenMPI framework in C++.

PROPOSED APPROACH: DMF (DICTIONARY MOTIF FINDER)

Background In this thesis, we explore two different motif definitions. As stated in the previous, we will be utilizing the definitions Eq. (1) and Eq. (3) as our search parameters. We will also be searching for the top-k scoring motifs so a ranked and filtered solution list must be returned. The strength of a motif is determined by the scoring function of total hamming distance shown in Eq. (4) [25], where M is the candidate motif, 푝푖 is the location of motif M at position i in sequence p and 훿(푝푖, 푀) is the minimum hamming distance of motif M in sequence p.

푡 ∑ 훿(푝푖, 푀) (4) 푖 = 1

When summed over the entire database the total hamming distance correlates to the consensus score, that each motif should be ranked with. This relation is shown in Eq.

(5), where (퐿 ∗ 푡) is the maximum possible consensus score.

푡표푡푎푙 푑𝑖푠푡푎푛푐푒 = (퐿 ∗ 푡) − 푐표푛푠푒푛푠푢푠 푠푐표푟푒 (5)

The motif search problem can be brute forced by looking through all possible starting positions within a list of sequences. Algorithm 1 shows the brute force motif search on a database of sequences.

Algorithm 1: BRUTEFORCEMOTIFSEARCH (DNA, t, n, L) 1 bestScore ← 0 2 for each (s1, …, st) from (1, …, 1) to (n - l + 1, …, n - l + 1) 3 if Score (s, DNA) > bestScore 4 bestScore ← Score (s, DNA) 5 bestMotif ← (s1, s2, …, st) 6 return bestMotif 11 11

At each index there are (푛 − 푙 + 1) choices for starting the search. At each of those points there are (푛 − 푙 + 1) choices for the next round of the algorithm and so on. For a database of t sequences, there are (푛 − 푙 + 1)푡 positions within which the motif search algorithm can be run. For each one of these iterations, the scoring function takes

푂(푙)operations. Therefore, the overall complexity of the algorithm is 푂(푙푛)푡 The number of possible motifs for a given length L is based on the alphabet that the patterns are encoded in. In our case of DNA motifs, |Σ| = 4. This means that for any given L the solution space can include 4퐿motifs. For L = 3, these can be represented as:

AAA AAC AAG AAT ACA … TGT TTA TTC TTG TTT

In an alphabet of DNA nucleotides, Σ = {A, C, G, T}, we can represent possible patterns with numerical representations of Σ = {1, 2, 3, 4}. The new representation can be packed within bytes more efficiently for better computing performance. For L = 3, the solution motif list can be represented as:

(1,1,1) (1,1,2) (1,1,3) (1,1,4) (1,2,1) … (4,3,4) (4,4,1) (4,4,2) (4,4,3) (4,4,4) 12 12

Each L-mer, a can be shown as an array of length L where each array slot refers to the character at that position.

a = (a1, a2, a3 …, aL) To iterate through this list, we use the numerical representation of the solution list as our iterator. Given a motif within this new alphabet, we can move to the next motif in the list using an algorithm (Algorithm 2) we will refer to as NEXTLEAF (a, L, k).

Algorithm 2: NEXTLEAF (a, L, k) 1 for i ← L to 1 2 if ai < k 3 ai ← ai + 1 4 return a 5 ai ← 1 6 return a In this algorithm, the array containing the current motif will wraparound to the next available solution with the iterations going from right to left. This behavior is similar to a 4-bit numbering system where the bit values start to wrap around as we reach higher numbers.

With the NEXTLEAF function defined in our system, we can now generate a list of the solution motifs in ascending order. The ALLLEAVES (L, k) function, shown in Algorithm 3, will generate this list for the user. Even though it might seem like the loop will go on indefinitely, the final element will always wraparound back to the start of the solution list, so an infinite loop is not present in this system.

Algorithm 3: ALLLEAVES (L, k) 1 a ← (1, ..., 1) 2 while forever 3 output a 4 a ← NEXTLEAF (a, L, k) 5 if a = (1, 1, …, 1) 6 return 13 13

With the introduction of the NEXTLEAF method, we have a way of simplifying the brute force motif search problem slightly by finding the median string from 4퐿 choices in the possible solution list with a time complexity of 푂(4퐿푛푡) and a space complexity of 푂(푛푡), where n is the average length of the sequences. The pseudocode for the new brute force method is shown in Algorithm 4.

Algorithm 4: BRUTEFORCEMOTIFSEARCH (DNA, t, n, L) 1 s ← (1, ..., 1) 2 bestScore ← Score (s, DNA) 3 while forever 4 s ← NEXTLEAF (s, t, n – l + 1) 5 if Score (s, DNA) > bestScore 6 bestScore ← Score (s, DNA) 7 bestMotif ← (s1, s2, …, st) 8 if s = (1, 1, …, 1) 9 return bestMotif The possible motifs in a solution list can be represented in a tree to better understand a key property within these lists. The entire list of L-mers can be shown in a

퐿 푖 tree with ∑푖=1 푘 nodes (excluding the root node), where k = 4 ( |Σ| ). k represents the number of child nodes that any non-leaf node in the system will contain. At the leaf level, each node corresponds to a possible motif in the tree and the depth of the tree is represented by the length of the motif. A visual representation of this concept is shown in Figure 2.

Figure 2. Tree representation of a motif search 14 14

With this tree representation of the motif solution list, we can update our

NEXTLEAF algorithm to be able to traverse to the parent nodes as well. For this system, we use preorder traversal as our method of moving from node to node. This new

NEXTVERTEX (a, i, L, k) algorithm takes a new parameter i, where i represents the level of the tree we are currently in. The updated algorithm is shown in Algorithm 5.

Algorithm 5: NEXTVERTEX (a, i, L, k) 1 if i < L 2 ai+1 ← 1 3 return (a, i + 1) 4 else 5 for j ← L to 1 6 if aj < k 7 aj ← aj + 1 8 return (a, j) 9 return (a, 0)

The brute force motif search algorithm can be simplified to utilize the tree-based representation for traversing motifs. Even though this approach iterates through every possible solution motif, it will be a good starting point for showcasing the benefits of the skipping hopeless solutions, later on. The simple motif search algorithm is shown in Algorithm 6.

Algorithm 6: SIMPLEMOTIFSEARCH (DNA, t, n, l) 1 s ← (1, . . . , 1) 2 bestScore ← 0 3 i ← 1 4 while i > 0 5 if i < t 6 (s, i) ← NEXTVERTEX (s, i, t, n − l + 1) 7 else 8 if Score(s, DNA) > bestScore 9 bestScore ← Score(s, DNA) 10 bestMotif ← (s1, s2, . . . , st) 11 (s, i) ←NEXTVERTEX(s, i, t, n − l + 1) 12 return bestMotif 15 15

A key distinction to make with this representation of the motif solution tree is that only the path that we can follow has been represented here. Since we can reach every node in this tree, the entire motif solution list exists in a storage less form. Traversing the list in order can be useful in trying to find a way to improve our brute force search algorithm. Since our traversal path reaches all the nodes within a level before moving up the tree, it is beneficial for us to find a way to filter our computation by not exploring a specific subtree. If we were to rule out a node at the lower levels, all nodes within that subtree is skipped leading to improved overall processing time. This approach is called branch-and-bound.

However, to perform a bypass of a useless subtree, the BYPASS (a, i, L, k) algorithm is used. This algorithm lets us jump levels, if needed, at any point in the tree traversal. The pseudocode for this approach is shown in Algorithm 7.

Algorithm 7: BYPASS (a, i, L, k) 1 for j ← i to 1 2 if aj < k 3 aj ← aj + 1 4 return (a, j) 5 return (a, 0) Some possible bypass paths in the search tree are shown in the Figure 3.

Figure 3. Bypass paths on the L-mer tree 16 16

The bypass procedure relies on a heuristic to filter subtrees and prevent unnecessary traversal. The basis of this filtering procedure relies on the same mechanism used to create the multilevel tree. In this representation of the tree, we can see that the actual motifs within the internal nodes are built up one letter at a time. By employing a similar understanding, we can see that our hamming distance metric can also be summed as we go deeper into the tree. For example, if a certain internal node had hamming distance x, every subtree underneath that node would also have a hamming distance of x at the very least. Since we are adding characters to the motif, the possibility of having a hamming distance equal to or greater than that of the parent node is guaranteed. If we know that a certain subtree will not lead to a better hamming distance value, we can save time and computation resources by not exploring this node’s children. The bypass function is very useful in the levels of the tree closer to the root due to the number of possible motif solutions that we can skip. This skipping mechanic is the basis of the branch-and-bound approach and is employed in the implementation of this algorithm in other areas of study. One example use of this algorithm is in solving the NP-hard knapsack problem. By integrating the hamming distance function within our traversal, we can now rewrite our motif search algorithm to include the bypass function. The pseudocode for this basic algorithm is shown in Algorithm 8.

Algorithm 8: BRANCHANDBOUNDMOTIFSEARCH (DNA, t, n, l) 1 s ← (1, . . . , 1) 2 bestScore ← 0 3 i ← 1 4 while i > 0 5 if i < t 6 optimisticScore ← Score(s, i, DNA) + (t − i) · l 7 if optimisticScore < bestScore 8 (s, i) ← BYPASS (s, i, t, n − l + 1) 9 else 10 (s, i) ← NEXTVERTEX (s, i, t, n − l + 1) 11 else 17 17 12 if Score(s, DNA) > bestScore 13 bestScore ← Score(s, DNA) 14 bestMotif ← (s1, s2, . . . , st) 15 (s, i) ← NEXTVERTEX (s, i, t, n − l + 1) 16 return bestMotif However, we must keep in mind that the worst-case time complexity of our algorithm is not improved with this approach. In a problem instance where no skipping can take place, we will still be traversing into every leaf node with very little skipping to the algorithm. With the large solution space of 4퐿 for our DNA alphabet, for longer motif strings, this approach does not yield a practical time bound for execution.

The planted motif definition requires a minor addition to the basic motif search algorithm. Since we are looking for a motif that satisfies both d and q parameters as well, we must include these definitions within our algorithm. The updated version of this is shown in Algorithm 9.

Algorithm 9: BRANCHANDBOUNDPLANTEDMOTIFSEARCH (DNA, t, n, l, d, q) 1 s ← (1, . . . , 1) 2 bestScore ← 0 3 i ← 1 4 while i > 0 5 if i < t 6 optimisticScore ← Score(s, i, DNA) + (t − i) · l 7 if optimisticScore < bestScore 8 (s, i) ← BYPASS (s, i, t, n − l + 1) 9 else 10 (s, i) ← NEXTVERTEX (s, i, t, n − l + 1) 11 else 12 if Score(s, DNA, d) > bestScore 13 if Quorum(s, DNA, d) > q 14 bestScore ← Score(s, DNA) 15 bestMotif ← (s1, s2, . . . , st) 16 (s, i) ← NEXTVERTEX (s, i, t, n − l + 1) 17 return bestMotif A key distinction to make with the definition of the planted motif search algorithm is that the mutation and quorum values only apply to the leaf level nodes. Using the bypass method within the internal nodes can lead to incorrect results. The d parameter stands for the number of maximum allowed mutations on the final solution 18 18 motifs. If we were to employ a maximum distance metric on each internal level of the tree, the maximum distance would accumulate with every additional level in the tree. Therefore, for a tree of depth 8, with d = 2, we could possibly see a max distance of 16 by the time the algorithm traverses to the bottom of the tree. This is not the intended result when submitting a motif search request.

Hash-based Heuristic Approach (DMF) To better utilize the branch-and-bound approach to motif finding, we use a couple of hashing-based heuristics to improve the bypass ratio. Since the branch-and-bound algorithms start at the bottom level of the tree, finding good candidate motifs to act as the lowest bound distance limit (best hamming distance) before any traversal occurs will speed up the algorithm. The first heuristic looks at the number of motif occurrences in the database before determining the hypothetical best motif. A sample driven method is utilized to find a solution motif that occurs the most in the input database. Since the search problem is ideally trying to find the pattern that occurs most frequently in the database, it follows general reasoning that this pattern is either the solution or close to the intended solution. Algorithm 10 showcases the mechanism used to locate the motif that occurs most frequently. Since this approach is implemented in C++, we use the unordered_map data structure present in the language. This is analogous to the dictionary in other languages and will occasionally be referred as such in this thesis. Since my approach relies on the hashing-based heuristics to provide efficient computation, the resulting tool showcased here is named DMF (Dictionary Motif Finder).

Algorithm 10: FINDHYPOTHETICALPATTERN (DNA, t, L) // building an unordered_map UM with // One Occurrence Per Sequence (OOPS) model 1 for each sequence_i 2 scan L-mer tiles in sequence_i with OOPS 3 if (L-mer found in UM) 4 UM[L-mer]++ //increment count 5 else 19 19 6 UM[L-mer] ← 1 //new pattern with count 1 //determine the hypothetical pattern (best or best-k) 7 hypothetical consensus pattern(s) ← best (or best-k) counted pattern(s) from UM 8 compute total_distance(s) of the hypothetical pattern(s) 9 return the best (or best-k) hypothetical pattern(s) with score(s)

The first heuristic is shown in line 7 of algorithm 10. The time complexity of this algorithm if 푂(푝 + 퐿푛푡) where n is the average sequence length and p refers to the computation of the total hamming distance for the hypothetical best motif. Each unordered_map operation takes 푂(1) so it is not considered in the overall time complexity. Even though this is the simplest motif search definition, we can show this search in terms of the planted motif search problem definition where d = 0 and q = 0. Our second heuristic utilizes an approximation method to determine which subtrees can be skipped in the algorithm based on the unordered_map created in

Algorithm 10. The concept relies on a hypothetical optimum total hamming distance measure that is directly tied into the frequency of occurrence of a pattern in a database. Algorithm 11 shows the hypothetical score computation for a given L-mer.

Algorithm 11: OPTIMISTICDISTANCECOMPUTE (UM, L-mer, t) //Assume: UM is unordered_map 1 search UM for L-mer 2 if found 3 optimistic_dist = (t – count) //assume L-mer has 1 mismatch 4 else //in each remaining seq. 5 optimistic_dist = t //assume L-mer has 1 mismatch in each seq. 6 return optimistic_dist

Since we have a count of how many times a motif occurs for each sequence in a database, we can assume every sequence that the motif does not occur in has at minimum a hamming distance of 1. When we subtract the number of possible mutated motifs from the total number of sequences, we get the hypothetical best distance value that we can use as a motif filter in the branch-and-bound approach. Since this heuristic assumes the best possible case for a motif in terms of total hamming distance (real world databases have a 20 20 distance of more than 1 usually), we will not miss any possible solution motifs in our

DMF algorithm. Finally, some additional optimizations to the total hamming distance algorithm has also been included in the unordered_map creation phase. For each discovered motif in the database, a log of which sequence it came from is also recorded. This prevents us from having to recheck a sequence for a motif in our total hamming distance computation. The updated hypothetical pattern creation is shown in Algorithm 12.

Algorithm 12: FINDHYPOTHETICALPATTERN (DNA, t, L) // building an unordered_map UM with // One Occurrence Per Sequence (OOPS) model 1 for each sequence_i 2 scan L-mer tiles in sequence_i with OOPS 3 if (L-mer found in UM) 4 UM[L-mer]++ //increment count 5 else 6 UM[L-mer] ← 1 //new pattern with count 1 7 location_map[i] ← true //determine the hypothetical pattern (best or best-k) 8 hypothetical consensus pattern(s) ← best (or best-k) counted pattern(s) from UM 9 compute total_distance(s) of the hypothetical pattern(s) 10 return the best (or best-k) hypothetical pattern(s) with score(s)

Using the inclusion of these two hashing-based heuristics, we can now rewrite our original branch-and-bound algorithm with the intension of high bypassing values. The complete DMF algorithm is shown in Algorithm 13.

Algorithm 13: DICTIONARY MOTIF FINDER (DNA, t, L) //Assume: Total_dist_compute (v, DNA) returns total hamming distance between pattern v and all DNA sequences with OOPS 1 best_distance/pattern ← FINDHYPOTHETICALPATTERN (DNA, t, L) 2 a ← (1, 1, …, 1) //starts from the 1st leaf vertex, AA...AA 3 i ← L //starts from the leaf level 4 while (i > 0) 5 if (i < L) //non-leaf vertex 6 prefix ← nucleotide symbols corresponding to (a1, a2, ..., ai) 7 optimistic_dist ← Total_dist_compute (prefix, DNA) 8 if (optimistic_dist ≥ best_distance) 9 (a, i) ← BYPASS (a, i, L, 4) //skips subtree 21 21 10 else 11 (a, i) ← NEXTVERTEX (a, i, L, 4) 12 else //leaf vertex 13 word ← nucleotide symbols corresponding to (a1, a2, ..., aL) 14 optimistic_dist ← OPTIMISTICDISTANCECOMPUTE (UM, word, t); 15 if (optimistic_dist >= best_distance) 16 skip this L-mer 17 else //need to compute total hamming distance 18 if (x ← TOTALDISTCOMPUTE (word, DNA) < best_distance) 19 best_distance ← x 20 best_pattern ← word 21 (a, i) ← NEXTVERTEX (a, i, L, 4) 22 return best_distance/pattern

In Algorithm 13, the initial best hypothetical best pattern with the optimistic best distance is found and an unordered map containing the frequency of pattern occurrences is shown in line 1 and a dictionary of the form is created. Within the internal nodes of the tree, lines 8 and 9 refer to the bypassing of nodes based on the hypothetical pattern found earlier. If a solution motif is found to be within the same distance range as the optimistic approximation, the algorithm will traverse into the subtree to confirm if it is better than the current best motif. Line 14 refers to the optimistic score measure using the frequency of motif occurrences. This will lead to a

푂(1) call to the dictionary structure to determine if the algorithm should attempt a full hamming distance computation. Lines 15 and 16 refer to the skipping procedure that is a result of the optimistic distance calculation. DMF can expand the number of requested motifs in an efficient manner. If the user requires multiple consensus patterns, the algorithm uses priority queues to determine the best possible motif in the algorithm. This problem definition is referred to as the best-k motif search with each solution motif ranked with respect to the other solution motifs. This is a key point, when compared to other motif searching solutions due to the algorithm finding the truly global best-k motifs without a loss in accuracy. Since our heuristics only affect the number of possible skipped unnecessary computations and not the scoring or motif generation procedure 22 22 itself DMF is the best possible solution for a combinatorial approach that returns the best- k motifs when searching for motifs in a database. The implementation details and the performance of DMF is shown in the next section.

PERFORMANCE OF DMF

To measure the performance of the proposed solution, DMF was implemented in C++ using unordered_map and priority queue libraries. These modules are well tested and optimized data structures within the language and suits our needs very well. The performance tests were conducted on a system that contains an Intel Core i7-4790 processor (4 Cores - 3.6 GHz) with 16GB of system memory. Ubuntu 18.04 LTS was used as the operating system on the system. A Nvidia K1200 Quadro GPU sits as the graphical acceleration device but is only used as a display output in this phase of performance testing. The HMP (Human Biome Project) database [49] was used as the input data within the system. In certain instances of testing, smaller dataset portions are used to find if database size has an effect on the execution time of the DMF algorithm. In order to create a consistent testing scheme, the original dataset is split into partitions of 2푛 (1 ≤ 푛 ≤ 8) where the partitions from each split is chosen at random. For the test cases in this section, we were searching for motifs of length L = 6 and the top scoring best-1, 5 and 10 motif strings. Figure 4 shows the bypassing ratio of the original branch and bound algorithm with respect to finding the top-k best motifs. For this test, all the database partitions are used as input data. We see that the increasing the k value causes a decrease in the execution time of the algorithm. This can be attributed to the additional number of solution motif’s requested requiring more comparison operations during the execution of the algorithm. Since k is larger than 1, a priority queue that holds ranking information also needs additional sorting operations every time a possible solution motif is found. Since there is a difference between the best and worst consensus patterns showed during the process of finding the best hypothetical optimal pattern, our heuristics uses 2k comparisons to have a better chance of finding the solution motifs. This corresponds to 24 24 finding the 20 best optimum patterns, from the dictionary structure, in heuristic 1 when the search problem asks for the best-10. Using this doubled searching strategy allows us to increase our chances of finding the true best motifs present in the database.

Figure 4: Branch and bound performance vs. k and dataset size

Figure 5 shows a comparison of skipped L-mer ratios between DMF and the original branch-and-bound algorithm (BaB). This experiment was run on L = 6 and three different k values with 8 different sized databases used. The performance gain of the DMF algorithm is also shown in this graph as a higher bypassing ratio leads to a better execution time overall. The values shown in the graph are averaged values and DMF achieves a very high-performance gain in all the tested cases. On average, DMF has a 97.5% bypassing ratio, which corresponds to total distance processing, on average, 97.5% of 46 possible solution motifs being skipped. The regular branch-and-bound algorithm only shows, on average, about 27.5% bypassing ratio. As in the case of the original branch-and-bound algorithm, adding more k leads a reduced bypassing ratio overall. This 25 25 experiment was also used to verify the accuracy of DMF with respect to the original branch-and-bound algorithm as well. Execution times for Figure 5 is shown in Table 1.

Figure 5. DMF vs. Branch-and-Bound performance

Table 1. DMF vs. Branch-and-Bound Execution Time Execution Time (sec.) Dataset BaB DMF BaB DMF BaB DMF best-1 best-1 best-5 best-5 best-10 best-10 DS 3773.5 154.27 3973.9 210.38 4040.7 311.43 DS (1/2) 1951.1 129.26 2028.7 151.88 2024.6 189.22 DS (1/4) 1002.5 63.44 1028.3 111.61 1028.4 138.65 DS (1/8) 495.60 50.17 511.32 72.73 511.87 99.85 DS (1/16) 203.21 3.99 212.77 4.44 234.34 6.82 DS (1/32) 110.08 1.93 118.99 3.37 125.03 4.27

DMF has a speedup, on average, of 28x, 21x, and 16x for the best-1, best-5, and best-10 respectively. Overall, DMF has an average speedup of 22.48x over the original branch-and-bond algorithm. A large performance gain is found at the DS (1/16) and DS

(1/32) partitions which can be attributed to the random partition selection process. In 26 26 these two cases, DMF has an average speedup of 42.5x over the original branch-and- bound algorithm. One of the heuristics helping DMF achieve this speedup is the process of finding the hypothetical best motifs at the start of the algorithm. To see the benefit of this heuristic, Table 2 shows the strength of the hypothetical motif patterns when compared to the final solution list. L-mer length = 6 and randomly selected partitions are used in this experiment. As shown in the table, the position of the hypothetically best ranked motif with respect to the final ranking is displayed. Most of the hypothetical best-k motifs are present in the same order, confirming that our approximation of the solution list based on the frequency of motifs, hypothesis is supported by the experimental results.

Table 2. Strength of Hypothetical Consensus Patterns Final Rankings of Hypothetical Consensus Patterns Dataset best-5 best-10

DS (1/16) - 1 1, 2 ,3, 4, 5 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

DS (1/16) - 2 1, 2, 3, 5, 7 1, 2, 3, 4, 5, 7, 8, 10, 11, 12

DS (1/32) - 1 1, 2, 3, 4, 5 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

DS (1/32) - 2 1, 2, 3, 4, 5 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

DS (1/64) - 1 1, 2, 6, 7, 11 1, 2, 3, 4, 6, 7, 8, 9, 10, 11

DS (1/64) - 2 1, 2, 3, 4, 5 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

DS (1/128) - 1 1, 2, 3, 4, 5 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

DS (1/128) - 2 1, 2, 3, 4, 7 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

In fact, we find that our approximation heuristic has better accuracy than MEME [4], a popularly used statistical approach for the finding of motifs. Table 3 showcases an accuracy comparison between MEME and our DMF approach. The table shows the number of solution motifs that DMF produces and are present in the MEME output after 27 27 execution. Both tools are run with L-mer length = 6 and searching for the best-5 and best-

10 motifs. MEME was unable to maintain the same accuracy as DMF, which provides the truly global best solution motifs in each case.

Table 3. Accuracy: MEME vs. DMF Dataset Output size DS DS DS DS DS DS DS & matches DS (1/2) (1/4) (1/8) (1/16) (1/32) (1/64) (1/128) best-5 2 0 0 1 0 1 0 1 # matches best-10 2 1 0 1 1 2 1 2 # matches

The execution times of both DMF and MEME for finding the best-k (k = 1, 5, 10) L-mer length = 6 motifs with different sized databases are shown in Figure 6. MEME shows a consistent execution time regardless of the database size but DMF has better performance in the smaller sized datasets despite being a combinatorial approach.

Figure 6. Efficiency: DMF vs. MEME 28 28

A performance comparison between DMF, SPELLER and WEEDER is shown in

Table 4. This comparison uses the planted motif search problem definition, i.e. (L, d) - q problem, where L = 6, d =2 (2 maximum mismatches). Both SISMA SPELLER[D39] and WEEDER [D31] are suffix tree-based approaches that focus on the planted motif search problem definition. Suffix tree-based approaches suffer from poor execution times [27] and processing large DNA datasets can lead to a need for excessive memory and storage space requirements during runtime. As shown in Table 4, SISMA SPELLER was unable to complete the search on our full-size database (12,488 sequences) due to it running out of storage space on our test machine. On average, DMF had a 9.9x and 2.94x speedup over SISMA SPELLER and WEEDER respectively. These motif solutions do not rank the final solution motif but rather return every qualified motif as well so further execution time (not included in the table) is needed to sort the solution list returned by these tools.

Table 4. DMF vs. SPELLER vs. WEEDER Execution Time Execution Time (sec.) Dataset DMF SISMA SPELLER WEEDER DS 183.66 - 675.37 DS (1/2) 129.29 4782.47 212.24 DS (1/4) 86.14 942.72 82.68 DS (1/8) 61.32 210.22 36.22 DS (1/16) 4.40 49.07 17.08 DS (1/32) 3.33 10.43 8.83 DS (1/64) 0.94 2.33 4.02 DS (1/128) 0.39 0.61 2.29

We can see the performance gain of DMF, using the two hash-based heuristics of finding the hypothetical best motifs at the start of the algorithm and the approximation of the hypothetical best hamming distance function while maintaining the accuracy of our combinatorial approach. DMF also shows better performance and accuracy than MEME, a statistical approximation approach, and SISMA SPELLER and WEEDER, two suffix tree based planted motif search finding algorithms. ACHIEVING HIGH PERFORMANCE

As with other pattern-driven combinatorial motif searching algorithms, DMF suffers when processing larger length motif strings due to the maximum accuracy provided by the algorithm. However, DMF is scalable regarding the amount of computing resources that can be utilized to help with the processing of motifs. The structure of the branch-and-bound algorithm tree allows us to divide the workload into even sections that can be parallelized to allow for a more efficient form of computation. Our total hamming distance computation function is also data parallel, meaning that we can easily split our workload for using additional computing resources. This was the motivation behind developing high performance computing versions of DMF that retain the accuracy of the original algorithm while being able to both process longer length motifs as well as larger databases.

Background With the introduction of parallel computing architectures like multi and many core CPUs and CUDA or OpenCL based GPUs, former exhaustive and accurate but time- consuming approaches have been parallelized to take advantage of the additional computing resources. Statistical approaches have also utilized the extra resource availability to further improve their execution times for large motif lengths with less approximation heuristics. In the category of the statistical approaches, parallel versions of the MEME [26] algorithm, paraMEME [30], GPU-MEME [31], and mCUDA-MEME [32], are implemented on the XP/S system, a single GPU, and multiple GPUs, respectively. Using

Intel Paragon XP/S parallel computer paraMEME [30] creates statistical models of motifs that are based on probability distributions and the workload is split by assigning a unique subset of initial models to each processor to run in SPMD mode. Each processor 30 30 compares the best initial models and the highest scoring model is passed to every processor. The initial model creation and the expectation-maximization operation is parallelized as well. This approach supports MPI and is claimed to scalable over a large cluster network. GPU-MEME [31] showcases an OpenGL based GPU acceleration approach to accelerate the MEME algorithm. This version of MEME supports both

OOPS (One Occurrence Per Sequence) and ZOOPS (Zero or One Occurrence Per Sequence) and makes use of fast texture memory to store each sequence. The calculation of score and reduction operations are handled by GPU threads. A single and two GPU system shows one and two orders of magnitude speedup over the sequential version of MEME respectively. mCUDA-MEME [32] is a heterogeneous CUDA GPU version of MEME that uses a hybrid combination of CUDA, OpenMP, and MPI and supports both OOPS and ZOOPs models for motif searching. Master control and worker kernel threads are used within the GPU while the heterogeneous version can be parallelized only on one workstation. This algorithm is focused on multiple GPU clusters with symmetric workload distribution being the priority.

Most of the parallel approaches have been developed for the combinatorial approaches under the problem definition of the planted (L, d)-q motif finding. PCVoting [33], an OpenMP accelerated version of CVoting [34], uses a simple divide and merge parallel scheme. In the divide phase, each L-mer from a sequence is assigned to each processor dynamically to generate the set of motifs. The generated candidate motifs are combined into a larger set in the merge phase and radix sort is used to remove any redundant motifs. This scheme is claimed to have an average efficiency of over 90% due to the independent nature of the original CVoting algorithm. The authors also state that their scalability, on multicore architectures, is linear. 31 31

qPMS9 [22] is a parallel exact searching algorithm based on the works of PMS8 where they improve the previous approach by modifying the tuple L-mer generation portion of their program to maximize the common neighborhood space on expansion. PHEPPMSprune [35] parallelizes HEPPMSprune [36] by accelerating the candidate motif creation and validation operations. This approach utilizes a hybrid exact hybrid exact pattern motif search where mechanisms of candidate motif generation from other algorithms like Voting [37] and PMSP [38] and pattern matching on PMSprune [39] have been included. This approach’s runtime is greatly determined by the user’s requested quorum percentage, so they leave that decision up to the end user. The acceleration is done by assigning a set of L-mers to each processor to generate the set of neighboring motifs and then evaluate all candidate motifs parallel across each processor. The time complexity of PMSprune is

2 2푑−푑′+1 푖 푂(푡 (푛 − 푙 + 1) (푙 + 푝2푑 ) ∑푖=1 푙𝑖3 ) where, 푝2푑 is the probability that the hamming distance between two strings is at most 2d. mSPELLER [40] accelerates the suffix tree-based approach SPELLER by parallelizing the spelling operation. Node lists containing both explicit and implicit nodes at each level i is generated. Using a parallelized algorithm, the authors are able to divide this workload across multiple processors using a dynamic load balancing scheme to prevent uneven workloads. This approach does not accelerate the suffix tree creation process. PMS6MC [41] is a multicore accelerated algorithm of PMS6 [42]. It uses an inner-outer level parallelism scheme with Pthread libraries where the outer level parallelism corresponds to the L-mer block assignment while the inner level parallelism corresponds to the individual steps of the inner loop executed in parallel. Threads are partitioned into thread blocks and L-mers are assigned to these blocks where each thread is responsible for finding its neighbors. The authors state that modifying thread block 32 32 sizes and threads per block did not result in better performance due to more stalls for memory access. The BitBased [43] algorithm adopts the following schemes to improve performance. The first is incremental support which minimizes the problem to a smaller (L, d) problem and incrementally works to the actual size. In addition, the motif search space is reduced, and a filtering scheme is employed in the candidate motif selection process. Each core receives a 4퐿 matrix with scalability shown to be linear across multiple cores. However, the amount of memory required is a drawback with larger runtimes being attributed to limited system memory. A GPU version of this algorithm is also shown in [44]. Due to the increased performance degradation with multiple memory accesses, repartitioning and reordering of data structures for the shared memory modules of the GPUS has been performed to prevent memory bank conflicts.

A MapReduce version of the PMSP algorithm, named PMSPMR [45], is implemented on a cloud system. In this approach the workload is distributed across p nodes, so each node is only responsible for 1⁄푝 of the total workload. The Map phase is used to divide the task and send to each node in the cluster. The Reduce phase is then used to aggregate the results in each of the nodes back into the main node. The authors of the described approach [46] use a hybrid combination of MPI and POSIX threads in a 4-node SMP system for solving the planted motif search problem. A sample driven approach is used for selecting triplet L-mer patterns while a pattern driven method is used for computing the d-neighborhoods. 3 L-mers that are close proximity neighbors are grouped together to generate the d-neighbors set using a recursive approach. To find motifs, intersections on the common d-neighbors are used with bit vectors. The final solution motifs are created by performing the union on the results of the previous step. In this approach, a CPU is reserved as the master node and remaining are labelled as worker nodes. Each worker node sends all their solution motifs, at the end 33 33 of their schedule, to the master node for the final union step. One downside of this method is that some processors end up starving since the execution of neighbors is sensitive to the volume of neighborhoods being generated while other processors might also have unbalanced workloads. gSPELLER [40] is the GPU version of the suffix tree-based planted motif finding algorithm SPELLER. To overcome the high memory requirements of the original algorithm, the new approach uses dynamic allocation of sequences within the GPU memory is used. Filtering is also used to limit the number of candidate motifs entering the GPU so that suffix tree expansion is minimized. Since GPUs have high global memory latency, all accesses to the device global memory are coalesced to make the maximum use out of the throughput of the transfer medium. These memory limitations, however, make gSPELLER perform worse than mSPELLER, due to the CPU’s access to more free memory. Multiple approaches to accelerating the brute force motif algorithm on parallel architectures can be found in [47]. The authors of this paper used a combination of

OpenMP and MPI to dividing the workload on a multicore processor. They also showcased a CUDA based approach that outperformed the CPU version for both few- long and many-short sequences. To prevent the GPU from timing out they use multiple kernel calls where the motif search space is divided into 8 subspaces and the subspaces themselves are then divided into blocks of size 512. Each GPU thread within a block searches for a motif. A different approach to load balancing motif searches is shown by the authors of the works [48]. The target of this approach is heterogeneous systems where workload and computation power are unbalanced across different systems (clusters containing multicore, manycore and GPU devices within the same network). To target this difference, the authors suggest using a custom task scheduler that splits workloads based 34 34 on the target architecture. The entire workload is split into chunks and each chunk is assigned to an architecture based on precalculated ratios for each system. To showcase the benefit of this approach they highlighted the speedup of the system over the brute force version of the motif search problem.

Parallel Dictionary Motif Finder (PDMF) There exist many different high-performance computing architectures that we can use for the purposes of accelerating the DMF algorithm. In this thesis, we explore multicore CPUs and GPUS as a means of providing multiple computing cores that can work on the motif search. Three proposed models that run on CPU only, GPU only and a heterogeneous models (both CPU and GPU) that parallelize different points on the DMF algorithm are explored in the next three sections

PDMFm (MultiCore version) One of the simplest architectures for parallelizing algorithms is the multicore CPU present on almost every modern CPU in production. Commodity desktop processing chips usually provide some form of multicore processing technology with anywhere between 2-4 individual cores that can process data independent of each other. Furthermore, many CPUS also handle Simultaneous Multiprocessing (SMP) architecture that effectively doubles the amount of processing power available in a chip. Since all the cores are present to the operating system, a multicore focused approach is able to access system memory over a very fast memory bus leading to a simple parallel solution that many algorithms can abstract to be just one procedural call for large computing power requests. Due to the nature of the motif searching problem we cannot employ a straightforward data parallelism strategy due to the requirement of a globally optimal motif. Splitting our datasets into partitions would not allow us to use a traversal tree that 35 35 is able to filter motifs based on a branch-and-bound strategy. With DMF, we can look at two possible ways to accelerate the algorithm that are beneficial to our goal. The storage less tree traversal-based nature of the algorithm allows us to partition our solution space that can be independently worked on at the same time. This form of parallelization can also be detrimental to our algorithms core purpose as adding partitions might degrade the bypassing effectiveness between partitions. A simpler form of parallelization can be done on the total hamming distance computation module. Since each total hamming distance calculation needs to be done on every sequence in the dataset over multiple iterations, acceleration of the most computation intensive procedure will lead to better execution times. For all multicore processing on our C++ DMF program, we use the OpenMP parallel processing libraries. The library is an optimized and tested with performance in mind while allowing for more simplicity on the end user. OpenMP uses native Pthread parallel processing directives within its base layer so it is well suited to working on any machine that is capable of running multithreaded instructions.

Within the OpenMP language a few important directives that are used within the pseudocode is shown here. omp parallel {num_threads} { … } refers to a block of code that is ready to be run on multiple processing cores. The num_threads variable refers to the number of threads that need to be created to run the code within the {…}. Each thread has its own thread data and storage space and are executed independent of each other. Since all threads are connected to the same global system memory, they are all able to access any valid slot within the program space to allow for global data processing. To synchronize data accesses between threads, special shared variables can be used to both affect and inherit scope from the parent code or module. To prevent race conditions due to multiple threads potentially accessing the same memory slot, we can use an openmp critical { … } directive that serializes data access for all code within the { … }. 36 36 PDMFm1 – Multicore Model 1 The first multicore model in the program utilizes a partitioned solution space and allocates computing resources to each individual partition. Figure 7 shows a possible split of the L-mer search tree that uses 8 threads. In each partition, a thread will utilize the

NEXTVERTEX and BYPASS functions to traverse its section of the search tree within the boundaries of the start and end L-mers.

Figure 7. PDMFm1 computation model

In order to determine where each partition’s start and end points lie, irrespective of the number of threads, an algorithm that splits the entire search space into even portions was created. For example, for a motif search problem of motif length L = 3 and two threads, the first thread would handle all the motifs from AAA…CTT and the second thread would compute on motifs between GAA…TTT. Since we use an array based numerical representation for a motif, the algorithm used to find starting motif for a thread is shown in Algorithm 14.

Algorithm 14: DISPLACEMENTCOMP (thread_size, L, result) //thread_size is the local searching space size //result is an array of values, each ranged 0...3 for A...T 1 if (thread_size < 4) 2 result[L] ← thread_size; 3 else 4 for i ← 1 to L 5 result[L - i] ← thread_size % 4; // alphabet size 4 6 thread_size ← thread_size / 4; // alphabet size 4 7 return result 37 37

The PDMF algorithm using the multicore parallelization strategy is shown in this section is showcased in Algorithm 15. Lines 6 and 7 refer to the start and end motif generation phase. Lines 30 and 31 are under an omp critical directive due to the global memory access for the best scoring motif in order to prevent race conditions.

Algorithm 15: PDMFM1 (DNA, t, L) //Assume: TOTALDISTCOMPUTE (v, DNA) returns total hamming distance // between pattern v and all DNA sequences with OOPS 1 best_distance/pattern ← Find hypothetical pattern (DNA, t, L) 2 omp parallel num_threads (thread_count) 3 { 4 thread_id ← omp_get_thread_num() 5 thread_size ← 4퐿 // thread_count 6 a = e ← (1, 1, ..., 1) //initial local start, end L-mers, AA...AA 7 a ← a + DISPLACEMENTCOMP (thread_id * thread_size, L, a) 8 e ← e + DISPLACEMENTCOMP ((thread_id + 1) * thread_size - 1, L, e) 9 i ← L; //starts from the leaf level 10 while (i > 0) 11 if (i < L) //non-leaf vertex 12 prefix ← nucl. symbols corresponding to (a1, a2, …, ai) 13 end_prefix ← nucl. symbols corresponding to (e1, e2, ..., ei) 14 if (prefix.substring(0, i) > end_prefix.substring(0, i)) 15 break //terminates thread 16 optimistic_dist ← TOTALDISTCOMPUTE (prefix, DNA) 17 if (optimistic_dist > best_distance) 18 (a, i) ← BYPASS (a, i, L, 4) //skips subtree 19 else 20 (a, i) ← NEXTVERTEX (a, i, L, 4) //finds next vertex 21 else //leaf vertex 22 word ← nucl. symbols corresponding to (a1, a2, …, aL) 23 end_word ← nucl. symbols corresponding to (e1, e2, ..., eL) 24 if (word.substring(0, i) > end_word.substring(0, i)) 25 break //terminates thread 26 optimistic_dist ← OPTIMISTICDISTCOMPUTE (UM, word, t) 27 if (optimistic_dist ≥ best_distance) 28 skip this L-mer 29 else //need to compute total hamming distance 30 if (x ← TOTALDISTCOMPUTE (word, DNA) < best_distance) 31 omp critical 32 best_distance ← x 33 best_pattern ← word 34 (a, i) ← NEXTVERTEX (a, i, L, 4) //finds next vertex 35 }//omp 36 return best_distance/pattern 38 38 PDMFm2 – Multicore Model 2 Each vertex in the tree requires a total hamming distance calculation before it can

make the decision of whether to bypass or traverse the subtree. The second multicore- based parallelization strategy looks at accelerating the total hamming distance computation module which is a computation bottleneck in DMF. A data parallel approach

can be used in this approach as the hamming distance between a motif and a set of sequences can be split up based on the number of threads available.

Sequences: Sequences: Figure 8. PDMFm2 block division Figure 9. PDMFm2 cyclic division computation model computation model

We can use two different strategies for assigning sequences to threads in this approach. A simple block division method where each thread is responsible for a set of sequences (shown in Figure 8) or a cyclic allocation strategy (shown in Figure 9) can be

used for the total hamming distance computation. For the block division strategy, we use the OpenMP directive omp parallel for num_threads { … } where num_threads stands for the number of blocks we want to partition. This directive automatically splits a block of for loop iterations into equal partitions without any additional pseudocode needed.

Additional runtime optimizations are automatically added by the OpenMP library. With both strategies, each thread is responsible for keeping track of its own running hamming

distance total and the totals are summed up at the end of the program. 39 39 PDMFm3 – Multicore Model 3 A hybrid model consisting of the approaches shown in both PDMFm1 and

PDMFm2 was also developed and tested. In this approach, a combination of threads for both search space and total hamming distance computation was utilized. As shown in Figure 10, this approach requires a large amount of threads and can lead to fewer idle threads during runtime.

Sequences:

Figure 10. PDMFm3 computation model

A case can be made to say that both Model 1 and model 2 are extremes of the hybrid model where either search space or database partitioning is used. The performance of multiple combinations of threads were tested for this approach.

PDMFg (GPU version) Early GPUs provided a way of accelerating graphics calculations in a computer system to relieve the additional load from the CPU. However, the nature of graphical calculations required many multiple pieces of data (pixels on the screen) to be run with 40 40 the same instruction but over a much larger data size. To perform such a large number of calculations on the GPU, specialized hardware was added to the device as a means of accelerating the computations. Furthermore, a global memory storage space was added to allow the ALU (Arithmetic Logic Unit) chips on the device to have access to a fast- random access memory. Programmable pixel and vertex shader units allowed for faster floating point and looping calculations to be implemented with relative ease. These additions allowed GPUs to become very specialized in the area of graphical acceleration. Researchers in the late 2000s noticed that the architecture of GPUS were ideal for high performance computing applications in various research fields. With many GPUS being present in regular desktop computers, a large computing resource was available at a relatively low cost. However, early efforts in GPU programming had to use graphical storage spaces to run code. Data had to be passed into the device as texture maps and other forms of data storage spaces meant for creating images. The Nvidia CUDA platform, introduced in 2007, allowed for the use of the processing units (called CUDA cores) with standard programming language definitions. This platform was based on the

C and C++ language due to their strict type definitions and static data types. Within a GPU, threads are able to execute code in specialized hardware. The architecture on this device is built with highly parallel data processing in mind. Within a GPU unit, groups of 32 threads are called warp and a thread block can have up to 1024 threads at a time. Each thread block is assigned to a streaming multiprocessor (SM) on the GPU. These SMs can process warps of threads at any given time. Threads within a warp are allowed to run in SIMD (Single Instruction Multiple Data) mode where a single instruction can be run across all the given data at the same time. Figure 11 shows the organization of threads in a GPU system. To access the threads in a block, CUDA uses a three-dimensional indexing scheme that is referred to by x, y, and z co-ordinates. For example, a thread block can be 41 41

Figure 11. GPU parallel architecture instantiated with a (32, 32, 1) which creates up to 1024 threads. Furthermore, we can allocate thread blocks, each with up to 1024 threads, in a CUDA kernel. There is a hard limit of up to 65535 blocks in any kernel call. Since each one of these blocks can have threads, certain programming applications can scale to fit this parallelization style as blocks can be used on data structures and threads can be used on elements in the data structure. There is no inter block communication allowed in the CUDA programming paradigm but threads within a block can communicate with the help of very small and limited shared data structures. One final point to mention is that since warps are allowed to be run on an SM at a time, it is ideal to pick thread numbers that are in multiples of 32. This will allow each SM to queue threads in an order that is ideal for fast and efficient processing. For PDMF, we can utilize the large number of available threads to improve the runtime of our computation intensive algorithm. Since the goal is to make a heterogeneous approach that takes the best CPU and GPU based model together, 2 GPU only models were developed to handle the task. An important precursor to developing the models was the handling of data inside the GPU memory. Since the GPU has its own 42 42 device memory, separate and inaccessible by system main memory, computation data has to be explicitly copied to the local memory. However, to prevent wastage of time, we created a double buffering algorithm that copies database sequences to the GPU at the same time it is copied to the CPU. This allows us to hide the memory latency during data transfer and allow for less overhead later on in the program. Algorithm 16 showcases this mechanism in detail.

Algorithm 16: DOUBLEBUFFERING (dataset, seq_array, device_array) 1 i ← 1 2 read_flag ← true 3 buffer_size ← x //variable size 4 if read_thread 5 while (sequence ← dataset) 6 seq_array[i] ← sequence 7 i ← i + 1 8 read_flag ← false 9 else //write thread 10 buffer_start ← 0 11 while (read_flag || buffer_start + buffer_size < dataset.length) 12 device_copy(device_array, seq_array, buffer_start,buffer_size) 13 buffer_start ← buffer_start + buffer_size 14 buffer_size ← dataset.length - buffer_start //move excess data 15 if (buffer_size != 0) 16 device_copy(device_array, seq_array, buffer_start,buffer_size)

Since we need to manage the number of GPU threads and blocks, we create in a kernel call, we use a maximized block memory usage policy to help with overall efficiency of the program. This involves creating more threads per block within less overall blocks as opposed to more blocks with less threads per block. As noted previously, Nvidia GPUs use a 32-thread warp as the basic building block of processing data. Any multiple of this number will allow for an efficient processing cycle on the device. In this thesis, we empirically elected to use 64 threads per block for all processing with this number being adjustable as we see fit, for processing large database sizes. To determine the number of blocks needed in PDMF, we used the following calculation

푏푙표푐푘_푠𝑖푧푒 ← (푑푏. 푠𝑖푧푒() + 푡ℎ푟푒푎푑_푐표푢푛푡 − 1)/푡ℎ푟푒푎푑_푐표푢푛푡 43 43

This allows us to use the minimum number of blocks needed based on the current number of threads per block. The main goal of the GPU is to accelerate the total hamming distance computation in DMF. Since this portion of the algorithm is inherently parallelizable in this data parallel method, the GPU is one of the best devices for this purpose. The GPU version of the simple motif search total hamming distance computation module is shown in Algorithm 17. We use a simple cyclic assignment methodology to reuse threads if we need to. In a real-world database, we would not need to since the number of sequences in a dataset is well below the thread limitations of the GPU.

Algorithm 17: GPUTOTALDISTANCE (a, i, dataset, L, tot_dist) 1 seq_num ← thread_id + block_id * grid_dimension //for thread’s data access position 2 tot_dist ← 0 3 while (current sequence is in dataset) 4 sequence ← dataset[seq_num] 5 counter ← 0 //substring start position 6 best_hamm_dist ← L //initially worst value 7 while (counter < sequence.length – L + 1) 8 word ← sequence.substr(counter, L) 9 if (hamm_dist(word, a) < best_hamm_dist) 10 best_hamm_dist ← hamm_dist 11 counter ← counter + 1 12 atomicAdd: tot_dist ← tot_dist + best_hamm_dist 13 seq_num ← seq_num + block_dimension * grid_dimension 14 return tot_dist

PDMFg1 – GPU Model 1 The first GPU model for PDMF uses a simple implementation of the PDMFm algorithms. Using one CPU thread for kernel control and tree traversal while allowing the total hamming distance function to be under the GPU control. Figure 12 shows the computation model in this GPU based solution. The PDMFg1 algorithm is shown in Algorithm 18.

44 44

Figure 12. PDMFg1 computation model

Algorithm 18: PDMFG1 (DNA, t, L) //Assume: GPUTOTALDISTCOMPUTE (v, DNA) returns total hamming distance // between pattern v and all DNA sequences with OOPS // kernel called with b blocks and th threads per block 1 best_distance/pattern ← FINDHYPOTHETICALPATTERN (DNA, t, L) 2 a ← (1, 1, …, 1) //starts from the 1st leaf vertex, AA...AA 3 i ← L //starts from the leaf level 4 while (i > 0) 5 if (i < L) //non-leaf vertex 6 prefix ← nucleotide symbols corresponding to (a1, a2, ..., ai) 7 optimistic_dist ← Total_dist_compute (prefix, DNA) 8 if (optimistic_dist ≥ best_distance) 9 (a, i) ← BYPASS (a, i, L, 4) //skips subtree 10 else 11 (a, i) ← NEXTVERTEX (a, i, L, 4) 12 else //leaf vertex 13 word ← nucleotide symbols corresponding to (a1, a2, ..., aL) 14 optimistic_dist ← OPTIMISTICDISTANCECOMPUTE (UM, word, t) 15 if (optimistic_dist ≥ best_distance) 16 skip this L-mer 17 else //need to compute total hamming distance 18 if (x ← GPUTOTALDISTCOMPUTE<<>> (word, DNA) < best_distance) 19 best_distance ← x 20 best_pattern ← word 21 (a, i) ← NEXTVERTEX (a, i, L, 4) 22 return best_distance/pattern 45 45

Line 18 refers to the kernel call itself where we instruct the GPU total hamming distance function to start computation. The values within the <<< >>> on line 18 refer to the number of blocks and threads we want the kernel call to run with. This model’s structure is similar to the PDMFm2 model where only the total hamming distance computation is accelerated.

PDMFg2 – GPU Model 2 The second GPU model is based on the PDMFm3 model where we combine both solution search space partitioning as well as total hamming distance computation. Since the GPU is able to run multiple instances of the same kernel call, given that there is enough memory hold all the necessary data. Since with PDMFg1 we are left with idle multicore threads, the rationale behind this approach is to allow more threads to start up their own kernel call instances as well as follow the tree traversal approach. One of the concerns with this approach is the likelihood of a reduced bypass ratio since possible bypass paths are interrupted by the vertical partitioning of the tree. The GPU also has a limit on the number of simultaneous kernel-calls it can run at the same time. Since thread blocks are processed on the streaming multiprocessor, a GPU will be forced to run warps in a serial manner if all the available SMs are currently processing data. Figure 13 shows the computation model in this approach.

For this PDMF model, we need to explicitly define a stream that the kernel call will run on. If a stream value is not provided in the function call, CUDA will run every instance of the function on the same streaming multiprocessor. This is equivalent to running the PDMF program serially. PDMFg2’s pseudocode is shown in Algorithm 19.

46 46

Figure 13. PDMFg2 computation model

Algorithm 19: PDMFG2 (DNA, t, L) //Assume: GPUTOTALDISTCOMPUTE (v, DNA) returns total hamming distance // between pattern v and all DNA sequences with OOPS // kernel called with b blocks, st streams and th threads per block 1 best_distance/pattern ← Find hypothetical pattern (DNA, t, L) 2 omp parallel num_threads (thread_count){ 3 thread_id ← omp_get_thread_num() 4 thread_size ← 4퐿 // thread_count 5 a = e ← (1, 1, ..., 1) //initial local start, end L-mers, AA…AA 6 a ← a + DISPLACEMENTCOMP (thread_id * thread_size, L, a) 7 e ← e + DISPLACEMENTCOMP ((thread_id + 1) * thread_size - 1, L, e) 8 i ← L; //starts from the leaf level 9 while (i > 0) 10 if (i < L) //non-leaf vertex 11 prefix ← nucl. symbols corresponding to (a1, a2, …, ai) 12 end_prefix ← nucl. symbols corresponding to (e1, e2, ..., ei) 13 if (prefix.substring(0, i) > end_prefix.substring(0, i)) 14 break //terminates thread 15 optimistic_dist ← TOTALDISTCOMPUTE (prefix, DNA) 16 if (optimistic_dist > best_distance) 17 (a, i) ← BYPASS (a, i, L, 4) //skips subtree 18 else 19 (a, i) ← NEXTVERTEX (a, i, L, 4) //finds next vertex 20 else //leaf vertex 21 word ← nucl. symbols corresponding to (a1, a2, …, aL) 22 end_word ← nucl. symbols corresponding to (e1, e2, ..., eL) 23 if (word.substring(0, i) > end_word.substring(0, i)) 24 break //terminates thread 25 optimistic_dist ← OPTIMISTICDISTCOMPUTE (UM, word, t) 47 47 26 if (optimistic_dist ≥ best_distance) 27 skip this L-mer 28 else //need to compute total hamming distance 29 if (x ← GPUTOTALDISTCOMPUTE<<>> (word, DNA) < best_distance) 30 omp critical 31 best_distance ← x 32 best_pattern ← word 33 (a, i) ← NEXTVERTEX (a, i, L, 4) //finds next vertex 34 }//omp 35 return best_distance/pattern

The third parameter in the <<< >>> in line 29 refers to the stream that the kernel call is supposed to run in. Arguably this model can be called heterogeneous but in this thesis we decided not to do so. Since the CPU threads are only working on tree traversal and kernel calling points, we decided that this model was a GPU model due to most of the computational work being done on the GPU.

PDMFracing (Heterogeneous Model) In order to better utilize the resources at hand, we decided to come up with a new heterogeneous model that can better utilize the available CPU threads on the system. Since we have two separate architectures on our system, we decided to use the slower

CPU architecture to help with the sequences from the solution list at its fastest speed. Figure 14 shows the concept of the computation model used in PDMFracing. By splitting the tree space into two halves, we have the CPU and GPU working on different sections of the tree. Since one CPU thread is used for tree traversal and kernel control, the remaining threads are used in the total hamming distance computation of the second half of the system. Since the GPUs computation power is much higher than that of the CPU, the racing ends with the GPU’s win at the center point of the search tree. At this point, the CPU gives up control of the tree and the GPU immediately resumes the execution at the point that the CPU let off. This role change considers BYPASS and

NEXTVERTEX instructions in order to seamlessly transition into the CPU’s execution path. The purpose of this model is to alleviate the GPU load as much as possible and the 48 48

Figure 14. PDMFracing computation model

CPU’s execution time is hidden as it become the performance gain over the basic GPU computation model. By sharing the total workload amongst the two computing devices using the proposed dynamic load balancing system, we can achieve the highest performance gain in the heterogeneous system. To achieve a better load balancing result, we can repeatedly divide the remaining tree space by two halves again and executing the racing-based mechanic until the end of execution. Although this might be beneficial for longer length motifs where a larger searching space is possible, more control overhead required, and the dropped bypass ratio is a concern. Therefore, we limit our execution path to only one instance of swapping at this instance. 49 49 Enhanced Heterogeneous Model with SIMD Vectors - PDMFh With the introduction of PDMFracing, we decide to use a different methodology to improve the workload balancing of the system. A converging processing scheme is better in terms of running two architectures at full speed to allow for the most amount of work done by both systems, since each one would be running without any hindrances. To facilitate this, the two systems can start on opposite sides of the search tree and work towards each other. This sort of mirroring will allow both sides to keep their bypassing metrics as each entity assumes it can go over the entire tree. The program ends when they cross each other, and the best motif has been found. The concept of the computation model used in PDMFh is shown in Figure 15.

Figure 15. PDMFh computation model

In order to improve the computation speed of the CPU, we decided to implement CPU-based SSE instructions using vectors. SSE (Streaming SIMD Extensions) is a SIMD

(Single Instruction Multiple Data) instruction set that can be executed on x86 architecture. Designed by Intel in 1999, it allowed CPUs to utilize the floating-point architecture present in the device to perform an instruction on multiple data pieces at a time. This can be thought of as data parallel operations in a typical parallel system. In 50 50

SSE instructions however, the data needs to be packed into a single 128-bit data structure in order to be processed. Furthermore, very few SSE instructions are present within the instructions set. Since these commands are meant to be used for highly parallel work applications like graphics or image processing finding other applications that benefit from SSE instructions can be challenging. However, PDMF’s total hamming distance calculation only requires an equality and addition operator. This allows us to massively parallelize the workload within our motif search. With SSE instructions, we work on packed data structures known as vectors. These vectors are loaded onto floating point registers in the CPU and executed within or under one clock cycle. Integers in the C++ platform are encoded with 32 bits and SSE instructions are usually run on 128-bit or 256- bit packed vectors. This allows us to pack 4 or 8 integers into a vector respectively. In a single CPU core, using 256-bit vectors, we are able to process 8 data items in one instruction. With an SMP-based CPU, we effectively double the amount of computation power we have available due to additional processing we achieve by interleaving instructions in an execution cycle. On a modern Core i7 carrying up to 6 hyper threaded cores, we can effectively run an instruction on 96 items of data (6 cores * 2 for hyper- threading * 8 data items). This is a massive increase from the regular 12 threads we would be able to run using PDMFm.

The example below shows five regular integer instructions: vu ← v1u + v2u vw ← v1w + v2w vx ← v1x + v2x vy ← v1y + v2y vz ← v1z + v2z We can condense these five operations into one SIMD instruction. SSE instructions can only be executed on vectors, so we need to load the integers into registers on the CPU. 51 51 vecLoad v1 xmm0 // load v1 into the xmm0 register vecLoad v2 xmm1 // load v2 into xmm1 register vecAdd xmm1 xmm0 // add all integer components and store output in xmm1 vecExtract v xmm1 // load the components of xmm1 into v

This form of parallelism requires additional computation in the preparation of vectors and other non-intuitive ways to perform simple instructions, but the amount of processing derived from the packing of these integers far outweighs the cost. In our PDMF system, since we can use 8-bit integers, due to motif lengths not being higher than

128, and pack 128-bit vectors we can run 16 data items with the same instruction.

Since we are only accelerating the total hamming distance function, we can reuse the GPU total hamming distance function in PDMFg2 on our heterogeneous system. However, we need to rewrite how our next vertex and bypass algorithms are run. Since they work in a reverse, the next motif in our solution space needs to be the predecessor to the current motif. The new BYPASSREVERSE and NEXTVERTEXREVERSE functions are shown in algorithms 20 and 21 respectively.

Algorithm 20: BYPASSREVERSE (a, i, L, k) 1 for j ← i to 1 2 if aj > 1 3 aj ← aj - 1 4 return (a, j) 5 return (a, 0)

Algorithm 21: NEXTVERTEXREVERSE (a, i, L, k) 1 if i < L 2 ai+1 ← 4 3 return (a, i + 1) 4 else 5 for j ← L to 1 6 if aj > 1 7 aj ← aj - 1 8 return (a, j) 9 return (a, 0) 52 52

Our PDMFh algorithm for the CPU and GPU sides differ slightly. Algorithms 22 and 23 show the execution cycle for each respectively.

Algorithm 22: PDMFHCPU (DNA, t, L) //Assume: TOTALDISTCOMPUTE (v, DNA) returns total hamming distance // between pattern v and all DNA sequences with OOPS 1 best_distance/pattern ← Find hypothetical pattern (DNA, t, L) 2 omp parallel num_threads (thread_count){ 3 thread_id ← omp_get_thread_num() 4 thread_size ← 4퐿 // thread_count 5 a = e ← (1, 1, ..., 1) //initial local start, end L-mers, AA…AA 6 a ← a + DISPLACEMENTCOMP (thread_id * thread_size, L, a) 7 e ← e + DISPLACEMENTCOMP ((thread_id + 1) * thread_size - 1, L, e) 8 i ← L; //starts from the leaf level 9 while (i > 0) 10 if (i < L) //non-leaf vertex 11 prefix ← nucl. symbols corresponding to (a1, a2, …, ai) 12 end_prefix ← nucl. symbols corresponding to (e1, e2, ..., ei) 13 if (prefix.substring(0, i) > end_prefix.substring(0, i)) 14 break //terminates thread 15 optimistic_dist ← TOTALDISTCOMPUTE (prefix, DNA) 16 if (optimistic_dist > best_distance) 17 (a, i) ← BYPASS (a, i, L, 4) //skips subtree 18 else 19 (a, i) ← NEXTVERTEX (a, i, L, 4) //finds next vertex 20 else //leaf vertex 21 word ← nucl. symbols corresponding to (a1, a2, …, aL) 22 if (word.substring(0, i) > GPU_word) // check for GPU position 23 break //terminates thread 24 optimistic_dist ← OPTIMISTICDISTCOMPUTE (UM, word, t) 25 if (optimistic_dist ≥ best_distance) 26 skip this L-mer 27 else //need to compute total hamming distance 28 if ((xd, xq) ← SSETOTALDISTCOMPUTE (word, d, DNA) < best_distance) 29 if (xq > q) // check quorum 30 omp critical 31 best_distance ← xd 32 best_pattern ← word 33 (a, i) ← NEXTVERTEX (a, i, L, 4) //finds next vertex 34 }//omp 35 return best_distance/pattern

53 53 Algorithm 23: PDMFHGPU (DNA, t, L) //Assume: GPUTOTALDISTCOMPUTE (v, DNA) returns total hamming distance // between pattern v and all DNA sequences with OOPS // kernel called with b blocks and th threads per block 1 best_distance/pattern ← Find hypothetical pattern (DNA, t, L) 2 omp parallel num_threads (thread_count){ 3 thread_id ← omp_get_thread_num() 4 thread_size ← 4퐿 // thread_count 5 a = e ← (1, 1, ..., 1) //initial local start, end L-mers, AA…AA 6 a ← a + DISPLACEMENTCOMP (thread_id * thread_size, L, a) 7 e ← e + DISPLACEMENTCOMP ((thread_id + 1) * thread_size - 1, L, e) 8 i ← L; //starts from the leaf level 9 while (i > 0) 10 if (i < L) //non-leaf vertex 11 prefix ← nucl. symbols corresponding to (a1, a2, …, ai) 12 end_prefix ← nucl. symbols corresponding to (e1, e2, ..., ei) 13 if (prefix.substring(0, i) > end_prefix.substring(0, i)) 14 break //terminates thread 15 optimistic_dist ← TOTALDISTCOMPUTE (prefix, DNA) 16 if (optimistic_dist > best_distance) 17 (a, i) ← BYPASSREVERSE (a, i, L, 4) //skips subtree 18 else 19 (a, i) ← NEXTVERTEXREVERSE (a, i, L, 4) //finds next vertex 20 else //leaf vertex 21 word ← nucl. symbols corresponding to (a1, a2, …, aL) 22 if (word.substring(0, i) < CPU_word) // check for CPU position 23 break //terminates thread 24 optimistic_dist ← OPTIMISTICDISTCOMPUTE (UM, word, t) 25 if (optimistic_dist ≥ best_distance) 26 skip this L-mer 27 else //need to compute total hamming distance 28 if ((xd, xq) ← GPUTOTALDISTCOMPUTE<<>> (word, DNA) < best_distance) 29 if (xq > q) // check quorum 30 omp critical 31 best_distance ← xd 32 best_pattern ← word 33 (a, i) ← NEXTVERTEXREVERSE (a, i, L, 4) //finds next vertex 34 }//omp 35 return best_distance/pattern

We utilize two separate threads to handle the CPU and GPU instances of our

PDMFh algorithm. Each thread is independent of one another and are always checking on the progress of the other to prevent additional duplicate processing. 54 54

The CPU total hamming distance function has been rewritten to handle SIMD based processing. In this approach, we use a channel-based method to handle the sequences going through the vectors. Figure 16 shows an 8-channel example where the current sequences being processed are shown. The sequences in light grey are next in line on a channel and will be processed when the current sequence is complete.

Figure 16. SIMD Vector execution on multiple sequences

The SSE instruction set only accounts for about 70 vector instructions present on a CPU. SSE2, SSE3 and SSE4 are updates to the original instruction set that adds support for additional instructions that PDMF requires. Certain modern CPUs also have access to

512-bit vectors, also called AVX or AVX-512 instruction sets, but our hardware is not capable of running these instructions. Since the majority of our instructions come from the SSE4 package, we will be showing those commands as native instructions and will also be referring to them as SSE or SIMD instructions in this thesis. 55 55

For our total hamming distance function, we need to create some helper functions within our SSE computation cycle. Since the hamming distance looks at the differences between the motif string and our database sequence, we need to use a SIMD inequality instruction. However, such an instruction does not exist in the SSE instruction set. Therefore, we must use the equality operator and subtract the difference to account for the inequalities. Algorithm 24 shows the SIMD-based total hamming distance function.

Algorithm 24: SSETOTALDISTCOMPUTE (DNA, t, L) 1 for(i ← 0 to channel_count) 2 . . . // initialize channel array with channel object 3 while (si < t) 4 motifVctr ← Mj 5 . . . // load sequences into vector units 6 resultVctr ← vctrNotEqual (motifVctr, seqVctr) // vector comparison instruction 7 totalVctr ← vctrAdd(resultVctr, totalVctr) // add output to tile_total 8 if ( j = L ) 9 bestDistanceVctr ← vectorMin(totalVctr, bestDistanceVctr, i) 10 if (end of sequence) 11 bestDistancei ← extract(bestDistanceVctr, i) 12 if (bestDistancei < d) 13 totDist ← totDist + bestDistancei 14 xq ← xq + 1 15 . . . // load new sequence into channel 16 if (xq < q) 17 totDist ← L * t 18 return tot_dist There is also no minimum vector function present in the SSE instruction set. To overcome this limitation, a new method of finding the minimum value in a pair of vectors is shown in Figure 17. The pseudocode for this approach is shown in Algorithm 25.

Algorithm 25: VECTORMAX(tileTotalVctr, bestDstVctr) // tileTotalVctr contains the current hamming distance // bestDstVctr contains is the best hamming distance yet 1 step1Vctr ← vectorEquals(tileTotalVctr, bestDstVctr) 2 step1Vctr ← vectorAnd(step1Vctr, tileTotalVctr) 3 step2Vctr ← vectorLessThan(tileTotalVctr, bestDstVctr) 4 step2Vctr ← vectorAnd(step2Vctr, tileTotalVctr) 5 step3Vctr ← vectorLessThan(bestDstVctr, tileTotalVctr) 6 step3Vctr ← vectorAnd(step3Vctr, bestDstVctr) 7 tempVctr ← vectorAdd(step1Vctr, step2Vctr) 8 bestDistanceVctr ← vectorAdd(step3Vctr, tempVctr) 56 56

Figure 17. SSE-based min operation

By using 8 SSE instructions, we can find the minimum values between any two vectors using bit-based mathematics. In line 1 we find the positions with the same value in both vectors and mark all the bits within this position with 1. Line 2 performs a logical AND operation on the input vector and the result will show the values of the vectors that are equal. Line 3 performs a less than operation on the two vectors with all the positions, with a value in the tile total vector being less than the best distance vector, being marked with a 1. When a logical AND is performed over the first vector, only the values that are less than are saved in the result vector. The reverse of this operation is also conducted with the best distance vector and the results are added with the results from previous two steps. Since they keep either the lowest value or the equality value, we can run this module to find and keep the minimum between two vectors.

There also exists no method of setting a specific vector position with a value. Since a vector’s contents need to be accessed by an explicit call instruction (with an index to refer to which element we would like to read), we need to create a way to set a 57 57 value in the vector. It is easy to load a vector from an array, but we only need to modify a single value so rewriting the vector contents with a modified array is costly. To overcome this limitation, we use some SSE based bit mathematics to set a value in a vector. Algorithm 26 shows the instructions we need to use to complete this task.

Algorithm 26: set_VALUE(aVctr, i, val) // aVctr corresponds to the running total // i stands for the location in the vector // val stands for the value to be inserted 1 setMaskVctr ← loadVector(0) 2 clearMaskVctr ← loadVector(0xFF) 3 valVctr ← loadVector(val) 4 setMaskVctr[i] ← 0xFF 5 clearMaskVctr[i] ← 0 6 resultVctr1 ← vectorAnd (clearMaskVctr, aVctr) 7 resultVctr2 ← vectorAnd (setMaskVctr, valVctr) 8 return vectorAdd (resultVctr1, resultVctr2)

Figure 18 shows an example of this operation. In this example, we want to set the third position in the vector with the value 5. We use arrays to load the mask vectors as these are easily modifiable before loading.

Figure 18. SSE vector set operation

Our last helper function is the vector extract module. Since we are packing 16 integers into a 128-bit data structure, we need to be able to extract a specific value from the vector as well. SSE provides an extract instruction called __mm_extract_epi8 58 58

(vector, index) but the index variable must be defined at compile time. This translates to the need for immediate values in that field so that appropriate machine code can be generated. A simple way to overcome this issue is to explicitly define the index 16 times in a case statement as follows:

Algorithm 27: VECTOREXTRACT(__m128i a, const int index) 1 switch (index) 2 case 0: return (uint8_t)_mm_extract_epi8(a, 0) 3 case 1: return (uint8_t)_mm_extract_epi8(a, 1) 4 case 2: return (uint8_t)_mm_extract_epi8(a, 2) … 16 case 14: return (uint8_t)_mm_extract_epi8(a, 14) 17 case 15: return (uint8_t)_mm_extract_epi8(a, 15)

However, this function is not ideal for computation as the switch statement requires unnecessary equality checks for a single extract statement. We have created an

SSE based function that can extract a value much faster using SIMD instructions instead. This procedure can be written in three lines. The algorithm is written using C++ instructions and shown in Algorithm 28.

Algorithm 28: VECTOREXTRACT (__m128i a, i) 1 m128i index ← _mm_cvtsi32_si128(i) 2 __m128i value ← _mm_shuffle_epi8(a, index) 3 return (uint8_t) _mm_cvtsi128_si32(value)

Line 1 copies the 32-bit integer i to the lower elements of the index vector and zeros the upper elements of the index vector. Line 2 shuffles the packed 8-bit integers in the input vector according to the shuffle mask in the index vector. The last line of the algorithm then copies the lower 32-bit integer in the value vector to be returned to the function caller. PERFORMANCE OF PDMF

To measure the performance of the proposed high-performance solution, PDMF was implemented in C++ using the built in unordered_map and priority queue libraries as DMF. The multicore models of PDMFm use the OpenMP library for multicore-based parallelization. The GPU modes of PDMFg use the NVIDIA CUDA platform to create and execute kernel level code on the GPU device. Our SSE-based SIMD computation model of PDMFh uses Intel SSE2, SSE3 and SSE4 instructions. These modules are well tested and optimized within the language and suits our needs very well. The performance tests were conducted on two systems. The first machine contains an Intel Core i7-4790 processor (4 Cores - 3.6 GHz) with 16GB of system memory. Ubuntu 18.04 LTS was used as the operating system on the system. A Nvidia K1200 Quadro GPU with 4GB of device memory is used for the GPU models on this machine. Our second computing system contains two Intel Xeon E5-2670 processors (16 cores - 2.6 GHz) with 32 GB of system memory. OpenSUSE 13.1 is used as the operating system of this system. The second machine has a Nvidia Tesla M2075 GPU with 6GB of device memory.

The HMP (Human Biome Project) database was used as the input data within the system. In certain instances of testing, smaller dataset portions are used to find if database size has an effect on the execution time of the DMF algorithm. In order to create a consistent testing scheme, the original dataset is split into partitions of 2푛 (1 ≤ 푛 ≤ 8) where the partitions from each split is chosen at random. Figure 19 shows the performance of the multicore PDMF models in relation to each other. This experiment was run on the system with 16 cores (32 threads) using the

DS (1/8) dataset for varying L-mer lengths. PDMFm1 has the worst performance out of the three models. This can be attributed to the decreased bypassing ratio caused by the splitting of the search space into vertical partitions. Adding more threads leads to more fine-grained search space partitions and higher overall execution time. PDMFm2 has the 60 60 best performance with an amplified performance gain when using more CPU threads.

This suggests that using our computation resources on accelerating the total distance computation module is the best strategy in relation to the multicore CPU models. The hybrid approach, PDMFm3’s performance, resides in between PDMFm1 and PDMFm2. Since we are using a combination of both the search space split and the total distance computation acceleration, we can see performance that corresponds to the mid-level between the two individual multicore approaches. For L-mer length = 8, PDMFm2 showed an average speedup of 8.87x and 4.63x on the 4 core and 16 core systems respectively.

Figure 19. PDMFm performance: m1 vs. m2 vs. m3

Figure 20 shows the performance of multiple combinations of threads on our 16- core system for variations of our hybrid multicore approach PDMFm3. L-mer length = 8 and varying sized datasets are used in this comparison. The best performance gain is shown by the usage of one search space tree traversal thread and 16 total distance computation threads further confirming the benefits of the parallelization of the computation intensive module. 61 61

Figure 20. Performance of PDMFm3 with different thread allocations

Figure 21 shows the performance of our baseline GPU approach (PDMFg1) with respect to the fastest Multicore model (PDMFm2) for various sized datasets. The base GPU model uses one CPU thread for kernel control and search space tree traversal while the total distance computation is handled by the GPU itself. This comparison is done on the system containing the Core-i7 CPU and Quadro K1200 GPU. For the shorter motif length, L = 6, the performance difference between the 2 models is not very large but we see an improvement in the performance gain when considering the longer length motif.

The performance gain is further amplified when using larger datasets as there is more total distance computations with more sequences. Since the GPU’s computation is much faster than the CPU, the total execution time is lower. As shown in the graph the performance gap between PDMFm and PDMFg is more pronounced when searching for longer motif lengths and in larger datasets. 62 62

Figure 21. PDMFg1 vs. PDMFm2 performance

Table 5 shows the execution time of DMF, PDMFm2, and PDMFg1 for L = 8 and varying sized datasets. The average speedup of PDMFm2 over DMF is 4.63x while the average speedup of PDMFg1 over DMF and PDMFm2 is 41.46x and 9.95x respectively.

Table 5. DMF vs. PDMFm2 vs. PDMFg1 Execution Times DMF PDMFm2 PDMFg1 (serial) (Multicore - Model2) (GPU - Model 1) Dataset Run time Run time Speedup Run time Speedup Speedup over (min.) (min.) over DMF (min.) over DMF PDMFm2 DS 926.32 177.64 5.21x 22.09 41.93x 8.04x DS (1/2) 447.57 85.46 5.24x 9.41 47.56x 9.08x DS (1/4) 202.14 39.72 5.09x 4.27 47.34x 9.30x DS (1/8) 51.47 22.09 2.33x 1.24 47.51x 17.81x DS (1/16) 14.24 2.70 5.27x 0.49 29.06x 5.51x

Figure 22 shows the performance comparison of our first heterogeneous model (PDMFracing) with respect to our two GPU models. This experiment was conducted on the system with the 16 core Xeon processor and the Tesla M2075 GPU using L-mer length = 8 and DS (1/8) database. PDMFg1’s execution time does not change because it 63 63 is the baseline GPU model and does not use more than one thread for the search tree.

PDMFg2 shows worsening performance with the addition of threads into the model. This can be attributed to lower bypassing ratio due to search space split performed in this computation strategy. Furthermore, addition control and communication overhead related to the control of multiple parallel kernels can worsen the performance. We also see a large deviation in performance after 8 threads in this model. The Nvidia Tesla M2075 GPU can only handle 8 concurrent kernel executions at one time. Starting up more kernel threads will lead to worse performance since all the kernel calls will be queued up on the

GPU waiting for their processing turn. In this case out baseline GPU model (PDMFg1) has the best performance amongst the PDMF GPU models. However, our racing-based computation strategy has the performance amongst the three models. This can be attributed to the performance gain by the CPU execution strategy in the search space.

Since all the motifs that benefit from the CPU execution are not needed to be recomputed on the GPU, the difference between the two values on the graph are solely due to the racing-based dynamic load balancing method.

Figure 22. Performance: PDMF GPU vs. Heterogeneous Models 64 64

Table 6 shows the execution time and speedup of our PDMFh, PDMFracing and

PDMFg1 models for various L-mer lengths for the DS (1/8) database on the system containing the Core-i7 CPU and Quadro K1200 GPU. We see increased efficiency with larger length L-mers on our SIMD-based heterogeneous PDMF model. PDMFh has an average speedup of 3.42x and 2.48x in relation to PDMFg1 and PDMFracing respectively. The increase in performance gain for longer length L-mers can be attributed to the ability to run more SIMD instructions using the SSE-based model leading to less wasted computation cycles using regular instructions. The massive throughput we get from the addition of the SIMD computation also helps in this instance.

Table 6. PDMFg1 vs. PDMFracing vs. PDMFh Execution Times Execution Time (min.)

L = 6 L = 8 L = 10, d = 2 L = 12, d = 3 L = 14, d = 4 PDMFg1 0.0505 1.37 22.19 326.96 11349.43 PDMFh1 0.046 1.04 15.89 223.31 7466.73 PDMFh2 0.0201 0.52 7.76 104.35 1895.11 Speedup 1.10x 1.32x 1.40x 1.46x 1.52x (racing over g1) Speedup 2.28x 2.00x 2.04x 2.14x 3.94x (h2 over racing) Speedup 2.51x 2.63x 2.86x 3.13x 5.99x (h2 over g1)

Figure 23 compares the scalability of our second heterogeneous model

(PDMFsse) that uses SIMD instructions with respect to various sized databases. We find that the execution time of the SSE-based approach does not double with respect to doubling the size of the database. In fact, for L-mer length = 8, we see that our PDMF model only shows an increase of 1.5x on average in terms of execution time and for L- mer length = 10, an increase of 1.31x on average is observed. The performance loss caused by doubling our input size has been mitigated by the use of SIMD instruction in our heterogeneous model of PDMFh. 65 65

Figure 23. PDMFh scalability comparison

CONCLUSION

We have designed and implemented a new combinatorial approach to the motif searching problem in this thesis. The proposed approach DMF uses hash-based heuristics to bypass unnecessary computations by approximating the hypothetical best motifs and total distance using hashing data structures. Our DMF algorithm shows a speedup of

22.48x, on average over the regular branch-and-bound motif searching approach. The DMF approach was also compared against popular motif searching tools MEME, SISMA SPELLER, and WEEDER on real world datasets. DMF was on average 9.9x and 2.94x faster than SPELLER AND WEEDER. Not only was DMF faster than the mentioned motif searching tools but was also able to return the accurate best-k motifs within a database. It is demonstrated in our practice that the proposed hash-based heuristics used in DMF are highly efficient in reducing the searching space of tree-based branch and bound combinatorial approaches of finding motif. By utilizing parallel computation methods, we also developed parallel computation models of the DMF algorithm, called PDMF. CPU only, GPU only, and heterogeneous (CPU and GPU) computation models have been developed with specific target architectures in mind. Two new dynamic load balancing and computation strategies (racing-based and mirror-based) have also been implemented in the heterogeneous models. Furthermore, SIMD instructions have been integrated into our second heterogeneous model for accelerated CPU processing. Among the multicore models, PDMFm2, which focuses on accelerating the total distance computation task, showed the best performance with the average speedups of 4.63x and 8.87x over the serial version

(DMF) on a system with 4 cores and a system with 16 cores, respectively. The baseline GPU model (PDMFg1) showed average speedups of 41.48x and 9.95x over the serial version (DMF) and PDMFm2, respectively, on a system with a 4-core CPU and a GPU.

Our first heterogeneous model (PDMFracing) showed an average 1.36x speedup over the 67 67 baseline GPU model (PDMFg1) by using the racing based dynamic load balancing method. Our fastest heterogeneous model (PDMFh) showed average speedups of 3.42x and 2.48x over the fastest GPU model (PDMFg1) and PDMFracing, respectively. In our practice, we observed that the performance gain of using our GPU and heterogeneous models is amplified with increasing sized motifs and datasets. With the developed high- performance computation models, we could find motif of lengths 6 -14 within a reasonable time bound for all the databases that we tested.

REFERENCES REFERENCES

[1] W.K. Sung, “Algorithms in Bioinformatics: A Practical Introduction,” CRC Press, 2010.

[2] C.E. Lawrence, S.F. Altschul, M.S. Boguski, et al., “Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Mulitple Alignment,” Science, Vol. 2662, No. 5131, pp. 208-214, 1993.

[3] J.D. Hughes, P. W. Estep, S. Tavazoie and G. M. Church, “Computational Identification of Cis-regulatory Elements Associated with Groups of Functionally Related Genes in Saccharomyces cerevisiae,” Journal of Molecular Biology, Bol. 296, No. 5, pp 1205-1214, 200.

[4] T. L. Bailey and C. Elkan, “Unsupervised learning of multiple motifs in biopolymers using expectation maximization,” Machine Learning, Vol. 21, No. 1, pp. 51-80, 1995.

[5] T. A. Down and T. J. P. Hubbard, “NestedMICA: sensitive inference of over- represented motifs in nucleic acid sequence,” Nucleic Acids Research, Vol. 33, No. 1, pp. 1445–1453, 2005.

[6] G. Z. Hertz, G. W. Hartzell, III, and G. D. Stormo, “Identification of consensus patterns in unaligned DNA sequences known to be functionally related,” Computer Applications in Biosciences (CABIOS), Vol. 6, No. 2, pp. 81-92, 1990.

[7] M. K. Das and H.-K. Dai, “A survey of DNA motif finding algorithms,” BMC Bioinformatics, Vol. 8, No. 7, pp. S21, 2007.

[8] J. Hu, B. Li, and D. Kihara, “Limitations and potentials of current motif discovery algorithms,” Nucleic Acids Research, Vol. 33, No. 15, pp. 4899–913, 2005.

[9] S. Vijayvargiya and P. Shukla, “Regulatory Motif identification in Biological Sequences: An Overview of Computational Methodologies,” Advances in Enzyme Biotechnology, Chapter 8, Springer, 2013.

[10] Angela Makolo, “A Comparative Analysis of Motif Discovery Algorithms,” Computational Biology and Bioinformatics, Vol. 4, No. 1, pp. 1-9, 2016.

[11] M. S. Waterman, R. Arratia and D. J. Galas, “Pattern Recognition in Several Sequences: Consensus and Alignment,” Bulletin of Mathematical Biology, Vol. 46, No. 4, pp. 515-527, 1984.

[12] J. V. Helden, B. Andre, and J. Collado-Vides, “Extracting Regulatory Sites from the Upstream Region of Yeast Genes by Computational Analysis of Oligonucleotide Frequencies,” Journal of Molecular Biology, 281:827-842, 1998 70 70 [13] U. Keich and P. Pevzner, “Finding motifs in the twilight zone,” Bioinformatics, Vol. 18, No. 1, pp. 1374–1381, 2002.

[14] M.-F. Sagot, “Spelling approximate repeated or common motifs using a suffix tree,” LNCS 1380, pp. 111-127, 1998.

[15] G. Pavesi, P. Mereghetti, G. Mauri, and G. Pesole, “Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes,” Nucleic Acids Research, Vol. 32, No. 1, pp. W199–W203, 2004.

[16] N. Pisanti, A. M. Carvalho, L. Marsan, and M.-F. Sagot, “Risotto: Fast extraction of motifs with mismatches,” in LATIN 2006: Theoretical Informatics, pp. 757–768, 2006.

[17] E. Eskin and P. A. Pevzner, “Finding composite regulatory patterns in DNA sequences,” Bioinformatics, Vol. 18, No. 1, pp. S354–S363, 2002.

[18] C. Jia, M. B. Carson, Y. Wang, Y. Lin, and H. Lu, “A new exhaustive method and strategy for finding motifs in chip-enriched regions,” PLOS ONE, Vol. 9, No. 1, pp. 1–13, 2014.

[19] H. Dinh and S. Rajasekaran, “PMS: A panoptic motif search tool,” PLOS ONE, Vol. 8, No. 1, pp. 1 -7, 2013.

[20] H. Dinh, S. Rajasekaran, and V. K. Kundeti, “PMS5: an efficient exact algorithm for the (l, d)-motif finding problem,” BMC Bioinformatics, Vol. 12, No. 1, pp. 410, 2011.

[21] M. Nicolae and S. Rajasekaran, “Efficient sequential and parallel algorithms for planted motif search,” BMC Bioinformatics, Vol. 15, No. 1, pp. 34, 2014.

[22] M. Nicolae and S. Rajasekaran, “qPMS9: An efficient algorithm for quorum planted motif search,” Scientific Reports, Vol. 5, No. 1, pp. 7813 EP, 2015.

[23] Carvalho A, Freitas A, Oliveira A, Sagot M: Efficient Extraction of Structured Motifs Using Box-links. String Processing and Information Retrieval Conference. 2004, 267-278.

[24] Carvalho A, Freitas A, Oliveira A, Sagot M: A highly scalable algorithm for the extraction of cis-regulatory regions. Asia-Pacific Bioinformatics Conference 2005:273-283

[25] M. S. Waterman, R. Arratia and D. J. Galas, “Pattern Recognition in Several Sequences: Consensus and Alignment,” Bulletin of Mathematical Biology, Vol. 46, No. 4, pp. 515-527, 1984. 71 71 [26] T.L.Bailey,N.Williams,C.Misleh,andW.W.Li,“MEME:discovering and analyzing DNA and protein sequence motifs,” Nucleic Acids Research, Vol. 34, pp. W369– W373, 2006.

[27] M. Federico, M. Leoncini, M. Montangero and P. Valente, “Direct vs 2- stage approaches to structured motif finding,” Algorithms for Molecular Biology, 7:20, 2012.

[28] G. Pavesi, P. Mereghetti, G. Mauri and G. Pesole, “Weeder web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes,” Nucleic Acids Research, Vol.32 W199–W203, 2004.

[29] Q. Yu, D. Wei and H. Huo, “SamSelect: a sample sequence selection algorithm for quorum planted motif search on large DNA datasets,” BMC Bioinformatics, 19:228, 2018.

[30] G. N. William, T. L. Bailey, and C. P. Elkan. “ParaMEME: a parallel implementation and a web interface for a DNA and protein motif discovery tool.” Bioinformatics, Vol. 12, No. 4, pp. 303-310, 1996.

[31] C. Chen, B. Schmidt, L. Weiguo, and W. Muller-Wittig, “GPU-MEME: Using graphics hardware

[32] Y. Liu, B. Schmidt, and D. L. Maskell, “An Ultrafast Scalable Many-Core Motif Discovery Algorithm for Multiple GPUs,” 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, Shanghai, pp. 428- 434, 2011.

[33] M. M. Abbas, Q. M. Malluhi and P. Balakrishnan, "Scalable Multi-core Implementation for Motif Finding Problem," 2014 IEEE 13th International Symposium on Parallel and Distributed Computing, Marseilles, pp. 178-183, 2014.

[34] Xu, Yun & Jiaoyun, Yang & Zhao, Yuzhong & Shang, Yi. An improved voting algorithm for planted (1,d) motif search. Information Sciences. 237. 305–312

[35] M. M. Abbas, M. Abouelhoda, and H. M. Bahig, “A hybrid method for the exact planted (l, d) motif finding problem and its parallelization,” BMC Bioinformatics, Vol. 13, No. 17, pp. S10, 2012.

[36] Abbas, M., Bahig, M., Abouelhoda, H., & Mohie-Eldin, M. (2014). Parallelizing exact motif finding algorithms on multi-core. The Journal of Supercomputing, 69(2), 814-826

[37] Chin FYL, Leung HCM: Voting algorithms for discovering long motifs. Proceedings of Third Asia Pacific Bioinformatics Conference. 2005, 261-271. 72 72 [38] Davila J, Balla S, Rajasekaran S: Space and time efficient algorithms for planted motif search. Proceedings of Second International Workshop on Bioinformatics Research and Applications (LNCS 3992). 2006, 822-829.

[39] Davila J, Balla S, Rajasekaran S: Fastand practical algorithms for planted (l, d) motif search. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2007, 544-552.

[40] N. S. Dasari, D. Ranjan and M. Zubair, “High performance implementation of planted motif problem using suffix trees,” 2011 International Conference on High Performance Computing & Simulation, Istanbul, pp. 200-206, 2011.

[41] S. Bandyopadhyay, S. Sahni, and S. Rajasekaran, “PMS6MC: A multicore algorithm for motif discovery," Algorithms, Vol. 6, No. 4, pp. 805-823, 2013.

[42] S. Bandyopadhyay and S. Sahni. Psm6: A fast algorithm for motif discovery. In Second IEEE International Conference on Computational Advances in Bio and Medical Sciences (ICCABS), 2012.

[43] N. S. Dasari, R. Desh and Z. M, “An efficient multicore implementation of planted motif problem,” 2010 International Conference on High Performance Computing & Simulation, Caen, pp. 9-15, 2010.

[44] N. S. Dasari, D. Ranjan and M. Zubair, “High-performance implementation of planted motif problem on multicore and GPU,” Concurrency and Computation: Practice and Experience, Vol. 25, No. 1, pp. 1340-1355, 2013.

[45] H. Huo, S. Lin, Q. Yu, Y. Zhang and V. Stojkovic, “A MapReduce-based Algorithm for Motif Search,” 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum, pp. 2052-2060, 2012.

[46] S. Mohanty and B. Sahoo, “Improved Exact Parallel Algorithm for Planted (l, d) Motif Search,” Asian Journal of Information Technology, Vol. 15, No. 1, pp. 4835- 4843, 2016.

[47] M. M. Al-Qutt, H. Khaled, R. ElGohary, et. al., “Accelerating Motif Finding Problem Using Skip Brute-Force on CPUs and GPU’s Architect-ures,” Proc. of the 2017 Int’l Conf. on Parallel and Distributed Processing Technology and Application, pp. 155-161, 2017.

[48] M. M. Al-Qutt, H. Khaled, R. ElGohary, et. al., “Accelerating Motif Finding Problem Using Skip Brute-Force on CPUs and GPU’s Architect-ures,” Proc. of the 2017 Int’l Conf. on Parallel and Distributed Processing Technology and Application, pp. 155-161, 2017.

[49] NIH Human Microbiome Project, http://hmpdacc.org/, accessed 2016. Fresno State Non-exclusive Distribution License (Keep for your records) (to archive your thesis/dissertation electronically via the Fresno State Digital Repository)

By submitting this license, you (the author or copyright holder) grant to the Fresno State Digital Repository the non-exclusive right to reproduce, translate (as defined in the next paragraph), and/or distribute your submission (including the abstract) worldwide in print and electronic format and in any medium, including but not limited to audio or video.

You agree that Fresno State may, without changing the content, translate the submission to any medium or format for the purpose of preservation.

You also agree that the submission is your original work, and that you have the right to grant th e rights contained in this license. You also represent that your submission does not, to the best of your knowledge, infringe upon anyone’s copyright.

If the submission reproduces material for which you do not hold copyright and that would not be considered fair use outside the copyright law, you represent that you have obtained the unrestricted permission of the copyright owner to grant Fresno State the rights required by this license, and that such third-party material is clearly identified and acknowledged within the text or content of the submission.

If the submission is based upon work that has been sponsored or supported by an agency or organization other than Fresno State, you represent that you have fulfilled any right of review or other obligations required by such contract or agreement.

Fresno State will clearly identify your name as the author or owner of the submission and will not make any alteration, other than as allowed by this license, to your submission. By typing your name and date in the fields below, you indicate your agreement to the terms of this use. Publish/embargo options (type X in one of the boxes).

X Make my thesis or dissertation available to the Fresno State Digital Repository immediately upon submission.

Embargo my thesis or dissertation for a period of 2 years from date of graduation. After 2 years, I understand that my work will automatically become part of the university’s public institutional repository unless I choose to renew this embargo here: [email protected]

Embargo my thesis or dissertation for a period of 5 years from date of graduation. After 5 years, I understand that my work will automatically become part of the university’s public institutional repository unless I choose to renew this embargo here: [email protected]

Sanjay Soundarajan

Type full name as it appears on submission

May 8, 2020

Date