Using Weighted Set Cover to Identify Biologically Significant Motifs

A thesis presented to

the faculty of the Russ College of Engineering and Technology of Ohio University

In partial fulfillment

of the requirements for the degree

Master of Science

Robert J.M. Schmidt

December 2015

© 2015 Robert J.M. Schmidt. All Rights Reserved. 2 This thesis titled

Using Weighted Set Cover to Identify Biologically Significant Motifs

by

ROBERT J.M. SCHMIDT

has been approved for

the School of Electrical Engineering and Computer Science

and the Russ College of Engineering and Technology by

Lonnie R. Welch

Stuckey Professor of Electrical Engineering and Computer Science

Dennis Irwin

Dean, Russ College of Engineering and Technology 3 ABSTRACT

SCHMIDT, ROBERT J.M., M.S., December 2015, Computer Science

Using Weighted Set Cover to Identify Biologically Significant Motifs

Director of Thesis: Lonnie R. Welch

One of the greatest challenges of mankind is understanding how living organisms operate, and a key step towards understanding this challenge is identifying how are regulated. Promoter regions play a key role in the regulation of genes via sequences of

DNA base pairs known as transcription factor binding sites. When a transcription factor binding site is activated, the genes associated with the transcription factor binding site are transcribed, the first step towards creating . The identification of transcription factor binding sites has come a long way with the advancements of next generation sequencing technologies and projects like ENCODE, but still relies on motif discovery algorithms to pinpoint the exact binding sites. In this thesis, the motif discovery problem is explored and a novel method based on weighted set cover is presented to identify the minimal set of motifs, with objective functions, that discriminately cover a set of DNA sequences. The results show that some motif set cover methods can more accurately identify biologically significant motifs over simply selecting the top scoring motifs.

However, the weighed set cover algorithms did not perform exceptionally well when compared to standard selection methods, which is attributed to the use of a discriminative motif discovery application. Detailed results can be found at http://motifpipeline.com. 4 ACKNOWLEDGMENTS

I would like to express my gratitude to Dr. Lonnie Welch, as an advisor and teacher throughout graduate school, for providing solid direction in my research, countless ideas and suggestions, continuous support, and for his work in obtaining the Choose Ohio First for Bioinformatics scholarships. Without Dr. Welch this thesis would not be possible.

I would like to thank Dr. Frank Drews for listening to countless research presentations, providing useful guidance and numerous helpful suggestions along the way.

I would like to thank Dr. David Juedes for teaching me the majority of what I know about the hardness of problems and the different tools available for solving computationally difficult problems. I also want to thank Dr. Sonsoles de Lacalle for all of her hard work on the rat estrous cycle project, allowing me to work with and analyze her data, and providing crucial insight into the biology behind the numbers.

I also want to give a huge thanks to my friend Rami, Al-Ouran, who has given me countless hours of his time, and provided me with so much help and guidance throughout graduate school. I also want to give a huge thanks to Xiaoyu Liang for all of her help throughout graduate school and for always answering my questions. I also want to thank

Richard Wolfe, Jeffery Jones, Yichao Li, Ashwini Naik, and the whole bioinformatics lab for all of their help and support throughout graduate school. I also want to give a big thanks to Choose Ohio First for Bioinformatics for help funding me throughout graduate school.

Finally, I want to thank my love, Kasia, for always being by my side. 5 TABLE OF CONTENTS

Page

Abstract ...... 3 Acknowledgments...... 4 List of Tables ...... 8 List of Figures ...... 9 Chapter 1: Introduction ...... 10 1.1 Background ...... 10 1.2 Problem Statement ...... 15 Chapter 2: Literature Review ...... 17 2.1 Motif Discovery ...... 17 2.1.1 Non-discriminative Motif Discovery Algorithms ...... 17 2.1.2 Discriminative Motif Discovery Algorithms ...... 19 2.2 Set Cover ...... 21 2.2.1 Set Cover Problem ...... 21 2.2.2 Hitting Set Problem ...... 23 Chapter 3: Algorithmic Approaches ...... 24 3.1 Set Cover Methods ...... 24 3.1.1 Weighted Greedy Set Cover ...... 24 3.1.2 Weighted Relaxed Greedy Set Cover ...... 25 3.1.3 Weighted Modified Greedy Approach Set Cover ...... 25 3.1.4 Weighted Hill Climbing Set Cover ...... 26 3.1.5 Weighted Random Set Cover ...... 27 3.1.6 Weighted Simulated Annealing Set Cover ...... 27 3.1.7 Integer Linear Programming Formulation ...... 29 3.1.8 Linear Programming using Branch and Cut ...... 29 3.1.9 Linear Programming Relaxation with Randomized Rounding ...... 30 3.2 Filter Methods ...... 31 3.2.1 Greedy Removal Method ...... 31 3.3 Weight Schemes ...... 32 3.3.1 Solution Based Weight Schemes ...... 32 6 3.3.2 Greedy Based Weight Schemes ...... 33 3.4 Metrics ...... 34 3.4.1 Comparisons ...... 34 3.4.2 Classification Metrics ...... 36 3.4.3 Ranking ...... 39 3.5 Baseline Motif Selection Methods ...... 40 Chapter 4: Case Studies ...... 41 4.1 ENCODE Case Study ...... 41 4.2 Rat Estrous Cycle Case Study ...... 43 4.3 Brugia Malayi Case Study ...... 46 Chapter 5: Evaluation of Algorithms ...... 49 5.1 Weighted Greedy Set Cover ...... 57 5.1.1 Results ...... 57 5.1.2 Discussion ...... 61 5.2 Weighted Relaxed Greedy Set Cover ...... 62 5.2.1 Results ...... 62 5.2.2 Discussion ...... 65 5.3 Weighted Modified Greedy Set Cover ...... 66 5.3.1 Results ...... 66 5.3.2 Discussion ...... 69 5.4 Weighted Hill Climbing Set Cover ...... 70 5.4.1 Results ...... 70 5.4.2 Discussion ...... 73 5.5 Weighted Random Set Cover ...... 73 5.5.1 Results ...... 73 5.5.2 Discussion ...... 77 5.6 Weighted Simulated Annealing Set Cover ...... 77 5.6.1 Fast Temperature Function Results...... 77 5.6.2 Exponential Temperature Function Results ...... 81 5.6.3 Boltzmann Temperature Function Results ...... 84 5.6.4 Discussion ...... 88 5.7 Linear Programming Using Branch and Cut Set Cover ...... 89 7 5.7.1 Results ...... 89 5.7.2 Discussion ...... 92 5.8 Linear Programming Relaxation Using Randomized Rounding Set Cover ...... 92 5.8.1 Results ...... 92 5.8.2 Discussion ...... 96 5.9 Top Ten Scoring Motifs ...... 96 5.9.1 Results ...... 96 5.9.2 Discussion ...... 98 5.10 Top Ten Sequence Coverage Motifs ...... 99 5.10.1 Results ...... 99 5.10.2 Discussion ...... 101 5.11 Top Ten Information Content Motifs ...... 102 5.11.1 Results ...... 102 5.11.2 Discussion ...... 104 5.12 Using All Motifs ...... 104 5.12.1 Results ...... 104 5.12.2 Discussion ...... 107 Chapter 6: Case Study Results ...... 108 6.1 Rat Estrous Cycle Results ...... 108 6.1.1 Differentially Expressed Genes ...... 108 6.1.2 Fuzzy Clustering ...... 109 6.1.3 Ontology Analysis ...... 109 6.1.4 Transcription Factor Analysis ...... 110 6.2 Brugia Malayi Results ...... 123 Chapter 7: Conclusions ...... 126 7.1 Summary of Work ...... 126 7.2 Future Work ...... 127 References ...... 129

8 LIST OF TABLES

Page

Table 1: TFBSs identified across brain regions ...... 115

Table 2: TFBSs identified in the basal forebrain ...... 115

Table 3: TFBSs identified in the frontal cortex ...... 116

Table 4: TFBSs identified in the hippocampus...... 117

Table 5: Top expressed genes across brain regions ...... 118

Table 6: Top expressed genes in the basal forebrain ...... 119

Table 7: Top expressed genes in the frontal cortex ...... 120

Table 8: Top expressed genes in the hippocampus ...... 121

9 LIST OF FIGURES

Page

Figure 1: Model of transcription ...... 11

Figure 2: Example of motif logo ...... 12

Figure 3: Johnson's greedy set cover algorithm ...... 22

Figure 4: Number of motifs over average accuracy ...... 50

Figure 5: Selection methods without filtering ...... 54

Figure 6: Selection methods with 2% filtering ...... 55

Figure 7: Selection methods with 5% filtering ...... 56

Figure 8: Selection methods with 10% filtering ...... 57

Figure 9: Gene clusters across brain regions ...... 111

Figure 10: Gene clusters in the basal forebrain ...... 112

Figure 11: Gene clusters in the frontal cortex ...... 113

Figure 12: Gene clusters in the hippocampus ...... 114

Figure 13: Top discriminative motifs in the L3 stage ...... 124

Figure 14: Top motifs in the L3 life cycle stage ...... 125

10 CHAPTER 1: INTRODUCTION

1.1 Background

Every known living organism uses DNA to store the genetic information necessary for life. This blueprint to life is utilized by organisms though the transcription of DNA into

RNA. RNA is then used within the cell or translated into . Proteins carry out many important tasks in the cell, including: gene regulation, catalysis and molecular transport.

This process of transcribing DNA into RNA, and translating RNA into protein is known as the central dogma of molecular biology. This process is heavily regulated through a variety of mechanisms so that organisms only produce the necessary RNA and proteins when they are required by the organism. One regulatory mechanism uses trans-acting proteins known as transcription factors to control which sections of DNA are transcribed into RNA [1].

Transcription factors work together with RNA polymerases to form a module of protein machinery upstream of the transcription start site in a DNA region known as the core promoter, Figure 1. These trans-acting, transcription factors can then bind to the cis-acting

DNA regions that they touch, including the core promoter region, but also to distant enhancer and silencer regions. The cis-acting DNA regions that the trans-acting transcription factors bind to are known as transcription factor binding sites (TFBSs).

The identification and location of TFBSs is an important task in determining how genes work together and regulate one another. When a transcription factor binds to DNA it favors regions of 5 to 20 nucleotides in length that have sequences similar to each other

[2]. An individual DNA binding site is known as a word, and a set of DNA binding sites is known as a motif. Motifs are usually represented as probability weight matrices (PWMs), where each index in the PWM represents the percentage that a transcription factor binds to 11 either an A, C, T, or G nucleotide at that position in the motif. Motifs are often represented as motif logos, Figure 2, where each nucleotide’s height is based on the information content of that nucleotide at a position, to give a visual representation to the abstract concept of a motif. The identification of a motif for a specific transcription factor allows researchers to identify other putative binding sites of a transcription factor in DNA that share the same nucleotide pattern as the motif. Traditionally, two techniques have been used for discovering TFBSs: the application of lab techniques, that identify regions of DNA bound to a transcription factor, and the application of computational analysis, that identifies overrepresented subsequences in a set of sequences.

Figure 1. Model of transcription with transcription factors and RNA polymerase [1]

12

Figure 2. Example of a motif logo. The information content of each nucleotide is represented in bits.

The most common lab based approach to identifying TFBSs uses chromatin immunoprecipitation. This process works by creating an antibody to a specific transcription factor so that once the DNA has been broken into pieces, only the regions bound by the specific transcription factor can be immunoprecipitated out of the solution. Once the target transcription factor and bound DNA sequence are extracted, the DNA sequences can be sequenced using high-throughput sequencing, known as ChiP-seq or the DNA sequences can be bound to probes on a microarray to identify the regions of DNA the transcription factors bind to, known as ChiP-chip. Once the location and identity of the sequence that a transcription factor binds to is identified, it can be modified in a living organism to verify that a transcription factor does bind at a specific region in the genome.

Traditionally, computational motif discovery algorithms take a single set of input sequences and attempt to identify the most statistically overrepresented patterns of nucleotides through a variety of methods including: word enumeration, probabilistic algorithms, machine learning algorithms, among others [2]. These types of algorithms take a set of DNA regions and return the putative PWMs and motifs that were identified. Many different methods of analyzing DNA sequences for possible TFBSs have been investigated, including: The Multiple Em for Motif Elicitation algorithm, MEME, it uses mixture models 13 of PWMs and expectation maximization of the parameters to iteratively identify Motifs

[3]. Another approach, the Weeder algorithm, uses a suffix tree and enumerates all possible motifs to identify those most statistically significant [4]. The XXmotif algorithm, calculates

P-values of single-sites in PWMs to identify the most statistically motifs [5]. The

WordSeeker algorithm uses a Markov Model to also identify the most statistically significant motifs [6]. What all of these algorithms have in common is that they take an input set of DNA sequences and return lists of putative motifs.

A relatively new computational method for motif discovery introduced by Sinha in

2003 is discriminative motif discovery [7]. Discriminative motif discovery algorithms take a foreground set of sequences, sequences thought to contain some motif, and a background set of sequences, sequences not thought to contain some motif, and identifies motifs similar to the foreground set and not similar to the background set. Many discriminative motif discovery algorithms have also been investigated. The discriminating matrix enumerator algorithm, DME, searches exhaustively for PWMs and then refines the PWM using local search [8]. The DEConvolved Discriminative motif discovery algorithm, DECOD, uses k- mer counts of the foreground and background sequences to identify discriminative PWMs and motifs [9]. The SEEDER algorithm uses the probability of k-mer counts to identify motif seeds which are expanded to identify significant motifs [10]. The Discriminative

PWM Search algorithm, DIPS, randomly selects sites, improves them, and then the highest scoring PWM is selected and reported. All of these algorithms require both a foreground and background data set of sequences and often return a long list of putative motifs.

Traditional motif discovery methods and discriminative motif discovery methods often identify long lists of putative motifs. Recent research has gone into minimizing the 14 number of putative motifs that are generated from traditional motif discovery algorithms through the use of minimum set cover [11]. Using minimum set cover to identify the minimum set of motifs needed to cover all the sequences eases the difficult task of identifying what putative motifs are actually important and what are just noise. Several different methods have been researched to solve the problem of finding the minimum set cover for a set of motifs, including greedy algorithms, local search algorithms and integer linear programming approaches. However, there is still room for improvement.

The research into minimum set cover has been helpful to traditional motif discovery, though it hasn’t been as helpful to the newer field of discriminative motif discovery. The problem with using a minimum set cover algorithm on a discriminative motif discovery algorithm is that it doesn’t make use of the background data set of sequences. Without using the background dataset, the minimum set of motifs that cover all the foreground sequences might also cover many of the background sequences, which is not ideal. Reducing the number of putative motifs is crucial because it is cost prohibitive to test all possible putative motifs in a wet lab. Therefore, it comes down to a guessing game for the researcher as to which putative motifs likely have a biological function. Some helpful statistics are provided by different motif discovery tools such as p-values and q- values, but there is no general application for identifying the minimum set of motifs that discriminately cover the foreground set over the background set.

The solution presented in this paper for the problem of identifying the minimum set of motifs that discriminately cover the foreground set of sequences versus the background set of sequences is through the use of weighted set cover. With weighted set cover, weights can be assigned to the foreground and background set of sequences such that the algorithm 15 discriminately selects sequences that cover fewer background sequences. A simple weighting scheme is to set all background sequences to a cost of 1, and all foreground sequences to a cost of 0. The use of weights is not only limited to the number of sequences in the foreground and background data sets. Other weighing schemes could include p- values, e-values, information content and other statistical measures.

1.2 Problem Statement

Formally, the weighted set cover problem can be stated as:

Given a set of sequences, 푆 = {푠1, 푠2, … , 푠푛}, a set of motifs 푀 = {푚1, 푚2, … , 푚푘}, where

푚푖 ⊆ 푆 and a cost function 푤: 푚푖 → ℝ , Find a set cover 퐶 ⊆ 푀 such that:

⋃ 푚푖 = 푆

푚푖∈퐶

and

∑ 푤(푚푖) 𝑖푠 푚𝑖푛𝑖푚𝑖푧푒푑.

푚푖∈퐶

Weighted set cover has many advantages over minimum set cover for the application of identifying sets of useful motifs. By minimizing the sum of the weights of the motifs instead of the number of motifs, different weighting schemes can be chosen, tweaked, and substituted depending on the parameters, allowing flexibility in the weight schemes. This flexibility allows for the discovery of the minimum set of motifs discriminated by the foreground, but also the minimum set of motifs on a custom weighting scheme that discriminated by any statistical values. Unfortunately, the optimization version of weighted set cover problem is NP-hard, and does not have a known polynomial time 16 solution to solve for the optimal solution [12]. Therefore, a variety of approaches are explored to solve non-optimal solutions to this problem to obtain results in a reasonable amount of time.

17 CHAPTER 2: LITERATURE REVIEW

2.1 Motif Discovery

DNA motif discovery is the process of identifying biologically significant nucleotide sequences (motifs) in a set of larger nucleotide sequences. DNA motif discovery algorithms attempt to identify biologically significant motifs by identifying statistically overrepresented nucleotide sequences [13]. For the purposes of this thesis, motif discovery algorithms can roughly be divided into two broad categories, non-discriminative

(traditional) motif discovery algorithms and discriminative motif discovery algorithms.

2.1.1 Non-discriminative Motif Discovery Algorithms

Non-discriminative motif discovery algorithms take a single set of sequences, which is thought to have some shared functionality, as input and attempts to identify short sequences of nucleotides that are relatively overrepresented compared to background models of nucleotides (usually specific to an organism). Some notable non-discriminative motif discovery algorithms are:

MEME [3]: The Multiple Em for Motif Elicitation algorithm randomly selects words from the set of sequences, and uses expectation maximization with the best word identified to fit a model on the input sequences and converge on a motif. After a motif is located it deletes the instances of this motif and attempts to find another motif. MEME is one of the most popular algorithms and is updated and maintained.

YMF [14]–[16]: The Yeast Motif Finder searches the set of input sequences and constructs all possible motifs of a specified length and calculates a z-score for each motif 18 using a Markov chain constructed from the complement of the promoter regions of the yeast genome. The motifs with the highest z-scores are then output.

AlignACE [17], [18]: AlignACE selects random start locations among the input sequences and uses Gibbs sampling similar to the described method in Lawrence `93 combined with maximum a postiori (MAP) probability to identify motifs. It was also designed to use a background model specific to the yeast genome [19].

Weeder [4]: Weeder searches a suffix tree created on the input sequences by recursively “expanding” the motifs being searched and excluding paths above an error threshold. The resulting motifs are then filtered based on the probability of finding that motif based on the motif length, number of input sequences, and the error threshold. The resulting motifs are then sorted based on the significance of the motif.

XXmotif [5], [20]: Exhaustive evaluation of matrix motifs, the XXmotif algorithm, enumerates all subsequences of length five, tandem repeats of length six and palindromes of length six, and extends them to improve the calculated e-value. The subsequences are matched on the input sequences, converted into probability weight matrices, merged by similarity, and optimized by their enriched p-values.

WordSeeker [6]: WordSeeker makes use of parallel computing to enumerate words of a user defined length using either a radix tree, a suffix tree or a suffix array. A Markov model is used to score each word. The resulting words are clustered using Hamming distance or edit distance and converted into a probability weight matrix.

19 2.1.2 Discriminative Motif Discovery Algorithms

Discriminative motif discovery algorithms take two set of sequences as input, a foreground (positive) set of sequences which is thought to have some shared functionality, and a background (negative) set of sequences which is thought to have some unrelated functionality compared to the foreground set [21]. Statistically overrepresented motifs are then identified by comparing the occurrences of motifs in the foreground set of sequences to the occurrences in the background set of sequences, a high foreground to background occurrence ratio may indicate a motif that is biologically significant [7]. In both traditional and discriminative motif discovery categories there are numerous ways of generating motifs and incorporating other data into the motif discovery process, which leads to a variety of motif discovery algorithms. Some notable discovery algorithms are:

DME [8]: The Discriminating Matrix Enumerator, accepts a foreground set of DNA sequences, a background set of DNA sequences, a substring width, and a number of motifs and outputs a set of differentially expressed motifs. The algorithm begins by conducting a grid like search for motifs of the input size, filtering out those below an information content threshold determined by the width, and selects the motif with the highest score based on a simplified likelihood equation. A local search is then performed consisting of iterating over motifs similar to the selected motif and replacing the selected motif when a higher likelihood motif is identified. This process repeats until a threshold is reached or the selected motif is identified as having the highest score. The algorithm then removes the final motif and repeats the algorithm based on the number of motifs parameter.

Seeder [10]: The Seeder algorithm enumerates all strings of an input length and uses Hamming distance to find the closest substring in every foreground sequence and 20 Hamming distance to a random background sequence. A P-value is calculated for each word based on its existence in the foreground and background sequences. An initial motif is created based on the word with the lowest P-value, the word acting as a seed. This motif is then expanded to the input width by selecting the closest site for each foreground sequence until a threshold is reached or the motifs score cannot be improved.

DREME [22]: The Discriminative Regular Expression Motif Elicitation specializes in finding short motifs in ChIP-seq data. The algorithm enumerates all strings between length 4 and 8 in the foreground and background sequences. The estimated P-value for each word is then calculated based on its existence in the foreground and background sets.

For top 100 most significant words, motifs are created with up to one IUPAC nucleotide ambiguity that all have estimated significant P-values. This process is repeated using the resulting set of motifs until no more ambiguities can be added. The top motif is then removed from the data output and the algorithm attempts to find more motifs based on a

E-value threshold.

DECOD [9]: The DECOnvolved Discriminative motif discovery algorithm calculates the frequency of all substrings, in the foreground and background sets of sequences, based on an input length. To accommodate overlapping substrings a “convolved motif component” is calculated such that the probability weight matrix of the motif includes the surrounding nucleotides. Using this motif model a score is calculated by multiplying the probability of the motif by its foreground frequency subtracted by its background frequency to identify motifs that are differentially expresses in the foreground and background sets. Hill climbing is then used in conjunction with the above objective function to iteratively find and remove motifs. 21 DIPS [21]: The DIscriminative PWM Search algorithm is Sinha’s response to the discriminative motif discovery problem he presented in his previous paper [7]. The algorithm works by randomly selecting a set of substrings from the foreground sequences of a specified length, creating a probability weight matrix from them, and calculating the probability of each substring given the probability weight matrix, (or matrices). Then count all the substrings with probabilities above a threshold in the foreground and background sequences, normalize these counts and subtract them to identify the “discriminative-score”.

A hill climbing algorithm is then used to optimize the probability weight matrix by deleting substrings that lower the score and adding substrings that increase the score. This process stops after improvements are no longer made, and the algorithm can be repeated to search for additional motifs by adding the discovered probability weight matrix to the input set. It is also important to note that this algorithm can take a considerable amount of time to complete.

2.2 Set Cover

2.2.1 Set Cover Problem

The set cover optimization problem focuses on identifying the minimum set of sets required to contain all the elements in the universal set of elements. One of the first attempts at identifying the complexity of this problem was by Karp in his 1972 paper, “Reducibility

Among Combinatorial Problems”, where Karp identified the decision version of the set cover problem, does there exist a set of sets of size less than or equal to k that covers the universal set, as being NP-complete, existing in the set of both NP and NP-hard problems

[12]. The set cover problem was further defined with two theorems by Stein in his 1974 22 paper, “Two Combinatorial Covering Theorems” where he applied set cover to matrices

[23]. The problem was visited again by Johnson in his 1974 paper, “Approximation algorithms for Combinatorial Problems”, where he presents two heuristic based algorithms for approximating an optimal solution to set cover, including the polynomial time greedy approximation algorithm, Figure 3 [24]. Near the same time as Johnson, Lovász presented the same greedy approximation algorithm to the set cover problem, and proposed a linear programming relaxation of the set cover problem [25]. In 1994 Lund and Yannakakis [26] proved via reduction that the set cover problem cannot be approximated within c log2 N where c is a constant less than ¼, unless the set of problems that can be solved in

Nondeterministic polynomial (NP) time is contained in the set of problems that can be determined by a deterministic Turing machine in n poly log n time. Therefore, unless P=NP there is not much more room for improvement over Johnson’s greedy set cover method

[27].

Figure 3. Johnson’s greedy algorithm for approximating a solution to the set cover optimization problem in polynomial time [24]. 23 A paper titled “Discovering Gene Regulatory Elements Using Coverage-based

Heuristics,” is being prepared for submission by members of the Ohio University

Bioinformatics Lab that explores unweighted set cover methods for identifying biologically significant sets of motifs. In the paper unweighted set cover methods are proposed as a way to reduce the set of motifs generated in motif discovery. That work ties closely with this thesis and was the inspiration for exploring weighted set cover as a method of identifying biologically significant motifs.

2.2.2 Hitting Set Problem

The hitting set problem centers on trying to identify the minimum set of vertices that intersect (nonempty) every edge of a hypergraph. This minimum set of vertices is called the hitting set or the transversal. The hitting set problem is the graphical equivalent to the set cover problem. Set cover and hitting set can map between each other by setting the vertices in a hypergraph to be the sets in set cover and setting the edges in a hypergraph to be the universal set of elements in set cover. A vertex then intersects an edge (in the hitting set problem) if the corresponding set contains that element (in the set cover problem). Lu and Lu 2013 used the weighted k-hitting set problem to identity transcription factor modules by searching for genes that are required to be covered by a set of k transcription factors [28]. An exponential time algorithm was presented to solve for the minimum weight set of transcription factors to cover their set of genes, and since the input data had few subsets optimal solutions were identified. 24 CHAPTER 3: ALGORITHMIC APPROACHES

Several algorithmic approaches were applied to identify subsets of motifs that discriminately cover foreground sequences over background sequences. These methods include: a greedy weighted approach, a weighted hill climbing approach, a weighted simulated annealing approach, a weighted branch and cut approach, a weighted relaxed integer linear programming approach, a weighted random approach, a weighted modified greedy approach, a weighted relaxed greedy approach. For comparison purposes un- weighted versions of all of these algorithms were also tested to determine any benefits gained from a weighted approach. A variety of algorithms were tested in an attempt to identify the most effective, in terms of running time and biological results.

3.1 Set Cover Methods

3.1.1 Weighted Greedy Set Cover

Greedy algorithms attempt to identify good solutions by iteratively making locally optimal choices. Greedy solutions for many problems do not guarantee globally optimal solutions. The basis for the weighted greedy set cover algorithm is to always select the motif that appears to cover the most sequences with the smallest weight per sequence. This is implemented by iteratively selecting the motif with the lowest weight divided by the number of uncovered sequences until all foreground sequences are covered. The implementation accepts a parameterized weight function which accepts any of the greedy weight schemes described below. Unless otherwise stated the lowest ratio of background to foreground sequences weight scheme was used. This algorithm has a well-known polynomial run time and does not guarantee to identify the optimal solution, although it 25 does converge on a local optimum. Due to the polynomial running time of this algorithm it is one of the fastest algorithms tested. The greedy algorithm has also been shown to be close to the optimal approximation for a polynomial time algorithm assuming NP doesn’t have a quasi-polynomial time algorithm [29][30].

3.1.2 Weighted Relaxed Greedy Set Cover

The relaxed greedy algorithm is a variant of the weighted greedy algorithm, defined above, where the motifs are sorted by their weight, calculated by the weight schemes defined below, in ascending order, and then iteratively selected only if the motif covers any uncovered sequences. This differs from the weighted greedy algorithm which recalculates the weights of every motif every iteration. In the case where the weight scheme is equal to one over the number of foreground sequences, equivalent to un-weighted set cover, motifs that have a large number of foreground sequences are favored over motifs that cover the most uncovered foreground sequences. This algorithm accepts a parameterized weight function which accepts any of the greedy weight schemes defined below. The algorithm runs in polynomial time like the weighted greedy algorithm. This algorithm is run alongside the weighted greedy algorithm to compare the effects of using the original calculated weights versus using weights recalculated each iteration.

3.1.3 Weighted Modified Greedy Approach Set Cover

The weighted modified greedy approach is a variant of the weighted greedy algorithm in which certain required motifs are added before running the standard weighted greedy algorithm and certain redundant motifs are deleted afterward. Required motifs are 26 added by selecting all motifs that contain a foreground sequence only covered by one motif, and redundant motifs are identified by all motifs that contain no foreground sequences covered only once. This set cover method tries to slightly improve the standard weighted set cover method by adding certain required motifs, and removing certain redundant motifs.

This algorithm accepts a parameterized greedy weight scheme, defined below, and runs in polynomial time similar to the standard weighted greedy set cover method.

3.1.4 Weighted Hill Climbing Set Cover

Hill climbing is a local search algorithm that attempts to identify a solution by iteratively altering elements and accepting better solutions. Hill climbing algorithms can often become stuck on local optima as they do not accept worse solutions, as opposed to algorithms like simulated annealing which can accept worse solutions. The weighted hill climbing set cover algorithm that is implemented first begins with a random solution to the set cover problem using the random set cover method, described later, and randomly applies one of three operations for a number of iterations and the final solution is returned.

The three operations that can be selected from include: deletion of a motif, swapping a motif in the solution with a motif not in the solution and swapping two motifs in the solution with two motifs not in the solution. If the applied operation improves the solution, the change is kept, if the operation does not improve the solution the operation is reversed.

The implemented weighted set cover hill climbing also makes use of a fixed number of random restarts in an attempt to find a better solution. Only the best solution identified from random restarts is returned. The running time of this algorithm greatly depends on the number of random restarts, and the number of operation iterations that are chosen. 27 Unless otherwise stated the implemented hill climbing algorithm used two hundred random restarts, and one thousand attempted iterations with the lowest ratio of background to foreground sequences weight scheme. The implemented hill climbing solution has a parameterized objective function which can be any of the solution based weight schemes described below.

3.1.5 Weighted Random Set Cover

The weighted random set cover algorithm is intended to generate solutions that are passed to other algorithms, and act as a baseline to compare how effective other algorithms are. The weighted random set cover algorithm shuffles all the motifs based on a random number generator seeded with the current time, and then selects motifs in order as long as they add at least one foreground sequence until a set cover is found. A fixed number of random restarts can also be passed to this implementation and only the solution found with the lowest weight is returned. This implementation accepts a parameterized solution based weight scheme, described below. Unless otherwise stated all of the solutions identified by the weighted random set cover method used two hundred random restarts and the lowest ratio of background to foreground sequences weight scheme.

3.1.6 Weighted Simulated Annealing Set Cover

Simulated annealing algorithms search for the global optimum by iteratively accepting and rejecting random neighbor states based on a temperature function and an acceptance probability. This allows for many worse-solutions to be accepted when the temperature is hot and fewer worse-solutions to be accepted as the temperature cools. The 28 implemented solution to the weighted simulated annealing set cover solution starts with a random solution generated by the random set cover algorithm, and navigates between neighbor states with a temperature function using three randomly selected operations: swapping a motif used in the solution with a motif not used in the solution, deleting a used motif from the solution, and adding an unused motif to the solution. At all time-points a set cover is maintained. These three operations allow for all possible solution states to be fully explored; this can be shown by adding all unused motifs to the solution and then deleting used motifs until the desired solution is obtained. A plateau parameter is used to allow for the temperature to remain constant for a set number of iterations before the temperature function beginning to cool. Three temperature functions were implemented:

푇 푇 = 푇 = the Boltzmann temperature function 퐿표𝑔(푘), an exponential temperature function

푇 푇 × 0.95푘 푇 = and a fast temperature function 푘 [31], [32]. In all cases k is the annealing parameter which is set as the current iteration number with a minimum value of two to ensure no temperatures result in an undefined number. The acceptance probability of

1 selecting a worse neighbor state is calculated as 퐴푐푐푒푝푡푎푛푐푒푃푟표푏푎푏𝑖푙𝑖푡푦 = ∆ which 1+푒푇

1 1 produces a value between 0 and 2 therefore a random double between 0 and 2 is generated to compare to the acceptance probability to determine if a worse solution is selected.

Throughout the algorithm the best solution is stored and is the value returned. The implemented algorithm stops when either the max number of iterations are reached or the

∑ ∆푊푒푖𝑔ℎ푡 algorithm stalls, defined by the value 푊푒푖𝑔ℎ푡퐶ℎ푎푛𝑔푒퐶표푢푛푡푒푟 going below the stall limit. The algorithm is parameterized to accept a solution based weight scheme, defined below, and 29 can use random restarts, in which case only the best solution is returned. Unless otherwise defined the simulated annealing algorithm uses the Boltzman temperature function, one hundred plateau steps, no random restarts, a stall limit of 0.000001 and the number of background sequences covered as the solution based weight scheme.

3.1.7 Integer Linear Programming Formulation

Fortunately, there exists an integer linear programming (ILP) formulation of the weighted set cover problem. By using an ILP formulation of weighted set cover, existing software libraries, such as GLPK and Gurobi Optimizer, can be leveraged to solve the weighted set cover problem [33], [34]. The integer linear programming formulation can be defined as finding a {0,1} vector 푥⃗ of length 푛 = |푚표푡𝑖푓푠| that satisfies the constraints that

푐⃗ ⋅ 푥⃗ is minimized subject to 퐴푥⃗ ≤ 푏⃗⃗ where: 퐴 is an 푚 × 푛 adjacency matrix representing the existence of n motifs in m sequences such that:

−1 𝑖푓 푛 ∈ 푚 푎 = { 푖 푗 푖푗 0 표푡ℎ푒푟푤𝑖푠푒

푏⃗⃗ = −⃗⃗⃗⃗⃗1⃗, of length m, defining all foreground sequences as required, and 푐⃗ is a vector of motif weights, of length n, defined by the weight scheme used.

3.1.8 Linear Programming using Branch and Cut

Branch and cut algorithms solve for the optimal solution to integer linear programs by branching through the possible solutions, while cutting off branches that cannot give better solutions. The decision to cut branches is determined by comparing the best known solution to the linear programming relaxation solution of the current branch. Since the 30 linear programming relaxation is guaranteed to produce a solution smaller or equal to the optimal solution, for a minimization problem like set cover, it can be used as a lower bound for the best possible solution of a branch. If the solution of the linear programming relaxation is worse than the best solution, the branch is cut, and no longer explored, while if the solution is better, the branch is explored further. The GNU Linear Programming Kit

(GLPK) was initially used to solve the integer linear programming formulation of weighted set cover, defined above, using its branch and cut algorithm. The GLPK software package was found to take an extremely long time to identify optimal solutions in practice so it was replaced with Gurobi Optimizer 6.05, which was significantly faster. Unfortunately, solving for the optimal solution to the weighted set cover problem using branch and cut takes exponential time and in practice can take a very long time to identify a solution, therefore this approach is not always feasible. This was the case for four of the experiments used in the ENCODE data set, in which the Gurobi Optimizer could not identify the optimal solution for the four data sets within 24 hours of running on an i5-2600k Intel processor.

3.1.9 Linear Programming Relaxation with Randomized Rounding

Linear programming relaxation with randomized rounding can produce a solution for an integer linear program within 푙푛(푛) to the optimal integer linear program. By solving for the linear programming relaxation of the integer linear program, a [0,1] vector, the values can be randomly rounded to a 1 or a 0, based on the probability at some point, for a number of iterations until a solution is found. The linear programming relaxation implementation uses Gurobi Optimizer 6.05 with the weighted set cover formulation defined above. Randomized rounding is then applied to the solution for one thousand 31 iterations, and the best solution found is returned. The linear programming relaxation is solvable in polynomial time, but the optimal solution is not guaranteed. In practice the algorithm finishes quickly and produces good results.

3.2 Filter Methods

Filter methods aim to reduce a solution set of motifs, with complete sequence coverage, to a smaller set of motifs with slightly reduced sequence coverage. The motivation behind such algorithms is a phenomenon, which was repeatedly observed, in which the number of motifs in the solution set doubles, triples or quadruples in order to cover the last ten percent of the sequences. This suggests that a fraction of the sequences may not contain a strong binding site motif. Using this observation as motivation, these algorithms were designed to reduce the set of motifs in the solution sets by selectively removing motifs based on a percentage threshold of the set of sequences covered.

3.2.1 Greedy Removal Method

The premise behind the greedy removal filter algorithm is to greedily select and delete motifs that cover a small proportion of sequences. This algorithm is implemented by assigning each motif a sequence coverage weight equal to the number of unique sequences a motif covers divided by the total number of sequences covered in the input solution set.

The motif with the smallest unique coverage ratio below the unique coverage threshold is then deleted and the rest of the motif weights are recalculated. This process continues until there are no motifs remaining with weights below the unique coverage ratio. Unless otherwise defined a unique coverage ratio of two percent was used as the threshold for this 32 algorithm. This algorithm can result in the removal of all of the motifs, if the solution set consists of many motifs that all have sequence coverage ratios less than the unique coverage ratio. This type of solution set is uncommon and may represent that no strong motifs exist in the solution.

3.3 Weight Schemes

3.3.1 Solution Based Weight Schemes

Solution based weight schemes are defined as schemes that are passed a set of motifs which solve the set cover problem, and a weight is returned for the entire solution for use in a minimization algorithm. In total five weight schemes were considered where the weight was calculated as: the number of selected motifs, the number of background sequences covered, the number of selected motifs multiplied by the background sequences covered, the number of background sequences covered over the average information content of the selected motifs, and the number of background sequences covered over the average score of the selected motifs. The number of selected motifs weight scheme selects for solutions that uses the fewest motifs. The number of background sequences covered weight scheme selects for solutions that cover the fewest background sequences. The number of selected motifs multiplied by the background sequences covered weight scheme attempts to select for solutions that use the fewest motifs and background sequences. The number of background sequences covered over the average information content of the selected motifs selects for motifs that cover few background sequences and have a high information content. The number of background sequences covered over the average score of the selected motifs weight scheme selects for motifs that cover few background 33 sequences and have a high average motif score, determined by the motif discovery application.

3.3.2 Greedy Based Weight Schemes

Greedy based weight schemes are defined as schemes that return a weight value for a single motif for use in a minimization algorithm. Five greedy weight schemes were explored to select for certain motif features: the number of background sequences, the ratio of background sequences over foreground sequences, the ratio of background sequences over information content over foreground sequences, the ratio of background sequences over motif score over foreground sequences and the ratio of one over the background sequences. The ratio of background sequences over foreground sequences weight select motifs that discriminate between background and foreground sequences. The ratio of background sequences over information content over foreground sequences selects for motifs that discriminate between background and foreground promoters while also having a high information content. The ratio of background sequences over motif score over foreground sequences selects for motifs that discriminate between foreground and background sequences but also have a high motif score as calculated by the motif discovery application. Finally, the ratio of one over the background sequences selects for the most background sequences, this metric has an edge case when zero foreground sequences are covered positive infinity is returned. This weight reduces the weighted set cover algorithm to the un-weighted version of the algorithm.

34 3.4 Metrics

To quantify each set cover method’s ability to identify biologically significant motifs, the set of known motifs was compared to the set of predicted motifs. From these comparisons the number of true positives, false positives, true negatives and false negatives were calculated along with other classification metrics. In total five comparisons were made between these two groups including: comparing the individual nucleotides between the two sets, comparing the individual binding sites (words) between the two sets, comparing the selected set of motifs against the known motifs using the set of discovered motifs as the universe, comparing the selected set of motifs against the known set of motifs using the mapping of discovered motifs to known motifs as the universe and comparing the selected set of motifs against the known set of motifs using the set of known motifs as the universe. Each of these comparisons have their advantages and disadvantages.

3.4.1 Comparisons

The five comparisons made between the known and predicted motifs each serve a different purpose. Using FIMO to map the set of known motifs for the target transcription factor onto the testing and background sequences allows for comparisons to be made between individual nucleotides and individual binding sites (words). Comparing individual nucleotides between the known and predicted sets of motifs serves the purpose of identifying how well the motif discovery and set coverage processes do at identifying biologically significant nucleotides. Comparing the individual binding sites (words) between the known and predicted motifs identifies how well the motif discovery and set coverage processes do at identifying each motif binding site (word). Since all nucleotides 35 and binding sites (words) are considered, it is important to recognize that these comparisons can be heavily influenced by the motif discovery process.

Using TOMTOM, a motif comparison tool, to compare the set of all known motifs against the set of predicted motifs allows for direct comparisons between motifs without using nucleotides or words. The comparison between the selected set of motifs and known set of motifs using the set of discovered motifs as the universe, serves to identify the capability of the set cover methods to identify the biologically significant motif somewhat independently of the motif discovery process. Since the comparison universe is restricted to the set of discovered motifs, the individual set cover methods are not as strongly penalized for poor motifs discovered by the motif discovery application as they would be if the entire set of known motifs were used for the universe. The comparison between the selected set of motifs and the known set of motifs using the mapping of discovered motifs to known motifs as the universe, serves the purpose of identifying how well each set cover method does at identifying known motifs of the target transcription factor in terms of the set of all known motifs, across many transcription factors. This comparison also attempts to remove some of the bias associated with the motif discovery process by only using the mapping of discovered motifs to known motifs as the universe. The final comparison between the selected set of motifs and the known set of motifs using the set of known motifs as the universe, serves the purpose of identifying how well the set of known motifs was classified. This comparison depends somewhat heavily on the set of all known motifs, as similar motifs of different transcription factors could bias this comparison. Overall, these comparisons each serve a slightly different purpose, each with their advantages and disadvantages. 36 3.4.2 Classification Metrics

For each comparison made several classification metrics were calculated from the number of true positives, true negatives, false positives, false negatives, including: accuracy, sensitivity (recall), specificity, precision, balanced accuracy, informedness and

F1 score. The following metrics were calculated as follows:

푇푟푢푒 푃표푠𝑖푡𝑖푣푒 퐶표푢푛푡 푆푒푛푠𝑖푡𝑖푣𝑖푡푦 = 푅푒푐푎푙푙 = 푇푟푢푒 푃표푠𝑖푡𝑖푣푒 퐶표푢푛푡 + 퐹푎푙푠푒 푁푒푔푎푡𝑖푣푒 퐶표푢푛푡

푇푟푢푒 푁푒푔푎푡𝑖푣푒 퐶표푢푛푡 푆푝푒푐𝑖푓𝑖푐𝑖푡푦 = 푇푟푢푒 푁푒푔푎푡𝑖푣푒 퐶표푢푛푡 + 퐹푎푙푠푒 푃표푠𝑖푡𝑖푣푒 퐶표푢푛푡

푇푟푢푒 푃표푠𝑖푡𝑖푣푒 퐶표푢푛푡 푃푟푒푐𝑖푠𝑖표푛 = 푇푟푢푒 푃표푠𝑖푡𝑖푣푒 퐶표푢푛푡 + 퐹푎푙푠푒 푃표푠𝑖푡𝑖푣푒 퐶표푢푛푡

푇푟푢푒 푃표푠𝑖푡𝑖푣푒 퐶표푢푛푡 + 푇푟푢푒 푁푒푔푎푡𝑖푣푒 퐶표푢푛푡 퐴푐푐푢푟푎푐푦 = 푇표푡푎푙 퐶표푢푛푡

푆푒푛푠𝑖푡𝑖푣𝑖푡푦 + 푆푝푒푐𝑖푓𝑖푐𝑖푡푦 퐵푎푙푎푛푐푒푑 퐴푐푐푢푟푎푐푦 = 2

퐼푛푓표푟푚푒푑푛푒푠푠 = 푆푒푛푠𝑖푡𝑖푣𝑖푡푦 + 푆푝푒푐𝑖푓𝑖푐𝑖푡푦 – 1

푃푟푒푐𝑖푠𝑖표푛 × 푆푒푛푠𝑖푡𝑖푣𝑖푡푦 퐹 푆푐표푟푒 = 2 × 1 푃푟푒푐𝑖푠𝑖표푛 + 푆푒푛푠𝑖푡𝑖푣𝑖푡푦

When calculating the averages of these metrics across multiple experiments if any of these metrics were undefined, via division by zero, it was replaced with zero instead. The definitions used for true positives, false positives, true negatives and false negatives for each of the five comparison methods were defined as follows:

37 FIMO - Comparing individual nucleotides

True Positives: Number of nucleotides in set of predicted nucleotides and in set of known nucleotides for target transcription factor.

True Negatives: Number of nucleotides not in set of predicted nucleotides and not in nucleotides for target transcription factor.

False Positives: Number of nucleotides in predicted set of nucleotides and not in nucleotides for target transcription factor.

False Negatives: Number of nucleotides not in predicted set of nucleotides and in nucleotides for target transcription factor.

FIMO - Comparing individual binding sites (words)

True Positives: Number of binding sites in set of predicted binding sites and in set of known binding sites for target transcription factor.

True Negatives: Number of binding sites not in set of predicted binding sites and not in binding sites for target transcription factor.

False Positives: Number of binding sites in predicted set of binding sites and not in binding sites for target transcription factor.

False Negatives: Number of binding sites not in predicted set of binding sites and in binding sites for target transcription factor.

TOMTOM - Compare motifs using the set of discovered motifs as the universe

True Positives: Number of discovered motifs that match predicted motifs and match at least one occurrence of the target transcription factors motifs. 38 True Negatives: Number of discovered motifs that do not match predicted motifs and do not match at least one occurrence of the target transcription factors motifs.

False Positives: Number of discovered motifs that match predicted motifs and do not match at least one occurrence of the target transcription factors motifs.

False Negatives: Number of discovered motifs that do not matched predicted motifs and match at least one occurrence of the target transcription factors motifs.

TOMTOM - Comparing motifs using the mapped known motifs as the universe

True Positives: Number of discovered motifs mapped to known motifs that match mapped predicted motifs and match the target transcription factor.

True Negatives: Number of discovered motifs mapped to known motifs that do not match mapped predicted motifs and do not match the target transcription factor.

False Positives: Number of discovered motifs mapped to known motifs that match mapped predicted motifs and do not match the target transcription factor.

False Negatives: Number of discovered motifs mapped to known motifs that do not matched mapped predicted motifs and match the target transcription factor.

TOMTOM - Comparing motifs using the set of known motifs as the universe

True Positives: Number of known motifs that match mapped predicted motifs and match the target transcription factor.

True Negatives: Number of known motifs that do not match mapped predicted motifs and do not match the target transcription factor. 39 False Positives: Number of known motifs that match mapped predicted motifs and do not match the target transcription factor.

False Negatives: Number of known motifs that do not match mapped predicted motifs and match the target transcription factor.

The first two comparisons, both utilizing FIMO, define known nucleotides as the set of nucleotides that were mapped from the known biological motif PWM onto the set of sequences using FIMO. They define known binding sites as the set of binding sites that were mapped from the known biological motif PWM onto the set of sequences using

FIMO. They define predicted nucleotides as the set of nucleotides mapped from the solution set of predicted motifs mapped onto the sequences using FIMO, and finally they define predicted binding sites as the set of binding sites mapped from the solution set of predicted motifs onto the sequences using FIMO. For the last three comparisons, all using

TOMTOM, the set of known motifs is the set of known biological motifs used in

Kheradpour’s 2013 paper [35] mapped to the set of sequences using FIMO, the set of predicted motifs is the solution set of predicted motifs identified by the set cover algorithms mapped to the set of sequences using FIMO, and the set of discovered motifs is the set of all motifs identified by the motif discovery application mapped to the sequences using

FIMO.

3.4.3 Ranking

One of main comparisons made between the different objective functions and filters is rank. Rank is calculated as the position of an element in an ordering of numerical data 40 with ties resulting in the same position. For instance, the set of numbers ranked in ascending order {23,44,35,6,23,11} would result in the ranking {3,6,5,1,3,2}. Notice that there is no number with a ranking of 4 since there are two 23s in the set resulting in a tie at third position, and therefore two rankings of 3.

3.5 Baseline Motif Selection Methods

Several motif selection methods were also implemented to act as a baseline to which the set cover methods could be compared. These methods include selecting the top

10 motifs with the highest score output by DME, selecting the top 10 motifs which cover the most foreground sequences, selecting the top 10 motifs with the highest information content, and selecting all motifs. Selecting the top 10 highest scoring motifs is based on the commonly used method of selecting the top 41 CHAPTER 4: CASE STUDIES

In this chapter three case studies are presented that utilize the set cover methods presented in this thesis: the ENCODE case study, the rat estrous cycle case study, and the

Brugia malayi case study. The ENCODE case study was chosen to benchmark the effectiveness of set cover algorithms in identifying biologically significant motifs. The

Brugia malayi case study utilizes the weighted set cover methods in the identification of motifs that exist only in the foreground. The rat estrous cycle case study uses the weighted set cover methods to identify biologically significant motifs related to the estrous cycle of rats. Together these case studies demonstrate the effectiveness and utility of applying weighted set cover methods to motif discovery.

4.1 ENCODE Case Study

ENCODE data from 402 ChIp-seq experiments from was used to assay the effectiveness of the weighted set cover methods to identify biological motifs. Peak regions extracted from the were input into the discriminative matrix enumerator

(DME) motif discovery application to identify putative motifs. Subsets of the putative motifs were identified using the weighted set cover methods and were compared against known transcription factor databases. The resulting motif database comparisons demonstrate the capability of the weighted set cover algorithms to identify the correct biologically significant motifs.

The narrow peak files for each of the ENCODE ChIp-seq experiments were downloaded and ordered by their e-values. The start and the stop coordinates of the top 250 most significant peaks were then extracted from the hg19 unmasked genome and formatted 42 into fasta format for input into DME. The next 1000 peaks, after the top 250, were also extracted from the human genome for use as a testing data set. A total of 500 random intergenic background regions, defined by Kheradpour, were then extracted from the

Human genome, and formatted for use as a background data set in DME [35].

For each ChIp-seq experiment DME was executed three times for motif lengths 8,

10 and 12, with the top 250 sequences as the foreground and the 500 intergenic sequences as the background. Each DME run searched for 100 single stranded motifs, resulting in a total of at most 300 motifs per ChIp-seq experiment. All 300 motifs were then mapped onto the 1000 sequences in the testing data set, and the 500 background sequences using

FIMO with an e-value of 0.05. The sequences were mapped onto the testing data set to remove bias that could incur from the use of FIMO when calculating the classification metrics.

All mapped motifs for each ChIp-seq experiment were combined into a single set that were passed as input to the weighted set cover algorithms. The weighted set cover algorithms run include the set cover algorithms: weighted greedy set cover, weighted relaxed greedy set cover, weighted modified greedy set cover, weighted hill climbing set cover, weighted simulated annealing set cover, weighted random set cover, integer linear programming using branch and cut and relaxed integer linear programming using randomized rounding. Each of the three set cover algorithms, weighted greedy set cover, weighted relaxed greedy set cover and weighted modified greedy set cover, were run using the five greedy weight schemes, defined above. All five of the solution weight schemes, defined above, were also used with weighted hill climbing set cover, weighted simulated 43 annealing and weighted random cover. Weighted simulated annealing was also run for the three temperature functions, defined above.

The resulting sets of motifs identified by the weighted set cover algorithms were then analyzed to identify their effectiveness in identifying biologically significant motifs.

Every set of motifs identified by the weighted set coverage algorithms was compared against the set of known transcription factor binding sites used by Kheradpour 2013, using

TOMTOM with an e-value threshold of 0.05 [35]. Likewise, all motifs identified by the weighted set cover methods were also compared against the known transcription factor binding sites identified by Kheradpour using FIMO, with an e-value threshold of 0.05, to map the known database of transcription factors onto the 1000 testing sequences.

Using both TOMTOM and FIMO allows for comparisons between the known transcription factor binding sites and the discovered motifs at the nucleotide level, binding site (word) level, and motif level. By comparing motifs discovered in the peak regions of

ChIp-seq experiments, that target specific transcription factors, against the known transcription factors, the classification capabilities of the weighted set cover methods can be evaluated in terms of true positives, true negatives, false positives and false negatives.

From these classifications, accuracy, precision, sensitivity and specificity can be derived for each method of weighted set cover.

4.2 Rat Estrous Cycle Case Study

With the average human lifespan increasing, women are spending a larger portion of their lives postmenopause. Therefore, studying and understanding the female hormone cycle is becoming more and more important. To better understand the effects on the loss 44 of hormone cyclicity, microarray experiments were conducted on twelve three-month old female Fisher 344 rats. Brain tissue samples were taken from three different brain regions, the basal forebrain, the hippocampus, and the frontal cortex, at three time points during the estrous cycle, estrous 10:00 AM (E10), estrous 6:00 PM (E6), and diestrous 10:00 AM

(D10). RNA was extracted from these samples using the Absolutely RNA miniprep Kit and analyzed on an ND-1000 Spectrophotometer. The resulting data from the microarray chips was statistically analyzed for the identification of differentially expressed genes and statistically overrepresented motifs in the gene promoter regions. These motifs were then analyzed using the weighted set cover methods to identify sets of biologically significant motifs.

Preprocessing of the data for the thirty-four microarray chips was conducted using robust multichip analysis (RMA) in BioConductor v2.12 package affy v1.38.1 [36], [37].

Quality analysis of the data using outlier detection for relative log expression (RLE) and outlier detection for normalized unscaled standard error (NUSE) in BioConductor package arrayQualityMetrics v3.16.0 revealed no outliers in the data before RMA preprocessing

[38]. Similarly, no outliers were detected after RMA preprocessing using a Hoeffding’s statistic threshold of 0.15. Differentially expressed probe sets across the three estrous cycle time points (E10, E6 and D10) for each region of the brain (Basal Forebrain, Frontal Cortex and Hippocampus) were identified using the Bayesian estimation of temporal regulation

(BETR) algorithm [39]. The probability of differential expression for each probe set was calculated using Bioconductor package betr v1.16.0 with an FDR of 0.5. A probability cutoff threshold of 0.99 was used in BETR to identify differentially expressed probe sets.

Fuzzy clustering of differentially expressed probes identified by the BETR algorithm was 45 conducted using the Bioconductor package Mfuzz 2.18.0 [40]. Fuzzy clustering was selected over hard clustering as time-course data, as in this case, can have significant overlap of probe sets between time points. The number of initial cluster was set to nine after empty cluster testing and (GO) term analysis.

All differentially expressed probe sets and clusters were mapped to corresponding rat genes using the rat2303 annotation database. Enriched GO terms for differentially expressed rat genes were identified using hyper geometric probabilities via Bioconductor package GOstats v2.26.0 [41]. A probability of 0.05 was used to determine GO term enrichment for a set of genes. Overrepresented KEGG pathways, BioCarta pathways and

GO terms between the possible pairwise combinations of the three time points (E10, E6,

D10) across the whole brain and each region of the brain were identified using

Bioconductor Package Gage v2.10.0.

The promoter sequences of the genes in the clusters identified by Mfuzz were extracted from the rat genome, rn5, for input as the foreground data set in DME. While the promoter regions of the least significant genes, identified via the BETR algorithm, were extracted for use as the background data set in DME. DME was executed on each of the clusters of promoter sequences to identify the significantly overrepresented DNA elements in the promoter regions of the genes in each cluster[8]. DME was run three times for motif lengths 8, 10 and 12 producing 100 putative single stranded motifs each run for a total of

300 putative motifs per cluster.

The set of motifs identified for each cluster were combined into a single set that were passed as input to the weighted set cover algorithms. The weighted set cover algorithms run include: weighted greedy set cover, weighted relaxed greedy set cover, 46 weighted modified greedy set cover, weighted hill climbing set cover, weighted simulated annealing set cover, weighted random set cover, branch and cut and relaxed linear programming with randomized rounding. Each of the five set cover algorithms, weighted greedy set cover, weighted relaxed greedy set cover, weighted modified greedy set cover, branch and cut and relaxed linear programming with randomized rounding were run using the five greedy weight schemes, defined above. Each of the three weighted set cover methods: weighted hill climbing set cover, weighted simulated annealing and weighted random cover were run on all 5 solution based weight schemes. Weighted simulated annealing was also run on all three temperature functions, defined above. The sets of motifs identified by the weighted set cover algorithms were queried against the JASPAR database of known transcription factor binding sites of vertebrates [42]. The known transcription factor binding sites that matched the putative motifs were presented along with the putative transcription factors identified for each cluster.

4.3 Brugia Malayi Case Study

The Brugia malayi is a nematode parasite that causes Lymphatic Filariasis, more commonly known as Elephantiasis, which is characterized by intense swelling of limbs.

The disease can have debilitating effects on the human body, and is thought to infect over

120 million people throughout the world. In 2007 the Brugia malayi’s genome was sequenced, and in 2011 the transcriptomes of each of the five lifecycle stages of the worm were sequenced to gain a better understanding of the worms biology [43], [44]. One lifecycle stage is particularly important for the Brugia malayi, lifecycle stage 3 (L3). In L3 the Brugia malayi enters the human body, usually through contaminated water. By 47 disrupting the L3 lifecycle stage of the Brugia malayi the worm would not be able to grow inside the human host and infect them with Lymphatic Filariasis. Therefore, identifying regulatory elements in the Brugia malayi that are expressed in L3 stage is of great interest for the identification of druggable targets. Through the identification of differentially expressed genes between the L3 and L4 life cycle stages, promoter regions can be extracted and statistically analyzed for overrepresented DNA elements.

To identify the statistically overrepresented DNA elements in the L3 stage, the promoter regions of the differentially expressed genes in the L3 stage were extracted for use as a foreground dataset in DME. The promoter regions of the differentially expressed genes in all other stages were originally used as the background data set, but due to size limitations, the background data set was reduced to only the differentially expressed genes in the L4 stage. Once the foreground and background datasets were obtained, DME was run on these sets for motif lengths 8, 10 and 12. Each DME run searched for 100 single stranded motifs, resulting in a total of at most 300 motifs.

Since the L3 stage is specifically targeted for its biological importance, ideally any selected motifs would occur minimally in the other lifecycle stages. This requirement suits weighted set cover very well, as weighted set cover has the ability to minimize the number of background sequence occurrences. The weighted set cover algorithms run on the motifs identified in the L3 stage include: weighted greedy set cover, weighted relaxed greedy set cover, weighted modified greedy set cover, weighted hill climbing set cover, weighted simulated annealing set cover, weighted random set cover, branch and cut and relaxed linear programming with randomized rounding. Each of the five set cover algorithms, weighted greedy set cover, weighted relaxed greedy set cover, weighted modified greedy 48 set cover, integer linear programming using branch and cut and relaxed integer linear programming with randomized rounding were run using the five greedy weight schemes, defined above. Each of the three weighted set cover methods: weighted hill climbing set cover, weighted simulated annealing and weighted random cover were run on all 5 solution based weight schemes. Weighted simulated annealing was also run on all three temperature functions, defined above. 49 CHAPTER 5: EVALUATION OF ALGORITHMS

The motif selection methods were evaluated on the ENCODE ChIP-seq experiments against the known biological motifs in the JASPAR, TRANSFAC, BULYK and Jolma databases identified by Kheradpour 2014 [35]. Among the unfiltered weighted set cover algorithms, not including the benchmark algorithms, the branch and cut algorithms all tied for the highest average accuracy of 70.465% across 398 ChIP-seq experiments. The weighted simulated annealing algorithms followed closely behind the branch and cut methods, with the fewest motifs objective function and the Boltzmann temperature function achieving an average accuracy of 70.04% across 402 ChIP-seq experiments (the Branch and Cut algorithms could not finish 4 experiments). Overall, the set cover methods with objective function which either minimize the number of motifs or minimize the number of foreground sequences performed well without filtering.

Interestingly the baseline set cover methods, the top 10 highest scoring motifs, the top 10 highest sequence coverage motifs and the top 10 highest information content motifs did significantly better than the weighted set cover methods in average accuracy and precision.

The best method without the use of filtering was selecting the top 10 highest coverage motifs.

Overall, the weighted set cover methods without filtering required 38 motifs on average to achieve 100% sequence coverage. Using a relatively large number of motifs to achieve full sequence coverage negatively impacts the average accuracy of the selection method. This relationship can be clearly shown in Figure 4 where there is a correlation between declining average percent accuracy and the number of motifs selected.

50 Number of Motifs Compared to Average Accuracy 300 100%

90%

250 80%

70% 200

60%

150 50%

40% Percentage

100 30% Number Number of Motifs Selected 20% 50

10%

0 0% Motif Selection Method

Average Number of Motifs Average Accuracy

Figure 4. Average number of motifs selected per motif selection method compared to the average accuracy of the motif selection method.

The identification of this downward trend in accuracy and specificity with the corresponding increase in the number of motifs was the motivation for adding an additional filtering step to further reduce the size of the selected sets of motifs. After a two percent greedy delete filter was applied to all the weighted set cover algorithms, accuracy and specificity were greatly increased compared to the weighted set covered methods without filtering. The average number of motifs per solution set were reduced from roughly 30 motifs per solution on average to 5 motifs per solution on average, while the average percentage of foreground sequences covered was reduced by only 7.78% and the average percentage of background sequence covered was reduced by 21.9%. This suggests that covering the last 7.78% of foreground sequences requires roughly a 600% increase in the 51 number of average motifs; alluding to the idea that 100% sequence coverage may not be ideal when identifying sets of biologically significant motifs.

With two percent greedy filtering, the weighted set cover algorithm with the highest average accuracy was the weighted relaxed greedy set cover algorithm with the lowest ratio objective function at an accuracy of, 73.7%. This method was closely followed by the weighted greedy set cover algorithm with the highest score objective function and an accuracy of 73.6%. Overall the greedy, relaxed greedy and restricted greedy algorithms performed very well when combined with two percent greedy filtering. The weighted greedy algorithms with filtering out performed the unweighted versions of the algorithms, where only foreground sequence coverage is considered. Even with the increased effectiveness of the set coverage methods, the baseline algorithms outperformed all the weighted set cover methods. The top 10 coverage motifs without filtering method, and the top 10 score motifs without filtering method, still outperformed the weighted set cover algorithms with two percent filtering, with accuracies of 74.9% and 74.6% respectively.

When the two percent greedy filtering algorithm was applied to the baseline selection methods all of their accuracies were improved. The top 10 coverage motifs method, with two percent filtering, and the top 10 score motifs, with two percent filtering, method performed the best out of all the methods with two percent filtering, including weighted set cover methods, with average accuracies of 75.3% and 75.0% respectively. These two methods are the first two methods shown to outperform the unfiltered baseline methods.

Using five percent filtering, the weighted greedy set cover method using the highest score objective function had the highest accuracy, of the weighted set cover methods, at

74.3%. The weighted modified greedy method with the lowest ratio objective function had 52 the highest precision among weighted set cover methods at, 44.5%. The weighted greedy algorithms, again, performed very well among the weighted set cover methods. The unfiltered top 10 coverage motifs baseline and the unfiltered top 10 score motifs baseline outperformed all the weighted set cover methods with five percent filtering. With five percent filtering applied to the top 10 coverage motif baseline and the top 10 score motifs baseline, the accuracy was improved to 75.1% and 74.9% respectively. While five percent filtering improves the accuracy of these two baseline methods compared to no filtering, accuracy fell compared to two percent filtering. Overall, the top ten coverage motifs method had the highest accuracy and precision among the motif selection methods with five percent filtering.

Using a ten percent greedy filter, the weighted modified greedy set cover algorithm with the lowest foreground to background objective function achieves the highest accuracy and precision among the weighted set cover algorithms with an accuracy of 74.6% and a precision of 47.2%. The top 10 coverage motifs baseline without filtering and the top 10 score motifs baseline without filtering both outperform the weighted set cover methods with ten percent filtering with accuracies of 74.9% and 74.6% respectively. The top 10 coverage motifs with greedy ten percent filtering and the top 10 score motifs with greedy ten percent filtering both outperform their unfiltered counterparts with accuracies of 75.0% and 74.7% respectively. Though, these two methods underperform when compared to their two percent and five percent counterparts. With ten percent filtering, the accuracies of all of the weighted set cover methods improved compared to two percent filtering, five percent filtering and no filtering, while the accuracies of the top 10 coverage motifs and the top 10 score motifs worsen compared to both two percent filtering and five percent filtering. 53 The best motif selection method identified across all selection methods and filters is the top 10 coverage motifs with two percent greedy filtering. Overall the top 10 coverage motifs method performed very well with and without filtering. Since this method selects the top coverage motifs it also maintains a relatively high sequence coverage compared to the other methods. Interestingly, the top 10 coverage motifs method outperformed the common practice of selecting the top scoring motifs, showing that sequence coverage can more accurately select biologically significant motifs than the scores produced by the motif discovery application. Among the weighted set cover algorithms, the weighted greedy algorithms with ten percent filtering performed the best, with the best objective functions being the lowest foreground to background ratio objective function and highest score objective function. Overall, the weighted set cover methods performed more poorly than the baseline methods, which could be attributed to the use of a discriminative motif discovery application which already takes the background dataset into consideration.

Detailed comparisons between average accuracy, average precision, average sensitivity, average specificity, foreground sequence coverage and number of motifs selected can be found for each algorithm, objective function and filtering level in figures 5, 6, 7 and 8. All the ENCODE comparisons and results can be viewed in detail at http://motifpipeline.com.

54 Comparison of Metrics for Selection Methods Without Filtering 400 100%

90% 350

80%

300

70%

250 60%

200 50% Percentage 40% 150

30% Number Number of Selected Motifs 100

20%

50 10%

0 0% Motif Selection Methods Average Number of Motifs Average Accuracy Average Specificity Average Sensitivity Average Precision Average Coverage

Figure 5. Comparison of the unfiltered set cover motif selection methods and baseline selection methods, displaying the average accuracy, specificity, sensitivity, precision and percentage coverage.

55 Comparison of Metrics for Selection Methods With 2% Filtering 25 100%

90%

20 80%

70%

15 60%

50% Percentage 10 40%

Number Selected of Number Motifs 30%

5 20%

10%

0 0% Motif Selection Method Average Number of Motifs Average Accuracy Average Specificity Average Sensitivity Average Precision Average Coverage

Figure 6. Comparison of the set cover motif selection methods and baseline selection methods with two percent greedy filtering, displaying the average accuracy, specificity, sensitivity, precision and percentage coverage. 56 Comparison of Metrics for Selection Methods With 5% Filtering 20 100%

18 90%

16 80%

14 70%

12 60%

10 50% Percentage 8 40%

Number Selected Number of Motifs 6 30%

4 20%

2 10%

0 0% Motif Selection Method Average Number of Motifs Average Accuracy Average Specificity Average Sensitivity Average Precision Average Coverage

Figure 7. Comparison of the set cover motif selection methods and baseline selection methods with five percent greedy filtering, displaying the average accuracy, specificity, sensitivity, precision and percentage coverage.

57 Comparison of Metrics for Selection Methods With 10% Filtering 20 100%

18 90%

16 80%

14 70%

12 60%

10 50% Percentage 8 40%

Number Selected of Number Motifs 6 30%

4 20%

2 10%

0 0% Motif Selection Method Average Number of Motifs Average Accuracy Average Specificity Average Sensitivity Average Precision Average Coverage

Figure 8. Comparison of the set cover motif selection methods and baseline selection methods with ten percent greedy filtering, displaying the average accuracy, specificity, sensitivity, precision and percentage coverage.

5.1 Weighted Greedy Set Cover

5.1.1 Results

The weighted greedy set cover algorithm was used to identify sets of motifs in the

ENCODE dataset using five different objective functions. The resultant sets of motifs were filtered using a greedy algorithm to further reduce the size of the motif sets with 2%, 5% and 10% filtering parameters. Metrics were then calculated using the five comparison methods, described above, to compare the effectiveness of the different objective functions and different filtering parameters at identifying the biologically significant set of motifs. 58 Each objective function for the weighted greedy set cover algorithm was then ranked within each filtering set and globally, across all filters. The five different objective functions used in conjunction with the weighted greedy set cover algorithm attempted to: select the motifs with the lowest background sequence count, select for the highest information content, select for the highest DME score, select the motifs with the lowest ratio of background sequence occurrences over foreground sequence occurrences and select the motifs with the most foreground sequence occurrences.

When ranked by highest to lowest average accuracy across the five comparison methods the unfiltered results of the weighted greedy set cover algorithm’s objective functions ranked 55th, 45th, 44th, 42nd and 23rd respectively, out of the 56 selection methods without filtering. When ranked by highest to lowest average precision across the five comparison methods the unfiltered results of the weighted greedy set cover algorithm’s objective functions ranked 55th, 46th, 28th, 33rd and 31st respectively, out of the 56 selection methods without filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the unfiltered results of the weighted greedy set cover algorithm’s objective functions ranked 1st for all objective functions, out of the 56 selection methods without filtering. When ranked by lowest to highest average background sequence coverage across the five comparison methods the unfiltered results of the weighted greedy set cover algorithm’s objective functions ranked

55th, 7th, 18th, 6th and 37th respectively, out of the 56 selection methods without filtering.

When ranked by highest to lowest average accuracy across the five comparison methods the 2% greedy filter results of the weighted greedy set cover algorithm’s objective functions ranked 33rd, 12th, 4th, 13th and 56th respectively, out of the 56 selection 59 methods with 2% greedy filtering. When ranked by highest to lowest average precision across the five comparison methods the 2% greedy filter results of the weighted greedy set cover algorithm’s objective functions ranked 53rd, 8th, 5th, 10th and 49th respectively, out of the 56 selection methods with 2% greedy filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the 2% greedy filter results of the weighted greedy set cover algorithm’s objective functions ranked 50th,

48th, 42nd, 46th and 8th respectively, out of the 56 selection methods with 2% greedy filtering. When ranked by lowest to highest average background sequence coverage across the five comparison methods the 2% greedy filter results of the weighted greedy set cover algorithm’s objective functions ranked 48th, 4th, 10th, 6th and 34th respectively, out of the

56 selection methods with 2% greedy filtering.

When ranked by highest to lowest average accuracy across the five comparison methods the 5% greedy filter results of the weighted greedy set cover algorithm’s objective functions ranked 27th, 14th, 3rd, 12th and 47th respectively, out of the 56 selection methods with 5% greedy filtering. When ranked by highest to lowest average precision across the five comparison methods the 5% greedy filter results of the weighted greedy set cover algorithm’s objective functions ranked 49th, 8th, 4th, 6th and 52nd respectively, out of the 56 selection methods with 5% greedy filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the 5% greedy filter results of the weighted greedy set cover algorithm’s objective functions ranked 50th,

48th, 35th, 44th and 1st respectively, out of the 56 selection methods with 5% greedy filtering. When ranked by lowest to highest average background sequence coverage across the five comparison methods the 5% greedy filter results of the weighted greedy set cover 60 algorithm’s objective functions ranked 54th, 6th, 9th, 5th and 34th respectively, out of the

56 selection methods with 5% greedy filtering.

When ranked by highest to lowest average accuracy across the five comparison methods the 10% greedy filter results of the weighted greedy set cover algorithm’s objective functions ranked 12th, 10th, 13th, 9th and 32nd respectively, out of the 56 selection methods with 10% greedy filtering. When ranked by highest to lowest average precision across the five comparison methods the 10% greedy filter results of the weighted greedy set cover algorithm’s objective functions ranked 30th, 6th, 7th, 3rd and 51st respectively, out of the 56 selection methods with 10% greedy filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the 10% greedy filter results of the weighted greedy set cover algorithm’s objective functions ranked 52nd, 49th, 30th, 42nd and 1st respectively, out of the 56 selection methods with 10% greedy filtering. When ranked by lowest to highest average background sequence coverage across the five comparison methods the 10% greedy filter results of the weighted greedy set cover algorithm’s objective functions ranked 54th, 3rd,

7th, 6th and 39th respectively, out of the 56 selection methods with 10% greedy filtering.

When ranked globally, across all filter parameters, by highest to lowest average accuracy across the five comparison methods, described above, the top three objective function / filtering combinations for the weighted greedy set cover algorithm were: selecting for motifs with the lowest ratio of background sequence occurrences over foreground sequence occurrences with 10% greedy filtering, selecting for the highest information content with 10% greedy filtering and selecting for motifs with the lowest background sequence count with 10% greedy filtering. They ranked 15th, 16th and 18th 61 respectively, out of 224 methods, with average accuracies of 0.7452, 0.7451 and 0.7446.

When ranked by highest to lowest average precision globally, across all filters, the top three objective function / filtering combinations were: selecting for motifs with the lowest ratio of background sequence occurrences over foreground sequence occurrences with 10% greedy filtering, selecting for the highest information content with 10% greedy filtering and selecting for the highest DME score with 10% greedy filtering. They ranked 3rd, 7th and 8th respectively, out of 224 methods, with average precisions of 0.4687, 0.4632 and

0.4620.

5.1.2 Discussion

The weighted greedy set cover method performed very well in combination with filtering compared to the other set cover methods. However, the weighted greedy method did not perform well compared to other set cover methods when no filtering was used. For each level of filtering at least one variant of the greedy weighted set cover method ranked in the top 5 highest accuracy motif selection methods, showing the effectiveness of pairing the greedy algorithm with filtering. The objective functions which selected the lowest ratio of foreground sequences to background sequences, and selected the highest score per sequence added performed the best. While the objective function to select the highest information content per sequence had mediocre performance throughout, and the objective functions to select for most foreground sequences and fewest background sequences performed the worst with filtering. Interestingly, with filtering, the objective function to cover the most foreground sequences performed the worst, but without filtering performed the best. 62 5.2 Weighted Relaxed Greedy Set Cover

5.2.1 Results

The weighted relaxed greedy set cover algorithm was used to identify sets of motifs in the ENCODE dataset using five different objective functions. The resultant sets of motifs were filtered using a greedy algorithm to further reduce the size of the motif sets with 2%,

5% and 10% filtering parameters. Metrics were then calculated using the five comparison methods, described above, to compare the effectiveness of the different objective functions and different filtering parameters at identifying the biologically significant set of motifs.

Each objective function for the weighted relaxed greedy set cover algorithm was then ranked within each filtering set and globally, across all filters. The five different objective functions used in conjunction with the weighted relaxed greedy set cover algorithm attempted to: select the motifs with the lowest background sequence count, select for the highest information content, select for the highest DME score, select the motifs with the lowest ratio of background sequence occurrences over foreground sequence occurrences and select the motifs with the most foreground sequence occurrences.

When ranked by highest to lowest average accuracy across the five comparison methods the unfiltered results of the weighted relaxed greedy set cover algorithm’s objective functions ranked 53rd, 52nd, 49th, 51st and 38th respectively, out of the 56 selection methods without filtering. When ranked by highest to lowest average precision across the five comparison methods the unfiltered results of the weighted relaxed greedy set cover algorithm’s objective functions ranked 53rd, 45th, 25th, 30th and 5th respectively, out of the 56 selection methods without filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the 63 unfiltered results of the weighted relaxed greedy set cover algorithm’s objective functions ranked 1st for all objective functions, out of the 56 selection methods without filtering.

When ranked by highest to lowest average accuracy across the five comparison methods the 2% greedy filter results of the weighted relaxed greedy set cover algorithm’s objective functions ranked 45th, 10th, 5th, 3rd and 16th respectively, out of the 56 selection methods with 2% greedy filtering. When ranked by highest to lowest average precision across the five comparison methods the 2% greedy filter results of the weighted relaxed greedy set cover algorithm’s objective functions ranked 44th, 4th, 11th, 3rd and 24th respectively, out of the 56 selection methods with 2% greedy filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the 2% greedy filter results of the weighted relaxed greedy set cover algorithm’s objective functions ranked 53rd, 36th, 32nd, 39th and 19th respectively, out of the 56 selection methods with 2% greedy filtering.

When ranked by highest to lowest average accuracy across the five comparison methods the 5% greedy filter results of the weighted relaxed greedy set cover algorithm’s objective functions ranked 30th, 24th, 5th, 6th and 28th respectively, out of the 56 selection methods with 5% greedy filtering. When ranked by highest to lowest average precision across the five comparison methods the 5% greedy filter results of the weighted relaxed greedy set cover algorithm’s objective functions ranked 19th, 11th, 12th, 9th and 54th respectively, out of the 56 selection methods with 5% greedy filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the 5% greedy filter results of the weighted relaxed greedy set cover algorithm’s 64 objective functions ranked 54th, 36th, 27th, 30th and 9th respectively, out of the 56 selection methods with 5% greedy filtering.

When ranked by highest to lowest average accuracy across the five comparison methods the 10% greedy filter results of the weighted relaxed greedy set cover algorithm’s objective functions ranked 43rd, 15th, 4th, 16th and 51st respectively, out of the 56 selection methods with 10% greedy filtering. When ranked by highest to lowest average precision across the five comparison methods the 10% greedy filter results of the weighted relaxed greedy set cover algorithm’s objective functions ranked 43rd, 18th, 25th, 21st and

54th respectively, out of the 56 selection methods with 10% greedy filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the 10% greedy filter results of the weighted relaxed greedy set cover algorithm’s objective functions ranked 55th, 38th, 25th, 40th and 3rd respectively, out of the 56 selection methods with 10% greedy filtering.

When ranked globally, across all filter parameters, by highest to lowest average accuracy across the five comparison methods, described above, the top three objective function / filtering combinations for the weighted relaxed greedy set cover algorithm were: selecting for the highest DME score with 10% greedy filtering, selecting for the highest information content with 10% greedy filtering and selecting for motifs with the lowest ratio of background sequence occurrences over foreground sequence occurrences with 10% greedy filtering. They ranked 10th, 21st and 22nd respectively, out of 224 methods, with average accuracies of 0.7458, 0.7446 and 0.7444. When ranked by highest to lowest average precision globally, across all filters, the top three objective function / filtering combinations were: selecting for the highest information content with 10% greedy filtering, 65 selecting for motifs with the lowest ratio of background sequence occurrences over foreground sequence occurrences with 10% greedy filtering and selecting for the highest

DME score with 10% greedy filtering. They ranked 21st, 24th and 28th respectively, out of 224 methods, with average precisions of 0.4588, 0.4575 and 0.4557.

5.2.2 Discussion

The weighted relaxed greedy set cover method performed slightly better than the weighted greedy set cover method when filtering was applied. Like the weighted greedy set cover method, the weighted relaxed greedy method performed poorly compared to the other weighted set cover methods, in terms of accuracy and precision, without filtering, and performed well with filtering. This set cover method also had at least one variant in the top five accuracy rankings for each type of filtering tested. The objective function which obtained the highest accuracy for the weighted relaxed greedy set cover method, selected for the motifs with the highest DME score using 10% filtering and achieved an accuracy of 0.7458. Overall, selecting for the highest DME performed the best across the different levels of filtering, while objective functions which selected for the lowest ratio between foreground and background sequences, and selected for the highest information content per sequence had a mediocre performance doing well sometimes and worse other times. In summary, the weighted relaxed greedy algorithm performed well similar to the other weighted greedy algorithms.

66 5.3 Weighted Modified Greedy Set Cover

5.3.1 Results

The weighted modified greedy set cover algorithm was used to identify sets of motifs in the ENCODE dataset using five different objective functions. The resultant sets of motifs were filtered using a greedy algorithm to further reduce the size of the motif sets with 2%, 5% and 10% filtering parameters. Metrics were then calculated using the five comparison methods, described above, to compare the effectiveness of the different objective functions and different filtering parameters at identifying the biologically significant set of motifs. Each objective function for the weighted modified greedy set cover algorithm was then ranked within each filtering set and globally, across all filters.

The five different objective functions used in conjunction with the weighted modified greedy set cover algorithm attempted to: select the motifs with the lowest background sequence count, select for the highest information content, select for the highest DME score, select the motifs with the lowest ratio of background sequence occurrences over foreground sequence occurrences and select the motifs with the most foreground sequence occurrences.

When ranked by highest to lowest average accuracy across the five comparison methods the unfiltered results of the weighted modified greedy set cover algorithm’s objective functions ranked 54th, 40th, 39th, 41st and 22nd respectively, out of the 56 selection methods without filtering. When ranked by highest to lowest average precision across the five comparison methods the unfiltered results of the weighted modified greedy set cover algorithm’s objective functions ranked 54th, 44th, 27th, 40th and 29th respectively, out of the 56 selection methods without filtering. When ranked by highest to 67 lowest average foreground sequence coverage across the five comparison methods the unfiltered results of the weighted modified greedy set cover algorithm’s objective functions ranked 1st for all objective functions, out of the 56 selection methods without filtering.

When ranked by highest to lowest average accuracy across the five comparison methods the 2% greedy filter results of the weighted modified greedy set cover algorithm’s objective functions ranked 19th, 7th, 8th, 6th and 47th respectively, out of the 56 selection methods with 2% greedy filtering. When ranked by highest to lowest average precision across the five comparison methods the 2% greedy filter results of the weighted modified greedy set cover algorithm’s objective functions ranked 50th, 6th, 9th, 7th and 52nd respectively, out of the 56 selection methods with 2% greedy filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the 2% greedy filter results of the weighted modified greedy set cover algorithm’s objective functions ranked 51st, 49th, 43rd, 47th and 1st respectively, out of the 56 selection methods with 2% greedy filtering.

When ranked by highest to lowest average accuracy across the five comparison methods the 5% greedy filter results of the weighted modified greedy set cover algorithm’s objective functions ranked 10th, 9th, 4th, 7th and 26th respectively, out of the 56 selection methods with 5% greedy filtering. When ranked by highest to lowest average precision across the five comparison methods the 5% greedy filter results of the weighted modified greedy set cover algorithm’s objective functions ranked 38th, 7th, 5th, 3rd and 53rd respectively, out of the 56 selection methods with 5% greedy filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the 5% greedy filter results of the weighted modified greedy set cover algorithm’s 68 objective functions ranked 51st, 49th, 34th, 41st and 2nd respectively, out of the 56 selection methods with 5% greedy filtering.

When ranked by highest to lowest average accuracy across the five comparison methods the 10% greedy filter results of the weighted modified greedy set cover algorithm’s objective functions ranked 7th, 5th, 14th, 3rd and 8th respectively, out of the

56 selection methods with 10% greedy filtering. When ranked by highest to lowest average precision across the five comparison methods the 10% greedy filter results of the weighted modified greedy set cover algorithm’s objective functions ranked 22nd, 4th, 8th, 2nd and

37th respectively, out of the 56 selection methods with 10% greedy filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the 10% greedy filter results of the weighted modified greedy set cover algorithm’s objective functions ranked 51st, 50th, 21st, 44th and 2nd respectively, out of the 56 selection methods with 10% greedy filtering.

When ranked globally, across all filter parameters, by highest to lowest average accuracy across the five comparison methods, described above, the top three objective function / filtering combinations for the weighted modified greedy set cover algorithm were: selecting for motifs with the lowest ratio of background sequence occurrences over foreground sequence occurrences with 10% greedy filtering, selecting for the highest information content with 10% greedy filtering and selecting for motifs with the lowest background sequence count with 10% greedy filtering. They ranked 9th, 11th and 13th respectively, out of 224 methods, with average accuracies of 0.7467, 0.7458 and 0.7456.

When ranked by highest to lowest average precision globally, across all filters, the top three objective function / filtering combinations were: selecting for motifs with the lowest ratio 69 of background sequence occurrences over foreground sequence occurrences with 10% greedy filtering, selecting for the highest information content with 10% greedy filtering and selecting for the highest DME score with 10% greedy filtering. They ranked 2nd, 5th and 9th respectively, out of 224 methods, with average precisions of 0.4729, 0.4674 and

0.4606.

5.3.2 Discussion

The weighted modified greedy algorithm performed well compared to the other weighted set cover algorithms, actually outperforming some of the unfiltered baseline methods in precision. The weighted modified greedy set cover algorithm ranked among the top ten in terms of precision and accuracy when filtering was added. Like all the weighted set cover methods, the weighted modified greedy algorithm benefited greatly from greedy filtering. The top three variants used in combination with filtering, with the highest accuracy and precision, selected for the lowest ratio of foreground to background sequences, the highest score per sequences and the highest information content per sequence. The objective function that selected for the fewest background sequences performed poorly with and without filtering. Like the other weighted greedy algorithms, the objective function which maximizes foreground sequence coverage performed the best when no filtering was used. Overall the weighted modified greedy algorithm performed in line with the other greedy algorithms.

70 5.4 Weighted Hill Climbing Set Cover

5.4.1 Results

The weighted hill climbing set cover algorithm was used to identify sets of motifs in the ENCODE dataset using five different objective functions. The resultant sets of motifs were filtered using a greedy algorithm to further reduce the size of the motif sets with 2%,

5% and 10% filtering parameters. Metrics were then calculated using the five comparison methods, described above, to compare the effectiveness of the different objective functions and different filtering parameters at identifying the biologically significant set of motifs.

Each objective function for the weighted hill climbing set cover algorithm was then ranked within each filtering set and globally, across all filters. The five different objective functions used in conjunction with the weighted hill climbing set cover algorithm attempted to: select the solution with the fewest motifs, select the solution with the lowest number of motifs multiplied by the number of background sequences, select the solution with the fewest background sequences, select for the highest information content and select for the highest DME score.

When ranked by highest to lowest average accuracy across the five comparison methods the unfiltered results of the weighted hill climbing set cover algorithm’s objective functions ranked 20th, 19th, 21st, 30th and 24th respectively, out of the 56 selection methods without filtering. When ranked by highest to lowest average precision across the five comparison methods the unfiltered results of the weighted hill climbing set cover algorithm’s objective functions ranked 26th, 20th, 18th, 32nd and 13th respectively, out of the 56 selection methods without filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the unfiltered results of 71 the weighted hill climbing set cover algorithm’s objective functions ranked 1st for all objective functions, out of the 56 selection methods without filtering.

When ranked by highest to lowest average accuracy across the five comparison methods the 2% greedy filter results of the weighted hill climbing set cover algorithm’s objective functions ranked 51st, 37th, 15th, 36th and 48th respectively, out of the 56 selection methods with 2% greedy filtering. When ranked by highest to lowest average precision across the five comparison methods the 2% greedy filter results of the weighted hill climbing set cover algorithm’s objective functions ranked 46th, 25th, 14th, 19th and

20th respectively, out of the 56 selection methods with 2% greedy filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the 2% greedy filter results of the weighted hill climbing set cover algorithm’s objective functions ranked 18th, 23rd, 31st, 33rd and 24th respectively, out of the 56 selection methods with 2% greedy filtering.

When ranked by highest to lowest average accuracy across the five comparison methods the 5% greedy filter results of the weighted hill climbing set cover algorithm’s objective functions ranked 41st, 33rd, 13th, 48th and 42nd respectively, out of the 56 selection methods with 5% greedy filtering. When ranked by highest to lowest average precision across the five comparison methods the 5% greedy filter results of the weighted hill climbing set cover algorithm’s objective functions ranked 39th, 33rd, 13th, 30th and

34th respectively, out of the 56 selection methods with 5% greedy filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the 5% greedy filter results of the weighted hill climbing set cover algorithm’s 72 objective functions ranked 22nd, 20th, 33rd, 40th and 26th respectively, out of the 56 selection methods with 5% greedy filtering.

When ranked by highest to lowest average accuracy across the five comparison methods the 10% greedy filter results of the weighted hill climbing set cover algorithm’s objective functions ranked 53rd, 35th, 6th, 48th and 49th respectively, out of the 56 selection methods with 10% greedy filtering. When ranked by highest to lowest average precision across the five comparison methods the 10% greedy filter results of the weighted hill climbing set cover algorithm’s objective functions ranked 53rd, 36th, 11th, 34th and

35th respectively, out of the 56 selection methods with 10% greedy filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the 10% greedy filter results of the weighted hill climbing set cover algorithm’s objective functions ranked 16th, 19th, 32nd, 41st and 18th respectively, out of the 56 selection methods with 10% greedy filtering.

When ranked globally, across all filter parameters, by highest to lowest average accuracy across the five comparison methods, described above, the top three objective function / filtering combinations for the weighted hill climbing set cover algorithm were: selecting for solutions with the fewest background sequences with 10% greedy filtering, selecting for solutions with the lowest number of motifs multiplied by the number of background sequences with 10% greedy filtering and selecting for the highest information content with 10% greedy filtering. They ranked 12th, 43rd and 56th respectively, out of

224 methods, with average accuracies of 0.7457, 0.7429 and 0.7422. When ranked by highest to lowest average precision globally, across all filters, the top three objective function / filtering combinations were: selecting for solutions with the fewest background 73 sequences with 10% greedy filtering, selecting for the highest information content with

10% greedy filtering and selecting for the highest DME score with 10% greedy filtering.

They ranked 13th, 37th and 38th respectively, out of 224 methods, with average precisions of 0.4594, 0.4528 and 0.4519.

5.4.2 Discussion

The weighted hill climbing method did not perform very well overall when compared to the other weighted set cover algorithms or the baseline methods. The best objective function for the weighted hill climbing method selected for the fewest total background sequences using 10% filtering. Besides this one objective function / filtering combination, all the hill climbing methods were outperformed by the greedy methods when filtering was applied. These results fall in line with the other methods which attempt to obtain near optimal or optimal solutions before filtering is applied. It appears that obtaining an optimal solution for an objective function may initially improve the accuracy and precision of identifying a biologically significant motif with a complete foreground coverage requirement, but when greedy filtering is applied to these sets of motifs the solutions do not hold up very well. These near optimal solutions may also only be performing well without any filters because they usually select a smaller number of motifs.

5.5 Weighted Random Set Cover

5.5.1 Results

The weighted random set cover algorithm was used to identify sets of motifs in the

ENCODE dataset using five different objective functions. The resultant sets of motifs were 74 filtered using a greedy algorithm to further reduce the size of the motif sets with 2%, 5% and 10% filtering parameters. Metrics were then calculated using the five comparison methods, described above, to compare the effectiveness of the different objective functions and different filtering parameters at identifying the biologically significant set of motifs.

Each objective function for the weighted random set cover algorithm was then ranked within each filtering set and globally, across all filters. The five different objective functions used in conjunction with the weighted random set cover algorithm attempted to: select the solution with the fewest motifs, select the solution with the lowest number of motifs multiplied by the number of background sequences, select the solution with the fewest background sequences, select for the highest information content and select for the highest DME score.

When ranked by highest to lowest average accuracy across the five comparison methods the unfiltered results of the weighted random set cover algorithm’s objective functions ranked 46th, 43rd, 47th, 48th and 50th respectively, out of the 56 selection methods without filtering. When ranked by highest to lowest average precision across the five comparison methods the unfiltered results of the weighted random set cover algorithm’s objective functions ranked 49th, 48th, 50th, 51st and 52nd respectively, out of the 56 selection methods without filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the unfiltered results of the weighted random set cover algorithm’s objective functions ranked 1st for all objective functions, out of the 56 selection methods without filtering.

When ranked by highest to lowest average accuracy across the five comparison methods the 2% greedy filter results of the weighted random set cover algorithm’s 75 objective functions ranked 49th, 35th, 34th, 22nd and 24th respectively, out of the 56 selection methods with 2% greedy filtering. When ranked by highest to lowest average precision across the five comparison methods the 2% greedy filter results of the weighted random set cover algorithm’s objective functions ranked 54th, 47th, 45th, 42nd and 34th respectively, out of the 56 selection methods with 2% greedy filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the 2% greedy filter results of the weighted random set cover algorithm’s objective functions ranked 35th, 34th, 45th, 44th and 41st respectively, out of the 56 selection methods with 2% greedy filtering.

When ranked by highest to lowest average accuracy across the five comparison methods the 5% greedy filter results of the weighted random set cover algorithm’s objective functions ranked 54th, 43rd, 31st, 22nd and 19th respectively, out of the 56 selection methods with 5% greedy filtering. When ranked by highest to lowest average precision across the five comparison methods the 5% greedy filter results of the weighted random set cover algorithm’s objective functions ranked 51st, 50th, 43rd, 37th and 40th respectively, out of the 56 selection methods with 5% greedy filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the 5% greedy filter results of the weighted random set cover algorithm’s objective functions ranked 39th, 37th, 47th, 43rd and 38th respectively, out of the 56 selection methods with 5% greedy filtering.

When ranked by highest to lowest average accuracy across the five comparison methods the 10% greedy filter results of the weighted random set cover algorithm’s objective functions ranked 50th, 42nd, 30th, 38th and 21st respectively, out of the 56 76 selection methods with 10% greedy filtering. When ranked by highest to lowest average precision across the five comparison methods the 10% greedy filter results of the weighted random set cover algorithm’s objective functions ranked 52nd, 46th, 19th, 50th and 27th respectively, out of the 56 selection methods with 10% greedy filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the 10% greedy filter results of the weighted random set cover algorithm’s objective functions ranked 39th, 37th, 45th, 46th and 36th respectively, out of the 56 selection methods with 10% greedy filtering.

When ranked globally, across all filter parameters, by highest to lowest average accuracy across the five comparison methods, described above, the top three objective function / filtering combinations for the weighted random set cover algorithm were: selecting for the highest DME score with 10% greedy filtering, selecting for solutions with the fewest background sequences with 10% greedy filtering and selecting for the highest information content with 10% greedy filtering. They ranked 27th, 37th and 46th respectively, out of 224 methods, with average accuracies of 0.7438, 0.7434 and 0.7428.

When ranked by highest to lowest average precision globally, across all filters, the top three objective function / filtering combinations were: selecting for solutions with the fewest background sequences with 10% greedy filtering, selecting for the highest DME score with

10% greedy filtering and selecting for solutions with the lowest number of motifs multiplied by the number of background sequences with 10% greedy filtering. They ranked

22nd, 30th and 49th respectively, out of 224 methods, with average precisions of 0.4577,

0.4547 and 0.4485.

77 5.5.2 Discussion

The weighted random set cover method was used as a baseline to compare against the weighted set cover methods. As expected, this method performed poorly overall, especially when no filtering was applied. Even without filtering this method outperformed many of the weighted relaxed set cover methods in terms of accuracy and precision. This method also outperformed many of the greedy algorithms which attempted to select for motifs with the fewest background sequence count. When greedy filtering was applied to weighted random set cover, this method still performed poorly, but surprisingly performed better in terms of accuracy and precision than some of the weighted set cover methods, in particular the weighted set cover methods that attempted to obtain near optimal or optimal solutions. The objective function which performed the best for this method attempted to obtain the highest ratio of foreground to background sequences with 10% filtering, and surprisingly ranked 22nd out of all the 224 algorithms and objective function combinations.

5.6 Weighted Simulated Annealing Set Cover

5.6.1 Fast Temperature Function Results

The weighted simulated annealing set cover with a fast temperature function algorithm was used to identify sets of motifs in the ENCODE dataset using five different objective functions. The resultant sets of motifs were filtered using a greedy algorithm to further reduce the size of the motif sets with 2%, 5% and 10% filtering parameters. Metrics were then calculated using the five comparison methods, described above, to compare the effectiveness of the different objective functions and different filtering parameters at identifying the biologically significant set of motifs. Each objective function for the 78 weighted simulated annealing set cover with a fast temperature function algorithm was then ranked within each filtering set and globally, across all filters. The five different objective functions used in conjunction with the weighted simulated annealing set cover with a fast temperature function algorithm attempted to: select the solution with the fewest motifs, select the solution with the lowest number of motifs multiplied by the number of background sequences, select the solution with the fewest background sequences, select for the highest information content and select for the highest DME score.

When ranked by highest to lowest average accuracy across the five comparison methods the unfiltered results of the weighted simulated annealing set cover with a fast temperature function algorithm’s objective functions ranked 13th, 17th, 18th, 36th and 34th respectively, out of the 56 selection methods without filtering. When ranked by highest to lowest average precision across the five comparison methods the unfiltered results of the weighted simulated annealing set cover with a fast temperature function algorithm’s objective functions ranked 23rd, 22nd, 21st, 43rd and 6th respectively, out of the 56 selection methods without filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the unfiltered results of the weighted simulated annealing set cover with a fast temperature function algorithm’s objective functions ranked 1st for all objective functions, out of the 56 selection methods without filtering.

When ranked by highest to lowest average accuracy across the five comparison methods the 2% greedy filter results of the weighted simulated annealing set cover with a fast temperature function algorithm’s objective functions ranked 53rd, 46th, 17th, 31st and

14th respectively, out of the 56 selection methods with 2% greedy filtering. When ranked 79 by highest to lowest average precision across the five comparison methods the 2% greedy filter results of the weighted simulated annealing set cover with a fast temperature function algorithm’s objective functions ranked 51st, 23rd, 21st, 17th and 16th respectively, out of the 56 selection methods with 2% greedy filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the 2% greedy filter results of the weighted simulated annealing set cover with a fast temperature function algorithm’s objective functions ranked 11th, 22nd, 27th, 37th and 29th respectively, out of the 56 selection methods with 2% greedy filtering.

When ranked by highest to lowest average accuracy across the five comparison methods the 5% greedy filter results of the weighted simulated annealing set cover with a fast temperature function algorithm’s objective functions ranked 51st, 21st, 16th, 18th and

17th respectively, out of the 56 selection methods with 5% greedy filtering. When ranked by highest to lowest average precision across the five comparison methods the 5% greedy filter results of the weighted simulated annealing set cover with a fast temperature function algorithm’s objective functions ranked 46th, 20th, 17th, 14th and 16th respectively, out of the 56 selection methods with 5% greedy filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the 5% greedy filter results of the weighted simulated annealing set cover with a fast temperature function algorithm’s objective functions ranked 11th, 28th, 32nd, 45th and 21st respectively, out of the 56 selection methods with 5% greedy filtering.

When ranked by highest to lowest average accuracy across the five comparison methods the 10% greedy filter results of the weighted simulated annealing set cover with a fast temperature function algorithm’s objective functions ranked 52nd, 31st, 20th, 33rd 80 and 17th respectively, out of the 56 selection methods with 10% greedy filtering. When ranked by highest to lowest average precision across the five comparison methods the 10% greedy filter results of the weighted simulated annealing set cover with a fast temperature function algorithm’s objective functions ranked 47th, 32nd, 28th, 10th and 20th respectively, out of the 56 selection methods with 10% greedy filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the 10% greedy filter results of the weighted simulated annealing set cover with a fast temperature function algorithm’s objective functions ranked 6th, 28th, 35th, 47th and

15th respectively, out of the 56 selection methods with 10% greedy filtering.

When ranked globally, across all filter parameters, by highest to lowest average accuracy across the five comparison methods, described above, the top three objective function / filtering combinations for the weighted simulated annealing set cover with a fast temperature function algorithm were: selecting for the highest DME score with 10% greedy filtering, selecting for solutions with the fewest background sequences with 10% greedy filtering and selecting for solutions with the lowest number of motifs multiplied by the number of background sequences with 10% greedy filtering. They ranked 23rd, 26th and

38th respectively, out of 224 methods, with average accuracies of 0.7444, 0.7438 and

0.7433. When ranked by highest to lowest average precision globally, across all filters, the top three objective function / filtering combinations were: selecting for the highest information content with 10% greedy filtering, selecting for the highest DME score with

10% greedy filtering and selecting for solutions with the fewest background sequences with

10% greedy filtering. They ranked 11th, 23rd and 31st respectively, out of 224 methods, with average precisions of 0.4598, 0.4576 and 0.4544. 81 5.6.2 Exponential Temperature Function Results

The weighted simulated annealing set cover with an exponential temperature function algorithm was used to identify sets of motifs in the ENCODE dataset using five different objective functions. The resultant sets of motifs were filtered using a greedy algorithm to further reduce the size of the motif sets with 2%, 5% and 10% filtering parameters. Metrics were then calculated using the five comparison methods, described above, to compare the effectiveness of the different objective functions and different filtering parameters at identifying the biologically significant set of motifs. Each objective function for the weighted simulated annealing set cover with an exponential temperature function algorithm was then ranked within each filtering set and globally, across all filters.

The five different objective functions used in conjunction with the weighted simulated annealing set cover with an exponential temperature function algorithm attempted to: select the solution with the fewest motifs, select the solution with the lowest number of motifs multiplied by the number of background sequences, select the solution with the fewest background sequences, select for the highest information content and select for the highest

DME score.

When ranked by highest to lowest average accuracy across the five comparison methods the unfiltered results of the weighted simulated annealing set cover with an exponential temperature function algorithm’s objective functions ranked 11th, 12th, 16th,

37th and 32nd respectively, out of the 56 selection methods without filtering. When ranked by highest to lowest average precision across the five comparison methods the unfiltered results of the weighted simulated annealing set cover with an exponential temperature function algorithm’s objective functions ranked 15th, 14th, 17th, 42nd and 3rd 82 respectively, out of the 56 selection methods without filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the unfiltered results of the weighted simulated annealing set cover with an exponential temperature function algorithm’s objective functions ranked 1st for all objective functions, out of the 56 selection methods without filtering.

When ranked by highest to lowest average accuracy across the five comparison methods the 2% greedy filter results of the weighted simulated annealing set cover with an exponential temperature function algorithm’s objective functions ranked 29th, 38th, 30th,

52nd and 32nd respectively, out of the 56 selection methods with 2% greedy filtering.

When ranked by highest to lowest average precision across the five comparison methods the 2% greedy filter results of the weighted simulated annealing set cover with an exponential temperature function algorithm’s objective functions ranked 33rd, 43rd, 29th,

22nd and 12th respectively, out of the 56 selection methods with 2% greedy filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the 2% greedy filter results of the weighted simulated annealing set cover with an exponential temperature function algorithm’s objective functions ranked

10th, 21st, 28th, 40th and 25th respectively, out of the 56 selection methods with 2% greedy filtering.

When ranked by highest to lowest average accuracy across the five comparison methods the 5% greedy filter results of the weighted simulated annealing set cover with an exponential temperature function algorithm’s objective functions ranked 25th, 23rd, 32nd,

55th and 34th respectively, out of the 56 selection methods with 5% greedy filtering. When ranked by highest to lowest average precision across the five comparison methods the 5% 83 greedy filter results of the weighted simulated annealing set cover with an exponential temperature function algorithm’s objective functions ranked 31st, 36th, 22nd, 21st and 10th respectively, out of the 56 selection methods with 5% greedy filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the 5% greedy filter results of the weighted simulated annealing set cover with an exponential temperature function algorithm’s objective functions ranked 10th, 23rd, 31st,

42nd and 24th respectively, out of the 56 selection methods with 5% greedy filtering.

When ranked by highest to lowest average accuracy across the five comparison methods the 10% greedy filter results of the weighted simulated annealing set cover with an exponential temperature function algorithm’s objective functions ranked 29th, 36th,

44th, 55th and 37th respectively, out of the 56 selection methods with 10% greedy filtering.

When ranked by highest to lowest average precision across the five comparison methods the 10% greedy filter results of the weighted simulated annealing set cover with an exponential temperature function algorithm’s objective functions ranked 29th, 45th, 41st,

31st and 9th respectively, out of the 56 selection methods with 10% greedy filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the 10% greedy filter results of the weighted simulated annealing set cover with an exponential temperature function algorithm’s objective functions ranked

14th, 31st, 33rd, 43rd and 4th respectively, out of the 56 selection methods with 10% greedy filtering.

When ranked globally, across all filter parameters, by highest to lowest average accuracy across the five comparison methods, described above, the top three objective function / filtering combinations for the weighted simulated annealing set cover with an 84 exponential temperature function algorithm were: selecting for solutions with the fewest motifs with 10% greedy filtering, selecting for solutions with the lowest number of motifs multiplied by the number of background sequences with 10% greedy filtering and selecting for the highest DME score with 10% greedy filtering. They ranked 35th, 44th and 45th respectively, out of 224 methods, with average accuracies of 0.7436, 0.7429 and 0.7429.

When ranked by highest to lowest average precision globally, across all filters, the top three objective function / filtering combinations were: selecting for the highest DME score with

10% greedy filtering, selecting for solutions with the fewest motifs with 10% greedy filtering and selecting for the highest information content with 10% greedy filtering. They ranked 10th, 32nd and 34th respectively, out of 224 methods, with average precisions of

0.4601, 0.4542 and 0.4536.

5.6.3 Boltzmann Temperature Function Results

The weighted simulated annealing set cover with a Boltzmann temperature function algorithm was used to identify sets of motifs in the ENCODE dataset using five different objective functions. The resultant sets of motifs were filtered using a greedy algorithm to further reduce the size of the motif sets with 2%, 5% and 10% filtering parameters. Metrics were then calculated using the five comparison methods, described above, to compare the effectiveness of the different objective functions and different filtering parameters at identifying the biologically significant set of motifs. Each objective function for the weighted simulated annealing set cover with a Boltzmann temperature function algorithm was then ranked within each filtering set and globally, across all filters. The five different objective functions used in conjunction with the weighted simulated annealing set cover 85 with a Boltzmann temperature function algorithm attempted to: select the solution with the fewest motifs, select the solution with the lowest number of motifs multiplied by the number of background sequences, select the solution with the fewest background sequences, select for the highest information content and select for the highest DME score.

When ranked by highest to lowest average accuracy across the five comparison methods the unfiltered results of the weighted simulated annealing set cover with a

Boltzmann temperature function algorithm’s objective functions ranked 10th, 15th, 14th,

35th and 33rd respectively, out of the 56 selection methods without filtering. When ranked by highest to lowest average precision across the five comparison methods the unfiltered results of the weighted simulated annealing set cover with a Boltzmann temperature function algorithm’s objective functions ranked 16th, 19th, 24th, 41st and 4th respectively, out of the 56 selection methods without filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the unfiltered results of the weighted simulated annealing set cover with a Boltzmann temperature function algorithm’s objective functions ranked 1st for all objective functions, out of the 56 selection methods without filtering. When ranked by lowest to highest average background sequence coverage across the five comparison methods the unfiltered results of the weighted simulated annealing set cover with a Boltzmann temperature function algorithm’s objective functions ranked 32nd, 17th, 10th, 11th and 23rd respectively, out of the 56 selection methods without filtering.

When ranked by highest to lowest average accuracy across the five comparison methods the 2% greedy filter results of the weighted simulated annealing set cover with a

Boltzmann temperature function algorithm’s objective functions ranked 50th, 55th, 9th, 86 18th and 23rd respectively, out of the 56 selection methods with 2% greedy filtering. When ranked by highest to lowest average precision across the five comparison methods the 2% greedy filter results of the weighted simulated annealing set cover with a Boltzmann temperature function algorithm’s objective functions ranked 48th, 35th, 13th, 18th and 15th respectively, out of the 56 selection methods with 2% greedy filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the 2% greedy filter results of the weighted simulated annealing set cover with a

Boltzmann temperature function algorithm’s objective functions ranked 9th, 20th, 26th,

38th and 30th respectively, out of the 56 selection methods with 2% greedy filtering.

When ranked by highest to lowest average accuracy across the five comparison methods the 5% greedy filter results of the weighted simulated annealing set cover with a

Boltzmann temperature function algorithm’s objective functions ranked 15th, 44th, 8th,

20th and 29th respectively, out of the 56 selection methods with 5% greedy filtering. When ranked by highest to lowest average precision across the five comparison methods the 5% greedy filter results of the weighted simulated annealing set cover with a Boltzmann temperature function algorithm’s objective functions ranked 32nd, 35th, 18th, 15th and

23rd respectively, out of the 56 selection methods with 5% greedy filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the 5% greedy filter results of the weighted simulated annealing set cover with a

Boltzmann temperature function algorithm’s objective functions ranked 18th, 19th, 29th,

46th and 25th respectively, out of the 56 selection methods with 5% greedy filtering.

When ranked by highest to lowest average accuracy across the five comparison methods the 10% greedy filter results of the weighted simulated annealing set cover with 87 a Boltzmann temperature function algorithm’s objective functions ranked 19th, 54th, 11th,

34th and 22nd respectively, out of the 56 selection methods with 10% greedy filtering.

When ranked by highest to lowest average precision across the five comparison methods the 10% greedy filter results of the weighted simulated annealing set cover with a

Boltzmann temperature function algorithm’s objective functions ranked 33rd, 49th, 23rd,

26th and 24th respectively, out of the 56 selection methods with 10% greedy filtering.

When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the 10% greedy filter results of the weighted simulated annealing set cover with a Boltzmann temperature function algorithm’s objective functions ranked 5th,

20th, 34th, 48th and 13th respectively, out of the 56 selection methods with 10% greedy filtering.

When ranked globally, across all filter parameters, by highest to lowest average accuracy across the five comparison methods, described above, the top three objective function / filtering combinations for the weighted simulated annealing set cover with a

Boltzmann temperature function algorithm were: selecting for solutions with the fewest background sequences with 10% greedy filtering, selecting for solutions with the fewest motifs with 10% greedy filtering and selecting for the highest DME score with 10% greedy filtering. They ranked 17th, 25th and 28th respectively, out of 224 methods, with average accuracies of 0.7448, 0.7444 and 0.7437. When ranked by highest to lowest average precision globally, across all filters, the top three objective function / filtering combinations were: selecting for solutions with the fewest background sequences with 10% greedy filtering, selecting for the highest DME score with 10% greedy filtering and selecting for 88 the highest information content with 10% greedy filtering. They ranked 26th, 27th and 29th respectively, out of 224 methods, with average precisions of 0.4563, 0.4562 and 0.4556.

5.6.4 Discussion

The weighted simulated annealing set cover methods performed well in terms of accuracy and precision without any filtering, only being beat in terms of accuracy and precision by the weighted branch and cut methods. With filtering applied though, the weighed simulated annealing algorithms performed poorly when compared to the greedy algorithms. The weighted simulated annealing methods obtained optimal or near optimal solutions for many of the experiments, with the exponential temperature function selecting on average 15.9 motifs per solution and the optimal branch and cut method selecting 15.4 motifs per solution. The branch and cut average also does not include four of the experiments in which no optimal solution could be obtained in with 24 hours. Also, the simulated annealing algorithms outperformed the relaxed integer linear programming with randomized rounding solutions without filtering. Unfortunately, even though these algorithms performed well without filtering, they did not perform well when filtering was applied, similar to other algorithms that attempted to identify optimal solutions. Out of the three temperature functions that were tested the exponential temperature function identified the closest to optimal solutions followed by the fast temperature function and finally by the

Boltzmann temperature function.

89 5.7 Linear Programming Using Branch and Cut Set Cover

5.7.1 Results

The linear programming using branch and cut algorithm was used to identify sets of motifs in the ENCODE dataset using six different objective functions. Optimal solutions for four experiments could not be identified out of the 402 experiments tested in the

ENCODE data within 24 hours of runtime, and are not included in these results. For each of the 398 data sets tested the resultant sets of motifs were filtered using a greedy algorithm to further reduce the size of the motif sets with 2%, 5% and 10% filtering parameters.

Metrics were then calculated using the five comparison methods, described above, to compare the effectiveness of the different objective functions and different filtering parameters at identifying the biologically significant set of motifs. Each objective function for the linear programming using branch and cut algorithm was then ranked within each filtering set and globally, across all filters. The six different objective functions used in conjunction with the linear programming using branch and cut algorithm attempted to: select the motifs with the lowest background sequence count, select the fewest motifs, select for the highest information content, select for the highest DME score, select the motifs with the lowest ratio of background sequence occurrences over foreground sequence occurrences and select the motifs with the most foreground sequence occurrences.

When ranked by highest to lowest average accuracy across the five comparison methods the unfiltered results of the linear programming using branch and cut algorithm’s objective functions ranked 4th for all objective functions, out of the 56 selection methods without filtering. When ranked by highest to lowest average precision across the five comparison methods the unfiltered results of the linear programming using branch and cut 90 algorithm’s objective functions ranked 7th for all objective functions, out of the 56 selection methods without filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the unfiltered results of the linear programming using branch and cut algorithm’s objective functions ranked 48th for all objective functions, out of the 56 selection methods without filtering.

When ranked by highest to lowest average accuracy across the five comparison methods the 2% greedy filter results of the linear programming using branch and cut algorithm’s objective functions ranked 39th for all objective functions, out of the 56 selection methods with 2% greedy filtering. When ranked by highest to lowest average precision across the five comparison methods the 2% greedy filter results of the linear programming using branch and cut algorithm’s objective functions ranked 36th for all objective functions, out of the 56 selection methods with 2% greedy filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the 2% greedy filter results of the linear programming using branch and cut algorithm’s objective functions ranked 2nd for all objective functions, out of the 56 selection methods with 2% greedy filtering.

When ranked by highest to lowest average accuracy across the five comparison methods the 5% greedy filter results of the linear programming using branch and cut algorithm’s objective functions ranked 35th for all objective functions, out of the 56 selection methods with 5% greedy filtering. When ranked by highest to lowest average precision across the five comparison methods the 5% greedy filter results of the linear programming using branch and cut algorithm’s objective functions ranked 24th for all objective functions, out of the 56 selection methods with 5% greedy filtering. When ranked 91 by highest to lowest average foreground sequence coverage across the five comparison methods the 5% greedy filter results of the linear programming using branch and cut algorithm’s objective functions ranked 3rd for all objective functions, out of the 56 selection methods with 5% greedy filtering.

When ranked by highest to lowest average accuracy across the five comparison methods the 10% greedy filter results of the linear programming using branch and cut algorithm’s objective functions ranked 23rd for all objective functions, out of the 56 selection methods with 10% greedy filtering. When ranked by highest to lowest average precision across the five comparison methods the 10% greedy filter results of the linear programming using branch and cut algorithm’s objective functions ranked 12th for all objective functions, out of the 56 selection methods with 10% greedy filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the 10% greedy filter results of the linear programming using branch and cut algorithm’s objective functions ranked 7th for all objective functions, out of the 56 selection methods with 10% greedy filtering.

When ranked globally, by highest to lowest average accuracy across the five comparison methods, described above, the 10% greedy filtering performed the best with a ranking of 29th for all objective functions, out of 224 methods, with an average accuracy of 0.7437. When ranked by highest to lowest average precision globally the 10% greedy filtering performed the best with an average precision of 0.4588 ranking 15th for all the objective functions out of 224 methods. 92 5.7.2 Discussion

The linear programming using branch and cut method performed the best among the weighted set cover methods without filtering, on average identifying a solution of just

15.4 motifs that covered all foreground sequences. Interestingly, across all objective functions the same set of motifs were identified, the set with the fewest motifs. This is reasonable considering all of the objective functions apply pressure to identify the smallest set of motifs. It is also important to note that in four of the experiments the optimal solution could not be identified within 24 hours, so the other set cover methods contain four additional experiments, that the branch and cut method does not. Unfortunately, the set of motifs identified by the branch and cut method did not fare well compared to the greedy algorithms when filtering was applied, like the other weighted set cover methods which attempted to identify the optimal or near optimal set of motifs. Even with the greedy methods outperforming the branch and cut method when filtering was applied, the accuracy and precision of the branch and cut method was improved when filtering was added. This is attributed to a smaller set of motifs being selected.

5.8 Linear Programming Relaxation Using Randomized Rounding Set Cover

5.8.1 Results

The linear programming relaxation using randomized rounding algorithm was used to identify sets of motifs in the ENCODE dataset using six different objective functions.

The resultant sets of motifs were filtered using a greedy algorithm to further reduce the size of the motif sets with 2%, 5% and 10% filtering parameters. Metrics were then calculated using the five comparison methods, described above, to compare the effectiveness of the 93 different objective functions and different filtering parameters at identifying the biologically significant set of motifs. Each objective function for the linear programming relaxation using randomized rounding algorithm was then ranked within each filtering set and globally, across all filters. The six different objective functions used in conjunction with the linear programming relaxation using randomized rounding algorithm attempted to: select the motifs with the lowest background sequence count, select the fewest motifs, select for the highest information content, select for the highest DME score, select the motifs with the lowest ratio of background sequence occurrences over foreground sequence occurrences and select the motifs with the most foreground sequence occurrences.

When ranked by highest to lowest average accuracy across the five comparison methods the unfiltered results of the linear programming relaxation using randomized rounding algorithm’s objective functions ranked 27th, 26th, 25th, 31st, 28th and 29th respectively, out of the 56 selection methods without filtering. When ranked by highest to lowest average precision across the five comparison methods the unfiltered results of the linear programming relaxation using randomized rounding algorithm’s objective functions ranked 35th, 34th, 37th, 39th, 36th and 38th respectively, out of the 56 selection methods without filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the unfiltered results of the linear programming relaxation using randomized rounding algorithm’s objective functions ranked 1st for all objective functions, out of the 56 selection methods without filtering.

When ranked by highest to lowest average accuracy across the five comparison methods the 2% greedy filter results of the linear programming relaxation using randomized rounding algorithm’s objective functions ranked 26th, 27th, 20th, 21st, 28th 94 and 25th respectively, out of the 56 selection methods with 2% greedy filtering. When ranked by highest to lowest average precision across the five comparison methods the 2% greedy filter results of the linear programming relaxation using randomized rounding algorithm’s objective functions ranked 31st, 30th, 26th, 28th, 32nd and 27th respectively, out of the 56 selection methods with 2% greedy filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the 2% greedy filter results of the linear programming relaxation using randomized rounding algorithm’s objective functions ranked 12th, 15th, 13th, 17th, 16th and 14th respectively, out of the 56 selection methods with 2% greedy filtering.

When ranked by highest to lowest average accuracy across the five comparison methods the 5% greedy filter results of the linear programming relaxation using randomized rounding algorithm’s objective functions ranked 52nd, 46th, 45th, 49th, 53rd and 50th respectively, out of the 56 selection methods with 5% greedy filtering. When ranked by highest to lowest average precision across the five comparison methods the 5% greedy filter results of the linear programming relaxation using randomized rounding algorithm’s objective functions ranked 41st, 44th, 42nd, 48th, 47th and 45th respectively, out of the 56 selection methods with 5% greedy filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the 5% greedy filter results of the linear programming relaxation using randomized rounding algorithm’s objective functions ranked 12th, 15th, 17th, 14th, 16th and 13th respectively, out of the 56 selection methods with 5% greedy filtering.

When ranked by highest to lowest average accuracy across the five comparison methods the 10% greedy filter results of the linear programming relaxation using 95 randomized rounding algorithm’s objective functions ranked 45th, 40th, 39th, 41st, 46th and 47th respectively, out of the 56 selection methods with 10% greedy filtering. When ranked by highest to lowest average precision across the five comparison methods the 10% greedy filter results of the linear programming relaxation using randomized rounding algorithm’s objective functions ranked 38th, 39th, 40th, 48th, 44th and 42nd respectively, out of the 56 selection methods with 10% greedy filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the 10% greedy filter results of the linear programming relaxation using randomized rounding algorithm’s objective functions ranked 27th, 29th, 24th, 22nd, 23rd and 26th respectively, out of the 56 selection methods with 10% greedy filtering.

When ranked globally, across all filter parameters, by highest to lowest average accuracy across the five comparison methods, described above, the top three objective function / filtering combinations for the linear programming relaxation using randomized rounding algorithm were: selecting for the highest information content with 10% greedy filtering, selecting for the solution with the fewest motifs with 10% greedy filtering and selecting for the highest DME score with 10% greedy filtering. They ranked 47th, 48th and

49th respectively, out of 224 methods, with average accuracies of 0.7428, 0.7427 and

0.7427. When ranked by highest to lowest average precision globally, across all filters, the top three objective function / filtering combinations were: selecting for motifs with the lowest background sequence count with 10% greedy filtering, selecting for the solution with the fewest motifs with 10% greedy filtering and selecting for the highest information content with 10% greedy filtering. They ranked 41st, 42nd and 43rd respectively, out of

224 methods, with average precisions of 0.4494, 0.4493 and 0.4490. 96 5.8.2 Discussion

The linear programming relaxation using randomized rounding method did not perform very well with or without filtering when compared to the other weighted set cover solutions. Without filtering, the randomized rounding method outperformed, in terms of accuracy and precision, the greedy methods, but performed worse than the branch and cut methods, the simulated annealing methods and the hill climbing methods. The linear programming relaxation method achieved its best accuracy using a 10% greedy filter and selecting for the highest information content, and achieved its highest precision using a

10% greedy filter and selecting for the fewest background sequences.

5.9 Top Ten Scoring Motifs

5.9.1 Results

The top ten scoring motifs algorithm was used to identify sets of motifs in the

ENCODE dataset that act as a baseline to compare the other methods to. The resultant sets of motifs were filtered using a greedy algorithm to further reduce the size of the motif sets with 2%, 5% and 10% filtering parameters. Metrics were then calculated using the five comparison methods, described above, to compare the effectiveness of the filtering parameters at identifying the biologically significant set of motifs. The top ten scoring motifs algorithm was then ranked within each filtering set and globally, across all filters.

When ranked by highest to lowest average accuracy across the five comparison methods the unfiltered results of the top ten scoring motifs algorithm placed in 2nd out of the 56 selection methods without filtering. When ranked by highest to lowest average precision across the five comparison methods the unfiltered results of the top ten scoring 97 motifs algorithm placed in 2nd out of the 56 selection methods without filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the unfiltered results of the top ten scoring motifs algorithm placed in

55th out of the 56 selection methods without filtering.

When ranked by highest to lowest average accuracy across the five comparison methods the 2% greedy filter results of the top ten scoring motifs algorithm placed in 2nd out of the 56 selection methods with 2% greedy filtering. When ranked by highest to lowest average precision across the five comparison methods the 2% greedy filter results of the top ten scoring motifs algorithm placed in 2nd out of the 56 selection methods with 2% greedy filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the 2% greedy filter results of the top ten scoring motifs algorithm placed in 55th out of the 56 selection methods with 2% greedy filtering.

When ranked by highest to lowest average accuracy across the five comparison methods the 5% greedy filter results of the top ten scoring motifs algorithm placed in 2nd out of the 56 selection methods with 5% greedy filtering. When ranked by highest to lowest average precision across the five comparison methods the 5% greedy filter results of the top ten scoring motifs algorithm placed in 2nd out of the 56 selection methods with 5% greedy filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the 5% greedy filter results of the top ten scoring motifs algorithm placed in 55th out of the 56 selection methods with 5% greedy filtering.

When ranked by highest to lowest average accuracy across the five comparison methods the 10% greedy filter results of the top ten scoring motifs algorithm placed in 2nd out of the 56 selection methods with 10% greedy filtering. When ranked by highest to 98 lowest average precision across the five comparison methods the 10% greedy filter results of the top ten scoring motifs algorithm placed in 5th out of the 56 selection methods with

10% greedy filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the 10% greedy filter results of the top ten scoring motifs algorithm placed in 53rd out of the 56 selection methods with 10% greedy filtering.

When ranked globally by highest to lowest average accuracy across the five comparison methods, described above, the top ten scoring motifs algorithm ranked 8th,

4th, 6th and 7th out of 224 methods, for unfiltered, 2% filtering, 5% filtering and 10% filtering with average accuracies of 0.7467, 0.7504, 0.7492 and 0.7480. When ranked globally by highest to lowest average precision across the five comparison methods, described above, the top ten scoring motifs algorithm ranked 106th, 52nd, 14th and 6th out of 224 methods, for unfiltered, 2% filtering, 5% filtering and 10% with average precisions of 0.4220, 0.4473, 0.4592 and 0.4668.

5.9.2 Discussion

The top ten scoring motifs method performed very well, and second best behind selecting the top ten highest sequence coverage motifs. The method ranked second in accuracy with and without filtering. For the majority of the filters this method also performs the second best in precision and the best in specificity. Overall this method performs very well at identifying biologically significant motifs, and outperforms all the weighted set cover methods. This result is somewhat expected as this method is one of the most commonly chosen methods for selecting motifs after the motif discovery phase. It is 99 important to note that using a greedy filter did improve the results compared to just selecting the top 10 scoring motifs.

5.10 Top Ten Sequence Coverage Motifs

5.10.1 Results

The top ten sequence coverage motifs algorithm was used to identify sets of motifs in the ENCODE dataset that act as a baseline to compare the other methods to. The resultant sets of motifs were filtered using a greedy algorithm to further reduce the size of the motif sets with 2%, 5% and 10% filtering parameters. Metrics were then calculated using the five comparison methods, described above, to compare the effectiveness of the filtering parameters at identifying the biologically significant set of motifs. The top ten sequence coverage motifs algorithm was then ranked within each filtering set and globally, across all filters.

When ranked by highest to lowest average accuracy across the five comparison methods the unfiltered results of the top ten sequence coverage motifs algorithm placed in

1st out of the 56 selection methods without filtering. When ranked by highest to lowest average precision across the five comparison methods the unfiltered results of the top ten sequence coverage motifs algorithm placed in 1st out of the 56 selection methods without filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the unfiltered results of the top ten sequence coverage motifs algorithm placed in 54th out of the 56 selection methods without filtering.

When ranked by highest to lowest average accuracy across the five comparison methods the 2% greedy filter results of the top ten sequence coverage motifs algorithm 100 placed in 1st out of the 56 selection methods with 2% greedy filtering. When ranked by highest to lowest average precision across the five comparison methods the 2% greedy filter results of the top ten sequence coverage motifs algorithm placed in 1st out of the 56 selection methods with 2% greedy filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the 2% greedy filter results of the top ten sequence coverage motifs algorithm placed in 54th out of the 56 selection methods with 2% greedy filtering.

When ranked by highest to lowest average accuracy across the five comparison methods the 5% greedy filter results of the top ten sequence coverage motifs algorithm placed in 1st out of the 56 selection methods with 5% greedy filtering. When ranked by highest to lowest average precision across the five comparison methods the 5% greedy filter results of the top ten sequence coverage motifs algorithm placed in 1st out of the 56 selection methods with 5% greedy filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the 5% greedy filter results of the top ten sequence coverage motifs algorithm placed in 53rd out of the 56 selection methods with 5% greedy filtering.

When ranked by highest to lowest average accuracy across the five comparison methods the 10% greedy filter results of the top ten sequence coverage motifs algorithm placed in 1st out of the 56 selection methods with 10% greedy filtering. When ranked by highest to lowest average precision across the five comparison methods the 10% greedy filter results of the top ten sequence coverage motifs algorithm placed in 1st out of the 56 selection methods with 10% greedy filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the 10% greedy filter 101 results of the top ten sequence coverage motifs algorithm placed in 17th out of the 56 selection methods with 10% greedy filtering.

When ranked globally by highest to lowest average accuracy across the five comparison methods, described above, the top ten sequence coverage motifs algorithm ranked 5th, 1st, 2nd and 3rd out of 224 methods, for unfiltered, 2% filtering, 5% filtering and 10% filtering with average accuracies of 0.7492, 0.7538, 0.7520 and 0.7505. When ranked globally by highest to lowest average precision across the five comparison methods, described above, the top ten sequence coverage motifs algorithm ranked 68th, 12th, 4th and 1st out of 224 methods, for unfiltered, 2% filtering, 5% filtering and 10% with average precisions of 0.4360, 0.4596, 0.4685 and 0.4748.

5.10.2 Discussion

Selecting the top 10 motifs with the highest foreground sequence coverage is the best method for identifying biologically significant motifs with and without filtering. In terms of accuracy and precision this method performs the best every time, even better than using the scores output by the motif discovery application. This method was improved upon by using a greedy filter to reduce the number of motifs that were selected, achieving a maximum average accuracy of 75.376% using a 2% greedy filter threshold. This method is also extremely easy and fast to compute, and based on these results is a good alternative to using the scores generated by motif discovery applications. 102 5.11 Top Ten Information Content Motifs

5.11.1 Results

The top ten information content motifs algorithm was used to identify sets of motifs in the ENCODE dataset that act as a baseline to compare the other methods to. The resultant sets of motifs were filtered using a greedy algorithm to further reduce the size of the motif sets with 2%, 5% and 10% filtering parameters. Metrics were then calculated using the five comparison methods, described above, to compare the effectiveness of the filtering parameters at identifying the biologically significant set of motifs. The top ten information content motifs algorithm was then ranked within each filtering set and globally, across all filters.

When ranked by highest to lowest average accuracy across the five comparison methods the unfiltered results of the top ten information content motifs algorithm placed in 3rd out of the 56 selection methods without filtering. When ranked by highest to lowest average precision across the five comparison methods the unfiltered results of the top ten information content motifs algorithm placed in 47th out of the 56 selection methods without filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the unfiltered results of the top ten information content motifs algorithm placed in 56th out of the 56 selection methods without filtering.

When ranked by highest to lowest average accuracy across the five comparison methods the 2% greedy filter results of the top ten information content motifs algorithm placed in 54th out of the 56 selection methods with 2% greedy filtering. When ranked by highest to lowest average precision across the five comparison methods the 2% greedy filter results of the top ten information content motifs algorithm placed in 56th out of the 103 56 selection methods with 2% greedy filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the 2% greedy filter results of the top ten information content motifs algorithm placed in 56th out of the 56 selection methods with 2% greedy filtering.

When ranked by highest to lowest average accuracy across the five comparison methods the 5% greedy filter results of the top ten information content motifs algorithm placed in 56th out of the 56 selection methods with 5% greedy filtering. When ranked by highest to lowest average precision across the five comparison methods the 5% greedy filter results of the top ten information content motifs algorithm placed in 56th out of the

56 selection methods with 5% greedy filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the 5% greedy filter results of the top ten information content motifs algorithm placed in 56th out of the 56 selection methods with 5% greedy filtering.

When ranked by highest to lowest average accuracy across the five comparison methods the 10% greedy filter results of the top ten information content motifs algorithm placed in 56th out of the 56 selection methods with 10% greedy filtering. When ranked by highest to lowest average precision across the five comparison methods the 10% greedy filter results of the top ten information content motifs algorithm placed in 56th out of the

56 selection methods with 10% greedy filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the 10% greedy filter results of the top ten information content motifs algorithm placed in 56th out of the 56 selection methods with 10% greedy filtering. 104 When ranked globally by highest to lowest average accuracy across the five comparison methods, described above, the top ten information content motifs algorithm ranked 171st, 168th, 146th and 132nd out of 224 methods, for unfiltered, 2% filtering, 5% filtering and 10% filtering with average accuracies of 0.7217, 0.7294, 0.7320 and 0.7331.

When ranked globally by highest to lowest average precision across the five comparison methods, described above, the top ten information content motifs algorithm ranked 215th,

173rd, 169th and 168th out of 224 methods, for unfiltered, 2% filtering, 5% filtering and

10% with average precisions of 0.2903, 0.3275, 0.3501 and 0.3645.

5.11.2 Discussion

Selecting the top 10 motifs with the highest information content method performed well, in terms of accuracy and precision, compared to the other methods without filtering, but performed very poorly compared to the other methods when filtering was applied. In terms of accuracy the algorithm ranked at its highest, with 10% filtering, at position 132 out of 224 methods. In terms of precision the algorithm ranked at its highest, with 10% filtering, at position 168 out of 224 methods. Overall, selecting the top 10 information content motifs is not a good solution to identify biologically significant motifs.

5.12 Using All Motifs

5.12.1 Results

The using all motifs algorithm was used to identify sets of motifs in the ENCODE dataset that act as a baseline to compare the other methods to. The resultant sets of motifs were filtered using a greedy algorithm to further reduce the size of the motif sets with 2%, 105 5% and 10% filtering parameters. Metrics were then calculated using the five comparison methods, described above, to compare the effectiveness of the filtering parameters at identifying the biologically significant set of motifs. The using all motifs algorithm was then ranked within each filtering set and globally, across all filters.

When ranked by highest to lowest average accuracy across the five comparison methods the unfiltered results of the using all motifs algorithm placed in 56th out of the 56 selection methods without filtering. When ranked by highest to lowest average precision across the five comparison methods the unfiltered results of the using all motifs algorithm placed in 56th out of the 56 selection methods without filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the unfiltered results of the using all motifs algorithm placed in 1st out of the 56 selection methods without filtering.

When ranked by highest to lowest average accuracy across the five comparison methods the 2% greedy filter results of the using all motifs algorithm placed in 11th out of the 56 selection methods with 2% greedy filtering. When ranked by highest to lowest average precision across the five comparison methods the 2% greedy filter results of the using all motifs algorithm placed in 55th out of the 56 selection methods with 2% greedy filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the 2% greedy filter results of the using all motifs algorithm placed in 52nd out of the 56 selection methods with 2% greedy filtering.

When ranked by highest to lowest average accuracy across the five comparison methods the 5% greedy filter results of the using all motifs algorithm placed in 11th out of the 56 selection methods with 5% greedy filtering. When ranked by highest to lowest 106 average precision across the five comparison methods the 5% greedy filter results of the using all motifs algorithm placed in 55th out of the 56 selection methods with 5% greedy filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the 5% greedy filter results of the using all motifs algorithm placed in 52nd out of the 56 selection methods with 5% greedy filtering.

When ranked by highest to lowest average accuracy across the five comparison methods the 10% greedy filter results of the using all motifs algorithm placed in 18th out of the 56 selection methods with 10% greedy filtering. When ranked by highest to lowest average precision across the five comparison methods the 10% greedy filter results of the using all motifs algorithm placed in 55th out of the 56 selection methods with 10% greedy filtering. When ranked by highest to lowest average foreground sequence coverage across the five comparison methods the 10% greedy filter results of the using all motifs algorithm placed in 54th out of the 56 selection methods with 10% greedy filtering.

When ranked globally by highest to lowest average accuracy across the five comparison methods, described above, the using all motifs algorithm ranked 224th, 123rd,

68th and 24th out of 224 methods, for unfiltered, 2% filtering, 5% filtering and 10% filtering with average accuracies of 0.4060, 0.7343, 0.7413 and 0.7444. When ranked globally by highest to lowest average precision across the five comparison methods, described above, the using all motifs algorithm ranked 224th, 167th, 123rd and 69th out of

224 methods, for unfiltered, 2% filtering, 5% filtering and 10% with average precisions of

0.1346, 0.3715, 0.4040 and 0.4351. 107 5.12.2 Discussion

The select all motifs method was used as a baseline method to compare against the weighted set cover methods. As expected this method performed the worst in terms of accuracy and precision without any filters. When filters were applied to selecting all motifs, the solutions became competitive to the weighted set cover methods in terms of accuracy, but not precision. This is expected since the greedy delete filter would select for the motifs with the most foreground sequences and the fewest background sequences. This result also validates the effectiveness of the greedy delete filter at reducing the number of motifs. The top accuracy obtained for this method was achieve using a greedy 10% filter with an accuracy of 74.44%, by comparison the unfiltered top 10 coverage motifs method achieved an average accuracy of 74.92%, only marginally better. 108 CHAPTER 6: CASE STUDY RESULTS

6.1 Rat Estrous Cycle Results

6.1.1 Differentially Expressed Genes

Using ANOVA to compare the sampled time points E6, E10 and D10 across all brain regions revealed 198 probes and 106 genes to be differentially expressed with a nominal p-value under the 0.05 threshold. Each brain region was also considered separately with 58 probes and 31 genes differentially expressed in the basal forebrain, 177 probes and

110 genes differential expressed in the frontal cortex and 39 probes and 21 genes differentially expressed in the hippocampus. A complete listing of differentially expressed genes for each brain region can be found in supplemental data files 1-4.

The BETR algorithm, for time-course microarray data, identified 465 probes and

273 genes differentially expressed, with probability greater than 0.99, across the three time points E6, E10 and D10 in all brain regions. Each of the three brain regions individually exhibited a larger number of differentially expressed probes and genes compared to the probes and genes expressed across all brain regions. 693 probes and 422 genes were differentially expressed in the frontal cortex. 798 probes and 529 genes were differentially expressed in the basal forebrain. 1642 probes and 1054 genes were differentially expressed in the Hippocampus. The hippocampus contained the most differentially expressed probes and genes when using a BETR probability threshold of 0.99. The top genes with BETR probabilities of 1.00 are shown in tables 5-8 for each of the corresponding brain regions described. 109 6.1.2 Fuzzy Clustering

Fuzzy clustering was used to identify biologically significant gene expression patterns, in the differentially expressed genes, across the three time points sampled. Nine expression patterns were identified for each of the specific areas of the brain and across the brain as a whole. Across the brain as a whole, clusters of size, 14, 88, 4, 30, 14, 5, 19, 50 and 49 genes were identified corresponding to clusters 1-9, Figure 9. In the basal forebrain clusters of size, 5, 49, 40, 37, 85, 97, 65, 38 and 113 genes were identified corresponding to clusters 1-9, Figure 10. In the frontal cortex clusters of size 16, 41, 29, 96, 53, 24, 110,

4 and 50 genes were identified corresponding to clusters 1-9, Figure 11. In the hippocampus clusters of size 113, 72, 209, 119, 26, 31, 115, 252 and 116 genes were identified corresponding to clusters 1-9, Figure 12.

6.1.3 Gene Ontology Analysis

Gene ontology terms were identified for each differentially expressed set of genes identified by the BETR algorithm. The analysis across all brain regions identified 192 GO terms to be statistically significant with a p-value less than 0.05. The analyses for each of the three specific brain regions identified 242 significant GO terms in the frontal cortex,

180 significant GO terms in the basal forebrain, and 276 significant GO terms in the hippocampus. The top GO terms across the entire brain included: RNA splicing, organelle organization, cytoskeleton organization and cell-substrate adhesion. The top GO terms for the frontal cortex included terms neuron morphogenesis, skeletal muscle organ development, and tube morphogenesis. The top GO terms for the basal forebrain included: regulation of peptidyl-threonin phosphorylation, regulation of Notch signaling pathway, 110 perception of sound and blood circulation. The top GO terms in the hippocampus included signaling, cellular homeostasis and nervous system development. KEGG pathways identified by GAGE for each of the nine comparisons, three per brain region, were also identified but are not presented here.

6.1.4 Transcription Factor Analysis

For each brain region, known transcription factor binding sites (TFBSs) were compared to putative TFBSs identified in the promoter regions of differentially expressed clustered genes. Five of the nine clusters identified across all brain regions matched any known transcription factors, with the matching transcription factors listed in table 1. In the basal forebrain three of the clusters mapped to known transcription factors, table 2. In the frontal cortex, 5 clusters mapped to known transcription factors table 3, and in the hippocampus, 7 of the clusters mapped to known transcription factors, table 4. In every brain region, proteins from the fox family were present in at least one of the clusters identified. The top two gene ontology terms for each cluster can also be found in each of the tables 1 - 4.

111

Figure 9. The nine clusters identified by fuzzy clustering of the differentially expressed probes, with a probability of larger than 0.99, identified by BETR across all brain regions.

Each cluster has the gene with the highest membership value plotted beside the corresponding cluster to illustrate the pattern expressed by the cluster. 112

Figure 10. The nine clusters identified by fuzzy clustering of the differentially expressed probes, with a probability of larger than 0.99, identified by BETR in the basal forebrain.

Each cluster has the gene with the highest membership value plotted beside the corresponding cluster to illustrate the pattern expressed by the cluster.

113

Figure 11. The nine clusters identified by fuzzy clustering of the differentially expressed probes, with a probability of larger than 0.99, identified by BETR in the frontal cortex.

Each cluster has the gene with the highest membership value plotted beside the corresponding cluster to illustrate the pattern expressed by the cluster.

114

Figure 12. The nine clusters identified by fuzzy clustering of the differentially expressed probes, with a probability of larger than 0.99, in the hippocampus. Each cluster has the gene with the highest membership value plotted beside the corresponding cluster to illustrate the pattern expressed by the cluster.

115 Table 1

The known TFBS matches for each cluster identified across all brain regions

Probe Cluster Known TFBS Match E Value Go Term #1 GO Term #2 ALL Cluster 1 tRNA 5'-leader removal negative regulation of protein glycosylation in Golgi ALL Cluster 2 MA0480.1 FOXO1 1.60E-02 GTP catabolic process guanosine-containing compound metabolic process ALL Cluster 3 nuclear speck organization positive regulation of reciprocal meiotic recombination ALL Cluster 4 positive regulation of axon extension mRNA localization resulting in posttranscriptional regulation of gene expression ALL Cluster 5 UP00021_1 ZFP281_PRIMARY 3.66E-02 regulation of epithelial cell proliferation homophilic cell adhesion ALL Cluster 6 inactivation of MAPKK activity protein O-linked mannosylation ALL Cluster 7 UP00043_2 BCL6B_SECONDARY 1.14E-03 regulation of cell-substrate junction assembly cellular senescence UP00033_2 ZFP410_SECONDARY 1.61E-03 MA0516.1 SP2 1.16E-02 UP00093_1 KLF7_PRIMARY 1.35E-02 ZNF740_FULL 1.86E-02 MA0079.3 SP1 3.32E-02 ZNF740_DBD 3.88E-02 MA0599.1 KLF5 4.79E-02 ALL Cluster 8 UP00134_1 HOXB13_3479.1 5.62E-03 cellular response to peptide hormone stimulus response to insulin stimulus HOXC10_DBD_2 1.99E-02 HOXC13_DBD 2.36E-02 HOXA13_FULL 4.80E-02 HOXD13_DBD 4.80E-02 ALL Cluster 9 FOXC2_DBD_2 3.41E-02 UDP-D-xylose biosynthetic process negative regulation of collagen binding

Table 2

The known TFBS matches for each cluster identified in the basal forebrain

Probe Cluster Known TFBS Match E Value Go Term #1 GO Term #2 Basal Cluster 1 NADH oxidation glycerol-3-phosphate catabolic process Basal Cluster 2 prolyl-tRNA aminoacylation positive regulation of peptidyl-threonine phosphorylation Basal Cluster 3 MA0599.1 KLF5 1.43E-03 morphogenesis of a branching structure morphogenesis of a branching structure MA0079.3 SP1 2.18E-03 UP00021_1 ZFP281_PRIMARY 2.36E-02 MA0516.1 SP2 3.25E-02 MA0073.1 RREB1 3.42E-02 MA0039.2 KLF4 4.01E-02 Basal Cluster 4 nuclear matrix anchoring at nuclear membrane cytoskeletal anchoring at nuclear membrane Basal Cluster 5 FOXC1_DBD 6.33E-03 phospholipid dephosphorylation prostaglandin production involved in inflammatory response FOXC2_DBD_2 8.32E-03 FOXC1_DBD 9.50E-03 FOXL1_FULL_2 1.57E-02 Basal Cluster 6 multicellular organismal process cellular response to retinoic acid Basal Cluster 7 membrane organization positive regulation of GTPase activity Basal Cluster 8 MA0598.1 EHF 2.32E-02 negative regulation of neuron migration cell motility involved in cerebral cortex radial glia guided migration Basal Cluster 9 protein-chromophore linkage protein N-linked glycosylation via asparagine

116 Table 3

The known TFBS matches for each cluster identified in the frontal cortex

Probe Cluster Known TFBS Match E Value Go Term #1 GO Term #2 Frontal Cluster 1 MA0469.1 E2F3 6.59E-03 mechanoreceptor differentiation negative regulation of nitric oxide mediated signal transduction MA0024.2 E2F1 2.18E-02 Frontal Cluster 2 positive regulation of myotube differentiation regulation of skeletal muscle cell differentiation Frontal Cluster 3 morphogenesis of a branching structure regulation of cell-substrate junction assembly Frontal Cluster 4 FOXC2_DBD_2 8.00E-04 regulation of ventricular cardiac muscle cell membrane repolarization membrane repolarization involved in regulation of action potential FOXC1_DBD 8.30E-04 FOXC1_DBD 1.18E-03 FOXC1_DBD 2.55E-03 FOXC2_DBD_2 2.58E-03 FOXL1_FULL_2 6.31E-03 FOXC1_DBD 8.02E-03 FOXL1_FULL_2 1.13E-02 MA0041.1 FOXD3 2.96E-02 MA0042.1 FOXI1 3.76E-02 Frontal Cluster 5 MA0079.3 SP1 5.22E-04 biological adhesion positive regulation of apoptotic process MA0162.2 EGR1 2.61E-03 UP00021_1 ZFP281_PRIMARY 2.84E-03 SP1_DBD 5.64E-03 MA0599.1 KLF5 7.36E-03 MA0516.1 SP2 2.60E-02 Frontal Cluster 6 amylopectin biosynthetic process hemolysis by symbiont of host erythrocytes Frontal Cluster 7 lung-associated mesenchyme development coronal suture morphogenesis Frontal Cluster 8 histamine uptake blood vessel maturation Frontal Cluster 9 MA0079.3 SP1 9.61E-06 skeletal muscle organ development skeletal muscle cell differentiation MA0162.2 EGR1 7.89E-05 SP1_DBD 2.98E-04 MA0516.1 SP2 4.09E-04 KLF16_DBD 7.78E-04 MA0599.1 KLF5 2.40E-03 MA0528.1 ZNF263 3.10E-03 SP3_DBD 3.21E-03 UP00021_1 ZFP281_PRIMARY 5.19E-03 UP00093_1 KLF7_PRIMARY 6.05E-03 MA0057.1 MZF1_5-13 1.00E-02 UP00002_1 SP4_PRIMARY 1.01E-02 SP4_FULL 1.09E-02 MA0528.1 ZNF263 1.23E-02 SP8_DBD 1.34E-02 MA0528.1 ZNF263 1.67E-02 MA0079.3 SP1 2.38E-02 MA0599.1 KLF5 2.54E-02 MA0039.2 KLF4 3.12E-02 KLF14_DBD 4.06E-02

117 Table 4

The known TFBS matches for each cluster identified in the hippocampus

Probe Cluster Known TFBS Match E Value Go Term #1 GO Term #2 Hippo Cluster 1 MA0528.1 ZNF263 1.24E-03 signaling locomotory behavior MA0079.3 SP1 4.81E-02 UP00021_1 ZFP281_PRIMARY 4.85E-02 Hippo Cluster 2 MA0599.1 KLF5 8.03E-03 presynaptic membrane assembly regulation of epidermal growth factor receptor signaling pathway MA0039.2 KLF4 3.03E-02 Hippo Cluster 3 MA0079.3 SP1 3.84E-05 brain development cell communication MA0162.2 EGR1 1.61E-04 SP1_DBD 1.78E-04 MA0516.1 SP2 5.91E-04 KLF16_DBD 7.72E-04 MA0599.1 KLF5 9.96E-04 SP3_DBD 1.70E-03 UP00093_1 KLF7_PRIMARY 3.71E-03 MA0039.2 KLF4 6.45E-03 MA0599.1 KLF5 8.46E-03 SP4_FULL 9.66E-03 UP00002_1 SP4_PRIMARY 1.05E-02 UP00021_1 ZFP281_PRIMARY 1.27E-02 MA0039.2 KLF4 1.39E-02 SP8_DBD 1.41E-02 MA0079.3 SP1 1.44E-02 MA0599.1 KLF5 1.67E-02 UP00099_2 ASCL2_SECONDARY 3.89E-02 KLF14_DBD 4.03E-02 MA0042.1 FOXI1 4.28E-02 UP00033_2 ZFP410_SECONDARY 4.75E-02 UP00043_2 BCL6B_SECONDARY 4.81E-02 Hippo Cluster 4 MA0528.1 ZNF263 2.63E-03 cellular sodium ion homeostasis negative regulation of myelination MA0522.1 TCF3 2.06E-02 MA0079.3 SP1 2.19E-02 Hippo Cluster 5 biological adhesion positive regulation of reciprocal meiotic recombination Hippo Cluster 6 MA0599.1 KLF5 7.55E-05 collagen fibril organization adenylate cyclase-activating serotonin receptor signaling pathway MA0079.3 SP1 3.84E-04 MA0039.2 KLF4 4.87E-04 MA0516.1 SP2 3.91E-03 SP1_DBD 5.91E-03 UP00093_1 KLF7_PRIMARY 1.14E-02 KLF16_DBD 1.90E-02 MA0162.2 EGR1 2.35E-02 MA0493.1 KLF1 3.83E-02 SP3_DBD 4.80E-02 Hippo Cluster 7 MA0079.3 SP1 1.99E-06 positive regulation of cell-cell adhesion phospholipase C-activating G-protein coupled receptor signaling pathway MA0162.2 EGR1 6.88E-05 MA0516.1 SP2 1.07E-04 SP1_DBD 4.78E-04 MA0599.1 KLF5 1.02E-03 KLF16_DBD 1.92E-03 UP00021_1 ZFP281_PRIMARY 5.58E-03 UP00002_1 SP4_PRIMARY 5.79E-03 SP3_DBD 6.11E-03 UP00093_1 KLF7_PRIMARY 1.02E-02 UP00033_2 ZFP410_SECONDARY 1.16E-02 SP4_FULL 1.93E-02 SP8_DBD 2.15E-02 MA0039.2 KLF4 3.00E-02 MA0469.1 E2F3 3.21E-02 UP00043_2 BCL6B_SECONDARY 3.70E-02 UP00007_1 EGR1_PRIMARY 4.08E-02 Hippo Cluster 8 retrograde transport endosome to Golgi central nervous system myelin maintenance Hippo Cluster 9 UP00061_2 FOXL1_SECONDARY 4.37E-02 negative regulation of lymphocyte apoptotic process myelination UP00037_1 ZFP105_PRIMARY 4.70E-02

118 Table 5

The top differentially expressed probes across all brain regions

Gene Symbol BETR Probability Probe Id Id Gene Name Cnbd2 1.00 1382561_at 296311 cyclic nucleotide binding domain containing 2 Parvb 1.00 1388891_at 362973 parvin, beta Kcnip2 1.00 1370773_a_at 56817 Kv channel-interacting protein 2 Pfn2 1.00 1367970_at 81531 profilin 2 Lrp11 1.00 1375193_at 292462 low density lipoprotein receptor-related protein 11 Mmp14 1.00 1367860_a_at 81707 matrix metallopeptidase 14 (membrane-inserted) Kdr 1.00 1367948_a_at 25589 kinase insert domain receptor Scn2b 1.00 1369311_at 25349 sodium channel, voltage-gated, type II, beta Tm4sf1 1.00 1373847_at 295061 transmembrane 4 L six family member 1 Pcdh19 1.00 1382130_at 317183 protocadherin 19 Dbp 1.00 1387874_at 24309 D site of albumin promoter (albumin D-box) binding protein Armcx3 1.00 1397960_at 367902 armadillo repeat containing, X-linked 3

119 Table 6

The top differentially expressed probes in the basal forebrain

Gene Symbol BETR Probability Probe Id Entrez Id Gene Name Fam133b 1.00 1395620_at 362320 family with sequence similarity 133, member B Ccsap 1.00 1380816_at 307926 centriole, cilia and spindle-associated protein Ppp3r1 1.00 1369152_at 29748 protein phosphatase 3, regulatory subunit B, alpha Leprel2 1.00 1373750_at 297595 leprecan-like 2 Ankrd34b 1.00 1393283_at 499506 ankyrin repeat domain 34B Kif5a 1.00 1375875_at 314906 kinesin family member 5A Wfs1 1.00 1387356_at 83725 Wolfram syndrome 1 (wolframin) Actn1 1.00 1389189_at 81634 actinin, alpha 1 Sdf2l1 1.00 1373043_at 680945 stromal cell-derived factor 2-like 1 Mapk12 1.00 1398297_at 60352 mitogen-activated protein kinase 12 Smoc1 1.00 1388545_at 314280 SPARC related modular calcium binding 1 Tbx18 1.00 1391630_at 315870 T-box18 Slc6a20 1.00 1369705_at 113918 solute carrier family 6 (proline IMINO transporter), member 20 Tef 1.00 1369919_at 29362 thyrotrophic embryonic factor Gadd45g 1.00 1388792_at 291005 growth arrest and DNA-damage-inducible, gamma Aurkc 1.00 1375008_at 292554 aurora kinase C Rif1 1.00 1395264_at 295602 RAP1 interacting factor homolog (yeast) Rgs2 1.00 1387074_at 84583 regulator of G-protein signaling 2 Ogn 1.00 1385248_a_at 291015 osteoglycin Gnrh1 1.00 1387535_at 25194 gonadotropin-releasing hormone 1 (luteinizing-releasing hormone) Tmem255a 1.00 1394295_at 313453 transmembrane protein 255A Cdh1 1.00 1386947_at 83502 cadherin 1 Col1a2 1.00 1387854_at 84352 collagen, type I, alpha 2 Rprm 1.00 1390672_at 680110 reprimo, TP53 dependent G2 arrest mediator candidate Slc6a20 1.00 1369704_at 113918 solute carrier family 6 (proline IMINO transporter), member 20 Pank1 1.00 1382924_at 294088 pantothenate kinase 1 Igf2 1.00 1367571_a_at 24483 insulin-like growth factor 2 Fmod 1.00 1367700_at 64507 fibromodulin Sst 1.00 1367762_at 24797 somatostatin Ptgds 1.00 1367851_at 25526 prostaglandin D2 synthase (brain) Pth2r 1.00 1369461_at 81753 parathyroid hormone 2 receptor Ctsk 1.00 1369947_at 29175 cathepsin K Tpm1 1.00 1370288_a_at 24851 tropomyosin 1, alpha Colec12 1.00 1372818_at 361289 collectin sub-family member 12 Cpxm2 1.00 1373148_at 293566 carboxypeptidase X (M14 family), member 2 Celf5 1.00 1373945_at 314647 CUGBP, Elav-like family member 5 RGD1566401 1.00 1377008_at 500717 similar to GTL2, imprinted maternally expressed untranslated Fam19a4 1.00 1378557_at 689043 family with sequence similarity 19 (chemokine (C-C motif)-like), member A4 Ccdc109b 1.00 1385426_at 295462 coiled-coil domain containing 109B Grifin 1.00 1386936_at 117130 galectin-related inter-fiber protein Bmp6 1.00 1388201_at 25644 bone morphogenetic protein 6 Ogn 1.00 1390450_a_at 291015 osteoglycin Slc13a4 1.00 1390532_at 503568 solute carrier family 13 (sodium/sulfate symporters), member 4 Zbtb41 1.00 1398217_at 289052 zinc finger and BTB domain containing 41

120 Table 7

The top differentially expressed probes in the frontal cortex

Gene Symbol BETR Probability Probe Id Entrez Id Gene Name Rreb1 1.00 1388710_at 306873 ras responsive element binding protein 1 RT1-Da 1.00 1370883_at 294269 RT1 class II, locus Da Ryk 1.00 1371101_at 140585 receptor-like tyrosine kinase Ankrd37 1.00 1373685_at 361149 ankyrin repeat domain 37 Nr4a1 1.00 1386935_at 79240 nuclear receptor subfamily 4, group A, member 1 Nefh 1.00 1370815_at 24587 neurofilament, heavy polypeptide Rcan2 1.00 1374235_at 140666 regulator of calcineurin 2 Cabp1 1.00 1369886_a_at 171051 calcium binding protein 1 Aldh1a1 1.00 1387022_at 24188 aldehyde dehydrogenase 1 family, member A1 Zfp68 1.00 1395553_at 304337 zinc finger protein 68 Mtus1 1.00 1380321_at 306487 microtubule associated tumor suppressor 1 Nr4a2 1.00 1369007_at 54278 nuclear receptor subfamily 4, group A, member 2 Egr2 1.00 1387306_a_at 114090 early growth response 2 Fam19a2 1.00 1379872_at 680647 family with sequence similarity 19 (chemokine (C-C motif)-like), member A2 Large 1.00 1379722_at 361368 like-glycosyltransferase Dbp 1.00 1387874_at 24309 D site of albumin promoter (albumin D-box) binding protein Ermn 1.00 1395447_at 295619 ermin, ERM-like protein Cpne9 1.00 1377168_at 297516 copine family member IX Scn1b 1.00 1367959_a_at 29686 sodium channel, voltage-gated, type I, beta Cd93 1.00 1368393_at 84398 CD93 molecule Dcn 1.00 1370956_at 29139 decorin Nr4a3 1.00 1369067_at 58853 nuclear receptor subfamily 4, group A, member 3 Homer1 1.00 1370454_at 29546 homer homolog 1 (Drosophila) Scn4b 1.00 1373188_at 315611 sodium channel, voltage-gated, type IV, beta Hhatl 1.00 1383413_at 301073 hedgehog acyltransferase-like Pcdh17 1.00 1384509_s_at 306055 protocadherin 17

121 Table 8

The top differentially expressed probes in the hippocampus

122 Table 8: continued

123 Table 8: continued

6.2 Brugia Malayi Results

The purpose of the Brugia Malayi case study is to identify a set of motifs that are differentially expressed in the L3 stage of the Brugia Malayi compared to the other stages, specifically against the AF and L4 stages. To identify discriminative motifs, DME was run three times with input parameters to output one hundred motifs each run for motif lengths of 8, 10 and 12. The resulting sets of motifs were combined into a single set and every set coverage method was run on this combined set of motifs. These results were then filtered with 2% filtering to further reduce the size of the sets of motifs. Motif logos were generated for the resulting sets of motifs and foreground and background sequence coverage percentages were calculated for each motif.

In total only 324 out of 334 foreground promoter sequences were covered by any motif. This is primarily due to the very short length of some of the promoter sequences, with some sequences as short as 14 nucleotides. The majority of motifs discovered 124 appeared in both foreground and background sequences, but there were some motifs that appeared exclusively in the foreground data set. The top four motifs with the highest count in the foreground with zero background occurrences are shown in Figure 13. It is apparent that these top four motifs are very AT rich, with few GCs. These motifs were also searched via TOMTOM across the JASPAR Core 2014 vertebrate’s database, the UniPROBE mouse database and the Jolma 2013 database, but only the fourth motif, ATTGWTAAWTMA, had a statistically significantly match. The ATTGWTAAWTMA motif matched the motif associated with the Hoxc4 gene with an E-value of 0.024. The top motifs identified using branch and cut with a 2% greedy filter are also shown in Figure 14. These motifs represent putative TFBSs that may not be exclusive to the L3 stage of the Brugia Malayi, but are likely to be biologically significant. All the motifs identified for each selection method are available at http://motifpipeline.com.

Figure 13. The motifs identified in the L3 life cycle stage with the most occurrences in the foreground sequences and no occurrences in the background sequences.

125

Figure 14. The motifs identified in the L3 life cycle stage using weighted branch and cut set cover with a lowest foreground to background ratio objective function with 2% greedy filtering. 126 CHAPTER 7: CONCLUSIONS

7.1 Summary of Work

In this body of work, weighted set cover was analyzed as a possible solution to identify a small subset of biologically significant motifs as a post processing step to motif discovery. A fairly comprehensive set of algorithms was analyzed using data from the

ENCODE project to assess the viability of each algorithm at identifying the biologically significant motifs. In total over 50 algorithm / objective function combinations were tested, revealing that some sequence coverage based approaches can identify biologically significant motifs more accurately than the traditional method of selecting the top scoring motifs. This result is shown by the top performing algorithm, in which motifs with the highest foreground sequence coverage were selected. This method was further improved via a greedy delete algorithm which remove motifs that cover few foreground sequences.

Interestingly, using weighted set cover methods did not significantly aid in identifying biologically significant motifs. This result is attributed to the fact that the motif discovery application used in this study, DME, already discriminates between foreground and background sequences during the motif discovery phase.

Two case studies, one focusing on the expressed genes in the rat estrous cycle, and one on the expressed genes in the Brugia Malayi life cycle, were also examined using the weighted set cover methods presented in this thesis. In the rat estrous cycle study, distinct clusters of genes and their putative motifs were identified across different brain regions throughout the estrous cycle. In the Brugia Malayi case study putative motifs were identified that strongly discriminate between the L3 life cycle stage and the L4 & adult life cycle stages. 127

7.2 Future Work

In future work it would be interesting to explore how well weighted set cover methods perform in combination with non-discrminative motif discovery applications. The scope of the ENCODE data analysis was limited to the DME motif discovery application which discriminates between the foreground and background data sets. This may have attributed to the poor performance of the weighted set cover methods, whereas a non- discriminative motif discovery application may have benefited from the weighted set cover methods.

Depth of coverage has interesting properties when dealing with modules of transcription factors that work together to transcribe a gene and could be further expanded using a k-set cover algorithm which would require each sequence to be covered by a fixed k constant of motifs (if applicable). This change could be easily applied to the integer linear programming constraints that were defined for the branch and cut method and the relaxed integer linear programming method. Instead of using a fixed constant of at least 1 motif per sequence, a variable could be passed in to identify the minimum sets of motifs to cover each sequence at least k times. The sets of motifs could then be viewed as modules that may work together.

Using branch and cut and relaxed integer linear programming to solve the weighted set cover problem can identify the set of motifs that have the fewest occurrences in background sequences but do not identify the set of motifs that cover the fewest background sequences. To demonstrate this property, consider the case where there are 2 motifs and are 3 sequences, 1 foreground and 2 background sequences. One motif covers 128 both background sequences once and the other motif covers one background sequence 3 times. Assuming both motifs cover the foreground sequence, using the simple weighting scheme where each background sequence is weighted with a value of 1 and each foreground sequence is weighted with a value of 0, the motif that covers both background sequences would be selected over the motif that covers only one background sequence.

One approach to mitigate this problem would be to only count the first occurrence of a background sequence versus counting all occurrences. This might better accommodate sequences that are incorrectly identified as background versus foreground. Ideally, better models which account for this variation would be further explored.

Identifying a set of motifs that covers all sequences can sometimes result in a set of motifs that is relatively large and not feasible for testing in a laboratory setting. Therefore, identifying a set of motifs that only covers a portion of the sequences instead of all the sequences may be preferred. Since all of the methods presented in this thesis are designed to cover all sequences by default, with the exception of the filtering algorithms, developing algorithms that specifically search for high sequence coverage but not complete set coverage is an area of future work.

Additional methods for solving the weighted set cover problem can also be explored. Some additional methods could include genetic algorithms, neural networks, tabu searches and particle swarms.

129 REFERENCES

[1] R. Tjian, “Molecular machines that control genes.,” Sci. Am., vol. 272, no. 2, pp.

54–61, Feb. 1995.

[2] M. K. Das and H.-K. Dai, “A survey of DNA motif finding algorithms,” BMC

Bioinformatics, vol. 8, no. Suppl 7, Nov. 2007.

[3] T. L. Bailey and C. Elkan, Fitting a mixture model by expectation maximization to

discover motifs in bipolymers. La Jolla, USA: Department of Computer Science and

Engineering, University of California, San Diego, 1994.

[4] G. Pavesi, G. Mauri, and G. Pesole, “An algorithm for finding signals of unknown

length in DNA sequences,” Bioinformatics, vol. 17, no. suppl 1, pp. S207–S214,

2001.

[5] H. Hartmann, “Regulatory motif discovery using PWMs and the architecture of

eukaryotic core promoters,” Munchen, Ludwig-Maximilians-Universitat, Diss.,

2012, 2012.

[6] J. Lichtenberg, K. Kurz, X. Liang, R. Al-Ouran, L. Neiman, L. Nau, J. Welch, E.

Jacox, T. Bitterman, and K. Ecker, “WordSeeker: concurrent bioinformatics

software for discovering genome-wide patterns and word-based genomic

signatures,” BMC Bioinformatics, vol. 11, no. Suppl 12, Nov. 2010. 130 [7] S. Sinha, “Discriminative motifs,” J. Comput. Biol., vol. 10, no. 3–4, pp. 599–615,

Nov. 2003.

[8] A. D. Smith, P. Sumazin, and M. Q. Zhang, “Identifying tissue-selective

transcription factor binding sites in vertebrate promoters,” Proc. Natl. Acad. Sci. U.

S. A., vol. 102, no. 5, pp. 1560–1565, 2005.

[9] P. Huggins, S. Zhong, I. Shiff, R. Beckerman, O. Laptenko, C. Prives, M. H. Schulz,

I. Simon, and Z. Bar-Joseph, “DECOD: fast and accurate discriminative DNA motif

finding,” Bioinformatics, vol. 27, no. 17, pp. 2361–2367, 2011.

[10] F. Fauteux, M. Blanchette, and M. V Strömvik, “Seeder: discriminative seeding

DNA motif discovery,” Bioinformatics, vol. 24, no. 20, pp. 2303–2307, Nov. 2008.

[11] A. Naik, J. Jones, R. Al-ouran, R. Schmidt, F. Drews, and D. Juedes, “Mining for

Gene Regulatory Elements Using the Concept of Motif Set Coverage.”

[12] R. M. Karp, Reducibility among Combinatorial Problems. New York: Plenum

Press, 1972.

[13] P. D’haeseleer, “What are DNA sequence motifs?,” Nat. Biotechnol., vol. 24, no. 4,

pp. 423–425, 2006.

[14] S. Sinha and M. Tompa, “YMF: a program for discovery of novel transcription

factor binding sites by statistical overrepresentation,” Nucleic Acids Res., vol. 31,

no. 13, pp. 3586–3588, Oct. 2003. 131 [15] S. Sinha and M. Tompa, “Discovery of novel transcription factor binding sites by

statistical overrepresentation.,” Nucleic Acids Res, vol. 30, no. 24, pp. 5560, 5549,

5560, 5549, 2002.

[16] S. Sinha and M. Tompa, “A statistical method for finding transcription factor

binding sites.,” Proc. Int. Conf. Intell. Syst. Mol. Biol., vol. 8, pp. 344–354, 2000.

[17] F. P. Roth, J. D. Hughes, P. W. Estep, and G. M. Church, “Finding DNA regulatory

motifs within unaligned noncoding sequences clustered by whole-genome mRNA

quantitation.,” Nat. Biotechnol., vol. 16, no. 10, pp. 939–945, 1998.

[18] J. D. Hughes, P. W. Estep, S. Tavazoie, and G. M. Church, “Computational

identification of cis-regulatory elements associated with groups of functionally

related genes in Saccharomyces cerevisiae.,” J. Mol. Biol., vol. 296, no. 5, pp. 1205–

1214, 2000.

[19] C. E. Lawrence, S. F. Altschul, M. S. Boguski, J. S. Liu, a F. Neuwald, and J. C.

Wootton, “Detecting subtle sequence signals: a Gibbs sampling strategy for multiple

alignment.,” Science (New York, N.Y.), vol. 262, no. 5131. pp. 208–214, 1993.

[20] H. Hartmann, E. W. Guthöhrlein, M. Siebert, E. W. Gutho, and S. Luehr, “P -value-

based regulatory motif discovery using positional weight matrices P -value-based

regulatory motif discovery using positional weight matrices,” pp. 181–194, 2013. 132 [21] S. Sinha, “On counting position weight matrix matches in a sequence, with

application to discriminative motif finding.,” Bioinformatics, vol. 22, pp. e454–

e463, 2006.

[22] T. L. Bailey, “DREME: motif discovery in transcription factor ChIP-seq data,”

Bioinformatics, vol. 27, no. 12, pp. 1653–1659, 2011.

[23] S. . Stein, “Two combinatorial covering theorems,” J. Comb. Theory, Ser. A, vol.

16, no. 3, pp. 391–397, 1974.

[24] D. S. Johnson, “Approximation algorithms for combinatorial problems,” J. Comput.

Syst. Sci., vol. 9, pp. 256–278, 1974.

[25] L. Lovász, “On the ratio of optimal integral and fractional covers,” Discrete Math.,

vol. 13, no. 4, pp. 383–390, 1975.

[26] C. Lund and M. Yannakakis, “On the hardness of approximating minimization

problems,” J. ACM, vol. 41, no. 5, pp. 960–981, 1994.

[27] H. Shekhar, “Survey of Approximation Algorithms for Set Cover Problem,”

Network, 2009.

[28] S. Lu and X. Lu, “Using graph models to find transcription factor modules: the

hitting set problem and an exact algorithm.,” Algorithms Mol. Biol., vol. 8, no. 1, p.

2, 2013. 133 [29] R. Raz and S. Safra, “A sub-constant error-probability low-degree test, and a sub-

constant error-probability PCP characterization of NP,” in Proceedings of the 29th

annual ACM Symposium on Theory of Computing, 1997, pp. 475–484.

[30] U. Feige, “A threshold of ln n for approximating set cover,” Journal of the ACM,

vol. 45. pp. 634–652, 1998.

[31] L. Ingber, “Adaptive simulated annealing (ASA): Lessons learned,” 1996.

[32] “How Simulated Annealing Works - MATLAB & Simulink,” MathWorks. 06-Jun-

2014.

[33] “GLPK GNU Linear Programming Kit.” 2012.

[34] I. Gurobi Optimization, “Gurobi Optimizer.” 2015.

[35] P. Kheradpour and M. Kellis, “Systematic discovery and characterization of

regulatory motifs in ENCODE TF binding experiments.,” Nucleic Acids Res., vol.

42, pp. 2976–87, 2014.

[36] R. C. Gentleman, V. J. Carey, D. M. Bates, and others, “Bioconductor: Open

software development for computational biology and bioinformatics,” Genome

Biol., vol. 5, p. R80, 2004.

[37] L. Gautier, L. Cope, B. M. Bolstad, and R. A. Irizarry, “affy---analysis of Affymetrix

GeneChip data at the probe level,” Bioinformatics, vol. 20, no. 3, pp. 307–315, 2004. 134 [38] A. Kauffmann, R. Gentleman, and W. Huber, “arrayQualityMetrics--a bioconductor

package for quality assessment of microarray data,” Bioinformatics, vol. 25, no. 3,

pp. 415–416, 2009.

[39] M. J. Aryee, J. A. Gutiérrez-Pabello, I. Kramnik, T. Maiti, and J. Quackenbush, “An

improved empirical bayes approach to estimating differential gene expression in

microarray time-course data: BETR (Bayesian Estimation of Temporal

Regulation).,” BMC Bioinformatics, vol. 10, p. 409, 2009.

[40] L. Kumar and M. E Futschik, “Mfuzz: a software package for soft clustering of

microarray data.,” Bioinformation, vol. 2, pp. 5–7, 2007.

[41] S. Falcon and R. Gentleman, “Using GOstats to test gene lists for GO term

association.,” Bioinformatics, vol. 23, no. 2, pp. 257–258, 2007.

[42] E. Portales-Casamar, S. Thongjuea, A. T. Kwon, D. Arenillas, X. Zhao, E. Valen,

D. Yusuf, B. Lenhard, W. W. Wasserman, and A. Sandelin, “JASPAR 2010: the

greatly expanded open-access database of transcription factor binding profiles.,”

Nucleic Acids Res., vol. 38, pp. D105–D110, 2010.

[43] E. Ghedin et al., “Draft genome of the filarial nematode parasite Brugia malayi.,”

Science, vol. 317, pp. 1756–1760, 2007.

[44] Y. J. Choi, E. Ghedin, M. Berriman, J. McQuillan, N. Holroyd, G. F. Mayhew, B.

M. Christensen, and M. L. Michalski, “A deep sequencing approach to 135 comparatively analyze the transcriptome of lifecycle stages of the filarial worm, brugia malayi,” PLoS Negl. Trop. Dis., vol. 5, 2011.

! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

Thesis and Dissertation Services ! !