Motif Selection: Identification of Regulatory Elements using Sequence Coverage Based Models and Evolutionary Algorithms

A dissertation presented to the faculty of the Russ College of Engineering and Technology of Ohio University

In partial fulfillment of the requirements for the degree Doctor of Philosophy

Rami Al-Ouran December 2015

© 2015 Rami Al-Ouran. All Rights Reserved. 2

This dissertation titled Motif Selection: Identification of Gene Regulatory Elements using Sequence Coverage Based Models and Evolutionary Algorithms

by RAMI AL-OURAN

has been approved for the School of Electrical Engineering and Computer Science and the Russ College of Engineering and Technology by

Lonnie Welch Stuckey Professor of Electrical Engineering and Computer Science

Dennis Irwin Dean, Russ College of Engineering and Technology 3 Abstract

AL-OURAN, RAMI, Ph.D., December 2015, Electrical Engineering and Computer Science Motif Selection: Identification of Gene Regulatory Elements using Sequence Coverage Based Models and Evolutionary Algorithms (120 pp.) Director of Dissertation: Lonnie Welch The accuracy of identifying binding sites (motifs) has increased with the use of technologies such as chromatin immunoprecipitation followed by sequencing (ChIP-seq), but this accuracy remains low enough that bioinformaticians and biologists struggle in choosing the right methods for identifying such regulatory elements. Current motif discovery methods typically produce lengthy lists of putative transcription factor binding sites, and a significant challenge lies in how to mine these lists to select a manageable set of candidate sites for experimental validation. Additionally, despite the importance of covering large numbers of genomic sequences, current motif discovery methods do not consider the sequence coverage percentage. To address the aforementioned problems, the motif selection problem is introduced and solved using a coverage based model greedy algorithm and a multi-objective evolutionary algorithm. The motif selection problem aims to produce a concise list of significant motifs which is both accurate and covers a high percentage of the genomic input sequences. The proposed motif selection methods were evaluated using ChIP-seq data from the ENCyclopedia of DNA Elements (ENCODE) project. In addition, the proposed methods were used to identify putative transcription factor binding sites in two case studies: stage specific binding sites in Brugia malayi, and tissue specific binding sites in hydroxyproline-rich glycoprotein (HRGP) in Arabidopsis thaliana. 4

To my beloved parents 5 Acknowledgments

I am deeply grateful to my adviser Dr. Lonnie Welch for his continuous support and advice. Dr. Welch helped me improve both as a researcher and as a person, and this work would not have been possible without his continuous guidance and encouragement. I would like to thank Dr. Frank Drews for his helpful discussions and suggestions. I would like to thank the committee members for their time and discussions. Many thanks to all current and previous members of the Ohio University bioinformatics lab especially: Xiaoyu Liang, Yichao Li, Jens Lichtenberg, Kyle Kurz, and Matthew Wiley. I would also like to thank Ohio University for their financial support. Finally, words are not enough to thank my dear parents for their love, encouragement, and patience all these years. 6 Table of Contents

Page

Abstract ...... 3

Dedication ...... 4

Acknowledgments ...... 5

List of Tables ...... 9

List of Figures ...... 12

List of Acronyms and Terms ...... 14

1 Introduction ...... 15 1.1 Motivation and the motif selection problem ...... 16 1.2 Contributions ...... 17 1.3 Gene regulation ...... 17 1.4 Motif discovery ...... 18 1.5 Discriminative motif discovery ...... 19 1.6 ChIP-seq motif discovery ...... 22 1.7 Ensemble motif discovery ...... 24

2 Identification of Gene Regulatory Elements using Coverage-based Heuristics . . 28 2.1 Introduction ...... 28 2.2 Methods ...... 30 2.2.1 Formal problem definition ...... 30 2.2.2 Relaxed Integer Linear Programming (RILP) approximation algo- rithm ...... 31 2.2.3 Bounded exact search algorithm ...... 32 2.2.4 Greedy algorithm ...... 32 2.3 Results and discussion ...... 33 2.3.1 Evaluation methodology ...... 35 2.3.2 Evaluation Results ...... 37 2.3.3 Putative functional genomic elements discovered by our methods . 41 2.4 Conclusions ...... 43

3 Discriminative Motif Selection using Multi-Objective Optimization (MOP) Methods ...... 52 3.1 Background ...... 52 3.1.1 Multi-objective optimization (MOP) ...... 53 7

3.1.2 Pareto optimal solutions ...... 54 3.1.3 Finding Pareto optimal solutions using evolutionary algorithms (EAs) 56 3.1.4 Post Pareto analysis ...... 58 3.2 Methods ...... 59 3.2.1 The discriminative motif selection problem formal definition . . . . 59 3.2.2 The Positive Negative Partial Set Cover (PNPSC) problem . . . . . 60 3.2.3 Mapping the motif selection problem to the PNPSC problem . . . . 60 3.2.4 Solving the discriminative motif selection problem using multi- objective optimization ...... 62 3.2.5 Using an MOEA to solve multi-objective problems ...... 62 3.2.5.1 Filtering the features before applying MOAE ...... 63 3.3 Results and discussion ...... 63 3.3.1 Evaluation using ENCODE data ...... 63 3.3.2 Application to case studies ...... 67 3.4 Conclusions ...... 67 3.5 Variations of the set cover problem ...... 68 3.5.1 The Weighted Set Cover problem ...... 69 3.5.2 The Red Blue Set Cover (RBSC) problem ...... 70 3.5.3 The Set Multicover Problem ...... 70 3.5.4 The Mutliset Multicover (MSMC) problem ...... 70 3.5.5 The Partial Set Cover (PSC) problem ...... 70

4 Guidelines for Motif Discovery and Motif Selection ...... 72 4.1 Introduction ...... 72 4.2 Motif discovery ...... 72 4.2.1 Generative vs discriminative ...... 73 4.2.2 Large data sets ...... 73 4.2.3 Small data sets ...... 73 4.2.4 Motif representation ...... 74 4.2.5 Motif scanning ...... 74 4.2.6 Reporting the properties of discovered motifs ...... 75 4.2.7 Filtering predicted motifs ...... 76 4.2.8 Motif selection ...... 77 4.2.8.1 Motif selection without background data ...... 77 4.2.8.2 Motif selection with background data ...... 77 4.2.9 Interpreting the results of motif selection ...... 78 4.2.9.1 Motif selection without background data ...... 78 4.2.9.2 Motif selection with background data ...... 79 4.3 Conclusions ...... 81 8

5 Identification of Stage Specific Regulatory Elements in Brugia malayi using Discriminative Motif Selection ...... 82 5.1 Introduction ...... 82 5.2 Results and discussion ...... 82 5.2.1 The foreground and background data sets ...... 83 5.2.2 Motif prediction ...... 84 5.2.3 Motif selection ...... 87 5.2.3.1 Motif selection for motifs with Ωb = 10 and Ω f = 5 . . . 87 5.2.3.2 Motif selection for motifs with Ωb = 10 and Ω f = 10 . . 90 5.2.3.3 Motif selection for motifs with Ωb = 20 and Ω f = 5 . . . 92 5.2.3.4 Motif selection for motifs with Ωb = 20 and Ω f = 10 . . 93 5.3 Methods ...... 95 5.3.1 Motif discovery ...... 95 5.3.2 Motif filtering ...... 96 5.3.3 Motif selection ...... 96 5.4 Conclusions ...... 96

6 Identification of Tissue Specific Regulatory Elements in Hydroxyproline-Rich Glycoprotein (HRGP) Genes in Arabidopsis thaliana ...... 98 6.1 Introduction ...... 98 6.2 Results and discussion ...... 98 6.2.1 The foreground and background data sets ...... 99 6.2.2 Motif prediction ...... 99 6.2.3 Selecting motifs with zero background coverage ...... 101 6.2.4 Filtering discovered motifs ...... 103 6.2.5 Motif selection for motifs with Ωb = 20 and Ω f = 20 ...... 104 6.2.6 Recommendations ...... 108 6.3 Methods ...... 108 6.3.1 HRGP gene identification ...... 108 6.3.2 Motif discovery ...... 108 6.3.3 Motif filtering ...... 109 6.3.4 Motif selection ...... 109 6.4 Conclusions ...... 110

7 Conclusions and Future Work ...... 111

References ...... 113 9 List of Tables

Table Page

2.1 Number of features used (Mean, Median, SD) across 38 TF groups ...... 38 2.2 Sequence sensitivity (Mean, Median, SD) across 38 TF groups ...... 39 2.3 Number of features and sSn for five motif selection methods across the 38 TF groups. P is the total number of peaks selected per TF group and P(%) is the percentage of peaks with motif occurrences. N is the number of features selected by each method...... 40 2.4 Novel motifs discovered by the greedy algorithm...... 45 2.5 Novel motifs discovered by the greedy algorithm...... 46 2.6 Novel motifs discovered by the greedy algorithm...... 47 2.7 Novel motifs discovered by the greedy algorithm...... 48 2.8 Novel motifs discovered by the greedy algorithm...... 49 2.9 Novel motifs discovered by the greedy algorithm...... 50 2.10 Novel motifs discovered by the greedy algorithm...... 51

3.1 Number of features used (Mean, Median, SD)...... 66 3.2 Variations of the set cover problem ...... 69

4.1 Number of motifs which pass the filtering thresholds Ωb and Ω f . Ωb is the maximum background coverage (%) per motif. Ω f is the minimum foreground coverage (%) per motif...... 77 4.2 Motif solution A selected from figure 4.2. This solution consists of 4 motifs with a cumulative foreground coverage of 100% and a cumulative background coverage of 4.7%. Fore Cov (%) is the motif foreground coverage, Fore Num is the number of foreground sequences covered by the motif. Back Cov (%) is the motif background coverage, Back Num is the number of background sequences covered by the motif...... 80

5.1 Number of motifs discovered with different background coverage thresholds. For each set of motifs, the maximum motif foreground coverage in that set is shown...... 84 5.2 Number of motifs discovered using filtering thresholds Ωb and Ω f . Ωb is the maximum background coverage per motif. Ω f is the minimum foreground coverage per motif...... 85 5.3 Discovered motifs with zero background coverage ...... 86 5.4 Motif solution H selected from figure 5.3. This solution consists of 12 motifs with a cumulative foreground coverage of 72.8% and a cumulative background coverage of 30.1%. Fore Cov (%) is the motif foreground coverage, Fore Num is the number of foreground sequences. Back Cov (%) is the motif background coverage, Back Num is the number of background sequences...... 89 10

5.5 Motif solution A selected from figure 5.4. This solution consists of 10 motifs with a cumulative foreground coverage of 75.1% and a cumulative background coverage of 46.1%. Each motif has a minimum foreground coverage of 10%. Fore Cov (%) is the motif foreground coverage, Fore Num is the number of foreground sequences. Back Cov (%) is the motif background coverage, Back Num is the number of background sequences...... 91 5.6 Motif solution I selected from figure 5.5. This solution consists of 10 motifs with a cumulative foreground coverage of 70.4% and cumulative background coverage of 31.6%. Fore Cov (%) is the motif foreground coverage, Fore Num is the number of foreground sequences. Back Cov (%) is the motif background coverage, Back Num is the number of background sequences...... 93 5.7 Motif solution D selected from figure 5.6. This solution consists of 5 motifs with a cumulative foreground coverage of 62.6% and a cumulative background coverage of 35.9%. Fore Cov (%) is the motif foreground coverage, Fore Num is the number of foreground sequences. Back Cov (%) is the motif background coverage, Back Num is the number of background sequences...... 95

6.1 Number of motifs discovered with maximum background coverage thresholds. For each set of motifs the maximum motif foreground coverage in that set is shown...... 101 6.2 Discovered motifs with 100% foreground coverage...... 101 6.3 Motif solution A selected from figure 6.2. This solution consists of 8 motifs with a cumulative foreground coverage of 100% and zero cumulative background coverage. Fore Cov (%) is the motif foreground coverage, Fore Num is the number of foreground sequences covered by the motif. Back Cov (%) is the motif background coverage, Back Num is the number of background sequences covered by the motif...... 103 6.4 Number of motifs which passed the filtering thresholds Ωb and Ω f . Ωb is the maximum background coverage per motif. Ω f is the minimum foreground coverage per motif...... 104 6.5 Motif solution A selected from figure 6.3. This solution consists of 4 motifs with a cumulative foreground coverage of 100% and cumulative background coverage of 4.7%. Fore Cov (%) is the motif foreground coverage, Fore Num is the number of foreground sequences covered by the motif. Back Cov (%) is the motif background coverage, Back Num is the number of background sequences covered by the motif...... 106 6.6 Motif solution D selected from figure 6.3. This solution consists of 1 motif with a foreground coverage of 73.7% and background coverage of 12.8%. Fore Cov (%) is the motif foreground coverage, Fore Num is the number of foreground sequences covered by the motif. Back Cov (%) is the motif background coverage, Back Num is the number of background sequences covered by the motif...... 106 11

6.7 Motif solution C selected from figure 6.3. This solution consists of 3 motif with a foreground coverage of 94.7% and background coverage of 4.6%. Fore Cov (%) is the motif foreground coverage, Fore Num is the number of foreground sequences covered by the motif. Back Cov (%) is the motif background coverage, Back Num is the number of background sequences covered by the motif...... 107 12 List of Figures

Figure Page

2.1 Evaluation pipeline applied to the ENCODE ChIP-seq data. 1000 random peaks were selected per experiment per factor group for the training and testing data. Peaks selected for training data were not included in the testing data. The discovered motifs were reported in [1] using an ensemble of motif discovery tools. FIMO [2] was used for motif scanning...... 36 2.2 Number of features and sequence sensitivity comparison...... 38 2.3 Motifs selected by the greedy algorithm for factor group TCF12...... 42 2.4 Cell line specific motifs selected by the greedy algorithm for factor group TCF12. 43

3.1 The Pareto front. Adapted from [3] ...... 55 3.2 Number of features comparison...... 66 3.3 Sequence sensitivity, sequence specificity, and accuracy comparisons...... 67

4.1 Foreground and background coverage of 849 discovered motifs...... 76 4.2 Motif solutions selected by the discriminative motif selection method. Each solution consists of a set of motifs. The sequence coverage is the cumulative coverage of all the motifs in the solution...... 80

5.1 AT and GC content in foreground (L3) and background (L4 and AF) data sets. . 83 5.2 Foreground and background coverage of all 871 discovered motifs...... 85 5.3 Motif solutions selected by the discriminative motif selection method. Each solution consists of a set of motifs. The sequence coverage is the cumulative coverage of all the motifs in the solution...... 88 5.4 Motif solutions selected by the discriminative motif selection method. Each solution consists of a set of motifs. The sequence coverage is the cumulative coverage of all the motifs in the solution...... 90 5.5 Motif solutions selected by the discriminative motif selection method. Each solution consists of a set of motifs. The sequence coverage is the cumulative coverage of all the motifs in the solution...... 92 5.6 Motif solutions selected by the discriminative motif selection method. Each solution consists of a set of motifs. The sequence coverage is the cumulative coverage of all the motifs in the solution...... 94

6.1 Foreground and background coverage of all 849 discovered motifs...... 100 6.2 Motif solutions selected by the discriminative motif selection method for motifs with zero background coverage. The Pareto front shows the possible solutions where the cost value is explained under methods. Each solution consists of a set of motifs. The sequence coverage is the cumulative coverage of all the motifs in the solution...... 102 13

6.3 Motif solutions selected by the discriminative motif selection method applied to 317 motifs which pass the Ωb = 20 and Ωa = 20 filtering thresholds. Each solution consists of a set of motifs. The sequence coverage is the cumulative coverage of all the motifs in the solution...... 105 14 List of Acronyms and Terms ENCODE Encyclopedia of DNA Elements TF Transcription Factor TFBS Transcription Factor Binding Site NGS Next Generation Sequencing PWM Position Weight Matrix Feature A motif MOP Multi-objective Optimization Problem MOEA Multi-objective Evolutionary Algorithm HRGP Hydroxyproline-Rich Glycoprotein 15 1 Introduction

Human disease association studies are often gene-centric and focus on identifying variants in genes. However, numerous diseases are caused by alterations in the non-coding regions of the genome, which can affect gene regulation. Thus, the identification of genomic regulatory elements can be helpful in understanding the biology of disease [4]. With the availability of technologies such as RNA-seq and microarray, genes that are active in specific conditions are being identified, providing the opportunity to discover the regulatory elements that control these genes by searching for common patterns in the promoter regions of the co-expressed genes. Additionally, chromatin immunoprecipitation followed by sequencing (ChIP-seq) experiments identify all regions in a genome that are bound by a particular transcription factor (TF), offering another pattern mining opportunity. In the last few years the accuracy of identifying genomic regulatory elements has increased with the use of new technologies, but the accuracy is still not high enough and bioinformaticians and biologists struggle in choosing the right method or tool for identifying the regulatory elements. Gene transcription initiation is controlled by several regulatory elements including TFs and Transcription Factor Binding Sites (TFBSs). A TF is a that binds to specific DNA regions across the genome called the TFBSs. TFs and TFBSs along with other regulatory elements initiate and control gene transcription. Identification of the TFBSs is a well-defined problem in the computational biology field and the discovery of TFBSs is referred to as the motif discovery problem. Motif discovery attempts to find over-represented patterns (motifs) in a set of DNA sequences. In motif discovery the input DNA sequences could represent promoter regions of genes found to be co-expressed by RNA-seq or microarray experiments. In the case of a ChIP-seq experiment which locates all the TF binding regions across the genome, the set of DNA sequences are the sequences bound by the same TF across the genome. In discriminative motif discovery, two sets of 16 sequences referred to as the foreground (positive) and background (negative) data sets are compared, where the background data set is explicitly provided by the user. The objective is identifying motifs which are enriched in the foreground data set relative to the background data set. When performing motif discovery, it is important to consider that there are multiple modes or strategies of TF-DNA binding which include: high affinity binding (strong motifs) where the binding sites match to a high degree with the TF studied. Low affinity binding (weak or subtle motifs) where the binding sites do not match the TF studied to a high degree. Indirect binding where the TF binds to the DNA region through another TF. Finally, co-operative binding where TFBSs are usually clustered close to each other and work in a combinatorial manner to regulate gene transcription [5] [6] [7] [8] [9].

1.1 Motivation and the motif selection problem

Motif discovery is the traditional method for discovering regulatory element binding site patterns, but existing motif discovery methods have several shortcomings. First, the specificity problem where motif discovery methods produce too many motifs, causing a high false positive rate. Second, the coverage problem where motif discovery methods fail to find a single motif (or a small set of motifs) that covers all or the maximum number of the genomic sequences of interest (e.g., the binding regions from a ChIP-seq experiment). Third, inability to discover the multiple modes or strategies of TF-DNA binding. Based on the shortcomings mentioned above, there is a need for methods that are capable of mining the large list of candidate motifs to produce a manageable set of significant motifs. We refer to this problem of choosing a subset of significant motifs from a set of candidate motifs as the motif selection problem. A coverage based model is introduced to solve the motif selection problem to produce a list of significant motifs that is accurate and covers a high percentage of the genomic input sequences. In the case of discriminative motif selection, the motif selection problem is formulated as a 17 multi-objective optimization problem and solved using a multi-objective evolutionary algorithm.

1.2 Contributions

The contributions of this dissertation include the following:

• Implementation of coverage based heuristics to solve the motif selection problem (chapter 2).

• Analysis of ENCODE TF ChIP-seq data and discovery of new biological insights (chapter 2).

• Implementation of an evolutionary algorithm to solve the discriminative motif selection problem (chapter 3).

• Provide general guidelines for motif discovery and motif selection (chapter 4)

• Identification of stage specific regulatory elements in Brugia malayi (chapter 5)

• Identification of tissue specific regulatory elements in Hydroxyproline-rich glycoprotein (HRGP) genes in Arabidopsis thaliana (chapter 6)

1.3 Gene regulation

Gene transcription, the first step in , is a complex process that involves several regulatory elements working in a combinatorial manner. The transcription regulation mechanism works according to a set of rules to coordinate and initiate the process of mapping the DNA to a final gene product and to carry out specific biological processes [10] [11]. With advances in technologies such as next generation sequencing (NGS) our understanding of gene transcription regulation is increasing; yet there is no simple model that can explain the sophisticated process of transcription regulation [5]. 18

Understanding this process requires the identification of the functional elements involved and understanding how these functional elements work together to regulate gene transcription [11]. Transcription factors (TFs) and their DNA binding sites (TFBSs) are two components directly related to gene transcription regulation. Identifying TFBSs will assist in deciphering the transcription regulation mechanism by linking the activation and deactivation of genes with the specific TFs that turn them on (active) or off (silent). Creating a network of TFs and their target genes will lead to an improved understanding of gene regulation mechanisms by providing a blueprint that maps the interactions that occur between genes and TFs. Identifying TFBSs is considered a challenging problem for both the fields of computer science and molecular biology, since the TFBSs are degenerate, TFBSs vary in length, and TFs work in a combinatorial manner [12]. Motif discovery is one of several methods used to help discover these TFBSs elements.

1.4 Motif discovery

Motif discovery is the process of finding short DNA patterns (6-16 DNA bases) that are overrepresented in a set of DNA sequences which share a biological function (e.g., promoters of co-expressed genes) [13] or are bound by the same TF (as determined by ChIP-seq experiments). Motif discovery methods are divided into two classes - generative methods and discriminative methods. Generative methods discover motifs by contrasting their enrichment in a set of sequences with generated statistical background models (for example, a Markov background model generated from the input sequences [14]). Some well-known generative discovery methods include MEME [15] and Weeder [12]. In discriminative motif discovery methods, two sets of sequences (denoted foreground and background, or positive and negative) are compared to find motifs that are overrepresented in the foreground set and are underrepresented in the background set. Discriminative motif discovery was first introduced by Sinha [16], where a motif was considered a feature 19 of the input sequences, and features that best discriminate the two sets were identified by applying classification techniques. Features can be ranked based on their power to discriminate between members of the two classes. Discriminative motif discovery methods include xxMotif [14], DECOD [17], DEME [18], and DME [19].

1.5 Discriminative motif discovery

In discriminative motif discovery, two sets of sequences (denoted foreground and background or positive and negative) are compared to find motifs overrepresented in the foreground set and underrepresented in the background set. Several discriminative motif discovery methods exist, some of the recent methods developed are: xxMotif [6] and DECOD [8] and older methods include CMF [9], DEME [10], and DME [11]. These methods differ in their ability in recovering true TFBSs and they share the following shortcomings: producing a long list of putative motifs without providing a systematic way of choosing the most significant motifs, and producing only one solution (point solution) instead of solution spaces. One of the recent discriminative motif discovery methods is DECOD (DECOnvolved discriminative motif discovery) [8]. DECOD starts by enumerating all words of a specific length k from the foreground and background sequence sets. The results of enumeration are stored into a count table and all further processing is based on this count table. Using the count table DECOD aims to find a Position Weight Matrix (PWM) that matches a higher number of k-mers from the foreground set compared to the background set. Using a count table without considering the context information of the k-mers leads to an inaccurate and convolved PWM since many k-mers are shifted versions of the same k-mer. To account for this problem the authors propose a deconvolution method where a convolved motif component is used instead of a simple motif component. The initial set of k-mers is chosen based on this convolved function to build the PWM. Following that, 20 k-mers are added to a PWM iteratively using a hill climbing search method until convergence. DECOD has a faster option (DECOD-speedup) to process longer motifs by only examining those k-mers that have the highest count difference between the foreground and background data sets. DECOD was found to be comparable to existing methods in terms of accuracy. DECOD examines one word length at a time which requires more processing time and makes it more difficult to choose the most significant motifs of different lengths. Additionally, as the word length increases the processing time increases as well. Finally, DECOD does not report discriminative motif modules. XXmotif (eXhaustive evaluation of matriX motifs) [6] is a PWM-based motif discovery method that combines a pattern based enumerative method with a PWM iterative refinement method. XXmotif has the ability to quickly calculate PWM enrichment p-values using a fast branch and bound algorithm and order statistics. XXmotif has two main stages: the pattern stage which generates the seed words, and the PWM stage which merges and refines the PWMs resulting from the seed words. XXmotif performed better than other generative motif discovery tools on a number of benchmark studies, but the drawback of XXmotif is its long computational time and its tendency to report repeats as top significant motifs. CMF (Contrast Motif Finder) [9] is a discriminative motif discovery method that identifies differentially enriched motifs using two steps: seed creation and iterative PWM updating. In the seed creation step, all words of length k are enumerated and a z-score is calculated for each word. For each enumerated word, neighbors (words with at most m-mismatches from the seed word) and sub-neighbors (the set of neighbor words that have mismatches at the same position) are found. A word is considered a neighbor or a sub-neighbor only if it is enriched in the foreground set. For each seed word a count matrix (k x 4) is created that summarizes the seed using only the best sub-neighbors which have the highest z-score. Two matrices would be created representing the sub-neighboring sites in the foreground and background data sets. Using 21 the two matrices, the given sequences are scanned and the matrices are updated with new sites that match the matrix PWM. CMF was applied to find TFBSs for a number of TFs obtained from ChIP-seq and ChIP-chip experiments and its performance was evaluated based on its false discovery rate (FDR). Compared to two discriminative tools (DME [11] and FIRE [12] ) and one generative tool (BioProspector [13]), CMF had the lowest FDR in the majority of the cases. One of the drawbacks of CMF is its high computation time especially for longer motifs. DEME (Discriminatively Enhanced Motif Elicitation) [10] identifies discriminative motifs by maximizing a discriminative objective function using conjugate gradient optimization. DEME starts with a global search step to find initial string seeds using a substring search step and pattern branching. The substring search finds all substrings of length k occurring in the foreground set. Each substring is given a score using the objective function and the best scoring substrings are used for branch searching where positions in the substring are mutated to produce new strings. The top substrings are then mapped to motif PWMs and used as input to the local search phase. Local search refines the motifs and tries to maximize the discriminative objective function for each motif and outputs the best motifs. DEME performance was only compared to MEME [14] and as indicated by the authors DEME suffers from long computational time and it can only report one motif per dataset. DME (Discriminating Matrix Enumerator) [11] uses an enumerative algorithm to search a discrete space of matrices which represent possible motifs. Each matrix is scored using a relative overrepresentation score that is based on a log likelihood ratio. A local search is then used to refine the original matrices by finding similar matrices which have a higher overrepresentation score than the original matrix. Finally, the discovered motif instances are removed from the input and the procedure is repeated. DME has very low computational time compared to other methods since it searches a finite set of matrices only. Across several benchmark datasets, DME ranked as one of the top performing methods with very low computational time, but the 22 maximum motif length that DME searches for is 15 and the dependency on the predefined set of matrices might leave out some significant motifs.

1.6 ChIP-seq motif discovery

ChIP-seq is a technique used to identify the in-vivo interactions that occur between DNA and . The ChIP-seq experiment starts by crosslinking the proteins and the DNA inside the nucleus followed by fragmenting the protein-DNA interacting regions into fragments of size 50-2000 base pairs, then immunoprecipitating the protein-DNA regions by adding an antibody specific to the protein of interest. Finally, the immunoprecipitated DNA regions are analyzed to find their DNA sequence using next-generation sequencing (NGS). The final DNA sequences produced by ChIP-seq (referred to as ChIP-seq peaks) which are 50-2000 base pairs in length are analyzed using motif discovery tools to identify the overrepresented patterns (motifs) in the peaks. Motif discovery for ChIP-seq data can be used to confirm a known TF by confirming its existence in the ChIP-seq sequences, and secondary motifs could be discovered that might work in a combinatorial manner with the primary TF [15] [16]. With the use of the ChIP-seq technology new challenges emerge for the motif discovery problem. These challenges include: the large size of the input since ChIP-seq experiments produce thousands of sequences and the original motif discovery methods do not scale well to large input, also most motif discovery methods focus on finding only one core motif and ignore the secondary motifs that exist. Finally most of the tools produce too many motifs to choose from [17]. Below is a list of the most recent ChIP-seq motif discovery methods. DREME (discriminative regular expression motif elicitation) [17], is a motif discovery tool that discovers short core motifs in large data sets like ChIP-seq data. DREME is designed as a complimentary motif tool that discovers the basic core motifs which could be extended and refined using other tools. DREME searches for simple 23 regular expressions and limits its search to short motifs which makes searching large data sets fast. The Fishers exact test is used to find the significance of the motifs. 13 ChIP-seq data sets were used for evaluation where 100 base pairs centered around the ChIP-seq peaks where used as the foreground data set. Dinucleotide shuffling was used to generate the background data sets. Sensitivity and runtime were the performance measures and DREME had the lowest runtime and the highest sensitivity compared to MEME [14], Amadeus [24], Trawler [25], nestedMICA [26] and WEEDER [4]. DREME suffers from searching for short motifs only and should be used along with other motif discovery tools, but its speed is a big advantage compared to existing tools. motifRG [18] is a de novo discriminative motif discovery tool designed to discover motifs in large sequence data like ChIP-seq. motifRG uses a logistic regression model to score the motifs and it starts by enumerating all the words of a specific length then scores them using a z-score obtained using the logistic regression model. The top scored words are used as seeds and then extended on each side and finally there is a motif refinement stage. For evaluation purposes, motifRG was compared to DREME [17] using 207 ENCODE ChIP-seq datasets collected from two groups. This dataset covers 82 unique TFs and 25 cell types. motifRG identified 78% of the known TFs in 148 ENCODE experiments which is comparable to the performance of DREME but ran 40% faster. The authors did not report the computational time and scalability of the method where it was only mentioned that motifRG is faster than DREME. SIOMICS (systematic identification of motifs in Chipseq data) [19] is a de novo motif discovery tool that discovers motifs from all peak regions of a ChIP-seq experiment. SIOMICS starts by enumerating all k-mers of specific length k and produces motifs by grouping all similar k-mers if they are different at one position. This produces candidate motifs which are used to produce motif modules. The motifs are ranked using an MAP (maximum a posterior probability) score. Motif modules are then identified using a frequent pattern mining approach and conditional probability. To 24 measure the significance of the modules a Poisson clumping heuristic strategy is applied. 13 ChIP-seq experiments and 13 random data sets generated by shuffling the 13 ChIP-seq experiments using permutation of nucleotide bases in the 13 ChIP-seq sequences were used for evaluation. SIMOICS was compared in terms of sensitivity, specificity, and runtime to DREME [17] and Peak-motifs [20]. Sensitivity was calculated by finding how many known TFs were found. STAMP [21] was used to find matches between the reported motifs and known TFs in TRANSFAC and JASPAR. Specificity was measured using the random data sets where SIMOICS did not report any motifs found in the randomly generated data sites (false rate = 0). Finally, SIOMICS was compared in terms of running time where SIOMICS was faster than DREME but slower than Peak-motifs. In terms of sensitivity, about 76.0% of the predicted motifs were similar to known motifs from TRANSFAC and JASPAR and the sensitivity results were comparable to DREME and Peak-motifs. SIOMICS and DREME had a zero false discovery rate since they did not report any motifs in the random data sets. Some of SIOMICS shortcomings is its use of only one word length at a time and the user has to choose the m parameter which is the final number of motifs to report. In [22] the authors introduce factorbook.org which is a wiki page for a set of ChIP-seq TFs reported in the ENCODE project. Factorbook stores information for 457 ChIP-seq datasets representing 119 TFs. MEME-Chip was used for de novo motif discovery using the top 500 peaks per each ChIP-seq TF experiment as input. The top five motifs were reported in Factorbook and matches to known TFs are reported as well. The top 5 motifs were scanned across the whole peak file using FIMO [23] to check the enrichment of the motifs. Factorbook.org provides an easy way to access the TFs used.

1.7 Ensemble motif discovery

Based on results obtained by assessing the performances of several motif discovery tools by Tompa et al. [20], it was found that the sensitivity in recovering motifs was very 25 low. To address this problem, it was suggested to use an ensemble of motif discovery tools. In ensemble methods, multiple tools are integrated to produce a list of candidate motifs. Motif selection methods are then used to select a subset of motifs from the output of the suite of methods. In [21] and [22] the authors introduced SCOPE where three algorithms (BEAM, PRISM, and SPACER) are used to perform motif discovery and a scoring metric called the S ig score is used to rank the motifs. S ig is a statistical significance score that is based on three objective functions (motif overrepresentation, motif coverage, and motif positional bias). The motif coverage score introduced is used to determine the statistical significance of only a single motif to assist in ranking the predicted motifs. This score compares the number of regions that contain the motif in the input set to the total number of regions that contain the motif in its corresponding genome. In [1], five motif discovery tools were used (AlignACE [23], MDScan [24], MEME [15], Trawler [25], and Weeder [12]). The motifs were ranked using an enrichment score that is computed by dividing the number of the discovered motif instances by the number of shuffled control motif instances across the genomic regions studied. Another motif ensemble method is MotifLab [26], where motif discovery is performed by popular tools chosen by the user and a p-value for over-representation is calculated for the top motifs produced by each tool. In Ensemble Motif Discovery (EMD) [27], five motif discovery tools are used (AlignACE [23], Bio-Prospector [28] , MDScan [24] , MEME [15] , and Motif-Sampler [29]). The ensemble approach to selecting motifs consists of five steps: collecting, grouping, voting, smoothing, and extracting. EMD runs the tools multiple times, where some tools are run with different parameters each time to produce different sets of motifs per run. The motifs are then collected and grouped based on their scores. The motifs are then mapped on the input sequences and votes for each position across the sequences are 26 counted based on how many motifs overlap it. The final sites reported are the sites with the highest votes. W-ChIPMotifs [30] uses three motif discovery tools (MEME [15], MaMF [31], and Weeder [12]) to discover motifs in ChIP-seq data. A bootstrap resampling method was used for calculating the top candidate motif scores. In this method a control data set is generated by randomly shuffling the original input sequences and the motif PWMs are scanned across the control data set with a minimum PWM score of 0.5. The top PWM scores are retrieved and a p-value is calculated for each PWM score using a Fisher test. Finally, Bonferroni correction was applied to the motif p-values. In CompleteMOTIFs [32], three motif discovery methods (MEME [15], Weeder [12], and ChIPMunk [33] ) are used. The top 10 motifs are selected from each tool based on each tool scoring method. The candidate motifs from all the tools are then scanned across a background data set generated by shuffling the original input data while preserving its nucleotide frequency. Based on the number of motif occurrences in the input and background data sets, a p-value is calculated and then corrected for false discovery rate to produce q-values. CompleteMOTIFs finally reports the top 10 motifs based on their q-values. GimmeMotifs [34] incorporates nine motif discovery tools to discover motifs across ChIP-seq data. The tools are: BioProspector [28] , GADEM [35], Improbizer [36], MDmodule [37], MEME [15], MoAn [38], MotifSampler [29], trawler [25] and Weeder [12]. The input data is divided into a prediction set (20% of the original input set selected randomly) and a validation set. Additionally, two background data sets are generated to calculate motif statistical significance. One background data set is randomly generated from the input data while maintaining the dinucleotide frequency, and the second background data set is selected from the genome studied. The statistical significance of the non-redundant motifs is then calculated using the validation set and the two 27 background sets. The following statistical scores are calculated: absolute enrichment, hypergeometric p-value, ROC-AUC graph, the Mean Normalized Conditional Probability (MNCP). Finally, all the candidate motifs are clustered using a Weighted Information Content (WIC) similarity score and the non-redundant motifs are reported. In MotifVoter [39], 10 motif discovery methods are used: MITRA [40], Weeder [12], SPACE [41], AlignACE [23], ANN-Spec [42], BioProspector [28], Improbizer [36], MDScan [24], MEME [15] and MotifSampler [29]. MotifVoter includes two stages: motif filtering and sites extraction. In the motif filtering step similar motifs are clustered to remove redundant motifs. The clustering (filtering) step attempts to group the motifs based on two criteria: discriminative where a set of motifs in the same group should share many binding sites, and consensus where a high number of the motif discovery tools should predict the binding sites in the group. In the site extraction step the goal is to identify the binding sites with the highest confidence based on how many motif methods report the site, where a binding site should be shared by at least two motif discovery methods. The final high confidence selected binding sites are aligned using MUSCLE [43] and a PWM is generated. 28 2 Identification of Gene Regulatory Elements using

Coverage-based Heuristics

Data mining algorithms and sequencing methods (such as RNA-seq and ChIP-seq) are being combined to discover genomic regulatory motifs that relate to a variety of phenotypes. However, motif discovery algorithms often produce very long lists of putative transcription factor binding sites, hindering the discovery of phenotype-related regulatory elements by making it difficult to select a manageable set of candidate motifs for experimental validation. To address this issue, the motif selection problem is introduced and coverage-based search heuristics are provided for its solution. Analysis of 203 ChIP-seq experiments from the ENCyclopedia of DNA Elements project shows that our algorithms produce motifs that have high sensitivity and specificity and reveals new insights about the regulatory code of the . The greedy algorithm performs the best, selecting a median of 2 motifs per ChIP-seq transcription factor group while achieving a median sensitivity of 77%.

2.1 Introduction

Human disease association studies are often gene-centric and focus on identifying variants in genes. However, numerous diseases are caused by alterations in the non-coding, regulatory regions of the genome. Thus, discovery of genomic regulatory elements is not only important for understanding the biology of genomes, it is also critical for understanding the biology of disease [4]. RNA-seq, microarray and ChIP-seq experiments are used to discover disease-associated changes in gene expression and in transcription factor binding. Such experiments identify genomic areas (e.g., gene promoters and transcription factor binding regions) wherein disease-associated regulatory elements may be found. 29

Motif discovery, the de novo computational method for finding putative regulatory element binding sites, has several shortcomings. The specificity problem occurs when motif discovery methods produce too many motifs, causing a high false positive rate. The coverage problem occurs when motif discovery methods fail to find a single motif (or a small set of motifs) that covers all of the genomic sequences of interest (e.g., the binding regions from a ChIP-seq experiment). These issues can be addressed by solving the motif selection problem, i.e., picking a small set of significant motifs from a large collection of discovered motifs. Since one might view each subset of the discovered motifs as a hypothesis concerning transcription factor binding or gene co-expression, by the principle of Occam’s Razor, the simplest such hypothesis is preferred. Hence, the output of our method is viewed as a likely genomic mechanism to explain the common regulatory (or binding) properties of a sequence set. While motif discovery ensembles (reviewed in the introduction chapter) have been developed to address the problems identified by Tompa et al. [20], they tend to exacerbate the specificity problem and they do not consider the coverage problem. These challenges are addressed by explicitly defining and solving the motif selection problem, which incorporates both objectives. In this chapter, the motif selection problem is formally defined, novel methods for solving the problem are presented, and the effectiveness of the methods are demonstrated by analyzing ChIP-seq data from the ENCyclopedia of DNA Elements (ENCODE) project [44]. In section 2.2, the motif selection problem is formally defined and algorithms for the motif selection problem are presented. The effectiveness of the algorithms is demonstrated in Section 2.3 by providing analysis results for the ENCODE data. 30

2.2 Methods

Whether using a single motif discovery method or an ensemble of motif discovery methods, one faces the challenge of selecting a biologically important subset of motifs from a large set of candidate motifs. This section provides a formal description of the motif selection problem and presents algorithms that solve the problem. The source code of the algorithms is available at https://github.com/RamiOran/SeqCov.git.

2.2.1 Formal problem definition

Given a set of motifs M = {m1, m2,..., mk} and a set of sequences

S = {S 1, S 2,..., S n}, the motif selection problem can be defined as follows: Xk Minimize x j (2.1) j=1 Subject to: Xk ai j x j ≥ 1, i = 1, ..., n (2.2) j=1

Where xi is defined as:    1 if mi is part of the solution x =  (2.3) i   0 otherwise Where A is an n x k matrix representing the coverage of sequence set S by motif set M:   ∈  1 if S i S m j a =  (2.4) i j   0 otherwise

Where S mi ⊆ S is the set of sequences covered by motif mi. Note that equation 2.2 guarantees that each sequence in S is covered by at least one

motif mi. Note also that the motif selection problem can be modeled by the Set Covering Problem (SCP), and that the SCP decision problem is NP-complete and the SCP optimization problem is NP-hard [45]. 31

A general solution procedure for the motif selection problem finds Fmin, a minimally sized set of motifs that covers all sequences in S . A feature set F ⊆ M is generated by incrementally adding features (motifs) to the set, based on the heuristic rule of the algorithm used such that if a new motif mi is added to F, then the corresponding set of sequences covered by mi are added to S F (the set of sequences covered by feature set F). This implies that it is never beneficial to add any motif that does not increase the size of

S F. The procedure terminates when S F = S .

2.2.2 Relaxed Integer Linear Programming (RILP) approximation algorithm

The Set Cover Problem as well as the motif selection problem can be cast as a 0-1 integer linear program, wherein the goal is to find a 0-1 vector ~x of length m satisfying the constraints A~x ≥ ~b such that Z = ~c · ~x is minimized, where

1. ~c is a vector of all 1’s of length m

2. ~b is a vector of all 1’s of length n

It is relatively straightforward to prove that the minimum set cover has K sets if and only if Z = K. To see this, notice that ~x[i] = 1 corresponds to the motif mi being part of the set cover, and ~x[i] = 0 corresponds to the set mi being left out of the set cover. Hence, A~x ≥ ~b if and only if, for each S i ∈ S , at least one of the sets mi, where ~x[i] = 1 contains S i. Hence, if ~x satisfies the constraint that A~x ≥ ~b, then ~x corresponds to a valid set cover. The additional constraint that ~c · ~x is minimized means that the solution to the integer linear program provides the optimal set cover. Hence, solvers for integer linear programs, such as those found in the GNU Linear Programming Kit (GLPK)[46], can be used to find the optimal solution to the set cover problem. However, these solvers may take a long time to find the optimal solution. Now, while 0-1 integer linear programming is also NP-complete [45], it is possible to relax the constraint that xi ∈ {0, 1} and allow xi ∈ [0, 1]. This relaxation converts the 32 integer linear program into a linear program. Since linear programming can be solved in polynomial-time in the worst-case (e.g., the Ellipsoid Method), GLPK can be used [46] to solve the relaxed version. Furthermore, the relaxed version can be used to provide an approximate solution via randomized rounding [47]. The standard randomized rounding approach proceeds as follows: (i) construct the optimal solution ~x to the relaxed version of the ILP problem, (ii) select set S i be part of the cover C with probability x[i], (iii) repeat step (ii) until C is a set cover. As shown in [47], this algorithm produces, with high probability, a set cover that is within O(logn) times the size of optimal solution. This is the approach used here to build good set covers.

2.2.3 Bounded exact search algorithm

The GLPK toolkit [46] also provides a branch-and-cut algorithm which attempts to find exact solutions to certain stated integer linear programming problems by utilizing accepted trial solution methods (such as the simplex or primal-dual interior-point methods) and then successively computing cutting planes to reduce the size of the search space. This technique is applicable to our problem of interest since the set coverage problem can be described using only linear constraints, and the search space is convex. Branch and cut behaves heuristically in terms of the search space reduction, but provides exact answers to the linear programming problem. The principle drawback to branch and cut is that it demonstrates exponential runtime in the worst case, and therefore may not return any answer to specific problem instances within a reasonable time frame. The ILP characterization is provided in the previous section.

2.2.4 Greedy algorithm

Greedy algorithms try to generate good solutions by employing simple rules. The strategy chosen here is to employ a “maximum uncovered-first” rule. According to this rule, a feature set F is constructed by incrementally adding motifs such that, at every 33 iteration, the motif mi that covers the largest number of uncovered sequences in S is added to F. Two filtering steps are applied during the greedy motif selection process. The first filtering step is used to avoid selection of redundant features. During the iteration process, if feature mi is similar to a previously selected feature mk then feature mi is discarded and the search continues for the next feature. The similarity of two motifs, mi and mk, is calculated using Tomtom [48], which assigns an E-value that characterizes the significance

of the similarity. The significance of similarity between two motifs mi and mk is defined as

ε(mi, mk). If ε(mi, mk) < 0.05, then the two motifs mi and mk are considered similar. The second filtering step avoids selecting features which provide small incremental

benefit, choosing features which add a minimum number of uncovered sequences. Let S u

be the set of uncovered sequences, let |S mi ∩ S u|/|S | be the percentage of sequences

covered by feature mi, and let ∆ be the minimum percentage of new sequences that must

be added to the set cover. If |S mi ∩ S u|/|S | < ∆, the greedy algorithm terminates (because no further improvement greater than ∆ is possible by selecting any of the remaining motifs). This is beneficial since some features only add a small percentage of uncovered sequences. Although the filtering steps might result in partial coverage instead of full coverage, they produce a feature set which includes non-redundant features and avoids selection of features that add a small number of uncovered sequences. Algorithm 1 shows the pseudocode for the greedy algorithm. The time complexity of the greedy algorithm is O(|S | |M|) where S is the set of sequences and M is the set of motifs.

2.3 Results and discussion

This section presents the results of applying our motif selection methods to the ENCODE ChIP-seq data [44]. Our results are compared to those described in [1] and [30]. 34

Algorithm 1 Motif selection using the greedy algorithm. 1: procedure Greedy algorithm(S, M, ∆)

2: S u = S 3: F = ∅

4: Ms = M 5: j = 0

6: while S u , ∅ and j < |M| do

7: Select an mi ∈ Ms s.t. |S mi ∩ S u| is maximized

8: Ms = Ms − mi 9: j = j + 1

10: if there exists mk ∈ F such that ε(mk, mi) < 0.05 then 11: continue

12: if |S mi ∩ S u|/|S | < ∆ then 13: break 14: else

15: S u = S u − S mi

16: F = F ∪ mi 17: Return F

In [1], the authors grouped 427 ChIP-seq experiments from the ENCODE project into 84 factor groups (based on homology and the presence of known motifs). For each ChIP-seq experiment in each factor group, the ChIP-seq peaks were divided into two parts, one for motif discovery by an ensemble of motif discovery tools and one for enrichment score calculation. The top 10 enriched motifs were selected for each factor group. Of the 84 factor groups, 56 groups have known TFBSs. Our methods typically selected fewer motifs per factor group than reported in [1], and the TFBSs selected by our methods often cover higher percentages of the ENCODE ChIP-seq binding regions than did the motifs reported in [1]. Our methods are validated by their ability to rediscover known binding motifs for 38 factor groups. Interestingly, our methods rediscovered known motifs for one factor group (TCF12) for which the method in [1] failed to find known motifs. 35

The remainder of this section summarizes our key results. First, our evaluation pipeline is described and our evaluation metrics are defined. Focusing on the aforementioned 38 factor groups, the effectiveness of our motif selection methods is compared to the effectiveness of the method of Kheradpour and Kellis [1], as well as to the motif selection method used in a motif discovery ensemble (W-ChIPMotifs [30]). Finally, we present new putative functional genomic elements discovered by our methods.

2.3.1 Evaluation methodology

Figure 2.1 shows the pipeline used for evaluating the motif selection methods. For each ChIP-seq experiment in each factor group, 1000 randomly selected peaks were used as training data and 1000 randomly selected peaks were used as testing data. The sets of all discovered motifs for each TF group were obtained from [1], and were provided as input to our motif selection algorithms and to the motif selection algorithm of

W-ChIPMotifs. Filtering thresholds of ∆ >= 5% and ε(mi, mk) < 0.05 were used for the greedy algorithm. The motif selection methods were evaluated in terms of the following metrics:

1. Number of features selected (N): This measure indicates the number of motifs chosen by a motif selection method.

2. Sequence sensitivity (sSn): This measure indicates the percentage of input sequences (ChIP-seq peaks) that were identified by the selected set of features reported by a method. Sequence sensitivity, sSn, is defined as sSn = TPs/(TPs + FNs), where the number of true positives, TPs, is the number of sequences containing at least one selected motif (determined by using FIMO [2]) and the number of false negatives, FNs, is the number of sequences with no occurrence of any selected motif. 36

3. Sensitivity in recovering known motifs (mSn): For each of the 56 TF groups with known motifs (see [1]), the known motifs were compared to the selected motifs. Motif sensitivity, mSn, is defined as mSn = TPm/(TPm + FNm), where TPm is the number of TF groups with known motifs that are covered by the selected motifs, and FNm is the number of known motifs not matched by the predicted motifs (determined using Tomtom [48], with E-value threshold = 0.05).

ChIP-seq peaks for a TF factor group

Motif discovery [1]

Testing data Discovered motifs [1] Training data

Enrichment Coverage-based W-ChIPMotifs [30] analysis [1] heuristic

Motifs selected Motifs selected Motifs selected by by enrichment by Coverage- W-ChIPMotifs [30] [1] based heuristic

Find motif occurrences of selected motifs using FIMO [2]

Report sequence sensitivity

Figure 2.1: Evaluation pipeline applied to the ENCODE ChIP-seq data. 1000 random peaks were selected per experiment per factor group for the training and testing data. Peaks selected for training data were not included in the testing data. The discovered motifs were reported in [1] using an ensemble of motif discovery tools. FIMO [2] was used for motif scanning. 37

2.3.2 Evaluation Results

Table 2.1 provides a summary comparison between the methods, in terms of the number of features selected. The greedy algorithm produces the smallest average number of features, followed by the enrichment method. The other three algorithms produce much larger feature sets. Figure 2.2 (a) provides a comparison between the greedy, enrichment, and W-ChIPMotifs methods, where it is clear the greedy has the lowest median number of features. The small number of features selected by the greedy algorithm and the concurrent high sSn indicate the strong specificity of the selected motifs. The sSn measure is used to find the percentage of ChIP-seq peaks covered by the selected motifs. Table 2.2 shows a high-level comparison of all methods, and figure 2.2 (b) shows a comparison between the greedy, enrichment, and W-ChIPMotifs methods. In terms of sSn, the RILP and bounded exact search algorithms performed the best. However, the higher sensitivity was obtained at the expense of selecting a very large number of motifs. The greedy algorithm and the enrichment method have much better motif-to-sensitivity ratios, and the greedy algorithm has the most favorable ratio overall. Table 2.3 shows the number of features and sSn values across the 38 TF groups. The enrichment method reported known motifs for 37 TF groups (mSn = 66.1%). The RILP and bounded exact search algorithms reported known motifs for 38 TF groups (mSn = 67.9%). The W-ChIPMotifs method reported known motifs for 35 TF groups (mSn = 62.5%). The greedy algorithm achieved 64.3% mSn (reporting known motifs for 36 TF groups). All methods performed similarly with respect to the gold standard. It is important to note that the greedy algorithm achieved this performance with fewer motifs, on average, than the other algorithms. In terms of running time, the greedy algorithm was the fastest with average run time of 103 seconds. The median was 57 seconds and the standard deviation was 123. The RILP and the bounded exact search algorithms had an average run time of 141 and 129 seconds, respectively. The median was 38

90 seconds and 85 seconds and the standard deviation was 153 and 136 seconds, respectively.

Table 2.1: Number of features used (Mean, Median, SD) across 38 TF groups

Method Mean Median SD

Greedy 2.5 2 0.86

Enrichment 4 3 2.2

W-ChIPMotifs 8.2 6 6.5

RILP 30.7 30 15.2

Bounded 30.7 30 15.2

(a) Number of features comparison (b) Sequence sensitivity comparison

● 100 30

25 80

20 ● ●● ● 60 ● 15 40 10 ● Number of features 5 20 Sequence sensitivity (%) ● ●

● 0 Greedy Greedy Enrichment Enrichment W−ChIPMotifs W−ChIPMotifs Figure 2.2: Number of features and sequence sensitivity comparison. 39 Table 2.2: Sequence sensitivity (Mean, Median, SD) across 38 TF groups

Method Mean Median SD

RILP 93.1 94.7 4.8

Bounded 93.1 94.7 4.8

Greedy 77.0 76.7 8.5

W-ChipMotifs 74.0 75.3 15.8

Enrichment 69.4 70.9 15.9 40 Table 2.3: Number of features and sSn for five motif selection methods across the 38 TF groups. P is the total number of peaks selected per TF group and P(%) is the percentage of peaks with motif occurrences. N is the number of features selected by each method.

Greedy RILP and Bounded W-ChIPMotifs Enrichment

TF Group(P)(P(%)) N sSn(%) N sSn(%) N sSn(%) N sSn(%)

EGR1(2600)(97.0) 1 82.8 29 96.4 9 91.3 7 86.8

NRF1(4200)(98.9) 1 91.5 22 98.6 16 97.3 3 95.6

ATF3(2400)(89.5) 2 74.3 33 89.2 7 70.6 4 70.5

BHLHE40(1000)(81.2) 2 63.0 11 79.0 2 53.3 2 58.8

CEBPB(4000)(90.0) 2 64.7 32 88.5 4 60.5 2 30.6

E2F(8000)(98.8) 2 86.9 51 98.8 11 93.8 8 91.5

ELF1(3000)(93.3) 2 78.8 23 91.9 6 78.7 3 76.3

ETS(8200)(98.2) 2 87.5 55 97.8 18 89.8 9 90.1

FOXA(5000)(92.0) 2 63.0 35 92.3 6 63.0 5 58.4

HNF4(3000)(96.5) 2 76.1 26 96.1 4 72.5 5 77.1

MAF(4000)(97.2) 2 77.7 28 97.2 8 82.7 2 59.5

NFE2(1200)(96.1) 2 93.3 11 96.1 4 88.5 4 87.3

NFKB(10200)(97.3) 2 73.3 54 96.6 20 85.8 4 68.9

NFY(2000)(96.8) 2 93.2 13 97.3 4 92.0 1 83.2

POU2F2(4000)(85.5) 2 59.3 35 85.2 5 60.7 2 55.2

POU5F1(1000)(88.4) 2 76.4 7 86.0 3 68.1 2 74.6

PRDM1(1000)(91.1) 2 79.7 6 91.8 2 28.6 2 73.6

REST(10000)(97.4) 2 81.1 56 96.8 31 94.2 10 92.1

SPI1(3000)(98.7) 2 87.7 20 98.9 7 94.1 3 84.1

SRF(5000)(91.1) 2 70.3 40 90.7 11 81.3 2 57.5

TFAP2(2000)(96.8) 2 85.8 17 97.0 3 88.0 2 83.8

YY1(9200)(95.4) 2 79.1 49 95.2 18 88.0 5 83.1

ZEB1(1000)(85.9) 2 72.8 13 88.8 2 51.5 1 40.6

EBF1(2000)(87.0) 3 75.6 17 87.9 4 68.0 2 59.5

MEF2(2000)(86.0) 3 64.0 18 85.2 4 57.9 3 50.5

MXI1(2000)(88.2) 3 75.4 20 89.3 4 47.1 2 39.1

NR2C2(1600)(93.6) 3 86.8 20 92.5 5 56.6 3 67.9

PAX5(4000)(96.2) 3 74.7 41 95.7 8 77.0 5 71.4

RFX5(3200)(92.0) 3 76.1 33 92.1 6 67.9 3 54.1

SP1(4000)(88.5) 3 73.6 34 88.0 10 75.9 3 68.3

TCF12(2200)(96.3) 3 69.6 29 96.2 4 55.3 6 72.3

ESRRA(4200)(90.4) 4 65.4 44 89.2 9 74.5 4 62.3

GATA(8000)(99.0) 4 77.1 53 98.9 18 88.1 6 66.7

IRF(2650)(97.9) 4 80.5 31 97.7 4 75.5 6 85.3

NR3C1(4250)(95.7) 4 78.7 47 95.4 5 67.9 6 72.6

RXRA(3050)(94.5) 4 71.8 40 94.3 6 65.1 5 61.0

STAT(7200)(99.0) 4 77.4 60 98.7 20 86.6 7 79.6

TCF7L2(2000)(90.1) 4 83.0 12 90.9 4 75.1 2 45.6 41

2.3.3 Putative functional genomic elements discovered by our methods

The greedy method identified a number of putative regulatory elements. First, we present our findings for the TCF12 factor group, for which the previous study (see [1]) failed to rediscover a known motif. This is followed by a presentation of previously unreported motifs for the remaining 37 factor groups for which our methods were validated with respect to the gold standard. The TCF12 TF (other names include HTF4 and HEB) is a member of the basic helix-loop-helix (bHLH) protein family and a member of the E-protein class which binds to the E-box sequence CANNTG [49] [50] [51]. The greedy algorithm reported 3 motifs for TCF12 group, with a sSn of 69.6%. The RILP, bounded exact search algorithms reported 29 motifs with a sSn of 96.2%. The W-ChIPMotifs method reported 4 motifs with a sSn of 55.3%. In [1], the enrichment method reported 6 motifs with a sSn of 72.3%. Figure 2.3 shows the matches found by comparing the predicted motifs by the greedy algorithm for TCF12 against the JASPAR database. In figure 2.3, the second selected motif matches other TFs from the JASPAR database which are likely co-factors of TCF12. 4.7% of the TCF12 factor group peaks had both the first and second motifs co-occurring with a mean distance between them of 59.2 base pairs. To study this factor group further, motif selection was performed on each of the TF ChIP-seq experiments individually, to identify cell line specific TFBSs. The TCF12 factor group consists of three experiments across three cell lines (HepG2, H1-hESC, and GM12878). Figure 2.4 shows the motifs selected by the greedy algorithm for each experiment. The TCF12 binding regions of the two normal cell lines (GM12878 and H1-hESC) contain similar motifs, but the motifs selected for the binding regions of the hepatocellular carcinoma cell line (HepG2) are different. This suggests the presence of genomic regulatory elements that may be linked to hepatocellular carcinoma. 42

We examined the TCF12 for any overlapping Single Nucleotide Polymorphisms (SNPs), using the RegulomeDB data base [52] [4]. The first motif had 63 overlapping SNPs, the second motif had 164 overlapping SNPs, and the third motif had 23 overlapping SNPs. The second motif had one match in the genome-wide association study (GWAS) catalog: rs2293152 [53]. The disease associated with this SNP is multiple sclerosis [53] and the associated gene is STAT3. For the remaining 37 factor groups, Tables 2.4-2.10 show the motifs that were selected by the greedy algorithm but were not selected by the enrichment method. We believe that these previously unreported motifs are important functional elements of the human genome, due to their ability to provide the simplest explanations for the ChIP-seq binding experiments.

Predicted Motif sSn(%) JASPAR Matches Species, class, family

Mus musculus, Other Alpha-Helix , High Mobility MA0522.1 (Tcf3) Group (Box)

MA0521.1 (Tcf12) Mus musculus, Zipper-Type , Helix-Loop-Helix 45.8 MA0500.1 (Myog) Mus musculus, Zipper-Type, Helix-Loop-Helix

MA0499.1 (Myod1) Mus musculus, Zipper-Type, Helix-Loop-Helix

MA0517.1 Homo sapiens , Other , STAT (STAT2::STAT1) MA0537.1 (BLMP- Caenorhabditis elegans , Zinc-coordinating , 1) BetaBetaAlpha-

MA0050.2 (IRF1) Homo sapiens , Winged Helix-Turn-Helix , IRF 11.7 Saccharomyces cerevisiae , Zinc-coordinating , MA0277.1 (AZF1) BetaBetaAlpha-zinc finger MA0508.1 Homo sapiens , Zinc-coordinating, (PRDM1) BetaBetaAlpha-zinc Finger

MA0554.1 (SOC1) Arabidopsis thaliana, Other Alpha-Helix , MADS

Figure 2.3: Motifs selected by the greedy algorithm for factor group TCF12. 43

Cell line: HepG2 Cumulative Motif sSn(%) Coverage (%)

30.5 30.5

30 55

27 64.5

17.5 71

Cell line: GM12878 Cumulative Motif sSn(%) Coverage (%)

50.7 50.7

44.9 69.6

15.7 77.2

Cell line: H1-hESC Cumulative Motif sSn(%) Coverage (%)

53.9 53.9

45.4 70.7

48.9 76.8

Figure 2.4: Cell line specific motifs selected by the greedy algorithm for factor group TCF12.

2.4 Conclusions

This chapter presents heuristics that employ the concept of sequence coverage to solve the motif selection problem, yielding a small, concise set of motifs with high coverage of the input sequences. Three motif selection algorithms were implemented and compared: greedy, relaxed integer linear programming (RILP), and bounded exact search. 44

The proposed algorithms were also compared to two existing motif selection methods. The methods were compared in terms of the number of features (motifs) selected and the sequence sensitivity achieved by the chosen motifs. Even though the RILP and bounded exact search algorithms achieve the highest sequence sensitivity, that is obtained at the expense of a high number of motifs selected. Thus, the greedy algorithm is recommended because it produces a small set of motifs that provides high sequence coverage, enhancing the feasibility of laboratory validation of the reported motifs. Note:The RILP and bounded exact search sections are part of a joint work with members of the bioinformatics lab. Parts of this chapter has been published in ’Discovering Gene Regulatory Elements Using Coverage-Based Heuristics’ by Rami

Al-Ouran, Robert Schmidt, Ashwini Naik, Jeffrey Jones, Frank Drews, David Juedes, Laura Elnitski, and Lonnie Welch, in IEEE/TCBB. 45 Table 2.4: Novel motifs discovered by the greedy algorithm.

TF group Motif sSn(%) Top 3 JASPAR matches

BHLHE40 48.0 MA0376.1 RTG3;MA0281.1 CBF1;MA0004.1 Arnt

CEBPB 55.6 MA0102.3 CEBPA;MA0466.1 CEBPB;MA0043.1 HLF

22.1 MA0162.2 EGR1;MA0516.1 SP2;MA0068.1 Pax4

EBF1 56.2 MA0154.2 EBF1;MA0524.1 TFAP2C

35.5 -

33.5 -

EGR1 82.7 MA0162.2 EGR1;MA0472.1 EGR2;MA0516.1 SP2

ELF1 59.4 MA0473.1 ELF1;MA0076.2 ELK4;MA0062.2 GABPA

57.0 MA0068.1 Pax4

ESRRA 39.4 MA0112.2 ESR1;MA0258.2 ESR2;MA0066.1 PPARG

29.5 - 46 Table 2.5: Novel motifs discovered by the greedy algorithm.

TF group Motif sSn(%) Top 3 JASPAR matches

ETS 73.0 MA0162.2 EGR1

65.2 MA0062.2 GABPA;MA0076.2 ELK4;MA0098.2 Ets1

FOXA 50.7 MA0148.3 FOXA1;MA0047.2 Foxa2;MA0546.1 PHA-4

20.7 MA0068.1 Pax4

GATA 40.3 MA0528.1 ZNF263;MA0068.1 Pax4;MA0516.1 SP2

38.0 MA0482.1 Gata4;MA0035.3 Gata1;MA0036.2 GATA2

39.0 MA0528.1 ZNF263

32.3 -

HNF4 66.9 MA0114.2 HNF4A;MA0484.1 HNF4G;MA0017.1 NR2F1

35.8 -

IRF 49.1 MA0516.1 SP2;MA0079.3 SP1;MA0162.2 EGR1

40.6 MA0528.1 ZNF263

12.4 MA0462.1 BATF::JUN;MA0489.1 JUN (var.2);MA0099.2 JUN::FOS 47 Table 2.6: Novel motifs discovered by the greedy algorithm.

TF group Motif sSn(%) Top 3 JASPAR matches

MAF 70.8 MA0495.1 MAFF;MA0496.1 MAFK;MA0501.1 NFE2::MAF

18.8 MA0528.1 ZNF263

MEF2 39.1 MA0052.2 MEF2A;MA0497.1 MEF2C;MA0558.1 FLC

24.8 MA0068.1 Pax4;MA0528.1 ZNF263

MXI1 55.9 MA0469.1 ;MA0516.1 SP2

23.9 MA0560.1 PIF3;MA0059.1 ::MAX;MA0104.3 Mycn

19.6 MA0509.1 Rfx1;MA0600.1 RFX2;MA0510.1 RFX5

NFE2 67.6 MA0501.1 NFE2::MAF;MA0530.1 CNC::-S;MA0591.1 Bach1::Mafk

29.8 MA0093.2 USF1;MA0409.1 TYE7;MA0526.1 USF2

NFKB 57.3 MA0105.3 NFKB1;MA0107.1 RELA;MA0101.1 REL

40.7 MA0425.1 YGR067C;MA0068.1 Pax4;MA0516.1 SP2 48 Table 2.7: Novel motifs discovered by the greedy algorithm.

TF group Motif sSn(%) Top 3 JASPAR matches

NFY 86.8 MA0315.1 HAP4;MA0316.1 HAP5;MA0060.2 NFYA

40.7 MA0068.1 Pax4;MA0535.1 Mad;MA0146.2 Zfx

NR2C2 59.3 -

47.4 MA0504.1 NR2C2;MA0115.1 NR1H2::RXRA;MA0065.2 PPARG::RXRA

46.5 MA0076.2 ELK4;MA0062.2 GABPA;MA0028.1 ELK1

NR3C1 41.0 MA0113.2 NR3C1;MA0007.2 AR

36.3 MA0516.1 SP2

24.1 MA0476.1 FOS;MA0491.1 JUND;MA0477.1 FOSL1

33.6 MA0469.1 E2F3

NRF1 91.7 MA0506.1 NRF1;MA0565.1 FUS3

PAX5 43.5 -

27.4 MA0050.2 IRF1;MA0080.3 Spi1;MA0508.1 PRDM1 49 Table 2.8: Novel motifs discovered by the greedy algorithm.

TF group Motif sSn(%) Top 3 JASPAR matches

POU2F2 36.7 MA0469.1 E2F3;MA0068.1 Pax4;MA0535.1 Mad

30.6 MA0507.1 POU2F2;MA0142.1 Pou5f1::;MA0453.1 nub

POU5F1 20.6 MA0516.1 SP2;MA0528.1 ZNF263

PRDM1 69.8 MA0508.1 PRDM1;MA0050.2 IRF1

REST 62.8 MA0138.2 REST

RFX5 45.9 MA0516.1 SP2;MA0425.1 YGR067C;MA0469.1 E2F3

38.6 MA0510.1 RFX5;MA0600.1 RFX2;MA0509.1 Rfx1

RXRA 39.0 MA0159.1 RXR::RAR DR5;MA0528.1 ZNF263

27.6 MA0114.2 HNF4A;MA0065.2 PPARG::RXRA;MA0484.1 HNF4G

38.0 MA0528.1 ZNF263 50 Table 2.9: Novel motifs discovered by the greedy algorithm.

TF group Motif sSn(%) Top 3 JASPAR matches

SP1 55.2 MA0079.3 SP1;MA0068.1 Pax4

SPI1 80.5 MA0080.3 Spi1;MA0050.2 IRF1;MA0473.1 ELF1

33.4 MA0528.1 ZNF263;MA0080.3 Spi1;MA0516.1 SP2

SRF 43.2 MA0083.2 SRF;MA0584.1 SEP1;MA0555.1 SVP

33.3 MA0535.1 Mad

STAT 36.8 MA0518.1 Stat4;MA0137.3 STAT1;MA0144.2 STAT3

29.5 -

20.6 MA0517.1 STAT2::STAT1;MA0050.2 IRF1;MA0508.1 PRDM1

TCF12 47.0 MA0522.1 Tcf3;MA0500.1 Myog;MA0521.1 Tcf12

43.2 -

12.3 MA0517.1 STAT2::STAT1;MA0537.1 BLMP-1;MA0050.2 IRF1 51 Table 2.10: Novel motifs discovered by the greedy algorithm.

TF group Motif sSn(%) Top 3 JASPAR matches

TCF7L2 44.9 MA0599.1 KLF5;MA0068.1 Pax4

38.8 MA0523.1 TCF7L2

40.6 MA0469.1 E2F3

TFAP2 58.8 -

YY1 58.8 MA0068.1 Pax4;MA0162.2 EGR1;MA0535.1 Mad

53.7 MA0095.2 YY1;MA0567.1 ERF1

ZEB1 49.1 -

43.3 MA0103.2 ZEB1;MA0583.1 RAV1 (var.2);MA0086.1 sna 52 3 Discriminative Motif Selection using Multi-Objective

Optimization (MOP) Methods

In the previous chapter a number of motif selection algorithms were introduced to select the most significant motifs from a list of candidate motifs which are produced by one or more motif discovery methods. Two problems were addressed by the motif selection methods: the specificity problem, and the coverage problem. Although the proposed methods performed well when applied to ChIP-seq data, one drawback was its inability to handle cases with a background data set such as in discriminative motif discovery. As mentioned in the introduction chapter two types of motif discovery methods exist: generative and discriminative. In discriminative motif discovery, two data sets exist referred to as foreground (positive) and background (negative) data sets, and the objective is identifying motifs enriched in the foreground data set relative to the background data set. Motif selection with the presence of background data is defined as selecting the smallest set of motifs which covers the maximum number of foreground sequences and the minimum number of background sequences. In this chapter, a new motif selection method is introduced which accounts for the existence of background data. The motif selection problem is formulated as a multi-objective optimization problem (MOP) and solved using a multi-objective evolutionary algorithm (MOEA).

3.1 Background

Since the motif selection problem with background data consists of multiple conflicting objectives, multi-objective optimization could be used to solve such problems. In this section we review the basics of multi-objective optimization and methods used to solve such optimization problems. 53

3.1.1 Multi-objective optimization (MOP)

Multi-objective optimization problems (MOPs) represent a class of problems with multiple objectives (often conflicting objectives) and the goal is to optimize all the objectives simultaneously. The objective functions could be of a minimization or maximization type or a combination of both [54] [55] [56]. Since there are multiple objectives for MOPs, the problem of finding a global optimum solution for an MOP is considered an NP-complete problem. Thus, there is no single solution to MOPs but a set of solutions with trade-offs between them. These solutions could be found using Pareto optimality theory [54] [57] as explained in section 3.1.2. A multi-objective optimization problem (MOP) is formally defined as follows (assuming all the objective functions to be minimized) [58] [55]:

T Minimize f(x) = [ f1(x), f2(x), ..., fk(x)] , (3.1)

subject to x ∈ X. where:

n T • x: an n-dimensional decision variable vector. x ∈ R ; x = [x1, x2, ..., xn] . x represents the values selected to solve the optimization problem.

• X: the feasible (parameter or solution) space. X ⊆ Rn. X is determined by a set of equality and inequality constraints.

• fi: an objective function.

• f: vector of objective functions. f : X → Rk

• k: the number of objective functions. k ≥ 2.

• Rn: the decision variable space.

• Rk: the objective function space. 54

3.1.2 Pareto optimal solutions

In MOPs instead of finding a single optimum solution, the goal is finding a set of good solutions with compromises or trade-offs between the different solutions. This is due to the existence of multiple conflicting objectives instead of a single objective function [54]. The set of good solutions found is referred to as a Pareto optimal solution set. There are two main approaches for solving an MOP [55]:

• Solving an MOP as a single-objective optimization problem. This is achieved by combining the objective functions into a single objective function by assigning weights to each objective function. Another method would be to solve for one objective function and consider the other objective functions as constraints. The problem with this approach is determining the weights that should be assigned for each objective function and identifying the thresholds for the constraints.

• Producing a set of good solutions with trade-offs between the objective functions. This solution set is referred to as the Pareto optimal solution set.

To understand the theory of Pareto optimality a number of definitions need to be introduced (assuming all the objective functions to be minimized)[55] [56] [59]:

• Pareto dominance : given two feasible solution vectors a and b, solution a dominates solution b (denoted a ≺ b) if and only if [56]:

∀i ∈ {1, ..., k} : fi(a) ≤ fi(b)

And

∃i ∈ {1, ..., k} : fi(a) < fi(b)

• Pareto optimal solution: a solution that is not dominated by any other solution in the solution space X is referred to as a Pareto optimal solution (also referred to as a non-inferior or an efficient solution). Formally [56]: 55

Solution a is a Pareto optimal solution if there is no other solution b ∈ X such that f(b) ≺ f(a).

• Pareto optimal set (Popt): the set of all feasible non-dominated solutions in X. Formally [56]:

Popt = {a ∈ X|@b ∈ X : f(b) ≺ f(a)} (3.2)

• Pareto front (PFopt): the objective function values in the objective space which correspond to the non-dominated solutions. Formally [56]:

PFopt = {z = ( f1(a), ..., fk(a))|a ∈ Popt} (3.3)

Solutions identified in the Pareto optimal solution set represent solutions to the MOP [55]. As shown in figure 3.1, all solutions on the Pareto front are optimal solutions. Based on the decision maker (analyst) requirements, she/he can select a solution which meets the requirements of problem studied.

푓2(푥)

Feasible solution χ space

The Pareto front

푓1(푥) All points on this Pareto front are called non-inferior or non- dominated points Figure 3.1: The Pareto front. Adapted from [3] 56

3.1.3 Finding Pareto optimal solutions using evolutionary algorithms (EAs)

Evolutionary algorithms (EAs) are used to solve MOPs since they produce at each run of the algorithm a set of solutions to search from instead of a single solution. As a result, multiple members of the Pareto optimal set could be found in a single run of the algorithm [60] [61]. EAs used to solve MOPs are referred to as multi-objective evolutionary algorithms (MOEAs). Evolutionary algorithms follow the evolution process that biological organisms undergo in nature. Individuals in natural populations experience natural selection and the fittest individuals survive. Individuals with good genes have a larger chance of surviving in their environment (more fit), whereas less fit individuals are eliminated from the population. Individuals that remain in the population pass their good genes to their offspring through recombination (mating) which results in new individuals with better characteristics and higher chances of survival. With each new generation, individuals become more adapted to their environment and more fit [62] [63] [64]. An evolutionary algorithm mimics this evolution process of biological organisms. An evolutionary algorithm is broadly defined as any model that is based on a population of individuals (possible solutions) that uses operations of selection and recombination to produce a set of sample points in a search space [63]. An evolutionary algorithm starts with a population of individuals where each individual is represented as a that consists of a set of genes. Each chromosome is a possible solution for the problem to be solved, which is evaluated based on an objective function to determine its fitness. are recombined to produce new offspring. The new offspring are then mutated and evaluated using the fitness function. Chromosomes with high fitness scores remain in the population and the recombination and evaluation process is repeated until a satisfactory solution is found or the maximum number of iterations is reached [62] [63] [64] [65] [55] [59] [60]. Algorithm 2 shows the basic steps of an evolutionary algorithm [62]. 57

One of the widely used evolutionary algorithms to solve multi-objective problems is the nondominated sorting genetic algorithm II (NSGA-II) [66]. NSGA-II is characterized by its fast non-dominated sorting approach, and its ability to ensure diversity and preserve elitism in the solutions produced. NSGA-II starts by creating an initial random population

Q1 of size N. Each individual in the initial population Q1 is assigned a fitness value which is its nondomination level with level 1 being the best which results in a minimization type of problem. Following that, the individuals (chromosomes) in population Q1 are sorted based on nondomination. Binary tournament selection is then used to select two parents based on nondomination ranking followed by recombination and mutation to create children (offspring) for the second generation Q2. To maintain elitism, all individuals in the current population Q2 and previous population Q1 are compared. This ensures all the individuals are compared and the fittest individuals remain in the selection process. Since two individuals with the same nondomination level could be compared, another metric is introduced referred to as the crowding distance. The crowding distance calculates how close an individual is to other neighboring individuals. Individuals in less crowded areas are preferred since that leads to higher diversity and as a result will have higher ranks. The two metrics above (nondomination and crowding distance) are combined into one metric called the crowded-comparison operator (≺n) and is used for comparing individuals and new generations are created until the maximum number of generations requested is reached [66] [67]. The implementation of NSGA-II is included in the MOEA framework [68] which was used as part of the implementation of the motif selection method. The time complexity of the NSGA-II algorithm is O(k N) where k is the number of objectives and N is the population size. 58

Algorithm 2 Evolutionary algorithm basic steps. 1: procedure Evolutionary algorithm(num iterations)

2: Generate an initial population

3: Evaluate the fitness of every individual (chromosome) in the population

4: while a satisfactory result not reached And j < num iterations do

5: Select two individuals (parents) from the population

6: Recombine the parents to produce children

7: Mutate the children

8: Evaluate the fitness of the children

9: Replace some or all of the population by the children

10: j = j + 1

3.1.4 Post Pareto analysis

Although producing multiple solutions is one of the advantages of solving MOPs, the decision maker might still need help in narrowing down the Pareto optimal solutions to a smaller set of solutions to select from. Additionally, the decision maker may prefer studying one solution only. In such cases, post Pareto analysis can help in filtering and narrowing down the solutions to one solution or to a small set of solutions. One method for narrowing down Pareto optimal solutions is the light beam search method introduced in [69]. In the light beam search method, the decision maker provides preference information such that solutions from the Pareto front which are close to the preference values are selected. The decision maker provides an aspiration point (AP) and a reservation point (RP). The aspiration point represents the values for each objective function which are most preferred, while the reservation point represents the values for each objective function which are least preferred [70]. Based on AP and RP the direction for searching Pareto optimal solutions is determined, and all solutions which meet the AP 59 and RP constraints are selected. If only one solution is desired, then the middle point between the AP and RP points is selected [69] [70].

3.2 Methods

In this section we formally define the discriminative motif selection problem and demonstrate its similarity to another combinatorial problem which is the Positive Negative Partial Set Cover (PNPSC) problem. Using properties of the PNPSC problem, the discriminative motif selection problem is formulated as a multi-objective problem and solved using an evolutionary algorithm.

3.2.1 The discriminative motif selection problem formal definition

The motif selection problem with background data can be formally defined as follows. Given a set of foreground (positive) genomic sequences P = {p1, p2, ..., pπ}, a set of background (negative) genomic sequences N = {n1, n2, ..., nν}, a set of discovered

motifs M = {m1, m2, ..., mk}, and let Pmi and Nmi be the set of foreground and background

sequences covered by motif mi where Pmi = {pi1 , pi2 , ..., pih } and Nmi = {ni1 , ni2 , ..., nih }, find

min the smallest subset of features (motifs) F = {mi1 , mi2 , ..., mid } ⊆ M that covers the maximum number of P sequences and the minimum number of N sequences. The motif selection problem has three objectives:

• Maximizing the foreground sequence coverage.

• Minimizing the background sequence coverage.

• Achieving the above two objectives using the smallest number of features.

Since the motif selection problem consists of multiple conflicting objectives, multi-objective optimization methods can be used to solve such problems. To reduce the complexity of the motif selection problem, the first two objectives (foreground, 60 background) sequence coverage are combined into one objective using a cost function used in solving the Positive Negative Partial Set Cover (PNPSC) combinatorial problem [71]. The PNPSC is a variation of the set cover problem and is explained in the next section.

3.2.2 The Positive Negative Partial Set Cover (PNPSC) problem

The PNPSC problem was first introduced in [71]. Given disjoint sets P (the positive

P∪N elements) and N (the negative elements) and a family of subsets S = {S 1, ..., S m} ⊆ 2 , find a subfamily C ⊆ S that minimizes [71] [72]:

[ [ cost(P, N, C) = |P\( C)| + |N ∩ ( C)|. (3.4)

The cost is the sum of the number of uncovered positive elements P, and the number of negative elements covered N. The aim of the PNPSC problem is to cover the maximum number of positive elements P and the minimum number of negative elements N [71] [72]. In the PNPSC problem, the constraint of covering all the positive elements P is relaxed which results in a partial coverage of P. The objective is identifying a subfamily of sets C that achieves the best balance of covering the maximum number of positive elements and the minimum number of negative elements [71]. Based on results obtained in the previous chapter using the greedy algorithm, relaxing the full coverage constraint produces a manageable set of motifs which still produces high coverage of the foreground sequence data set.

3.2.3 Mapping the motif selection problem to the PNPSC problem

The goal of the motif selection problem is to find the smallest feature set Fmin which covers the maximum number of foreground sequences and the minimum number of background sequences. The motif selection problem can be formulated as the PNPSC problem described above as follows: the motif set M corresponds to the family of subsets 61

S in the PNPSC problem, and the set of foreground and background genomic sequences correspond to the positive and negative sequences P and N used in PNPSC. Finally, the feature set F ⊆ M corresponds to the subfamily C ⊆ S . The cost of a feature set F is defined as: |P\(S F)| |N ∩ (S F)| cost(P, N, F) = + (3.5) |P| |N| The cost is the sum of the number of uncovered foreground sequences and the number of covered background sequences. The cost is normalized by the total number of P and N elements since the foreground and background data sets are not necessarily the same size. Feature sets with lower costs are preferred since they result in more foreground sequences covered and less background sequences covered. The cost function in eq. 3.5 can be modified to penalize feature sets which have higher background coverage by adding a penalty value (α) referred to as the weight balance of coverage between the foreground and background data sets [73]. Increasing the α value leads to selecting feature sets with lower background coverage. The modified cost function with α is:

(1 − α)|P\(S F)| α|N ∩ (S F)| cost(P, N, F) = + (3.6) |P| |N|

As shown in eq. 3.6, the foreground and background coverage of a feature set F are combined into one cost function cost(P, N, F), which reflects the foreground and background coverage obtained. Using this cost function along with the size of the feature set F, the motif selection problem can be cast as a multi-objective optimization problem with two objectives:

• The cost of a feature set F as defined in eq. 3.6 (minimization).

• The size of the feature set F (minimization). 62

3.2.4 Solving the discriminative motif selection problem using multi-objective optimization

Since there are two conflicting objectives in the discriminative motif selection problem, there is no single optimum solution but a set of good solutions with trade-offs between the two objectives. The process is to optimize two objectives that characterize the motif selection problem and find a sub collection of features that is small in size and with low cost. The two objectives are the cost of the feature set F (defined in eq 3.6) , and the number of features used in F. Thus, the discriminative motif selection problem with objective functions f1 and f2 is defined as:

(1 − α)|P\(S F)| α|N ∩ (S F)| Minimize f = + (3.7) 1 |P| |N|

Minimize f2 = |F| (3.8) subject to:

∀m j ∈ F : added(m j) > ∆ (3.9)

|Nm | ∀m ∈ F : j < θ (3.10) j |N| Constraint 3.9 was introduced in chapter 2 and is used to avoid selecting features which do not add a minimum percentage of new sequences to the solution. Constraint 3.10 is used to avoid selecting features with high background sequence coverage, which helps in reducing the overall background coverage of the solution produced.

3.2.5 Using an MOEA to solve multi-objective problems

Each solution (individual in the population) to the motif selection problem is represented as a binary vector of length k where k is the total number of motifs discovered. If a motif is included in the solution, then it will be assigned a binary value of 1. The MOEA framework [68] using the NSGA-II as the MOEA algorithm was used to solve the 63 multi-objective problem using the objectives and constriants defined in the previous section.

3.2.5.1 Filtering the features before applying MOAE

To reduce the space of possible solutions, the features could be filtered before calling the MOEA. This is similar to feature selection methods used in machine learning to eliminate noisy features which are not desired in the final solutions [74]. Two pre-filtering values could be used. The first filter Ω f sets the minimum positive sequence coverage desired per feature which helps in eliminating features with very low foreground coverage. This does not violate the coverage requirement of the PNPSC problem since partial coverage of the foreground sequences is acceptable. The second pre-filtering value

Ωb eliminates features which have high background coverage. An advantage of using the mentioned pre-filtering steps is the use of a smaller solution space which speeds up the MOEA running time.

3.3 Results and discussion

In this section the discriminative motif selection method will be evaluated and compared to other motif selection methods. The discriminative motif selection method implemented using the NSGA-II algorithm will be referred to as DSEA (Discriminative motif Selection using Evolutionary Algorithm). Application of DSEA to two case studies will be covered in chapters 5 and 6.

3.3.1 Evaluation using ENCODE data

To evaluate the performance of DSEA, comparisons were made to the (greedy, enrichment, RILP, bounded exact serach, WChIPMotifs) motif selection methods introduced in chapter 2 using the ENCODE ChIP-seq data (explained in chapter 2). The methods were compared in terms of the following metrics: 64

1. Number of features selected (N): This measure indicates the number of motifs chosen by a motif selection method.

2. Sequence sensitivity (sSn): Also referred to as the true positive rate. This measure indicates the percentage of foreground input sequences (ChIP-seq peaks) that were identified by the selected set of features reported by a method. Sequence sensitivity, sSn, is defined as sSn = TPs/(TPs + FNs), where the number of true positives, TPs, is the number of sequences containing at least one selected motif (determined by using FIMO [2]) and the number of false negatives, FNs, is the number of sequences with no occurrence of any selected motif.

3. Sequence specificity (sSp): Also referred to as the true negative rate. This measure indicates how many background sequences were correctly identified as background and did not have feature occurrences. High specificity indicates that the motifs selected have low coverage in the background data set. It is defined as sSp = TNs/(TNs + FPs).

4. Accuracy (Acc):. Accuracy determines the percentage of sequences that were correctly classified as foreground and background across the total number of foreground and background sequences. Accuracy is defined as Acc = (TP + TN) / (TP+TN+FP+FN).

The evaluations were performed using two α values (0.5 and 0.6). Increasing α to 0.6 will penalize solutions with high background coverage. ∆ was set to 5%. θ was set to 100 which means there was no maximum background coverage constraint per single motif. Number of iterations was set to 10,000. No pre-filtering was applied to the motifs predicted to ensure fair comparison to other methods which do not apply pre-filtering. Since DSEA produces multiple solutions compared to only one solution by other methods, it is necessary to select one solution from all the possible solutions. The light 65 beam search method explained in section 3.1.4 was used. The aspiration point AP (desired values) for the two objectives ( f1, f2) was set to zero (the minimum possible value) since the two objectives are of minimization type. The reservation point RP (worst accepted values) for f1 was set to the cost associated with minimum foreground coverage of 60%. This indicates that the minimum positive coverage of a solution accepted by the decision maker is 60%. The RP set for f2 was 10 which indicates the maximum number of motifs accepted by the decision maker is 10 motifs. Table 3.1 shows the average number of features selected by the different methods. DSEA (α = 0.6) selects the smallest average number of features (1.8 features). DSEA (α = 0.5) has the second smallest average number of features (2.4 features). Increasing α penalizes solutions with higher background coverage and as a result the selection of motifs is more strict, and only the set of motifs which lower the background coverage as a whole are selected. The maximum average number of features selected was 24 by the RILP and bounded exact search methods. The number of features are also compared in figure 3.2. Figure 3.3 shows a comparison between the features in terms of sSn, sSp, and accuracy. The DSEA methods have the highest sSp and accuracy values. This indicates the specificity of the motifs selected by DSEA and how they have lower background coverage. In terms of sSn, the DSEA methods are not the highest but they perform comparably and are higher than the enrichment and W-ChIPMotifs methods. The lower sSn of DSEA compared to (greedy, RILP, bounded exact search) is due to DSEA’s strict selection process where motifs with lower background coverage are always preferred. Increasing the DSEA α value penalizes solutions with higher background coverage and as a result DSEA with α = 0.6 has higher sSp. 66 Table 3.1: Number of features used (Mean, Median, SD).

Method Mean Median SD

DSEA (α = 0.6) 1.8 2 0.83

DSEA (α = 0.5) 2.4 3 0.79

Greedy 2.59 2 0.89

Enrichment 3.4 3 1.7

W-ChIPMotifs 5.5 4 3.04

RILP 24 22 11.5

Bounded 24 22 11.5

Number of features comparison

l 15

10

5 Number of features Greedy DSEA_A6 DSEA_A5 Enrichment W−ChIPMotifs Figure 3.2: Number of features comparison. 67

Sequence sensitivity comparison Sequence specificity comparison Accuracy comparison 100 100 l l

90 90 80 80 80

70 60

60 70 Accuracy (%) Accuracy 50 40 Sequence sensitivity (%) Sequence Specificity (%) 60 40 l 20

30 l 50 RILP RILP RILP Greedy Greedy Greedy Bounded Bounded Bounded DSEA_A6 DSEA_A5 DSEA_A6 DSEA_A5 DSEA_A6 DSEA_A5 Enrichment Enrichment Enrichment W−ChIPMotifs W−ChIPMotifs W−ChIPMotifs Figure 3.3: Sequence sensitivity, sequence specificity, and accuracy comparisons.

3.3.2 Application to case studies

The discriminative motif selection method described in this chapter is used to identify motifs in two case studies: the Brugia malayi case study in chapter (5) , and the Hydroxyproline-rich glycoproteins (HRGPs) case study in chapter (6).

3.4 Conclusions

This chapter presented an evolutionary algorithm approach to solve the motif selection problem with background data. The motif selection problem was formulated as a multi-objective optimization problem (MOP) and solved using a multi-objective evolutionary algorithm (MOEA). Solving an MOP produces a set of significant solutions instead of a single solution. As a result, analysts are provided with different solutions to analyze and have the ability to select the solution which meets the problem requirements. 68

The proposed method was evaluated using ENCODE ChIP-seq data, where it was shown that the motifs selected were more accurate and had higher specificity. The proposed method has also been applied to two case studies presented in chapters 5 and 6.

3.5 Variations of the set cover problem

In this section, other variations of the set cover problem are introduced. Table 3.2 shows a summary of these variations. Let:

• U: universe of elements containing n elements. U = {u1, u2, ..., un}.

• S = {S 1, S 2, ..., S k}: collection of sets over U. S = {S 1, S 2, ..., S k} ⊆ U.

• S 0: subfamily of sets with minimum cardinality which covers all elements in U. S 0 ⊆ S .

• P: positive elements (or blue B elements).

• N: negative elements (or red R red elements).

• p: fraction of elements to be covered where 0 < p ≤ 1. 69 Table 3.2: Variations of the set cover problem

Problem Description Motif selection References mapping

Positive-Negative Cover all P elements while Discriminative [75] Set Cover or the covering the minimum num- motif selection red blue set cover ber of N elements. with full fore- (RBSC) problem ground sequence coverage.

Set Multi Cover Find S 0 ⊆ S such that every Depth of cover- [76] [77] [78]

element ui ∈ U is covered at age. least d times.

Multiset- Given a collection of multi- Multiple motif in- [76]

multicover sets where a multiset S i con- stances per se- tains a specified number of quence.

copies of each element ui ∈

0 U, find S such that every ui is covered at least d times.

Set Partial Cover Find smallest set family S 0 Partial sequence [79] which covers q∗|U| elements. coverage.

Positive-Negative Cover the maximum number Discriminative [71] Partial Set Cover of P elements while covering motif selection the minimum number of N with partial fore- elements. The requirement of ground sequence covering all P is relaxed. coverage.

3.5.1 The Weighted Set Cover problem

Given a universe U of n elements and a collection of sets S = {S 1, S 2, ..., S m} where

0 each S i is a subset of U and has a weight (cost) w, find a subset S ⊆ S of minimum cardinality and minimum weight such that S 0 covers all elements in U. 70

3.5.2 The Red Blue Set Cover (RBSC) problem

Given disjoint sets R and B where R is a finite set of red elements and B is a finite set

R∪B of blue elements and a collection S = {S 1, ..., S m} ⊆ 2 , find a sub-collection C ⊆ S that covers all the blue elements B and covers the minimum number of red elements R [75]. The cost of a solution C is defined as [75]:

[ cost(R, C) = |R ∩ ( C)| (3.11)

3.5.3 The Set Multicover Problem

In the set multicover problem, given a universe U of n elements and a collection of

0 sets S = {S 1, S 2, ..., S m} where each S i is a subset of U, find a subset S ⊆ S of minimum cardinality such that each element ui ∈ U is covered di times [76]. The set multicover problem is a variation of the original set cover problem where each element has to be covered a minimum number of times d instead of only once [80]. The set multicover problem has been used to find gene and protein networks as shown in [78].

3.5.4 The Mutliset Multicover (MSMC) problem

In the multiset multicover problem, a collection of multisets instead of a collection of sets is given where a multiset S i contains a specified number of copies of each element

0 ui ∈ U. The objective is finding S such that every ui is covered at least d times

3.5.5 The Partial Set Cover (PSC) problem

Given a universe U of n elements and a collection of sets S = {S 1, S 2, ..., S m} where each S i is a subset of U and a minimum percentage of coverage p where 0 < p ≤ 1, the objective is finding a subset S0 ⊆ S of minimum cardinality such that S 0 covers at least p.|U| elements [79]. The PSC problem has been used to solve a number of bioinformatics 71 problems such as identifying proteins in mass spectrometry data [81] and combinatorial therapy discovery [73]. 72 4 Guidelines for Motif Discovery and Motif Selection

4.1 Introduction

Based on the assessment of motif discovery tools performed by Tompa et al. [20], it was concluded that the sensitivity of the motif discovery tools was very low. Therefore, an ensemble of motif discovery tools is recommended to use instead of a single tool. Since motif discovery produces a long list of possible motifs, motif selection is required as well. In this chapter, guidelines for motif discovery and motif selection are introduced in terms of which tools to use and the tools’ parameter settings. The input to motif discovery consists of a set of DNA sequences in fasta format (referred to as the foreground data set) which share a common biological function and the objective is identifying common patterns (motifs) in these sequences (generative motif discovery). If a background data set is provided, then motif discovery aims to identify patterns which are enriched in the foreground data set relative to the background data set, referred to as discriminative motif discovery.

4.2 Motif discovery

The selection of the motif discovery tools and their parameters depends on the type of discovery performed (generative or discriminative) and the length of motifs expected to appear in the data. Since many tools do not scale well with the increase in data size, the size of the input data plays a role as well. If there exists a foreground data set only, then generative motif discovery tools are used. If there is a foreground data set and a background data set, then generative and discriminative motif discovery tools are used. There are no standards that determine whether the data sets are large or small, but if there are about 500 or more sequences (of length 1000 bases which is the typical length selected for promoters), then it is recommended to run fewer number of tools and search for short motifs. 73

4.2.1 Generative vs discriminative

If only foreground data exists, then GimmeMotifs [34] is recommended for use. GimmeMotifs includes several motif discovery tools and choosing the tools is dependent on the size of data. If foreground and background data sets exist, then GimmeMotifs along with two discriminative motif discovery tools are recommended for use (DME [82] and DECOD [17]).

4.2.2 Large data sets

If the data set size is large then the following tools and parameters are recommended for use:

• GimmeMotifs: the following tools are recommended: MEME, Weeder. The length of motifs searched for is 5-8 (the small option in GimmeMotifs).

• DME: search for motifs of length 8, 10, 12, 14 with 200 motifs predicted per motif length. The number of motifs (200) is the number of motifs recommended by DME for accurate motif significance calculation.

• DECOD: search for motifs of length 8. Report 10 motifs per motif length which is the default value used by DECOD.

Again if only foreground data is available, only GimmeMotifs is used. With the existence of foreground and background data, all three tools (GimmeMotifs, DME, and DECOD) are recommended for use.

4.2.3 Small data sets

If the data set size is small then the following tools and parameters are recommended for use: 74

• GimmeMotifs: the following tools are recommended: MEME, Weeder, MDmodule, Improbizer, and BioProspector. The length of motifs searched for is 5-12 (the medium option in GimmeMotifs).

• DME: search for motifs of length 8, 10, 12, 14 with 200 motifs predicted per motif length. The number of motifs (200) is the number of motifs recommended by DME for accurate motif significance calculation.

• DECOD: search for motifs of length 8 and 10. Report 10 motifs per motif length which is the default value used by the authors.

If only foreground data is available then GimmeMotifs is used, otherwise all three tools (GimmeMotifs, DME, and DECOD) are used.

4.2.4 Motif representation

After the motif discovery process has completed, all the predicted motifs by all the tools are grouped into one file. It is recommended to represent the motifs with a Position Weight Matrix (PWM) using the MEME PWM format. Details about the MEME PWM format can be found at: http://meme-suite.org/doc/meme-format.html. Each motif name should be preceded by the name of the tool it was predicted by.

4.2.5 Motif scanning

All the predicted motifs are scanned across the foreground data set and the background data set (if exists) using FIMO [2] with a p-value of 0.0001. The output of FIMO should be represented as a motif hit (occurrence) file. The motif hit file should have the name of motifs predicted preceded by ’>’ and below each motif the name of sequences it occurs in, one sequence per line. Below is an example of two predicted motifs in a motif occurrence (hit) file: >Motif 1 name 75

Sequence 1 ID Sequence 2 ID >Motif 2 name Sequence 2 ID Sequence 3 ID Two motif hit files should be generated, one for the foreground data set and one for the background data set in the case of discriminative motif discovery.

4.2.6 Reporting the properties of discovered motifs

To understand the properties of motifs discovered and to obtain a high level overview before motif selection, the motif hit file (or two hit files in the case of discriminative motif discovery) produced in section 4.2.5 are analyzed. The hit file(s) are read and a file which shows the motifs’ foreground and background coverage and the number of sequences hits are reported. The file will have four columns: Motif, foreCov (%), foreNumSeqs, backCov (%), backNumSeqs. This file provides an idea of the overall coverage of the predicted motifs and a bar graph showing the foreground and background coverage can be generated. Figure 4.1 shows an example. 76

Seqeunce coveraege for all discovered motifs 120

100

80

60 Sequencecoverage (%) 40

20

0 SGTTTSAT TTCTTGTT SSBYGGYT Weeder_6 TTGATCAA TCTGATAA MTGKKSCT AGARAGTT TAGATGAA TTAGAAAG AGAGAAAT TGAAAAGA GTKTDCCW AAAGGAAA TTTTTTCTTK Homer_12_3 GTCGTGTGCC GTGGCTAAGT ARAAYAGAAA ARRAGWTKTC GCGGGAGGTC MEME_2_w10 MEME_7_w12 MEME_6_w12 AAAATGTAMA AGCGGGAGGT GTMCKWTAAT TGWTGWDCDS ATAGTTCTTGGT CCGTTTAGTAAG GGTTAGAGCACC GTTTACGAAGAG GGAGGTCTTGAG AGGGAGAGAGRA RATTTCTATTCTTG ATTGCKTGATGTTC TACYTTTAGAAAGT TAATTGGCCAATTA TTGATGTTCGCCAA GTGATCATATTGCC DECOD_Motif9_14 GAAGAGATTTCTAT TTTACGAAGAGATT AGCTCAGTTGGTTA GTAGCTCAGTTGGT BioProspector_w8_3 MDmodule_Motif.8.1 BioProspector_w10_4 MDmodule_Motif.12.7 MDmodule_Motif.10.6

Foreground sequence coverage Background sequence coverage

Figure 4.1: Foreground and background coverage of 849 discovered motifs.

4.2.7 Filtering predicted motifs

Using different pre-filtering values (Ωb, Ω f ), the number of predicted motifs could be

filtered to a smaller set. Ωb is the maximum background coverage (%) per motif. Ω f is the minimum foreground coverage (%) per motif. A table can be produced as shown in table 4.1 which provides a high level overview of the number of motifs to supply as input to the motif selection step. 77

Table 4.1: Number of motifs which pass the filtering thresholds Ωb and Ω f . Ωb is the maximum background coverage (%) per motif. Ω f is the minimum foreground coverage (%) per motif.

Ωb(%) Ω f (%) Number of motifs Maximum motif foreground coverage (%) 20 20 317 78.9 10 20 178 78.9 5 10 244 52.6 5 20 85 52.6

4.2.8 Motif selection

Based on the filtering thresholds in the previous section, motif selection can be applied to different sets of predicted motifs to select the smallest subset of motifs, which covers the maximum number of foreground sequences and the minimum number of background sequences. The subset of motifs selected by the motif selection method is referred to as a motif solution. The set of motifs in a motif solution collectively cover a percentage of the foreground and background data sets.

4.2.8.1 Motif selection without background data

If there is no background data, then the greedy algorithm from chapter 2 is recommended. The following parameters are recommended for the greedy algorithm: ∆ = 5% and  = 0.05. If using ∆ = 5% does not produce high cumulative foreground coverage (< 50%), then ∆ = 3% or ∆ = 1% could be used.

4.2.8.2 Motif selection with background data

If there is foreground and background data, then the discriminative motif selection method (DSEA) introduced in chapter 3 is recommended. The following parameters are recommended: ∆ = 5% and number of iterations = 100,000. If the number of input motifs 78 is small then 10,000 iterations is sufficient. If the motifs input to DSEA are pre-filtered using Ωb and Ω f , then θ = 100 is used since there is no need for an internal constraint for DSEA because the motifs are pre-filtered. If no pre-filtering is performed then θ could be set to 10% or 20%. Finally, the α parameter controls the motif solution as a whole in terms of coverage. The recommended α is 0.6. If α = 0.6 does not produce any motif solution with cumulative foreground coverage > 50%, then α can be set to 0.5. Setting α to values < 0.5 will produce solutions with high background coverage, so it is not recommended.

4.2.9 Interpreting the results of motif selection

As mentioned before, motif selection produces motif solutions which consist of a subset of motifs. Thus, interpreting the results will be based on the motif solution and on the individual motifs composing the motif solution.

4.2.9.1 Motif selection without background data

The motif selection method recommended for cases without background data is the greedy algorithm. The greedy algorithm produces one motif solution which covers the maximum number of foreground sequences. The following metrics should be reported:

• The sequence sensitivity (sSn) of the motif solution.

• Number of motifs found by the motif solution.

• Sequence sensitivity (sSn) of each motif (foreground coverage).

• Cumulative foreground coverage of each motif.

When studying the solution reported, a motif solution with high sSn and low number of motifs is desired. If the motif solution sSn is < 60%, then ∆ could be relaxed to 3% or 1%. 79

4.2.9.2 Motif selection with background data

The motif selection method recommended for cases with background data is the DSEA method introduced in chapter 3. Since DSEA is a multi-objective evolutionary algorithm, multiple motif solutions are produced in each run. It is recommended to show the solutions with their objectives as a bar chart and a Pareto front as shown in figure 4.2. The graph should show for each motif solution the number of motifs used and the cumulative foreground and background coverage. One or more solutions could be selected and the individual motifs in that solution can be reported in a table as shown in table 4.2. The same metrics used in section 4.2.9.1 should be reported along with sequence specificity (sSp) and accuracy. When analyzing the solutions provided in figure 4.2, the solutions preferred are the solutions which have high foreground coverage and low background coverage. Generally, solutions with background coverage < 20% are preferred. If such solutions do not exist, then the pre-filtering threshold Ωb and Ω f could be changed. 80

Pareto front 0.45 D 0.4 0.35 0.3 0.25 B

cost 0.2 0.15 C 0.1 A 0.05 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Number of motifs

Ωb = 20% and Ωf = 20% per motif. α = 0.6 and Δ=5% 120 4.5 4 4 100 3.5 3 80 3

2.5 60 2 coverage (%) 2

40 1.5 Numberof motifs

Sequence 1 1 20 0.5

0 0 A B C D

Foreground coverage Background coverage Number of motifs

Figure 4.2: Motif solutions selected by the discriminative motif selection method. Each solution consists of a set of motifs. The sequence coverage is the cumulative coverage of all the motifs in the solution.

Table 4.2: Motif solution A selected from figure 4.2. This solution consists of 4 motifs with a cumulative foreground coverage of 100% and a cumulative background coverage of 4.7%. Fore Cov (%) is the motif foreground coverage, Fore Num is the number of foreground sequences covered by the motif. Back Cov (%) is the motif background coverage, Back Num is the number of background sequences covered by the motif.

Motif Fore Cov (%) Fore Num Back Cov (%) Back Num

52.6 10.0 1.2 1.0

31.6 6.0 2.3 2.0

26.3 5.0 0.0 0.0

21.1 4.0 1.2 1.0 81

4.3 Conclusions

This chapter introduced general guidelines for motif discovery and for motif selection methods introduced in chapters 2 and 3. There are no universal parameters which work for every data set, hence experimenting with different parameters is recommended. Producing an overview of the sequence coverage of motifs, and pre-filtering the motifs before performing motif selection helps in selecting the appropriate parameters and reducing the run time of the motif selection methods. 82 5 Identification of Stage Specific Regulatory Elements in

Brugia malayi using Discriminative Motif Selection

5.1 Introduction

Lymphatic filariasis (LF), commonly known as elephantiasis is a neglected tropical disease which is caused by the nematode Brugia malayi. LF affects over 120 million people worldwide [83] and is characterized by the swelling of the limbs. Although the B. malayi genome has been sequenced [84], its regulatory elements have not been characterized yet. Each stage of the B. malayi life cycle has a unique gene transcriptional signature and the stage specific genes have been identified using RNA-seq experiments in [85]. Putative regulatory element binding sites that are activated during the third stage larvae (L3) of the B. malayi life cycle, which is the infective stage, are identified using motif discovery followed by motif selection. The motif selection problem is formulated as a multi-objective optimization problem (MOP) and solved using a multi-objective evolutionary algorithm (MOEA) as introduced in chapter 3. Identifying the B. malayi stage specific regulatory elements will help in developing intervention strategies for the control of the (LF) disease. Moreover, characterizing the regulatory elements will provide an improved understanding of the development of B. malayi.

5.2 Results and discussion

The promoter regions of genes expressed in stage L3 of B. malayi are referred to as the foreground (positive) data set, and promoter regions of genes expressed in stages L4 and adult female (AF) are referred to as the background (negative) data set. An ensemble of motif discovery tools was used for motif prediction. Motif discovery tools produce a large list of candidate motifs, thus motif selection was applied to identify a subset of significant motifs which have high sequence coverage in the foreground data set (L3) 83 relative to the background data set (L4 and AF). The objective is selecting a set of motifs (referred to as a motif solution) which have high coverage in the foreground data set (L3), and low coverage in the background data set (L4 and AF). The subset of selected motifs should also be small in size to allow cheaper and faster lab validation. Since selecting motifs with such properties involves multiple conflicting objectives, the motif selection problem was formulated as a multi-objective optimization problem (MOP) and solved using a multi-objective evolutionary algorithm (MOEA) as discussed in chapter 3. Solving an MOP produces a set of optimal solutions instead of a single solution, refereed to as Pareto optimal solutions.

5.2.1 The foreground and background data sets

There were 334 genes over expressed in stage L3 and 256 genes over expressed in stages L4 and AF as reported in [85] using RNA-seq experiments. The maximum gene promoter length studied was 500 bases. In stage L3, the minimum promoter length was 14 bases and the average promoter length was 478 bases. In stages L4 and AF, the minimum promoter length was 16 bases and the average promoter length was 484 bases. The GC and AT content were similar in the L3 and the (L4, AF) data sets as shown in figure 5.1.

Figure 5.1: AT and GC content in foreground (L3) and background (L4 and AF) data sets. 84 Table 5.1: Number of motifs discovered with different background coverage thresholds. For each set of motifs, the maximum motif foreground coverage in that set is shown.

Maximum background coverage (%) Number of motifs Maximum motif foreground coverage (%)

0 11 5.1 1 60 5.4 2 122 6.6 5 314 10.2 10 420 18.6 20 597 28.7

5.2.2 Motif prediction

Figure 5.2 shows an overview of the foreground and background sequence coverage for 871 motifs discovered using an ensemble of motif discovery tools (explained under methods). The maximum foreground coverage is 57.2% (191 sequences) and the minimum background coverage is zero. Table 5.1 shows the number of motifs found with different background coverage thresholds. Only 11 motifs (shown in table 5.3) had zero background coverage, where the maximum foreground coverage in those 11 motifs was 5.1% (17 sequences). 597 motifs had background coverage < 20% where the maximum foreground coverage in that set was 28.7%. 274 motifs had background coverage > 20%. To reduce the number of motifs and eliminate noisy motifs, a set of filtering thresholds were used. The first filtering threshold Ωb sets the maximum background coverage per motif. The second filtering threshold Ω f sets the minimum foreground coverage per motif. Table 5.2 shows the number of motifs which passed the different

filtering thresholds Ωb and Ω f . 85

Seqeunce coveraege for all discovered motifs 70

60

50

(%) 40

30 Sequencecoverage

20

10

0 Weeder_7 ATCACTAA TCACAATA TATACAAC AATGCTAA AATCAAAG GAAATTGA AAAAAGGA AAAGAGAA GAAAAAGA Homer_8_3 MRGWATGC AMD_Motif4 Homer_10_4 AAATTTCAAA TAATGAAATT TTAGAAAAAT AATRAATGRA AAATGAAATA MEME_1_w12 MEME_7_w12 ABGATGAAWT CACWAWAATT MEME_10_w12 MEME_10_w10 ACTAGAAKACAY AATAAAATTAAA TAAAAATAATAA SAAAAAKAARAR GAAWGWAWSA TCAYVWKDAKAY AAMAASARRAAS ATWAAATYTGAA ARTKGDTWTGAA DECOD_Motif8_8 ACACWRWAAKTA ATGAAWTWTCAD WAAATMAATGAA TSACAMWYTDWA RATCAAAATKTCTT CATACCAATTATTC DECOD_Motif8_12 DECOD_Motif6_10 CTTAGAAATTGATA CGGATATCATCACA AAGAAAAWYTTCYT AYAWRAAATKAAAA AATKMAAWWTTTRA MDmodule_Motif.14.8 Improbizer_TAATTTCAG MDmodule_Motif.12.11 Improbizer_AAAAATGAA

Foreground sequence coverage Background sequence coverage

Figure 5.2: Foreground and background coverage of all 871 discovered motifs.

Table 5.2: Number of motifs discovered using filtering thresholds Ωb and Ω f . Ωb is the maximum background coverage per motif. Ω f is the minimum foreground coverage per motif.

Ωb(%) Ω f (%) Number of motifs Maximum motif foreground coverage (%) 10 5 177 18.6 10 10 29 18.6 20 5 354 28.7 20 10 192 28.7 86 Table 5.3: Discovered motifs with zero background coverage

Motif Foreground coverage (%) Number of sequences

5.1 17

2.7 9

1.8 6

1.5 5

1.5 5

1.5 5

1.2 4

1.2 4

0.6 2

0.6 2

0.6 2 87

5.2.3 Motif selection

Filtering the discovered motifs as shown in the previous section helped in reducing the number of discovered motifs to be studied, but the number is still high and it is desirable to have a small set of motifs which could be validated in the lab. Thus, discriminative motif selection (introduced in chapter 3) was used to select the smallest set of significant motifs. Motif selection was applied to each filtered list of motifs shown in table 5.2.

5.2.3.1 Motif selection for motifs with Ωb = 10 and Ω f = 5

Using Ωb = 10 and Ω f = 5, there were 177 discovered motifs. Applying motif selection to this set of motifs, 19 motif solutions were found. The 19 motif solutions ranked by their cumulative foreground coverage are shown in figure 5.3. In each motif solution, the cumulative foreground coverage is higher than the cumulative background coverage. Solution A with 19 motifs had the highest cumulative foreground coverage (84.4%) and a cumulative background coverage of (39.5%). Although solution A produces a high foreground coverage, the background coverage is considered high and other solutions are recommended. If the decision maker (analyst) prefers solutions with maximum cumulative background coverage of 20%, then there are three solutions which meet this requirement (Q, R, and S). Although these three solutions have low cumulative background coverage, their cumulative foreground coverage is not considered high (maximum was 36.8%). Note that each motif in the solutions introduced has a maximum background coverage of 10% (Ωb = 10) and minimum foreground coverage of 5%

(Ω f = 5). Solution H with 12 motifs represents a solution with high cumulative foreground coverage (72.8%) and acceptable cumulative background coverage (30.1%).

Again each motif in solution H has a maximum background coverage of 10% (Ωb = 10) 88 and minimum foreground coverage of 5% (Ω f = 5). Motifs of solution H are shown in table 5.4.

Ωb = 10% and Ωf = 5% per motif 90 20 19 18 18 80 17 16 16 70 15 14 14 13 60

12

12 11 50 10 10 9 40 8 8 Numberof motifs

Sequencecoverage (%) 7

30 6 6 5

20 4 4 3 2 10 2 1

0 0 A B C D E F G H I J K L M N O P Q R S Solution

Foreground coverage Background coverage Number of motifs

Figure 5.3: Motif solutions selected by the discriminative motif selection method. Each solution consists of a set of motifs. The sequence coverage is the cumulative coverage of all the motifs in the solution. 89 Table 5.4: Motif solution H selected from figure 5.3. This solution consists of 12 motifs with a cumulative foreground coverage of 72.8% and a cumulative background coverage of 30.1%. Fore Cov (%) is the motif foreground coverage, Fore Num is the number of foreground sequences. Back Cov (%) is the motif background coverage, Back Num is the number of background sequences.

Motif Fore Cov (%) Fore Num Back Cov (%) Back Num

18.6 62.0 9.8 25.0

13.8 46.0 7.8 20.0

10.2 34.0 4.3 11.0

9.0 30.0 3.1 8.0

8.1 27.0 3.9 10.0

6.9 23.0 2.0 5.0

6.6 22.0 1.2 3.0

6.6 22.0 3.1 8.0

5.7 19.0 2.0 5.0

5.4 18.0 1.6 4.0

5.1 17.0 0.0 0.0 90

Ωb = 10% and Ωf = 10% per motif 80 12

70 10 10 9 60 8 8

50 7

6 40 6 5 Number of motifs Number

Sequencecoverage (%) 30 4 4 3 20 2 2 10 1

0 0 A B C D E F G H I J Solution

Foreground coverage Background coverage Number of motifs

Figure 5.4: Motif solutions selected by the discriminative motif selection method. Each solution consists of a set of motifs. The sequence coverage is the cumulative coverage of all the motifs in the solution.

5.2.3.2 Motif selection for motifs with Ωb = 10 and Ω f = 10

Using Ωb = 10 and Ω f = 10, there were 29 discovered motifs. Applying motif selection to this set of motifs, 11 motif solutions were found. The motif solutions are shown in figure 5.4. The maximum cumulative foreground coverage was 75.1% and the minimum cumulative background coverage was 9.8%. Compared to the solutions presented in the previous section with Ωb = 10 and Ω f = 5, two solutions have cumulative background coverage > 40% which is considered high. If the analyst has preference for foreground coverage over background coverage then solution A from figure 5.4 fits such requirements. Solution A (shown in table 5.5) has 10 motifs with a cumulative foreground coverage of 75.4% and a cumulative background coverage of 46.9%. Each motif in solution A has a minimum foreground coverage of 10%. 91 Table 5.5: Motif solution A selected from figure 5.4. This solution consists of 10 motifs with a cumulative foreground coverage of 75.1% and a cumulative background coverage of 46.1%. Each motif has a minimum foreground coverage of 10%. Fore Cov (%) is the motif foreground coverage, Fore Num is the number of foreground sequences. Back Cov (%) is the motif background coverage, Back Num is the number of background sequences.

Motif Fore Cov (%) Fore Num Back Cov (%) Back Num

14.4 48.0 7.4 19.0

12.6 42.0 6.6 17.0

12.6 42.0 7.4 19.0

12.3 41.0 7.0 18.0

12.0 40.0 6.3 16.0

12.0 40.0 7.8 20.0

12.0 40.0 7.4 19.0

12.0 40.0 9.8 25.0

10.8 36.0 7.8 20.0

10.2 34.0 6.6 17.0 92

Ωb = 20% and Ωf = 5% per motif 90 20

18 18 80 17

16 16 70 15

14 14 13 60 12

12 11 50 10 10 9 40 8 Number of motifs Number

Sequencecoverage (%) 8 7

30 6 6 5

20 4 4 3

2 10 2 1

0 0 A B C D E F G H I J K L M N O P Q R

Foreground coverage Background coverage Number of motifs

Figure 5.5: Motif solutions selected by the discriminative motif selection method. Each solution consists of a set of motifs. The sequence coverage is the cumulative coverage of all the motifs in the solution.

5.2.3.3 Motif selection for motifs with Ωb = 20 and Ω f = 5

Using Ωb = 20 and Ω f = 5, the maximum background coverage is relaxed to 20% which results in 345 discovered motifs. Applying motif selection to this set of motifs, 18 motif solutions were found as shown in figure 5.5. The maximum cumulative foreground coverage was 83.2% and the maximum cumulative background coverage was 40.2%.

Relaxing the Ωb = 20 results in more motifs to select from where some of them have high foreground coverage. This helps in increasing the cumulative foreground coverage. Solution I with 10 motifs is shown in figure 5.6. The maximum motif foreground coverage in solution I is 22.5% and the maximum motif background coverage is 14.1%. 93 Table 5.6: Motif solution I selected from figure 5.5. This solution consists of 10 motifs with a cumulative foreground coverage of 70.4% and cumulative background coverage of 31.6%. Fore Cov (%) is the motif foreground coverage, Fore Num is the number of foreground sequences. Back Cov (%) is the motif background coverage, Back Num is the number of background sequences.

Motif Fore Cov (%) Fore Num Back Cov (%) Back Num

22.5 75.0 14.1 36.0

18.6 62.0 9.8 25.0

12.0 40.0 5.5 14.0

9.0 30.0 3.1 8.0

8.1 27.0 3.5 9.0

6.9 23.0 2.0 5.0

6.9 23.0 2.3 6.0

6.3 21.0 3.9 10.0

5.4 18.0 1.6 4.0

5.1 17.0 0.0 0.0

5.2.3.4 Motif selection for motifs with Ωb = 20 and Ω f = 10

Using Ωb = 20 and Ω f = 10, there were 192 discovered motifs. Applying motif selection to this set of motifs, 8 motif solutions were found. The motif solutions are shown in figure 5.6. The solutions are characterized by a smaller number of motifs 94 selected per motif solution where the maximum number of motifs was 8. Solution D with 5 motifs is shown in figure 5.7. This solution has a cumulative foreground coverage of 62.6% and a cumulative background coverage of 35.9%.

Ωb = 20% and Ωf = 10% per motif 90 9

8 80 8

7 70 7

6 60 6

5 50 5

4 40 4 Numberof motifs Sequencecoverage (%) 3 30 3

2 20 2

1 10 1

0 0 A B C D E F G H Solution

Foreground coverage Background coverage Number of motifs

Figure 5.6: Motif solutions selected by the discriminative motif selection method. Each solution consists of a set of motifs. The sequence coverage is the cumulative coverage of all the motifs in the solution. 95 Table 5.7: Motif solution D selected from figure 5.6. This solution consists of 5 motifs with a cumulative foreground coverage of 62.6% and a cumulative background coverage of 35.9%. Fore Cov (%) is the motif foreground coverage, Fore Num is the number of foreground sequences. Back Cov (%) is the motif background coverage, Back Num is the number of background sequences.

Motif Fore Cov (%) Fore Num Back Cov (%) Back Num

21.3 71.0 12.9 33.0

21.3 71.0 11.3 29.0

18.6 62.0 9.8 25.0

12.6 42.0 6.6 17.0

12.0 40.0 5.5 14.0

5.3 Methods

5.3.1 Motif discovery

Motif discovery was performed using an ensemble motif discovery tool (Gimmemotifs [34]) and two discriminative motif discovery tools: DME [82] and DECOD [17]. The motif discovery tools used in Gimmemotifs were: Weeder [12], MEME [15], trawler [25], MDmodule [37], BioProspector [28], Improbizer [36], Homer [86], and AMD [87] using their default parameters and the large analysis option where motifs of length 6 to 15 were searched. DME and DECOD were used to predict motifs of lengths 8, 10, 12, and 14. The maximum number of motifs predicted per motif length for DME was 200. For DECOD 20 motifs were predicted per motif length. The total number of motifs predicted was 932 where Gimmemotifs predicted 222 motifs, DME predicted 96

690 motifs, and DECOD predicted 80 motifs. The number of motifs with occurrences in the L3 promoter regions was 871 motifs found using FIMO [2] with a p-value of 0.0001. All the 871 motifs predicted were used as input to the discriminative motif selection method described in chapter 3 to select the smallest set of significant motifs.

5.3.2 Motif filtering

To reduce the number of motifs input to the motif selection algorithm, the discovered motifs were filtered based on their foreground and background coverage. Table 5.2 shows the number of motifs that passed the different foreground and background thresholds Ωb and Ω f .

5.3.3 Motif selection

Motif selection was performed using the MOEA method introduced in chapter 3. The penalty value α was set to 0.5 and the number of iterations was set to 100000. ∆ was set to zero, since the individual motifs did not have very high foreground coverage. Setting ∆ to 5% resulted in solutions were the maximum cumulative foreground coverage was around 40%. The α and ∆ values were relaxed since the motifs predicted are characterized with low foreground to background coverage ratio as shown in figure 5.2.

5.4 Conclusions

The control of gene transcription initiation and gene regulation are complex biological mechanisms that involve transcription factors (TFs) and transcription factor binding sites (TFBSs). In B. malayi, these binding sites have not been characterized. Motif discovery followed by motif selection was performed across the promoter regions of genes activated during the infective form of the worm (L3) to identify potential regulatory element binding sites. Using motif selection, a set of significant motifs instead of a single motif was identified. Identifying the B. malayi stage specific regulatory elements will help 97 in developing intervention strategies for the control of the (LF) disease. Moreover, characterizing the regulatory elements will provide an improved understanding of the development of B. malayi. 98 6 Identification of Tissue Specific Regulatory Elements in

Hydroxyproline-Rich Glycoprotein (HRGP) Genes in

Arabidopsis thaliana

6.1 Introduction

Hydroxyproline-rich glycoproteins (HRGPs) are plant cell wall proteins with functions in plant growth and development. HRGPs are categorized into three groups based on glycosylation levels: hyperglycosylated arabinogalactan proteins (AGPs), moderately glycosylated extensins (EXTs), and lightly glycosylated proline-rich proteins (PRPs) [88]. 166 HRGP genes were identified and categorized as 85 AGPs, 63 EXTs, and 18 PRPs [88]. Putative regulatory element binding sites in HRGP genes, which are expressed solely in the pollen tissue (pollen specific) in Arabidopsis thaliana, are identified using motif discovery and selection. Identifying unique regulatory elements which are employed in a specific tissue will help in understanding the mechanisms of tissue specific expression. Additionally, identifying the regulatory elements will help in controlling the expression of HRGP genes. Overexpression of HRGP genes has been found related to the defense mechanism used by plants against pathogen attacks [89] [90]. The underexpression of HRGP genes allows easier cellulose extraction from the cell wall of the plants which may have benefits in the biofuel industry [91] [90].

6.2 Results and discussion

To identify unique regulatory element binding sites controlling HRGP genes in the Arabidopsis thaliana pollen tissue, motif discovery and motif selection was performed. The promoter regions of HRGP genes expressed solely in the pollen tissue are referred to as the foreground data set, and the promoter regions of HRGP genes not expressed in pollen are referred to as the background data set. Motif prediction was performed using an 99 ensemble of motif discovery tools. Since motif discovery tools produce a large list of candidate motifs, discriminative motif selection was used to identify the most significant motifs which have high sequence coverage in the foreground data set relative to the background data set. The objective is selecting a set of motifs (referred to as a motif solution) characterized by the following: 1) high sequence coverage in the foreground data set (HRGP pollen specific genes) 2) low sequence coverage in the background data set (HRGP non pollen) 3) achieve the above two objectives using a small number of motifs (small motif solution size). Since the motif selection problem consists of multiple conflicting objectives, the problem was formulated as a multi-objective optimization problem (MOP) and solved using a multi-objective evolutionary algorithm (MOEA) as discussed in chapter 3. Solving MOPs produces a set of optimal solutions instead of a single solution, refereed to as Pareto optimal solutions. In the following sections, a motif solution refers to a subset of motifs selected from a list of predicted motifs.

6.2.1 The foreground and background data sets

There were 19 HRGP genes over expressed in the pollen tissue only in Arabidopsis thaliana and 86 HRGP genes not expressed in pollen. The expression of HRGP genes was determined using the eFP browser [92] [90]. The promoter regions of the genes were downloaded from the TAIR database [93] [90] where the length of the promoter regions was 1000 bases.

6.2.2 Motif prediction

Using an ensemble of motif discovery tools (explained under methods) 849 motifs were predicted. The foreground sequence coverage of the predicted motifs ranges from 100% to 5.3%. The background sequence coverage ranges from 88.4% to 0%. Figure 6.1 shows the foreground and background sequence coverage of all the predicted motifs. The majority of predicted motifs have higher foreground coverage than background coverage. 100

Table 6.1 shows the number of motifs found with different maximum background coverage thresholds. 32 motifs had zero background coverage, where the maximum foreground coverage in those 32 motifs was 26.3% (5 sequences). 5 motifs had 100% foreground coverage (table 6.2) where all the 5 motifs are repeats of the same DNA base. Although the five motifs cover 100% of the foreground, their background coverage is considered high. The objective is identifying motifs which are unique to the pollen tissue relative to the non-pollen tissue. The following sections will illustrate how motifs unique to the pollen tissue can be selected from the large list of candidate motifs using the discriminative motif selection method introduced in chapter 3.

Seqeunce coveraege for all discovered motifs 120

100

80

60 Sequencecoverage (%) 40

20

0 SGTTTSAT TTCTTGTT SSBYGGYT Weeder_6 TTGATCAA TCTGATAA MTGKKSCT AGARAGTT TAGATGAA TTAGAAAG AGAGAAAT TGAAAAGA GTKTDCCW AAAGGAAA TTTTTTCTTK Homer_12_3 GTCGTGTGCC GTGGCTAAGT ARAAYAGAAA ARRAGWTKTC GCGGGAGGTC MEME_2_w10 MEME_7_w12 MEME_6_w12 AAAATGTAMA AGCGGGAGGT GTMCKWTAAT TGWTGWDCDS ATAGTTCTTGGT CCGTTTAGTAAG GGTTAGAGCACC GTTTACGAAGAG GGAGGTCTTGAG AGGGAGAGAGRA RATTTCTATTCTTG ATTGCKTGATGTTC TACYTTTAGAAAGT TAATTGGCCAATTA TTGATGTTCGCCAA GTGATCATATTGCC DECOD_Motif9_14 GAAGAGATTTCTAT TTTACGAAGAGATT AGCTCAGTTGGTTA GTAGCTCAGTTGGT BioProspector_w8_3 MDmodule_Motif.8.1 BioProspector_w10_4 MDmodule_Motif.12.7 MDmodule_Motif.10.6

Foreground sequence coverage Background sequence coverage

Figure 6.1: Foreground and background coverage of all 849 discovered motifs. 101 Table 6.1: Number of motifs discovered with maximum background coverage thresholds. For each set of motifs the maximum motif foreground coverage in that set is shown.

Maximum background coverage (%) Number of motifs Maximum motif foreground coverage (%)

0 32 26.3 2 89 52.6 5 310 52.6 10 484 57.9 20 676 78.9

Table 6.2: Discovered motifs with 100% foreground coverage.

Motif Background coverage (%)

60.5

60.5

66.3

68.6

68.6

6.2.3 Selecting motifs with zero background coverage

There were 32 discovered motifs with zero background coverage. These motifs represent regulatory elements which are unique to the HRGP pollen tissue. The maximum motif foreground coverage in all motifs with zero background coverage was 26.3% and the lowest motif foreground coverage was 5.3%. Motif selection was used to select the smallest set of motifs from the list of 32 motifs which cover the maximum percentage of 102

foreground sequences. Figure 6.2 shows 8 motif solutions (subset of motifs) that were produced. 5 out of the 8 motif solutions have cumulative foreground coverage > 50%. All of these solutions represent putative motifs which are unique to the HRGP pollen tissue since the background coverage is zero. Solution A (motifs shown in table 6.3) consists of 8 motifs which collectively cover 100% of the foreground data. The maximum foreground coverage per motif in solution A was 26.3% (the first and second motifs) and all the motifs are not repeats and their information content is high. The minimum foreground coverage per motif was 5.3%. Having only 8 motifs which cover all the foreground sequences is considered a manageable set of motifs which could be validated in the lab, but the foreground coverage per motif is < 50%. To find individual motifs with higher foreground coverage, the zero background coverage constraint was relaxed and motif selection was re-applied.

Pareto front H 0.8 0.7 0.6 G

0.5 F 0.4

Cost E 0.3 D 0.2 C 0.1 B A 0 0 2 4 6 8 10 Number of motifs

Chart Title 120 9 8 8 100 7 7

6

80 6 5 5 60 4 4 3

40 3 Numberof motifs

Sequencecoverage (%) 2 2 20 1 1

0 0 A B C D E F G H

Foreground coverage Background coverage Number of motifs

Figure 6.2: Motif solutions selected by the discriminative motif selection method for motifs with zero background coverage. The Pareto front shows the possible solutions where the cost value is explained under methods. Each solution consists of a set of motifs. The sequence coverage is the cumulative coverage of all the motifs in the solution. 103 Table 6.3: Motif solution A selected from figure 6.2. This solution consists of 8 motifs with a cumulative foreground coverage of 100% and zero cumulative background coverage. Fore Cov (%) is the motif foreground coverage, Fore Num is the number of foreground sequences covered by the motif. Back Cov (%) is the motif background coverage, Back Num is the number of background sequences covered by the motif.

Motif Fore Cov (%) Fore Num Back Cov (%) Back Num

26.3 5.0 0.0 0.0

26.3 5.0 0.0 0.0

21.1 4.0 0.0 0.0

15.8 3.0 0.0 0.0

15.8 3.0 0.0 0.0

10.5 2.0 0.0 0.0

10.5 2.0 0.0 0.0

5.3 1.0 0.0 0.0

6.2.4 Filtering discovered motifs

Using the filtering values Ωb and Ω f explained in chapter 3, the 849 predicted motifs could be reduced to a smaller set of motifs based on their individual foreground and background sequence coverage. The motifs with either high background coverage or very 104

low foreground coverage are eliminated which results in a smaller number of motifs to select from. Table 6.4 shows the number of motifs found using different filtering

thresholds Ωb and Ω f . The next section will show the results of motif selection applied to the filtered set of motifs.

Table 6.4: Number of motifs which passed the filtering thresholds Ωb and Ω f . Ωb is the maximum background coverage per motif. Ω f is the minimum foreground coverage per motif.

Ωb(%) Ω f (%) Number of motifs Maximum motif foreground coverage (%) 20 20 317 78.9 10 20 178 78.9 5 10 244 52.6 5 20 85 52.6

6.2.5 Motif selection for motifs with Ωb = 20 and Ω f = 20

Using Ωb = 20 and Ω f = 20 there were 317 discovered motifs. Discriminative motif selection was used to select the smallest set of of motifs out of 317 possible motifs which cover the maximum number of foreground sequences and the minimum number of background coverage. Figure 6.3 shows the motif solutions obtained using motif selection. Four solutions were produced where all of the solutions had foreground cumulative coverage > 50%. Solution A achieves 100% cumulative foreground coverage using only 4 motifs, and at the same time the cumulative background coverage is very low (4.7%). Solution A motifs are shown in table 6.5. The first selected motif has a foreground coverage of 52.6% and only 1.2% background coverage. All motifs in solution A have a background coverage < 3% which indicates the uniqueness of the selected motifs to the pollen tissue. Two more solutions (D and C) are shown in tables 6.6 and 6.7. Solution D consists of one motif only with 73.7% foreground coverage and 12.8% background coverage. If the goal is finding a single motif only, then solution D is a viable solution. 105

Solution C consists of 3 motifs where the first two motifs have high foreground coverage. Both of these motifs represent putative motifs that might explain the HRGP gene expression in the pollen tissue.

Pareto front 0.45 D 0.4 0.35 0.3 0.25 B

cost 0.2 0.15 C 0.1 A 0.05 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Number of motifs

Ωb = 20% and Ωf = 20% per motif. α = 0.6 and Δ=5% 120 4.5 4 4 100 3.5 3 80 3

2.5 60 2 coverage (%) 2

40 1.5 Numberof motifs

Sequence 1 1 20 0.5

0 0 A B C D

Foreground coverage Background coverage Number of motifs

Figure 6.3: Motif solutions selected by the discriminative motif selection method applied to 317 motifs which pass the Ωb = 20 and Ωa = 20 filtering thresholds. Each solution consists of a set of motifs. The sequence coverage is the cumulative coverage of all the motifs in the solution. 106 Table 6.5: Motif solution A selected from figure 6.3. This solution consists of 4 motifs with a cumulative foreground coverage of 100% and cumulative background coverage of 4.7%. Fore Cov (%) is the motif foreground coverage, Fore Num is the number of foreground sequences covered by the motif. Back Cov (%) is the motif background coverage, Back Num is the number of background sequences covered by the motif.

Motif Fore Cov (%) Fore Num Back Cov (%) Back Num

52.6 10.0 1.2 1.0

31.6 6.0 2.3 2.0

26.3 5.0 0.0 0.0

21.1 4.0 1.2 1.0

Table 6.6: Motif solution D selected from figure 6.3. This solution consists of 1 motif with a foreground coverage of 73.7% and background coverage of 12.8%. Fore Cov (%) is the motif foreground coverage, Fore Num is the number of foreground sequences covered by the motif. Back Cov (%) is the motif background coverage, Back Num is the number of background sequences covered by the motif.

Motif Fore Cov (%) Fore Num Back Cov (%) Back Num

73.7 14.0 12.8 11.0 107 Table 6.7: Motif solution C selected from figure 6.3. This solution consists of 3 motif with a foreground coverage of 94.7% and background coverage of 4.6%. Fore Cov (%) is the motif foreground coverage, Fore Num is the number of foreground sequences covered by the motif. Back Cov (%) is the motif background coverage, Back Num is the number of background sequences covered by the motif.

Motif Fore Cov (%) Fore Num Back Cov (%) Back Num

52.6 10.0 1.2 1.0

42.1 8.0 2.3 2.0

21.1 4.0 1.2 1.0 108

6.2.6 Recommendations

A number of motif solutions were presented in the previous sections. The solutions had different properties and the question is which of these solutions should be recommended for lab validation. Answering this question depends on the hypothesis or the research questions, and providing multiple solutions is a useful approach to allow the decision maker (the analyst) to study multiple solutions. In this study, the goal is identifying HRGP regulatory elements which are unique to the pollen tissue. If the objective is finding motifs which are exclusive to the pollen tissue, then the zero background coverage motifs are recommended. If the zero background coverage requirement is relaxed, then motif solutions from section 6.2.5 are recommended . These motifs have higher foreground coverage and at the same time low background coverage.

6.3 Methods

6.3.1 HRGP gene identification

Using the BIO OHIO tool [88] 166 HRGP genes were identified: 85 AGPs, 63 EXTs, and 18 PRPs. The expression levels of the HRGP genes was determined using the eFP browser [92] [90]. The genes were labeled as one of the following: pollen specific, pollen plus, pollen none. There were 19 HRGP genes which were pollen specific and were used as the foreground data set. The background data set consisted of 86 HRGP genes which were not expressed in pollen (pollen none).

6.3.2 Motif discovery

Motif discovery was performed using an ensemble motif discovery tool (Gimmemotifs [34]) and two discriminative motif discovery tools: DME [82] and DECOD [17]. The motif discovery tools used in Gimmemotifs were: Weeder [12], MEME [15], trawler [25], MDmodule [37], BioProspector [28], Improbizer [36], Homer 109

[86], and AMD [87] using their default parameters and the large analysis option where motifs of length 6 to 15 were searched. DME and DECOD were used to predict motifs of lengths 8, 10, 12, and 14. The maximum number of motifs predicted per motif length for DME was 200. For DECOD 20 motifs were predicted per motif length. The total number of motifs predicted was 966 where Gimmemotifs predicted 213 motifs, DME predicted 673 motifs, and DECOD predicted 80 motifs. The number of motifs with occurrences in the foreground data set (pollen specific) was 849 motifs found using FIMO [2] with a p-value of 0.0001. The 849 predicted motifs were used as input to the discriminative motif selection method introduced in chapter 3. The goal is identifying motifs which are unique to the HRGP pollen tissue.

6.3.3 Motif filtering

The list of predicted motifs (849) includes many noisy motifs which either cover a very small percentage of the foreground data, or cover a high percentage of the background data. To reduce the list of candidate motifs, the motifs were filtered using two

thresholds Ωb and Ω f . Ωb is the maximum background coverage per motif. Any motif

which has a background coverage > Ωb was removed. Ω f removes any motif with foreground coverage < Ω f . Table 6.4 shows the number of motifs selected using different threshold values.

6.3.4 Motif selection

Motif selection was performed using the discriminative motif selection method introduced in chapter 3. Two data sets were provided (HRGP pollen specific and HRGP pollen none) and the objective is selecting the most discriminative motifs. The penalty value α was set to 0.6, ∆ was set to 5%, and the number of iterations was set to 100,000. The cost function is defined in chapter 3. 110

6.4 Conclusions

Plant HRGP genes are genes involved in plant development and growth. A number of HRGP genes have been identified and these genes have different expression values depending on the tissue they are expressed in. To understand the mechanisms of tissue specific expression of HRGP genes, motif discovery and discriminative motif selection was used to identify regulatory element binding sites unique to the pollen tissue. A set of 32 motifs which existed only in the pollen tissue were reported. These represent unique regulatory elements which might explain the expression of HRGP genes in pollen. Since validating a large number of motifs is expensive and cumbersome, discriminative motif selection was used to select the smallest subset of motifs with high sequence coverage. One subset of discovered motifs consisted of 4 motifs which covered 100% of the foreground sequences and had a low background coverage (4.7%). Producing a small subset of motifs will help in reducing the cost of lab validation and speed the process of biological validation. 111 7 Conclusions and Future Work

Motif discovery is a well studied problem in the field of computational biology. Several motif discovery methods have been implemented but the sensitivity in recovering known motifs remains a challenge. With new technologies such as ChIP-seq experiments, the accuracy of discovering motifs increased but new challenges emerged as well. Based on assessing several motif discovery tools, it is recommended to use an ensemble of motif discovery tools instead of a single tool only. This helps in increasing sensitivity but it affects specificity since too many motifs are produced and there is a need for new methods to select the most significant motifs from the list of candidate motifs. The motif selection problem was introduced to solve this problem. The motif selection problem attempts to select the smallest set of motifs with high sequence coverage which allows easier and cheaper validation in the lab. Two motif selection methods were introduced based on coverage based heuristics. The first motif selection method focuses on selecting motifs in cases where no background data is provided. Three algorithms were introduced (greedy, relaxed integer linear programming (RILP), and bounded exact search). The algorithms were evaluated using ENCODE ChIP-seq data and compared to two existing motif selection methods. Based on evaluation, the greedy method was recommended since it produces the smallest set of motifs with high sequence coverage. To consider the background (negative) data in motif selection, the discriminative motif selection problem was introduced. The discriminative motif selection problem was formulated as a multi-objective optimization problem (MOP) and solved using a multi-objective evolutionary algorithm (MOEA). The discriminative motif selection problem was evaluated using the ENCODE ChIP-seq data and compared to all the previous methods introduced. The discriminative motif selection method produced a small set of motifs with higher specificity. The proposed methods were applied to two case studies:1) the identification of stage specific regulatory elements in Brugia malayi 2) the identification 112

of tissue specific regulatory elements in Hydroxyproline-rich glycoprotein (HRGP) genes in Arabidopsis thaliana. Two motif selection methods were introduced in this dissertation and the work could be extended in several directions:

• The motif selection problem was formulated as the basic set cover problem and the partial set cover problem. Other variations of the set cover problem exist as well. Examples include the set multi-cover problem, where sequences have to be covered a minimum number of times.

• To discover other motifs which act as co-factors, the depth of coverage principle could be employed such that after the first set of significant solutions has been found, the search can be performed again to find other significant motifs.

• The algorithms introduced have a number of parameters which have to be set to obtain the best results. Automating the finding of these parameters will help the analysts and reduce the time of running multiple jobs to find the right parameters.

• The motifs input to the motif selection methods could be clustered before applying motif selection. This will increase the quality of motifs and increase the coverage of single motifs.

• In the discriminative motif selection method, only one evolutionary algorithm was used. Other evolutionary algorithms could be explored and added to the method which will increase the options available to the users. 113 References

[1] P. Kheradpour and M. Kellis, “Systematic discovery and characterization of regulatory motifs in ENCODE TF binding experiments.” Nucleic acids research, pp. 1–12, Dec. 2013.

[2] C. E. Grant, T. L. Bailey, and W. S. Noble, “FIMO: scanning for occurrences of a given motif.” Bioinformatics, vol. 27, no. 7, pp. 1017–8, Apr. 2011.

[3] M. Caramia and P. Dell’Olmo, Multi-objective management in freight logistics: Increasing capacity, service level and safety with optimization algorithms. Springer Science & Business Media, 2008.

[4] M. A. Schaub, A. P. Boyle, A. Kundaje, S. Batzoglou, and M. Snyder, “Linking disease associations with regulatory information in the human genome,” Genome research, vol. 22, no. 9, pp. 1748–1759, 2012.

[5] M. Slattery, T. Zhou, L. Yang, A. C. D. Machado, R. Gordan,ˆ and R. Rohs, “Absence of a simple code: how transcription factors read the genome,” Trends in biochemical sciences, vol. 39, no. 9, pp. 381–399, 2014.

[6] A. Mathelier, W. Shi, and W. W. Wasserman, “Identification of altered cis-regulatory elements in human disease,” Trends in Genetics, vol. 31, no. 2, pp. 67–76, 2015.

[7] T. Siggers and R. Gordan,ˆ “Protein–DNA binding: complexities and multi-protein codes,” Nucleic acids research, p. gkt1112, 2013.

[8] I. V. Kulakovskiy and V. J. Makeev, “Motif discovery and motif finding in ChIP-Seq data,” Genome Analysis: Current Procedures and Applications, p. 83, 2014.

[9] G. Badis, M. F. Berger, A. A. Philippakis, S. Talukder, A. R. Gehrke, S. A. Jaeger, E. T. Chan, G. Metzler, A. Vedenko, X. Chen et al., “Diversity and complexity in DNA recognition by transcription factors,” Science, vol. 324, no. 5935, pp. 1720–1723, 2009.

[10] Z. Dai, D. Guo, X. Dai, and Y. Xiong, “Genome-wide analysis of transcription factor binding sites and their characteristic DNA structures,” BMC genomics, vol. 16, no. Suppl 3, p. S8, 2015.

[11] S. Weingarten-Gabbay and E. Segal, “The grammar of transcriptional regulation,” Human genetics, vol. 133, no. 6, pp. 701–711, 2014.

[12] G. Pavesi, G. Mauri, and G. Pesole, “An algorithm for finding signals of unknown length in DNA sequences,” Bioinformatics, vol. 17, no. suppl 1, pp. S207–S214, 2001. 114

[13] F. Zambelli, G. Pesole, and G. Pavesi, “Motif discovery and transcription factor binding sites before and after the next-generation sequencing era.” Briefings in bioinformatics, vol. 14, no. 2, pp. 225–37, Mar. 2013. [14] H. Hartmann, E. W. Guthohrlein,¨ M. Siebert, S. Luehr, and J. Soding,¨ “P-value-based regulatory motif discovery using positional weight matrices.” Genome research, vol. 23, no. 1, pp. 181–94, Jan. 2013. [15] T. L. Bailey and C. Elkan, “Fitting a mixture model by expectation maximization to discover motifs in biopolymers.” Proceedings of International Conference on Intelligent Systems for Molecular Biology ; ISMB. International Conference on Intelligent Systems for Molecular Biology, vol. 2, pp. 28–36, Jan. 1994. [16] S. Sinha, “Discriminative motifs.” Journal of computational biology, vol. 10, no. 3-4, pp. 599–615, Jan. 2003. [17] P. Huggins, S. Zhong, I. Shiff, R. Beckerman, O. Laptenko, C. Prives, M. H. Schulz, I. Simon, and Z. Bar-Joseph, “DECOD: Fast and Accurate Discriminative DNA Motif Finding.” Bioinformatics, vol. 27, no. 17, pp. 2361–2367, Jul. 2011. [18] E. Redhead and T. L. Bailey, “Discriminative motif discovery in DNA and protein sequences using the DEME algorithm.” BMC bioinformatics, vol. 8, p. 385, Jan. 2007. [19] A. D. Smith, P. Sumazin, and M. Q. Zhang, “Identifying tissue-selective transcription factor binding sites in vertebrate promoters.” Proceedings of the National Academy of Sciences of the United States of America, vol. 102, no. 5, pp. 1560–5, Feb. 2005. [20] M. Tompa, N. Li, T. L. Bailey, G. M. Church, B. De Moor, E. Eskin, A. V. Favorov, M. C. Frith, Y. Fu, W. J. Kent et al., “Assessing computational tools for the discovery of transcription factor binding sites,” Nature biotechnology, vol. 23, no. 1, pp. 137–144, 2005. [21] A. Chakravarty, J. M. Carlson, R. S. Khetani, and R. H. Gross, “A novel ensemble learning method for de novo computational identification of DNA binding sites,” BMC bioinformatics, vol. 8, no. 1, p. 249, 2007. [22] V. Martyanov and R. H. Gross, “Using SCOPE to identify potential regulatory motifs in coregulated genes.” Journal of visualized experiments : JoVE, pp. 1–7, 2011. [23] F. P. Roth, J. D. Hughes, P. W. Estep, and G. M. Church, “Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mrna quantitation,” Nature biotechnology, vol. 16, no. 10, pp. 939–945, 1998. [24] X. S. Liu, D. L. Brutlag, and J. S. Liu, “An algorithm for finding protein–DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments,” Nature biotechnology, vol. 20, no. 8, pp. 835–839, 2002. 115

[25] L. Ettwiller, B. Paten, M. Ramialison, E. Birney, and J. Wittbrodt, “Trawler: de novo regulatory motif discovery pipeline for chromatin immunoprecipitation,” Nature Methods, vol. 4, no. 7, pp. 563–565, 2007. [26] K. Klepper and F. Drabløs, “MotifLab: a tools and data integration workbench for motif discovery and regulatory sequence analysis.” BMC bioinformatics, vol. 14, p. 9, 2013. [27] J. Hu, Y. D. Yang, and D. Kihara, “EMD: an ensemble algorithm for discovering regulatory motifs in dna sequences,” BMC bioinformatics, vol. 7, no. 1, p. 342, 2006. [28] X. Liu, D. L. Brutlag, J. S. Liu et al., “BioProspector: discovering conserved dna motifs in upstream regulatory regions of co-expressed genes.” in Pacific symposium on biocomputing, vol. 6, no. 2001, 2001, pp. 127–138. [29] G. Thijs, K. Marchal, M. Lescot, S. Rombauts, B. De Moor, P. Rouze,´ and Y. Moreau, “A Gibbs sampling method to detect overrepresented motifs in the upstream regions of coexpressed genes,” Journal of Computational Biology, vol. 9, no. 2, pp. 447–464, 2002. [30] V. X. Jin, J. Apostolos, N. S. V. R. Nagisetty, and P. J. Farnham, “W-ChIPMotifs: a web application tool for de novo motif discovery from ChIP-based high-throughput data,” Bioinformatics, vol. 25, no. 23, pp. 3191–3193, 2009. [31] L. S. Hon and A. N. Jain, “A deterministic motif finding algorithm with application to the human genome,” Bioinformatics, vol. 22, no. 9, pp. 1047–1054, 2006. [32] L. Kuttippurathu, M. Hsing, Y. Liu, B. Schmidt, D. L. Maskell, K. Lee, A. He, W. T. Pu, and S. W. Kong, “CompleteMOTIFs: DNA motif discovery platform for transcription factor binding experiments,” Bioinformatics, vol. 27, no. 5, pp. 715–717, 2011. [33] I. V. Kulakovskiy, V. Boeva, A. V. Favorov, and V. Makeev, “Deep and wide digging for binding motifs in ChIP-Seq data,” Bioinformatics, vol. 26, no. 20, pp. 2622–2623, 2010. [34] S. J. van Heeringen and G. J. C. Veenstra, “GimmeMotifs: a de novo motif prediction pipeline for ChIP-sequencing experiments,” Bioinformatics, vol. 27, no. 2, pp. 270–271, 2011. [35] L. Li, “GADEM: a genetic algorithm guided formation of spaced dyads coupled with an EM algorithm for motif discovery,” Journal of Computational Biology, vol. 16, no. 2, pp. 317–329, 2009. [36] W. Ao, J. Gaudet, W. J. Kent, S. Muttumu, and S. E. Mango, “Environmentally induced foregut remodeling by PHA-4/FoxA and DAF-12/NHR,” Science, vol. 305, no. 5691, pp. 1743–1746, 2004. 116

[37] X. S. Liu, D. L. Brutlag, and J. S. Liu, “An algorithm for finding protein–DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments,” Nature biotechnology, vol. 20, no. 8, pp. 835–839, 2002.

[38] E. Valen, A. Sandelin, O. Winther, and A. Krogh, “Discovery of regulatory elements is improved by a discriminatory approach,” PLoS computational biology, vol. 5, no. 11, p. e1000562, 2009.

[39] E. Wijaya, S.-M. Yiu, N. T. Son, R. Kanagasabai, and W.-K. Sung, “MotifVoter: a novel ensemble method for fine-grained integration of generic motif finders,” Bioinformatics, vol. 24, no. 20, pp. 2288–2295, 2008.

[40] E. Eskin and P. A. Pevzner, “Finding composite regulatory patterns in DNA sequences,” Bioinformatics, vol. 18, no. suppl 1, pp. S354–S363, 2002.

[41] E. Wijaya, K. Rajaraman, S.-M. Yiu, and W.-K. Sung, “Detection of generic spaced motifs using submotif pattern mining,” Bioinformatics, vol. 23, no. 12, pp. 1476–1485, 2007.

[42] C. Workman and G. Stormo, “ANN-Spec: a method for discovering transcription factor binding sites with improved specificity,” in Pac Symp Biocomput, vol. 5, 2000, pp. 464–475.

[43] R. C. Edgar, “MUSCLE: multiple sequence alignment with high accuracy and high throughput,” Nucleic acids research, vol. 32, no. 5, pp. 1792–1797, 2004.

[44] The ENCODE Consortium, “An integrated encyclopedia of DNA elements in the human genome.” Nature, vol. 489, no. 7414, pp. 57–74, Sep. 2012.

[45] M. R. Garey and D. S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness, 1st ed. W. H. Freeman, 1979.

[46] GLPK GNU linear programming kit. [Online]. Available: http://www.gnu.org/software/glpk/

[47] V. V. Vazirani, Approximation Algorithms. New York, NY, USA: Springer-Verlag New York, Inc., 2001.

[48] S. Gupta, J. a. Stamatoyannopoulos, T. L. Bailey, and W. S. Noble, “Quantifying similarity between motifs.” Genome biology, vol. 8, no. 2, p. R24, Jan. 2007.

[49] C.-C. Lee, W.-S. Chen, C.-C. Chen, L.-L. Chen, Y.-S. Lin, C.-S. Fan, and T.-S. Huang, “TCF12 protein functions as transcriptional repressor of E-cadherin, and its overexpression is correlated with metastasis of colorectal cancer,” Journal of Biological Chemistry, vol. 287, no. 4, pp. 2798–2809, 2012. 117

[50] J.-S. Hu, E. Olson, and R. Kingston, “HEB, a helix-loop-helix protein related to E2A and ITF2 that can modulate the DNA-binding ability of myogenic regulatory factors.” Molecular and cellular biology, vol. 12, no. 3, pp. 1031–1042, 1992.

[51] Y. Zhang, J. Babin, A. L. Feldhaus, H. Singh, P. A. Sharp, and M. Bina, “HTF4: a new human helix-loop-helix protein.” Nucleic acids research, vol. 19, no. 16, p. 4555, 1991.

[52] A. P. Boyle, E. L. Hong, M. Hariharan, Y. Cheng, M. A. Schaub, M. Kasowski, K. J. Karczewski, J. Park, B. C. Hitz, S. Weng et al., “Annotation of functional variation in personal genomes using RegulomeDB,” Genome research, vol. 22, no. 9, pp. 1790–1797, 2012.

[53] T. Burdett, P. Hall, E. Hasting, L. Hindorff, H. Junkins, A. Klemm, J. MacArthur, T. Manolio, J. Morales, H. Parkinson, and D. Welter, “The NHGRI-EBI Catalog of published genome-wide association studies.” www.ebi.ac.uk/gwas, accessed: 2015-06-25.

[54] C. C. Coello, G. B. Lamont, and D. A. Van Veldhuizen, Evolutionary algorithms for solving multi-objective problems. Springer Science & Business Media, 2007.

[55] A. Konak, D. W. Coit, and A. E. Smith, “Multi-objective optimization using genetic algorithms: A tutorial,” Reliability Engineering & System Safety, vol. 91, no. 9, pp. 992–1007, 2006.

[56] C. A. C. Coello, “Multi-objective evolutionary algorithms in real-world applications: Some recent results and current challenges,” in Advances in Evolutionary and Deterministic Methods for Design, Optimization and Control in Engineering and Sciences. Springer, 2015, pp. 3–18.

[57] T. Back,¨ Evolutionary algorithms in theory and practice. Oxford Univ. Press, 1996.

[58] A. L. Jaimes and C. A. C. Coello, “Many-objective problems: Challenges and methods,” in Springer Handbook of Computational Intelligence. Springer, 2015, pp. 1033–1046.

[59] E. Zitzler, K. Deb, and L. Thiele, “Comparison of multiobjective evolutionary algorithms: Empirical results,” Evolutionary computation, vol. 8, no. 2, pp. 173–195, 2000.

[60] M. Bhuvaneswari and G. Subashini, “Introduction to multi-objective evolutionary algorithms,” in Application of Evolutionary Algorithms for Multi-objective Optimization in VLSI and Embedded Systems. Springer, 2015, pp. 1–20.

[61] L. Jain, R. Goldberg, and A. Abraham, Evolutionary multiobjective optimization: theoretical advances and applications. Springer, 2005. 118

[62] J. E. Beasley and P. C. Chu, “A genetic algorithm for the set covering problem,” European Journal of Operational Research, vol. 94, no. 2, pp. 392–404, 1996.

[63] U. Aickelin, “An indirect genetic algorithm for set covering problems,” Journal of the Operational Research Society, pp. 1118–1126, 2002.

[64] K. Al-Sultan, M. Hussain, and J. Nizami, “A genetic algorithm for the set covering problem,” Journal of the Operational Research Society, pp. 702–709, 1996.

[65] G. R. Zavala, A. J. Nebro, F. Luna, and C. A. C. Coello, “A survey of multi-objective metaheuristics applied to structural optimization,” Structural and Multidisciplinary Optimization, vol. 49, no. 4, pp. 537–558, 2014.

[66] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist multiobjective genetic algorithm: NSGA-II,” Evolutionary Computation, IEEE Transactions on, vol. 6, no. 2, pp. 182–197, 2002.

[67] A. Seshadri, “A fast elitist multiobjective genetic algorithm: Nsga-ii, mathlab central, file exchange, mathworks,” 2007.

[68] D. Hadka and P. Reed, “Diagnostic assessment of search controls and failure modes in many-objective evolutionary optimization,” Evolutionary Computation, vol. 20, no. 3, pp. 423–452, 2012.

[69] K. Deb and A. Kumar, “Light beam search based multi-objective optimization using evolutionary algorithms,” in Evolutionary Computation, 2007. CEC 2007. IEEE Congress on. IEEE, 2007, pp. 2125–2132.

[70] S. Sudeng and N. Wattanapongsakorn, “Post Pareto-optimal pruning algorithm for multiple objective optimization using specific extended angle dominance,” Engineering Applications of Artificial Intelligence, vol. 38, pp. 221–236, 2015.

[71] P. Miettinen, “On the positive–negative partial set cover problem,” Information Processing Letters, vol. 108, no. 4, pp. 219–221, 2008.

[72] ——, “Matrix decomposition methods for data mining: Computational complexity and algorithms,” Ph.D. dissertation, University of Helsinki, 2009.

[73] K. Pang, Y.-W. Wan, W. T. Choi, L. A. Donehower, J. Sun, D. Pant, and Z. Liu, “Combinatorial therapy discovery using mixed integer linear programming,” Bioinformatics, p. btu046, 2014.

[74] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene selection for cancer classification using support vector machines,” Machine learning, vol. 46, no. 1-3, pp. 389–422, 2002. 119

[75] R. D. Carr, S. Doddi, G. Konjevod, and M. V. Marathe, “On the red-blue set cover problem,” in SODA. Citeseer, 2000, pp. 345–353.

[76] Q.-S. Hua, D. Yu, F. C. Lau, and Y. Wang, “Exact algorithms for set multicover and multiset multicover problems,” in Algorithms and Computation. Springer, 2009, pp. 34–44.

[77] N. Bansal and K. Pruhs, “Weighted geometric set multi-cover via quasi-uniform sampling,” in Algorithms–ESA 2012. Springer, 2012, pp. 145–156.

[78] P. Berman, B. DasGupta, and E. Sontag, “Randomized approximation algorithms for set multicover problems with applications to reverse engineering of protein and gene networks,” Discrete Applied Mathematics, vol. 155, no. 6, pp. 733–749, 2007.

[79] T. Elomaa and J. Kujala, “Covering analysis of the greedy algorithm for partial cover,” in Algorithms and Applications. Springer, 2010, pp. 102–113.

[80] T. Fujito and H. Kurahashi, “A better-than-greedy algorithm for k-set multicover,” in Approximation and Online Algorithms. Springer, 2006, pp. 176–189.

[81] Z. He, C. Yang, and W. Yu, “A partial set covering model for protein mixture identification using mass spectrometry data,” IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), vol. 8, no. 2, pp. 368–380, 2011.

[82] A. D. Smith, P. Sumazin, and M. Q. Zhang, “Identifying tissue-selective transcription factor binding sites in vertebrate promoters,” Proceedings of the National Academy of Sciences of the United States of America, vol. 102, no. 5, pp. 1560–1565, 2005.

[83] “Cdc,” http://www.cdc.gov/parasites/lymphaticfilariasis/epi.html, accessed: July-23-2015.

[84] E. Ghedin, S. Wang, D. Spiro, E. Caler, Q. Zhao, J. Crabtree, J. E. Allen, A. L. Delcher, D. B. Guiliano, D. Miranda-Saavedra et al., “Draft genome of the filarial nematode parasite Brugia malayi,” Science, vol. 317, no. 5845, pp. 1756–1760, 2007.

[85] Y.-J. Choi, E. Ghedin, M. Berriman, J. McQuillan, N. Holroyd, G. F. Mayhew, B. M. Christensen, and M. L. Michalski, “A deep sequencing approach to comparatively analyze the transcriptome of lifecycle stages of the filarial worm, Brugia malayi,” PLoS Negl Trop Dis, vol. 5, no. 12, p. e1409, 2011.

[86] S. Heinz, C. Benner, N. Spann, E. Bertolino, Y. C. Lin, P. Laslo, J. X. Cheng, C. Murre, H. Singh, and C. K. Glass, “Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities,” Molecular cell, vol. 38, no. 4, pp. 576–589, 2010. 120

[87] J. Shi, W. Yang, M. Chen, Y. Du, J. Zhang, and K. Wang, “AMD, an automated motif discovery tool using stepwise refinement of gapped consensuses,” PloS one, vol. 6, no. 9, 2011.

[88] A. M. Showalter, B. D. Keppler, J. Lichtenberg, D. Gu, and L. R. Welch, “A bioinformatics approach to the identification, classification, and analysis of hydroxyproline-rich glycoproteins,” Plant Physiology, pp. pp–110, 2010.

[89] S. Deepak, S. Shailasree, R. K. Kini, B. Hause, S. H. Shetty, and A. Mithofer,¨ “Role of hydroxyproline-rich glycoproteins in resistance of pearl millet against downy mildew pathogen Sclerospora graminicola,” Planta, vol. 226, no. 2, pp. 323–333, 2007.

[90] R. A. Wolfe, “In silico discovery of pollen-specific cis-regulatory elements in the arabidopsis hydroxyproline-rich glycoprotein gene family,” Ph.D. dissertation, Ohio University, 2014.

[91] D. Somma, H. Lobkowicz, and J. P. Deason, “Growing americas fuel: an analysis of corn and cellulosic ethanol feasibility in the united states,” Clean Technologies and Environmental Policy, vol. 12, no. 4, pp. 373–380, 2010.

[92] D. Winter, B. Vinegar, H. Nahal, R. Ammar, G. V. Wilson, and N. J. Provart, “An electronic fluorescent pictograph browser for exploring and analyzing large-scale biological data sets,” PloS one, vol. 2, no. 8, pp. e718–e718, 2007. [Online]. Available: http://bar.utoronto.ca/efp/cgi-bin/efpWeb.cgi

[93] The Arabidopsis Information Resource (TAIR). [Online]. Available: http://www.arabidopsis.org ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

Thesis and Dissertation Services ! !