Motif Selection Using Simulated Annealing Algorithm with Application to Identify Regulatory Elements

A presented to the faculty of the Russ College of Engineering and Technology of Ohio University

In partial fulfillment of the requirements for the degree Master of Science

Liang Chen August 2018 © 2018 Liang Chen. All Rights Reserved. 2

This thesis titled Motif Selection Using Simulated Annealing Algorithm with Application to Identify Regulatory Elements

by LIANG CHEN

has been approved for the Department of Electrical Engineering and Computer Science and the Russ College of Engineering and Technology by

Lonnie Welch Professor of Electrical Engineering and Computer Science

Dennis Irwin Dean, Russ College of Engineering and Technology 3 Abstract

CHEN, LIANG, M.S., August 2018, Computer Science Master Program Motif Selection Using Simulated Annealing Algorithm with Application to Identify Regulatory Elements (106 pp.) Director of Thesis: Lonnie Welch Modern research on regulation and disorder-related pathways utilize the tools such as microarray and RNA-Seq to analyze the changes in the expression levels of large sets of . In silico motif discovery was performed based on the profile data, which generated a large set of candidate motifs (usually hundreds or thousands of motifs). How to pick a set of biologically meaningful motifs from the candidate motif set is a challenging biological and computational problem. As a computational problem it can be modeled as motif selection problem (MSP). Building solutions for motif selection problem will give biologists direct help in finding transcription factors (TF) that are strongly related to specific pathways and gaining insights of the relationships between genes. This study implemented an algorithm based on simulated annealing (SA) optimization algorithm for the motif selection problem, and investigated the properties of the implemented algorithm with the real world datasets (ENCODE project data). The results of evaluation based on ENCODE datasets indicate that simulated annealing algorithm is good for solving motif selection problem. The performance of simulated annealing algorithm can be tuned based on some parameters to fit for special requirements. Future improvement may be achieved via extending algorithm model (adaptive simulated annealing) and applying high dimensional cost function. 4 Dedication

To my family, and my parents. 5 Acknowledgments

First I would like to thank my advisor, Dr. Lonnie Welch for his mentoring and support on my daily study and research project. Then I would like to thank my graduate committee members, Dr. Frank Drews, Dr. Razvan Bunescu, for their support, help, comments and suggestions for my research. I also want to thank Dr. Karen Coschigano for serving as college representative for my thesis defense. Special thanks to: graduate student Rami Al-Ouran, Yi-Chao Li, and Yating Liu in Dr. Welch’s lab, graduate student alumni Jens Schmidt, Robert Schmidt, and Krystine Garcia in Dr. Welch’s lab, graduate student Bibo Shi, and Zhe-Wei Wang in Dr. Jundong Liu’s lab. 6 Table of Contents

Page

Abstract ...... 3

Dedication ...... 4

Acknowledgments ...... 5

List of Tables ...... 8

List of Figures ...... 9

List of Acronyms ...... 10

1 Introduction ...... 11 1.1 Background ...... 11 1.2 Biological Motivation ...... 13 1.3 Foundations of Computational Modeling and Optimization Algorithm . . . 20 1.4 Problem Statement ...... 22 1.5 Contributions ...... 23

2 Methods ...... 24 2.1 Motif Selection Problem ...... 24 2.2 Set Cover Problem (SCP) ...... 24 2.3 Mapping Motif Selection Problem to Set Cover Problem ...... 25 2.4 SA Relaxed Version ...... 25 2.5 Simulated Annealing Algorithm ...... 26 2.6 Implementation for Solving MSP ...... 31 2.7 Adjustable Parameters of SA Implementation for MSP ...... 34

3 Evaluation Using ENCODE Data ...... 38 3.1 Overview ...... 38 3.2 Datasets ...... 38 3.3 Parameters ...... 40 3.4 Results ...... 41 3.5 Analysis on Results ...... 42 3.6 Biological Insights of Selected Motifs ...... 46

4 Conclusion and Future Work ...... 50 4.1 Conclusion ...... 50 4.2 Future Work ...... 51 7

References ...... 54

Appendix A: Source Code ...... 67

Appendix B: Supplementary Contents ...... 82

Appendix C: Disclaimer ...... 106 8 List of Tables

Table Page

2.1 Parameter Settings for Simulated Annealing Algorithm ...... 35

3.1 Parameter Settings for ENCODE Datasets ...... 40

B.1 ENCODE TF Group Datasets ...... 82 B.2 Feature Set Size Result ...... 84 B.3 Sequence Sensitivity Result ...... 86 B.4 Motifs selected by SAr85 from BATF group ...... 89 B.5 Examples of TOMTOM reported alignments ...... 89 B.6 Motifs selected by SAr85 from PBX3 group ...... 98 B.7 Examples of TOMTOM reported alignments ...... 98 9 List of Figures

Figure Page

1.1 General Pipeline for Motif Selection ...... 19

2.1 Flowchart for Simulated Annealing ...... 27 2.2 Temperature Curve for Exponential Cooling ...... 30 2.3 Class Relationships ...... 32

3.1 Overview of ENCODE Project ...... 39 3.2 Boxplot for Feature Set Size ...... 41 3.3 Line plot for Feature Set Size ...... 42 3.4 Boxplot for Sequence Sensitivity (sSn) ...... 43 3.5 Line plot for Sequence Sensitivity (sSn) ...... 44 3.6 Comprehensive comparison: SA ...... 46 3.7 Comprehensive comparison: SAr85 ...... 47 3.8 Comprehensive comparison: SAr70 ...... 48 10 List of Acronyms

ChIP Chromatin Immunoprecipitation CPL Common Public License DECOD DECOnvolved Discriminative motif discovery DME Discriminating Matrix Enumerator DNA DeoxyriboNucleic Acid DP Dynamic Programming ENCODE Encyclopedia of DNA Elements FIMO Find Individual Motif Occurrences GNU GNU’s Not Unix GPL General Public License HGP Project ILP Integer Linear Programming LP Linear Programming MEME Multiple Em for Motif Elicitation MSP Motif Selection Problem NCBI National Center for Biotechnology Information NGS Next Generation Sequencing NP Non-deterministic Polynomial PWM Position Weight Matrix RILP Relaxed Integer Linear Programming RNA RiboNucleic Acid SA Simulated Annealing SCP Set Cover Problem TF TFBS Transcription Factor Binding Site TSS Transcription Start Site UTR UnTranslated Region 11 1 Introduction

This research project focuses on the implementation and evaluation of simulated annealing optimization algorithm for motif selection problem with application to ENCODE datasets.

1.1 Background

Biologists have proven that all the species of living beings on the earth have their own genetic codes to store the information about how to construct themselves and control the metabolic processes that are essential to their survival, development, and reproduction [1, 2]. In order to investigate the internal mechanisms of these genetic codes and decode the encrypted information of natural beings, huge work have been done: from the structure and properties of deoxyribonucleic acid (DNA) molecules [3], the amino acid sequences of [4], classical genetics theories [5], to the modern views of genome and genes and various projects and achievement on gnomic information such as the Human Genome Project (HGP) [6], the International HapMap project[7], and the ENCODE project [8]. With continuous efforts and international collaboration, many species such as Drosophila melanogaster (model species, fruit fly) [9], Caenorhabditis Elegans (worm, model species )[10], Escherichia Coli (bacteria, model species) [11], Arabidopsis thaliana (model plant species) [12], Oryza sativa (rice, food crop) [13], and Homo Sapiens (human being) [14], have had their whole genome sequenced. With technique advances and more specific sequencing targets [15, 16], new problems have emerged, such as storing and interpreting these biological datasets. Scientists are no longer satisfied by just getting the raw gnomic information such as DNA and RNA sequences, but are more interested in how these gnomic elements interact with each other and the variable environment. For example, BRAF mutations[17–19] have been widely accepted as an indicator for certain types of cancers such as melanoma[20, 21] and 12 colorectal cancer[22–25]. Another example is the association between EGFR mutations and prostate cancer[26]. With the emergence of genomic testing methods and practice in clinical medicine (some commercialized genomic testings[27, 28] have already been available to physicians and patients), the demand on interpreting genomic data and applying the information to improve medical treatment on patients increases dramatically. Interestingly, the research on gene interactions is not as easy as neuroscience research on acute reactions and living animals (which is another hot topic in the basic science field that may reveal the mechanisms and rules about how human beings do intelligent work such as thinking and learning): neuroscientists may penetrate tiny electrodes into neural tissues such as cerebral cortex or peripheral neural ganglion to record the electrical signals of currently functioning cells (“neurons”) [29–31], and they can use the temporal and strength relations of these neural signals between different groups of neurons to establish their interaction relations; some of the predicted relationships may be supported by the anatomical structures[32]. Compared with electrophysiology studies, molecular genetic research usually depends on the sample extraction from targeted models (animals, plants, bacteria, with some additional treatments or conditions, optional genetic modifications), sequencing the samples to acquire expression levels of genes and biomarkers, and applying tools to analyze and interpret the results[33, 34]. For instance, bioinformatics tools such as BLAST[35, 36], FASTA[37], and ClustalW[38] are widely used for sequences alignment to compare the similarity between biological sequences. Early studies[39] on transcription factors had been limited to simple models and exploring methodology, while in the next generation sequencing (NGS) era [16, 40] determining gene regulation changes and identifying known and unknown genes are much easier and faster. Computer programs[41] have been developed for assembling the short reads (hundreds of thousands of short DNA or RNA fragmentary sequences that are reported by the sequencing instruments) from chromatin immunoprecipitation sequencing (ChIP-Seq) 13 experiments into the sequences of targeted genes. With faster and more accurate sequencing technologies, the magnitude of sequencing data increases dramatically. Advanced biological problems also require in-depth mathematical modeling and information encoding as well as advanced computational methodology.

1.2 Biological Motivation

In the modern molecular genetics field, the basic science study about gene regulatory networks with transcription factors is important because the outcome enriches the knowledge foundation of molecular genetics and benefits translational studies about gene-centric medicine.

1.2.1 Gene Regulation

In the classical central dogma of biology, DNA (genotype) is transcribed into RNA, and RNA is translated into (phenotype). Therefore biological functions provided by the proteins are controlled via their DNA ancestors, which means the turning on or off of genes regulates the corresponding functions. Genes can be turned on or off by various factors such as environmental stimuli and changes of the physiological status within the individual living being, so that sophisticated metabolic processes can be triggered and controlled on demand and with minimal overhead. Some classical examples on gene regulation include lac operon (lactose operon) and trp operon (tryptophan operon) in E.coli. Lactose operon, also known as lac operon, is a regulatory process of lac genes which the enzymes for consuming lactose from the environment. The enzymes help transport lactose from the environment into the intracellular space of E.coli (β-galactoside permease), and metabolizing lactose to provide carbon and energy (β-galactosidase). For E.coli lactose may be used as an alternative carbon source (energy source) when glucose (the primary and preferred carbon source) is absent from the environment. When glucose 14 is abundant in the environment, E.coli uses glucose. The protein encoded by lacI gene binds to the lac gene promoter region, and the transcription of lac genes is inhibited. When only lactose is available in environment, the lactose molecules are transferred into E.coli cells and consumed as the primary carbon source. During the lactose metabolism process some allolactose molecules can be generated as the result of occasional transglycosylation of lactose by β-galactosidase, which causes the intracellular concentration of allolactose to increase. Allolactose can bind to the LacI protein and release the protein from the promoter region of lac gene, which gives the space for RNA polymerase to bind the gene promoter region and start transcription. [42–44] Trp operon is the regulatory process of genes which encode the enzymes for synthesizing tryptophan (a non-essential amino acid) from other substances. Since the process of synthesizing tryptophan is an energy consuming process and requires other substances, this process is rarely activated when tryptophan can be obtained from environment. The intracellular concentration of tryptophan controls whether the tryptophan synthesis related genes and associated downstream gene should be activated. When there is sufficient tryptophan supply from the environment, tryptophan binds to the transcription factor which is a protein encoded by the trpR gene and forms a complex. Due to the binding of tryptophan the conformation of the transcription factor has been altered, which allows the complex to bind to the sequence region that is adjacent to the tryptophan gene promoter and inhibits the transcription of these genes. When tryptophan is no longer available from the environment, the E.coli consumes tryptophan for some metabolic processes which causes the intracellular concentration of tryptophan to drop. Fewer tryptophan molecules exist in the intracellular environment, which moves the equilibrium towards the dissociation of the transcription factor complex. Then the inhibition caused by the transcription factor complex is removed, and the genes are expressed for synthesizing tryptophan. [45, 46] 15

From the above examples of lac operon and trp operon, a conclusion can be drawn that transcription factors (TFs) are the key role that controls the transcription behavior of the genes. The control mechanism is the selective binding of transcription factor to the sequence regions within the effective range of the gene promoter that (de)activates the promoter. In the lac model, lac repressor is the transcription factor that inhibits lactose metabolism related genes.[47] And in the trp model, tryptophan repressor is the transcription factor that controls the expression of related genes.[46] Besides transcription factors, their binding sites in DNA sequences are also involved in the process. Transcription factor binding sites (TFBSs, also known as “motifs”) are the sequence regions that transcription factors (TFs) bind in order to alter the transcription level of the associated genes. Eukaryotic transcription factor binding sites may reside in the sequence regions of gene promoters, upstream promoter elements (UPEs), regulatory elements, and enhancers[48], and the interactions between gene promoters and transcription factors in eukaryotic gene regulatory processes are far more complicated than prokaryotic gene regulatory processes such as lac operon and trp operon. For example, BRAF is an indicator and control role in cancer development since it can interact with or regulate more than 20 genes in the pathways that are involved in melanoma or colorectal cancer[17–25]. Many genes and transcription factors are involved in the eukaryotic gene regulation processes, and each element has an effect on the regulation pathway of some other elements. This interdependent cross-regulation scenario forms an interconnected network which is called a gene regulatory network. For example, previous studies[49, 50] report that many prostate cancer related biomarkers can be selected from the sequence pool, and many of the biomarkers are reported either interdependent with each other or structurally similar to many other biomarkers or genomic sequences (genes). Due to the complexity of gene regulatory networks and the essential role of transcription factor binding sites (TFBSs), discovering and identifying the potential motifs 16

(putative TFBSs) are the important steps in gene regulatory studies. The two processes, motif discovering and motif selecting are used to perform these steps.

1.2.2 Motif Discovery

Motif discovery is the task of finding the potential genomic sequence patterns (motifs) that may be the binding sites for the transcription factors participating in certain gene regulation processes. There are many methods and models for motif discovery, and corresponding software implementations built by previous studies. The classical motif discovery methods, also known as de novo motif discovery, are based on some structural, functional patterns and/or properties of the sequences, which fits the biological models directly. The de novo motif discovery tools such as AlignACE[51], Weeder[52], MDscan[53], Trawler[54], and MEME[55], use mathematical models that encode sequence structure and/or functional properties to search for exact or similar matches within an input genomic sequence pool and extracting motifs from them. The advantages of the de novo method are that they are directly related with binding structure and functionality and have high sensitivity, while the drawbacks are relative low specificity, need for hard-coded mathematical models, and limited search targets.[56] Due to the limitation of de novo methods that does not distinguish between regulatory-relevant and non-relevant sequences[56], a method called discriminative motif discovery was developed. Discriminative motif discovery, is the motif discovering methodology that utilizes both positive (foreground) and negative (background) sets of sequences to search for the motifs.[57, 58] The advantages of discriminative methods over non-discriminative (de novo) methods are that they: a) identify co-factors that are involved in the target gene regulation process, b) detect regulatory behavior changes between different (test) conditions, c) reduce motif decoy due to variants of patterns in different sequences, and d) do not heavily rely on well-annotated motifs, increasing the (number) 17 limits of detectable motifs. Some of the discriminative motif discovery tools include DECOD[59] and DEME[60]. In the next generation sequencing (NGS) era, especially the emerging of ChIP-Seq datasets for various transcription factors, the classical methods face the challenges: huge datasets (and long runtime), poor-annotated motifs, and limited knowledge about motif set compared with huge number of sequence reads. Some software tools such as DREME[61], SIOMICS[62], and Dimont[63] have been developed to handle huge ChIP-Seq datasets with improved analysis speed and processing capacity. Since many software tools are designed for specific aims, aggregating complementary algorithms and assembling the results from multiple tools can be a good choice of methodology for accurate results.[64] Ensemble motif discovery methods are the combined methods that utilize both the traditional and discriminative motif discovery tools, which assemble the strong aspects of each tool and reduce their weak points.[65, 66] An example of ensemble motif discovery tool is SCOPE[67, 68]. The motif discovery task performed in this thesis research project also utilizes several different motif discovery tools such as DREME and DECOD and assembles a joint candidate motif set.

1.2.3 Motif Selection

Motif discovery tools can generate many potential motifs that may have biological meanings in the context of gene regulation in the living cells and tissue by analyzing and clustering the DNA sequences of regulated genes and untranslated regions (UTRs) around the gene coding regions. The relatively huge size of the candidate motif set provides no practical value to biologists because: a) it is costly or even impractical to validate all the candidate motifs through biological experiments; b) discovered motifs may not have actual function or relation with the regulation of the genes. Therefore the task of selecting a strong set of motifs from the large candidate set must be performed against the huge 18 candidate motif set before biological validations, and this task is called motif selection. The major goal of the motif selection problem is picking motifs from the large set of candidate motifs that were obtained from biological experiments such as ChIP-Seq and gene expression profiling. Previous study[69] developed an enrichment method with merging similar motifs to archive a better quality result in getting overly represented motifs from the large candidate sets provided by ENCODE ChIP-Seq results. In this thesis research project, the motif selection method is based on the sequence coverage model. Sequence coverage model assumes the correlation between the existence of the motif within the sequence region and the regulatory effect of the motif on the sequence.[70]

1.2.4 General Pipeline for Motif Selection

For the sequence coverage based motif selection study conducted in this thesis, the major steps of the logic pipeline include sequence retrieval, motif discovery, motif scanning, and motif selection. The diagram of general pipeline for motif selection is shown in Figure 1.1. Each step of this general pipeline is explained in following the paragraph.

Sequence retrieval is done based on the biological experiments. Biological experiments examined the changes in the expression level of a large set of genes that may or may not be related with the certain biological factors such as environmental factors, stages of development, phases of life cycle, stages of disorders, etc. Once the changes in the expression level exceed some predefined value, the names of the corresponding genes were collected. Then the nucleotide sequences of the promoter regions (and/or 5’-UTR) of collected genes were retrieved from available databases such as GenBank[71] of NCBI. These nucleotide sequences served as the input data for the motif discovery step. For ChIP-Seq experiments, the peaks of the expression level changes were reported, and they served the same role as that of gene promoter sequences. During the motif discovery step, 19

ChIP sequences Gene promoters

Peaks

Motif Discovery

Motifs, M Sequences covered by motifs, S

Motif Selection∗

Selected motifs, F ⊆ M

Figure 1.1: General Pipeline for Motif Selection. Biology experiments provide the input data (Peaks from ChIP-Seq, and/or gene promoter sequences), and the input data is firstly used by motif discovery tools to discover short nucleotide sequences as potential candidates of motifs (M), then the potential candidates are scanned against the input sequences to find the occurrences of each candidate in the sequences (“find the coverage of each candidate”, S ), and both the candidates and their covered sequences are read by motif selection tool to pick a relatively smaller subset ( F, feature set) of the candidates.

the motif discovery tools read the nucleotide sequences and tried different methods to discover short nucleotide sequences as potential candidates of motifs. Usually the discovered motifs were stored in the format of PWM (Position Weight Matrix). The PWM 20 format stores the information of the weighted probability of the four basic nucleotides at each position within the motif sequence. In the motif scanning step, the potential candidates are scanned against the input sequences to find the occurrences of each candidate in the sequences (“find the coverage of each candidate”), and both the candidates and their covered sequences are read by motif selection tool The motif selection step picks a relatively smaller subset (“feature set”) of the candidates. The motif selecting criteria of existing motif selection methods are based on several different models.

1.3 Foundations of Computational Modeling and Optimization Algorithm

The computational model used in this thesis research project for motif selection is sequence coverage based motif selection method. The motif selection problem is defined as in previous study[70]: given a set of motifs and the corresponding sequences which contains at least one of the motifs in the given set, find the subset of the motif set which (a) has the minimum cordiality and (b) all the sequences contain at least one of the motifs that are in the selected subset. Since the motif selection problem is polynomial-time reducible to set cover problem which belongs to the general class of combinatorial optimization problem (NP-hard), exact algorithm for motif selection problem may not be feasible. Therefore constructing approximation algorithm for motif selection problem is the primary choice.

1.3.1 Set Cover Problem (SCP)

Set cover problem (SCP) is a classical NP problem.[72, 73] Due to the NP-hard property of SCP, if a problem A can be formulated as a SCP, the search for an optimal (or near-optimal) solution to problem A is equivalent to a 0-1 linear programming problem (0-1 ILP).[74] Therefore many algorithms[75–79] have been developed for solving SCP. Varieties of set cover problem (SCP) include partial set cover (PSC) and positive negative 21 partial set cover (PNPSC)[80] which are the extensions to the basic SCP model and useful to modeling special cases of real world problems. There is a long list of optimization algorithms for finding feasible or optimal solutions to NP problems: greedy algorithm, hill climbing, linear programming (LP) and integer linear programming (ILP), dynamic programming (DP), simulated annealing(SA), genetic algorithm (GA), etc. Different optimization algorithms have various features and properties, and they may be suitable for some specific subset of the whole set of NP-complete problems. A survey on major algorithms that were developed for solving SCP was discussed by reference[81].

1.3.2 Simulated Annealing Algorithm

The idea of simulated annealing algorithm came from the material forging process that is used for re-constructing the internal crystal structure of target solid objects[82]. This forging process of the metal manipulation includes warming up the solid objects and slowly cooling them down, and the cooling down step is called “annealing process”. In statistical thermodynamics, the crystal annealing process usually starts at a high initial temperature (usually from the melting point of the material). At the initial temperature, the thermal energy of the observed target object is high: the basis (the basic mass unit of the target solid object, usually atoms or molecules that build up the target object) within the solid object may move (Brown movement) randomly at this temperature. The object can be easily reshaped by mechanical efforts. As time passes, heat is transferred away from the target object, the temperature of the object decreases (the temperature goes down as time passes). Temperature determines the ability of free movements of the bases, and the bases are restricted to limited position shifting range with in the crystal lattice of the material by the atomic forces. Also the crystal structure begins to form at random positions within the target object, and these initial crystal structures are 22 called crystal lattices. As temperature decreases, the crystal structure expands (this process is called crystal growth). During crystal growth, the crystal structure is built up by primitive cells, and the type of the crystal structure is determined by various factors such as type of the bases, the atomic packing factor, and cooling speed.[83] Simulated annealing algorithm is the mathematical emulation of the crystal material annealing process[84, 85]. The foundation of the mathematical model for the annealing process is the description of the energy level change throughout the annealing time. The global search process in the search space is guided as a simulated process of crystal annealing process, the energy level (annealing temperature) determines bases’ ability of free movements, which is analogized as the maximum variant (mutation) that can be derived from the current solution (also known as the size of the neighborhood in the local search step of simulated annealing algorithm). The initial temperature, stop temperature, and cooling speed guide the maximum iterations of the search process. In general, simulated annealing algorithm has the following advantages: it returns good quality solutions to the problems; it is a general method to many different types of problems and flexible to extend; and implementing for a specific problem is easy.[86] But some drawbacks are also observed in research and application: too many parameters need to be configured and tuned in order to get better solutions, and runtime is too long under some circumstances.[86]

1.4 Problem Statement

The problem that this thesis project studied is building the exact and approximation solutions to motif selection problem (MSP) in an efficient way, because the computational solutions for motif selection problem can be applied to real world biological experiments for exploring the regulatory networks between different genes in various environmental settings or developmental stages of targeted animal models. 23

The proposed work in this thesis project is to use simulated annealing optimization algorithm to build solutions for motif selection problem, and evaluate the solution with ENCODE transcription factor datasets.

1.5 Contributions

The contributions of this thesis project include developing new motif selection method using simulated annealing (SA) algorithm, and evaluation of the performance and properties of SA implementation for MSP. Due to the NP-completeness of the motif selection problem, various exact and approximation algorithms have been used to build motif selection methods[70]. Applying simulated annealing algorithm to motif selection is another trial on building efficient methods to solve motif selection problem. The investigation on the performance and properties of simulated annealing based motif selection method will contribute to this state-of-art problem. 24 2 Methods

2.1 Motif Selection Problem

The general idea of motif selection problem is finding the minimum set of motifs which covers the universe set of given sequences. The input of the motif selection problem is a set of motifs and corresponding sequences that contains one or more motifs in the given motif set. Especially in this thesis research project, the universe set is determined by other tools in the whole pipeline (for example, motif finding tools, which provide the universe set of sequences). The formal definition of motif selection problem: given a universe set of sequences U of n elements (sequences), a collection of subsets of U (a collection/set of motifs),

+ S = {S 1,..., S k} ( each motif represents a subset of U), and a cost function c : S → Q , find a minimum cost subcollection (S ′) of S that covers all elements of U ( find a minimum cost set of motifs that covers all the sequences).

′ ′ ′ cMSP(S ) = |S | = number of motifs that are selected in the solution S

2.2 Set Cover Problem (SCP)

The Set Cover Problem (SCP) is a well-known NP-complete problem in the computer science field.[72, 87] The following definition of Set Cover Problem was adapted from [87]: Given a universe set U of n elements, and a collection of subsets of U, k C = {S 1, S 2,..., S k}, S i ⊂ U for i = 1,..., k, S i = U, find a subset F of C (F ⊂ C) such iS=1 that S x = U and |F| is minimized. SSx∈F The corresponding cost function for the above definition should be: cSCP(F) = |F| This cost function follows the form and style of cost functions described in reference[80]. 25

2.3 Mapping Motif Selection Problem to Set Cover Problem

Many computational problems can be simplified to some known problems[72, 73] that can be solved in a more predictable time constraint. For example, using ILP may solve some NP problem in pseudo polynomial time. Even though the running time is still large for big problems, available methodology and tools can make some real world problems solvable[87]. Therefore mapping new problems to known problems is an important and essential step for problem solving. The motif selection problem (MSP) can be mapped to set cover problem (SCP) as follows: a) all the given sequences make the universe set U with n is the total number of sequences; b) all the motifs make the collection of subsets of U, S = {S 1,..., S k}, with k is the total number of motifs; c) a cost function c : S → Q+, is used to assign weight for each motif; d) finding a minimum set of motifs that covers all the sequences, is equivalent to finding a minimum cost subcollection of S that covers all elements of U. The cost function used in this model is shown as follow:

5  10 , if X = φ  k  |X|, if S X = |U| c(X) =  i i  iS=1  k  S iXi  i=1  |X| + 1 − S , otherwise  |U|  In the above cost function a technique called “poisoned reverse” (similar to route poisoning [88]) is used to avoid empty set (X = φ).

2.4 SA Relaxed Version

For relaxed version, r ∈ (0, 1] denotes the relaxation factor. The cost function is shown as below: 26

5  10 , if X = φ  k  |X|, if S X ≥ r|U| c(X) =  i i  iS=1  k  S iXi  i=1  |X| + 1 − S , otherwise  |U|  2.5 Simulated Annealing Algorithm

Simulated annealing algorithm is an optimization algorithm for searching optimal solutions in the search space of combinatorial optimization problems. It can also be used as an approximation algorithm in some circumstances when approximate solutions are sufficient for real world applications. Besides simulated annealing algorithm, other optimization algorithms are also built for NP-complete problems, but simulated annealing algorithm has some properties that have special values for practical application: a) easy to understand the algorithm principle, and easy to implement, b) predictable maximum running time, c) may run under limited resource (such as limited run time or computation resources), d) can generate approximation solutions based on certain criteria. A generalized flowchart for simulated annealing algorithm is shown in Figure 2.1.

2.5.1 Pseudo-code for Simulated Annealing Algorithm

Algorithm description pseudo-code (shown as Algorithm 2.1) was quoted from the documentation[89] of METSlib framework. The meaning of each line in the pseudo-code (Algorithm 2.1) is explained as follows: Line 1: set the value of the working solution (s′′) as the initial solution; Line 2: set the value of the best-ever solution (s∗) as the initial solution;

Line 3: set the value of current temperature (T) as the initial temperature T0; Line 4: check whether current temperature T is still above zero(0): if true, run the loop; 27

start

assess inital solution

generate new solutions (“neighborhood”)

assess new solutions

accept new solution? no yes

update scores

adjust termperature

no terminate search? yes

exit

Figure 2.1: Generalized Flowchart for Simulated Annealing Algorithm

Line 5: generate the neighborhood (N(s′′)) of working solution, and iterate through the neighborhood;

Line 6: pick a random value u from the uniform distribution in range [0, 1] (U01); 28

Algorithm 2.1: SimulatedAnnealing

′′ 1 s ← s0

∗ 2 s ← s0

3 T ← T0

4 while T > 0 do

′ ′′ 5 for all s ∈ N(s ) do

6 pick random u in U01

f (s′)− f (s′′) 7 if min , e T > u then 1  ′′ ′ 8 s ← s /* always accept improving moves, accept also

f (s′)− f (s′′) non-improving moves with probability e T */

9 break /* stop exploring the current neighborhood after accepting a point */

′′ ∗ 10 if f (s ) < f (s ) then

∗ ′′ 11 s ← s

12 T ← update(T)

∗ 13 return s

Line 7: test if the value of u is below a threshold which is calculated from the value of current temperature T and the cost difference between the current assessed solution (s′) and the working solution (s′′): if true, run the following code block; Line 8: set the value of the working solution (s′′) as current assessed solution (s′); Line 9: stop the iteration on current neighborhood; Line 10: test if the cost of current solution (s′′) is lower than that of the best-ever solution (s∗): if true, run the following code block; Line 11: set the value of the best-ever solution (S ∗) as the current solution (s′′); 29

Line 12: adjust the temperature T by function update(); Line 13: return the value of best-ever solution (s∗). Since the temperature T is always decreased regardless of whether a new good solution is achieved or not, the simulated annealing algorithm is guaranteed to stop and return the best ever solution it traversed through the search space of the given problem. Hence the generation of the neighborhood of given solution, the cooling down scheme of the temperature, and the calculation of the solution cost may affect the performance and outcome of this stochastic, heuristic search algorithm — simulated annealing algorithm. If the parameters were carefully chosen, the search path of simulated annealing algorithm in the search space of the given problem may pass through the global optimum, and the algorithm will return the global optimum as the final solution to that problem.

2.5.2 Important Parameters

As stated in the previous section, some parameters of simulated annealing algorithm have great impact on the running time performance and searching final result of the given problem. They are described in details in this section.

2.5.2.1 Cooling Schedule

Cooling Schedule is an important part of simulated annealing algorithm. Cooling schedule regulates the possibility of picking worse solutions at each iteration step. As the iteration continues, the cooling schedule decreases the value of the temperature with predefined formula. And the value of the temperature directly determines how much the change of a successful drawing will be made from the random pool. Generally two types of cooling schedules are used: exponential cooling[84] and linear cooling[90]. Exponential cooling uses an exponential formula to decrease the value of the temperature, which is based on the energy level calculation formula.[84] For example, the temperature formula can be Ti+1 = ATi where 0 < A < 1, i is the iteration step. With this 30 formula, the temperature decrease rate (delta temperature per step) is high at the beginning stage, and becomes low as iteration step grows (see Figure 2.2). This property is useful for prolonging the optimization search process and doing fine tuning on the result.

T A = 0.9 A = 0.75 A = 0.5

0 i

Figure 2.2: Generic Exponential Cooling Temperature Curve. Formula: Ti+1 = ATi

Linear cooling schedule uses a linear formula to decrease the value of the temperature. The temperature drop at each iteration step is fixed, while its value is carefully calculated to guarantee that the best result will be returned. Under some circumstances, linear cooling schedule may save computational time and/or achieve user-defined criteria for final results.[90]

2.5.2.2 Selecting Criteria

For acceptance criterion, previous studies[84, 85, 91] had used negative exponential distribution with parameter 1/ck, and the probability of accepting new solution in the iteration steps is calculated as follow (adapted from [91]):

ck is constant for the k-th iteration, i is the solution gained before the k-th iteration, and j is the new solution gained in the k-th iteration, f is the cost function. 31

1 if f ( j) ≤ f (i), P  ck {accept j} =   f (i)− f ( j)  exp if f ( j) > f (i),   ck   The Boltzmann distribution is used for the drawing operation (“sweeptaking”) which is used for determining whether a worse solution should be picked in case no better solutions were found in the current neighborhood.

2.5.2.3 Termination Criteria

For some specific applications, the simulated annealing process may be terminated before the temperature reaches zero(0). The additional mechanism to terminate the loop earlier than totally cooling down is called termination criteria. At each run of the loop, both the value of the temperature and the test of termination criteria are checked. If the test results of termination criteria are false, the search process is halted and the current best-ever solution is returned.

2.6 Implementation for Solving MSP

The implementation of simulated annealing algorithm to solve MSP (motif selection problem) built for this thesis project utilizes a freely available programming framework which will be described in details below.

2.6.1 Framework Overview

METSlib is a freely available C++ implementation of metaheuristic algorithm framework[92]. This framework provides abstracted data types (C++ classes) for heuristic algorithms such as simulated annealing. Actual problems need to be encoded as derived classes of METSlib abstracted basic classes; calling the corresponding algorithm class method to start global search process and retrieve final results. For example, the simulated 32 annealing search process can be invoked by calling search() member function (code is shown in Code A.1.1) of mets::simulated annealing class instance. The logical relationships of basic classes used in the simulated annealing algorithm (mets::simulated annealing) implemented by METSlib are shown in Figure 2.3. The algorithm framework generates random derivations (mets::move) by neighborhood generator (mets::move manager), and the neighboring solutions are generated by applying derivations to the current solution (mets::evaluable solution), then the collection of neighboring solutions is evaluated by comparing their costs with the cost of current solution. If a solution is better than the current solution, the better solution is used as current solution in the next iteration of search.

′ S i

mets::move mets::move mets::move mets::move

populate() populate() populate()

′ ′ apply() ′ ′ S j S k S l S x

evaluate() evaluate() evaluate()

Cost j Costk Costl

Figure 2.3: Logical Relationships of METSlib Framework Classes

2.6.2 Implementation of Motif Selection Problem Model

For the simulated annealing solver for MSP implemented in this thesis research project, several derived classes are created to encode MSP problem. 33

Class mets::simulated annealing is the delegator for simulated annealing algorithm, derived from class mets::abstract search. It must be instancilized and bound to the actual problem definition class. The search method is used for performing actual annealing optimization process on the given problem. The result is saved in the pre-allocated solution class instances. Individual solution to MSP is represented by class SAsol (Code A.2.1), derived from class mets::evaluable solution which is the class representing each solution in the search space. The class SAsol is also used for describing the optimization problem and storing solutions during the iteration. For instance, cost function is implemented as a method function of class SAsol, which is a requirement of the METSlib framework in order to perform heuristic search correctly. The delegator about neighborhood generation and local search is represented by class Nhf (Code A.2.3), derived from class mets::move manager which is internally called by algorithm classes for generating neighborhood solutions based on current solution. The data manipulation instruction of generating neighboring solution from current solution is represented by the class Walk (Code A.2.4), derived from class mets::move which is used for creating neighboring solution, evaluating neighboring solution, and performing the actual movement operation during the annealing process. The complete code of above mentioned classes is shown in Appendix A.2.

2.6.3 Implementation of Cost Function

Some modifications must be made to the implemented cost function for METSlib-based simulated annealing MSP solver in order to enforce the constraints of the original SCP. The original MSP has a objective function of

cMSP(X) = |X| 34 subject to constraint: X ≥ 1 X S S :e∈U Since the METSlib framework can only minimize the value of the implemented cost function, the constraints needs to be embedded in the cost function by intentionally increasing the value of the cost function if the constraints were not obeyed by current solution. The constriants of MSP are implemented as a penalty value in the following cost function that is used by METSlib framework:

k

cost(X) = |X| + c · |U| − S iXi   [   i=1    where

⌈ + ⌉ c = 10 1 log10 k

In the above cost function, a high penalty factor is multiplied to the number of

uncovered sequences (the count of the sequences in U that are not covered by Xs) to S enforce the fully coverage constraint of MSP.

2.7 Adjustable Parameters of SA Implementation for MSP

Basic algorithm parameters exposed by METSlib for simulated annealing and the settings used in this thesis research project are shown in Table 2.1.

2.7.1 Iteration Termination criteria

Iteration termination criteria are used for determining whether the heuristic search process should be stopped after each iteration. Available options provided by METSlib are iteration-based criteria, threshold-based criteria, improvement-based criteria, and infinite loop criteria. 35 Table 2.1: Parameter Settings for Simulated Annealing Algorithm

Parameter Setting

iteration termination criteria threshold termination criteria neighborhood size adaptive to input size cooling schedule exponential cooling

Tstart (start temperature) 50K

Tend (end temperature) zero(0) K (Boltzmann distribution parameter) 1 (use framework default value)

Iteration-based criteria (class mets::iteration termination criteria): search process terminated when the iteration numbers reaches the pre-defined iteration limit. This termination criteria can be used to terminated the search process by clearly defined limit, which is easy to control the upper limit on resources and running time. Threshold-based criteria (class mets::threshold termination criteria) (Iteration stops when cost value of the current solution reaches the pre-defined threshold (cost value of current solution is smaller than the threshold). This termination criteria can be used to stop iteration based on special requirements from end users, for example, certain level of coverage. Improvement-based criteria (class mets::noimprove termination criteria): the search process terminates if there is no better solutions than current solution can be found in pre-defined number of iterations. This termination criteria can be used to safely terminate the searching iterations. Infinite loop criteria (class mets::forever termination criteria) never stops the searching iteration. This termination criteria can be used for prolonging the search process for large search space. 36

2.7.2 Cooling Schedule

Cooling schedule defines the scheme of decreasing the temperature as the iteration goes on. The temperature determines the probability of choosing worse-than-current solution, which is the property of simulated annealing algorithm that overcomes local optimum in the search space. Reducing the temperature (“Cooling”) as the search process goes on prevents picking worse solutions at the later stage of the iteration and ensure optimal solution is returned when search process ends. Available options provided by METSlib are exponential cooling and linear cooling. Exponential cooling (class mets::exponential cooling) uses an exponential formula to decrease the value of the temperature. The METSlib implementation of exponential cooling schedule is shown in Code A.1.2. Linear cooling schedule (class mets::linear cooling) uses a linear formula to decrease the value of the temperature. The METSlib implementation of linear cooling schedule is shown in Code A.1.3.

2.7.3 Temperature Tstart and Tend

Temperature is an important parameter for simulated annealing algorithm. The iteration will not continue if the value of current temperature is lower than or equal to the value of stopping temperature. Since the value of current temperature decreases at the end of each iteration step, the simulated annealing algorithm is guaranteed to stop and provide the best solution it gets from the annealing process. Usually Tstart denotes the value of the starting temperature, and Tend denotes the value of the stopping temperature.

2.7.4 Boltzmann Distribution Parameter K

The Boltzmann distribution is used for the drawing operation (“sweeptaking”) which is used for determining whether a worse solution should be picked in case no better solutions were found in the current neighborhood. 37

The following formula for Boltzmann distribution is converted from METSlib source code (Code A.1.1):

− ∆ p = e K·T while T denotes the current temperature, ∆ denotes the cost value difference between current solution and the evaluated neighboring solution, and K denotes the algorithm parameter. 38 3 Evaluation Using ENCODE Data

This chapter covers the evaluation of simulated annealing implementation of MSP solver with ENCODE datasets.

3.1 Overview

The encyclopedia of DNA elements project, aka “ENCODE project”, is a comprehensive international collaboration research project focusing on human genome.[93] The practice of testing novel algorithms and methods against ENCODE datasets and comparing the results with previous studies is generally considered as a “golden standard” when designing and developing solutions for motif discovery, motif selection, and derived real world applications. The ENCODE Project has large scale datasets that come from real world experiments, because this project has the aim of identifying all functional elements at various gene regulatory levels by applying many technologies[94] (Figure 3.1). Using ENCODE Project data to evaluate the solvers for motif selection problem (MSP) instead of synthesized data eliminates the pitfalls of data synthesizing methods and indicates the performance on real world applications.

3.2 Datasets

The sequence data used in this study is from ENCODE Project ChIP-Seq transcription factor datasets[93, 94], and the selections of transcription factor groups are reported by previous study[69]. The 51 core ENCODE transcription factor groups were picked in order to compare the results with previous study[70]. The basic information about these 51 ENCODE transcription factor groups is shown in Table B.1. According to the general pipeline described in Section 1.2.4, some data pre-processing steps (motif discovery, motif scanning) are needed to obtain the input datasets for simulated annealing MSP solver. The discovering of potential motifs from 39

ENCODE sequence data was performed with an ensemble method of utilizing multiple motif discovery tools (MDscan[53], MEME[55], Trawler[54], Weeder[52], some ensemble motif discovery tools described in references[69, 95]) to generate the motif collections and merge together. The scanning for occurrences of each given motif in sequences were collected by FIMO from MEME toolkit[96]. Motif selection was performed on the training sequence set. After the motif selection step, the selected motifs were scanned against the testing sequence set with FIMO, and the final coverage was calculated for the feature set (selected motifs) against testing sequence set. This training-testing paradigm was also used by previous study[70].

Figure 3.1: Overview of ENCODE Project. The major datasets and methodologies for collecting and analyzing the data. (Image from https://www.encodeproject.org/) 40

3.3 Parameters

Basic algorithm parameters exposed by METSlib for simulated annealing and the settings used in this pilot study are shown in Table 3.1. For each dataset, simulated annealing process was performed 100 times in order to gain possible baseline results. The sequence coverage is used as an independent variable for evaluating the performance characteristics of simulated annealing algorithm. Three levels (100%, 85%, and 70%) were chosen in this thesis study. 100% sequence coverage means all the sequences within the training sequence set must be covered by the feature set (selected motifs). 85% sequence coverage means at least 85% of all the sequences within the training sequence set must be covered by the feature set (selected motifs), and 70% sequence coverage means at least 70% of all the sequences within the training sequence set must be covered by the feature set (selected motifs). The sequence coverage setting enforces the minimum sequence coverage on the training sequence set, because enforcing the exact sequence coverage value is NP-complete which is a huge cost that should not be introduced in this circumstance.

Table 3.1: Parameter Settings for ENCODE Datasets

Parameter Value

iteration termination criteria threshold termination neighborhood size the size of the input motif set cooling schedule exponential

Tstart (start temperature) 1.2e5

Tend (end temperature) 0 K (Boltzmann distribution parameter) 1.0 41

3.4 Results

The complete testing results of ENCODE datasets are shown in Appendix B.1. Two major aspects of the results, feature set size (the size of the selected motif sets) and sequence sensitivity (sSn, the coverage on testing sequence set), are compared with the results of Greedy and RILP methods from previous study[70] in order to evaluate the properties and performance of simulated annealing MSP solver. The complete comparison on the results are shown in Table B.2 and B.3.

Feature set size

● ● ●

● 0 10 20 30 40 50 60 70

SA SAr85 SAr70 Greedy RILP Figure 3.2: Boxplot of the feature set size (number of selected motifs). SA, simulated annealing; RILP, relaxed intger linear programming. (Data of Greedy and RILP methods were from [70]) 42

feature set size for each TF group 0 10 20 30 40 50 60 feature set size (# of selected motifs) set size feature IRF YY1 SP1 ETS BCL SRF NFY MAF SIX5 SPI1 TAL1 STAT ATF3 MXI1 ELF1 BATF PAX5 GATA ZEB1 EBF1 RFX5 NFE2 PBX3 NRF1 HNF4 HEY1 BDP1 FOXA REST NFKB EGR1 RXRA EP300 TFAP2 TCF12 NR2C2 NR3C1 BRCA1 CEBPB ESRRA PRDM1 TCF7L2 ZNF143 NANOG ZBTB33 ZBTB7A POU2F2 POU5F1 BHLHE40

SA SAr85 SAr70 Greedy RILP

Figure 3.3: Line plot of the feature set size (number of selected motifs). SA, simulated annealing; RILP, relaxed intger linear programming. (Data of Greedy and RILP methods were from [70])

For overall performance on feature set size (number of selected motifs), the result is shown in Figure 3.2. Feature set size on individual transcription factor group is shown in Figure 3.3. The overall performance on sequence sensitivity (sSn) is shown in Figure 3.4, and sequence sensitivity on individual transcription factor groups is shown in Figure 3.5.

3.5 Analysis on Results

The results show that the simulated annealing based MSP solver implemented in this thesis project is working correctly. The performance (feature set size, sequence sensitivity) of simulated annealing algorithm is between Greedy and RILP. The simulated 43

Sequence Sensitivity (sSn)

● ●

● ●

● ●

● ● ● ●

● ●

● ● 0.0 0.2 0.4 0.6 0.8 1.0

SA SAr85 SAr70 Greedy RILP

Figure 3.4: Boxplot of the sequence sensitivity (sSn, coverage on testing sequence set). SA, simulated annealing; RILP, relaxed integer linear programming. (Data of Greedy and RILP methods were from [70])

annealing algorithm returns feasible solutions to each transcription factor group dataset, and the running time is acceptable. The results given by simulated annealing algorithm are close to those from RILP algorithm (see SA and RILP lines in Figure 3.3 and Figure 3.5). The characteristics of simulated annealing algorithm on MSP are also presented in the result. Stochastic search: the traverse in search space is stochastic, and sometimes the search process ends at some local optima. Feature set oriented optimization: accepting a feasible solution depends on the cost of this solution instead of including or excluding 44

sSn for each TF group Coverage SA SAr85 SAr70 Greedy RILP 0.0 0.2 0.4 0.6 0.8 1.0 IRF E2F YY1 SP1 ETS BCL SRF NFY MAF SIX5 SPI1 TAL1 STAT ATF3 MXI1 ELF1 BATF PAX5 GATA ZEB1 EBF1 RFX5 NFE2 PBX3 NRF1 HNF4 HEY1 BDP1 FOXA REST NFKB MEF2 EGR1 RXRA EP300 TFAP2 TCF12 NR2C2 NR3C1 BRCA1 CEBPB ESRRA GROUP PRDM1 TCF7L2 ZNF143 NANOG ZBTB33 ZBTB7A POU2F2 POU5F1 BHLHE40

Figure 3.5: Line plot of the sequence sensitivity (sSn, coverage on testing sequence set). SA, simulated annealing; RILP, relaxed integer linear programming. (Data of Greedy and RILP methods were from [70])

specific motif. From the modeling of MSP and implementation design, simulated annealing algorithm evaluates the cost of the given feasible solution and determines whether this solution should be accepted. This behavior is similar (or equivalent) to RILP, but different from Greedy algorithm which is individual motif oriented: Greedy algorithm checks for the coverage of each motif and the incremental coverage when adding another motif to the feature set. Different optimization strategies result in different behaviors and results as shown in Figure 3.3 and Figure 3.5. In this evaluation on simulated annealing MSP solver, some pilot investigations on partial set cover (PSC) modeling were also performed, and the results were denoted as “SAr85” and “SAr70”. The PSC modeling was achieved by adjusting the sequence coverage term of the cost function so that it can be capped at the pre-defined level. The 45 investigations were trying to extract more representative motifs by reducing the sequence coverage on training sequence set. “SAr85” requires the feasible solution (selected motif set) must have at least 85% coverage on training sequence set (at least 85% of the sequences in training sequence set are covered by the selected motif set), and “SAr70” requires the feasible solution (selected motif set) must have at least 70% coverage on training sequence set. Therefore “SA” requires the feasible solution (selected motif set) must have 100% sequence coverage on training sequence set. The results of general comparison with “SAr85”, “SAr70”, Greedy, and RILP are shown in Figure 3.2 (feature set size) and Figure 3.4 (sequence sensitivity). As the sequence coverage limit decreases, the feature set size can be reduced dramatically, and the sequence sensitivity also decreases. For each individual transcription factor group, the feature set size and sequence sensitivity for “SAr85” and “SAr70” have different trends. The results for feature set size move toward Greedy results, and the quantity become comparable with Greedy results (shown in Figure 3.3), which is promising. On the other side, the results for sequence sensitivity drop badly, and only “SAr85” is comparable with Greedy method (shown in Figure 3.5). A comprehensive comparison with both feature set size and sequence sensitivity for the three algorithms indicates the performance characteristics of SA under different relaxation level (Figure 3.6,3.7,3.8). The trend of SA relaxation level is shown clearly in these figures: when no relaxation is allowed, results reported by SA are similar to those reported by RILP; as the relaxation goes on, results reported by SA are transfering towards those reported by Greedy, but SA results become worse when relaxation is too loose. This comprehensive comparison also suggest “SAr85” is a relative good choice (or compromise) between the two aspects (feature set size and sequence sensitivity). 46

● ● SA Greedy ● RILP ●● ● ● ●

● ● ●

● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ●

feature set size (m) set size feature ●

● ●● ● ● ● ● ● ● ●

● ● ● ● ● ●● ● ● ● ●

● ● 0 10 20 30 40 50 60

0.2 0.4 0.6 0.8 1.0

sequence sensitivity (sSn)

Figure 3.6: Combined comparison with feature set size (m) and sequence sensitivity (sSn) for SA. The points that represent the solutions provided by different algorithms (Greedy, RILP, and SA) to the same TF group are connected with dotted line.

3.6 Biological Insights of Selected Motifs

In order to investigate the potential biological insights for the motifs selected by Simulated Annealing MSP solver, the selected motifs of some groups of the test dataset were exampled by querying against publicly available motif database with TOMTOM[97]. The criteria of picking motif groups were based on the performance of 47

● SAr85 Greedy RILP

● feature set size (m) set size feature ●

● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 10 20 30 40 50 60

0.2 0.4 0.6 0.8 1.0

sequence sensitivity (sSn)

Figure 3.7: Combined comparison with feature set size (m) and sequence sensitivity (sSn) for SAr85. The points that represent the solutions provided by different algorithms (Greedy, RILP, and SA) to the same TF group are connected with dotted line.

SA compared with Greedy and RILP methods: feature set size and sequence sensitivity of selected motifs. Motif groups “BATF” and “PBX3” were picked as examples here: motif group BATF was one of the groups that SA achieves very good performance compared with Greedy and RILP, while motif group PBX3 was one of a few groups that SA returns the worst solutions compared with Greedy and RILP. 48

● SAr70 Greedy RILP feature set size (m) set size feature

● ●

● ●●●● ●● ● ●● ● ●● ● ● ● ● ●●● ● ● ●● ● ●● ● ● ● ● ● ● ●● ●●●● ● ●●●● 0 10 20 30 40 50 60

0.2 0.4 0.6 0.8 1.0

sequence sensitivity (sSn)

Figure 3.8: Combined comparison with feature set size (m) and sequence sensitivity (sSn) for SAr70. The points that represent the solutions provided by different algorithms (Greedy, RILP, and SA) to the same TF group are connected with dotted line.

The TOMTOM query results for these two groups showed some interesting messages. For the good performance group (BATF), the SA selected motifs are show in Table B.4. These motifs have strong alignment with matched motifs (shown in Table B.5). And the biological correlations between motifs are significant: TOMTOM reports matches from the same group BATF and closely related groups (for example BATF3). For 49 the bad performance group (PBX3), the SA selected motifs are shown in Table B.6. Even though the selected motifs for this group have high numbers of matches reported by TOMTOM, the actual alignments to known motifs are poor (see Table B.7). The motifs picked by SA contain GC-rich sequence (for instance, the motif “PBX3 AlignACE 3”) which is common in regulatory elements and conservative regions of the genome. Therefore the selected motifs can have high numbers of matches reported by TOMTOM easily, and many matches are from the motifs that have less biological correlation with PBX3. This circumstance indicates that the selected motifs can have good coverage on training sequence set (since that is the criteria that SA solver uses to pick those motifs), they may have poor sequence sensitivity on testing sequence set due to their non-specific sequence characteristics. The result showed that the SA solver for MSP is biological background agnostic, because the modeling of MSP discussed in this thesis work is based on the coverage model. This kind of limitation can only be resolved when the biological background information is encoded in the MSP model. 50 4 Conclusion and Future Work

4.1 Conclusion

Motif selection problem has close relation with its biological applications about investigating genetic regulatory networks within live beings. Building solvers to motif selection problem (MSP) can benefit biological research by picking the most representative motifs. Due to the fact that Motif selection problem is a NP-hard problem and using exact methods to get the solutions to the problem is either infeasible or impractical, approximation algorithms such as greedy and simulated annealing are the available options. In this thesis research project a simulated annealing algorithm based implementation of MSP solver was built on top of METSlib framework. Some simulated annealing algorithm parameters are made available to users, which is a benefit that allows end users to set up with application-specific configurations. This implementation was evaluated with the real world experiment data from ENCODE Project ChIP-Seq TF datasets, and the evaluation result was compared with the results from previous studies. The evaluation result shows that this simulated annealing implementation based MSP solver is functional. When comparing results with other algorithms, two result aspects (feature set size, and sequence sensitivity sSn) were used. Through the comparison with greedy and relaxed integer linear programming (RILP) algorithms, some behavioral characteristics of simulated annealing MSP solver have been revealed: the optimization behavior of simulated annealing is similar to RILP; the consumption on memory and CPU resources are predictable, and the performance of simulated annealing algorithm is quick. Cooling down speed determines the upper limit of search iterations. Through this thesis research project, exponential cooling schedule prolongs search process and attempts to get the best solution from the search space. By contrast, linear cooling schedule cuts out too soon and often returns worse solutions than exponential cooling schedule. For real 51 world application, the detailed parameters of cooling schedules may be adjusted to fit the actual requirements, such as approximation ratio. In addition to the Set Cover Problem (SCP) which is equivalent to the sequence coverage model that is used in this simulated annealing implementation, some pilot investigations toward the model of Partial Set Cover problem (PSC) have been done in this thesis research project, and the corresponding results were also presented in this thesis. For partial sequence coverage, the stochastic property of simulated annealing algorithm shows up, which dominants the final optimization results. This behavior may be observed on datasets of ETS, NRF1, and PBX3 TF groups (see Figure 3.5). For these three TF groups, the feature sets (selected motifs) have much lower sequence coverage on corresponding testing sequence sets. One potential explanation on low sequence coverage on testing sequence set is the neighborhood parameter for simulated annealing algorithm was set to the candidate motif set size which affects the variety of changes that may be made based on current solution. Although these results with PSC modeling may be counter-productive, they may be useful for removing some low representative motifs from the candidate motif collection as a “reverse filter”.

4.2 Future Work

Future work on simulated annealing based motif selection may focus on the following directions: algorithm model, cost function, neighborhood size, and cooling schedule. Enhancements on these aspects may help improving algorithm performance and achieving better result toward global optima. This thesis research project builds an simulated annealing based motif selection problem solver for the sequence coverage model that is equivalent to Set Cover Problem (SCP). Replacing SCP with Partial Set Cover problem (PSC) may be the immediate next 52 on this direction of the study. As indicated by the pilot result presented in this thesis, for partial sequence coverage, the stochastic property of simulated annealing algorithm shows up, which dominants the final optimization results. Finding methods to get stable results and avoid local optima is the key to success. Building reverse filter by utilizing the partial sequence coverage model and remove motifs that have low coverage on testing sequence set would be another potential application of simulated annealing algorithm in motif selection problem. This methodology may become effective when considering the discriminative perspective of motif selection that both the foreground sequence coverage and background sequence coverage were used to pick motifs from the candidate set. Therefore extending the current implementation to considering both foreground and background sequence coverage (a model equivalent to positive-negative partial set cover, PNPSC) is one of the future works that may be done. From another point of view, remodeling MSP with biological background information can extend the capability of simulated annealing algorithm so that the biological correlation of motifs may be evaluated when picking motifs. The cost function used in this thesis research project combines the sequence coverage and feature set size together. This method may become an obstacle when adding background coverage to cost function. For high-dimensional cost function parameters (foreground coverage, background coverage, feature set size, etc), re-constructing the cost function is needed because the property of high dimensions of the input data should be utilized in calculating the cost of the given feasible solution. Some previous studies[98] may be a good start on this. Fine tuning on the neighborhood size may be a application-specific setting that affects the performance of simulated annealing MSP solver. With the concurrent computation capability, running simulated annealing algorithm on the same input set with 53 different neighborhood size in parallel may speed up the solution searching process and harvest alternative solutions. Cooling schedule plays the key role in controlling the process of optimization. Prolonging cooling period at intermediate temperature level may help the search process overcome local optimal. Adaptive simulated annealing (ASA) [99–102] has the ability of controlling cooling schedule within the process, which is a good solution for non-linear global search space. Adding warming up process at the beginning of the simulated annealing process is another approach.[103–105] Utilizing warming up process allows extreme mutations to be generated, which helps the search process escaping from local optimal solutions and achieves wide-spread search in the search space. 54 References

[1] F. Griffith. The Significance of Pneumococcal Types. J Hyg (Lond), 27(2):113–159, Jan 1928. 11

[2] O. T. Avery, C. M. Macleod, and M. McCarty. Studies on the Chemical Nature of the Substance Inducing Transformation of Pneumococcal Types : Induction of Transformation by a Desoxyribonucleic Acid Fraction Isolated from Pneumococcus Type III. J Exp Med, 79(2):137–158, Feb 1944. 11

[3] James. D. Watson and Francis. H. Crick. Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid. Nature, 171(4356):737–738, Apr 1953. 11

[4] Frederick. Sanger and E O P. Thompson. The amino-acid sequence in the glycyl chain of insulin. I. The identification of lower peptides from partial hydrolysates. Biochem J, 53(3):353–366, Feb 1953. 11

[5] T. H. Morgan. Sex Limited Inheritance in Drosophila. Science, 32(812):120–122, Jul 1910. 11

[6] Jeremy Schmutz, Jeremy Wheeler, Jane Grimwood, Mark Dickson, Joan Yang, Chenier Caoile, Eva Bajorek, Stacey Black, Yee Man Chan, Mirian Denys, Julio Escobar, Dave Flowers, Dea Fotopulos, Carmen Garcia, Maria Gomez, Eidelyn Gonzales, Lauren Haydu, Frederick Lopez, Lucia Ramirez, James Retterer, Alex Rodriguez, Stephanie Rogers, Angelica Salazar, Ming Tsai, and Richard M. Myers. Quality assessment of the human genome sequence. Nature, 429(6990):365–368, May 2004. 11

[7] International HapMap Consortium. The International HapMap Project. Nature, 426(6968):789–796, December 2003. 11

[8] Hongzhu Qu and Xiangdong Fang. A Brief Review on the Human Encyclopedia of DNA Elements (ENCODE) Project. , Proteomics & Bioinformatics, 11(3):135141, Jun 2013. 11

[9] M. D. Adams, S. E. Celniker, R. A. Holt, C. A. Evans, J. D. Gocayne, P. G. Amanatides, S. E. Scherer, P. W. Li, R. A. Hoskins, R. F. Galle, R. A. George, S. E. Lewis, S. Richards, M. Ashburner, S. N. Henderson, G. G. Sutton, J. R. Wortman, M. D. Yandell, Q. Zhang, L. X. Chen, R. C. Brandon, Y. H. Rogers, R. G. Blazej, M. Champe, B. D. Pfeiffer, K. H. Wan, C. Doyle, E. G. Baxter, G. Helt, C. R. Nelson, G. L. Gabor, J. F. Abril, A. Agbayani, H. J. An, C. Andrews-Pfannkoch, D. Baldwin, R. M. Ballew, A. Basu, J. Baxendale, L. Bayraktaroglu, E. M. Beasley, K. Y. Beeson, P. V. Benos, B. P. Berman, D. Bhandari, S. Bolshakov, D. Borkova, M. R. Botchan, J. Bouck, P. Brokstein, P. Brottier, K. C. Burtis, D. A. Busam, H. Butler, E. Cadieu, A. Center, I. Chandra, J. M. Cherry, S. Cawley, C. Dahlke, 55

L. B. Davenport, P. Davies, B. de Pablos, A. Delcher, Z. Deng, A. D. Mays, I. Dew, S. M. Dietz, K. Dodson, L. E. Doup, M. Downes, S. Dugan-Rocha, B. C. Dunkov, P. Dunn, K. J. Durbin, C. C. Evangelista, C. Ferraz, S. Ferriera, W. Fleischmann, C. Fosler, A. E. Gabrielian, N. S. Garg, W. M. Gelbart, K. Glasser, A. Glodek, F. Gong, J. H. Gorrell, Z. Gu, P. Guan, M. Harris, N. L. Harris, D. Harvey, T. J. Heiman, J. R. Hernandez, J. Houck, D. Hostin, K. A. Houston, T. J. Howland, M. H. Wei, C. Ibegwam, M. Jalali, F. Kalush, G. H. Karpen, Z. Ke, J. A. Kennison, K. A. Ketchum, B. E. Kimmel, C. D. Kodira, C. Kraft, S. Kravitz, D. Kulp, Z. Lai, P. Lasko, Y. Lei, A. A. Levitsky, J. Li, Z. Li, Y. Liang, X. Lin, X. Liu, B. Mattei, T. C. McIntosh, M. P. McLeod, D. McPherson, G. Merkulov, N. V. Milshina, C. Mobarry, J. Morris, A. Moshrefi, S. M. Mount, M. Moy, B. Murphy, L. Murphy, D. M. Muzny, D. L. Nelson, D. R. Nelson, K. A. Nelson, K. Nixon, D. R. Nusskern, J. M. Pacleb, M. Palazzolo, G. S. Pittman, S. Pan, J. Pollard, V. Puri, M. G. Reese, K. Reinert, K. Remington, R. D. Saunders, F. Scheeler, H. Shen, B. C. Shue, I. Siden-Kiamos,´ M. Simpson, M. P. Skupski, T. Smith, E. Spier, A. C. Spradling, M. Stapleton, R. Strong, E. Sun, R. Svirskas, C. Tector, R. Turner, E. Venter, A. H. Wang, X. Wang, Z. Y. Wang, D. A. Wassarman, G. M. Weinstock, J. Weissenbach, S. M. Williams, WoodageT., K. C. Worley, D. Wu, S. Yang, Q. A. Yao, J. Ye, R. F. Yeh, J. S. Zaveri, M. Zhan, G. Zhang, Q. Zhao, L. Zheng, X. H. Zheng, F. N. Zhong, W. Zhong, X. Zhou, S. Zhu, X. Zhu, H. O. Smith, R. A. Gibbs, E. W. Myers, G. M. Rubin, and J. C. Venter. The genome sequence of Drosophila melanogaster. Science, 287(5461):2185–2195, Mar 2000. 11

[10] C. elegans Sequencing Consortium. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science, 282(5396):2012–2018, Dec 1998. 11

[11] F. R. Blattner, G Plunkett, 3rd, C. A. Bloch, N. T. Perna, V. Burland, M. Riley, J. Collado-Vides, J. D. Glasner, C. K. Rode, G. F. Mayhew, J. Gregor, N. W. Davis, H. A. Kirkpatrick, M. A. Goeden, D. J. Rose, B. Mau, and Y. Shao. The complete genome sequence of Escherichia coli K-12. Science, 277(5331):1453–1462, Sep 1997. 11

[12] Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature, 408(6814):796–815, Dec 2000. 11

[13] Stephen A. Goff, Darrell Ricke, Tien-Hung Lan, Gernot Presting, Ronglin Wang, Molly Dunn, Jane Glazebrook, Allen Sessions, Paul Oeller, Hemant Varma, David Hadley, Don Hutchison, Chris Martin, Fumiaki Katagiri, B Markus Lange, Todd Moughamer, Yu Xia, Paul Budworth, Jingping Zhong, Trini Miguel, Uta Paszkowski, Shiping Zhang, Michelle Colbert, Wei-lin Sun, Lili Chen, Bret Cooper, Sylvia Park, Todd Charles Wood, Long Mao, Peter Quail, Rod Wing, Ralph Dean, Yeisoo Yu, Andrey Zharkikh, Richard Shen, Sudhir Sahasrabudhe, Alun Thomas, Rob Cannings, Alexander Gutin, Dmitry Pruss, Julia Reid, Sean Tavtigian, Jeff Mitchell, Glenn Eldredge, Terri Scholl, Rose Mary Miller, Satish Bhatnagar, Nils 56

Adey, Todd Rubano, Nadeem Tusneem, Rosann Robinson, Jane Feldhaus, Teresita Macalma, Arnold Oliphant, and Steven Briggs. A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science, 296(5565):92–100, Apr 2002. 11

[14] J. C. Venter, M. D. Adams, E. W. Myers, P. W. Li, R. J. Mural, G. G. Sutton, H. O. Smith, M. Yandell, C. A. Evans, R. A. Holt, J. D. Gocayne, P. Amanatides, R. M. Ballew, D. H. Huson, J. R. Wortman, Q. Zhang, C. D. Kodira, X. H. Zheng, L. Chen, M. Skupski, G. Subramanian, P. D. Thomas, J. Zhang, G. L. Gabor Miklos, C. Nelson, S. Broder, A. G. Clark, J. Nadeau, V. A. McKusick, N. Zinder, A. J. Levine, R. J. Roberts, M. Simon, C. Slayman, M. Hunkapiller, R. Bolanos, A. Delcher, I. Dew, D. Fasulo, M. Flanigan, L. Florea, A. Halpern, S. Hannenhalli, S. Kravitz, S. Levy, C. Mobarry, K. Reinert, K. Remington, J. Abu-Threideh, E. Beasley, K. Biddick, V. Bonazzi, R. Brandon, M. Cargill, I. Chandramouliswaran, R. Charlab, K. Chaturvedi, Z. Deng, V. Di Francesco, P. Dunn, K. Eilbeck, C. Evangelista, A. E. Gabrielian, W. Gan, W. Ge, F. Gong, Z. Gu, P. Guan, T. J. Heiman, M. E. Higgins, R. R. Ji, Z. Ke, K. A. Ketchum, Z. Lai, Y. Lei, Z. Li, J. Li, Y. Liang, X. Lin, F. Lu, G. V. Merkulov, N. Milshina, H. M. Moore, A. K. Naik, V. A. Narayan, B. Neelam, D. Nusskern, D. B. Rusch, S. Salzberg, W. Shao, B. Shue, J. Sun, Z. Wang, A. Wang, X. Wang, J. Wang, M. Wei, R. Wides, C. Xiao, C. Yan, A. Yao, J. Ye, M. Zhan, W. Zhang, H. Zhang, Q. Zhao, L. Zheng, F. Zhong, W. Zhong, S. Zhu, S. Zhao, D. Gilbert, S. Baumhueter, G. Spier, C. Carter, A. Cravchik, T. Woodage, F. Ali, H. An, A. Awe, D. Baldwin, H. Baden, M. Barnstead, I. Barrow, K. Beeson, D. Busam, A. Carver, A. Center, M. L. Cheng, L. Curry, S. Danaher, L. Davenport, R. Desilets, S. Dietz, K. Dodson, L. Doup, S. Ferriera, N. Garg, A. Gluecksmann, B. Hart, J. Haynes, C. Haynes, C. Heiner, S. Hladun, D. Hostin, J. Houck, T. Howland, C. Ibegwam, J. Johnson, F. Kalush, L. Kline, S. Koduru, A. Love, F. Mann, D. May, S. McCawley, T. McIntosh, I. McMullen, M. Moy, L. Moy, B. Murphy, K. Nelson, C. Pfannkoch, E. Pratts, V. Puri, H. Qureshi, M. Reardon, R. Rodriguez, Y. H. Rogers, D. Romblad, B. Ruhfel, R. Scott, C. Sitter, M. Smallwood, E. Stewart, R. Strong, E. Suh, R. Thomas, N. N. Tint, S. Tse, C. Vech, G. Wang, J. Wetter, S. Williams, M. Williams, S. Windsor, E. Winn-Deen, K. Wolfe, J. Zaveri, K. Zaveri, J. F. Abril, R. Guigo,´ M. J. Campbell, K. V. Sjolander, B. Karlak, A. Kejariwal, H. Mi, B. Lazareva, T. Hatton, A. Narechania, K. Diemer, A. Muruganujan, N. Guo, S. Sato, V. Bafna, S. Istrail, R. Lippert, R. Schwartz, B. Walenz, S. Yooseph, D. Allen, A. Basu, J. Baxendale, L. Blick, M. Caminha, J. Carnes-Stine, P. Caulk, Y. H. Chiang, M. Coyne, C. Dahlke, A. Mays, M. Dombroski, M. Donnelly, D. Ely, S. Esparham, C. Fosler, H. Gire, S. Glanowski, K. Glasser, A. Glodek, M. Gorokhov, K. Graham, B. Gropman, M. Harris, J. Heil, S. Henderson, J. Hoover, D. Jennings, C. Jordan, J. Jordan, J. Kasha, L. Kagan, C. Kraft, A. Levitsky, M. Lewis, X. Liu, J. Lopez, D. Ma, W. Majoros, J. McDaniel, S. Murphy, M. Newman, T. Nguyen, N. Nguyen, M. Nodell, S. Pan, J. Peck, M. Peterson, W. Rowe, R. Sanders, J. Scott, 57

M. Simpson, T. Smith, A. Sprague, T. Stockwell, R. Turner, E. Venter, M. Wang, M. Wen, D. Wu, M. Wu, A. Xia, A. Zandieh, and X. Zhu. The sequence of the human genome. Science, 291(5507):1304–1351, Feb 2001. 11 [15] Tracy Tucker, Marco Marra, and Jan M. Friedman. Massively parallel sequencing: the next big thing in genetic medicine. Am J Hum Genet, 85(2):142–154, Aug 2009. 11 [16] Ayman Grada and Kate Weinbrecht. Next-Generation Sequencing: Methodology and Application. Journal of Investigative Dermatology, 133(8):e11, Aug 2013. 11, 12 [17] Helen Davies, Graham R Bignell, Charles Cox, Philip Stephens, Sarah Edkins, Sheila Clegg, Jon Teague, Hayley Woffendin, Mathew J Garnett, William Bottomley, Neil Davis, Ed Dicks, Rebecca Ewing, Yvonne Floyd, Kristian Gray, Sarah Hall, Rachel Hawes, Jaime Hughes, Vivian Kosmidou, Andrew Menzies, Catherine Mould, Adrian Parker, Claire Stevens, Stephen Watt, Steven Hooper, Rebecca Wilson, Hiran Jayatilake, Barry A Gusterson, Colin Cooper, Janet Shipley, Darren Hargrave, Katherine Pritchard-Jones, Norman Maitland, Georgia Chenevix-Trench, Gregory J Riggins, Darell D Bigner, Giuseppe Palmieri, Antonio Cossu, Adrienne Flanagan, Andrew Nicholson, Judy W C Ho, Suet Y Leung, Siu T Yuen, Barbara L Weber, Hilliard F Seigler, Timothy L Darrow, Hugh Paterson, Richard Marais, Christopher J Marshall, Richard Wooster, Michael R Stratton, and P Andrew Futreal. Mutations of the braf gene in human cancer. Nature, 417:949–954, June 2002. 11, 15 [18] Emma R Cantwell-Dorris, John J O’Leary, and Orla M Sheils. Brafv600e: implications for carcinogenesis and molecular therapy. Molecular cancer therapeutics, 10:385–394, March 2011. [19] Lauren L Ritterhouse and Justine A Barletta. Braf v600e mutation-specific antibody: A review. Seminars in diagnostic pathology, 32:400–408, September 2015. 11 [20] Janet L Maldonado, Jane Fridlyand, Hetal Patel, Ajay N Jain, Klaus Busam, Toshiro Kageshita, Tomomichi Ono, Donna G Albertson, Dan Pinkel, and Boris C Bastian. Determinants of braf mutations in primary melanomas. Journal of the National Cancer Institute, 95:1878–1890, December 2003. 11 [21] Paolo A Ascierto, John M Kirkwood, Jean-Jacques Grob, Ester Simeone, Antonio M Grimaldi, Michele Maio, Giuseppe Palmieri, Alessandro Testori, Francesco M Marincola, and Nicola Mozzillo. The role of braf v600 mutation in melanoma. Journal of translational medicine, 10:85, July 2012. 11 [22] Jana Vandrovcova, Kristina Lagerstedt-Robinsson, Lars Phlman, and Annika Lindblom. Somatic braf-v600e mutations in familial colorectal cancer. Cancer 58

epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology, 15:2270–2273, November 2006. 12

[23] Kevin J Spring, Zhen Zhen Zhao, Rozemary Karamatic, Michael D Walsh, Vicki L J Whitehall, Tanya Pike, Lisa A Simms, Joanne Young, Michael James, Grant W Montgomery, Mark Appleyard, David Hewett, Kazutomo Togashi, Jeremy R Jass, and Barbara A Leggett. High prevalence of sessile serrated adenomas with braf mutations: a prospective study of patients undergoing colonoscopy. Gastroenterology, 131:1400–1407, November 2006.

[24] Srgia Velho, Ctia Moutinho, Lus Cirnes, Cristina Albuquerque, Richard Hamelin, Fernando Schmitt, Ftima Carneiro, Carla Oliveira, and Raquel Seruca. Braf, kras and pik3ca mutations in colorectal serrated polyps and cancer: primary or secondary genetic events in colorectal carcinogenesis? BMC cancer, 8:255, September 2008.

[25] Tyler A Wish, Angela J Hyde, Patrick S Parfrey, Jane S Green, H Banfield Younghusband, Michelle I Simms, Dan G Fontaine, Elizabeth L Dicks, Susan N Stuckless, Steven Gallinger, John R McLaughlin, Michael O Woods, and Roger C Green. Increased cancer predisposition in family members of colorectal cancer patients harboring the p.v600e braf mutation: a population-based study. Cancer epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology, 19:1831–1839, July 2010. 12, 15

[26] Thierry Fozing, Claudia Scheuer, and Samuel Samnick. Synthesis and initial tumor affinity testing of iodine-123 labelled egfr-affine agents as potential imaging probes for hormone-refractory prostate cancer. European journal of medicinal chemistry, 45:3780–3786, September 2010. 12

[27] Garrett M Frampton, Alex Fichtenholtz, Geoff A Otto, Kai Wang, Sean R Downing, Jie He, Michael Schnall-Levin, Jared White, Eric M Sanford, Peter An, James Sun, Frank Juhn, Kristina Brennan, Kiel Iwanik, Ashley Maillet, Jamie Buell, Emily White, Mandy Zhao, Sohail Balasubramanian, Selmira Terzic, Tina Richards, Vera Banning, Lazaro Garcia, Kristen Mahoney, Zac Zwirko, Amy Donahue, Himisha Beltran, Juan Miguel Mosquera, Mark A Rubin, Snjezana Dogan, Cyrus V Hedvat, Michael F Berger, Lajos Pusztai, Matthias Lechner, Chris Boshoff, Mirna Jarosz, Christine Vietz, Alex Parker, Vincent A Miller, Jeffrey S Ross, John Curran, Maureen T Cronin, Philip J Stephens, Doron Lipson, and Roman Yelensky. Development and validation of a clinical cancer genomic profiling test based on massively parallel dna sequencing. Nature biotechnology, 31:1023–1031, November 2013. 12 59

[28] Marina N Nikiforova, Abigail I Wald, Somak Roy, Mary Beth Durso, and Yuri E Nikiforov. Targeted next-generation sequencing panel (thyroseq) for detection of mutations in thyroid cancer. The Journal of clinical endocrinology and metabolism, 98(11):E1852–60, Nov 2013. 12 [29] A. Williamson and D. D. Spencer. Electrophysiological characterization of CA2 pyramidal cells from epileptic humans. Hippocampus, 4(2):226–237, Apr 1994. 12 [30] Fei Gao, Jiping Zhang, Xinde Sun, and Liang Chen. The effect of postnatal exposure to noise on sound level processing by auditory cortex neurons of rats in adulthood. Physiol Behav, 97(3-4):369–373, Jun 2009. [31] Vikash Gilja, Cindy A. Chestek, Paul Nuyujukian, Justin Foster, and Krishna V. Shenoy. Autonomous head-mounted electrophysiology systems for freely behaving primates. Curr Opin Neurobiol, 20(5):676–686, Oct 2010. 12 [32] Keigo Kohara, Michele Pignatelli, Alexander J Rivest, Hae-Yoon Jung, Takashi Kitamura, Junghyup Suh, Dominic Frank, Koichiro Kajikawa, Nathan Mise, Yuichi Obata, and et al. Cell typespecific genetic and optogenetic tools reveal hippocampal CA2 circuits. Nat Neurosci, 17(2):269279, Dec 2013. 12 [33] Rong Mao, Xiaowen Wang, Edward L Spitznagel, Jr, Laurence P. Frelin, Jason C. Ting, Huashi Ding, Jung-whan Kim, Ingo Ruczinski, Thomas J. Downey, and Jonathan Pevsner. Primary and secondary transcriptional effects in the developing human Down syndrome brain and heart. Genome Biol, 6(13):R107, 2005. 12 [34] Jens Schmidt. Discovery of Putative STAT5 Transcription Factor Binding Sites in Mice with Diabetic Nephropathy. Master’s thesis, Ohio University, December 2013. 12 [35] S F Altschul, W Gish, W Miller, E W Myers, and D J Lipman. Basic local alignment search tool. Journal of molecular biology, 215:403–410, October 1990. 12 [36] D J Lipman and W R Pearson. Rapid and sensitive protein similarity searches. Science (New York, N.Y.), 227:1435–1441, March 1985. 12 [37] W R Pearson. Rapid and sensitive sequence comparison with fastp and fasta. Methods in enzymology, 183:63–98, 1990. 12 [38] J D Thompson, D G Higgins, and T J Gibson. Clustal w: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic acids research, 22:4673–4680, November 1994. 12 [39] David. S. Latchman. Transcription factors: an overview. Int J Exp Pathol, 74(5):417–422, Oct 1993. 12 60

[40] Federico Zambelli, Graziano Pesole, and Giulio Pavesi. Motif discovery and transcription factor binding sites before and after the next-generation sequencing era. Brief Bioinform, 14(2):225–237, Mar 2013. 12

[41] Cole Trapnell, , and Steven L. Salzberg. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics, 25(9):1105–1111, May 2009. 12

[42] F Jacob and J Monod. Genetic regulatory mechanisms in the synthesis of proteins. Journal of molecular biology, 3:318–356, June 1961. 14

[43] Peter H Von Hippel, Arnold Revzin, Carol A Gross, and Amy C Wang. Non-specific dna binding of genome regulating proteins as a biological control mechanism: 1. the lac operon: Equilibrium aspects. Proc Natl Acad Sci U S A, 71(12):4808–12, Dec 1974.

[44] S Oehler, E R Eismann, H Kramer,¨ and B Muller-Hill.¨ The three operators of the lac operon cooperate in repression. The EMBO Journal, 9(4):973–9, Apr 1990. 14

[45] C Yanofsky. Attenuation in the control of expression of bacterial operons. Nature, 289(5800):751–8, Feb 1981. 14

[46] Moises´ Santillan´ and Michael C Mackey. Dynamic regulation of the tryptophan operon: A modeling study and comparison with experimental data. Proc Natl Acad Sci U S A, 98(4):1364–9, Feb 2001. 14, 15

[47] Robert Daber, Steven Stayrook, Allison Rosenberg, and Mitchell Lewis. Structural analysis of lac repressor bound to allosteric effectors. Journal of Molecular Biology, 370(4):609–19, Jul 2007. 15

[48] David S. Latchman. Eukaryotic Transcription Factors. Elsevier, 5th edition, 2008. 15

[49] Xiaoju Wang, Jianjun Yu, Arun Sreekumar, Sooryanarayana Varambally, Ronglai Shen, Donald Giacherio, Rohit Mehra, James E Montie, Kenneth J Pienta, Martin G Sanda, Philip W Kantoff, Mark A Rubin, John T Wei, Debashis Ghosh, and Arul M Chinnaiyan. Autoantibody signatures in prostate cancer. N Engl J Med, 353(12):1224–35, Sep 2005. 15

[50] Matthew Schipper, George Wang, Nick Giles, and Jeanne Ohrnberger. Novel prostate cancer biomarkers derived from autoantibody signatures. Transl Oncol, 8(2):106–11, Apr 2015. 15

[51] F P Roth, J D Hughes, P W Estep, and G M Church. Finding dna regulatory motifs within unaligned noncoding sequences clustered by whole-genome mrna quantitation. Nat Biotechnol, 16(10):939–45, Oct 1998. 16 61

[52] G. Pavesi, G. Mauri, and G. Pesole. An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics, 17 Suppl 1:S207–S214, 2001. 16, 39

[53] X Shirley Liu, Douglas L. Brutlag, and Jun S. Liu. An algorithm for finding protein-dna binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nat Biotechnol, 20(8):835–839, Aug 2002. 16, 39

[54] Laurence Ettwiller, Benedict Paten, Mirana Ramialison, , and Joachim Wittbrodt. Trawler: de novo regulatory motif discovery pipeline for chromatin immunoprecipitation. Nat Methods, 4(7):563–565, Jul 2007. 16, 39

[55] Timothy L. Bailey, Mikael Boden, Fabian A. Buske, Martin Frith, Charles E. Grant, Luca Clementi, Jingyuan Ren, Wilfred W. Li, and William S. Noble. MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res, 37(Web Server issue):W202–W208, Jul 2009. 16, 39

[56] Zizhen Yao, Kyle L. Macquarrie, Abraham P. Fong, Stephen J. Tapscott, Walter L. Ruzzo, and Robert C. Gentleman. Discriminative motif analysis of high-throughput dataset. Bioinformatics, 30(6):775–783, Mar 2014. 16

[57] Saurabh Sinha. Discriminative motifs. J Comput Biol, 10(3-4):599–615, 2003. 16

[58] Eivind Valen, Albin Sandelin, Ole Winther, and . Discovery of regulatory elements is improved by a discriminatory approach. PLoS Comput Biol, 5(11):e1000562, Nov 2009. 16

[59] Peter Huggins, Shan Zhong, Idit Shiff, Rachel Beckerman, Oleg Laptenko, Carol Prives, Marcel H. Schulz, Itamar Simon, and Ziv Bar-Joseph. DECOD: fast and accurate discriminative dna motif finding. Bioinformatics, 27(17):2361–2367, Sep 2011. 17

[60] Emma Redhead and Timothy L. Bailey. Discriminative motif discovery in dna and protein sequences using the deme algorithm. BMC Bioinformatics, 8:385, 2007. 17

[61] Timothy L. Bailey. DREME: motif discovery in transcription factor ChIP-seq data. Bioinformatics, 27(12):1653–1659, Jun 2011. 17

[62] Jun Ding, Haiyan Hu, and Xiaoman Li. SIOMICS: a novel approach for systematic identification of motifs in ChIP-seq data. Nucleic Acids Res, 42(5):e35, Mar 2014. 17

[63] Jan Grau, Stefan Posch, Ivo Grosse, and Jens Keilwagen. A general approach for discriminative de novo motif discovery from high-throughput data. Nucleic Acids Res, 41(21):e197, Nov 2013. 17

[64] Andrei Lihu and Åtefan Holban. A review of ensemble methods for de novo motif discovery in chip-seq data. Brief Bioinform, Apr 2015. 17 62

[65] Martin Tompa, Nan Li, Timothy L Bailey, George M Church, Bart De Moor, , Alexander V Favorov, Martin C Frith, Yutao Fu, W James Kent, Vsevolod J Makeev, Andrei A Mironov, William Stafford Noble, Giulio Pavesi, Graziano Pesole, Mireille Regnier, Nicolas Simonis, Saurabh Sinha, Gert Thijs, Jacques van Helden, Mathias Vandenbogaert, Zhiping Weng, Christopher Workman, Chun Ye, and Zhou Zhu. Assessing computational tools for the discovery of transcription factor binding sites. Nature biotechnology, 23(1):137–44, Jan 2005. 17

[66] Nan Li and Martin Tompa. Analysis of computational approaches for motif discovery. Algorithms for Molecular Biology, 1:8, 2006. 17

[67] Arijit Chakravarty, Jonathan M. Carlson, Radhika S. Khetani, and Robert H. Gross. A novel ensemble learning method for de novo computational identification of dna binding sites. BMC Bioinformatics, 8:249, 2007. 17

[68] Viktor Martyanov and Robert H. Gross. Using scope to identify potential regulatory motifs in coregulated genes. J Vis Exp, (51), 2011. 17

[69] Pouya Kheradpour and Manolis Kellis. Systematic discovery and characterization of regulatory motifs in ENCODE TF binding experiments. Nucleic Acids Research, 42(5):2976–2987, Mar 2014. 18, 38, 39

[70] Rami Al-Ouran, Robert Schmidt, Ashwini Naik, Jeffrey Jones, Frank Drews, David Juedes, Laura Elnitski, and Lonnie Welch. Discovering gene regulatory elements using coverage-based heuristics. IEEE/ACM Trans Comput Biol Bioinform, Oct 2015. 18, 20, 23, 38, 39, 41, 42, 43, 44, 84

[71] Dennis A. Benson, Mark Cavanaugh, Karen Clark, Ilene Karsch-Mizrachi, David J. Lipman, James Ostell, and Eric W. Sayers. Genbank. Nucleic Acids Res, 41(Database issue):D36–D42, Jan 2013. 18

[72] Richard M. Karp. Complexity of Computer Computations, chapter Reducibility Among Combinatorial Problems, page 85103. Plenum, New York, USA, 1972. 20, 24, 25

[73] Stephen A. Cook. The complexity of theorem-proving procedures. In Proceedings of the Third Annual ACM Symposium on Theory of Computing, STOC ’71, pages 151–158, New York, NY, USA, 1971. ACM. 20, 25

[74] Karla Hoffman and Manfred Padberg. Set covering, packing and partitioning problems, pages 3482–3486. Springer US, Boston, MA, 2009. 20

[75] Dorit S. Hochbaum. Approximation algorithms for the set covering and vertex cover problems. SIAM Journal on Computing, 11(3):555–556, July 1982. 20 63

[76] J.E Beasley. An algorithm for set covering problem. European Journal of Operational Research, 31:85–93, 1987. [77] Mario Marchand and John Shawe-Taylor. The set covering machine. Journal of Machine Learning Research, 3:723–746, December 2002. [78] Guanghui Lan, Gail W. DePuy, and Gary E. Whitehouse. An effective and simple heuristic for the set covering problem. European Journal of Operational Research, 176:1387–1403, 2007. [79] Fabrizio Grandoni, Anupam Gupta, Stefano Leonardi, Pauli Miettinen, Piotr Sankowski, and Mohit Singh. Set covering with our eyes closed. SIAM Journal on Computing, 42(3):808–830, 2013. 20 [80] Pauli Miettinen. On the positive–negative partial set cover problem. Information Processing Letters, 108(4):219–221, 2008. 21, 24 [81] Alberto Caprara, Matteo Fischetti, and Paolo Toth. Algorithms for the set covering problem. Annals of Operations Research, 98:353–371, 1998. 21 [82] P.J. van Laarhoven and E.H. Aarts. Simulated Annealing: Theory and Applications. Ellis Horwood series in mathematics and its applications. Springer, 1987. 21 [83] Charles Kittel. Introduction to solid state physics. John Wiley & Sons, Inc., New York, NY, USA, 8th edition, 2005. 22 [84] Scott Kirkpatrick, C. Daniel Gelatt, and Mario P. Vecchi. Optimization by simulated annealing. Science, 220(4598):pp. 671–680, 1983. 22, 29, 30 [85] V. ern. Thermodynamical approach to the traveling salesman problem: An efficient simulation algorithm. Journal of Optimization Theory and Applications, 45(1):41–51, Jan 1985. 22, 30 [86] Johann Dreo,´ Alain Petrowski,´ Patrick Siarry, and Taillard Eric. Simulated Annealing, pages 23–46. Springer Berlin Heidelberg, Berlin, Heidelberg, 2006. 22 [87] Vijay V. Vazirani. Approximation Algorithms. Springer-Verlag New York, Inc., New York, NY, USA, 2001. 24, 25 [88] RFC 1058: Routing information protocol, June 1988. 25 [89] Mirko Maischberger. COIN–OR METSlib: a Metaheuristics Framework in Modern C++. Computational Infrastructure for Operations Research (COIN-OR), June 2011. 26 [90] R.E. Randelman and Gary S. Grest. N-city traveling salesman problem: Optimization by simulated annealings. Journal of Statistical Physics, 45(5-6):885–890, 1986. 29, 30 64

[91] Emile H.L. Aarts, Jan H.M. Korst, and Peter J.M. van Laarhoven. Local Search in Combinatorual Optimization, chapter Simulated Annealing, pages 91–120. John Wiley & Sons Litd., 1997. 30

[92] Mirko Maischberger. METSlib. Computational Infrastructure for Operations Research (COIN-OR), COIN-OR Foundation, Inc., 40 York Road, Suite 300, Towson MD 21204, May 2011. Verion 0.5.3 was released on 2011-May-05. 31, 67

[93] Christoph M. Koch, Robert M. Andrews, Paul Flicek, Shane C. Dillon, UlaÅ Karaoz,¨ Gayle K. Clelland, Sarah Wilcox, David M. Beare, Joanna C. Fowler, Phillippe Couttet, Keith D. James, Gregory C. Lefebvre, Alexander W. Bruce, Oliver M. Dovey, Peter D. Ellis, Pawandeep Dhami, Cordelia F. Langford, Zhiping Weng, Ewan Birney, Nigel P. Carter, David Vetrie, and Ian Dunham. The landscape of histone modifications across 1% of the human genome in five human cell lines. Genome Res, 17(6):691–707, Jun 2007. 38

[94] ENCODE Project Consortium, Ewan Birney, John A. Stamatoyannopoulos, Anindya Dutta, Roderic Guigo,´ Thomas R. Gingeras, Elliott H. Margulies, Zhiping Weng, Michael Snyder, Emmanouil T. Dermitzakis, Robert E. Thurman, Michael S. Kuehn, Christopher M. Taylor, Shane Neph, Christoph M. Koch, Saurabh Asthana, Ankit Malhotra, Ivan Adzhubei, Jason A. Greenbaum, Robert M. Andrews, Paul Flicek, Patrick J. Boyle, Hua Cao, Nigel P. Carter, Gayle K. Clelland, Sean Davis, Nathan Day, Pawandeep Dhami, Shane C. Dillon, Michael O. Dorschner, Heike Fiegler, Paul G. Giresi, Jeff Goldy, Michael Hawrylycz, Andrew Haydock, Richard Humbert, Keith D. James, Brett E. Johnson, Ericka M. Johnson, Tristan T. Frum, Elizabeth R. Rosenzweig, Neerja Karnani, Kirsten Lee, Gregory C. Lefebvre, Patrick A. Navas, Fidencio Neri, Stephen C J. Parker, Peter J. Sabo, Richard Sandstrom, Anthony Shafer, David Vetrie, Molly Weaver, Sarah Wilcox, Man Yu, Francis S. Collins, Job Dekker, Jason D. Lieb, Thomas D. Tullius, Gregory E. Crawford, Shamil Sunyaev, William S. Noble, Ian Dunham, France Denoeud, Alexandre Reymond, Philipp Kapranov, Joel Rozowsky, Deyou Zheng, Robert Castelo, Adam Frankish, Jennifer Harrow, Srinka Ghosh, Albin Sandelin, Ivo L. Hofacker, Robert Baertsch, Damian Keefe, Sujit Dike, Jill Cheng, Heather A. Hirsch, Edward A. Sekinger, Julien Lagarde, Josep F. Abril, Atif Shahab, Christoph Flamm, Claudia Fried, Jorg¨ Hackermuller,¨ Jana Hertel, Manja Lindemeyer, Kristin Missal, Andrea Tanzer, Stefan Washietl, Jan Korbel, Olof Emanuelsson, Jakob S. Pedersen, Nancy Holroyd, Ruth Taylor, David Swarbreck, Nicholas Matthews, Mark C. Dickson, Daryl J. Thomas, Matthew T. Weirauch, James Gilbert, Jorg Drenkow, Ian Bell, XiaoDong Zhao, K. G. Srinivasan, Wing-Kin Sung, Hong Sain Ooi, Kuo Ping Chiu, Sylvain Foissac, Tyler Alioto, Michael Brent, Lior Pachter, Michael L. Tress, , Siew Woh Choo, Chiou Yu Choo, Catherine Ucla, Caroline Manzano, Carine Wyss, Evelyn Cheung, Taane G. Clark, James B. Brown, Madhavan Ganesh, Sandeep Patel, Hari Tammana, Jacqueline Chrast, Charlotte N. Henrichsen, Chikatoshi Kai, Jun Kawai, Ugrappa Nagalakshmi, 65

Jiaqian Wu, Zheng Lian, Jin Lian, Peter Newburger, Xueqing Zhang, Peter Bickel, John S. Mattick, Piero Carninci, Yoshihide Hayashizaki, Sherman Weissman, , Richard M. Myers, Jane Rogers, Peter F. Stadler, Todd M. Lowe, Chia-Lin Wei, Yijun Ruan, Kevin Struhl, Mark Gerstein, Stylianos E. Antonarakis, Yutao Fu, Eric D. Green, UlaÅ Karaoz,¨ Adam Siepel, James Taylor, Laura A. Liefer, Kris A. Wetterstrand, Peter J. Good, Elise A. Feingold, Mark S. Guyer, Gregory M. Cooper, George Asimenos, Colin N. Dewey, Minmei Hou, Sergey Nikolaev, Juan I. Montoya-Burgos, Ari Loytynoja,¨ Simon Whelan, Fabio Pardi, Tim Massingham, Haiyan Huang, Nancy R. Zhang, Ian Holmes, James C. Mullikin, Abel Ureta-Vidal, Benedict Paten, Michael Seringhaus, Deanna Church, Kate Rosenbloom, W James Kent, Eric A. Stone, N. I. S. C Comparative Sequencing Program, Baylor College of Medicine Human Genome Sequencing Center, Washington University Genome Sequencing Center, , Children’s Hospital Oakland Research Institute, Serafim Batzoglou, Nick Goldman, Ross C. Hardison, , , Arend Sidow, Nathan D. Trinklein, Zhengdong D. Zhang, Leah Barrera, Rhona Stuart, David C. King, Adam Ameur, Stefan Enroth, Mark C. Bieda, Jonghwan Kim, Akshay A. Bhinge, Nan Jiang, Jun Liu, Fei Yao, Vinsensius B. Vega, Charlie W H. Lee, Patrick Ng, Atif Shahab, Annie Yang, Zarmik Moqtaderi, Zhou Zhu, Xiaoqin Xu, Sharon Squazzo, Matthew J. Oberley, David Inman, Michael A. Singer, Todd A. Richmond, Kyle J. Munn, Alvaro Rada-Iglesias, Ola Wallerman, Jan Komorowski, Joanna C. Fowler, Phillippe Couttet, Alexander W. Bruce, Oliver M. Dovey, Peter D. Ellis, Cordelia F. Langford, David A. Nix, Ghia Euskirchen, Stephen Hartman, Alexander E. Urban, Peter Kraus, Sara Van Calcar, Nate Heintzman, Tae Hoon Kim, Kun Wang, Chunxu Qu, Gary Hon, Rosa Luna, Christopher K. Glass, M Geoff Rosenfeld, Shelley Force Aldred, Sara J. Cooper, Anason Halees, Jane M. Lin, Hennady P. Shulha, Xiaoling Zhang, Mousheng Xu, Jaafar N S. Haidar, Yong Yu, Yijun Ruan, Vishwanath R. Iyer, Roland D. Green, Claes Wadelius, Peggy J. Farnham, Bing Ren, Rachel A. Harte, Angie S. Hinrichs, Heather Trumbower, Hiram Clawson, Jennifer Hillman-Jackson, Ann S. Zweig, Kayla Smith, Archana Thakkapallayil, Galt Barber, Robert M. Kuhn, Donna Karolchik, Lluis Armengol, Christine P. Bird, Paul I W. de Bakker, Andrew D. Kern, Nuria Lopez-Bigas, Joel D. Martin, Barbara E. Stranger, Abigail Woodroffe, Eugene Davydov, Antigone Dimas, Eduardo Eyras, Ingileif B. Hallgr´ımsdottir,´ Julian Huppert, Michael C. Zody, Gonc¸alo R. Abecasis, Xavier Estivill, Gerard G. Bouffard, Xiaobin Guan, Nancy F. Hansen, Jacquelyn R. Idol, Valerie V B. Maduro, Baishali Maskeri, Jennifer C. McDowell, Morgan Park, Pamela J. Thomas, Alice C. Young, Robert W. Blakesley, Donna M. Muzny, Erica Sodergren, David A. Wheeler, Kim C. Worley, Huaiyang Jiang, George M. Weinstock, Richard A. Gibbs, Tina Graves, Robert Fulton, Elaine R. Mardis, Richard K. Wilson, Michele Clamp, James Cuff, Sante Gnerre, David B. Jaffe, Jean L. Chang, Kerstin Lindblad-Toh, Eric S. Lander, Maxim Koriabine, Mikhail Nefedov, Kazutoyo Osoegawa, Yuko Yoshinaga, Baoli Zhu, and Pieter J. de Jong. Identification and analysis of functional elements in 1% of the human genome by 66

the encode pilot project. Nature, 447(7146):799–816, Jun 2007. 38

[95] Simon J. van Heeringen and Gert Jan C. Veenstra. GimmeMotifs: a de novo motif prediction pipeline for chip-sequencing experiments. Bioinformatics, 27(2):270–271, Jan 2011. 39

[96] Charles E. Grant, Timothy L. Bailey, and William Stafford Noble. FIMO: scanning for occurrences of a given motif. Bioinformatics, 27(7):1017–1018, Apr 2011. 39

[97] Shobhit Gupta, John A. Stamatoyannopoulos, Timothy L. Bailey, and William Stafford Noble. Quantifying similarity between motifs. Genome Biol, 8(2):R24, 2007. 46

[98] B Suman and P Kumar. A survey of simulated annealing as a tool for single and multiobjective optimization. Journal of the Operational Research Society, 57:1143–1160, October 2006. 52

[99] Lester Ingber. Very fast simulated re-annealing. Mathl. Comput. Modelling, 12(8):967–973, 1989. 53

[100] Lester Ingber. Simulated annealing: Practice versus theory. Mathl. Comput. Modelling, 18(11):29–57, 1993.

[101] Guanglu Gong, Yong Liu, and Minping Qian. An adaptive simulated annealing algorithm. Stochastic Processes and their Applications, 94(1):95–103, 2001.

[102] Vincent A. Cicirello. Variable annealing length and parallelism in simulated annealing. In Proceedings of the Tenth International Symposium on Combinatorial Search (SoCS 2017), pages 2–10. AAAI Press, June 2017. 53

[103] Ihor O. Bohachevsky, Mark E. Johnson, and Myron L. Stein. Generalized simulated annealing for function optimization. Technometrics, 28(3):209–217, August 1986. 53

[104] Peter J. M. Laarhoven and Emile H. L. Aarts. Simulated Annealing: Theory and Applications. Springer Netherlands, 2. Philips Research Laboratories, Eindhoven, The Netherlands, 1987.

[105] Vincent A. Cicirello. On the design of an adaptive simulated annealing algorithm. In Proceedings of the International Conference on Principles and Practice of Constraint Programming First Workshop on Autonomous Search, Computer Science and Information Systems; The Richard Stockton College of New Jersey, Pomona, NJ 08240, September 2007. AAAI Press. 53 67 Appendix A: Source Code

A.1 METSlib Code

The source code included in this section is used for describing the internal mechanism of the simulated annealing implementation of METSlib[92]. The codebase of METSlib is licensed under GNU General Public License version 3 or later. The author of METSlib also stated in the source code files that the code can be distributed under Common Public License 1.0. Disclaimer: the source code included in this section is from the codebase of METSlib, and the author of this thesis does not contribute to METSlib.

A.1.1 Search Procedure

Code from metslib/simulated-annealing.hh 1 template 2 void 3 mets::simulated_annealing::search() 4 throw(no_moves_error) 5 { 6 typedef abstract_search base_t; 7 8 current_temp_m = starting_temp_m; 9 while(!termination_criteria_m(base_t::working_solution_m) 10 && current_temp_m > stop_temp_m) 11 { 12 gol_type actual_cost = 13 static_cast (base_t:: working_solution_m) 14 .cost_function(); 15 gol_type best_cost = 16 static_cast (base_t:: working_solution_m) 17 .cost_function(); 18 19 base_t::moves_m.refresh(base_t::working_solution_m); 20 for(typename move_manager_t::iterator movit = base_t:: moves_m.begin(); 68

21 movit != base_t::moves_m.end(); ++movit) 22 { 23 // apply move and record proposed cost function 24 gol_type cost = (*movit)->evaluate(base_t:: working_solution_m); 25 26 double delta = ((double)(cost-actual_cost)); 27 if(delta < 0 || gen() < exp(-delta/(K_m* current_temp_m))) 28 { 29 // accepted: apply, record, exit for and lower temperature 30 (*movit)->apply(base_t::working_solution_m); 31 base_t::current_move_m = movit; 32 33 if(base_t::solution_recorder_m.accept(base_t:: working_solution_m)) 34 { 35 base_t::step_m = base_t::IMPROVEMENT_MADE; 36 this ->notify(); 37 } 38 base_t::step_m = base_t::MOVE_MADE; 39 this ->notify(); 40 break; 41 } 42 } // end for each move 43 44 current_temp_m = 45 cooling_schedule_m(current_temp_m, base_t:: working_solution_m); 46 } 47 }

A.1.2 Exponential Cooling Schedule

Code from metslib/simulated-annealing.hh 1 /// @brief Original ECS proposed by Kirkpatrick 2 class exponential_cooling 3 : public abstract_cooling_schedule 4 { 5 public: 6 exponential_cooling(double alpha = 0.95) 7 : abstract_cooling_schedule(), factor_m(alpha) 69

8 { if(alpha >= 1) throw std::runtime_error("alpha must be < 1"); } 9 double 10 operator()(double temp, feasible_solution& fs) 11 { return temp*factor_m; } 12 protected: 13 double factor_m; 14 };

A.1.3 Linear Cooling Schedule

Code from metslib/simulated-annealing.hh 1 /// @brief Alternative LCS proposed by Randelman and Grest 2 class linear_cooling 3 : public abstract_cooling_schedule 4 { 5 public: 6 linear_cooling(double delta = 0.1) 7 : abstract_cooling_schedule(), decrement_m(delta) 8 { if(delta <= 0) throw std::runtime_error("delta must be > 0"); } 9 double 10 operator()(double temp, feasible_solution& fs) 11 { return std::(0.0, temp-decrement_m); } 12 protected: 13 double decrement_m; 14 }; 70

A.2 Implementation for Motif Selection Problem

The source code included in this section was developed by the author of this thesis. The source code included in this section is licensed under GNU General Public License version 3 or later. A copy of the GNU Public License can be retrieved from https://www.gnu.org/licenses/.

A.2.1 Problem and Solution Definition

sol.hh 1 class SAsol : public mets::evaluable_solution { 2 3 public: 4 double new_score; 5 double cur_score; 6 SAsol(); 7 ˜SAsol(); 8 void copy_from( const mets::copyable& ); 9 mets::gol_type cost_function() const; 10 SAsol( const SAsol& ); 11 flag getSelection() const; 12 vector getMap( int ) const; 13 size_t getCount( int ) const; 14 void setSelection( flag& ); 15 void init( vector&, vector&, size_t, size_t ); 16 int getSolsize() const; 17 18 private: 19 vector Fore; 20 vector Back; 21 size_t fore_cnt; 22 size_t back_cnt; 23 flag Selection; 24 void copy( vector&, vector* ); 25 friend std::ostream& operator <<( std::ostream& os, SAsol& p ){ 26 for( size_t i = 0; i < p.Selection.size(); i++ ){ 27 os << p.Selection.get(i) << ", "; 28 } 29 return os; 71

30 }; 31 }; 32 33 SAsol::SAsol(){ 34 this ->new_score = 0; 35 this ->cur_score = 0; 36 return; 37 } 38 39 SAsol::˜SAsol(){ 40 this ->Fore.clear(); 41 this ->Back.clear(); 42 } 43 44 void SAsol::init( vector &fore, vector &back, size_t fc, size_t bc ) { 45 this ->copy( fore, &(this ->Fore) ); 46 this ->copy( back, &(this ->Back) ); 47 this ->fore_cnt = fc; 48 this ->back_cnt = bc; 49 this ->Selection.setlen( this ->Fore.size() ); 50 return; 51 } 52 53 SAsol::SAsol( const SAsol &obj ){ 54 this ->copy_from( obj ); 55 } 56 57 void SAsol::copy( vector &src, vector *dest ){ 58 dest->clear(); 59 size_t i, cnt; 60 cnt = src.size(); 61 for( i = 0; i < cnt; i++ ){ 62 dest->push_back( src[i] ); 63 } 64 return; 65 } 66 67 flag SAsol::getSelection() const { 68 return this ->Selection; 69 } 70 71 vector SAsol::getMap( int bFore ) const { 72

72 if( bFore ){ 73 return this ->Fore; 74 }else{ 75 return this ->Back; 76 } 77 } 78 79 size_t SAsol::getCount( int bFore ) const { 80 if( bFore ){ 81 return this ->fore_cnt; 82 }else{ 83 return this ->back_cnt; 84 } 85 } 86 87 void SAsol::setSelection( flag &f ){ 88 this ->Selection.copy( f ); 89 return; 90 } 91 92 void SAsol::copy_from( const mets::copyable& obj ){ 93 const SAsol& model = dynamic_cast ( obj ); 94 vector tf = model.getMap(1); 95 vector tb = model.getMap(0); 96 this ->init( tf, tb, model.getCount(1), model.getCount(0) ) ; 97 flag ts = model.getSelection(); 98 this ->setSelection( ts ); 99 return; 100 } 101 102 mets::gol_type SAsol::cost_function() const { 103 gol_type res; 104 int fcov = coverage( this ->Selection, this ->Fore) ; 105 int sum = this ->Selection.getsum(); if( sum==0 ){ sum = 1000; } 106 int cnt = this ->Selection.size(); 107 res = (gol_type)( ( this ->fore_cnt == fcov ) ? ( sum ) : ( PFACTOR * sum ) ); 108 return res; 109 } 110 111 int SAsol::getSolsize() const { 73

112 return this ->Selection.getsum(); 113 }

A.2.2 Cost Function

cof.hh 1 class COF: public chenliang::omb::SAsol { 2 3 mets::gol_type COF::rc_cost(){ 4 mets::gol_type res = 0; 5 //// 6 int remained = this ->ft - this ->fc; 7 // penalty factor 8 int pfactor = ( remained == 0 ) ? 1 : 10000 ; 9 10 //=2015-0909-2117.09edt on dhn 11 res = (mets::gol_type)( 12 ( this ->msel > 0 ) ? ( pfactor * this ->msel ) : ( 10000 * this ->mtotal ) 13 ); 14 //// 15 return res; 16 } 17 18 mets::gol_type COF::rc_relax(){ 19 //=2015-0909-2121.55edt on dhn 20 mets::gol_type res = 0; 21 //// 22 int remained = this ->ft - this ->fc; 23 if((this ->fc * 100) >= (this ->ft * this ->fc_relax) ){ 24 remained = 0; 25 } 26 // penalty factor 27 int pfactor = ( remained == 0 ) ? 1 : 10000 ; 28 29 res = (mets::gol_type)( 30 ( this ->msel > 0 ) ? ( pfactor * this ->msel ) : ( 10000 * this ->mtotal ) 31 ); 32 //// 33 return res; 34 } 35 74

36 int COF::setrelaxfore( const int relax_fc ){ 37 //=2015-0909-2140.50edt on dhn 38 if( (relax_fc < 0) || (relax_fc > 100) ){ 39 throw std::range_error("COF:setrelaxfore():invalid fore coverage relax value."); 40 }else{ 41 this ->fc_relax = relax_fc; 42 } 43 //// 44 return this ->fc_relax; 45 } 46 47 mets::gol_type COF::cost_function() const { 48 mets::gol_type res; 49 res = (mets::gol_type)( ( this ->fc_relax == 0 ) ? ( this ->rc_cost() ) : ( this ->rc_relax() ) ); 50 return res; 51 } 52 };

A.2.3 Neighborhood

nf.hh 1 class Nhf : public mets::move_manager { 2 protected: 3 std::deque moves_m; 4 Nhf( const move_manager& ); 5 6 public: 7 Nhf(); 8 ˜Nhf(); 9 Nhf( size_t ); 10 void refresh( mets::feasible_solution& ); 11 typedef std::deque::iterator iterator; 12 iterator begin() { return moves_m.begin(); } 13 iterator end() { return moves_m.end(); } 14 typedef size_t size_type; 15 size_type size() const { 16 size_type res; 17 res = this ->moves_m.size(); 18 return res; 19 } 20 75

21 private: 22 size_t ns; 23 size_t generate( mets::feasible_solution& ); 24 bool novelty( const Walk* , const size_t ); 25 }; 26 27 Nhf::Nhf(){ 28 this ->ns = NF_SIZE; 29 } 30 31 Nhf::Nhf( size_t asize ){ 32 this ->ns = asize; 33 } 34 35 Nhf::Nhf(const move_manager &mvm ){ 36 this ->ns = mvm.size(); 37 std::cerr<<"nhf:nhf(mvm) not fully implemented!"<begin(); ii != this ->end(); ii++ ){ delete (*ii); } 42 this ->moves_m.clear(); 43 } 44 45 size_t Nhf::generate( mets::feasible_solution& s ){ 46 size_t res = 0; 47 SAsol& model = dynamic_cast ( s ); 48 size_t selsize = model.getSelection().size(); 49 size_t i; 50 for( i = 0 ; i < this ->ns ; i++ ){ 51 Walk *newstep = new Walk( selsize, i ); 52 this ->moves_m.push_back( newstep ); 53 res++; 54 } 55 return res; 56 } 57 58 void Nhf::refresh( mets::feasible_solution& s ){ 59 for( iterator ii = this ->begin(); ii != this ->end(); ii++ ){ delete (*ii); } 60 this ->moves_m.clear(); 76

61 if( this ->ns ){ 62 // nohting 63 }else{ 64 this ->ns = NF_SIZE; 65 } 66 this ->generate( s ); 67 }

A.2.4 Solution Generation

walk.hh 1 class Walk : mets::move { 2 public: 3 ˜Walk(); 4 mets::gol_type evaluate( const mets::feasible_solution& ) const ; 5 void apply( mets::feasible_solution& ) const ; 6 Walk( size_t ); 7 Walk( size_t, size_t ); 8 size_t getpos() const ; 9 bool operator==( const mets::move& ) const ; 10 private: 11 size_t picker( size_t ); 12 size_t pos; 13 size_t changeBits( const SAsol*, SAsol* ) const; 14 }; 15 16 Walk::˜Walk(){ 17 } 18 19 Walk::Walk( size_t x ){ 20 if( x ){ 21 this ->pos = this ->picker( x ); 22 }else{ 23 this ->pos = this ->picker( LEN ); 24 } 25 } 26 27 Walk::Walk( size_t length, size_t newpos ){ 28 this ->pos = newpos % length; 29 } 30 31 size_t Walk::picker( size_t length ){ 77

32 size_t res = 0; 33 std::tr1::uniform_int ui(0, length-1); 34 std::tr1::mt19937 rng( time( NULL ) ); 35 res = ui(rng); 36 return res; 37 } 38 size_t Walk::changeBits( const SAsol *src, SAsol *dest ) const { 39 size_t res = this ->pos; 40 flag ff = src->getSelection(); 41 ff.set( this ->pos, flip( ff.get(this ->pos) ) ); 42 dest->setSelection( ff ); 43 return res; 44 } 45 46 mets::gol_type Walk::evaluate( const mets::feasible_solution & cs ) const { 47 gol_type res = 0; 48 const SAsol& model = dynamic_cast ( cs ); 49 SAsol *newmodel = new SAsol( model ); 50 this ->changeBits( &model, newmodel ); 51 res = newmodel->cost_function(); 52 delete newmodel; 53 return res; 54 } 55 56 void Walk::apply( mets::feasible_solution& s ) const { 57 SAsol& model = dynamic_cast ( s ); 58 this ->changeBits( &model, &model ); 59 return; 60 } 61 62 size_t Walk::getpos() const { 63 return this ->pos; 64 } 65 66 bool Walk::operator==( const mets::move& o ) const { 67 bool res = false; 68 try{ 69 const Walk& other = dynamic_cast ( o ); 70 res = ( this ->pos == other.getpos() ); 71 } catch (std::bad_cast& e) { 72 std::cerr << "Walk: bad cast" << std::endl; 78

73 return false; 74 } 75 return res; 76 } 79

A.3 Plotting

The source included in this section was developed by the author of this thesis. plot.r 1 ### data plotting script for Master of Science Thesis. 2 ### Copyright (C) 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018 Liang Chen 3 ### 4 ### This program is free software: you can redistribute it and/or modify 5 ### it under the terms of the GNU General Public License as published by 6 ### the Free Software Foundation, either version 3 of the License , or 7 ### (at your option) any later version. 8 ### 9 ### This program is distributed in the hope that it will be useful , 10 ### but WITHOUT ANY WARRANTY; without even the implied warranty of 11 ### MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 12 ### GNU General Public License for more details. 13 ### 14 ### You should have received a copy of the GNU General Public License 15 ### along with this program. If not, see . 16 17 library(xtable) 18 19 fs.box <- function( Dt ){ 20 dtv <- data.frame( 21 SA=Dt$SA.fs, 22 Greedy=Dt$G.fs, 23 RILP=Dt$RILP.fs 24 ) 25 max.covered <- max( c(dtv$SA,dtv$Greedy ,dtv$RILP)) 26 ## 27 boxplot( dtv, 28 ylim=c( 0, max.covered ), 29 main="mSn" ) 80

30 } 31 32 fs.line <- function( Dt ){ 33 dtv <- data.frame( 34 GROUP=factor(Dt$tf), 35 MSN=( numeric(length=length(Dt$SA.ssn)) -100 ) 36 ) 37 max.covered <- max( c(Dt$SA.fs,Dt$G.fs,Dt$RILP.fs) ) 38 ## 39 plot( dtv, 40 type="n", 41 ylim=c( 0, max.covered ), 42 main="mSn for each TF group" ) 43 lines( Dt$SA.fs, col="red" ) 44 lines( Dt$G.fs, col="blue" ) 45 lines( Dt$RILP.fs, col="green" ) 46 legend( "topright", 47 legend=c("SA", "Greedy", "RILP"), 48 col=c("red", "blue", "green"), 49 lty=1 50 ) 51 } 52 53 cov.box <- function( Dt ){ 54 dtv <- data.frame( 55 SA=Dt$SA.ssn, 56 Greedy=Dt$G.ssn, 57 RILP=Dt$RILP.ssn 58 ) 59 boxplot( dtv, ylim=c(0,1), main="sSn" ) 60 } 61 62 cov.line <- function( Dt ){ 63 dtv <- data.frame( 64 GROUP=factor(Dt$tf), 65 SSN=( numeric(length=length(Dt$SA.ssn)) -100 ) 66 ) 67 ## 68 plot( dtv, type="n", ylim=c(0,1), main="sSn for each TF group" ) 69 lines( dt$SA.ssn, col="red" ) 70 lines( dt$G.ssn, col="blue" ) 71 lines( dt$RILP.ssn, col="green" ) 81

72 legend( "bottomright", 73 legend=c("SA", "Greedy", "RILP"), 74 col=c("red", "blue", "green"), 75 lty=1 76 ) 77 } 78 79 fa2 <- "tf.source2.tsv" 80 81 dt <- read.table( file=fa2, 82 header=T, 83 comment.char="#", 84 sep="," 85 ) 86 87 message("mSn (f.size)") 88 fs.box( dt ) 89 cov.box( dt ) 90 91 message("sSn (fore.coverage)") 92 fs.line( dt ) 93 cov.line( dt ) 94 95 ###---eof---### 82 Appendix B: Supplementary Contents

B.1 ENCODE TF Group Datasets

Table B.1: Basic Informmation about 51 Core ENCODE TF Groups

TF group candidate motif set training sequence set testing sequence set

ATF3 53 2400 2400 BATF 12 1000 1000 BCL 67 5400 5400 BDP1 51 2496 2500 BHLHE40 13 1000 1000 BRCA1 20 1050 1050 CEBPB 44 3999 4000 E2F 83 7993 8000 EBF1 22 2000 2000 EGR1 63 2580 2600 ELF1 29 3000 3000 EP300 72 8200 8200 ESRRA 54 4093 4200 ETS 105 8199 8200 FOXA 49 5000 5000 GATA 82 7996 8000 HEY1 20 2000 2000 HNF4 35 3000 3000 IRF 47 2650 2650 83

MAF 43 4000 4000 MEF2 20 2000 2000 MXI1 23 2000 2000 NANOG 11 1000 1000 NFE2 22 1200 1200 NFKB 109 10188 10200 NFY 24 2000 2000 NR2C2 35 1598 1600 NR3C1 64 4250 4250 NRF1 58 4198 4200 PAX5 48 4000 4000 PBX3 12 999 1000 POU2F2 47 3831 4000 POU5F1 10 1000 1000 PRDM1 7 1000 1000 REST 108 9910 10000 RFX5 46 3200 3200 RXRA 48 3050 3050 SIX5 29 2998 3000 SP1 47 4000 4000 SPI1 30 3000 3000 SRF 57 5000 5000 STAT 135 7200 7200 TAL1 12 1000 1000 TCF12 34 2200 2200 TCF7L2 14 1999 2000 84

TFAP2 22 1999 2000 YY1 75 9196 9200 ZBTB33 48 800 800 ZBTB7A 13 1000 1000 ZEB1 13 1000 1000 ZNF143 11 1000 1000

B.2 Results on ENCODE TF Groups

Data of Greedy and RILP methods were from [70].

Table B.2: Feature Set Size (m) Result

TF group Greedy RILP SA SAr85 SAr70

ATF3 2 33 33 4 4 BATF 3 11 11 5 2 BCL 4 54 55 25 16 BDP1 4 34 34 14 4 BHLHE40 2 11 11 5 4 BRCA1 2 16 16 4 2 CEBPB 2 32 32 6 3 E2F 2 51 51 7 1 EBF1 3 17 17 6 2 EGR1 1 29 30 9 6 ELF1 2 23 23 5 3 EP300 5 60 60 21 12 ESRRA 4 44 44 16 13 85

ETS 2 55 56 4 4 FOXA 2 35 35 14 8 GATA 4 53 55 11 4 HEY1 2 20 20 5 2 HNF4 2 26 26 9 3 IRF 4 31 31 6 6 MAF 2 28 28 3 1 MEF2 3 18 18 8 5 MXI1 3 20 20 8 4 NANOG 3 11 11 6 5 NFE2 2 11 11 2 2 NFKB 2 54 56 9 5 NFY 2 13 13 2 1 NR2C2 3 20 20 7 6 NR3C1 4 47 47 9 6 NRF1 1 22 25 2 1 PAX5 3 41 41 8 5 PBX3 3 12 12 3 2 POU2F2 2 35 35 15 4 POU5F1 2 7 7 2 1 PRDM1 2 6 6 2 2 REST 2 56 56 13 5 RFX5 3 33 33 6 3 RXRA 4 40 40 10 6 SIX5 2 18 18 3 2 SP1 3 34 34 9 6 86

SPI1 2 20 20 4 2 SRF 2 40 40 6 2 STAT 4 60 65 13 4 TAL1 3 11 11 4 3 TCF12 3 29 29 9 5 TCF7L2 4 12 12 5 3 TFAP2 2 17 17 4 2 YY1 2 49 49 7 2 ZBTB33 2 20 20 4 2 ZBTB7A 3 12 12 4 2 ZEB1 2 13 13 7 4 ZNF143 2 10 10 4 3

Table B.3: Sequence Sensitivity (sSn) Result

TF group Greedy RILP SA SAr85 SAr70

ATF3 0.74 0.89 0.89 0.65 0.64 BATF 0.58 0.79 0.80 0.71 0.59 BCL 0.66 0.95 0.94 0.80 0.73 BDP1 0.82 0.96 0.96 0.83 0.67 BHLHE40 0.63 0.79 0.79 0.68 0.56 BRCA1 0.49 0.63 0.63 0.51 0.20 CEBPB 0.65 0.89 0.88 0.75 0.62 E2F 0.87 0.99 0.99 0.82 0.70 EBF1 0.76 0.88 0.88 0.75 0.64 87

EGR1 0.83 0.96 0.96 0.85 0.64 ELF1 0.79 0.92 0.91 0.77 0.61 EP300 0.73 0.97 0.97 0.81 0.70 ESRRA 0.65 0.89 0.89 0.77 0.65 ETS 0.88 0.98 0.98 0.69 0.68 FOXA 0.63 0.92 0.91 0.80 0.64 GATA 0.77 0.99 0.99 0.83 0.64 HEY1 0.73 0.88 0.89 0.77 0.64 HNF4 0.76 0.96 0.97 0.83 0.72 IRF 0.81 0.98 0.97 0.85 0.76 MAF 0.78 0.97 0.97 0.84 0.72 MEF2 0.64 0.85 0.85 0.69 0.31 MXI1 0.75 0.89 0.90 0.80 0.68 NANOG 0.57 0.83 0.84 0.71 0.61 NFE2 0.93 0.96 0.96 0.88 0.88 NFKB 0.73 0.97 0.95 0.77 0.69 NFY 0.93 0.97 0.97 0.84 0.70 NR2C2 0.87 0.93 0.93 0.78 0.63 NR3C1 0.79 0.95 0.95 0.80 0.75 NRF1 0.92 0.99 0.98 0.10 0.10 PAX5 0.75 0.96 0.96 0.78 0.58 PBX3 0.59 0.72 0.73 0.38 0.41 POU2F2 0.59 0.85 0.85 0.67 0.58 POU5F1 0.76 0.86 0.86 0.75 0.71 PRDM1 0.80 0.92 0.92 0.80 0.74 REST 0.81 0.97 0.96 0.88 0.70 88

RFX5 0.76 0.92 0.92 0.80 0.69 RXRA 0.72 0.94 0.93 0.69 0.63 SIX5 0.80 0.92 0.90 0.79 0.74 SP1 0.74 0.88 0.88 0.76 0.62 SPI1 0.88 0.99 0.99 0.85 0.72 SRF 0.70 0.91 0.90 0.77 0.68 STAT 0.77 0.99 0.98 0.84 0.72 TAL1 0.67 0.87 0.88 0.77 0.67 TCF12 0.70 0.96 0.96 0.73 0.58 TCF7L2 0.83 0.91 0.91 0.79 0.69 TFAP2 0.86 0.97 0.97 0.88 0.73 YY1 0.79 0.95 0.95 0.80 0.54 ZBTB33 0.72 0.84 0.84 0.73 0.69 ZBTB7A 0.91 0.96 0.96 0.83 0.70 ZEB1 0.73 0.89 0.89 0.74 0.60 ZNF143 0.46 0.58 0.56 0.43 0.23 89

B.3 Detailed Results on Selected Motifs

B.3.1 Example of good performance group

Table B.4: Motifs selected by SAr85 from BATF group

motif logo training testing TOMTOM

BATF MEME 3 420 452 42

BATF MEME 1 374 404 41

BATF MEME 2 376 390 47

BATF Trawler 2 114 106 1

BATF Trawler 1 109 129 6

Table B.5: Examples of TOMTOM reported alignments

SA selected motif matched motif alignment

BATF MEME 1 BATF

BATF MEME 1 JUNB 90

BATF MEME 1 JUN

BATF MEME 1 JUND

BATF MEME 1 FOSL2

BATF MEME 1 BACH2

BATF MEME 1 FOSL1

BATF MEME 1 NFE2 91

BATF MEME 1 FOS

BATF MEME 1 NF2L2

BATF MEME 1 FOSB

BATF MEME 1 BACH1

BATF MEME 1 MAFK

BATF MEME 1 NF2L1 92

BATF MEME 1 MAFG

BATF MEME 1 MAFF

BATF MEME 1 HXA9

BATF MEME 1 RORG

BATF MEME 1 BATF3

BATF MEME 1 MEIS1 93

BATF MEME 1 MAFB

BATF MEME 1 PDX1

BATF MEME 1 HXB6

BATF MEME 1 ZN554

BATF MEME 1 NRL

BATF MEME 1 HXB7 94

BATF MEME 1 PAX2

BATF MEME 1

BATF MEME 1 NR2E1

BATF MEME 1 ATF1

BATF MEME 1 HXB8

BATF MEME 1 MAF 95

BATF MEME 1 NR1D1

BATF MEME 1 HXC6

BATF MEME 1 HXC8

BATF MEME 1 ZFP28

BATF MEME 1 CREB1

BATF MEME 1 PBX2 96

BATF MEME 1 CUX2

BATF MEME 1 PBX1

BATF MEME 1 HXA5

BATF Trawler 2 MAFG

BATF Trawler 1 RUNX3

BATF Trawler 1 RUNX1 97

BATF Trawler 1 RUNX2

BATF Trawler 1 PEBB

BATF Trawler 1 GFI1

BATF Trawler 1 GFI1B 98

B.3.2 Example of bad performance group

Table B.6: Motifs selected by SAr85 from PBX3 group

motif logo training testing TOMTOM

PBX3 AlignACE 3 272 140 43

PBX3 MEME 1 177 242 56

PBX3 MDscan 2 142 113 31

Table B.7: Examples of TOMTOM reported alignments

SA selected motif matched motif alignment

PBX3 AlignACE 3 TFDP1

PBX3 AlignACE 3 MAZ

PBX3 AlignACE 3 SP4 99

PBX3 AlignACE 3 ZFX

PBX3 AlignACE 3 SP1

PBX3 AlignACE 3 SP3

PBX3 AlignACE 3 SP2

PBX3 AlignACE 3 ETV1

PBX3 AlignACE 3 CLOCK 100

PBX3 AlignACE 3 TBX15

PBX3 AlignACE 3 PURA

PBX3 AlignACE 3 EGR1

PBX3 AlignACE 3 SPIC

PBX3 AlignACE 3 KLF15

PBX3 AlignACE 3 ZBT7A 101

PBX3 AlignACE 3 KLF16

PBX3 AlignACE 3 GATA1

PBX3 AlignACE 3 THAP1

PBX3 AlignACE 3 TF7L1

PBX3 AlignACE 3 PAX5

PBX3 AlignACE 3 TAL1 102

PBX3 AlignACE 3 RARG

PBX3 AlignACE 3 PLAG1

PBX3 AlignACE 3 IRX2

PBX3 AlignACE 3 MNT

PBX3 AlignACE 3 ASCL2

PBX3 AlignACE 3 SRBP2 103

PBX3 AlignACE 3 USF2

PBX3 AlignACE 3 PROX1

PBX3 AlignACE 3 WT1

PBX3 AlignACE 3 RARA

PBX3 AlignACE 3 ZN148

PBX3 AlignACE 3 KLF6 104

PBX3 AlignACE 3 TFCP2

PBX3 AlignACE 3 AP2B

PBX3 AlignACE 3 SP1

PBX3 AlignACE 3 RREB1

PBX3 AlignACE 3

PBX3 AlignACE 3 ARNT2 105

PBX3 AlignACE 3

PBX3 AlignACE 3 FOXk1

PBX3 AlignACE 3 EGR1

PBX3 AlignACE 3 MAFA 106 Appendix C: Disclaimer

The content of this thesis were based on the research work that the author performed as a member of Bioinformatics Lab at Ohio University. The previous and current employers of the author had no direct or indirect contributions to the research work. None of the protected information that may be exposed to the author during the employment periods with previous and current employers were disclosed or included in this thesis. The trademarks and registered trademarks, product names, brand names, vendor names are properties of their owners. Quotations to them in this thesis are for informative purpose. No explicit or implicit endorsements to them are made by composing and publishing this thesis. The views and ideas presented in this thesis are: a) the author’s intellectual output based on this thesis research project, and b) contents referenced from literature. These views and ideas do not represent the views or positions of the author’s previous and current employers. ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

Thesis and Dissertation Services ! !