Topics in Signal Processing: applications in and

Abdulkadir Elmas

Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate School of Arts and Sciences

COLUMBIA UNIVERSITY

2016 c 2016

Abdulkadir Elmas All Rights Reserved ABSTRACT

Topics in Signal Processing: applications in genomics and genetics

Abdulkadir Elmas

The information in genomic or genetic data is influenced by various complex processes and appropriate mathematical modeling is required for studying the underlying processes and the data. This dissertation focuses on the formulation of mathematical models for certain problems in genomics and genetics studies and the development of algorithms for proposing efficient solutions. A Bayesian approach for the transcription factor (TF) motif discovery is examined and the extensions are proposed to deal with many interdependent parameters of the TF-DNA binding. The problem is described by statistical terms and a sequential Monte Carlo sampling method is employed for the estimation of unknown param- eters. In particular, a class-based resampling approach is applied for the accurate estimation of a set of intrinsic properties of the DNA binding sites. Through statistical analysis of the gene expressions, a motif-based computational approach is developed for the inference of novel regulatory networks in a given bacterial . To deal with high false-discovery rates in the genome-wide TF binding predictions, the discriminative learning approaches are examined in the context of sequence classification, and a novel mathematical model is introduced to the family of kernel-based Support Vector Machines classifiers. Furthermore, the problem of haplotype phasing is examined based on the genetic data obtained from cost-effective genotyping technologies. Based on the identification and augmentation of a small and relatively more informative genotype set, a sparse dictionary selection algorithm is developed to infer the haplotype pairs for the sampled population. In a relevant con- text, to detect redundant information in the single nucleotide polymorphism (SNP) sites, the problem of representative (tag) SNP selection is introduced. An information theoretic heuristic is designed for the accurate selection of tag SNPs that capture the genetic diver- sity in a large sample set from multiple populations. The method is based on a multi-locus mutual information measure, reflecting a biological principle in the population genetics that is linkage disequilibrium. Table of Contents

List of Figures iv

List of Tables vii

1 Introduction 1 1.1 Thesis overview ...... 5

2 Reconstruction of novel transcription factor regulons through inference of their binding sites 9 2.1 Background ...... 9 2.1.1 Motif-based inference of novel regulons of transcription factors . . . 10 2.2 System model ...... 12 2.2.1 High-confidence co-regulated gene set estimation via sparse biclustering 12 2.2.2 Filtering high-confidence gene set ...... 16 2.2.3 Motif discovery with BAMBI2b ...... 17 2.3 Results and discussion ...... 25

3 The folded k-spectrum kernel: a machine learning approach to detecting transcription factor binding sites with nucleotide dependencies 35 3.1 Introduction ...... 35 3.1.1 approach ...... 36 3.1.2 Support Vector Machines approach ...... 37 3.2 Approach ...... 38 3.2.1 SVM for sequence classification ...... 38

i 3.2.2 Folded k-spectrum kernel ...... 40 3.3 Methods ...... 42 3.3.1 Datasets ...... 42 3.3.2 Identifying top-enriched features ...... 43 3.3.3 Filtering out potential false positive gapped k-mers (false discovery- based feature elimination) ...... 44 3.3.4 Comparing top-enriched features with those in JASPAR ...... 45 3.3.5 Cross-validation ...... 45 3.4 Results ...... 47 3.4.1 Filtering motif-based TFBS predictions by SVM classification . . . . 47 3.4.2 Classifying enhancer sequences and detecting functional subgroups . 51 3.4.3 Removing False Positives ...... 52 3.4.4 Enriched motifs correspond to known TFBSs ...... 54 3.4.5 Cross-validation ...... 57 3.5 Discussion ...... 57

4 Maximum parsimony xor haplotyping by sparse dictionary selection 60 4.1 Introduction ...... 60 4.2 System model ...... 64 4.2.1 Xor Haplotyping by Sparse Dictionary Selection (XHSD) ...... 67 4.2.2 Extensions ...... 75 4.3 Results and discussion ...... 77 4.3.1 Synthetic data ...... 78 4.3.2 Missing data ...... 81 4.3.3 CFTR gene database ...... 84 4.3.4 Typing errors ...... 85 4.3.5 ANRIL database ...... 86

5 Discovering genome-wide tag SNPs based on the mutual information of the variants 89 5.1 Introduction ...... 89

ii 5.2 Methods ...... 92 5.2.1 The attractor metaSNP ...... 92 5.2.2 Attractor metaSNP estimation for a particular seed ...... 93 5.2.3 Finding all attractor metaSNPs and selecting the tag SNPs . . . . . 94 5.2.4 Selecting tag SNPs ...... 95 5.3 Results ...... 96 5.3.1 Data sets ...... 96 5.3.2 Performance evaluation ...... 98 5.3.3 Genotype data sets from several ENCODE regions and the human chromosome 22 ...... 99 5.3.4 Haplotype data sets from ENCODE regions ...... 100 5.3.5 Missing genotype datasets from ENCODE regions ...... 102 5.3.6 Genotype data sets from 1000 Project (1KGP) ...... 104 5.3.7 Complexity ...... 105 5.4 Discussion ...... 108

6 Conclusions 111

Bibliography 112

iii List of Figures

2.1 Motif-based inference of novel regulons of transcription factors ...... 11 2.2 Scatter plot of the features corresponding to expression rates of genes lexA and dinI. The data is obtained from [40]...... 13 2.3 The two-block structure of a length-M motif described by the blocks ( ) and B the flankers ( )...... 23 F 2.4 Gene expression heatmap of the estimated high-confidence set: dinB, yafN, dinG, sulA, dinI, umuD, ydjM, yebG, recX, recA, lexA, recN...... 25 2.5 LexA motif estimated by the proposed approach...... 26 2.6 PurR motif estimated by the proposed approach...... 28 2.7 Fur motif estimated by the proposed approach...... 30 2.8 ROC of reconstruction results based on BAMBI2b-estimated motifs (above) vs RegPrecise motifs (below)...... 31 2.9 Hypothetical Dde0289 motif estimated by the proposed approach...... 32 2.10 motif estimated by the proposed approach...... 33

3.1 SVM with feature elimination based on the discovery of false enrichments . 45 3.2 Discovery of the different groups of input sequences, based on the difference in FN calls accumulated over multiple cross-validation tests...... 48 3.3 Performance of the SVM-based filtering approaches, contiguous vs. folded k-spectrum kernels with/without feature elimination (FE)...... 50 3.4 Classification performance of the k-spectrum kernel vs folded k-spectrum kernel...... 52

iv 3.5 Classification performance of the k-spectrum kernel vs folded k-spectrum kernel after feature elimination...... 53 3.6 JASPAR motifs (top) and the weblogos of the enriched gapped k-mers (be- low). Shown are weblogos corresponding to the TFBSs for A) BCD, B) CTCF, C) BRK, and D) MAD...... 56 3.7 Superimposed ROC curves of the 100 cross-validation tests. Contiguous vs. Folded k-spectrum kernel without feature elimination (above) and with fea- ture elimination (below). Note that many of ROC curves overlap due to the very small number of false positives found...... 58

4.1 Ambiguity resolution for PPXH method. Informative regular genotypes G I are determined by the MTI algorithm, and they are used as control inputs for bit flipping on the initial inference result H...... 72 4.2 Ambiguity resolution in XHSD or in XOR-HAPLOGEN (XHAP). Informa- tive regular genotypes G are determined by the MTI algorithm, and they I are used as inputs to augment the initial data set by replacing xor-genotypes of the individuals 1,...,N ...... 73 I ⊆ { } 4.3 Potential inference quality on short (L < 14) synthetic data...... 79 4.4 Performance on long (5 ≤ L ≤ 46) synthetic data by bit flipping via 2 regular genotypes...... 80 4.5 Performance on long (5 ≤ L ≤ 46) synthetic data from 50 individuals by employing different numbers of regular genotypes...... 82 4.6 Performance under low rates of missing data, long (5 ≤ L ≤ 46) synthetic data...... 83 4.7 Performance under high rates of missing data, long (5 ≤ L ≤ 46) synthetic data...... 83 4.8 Performance under different percentages of missing data...... 84 4.9 Performance on CFTR gene database with different population sizes with/without missing data...... 85 4.10 Running times on CFTR gene database with different population sizes with/without missing data...... 86

v 4.11 Performance on CFTR gene database for different population sizes, with

P err=2%, with/without missing data...... 87 4.12 Performance on ANRIL gene database with different population sizes with/without missing data...... 88

5.1 3 tag SNPs (red) are selected by clustering the 12 SNPs according to geno- types from 6 individuals (rows)...... 91 5.2 Coverage rates in imputed genotypes (HapMap)...... 101 5.3 Coverage rates in imputed Chromosome 22 genotypes (HapMap)...... 101 5.4 Coverage rates in phased haplotypes (HapMap)...... 102 5.5 Coverage rates in missing genotypes (HapMap)...... 103 5.6 Coverage rates in 1KGP genotypes, Chromosomes 21 and 22...... 104 5.7 Supplementary Figure 1...... 105 5.8 Coverage rates in 1KGP genotypes, Chromosomes 1-20...... 106 5.9 Comparison of the coverage rates in 1KGP chromosomal segments. The average performance curve is estimated from 20 consecutive blocks of 50,000 SNPs. For each given cost (c) the error bars represent the best and worst coverage rates obtained from the tagging results in different blocks...... 107 5.10 Running times in imputed genotypes data tested in ENCODE regions (HapMap).107 5.11 Convergence of the algorithm for an estimated attractor (only the first-500 SNPs in the ENCODE’s ENr113 region are displayed). The tag SNP is marked by the red circle through the iterations (w0, . . ., w7), which is selected according to its proximity to the genomic median of the co-top-ranking SNPs.108

vi List of Tables

2.1 The best hits of the estimated (BAMBI2b) vs true (RegPrecise) LexA motifs in SwissRegulon database...... 26 2.2 Novel binding sites for LexA regulon...... 27 2.3 Functional enrichment in reconstructed LexA regulon...... 27 2.4 The best hits of the estimated (BAMBI2b) vs true (RegPrecise) PurR motifs in SwissRegulon database...... 28 2.5 Novel binding sites for PurR regulon...... 29 2.6 The best hits of the estimated (BAMBI2b) vs true (RegPrecise) Fur motifs in SwissRegulon database...... 29 2.7 Functional enrichment in reconstructed Fur regulon...... 30 2.8 The best hits of the estimated (BAMBI2b) vs true (RegulonDB) Fis motifs in SwissRegulon database...... 34

3.1 5 TFBS data sets used in the proposed SVM approach...... 47 3.2 The rate of ChIP hits in the given TFBS predictions (before and after the SVM filtering)...... 51

5.1 Details of HapMap data sets used in this chapter...... 97 5.2 Coverage rates in imputed genotype data (HapMap)...... 100 5.3 Minimum number of tag SNPs that reach >95% coverage in imputed geno- type data (HapMap)...... 100 5.4 Coverage rates in missing genotype data (HapMap)...... 103

vii Acknowledgments

First and foremost, I would like to thank my advisor Prof. Xiaodong Wang, who has made this thesis possible. I am indebted for his constant support throughout my doctoral program. He has always guided me with patience and trust, teaching the very essence of a life-long learning path. I am privileged to have had him as mentor for the past years for solidifying my path as an independent researcher. My deep gratitude is also due to Prof. Dimitris Anastassiou, who has given me constant advice, encouragement and support during our collaboration in my last year in the program, which helped me design my future research career. His friendly approach to the most difficult challenges in modern health science continues to inspire me. I am especially thankful to Prof. Jacqueline Dresch, with whom we had the most fruitful discussions in our collaborative work. With her advices and guidance I found my envision to continue a career in this interdisciplinary research. I am also thankful to Dr. Michael Samoilov, with whom I had chance to collaborate on several projects, and wish him success for his promising contribution to research. I would like to extend my thanks to the thesis committee members, including Prof. Christine Hendon and Prof. John Wright, for taking time to read and evaluate this thesis. I am thankful to my friends at Columbia, Guido, Taihsien and Oyetunji, with whom I had fruitful discussions and collaborations. I would also like to thank my friends Yasin and Shang with whom I have walked through the end of the PhD. My special thanks are reserved to my dear parents, Nazım and Ay¸seg¨ul,and my brothers Abdulkerim and Abdulbaki, and my dear sister Asude, and especially my dear fiancee Ay¸seg¨ul,who are the unknown collaborators of this thesis. Their endless love and support was my main motivation to walk through all the difficulties I faced during this journey. There are no words that I can use to express my gratitude and love for them.

viii To my dear parents, Nazım & Ay¸seg¨ul

ix CHAPTER 1. INTRODUCTION 1

Chapter 1

Introduction

The execution of biological process depends on the precise and highly-orchestrated control of the spatial and temporal expression of genes. Such regulatory instructions as where, when and to what extend the genes should be expressed are encoded within the DNA [1]. The essence of information processing in the DNA are sustained by the timely binding of regulatory elements, mainly transcription factors (TFs), to the specific regulatory sequences such as promoters or enhancers that facilitate the switching of genes on and off in space and time [2]. The main building blocks of regulatory sequences are the TF binding sites (TFBSs), the short DNA sequences typically consisting of 4 to 30 nucleotides which have distinct specificities such as the base compositions, nucleotide arrangements, etc. They are specif- ically bound by one or more DNA-binding , mainly TFs, and serve as a median for recruiting the transcriptional machinery that governs the expression of the associated genes. In recent years, the advent of high-throughput and quantitative technologies have en- abled to decipher the transcriptional machinery and to characterize the TFBSs involved. The approaches broadly fall into two main categories: measuring the occupancy of sites through in vivo binding experiments for a target TF of interest to identify its binding pref- erence along the genome, and measuring specific TF-DNA binding affinities by the properly designed in vitro experiments for a thorough examination of the assumptions made for TF binding characteristics. CHAPTER 1. INTRODUCTION 2

In the first case, the binding events are captured by methods such as chromatin im- munoprecipitation followed by microarray hybridization (ChIP-chip) or followed by high- throughput sequencing (ChIP-seq) to gain a more systems-level insight into the TF’s binding preferences. By comparing the measurements obtained from different cell types or condi- tions some dynamics of the TFs could be derived, such as co-binding and interplay with other TFs in a gene regulatory network [3]. Furthermore, through the analysis of ChIP signals obtained for a given TF, one can reveal the location of binding sites at a certain resolution. Aided by in silico experiments, i.e., computational algorithms that perform sta- tistical inference in these TF binding sequences, some nucleotide level characteristics of the TF binding can be estimated, such as the position weight matrix (PWM) representing the occurrence frequencies of different nucleotides at each position of the binding sites across the bound genomic regions [4]. This is known as the motif discovery problem, as one aims to computationally derive the precise locations of the bound TFs and infer the corresponding binding motif (PWM) by exploiting the intrinsic sequence features of TFBSs [5]. The studies using the second approach aim for a more quantitative description of the TFBSs. A rich database of affinity scores can be measured in vitro for a large number of tested sequences. The test sequences can be designed in silico for various purposes, e.g., [6]. Through the use of statistical methods one can evaluate the characteristics of TF binding by studying the sequence context in each tested example and the corresponding affinity measure. Then the conclusions can be derived for the several assumptions made about TF binding, such as the position independence assumption underlying the popular PWM representation in which the base pairs are assumed to contribute independently to the overall affinity of the TF-DNA binding complex [4]. The methods based on the models that can relax such assumptions are promising. Recent efforts for examining more complex TF binding models proves the need for revising the basepair-independence assumption [7, 8]. In particular, certain TFs with non-generic binding domains (e.g., with nucleotide interdependency) are revealed to play important functional roles in the regulation of gene expression, e.g., CTCF which mediates enhancer-promoter looping [9]. Such discoveries will shed light on how TFs function in the gene expression. Regulatory DNA sequences also play a key factor in modern genetics. The changes in CHAPTER 1. INTRODUCTION 3 regulatory sequences are known to alter the properties of gene expression patterns which is the fundamental process in cell development that enables a given gene to be used for a variety of physiological processes. However, deviations in the desired expression patterns may result in the formation of diseases, where the characteristics of the underlying disease state may be associated with the changes in regulatory regions, in particular, the single nucleotide polymorphisms (SNPs) [10]. The basic unit of genetic variation is the single nucleotide polymorphism. Occurring at specific positions in the genome, the information they encode vary from one individual to another. Many such polymorphic sites are common in the majority of human populations. A typical SNP carries two common “alleles”, meaning that two among the four base pair possibilities (nucleotides) are commonly observed in a population. The specific variations between these two alleles in the population is defined by the frequency of the less common allele, i.e., the minor allele frequency (MAF). SNPs may have a minor allele frequency of as low as 1%, meaning that only 1% of the individuals have the specific “alternative” allele in that locus while the remaining population have the “reference” allele. The projects such as HapMap or 1000 Genomes project focus on cataloguing all genetic variants observed in several human populations with the observed frequencies as low as 1% [11, 12]. The common variation observed in individual SNPs differ from the rare variants of genetic disorders, e.g., cystic fibrosis, which are mainly caused by the very rare mutations occurring in the critical regions of DNA that influence the malfunction. Nonetheless, genome-wide association studies revealed that the SNPs may affect disease formation, by acting cooperatively rather than individually [13]. The causal variants of genetic diseases may be identified by examining the SNPs in a case-control study, where one is interested in the detection of multiple SNP markers that discriminate with disease across the given samples. Such methods are called “linkage analysis”, and have achieved notable success for identifying particular rare diseases such as cystic fibrosis [14] and Huntington disease [15]. A central problem in population genetics is the systematic analysis of associations be- tween the SNPs. An empirical observation is the co-occurrence of certain alleles in a set of SNPs within the population. This dependency between SNPs is termed by the “linkage disequilibrium” (LD), resembling the chromosomal linkage of these sites that are inherited CHAPTER 1. INTRODUCTION 4 together and remain physically linked through generations, which otherwise would be in “linkage equilibrium” (LE) due to recombination events [16]. Such discrete patterns in allelic information enables mathematical formulations to examine the underlying linkage between SNPs. Widely-known measures of association include the pair-wise LD metrics 2 r or D0 [11, 17], for measuring the deviation between the observed frequency of two co- occurring alleles and the frequency expected if they are independent (in LE). Furthermore, the multi-locus LD measure corresponding to multiple SNPs provide a more sensible ap- proach for estimating joint associations [18], which is also more biologically-relevant [19]. However, it results in increased complexity due to combinatorial analysis of numerous SNPs, which requires efficient strategies to explore vast amount of genetic information [20]. One strategy is to select some subset of (representative) SNPs to reduce the dimensionality of SNP data. A main observation from the LD analysis is that the nearby SNPs may have identical allelic variation for the same population of individuals, indicating that they are in complete LD and bear redundant information, and one of these SNPs can sufficiently capture the variation at nearby sites. Selecting such representative SNPs is called the “SNP tagging” problem. Algorithms developed for solving a biological problem are often assisted by certain bi- ological principles that are learned from the empirical data. For example, the co-occuring alleles are often observed in nearby SNPs and are inherited together on a single chromosome (i.e., in complete LD). Such a sequence of alleles is called a “haplotype”, and they are usu- ally contained in blocks of contiguous SNPs. This leads to a limited sequence (haplotype) diversity within each block, where some haplotypes are observed more frequently than other possible haplotypes [21]. This is know as the “maximum parsimony principle”. Humans, like any other diploids, carry two copies of each particular chromosome in its cells which are inherited from the two parental genomes. Thereby, for a particular SNP site a pair of alleles are observed, and this conflated information is called a “genotype”. Most sequencing techniques can only determine the genotype sequence of an individual, without specifying the “phase” or chromosome of origin for each allele in a SNP site. Such direct measurement of the chromosomes of origin require expensive and time consuming procedures. Thereby, in most genetic data the haplotype sequences can not be directly CHAPTER 1. INTRODUCTION 5 measured and algorithms are developed to infer the haplotype phase from the genotype data. This problem is called “haplotype inference” (or haplotype phasing), where one aims to infer a pair of haplotype sequences for each genotype sequence of a population of individuals such that the same haplotype sequences will be matched to as many individuals as possible to reflect the underlying “maximum parsimony principle”. This thesis focuses on solving the aforementioned problems that are connected by the observations and insights provided from each other. Depending on the available data and the knowledge, each problem is stated as statistical inference or optimization problem and is approached by efficient computational algorithms that are developed.

1.1 Thesis overview

The dissertation is organized into four chapters. The first two chapters address the problems in genomics: the inference of novel TF regulons based on the discovery of unknown TF motifs, and reducing the false positive rate in the predicted TF binding sites. The last two chapters address the problems in genetics: haplotype phasing from the less-informative “xor” genotypes by using the maximum parsimony principle and exploiting the structure of haplotype blocks, and a block-free solution for the representative SNP selection (tagging) problem which obviates the computationally-intensive block inference step in the haplotype phasing problem – as well as addresses the curse-of-dimensionality in general that arise in genetic research. In Chapter 2, we tackled the problem of identifying regulatory networks (regulons) of the bacterial TFs by a series of computational approaches, including co-expressed gene set (upstream regulatory sites) estimation for a given TF gene, the discovery of the underlying TF motif, and scanning of the motif estimate genome-wide for the detection of putative regulatory elements as TFBSs and genes. The key objective in all of these methods is the derivation of an accurate TF motif, i.e., the motif discovery problem. We adopted the classical PWM model for the statistical representation of the putative TFBSs that are likely to be commonly present in the co-expressed gene upstreams. In novel applications, a priori knowledge for the model parameters is unknown, such CHAPTER 1. INTRODUCTION 6 as the PWM length, the base composition, and the underlying motif core (bases with high information content) that mostly characterize the TF binding. We modeled such quantities as unobservable variables in a hidden Markov model (HMM) and formulated a Bayesian inference method [22]. In particular, in bacteria most regulatory sites bound by a TF have intrinsic sequence features, i.e., the spatial similarity of nucleotides observed between the TFBS half-sites. We modeled this by defining the PWM as consisting of two separated blocks, where the locations of each block and the type of symmetry they exhibit are estimated within the Bayesian framework. The approach is applied to a novel genome and the predictions provided new insights for a hypothetical TF motif. When tested in the manually-curated databases, the reconstructed regulons are consistent with the results of other approaches, though for certain TFs it may similarly suffer from the high rates of false positives for the genome-wide prediction of the binding sites. In Chapter 3, to overcome the rate of false positives in the predicted TFBSs, we aimed at a sequence classification-based solution through a discriminative learning model. The TFBSs have different sequence characteristics that enable their discrimination from the genomic background. The sequences bound by a particular TF across the genome exhibit a certain degree of sequence similarity, which can be captured by the shared occurrences of the fixed-length subsequence patterns known as the “k-mers”. Discriminative approaches are well-suited for this type of problem, in particular the support vector machines (SVM) with properly defined string kernels [23]. A k-mer is a contiguous sequence of k nucleotides that may be present in multiple locations in a TFBS. Thereby, the observed frequencies of a set of particular k-mers (i.e., the k-mer spectrum) may represent the underlying TF’s binding preferences, through modeling the “contiguous dependency” of (k) nucleotides as opposed to the classical PWM models. We examined the predictive power of the SVM classifier based on the k-mer spectrum kernel for discriminating the TFBSs from background, then proposed an improved spectrum kernel that efficiently incorporates the “non-contiguous” dependency into k-mers. That is, the dependency in the contiguous k-mer model is broken by the gaps that are introduced in specific numbers and arrangements, and the k-mer spectrum is accordingly expanded by the occurrences of these “gapped” k-mers. When applied to a particular cis-regulatory CHAPTER 1. INTRODUCTION 7 module (CRM) sequence set, the approach outperforms the traditional SVM classifier and raises new biological hypotheses regarding the nature of protein-DNA binding. Due to large feature space, however, the learned classifier often overfits noise in the training data. To deal with this we proposed a feature elimination procedure based on the “false discovery” of the spurious k-mer (or gapped k-mer) features. In Chapter 4, we deal with a specific case of the haplotype phasing problem that is based on a less-informative form of genotype sequences (“xor genotypes”) that result from certain cost-effective sequencing technologies [24]. Xor genotype, termed by the logical “xor” oper- ation, is obtained by a technique [25] that captures only the similarity between the alleles in a SNP locus, i.e., whether the alleles are identical (homozygous) or different (heterozy- gous). The ambiguity occurs in the homozygous xor genotypes (as well as heterozygous xor genotypes) where the pair of identical alleles could be either reference (0 0) or alternate | (1 1). This is called the “bit-flip degree of freedom”, where for a given SNP by choosing | (flipping) between the two possible haplotype solutions (0 0 and 1 1) one can equivalently | | solve all remaining xor genotypes in the population. This ambiguity can be resolved with a reasonable extra computational effort. Based on the maximum parsimony principle, we developed a sparse dictionary selection algorithm that finds a smallest set of (phasing) haplotype sequences that are sufficient to explain every xor genotype sequence in the given population [26]. Assisted with a small set of regular genotypes, the algorithm can resolve the ambiguity occurring at the homozygous xor genotypes. A minimum tree intersection algorithm is also used to efficiently determine the smallest subgroup of such samples containing the most information (regular genotypes) that covers all ambiguous loci (homozygous xor genotypes) in the given data. Experiments with synthetic and real data sets show that the proposed algorithm outperforms other state-of-the-art algorithms in most cases, especially in the datasets with large number of samples. Finally, the representative (tag) SNP selection problem is presented in Chapter 5. In most genome-wide association studies, the genotypes often contain millions of SNPs and the examination of all associations (even in the pair-wise case) is computationally intractable for accurate SNP tagging. We developed an efficient “attractor” algorithm that can iteratively CHAPTER 1. INTRODUCTION 8 estimate a set of (tagging) SNPs such that each will represent a distinct cluster of mutually- associated variants. The information theory is used to measure the degree of association between the individual variants and the sets of linked variants. Compared to conventional approaches, the main contributions of the proposed algorithm are the definitions of “attrac- tor” and “metaSNP” for the simplified representations of “multi-locus LD” and “variation” in the haplotype blocks, respectively, that provide the rich information underlying the joint linkage disequilibrium of the estimated SNPs. Furthermore, these quantities can be itera- tively estimated up to a certain precision, allowing efficient heuristic designs which can be applied to very large data sets. In particular, our sliding-window heuristic employs the vast amount of genetic information in several layers, and efficiently scales up to genome-wide analysis for several thousands of individuals. Rigorous experiments are carried out in the benchmarking HapMap data sets, and the approach outperformed the existing state-of-the- art algorithms in terms of both accuracy (sample coverage per cost) and computational complexity. A novel application to the popular 1,000 Genomes Project database has re- sulted in a catalogue of tagging variants, that can capture all genotype diversity in this multi-population sample set in a cost-effective manner. Each chapter is self-contained and can be read independently, presented with a specific set of notations. CHAPTER 2. RECONSTRUCTION OF NOVEL TRANSCRIPTION FACTOR REGULONS THROUGH INFERENCE OF THEIR BINDING SITES 9

Chapter 2

Reconstruction of novel transcription factor regulons through inference of their binding sites

2.1 Background

In most sequenced genomes a significant proportion (3-6 %) of all genes are known to encode transcription factors [27], an essential DNA-binding component that regulates target gene transcriptional activity. The promoter regions where TFs specifically bind on genome are usually located in intergenic sites. Extensive sequencing of genomes of various organisms revealed that there is a large conservation of intergenic regions across different species, often occurring among moderately-distant relatives. This is the main intuition behind comparative genomics approaches where one aims to reconstruct regulatory networks by exploiting evolutionary conservation of regulatory features. The assumption is that if a TF- encoding gene is preserved in a set of closely-related species, the respective target genes that are regulated via cognate TF binding sites also tend to be preserved [28]. Such regulatory elements as TFBSs and their target genes identified for each genome constitute the “regulon” CHAPTER 2. RECONSTRUCTION OF NOVEL TRANSCRIPTION FACTOR REGULONS THROUGH INFERENCE OF THEIR BINDING SITES 10 of the given TF. Although most known regulators abide evolutionary conservation, many TF-encoding genes can be organism-specific due to various reasons and the orthologs may not exist in closely-related species. In particular, the discovery of horizontal gene transfer can explain the occurrence of nonconserved regulatory members [29]. The intuition of “true sites occur upstream of orthologous genes, false sites are scattered at random” [28] can thereby miss organism-specific interactions by treating them as false predictions. Hence, there is some limitation to the comparative genomics approaches, and alternative techniques are needed to identify organism-specific regulatory interactions [30]. In this chapter, we attack this problem from a more general computational perspective by aiming at single-genome TF regulon reconstruction, which makes our approach also suit- able for novel organisms. We demonstrated our results for the TFs LexA, PurR and Fur in model bacteria K12 by comparing to their respective regulons in man- ually curated RegPrecise database [31]. The extended predictions –which are not captured by RegPrecise– are presented with annotations provided by GO [32] and Protein Interac- tions (http://www.pir.uniprot.org/) databases. Putative regulon genes reported with high biological significance have expanded the known regulons of LexA, PurR and Fur. Further- more, the results for a novel genome Desulfovibrio alaskensis discovered a Fis-type motif for the hypothetical regulator Dde0289.

2.1.1 Motif-based inference of novel regulons of transcription factors

The cis-acting regulatory elements of genes are usually located in upstream regions of their coding sequences, where gene expression is controlled by sequence-specific binding of the TFs. Co-expressed genes that have similar TF binding patterns in their regulatory regions can be good candidates for a putative regulon. Binding preference (motif) of a TF can be described by a matrix that represents the frequency of nucleotides observed in each position of the known binding sites. Among others, the position weight matrix (PWM) is a well- suited representation of motifs for statistical evaluation of the corresponding binding sites [33], and it is also a more sensitive metric for TFBS recognition [34]. Recently, more complex models are introduced when modeling TF-DNA binding affini- CHAPTER 2. RECONSTRUCTION OF NOVEL TRANSCRIPTION FACTOR REGULONS THROUGH INFERENCE OF THEIR BINDING SITES 11 ties. In [35], it is shown that DNA structural features can be calculated from the nucleotide sequences in motif databases, and later [36] et al. proposed that certain 3D DNA shape information can be derived from high-throughput approaches. In [37], epigenetic factors (methylation, histone modification etc.) are considered in TF binding, where they investi- gated certain location- and cell-type specific relationships between epigenetic modifications and binding affinities. Although these studies expand the knowledge for modeling TF bind- ing affinity, the proposed methods may not be readily employed in every genome. In this chapter, we focused on a more general regulon recovery approach based on the discovery of sequence motifs that could be broadly applied by only using the genome sequence and corresponding gene expression data sets. We present an integrated method for motif-based inference of novel regulons of transcrip- tion factors (Figure 2.1). For a given TF (and its coding gene), the putatively co-regulated gene set is estimated by utilizing available gene expression and knockout fitness data sets in the proposed biclustering method. The approach proceeds by performing motif discov- ery in the upstream sequences of this high-confidence gene set. For this, we developed a probabilistic algorithm BAMBI2b that can estimate, from the supplied sequences, the main regulatory factor’s unknown weight matrix of unknown length and unknown intrinsic sequence symmetry. Once the motif is obtained the entire genome is scanned by it for TFBS prediction, where the putative TF-DNA binding affinities are estimated by statisti- cal over-representation of the elucidated motif. Finally, the candidate genes located in the downstream of the predicted TFBSs are checked and identified as members of the putative regulon.

Figure 2.1: Motif-based inference of novel regulons of transcription factors CHAPTER 2. RECONSTRUCTION OF NOVEL TRANSCRIPTION FACTOR REGULONS THROUGH INFERENCE OF THEIR BINDING SITES 12

By using the proposed approaches we estimated binding motifs of the given transcription factors, and reconstructed their putative binding sites and regulons. We compared our re- sults (i.e., estimated motifs, binding sites and regulated genes) with the RegPrecise database which is manually curated by an approach [38] most relevant to our work. We used the well- studied transcription factor regulons, LexA, PurR, and Fur in the model organism E. coli to validate our predictions. LexA is a protein that –under non-stress conditions– represses SOS response genes which involve in repairing DNA damages. Manually-curated LexA regulon in RegPrecise database consists of 30 genes that are regulated by 26 . PurR is an important repressor for the transcriptional regulation of purine . Its regulon includes genes participating in the biosynthesis of purine/pyrimidine nucleotides. FUR consists of a family of TFs including metal ion-dependent regulators Fur, Mur, Zur, and Nur which are responsible for homeostasis of the metal ions in the organism. For each studied TF we used relevant gene expression assays and knockout fitness dataset when avail- able. We also applied our approach to a novel genome D. alaskensis to predict the binding behavior of one of its hypothetical regulators Dde0289. In fact, we discovered a rare type of binding motif which is structurally-weak and unexposed to most sequence-based motif finding tools. We further validated this prediction by applying the same approach to the estimated motif’s main presumed regulator (Fis) that has been annotated in E. coli [39]. The details of the applications and results are described as follows.

2.2 System model

2.2.1 High-confidence co-regulated gene set estimation via sparse biclus- tering

Given the genome-wide expression profiles we want to estimate a group of genes that are putatively co-regulated with the coding gene of a TF of interest. Co-regulation of a pair of genes can be quantified by the expression coherency across the data points (conditions). Let M n p be the gene expression data consisting of n genes and p conditions ∈ R × (features). Let m denote the i-th row of M. For a given subset of features ( 1, ..., p ) i ⊆ { } , the observation pair (mi, mj) can be represented as a beam around the linear regression CHAPTER 2. RECONSTRUCTION OF NOVEL TRANSCRIPTION FACTOR REGULONS THROUGH INFERENCE OF THEIR BINDING SITES 13

4 Features lexA mean 3 lexA std dinI mean dinI std 2

1

0 dinI

−1

−2

−3

−4 − 3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 lexA

Figure 2.2: Scatter plot of the features corresponding to expression rates of genes lexA and dinI. The data is obtained from [40].

line with slope aij and intercept bij (e.g., Fig. 2.2). Then the feature selection problem is to select a subset of features that minimizes the sum of residuals from the best regression. For the bicluster generated for the i-th observation (e.g., a TF’s coding gene), feature selection can be represented by a vector, s 0, 1 p, such that s = 1 if the feature k is i ∈ { } ik selected, and 0 otherwise. To avoid trivial solutions –not selecting any feature– a sparse penalty term is introduced to the problem, i.e., β 1 s , where β is the regularizer 1k − ik1 1 coefficient and 1 is the vector of 1s. Therefore, the feature selection problem for the i-th gene can be represented as,

p 2 min sik(mik aijmjk bij) + β1 1 si 1 , (2.1) si,aij ,bij − − k − k Xk=1 where s 0, 1 . ik ∈ { } To find coherent observations under the conditions given by si a gene selection vector w 0, 1 n is introduced, such that w = 1 if there is a strong linear coherence between i ∈ { } ij the observation pair (mi, mj), and wij = 0 otherwise. Similarly, another sparse regularizer is added to penalize trivial biclusters, i.e., β 1 w . To favor co-up-regulated and co- 2k − ik1 down-regulated observation samples, an importance matrix D n n p is introduced, ∈ R × × where d [0, 1] indicates the importance of the k-th feature for the observation pair ijk ∈ (mi, mj) which is computed as a function of the euclidean distance of the k-th feature to CHAPTER 2. RECONSTRUCTION OF NOVEL TRANSCRIPTION FACTOR REGULONS THROUGH INFERENCE OF THEIR BINDING SITES 14 the central point (m , m ) in the scatter plot. Relaxing w 0, 1 to w [0, 1] and i j ij ∈ { } ij ∈ s 0, 1 to s [0, 1], the gene expression biclustering problem for the i-th gene can ik ∈ { } ik ∈ then be cast as,

1 2 min wij sik (mik aijmjk bij) + β1 1 si 1 + β2 1 wi 1 , (2.2) wi,si,aij ,bij dijk − − k − k k − k Xj Xk where w [0, 1], s [0, 1] [41]. ij ∈ ik ∈ By optimizing (2.2) for a given regulatory gene i and expression matrix M, one can obtain a bicluster where the genes within the bicluster exhibit high linear-coherency with each other. One may expect that the high coherency in gene expression could indicate a common transcriptional activity. However, this correlation may occur either by chance (e.g. resulting from experimental noise) or because of indirect effects underlying the experiments, thereby integration of additional biological evidence into the problem can be crucial. In particular, we can employ the fitness data to check whether the knockout strains of the bicluster genes can show similar responses (fitness) to a particular stress, i.e., we may assume that the co-regulated genes –when knocked out– are more likely to cause similar growth rates for the organism. Following this intuition, the algorithm looks to find the knock outs that have similar growth patterns under a subset of stress conditions. Let F n q be the fitness data consisting of n genes and q conditions. For the ∈ R × observation pair (fi, fj) in F we determine the feature selection by

q 2 min sˆi`(fi` aijfj` bij) + βˆ1 1 sˆi 1 , sˆi,aij ,bij − − k − k X`=1 wheres ˆ 0, 1 , and βˆ control the feature selection in F . We optimize the observation i` ∈ { } 1 selection such that

sˆi` 2 (fi` aijfj` bij) + βˆ1 1 sˆi 1 + βˆ2 1 wi 1 , dˆij` − − k − k k − k X` where βˆ2 controls the gene selection and dˆij` is the importance of the `-th feature for the observation (fi, fj). For a given regulatory gene i, gene expression matrix M, and fitness data matrix F , CHAPTER 2. RECONSTRUCTION OF NOVEL TRANSCRIPTION FACTOR REGULONS THROUGH INFERENCE OF THEIR BINDING SITES 15 the biclustering problem can then be written as

sik 2 sˆi` 2 min wij (mik aijmjk bij) + (fi` aijfj` bij) + w s s ˆ i, i,ˆi,aij ,bij dijk − − dij` − − ! Xj Xk X` β 1 s + βˆ 1 sˆ + (β + βˆ ) 1 w , (2.3) 1k − ik1 1k − ik1 2 2 k − ik1 where w [0, 1], s [0, 1], ands ˆ [0, 1]. Since (2.3) is not jointly convex in W , S, Sˆ, A, ij ∈ ik ∈ i` ∈ and B, we propose an iterative procedure for solving it as follows. Initialization: A is initialized to 1 1T since the genes’ expression levels (rows) are · normalized to [0,1]. B is normalized as the linear regression line for each observation

(mi, mj) passing through the central point (mi, mj) with slope aij. S and Sˆ are initialized as selecting the feature points whose importance distance to regression line remain within a certain threshold, i.e., sik = 1 if the k-th data point’s distance to the central point (mi, mj) is larger than some predefined threshold. W is computed from initial values of S, Sˆ, A, and B. Each variable is then updated through optimizing its cost function when the other variables are fixed. Updating W : When S, Sˆ, A, and B are fixed (2.3) becomes a convex linear objective of W , i.e., p sik 2 min wij (mik aijmjk bij) + wij dijk − − k=1 q X sˆ i` (f a f b )2 (β + βˆ ) (2.4) ˆ i` ij j` ij 2 2 dij` − − − ! X`=1 s.t. w [0, 1], ij ∈

p sik 2 q sˆi` which can be solved as wij = 1 if k=1 d (mik aijmjk bij) + `=1 ˆ (fi` aijfj` ijk − − dij` − − 2  bij) < (β2 + βˆ2), and wij = 0 otherwise.P P Updating S: When W , Sˆ, A, and B are fixed, the objective function is a convex (linear) function of S, i.e.,

wij 2 min sik (mik aijmjk bij) β1 (2.5) sik dijk − − − ! Xj s.t. s [0, 1], ik ∈

wij 2 then S can be updated in closed form such that sik = 1 if (mik aijmjk bij) < β1, j dijk − − and sik = 0 otherwise. P CHAPTER 2. RECONSTRUCTION OF NOVEL TRANSCRIPTION FACTOR REGULONS THROUGH INFERENCE OF THEIR BINDING SITES 16

Updating Sˆ: Similarly, when W , S, A, and B are fixed, the objective function is convex in Sˆ, i.e.,

w min sˆ ij (f a f b )2 βˆ (2.6) i` ˆ i` ij j` ij 1 sˆi` dij` − − − ! Xj s.t. sˆ [0, 1], i` ∈

ˆ wij 2 ˆ and S is updated in closed from such thats ˆi` = 1 if j ˆ (fi` aijfj` bij) < β1, and dij` − − sˆi` = 0 otherwise. P Updating A and B: When W , S, and Sˆ are fixed, A and B can be solved as a least- squares linear regression problem. The procedure iterates by updating W , S, Sˆ, A, and B until the objective function converges. After convergence, we check if the bicluster have sufficient members (intuitively between 5-50 genes) to discover a regulatory motif from their upstream sequences. Before motif discovery, a pre-processing may be necessary for refining those inputs. As one can expect, the obtained high-confidence set may contain false positive genes and the corre- sponding upstream sequences can thereby deteriorate the motif estimation. Even some true positive genes may lack the corresponding regulatory sites in their upstream sequences which makes the motif finding problem ill-defined. Next, we will demonstrate a filtering procedure to eliminate those potentially uninformative genes from the obtained set.

2.2.2 Filtering high-confidence gene set

We aim to eliminate genes that presumably lack the regulatory patterns in their upstream, which is generally observed in gene groups that belong to the same . If the reference knowledge about the TF is available (e.g., , motif, etc.) we could easily detect such genes by scanning their upstream regions. However, for novel genomes such prior information is usually not available. On the other hand, for most sequenced genomes the operon predictions data are available. In such data, the adjacently co-regulated genes are grouped into operons whereby the common regulatory sites are located in the upstream of the first-in-operon genes. Using this information, an intuitive elimination procedure could be first including only the first-in-operon genes and then sequentially filtering in the downstream genes as necessary. Our filtering procedure is outlined as follows. CHAPTER 2. RECONSTRUCTION OF NOVEL TRANSCRIPTION FACTOR REGULONS THROUGH INFERENCE OF THEIR BINDING SITES 17

Assume that the high-confidence set consist of the genes appearing in three different H operons named as a, b, and c, e.g., = a , a , a , b , b , c , c , c , where a is the first-in- H { 1 2 3 1 2 1 2 3} 1 operon gene in the operon a and a2 is the second-in-operon gene, and so forth for the other operons b and c.

Determine the first-in-operon genes = a , b , c and the downstream genes = • F { 1 1 1} D a , a , b , c , c in the high-confidence set by using the operon predictions database { 2 3 2 2 3} [42]. Start the new high-confidence set with only the first-in-operon genes, e.g., = H = a , b , c . F { 1 1 1} For each operon in , calculate the difference S in co-expression score between a pair • D of adjacent genes, where the co-expression score is defined as “the coherence of gene expressions (correlation) with the given TF gene”, e.g., S(a ) = score(a ) score(a ), 2 1 − 2 where score(x) = Corr(x, T F gene).

Add the downstream genes with score S < threshold, and check if (i) there are 10 • ≥ genes in or (ii) 50% of the downstream genes are filtered in. H If (i) or (ii) satisfies, return . Otherwise, continue adding more downstream genes • H by incrementing the threshold within a given range then comparing the next set of gene pairs in the downstream (see Algorithm 1).

Setting a sensible thresholdRange is important, e.g., starting with a negative value may be intuitive. A negative S occurs when the gene’s correlation score is higher than that of a preceding gene, which indicates that the gene is likely directly-regulated by a TF and the regulatory pattern could be present in the corresponding gene’s upstream region.

2.2.3 Motif discovery with BAMBI2b

Genes regulated by the same TF usually share a common nucleotide pattern in their pro- moter regions where the TF specifically recognizes this pattern and binds. The locations of such TF binding sites and the properties of the corresponding motif can be computationally derived by motif discovery algorithms. Here, we employ the Bayesian Algorithm for Mul- CHAPTER 2. RECONSTRUCTION OF NOVEL TRANSCRIPTION FACTOR REGULONS THROUGH INFERENCE OF THEIR BINDING SITES 18

Algorithm 1 Downstream gene filter Require: set = , and = H F D0 D Require: set i = 2, and threshold = min(thresholdRange)

1: for every operon x , e.g., x a, b, c , do ∈ D ∈ { } 2: Calculate S(xi) = score(xi 1) score(xi) − − 3: if S(xi) < threshold then 4: = x H H ∪ i 5: = x D D\ i 6: end if

7: end for

8: if 10 then |H| ≥ 9: return H 0 10: else if |D | then |D| ≤ 2 11: return H 12: else

13: threshold = threshold + 0.1 1+ Hnew H | |−| old| 14: if threshold >= max(thresholdRange) then

15: i = i + 1

16: threshold = min(thresholdRange)

17: end if

18: jump to 1

19: end if tiple Biological Instances of motif discovery –BAMBI [43]– to estimate the unknown motif for the putatively co-regulated genes obtained in the previous section. BAMBI is designed to discover nucleotide motifs via estimating the number, length and locations of the binding instances in a collection of sequences, where each sequence may contain one, several or no instances of the motif. The algorithm is based on the sequential Monte Carlo (SMC) technique which processes one input sequence at a time efficiently in a sequential manner. The system is modeled as an HMM, where the hidden state at time t corresponds to the current input’s state vector xt composed by the number of motif CHAPTER 2. RECONSTRUCTION OF NOVEL TRANSCRIPTION FACTOR REGULONS THROUGH INFERENCE OF THEIR BINDING SITES 19

instances nt and their locations in the current sequence (observation) st. As the iteration proceeds, given t sequences S = s ,..., s , BAMBI aims to identify the number of t { 1 t} motif instances and their locations for each sequence, i.e., X = x ,..., x . Since the t { 1 t} distribution of the state vectors given observations p(X S ) is not known and the samples t| t are usually not available for this posterior, an approximation is implemented by taking weighted samples (particles), Xk, k=1,...,K, from a trial density q(X S ) within the t t| t sequential importance sampling (SIS) framework [44]. The sequential algorithm is obtained k k k k by setting q(Xt St) = q(Xt 1 St 1)q(xt Xt 1, St), and computing the sample weights | − | − | − k k k k k p(st Xt , St 1)p(xt Xt 1, St 1) w w | − | − − . (2.7) t t 1 k k ∝ − q(xt Xt 1, St) | − A resampling procedure [45] is employed to reduce the increasing variance of ineffective samples with very small weights resulting from the phenomenon known as the degeneracy problem [46]. The length-M motif is described as a 4 M PWM, i.e., θ = [θ ,..., θ ], where θ , × 1 M j j = 1,...,M, represents the unknown nucleotide distributions (i.e., A, C, G, T/U) in each position of the motif that will be estimated. Since each promoter sequence can have one, several or no binding site (instance) of the motif their abundance also needs to be estimated. Assuming N instances per sequence as the upper bound, the distribution of the number of instances is described by an unknown vector λ = [λ0 . . . λN ], where λj is the proportion of sequences with j instances of the motif. TF binding sites have intrinsic properties that should be taken into account in the model, such as the location of conserved bases in the half-sites and their symmetry. We modified BAMBI to incorporate such properties of the TFBS. In particular, we can use the general two-block model [47] to describe the palindromic or direct/inverted repeat patterns in the half-sites. Estimating the location and size of the core segments will be introduced later under the “Two-block model” section. To model the motif symmetry we consider three types, i.e., (i) palindromic, (ii) inverted-repeat and (iii) direct-repeat symmetry. We represent the distribution of the symmetry of instances as an unknown vector, i.e., σ = [σ0, σ1, σ2], where

σ0 represents the proportion of instances with palindromic symmetry, and σ1 (σ2) represents the proportion of instances with inverted(direct)-repeat symmetry. CHAPTER 2. RECONSTRUCTION OF NOVEL TRANSCRIPTION FACTOR REGULONS THROUGH INFERENCE OF THEIR BINDING SITES 20

At each step t, given the first t sequences St the sequential algorithm estimates the number of instances in each sequence and their locations, Xt, within the Bayesian frame- work. The motif matrix θ is modeled as Dirichlet random vectors. The distribution of the number of instances λ and the distribution of the symmetry of instances σ are also modeled as Dirichlet random vectors. A sufficient statistic αt is used to characterize the distribution of θ, and it can be updated sequentially, i.e., αt = αt j=1...4 , where { m,j}m=1...M t t t T αm = [αm,1 . . . αm,4] is the parameters for the m-th Dirichlet distribution corresponding to the m-th column in PWM θ . Then the parameters of p(θ X , S ) can be sequentially m | t t updated from p(θ Xt 1, St 1) as | − −

t t 1 αm,j = αm,j− + n(at,xt (m)),

where at,xt is the subsequence of M letters in the sequence st whose location is given by xt. Similarly, a sufficient statistic is selected for the distribution of the number of instances t t t λ which is Dirichlet distributed with parameters γ = [γ0 . . . γN ]. The sequential update is t t 1 obtained by γ = γ − + j(xt), where j(xt) is a vector of zeros except for a 1 indicating the number of instances of the motif in the t-th sequence. A sufficient statistic for the distribution of the symmetry of instances σ is also defined. σ is a Dirichlet distribution t t t t t t 1 with parameters δ = [δ0 δ1 δ2] which can be updated sequentially by δ = δ − + u(at,xt ), where

M/2

u(at,xt ) = [u0 u1 u2] = [ I , at,xt (m)=at,xt (M m) m=1 { − } X M/2

I at,x (m)=at,x (M m) , { t t − } m=1 X M/2

I a a M ]. t,xt (m)= t,xt (m+ 2 ) m=1 { } X

I . is the indicator function which takes the value 1 if a nucleotide pair observed in at,xt { } obeys the given symmetry type, palindromic, inverted-repeat, or direct-repeat, and the operator () takes the Watson-Crick complement of the given letter. The influence of these variables can be averaged out in the SIS procedure, whereby the measurement model depends on θ and σ, and the state transition depends on λ. The CHAPTER 2. RECONSTRUCTION OF NOVEL TRANSCRIPTION FACTOR REGULONS THROUGH INFERENCE OF THEIR BINDING SITES 21

k k importance distribution can be chosen with q(xt Xt 1, St) = p(xt Xt 1, St), where | − | −

k k k p(xt Xt 1, St) p(st xt, Xt 1, St 1)p(xt Xt 1, St 1) | − ∝ | − − | − − k k = p(st xt, Xt 1, St 1, θ)p(θ xt, Xt 1, St 1)dθ | − − | − − Z k k p(xt Xt 1, St 1, λ)p(λ Xt 1, St 1)dλ. (2.8) | − − | − − Z Then, for each particle, k=1,...,K, the importance weight can be updated as

k k k k wt wt 1 p(st xt, Xt 1, St 1)p(xt Xt 1, St 1). (2.9) − − ∝ − x | − | − Xt Given the PWM θ, the background distribution θ0, the state xt at time t, and the symmetry distribution σ, we can express in closed-form the likelihood of the sequence st in (2.8) as

k k p(st xt, Xt 1, St 1, θ)p(θ xt, Xt 1, St 1)dθ | − − | − − Z M n(ac ) t,xt n(at,xt (m)) = θ E[θm p(at,xt θ)], (2.10) 0 | m=1 Y n 4 nj where θ = j=1 θj , n(a)=[n1, . . . , n4] with nj, j=1,..., 4, being the number of times the c j-th letter appearsQ in the instance a, and at,xt is the sequence in st resulting from removing the instance at,xt .

The last term in (2.10) is the likelihood of the instance at,xt given the PWM θ, and the motif’s underlying symmetry σ. The influence of the symmetry can be averaged out by computing the integral p(a θ) = p(a θ, σ)p(σ θ)dσ. For a given instance a , as a t,xt | t,xt | | t,i measure of dissimilarity (asymmetry)R we use the Kullback-Leibler (KL) divergence between the PWM’s base variations indicated by the symmetry type observed in at,i. Since the KL divergence is a non-symmetric distance measure we define a symmetrized similarity metric and write the closed-from expression for p(a θ) as follows. t,xt |

p(at,xt θ) = p(at,xt θ, σ)p(σ θ)dσ = p(at,xt , θ, s)E[σs] | | | Z 1 δs = s s s s 2 , (2.11) 1 + H(θa, θa) + H(θa, θa) s=0 δs P CHAPTER 2. RECONSTRUCTION OF NOVEL TRANSCRIPTION FACTOR REGULONS THROUGH INFERENCE OF THEIR BINDING SITES 22 where

M/2 4 0 0 H(θa, θa) = I a (m)=a (M m) θm(j) log(θM m(4 j)), − t,xt t,xt − − m=1 { − } X Xj=1 M/2 4 1 1 H(θa, θa) = I at,x (m)=at,x (M m) θm(j) log(θM m(j)), − { t t − } − m=1 X Xj=1 M/2 4 2 2 H(θa, θa) = I a (m)=a (m+ M ) θm(j) log(θm+ M (j)), − t,xt t,xt 2 2 m=1 { } X Xj=1 and s=0, 1, 2 refers to the given symmetry type, i.e., palindromic, inverted-repeat, and direct-repeat, respectively. We normalize these cross entropy measures by the number of symmetric bases in a , i.e., u (a ). By restricting the calculation to only block sites , t,xt s t,xt B us(at,xt ) i.e., m 1,...,M/2 , the density is then multiplied by 1+ and log-normalized ∈ B ⊂ { } |B| to average out the influence of the block length , which replaces (2.11) by |B| 2 us(at,xt ) 1+ δs |B| p(at,xt θ) = log(τ + s s s s ) 2 , (2.12) | 1 + H(θa, θa) + H(θa, θa) s=0 δs where τ is a constant to avoid taking log(0) and serves as a toleranceP threshold for the asymmetric instances, i.e., the larger is τ the algorithm is more likely to sample an (asym- metric) instance that possess a larger cross-entropy. By using (2.12) the expectation term in (2.10) can be approximated as

2 us(at,xt ) n(at,xt (m)) n(at,xt (m)) 1+|B| δs [θm p(at,x |θ)] ≈ θm log(τ + ) . (2.13) E t 1 + H(θs , θs ) + H(θs , θs ) P2 a a a a s=0 δs

The second integral in (2.8) is computed analogously.

k k p(xt Xt 1, St 1, λ)p(λ Xt 1, St 1)dλ | − − | − − Z t 1 γn− = p(i1 . . . in n)E[λn] = p(i1 . . . in n) , (2.14) | | N t 1 i=0 γi− where p(i . . . i n) is distributed uniformly. P 1 n| Given a set of promoter sequences S = s ,..., s , the BAMBI2b algorithm for dis- t { 1 t} covering symmetric two-block motifs is then summarized as follows. For each sequence, and for each particle, CHAPTER 2. RECONSTRUCTION OF NOVEL TRANSCRIPTION FACTOR REGULONS THROUGH INFERENCE OF THEIR BINDING SITES 23

Construct the importance distribution by computing (2.8,2.10,2.13,2.14) and enumer- • ating all possible sample extensions

k k T Xt (n, i1, . . . , in) = [Xt 1[ni1 . . . in] ]; −

Draw sample xk from the importance distribution and set • t

k k k Xt = [Xt 1xt ]; −

and update the particle weight using (2.9);

Update the sufficient statistics αt, γt, and δt; then, resample if needed. •

Two-block model The bacterial TF binding sites are usually symmetric in sequence where the half-sites obey a palindromic or direct/inverted repeat symmetry. The more informative bases in each half-site are generally conserved in the middle. One can model such motif structure as a pair of informative symmetric sequences (blocks ) separated by a B run of uninformative bases (gap), and a pair of uninformative flanker sequences ( ) located F at each end (Fig. 2.3). To estimate this structure within the SMC framework we used the class-based resampling scheme presented in [45] and augmented the state vector as zt = T T [xt , m, b, f] , where m, b, and f are the parameters for motif length M, block length B, and flanker length F , respectively. The classes are enumerated as Λ = (M,B,F ) B M,B,F { | ∈ 1,..., M ,F 0,..., M B , where each class represents a different scenario of the { 2 } ∈ { 2 − }} possible motif structures. The Bayesian algorithm for the two-block motif model is similar to the single-block model given in [47] except that the closed-form expressions are modified for the non-block region, i.e., background distribution θ0 is used for the uninformative bases 1,...,M . { }\B

1 M

F B B F

Figure 2.3: The two-block structure of a length-M motif described by the blocks ( ) and B the flankers ( ). F

The class-based resampling procedure is summarized as follows. CHAPTER 2. RECONSTRUCTION OF NOVEL TRANSCRIPTION FACTOR REGULONS THROUGH INFERENCE OF THEIR BINDING SITES 24

- Choose the number of particles for each class according to a multinomial distribution with parameters

ˆ Pm,b,f P (m, b, f St) (m,b,f) Λ . (2.15) ∼ { | } ∈ M,B,F

- For every class γ Λ , if P < P then set P = P . ∀ ∈ M,B,F γ thresh γ thresh - Reduce the number of particles starting from the class with the largest number of

particles until γ Pγ = P . P - For every class draw Pγ new particles from the set of previous particles with proba- bilities proportional to their weights.

- Assign equal weights to each new sample within a class, wk = Pˆ(m, b, f S )/P . t | t m,b,f

Optimal data feeding

Because of the sequential nature of BAMBI2b, the order in which the sequences are uti- lized have a direct impact on the accuracy of estimation. An optimal data feeding should then be defined in order to prioritize the most useful sequences to supply to BAMBI2b. The downstream genes can be supplied in the same order in which they are filtered in by Algorithm 1. To rank-order the first-in-operon genes , we can use a similar approach. The genes that F are expressed more coherently with the given transcription factor may indicate a proximity to the TF’s binding instances. Following this intuition, we feed the upstream sequences of the first-in-operon genes x in a decreasing order based on the correlation scores, { ∈ F} i.e., score(x) = Corr(x, T F gene), as follows. Let M p be the gene expression ∈ R|F|× matrix corresponding to high-confidence genes and p different conditions. Suppose that |F| m p is the vector of expression rates for the given transcription factor gene located in i ∈ R the i-th row of M. The correlation scores for the genes x = 1,..., , are calculated by |F| Pearson’s correlation

Pp (m − m )(m − m ) score(x) = k=1 ik i xk x , pPp 2 Pp 2 k=1(mik − mi) k=1(mxk − mx) CHAPTER 2. RECONSTRUCTION OF NOVEL TRANSCRIPTION FACTOR REGULONS THROUGH INFERENCE OF THEIR BINDING SITES 25

Figure 2.4: Gene expression heatmap of the estimated high-confidence set: dinB, yafN, dinG, sulA, dinI, umuD, ydjM, yebG, recX, recA, lexA, recN.

where mi and mx are empirical means of the expression vectors mi and mx of the genes i and x, respectively. The upstream sequences are then arranged in the order of descending absolute scores score(x) of the corresponding genes in . | | F

2.3 Results and discussion

Predictions in the model organism E. coli K12 validates our approach’s sensitivity

To assess our approach’s performance we reconstructed the regulons of LexA, PurR and Fur transcription factors which are among the well-studied regulons in Escherichia coli K12 strain. In literature, these transcription factor motifs are known to have palindromic sequence symmetry. The informative sites in the LexA motif are conserved in the middle of each half-site sequence, while for PurR and Fur they are mostly scattered across the motif and appear more informative near the motif half-site. In Figures 2.5-2.7 these structures can be seen in the estimated motifs too which are obtained by the proposed approach. We estimated a set of genes putatively co-regulated with the LexA’s coding gene lexA, by using the proposed biclustering method with the gene expression data from [40] corre- sponding to 266 experiments and the fitness data from [48] corresponding to growth rates under antibiotic stress conditions, tetracycline, doxycycline, and minocycline. The esti- mated high-confidence set for LexA consist of 12 putatively co-regulated genes 11 of which are members of the RegPrecise regulon. Figure 2.4 shows the corresponding gene expression heatmap of the estimated bicluster consisting of 12 genes and 260 conditions. By using 300-bp upstream sequences, the motif discovery with proposed BAMBI2b al- gorithm yielded a 20-bp motif (Figure 2.5) with the consensus sequence identical to that CHAPTER 2. RECONSTRUCTION OF NOVEL TRANSCRIPTION FACTOR REGULONS THROUGH INFERENCE OF THEIR BINDING SITES 26 of the RegPrecise motif. Table 2.1 shows the similarity of both motifs (BAMBI2b vs Reg- Precise) with the curated motif database (SwissRegulon [49]). It is seen that both motifs exhibit high similarity with the known weight matrix LexA 20-6, and the BAMBI2b es- timate’s similarity is statistically more significant. By scanning E. coli genome with this motif we predicted 60 different putative binding sites, where 10 of them are located in the intragenic regions (open reading frames). After downstream analyses –assisted with Mi- crobesOnline prediction data for adjacent genes [42]– we identified a set of 90 genes as the putative regulon (see Supplementary Data I).

BAMBI2b motif RegPrecise motif Target ID LexA 20-6 LexA 20-6 Optimal offset 2 2 p-value 2.05557e-11 2.97639e-09 e-value 1.99391e-09 2.88709e-07 q-value 3.96654e-09 5.74338e-07 Overlap 18 18 Query consensus TACTGTATATATAAACAGTA TACTGTATATATATACAGTA Target consensus TATACTGTATATAAAAACAG TATACTGTATATAAAAACAG Orientation + +

Table 2.1: The best hits of the estimated (BAMBI2b) vs true (RegPrecise) LexA motifs in SwissRegulon database.

Figure 2.5: LexA motif estimated by the proposed approach.

Comparison with RegPrecise database showed that 27 genes are the members of known CHAPTER 2. RECONSTRUCTION OF NOVEL TRANSCRIPTION FACTOR REGULONS THROUGH INFERENCE OF THEIR BINDING SITES 27

LexA regulon (corresponding to 93%), which are predicted through the same binding sites. We call this group as true positives (TP). 4 novel binding sites are also found for the true positives dinB, ydjM, ruvA, and lexA (Table 2.2), in addition to their RegPrecise binding sites. The remaining 63 genes are not identified in RegPrecise, 17 of which have intragenic binding sites.

Locus Gene Position Score TFBS sequence b0231 dinB -111 7.6 AGCTGGATAAGCAGCAGGTG b1728 ydjM -52 10 CACTGTATAAAAATCCTATA b1861 ruvA -51 8.6 TGCTGTATGATAAAAAAATG b4042 lexA -88 8.7 AACTGCACAATAAACCAGAG

Table 2.2: Novel binding sites for LexA regulon.

We analyzed this putative regulon of 90 genes for biological significance by using Database for Annotation, Visualization and Integrated Discovery (DAVID) [50, 51] and identified cer- tain significant genes that are novel to RegPrecise database. A cluster of 29 genes including alkB, dinJ, ada, yagL, and rmuC, was annotated with the highest functional enrichment score 25.5 (see Supplementary Data I). Notice that alkB, ada and yagL were predicted via intragenic TFBSs that are excluded in RegPrecise regulon. Protein interactions (SP-PIR) indicated high significance for the group including aklB and ada for DNA repair and DNA damage terms. For the groups including aklB, ada, and dinJ, Gene Ontology (GO) terms cellular response to stress, DNA repair, and response to DNA damage stimulus were the reported enrichment terms. Table 2.3 shows the corresponding false discovery rates (FDRs) (Benjamini). On the other hand, dinJ, yafQ, rpod, molR, and insK are identified in the experimental database RegulonDB [52] as the genes regulated by LexA.

Terms aklB, ada aklB, ada, dinJ DNA repair (SP-PIR) 5.1E-28 DNA damage (SP-PIR) 6.2E-28 Cellular response to stress (GO) 7.5E-23 DNA repair (GO) 1.6E-21 Response to DNA damage stimulus (GO) 1.6E-21

Table 2.3: Functional enrichment in reconstructed LexA regulon. CHAPTER 2. RECONSTRUCTION OF NOVEL TRANSCRIPTION FACTOR REGULONS THROUGH INFERENCE OF THEIR BINDING SITES 28

We obtained a similar performance for the putative PurR regulon. Based on the gene expression data in [53], the proposed biclustering method estimated a high-confidence set that consists of 5 true positive genes, i.e., purR, cvpA, purC, purM, and purN. From the upstream sequences, BAMBI2b discovered a 16-bp motif (Figure 2.6) which is comparable to the RegPrecise motif (Table 2.4). After TFBS prediction and subsequent regulon

BAMBI2b motif RegPrecise motif Target ID PurR 17-3 PurR 17-3 Optimal offset 0 0 p-value 4.59804e-10 3.53226e-12 e-value 4.4601e-08 3.4263e-10 q-value 8.82639e-08 6.74503e-10 Overlap 16 16 Query consensus ACGCAAACGTTTACCT ACGCAAACGTTTGCGT Target consensus ACGCAAACGTTTTCCTT ACGCAAACGTTTTCCTT Orientation + +

Table 2.4: The best hits of the estimated (BAMBI2b) vs true (RegPrecise) PurR motifs in SwissRegulon database.

Figure 2.6: PurR motif estimated by the proposed approach. reconstruction, we obtained a putative regulon consisting of 158 genes regulated via 93 non-intragenic sites (Supplementary Data II). 33 genes are identified in the RegPrecise regulon which are regulated by the same binding sites. Our approach found 3 additional binding sites for the true positive genes, serA, yieG(purP), and yjcD (Table 2.5). CHAPTER 2. RECONSTRUCTION OF NOVEL TRANSCRIPTION FACTOR REGULONS THROUGH INFERENCE OF THEIR BINDING SITES 29

Locus Gene Position Score TFBS sequence b2913 serA -95 8.1 ATATGAACGTTTGCGT b3714 yieG -115 8.4 ACGGCAACGATTGCGT b4064 yjcD -76 7.6 AAGATAACGTTTCGCT

Table 2.5: Novel binding sites for PurR regulon.

GO annotations for this putative regulon suggested a cluster of genes with significant functional enrichment. Among the genes, nudB, cadA, cadB, cysZ, gltF, gltS, rhtC, and ygeW were members of this cluster which are annotated by the GO term nitrogen compound biosynthetic process (see Supplementary Data II). The reconstruction of Fur regulon also indicated novel predictions. We employed the gene expression data in [54], and the fitness data [48] with growth rates under antibiotic stress conditions. The proposed biclustering approach estimated a high-confidence set con- taining 56 genes of which 19 are members of the Fur regulon in RegPrecise database. From the upstream sequences of these 56 genes BAMBI2b discovered a 19-bp motif (Figure 2.7, Table 2.6). Although it slightly differs from the RegPrecise’s Fur motif (i.e., the binding

BAMBI2b motif RegPrecise motif Target ID Fur 21-4 Fur 21-4 Optimal offset 4 1 p-value 3.01028e-07 2.00946e-10 e-value 2.91997e-05 1.94918e-08 q-value 5.77853e-05 3.00056e-08 Overlap 17 19 Query consensus AATGATTATCATTATCATT GATAATGATTATCATTATC Target consensus TGATAATGATAATAATTATCA TGATAATGATTATCATTATCA Orientation - +

Table 2.6: The best hits of the estimated (BAMBI2b) vs true (RegPrecise) Fur motifs in SwissRegulon database. sites align with a 3-bp shift and have the same consensus sequence), the motif still conforms the same palindromic symmetry and is able to recover the same RegPrecise binding sites (see Supplementary Data III). After screening the E. coli genome, we obtained a putative regulon consisting of 236 CHAPTER 2. RECONSTRUCTION OF NOVEL TRANSCRIPTION FACTOR REGULONS THROUGH INFERENCE OF THEIR BINDING SITES 30

Figure 2.7: Fur motif estimated by the proposed approach. genes that covers all identified Fur operons in the RegPrecise database. Among them 32 genes have the intragenic binding sites. The annotation analysis identified a group of 71 genes with the highest enrichment score (10.6) and associated them to the metal ion/iron related functional terms, in particular, the protein interaction terms iron, iron transport, transport, and ion transport (see Supplementary Data III). Among this group we pinpointed two genes fes and fhuE (Table 2.7) that are novel to RegPrecise database. In literature, fes and fhuE are notable for making iron available for metabolic use [55] and regulating ferrum uptake (Fe3+) via coprogen [56], respectively. Both are identified in RegulonDB as the members of Fur regulon.

Terms fes, fhuE Iron (SP-PIR) 1.5E-16 Iron transport (SP-PIR) 2.3E-16 Transport (SP-PIR) 6.8E-15 Ion transport (SP-PIR) 1.9E-13

Table 2.7: Functional enrichment in reconstructed Fur regulon.

To evaluate the reconstruction accuracy in terms of specificity/sensitivity performance we implemented our approach in different datasets, i.e., LexA, PurR, Fur, Crp, and Fnr, and assessed the impact of our estimated motifs (BAMBI2b) vs true (RegPrecise) motifs in the reconstruction results. Figure 2.8 shows the corresponding ROC curves, where each data point represents a putative regulon with corresponding true positive (TPR) and false CHAPTER 2. RECONSTRUCTION OF NOVEL TRANSCRIPTION FACTOR REGULONS THROUGH INFERENCE OF THEIR BINDING SITES 31 positive rate (FPR) in respect to the (true) regulon in RegPrecise database, i.e., TPR = TP FP TP +FN , FPR = FP +TN . These measures depend on the predefined site score threshold, i.e., the recovery rate (TPR) increases from left to right as the site score threshold is lowered and more true binding sites are recovered. Since the performance of our approach depends on the complexity of the TF’s regulatory network, we expect better performance for relatively smaller regulons. It can be seen from Figure 2.8 that the recovery rates are less significant for the larger regulons Crp and Fnr. This is in accord with our expectations since the Crp and Fnr family TFs are among the 7 global regulators that control 50% of all regulated genes in E. coli [57].

1

0.9

0.8 TPR

0.7

0.6 0 0.2 0.4 0.6 0.8 1 FPR

1

0.9

0.8

TPR LexA PurR 0.7 Crp Fur Fnr 0.6 0 0.2 0.4 0.6 0.8 1 FPR

Figure 2.8: ROC of reconstruction results based on BAMBI2b-estimated motifs (above) vs RegPrecise motifs (below). CHAPTER 2. RECONSTRUCTION OF NOVEL TRANSCRIPTION FACTOR REGULONS THROUGH INFERENCE OF THEIR BINDING SITES 32

Results for the hypothetical proteins indicates good predictions for non- generic TF binding motifs

We used our approach to predict the motifs of hypothetical proteins of the novel organism Desulfovibrio alaskensis [58]. It is an anaerobic sulfate-reducing bacteria that is notable for its ability to produce hydrogen sulfide, a chemically reactive product toxic to plants, animals and humans. Dde0289 is one of the hypothetical DNA-binding proteins in D. alaskensis, which is annotated as a Sigma-54-dependent transcriptional activator. It is presumed to belong to Fis-type helix-turn-helix motifs in literature. For this regulator, we used all available gene expressions and fitness data sets from MicrobesOnline database [59] in the proposed biclustering method. The estimated high- confidence set consisted of 11 co-expressed genes, i.e., Dde0289, Dde0312, Dde2767, Dde2741, Dde2343, Dde1987, Dde1148, Dde2317, Dde0481, Dde2075, and Dde1935. From the up- stream sequences, BAMBI2b discovered the instances containing A-tracts, significantly ap- pearing in the middle of the motif (Figure 2.9). Although such structure is not generic in

Figure 2.9: Hypothetical Dde0289 motif estimated by the proposed approach.

TF motifs, it is known that the intrinsic sequence-dependent protein-DNA conformations can result in high-affinity binding events. A recent study proposes that A-tracts are the preferred Fis-binding sites in E. coli, and in particular, A6-tracts provide the strongest binding signals [39]. A6-tracts are known to induce intrinsic curvature to segments of DNA [60], whereby it enhances the local region’s exposure to transcription machinery. To validate our approach’s sensitivity for recovering such rare binding motifs, we re- CHAPTER 2. RECONSTRUCTION OF NOVEL TRANSCRIPTION FACTOR REGULONS THROUGH INFERENCE OF THEIR BINDING SITES 33 constructed E. coli’s Fis motif, and compared our results with those that are deduced by the ChIP-chip binding data in [39]. We used the gene expression data set in [61], and es- timated a high-confidence set of 100 genes by using the proposed biclustering method and Algorithm 1. BAMBI2b found several motif estimates that consist of variably conserved A-tracts flanked by the G residues. In particular, the estimates partially recovered the Fis motif’s consensus sequence GCTGAAAAAA, with the highest information content con- served at GCTGAAAA (Figure 2.10) which corresponds to the consensus half-site sequence of the Fis motif. This is in accord with the findings in [39], where the different Fis motif

Figure 2.10: Fis motif estimated by the proposed approach. subtypes (non-palindromic and palindromic) share the most common bases in their con- sensus half-site. In fact, motif comparison of our estimate with the SwissRegulon database resulted in a significant hit to one of the Fis weight matrices, i.e., Fis 26-32 (Table 2.8). In contrast, we used the true Fis weight matrix obtained from RegulonDB, and the motif comparison assigned this motif to the same weight matrix Fis 26-32 with a similar score.

Refining the methods

We applied certain constraints to refine our predictions. Knockout fitness data sets are integrated to refine high-confidence gene set estimation by giving more biological relevance. Integrating appropriately-selected growth conditions often showed a positive effect for elim- inating false positive genes. In motif discovery, we imposed the two-block motif structure to account for palindromic or inverted/direct-repeat symmetry patterns of the TFBSs. We CHAPTER 2. RECONSTRUCTION OF NOVEL TRANSCRIPTION FACTOR REGULONS THROUGH INFERENCE OF THEIR BINDING SITES 34

BAMBI2b motif RegulonDB motif Target ID Fis 26-32 Fis 26-32 Optimal offset 7 3 p-value 6.29038e-06 3.21289e-08 e-value 0.000610167 3.1165e-06 q-value 0.00121382 6.16615e-06 Overlap 10 15 Query consensus CGCTGAAAAA GCTTATTTTTTAAGC Target consensus GTTCTGTTGCTGAAAAAATAACCAAA TTTGGCTATTTTTTCAGCAACAGAAC Orientation + -

Table 2.8: The best hits of the estimated (BAMBI2b) vs true (RegulonDB) Fis motifs in SwissRegulon database. also used a heuristic to define an optimal data feeding order to motif discovery problem based on the co-expression of local genes. Further constraints can be imposed in the reconstruction program to limit the extent of predictions, such as searching genes only in the downstream direction by the strand which TF putatively binds. On the other hand, some restrictions could be relaxed to refine the results in particular cases. For instance, one can obtain different estimates by performing multiple runs of BAMBI2b with scrambled data order, in particular, when the co-expression patterns are not very determinative. If a reference motif is available, e.g., when reconstructing/expanding known regulons, this allows one to directly use it as the prior PWM (θ) in the proposed motif discovery algorithm BAMBI2b. Such procedures will likely reduce the negative effect of the false positive sequences which can diminish the high-confidence set. CHAPTER 3. THE FOLDED K-SPECTRUM KERNEL: A MACHINE LEARNING APPROACH TO DETECTING TRANSCRIPTION FACTOR BINDING SITES WITH NUCLEOTIDE DEPENDENCIES 35

Chapter 3

The folded k-spectrum kernel: a machine learning approach to detecting transcription factor binding sites with nucleotide dependencies

3.1 Introduction

There are certain limitations of motif-based binding site prediction methods, in particular, the rate of false positives [62]. Since the predicted binding sites are not constrained to the partially-conserved upstream sequences of the regulog-related gene orthologs [31], they can be scattered across the genome and may correspond to many false binding sites. Although comparative genomics approaches can mitigate the rate of false positives, they have specific limitations for detecting the novel (organism-specific) TF-DNA interactions [63]. To deal with the rate of false-positive predictions from the motif screening, we developed a machine learning-based filtering method. The approach re-classifies the TFBS predictions based on a supervised learning method trained by the high-confidence TF binding sites. CHAPTER 3. THE FOLDED K-SPECTRUM KERNEL: A MACHINE LEARNING APPROACH TO DETECTING TRANSCRIPTION FACTOR BINDING SITES WITH NUCLEOTIDE DEPENDENCIES 36

That is, the sequences that constitute the PWM of the TF motif are used to train a string kernel-based support vector machine (SVM) classifier, then the putative TFBS predictions are classified by the learned SVM. DNA motif discovery and sequence classification to identify portions of the genome with specific biological function(s) has been a central problem in , addressed by numerous approaches based on [64, 65], profiling consensus patterns of motifs [66, 67], and hidden Markov models [68, 69]. In this chapter, we draw from ideas introduced in Position Weight Matrix-based approaches to develop a novel, non-bias support vector machine approach.

3.1.1 Position Weight Matrix approach

Since the 1980s, one of the most popular bioinformatic approaches for predicting tran- scription factor binding sites involves constructing a Position Weight Matrix (PWM) and scanning a DNA sequence for subsequences with a high probability of binding the TF of interest [70–72]. Standard implementations of this approach rely on the assumption that nucleotide positions within a TF binding site are independent of each other [73, 74]. Even with this simplifying assumption, these models have proven themselves effective in predict- ing binding sites in a variety of different species [73–77]. In an attempt to implement a more systematic and non-bias approach in predicting TF binding sites, there have been numerous different extensions to this approach, relaxing the assumption of nucleotide independence of contiguous nucleotides in the binding site. Com- mon extensions include dinucleotide and k-mer models, which allow for dependence between adjacent nucleotides or a contiguous string of k nucleotides, respectively [78–80]. Even more recently, an algorithm was developed, referred to as MARZ, that combinatorially considers all possible “gapped” nucleotide dependencies across a fixed number of nucleotides [7]. By considering all possible dependencies, this new algorithm includes traditional PWM models, dinucleotide and k-mer models, as well as models with noncontiguous (i.e. gapped) depen- dencies. When tested on a set of well characterized TFs in Drosophila, MARZ illustrated that gapped models often outperform traditional PWM-based models [7, 81]. Although this may not be the case for all TFs or in all species, these studies have highlighted the impor- CHAPTER 3. THE FOLDED K-SPECTRUM KERNEL: A MACHINE LEARNING APPROACH TO DETECTING TRANSCRIPTION FACTOR BINDING SITES WITH NUCLEOTIDE DEPENDENCIES 37 tance of using a non bias approach and considering all possible combinations of nucleotide dependence.

3.1.2 Support Vector Machines approach

In recent years, discriminative approaches, such as Support Vector Machines (SVM), have also been introduced and shown to be the best performing methods for sequence classi- fication [82]. In the SVM classifier approach, the input sequences from different classes are considered as labeled examples and a learning algorithm is trained to find an optimal decision boundary between the different classes. The decision boundary (determined by a hyperplane in a multidimensional feature space) optimally separates the feature representa- tions of the labeled examples. In this supervised approach, the unlabeled (test) sequences are later given to the algorithm, and the algorithm uses the learned decision boundary to predict labels/classes for these test sequences. One of the earliest SVM approaches for biological sequence classification, which was originally implemented on protein sequences made up of strings of amino acids, was the Fisher kernel [83]. This approach is based on a computationally-demanding generative model that requires one to build hidden Markov model profiles for each positive training sequence to obtain the feature vector representations. In a later study by [Liao et al., 2002], the SVM-pairwise kernel was introduced in which the pairwise alignments between each training sequence are used as the SVM features. This methodology is similar to the Fisher kernel approach and the process is computationally-expensive as well. In [23], a more efficient “spectrum” kernel was introduced for the classification of protein sequences, where the feature vector consists of the occurrences of all k-mer contiguous subsequences in a given sequence. In the subsequent works, the authors extended their approach for “inexact” subsequence matches for the improved classification performance, which allows up to a certain number of mismatches when counting the k-mer occurrences [84, 85]. The k-mer spectrum is an effective representation of DNA sequences in terms of discrim- inating functional segments (e.g., TFBS, or promoter/enhancer regions) from the genomic background. The TFBSs can be characterized by the specific composition of the k-mer subsequences inherent to the binding preference of a protein, where the consensus motif CHAPTER 3. THE FOLDED K-SPECTRUM KERNEL: A MACHINE LEARNING APPROACH TO DETECTING TRANSCRIPTION FACTOR BINDING SITES WITH NUCLEOTIDE DEPENDENCIES 38 sequence is defined as the most observed k-mer in the corresponding binding sites followed by those that differ from the consensus sequence in various degrees. It has been suggested that protein binding sites have varying levels of nucleotide inter- dependencies, e.g., the nucleotide patterns in certain (non-adjacent) bases within binding sites may appear more often than the patterns in other bases [8, 81, 86]. This suggests a (gapped) dependency in some non-adjacent bases within the TFBSs which can be modeled by a “gapped” k-mer. That is, the gapped k-mer is a length k sequence of nucleotides that breaks dependency in certain bases that cannot form a significant consensus, e.g., we can assume the interdependency between the first and the last nucleotides in the follow- ing 3 binding sites “AACA”,“ACGA”,“AGT A” , where a “gapped” k-mer represented { } by “ANNA” (where each ‘N ’ is representing a gap) will be a better fit to the majority of TFBSs (i.e., 3/3) than any other “contiguous” k-mer. Following this intuition, to account for the proteins that possess nucleotide interde- pendency, a more suitable representation of the DNA binding sites –in terms of sequence discrimination– should incorporate the gapped k-mer compositions besides the regular k- mers. In that respect, the composition (enrichment) of the specific gapped k-mers in the analyzed sequence (relative to the enrichments obtained from the training data) can help identify such proteins that possess interdependency in their preferred binding sites. In the following section, we describe a generalized discriminative approach for detecting functional DNA sequences including those that may possess nucleotide interdependency in the TFs’ binding sites.

3.2 Approach

3.2.1 SVM for sequence classification

Support vector machine (SVM) is a discriminative learning method that finds an optimal decision boundary that separates, in the case of binary classification, the positive and negative data sets represented in a high-dimensional vector space [87]. Consider the training N data set of labeled input vectors (xi, yi), i = 1, . . . , t, where xi R is the projection of ∈ the i-th input data into a high-dimensional feature space (obtained by a known mapping CHAPTER 3. THE FOLDED K-SPECTRUM KERNEL: A MACHINE LEARNING APPROACH TO DETECTING TRANSCRIPTION FACTOR BINDING SITES WITH NUCLEOTIDE DEPENDENCIES 39 function, Ψ), and y 1, +1 is the respective class label. In this classification problem, i ∈ {− } a basic (linear) decision boundary can be found by minimizing w 2 subject to y (xT w + B || || i i b) 1, i = 1, . . . , t, where w RN is the normal vector to the decision boundary (which is ≥ ∈ a hyperplane, i.e., = x RN : xT w + b = 0 ), b is the classifier bias, and the term xT w B { ∈ } i represents the inner product between the vectors xi and w. For practical considerations a dual problem is rather solved to find the decision boundary by forming the Lagrangian

(with multipliers αi) for the above quadratic programming problem [82]: 1 max α α α y y xT x , (3.1) i − 2 i j i j i j Xi Xi Xj subject to α 0 i, i ≥ ∀ whereby the normal vector w can be optimally constructed from w = i yiαixi. Here, the learned variables (w,b) serve as the parameters of the classification rule.P A test example z is then classified by the expression

T T f(z) = z w + b = yiαiz xi + b, Xi where f(z) represents the distance of z from the learned decision boundary, and its sign gives the estimated class label. The inner product in (3.1) yields a measure of similarity in the future space where the coordinates are defined by the vector elements. The F similarity between any two input data (χi, χj) can be generalized by using the Kernel functions K(χi, χj) [82]. A convenient Kernel function for sequence classification is the k-spectrum kernel [23] which describes the similarity of sequences by their (k-mer) subsequence compositions of a fixed length k. Consider that the training data χ is a character sequence belonging to an input space consisting of all finite sequences from an alphabet of size `. For a X A given k 1, it is projected to a frequency vector called “k-spectrum”, i.e., x = Ψ (χ), that ≥ k represents the occurrences of all possible k-mer subsequences in χ

Ψk(χ) = φα(χ) , α k   ∈A k where φα(χ) is the number of times α occurs in χ, and represents the set of all possible A length-k sequences from the alphabet . A CHAPTER 3. THE FOLDED K-SPECTRUM KERNEL: A MACHINE LEARNING APPROACH TO DETECTING TRANSCRIPTION FACTOR BINDING SITES WITH NUCLEOTIDE DEPENDENCIES 40

In [23], the k-spectrum kernel’s mapping function Ψk considers only the subsequences of “contiguous” (dependent) nucleotides (i.e., k-mers). In a later related work, authors introduce the mismatch kernel to allow for a certain number of mismatches in the k-mer occurrences. Recently, a more systematic approach was proposed for detecting regulatory sequences possessing certain degrees of nucleotide interdependency through the search of sequence motifs called “gapped k-mers” [7]. In our study, we extend the concept of k- spectrum given in [23] by incorporating all types of nucleotide interdependency in k-mers to improve the predictive power of the feature set. That is, we extend the feature set of k-mers by those of the gapped k-mers given in [7] that ignore all possible subsets of nucleotides. As it will be explained next, the mapping function only evaluates the data points under the k-spectrum and computes the extended (gapped) features efficiently by folding this spectrum in certain coordinates specific to the gapped k-mer features.

3.2.2 Folded k-spectrum kernel

We use the binary notation to refer the nucleotide dependence in a subseqeunce α, where ‘1’ represents the dependent nucleotides and ‘0’ represents the gap. For example, given k=3 (3-mer) the binary value of the decimal number 5 is 101; therefore 5 encodes the 3-mer sequences of which only the 1st and the 3rd nucleotides are interdependent. We denote this dependency model by a set of gapped k-mer features 3 which consists of 16 sequences vary- A5 ing on the nucleotides (1,3), i.e., 3 = ANA, ANC, ANG, ANT, CNA, . . . , T NG, T NT , A5 { } where the gaps are represented by ‘N’. Similarly, the decimal number 7 yields the binary se- quence 111 and the corresponding 3 represents the set of all (contiguous) 3-mer sequences. A7 For k = 3, the remaining possible feature sets are 3 and 3 which consists of all possible A3 A1 dimer (011) and monomer (001) sequences, respectively. As described above, there are 4 different feature sets (gapped k-mer models) for k = 3, i.e., monomer, dimer, 1-gapped, and 3-mer. In general, this number grows exponentially k 1 with k, i.e., 2 − different feature sets (obtained by placing gaps at each possible location in 1, . . . , k 1 ). However, in practice the length of k = 6 is a sufficient (and optimal) { − } choice for the purpose of sequence discrimination [88][89], which we also used in this study, and it limits the number of feature sets of the different gapped k-mer models to 32. In the CHAPTER 3. THE FOLDED K-SPECTRUM KERNEL: A MACHINE LEARNING APPROACH TO DETECTING TRANSCRIPTION FACTOR BINDING SITES WITH NUCLEOTIDE DEPENDENCIES 41 following, we present a general formulation of the folded k-spectrum kernel for any k. k Using the same notation, we define the feature map of a particular feature set 2m 1, m = A − k 1 1,..., 2 − by

Φ (χ) = φα(χ) , (3.2) (k,m) k α 2m−1   ∈A where Φ(k,m) maps the input data into the m-th feature set which can be denoted by k 2m 1 = [αα . . . α] (2m 1)(2) , with (l)(2) representing the decimal l in binary A − ⊗ − α   ∈A sequence, i.e. (5)(2) = 101. In equation (3.2), we used the relative frequency mapping for φα(χ) to cancel out the influence of different sequence lengths and different background frequencies of α, i.e., it represents the number of times α occurs in χ divided by the number of possible k-mers ( χ k + 1), relative to the same measure observed in the genome | | − 1 # α occurs in χ # α occurs in genome − φα(χ) = . χ k + 1 genome k + 1 | | −  | | −  k 1 Notice that in (3.2), a value of m = 2 − corresponds to the last feature set (which is the set of contiguous k-mers), where the term 2m 1 yields the length-k sequence of 1s in the − binary representation, i.e., (2m 1) = 2 2k 1 1 = 2k 1 = 1 k = [11 ... 1]. − (2) × − − (2) − (2) { } Similary, m = 1 represents the monomer model with the feature set k k 2m 1 = 1 = N . . . NA, N . . . NC, N . . . NG, N . . . NT , A − A { } and m = 2 represents the dimer model with the feature set

k k 2m 1 = 3 = N . . . NAA, N . . . NAC, N . . . NAG, A − A { N . . . NAT, N . . . NCA, . . . , N . . . NT T . }

k 1 Any other value between 2 < m < 2 − corresponds to the gapped feature sets given that k>2. Considering all possible gapped k-mers, i.e., m = 1,..., 2k 1 , the extended { − } k-spectrum feature map is then given by

2k−1 Ψk(χ) = Φ(k,m)(χ) . (3.3) m=1   For efficiency, we only evaluate the contiguous k-mer features under the k-spectrum map, k 1 i.e., Φ(k,2k−1)(χ), then the mapping of χ to any other feature set (m < 2 − ) is calculated CHAPTER 3. THE FOLDED K-SPECTRUM KERNEL: A MACHINE LEARNING APPROACH TO DETECTING TRANSCRIPTION FACTOR BINDING SITES WITH NUCLEOTIDE DEPENDENCIES 42 as a linear sum of the feature-specific values in this k-spectrum map, imitating a folding process over those coordinates. For example, given k = 3 the frequency of the feature sequence “ANA” in 3, is computed by A5

φANA(χ) = φAαA(χ) α X∈A = φAAA(χ) + φACA(χ) + φAGA(χ) + φAT A(χ), where [φAAA(χ), φACA(χ), φAGA(χ), φAT A(χ)] are obtained from the indexed elements in Φ(k,2k−1)(χ). In other words, the mapping function of a gapped k-mer feature (e.g., “ANA”) is the (unweighted) linear combination of the mapping functions of all contiguous k-mer features which differ from that gapped k-mer sequence in the gapped locations.

3.3 Methods

3.3.1 Datasets

We tested the proposed SVM kernel through the binary classification problem, i.e., discrim- inating a single class of positive (functional) DNA sequences from negative (background) DNA sequences. We applied the proposed method to the putative TFBSs predicted in Chapter 2. More precisely, we used as the positive set the putative binding sites of lexA, purR, fur, crp and fnr, consisting of 143, 1736, 236, 3582, and 3826 sequences, respectively. We generate the negative signal set by retaining the same distribution of sequence lengths and nucleotide compositions as the positive signal set. That is, for each sequence in the positive set, we generate 10 different scrambled versions using random (uniformly-selected) permutations of nucleotides from the original sequence. As an extension, we tested the approach in a more general setting. We used the set of 127 CRMs studied in [90], belonging to 114 genes of interest in early Drosophila development that exhibit a differential expression patterns along the anterio-posterior (AP) axis. The CRM sequences were originally obtained from the data base in [91] and extended by 100-bps of flanking sequence in each direction of their in-situ verified peaks. The negative sequence set is generated similarly as described above. All positive and negative data sets used in this general application are available from http://www.columbia.edu/∼ae2321/DmelCRMs.zip. CHAPTER 3. THE FOLDED K-SPECTRUM KERNEL: A MACHINE LEARNING APPROACH TO DETECTING TRANSCRIPTION FACTOR BINDING SITES WITH NUCLEOTIDE DEPENDENCIES 43

3.3.2 Identifying top-enriched features

Through SVM training, one can investigate the enrichment of the individual features (gapped k-mers) observed in a given input sequence. For this, we use all positive and nega- tive data sets and train the SVM with the proposed kernel. The elements w(n), n = 1,...,N in the resulting decision (weight) vector w = [w(1), . . . , w(N)] RN may indicate the rela- ∈ tive discriminative power of the corresponding features, whereby the class of a test example mapped into this feature space x = [x(1), . . . , x(N)] RN is obtained through the sum ∈ of the weighted frequencies of its sequence features, i.e. the sign of the inner product T N w x = n=1 w(n)x(n). Following this intuition, given the test sequence x we define the enrichmentP score of the n-th feature in this sequence by r(n) = w(n)x(n) and the score vector corresponding to all features by r = [r(1), . . . , r(N)] RN . ∈ Considering the DNA sequence data, different subset of features may be enriched in different sequences due to the underlying degeneracy of the binding sequence. In addition, since the nature of DNA binding prefers a “consensus” rather than the exact binding se- quence, k-mers containing a preferred core sequence with certain mismatches may also be enriched. Thereby, it is intuitive to expect that a number of top-enriched features estimated for a given sequence may represent binding sites for the underlying regulatory factor(s) driv- ing the primary function of the given (e.g., the cis regulatory module, CRM). To investigate this, we calculate the enrichment vectors for every positive sequence used in the SVM training (i.e., CRM data set), i.e., r , i = 1,...,P . For each given sequence, { i } we select the top-enriched features residing above a cutoff weight (0.005) and those belonging to the same feature set (gapped k-mer model) are used to construct the relevant “gapped k-mer motif”. For example, suppose that in the top-enrichment results there are only two features ANT and TNT belonging to the set 3, then the corresponding gapped k-mer A5 motif will be A(T )NT . In fact, such motif may be the fragment of a real DNA binding motif (e.g., transcription factor binding site), whereby using this k-mer motif one can search for the possible hits in a known motif data base. By this motivation, we generated all such motif fragments estimated for each given sequence. CHAPTER 3. THE FOLDED K-SPECTRUM KERNEL: A MACHINE LEARNING APPROACH TO DETECTING TRANSCRIPTION FACTOR BINDING SITES WITH NUCLEOTIDE DEPENDENCIES 44

3.3.3 Filtering out potential false positive gapped k-mers (false discovery- based feature elimination)

It is known that certain sequence patterns may be found more frequently in the genomic background, entailing the danger, for the above procedure, of falsely discovering such pat- terns (features) and predicting the subsequent (false) motif hits [92–94]. To address this, we set out to eliminate the features that putatively lead to false enrichments. This is done by replacing the positive inputs with a set of “false” positive sequences to identify an ensemble of corresponding falsely-enriched features (Figure 3.1). That is, we replace each positive f training sequence χi, i = 1, . . . , t with 10 scrambled versions of it (χj , j = i, . . . , i + 9), and use this expanded (false) positive set for training the SVM with the proposed kernel; f then we estimate the top-enriched gapped k-mers (with rj (n), n = 1,...,N) for each of the false positive sequence χf , j = 1,..., 10 t, and discard out those residing below the cutoff j × f f weight (0.005), i.e., set rj (n) = 0 if rj (n) < 0.005. We further discard “insignificant” false positives via Tomtom by filtering out any “gapped k-mer” if the common motif of the same gapped models does not lead to a significant JASPAR hit. That is, for each false positive input χf , j = 1,..., 10 t, we construct all enriched gapped k-mer motifs (with different j × gapped models) and search the motif database via Tomtom. If a gapped k-mer motif hits to a JASPAR motif significantly (p-value < 0.001) we keep this gapped k-mer motif and all its sequences (gapped k-mers) assuming that they consistently appear on the genomic background; otherwise if the motif is not found in JASPAR, we discard that motif and all its gapped k-mers considering that they are “insignificant” (noisy) false positives. These gapped k-mers (features) are referred to as “false” and are used to remove the corresponding features found in the positive training sequence’s top-features. That is, we

filter out n-th gapped k-mer (set ri(n) = 0) if it belongs to one of the “gapped models” f of the falsely-enriched features (corresponding to rj (n)) in the respective scrambled copies j = i, . . . , i + 9. After removal, the remaining gapped k-mers (with enrichment scores h ri (n) > 0.005) at each input sequence are assumed to be “high-confidence” predictions, whereby we constrained the feature set to this collection, i.e., Ψh = ( (rh)) Ψ where k ∪iI i ∩ k (rh) represents the indices of the remaining features with rh(n) > 0.005, n 1,...,N . I i i ∈ { } CHAPTER 3. THE FOLDED K-SPECTRUM KERNEL: A MACHINE LEARNING APPROACH TO DETECTING TRANSCRIPTION FACTOR BINDING SITES WITH NUCLEOTIDE DEPENDENCIES 45

χ ,i=1,...,t i FINAL PREDICTION

10 10 xp n × × i xj χn χf j j + − h Ψk Ψk Ψk SVM Ψk

h p f w x n n xj p i xj ,j = i,...,i +9 xj xi + + × − − ri SVM Ψk SVM Ψk ri(n) > f 0.005 p wwf xi FEATURE xj × ELIMINATION × f Features(n) ri(n) ri rj ...... f ri(n) > rj (n) > CANTTG 0.00531 0.005 0.005 CCATTT 0.0053 TTTNGC 0.00529 f TGTGCT 0.00523 ri rj \ TGGCCA 0.00519 TTTGAG 0.00518 h ri NTAAGC 0.00518 h AAGCGG 0.00515 Ψk rh Ψ CATGGG 0.00513 i ∩ k

Figure 3.1: SVM with feature elimination based on the discovery of false enrichments

3.3.4 Comparing top-enriched features with those in JASPAR

After filtering, we compared the individual gapped k-mers against a relevant motif data base (i.e., JASPAR’s core data base for insects) [95] to elucidate the putative regulatory factors. For this task, we used the motif comparison tool Tomtom [96] which calculates statistical measures (p-value) to quantify the similarity between two motifs. For each gapped k- mer sequence, we obtained all the motif hits found by Tomtom and retained those with a 3 significant p-value score < 10− .

3.3.5 Cross-validation

We assess the performance of the proposed folded k-spectrum kernel through cross-validation tests. During this procedure, a true positive was defined as a CRM from the original set of CHAPTER 3. THE FOLDED K-SPECTRUM KERNEL: A MACHINE LEARNING APPROACH TO DETECTING TRANSCRIPTION FACTOR BINDING SITES WITH NUCLEOTIDE DEPENDENCIES 46

127 ‘positive’ CRMs that was not used in the initial training step but was detected by the algorithm (i.e., correctly classified as a positive sequence). Similarly, a false negative was defined as a CRM from the original set of 127 ‘positive’ CRMs that was not used in the initial training step and was not detected by the algorithm (i.e., incorrectly classified as a negative sequence). Thus, any CRM with a low number of false negatives after this analysis represents a CRM that, even when left out of the training data, can still be detected as a positive sequence. Given a set of input sequences (i.e., the CRM data set) with positive and negative class labels we randomly divide them into 10 disjoint subsets by retaining (approximately) equal proportions of the positive/negative sequences. At each fold, one different subset is left for testing and the other 9 subsets are used for training the SVM with the proposed folded k-spectrum kernel. We note that the false positive features are eliminated from the feature h set and the feature map is constrained to Ψk following the procedure described in section 3.2.2. After processing 10 folds, each input sequence is tested (classified) once. We then obtain those class predictions and determine the true/false predictions for the positive and negative sequences. The corresponding ROC curve and Area Under Curve (AUC) score are then generated (Figure 3.2). We re-evaluate these predictions in the repeated experiments in order to average out the influence of randomization on the test subsets. We repeat the above procedure 100 times by randomly dividing the data in each experiment, then check the cumulative “false negative” prediction of each input sequence, which represents the number of times a trained classifier fails to detect that sequence accurately. This evaluation elucidates the predictive power of the “gapped k-mer” features on the individual input sequences. We want to identify the group of CRMs that the classier performs with low accuracy, i.e., through repeated experiments the classifier “consistently” fails to detect those CRMs and results in high number of false negative (FN) calls. The consistency can be defined by setting a threshold in the FN call rate, e.g., FN>10/100. For example, if a classifier (falsely) predicts a positive test sequence as negative in at most 10% of the repeated (randomized) cross-validation experiments, we can assume that the classifier is still able to predict that CHAPTER 3. THE FOLDED K-SPECTRUM KERNEL: A MACHINE LEARNING APPROACH TO DETECTING TRANSCRIPTION FACTOR BINDING SITES WITH NUCLEOTIDE DEPENDENCIES 47 sequence with high (90%) accuracy. Figure 3.2 displays the (sequence-specific) prediction accuracy of the classifiers, the contiguous k-spectrum kernel and the folded k-spectrum kernel, based on the 10% FN call rate cutoff. Subsequently, one can determine the groups of CRMs that the methods (contiguous k-mer vs. folded k-mer) perform differently or similarly, and investigate the underlying effect of each classifier in these CRM groups.

3.4 Results

3.4.1 Filtering motif-based TFBS predictions by SVM classification

We applied the proposed SVM approach in Figure 3.1, to further classify the putative binding sites resulted from our motif-based inference approach, i.e., the TFBSs identified in the putative regulons of lexA, purR, fur, crp, and fnr TFs presented in Chapter 2. These binding sites (and the putatively regulated genes they are paired with) are labeled with true positive or false positive according to the RegPrecise database, i.e., a putative TF binding site was called true (or false) if the downstream gene it is paired with is (or is not) identified as a member of the TF’s regulon in the RegPrecise database. As the training data, we used the corresponding motif’s binding sites which were discovered by BAMBI2b algorithm from the upstream sequences of high-confidence co-expressed genes. We have not used other TFBS/motif databases (e.g., RegPrecise, JASPAR, SwissRegulon, etc.) in SVM training, in order to establish a consistent framework that will only rely on the computational methods/results presented in Chapter 2, which renders the proposed SVM-based filtering method as an additional tool applicable to the novel genomes. The details of the data sets used in the SVM training and testing are summarized in Table 3.1.

Motif Sequences Predicted TFBSs (+) (-) (TP) (FP) lexA 15 150 26 117 purR 6 60 28 1708 fur 33 330 30 206 crp 16 160 233 3349 fnr 32 320 79 3747

Table 3.1: 5 TFBS data sets used in the proposed SVM approach. CHAPTER 3. THE FOLDED K-SPECTRUM KERNEL: A MACHINE LEARNING APPROACH TO DETECTING TRANSCRIPTION FACTOR BINDING SITES WITH NUCLEOTIDE DEPENDENCIES 48

p n p n xi xj xi xj

+ + − − Randomize subsets Randomize subsets for 10-fold cross-validation for 10-fold cross-validation

MUTLIPLE Xtrain test Xtrain test f f =1,...,10 Xf TESTS f f =1,...,10 Xf t =1,...,100

+ + + + − − − − Train wh Test Train wh Test h h h h SVM Ψk SVM Ψk SVM Ψk SVM Ψk

ROC FN calls ROC FN calls

...

! Repeat the procedure with contiguous k-spectrum FN cutoff 0.1 SVM for both training and testing

!

GROUP 2 GROUP 1

Figure 3.2: Discovery of the different groups of input sequences, based on the difference in FN calls accumulated over multiple cross-validation tests. CHAPTER 3. THE FOLDED K-SPECTRUM KERNEL: A MACHINE LEARNING APPROACH TO DETECTING TRANSCRIPTION FACTOR BINDING SITES WITH NUCLEOTIDE DEPENDENCIES 49

For a given data set, e.g., lexA, the positive training sequences χi, i = 1,..., 15 are n obtained from the lexA motif’s binding instances. The negative training sequences χj , j = f 1,..., 150, and the false positive training sequences χj , j = 1,..., 150, are generated by scrambling the positive training sequences, as described in Section 3.3.3. The SVM procedure (with feature elimination) described in Figure 3.1 is applied to the data sets 15 n 150 f 150 h h (χi)i=1, (χj )j=1, (χj )j=1, and the classifier parameters (w ,b ) are obtained from the final predictions. Based on (wh,bh) the putative TFBSs (26 TPs and 117 FPs) are reclassified. Those among the 26 TPs that are classified as “positive” are again labeled by “TP”, and the others (classified as “negative”) are filtered out (for being “false negative”); similarly, those among the 117 FPs that are classified as “positive” are again labeled by “FP”, and the others (classified as “negative”) are filtered out (for being “true negative”). In other words, after the filtering procedure the size of the TFBS predictions will be reduced to a stringent and more accurate subset. The accuracy is then evaluated by the rate of true positives in TP the predicted TFBSs, i.e., the positive predictive value PPV = TP +FP . Figure 3.3 shows the corresponding results (before and after the filtering – BAMBI2b predictions and SVM kernels, respectively) based on the different SVM approaches, the contiguous k-spectrum and the proposed folded k-spectrum. It is seen that all filtering approaches significantly improve the accuracy of TFBS predictions, and the folded k-spectrum SVM outperforms in terms of average PPV.

3.4.1.1 Filtering TFBSs results in better fit to the ChIP-seq peaks

The fnr TFBSs results in the lowest performance in terms of FP rate (and the PPV) both before and after the filtering. In Chapter 2, the fnr regulon was estimated (through motif discovery and screening) based on the microarray gene expression data sets with several stress conditions (i.e., aerobic knock-out, changes of global gene expression in E. coli during an oxygen shift, and anaerobic knock-out) [97]. Since the stress-specific expression data can only reveal certain subset of genes involving the fnr activity perturbed by the given stress, it is expected that the derived TF motif in the co-expressed gene upstreams and the subsequent genome-wide TFBS predictions may not represent all levels of interactions, and thereby can poorly fit to the taxonomy-level regulog estimatates (i.e., RegPrecise regulog CHAPTER 3. THE FOLDED K-SPECTRUM KERNEL: A MACHINE LEARNING APPROACH TO DETECTING TRANSCRIPTION FACTOR BINDING SITES WITH NUCLEOTIDE DEPENDENCIES 50

0.9 k-spectrum SVM (0.540) k-spectrum SVM FE (0.546) 0.8 folded k-spectrum SVM (0.432) folded k-spectrum SVM FE (0.558) BAMBI2b predictions (0.082) 0.7

0.6

0.5

0.4

0.3 Positive predictive value

0.2

0.1

0 lexa purr fur crp fnr

Figure 3.3: Performance of the SVM-based filtering approaches, contiguous vs. folded k- spectrum kernels with/without feature elimination (FE).

[31] constructed by evolutionary conservation of the known fnr binding sites). Perhaps this poor fit may be better explained by the complex regulation of anaerobiosis in fnr, where many factors play role in the genome’s accessibility to fnr binding and the gene expression pattern [98]. A recent study in E. coli [99] argues that only a subset of the predicted fnr binding sites are occupied in vivo, where the active regulon can be expanded/contracted depending on the environmental conditions. By this motivation, we obtained the fur ChIP-seq signals conducted on E.coli [99], and examined our TFBS predictions (before and after the filtering) in terms of their overlap with the ChIP signal peaks (i.e., appearance of the TFBS’s center within the 100-bps range from a signal peak). Given the 79 true positive (w.r.t. RegPrecise) TFBSs, the 40% (32 sites) of them overlapped/hit to the ChIP-seq peaks. In the false positives (3747 TFBSs) this rate was very low, i.e., 3% (135 sites). After the SVM filtering (with folded k-spectrum), the hit rate for true positives increased to 66% (i.e., 2 out of 3 remaining sites), and for the false positives it increased to 50% (i.e., 5 out of 10 remaining sites). The contiguous k-spectrum SVM filter reached similar/lower hit rates, i.e., 66% for TP sites and 38% for FP sites (Table 3.2). CHAPTER 3. THE FOLDED K-SPECTRUM KERNEL: A MACHINE LEARNING APPROACH TO DETECTING TRANSCRIPTION FACTOR BINDING SITES WITH NUCLEOTIDE DEPENDENCIES 51

Rate of ChIP hits in TFBSs (TP) (FP) No filter 32/79 135/3747 Folded k-spectrum SVM 2/3 5/10 k-spectrum SVM 2/3 5/13

Table 3.2: The rate of ChIP hits in the given TFBS predictions (before and after the SVM filtering).

3.4.2 Classifying enhancer sequences and detecting functional subgroups

We also applied our approach to a more general case of regulatory sequence classification, i.e., transcriptional enhancers playing a role in early Drosophila development. We implemented the folded k-spectrum kernel approach using a set of 127 CRMs re- sponsible for the regulation of 114 genes exhibiting differential expression during early Drosophila development. These CRMs are referred to as our ‘positive sequences’, as each of them has been experimentally shown to regulate gene expression along the anterio-posterior axis, and thus each is thought to contain motif(s) corresponding to binding sites for tran- scription factors (TFs) present in the early embryo [100]. Those referred to as ‘negative sequences’ during our SVM training procedure have been constructed by randomly scram- bling the positive sequences while retaining each sequence’s original nucleotide distribution (see Methods for more details). Based on the cross-validation tests described in Figure 3.2 (with 10% FN call rate cutoff), our algorithm resulted in the identification of three very interesting subsets of these CRMs:

Group 1 : those that are correctly detected as belonging to the ‘positive sequence set’ by gapped k-mers (i.e., folded k-spectrum kernel), but not detected by contiguous k-mers (i.e., contiguous k-spectrum kernel),

Group 2 : those that are incorrectly identified as belonging to the ‘negative sequence set’ by both gapped k-mers and contiguous k-mers,

Group 3 : those that are correctly detected as belonging to the ‘positive sequence set’ by both contiguous k-mers and gapped k-mers. CHAPTER 3. THE FOLDED K-SPECTRUM KERNEL: A MACHINE LEARNING APPROACH TO DETECTING TRANSCRIPTION FACTOR BINDING SITES WITH NUCLEOTIDE DEPENDENCIES 52

We observed that 7 CRMs fall into Group 1, 7 fall into Group 2, and the remaining 113 fall into Group 3 (Figure 3.4). One should note that none of the 127 CRMs were correctly detected by contiguous k-mers but not by gapped k-mers due to the fact that contiguous k-mers are included when considering all possible gapped k-mers. The Group 2 CRMs can be interpreted as False Negatives (FN), since they are not detected using either approach. However, the subset of CRMs contained in Group 1 may contain crucially important TFBSs with gapped motifs, which are undetectable using the standard approaches. Therefore, we focus our attention on those motifs (gapped k-mers) enriched in the Group 1 CRMs (Table 1A), and include the enriched gapped k-mers in the other CRMs in the Supplementary Information (Table S1A).

6-spectrum kernel Folded 6-spectrum kernel

100 80 60 40 20 0 # FN-predicted tests 20 40 60 80 100 120 CRM ID

Figure 3.4: Classification performance of the k-spectrum kernel vs folded k-spectrum kernel.

The results in Table 1A give a comprehensive picture of all of the gapped k-mers found to be enriched (those with weight > 0.005) in the Group 1 CRMs. However, to further investigate the importance of particular gapped k-mer models, we combined these results into Table 2A, which gives each gapped k-mer model found to be enriched, the number of enriched gapped k-mers that fell into that gapped k-mer model, and the cumulative weight of those enriched gapped k-mers. Note that the most highly enriched gapped k-mers in each of these CRMs are those containing only a small number of gaps (i.e. most have at most one gap, the letter ‘k’), although each of these CRMs were identified as belonging to Group 1.

3.4.3 Removing False Positives

There are some sequence features that may be found to be more abundant than others, even in the genomic background sequence, due to the sequence composition, although they do CHAPTER 3. THE FOLDED K-SPECTRUM KERNEL: A MACHINE LEARNING APPROACH TO DETECTING TRANSCRIPTION FACTOR BINDING SITES WITH NUCLEOTIDE DEPENDENCIES 53 not represent TFBSs. Thus, in an attempt to remove these ‘false positive’ motifs (spurious gapped k-mers), we have used a filtering procedure (see Methods for more details). This procedure has eliminated approximately 88% of the total gapped k-mer features that were previously found to be enriched. Tables 1B, 2B, and all remaining figures correspond to these filtered results. After filtering, 9 CRMs fall into Group 1, 1 falls into Group 2, and the remaining set of 117 CRMs fall into Group 3 (Figure 3.5). Note that 3 CRMs moved from Group 2 to Group 1 after filtering (i.e., the filtering procedure allowed them to be correctly detected as belonging to the ‘positive sequence set’ by gapped k-mers, but still incorrectly identified them by contiguous k-mers), only 1 CRM moved from Group 1 to Group 2 after filtering (i.e., the filtering procedure caused the gapped k-mers to lose their ability to correctly identify the CRM). Thus, overall the filtering procedure improves the performance of the gapped k-mers when compared to the contiguous k-mers.

6-spectrum kernel (FE) Folded 6-spectrum kernel (FE)

100 80 60 40 20 0 # FN-predicted tests 20 40 60 80 100 120 CRM ID

Figure 3.5: Classification performance of the k-spectrum kernel vs folded k-spectrum kernel after feature elimination.

Table 1B shows the gapped k-mers found to be enriched in the Group 1 CRMs using the 127 positive CRM sequences for training after eliminating features characterized as ‘false’ during this procedure. The overall number of enriched gapped k-mers in the Group 1 CRMs decreases after filtering (from an average of approximately 176 enriched gapped k-mers per CRM to 90 enriched gapped k-mers per CRMs, Tables 1A vs. 1B), but there still remain many enriched gapped k-mers. Again, to investigate the importance of particular gapped k-mer models, we combined the filtered results into Table 2B, the number of enriched gapped k-mers within that model, CHAPTER 3. THE FOLDED K-SPECTRUM KERNEL: A MACHINE LEARNING APPROACH TO DETECTING TRANSCRIPTION FACTOR BINDING SITES WITH NUCLEOTIDE DEPENDENCIES 54 and the cumulative weight of those enriched gapped k-mers. An interesting observation is that those found to be highly enriched after filtering have a small number of gaps, “k”, (i.e., mmmkmm, mkmmmm, mmmmkm, mmkmmm, etc.). However, one should note that the gapped k-mers that belong to the contiguous model mmmmmm were characterized as false enrichments in all cases.

3.4.4 Enriched motifs correspond to known TFBSs

For the enriched motifs found after filtering in the 9 Group 1 CRMs, we set out to gain some insight into what particular TFs these motifs may be binding in vivo. The improved predictive power obtained by the gapped k-mer models in those CRMs suggests the presence of TFs with strong nucleotide interdependencies in their binding sites. To identify such potential TFs, we analyzed these enriched gapped k-mers using the insect motif database, JASPAR, along with the widely used comparison tool, Tomtom, quantifying the similarity between our enriched motifs (gapped k-mer sequences) and known motifs for regulatory factors [95, 96]. We analyzed all 817 enriched gapped k-mers found in the Group 1 CRMs, and found 3 136 significant hits using JASPAR (p-value < 10− , see Table 3B). Here, we highlight three important findings illustrated by these results:

1. A large number of the enriched gapped k-mers correspond to binding sites of TFs known to regulate genes involved in early AP patterning in Drosophila.

2. Both BRK and MAD were found to be among the most frequently found TFs with binding sites corresponding to enriched gapped k-mers.

3. CTCF binding sites are also frequently found to correspond to enriched gapped k- mers.

3.4.4.1 TFs known to regulate AP patterning genes

The hits found by all gapped k-mers correspond to 29 different TFs, 14 of which are known to be involved in regulating AP patterning genes [101]. These include BCD, BTD, GSC, OC, KR, TTK, KNI, TLL, HKB, H, SLP, OPA, ODD, and RUN. It is not surprising to find such CHAPTER 3. THE FOLDED K-SPECTRUM KERNEL: A MACHINE LEARNING APPROACH TO DETECTING TRANSCRIPTION FACTOR BINDING SITES WITH NUCLEOTIDE DEPENDENCIES 55 a large number of the significant hits were for TFs known to regulate AP patterning genes since the 127 CRMs used for the analysis are known to regulate genes that are differentially expressed along the AP-axis. However, it is worth noting that through this analysis some interesting gapped dependency patterns emerged. As an example, we focus on BCD, the TF found with the second highest number of occurrences in Group 1 CRMs. The enriched gapped k-mers that were found to be significant with the known BCD motif using Tomtom were aligned and used to construct the weblogo in Figure 3.6A . One should note that this logo is very similar to JASPAR’s BCD motif, in which the gaps significantly align to the 4th base and the (flanking) 7th base of the JASPAR motif. Although the majority (7) of these aligned enriched gapped k-mers have the contiguous 5-mer model (kmmmmm), the remainder of them (3) have a consistent appearance of the gap in the 4th base, suggesting a non-adjacent (gapped) dependency between nucleotides 2 and 3, and 5 and 6. One possible explanation for this gapped dependency may be the three-dimensional shape of the DNA bound by BCD. When the BCD binding sites in the JASPAR database, originally identified using bacterial-1 hybrid experiments [102], were analyzed using the newly developed and validated high-throughput approach, DNAshape, [103] they were pre- dicted to contain a significantly smaller minor groove width estimated on the 4th base [104]. We believe the gap in the 4th base (and the entailing dependency between the surrounding nucleotides) may be due to this reduced minor groove width.

3.4.4.2 BRK and MAD

BRK and MAD are two TFs known to be involved in Dpp signaling in early Drosophila development [105]. In our analysis, MAD and BRK appear among the most frequently found TFs with binding sites corresponding to enriched gapped k-mers (11 and 9 hits, respectively) and are jointly found in multiple CRMs (iab-4, Stat92E 1, prd 1). It has been shown that they can compete for binding sites, both in vitro and in vivo, effecting the stimulation of the Ubx enhancer [106]. Thus, it is not surprising to find their binding sites enriched on the same enhancers. The gapped k-mers that produce hits to BRK motifs were used to generate the logo in CHAPTER 3. THE FOLDED K-SPECTRUM KERNEL: A MACHINE LEARNING APPROACH TO DETECTING TRANSCRIPTION FACTOR BINDING SITES WITH NUCLEOTIDE DEPENDENCIES 56

A. bcd B. CTCF C. brk D. Mad 2 2 2 2

1 1 1 1 bits bits A bits bits C CG G C CTG G C G C C G T T G G C C AG C C G C C A A G A T G C C C G G T AA A G C AC A G G T G A GA T T G C T T T A T CGA GGG A CG G G C G C G C C G C TTA T A TA TAT A TT T T T T AAAAA A 0 0 0 0 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 1 1 1 2 3 4 5 6 1 2 3 4 5 6 7 8 10 11 12 13 14 15 10 11 12 13 14 15

4 4 4 4 3 3 3 3 2 2 2 2 bits T T bits N GC bits T G C bits N 1 1 T G 1 CC GC 1 N A N GG G C C C N C G CGCCGCCGC G GCC G G NNA N A AN N NN NN N NNT A 0 0 0 0 7 8 9 2 3 4 5 6 7 8 9 1 2 3 4 5 6 1 1 1 2 3 4 5 6 7 2 3 4 5 6 7 8 10 11 12 13 14 15 10 11 12 13 14 15

Figure 3.6: JASPAR motifs (top) and the weblogos of the enriched gapped k-mers (below). Shown are weblogos corresponding to the TFBSs for A) BCD, B) CTCF, C) BRK, and D) MAD.

Figure 3.6C. This logo conforms with JASPAR’s BRK motif, where the gaps (N) appear in various bases, more significantly on the 4th and 6th bases. The same alignment for MAD’s gapped k-mers strongly suggests a single-gapped de- pendency pattern located at the 10th base (Figure 3.6D).

3.4.4.3 CTCF

Our results show that CTCF is also frequently found when comparing binding sites from JASPAR with the enriched gapped k-mers (8 hits, Table 3B). This is quite an interesting result, as CTCF has been shown to be involved in enhancer-promoter looping interactions, facilitating transcriptional activation [9, 107, 108]. More specifically, White and colleagues have conducted experiments to support the function of CTCF in promoting the interac- tion between enhancers and the transcriptional start site in the Ultrabithorax gene in the Drosophila bithorax complex [109]. The enriched gapped k-mers that were found to be significant with the known CTCF motif using Tomtom were aligned and used to construct the weblogo in Figure 3.6B . One should note that this logo is very similar to JASPAR’s CTCF motif, and that the most significant gap appears in the 8th base of CTCF weblogo. This again correlates with the estimated TFBS shape given by the DNAshape algorithm, in which the shortest minor CHAPTER 3. THE FOLDED K-SPECTRUM KERNEL: A MACHINE LEARNING APPROACH TO DETECTING TRANSCRIPTION FACTOR BINDING SITES WITH NUCLEOTIDE DEPENDENCIES 57 groove width is estimated on the 8th base of the BS [104].

3.4.5 Cross-validation

To validate our model, we conducted a set of cross-validation experiments. This was done by determining the number of true positive/ false negative class predictions of sequences through a random leave-sets-out analysis (see Methods for more details). After repeating the leave-sets-out analysis 100 times using both the traditional k- spectrum kernel as well as the folded k-spectrum kernel (with feature elimination) dur- ing the initial training, leaving out subsets of the positive as well as negative (scrambled) CRMs for testing, the resulting number of false negatives found are shown in Figure 3.5. One should note that the SVM with folded k-spectrum performs better than the traditional SVM with k-spectrum method in all CRMs. The overall performance of the classifiers are evaluated through the ROC curves, i.e., true TP FP positive rate TP +FN vs false positive rate FP +TN . Figure 3.7 shows the superimposed ROC curves of the 100 (binary) classification results, with the corresponding average Area Under the Curve (AUROC) scores. The folded k-spectrum approach clearly outperforms the contiguous k-spectrum in terms of AUROC scores, with the feature elimination procedure further improving the classifier’s performance.

3.5 Discussion

Approaches to binding site prediction from DNA sequence data have been developed, and improved upon for decades. The development and implementation of the novel approach laid out in this manuscript has led us to some very interesting conclusions. Most striking, and probably most important to the general bioinformatic community, is the illustrated advantages that this new approach, which incorporates gapped dependency, has shown over existing SVM approaches. To highlight the success of this approach, note that 126 of the 127 CRMs tested were correctly identified as belonging to the ’positive sequence set’ by this new algorithm, while 9 of those were not correctly identified by the previous contiguous k- mer approach, and the only remaining CRM was incorrectly identified by both approaches. CHAPTER 3. THE FOLDED K-SPECTRUM KERNEL: A MACHINE LEARNING APPROACH TO DETECTING TRANSCRIPTION FACTOR BINDING SITES WITH NUCLEOTIDE DEPENDENCIES 58

Contiguous 6-spectrum. AUC = 0.967 Folded 6-spectrum. AUC = 0.975 avg avg 1 1

0.99 0.99

0.98 0.98 0.97 0.97

0.96 0.96

0.95 0.95

0.94 0.94

True positive rate 0.93 True positive rate 0.93

0.92 0.92

0.91 0.91

0.9 0.9 0 0.02 0.04 0.06 0.08 0.1 0 0.02 0.04 0.06 0.08 0.1 False positive rate False positive rate Contiguous 6-spectrum (FE). AUC = 0.980 Folded 6-spectrum (FE). AUC = 0.995 avg avg 1 1

0.99 0.99 0.98 0.98

0.97 0.97

0.96 0.96

0.95 0.95 0.94 0.94

True positive rate 0.93 True positive rate 0.93 0.92 0.92

0.91 0.91

0.9 0.9 0 0.02 0.04 0.06 0.08 0.1 0 0.02 0.04 0.06 0.08 0.1 False positive rate False positive rate

Figure 3.7: Superimposed ROC curves of the 100 cross-validation tests. Contiguous vs. Folded k-spectrum kernel without feature elimination (above) and with feature elimination (below). Note that many of ROC curves overlap due to the very small number of false positives found.

Thus, our new algorithm outperformed the previous algorithm over the complete set of 127 CRMs (Figure 4). We also found, through a rigorous set of cross-validation tests, that our novel approach outperformed the previous approach in terms of the AUROC measurements (average of 0.995 vs. 0.98, Figure 6). Beyond showing the advantage this approach holds over previous approaches, we have also illustrated its utility on the specific set of CRMs tested, validating the method and raising new biological hypotheses regarding the nature of DNA-protein binding. The 127 CHAPTER 3. THE FOLDED K-SPECTRUM KERNEL: A MACHINE LEARNING APPROACH TO DETECTING TRANSCRIPTION FACTOR BINDING SITES WITH NUCLEOTIDE DEPENDENCIES 59

CRMs chosen for this study belong to 114 genes which exhibit differential expression pat- terns along the AP-axis in early Drosophila embryos. When comparing the gapped k-mers found to be enriched using our algorithm to those in the insect motif database, JASPAR, we found that 14 of the 29 TFs found were indeed known to be involved in regulating AP patterning genes. Although not all TFs found were known to be involved in the regulation of AP patterning genes, the appearance of many of these other TFs can still be explained in a reasonable biological context. For example, BRK and MAD are thought to compete for binding sites effecting Dpp signaling events in early Drosophila development, and our results support this as we have found them enriched on the same enhancers. We have also found enrichment of CTCF within these 127 known CRMs, which has been shown to be involved in enhancer-promoter looping. These result all lead us to believe that the motifs we are identifying are true TF binding sites. To begin to disentangle the idea of gapped dependency, we chose to focus on examples from the most enriched motifs found to correspond to known TF binding sites. Using a newly developed approach, DNAshape, to predict the shape of the DNA at particular binding sites, we found that for both BCD and CTCF, the nucleotides in which our algorithm predicted prominent gaps aligned with those DNAshape predicted to have the shortest minor groove width. One might hypothesize that this is the connection between the success of a bioinformatic approach incorporating gapped-dependency and the molecular basis of TF binding specificity. This illustrates the need for more detailed studies using algorithms, such as the one developed in this study, that incorporate gapped-dependency into models for binding site identification, along with experimental data on binding events. This study has introduced a novel approach to motif identification, which builds upon previous approaches in machine learning to allow for a less biased approach to binding site discovery. We have shown its ability to predict TF binding sites on a set of 127 CRMs, and have offered some insight into the molecular basis for its success. In the future, it will be very interesting to see similar analyses performed on various other sets of sequences, possibly including sequences from different species and different time point in development. Such studies could help in answering a deeper biological question of whether universal rules exist governing binding site preference, strength, and flexibility. CHAPTER 4. MAXIMUM PARSIMONY XOR HAPLOTYPING BY SPARSE DICTIONARY SELECTION 60

Chapter 4

Maximum parsimony xor haplotyping by sparse dictionary selection

4.1 Introduction

A human genome is a sequence of nucleotides that can differ from one individual to another (approximately 0.1% difference between any two individual) due to various reasons, such as insertions/deletions of fractions of the sequence on the genome or mostly the substi- tution/mutation of single nucleotides on commonly observed sites called single nucleotide polymorphism (SNP) [110]. In most SNPs only two different nucleotides are observed out of 4 nucleotides. The information of nucleotide variations extracted from these SNP sites (loci) is encoded as a sequence called “haplotype”. That is, for a particular SNP site a notation is used for one of the observed nucleotides (e.g., the most commonly observed nucleotide variant - reference/major allele) and another notation is used for the other (e.g., the least observed nucleotide variant - alternate/minor allele). Because of its informative and heredity nature identifying the haplotypes of individuals has been an important subject in various medical and scientific studies, such as gene related disease discovery and drug design [111, 112], population history research [113], etc. Nonetheless, current experimen- CHAPTER 4. MAXIMUM PARSIMONY XOR HAPLOTYPING BY SPARSE DICTIONARY SELECTION 61 tal techniques are not low-cost and efficient enough for directly sequencing haplotypes of an individual; thereby identifying them is mostly based on indirect approaches, e.g., using computational methods to infer haplotypes from an alternative cost-effective data called “genotype”. The entire human genome consists of 23 distinct chromosomes each appearing in two copies (autosomes) except for the chromosome-23 (allosome) which consists of two copies of chromosome-X in females or one chromosome-X and one chromosome-Y in males. Along each chromosome there is a pair of two distinct sequences -haplotypes- inherited from the parents, i.e., one is from the maternal genome and the other is from the paternal genome. The genotype is sequenced by identifying the types of alleles -nucleotide variants- across the SNP locations (locus) in chromosomes. In a particular locus of a chromosome if both haplotypes have the same allele we call this site in the genotype homozygous and denote it with the type of alleles in both haplotypes as either reference or alternate; otherwise, if both haplotypes have different alleles –one reference and one alternate– we call this site heterozy- gous. When identifying haplotypes for a given genotype (haplotype phasing), the ambiguity occurs for the heterozygous sites since there is no information about which haplotype has the reference allele and which haplotype has the alternate allele. Clearly, genotypes are less informative than haplotypes, as they present an ambiguity on heterozygous sites due to possible permutations and computational methods can be employed to identify which allele come from which haplotype. Recently, more cost-effective alternative methods have been used for genotype sequencing [24], e.g., widely used denaturing high-performance liq- uid chromatography (DHPLC) [25]. By certain applications of such methods one can only determine whether an individual has homozygous or heterozygous allele in a given SNP site, but cannot distinguish the type of allele in the homozygous sites. The sequenced data is thereby less informative than the regular genotypes as it only represents the differing sites (XOR operation) between the haplotypes. This less informative form of genotype is named xor-genotype. One can solve the haplotype inference problem based on the xor-genotypes, i.e., xor-haplotyping, with a reasonable extra computational effort. Methods for solving the haplotype inference problem given the regular genotypes can be summarized in two categories: combinatorial methods that usually state an explicit objec- CHAPTER 4. MAXIMUM PARSIMONY XOR HAPLOTYPING BY SPARSE DICTIONARY SELECTION 62 tive function and propose methods for optimizing it, and statistical methods that relies on the statistical modeling of the problem. Various methods have been published for the hap- lotype inference problem[114–120], however the xor-haplotyping problem mostly remained under-investigated. Two particular methods are suitable for xor-haplotyping problems: parsimony haplotyping that is based on the maximum parsimony principle, and perfect phylogeny haplotyping that relies on a population genetics assumption called the infinite sites/alleles model [121], i.e., it assumes that allele sequences are long enough so that a particular allele will have a mutation only once in the . The perfect phy- logeny (PP) model utilizes the infinite sites assumption by building a tree of individuals -haplotypes- where all individuals evolve, with no recurrent mutation, from one common ancestor. An approximate solution to xor-haplotyping problem in the case of PP model was introduced in [122] where the xor-haplotype inference was cast as a graph realization problem [123, 124]. However, the proposed method (GREAL) in [122] is not well-suited for the xor-genotypes with large number of SNPs, i.e., usually limited by 30 SNPs [125], and is not extended to missing data cases. On the other hand, it is known that in a population of individuals certain haplotypes are frequently found in certain genomic regions [126]. This fact leads to the parsimony principle that states that the genotypes of a population of individuals are generated by the least number of distinct haplotypes. Identifying such smallest set of haplotypes is called Pure (Maximum) Parsimony Problem, which is NP-hard [127]. An integer linear programing method was introduced in [128] that finds a pure parsimony solution to this problem, and in [129] a branch-and-bound method was used to solve pure parsimony problem. In [130] a method called XOR-HAPLOGEN was proposed for solving haplotype inference problem in the case of xor-genotype data. This method can find accurate solutions for xor-genotypes with large number of SNPs. Another parsimony method was introduced in [131] for xor- haplotype inference by representing it as a graph realization problem called pure parsimony xor haplotyping (PPXH). In [43] a novel framework for (regular) haplotyping was proposed by interpreting the par- simony principle as a sparse representation of the genotypes. Two approaches are presented: maximizing a sparseness condition on the haplotype frequency vector determined by the in- CHAPTER 4. MAXIMUM PARSIMONY XOR HAPLOTYPING BY SPARSE DICTIONARY SELECTION 63 ferred haplotypes, and casting the sparsity of this frequency vector as a sparse dictionary selection problem. The latter approach is solved by an efficient greedy method SHSD where haplotypes explaining the given genotypes are determined according to a sparse selection from the set of compatible haplotypes. The method constructively determines the solution of each individual while selecting the haplotypes from this set, and it has the convergence guarantee. For the xor-haplotyping problem, there is an increased ambiguity due to the XOR oper- ation between haplotypes, i.e., the process of xor-genotyping that determines whether the type of alleles in both haplotypes differ in a particular site (heterozygous) or they are the same (homozygous). However this ambiguity can be resolved with the assistance of regular genotypes. Regular genotypes can either be used as post-processing inputs for eliminating set-equivalent solutions of a particular inference, or they can be used to refine inference while constructing the solution. Tractability of the maximum parsimony haplotyping problem in the xor-genotype case is still open [131]. In this chapter, we propose a modified version of SHSD algorithm —XHSD—, that can efficiently find a solution for maximum parsimony xor-haplotyping problem and resolve the ambiguity with the help of a small number of regular genotypes. For a given set of xor-genotypes the haplotype pairs for each individual are selected from the set of compatible haplotypes by a sparse dictionary selection method. The selection of dictionary columns from the set of compatible haplotypes and the sparse representation of xor-genotypes is formulated as a joint combinatorial optimization problem. The objective function of this problem maximizes a variance reduction metric over all individuals. Our algorithm is a low-complexity greedy method that terminates once the solution is fully determined. To resolve the ambiguity and to improve the inference accuracy, we employ a small number of regular genotypes as constraints for the set of compatible haplotypes to help resolve the phase of homozygous alleles. CHAPTER 4. MAXIMUM PARSIMONY XOR HAPLOTYPING BY SPARSE DICTIONARY SELECTION 64

4.2 System model

In an SNP locus only 2 nucleotides are observed, and a single bit is sufficient for the representation of nucleotide variants such that 0 encodes the major allele and 1 encodes the minor allele. The haplotype of an individual can thereby be represented with a binary vector that shows the SNP variants across the individual’s chromosome. The (conflated) genotype can then be thought of as a ternary vector where a 0 (2) indicates that the site is homozygous and both haplotypes have major 0 0 (minor 1 1) alleles, and 1 indicates that | | the site is heterozygous and the haplotypes have different alleles 0 1 or 1 0. Notice that | | when encoding homozygous and heterozygous sites we used a different notation from the literature in order to express a genotype vector as the sum of two haplotypes: a minor- homozygous SNP is encoded with 2 and a heterozygous SNP is encoded with 1, so that a 2 in the genotype is given by (the sum of) two minor alleles, and a 1 in the genotype is given by (the sum of) one major and one minor allele. In general, given a length-L genotype vector, k L of the loci can be heterozygous and ≤ thereby become ambiguous. In each of the k sites one haplotype can take two values (0 or 1) and the other haplotype takes the complement value. Considering all k heterozygous sites, one haplotype can then be one of the 2k possible binary sequences, and the other haplotype will be the complement (inverted values) of that sequence. Therefore, for solving a genotype with k heterozygous sites, the pair of haplotypes is drawn from a set of 2k distinct binary vectors of length-L. On the other hand, in xor-haplotyping problem the conflated data — xor-genotype — is less informative than the regular genotype with respect to the information loss about the type of allele in homozygous sites. The xor-genotype is itself a binary vector, where for a given site, 1 indicates heterozygous SNP where both haplotypes have different alleles for this given site. The xor-genotype can be represented by the XOR sum of the two haplotypes, likewise, for a given site 0 indicates a homozygous SNP where the haplotypes have the same allele but without any distinction whether the type of the allele is major or minor. That is, the xor-genotype contains the information whether a particular SNP site has homozygous alleles, but the type of alleles for those homozygous sites is not identified. Every site of an xor-genotype is ambiguous, and each site of the corresponding haplotype can take two CHAPTER 4. MAXIMUM PARSIMONY XOR HAPLOTYPING BY SPARSE DICTIONARY SELECTION 65 values. Therefore, a length-L xor-genotype can be explained by a pair of haplotypes that are drawn from a set of 2L distinct binary vectors of length-L. Hence, because of the additional ambiguity on homozygous sites, the number of possible solutions for an xor-genotype is significantly (in fact, exponentially) larger than that of a regular genotype of the same size. Besides the xor-haplotyping problem is NP-hard, there is also no unique solution to this problem. The nature of the XOR operation results in a phenomenon called bit flip degree of freedom [122], i.e., for a particular solution set H consisting of length-L haplotypes, one can produce equivalent solution sets by inverting a certain SNP i L (or a set of SNPs ≤ S 1,...,L ) in all haplotypes of H. Notice that inverting (complementing) an SNP ⊆ { } across all haplotypes has no effect on the xor-genotypes they generated, because even the alleles explaining homozygous sites of xor-genotypes are not distinguished (hidden). More specifically, assume that h1(`) 0, 1 and h2(`) 0, 1 represent the haplotypes of i-th i ∈ { } i ∈ { } individual in the `-th SNP and they generate that individual’s xor-genotype xi(`) such that x (`) = h1(`) h2(`). Then the complemented SNPs of haplotypes also explain that SNP i i ⊕ i of the same xor-genotype, i.e., x (`) = h1(`) h2(`). It then follows that for a particular i i ⊕ i set H of length-L haplotypes that solves a given set of xor-genotypes, there are at most L L L L = 2 1 equivalent sets H0 , i = 1,... 2 1 to H where each equivalent set H0 i=1 i − i − i alsoP solves  that given set of xor-genotypes.

Problem statement

For each SNP ` = 1,...,L, the xor-genotype is given by the XOR-sum of two haplotypes such that,

x (`) = h1(`) h2(`), ` = 1, . . . , L, (4.1) i i ⊕ i where x (`) 0, 1 is the xor-genotype of the i-th individual in SNP `, and hj(`) 0, 1 i ∈ { } i ∈ { } T is the j-th haplotype of the i-th individual in SNP `. Let xi = [xi(1) . . . xi(L)] be the xor-genotype of the i-th individual, then (4.1) can be written as

x = h1 h2 (4.2) i i ⊕ i j j j T where hi = [hi (1) . . . hi (L)] is the j-th (j=1,2) haplotype of the i-th individual consisting of L SNPs. In this representation, we say that the xor-genotype of the i-th individual xi is CHAPTER 4. MAXIMUM PARSIMONY XOR HAPLOTYPING BY SPARSE DICTIONARY SELECTION 66 phased by the haplotype pair h1, h2 . { i i } In regular haplotyping, a putative haplotype z 0, 1 L is called compatible with a ∈ { } genotype g 0, 1, 2 L if (g z) 0, 1 L, and such a haplotype is a possible solution ∈ { } − ∈ { } that can explain that genotype. That is, the haplotype pair z, (g z) is one of the { − } possible solutions to the genotype g. Therefore, for every given genotype gi it is essential to determine a set of compatible haplotypes when searching for possible solutions. The Hi union of the sets ,..., for N individuals forms the matrix Z 0, 1 L M where M H1 HN ∈ { } × is the total number of distinct compatible haplotypes. In xor-haplotyping, on the other hand, it is trivial to see that any haplotype z 0, 1 L ∈ { } is compatible (consistent) with any xor-genotype x, i.e., x = z z since there always exists ⊕ 0 a haplotype z 0, 1 L such that z = x z. Therefore, the set of compatible haplotypes 0 ∈ { } 0 ⊕ for a given length-L xor-genotype x consists of all possible binary vectors of length-L, Hi i L i.e., = = = = 0, 1 L 2 Z. H1 H2 ··· HN { } × , Because of this compatibility between the xor-genotypes and candidate haplotypes an SNP site can always be explained by either of the two alleles, and thus unambiguous SNPs do not exist anymore. Notice that, in particular, an xor-genotype with all-homozygous SNPs is still ambiguous and requires to be solved up to bit flipping. However, we know that such an xor-genotype is always explained by a pair of identical haplotypes which correspond to the same column of Z. On the other hand, if there is at least one heterozygous SNP in the xor-genotype then its phasing haplotypes are not identical and correspond to the different columns in Z. The xor-genotype of i-th individual can be expressed as

xi = (Z vi)2 (4.3) where (.) represents the component-wise modulo-2 operation, and v 0, 1, 2 M , 1T v = 2 i ∈ { } i 2, is the sparse vector indicating the haplotype locations as the indices of the matrix Z of consistent haplotypes. Notice that the modulo-2 operation in (4.3) is equivalent to the

XOR operation between the two haplotypes selected by vi.

Given Z, finding the indicator vector vi for an individual is equivalent to inferring its haplotype pair h1, h2 . The maximum parsimony principle suggests that a given set of xor- { i i } genotypes should be explained by the smallest number of distinct haplotypes. Therefore, CHAPTER 4. MAXIMUM PARSIMONY XOR HAPLOTYPING BY SPARSE DICTIONARY SELECTION 67 given the set of xor-genotypes for N individuals x , i = 1,...,N , one needs to infer the { i } haplotype pairs for each individual h1, h2, i = 1,...,N , so that the union of all inferred { i i } haplotypes forms the smallest set as possible. In other words, the xor-haplotyping problem is to infer vi, i = 1, . . . , N, given Z while selecting as few columns of Z as possible.

4.2.1 Xor Haplotyping by Sparse Dictionary Selection (XHSD)

If an (all-homozygous) xor-genotype is explained by only one haplotype, i.e., x = hs hs, i ⊕ where the haplotype hs is the s-th column of Z, then the indicator vector multiplies that haplotype by 2, i.e., v(s) = 2 and v(j) = 0 for j = 1 ... 2L s . Otherwise, if the xor- { }\{ } genotype is explained by two different haplotypes x = hm hn, m = n, then they are i i ⊕ i 6 indicated by the vector v such that v(m) = v(n) = 1 and v(j) = 0 for j = 1 ... 2L m, n . { }\{ } Hence, we can rewrite (4.3) in the following more compact form

xi = (Z v˜i)2 (4.4) Ai where i is a set of indices corresponding to the nonzero elements of vi, Z is the submatrix A Ai of Z consisting of the columns indexed by , and v˜ is the non-zero elements of v . Ai i i For each observed xor-genotype xi, the phasing haplotypes are located in the columns of Z indexed by . The union of these column indices, i.e., = , forms the dictionary of Ai D ∪iAi the haplotypes that suffices to construct all given xor-genotypes. The maximum parsimony principle then dictates that the dictionary should contain the least possible number of D elements that can reconstruct all observed xor-genotypes. The set of haplotypes indicated by such a sparse dictionary is given by H = Z , where Z is the submatrix of Z D D D consisting of the columns indexed by . Then H is a solution set to the maximum parsimony D haplotyping problem for the given set of xor-genotypes x , i = 1,...,N . { i } To solve the xor-haplotyping problem, we aim to find the sparse dictionary to minimize D the average distance between the observed xor-genotypes and the closest approximations constructed by the haplotypes in Z . Since there is no prior information about the dictio- D nary and the indices for proper reconstruction of each xor-genotype, determining D Ai D and leads to a combinatorial problem. This joint-optimization problem can be efficiently Ai solved by a greedy method that we will explain next. CHAPTER 4. MAXIMUM PARSIMONY XOR HAPLOTYPING BY SPARSE DICTIONARY SELECTION 68

For an observed xor-genotype the reconstruction accuracy can be interpreted as the Euclidean distance between the observation and its closest approximation in , i.e., Z

2 Li( ) = min xi (Z v˜i)2 , (4.5) A v˜i k − A k where represents the indices of haplotypes in Z used to approximate x . Notice that A i an exact solution will satisfy L ( ) = 0. For a given dictionary , the indices for i A D Ai reconstructing each xor-genotype will be determined by restricting to be a subset of Ai D such that

i = arg min Li( ). (4.6) A , 2 A A⊆D |A|≤ The individual cost function in (4.5) is then translated into a fitness function associated with a given dictionary , i.e., D

2 Fi( ) = xi min Li( ). (4.7) D k k − , 2 A A⊆D |A|≤ Finally, the fitness value of is averaged over all individuals to measure the overall D reconstruction accuracy

N 1 F ( ) = F ( ). D N i D Xi=1 For a given cardinality (sparsity) of n, the best dictionary is therefore given by

n∗ = arg max F ( ), (4.8) D n D |D|≤ and the sparsest dictionary that is sufficient to reconstruct all observed xor-genotypes is determined by

N 1 2 ∗ = min n∗ : F ( n∗ ) = xi . (4.9) D n (D D N k k ) Xi=1 Notice that determining both for a given n in (4.8) and for a given in (4.7) is a D A D combinatorial problem. In [132], it is shown that such combinatorial problems can be ap- proximately solved efficiently by a simple greedy method if the objective function satisfies a fundamental property called submodularity. In [43], it is shown that the dictionary selection CHAPTER 4. MAXIMUM PARSIMONY XOR HAPLOTYPING BY SPARSE DICTIONARY SELECTION 69 problem for (regular) haplotype inference has a cost function that is approximately sub- modular, and when a greedy method is used to optimize this cost function it can efficiently find an approximate solution with a theoretical guarantee [133]. For xor-haplotype inference, on the other hand, the problem is fundamentally different. That is, the submodularity property may not hold for the cost function in (4.5) due to the XOR operation, and thereby the theoretical guarantee does not hold either for the greedy method. Nonetheless, we still use the similar greedy heuristic as SHSD in [43] in order to maximize the variance reduction metric in (4.5) over the set of observations. In our algorithm Xor Haplotyping by Sparse Dictionary Selection (XHSD), we start with an empty dictionary set = φ. Then at each iteration `, among the consistent D1 L haplotypes that are not already in dictionary ` 1, i.e., in 1 ... 2 ` 1, we iteratively D − { }\D − add the haplotype that contributes to the dictionary ` 1 with the maximal marginal gain. D − m L That is, at iteration ` the haplotype h Z 1...2 − is added to ` 1 if it satisfies ∈ { }\D` 1 D −

m = arg max F ( ` 1 k ). (4.10) L k 1...2 − D − ∪ { } ∈{ }\D` 1 To compute (4.10) requires solving (4.5) and (4.6) for each k. In (4.6) for each individual i, is found by computing the Euclidean distance (4.5) between x and the possible Ai i reconstructions given by the pairwise xor-sum of all columns in Z , and picking the columns D that minimize (4.6). Whenever indices yield zero in (4.5) we can explain that individual Ai with the corresponding haplotypes in Z, i.e., xi = (Z v˜i)2. The dictionary ` keeps Ai D growing until all xor-genotypes are explained, i.e., F ( ) = 1 N F ( ) = 1 N x 2. D` N i=1 i D` N i=1 k ik Notice that in XHSD algorithm the number of compatibleP haplotypes ZP exponen- | | tially increase in comparison to regular haplotyping problem with SHSD. However, –when available– we can reduce Z with respect to regular genotype information via utilizing them in the cost function (4.5). The necessary modifications are discussed in the next section XHSD with regular genotypes. Another fundamental difference in xor-haplotyping is that the xor-genotypes do not provide unambiguous genotype information which one can initial- ize the dictionary with corresponding haplotypes and improve the reconstruction accuracy. Nonetheless, with a bias weight, the modified cost function can exploit the available regular genotypes even when they are not unambiguous. Summary of XHSD algorithm: CHAPTER 4. MAXIMUM PARSIMONY XOR HAPLOTYPING BY SPARSE DICTIONARY SELECTION 70

Initialization. •

L – Z = 0, 1 L 2 . { } × – n 1. ←

– n∗ 1 = φ. D − Iterate until all xor-genotypes are explained, i.e., F ( ) = 1 N x 2. • Dn∗ N i=1 k ik P – Perform the greedy search.

L For j 1,..., 2 n∗ 1, compute F ( n∗ 1 j ). ∗ ∀ ∈ \D − D − ∪ { }

Let j∗ = arg maxj 1...2L ∗ F ( n∗ 1 j ). ∗ ∈{ }\Dn−1 D − ∪ { } Set n∗ = n∗ 1 j∗ . D D − ∪ { } Check if any xor-genotype is explained by the addition of the new element ∗ ∗ hj , i.e., if (4.5) is zero. If so, the inferred haplotype pair for the individual ∗ ∗ with such an xor-genotype is [hj , x hj ]. i ⊕ – n n + 1. ← Given the xor-genotypes of a set of individuals, this algorithm finds the haplotype pair of each individual based on the maximum parsimony principle. As an example, consider the following demonstration. Let x1, x2 and x3 be the xor-genotypes of three individuals each corresponding to three SNPs, i.e.,

0 0 0

x1 =  0  , x2 =  1  , x3 =  1  .        1   0   1              The set of compatible haplotypes for these individuals will consist of all length-3 binary vectors, i.e.,

01010101 Z =  00110011  .    00001111      After initializing Z, and starting with empty dictionary , the algorithm performs the D0 greedy search by adding one haplotype from Z (with the maximal marginal gain) at a time. CHAPTER 4. MAXIMUM PARSIMONY XOR HAPLOTYPING BY SPARSE DICTIONARY SELECTION 71

At iteration n = 1, (4.10) calculates m = arg max(0, 0, 0, 0, 0, 0, 0, 0) and m is randomly picked as 5 among the equal maximum values, then the corresponding haplotype Z5 = 0 T   [001] is added to the dictionary, i.e., 1 = [5], and 1 = 0 . Similarly, at n = 2, D ZD    1    T m = arg max(0.33, 0, 0.66, 0, 0, 0.33, 0) = 3 is calculated and the haplotype Z3 = [010] is 00   added to the dictionary, i.e., 2 = [5, 3], and 2 = 01 . x3 is explained by the addition D ZD    10    of new haplotype, i.e., x = [001]T [010]T , yet the other xor-genotypes are not explained. 3 ⊕ At n = 3, m = arg max(1.33, 0.66, 0.66, 0.66, 0.66, 1.33, 0.66) = 1 is calculated and the 000 T   haplotype Z1 = [000] is added to the dictionary, i.e., 3 = [5, 3, 1], and 3 = 010 . D ZD    100    Other two xor-genotypes are explained by the new addition, i.e., x = [001]T  [000]T , 1 ⊕ x = [010]T [000]T , and the algorithm converges at n = 3 via calculating F ( ) = 2 ⊕ D3 1 3 F ( ) = 4 = 1 3 x 2. 3 i=1 i D3 3 3 i=1 k ik PThis simple exampleP demonstrates how the proposed greedy approach can efficiently construct sparse solutions, where three xor-genotypes are explained by only three haplotypes within three iterations. Nonetheless, the solution set has the ambiguity of being one of the equivalent sets of the true solution due to the bit flip degree of freedom which should be resolved.

4.2.1.1 Resolving bit flip degree of freedom

In [122] it is shown that the xor perfect phylogeny problem can be solved up to bit flipping based on the characteristics of the given xor-genotypes. Let X 0, 1 L N be the xor- ∈ { } × genotypes matrix of N individuals such that X = [x1 x2 ... xN ]. Denote χi as the set of heterozygous loci for the i-th individual, i.e., χ = ` : x (`) = 1 , where x (`) is the i { i } i `-th SNP in x . If there exists a set of individuals 1,...,N whose xor-genotypes i I ⊆ { } have empty intersection, i.e., χ = φ, then with the knowledge of regular genotypes ∩I I L G 0, 1, 2 ×|I| of those individuals one can remove all bit flip degrees of freedom. The I ∈ { } CHAPTER 4. MAXIMUM PARSIMONY XOR HAPLOTYPING BY SPARSE DICTIONARY SELECTION 72

empty intersection indicates that every SNP will have a homozygous allele in at least one of those individuals in and therefore that SNP can be resolved by revealing the type of I allele at the corresponding regular genotype. Following this, a post-processing method is suggested in [122] that can remove the bit flip degree of freedom across the loci where a set of xor-genotypes have empty intersection. By bit flipping on a given solution H, one attempts at choosing among the set-equivalent L solutions H0 , i = 1,..., 2 1 and this choice is decided by the given regular genotypes Figures i { − } Figure 1(Figure 4.1).

X G MTI {I} G I

G I

X H H0 PPXH Bit flipping

Figure 1: Ambiguity resolution for PPXH method. Informative regular genotypes G are determined by Figure 4.1: Ambiguity resolution for PPXH method. Informative regular genotypesI G are the MTI algorithm, and they are used as control inputs for bit flipping on the initial inferenceI result H. determined by the MTI algorithm, and they are used as control inputs for bit flipping on the initial inference result H. Figure 2 However, this post-processing method have certain limitations. Notice that, for large X G L the set-equivalent solutions areMTI highly specific{I} to theG infererredI H, e.g., for a given set

of xor-genotypes it is very likely that any two different inferences H1 and H2 —which are not set-equivalent— can have very different set-equivalent solutions. Bit flipping on G different inferences likelyI leads to different results, and thereby the bit flipping accuracy X X, G H0 largely depends onAugment the initial data inference set H{ whichI } is madeXHSD by avoiding / XHAP the prior knowledge on homozygous SNPs, i.e., regular genotypes. Besides, –when available– utilizing more Figure 2: Ambiguity resolution in XHSD or in XOR-HAPLOGEN (XHAP). Informative regular genotypes G are determinedregular genotypes by the in MTI post-processing algorithm, and does they not are necessarily used as improveinputs to the augment bit flipping the initial accuracy. data set by I replacingBasically, xor-genotypes to decide of the among individuals the appropriate1,...,N bit flippings. for a particular locus requires the I ✓ { } knowledge of that homozygous SNP from a regular genotype. Intuitively, to reveal a set of homozygous SNPs by employing the least number of regular genotypes, e.g., provided by the MTI method, will be necessary and sufficient for removing the bit flip degree of freedom

22 CHAPTER 4. MAXIMUM PARSIMONY XOR HAPLOTYPING BY SPARSE DICTIONARY SELECTION 73

across those SNPs. On the other hand, a larger number of regular genotypes will not be any more informative due to possible inconsistencies on the type of homozygous allele for an SNP site across the given regular genotypes. Furthermore, notice that flipping the bits on some loci across all the haplotypes in

H does not affect the parsimony of the solution. The final solution H0 will have the same parsimony with H regardless of the set of loci that are flipped. From the maximum parsimony point of view, refining an xor-haplotyping solution via bit flipping method does not necessarily lead to global optimum unless the initial inference is a set-equivalent of the Figuresglobal optimal solution. Figure 1 Therefore, instead of using regular genotypes to post-process a solution, a more intu- itive way could be to aim atX resolving the bit-flip degree of freedomG while constructing the MTI {I} G I solution. In particular, in XHSD framework, the regular genotypes can be used as con- straints when solving the homozygous sites of an xor-genotype. In this sense, given a set G of individuals’ xor-genotypes we determine the individualsI that have the most informative X H H0 regular genotypes and pre-processPPXH the data set by replacingBit flipping with the regular genotypes for those individuals. The MTI algorithm [122] is useful for finding the least number of such Figure 1: Ambiguity resolution for PPXH method. Informative regular genotypes G are determined by individuals that will be adequate to reveal the homozygous alleles for each ofI the L SNPs. the MTI algorithm, and they are used as control inputs for bit flipping on the initial inference result H. In the proposed XHSD framework, we employ the MTI method to find which individuals should be replaced with regular genotypes and after replacing them the new data set is Figure 2presented to the XHSD algorithm (Figure 4.2).

X G MTI {I} G I

G I

X X, G H0 Augment data set { I } XHSD / XHAP

Figure 2: Ambiguity resolution in XHSD or in XOR-HAPLOGEN (XHAP). Informative regular genotypes G are determinedFigure 4.2: Ambiguityby the MTI resolution algorithm, in and XHSD they or are in XOR-HAPLOGENused as inputs to augment (XHAP). the Informative initial data set by I replacingregular xor-genotypes genotypes ofG theare individuals determined by1 the,...,N MTI. algorithm, and they are used as inputs to I I ✓ { } augment the initial data set by replacing xor-genotypes of the individuals 1,...,N . I ⊆ { }

22 CHAPTER 4. MAXIMUM PARSIMONY XOR HAPLOTYPING BY SPARSE DICTIONARY SELECTION 74

In most cases the xor-genotypes in X has empty intersection and for each run MTI outputs 2 or 3 individuals, i.e., 3; then G has at most 3 regular genotypes. One can |I| ≤ I obtain a larger G by performing multiple runs of MTI with X and collecting the distinct I regular genotypes given by each run. Next we explain the necessary modifications to the XHSD algorithm for utilizing the regular genotypes.

4.2.1.2 XHSD with regular genotypes

The information provided by regular genotypes is used to reveal the type of allele in ho- mozygous sites of an individual so that we can improve the reconstruction accuracy in (4.5), and build the dictionary with more reliable haplotypes. That is, when a regular genotype D gi is observed in the i-th individual we employ the variance reduction metric that is given for regular genotypes such that

2 Li( ) = min gi Z v˜i , (4.11) A v˜i k − A k where Z is the set of haplotypes that are compatible with the i-th individual’s genotype g , and contains the indices of the haplotypes in Z that are used to approximate g . In i A i this representation the approximation accuracy is potentially higher when compared to the xor-genotypes, since the homozygous SNPs in gi are unambiguous. The haplotypes that are used to approximate those SNPs will be more reliable candidates when building the dictionary . D To exploit this fact, we can introduce a weight b in the cost function L ( ) so that i i A the algorithm will give a higher priority on the variance reduction of those individuals that are given by regular genotypes (since we can tolerate more error in the reconstruction of xor-genotypes), and the dictionary will more likely grow with the haplotypes that are compatible with the given regular genotypes. The biased variance reduction metric for each individual is then given by

2 bi minv˜i gi Z v˜i , given gi Li( ) = k − A k (4.12) A  2 minv˜ xi (Z v˜i)2 , given xi .  i k − A k  CHAPTER 4. MAXIMUM PARSIMONY XOR HAPLOTYPING BY SPARSE DICTIONARY SELECTION 75

The weight parameter bi could be set as proportional to the average rate of homozygous SNPs per genotype, assuming that the more homozygous sites the regular genotype contains the more informative it will be. We experimentally set bi = 4 as it yielded good performance with both synthetic and real databases.

4.2.2 Extensions

4.2.2.1 Long xor-genotypes

Note that the size of Z grows exponentially with the length-L due to the compatibility between haplotypes and xor-genotypes. That is, finding the solution of a length-L xor- genotype requires to perform the greedy search over Z that consists of 2L haplotypes. To mitigate the computational complexity we employ the partition-ligation method [134] as in [43] where the block partitioning is based on identifying the recombination hot spots [135] existing between the haplotype blocks [136]. After partitioning, the contiguous SNP sequences will be divided into blocks where within each block the haplotype diversity will be minimized. The haplotype diversity of a given block is measured by its Shannon entropy. The block partitioning by minimizing the total Shannon entropy proceeds as follows. Let ˜lm ˜lm ˜ lm h1 ... hK˜lm be the Klm haplotypes that explains all the xor-genotypes xi , i = 1,...,N inn the block thato starts at locus l and ends at locus m, i.e., 1 l m L, and ≤ ≤ ≤ ˜lm ˜lm ˜lm let f = f1 ,..., f be the haplotype frequency vector for this block. Each K˜lm ˜lm h ˜ i fk , k = 1,... Klm is represented by the density of the nonzero values of the indicator vectors vlm,..., vlm for the given block xlm, i.e., { 1 N } i N lm 1 f˜ vlm . , 2N n n=1 X ˜lm The entropy of the haplotype block hk is then given by

K˜lm E(l, m) = f˜lm log f˜lm , − k k Xk=1 and the total entropy of Q blocks, where each block [lq : mq], q = 1,...Q has an upper CHAPTER 4. MAXIMUM PARSIMONY XOR HAPLOTYPING BY SPARSE DICTIONARY SELECTION 76 bound of length W , i.e., mq lq + 1 W , is given by − ≤ Q = E(lq, mq) . E q=1 X To determine the initial and ending loci of each block [lq : mq], q = 1,...Q that minimizes we use the recursive method explained in [43], i.e., for each ending locus E 1 m L we determine the block [l : m], with m l + 1 W , that contributes with ≤ ≤ m∗ − m∗ ≤ the lowest entropy and then backtrack the best initial points lm∗ for each consecutive block by starting with the block [lL∗ ,L].

4.2.2.2 Missing data

Genotyping errors (outliers) often occur when the observed genotype of an individual differs from the original sequence for various reasons [137, 138]. A particular type of genotyping error is the case when some loci are not observed/missed during sequencing or other ap- plication processes. Although experimental methods dealing with some type of errors were proposed, often erroneous genotypes are produced with significant missing/error rates [139]. Therefore, it is of high importance for an xor-haplotyping technique to be adaptive for re- solving such databases with missing sites. We next present a modification to XHSD in order to perform xor-haplotyping for the sequences exposed to missing data conditions.

Let g˜i be the incomplete genotype of the i-th individual where the loci with missing information in gi are removed. Similarly, let x˜i represent the xor-genotype of the i-th individual where the missing loci are removed. As the rate of missing loci increases the sequences become less informative. Following the suggestion in [43], we introduce another weight w(.) to give less weight to the less informative individuals when evaluating (4.12) in order to improve the reliability of haplotype inference, i.e.,

˜ i 2 w(g˜i) bi minv˜i g˜i Z v˜i , given g˜i Li( ) = k − A k (4.13) A  ˜ i 2 w(x˜i) minv˜i x˜i (Z v˜i)2 , given x˜i .  k − A k i where Z˜ is the matrix Z with the rows corresponding to the missing loci of the i- A th individual removed. The weight is selected as a nondecreasing function of the total CHAPTER 4. MAXIMUM PARSIMONY XOR HAPLOTYPING BY SPARSE DICTIONARY SELECTION 77 information content in the sequence such that

2 w(x˜i) = dim(x˜i) , (4.14) where dim(x˜i) gives the dimension of x˜i. Different weight functions could be employed to exploit the distribution of missing sites. Since, in our experiments, the missing sites are uniformly distributed across the SNPs and individuals the function in (4.14) gave a good performance. The proposed method does not account for the direct inference of the missing sites, i.e., imputing missing genotypes [140]. However, the missing values in each xor-genotype can be recovered from the solution by simply looking at the haplotype pairs which are specifically inferred for each individual. Since the proposed method has robust performance against missing data, as presented in the next section, the inferred solution will be sufficient to type missing genotype sites.

4.3 Results and discussion

We tested the performance of several xor-haplotyping methods with a number of metrics.

First we measured the probability of error (Pe), i.e., the percentage of individuals whose inferred pair of haplotypes are different from the original pair. This measure is sensible for assessing the inference quality in regular haplotyping problem since the alleles correspond- ing to homozygous loci are known and only the heterozygous loci are ambiguous thereby performance depends on the inference accuracy on heterozygous loci. Nonetheless, in xor- haplotyping there are a large number of equivalent solutions to the original one up to bit flipping and thereby it is very likely that a solution set differs from the original phasing on at least one SNP. In particular, for a given xor-genotype even if there is a single SNP difference (namely bit flip) between the pair of inferred haplotypes and the pair of haplo- types that originally gave rise to that xor-genotype, it is counted as mis-inference. A more sensible metric, therefore, would take into account the percentage of such SNPs where the inference differs from the true phasing. In that sense, the switch error rate (swr) [141] is a proper metric that counts the minimum amount of required switches for heterozygous loci to change to the correct alleles of the original haplotypes. It gives a sense of how closely CHAPTER 4. MAXIMUM PARSIMONY XOR HAPLOTYPING BY SPARSE DICTIONARY SELECTION 78

het the inference was made, i.e., as a ratio of total mis-inferred heterozygous loci misi in all individuals i = 1 ...N to the worst-case number of switches (half of the number of { } heterozygous loci in each individual χi/2), i.e.,

N mishet swr = i=1 i . N χi P i=1 2 Moreover, to assess the accuracy on homozygousP sites, we employ prediction error rate

(errp) [130] computed as the fraction of incorrectly predicted hidden-homozygous sites out of all hidden-homozygous sites, i.e.,

N hom i=1 misi errp = . N (L χ ) Pi=1 − i We performed xor-haplotyping on variousP data sets, with and without missing infor- mation on loci: synthetic data sets with different recombination rates simulated by a co- alescence based program of [142], a database consisting of the SNPs in the CFTR gene that is associated with cystic fibrosis (CF) disorder [14], and another database (ANRIL) containing the SNPs that have relatively lower linkage disequilibrium (high polymorphism). We tested different xor-haplotyping methods that are based on different assumptions in- cluding the parsimony graph realization model PPXH [131], the parsimony genetic search model XOR-HAPLOGEN (XHAP) [130], the graph representation model GREAL [122], and an integer programming approach Poly-IP [143]. Among the four methods the last two were ineffective for practical reasons. GREAL failed at finding solutions for data sets with reasonably long sequences (SNPs > 30), and Poly-IP method is often computationally inefficient when solving even a simple problem (e.g., it takes more than 24 hours to solve a set of 50 individuals with 30 SNPs).

4.3.1 Synthetic data

Based on different recombination rates three different scenarios are considered in synthetic data sets: no recombination (r = 0), and recombination with rates r = 4 and r = 40, respec- tively. The recombination rate is the rate that the haplotypes of an individual exchange the sequence fragments due to several reasons such as crossing-over events. This fact is simulated by a model given in Hudson’s software [142]. For each scenario we generated 100 CHAPTER 4. MAXIMUM PARSIMONY XOR HAPLOTYPING BY SPARSE DICTIONARY SELECTION 79 different data sets by random pairing of a set of simulated haplotypes of different lengths (5 L 46) for a given population size. This is repeated for different population sizes as ≤ ≤ well, N 10, 20, 30, 40, 50 . ∈ { } In Figure 4.3, the performances of different methods on short data sets (L < 14) are displayed which is based only on xor-genotypes. The quality of inference is exhaustively determined after removing all bit flip degrees of freedom by looking for the best equivalent set of a particular inference, i.e., performing an exhaustive search to find the best bit flip- ping that gives a result closest to the true phasing of xor-genotypes. Such evaluation shows the best inference performance of different methods without the help of regular genotypes. Compared to other methods, XHSD can potentially resolve a set of xor-genotypes with com- parably low error rates. Moreover, XHSD achieves the lowest switch error rates, especially for large datasets, indicating a better accuracy (i.e., similarity with the true haplotypes) for the initial inference given only the xor-genotypes.

Figure 4.3: Potential inference quality on short (L < 14) synthetic data.

To evaluate the inference quality when regular genotype data are available, we first determined only a limited number of regular genotypes by the MTI method, i.e., the smallest set of regular genotypes that have empty intersection on the heterozygous SNPs, then CHAPTER 4. MAXIMUM PARSIMONY XOR HAPLOTYPING BY SPARSE DICTIONARY SELECTION 80 resolved the ambiguity by bit flipping on the initial inference according to these regular genotypes (Figure 4.4). This test evaluates how methods can deal with bit-flip degree of freedom under very limited regular genotype data that –in theory– suffice to resolve all SNPs. Given the long xor-genotype data sets (5 L 46), block partitioning is applied in ≤ ≤ XHSD by limiting the maximum block size to W = 8 SNPs. From Figure 4.4, we can say that XHSD has the best potential to make an inference with high accuracy when the regular genotypes are introduced. We also applied the proposed XHSD framework represented in Figure 2 to the same dataset where 2 xor-genotypes are replaced with the regular genotypes.

Note that the Proposed XHSD achieves a significant decrease in Pe rates despite the small augmentation of data by only 2 regular genotypes, compared to using them in the post- processing, i.e., XHSD (bit flipping).

Figure 4.4: Performance on long (5 ≤ L ≤ 46) synthetic data by bit flipping via 2 regular genotypes.

It is worthy of noting that the algorithms based on segmentation may deteriorate when processing long xor-genotype sequences, especially with increasing recombination rates where the detection of haplotype blocks is complicated [144]. We used block par- titioning (segmentation) in XHSD to reduce complexity when processing long xor-genotype CHAPTER 4. MAXIMUM PARSIMONY XOR HAPLOTYPING BY SPARSE DICTIONARY SELECTION 81 sequences. In Figure 4.4 the segmentation effect is noticeable particularly in very high re- combination rates, i.e., r = 40. However, in general scenario, i.e., r 4, we can say that ≤ the segmentation effect is not significant for the proposed method’s performance, and it outperforms XOR-HAPLOGEN in most data sets containing typical recombination rates. For more practical results we added regular genotypes in each method with different percentages of the population and allowed the methods to remove ambiguity by their own, except for PPXH. Since PPXH cannot make use of regular genotypes directly, we applied bit flipping using the MTI solver to remove ambiguity for this method. To regularly genotype a given percentage of the population, the regular genotypes are determined by running the MTI method several times until the number of distinct regular genotypes obtained achieves the given percentage of the total number of individuals. Figure 4.5 shows performances on the synthetic data of a large population of 50 indi- viduals with zero recombination rate, where cases are considered from 10% (5 individuals) to 100% (50 individuals) of the population are given by regular genotypes. XHSD over- performs other methods in almost all cases. Particularly after 20% of the population is given by regular genotypes, XHSD can immediately utilize regular genotypes and signifi- cantly improve the accuracy on both homozygous (errp) and heterozygous sites (swr). We can conclude that the parsimony principle of XHSD method is well-suited for inferring the heterozygous sites, and for predicting the homozygous sites it usually suffices to have a small percentage of regular genotypes.

4.3.2 Missing data

We investigated capability for dealing with missing data under different circumstances by various methods. Since the methods performed similarly under zero recombination rate we used the same data sets with no recombination to generate the database with missing entries. An SNP site of an individual is defined as “missing” with a probability of Pmiss and the data sets for different percentages of missing SNPs are generated accordingly. PPXH method is excluded since it cannot handle missing data. In XHSD the block partitioning is applied as before with a maximum block size of W = 8 SNPs. Figures 4.6 and 4.7 show the performances in different scenarios of partial regular geno- CHAPTER 4. MAXIMUM PARSIMONY XOR HAPLOTYPING BY SPARSE DICTIONARY SELECTION 82

Figure 4.5: Performance on long (5 ≤ L ≤ 46) synthetic data from 50 individuals by employing different numbers of regular genotypes. typing under different rates of missing data. As in the previous plots, each point represents the average value of the corresponding metric over 100 realizations–100 different sets of varying SNP sizes between 5 and 46. In most cases, XOR-HAPLOGEN and XHSD are insensitive to the increased number of missing sites. XOR-HAPLOGEN is more accurate for small group of individuals. Nonetheless, when more individuals are available in the database (N > 30) XHSD displays a better performance in all circumstances. We examined the dependency of methods on percentage of the missing data rate for a population with large number of individuals. That is, we used the xor-genotypes from 50 individuals and replaced 30% and 50% of the population with regular genotypes, and performed xor-haplotype inference under different missing data rates ranging from 0.5% to 5%. As seen in Figure 4.8 both methods are robust against missing data. On the other hand, XHSD is less dependent on regular genotypes and it can achieve better error rates than XOR-HAPLOGEN by employing even less number of regular genotypes. XOR-

HAPLOGEN needs approximately 20% more regular genotypes to reach the same Pe level with XHSD, e.g., regular genotyping by 30% in XHSD is comparable to that of 50% in XOR-HAPLOGEN. CHAPTER 4. MAXIMUM PARSIMONY XOR HAPLOTYPING BY SPARSE DICTIONARY SELECTION 83

Figure 4.6: Performance under low rates of missing data, long (5 ≤ L ≤ 46) synthetic data.

Figure 4.7: Performance under high rates of missing data, long (5 ≤ L ≤ 46) synthetic data. CHAPTER 4. MAXIMUM PARSIMONY XOR HAPLOTYPING BY SPARSE DICTIONARY SELECTION 84

Figure 4.8: Performance under different percentages of missing data.

4.3.3 CFTR gene database

Cystic fibrosis (CF) is an autosomal recessive disorder caused by mutations in the gene that encodes the cystic fibrosis transmembrane conductance regulator protein (CFTR). In [14], various mutations on 23 polymorphic locations from the chromosome 7 are detected as the disease loci for CF. We used this database corresponding to 29 distinct haplotypes to generate random xor-genotypes. By combining the haplotype pairs at random we generated the xor-genotypes for a given number of individuals N, and repeated the process for dif- ferent population sizes, i.e., N 100, 200, 300, 400 . In this database, the data sets with ∈ { } small number of individuals present high haplotype diversities, i.e., many of the distinct haplotypes are only used once in the generation of individuals. Therefore, the larger data sets that have low haplotype diversities are expected to be solved with higher accuracy by biologically-oriented methods, such as XOR-HAPLOGEN which obtains its inference according to a multi-locus linkage disequilibrium (LD)-based block identification model. We tested the performance of each method on this database with/without missing sites 0, 5% . PPXH method was excluded from the missing data analysis since it cannot deal { } CHAPTER 4. MAXIMUM PARSIMONY XOR HAPLOTYPING BY SPARSE DICTIONARY SELECTION 85 with missing data. XHSD is applied with block partitioning and the maximum block length of W = 8 SNPs as before. It is seen in Figure 4.9 that XHSD out-performs for various population sizes with significantly low error rates. As the xor-genotypes are taken from more individuals, the inference accuracy is immediately improved in XHSD and XOR- HAPLOGEN, whereas PPXH do not have this ability to benefit from the additional data.

Figure 4.9: Performance on CFTR gene database with different population sizes with/without missing data.

Figure 4.10 shows the average running times of each method performing on this database. It is observed that XHSD has similar computational complexity as the size of data set grows, and it shows comparable running times with XOR-HAPLOGEN. Although PPXH performs significantly faster, it cannot mitigate the high error rates and is not able to provide accurate inferences.

4.3.4 Typing errors

Combinatorial optimization techniques are known with their sensitivity to genotyping errors [145]. Thereby, we tested the effect of typing errors on the proposed algorithm using CFTR gene database. We defined a SNP site of an individual as erroneous with a probability of CHAPTER 4. MAXIMUM PARSIMONY XOR HAPLOTYPING BY SPARSE DICTIONARY SELECTION 86

Figure 4.10: Running times on CFTR gene database with different population sizes with/without missing data.

Perr, and typed the site as either homozygous or heterozygous with equal probabilities. We then run the algorithms without providing the knowledge of erroneous site positions. We excluded PPXH method due to its low performance on the CFTR database. Figure 4.11 illustrates the algorithms’ performance on typing errors with Perr = 2%. It is seen that XOR-HAPLOGEN is a more robust method against typing errors because of its statical nature. Nonetheless, the proposed XHSD algorithm can deal with erroneous data containing 2% typing errors, with a small increase in the error rates compared to the results without ∼ typing errors.

4.3.5 ANRIL database

The performance of haplotyping methods can deteriorate on databases with decreasing linkage disequilibrium (LD) rates. A SNP database with low pairwise-LD scores are in- vestigated in an association study given in [146] for their susceptibility to certain types of leukemia. This database includes 16 SNPs from the chromosome 9p21 associated with several diseases and a SNP locus encoding for anti-sense non-coding RNA in the INK4 lo- cus (ANRIL) [147]. We used the corresponding haplotype data from HAPMAP database (http://hapmap.ncbi.nlm.nih.gov/) collected from 90 European individuals. We generated the xor-genotypes for the individuals by using their haplotype pairs and tested the algo- rithms on this database. It is seen from the Figure 4.12 that the algorithms deteriorate when inferring the haplotypes with low-LD SNPs. XHSD shows very similar performance with XOR-HAPLOGEN, and both methods over-perform PPXH on this database. CHAPTER 4. MAXIMUM PARSIMONY XOR HAPLOTYPING BY SPARSE DICTIONARY SELECTION 87

Figure 4.11: Performance on CFTR gene database for different population sizes, with

P err=2%, with/without missing data.

Notice that the algorithms cannot mitigate the error rates with increasing number of individuals. This can be explained by the occurrence of very high haplotype diversity in corresponding low-LD SNP regions. The number of distinct haplotypes explaining the given number of individuals presumably remains at high diversity as the number of individuals grows, whereas the methods based on maximum parsimony principle fail to incorporate this fact. They are tend to find parsimonious (low-diversity) solutions in all population sizes, with a decreasing ratio (ρ) of “total number of distinct haplotypes explaining the given set of individuals” to “total number of given individuals” as the population size grows. It is worthy of noticing that, in XHSD results in Figure 4.12 (Pmiss=0), we observed that such ratio decreases as ρ = [1.3, 0.95, 0.83, 0.72, 0.66] in respect to the populations with 10, 20, 30, 40, 50 individuals; whereas the same ratio for the true phasing (ground truth data) is in fact much higher, i.e., ρ = [1.7, 1.48, 1.34, 1.27, 1.24], respectively, thereby causing the parsimony-based haplotyping methods to deteriorate on this database. On the other hand, in high-LD CFTR database, the same ratio for the true phasing is very low due to low haplotype diversity, i.e., ρ = [0.29, 0.14, 0.1, 0.07], in respect to the populations with CHAPTER 4. MAXIMUM PARSIMONY XOR HAPLOTYPING BY SPARSE DICTIONARY SELECTION 88

Figure 4.12: Performance on ANRIL gene database with different population sizes with/without missing data.

100, 200, 300, 400 individuals, and the XHSD method is good at achieving very similar rates, i.e., ρ = [0.43, 0.15, 0.1, 0.07], respectively. CHAPTER 5. DISCOVERING GENOME-WIDE TAG SNPS BASED ON THE MUTUAL INFORMATION OF THE VARIANTS 89

Chapter 5

Discovering genome-wide tag SNPs based on the mutual information of the variants

5.1 Introduction

The basic unit of genetic variation is the single nucleotide polymorphism (SNP) which refers to single base-pair changes in the DNA sequence of an individual’s chromosome [148, 149]. SNPs are located in various sites of a pair of near-identical chromosomes. Most experimental techniques can determine an unordered pair of allele readings for each SNP site to build an individual’s genotype sequence. Given a population of individuals, a high degree of correlation is observed among the nearby allelic variations (linkage disequilibrium, LD), whereby most of the SNP sites convey redundant information and may be omitted for cost-effectiveness during genotyping [150]. For this, many studies have aimed to find such “representative” SNPs (tag SNPs) that can provide sufficient information about their nearby variants that are not genotyped. More formally, given the genotype sequences consisting of N SNPs obtained from a population of P individuals, the number of SNPs that capture the genomic diversity (haplotype diversity) in that population may be greatly reduced to a subset of representative SNPs where each such SNP will represent a cluster CHAPTER 5. DISCOVERING GENOME-WIDE TAG SNPS BASED ON THE MUTUAL INFORMATION OF THE VARIANTS 90 of redundant variants. This may be described by a clustering problem where N SNPs are divided into a number of distinct clusters according to some measure of similarity between the observations si, i = 1, . . ., N taken over P samples, i.e., si = [si(1), . . ., si(P )]. Each cluster is characterized by the similarity (redundancy) between its observation vectors si and a centroid vector (tag SNP) will be the “representative” of that cluster (Figure 5.1). The SNP tagging approaches can be categorized into block-based and block-free meth- ods. The block-based methods exploit prior information about haplotype block structures [21], and identify the optimal subset of SNPs (i.e., tag SNPs – also commonly referred as haplotype tagging SNPs (htSNPs)) in order to capture most of the haplotype diversity in a given block [126, 151]. However, block-based methods may suffer from inaccuracies caused by the block partitioning results [152]. A potentially better alternative may be (block-free) genome-wide methods. In genome-wide approaches two strategies are gener- ally used for selecting representative SNPs, i.e., haplotype reconstruction-based methods and LD-based methods. The former involves a series of post-analyses (wrapper methods) for refining an initial inference through improving the haplotype reconstruction accuracy. Given a particular solution the representative (tagged) SNPs are considered as informative sites, and the allelic information on non-tagged sites (i.e., haplotypes) are predicted by a machine learning algorithm. In this methodology, the accurate prediction of haplotypes indicates that the given tag SNPs contain enough information about other SNPs and are sufficient for genotyping [153]. An optimization procedure (wrapper method) is run by employing the given informative SNPs until a desired accuracy level is obtained. Various strategies can be used for the initial selection of informative SNPs, e.g., regression-based [153], correlation-based [154, 155] etc. One limitation of wrapper-based methods is though the reconstruction scheme, which entails considerable computational complexity and can be impractical for high-throughput data. On the other hand, LD-based methods aim to identify regions of SNPs with high linkage disequilibrium through the discovery of recombi- nation hotspots [21, 156]. In genetic studies it has been observed that there is a block-like structure between two adjacent hotspots where limited (or no) recombination events occur, and the SNPs within the block are often inherited together (i.e., linked) carrying redundant information [157, 158]. In other words, such a block possesses very low haplotype diver- CHAPTER 5. DISCOVERING GENOME-WIDE TAG SNPS BASED ON THE MUTUAL INFORMATION OF THE VARIANTS 91

cluster A cluster B cluster C

0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 1 1 1

Figure 5.1: 3 tag SNPs (red) are selected by clustering the 12 SNPs according to genotypes from 6 individuals (rows). sity across the population and the information carried by respective SNPs becomes highly redundant [126], suggesting that some subset of the SNPs can be sufficient to represent the diversity of the haplotype patterns observed in this block [151]. In a large genomic region the set of linked SNPs (blocks) can be identified through estimating the structure of haplotype blocks, e.g., by maximizing some measure of correlation (LD) among the variants [159, 160]. Then the SNPs representing each block can be picked in numerous ways (e.g., greedy forward-selection algorithm [161], LD-based or wrapper optimization based selection algorithms [162] etc.). SNP tagging approaches are conventionally cast as a two-phased optimization prob- lem, i.e., inferring highly-linked SNP sets (or haplotpye blocks), and selecting representa- tive SNPs based on the inference result. Under such framework various algorithms have been proposed, including forward/backward selection based greedy algorithm [161], greedy pair-wise selection/prioritization algorithm [163], mutation/survival based genetic algorithm [164], clustering/elimination based greedy algorithm [162]. In this chapter, we introduce a block-free solution that can jointly estimate the putatively-linked SNP sets and their tag SNPs, through exploiting a measure of mutual information estimated among the SNPs and the sets of linked SNPs. Information theoretic approaches have been used in earlier stud- ies as a base for quantifying haplotype diversity and SNP selection by maximally retained information content [165–167]. In this chapter, the mutual information is used to measure the degree of association between the individual variants and the sets of linked variants. For this, we employed CHAPTER 5. DISCOVERING GENOME-WIDE TAG SNPS BASED ON THE MUTUAL INFORMATION OF THE VARIANTS 92 the computational algorithm in [168] presented for the analysis of gene expression data sets, which is an unsupervised iterative approach that converges to the core (“heart”) of coexpression of a given arbitrary gene in the data. When applied to SNP tagging problem, the method discovers distinct patterns of mutually associated SNPs through iteratively updating each pattern, and converges in a reasonable time that is proportional to data size. The details are given in the next section.

5.2 Methods

5.2.1 The attractor metaSNP

A set of linked SNPs in a high-LD block can be thought of as a cluster of variants of high mutual association. Of practical interest in this cluster may be to find the most representative SNP (tag SNP). For this, pair-wise or multi-locus LD metrics are widely used for quantifying the association between SNPs [17, 161, 169, 170], and the representative SNPs are selected accordingly. However, such pair-wise (or multi-locus) analyses among the individual (or subsets of) SNPs may fail to incorporate the broad association of each SNP with the “heart” of the cluster/block [158]. In this sense, we can define a consensus “metaSNP” (defined below) to represent the broad variation in the cluster, e.g., some sort of average value of the known allelic information over the variants. Then we can simply rank all individual SNPs by their “pair-wise” association with that metaSNP where the few SNPs with the largest association measure may represent the core of the underlying linkage. When analyzing gene expression data, a metagene is a hypothetical gene whose expres- sion level is a weighted average of the expression levels of the particular individual genes. In [168], the authors present an iterative (“attractor”) algorithm, in which each “seed” gene leads to a successive sequence of metagenes. If the seed gene is a member of a co-expression signature, then this process converges to an “attractor metagene” representing the heart of co-expression. Although the algorithm is unsupervised, attractor metagenes proved to represent important biomolecular events and were used successfully for prognostic models for breast cancer [171, 172]. The algorithm has also been used to define signatures of mu- tually associated features (“attractor metafeatures”) from data other than gene expression, CHAPTER 5. DISCOVERING GENOME-WIDE TAG SNPS BASED ON THE MUTUAL INFORMATION OF THE VARIANTS 93 such as methylation and protein activity levels [173]. The attractor program is available as an R package under the Synapse ID syn1446295, and has also been adopted as a function (metafeatures) in MATLAB’s toolbox [174]. In the case of SNP data, a “metaSNP,” whose value is defined as a weighted average of the tri-level values of particular individual SNPs, cannot be thought of as hypothetical SNP because it is continuous-valued. Nevertheless, the above methodology is still directly applicable, and the resulting “attractor metaSNP” can represent a particular haplotype block (corresponding to a cluster of SNPs) due to high-LD, indicating the presence of a joint (rather than pair-wise) linkage. This is biologically relevant since the co-inheritance of SNPs naturally occurs in contiguous stretches, and this “joint” association is gradually degraded with the generational age due to being broken apart by recombination events [16]. From this point of view, a given cluster’s attractor metaSNP could be a suitable proxy encoding the broad allelic variation in the corresponding high-LD SNP region, whereby a tag SNP can be selected among the essential contributors of that metaSNP (i.e., the few SNPs with the largest association value). Therefore, we employed the above algorithm (Synapse ID syn1446295 ), where the descriptions are given as follows.

5.2.2 Attractor metaSNP estimation for a particular seed

Without any prior information, the problem of finding tag SNPs in a given genomic region can be cast as finding a number of distinct metaSNPs which can represent all common variations in layers (blocks) in the respective region. Given N SNPs and P samples, the vector with the values of a metaSNP is the weighted average of all SNP vectors, s M i ∈ 0, 1, 2 P , i = 1, . . ., N, characterized by the set of weights w = [w(1), w(2), . . ., w(N)], i.e., { } = N w(i)s . Each individual SNP in the data can play the role of a “seed”, and M i=1 i can beP processed using the above algorithm to estimate its attractor metaSNP and the associated weight vector, as follows. When the k-th SNP is used as seed, the corresponding metaSNP is initialized by using a set of (trivial) weights specific to that seed, i.e., a length-N vector of zeros except for the k-th element which is 1. This choice initializes the metaSNP as being equal to s . In Mk k the next step, the pair-wise associations between each SNP vector si and the metaSNP is CHAPTER 5. DISCOVERING GENOME-WIDE TAG SNPS BASED ON THE MUTUAL INFORMATION OF THE VARIANTS 94 calculated by the similarity metric

J(s , ) = Iα(s , ), (5.1) i Mk i Mk where I(x, y) [0, 1] is a normalized estimation of the mutual information [175] between ∈ the two random variables x and y, and α is a nonnegative power exponent that shapes the similarity metric in a nonlinear manner pushing smaller values of the normalized mutual information closer to zero. We use α = 5, as in [168]. This measure is used for the updated set of weights in the next iteration, i.e., w(i) = J(s , ), and the new estimation of the i Mk metaSNP is obtained by the updated weights. After iterating several times, the weights tend to stabilize whereby the convergence is determined, i.e., the algorithm stops when the norm of difference between the two consecutive weight vectors drops below a certain threshold, at which point the iterative process is assumed to have converged to the attractor metaSNP (Algorithm 2).

Algorithm 2 Attractor metaSNP 1. Start with the k-th SNP as seed. 2. Calculate the pairwise associations (weights) between the k-th SNP and all other SNPs, i.e., w(i) = J(si, sk), i = 1,...,N. 3. Estimate the metaSNP by taking the weighted average of all SNP vectors, i.e., = N w(i)s . Mk i=1 i 4. CalculateP the (multi-locus) associations between the metaSNP k and all other SNPs, M i.e., w(i) = J(s , ), i = 1,...,N. i Mk 5. Repeat the steps 3-4 until the two consecutive weight vectors obtained at the 4th step are very similar, i.e., N (wnew(i) wold(i))2) < , or a predefined maximum number i=1 − of iterations is reached.qP 6. Return the attractor w, and the metaSNP . Mk

5.2.3 Finding all attractor metaSNPs and selecting the tag SNPs

We can do an exhaustive search by applying the attractor algorithm for all N seeds. In that case, we will find a limited number of attractor metaSNPs, each of which has multiple “attractee” seeds (i.e., a seed that converges to a “particular” attractor) [168]. It would CHAPTER 5. DISCOVERING GENOME-WIDE TAG SNPS BASED ON THE MUTUAL INFORMATION OF THE VARIANTS 95 suffice to constrain the “set of seeds to be processed” to a subset of 1,...,N such that { } it will consist of only one attractee seed (from the equvalent attractees) to efficiently result in the same set of attractor metaSNPs. To overcome such complexities, we developed a sliding-window based heuristic using the two objectives described below, which we observed that do not compromise performance.

First, for a given seed SNP, we will constrain the weighted average (metaSNP) and the • association calculations to the seed’s “genomic neighborhood” with X local SNPs (we X X used X=10,001), i.e., use i = k- 2 , . . . , k+ 2 in steps 2-4; otherwise, using all N SNPs at the repeated steps 3-4 can be computationally prohibitive for large N, making the algorithm intractable (e.g., the 1,000 Genomes Project provides the genotype data with N = 84 million variants). ∼ Second, to estimate all attractor metaSNPs in the data efficiently, we will reduce • the number of possible seeds by evaluating their potential to be an attractee seed. For every SNP (k 1,...,N) in the data, we quickly estimate a “short-attractor” ∈ 101 101 (consisting of 101-SNPs) by using the Algorithm 2 with i = k- 2 , . . . , k+ 2 in the steps 2-4, and call it an attractee seed if the 5th largest weight in the converged w is larger than 0.5; otherwise, discard the seed.

Given the above definitions, the sliding-window heuristic is summarized in Algorithm 3.

Algorithm 3 Finding all attractors 1. Estimate all attractee seeds having 5th largest weight in its short-attractor 0.5. ≥ 2. Run the genomically-localized program for every attractee seed reported from the short-attractors. 3. Return all attractors estimated in step 2.

5.2.4 Selecting tag SNPs

Since the weights used for the attractor metaSNPs are equal to their associations with the individual SNPs, it is straightforward to select the tagging SNP as the one with the largest CHAPTER 5. DISCOVERING GENOME-WIDE TAG SNPS BASED ON THE MUTUAL INFORMATION OF THE VARIANTS 96 weight. In case of multiple (co-)top-ranked SNPs with identical (largest) weights, we choose the one that is located closest to the median of their genomic positions. After identifying all tag SNPs, one can order them to assist the selection of informative tag SNP subsets for various genotyping needs (see Results for the analyses of genotype coverage obtained by different choices of tag SNPs). The information value of a tag SNP may correlate with the “strength” of its attractor measuring the degree of association in the attractor’s top SNPs. To favor a tag SNP with strong mutual associations in its top-ranking variants, we define the strength S of an attractor as “the (unnormalized) mutual information between the n-th top SNP and the attractor metaSNP”. By default, we set n to 10 as it leads to good performance in most data sets.

5.3 Results

In this chapter we conducted a series of experiments on widely-used SNP data sets to assess the proposed method’s performance in terms of efficiency in SNP tagging. We compared our results with the state-of-the-art algorithms designed for the same task on the relevant data sets.

5.3.1 Data sets

HapMap

We used the trio genotype data sets from HapMap’s ENCODE project (http://hapmap. ncbi.nlm.nih.gov/genotypes/latest ncbi build34/ENCODE/non-redundant/) [176] belong- ing to 30 trio families from the CEU population. We focused on four genomic regions ENm013, ENm014, ENr112 and ENr113, where the largest (ENr113) contain 2486 SNPs covering 500 kb region of the Chromosome 4 corresponding to an average marker density ∼ of 1 SNP per 0.2 kb. The details of the data sets are listed in Table 5.1. For a wider ap- ∼ plication in HapMap, we also used the genotypes corresponding to human Chromosome 22 (http://hapmap.ncbi.nlm.nih.gov/genotypes/latest phaseIII ncbi b36/hapmap format/ consensus) consisting of 60 trio samples from the CEU population. All genotype data sets come with missing values for certain SNPs and samples. We used IMPUTE2 algorithm CHAPTER 5. DISCOVERING GENOME-WIDE TAG SNPS BASED ON THE MUTUAL INFORMATION OF THE VARIANTS 97

[177] to impute missing values by employing as the reference panel the genotypes from 1000 Genomes Project [178](https://mathgen.stats.ox.ac.uk/impute/1000GP Phase3.html). In addition to imputed data, we run the algorithms using the raw (unimputed) format of the data sets to see their performance under missing data condition. We also tested the algorithms’ performance for tagging SNPs in haplotype data sets. For this we used the same ENCODE genotypes and phased the corresponding haplotypes by using IMPUTE2 algorithm. Due to limitations in inference accuracy we only used the confidently-phased SNPs, which reduced the data sizes of ENm013, ENm014, ENr112 and ENr113 to 1626, 1712, 1366 and 1998 SNPs, respectively.

Table 5.1: Details of HapMap data sets used in this chapter. Dataset Chr. no. No. of SNPs Marker sparsity (b) No. of samples ENm013 7 2069 241.5 90 ENm014 7 2232 222.7 90 ENr112 2 1505 332.1 90 ENr113 4 2486 200.9 90 Chr22 22 20108 1741.3 165

1000 Genomes Project (1KGP)

In addition to HapMap database, we tested the algorithms on the genotype data from the recently catalogued 1,000 Genomes Project [179]. These genotypes are constructed based on a large group of individuals from multiple populations, combined with genotype imputation on the variants not covered by sequencing reads, which obviated the imputation step in our analyses. In this chapter, we used the genotype data from the latest release consisting of 84.4 million variants built by the 2,504 samples from 26 populations (http: //ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/) [12]. We tested our algorithm for the whole-genome data by processing one chromosome at a time to scrutinize the tagging results specifically for each chromosome. As a result, we built the 1000 Genomes Tags

(1KGT) database (http://www.columbia.edu/∼to2232/tagSNP/) displaying the tag SNPs, and the SNPs they tag, provided with the respective joint association measures (i.e., the CHAPTER 5. DISCOVERING GENOME-WIDE TAG SNPS BASED ON THE MUTUAL INFORMATION OF THE VARIANTS 98 attractors depicting the multi-locus LD maps).

5.3.2 Performance evaluation

As the performance metric we used “coverage rate per tagged SNPs” to represent the ge- nomic diversity (or genotype diversity) captured by a given choice of tag SNPs. The coverage rate (R) can be defined as “the ratio between the maximum number of genomic sequences covered (Gi) and the total number of samples (P )” for the given choice of tag SNPs [162], i.e.,

t G R = i i (5.2) P P where t is the number of genomic patterns observed on the given selection of tag SNPs. In genotype data, those patterns are the distinct sequences of genotypes, each consisting of the three-level genotype values of an individual in the selected tag SNPs, i.e., 0 encodes for the reference homozygous genotype, 1 encodes for the heterozygous genotype, and 2 encodes for the alternate homozygous genotype. For example, assume that 5 samples (I-V) are geno- typed on four SNP loci and the corresponding sequences are “2111”, “2200”,“2201”,“2110”, and “2200”. If the first two SNPs are tagged, there will be only two genotype patterns “21” and “22” observed on the samples I,IV and II,III,V respectively. The pattern “21” can represent the genotype sequences “2111” (I) and “2110” (IV), each belonging to exactly

1 individual, so that the maximum coverage of this pattern is G1=1. The second pattern “22” represents the genotype sequences “2200” (II,V) and “2201” (III), where the sequence

“2200” covers 2 individuals (samples II and V) and the maximum coverage is G2=2. Using equation (5.2) the coverage rate R is calculated as (G1+G2)/P = (1+2)/5 = 0.6. Intu- itively, an optimal tagging approach should find a subset of SNPs possessing larger number of genomic patterns with perhaps a broader coverage obtained by each pattern, e.g., the last two SNP loci in this example will result in 4 distinct patterns covering 100% of the individuals (R=1). R is a versatile metric that can be readily used in the haplotype data to represent the “haplotype diversity” captured by a set of tagged SNPs [162]. In genotype data sets, we compared our results with the state-of-the-art algorithm Tag- ger [163] which is employed by HapMap database as a SNP tagging tool. Tagger is a CHAPTER 5. DISCOVERING GENOME-WIDE TAG SNPS BASED ON THE MUTUAL INFORMATION OF THE VARIANTS 99 block-free (LD-based) approach that offers multiple LD measures for optimal SNP selec- tion, and is robust for tagging SNPs in samples from multiple populations [180]. The version we experimented is maintained by Haploview (4.2) software [181]. For the phased haplotype data, we used another LD-based approach, ER algorithm [161], which is the most relevant work to the theoretical part of our study. ER algorithm uses information theory to define a multi-locus LD measure which is estimated by a form relative entropy calculation among the variants, and can only work with haplotype data. We run the metaSNP method versus Tagger and ER on four ENCODE regions and Chromosome 22 in HapMap, and obtained the ranked estimates of tag SNPs solved by each algorithm. For the 1KGP data, we compared our predictions with those of the Tagger obtained for several short (50,000 SNPs) segments of the Chromosome 22, due to limitations pertaining the Tagger algorithm. In all simulations we used the recommended default settings of algorithms unless otherwise noted. We demonstrated the coverage rate (R) obtained by the top-c tag SNPs of a solution, whereby different values of c are used to illustrate the tradeoff between the coverage rate (R) and the genotyping cost (c).

5.3.3 Genotype data sets from several ENCODE regions and the human chromosome 22

It is seen from the Figure 5.2 that the algorithms perform favorably in ENCODE data by reaching a coverage rate of 90% within the 20 tagged SNPs. In all genomic regions ∼ the curves tend to saturate after 15 tag SNPs which corresponds to 90-100% coverage. ∼ MetaSNP outperforms Tagger in all genotyping costs, including the high price range c 15 ≥ (Table 5.2). From the plots we can say that a coverage rate of 95% will be a good tradeoff since the genotyping costs exponentially increase after this rate. In this value, metaSNP algorithm is more cost-effective than Tagger (Table 5.3). The algorithms achieved similar behavior when tagging SNPs chromosome-wide. Fig- ure 5.3 shows the coverage rates in Chromosome 22, for different choices of the tag SNPs. Notably, both metaSNP and Tagger algorithms can cover 90% of the genomic diversity by only 7 and 10 tag SNPs, respectively, and the 99% by 10 and 13 tag SNPs, respectively. However, metaSNP algorithm outperforms at all conditions. The full coverage is not ef- CHAPTER 5. DISCOVERING GENOME-WIDE TAG SNPS BASED ON THE MUTUAL INFORMATION OF THE VARIANTS 100

Table 5.2: Coverage rates in imputed genotype data (HapMap). c ENm013 ENm014 ENr112 ENr113 Tagger 15 0.59 0.93 0.93 0.80 metaSNP 15 0.80 1.00 1.00 0.98 Tagger 20 0.67 1.00 0.99 0.90 metaSNP 20 0.82 1.00 1.00 0.98

Table 5.3: Minimum number of tag SNPs that reach >95% coverage in imputed genotype data (HapMap). ENm013 ENm014 ENr112 ENr113 Tagger >50 16 17 38 metaSNP 34 11 9 13 fectively obtained by Tagger, i.e., requiring 36 tag SNPs. The main reason for this is the pairwise-LD measure, which can locate SNPs with high “pairwise”-LD but fails to explore genomic diversity on those loci. On the other hand, metaSNP reaches full coverage in a more cost-effective way (13 tag SNPs) due to its multi-locus nature.

5.3.4 Haplotype data sets from ENCODE regions

The proposed approach can readily employ haplotype data and perform SNP tagging. We used the phased haplotypes obtained from the imputed ENCODE data and tested the performance of metaSNP against the ER algorithm. In these plots, the coverage rates are calculated from the haplotype sequences by using equation (5.2) for the given selection of top-c tag SNPs. In Figure 5.4, it is seen that metaSNP algorithm significantly outperforms ER at all genotyping costs after 5 tag SNPs. This is due to multi-locus LD measure of ER which may accurately find low-LD tag SNPs and capture the haplotype diversity in sparser SNP regions, however it becomes ineffective in larger data sets. This can be observed in the denser (high-LD) SNP regions (i.e., ENr113 and ENm014 plots), where increasing the number of tag SNPs fails to incorporate a relative gain in coverage. i i “elmas_et_al_nov2015” — 2015/11/23 — 22:00 — page 4 — #4 i i “elmas_et_al_nov2015” — 2015/11/23 — 22:00 — page 4 — #4 i i i 4 Elmas et al. i

4 Elmas et al. Table 2. Coverage rates in imputed genotype data. Table 3. Minimum number of tag SNPs that reach >95% coverage in imputed genotype data. Table 3. Minimum number of tag SNPs that reach Table 2.c Coverage ENm013 rates ENm014 in imputed ENr112genotype data. ENr113 > 95% coverageENm013 in imputed ENm014 genotype ENr112 data. ENr113 c ENm013 ENm014 ENr112 ENr113 Tagger 15 0.59 0.93 0.93 0.80 ENm013 ENm014 ENr112 ENr113 Tagger >50 16 17 38 metaSNPTagger 15 15 0.80 0.59 1.00 0.93 1.00 0.93 0.98 0.80 metaSNP 34 11 9 13 TaggermetaSNP 20 15 0.67 0.80 1.00 1.00 0.99 1.00 0.90 0.98 CHAPTER 5. DISCOVERINGTagger GENOME-WIDE>50 16 17TAG SNPS 38 BASED ON THE metaSNPTagger 20 20 0.82 0.67 1.00 1.00 1.00 0.99 0.98 0.90 MUTUAL INFORMATIONmetaSNP OF THE 34 VARIANTS 11 9 13 101 metaSNP 20 0.82 1.00 1.00 0.98

ENm013 region, 2069 SNPs ENm014 region, 2232 SNPs 1 1 ENm013 region, 2069 SNPs ENm014 region, 2232 SNPs 0.91 0.91 samples (I-V) are genotyped on four SNP loci and the corresponding 0.80.9 0.90.8 sequencessamples are (I-V) “2111", are genotyped “2200",“2201",“2110", on four SNP loci and and “2200". the corresponding If the first 0.70.8 0.80.7 two SNPssequences are tagged, are “2111", there “2200",“2201",“2110", will be only two genotype and patterns“2200". “21"If the and first 0.60.7 0.70.6 0.6 0.6 “22"two observed SNPs are on thetagged, samples there I,IV willand be only II,III,V two genotyperespectively. patterns The “21" pattern and 0.5 0.5 0.40.5 0.50.4 “21"“22" can representobserved on the the genotype samples sequences I,IV and II,III,V “2111" respectively. (I) and “2110" The pattern (IV), Coverage rate, R 0.4 0.4Coverage rate, R “21" can represent the genotype sequences “2111" (I) and “2110" (IV), 0.3 0.3 Coverage rate, R 0.3 Coverage rate, R 0.3 each belonging to exactly 1 individual, so that the maximum coverage 0.2 0.2 each belonging to exactly 1 individual, so that the maximum coverage 0.2 0.2 of this pattern is G1=1. The second pattern “22" represents the genotype 0.1 0.1 of this pattern is G1=1. The second pattern “22" represents the genotype 0.1 0.1 sequences “2200" (II,V) and “2201" (III), where the sequence “2200" 0 0 sequences “2200" (II,V) and “2201" (III), where the sequence “2200" 0 10 20 30 40 50 0 10 20 30 40 50 covers 2 individuals (samples II and V) and the maximum coverage 10Number 20 of tagSNPs, 30 c 40 50 10Number 20 of tagSNPs, 30 40 c 50 covers 2 individuals (samples II and V) and the maximum coverage Number of tagSNPs, c Number of tagSNPs, c is G2=2. Using equation (1) the coverage rate R is calculated as is G2=2. Using equation (1) the coverage rate R is calculated as ENr112 region, 1505 SNPs ENr113 region, 2486 SNPs 1 ENr112 region, 1505 SNPs 1 ENr113 region, 2486 SNPs (G1+G2)/M = (1+2)/5 = 0.6. Intuitively, an optimal tagging approach 1 1 (G1+G2)/M = (1+2)/5 = 0.6. Intuitively, an optimal tagging approach 0.9 0.9 should find a subset of SNPs possessing larger number of genomic patterns 0.9 0.9 should find a subset of SNPs possessing larger number of genomic patterns 0.8 0.8 0.8 0.8 with perhaps a broader coverage obtained by each pattern, e.g., the last two 0.7 0.7 with perhaps a broader coverage obtained by each pattern, e.g., the last two 0.7 0.7 SNP loci in this example will result in 4 distinct patterns covering 100% 0.6 0.6 SNP loci in this example will result in 4 distinct patterns covering 100% 0.6 0.6 of theof individuals the individuals (R=1). (R=1).R isR ais versatile a versatile metric metric that that can can be be readily readily used used 0.50.5 0.50.5 in thein haplotype the haplotype data data to represent to represent the the “haplotype “haplotype diversity diversity" captured" captured by by a a 0.40.4 0.40.4 Coverage rate, R Coverage rate, R set ofset tagged of tagged SNPs SNPs (Liao (Liaoet alet., al 2015).., 2015). 0.3Coverage rate, R 0.3 Coverage rate, R 0.30.3 0.2 0.2 In genotypeIn genotype data data sets, sets, we we compared compared our our results results with with the the state-of-the- state-of-the- 0.2 0.2 0.10.1 0.10.1 TaggerTagger art algorithmart algorithm Tagger Tagger (Bakker (Bakkeret alet., al 2005)., 2005) which which is employed is employed by by HapMap HapMap MetaSNPMetaSNP 00 00 1010 20 20 30 40 40 50 50 1010 20 20 30 30 40 40 50 50 databasedatabase as a SNP as a SNP tagging tagging tool. tool. Tagger Tagger is a is block-free a block-free (LD-based) (LD-based) approach approach NumberNumber of tagSNPs, c NumberNumber of oftagSNPs, tagSNPs, c c that offersthat offers multiple multiple LD LD measures measures for for optimal optimal SNP SNP selection. selection. The The version version we experimentedwe experimented is maintained is maintained by Haploview by Haploview (4.2) (4.2) software software (Barrett (Barrettetet al al.,., Fig.Fig.Figure 2. 2.CoverageCoverage 5.2: rates ratesCoverage in imputed rates genotypes. genotypes. in imputed genotypes (HapMap). 2005).2005). For the For phased the phased haplotype haplotype data, data, we we used used another another LD-based LD-based approach, approach, ER algorithm (Liu et al., 2005), which is the most relevant work to the ER algorithm (Liu et al., 2005), which is the most relevant work to the Chr22 region, 20108 SNPs Chr22 region, 20108 SNPs theoretical part of our study. ER algorithm uses information theory to 1 theoretical part of our study. ER algorithm uses information theory to 1 0.9 define a multilocus LD measure which is estimated by a form relative 0.9 define a multilocus LD measure which is estimated by a form relative 0.8 entropy calculation among the variants, and can only work with haplotype 0.8 entropy calculation among the variants, and can only work with haplotype 0.7 data. 0.7 data. 0.6 0.6 We run the metaSNP method versus Tagger and ER on four ENCODE 0.5 We run the metaSNP method versus Tagger and ER on four ENCODE 0.5 regions and Chromosome 22, and obtained the ranked estimates of 0.4 0.4 regions and Chromosome 22, and obtained the ranked estimates of Coverage rate, R tag SNPs solved by each algorithm. In all simulations we used the 0.3 Coverage rate, R 0.3 tag SNPs solved by each algorithm. In all simulations we used the 0.2 recommended default settings of algorithms unless otherwise noted. We 0.2 Tagger recommended default settings of algorithms unless otherwise noted. We 0.1 MetaSNP demonstrated the coverage rate (R) obtained by the top-c tag SNPs of a Tagger 0.10 MetaSNP demonstrated the coverage rate (R) obtained by the top-c tag SNPs of a 5 10 15 20 25 30 35 40 45 50 solution, whereby different values of c are used to illustrate the tradeoff 0 Number of tagSNPs, c solution, whereby different values of c are used to illustrate the tradeoff 5 10 15 20 25 30 35 40 45 50 between the coverage rate (R) and the genotyping cost (c). Number of tagSNPs, c between the coverage rate (R) and the genotyping cost (c). Figure 5.3:Fig. 3. CoverageCoverage rates rates in imputed in imputed Chromosome Chromosome 22 genotypes. 22 genotypes (HapMap). Fig. 3. Coverage rates in imputed Chromosome 22 genotypes. 4 Discussion 4 DiscussionIt is seen from the Figure 2 that the algorithms perform favorably in and 10 tag SNPs, respectively, and the 99% by 10 and 13 tag SNPs, It isENCODE seen from data the by Figure reaching 2 thata coverage the algorithms rate of 90% perform within the favorably20 tagged in andrespectively. 10 tag SNPs, However, respectively, metaSNP algorithm and the 99% outperforms by 10and at all 13 conditions. tag SNPs, ⇠ ENCODESNPs. data In all by genomic reaching regions a coverage the curves rate of tend 90% to within saturate the after20 tagged15 tag respectively.The full coverage However, is not metaSNP effectively algorithm obtained outperforms by Tagger, i.e., at requiringall conditions. 36 ⇠ ⇠ SNPs.SNPs In all which genomic corresponds regions to the 90-100% curves coverage.tend to saturate MetaSNP after outperforms15 tag Thetagfull SNPs. coverage The main is not reason effectively for this obtained is the pairwise-LD by Tagger, measure, i.e., requiring which 36 Tagger in all genotyping costs, including the high price range⇠c 15 can locate SNPs with high “pairwise”-LD but fails to explore genomic SNPs which corresponds to 90-100% coverage. MetaSNP outperforms tag SNPs. The main reason for this is the pairwise-LD measure, which Tagger(Table in all 2). genotyping From the plots costs, we can including say that the a coverage high price rate range of 95%c will15 be a candiversity locate on SNPs those with loci. Onhigh the “pairwise”-LD other hand, metaSNP but fails reaches to explore full coverage genomic (Tablegood 2). tradeoffFrom the since plots the we genotyping can say that costs a coverage exponentially rate of increase 95% will after be this a diversityin a more on cost-effective those loci. On way the (13 other tag SNPs)hand, metaSNP due to its multilocus reaches full nature. coverage goodrate. tradeoff In this since value, the metaSNP genotyping algorithm costs exponentially is more cost-effective increase than after Tagger this in a moreThe proposed cost-effective approach way can (13 readily tag SNPs) employ due haplotype to its multilocus data and perform nature. (Table 3). SNP tagging. We used the phased haplotypes obtained from the imputed rate. In this value, metaSNP algorithm is more cost-effective than Tagger The proposed approach can readily employ haplotype data and perform The algorithms achieved similar behavior when tagging SNPs ENCODE data and tested the performance of metaSNP against the ER (Table 3). SNP tagging. We used the phased haplotypes obtained from the imputed chromosome-wide. Figure 3 shows the coverage rates in Chromosome algorithm. In these plots, the coverage rates are calculated from the The algorithms achieved similar behavior when tagging SNPs ENCODE data and tested the performance of metaSNP against the ER 22, for different choices of the tag SNPs. Notably, both metaSNP and haplotype sequences. In Figure 4, it is seen that metaSNP algorithm chromosome-wide. Figure 3 shows the coverage rates in Chromosome algorithm. In these plots, the coverage rates are calculated from the Tagger algorithms can cover 90% of the genomic diversity by only 7 significantly outperforms ER at all genotyping costs after 5 tag SNPs. This 22, for different choices of the tag SNPs. Notably, both metaSNP and haplotype sequences. In Figure 4, it is seen that metaSNP algorithm Tagger algorithms can cover 90% of the genomic diversity by only 7 significantly outperforms ER at all genotyping costs after 5 tag SNPs. This

i i i i i i i i i i “elmas_et_al_nov2015” — 2015/11/23 — 22:00 — page 5 — #5 i CHAPTER 5. DISCOVERING GENOME-WIDE TAG SNPS BASED ON THE i MUTUAL INFORMATION OF THE VARIANTS 102 Genome-wide tag SNP selection 5

ENm013 region, 1626 SNPs ENm014 region, 1712 SNPs ENm013 region, 2069 SNPs ENm014 region, 2232 SNPs 0.7 0.7 1 1

0.9 0.9 0.6 0.6 0.8 0.8

0.5 0.5 0.7 0.7

0.6 0.6 0.4 0.4 0.5 0.5 0.3 0.3 0.4 0.4

Coverage rate, R 0.2 Coverage rate, R 0.2 Coverage rate, R 0.3 Coverage rate, R 0.3 0.2 0.2 0.1 0.1 0.1 0.1

0 0 0 0 5 10 15 20 25 30 5 10 15 20 25 30 10 20 30 40 50 10 20 30 40 50 Number of tagSNPs, c Number of tagSNPs, c Number of tagSNPs, c Number of tagSNPs, c

ENr112 region, 1366 SNPs ENr113 region, 1998 SNPs ENr112 region, 1505 SNPs ENr113 region, 2486 SNPs 0.7 0.7 1 1

0.9 0.9 0.6 0.6 0.8 0.8

0.5 0.5 0.7 0.7

0.6 0.6 0.4 0.4 0.5 0.5 0.3 0.3 0.4 0.4

Coverage rate, R 0.2 Coverage rate, R 0.2 Coverage rate, R 0.3 Coverage rate, R 0.3 0.2 0.2 0.1 0.1 ER 0.1 0.1 Tagger MetaSNP MetaSNP 0 0 0 0 5 10 15 20 25 30 5 10 15 20 25 30 10 20 30 40 50 10 20 30 40 50 Number of tagSNPs, c Number of tagSNPs, c Number of tagSNPs, c Number of tagSNPs, c

Fig.Figure 4. Coverage 5.4: rates Coverage in phased rates haplotypes. in phased haplotypes (HapMap). Fig. 5. Coverage rates in missing genotypes.

103 5.3.5 Missing genotypeTable 4. Coverage datasets rates from in missing ENCODE genotype data. regions

We tested algorithms under missingc ENm013 data conditions. ENm014 In ENr112 this experiment, ENr113 the coverage rate

R cannot incorporateTagger missing alleles 15 0.60 since those 0.97 sites 0.99 are typed 0.87 with a nonspecific value. One can use the imputedmetaSNP genotype 15 0.78 values corresponding 1.00 0.99 to a 0.98 given choice of tag SNPs. 102 Tagger 20 0.83 0.98 1.00 0.87 However, we simply ignored those missing loci in calculating R to avoid any imputation bias metaSNP 20 0.79 1.00 1.00 0.98 Running time (sec)

which may affect the algorithms, i.e., in equation (5.2) we excluded missing sites from the ER Tagger analyses when determining the patterns and the number of samples they cover. Figure 5.5 MetaSNP 101 1600 1800 2000 2200 2400 displays the performance of metaSNP against Tagger in the raw ENCODE genotypes, where No. of SNPs in ENr112, ENm013, ENm014 and ENr113, respectively metaSNP haveis due similar to multilocus or better LD rates measure (Table of ER 5.4). which From may the accurately plots,we find can low- say that the LD tag SNPs and capture the haplotype diversity in sparser SNP regions, Fig. 6. Running times in imputed genotype data missing data havehowever little it becomesor no impact ineffective on both in larger algorithms data sets. where This they can becan observed accurately perform SNP taggingin on the all denser sites and (high-LD) capture SNP the regions majority (i.e., of ENr113 genotype anddiversity ENm014 plots), on “non-missing” where increasing the number of tag SNPs fails to incorporate a relative Tagger and seems unaffected from the data size as well, displaying a log- gain in coverage. linear behavior which is desirable for a block-free approach based on the We tested algorithms under missing data conditions. In this experiment, multilocus analyses of SNPs. On the other hand, for most attractors the the coverage rate R cannot incorporate missing alleles since those sites are metaSNP algorithm converges in less than 20 iterations. Figure 7 displays typed with a nonspecific value. One can use the imputed genotype values the convergence of the algorithm in 7 iterations when it estimates the corresponding to a given choice of tag SNPs. However, we simply ignored attractor from the second seed and its tag SNP (rs1204568) in the ENm014 those missing loci in calculating R to avoid any imputation bias which may genotypes data. affect the algorithms, i.e., in equation (1) we excluded missing sites from We emphasize that our SNP tagging approach is block-free and the analyses when determining the patterns and the number of samples undemanding in the sense that it is not based on any prior information they cover. Figure 5 displays the performance of metaSNP against Tagger derived from the empirical data such as haplotype block structures, in the raw ENCODE genotypes, where metaSNP have similar or better recombination hotspots, LD maps, or even the SNP locus information rates (Table 4). From the plots, we can say that the missing data have such as genomic position, strand orientation, or allele frequency etc. It little or no impact on both algorithms where they can accurately perform requires only the genotype (or haplotype) sequences to perform SNP SNP tagging on all sites and capture the majority of genotype diversity on tagging, and offers a number of parameters to provide flexibility for “non-missing" loci. different genotyping needs. For example, it can be started with a user- The computational complexity of algorithms is evaluated in terms of defined SNP set as the “primary seeds" to be force-included in the final running times. The pairwise LD-based Tagger is the fastest approach in this results. In this case, the ↵ parameter can be accordingly adjusted (i.e., study as it can process each ENCODE data set under a minute (Figure 6). increased) for those seed SNPs to ensure the discovery of “sharp" attractors On the other hand, multilocus-LD calculation substantially slows down that mutually exclude each other’s seed SNP. Since the objective is based the ER algorithm though resulting in a constant growth of complexity on a general measure of correlation between the variables (i.e., SNP loci) scaling with the data size. MetaSNP’s performance is comparable to the employed algorithm can handle phased haplotype sequences as well

i i

i i i i “elmas_et_al_nov2015” — 2015/11/23 — 22:00 — page 5 — #5 CHAPTER 5. DISCOVERING GENOME-WIDE TAG SNPS BASED ON THE i MUTUAL INFORMATION OF THE VARIANTS 103 i

Genome-wide tag SNP selection 5 loci.

ENm013 region, 1626 SNPs ENm014 region, 1712 SNPs ENm013 region, 2069 SNPs ENm014 region, 2232 SNPs 0.7 0.7 1 1

0.9 0.9 0.6 0.6 0.8 0.8

0.5 0.5 0.7 0.7

0.6 0.6 0.4 0.4 0.5 0.5 0.3 0.3 0.4 0.4

Coverage rate, R 0.2 Coverage rate, R 0.2 Coverage rate, R 0.3 Coverage rate, R 0.3 0.2 0.2 0.1 0.1 0.1 0.1

0 0 0 0 5 10 15 20 25 30 5 10 15 20 25 30 10 20 30 40 50 10 20 30 40 50 Number of tagSNPs, c Number of tagSNPs, c Number of tagSNPs, c Number of tagSNPs, c

ENr112 region, 1366 SNPs ENr113 region, 1998 SNPs ENr112 region, 1505 SNPs ENr113 region, 2486 SNPs 0.7 0.7 1 1

0.9 0.9 0.6 0.6 0.8 0.8

0.5 0.5 0.7 0.7

0.6 0.6 0.4 0.4 0.5 0.5 0.3 0.3 0.4 0.4

Coverage rate, R 0.2 Coverage rate, R 0.2 Coverage rate, R 0.3 Coverage rate, R 0.3 0.2 0.2 0.1 0.1 ER 0.1 0.1 Tagger MetaSNP MetaSNP 0 0 0 0 5 10 15 20 25 30 5 10 15 20 25 30 10 20 30 40 50 10 20 30 40 50 Number of tagSNPs, c Number of tagSNPs, c Number of tagSNPs, c Number of tagSNPs, c

Fig. 4. Coverage rates in phased haplotypes. Fig. 5. Coverage rates in missing genotypes. Figure 5.5: Coverage rates in missing genotypes (HapMap).

103 Table 4. Coverage rates in missing genotype data. Table 5.4: Coverage rates in missing genotype data (HapMap). c ENm013 ENm014 ENr112 ENr113 c ENm013 ENm014 ENr112 ENr113 Tagger 15 0.60 0.97 0.99 0.87 Tagger 15 0.60 0.97 0.99 0.87 metaSNP 15 0.78 1.00 0.99 0.98 102 metaSNP 15 0.78 1.00 0.99 0.98 Tagger 20 0.83 0.98 1.00 0.87

metaSNP 20 0.79 1.00 1.00 0.98 TaggerRunning time (sec) 20 0.83 0.98 1.00 0.87

metaSNP 20ER 0.79 1.00 1.00 0.98 Tagger MetaSNP 101 1600 1800 2000 2200 2400 No. of SNPs in ENr112, ENm013, ENm014 and ENr113, respectively is due to multilocus LD measure of ER which may accurately find low- LD tag SNPs and capture the haplotype diversity in sparser SNP regions, Fig. 6. Running times in imputed genotype data however it becomes ineffective in larger data sets. This can be observed in the denser (high-LD) SNP regions (i.e., ENr113 and ENm014 plots), where increasing the number of tag SNPs fails to incorporate a relative Tagger and seems unaffected from the data size as well, displaying a log- gain in coverage. linear behavior which is desirable for a block-free approach based on the We tested algorithms under missing data conditions. In this experiment, multilocus analyses of SNPs. On the other hand, for most attractors the the coverage rate R cannot incorporate missing alleles since those sites are metaSNP algorithm converges in less than 20 iterations. Figure 7 displays typed with a nonspecific value. One can use the imputed genotype values the convergence of the algorithm in 7 iterations when it estimates the corresponding to a given choice of tag SNPs. However, we simply ignored attractor from the second seed and its tag SNP (rs1204568) in the ENm014 those missing loci in calculating R to avoid any imputation bias which may genotypes data. affect the algorithms, i.e., in equation (1) we excluded missing sites from We emphasize that our SNP tagging approach is block-free and the analyses when determining the patterns and the number of samples undemanding in the sense that it is not based on any prior information they cover. Figure 5 displays the performance of metaSNP against Tagger derived from the empirical data such as haplotype block structures, in the raw ENCODE genotypes, where metaSNP have similar or better recombination hotspots, LD maps, or even the SNP locus information rates (Table 4). From the plots, we can say that the missing data have such as genomic position, strand orientation, or allele frequency etc. It little or no impact on both algorithms where they can accurately perform requires only the genotype (or haplotype) sequences to perform SNP SNP tagging on all sites and capture the majority of genotype diversity on tagging, and offers a number of parameters to provide flexibility for “non-missing" loci. different genotyping needs. For example, it can be started with a user- The computational complexity of algorithms is evaluated in terms of defined SNP set as the “primary seeds" to be force-included in the final running times. The pairwise LD-based Tagger is the fastest approach in this results. In this case, the ↵ parameter can be accordingly adjusted (i.e., study as it can process each ENCODE data set under a minute (Figure 6). increased) for those seed SNPs to ensure the discovery of “sharp" attractors On the other hand, multilocus-LD calculation substantially slows down that mutually exclude each other’s seed SNP. Since the objective is based the ER algorithm though resulting in a constant growth of complexity on a general measure of correlation between the variables (i.e., SNP loci) scaling with the data size. MetaSNP’s performance is comparable to the employed algorithm can handle phased haplotype sequences as well

i i i i CHAPTER 5. DISCOVERING GENOME-WIDE TAG SNPS BASED ON THE MUTUAL INFORMATION OF THE VARIANTS 104

5.3.6 Genotype data sets from 1000 Genomes Project (1KGP)

Compared with HapMap, the main advance in 1KGP data is the greater SNP density ( 50 times denser) obtained from a rich multi-population sample set (2,504 individuals). ∼ Combined with genotype imputation this results in many variants with (ignorably) low fre- quencies in the overall population, e.g., the variants carrying only the reference homozygous genotype for all samples in the populations. In terms of tagging costs, as one can expect, compared to HapMap relatively more tag SNPs are required to reach the same coverage rate in the larger population (i.e., to cover more samples). First, we evaluated our method in the Chromosome 22 genotypes to provide a baseline tagging results against the metaSNP’s HapMap predictions. Then we used the Chromosome 21 genotypes, the shortest autosome, to further scrutinize our predictions. It is seen in Figure 5.6 that the algorithm finds 11 tag SNPs that sufficiently cover 95% of the samples and 25 tag SNPs for the full coverage in Chromosome 22. It performs better in Chromosome 21 genotypes, requiring less number of tags (20) for the full coverage of the samples. In both results the tagging SNPs are spread across the chromosome (Supplementary Figure 1 (Figure 5.7)).

Chromosome 21, 1105538 SNPs Chromosome 22, 1103547 SNPs 1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4 Coverage rate, R Coverage rate, R 0.3 0.3

0.2 0.2

0.1 0.1

0 0 5 10 15 20 25 30 5 10 15 20 25 30 Number of tagSNPs, c Number of tagSNPs, c

Figure 5.6: Coverage rates in 1KGP genotypes, Chromosomes 21 and 22.

We observed similar performance in other chromosomes as well. Figure 5.8 displays the coverage rates in the remaining 1KGP chromosomes, where we see that the algorithm is able to discover only 15-20 tag SNPs for the full coverage of the samples. ∼ CHAPTER 5. DISCOVERING GENOME-WIDE TAG SNPS BASED ON THE MUTUAL INFORMATION OF THE VARIANTS 105

21 22

chr21_tagSNPs.bed chr22_tagSNPs.bed

Figure 5.7: Supplementary Figure 1.

5.3.6.1 Comparison with Tagger

As a performance comparison, we tested Tagger algorithm in the same 1KGP data sets. Since Tagger is based on the (offline) pair-wise analysis of all SNPs, the limitations occur due to memory use. To overcome this and to set out a fair comparison, we divided the data into several chromosomal segments (each containing 50,000 SNPs) then processed the individual segments. Figure 5.9 displays the performance curves of both algorithms averaged over these segmented data sets. We can say that, approximately 19 tag SNPs estimated by the metaSNP approach are sufficient to capture all genotype diversity in an arbitrary 50,000- SNPs genomic region. In contrast, the Tagger’s estimates can reach a similar performance at the cost of > 30 tag SNPs. We note that the performance of our algorithm improves with the increased data size, i.e., given more variants residing in the neighboring genomic distances. This can be observed from Figure 5.9 as the confidence upper-bounds in the metaSNP’s coverage rates overlaps with the results in Figure 5.6 (Chromosome 22) which is based on the whole chromosome data.

5.3.7 Complexity

The computational complexity of algorithms is evaluated in terms of running times in the benchmark HapMap data. We carried out all experiments on a hardware with Core(TM) CHAPTER 5. DISCOVERING GENOME-WIDE TAG SNPS BASED ON THE MUTUAL INFORMATION OF THE VARIANTS 106

Chromosomes 1-5 Chromosomes 6-10 1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4 Coverage rate, R Coverage rate, R 0.3 0.3

0.2 chr1 0.2 chr6 chr2 chr7 chr3 chr8 0.1 chr4 0.1 chr9 chr5 chr10 0 0 5 10 15 20 25 30 5 10 15 20 25 30 Number of tagSNPs, c Number of tagSNPs, c

Chromosomes 11-15 Chromosomes 16-20 1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4 Coverage rate, R Coverage rate, R 0.3 0.3

0.2 chr11 0.2 chr16 chr12 chr17 chr13 chr18 0.1 chr14 0.1 chr19 chr15 chr20 0 0 5 10 15 20 25 30 5 10 15 20 25 30 Number of tagSNPs, c Number of tagSNPs, c

Figure 5.8: Coverage rates in 1KGP genotypes, Chromosomes 1-20. i7 CPU @2.6GHz, 16GB memory, and Mac OSX 10.10.5. The pairwise LD-based Tagger is the fastest approach in this study as it can process each ENCODE data set in a minute (Figure 5.10). On the other hand, the multi-locus LD calculation substantially slows down the ER algorithm, although it results in a constant growth of complexity increasing with the data size. MetaSNP’s performance is comparable to Tagger and seems to be unaffected from the data size as well, displaying a log-linear behavior which is desirable for a block- free approach based on the multi-locus analyses of SNPs. On the other hand, for most attractors the metaSNP algorithm converges in less than 20 iterations. Figure 5.11 displays the convergence of the algorithm in 7 iterations when it estimates the attractor from the second seed and its tag SNP (rs1204568 ) in the ENm014 genotypes data. i i “elmas_et_al_nov2015” — 2015/11/23 — 22:00 — page 5 — #5 i i

Genome-wide tag SNP selection 5

ENm013 region, 1626 SNPs ENm014 region, 1712 SNPs ENm013 region, 2069 SNPs ENm014 region, 2232 SNPs 0.7 0.7 1 1

0.9 0.9 0.6 0.6 0.8 0.8

0.5 0.5 0.7 0.7

0.6 0.6 0.4 0.4 CHAPTER 5. DISCOVERING GENOME-WIDE TAG SNPS BASED ON THE MUTUAL INFORMATION0.5 OF THE VARIANTS0.5 0.3 0.3 107 0.4 0.4

Coverage rate, R 0.2 Coverage rate, R 0.2 Coverage rate, R 0.3 Coverage rate, R 0.3

0.2 Chromosome 22, segments0.2 of 50000 SNPs 0.1 0.1 1 0.1 0.1 0.9 0 0 0 0 5 10 15 20 25 30 5 10 15 20 25 30 100.8 20 30 40 50 10 20 30 40 50 Number of tagSNPs, c Number of tagSNPs, c Number of tagSNPs, c Number of tagSNPs, c 0.7 0.6 ENr112 region, 1366 SNPs ENr113 region, 1998 SNPs ENr112 region, 1505 SNPs ENr113 region, 2486 SNPs 0.7 0.7 1 0.5 1 0.9 0.4 0.9 0.6 0.6 0.8 0.3 0.8 Coverage rate, R 0.5 0.5 0.7 0.2 0.7 Tagger 0.6 0.1 0.6 0.4 0.4 MetaSNP 0.5 0 0.5 5 10 15 20 25 30 0.3 0.3 0.4 Number of 0.4tagSNPs, c

Coverage rate, R 0.2 Coverage rate, R 0.2 Coverage rate, R 0.3 Coverage rate, R 0.3 0.2 0.2 0.1 0.1 FigureER 5.9: Comparison0.1 of the coverage rates in0.1 1KGP chromosomalTagger segments. The average MetaSNP MetaSNP 0 0 0 0 5 10 15 20 25 30 5 10 15 20 25 30 10 20 30 40 50 10 20 30 40 50 Number of tagSNPs, c Number of tagSNPs,performance c curve is estimatedNumber of tagSNPs, from c 20 consecutive blocksNumber of tagSNPs, of 50,000 c SNPs. For each given cost (c) the error bars represent the best and worst coverage rates obtained from the tagging Fig. 4. Coverage rates in phased haplotypes. Fig. 5. Coverage rates in missing genotypes. results in different blocks.

103 Table 4. Coverage rates in missing genotype data. c ENm013 ENm014 ENr112 ENr113

Tagger 15 0.60 0.97 0.99 0.87 metaSNP 15 0.78 1.00 0.99 0.98 102 Tagger 20 0.83 0.98 1.00 0.87

metaSNP 20 0.79 1.00 1.00 0.98 Running time (sec)

ER Tagger MetaSNP 101 1600 1800 2000 2200 2400 No. of SNPs in ENr112, ENm013, ENm014 and ENr113, respectively is due to multilocus LD measure of ER which may accurately find low- LD tag SNPs and capture the haplotype diversity in sparser SNP regions, Fig. 6. Running times in imputed genotype data however it becomes ineffective in larger data sets. This canFigure be observed 5.10: Running times in imputed genotypes data tested in ENCODE regions in the denser (high-LD) SNP regions (i.e., ENr113 and ENm014(HapMap). plots), where increasing the number of tag SNPs fails to incorporate a relative Tagger and seems unaffected from the data size as well, displaying a log- gain in coverage. linear behavior which is desirable for a block-free approach based on the We tested algorithms under missing data conditions. In this5.3.7.1 experiment, Complexitymultilocus analyses in tagging of SNPs. the On 1,000 the other Genomes hand, for most genotypes attractors the the coverage rate R cannot incorporate missing alleles since those sites are metaSNP algorithm converges in less than 20 iterations. Figure 7 displays typed with a nonspecific value. One can use the imputed genotypeThe genome-wide values the convergenceapplication of of the the algorithm metaSNP in 7 algorithm iterations when (step it estimates1 and step the 2 in Algorithm 3) corresponding to a given choice of tag SNPs. However, we simply ignored attractor from the second seed and its tag SNP (rs1204568) in the ENm014 is run with the cluster support: those missing loci in calculating R to avoid any imputation bias which may genotypes data. affect the algorithms, i.e., in equation (1) we excluded missing sitesIn step from 1, the scanningWe emphasize for the that “short-attractors” our SNP tagging approach were run is block-free on 100 m1.medium and instances the analyses when determining the patterns and the number of samples undemanding in the sense that it is not based on any prior information they cover. Figure 5 displays the performance of metaSNP againston Amazon Tagger Webderived Services from the using empirical StarCluster data such (http://star.mit.edu/cluster/) as haplotype block structures, and R (https: in the raw ENCODE genotypes, where metaSNP have similar//www.r-project.org/). or better recombination Each hotspots, instance LD maps, is equipped or even the with SNP a locus vCPU information (Intel Xeon Family) and rates (Table 4). From the plots, we can say that the missing data have such as genomic position, strand orientation, or allele frequency etc. It little or no impact on both algorithms where they can accurately3.75 perform GB RAM.requires In step only2, the the refinement genotype (or of haplotype) the ”short-attractors” sequences to perform were SNP run on a workstation SNP tagging on all sites and capture the majority of genotype diversity on tagging, and offers a number of parameters to provide flexibility for “non-missing" loci. different genotyping needs. For example, it can be started with a user- The computational complexity of algorithms is evaluated in terms of defined SNP set as the “primary seeds" to be force-included in the final running times. The pairwise LD-based Tagger is the fastest approach in this results. In this case, the ↵ parameter can be accordingly adjusted (i.e., study as it can process each ENCODE data set under a minute (Figure 6). increased) for those seed SNPs to ensure the discovery of “sharp" attractors On the other hand, multilocus-LD calculation substantially slows down that mutually exclude each other’s seed SNP. Since the objective is based the ER algorithm though resulting in a constant growth of complexity on a general measure of correlation between the variables (i.e., SNP loci) scaling with the data size. MetaSNP’s performance is comparable to the employed algorithm can handle phased haplotype sequences as well

i i i i i i “elmas_et_al_nov2015” — 2015/11/23 — 22:00 — page 6 — #6 i CHAPTER 5. DISCOVERING GENOME-WIDE TAG SNPS BASED ON THE i MUTUAL INFORMATION OF THE VARIANTS 108 6 Elmas et al.

tagSNP ID rs1204568 1 Carlson, C.S. et al. (2004) Selecting a maximally informative set of 0 0.5 w 0 0 50 100 150 200 250 300 350 400 450 500 single-nucleotide polymorphisms for association analyses using linkage 1 1 0.5 w 0 disequilibrium. Am J Hum Genet, 74, 106–120. 0 50 100 150 200 250 300 350 400 450 500 1 2 0.5 Cheng, W. et al. (2013) Biomolecular events in cancer revealed by attractor w 0 0 50 100 150 200 250 300 350 400 450 500 metagenes. PLoS Comput Biol, , 2. 1 9 3 0.5 w 0 et al 0 50 100 150 200 250 300 350 400 450 500 Cheng, W. . (2013b) Development of a prognostic model for breast 1 4 0.5 cancer survival in an open challenge environment. Science Translational w 0 0 50 100 150 200 250 300 350 400 450 500 1 Medicine, 181, ra50. 5 0.5 w 0 0 50 100 150 200 250 300 350 400 450 500 Cheng, W. et al. (2013c), arXiv:1306.2584 1 6 0.5 w 0 Collins, F.S. et al. (1997) Variations on a Theme: Cataloging Human DNA 0 50 100 150 200 250 300 350 400 450 500 1 7 0.5 Sequence Variation. Science, 278, 1580–1581. w 0 0 50 100 150 200 250 300 350 400 450 500 Cover, T. et al. (2006) Elements of Information Theory, 2nd edn. Wiley- SNP no Interscience, N.J., Vol. 12, pp. 748. Daly, M.J. et al. (2001) High-resolution haplotype structure in the human genome. Nat Genet, , 229–232. Figure 5.11: ConvergenceFig. 7. Convergence of the of the algorithm algorithm for an forestimated an attractor estimated (only the attractorfirst-500 SNPs are (only the first-500 2 displayed). The tag SNP is marked by the red circle through the iterations (w0,...,w7), Deng, L. et al. (2008) An unusual haplotype structure on human SNPs in the ENCODE’swhich is selected ENr113 according region to its proximity are displayed). to the genomic median The of tag the co-top-rankingSNP is markedchromosome by the red 8p23 derived from the inversion polymorphism. Hum SNPs. Mutat, 29, 1209–1216. circle through the iterations (w0, . . ., w7), which is selected according to its proximity to Devlin, B. et al. (1995) A comparison of linkage disequilibrium measures the genomic median of the co-top-ranking SNPs. for fine-scale mapping. Genomics, 29, 311–322. as the genotype data. This is a great advantage over existing algorithms Gabriel, S.B. et al. (2002) The Structure of Haplotype Blocks in the Human since the procedures that apply statistical inference on data management Genome. Science, 296, 2225–2229. with Dual Intel Xeon(e.g., haplotypeE5-2637 phasing, (16 cores) missing and value 256GB imputation RAM etc.) using is prone R. to error Hampe, J. et al. (2003) Entropy-based SNP selection for genetic and often introduce dependencies and arbitrary patterns to the end results, For the whole genome, the total running times for step 1 and step 2 are 134.37association hours studies. Hum Genet, 114, 36–43. whereby algorithms capable of handling different data types can avoid Hao, K. et al. (2007) Genome-wide selection of tag SNPs using multiple- and 153.3 hours, respectively.such limitations. In addition to ranked tag SNPs, the proposed method marker correlation. Bioinformatics, 23, 3178–3184. reports their metaSNPs and the attractors representing the association He, J. et al. (2007) Informative SNP selection methods based on SNP landscapes, providing an information-rich output for subsequent analyses. prediction. IEEE T Nanobiosci, 6, 60–67. An example output corresponding to ENr113 genotypes results is given in Howie, B. et al. (2009) A flexible and accurate genotype imputation the Supplementary data. method for the next generation of genome-wide association studies. PLoS Genet, 6, e1000529. 5.4 Discussion Howie, B. et al. (2011) Genotype imputation with thousands of genomes. 5 Conclusion G3, 6, 457–470. This chapter addressesIn this thestudy, problem we have presented of efficient a new SNP block-free tagging approach in very for large solving genomicItzhak, distances H. et al. (2003) Selection of minimum subsets of single the SNP tagging problem which only requires genomic sequence data to nucleotide polymorphisms to capture haplotype block diversity. Pac by exploiting the underlying multi-locus LD patterns. We emphasize that our SNP tagging perform efficient selection of tag SNPs in very large genomic distances. Symp Biocomput, 2003, 466–477. et al approach is block-freeThe employed and undemanding algorithm is by in nature the senseblock-free that and it discovers is not based the joint on anyJeffreys, prior A. infor-. (2001) Intensely punctate meiotic recombination in the associations between the variants based on a measure of multilocus mutual class II region of the major histocompatibility complex. Nat Genet, 2, mation derived frominformation. the empirical Experimental data results such indicate as haplotype that the metaSNP block structures, approach recombination217–222. Johnson, G.C. et al. (2001) Haplotype tagging for the identification of hotspots, or LD maps.can efficiently It requires find tag only SNPs the covering genotype a greater (or haplotype) majority of genomic sequences (i.e., trilevel diversity in comparison to existing algorithms. It outperforms the relatively common disease genes. Nat Genet, 29, 233–237. or twolevel informationfaster pairwise-LD of allelic variation)based Tagger to in performthe genotype SNP data tagging, sets from severaland offersJudson, a number R. et al of. (2002) How many SNPs does a genome-wide haplotype ENCODE regions and Chromosome 22, achieving a better balance in map require? , 3, 379–391. parameters to providecoverage flexibility and genotyping for different cost, suggesting genotyping that it could needs. be used For in genome- example,Katanforoush, the different A. et al. (2009) Global haplotype partitioning for maximal associated SNP pairs. BMC Bioinformatics, 10, 269. values of α couldwide be chosen applications. for optimizing In the relevant different haplotype objectives experiments, of metaSNP attractor searching: in significantly outperforms the multilocus-LD based ER algorithm in terms Kruglyak, L. et al. (2001) Variation is the spice of life. Nat Genet, 3, Algorithm 3 we usedof bothα coverage-cost= 2 in step balance1 for and obtaining computational all valuable complexity. short-attractors Extensions 234-236. even with to missing data tests are also carried out and the algorithm is shown to Lewontin, R.C. et al. (1988) On measures of gametic disequilibrium. low degrees of LD at the expense of getting highly-correlated (redundant, overlapping) ones; be robust in such conditions. In default settings, the metaSNP approach Genetics, 120, 849–852. yields a great tradeoff between the amount of required genotyping and Li, X. et al. (2013) Informative SNPs Selection Based on Two-Locus and coverage rate. Moreover, it offers useful parameters to fulfill requirements Multilocus Linkage Disequilibrium: Criteria of Max-Correlation and in different applications, is versatile to perform on different data types, Min-Redundancy. IEEE ACM T Comput Bi, 10, 688–695. and produces rich output for subsequent association studies. Liao, B. et al. (2015) A Hierarchical Clustering Method of Selecting Kernel SNP to Unify Informative SNP and Tag SNP. IEEE ACM T Comput Bi, 12, 113–122. References Liu, G. et al. (2010) FastTagger: an efficient algorithm for genome- Bakker, P. et al. (2005) Efficiency and power in genetic association studies. wide tag SNP selection using multi-marker linkage disequilibrium. BMC Nat Genet, 37, 1217–1223. Bioinformatics, 11, 66. Barrett, J. et al. (2005) Haploview: analysis and visualization of LD and Liu, Z. et al. (2005) Multilocus LD measure and tagging SNP selection haplotype maps. Bioinformatics, 2, 263–265. with generalized mutual information. Genet Epidemiol, 29, 353–364. Bush, W.S. et al. (2012) Genome-wide association studies. PLoS Comput MATLAB and Machine Learning for Bioinformatics Toolbox: Biol, 8, 12. metafeatures, Release 2014b. The MathWorks Inc., Natick,

i i

i i CHAPTER 5. DISCOVERING GENOME-WIDE TAG SNPS BASED ON THE MUTUAL INFORMATION OF THE VARIANTS 109 then in step 2, as opposed to that, we used α = 5 to focus on only the sharpest (distinctive) attractors. Further, the algorithm can be started with a user-defined SNP set as the “pri- mary seeds” to be force-included in the final results. In this case, the α parameter can be accordingly adjusted (i.e., increased) for those seed SNPs to ensure the discovery of “sharp” attractors that mutually exclude each other’s seed SNP. Since the objective is based on a general measure of correlation between the variables (i.e., SNP loci) the employed algorithm can handle phased haplotype sequences as well as the genotype data. This is a great advantage over existing algorithms since the procedures that apply statistical inference on data management (e.g., haplotype phasing, genotype imputation etc.) is prone to error and often introduce dependencies and arbitrary patterns to the end results, whereby algorithms capable of handling different data types can avoid such limitations. As a measure of multi-locus correlation we used the mutual information that can be efficiently estimated between a continuous valued meta SNP variable and an individual SNP variant. For this, we employed the numerical method in [182] which is based on partitioning the continues data into discrete intervals (bins) and assigning each data into several bins simultaneously by the use of B-spline functions. For computational efficiently, we used 4 bins which effectively capture the variation in trilevel data, and the spline order of 2 to allow the continues data to be represented by two bins at most. In addition to the ranked tag SNPs, the proposed method reports their metaSNPs and the attractors representing the association landscapes, providing an information-rich output for subsequent analyses. An example output corresponding to ENr113 genotypes results is given in the Supplementary data. The metaSNP and the converged attractor captures the underlying haplotype structure in several layers, where a number of top-ranking SNPs selected by a specific weight threshold can represent a haplotype block of particular degree of LD. The main contributions of the proposed algorithm are the definitions of “attractor” and “metaSNP” for the simplified representations of “multi-locus LD” and “variation” in the haplotype blocks, respectively, which may provide valuable information in both dimensions of the SNP data. Furthermore, these quantities can be iteratively estimated up to a certain CHAPTER 5. DISCOVERING GENOME-WIDE TAG SNPS BASED ON THE MUTUAL INFORMATION OF THE VARIANTS 110 precision, allowing efficient heuristic designs which can be applied to very large data set. In particular, our sliding-window heuristic can employ the vast amount of information in several layers, i.e., using the first layer of data (sliding a 101-SNPs window) we find all potential attractee seeds that may form a typical attractor, then using the second layer of data (the window of 10,001-SNPs local to a given seed) we find the desired long attractors to capture any long-range multi-locus LD pattern. In this chapter, we defined the genomic neighborhood of a tag SNP as the set of X = 10,001 nearby variant that will exhibit a substantial LD decay in the corresponding genomic distance (i.e., 500kb region). Although in 1KGP data sets a large drop in pair-wise LD (r2 < 0.05) is observed within the 100 kb distance in all populations [12], we extend our LD decay assumption to 500 kb (i.e., 10,001 SNPs) since we use a different LD score ∼ (the similarity metric J) to be able to capture the multi-locus associations that might be (jointly) present in larger distances. CHAPTER 6. CONCLUSIONS 111

Chapter 6

Conclusions

In this thesis we have addressed several problems in computational genomics and genetics: TF motif discovery and regulon inference supported with an improved scheme for TFBS prediction, and haplotype inference from the genotype sequences a population of unrelated individuals and the genome-wide selection of representative (tag) SNPs. In Chapter 2, we introduced a computational approach for the identification of TF regulons and solved the core problem, the motif discovery, within a Bayesian framework. The novel approach can estimate the intrinsic features of the TFBSs and represents a more accurate model for the bacterial TF-DNA binding. Compared to the conventional comparative genomics approaches, the method produced statistically significant results with comparable accuracy as validated on the gold-standard data sets. Furthermore, when applied to a novel organism we revealed some organism-specific regulatory patterns for a hypothetical regulator, which is not captured previously by conventional methods. In Chapter 3, we examined the problem of TFBS sequence prediction based on the sequence context, and proposed a novel discriminative learning model that incorporates the “non-contiguous” dependency in the binding sites. We demonstrated that the approach outperforms the traditional SVM classifiers, and can provide more biological insight to the TFBS sequences that have nucleotide interdependency in their binding sites. We also proposed a false discovery based feature elimination method to deal with overfitting, which can be applied to any string kernel based machine learning approach, and demonstrated its efficacy in the presented SVM classifiers. CHAPTER 6. CONCLUSIONS 112

In Chapter 4, we presented a cost-effective solution to the haplotype inference problem based on –primarily– the xor-genotype data. The problem is NP-hard and does not yield a unique solution due to bit-flip degree of freedom. Both limitations are addressed by the proposed efficient greedy heuristic based on the parsimony-inducing sparse dictionary selection algorithm and a data augmentation strategy, respectively. Through experiments with synthetic and real data sets we demonstrated that our approach is superior than the state-of-the-art algorithms at each task, i.e., it solves the haplotype phases more accurately and requires less amount of regular genotype data. In Chapter 5, we solved the tag SNP selection problem for the aim of genome-wide se- lection using the large set of samples from different populations, which reflects the challenge with the most recent technology and the available data sets. We demonstrated that the com- mon genetic diversity can be captured by only 15-20 tag SNPs per chromosome discovered by the proposed multi-locus LD-based algorithm, which over-performs the existing algorithms in terms of both accuracy (sample coverage per cost) and computational-complexity. We showed that our information-theoretic attractor approach is suitable for efficient heuristic designs to scale up to the genome-wide analysis, and can work with different data types as genotypes or haplotypes. As a novel application, we estimated a catalogue of tagging variants for the 1000 Genomes Project database that captures all genotype diversity in the available sample set. BIBLIOGRAPHY 113

Bibliography

[1] F JACOB and J MONOD. Genetic regulatory mechanisms in the synthesis of proteins. J Mol Biol, 3:318–356, Jun 1961. ISSN 0022-2836 (Print); 0022-2836 (Linking).

[2] M I Arnone and E H Davidson. The hardwiring of development: organization and function of genomic regulatory systems. Development, 124(10):1851–1864, May 1997. ISSN 0950-1991 (Print); 0950-1991 (Linking).

[3] Jie Wang, Jiali Zhuang, Sowmya Iyer, XinYing Lin, Troy W Whitfield, Melissa C Greven, Brian G Pierce, Xianjun Dong, Anshul Kundaje, Yong Cheng, Oliver J Rando, , Richard M Myers, William S Noble, Michael Snyder, and Zhip- ing Weng. Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Res, 22(9):1798–1812, Sep 2012. ISSN 1549-5469 (Electronic); 1088-9051 (Linking). doi: 10.1101/gr.139105.112.

[4] Gary D. Stormo and Yue Zhao. Determining the specificity of protein– inter- actions. Nat Rev Genet, 11(11):751–760, 11 2010. URL http://dx.doi.org/10.1038/ nrg2845.

[5] Morgane Thomas-Chollier, Elodie Darbo, Carl Herrmann, Matthieu Defrance, Denis Thieffry, and Jacques van Helden. A complete workflow for the analysis of full-size chip-seq (and similar) data sets using peak-motifs. Nat Protoc, 7(8):1551–1568, Aug 2012. ISSN 1750-2799 (Electronic); 1750-2799 (Linking). doi: 10.1038/nprot.2012.088.

[6] Abdulkadir Elmas, Guido H Jajamovich, Xiaodong Wang, and Michael S Samoilov. Designing dna barcodes orthogonal in melting temperature by simulated annealing BIBLIOGRAPHY 114

optimization. Nucleic Acid Ther, 23(2):140–151, Apr 2013. ISSN 2159-3345 (Elec- tronic); 2159-3337 (Linking). doi: 10.1089/nat.2012.0394.

[7] Rowan G Zellers, Robert A Drewell, and Jacqueline M Dresch. Marz: an algorithm to combinatorially analyze gapped n-mer models of transcription factor binding. BMC Bioinf, 16:30, 2015.

[8] Anthony Mathelier and Wyeth W. Wasserman. The next generation of transcription factor binding site prediction. J Bioinform Comput Biol., 9(9):e1003214, 2013.

[9] Y Ghavi-Helm, FA Klein, T Pakozdi, L Ciglar, D Noordermeer, W Huber, and EE Furlong. Enhancer loops appear stable during development and are associated with paused polymerase. Nature, 512(7512):96–100, 2014.

[10] Matthew T Maurano, Richard Humbert, Eric Rynes, Robert E Thurman, Eric Haugen, Hao Wang, Alex P Reynolds, Richard Sandstrom, Hongzhu Qu, Jen- nifer Brody, Anthony Shafer, Fidencio Neri, Kristen Lee, Tanya Kutyavin, San- dra Stehling-Sun, Audra K Johnson, Theresa K Canfield, Erika Giste, Morgan Diegel, Daniel Bates, R Scott Hansen, Shane Neph, Peter J Sabo, Shelly Heimfeld, Antony Raubitschek, Steven Ziegler, Chris Cotsapas, Nona Sotoodehnia, Ian Glass, Shamil R Sunyaev, Rajinder Kaul, and John A Stamatoyannopoulos. Systematic localization of common disease-associated variation in regulatory dna. Science, 337 (6099):1190–1195, Sep 2012. ISSN 1095-9203 (Electronic); 0036-8075 (Linking). doi: 10.1126/science.1222794.

[11] The International HapMap Consortium. A haplotype map of the human genome. Nature, 437(7063):1299–1320, 10 2005. URL http://dx.doi.org/10.1038/nature04226.

[12] Adam Auton, Lisa D Brooks, Richard M Durbin, Erik P Garrison, Hyun Min Kang, Jan O Korbel, Jonathan L Marchini, Shane McCarthy, Gil A McVean, and Goncalo R Abecasis. A global reference for human genetic variation. Nature, 526(7571):68– 74, Oct 2015. ISSN 1476-4687 (Electronic); 0028-0836 (Linking). doi: 10.1038/ nature15393. BIBLIOGRAPHY 115

[13] Inderpreet Sur, Sari Tuupanen, Thomas Whitington, Lauri A Aaltonen, and Jussi Taipale. Lessons from functional analysis of genome-wide association studies. Cancer Res, 73(14):4180–4184, Jul 2013. ISSN 1538-7445 (Electronic); 0008-5472 (Linking). doi: 10.1158/0008-5472.CAN-13-0789.

[14] B Kerem, J M Rommens, J A Buchanan, D Markiewicz, T K Cox, A Chakravarti, M Buchwald, and L C Tsui. Identification of the cystic fibrosis gene: genetic anal- ysis. Science, 245(4922):1073–1080, Sep 1989. ISSN 0036-8075 (Print); 0036-8075 (Linking).

[15] M E MacDonald, A Novelletto, C Lin, D Tagle, G Barnes, G Bates, S Taylor, B Allitto, M Altherr, and R Myers. The huntington’s disease candidate region exhibits many different haplotypes. Nat Genet, 1(2):99–103, May 1992. ISSN 1061-4036 (Print); 1061-4036 (Linking). doi: 10.1038/ng0592-99.

[16] William S Bush and Jason H Moore. Genome-wide association studies. PLoS Com- putational Biology, 8(12):e1002822, 12 2012. doi: 10.1371/journal.pcbi.1002822. URL http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3531285/.

[17] B Devlin and N Risch. A comparison of linkage disequilibrium measures for fine-scale mapping. Genomics, 29(2):311–322, Sep 1995. ISSN 0888-7543 (Print); 0888-7543 (Linking). doi: 10.1006/geno.1995.9003.

[18] William S Bush, Scott M Dudek, and Marylyn D Ritchie. Biofilter: a knowledge- integration system for the multi-locus analysis of genome-wide association studies. Pac Symp Biocomput, pages 368–379, 2009. ISSN 2335-6936 (Print).

[19] Christine Herold, Michael Steffens, Felix F Brockschmidt, Max P Baur, and Tim Becker. Intersnp: genome-wide interaction analysis guided by a priori information. Bioinformatics, 25(24):3275–3281, Dec 2009. ISSN 1367-4811 (Electronic); 1367-4803 (Linking). doi: 10.1093/bioinformatics/btp596.

[20] Jason H Moore and Marylyn D Ritchie. Studentjama. the challenges of whole-genome approaches to common diseases. JAMA, 291(13):1642–1643, Apr 2004. ISSN 1538- 3598 (Electronic); 0098-7484 (Linking). doi: 10.1001/jama.291.13.1642. BIBLIOGRAPHY 116

[21] Stacey B Gabriel, Stephen F Schaffner, Huy Nguyen, Jamie M Moore, Jessica Roy, Brendan Blumenstiel, John Higgins, Matthew DeFelice, Amy Lochner, Maura Fag- gart, Shau Neen Liu-Cordero, Charles Rotimi, Adebowale Adeyemo, Richard Cooper, Ryk Ward, Eric S Lander, Mark J Daly, and David Altshuler. The structure of hap- lotype blocks in the human genome. Science, 296(5576):2225–2229, Jun 2002. ISSN 1095-9203 (Electronic); 0036-8075 (Linking). doi: 10.1126/science.1069424.

[22] Abdulkadir Elmas, Xiaodong Wang, and Michael S Samoilov. Reconstruction of novel transcription factor regulons through inference of their binding sites. BMC Bioin- formatics, 16:299, 2015. ISSN 1471-2105 (Electronic); 1471-2105 (Linking). doi: 10.1186/s12859-015-0685-y.

[23] Christina Leslie, , and William Stafford Noble. The spectrum kernel: A string kernel for svm protein classification. Pac Symp Biocomput, 7:564–575, 2002.

[24] Thomas Peters and Reinhard Sedlmeier. Current methods for high-throughput detec- tion of novel dna polymorphisms. Drug Discov Today Technol, 3(2):123–129, Summer 2006. ISSN 1740-6749 (Electronic); 1740-6749 (Linking). doi: 10.1016/j.ddtec.2006. 05.002.

[25] W Xiao and P J Oefner. Denaturing high-performance liquid chromatography: A review. Hum Mutat, 17(6):439–474, Jun 2001. ISSN 1098-1004 (Electronic); 1059- 7794 (Linking). doi: 10.1002/humu.1130.

[26] Abdulkadir Elmas, Guido H Jajamovich, and Xiaodong Wang. Maximum parsimony xor haplotyping by sparse dictionary selection. BMC Genomics, 14:645, 2013. ISSN 1471-2164 (Electronic); 1471-2164 (Linking). doi: 10.1186/1471-2164-14-645.

[27] G Chua, QD Morris, R Sopko, MD Robinson, O Ryan, ET Chan, BJ Frey, BJ An- drews, G Boone, and TR Hughes. Identifying transcription factor functions and targets by phenotypic activation. Proc Natl Acad Sci USA, 103:12045–12050, 2006.

[28] MS Gelfand, PS Novichkov, ES Novichkov, and AA Mironov. Comparative analysis of regulatory patterns in bacterial genomes. Briefings Bioinf, 1:357–371, 2000. BIBLIOGRAPHY 117

[29] M Price, P Dehal, and A Arkin. Horizontal gene transfer and the evolution of tran- scriptional regulation in Escherichia Coli. Genome Biol, 9, 2008.

[30] AE Kazakov, DA Rodionov, MN Price, AP Arkin, I Dubchak, and PS Novichkov. Transcription factor family-based reconstruction of singleton regulons and study of the CRP/FNR, ArsR, and GntR families in Desulfovibrionales genomes. J Bacteriol, 195:29–38, 2013.

[31] PS Novichkov, ON Laikova, ES Novichkova, MS Gelfand, AP Arkin, I Dubchak, and DA Rodionov. Regprecise: a database of curated genomic inferences of transcriptional regulatory interactions in prokaryotes. Nucleic Acids Res, 38:111–118, 2010.

[32] M Ashburner, C Ball, J Blake, D Botstein, H Butler, J Cherry, A Davis, K Dolinski, S Dwight, J Eppig, M Harris, D Hill, TL Issel, A Kasarskis, S Lewis, J Matese, J Richardson, M Ringwald, G Rubin, and G Sherlock. Gene ontology: tool for the unification of biology. Nat Genet, 25, 2000.

[33] MC Frith, Y Fu, L Yu, J Chen, U Hansen, and Z Weng. Detection of functional DNA motifs via statistical over-representation. Nucleic Acids Res, 32:1372–1381, 2004.

[34] P Dhaeseleer. How does DNA discovery work? Nat Biotechnol, 24: 959–961, 2006.

[35] Lin Yang, Tianyin Zhou, Iris Dror, Anthony Mathelier, Wyeth W. Wasserman, Raluca Gordan, and Remo Rohs. Tfbsshape: a motif database for dna shape features of transcription factor binding sites. Nucleic Acids Research, 42(D1):D148–D155, 2014. doi: 10.1093/nar/gkt1087.

[36] Tianyin Zhou, Ning Shen, Lin Yang, Namiko Abe, John Horton, Richard S Mann, Harmen J Bussemaker, Raluca Gordan, and Remo Rohs. Quantitative modeling of transcription factor binding specificities using DNA shape. Proceedings of the National Academy of Sciences of the of America, 112(15):4654–4659, April 2015. ISSN 0027-8424. doi: 10.1073/pnas.1422023112. BIBLIOGRAPHY 118

[37] Liang Liu, Guangxu Jin, and Xiaobo Zhou. Modeling the relationship of epigenetic modifications to transcription factor binding. Nucleic acids research, 43(8):3873–3885, April 2015. ISSN 0305-1048. doi: 10.1093/nar/gkv255.

[38] Pavel S. Novichkov, Dmitry A. Rodionov, Elena D. Stavrovskaya, Elena S. Novichkova, Alexey E. Kazakov, Mikhail S. Gelfand, Adam P. Arkin, Andrey A. Mironov, and Inna Dubchak. Regpredict: an integrated system for regulon inference in prokaryotes by comparative genomics approach. Nucleic Acids Research, 38(suppl 2):W299–W307, 2010. doi: 10.1093/nar/gkq531.

[39] BK Cho, EM Knight, CL Barrett, and BO Palsson. Genome-wide analysis of Fis binding in Escherichia Coli indicates a causative role for A-/AT-tracts. Genome Res, 18:900–910, 2008.

[40] J Faith, B Hayete, J Thaden, I Mogno, J Wierzbowski, G Cottarel, S Kasif, J Collins, and J Gardner. Large-scale mapping and validation of Escherichia Coli transcriptional regulation from a compendium of expression profiles. PLoS Biol, 5, 2007.

[41] Y Shi, X Liao, X Zhang, G Lin, and D Schuurmans. Sparse learning based linear coherent bi-clustering. Algorithms in Bioinformatics, 7534:346–364, 2012.

[42] MN Price, KH Huang, EJ Alm, and AP Arkin. A novel method for accurate operon predictions in all sequenced prokaryotes. Nucleic Acids Res, 33:880–892, 2005.

[43] Guido H. Jajamovich and Xiaodong Wang. Maximum-parsimony haplotype inference based on sparse representations of genotypes. Trans. Sig. Proc., 60(4):2013–2023, April 2012. ISSN 1053-587X. doi: 10.1109/TSP.2011.2179542. URL http://dx.doi. org/10.1109/TSP.2011.2179542.

[44] B Dong, X Wang, and A Doucet. A new class of soft MIMO demodulation algorithms. Signal Processing, IEEE Transactions on, 51:2752–2763, 2003.

[45] T Vercauteren, D Guo, and X Wang. Joint multiple target tracking and classification in collaborative sensor networks. Selected Areas in Communications, IEEE Journal on, 23:714–723, 2005. BIBLIOGRAPHY 119

[46] A Doucet and X Wang. Monte Carlo methods for signal processing: a review in the statistical signal processing context. Signal Processing Magazine, IEEE, 22:152–170, 2005.

[47] KC Liang, X Wang, and D Anastassiou. A sequential Monte Carlo method for motif discovery. Signal Processing, IEEE Transactions on, 56:4496–4507, 2008.

[48] Robert J Nichols, Saunak Sen, Yoe Jin Choo, Pedro Beltrao, Matylda Zietek, Rachna Chaba, Sueyoung Lee, Krystyna M Kazmierczak, Karis J Lee, Angela Wong, Michael Shales, Susan Lovett, Malcolm E Winkler, Nevan J Krogan, Athanasios Typas, and Carol A Gross. Phenotypic landscape of a bacterial cell. Cell, 144(1):143–156, January 2011. ISSN 0092-8674. doi: 10.1016/j.cell.2010.11.052.

[49] Mikhail Pachkov, Ionas Erb, Nacho Molina, and Erik van Nimwegen. Swissregulon: a database of genome-wide annotations of regulatory sites. Nucleic Acids Research, 35(suppl 1):D127–D131, 2007. doi: 10.1093/nar/gkl857.

[50] G Dennis, BT Sherman, DA Hosack, J Yang, W Gao, HC Lane, and RA Lempicki. David: database for annotation, visualization, and integrated discovery. Genome Biol, 4:3, 2003.

[51] DW Huang, BT Sherman, and RA Lempicki. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res, 37:1013, 2009.

[52] H Salgado, M Peralta, and Gama S et al. RegulonDB v8.0: data sets, evo- lutionary conservation, regulatory phrases, cross-validated gold standards and more. Nucleic Acids Res, 41, 2013.

[53] S Jozefczuk, S Klie, G Catchpole, J Szymanski, ID Cuadros, D Steinhauser, J Selbig, and L Willmitzer. Metabolomic and transcriptomic stress response of Escherichia Coli. Mol Syst Biol, 6:5, 2010.

[54] E Masse, CK Vanderpool, and S Gottesman. Effect of RyhB small RNA on global iron use in Escherichia Coli. J Bacteriol, 187:6962–6971, 2005. BIBLIOGRAPHY 120

[55] GS Pettis, TJ Brickman, and MA McIntosh. Transcriptional mapping and nucleotide sequence of the Escherichia Coli fepA-fes enterobactin region. identification of a unique iron-regulated bidirectional promoter. J Biol Chem, 263:857–863, 1988.

[56] M Sauer, K Hantke, and V Braun. Sequence of the fhue outer membrane receptor gene of escherichia coli k12 and properties of mutants. Mol Microbiol, 4:427–437, 1990.

[57] AA Martinez and VJ Collado. Identifying global regulators in transcriptional regula- tory networks in bacteria. Curr Opin Microbiol, 6:482–489, 2003.

[58] LJ Hauser, ML Land, SD Brown, and et al. Complete genome sequence and updated annotation of Desulfovibrio Alaskensis G20. J Bacteriol, 193:4268–4269, 2011.

[59] Paramvir S. Dehal, Marcin P. Joachimiak, Morgan N. Price, John T. Bates, Jason K. Baumohl, Dylan Chivian, Greg D. Friedland, Katherine H. Huang, Keith Keller, Pavel S. Novichkov, Inna L. Dubchak, Eric J. Alm, and Adam P. Arkin. Microbeson- line: an integrated portal for comparative and . Nucleic Acids Research, 2009. doi: 10.1093/nar/gkp919.

[60] HS Koo, HM Wu, and DM Crothers. DNA bending at Adenine Thymine tracts. · Nature, 320:501–506, 1986.

[61] MD Bradley, MB Beach, APJ Koning, TS Pratt, and R Osuna. Effects of Fis on Escherichia Coli gene expression during different growth stages. Microbiology, 153: 2922–2940, 2007.

[62] Jianjun Hu, Bin Li, and Daisuke Kihara. Limitations and potentials of current motif discovery algorithms. Nucleic Acids Res, 33(15):4899–4913, 2005. ISSN 1362-4962 (Electronic); 0305-1048 (Linking). doi: 10.1093/nar/gki791.

[63] X Shirley Liu, Douglas L Brutlag, and Jun S Liu. An algorithm for finding protein-dna binding sites with applications to chromatin-immunoprecipitation microarray experi- ments. Nat Biotechnol, 20(8):835–839, Aug 2002. ISSN 1087-0156 (Print); 1087-0156 (Linking). doi: 10.1038/nbt717. BIBLIOGRAPHY 121

[64] M.S. Waterman, J. Joyce, and M. Eggert. Computer alignment of sequences. Phylo- genetic Analysis of DNA Sequences, pages 59–72, 1991.

[65] S F Altschul, W Gish, W Miller, E W Myers, and D J Lipman. Basic local alignment search tool. J Mol Biol, 215(3):403–410, Oct 1990. ISSN 0022-2836 (Print); 0022-2836 (Linking). doi: 10.1016/S0022-2836(05)80360-2.

[66] A Bairoch, P Bucher, and K Hofmann. The prosite database, its status in 1995. Nu- cleic Acids Res, 24(1):189–196, Jan 1996. ISSN 0305-1048 (Print); 0305-1048 (Link- ing).

[67] T K Attwood, M E Beck, D R Flower, P Scordis, and J N Selley. The prints protein fingerprint database in its fifth year. Nucleic Acids Res, 26(1):304–308, Jan 1998. ISSN 0305-1048 (Print); 0305-1048 (Linking).

[68] A Krogh, M Brown, I S Mian, K Sjolander, and D Haussler. Hidden markov models in computational biology. applications to protein modeling. J Mol Biol, 235(5):1501– 1531, Feb 1994. ISSN 0022-2836 (Print); 0022-2836 (Linking). doi: 10.1006/jmbi. 1994.1104.

[69] S R Eddy. Multiple alignment using hidden markov models. Proc Int Conf Intell Syst Mol Biol, 3:114–120, 1995. ISSN 1553-0833 (Print); 1553-0833 (Linking).

[70] , Thomas D. Schneider, Larry Gold, and Andrzej Ehrenfeucht. Use of the ‘perceptron’ algorithm to distinguish translational initiation sites in E. coli. Nucleic Acids Res., 10(9):2997–3011, 1982.

[71] Rodger Staden. Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res., 12(1Part2):505–519, 1984.

[72] O. G. Berg and P. H. von Hippel. Selection of dna binding sites by regulatory proteins. statistical-mechanical theory and application to operators and promoters. Journal of , 193(4):723–750, 1987.

[73] T. L. Bailey and M Gribskov. Combining evidence using p-values: application to sequence homology searches. Bioinformatics, 14(1):48–54, 1998. BIBLIOGRAPHY 122

[74] G. Z. Hertz and G. D. Stormo. Identifying dna and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics, 15(7-8):563–577, 1999.

[75] T. K. Man and G. D. Stormo. Non-independence of mnt repressor-operator interac- tion determined by a new quantitative multiple fluorescence relative affinity (qumfra) assay. Nucleic Acids Res., 29:2471–2478, 2001.

[76] P. V. Benos, A. S. Lapedes, and G. D. Stormo. Probabilistic code for dna recognition by proteins of the egr family. Journal of Molecular Biology, 323:701–727, 2002.

[77] M Lassig. From biophysics to evolutionary genetics: statistical aspects of gene regu- lation. BMC Bioinformatics, 8(Suppl 6):S7, 2007.

[78] G. Badis, M. F. Berger, A. A. Philippakis, S. Talukder, A. R. Gehrke, S. A. Jaeger, E. T. Chan, G. Metzler, A. Vedenko, X. Chen, H. Kuznetsov, C. F. Wang, D. Coburn, D. E. Newburger, Q. Morris, T. R. Hughes, and M. L. Bulyk. Diversity and complexity in dna recognition by transcription factors. Science, 324(5935):1720–1723, 2009.

[79] Rahul Siddharthan. Dinucleotide weight matrices for predicting transcription factor binding sites: Generalizing the position weight matrix. PLoS ONE, 5(3):e9722, 2010. URL http://dx.doi.org/10.1371%2Fjournal.pone.0009722.

[80] M. Annala, K. Laurila, H. Lahdesmaki, and M. Nykter. A linear model for transcrip- tion factor binding affinity prediction in protein binding microarrays. PLoS One, 6 (5):e20059, 2011.

[81] Jacqueline M Dresch, Rowan G Zellers, Daniel K Bork, and Robert A Drewell. Nucleotide interdependency in transcription factor binding sites in the drosophila genome. Gene Regulation and , in press, 2016.

[82] N Cristianini and J Shawe-Taylor. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, 2000.

[83] T Jaakkola, M Diekhans, and D Haussler. A discriminative framework for detecting remote protein homologies. J Comput Biol, 7(1-2):95–114, Feb-Apr 2000. ISSN 1066- 5277 (Print); 1066-5277 (Linking). doi: 10.1089/10665270050081405. BIBLIOGRAPHY 123

[84] Christina Leslie, Eleazar Eskin, Jason Weston, and William Stafford Noble. Mismatch string kernels for svm protein classification, 2003.

[85] Christina S. Leslie, Eleazar Eskin, Adiel Cohen, Jason Weston, and William Stafford Noble. Mismatch string kernels for discriminative protein classification. Bioin- formatics, 20(4):467–476, 2004. doi: 10.1093/bioinformatics/btg431. URL http: //bioinformatics.oxfordjournals.org/content/20/4/467.abstract.

[86] Jose Paolo Magbanua, Estelle Runneburger, Steven Russell, and Robert White. A general pairwise interaction model provides an accurate description of in vivo tran- scription factor binding sites. PLoS One, 9(6):e99015, 2014.

[87] V.N. Vapnik. Statistical Learning Theory. Springer, 1998.

[88] D Lee, R Karchin, and MA Beer. Discriminative prediction of mammalian enhancers from dna sequence. Genome Res, 21(12):2167–80, 2011.

[89] Genevieve D. Erwin, Nir Oksenberg, Rebecca M. Truty, Dennis Kostka, Karl K. Murphy, Nadav Ahituv, Katherine S. Pollard, and John A. Capra. Integrating diverse datasets improves developmental enhancer prediction. PLoS Comput Biol, 6:e1003677, 2014.

[90] Jessica L Stringham, Adam S Brown, Robert A Drewell, and Jacqueline M Dresch. Flanking sequence context-dependent transcription factor binding in early Drosophila development. BMC Bioinf, 14:298, 2013.

[91] SM Gallo, DT Gerrard, D Miner, M Simich, B Des Soye, CM Bergman, and MS Hal- fon. Redfly v3.0: toward a comprehensive database of transcriptional regulatory elements in drosophila. Nucleic Acids Res, 39:D118–D123, 2011.

[92] Wyeth W Wasserman and Albin Sandelin. Applied bioinformatics for the identifica- tion of regulatory elements. Nat Rev Genet, 5:276–287, 2004.

[93] JV Turatsinze, M Thomas-Chollier, M Defrance, and J van Helden. Using rsat to scan genome sequences for transcription factor binding sites and cis-regulatory modules. Nat Protoc, 3(10):1578–88, 2008. BIBLIOGRAPHY 124

[94] RC Hardison and J Taylor. Genomic approaches towards finding cis-regulatory mod- ules in animals. Nat Rev Genet, 13(7):469–83, 2012.

[95] A Mathelier, X Zhao, AW Zhang, F Parcy, R Worsley-Hunt, DJ Arenillas, S Buch- man, CY Chen, A Chou, H Ienasescu, J Lim, C Shyr, G Tan, M Zhou, B Lenhard, A Sandelin, and Wasserman WW. Jaspar 2014: an extensively expanded and updated open-access database of transcription factor binding profiles. Nucleic Acids Res, 42: D142–7, 2013.

[96] S Gupta, JA Stamatoyannopoulos, TL Bailey, and WS Noble. Quantifying similarity between motifs. Genome Biol, 8(2):R24, 2007.

[97] Markus W Covert, Eric M Knight, Jennifer L Reed, Markus J Herrgard, and Bern- hard O Palsson. Integrating high-throughput and computational data elucidates bac- terial networks. Nature, 429(6987):92–96, May 2004. ISSN 1476-4687 (Electronic); 0028-0836 (Linking). doi: 10.1038/nature02456.

[98] Anne Barnard, Alan Wolfe, and Stephen Busby. Regulation at complex bacterial promoters: how bacteria use different promoter organizations to produce different regulatory outcomes. Curr Opin Microbiol, 7(2):102–108, Apr 2004. ISSN 1369-5274 (Print); 1369-5274 (Linking). doi: 10.1016/j.mib.2004.02.011.

[99] Kevin S Myers, Huihuang Yan, Irene M Ong, Dongjun Chung, Kun Liang, Frances Tran, Sunduz Keles, Robert Landick, and Patricia J Kiley. Genome-scale analysis of escherichia coli fnr reveals complex features of transcription factor binding. PLoS Genet, 9(6):e1003565, Jun 2013. ISSN 1553-7404 (Electronic); 1553-7390 (Linking). doi: 10.1371/journal.pgen.1003565.

[100] Peter McQuilton, Susan E. St. Pierre, Jim Thurmond, and the FlyBase Consortium. Flybase 101 ? the basics of navigating flybase. Nucleic Acids Res, 40:D706–D714, 2012.

[101] MO Starr, MC Ho, EJ Gunther, YK Tu, AS Shur, SE Goetz, MJ Borok, V Kang, and RA Drewell. Molecular dissection of cis-regulatory modules at the drosophila BIBLIOGRAPHY 125

bithorax complex reveals critical transcription factor signature motifs. Dev Biol, 359 (2):290–302, 2011.

[102] Marcus B Noyes, Xiangdong Meng, Atsuya Wakabayashi, Saurabh Sinha, Michael H Brodsky, and Scot A Wolfe. A systematic characterization of factors that regulate drosophila segmentation via a bacterial one-hybrid system. Nucleic Acids Res, 36 (8):2547–2560, May 2008. ISSN 1362-4962 (Electronic); 0305-1048 (Linking). doi: 10.1093/nar/gkn048.

[103] T Zhou, L Yang, Y Lu, I Dror, AC Dantas Machado, T Ghane, R Di Felice, and R Rohs. Dnashape: a method for the high-throughput prediction of dna structural features on a genomic scale. Nucleic Acids Res, 41:W56–62, 2013.

[104] L. Yang, T. Zhou, I. Dror, A. Mathelier, W. W. Wasserman, R. Gordˆan,and R. Rohs. Tfbsshape: a motif database for dna shape features of transcription factor binding sites. Nucleic Acids Res, 42(D1):D148–D155, 2014.

[105] Markus Affolter and Konrad Basler. The decapentaplegic morphogen gradient: from pattern formation to growth regulation. Nat Rev Genet, 8(9):663–674, Sep 2007. ISSN 1471-0056 (Print); 1471-0056 (Linking). doi: 10.1038/nrg2166.

[106] Elisabeth Saller and Mariann Bienz. Direct competition between brinker and Drosophila mad in dpp target gene transcription. EMBO reports, 2(4):298–305, 2014.

[107] JE Phillips-Cremins, ME Sauria, A Sanyal, TI Gerasimova, BR Lajoie, JS Bell, CT Ong, TA Hookway, C Guo, Y Sun, MJ Bland, W Wagstaff, S Dalton, TC McDe- vitt, R Sen, J Dekker, J Taylor, and VG Corces. Architectural protein subclasses shape 3d organization of genomes during lineage commitment. Cell, 153(6):1281–95, 2013.

[108] Martin Oti, Jonas Falck, Martijn A Huynen, and Huiqing Zhou. Ctcf-mediated chro- matin loops enclose inducible gene regulatory domains. BMC Genomics, 17:252, 2016.

[109] Jose Paolo Magbanua, Estelle Runneburger, Steven Russell, and Robert White. A BIBLIOGRAPHY 126

variably occupied ctcf binding site in the Ultrabithorax gene in the Drosophila bithorax complex. Mol and Cell Biol, 35(1):318–330, 2015.

[110] A J Brookes. The essence of snps. Gene, 234(2):177–186, Jul 1999. ISSN 0378-1119 (Print); 0378-1119 (Linking).

[111] N Risch and K Merikangas. The future of genetic studies of complex human dis- eases. Science, 273(5281):1516–1517, Sep 1996. ISSN 0036-8075 (Print); 0036-8075 (Linking).

[112] P Y Kwok and Z Gu. Single nucleotide polymorphism libraries: why and how are we building them? Mol Med Today, 5(12):538–543, Dec 1999. ISSN 1357-4310 (Print); 1357-4310 (Linking).

[113] I C Gray, D A Campbell, and N K Spurr. Single nucleotide polymorphisms as tools in human genetics. Hum Mol Genet, 9(16):2403–2408, Oct 2000. ISSN 0964-6906 (Print); 0964-6906 (Linking).

[114] Vikas Bansal and . Hapcut: an efficient and accurate algorithm for the haplotype assembly problem. Bioinformatics, 24(16):i153–9, Aug 2008. ISSN 1367-4811 (Electronic); 1367-4803 (Linking). doi: 10.1093/bioinformatics/btn298.

[115] Dan He, Arthur Choi, Knot Pipatsrisawat, Adnan Darwiche, and Eleazar Eskin. Op- timal algorithms for haplotype assembly from whole-genome sequence data. Bioin- formatics, 26(12):i183–i190, 06 2010. doi: 10.1093/bioinformatics/btq215. URL http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2881399/.

[116] Yun Li, Cristen J Willer, Jun Ding, Paul Scheet, and Goncalo R Abecasis. Mach: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol, 34(8):816–834, Dec 2010. ISSN 1098-2272 (Electronic); 0741-0395 (Linking). doi: 10.1002/gepi.20533.

[117] Arvind Gupta, Jan Manuch, Ladislav Stacho, and Xiaohong Zhao. Algorithm for haplotype inference via galled-tree networks with simple galls. J Comput Biol, 19(4): BIBLIOGRAPHY 127

439–454, Apr 2012. ISSN 1557-8666 (Electronic); 1066-5277 (Linking). doi: 10.1089/ cmb.2010.0145.

[118] Alexandros Iliadis, Dimitris Anastassiou, and Xiaodong Wang. A unified framework for haplotype inference in nuclear families. Ann Hum Genet, 76(4):312–325, Jul 2012. ISSN 1469-1809 (Electronic); 0003-4800 (Linking). doi: 10.1111/j.1469-1809.2012. 00715.x.

[119] En-Yu Lai, Wei-Bung Wang, Tao Jiang, and Kun-Pin Wu. A linear-time algorithm for reconstructing zero-recombinant haplotype configuration on a pedigree. BMC Bioin- formatics, 13 Suppl 17:S19, 2012. ISSN 1471-2105 (Electronic); 1471-2105 (Linking). doi: 10.1186/1471-2105-13-S17-S19.

[120] Dan He, Buhm Han, and Eleazar Eskin. Hap-seq: an optimal algorithm for haplotype phasing with imputation using sequencing data. J Comput Biol, 20(2):80–92, Feb 2013. ISSN 1557-8666 (Electronic); 1066-5277 (Linking). doi: 10.1089/cmb.2012.0091.

[121] Motoo Kimura and James F Crow. The number of alleles that can be maintained in a finite population. Genetics, 49(4):725–738, 04 1964. URL http://www.ncbi.nlm.nih. gov/pmc/articles/PMC1210609/.

[122] Tamar Barzuza, Jacques S Beckmann, , and Itsik Pe’er. Computa- tional problems in perfect phylogeny haplotyping: typing without calling the allele. IEEE/ACM Trans Comput Biol Bioinform, 5(1):101–109, Jan-Mar 2008. ISSN 1545- 5963 (Print); 1545-5963 (Linking). doi: 10.1109/TCBB.2007.1063.

[123] Vincenzo Liberatore. Matroid decomposition methods for the set maxima problem. In Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’98, pages 400–409, Philadelphia, PA, USA, 1998. Society for Industrial and Applied Mathematics. ISBN 0-89871-410-9. URL http://dl.acm.org/citation.cfm?id= 314613.314770.

[124] Dan Gusfield. Haplotyping as perfect phylogeny: Conceptual framework and efficient solutions (extended abstract), 2002. BIBLIOGRAPHY 128

[125] Tamar Barzuza, Jacques S. Beckmann, Ron Shamir, and Itsik Pe’er. Computational Problems in Perfect Phylogeny Haplotyping: Xor-Genotypes and Tag SNPs, pages 14– 31. Springer Berlin Heidelberg, Berlin, Heidelberg, 2004. ISBN 978-3-540-27801-6. doi: 10.1007/978-3-540-27801-6 2. URL http://dx.doi.org/10.1007/978-3-540-27801-6 2.

[126] N Patil, A J Berno, D A Hinds, W A Barrett, J M Doshi, C R Hacker, C R Kautzer, D H Lee, C Marjoribanks, D P McDonough, B T Nguyen, M C Norris, J B Sheehan, N Shen, D Stern, R P Stokowski, D J Thomas, M O Trulson, K R Vyas, K A Frazer, S P Fodor, and D R Cox. Blocks of limited haplotype diversity revealed by high- resolution scanning of human chromosome 21. Science, 294(5547):1719–1723, Nov 2001. ISSN 0036-8075 (Print); 0036-8075 (Linking). doi: 10.1126/science.1065573.

[127] Giuseppe Lancia, Maria Cristina Pinotti, and Romeo Rizzi. Haplotyping populations by pure parsimony: Complexity of exact and approximation algorithms. INFORMS Journal on Computing, 16(4):348–359, 2016/08/13 2004. doi: 10.1287/ijoc.1040.0085. URL http://dx.doi.org/10.1287/ijoc.1040.0085.

[128] Dan Gusfield. Haplotype Inference by Pure Parsimony, pages 144–155. Springer Berlin Heidelberg, Berlin, Heidelberg, 2003. ISBN 978-3-540-44888-4. doi: 10.1007/ 3-540-44888-8 11. URL http://dx.doi.org/10.1007/3-540-44888-8 11.

[129] Lusheng Wang and Ying Xu. Haplotype inference by maximum parsimony. Bioin- formatics, 19(14):1773–1780, 2003. doi: 10.1093/bioinformatics/btg239. URL http: //bioinformatics.oxfordjournals.org/content/19/14/1773.abstract.

[130] Nadezhda Sazonova, Edward Sazonov, and E James Harner. Algorithm for haplotype resolution and block partitioning for partial xor-genotype data. J Biomed Inform, 43 (1):51–59, Feb 2010. ISSN 1532-0480 (Electronic); 1532-0464 (Linking). doi: 10.1016/ j.jbi.2009.08.009.

[131] Paola Bonizzoni, Gianluca Della Vedova, Riccardo Dondi, Yuri Pirola, and Romeo Rizzi. Pure parsimony xor haplotyping. IEEE/ACM Trans Comput Biol Bioinform, 7(4):598–610, Oct-Dec 2010. ISSN 1557-9964 (Electronic); 1545-5963 (Linking). doi: 10.1109/TCBB.2010.52. BIBLIOGRAPHY 129

[132] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizing submodular set functions—i. Mathematical Programming, 14(1):265–294, 1978. ISSN 1436-4646. doi: 10.1007/BF01588971. URL http://dx.doi.org/10.1007/ BF01588971.

[133] Andreas Krause and Volkan Cevher. Submodular dictionary selection for sparse rep- resentation. In Johannes F¨urnkranzand Thorsten Joachims, editors, Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 567–574. Omnipress, 2010. URL http://www.icml2010.org/papers/366.pdf.

[134] Tianhua Niu, Zhaohui S Qin, Xiping Xu, and Jun S Liu. Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. American Journal of Human Genetics, 70(1):157–169, 01 2002. URL http://www.ncbi.nlm.nih.gov/pmc/articles/ PMC448439/.

[135] K. c. Liang and X. Wang. A deterministic sequential monte carlo method for haplotype inference. IEEE Journal of Selected Topics in Signal Processing, 2(3):322–331, June 2008. ISSN 1932-4553. doi: 10.1109/JSTSP.2008.923842.

[136] Jody Hey. What’s so hot about recombination hotspots? PLoS Biology, 2(6):e190, 06 2004. doi: 10.1371/journal.pbio.0020190. URL http://www.ncbi.nlm.nih.gov/pmc/ articles/PMC423158/.

[137] A Bonin, E Bellemain, P Bronken Eidesen, F Pompanon, C Brochmann, and P Taber- let. How to track and assess genotyping errors in population genetics studies. Mol Ecol, 13(11):3261–3273, Nov 2004. ISSN 0962-1083 (Print); 0962-1083 (Linking). doi: 10.1111/j.1365-294X.2004.02346.x.

[138] Francois Pompanon, Aurelie Bonin, Eva Bellemain, and Pierre Taberlet. Genotyping errors: causes, consequences and solutions. Nat Rev Genet, 6(11):847–846, 11 2005. URL http://dx.doi.org/10.1038/nrg1707.

[139] C A Hackett and L B Broadfoot. Effects of genotyping errors, missing values and segregation distortion in molecular marker data on the construction of linkage maps. Heredity, 90(1):33–38, print 0000. URL http://dx.doi.org/10.1038/sj.hdy.6800173. BIBLIOGRAPHY 130

[140] Jonathan Marchini, Bryan Howie, Simon Myers, Gil McVean, and Peter Donnelly. A new multipoint method for genome-wide association studies by imputation of geno- types. Nat Genet, 39(7):906–913, Jul 2007. ISSN 1061-4036 (Print); 1061-4036 (Link- ing). doi: 10.1038/ng2088.

[141] Leonardo Tininini, Paola Bertolazzi, Alessandra Godi, and Giuseppe Lancia. Coll- haps: a heuristic approach to haplotype inference by parsimony. IEEE/ACM Trans Comput Biol Bioinform, 7(3):511–523, Jul-Sep 2010. ISSN 1557-9964 (Electronic); 1545-5963 (Linking). doi: 10.1109/TCBB.2008.130.

[142] Richard R Hudson. Generating samples under a wright-fisher neutral model of genetic variation. Bioinformatics, 18(2):337–338, Feb 2002. ISSN 1367-4803 (Print); 1367- 4803 (Linking).

[143] Bjarni V. Halld´orsson,Vineet Bafna, Nathan Edwards, Ross Lippert, Shibu Yooseph, and Sorin Istrail. A Survey of Computational Methods for Determining Haplotypes, pages 26–47. Springer Berlin Heidelberg, Berlin, Heidelberg, 2004. ISBN 978-3- 540-24719-7. doi: 10.1007/978-3-540-24719-7 3. URL http://dx.doi.org/10.1007/ {\ } 978-3-540-24719-7 3.

[144] Matthew Stephens and Paul Scheet. Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation. American Journal of Human Genetics, 76(3):449–462, 03 2005. URL http://www.ncbi.nlm.nih.gov/pmc/articles/ PMC1196397/.

[145] Laurent Excoffier, Guillaume Laval, and David Balding. Gametic phase estimation over large genomic regions using an adaptive window approach. Hum Genomics, 1 (1):7–19, Nov 2003. ISSN 1473-9542 (Print); 1473-9542 (Linking).

[146] Ilaria Iacobucci, Marco Sazzini, Paolo Garagnani, Anna Ferrari, Alessio Boattini, An- nalisa Lonetti, Cristina Papayannidis, Vilma Mantovani, Elena Marasco, Emanuela Ottaviani, Simona Soverini, Domenico Girelli, Donata Luiselli, Marco Vignetti, Michele Baccarani, and Giovanni Martinelli. A polymorphism in the chromosome 9p21 anril locus is associated to philadelphia positive acute lymphoblastic leukemia. BIBLIOGRAPHY 131

Leuk Res, 35(8):1052–1059, Aug 2011. ISSN 1873-5835 (Electronic); 0145-2126 (Link- ing). doi: 10.1016/j.leukres.2011.02.020.

[147] Eric Pasmant, Ingrid Laurendeau, Delphine Heron, Michel Vidaud, Dominique Vi- daud, and Ivan Bieche. Characterization of a germ-line deletion, including the en- tire ink4/arf locus, in a melanoma-neural system tumor family: identification of an- ril, an antisense noncoding whose expression coclusters with arf. Cancer Res, 67(8):3963–3969, Apr 2007. ISSN 0008-5472 (Print); 0008-5472 (Linking). doi: 10.1158/0008-5472.CAN-06-2004.

[148] F S Collins, M S Guyer, and A Charkravarti. Variations on a theme: cataloging human dna sequence variation. Science, 278(5343):1580–1581, Nov 1997. ISSN 0036- 8075 (Print); 0036-8075 (Linking).

[149] L Kruglyak and D A Nickerson. Variation is the spice of life. Nat Genet, 27(3):234– 236, Mar 2001. ISSN 1061-4036 (Print); 1061-4036 (Linking). doi: 10.1038/85776.

[150] Libin Deng, Yuezheng Zhang, Jian Kang, Tao Liu, Hongbin Zhao, Yang Gao, Chao- hua Li, Hao Pan, Xiaoli Tang, Dunmei Wang, Tianhua Niu, Huanming Yang, and Changqing Zeng. An unusual haplotype structure on human chromosome 8p23 de- rived from the inversion polymorphism. Hum Mutat, 29(10):1209–1216, Oct 2008. ISSN 1098-1004 (Electronic); 1059-7794 (Linking). doi: 10.1002/humu.20775.

[151] G C Johnson, L Esposito, B J Barratt, A N Smith, J Heward, G Di Genova, H Ueda, H J Cordell, I A Eaves, F Dudbridge, R C Twells, F Payne, W Hughes, S Nutland, H Stevens, P Carr, E Tuomilehto-Wolf, J Tuomilehto, S C Gough, D G Clayton, and J A Todd. Haplotype tagging for the identification of common disease genes. Nat Genet, 29(2):233–237, Oct 2001. ISSN 1061-4036 (Print); 1061-4036 (Linking). doi: 10.1038/ng1001-233.

[152] Daniel O Stram. Software for tag single nucleotide polymorphism selection. Hum Genomics, 2(2):144–151, Jun 2005. ISSN 1479-7364 (Electronic); 1473-9542 (Linking).

[153] Jingwu He and Alexander Zelikovsky. Informative snp selection methods based on BIBLIOGRAPHY 132

snp prediction. IEEE Trans Nanobioscience, 6(1):60–67, Mar 2007. ISSN 1536-1241 (Print); 1536-1241 (Linking).

[154] Xiong Li, Bo Liao, Lijun Cai, Zhi Cao, and Wen Zhu. Informative snps selection based on two-locus and multilocus linkage disequilibrium: criteria of max-correlation and min-redundancy. IEEE/ACM Trans Comput Biol Bioinform, 10(3):688–695, May-Jun 2013. ISSN 1557-9964 (Electronic); 1545-5963 (Linking). doi: 10.1109/TCBB.2013.61.

[155] Guimei Liu, Yue Wang, and Limsoon Wong. Fasttagger: an efficient algorithm for genome-wide tag snp selection using multi-marker linkage disequilibrium. BMC Bioinformatics, 11:66, 2010. ISSN 1471-2105 (Electronic); 1471-2105 (Linking). doi: 10.1186/1471-2105-11-66.

[156] Christopher S Carlson, Michael A Eberle, Mark J Rieder, Qian Yi, Leonid Kruglyak, and Deborah A Nickerson. Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am J Hum Genet, 74(1):106–120, Jan 2004. ISSN 0002-9297 (Print); 0002-9297 (Linking). doi: 10.1086/381000.

[157] M J Daly, J D Rioux, S F Schaffner, T J Hudson, and E S Lander. High-resolution haplotype structure in the human genome. Nat Genet, 29(2):229–232, Oct 2001. ISSN 1061-4036 (Print); 1061-4036 (Linking). doi: 10.1038/ng1001-229.

[158] A J Jeffreys, L Kauppi, and R Neumann. Intensely punctate meiotic recombination in the class ii region of the major histocompatibility complex. Nat Genet, 29(2):217–222, Oct 2001. ISSN 1061-4036 (Print); 1061-4036 (Linking). doi: 10.1038/ng1001-217.

[159] Kui Zhang, Zhaohui S Qin, Jun S Liu, Ting Chen, Michael S Waterman, and Fengzhu Sun. Haplotype block partitioning and tag snp selection using genotype data and their applications to association studies. Genome Research, 14(5):908–916, 05 2004. doi: 10.1101/gr.1837404. URL http://www.ncbi.nlm.nih.gov/pmc/articles/PMC479119/.

[160] Ali Katanforoush, Mehdi Sadeghi, Hamid Pezeshk, and Elahe Elahi. Global haplotype partitioning for maximal associated snp pairs. BMC Bioinformatics, 10:269, 2009. ISSN 1471-2105 (Electronic); 1471-2105 (Linking). doi: 10.1186/1471-2105-10-269. BIBLIOGRAPHY 133

[161] Zhenqiu Liu and Shili Lin. Multilocus ld measure and tagging snp selection with generalized mutual information. Genet Epidemiol, 29(4):353–364, Dec 2005. ISSN 0741-0395 (Print); 0741-0395 (Linking). doi: 10.1002/gepi.20092.

[162] Bo Liao, Xiong Li, Lijun Cai, Zhi Cao, and Haowen Chen. A hierarchical clustering method of selecting kernel snp to unify informative snp and tag snp. IEEE/ACM Trans Comput Biol Bioinform, 12(1):113–122, Jan-Feb 2015. ISSN 1557-9964 (Elec- tronic); 1545-5963 (Linking). doi: 10.1109/TCBB.2014.2351797.

[163] Paul I W de Bakker, Roman Yelensky, Itsik Pe’er, Stacey B Gabriel, Mark J Daly, and David Altshuler. Efficiency and power in genetic association studies. Nat Genet, 37(11):1217–1223, Nov 2005. ISSN 1061-4036 (Print); 1061-4036 (Linking). doi: 10.1038/ng1669.

[164] Chuan-Kang Ting, Wei-Ting Lin, and Yao-Ting Huang. Multi-objective tag snps se- lection using evolutionary algorithms. Bioinformatics, 26(11):1446–1452, Jun 2010. ISSN 1367-4811 (Electronic); 1367-4803 (Linking). doi: 10.1093/bioinformatics/ btq158.

[165] Richard Judson, Benjamin Salisbury, Julie Schneider, Andreas Windemuth, and J Claiborne Stephens. How many snps does a genome-wide haplotype map require? Pharmacogenomics, 3(3):379–391, May 2002. ISSN 1462-2416 (Print); 1462-2416 (Linking). doi: 10.1517/14622416.3.3.379.

[166] Hadar I Avi-Itzhak, Xiaoping Su, and Francisco M De La Vega. Selection of minimum subsets of single nucleotide polymorphisms to capture haplotype block diversity. Pac Symp Biocomput, pages 466–477, 2003. ISSN 2335-6936 (Print).

[167] Jochen Hampe, Stefan Schreiber, and Michael Krawczak. Entropy-based snp selection for genetic association studies. Hum Genet, 114(1):36–43, Dec 2003. ISSN 0340-6717 (Print); 0340-6717 (Linking). doi: 10.1007/s00439-003-1017-2.

[168] Wei-Yi Cheng, Tai-Hsien Ou Yang, and Dimitris Anastassiou. Biomolecular events in cancer revealed by attractor metagenes. PLoS Comput Biol, 9(2):e1002920, BIBLIOGRAPHY 134

2013. ISSN 1553-7358 (Electronic); 1553-734X (Linking). doi: 10.1371/journal.pcbi. 1002920.

[169] R C Lewontin. On measures of gametic disequilibrium. Genetics, 120(3):849–852, Nov 1988. ISSN 0016-6731 (Print); 0016-6731 (Linking).

[170] K Hao. Genome-wide selection of tag snps using multiple-marker correlation. Bioinfor- matics, 23(23):3178–3184, Dec 2007. ISSN 1367-4811 (Electronic); 1367-4803 (Link- ing). doi: 10.1093/bioinformatics/btm496.

[171] Wei-Yi Cheng, Tai-Hsien Ou Yang, and Dimitris Anastassiou. Development of a prog- nostic model for breast cancer survival in an open challenge environment. Sci Transl Med, 5(181):181ra50, Apr 2013. ISSN 1946-6242 (Electronic); 1946-6234 (Linking). doi: 10.1126/scitranslmed.3005974.

[172] Tai-Hsien Ou Yang, Wei-Yi Cheng, Tian Zheng, Matthew A Maurer, and Dimitris Anastassiou. Breast cancer prognostic biomarker using attractor metagenes and the fgd3-susd3 metagene. Cancer Epidemiol Biomarkers Prev, 23(12):2850–2856, Dec 2014. ISSN 1538-7755 (Electronic); 1055-9965 (Linking). doi: 10.1158/1055-9965. EPI-14-0399.

[173] Wei-Yi Cheng, Tai-Hsien Ou Yang, Hui Shen, Peter W. Laird, and Dimitris Anastas- siou. arxiv:1306.2584.

[174] Matlab and machine learning for bioinformatics toolbox: metafeatures, release 2014b. The MathWorks Inc., Natick, Massachusetts, United States, 2014.

[175] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, 2006. ISBN 0471241954.

[176] Kate R Rosenbloom, Cricket A Sloan, Venkat S Malladi, Timothy R Dreszer, Katrina Learned, Vanessa M Kirkup, Matthew C Wong, Morgan Maddren, Ruihua Fang, Steven G Heitner, Brian T Lee, Galt P Barber, Rachel A Harte, Mark Diekhans, Jeffrey C Long, Steven P Wilder, Ann S Zweig, Donna Karolchik, Robert M Kuhn, BIBLIOGRAPHY 135

David Haussler, and W James Kent. Encode data in the ucsc genome browser: year 5 update. Nucleic Acids Res, 41(Database issue):D56–63, Jan 2013. ISSN 1362-4962 (Electronic); 0305-1048 (Linking). doi: 10.1093/nar/gks1172.

[177] Bryan N Howie, Peter Donnelly, and Jonathan Marchini. A flexible and accurate geno- type imputation method for the next generation of genome-wide association studies. PLoS Genet, 5(6):e1000529, Jun 2009. ISSN 1553-7404 (Electronic); 1553-7390 (Link- ing). doi: 10.1371/journal.pgen.1000529.

[178] Bryan Howie, Jonathan Marchini, and Matthew Stephens. Genotype imputation with thousands of genomes. G3 (Bethesda), 1(6):457–470, Nov 2011. ISSN 2160-1836 (Electronic); 2160-1836 (Linking). doi: 10.1534/g3.111.001198.

[179] Ewan Birney and Nicole Soranzo. Human genomics: The end of the start for popu- lation sequencing. Nature, 526(7571):52–53, Oct 2015. ISSN 1476-4687 (Electronic); 0028-0836 (Linking). doi: 10.1038/526052a.

[180] Paul I W De Bakker, Robert R Graham, David Altshuler, Brian E Henderson, and Christopher A Haiman. Transferability of tag snps to capture common genetic vari- ation in dna repair genes across multiple populations. Pac Symp Biocomput, pages 478–486, 2006. ISSN 2335-6936 (Print).

[181] J C Barrett, B Fry, J Maller, and M J Daly. Haploview: analysis and visualization of ld and haplotype maps. Bioinformatics, 21(2):263–265, Jan 2005. ISSN 1367-4803 (Print); 1367-4803 (Linking). doi: 10.1093/bioinformatics/bth457.

[182] Carsten O Daub, Ralf Steuer, Joachim Selbig, and Sebastian Kloska. Estimating mutual information using b-spline functions –an improved similarity measure for analysing gene expression data. BMC Bioinformatics, 5:118–118, 2004. doi: 10.1186/ 1471-2105-5-118. URL http://www.ncbi.nlm.nih.gov/pmc/articles/PMC516800/.