<<

Algorithmic Methods for Multi-Omics Discovery

A dissertation presented to the faculty of the Russ College of Engineering and Technology of Ohio University

In partial fulfillment of the requirements for the degree Doctor of Philosophy

Yichao Li December 2018

© 2018 Yichao Li. All Rights Reserved. 2

This dissertation titled Algorithmic Methods for Multi-Omics Biomarker Discovery

by YICHAO LI

has been approved for the School of Electrical Engineering and Computer Science and the Russ College of Engineering and Technology by

Lonnie Welch Professor of Electrical Engineering and Computer Science

Dennis Irwin Dean, Russ College of Engineering and Technology 3 Abstract

LI, YICHAO, Ph.D., December 2018, Electrical Engineering and Computer Science Algorithmic Methods for Multi-Omics Biomarker Discovery (138 pp.) Director of Dissertation: Lonnie Welch The central dogma of states that DNA is transcribed into RNA, which is then translated into . The flow of genetic information in time and space is orchestrated by complex regulatory mechanisms. With the advent of modern biotechnology, our understanding of , transcriptomics, and has deepened. However, bioinformatic tools for biomarker discovery in the different types of omics are still lacking. To address these issues, we developed novel algorithmic methods for three primary omics. Proteins are the main executor of cellular functions. In the proteomic level, we developed machine learning models for early diagnosis of type 2 diabetes based on the abundance of post-translational modifications (PTMs). Our models can interpret data and perform integrative analysis together with clinical parameters such as HbA1C and fasting plasma glucose. In the results, we identified glycated lysine-141 of to be a potential biomarker. regulation is conducted by cis-regulatory elements and transcription factors. In the transcriptomic level, we developed Emotif Alpha bioinformatic pipeline for DNA motif discovery and selection using RNA-seq, ChIP-seq, and gene homology data. We applied this pipeline to multiple species, including human, mouse, plants, and nematodes. The discovered motifs were validated using Gaussia Luciferase (GLuc) reporter. The 3D genome architecture in the nucleus involves spatial organization of such as the histone body (HLB). In the 3D genomics level, we developed a bioinformatic pipeline for characterizing locus-specific interactions. Specifically, we integrated Hi-C, GAM, and SPRITE data and identified complex 4 chromatin organization signature of the Hist1 cluster in mouse embryonic stem cell (mESC). In addition, we performed network hub analysis and identified hubs of diverse functions. These hubs contained not only histone and other active genes, but also lamina-associated domains (LADs) and polycomb domains. Motif and motif pair analyses further revealed putative transcription factors that might play important roles for each interaction hub. 5 Acknowledgments

I am deeply grateful to my adviser Dr. Lonnie Welch, who helps me grow as an independent researcher. This work would not have been possible without his guidance and encouragement. I would like to thank Dr. Ana Pombo for her helpful advice on the 3D genome organization project. I would also like to thank Dr. Razvan Bunescu for his helpful discussions on machine learning questions. Many thanks to my dissertation committee, all of my collaborators, and all members of the Ohio University bioinformatics lab. 6 Table of Contents

Page

Abstract ...... 3

Acknowledgments ...... 5

List of Tables ...... 9

List of Figures ...... 11

List of Acronyms ...... 15

1 Introduction ...... 16 1.1 Motivation ...... 16 1.2 Contributions ...... 16 1.3 Gene transcriptional regulation ...... 17 1.3.1 Phase 1: prior to gene activation ...... 18 1.3.2 Phase 2: pioneer TFs to activate ...... 18 1.3.3 Phase 3: assembly of the transcription initiation complex (PIC). . . 18 1.3.4 Phase 4: transcription initiation, elongation, and termination. . . . . 18 1.4 Machine learning ...... 19 1.5 Organization of the dissertation ...... 22

2 Discovery of Diagnostic in Type 2 Diabetes ...... 23 2.1 Introduction ...... 23 2.2 Machine learning methods ...... 24 2.2.1 Exploratory data analysis ...... 24 2.2.2 Supervised learning ...... 25 2.2.2.1 Classification ...... 26 2.2.2.2 Regression ...... 34 2.2.2.3 Feature Selection ...... 34 2.2.2.4 Overfitting and Cross Validation ...... 35 2.2.3 Unsupervised Learning ...... 37 2.2.4 Toolboxes in Python ...... 39 2.2.5 Remarks ...... 40 2.3 Results and Discussion ...... 40 2.3.1 Analyses of newly diagnosed T2D patients ...... 41 2.3.1.1 Data description ...... 41 2.3.1.2 Combinatorial analysis of glycated peptides, FPG, and HbA1C ...... 41 7

2.3.1.3 Classification of T2D patients and controls using SVM- RFE ...... 42 2.3.1.4 Subtypes identification ...... 43 2.3.2 Analyses of long-term controlled T2D patients ...... 46 2.3.3 Analysis of prediabetic patients ...... 48 2.4 Conclusion ...... 48

3 Identification of transcription factor binding site in multiple species ...... 50 3.1 Introduction ...... 50 3.2 Methods ...... 52 3.2.1 Emotif Alpha: A multi-omics DNA motif discovery and selection pipeline ...... 52 3.2.2 Workflow of the Brugia malayi project ...... 54 3.2.3 Workflow of the Arabidopsis thaliana project ...... 56 3.2.4 Workflow of motif selection in ChIP-seq data ...... 58 3.3 Promoter analysis in Brugia malayi ...... 62 3.3.1 Stage-specific motifs ...... 62 3.3.2 Sex-biased motifs ...... 65 3.4 Analysis of pollen-specific HRGPs expression in Arabidopsis thaliana . . . 67 3.4.1 Identification of pollen-specific HRGPs ...... 67 3.4.2 Discovery of putative motifs ...... 69 3.5 In vivo prediction of transcription factor binding sites in 14 cell types . . . . 70 3.5.1 Overview of the ENCODE-DREAM Challenge ...... 70 3.5.2 Data Description ...... 71 3.5.3 Multi-omics models ...... 71 3.5.4 Competition Performance ...... 74 3.6 Motif selection in ChIP-seq data ...... 75 3.6.1 Comparison of set cover based methods ...... 75 3.6.2 Shared motifs between the solutions of set cover based methods and the enrichment method ...... 77 3.6.3 Putative cofactors identified by set cover based methods ...... 79 3.7 Conclusion ...... 80

4 Characterization of Locus-Specific Chromatin Interactions ...... 81 4.1 Introduction ...... 81 4.1.1 History ...... 81 4.1.2 The Hi-C method ...... 82 4.1.3 The GAM method ...... 82 4.1.4 The SPRITE method ...... 84 4.1.5 Genome architecture and nuclear organization ...... 85 4.1.5.1 Nucleosomes ...... 85 4.1.5.2 Chromatin loops ...... 85 4.1.5.3 The loop extrusion hypothesis ...... 86 8

4.1.5.4 TADs, sub-TADs, mega-TADs ...... 87 4.1.5.5 Compartments A and B ...... 87 4.1.5.6 Histone locus body ...... 88 4.1.6 Bioinformatics tools for chromatin interaction analysis ...... 88 4.1.6.1 Identification of chromatin interactions ...... 88 4.1.6.2 Identification of higher-order chromatin structures . . . . 89 4.1.6.3 Visualization of chromatin interactions ...... 89 4.1.6.4 Differential interaction analysis ...... 89 4.2 Methods ...... 90 4.2.1 A multi-omics bioinformatics pipeline for analyzing locus-specific chromatin interactions ...... 90 4.2.2 Motif enrichment analysis ...... 92 4.2.3 Motif pair enrichment analysis ...... 93 4.3 Analysis of Hist1-specific chromatin interactions ...... 96 4.3.1 Chromatin interactions between histone genes in the Hist1 clusters . 96 4.3.2 Chromatin interactions between lamina-associated domains in the Hist1 cluster ...... 98 4.3.3 Diverse functions of interaction hubs ...... 99 4.3.4 Putative motifs and motif pairs that regulate interaction hubs . . . . 101 4.4 Conclusion ...... 106

5 Discussion ...... 108 5.1 Summary ...... 108 5.2 Towards precision medicine ...... 108 5.3 How to build a good machine learning model? ...... 108 5.4 How to find a good motif? ...... 110 5.5 How to hack the 3D genome? ...... 111

References ...... 114

Appendix: Tables of the evaluation results for the four motif selection methods . . . 134 9 List of Tables

Table Page

3.1 Stage-specific promoter motifs for B. malayi. Motif names are given in the first column. Comparision of stages are shown below motif name. For example, represents the up-regulated genes in L4 comparing to L3. Motif discovery is then performed on the promoters of up-regulated genes. aFrequency of a motif in up-regulated gene promoters. bRelative frequency of a motif in up-regulated gene promoters vs. background promoters...... 63 3.2 Stage-specific promoter motifs for B. malayi (cont.). ∗These two motifs have been validated in vitro...... 64 3.2 Thirteen pollen-specific HRGP genes. ∗These genes have been reported to be pollen-specific. Level of expression is computed using all the gene expression values in pollen. Extremely high: 3 standard deviation (STD) above the mean. High: 2 STD above the mean but within 3 STD of the mean...... 68 3.3 Expression of 13 pollen-specific HRGP genes in all examined tissues. The gene expression value for each tissue was averaged across all the samples in that tissue after log2 transformation...... 68 3.4 List of putative promoter motifs for pollen-specific HRGPs...... 69 3.5 Summary of the performance metrics evaluated by the ENCODE-DREAM Challenge. These 13 TF-cell-type combinations are the final competition for the conference round. Our best rank model is the CTCF model...... 74 3.6 Shared motifs between the set cover and enrichment methods...... 78 3.7 Putative cofactors discovered by the set cover based methods...... 79

4.1 Methods for identifying differential interactions...... 90 4.2 Annotation of the top five Hi-C hubs. The genomic locations are shown in the first column. Hubs are ranked by number of contacts...... 100 4.3 Annotation of the top five SLICE hubs. The genomic locations are shown in the first column. Hubs are ranked by number of contacts...... 100 4.4 Annotation of the top five SPRITE hubs. The genomic locations are shown in the first column. Hubs are ranked by number of contacts...... 101 4.5 Enriched promoter motifs in the top five Hi-C interaction hubs. The top two enriched motifs, if existed, were shown for each hub...... 102 4.6 Enriched promoter motifs in the top five SLICE interaction hubs. The top two enriched motifs, if existed, were shown for each hub...... 102 4.7 Enriched promoter motifs in the top five SPRITE interaction hubs. The top two enriched motifs, if existed, were shown for each hub...... 103 4.8 Enriched motif pairs in the promoter-associated interactions from the top five Hi-C interaction hubs. The top two enriched motif pairs, if existed, were shown for each interaction type in each hub...... 104 10

4.9 Enriched motif pairs in the promoter-associated interactions from the top five SLICE interaction hubs. The top two enriched motif pairs, if existed, were shown for each interaction type in each hub...... 105 4.10 Enriched motif pairs in the promoter-associated interactions from the top five SPRITE interaction hubs. The top two enriched motif pairs, if existed, were shown for each interaction type in each hub...... 106

A.1 Foreground coverage(%) comparison across 55 factors. Each row is the transcription factor group and each column is the four methods. The best foreground coverage for each factor group is shown in bold. The last row shows the winning times for each method across the 55 datasets...... 135 A.2 Number of motifs comparison across 55 factors. Each row is the transcription factor group and each column is the four methods. The lowest selected number of motifs for each factor group is shown in bold. The last row shows the winning times for each method across the 55 datasets...... 136 A.3 Background coverage(%) comparison across 55 factors. Each row is the transcription factor group and each column is the four methods. The best background coverage for each factor group is shown in bold. The last row shows the winning times for each method across the 55 datasets...... 137 A.4 Error rate(%) comparison across 55 factors. Each row is the transcription factor group and each column is the four methods. The best error rate for each factor group is shown in bold. The last row shows the winning times for each method across the 55 datasets...... 138 11 List of Figures

Figure Page

1.1 Systematic view of machine learning based bioinformatic data analysis. This flowchart represents 3 main steps, including exploratory data analysis, applying machine learning algorithms, and evaluating the performance. In each step, data visualization should be done to provide interpretations and sanity check...... 20

2.1 Flowchart of a typical stacking model. The input is the raw data or data learned from unsupervised feature learning. A typical stacking model can have multiple layers of classifiers or regressors. In the last layer, a meta-learner is used to combined all the predicted values (e.g. probabilities) into a single output value...... 31 2.2 Nested cross validation framework [1]. This procedure has two loops. An outer CV loop splits the data into K1 folds. For each outer fold, use one fold as the testing set, the rest nine folds as the training set. An inner CV loop splits the training set into K2 folds. For each inner fold, use one fold as the validation set and the rest nine folds as the learning set. Then, for each parameter, train a machine learning model using the learning set and evaluate the model using the validation set...... 38 2.3 Scatter plot of HP K141 and HbA1C. T2D patients are represented by blue dots and non-diabetics (NGT, normal glucose tolerance) are denoted by red dots. The cutoff was 6.0% for HbA1C and 30 fmol/mg for HP K141. Three T2D patients were misclassified as non-diabetes (purple arrows). One normal individual was misclassified as T2D (green arrow)...... 42 2.4 Principle components visualization of the newly diagnosed patients using a set of 15 features selected by SVM-RFE. T2D patients are represented by blue dots and non-diabetics are denoted by red dots. The ellipses are drawn at 95% confidence interval using the level parameter in ggplot...... 43 2.5 Clustering stability scores for different numbers of clusters. The scores were calculated using adjusted rand score from scikit-learn. The EM algorithm for Gaussian Mixture Model was run for 100 times with a sampling ratio of 0.8 for each number of clusters...... 44 2.6 Principle components visualization of the 3 subtypes in newly diagnosed patients. The ellipses are drawn at 95% confidence interval...... 45 2.7 Box plot of all the features in the 3 subtypes. The colors correspond to the three subtypes in the PCA plot above...... 45 2.8 Principle components visualization of the long-term controlled patients using a set of 7 features selected by RF-RFE. T2D patients are represented by blue dots and non-diabetics are denoted by red dots. The ellipses are drawn at 95% confidence interval using the level parameter in ggplot...... 47 12

2.9 Principle components visualization of the 3 subtypes in long-term controlled patients. The ellipses are drawn at 95% confidence interval...... 47

3.1 The Emotif Alpha pipeline. Users can selectively run each step or a combination of steps by modifying the configuration file. For a complete pipeline, the user should provide a list of differentially expressed genes. The promoter retrieval step will generate the foreground promoters and background promoters. Then an ensemble motif discovery is performed. The motif scanning step is used to generate motif mapping information. Users can filter motifs by using a coverage filter, running the motif selection algorithms, or performing an enrichment test. Next, de novo motifs are matched to known TFBSs. Lastly, final high-confidence motifs will be reported after conservation analysis...... 53 3.2 Workflow of Brugia malayi promoter motif identification. There are seven main steps: 1) determination of up-regulated genes, 2) promoter retrieval, 3) ensemble motif discovery, 4) in silico motif selection, 5) motif database query, 6) conservation analysis, 7) experimental validation. Steps 2 to 6 were performed using the Emotif Alpha pipeline...... 55 3.3 Bioinformatics workflow used to identify of putative promoter motifs in the 13 pollen-specific HRGPs in A. thaliana. Pollen-specific HRGPs were identified using Araport 11 RNA-seq gene expression database. Non-pollen HRGPs were defined as HRGP genes that were not expressed in pollen. Known transcription factor binding sites in A. thaliana were downloaded from plantTFDB [2]. Using gene expression as a filter, 99 TFs were found to be expressed in pollen. Next, the Emotif Alpha pipeline was used to perform ensemble motif discovery, known TFBSs matching, conservation analysis, coverage filtering, and redundant motifs removing. The number of genes or motifs retained in a given step is shown in parenthesis...... 57 3.4 Motif selection evaluation pipeline using ENCODE datasets. For a given TF, ChIP-seq experiments in different conditions were combined and duplicate peaks were removed. Motif discovery was performed and the discovery motifs were obtained from. The evaluation datasets contain 10,000 random selected peaks, 10,000 random selected background sequences, and the discovered motifs. Four motif selection algorithms were evaluated using nested cross- validation...... 59 3.5 Nested cross-validation (CV) workflow for evaluating motif selection algo- rithms. A boolean matrix containing the motif occurrence information was given as input. Then the data was split twice in an outer CV and an inner CV. Inside each inner CV, a grid search of the parameters was done. Final optimal parameters were obtained in the outer CV by selecting the most frequent op- timal parameters in the inner CV. Lastly, final selected motifs were obtained using the whole dataset and the optimal parameters...... 61 13

3.6 A conserved site of motif M3.2 in three species. The occurrence of M3.2 in the Bm10655 promoter region is conserved in both C. elegans (promoter of WBGene00003590) and O. volvulus (promoter of WBGene00246409). . . . . 65 3.7 A putative TATA-box element that is enriched in male up-regulated genes. The occurrences of this motif are conserved in both C. elegans and O. volvulus. . . 66 3.8 Sequence visualization of the occurrences of the three putative motifs in HRGP gene promotor regions. The position starts from translation start site (i.e. 0) to upstream 1000 bp. Motif 2 and motif 3 seemed to form a module and the gap distance was around 220bp. This pattern occurs in AT3G18810, AT3G57690, AT1G24520, AT3G01700, and AT4G33970...... 70 3.9 An example of multi-omics features used in the final model. The last column is the class label: 1 indicates bound, 0 indicates unbound. The data were highly imbalanced with an average positive vs. negative ratio of 1:50...... 73 3.10 Estimated computational time for the 5 models that we built. The last model was the one submitted to the final competition. Models 1 to 4 were used in the leaderboard round to tune and adjust the models...... 73 3.11 Boxplots of the 4 evaluation metrics. Median values and all the data points are shown...... 76

4.1 Timeline of studies of chromatin structure and nuclear organization. This figure was made for the Wikipedia article “ Conformation Capture”...... 81 4.2 Workflow of Hi-C, GAM, and SPRITE. All the three techniques start with crosslinking of DNA-. The Hi-C method then performs DNA ligation and pair-end sequencing. The GAM method perform cryosectioning and laser microdissection to extract DNA fragments that are captured in one nuclear profile. With hundreds of nuclear profiles from different cells, the GAM method then perform statistical inference to infer interaction probabilities. The SPRITE method performs a split-and-pool method and repeats the process multiple times. In the result, interacting DNA sequences have the same barcode, which then can be used to infer interaction probabilities...... 83 4.3 The multi-omics pipeline for integrating Hi-C, GAM, and SPRITE data. Significant interactions were extracted based on interaction score filter. Locus- specific interactions were extracted based on the genomic location using locus filter. Interactions were annotated using numeric features, such as gene expression and DNA motifs, and categorical features, such as gene annotation, LAD regions, and chromHMM annotation. This pipeline provided visualizations using, such as heatmap, Cytoscape [3], and WashU epigenome browser [4]...... 91 14

4.4 An example of motif pair enumeration and occurrence calculation. In this example, two interactions are provided in the interaction table. Then occurrences of motif 1 and motif 2 in the interacting windows are shown in the motif mapping table. Step 1 is used to generate another two motif mapping tables. Step 2 is used to generate the motif pair union table. Step 3 and step 4 are used to generate the motif interaction table. In the last step, a matrix subtraction is performed between the motif union table and the motif intersection table...... 94 4.5 GPU-accelerated motif pair enumeration and occurrence calculation. This is an overview of the motif pair finding algorithm. Inputs are given as an interaction table and a motif mapping table. Then to count motif pair occurrences, the motif pair union table is generated through step 2 and the motif pair intersection table is generated through step 3 and step 4. The final motif pair occurrence table is generated from the subtraction between the union table and the intersection table...... 95 4.6 WashU epigenome browser visualization of the chromatin interactions between histone genes. The mouse Hist1 gene cluster contains five subclusters, which are highlighted in green. Chromatin interactions between these clusters for each technique are shown in pink arcs. CTCF binding signal is shown as the first track, followed by gene annotation track and the chromatin interaction tracks of Hi-C, SLICE, and SPRITE...... 97 4.7 WashU epigenome browser visualization of the chromatin interactions between lamina-associated domains (LADs) in the Hist1 cluster. The mouse Hist1 gene cluster contains five subclusters, which are highlighted in green. LAD regions are shown in blue. There is a LAD region (1.1 MB in length) inside the Hist1 cluster, which is just about 19kb downstream to a Hist1 subcluster. There is another LAD region that is 83kb upstream to the Hist1 cluster. Constitutive LAD track is shown as the first track, followed by gene annotation track and the chromatin interaction tracks of Hi-C, SLICE, and SPRITE...... 98 4.8 WashU epigenome browser visualization of the Hist1 cluster in mouse embryonic stem cell. The mouse Hist1 gene cluster contains five subclusters, which are highlighted in green. LAD regions are shown in blue. Polycomb domain defined by H3K27me3 peaks are shown in yellow. The top 5 interaction hubs for Hi-C, SLICE, and SPRITE are shown, along with other characteristics, including markers of open chromatin (i.e. DNase-seq), CTCF binding signals, actively transcribed regions (i.e. GRO-seq), Polycomb domains (i.e. H3K27me3), promoter states (i.e. H3K27me3, RNAPIIS5p, RNAPIIS7p), chromatin states (i.e., chromHMM), and (i.e. cLAD)...... 99 15 List of Acronyms 5-number summary To describe the data using the minimum, the first quartile, the median, the third quartile, and the maximum. ADA American Diabetes Association auPRC The area under the Precision Recall Curve auROC The area under the Receiver Operating Characteristic Curve BMI Body Mass Index ChIP-Seq Chromatin Immunoprecipitation Sequencing CTCF CCCTC-binding Factor CTD C-Terminal repeat Domain CV Cross-Validation DREAM Dialogue for Reverse Engineering Assessment and Methods ENCODE Encyclopedia of DNA Elements ESI-MS Electrospray Ionization Mass Spectrometry FDR False Discovery Rate FPG Fasting Plasma Glucose FPI Fasting Plasma Insulin GAM Genome Architecture Mapping GMM Gaussian Mixture Model GPU Graphics Processing Unit HbA1C Hemoglobin A1C HIST1 The major histone gene cluster in human Hist1 The major histone gene cluster in mouse HLB Histone Locus Body HOMA-IR Homeostasis Model Assessment as an index of Insulin Resistance HP K141 Glycated lysine-141 of haptoglobin HSA Human Serum Albumin LAD Lamina-Associated Domain LASSO Least Absolute Shrinkage and Selection Operator LEF Loop Extruding Factor MDS Multidimensional Scaling mESC Mouse embryonic stem cell PTM Post-Translational Modification RSAT Regulatory Sequence Analysis Tools RDHGs Replication-Dependent Histone Genes SPRITE Split-Pool Recognition of Interactions by Tag Extension SVM Support Vector Machine TF Transcription Factor TFBS Transcription Factor Binding Site t-SNE T-Distributed Stochastic Neighbor Embedding 16 1 Introduction

1.1 Motivation

Most diseases are caused by defects in the genome, which can be studied in the three levels of the central dogma: proteomics (i.e. protein), transcriptomics (i.e. RNA), and genomics (i.e. DNA). In recent years, epigenomics and 3D genomics sequencing techniques are widely used in large-scale international projects, including the ENCODE project [5], the Roadmap Epigenome project [6], and the 4D Nucleome project [7]. Biomarkers that characterize certain biological processes are becoming more and more important to both basic and clinical research [8]. Biomarker discovery is a fundamental question in both bioinformatics and precision medicine. Typically, a case-control study is conducted where a set of samples (e.g., patients) that have a disease or a phenotype of interest is compared to a set of samples that are normal or do not exhibit the phenotype. Then, a set of biomarkers that are discriminative between the case and the control are identified by bioinformatic algorithms. With the availability of low-cost biotechnology, the size and complexity of biological data are increasing. However, bioinformatics algorithms and pipelines for analyzing such a tremendous amount of data are still lacking. Therefore, this dissertation aims to develop new bioinformatics methods that enable the biologists to study the in high-level depth and breath.

1.2 Contributions

The contributions of this dissertation include the following:

• A machine learning model [9–11] that integrates mass spectrometry data, clinical data, and demographic data for early diagnosis of type 2 diabetes. 17

• A bioinformatics pipeline Emotif Alpha that integrates gene expression data [12] and sequence conservation information, to discover motifs on a set of biologically-related sequences.

• A random forest model that integrates DNA motifs, TF ChIP-seq, DNase-seq, and DNA shape information to predict in vivo binding of 32 transcription factors for 13 cell types in the ENCODE-DREAM challenge [13].

• A systematic evaluation of set-cover based motif selection algorithms using 349 ChIP-seq datasets (manuscript under review in bioinformatics).

• A bioinformatics pipeline for characterizing histone locus body using an ensemble of 3D genomics sequencing techniques, including Hi-C, GAM, and SPRITE.

• A GPU-accelerated genome-wide motif pair finding algorithm.

1.3 Gene transcriptional regulation

Gene expression is controlled by thousands of transcription factors [14], millions of regulatory elements [15], and complex 3D chromatin structures [16]. Gene regulation mechanisms need to be functioning in a precise manner. On the genomic level, cis-regulatory elements are bound by transcription factors causing the DNA to bend, which in turn forms a long-range interaction (e.g. promoter-enhancer loop), and different transcription factors are interacting with each other to form gene regulatory networks. On the epigenomic level, various histone modifications and DNA methylations can regulate chromatin states to activate or inhibit gene transcription. The gene activation and transcription can be summarized into 4 phases [17]: 18

1.3.1 Phase 1: prior to gene activation

When a gene is off, its coding sequences and surrounding regulatory sequences are in repressed chromatin state. DNA methylation is involved in this state. For example, the DNA methyltransferase such as Dmnt3a can methylate cytosine and the methylated cytosine can be bound by methyl-CG binding proteins such as MeCP2 or MBD1. This process produces tightly condensed chromatin. The gene region thus becomes heterochromatin unless a pioneer TF binds to it.

1.3.2 Phase 2: pioneer TFs to activate gene expression

When a gene is about to be activated, pioneer TFs, such as FOXA1 and GATA1, will bind to the heterochromatin region. The pioneer TFs can directly bind to the nucleosome DNA with the DNA binding domain and histone-binding domain. In the meantime, the pioneer TFs can recruit other transcription factors to modify the chromatin state. Factors that have histone acetyltransferase activity, such as p300/CBP [18], can bind to enhancer regions and acetylate nearby histones, whereas factors that have histone methyltransferase activity, such as Setd1a [19], can methylate H3K4 and modulate the chromatin structure.

1.3.3 Phase 3: assembly of the transcription initiation complex (PIC).

PIC [20] consists of RNA polymerase II (RNA pol-II), TATA-box binding protein (TBP), transcription factor II A (TFIIA), TFIIB, TFIIC, TFIID, TFIIE, TFIIF, and TFIIH. In the first step of the assembly process, TBP binds to a TATA box-containing promoter, which helps to recruit TFIIA, TFIIB, TFIIC, TFIID, TFIIF. Next, TFIIF helps to recruit RNA pol-II. Lastly, TFIIE and TFIIH are attached to the complex.

1.3.4 Phase 4: transcription initiation, elongation, and termination.

The initiation starts after TFIIH phosphorylates the C-terminal repeat domain (CTD) of RNA pol-II. First, RNA pol-II transcribes about 20nt RNA and then pauses. In this 19 paused period, the capping process takes place. The gene transcription process resumes after elongation factors, such as P-TEFb, that further phosphorylate the CTD of RNA pol-II. Gene transcription is terminated when RNA pol-II reaches the polyadenylation signal (AAUAAA).

1.4 Machine learning

Two main learning tasks are involved in machine learning [21, 22]. First, supervised learning tries to learn a transformation function that maps a set of inputs into a set of target variables. Second, unsupervised learning tries to learn a feature distribution without the target variables. Before applying the actual machine learning algorithms, the data might need to be cleaned, which includes feature normalization and missing value imputation. This preprocessing step is often conducted in the exploratory data analysis. Data visualization is essential to machine learning since the algorithms are often understood as black boxes. Machine learning practitioners should perform data visualization both before and after machine learning algorithms. A common issue in the field is the overfitting problem. To avoid such a problem, cross-validation can be performed and the appropriate performance evaluation metrics should be applied. For example, accuracy is usually the best practice. However, when the classes are highly imbalanced, the area under the precision-recall curve (auPRC) should be evaluated instead. The comprehensive review of machine learning algorithms and related topics can be found in Chapter 2. Here, we present a common practice of using machine learning algorithms in bioinformatics (Figure 1.1). This pipeline is applicable to not only biomarker discovery, but also most machine learning competitions, such as DREAM (Dialogue for Reverse Engineering Assessment and Methods) challenges [23] and Kaggle competitions. Bioinformaticians are given biological data from a biologist. The bioinformaticians should first discuss with the biologist about the hypotheses and aims in order to map the 20

Biologist Data Data Visualization Box plot Swarm plot Bioinformaticians Violin plot

PCA/MDS/ T-SNE plot Exploratory Data Analysis Scatterplot 1. Identify data types and learning tasks. 2. Explore data distributions. Distplot a. Any missing values? Outliers? b. Any human mistakes? Histogram c. Any feature correlations? Barplot

Heatmap Machine Learning Analysis Correlation plot Supervised Unsupervised Learning Learning Others: CytoScape Circos plot Dentrogram Performance Evaluation Parallel Coordinates 1. Search for optimal parameters. Venn diagram 2. Determine appropriate metrics. Genome browser a. Accuracy? auROC? auPRC? b. Silhouette value?

Figure 1.1: Systematic view of machine learning based bioinformatic data analysis. This flowchart represents 3 main steps, including exploratory data analysis, applying machine learning algorithms, and evaluating the performance. In each step, data visualization should be done to provide interpretations and sanity check.

data to specific machine learning tasks. In the exploratory data analysis step, data types should be identified, which could be categorical, numeric, and/ or string. Based on the experimental design, it could also be time-series data. If the data is categorical or string, the bioinformaticians should convert the data into numbers by introducing dummy 21 variables. For example, if the feature column contains A, B, or C, one can use three dummy binary variables (i.e. three columns) to represent A, B, and C. Promoted by John Tukey, a 5-number summary of the minimum, the first quartile, the median, the third quartile, and the maximum is a very simple yet powerful way to look at the data distribution. If the data is numeric, a 5-number summary should be displayed to see if any outliers are present. In some cases, these outliers can be human mistakes. Missing values should not be ignored and in most cases, replacing with the median/mean values or zero values works fine. Data visualization is essential throughout the whole process. In this step, the box plot and the histogram can be used to do single variable analysis, such as visualizing the distribution of the data. Usually, a normality is expected, however, a bimodal distribution can also indicate interesting biological discoveries. For multiple variables analysis, a heatmap of feature correlation can be used to identify correlated features and a dimensionality reduction visualization can be plotted to identify hidden patterns. Next, machine learning analysis is performed. For example, decision tree, support vector machine (SVM), and random forest algorithms can be used to perform classification, whereas the LASSO (least absolute shrinkage and selection operator) regression algorithm can be used to perform regression. The Gaussian mixture model and k-means algorithms can be used to do clustering analysis. In this step, the dimensionality reduction methods (e.g. PCA, MDS, and t-SNE) can be used to visualize the classification boundaries or detected clusters. The dendrogram can also be used to visualize the pairwise relationships. Lastly, the performance of machine learning algorithms should be evaluated to avoid overfitting. A common approach is the cross-validation method that involves splitting the entire data into k folds. For each fold, one needs to use the given fold to test and the rest to train. Parameter tuning is critical to most machine learning algorithms. A common 22 mistake in some of the bioinformatics journals is to compare the proposed algorithm to an untuned machine learning algorithm. Although default parameters are supposed to work reasonably well, it is questionable to publish methods without a thorough evaluation of different parameters. Machine learning algorithms can be divided into parametric methods and non-parametric methods. Parametric methods, such as linear regression and neural networks, pre-define a mapping function. The learning procedure is used to find the best parameters of the function. In contrast, non-parametric algorithms, such as decision trees, k-nearest neighbors, and k-means, do not preassume the mapping function. People tend to misunderstand non-parametric algorithms; they often times think non-parametric algorithms do not have parameters. However, for example, the tree-based algorithms (e.g. random forest) do have important parameters (called hyperparameters) such as the maximum depth of a tree (e.g. max depth) and the minimum number of samples to split a node (e.g. min samples split). Grid search can be used to find optimal hyperparameters. This process is very computationally intensive. It often depends on the experiences of the machine learning practitioners to identify the optimal parameters.

1.5 Organization of the dissertation

This dissertation mainly discusses three biomarker discovery problems, namely diabetic biomarker discovery (Chapter 2), DNA motif discovery (Chapter 3), and motif pair finding (Chapter 4). Literature review, problem statement, and methodology will be presented inside each of the three chapters. In Chapter 5, I will discuss and provide insights into personalized medicine, machine learning, cis-regulatory elements, and 3D genome organization. 23 2 Discovery of Diagnostic Biomarkers in Type 2 Diabetes

Traditional fasting plasma glucose (FPG) test and the HbA1C test for type 2 diabetes (T2D) only identify 30% - 50% of the undiagnosed T2D patients. Novel biomarkers are needed for early diagnosis and prognosis. In this chapter, we used machine learning approaches to study the glycation degrees of several plasma proteins in newly diagnosed T2D patients, long-term controlled T2D patients, and prediabetic patients. All the work has been published in [9–11]. In summary, glycated lysine-141 of haptoglobin (HP K141) is found to be a novel biomarker for both newly diagnosed and long-term controlled T2D patients. When combined with HbA1C, it provides an accuracy of 96% to predict T2D in newly diagnosed patients [9]. Clustering analysis consistently identifies 3 subtypes in both groups of T2D patients, suggesting heterogeneity of T2D phenotypes is common and should be taken into consideration by health professionals. A preliminary analysis of 20 prediabetes samples reveals the prognosis potential of the glycation sites.

2.1 Introduction

Diabetes is a complex metabolic disease characterized by hyperglycemia that may severely damage eyes, kidneys, and heart. According to the National Diabetes Statistics Report 2017 [24], by 2015, 30.3 million Americans, or 9.4% of the population had diabetes, of which 7.2 million people (23.8%) were undiagnosed. There are two main types of diabetes [25]. Type 1 diabetes (T1D) is usually characterized by absolute insulin deficiency resulting from β-cell destruction. T1D patients need to inject insulin to survive. Type 2 diabetes (T2D) accounts for over 90% of all diabetes cases, which is mainly characterized by insulin resistance with relative insulin deficiency. T2D has a long asymptomatic stage, called prediabetes, which is characterized by mild hyperglycemia. Moreover, it is estimated that 79 million Americans have prediabetes [26]. It has been found that effective blood glucose control and lifestyle interventions can reduce the 24 development of T2D by 58% [27]. Therefore, early diagnosis and prognosis of diabetes are crucial in the prevention and treatment of diabetes. Standard screening of diabetes [26] includes FPG level of 126 mg per deciliter or more or HbA1C level of 6.5% or higher. Current HbA1C diagnostic accuracy has 78.7% sensitivity and 94.0% specificity [28]. An FPG level between 100 to 125 mg per deciliter or an HbA1C level between 5.7% and 6.4% is diagnosed as prediabetes. However, these methods identify less than 50% of the undiagnosed T2D patients[9, 10]. The diagnostic accuracy of current biomarkers is unsatisfied and novel biomarkers are needed for better prediction of diabetes.

2.2 Machine learning methods

Machine learning methods were applied to discover diagnostic and prognostic biomarkers for T2D and prediabetes. In this section, we describe our algorithms and review common machine learning topics and techniques.

2.2.1 Exploratory data analysis

Exploratory data analysis, or EDA, is an important topic in data science that identifies patterns, issues, and outliers. The first task is to identify data types. In bioinformatics, string is a very common type. Inputs of DNA sequences or protein sequences can be encoded using one-hot encoding [29] or k-mer spectrum [30]. Inputs of gene expression, protein abundance, and biological signals (e.g. ChIP-seq) are numeric values. Inputs of gene annotation and other biological annotations (e.g. regulatory elements) are often categorical types. These types of data can be converted into numbers using dummy variables or simply count the occurrences in a sliding genomic window. Once the data types are clear, the next step is to understand the data distribution. For example, one can perform the 5-number summary. The summary() in R and data f rame.describe() in Python Pandas library can do the work. Those summary statistics 25 are used to find outliers and differences between features. Besides univariate analysis, multivariate analysis can be done by exploring pairwise feature correlations or performing dimensionality reduction. Usually, EDA uses multiple data visualization techniques, such as box plots, heatmaps, and dimensionality reduction (e.g. PCA, MDS, and t-SNE). The last step is to determine whether the data need to be cleaned, normalized, or scaled. These decisions come after looking at the data distribution. One should check if there is any unexpected patterns or data points. Data cleaning refers to a series of steps to ensure data quality. Common steps include missing value imputation and outlier removal or correction. Missing values can be replaced by zeros, median or mean values depending on the data. For example, if a person’s age is missing, it is better to use an average value because zero does not make sense for this feature. More complex imputation methods can be applied for gene expression data [31], microRNA expression data [32], mass spectrometry-based data [33], or mixed-type data [34]. If any preprocessing is done, one should visualize the data distribution again.

2.2.2 Supervised learning

One of the most common tasks in machine learning is supervised learning. Such a task requires one to learn a transformation function that maps a set of inputs into a set of target variables. Since the ground truth is available, any prediction algorithms that accomplish the task can be fairly evaluated. For example, the most famous machine learning competition platform, Kaggle, allows users develop different algorithms and compete with each other, which is not only a real demonstration of the algorithms developed in academia, but it also fosters novel ideas that push forward the machine learning community. In bioinformatics, the DREAM challenge [23] provides machine learning competitions specifically for bioinformaticians. 26

Mathematically speaking, a supervised learning task is defined as follows. Given a dataset D that contains a set of input samples x ∈ X and a set of target labels y ∈ Y, denoted by D = {(x1, y1), ..., (xn, yn)}, where n is the total number of samples, the objective

Pn is to train a model f that minimizes the cost defined by C( f, D) = i=1 l( f (xi), yi), where l( f (xi), yi) is called the loss function. For a sample xi, f (xi) is the label predicted by model f . If the target label y ∈ Y is a set of discrete categories, then the learning task is called classification. On the other hand, if y ∈ Y is continuous, then the learning task is called regression.

2.2.2.1 Classification

Decision Tree Tree-based algorithms are some of the most interpretable and powerful machine learning algorithms. They have many advantages:

• Able to visualize the decision trees, enhancing the interpretability.

• Able to handle missing values and categorical data.

• No need to scale or normalize the data.

• Able to be constructed in parallel.

One of the simplest classification algorithms is the decision tree algorithm. A tree is constructed using a greedy manner : (1) first, each feature is evaluated based on information gain (Equation 2.2) and the feature with the largest value is selected as the root; (2) then, the data is partitioned and the tree is grown inside each partition by selecting the feature with the largest IG value again. The ID3 algorithm [35] is described in Algorithm 1. This algorithm grows the tree without pruning, which may overfit the data. Thus, later on, the C4.5 [36] and C5.0 algorithms are developed to improve the classification performance. 27

To introduce the ID3 algorithm [35], let us consider a classification task where the target labels can take on c different classes. Then let us first define the entropy of the labels in dataset D, denoted by H(D), as follows:

Xc H(D) = −pi log2 pi (2.1) i=1

where pi is the proportion of D belonging to class i. Then, let us define the information gain of a feature S , denoted by IG(S ; D), which measures the expected reduction in entropy caused by partitioning the data using S . The information gain is also known as impurity function. Another choice of impurity function is Gini Index [37].

X |D | IG(S ; D) = H(D) − v H(D ) |D| v (2.2) v∈Values(S )

where Values(S ) is the set of all possible values for feature S and Dv is the subset of D for which S has value v. The ID3 algorithm is shown below.

Algorithm 1 ID3 (Data D, Features F) 1: If all examples in D have the same class, Return the class label. 2: Find the feature S with the largest IG value. 3: Create a root node from S and call it T.

4: Split D into D1, D2, ..., Dv partitions based on the values of S .

5: For each Dv, add Tv = ID3 (Dv, F − {S }) as a new branch to T Return T

Ensemble Learning The ensemble method is an important machine learning paradigm that combines multiple models together for making predictions. Intuitively, ensemble 28 methods mimic the second nature that we like to evaluate multiple options before making a final decision. So far, any winning algorithms in Kaggle competitions are some types of ensemble methods [38–40]. The theoretical foundation for ensemble methods can be dated back to 1785 when Condorcet (1743-1794) wrote the Essay on the Application of Analysis to the Probability of Majority Decisions. This work demonstrates that the probability of majority decisions M will be bigger than the probability of every single decision p when p is larger than a random guess [41]. In 1990, Hansen and Salamon [42] conducted applied research to show that a combination of multiple classifiers is often superior to a single classifier. In the same year, Schapire [43] did a theoretical study on the strength of weak learnability where he proved that weak classifiers could be boosted to a strong classifier; this is known as Boosting, one of the most effective ensemble methods [44]. An ensemble framework contains three elements: (1) a dataset set; (2) an ensemble generator; and (3) a meta-learner (also known as a combiner). An ensemble generator refers to the process of creating each individual classifier, which is often called a base learner. A combiner refers to the method that combines a set of predictions made by base learners. There are three representative ensemble methods, namely Boosting, Bagging, and Stacking. The following algorithms are introduced based on a binary classification task where the label is either -1 or 1. Boosting Boosting is a model-dependent ensembling method. The most famous algorithm is AdaBoost [45]; it gives higher weights to misclassified samples and trains a base learner using more misclassified samples instead of all the samples (Algorithm 2). Notably, another gradient boosting method, the XGBoost algorithm [46], has been used in many Kaggle competitions [47]. It is worth noting that the XGBoost algorithm has GPU support. 29

The AdaBoost algorithm is shown below. The output of the algorithm is a list of trained base models ht and a list of associated weights αt. During the testing phase, given

PT sample xi, the predicted label of xi is sign( t=1 αtht(xi)).

Algorithm 2 AdaBoost (Data D, Base learner L, Epochs T)

1: Initialize uniform weights: D1(i) = 1/n for i = 1, ..., n For t = 1, ..., T

2: Sample from D using Dt, denoted by (D, Dt)

3: Train a model using base learner L, denoted by ht = L(D, Dt)

4: Measure the prediction error using ht, denoted by t

1−t 5: Determine the weight of ht, denoted by αt = 0.5 ∗ ln( ) t

6: Give higher weights to the misclassified samples, Dt+1(i) = Dt(i) ∗ exp(−αtyiht(xi))

7: Normalize Dt+1 to a probability distribution

Return {α1, ..., αt}, {h1, ..., ht}

Bagging Bagging is a model-independent ensembling method. A simple Bagging framework is to train a set of base learners using bootstrap sampling and then combine them using majority vote to improve the accuracy and avoid overfitting. The famous Bagging algorithm is the random forest algorithm [48], where the base learner is a decision tree. The pseudocode of Bagging is shown below. The output of the algorithm is

a list of trained decision trees. During the testing phase, given sample xi, the predicted

PT label of xi is sign( t=1 Lt(xi)). Stacking Stacking is also a model-independent method, which combines different types of base learner. Its structure is like a neural network where each “neuron” is a base learner. The last layer of a stacking model contains only one “neuron”, which is called a meta-learner. A stacking model reduces to a bagging model if it has two layers where the 30

Algorithm 3 Random Forest (Data D, Base learner L, Number of trees T) For t = 1, ..., T 1: Sample from D using bootstrap sampling

2: Train a decision tree model using the bootstrapped samples, denoted by Lt return {L1, ..., LT }

first layer contains the same type of base learner and the second layer is a majority vote function. Two-layer stacking models are also called Blending; a name that is created by Netflix winners [49]. One of the most famous stacking method, StackNet (https://github.com/kaz-Anova/StackNet ), is created by Marios Michailidis, a Kaggle grandmaster. A typical stacking model scheme is shown in Figure 2.1. My winning solution1in the Multiple Myeloma DREAM challenge sub-challenge 1 was using a linear stacking model; it was a two-layer stacking model where the last layer used a linear algorithm to combine values returned by the first layer. A few tips and cautions of using a stacking model are listed below:

• Always start with the simplest one; for example, try a linear stacking model.

• Parameter tuning is crucial in a stacking model since so many types of algorithms are used.

• The diversity of each layer is important; a stacking model should include models that are based on different hypotheses (i.e. the no-free-lunch theorem), such as density-based methods (e.g. kNN), maximal margin models (e.g. SVM), simple parametric methods (e.g. logistic regression and linear regression), tree-based methods (e.g. XGBoost [46] and random forest), sparsity regularized methods (e.g. LASSO), and other methods (e.g. neural networks).

1 https://www.synapse.org/#!Synapse:syn10291644/wiki/489735 31

• A simple linear model is usually used as the meta-learner. The hypothesis is that a good data transformation from multiple layers should learn a linear representation.

• Unsupervised feature learning methods can be used to increase the performance.

SVM SVM SVM

kNN kNN kNN

Logistic Logistic Logistic Regression Regression Regression

Meta- Data learner

Random Random Random Forest Forest Forest Unsupervised feature learning XGBoost XGBoost XGBoost

Lasso Lasso Lasso

Neural Neural Neural Network Network Network

Figure 2.1: Flowchart of a typical stacking model. The input is the raw data or data learned from unsupervised feature learning. A typical stacking model can have multiple layers of classifiers or regressors. In the last layer, a meta-learner is used to combined all the predicted values (e.g. probabilities) into a single output value.

Support Vector Machine SVMs are classic and successful algorithms for kernel-based machine learning [50]. A kernel is defined as follows: a function k : X × X → R is a kernel function if there exists a feature mapping function ϕ : X → Rn such that k(x, y) = ϕ(x)T ϕ(y). The advantage of kernel functions is that they can map original data into higher dimensions without explicitly calculating the values. Any machine learning algorithms can use the kernel trick if their objective functions can be expressed using a kernel function. For example, the primal problem of SVM is defined as: 32

1 Xn Minimize J(w~, b) = kw~k2 + C ξ 2 i i=1

T subject to yi(w~ ϕ(x~i) + b) ≥ 1 − ξi

ξi ≥ 0 , ∀i ∈ {1, ..., n}

where x~i is sample i, w~ and b are the parameters to be optimized, C is the hyperparameter defined by users to control overfitting, and ξi is called the slack variable, which allows x~i to not meet the margin requirement. The above primal problem can be solved using Lagrange Multipliers, which brings the dual representation of SVM as follows:

Xn 1 Xn Xn Maximize L(α~) = α − α α y y k(x~ , x~ ) i 2 i j i j i j i=1 i=1 j=1

subject to 0 ≤ αi ≤ C Xn αiyi = 0 , ∀i ∈ {1, ..., n} i=1 where αi is the Lagrangian multipliers.

k(x~i, x~j) is the kernel function and it enables SVM to map the data into a hyperplane without increasing the computational complexity. Common kernels of SVM include polynomial kernels and radial basis function (RBF, a variant of Gaussian kernel) kernels [51]. There are two advantages of SVM: (1) the use of kernel and (2) the use of support vectors. The decision plane of SVMs can be explained just using a few of support vectors, which makes the prediction of new examples very efficient. However, it is a fact (Google trend) that the popularity of SVM has decreased, possibly due to the following reasons:

• The input data is so important to SVM. Missing value imputation, outlier removal, feature engineering (i.e. using domain knowledge to create new features from existing features), and feature normalization are all necessary steps before using SVM. 33

• SVMs cannot handle large data set very efficiently due to the high computational cost in the training process [52]. Standard SVM has a training complexity of O(N3) where N is the training set size [53]. Fortunately, SVMlight [54, 55] has O(N) time complexity for linear kernel. But still, for non-linear kernels, its training complexity is between O(N2) and O(N3).

• Other powerful algorithms, such as deep neural networks and gradient boosting trees, make SVM not the first choice to many people. For example, XGBoost [46] does not require feature normalization and it can handle missing values. Moreover, its calculation can be run using GPU.

Performance Evaluation Accuracy (ACC), sensitivity (SE, also known as recall), specificity (SP), precision, F1 score, and Matthews correlation coefficient (MCC) are commonly used to evaluate classification performance. These metrics are evaluated at a single cutoff. To evaluate the general performance of a classifier, two values of area under the curve are used; they are the receiver operating characteristic curve and the precision-recall curve. We denote the two values as auROC and auPRC, respectively. In the presence of imbalanced datasets (usually more negatives samples), the balanced accuracy (BAC) and auPRC should be used. When evaluating multi-class classification problems, the confusion matrix should be reported [56]. The confusion matrix is an N × N matrix, where N is the number of classes. Typically, for a confusion matrix to be readable, the total number of classes should be less than 20. Moreover, important information should be highlighted. For example, misclassified classes should be labeled. If the number of classes is too large, the confusion matrix may not be helpful. In this case, the mean average precision (mAP) or the global average precision (GAP) should be evaluated. 34

2.2.2.2 Regression

When the target variable is continuous, the learning task is called regression. The most simple method is the ordinary least square (OLS) formulation of linear regression, where we fit a linear function of f (x) = wx~ + b. The objective is to minimize the mean C 1 Pn − 2 squared error (MSE), defined as (w~) = 2 i=1( f (xi, w~) yi) . Other evaluation metrics include mean absolute error (MAE) , root mean squared error (RMSE), and root mean squared log error (RMSLE). To control overfitting, regularized linear regression is preferred. Based on the formula of the regularizer, regularized linear regression can be divided into three different types:

• Ridge Regression: the L2 norm of w~ is included in the cost function.

• LASSO Regression: the L1 norm is used.

• Elastic Net: the ratio of L1 and L2 norm is used.

Intuitively, L2 norm tends to keep every coefficient small but not zero. In contrast, LASSO regression will enforce sparsity; it keeps only a few coefficients and all others will be zero. The elastic net algorithm is somewhat in between. In general, all these methods should be applied to search for the best model. For non-linear regression, all the sophisticated methods discussed in the classification section can be performed for regression tasks, including random forest regressor, XGBoost regressor, support vector regressor, and stacking models. It is worth noting that non-linear regression can also be solved using piecewise linear regression, such as the least-angle regression algorithm [57] and the isotonic regression algorithm [58].

2.2.2.3 Feature Selection

Feature selection is another important topic in machine learning. Some algorithms can automatically select the best features while performing classifications or regressions. 35

For example, the LASSO regression can shrinkage irrelevant features with zero coefficients using the L1 norm. Interestingly, like physics pushed the development of mathematics, feature selection algorithms are advanced because of gene expression data, where tens of thousands of gene expression features are present. The significance of feature selection is [59]:

• Enhancing the prediction performance by reducing the likelihood of overfitting.

• Creating a concise set of features that are easier to comprehend.

• Providing insights into the underlying mechanisms of the data.

Feature ranking The first way to do feature selection is to evaluate features individually. These methods include mutual information, χ-square test, t-test, feature correlations, signal-to-noise ratio, etc. One can also use a classifier or a regressor to rank features. For example, in linear-SVM, the absolute values of the weights indicate the importance of features. Feature subset selection A simple way to select a subset is to pick the top k ranked features. However, the top k features are likely to contain redundancy. Also, it is likely that “useless” features can be useful with others. Therefore, a recursive feature elimination (RFE) method is commonly used. Coupled with SVM, the SVM-RFE algorithm has been successfully applied to cancer classification [60]. This algorithm uses a greedy approach and its pseudocode is shown in Algorithm 4.

2.2.2.4 Overfitting and Cross Validation

Overfitting is a common problem in machine learning. This issue can happen in the following scenarios: 36

Algorithm 4 SVM-RFE (Features set F)

1: Let S = [] be an ordered list.

While F , ∅ 2: Train a linear-SVM using F. 3: Let f denote the feature with the minimal absolute weight. 4: F = F − { f }. 5: Append f to S . return S

• The number of features is much larger than the number of training samples. In this case, many local optima exist that may cause the learning algorithm stuck in a biased and incorrect point.

• The learning model is trained and tested on the same dataset. This can happen in most machine learning beginners.

• No regularization function is used. An observation of overfitted models is that their parameters are all very large values.

The overfitting issue can be alleviated by splitting the data into training and testing sets, known as cross-validation (CV). For example, in a 10-fold CV, the procedure is to split the data into 10 pieces. For each partition i, train models with a set of parameters using the other 9 partitions and test the models using partition i. Based on the average performance (e.g. accuracy), select the best parameter setting. Traditional CV may also overfit the model since the best parameters are selected based on the test dataset. Therefore, a nested CV procedure has been developed [1]. A nested CV framework (Figure 2.2) contains an inner loop that is used to select the best model and an outer loop 37 that is used to evaluate the best model. There are two ways to determine the best parameter setting (namely, the best model). One can choose either the most frequent winning parameters or the parameters that have the highest performance in the outer loop.

2.2.3 Unsupervised Learning

It is hard to say whether supervised learning or unsupervised learning is more difficult. Since in unsupervised learning, the ground truth does not exist, people can set up their own metrics and design optimization algorithms to achieve the objective. In such cases, the unsupervised learning tasks might be easier. Hierarchical Agglomerative Clustering HAC is the most commonly used clustering algorithm. Its advantage is able to visualize the relationship (i.e. distance) between samples or groups of samples. It starts with N clusters, where N is the total number of samples. Then it iteratively merges the two closest clusters and stops if only one cluster left. The distance function is usually Euclidean distance. There are multiple choices for the linkage function. The recommended ones are ward [61] and average. K-means Clustering The k-means algorithm aims to produce k clusters such that every sample point is close to the cluster centroid. Users should try different k values and visualize the k clusters in order to determine which k value makes sense. Kernel k-means should be used when the clusters are not spherical. Gaussian Mixture Model Gaussian Mixture Model (GMM) assumes the samples are generated from k different Gaussian distributions. Thus the clustering problem is to maximize the probability of observing the data given the k distributions. This problem is solved using the expectation-maximization (EM) algorithm. Other Methods In real applications, it is advisable to apply several clustering algorithms before reaching the conclusion. Other powerful methods include spectral 38

The entire dataset K1-fold Cross Validation

Partition into training and testing sets

K2-fold Cross Split training set as learning Validation and validation sets

Train a machine learning model with different parameters

Evaluate on the validation set

Determine optimal parameters

Evaluate on the testing set

Figure 2.2: Nested cross validation framework [1]. This procedure has two loops. An outer CV loop splits the data into K1 folds. For each outer fold, use one fold as the testing set, the rest nine folds as the training set. An inner CV loop splits the training set into K2 folds. For each inner fold, use one fold as the validation set and the rest nine folds as the learning set. Then, for each parameter, train a machine learning model using the learning set and evaluate the model using the validation set.

clustering and DBSCAN clustering (Density-Based Spatial Clustering of Applications with Noise) [62]. 39

Performance Evaluation The silhouette method is a commonly used evaluation metric. The silhouette value, ranging from -1 to 1, is calculated for each data point to measure the closeness to the assigned cluster and a neighbor cluster. A bar plot of all silhouette scores is a good way to visualize the clustering performance. Another good metric is the clustering stability score [63]. The basic idea is to run the same clustering algorithm hundreds of times and calculate the pairwise distances between any two clusters (from the different run). The final score is represented as a mean distance. Users can choose a favorite distance function. For example, in [9, 10], we used the similarity function, adjusted rand score, from the scikit-learn Python library. The Rand index (RI) [64] is a similarity score for any two clusters. The adjusted RI score is then a standardized score

computed through RI−ExpectedRI . max(RI)−ExpectedRI

2.2.4 Toolboxes in Python

Python is by far the best programming language for data science. To get started, one can practise with the Pandas library. Common functions inlcude: read csv(), describe(), groupby(), astype(), replace(), row selection functions .loc[] and .iloc[], column selection method dataframe[] and dataframe.column name, fillna(), accessing a cell dataframe.at[row, col] and change a cell value .set value(), set index(), and ways to iterate rows or columns. For data visualization, one can use seaborn, matplotlib, and plotly in Python. Notably, ggplot2 in R, circos in Perl, and D3.js in JavaScript are other popular data visualization libraries. For machine learning, one can use scikit-learn (general machine learning tasks), mlxtend (stacking models), and deep learning frameworks such as Keras and TensorFlow. 40

For big data analysis, one can use the Python API of Apache Spark. This library has implemented common machine learning algorithms such as linear regression, logistic regression, SVM, decision tree, random forest, naive Bayes, and k-means. We have set up a GitHub repository for readers to get started using Python, which can be found at https://github.com/YichaoOU/python notebooks.

2.2.5 Remarks

Machine learning is an active field. Methods are evolving quickly, especially with the advent of Big Data. However, model selection is still a hard problem. Automated machine learning tools have been developed. For example, a Python library, TPOT [65] can automatically select the best algorithm and best model using genetic programming. Another remark is the model size. It has been found that the top-performing models in Kaggle tend to be very large since all these models use ensembles. Such models can hardly be applied in the real world. Therefore, in 2015, Geoffrey Hinton, et. al proposed a method based on distillation to reduce the model size, known as the “dark knowledge” [66]. He presented this work at Toyota Technological Institute at Chicago (https://www.youtube.com/watch?v=EK61htlw8hY). A TensorFlow implementation of the distillation method using the 2nd YouTube Video Understanding Challenge data can be found at https://github.com/miha-skalic/youtube8mchallange.

2.3 Results and Discussion

In this section, we present the results of applying several machine learning algorithms for novel biomarkers discovery and subtypes identification [9–11]. 41

2.3.1 Analyses of newly diagnosed T2D patients

2.3.1.1 Data description

This dataset consisted of demographic features (e.g. age, height), clinical parameters (e.g. HbA1C, fasting plasma glucose (FPG), homeostasis model assessment as an index of insulin resistance (HOMA-IR) ) and glycation sites [67] from 48 newly diagnosed males and 48 non-diabetic males [9]. Glycation is the bonding of a sugar to a specific amino acid in a protein without the help from an enzyme. The glycation degrees of 27 peptides from 9 plasma proteins were quantified using electrospray ionization mass spectrometry (ESI-MS). The total number of features was 65.

2.3.1.2 Combinatorial analysis of glycated peptides, FPG, and HbA1C

To identify a single best biomarker, we asked if the glycation degree of a peptide can boost the current diagnostic performance of FPG or HbA1C. A decision tree algorithm was applied to classify diabetes vs. non-diabetes using the 27 combinations between FPG and each one of the glycated peptides. The same technique was also performed on HbA1C. Classification performance was evaluated using nested 10-fold cross validation (CV) [1]. The nested CV procedure was implemented using scikit-learn. Specifically, the outer loop CV was implemented using StratifiedKFold with K1 = 10 (see Figure 2.2). The inner loop was implemented using GridSearchCV with K2 = 10. The accuracy metric was maximized in the grid search procedure. The minimum number of samples that D needs to have to allow it to be split (i.e. min samples split) in step 4 of Algorithm 1 was tuned on the values of 1, 5, 10, 20. The impurity function (i.e. criterion) was automatically selected from Gini Index [37] and information gain (i.e. called entropy in scikit-learn) [35]. The maximum depth of a tree (i.e. max depth) was fixed to 2. The best-fitted model from GridSearchCV was then used to test the testing set in the outer loop CV. Sensitivity, specificity, and accuracy were averaged over the 10 outer loop testing sets. The best 42 biomarker was HP K141 (glycated lysine-141 of haptoglobin); when combined with HbA1C, it yielded an accuracy of 95.8%, a sensitivity of 93.7%, and a specificity of 98.0%. The result suggests that in a clinical test, the two biomarkers, HbA1C and HP K141, can correctly predict 95.8% of individuals as diabetes or non-diabetes. The cutoff points and misclassified individuals are shown in Figure 2.3.

Figure 2.3: Scatter plot of HP K141 and HbA1C. T2D patients are represented by blue dots and non-diabetics (NGT, normal glucose tolerance) are denoted by red dots. The cutoff was 6.0% for HbA1C and 30 fmol/mg for HP K141. Three T2D patients were misclassified as non-diabetes (purple arrows). One normal individual was misclassified as T2D (green arrow).

2.3.1.3 Classification of T2D patients and controls using SVM-RFE

Next, we performed a support vector machine and recursive feature elimination (SVM-RFE) [60] to maximize the classification accuracy using a small feature size (Figure 2.4). As a result, its performance achieved an accuracy of 98.7% using a set of 15 features, including age, glycation sites of HSA ( human serum albumin) K93, K262, 43

K414, and HP K141, diabetic markers of FPG, and HbA1C, obesity markers of triglycerides, free fatty acids (FFA), BMI, waist circumference, and waist-to-hip ratio, inflammatory markers of leukocytes and C-reactive protein, and insulin resistance (HOMA-IR) [9]. Interestingly, as for feature importance, HP K141 was ranked at the 4th place by SVM. The top three features were FFA, FPG, and HbA1C, respectively.

Figure 2.4: Principle components visualization of the newly diagnosed patients using a set of 15 features selected by SVM-RFE. T2D patients are represented by blue dots and non- diabetics are denoted by red dots. The ellipses are drawn at 95% confidence interval using the level parameter in ggplot.

2.3.1.4 Subtypes identification

Clustering analysis was performed using the Expectation-Maximization (EM) algorithm for Gaussian Mixture Model. The clustering stability score [63] and the elbow criterion were used to determine an optimal number of clusters in the sample cohort. Figure 2.5 shows the cluster stability score for a different number of clusters and 44 identifies the optimal number as 3. Figure 2.6 shows the distribution of the 3 clusters. These clusters might indicate different development stages of the 48 newly diagnosed T2D patients because the main differences between the clusters are caused by the changes of glycation degrees rather than diabetic markers (e.g. HbA1C and FPG) (Figure 2.7).

Figure 2.5: Clustering stability scores for different numbers of clusters. The scores were calculated using adjusted rand score from scikit-learn. The EM algorithm for Gaussian Mixture Model was run for 100 times with a sampling ratio of 0.8 for each number of clusters. 45

Figure 2.6: Principle components visualization of the 3 subtypes in newly diagnosed patients. The ellipses are drawn at 95% confidence interval.

Figure 2.7: Box plot of all the features in the 3 subtypes. The colors correspond to the three subtypes in the PCA plot above. 46

2.3.2 Analyses of long-term controlled T2D patients

This dataset contained the same set of features for 48 long-term controlled T2D patients (> 10 years) and 48 non-diabetics matched for gender, BMI, and age (20 - 70 years old) [10]. Decision tree algorithms were performed on different combinations between glycated peptides and multiple clinical factors (e.g. HbA1C, FPG, fasting plasma insulin (FPI), HOMA-IR, HOMA2, C-peptides) (see supplementary tables in [10]). Unlike the newly diagnosed T2D samples, the simple combination between one glycated peptide and one clinical factor did not provide sufficient diagnostic accuracy for long-term controlled T2D patients. This is probably due to their better blood glucose management during a long-term treatment. Therefore, a random forest and recursive feature elimination (RF-RFE) algorithm was applied to classify T2D patients and controls. Most interestingly, a set of 7 features, including previously mentioned novel biomarker HP K141, was revealed. Together with the other 6 diabetic factors (e.g. C-peptide, FFA, FPG, FPI, HbA1C, and HOMA-IR), they provided an accuracy of 95% (Figure 2.8). Clustering analysis was performed using the k-means algorithm. The optimal number of clusters was also 3 (Figure 2.9). 47

Figure 2.8: Principle components visualization of the long-term controlled patients using a set of 7 features selected by RF-RFE. T2D patients are represented by blue dots and non- diabetics are denoted by red dots. The ellipses are drawn at 95% confidence interval using the level parameter in ggplot.

Figure 2.9: Principle components visualization of the 3 subtypes in long-term controlled patients. The ellipses are drawn at 95% confidence interval. 48

2.3.3 Analysis of prediabetic patients

Baseline blood samples from 20 prediabetic patients (all male) were taken at a given time point and the blood samples were taken again approximately 4.1 years later. Clinical parameters were measured at both time points. The glycated peptides were quantified at both time points. Based on the diagnostic criteria of American Diabetes Association (ADA) [25] (i.e. HbA1C ≥ 6.5% and FPG > 7.0 mmol/L), 8 prediabetic patients were converted to T2D, 7 remained prediabetic with a trend towards diabetes, and 5 individuals remained stable [10]. The 20 patients were grouped into 3 clusters based on the number of highly glycated peptides from previously reported cutoffs [9]. The RF-RFE method identified 9 glycated peptides with an accuracy of 95% to distinguish the clusters. Interestingly, HSA K262 showed higher glycation level in 6 out of the 7 samples of remained prediabetic with a trend towards diabetes, suggesting a prognostic potential [10].

2.4 Conclusion

This chapter presents a machine learning based bioinformatics analysis for biomarker discovery in diabetes. Decision tree algorithms with nested cross-validation identified the glycation level of HP K141 as a novel biomarker when combined with established clinical markers HbA1C or FPG; they together improved diagnostic accuracy to 95%. Clustering analyses revealed 3 subtypes of T2D, providing insights into the heterogeneity of the disease. Future studies should be conducted to further understand the impact and mechanism between the glycation sites and type 2 diabetes. For example, a new deep learning based model has been developed to predict post-translational modifications [68]. We can apply this model to identify new glycation sites that may be as useful as HP K141. One may ask the question: why is HP K141 more predictive than other glycation sites? To answer this 49 question, we can explore the protein sequence, the gene annotations, and biological pathways. On the other hand, correlation analysis can reveal interesting associations [12]. Thus, it is also important to study the correlations between the glycation sites and the clinical factors, such as free fatty acids [69]. Machine learning regression methods can be applied to answer this question, which may provide a better understanding of the associations between glycation sites and type 2 diabetes. Specifically, regularized linear regression, support vector regressor, decision tree regressor, and k-NN regressor can be used and visualized (e.g. the contribution/weights of features, the residual plot, and the regression line). Multivariate multiple regression is also of interest [70] because it allows learning a mapping function that transforms glycation levels into multiple clinical factors. This method should be able to capture all three types of dependencies; they are intra-dependencies between glycation sites, intra-dependencies between clinical markers, and inter-dependencies between glycation sites and clinical factors. 50 3 Identification of transcription factor binding site in

multiple species

It is of both biological and clinical significance to understand gene regulatory mechanisms such as transcription factor binding. Accurate identification of transcription factor binding sites (TFBSs) is an important step toward personalized medicine. In this chapter, we present Emotif Alpha, a DNA motif discovery and selection pipeline that integrates multiple biological information. Then we present its applications in Brugia malayi and Arabidopsis thaliana. Other lab members have applied this pipeline in mouse, protozoa, and bacteria. Next, we highlight our solution to the ENCODE-DREAM in-vivo Transcription Factor Binding Site Prediction Challenge (https://www.synapse.org/ENCODE). Lastly, we present set cover based algorithms to solve the motif selection problem. The new algorithms are employed to analyze 349 ChIP-seq experiments from the ENCODE project [5], yielding a small number of motifs that have strong explanatory power.

3.1 Introduction

Fast and inexpensive DNA sequencing techniques have led to the extensive study of regulatory genomics. In particular, RNA-seq based gene expression datasets have increased significantly over the last ten years. One important problem is to study the upstream regulating (i.e. promoter) regions of genes in order to understand their expression patterns and the underlying mechanisms. Typically, over-represented sequence patterns, namely DNA motifs, are discovered from a set of co-expressed genes[71]. These motifs are assumed to be TFBSs that regulate the gene expression and thus confer a specific phenotype. 51

DNA motifs are specific short DNA sequences, often 8-20 nucleotides in length [72], which are statistically overrepresented in a set of sequences with a common biological function. The TFBS is another name for DNA motifs and they are interchangeable in most cases. However, there are subtle differences: TFBSs are referring to specific binding sites occurring in the genome and DNA motifs are referring to a general pattern (i.e. a group of TFBSs) that exists in the genome. Generally, TFBSs are recognized by transcription factors to initiate, enhance, or repress gene expression; DNA motifs are statistically significant mathematical patterns used in bioinformatics. One of the most common models to characterize DNA motifs is the position weight matrix (PWM); it is a linear representation of transcription factor binding preference where each position consists of the occurring probability of the 4 bases. The limitation of this model is the assumption of positional independence. To relax this assumption, Mathelier and Wasserman proposed a transcription factor flexible model (TFFM) [73]. This model used a Hidden Markov Model (HMM) to construct DNA motifs and it was shown to perform as well as PWMs. Motif discovery is a de novo method for identifying TFBSs from a set of genomic regions that are bound by the same transcription factor [74]. The problem of motif discovery, including RNA motifs and protein motifs, has been studied for over 20 years, yet it is still under development. A large number of motif discovery methods have been developed. The common methods and tools are as follows.

1. Expectation maximization: MEME [75] and Improbizer [76].

2. Gibbs sampling: BioProspector [77] and MotifSampler [78].

3. K-mer enumeration: Weeder [79], DME [80], and DECOD [81].

4. Deep learning: DanQ[82], DeepFinder[83]. 52

5. Ensemble methods: W-ChIPMotifs [84], GimmeMotifs [85].

6. ChIP-seq data motif discovery: Trawler [86], Posmo [87], ChIPMunk [88], HMS [89], and MEME-ChIP [90].

Discovered motifs are usually scanned for occurrences in the genome using motif scanning algorithms such as FIMO [91], which evaluates the chance of motif occurrence using a p-value based on log-likelihood ratio score. Existing motif discovery algorithms often produce a large number of DNA motifs, making further analyses and validations difficult. To solve this problem, we have developed the Emotif Alpha pipeline based on the experience from many motif discovery projects that our lab has done. This pipeline can evaluate motifs using multiple biological features, including gene expression information, known motif database query, and conservation analysis. In addition, we have also developed set cover based methods seeking to identify a minimal set of regulatory motifs that characterize sequences of interest.

3.2 Methods

3.2.1 Emotif Alpha: A multi-omics DNA motif discovery and selection pipeline

Identification of TFBS is important for understanding gene transcriptional regulation. The Emotif Alpha pipeline (Figure 3.1) is the result of many projects. In particular, in the Brugia malayi projects, three of the identified motifs were selected and successfully validated in vitro. Specific results will be discussed in the result section. The Emotif Alpha pipeline was written in Python. We used a configuration file for users to input files, adjust parameters, and choose the analyses they need. This configuration file parser was written by Chen Liang, one of the former lab members. The 53 procedures of motif discovery and motif scanning were previously written by Rami Al-Ouran, another former lab member.

Gene set

Background Promoter retrieval IN: gene ID; OUT: fasta Gene candidate set expression Parallelization using Python Joblib Motif Discovery

IN: fasta; OUT: pwm

IN: fasta & PWM; OUT: Motif scanning FIMO output

IN: FIMO output; OUT: Coverage filter motif names

IN: FIMO output; OUT: Motif Selection motif names

IN: FIMO output; OUT: Enrichment test motif names

IN: query PWM & target Motif Database Known TFBS check PWM | gene expression; OUT: PWM

Homologous Info Conservation analysis IN: fasta & PWM; OUT: PWM

Final motifs

Figure 3.1: The Emotif Alpha pipeline. Users can selectively run each step or a combination of steps by modifying the configuration file. For a complete pipeline, the user should provide a list of differentially expressed genes. The promoter retrieval step will generate the foreground promoters and background promoters. Then an ensemble motif discovery is performed. The motif scanning step is used to generate motif mapping information. Users can filter motifs by using a coverage filter, running the motif selection algorithms, or performing an enrichment test. Next, de novo motifs are matched to known TFBSs. Lastly, final high-confidence motifs will be reported after conservation analysis. 54

The inputs could vary from gene expression tables to lists of gene names, or promoter sequences. Promoter retrieval could be done in two ways: (1) using the Ensembl Biomart APIs [92] (2) or using bedtools [93]. Ensemble motif discovery was the most time-consuming part since a diverse set of algorithms were used. The input to the motif discovery algorithms contained a set of foreground sequences and a set of background sequences and the background size is three times the foreground size. The motif discovery processes ran in parallel using Python joblib. Motif selection and filtering procedure contained a series of steps. The coverage filter removed motifs based on user-defined foreground and background sequence coverage thresholds. The motif selection step was used to produce a small set of motifs to characterize the foreground and background sequences. This step used a random forest classifier to assess the ability of each motif to assign the gene promoter to the respective classes. Users could select top k motifs for each of the two impurity measures: Gini Index [37] and entropy [35] . In addition, set cover based motif selection algorithms could also be applied. The enrichment test ranked motifs using statistical tests, such as Z-test and Fisher exact test (http://www.regulatory-genomics.org/motif-analysis/method/). Identified motifs were then matched to known motifs to infer putative functions and their cognate transcription factors. In this step, gene expression data could be used to filter out TFs that were not expressed in the given condition. Final motifs were reported after conservation analysis [94].

3.2.2 Workflow of the Brugia malayi project

The process for identifying stage-specific and sex-biased motifs consisted of the following steps (Figure 3.2): (1) determination of up-regulated genes, (2) promoter 55 retrieval, (3) ensemble motif discovery, (4) motif filtering, (5) known motif databases query, (6) conservation analysis, (7) experimental validation.

RNA-Seq gene expression profile

Differential gene expression analysis: DESeq & EdgeR

Promoter retrieval: 1000bp upstream of translation start site

Ensemble motif discovery

Motif Filtering

Motif database query: CIS-BP, JASPAR 2016 Nematodes, UniPROBE

Conservation analysis: motif occurrence is conserved in either C.elegans or O.volvulus

High-confidence motifs

Experimental validation

Figure 3.2: Workflow of Brugia malayi promoter motif identification. There are seven main steps: 1) determination of up-regulated genes, 2) promoter retrieval, 3) ensemble motif discovery, 4) in silico motif selection, 5) motif database query, 6) conservation analysis, 7) experimental validation. Steps 2 to 6 were performed using the Emotif Alpha pipeline. 56

To characterize stage differences, RNA-seq experiments were performed on L3, L3 Day 6 (L3D6), L3 day 9 (L3D9), and L4 worms. To explore sex differences, RNA-seq experiments were performed on male and female worms at both 30 days post infection (dpi) and 120 dpi [95]. Sequencing quality control, reads alignment, gene abundance estimation, and differential gene expression analysis were done in Dr. Elodie Ghedin’s lab. Promoter sequences were retrieved from WormBase ParaSite Biomart [96], capturing the 1000bp upstream of translation start site for each gene. Ensemble motif discovery was performed. A foreground coverage filter (≥ 30%), a random forest classifier, and a Z-test were used to filter motifs. A collection of 163 known nematode motifs were retrieved from MEME suite [75], including JASPAR CORE 2016 nematodes [97], CIS-BP Brugia malayi [98], and uniprobe worm [99]. The remaining motifs were matched to known TFBSs using TOMTOM [100]. The motif similarity p-value threshold was 10−4. Conservation analysis was performed using the method adopted from [94]. Orthologous information between B. malayi, C. elegans, and O. volvulus were retrieved from Wormbase ParaSite Biomart [96]. Their promoters were then extracted. CLUSTALW2 [101] was used to do multiple sequence alignment with gap open penalty of 10 and an extension penalty of 0.1. A motif was defined as conserved if it occurred at the same position in the orthologous promoter alignment of either C.elegans or O.volvulus.

3.2.3 Workflow of the Arabidopsis thaliana project

The overall bioinformatics workflow is shown in Figure 3.3. The first step was to find pollen-specific HRGP (hydroxyproline-rich ) genes. Tissue-specific gene expression analysis was performed using 113 RNA-Seq datasets from Araport11[102] using the python API, intermine.webservice. To determine pollen-specific expression, the tissue specificity index, Tau[103], was used. In a recent benchmarking comparison, Tau was found to be the most robust and biologically relevant method[104]. All expression 57 values were log-transformed before calculating Tau; values less than 1 were set to 0 [104]. A pollen-specific HRGP gene is defined as follow: (1) its Tau value is greater than 0.85 [103] and (2) the maximally expressed tissue is pollen. Using these criteria, 13 pollen-specific HRGP genes were identified. The background gene set consisted of 132 HRGP genes that were never expressed in pollen. Next, ensemble motif discovery was used to identify regulatory motifs responsible for pollen-specific HRGP genes. In total, it returned 3519 motifs.

HRGP genes (166) TFs/TFBSs from plantTFDB (619)

Arabidopsis RNA-seq Expression Database (Araport11, 33602)

Pollen-specific Non-Pollen HRGPs Pollen-expressed TFs/TFBSs HRGPs (13) (132) (99)

Ensemble motif Matched putative discovery (3519) promoter motifs (1341)

Conservation Analysis (875)

Select motifs present in > 9 pollen-specific genes and in < 20 non-pollen genes (6)

Remove redundant motifs

Putative motifs (3)

Figure 3.3: Bioinformatics workflow used to identify of putative promoter motifs in the 13 pollen-specific HRGPs in A. thaliana. Pollen-specific HRGPs were identified using Araport 11 RNA-seq gene expression database. Non-pollen HRGPs were defined as HRGP genes that were not expressed in pollen. Known transcription factor binding sites in A. thaliana were downloaded from plantTFDB [2]. Using gene expression as a filter, 99 TFs were found to be expressed in pollen. Next, the Emotif Alpha pipeline was used to perform ensemble motif discovery, known TFBSs matching, conservation analysis, coverage filtering, and redundant motifs removing. The number of genes or motifs retained in a given step is shown in parenthesis. 58

In the third step, Arabidopsis thaliana motif database plantTFDB [2] was used, containing 619 TFBSs and corresponding TFs. Out of the 619 TFs, 99 TFs were expressed in pollen. Then, the discovered motifs were matched to the 99 known TFBSs using TOMTOM [105] with a p-value less than 0.005. After this step, 1341 motifs were retained. Conservation analysis was performed using the method adopted by [94]. Orthologous information between A. thaliana and A. lyrata were retrieved from Ensembl Plant Biomart v39 [92]. In this step, 875 motifs were found to be conserved in A. lyrata. Then motifs were filtered if they occurred in less than 10 promoters of the pollen-specific HRGPs or more than 19 promoters of the non-pollen HRGPs. Lastly, redundant motifs were removed and 3 putative motifs were identified. Similar motifs were identified using TOMTOM with a p-value less than 0.005. If two motifs were similar, then the one with higher foreground coverage was kept.

3.2.4 Workflow of motif selection in ChIP-seq data

Three set cover based methods were evaluated against the enrichment method[106], including a greedy set cover algorithm[107], a tabu search, and a relaxed integer linear programming method. The greedy set cover algorithm uses the “maximum uncovered-first” rule[107]. A motif will be added to the set until all the sequences are covered. This method does not consider background sequences. The latter two methods find a small set of motifs that covers all the regions of interest while minimizing the number of false positives (i.e. covering the background sequences). To evaluate the algorithms, we used the ChIP-seq datasets from [106]. The authors analyzed 427 ChIP-seq experiments and grouped them into 84-factor groups based on TF homology. Ensemble motif discovery was done using 5 existing tools, including MEME[108], AlignACE[109], Trawler[86], MDscan[110], and Weeder[79]. The top 10 59

ChIP-seq peaks for a TF factor group

Motif Discovery

Discovered motifs

Evaluation Dataset

Enrichment Greedy set Tabu search RILP analysis cover

Selected Selected Selected Selected motifs motifs motifs motifs Nested cross-validation

Figure 3.4: Motif selection evaluation pipeline using ENCODE datasets. For a given TF, ChIP-seq experiments in different conditions were combined and duplicate peaks were removed. Motif discovery was performed and the discovery motifs were obtained from. The evaluation datasets contain 10,000 random selected peaks, 10,000 random selected background sequences, and the discovered motifs. Four motif selection algorithms were evaluated using nested cross-validation.

most enriched motifs for each factor group were reported. The enrichment score is computed based on the fraction of motif instances in the bound regions, corrected using shuffled motifs and a confidence interval of small counts. We focused on a subset of 55-factor group datasets because the known motifs of these factors are available; each of the datasets contains pooled regions (q-value ≤ 0.01) across all the ChIP-seq experiments of the given factor. To generate evaluation datasets, 10,000 random peaks were selected per factor group dataset. A few numbers of datasets, including SIX5, ATF3, ZEB1, PBX3, MXI1, ZBTB33, NR2C2, BHLHE40, ZBTB7A, BRCA1, POU5F1, NFE2, PRDM1, HSF, and SREBP contained less than 10,000 peaks, 60 so all the peaks were used. The same number of randomly selected background regions (from [106]) for added to the evaluation datasets. In other words, the evaluation datasets contain a balanced number of foreground sequences and background sequences. Figure 3.4 shows the pipeline used for evaluating the motif selection methods. The sets of all discovered motifs for each factor group were adopted from [106]. The evaluation datasets contain foreground sequences (i.e. bound regions), background sequences, and the corresponding motifs discovered in that factor group. Motif scanning was done using Fimo[111]. Since a motif can either occur or not occur in a sequence, it is natural to produce a boolean matrix to represent the occurrence information. Together with the class label, it is the input to the four motif selection methods as shown in Figure 3.5. The evaluation procedure used a nested cross-validation (CV) approach[112]. Nested CV can reduce the bias and give a better estimation of the error than the traditional CV methods[113]. The evaluation program was run in Ohio Supercomputer Center (OSC) [114]. Each algorithm for each dataset was run for 100 hours with 8 cores and 64G memory. The nested CV program runs in parallel. The tabu search algorithm didn’t finish 4 datasets due to excessive memory usage, including AP1, CTCF, MYC, and TATA, which contain 244, 853, 372, and 248 motifs, respectively. The motif selected methods were evaluated using the following metrics:

1. Foreground coverage (ForeCov): The fraction of foreground sequences that contain the selected motifs. This metric is the higher the better.

2. Background coverage (BackCov): The fraction of background sequences that contain the selected motifs. This metric is the lower the better.

3. Error rate: The fraction of uncovered foreground sequences (i.e. False negatives) and covered background sequences (i.e. False positives). 61

Boolean matrix (row: seq ; col: motif; last col: label)

10-fold Cross Split as training and testing sets Validation Split training set as training and validation sets 3-fold Cross Grid search: Lists of Validation parameters Parameter Tuning Motif selection algorithm

Minimizing the cost

Determine optimal parameters

Evaluation on the test sets, calculate the new cost, foreCov, backCov

Selected motifs

Figure 3.5: Nested cross-validation (CV) workflow for evaluating motif selection algorithms. A boolean matrix containing the motif occurrence information was given as input. Then the data was split twice in an outer CV and an inner CV. Inside each inner CV, a grid search of the parameters was done. Final optimal parameters were obtained in the outer CV by selecting the most frequent optimal parameters in the inner CV. Lastly, final selected motifs were obtained using the whole dataset and the optimal parameters. 62

4. Number of motifs: The number of selected motifs returned by motif selection algorithms. This number should be minimized.

3.3 Promoter analysis in Brugia malayi

Brugia malayi is a mosquito-borne filarial worm that causes lymphatic filariasis (LF) in many tropical and subtropical countries. Over 120 million people are affected worldwide. The disease is characterized by swelling of the limbs, elephantiasis, and hydrocele. No effective treatments are available. Hence, it is important to understand the molecular processes in order to develop new drugs.

3.3.1 Stage-specific motifs

Brugia malayi has 5 lifecycle stages: embryos, immature microfilariae, mature microfilariae, third- and fourth-stage larvae (L3 and L4). The third-stage larva (L3) is the infective stage through mosquitos to human. The transition of L3 to L4 takes place in the lymphatics and eventually causing the disfiguring symptoms. Characterization of the promoter elements in the L3 infective stage is the fundamental step to clearly understand the regulatory pathways of the parasite and would help develop effective intervention strategies for treatment. Although its genome has been released [115], most of the regulatory elements have not been fully explored yet. To identify stage-specific motifs, we obtained gene expression profiles in L3, L3 Day6, L3 Day9, and L4. Differentially expressed genes were characterized using DESeq and EdgeR. The Emotif Alpha pipeline was used to identify stage-specific motifs in each comparison. In the results, 12 significantly enriched (p value ≤ 10−3) motifs were found. The 12 selected motifs matched the known binding sites for 6 transcription factors in C. elegans (Table 3.1). Motifs M1.1, M1.2, M1.3, and M1.4 matched a zinc-finger protein, zfh-2, which is involved in hermaphrodite genitalia development, locomotion, nematode larval development, and receptor-mediated endocytosis. Motif M2 matched vab-7, which 63 Table 3.1: Stage-specific promoter motifs for B. malayi. Motif names are given in the first column. Comparision of stages are shown below motif name. For example, represents the up-regulated genes in L4 comparing to L3. Motif discovery is then performed on the promoters of up-regulated genes. aFrequency of a motif in up- regulated gene promoters. bRelative frequency of a motif in up-regulated gene promoters vs. background promoters.

Matched Gene pair of

known conserved

Name Logo %UGa Ratiob P-value motif TF Function sites

[Bm4560,

M1.1 WBGene00243715], 64 1.63 1.4*10-4 [Bm856, WBGene00240448]

[Bm856,

WBGene00240448], M1.2 [Bm17348, 84.1 1.31 7.1*10-5 [Bm17988,

hermaphrodite genitalia WBGene00240101]

MA0928.1 development, zfh [Bm2559, (C. locomotion, nematode WBGene00007916], -2 [Bm2821, larval development and elegans) WBGene00247255], M1.3 receptor-mediated [Bm17348, 94.4 1.28 7.6*10-6 endocytosis [Bm3341,

WBGene00247205],

[Bm4257,

WBGene00239362]

[Bm7179, M1.4 WBGene00237203], 84.7 1.55 9.6*10-8

3D6> WBGene00239200]

required for DB [Bm4904, MA0927.1 WBGene00238932] M2 va motorneuron identity 93.3 1.31 2.9*10-5 (C. elegans) polarity 64 Table 3.2: Stage-specific promoter motifs for B. malayi (cont.). ∗These two motifs have been validated in vitro.

[Bm2802, M3.1 WBGene00246313] 80.6 1.33 3.0*10-4 elt- controlling hypodermal (C. 3 cell differentiation [Bm10655, M3.2 elegans) WBGene00003590, 81.8 1.32 8.3*10-5

[Bm2802, M4.1 WBGene00246313], 34.8 2.19 4.9*10-4

6> loss of blmp-1 activity WBGene00237636] via deletion mutation MA0537.1 bl [Bm7019, has been reported to M4.2 WBGene00244214], 56 1.71 1.6*10-4 (C. mp [Bm6190, result in small, dumpy

elegans) -1 WBGene00237636] animals with abnormal

[Bm2802, M4.3 fat content WBGene00246313], 38.9 2.01 1.0*10-5

3D6> WBGene00237203]

[Bm1938,

WBGene00238889],

M5348_1. Bm [Bm8228, * M5 Retinal homeobox 1_4 WBGene00016162], 93.3 1.23 2.9*10-4 02 (CIS-BP Brugia 0 WBGene00244019], inferred) [Bm4184,

WBGene00240195]

involved in regulation of [Bm1938, WBGene00238889], M5221_1. Bm transcription, DNA- M6* [Bm6642, 1_3 33.3 2.03 3.0*10-5 02 templated and steroid WBGene00012983] (CIS-BP Brugia 0 hormone mediated inferred) signaling pathway 65 is required for DB (one of the three motoneurone subtypes) motoneuron identity and posterior DB axonal polarity. Motifs M3.1 and M3.2 matched elt-3, which is related to aging. Motifs M4.1, M4.2, and M4.3 matched bump-1. M5 matched a homeobox protein, Bm1 43010, and M6 matched a nuclear receptor Bm1 35750. Interestingly, the occurrence of M3.2 in the Bm10655 promoter region is conserved in both C. elegans (promoter of WBGene00003590) and O. volvulus (promoter of WBGene00246409) (Figure 3.6). It is worth noting that motifs M5 and M6 were selected for validation. Using Gaussia Luciferase (GLuc) reporter under the control of the HSP70 promoter, the elimination of both motifs reduced fluorescence significantly (unpublished results).

Figure 3.6: A conserved site of motif M3.2 in three species. The occurrence of M3.2 in the Bm10655 promoter region is conserved in both C. elegans (promoter of WBGene00003590) and O. volvulus (promoter of WBGene00246409).

3.3.2 Sex-biased motifs

Most nematodes use an X-chromosome dosage mechanism to determine sex; XO is male and XX is female. However, sex determination becomes an intriguing study area in Brugia malayi because it has a Y chromosome. To investigate its sex determination mechanism, we first asked if there are any differences in the promoter elements of chrX genes vs. non-chrX genes. The promoter regions of 3217 chrX genes and 7705 non-chrX genes are scanned for 163 known Brugia or nematodes motifs constructed from databases of JASPAR CORE 2016 nematodes, CIS-BP Brugia malayi [98], and uniprobe worm [99]. 66

No significant enriched motifs were found. This observation confirms the hypothesis that “Perhaps the emergence of a Y chromosome in B. malayi has liberated its X from any function in sex determination, resulting in relaxation of selection against rearrangements of X.” [116].

Known motif (CIS-BP) Motif statistics ForeCov(%) 87.4% BackCov(%) 63.8% p-value 2.7 ∗ 10−11 Conserved sites Bm6237 WBGene00242995 Bm4031 WBGene00247968 Bm17313 WBGene00249141 Bm7660 WBGene00242675 Bm17123 WBGene00240256 Discovered motif Bm8917 WBGene00246573

Figure 3.7: A putative TATA-box element that is enriched in male up-regulated genes. The occurrences of this motif are conserved in both C. elegans and O. volvulus.

To gain more insights, we used the sex-specific gene expression data from [95]. The Emotif Alpha pipeline was applied. In the result, a motif that matched to a TATA-box binding protein (TBP) was identified. This motif was discovered from the 174 up-regulated genes in male 120 days post-infection (UP M120-dpi) dataset (Figure 3.7), which occurred in 87.4% of the up-regulated M120 genes and only 63.8% of the randomly selected genes. The TATA-box binding protein often occurs in the housekeeping genes’ promoters. However, this motif is not a typical TATA-box (i.e. TATAAA), suggesting it might play a role in sex-related functions. 67

3.4 Analysis of pollen-specific HRGPs expression in Arabidopsis thaliana

Hydroxyproline-rich (HRGPs) are a superfamily of plant cell wall structural proteins that function in various aspects of plant growth and development, including pollen tube growth. The HRGP superfamily is composed of three family members: the hyperglycosylated arabinogalactan-proteins (AGPs), the moderately glycosylated extensins (EXTs), and the lightly glycosylated proline-rich proteins (PRPs). These three members vary not only with respect to their degree of glycosylation, but in their respective amino acid compositions and sequence repeats, and in their patterns of post-translational proline hydroxylation and glycosylation. In a previous study, Dr. Allan Showalter’s lab has revealed 166 HRGP genes in A. thaliana, including 85 AGP genes, 59 EXT genes, 18 PRP genes, and 4 AGP/EXT hybrid HRGP genes [117]. The biological problem is to find promoter motifs that are responsible for pollen-specific expression of HRGPs. To address this, we performed a bioinformatics analysis based on the Emotif Alpha pipeline. The analysis included characterization of pollen-specific expressed HRGPs, pollen-expressed transcription factors, and de novo motif discovery and selection (Figure 3.3).

3.4.1 Identification of pollen-specific HRGPs

We have identified 13 pollen specific HRGPs shown in Table 3.2, including 8 extensins (6 PERKs, 1 EXT, and 1 PEX) and 5 arabinogalactan-proteins (4 AGPs and 1 FLA). We noticed that 12 pollen-specific HRGPs were already reported to be pollen-specific by an analysis of gene expression microarrays in [117]. Next, we examined if the 13 pollen-specific HRGPs were expressed in other tissues. Table 3.3 shows the expression profile. As expected, pollen-specific HRGPs were also found in flower tissues (i.e. stage 12 inflorescence and carpel), indicating that the 13 genes also play a role in reproduction. 68 Table 3.2: Thirteen pollen-specific HRGP genes. ∗These genes have been reported to be pollen-specific. Level of expression is computed using all the gene expression values in pollen. Extremely high: 3 standard deviation (STD) above the mean. High: 2 STD above the mean but within 3 STD of the mean. Tissue Specificity Expression in Level of TAIR ID Gene Name Index (Tau) Pollen Expression AT1G10620* PERK11 0.97 7.00 Extremely high AT1G49270* PERK7 0.96 8.71 Extremely high AT4G34440* PERK5 0.95 7.04 Extremely high AT3G18810* PERK6 0.94 9.34 Extremely high AT2G24450* FLA3 0.94 12.28 Extremely high AT1G23540* PERK12 0.94 7.14 Extremely high AT4G33970* PEX4 0.91 9.89 Extremely high AT1G54215 EXT32 0.91 5.90 High AT2G18470* PERK4 0.88 9.80 Extremely high AT1G24520* AGP50C 0.88 13.27 Extremely high AT3G01700* AGP11C 0.87 13.08 Extremely high AT5G14380* AGP6C 0.86 13.01 Extremely high AT3G57690* AGP23P 0.86 14.22 Extremely high

Table 3.3: Expression of 13 pollen-specific HRGP genes in all examined tissues. The gene expression value for each tissue was averaged across all the samples in that tissue after log2 transformation.

Light Shoot Root Dark Stage 12 grown apical apical grown Gene ID Pollen Aerial inflorescence Root Carpel seedling Leaf meristem Receptacle meristem seedling AT1G10620 7.0 0.0 2.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 AT1G49270 8.7 0.0 3.1 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 AT4G34440 7.0 0.0 3.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 AT3G18810 9.3 0.0 4.9 0.0 0.4 0.0 0.0 0.0 0.2 0.0 0.0 AT2G24450 12.3 0.0 7.1 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.0 AT1G23540 7.1 0.0 3.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 AT4G33970 9.9 0.0 4.7 0.0 1.4 0.0 0.0 0.0 1.4 0.0 0.5 AT1G54215 5.9 0.0 3.3 0.0 0.2 0.0 0.0 1.0 0.0 1.4 0.0 AT2G18470 9.8 0.0 5.4 1.9 1.4 0.0 0.0 0.0 2.6 0.0 0.0 AT1G24520 13.3 0.0 8.0 0.0 3.9 0.0 0.0 0.0 4.1 0.0 0.0 AT3G01700 13.1 0.0 8.7 0.0 3.9 0.0 0.0 0.0 4.0 0.0 0.0 AT5G14380 13.0 1.2 8.9 0.0 3.8 0.0 0.0 0.3 3.7 0.0 0.0 AT3G57690 14.2 0.0 10.4 0.0 5.1 0.0 0.0 0.0 5.3 0.0 0.0

69

3.4.2 Discovery of putative motifs

An ensemble motif discovery method was applied to the promoter regions (defined as the 1000bp upstream of the translation start site) of pollen-specific HRGPs. Three over-represented and conserved motifs were discovered (Table 3.4). Two motifs matched to GATA9 binding motifs. It is known that GATA9 is the closest homolog of GATA12 and the expression level of GATA12 is high in mature pollen grains but low in germinated pollen grains [118].

Table 3.4: List of putative promoter motifs for pollen-specific HRGPs.

Motif Fore Back Relative TFBS match Motif Logo TF TAIR ID Name Coverage Coverage Frequency (p-value) GATA9 Motif_1 76.9% 9.8% 7.8 AT4G32890 (0.0005)

ZML1 Motif_2 84.6% 12.1% 7.0 AT3G21175 (0.0009)

GATA9 Motif_3 76.9% 12.1% 6.3 AT4G32890 (0.0018)

Next, we asked if there were any combinatorial patterns of the discovered motifs. We used the feature map tool from RSAT (i.e. Regulatory Sequence Analysis Tools) [119] to visualize the occurrences of the motifs in the promoter sequences (Figure 3.8). Interestingly, we found that motif 2 and motif 3 seemed to form a module and the gap distance was around 220bp. This pattern occurs in AT3G18810, AT3G57690, AT1G24520, AT3G01700, and AT4G33970. 70

Figure 3.8: Sequence visualization of the occurrences of the three putative motifs in HRGP gene promotor regions. The position starts from translation start site (i.e. 0) to upstream 1000 bp. Motif 2 and motif 3 seemed to form a module and the gap distance was around 220bp. This pattern occurs in AT3G18810, AT3G57690, AT1G24520, AT3G01700, and AT4G33970.

3.5 In vivo prediction of transcription factor binding sites in 14 cell types

3.5.1 Overview of the ENCODE-DREAM Challenge

The ENCODE-DREAM in-vivo Transcription Factor Binding Site Prediction Challenge (https://www.synapse.org/ENCODE) aims to identify the best computational method for predicting the positional binding sites of transcription factors across cell types. The problem involves the reverse engineering of the human genetic code to determine protein-DNA interactions amongst 14 cell types. The difficulty lies in that each cell type has a unique signature causing protein-DNA interactions to behave differently. To solve the problem, we developed random forest models that integrated four biological features: 71

DNA motifs, DNA shape properties, gene expression profiles, and chromatin accessibility. This competition contains several phases and we participated in the first phase - the Conference Round. We competed against more than 60 international teams and only 34 teams ’survived’ to the final competition. Our solution ranked 14th in the DREAM conference round. Our model works particularly well, in terms of auROC and auPRC metrics, for the protein CTCF, which is an important regulator for the 3D genome organization.

3.5.2 Data Description

The ground truth binding positions were determined using ChIP-seq conserved peaks. A label of Bound (B), Unbound (U), or Ambiguous (A) was associated to each 200bp genomic window. For every two consecutive windows, there was an overlapping region of 50bp. The total number of transcription factors was 32 and the total number of cell types was 14. Not every TF-cell-type combination had ChIP-seq data. The human genome was divided into training and testing chromosomes (i.e. chr1, chr21, and chr8). The data on the testing chromosomes were unseen to every team. The cell types were also divided into training cell types, leaderboard cell types, and final competition cell types. During a period of 2 months, each team could submit up to 10 predictions per TF-cell-type combination for the leaderboard round. The data of final competition cell types were also unseen to every team.

3.5.3 Multi-omics models

A good machine learning model primarily depends on the number of training examples and the relevance of features. Thus, we used a non-parametric algorithm, random forest, to train the models because it does not require feature normalization and it is a non-linear model that runs efficiently on large datasets. All of our code is available in 72

Github at https://github.com/YichaoOU/ENCODE-DREAM. Figure 3.9 shows an example of the feature design:

• DNA motifs from 6 databases were used, including CIS-BP [98], HOMER [120], HOCOMOCO [121], factorbook [122], -motif [106], and Epigram [123]. Notably, Epigram motifs are discovered from histone marks rather than TF ChIP-seq. Then motifs were ranked by a random forest classifier, top 120 TF-motifs and top 40 Epi-motifs were selected for each TF training set.

• In vitro DNA shapes are HelT, Roll, MGW, ProT. The mean and max values were extracted using bigWigAverageOverBed tool from UCSC.

• Open chromatin refers to DNase-seq. Also, the mean and max values were extracted using bigWigAverageOverBed tool from UCSC.

• Gene expression information was used in three different ways: (1) the PCA algorithm was applied and the first 3 PCs (average value since two replicates per cell type) were used; (2) For each 200bp window, the expression value (i.e. TPM, Transcripts Per Kilobase Million) of its nearest gene was used; (3) The expression values of the regulators of each TF were used.

• Finally, ten engineered features were added to the model to capture some non-linear relationship; they were the mean motif score multiplied by the values of mean DNase signal, mean signals of the 4 types of DNA shape, first 3 PCs, nearest gene expression level, and average expression level of given TF-regulators.

Since the final competition round required 13 TF-cell-type predictions, the time needed for our final model to produce these predictions (including the training time) was 2650 hours (Figure 3.10 model 5). The most time-consuming part was processing DNA motifs since 6 motif databases were used. The computation was done using three 73

Figure 3.9: An example of multi-omics features used in the final model. The last column is the class label: 1 indicates bound, 0 indicates unbound. The data were highly imbalanced with an average positive vs. negative ratio of 1:50.

computers, totally 16 cores. However, the maximal memory in each computer was only 16GB. Due to this issue, the number of training samples was only around 1.2 million per TF, whereas the total number of genomic bins was 50 million.

Figure 3.10: Estimated computational time for the 5 models that we built. The last model was the one submitted to the final competition. Models 1 to 4 were used in the leaderboard round to tune and adjust the models. 74

3.5.4 Competition Performance

The predictive performance is shown in Table 3.5. Note that none of the final round cell types are overlapping with either the training cell types or leaderboard cell types, which makes the prediction task very challenging. Our model had auROC scores above 90% for most of the TF-cell-type dataset; the lowest score was 85.9% for HNF4A in the liver cell. However, the auPRC and recall at 50% FDR could vary significantly across TF-cell-type combinations. The highest auPRC score was 67.3% for CTCF and iPSC cell type and the lowest score was 11.4% for NANOG in the same cell type. The auPRC scores had the largest variation and it was found to be caused by the noise in ChIP-seq data [124].

Table 3.5: Summary of the performance metrics evaluated by the ENCODE-DREAM Challenge. These 13 TF-cell-type combinations are the final competition for the conference round. Our best rank model is the CTCF model.

# training Recall at 50% TF Cell type Rank auROC auPRC cell types FDR CTCF iPSC 7 7 0.9815 0.6734 0.6785 CTCF PC-3 7 7 0.9167 0.4486 0.4046 HNF4A liver 1 12 0.8591 0.3949 0.4056 MAX liver 8 18 0.9385 0.2913 0.1743 GABPA liver 6 13 0.9329 0.2857 0.2269 REST liver 6 14 0.8787 0.2651 0.1740 FOXA2 liver 1 14 0.9241 0.2290 0.1593 EGR1 liver 4 10 0.9062 0.2173 0.0984 E2F1 K562 2 22 0.9350 0.2048 0.0000 TAF1 liver 5 14 0.9118 0.1901 0.0172 FOXA1 liver 1 17 0.9154 0.1621 0.0881 JUND liver 6 17 0.9202 0.1593 0.0644 NANOG iPSC 1 17 0.9222 0.1141 0.0000

The performance of our model across the 13 datasets is highly correlated with the two conference round winners. We find that the number of training cell types has a weak correlation with auPRC (ρ=0.47), and recall at 50% FDR (ρ=0.51), but shows no correlation with auROC (ρ=-0.01), suggesting the model does benefit from learning from 75 more cell types. Interestingly, the auPRC performance of our model across the 13 TF/cell type pairs is highly correlated with the two conference round winners, automosome.ru and J-TEAM, with ρ=0.88 and ρ=0.93, respectively. This observation suggests that the predictive performance also highly depends on the given TF and cell type.

3.6 Motif selection in ChIP-seq data

3.6.1 Comparison of set cover based methods

Three set cover algorithms were evaluated on the same 55 TF group datasets used by the enrichment method[106]. Unlike the enrichment method which calculates an enrichment score for each motif and then selects the top 10 motifs, the set cover methods are iteratively optimizing a group of selected motifs toward high foreground coverage and low background coverage. The foreground coverage represents the fraction of ChIP-seq bound regions that contain the selected motifs. As shown in Figure 3.11(a), the median foreground coverage of the enrichment method is 66.6%, even though it is 1.7% higher than the tabu search method, it is 4.8% and 6.3% lower than the greedy method and the RILP method, respectively. As for the foreground coverage metric, the RILP method performed the best. The background coverage shows the fraction of ChIP-seq unbound regions that contain the selected motifs. In other words, it also represents the false positive rate as the motifs are not expected to occur in the background sequences. As shown in Figure 3.11(b), the median background coverage of the enrichment method is 19.8%, which is 3.5% higher than the greedy method and the RILP method, respectively. As for the background coverage metric, the tabu search method performed the best. The error rate represents the percentage of misclassified sequences if the selected set of motifs is used to predict the regions bound by a TF. The median error rate of the 76

(a) Boxplot of foreground coverage(%). (b) Boxplot of background coverage(%).

(c) Boxplot of error rate(%). (d) Boxplot of number of motifs.

Figure 3.11: Boxplots of the 4 evaluation metrics. Median values and all the data points are shown.

enrichment method is 29.1% (Figure 3.11(c)). All three set cover based methods have a lower median cost and the RILP method has the lowest median cost of 22.7%. The median number of selected motifs does not vary much (i.e. around 2 or 3 motifs) for these methods (Figure 3.11(d)). However, their ranges can differ significantly. For 77 example, the enrichment method has a range from 1 to 10 and the RILP method has a range from 1 to 12. On the other hand, the greedy and tabu search methods pick only 1 to 4 motifs for each TF group. It is worth noting that the RILP method selected 1 to 4 motifs in most cases (52/55). Therefore, the set cover based methods select fewer motifs than the enrichment method and the tabu search method generally picks the smallest number of motifs. Our results generally confirm the extent of the sequence coverage problem[107], which describe the observation that simply selecting the top motifs, e.g. the enrichment method, is not able to cover all the sequences of interest. However, this observation is not always true. In fact, the sequence coverage problem can vary by different transcription factors. For example, the top enriched motifs (i.e. the enrichment method) for NRF1, CTCF, REST, ETS, and EGR1 all cover more than 90% of the bound regions (Table A.1). In all the aforementioned 5 TF groups, the enrichment method reported a larger number of motifs (Table A.2); it selected 10 CTCF motifs while the greedy method and the RILP method selected 1 and 2 motifs, respectively. The number of discovered motifs is greatly reduced using set cover based methods. The minimal description length (MDL) principle favors hypotheses that describe the biological data using fewer symbols than needed[125]. Similarly, the set cover methods discover fewer motifs, which in turn tend to cover fewer background sequences (Table A.3) and thus produce lower cost (Table A.4).

3.6.2 Shared motifs between the solutions of set cover based methods and the enrichment method

The three set cover based methods have found the same motifs in 7-factor groups as reported in [106]. As shown in Table 3.6, these shared motifs occur more frequently in the bound regions than in the background regions. For example, TFAP2 disc2 occurs in 76.3% of the TFAP2 binding peaks and yet only 9.7% of the background sequences. 78

TAL1 disc1 matches the binding motif of GATA. It has been shown that TAL1 acts as a cofactor for GATA3[126]. More recently, Moreau et. al have identified “GATA1, FLI1, and TAL1 as a minimal and sufficient combination of TFs to induce the formation of MK precursors from hPSCs”[127], which has a great impact on transfusion medicine. PBX3 disc2 matches the known MEIS1 motif[106], which is consistent with their cooperative binding activity[128]. Moreover, it is known that PBX3 and MEIS1 work cooperatively in hematopoietic cells to drive acute myeloid leukemia (AML)[129], suggesting PBX3 disc2 might play an important role in the progression of AML.

Table 3.6: Shared motifs between the set cover and enrichment methods. Motif Name Motif Logo ForeCov BackCov

TFAP2_disc2 76.3% 9.7%

POU5F1_disc1 71.8% 12.2%

REST_disc3 60.7% 8.4%

TAL1_disc1 47.3% 7.0%

ZNF143_disc3 39.8% 11.9%

PAX5_disc1 37.8% 5.6%

PBX3_disc2 37.3% 7.4%

79

3.6.3 Putative cofactors identified by set cover based methods

Table 3.7: Putative cofactors discovered by the set cover based methods.

Factor Discovery Motif logo ForeCov BackCov JASPAR Matching Species, class, family Group tool match p-value HEY1 MEME 67.6% 22.0% MA0528.1 3.11E-13 Homo sapiens, C2H2 zinc fin- (ZNF263) ger factors, More than 3 adja- cent zinc finger factors

BRCA1AlignACE 46.8% 2.8% MA0527.1 3.98E-06 Homo sapiens, C2H2 zinc fin- (ZBTB33) ger factors, Other factors with up to three adjacent zinc fingers

GATA MEME 40.8% 34.9% MA0528.1 5.95E-08 Homo sapiens, C2H2 zinc fin- (ZNF263) ger factors, More than 3 adja- cent zinc finger factors

RXRA MEME 39.1% 22.8% MA1149.1 1.79E-11 Homo sapiens, Nuclear recep- (RXRG) tors with C4 zinc fingers, Thy- roid hormone receptor-related factors (NR1) PBX3 AlignACE 31.8% 8.3% MA0516.1 3.21E-08 Homo sapiens, C2H2 zinc fin- (SP2) ger factors, Three-zinc finger Kruppel-related factors

EP300 MEME 30.1% 25.0% MA0528.1 9.16E-09 Homo sapiens, C2H2 zinc fin- (ZNF263) ger factors, More than 3 adja- cent zinc finger factors

Next, we asked if the set cover based methods identified any known motifs that were missed by the enrichment method[106]. To do this, we took the union of motifs selected by the set cover methods and filtered out the motifs that were similar to the enrichment discovered motifs. The remaining motifs were matched to 579 JASPAR 2018[97] vertebrates non-redundant motifs using Tomtom[130] with q-value cutoff at 0.01, resulting in six motifs (Table 3.7). Interestingly, three motifs in HEY1, GATA, and EP300 factor groups all matched the binding motif of ZNF263. It has been reported that HEY1 and ZNF263 are highly expressed (fold change >= 25) in CD34+ cell line[131], suggesting that they might be cofactors. The ZBTB33 motif found in the BRCA1 bound regions is consistent with the finding that BRCA1 might “bind ZBTB33 to perform their 80 functions in DNA repair and genome maintenance.”[122]. Moreover, both BRCA1 and ZBTB33 are strongly associated with TP53[132], suggesting they might have a cooperative function in cancer. RXRA and RXRG are retinoic acid receptor RXR-alpha and RXR-gamma, respectively. Hence, it is expected to see the binding motif of RXRG found in RXRA bound regions. Overall, the motifs identified by the set cover methods shed light on the importance of cofactors in gene regulation.

3.7 Conclusion

DNA motif discovery is an old and classic problem in bioinformatics that has made a significant contribution to our understanding of gene regulation. However, new biological features, together with DNA motifs, are found to more predictive of true in vivo binding sites. This includes DNase hypersensitivity, DNA shape, flanking regions of the binding site, branch-of-origin [133], gene expression, genetic variations, etc. Moreover, the human genome contains roughly 1400 transcription factors and more than 8000 TF isoforms [134]. Therefore, current methods of identification and characterization TFBSs is far from complete. To address some of the issues, this chapter presented the Emotif Alpha pipeline that integrated gene expression, sequence conservation, and known motif similarity information for more accurate identification of TFBSs. The pipeline also solved the large-number-of-motifs issue through the set cover based motif selection algorithms. By providing high-confidence motifs, it is easier to validate in the lab and thus gain important biological insights. The computational model presented in the ENCODE-DREAM Challenge provided an integrated framework for combining multi-omics information into a machine learning model. The set cover based motif selection algorithms were demonstrated to provide novel insights into transcription factor binding using 349 ChIP-seq experiments. 81 4 Characterization of Locus-Specific Chromatin

Interactions

4.1 Introduction

4.1.1 History

The study of the nuclear structure dates back to the 1800s when Flemming coined the term “chromatin” (Figure 4.1). More properties and structures of chromatin were discovered in the next one hundred years: 1. chromatin is inheritable (1883 and 1888), 2. chromatin is made of chromosomes (1889), 3. chromatin has active regions called euchromatin and inactive regions called heterochromatin (1928), 4. the discovery of chromatin fiber (1973), nucleosomes (1975), and chromosome territories (1982). In the past 20 years, the innovations of sequencing methods have pushed the study into the next level, namely the 3D genome. The innovated methods include ChIA-PET (2009) [135], Hi-C (2009) [136], GAM (2017) [16], and SPRITE (2018) [137]. These techniques are relatively new and the interpretation of the data is posing novel challenges [138].

Sutton and Heitz coins the Liebermann- Flemming coins Boveri propose Dekker James Fraser term Chromosome Aiden - Hi-C, Guttman - the term the continuity of Discoveries of innovates the Simonis – 4C, coins the term heterochromatin territories Fullwood – SPRITE chromatin chromatin during the chromatin 3C technique Dostie – 5C metaTADs the cell cycle and euchromatin fiber ChIA-PET 1879 1883 1888 1889 1928 1961 1973 1975 1982 1984 2002 2003 2006 2007 2009 2012 2015 2017 2018

Weismann Waldemeyer Lyon postulates Chambon coins Lis innovates Human Genome Pugh innovates Ren discovers The 4D nucleome connects coins the term the principle of the term the ChIP Project finished the ChIP-seq TADs in project, Pombo - chromatin with chromosomes XCI nucleosomes technique technique mammals GAM heredity

Figure 4.1: Timeline of studies of chromatin structure and nuclear organization. This figure was made for the Wikipedia article “Chromosome Conformation Capture”. 82

4.1.2 The Hi-C method

Hi-C [136] is the most widely used method to explore long-range chromatin interactions on a genome scale. It is based on the chromosome conformation capture (3C) technique [139]. The 3C method captures one-to-one interaction. The next generation is chromosome conformation capture-on-chip, known as 4C [140]. It can capture one-to-all interactions. The next generation is chromosome conformation capture carbon copy, known as 5C [141]. This method captures many-to-many interactions. The most advanced methods are ChIA-PET [135] and Hi-C [136] that can capture all-to-all interactions. Currently, most of the chromatin interaction datasets are generated by Hi-C. The procedure of Hi-C is as follows (Figure 4.2). First, cells are fixed using formaldehyde, which preserves the interactions. Then, the DNA sequences are cut using restriction enzymes such as HindIII or EcoR1, which cut the genome at approximately every 4kb window. Next, DNA fragments are ligated and amplified using PCR. Lastly, the DNA fragments are sequenced using a pair-end sequencer. Most of the Hi-C datasets are in low resolution (i.e. ≥10kb). The resolution of the Hi-C experiment highly depends on the sequencing depth. One criterion is that the choice of a bin size (i.e. resolution) should make sure that “at least 80% of all possible bins have more than 1,000 contacts”, which suggests that to achieve a 10kb resolution in human, the Hi-C experiment should have more than 300 million mapped reads [142].

4.1.3 The GAM method

Although the 3C-based methods have significantly increased our understanding of 3D nuclear structure and genome functions, they have important technical limitations such as restriction site density bias. Moreover, Hi-C can fail to identify interactions with nuclear bodies because these genomic loci are too far to ligate [137]. Therefore, in 2017, 83

Cell samples

Crosslink of DNA- protein

Cut DNA using Embed and freeze Split the cells into restriction enzyme the cells in sucrose 96-well plate

Cryosectioning Tagging DNA ligation Laser Pool the cells into a microdissection single well Amplification Repeat n times Whole genome amplification Amplification Pair-end sequencing DNA sequencing DNA sequencing

Hi-C GAM SPRITE

Figure 4.2: Workflow of Hi-C, GAM, and SPRITE. All the three techniques start with crosslinking of DNA-protein. The Hi-C method then performs DNA ligation and pair-end sequencing. The GAM method perform cryosectioning and laser microdissection to extract DNA fragments that are captured in one nuclear profile. With hundreds of nuclear profiles from different cells, the GAM method then perform statistical inference to infer interaction probabilities. The SPRITE method performs a split-and-pool method and repeats the process multiple times. In the result, interacting DNA sequences have the same barcode, which then can be used to infer interaction probabilities.

Pombo et. al developed genome architecture mapping (GAM), a proximity-ligation-free method, to unbiasedly detect genome-wide chromatin interactions [16]. GAM is based on linear genomic distance mapping that combines multiple nuclear ultrathin slices to infer 3D genome structure. Specifically, cells are fixed by formaldehyde, embedded in sucrose, and frozen (Figure 4.2). The next step is cryosectioning and laser 84 microdissection. Once the individual nuclear profile is obtained, the genomic sequences are then fragmented, amplified, and sequenced. In the GAM paper, the authors identified 408 high-quality nuclear profiles and used them to infer chromatin interactions, radial distributions, and chromatin compaction [16]. It is worth noting that the hypothesis of GAM is that “loci that are closer to each other in the nuclear space (but not necessarily on the linear genome) are detected in the same nuclear profile more often than distant loci”.

4.1.4 The SPRITE method

Another proximity-ligation-free method is SPRITE, short for split-pool recognition of interactions by tag extension [137]. SPRITE can simultaneously capture all genomic loci that interact with a protein complex (i.e. multi-way interactions). Using SPRITE, Guttman et. al identified that gene-dense, highly transcribed Pol II regions were associated with nuclear speckles and gene-poor, inactive regions were close to the . The SPRITE paper pointed out the importance of nuclear bodies that “nuclear bodies act as inter-chromosomal hubs that shape the overall packaging of DNA in the nucleus” [137]. Comparing to GAM, SPRITE is easier to perform and does not need whole genome amplification. The protocol of SPRITE is as follows (Figure 4.2). First, cells are fixed by formaldehyde. Then all complexes are split into a 96-well plate (“split”), tagged with a specific sequence using DNA ligation (“tag”), and pooled into a single well (“pool”). The above split-pool tagging step is repeated several times, producing a unique sequence barcode for all the DNA molecules from one complex. A SPRITE cluster is thus defined as all the reads that contain the same barcode sequence. For example, more than 75 SPRITE clusters that mapped into 3 distinct regions were found in the major histone clusters in human (HIST1), which was consistent with the observation that the histone genes were associated to a specific nuclear body called histone locus body (HLB) [137]. 85

4.1.5 Genome architecture and nuclear organization

4.1.5.1 Nucleosomes

DNA sequences are tightly compacted into chromosomes and a nucleosome is a basic unit. Nucleosomes form the beads-on-a-string structure[143]. One nucleosome contains about 200bp DNA, a histone octamer, and a histone linker. The histone octamer consists of 2 copies of H2A, H2B, H3, and H4. The histone linker is histone H1. About 146bp DNA is folded around the octamer and a linker DNA of 54bp is connecting one nucleosome to another [144].

4.1.5.2 Chromatin loops

Chromatin loops are interactions between multiple genomic loci. For example, a gene promoter is a proximal regulatory region to initiate gene transcription and a gene enhancer is a distant regulatory region to enhance gene expression. Thanks to several protein complexes (e.g. CTCF, ), enhancers that could be megabases away from their target gene are able to interact with their target genes promoter to increase the gene transcription rate and stability. Chromatin loops can occur both intra-chromosomally and inter-chromosomally. Intra-chromosomal interactions such as promoter-enhancer interactions are widely studied. The X-inactivation and centromere clusters are regulated by inter-chromosomal interactions [145, 146]. Chromatin loops are mediated by transcription factors that bind to the DNA sequences at the interacting loci [147]. There are 3 known types of chromatin loops:

• Cohesin-CTCF-mediated chromatin loops The cohesin complex is highly conserved across species and is made of Smc1, Smc3, and Rad21/Scc1 subunits. Studies found that cohesin is colocalized with CTCF and together they form the protein complex to regulate chromatin loops [148]. Moreover, a study of inverting 86

CTCF sites using CRISPR found that the orientation of CTCF motif is crucial in the formation of chromatin loops. Specifically, the two boundaries of a chromatin loop should contain two CTCF motifs in a convergent orientation; that is, one occurs on the positive strand and the other occurs on the negative strand [149].

• YY1-mediated enhancer-promoter interactions The transcription factor Yin Yang 1 (YY1) is enriched in active promoter-enhancer interactions and a study found that YY1 is an important structural protein for promoter-enhancer loops [150].

• Polycomb-mediated chromatin loops While most chromatin loops are found to be bound by cohesin complex in mammals, a study in Drosophila discovered an enrichment of polycomb repressive complex 1 (PRC1) mediated interactions [151]. Their results showed that chromatin loops can also repress gene transcription.

Other putative mediators An integrative network analysis of TF ChIP-seq and Hi-C data found many other potential chromatin loop complexes. For example, PML−FOXM1−MTA3−STAT5A−CEBPB−RUNX3 might play a role in immune response and MAX−MAZ−MXI1−CHD2−BHLHE40 might be involved in ribonucleoprotein biogenesis [147].

4.1.5.3 The loop extrusion hypothesis

The loop extrusion model is one of the models that explain the dynamics of chromatin loops [152–154]. The model states that, in a stochastic process, the loop extruding factors (LEFs, e.g. cohesin complex) randomly anchor on the DNA sequence. Both downstream sequence and upstream sequence extrude inside the LEFs. The extruding process halts in both directions when LEFs reach two opposing CTCFs [152]. There are 4 parameters for this model: velocity, lifetime, separation, and barrier strength. 87

An animation of the loop extrusion model can be found at http://symposium.cshlp.org/ content/suppl/2018/05/04/sqb.2017.82.034710.DC1/Supplemental Movie 1.mp4.

4.1.5.4 TADs, sub-TADs, mega-TADs

A fundamental question in 3D genome organization is how chromatin loops form chromosome territories. Analyses of many Hi-C datasets have revealed megabase-sized local interaction domains, which are termed as topological associating domains (TADs) [155]. A TAD is defined as a genomic segment that contains many chromatin loops, where intra-TAD interactions are more frequent than inter-TAD interactions. In the seminal paper [155], the authors discovered several properties of TADs:

• CTCF motifs are enriched in TAD borders.

• TAD borders are highly conserved between cell types, and even between human and mouse.

• Within each TAD, the promoter-enhancer interactions are highly cell type-specific.

In addition to TADs, finer-level chromatin structures such as sub-TADs [156] and hierarchical structures such as mega-TADs [157] are also shown to play an important role in genome function.

4.1.5.5 Compartments A and B

Another important chromatin structure is the A/B compartments, which are usually on a multi-megabase scale. The A/B compartments are identified by the eigenvalue decomposition of the observed-expected normalized contact matrix. The sign of the first eigenvector is used to define the A/B compartments where A compartment (i.e. positive) correlates with active gene expression and B compartment (i.e. negative) correlates with repressed genomic regions [136]. It is worth noting that A/B compartments can also be 88 identified using epigenetic data such as DNA methylation and DNase hypersensitivity [158].

4.1.5.6 Histone locus body

Histone genes (HGs) are the most conserved genes comparing to all other genes in terms of both its sequence and structure. Most HGs do not have polyA tail nor introns, making it the most unique genes across all eukaryotes. The expression of HGs varies in different cell cycles. These HGs are called replication-dependent HGs (RDHGs) where S-phase is the highest expression stage. RDHGs have 5 classes: H1, H2a, H2b, H3, and H4 [159]. The largest histone gene cluster is HIST1 (human - chr6) and Hist1 (Mouse - chr13). They are referred to as the major HG locus. There are two small clusters: HIST2 (Hist2) and HIST3 (Hist3). They are referred to as the minor HG loci. A specific nuclear body, the histone locus body (HLB) is required for the transcription of RDHGs. Cajal bodies are found to co-exist with HLBs in the interphase of cell cycle [160].

4.1.6 Bioinformatics tools for chromatin interaction analysis

The analysis of chromatin interactions can generally be divided into the following categories. For a comprehensive review and comparison, please see [161–165].

4.1.6.1 Identification of chromatin interactions

One of the successful methods is Fit-Hi-C [166]. It uses a binomial probability distribution to model a random pair at a given distance. The observed chromatin interaction has a probability estimated from a spline fitting method and normalized using the ICE method [167]. Other common tools include HiCCUPS [168], HIPPIE [169], etc. 89

4.1.6.2 Identification of higher-order chromatin structures

MrTADFinder [170] identifies TADs using a network optimization framework based on the concept of modularity. It calls TADs at different resolutions. ClusterTAD [171] detects TADs based on unsupervised learning, e.g. k-means clustering. Other common tools include TADtree [172], InsulationScore [173], etc. A critical assessment of multiple TAD calling methods showed a great inconsistency between different tools, however, each individual tool was quite reliable [174].

4.1.6.3 Visualization of chromatin interactions

Heatmap is a commonly used visualization method for pairwise chromatin interactions. Juicebox [175], written in Java, provides a computationally efficient way to visualize large contact matrix. However, heatmaps are limited in the ability to visualize other omic datasets, such as ChIP-seq, gene annotations, etc. Therefore, 3D Genome Browser [176] and WashU Epigenome Browser [4] can be used instead for more interactive visualizations.

4.1.6.4 Differential interaction analysis

Comparing to the tools of differential gene expression analysis, the number of tools for identifying differential interactions is much smaller and a comprehensive comparison of different algorithms is needed. Interestingly, most tools for differential interaction analysis are also written in R. Table 4.1 shows a list of tools. 90 Table 4.1: Methods for identifying differential interactions.

Tool Availability Language https://bioconductor.org/packages/release/bioc/ diffloop [177] R html/diffloop.html FIND [178] https://bitbucket.org/nadhir/find R http://www.bioconductor.org/packages/release/ diffHic [179] R bioc/html/diffHic.html HiCcompare [180] https://bioconductor.org/packages/HiCcompare R chromoR [181] https://rdrr.io/cran/chromoR/ R HiCdat [182] https://github.com/MWSchmid/HiCdat R HOMER [120] http://homer.ucsd.edu/homer/interactions/ Perl

4.2 Methods

4.2.1 A multi-omics bioinformatics pipeline for analyzing locus-specific chromatin interactions

To our best knowledge, currently, there are no tools for analyzing locus-specific or gene-specific chromatin interactions. To fill the gap, we developed the bioinformatics pipeline shown in Figure 4.3. The input was an interaction matrix, which could be Hi-C, GAM, or SPRITE data. Exploratory data visualization could be done for the initial interaction list, including bar plots for the number of interactions per chromosome. Two filtering steps were followed. The first filter was a score filter, namely filtering out interactions that below a user-defined score, in order to get significant interactions. In this study, Fit-Hi-C q-value ≤ 0.01 and SLICE/SPRITE interaction score > 0 were used. The second filter was a locus/loci filter. Users could provide a specific genomic region, such as Hist1 region in mouse (i.e. chr13:21700000-24100000), or a list of genomic regions in a bed format. After the two filters, the numbers of significant Hist1-specific interactions were 503, 13334, 626887, for Hi-C, SLICE, and SPRITE, respectively. The pipeline 91

SPRITE GAM Hi-C Interaction score filter

Hi-Corrector SLICE Fit-Hi-C Significant interactions Chromatin Loops/interactions

Count #interaction Locus/Loci filter per chromosome

Numeric features: Categorial features: Visualization bar Gene expression, Gene, LAD, plot DNA motifs chromHMM

Heatmap Annotated interactions and loci

Cytoscape Network hub analysis

WashU epigenome browser Annotated interaction hubs

Motif and motif pair analysis

Figure 4.3: The multi-omics pipeline for integrating Hi-C, GAM, and SPRITE data. Significant interactions were extracted based on interaction score filter. Locus-specific interactions were extracted based on the genomic location using locus filter. Interactions were annotated using numeric features, such as gene expression and DNA motifs, and categorical features, such as gene annotation, LAD regions, and chromHMM annotation. This pipeline provided visualizations using, such as heatmap, Cytoscape [3], and WashU epigenome browser [4].

provided three visualization methods: heatmap, WashU epigenome browser [4], and a network file in gml format for Cytoscape [3] visualization. Next, network hub analysis 92 was performed and the top 5 hubs were retained. In the last step, motif and motif pair analyses were performed for each interaction hub.

4.2.2 Motif enrichment analysis

ChromHMM [183] annotation for mouse embryonic stem cell (mESC) was downloaded from https://github.com/guifengwei/ChromHMM mESC mm10 [184], which was used to extract promoter regions (i.e. E7), enhancer regions (i.e. E4, E8), and insulator regions (i.e. E1) from the Hist1-specific interactions. The chromHMM genomic segments were used if and only if they were inside the windows. The window size was 20kb, 30kb, and 40kb, for SPRITE, GAM, and Hi-C data, respectively. Mouse motifs were downloaded from HOCOMOCO v11 core motif database [121]. Background regions were defined as all regions of the same chromHMM annotation excluding the foreground regions. For example, there were 22634 promoter regions defined by chromHMM in mESC [184]. If there were x number of foreground promoters, then the background promoters would be 22634-x. Fisher exact test was used to evaluate motif enrichment http://www.regulatory-genomics.org/motif-analysis/method/, by calculating the following 2 × 2 contingency table:

• number of foreground sequences that have at least one occurrence of a given motif.

• number of foreground sequences that have no occurrence of the given motif.

• number of background sequences that have at least one occurrence of the given motif.

• number of background sequences that have no occurrence of the given motif. 93

4.2.3 Motif pair enrichment analysis

One difficulty for motif pair analysis is the enumeration of all possible pairs and the calculation of their occurrences in all chromatin interactions (i.e. window pairs). For example, the number of motif pairs for 300 motifs is 45150. If there are 300k interactions, then it needs at least 13,545,000,000 calculations to finish. In this study, we developed an efficient algorithm based on matrix operations to solve this problem. An example of the procedure is shown in Figure 4.4. The overview of the algorithm is shown in Figure 4.5. This algorithm was implemented using GPU and was extensible to enumerate multiway interacting motifs. To calculate the time complexity for the motif pair finding algorithm, let L be the number of interactions, M be the number of motifs. Step 1 takes O(2LM) time. Step 2 performs a matrix multiplication, which takes O(4LM2). Step 3 takes O(LM) time. Step 4 takes O(M3) time. Step 5 takes O(M2) time. Overall, the algorithm has a time complexity of O(LM2 + M3). A native approach of counting motif pair frequency is to: (1) generate the two motif mapping tables shown in step 1 in Figure 4.5, (2) choose two motif columns and calculate the motif pair occurrence, and (3) perform (2) for every pairwise combination of motifs. This approach has a time complexity of O(LM2). However, due to the fact that the computation based on enumerating all combinations of two columns (i.e. cannot be parallelized) is much slower than matrix operations (i.e. highly parallelizable), the native approach is not feasible for large datasets. Moreover, fast matrix multiplication algorithms have been proposed. For example, Francois Le Gall has proposed a matrix multiplication algorithm with a time complexity of O(n2.3728639) [185]. Also, matrix operations can be done using GPU, which reduces the time complexity of our motif pair finding algorithm by a large constant factor. Specifically, we use Tesla P100 GPUs at Ohio Supercomputer Center (OSC) [114], which have 3584 CUDA cores. Together, it is much faster and appealing to use our motif pair finding algorithm than the native approach. 94

Input Motif Mapping Table (binary) Interaction Table Motif 1 Motif 2 Window 1.1 Window1.2 Window 1.1 a1 b1 Window 2.1 Window 2.2 Window 2.1 a2 b2 Window 1.2 c1 d1 Window 2.2 c2 d2 Step1 Motif 1 Motif 2 Motif 1 Motif 2 Window 1.1 a1 b1 Window 1.2 c1 d1 Window 2.1 a2 b2 Window 2.2 c2 d2

Step2 c1 d1 Motif 1 Motif 2 a1 a2 c1 c2 c2 d2 Motif 1 + + b1 b2 d1 d2 a1 b1 a2 b2 Motif 2 + + Step3 If motif 1 occurs in both window

a1&c1 a2&c2 b1&d1 b2&d2 If motif 2 occurs in both window Step4 Motif 1 Motif 2 a1&c1 b1&d1 a1c1*a1c1+a2 a1c1*b1d1+ Motif 1 a2&c2 b2&d2 c2*a2c2 a2c2*b2d2 a1c1*b1d1+a a1c1*a1c1+a Motif 2 2c2*b2d2 2c2*a2c2 Step5 Motif 1 Motif 2 Motif 1 Motif 2 a1c1*a1c1+a2 a1c1*b1d1+ Motif 1 Motif 1 + + c2*a2c2 a2c2*b2d2 Motif 2 + + a1c1*b1d1+a a1c1*a1c1+a Motif 2 2c2*b2d2 2c2*a2c2

Figure 4.4: An example of motif pair enumeration and occurrence calculation. In this example, two interactions are provided in the interaction table. Then occurrences of motif 1 and motif 2 in the interacting windows are shown in the motif mapping table. Step 1 is used to generate another two motif mapping tables. Step 2 is used to generate the motif pair union table. Step 3 and step 4 are used to generate the motif interaction table. In the last step, a matrix subtraction is performed between the motif union table and the motif intersection table. 95

Interaction Table Motif Mapping Table Input Window1 Window2 Window AP2A NFAC2 chr13:18578234-18581034 chr13:23595669-23596469 chr13:18578234-18581034 1 0 chr13:22068869-22070269 chr13:51747569-51747769 chr13:22068869-22070269 0 1 chr13:22085269-22085669 chr13:23494469-23494869 chr13:23595669-23596469 0 1 … … … … …

Interaction-Window1 Motif Table Interaction-Window2 Motif Table Step1 Window1 AP2A NFAC2 STA5A CTCF Window2 AP2A NFAC2 STA5A CTCF chr13:18578234-18581034 1 0 1 1 chr13:23595669-23596469 0 1 1 1 chr13:22068869-22070269 0 1 1 1 chr13:51747569-51747769 0 0 0 0 chr13:22085269-22085669 0 0 0 0 chr13:23494469-23494869 0 0 0 1 chr13:23583669-23584469 0 0 0 1 chr13:42018369-42021169 0 0 1 1 chr13:23687669-23688669 0 0 0 1 chr13:60580961-60581561 0 0 0 1

Motif Pair Union Table Step2 Matrix Multiplication AP2A NFAC2 STA5A CTCF AP2A 0 1 1 1 NFAC2 1 0 1 1 STA5A 1 1 2 3 CTCF 1 1 3 6

Step3 Element-wise Multiplication

Motif Pair Intersection Table Step4 AP2A NFAC2 STA5A CTCF AP2A 0 0 0 0 NFAC2 0 0 0 0 STA5A 0 0 1 1 CTCF 0 0 1 3

Motif Pair Occurrences Table Step5 AP2A NFAC2 STA5A CTCF All calculations AP2A 0 1 1 1 done using GPU! NFAC2 1 0 1 1 STA5A 1 1 1 2 Element-wise Subtraction CTCF 1 1 2 3

Figure 4.5: GPU-accelerated motif pair enumeration and occurrence calculation. This is an overview of the motif pair finding algorithm. Inputs are given as an interaction table and a motif mapping table. Then to count motif pair occurrences, the motif pair union table is generated through step 2 and the motif pair intersection table is generated through step 3 and step 4. The final motif pair occurrence table is generated from the subtraction between the union table and the intersection table. 96

To identify enriched motif pairs, a background list of 10000 interactions was randomly selected from the same type of chromHMM states. The enrichment of motif pairs was determined using the Fisher exact test from the same procedure of motif enrichment.

4.3 Analysis of Hist1-specific chromatin interactions

Chromatin interaction hubs are modulated by nuclear bodies that specialize in various cellular functions, such as RNA splicing and viral defense. In particular, the histone locus body (HLB) is associated with highly efficient transcription of histone genes. Previous characterizations of the Hist1 cluster have revealed higher-order chromatin structures, such as TADs [137, 186]. Here, we integrated Hi-C, GAM, and SPRITE data and identified complex chromatin organization signature of the Hist1 cluster in mouse embryonic stem cell (mESC). Specifically, we performed network hub analysis and identified hubs of diverse functions. These hubs contained not only histone genes and other active genes, but also lamina-associated domains (LADs) and polycomb domains. Motif and motif pair analyses further revealed putative transcription factors that may play important roles for each hub. Our results were demonstrated by cross-correlation through three different types of chromatin interaction profiling techniques. Overall, we provided insights into the chromatin organization of HLB at a finer granularity.

4.3.1 Chromatin interactions between histone genes in the Hist1 clusters

The mouse Hist1 cluster is located on chromosome 13 from 21.7MB to 24.1MB. We extracted chromatin interactions overlapped with the Hist1 cluster from the Fit-Hi-C interaction list, the SLICE matrix, and the SPRITE matrix. The mouse Hist1 cluster is about 2.4 MB in length, which contains five subclusters (Figure 4.6). Inside the Hist1 cluster, there is a 1.8 MB gap region between the upstream 2 Hist1 subclusters and the downstream 3 Hist1 subclusters. The Hi-C method showed interactions between four of 97 the five subclusters but did not contain the interactions of Hist1h2aa and Hist1h2ba. It is known that these two histone genes are expressed only in testis [159]. The SLICE method showed interactions between histone genes across the gap region. The SPRITE interactions showed that all histone genes interacted with each other, which was consistent with the finding in [137].

Figure 4.6: WashU epigenome browser visualization of the chromatin interactions between histone genes. The mouse Hist1 gene cluster contains five subclusters, which are highlighted in green. Chromatin interactions between these clusters for each technique are shown in pink arcs. CTCF binding signal is shown as the first track, followed by gene annotation track and the chromatin interaction tracks of Hi-C, SLICE, and SPRITE. 98

4.3.2 Chromatin interactions between lamina-associated domains in the Hist1 cluster

It is interesting that a large LAD region (1.1 MB) was inside the Hist1 cluster (Figure 4.7) in mouse embryonic stem cell, suggesting that when HLB actively transcribes histone genes it pushes the gap region to the nuclear lamina. The Hi-C, SLICE, and SPRITE captured interactions within the LAD region inside the Hist1 cluster. However, only SPRITE captured the interactions between the LAD region inside the Hist1 cluster and the LAD region in the upstream of the Hist1 cluster.

Figure 4.7: WashU epigenome browser visualization of the chromatin interactions between lamina-associated domains (LADs) in the Hist1 cluster. The mouse Hist1 gene cluster contains five subclusters, which are highlighted in green. LAD regions are shown in blue. There is a LAD region (1.1 MB in length) inside the Hist1 cluster, which is just about 19kb downstream to a Hist1 subcluster. There is another LAD region that is 83kb upstream to the Hist1 cluster. Constitutive LAD track is shown as the first track, followed by gene annotation track and the chromatin interaction tracks of Hi-C, SLICE, and SPRITE. 99

4.3.3 Diverse functions of interaction hubs

Next, we visualized and annotated the top five hubs for each technique (Figure 4.8). Interestingly, the top five HiC hubs (Table 4.2) did not overlap with any histone genes but overlapped with one LAD, two Polycomb domains. The top five SLICE hubs (Table 4.3) overlapped with two histone gene clusters and two LADs. The top five SPRITE hubs (Table 4.4) did not overlap with any histone genes cluster but overlapped with one open chromatin region and one Polycomb domain. The common hubs were located close to the 3’-end of the Hist1 cluster. This region contained no active signal of DNase-seq, GRO-seq, H3K27me3, Pol-II S5P, and Pol-II S7P. Since this region had two CTCF peaks, it might function as an insulator.

Figure 4.8: WashU epigenome browser visualization of the Hist1 cluster in mouse embryonic stem cell. The mouse Hist1 gene cluster contains five subclusters, which are highlighted in green. LAD regions are shown in blue. Polycomb domain defined by H3K27me3 peaks are shown in yellow. The top 5 interaction hubs for Hi-C, SLICE, and SPRITE are shown, along with other characteristics, including markers of open chromatin (i.e. DNase-seq), CTCF binding signals, actively transcribed regions (i.e. GRO-seq), Polycomb domains (i.e. H3K27me3), promoter states (i.e. H3K27me3, RNAPIIS5p, RNAPIIS7p), chromatin states (i.e., chromHMM), and nuclear lamina (i.e. cLAD). 100 Table 4.2: Annotation of the top five Hi-C hubs. The genomic locations are shown in the first column. Hubs are ranked by number of contacts.

Number Genomic Gene of Genes LAD Functional Elements Coordinates Expression Contacts Repressed Region:3, Bivalent Promoter:3, Active Promoter:7, Strong Enhancer:4, Abt1: 12.84, chr13:23480814- C230035I16Rik, Enhancer:2, Intergenic Region:2, 37 C230035I16Rik: FALSE 23520814 Abt1 Transcriptional Elongation:2, 1.87 Heterochromatin:1, Transcriptional Transition:1, Insulator:1, Intergenic Region:1, chr13:22135581- 31 . . TRUE Enhancer:1, Strong Enhancer:1, Active 22175581 Promoter:1, Repressed Region:1, chr13:23870490- Trim38: 0.21, Enhancer:1, Strong Enhancer:1, Repressed 30 Trim38,Slc17a2 FALSE 23910490 Slc17a2: 0.0 Region:1, Intergenic Region:2, Insulator:1, chr13:23902235- Slc17a2,Slc17a3 Slc17a2: 0.0, 25 FALSE Insulator:1, 23942235 ,Slc17a3 Slc17a3: 0.0 Heterochromatin:1, Intergenic Region:4, chr13:23353920- 4933404K08Rik: Active Promoter:4, Repressed Region:3, 23 4933404K08Rik FALSE 23393920 0.0 Bivalent Promoter:3, Enhancer:1, Insulator:1,

Table 4.3: Annotation of the top five SLICE hubs. The genomic locations are shown in the first column. Hubs are ranked by number of contacts.

Number Genomic of Genes Gene Expression LAD Functional Elements Coordinates Contacts chr13:22710000- 450 Vmn1r206 Vmn1r206: 0.0 TRUE Heterochromatin:1, 22740000 Intergenic Region:4, Hist1h4a: 2.15, Active Promoter:3, chr13:23850000- Hist1h4a,Hist1h3a, Hist1h3a: 0.14, 449 FALSE Strong Enhancer:4, 23880000 Hist1h1a,Trim38 Hist1h1a: 2.28, Enhancer:4, Trim38: 0.21 Insulator:1, chr13:23880000- Trim38: 0.21, Slc17a2: Intergenic Region:1, 342 Trim38,Slc17a2 FALSE 23910000 0.0 Insulator:1, chr13:22320000- 330 Vmn1r194 Vmn1r194: 0.0 TRUE . 22350000

Repressed Region:2, Active Promoter:3, Hist1h1t: 0.0, Enhancer:3, Strong chr13:23760000- Hist1h2ac,Hist1h2 329 Hist1h2ac: 0.02, FALSE Enhancer:2, 23790000 bc,Hist1h1t Hist1h2bc: 14.42 Intergenic Region:3, Heterochromatin:1, Bivalent Promoter:1, 101 Table 4.4: Annotation of the top five SPRITE hubs. The genomic locations are shown in the first column. Hubs are ranked by number of contacts.

Number of Genomic Coordinates Genes Gene Expression LAD Functional Elements Contacts chr13:23980000- Slc17a4,Slc17a Slc17a4: 0.0, 5672 FALSE Repressed Region:1, 24000000 1,Slc17a1 Slc17a1: 0.0 chr13:23940000- Slc17a3,Slc17a 5669 Slc17a3: 0.0 FALSE Repressed Region:1, 23960000 3 Bivalent Promoter:1, Strong Enhancer:2, Active chr13:24080000- 5669 Scgn Scgn: 0.07 FALSE Promoter:1, Enhancer:3, 24100000 Repressed Region:2, Intergenic Region:2, Bivalent Promoter:2, Active Promoter:3, chr13:22100000- Enhancer:3, Intergenic 5668 Prss16 Prss16: 1.43 FALSE 22120000 Region:2, Repressed Region:3, Strong Enhancer:2, Active Promoter:6, Strong chr13:22000000- Enhancer:2, Enhancer:1, 5666 . . FALSE 22020000 Intergenic Region:4, Bivalent Promoter:1,

4.3.4 Putative motifs and motif pairs that regulate interaction hubs

The promoters captured by each of the top five Hi-C interaction hubs and their contacted regions were enriched for IRF1, NFAC2, NFAC3, and STAT2 motifs (Table 4.5). Both IRF1 and STAT2 can act as transcriptional activators. A diverse set of transcription factor binding motifs was enriched in the SLICE hub interactions (Table 4.6), including BACH1, which can act as both activators and repressors. STAT1 and STAT2 were enriched in all SPRITE interaction hubs and their contacted promoter regions (Table 4.7). 102 Table 4.5: Enriched promoter motifs in the top five Hi-C interaction hubs. The top two enriched motifs, if existed, were shown for each hub.

Enriched Hi-C Hub Motifs P-value ForeCov ForeNum BackCov Hub1 IRF1 1.14E-12 72.9% 62 34.8% Hub1 NFAC2 7.56E-12 49.4% 42 17.0% Hub2 NFAC3 1.83E-07 49.1% 26 17.7% Hub2 STAT2 1.39E-06 71.7% 38 38.6% Hub5 STAT2 9.72E-08 77.1% 37 38.6% Hub5 IRF1 1.20E-07 72.9% 35 34.9%

Table 4.6: Enriched promoter motifs in the top five SLICE interaction hubs. The top two enriched motifs, if existed, were shown for each hub.

SLICE Hub Enriched Motifs P-value ForeCov ForeNum BackCov Hub1 BACH1 0.006969 23.9% 32 14.9% Hub2 TCF7 0.003383 54.5% 60 40.5% Hub2 CUX2 0.003458 14.5% 16 6.7% Hub3 TF7L1 0.000854 27.0% 27 14.3% Hub3 CUX2 0.001101 16.0% 16 6.7% Hub5 NFAC4 0.008121 30.2% 26 18.6% 103 Table 4.7: Enriched promoter motifs in the top five SPRITE interaction hubs. The top two enriched motifs, if existed, were shown for each hub.

Enriched SPRITE Hub Motifs P-value ForeCov ForeNum BackCov Hub1 STAT1 0.000202 51.5% 493 45.4% Hub1 STAT2 0.001122 43.8% 419 38.5% Hub2 STAT1 0.00018 51.6% 495 45.4% Hub2 STAT2 0.000702 44.0% 422 38.5% Hub3 STAT1 0.000202 51.5% 493 45.4% Hub3 STAT2 0.000415 44.2% 423 38.4% Hub4 STAT1 0.000269 51.4% 494 45.4% Hub4 STAT2 0.000799 43.9% 422 38.5% Hub5 STAT1 0.000265 51.4% 493 45.4% Hub5 STAT2 0.00048 44.1% 423 38.4%

Next, we examined enriched motif pairs in the promoter associated interactions for each interaction hub. A motif pair between IRF1 and STAT2 were enriched in the promoter-enhancer interactions in Hi-C hub 1 (Table 4.8). The STRING database suggested a strong association between the two transcription factors [187]. The SLICE hub 1 found an enriched motif pair between BACH1 and SP3 (Table 4.9). Interestingly, both transcription factors can act as activators and repressors. The SPRITE hub 4 found an enriched motif pair between ATF3 and STAT1. This protein-protein interaction has been verified in [188]. 104 Table 4.8: Enriched motif pairs in the promoter-associated interactions from the top five Hi-C interaction hubs. The top two enriched motif pairs, if existed, were shown for each interaction type in each hub.

Hi-C Hub Interaction Type Motif 1 Motif 2 P-value ForeCov ForeNum BackCov Hub1 Promoter-Heterochromatin NFAC2 LEF1 8.64E-41 62.1% 157 22.2% Hub1 Promoter-Heterochromatin IRF1 LEF1 8.32E-34 73.1% 185 35.1% Hub1 Promoter-Enhancer IRF1 FOXJ3 1.45E-136 52.3% 647 18.4% Hub1 Promoter-Enhancer IRF1 STAT2 2.35E-80 47.2% 584 21.1% Hub1 Promoter-Insulator IRF1 ZN431 6.84E-06 36.7% 18 11.9% Hub1 Promoter-Insulator IRF1 SOX4 0.003357 34.7% 17 17.1% Hub1 Promoter-Promoter IRF1 STAT2 1.50E-117 65.4% 357 18.7% Hub1 Promoter-Promoter IRF1 NFAC3 5.22E-117 53.3% 291 11.3% Hub2 Promoter-Heterochromatin FOXO4 NFAC3 4.89E-14 95.2% 20 18.7% Hub2 Promoter-Heterochromatin SOX4 NFAC3 2.03E-13 100.0% 21 24.8% Hub2 Promoter-Enhancer STAT2 PO3F1 9.65E-72 64.0% 114 8.8% Hub2 Promoter-Enhancer IRF9 STAT2 3.92E-66 66.9% 119 11.4% Hub2 Promoter-Insulator STAT2 GRHL2 4.47E-23 63.8% 37 9.9% Hub2 Promoter-Insulator GRHL2 NFAC3 3.64E-19 43.1% 25 4.2% Hub2 Promoter-Promoter STAT2 TBP 2.93E-34 78.8% 41 8.3% Hub2 Promoter-Promoter STAT2 PO2F1 4.79E-34 76.9% 40 7.7% Hub5 Promoter-Heterochromatin IRF1 OTX2 2.54E-25 71.2% 74 22.7% Hub5 Promoter-Heterochromatin IRF1 PO2F1 6.14E-25 68.3% 71 20.8% Hub5 Promoter-Enhancer STAT2 OTX2 4.54E-40 41.7% 135 11.8% Hub5 Promoter-Enhancer CUX1 STAT2 4.70E-39 25.9% 84 4.2% Hub5 Promoter-Insulator PO2F2 STAT2 1.21E-25 66.1% 37 8.8% Hub5 Promoter-Insulator PO2F2 IRF1 8.00E-25 62.5% 35 7.8% Hub5 Promoter-Promoter STAT2 PO2F1 2.36E-69 61.4% 108 8.0% Hub5 Promoter-Promoter IRF1 PO2F1 1.79E-67 58.0% 102 7.0% 105 Table 4.9: Enriched motif pairs in the promoter-associated interactions from the top five SLICE interaction hubs. The top two enriched motif pairs, if existed, were shown for each interaction type in each hub.

SLICE Hub Interaction Type Motif 1 Motif 2 P-value ForeCov ForeNum BackCov Hub1 Promoter-Heterochromatin BACH1 SP3 2.28E-27 79.9% 107 33.8% Hub1 Promoter-Heterochromatin BACH1 KLF3 2.27E-24 75.4% 101 32.1% Hub2 Promoter-Heterochromatin SRF TCF7 2.52E-157 74.2% 512 23.9% Hub2 Promoter-Heterochromatin SIX4 TCF7 4.63E-120 77.4% 534 32.6% Hub2 Promoter-Enhancer SRF TCF7 2.73E-130 35.0% 764 12.1% Hub2 Promoter-Enhancer PBX2 TCF7 1.22E-118 29.2% 638 9.2% Hub2 Promoter-Insulator MEF2D TCF7 1.43E-49 29.2% 182 8.0% Hub2 Promoter-Insulator MEF2A TCF7 1.91E-46 31.1% 194 9.6% Hub2 Promoter-Promoter SRF TCF7 1.25E-81 56.1% 180 11.0% Hub2 Promoter-Promoter DBP TCF7 7.37E-66 36.8% 118 4.8% Hub3 Promoter-Insulator PRRX2 TF7L1 2.75E-25 27.0% 27 1.5% Hub3 Promoter-Insulator DBP TF7L1 1.73E-24 27.0% 27 1.6% Hub5 Promoter-Heterochromatin NFAC4 BHE40 3.10E-56 33.1% 214 9.5% Hub5 Promoter-Heterochromatin NFAC4 PBX2 9.21E-55 35.7% 231 11.3% Hub5 Promoter-Enhancer NFAC4 PBX2 4.10E-53 16.5% 254 4.8% Hub5 Promoter-Enhancer NFAC4 SOX2 7.28E-50 21.5% 332 8.1% Hub5 Promoter-Insulator NFAC4 OVOL2 7.60E-07 3.5% 16 0.7% Hub5 Promoter-Insulator NFAC4 BHE40 2.65E-05 7.0% 32 3.0% Hub5 Promoter-Promoter NFAC4 PBX2 2.92E-24 20.9% 52 3.5% Hub5 Promoter-Promoter NFAC4 TBP 1.73E-21 20.9% 52 4.0% 106 Table 4.10: Enriched motif pairs in the promoter-associated interactions from the top five SPRITE interaction hubs. The top two enriched motif pairs, if existed, were shown for each interaction type in each hub.

SPRITE Hub Interaction Type Motif 1 Motif 2 P-value ForeCov ForeNum BackCov Hub3 Promoter-Promoter FOXJ2 STAT1 2.07E-172 57.7% 552 15.5% Hub3 Promoter-Promoter STAT1 FOXO4 4.52E-152 56.0% 535 16.3% Hub3 Promoter-Insulator FOXJ2 STAT1 6.43E-172 52.1% 754 16.9% Hub3 Promoter-Insulator STAT1 FOXO4 2.18E-167 47.7% 690 14.3% Hub3 Promoter-Enhancer STAT1 FOXJ3 0 52.8% 4532 23.7% Hub3 Promoter-Enhancer STAT1 TCF7 0 61.6% 5288 31.1% Hub3 Promoter-Heterochromatin STAT1 E2F2 0 79.6% 1561 27.1% Hub3 Promoter-Heterochromatin STAT1 FOXM1 0 84.1% 1649 35.9% Hub4 Promoter-Promoter STAT1 FOXM1 0 56.5% 1623 16.9% Hub4 Promoter-Promoter NFIL3 STAT1 0 37.6% 1082 5.8% Hub4 Promoter-Insulator CTCFL STAT2 0 77.3% 3357 40.6% Hub4 Promoter-Insulator STAT2 CTCF 0 80.0% 3474 41.4% Hub4 Promoter-Enhancer IRF9 STAT1 0 39.2% 6348 12.9% Hub4 Promoter-Enhancer ATF3 STAT1 0 37.9% 6128 15.9% Hub4 Promoter-Heterochromatin STAT1 HSF2 0 65.5% 3839 24.6% Hub4 Promoter-Heterochromatin STAT1 NFYA 0 65.4% 3831 28.8% Hub5 Promoter-Promoter STA5A STAT1 0 46.6% 2662 15.8% Hub5 Promoter-Promoter NR2C1 STAT1 0 48.3% 2764 18.8% Hub5 Promoter-Insulator CTCFL STAT1 0 77.2% 6720 46.9% Hub5 Promoter-Insulator STAT1 CTCF 0 81.2% 7072 47.7% Hub5 Promoter-Enhancer STAT1 FOXM1 0 51.0% 13093 21.1% Hub5 Promoter-Enhancer NR2C1 STAT2 0 41.0% 10522 16.9% Hub5 Promoter-Heterochromatin STA5A STAT2 0 60.6% 7117 28.6% Hub5 Promoter-Heterochromatin STAT1 IRF3 0 85.0% 9978 53.0%

4.4 Conclusion

This chapter presented a bioinformatics pipeline for analyzing locus-specific chromatin interactions. Specifically, we examined the Hist1-specific interactions in mouse embryonic stem cells using Hi-C, GAM, and SPRITE. The WashU epigenome browser [4] visualizations showed complex chromatin interactions between histone genes and between LADs. Network hub analysis revealed interaction hubs associated with histone genes, LADs, open chromatin regions, and Polycomb domains. The most intriguing observation was the location of common hubs. Based on our data, this region might function as an 107 insulator. Motif and motif pair analyses revealed putative transcription factors that might facilitate chromatin interactions. Future studies should be conducted to further explore the interactions between all histone gene clusters, including Hist2 and Hist3. Other network statistics can be applied to Hist1-specific interactions, such as betweenness centrality. A comparative genomics approach that analyzes interactions in the Hist1 clusters between human and mouse can be performed to identify conserved Hist1 interaction hubs. 108 5 Discussion

5.1 Summary

I have presented novel bioinformatic methods to discover and interpret biomarkers in proteomics, transcriptomics, genomics, and 3D genomics. Machine learning models, especially feature selection methods, are essential to identify interesting biological features. In this section, I summarize the implications and insights of these results.

5.2 Towards precision medicine

“Tonight, I’m launching a new Precision Medicine Initiative to bring us closer to curing diseases like cancer and diabetes.”, said President Barack Obama at the State of the Union Address in 2015. In fact, the concept of precision medicine is not new; current standards for blood transfusion or organ transplantation all take patients’ personal genetic background into consideration [189]. Modern precision medicine aims to assess disease risk and select the optimal treatment specific to an individual [190]. To approach the goal, it needs our understanding and integration of all levels of omics information. The next generation sequencing techniques and high throughput mass spectrometry have enabled us to study human disease in an unprecedented level of depth and breath. The algorithmic methods presented in this dissertation provides the computational tools for analyzing multi-omics datasets. The next step is to conduct longitudinal studies and validate promising biomarkers in a larger number of people.

5.3 How to build a good machine learning model?

Formulating the research problem The first step to build a model is to understand the research question. You should read all the necessary background and then cast the 109 biological problem into a machine learning problem. Your domain knowledge will differentiate you from other people and will help to finally build a good model. Understanding the data Although artificial intelligence is supposed to take care of everything for us, it has never been smart enough to learn from the data without human supervision. Understanding the data is the first step to build a good machine learning model. Exploratory data analysis techniques can help to understand feature distributions, feature correlations, potential outliers, etc. Another key component is to understand the noise. For example, one can inspect the probabilities outputted by a model and correlate them with the target variable and the predictors. The goal is to make sense of the residuals and how those differences (or possible noises) are accounted by all the features. Domain expertise can also help. For example, in the ENCODE-DREAM challenge, one can represent DNA sequences using one-hot encoding or the occurrences of DNA motifs. One-hot encoding converts DNA sequence (i.e. string) to numerics by representing, for example, A as (1, 0, 0, 0). While the one-hot encoding preserves the entire DNA information, the output features can be very high dimensional and sparse. On the other hand, DNA motifs are experimentally verified features. Therefore, the use of DNA motifs is preferred and it relates to transfer learning. Choosing a good evaluation function Unlike machine learning competitions where the organizers have decided a practical scoring metric to evaluate the model performance, in most bioinformatics applications, the scientists have to pick their own scoring function. It is advisable to use multiple scoring functions since each metric has its own pros and cons. For example, accuracy, sensitivity, and specificity are direct measurements for classification, however, they can fail if the dataset is highly imbalanced. The auROC value is another commonly used metric but it does not provide information about false negatives and false positives. In contrast, the auPRC value avoids this problem by considering the precision and recall. 110

Applying cross validation It is easy to get a surprisingly high score if one trains and tests on the same dataset because machines are good at memorizing data. Therefore, it is crucial to use cross-validation. The way of doing cross-validation does not affect the final performance as long as there is no information leakage. One can accidentally leak the training data information to the testing data without noticing it. For example, when doing feature normalization, one can apply normalization on the entire data and later split the data for cross-validation. Such mistakes are not easy to detect. Building multiple machine learning models Traditionally, most people would like to try many models and then select the best one. It may seem redundant to combine all the models as the models trained on the same dataset are highly correlated. However, nowadays people realized the power of ensemble learning. It is a standard approach now to create model ensembles [191]. Feature engineering Most people stop when they tried everything and by everything, they mean hyper-parameter tuning and ensemble learning. For those who can go further, they use domain knowledge or automatic methods to do feature engineering. Of those people, only a small portion can succeed in gold mining; that is how they win machine learning competitions. It can be very frustrated to do feature engineering because most of the engineered features have little help in the prediction model. However, feature engineering is likely to be the key difference between a machine learning guru and an ordinary data scientist.

5.4 How to find a good motif?

De novo motif discovery vs. motif finding There are two advantages for doing motif discovery: 1. de novo motifs are more specific to the input sequences; 2. The binding sites of one TF can have many variants. De novo motifs might be more sensitive to find all the variants than just known motif scanning. However, motif finding has its own advantages; 111

it is easier to interpret and computationally much faster. On the other hand, de novo motifs require more information and possibly more omics data to infer their putative functions.

Why do you perform motif analysis? Without a clear understanding of the research question, motif analysis can be just garbage in, garbage out. Motif analysis has to be performed on a set of related sequences vs. a set of randomly selected sequences. For example, promoter motif discovery is based on the assumption that co-expression of genes is caused by co-regulation of the genes. It is expected to find common TFBSs in their promoter sequences to explain the co-expression pattern, at least partly. However, one can perform motif analysis on a set of unrelated sequences and still identify statistically significant motifs. These results, of course, will make no sense. Therefore, it is important to carefully gather the input sequences for the hypothesis and research question.

What are good indicators for a successful motif validation? From in silico prediction to experimental validation, there is much information that can be used to increase the chance of a successful DNA motif discovery.

• Although it is assumed that the occurrence of DNA motif indicates TF binding, the ENCODE-DREAM challenge reveals that in vivo TF binding activity is dynamic and largely depends on open chromatin, DNA shapes, gene expression, and cell type-specific factors. Therefore, associating that information to the DNA motif is likely to help the experimental validation [192].

• Conserved sequences often indicate an important function. Therefore, motif validation is highly likely to succeed if the motif site is conserved in other species.

5.5 How to hack the 3D genome?

Massively parallel genomic and epigenomic profiling techniques have been applied to the human genome in the last few years. The human genome project completed in 2003 has inspired numerous large-scale international collaborations. The ENCODE project [5] 112 has reported that nearly 90% of the human genome is biochemically active using RNA-seq, ChIP-seq, and DNase-seq for over 100 cell types. The NIH Roadmap Epigenomics Mapping Consortium has been producing a human epigenomic database [6]. The Cancer Genome Atlas (TCGA) [193] has characterized many cancer types using next-generation sequencing. Both epigenetic markers and cis-regulatory elements play an important role in gene regulation. Moreover, the 3D genome organization has been shown to be essential for transcription factor binding and gene regulation. For example, a gene promoter is a proximal regulatory region whereas a gene enhancer is a distant regulatory region. Thanks to several protein complexes (e.g. CTCF, cohesin), enhancers that could be megabases away from their target gene are able to interact with their target genes promoter to increase the gene transcription rate and stability. Innovative methods, such as Hi-C, GAM, and SPRITE, have yielded important insights into the spatial organization of genomes. However, we still lack a comprehensive understanding of the 3D genome organization. As suggested by the 4D Nucleome Network [7], we need “a highly synergistic, multidisciplinary and integrated approach in which groups with different expertise and knowledge, ranging from imaging and genomics to computer science and physics, work closely together to study common cell systems using complementary methods”. To hack the 3D genome, it is important to consider the following issues:

• There is a diverse range of assays for studying 3D genomics. When doing data interpretation, it is important to consider the technical limitations.

• There are multiple levels of the nuclear organization. It is very likely that different mechanisms are used in different level.

• There are multiple nuclear bodies. Each of them may have their unique interaction profiles. 113

• Analysis of the contact matrix can be computationally infeasible. Therefore, it is important to consider the sparse representation of the data and the use of GPU.

• Integrative analysis is necessary for interpreting 3D genomics because most interactions are due to random polymer contacts. Therefore, to reduce the false positives, the integration of multi-omics information is important. 114 References

[1] L. Chen, J. Xuan, C. Wang, I.-M. Shih, Y. Wang, Z. Zhang, E. Hoffman, and R. Clarke, “Knowledge-guided multi-scale independent component analysis for biomarker identification,” BMC bioinformatics, vol. 9, no. 1, p. 416, 2008.

[2] J. Jin, F. Tian, D.-C. Yang, Y.-Q. Meng, L. Kong, J. Luo, and G. Gao, “Planttfdb 4.0: toward a central hub for transcription factors and regulatory interactions in plants,” Nucleic Acids Research, vol. 45, no. D1, pp. D1040–D1045, 2017. [Online]. Available: http://dx.doi.org/10.1093/nar/gkw982

[3] P. Shannon, A. Markiel, O. Ozier, N. S. Baliga, J. T. Wang, D. Ramage, N. Amin, B. Schwikowski, and T. Ideker, “Cytoscape: a software environment for integrated models of biomolecular interaction networks,” Genome research, vol. 13, no. 11, pp. 2498–2504, 2003.

[4] X. Zhou, R. F. Lowdon, D. Li, H. A. Lawson, P. A. Madden, J. F. Costello, and T. Wang, “Exploring long-range genome interactions using the washu epigenome browser,” Nature methods, vol. 10, no. 5, p. 375, 2013.

[5] T. E. P. Consortium, “An integrated encyclopedia of dna elements in the human genome,” Nature, vol. 489, pp. 57 EP –, Sep 2012, article. [Online]. Available: http://dx.doi.org/10.1038/nature11247

[6] A. Kundaje, W. Meuleman, J. Ernst, M. Bilenky, A. Yen, A. Heravi-Moussavi, P. Kheradpour, Z. Zhang, J. Wang, M. J. Ziller, and M. Kellis, “Integrative analysis of 111 reference human epigenomes,” Nature, vol. 518, no. 7539, p. 317, 2015.

[7] J. Dekker, A. S. Belmont, M. Guttman, V. O. Leshyk, J. T. Lis, S. Lomvardas, L. A. Mirny, C. C. O’Shea, P. J. Park, B. Ren, J. C. R. Politz, J. Shendure, S. Zhong, and t. D. N. Network, “The 4d nucleome project,” Nature, vol. 549, pp. 219 EP –, Sep 2017, perspective. [Online]. Available: https://doi.org/10.1038/nature23884

[8] K. Strimbu and J. A. Tavel, “What are biomarkers?” Current opinion in HIV and AIDS, vol. 5, no. 6, p. 463466, Nov 2010.

[9] S. Spiller, Y. Li, M. Bluher,¨ L. Welch, and R. Hoffmann, “Glycated lysine-141 in haptoglobin improves the diagnostic accuracy for type 2 diabetes mellitus in combination with glycated hemoglobin HbA1c and fasting plasma glucose,” Clinical Proteomics, vol. 14, no. 1, p. 10, 2017.

[10] S. Spiller, Y. Li, M. Bluher,¨ L. Welch, and R. Hoffmann, “Diagnostic accuracy of protein glycation sites in long-term controlled patients with type 2 diabetes mellitus and their prognostic potential for early diagnosis,” Pharmaceuticals, vol. 11, no. 2, pp. 1–13, 2018. 115

[11] R. Hoffmann, A. Frolov, S. Spiller, Y. Li, and L. R. Welch, “Method and means for the non-invasive diagnosis of type ii diabetes mellitus,” US Patient Application 15 101 885, 2017.

[12] K. Y. Lee, R. Sharma, G. Gase, S. Ussar, Y. Li, L. Welch, D. E. Berryman, A. Kispert, M. Bluher, and C. R. Kahn, “Tbx15 defines a glycolytic subpopulation and white adipocyte heterogeneity,” Diabetes, p. db170218, 2017.

[13] “ENCODE-DREAM in vivo transcritpion factor binding site prediction challenge,” SAGE BIONETWORKS, 2016.

[14] T. I. Lee and R. A. Young, “Transcriptional regulation and its misregulation in disease,” Cell, vol. 152, no. 6, pp. 1237–1251, 2013.

[15] K. Lindblad-Toh, M. Garber, O. Zuk, M. F. Lin, B. J. Parker, S. Washietl, P. Kheradpour, J. Ernst, G. Jordan, E. Mauceli, L. D. Ward, C. B. Lowe, A. K. Holloway, M. Clamp, S. Gnerre, J. Alfoldi,¨ K. Beal, J. Chang, H. Clawson, J. Cuff, F. Di Palma, S. Fitzgerald, P. Flicek, M. Guttman, M. J. Hubisz, D. B. Jaffe, I. Jungreis, W. J. Kent, D. Kostka, M. Lara, A. L. Martins, T. Massingham, I. Moltke, B. J. Raney, M. D. Rasmussen, J. Robinson, A. Stark, A. J. Vilella, J. Wen, X. Xie, M. C. Zody, B. I. S. P. Team, W. G. Assembly, J. Baldwin, T. Bloom, C. Whye Chin, D. Heiman, R. Nicol, C. Nusbaum, S. Young, J. Wilkinson, K. C. Worley, C. L. Kovar, D. M. Muzny, R. A. Gibbs, B. C. o. M. H. G. S. C. S. Team, A. Cree, H. H. Dihn, G. Fowler, S. Jhangiani, V. Joshi, S. Lee, L. R. Lewis, L. V. Nazareth, G. Okwuonu, J. Santibanez, W. C. Warren, E. R. , G. M. Weinstock, R. K. , G. I. a. W. University, K. Delehaunty, D. Dooling, C. Fronik, L. Fulton, B. Fulton, T. Graves, P. Minx, E. Sodergren, E. Birney, E. H. Margulies, J. Herrero, E. D. Green, D. Haussler, A. Siepel, N. Goldman, K. S. Pollard, J. S. Pedersen, E. S. Lander, and M. Kellis, “A high-resolution map of human evolutionary constraint using 29 mammals,” Nature, vol. 478, pp. 476 EP –, Oct 2011, article. [Online]. Available: https://doi.org/10.1038/nature10530

[16] R. A. Beagrie, A. Scialdone, M. Schueler, D. C. A. Kraemer, M. Chotalia, S. Q. Xie, M. Barbieri, I. de Santiago, L.-M. Lavitas, M. R. Branco, J. Fraser, J. Dostie, L. Game, N. Dillon, P. A. W. Edwards, M. Nicodemi, and A. Pombo, “Complex multi-enhancer contacts captured by genome architecture mapping,” Nature, vol. 543, pp. 519 EP –, Mar 2017, article. [Online]. Available: https://doi.org/10.1038/nature21411

[17] D. Latchman, Gene Control, 2nd ed. USA: Garland Science, 2015.

[18] H. M. Chan and N. B. La Thangue, “p300/cbp proteins: Hats for transcriptional bridges and scaffolds,” Journal of cell science, vol. 114, no. 13, pp. 2363–2373, 2001. 116

[19] Y. Li, V. P. Schulz, C. Deng, G. Li, Y. Shen, B. K. Tusi, G. Ma, J. Stees, Y. Qiu, L. A. Steiner, L. Zhou, K. Zhao, J. Bungert, P. G. Gallagher, and S. Huang, “Setd1a and nurf mediate chromatin dynamics and gene regulation during erythroid lineage commitment and differentiation,” Nucleic Acids Research, vol. 44, no. 15, pp. 7173–7188, 2016. [Online]. Available: http://dx.doi.org/10.1093/nar/gkw327

[20] J. Conaway, J. Bradsher, and R. Conaway, “Mechanism of assembly of the rna polymerase ii preinitiation complex. transcription factors delta and epsilon promote stable binding of the transcription apparatus to the initiator element.” Journal of Biological Chemistry, vol. 267, no. 14, pp. 10 142–10 148, 1992.

[21] C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics). Berlin, Heidelberg: Springer-Verlag, 2006.

[22] T. M. Mitchell, Machine Learning, 1st ed. New York, NY, USA: McGraw-Hill, Inc., 1997.

[23] J. Saez-Rodriguez, J. C. Costello, S. H. Friend, M. R. Kellen, L. Mangravite, P. Meyer, T. Norman, and G. Stolovitzky, “Crowdsourcing biomedical research: leveraging communities as innovation engines,” Nature Reviews Genetics, vol. 17, no. 8, p. 470, 2016.

[24] Centers for Disease Control and Prevention, “National diabetes statistics report, 2017,” Atlanta, GA: Centers for Disease Control and Prevention, US Department of Health and Human Services, 2017.

[25] American Diabetes Association, “Diagnosis and classification of diabetes mellitus,” Diabetes Care, vol. 34, no. Supplement 1, pp. S62–S69, 2011.

[26] S. E. Inzucchi, “Diagnosis of diabetes,” New England Journal of Medicine, vol. 367, no. 6, pp. 542–550, 2012.

[27] N. DM, “Diabetes: Advances in diagnosis and treatment,” JAMA, vol. 314, no. 10, pp. 1052–1062, 2015. [Online]. Available: +http://dx.doi.org/10.1001/jama.2015.9536

[28] S. D. Rathod, A. C. Crampin, C. Musicha, N. Kayuni, L. Banda, J. Saul, E. McLean, K. Branson, S. Jaffar, and M. J. Nyirenda, “Glycated haemoglobin a1c (hba1c) for detection of diabetes mellitus and impaired fasting glucose in malawi: a diagnostic accuracy study,” BMJ open, vol. 8, no. 5, p. e020972, 2018.

[29] J. Lanchantin, R. Singh, Z. Lin, and Y. Qi, “Deep motif: Visualizing genomic sequence classifications,” arXiv preprint arXiv:1605.01133, 2016.

[30] P. Ng, “dna2vec: Consistent vector representations of variable-length k-mers,” arXiv preprint arXiv:1701.06279, 2017. 117

[31] S. Oba, M.-a. Sato, I. Takemasa, M. Monden, K.-i. Matsubara, and S. Ishii, “A bayesian missing value estimation method for gene expression profile data,” Bioinformatics, vol. 19, no. 16, pp. 2088–2096, 2003. [Online]. Available: http://dx.doi.org/10.1093/bioinformatics/btg287

[32] Y. Yang, Z. Xu, and D. Song, “Missing value imputation for microrna expression data by using a go-based similarity measure,” BMC Bioinformatics, vol. 17, no. 1, p. S10, Jan 2016. [Online]. Available: https://doi.org/10.1186/s12859-015-0853-0

[33] R. Wei, J. Wang, M. Su, E. Jia, S. Chen, T. Chen, and Y. Ni, “Missing value imputation approach for mass spectrometry-based data,” Scientific reports, vol. 8, no. 1, p. 663, 2018.

[34] D. J. Stekhoven and P. Bhlmann, “Missforestnon-parametric missing value imputation for mixed-type data,” Bioinformatics, vol. 28, no. 1, pp. 112–118, 2012. [Online]. Available: http://dx.doi.org/10.1093/bioinformatics/btr597

[35] J. R. Quinlan, “Induction of decision trees,” Machine learning, vol. 1, no. 1, pp. 81–106, 1986.

[36] ——, C4. 5: programs for machine learning. Elsevier, 2014.

[37] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and Regression Trees. Wadsworth and Brooks, 1984, new edition citecart93?

[38] A. Puurula, J. Read, and A. Bifet, “Kaggle lshtc4 winning solution,” 2014.

[39] J. Sill, G. Takacs, L. Mackey, and D. Lin, “Feature-weighted linear stacking,” p. 117.

[40] S. B. Taieb and R. J. Hyndman, “A gradient boosting approach to the kaggle load forecasting competition,” International Journal of Forecasting, vol. 30, no. 2, p. 382394, 2014.

[41] L. Rokach and O. Maimon, Data Mining With Decision Trees: Theory and Applications, 2nd ed. World Scientific Publishing Co., Inc., 2014.

[42] L. K. Hansen and P. Salamon, “Neural network ensembles,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 12, no. 10, p. 9931001, Oct 1990.

[43] R. E. Schapire, “The strength of weak learnability,” Mach. Learn., vol. 5, no. 2, p. 197227, Jul 1990.

[44] Z.-H. Zhou, Ensemble Learning. Springer US, 2009, p. 270273.

[45] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of on-line learning and an application to boosting,” 1996. 118

[46] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’16. New York, NY, USA: ACM, 2016, pp. 785–794. [Online]. Available: http://doi.acm.org/10.1145/2939672.2939785

[47] E. Holloway and I. Marks, “High dimensional human guided machine learning,” arXiv preprint arXiv:1609.00904, 2016.

[48] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, p. 532, 2001. [Online]. Available: http://dx.doi.org/10.1023/A:1010933404324

[49] A. Toscher,¨ M. Jahrer, and R. M. Bell, “The bigchaos solution to the netflix grand prize,” Netflix prize documentation, pp. 1–52, 2009.

[50] C. J. Burges, “A tutorial on support vector machines for pattern recognition,” Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121–167, 1998.

[51] C.-C. Chang and C.-J. Lin, “Libsvm: A library for support vector machines,” ACM Transactions on Intelligent Systems and Technology, vol. 2, no. 3, p. 27, 2011.

[52] S. Ertekin and G. Hopper, “Efficient support vector learning for large datasets,” in Grace Hopper Celebration of Women in Computing. Citeseer, 2006.

[53] I. W. Tsang, J. T. Kwok, and P.-M. Cheung, “Core vector machines: Fast svm training on very large data sets,” Journal of Machine Learning Research, vol. 6, no. Apr, pp. 363–392, 2005.

[54] T. Joachims, “Training linear svms in linear time,” in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2006, pp. 217–226.

[55] ——, “A support vector method for multivariate performance measures,” in Proceedings of the 22nd international conference on Machine learning. ACM, 2005, pp. 377–384.

[56] Y. Li, X. A. Shen, R. L. Ewing, and J. Li, “Terahertz spectroscopic material identification using approximate entropy and deep neural network,” in 2017 IEEE National Aerospace and Electronics Conference (NAECON), June 2017, pp. 52–56.

[57] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, “Least angle regression,” The Annals of statistics, vol. 32, no. 2, pp. 407–499, 2004.

[58] R. L. Dykstra, “An isotonic regression algorithm,” Journal of Statistical Planning and Inference, vol. 5, no. 4, pp. 355–363, 1981.

[59] I. Guyon, “An introduction to variable and feature selection,” Journal of Machine Learning Research, vol. 3, pp. 1157–1182, 2003. 119

[60] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene selection for cancer classification using support vector machines,” Mach. Learn., vol. 46, no. 1-3, pp. 389–422, Mar. 2002. [Online]. Available: http://dx.doi.org/10.1023/A:1012487302797 [61] J. H. Ward Jr, “Hierarchical grouping to optimize an objective function,” Journal of the American statistical association, vol. 58, no. 301, pp. 236–244, 1963. [62] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise.” in Kdd, vol. 96, no. 34, 1996, pp. 226–231. [63] U. von Luxburg, “Clustering stability: An overview,” Found. Trends Mach. Learn., vol. 2, no. 3, pp. 235–274, Mar. 2010. [Online]. Available: http://dx.doi.org/10.1561/2200000008 [64] W. M. Rand, “Objective criteria for the evaluation of clustering methods,” Journal of the American Statistical association, vol. 66, no. 336, pp. 846–850, 1971. [65] R. Olson, W. Fu, Daniel, Nathan, PG-TUe, G. Jena, S. Raschka, P. Saha, sahilshah1194, sohnam, DanKoretsky, kadarakos, G. Bradway, M. Ficek, A. Varik, Ted, screwed99, kamalasaurus, R. Carnevale, S. Riccardelli, ktkirk, derekjanni, Yatoom, iddober, and F. T. O’Donovan, “rhiever/tpot: Sparse matrix support, early stopping, and checkpointing,” Sep. 2017. [Online]. Available: https://doi.org/10.5281/zenodo.998172 [66] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015. [67] A. Frolov, M. Bluher,¨ and R. Hoffmann, “Glycation sites of human plasma proteins are affected to different extents by hyperglycemic conditions in type 2 diabetes mellitus,” Analytical and bioanalytical chemistry, vol. 406, no. 24, pp. 5755–5763, 2014. [68] D. Wang and D. Liu, “Musitedeep: A deep-learning framework for protein post-translational modification site prediction,” in Bioinformatics and Biomedicine (BIBM), 2017 IEEE International Conference on. IEEE, 2017, pp. 2327–2327. [69] S. Spiller, M. Bluher,¨ and R. Hoffmann, “Plasma levels of free fatty acids correlate with type 2 diabetes mellitus,” Diabetes, Obesity and Metabolism, 2018. [70] E. Alexopoulos, “Introduction to multivariate regression analysis,” Hippokratia, vol. 14, no. Suppl 1, p. 23, 2010. [71] R. C. Hardison and J. Taylor, “Genomic approaches towards finding cis-regulatory modules in animals,” Nature Reviews Genetics, vol. 13, pp. 469 EP –, Jun 2012, review Article. [Online]. Available: http://dx.doi.org/10.1038/nrg3242 120

[72] F. Zambelli, G. Pesole, and G. Pavesi, “Motif discovery and transcription factor binding sites before and after the next-generation sequencing era.” Briefings in bioinformatics, vol. 14, no. 2, p. 22537, Mar 2013. [Online]. Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3603212&tool= pmcentrez&rendertype=abstract [73] A. Mathelier and W. W. Wasserman, “The next generation of transcription factor binding site prediction.” PLoS computational biology, vol. 9, no. 9, p. e1003214, Jan 2013. [Online]. Available: http://www.pubmedcentral.nih.gov/articlerender. fcgi?artid=3764009&tool=pmcentrez&rendertype=abstract [74] S. G. Landt, G. K. Marinov, A. Kundaje, P. Kheradpour, F. Pauli, S. Batzoglou, B. E. Bernstein, P. Bickel, J. B. Brown, P. Cayting, Y. Chen, G. DeSalvo, C. Epstein, K. I. Fisher-Aylor, G. Euskirchen, M. Gerstein, J. Gertz, A. J. Hartemink, M. M. Hoffman, V. R. Iyer, Y. L. Jung, S. Karmakar, M. Kellis, P. V. Kharchenko, Q. Li, T. Liu, X. S. Liu, L. Ma, A. Milosavljevic, R. M. Myers, P. J. Park, M. J. Pazin, M. D. Perry, D. Raha, T. E. Reddy, J. Rozowsky, N. Shoresh, A. Sidow, M. Slattery, J. A. Stamatoyannopoulos, M. Y. Tolstorukov, K. P. White, S. Xi, P. J. Farnham, J. D. Lieb, B. J. Wold, and M. Snyder, “Chip-seq guidelines and practices of the encode and modencode consortia,” Genome Research, vol. 22, no. 9, pp. 1813–1831, 2012. [Online]. Available: http://genome.cshlp.org/content/22/9/1813.abstract [75] T. L. Bailey, N. Williams, C. Misleh, and W. W. Li, “Meme: discovering and analyzing dna and protein sequence motifs,” Nucleic Acids Research, vol. 34, no. suppl 2, pp. W369–W373, 2006. [Online]. Available: +http://dx.doi.org/10.1093/nar/gkl198 [76] W. Ao, J. Gaudet, W. J. Kent, S. Muttumu, and S. E. Mango, “Environmentally induced foregut remodeling by pha-4/foxa and daf-12/nhr,” Science, vol. 305, no. 5691, pp. 1743–1746, 2004. [Online]. Available: http://science.sciencemag.org/content/305/5691/1743 [77] X. Liu, D. L. Brutlag, and J. S. Liu, “BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes,” Pac Symp Biocomput, pp. 127–138, 2001. [78] G. Thijs, K. Marchal, M. Lescot, S. Rombauts, B. De Moor, P. Rouze,´ and Y. Moreau, “A gibbs sampling method to detect overrepresented motifs in the upstream regions of coexpressed genes,” Journal of Computational Biology, vol. 9, no. 2, pp. 447–464, 2002. [79] G. Pavesi, P. Mereghetti, G. Mauri, and G. Pesole, “Weeder web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes,” Nucleic Acids Research, vol. 32, no. suppl 2, pp. W199–W203, 2004. [Online]. Available: +http://dx.doi.org/10.1093/nar/gkh465 121

[80] A. D. Smith, P. Sumazin, and M. Q. Zhang, “Identifying tissue-selective transcription factor binding sites in vertebrate promoters,” Proceedings of the National Academy of Sciences of the United States of America, vol. 102, no. 5, pp. 1560–1565, 2005.

[81] P. Huggins, S. Zhong, I. Shiff, R. Beckerman, O. Laptenko, C. Prives, M. H. Schulz, I. Simon, and Z. Bar-Joseph, “Decod: fast and accurate discriminative dna motif finding,” Bioinformatics, vol. 27, no. 17, pp. 2361–2367, 2011.

[82] D. Quang and X. Xie, “Danq: a hybrid convolutional and recurrent deep neural network for quantifying the function of dna sequences,” Nucleic Acids Research, vol. 44, no. 11, p. e107, 2016. [Online]. Available: http://dx.doi.org/10.1093/nar/gkw226

[83] N. K. Lee, F. L. Azizan, Y. S. Wong, and N. Omar, “Deepfinder: An integration of feature-based and deep learning approach for dna motif discovery,” Biotechnology & Biotechnological Equipment, pp. 1–10, 2018.

[84] V. X. Jin, J. Apostolos, N. S. V. R. Nagisetty, and P. J. Farnham, “W-chipmotifs: a web application tool for de novo motif discovery from chip-based high-throughput data,” Bioinformatics, vol. 25, no. 23, pp. 3191–3193, 2009.

[85] S. J. van Heeringen and G. J. C. Veenstra, “Gimmemotifs: a de novo motif prediction pipeline for chip-sequencing experiments,” Bioinformatics, vol. 27, no. 2, pp. 270–271, 2011.

[86] L. Ettwiller, B. Paten, M. Ramialison, E. Birney, and J. Wittbrodt, “Trawler: de novo regulatory motif discovery pipeline for chromatin immunoprecipitation,” Nature Methods, vol. 4, no. 7, pp. 563–565, 2007.

[87] X. Ma, A. Kulkarni, Z. Zhang, Z. Xuan, R. Serfling, and M. Q. Zhang, “A highly efficient and effective motif discovery method for chip-seq/chip-chip data using positional information,” Nucleic acids research, vol. 40, no. 7, pp. e50–e50, 2011.

[88] I. V. Kulakovskiy, V. Boeva, A. V. Favorov, and V. Makeev, “Deep and wide digging for binding motifs in chip-seq data,” Bioinformatics, vol. 26, no. 20, pp. 2622–2623, 2010.

[89] M. Hu, J. Yu, J. M. Taylor, A. M. Chinnaiyan, and Z. S. Qin, “On the detection and refinement of transcription factor binding sites using chip-seq data,” Nucleic acids research, vol. 38, no. 7, pp. 2154–2167, 2010.

[90] P. Machanick and T. L. Bailey, “Meme-chip: motif analysis of large dna datasets,” Bioinformatics, vol. 27, no. 12, pp. 1696–1697, 2011. 122

[91] C. E. Grant, T. L. Bailey, and W. S. Noble, “Fimo: scanning for occurrences of a given motif,” Bioinformatics, vol. 27, no. 7, pp. 1017–1018, 2011. [Online]. Available: +http://dx.doi.org/10.1093/bioinformatics/btr064

[92] D. R. Zerbino, P. Achuthan, W. Akanni, M. Amode, D. Barrell, J. Bhai, K. Billis, C. Cummins, A. Gall, C. G. Girn, L. Gil, L. Gordon, L. Haggerty, E. Haskell, T. Hourlier, O. G. Izuogu, S. H. Janacek, T. Juettemann, J. K. To, M. R. Laird, I. Lavidas, Z. Liu, J. E. Loveland, T. Maurel, W. McLaren, B. Moore, J. Mudge, D. N. Murphy, V. Newman, M. Nuhn, D. Ogeh, C. K. Ong, A. Parker, M. Patricio, H. S. Riat, H. Schuilenburg, D. Sheppard, H. Sparrow, K. Taylor, A. Thormann, A. Vullo, B. Walts, A. Zadissa, A. Frankish, S. E. Hunt, M. Kostadima, N. Langridge, F. J. Martin, M. Muffato, E. Perry, M. Ruffier, D. M. Staines, S. J. Trevanion, B. L. Aken, F. Cunningham, A. Yates, and P. Flicek, “Ensembl 2018,” Nucleic Acids Research, vol. 46, no. D1, pp. D754–D761, 2018. [Online]. Available: http://dx.doi.org/10.1093/nar/gkx1098

[93] A. R. Quinlan and I. M. Hall, “Bedtools: a flexible suite of utilities for comparing genomic features,” Bioinformatics, vol. 26, no. 6, pp. 841–842, 2010.

[94] S. Roy, M. Kagda, and H. S. Judelson, “Genome-wide prediction and functional validation of promoter motifs regulating gene expression in spore and infection stages of phytophthora infestans,” PLoS pathogens, vol. 9, no. 3, p. e1003182, 2013.

[95] A. Grote, D. Voronin, T. Ding, A. Twaddle, T. R. Unnasch, S. Lustigman, and E. Ghedin, “Defining brugia malayi and wolbachia symbiosis by stage-specific dual rna-seq,” PLOS Neglected Tropical Diseases, vol. 11, no. 3, pp. 1–21, 03 2017. [Online]. Available: https://doi.org/10.1371/journal.pntd.0005357

[96] K. L. Howe, B. J. Bolt, S. Cain, J. Chan, W. J. Chen, P. Davis, J. Done, T. Down, S. Gao, C. Grove, T. W. Harris, R. Kishore, R. Lee, J. Lomax, Y. Li, H.-M. Muller, C. Nakamura, P. Nuin, M. Paulini, D. Raciti, G. Schindelman, E. Stanley, M. A. Tuli, K. VanAuken, D. Wang, X. Wang, G. Williams, A. Wright, K. Yook, M. Berriman, P. Kersey, T. Schedl, L. Stein, and P. W. Sternberg, “Wormbase 2016: expanding to enable helminth genomic research,” Nucleic Acids Research, vol. 44, no. D1, pp. D774–D780, 2016. [Online]. Available: http://dx.doi.org/10.1093/nar/gkv1217

[97] A. Khan, O. Fornes, A. Stigliani, M. Gheorghe, J. A. Castro-Mondragon, R. vanderLee, A. Bessy, J. Chneby, S. R. Kulkarni, G. Tan, D. Baranasic, D. J. Arenillas, A. Sandelin, K. Vandepoele, B. Lenhard, B. Ballester, W. W. Wasserman, F. Parcy, and A. Mathelier, “Jaspar 2018: update of the open-access database of transcription factor binding profiles and its web framework,” Nucleic Acids Research, vol. 46, no. D1, pp. D260–D266, 2018. [Online]. Available: http://dx.doi.org/10.1093/nar/gkx1126 123

[98] M. Weirauch, A. Yang, M. Albu, A. G. Cote, A. Montenegro-Montero, P. Drewe, H. Najafabadi, S. Lambert, I. Mann, K. Cook, H. Zheng, A. Goity, H. vanBakel, J.-C. Lozano, M. Galli, M. G. Lewsey, E. Huang, T. Mukherjee, X. Chen, J. Reece-Hoyes, S. Govindarajan, G. Shaulsky, A. Walhout, F.-Y. Bouget, G. Ratsch, L. Larrondo, J. Ecker, and T. Hughes, “Determination and inference of eukaryotic transcription factor sequence specificity,” Cell, vol. 158, no. 6, pp. 1431 – 1443, 2014. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0092867414010368 [99] M. F. Berger, A. A. Philippakis, A. M. Qureshi, F. S. He, P. W. Estep III, and M. L. Bulyk, “Compact, universal dna microarrays to comprehensively determine transcription-factor binding site specificities,” Nature biotechnology, vol. 24, no. 11, p. 1429, 2006. [100] S. Gupta, J. A. Stamatoyannopoulos, T. L. Bailey, and W. S. Noble, “Quantifying similarity between motifs,” Genome Biology, vol. 8, no. 2, p. R24, Feb 2007. [Online]. Available: https://doi.org/10.1186/gb-2007-8-2-r24 [101] M. Larkin, G. Blackshields, N. Brown, R. Chenna, P. McGettigan, H. McWilliam, F. Valentin, I. Wallace, A. Wilm, R. Lopez, J. Thompson, T. Gibson, and D. Higgins, “Clustal w and clustal x version 2.0,” Bioinformatics, vol. 23, no. 21, pp. 2947–2948, 2007. [Online]. Available: http://dx.doi.org/10.1093/bioinformatics/btm404 [102] C.-Y. Cheng, V. Krishnakumar, A. P. Chan, F. Thibaud-Nissen, S. Schobel, and C. D. Town, “Araport11: a complete reannotation of the arabidopsis thaliana reference genome,” The Plant Journal, vol. 89, no. 4, pp. 789–804, 2017. [Online]. Available: http://dx.doi.org/10.1111/tpj.13415 [103] I. Yanai, H. Benjamin, M. Shmoish, V. Chalifa-Caspi, M. Shklar, R. Ophir, A. Bar-Even, S. Horn-Saban, M. Safran, E. Domany, D. Lancet, and O. Shmueli, “Genome-wide midrange transcription profiles reveal expression level relationships in human tissue specification,” Bioinformatics, vol. 21, no. 5, pp. 650–659, 2005. [Online]. Available: +http://dx.doi.org/10.1093/bioinformatics/bti042 [104] N. Kryuchkova-Mostacci and M. Robinson-Rechavi, “A benchmark of gene expression tissue-specificity metrics,” Briefings in Bioinformatics, vol. 18, no. 2, pp. 205–214, 2017. [Online]. Available: +http://dx.doi.org/10.1093/bib/bbw008 [105] S. Gupta, J. A. Stamatoyannopoulos, T. L. Bailey, and W. S. Noble, “Quantifying similarity between motifs,” Genome Biology, vol. 8, no. 2, p. R24, Feb 2007. [Online]. Available: https://doi.org/10.1186/gb-2007-8-2-r24 [106] P. Kheradpour and M. Kellis, “Systematic discovery and characterization of regulatory motifs in encode tf binding experiments,” Nucleic acids research, vol. 42, no. 5, pp. 2976–2987, 2014. 124

[107] R. Al-Ouran, R. Schmidt, A. Naik, J. Jones, F. Drews, D. Juedes, L. Elnitski, and L. Welch, “Discovering gene regulatory elements using coverage-based heuristics,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. PP, no. 99, pp. 1–1, 2016.

[108] T. L. Bailey and C. Elkan, “Fitting a mixture model by expectation maximization to discover motifs in bipolymers,” 1994.

[109] J. D. Hughes, P. W. Estep, S. Tavazoie, and G. M. Church, “Computational identification of cis-regulatory elements associated with groups of functionally related genes in saccharomyces cerevisiae,” Journal of molecular biology, vol. 296, no. 5, pp. 1205–1214, 2000.

[110] X. S. Liu, D. L. Brutlag, and J. S. Liu, “An algorithm for finding protein–dna binding sites with applications to chromatin-immunoprecipitation microarray experiments,” Nature biotechnology, vol. 20, no. 8, pp. 835–839, 2002.

[111] C. E. Grant, T. L. Bailey, and W. S. Noble, “Fimo: scanning for occurrences of a given motif,” Bioinformatics, vol. 27, no. 7, pp. 1017–1018, 2011.

[112] L. Chen, J. Xuan, C. Wang, I.-M. Shih, Y. Wang, Z. Zhang, E. Hoffman, and R. Clarke, “Knowledge-guided multi-scale independent component analysis for biomarker identification,” BMC Bioinformatics, vol. 9, no. 1, p. 416, Oct 2008. [Online]. Available: https://doi.org/10.1186/1471-2105-9-416

[113] S. Varma and R. Simon, “Bias in error estimation when using cross-validation for model selection,” BMC Bioinformatics, vol. 7, no. 1, p. 91, Feb 2006. [Online]. Available: https://doi.org/10.1186/1471-2105-7-91

[114] O. S. Center, “Ohio supercomputer center,” http://osc.edu/ark:/19495/f5s1ph73, 1987.

[115] E. Ghedin, S. Wang, D. Spiro, E. Caler, Q. Zhao, J. Crabtree, J. E. Allen, A. L. Delcher, D. B. Guiliano, D. Miranda-Saavedra, S. V. Angiuoli, T. Creasy, P. Amedeo, B. Haas, N. M. El-Sayed, J. R. Wortman, T. Feldblyum, L. Tallon, M. Schatz, M. Shumway, H. Koo, S. L. Salzberg, S. Schobel, M. Pertea, M. Pop, O. White, G. J. Barton, C. K. S. Carlow, M. J. Crawford, J. Daub, M. W. Dimmic, C. F. Estes, J. M. Foster, M. Ganatra, W. F. Gregory, N. M. Johnson, J. Jin, R. Komuniecki, I. Korf, S. Kumar, S. Laney, B.-W. Li, W. Li, T. H. Lindblom, S. Lustigman, D. Ma, C. V. Maina, D. M. A. Martin, J. P. McCarter, L. McReynolds, M. Mitreva, T. B. Nutman, J. Parkinson, J. M. Peregr´ın-Alvarez, C. Poole, Q. Ren, L. Saunders, A. E. Sluder, K. Smith, M. Stanke, T. R. Unnasch, J. Ware, A. D. Wei, G. Weil, D. J. Williams, Y. Zhang, S. A. Williams, C. Fraser-Liggett, B. Slatko, M. L. Blaxter, and A. L. Scott, “Draft genome of the filarial nematode parasite brugia malayi,” Science, vol. 317, no. 5845, pp. 125

1756–1760, 2007. [Online]. Available: http://science.sciencemag.org/content/317/5845/1756

[116] A. Coghlan, “Nematode genome evolution,” WormBook, vol. 2005, pp. 1–15, 2005.

[117] A. M. Showalter, B. Keppler, J. Lichtenberg, D. Gu, and L. R. Welch, “A bioinformatics approach to the identification, classification, and analysis of hydroxyproline-rich glycoproteins,” Plant Physiology, vol. 153, no. 2, pp. 485–513, 2010. [Online]. Available: http://www.plantphysiol.org/content/153/2/485

[118] P. Ravindran, V. Verma, P. Stamm, and P. P. Kumar, “A novel rgl2–dof6 complex contributes to primary seed dormancy in arabidopsis thaliana by regulating a gata transcription factor,” Molecular plant, vol. 10, no. 10, pp. 1307–1320, 2017.

[119] N. Nguyen, B. Contreras-Moreira, J. A. Castro-Mondragon, W. Santana-Garcia, R. Ossio, C. Robles-Espinoza, M. Bahin, S. Collombet, P. Vincens, D. Thieffry, J. vanHelden, A. Medina-Rivera, and M. Thomas-Chollier, “Rsat 2018: regulatory sequence analysis tools 20th anniversary,” Nucleic Acids Research, vol. 46, no. W1, pp. W209–W214, 2018. [Online]. Available: http://dx.doi.org/10.1093/nar/gky317

[120] S. Heinz, C. Benner, N. Spann, E. Bertolino, Y. C. Lin, P. Laslo, J. X. Cheng, C. Murre, H. Singh, and C. K. Glass, “Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and b cell identities,” Molecular Cell, vol. 38, no. 4, pp. 576 – 589, 2010. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1097276510003667

[121] I. V. Kulakovskiy, Y. A. Medvedeva, U. Schaefer, A. S. Kasianov, I. E. Vorontsov, V. B. Bajic, and V. J. Makeev, “Hocomoco: a comprehensive collection of human transcription factor binding sites models,” Nucleic acids research, vol. 41, no. D1, pp. D195–D202, 2012.

[122] J. Wang, J. Zhuang, S. Iyer, X. Lin, T. W. Whitfield, M. C. Greven, B. G. Pierce, X. Dong, A. Kundaje, Y. Cheng, O. J. Rando, E. Birney, R. M. Myers, W. S. Noble, M. Snyder, and Z. Weng, “Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors,” Genome Research, vol. 22, no. 9, pp. 1798–1812, 2012. [Online]. Available: http://genome.cshlp.org/content/22/9/1798.abstract

[123] J. W. Whitaker, Z. Chen, and W. Wang, “Predicting the human epigenome from dna motifs,” Nature methods, vol. 12, no. 3, p. 265, 2014.

[124] D. Quang and X. Xie, “Factornet: a deep learning framework for predicting cell type specific transcription factor binding from nucleotide-resolution sequential data,” bioRxiv, p. 151274, 2017. 126

[125] P. Grunwald, “A tutorial introduction to the minimum description length principle,” arXiv preprint math/0406077, 2004.

[126] Y. Ono, N. Fukuhara, and O. Yoshie, “Tal1 and lim-only proteins synergistically induce retinaldehyde dehydrogenase 2 expression in t-cell acute lymphoblastic leukemia by acting as cofactors for gata3,” Molecular and Cellular Biology, vol. 18, no. 12, pp. 6939–6950, 1998. [Online]. Available: http://mcb.asm.org/content/18/12/6939.abstract

[127] T. Moreau, A. L. Evans, L. Vasquez, M. R. Tijssen, Y. Yan, M. W. Trotter, D. Howard, M. Colzani, M. Arumugam, W. H. Wu, A. Dalby, R. Lampela, G. Bouet, C. M. Hobbs, D. C. Pask, H. Payne, T. Ponomaryov, A. Brill, N. Soranzo, W. H. Ouwehand, R. A. Pedersen, and C. Ghevaert, “Large-scale production of megakaryocytes from human pluripotent stem cells by chemically defined forward programming,” Nature Communications, vol. 7, pp. 11 208 EP –, Apr 2016, article. [Online]. Available: https://doi.org/10.1038/ncomms11208

[128] L. J. Bischof, N. Kagawa, J. J. Moskow, Y. Takahashi, A. Iwamatsu, A. M. Buchberg, and M. R. Waterman, “Members of the meis1 and pbx homeodomain protein families cooperatively bind a camp-responsive sequence (crs1) from bovinecyp17,” Journal of Biological Chemistry, vol. 273, no. 14, pp. 7941–7948, 1998.

[129] Z. Li, P. Chen, R. Su, C. Hu, Y. Li, A. Elkahloun, Z. Zuo, S. Gurbuxani, S. Arnovitz, H. Weng, Y. Wang, L. Shenglai, H. Huang, M. Neilly, G. Wang, X. Jiang, P. Liu, J. Jin, and J. Chen, “Pbx3 and meis1 cooperate in hematopoietic cells to drive acute myeloid leukemias characterized by a core of the mll-rearranged disease,” Cancer Research, vol. 76, no. 3, pp. 619–629, 2 2016.

[130] S. Gupta, J. A. Stamatoyannopoulos, T. L. Bailey, and W. S. Noble, “Quantifying similarity between motifs,” Genome biology, vol. 8, no. 2, p. R24, 2007.

[131] I. Gomes, T. T. Sharma, S. Edassery, N. Fulton, B. G. Mar, and C. A. Westbrook, “Novel transcription factors in human cd34 antigen–positive hematopoietic cells,” Blood, vol. 100, no. 1, pp. 107–119, 2002.

[132] D. Szklarczyk, A. Franceschini, S. Wyder, K. Forslund, D. Heller, J. Huerta-Cepas, M. Simonovic, A. Roth, A. Santos, K. P. Tsafou, M. Kuhn, P. Bork, L. J. Jensen, and C. vonMering, “String v10: proteinprotein interaction networks, integrated over the tree of life,” Nucleic Acids Research, vol. 43, no. D1, pp. D447–D452, 2015. [Online]. Available: http://dx.doi.org/10.1093/nar/gku1003

[133] K. D. Yokoyama, Y. Zhang, and J. Ma, “Tracing the evolution of lineage-specific transcription factor binding sites in a birth-death framework,” PLoS computational biology, vol. 10, no. 8, p. e1003771, 2014. 127

[134] S. Inukai, K. H. Kock, and M. L. Bulyk, “Transcription factor–dna binding: beyond binding site motifs,” Current opinion in genetics & development, vol. 43, pp. 110–119, 2017. [135] G. Li, M. J. Fullwood, H. Xu, F. H. Mulawadi, S. Velkov, V. Vega, P. N. Ariyaratne, Y. B. Mohamed, H.-S. Ooi, C. Tennakoon, C.-L. Wei, Y. Ruan, and W.-K. Sung, “Chia-pet tool for comprehensive chromatin interaction analysis with paired-end tag sequencing,” Genome Biology, vol. 11, no. 2, p. R22, Feb 2010. [Online]. Available: https://doi.org/10.1186/gb-2010-11-2-r22 [136] E. Lieberman-Aiden, N. L. van Berkum, L. Williams, M. Imakaev, T. Ragoczy, A. Telling, I. Amit, B. R. Lajoie, P. J. Sabo, M. O. Dorschner, R. Sandstrom, B. Bernstein, M. A. Bender, M. Groudine, A. Gnirke, J. Stamatoyannopoulos, L. A. Mirny, E. S. Lander, and J. Dekker, “Comprehensive mapping of long-range interactions reveals folding principles of the human genome,” Science, vol. 326, no. 5950, pp. 289–293, 2009. [Online]. Available: http://science.sciencemag.org/content/326/5950/289 [137] S. A. Quinodoz, N. Ollikainen, B. Tabak, A. Palla, J. M. Schmidt, E. Detmar, M. M. Lai, A. A. Shishkin, P. Bhat, Y. Takei, V. Trinh, E. Aznauryan, P. Russell, C. Cheng, M. Jovanovic, A. Chow, L. Cai, P. McDonel, M. Garber, and M. Guttman, “Higher-order inter-chromosomal hubs shape 3d genome organization in the nucleus,” Cell, vol. 174, no. 3, pp. 744 – 757.e24, 2018. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0092867418306366 [138] J. Dekker, M. A. Marti-Renom, and L. A. Mirny, “Exploring the three-dimensional organization of genomes: interpreting chromatin interaction data,” Nature reviews. Genetics, vol. 14, no. 6, p. 390403, Jun 2013. [139] M. J. Fullwood and Y. Ruan, “Chip-based methods for the identification of long-range chromatin interactions,” Journal of Cellular , vol. 107, no. 1, p. 3039, 2009. [140] M. Simonis, P. Klous, E. Splinter, Y. Moshkin, R. Willemsen, and E. de Wit, “Nuclear organization of active and inactive chromatin domains uncovered by chromosome conformation capture-on-chip (4c),” Nat Genet, vol. 38, 2006. [Online]. Available: http://dx.doi.org/10.1038/ng1896 [141] J. Dostie, T. A. Richmond, R. A. Arnaout, R. R. Selzer, W. L. Lee, and T. A. Honan, “Chromosome conformation capture carbon copy (5c): a massively parallel solution for mapping interactions between genomic elements,” Genome Res, vol. 16, 2006. [Online]. Available: http://dx.doi.org/10.1101/gr.5571506 [142] F. Ay and W. S. Noble, “Analysis methods for studying the 3d architecture of the genome,” Genome Biology, vol. 16, no. 1, p. 183, Sep 2015. [Online]. Available: https://doi.org/10.1186/s13059-015-0745-7 128

[143] A. L. Olins and D. E. Olins, “Spheroid chromatin units ( bodies),” Science, vol. 183, no. 4122, p. 330332, 1974.

[144] K. Luger, A. W. Mader, R. K. Richmond, D. F. Sargent, and T. J. Richmond, “Crystal structure of the nucleosome core particle at 2.8 a resolution.” Nature, vol. 389, no. 6648, p. 251260, Sep 1997.

[145] N. Xu, C.-L. Tsai, and J. T. Lee, “Transient homologous chromosome pairing marks the onset of x inactivation,” Science, vol. 311, no. 5764, pp. 1149–1152, 2006. [Online]. Available: http://science.sciencemag.org/content/311/5764/1149

[146] J. Dekker and T. Misteli, “Long-range chromatin interactions,” Cold Spring Harbor Perspectives in Biology, vol. 7, no. 10, 2015. [Online]. Available: http://cshperspectives.cshlp.org/content/7/10/a019356.abstract

[147] K. Zhang, N. Li, R. I. Ainsworth, and W. Wang, “Systematic identification of protein combinations mediating chromatin looping,” Nature communications, vol. 7, p. 12249, 2016.

[148] S. Sofueva, E. Yaffe, W.-C. Chan, D. Georgopoulou, M. Vietri Rudan, H. Mira-Bontenbal, S. M. Pollard, G. P. Schroth, A. Tanay, and S. Hadjur, “Cohesin-mediated interactions organize chromosomal domain architecture,” The EMBO Journal, vol. 32, no. 24, pp. 3119–3129, 2013. [Online]. Available: http://emboj.embopress.org/content/32/24/3119

[149] Y. Guo, Q. Xu, D. Canzio, J. Shou, J. Li, D. Gorkin, I. Jung, H. Wu, Y. Zhai, Y. Tang, Y. Lu, Y. Wu, Z. Jia, W. Li, M. Zhang, B. Ren, A. Krainer, T. Maniatis, and Q. Wu, “Crispr inversion of ctcf sites alters genome topology and enhancer/promoter function,” Cell, vol. 162, no. 4, pp. 900–910, Aug 2015. [Online]. Available: https://doi.org/10.1016/j.cell.2015.07.038

[150] A. S. Weintraub, C. H. Li, A. V. Zamudio, A. A. Sigova, N. M. Hannett, D. S. Day, B. J. Abraham, M. A. Cohen, B. Nabet, D. L. Buckley, Y. E. Guo, D. Hnisz, R. Jaenisch, J. E. Bradner, N. S. Gray, and R. A. Young, “Yy1 is a structural regulator of enhancer-promoter loops,” Cell, vol. 171, no. 7, pp. 1573 – 1588.e28, 2017. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S009286741731317X

[151] K. P. Eagen, E. L. Aiden, and R. D. Kornberg, “Polycomb-mediated chromatin loops revealed by a subkilobase-resolution chromatin interaction map,” Proceedings of the National Academy of Sciences, vol. 114, no. 33, pp. 8764–8769, 2017. [Online]. Available: http://www.pnas.org/content/114/33/8764

[152] G. Fudenberg, N. Abdennur, M. Imakaev, A. Goloborodko, and L. A. Mirny, “Emerging evidence of chromosome folding by loop extrusion,” Cold Spring 129

Harbor Symposia on Quantitative Biology, vol. 82, pp. 45–55, 2017. [Online]. Available: http://symposium.cshlp.org/content/82/45.abstract

[153] J. Nuebler, G. Fudenberg, M. Imakaev, N. Abdennur, and L. A. Mirny, “Chromatin organization by an interplay of loop extrusion and compartmental segregation,” Proceedings of the National Academy of Sciences, 2018. [Online]. Available: http://www.pnas.org/content/early/2018/06/29/1717730115

[154] C. A. Brackley, J. Johnson, D. Michieletto, A. N. Morozov, M. Nicodemi, P. R. Cook, and D. Marenduzzo, “Extrusion without a motor: a new take on the loop extrusion model of genome organization,” Nucleus, vol. 9, no. 1, pp. 95–103, 2018, pMID: 29300120. [Online]. Available: https://doi.org/10.1080/19491034.2017.1421825

[155] J. R. Dixon, S. Selvaraj, F. Yue, A. Kim, Y. Li, Y. Shen, M. Hu, J. S. Liu, and B. Ren, “Topological domains in mammalian genomes identified by analysis of chromatin interactions,” Nature, vol. 485, no. 7398, p. 376, 2012.

[156] J. Phillips-Cremins, M. Sauria, A. Sanyal, T. Gerasimova, B. Lajoie, J. Bell, C.-T. Ong, T. Hookway, C. Guo, Y. Sun, M. Bland, W. Wagstaff, S. Dalton, T. McDevitt, R. Sen, J. Dekker, J. Taylor, and V. Corces, “Architectural protein subclasses shape 3d organization of genomes during lineage commitment,” Cell, vol. 153, no. 6, pp. 1281 – 1295, 2013. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0092867413005291

[157] J. Fraser, C. Ferrai, A. M. Chiariello, M. Schueler, T. Rito, G. Laudanno, M. Barbieri, B. L. Moore, D. C. Kraemer, S. Aitken, S. Q. Xie, K. J. Morris, M. Itoh, H. Kawaji, I. Jaeger, Y. Hayashizaki, P. Carninci, A. R. Forrest, , C. A. Semple, J. Dostie, A. Pombo, and M. Nicodemi, “Hierarchical folding and reorganization of chromosomes are linked to transcriptional changes in cellular differentiation,” Molecular Systems Biology, vol. 11, no. 12, 2015. [Online]. Available: http://msb.embopress.org/content/11/12/852

[158] J.-P. Fortin and K. D. Hansen, “Reconstructing a/b compartments as revealed by hi-c using long-range correlations in epigenetic data,” Genome biology, vol. 16, no. 1, p. 180, 2015.

[159] W. F. Marzluff, P. Gongidi, K. R. Woods, J. Jin, and L. J. Maltais, “The human and mouse replication-dependent histone genes,” Genomics, vol. 80, no. 5, pp. 487–498, 2002.

[160] M. Machyna, S. Kehr, K. Straube, D. Kappei, F. Buchholz, F. Butter, J. Ule, J. Hertel, P. F. Stadler, and K. M. Neugebauer, “The interactome identifies hundreds of small noncoding rnas that traffic through cajal bodies,” Molecular cell, vol. 56, no. 3, pp. 389–399, 2014. 130

[161] F. Ay and W. S. Noble, “Analysis methods for studying the 3d architecture of the genome,” Genome biology, vol. 16, no. 1, p. 183, 2015.

[162] G. G. Yardımcı and W. S. Noble, “Software tools for visualizing hi-c data,” Genome biology, vol. 18, no. 1, p. 26, 2017.

[163] Z. Han and G. Wei, “Computational tools for hi-c data analysis,” Quantitative Biology, vol. 5, no. 3, pp. 215–225, 2017.

[164] M. Forcato, C. Nicoletti, K. Pal, C. M. Livi, F. Ferrari, and S. Bicciato, “Comparison of computational methods for hi-c data analysis,” Nature methods, vol. 14, no. 7, p. 679, 2017.

[165] C. Nicoletti, M. Forcato, and S. Bicciato, “Computational methods for analyzing genome-wide chromosome conformation capture data,” Current Opinion in Biotechnology, vol. 54, pp. 98 – 105, 2018, analytical Biotechnology. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0958166917302550

[166] F. Ay, T. L. Bailey, and W. S. Noble, “Statistical confidence estimation for hi-c data reveals regulatory chromatin contacts,” Genome Research, 2014. [Online]. Available: http://genome.cshlp.org/content/early/2014/02/05/gr.160374.113.abstract

[167] M. Imakaev, G. Fudenberg, R. P. McCord, N. Naumova, A. Goloborodko, B. R. Lajoie, J. Dekker, and L. A. Mirny, “Iterative correction of hi-c data reveals hallmarks of chromosome organization,” Nature methods, vol. 9, no. 10, p. 999, 2012.

[168] S. S. Rao, M. H. Huntley, N. C. Durand, E. K. Stamenova, I. D. Bochkov, J. T. Robinson, A. L. Sanborn, I. Machol, A. D. Omer, E. S. Lander, and E. L. Aiden, “A 3d map of the human genome at kilobase resolution reveals principles of chromatin looping,” Cell, vol. 159, no. 7, pp. 1665–1680, 2014.

[169] Y.-C. Hwang, C.-F. Lin, O. Valladares, J. Malamon, P. P. Kuksa, Q. Zheng, B. D. Gregory, and L.-S. Wang, “Hippie: a high-throughput identification pipeline for promoter interacting enhancer elements,” Bioinformatics, vol. 31, no. 8, pp. 1290–1292, 2014.

[170] K.-K. Yan, S. Lou, and M. Gerstein, “Mrtadfinder: A network modularity based approach to identify topologically associating domains in multiple resolutions,” PLoS computational biology, vol. 13, no. 7, p. e1005647, 2017.

[171] O. Oluwadare and J. Cheng, “Clustertad: an unsupervised machine learning approach to detecting topologically associated domains of chromosomes from hi-c data,” BMC bioinformatics, vol. 18, no. 1, p. 480, 2017. 131

[172] C. Weinreb and B. J. Raphael, “Identification of hierarchical chromatin domains,” Bioinformatics, vol. 32, no. 11, pp. 1601–1609, 2015.

[173] E. Crane, Q. Bian, R. P. McCord, B. R. Lajoie, B. S. Wheeler, E. J. Ralston, S. Uzawa, J. Dekker, and B. J. Meyer, “-driven remodelling of x chromosome topology during dosage compensation,” Nature, vol. 523, no. 7559, p. 240, 2015.

[174] R. Dali and M. Blanchette, “A critical assessment of topologically associating domain prediction tools,” Nucleic acids research, vol. 45, no. 6, pp. 2994–3005, 2017.

[175] N. C. Durand, J. T. Robinson, M. S. Shamim, I. Machol, J. P. Mesirov, E. S. Lander, and E. L. Aiden, “Juicebox provides a visualization system for hi-c contact maps with unlimited zoom,” Cell systems, vol. 3, no. 1, pp. 99–101, 2016.

[176] Y. Wang, B. Zhang, L. Zhang, L. An, J. Xu, D. Li, M. N. Choudhary, Y. Li, M. Hu, R. Hardison, T. Wang, and F. Yue, “The 3d genome browser: a web-based browser for visualizing 3d genome organization and long-range chromatin interactions,” BioRxiv, p. 112268, 2017.

[177] C. A. Lareau and M. J. Aryee, “diffloop: a computational framework for identifying and analyzing differential dna loops from sequencing data,” Bioinformatics, vol. 34, no. 4, pp. 672–674, 2017.

[178] M. N. Djekidel, Y. Chen, and M. Q. Zhang, “Find: differential chromatin interactions detection using a spatial poisson process,” Genome research, 2018.

[179] A. T. Lun and G. K. Smyth, “diffhic: a bioconductor package to detect differential genomic interactions in hi-c data,” BMC bioinformatics, vol. 16, no. 1, p. 258, 2015.

[180] J. Stansfield and M. G. Dozmorov, “Hiccompare: a method for joint normalization of hi-c datasets and differential chromatin interaction detection,” bioRxiv, p. 147850, 2017.

[181] Y. Shavit and P. Lio, “Combining a wavelet change point and the bayes factor for analysing chromosomal interaction data,” Molecular BioSystems, vol. 10, no. 6, pp. 1576–1585, 2014.

[182] M. W. Schmid, S. Grob, and U. Grossniklaus, “Hicdat: a fast and easy-to-use hi-c data analysis tool,” BMC bioinformatics, vol. 16, no. 1, p. 277, 2015.

[183] J. Ernst and M. Kellis, “Chromhmm: automating chromatin-state discovery and characterization,” Nature methods, vol. 9, no. 3, p. 215, 2012. 132

[184] G. Pintacuda, G. Wei, C. Roustan, B. A. Kirmizitas, N. Solcan, A. Cerase, A. Castello, S. Mohammed, B. Moindrot, T. B. Nesterova, and N. Brockdorff, “hnrnpk recruits pcgf3/5-prc1 to the xist rna b-repeat to establish polycomb-mediated chromosomal silencing,” Molecular cell, vol. 68, no. 5, pp. 955–969, 2017.

[185] F. Le Gall, “Powers of tensors and fast matrix multiplication,” in Proceedings of the 39th international symposium on symbolic and algebraic computation. ACM, 2014, pp. 296–303.

[186] A. J. Fritz, P. N. Ghule, J. R. Boyd, C. E. Tye, N. A. Page, D. Hong, D. J. Shirley, A. S. Weinheimer, A. R. Barutcu, D. L. Gerrard, S. Frietze, A. J. van Wijnen, S. K. Zaidi, A. N. Imbalzano, J. B. Lian, J. L. Stein, and G. S. Stein, “Intranuclear and higher-order chromatin organization of the major histone gene cluster in breast cancer,” Journal of Cellular Physiology, vol. 233, no. 2, pp. 1278–1290. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/jcp.25996

[187] D. Szklarczyk, J. H. Morris, H. Cook, M. Kuhn, S. Wyder, M. Simonovic, A. Santos, N. T. Doncheva, A. Roth, P. Bork, L. J. Jensen, and C. von Mering, “The string database in 2017: quality-controlled protein–protein association networks, made broadly accessible,” Nucleic acids research, p. gkw937, 2016.

[188] J. Y. Kim, S. H. Lee, E. H. Song, Y. M. Park, J.-Y. Lim, D. J. Kim, K.-H. Choi, S. I. Park, B. Gao, and W.-H. Kim, “A critical role of stat1 in streptozotocin-induced diabetic liver injury in mice: controlled by atf3,” Cellular signalling, vol. 21, no. 12, pp. 1758–1767, 2009.

[189] F. S. Collins and H. Varmus, “A new initiative on precision medicine,” New England Journal of Medicine, vol. 372, no. 9, pp. 793–795, 2015, pMID: 25635347. [Online]. Available: https://doi.org/10.1056/NEJMp1500523

[190] J. L. Jameson and D. L. Longo, “Precision medicine personalized, problematic, and promising,” New England Journal of Medicine, vol. 372, no. 23, pp. 2229–2234, 2015, pMID: 26014593. [Online]. Available: https://doi.org/10.1056/NEJMsb1503104

[191] P. Domingos, “A few useful things to know about machine learning,” Commun. ACM, vol. 55, no. 10, pp. 78–87, Oct. 2012. [Online]. Available: http://doi.acm.org/10.1145/2347736.2347755

[192] S. Zhang, F. Du, and H. Ji, “A novel dna sequence motif in human and mouse genomes,” Scientific reports, vol. 5, p. 10444, 2015.

[193] J. N. Weinstein, E. A. Collisson, G. B. Mills, K. R. M. Shaw, B. A. Ozenberger, K. Ellrott, I. Shmulevich, C. Sander, J. M. Stuart, and C. G. A. R. Network, “The 133 cancer genome atlas pan-cancer analysis project,” Nature genetics, vol. 45, no. 10, p. 1113, 2013. 134 Appendix:Tables of the evaluation results for the four

motif selection methods 135 Table A.1: Foreground coverage(%) comparison across 55 factors. Each row is the transcription factor group and each column is the four methods. The best foreground coverage for each factor group is shown in bold. The last row shows the winning times for each method across the 55 datasets.

Factor Enrch Grdy RILP Tabu HSF 18.9 60.1 56.8 39.8 SREBP 26.5 64.9 72.9 64.9 BRCA1 29.6 62.3 68.4 46.2 CEBPB 31.7 54.9 54.9 61.3 ZEB1 40.6 71.4 80.3 71.4 RFX5 42.3 64.3 64.5 51.3 TCF7L2 44.5 81.6 81.5 68.0 BATF 45.1 43.9 57.5 43.9 MXI1 45.5 81.0 82.4 57.5 ZBTB7A 48.3 79.4 88.9 79.4 MEF2 50.2 37.2 56.4 35.1 ZNF143 51.0 47.8 52.3 39.8 SRF 52.1 64.9 65.3 64.9 PBX3 54.1 60.0 60.9 37.3 SIX5 54.7 79.4 79.2 78.5 TAL1 55.3 47.3 64.1 47.3 NANOG 56.1 66.9 54.1 35.0 POU2F2 58.1 67.8 70.4 60.2 BHLHE40 58.2 63.4 69.5 46.5 RXRA 59.4 68.2 77.7 61.7 MAF 60.0 73.3 82.1 72.8 FOXA 60.1 50.6 50.6 49.4 HEY1 60.5 74.1 61.5 59.3 EBF1 61.9 70.5 70.6 56.6 SP1 63.1 63.5 66.5 55.1 GATA 64.3 68.4 82.2 52.4 ZBTB33 65.2 74.7 72.8 70.0 ESRRA 66.6 60.4 60.7 60.1 NR2C2 70.5 90.8 90.8 78.1 NFKB 70.6 74.2 71.8 73.2 TCF12 71.1 67.0 72.1 66.5 ATF3 71.2 80.5 80.3 78.7 PAX5 71.8 75.9 62.9 64.5 NR3C1 74.7 75.5 78.8 71.8 POU5F1 75.6 71.8 71.8 71.8 EP300 75.7 61.4 76.6 43.8 HNF4 76.7 68.3 66.1 65.3 PRDM1 77.1 83.7 77.6 76.5 ELF1 77.4 81.9 80.3 79.9 STAT 78.6 60.9 60.1 62.2 MYC 79.4 72.9 79.8 NA YY1 79.6 80.1 78.5 73.0 IRF 82.4 45.9 64.0 52.3 SPI1 83.7 80.6 90.3 80.7 TFAP2 83.9 91.5 87.7 76.3 NFY 84.3 87.8 87.8 87.8 TATA 86.4 68.7 79.0 NA E2F 86.5 77.4 77.4 81.8 AP1 86.5 62.5 62.5 NA NFE2 87.9 93.3 93.2 93.0 EGR1 89.8 87.0 88.7 84.6 ETS 89.9 87.9 88.7 82.4 REST 92.5 81.1 81.1 83.2 CTCF 94.8 85.5 85.3 NA NRF1 96.4 93.5 93.5 93.5 Winning Times 14 16 26 2 136 Table A.2: Number of motifs comparison across 55 factors. Each row is the transcription factor group and each column is the four methods. The lowest selected number of motifs for each factor group is shown in bold. The last row shows the winning times for each method across the 55 datasets.

Factor Enrch Grdy RILP Tabu AP1 10 1 1 NA TATA 10 2 2 NA REST 10 2 2 3 EP300 10 3 8 2 CTCF 10 1 1 NA MYC 10 2 4 NA ETS 9 2 2 2 E2F 8 1 1 2 STAT 7 2 2 2 EGR1 7 2 2 1 TCF12 6 2 3 2 NR3C1 6 3 4 3 IRF 6 1 4 2 GATA 6 3 4 3 FOXA 5 1 1 1 HNF4 5 1 1 1 PAX5 5 3 2 2 YY1 5 4 4 2 RXRA 5 3 6 3 NFE2 4 2 2 2 ZBTB33 4 3 2 2 SIX5 4 2 2 2 ATF3 4 4 3 3 NFKB 4 2 2 2 ZNF143 4 2 3 1 NANOG 4 3 2 1 ESRRA 4 2 2 2 PBX3 3 3 3 1 BATF 3 1 2 1 SPI1 3 1 2 1 ELF1 3 2 2 2 RFX5 3 4 4 2 SP1 3 2 2 2 MEF2 3 1 2 1 NRF1 3 1 1 1 NR2C2 3 3 3 2 POU2F2 2 4 3 2 POU5F1 2 1 1 1 ZBTB7A 2 1 2 1 BHLHE40 2 2 3 1 TFAP2 2 3 3 1 TCF7L2 2 4 4 2 CEBPB 2 1 1 2 TAL1 2 1 2 1 EBF1 2 2 2 1 SRF 2 2 2 2 MXI1 2 4 4 1 MAF 2 1 2 1 HEY1 2 2 2 1 PRDM1 2 2 1 1 SREBP 1 1 2 1 BRCA1 1 4 11 1 HSF 1 2 3 1 ZEB1 1 2 4 2 NFY 1 1 1 1 Winning Times 8 33 24 46 137 Table A.3: Background coverage(%) comparison across 55 factors. Each row is the transcription factor group and each column is the four methods. The best background coverage for each factor group is shown in bold. The last row shows the winning times for each method across the 55 datasets.

Factor Enrch Grdy RILP Tabu EP300 56.6 46.6 55.5 18.0 STAT 50.6 16.3 15.7 14.5 AP1 47.9 6.1 6.1 NA IRF 47.5 15.4 32.2 10.7 TATA 42.3 23.4 24.4 NA NANOG 39.4 35.0 22.5 12.6 RXRA 38.8 33.3 39.0 24.5 GATA 38.2 46.2 58.5 18.4 PAX5 36.4 35.2 16.2 14.1 REST 36.4 16.3 16.3 17.8 CTCF 35.8 9.6 9.4 NA TCF12 34.9 21.4 25.0 20.3 FOXA 33.7 13.1 13.1 10.8 SPI1 33.2 17.0 22.3 15.9 MYC 30.8 14.0 18.4 NA HNF4 30.4 13.6 10.9 9.7 NR3C1 29.6 28.2 32.3 21.0 MEF2 27.4 17.5 34.8 14.8 SP1 26.8 24.2 28.7 14.8 SIX5 25.8 17.0 17.5 13.3 ESRRA 25.8 19.5 19.6 19.0 ETS 25.6 14.7 15.8 8.0 NFKB 25.4 21.6 19.6 18.2 E2F 22.4 9.6 9.6 9.5 EGR1 21.7 12.6 14.2 9.0 YY1 21.0 17.5 16.7 10.3 ZNF143 21.0 13.3 15.8 11.9 POU2F2 19.8 24.8 27.9 19.3 NFE2 19.5 12.0 12.1 11.8 BATF 19.3 8.2 12.6 8.2 POU5F1 19.1 12.2 12.2 12.2 MAF 17.4 9.0 14.6 9.0 EBF1 15.4 20.1 20.2 11.4 PBX3 15.4 17.5 18.0 7.4 NR2C2 14.6 16.9 17.0 7.4 PRDM1 14.5 18.4 14.9 14.0 RFX5 14.0 23.5 24.6 14.6 TAL1 13.8 7.0 12.5 7.0 TFAP2 13.3 21.2 18.2 9.7 TCF7L2 12.8 31.0 30.8 21.1 BHLHE40 12.3 10.8 13.1 2.3 HEY1 11.9 22.8 43.1 7.2 SRF 10.3 16.1 15.0 13.2 ELF1 9.9 11.5 11.4 8.9 ZEB1 9.5 14.2 20.2 14.2 ATF3 7.8 12.4 11.3 8.9 CEBPB 7.7 4.6 4.6 5.6 MXI1 7.1 19.3 20.1 7.9 NRF1 7.0 2.5 2.5 2.5 ZBTB33 6.0 9.9 8.6 5.3 SREBP 5.5 8.0 10.2 8.0 ZBTB7A 4.6 7.4 15.4 7.4 NFY 4.6 4.1 4.1 4.1 HSF 2.6 17.7 17.2 10.2 BRCA1 0.3 8.8 14.2 3.0 Winning Times 10 11 7 39 138 Table A.4: Error rate(%) comparison across 55 factors. Each row is the transcription factor group and each column is the four methods. The best error rate for each factor group is shown in bold. The last row shows the winning times for each method across the 55 datasets.

Factor Enrch Grdy RILP Tabu HSF 41.9 28.8 30.2 35.2 NANOG 41.6 34.0 34.2 38.8 EP300 40.4 42.6 39.5 37.1 RXRA 39.7 32.6 30.7 31.4 SREBP 39.5 21.6 18.6 21.6 MEF2 38.6 40.2 39.2 39.9 CEBPB 38.0 24.9 24.9 22.2 BATF 37.1 32.2 27.6 32.2 GATA 36.9 38.9 38.2 33.0 FOXA 36.8 31.3 31.3 30.7 STAT 36.0 27.7 27.8 26.1 RFX5 35.8 29.6 30.0 31.6 SIX5 35.5 18.8 19.2 17.4 BRCA1 35.4 23.2 22.9 28.4 ZNF143 35.0 32.8 31.8 36.0 ZEB1 34.4 21.4 20.0 21.4 TCF7L2 34.2 24.7 24.6 26.5 IRF 32.5 34.7 34.1 29.2 PAX5 32.3 29.6 26.6 24.8 SP1 31.9 30.4 31.1 29.9 TCF12 31.9 27.2 26.5 26.9 POU2F2 30.9 28.5 28.7 29.6 MXI1 30.8 19.1 18.9 25.2 PBX3 30.7 28.8 28.6 35.1 AP1 30.7 21.8 21.8 NA ESRRA 29.6 29.6 29.5 29.5 TAL1 29.2 29.8 24.2 29.8 SRF 29.1 25.6 24.9 24.2 MAF 28.7 17.8 16.3 18.1 ZBTB7A 28.2 14.0 13.2 14.0 TATA 28.0 27.3 22.7 NA NFKB 27.4 23.7 23.9 22.5 NR3C1 27.4 26.3 26.8 24.6 BHLHE40 27.1 23.7 21.8 27.9 HNF4 26.9 22.7 22.4 22.2 EBF1 26.7 24.8 24.8 27.4 MYC 25.7 20.6 19.3 NA HEY1 25.7 24.3 40.8 23.9 SPI1 24.8 18.2 16.0 17.6 NR2C2 22.0 13.0 13.1 14.7 REST 22.0 17.6 17.6 17.3 POU5F1 21.7 20.2 20.2 20.2 YY1 20.7 18.7 19.1 18.6 CTCF 20.5 12.1 12.1 NA ZBTB33 20.4 17.6 17.9 17.6 PRDM1 18.7 17.4 18.7 18.7 ATF3 18.3 16.0 15.5 15.1 E2F 17.9 16.1 16.1 13.8 ETS 17.8 13.4 13.6 12.8 ELF1 16.3 14.8 15.5 14.5 EGR1 15.9 12.8 12.8 12.2 NFE2 15.8 9.4 9.4 9.4 TFAP2 14.7 14.9 15.2 16.7 NFY 10.2 8.2 8.2 8.2 NRF1 5.3 4.5 4.5 4.5 Winning Times 2 14 25 27 ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

Thesis and Dissertation Services ! !