Motif Selection: Identification of Gene Regulatory Elements Using

Motif Selection: Identification of Gene Regulatory Elements using Sequence Coverage Based Models and Evolutionary Algorithms A dissertation presented to the faculty of the Russ College of Engineering and Technology of Ohio University In partial fulfillment of the requirements for the degree Doctor of Philosophy Rami Al-Ouran December 2015 © 2015 Rami Al-Ouran. All Rights Reserved. 2 This dissertation titled Motif Selection: Identification of Gene Regulatory Elements using Sequence Coverage Based Models and Evolutionary Algorithms by RAMI AL-OURAN has been approved for the School of Electrical Engineering and Computer Science and the Russ College of Engineering and Technology by Lonnie Welch Stuckey Professor of Electrical Engineering and Computer Science Dennis Irwin Dean, Russ College of Engineering and Technology 3 Abstract AL-OURAN, RAMI, Ph.D., December 2015, Electrical Engineering and Computer Science Motif Selection: Identification of Gene Regulatory Elements using Sequence Coverage Based Models and Evolutionary Algorithms (120 pp.) Director of Dissertation: Lonnie Welch The accuracy of identifying transcription factor binding sites (motifs) has increased with the use of technologies such as chromatin immunoprecipitation followed by sequencing (ChIP-seq), but this accuracy remains low enough that bioinformaticians and biologists struggle in choosing the right methods for identifying such regulatory elements. Current motif discovery methods typically produce lengthy lists of putative transcription factor binding sites, and a significant challenge lies in how to mine these lists to select a manageable set of candidate sites for experimental validation. Additionally, despite the importance of covering large numbers of genomic sequences, current motif discovery methods do not consider the sequence coverage percentage. To address the aforementioned problems, the motif selection problem is introduced and solved using a coverage based model greedy algorithm and a multi-objective evolutionary algorithm. The motif selection problem aims to produce a concise list of significant motifs which is both accurate and covers a high percentage of the genomic input sequences. The proposed motif selection methods were evaluated using ChIP-seq data from the ENCyclopedia of DNA Elements (ENCODE) project. In addition, the proposed methods were used to identify putative transcription factor binding sites in two case studies: stage specific binding sites in Brugia malayi, and tissue specific binding sites in hydroxyproline-rich glycoprotein (HRGP) genes in Arabidopsis thaliana. 4 To my beloved parents 5 Acknowledgments I am deeply grateful to my adviser Dr. Lonnie Welch for his continuous support and advice. Dr. Welch helped me improve both as a researcher and as a person, and this work would not have been possible without his continuous guidance and encouragement. I would like to thank Dr. Frank Drews for his helpful discussions and suggestions. I would like to thank the committee members for their time and discussions. Many thanks to all current and previous members of the Ohio University bioinformatics lab especially: Xiaoyu Liang, Yichao Li, Jens Lichtenberg, Kyle Kurz, and Matthew Wiley. I would also like to thank Ohio University for their financial support. Finally, words are not enough to thank my dear parents for their love, encouragement, and patience all these years. 6 Table of Contents Page Abstract . 3 Dedication . 4 Acknowledgments . 5 List of Tables . 9 List of Figures . 12 List of Acronyms and Terms . 14 1 Introduction . 15 1.1 Motivation and the motif selection problem . 16 1.2 Contributions . 17 1.3 Gene regulation . 17 1.4 Motif discovery . 18 1.5 Discriminative motif discovery . 19 1.6 ChIP-seq motif discovery . 22 1.7 Ensemble motif discovery . 24 2 Identification of Gene Regulatory Elements using Coverage-based Heuristics . 28 2.1 Introduction . 28 2.2 Methods . 30 2.2.1 Formal problem definition . 30 2.2.2 Relaxed Integer Linear Programming (RILP) approximation algorithm . 31 2.2.3 Bounded exact search algorithm . 32 2.2.4 Greedy algorithm . 32 2.3 Results and discussion . 33 2.3.1 Evaluation methodology . 35 2.3.2 Evaluation Results . 37 2.3.3 Putative functional genomic elements discovered by our methods . 41 2.4 Conclusions . 43 3 Discriminative Motif Selection using Multi-Objective Optimization (MOP) Methods . 52 3.1 Background . 52 3.1.1 Multi-objective optimization (MOP) . 53 7 3.1.2 Pareto optimal solutions . 54 3.1.3 Finding Pareto optimal solutions using evolutionary algorithms (EAs) 56 3.1.4 Post Pareto analysis . 58 3.2 Methods . 59 3.2.1 The discriminative motif selection problem formal definition . 59 3.2.2 The Positive Negative Partial Set Cover (PNPSC) problem . 60 3.2.3 Mapping the motif selection problem to the PNPSC problem . 60 3.2.4 Solving the discriminative motif selection problem using multi- objective optimization . 62 3.2.5 Using an MOEA to solve multi-objective problems . 62 3.2.5.1 Filtering the features before applying MOAE . 63 3.3 Results and discussion . 63 3.3.1 Evaluation using ENCODE data . 63 3.3.2 Application to case studies . 67 3.4 Conclusions . 67 3.5 Variations of the set cover problem . 68 3.5.1 The Weighted Set Cover problem . 69 3.5.2 The Red Blue Set Cover (RBSC) problem . 70 3.5.3 The Set Multicover Problem . 70 3.5.4 The Mutliset Multicover (MSMC) problem . 70 3.5.5 The Partial Set Cover (PSC) problem . 70 4 Guidelines for Motif Discovery and Motif Selection . 72 4.1 Introduction . 72 4.2 Motif discovery . 72 4.2.1 Generative vs discriminative . 73 4.2.2 Large data sets . 73 4.2.3 Small data sets . 73 4.2.4 Motif representation . 74 4.2.5 Motif scanning . 74 4.2.6 Reporting the properties of discovered motifs . 75 4.2.7 Filtering predicted motifs . 76 4.2.8 Motif selection . 77 4.2.8.1 Motif selection without background data . 77 4.2.8.2 Motif selection with background data . 77 4.2.9 Interpreting the results of motif selection . 78 4.2.9.1 Motif selection without background data . 78 4.2.9.2 Motif selection with background data . 79 4.3 Conclusions . 81 8 5 Identification of Stage Specific Regulatory Elements in Brugia malayi using Discriminative Motif Selection . 82 5.1 Introduction . 82 5.2 Results and discussion . 82 5.2.1 The foreground and background data sets . 83 5.2.2 Motif prediction . 84 5.2.3 Motif selection . 87 5.2.3.1 Motif selection for motifs with Ωb = 10 and Ω f = 5 . 87 5.2.3.2 Motif selection for motifs with Ωb = 10 and Ω f = 10 . 90 5.2.3.3 Motif selection for motifs with Ωb = 20 and Ω f = 5 . 92 5.2.3.4 Motif selection for motifs with Ωb = 20 and Ω f = 10 . 93 5.3 Methods . 95 5.3.1 Motif discovery . 95 5.3.2 Motif filtering . 96 5.3.3 Motif selection . 96 5.4 Conclusions . 96 6 Identification of Tissue Specific Regulatory Elements in Hydroxyproline-Rich Glycoprotein (HRGP) Genes in Arabidopsis thaliana . 98 6.1 Introduction . 98 6.2 Results and discussion . 98 6.2.1 The foreground and background data sets . 99 6.2.2 Motif prediction . 99 6.2.3 Selecting motifs with zero background coverage . 101 6.2.4 Filtering discovered motifs . 103 6.2.5 Motif selection for motifs with Ωb = 20 and Ω f = 20 . 104 6.2.6 Recommendations . 108 6.3 Methods . 108 6.3.1 HRGP gene identification . 108 6.3.2 Motif discovery . 108 6.3.3 Motif filtering . 109 6.3.4 Motif selection . 109 6.4 Conclusions . 110 7 Conclusions and Future Work . 111 References . 113 9 List of Tables Table Page 2.1 Number of features used (Mean, Median, SD) across 38 TF groups . 38 2.2 Sequence sensitivity (Mean, Median, SD) across 38 TF groups . 39 2.3 Number of features and sSn for five motif selection methods across the 38 TF groups. P is the total number of peaks selected per TF group and P(%) is the percentage of peaks with motif occurrences. N is the number of features selected by each method. 40 2.4 Novel motifs discovered by the greedy algorithm. 45 2.5 Novel motifs discovered by the greedy algorithm. 46 2.6 Novel motifs discovered by the greedy algorithm. 47 2.7 Novel motifs discovered by the greedy algorithm. ..

Motif Selection: Identification of Gene Regulatory Elements Using

Detection of Interacting Transcription Factors in Human Tissues Using

Identifying and Mapping Cell-Type-Specific Chromatin PNAS PLUS Programming of Gene Expression

REST Mediates Androgen Receptor Actions on Gene Repression And

A Computational Approach for Defining a Signature of Β-Cell Golgi Stress in Diabetes Mellitus

Two Regions Within the Proximal Steroidogenic Factor 1 Promoter Drive Somatic Cell-Specific Activity in Developing Gonads of the Female Mouse1

Genome-Wide DNA Methylation Analysis of KRAS Mutant Cell Lines Ben Yi Tew1,5, Joel K

Elucidation of the ELK1 Target Gene Network Reveals a Role in the Coordinate Regulation of Core Components of the Gene Regulation Machinery

SLC45A3-ELK4 Is a Novel and Frequent Erythroblast Transformation–Specific Fusion Transcript in Prostate Cancer

Pharmacodynamic Effects of Seliciclib, an Orally Administered

Derivation and Application of Molecular Signatures to Prostate Cancer: Opportunities and Challenges

MOCHI Enables Discovery of Heterogeneous Interactome Modules in 3D Nucleome

Prolactin and Oestrogen Synergistically Regulate Gene Expression and Proliferation of Breast Cancer Cells