Motif Selection Using Simulated Annealing Algorithm with Application to Identify Regulatory Elements
Total Page:16
File Type:pdf, Size:1020Kb
Motif Selection Using Simulated Annealing Algorithm with Application to Identify Regulatory Elements A thesis presented to the faculty of the Russ College of Engineering and Technology of Ohio University In partial fulfillment of the requirements for the degree Master of Science Liang Chen August 2018 © 2018 Liang Chen. All Rights Reserved. 2 This thesis titled Motif Selection Using Simulated Annealing Algorithm with Application to Identify Regulatory Elements by LIANG CHEN has been approved for the Department of Electrical Engineering and Computer Science and the Russ College of Engineering and Technology by Lonnie Welch Professor of Electrical Engineering and Computer Science Dennis Irwin Dean, Russ College of Engineering and Technology 3 Abstract CHEN, LIANG, M.S., August 2018, Computer Science Master Program Motif Selection Using Simulated Annealing Algorithm with Application to Identify Regulatory Elements (106 pp.) Director of Thesis: Lonnie Welch Modern research on gene regulation and disorder-related pathways utilize the tools such as microarray and RNA-Seq to analyze the changes in the expression levels of large sets of genes. In silico motif discovery was performed based on the gene expression profile data, which generated a large set of candidate motifs (usually hundreds or thousands of motifs). How to pick a set of biologically meaningful motifs from the candidate motif set is a challenging biological and computational problem. As a computational problem it can be modeled as motif selection problem (MSP). Building solutions for motif selection problem will give biologists direct help in finding transcription factors (TF) that are strongly related to specific pathways and gaining insights of the relationships between genes. This study implemented an algorithm based on simulated annealing (SA) optimization algorithm for the motif selection problem, and investigated the properties of the implemented algorithm with the real world datasets (ENCODE project data). The results of evaluation based on ENCODE datasets indicate that simulated annealing algorithm is good for solving motif selection problem. The performance of simulated annealing algorithm can be tuned based on some parameters to fit for special requirements. Future improvement may be achieved via extending algorithm model (adaptive simulated annealing) and applying high dimensional cost function. 4 Dedication To my family, and my parents. 5 Acknowledgments First I would like to thank my advisor, Dr. Lonnie Welch for his mentoring and support on my daily study and research project. Then I would like to thank my graduate committee members, Dr. Frank Drews, Dr. Razvan Bunescu, for their support, help, comments and suggestions for my research. I also want to thank Dr. Karen Coschigano for serving as college representative for my thesis defense. Special thanks to: graduate student Rami Al-Ouran, Yi-Chao Li, and Yating Liu in Dr. Welch’s lab, graduate student alumni Jens Schmidt, Robert Schmidt, and Krystine Garcia in Dr. Welch’s lab, graduate student Bibo Shi, and Zhe-Wei Wang in Dr. Jundong Liu’s lab. 6 Table of Contents Page Abstract . 3 Dedication . 4 Acknowledgments . 5 List of Tables . 8 List of Figures . 9 List of Acronyms . 10 1 Introduction . 11 1.1 Background . 11 1.2 Biological Motivation . 13 1.3 Foundations of Computational Modeling and Optimization Algorithm . 20 1.4 Problem Statement . 22 1.5 Contributions . 23 2 Methods . 24 2.1 Motif Selection Problem . 24 2.2 Set Cover Problem (SCP) . 24 2.3 Mapping Motif Selection Problem to Set Cover Problem . 25 2.4 SA Relaxed Version . 25 2.5 Simulated Annealing Algorithm . 26 2.6 Implementation for Solving MSP . 31 2.7 Adjustable Parameters of SA Implementation for MSP . 34 3 Evaluation Using ENCODE Data . 38 3.1 Overview . 38 3.2 Datasets . 38 3.3 Parameters . 40 3.4 Results . 41 3.5 Analysis on Results . 42 3.6 Biological Insights of Selected Motifs . 46 4 Conclusion and Future Work . 50 4.1 Conclusion . 50 4.2 Future Work . 51 7 References . 54 Appendix A: Source Code . 67 Appendix B: Supplementary Contents . 82 Appendix C: Disclaimer . 106 8 List of Tables Table Page 2.1 Parameter Settings for Simulated Annealing Algorithm . 35 3.1 Parameter Settings for ENCODE Datasets . 40 B.1 ENCODE TF Group Datasets . 82 B.2 Feature Set Size Result . 84 B.3 Sequence Sensitivity Result . 86 B.4 Motifs selected by SAr85 from BATF group . 89 B.5 Examples of TOMTOM reported alignments . 89 B.6 Motifs selected by SAr85 from PBX3 group . 98 B.7 Examples of TOMTOM reported alignments . 98 9 List of Figures Figure Page 1.1 General Pipeline for Motif Selection . 19 2.1 Flowchart for Simulated Annealing . 27 2.2 Temperature Curve for Exponential Cooling . 30 2.3 Class Relationships . 32 3.1 Overview of ENCODE Project . 39 3.2 Boxplot for Feature Set Size . 41 3.3 Line plot for Feature Set Size . 42 3.4 Boxplot for Sequence Sensitivity (sSn) . 43 3.5 Line plot for Sequence Sensitivity (sSn) . 44 3.6 Comprehensive comparison: SA . 46 3.7 Comprehensive comparison: SAr85 . 47 3.8 Comprehensive comparison: SAr70 . 48 10 List of Acronyms ChIP Chromatin Immunoprecipitation CPL Common Public License DECOD DECOnvolved Discriminative motif discovery DME Discriminating Matrix Enumerator DNA DeoxyriboNucleic Acid DP Dynamic Programming ENCODE Encyclopedia of DNA Elements FIMO Find Individual Motif Occurrences GNU GNU’s Not Unix GPL General Public License HGP Human Genome Project ILP Integer Linear Programming LP Linear Programming MEME Multiple Em for Motif Elicitation MSP Motif Selection Problem NCBI National Center for Biotechnology Information NGS Next Generation Sequencing NP Non-deterministic Polynomial PWM Position Weight Matrix RILP Relaxed Integer Linear Programming RNA RiboNucleic Acid SA Simulated Annealing SCP Set Cover Problem TF Transcription Factor TFBS Transcription Factor Binding Site TSS Transcription Start Site UTR UnTranslated Region 11 1 Introduction This research project focuses on the implementation and evaluation of simulated annealing optimization algorithm for motif selection problem with application to ENCODE datasets. 1.1 Background Biologists have proven that all the species of living beings on the earth have their own genetic codes to store the information about how to construct themselves and control the metabolic processes that are essential to their survival, development, and reproduction [1, 2]. In order to investigate the internal mechanisms of these genetic codes and decode the encrypted information of natural beings, huge work have been done: from the structure and properties of deoxyribonucleic acid (DNA) molecules [3], the amino acid sequences of proteins [4], classical genetics theories [5], to the modern views of genome and genes and various projects and achievement on gnomic information such as the Human Genome Project (HGP) [6], the International HapMap project[7], and the ENCODE project [8]. With continuous efforts and international collaboration, many species such as Drosophila melanogaster (model species, fruit fly) [9], Caenorhabditis Elegans (worm, model species )[10], Escherichia Coli (bacteria, model species) [11], Arabidopsis thaliana (model plant species) [12], Oryza sativa (rice, food crop) [13], and Homo Sapiens (human being) [14], have had their whole genome sequenced. With technique advances and more specific sequencing targets [15, 16], new problems have emerged, such as storing and interpreting these biological datasets. Scientists are no longer satisfied by just getting the raw gnomic information such as DNA and RNA sequences, but are more interested in how these gnomic elements interact with each other and the variable environment. For example, BRAF mutations[17–19] have been widely accepted as an indicator for certain types of cancers such as melanoma[20, 21] and 12 colorectal cancer[22–25]. Another example is the association between EGFR mutations and prostate cancer[26]. With the emergence of genomic testing methods and practice in clinical medicine (some commercialized genomic testings[27, 28] have already been available to physicians and patients), the demand on interpreting genomic data and applying the information to improve medical treatment on patients increases dramatically. Interestingly, the research on gene interactions is not as easy as neuroscience research on acute reactions and living animals (which is another hot topic in the basic science field that may reveal the mechanisms and rules about how human beings do intelligent work such as thinking and learning): neuroscientists may penetrate tiny electrodes into neural tissues such as cerebral cortex or peripheral neural ganglion to record the electrical signals of currently functioning cells (“neurons”) [29–31], and they can use the temporal and strength relations of these neural signals between different groups of neurons to establish their interaction relations; some of the predicted relationships may be supported by the anatomical structures[32]. Compared with electrophysiology studies, molecular genetic research usually depends on the sample extraction from targeted models (animals, plants, bacteria, with some additional treatments or conditions, optional genetic modifications), sequencing the samples to acquire expression levels of genes and biomarkers, and applying bioinformatics tools to analyze and interpret the results[33, 34]. For instance, bioinformatics tools such as BLAST[35, 36], FASTA[37], and ClustalW[38] are widely used for sequences alignment to compare the similarity between biological sequences. Early