Graphic Network Based Methods in Discovering TFBS Motifs THESIS

Graphic Network based Methods in Discovering TFBS Motifs THESIS Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University By Lizhi Li Graduate Program in Biophysics The Ohio State University 2012 Master's Examination Committee: Kun Huang, Advisor Victor Jin Copyright by Lizhi Li 2012 Abstract To find motifs of transcriptional factors binding sites (TFBS) is essential to understand many biological processes in a cell. Currently the algorithms in discovering the motifs can be divided into three categories: word numeration methods, probabilistic based methods and newly developed graphic network based methods. Graphic network based methods show their advantages over the other two categories of algorithms on prediction accuracy, sensitivity and specificity. This thesis gives a comprehensive overview the main motif discovery methods which are being used now and especially, focuses on the introduction of graphic network based methods. In addition, a study in discovering the TFBS motifs of E2F1, which is a well- known transcription factor, is performed by applying graphic network based algorithms. ii Dedication This document is dedicated to my family. iii Vita July 2002 ........................................................No. 1 Hengyang County High 2006 ...............................................................B.S. Biological Science, Peking University 2009 ...............................................................M.Eng. Biotechnology, Osaka University 2009 to present ..............................................Graduate Teaching Associate, Biophysics Graduate Program, The Ohio State University Fields of Study Major Field: Biophysics iv Table of Contents Chapter 1: Introduction ...................................................................................................... 1 1.1 Protein-DNA Interactions ......................................................................................... 1 1.2 Transcription Factor .................................................................................................. 2 1.3 TFBS and DNA Motif ............................................................................................... 3 1.4 Position Weight Matrix (PWM) ................................................................................ 4 Chapter 2: TFBS Motif Discovery Methods ...................................................................... 6 2.2 Probabilistic Methods ................................................................................................ 7 2.1.1 MEME (Multiple EM for Motif Elicitation) ...................................................... 7 2.1.2 AlignACE ........................................................................................................... 9 2.2 Word Enumeration Based Methods ........................................................................ 10 2.2.1 Weeder .............................................................................................................. 10 2.2.2 Motif Discovery Scan (MDscan) ...................................................................... 11 2.3 Challenges of Current Motif Discovery Methods ................................................... 12 Chapter 3: Graphic Network Based Methods .................................................................. 14 3.1 MotifCut: A Graphic Network Based Method ........................................................ 14 3.1.1 Graph Construction ........................................................................................... 15 v 3.1.2 Finding the MDS .............................................................................................. 16 3.2 Discussion of MotifCut and Other Graphic Network Based Methods .................... 16 Chapter 4: TFBS Motifs Discovery Using Weighted Graph ............................................ 18 4.1 Motivation ............................................................................................................... 18 4.2 Work Flow ............................................................................................................... 18 4.2.1 Sequences Selection and Procession ................................................................ 18 4.2.2 SSM (Similarity Score Matrix) Calculation ..................................................... 19 4.2.3 Graph Construction ........................................................................................... 21 4.2.4 Merging the Subgraphs ..................................................................................... 22 4.2.5 Analysis of the MDS ........................................................................................ 23 4.3 Result ....................................................................................................................... 23 Chapter 5: Summary and Future Work ............................................................................. 25 References ......................................................................................................................... 26 vi List of Tables Table 1. An Position Weight Matrix ................................................................................... 5 Table 2: The constructed MDS with density score. .......................................................... 23 Table 3. The PWM of a MDS getting from the study of E2F1 TFBS using weighted graph algorithm. ................................................................................................................ 24 vii List of Figures Figure 1. Sample MEME output ......................................................................................... 9 Figure 2. Breakdown of an input sequence into a set of 2(k-1)-mers ............................... 19 Figure 3. Examples of the output of eQCM ...................................................................... 22 viii Chapter 1: Introduction A transcription factor is a protein that regulates the expression of a gene by binding on the transcription factor binding site (TFBS). TFBS usually between 5 and 20 bps long sequence that show conservation which possesses some specific patterns called DNA motifs. In biological research, to find TFBS is an important problem in regarding of the lack of experiment result to confirm transcription factor binding under most conditions (Naughton et al., 2006). The development of computational methods for most accurately detecting these binding sites would benefit the comprehension into complex biological processes such as differentiation, development and oncogenesis. 1.1 Protein-DNA Interactions A cell is driven by numerous interactions between macromolecules. Those interactions include but are not limited to protein-protein interactions, protein-nucleic acid interactions and nucleic acid-nucleic acid interactions, while protein-DNA interactions play a crucial role among those interactions. In order to have a good understanding on the role that protein-DNA interactions play, an examination in how proteins interact with DNA, what proteins exist in the protein-DNA 1 complexes and what nucleic acid sequences are required to assemble these complexes is necessary (“Overview of Protein-Nucleic Acid Interactions”). Proteins interact with DNA according to four major forces: H-bonds, salt bridges, entropic effects and dispersion forces. These forces help protein to bind to a DNA in a sequence-specific or non-sequence specific manner (“Overview of Protein-Nucleic Acid Interactions”). To understand the mechanism that proteins interact with DNA and identify the pattern of DNA sequences that interact with a given protein are crucial to comprehend how protein- DNA interactions regulate cellular processes. 1.2 Transcription Factor Transcription is the process that DNA is copied to form a new messenger RNA (mRNA) which is responsible for the synthesizing of proteins or other cell process such as RNA interfering. The level of gene transcription highly contributes to a cell’s morphological and functional attributes (“Basal Transcription Factor Information”). Hence, it is very important to comprehensively understand the mechanism of transcriptional regulation. A transcription factor (TF) is a protein which binds to gene at specific sites and regulates the gene expression. TF can either be an activator by promoting the recruiting other transcription related proteins or the opposite, be a repressor to block the expression of gene (“Basal Transcription Factor Information”). Via this mechanism, transcription factors make the regulation of gene expression at molecule level possible. 2 1.3 TFBS and DNA Motif Considering the truth that transcriptional regulation significantly relies on TF via the binding to DNA sites close to target gene in a sequence-specific manner, the goal to understand the evolution and the regulation mechanism of these interactions can be achieved by the precisely determination of where the proteins are bound in a genome. Previously research shows that mutations in transcription factor binding sites (TFBS) result in cell dysfunction. Regarding this, a reveal of the mechanism of TFBS regulation is the key point to understand the transcriptional regulation and phenotypic variability (Mahadevan et al., 2008). There are usually three major steps to identify TFBS in a gnome scale. First, experimentally identify binding sites. Second, construct a sequence model, the pattern of DNA sequences to which a TF specifically binds, to present

Graphic Network Based Methods in Discovering TFBS Motifs THESIS

Regulatory Motifs of DNA

Motif Discovery in Sequential Data Kyle L. Jensen

A General Pairwise Interaction Model Provides an Accurate Description of in Vivo Transcription Factor Binding Sites

A New Sequence Logo Plot to Highlight Enrichment and Depletion

Predicting Expression Levels of De NovoProtein Designs in Yeast

Low-N Protein Engineering with Data-Efficient Deep Learning

Lecture 12. Gene Finding & Rnaseq

Interpretable Convolution Methods for Learning Genomic Sequence Motifs

Specificity and Nonspecificity in RNA–Protein Interactions

Ancient Mechanisms for the Evolution of the Bicoid Homeodomain's

Probability and Statistics for Bioinformatics and Genetics Course

Predicting Variation of DNA Shape Preferences in Protein-DNA Interaction in Cancer Cells with a New Biophysical Model