A Generalized Biophysical Model of Transcription Factor Binding Specificity and Its Application on High-Throughput SELEX Data

Washington University in St. Louis Washington University Open Scholarship Arts & Sciences Electronic Theses and Dissertations Arts & Sciences Winter 12-15-2017 A Generalized Biophysical Model of Transcription Factor Binding Specificity and Its Application on High-Throughput SELEX Data Shuxiang Ruan Washington University in St. Louis Follow this and additional works at: https://openscholarship.wustl.edu/art_sci_etds Part of the Bioinformatics Commons Recommended Citation Ruan, Shuxiang, "A Generalized Biophysical Model of Transcription Factor Binding Specificity and Its Application on High- Throughput SELEX Data" (2017). Arts & Sciences Electronic Theses and Dissertations. 1189. https://openscholarship.wustl.edu/art_sci_etds/1189 This Dissertation is brought to you for free and open access by the Arts & Sciences at Washington University Open Scholarship. It has been accepted for inclusion in Arts & Sciences Electronic Theses and Dissertations by an authorized administrator of Washington University Open Scholarship. For more information, please contact [email protected]. WASHINGTON UNIVERSITY IN ST. LOUIS Division of Biology and Biomedical Sciences Computational and Systems Biology Dissertation Examination Committee: Gary D. Stormo, Chair Jeremy D. Buhler Barak A. Cohen S. Joshua Swamidass Ting Wang A Generalized Biophysical Model of Transcription Factor Binding Specificity and Its Application on High-Throughput SELEX Data by Shuxiang Ruan A dissertation presented to The Graduate School of Washington University in partial fulfillment of the requirements for the degree of Doctor of Philosophy December 2017 St. Louis, Missouri © 2017, Shuxiang Ruan Table of Contents List of Figures ................................................................................................................................ iv List of Tables .................................................................................................................................. v Acknowledgments.......................................................................................................................... vi Abstract ......................................................................................................................................... vii Chapter 1: Introduction ................................................................................................................... 1 1.1 Matrix-Based Models of Transcription Factor Binding Specificity................................. 4 1.2 High-Throughput Methods for Measuring Binding Specificity ....................................... 6 1.3 Contribution of DNA Shape Features to Binding Specificity .......................................... 9 Chapter 2: Inherent Limitations of Probabilistic Models ............................................................. 12 2.1 Models for Transcription Factor Binding Specificity .................................................... 12 2.1.1 Probabilistic Model ............................................................................................................. 12 2.1.2 Biophysical Model .............................................................................................................. 13 2.2 Methods .......................................................................................................................... 14 2.2.1 Simulation Procedure .......................................................................................................... 14 2.2.2 The Estimated Probabilistic Models ................................................................................... 16 2.2.3 The Fitted Biophysical Model ............................................................................................. 16 2.3 Results ............................................................................................................................ 17 2.3.1 Results in the Absence of Measurement Noise ................................................................... 17 2.3.2 Results in the Presence of Measurement Noise .................................................................. 24 2.4 Discussion ...................................................................................................................... 27 Chapter 3: Estimation of Binding Motifs Using HT-SELEX Data .............................................. 30 3.1 Materials ......................................................................................................................... 31 3.2 Generalized Biophysical Model ..................................................................................... 32 3.3 BEESEM Algorithm ...................................................................................................... 34 3.3.1 Parameter Estimation .......................................................................................................... 34 3.3.2 Confidence Intervals ........................................................................................................... 36 3.3.3 Motif Length Selection ....................................................................................................... 36 3.3.4 Initialization of BEESEM with Seed Sequences ................................................................ 37 3.3.5 Evaluation of Binding Models ............................................................................................ 38 ii 3.4 Results ............................................................................................................................ 40 3.4.1 Characterization of BEESEM PWMs ................................................................................. 40 3.4.2 Evaluation Results............................................................................................................... 41 3.5 Discussion ...................................................................................................................... 46 Chapter 4: Contribution of DNA Shape Features to Binding Specificity ..................................... 50 4.1 Materials ......................................................................................................................... 50 4.1.1 JASPAR PFMs .................................................................................................................... 50 4.1.2 ChIP-seq Datasets ............................................................................................................... 51 4.1.3 DNA Shape Features ........................................................................................................... 51 4.2 Methods .......................................................................................................................... 51 4.2.1 Motif Discovery Algorithms Evaluated .............................................................................. 51 4.2.2 Training and Testing Binding Models ................................................................................ 54 4.3 Results ............................................................................................................................ 55 4.4 Discussion ...................................................................................................................... 60 Future Plans .................................................................................................................................. 62 References ..................................................................................................................................... 63 Appendix A: Biophysical Model of BEESEM ............................................................................. 74 Appendix B: Proof of BEESEM Being an Expectation Maximization Algorithm ...................... 82 iii List of Figures Figure 1.1 The workflow of high-throughput SELEX ................................................................7 Figure 2.1 The energy matrix, derived probabilistic models and corresponding logos of a typical simulation .....................................................................................................18 Figure 2.2 The non-linear relationship between binding energy and probability ......................19 Figure 2.3 The correlation between the predicted and true all sequence distributions ..............21 Figure 2.4 The correlation between the predicted and true top 1% sequence distributions ......23 Figure 3.1 The HT-SELEX experiments and the J2013 PFMs .................................................31 Figure 3.2 The results of the PBM and ChIP-seq evaluation tests ............................................45 Figure 3.3 The median relative affinities (MRAs) predicted by different binding models .......49 Figure 4.1 Comparison of the AUPRC scores generated by the DNAshapedTFBS-based methods with and without DNA shape features .......................................................58 Figure 4.2 Comparison of the AUPRC scores generated by DNAshapedTFBS_4bit + shape with other methods ...................................................................................................59 iv List of Tables Table 2.1 The rank correlation between the predicted and true all sequence distributions for probabilistic models in the absence of noise ............................................................22 Table 2.2 The rank correlation between the predicted and true top 1% sequence distributions for probabilistic models in the absence of noise ......................................................24

A Generalized Biophysical Model of Transcription Factor Binding Specificity and Its Application on High-Throughput SELEX Data

Department of Energy Office of Health and Environmental Research SEQUENCING the HUMAN GENOME Summary Report of the Santa Fe Workshop March 3-4, 1986

Mapping Our Genes—Genome Projects: How Big? How Fast?

Characterizing the Dna-Binding Site Specificities of Cis2his2 Zinc Fingers

ISMB 99 August 6 – 10, 1999 Heidelberg, Germany the Seventh

Download File

Baldi Bioinformatics

Gene Regulatory Network Inference Using Machine Learning Techniques

Highlights (PDF)

2014 ISCB Accomplishment by a Senior Scientist Award: Gene Myers

Protein-DNA Recognition Models for the Homeodomain and C2H2 Zinc Finger Transcription Factor Families Ryan Christensen Washington University in St

Reading Genomes Bit by Bit

Footer: a Quantitative Comparative Genomics Method for Efficient Recognition of Cis-Regulatory Elements