Discriminative Learning for Probabilistic Sequence Analysis

Freie Universität Berlin Fachbereich Mathematik und Informatik Dissertation zur Erlangung des akademischen Grades des Doktors der Naturwissenschaften (Dr. rer. nat.) Discriminative learning for probabilistic sequence analysis Jonas Maaskola June 2014 Erstgutachter: Prof. Dr. Martin Vingron Transcriptional Regulation Group Computational Molecular Biology Department Max Planck Institute for Molecular Genetics Honorary Professor of Bioinformatics Department of Mathematics and Computer Science Freie Universität Berlin Zweitgutachter: Prof. Dr. Nikolaus Rajewsky Laboratory for Systems Biology of Gene Regulatory Elements Berlin Institute for Medical Systems Biology Max-Delbrück-Centrum für Molekulare Medizin, Berlin-Buch Professor of Systems Biology Charité Medical School Humboldt Universität Berlin Disputation: 16. April 2015 Contents Abstract 1 Summary 3 Methodology ..................................... 3 Results and contributions .............................. 6 I Biological Background 15 1 Systems biology of gene regulation 17 1.1 State of life science research .......................... 17 1.2 Gene regulatory processes ........................... 18 1.3 Sequencing technologies ............................ 22 2 Published computational biology research 25 2.1 List of publications ............................... 25 2.2 Computational analysis of deep-sequencing data .............. 26 2.3 Population genomics of drosophilid miRNAs ................. 29 2.4 Small RNAs in C. elegans embryogenesis ................... 34 2.5 Small RNAs in S. purpuratus early development ............... 42 2.6 ChIP-Sequencing a cell-cycle regulator and a helicase ............ 45 2.7 Data curation for post-transcriptional research ................ 51 2.8 Splicing regulation ............................... 51 II Probabilistic Sequence Analysis 53 3 Motif and binding site models 57 3.1 Binding motif models .............................. 57 3.2 Binding site models ............................... 64 4 Hidden Markov models 67 4.1 Formal definition ................................ 67 4.2 Fundamental problems and basic inference algorithms ........... 68 4.3 Viterbi algorithm ................................ 69 4.4 Forward-backward algorithm ......................... 71 i Contents 4.5 Scaled forward-backward algorithm ...................... 73 4.6 Likelihood gradient ............................... 75 5 Binding site hidden Markov model 81 5.1 An HMM for binding sites ........................... 81 5.2 Posterior motif occurrence probability .................... 82 6 Learning algorithms 85 6.1 Baum-Welch algorithm ............................. 85 6.2 Gradient learning ................................ 88 6.3 Complexity ................................... 89 III Discriminative Learning for Probabilistic Sequence Analysis 91 7 Statistics for discriminative learning 95 7.1 Contrasts .................................... 95 7.2 Contingency tables of features across contrasts ................ 96 8 Table-based discriminative objective functions 99 8.1 Difference of occurrence frequency ...................... 99 8.2 Matthews correlation coefficient ........................ 100 8.3 Fisher’s exact test ................................ 101 8.4 Normalized enrichment score ......................... 102 8.5 Pearson’s χ2 test for independence ...................... 102 8.6 Mutual information of condition and occurrence ............... 102 9 Probabilistic discriminative objective functions 105 9.1 Difference of log likelihood ........................... 105 9.2 Multiple model classification ......................... 106 9.3 Probability of correct classification ...................... 106 10 Significance of association 109 10.1 Mutual information, likelihood ratio, and χ2 test ............... 109 10.2 Multiple testing correction for motif discovery problems ........... 110 10.3 Significance of association significance filtering ............... 111 11 Discrete optimization of discriminative objectives 113 11.1 Enumerating residually most discriminative words ............. 113 11.2 Identifying discriminative words with degeneracy .............. 114 12 Hybrid learning 119 12.1 Signal and context parameters ......................... 119 12.2 Learning scheme ................................ 119 12.3 Multi-objective learning ............................ 120 ii Contents 13 Discovering multiple motifs 123 13.1 Measures of conditional association ...................... 123 13.2 Discovering multiple motifs .......................... 124 14 Related work 129 14.1 Overview .................................... 129 14.2 DREME: Discriminative Regular Expression Motif Elicitation ........ 129 14.3 YMF: Yeast Motif Finder ............................ 130 14.4 CMF: Contrast Motif Finder ........................... 131 14.5 DME: Discriminating Matrix Enumerator ................... 132 14.6 DIPS: Discriminative PWM Search ....................... 134 14.7 DECOD: Deconvolved Discriminative Motif Discovery ............ 136 14.8 MoAn: Motif Annealer ............................. 137 14.9 DEME: Discriminatively Enhanced Motif Elicitation ............. 139 14.10 Dispom ..................................... 140 14.11 FIRE: Finding Informative Regulatory Elements ............... 141 14.12 Further motif discovery tools .......................... 142 IV Empirical Study of Motif Discovery Methodology 143 15 Synthetic test data 147 15.1 Generation of synthetic data .......................... 147 15.2 Recognizability ................................. 148 16 Performance metrics 151 16.1 Supervised performance metrics ........................ 151 16.2 Summarization ................................. 153 17 Results on synthetic data 155 17.1 Evaluated motif discovery tools ........................ 155 17.2 Summary performance ............................. 156 17.3 Comparing signal-only and discriminative learning ............. 159 17.4 Discriminative filtering ............................. 162 18 PUF RNA-binding protein family 163 18.1 Materials .................................... 164 18.2 Methods ..................................... 165 18.3 Discriminative motifs in PUF RBP data .................... 166 18.4 Comparing MICO, MMIE, and ML learning .................. 168 18.5 Dilution and word-based analyses ....................... 168 19 Alternative splicing regulator RBM10 173 19.1 Materials .................................... 174 19.2 Motif discovery for RBM10 reveals splicing-relevant motifs ......... 174 19.3 Previously reported RBM10 motifs not corroborated ............. 176 iii Contents 20 Mouse embryonic stem cell transcription factors 179 20.1 Materials and methods ............................. 179 20.2 Discriminative motifs in ChIP-Seq data .................... 180 20.3 Spatial distribution of motif occurrences ................... 185 20.4 Contrasting Nanog and Tcf3 against other ChIP-Seq data .......... 186 20.5 Comparing results of DREME, FIRE, and Discrover .............. 186 V Discussion 191 21 Supervised motif discovery performance experiments 193 21.1 Influence of parameters varied in synthetic data ............... 193 21.2 Generative, signal-only learning ........................ 195 21.3 Discriminative learning ............................ 196 21.4 Robustness of hybrid learning ......................... 198 21.5 Comparison to published motif discovery methods .............. 199 22 Sequence motifs in biological data 201 22.1 RIP-Chip and PAR-CLIP data of PUF family RBPs ............... 201 22.2 Alternative splicing regulator RBM10 ..................... 202 22.3 ChIP-Seq data of mouse ESC TFs ........................ 204 23 Outlook 207 23.1 Global optimization .............................. 207 23.2 Rank-based learning - Average rank information ............... 208 23.3 Faster learning ................................. 208 23.4 Additional information sources ........................ 209 23.5 Other applications ............................... 210 24 Conclusions 211 Appendices 213 A Proof of correctness of the scaling procedure 215 A.1 Correctness for the forward matrix recursion ................. 215 A.2 Correctness for the backward matrix recursion ................ 216 B Runtime of HMM inference algorithms 217 C Information theory 219 C.1 Communication systems ............................ 219 C.2 Fundamental quantities of information theory ................ 219 D Likelihood ratio, mutual information and χ2 statistic 223 D.1 The likelihood ratio statistic for goodness of fit ................ 223 D.2 G-test ...................................... 225 iv Contents D.3 Likelihood ratio and mutual information ................... 225 E Limits of Matthews correlation coefficient 227 F Gradient calculus 229 G Running parameters 231 H Synthetic data experiments 235 I PUF RBP family data 245 J Alternative splicing regulator RBM10 253 K Mouse ESC ChIP-Seq data 261 L Research anecdotes 265 L.1 Motif discovery anecdotes ........................... 265 L.2 A fishy smell in ChIP-Seq data ......................... 267 Bibliography 269 Statutory Declaration 299 Zusammenfassung 301 This is revision 66a1c19 of the manuscript, generated July 15, 2015. v List of Figures 1.1 Biogenesis of an mRNA ............................. 21 1.2 Structure of 4-thiouridine and 6-thioguanosine ............... 23 1.3 ChIP-Seq and PAR-CLIP methodologies .................... 24 2.1 Small RNA expression in early development of C. elegans and S. purpuratus 37 2.2 E2F3 and HELLS regulate many common targets, most notably MLL1 .... 46 3.1

Discriminative Learning for Probabilistic Sequence Analysis

The Evolution of Lisp

Robert Alan Saunders

Introduction to the Literature on Programming Language Design Gary T

The University of Chicago Transaction and Message

Commencement1985.Pdf (6.060Mb)

The Evolution of Smalltalk from Smalltalk-72 Through Squeak

Wizards of The

The Lenfest Ocean Program

Oral History of Richard Greenblatt

The Connection Machine

Experimenting with Programming Languages

NBS FIPS Software Documentation