Domino Tiling, Gene Recognition, and Mice Lior
Total Page:16
File Type:pdf, Size:1020Kb
Domino Tiling, Gene Recognition, and Mice by Lior Samuel Pachter B.S. in Mathematics California Institute of Technology (1994) Submitted to the Department of Mathematics in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 1999 © Lior Pachter, MCMXCIX. All rights reserved. The author hereby grants to MIT permission to reproduce and distribute publicly paper and electronic copies of this thesis document in whole or in part, and to grant others the right to do so. A uth or ........................................... Department of Mathematics May 4, 1999 Certified by ....... ... .. .................................... Bonnie A. Berger Associate Professor of Applied Mathematics Thesis Supervisor Accepted by ..... ....................................... Michael Sipser Chairman, Applied Mathematics Committee Accepted by . ..................... INSTITUTE Richard Melrose Chairman, Department Committee on Graduate Students LIBRARIES 2 Domino Tiling, Gene Recognition, and Mice by Lior Samuel Pachter Submitted to the Department of Mathematics on May 17, 1999, in partial fulfillment of the requirements for the degree of Doctor of Philosophy Abstract The first part of this thesis outlines the details of a computational program to identify genes and their coding regions in human DNA. Our main result is a new algorithm for identifying genes based on comparisons between orthologous human and mouse genes. Using our new technique we are able to improve on the current best gene recognition results. Testing on a collection of 117 genes for which we have human and mouse orthologs, we find that we predict 84% of the coding exons in genes correctly on both ends. Our nucleotide sensitivity and specificity is 95% and 98% respectively. Most importantly, our algorithms are applicable to large scale annotation prob- lems. The methods are completely scalable. We are able to take into account multiple or incomplete genes in a genomic region, splice sites without the usual GT/AG con- sensus, as well as genes on either strand. In addition to our algorithmic results, we also detail a number of computational studies relevant to the biological phenomena associated with splicing. We discuss the implications of directionality in splice site detection, statistical characteristics of splice sites and exons, as well as how to apply this information to the gene recognition problem. The second part of the thesis is devoted to combinatorial problems that originate from domino tiling questions. Our main results are upper and lower bounds for forcing numbers of matchings on square grids, as well as the first combinatorial proof that the number of domino tilings of a 2n x 2n square grid is of the form 2n(2k + 1)2. Our approach to both problems is concrete and combinatorial, relying on the same set of tools and techniques. We also discuss a number of new problems and conjectures. Thesis Supervisor: Bonnie A. Berger Title: Associate Professor of Applied Mathematics 3 For my grandfather, Jacob (Yankale) Pesate. 4 Acknowledgments I thank my mom and dad first, not that "thanks" can express my gratitude for what they have contributed to me. I hope that the readers of this thesis can know what I mean. The work described herein is the result of, and epitome of, collaborative research. I am deeply indebted and grateful to all who worked with me on the various aspects of what has become a thesis. The guidance and thoughts of my advisor, professor Bonnie Berger, are evident throughout my work. I thank her for her help and for the unwavering support I re- ceived during my studies. I am grateful to professor Daniel Kleitman for suggesting the use of dictionaries in gene recognition, and for the many contributions he made during the ensuing developments. The key idea of applying comparative genomics principles to gene recognition was suggested by professor Eric Lander, who subse- quently contributed many excellent ideas in critical moments. Above all, I thank Serafim Batzoglou who has been my colleague and friend for the last two years. His contributions and efforts directly enabled the completion of much of the gene recog- nition project described in this thesis. My thanks also go to Val Spitkovsky, who created and developed the dictionary described in Chapter 5. Eric Banks, Bill Wallis, Theodore Tonchev and John Dunagan, all contributed, both by coding and thinking, in countless ways to the research I did. Thank you all. Of course, I cannot omit my thanks to those who turned my love affair with math- ematics into a marriage. I thank my father for showing me what mathematics is, my friend and mentor Nitu Kitchloo for helping me realize that I can be a mathematician, professor Daniel Kleitman for showing me how to be one, and my students through- out the years for giving me a reason to continue being one. Special thanks go to my mother who taught me about those things in life that are much more important than mathematics. There have been many others who have inspired me and taught me, and of course along with those mentioned above, they all deserve much more than a thank you note in my thesis. The influence and help of all of my friends is evident throughout this thesis. Fortu- nately, I have too many friends to list. Still, I cannot resist the temptation to specially thank Dave Amundsen (I did beat him at pigskin), Dave Finberg (the computer and combinatorics stud), Jing Li (thanks for the angst), Nitu Kitchloo (brother), Tal Malkin (vodka madame?), Mats Nigam (thanks Linda!), Boris Schlittgen (see Jing Li) and Glenn Tesler (who TeXed most of the difficult stuff in this thesis). Thanks to Beth Hardesty for ungrouching me many times, often during critical moments of my graduate student career. Finally, I'd like to thank my love Son Preminger for all her help, understanding and fanchuking. 5 Contents N otation .. ....................... ............ 13 I Gene Recognition 15 Overview . ....... .... .... .... .... .. 16 1 Biology Background and Goals 17 1.1 The Genetic Dogma .................. 17 1.2 The Splicing Cycle ................... 19 1.3 Biological Signals and Patterns in DNA .. ...... 20 1.4 What do you do with 100KB of human genomic DNA 25 1.5 The Computational Challenges in Gene Annotation . 25 1.5.1 Exon Prediction ................. 26 1.5.2 Other Problems ................. 26 2 Previous Work on Gene Annotation 28 2.1 Similarity Searching and Gene Annotation ...... 28 2.2 Statistical Approaches ................. 29 2.3 Homology Approaches ................. 29 2.4 H ybrids ......................... 29 2.5 R esults .......................... 30 3 Identification of Introns and Exons 32 3.1 Splice Sites ....................... ... .... 32 3.1.1 Pairwise Correlations .............. .. .... .. 32 3.1.2 The GENSCAN splice site detector ........ ... .. ... 35 3.1.3 Left Rules .................... ... .. ... 36 3.2 Introns .......... .. ........... .... .. ...... 38 3.2.1 Length Distribution ............... ...... .. 42 3.2.2 Pair Correlations in Introns .......... ...... 42 3.2.3 G+C effects ................... ... ... 43 3.2.4 G triplets near the donor splice site ...... ... ... 44 3.3 Exons ..... ..................... ..... ...... 45 3.3.1 Length Distribution ............... .. ... ... 45 3.3.2 Pair Correlations in Exons ........... .. ... ... 47 3.3.3 The Frame .................... ... ..... 47 6 4 Assembling a Parse 55 4.1 Introduction ............ .................... 55 4.2 Complexity of the Problem ........................ 55 4.2.1 A visit with Fibonacci .... .................. 55 4.2.2 Average case analysis ..... .................. 56 4.2.3 Mitigating factors ............ ............. 57 4.3 A Dynamic Programming Approach ................... 57 4.3.1 General Framework ........................ 57 4.3.2 Frame Consistent Dynamic Programming ........ .... 58 4.3.3 Technical Issues ................. ......... 58 5 Dictionary Approaches 60 5.1 Introduction ................................ 60 5.2 M ethods ......... ......................... 61 5.2.1 Dictionary Lookups and Fragment Matching ....... ... 61 5.2.2 Dynamic Programming ...................... 63 5.3 Results and Discussion .......................... 63 5.3.1 Output of the Program ...................... 63 5.3.2 Alternative Splice Sites ...................... 64 5.3.3 Exon Prediction ... ....................... 66 5.3.4 Other Applications ........................ 68 5.3.5 Discussion ............................. 69 5.3.6 Running Times .......................... 70 6 Comparative Genomics 72 6.1 Introduction .. ......................... ..... 72 6.1.1 The Rosetta Stone ...... .............. .... 72 6.1.2 A New Paradigm for Gene Annotation ............. 73 6.2 Alignments ................................ 74 6.2.1 Background . ....................... .... 74 6.2.2 Nested Alignments ........................ 75 6.3 Finding Coding Exons .......................... 77 6.3.1 Removing Regions with Bad Alignment ............. 77 6.3.2 Scoring a Pair of Exons. ..................... 78 6.3.3 Piecing together Exons ...................... 80 6.4 Results .................... ............... 80 6.5 D iscussion .. ............. ............ ...... 82 II Combinatorics 103 O verview ........ ............ ............. ... 104 7 Forcing Matchings 105 7.1 Introduction ................. ............... 105 7.2 Preliminaries ............................... 105 7 7.3 The Upper Bound ......... .................... 106 7.4 The Lower Bound ......... ...................