Automated Methods to Infer Ancient Homology and Synteny
Total Page:16
File Type:pdf, Size:1020Kb
AUTOMATED METHODS TO INFER ANCIENT HOMOLOGY AND SYNTENY by JULIAN M. CATCHEN A DISSERTATION Presented to the Department of Computer and Information Science and the Graduate School of the University of Oregon in partial fulfillment of the requirements for the degree of Doctor of Philosophy June 2009 ii \Automated Methods to Infer Ancient Homology and Synteny," a thesis prepared by Julian M. Catchen in partial fulfillment of the requirements for the Doctor of Philosophy degree in the Department of Computer and Information Science. This thesis has been approved and accepted by: Dr. John S. Conery, Chair of the Examining Committee Date Committee in charge: Dr. John S. Conery, Chair Dr. John H. Postlethwait Dr. Virginia M. Lo Dr. Arthur M. Farley Dr. William A. Cresko Accepted by: Dean of the Graduate School iii Copyright 2009 Julian M. Catchen iv An Abstract of the Thesis of Julian M. Catchen for the degree of Doctor of Philosophy in the Department of Computer and Information Science to be taken June 2009 Title: AUTOMATED METHODS TO INFER ANCIENT HOMOLOGY AND SYNTENY Approved: Dr. John S. Conery, Chair Establishing homologous (evolutionary) relationships among a set of genes allows us to hypothesize about their histories: how are they related, how have they changed over time, and are those changes the source of novel features? Likewise, aggregating related genes into larger, structurally conserved regions of the genome allows us to infer the evolutionary history of the genome itself: how have the chromosomes changed in number, gene content, and gene order over time? Establishing homology between genes is important for the construction of human disease models in other organisms, such as the zebrafish, by identifying and manipulating the zebrafish copies of genes involved in the human disease. To make such inferences, researchers compare the genomes of extant species. However, the dynamic nature of genomes, in gene content and chromosomal architecture, presents a major technical challenge to correctly identify homologous genes. This thesis presents a system to infer ancient homology between genes that takes into account a v major but previously overlooked source of architectural change in genomes: whole-genome duplication. Additionally, the system integrates genomic conservation of synteny (gene order on chromosomes), providing a new source of evidence in homology assignment that complements existing methods. The work applied these algorithms to several genomes to infer the evolutionary history of genes, gene families, and chromosomes in several case studies and to study several unique architectural features of post-duplication genomes, such as Ohnologs gone missing. vi CURRICULUM VITAE NAME OF AUTHOR: Julian M. Catchen PLACE OF BIRTH: Bronxville, New York DATE OF BIRTH: June 3, 1978 GRADUATE AND UNDERGRADUATE SCHOOLS ATTENDED: University of Oregon Pennsylvania State University DEGREES AWARDED: Doctor of Philosophy in Computer and Information Science, 2009, University of Oregon Master of Science in Computer and Information Science, 2006, University of Oregon Bachelor of Science in Computer Science, 2000, Pennsylvania State University AREAS OF SPECIAL INTEREST: Evolution of genome architecture Whole genome duplication Conserved synteny Genetic networks Bioinformatics vii PROFESSIONAL EXPERIENCE: Graduate Research Fellow, Evolution, Development, and Genomics IGERT Program, University of Oregon, Eugene, OR, 2006 - present Graduate Research Fellow, Postlethwait Lab, University of Ore- gon, Eugene, OR, 2003 - 2006 Teaching Assistant, Department of Computer and Information Science, University of Oregon, 2002 - 2003 Software Engineer, Intel Corporation, Chandler, AZ, 2000 - 2002 Software Developer, IBM Corporation, Poughkeepsie, NY, 1999 GRANTS, AWARDS AND HONORS: Upsilon Pi Epsilon Honor Society for the Computing Sciences, October, 2006 PUBLICATIONS: C. Sullivan, J. Charette, J. Catchen, C. Lage, G. Giasson, J. Postleth- wait, P. Millard, and C. Kim. Zebrash toll-like receptor-4 gene history is predictive of divergent functions. Submitted, 2009. J. Catchen, J. Conery, and J. Postlethwait. Automated identifi- cation of conserved synteny after whole genome duplication. Genome Research, In Press. 2009. R. Jovelin, Y. Yan, X. He, J. Catchen, A. Amores, H. Yokoi, C. Ca~nestro,J. Postlethwait. Evolution of developmental regulation in the vertebrate FgfD subfamily. Journal of Experimental Zoology Part B: Molecular and Developmental Evolution, In Press. 2009. viii C. Ca~nestro,J. Catchen, A. Rodr´ıguez-Mar´ı,H. Yokoi, and J. Postleth- wait. Consequences of lineage-specific gene loss on functional evolution of surviving ohnologs in vertebrate genomes: ALDH1A and retinoic acid signaling. PLoS Genetics, 5(5):e1000496, 2009. H. Yokoi, Y. Yan, M. Miller, R. BreMiller, J. Catchen, E. John- son, and J. Postlethwait. Expression proling of zebrash sox9 mutants reveals that Sox9 is required for retinal differentiation. Developmental Biology, 329(1):1{15, 2009. J. Catchen, J. Conery, and J. Postlethwait. Inferring ancestral gene order. Methods in Molecular Biology, 452:365{383, 2008. J. Bridgham, J. Brown, A. Rodr´ıguez-Mar´ı, J. Catchen, and J. Thornton. Evolution of a new function by degenerative mutation in cephalochordate steroid receptors. PLoS Genetics, 4(9):e1000191, 2008. J. Conery, J. Catchen, and M. Lynch. Rule-based workow man- agement for bioinformatics. VLDB Journal, 14(3):318{329, 2005. ix ACKNOWLEDGMENTS I am grateful to Dr. John Conery for providing me with the opportunity to enter the field of computational biology; for investing his time and energy in my training; for his technical skills; for Ireland. I feel truly privileged to have had the opportunity to work with Dr. John Postlethwait, who took a risk and gave me a seat at his lab bench; who provided me with half a decade of quiet mentoring and support; who taught me how to do science. I would like to thank Cristian Ca~nestro,Angel Amores, Tom Titus, and all the members of the Postlethwait Lab, for their instruction, collaboration, and friendship. The IGERT program in Evolution, Development, and Genomics provided me with much of my education in biology { I became a better student the day I began attending Friday afternoon journal club. I owe thanks to Star Holmberg for shepherding me through this arduous process and for providing support when I needed it most. I would like to acknowledge those institutions that supported me financially, including the National Institutes of Health and the National Science Foundation. Finally, I would like to thank my friends and family: Mom, Dad, Aaron, for calling me when I didn't call you; Anna, for picking me up off the ground; Kevin, for a thousand coffees; and, the Graduate Teaching Fellows Federation, AFT Local 3544, for making me feel like a citizen. x TABLE OF CONTENTS Chapter Page I. INTRODUCTION..............................1 1.1 Gene Relationships...........................2 1.2 Whole-Genome Duplication......................4 1.3 Assigning Orthology.......................... 11 1.4 Ohnologs Gone Missing......................... 14 1.5 Conserved Synteny........................... 19 1.6 Contributions and Outline....................... 21 II. RELATED WORK.............................. 25 2.1 Stand-alone Tools............................ 26 2.2 Whole-Genome Studies of Conserved Synteny............ 34 2.3 Studies Related to Ohnologs Gone Missing.............. 37 III. THE RBH ANALYSIS PIPELINE..................... 44 3.1 Methods................................. 45 3.2 Results.................................. 60 3.3 Case Study: Inferring Ancestral Gene Order............. 76 3.4 Summary................................ 84 IV. THE SYNTENY DATABASE........................ 86 4.1 Methods................................. 88 4.2 Case Study: The ARNTL Gene Family................ 104 4.3 Case Study: The MSX Gene Family.................. 125 xi Chapter Page V. IDENTIFYING OHNOLOGS GONE MISSING.............. 143 5.1 Methods................................. 147 5.2 Results.................................. 156 5.3 Summary................................ 167 VI. CONCLUSION................................ 169 6.1 Future Work............................... 171 APPENDICES................................... 174 A. IMPORTANT BIOLOGICAL CONCEPTS................ 174 B. SINGLE LINKAGE CLUSTERING ALGORITHM............ 180 C. SLIDING WINDOW ALGORITHM.................... 182 D. MICRO-SYNTENY ALGORITHM..................... 184 BIBLIOGRAPHY................................. 185 xii LIST OF FIGURES Figure Page 1.1 The evolutionary history of a hypothetical gene..............3 1.2 An illustration of subfunctionalization...................6 1.3 Whole-Genome Duplications in the chordate lineages...........7 1.4 Hox Clusters: the signature of chordate whole-genome duplications...9 1.5 The Reciprocal Best Hit Algorithm..................... 12 1.6 Differential gene loss following whole genome duplication creates ohnologs gone missing ................................. 16 1.7 Four categories of conservation....................... 19 3.1 Anchoring paralogous genes to the outgroup................ 45 3.2 RBH Analysis Pipeline Scheme....................... 47 3.3 Output of the Local Minimum Alignment algorithm............ 49 3.4 Examples of the BLAST Clustering algorithm............... 51 3.5 The single linkage clustering algorithm of the RBH Analysis Pipeline.. 55 3.6 Summary of RBH Analysis Pipeline Results................ 61 3.7 Danio rerio primary genome anchored to the Homo sapiens outgroup