A Taxonomy and Genome Cartology of Maize Ltr Retrotransposons
Total Page:16
File Type:pdf, Size:1020Kb
A TAXONOMY AND GENOME CARTOLOGY OF MAIZE LTR RETROTRANSPOSONS by JAMES CHRISTOPHER ESTILL (Under the Direction of Jeffrey L. Bennetzen) ABSTRACT The genome cartology approach to genomics is a spatially explicit framework for the study of patterns in the distribution and abundance of sequence features within and among genomes. The ultimate goal of this approach is to identify the demographic and selective processes that have given rise to these extant patterns. However, tools for the accurate annotation and taxonomic assignment of sequence features must first be implemented before these ultimate goals can be realized. I have designed, implemented and assessed the accuracy of novel annotation and taxonomic classification software and applied these tools to a genome cartology of maize LTR retrotransposons (LRPs). The DAWGPAWS pipeline facilitates combined evidence human curation of ab initio and similarity search based computational results. I verified the value of DAWGPAWS by using this pipeline to annotate genes and transposable elements in 220 BAC insertions from the hexaploid wheat genome. To illustrate that these techniques can scale to entire genomes, the pipeline was applied to the annotation of LRPs in the B73 maize genome and discovered over 31,000 intact elements. The RepMiner suite of programs allows for the clustering of sequences into families based on networks of shared homology. I applied the RepMiner approach to the database of intact maize LRPs annotated by DAWGPAWS. RepMiner further illuminated previously identified family relationships, indicated an unrecognized split in the Huck family, and recognized over 350 new families of LRPs. Affinity propagation based clustering of intact LRPs identified a subset of ~500 exemplar sequences that can serve as a representative database of all maize LRPs. The exemplar database of maize LRPs was used to map the location of intact and fragmented LRPs in the assembled genome of maize. These LRPs comprised over 75% of the genome, and are nonrandomly distributed with a preferential accumulation in pericentromeric heterochromatin. Surprisingly, the regions of the genome with the highest accumulation of LRPs had the lowest diversity of LRP families. These results indicate that genome cartology will provide new insight into genome dynamics, and that continued development of this approach to genomics will further enlighten the study of genome evolution. INDEX WORDS: Affinity Propagation, Bioinformatics, Biological Clustering, Copia, Genome Annotation, Genome Cartology, Genome Evolution, Graph Theory, Gypsy, LTR Retrotransposon, Maize, Poaceae, Transposon, Transposable Element, Zea mays A TAXONOMY AND GENOME CARTOLOGY OF MAIZE LTR RETROTRANSPOSONS by JAMES CHRISTOPHER ESTILL BS, Western Kentucky University, 1996 A Dissertation Submitted to the Graduate Faculty of The University of Georgia in Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY ATHENS, GEORGIA 2010 © 2010 James Christopher Estill All Rights Reserved A TAXONOMY AND GENOME CARTOLOGY OF MAIZE LTR RETROTRANSPOSONS by JAMES CHRISTOPHER ESTILL Major Professor: Jeffery L. Bennetzen Committee: Kelly A. Dyer Jim H. Leebens-Mack Russell L. Malmberg Xiaoyu Zhang Electronic Version Approved: Maureen Grasso Dean of the Graduate School The University of Georgia December 2010 DEDICATION To my family; Regina, Jack and Neko thank you. Regina has been both a patient spouse and an insightful scientific collaborator. Jack has provided numerous insights into life and science that are completely unequaled in their depth and candor. Neko motivates wakefulness in the morning, keeps me engaged in work throughout the day and has smiles for me through the night. My parents Frank and Libby Estill have always supported my schoolwork and science and for that I am very thankful. iv ACKNOWLEDGEMENTS I would first like to thank Jeff Bennetzen who served as my mentor and graduate committee chair. Jeff has been very supportive of my career and is one the most imaginative and knowledgeable scientist I have worked with. I would also like to thank Jim Leebens-Mack for fruitful collaboration over the past year. Each member of my committee has made valuable contributions to my dissertation and career development. Hilmar Lapp and William Piel served as mentors for a phyloinformatics Perl coding project in the 2007 Google Summer of Code. They provided general programming examples and advice that were incorporated into the Perl command line and database interface projects. Multiple people also helped improve the software that are described in this dissertation by acting as early adopters of the programs as they were in development. I would like to thank Katrien Devos, Jim Leebens-Mack, Antonio Costa de Oliveira, Xiangyang Xu, Ansuya Jogi, Jennifer Hawkins and Hao Wang for their useful comments and bug reports that have been incorporated in the implementation of DAWGPAWS. Regina Baucom, Phillip SanMiguel, and Evan Staton provided helpful feedback in the early development of the RepMiner suite of programs. My dissertation research also benefited from stimulating conversations concerning the biology of transposable elements with fellow members of the Bennetzen lab group at UGA. This dissertation could not have been completed without the support of family and friends. This work was supported by NSF grants DBI-0501814, DBI-0607123, and DBI-0821263. v This study was also supported in part by resources and technical expertise from the University of Georgia Research Computing Center, a partnership between the Office of the Vice President for Research and the Office of the Chief Information Officer. vi TABLE OF CONTENTS Page ACKNOWLEDGEMENTS ........................................................................................................v CHAPTER 1 INTRODUCTION AND LITERATURE REVIEW..................................................1 Overview............................................................................................................1 The Biology of Plant LTR Retrotransposons.......................................................2 Intragenomic Distribution of Transposable Elements..........................................8 Rationale for a Genome Cartology of Maize LTR Retrotransposons .................18 The Annotation of Maize LTR Retrotransposons with DAWGPAWS...............20 Network Methods for Analysis of Maize LTR Retrotransposons in RepMiner..22 Genome Cartology of Maize LTR Retrotransposons with GenCart ...................25 Overview of the Dissertation Chapters..............................................................26 References........................................................................................................41 2 THE DAWGPAWS PIPELINE FOR THE ANNOTATION OF GENES AND TRANSPOSABLE ELEMENTS IN PLANT GENOMES ......................................57 Abstract............................................................................................................58 Background ......................................................................................................59 Implementation.................................................................................................61 Results and Discussion......................................................................................64 Conclusions......................................................................................................68 vii Availability and Requirements..........................................................................70 References........................................................................................................78 3 REPMINER AND ITS USE FOR LTR RETROTRANSPOSON ANALYSIS IN MAIZE...................................................................................................................82 Abstract............................................................................................................83 Introduction ......................................................................................................83 Materials and Methods......................................................................................87 Results..............................................................................................................92 Discussion........................................................................................................94 Acknowledgements...........................................................................................99 References...................................................................................................... 108 4 AFFINITY PROPAGATION CLUSTERING EFFICIENTY AND EFFECTIVELY DEFINES REPRESENTATIVE EXEMPLARS FROM LARGE MOLECULAR SEQUENCE DATABASES OF TRANSPOSABLE ELEMENTS........................ 111 Abstract.......................................................................................................... 112 Introduction .................................................................................................... 112 Materials and Methods.................................................................................... 115 Results............................................................................................................ 119 Discussion...................................................................................................... 120 References...................................................................................................... 126 5 EXCEPTIONAL