IDENTIFICATION and ANNOTATION of TRANSPOSABLE ELEMENTS and AGENT- and GIS-BASED MODELING of PATHOGEN TRANSMISSION a Dissertation
Total Page:16
File Type:pdf, Size:1020Kb
IDENTIFICATION AND ANNOTATION OF TRANSPOSABLE ELEMENTS AND AGENT- AND GIS-BASED MODELING OF PATHOGEN TRANSMISSION A Dissertation Submitted to the Graduate School of the University of Notre Dame in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy by Ryan C. Kennedy, Gregory R. Madey, Co-Director Frank H. Collins, Co-Director Graduate Program in Computer Science and Engineering Notre Dame, Indiana January 2011 IDENTIFICATION AND ANNOTATION OF TRANSPOSABLE ELEMENTS AND AGENT- AND GIS-BASED MODELING OF PATHOGEN TRANSMISSION Abstract by Ryan C. Kennedy The work presented here has two primary components: 1) the identification and annotation of transposable elements (TEs) and 2) a spatially-aware agent- based model of pathogen transmission. Recent advances in sequencing technology have resulted in an explosion of genomic data. The identification of TEs is an important part of every genome project. This dissertation presents an automated homology-based approach to identify TEs, implemented as TESeeker, that produces consensus TEs up to 98% identical to manually annotated sequences. It also offers a design and implementa- tion plan to allow for the inclusion of TEs on VectorBase's community annotation pipeline. Agent-based modeling is very adept at modeling natural phenomena. Coupling geographical information system (GIS) data with agent-based modeling further increases the utility of such simulations. This dissertation presents a GIS aware agent-based model of pathogen transmission as well as methods and recommenda- tions for incorporating GIS data into a simulation. The model, named LiNK, was specifically developed to study the impact of landscape on pathogen transmission. DEDICATION To my family and friends ii CONTENTS FIGURES . ix TABLES . xi ACKNOWLEDGMENTS . xii CHAPTER 1: INTRODUCTION . .1 1.1 Overview . .1 1.2 Identification and Annotation of Transposable Elements . .1 1.3 Agent- and GIS-based Modeling of Pathogen Transmission . .3 1.4 Goals . .3 1.5 Organization . .4 1.6 Contributions . .4 CHAPTER 2: TRANSPOSABLE ELEMENT AND BIOINFORMATICS BACKGROUND . .6 2.1 Introduction . .6 2.2 Molecular Biology . .6 2.3 Bioinformatics . .8 2.3.1 VectorBase . 11 2.4 Transposable Elements . 12 2.5 Transposable Element Identification . 16 2.5.1 De novo Discovery . 17 2.5.2 Structure-based Discovery . 17 2.5.3 Comparative Genomic Methods . 18 2.5.4 Homology-based Discovery . 18 2.6 Annotation . 19 2.6.1 DAS . 20 2.6.2 Ensembl . 20 2.6.2.1 Ensembl Genebuild . 21 2.6.3 Chado . 25 iii 2.6.4 Hibernate . 25 2.6.5 VectorBase Community Annotation Pipeline . 25 2.6.5.1 Planned Updates to the VectorBase Community An- notation Pipeline . 28 2.7 Transposable Element Annotation . 28 2.7.1 VisualRepbase . 29 2.8 Summary . 31 CHAPTER 3: AUTOMATED HOMOLOGY-BASED APPROACH FOR THE IDENTIFICATION OF TRANSPOSABLE ELEMENTS . 32 3.1 Introduction . 32 3.2 Approach for Identification of Transposable Elements . 33 3.2.1 Dependencies . 33 3.2.1.1 Library of Representative Sequences . 33 3.2.1.2 BLAST . 34 3.2.1.3 DNASTAR SeqMan II . 34 3.2.1.4 CAP3 .......................... 34 3.2.1.5 ClustalW2 ....................... 34 3.2.1.6 BioPerl . 35 3.2.2 General Description of Approach . 35 3.2.2.1 Identify Coding Region . 37 3.2.2.2 Encompass Complete Transposable Element . 39 3.2.2.3 Generate Consensus . 41 3.2.2.4 Identify Complete Transposable Element . 41 3.2.3 Implementation . 42 3.2.4 Advantages . 42 3.2.5 Limitations . 43 3.3 Results . 44 3.3.1 Pediculus humanus humanus ................. 45 3.3.1.1 Class I Elements . 47 3.3.1.2 Class II Elements . 48 3.3.2 Culex quinquefasciatus .................... 49 3.3.3 Anopheles gambiae PEST Genome . 49 3.3.4 Other Organisms . 51 3.4 Conclusion . 52 CHAPTER 4: DESIGN AND PROOF-OF-CONCEPT PLAN FOR COM- MUNITY ANNOTATION OF TRANSPOSABLE ELEMENTS ON VEC- TORBASE . 54 4.1 Introduction . 54 4.2 Transposable Elements and the VectorBase Community Annotation Pipeline . 56 iv 4.2.1 Similarities to the VectorBase Community Annotation Pipeline 56 4.2.2 Differences from the VectorBase Community Annotation Pipeline . 60 4.2.3 Transposable Element Representation in Chado . 60 4.2.4 Proof-of-Concept . 62 4.3 Design and Implementation Plan . 65 4.4 Conclusion . 66 CHAPTER 5: SIMULATION AND MODELING BACKGROUND . 68 5.1 Introduction . 68 5.2 Simulation and Modeling . 68 5.2.1 Advantages and Disadvantages . 70 5.2.2 Building a Simulation Model . 71 5.2.3 Simulation Model Types . 72 5.2.4 Agent-based Modeling . 74 5.2.5 Equation-based Modeling . 74 5.3 Geographic Information Systems . 75 5.3.1 Raster Data . 75 5.3.2 Vector Data . 76 5.4 Integrating Geographic Information System Data into Agent-based Modeling . 76 5.5 Summary . 78 CHAPTER 6: A GIS AWARE AGENT-BASED MODEL OF PATHOGEN TRANSMISSION . 79 6.1 Introduction . 79 6.2 LiNK Simulation Model . 79 6.2.1 Model Background . 80 6.2.2 Conceptual Model . 82 6.2.3 ODD Protocol Description of LiNK ............. 91 6.2.3.1 Purpose . 91 6.2.3.2 State Variables and Scales . 91 6.2.3.3 Process Overview and Scheduling . 94 6.2.3.4 Design Concepts . 95 6.2.3.5 Initialization . 96 6.2.3.6 Input . 96 6.2.3.7 Submodels . 97 6.2.4 Implementation . 98 6.2.5 Verification and Validation . 98 6.3 Geographic Information System Data and Agent-Based Modeling 100 6.3.1 Approximating Geographic Information System Data in Sim- ulations . 100 v 6.3.2 Raster Queries . 100 6.3.3 Spatial Queries . 101 6.3.3.1 Simplified Spatial Queries . 101 6.3.4 Precalculated Query Matrix . 103 6.3.5 GIS Aware Agents . 104 6.3.5.1 Movement . 104 6.4 Results . 108 6.4.1 Performance . 111 6.5 Analyzing Massive Amounts of Simulation Data . 116 6.5.1 LiNKStat ........................... 116 6.6 Conclusion . 116 CHAPTER 7: CONCLUSION . 121 7.1 Overview . 121 7.2 Automated Homology-based Approach for the Identification of Trans- posable Elements . 121 7.2.1 Future Work . 122 7.3 Community Annotation of Transposable Elements on VectorBase 122 7.3.1 Future Work . 123 7.4 GIS Aware Agent-based Model of Pathogen Transmission . 123 7.4.1 Future Work . 124 7.5 Contributions . 124 APPENDIX A: AUTOMATED APPROACH WALKTHROUGH . 127 A.1 Representative Amino Acid Coding Regions . 127 A.2 Identify Coding Region . 131 A.2.1 tblastn Search . 131 A.2.2 Extract Sequences from the Genome . 135 A.2.3 CAP3 Assembly . 137 A.2.3.1 CAP3 Contigs . 137 A.2.3.2 CAP3 Contigs Quality Scores . 141 A.3 Encompass Complete Transposable Element . 148 A.4 Generate Consensus . 149 A.5 Identify Complete Transposable Element . 150 A.5.1 CAP3 Assembly . 150 A.5.2 CAP3 Contigs Quality File . 151 A.5.3 Trimmed CAP3 Contigs . 153 APPENDIX B: TESeeker WEBSITE . 154.