IDENTIFICATION AND ANNOTATION OF TRANSPOSABLE ELEMENTS AND AGENT- AND GIS-BASED MODELING OF PATHOGEN TRANSMISSION

A Dissertation

Submitted to the Graduate School of the University of Notre Dame in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

by

Ryan C. Kennedy,

Gregory R. Madey, Co-Director

Frank H. Collins, Co-Director

Graduate Program in Computer Science and Engineering Notre Dame, Indiana January 2011 IDENTIFICATION AND ANNOTATION OF TRANSPOSABLE ELEMENTS AND AGENT- AND GIS-BASED MODELING OF PATHOGEN TRANSMISSION

Abstract by Ryan C. Kennedy

The work presented here has two primary components: 1) the identification and annotation of transposable elements (TEs) and 2) a spatially-aware agent- based model of pathogen transmission. Recent advances in sequencing technology have resulted in an explosion of genomic data. The identification of TEs is an important part of every genome project. This dissertation presents an automated homology-based approach to identify TEs, implemented as TESeeker, that produces consensus TEs up to 98% identical to manually annotated sequences. It also offers a design and implementa- tion plan to allow for the inclusion of TEs on VectorBase’s community annotation pipeline. Agent-based modeling is very adept at modeling natural phenomena. Coupling geographical information system (GIS) data with agent-based modeling further increases the utility of such simulations. This dissertation presents a GIS aware agent-based model of pathogen transmission as well as methods and recommenda- tions for incorporating GIS data into a simulation. The model, named LiNK, was specifically developed to study the impact of landscape on pathogen transmission. DEDICATION

To my family and friends

ii CONTENTS

FIGURES ...... ix

TABLES ...... xi

ACKNOWLEDGMENTS ...... xii

CHAPTER 1: INTRODUCTION ...... 1 1.1 Overview ...... 1 1.2 Identification and Annotation of Transposable Elements ...... 1 1.3 Agent- and GIS-based Modeling of Pathogen Transmission . . . .3 1.4 Goals ...... 3 1.5 Organization ...... 4 1.6 Contributions ...... 4

CHAPTER 2: TRANSPOSABLE ELEMENT AND BIOINFORMATICS BACKGROUND ...... 6 2.1 Introduction ...... 6 2.2 Molecular Biology ...... 6 2.3 Bioinformatics ...... 8 2.3.1 VectorBase ...... 11 2.4 Transposable Elements ...... 12 2.5 Transposable Element Identification ...... 16 2.5.1 De novo Discovery ...... 17 2.5.2 Structure-based Discovery ...... 17 2.5.3 Comparative Genomic Methods ...... 18 2.5.4 Homology-based Discovery ...... 18 2.6 Annotation ...... 19 2.6.1 DAS ...... 20 2.6.2 Ensembl ...... 20 2.6.2.1 Ensembl Genebuild ...... 21 2.6.3 Chado ...... 25

iii 2.6.4 Hibernate ...... 25 2.6.5 VectorBase Community Annotation Pipeline ...... 25 2.6.5.1 Planned Updates to the VectorBase Community An- notation Pipeline ...... 28 2.7 Transposable Element Annotation ...... 28 2.7.1 VisualRepbase ...... 29 2.8 Summary ...... 31

CHAPTER 3: AUTOMATED HOMOLOGY-BASED APPROACH FOR THE IDENTIFICATION OF TRANSPOSABLE ELEMENTS . . . . . 32 3.1 Introduction ...... 32 3.2 Approach for Identification of Transposable Elements ...... 33 3.2.1 Dependencies ...... 33 3.2.1.1 Library of Representative Sequences ...... 33 3.2.1.2 BLAST ...... 34 3.2.1.3 DNASTAR SeqMan II ...... 34 3.2.1.4 CAP3 ...... 34 3.2.1.5 ClustalW2 ...... 34 3.2.1.6 BioPerl ...... 35 3.2.2 General Description of Approach ...... 35 3.2.2.1 Identify Coding Region ...... 37 3.2.2.2 Encompass Complete Transposable Element . . . . 39 3.2.2.3 Generate Consensus ...... 41 3.2.2.4 Identify Complete Transposable Element ...... 41 3.2.3 Implementation ...... 42 3.2.4 Advantages ...... 42 3.2.5 Limitations ...... 43 3.3 Results ...... 44 3.3.1 Pediculus humanus humanus ...... 45 3.3.1.1 Class I Elements ...... 47 3.3.1.2 Class II Elements ...... 48 3.3.2 Culex quinquefasciatus ...... 49 3.3.3 Anopheles gambiae PEST Genome ...... 49 3.3.4 Other Organisms ...... 51 3.4 Conclusion ...... 52

CHAPTER 4: DESIGN AND PROOF-OF-CONCEPT PLAN FOR COM- MUNITY ANNOTATION OF TRANSPOSABLE ELEMENTS ON VEC- TORBASE ...... 54 4.1 Introduction ...... 54 4.2 Transposable Elements and the VectorBase Community Annotation Pipeline ...... 56

iv 4.2.1 Similarities to the VectorBase Community Annotation Pipeline 56 4.2.2 Differences from the VectorBase Community Annotation Pipeline ...... 60 4.2.3 Transposable Element Representation in Chado ...... 60 4.2.4 Proof-of-Concept ...... 62 4.3 Design and Implementation Plan ...... 65 4.4 Conclusion ...... 66

CHAPTER 5: SIMULATION AND MODELING BACKGROUND . . . . 68 5.1 Introduction ...... 68 5.2 Simulation and Modeling ...... 68 5.2.1 Advantages and Disadvantages ...... 70 5.2.2 Building a Simulation Model ...... 71 5.2.3 Simulation Model Types ...... 72 5.2.4 Agent-based Modeling ...... 74 5.2.5 Equation-based Modeling ...... 74 5.3 Geographic Information Systems ...... 75 5.3.1 Raster Data ...... 75 5.3.2 Vector Data ...... 76 5.4 Integrating Geographic Information System Data into Agent-based Modeling ...... 76 5.5 Summary ...... 78

CHAPTER 6: A GIS AWARE AGENT-BASED MODEL OF PATHOGEN TRANSMISSION ...... 79 6.1 Introduction ...... 79 6.2 LiNK Simulation Model ...... 79 6.2.1 Model Background ...... 80 6.2.2 Conceptual Model ...... 82 6.2.3 ODD Protocol Description of LiNK ...... 91 6.2.3.1 Purpose ...... 91 6.2.3.2 State Variables and Scales ...... 91 6.2.3.3 Process Overview and Scheduling ...... 94 6.2.3.4 Design Concepts ...... 95 6.2.3.5 Initialization ...... 96 6.2.3.6 Input ...... 96 6.2.3.7 Submodels ...... 97 6.2.4 Implementation ...... 98 6.2.5 Verification and Validation ...... 98 6.3 Geographic Information System Data and Agent-Based Modeling 100 6.3.1 Approximating Geographic Information System Data in Sim- ulations ...... 100

v 6.3.2 Raster Queries ...... 100 6.3.3 Spatial Queries ...... 101 6.3.3.1 Simplified Spatial Queries ...... 101 6.3.4 Precalculated Query Matrix ...... 103 6.3.5 GIS Aware Agents ...... 104 6.3.5.1 Movement ...... 104 6.4 Results ...... 108 6.4.1 Performance ...... 111 6.5 Analyzing Massive Amounts of Simulation Data ...... 116 6.5.1 LiNKStat ...... 116 6.6 Conclusion ...... 116

CHAPTER 7: CONCLUSION ...... 121 7.1 Overview ...... 121 7.2 Automated Homology-based Approach for the Identification of Trans- posable Elements ...... 121 7.2.1 Future Work ...... 122 7.3 Community Annotation of Transposable Elements on VectorBase 122 7.3.1 Future Work ...... 123 7.4 GIS Aware Agent-based Model of Pathogen Transmission . . . . . 123 7.4.1 Future Work ...... 124 7.5 Contributions ...... 124

APPENDIX A: AUTOMATED APPROACH WALKTHROUGH . . . . . 127 A.1 Representative Amino Acid Coding Regions ...... 127 A.2 Identify Coding Region ...... 131 A.2.1 tblastn Search ...... 131 A.2.2 Extract Sequences from the Genome ...... 135 A.2.3 CAP3 Assembly ...... 137 A.2.3.1 CAP3 Contigs ...... 137 A.2.3.2 CAP3 Contigs Quality Scores ...... 141 A.3 Encompass Complete Transposable Element ...... 148 A.4 Generate Consensus ...... 149 A.5 Identify Complete Transposable Element ...... 150 A.5.1 CAP3 Assembly ...... 150 A.5.2 CAP3 Contigs Quality File ...... 151 A.5.3 Trimmed CAP3 Contigs ...... 153

APPENDIX B: TESeeker WEBSITE ...... 154

vi APPENDIX C: TESeeker USER MANUAL ...... 156 C.1 Installation ...... 156 C.2 Usage ...... 157 C.3 Example Search ...... 165 C.4 Additional Tools ...... 166 C.5 Technology ...... 169

APPENDIX D: SELECTED AUTOMATED APPROACH SOURCE CODE 170 D.1 Combine BLAST Hits ...... 170 D.2 Extract Sequences ...... 177 D.3 Trim CAP3 Contigs ...... 179 D.4 Generate Consensus ...... 181

APPENDIX E: TRANSPOSABLE ELEMENTS IDENTIFIED ...... 183 E.1 P. humanus humanus ...... 183 E.1.1 Non-LTRs ...... 183 E.1.1.1 Hope-like SART ...... 183 E.1.1.2 Dong-like R4 ...... 186 E.1.2 LTRs ...... 189 E.1.2.1 Mdg1 ty3/gypsy ...... 189 E.1.3 Transposons ...... 192 E.1.3.1 mariner ...... 192 E.1.3.2 MITE1 ...... 194 E.1.3.3 MITE2 ...... 195 E.2 C. quinquefasciatus ...... 196 E.2.1 Non-LTRs ...... 196 E.2.1.1 CR1 ...... 196 E.2.1.2 I ...... 198 E.2.1.3 Jockey ...... 200 E.2.1.4 L1 ...... 202 E.2.1.5 L2 ...... 204 E.2.1.6 LOA ...... 206 E.2.1.7 Loner ...... 208 E.2.1.8 Outcast ...... 211 E.2.1.9 R1 ...... 213 E.2.1.10 RTE ...... 216 E.2.1.11 Unclassified LINE ...... 218 E.3 D. melanogaster ...... 219 E.3.1 Transposons ...... 219 E.3.1.1 mariner ...... 219

vii APPENDIX F: SCRIPT USED TO IDENTIFY MITES ...... 220

REFERENCES ...... 224

viii FIGURES

1.1 Dissertation Components ...... 2

2.1 Central Dogma of Molecular Biology ...... 8 2.2 Genetic Code ...... 9 2.3 Global and Local Sequence Alignment ...... 10 2.4 Typical mariner Class II Transposon Structure ...... 13 2.5 Transposable Element (TE) Classification Scheme and Structures 14 2.6 Ensembl Location-based View on VectorBase ...... 22 2.7 Ensembl -based View on VectorBase ...... 23 2.8 Ensembl Transcript-based View on VectorBase ...... 24 2.9 VectorBase Gene Submission Form ...... 26 2.10 VectorBase Community Annotation Pipeline Data Flow ...... 27 2.11 VisualRepbase Interface ...... 30

3.1 Approach Schematic ...... 36 3.2 Methods of Combination ...... 38 3.3 P. humanus humanus mariner element ...... 45 3.4 C. quinquefasciatus Jockey element ...... 45

4.1 Client-side TE Submission Process ...... 58 4.2 TE Submission Form ...... 59 4.3 Entity-Relationship Diagram of Selected Chado Tables ...... 61 4.4 TE Start and Submit Page ...... 62 4.5 TE Details Page ...... 63 4.6 TE Structure Page ...... 64 4.7 Proof-of-Concept Configuration ...... 65

5.1 Raster vs. Vector Data ...... 77

ix 6.1 Female Macaque and Infant ...... 81 6.2 Uluwatu Temple Site ...... 83 6.3 Life Cycle Transition Diagram ...... 85 6.4 LiNK Control Panel ...... 86 6.5 LiNK Display ...... 88 6.6 Temple Site Display ...... 89 6.7 Temporal Relationship of Pathogen Parameters and Related Events 89 6.8 Pathogen Transition Diagram ...... 90 6.9 Verification and Validation Techniques for Agent-based Models . . 99 6.10 Spatial Data Approximation ...... 102 6.11 Macaque Movement ...... 107 6.12 Comparison of Total Number of Infections grouped by Landscape and Population Size at Varying Sites ...... 109 6.13 Pathogen Spread to Varying Temple Sites ...... 110 6.14 Performance Comparison of Varying Query Methods ...... 113 6.15 Scalability with Respect to Initial Number of Dispersed Macaques and Amount of GIS data ...... 115 6.16 LiNKStat ...... 117 6.17 LiNKStat Pathogen Transmission Graph ...... 118

B.1 TESeeker Website ...... 155

C.1 TESeeker Desktop ...... 158 C.2 TESeeker Genomes Folder ...... 159 C.3 TESeeker TELibrary ...... 160 C.4 TESeeker Documentation ...... 161 C.5 TESeeker Web Interface ...... 162 C.6 TESeeker BLAST Interface ...... 163 C.7 TESeeker Extract Interface ...... 164 C.8 TESeeker Default Parameters ...... 166 C.9 Web Interface File Browser ...... 167 C.10 ClustalX Alignment with Annotated Element ...... 168

x TABLES

3.1 Pediculus humanus humanus NON-LONG TERMINAL REPEAT (NON-LTR) RESULTS ...... 46 3.2 Culex quinquefasciatus RESULTS ...... 50

6.1 MOVEMENT VALUES FOR DISPERSING MACAQUES . . . . 92 6.2 STATE VARIABLES ...... 93 6.3 PERFORMANCE COMPARISON OF GUI LOAD TIME . . . . 112 6.4 PERFORMANCE COMPARISON OF TIME STEPS/S . . . . . 112 6.5 SCALABILITY COMPARISON OF TIME STEPS/S ...... 114 6.6 ADVANTAGES AND DISADVANTAGES (1- POOR; 5- EXCEL- LENT) ...... 120

xi ACKNOWLEDGMENTS

I would like to thank my advisors, Dr. Frank Collins and Dr. Greg Madey for their direction, patience, and encouragement. I would especially like to thank Dr. Greg Madey, with whom I have collaborated since I was an undergraduate, for his constant support and for the providing me the opportunity to pursue this degree. Thank you also to my committee members: Dr. Scott Emrich for his bioinfor- matics guidance, Dr. Agust´ınFuentes for his direction on the LiNK project, and Dr. Tijana Milenkovi´cfor her valuable contributions on this dissertation. Additionally, I am grateful to Dr. Nora Besansky for serving as my outside chair and to Dr. Hope Hollocher for her direction on the LiNK simulation model and for serving as the outside chair for my proposal. Thank you to our administrative assistant, Ms. Joyce Yeats, for her invaluable assistance over the years. Special thanks to Dr. Scott Christley for his unwavering support, insight, and contributions. Thank you also to Maria Unger and Jenica Abrudan for sharing their biological expertise while also contributing to much of this work. Thank you to Kelly Lane for her collaboration on the LiNK simulation model. Finally, I would like to thank my family and friends, particularly my parents Bill and Martha, and Carrianne Scheib. This work would not be possible without them. This research was supported in part by NIAID/NIH contracts HHSN272200-

xii 900039C and HHSN266200400039C for “VectorBase: A Bioinformatics Resource Center for Invertebrate Vectors of Pathogens” [143] and NSF grants BCS#0639787 and BCS#0629787. Selected bioinformatics simulations were per- formed on the Notre Dame Biocomplexity Cluster supported in part by NSF MRI Grant No. DBI-0420980. Additional computational resources provided in part by the Notre Dame Center for Research Computing [142].

xiii CHAPTER 1

INTRODUCTION

1.1 Overview

The work presented in this dissertation consists of two related parts. The first part, Chapters 2, 3, and 4, concerns the discovery and annotation of transposable elements (TEs). The second part, Chapters 5 and 6, involves the development of a simulation model that utilizes agent-based modeling (ABM) and geographic information system (GIS) data to model pathogen spread. Each of these parts could be categorized under the computational biology realm, shown visually in Figure 1.1. The following sections provide a brief introduction to each chapter, including our motivations.

1.2 Identification and Annotation of Transposable Elements

Transposable elements (TEs) are a type of repetitive sequence that have been found in nearly all eukaryotic genomes. They have the ability to move about and replicate within a genome and are believed to pay a major role in genome evolution [82, 100, 130]. Largely because of their diversity and mobility, TEs are difficult to identify. We present an automated homology-based approach for the identification of TEs. This approach utilizes a comprehensive library of representative sequences as the

1 Computational Biology

Agent-based Bioinformatics Modeling

Model of TE TE Pathogen Identification Annotation Transmission

Global Health

Figure 1.1. Dissertation Components. The first part of this dissertation lies within the Bioinformatics realm, while the second is categorized as Agent-based modeling. Each part of this dissertation shares global health implications.

basis for our search. We make heavy use of common bioinformatics technologies, namely BLAST, ClustalW2, CAP3, and BioPerl. Our approach, implemented as TESeeker, is designed to be easier to use than existing approaches and to produce high-quality consensus TEs, allowing for quicker genome annotation. We also present a design and implementation plan for the inclusion of TEs in VectorBase’s community annotation pipeline for . Existing TE annotation websites, such as TEfam [134] and Repbase [119], lack the detailed data that is available for genes. Extending VectorBase’s community annotation pipeline to include TEs would fill this gap and allow researchers resources and detailed information not available elsewhere.

2 1.3 Agent- and GIS-based Modeling of Pathogen Transmission

There are numerous advantages to using a simulation for scientific study [10], including the ability to model and predict behavior of a real-world system with- out altering the actual system. If a simulation can utilize real-world data, such as geographic data, it has even more potential to be valuable. We present an ABM that utilizes GIS data to simulate pathogen transmission. In particular, we are interested in the effect landscape has on pathogen transmission. Our simu- lation, named LiNK, has been developed to model pathogen transmission among macaque monkeys on the island of Bali, Indonesia. GIS data has been incorpo- rated into LiNK to allow for spatially aware macaques. We are unaware of any other epidemiological studies to couple ABM with GIS data, or of any work study- ing efficient ways for mobile agents to interact with their environment. As such, we explore different means to include GIS data in an agent-based simulation and offer suggestions for a variety of applications.

1.4 Goals

This dissertation makes the following contributions:

• Development and implementation of an automated approach to detect trans- posable elements.

• Design and implementation plan for the incorporation of TEs into the Vec- torBase community annotation pipeline.

• Development and implementation of a GIS aware agent-based model of pathogen transmission.

3 1.5 Organization

The remainder of this dissertation is organized as follows. Chapter 2 intro- duces biological concepts necessary to the understanding of the first part of this dissertation. Chapter 3 describes our approach for the automatic identification of TEs, as well as the implementation of our approach, TESeeker. This chap- ter also presents the results of our approach, applied to a number of genomes, including comparisons to published data. Chapter 4 describes the community an- notation pipeline for genes on VectorBase and how to extend it to allow for TEs. We present a design and implementation plan and describe a preliminary imple- mentation. An introduction to agent-based simulations and GIS is described in Chapter 5. Chapter 6 thoroughly describes our model of pathogen transmission,

LiNK. Chapter 7 summarizes the contributions of this dissertation and proposes future work. We conclude this document with supplementary information: Ap- pendix A presents a walkthrough of the inner-workings of TESeeker, Appendix B presents the TESeeker website, Appendix C presents the TESeeker user manual, Appendix D presents selected source code for TESeeker, Appendix E presents selected transposable elements, and Appendix F presents our script to detect MITEs.

1.6 Contributions

Applications of our approach to identify TEs have been included in the Pedicu- lus humanus humanus and Culex quinquefasciatus genome papers [5, 83]. In par- ticular, we authored all TE-related sections of the P. humanus humanus paper and contributed to the non-LTR section of the TE analysis in the C. quinquefas- ciatus paper. A paper describing our approach and its implementation is under

4 review [80]. LiNK has been described in detail in an invited journal manuscript [76] and in conference proceedings [79]. A manuscript detailing initial biological implications of the model is in preparation [88].

5 CHAPTER 2

TRANSPOSABLE ELEMENT AND BIOINFORMATICS BACKGROUND1

2.1 Introduction

Much of this dissertation relies on a basic understanding of molecular biology in addition to a familiarity with transposable elements and the field of bioinformatics. This chapter introduces these biological and bioinformatics concepts.

2.2 Molecular Biology

Cells form the basic components of life and serve varying purposes, but most have the ability to replicate. Within each cell, there is enough genetic information and mechanisms for the cell to make a complete copy of itself, a process called replication [69]. Jones and Pevzner liken this to a car factory that gathers the raw materials, prepares the materials, and assembles a copy of itself, all while making cars at the same time [69]. Because cells are the basic reaction vesicles in the body, understanding their inner-workings would lead to a greater overall understanding as to how the body functions, which would be very valuable to scientists. Although cells come in a myriad of shapes and sizes, each has three common components: DNA, RNA, and protein molecules. DNA, or deoxyribonucleic acid,

1Portions of this chapter previously reported in Kennedy [75].

6 is often described as the building block of life, as it contains the genetic mate- rial governing how a cell operates. DNA, composed of the nucleotides adenine, guanine, cytosine, and thymine, is a double-stranded, helical molecule. RNA, on the other hand, or ribonucleic acid, is composed of only a single strand of the nu- cleotides adenine, guanine, cytosine, and uracil. RNA is used to transfer pieces of the DNA strand to other locations in the cell. Proteins are molecules made up of amino acids, which we describe later, that produce enzymes that can be thought of as the laborers of the cell. This is because they perform functions varying from assembling strands of nucleotides to signaling other cells. The double-stranded helical structure of DNA lends itself to replication. To replicate, a in the DNA “unzips” from its matching strand, and then an enzyme called DNA polymerase, which is prevalent throughout the cell, attaches itself to one of the strands. It then moves along the strand of DNA, attracting complementary nucleotides. These nucleotides hydrogen bond to one another and the process continues until the chromosome is copied. The same process happens concurrently on the original strand, completing replication. It is also important to understand the Central Dogma of Molecular Biology, which outlines the general process by which proteins are generated from DNA, shown in Figure 2.1. First, DNA “unzips” and then an enzyme called RNA polymerase binds complementary nucleotides to one strand of DNA, starting at the promoter region. This “unzipping” continues until the RNA polymerase reaches the terminator region, at which point the RNA strand breaks off and the DNA strands “zip” back together, completing transcription. Next, ribosomes allow the RNA to translate and produce a polypeptide chain. The RNA strand produced by transcription is more specifically called messenger RNA, or mRNA. Another type

7 transcription translation DNA RNA protein

Figure 2.1. Central Dogma of Molecular Biology. This is the process by which DNA undergoes transcription to produce RNA, which in turn undergoes translation to produce protein.

of RNA, transfer RNA, or tRNA, continually floats around within the nucleus. Ribosomes move along the mRNA strand, reading groups of three nucleotides, called codons, at a time. Each codon encodes for an amino acid. There are sixty-four possible combinations of nucleotides for a codon, yet there are only twenty different amino acids, as multiple combinations can code for the same amino acid. Partially for this reason, amino acids are the preferred components of our transposable element library, described in detail in Chapter 3. Figure 2.2 shows which codons refer to which amino acids. As the ribosomes move along the mRNA, they decipher the codons and the tRNA molecules bring the correlating anticodons. These amino acids are assembled into a chain, called a polypeptide chain, which is folded to make up a protein. This process proceeds from the start codon and continues until the stop codon is encountered.

2.3 Bioinformatics

Bioinformatics is the application of computer science techniques to solve bi- ological problems. Bioinformatics aims to develop a better understanding of the function of genes through the use of advanced, yet easy-to-use and often web- based, interfaces. Computer science plays a central role in bioinformatics because the study and analysis of large amounts of genetic data, which is otherwise time- consuming and prone to error, readily lends itself to the computer science dis-

8 U C A G UUU UCU UAU UGU Phenylalanine Tyrosine Cysteine UUC UCC UAC UGC U Serine UUA UCA UAA UGA Stop Leucine Stop UUG UCG UAG UGG Tryptophan CUU CCU CAU CGU Histidine CUC CCC CAC CGC C Leucine Proline Arginine CUA CCA CAA CGA Glutamine CUG CCG CAG CGG AUU ACU AAU AGU Asparagine Serine AUC Isoleucine ACC AAC AGC A Threonine AUA ACA AAA AGA Lysine Arginine AUG Methionine ACG AAG AGG GUU GCU GAU GGU Aspartic acid GUC GCC GAC GGC G Valine Alanine Glycine GUA GCA GAA GGA Glutamic acid GUG GCG GAG GGG

Figure 2.2. Genetic Code. A codon is represented by a set of three nucleotides. Each codon codes for an amino acid.

cipline. We next briefly describe several relevant bioinformatics research areas, most of which are utilized in Chapters 3 and 4.

Genome Annotation Genome annotation refers to locating genes in a sequence and then giving biological meaning to those regions. Genome annotation is often classified into functional and structural annotation. Functional anno- tation refers to deciphering the function of a gene, such as how a gene is expressed. Structural annotation identifies characteristics of a gene, such as where a coding region is located.

Sequence Alignment Sequence alignment is the comparison of two or more sequences to one another. The goal of sequencing is to find similarities be- tween or among the sequences. There are three types of sequence alignment: global, local, and semiglobal alignment. Global alignment involves finding

9 CAATCAGATCTCAT Input Sequences CAATGATCAT

CAATCAGATCTCAT

Global Alignment |||| | |||| CAATGA----TCAT

CAATCAGATC

Local Alignment |||| |||| CAAT--GATC

Figure 2.3. Global and Local Sequence Alignment. Here, we show examples of global and local alignment of the same two sequences. Global alignment aims to match best over the entire sequence, which may result in the insertion of multiple gaps, as shown here. Additionally, both sequences are found in their entirety. While global alignment produces the best alignment that utilizes all nucleotides in each sequence, local alignment does not necessarily utilize all nucleotides and uses shorter segments. Local alignment produces a different alignment because it preferentially aligns shorter fragments.

the best alignment of two sequences using every nucleotide or amino acid in each sequence, while local alignment concentrates on aligning shorter frag- ments in highly conserved regions. The Needleman-Wunsch algorithm is typically used for global alignment, and the Smith-Waterman algorithm is used for local alignment. Figure 2.3 shows example global and local align- ment for two sequences. Semiglobal alignment aims to align two sequences such that only one sequence needs to be used in its entirety while only part of the other sequence is aligned. Aligning more than two sequences together is called multiple sequence alignment.

Sequencing Sequencing is the process used to find the order of nucleotides in

10 DNA. A common method to perform this is known as Sanger sequencing [125]. Another common technique is to perform shotgun sequencing on the Sanger sequencing results [102, 125]. In shotgun sequencing, a sequence is divided into many small, random fragments and then sequenced using the chain termination method. This method was used in the sequencing of the [145]. High-throughput techniques, such as 454, Illumina, and SOLiD sequencing, are now the most commonly used DNA sequencing methods [96].

Genome Assembly Genome assembly refers to the process of assembling many short DNA sequences together to form the original chromosome(s) they once composed. The short sequences are often generated by shotgun sequencing.

Gene Expression Translating information from nucleotide DNA sequence into protein or RNA is referred to as gene expression. This technique helps elucidate the function of such a sequence and is reflected phenotypically, meaning the effect is observable in the organism.

2.3.1 VectorBase

VectorBase [91, 92] is a bioinformatics resource that serves as a web-based facilitator to a wealth of information and tools pertaining to invertebrate vectors of human pathogens. At VectorBase, researchers can, among other things, browse genomic data, contribute to community annotation of a genome, run bioinformat- ics tools such as BLAST, ClustalW, or HMMER, and obtain relevant information about the vector organisms. VectorBase is an NIAID Bioinformatics Resource Center (BRC) and serves as a facilitator and motivator for much of the work in this dissertation.

11 2.4 Transposable Elements

Transposable elements (TEs) are a type of repetitive sequence that have been found in nearly all eukaryotic (nucleus containing) genomes. First discovered and analyzed by McClintock in the 1950s [98], TEs have the ability to move about and replicate within a genome. Due to their mobile and replicative nature, TEs often occupy large portions of genomes. TEs are estimated to represent 47% of the yellow-fever mosquito genome, Aedes aegypti [105], 35% of the frog genome, Xenopus tropicalis [61], and 45% of the human genome, Homo sapiens [65]. This prevalence of TEs poses a major difficulty in sequence assembly, as repeat regions are prone to misassembly [101, 111]. TEs can impact host genomes in a number of ways. They are believed to play a major role in genome evolution [82, 100, 130], as they can insert themselves into, mutate, and move genes, thereby influencing gene expression. In turn TEs can cause gene variation and transfer genetic material [12, 28, 129, 138]. The process by which TEs move about a genome is called transposition. TEs are classified according to their transposition mechanism into Class I and Class II elements. Class I TEs, or retrotransposons, are mediated by an RNA intermediate, typically produced by a TE encoded reverse transcriptase. Class I TEs transcribe themselves to RNA and are reverse transcribed back into DNA by the reverse transcriptase enzyme, the so-called “copy-and-paste” mechanism. The presence or absence of long terminal repeats (LTRs) further classifies Class I TEs into non- LTR and LTR elements. Class II TEs, or transposons, are DNA-mediated and transpose through the use of a transposase enzyme. Class II TEs are typically bounded on each end by terminal inverted repeats (TIRs), which flank and serve as the recognition sequence for the transposase. The transposase adheres to a

12 TA ACGC...GTAA TAC...CAT GTACAGC...AATTACG GAT...GAT TTAC...GCGT TA

TSD TIR UTR TRANSPOSASE UTR TIR TSD

2 bp 20-30 bp ~100 bp ~900 bp ~100 bp 20-30 bp 2 bp

Figure 2.4. Typical mariner Class II Transposon Structure. Mariner transposons are characterized by a single transposase flanked by terminal inverted repeats (TIRs) and a preferential target site duplication (TSD). There are generally 20-30 base pairs (bp) of untranslated region (UTR) flanking the transposase as well.

“cut-and-paste” mechanism, as it cuts out the TE from the host DNA and allows it to insert at a new site in the host DNA. Many TEs have preferential insertion sites and the method by which TEs move about genomes often produces artifacts flanking the TEs, called target site duplications (TSDs). Both Class I and II TEs are further divided into families, each with distinguishing characteristics. We follow the classification scheme described by Tu [139], summarized in Figure 2.5.

Class I RNA-mediated retrotransposons are divided into several main families, which we next describe.

LTR Retrotransposons– Members of the LTR retrotransposon group are typically 5-9 kb (kilo base pairs) long and have a 200 to 500 bp (base pairs) LTR on both ends. LTRs encode polymerase (pol) and group- associated antigen (gag) proteins [22, 139]. Example families of LTRs include Ty1/copia, Ty3/gypsy, and BEL retrotransposons.

Non-LTR Retrotransposons– The non-LTR retrotransposons include long interspersed nuclear elements (LINEs). LINEs are generally 5-8 kb long with open reading frames (ORFs) that contain coding necessary

13 Class I TEs

LTR Retrotransposon, 5-9 kb

ORF LTR LTR gag PR IN RT RH

non-LTR Retrotransposon (LINE), 5-8 kb

APERT RH

ORF I(gag) ORF II (pol) (GAA) n SINE, < 500 bp promoters

tRNA-related region non-coding region (GAA) n

Class II TEs

"cut-and-paste" DNA Transposon, typically under 5 kb

TIR TIR

ORF TSD TSD

MITE, typically under 500 bp

TIR TIR

TSD TSD

Helitron, around 7 - 9 kb

Helicase

A T

Figure 2.5. TE Classification Scheme and Structures. The figure above, adapted from [139], shows the typical division of TE classes, as well as major families within each. PR, IN, RT, RH, and APE refer to enzymes. TSD refers to target site duplications and ORF refers to open reading frame. ORFs are coding sequences without a stop codon. This figure is not drawn to scale.

14 for transposition [22, 129, 139]. Non-LTRs also have a pol region. L1, L2, R1, Jockey, and Penelope are several example non-LTR Retrotrans- posons.

SINEs– SINEs, or short interspersed nuclear elements, are similar to LINEs, but much shorter, with a typical length of under 500 bp [139]. SINEs do not code for proteins and instead use transposition mechanisms from other retrotransposons.

Class II Transposons are mediated by DNA transposition. Transposons cut and paste themselves within genomes without the use of an RNA intermediate. Descriptions of several families of transposons follow.

mariner– The mariner transposons are characterized by a DDD motif and generally contain one exon transposase. A motif is a sequence of amino acids, which, in this case, are all Aspartic acids and are each rep- resented by a ‘D.’ Mariners are widespread in invertebrates and are well characterized. We show an annotated mariner transposon in Ap- pendix E.1.3.1.

P– The P element transposons were first discovered in the fruit fly Drosophila melanogaster genome and were the first transposons to be used as a gene vector [73, 122]. Since the initial discovery of P elements, they have been found in several other species, most notably the malaria mosquito Anopheles gambiae [126].

piggyBac– The piggyBac family of transposons was first discovered in the cabbage looper moth Trichoplusia ni [23]. This transposon has since been found in a number of organisms and has proven effective as a gene

15 vector for a variety of organisms, including Aedes aegypti [95].

Tc1 – Like all transposons, Tc1 transposons are flanked by inverted repeat sequences. They are valuable gene vectors, extensively used in the analysis of Caenorhabditis elegans [110]. Tc1 transposons are typically recognized by their DDE motif.

pogo– These transposable elements are members of the Tc1 superfamily. They are very similar to Tc1 transposons, but lack the DDE catalytic motif.

MITEs– MITEs, or miniature inverted-repeat transposable elements, are small elements often under 500 bp. Their mechanism of transposition is not entirely known and they do not encode a protein.

Helitrons– Helitrons replicate using a rolling-circle mechanism and encode helicase-like proteins.

2.5 Transposable Element Identification

There are several difficulties with TE identification. TEs do not adhere to a universal structure; instead, some families of TEs follow specific structures. An example would be the TIR-transposase-TIR general structure of a Class II trans- poson, such as in the mariner element. Additionally, the structure of TEs can mutate over time. For example, TEs may preferentially insert themselves in sim- ilar regions of the genome, or even within one another, leading to many nested and fragmented copies. While autonomous, or active, TEs possess intact reading frames which serve as mechanisms for transposition, the majority of TEs are non- autonomous, or not active. Non-autonomous TEs can often still be transposed, using the transcription machinery of other elements in their class, but they can-

16 not transpose themselves. For these reasons, a general approach cannot be used to identify all TEs. Instead, several approaches are used with varying levels of effectiveness. The automatic computational identification of TEs is not as robust or mature as analogous methods currently used for genes [115]. Bergman and Quesneville [13] describe many TE discovery methods and classify existing TE discovery techniques into de novo, structure-based, comparative genomic, and homology-based. Saha et al. and Lerat more recently reviewed approaches to identify TEs [93, 124] and classify identification techniques into similar groups: ab initio, signature-based, and library-based techniques. We next describe the approaches according to the Bergman and Quesneville classification.

2.5.1 De novo Discovery

De novo TE discovery approaches look for similar sequences found at multiple positions within a genome. Once identified, the sequences are typically clustered, filtered, and characterized. While computationally expensive, this approach can identify novel TEs and is most effective in discovering TEs with high prevalence within a genome. De novo techniques are not as effective in identifying degraded TEs, meaning TEs with mutated or incomplete structure. Example de novo tools include PILER [40] and RECON [11].

2.5.2 Structure-based Discovery

Structure-based approaches, such as LTR STRUC [97], typically work well to identify complete TEs that comply to a defined and conserved structure. In this case, LTR STRUC is effective at finding retrotransposons with LTRs at each

17 end of the element. Structure-based methods are less useful when searching for degraded TEs or for TEs without a conserved structural characteristic.

2.5.3 Comparative Genomic Methods

A comparative genomic discovery method described by Caspi and Pachter [24] uses multiple sequence alignments of closely related genomes to detect large changes between the genomes. The idea is that differences in the genomes, called insertion regions, could be TEs or caused by TEs. Such differences are analyzed and classified. This approach is useful when related genomes are readily available and can identify new families of TEs. Common ancestral TEs will likely not be identified by this approach.

2.5.4 Homology-based Discovery

Homology-based approaches utilize known TEs as a means to discover similar TEs in genomes. This is typically done by manually using alignment programs, such as BLAST [1], to align known TEs to the genome in question and then care- fully analyzing the results. Biedler and Tu [14] reference a suite of TE-related programs to identify and characterize TEs that are homology-based, and Ques- neville et al. offer the BLASTER suite of tools [116] to detect TEs. Although there are few homology-based tools and despite the fact that they struggle in identifying TEs unrelated to known elements, they are normally most accurate in identifying known TEs, as well as detecting degraded TEs. Existing homology- based approaches also sometimes utilize hidden Markov models (HMMs) [2]. Such approaches are effective for closely related genomes, but struggle with distantly related species, as the models tend to capture more irrelevant data when searching

18 for diverse sequences. Additionally, homology-based approaches currently avail- able are the fewest in number and least automated. Moreover, many are not geared to output high-quality consensus sequences. For these reasons, our auto- mated approach to identify TEs, described in Chapter 3, is homology-based.

2.6 Annotation

The annotation of genomic data refers to giving meaning to such data, namely describing the functions served by specific regions of DNA. Annotation is most often applied to finding the structure and function of genes, a process that can be very time-consuming in the lab. Several computational tools are commonly used in the automatic annotation of genomes. These tools can be categorized as structure- or homology-based and are often used in conjunction with other tools. Genescan [20] is a gene identification tool that utilizes the structure of introns (non-coding regions) and exons (coding regions) in its computation. GeneWise [15] is another tool that works based on protein sequence similarity, shown to be very effective in locating known genes [16]. These tools often utilize Hidden Markov Models (HMMs) [38, 39, 102]. Ensembl’s automatic gene annotation system is one of the better-known annotation systems; however, it is far from the “gold standard” of annotation [30]. Such a standard is labor-intensive and takes years to complete. Genes vary in many ways, making their automatic annotation difficult. Kohany et al. [84] note that while automatic annotation has the potential for very high throughput and is not susceptible to user error or bias, it is difficult to reconstruct sequences, particularly TEs (largely due to their fragmentation), using automated techniques. The VectorBase community annotation pipeline eases many of these concerns.

19 Gene experts have the ability to annotate genes through the community pipeline, as well as link their scientific publications to their efforts. Their work is visible to the world through VectorBase, and these experts are publicly recognized. Vector- Base’s current system utilizes the DAS, Ensembl, Chado, Hibernate technologies, each of which are used in our design and implementation plan to annotate TEs, described in Chapter 4. We elaborate upon each technology below.

2.6.1 DAS

DAS, or distributed annotation system [35, 68], is the protocol used to dy- namically display data in the VectorBase Ensembl genome browser. DAS utilizes a client-server setup through which a client interacts with multiple servers. Its main advantage is that the annotation information can be located in multiple databases and on multiple servers, but can be gathered and used by a single client (VectorBase).

2.6.2 Ensembl

Ensembl exists with the goal to produce and maintain automatic annotations on selected eukaryotic genomes [44]. Ensembl regularly increases supported con- tent; as of release 59, Ensembl supports 56 species [45]. For supported organisms, Ensembl offers detailed information. Researchers can visually browse to locations within a genome and then view detailed information, such as for a gene. The anno- tation information is displayed through Ensembl’s rich genome browser’s various views. The location-based view shows details for regions of the genome. The gene-based view shows detailed gene information. Transcript information is also available in the transcript-view. This rich genomic annotation data is produced

20 through DAS tracks. Figures 2.6, 2.7, and 2.8 show the VectorBase implementa- tion of each of these views.

2.6.2.1 Ensembl Genebuild

Ensembl uses an automated pipeline to annotate genes in newly sequenced genomes [31]. Their pipeline is largely homology-based and heavily relies on pro- tein alignments with GeneWise [15]. A major advantage of GeneWise is that it allows for frameshifts in the coding regions. Frameshifts are regions where the reading frame has changed due to an insertion or deletion. Other alignment tools such as exonerate are also utilized. Protein sources are obtained from UniProt [141] and RefSeq [113]. A detailed description of the Ensembl pipeline is available in Curwen et al. [31].

21 22

Figure 2.6. Ensembl Location-based View on VectorBase. Here, we show the VectorBase implementation of Ensembl’s location-based view on the 2L chromosome of the Anopheles gambiae genome. 23

Figure 2.7. Ensembl Gene-based View on VectorBase. The figure above shows the VectorBase implementation of Ensembl’s gene-based view, showing gene AGAP003395 in Anopheles gambiae. 24

Figure 2.8. Ensembl Transcript-based View on VectorBase. The VectorBase implementation of Ensembl’s transcript-based view for Anopheles gambiae gene AGAP003395 is shown above. 2.6.3 Chado

Chado is a part of the GMOD (generic model organism database) [53] project. GMOD is a collection of open source software tools for creating and managing genome-scale biological databases. Chado is a modular schema that is designed to allow the addition of new modules for new data types. VectorBase uses Chado to host its complex biological data, currently stored in more than 250 tables. TEs can have a more complicated structure than “regular” genes and the existing Chado schema is complex and abstract enough to able to accommodate them. There is limited built-in TE-specific support for TEs within Chado; in fact, GMOD encourages adapting TEs to Chado [27], which prompted us to adapt Chado to our specific needs, which is outlined in Section 4.2.3.

2.6.4 Hibernate

Hibernate [62] is a library for Java that provides a framework for mapping java classes to a database using XML. This allows for the automatic updating, retrieval, creation, and deletion of object data. The use of Hibernate eliminates the need to write SQL for such operations.

2.6.5 VectorBase Community Annotation Pipeline

VectorBase has implemented a community annotation pipeline (CAP) for genes, which has the ability to accept four types of annotation information: 1. Gene Models Users can submit gene sequence to be incorporated into gene builds.

2. Publications Sequence data can be linked to the literature.

3. Controlled Vocabulary Terms Controlled vocabulary terms can be as- sociated with genes.

25 Figure 2.9. VectorBase Gene Submission Form. The figure above shows a portion of the spreadsheet researchers populate with gene data to submit to VectorBase.

4. Comments General comments are available for publication.

This data is submitted by full-time annotators and community researchers. Data is collected through spreadsheets, a portion of one which can be found in Figure 2.9. The spreadsheets are uploaded through the VectorBase website and the genome-specific curator is notified of the submission. If approved, the data is aligned to the genome with exonerate and incorporated into VectorBase, making it available to the community. Once submitted, links are available for researchers to associate publications or controlled vocabulary terms with the data, as well as to add comments. Additionally, community submitted genomic data is displayed through Ensembl’s browser on VectorBase via DAS. Community annotation data on VectorBase flows according to the schematic shown in Figure 2.10, adapted from Bruggner [19], and also briefly described in Butler [21]. Many technologies are utilized, most of which have been described previously. VectorBase uses PostgreSQL databases [112], which communicate with DAS through Hibernate. SOAP [131], an XML protocol for exchanging XML messages, interfaces with Hibernate and the VectorBase web pages.

26 PostgreSQL Apache Tomcat & Axis

Java Hibernate Chado API

Manual DAS FeatureSubmission FeatureSubmissionQuery Annotation Functions Functions Server Database FeatureSubmission Web FeatureSubmissionQuery Service Interface Web Service Interface

SOAP

Apache HTTP Server & PHP

SubmissionForm.xls

Community Annotation Submission & Review Interface

Community Genomic Annotation

Genome Browser Gene View Gene DAS Report

Genome Browser Contig View

Figure 2.10. VectorBase Community Annotation Pipeline Data Flow. We show the flow of information on VectorBase for community gene annotation, adapted from Bruggner [19].

27 2.6.5.1 Planned Updates to the VectorBase Community Annotation Pipeline

VectorBase has been heavily updated since the initial implementation of CAP. At the time of this work, the original VectorBase CAP has not been restored to working order. In the long term, and particularly with VectorBase 2.0, CAP is likely to evolve further. Potential new features to CAP include an improved interface for the submission and presentation of data, additional means to submit data, as well as the ability to search for community submitted data. Additional plans for the new CAP are currently under development and should eventually include TEs.

2.7 Transposable Element Annotation

There has been much work done on the annotation of genes [16, 20, 30, 84, 87], but little on the annotation of TEs. TEfam [134], Repbase [119], and WikiPoson [148] currently serve as the best online TE resources. TEfam currently caters to TEs from the Anopheles gambiae, Aedes aegypti, and Culex quinquefasciatus mosquito species, as well as the Ixodes scapularis tick. TEs from the body louse, Pediculus humanus humanus, are available internally, with plans to make them public in early 2011. There, users can submit and view representative sequences. There is no structural display provided, but the TEs are well-annotated. The submission process is regulated through user accounts. Meanwhile, Repbase has a large database of consensus TEs from many taxons. Registered users may export TEs in the EMBL, FASTA, or IG file format and can submit properly formatted TEs by emailing the editor-in-chief at the site. WikiPoson is a newer site that has various information about TEs, including descriptions of how TEs are classified, descriptions of many elements, as well as TE-related news. Stand-alone tools,

28 such as Apollo [94], also exist, but they largely rely on biological expertise and their data are often not available online.

2.7.1 VisualRepbase

VisualRepbase [136] offers an interface for the study of TEs available on Rep- base. VisualRepbase is available as a downloadable Java archive and has several important features. First, researchers can search for TEs by family and genome and download sequences in the FASTA format. Second, VisualRepbase can dis- play the location and orientation of its TEs on the selected genome, as well as any available annotation data. Third, VisualRepbase can also display the location of any properly formatted annotation data available for a given genome. Lastly, VisualRepbase lists the number of occurrences by chromosome for selected TEs.

This mapping was performed with Censor [72] and is stored in a table format that includes coordinates within the genome. Figure 2.11 shows the distribution of the MARINERN3 AG transposon on the 2R Chromosome of Anopheles gambiae. While useful, there are drawbacks to VisualRepbase. The visual display of TEs lacks structural information. While the sequence orientation is shown, struc- tural TE features such as TIRs are not shown. VisualRepbase is most useful for showing the distribution of TEs within a genome, as well as their proximity to any previously annotated data. Rich genome browsing features, such as those available in Ensembl’s Genome Browser, would increase the utility of VisualRep- base. Additionally, while VisualRepbase allows for the entry of new sequences, it appears that users must also submit the associated annotation information, including genomic coordinates, which is not always straightforward.

29 Figure 2.11. VisualRepbase Interface. The figure above shows the distribution of the MARINERN3 AG transposon on the 2R chromosome of Anopheles gambiae. As shown in the figure, there are 109 occurrences on the chromosome with at least 50% identity to the consensus.

30 2.8 Summary

This chapter has presented background information necessary for an under- standing of Chapters 3 and 4. In particular, we have discussed basic molecular bi- ology and introduced several bioinformatics research areas, including annotation. We have described TEs and techniques utilized to detect them within genomes. Lastly, we have described TE annotation techniques. Chapter 3 uses the infor- mation this chapter has presented to describe an approach to identify TEs and Chapter 4 describes a plan for the annotation of TEs on VectorBase.

31 CHAPTER 3

AUTOMATED HOMOLOGY-BASED APPROACH FOR THE IDENTIFICATION OF TRANSPOSABLE ELEMENTS1

3.1 Introduction

The identification of TEs is an important part of every genome project. Unfor- tunately, the automatic identification of TEs in novel genomes is far from mature. In particular, there is a lack of automated homology-based approaches that pro- duce high-quality consensus TEs [13]. As the number of sequenced genomes has rapidly risen, the identification of TEs has received greater attention from the scientific community. The ability to identify TEs automatically and effectively in a manner similar to the methods used for genes is also of increasing impor- tance. There exist many difficulties in identifying TEs, including their tendency to degrade over time and that many do not adhere to a conserved structure. In this chapter, we describe an easy-to-use, automated homology-based approach to discover high-quality putative TEs. We apply this approach to recently sequenced arthropod genomes and identify consensus TEs up to 98% identical to manually annotated TEs. The implementation of our approach, TESeeker, is available for download as a virtual appliance.

1Results of this chapter have appeared in the TE sections of Arensburger et al. [5], and Kirkness et al. [83]. Our approach, with selected results, is under review in Kennedy et al. [80].

32 3.2 Approach for Identification of TEs

Our approach targets the identification of TEs with homology (similarity) to known TEs. While our approach has many potential applications, we focus on characterizing TEs in novel genomes. The approach utilizes a diverse library of representative TEs as a basis for BLAST searches against the genome in question. The hits then undergo multiple iterations of processing before we produce a high- quality consensus TE. This section introduces and describes our approach.

3.2.1 Dependencies

This approach relies upon several notable bioinformatics tools, described be- low.

3.2.1.1 Library of Representative Sequences

Our modular homology-based approach relies on a thorough and high-quality library of representative TEs, organized by family. When strong information is available, amino acid coding regions, reverse transcriptases for Class I TEs and transposases for Class II TEs, are the preferred components of the library. Nu- cleotide sequences can also be used, but such sequences do not allow for as much nucleotide variance during the search. Sequences for our library were chosen man- ually from TEfam [134], NCBI [104], Repbase [71], and the literature. LTR reverse transcriptases within the representative library were chosen with the assistance of Jose Manuel C. Tub´ıo[140]. Sequences with complete amino acid coding regions were preferentially chosen, and a wide variety of related sequences was assembled for each family. Currently, the library consists of 475 representative coding re- gions from a variety of organisms and covering the major TE families. Further

33 details on the provided library are available within the FASTA files and online [137]. Because the library consists of sequences in the FASTA format, researchers can easily modify the library or create their own library for use in the approach.

3.2.1.2 BLAST

We utilize BLAST [17, 102], or basic local alignment search tool, to perform sequence similarity searches. BLAST is one of the most widely used algorithms in bioinformatics and works well for our purposes.

3.2.1.3 DNASTAR SeqMan II

While not used in the automated approach, DNASTAR SeqMan II [33] played a central role in the development of this approach. DNASTAR SeqMan II works similar to an assembler in that it allows one to set various parameters, such as match size, minimum match percentage, or minimum sequence length, to produce contigs, which are consensus sequences generated from multiple sequences. We utilized DNASTAR SeqMan II extensively in early versions of this approach.

3.2.1.4 CAP3

CAP3 [64] is a popular and mature sequence assembly tool. In most cases, CAP3 produces better quality contigs than the phrap assembler [57]. The ability to clip low-quality regions of the input sequences is an added plus.

3.2.1.5 ClustalW2

ClustalW2 [89] is the newest version of the most widely used multiple sequence alignment program. ClustalW2 offers a balance of speed and accuracy, while also

34 supporting the ability to produce phylogenetic trees.

3.2.1.6 BioPerl

We use Perl, or practical extraction report language, and an extension of Perl called BioPerl [133] to perform the majority of our sequence analysis. Perl is an interpreted scripting language that lends itself well to the bioinformatics field largely because of its parsing capabilities. BioPerl has many of the common bioin- formatics applications built into modules, which makes it very powerful.

3.2.2 General Description of Approach

Our approach varies slightly depending on whether the representative TEs are amino acid or nucleotide sequences, the main difference being that amino acid searches require only a translated nucleotide genome search, tblastn, while nucleotide sequences require translation of both themselves and the host genome, tblastx. We next describe the approach that starts with an amino acid library of TEs, shown graphically in Figure 3.1. A walkthrough of our approach, for the mariner TE in P. humanus humanus is described in Appendix A. The approach begins with BLAST searches against the genome using repre- sentative TEs for the chosen family. Resulting BLAST hits are combined if they overlap or are very close together, and are then extracted from the genome. We next assemble the hits with CAP3 in an attempt to get a viable representation of the coding sequence. We use the CAP3 results to do another BLAST search against the genome and process the hits in the same manner. However, when extracting the sequences from the genome, we add flanking regions. The length of the flank- ing region is dependent on the type of TE and is utilized to enable us to capture

35 trim

TE Library

BLAST combine CAP3 Consensus TE

Genome Genome

ClustalW2

generate consensus

Figure 3.1. Approach Schematic. The approach is composed of multiple, iterative steps. The general flow is as follows. A TE family is used in a BLAST search against the genome. Hits are then combined, extracted from the genome, and assembled with CAP3. Next, the sequences are trimmed and again used in a BLAST search against the genome. The results are then used to produce a multiple sequence alignment in ClustalW2. We generate a consensus from the alignment, and then perform a final BLAST search against the genome. We again combine, extract, and assemble with CAP3. Finally, the consensus TE is generated from CAP3.

36 the entire TE. These extracted results are then aligned and a consensus is gener- ated. We use the consensus to perform a final BLAST search, again combining, extracting, and assembling the sequences. CAP3 then produces the high-quality, full-length consensus TE. We next describe the approach in more detail.

3.2.2.1 Identify Coding Region

The coding region is generally most conserved across TEs within a genome, as it must be complete to produce a functional protein. We begin with local sequence alignments using BLAST. Nucleotide-based blastn searches are not as effective in identifying TEs and are not used; the nucleotide sequence for a given TE may vary considerably, while the translated amino acid sequence is more likely to be conserved. Instead, tblastn searches are used to identify the coding region. BLAST produces a set of hits for each TE query against the genome and we consider hits with an expectation value (e-value) less than 1E-20 for our approach. Lower e-values correlate to more significant hits. This cutoff was determined from our empirical data to limit the hits to the most probable TEs while also eliminating most false positives and can be manually adjusted. Due to slight sequence variations, BLAST results are often rich in short, nearly-adjacent hits. We process BLAST results such that all hits are combined if they are within a specified distance of one another, 50 bp by default, and originate from the same query sequence. Hits with overlapping coordinates are combined as well. These combinations increase the quality of our hits and the potential to capture more complete sequences. In the case where there is a gap between sequences, we also include the intervening sequence data in our hit. Figure 3.2 shows combination scenarios. Once all possible combinations are performed, hits are extracted from

37 AB

C

A A B B

C C

A A B B

C C

A B

C

Figure 3.2. Methods of Combination. This figure shows the five general combination scenarios used in our scripts. In each case, hit sequences A and B are consolidated into a single sequence C, which represents a section of nucleotides from the genome. We have shown combinations of overlaps, nested sequences, and sequences separated by a short, prespecified distance.

the genome. At this point, we have a set of possible coding sequences, both complete and partial, many of which are copies or partial copies of one another. To consolidate and improve our results, we assemble the sequences with the CAP3 assembly pro- gram [64]. CAP3 produces contigs and singleton sequences. Singleton sequences are sequences that did not assemble with other sequences. CAP3 also generates accompanying quality scores for the contigs, based upon the underlying sequences that produced the consensus. We use the quality scores to trim the sequences such that the highest quality sequence remains. To do this, we iterate through a

38 contig, keeping track of the cumulative sum of quality scores for a given number of consecutive nucleotides, called the sliding window, which is 20 bp by default. When the average value of a nucleotide in this sliding window exceeds a thresh- old, typically 18, we consider the corresponding sequence to be high quality. If the average value drops below the threshold, the sequence is ignored. Once we have read the entire sequence, there will likely be gaps in the sequence where there is little commonality. In these cases, we only keep the low-quality regions if they are of short length and have adjacent high-quality sequences. These results are then reassembled in CAP3, trimmed, and considered the best potential complete coding region. In the case that CAP3 produces only singletons, we perform the aforemen- tioned analysis with them. We then aim to extend the sequence to encompass the entire TE. Pseudocode for the steps described in this section of our approach is shown in Algorithm 1.

3.2.2.2 Encompass Complete Transposable Element

Once the putative coding region has been identified, we create a consensus for the complete TE. We perform a blastn search with each contig and singleton from the previous (CAP3) step to find the instances of the TE within the genome. We again process these hits as before and extract them from the genome, but this time we also extract flanking regions on both sides of the viable hits in an attempt to capture the entire TE. This extracted set of instances can then be used to generate a consensus sequence.

39 Algorithm 1 P=IdentifyPutativeSequences (Q, S, evalue, distance) Let Q be the set of representative TEs Let S be the genome Let P be the set of putative hits Let evalue be the maximum e-value of a potential hit Let distance be the maximum distance between potential hits // Search genome and sort hits according to location for all q ∈ Q do Hq ←BLAST(q, S) Hq ←sort(Hq, position) end for // Combine overlapping hits for all q ∈ Q do for all h ∈ Hq do if h ≤ evalue then for all i ∈ Hq do if i ≤ evalue then if abs(h.location − i.location) ≤ distance then h ← (h + i) end if end if end for end if end for end for // Extract putative TEs from genome for all q ∈ Q do for all h ∈ Hq do Pq ←extract(h, S) end for end for // Assemble consensus TEs for all p ∈ Pq do p ←trim(CAP3(p)) end for return P

40 3.2.2.3 Generate Consensus

The extracted near full-length sequences from the previous step are inherently very similar on a nucleotide-by-nucleotide basis. To generate a consensus from this set of sequences, we perform a multiple sequence alignment with ClustalW2 [89]. A consensus sequence from the multiple sequence alignment is generated as follows. We record counts for each nucleotide at each position in the alignment file. If a gap is encountered, counts for each nucleotide are incremented. If the percentage for any nucleotide at a given position exceeds a given threshold, 49% by default, that nucleotide is used for that position in the consensus. We now have a consensus sequence for the TE that is the most likely sequence to occur in the genome and we need to verify that it is complete.

3.2.2.4 Identify Complete Transposable Element

To validate and improve the consensus sequence, we look for similar copies of it in the genome with a blastn search. We again process the BLAST hits as previously described and extract them from the genome, generally adding short flanking sequences. The resulting extracted sequences are again iteratively exam- ined with CAP3 and trimmed. CAP3 produces a sequence which represents the best estimate for a representative putative TE in the novel genome. Manual inspec- tion on the putative TE is advisable, both in terms of validity and classification. Once validated, this TE can then be utilized to calculate the density its particular family within the genome and to find individual instances.

41 3.2.3 Implementation

Our approach is implemented as TESeeker and was purposely designed to be modular, while relying upon common bioinformatics tools, namely BLAST, CAP3, and ClustalW2, as well as BioPerl [133]. TESeeker is released as a VirtualBox [144] virtual appliance. The local web browser interface to TESeeker offers the main gateway to the core TESeeker functionality; however, TESeeker can also be run through the command line. A researcher needs to only provide basic param- eters, such as TE family, host genome, closeness to combine, minimum BLAST hit length, flank length, CAP3 window size, CAP3 quality score threshold, and the nucleotide percentage threshold for consensus generation. We offer suggested parameters that were determined through extensive testing of the approach on various TE families. These tests were largely performed on arthropod genomes.

Suggested parameters include combining BLAST hits within 50 bp, a CAP3 win- dow size of 20 bp, combine distance of 50 bp, and quality score threshold of 18, a 49% nucleotide commonality cutoff for consensus generation, and a 70% mini- mum length cutoff (with respect to the query) for the final BLAST search. Further details on suggested parameters, as well as means to perform a sample run are provided with the virtual appliance. While not parallelized, researchers can easily run multiple instances of TESeeker while varying parameters and TE families, offering scalability.

3.2.4 Advantages

Our approach offers many advantages to researchers. First, TESeeker allows for the fast and accurate detection of TEs. As demonstrated in several genomes, across multiple TE families, TESeeker effectively identifies TEs. In addition to

42 TE identification, our approach offers opportunities to reexamine and validate previous research. Second, TESeeker is very easy to use; we provide TESeeker as a virtual appliance, completely configured. Researchers must only provide a few parameters to begin searching. Parameters are easily modified and multiple iterations of the approach can be run simultaneously. Third, TESeeker is general. While we primarily evaluated our approach on Class II TEs in arthropod genomes, the parameters can be adjusted to allow for the effective detection of a variety of TE families in any genome, including genomes that contain only degraded TEs. Less stringent parameters will be more effective in detecting such degraded TEs, but will also increase the number of false positives. As mentioned previously, we have utilized various stages of this approach to identify non-LTR and LTR TEs in a number of genome projects. Last, our approach eases the burden on expert annotators, decreasing genome annotation time.

3.2.5 Limitations

While robust, this approach has several limitations. First, results are highly dependent on the quality of the sequences in the library and whether the novel genome contains TEs with homology to those in the library. The library must contain a thorough representation of TEs for a given family, preferably amino acid coding regions. The provided library has performed well, but extensive test- ing has not been performed on LTR elements. Additionally, this approach is not designed to detect TEs without a coding region, such as SINEs or MITEs. Second, the approach is most effective for TEs that exist in multiple copies throughout the genome. While TESeeker has been shown to find TEs that have only a sin- gle full-length instance, the quality of its output and the extra effort required

43 by the researcher to alter the parameters can be time consuming. Last, results from TESeeker must be closely examined. An ongoing issue with TEs concerns their classification. If a search is seeded with mariner sequences, it may produce consensus TEs that are not true mariners, but are rather mariner-like TEs. For this study, TEs were classified through examination of their amino acid coding regions.

3.3 Results

Our approach was developed over the course of several TE detection projects on several arthropod genomes [5, 83], but was not originally automated. DNASTAR SeqMan II [33] was used in place of CAP3 and ClustalW2. DNASTAR SeqMan II produced viable results, but it required extensive interaction from a researcher. Sequences had to be manually examined and trimmed in the program, a process which took considerable time and required a trained researcher. This manual ap- proach produced results that we consider a high-quality annotation of TEs. We used these results to partially validate TESeeker against the Pediculus humanus humanus genome, described below. For example, running the approach with de- fault parameters for a mariner Class II element in P. humanus humanus and with a Jockey Class I element in Culex quinquefasciatus produced a consensus TE that was more than 98% identical to the manually produced element. Additionally, the elements were correctly trimmed. These consensus sequences were generated with amino acid coding sequences - transposase in the mariner element and both open reading frames of the reverse transcriptase in the Jockey element. Figures 3.3 and 3.4 show alignments of the automated approach’s consensus versus manually an- notated elements. We also evaluated our approach against published results from

44 Automated AA-TATTGGGTTGGCAAATAAGTA...AATATCTTTTGCCAACCCAATA ||||||||||||||||||||| ||||||||||||||||||||||

Manual TATTGGGTTGGCAAATAAGTA...AATATCTTTTGCCAACCCAATA 5’ 3’

Figure 3.3. P. humanus humanus mariner element. The alignment of the consensus sequence from our approach and the manually annotated element is shown above. As evident from the figure, our approach correctly identifies the element and trims the ends almost perfectly.

Automated TTT...TTTTTTTTTTTAATTTATATTTAT...GAAGGTTCGCAAGACACTG |||||||||||||||||||||||| |||||||||||||||||||

Manual TTTTTTTTTTTAATTTATATTTAT...GAAGGTTCGCAAGACACTGAT 5’ 3’

Figure 3.4. C. quinquefasciatus Jockey element. The alignment of the consensus sequence from our approach and the manually annotated element is shown above. Extra thymine elements on the 5’ end are typical of the corresponding poly(A) tail, characteristic of Jockey elements.

the Anopheles gambiae PEST genome, as well as a number of other genomes. Var- ious stages of our methodology have been applied to a number of genome projects which we describe next. In all cases, we utilized our library of representative cod- ing regions. If we were searching an annotated genome with representative coding regions already in our library, we removed them before running TESeeker.

3.3.1 Pediculus humanus humanus

The body louse, Pediculus humanus humanus, is the primary vector of typhus and several other diseases [109]. It has the smallest presently sequenced insect

45 TABLE 3.1 Pediculus humanus humanus NON-LTR RESULTS

Class I Family Element Length (bp) Full-length Copies Copies Density

non-LTR SART Hope-like 4655 1 522 0.18% R4 Dong-like 5266 4 1739 0.45% LTR ty3/gypsy Mdg1 5395 2 976 0.28%

Class II Family Element Length (bp) Full-length Copies Copies Density

mariner/Tc1 mariner 1276 24 216 0.09% MITE MITE1 623 4 39 0.02% MITE2 169 16 66 0.007%

TOTAL 1.027%

genome at roughly 110 Mb (mega base pairs). TESeeker was able to identify all Class I and II TEs, with the exception of MITEs, reported in Kirkness et al. [83]. A separate tool was developed to detect MITEs and is not described here. Unlike many other arthropod genomes, only 1% of the P. humanus humanus genome is made up of TEs. Our approach’s ability to discover TEs of varying families, across classes, in a genome with so few TEs demonstrates its utility. Following is a description of our results for each class of elements, which is summarized in Table 3.1. Additional detail is provided in the P. humanus humanus genome paper [83].

46 3.3.1.1 Class I Elements

LTR Retrotransposons– Only one element of the LTR retrotransposon family is well-represented in the P. humanus humanus genome. Phylogenetic anal- ysis shows that it belongs to the Mdg1 lineage of LTR retrotransposons (Ty3/gypsy clade). There were no active copies found - the canonical copy has point mutations in the gag-like domain. There are only two full length copies in the genome, which suggests that these genomic insertions are rel- atively recent and that selective pressure is very efficient in purging func- tional copies from the genome. The other copies are present in the form of solo-LTRs and partial to highly deleted proviral copies, demonstrating that solo-LTR formation (by recombination between the two LTRs of the same copy) and deletions are important mechanisms in the inactivation and/or elimination of TEs from this genome. Another characteristic of this element is that the target site is always ATAT, and many of the copies are located in poly-AT regions (possible heterochromatin), where recombination rate may be lower and, therefore, selection pressure is also lower, permitting frag- mented copies to evolve like pseudogenes over time until selection finally eliminates them.

Non-LTR Retrotransposons– Two distinct types of non-LTRs were reconstructed and identified. The longest element is 5266 bp. It has 52% homology to the A. gambiae Dong reverse transcriptase (R4), possesses a single reading frame, and has a TAA target site [86]. Four full-length copies and many partial-length copies were found in the genome.

The second element that was reconstructed was about 4655 bp long; how- ever, it was difficult to determine the boundaries of the element outside of its

47 coding region. This transposable element is not represented in the genome as a full-length copy. However, several copies exist in the genome with in- terrupted reading frames. The highest homology was found to the Bombyx mori (50% homology) and the Papilio xuthus (48% homology) Hope reverse transcriptase protein [85]. Probable loss of target site specificity and a trend of insertions across all genomic locations was reported by Kojima and Fuji- wara [85] for Hope-like elements. However, it was observed that some copies of Hope-like elements in P. humanus humanus have a targeting sequence similar to 28S rDNA.

3.3.1.2 Class II Elements

mariner/Tc1 – From the many inspected Class II families only one mariner/Tc1 element was identified. It is a 1276 bp mariner transposon, with 33 bp TIRs. We show the consensus for this element in Appendix E.1.3.1. Its transposase has highest homology to Apis mellifera Ammar1 transposase and Ceratitis capitata Ccmar2 transposase. Reconstruction of the mariner revealed that the element has two reading frames and that some copies have a 24 bp deletion in the coding part, which caused further reading frame interruptions. No autonomous elements were found, but 24 full-length copies and many deteriorated ones are still present in the P. humanus humanus genome.

MITE– Two MITEs were identified in the P. humanus humanus genome. The first is 623 bp long, present in 4 copies, with a 12 bp TIR. The second is 169 bp long, present in 16 copies, with a 20 bp TIR. Dot plot analysis re- vealed that the 623 bp element consists of 4-5 repeats within itself and that

48 the 169 bp element has 2 repeats. No homologies with other P. humanus hu- manus TEs were identified. The MITEs were identified by a separate script we developed that aims to find inverted repeats within specified distances within the genome. This script is available in Appendix F.

3.3.2 Culex quinquefasciatus

The Culex quinquefasciatus mosquito is the primary vector of the West Nile virus and St. Louis encephalitis. We searched its roughly 580 Mb genome for non-LTR retrotransposons and identified 11 of the 17 known families of non- LTRs, together occupying 4.4% of the total genome. Among these, full-length copies of the CR1, I, Jockey, L1, L2, LOA, Loner, and R1 families were found. The Loner and Outcast families are unique to mosquitoes. There is evidence of recent activity in the CR1, Jockey, L1, and L2 elements. Table 2 contains our results, also presented in the C. quinquefasciatus genome paper [5]. Across all TE families, TEs occupy roughly 29% of the genome, comparable to similar mosquito species. Our full results are shown in Table 3.2.

3.3.3 Anopheles gambiae PEST Genome

Anopheles gambiae serves as the main vector of malaria [63]. The PEST strain consists of roughly 273 Mb and has been extensively studied. Class II P elements within the genome have been especially closely examined. Sarkar et al. originally identified 6 distinct P elements [126]. More recently, Oliveira de Carvalho et al. identified 4 additional P elements [106], while Quesneville et al. described 9 clades (subfamilies) that are at least 30% divergent at the nucleotide level [117]. In all, previous research has described 12 clades of P elements in A. gambiae that are

49 TABLE 3.2 Culex quinquefasciatus RESULTS

Class I Family Number of Elements Copies Density non-LTR

CR1 31 973 0.28%

I 11 63 0.02%

Jockey 16 5028 1.77%

L1 57 662 0.15%

L2 9 1416 0.61%

LOA 9 184 0.09%

Loner 2 127 0.12%

Outcast 4 15 0.00%

R1 32 250 0.14%

RTE 8 892 0.38%

Unclassified LINE 30 11,117 0.88%

TOTAL 4.45%

50 more than 30% divergent at the nucleotide level.

TESeeker detected 11 out of the 12 P elements within A. gambiae, as well as an additional 2 partial hits that showed strong similarity to P element transposase, but that were more than 30% divergent at the nucleotide level. The lone element that TESeeker missed, AgaP14, is most divergent from the other elements, which may explain its absence and which also suggests our library does not fully represent the P element family. Additionally, TESeeker produced consensus sequences with TIRs on every P element where they had been previously reported. Searches for additional Class II TE families were also successful. In particular, of the 13 piggyBac elements identified in Sarkar et al. [127], we identified 10, including TIRs where previously described. Again, the elements TESeeker missed were most divergent from the other sequences. TESeeker did especially well with mariner elements. TESeeker identified each of the 5 elements at TEfam, each with complete TIRs and 4 with the expected TSDs.

3.3.4 Other Organisms

TESeeker was also validated on select elements in a variety of organisms. Of particular note, we detected a previously unreported putative mariner element in the well-studied Drosophila melanogaster genome. The 1061 bp element has TIRs 26 bp in length, with 3 mismatches, but with no apparent TSDs. A sin- gle full-length copy, as well as several partial hits, exist within the genome. Its transposase has a high homology to related insects, such as Chymomyza amoena and Cladodiopsis seyrigi. Searches for this element in existing TE annotations for D. melanogaster produced no hits. Please refer to Section E.3.1.1 of Appendix E for an annotated version of this putative element. Further investigation in collab-

51 oration with FlyBase [36] is warranted to validate this result.

Additionally, TESeeker was used to search for mariner elements in the human (Homo sapiens), frog (Xenopus tropicalis), and chicken (Gallus gallus) genomes. Mariner elements are known to exist in the human, frog, and chicken genomes, which were found using TESeeker.

3.4 Conclusion

As the number of sequenced genomes rises, the necessity to identify TEs within them also grows. TEs are an important evolutionary force present in the majority of these genomes. While there exist mature, effective, and automated gene identi- fication systems, the tools available for the identification of TEs are not as robust. Particularly, current homology-based approaches are typically very interactive, requiring numerous user decisions and many separate tools. The approach described herein successfully identifies TEs in novel genomes in an automated and easy to use package, offering researchers the ability to quickly produce high-quality consensus TEs. TESeeker was developed and refined over the course of several TE identification projects and works best to detect TEs with homology to known TEs. We are able to generate high-quality putative TEs as well as characterize the prevalence of TEs in many genomes. We provide

TESeeker as a web-based tool within a VirtualBox virtual appliance, while also providing our representative TE library both within the virtual appliance and as a separate download. While its local web interface automates the underlying logic, each TESeeker step can be manually started through the command line, offering additional flexibility. Additionally, we provide documentation and test cases to evaluate the approach with the virtual appliance. The ability to automatically an-

52 alyze a genome alleviates the exhaustive, error-prone, and time-consuming task of manually inspecting and manipulating results. The performance of the approach varies, but is largely dependent on the length of the TE family (longer sequences take longer to assemble) and its abundance in the genome.

53 CHAPTER 4

DESIGN AND PROOF-OF-CONCEPT PLAN FOR COMMUNITY ANNOTATION OF TRANSPOSABLE ELEMENTS ON VECTORBASE

4.1 Introduction

This chapter presents our design and implementation plan for an online com- munity annotation platform for TEs, designed to work in conjunction with Vec- torBase [91, 92]. Although TEs often represent a very high percentage of genomic data within a genome, repositories of TE data are lacking and remain unstandard- ized, specifically for vectors of human pathogens. Moreover, existing TE reposi- tories typically lack the user-friendliness that other genomic data is afforded. For example, at NCBI [104], there is an extensive amount of information available for genes; one can search for them in multiple ways, as well as visualize and eas- ily browse many details. In comparison, the primary online resources for TEs, namely TEfam [134], RepBase [119], and WikiPoson [148], offer far less informa- tion. TEfam allows users to submit TEs and some associated information about them, but only supports four organisms and offers no structural display options for TEs. However, TEfam does have the capability for users to submit detailed TE information, such as the amino acid open reading frames (ORFs), terminal inverted repeats (TIRs), long terminal repeats (LTRs), and target site duplica- tions (TSDs). RepBase has a much larger database of TEs for many organisms

54 but simply offers them in the standard FASTA and EMBL file formats, most of- ten only in nucleotide form. Both sites have a review process for TE submission yet neither offer any kind of structural display. Additionally, neither offers com- munity feedback concerning the hosted TEs. The third site, WikiPoson, offers user feedback in the form of a MediaWiki [99]. While not standardized and quite limited, WikiPoson offers researchers the ability to submit information and offers classification guidelines. The ability to store and visualize the structure of a TE, as well as allow for community feedback is important for several reasons. First, it allows researchers unfamiliar with certain types of TEs to be quickly exposed to other types. Second, the opportunity for user feedback is critical. Much of the existing TE research is not moderated. The ability for feedback from outsiders increases credibility. Lastly, a moderated repository for TE data would encourage more standardized research of TEs. Our design and implementation plan utilizes the existing VectorBase [91, 92] bioinformatics resource center for vectors of human pathogens. VectorBase stores genomic data for a variety of insects and offers the capability to browse and display genomic data, run scientific analysis tools, obtain information about the organ- isms, and to provide community feedback. VectorBase also provides a means for community annotation of genes. It currently does this through an user submit- ted form for the gene-related data. This data is then reviewed and, if approved, added to an online database. Currently, the data is accessible for all researchers. Additionally, the data can be browsed on VectorBase through Ensembl’s Genome Browser [43]. Our plan builds upon VectorBase’s manual annotation of genes by adding the

55 ability to describe TEs. We utilize and adapt their core methods and extend them to work with the unique structure of TEs. We utilize the Chado [26] database schema to store the TEs and build a display based on the php GD library for TE structural display, as Ensembl does not currently support the display of TEs. Our primary goal is to complement the existing features of VectorBase with a mechanism for the submission and display of TEs. This proposed system would fill many gaps in the aforementioned existing systems and improve upon the quality and spread of information across the scientific community.

4.2 Transposable Elements and the VectorBase Community Annotation Pipeline

Adding the capability to facilitate TEs through VectorBase’s community an- notation pipeline (CAP) requires several changes. The following sections describe how TEs would fit into the VectorBase CAP.

4.2.1 Similarities to the VectorBase Community Annotation Pipeline

Extending CAP to support TEs requires a number of steps. Fortunately, many of the technologies developed for CAP can be utilized for TEs. From the user standpoint, other than a TE-specific submission spreadsheet, the interface and submission process would be nearly identical to that of CAP, making sub- mission easier for the user. Submissions would be expert-regulated, and the TE information would be available in the Ensembl browser via DAS. From the user standpoint, the following steps (summarized in Figure 4.1) would be followed:

1. Download SubmissionForm.xls

2. Enter TE data

3. Upload SubmissionForm.xls

56 4. Wait for approval from curator

5. If approved by the curator, data goes live; otherwise, it must be modified by user

The TE-specific submission spreadsheet for Class II TEs is summarized below (some fields, such as ORF, can have multiple instances). A portion of the spread- sheet is shown in Figure 4.2. Researchers are able to enter as many TEs instances as they would like for a given submission.

Transposon Symbol Unique name for the TE.

Family Name Family to which the TE belongs.

Organism Organism in which the TE was found.

Transposon Description Description of the TE, as well as any unique proper- ties.

DNA Transposon Complete sequence data for the TE.

Target Site Duplication Target site duplication for the TE.

5’ Start Genomic location where the transposon starts.

3’ End Genomic location where the transposon ends.

Strand Strand on which the TE was found.

5’ TIR TIR on the 5’ end.

5’ TIR Start Genomic location where the 5’ TIR starts.

3’ TIR TIR on the 3’ end.

57 Figure 4.1. Client-side TE Submission Process. Similar to the VectorBase CAP, researchers download, fill-out, and submit a TE-specific submission form. Approved data goes online and is eventually incorporated into the Ensembl browser.

58 Figure 4.2. TE Submission Form. The figure above shows a snippet of the spreadsheet to submit TEs. Researchers can submit as many TEs at a time as they desire.

3’ TIR Start Genomic location where the 3’ TIR starts.

ORF ORF sequence data.

ORF Start Genomic location where the open ORF starts.

There are advantages to using an offline form. First, researchers can keep their information stored locally, making it easy to edit or add to their data. Second, our form allows researchers to enter as much data as they have for as many TEs as they have, eliminating the burden of manually going through an online submission process multiple times. Once the form has been submitted, phpExcelReader [108] is used to parse the data. phpExcelReader is an open source php library that works with Excel files. The library is used to read the file’s contents and display what it has parsed out. If the researcher clicks approve, the data is inserted into the Chado database with a status of “under review.” This is the same basic technique used by the VectorBase CAP.

59 4.2.2 Differences from the VectorBase Community Annotation Pipeline

Extending the VectorBase CAP to support TEs also requires several changes. While the submission process is largely the same from the client-side, thereby facilitating submission in an user-friendly format, the aforementioned changes (from the spreadsheet) are made when storing TEs in Chado. Additionally, the spreadsheet allows for the submission of TE instances, the coordinates of which are also provided. As a result, and unlike the alignment of genes, exonerate will not be required to align the TE to the genome. The ability to submit consensus TEs will also be supported. In this case, an alignment algorithm, such as BLAST, will be used to generate instances of the consensus TE within the genome. This data will be generated “on-the-fly” and could be displayed via a different DAS track than the normally submitted and curated data.

4.2.3 Transposable Element Representation in Chado

We utilized Chado’s central module, the Chado sequence module, to store information about TEs. The fundamental table within this module is the feature table, which is used for describing biological sequence features. Chado defines a feature to be a region of a biological polymer, which typically means a DNA, RNA, or polypeptide molecule. A region can be the entire extent of the molecule or a junction between two bases. Features can be classified according to ontology, localized relative to other features, and form part, whole, and other relationships with other features [54]. We store the model (in our case the consensus) of the TE and all its functional parts in the feature table. This is accomplished by identifying the relation of each functional part to the “main” consensus via the feature relationship table, as well as the location of the smaller parts within the

60 FEATURE (feature_id, dbxref_id, organism_id, name, uniquename, residues, seqlen, md5checksum, type_id, is_analysis, timeaccessioned, timelastmodified) FEATURELOC (featureloc_id, feature_id, srcfeature_id, fmin, is_fmin_partial, fmax, is_fmax_partial, strand, phase, residue_info, locgroup, rank) FEATURE_SYNONYM (feature_synonym_id, synonym_id, feature_id, pub_id, is_current, is_internal) FEATURE_RELATIONSHIP (feature_relationship_id, subject_id, object_id, type_id, rank)

• Primary key • Foreign key which will be linked to a PK in the part of the structure which we are creating • Foreign key NOT linked to a PK in the part of the structure which we are creating

FEATURELOC n 1 ORGANISM

1 n

n 1 1 n FEATURE_SYNONYM FEATURE DBXREF n 1

n

FEATURE_CVTERM CVTERM FEATURE_RELATOINSHIP n n 1

• Boxes with solid borders represent relations which we will build • Boxes with dashed borders represent relations which we will reference in the existing VectorBase database

Figure 4.3. Entity-Relationship Diagram of Selected Chado Tables. The figure above shows selected changes to the Chado schema to account for TEs.

main part via featureloc table. We also make a connection to existing data in Chado. For example, in the feature table, we need to specify the organism where the particular TE was found, so we utilize foreign keys connecting the TE, in this case, to the organism, cv term, and dbxref table. The featureprop table is utilized to set the status of the data (for example, “under review”). Figure 4.3 shows a graphical representation of some of the tables we utilized.

61 Figure 4.4. TE Start and Submit Page. Here, users can select the TE they wish to view information about or submit a new TE for review.

4.2.4 Proof-of-Concept

A sample interface independent of VectorBase has been developed. This in- terface allows for the submission of TE instances into a clone of the VectorBase Chado database. The basic interface allows researchers to view or submit TEs, shown in Figure 4.4. Basic information about the TE can be displayed, as in Figure 4.5, including a structural display. The structural display uses the php GD Graphics Library [48] to dynamically create a visualization of the TE, a sample of which is shown in Figure 4.6. This data would eventually be made available via a link in the Ensembl browser. Figure 4.7 depicts the configuration.

62 Figure 4.5. TE Details Page. This figure shows the information pertinent to each TE. It also has an option to display the TE structure, shown in Figure 4.6.

63 64

Figure 4.6. TE Structure Page. Here, we show the display of the TE, to scale. The large center region is the open reading frame. On either end, one can see the terminal inverted repeats (TIRs). Figure 4.7. Proof-of-Concept Configuration. Here, we show the configuration of the data flow and display, independent of VectorBase.

4.3 Design and Implementation Plan

Work has been performed independent of VectorBase on a clone of the Chado database utilizes the general design that has been described in the previous sec- tions. To implement the ability for the VectorBase CAP to accept TEs, the following steps should be initiated: 1. Add TE-specific Submission Interface to VectorBase. This can be done through modifications (or edited duplications) of existing files. Namely, files such as UserTools.php, ManualModel.php, and SubmitAnnotation.php could be used as templates to create files to allow for TE submission.

2. Import TEs into Chado. This is largely done through the CAP org.vectorbase.www.cap.importer package. The .java files in this pack- age handle the parsing and insertion of the contents of the spreadsheet into Chado.

3. Export TEs to Ensembl Browser. The CAP org.vectorbase.www.cap.exporter package allows for the display of CAP data in the Ensembl browser. A link to the structural view of the TE would need to be provided through the Ensembl code.

65 The logic to much of the CAP code would remain unchanged, such as the usage of hibernate. However, portions of the existing code that utilize exonerate would not be necessary, and the mechanisms by which TEs are inserted into Chado must follow the schema previously described. The VectorBase CAP has many underlying caveats; the implementation is relatively straightforward, yet is difficult to initially follow because of many interdependencies. Allowing for the acceptance of TEs into the VectorBase CAP first requires the CAP to be restored to working order and then edited and extended for TEs. The capability for users to submit consensus TE sequences is something that should be performed once the CAP allows for the acceptance and display of TE instances. Once this is implemented, consensus sequences could be used in blastn searches against the genome and the results dynamically displayed via a DAS track in the Ensembl browser. Additionally, TEs would be expert-regulated on an organism by organism basis, much like the CAP.

4.4 Conclusion

This chapter has described common annotation strategies as well as the tech- nologies used in the VectorBase CAP. We have described the VectorBase CAP in detail, and offered solutions to extending it to allow for TEs. As a proof-of- concept, we have cloned the VectorBase Chado database and successfully accepted user submissions of TEs from the web, while also parsing and inserting them into Chado. The Chado database has also been used to generate a structural display of the TE. Our approach extends the VectorBase CAP to allow for TEs while utilizing the technologies currently in place. Such an annotation system for TEs has not

66 been implemented to date, as current systems serve mainly as TE repositories, offering no structural display or community feedback. The community annotation of TEs complements the VectorBase CAP for genes while also strengthening the utility of VectorBase.

67 CHAPTER 5

SIMULATION AND MODELING BACKGROUND1

5.1 Introduction

Simulations of real-world phenomena have the potential to be extremely valu- able to researchers, particularly in the public health realm. Rather than relying on complex equations that are the basis for many scientific models, agent-based models (ABMs) rely on more natural behavioral rules [60]. This leads to a more direct translation from natural phenomena to a simulation model. It is logical to integrate spatial data into some simulation environments; however, as Gilbert [50] pointed out, utilizing geographical information system (GIS) data for dynamic agents is a difficult challenge that has not yet been adequately solved. Although GIS data has successfully been integrated into ABMs for several years, the ability to run complex simulations with thousands of GIS aware agents is computation- ally challenging. This chapter explores simulations, with an emphasis on ABM and its utilization of GIS data.

5.2 Simulation and Modeling

A simulation is an imitation of a real-world process [10]. This imitation is usually done with a computer through the use of a conceptual model. A concep-

1Portions of this chapter were previously reported in Kennedy [75].

68 tual model generally refers to the computer representation of the system that a researcher has chosen to model. The common goal of simulations is to accurately represent the behavior of a real-world system while providing feedback and in- sight in a manner that would otherwise be infeasible. For example, an experiment that would take months to complete in the laboratory may take only hours or days to complete with a computer simulation. Also, simulations are especially useful for models that are unethical in real life, such as infecting a population with a pathogen. It is appropriate to think of simulations as parts of the scientific method - we use them to help us check our assumptions or hypotheses as well as to possibly predict future behavior. Sharing the same goal as the scientific method, we utilize simulations to help us acquire new knowledge. The literature presents a multitude of reasons why computer simulations are valuable [10, 103, 128], a collection of which are listed below: 1. Simulations allow for the timely study of phenomena that would otherwise be impractical. For example, the evolution of a species over a long period of time can be simulated in far less time than the actual experiments would take to perform.

2. Simulations can model theoretical behavior that cannot be replicated in the laboratory. An example of this would be a simulation model that tracked the historical migration patterns of icebergs or continental drift.

3. Simulation inputs can be modified to determine the outcome or effect on a real-world system without harming the real-world system. This would be applicable if a researcher wanted to simulate the spread of a pathogen across a population without harming the population.

4. Experimentation with simulations can confirm understanding. For instance, a simulation model that mimics the population dynamics of a group of ani- mals could allow researchers to examine particular entities of the model and follow them over time, thus furthering the understanding of the system.

5. Simulations can be used as prototypes for new experiments before real-world implementation. For example, a disaster recovery team could simulate any

69 sort of disaster as well as their response tactics, allowing them to choose the best approach.

6. Modern systems are sometimes so complex that their internal workings can only be studied through simulations. Banks et al. [10] refer to a complex factory system in which the internal interactions are so complex that simu- lations offer the only solution.

As evidenced above, simulations are a powerful tool to researchers; however, there are cases where a simulation would not be appropriate. Banks and Gibson [8] list ten rules for when simulations should not be used. A sample of the more meaningful ones for our purposes are summarized below:

1. Simulations should not be used when common sense can solve the problem or when the problem can be solved analytically in reasonable time.

2. Simulations should not be used when the cost of developing the simulation model exceeds the cost of experimentation.

3. Simulations are not useful when system behavior is too complex or unknown.

5.2.1 Advantages and Disadvantages

There are many advantages to using simulations for scientific study [10]. Aside from the fact that simulations allow one to model the behavior of a real-world system without harming or altering the real-world system, simulations typically run and produce results faster than the real-world system being studied, if such a system exists. Additionally, simulations are useful in testing the influence of different variables both on the system as a whole and in regard to one another. Furthermore, a simulation is helpful when performing hypothetical tests or when testing situations that would be unethical or impractical in the real-world. Simulation studies also have some inherent disadvantages. Banks et al. [10] list four specific disadvantages. Namely, simulation models are difficult both to 1)

70 build and to 2) interpret. While true to an extent, experienced programmers will find model-building manageable. In addition, 3) interpreting and analyzing the results of a simulation may take some time, but, in many cases, this amount of time will be less than if the scientist had done the actual real-world experiment. Lastly, 4) simulations are sometimes incorrectly used when analytical solutions are more practical. Although valid disadvantages to using a simulation exist, building a simulation can be extremely useful to scientists as long as the simulation fulfills the requirements previously mentioned.

5.2.2 Building a Simulation Model

We have already described simulations as being built upon a model. In most cases, scientists start with a conceptual model, or a model with which they intend to accurately represent the system they are studying. This conceptual model typically goes through many phases and revisions as the simulation is being built. Often, scientists will recognize a problem with their conceptual model or discover a way to improve it and then implement the change. Once the scientist has sufficient confidence in the conceptual model, it will transition into simply being called the model, which will be used as the representation of the system the scientists are studying. This representation is for the study of the system through simulation. Accurately representing a model that exactly matches a real-world phenomenon is extremely difficult, if not impossible. Inherent randomness often appears in simulation models. While many factors cause this randomness, the main cause is that real-world systems are far too com- plex to accurately and comprehensively represent through a computer simulation model. Randomness is included in simulation models to cover our limited under-

71 standing or uncertainty. In many of the systems we model, we have little idea about the underlying mechanics. We build simulation models to try to help us to understand these characteristics and to experiment with them. If done properly, we will learn about real-world systems through our simulation models. Second, randomness is included for decision-making. If we have a simulation that models ants foraging for food, we have to give the ants the ability to make decisions. If the simulation continuously prompted the entities to perform the same action or encounter the same obstacles each time, nothing would be learned after the first run; randomness is included for this purpose. In the ants example above, the introduction of a random walk would add realism to the model. Lastly, mea- surement error or quantum effects are accounted for by randomness. Simulations cannot have the precision of real-world systems because of both the limitations of computers and of our own knowledge. They also cannot represent entities as accurately as a real-world system, so we include an inherent randomness. These examples are not meant to be looked at as limitations of simulation models, but as reasons why simulation models are created the way they are. In fact, this random- ness is part of what makes simulations unique and powerful, while representing how the world actually operates too.

5.2.3 Simulation Model Types

The literature [10, 90] has divided simulation models into the following three overlapping subcategories:

Static vs. Dynamic Static simulation models are representative of a system at a specific time. An example of a static system is one that solves complex analytical problems that are infeasible with other methods. Dynamic simu-

72 lation models are representative of a system over time, such as population dynamics.

Deterministic vs. Stochastic Deterministic simulation models produce results determined by the provided inputs. In such simulation models, probability does not play a role. An example would be a simulation that models a student going to class at a specified time every day. Stochastic simulation models involve random variables and produce different results with each random seed. Our model with the student would be stochastic if we add a certain probability as to when and whether the student will arrive to class.

Continuous vs. Discrete Continuous simulation models characterize systems constantly over time. An example would be the population dynamics in a predator-prey simulation model. Discrete simulation models characterize systems at specific points in time. An example would be people paying tolls at a toll booth.

For the purposes of this study, we further classify simulation models into the following subcategories:

Agent-based vs. Equation-based Agent-based simulation models have indi- vidual entities, called agents, that drive the simulation. They are good at modeling systems with emergent properties. Equation-based simulation models are adept at modeling mathematically based phenomena. We next elaborate on these two subcategories.

73 5.2.4 Agent-based Modeling

Agent-based simulations, also known as individual-based simulations, have recently gained popularity [121] and are proving to be very powerful. In an agent- based simulation, an agent can be thought of as any acting component in the system. Each agent is treated as an entity, having its own properties and behav- iors. These can be influenced by a variety of factors, including the environment and other agents. The interactions between agents and their environment over time often lead to emergent properties within the system. Time is typically rep- resented in the form of time steps; namely, each agent usually has a chance to change its properties and interact with other agents and the environment once every time step. A time step can represent any amount of time. An advantage of agent-based simulations is that they are easily extensible; adding agents to the model is a well-defined process. Additionally, agent-based simulations are rather intuitive to code, as they are modeled in the same manner that we tend to think about systems. Agent-based models have been applied to many areas, including social network models and models of pathogen spread. Our model, named LiNK, is an agent-based model.

5.2.5 Equation-based Modeling

Equation-based simulations are more mature than agent-based models and are adept at modeling systems governed by underlying mathematical properties or formulas. This is somewhat of a limitation, as more complex systems that cannot be approximated by equations are difficult to build. Also, changing overall properties of an equation-based simulation is often difficult, as it may require a new mathematical model. However, modifying parameters in an equation-based

74 simulation is relatively simple. In this respect, equation-based simulations are rather simple and straightforward. In general, equation-based simulations are very good at modeling known systems with aggregate behaviors or systems simply governed by mathematical rules.

5.3 Geographic Information Systems

A GIS is a system that is used to manipulate and store spatial data. For exam- ple, a GIS could consist of a map of the counties of Michigan and the correlating population data. Coupled with the proper software, users could query the data for counties with a population greater than 10,000 or for counties with an area larger than 500 square miles. Applications of GIS technology span many fields, including environmental impact assessment, scientific investigations, urban plan- ning, and resource management [52]. ArcGIS [4], GRASS GIS [56], and Quantum GIS [114] are several popular GIS software tools. GIS data is usually stored in either raster or vector format; next, we elaborate on each format. Figure 5.1 visually compares raster and vector data on a portion of Bali, Indonesia.

5.3.1 Raster Data

Raster data is characterized as a collection of pixels, or cells. Many cells make up a single raster file. These cells are stored in a matrix-like manner, namely in rows and columns. Each cell has its own attributes and associated data. Raster files are generally less computationally expensive than vector files. However, they require more storage space.

75 5.3.2 Vector Data

Vector data is coordinate-based and usually represents data as points, lines, or polygons. Vector files can more realistically represent spatial data in smaller storage space than raster files. GIS data is collected as coordinates, so there is typically much more precision when compared to raster files. Querying complex polygon-based vector files can be expensive, so the data is often approximated.

5.4 Integrating Geographic Information System Data into Agent-based Modeling

There have been several studies in which ABM has been combined with GIS data [25, 29, 51, 66, 146]. Few of these models have focused on infectious dis- eases, while even fewer have agents that intelligently move based on their current environment. Castle et al. [25] mention numerous toolkits and applications for coupling ABM and GIS yet fail to go beyond the incorporation of GIS data into a model and into the realm of its effective use. Crooks [29] more deeply describes the realm of space within ABM and offers example applications but does not specif- ically address the underlying issue of how agents can most efficiently access GIS data. Anwar et al. [3] describe a model built upon GIS data, but one that does not directly query it. Some models imply space, such as NOSOSIM [135], but few dynamically interact with GIS data. Gimblett [52], Keeling et al. [74], and Brown et al. [18] describe aspects of the integration of ABMs and GIS data, but do not go into detail regarding approaches to efficiently create GIS aware agents. Moreover, standard means of linking agents with GIS data are computationally expensive and are therefore not feasible for complex, large-scale simulation mod- els. In many cases, only particular parts of a GIS are necessary for an ABM; utilizing a feature-rich GIS toolkit, such as ArcGIS [4], at simulation run-time is

76 (a) Raster Data

(b) Vector Data

Figure 5.1. Panels (a) and (b) show the northwest corner of Bali as represented by a raster and a vector file. Here, the granularity of the raster file is not as precise as the vector file.

77 not typically advisable. We aim to advance the field through efficient and fast approaches to dynamically working with GIS data within an ABM.

5.5 Summary

This chapter has introduced simulation and modeling techniques, while focus- ing on ABM. We have also discussed GIS and its popular data formats, offering advantages and disadvantages. The difficulties in integrating GIS data with an ABM have also been described. Chapter 6 describes our simulation that integrates ABM and GIS data to model pathogen spread.

78 CHAPTER 6

A GIS AWARE AGENT-BASED MODEL OF PATHOGEN TRANSMISSION1

6.1 Introduction

In this chapter, we describe an epidemiological model that incorporates spa- tial data as an influence to agent behavior and pathogen spread. In particular, we create an epidemiological model to simulate pathogen spread amongst long- tailed macaque monkeys, Macaca fascicularis, on the Indonesian island of Bali. GIS data is incorporated into our simulation, and we offer insight on how to ef- ficiently integrate GIS data into a model, depending on the model’s complexity and needs. We note optimizations made along the way and compare our methods to conventional approaches. We conclude with results for our model. This work is performed with global public health goals in mind and could also be applied to model infectious diseases carried by arthropod vectors.

6.2 LiNK Simulation Model

We have created a model, the implementation of which is named LiNK after its creators (Lane, Niederweiser, and Kennedy) and further described in Lane [88], to aid in the understanding of pathogen transmission patterns. This model was designed to simulate the spread of infection amongst long-tailed macaques,

1Results from this chapter have appeared in Kennedy et al. [76, 79].

79 shown in Figure 6.1, on the Indonesian island of Bali. We have coupled detailed GIS data with a detailed understanding of the macaque population to create a rich simulation. LiNK is deployed on a computing cluster at the University of Notre Dame [142]. Development of the model has been performed through the interdisciplinary collaboration of biologists, anthropologists, and computer scientists.

6.2.1 Model Background

Several zoonotic diseases have recently emerged on the Asian landscape; macaques have been implicated as both hosts and reservoirs in these disease emergences in . Anthropogenic landscape changes have increased the incidence of human to non-human primate interaction, potentially leading to bi-directional pathogen transmission events [32, 41, 88]. In our model, we evaluate how landscape changes might influence pathogen transmission patterns, based on the behavior and dis- persal patterns of long-tailed macaques across the island of Bali. Our long-term aim is to answer the following research questions:

1. What are potential rates and routes of pathogen transmission in macaques across the island?

2. How do pathogen life history parameters impact this transmission?

3. Do the answers change with the inclusion of humans as a component of the landscape?

Landscape plays a very important role in these questions, necessitating the use of GIS data in our simulation. A unique system of temples, one of which is shown in Figure 6.2, has existed on Bali for centuries; these temples and their associated forests act as refugia for

80 Figure 6.1. Adult Female Macaque and Infant. Photo courtesy of A. Fuentes.

81 the large populations of long-tailed macaques [47, 147]. The island itself is fairly small at 130 km × 80 km, yet it is an ideal size for study. Each of its roughly 40 temple populations consist of between 30 and 400 macaques. Existing behavioral and preliminary genetic evidence has documented the matrifocal society of the macaques, resulting in strong female philopatry [42, 46, 47, 88]. Females remain at their natal (birth) temples throughout their lives, and social status is inherited maternally. Typically, subdominant and subadult males disperse from their natal temple at around age seven, traveling to non-natal temple populations. Actual dispersal distances and rates are unknown. The ability of long-tailed macaques to coexist with humans has enabled a number of macaque populations to thrive in areas where other primate species have become extinct [47]. On Bali, human land-use patterns have resulted in a mosaic of riparian forest, small forest patches, agricultural lands, and urban areas across much of the island. The broad distribution of macaque populations on Bali suggests that the macaques are utilizing the human modified landscape as it currently exists. Due to the protection and resource availability at temples, macaques are able to thrive in moderately high densities alongside high density human populations. This co-existence, particularly surrounding the temples, has created an ideal study environment for evaluating both how primate behavior and anthropogenic landscape changes influence pathogen transmission [41].

6.2.2 Conceptual Model

The conceptual model was developed by K.E. Lane, with support from A. Fuentes and H. Hollocher. This research group has closely studied macaques and an ar- ray of pathogens for a number of years. The basic model consists of a display of

82 Figure 6.2. Uluwatu Temple Site. This image shows the southern Balinese temple at Uluwatu.

83 Bali with temple sites and macaques. Users can also view the contents of a given temple and provide multiple model and pathogen parameter options. We next introduce the core components of our model and discuss them in greater detail in Section 6.2.3.

Agents Our agents are macaques, each with their own properties, such as loca- tion, sex, age, natal temple, and infection status. Macaques move according to their surrounding environment, and males have the ability to enter and leave temples. Our model can support thousands of agents, easily support- ing the roughly 10,000 macaques on Bali. We show a simplified transition diagram for the life cycle of our macaques in Figure 6.3.

Behavior Macaques have the ability to move through their environment, inter- act with other macaques, reproduce, and die. Movement is dictated by their surrounding environment; macaques query their neighborhood and move appropriately. Macaques within a temple move randomly, with no GIS in- fluence. All macaques have the ability to carry pathogens and can transmit pathogens when within a specified distance of one another. Reproduction is handled by allowing female macaques to produce offspring, with inherited traits, after they reach a specified age. As macaques age, they have a higher probability of dying.

Interface Researchers interact with the model through a simple control panel, shown in Figure 6.4, that allows them to modify simulation parameters. Once the parameters are set, the user can begin running the simulation. The simulation is displayed using OpenMap [107] and is shown in Figure 6.5. Users can also see macaques within temples, as shown in Figure 6.6 .

84 Figure 6.3. Life Cycle Transition Diagram. Macaques are always born in temple sites. Female macaques spend their entire lives within their natal temple. Mature male macaques disperse throughout the island through varying landscape with the ability to join other, non-natal, temples.

85 Figure 6.4. LiNK Control Panel. Here, we show the parameters a user can modify when running a simulation. GIS layers can be enabled or disabled, and pathogen parameters can be set.

86 Pathogens LiNK has the ability to simulate a wide array of pathogens through the incorporation of several important pathogen parameters. The infectivity parameter refers to how infectious the pathogen is, while infectiousness is the proximity a macaque must be to another macaque to have the ability to transmit the pathogen. Latency represents how long a macaque takes to become symptomatic after becoming infected, and virulence represents the deadliness of the pathogen. Acquired immunity refers to the amount of time a macaque is immune to contracting a pathogen after having been previously infected. Clearance time is the amount of time a macaque takes to be cleared of a pathogen. Finally, natural resistance represents the proportion of macaques that are immune to a given pathogen. Selected pathogen- related variables and their temporal relationships are shown in Figure 6.7.

A transition diagram for these variables is shown in Figure 6.8. LiNK has the ability to model one unique pathogen during a given simulation run.

Space The macaques move about on 2D grids that represent temples sites and the island. The island grids are extrapolated from GIS data, at a customized granularity. For our purposes, a grid cell has sides of roughly 100m, leading to over one million possible locations. Each grid is called a layer; we have a total of eight landscape layers: cities, forests, lakes, rice fields, rivers, roads, temples, and the actual island (called coast). We have three additional layers that serve as buffers that represent the impact of humans and water on infectivity. These eleven layers are melded together and use the same coordinate system. The coast and temple layers are required, while the remaining layers can be turned on or off.

Time One time step in our simulation correlates to 12 real-world hours.

87 Figure 6.5. LiNK Display. The figure above shows the display of our simulation. Here, we have the forests, lakes, and rivers layers enabled, as well as the actual island and the temple sites. Temples are shown as squares on the map. Green temples have no pathogens, while red temples have pathogens present within. Macaques are shown on the island as circles; they are green if they are healthy, pink if they are infected and not symptomatic, and red if infected and symptomatic. This screen capture has one infected temple and several infected macaques.

88 Figure 6.6. Temple Site Display. Here, we show the interior of a temple site. Male macaques are shown as solid circles and females as hollow circles. Macaques are green if healthy, pink if infected and not symptomatic, and red if infected and symptomatic. The user can choose which temple site to display at run-time.

Figure 6.7. Temporal Relationship of Pathogen Parameters and Related Events. The diagram above shows the relationships of the pathogen parameters in our simulation. Depending on the parameters used, macaques can become permanently immune to the modeled pathogen.

89 Figure 6.8. Pathogen Transition Diagram. Macaques generally begin as susceptible and then transition to other states after being infected. Macaques with a symptomatic infection can become reinfected and macaques can reinfect themselves (autoinfection). An acquired immunity is gained after most infections, but may be lost after a given amount of time.

90 6.2.3 ODD Protocol Description of LiNK

Grimm et al. proposed [58] and recently updated [59] a protocol to describe agent-based models, the ODD protocol, that consists of 1) the model Overview, 2) Design concepts, and 3) Details. The overview block consists of the purpose, state variable and scales, and process overview and scheduling elements. The details block is further divided into the initialization, input, and submodels elements.

This section describes LiNK according to the original ODD protocol.

6.2.3.1 Purpose

The purpose of the LiNK simulation model is to help understand the effect of landscape on the spread of pathogens among macaque monkeys on Bali, Indonesia.

6.2.3.2 State Variables and Scales

The spatially explicit model consists of agents representing macaque monkeys on the island of Bali, Indonesia. ESRI shapefiles serve as the backbone for the GIS in the model. Layers representing landscape include cities, forests, lakes, rice fields, rivers, and roads, as well as the island of Bali. We also utilize a layer representing the geographic location of 42 temple sites on the island and 3 additional layers we created that serve as buffers, namely to represent the impact of humans and water on infectivity. We abstract the shapefiles to a grid-based system on which movement amongst the layers is probability-based and relies, by default, on a Moore neighborhood, which is discussed further in Section 6.3.5.1. At each time step, macaques evaluate potential new positions, noting their current landscape and directional bias. Each new position is assigned a value, which is then normalized. The macaque then has the opportunity to move. Each range of

91 TABLE 6.1 MOVEMENT VALUES FOR DISPERSING MACAQUES

to City Coast Forest Lake Rice Field River Road

from

City 10-30 15-20 40-70 0 10-30 15-45 0-20

Coast 5-20 20-30 40-70 0 10-30 15-45 5-20

Forest 5-20 15-20 10-30 0 10-30 15-45 0-20

Rice Field 5-20 15-20 40-70 0 15-40 15-45 0-20

River 5-20 15-20 40-70 0 10-30 20-55 0-20

Road 5-20 15-30 40-70 0 10-30 15-45 5-30

values in Table 6.1 represents a weighted probability that a macaque will move from one landscape to another.

State variables for LiNK are described in Table 6.2. A time step of 12 hours was chosen in conjunction with a grid cell size of 111 meters to obtain the appro- priate level of precision based on our knowledge of macaque behavior. Movement probabilities were also chosen in accordance to studied macaque behavior.

92 TABLE 6.2 STATE VARIABLES

Variable Value

Model

Dispersal Deaths per Day 7.14E-4 (2% every 2 weeks)

Autoinfection True

Initial Infected Temples 1

Natural Resistance 1% of population

Temples Temples populated with realistic numbers

Time step 12 hours

Grid cell size 111 meters

Macaque

Sex Temples: 75% female, 25% male

Dispersing: 100% male

Age 50% adult (8-18y male; 8-20y female),

50% juvenile (0-8 yrs)

Latitude Random within island bounds

Longitude Random within island bounds

Natal Temple Random

Directional Bias Random

Current Landscape Based on latitude and longitude

Infected True if infected

continued on next page

93 TABLE 6.2 (continued)

Variable Value

Sick Steps Number of time steps infected to date

Symptomatic True if symptomatic

Pathogen

Infectiousness 1 grid cell

Infectivity 10 (0-100 range)

Virulence 80 (0-100 range)

Clearance Time 28 time steps

Natural Resistance 1% of population

Latency 4 time steps

Acquired Immunity 120 time steps

6.2.3.3 Process Overview and Scheduling

The LiNK model is event-driven. At each time step, a specified number of events are scheduled and executed, macaque by macaque. Macaques are handled in two groups: those dispersing and those within temples. Dispersing macaques are processed first. We begin by incrementing the macaque’s age and then allowing the macaque to move according to the movement function. Next, each infected macaque has the opportunity to transmit infection and to die from infection. Death is also possible as a result of dispersal deaths per day,

94 virulence, or macaque age. Finally, each dispersing macaque has the opportunity to enter a temple, depending on his proximity to it. Within temples, the process is similar. We begin by aging the macaques and next remove them if their age or sickness meets appropriate standards. If a macaque’s previous coordinates exceed those of the temple bounds and if the macaque is a male of appropriate age, the macaque leaves the temple according to a given probability. Female macaques have a 25% chance to give birth annually from 3-13 years of age. Finally, we simulate the pathogen and randomly move macaques within the temples.

6.2.3.4 Design Concepts

Emergence Influenced by the landscape, patterns of disease spread across Bali emerge over time.

Sensing Macaques know their current and surrounding landscape, which they use to make movement decisions.

Interactions Macaques interact with other macaques only to transmit pathogens. When a macaque is within the ring of infectiousness of an infected macaque, it has the possibility to become infected.

Stochasticity Survival in the model is stochastic; pathogens and the dispersal death rate directly affect survival rate. Movement is also probability-based. Certain landscapes are more desirable than others, and macaques move with a directional bias, both of which factor into movement decisions. Births are stochastic such that females have an annual 25% chance to give birth each year, between the ages of 3 and 13. The sex of the offspring has an equal

95 chance of being male or female. Finally, macaques located within temples often attempt to move beyond the bounds of the temple. This is permissible only a small percentage of the time and only for males of a specified age.

Observation Data is collected based on events. Namely, each infection, death, birth, and transition between a temple and the landscape is recorded in the output file. The model is observed through its GUI (graphical user interface) and also through analysis of the output file. We have written a

separate program named LiNKStat (described in Section 6.5.1) that presents and performs basic analysis of the output.

6.2.3.5 Initialization

Upon initialization, several things are constant. First, the number of macaques within each temple site is always the same and is based on scientific data. Sec- ond, the landscape layers available are always the same; however, the number of landscape layers that are enabled varies. The initial geographic placement of macaques, both inside the temples and dispersing, is random. The initial values of the parameters were chosen based upon observation and prior studies. Pathogen parameters are varied according to the characteristics of a given pathogen.

6.2.3.6 Input

The input to the model includes the GIS shapefiles representing the various landscape features of Bali. These were collected as part of a dissertation [132].

96 6.2.3.7 Submodels

Pathogens When a macaque becomes infected, it traverses through a variety of pathogen-related states. Upon infection, a macaque enters a latent state, which refers to how long it takes the macaque to become symptomatic. A latent macaque is also able to transmit the pathogen to other macaques. After completing the symptomatic phase, a macaque will become free of infection and clear of the pathogen, meaning it can no longer transmit the pathogen. The macaque will also enter an acquired immunity phase during which it will not be able to become infected. Transmission of the pathogen between macaques depends on infectiousness and infectivity. Infectiousness refers to the transmission ring which both macaques have to be within to transfer infection; infectivity is the chance that the infection will take place. Virulence reflects the deadliness of the pathogen. Figure 6.7 shows the temporal relationships for selected pathogen-related states.

Movement The higher the virulence of an infected macaque, the smaller the chance that macaque will move. While movement within temples is ran- dom, movement amongst dispersing macaques is complex. In its simplest form, macaques move about a Moore neighborhood, namely the eight imme- diately surrounding grid locations. At each time step, dispersing macaques consider their previous movement direction, their current landscape, and the landscape in their Moore neighborhood to determine their next location. We utilize the numbers in Table 6.1 to quantify the likelihood of a macaque leav- ing one landscape for another. This is combined with the macaques current direction of travel and the new location (if any) is determined stochastically. The mechanism of movement is independent of the number of layers enabled

97 for any given simulation run.

6.2.4 Implementation

There are several tools and technologies utilized in LiNK. The model is coded in Java [67] with the Repast simulation toolkit [118]. We utilize Repast and OpenMap [107] to display the model and GeoTools [49] and JTS Topology Suite [70] to interact with the spatial information. The choice of tools used in this study was primarily driven by the necessity to process and visualize GIS data and to be cross-platform and open-source.

6.2.5 Verification and Validation

Simulations are credible only once they have passed some form of verification and validation analysis. Verification refers to solving the model right, meaning that the simulation model matches the abstract model. Validation refers to solving the problem right, meaning the correct abstract model was chosen. ABMs must undergo and pass several subjective and quantitative verification and validation techniques to be considered useful models [7, 9, 81, 149]. Figure 6.9 shows common techniques for verifying and validating ABMs, adapted from Kennedy et al. [75].

The LiNK model was developed in conjunction with domain experts from multiple fields and has undergone extensive face validation, both through its display and evaluation of its output. We have also checked for internal validity and traced entities of the model. Much of this work has been performed through the use of LiNKStat, which we describe in Section 6.5.1. We are currently collecting additional real-world data that we will use in conjunction with the existing data to continue docking LiNK and to examine LiNK’s predictive power.

98 Figure 6.9. Verification and Validation Techniques for Agent-based Models. Here, we show techniques we used and plan to use for the verification and validation of LiNK.

99 6.3 GIS Data and Agent-Based Modeling

In this section, we describe common methods to utilize GIS data in an agent- based simulation environment. We also describe our improvements to these tech- niques. We conclude this section with details on our spatially aware agents.

6.3.1 Approximating GIS Data in Simulations

When an ABM environment is built upon GIS data, queries can be expensive, particularly with complex data or movement. As a general rule, the more complex the GIS data, the more difficult it is to efficiently utilize it within an ABM. Additionally, the more GIS data that is available, such as multiple landscape features, the more time-consuming it will be for agents to query. Put simply, at each time step, an agent needs to query its unknown surroundings and make a decision regarding its next move. The more GIS data there is, the longer this will take. A common solution is to approximate GIS data to the level of granularity required for a given model. As such, the amount of GIS data is decreased while the integrity of the data required is maintained. We next describe several ways to access GIS data from a simulation, offering advantages and disadvantages for each.

6.3.2 Raster Queries

Raster-based (cell-based) spatial queries made through a spatial package can be costly, as the mechanisms by which agents access this data are typically not optimized for use in simulations. Additionally, storing and loading potentially large raster data files is inefficient at simulation run-time, particularly when not all of the data is necessary. Raster files are also not ideal for representing complex

100 GIS data where fine-scale granularity is required. An advantage of utilizing raster data in an ABM is that it easily maps to traditional ABM grid spaces.

6.3.3 Spatial Queries

Spatial queries on vector-based (coordinate) GIS data are the most accu- rate way an agent can interact with GIS data. Here, an agent simply performs mathematical-based queries on the loaded GIS data to determine its surroundings. While very accurate, the cost of performing a spatial query increases as the com- plexity of the data increases. For example, it may be mathematically simple to query a rectangle to see whether an agent is contained within it; however, it is very mathematically expensive to do the same query on a large polygon. Repeatedly performing such queries is especially expensive, and this problem is exaggerated as the number of agents and the amount of spatial data increases. While indexing spatial data alleviates some redundancy, queries are still expensive.

6.3.3.1 Simplified Spatial Queries

The performance of spatial queries can be improved if the vector data is ap- proximated in a manner such that the number of vertices in a line or polygon is decreased, while maintaining an appropriate level of data integrity. The Douglas- Peucker algorithm [34] is commonly used to perform such simplifications. This technique offers a performance gain over traditional spatial queries, but at a cost of less accurate spatial data. However, repeatedly performing similar or identical spatial queries is redundant and can be remedied. Figure 6.10 shows a near 100% data simplification that maintains considerable data integrity for Bali’s outline.

101 (a) 10,000 Data Points

(b) 100 Data Points

Figure 6.10. Panels (a) and (b) represent Bali, Indonesia with approximately 10,000 and 100 data points, respectively. Here, we reduce the number of points by almost 100%, but still retain considerable data integrity.

102 6.3.4 Precalculated Query Matrix

Recognizing the drawbacks of earlier techniques, we developed and utilized a technique involving precalculated query matrices to create spatially aware agents. This technique relies on the advantages of raster data while utilizing the accuracy of vector data. Here, vector files are used in conjunction with spatial queries to build arrays of spatial data. Specifically, we iterate through the vector data, at a specified granularity, and perform spatial queries at each point. The result of the query is stored in the matrix for that specific layer as a Boolean value which specifies whether a given landscape is present. This process is shown in Algorithm 2 and is performed for all available spatial data. The run-time for Algorithm 2 is O(xyl), where x and y are the number of latitude and longitude values and l is the number of matrices. The number of matrices refers the number of landscape layers in use. While time consuming, the expensive queries only need to be performed once for a given granularity, prior to simulation run-time. We utilize serialization to load the arrays into the simulation and agents can access the data in constant time. The main disadvantage to this method is that arrays of finer granularity will take longer to build, resulting in larger arrays and slightly longer query times. The advantages include agents that can more quickly query their environment and a simulation that scales well, both in terms of the amount of GIS data available and in the number of agents. Researchers also have the advantage of choosing a granularity to fit their needs. Currently, we use multiple precalculated query matrices in LiNK.

103 Algorithm 2 BuildPrecalculatedQueryMatrix Let X be the set of latitude values Let Y be the set of longitude values Let L be the set of GIS layers Let M be the Precalculated Query Matrix for a layer for all x ∈ X do for all y ∈ Y do for all l ∈ L do Ml(x, y) ←SpatialQuery(l, x, y) end for end for end for

6.3.5 GIS Aware Agents

In traditional ABMs, agents move about a grid-like structure. GIS aware agents move about the same structure, but in a manner such that each move is influenced by the surrounding environment, including nearby agents. A simple example would be allowing agents to move preferentially into one landscape over another. Previously, we listed ways by which agents can query their environment. Our agents are able to adequately and efficiently survey their surroundings, mak- ing use of that data to become spatially aware. We utilize precalculated query matrices for movement decisions. To display this movement on the native vector data, we use hash tables to “map” the native GIS latitude and longitude points to our matrices, and vice versa. This mapping avoids repetitive calculations, while allowing agents to find their real-world coordinates quickly. This also assists in enabling agents to move with complex rules, which we next describe.

6.3.5.1 Movement

Adding movement to agents in a GIS-based environment is challenging. With raster data, agents must perform tedious queries through the GIS system to deter-

104 mine the surrounding landscape. Spatial queries are inefficient too, as the queries can be redundant and take considerable time. Utilizing precalculated query ma- trices enables us to create many agents with complex and realistic movements in rapid time. In traditional ABM cellular automata spaces, agent behavior is based on a von Neumann or Moore neighborhood. Specifically, von Neumann neighborhoods describe the four cells immediately adjacent to the current cell in a traditional square grid. A Moore neighborhood extends this to the surrounding eight ad- jacent cells, including those diagonally adjacent. Performing spatial queries on such spaces would be tedious and inefficient, particularly if the neighborhood was extended beyond a Moore neighborhood. In our model, spatial movement is based on a Moore neighborhood, with al- lowance for larger neighborhoods. To move intelligently, agents must know the landscape they are currently in as well as the surrounding landscape. To repre- sent possible transitions from one cell to another, we use a matrix of probabilistic movement values. This table consists of values representing the likelihood that an agent would move from a given landscape to another, shown in Table 6.1. These values were determined after discussions with domain experts. Calculations are performed for each of the cells in the Moore neighborhood. A directional bias is also added to the agents so they are more likely to continue in the same gen- eral direction. Once the values for the surrounding cells have been calculated, they are normalized. We then use probabilities to determine the next location for the agent, if it moves at all. These calculations are performed quickly, as the look-ups for the surrounding cells can be performed in constant time, allowing for realistic movement among agents. Figure 6.11 shows a simplified version of

105 our movement on an example grid and Algorithm 3 describes dispersed move- ment algorithmically (time-dependent on the number of possible new locations).

WeightedSelectOnAdjust refers selecting the new location based upon the nor- malized probabilities. Intelligent agents can be classified as simple reflex, model-based reflex, goal- based reflex, utility-based, or as learning [123]. Based on the movement deci- sions described previously, our agents could be classified as utility-based, but with stochastic-based utility functions and decisions. This classification fits our agents because they make decisions based upon utility - they are more content in certain landscapes, and their contentment is determined by their previous location and current landscape.

Algorithm 3 DispersedMovement Let Lt+1 be the set of possible locations for the next time step Let lt+1 be the new location Let b1 be the directional bias Let b2 be the landscape bias for each time step t do for all l ∈ Lt+1 do l ← b1 + b2 end for lt+1 ←WeightedSelectOnAdjust(l ∈ Lt+1) end for

106 Figure 6.11. Macaque Movement. The graphic above shows how a macaque M determines where to move in a landscape consisting of forests (green) and a river (blue). There are movement probabilities associated with landscape features. For example, a macaque would be more likely to enter a forest than a river. Here, we base movement on the immediate surrounding cells; however, it can be based on an arbitrary number of cells in an outward direction.

107 6.4 Results

LiNK has started to demonstrate the importance of landscape in the scope of epidemiological modeling [88]. The model has been improved in terms of speed and scalability through an abstraction of typical GIS data representation. We have shown the ability to have many agents interact with complex spatial data in a time frame adequate for a simulation while still addressing the research ques- tions mentioned in Section 6.2.1 at a high-level. Additionally, we have started to show the impact of landscape on pathogen transmission, as shown in Figures 6.12 and 6.13, which is thus far in accordance with real-world data from Roberts and Janovy [120]. To date, it appears that virulence is the dominant factor in terms of pathogen spread. Further sensitivity analysis and more verification and validation is ongoing.

108 Total Infection by Landscape

1200000000

1000000000

800000000

600000000

400000000 Number of Infections 200000000

0 Histo1 Histo27 Histo28 Histo38 Dispar1 Dispar27 Dispar28 Dispar38 Heterogenous Homogenous

(a) Total Number of Infections by Landscape

Total Infection by Population Size

1200000000

1000000000

800000000

600000000

400000000 Number of Infections 200000000

0 Histo1 Histo28 Histo27 Histo38 Dispar1 Dispar28 Dispar27 Dispar38 Large Population Small Population

(b) Total Number of Infections by Population Size

Figure 6.12. Panels (a) and(b) show the total number of infections at four temple sites. Temple sites 1 and 27 are in heterogeneous areas, meaning there are many landscape types present. Temple sites 28 and 38 are in homogeneous areas. Additionally, temple sites 1 and 28 consist of a small population, while temple sites 27 and 38 consist of a large population. Dispar refers to Entamoeba dispar and is an avirulent parasite, while Histo refers to Entamoeba histolytica and is highly virulent. Panel (a) shows that the diversity of landscape in which the pathogen is spread has little effect on the total number of infections. Panel (b) shows the same data grouped by population, showing that population has little effect on the total number of infections. From this data, we conclude that virulence has the highest impact on the total number of infections, while landscape has relatively little impact. 109 110

Figure 6.13. Pathogen Spread to Varying Temple Sites. The figure above shows the number of infected macaques that reach temple sites throughout the island after having vacated the temple denoted with the red star. The western part of the island is highly homogeneous, allowing for the pathogen to spread further. The pathogen likely does not spread to the north central part of the island due to its heterogeneous landscape. 6.4.1 Performance

The model has utilized the aforementioned (Sections 6.3.2-6.3.4) techniques to interact with GIS data. We started with hefty raster-based queries and refined our method until we achieved the balance of specificity and speed we desired. Table 6.3 shows the initial GUI load time for the model for each technique, and Table 6.4 and Figure 6.14 show the number of time steps simulated per second for each query mechanism. These tables and figure show averages over multiple simulation runs with either the coast and lakes or the coast, lakes, and forests layers enabled, all with the same number of initial agents. Spatial queries were predictably slowest, as the raw vector files contain an enormous amount of realism, making calculations expensive. Utilizing raster data offers a significant improvement but with the drawback of the long initial startup time. Our simplified spatial query greatly improves upon the traditional spatial query, but performance drops significantly as more layers are added. Utilizing precalculated query matrices produces the fastest simulation, with even greater gains when the display is disabled. Table 6.5 and its corresponding Figure 6.15 show the scalability, in terms of number of agents, for the raster and precalculated query matrix method. The precalculated query matrix method scales very well as the amount of GIS data increases and adequately as the number of agents increases. The precalculated query matrix method offers the best, scalable results. All performance tests were run on a single core as a single thread on a Core 2 Duo 2.0 GHz laptop, highlighting further potential in scalability. Numbers listed in the figures are averages of 10 simulation runs.

Additionally, LiNK has been adapted to run on a high-performance computing cluster, making it easy to automate, greatly increasing its utility.

111 TABLE 6.3 PERFORMANCE COMPARISON OF GUI LOAD TIME

GUI Load Time (s)

Coast, Lakes Coast, Lakes, Forests

Spatial Query 3.5 3.5

Raster Query 35 42

Simplified Spatial Query 1.8 2.5

Precalculated Query Matrix 1.6 2

TABLE 6.4 PERFORMANCE COMPARISON OF TIME STEPS/S

Time steps/s

Coast, Lakes Coast, Lakes, Forests

Spatial Query 1.6 0.15

Raster Query 18.5 (11x faster) 19 (126x)

Simplified Spatial Query 39.5 (25x) 15.8 (105x)

Precalculated Query Matrix 126.2 (79x) 124.1 (827x)

Precalculated Query Matrix,

non-GUI 669.6 (419x) 650.2 (4335x)

112 1000

100

10 Timesteps/s

1

0.1 Spatial Raster Simplified Spatial Precalculated Precalculated Query Matrix Query Matrix, non-GUI

Coast, Lakes Coast, Lakes, Forests

Figure 6.14. Performance Comparison of Varying Query Methods. The figure shows that we obtained nearly an order of magnitude performance increase in going from spatial to raster to simplified spatial queries, and then almost another order of magnitude from raster to simplified spatial queries. Finally, disabling the GUI offers nearly another order of magnitude improvement. It is also notable that enabling more layers in non-GUI mode adds almost no performance hit. We show the figure above with a logarithmic scale.

113 TABLE 6.5 SCALABILITY COMPARISON OF TIME STEPS/S

Time steps/s

Number of Initial Dispersed Macaques 10 100 1000

Raster Query (3 Layers) 51.3 29 19.9

Raster Query (7 Layers) 33.6 27.6 11

Precalculated Query Matrix (3 Layers) 140.7 131.4 83.8

Precalculated Query Matrix (7 Layers) 137.5 129.5 82.9

Precalculated Query Matrix,

non-GUI (3 Layers) 669.8 487.8 154

Precalculated Query Matrix,

non-GUI (7 Layers) 680.4 529.6 158.2

114 1000

100 Timesteps/s

10 10 100 1000 Number of Initial Dispersed Macaques

Raster Query (3 Layers) Raster Query (7 Layers) Precalculated Query Matrix (3 Layers) Precalculated Query Matrix (7 Layers) Precalculated Query Matrix, non-GUI (3 Layers) Precalculated Query Matrix, non-GUI (7 Layers)

Figure 6.15. Scalability with Respect to Initial Number of Dispersed Macaques and Amount of GIS data. Here, we show simulations starting with 10, 100, and 1000 dispersed macaques across different querying mechanisms. The precalculated query matrix method performs best in all cases, even better with 1000 agents than other methods with 10 agents. The figure is shown on a logarithmic scale.

115 6.5 Analyzing Massive Amounts of Simulation Data

LiNK is a complex model; as such, it creates enormous amounts of output, up to terabytes for a given experiment. To glean scientific insight and validation,

LiNK tracks of a wide array of events, including infections, births, deaths, and when a macaque enters or leaves a temple. When simulations are run over a long period of time, it is not uncommon to have tens of millions of events, or more. We have created an interactive graphical tool, originally named LiNKStat, to analyze output from LiNK.

6.5.1 LiNKStat

Written in Perl and Tcl, LiNKStat parses through output files and builds graphs to gather statistics about the model. Much of the initial analysis and graph building is done automatically following a simulation run. For example,

LiNKStat allows users to track the route of infection from a given macaque, ob- taining statistics such as number of macaques directly or indirectly infected. Such statistics help subject matter experts collect insight from LiNK. A screen capture of LiNKStat is shown in Figure 6.16 and an example graph from LiNKStat is shown in Figure 6.17. LiNKStat is efficient, with a run-time mainly dependent on the number of infection events and their degree of proliferation. The techniques used in LiNKStat have been generalized and published as P-SAM [6].

6.6 Conclusion

When designing an ABM with GIS aware agents, there are a number of factors that should be considered. Scalability in terms of the number of agents is generally most important. Other important issues include the complexity of the GIS data

116 Figure 6.16. LiNKStat. This screen capture shows one of the analysis tabs of LiNKStat. The left column displays an interactive list of macaques in the simulation that updates the middle right panel with specific infection statistics. These statistics form graphs, an example of which is shown in Figure 6.17. LiNKStat has been and will continue to be very helpful in the verification and validation of LiNK.

117 Figure 6.17. LiNKStat Pathogen Transmission Graph. The graph above allows us to visually track pathogen transmission, helping with validation and interpretation of output. Nodes refer to macaques, with the naming convention being natal temple number concatenated with an id concatenated with a sex identifier. For example, the topmost node would be parsed as a female macaque with temple 27 as its natal temple and 2969 as its id. Transitions are infection events, listed with the time step and location where the infection occurred. Starting at the top, macaque 27.2969.0, infected macaque 27.2775.0 at time step 1, in temple 27. Macaque 27.2775.0 went on to infect four other macaques, and was also reinfected by macaque 27.2870.1. Autoinfection is possible as indicated by nodes 27.2863.0 and 27.2805.1.

118 and the amount of GIS data that the model will rely upon. An adept modeler will utilize the GIS data at a granularity appropriate for the model at hand. In terms of speed, raster data scales reasonably with increasing GIS complexity, but not as well with an increase in the number of agents. Spatial queries scale poorly with an increase in the amount of GIS data and complexity, as well as with an increase in the number of agents. Regarding accuracy, utilizing vector data via spatial queries offers the highest accuracy, but at the highest performance cost. Raster data and our precalculated query matrix method offer varying levels of accuracy, while offering faster speed. Table 6.6 summarizes general ratings for each approach. Possible ratings are 1-5, from Poor to Excellent. Accuracy of GIS data refers to the faithfulness to the original GIS data, while the amount of GIS data refers to the ability of each technique to handle multiple layers of GIS data. Our precalculated query matrix method scales best in terms of number of agents and particularly in the amount of GIS data present. We have presented a complex model of pathogen transmission that utilizes GIS data. This model has started to demonstrate the importance of integrating spatial data into models of pathogen transmission. We have created an efficient and ef- fective mechanism to allow our agents to become GIS aware. Future extensions to the model include adding the ability to model different pathogens simultaneously, deploying a web-based front end to the model, and allowing for the use of cus- tom GIS data. We would also like to explore running our simulation on graphics processing units, as described in D’Souza et al. [37]. Finally, we plan to further verify and validate the LiNK model through real-world data.

119 TABLE 6.6 ADVANTAGES AND DISADVANTAGES (1- POOR; 5- EXCELLENT)

Raster Spatial Simplified Precalculated

Query Query Spatial Query Query Matrix

Accuracy of GIS Data 3 5 4 4

Amount of GIS Data 3 1 2 5

Complexity of GIS Data 2 5 4 4

Load Time 1 4 4 5

Memory Requirement 2 4 4 5

Number of Agents 4 1 2 4

Time steps/s 4 1 2 5

120 CHAPTER 7

CONCLUSION

7.1 Overview

This dissertation has described the significance of TEs both in general and with respect to their detection within newly sequenced genomes. We described an automated homology-based approach for the identification of high quality TEs. We next described a design and implementation plan for the annotation of TEs on VectorBase. We later described a GIS aware agent-based model of pathogen transmission. Together, we have created numerous approaches and models that have important public health implications. We elaborate on our conclusions in the following sections.

7.2 Automated Homology-based Approach for the Identification of Transposable Elements

Chapter 2 introduced TEs and described strategies to detect them. Chapter 3 described our approach to the identification of TEs. The approach, implemented as TESeeker, was tested on multiple families of TEs across a variety of organisms. Overall, results were very good, with resulting consensus TEs as much as 98% identical to previously annotated elements. TESeeker is available as a download- able virtual machine. This work has been submitted to BMC Bioinformatics [80]

121 while results of this approach have appeared in print [5, 83].

7.2.1 Future Work

Due to the nature of TEs, there will likely never be an all-encompassing ap- proach to discover them. Instead, existing approaches will be used in conjunction with other approaches. For example, LTR TEs can be detected by both structure- based and homology-based approaches. The utilization of multiple tools and ap- proaches to detect TEs produces the most robust results. With TESeeker, several improvements could be implemented. First, incorporating the capability to de- tect LTRs in Class I and TIRs in Class II consensus elements would allow us to more correctly trim our consensus sequences. Second, the ability for TESeeker to automatically determine the size of flanking sequence could be implemented on a family by family basis. Last, TESeeker could be extended to allow for the detection of MITEs.

TESeeker also opens up numerous opportunities to further study TEs within the organisms hosted on VectorBase. For example, a comparative study of the mariner elements within the Anopheline mosquito complex could be performed. Additionally, a comparative study of TEs within the Anopheles gambiae, Culex quinquefasciatus, and Aedes aegypti could be performed. Initial work has been performed on the comparative study of TEs within the Anopheles gambiae M&

S forms; TESeeker could help validate and complete this study.

7.3 Community Annotation of Transposable Elements on VectorBase

We introduced the VectorBase CAP in Chapter 2. Its technologies and imple- mentation were also described in Chapter 2. A design and implementation plan for

122 the community annotation of TEs on VectorBase was presented in Chapter 4, in- cluding a preliminary version which demonstrated the ability to store TEs within the Chado database schema and to dynamically created a structural display of the TE.

7.3.1 Future Work

Once the VectorBase CAP is restored to working order and TE instances are able to be submitted by the community, work can begin to allow for the submission and display of consensus TE sequences. The submission and display of consensus sequences is more complex because consensus sequences need to be aligned against the genome to determine the location of instances within the genome. This could be done with a BLAST search and results could be displayed in the Ensembl genome browser. Future additions could allow for the use of TESeeker to produce an annotation of TEs that are submitted through the CAP and displayed in the Ensembl genome browser.

7.4 GIS Aware Agent-based Model of Pathogen Transmission

Chapter 5 introduced simulations and described their applicability. We also in- troduced GIS and discussed its utility within an agent-based model. We combined GIS and agent-based modeling to create a simulation model for the transmission of pathogens amongst long-tailed macaques on Bali, Indonesia, described in Chap- ter 6. Macaques in our model are GIS aware and utilize their surroundings to make movement decisions. Performance improvements were made through itera- tive improvements in accessing GIS data. This work culminated in an invited and refereed journal publication [76], in refereed conference proceedings [79], and was

123 also presented at peer-reviewed conferences [77, 78]. Applications of this work are also expected to appear in Lane et al. [88].

7.4.1 Future Work

The utility of LiNK could be improved with the ability to utilize custom GIS data. While this can currently be done manually, the ability to do so “on-the- fly” would be very useful. This could be accompanied by the capability to add additional agents with custom behavioral rules to the model.

The most significant improvement to the LiNK model would be the incorpo- ration of a web-based interface to define and submit simulations, as well as to view simulation results. The advantages of this include the ability to run simula- tions on the CRC High-Performance Computing Cluster (HPCC) [142] at Notre Dame and the ability to store simulation results in a database (rather than a text file). Simulations run much more quickly on the HPCC, and the richer analysis of the simulation data could be performed through software designed to work with massive amounts of data.

7.5 Contributions

This dissertation has described the following contributions:

• Development and implementation of an automated approach to detect trans- posable elements.

• Design and implementation plan for the incorporation of TEs into the Vec- torBase community annotation pipeline, including a preliminary version im- plemented independent of VectorBase.

• Development and implementation of a GIS aware agent-based model of pathogen transmission.

124 Results from this dissertation have appeared in the following refereed publica- tions:

• Arensburger, P., Megy, K., Waterhouse, R.M., Abrudan, J., Amedeo, P., Antelo, B., Bartholomay, L., Bidwell, S., Caler, E., Camara, F., Camp- bell, C.L., Campbell, K.S., Casola, C., Castro, M.T., Chandramouliswaran, I., Chapman, S.B., Christley, S., Costas, J., Eisenstadt, E., Feschotte, C., Fraser-Liggett, C., Guigo, R., Haas, B., Hammond, M., Hansson, B.S., Hemingway, J., Hill, S.R., Howarth, C., Ignell, R., Kennedy, R.C. et al., “Sequencing of Culex quinquefasciatus Establishes a Platform for Mosquito Comparative Genomics,” Science, 330(6000):86-88, October 2010.

• E. F. Kirkness, E.F., Haas, B.J., Sun, W., Braig, H.R., Perotti, M.A., Clark, J.M., Lee, S.H., Robertson, H.M., Kennedy, R.C. et al., “Genome Se- quences of the Human Body Louse and its Primary Endosymbiont: Insights into the permanent parasitic lifestyle,” Proceedings of the National Academy of Sciences, 107(27):12168-12173, July 2010.

• Kennedy, R.C., Lane, K.E., Arifin, S. M. Niaz, Fuentes, A., Hollocher, H., Madey, G.R., “A GIS Aware Agent-Based Model of Pathogen Trans- mission,” International Journal of Intelligent Control and Systems, 14(1): 51-61, March 2009. (invited)

• Nene, V., Wortman, J.R., Lawson, D., Haas, B., Kodira, C., Tu, Z.J., Lof- tus, B., Xi, Z., Megy, K., Grabherr, M., Ren, Q., Zdobnov, E.M., Lobo, N.F., Campbell, K.S., Brown, S.E., Bonaldo, M.F., Zhu, J., Sinkins, S.P., Hogenkamp, D.G., Amedeo, P., Arensburger, P., Atkinson, P.W., Bidwell, S., Biedler, J., Birney, E., Bruggner, R.V., Costas, J., Coy, M.R., Crabtree, J., Crawford, M., Debruyn, B., Decaprio, D., Eiglmeier, K., Eisenstadt, E., El-Dorry, H., Gelbart, W.M., Gomes, S.L., Hammond, M., Hannick, L.I., Hogan, J.R., Holmes, M.H., Jaffe, D., Johnston, J.S., Kennedy, R.C. et al., “Genome sequence of Aedes aegypti, a major arbovirus vector,” Science, 316(5832):1718-23, June 2007.

• Lawson, D., Arensburger, P., Atkinson, P., Besansky, N.J., Bruggner, R.V., Butler, R., Campbell, K.S., Christophides, G.K., Christley, S., Dialynas, E., Emmert, D., Hammond, M., Hill, C.A., Kennedy, R.C. et al., “Vector- Base: a home for invertebrate vectors of human pathogens,” Nucleic Acids Research, 35(D503-505), January 2007.

• Kennedy, R.C., Lane, K.E., Fuentes, A., Hollocher, H., Madey, G., “Spa- tially Aware Agents: An effective and efficient use of GIS data with an

125 Agent-based Model,” In proceedings of Agent-Directed Simulation (ADS 2009), Spring Simulation Multiconference 2009, San Diego, CA, March 2009.

The following manuscripts are under review or in preparation:

• Kennedy, R.C., Unger, M.F., Christley, S., Collins, F.H., Madey, G.R., “An automated homology-based approach for identifying transposable ele- ments,” BMC Bioinformatics. (Under review)

• Lane, K.E., Kennedy, R.C., Miller, L.A., Madey, G., Hollocher, H., Fuentes, A., “Exploring the use of agent-based models in understanding patterns of pathogen transmission.” (In preparation)

126 APPENDIX A

AUTOMATED APPROACH WALKTHROUGH

In this appendix, we utilize our approach described in Chapter 3 to identify the mariner Class II TE from P. humanus humanus. We show iterative results along the way.

A.1 Representative Amino Acid Coding Regions

We begin with 26 transposase sequences from various mariner elements:

>gi|600840|gb|AAC46948.1| mariner transposase [Chrysoperla plorabunda] MEKKEFRVLIKYCFLKGKNTVEAKTWLDNEFPDSAPGKSTIIDWYAKFKRGEMSTEDGERSGRPKEVVTD ENIKKIHKMILNDRKMKLIEIAEALKISKERVGHIIHQYLDMRKLCAKWVPRELTFDQKQQRVDDSERCL QLLTRNTPEFLRRYVTMDETWLHHYTPEFNRQSAEWTATGEPSPKRGKTQKSAGKVMASVFWDAHGIIFI DYLEKGKTINSDYYMALLERLKVEIAAKRPHMKKKKVLFHQDNAPCHKSLRTMAKIHELGFELLPHPPYS PDLAPSDFFLFSDLKRMLAGKKFGCNEEVIAETEAYFEAKPKEYYQNGIKKLEGRYNRCIALEGNYVE >gi|19570323|dbj|BAB86288.1| mariner transposase [Apis cerana] MQDQKEHFRHILLFYYRKGKNAVQARKKLCEIYGEGILTVRQCQNWFSKFRSDNFDIKDAPRSGRPVEAD EDKIKALIEANRRITTREIATRLNLSNSTVHDHMKRLGFVSKLDIWVPHVLKEKDLLCRIDICDSLLKRE ENDPFLKRIVTGDEKWIVYDNIKRELNEPAQRTSKNNIHKKVMLSVWWDFKGVVFFELLPNNCTINSEVY CNQLDKLNNSIKQKRPELINRKGVVFHHDNAKAHMSLMTRQKLLQLGWEVLPHPPYSPDLAPSDYHLFRS LQNSLNDKTFTSNEDVKNYLDQFFANKDQKFYERGIMLLPKRWQYVLDHNGQYVIK >gi|5353885|gb|AAD42284.1| mariner transposase [Bombyx mori] WVPHELSEKNLNDRIIICTSLLAHNKIEPFLDRIITGYEKWITYENIIRKRAFYEPGKPAPSTSKPKLSL NKRMLCIWWNIRRPMHFELLKPNERLNSERHCQQFDKLKTALQEKRPAMFNRKDIILLHDNARPHAALGT RQKAAELG >gi|1698455|gb|AAC52011.1| mariner transposase [Homo sapiens] MNSAKIEARTNIKFMVKLGWKNGEITDALRKVYGDNAPKKSAVYKWITRFKKGRDDVEDEARSGRPSTSI CEEKINLVRALIEEDRRLTAETIANTTDISIGSAYTILTEKLKLSKLSTRWVPKPLRPDQLQTRAELSME

127 ILNKWDQDPEAFLRRIVTGDETWLYQYDPEDKAQSKQWLPRGGSGPVKAKADWSRAKVMATVFWDAQGIL LVDFLEGQRTITSAYYESVLRKLAKALAEKRPGKLHQRVLLHHDNAPAHSSHQTRAILREFRWEIIRHPP YSPDLAPSDFFLFPNLKKSLKGTHFSSVNNVKKTALTWLNSQDPQFFRDGLNGWYHRLQKCLELDGAYVE K >gi|1399036|gb|AAB17945.1| mariner transposase [Ceratitis capitata] MDNEKDHMLYEFRKGKTVGAATKDIREVYSDRAPALRTVKKWFAKFRSGDFNLEDRPRSGRPCELDNDVL RISVANNSRISTKEVASELNVNKPTAFRRLKKVGYTLKLDKWVPHQLSEKNKVDRMSTAISLLRRVKNEP FLDRLLTGDEKWILYNNVQRKRTWKQAHEGAEPMSKGGLHPMMVLLCIWWDIRGVIYFELLPAGETITAN KYCQQLVELKKAIDEKRPIFANRKGVLFHYDNARPHVAKPTLAKLKEMNWEIMPHSPYSPDIAPSDYHLF RSLQNNLNGKKFKNVEDVKSHLDNFFNEKPRDFYESGIRKLVERWEWIAEHDGEYIID >gi|2564437|gb|AAC28162.1| mariner transposase [Glossina palpalis] NENQKNRRFEVSSSLLLRNNDDPFLNRIVTCDEKWILYDNRRRSAQWLDADEAPQHFPKPKLHQKKIMVT VWWSAVGLIHHSFLNPGETITAEKYCQQIDEMHQRLQQKQPALVNRKGPILLHDNARPHVSMITRQKLYE LGYETLDHPP >gi|2564433|gb|AAC28160.1| mariner transposase [Pycnoscelus surinamensis] SDGLKCTRVEWCTEMLKRFNNGDSRRVSDIVTGEETWIYQFDLKTKCQSSVWVFPDEQPPTKVKRQRSVG KKMVATFFSKSGHLATVVLEDQRTVTVKWYTEVWLPQVFSKIQEKRPRTGLRGILLHHDNASSHTANATI AFLEKMPMKLMTHTA >gi|2564426|gb|AAC28158.1| mariner transposase [Plebeia frontalis] NAKNLHDRVTICTSLLARNKNDPFLDRIITGDEKWITYENIVRKRASCEPGQPAPSTFKPSLSLNKRMLC IWWEVQGPIHYVFLKPNEKLNSERYCQQMDDLNKELKKKRPAVFNRKHIILHHDNARPHTAFGTRQMIAE LGWEILSHPP >gi|2564423|gb|AAC28157.1| mariner transposase [Stomoxys uruma] TFDQKQQRVDDSEWCLQLLTRNTPEFLHRYVTMDETFLHHYTPESNRQSAEWTAIGEPTPKRGKDQKSAG KVMASVFWDARGIIFIDYLEKGKTINSDYYMALLERLKVEIAAKRPHMKKKKVLFHQDNAPCHKSLRTMV KIHELGFELLPHPP >gi|2564419|gb|AAC28156.1| mariner transposase [Cryptolestes ferrugineus] TEKNMMDRISICEALTKRNKIDPFLKRMATGNEKWITYDNRVRKRSWSKSGEAPQTVVKPGLTARKVLLC IWWDWKGIIYYELLPYGQTLNSDLYCQQLYRLKIAIDHKRPELTNRRGVVFHQDNPRPHTSTVTRQKLRE LGWEVLMHPP >gi|2564414|gb|AAC28155.1| mariner transposase [Delphinia picta] KEIHLTNRINACDMHLKRNEFDPFLKRIITGDEKWIVYNNVNRKRSWSKHGEPAQTTSKADIHQKKVMLS VWWDWKGVVYFELLPRNQTINSDVYCQQLDKLNAAIKEKRPELINRKGVIFHQDNARPHTSLMTRQKLGQ LGWEVLMHPP >gi|2564412|gb|AAC28154.1| mariner transposase [Culex restuans] NDRQMENRKTVCEMLLQRFERKSFLHRIVTGDEKWIYYENPKRKKSWLSPGEAGPSTAKPNHFGRKTMLC VWWDQDGVVYHELLKPGETVNTARYRQQIINLNYALIEKRPEWARRHGKVILQHDNAPSHTAKPVKDALK TLNWEILSHPP >gi|2564407|gb|AAC28153.1| mariner transposase [Mantispa pulchella] TERQMENRKVTCEMLLQRYKRKSFLYRIVTGDEKWIYLENPKRKKSWVSPGEASTSTARPNRFGRKAMLC VWWDQTGVIYFELLKPGETVNAVRYQQQIKDLSRAIAENRPEYQERQKKVILLHDNAPSHKSKVVRDTLE KLQWEVLDHAA >gi|2564401|gb|AAC28151.1| mariner transposase [Bittacus strigosus] NDGQQENRKTTCEMLLARQKRKSFLHRIVTGDEKWIYFVNLKRKRSYVDPGQPAQLSPRPNRFGRKTMLC VFWDQRGVIWYELLKPGETVNGQRYQQQLANLNRALRQKRSEYETRHDKVIFLDDNAPSHRTKQTRELVE

128 SYSWQPLPHPP >gi|2564397|gb|AAC28150.1| mariner transposase [Tribolium madens] TLDEKKARVNWCKKMLTKFNNGQSNHVFDIVTGDETWIYRYEPETKRQSAQWVFPYEENPTKLKRPKSVG RKMIAAFFSRSGYIATIPLEDRKTVNANWYTSICLPQVFEKVREKRPRSEIILHHDNASSHTAGETLDFL NVSGIKIMTHPP >gi|2564394|gb|AAC28149.1| mariner transposase [Poecilia reticulata] SEANRQMRVDCCVTLLNRHNNEGILNRIITCDEKWILYDNRKRSSQWLNPGEPAKSCPKRKFTKKKLLVS VWWTSAGVVHYSFLKSGQTITADIYCQQLQTMMEKLAAKQPRLVNRSRPLLLQDNARPHTAQRTATKLEE LQLECLRHPP >gi|2564369|gb|AAC28140.1| mariner transposase [Andrena erigeniae] SEENKRRRIDTAASLLSRFKRKSFLHKIIAGDEKWVLYDNPKRQKSWVSPGEPSTSMAKPSIHAKKVMLS IWWDFKGVIHYELLVPGKTITADYYQQQLMNLHDELERKRPFTGQGTRHVILQHDNARPHVAQGTRNTIY ALGWEVMSHAA >gi|2564360|gb|AAC28136.1| mariner transposase [Atteva punctella] TERNLMNRVLICDSLLRRNETESFLKKLITGDETWITYDKNVRKRSWSKAGQASQTVAKPGLTRNKVMLC AWWDWKGIIHYELLPPGRTIDSELYCEQMMRLKQKAERKRPELINRRGVVFHHDNARPHTSIATQQKLRE FGWGVLMHPP >gi|2564392|gb|AAC28148.1| mariner transposase [Nabis sp. HMR-1997a] TPQQSAKRLEICRNLLENPFDLRFCHRIVTCDEKWVYWRNPNTNKQWLDYGQTALPVAARGQFEKKSMLC VFWNFEGVIHHEFVPDGCSINSELYCEQLERLYSKISERYPALINRKGVLLQQDNARPHTSHRTKEKFTE LHGFELLPHPP >gi|2564387|gb|AAC28147.1| mariner transposase [Epicauta funebris] SEKNLNDRVVICTSLLARNNVEPFLNRMITGDEKWITYENILRKRAYCESGKPSPSTSKPNLNLNKRMLC IWWDIRGPIHYELLKPNKKLNSEKYCQQLDNLTTAVQEKRPAMFNRRDIILHHDNARPHTALGTRQKIAE LGWEILSHPP >gi|2564376|gb|AAC28142.1| mariner transposase [Buenoa sp. HMR-1997] TSDQKQQRIDDSEQCLKMFNRNKSEFLRRYVTMDETWLHHFTPESSRQSAEWTAYDEPNPKRAKTQQSAG KVMASVFLDAHGIIFIDYLEKGKTINSDYYIALLERLKDEIAEKRPHLKKKRVLFHQDNAPCHKSMKTMA KLNELGYELLPHPP >gi|1816499|gb|AAC47445.1| mariner transposase [Cymodusa distincta] TTRNLISRIEICDTLLKRNKMDPFLKRLITGDEKWIKYKNVKRKRSWLKPGEVPQTTTKPELTASKVMLS VWWDWKGIVYYEILEPGQTVDSGLYCQQLTRLQEAIQKKRPELVNRKSIEFHHDNARPHTSLMTRQKLTE FGWEILLHPP >gi|520556|gb|AAA20470.1| mariner transposase [Tetranychus urticae] PPGQMEHRVMACRFNLQMHRKTRELIQRTISIDETWVSLYMEPEKEQAKGWYYPDEQPEEVPRQNIHGNK RMLIMGMDYNGIAFFELLPEKTTVDGQTYKGFLERHVRHWLGTRASKHLWLLHDNARPHKHQVVREWLER HEITLWHHPP >gi|3093971|gb|AAC15448.1| mariner transposase [Heliothis subflexa] MLKLYENGTSNNINNIVSGDETWLYYFDVPSKNKNKVWLFENEQTPVQVRKSRSVKKKMIAVFFTRRGIL ERIVLESQRTVTASWYINDCLPKVFQKLQEIRPNSRMDTWHFHHDNASAHRARDTVEFLNTSGVKVLEHP AYTPDL >gi|520553|gb|AAA20469.1| mariner transposase [Metaseiulus occidentalis] SERQKEVRLTVCRELLSRYKNKSFLYRIITSDEKWIYYDNPGRKRSWVSPGEPAEKSVRRNRFGKKTMLC VWWDQRGVIYHELLKPGETVDTARYQQQLIDLNRAVKEKRPNWDQVRNRVILLHDNAPCHTSKPTQETLS ALNWEVLTHPA

129 >gi|3142710|gb|AAC16889.1| mariner transposase [Heliothis virescens] KWRKKMLHMYENGTSNNINNIVTGDETWLYYFDLPSKNKNKVWLFENEQTPVQVRKSRSVKKKMIAVFFT RRGILERVLLESQRTVTASWYINECLPKVFQRLQEIRPNSRMDTWHFHHDNAPAHRARDTVEFLNSSGVR VLDHPAYPPDLPQ

130 A.2 Identify Coding Region

A.2.1 tblastn Search

We utilize the library of TEs from the previous section to perform a tblastn search against the genome. Here, we present a subset of the tblastn hits.

TBLASTN 2.2.23+

Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.

Database: phumanus.CONTIGS-USDA.PhumU1.fa 8,555 sequences; 108,367,968 total letters

Query= gi|600840|gb|AAC46948.1| mariner transposase [Chrysoperla plorabunda] Length=348 Score E Sequences producing significant alignments: (Bits) Value

AAZO01005188.1 Pediculus humanus USDA contig 1103172084976 136 7e-32 AAZO01006015.1 Pediculus humanus USDA contig 1103172085458 132 1e-30 AAZO01003584.1 Pediculus humanus USDA contig 1103172096529 110 1e-28 AAZO01007101.1 Pediculus humanus USDA contig 1103172086085 122 1e-27 AAZO01003080.1 Pediculus humanus USDA contig 1103172096313 120 9e-27 AAZO01005603.1 Pediculus humanus USDA contig 1103172085202 100 3e-26 AAZO01001198.1 Pediculus humanus USDA contig 1103172094763 116 1e-25 AAZO01004437.1 Pediculus humanus USDA contig 1103172096885 114 4e-25 AAZO01001978.1 Pediculus humanus USDA contig 1103172095787 113 8e-25 AAZO01005816.1 Pediculus humanus USDA contig 1103172085330 112 1e-24 AAZO01001816.1 Pediculus humanus USDA contig 1103172095715 94.1 4e-24 AAZO01007534.1 Pediculus humanus USDA contig 1103172086338 111 4e-24 AAZO01007070.1 Pediculus humanus USDA contig 1103172086064 111 4e-24 AAZO01006787.1 Pediculus humanus USDA contig 1103172085910 111 4e-24 AAZO01005899.1 Pediculus humanus USDA contig 1103172085374 111 4e-24 AAZO01007995.1 Pediculus humanus USDA contig 1103172088564 110 9e-24 AAZO01006175.1 Pediculus humanus USDA contig 1103172085571 110 9e-24 AAZO01000215.1 Pediculus humanus USDA contig 1103172094998 110 1e-23

131 AAZO01003840.1 Pediculus humanus USDA contig 1103172096640 109 1e-23 AAZO01006286.1 Pediculus humanus USDA contig 1103172085643 108 2e-23 AAZO01005421.1 Pediculus humanus USDA contig 1103172085111 108 2e-23 AAZO01000288.1 Pediculus humanus USDA contig 1103172095038 108 2e-23 AAZO01007414.1 Pediculus humanus USDA contig 1103172086274 108 2e-23 AAZO01007386.1 Pediculus humanus USDA contig 1103172086255 108 3e-23 AAZO01006519.1 Pediculus humanus USDA contig 1103172090088 107 3e-23 AAZO01007033.1 Pediculus humanus USDA contig 1103172094563 107 4e-23 AAZO01007487.1 Pediculus humanus USDA contig 1103172086313 107 6e-23 AAZO01004198.1 Pediculus humanus USDA contig 1103172096805 107 6e-23 AAZO01003892.1 Pediculus humanus USDA contig 1103172096659 107 6e-23 AAZO01001012.1 Pediculus humanus USDA contig 1103172095359 106 1e-22 AAZO01001082.1 Pediculus humanus USDA contig 1103172095391 106 1e-22 AAZO01004190.1 Pediculus humanus USDA contig 1103172096798 106 1e-22 AAZO01003375.1 Pediculus humanus USDA contig 1103172096437 105 2e-22 AAZO01006313.1 Pediculus humanus USDA contig 1103172085657 104 3e-22 AAZO01003096.1 Pediculus humanus USDA contig 1103172096321 104 4e-22 AAZO01007094.1 Pediculus humanus USDA contig 1103172086080 104 5e-22 AAZO01008517.1 Pediculus humanus USDA contig 1103172093434 104 5e-22 AAZO01006528.1 Pediculus humanus USDA contig 1103172085776 102 2e-21 AAZO01005218.1 Pediculus humanus USDA contig 1103172084995 100 7e-21

> AAZO01005188.1 Pediculus humanus USDA contig 1103172084976 Length=113412

Score = 136 bits (318), Expect = 7e-32, Method: Compositional matrix adjust. Identities = 106/329 (32%), Positives = 153/329 (46%), Gaps = 21/329 (6%) Frame = -1

Query 2 EKKEFRVLIKYCFLKGKNTVEAKTWLDNEFPDSAPGKSTIIDWYAKFKRGEMSTEDGERS 61 + F ++ + F KG N +A L D A +W+AKF+ G+ S ++ ERS Sbjct 112767 QSEHFLHILLFYF*KGVNASQANKKLWVV*GDEALTERQCQNWFAKFRSGDFSLQNEERS 112588

Query 62 GRPKEVVTDENIKKIHKMILNDRKMKLIEIAEALKISKERVGHIIHQYLDMRKLCAK-WV 120 GR EV DE IK +I DR +I L +S V + + +KL A W Sbjct 112587 GRQLEV-KDEQIKA---LIDYDRYSSTKDIVKKLDVSHTCVKNRLRRLGCQKKLDALLW- 112423

Query 121 PRELTFDQKQQRVDDSERCLQLL--TRNTPEFLRRYVTMDETWLHHYTPEFNRQSAEWTA 178 V+++ L+ T+ F R VT DE W + +F R+ + W Sbjct 112422 ------GTLVNEATWSLRYAS*TQCK*PFFERMVTGDEKWVVY--DDFLRKRS-WFR 112279

Query 179 TGEPSPKRGKTQKSAGKVMASVFWDAHGIIFIDYLEKGKTINSDYYMALLERLKVEIAAK 238 G + + + KV+ S WD GI+ + L + +TINS+ Y+ L L I K Sbjct 112278 QGNRHQQLLRLTFTKKKVLLSFWWDYKGIVNFELLPRCQTINSEVYIRQLTNLSDTIQEK 112099

Query 239 RPHMKKKK-VLFHQDNAPCHKSLRTMAKIHELGFELLPHPPYSPDLAPSDFFLFSDLKRM 297 RP + K + FH+ NA +L T K+ ELG L HPPYSP LAP ++ F LK Sbjct 112098 RPELANSKGIVFHHHNARPSLTLATGQKLLELGWNVLLHPPYSPKLAPNNYHFFRFLKNF 111919

132 Query 298 LAGKKFGCNEEVIAETEAYFEAKPKEYYQ 326 L G+KF + EV E +F K KE Y+ Sbjct 111918 LNGQKFQNDNEVKTALEQFFAPKTKELYE 111832

> AAZO01006015.1 Pediculus humanus USDA contig 1103172085458 Length=109781

Score = 132 bits (308), Expect = 1e-30, Method: Compositional matrix adjust. Identities = 94/288 (32%), Positives = 138/288 (47%), Gaps = 21/288 (7%) Frame = +1

Query 43 DWYAKFKRGEMSTEDGERSGRPKEVVTDENIKKIHKMILNDRKMKLIEIAEALKISKERV 102 +W+AKF+ G+ S ++ ERSGR EV DE IK +I DR I L +S+ V Sbjct 96649 NWFAKFRSGDFSLQNEERSGRQLEV-KDEQIKA---LIDYDRHSSTKYIIKKLDVSRTCV 96816

Query 103 GHIIHQYLDMRKLCAK-WVPRELTFDQKQQRVDDSERCLQLL--TRNTPEFLRRYVTMDE 159 + + +KL A W V+++ L+ T F R VT DE Sbjct 96817 KNCLRRLECQKKLDALLW------GTLVNEATWSLRYAS*TECK*PFFERMVTEDE 96966

Query 160 TWLHHYTPEFNRQSAEWTATGEPSPKRGKTQKSAGKVMASVFWDAHGIIFIDYLEKGKTI 219 W + +F R+ + + G +P K K+++S WD GI+ + L + +TI Sbjct 96967 KWVVY--DDFLRKKS*-SRQGKQAPTTSKVDIKQKKILSSFWWDYKGIVNFELLPRCQTI 97137

Query 220 NSDYYMALLERLKVEIAAKRPHMKKKK-VLFHQDNAPCHKSLRTMAKIHELGFELLPHPP 278 NS+ Y+ L L I KRP + K + FH+ NA +L T K+ ELG L HP Sbjct 97138 NSEVYIRQLTNLNDTIQEKRPELANSKGIVFHHHNARPSPTLATGQKLLELGWNVLLHPS 97317

Query 279 YSPDLAPSDFFLFSDLKRMLAGKKFGCNEEVIAETEAYFEAKPKEYYQ 326 YSP L P ++ F LK L G+KF + EV + +F K KE+Y+ Sbjct 97318 YSPKLPPNNYHFFRSLKNFLNGQKFQNDNEVKTALDQFFAPKTKEFYE 97461

> AAZO01003584.1 Pediculus humanus USDA contig 1103172096529 Length=67685

Score = 110 bits (254), Expect(3) = 1e-28, Method: Compositional matrix adjust. Identities = 55/145 (37%), Positives = 78/145 (53%), Gaps = 1/145 (0%) Frame = -2

Query 183 SPKRGKTQKSAGKVMASVFWDAHGIIFIDYLEKGKTINSDYYMALLERLKVEIAAKRPHM 242 +P K K+++S WD GI+ + L + +TIN + Y+ L L I KR + Sbjct 16786 APTTSKVDIKQKKILSSFWWDYKGIVNFELLPRNQTINLEVYIRQLTNLNDTIQEKRLEL 16607

Query 243 KKKK-VLFHQDNAPCHKSLRTMAKIHELGFELLPHPPYSPDLAPSDFFLFSDLKRMLAGK 301 +K + FH+DNA SL T K+ ELG + L HPPYSP LAP ++ F LK L G+ Sbjct 16606 ANRKGIVFHHDNARPSPSLATGQKLLELGWDVLLHPPYSPKLAPNNYHFFRSLKNFLNGQ 16427

Query 302 KFGCNEEVIAETEAYFEAKPKEYYQ 326 KF + EV +F K KE+Y+

133 Sbjct 16426 KFQNDNEVKTALNQFFAPKTKEFYE 16352

Score = 31.8 bits (67), Expect(3) = 1e-28, Method: Compositional matrix adjust. Identities = 21/54 (38%), Positives = 29/54 (53%), Gaps = 4/54 (7%) Frame = -1

Query 45 YAKFKRGEMSTEDGERSGRPKEVVTDENIKKIHKMILNDRKMKLIEIAEALKIS 98 +AKF G+ S + E SG EV D++ +K I NDR +IAE L +S Sbjct 17150 FAKFYSGDFSLKNEECSGCLVEV--DDDQRK--AVIVNDRHSSTRDIAEKLDVS 17001

Score = 25.1 bits (51), Expect(3) = 1e-28, Method: Compositional matrix adjust. Identities = 21/68 (30%), Positives = 32/68 (47%), Gaps = 11/68 (16%) Frame = -3

Query 117 AKWVPRE---LTFDQKQQRVDDSERCLQLLTRNTPE-FLRRYVTMDETWLHHYTPEFNRQ 172 W+P+E +T +Q C LL RN + FL+ VT DE W + +F R+ Sbjct 16974 VSWIPKEACCITLGSVRQL----GLCDMLLKRNANDPFLKEMVTGDEKWVVY--DDFLRK 16813

Query 173 SAEWTATG 180 + W+ G Sbjct 16812 RS-WSRQG 16792

...

Matrix: BLOSUM90 Gap Penalties: Existence: 10, Extension: 1 Neighboring words threshold: 13 Window for multiple hits: 40

134 A.2.2 Extract Sequences from the Genome

After the tblastn search, we combine hits within 50 bp that originate from the same query sequence. When extracted from the genome, we include the in- tervening sequence. We show a subset of the extracted sequences following these steps below:

>AAZO01000215.1-0 40432 40911 f ATGGGGAGCCAAAGCGAGCATTTCCTCCACATTTTACTTTTTTATTTCTGAAAGGGTGTT AATGCTTCCCAGGCCAATAAAATGTTGTGAGCTGTGTAGGGGGATGAAACCTTAATAGAA CGGCAGTGTCAAAACTGGTTTGCGAAATTCCGTTCTGGAGATTTTCCTTTAAAAAATGAG GAGCGCTCCGGGCGTCCATTGGAGGTTAACGATGAGCAAATAAAGGCCCTCATTGATTAT GATCGGCATAGTTCGACGAAGGACATTGTAAAGAAGCTAGATGTGTCACATACGTGCGTC AAAAACTGTCTGCGGCGTCTTGGGTGCCAAAAAAAGCTTTATGCGTTACTTTGGGGAACG TTAGTTAACGAGGCGACTTGGTTTTTGCGATATGCTTCTTAAACGCAATGCGAATGGCCC TTTTTTGAAAGAATGGTCACCGGAGATGAAAAGTGGGTTGTCTATGATGACTTTTTGAGA >AAZO01000215.1-1 40432 40923 f ATGGGGAGCCAAAGCGAGCATTTCCTCCACATTTTACTTTTTTATTTCTGAAAGGGTGTT AATGCTTCCCAGGCCAATAAAATGTTGTGAGCTGTGTAGGGGGATGAAACCTTAATAGAA CGGCAGTGTCAAAACTGGTTTGCGAAATTCCGTTCTGGAGATTTTCCTTTAAAAAATGAG GAGCGCTCCGGGCGTCCATTGGAGGTTAACGATGAGCAAATAAAGGCCCTCATTGATTAT GATCGGCATAGTTCGACGAAGGACATTGTAAAGAAGCTAGATGTGTCACATACGTGCGTC AAAAACTGTCTGCGGCGTCTTGGGTGCCAAAAAAAGCTTTATGCGTTACTTTGGGGAACG TTAGTTAACGAGGCGACTTGGTTTTTGCGATATGCTTCTTAAACGCAATGCGAATGGCCC TTTTTTGAAAGAATGGTCACCGGAGATGAAAAGTGGGTTGTCTATGATGACTTTTTGAGA AAAAGATCCTGG >AAZO01000215.1-14 40902 41231 f CTTTTTGAGAAAAAGATCCTGGTCTAGGCAAGGAAAGAGGCACCAACAACTTCTTTGAAT GACATTCACCAAAAAAAGGTATTGTTATCATTTTGGTGGGATTACAAAGGCATAGTCTAC TTTGAGCTGCTGCCACGAAATCAGATCATAAATTCAGAGGTTTACATTCGACAATTGACA AATTTAAATGATACCATCCAAGAAAAACGATCGGAGCTAGCCAATAGCAAAGAAATTGTC TTTCACCACGATAATGCCAGGCCCTCCCCATCTTTAGCCACTGGACAAAAACTACTGGAG CTAGGCTGGAATGTTTTGCTGCACCCTCCA >AAZO01000215.1-15 40929 41231 f GCAAGGAAAGAGGCACCAACAACTTCTTTGAATGACATTCACCAAAAAAAGGTATTGTTA TCATTTTGGTGGGATTACAAAGGCATAGTCTACTTTGAGCTGCTGCCACGAAATCAGATC ATAAATTCAGAGGTTTACATTCGACAATTGACAAATTTAAATGATACCATCCAAGAAAAA CGATCGGAGCTAGCCAATAGCAAAGAAATTGTCTTTCACCACGATAATGCCAGGCCCTCC CCATCTTTAGCCACTGGACAAAAACTACTGGAGCTAGGCTGGAATGTTTTGCTGCACCCT CCA >AAZO01000215.1-16 40929 41441 f

135 GCAAGGAAAGAGGCACCAACAACTTCTTTGAATGACATTCACCAAAAAAAGGTATTGTTA TCATTTTGGTGGGATTACAAAGGCATAGTCTACTTTGAGCTGCTGCCACGAAATCAGATC ATAAATTCAGAGGTTTACATTCGACAATTGACAAATTTAAATGATACCATCCAAGAAAAA CGATCGGAGCTAGCCAATAGCAAAGAAATTGTCTTTCACCACGATAATGCCAGGCCCTCC CCATCTTTAGCCACTGGACAAAAACTACTGGAGCTAGGCTGGAATGTTTTGCTGCACCCT CCATATAGTCCCAAACTAGCTCCAAATAATTATCATTTTTTCCGATTCCTAAAAAATTTT TTAAACGGAGAAAAATTCCAAAACGACAATGAGGTCAAAACTGCATTGGAGCAGTTTTTT GCTCCTAAAACTAAAGAGTTTTATGAAAAAAGGAAAATGATACTACCCAAAAAATGTCAA AAGGTCACCAATAATAATGGACATAATATAATA >AAZO01000215.1-17 40935 41207 f AAAGAGGCACCAACAACTTCTTTGAATGACATTCACCAAAAAAAGGTATTGTTATCATTT TGGTGGGATTACAAAGGCATAGTCTACTTTGAGCTGCTGCCACGAAATCAGATCATAAAT TCAGAGGTTTACATTCGACAATTGACAAATTTAAATGATACCATCCAAGAAAAACGATCG GAGCTAGCCAATAGCAAAGAAATTGTCTTTCACCACGATAATGCCAGGCCCTCCCCATCT TTAGCCACTGGACAAAAACTACTGGAGCTAGGC >AAZO01000215.1-18 40938 41231 f GAGGCACCAACAACTTCTTTGAATGACATTCACCAAAAAAAGGTATTGTTATCATTTTGG TGGGATTACAAAGGCATAGTCTACTTTGAGCTGCTGCCACGAAATCAGATCATAAATTCA GAGGTTTACATTCGACAATTGACAAATTTAAATGATACCATCCAAGAAAAACGATCGGAG CTAGCCAATAGCAAAGAAATTGTCTTTCACCACGATAATGCCAGGCCCTCCCCATCTTTA GCCACTGGACAAAAACTACTGGAGCTAGGCTGGAATGTTTTGCTGCACCCTCCA >AAZO01000215.1-19 40938 41231 f GAGGCACCAACAACTTCTTTGAATGACATTCACCAAAAAAAGGTATTGTTATCATTTTGG TGGGATTACAAAGGCATAGTCTACTTTGAGCTGCTGCCACGAAATCAGATCATAAATTCA GAGGTTTACATTCGACAATTGACAAATTTAAATGATACCATCCAAGAAAAACGATCGGAG CTAGCCAATAGCAAAGAAATTGTCTTTCACCACGATAATGCCAGGCCCTCCCCATCTTTA GCCACTGGACAAAAACTACTGGAGCTAGGCTGGAATGTTTTGCTGCACCCTCCA

136 A.2.3 CAP3 Assembly

The extracted sequences are fed into the CAP3 assembler, which produces con- tigs and singletons from the sequences. The contigs file contains an accompanying quality score file, denoting the quality of each base pair in the contig. We utilize the quality scores to trim the contigs to encompass the TE, without irrelevant adjacent sequence. In the following sections, we show the raw CAP3 contigs and their accompanying quality file. In this case, the quality scores for each contig never drops below the threshold for the required amount of time; therefore, the contigs do not get trimmed.

A.2.3.1 CAP3 Contigs

>Contig1 ATGGGGAGCCAGAGCGAGCATTTCCTCCACATTTTACTTTTTTATTTTTGAAAGGGTGTT AATGCTTCCCAGGCCAATAAAAAGTTGTGGGTCGTGTAGGGGGATGAAGCCTTAATAGAA CGGCAGTGTCAAAACTGGTTTGCGAAATTCCGTTCTGGAGATTTTTCTTTGCAAAATGAG GAGTGCTCCGGGCGCCAATTGGAGGTTAAAGATGAGCAAATAAAGGCCCTCATTGATTAT GATCGGCATAGTTCGACTAAGGACATTGTAAAGAAGCTAGATGTGTCACATACGTGCGTC AAAAACCGTCTGCGGCGTCTTGGGTGCCAAAAGAAGCTTGATGCGTTACTTTGGGGAACG TTAGTTAACGAGGCGACTTGGTCTTTGCGATATGCTTCTTAAACGCAATGCAAATGACCC TTTTTTGAAAGAATGGTCACCGGAGATGAAAAGTGGGTTGTCTATGATGACTTTTTGAGA AAAAGATCCTGGTTTAGGCAAGGAAACAGGCACCAACAACTTCTAAGGCTGACATTCACC AAAAAAAGGTATTGTTATCATTTTGGTGGGATTACAAAGGCATAGTCTACTTTGAGCTGC TGCCACGATCAGACCATAAATTCAGAGGTTTACATTCGACAATTGACTAATTTAAATGAT ACCATCCAAGAAAAACGACCGGAGCTAGCCAATAGCAAAGGAATTGTCTTTCACCACCAT AATGCCAGGCCCTCCCCATCTTTAGCCACTGGACAAAAACTACTGGAGCTAGGCTGGAAT GTTTTGCTGCACCCTCCATATAGTCCCAAACTAGCTCCAAATAATTATCATTTTTTCCGA TTCCTAAAAAATTTTTTAAACGGACAAAAATTCCAAAACGACAATGAGGTCAAAACTGCA TTGGAGCAGTTTTTTGCTCCTAAAACTAAAGAGTTGTATAAAAAAAGGAAAATGATACCA CCCGAAAAATGTCAAAAGGTCACTAATAATAATAAACATAATATAATAGAT >Contig2 ATGGGGAGCCAAAGCGAGCATTTCCTTCACATTTTGCTTTTTTATTTCTGAAAGGGTGTT AATGCTTCACAAGCCAATAAAAAGTTGTGGGCCGTGTAGGGGGATGAAACCTTAATAGAA CGGCAGTGTCAAAACTGGTTTGCGAAATTCCGTTCTGGAGATTTTCCTTTGAAAAATGAG GAGTGCTCCGGGCGTCCATTGGAGGTTAATGATGAGCAAATAAAGGCCCTCATTGATTAT GATCGGCATAGTTCGACTAAGGACATTGTAAAGAAGCTAGATGAGTCACATACGTGCGTC

137 AAAAACCGTCTGCGGCGTCTTGGGTACCAAAAGAAGCTTGATGGGTCTTTGCGATATGCT TCTTAAATGCAATGCGAATGACCCTTTTTTGAAAGAATGGTCACCGGAGATGAAAAGTGG GTTGTTTATGATGACTTTTTGAGAAAAAGATCCTGGTCTAGGCAAGGAAACAGGCACCTT AGAACTTCTAAGGCTGACATTCAGCAAAAAAGTTATTGTTATCATTTTGGTGGGATTACA AAGGCATAGTCAACTTTGAGCTGCTGCCACGAAGTCAGACCATAAATTCAGAGGTTTACA TTCGACAATTGACAAATTTAAATGATACCATCCAAGAAAAACGATCGGAGCTAGCCAATA GCAAAGGAATTGTCTTTCACCACGATAATGCCAGGCCCTCCCCATCTTTAGCCACTGGAC AAAAACTACTGGAGCTAGGCTGGGATGTTTTGCTGCACCCTCCATATAGTCCCAAACTAG CTCCAAATAATTATCATTTTTTCCGATCCCTAAAAAATTTTTTGAACGGACAAAAATTCC AAAACGACAATGAGGTCAAAACTGCATTGGACCAGTTTTTTGCTCCTAAAACTAAAGAGT TTTATGAAAAAAGGAAAATGATT >Contig3 ATGGGGAGCCAAAGCGAGCATTTCCTCCACATTTTGCTTTTTTATTTCTGAAAGGGTGTT AATGCTTCACAAGCCAATAAAAAGTTGTGGGCCGTGTAGGGGGATGAAGCCTTAATAGAA CGGCAGTGTCAAAACTGGTTTGCGAAATTCCGTTCTGGAGATTTTCCTTTGAAAAATGAG GAGCGCTCCGGGCGTCCATTGGAGGTTAATGATGAGCAAATAAAGGCCCTCATTGATTAT GATCGGCATAGTTCGACAAGGGACATTGTAAAGAAGCTAGATGTGTCACATACGTGCGTC AAAAACCGTCTGCGACGTCTTGGGTACCAAAAGAAGCTTGATGCGTTACTTTGGGGAACG TTAGTTAACGAGGCGACTTGGTCTTTGCGATATGCTTCTTAAACGCAATGCGAATGACCC TTTTTTGAAAGAATGGTCATCGAAGATGAAAAGCGGGTTGTTTATGATGACTTTTTGAGA AAAAGATCCTGGTCTAGGTAAGGAAACAGGCACCTTAGAACTTCTAAGGCTGACATTCAC CAAAAAAGTTATTGTTATCATTTTGGTGGGCTTACAAAGGCATAGTCAACTTTGAGCTGC TGCTGCCACGAAGTCAGATCATAAATTCAGAGGTTTACATTCGACAATTGACAAATTTAA ATGATACCATCCAAGAAAAACGATCATAGCTAGCCAATAGCAAAGGAATTGTCTTTCACT ATGATAATGCCAGGCCCTCCCCATCTTTAGCCACTGGACAAAAACTACTGAAGCTAGGCT GGAATGTTTTGCTGCATCCTCCTTAAAGTCCCGACCTAACTCGAAGTGAGAATCATTTTT CCCGATTCCTAAAAAATTTTTTGAACGGACAAAAATTCCAAAACGACAATGAGGTCAATT GTACTACATTGGACCAGTTTTTTGCT >Contig4 ATGGGGAGCCAGAGCGAGCATTTCCTCCACATTTTACTTTTTTATTTTTGAAAGAGTGTT AATGCTTCCCAGGCCAATAAAAAGTTGTGGGTCGTGTAGGGGGATGAAGCTTTAACAGAA CGGCAGCGTCAAAACTGGTTTGCGAAATTCCGTTCTGGAGATTTTTCTTTGCAAAATGAG GAGCGCTCCGGGCGCCAATTGGAGGTTAAAGATGAGCAAATAAAGGCCCTCATTGATTAT GATCGGCATAGTTCGACGAAGTACATTATAAAGAAGCTAGATGTGTCACGTACGTGCGTC AAAAACTGTCTGCGGCGTCTTGAGTGCCAAAAAAAGCTTGATGCGTTACTTTGGGGAACG TTAGTTAACGAGGCGACTTGGTCTTTGCGATATGCTTCTTAAACGGAGTGCAAATGACCC ATTTTTGAAAGAATGGTCACCGAAGATGAAAAGTGGGTTGTCTATGATGACTTTTTGAGG AAAAAATCCTGATCTAGGCAAGGGAAACAGGCACCAACAACTTCTAAGGTTGACATAAAG CAAAAAAAGATCTTGTCATCATTTTGGTGGGATTACAAAGGCATAGTCAACTTTGAACTG CTGCCACGATGTCAGACCATAAATTCAGAGGTTTACATTCGACAATTGACAAATTTAAAT GATACCATCCAAGAAAAACGACCGGAACTAGCCAATAGCAAAGGAATTGTCTTTCACCAC CATAATGCCAGGCCCTCCCCAACTTTAGCCACTGGACAAAAACTACTGGAGCTAGGCTGG AATGTTTTGCTGCACCCTTCATATAGTCCCAAACTACCTCCAAATAATTATCATTTTTTC CGATCCCTAAAAAATTTTTTGAACGGACAAAAATTCCAAAACGACAATGAGGTCAAAACT GCATTGGACCAGTTTTTTGCTCCTAAAACTAAAGAGTTTTATGAAAAAAGGAAAATGATA

138 CTACCCGAAAAATGTCAAAAGGTCACTAATAATAATAAACATAATATAATA >Contig5 TTATTTTTGAAAGGGTGTTAATGCTTCACAAGTTCATAAAAAGTTGTGGGCTGTGTATGT ACGGTGATAAAGCCTTAATAGAACGGCAGTGTCAAAACTGCTTTGAGAAATACAGTTCTG GAGATTTTCCTTTGAAAAATGAGAACCGCTCCAGGCATCCCGTGGAGGTGAATGTCAGTC ATAAATAAAGGTTCTCATTGATTATGATCGGCATAATTCGACTATGGACATTGTAAAGAA GCTAGATGTGTCACATACGTGCGTCAAAAACCGTCTACGGCGTCTTGGGTGCCAAAAAAA GCTTGATGCGTTACTTTGGGGAACGTTAGTTAACGAGGCGACTTGGTCTTTGCGATATGC TTCTTAAACGCAATGCGAATGGCCCTTTTTTGAAAGAATGGTCAGCGGAGATGAAAAGTG CATTGTCTATGATGACTTTTTGAGAAAAAAGATCCTGGTCTAGGCAAGGAAACAGGCACC AACAACTTCTAAGACTGACATTCACCAAAAAAAGGTATTGTTATCATTTTGGTGGGATTA CAAAGGCATAGTCAACTTTGAGCTGCTGCCACGAAGTCAGACCATAAATTCAGAGGTTTA CATTCGACAATTGACAAATTTAAATGATACCATCCAAGAAAAACGACCGGAACTAGCCAA TAGCAAAGGAATTGTCTTTCATTACCATAATGCCAGGCCCTCCCCATCTTTAGCCACTGG ACAAAAACTACTGGAGCTAGGCTGGGATGTTTTGCTGCACCCTCCATATAGTCCCAAACT AGCTCCAAATAATTATCATTTTTTCCGATCCCTAAAAAATTTTTTGAACGGACAAAAATT CCAAAACGACAATGAGGTCAAAACTGCATTGGACCAGTTTTTTGCTCCTAAAACTAAAGA GTTTTATGAAAAAAGGAAAATGATACTACCCGAAAAATGTCAAAAGGTCATTAATAATAA TGGACATAATATAATAGAT >Contig6 ATGGAGAGCCAAAGCGAGCATTTCCTCCACATTTTCGTTTTTTATTTTTGAAAGGGTGTT AATGCTTCACAAGCCAATACAAAGTTGTGGGCTGTGTAGGGTGATGAAGCTTTAATAGAA CTGCAGTGTCAAAACTGGTTTGCGAAATTCCGTTCTGGAGATTTTTCTTTGAAAGATAAG GAGCGCTCTGGGGGTCCAGTGGAGGTCGATGATGACCAAATAAAGGCCCTAATTGTTAAT GATCGGCATAGTTCGACAAGGGACATTGCAAAGAAGCTAGATGTGTCACATAAGTGCGTC CAAAACCGTCTGCGGCGTCTTGGGTACCAAAAGAAGCTTGATGCATAACCTTGGGGAGTG TGAGTTAACGAGGCGACTTGATCTTTGCGATATGCTTCTTAAACGCAATGCGAATGACCC CTTTTTTGAAAGAATGGTCACCGGAGATGAAAAGTGGGTTGTCTATGACAACATTTTGAG GAAAAGATCCTGGTCTAGGCAAGGGAAACAGGCACCAACAACTTCTAAGGCTGACATTAA CCAAAAAAAGGTACTGTTATCAGTTTGGTGGGATTACAAAGGCATAGTCTACTTTGAGCT GCTGCTGCCACGAAATCAGACCATAAATTCAGAGGTTAATATTTGACAATTGAGAAATTT AAATGATACCATCCAAGAAAAACGACCGGAACTAGCCAATAGAAAAGGAAGAATTGTCTT TCAACACGATAATGCCAGGCCCTCCCCATCTTTAGCCACTGGACAAAAACTACTGGAGCT AGGCTGGAATGTTTTGCTGCACCCTCCTTATAGTCCCGACCTAGCTCAAAGCGAGTATCA TTATTTCCGATCACTAAAAAATTTTTTGAACGGACAAAAATTCCAA >Contig7 CTAGATGTGTCACATACGTACGTCGAAAACCGTCTGCGGCGTCTTGGGTACCAAAAGAAG CACAGGAGATGAAAAGTGGGTTGTCTATGACAACATTTTAAGGAAAAGATCCTGGTCTAG GCAAGGGAAACAGGCACCAACAACTTTTAAGGCTGACATAAACCAAAAAAAAGTATTGTT ATCATTTTGGTGGGATTACAAAGGCATAGTTTACTTTGAGCTGTTCCTACAAAGTCAGAC CATAAATTCAGAGGTTAACATTCAACAATTGACGAATTTAAATAATGCCATCCAAGAAAA ACGACCGGAGCTAGCCAATAGAAAAGGAATTGTCTTTCATCACAATAATGCCAGGCTCCA CACATCTTTAGACATCAGACAAAAACTACTGGAACTAGGCTGGGTTGTTTTGCCGCATCC TTCTTATAGTCCCAACTTAGCTCGAAGTGAGTTACATGTGTTTCGATCACTAAAAAAATT TTTGAACGGACAAAAAATCCAAATGAGGTCAAAACTGCATTGGAGCAGTTTTTTGCTCCT

139 AAAAATAAAGAGTTTTATAAAACATGGGATGATGATACTACCCGAAAAATGGCGAAAGAT CATTGATAACAATGGACATAATATAATG

140 A.2.3.2 CAP3 Contigs Quality Scores

>Contig1 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 82 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 11 97 97 97 97 82 97 97 97 97 97 97 97 97 82 97 97 82 97 82 82 82 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 82 82 97 97 97 97 97 97 97 97 97 17 97 97 97 97 97 97 97 97 97 97 97 97 87 97 97 97 97 82 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 11 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 82 97 97 97 11 97 97 97 17 97 97 97 17 97 97 97 97 97 97 97 97 97 97 17 97 97 97 97 97 97 97 97 97 97 17 97 97 97 97 97 97 97 97 17 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 17 97 97 97 97 97 97 97 97 97 97 97 97 17 97 97 82 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 4 4 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 4 4 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 82 97 97 97 97 82 97 97 97 97 97 97 97 97 97 97 97 97 82 97 97 97 97 97 82 82 97 97 97 97 97 82 82 97 97 82 97 97 97 82 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 82 97 97 97 97 97 82 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 82 97 97 97 97 97 97 97 97 97 97 97 97 11 17 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 82 82 82 97 97 97 97 97 97 82 97 97 97 97 97 97 97 97 97 97 82 97 97 97 97 97 97 97 97 97 97 97 97 97 82 97 11 82 97 97 97 97 97 97 97 97 97 82 82 82 82 82 82 82 82 82 82 97 97 97 82 97 17 97 97 97 97 97 97 97 97 97 82 17 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 82 82 82 97 97 11 97 97 97 97 97 97 97 17 97 82 97 97 97 97 97 97 97 82 82 97 97 97 97 97 97 97 97 97 97 97 17 97 82 97 97 11 97 97 97 82 97 97 97 97 97 97 97 82 97 97 97 97 82 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 82 17 97 97 97 97 97 82 97 97 97 97 97 97 97 97 97 97 4 97 4 97 4 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97

141 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 4 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 4 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 95 97 97 97 >Contig2 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 30 97 97 4 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 70 97 97 97 97 97 97 50 97 97 97 70 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 70 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 70 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 90 70 97 97 97 97 97 97 97 97 97 97 97 70 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 70 97 97 97 97 97 97 97 97 97 90 97 97 97 90 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 80 50 97 97 97 97 97 80 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 90 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 42 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 42 97 97 97 97 97 42 97 97 97 97 97 97 97 42 97 42 42 97 97 97 97 97 42 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 40 42 97 40 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 71 97 97 97 97 97 97 97 97 15 97 97 5 42 97 97 5 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 84 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 74 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 77 77 97 77 97 97 77 97 97 97 97 97 97 97 97 97 77 97 97 77 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 30 97 97 97 97 97 97 97 97 97 97 97 97 87 97 97

142 97 97 97 97 97 97 48 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 77 97 97 97 97 97 97 77 77 77 77 77 60 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 77 50 97 97 82 97 97 97 97 97 97 97 97 97 97 97 97 97 82 97 62 97 97 97 97 97 97 97 97 97 52 90 90 90 65 90 90 90 90 67 38 90 67 90 90 90 52 67 90 90 67 90 90 67 90 67 90 67 90 90 90 90 90 90 90 90 90 90 90 90 53 90 90 90 90 27 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 67 20 90 90 90 90 65 90 90 90 90 90 90 90 82 82 82 77 77 55 23 75 75 75 75 75 70 70 70 47 70 70 70 70 47 70 70 70 70 70 70 70 70 35 35 70 70 35 70 70 70 70 70 70 70 70 45 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 47 70 65 65 65 40 55 55 55 55 55 55 55 55 55 55 35 >Contig3 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 5 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 15 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 15 30 30 30 30 30 30 30 30 15 30 30 30 30 30 30 5 30 15 30 30 30 30 30 35 35 35 35 37 37 37 37 37 37 15 37 37 37 37 37 37 37 37 42 42 42 42 27 42 42 42 42 42 42 42 42 42 42 27 42 42 42 42 42 42 42 42 20 42 42 42 42 42 42 42 42 42 42 42 42 20 42 42 42 42 42 42 42 42 42 42 42 27 27 42 42 20 42 20 42 5 42 42 20 42 42 20 42 42 42 42 42 42 42 42 5 42 42 42 42 42 42 42 20 20 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 27 42 42 42 42 27 27 42 42 42 42 42 20 42 42 42 42 42 42 20 42 20 42 5 42 5 42 42 42 42 17 42 42 42 5 42 42 42 42 42 42 42 5 42 42 42 27 42 42 20 42 42 5 42 42 42 42 42 27 42 42 27 27 42 42 42 42 42 42 42 42 20 27 42 42 42 42 20 42 20 5 27 20 42 42 42 42 42 42 42 42 20 42 37 37 15 37 37 37 37 37 15 37 17 37 37 37 17 37 42 11 42 11 42 20 11 42 42 42 42 42 47 47 47 25 47 47 25 47 47 57 57 57 57 57 42 42 35 35 67 42 67 67 67 67 67 67 67 67 67 67 52 67 27 67 67 67 67 67 67 67 67 67 0 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 45 67 67 67 67 67 67 67 67 67 65 65 65 60 60 60 60 60 60 60 60 60 15 60 35 15 60 60 60 60 60 60 60 60 60 60 15 60 60 60 60 60 60 60 45 45 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 45 60 60 60 60 55 55 55 55 55 55 10 55 55 55 55 55 55 30 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 60 60 60 60 60 60 60 60 10 60 60 0 0 0 60 65 10 10 15 67 45 57 57 57 95 95 95 95 95 95 95 95 95 95 95 95 95 95 95 95 95 92 95 95 95 95 95 95 95 50 95 95 95 95 92 92 92 92 92 92 92 92 92 92 92 92 40 92 92 92 92 92 92 92 92 92 92 92 92

143 92 92 40 40 40 92 92 92 92 25 92 92 40 92 92 92 92 92 25 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 25 92 25 40 92 92 92 92 92 92 92 92 92 92 92 92 40 25 40 25 25 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 40 40 92 40 40 92 92 92 92 92 92 92 92 92 92 92 92 40 25 92 92 40 92 92 40 40 40 92 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 10 62 62 10 62 62 62 62 62 62 62 62 10 62 62 62 62 62 62 62 62 62 62 62 62 62 10 62 62 62 35 50 35 20 20 5 20 20 20 20 20 20 5 20 5 20 20 22 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 >Contig4 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 5 20 20 20 20 20 20 20 5 20 20 20 20 5 20 20 20 20 20 20 20 20 20 20 5 20 20 20 20 20 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 10 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 10 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 10 25 25 25 10 25 25 25 25 25 25 25 25 25 10 25 25 25 25 10 25 25 25 25 25 25 25 25 25 25 10 25 25 25 25 25 10 25 25 10 25 25 25 25 25 25 25 25 25 25 25 25 25 10 25 25 25 25 25 25 25 25 25 10 25 25 25 25 25 10 25 25 25 25 25 25 10 25 25 10 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 10 10 5 5 5 5 45 40 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 50 50 35 50 50 50 50 50 72 97 97 5 97 97 97 97 97 97 97 97 97 97 97 97 97 93 97 97 97 97 97 97 97 82 97 93 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 82 93 97 97 82 97 97 97 93 93 97 97 97 97 97 97 97 82 97 97 97 97 97 97 82 97 97 97 97 97 97 97 97 93 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 95 95 95

144 97 85 97 97 97 97 97 97 97 30 97 30 97 97 97 97 30 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 30 97 97 97 97 97 97 97 97 97 97 97 97 30 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 30 97 97 97 97 30 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 30 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 5 90 5 50 50 25 50 50 50 50 50 50 25 50 25 25 50 50 4 50 50 50 4 50 50 50 50 25 50 25 25 50 50 50 50 50 50 50 50 50 25 50 50 25 50 50 25 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 25 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 25 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 20 45 45 45 45 45 45 20 45 45 45 45 45 45 45 45 45 45 45 45 20 45 45 45 45 45 45 45 45 45 20 4 20 40 40 40 35 35 10 35 35 35 35 35 35 >Contig5 10 10 10 10 10 10 10 10 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 35 35 35 35 35 35 35 35 35 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 15 40 40 40 40 40 40 40 40 40 40 40 40 40 15 40 40 40 40 15 85 85 85 95 95 95 95 95 95 95 95 95 95 95 95 95 95 95

145 95 95 95 40 70 95 95 95 95 95 95 95 95 95 95 95 95 95 95 95 70 70 95 95 95 95 95 95 95 95 95 95 95 95 95 95 95 95 95 95 95 95 95 95 95 95 95 95 90 25 90 90 90 90 95 95 95 95 90 90 90 90 90 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 45 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 36 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 42 97 97 97 97 97 97 97 36 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 57 97 57 97 97 97 36 97 97 97 97 97 36 42 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 42 97 97 42 97 97 97 97 97 97 97 97 97 97 97 97 97 97 42 97 97 97 42 42 97 36 97 97 97 97 97 42 97 97 97 97 42 97 42 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 42 42 97 97 42 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 42 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 42 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 42 97 97 97 97 97 97 97 97 97 97 82 97 97 97 97 97 97 97 97 97 82 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 5 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 82 97 97 5 82 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 17 97 72 97 97 97 97 97 97 97 97 97 97 97 97 75 97 77 67 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 37 97 97 97 97 97 57 97 97 97 97 97 97 97 97 90 60 85 60 70 85 85 85 85 85 85 85 85 85 85 75 75 75 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 15 60 60 35 60 60 35 60 60 60 60 60 60 60 60 60 60 60 60 35 60 60 60 55 55 55 30 30 25 >Contig6 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 20 20 20 20 20 5 20 20 20 20 20 20 20 20 20 20 5 5 20 20 20 20 20 20 20 20 5 20 5 20 5 20 20 20 22 22 22 22 22 7 22 30 30 22 30 30 30 7 30 7 22 37 22 37 37 37 37 22 7 37 37 37 37 37 30 37 37 37 37 37 37 37 37 22 37 37 37 37 37 37 37 3 37 22 7 37 30 37 37 37 37 22 30 37 37 37 37 30 37 7 37 37 22 37 37 37 37 37 37 37 37 37 37 30 37 37 37 37 37 22 37 37 22 37 30 37 15 37 37 37 37 30 30 37 22 5 37 7 37 37 22 37 7 37 7 37 37 37 37 37 37 15 15 7 7 0 37 37 37 37 3 37 37 22 37 7 15 37 37 15 37 37 37 37 30 15 30 7 37 22 37 37 37 37 37 37 37 37 37 15 15 37 37 37 37 37 37 37 22 5 22 22 22 5 22 22 22

146 5 22 22 22 22 22 22 5 7 7 37 37 30 37 7 3 22 37 37 30 37 37 15 37 37 37 30 22 30 37 37 37 37 37 35 12 35 35 35 35 0 2 35 35 35 0 12 2 35 0 0 12 35 35 35 15 40 25 25 25 10 32 25 32 32 32 25 2 25 32 40 10 2 45 45 45 45 45 45 45 5 45 45 45 45 45 45 32 45 45 45 45 45 45 45 45 45 32 45 45 45 45 15 15 0 15 2 15 15 20 37 20 50 50 50 50 42 2 97 94 27 97 97 97 97 97 64 94 97 94 94 97 80 97 97 97 97 94 97 97 97 97 97 12 97 94 97 97 94 97 97 97 94 97 97 97 97 94 97 97 94 97 94 97 97 94 94 94 97 94 97 97 94 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 63 97 97 97 82 64 97 97 97 42 97 97 97 97 42 87 72 94 97 97 97 94 53 97 94 97 97 97 97 97 94 97 97 97 97 97 97 97 97 97 97 97 97 97 94 97 97 97 97 97 13 97 97 97 97 97 97 97 97 97 97 90 97 97 87 97 97 35 26 97 97 97 97 97 80 97 97 97 97 97 97 97 97 97 92 97 97 32 97 97 97 97 97 97 97 97 97 97 97 97 70 75 97 97 97 65 97 70 95 97 95 95 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 55 45 97 55 97 97 97 97 97 97 97 77 77 77 25 77 77 77 15 77 20 70 70 70 35 70 97 85 85 85 85 55 85 90 45 45 97 97 97 97 97 97 97 97 97 20 97 97 20 97 97 97 97 97 50 97 97 97 97 97 97 97 97 97 97 97 97 97 77 77 97 20 97 97 97 97 77 97 97 97 97 97 97 97 97 97 97 97 97 97 97 77 77 77 97 97 77 97 77 97 77 97 97 97 20 97 77 77 97 97 97 97 97 97 97 97 97 97 97 50 97 20 50 20 20 50 20 97 97 97 97 97 97 97 97 97 97 97 97 97 77 77 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 50 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 77 77 97 97 97 80 80 97 97 97 97 27 27 27 27 27 27 27 5 27 5 27 27 27 27 27 27 27 27 27 5 27 27 27 5 27 27 5 27 27 27 5 27 27 27 5 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 10 10 >Contig7 10 10 10 10 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 20 20 92 92 92 92 92 70 92 92 70 92 92 70 92 37 92 92 92 92 92 92 15 92 92 92 92 92 92 92 92 92 92 92 92 97 42 97 42 20 42 42 42 42 97 97 97 97 42 97 97 97 97 97 97 97 97 97 97 97 75 97 97 97 97 97 97 97 97 20 42 97 97 97 75 97 97 97 97 97 97 97 20 75 42 97 97 97 97 97 97 97 95 40 80 40 95 95 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 35 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 35 35 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 35 90 35 90 90 90 90 35 90 90 90 90 90 90 90 90 90 35 90 90 90 90 90 90

147 85 85 85 85 85 85 85 85 85 85 85 85 80 80 80 80 80 80 80 80 60 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 50 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 50 75 75 75 50 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 50 50 50 75 50 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 50 75 75 75 75 75 75 70 70 70 45 45 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 35 35 35 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 25 25 25

A.3 Encompass Complete TE

We now aim to extend the TE beyond the coding sequence and find instances.

Using the contigs from the previous step, we perform a blastn search, followed by combining and extracting hits. At this point, when we perform extractions, we add an extra 200 bp on both ends of the sequence. We now have instances in the genome with flanks.

148 A.4 Generate Consensus

Once we have extracted the instances from the genome with flanks, we perform a multiple sequence alignment using ClustalW2. Using the alignment, we generate a consensus sequence, shown below:

>Contig TTTT-AGTA-GTTTAC-TTT-TG----A-A-A-A--AACTTTTG-T-AT-TTATTA-TTG GTTCCTTCTG-----A--TTC-GTACCAAAATTTCA-T--A--TCTTA---AAT-GT-CG -GAGCAAAAGA-A-T--AAAC------GGGAGCCAAAGCGAGC------ATTTCCTCC ACATTTT-CTTTT----TTATTTTTGAAAGGG-TGTTAATG-CTTC-CA-GCCAATAAAA AGTTGTGGG--G----TGTAGG------G-GATGAAGCCT-TAATAGAACGGCA-GTGTC AAAA--CTGGTTTGCGAAATTCCGTTCTGGAGATTTT-CTTTGAAAAATGAG---GAG-- ---GCTCCGGGCGTC-ATTGGAGGTTAATGATGAGCA------AATAAAGGCCCTCATTG ATTATGATCGGCATAGTT-CGAC-AAGGACATTGTAAAGAAGCTAGATGTGTCACATACG -TGCG---TCAAAAACCGTCT--GCGGCGTCTTGGGT-----CCAAAAGAAGCTTGATGC -T-AC-TTGGGGA--GT------AG-C-AC---TTGGTCTTTGCGATATGCTTC--- TTAAACGCAA------CGAATGACCCTTTTTTGAAA---GAATGGTCACCGG----AGA TGAAAA---GTGGGTTGTCT-ATGAT--GACTTTTTGAGA-----AAAAG-ATCCT---G GTCTAGGCAA---G-AAACAGGCACCA-ACAACTTCTAAGGCTG-ACATTCACCAAAAAA A------GGTATTGTTA-TCATTTTGGTGGGATTACAAAGGCATAGTC------ACTTT -GAGC----TGC-GCCACGAAGTCAGACCATAAATTCAGA--GGTTTACATTCGACAATT GACAAAT----TTAAATGATACCATCC------AAGAAAAACGACCG------GAGCTAGCCAATAGCA---AAGGAATT------GTCTT-TCACCA C-ATAATG----CCAGGCCCT------CCCC--- -ATCTTTAGCCACTGG--ACAAA-----AACTACTGGAGCTAGGCTGG-ATGTTTTGCTG CACCCTCC---TATAGTCCCAAACTAGCTCCAAATAATTATCA------TTTTTTCCG AT--CTAAAAAATTTTTT-AA-GGACAAAAATTC----AAAA-GACAATGAGGTC------ACTGCATTGGA-CAGTTTTTT-GCTCCT--AAAACTAAA-GAGTTTTAT------GAA AAAA-----G-A-AATGATACTACC--CGAAAAAT------A-AA--T -A-TAATA-ATA------ACATAAT-T---ATA-ATAA----A-TAATTT---ATAATT AATAAAT------TTT-TT-TTTT-T-A-AAA-TTC--A-A-T--A----T-T------AATA

149 A.5 Identify Complete TE

We now utilize the consensus sequence to perform a blastn search against the genome to find its instances. The hits are again combined and extracted, with 50 bp flanks. Hits must be at least 90% of the query length before adding flanks to be considered. We next assemble the hits iteratively in CAP3.

A.5.1 CAP3 Assembly

The results from the CAP3 contigs file are shown below.

>Contig1 ATTTATTGGGTTGGCCAATAAGTAACTGCGGATTTTACCAACAGATAGTTTGTTTATTTT TTTGAGTACGTTTACGTTTTTGTACAGACATGAACTTTTGATATGTTATTACTTGGTTCC TTCTGTAACATTCGGTACCAAAATTTCATTGAACTCTTAAATAGTACGCGAGCAAAAGAC ATTTAAACATGGGGAGCCAAAGCGAGCATTTCCTCCACATTTTACTTTTTTATTTTTGAA AGGGTGTTAATGCTTCCCAGGCCAATAAAAAGTTGTGGGCCGTGTAGGGGGATGAAGCCT TAACAGAACGGCAGTGTCAAAACTGGTTTGCGAAATTCCGTTCTGGAGATTTTTCTTTGC AAAATGAGGAGCGCTCCGGGCGCCAATTGGAGGTTAAAGATGAGCAAATAAAGGCCCTCA TTGATTATGATCGGCATAGTTCGACGAAGGACATTGTAAAGAAGCTAGATGTGTCACATA CGTGCGTCAAAAACCGTCTGCGGCGTCTTGGGTGCCAAAAGAAGCTTGATGCGTTACTTT GGGGAACGTTAGTTAACGAGGCGACTTGGTCTTTGCGATATGCTTCTTAAACGCAATGCA AATGACCCTTTTTTGAAAGAATGGTCACCGGAGATGAAAAGTGGGTTGTCTATGATGACT TTTTGAGAAAAAGATCCTGGTTTAGGCAAGGAAACAGGCACCAACAACTTCTAAGACTGA CATTCACCAAAAAAAGGTATTGTTATCATTTTGGTGGGATTACAAAGGCATAGTCAACTT TGAGCTGCTGCCACGAAGTCAGACCATAAATTCAGAGGTTTACATTCGACAATTGACAAA TTTAAATGATACCATCCAAGAAAAACGACCGGAACTAGCCAATAGCAAAGGAATTGTCTT TCACCACCATAATGCCAGGCCCTCCCCATCTTTAGCCACTGGACAAAAACTACTGGAGCT AGGCTGGAATGTTTTGCTGCACCCTCCATATAGTCCCAAACTAGCTCCAAATAATTATCA TTTTTTCCGATTCCTAAAAAATTTTTTAAACGGACAAAAATTCCAAAACGACAATGAGGT CAAAACTGCATTGGACCAGTTTTTTGCTCCTAAAACTAAAGAGTTTTATGAAAAAAGGAA AATGATACTACCCGAAAAATGTCAAAAGGTCACTAATAATAATGGACATAATATAATAGA TAAAAATAATTTTGCATAATTAATAAATCGTTTTTTGTTTTCTTAAAAAATTCGTAAATA TCTTTTTGCCAACCCAATA

150 A.5.2 CAP3 Contigs Quality File

The quality file for the contigs in the previous subsection is shown below.

>Contig1 35 25 5 72 72 72 72 72 72 72 72 72 72 72 72 37 72 57 72 72 72 72 72 72 72 72 72 72 72 72 57 72 72 57 72 72 72 72 65 72 57 65 50 57 72 56 72 72 72 72 72 72 72 20 5 42 57 72 72 72 72 57 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 47 72 72 72 72 72 72 32 72 72 72 72 72 72 72 72 72 72 72 72 72 20 72 72 72 65 72 72 72 72 72 72 72 72 72 72 72 72 72 72 50 72 72 57 72 72 72 72 72 72 72 72 72 65 37 72 72 72 72 72 72 62 72 72 72 72 72 72 72 72 72 35 72 72 20 72 72 47 72 72 72 72 72 72 72 72 10 72 72 72 72 72 72 72 72 72 57 72 72 72 10 72 21 65 35 72 72 57 72 72 72 72 72 72 72 72 72 72 72 72 0 72 72 72 72 72 72 72 72 72 57 72 72 72 72 72 50 72 72 72 72 72 72 72 20 72 72 72 72 72 72 72 72 72 72 72 37 72 72 72 72 72 56 47 72 72 37 72 65 72 72 72 65 72 72 72 72 20 72 72 20 72 35 35 35 72 72 72 72 72 72 57 72 72 72 72 72 72 57 72 0 41 72 72 57 72 72 65 72 72 32 72 72 72 72 65 72 30 72 57 72 65 65 65 15 65 72 72 72 47 72 65 72 72 72 57 72 72 72 72 72 72 65 72 72 65 72 72 72 65 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 57 72 72 72 72 72 72 72 72 31 57 72 72 72 57 21 72 72 72 72 72 72 72 72 72 72 72 65 72 72 72 72 72 72 72 65 72 72 37 65 37 72 72 72 72 72 72 72 72 72 72 72 72 25 72 72 72 72 72 72 72 72 72 72 72 72 72 57 72 72 72 72 72 72 72 72 72 72 72 72 72 72 65 72 72 72 72 72 72 72 20 72 72 65 56 72 72 72 72 57 72 20 72 72 72 57 72 72 57 72 72 57 56 72 72 72 72 72 72 65 72 72 72 72 72 72 72 57 72 72 72 72 57 57 72 72 72 72 72 72 72 72 72 72 72 72 72 72 62 72 20 65 72 72 72 56 72 47 57 72 72 72 72 72 72 72 47 65 72 57 72 47 72 72 72 72 11 72 72 65 72 72 72 57 72 72 72 72 57 72 72 72 72 72 72 72 72 57 72 72 72 65 72 65 72 65 72 72 65 65 72 72 72 65 65 65 65 65 50 65 65 72 72 72 72 72 47 72 72 72 72 47 72 72 72 72 72 72 72 72 72 50 72 72 72 72 72 72 72 57 72 57 72 72 72 0 72 72 72 72 37 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 65 72 65 72 57 72 65 65 72 47 40 72 57 72 72 72 72 72 72 72 72 72 50 65 65 72 72 72 72 47 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 47 72 72 72 72 57 72 72 65 72 57 57 57 72 5 72 72 57 72 72 72 72 72 72 72 57 57 42 57 72 57 57 57 72 47 57 72 57 72 72 72 72 72 72 72 57 57 72 27 27 72 72 72 72 72 72 47 47 72 47 72 72 72 72 72 57 72 50 72 35 72 57 57 72 72 72 47 72 72 57 72 57 72 72 72 72 65 72 72 57 72 57 72

151 72 72 72 57 72 72 72 65 72 72 72 72 72 65 72 25 72 72 72 72 72 72 72 57 72 72 72 72 72 72 72 65 72 65 72 72 0 31 72 72 72 72 72 37 65 72 72 72 72 72 72 72 72 72 37 72 72 57 72 72 72 72 72 72 72 72 72 57 72 72 72 72 72 57 72 72 72 20 72 72 72 72 72 72 72 20 72 72 65 72 72 72 72 72 72 72 72 72 72 57 72 72 72 72 72 72 72 72 37 57 72 57 72 17 72 72 72 72 72 65 72 72 72 72 72 72 65 72 72 72 57 72 72 72 72 72 72 72 72 72 72 72 72 35 35 72 57 37 72 72 72 72 72 72 72 72 72 72 72 72 57 72 72 72 5 72 20 72 10 72 72 72 65 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 57 72 72 72 20 72 57 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 47 62 72 72 72 72 72 72 72 72 72 65 72 72 72 57 72 72 57 72 72 72 57 72 72 72 72 72 72 72 72 72 72 65 72 72 72 72 72 72 72 72 65 57 72 72 10 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 25 72 72 65 65 72 72 47 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 65 72 37 72 72 57 72 72 72 72 72 72 72 62 72 72 65 72 72 72 72 72 72 72 72 0 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 57 65 72 20 72 72 72 57 72 72 72 65 72 65 72 72 72 72 72 72 72 72 72 72 72 72 57 72 72 72 72 37 72 72 72 72 72 72 57 57 72 72 72 72 72 72 72 72 72 72 40 30 57 72 72 72 72 72 72 72 65 20 20 72 72 72 72 52 67 60 67 67 67 52 67 52 42 67 52 67 67 67 67 67 67 67 67 67 67 67 5 5 60 67 52 67 67 67 67 67 57 67 67 42 67 67 60 67 15 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 60 67 67 67 5 67 67 20 42 67 42 67 67 67 67 67 67 67 67 60 67 67 67 67 67 67

152 A.5.3 Trimmed CAP3 Contigs

Here we list the final contig, which represents the consensus TE. In this case, there was nothing trimmed from the previous step, as the quality scores were all above the threshold.

>Contig1-0 0 1280 f ATTTATTGGGTTGGCCAATAAGTAACTGCGGATTTTACCAACAGATAGTTTGTTTATTTT TTTGAGTACGTTTACGTTTTTGTACAGACATGAACTTTTGATATGTTATTACTTGGTTCC TTCTGTAACATTCGGTACCAAAATTTCATTGAACTCTTAAATAGTACGCGAGCAAAAGAC ATTTAAACATGGGGAGCCAAAGCGAGCATTTCCTCCACATTTTACTTTTTTATTTTTGAA AGGGTGTTAATGCTTCCCAGGCCAATAAAAAGTTGTGGGCCGTGTAGGGGGATGAAGCCT TAACAGAACGGCAGTGTCAAAACTGGTTTGCGAAATTCCGTTCTGGAGATTTTTCTTTGC AAAATGAGGAGCGCTCCGGGCGCCAATTGGAGGTTAAAGATGAGCAAATAAAGGCCCTCA TTGATTATGATCGGCATAGTTCGACGAAGGACATTGTAAAGAAGCTAGATGTGTCACATA CGTGCGTCAAAAACCGTCTGCGGCGTCTTGGGTGCCAAAAGAAGCTTGATGCGTTACTTT GGGGAACGTTAGTTAACGAGGCGACTTGGTCTTTGCGATATGCTTCTTAAACGCAATGCA AATGACCCTTTTTTGAAAGAATGGTCACCGGAGATGAAAAGTGGGTTGTCTATGATGACT TTTTGAGAAAAAGATCCTGGTTTAGGCAAGGAAACAGGCACCAACAACTTCTAAGACTGA CATTCACCAAAAAAAGGTATTGTTATCATTTTGGTGGGATTACAAAGGCATAGTCAACTT TGAGCTGCTGCCACGAAGTCAGACCATAAATTCAGAGGTTTACATTCGACAATTGACAAA TTTAAATGATACCATCCAAGAAAAACGACCGGAACTAGCCAATAGCAAAGGAATTGTCTT TCACCACCATAATGCCAGGCCCTCCCCATCTTTAGCCACTGGACAAAAACTACTGGAGCT AGGCTGGAATGTTTTGCTGCACCCTCCATATAGTCCCAAACTAGCTCCAAATAATTATCA TTTTTTCCGATTCCTAAAAAATTTTTTAAACGGACAAAAATTCCAAAACGACAATGAGGT CAAAACTGCATTGGACCAGTTTTTTGCTCCTAAAACTAAAGAGTTTTATGAAAAAAGGAA AATGATACTACCCGAAAAATGTCAAAAGGTCACTAATAATAATGGACATAATATAATAGA TAAAAATAATTTTGCATAATTAATAAATCGTTTTTTGTTTTCTTAAAAAATTCGTAAATA TCTTTTTGCCAACCCAATA

153 APPENDIX B

TESeeker WEBSITE

The TESeeker website is located at http://www.nd.edu/~teseeker and in- cludes the virtual appliance, representative TE library, and documentation. Fig- ure B.1 shows a screen capture of the home page.

154 Figure B.1. TESeeker Website. The Website allows researchers to download the TESeeker virtual appliance, as well as view the documentation. We also provide the library of representative TEs for download.

155 APPENDIX C

TESeeker USER MANUAL

TESeeker is available as a VirtualBox[144] virtual appliance in the open vir- tualization format (OVF). TESeeker requires at least 5 GB free hard disk space and at least 1.5 GB of RAM on the host machine. TESeeker can dynamically allocate up to 40 GB hard disk space for use in the virtual appliance. TESeeker is licensed under GNU General Public License (GPL) v3 [55].

C.1 Installation

TESeeker can run on any operating system that supports the VirtualBox vir- tualization software package, currently available for Windows, OS X, Linux, and Solaris.

The following steps shall be followed to install TESeeker:

1. Download and install VirtualBox from http://www.virtualbox.org.

2. Download the TESeeker virtual appliance files (2) from http://www.nd. edu/~teseeker.

3. Open VirtualBox.

4. Click File then Import Appliance... and complete the wizard, selecting the TESeeker .ovf file as the source. Be sure both downloaded TESeeker files are in the same directory.

156 C.2 Usage

After installation, start TESeeker by opening VirtualBox, clicking teseeker in the left frame, and then clicking Start. The virtual appliance hosting TESeeker will then boot.1 As shown in Figures C.1-C.7, the booted appliance will contain 7 desktop items: the Genomes and TELibrary folders, shortcuts to bring up the documentation and web interfaces, and the license. The TESeeker interface is shown in Figure C.5. Hovering the mouse over the parameter name will provide a more detailed description. All genomes and library files must be placed in the folders on the desktop and must be in the FASTA file format with a .fa, .fas, or .fasta file extension. We have included the Pediculus humanus humanus genome and our representative TE library within the virtual appliance.

Clicking the TESeeker shortcut on the desktop will load the web interface. Here, researchers can modify the default parameters, most notably the BLAST Query Library, BLAST Database, and the Desktop Output Folder Name. Hovering over the parameter name will provide a detailed tooltip description. Once the parameters have been set, clicking submit will briefly show the selected parameters and then start the search. The browser will display Job X is Running, where X represents the job id number. The browser will continually refresh the page until the job completes, at which point the page will notify the user. When finished, researchers navigate to the specified output folder on the desktop to view results. If the researcher elects to find only the coding region, results are organized as follows within the specified output folder: the codingRegion files folder contains intermediary output, the output folder contains all the singlets and contigs pro-

1Some Linux distributions automatically enable the KVM kernel extension If this is the case, disable it with the following command sudo modprobe -r kvm intel. To restore the KVM kernel extension, run sudo modprobe kvm intel.

157 Figure C.1. TESeeker Desktop. This figure shows the desktop.

158 Figure C.2. TESeeker Genomes Folder. Researchers can place FASTA genome data in this folder.

159 Figure C.3. TESeeker TELibrary. This figure shows the folder for the representative TEs. Researchers can also place FASTA sequence data in this folder.

160 Figure C.4. TESeeker Documentation. Here, we show a screen capture of the HTML TESeeker Documentation.

161 Figure C.5. TESeeker Web Interface. This figure shows the TESeeker web interface. Researchers can alter the default parameters as desired. Library and genome files in the desktop folders are selectable through drop-down menus.

162 Figure C.6. TESeeker BLAST Interface. Here, we show the BLAST interface. The BLAST Database drop-down menu is populated via the genomes available in the Genomes folder.

163 Figure C.7. TESeeker Extract Interface. This figure shows the TESeeker extract interface. Researchers can extract specified sequence data from any genome in the Genomes folder.

164 duced, and the remaining files represent the contigs and singlets produced from

CAP3. For example, a file called cap2c out.fas contains the contig sequences from the second iteration of CAP3, while cap1s out.fas contains the singlet sequences produced from the first iteration of CAP3. If a consensus sequence is desired, the results are organized as follows within the specified output folder: the codingRegion files folder contains intermediary output from the coding region search, the folder consen files contains intermediary files from the consensus search, and the output folder contains the contig and singlet sequences produced from each sequence that was fed into the consensus search. Additionally, all contig and singlet sequences are available in single FASTA files in the specified output folder.

C.3 Example Search

TESeeker is distributed with the Pediculus humanus humanus genome as well as our library of representative TEs. We next describe how one could obtain a high-quality consensus element for the Pediculus humanus humanus mariner element, once the virtual appliance has been loaded.

1. Launch TESeeker. Double-click the TESeeker shortcut on the desktop.

2. Confirm Parameters. Ensure mariner ac.fa is selected for the BLAST Query Library and that the phumanus.SUPERCONTIGS-USDA.PhumUA.fa genome is selected for the BLAST database. Also click Find Consensus? to enable a consensus search. The screen should now look as shown in Fig- ure C.8. The status for TESeeker will be continuously updated through the web interface until the job completes.

3. Inspect Results. When the job is finished, click the link to the specified output folder, louseOut, and inspect the results. The web view of this folder is shown in Figure C.9. As mentioned in the previous section, the main consensus results will be in up to three FASTA files, consensus contigs.fas,

165 Figure C.8. TESeeker Default Parameters. This figure shows the TESeeker web interface with the default parameters set for a search for the mariner transposon in P. humanus humanus.

consensus iter1 singlets.fas, and consensus singlets.fas. The best hits are generally in the consensus contigs.fas file, while the ones with the least like- lihood are generally in the consensus iter1 singlets.fas file. In this case, the first contig in consensus contigs.fas, Contig1-0 6 1309 f, contains a sequence 99% identical to the manually annotated element, differing mainly in its roughly 10 extra nucleotides on both ends. Figure C.10 shows the ends of the aligned sequences.

C.4 Additional Tools

There are also BLAST and Extract shortcuts on the desktop. These web in- terfaces offer additional functionality by making it simpler to do a custom BLAST search or sequence extraction using the files in the Genomes folder.

166 Figure C.9. Web Interface File Browser. The figure above shows the contents of the main output folder, louseOut. FASTA sequences are in the .fas files, shown here as consensus contigs.fas, consensus iter1 singlets.fas, and consensus singlets.fas.

167 (a) 5’ End

(b) 3’ End

Figure C.10. ClustalX Alignment with Annotated Element. Panels (a) and (b) show the 5’ and 3’ ends of the annotated mariner (mariner) and the top consensus sequence produced by TESeeker (Contig1-0) when run with the default parameters. The sequences are 99% identical. The extra sequence on both ends of Contig1-0 can be reduced with stricter parameters.

168 C.5 Technology

TESeeker utilizes a variety of technologies. The core bioinformatics tools, BLAST, CAP3, ClustalW2, and BioPerl were mentioned previously in Section 3.2.3 and are united through bash scripts. Researchers interact with TESeeker through a web-based form implemented in html/php and handled by the lighttpd web server. The form interacts with the local scripts and utilizes a PostgreSQL database and cgi/Perl to notify researchers when a job has completed. TESeeker is installed on Ubuntu 10.04 LTS. The administrative password for user teseeker is teseeker.

169 APPENDIX D

SELECTED AUTOMATED APPROACH SOURCE CODE

This chapter presents selected BioPerl scripts written for our automated ap- proach.

D.1 Combine BLAST Hits

################################################################### ### This is part of TESeeker, licensed under the GNU General Public ### License (GPL) v3. ### ### Authors: Ryan Kennedy and Scott Christley ### ### Arguments: ### ARGV[0]: closeness in order for segments to be joined ### ARGV[1]: minimum length a segment must be to be joined ### ARGV[2]: closeness segments must be in the query to be joined ### ARGV[3]: maximum length percent of query of combined sequence ### if joined because of ARGV[2] ### ARGV[4]: length combined sequence must be relative to the query ### ARGV[5]: length of flanks to add to each side ### ARGV[6]: BLAST file to process ### #################################################################### use Bio::SeqIO; use Bio::SearchIO; use List::Util qw[min max];

$closeLen = $ARGV[0]; $minSegLen = $ARGV[1]; $querySepDist = $ARGV[2]; $queryMaxPerc = $ARGV[3]; $minLenPerc = $ARGV[4];

170 $flankLen = $ARGV[5]; $blast = $ARGV[6];

$bres = new Bio::SearchIO(-format => ’blast’, -file => $blast );

%TEfor=(); $forNum=1; %TErev=(); $revNum=1; while(my $result = $bres->next_result) { $minLen=$minLenPerc * $result->query_length; $queryMax=$queryMaxPerc * $result->query_length; while(my $hit = $result->next_hit) { while(my $hsp = $hit->next_hsp) { $orientation = $hsp->strand(’hit’); if($orientation>0) { #forward strand $hitStart=$hsp->start(’hit’); $hitEnd=$hsp->end(’hit’); $QStart=$hsp->start(’query’); $QEnd=$hsp->end(’query’); $overlap=0; $QJoin=0;

if(abs($hitStart-$hitEnd)>=$minSegLen) { #only combine segments of a specified length

$TEfor{$hit->name}{$forNum}{"start"} = $hsp->start(’hit’); $TEfor{$hit->name}{$forNum}{"end"} = $hsp->end(’hit’); $TEfor{$hit->name}{$forNum}{"qStart"} = $hsp->start(’query’); $TEfor{$hit->name}{$forNum}{"qEnd"} = $hsp->end(’query’); $TEfor{$hit->name}{$forNum}{"minLen"} = $minLen; $TEfor{$hit->name}{$forNum}{"query"} = $result->query_accession . $result->query_length; ++$forNum; } #end minSegLen check } else { #reverse strand $hitStart=$hsp->start(’hit’); $hitEnd=$hsp->end(’hit’); $QStart=$hsp->start(’query’); $QEnd=$hsp->end(’query’); $overlap=0; $QJoin=0;

if(abs($hitStart-$hitEnd)>=$minSegLen) { #only combine segments of a specified length

$TErev{$hit->name}{$revNum}{"start"} = $hsp->start(’hit’); $TErev{$hit->name}{$revNum}{"end"} = $hsp->end(’hit’); $TErev{$hit->name}{$revNum}{"qStart"} = $hsp->start(’query’);

171 $TErev{$hit->name}{$revNum}{"qEnd"} = $hsp->end(’query’); $TErev{$hit->name}{$revNum}{"minLen"} = $minLen; $TErev{$hit->name}{$revNum}{"query"} = $result->query_accession . $result->query_length; ++$revNum; } #end minSegLen check } #end orientation if } #end hsp } #end hit } #end result $bres->close();

#JOIN for $scaffold ( sort keys %TEfor ) { $i=0; $#Forscaf= -1; $#teForStart=-1; $#teForEnd=-1; $#teForQStart=-1; $#teForQEnd=-1; $#teForMinLen=-1;

# go through each scaffold and collect into arrays for $TE ( keys %{ $TEfor{$scaffold} }) { $Forscaf[$i]=$scaffold; $teForStart[$i]=$TEfor{$scaffold}{$TE}{"start"}; $teForEnd[$i]=$TEfor{$scaffold}{$TE}{"end"}; $teForQStart[$i]=$TEfor{$scaffold}{$TE}{"qStart"}; $teForQEnd[$i]=$TEfor{$scaffold}{$TE}{"qEnd"}; $teForMinLen[$i]=$TEfor{$scaffold}{$TE}{"minLen"}; $teForQuery[$i]=$TEfor{$scaffold}{$TE}{"query"}; $i++; }

# sort and get indexes @list_order = sort { $teForStart[$a] cmp $teForStart[$b] } 0 .. $#teForStart;

# push sorted stuff onto are final array for my $i ( 0 .. $#list_order ) { push @joinScaffold, $Forscaf[$list_order[$i]]; push @joinStart, $teForStart[$list_order[$i]]; push @joinEnd, $teForEnd[$list_order[$i]]; push @joinQStart, $teForQStart[$list_order[$i]]; push @joinQEnd, $teForQEnd[$list_order[$i]]; push @joinMinLen, $teForMinLen[$list_order[$i]]; push @joinQuery, $teForQuery[$list_order[$i]]; } }

@Forscaf=@joinScaffold;

172 @teForStart=@joinStart; @teForEnd=@joinEnd; @teForMinLen=@joinMinLen; @teForQuery=@joinQuery;

$joinDist=$closeLen; for($i=0;$i<=(@Forscaf+0);$i++) { if($Forscaf[$i+1]) { if(abs($teForStart[$i+1]-$teForEnd[$i])<=$joinDist && $Forscaf[$i] eq $Forscaf[$i+1] && $teForQuery[$i] eq $teForQuery[$i+1]) { $teForEnd[$i]=max($teForEnd[$i],$teForEnd[$i+1]); splice @Forscaf, $i+1, 1; splice @teForStart, $i+1, 1; splice @teForEnd, $i+1, 1; splice @teForMinLen, $i+1, 1; splice @teForQuery, $i+1, 1; $i-=1; #unless $i==0; } elsif(abs($teForStart[$i]-$teForEnd[$i+1])<=$joinDist && $Forscaf[$i] eq $Forscaf[$i+1] && $teForQuery[$i] eq $teForQuery[$i+1]) { $teForEnd[$i]=max($teForEnd[$i],$teForEnd[$i+1]); splice @Forscaf, $i+1, 1; splice @teForStart, $i+1, 1; splice @teForEnd, $i+1, 1; splice @teForMinLen, $i+1, 1; splice @teForQuery, $i+1, 1; $i-=1; #unless $i==0; } elsif(abs($teForStart[$i]-$teForStart[$i+1])<=$joinDist && $Forscaf[$i] eq $Forscaf[$i+1] && $teForQuery[$i] eq $teForQuery[$i+1]) { $teForEnd[$i]=max($teForEnd[$i],$teForEnd[$i+1]); splice @Forscaf, $i+1, 1; splice @teForStart, $i+1, 1; splice @teForEnd, $i+1, 1; splice @teForMinLen, $i+1, 1; splice @teForQuery, $i+1, 1; $i-=1; #unless $i==0; } elsif(abs($teForEnd[$i]-$teForEnd[$i+1])<=$joinDist && $Forscaf[$i] eq $Forscaf[$i+1] && $teForQuery[$i] eq $teForQuery[$i+1]) { $teForEnd[$i]=max($teForEnd[$i],$teForEnd[$i+1]); splice @Forscaf, $i+1, 1; splice @teForStart, $i+1, 1; splice @teForEnd, $i+1, 1; splice @teForMinLen, $i+1, 1; splice @teForQuery, $i+1, 1;

173 $i-=1; #unless $i==0; } } }

$#joinScaffold=-1; $#joinStart=-1; $#joinEnd=-1; $#joinMinLen=-1; $#joinQuery=-1; $#list_order=-1;

#REVERSE JOIN for $scaffold ( sort keys %TErev ) { $i=0; $#Revscaf= -1; $#teRevStart=-1; $#teRevEnd=-1; $#teRevQStart=-1; $#teRevQEnd=-1; $#teRevMinLen=-1;

# go through each scaffold and collect into arrays for $TE ( keys %{ $TErev{$scaffold} }) { $Revscaf[$i]=$scaffold; $teRevStart[$i]=$TErev{$scaffold}{$TE}{"start"}; $teRevEnd[$i]=$TErev{$scaffold}{$TE}{"end"}; $teRevQStart[$i]=$TErev{$scaffold}{$TE}{"qStart"}; $teRevQEnd[$i]=$TErev{$scaffold}{$TE}{"qEnd"}; $teRevMinLen[$i]=$TErev{$scaffold}{$TE}{"minLen"}; $teRevQuery[$i]=$TErev{$scaffold}{$TE}{"query"}; $i++; }

# sort and get indexes @list_order = sort { $teRevStart[$a] cmp $teRevStart[$b] } 0 .. $#teRevStart;

# push sorted stuff onto are final array for my $i ( 0 .. $#list_order ) { push @joinScaffold, $Revscaf[$list_order[$i]]; push @joinStart, $teRevStart[$list_order[$i]]; push @joinEnd, $teRevEnd[$list_order[$i]]; push @joinQStart, $teRevQStart[$list_order[$i]]; push @joinQEnd, $teRevQEnd[$list_order[$i]]; push @joinMinLen, $teRevMinLen[$list_order[$i]]; push @joinQuery, $teRevQuery[$list_order[$i]]; } }

@Revscaf=@joinScaffold;

174 @teRevStart=@joinStart; @teRevEnd=@joinEnd; @teRevMinLen=@joinMinLen; @teRevQuery=@joinQuery;

$joinDist=$closeLen; for($i=0;$i<=(@Revscaf+0);$i++) { if($Revscaf[$i+1]) { if(abs($teRevStart[$i+1]-$teRevEnd[$i])<=$joinDist && $Revscaf[$i] eq $Revscaf[$i+1] && $teRevQuery[$i] eq $teRevQuery[$i+1]) { $teRevEnd[$i]=max($teRevEnd[$i],$teRevEnd[$i+1]); splice @Revscaf, $i+1, 1; splice @teRevStart, $i+1, 1; splice @teRevEnd, $i+1, 1; splice @teRevMinLen, $i+1, 1; splice @teRevQuery, $i+1, 1; $i-=1; #unless $i==0; } elsif(abs($teRevStart[$i]-$teRevEnd[$i+1])<=$joinDist && $Revscaf[$i] eq $Revscaf[$i+1] && $teRevQuery[$i] eq $teRevQuery[$i+1]) { $teRevEnd[$i]=max($teRevEnd[$i],$teRevEnd[$i+1]); splice @Revscaf, $i+1, 1; splice @teRevStart, $i+1, 1; splice @teRevEnd, $i+1, 1; splice @teRevMinLen, $i+1, 1; splice @teRevQuery, $i+1, 1; $i-=1; #unless $i==0; } elsif(abs($teRevStart[$i]-$teRevStart[$i+1])<=$joinDist && $Revscaf[$i] eq $Revscaf[$i+1] && $teRevQuery[$i] eq $teRevQuery[$i+1]) { $teRevEnd[$i]=max($teRevEnd[$i],$teRevEnd[$i+1]); splice @Revscaf, $i+1, 1; splice @teRevStart, $i+1, 1; splice @teRevEnd, $i+1, 1; splice @teRevMinLen, $i+1, 1; splice @teRevQuery, $i+1, 1; $i-=1; #unless $i==0; } elsif(abs($teRevEnd[$i]-$teRevEnd[$i+1])<=$joinDist && $Revscaf[$i] eq $Revscaf[$i+1] && $teRevQuery[$i] eq $teRevQuery[$i+1]) { $teRevEnd[$i]=max($teRevEnd[$i],$teRevEnd[$i+1]); splice @Revscaf, $i+1, 1; splice @teRevStart, $i+1, 1; splice @teRevEnd, $i+1, 1; splice @teRevMinLen, $i+1, 1; splice @teRevQuery, $i+1, 1;

175 $i-=1; #unless $i==0; } } }

#print $ct=0; for($i=0;$i<=(@Forscaf+0);$i++) { if(($Forscaf[$i] ne "") && (abs($teForStart[$i]-$teForEnd[$i]) >= $teForMinLen[$i])) { print "perl get_fasta2.pl " . $Forscaf[$i] . " $ct "; print $teForStart[$i] . " " . $teForEnd[$i] . " " . $flankLen . " f $teForStart[$i] $teForEnd[$i]\n"; } $ct++; } for($i=0;$i<=(@Revscaf+0);$i++) { if(($Revscaf[$i] ne "") && (abs($teRevStart[$i]-$teRevEnd[$i]) >= $teRevMinLen[$i])) { print "perl get_fasta2.pl " . $Revscaf[$i] . " $ct "; print $teRevStart[$i] . " " . $teRevEnd[$i] . " " . $flankLen . " r $teRevStart[$i] $teRevEnd[$i]\n"; } $ct++; }

176 D.2 Extract Sequences

################################################################### ### This is part of TESeeker, licensed under the GNU General Public ### License (GPL) v3. ### ### Author: Ryan Kennedy ### ### Arguments: ### ARGV[0]: BLAST database (genome) ### ARGV[1]: input file that contains get_fasta queries ### #################################################################### use Bio::SeqIO; use Bio::SearchIO;

$GENOMEDB = $ARGV[0]; $in = Bio::SeqIO->new(-file => $GENOMEDB, -format => ’Fasta’); $ii=0; $eeend=0; %cpipiens=(); while($seq=$in->next_seq()) { $cpipiens{$seq->id()}=$seq; #print "scaff: " . $seq->id() . "\n"; $ii++; }

# search for a specific scaffold id $filename=$ARGV[1]; $strand=""; $seq=""; open(INFILE,$filename); while() { ($p,$g,$scaffold,$scaffoldNum,$pos,$end,$flank,$strand)=split(); $seq=$cpipiens{$scaffold}; if ($strand eq "r") { $revseq = $seq->revcom(); $olen = $end - $pos+2*$flank; $start = $seq->length() - $pos - $olen + $flank; if($start<0) { $start=0; } if($olen>$seq->length()) { $olen=$seq->length(); } $seqstr = substr($revseq->seq(), $start, $olen+1); $start=$pos-$flank; $eeend=$start+$olen; print ">" . $scaffold . "-" . "$scaffoldNum $start $eeend $strand \n"; print $seqstr . "\n"; } elsif ($strand eq "f") { $olen = $end - $pos + 2*$flank;

177 $start = $pos-$flank-1; if($start<0){ $start=0; } if($olen>$seq->length()) { $olen=$seq->length(); } $seqstr = substr($seq->seq(), $start, $olen+1); # seq is getting the raw sequence form the $seqm which has # additional info about theother param $eeend=$olen+$start+1; $start=$pos-$flank; print ">" . $scaffold . "-" . "$scaffoldNum $start $eeend $strand \n"; print $seqstr . "\n"; } }

178 D.3 Trim CAP3 Contigs

################################################################### ### This is part of TESeeker, licensed under the GNU General Public ### License (GPL) v3. ### ### Author: Ryan Kennedy ### ### Arguments: ### ARGV[0]: quality score file ### ARGV[1]: minimum distance to perform joins ### ARGV[2]: sliding window size ### ARGV[3]: CAP3 quality baseline ### ARGV[4]: CAP3 threshold quality multiplier ### ARGV[5]: minimum length of processed CAP3 sequence ### #################################################################### use Bio::SeqIO; use Bio::SearchIO;

$qualfile=$ARGV[0]; $joinDist=$ARGV[1]; $windowSize=$ARGV[2]; $quality=$ARGV[3]; $val=$ARGV[4]; $minLen=0; $minLen=$ARGV[5];

$in = Bio::SeqIO->new(-file => $qualfile, -format => ’qual’); my $sum=0; my $outside=1; my $teNum=0; for($i=0;$i<$windowSize;$i++) { #define array of set size and fill will zeroes $window[$i]=0; } while($seq=$in->next_seq()) { for($i=0;$i<$seq->length();$i++) { unshift(@window,$seq->qual()->[$i]); #add new quality score $sum+=$seq->qual()->[$i]; $sum-=pop(@window); #remove new quality score and update sum if($sum>=($quality * $windowSize * $val) && $outside==1) { $outside=0; $scaf[$teNum]=$seq->id(); $te[$teNum]=$teNum; $len[$teNum]=$seq->length();

179 if($i-$windowSize<0) { $start[$teNum] = 0; } else { $start[$teNum]=$i-$windowSize; } } if($outside==0 && ($sum<($quality*$val*$windowSize)) ) { $end[$teNum]=$i; $teNum++; $outside=1; } } #end scaffold if($outside==0) { $end[$teNum]=$seq->length(); $teNum++; } $outside=1; $sum=0; for($i=0;$i<$windowSize;$i++) { #define array of set size and fill will zeroes $window[$i]=0; } }

#perform joins for($i=0;$i<=(@scaf+0);$i++) { if($start[$i+1]) { if(abs($end[$i]-$start[$i+1])<$joinDist && $scaf[$i] eq $scaf[$i+1]) { $end[$i]=$end[$i+1]; splice @start, $i+1, 1; splice @end, $i+1, 1; splice @scaf, $i+1, 1; splice @len, $i+1, 1; splice @te, $i+1, 1; $i-=1 #unless $i==0; } } }

#output for get_fasta2 for($i=0;$i<(@scaf+0);$i++) { if(abs($start[$i]-$end[$i]) > $len[$i]*$minLen) { print "perl get_fasta2.pl " . $scaf[$i] . " " . $te[$i] . " " . $start[$i] . " " . $end[$i] . " 0 f \n"; } }

180 D.4 Generate Consensus

################################################################### ### This is part of TESeeker, licensed under the GNU General Public ### License (GPL) v3. ### ### Author: Ryan Kennedy ### ### Arguments: ### ARGV[0]: alignment file to create consensus from ### ARGV[1]: percent of nt that must be common for consenus ### #################################################################### use Bio::SeqIO; use Bio::SearchIO;

$alnFile=$ARGV[0]; #file to make consensus from $percThresh=$ARGV[1]; #percent of nt that must be common to go to consensus my @A=0; my @T=0; my @G=0; my @C=0; my @consensus=""; my $numSeqs=0; my $length=0;

$in = Bio::SeqIO->new(-file => $alnFile, -format => ’fasta’); while($seq=$in->next_seq()) { $numSeqs++; my @seq_array=$seq->seq() =~ /./sg; $length=$seq->length(); for($i=0;$i<$seq->length();$i++) { if(uc $seq_array[$i] eq "A") { $A[$i]++; } elsif (uc $seq_array[$i] eq "T") { $T[$i]++; } elsif (uc $seq_array[$i] eq "G") { $G[$i]++; } elsif (uc $seq_array[$i] eq "C") { $C[$i]++; } elsif (uc $seq_array[$i] eq "N" || $seq_array[$i] eq "-") { $A[$i]++; $T[$i]++; $G[$i]++; $C[$i]++; } } } for($i=0;$i<$length;$i++) { #calculate percentages

181 $Ap[$i]=$A[$i]/($A[$i]+$T[$i]+$G[$i]+$C[$i]); $Tp[$i]=$T[$i]/($A[$i]+$T[$i]+$G[$i]+$C[$i]); $Gp[$i]=$G[$i]/($A[$i]+$T[$i]+$G[$i]+$C[$i]); $Cp[$i]=$C[$i]/($A[$i]+$T[$i]+$G[$i]+$C[$i]); } for($i=0;$i<$length;$i++) { #find commonalities if($Ap[$i] > $percThresh && $Ap[$i] > $Tp[$i] && $Ap[$i] > $Gp[$i] && $Ap[$i] > $Cp[$i]) { push(@consensus, "A"); } elsif ($Tp[$i] > $percThresh && $Tp[$i] > $Ap[$i] && $Tp[$i] > $Gp[$i] && $Tp[$i] > $Cp[$i]) { push(@consensus, "T"); } elsif ($Gp[$i] > $percThresh && $Gp[$i] > $Ap[$i] && $Gp[$i] > $Tp[$i] && $Gp[$i] > $Cp[$i]) { push(@consensus, "G"); } elsif ($Cp[$i] > $percThresh && $Cp[$i] > $Ap[$i] && $Cp[$i] > $Tp[$i] && $Cp[$i] > $Gp[$i]) { push(@consensus, "C"); } else { push(@consensus, "-"); } }

$i=0; until($consensus[$i] ne "-" && $consensus[$i+1] ne "-" && $consensus[$i+2] ne "-" && $consensus[$i+3] ne "-") { shift(@consensus); }

$i=(@consensus+0); #print $i; until($consensus[$i] ne "-" && $consensus[$i-1] ne "-" && $consensus[$i-2] ne "-" && $consensus[$i-3] ne "-") { pop(@consensus); $i=(@consensus+0)-1; } print ">Contig\n"; print @consensus; print "\n";

182 APPENDIX E

TRANSPOSABLE ELEMENTS IDENTIFIED

This chapter presents all full-length consensus TEs we identified in the P. humanus humanus genome and selected sequences from the C. quinquefasciatus genome. We offer a detailed annotation of the mariner element from P. humanus humanus and an annotated version of the putative mariner from D. melanogaster.

E.1 P. humanus humanus

E.1.1 Non-LTRs

E.1.1.1 Hope-like SART

LOCUS hope-like 4655 bp DEFINITION hope-like, 4655 bases, 1B9 checksum. ORIGIN 1 GGCCGATGGG GGTCACGGCC TCGTGTCCCG AGAGGCGAGT CGACCTGCGC 51 GTGGAAGGTC TTTGCGAGTC RGCGACGTCG TCGATGGTCG CAGCGGAATT 101 AGCTAGGGTA GGGGGGTGCT CGTCATCGGA CATCTCTGTG CGCTTCCCGA 151 CCGGCGGAGG CCCACATYGT WACTCGTSYG CGTACGCATC CGTCCCGGCG 201 GCCGTCGCGT CCGTYCTCTT RAAGGCRGGC GCGGTTGTCR TCGGGTGGRT 251 GCSTGCGAGG GTGGTCCTGC TGCCGAAACR CAGGACCAWC TGCGTCAGRT 301 GCCTACAYCY GGGGCATGTA GGCTCGCGTT GTCCCAATGT CAGGGAAAGG 351 GGAGATACGG CGCTGGACCG CTGCTTCAGG TGCGGGGAGA AGGGGCACCG 401 GGCGCGAGAA TACGCAAACA AAATCCGCTG CGCTCCCTGT TCGGAGGCGG 451 GCCGGCAGGC TCAACACCGG CTCGGTGGGT CTTCGTGCGG GGCGCCGGCC 501 GTCAGGGGTA GGCCCATTCG GTGGGAATCC GATACCAGGA ACACGGGGGG 551 AAACGTTAGG GGGGCAGCCT CCGCATGTCA GCCGGGAACG AACGGCTCCG 601 CCGGTAGTTC TCTGTGAGCA TGAGGAGGCT TCTCCAAATA AATCTGAATA 651 GGTGCAAAGG GGCACAGGAT CTCTTGCTTA ACTTGGTAGG GACGTGGGGC 701 GTCGGGATCG CGATGATCTC GGAACCCCAC GTTGTTCCGT CGGGGGAAGG

183 751 GAGTTGTTCC GCTGGCACTT GGTTCGTATC CGAGGACGGC GGGGCCGCGA 801 TCTTCTTCTC GAGCTTGGCG CCGGGTCCGA AACCTCGGGA ATTGTCCCGT 851 GGGGTCGACT TCGTTATTGC AGGCTGGGGA AACTCGGTCC TGGTATCGGT 901 ATACGCGTCG CCAAATCGTC CATTRMRAGT ATTTGAAGAT CAGCTGCGRG 951 CGATTTCGAG GGAGCTGGAC AGGTTAGGGG ACTCGCGTCC GATTCTGATG 1001 TCCGGGGACT TCAACGCGAA ACAYRCATCG TGGTCAGGTT ACGCGACTAA 1051 CGCGCGCGGC CGATTGTTAA GACAGTGGTT AGACGAGCGT CATCTCATMG 1101 TGTGGAACGC AAAAGGAKTC AAGACGTGCG TGCGGTCCAC CGGCGGGTYC 1151 AYGATCGATC TCACGATTTC ATCCGGTTCK ACGGCGRCGC GGGTCTCCGA 1201 CTGGGAGGTC CTTTCCGACT TCGAGTCCCT CAGCGATCAC GCCTACGTGG 1251 CGTTCTSGTT CGGGGATYCT YGGAGCTCGG GGGGGGGCAG CCTCCTCGCC 1301 GGACARRCCT CGCGTCGTCY CRCCCCGRTG GAGTCTGCGG TCGTTSGATG 1351 ACGAACGGCT CGCCGAATGT GTCGGCTCGC GGCTGAGGTC TGGAGARATG 1401 ACGTYGCTCC CGGCCCGAMA AGACGAAACG CACACGATAT GGCGCGGGGA 1451 GTTCGGGACG ACTTAMGACG GATCTCGGAC TCCTGCATGA AGAGAATCGG 1501 CTCGCGCTCG ACGACTCGGA AACGGSCGAA GTACTGGTRG ACGGACGAGA 1551 TAGGAGAATT GTGGCAGGAG TCCTGTCTCG CACGTCGCCR GTTCGTCGGC 1601 AAAAAAAGAC AAYTRATCAA ACGGGGAGGA CKTTACGAGG ACATCGCGAA 1651 CGACGACGAA CTCGCGCKTT TAAAGGCGGA GTGGAGRGCT TCCCGGGYCC 1701 GGGTCCGACG AGCGATATGG AGTTCTAAGG RGACGTGYTG RAAAGAACTC 1751 TTGAGCGAAG CGGATTCGGA TCCGTGGGGT CTCCCRTTCC GTCTYGTGAC 1801 CAACAAACTR AGRGGTTCGT YGGSACCGCT GACGGCCGGC ATGACGGAGR 1851 AGTTTCTCGC CGAAGTGATC GCAGAGCTCT TCCCTCGCGT CGARCCGTTC 1901 GCGTTTCCCC CGCCGGATYT CTCGCACGAC GCGAACCGCG ACCACGASAY 1951 CGAAGTGACG GAGGGGGAGG TCCGCCTTCT CCTCGACGCG GCCTCGAAGA 2001 GACGCTCGGC TCCGGGCMCG AACGGCGTGC ATTACCGCAT TATCGGCAAC 2051 TCTGCCGGGG TGCTGTGTGC GCGTCTCTCG GCGCTCTACA CCGCGTGTTT 2101 CCGAGAGGCG ACGTTTCCGA CGCCGTGGAA AGAAGCAAAC TTGGTACTGC 2151 TGGACAAACC GGGTAGGGAT CCGACGACGC CGAGCGCGTA CCGGCCGATT 2201 TGTCTTCTCG ATGTGGARGG CAAGTTGTTC GAGCGCGTGA TAGCGTCYCG 2251 GATMGACGAA CACTTACGAT CAGGGGGAGG GAGGAACGAC CTCTCTCCAA 2301 ATCAATACGG CTTCCAGACG GGACGTTCGA CGACCGACGC ACTCGACCGC 2351 GTGTGCGCCG GAATCCGGKA CACGCTTCAC CGGGGCGGCG TCGCGATCGC 2401 CGTCTCCATC GACATCAAAA ACGCGTTCAA CACGGTACCT TGGTCCGCGA 2451 TAAGGGACGG GCTTAMGTCG AAGTCCGTGM CAGACTACCT CGTGTCGGTG 2501 ATAGRGTCCT TCCTCTCGGA GAGGAARATC GCGTACGAGA AACCGGACGG 2551 AWCGACGGGA AGGGCGGACG TGTTCTRCGG CGTGCCTCAG GGATCGGTTC 2601 TAGGRCCACT CCTGTGGAAC ATCGCCTACG ACCGYGTCCT CACCCGGACC 2651 GTTCTTCCCG AAGGCGTWTC GTTGACTTGC TACGCCGACG ACACTCTGCT 2701 CCTCGCGACC GGTCGCGGGT GGGCGGAGGT TCGCGAACGY GCCGAAACRG 2751 GACTTAACGC RACGGTGGAG SCCATCCGGG ACACCGGTCT CCGGGTGTCT 2801 CTGCCGAAAA CGGAAGCTTG CGGGTTTCAC CGTCCKCGGA ATCCGCGTCC 2851 TCGCGACTTR TCGATAATCG TGGACAACGT CRGAATCGGA GTGGGTAGTT 2901 CACTTAAATA CTTGGGTTTG GTTCTCGAYT CCGGTCTTCA CTTCGGGGAA 2951 CATCTGGCRC GCYTCGGACC GARRATCCGA GCCGTCAGYG CGACGTTAGG 3001 RCGTCTGATK CCGAACCTGC GYGGMYCGCA GGTCAAGGTC AGRCGATTAT 3051 RCRCGACGGT CGTTCACTCC GTGGYCCTGT AYGGAGCTCC GATCTGGGCC 3101 GAGTCGSTTT CGGCAGCTCG GCCTCTTCGC GAGAAGGCAA TCCAACTCCA 3151 RAAGGCGTCT CTGAACAAGG TGGCGATGGC GTACAGGGAC GTCGYGGCGG 3201 AGGTGTCTTG TCTTCTGTYG GGCACTCCCC CGCTYGATCT CCTCGCGATG

184 3251 GARAGACTCG TYCTCTATCG CGAACGGGAT CGCGGCGGGC CTAGGACAGC 3301 TCGTCACCGC ARGGAACTTC GCTCTCAAAC GATCAACACG TGGCAGGCGC 3351 GATTCACCGA CGGGCGTAAG CGTTACGGGG GGGAAATCAT CAACGTTCTC 3401 GGCCCCCGAG TGGGGGAATG GGTRGGGCGR GGTCACGGAA ACTTGACTTA 3451 TCGCCTCACT CAGRTCCTCA CRGGTCACRG CGTGTTCGGC TCGTACTTGG 3501 CGCGCATCGG CAGAGAGGAG ACGGCGRAGT GTTGGTTCTG CGGTGCGCYC 3551 GAGGACGACG TCGARCACAC GGTCGCGATA TGTYCCAYGY GRGASGTCCA 3601 YAGRCAGCGG CTAGTCGAGG TCATYGGACC CGACTTGTYM ATACGCGGTT 3651 TAGTGAACGG GTTGCTCCGC GGACCTCGGG AATGGTCCGT GATCTCGCGA 3701 TTTTCGGAAA CGGTCCTACG GACGAAAGAG GACCGCGAAC GCGAAAGAGA 3751 AAAATCCGGA GTTCGGCGTC GCGCGCTCCA GGAAAAAAGA AAAAGGAAGA 3801 AAAACAACGG CGGCAGCGAC GGCGAAAAGG AAACAGCCAA CGCCTGATCA 3851 TCCGTGGATG ATGGACGATC GCATCCTCAG GAAACAAGCG AGTTTCCCAT 3901 CCCCTTCCCC CAGCGAAGGA CCGGAGGACA CTCCCGAAAG GGACTTGCCG 3951 AACGTCCTGG GAAATGCGTT CTCGTCTTCG AGGTGGCAGC AGAACGAAGA 4001 CTAGGGCGAA TCTCCTACCA AGTAATCGTG TAGAACGGTA ACGTTCTGGC 4051 TTCCCGGAGA GGGGACTGAC TGAAATGTAG GTACAACTCC CACGAACGTC 4101 CCCCCCTCTT CATATAAGCC ACGCTATTGC GACATGCTCA GGAGTTTTAG 4151 TGGGTATGGT TCCCGGTTAG GCCTTCTTTT CGAATCCCAT ACCGAGCCCC 4201 ACACTCCCTC TTCGGAGGGG TTGTGCGTAA CTTGCATTTC TCCTGAGTTA 4251 ACAAAAAAAA AAAAAAAAAA TTTTTTTTTT TTAAAAAAGG TTGGCTTGCT 4301 CAGAAAGGGT CAGCGACCCG TTCTGAGTCT CTGGCTGGGG AGTCCTCTCC 4351 GGGTTCGAGA CCTTGAGCGA TCACGTTGGC GGTGAATGAA TGTATGGATT 4401 TCTTTTTTTT TTGGAATGCA CATCACCATC ATTCTTATAA GGTTCTTATA 4451 AGGTACACTC CAAGATGAAT ACTGCATTAA ATTATATATT TATATGTATA 4501 TGGGTAATTA TGGAAGGCAG AACCGAAAGT AAAACCTCGG GAACGTCTCC 4551 CAGATTGTCA TTGGGAGAAA AGAATTCCAA TTAGGTGGAG AGGATCTTTG 4601 CACTCCACTG TTCTATCCAC TATTTACATT TATTTCAAGC GGGAATCCAT 4651 TGTTA //

185 E.1.1.2 Dong-like R4

LOCUS dong-like 5266 bp DEFINITION dong-like, 5266 bases, 28B checksum. ORIGIN 1 GAAAARGAGG GCGCTGGTGT TCCATGGTTG ACAAGCCTTT TAGTCATATC 51 GTTTTCTTTA ATTAATTTTA ATAGTTTTTT TTTAGGGTTT ATTTATAAGT 101 TATATATCAA AAGTTTTGTA ATTTAGTCTA GTTTACAGTG ATATAATTAG 151 TTTTTTAAAT TATTTAGTAA TAACTATATT ATCTCAAGTT TAAATTTACA 201 AATTGTTCAA CGATTCCATG TGACAAAAGG AAGCGTGGGC AATAAAAAAC 251 CTTAGAAGGC ACCTTAGTAC ATATAAATTA AACGGATAAG TACCTCATTT 301 TAAGACTTAA ATTTCAATTG TAAAACGTTA GGCTAACAAA GAAATTGAAC 351 ACATAACCAA AAAATTTTTT TTATTTTTAT TTTTTTTttt tATATTTCTT 401 TGATTAATTA TAAGTGATTT GACTATTTAT ATCTTTATCT TATTTAACAA 451 AATAAGAATT AAAATATTTT TACAATAATA TAGGTTGATA CATGATTTTT 501 TCAGGGTGTA ATTTCTATAC CCTTATTGTT AAGTGTTAAC TTGTGATTGT 551 ATGTTATTTT AGCCTTGTCT TAAGCTAACT TAATTCATTG TAGTAAAGTT 601 TATTAATAAT AACATAGTGT AACTAATACC TTTCTTAAAT CTGAATCAAA 651 TTAGGAAAGT ATAGTAGTAG TCTTTATATT AAGTGATATT AGTGTGAAAG 701 GGTACAACAG ATTTCTCTTT AACCAGACTT CATTTAATAA ATTTTGCATC 751 TTTTGAATTA CTTATAAGAG TTAAATATCA TTGTCATACT AAATAAGAAT 801 AGAGGTAGTT CAATATAAAT ACTCATTATT TATTGGTTTA ATGTAATGTA 851 TAGCTGGATA TAACATCTCT GTACTCCTGT GTTAATCTAC TGATTGTTTT 901 AATACAGAAT ATTTTGCCAT ATACTTTATG CAAATCTCCA TTTTGATCTA 951 CATTAACACT AGCAGAGATC AATTTCTTTT ATACTACTTA ATAATAGTAC 1001 AGAATTGTAA TTTGTATGTT GTGCATACAG TTAAGTGTAG TCTATGTAAT 1051 CTCTTTTATC CTTAAATTAG TTTGTTTTGT TACTGCTTAG TTTATAATCT 1101 TGATAGTGTA ATATTGCCTG TAAATTTGAT TCAAGATCAA TTAATTAAGT 1151 TATAAATAAG TAGCCTTTTA GTATTTGTTT CTAATGTAAA AATTGTGAAA 1201 AAGCATTATA CATTTGAGGG ACCACAACTT AAACAAACCT TATACAACTT 1251 AAGTTGAAGT CTAATTGCAT ATTATAACTA AATAGTCTAG CCAAATATCT 1301 CTACCATACT CATTGAAAAT AGGGATAGCT TAACAAAATA GGTTAATAAG 1351 ATAGTTAGTA ATATCTGATT TAATATATCA ATTGGGTAAA AAATTGGTTC 1401 TGTCTAATCC AGTTGTCTGA TAACTAGTGT TCTAATTGTA TAATATTTAA 1451 CCTTACATTA AATTGTTTCT GCCTTATTGC AGCATATTTT ACATGATTTA 1501 CAGCAAATTG ACTTTTATAA GTATAAGAGA CATAAGTGAT AAGTTTAATT 1551 AAAAATTTTA TAATAGTGAA CTCTAAATAC TGAATTAGAA AATGTTTTGT 1601 TATAAAGGTG AATGAATTTA TAAATTTATA TATAGGTAAA AAGAGCAAAA 1651 TGAGAACAAG ACAAAGTAAG AACCAAAACA AATCTTGTTC TACTGTCGAC 1701 CTGAGGCAGC TCGACGAGAA TGTGAATTTC ACTGCTTCTG ACCCTGGCCA 1751 CTCCAACGAT GTTAGTCACC GGAGCCCAGT TCAGGAGACA ACTCGTTCCC 1801 ATCGAATAAG ATGGACTCAG GAAGACCTTC AAGAATTGAT GTGGTGCTAT 1851 TTTTACTCTC AAAAATTTGG TTCAGGATCG GAAAGTGACA CCTTTAAAAT 1901 CTGGAGAGGG AGAAACCCAA ACAGCAGGAA GGACATGACA TCGAAGAAGT 1951 TGGCTGCCCA AAGAAGATAT ATAATAAAAA CAAAAAAAAT TGAAAATGAT 2001 AAATTAGAAG AAATTAAGAA AAATGTAGAC TCTTCGTGCA GTAATGTTAT 2051 ACGAGATAAT GCACAAATAA TAACAGAATT AAATGAGTCT AACCACAAGA 2101 GTAAAACCGA TACTGACGAG CAATTGTTGA CTGATGCAGA AATGAAGAGC 2151 ATCGAAGAGA GACTAATTGA AGAGATAAAA AAAGTAAAGA TGTGCCCATT

186 2201 AATAAACAGG GAGCCATTAA GGAAAATTTA CAAAAATAAA AAGGCAACTG 2251 AAGTTTTACA TCTTATAGAT AACACCCTCA TAAATGTATT AGAAAAAGTG 2301 GTAGACATTA ATTTAACCAC AATTAATGAA ATTATATATG CTGCGGGGGT 2351 AGTTGCCACA GATATTATAC TTGGGCCAAG AAAAGAAGCT CGTCATAAAG 2401 GAATGGAAAC AAAAAAGTCA ACATCACCAA TATGGATACA AAGAATAGAG 2451 GGAAAAATAG AAAGAATAAG ATTACATATC TCCCTGGTAT CCGAAATGAA 2501 GAAGAATAAT AATTTAAAGA AGAGGACAAT AAAGAAACTG GATCACCTTA 2551 AAAGGATCTA TAAATTGAAG ACCATGGAAG ATATAGAACT TACCATGGAA 2601 ACTTTAAAAC AAAAGGTATT GCTCTATTCT CAGAGAATAA GAAGATATAA 2651 GAAGAGAGAG CAGTTTTGGC GACAAAATAA ATTGTTTGAA TCGGACCCAA 2701 AGAAATTTTA CAGAACAATT AGAGAGCAGA ACATACAAAA TGGATTCAGC 2751 ACCTTAAATG TAGAAAAAAT GGCAGACTTT TGGTCAAATA TTTGGGAAAA 2801 ATCTCATCCA TTAAATAAGA ACTCTACATG GATGAATAAA GAAAAAGAAG 2851 CACATGCCTG GATTGCTTCT TCAACAATGT CGGATGTAAG AATGGCGGAT 2901 TTAGAAACAT GCCTAAAAAA CACGGCCAAT TGGAAAAGTC CCGGATTAGA 2951 TAGGGTACAA AACTTCTGGA TTAAGAATTT TACAAGTACC CATAAGTATT 3001 TACTGGTATC CATAAATAAA TTAATAATGG GACGGCAAGA AATGCCAGAA 3051 TGGATAACCA CAGGTAAAAC GTATTTGCTG CCAAAAAAAT CAGGAGCTAC 3101 GGAACCGAAA GATTTTCGGC CGATAACCTG TTTGCCCACA ATGTATAAAA 3151 TAATAACAGC TATTATTGCT GAAAAGATTT ATGGGCATTT AAGAAAAAAT 3201 AACATTTTTC CTCCTGAACA ATATGGATGC AGAAAAGGGT CTTACGGCTG 3251 CAAGGAGGTT TTATTAATAA ATAAATTGAT CATGGCCAGT GCAAAACAAA 3301 AGAGGAAGAA TTTAAGCATG GCATGGATTG ACTATCAAAA GGCCTTTGAT 3351 AGTGTGCCTC ACGAATGGAT TATTGAGGCA TTGAAAATAT ATAAAGTAGA 3401 CCCTAATATT ACAGCGTTCT GCGAGAAGAG TATGAAAAAT TGGTGCACCC 3451 AGCTGGAAGT GCAAAAATAC TCTTCTAGAA AAATATTTAT AAAAAGAGGA 3501 ATTTTTCAGG GAGATTCATT GTCGCCACTT TTATTTTGCA TGTCTTTAAT 3551 TCCTCTATCC AGACAGCTTA ATATCAAGGA TCAAGGATAT GAGTTGGTAC 3601 CGGGAGGCAG GAAAATTACC CATATGCTAT ATATGGATGA CTTAAAAATT 3651 TATGCCAAAA ATGAAGAGGA GTTAAATAAA ATGTTACGGA CGGTTCAAAC 3701 CTTTTCCTCT GACATCAACA TGAAATTTGG GTTAGAGAAA TGTGCCAGAA 3751 TAAATATTGT CAGAGGAAAG TTAAAACAAA AGCAAAATAT AGAAGACTCC 3801 GAAGAAGAAC TTATTAAAGA ATTGGACCCT GGATCATCAT ACAAATATCT 3851 GGGGATTGAA GAAAATTTTG GGATAGCCAA CAAGGAAATT AAACCTCGAT 3901 TGAAAAGAGA ATATTTTAAA AGATTGAGGC TTATATTACA GTCGGAACTG 3951 AATGGAAGAA ATAAGATAAC CGCCGTCGGC ACATTAGCAG TTCCTGTAAT 4001 AGAATATAGT TTTGGCCTCG TAGACTGGAC GAAAGAAGAA ATCACGCACC 4051 TAGATAGAAG GACAAGAAAA ATATTAACCA TGAATGGTGC GTTACATCCA 4101 AARGCTGATG TGGATAGATT GTACGTCAGC AGGAAAGATG GAGGAAGAGG 4151 ATTACGACAA ATAGAAGCGG CATACCAGAA TGCCATWATT GGAATGGGAA 4201 AATACATAGA ATCCCATCRA GAGGACCCYA TCTTAGCCCA AGTTATACAT 4251 GCAGAAGAAA AAACTACAAA AAAAGGAGTT CTGAAAAGGG CAAAACAAAT 4301 CGTCCAAGAA AATAAAGAAA ACGAGATAAT GGAAGARGGG CAACTTGCAA 4351 CTTATAATAG CAAAGCCCAA TCTCAGAAGA AATTAATAGG CAAATGGGAA 4401 CAGAAAAAAT TACATGGGCA ATACCTAAAA AGAATAAATG CCGAAGATAT 4451 TAATAAGAAG AGCACGCACA ATTGGCTACG ACGTGGAAAA CTTAAAATTG 4501 AAACAGAAGC GTTTATTACA GCGGCTCAAG ATCAGGCATT GCGGACCCAT 4551 AACTATGAAA AAGTAATTCT CAAAGTCCGC CAAGATGACA AGTGCCGAAT 4601 TTGCCAATCW CAATCGGAAA CCATCGATCA TTTAATTTCC GGTTGTCCGA 4651 TACTGGCAAA ACATGAATAC TTAGAAAGGC ACAATAAAAT ATGTCAATAT

187 4701 CTCCATTGGA ATATATGCCG AGAATATGGA ATGGATGGAT TACCCAAGGA 4751 GTGGTACAAC CACATTCCAA GCCCGGTTAC GACAGTAGGT CCATGCACAG 4801 TTCTATATGA TCAACAAATC CACACTGATA GAACTGTGCC AGCTAACAAA 4851 CCGGATATCA TCCTCAGGCA TAATGGGGAA AAATGGTGTA AGTTAATTGA 4901 GGTATCCGTG CCGGCAGAAA AAAACACCAC AGCCAAAGAA GCAGACAAAA 4951 GGCTGAAATA CAGAAATTTG GAAATTGAAA TAACCAGGAT GTGGGGAACA 5001 AAAACTGAAA CGATTCCGGT CATCGTGGGA GCATTGGGAG CCATGCCAMA 5051 TTCAATAAAA GGAAATTTAA AGAAGATTAT GAAGAACCTA AAAGAAGAAA 5101 CCATCCAGGA AATCGCACTA TGTGGGACGG CCCACATTCT TCGGAAAATA 5151 TTATAAATAG CACCACCGAT ATTCGAGTAT TTGTCCCTAA GGAATCGGGT 5201 AGAGACCTGG GATAAATGCT TAAAAACCCC GACAAATAAG AGCCTTGTCG 5251 TGTTTAAGAG TGACCA //

188 E.1.2 LTRs

E.1.2.1 Mdg1 ty3/gypsy

LOCUS mdg1 5395 bp DEFINITION mdg1, 5395 bases, 9DF checksum. ORIGIN 1 TGTTATGATC CCGTACTCAA TATTCACATT TTCAAATTTT TATAAACAAA 51 AGACTGGCGA CGAACTCGAA TTGTTCTTCT TTAGAAGCGA AGCTTCTAAA 101 GACTCCTTCT TTTATTTTAT TTGCAAGTGT TTGCTAGTTC CCATTGTTGC 151 TCGTTCGAAA CAGTTGACTG ACGAAAGACC AAAGTAATAT AGGCAAGCTA 201 CACAGCTTGG CGTnGGTTCA TTTGATTTTT AGTCTAGCTG TAATAAGTTC 251 TTGTTTATTA ATATTAGTTT TGTtagtGAT AGTTTTAAGT GCTTGTTCAT 301 TACTTACAAA AATAAAGAAC GAACCGAATA TAACATATTA AAATTTTGGC 351 GCAGTCGACG AAGGATCATT CTGAACGACA AGAAACCGGA GAATTCTTAA 401 GTAATTCGTT TCATGTTGGT GGGTTTTAAC TCTTTATCAG AATTTCTTTA 451 TACTGTAAGT AGTAGTAATA GGTAACTCAT TTttTGAACA ATGCCTAAAC 501 TTGATCAACC TGTTTTCAAC CCATCCCAAA TGTCCCTTAG TTTCTTGTGT 551 AGTCTAGTTC CTAATTCCTT TTCCGGAGAA AGGAACCAAC TTAAtGCATT 601 TATctcTGAT TGTGATAGTG TTATCGAAAT GTCCTCTGAA GAgAATAAGT 651 ATCCACTCTT TAGATTCATT TATTCCAGAA TCACTGGTAA agCCAgGgaT 701 CGAATTTCTa TATACCATTT TGATAATTGG GATGACGTAA AAaCGAAACT 751 CATAGAATTA TATcAAGACA AGAAACCCCA TAGTCAACTA ATGGAGGAGT 801 TGACTAGTTG CAGGCAAAAA TCAAATGAAT CGGTAACGGA ATTTTATGAA 851 AGATTGGAGA ATTTGTCCCG TGAAATTGTA TCCAACTTAA AGGTGGATGT 901 CAAGGATAAA AGAGCTCACT CAATAAAAAT AGATTACATT AACGAGATTG 951 CTTTGGGTCG TTTTATTTAC CATTCAAATT CCGGAATTTC CCAAGCGCTG 1001 AGATGGAGGA ATTTCGACAG TATCAACTCA GCATATTCAG CAGCCATTGC 1051 TGAAGAAAAA TTTCTAGAAA TGAGAAGGCG TGATAGGTGC TGTAATTGTG 1101 ATTCTAGAAA TTCTAATGCT TCAAAACTAT TGTCTCCAGC TCAAATTCCA 1151 AAATTCTGCA AGTACTGTAA AAAGTCAGGA CATCTTTTAG ATGAATGTTT 1201 CCGAAGAAAG AAAAtTAATG AGCTCCGAAA AAAACAAAGT AAAGCAAGTG 1251 TTAATTTAAA CTTACATCAG TCCCCAGTGG TAGACACCGC ACtGGAGGAG 1301 TCAaTAGGtC aACTGAAGGT GTCAGAAATA TAAAATTcTT tGTTGTTAAt 1351 GATAATtTAA ATTTCATTAA AGTATCGTCT AATAACTCCA AAAATCCgGG 1401 TAAGTCgCTG AAATTCATCT TAGATACCGG TTCCAGtGTA AACATtATAA 1451 AaGCGTGTAA GTTAACtCCT GAAACTAAAT TTAACTCCaA gGAAAAAaTC 1501 AAaCTTCAgG GAATCAGTCA tagTCAgCAA gCcTTGGAAA CAGTCGGTtC 1551 tTGcgTAATa CCacTAagAA TcGAGGtaaA AtaatataCC aCAAAATTTC 1601 AtATTCTtaA TCAaaCAACT AATATcCCat AtGATGGTcT GCTAGGAAAA 1651 GAATTTTTTa TAaAAcAcTC tGCATGCATC GAATATGAAA CCAACACcGT 1701 AAAATTGAAA GAAATTTCTA AACCACTAGT TGTGTATTCC GAGGAGACTT 1751 CATCTTCAAA CAACCTTCAC CTTAAGGCTA GAACAGAAAC CaTAGTTAAA 1801 ATAAACATCC TCAACCCtGA AATAAAGGAA GGAATAGTGC CAaACACAAA 1851 AATTATGGAA GGCGTCTATT TGTCaCGTGC CATCGTTAAA GTAAACAAtA 1901 ACAACGAGGC TTACGCCACC ATTTTGAATA CAAGAACGAC TGATcAAACA 1951 ATTGAACCAA TCACaGTTcG CCTTGAGAAA GCTTCAAAAC AGTGTTTcCA 2001 AATAAAAAGT CTAGAACCAC GACATAAAAG GAAATCAATA ATAAGTAACC

189 2051 ATTTGAGGTT AGAGCATCTT AATGAAGAAG AAAAAACATC AATAGTTAAA 2101 ATTTGTGAAT CTTATTCcGA TATTTTCCAC TTACCCGGAG ATTATtTAAG 2151 TTCGACAGAG GCAATTGAGC ATAAAATCAA TACTATTAAT GAAAATCCTA 2201 TTTATACTAA AACTTACAGA TATCCnGAAA TACACAAAAG AGAGGTTAAC 2251 AAACAGGTGA CCGATATGCT GAAACAAAAC ATTATAAGAC CTTCCAATTC 2301 TCCTTGGTCG TCTCCATTAT GGGTTGTTCC AAAAAAATCT GATGCTTCTG 2351 GAGAGAAGAA ATGGAGAGTA GTAATTGATT ATCGAAAATT GAATGAAGTT 2401 ACAGTAGATG ACAAATATCC TATTCCTAAT ATTGAAGAAA TTTTAGACCA 2451 GTTAGGGCGT TCAAAATATT TTACTACTCT AGACTTGGCT TCCGGTTTTC 2501 ACCAAATTCC ATTGAATGAC GATGACAGTA AAAAGACAGC TTTTACAACT 2551 CCTTTTGGAC ACTACGAGTA CACGCGTATG CCTTTTGGAC TGAAAAATGC 2601 CCCAGCTACT TTTCAAAGAC TAATGAATAC CGTTTTATCC GGTTTACAGG 2651 GACTACAATG TTTTGTTTAC CTTGACGACA TCGTGATCCA TGCTGCTAGT 2701 ATTCAGGAAC ACGAAATTAA GTTAAGGAAA ATATTCGACA GACTCCGACT 2751 AAATAATCTC AAATTACAGC CGGACAAGTG TGAGTTCCTA AGAAGAGAAG 2801 TCGTTTACTT GGGACATACT ATCAGTGACG TAGGTGTCAG ACCAAACCCA 2851 GACAAAGTAC AAGGAATTAA CTCATTCCCT ATCCCGAAAA GTACTAAGGA 2901 CATAAAGTCT TTCTTAGGTC TGGTAGGTTA TTATCGTAGG TTTATAAAAG 2951 GATTTGCCAA AATTGCAAAG CCCTTGACAA TTCTTCTTAA GAAAAACCAG 3001 GACTTCAAAT GGACAAACAA AGAGCAAGAA GCATTTGAAA AATTCAAGGA 3051 AATCCTTTCC ACACAACCCA TAnTACAGTA CCCCGATTTT AATAGCGAGT 3101 TTGTCTTGAC AACGGATGCA TCGAATTTCG CAATAGGAGC CGTCCTCAGT 3151 CAnGGGGAAA TAGGAAAAGA CTTnCCAATT GCCTATGCAT CTAGAACTCT 3201 GAATGAGTCC GAATTAAACT ATAGTGTAAT nGAnAAnGAA CTCCTAGCnA 3251 TCGTATGGAG CGTCAAACAT TTTCGACCCT ACCTCTTCGG AAGGAAATTC 3301 ACTATTGTCA CAGATCATAG ACCTCTAACC TGGTTGTTTA ATTGCAGGGA 3351 GCCCAATAGT AGGTTGGTAC GGTGGAGAAT TAAATTAGAA GAATATGATT 3401 ATAAGATTAT TTATAAGAAG GGAAGCCTAA ATGCAAATGC CGATGCACTC 3451 TCCAGAAATG TCTATATAAA CGTACCATCG TCCGACAAAT CCGAAAGAAT 3501 CCATCCCAAA AGTAAGGAAG AGATTAAGAA AATTTTGGTA GAAAACCATG 3551 ATTCTAAACT GGCTGGACAT TGTGGATTTT GTAAAACCTA TCAAAGAATC 3601 AAACAACGTT ATTACTGGAA AACAATGAAG AATGATATAA AAAACTATAT 3651 AAGAAGTTGC AAGTCCTGCC AAGTTAACAA AATTAATTTT AAACCTATTA 3701 AGGTGCCCAT GGAAATAACA ACTACCTCTA AGCAAGCTTT TGAAAAATTG 3751 GCCATAGATG TCATGGGACC ATTGCCAACC ACTAGTGAGG GAAATAAGTT 3801 CATTCTGACC ATGCAAGATG ACTTAACCAA GTATTCATAC GCGGAGCCCA 3851 TACCGAACCA TGAGGCGAGA ACAATAGCTT CGAAGATCTC CAAATTCGTG 3901 ACGTTATTCG GAATCCCTAA ATTTATTCTG ACAGATCAAG GCACTGATTT 3951 TACATCGAAC ATAATTAAAA ACTTAATGAA ATTATTTAAA ACAAATCACA 4001 TAAAATCTAC TCCATATCAT CCGCAAACTA ATGGAGCATT GGAGAGATCT 4051 CATCTGACCC TGAAGGAATA TTTAAAACAT TATATAAACG AGAGACAAGA 4101 CGATTGGGAC GAATTTATTC CTTTTGCCAT GTATTCCTTC AATTCGCACA 4151 CGCATACTTC CACAGGTTAC ACTCCATATG AGCTTTTATT TGGCAAGAAA 4201 CCGTTCATAC CGAATTCATT GATAAGAAAA TCTACACTCA GAAAATTCAT 4251 TGATAAACCA TTTAATAAAA ATTACGATGA TTATATTAGT GATTTAAAGG 4301 AGAAGATTCA AATCAGTCAA AAATTAGCCA GAGAGAATCT AATAAAACAT 4351 AAGGTGAAGT CAAAACAATA TTACGACAAT AAGATCAATA TTCACGATTA 4401 TAAGATTGGA GATTTGGTTT ATATAAAAAA CAATTTAACT AAAATAGGAA 4451 TCAATAAAAA ACTTAGTCCA AAATTTAAAG GCCCATACGA GATCAAGAAA 4501 ATATCTGGGA ATAACGTTTA TTTAAAAATC CGTAACAAAT TAGTTACTTA

190 4551 TCATGTTAAC AACACAAAAC CGTGTTCTGG GTAGTTGATC TAAAGAGAGA 4601 AATCCAAAAT AATAATAATT TAATCCTTTT ATTTTCCCAT ATAGCTAATC 4651 ATCATTTATT TTATTATTAA TCCATTTATT CTTTTTCTAA TCATTATTTA 4701 TGTTTCCTAT ATACAATTCT TGTATTTTTC AGAACTTATG TCAATTTTGT 4751 TTCGAATTTC ATGTCTTTTG ATTTTATTCT TTTATCTTTA CTCTTTATGT 4801 AACATTGTTC ACTAACGAGT ATTGCCTAGG AAAGGTTCTG CTACTTGCAG 4851 CTATAAAGCT TAAAATTATG TTCCTGATAA GAAGCCAATG TGAAGAGATA 4901 CAACAGTCTA CCATACAGAA GACATAGCAC CTACTACTAC TCAAGGAGTC 4951 TCAACCTGGA CGGAACCATT CAGAACCCGA CATAGGAAGC TGTCAAATCC 5001 GAACAAGAGG AGAAAACAAA TTCTTTTTCA AATAATTGTT TTCCGTCTTA 5051 AGGGGGGAGG TGTTATGATC CCGTACTCAA TATTCACATT TTCAAATTTT 5101 TATAAACAAA AGACTGGCGA CGAACTCGAA TTGTTCTTCT TTAGAAGCGA 5151 AGCTTCTAAA GACTCCTTCT TTTATTTTAT TTGCAAGTGT TTGCTAGTTC 5201 CCATTGTTGC TCGTTCGAAA CAGTTGACTG ACGAAAGACC AAAGTAATAT 5251 AGGCAAGCTA CACAGCTTGG CGTCGGTTCA TTTGATTTTT AGTCTAGCTG 5301 TAATAAGTTC TTGTTTATTA ATATTAGTTT TGTTAGTGAT AGTTTTAAGT 5351 GCTTGTTCAT TACTTACAAA AATAAAGAAC GAACCGAATA TAACA //

191 E.1.3 Transposons

E.1.3.1 mariner

DEFINITION Pediculus humanus humanus mariner consensus sequence FEATURES Location/Qualifiers source 1..1276 /organism="Pediculus humanus humanus" /mol_type="genomic DNA" /transposon="mariner transposon" misc_feature <1..2 /note="target site duplication" repeat_region 3..33 /note="left terminal inverted repeat" ORF1 285..584 translation=GDEALIERQCQNWFAKFRSGDFSLQNEECSGRQLEVKDEQIKALI DYDRHSSTKDIVKKLDVSHTCVKNRLRRLGCQKKLDALLWGTLVNEATWSLRYAS /product="mariner transposase" ORF2 683..1213 translation=ARKQAPTTSKTDIHQKKVLLSFWWDYKGIVNFELLPRCQTINSEV YIRQLTNLNDTIQEKRPELANSKGIVFHHHNARPSPSLATGQKLLELGWNVLLHPPY SPKLAPNNYHFFRFLKNFLNGQKFQNDNEVKTALEQFFAPKTKEFYEKRKMILPEKC QKVTNNNKHNIIDKNNLT /product="mariner transposase" repeat_region 1244..1274 /note="right terminal inverted repeat" misc_feature 1274..1276 /note="target site duplication"

ORIGIN 1 TATTGGGTTG GCAAATAAGT AACTGCGGAT TTTACCAACA GATAGTTTGT 51 TATTTTTTTT GAGTACGTTT ACGTTTTTGT ACAGACATGA ACTTTTGATA 101 TGTTATTACT TGGTTCCTTC TGTAACATTC GGTACCAAAA TTTCATTGAA 151 CTCTTAAATA GTACGCGAGC AAAAGATAAT TAAACATGGG GAGCCAGAGC 201 GAGCATTTCC TCCACATTTT ACTTTTTTAT TTTTGAAAGA GTGTTAATGC 251 TTCCCAGGCC AATAAAAAGT TGTGGGTCGT GTAGGGGGAT GAAGCCTTAA 301 TAGAACGGCA GTGTCAAAAC TGGTTTGCGA AATTCCGTTC TGGAGATTTT 351 TCTTTGCAAA ATGAGGAGTG CTCCGGGCGC CAATTGGAGG TTAAAGATGA 401 GCAAATAAAG GCCCTCATTG ATTATGATCG GCATAGTTCG ACTAAGGACA 451 TTGTAAAGAA GCTAGATGTG TCACATACGT GCGTCAAAAA CCGTCTGCGG 501 CGTCTTGGGT GCCAAAAGAA GCTTGATGCG TTACTTTGGG GAACGTTAGT 551 TAACGAGGCG ACTTGGTCTT TGCGATATGC TTCTTAAACG CAATGCAAAT 601 GACCCTTTTT TGAAAGAATG GTCACCGGAG ATGAAAAGTG GGTTGTCTAT 651 GATGACTTTT TGAGAAAAAG ATCCTGGTTT AGGCAAGGAA ACAGGCACCA 701 ACAACTTCTA AGACTGACAT TCACCAAAAA AAGGTATTGT TATCATTTTG 751 GTGGGATTAC AAAGGCATAG TCAACTTTGA GCTGCTGCCA CGATGTCAGA 801 CCATAAATTC AGAGGTTTAC ATTCGACAAT TGACAAATTT AAATGATACC 851 ATCCAAGAAA AACGACCGGA ACTAGCCAAT AGCAAAGGAA TTGTCTTTCA

192 901 CCACCATAAT GCCAGGCCCT CCCCATCTTT AGCCACTGGA CAAAAACTAC 951 TGGAGCTAGG CTGGAATGTT TTGCTGCACC CTCCATATAG TCCCAAACTA 1001 GCTCCAAATA ATTATCATTT TTTCCGATTC CTAAAAAATT TTTTAAACGG 1051 ACAAAAATTC CAAAACGACA ATGAGGTCAA AACTGCATTG GAGCAGTTTT 1101 TTGCTCCTAA AACTAAAGAG TTTTATGAAA AAAGGAAAAT GATACTACCC 1151 GAAAAATGTC AAAAGGTCAC TAATAATAAT AAACATAATA TAATAGATAA 1201 AAATAATTTG ACATAATTAA TAAATCGTTT TTTGTTTTCT TAAAAAATTC 1251 GTAAATATCT TTTTGCCAAC CCAATA //

193 E.1.3.2 MITE1

LOCUS MITE1 623 bp DEFINITION MITE1, 623 bases, A64 checksum. ORIGIN 1 CTCCGACGTC GGAACCCCGC GACTCGATGG GAGCCGCGAT CTCGCGGTTT 51 CGGGTGGAGG TGGGGGAGGG CGCGAAAAAT TTTTACTTTT TTTTTTCGGA 101 ATTTTGACCG CGGTCGGAGA CTCTCCGACG TCGGAaCSCS GaCSMCKCGA 151 YGGGAGCCGC GATCTCGCGG TTTCGGRTGG AGGTGGGGGA GGGCGCGAAA 201 AATTTTTACT TTTTTTTTTT TGAATTTTGA CCGCGGTCCC GGACTCTCCG 251 ACGTCGGAaC SCSGaCSMCK CGAYGGGAGC CGCGATCTCG CGGTTTCGGG 301 TGGAGGTGGG GGAGGGCGCG AAAAATTTTT mTTTTTTTTT TYGGAATTTT 351 GACCGCGGTC GGAGACTCTC CGACGTCGGA ACCCCGCGAC TCGATGGGAG 401 CCGCGATCTC GCGGTTTCGG GTGGAGGTGG GGGAGGGCGC RAAAAAWWWT 451 TacTTTTTTT TTTCGGAATT TTGACCGCGG TCGGAGACTC TCCGACGTCG 501 GAaCSCSGaC SMCKCGAYGG GAGCCGCGMT CTCGCGGTTT CGGRTGTAGR 551 TGGGGGAGGG CGCGAAAAAT TTTTTtTTTT TTTTtCGGAA TTYGACCGCG 601 GTCCGAGACT CTCCGACGTC GGA //

194 E.1.3.3 MITE2

LOCUS MITE2 169 bp DEFINITION MITE2, 169 bases, 2038 checksum. ORIGIN 1 TAGGTCGACC TTGAATMCAA GGCCATTGGT TTTACATTTC RATTTTGTGA 51 AATTTTTCAT GGTCATGATT TTTCAACCAA GGTCGACCTT GAATACAAGG 101 CCACCAGTTT TAAAATTTTA TTTTGTGAAA ATTTTTCATG GTCACGTTTT 151 TGCATTCAAG GTCGACCTT //

195 E.2 C. quinquefasciatus

Because there were more than 100 non-LTR TEs identified in C. quinquefas- ciatus, we list only one element per family.

E.2.1 Non-LTRs

E.2.1.1 CR1

LOCUS Cp_CR1_Ele11 3116 bp DEFINITION Cp_CR1_Ele11, 3116 bases, CE7 checksum. ORIGIN 1 CCACGAATCT GCTGCTTTAY TAYCAGAACG TTGGAGGCAT TAATACCACT 51 ATCGCCAACT ACGCCCTYGC AATCTCTTCY GCCTCMTACG ACCTGTACGC 101 AWTMWCTGAR ACGTGGTTGA CYTCWGCTAC TCTRTCTGGT CAAATYTTYG 151 GTCCCGAATA YGAAGTATTC CGTGGAGATC GGACMGYCTC GAACAGYWKT 201 AAAGRRTCAG GCGGRGGAGT YCTGCTTGCC GTCCGCTCSA AMCTAAAGCC 251 ACGCCAAYTR TTCCCWCCAR ATTGTACCGT TCCRGAGCAA GTMTGGGTYG 301 CAGTTCCACT CGCTGCATCY ACGATGTTYG TGTGTGTTAT CTACATTCCT 351 CCYAAATTTG ACAACGATAA GCCGCTGTTC GATCAGCACA GACATTCTTT 401 GACGTGGATA GTCTCCAAAA TGAAAGTGAA CGATAGTGTT ATGGTCCTCG 451 GTGACTTCAA CTTCCCAGCC ATTCGCTGGA CGCGSACMCC GACGAACAAA 501 CTGMTTCCAA ACYTAGCCCT YACTCCGACC AAYGMGTTAA AGCACAAMCT 551 CCTGGATGAS TATTCYACCG CAAACCTTAG CCAACTGAAY GACATGYGCA 601 ACAACTCAAR CAACGTTCTY GACCTRTGCT TTGCCAGCTC WGRKACACCG 651 ATCAACTWTA CYCTTCTMCC AGCWCCTYTR CCKTTGGTKA AAGACGTGCG 701 SCACCACYTK CCRTTTCTYG TWTCSATWTC YTGCACGRYG CTCSMTTTTC 751 GTGAWGTYGC TGGYAAYWCK TTYATGGACT AYCGWAAGGG AAACTAYGAT 801 GRCATGAACA ACTTCCTGAC CAACATTAAY TGGMACCAAC TWYTGGCCAA 851 CCTTGACGCC GACACAGCYG CTGMTACTTG GACAGGTGTT CTGACGGATG 901 CCATCAACAC CTTCGTTCCA AGGAAACWGC GCCAGCCTCC AAGAYATCCA 951 CCGTGGTCAA CACMTCGAYT GCAGATTYTG AAGTCCAGGA AACGMGCTGC 1001 CCTCAAGAAA TWCGCCAAAC AYCCGACAGA TCGATGGAGA AACCATTATA 1051 GGTCAAGAAA CCGGAAGTAC AGTATCCTGA ACAAACAACT TTTTCATCGC 1101 CACCAACACC GAATCCAAAG CCGATTGAAA CGAGACCCCA AGAAGTTCTG 1151 GAAYCAYGTA AACGAGCAGC GGAAAGAGAC AGGTCTRCCA ACTGCGATGA 1201 TACTCGACGG TGARGAGGCY ACTTCCACCG AGAGTATAAG CGATCTKTTT 1251 CGTCGCCAGT TCAGCAGCGT ATTCACCAAC GAAGCAGTAG RGGAAACGCA 1301 TATTGCTAAG GCTGCTAGCA ACGTTYCACT GCGACCTCCC ATYGGACCTC 1351 ACCCGGTGGT CACTTCCGAG TCCATCCGTC GTGCCTGCGC CTCTCTCAAA 1401 GGTTCTACCA GCTGCGGYCC AGACGGCATC CCMGCGTTTG TGCTMAAAAA 1451 GTGTTGYGAT GCACTCGCGG AACCAYTGGC TCAACTYTTC AAYACCTCGC 1501 TTGCTACTGG AGTTTTCCCG TGTTGCTGGA AGAAGTCYTW TGTKTTCCCA 1551 GTYCACAAGA AGGGCCCMAA ACGTGATGTC CGGAACTATC GYGGAATTGC 1601 TGCCCTCTGC GCAGTYAGCA ARCTGTTCGA AGTWATCGTG CTGGATTTYA

196 1651 TYAAGTTCAA CTGCTGTGAC YATGTCGCCC WGGAACARCA CGGCTTCATG 1701 GCGAAACGTT CCACYAACTC YAACTTGGTC TCYTACTCGT CCTTCATTCT 1751 WCGAACCATG CAGCAACGGA AGCAGATCGA TGCCATCTAT ACGGACCTAT 1801 CAGCGGCYTT CGACAAGCTG AACCACCGYA TYGCYGTTGC KAAACTGGAA 1851 CGMCTAGGYT TCGGCGGGCC CATGCTYGAT TGGCTWCGCT CCTATCTCAC 1901 TGGMCGTGAA ATGAGCGTYA AAATCGGTGA CGTGATTTCC GCTGCTTTYT 1951 CTGTTTTTTC AGGCRTTCCR CAAGGAAGCC ATCTGGGCCC TCTGATCTTC 2001 CTCCTCTACA TGAACGACGT GCATCATCTg YTTAGGCTGT CACAAACTGT 2051 CGTATGCGGA TGAYATCAAR CTGTTCRYCG TTGTCGAGAA TGATACCGAC 2101 TGCCAGWTTC TTCAGGAGCA GCTCRACCKG TTCGCCAACT GGTGCTCCGA 2151 MAACAGGATG GTTCTGAACG CTTCCAAGTG CTCGGTYATC TCTTTCACAC 2201 GCAAGCGCAA YACMATKTCY TTYSACTACA CACTTTCAAA CACCACCATA 2251 CCYAGGACCT CYTGTGTGAA AGATTTAGGT GTGATGCTGG ATAGCAAAAT 2301 GACGTTTRCT GACCACATYA CGTATACAGT CTCCAAGGCT TCCAAAACTC 2351 TTGGCTTCAT CTTYAGRATA GCTAAAAACT TCCGGGATTT AGGCTGTCTC 2401 AAAGCTCTTT ATTGTTCGTT GGTYCGCTCT ACTTTRGAGT AYTGTTGTAY 2451 TGTTTGGGCT CCCTTCTACC AGAACGCGAT TCAACGCGTG GAGTCGGTSC 2501 ARCGGAAGTT CGTTAAGTAC GCGCAACGTC ACATTATCTG GCCTGATCCC 2551 GCCAATCCRC CGAGTTACSC AGAGCGCTGT AAAATGCTTA ATCTCGAACT 2601 TCTTACAGTA AGRCGTGACG TTKCCAAGGC GACYTTCGTT GCAGATCTCC 2651 TTCGWTCGTC CATCGATTGT CCTGCCGTTT TGCAAMTGGT MAACATAAAC 2701 ACTCGCCCTC GCGTACTCCG CAATCACTCR TTCTTGACTG TCCRCAGGGC 2751 TCTCACAAAC TAYGGGCAGA ACGAACCGGT TTCWAGTATG TGTCGTGTTT 2801 TTAACTTGTG CTCAGATCTG TTTGACTTTG ACATCTCCCG TGACACAATC 2851 AAAAAACGAT TCCTTAATCA CCTGAAATCC CATCCCTAAc CtgAcgATAC 2901 ACACGTAGAT TTTAGAACTG TGATATTTWT GTTAATTTAT TGAGTTAGTT 2951 TTAAGAGTGA ACCCGTCTTG TATCATTTGA GTTTTGTGTA CTTGTTGATG 3001 CGATAAGATG AGGTGGTTTT GTGCCTTTTT GAGAAAGTGT CTTKAAYRAT 3051 ACCAGACACA GCTCAAGGGG GCTTTTGTCC ACCTCCAATA AAGAAAAAAT 3101 AACaAAAATA AATAAA //

197 E.2.1.2 I

LOCUS Cp_I_Ele1 3837 bp DEFINITION Cp_I_Ele1, 3837 bases, 16BF checksum. ORIGIN 1 TTTTTTTTTT GTATTTATTT AGGACCTTTT AACTATGAGT CTTTCGGGTC 51 CTGTTWATGA ATCTTTACAC TTTTCCATAC ATGTCAGTAG CCTTTAAAAA 101 TTTCAACACA TTTTTCGCCA TATCTCGATC ATCAGCCAAG GCTTCCCTGA 151 CGTTGGATGG TACTCCAGCT CGGCGTCTCT GCTCTTCAAA CTCAGGGCAT 201 TCCGCTAAGA TGTGTTTCAC CGTCAGCGCC AAATTGCACC TTGTGCAACG 251 CGGCGCACTG TCCTTCTCCA GCAGATACTG ATGGGTTAGC AAGGTGTGTC 301 CTATCCTGAG TCTTGTCAGA ATGACGTCTT CCTTCCTCGA TCCAACAAAT 351 ACATCTCGAT AAGGTAGTAC CGAGTTCTTC ACCTCTCGTA GTTTGTTGCC 401 CACCTTTCTG CTCCATTCCG CATTCCAGCG CCAAACAGTC CTCTTCTTGA 451 TCACCGTCCT GAACTCCTTG AATTCTACGC TTCTGTCCCA GATGTTTCTG 501 TCGTTCAATG ACTGCTTCGC TTCTTCATCT GCCTTCTCGT TGCCTGCTAT 551 TCCTACATGA CTTTTCACCC ACATGTAGAT TATCTCCGTG CCATTACATT 601 CTGCCTGATT GTGTAGGATG TTGATCTCRT CCTTCCATCT GCATTTGGTT 651 TTTCGTTTAC CCAAGGCCGT GATGGCACTC AAGGAATCTG TACAGACGAG 701 GTAGGTCCCC ACGCGATTCT GACCAATTAT CCACCTGAGC GCTTCAGTGA 751 TCGCACCACA CTCCGCAGCA AAGATACTGC TCAGATCACT GATTCTTCTT 801 CGAACAACCA AGTCCCCTCT AACCATCGCG TAACCAACCC TGCCATCTTT 851 CTTTGATCCA TCGGTGAAAA TTGTTTCACA CAAGCGATAT TCCGTGTTCC 901 TTCGGGACAC GAAGAGCTCC TTCAGTTGTG TTGAAGTTGC TCCGGCTCGT 951 TCAGCTTCGA GTAATGTTTT GTCTATCCGG ATCCGTCTGC GTTCCCAGGG 1001 GGGACACAAT GGTAGTGTGA ATATCTTCAG CTCGGGTAAC GGTAGTTCCA 1051 GCTCTTCAAG TATCGCCTTT CCTCTTGTCT CAGCAGTTTC CACTCCACGT 1101 AGCGGACCTC GATGCCTAGT CGAATTCCAC TCCTCTCCTG AACTGTTGCT 1151 ATCGTAGCTG CTCTCGGTAC TGCTTCCTGC TGATTGTGAG TCGTCTTCTG 1201 CTGGCTGGGT TMTGCTCTGT GCTGATATGG CTGCTTTTCT GGCGGCGTAG 1251 ATCGCTGTCC GCTGTTCAAA GAACACTCTC AGGCTCGGGA TTCCTGTTTC 1301 AGCTTGGAGA CTATCTACAG GGCTGGTACG GAACGCACCA CAGATAGCTC 1351 TCAAGCCTGT GTTGTGTGTT GGCTCGAGGA TCTTCAGGAC GTTGTCACTG 1401 ACGGCAGCTG TTATAGGTGC TGCGTACAGR ATTTTCTCCA ACACTGTGGC 1451 TCTGTACAGC TTGATTAGAG TCTTCCTATC GCCACCCCAT GACCTACATG 1501 CTACACAACG GATWAGCTGG ACTCGTTGTC GGCAAGCCGC TTTCACTTCC 1551 TCGCAGTGTG TCTTGAACAG TAAATGTTGA TCCAGGATAA CGCCCAAACA 1601 TCGGTGCTGC TTTTTGGTTG GGATCAGGGT CCCATTCAGC TCCAGTGCAG 1651 CTCTGCTCGG TGGTTTCCTC GTTCCGTAGC TTCTAAAGAT GACMGTAGCG 1701 CTTTTCTCCG CAGAAATTTT AAATCCCGTT GAGCTCTGCC AGCATTCCAC 1751 CGCCTTCAGT GCAGCTTGCA AGTCGTTCTC AACTTCTTCA ACATYTCGTC 1801 CACTGGCCAG CAGTACAACA TCATCTGCGT ACAAGAGTGT TGTTATGCTG 1851 GGCGGCATCC GTGCTACCAA KGTGTTGATA GCCACCAAGA ACAAGGTGAC 1901 GCTCAGTACT GATCCTTGGC AGAGGCCTGT TTCCATGATC TTGCTYTGCG 1951 ACAGCTGGCC GTTCACAAAA ACTCTAAATG AGCGATTTTC CAGCATTCGG 2001 TCCAAGAATT TCAACATCAG ACCTTCTATG TTCCAGTCAC GCAGTTGGTT 2051 CAGGACCAGT CTTCTCCAGG TGGTATCGTA CGCCTTGGTM ACATCCAGAA 2101 AAATCCCCTG AACGTACTCC TTCTTGTTCC AGGCTGCTCG TACCACTTTC 2151 TCGAGCTCAG CCAAGTGGTC AACGGTAGTT TTTCCTTTCC GGAAGGCAAA

198 2201 CTGGTGAGGA TGCAGTAATC TTCTTGTCTC GATGATGTGA ACTAGACGAT 2251 TGTTGACCAT TCGCTCAAAC ACCTTTCCCA GGCAACTGTT YAAGAAGATT 2301 GGCCTATAGT TACGAGGATT TGCTTTTTCA CCACAATTTT TGAAGATAGG 2351 GATCACCAGC GATTCCGTCC ATTCGGGTGG GTATACTGAA CCGAGCCACA 2401 GCCGGTTGTA CGCGTCAAGA AGCTGTCGTT TGCAATCCAA CGGTAACTTT 2451 ACGATCATGG AATAGTGGAC AGTGTCAGGG CCTGGTGATG AGCCACGCAA 2501 ACCTGCCGTT GCCTCTTCAA ATTCAGTGAA TAAAAAGGGC TTGTTGTACT 2551 CAGCACTCGT GTCAGCTGGR ATTGACAGAG GAGTTTGCTC AACCTGATCC 2601 TTGAATGCGC GAAACTCGGG GCTGTAAGCA TCGTTACTAG AGATGGAAGC 2651 GAAAGARCTA GCCAGCGCCT CAGCAATATC TTCATCTTTA GACACAGTAC 2701 CTCCTTCGGC TGTTATTGCA CTGATACGGT TCGTTTTGCR CTTGCCTTGG 2751 ATCCGGCGGA AGTTCTCCCA CACTTCTTTA ACAGGTGTTT GCACTGTGAA 2801 GTCGTTGACG AAGTTGATCC ACGAGGTCCG TTGTGCTTCC TGGATGACTT 2851 TGCGTGCATG MGAGCGGGCT GCTCTGAATT CTGCTGCAAG AGCCTCTTGA 2901 TTCTCCAAGT TGTTCCGTTT GGCTGATTGC ARTGCACGCA AGGCTTTCTT 2951 CCTGCTCTTG ACAGCGGAGG CCACCTCTTG GTTCCACCAA GGCACCGCCC 3001 TCTTGTTCAC CATCCCCGTC GTTTTCGGTA TAGTTTGTTC CGCTGCTTCA 3051 AGTATTTTCT GGGTAATGCT GGAGATTTGT TGTGTAGCTG TAGGCATKAG 3101 GGGGAACTGC ACTGTTGTCT GGAATGTCTC CCAGTCTGCT TCCTCGATCT 3151 TCCATTTGGG TCGCGTTCGG ATGGCATGGG CCGTGTCAGG GAGAGTTAGC 3201 AATARTGGTA GATGGTCACT ACCGTAGGMG TCTTGTAGTA CAGAAAAATC 3251 TAGTTGGTCT ATTATTTCTG TTGAGCAGCA TGCCACGTCA ATACTCGTGA 3301 GAGTTCCGGT CGCAACACTA ATGTGGGTTG GATCTGTTTT ATTCAGGACC 3351 ACCAGGTTGC ATTCGTTCAA TACTTCCTCG AACATTAGTC CTCGGGCGCT 3401 GAACGTTTCC GATCCCCAYA GTGGGTGGTG GGCGTTTATA TCTCCCACGA 3451 GTAGGCGTGG CTGGGGGACT TGGTTTATCA GGTGTACAAT WTCCGAGGCT 3501 TGTATGGCTT GCCCCGGTGG CAGATATRTA TTGATAATGG TCAGRTTAAA 3551 TGGGGGACCT ACCTTAACAG CGATGGCTTC CAGGTTGGTG TCCAGGTCGA 3601 TTTCTTCACT GTCCAGTTCA GGTTTCACKC CAACCAAAAC TCCTCCGGCT 3651 GCACGGTCGC CACCARCTCT CGGTCGGTAG TATATGCTGT AGCCGTTGAT 3701 GTTCGCCTGG TCCTTAGATG AGCACATGGT TTCTTGGAGG CATAGTACAG 3751 ATGGGTTGAT TTTGCTGCAT AGTATTTTCA AATTTGGTTT ACTGGTTCTA 3801 AGTCCTTGGG TGTTCCATGA AATAATGTTC GAGTAGT //

199 E.2.1.3 Jockey

LOCUS Cp_Jockey_Ele4 4487 bp DEFINITION Cp_Jockey_Ele4, 4487 bases, 6C9 checksum. ORIGIN 1 TTTTTTTTTT TAATTTATAT TTATTCAAAT TTCTTTTCCA TGTACATTCA 51 TTCAGTTAAA ATATTATTGA GTGTCCAATC ACAAACGATG ACTTTTCACC 101 TCAATTTTAA ATACTAGCAA CTTTCATTTA TTCATGAAAT ATTGTAGCTT 151 TCGCTATTCA GTGATTTCAA ATGTAGGAGG TCCTACATGT ACAAAAGGGA 201 AAAGGGATAC CTTAAAACTA ACTTATAAAC TATATAAAGA GCGGATCAAT 251 GCAGCTGAAG ACTGCAATGA TTTTTGTCGA AATGCATCAA TTATCTTATT 301 GGACATAACA TCCAAAGTGT CAACTTCGGC TAATTGATGA AGTTCACTGG 351 TGCTGAACCA GGGAGGAAGT TTCAGAATCA TTTTCAGAAT TTTGTTCTGA 401 ATCCTCTGAA GTTTTTTCTT CCTGGTTAAG CAACAGCTTG TCCAGATCGG 451 CACAGCATAA AGCATGGCAG GTCTGAAAAT TTGTTTATAA ATTAACAGTT 501 TATTCTTGAG ACAAAGTCTA GAATTCCTGT TTATAAGTGG ATACAAACAT 551 TTAATATATT TGTTACATTT AACCTGKATA CTTTCAATGT GATCCTTGTA 601 AGTAAGGTTT TTGTMWtAAA CCAAGTCCAA GATATTTCAC TTGatCCtCC 651 CACTTTAAaT TTACMTCATT CATCTTTATA ATGTGATGAC TTTTTGGTTT 701 AAGAAAATCA GCCCTTGGTT TGTGAGGGAA AATAATAAGT TGAGTTTTTG 751 CAGCATTTGG AGTAATTTTC CATTCTTTCA AATAAGAATT GAAAATATCC 801 AAGCTTTTTT GTAATCTTCT TGTGATGACA CGAAGGCTTC TGCCTTTGGC 851 GGAGATGCTT GTGTCATCAG CAAAAAGTGA TTTCTGACAT CCTGGGGGCA 901 AATCAGGCAA GTCAGAAGTA AAAATATTGT ATAAAATTGG ACCCAAAATG 951 CTTCCTTGAG GGAYGCCAGC ACGTACAGGT AGTTGATCAG ATTTGCTATT 1001 CTGATAACAT ACCTGCAGAG TACGATCCGT CAAATAATTT TGAATAATTT 1051 TCACGATATA AATCGGAAAA TTAAACCTTT TCAATTTCGC AATCAAACCT 1101 TTATGCCAAA CACTGTCAAA TGCTTTTTCT ATGTCTAGAA GAGCAGCGCC 1151 AGTAGAATAG CCCTCAGATT TGTTGCTTCG AATTAAATTT GAAACTCTCA 1201 ACAACTGATG AGTAGTTGAA TGCCCAAGGC GAAATCCAAA CTGCTCATCA 1251 GCGAAAATTG AATTTTCATT AATGTGCGTC ATCATTCTAT TAAGAATTAT 1301 TCTTTCGAAT AATTTACTAA TAGATGAAAG CAAACTAATG GGCCGATAGC 1351 TTGAGGCTTC AGCAGGATTT TTATCCGGTT TCAAAATCGG AATTACTTTG 1401 GCATTTTTCC AACTACTGGG AAAATATGCC AAATCAAAAC ATTTGTTGAA 1451 AATTTTGACC AAGCWACTTA AAGTTGCTTC TGGTAATTTT TTAATTAAAA 1501 TGTAAAAAAT GCCATCCTCA CCAGGGGCTT TCATATTTTT AAATTTTTTG 1551 ATAATAGATT TTATTTCATT CAGATCCGTA TTAAAAACTT CATCTGATGA 1601 AAATTCTTGT TCAACAATAT TCTGAAATTC TATTGAAATT TGATTTTCAA 1651 TAGGACTCAA AACATTCAAG TTGAAATTAT GAGCACTCTC AAACTGCTGA 1701 GCAAGTTTTT GAGCTTTTTC CCCATTAGTT AATAGAATAT TATCACCATC 1751 TTTTAAAGAA GGGATGGGTT TTTGAGGTTT CTTAAGAACC TTTGAAAGTT 1801 TCCAAAAAGG TTTGGAATAA GGTTTAATTT GTTCGACATC TCTTGCGAAC 1851 TTTTCATTTC GCAGGAGAGT GAATCTGTGG TCAATAACCT TTTGCAAATC 1901 TTTTTGAATT CGCTTCAGTG CAGGATCACG AGAACGTTGA TACTGTCTTC 1951 GGCGAACATT TTTCAGACGA ATCAGAAGCT GAAGATCGTC ATCAATAATG 2001 GGAGAATCAA ATTTGACTTG GACTTTAGGA ATAGCAATAT TCCTAGCATC 2051 CAAAATTGCA TTAGTTAAAG ATTCCAAGGC TGAATCAATA TCAGCTTTGG 2101 TTTCTAAAAC AAAATCATGA TTTAAATTAT TCTCAATATG ATGCTGATAC 2151 CTGTCCCAAT TAGCTTTGTG GTAATTAAAC ACAGAACTAT TGGGTCTGGT

200 2201 AACTGCTTCA TGAGAAAGTG AAAAAGTTAC TGGAAGATGA TCAGAATCAA 2251 AATCAGCATG AGTCACTAAA GGACCACAAT ACTGACTTTG ATTTGTCAAM 2301 ACCAAATCAA TTGTTGATGG ATTTCTAACA GAAGAAAAGC AAGTTGGCCC 2351 ATTCGGGTAT AAAACCGAAT AAAGACCAGA AGTGCAATCT CTGAACAGAA 2401 TTTTACCATT GGAATTTACT TTTGAATTAT TCCAAGATTG GTGTTTGGCA 2451 TTAAAATCAC CGATGATCAA AAATCGAGAT CTATGCCGAG TAAGTTTATT 2501 CAAATCCCCT TTGAAATAAT TTTTATTTTC CCCAGTGCAT TGGAAWGGCA 2551 AATATGCAGC TGCAATCATA ATTTTCCCAA AAGAAGTTTC AAGTTCAATG 2601 CCCAAACTTT CAATAACTTT TAACTTAAAG TCACGTAACG TGCTATAAGT 2651 CATACTACGG TGGATAACTA TTGCAACTCC ACCGCCATTT CGATTCATTC 2701 GGTTATTAGT TATAACTTTA TAATCTGGAT CACTTTTCAA ATAAGTGCCA 2751 GTTTTTAAAA ATGTTTCGGT TATAACAGCA ACATGCACGT TATGAACTCG 2801 TAAAAAGTTG AAAAATTCAT TTTCTTTCGC TTTTAAAGAG CGAGCATTAA 2851 AATTCATAAT ATTGATGGAA TTACTTAGAT CCATGATTAA ACTTCAGGGT 2901 AAGAACAACA TCATTCGCAA ATTTTAATCC AATCTGGATT GCTTCCATCA 2951 TGGATGTAGC ATTACTCATT GTTTGAATCA AACCAAACAG TGAGTTTTGC 3001 AAAAAAGTCA TTTTTTCAAA CGTAACATCG CCGAGATCAG AAGATCCCAA 3051 AGCGTTGCCA GCAGAAAAAT TTTCAAATGA GATTTGAGGT ACCTGCCCAA 3101 TTTCAGAAAG ATTGGTAGAG GATTTAAAAT TCGTGGATGA ACCCGAAACG 3151 ACGTTAGCAT AAGAAATGCC ATTGTTGTTA CCTAACTTTT CCACGGTAGG 3201 GGTATTTCTA GAATTGTTCG AGTGAGACAG CACGAACGTT TGATTTAAAG 3251 ATGCAGGTAC AACCTGACTT TGAGAAAATT TCGGTTTGGA TTTCGGCTGA 3301 TGCTTAGCAC GAGAATCCAA AACCTTTTTT CTGATGGGGC AATCCCAGAA 3351 ATTTGATTTG TGATTTCCAC CACAATTTGC ACATTTAAAT TGGGTGACTT 3401 CTTTCACGGG ACAATTGTCC TTGTCGTGAG AAGAATCCCC GCAAACCATG 3451 CATTTTGGAA CCATGGCGCA ATGATCAGTA CCGTGACCGA ATGCCTGGCA 3501 ACGCCGGCAC TGGGTCAGAT TCTGGCCATT ACCGCCATGT TTCTTAAAAT 3551 GCTCCCACTT TACCCGTACA TGGAACAAAA ACTGAACTTT GTCCAAAAGT 3601 TTCAAATTGT TGATTTCATT TCTGTTGAAA TGAATCAGAT AAAATTGTGA 3651 AGTCAAACCA AAGCGAGAAA TATTCCCGTT TGATTTTTTC TTCATTGGTA 3701 TTACTTGGGA TGGGGCAAAG CCAAGCAACA CCTTAAGTTC GTTTTTGATC 3751 TCATCCACCG ACAAGTCGTT GGAGAGACCT TTCAAGACCG CCTTGAATGG 3801 CCGAGCATTC TTGGTCTCAT ACGTGTAGAA ATTGTGTTTG TGGTTTTTCA 3851 AATAACCAAC AAAAGTTTGG TGATCTTGTA AAGATTCCGT CAACAAGCGA 3901 CATTCTCCTC TTCGACCAAG CTGGAACGAA ACYTTCAAAT TGCAAGTTTC 3951 CTTGCAATTC TTCAGTTGCG TTCGAAAGCT GGCCAAATCG GAGACGGAAG 4001 TCACWACAAT TGGCGGAGCC TTTACTCGTT TCTCGACGGC AGAAGGCTCA 4051 GTACGAGGAG AAGGTTCCTT GTCATCAGTT TCGGATAAAA CACCGAAACT 4101 GTTTGTCAAT GGAATTGGAG GATTGACCTC ACATTCAGAA TCAGAATCAG 4151 ACCTCAGAAG AGGCTGTTTT CTTTTTCTGT TTCCGTTAGC CGGTTTAGCG 4201 TTCAAACGCT TCACCGACGT AACGACCAAA TCTTCAGAAG ATTTGCGTTT 4251 ACCTTTGTTT TGACGCATTT TGCAAGCAAA GTTCTCTTAA AAAGATGGCT 4301 TCGTTTGTAA AATACAACAA AATTTCAGGT GGGGTTAGTC TTGAAAAGAC 4351 TGTTTAGAAT TTTGGAAAAT AACTCAGGTA GTCTTTAAAA AGACTGTGAT 4401 TTTGTTTTGA AATAACTCTG AGCTTAGGTA GTAAAAAATA CCGCAGCTCT 4451 AGTGTCCGTT CACCACGAAG GTTCGCAAGA CACTGAT //

201 E.2.1.4 L1

LOCUS Cp_L1_Ele39 3228 bp DEFINITION Cp_L1_Ele39, 3228 bases, D32 checksum. ORIGIN 1 TACAACGTAG CCACCATCAA CACTAACGCA ATATCCAACG AAAACAAGTT 51 AAACGCACTA CGGACACTTG TCCGACTACT CGRCCTTGAC GTAGTGTTGT 101 TGCAAGAGGT CGAGAGCAAC CAATTTTCGA TCCCTGGCTT CAACACCTAC 151 ACAAATGTAA ACGAAACTAA AAGAGGAACA GCAATTGCCG TGAAACAACA 201 CATGCTGGTG AGCAATGTTC AGCGTAGCCT GGACAGTAGA ATACTCACAC 251 TCAAGGTCAA CAACTGCGTT ACGATCTGCA ATGTCTATGC GCCTTCTGGA 301 GTCCAAAGCT ATCAGTCACG AGAGAGTATG TTCAACCAAT CTTTGCCTTT 351 CTATCTCCAA AACGCGGGGG AGTATGTACT TGTTGGGGGT GACTTCAACT 401 GTGTCGTGTC AGCCAGGGAT GCCACGGGTA CAAACAGTCA AAGCATCGCG 451 CTGAGAGTCC TTGTACAGAA CATGAACCTG AAGGACACTT GGCAGATTAT 501 GAATGGAACT CGGACGGAGT TCAGCTTCAT TCGAGCAAAC TCAATGTCTC 551 GTTTGGACAG AATTTACGTG TCGTCAAACA TCTGCTCACA GGTGCGCACA 601 ACGTCCTTCC ATGTGAACTC CTTCTCGGAC CACAAAGTCT ACAAAACAAG 651 AGTCTGTTTA CCAGATCTTG GAAGGGCAGC TGGCAATGGC TACTGGTCTA 701 TGCGAACGCA CACACTCACT GACGAGAACA TAATCGAGTT TGAGCATAAG 751 TGGAACTGGT GGACAAGACA AAGACGGGAC TATAACAGCT GGATGAGCTG 801 GTGGCTTGAG TACGCAAAAC CGCGCATCAA GACTTTCTTC AAGAGAAAAA 851 CCAACGAAGC ATTCCGTGCA TTCAACGCGG AAAATGAGTA CCTGTACGCT 901 CAACTGAGGG AGGCATATGA CTGTTTGTAC CTGAACCCGA ATGCTCTTGC 951 CGATGTAAAC CACATTAAAG GGAGGATGCT GCGACTGCAA CGTGACTTCT 1001 CTTCGAGCTA CCAGCGTCTT AATGATCCTG TCGTTGCCGG GGAGCACATC 1051 TCGTCCTTTC AACTTGGAGC CAGGATGAAA AGGAAGAAAA ATTCGTTCAT 1101 CTCTAAAATC ACCGACGGAG TCCAGACGCA GCCTTTGGAT GCAGCAGAAA 1151 TAGAAGCACA CATTCACCAG TATTTCCAGT CCCTGTACTC TGCTGGAGAC 1201 GTAGCTGATC CTGACGGTGC AACAACCAAC CGGGCTATTC CATCTGACTC 1251 GGTGCCGAAC GCGCAGGTAA TGGAGGAAAT AACAACTGAG CAGCTGTATA 1301 ACATCATCAA AACGAGTGCA TCGCGCAAAG CTCCGGGGAA CGATGGAATA 1351 CCCAAAGAGT TTTACGTGCG GACCTTCCAC GTAATACACA GACAACTAAA 1401 TCTGGTGATC AATGAAGCGC TGAACGGGAA CATCCCCCAG AAGCTGGTTG 1451 AAGGTGTTGT AGTATTGTGC CACAAAAAAG GTGGCAACAA TACTATCAAA 1501 TCCTACCGAC CTCTCACGAT GCTCAATTTT GACTACAAAA TCCTAAGCCG 1551 AATACTCAAA ACCCGAATTG AGGAGATCAT GGTCCGGCAC GACATCCTCA 1601 CACCCTCTCA AAAGTGCTCA AACGGCAAAA GGAATATATT TGAAGCTCTT 1651 CTTGCCGTCA AAGATCGAAT TGCCCAGATC AAGCACACAC ACATACAAGG 1701 AACGCTCGTA TCATTTGATC TTGATCACGC ATTCGATCGA GTTGAACACT 1751 CTTATCTGTT TCGGGTTATG GACGATATGG GCTTCAACAG GGCACTTATA 1801 CAGCTGCTGC GCACTATCAT GGACCACTCA CGCTCTCGTG TGCTAGTAAA 1851 CGGGCATTTG TCTCCAGAGT TCGAGATACG GCGCTCGGTT CGGCAAGGGG 1901 ATCCGATGAG CATGCATCTC TTCGTTCTCC ATCTGCACCC GCTGCTGGAG 1951 AAAATACGCA CACTCTGCAA CGACCAGCTA GACCTCTCCA CCGCATATGC 2001 CGACGATATA TCTGTTATCG TGGTTGATAA CACGAAGTTA CCAACACTCA 2051 AACAACTCTT CTTCGACTTT GGACGGTATT CTGGGGCCGT CCTCAACCTC 2101 GAGAAAACAG TTGCAATGAA CATAGGAAGA AGCAGCGAAA ACCTACCCTG 2151 GCCGTCGATG GAAACGCGTG TGAAGATCTT GGGAATCAAT TTCTTCAATG

202 2201 ACCACAAGCA GATGATACAG TTCAACTGGG ACGAAGTGAT CCGAAAAACT 2251 ACGCAGCTAA TGTGGATGTA TAAAGCGCGA AACCTTACGT TGATCCAAAA 2301 GGTTACCGTG CTGAACATGT TTGTGACCTC AAAACTGTGG TTCGTGGCAT 2351 CTGTGTTGAG CATACGCAAT CAAGACATAG CAAGAATCAC AAGACAACTT 2401 GGGTTCTTCC TATGGGGTCG CCAGCTGAGA GTTCCAATGG AGCAAATTTG 2451 TCAACCTATT GCAAAGGGAG GGCTGAATCT GCATCTTCCC ATGCACAAAT 2501 GCAGAGCACT ACTGGTCAAC CGATTCCTGT GTACGATTGC CGAAACTCCC 2551 TTCGCCGAGC ACCTGTACGG CCTGGTTAAC AATGGAGGAT CCCTACCAGC 2601 AACATACCCT TGTCTACGGC CGACGTGGAC CACTATTCGA GAACTTCCCC 2651 AGCAGCTACG AGACAACCCG TGTTCGAGCA GCATCGAAAG TCATCTTCTG 2701 CAAGCTTTGC CAACCCCGAA GGTAGTGGTG AACAACCCCA GAGCATCGTG 2751 GAGAAGCGTG TGGCGAAACG TACGAGCAAG GAGTCTCACG TCGTTGGAAA 2801 AATCCACGTT CTATCTGCTT GTGAATGGCA AGCTGCCTCA CGCGGCCCTG 2851 TTGTTCCGGC AACATCGGAT CAGTAGTGCT TTTTGCATTC ATTGTCCGAA 2901 CGAAACAGAA GATCTAGAAC ATAAGCTAAG CAAGTGCCGT AAAATAAGCC 2951 ATTTGTGGAA CCACCTTAAA CCAAAATTAG AATCCATTTT GGACCGAAGA 3001 GTAGAGTTCA AAAACTTTCA AATCCCTGAA TTCAGGGCAA TAAGAATGGC 3051 AAATGTAGAG AGATGTCTAA AATTGTTTAT CAACTACGTA AACTTTATTT 3101 TAGATACAAA AAATGATTTT ACGACTCAAG CACTTGATTT TTTACTAAAT 3151 TGTAACTGCC CATAATATGT ATCTCTGCAA ACTGTAACAA AACTAAATAA 3201 ACGTGTTAAA AAAAAAAAAA AAAAAAAA //

203 E.2.1.5 L2

LOCUS Cp_L2_Ele4 2824 bp DEFINITION Cp_L2_Ele4, 2824 bases, 358 checksum. ORIGIN 1 TCTAAAAACT AACTATTATT AAAATGCTCT AAACATGCTT TCTTAAATCC 51 TATACTCGAA TTTAACAATT GTAGATTAGG CGGCAATTGA TTCCATTCTG 101 CAATACCTCT AACAAAAAAA GAATTGGCAT AGTGAGATGA TCTAAATCTT 151 GGCAAGATGA ATTTTCTTCC TCTTCGGCTT CTCATAGGAG TTATTTTTGA 201 GGTGATGTAT GGAGGGGACT GATTGACAAT AAATTTATGT AAGAGTAGAA 251 CTGATCTAAG CACACCAAAA CGAGAAAACG GACAACCAAT TAGGATATTC 301 TGTCTGTGAG TCACGCTTGC AAAACGATTT AAACCAAACA CAAAACGTAC 351 ACAACAGTTT AAAGCAACTT TTAATTTATT ATAAGCAAAA GTGGACGATT 401 GTTTCAACAC AAAGTCACAG GTTATAAAGT GCGGTAAAAT TAAAGCTTTA 451 AATAATTTTA GTTTAATCTT AAAGTCGAGA TGATTTGCTT TGAGACGAAG 501 GGATCTTAAA GCAGCATAAA CTTTTCCACA TTGCATCAAA ACAAATTCGT 551 CCCACTCGAA TTTACTGGTA ATTTTCACTC CTAAGCTGAT AACTGTATCG 601 TTATAATGAA CTAATTGATT ATTTAAAAAT ATGTCGGGTT TAAACACAGG 651 TCTTTTTGAT CTTGAAATAC AAATAGCTTG CGTTTTTGAT GCATTAAGTT 701 TTAGATTGTT TGAAATAGCC CAATTGGCAA CAATGGTCAT ATTTGTATTT 751 ATCATGTTAG ATATTTCAAG TTGTGGTTTA TTGGTACAGT CAAAATATAT 801 CTGTACGTCA TCTGCGAACA AGTGAACACC ACATCCTTGT AGGTGTGACG 851 GTAGATCATT AATGTATAAT GAGAAAAGTA ATGGACCAAG AACGGATCCT 901 TGTGGCACTC CTGAAAGAAC CGGAAGCAAC TGAGAAAATA TTCCGTCGTT 951 AAAAACAGTT TGAAATCTAT TTGACAAATA AGATCGTACT AAACGTACTG 1001 CATCAGGACA AAAACCAAAA ATCATTTTTA ATTTATCACA CAAAGTTTTA 1051 TGACAAATAG TATCRAATGC TTTAGAATAG TCTAAAAGAA TAAGCACAAC 1101 GTATCCACCT GAGTCTATCA CTAAACCAAT ATCATCACAT ATCTTTAGCA 1151 TTGCTGTTTT AGTACTATGT CTTGGGCGAT ATCCCGACTG GCAAGGTGTT 1201 AATAAATCAA AACGAGAAAC AAAATCCGTA ATTTGTTTTT TAATAACTCT 1251 TTCAAAAGCT TTAGAAAGCG CAGATAAAAT GCTAATTGGA CGTAAATTGT 1301 CCAAACTAAT TTTAGGACCT TTCTTTTTAA TAGGTACAAC TTTCGATATT 1351 TTCCAAACTT TGGGGTAAAT ACAAGTTTTA ATTATTTTAT TAAAAATATG 1401 TCGAATGGGT TCAACTATGA GGGGAATTAT AGCTTTGCAA AATTTAATTG 1451 GAATTTCGTC CAGACCAACT GCATTAGATT TTATTTCATA TATTGCATTA 1501 ACAACATGAT ATGATTCAAT AGGTTTAAAA GCAAAAGCCT CAGGTACAGG 1551 TGCGATAGCA TTCAATGTTG TGGATGAACT GATCCCAGTT GAAAAATTAC 1601 TTGCAAAGAA ATGATTGATT TGATCTGCAT TTAGGTTGTT TTTTTCGTTT 1651 TTATTTTTTT TCTTAAATCC AATAGCATTG AGTTTATTCC ATAATTCTTT 1701 CGAGTTCAAA TTTTGAGCAA ACATTGTGCT GAAATAATTT TTCTTGGCAT 1751 TGTTGATCAT ATTTGTTGCC TTATTTCTAA GCCTTTTATA AAGCGTGAGG 1801 TCATTATCAT GTTTAAATTG TTTCCATTTA GAAAAAGCTA AATCTCTATC 1851 AACAATTACT TTAGACAAGT TTGCATTAAA CCAAGGTTTA TTATGGGGTT 1901 TTTGGATAAA ATTCCGTAAT GGAACGCAGT TATCATGCAG ATTTTTAATG 1951 TTTGTATTGA AAAAATCTAC TAAGATATCT GGATTGTCAA GGCTAAAAAA 2001 ATAATTCCAA TCGATCTGAT CAAACATTGT TAAAAGCAGA GACGAATTCA 2051 TGTGACCATA ATCACGAAAA CAAATATTTT CATTAACATT CGATACATCA 2101 ACATCAACAG ACGCAAAAAT TAAATCATGA TTGGACATAA ACGGTACAGA 2151 CACTTGGTTA AAACGCAAAA TAAATTCTTT TTTGTTAGTT AATAGAAGAT

204 2201 CTATCAAAGA TGCTCCTGTT CTGTGAAAAT GTGTAGGAAT TTCTCCAACA 2251 GGTTTCAGAG ACAAACTTTG TAAACACTTT AAAAAACGTC TTGTGTTAGA 2301 AGAGTTCAGG TCAAGAAGAT TGGTATTGAA ATCGCCCATG ATAATCATGT 2351 GTTCATGCTG TAAATTACTA CGGGAAAGCA AGTCGATCAT CAACTGAGAG 2401 CAATCGGAAT TTGGAGTATT GTAGATAAAT CCCATGAGTA AACTTTCATG 2451 ATTTACAGAA ACTTCGACGA AAATAAATTC AGTTGGTGCA TTGTCAATAT 2501 CATGTGATGA TATGATGACT TTACAATTTA AAAAGTTTTT AACATACAAC 2551 AAGATACCAC CTCCAATTCT TCCTTTACGA TCACATCTTA TAATTTTATA 2601 ACCATCAATT TCAAGTACGT CGTTTGGTAT TTCCATTTTC AACCATGACT 2651 CACAAACACA CATGACGTCG ACTTTGGAAA TGTATGCAAT ATTGCGAAGT 2701 TCATCCAATT TAGTTAATTT TCGAGCACAT ATACTTTGGC TATTCATACA 2751 GCATACAGAA AGTTTGTTGT GAATGAGAGC AGATTTCATC ACAATACCTG 2801 GAATACTCAA ATTGTTAGAG TTAT //

205 E.2.1.6 LOA

LOCUS Cp_LOA_Ele7 3751 bp DEFINITION Cp_LOA_Ele7, 3751 bases, 9C checksum. ORIGIN 1 TTTTTTTTTT TTTTTTTTTT GTTGAGCCTT ATGACCACTG CGACCAATTA 51 GAGGAATATT GTGGTGTGAC TCTTTTTTTT TGTAGCAAAT CAGGTTAACC 101 CAACACCATG GAACGATGAG AGGCAGCTCC TGTTTACTGC ATGTTGTCCC 151 AGTTAGGGTC GACTTGTTTT ATAAAGTCTA TCACCCTACT AGGACGGAAA 201 TGGTTGATAT TGGAGGGATG TAAGATCCCT TTCCCAAAAA ATAATATTCT 251 AGTTTGAATC AGGGCACCAC AATTACACAA AAGGTGCTCT GATGTTTCAT 301 TCTCAATATT ACAAAATCTA CAATTTGCGT TTGGAATTTT CCTCATATTA 351 AATAAGTGGT ATTTGCATGG GCAGTGACCA GTTAGTAAAC CAGTTATCAC 401 CCTTAAATTT TTCTTATCTA GTTTTAATAG TTCTCGAGAT AGTTTCCCAC 451 CTGGAAAGAT AAATCTTTTT GCTTGTCTGG AAGTATCTGT AGAATTCCAG 501 ATTGAATTGA TTGTTAATTG TTCCCAACGT TTTAGCTCCA TTTTCAGAGA 551 ACAATCTGGA ACACCAAAGA AAGGTTCTGG GCCAATGAAG TCAGTTTCAG 601 ATCCTTTCCT TGCCAGACTG TCCGCGTTTT CGTTTCCTAA AATTCCACAA 651 TGACCAGGAA CCCAGTAAAG ATTGACCGAG TTCTTTTGGG ACAATTGCTT 701 TAGTAATTTT ATTCCTTCCC ATACTAGTCT TGACTGGCAG GTATACCCTT 751 TCAGAGCATT TAGTGCTGCC AAGCTGTCGG AAAAGATACA AATATTAGCT 801 AATCTATAAT TTCTTTTGAG GCAAATGTGC AAACATTCAA TTATTGCATA 851 AACTTCTGCT TGAAAAACTG TTGGAAATTG TCCCATTGGA ACAGATGTCT 901 TAATTCCTGG TCCAAACACT CCAGCTCCCG TTTTATTATT CATTCTAGAA 951 CCATCCGTGT AGAACAGAAT AGAGCCTGGA CGAAGATTAG GCCCACCATT 1001 ATCCCACTCG TAGCGACTTT TAGCGTCCAC TGTGAACGGA ATGTCATAGT 1051 TAGCCTTTGT CGGCATCCAG TCTACTATTG AGTTTAACGC CGGATTCAAG 1101 GTAAAGTTTT TAAGGACTTC TAGGTGCCCT GTCAAATCTT CATCCTGAAT 1151 TCTTGTGGCT CGATTCATTA ATAAAGCACT CTTAAGGGCT TCTAATTCTA 1201 TGAATTTATG AAGAGGAAGT AAATTTAACA AGGAGTTTAG TGCAGCAGAT 1251 GAAGTGCTGT GCTTAGCTCC AGTTATTGAA ATGGTGGCTA ATCTTTGCAG 1301 CTTAGCTAAT TTAATTTGGG CAGTTGATTC TTTAACTTTC GGCCACCACA 1351 CAAGTGACGC ATACGTAATT CTTGTTCTAA CTATGCATGT ATATATCCAA 1401 TAAATCATTT TAGGGCGGAG ACCCCACTTT TTTCCAAAWG TTTTTTTACT 1451 CAGCCATAGA GAATTAATTC CTTTGTTTAA TACATTATTC AGGTGAGCAT 1501 TCCAATTAAG TTTCCTATCT AAAGTAATAC CAAGATATTT CACTTCGTTA 1551 GAATACTCAA TAGTTGTCCC ACGAATAGCA AGATTAGGTA GTACTAGTGA 1601 CCTTCGTCTT GTGAACGGTA TTATAACTGT TTTAGAGGGA TTTATACCAA 1651 GACCCTCATT TTTRCACCAT TTTGAGATAA AATTCAAGGC CGATTGCATT 1701 CGGTTACAAA TTTCGCTACC TATTTTTCCC CGAACTATTA TACAGATGTC 1751 ATCCGCATAG CTGATCAGTT CAAAGCCGAG ACTCGCCAAC TTTTTCAGAA 1801 GATCGTCGAC TACTAATGAC CATAGCAGAG GTGATAGAAC TCCACCTTGT 1851 GGACATCCCC TCCGTGGTCT GATTACCAAT CGTTCTCCTC CAAGGTCGGC 1901 AGATATCTCC CTTTCAATAA GCATTGTTTC AATCCATTTG ATAATCCATG 1951 AACCAAAGCC CCTTGATTCC ATGGCGGAAC GCATAGAGCT GTGAGATGCA 2001 TTATCAAAAG CTCCTTCAAT GTCTAGGAAA GCAGCCAGAG CTACTTCCTT 2051 GGCTTTGAGG GACTTCTCAA TTTTAGAAAC TAAACAGTTT AAAGCAGTTA 2101 TAGTAGATTT ACCACTTTGG TATGCAAATT GGCTATTACT WAGAGGAAAT 2151 TTGATCAAGA ACGtTTTTTT TtACAAAATC ATCAAGAATT TTTTCCATAG

206 2201 TTTTCAATAA AATAGAAGAG AGACTAATTG GACGGAAGGA TTTGGGACTT 2251 GTTTTGTCCC TTTTCCCAAC TTTTGGAATA AATATGACCC GTACCTCTCT 2301 CCAAGCTTTT GGGATATAGC CAAGAGCGAT ACTGGCTCGA AAAATGTCTG 2351 TTAATTTAGC ATTGATAATA TCTTTACCCT GCTGGAGAAG AGCAGGAAAG 2401 ATGCCATCCA TCCCTGCGGA TTTAAATGGT TTGAAAGAGC TCACCGCCCA 2451 GTCAACCTTG GAACGAGTAA ATATTTCACA GGCTTGATTA TGGGCAAGTA 2501 CAGACCAATC TTTTGTCAGT CTTTCGCCCA CGGCCTGATC TATACAACTT 2551 ATTACCTTGG TCGAGCCAGG AAAGTGAGAT TCCATCATTA AATCCAGAGT 2601 TTCAGATGCA TTTTTTGTAA AACACCCATC TGCTTTTTTA AGATTTCCTA 2651 GACCGTTTGA ATGATCTTTG GAAAGAGCTT TTTGTAATCT TGCGACTATT 2701 GGAGTTTCCT CGATTTTTTC ACAAGAACGT CTCCAGTCGT TGCGTTTAGA 2751 ACGTTTTATT TCTTTATTGT ATTCAGTCAG GGCTTTTTTA TATGAGTCCC 2801 AATCTCCAGA TTTTTTTGCT CTGTTGAACA ACCTTCTAGA TTTTTTCCTT 2851 AACGCAGCTA ATTTTTCATT CCACCAAGAT ACCTCACGAT TTGAAGATCT 2901 TTGCTGGACT GGGCAACTAG CGGAGTAAGA ATTGGTWATA GCAATAATTA 2951 ATTCAGACGA AGCCTTCTCC AACTGATCAA CAGTACGGAT TTCGGCTTCG 3001 GAATTGAAAT CGTAGTCATT TAGAAGAAGT TTATAGGAGT CCCARTTGGT 3051 TTTCTTGGGA TTYCTATAAC TTCCAACTAT CATTTCACCT GCATTGTATT 3101 CGAATTCAAT TTGTTTATGG TCTGAGAGTG ATATTTCATC AGAAACCTTC 3151 CAATTTATAA CTCTTTCAGA AATAACTTCG CTACAAAACG TGAGGTCAAG 3201 AACCTCCTGT CTTGTTGCTG TAACAAAAGT GGGATTATCA CCATTGTTAC 3251 AAATGTCAAT ATTGTTAGTT GTTATGTAAT TGAGCAAATT CTCACCTCTA 3301 TTGTTAATGT TAGTGCTACC CCATACTGTG TGATGAGCAT TRGCGTCGCA 3351 CCCAACAATA AATTGTTTAT TTATTTTTTT GCAGTAAGAG ATAAACATTG 3401 TTACATCAGM TGGAGGGGCC TCCTCGGTGT CTCCGGGGAA GTACGCAGAT 3451 GCAATACACA CCCAGGTACT TCCTCTGACC GTAGGGACCT CCATTTGAAT 3501 TGCAACAATG TCTCTTGAAA TAAATTCGGT TATTGGAATA AAATTTGTAT 3551 CATTTTTTAC WATCATAGCT GCTCTAGGGT TGGGAGAACT ATTATCRTAT 3601 AATAGTTTAC CCATTTGAGC TGAAATTCCT TTAATTTTAC CCTTATTTAT 3651 CCAGGGTTCT TGAATGAGCC CTACGTCAAT ATTATTTTTA CTAAACCTTC 3701 TGCTAAGAAC GGCTGACGCT CCTTTGGCAT GATGAAGATT AATTTGAATA 3751 A //

207 E.2.1.7 Loner

LOCUS Cp_Loner_Ele1 5618 bp DEFINITION Cp_Loner_Ele1, 5618 bases, 78D checksum. ORIGIN 1 CaGTCGGTAT CTGATCGAGG TACAAGCAAG ACGTGTTTGT TTTCGCTAGA 51 GCACCGAAGC TTCTAATCGT TCAACGGATC AAGGGATCCA CCGCCGACTT 101 CTCGAAGGTT GACCTTACGA GTGCGTACGT GACACGCGTG ACAGTTCGTG 151 AACGTGAATA GTGCGTGTAA GCCAATTATC GCCGGTCAAG TTCACCATCA 201 GCAGCCGAAC CACAAAGgCA aCcctccAaa cCGCGAAGCG TGTTTTGGAG 251 TTTTCGAAAA CGTTTACACA AGGCATTGTC TAACCGCGAG AGCTAGACAA 301 AGGAGCACCA CCGAATACAA TAGCAAACGG ACGGTTTGTT TCATCCTCAT 351 CAACTCGGCG TGGCAAAATC TAGAGTTAAG TAGAAAATCA TCTCCCCCTC 401 CCCCCATGTC TGGGGGCGAA GGCGGCATGG AAAGCGATGG CGAAGATGAC 451 GAGGCTTCCA ATGTGCGCAC AAAGATCTAC CCTAATGGCT CAACAGGGCC 501 GTTTATTGTC TTCTTTCGGC CCAAATTGAA ACCCCTAAAC CTGATCAGCA 551 TCACCCGAGA TCTAACGAGA AAGTTTTCTG GCGTATCCGA AATAAAACGT 601 GTCCATGCTA ACAAGATCCG CGTAGTTGTG AATAACATCT CTCACGCCAA 651 TGAAATTGTC ACCTGCGAGC TTTTCACGCT CGAATATCGA GTTTACATCC 701 CTTCACGCAT CGTGGAGTGT GATGGTGTTG TCACAGAAGA GGGCTTAACC 751 CTAGATGAGC TGTATGAGTG CCGTGGTTAC TTCAGAAATC CTGCCGTGAA 801 CCCCGTGAAG ATCATTGAGG TTAAACAACT GTTCTCCTCC TCCACACAGG 851 ATGGCAAAAC GGTTTACTCC CCTTCGAACT CTTTCCGAGT GACCTTTGAA 901 GGATCCGCTC TGCCAAAATA CATTGAGATA GACAAGGCTC GTCTACCTGT 951 TCGACTTTTT GTCCCCAAGG TAATGAACTG CCAAAAATGC AAACAACTCG 1001 GCCATACCAC AGCTTACTGT TGCAACAAGG CTAGATGCAT CAAATGCGGT 1051 GAAGAGCACG ATGATAGTAG CTGTACGCAA GCTGCTACCA AGTGCCTCTA 1101 CTGCGACGAG GACGCCCTTC ACAAACTCTC GGATTGTCCG ACGTATAAGC 1151 AGCGTCAGGA GAAACTTAAG CTTTCTCTGA AGCAACGATC GAATCGTACT 1201 TTTGCGGAAA TGCTCAAACA AGCCACCGAA CCACTCAATT CTGGAAACAT 1251 CTACAATATA CTGCCTTCCG ACGAGACGGT CGCCGACTCG ATCAATGCGG 1301 GCGCGTCAAC GTCCGGGACG GGTAACTCGA GGAAAAGGAA CAATGGATCA 1351 CCAAGCATCC GCCGAAAAGA AATAAAGCTA TCCCCACAAC AAGACAGGAT 1401 CCCTAATTTT CAGCCAACTC CCCCTGGAAT CAACCCCCCC GGTTTCCCCC 1451 CATTGCCAAG GCCCCCACCT CTGACCCCCA AACCAAATCC TAACAAACCT 1501 AAGCAAGGAT TAATCGGTTT CACAGTGTTG ATTAACCAAA TTCTCGATGC 1551 GCTCCAGATT TCCACCGGTG TTCGAACCGT GGTAATCACT CTGATCCCTT 1601 TCGTTCGGAC ATTTTTGATC AAATTATCTG AACAATGGCC CCTTATTTCA 1651 ACAATCATAT CCTTCGATGG ATAATTCAAC GTCGAAGATG AATGAAGAAA 1701 TTTCTGTTCT CCAGTGGAAC TGCAGGAGCA TTGTTCCAAA ATTAGATTCT 1751 CTTAAAATAT TAGCTCACGA AACTAAATGT GAAGTATTTG CTCTCTGTGA 1801 GACATGGCTT CCACCCAACG ATGATGGTCT GAATTTTCCC AATTTTAATA 1851 TCATTACCAA AAATAGAGAC GACTCCTACG GAGGGGTTTT GTTAGGCATA 1901 AGACACGGTT TAACATTCCA AAGATTGAAT CTTCCTTCTC AGCCTGGAAT 1951 TGAAGTAGTT GCGATTCAGG TTCAAATTAA GAATAAATGT TTTTCAATAG 2001 CTTCTGTATA TATCCCGCCC AAARCAAGTG TTAATCGTCA ACAGTTAAAA 2051 AACATCGTTG AAATGATGCC TGAGCCAAGA CTTATTCTCG GCGACTTCAA 2101 TTCTCATGGG ACAGGATGGG GTGAATTGTA CGACGACAAT CGAGCAAATC 2151 TTATATATGA CTTATGYGAT GAATTTAATC TAACTATTAA GAACAGTGGT

208 2201 GAAATAACTC GAATTGCTAG ACCTCCTGCA AGGGAAAGTA GATTGGAYTT 2251 GTCAATTTGC TCAAGAACAC TCTCAATAGA TTGCACCTGG AACGTAATTC 2301 AAGATCCCCA TGGTAGCGAT CACCTTCCTA TTTTGATTTC AATTGCGACA 2351 GGAAATCAAC CTKTAGAACC AGTYAGCTAT ACATACGATC TTACGAAAAA 2401 TATAGATTGG AAAAGATATG CTCTCATTAT CACCGAGGCG ATTGAATCAA 2451 TAGATCCTCT TACCCCCCAA GAAGAATACA CCTTCCTTGC AAATCTCATC 2501 CACAGTAGCG CGATCCAAGC TCAAACAAAA CCAATACCAT CWGCTTCTTC 2551 CCGAATGCGA CCTCCATCTT TATGGTGGGA CAAGGAGTGC TCGGAAGTGT 2601 ACTCTGAGAA ATCAAATTCT TTCAAAATTT ACAGACGAAC GGGTCAAATT 2651 GAGTCTTACG AACAGTACCT CCTTTTGGAG ATTAAGTTCC AAAaTTTAGT 2701 AAAATGTAAA AAACGAAAYT ATTgGCGAAC GTTTGTTGAT GGGCTTTCAC 2751 GCGAAACCTC CATGCGTACT CTTTGGACTA CAGCAAGAAG AATGAGAAAC 2801 CGAGCTCCCA AAAACGCTAG TGAAGAGTAT TCTGATCGGT GGTTGCATAA 2851 TTTTGCCAGA AAAGTGTGCC CCGACTCCAC GATTCCCAAA CAGAAAAGGT 2901 ATTCGAATGA TCTTGTATTC CCGGAACTAT CATCCGCGTT CTCGATGATA 2951 GAATTCTCGG TCGCTCTCCT TTCATGCAAT AACACTGCCT CTGGAATGGA 3001 TGGAATTAAA TTTAATCTCC TGAAAAATTT GCCTTCCGWT GCAAAATGTC 3051 GACTATTAAA CTTATTCAAT ATTTTCCTTG AACAAAACAT CGTCCCAGAA 3101 GTCTGGAGAC AAGTCAGAGT TATAGCTATT CAAAAACCGG GTAAGCCGGC 3151 CACCGATCAC CATTCATATA GGCCCATTTG TATGCTATCG TGCGTGCGAA 3201 AGTTATTGGA AAAAATGATA CTTTTCAGAT TGGATAAATG GATGGAATCA 3251 AACGGATTAT TATCAGATAC TCAGTTTGGA TTTCGTAGGG GCAAGGGAAC 3301 GCARGATTGT TTAGCGCTGC TTTCAACCGA AATTCAACTA GCTTTCGCTA 3351 AAAAAGMACA AATGGCTTCA ATTTTCTTAG ATGTAAAGGG AGCATTTGAT 3401 TCAGTGTGCA TCGAGGTGCT AGCAGATAAA CTCCACAAAA GTGGACTCCC 3451 ACCTTTATTG AACAATTTTT TGTATAACTT ACTCTCGGAA AAACACATGA 3501 ATTTCATTCA TGGTAACGTG ACAATCACAA GATCTAGCTT TATGGGCCTT 3551 CCTCAAGGAT CATGTYTAAG CCCTCTCTTG TACAATTTCT ATGTAAATGC 3601 AATTGACTCT TGCCTCGATA ACGGGTGCAC AATAAGACAA TTGGCAGATG 3651 ATTGCGTTGT ATCAGTTACT GGTCAGTCGG CCAACCATCT TTCTGAACCT 3701 CTGCAGAACA CTTTAAACAA TTTATCTCGC TGGGCTATGG AATTAGGAAT 3751 CGAGTTCTCA ACTGAGAAAA CGGAAATGGT CGTCTTCTCC AGAAAGCACA 3801 ACCCCCCCTC ACTGAAGCTG TACCTACTGG GAAAACTTAT AATACAGTCC 3851 CTGGTTTTCA AATATCTCGG TATTTGGTTT GACTCGAAAG GTACTTGGGC 3901 TTGTCAAATA AGATACCTGA AACAGAAATG CCAACAGAGA ATAAACTTCC 3951 TCCGAACAAT CACGGGTACG TGGTGGGGCG CACATCCCAC GGACCTCATT 4001 AGGCTATACC AAACGACGAT ACGTTCAGTA TTGGAATATG GATGTTTTTG 4051 CTTTCAATCC GCCGCGAAAA TCCACATGAT CAAACTTGAA AGAATACAGT 4101 ATCGTTGTCT GCGCATTGCC TTAGGATGCA TGCACTCAAC TCATACGCTG 4151 AGCCTAGAGG TACTTGCAGG CGTTCTTCCG CTGAAAACCA GATTGTATCA 4201 GCTCGCTCAC AGAACGTTGA TTCGTTGTGA GATTAGGAAT CCATTAGTGA 4251 TCCAGAACTT CGATCTTCTT CTCGACAAAA ATCCTCAGAC TAGGTTTATG 4301 ACTATCTATC ACAACCACAT AACCAAGGAA ATCTCACCTT CAAACTTTAC 4351 TCCCAACCGC AGCACAATAA GCAGCACGCA TAACCCATCA GTTTTATTTG 4401 ATTTATCTAT GCAACAAGAA ATCAAGATGA TACCAGCAAG TCAACGTTCG 4451 CAATTAGTAC CGCATATTTT TTTGTCTAAA TATAACCATA TTAAGGCGGA 4501 AAACATGTTC TACACAGACG GATCGCTAAT CGAAGGGTCC ACAGGCTTCG 4551 GGGTATTTAA TACGAAAGTA AGTGCCTTCC ACAAACTCCA AAATCCTGCT 4601 ACAGTATACG TAGCAGAACA AGCTGCAATT CATTATGCAC TAGGGATCAT 4651 TAACCTGCAG CCACAAGATC ACTACTACAT ATTTTCTGAC AGCCTTAGTA

209 4701 CAATTGAGGC TCTCCGGTCG TTGAAATCAC CCAATTCCTC GTCGTTCTTT 4751 TTTCATAAAA TTAAAGAAAT CATGAGTTTA CTGGTAGAGA AAAAATACAA 4801 AATTACTCTT GTTTGGATCC CTTCTCATTG TTCTGTATTA GGAAATGAGA 4851 AAGCGGACTC GTTGGCAAAG CAAGGTGCCT TGGAAGGATC CACTTACGAT 4901 CGTATTATCA CTTATGACGA ATATTTTACA ATCCCTCGTC AAGAATCTCT 4951 TGTAAGCTGG CAAACCAAAT GGGACAAAAG CGAAATGGGT CGATGGCTTT 5001 ACTCTATCAG GCCAAAAGTT TCTACAACTT CGTGGTTCAA ACACATGAAT 5051 GTTGAAAGGG ATTTCATACG CGTAATATCA AGATTAATGT CAAACCACTA 5101 CCTACTCAAC GCTCACTTAT ATCGGATTAA CTTAAAAGAT GACAATCTCT 5151 GCGGTTGTGG AGAGGGTTAT CACGATATCG AACATATTGT TTGGAACTGT 5201 CCAGAGAACC TTCACGCTAG ATCTCAACTC TTAGACTCCC TTAGGGCCCA 5251 AGGAAGACAA TCAGACTTCC CTGTTCGTGA CATTTTGGCA AGTCAAGATG 5301 TGCCATATCT TCTCTGCTTG TACCGCTTTC TAAAGTCAAT TAAAGTGCAC 5351 CTGTAACAGC ATCAATCTCG CAAGCATCGC CACCCTGCAA CCTAGCAATA 5401 GTAACATCTG ATAAAAACTA GAACCTTAGC CCGCACAGAA GCAAAAGTCC 5451 GTCCTTAAAC ATAATGTATT ATTAACCTCG AAACAGCCGC GAGTATTCGG 5501 CTTTCCCCCT TTACTAACCC TAGCTTTAAG TAATTATGTA AAAATGATAT 5551 CCGGCTCCGT AAAACTTTGG TAGATGAGCC TAAATAAATA AAGACAGTTA 5601 TAAAAAAAAA AAAAtAAT //

210 E.2.1.8 Outcast

LOCUS Cp_Outcast_Ele3 2336 bp DEFINITION Cp_Outcast_Ele3, 2336 bases, 2063 checksum. ORIGIN 1 AACTACATCC AGCTACAAAA AGCCAGAGCA CTGTTTAAAA AAGCTGTCAG 51 GAAGGCGAAA CGGGAACACG TAGCTGAGCT GACGGGAAAG ATTGACGAGT 101 CGACACCTCC MAAACAGCTA TGGAACATCG TCAAGGGAAT AGAYACGGCG 151 TTGGCTGRGG GCAGTAAAAA GAGGGCGATC CTGGAGCGCT CCAAAGGAGA 201 GGAGTTTATG GAGCACTACT TCAGTGGAAG ATGTGGTACA GTGCAGTTGC 251 CGAACTACGA GACAGCCCGG GACTTGGAAG GTTTCGAAAT GGCGCTYAAG 301 GACGGCGAAG TGCTTAAMGC ACTGAAAAGA ACAAAAAATC ACTCGGCGCC 351 GGGRGAAAAT CAAGTCTCGT ACGACATTRT TAARCAATTG CCGCTAGGTC 401 TGCAGCTCAA GTTCGCAGAA ATGCTGAGCA GAGTATTCGC GACTGAAAAC 451 ATTCCTGAAA GGTGGCGCAT CACTGAAGTA CGACCGATTC CAAAGAAAGG 501 AGYGAACCCC AACCTACCRA ACTCGTGGAG ACCCATTGCG CTCATGAATA 551 TCGAGATAAA GCTGATCAAC AGTGTAGTGA AAGACCGACT GGCGGCGATC 601 GCGGAGCTAA ATGGTYTGAT CCCGGATYTG TCTTTTGGTT TCCGGAAGAA 651 TGTRTCATCG GTAACCTGCG TGAACTATGT CGTGAATGCT GTACGAGAGG 701 CGAAGGAGTA CAACAAYGAA GTCATCGTAG CATTTCTCGA CGTGAAGATG 751 GCGTATGACA CCGTGAACAC GACTAAGCTG CTTCAGATCT TGGCARGGCT 801 GGGTATCCCG GAAAAACTGA CATCGTGGCT CTACGAGTAT CTCAGATGTC 851 GCGTGTTACG ACTACAAACG GAGGACGGAG TCGTAGAACA AGTRATCTCY 901 GAGGGCCTAT CACAAGGTTG TCCGGCAGCA CCGACACTTT ACAACTTCTA 951 CACGGCTGGG TTACACGATC TCTCAAACGA AACGTGCAAG TTRGTGCAGT 1001 TTGCTGATGA TTTCGCCGTY ATCGCAACAG GTGCCTCCCT CGAGCTGGCG 1051 GAACAACGGT TGAACGGTTT CCTCGATGTT TTGGCAGGSC GGCTRAAAGA 1101 GCTGGACATG GAAGTAAGCC CATCCAAGTG CGCTGCGATC GCCTTCACCG 1151 GAAAAAGGAT CGACCATCTC AGAGTCAAGA TGCAGGGGCA AGYGGTTCAG 1201 ATCGTCAACA CCCACAAGTA TCTGGGRTAC ACCTTGGACC GGRCCTTGAA 1251 ACACAGAAAA CACATCGAAA CCGTGACCGC CAAAGCCGGA GAGAAACTTG 1301 GWCTGCTCAA GYTACTATCG AGGAAAACAA GTGGTGCGAA TCCGGCAACC 1351 TTGGTCAAGG TGGGAAATGC GATTGTTCGG AGCCGGATGG AGTACGGAGC 1401 CACGATCTAC GGGAATGCCG CCAAATCAAA TCTGGGAAAG CTGCAGGTGT 1451 TACAAAATTC GTACATCAGA ATCGCCATGG GATATGTACG AAGCACACCC 1501 ATCCACGTGA TGTTGGCCGA AGCTGRCCAA ATCCCGACAA GTCTTAGAAY 1551 AGAGGCTCTT ACCAAGAGRG AACTGATCCG AAGTACGTAC TTCAGGACGC 1601 CGTTGCTRCG CTTTATAAGC GACACGCTAT CGAGGGAGAT TCCAAACGGA 1651 TCGTACCTGA CGGAAATGGC GGACAAGCAT GCGGATATCC TGTACCAACT 1701 GCACCCYTCA GACAAGGATG TCGCACAGGA GGCAAGAATG AGCTACTTCA 1751 GCAACTTTGA CCTGGAAGAC TAYGTACAGC ACACRCTAGG AAARGAAACA 1801 CKGAAAAAGG AGAACAYAAA CGAAGCAGTT TGGAGGCAGA ATTTTCATGA 1851 AGTGGCCAAT GGAAAGTACA AAGACCACAA GCAGATATWC ACAGATGCTT 1901 CGAAGACGCC CGGAGGGACA GCGCTGGCGG TCTACGACTC GAGCGAGGAG 1951 GCGACCTACA CGGAGAGCAT TAACGACAAC TACTCAATCA TGAACGCAGA 2001 GCTGCGGGCT ATTTGCATCG CAGTTGAGCA TGTGAAGCAA AAACAGTACG 2051 AAAAGGCGGT CATCTACACG GATTCCAAGG CGGCTTGTCA GAGCYTGCTA 2101 AACCWAAATG CACTGCGAGA GAACTTTATC GTTTGGAACA TTTACAAGGA 2151 GATTCAAARC ATGCGGAGAG GCTCGCTGAG AATCCAATGG ATCCCCAGYC

211 2201 ACGTCGGAAT ACGAGGAAAT GAAATTGCGG ACCAAGCAGC GAAAGCGAAG 2251 TCATACGAGA AGCAGACGGA GTTCATTGGA ATTACACTTG GAGATGCTAG 2301 AGTACTTTGC CATGAAGAAA TCTGGTACAA TTGGaG //

212 E.2.1.9 R1

LOCUS Cp_R1_Ele1 5425 bp DEFINITION Cp_R1_Ele1, 5425 bases, 116E checksum. ORIGIN 1 CGTGGGGACA GGTGAGGGTC CCTGCGGAGC TTAGCTGCCA GCTACCGGGC 51 GGGTTGCAGT AGGCGGATAG CTGTCGGCGA TTGCATACAT TCATTGCATC 101 GCCCCCGGAC CAGCAGCGGG AGGATGTCTA GGACGTGGCG GAATTGAACA 151 AGGGCTCTGT TTAATTTCTT CGAAWAAAAA AACCACATGG GTCCGTAAAC 201 ACCTCTGTCA AGCGATCGAA CGCCGCTATA AGTGTTTTAG CCCAAAACCT 251 CACCAAACCC CGAATCCAAG ATGTGAYGCG ACCCGTGTCG AGGGATGCAT 301 GGCTGGGGGG GTTCAACAAA TTCCCAGTCG ATAACGGAGC CTGTGGGGCA 351 CAGGGGCGAA CCCCACACGT AATTTGCCCT TACTGCGTCA CGGCAGGGCA 401 CTGGCGCAGC GGACCGTATT TCCCTAGCGA CTCGTGGGAT TCAAAATGAA 451 GACAAGTGAA ACAAACCAAC AAAAGGGTCC CGATGCGCCC CAAGCATCGG 501 AAAACACGGA GGTAGAAGAG GACGCAAACG TCGAAATGGC AGGAGGCGAG 551 TCAAACGGCG GCGACACAGC AGGCGGAGTG GCGAGCGCGT TCCGTGGAAG 601 CGGGAAGGTG TTGAGATCCC CAGTGTTGAA CCAGGCGGCT GCTTCGAGTC 651 AGCAGATAGG AGTAATTGGA GAGGAGACTC CCAAGTCATC CTTGTTGAAC 701 TTCGCCGGCA GTACCCCTCA GGACGGAGTC CTGCTCGGAa GGACCGcGTT 751 GCAGGAGGTC AGAAGGAGGG TCAACGAACT CTTTGATTTC ATCAAGGACA 801 AAAACAACGT CCACACCAGA ATCAAGCAGA TGGTGAATGG AGTCAAGGCA 851 GCCATGAATG CCGCAGAGCG CGAAAACAGC TCGCTGGTGG TGACGCGGAA 901 TTCACTGAAG CTCAGAGCTG AAAGAGCCGA AGAAACGCTG AAGGCAAAAC 951 TGGAGGAGGA AGCGCTACGG GAGAAAGAAC CGAAAACGCC GCCCGGCCCA 1001 AGCTCTAAAA GGGACAGGGA AACGCCTGGA GAGGAGGAGG ACGCAAAGAA 1051 GCAGAAGCAG GGGAATGGAG ACAGTCCGGA CCCAGCGAAG GAGCCAGAAC 1101 CAGACCCAGG GAAGGAGAAG GAATGGGAGA AGGTCAAGAA AAAGAAGCGG 1151 AAGAAAAAAG GGAAGCAGAA CGAGGACACC CAAAAACCCA AGTTTCGCAG 1201 GGAGCGTAAC AAAGGCGAGG CTTTGGTGGT CGAGGTGAAG GAAGGTGTTT 1251 CGTACGCAGA CCTCCTCCGG AAAGTACGAA CCGATCCGGA ACTCAAGGAG 1301 CTTGGCGAGA ACGTGGTTAA AACCAGGCGC ACTCAAACCG GAGCGATGCT 1351 TTTTGAGCTG AAGAATGATC CCGCGGTCAA GAGCTCAGCT TTTAAGTCCC 1401 TCGTCGAGAA AGCCGTAGGC TACGAGTCGA AGGTAAGGGC GCTATCGCCG 1451 GAGACAACGA TTGAGTGCAG GAACCTGGAC GAGATCACGA CGGAGGAAGA 1501 GCTAGAAGAT GCGCTGATCG TTCTTCTGGA TGACCGTACG ACACCGATGG 1551 CAATCCGGTT GAGGAAAGCC TACGGCGGCA CGCAAATTGC GTCGATCCGA 1601 CTATCGACGC CTTCGGCGTC TAAGCTGCTG GAAGCCGGCA AGGTCAAAGT 1651 AGGGTGGTCG GTGTGCCCAC TGAGGCCTGT TCCTCGAGTG ACCCAGCAGA 1701 TGACGAGGTG TTTCCGCTGT ATGGGTTTCG GCCACCAGGC GAGAAATTGC 1751 GACGGTCCCG ATCGAACCAA CAGTTGCAGA AGGTGTGGTA GAGAAGGCCA 1801 CATGGCAAGA GACTGCAAAA ATAAGCCGAA GTGCGTGCTC TGTAAAGAAG 1851 GCGATGGCAA TAGCCATGCG ACGGGTGGCT TTAATTGCCC GGTGTACACA 1901 GAAGCTGGCC TCGGGCAAAA AGTAATGGAG GTGTCCCAGG TGAACCTCAA 1951 TCACTGCGAC ACTGCACAGC AACTGCTGTG GCAGTCGACC GCGGAGACGG 2001 GGTGTGACGT GGCAATTATT GCAGAACCGT ACCGAGTTCC ACACGACAAC 2051 GGAAACTGGG CCGCGGATAC AGCAAGAATG GCGGCGATAC ACGTGATGGG 2101 GCGGTACCCC ATACAGGAAG TGGTCTCGAG GGCGTTTGAA GGATTCGTGA 2151 TCGCCAAAGT AAACGGAACC TTCTTCTGTA GCTGCTATGC TCCCCCAAGA

213 2201 TGGACCTTGG AGCAGTTTCA GCAGATGCTG GATAGTCTGA CCGACGAACT 2251 GATCGGACGA AGCCCGATCG TTATCGGAGG TGACTTCAAC GCGTGGGCGG 2301 TCGAGTGGGG TAGCAGATGC ACCAATGCTA GGGGGCATAG CCTAATGGAA 2351 GCTCTGGCAA AGCTAGACGT TAGGCTGGCG AATCGCGGAA CCAGCAGTAC 2401 CTTCCGCAAA GACGGTCGTG AGTCCATTAT CGACGTTACG TTCTGTAGCC 2451 CGCGACTGGC GGCCGACATG AACTGGAGGG TGAGTGAGGA CTATACCCAT 2501 AGCGATCACC AAGCGATCCG GTACAGCATC GGGAGACGAG CCCCTGTACC 2551 AGATAGGAGC AGCCGGTCCT ACGGAAGGAA ATGGAAGCTG CAGTACTTCG 2601 ACGAGGGTCT CTTCGTGGAA GCGCTCCATT GGTGTGATGG TCCCCAAGAC 2651 TTGAGTGCCG ACGTGCTAAC AGCACAACTG GTGACAGCAT GCGACACAAC 2701 CATGCCGCGG AGACTGGAGC CAAGGAACTG TCGTCGTCCA GCCTACTGGT 2751 GGAATGAAGA ACTCGGTACC CTTCGGGCAA GTTGCCTCAG CGCCAGAAGA 2801 CGAGTCCAGA GAGCAAGATC CGAAGCAACT AGAGAGGAGT GCAGAGAGGA 2851 GTACCGGTCT GCAAAGGCCG CGCTCAAGAA AGCGATCAAA TGCAGCAAGA 2901 CAAACTGCTT CAAGGAGTTA TGCCAAGACG CTGATGCAAA CCCTTGGGGG 2951 AGCGCATATC GTGTCGCGAT GGCGAAGATC AGAGGCCCAT CGATGGTGGC 3001 TGAAACGTGT CCCGACAAGC TGAAGGTCAT TGTGGAAGGG CTCTTCCCAA 3051 GACATGACCC AACGACATGG CCTCCTACAC CGTACAACGA CGAAGGGGGT 3101 AGCAACGCCG AAGGTCATCT GATCACCAAC GAAGAACTTG TGGCAGTAGC 3151 GAAGAGATTG AAGGTGAAGA AAGCTCCCGG CCCGGATGGA ATCCCGAATT 3201 TCGCCCTGAA ATCGGCGGTT CAAGCATTCC CGGACAGGTT TCGAACAGTC 3251 CTGCAGAAAT GCCTGGACGA AGGACACTTC CCCGACCCGT GGAAGGTTCA 3301 AAAGCTCGTG TTGCTGCCGA AGCCAGGCAA ACCACCGGGG GACCCATCAT 3351 CGTATAGGCC TATATGTTTG CTGGACACCC TCGGAAAGCT TCTGGAACGG 3401 ATCATCCTTA ACCGGCTGAC CAAGTACACG GAGAGCGAGC ATGGCTTAGC 3451 AGCGAGGCAG TTCGGCTTCC GTAAAGGGAG ATCCACGGTG GACGCCATCC 3501 GGAAAGTGGT CGAGAAAGCC GACGAAGCGC GGAGGAAAAA ACGCAGGGGG 3551 AACCGTTGCT GCGCAATAGT CACGATTGAC GTCAAGAACG CGTTCAACAG 3601 TGCGAGCTGG GCGGCCATAG CAGCAGCGCT GCACAAAATG AAGGTGCCTG 3651 ACTATTTGTG CATGATCTTG AAGAGCTACT TCGAGAACCG CGTGCTGGTC 3701 TACGACACTG CCGATGGACA AAAAACCGTT GTTGTTACCG CGGGAGTTCC 3751 ACAGGGATCC ATTCTGGGTT CAGCACTGTG GAACGGAATG TATGACGGAG 3801 TGTTGACACT GGGGCTACCC AACGGCGTAG AGATTGTTGG CTTTGCAGAC 3851 GACATAGTGC TGACGGTAAC CGGCGAAAAT GTCGAGGAGG TCGAAATGCT 3901 GGCTATGGAG GCAATCGCAA TGATCGAGAA CTGGATGCTC GAGGTGAAGC 3951 TGCGGATCGC TCACCACAAG ACGGAGATGG TACTGGTTAG TAACCACAAA 4001 AAGGTGCAGC AGGCCCAGAT ACACGTTGGG GAACACGTAG TGCACTCGAA 4051 GAGAGCGCTC AAGTACCTCG GGGTGATGGT GGATGACCGG CTGAACTTCA 4101 ACAGCCACGT CGATTACGCC TGCGAGAAGG CGGCTAAGGC GATCATGGCA 4151 CTGTCGAGGA TGATGCCGAA CAACGCTGGA CCCAGGAGCA GTAGGCGCCG 4201 CCTCTTGGCA AGTGTCGCGA CGTCCATACT TAGGTACGGC GGACCGGTAT 4251 GGTGGACGGC GCTGGGGACG AAGCGAAATC GAGCGCTGCT CGACAGAACG 4301 CAGAGACTGA TGGCCATGCG GGTTGCAAGC GCGTACAGGA CCATYTCGTC 4351 GGAAGCAGTT GGCGTCATAG CCGGAATGAT CCCCATCGGC ATCACACTGG 4401 AGGAGGACAC CGTGCGCTAC ACCCGRAGAG GCACGAGAGG TATCCGGGAA 4451 GCTGCGAGAG CCGAATCGCT GGCAAGGTGG CAACGTGAGT GGGACACCAC 4501 GGAGAAAGGC AGATGGACGC ATCGGCTTAT CCCGTCCGTA TCCACGTGGG 4551 TGAGCAGAAG GCAYGGAGAG GTCACCTTCC ACCTCACACA GTTCCTGTCG 4601 GGCCATGGCT GCTTCAGGAA GTACCTGCAC AGGTTYGGAC ATGCAGAGTC 4651 TCCTCTCTGT CCGGACTGCG TCGATTGCGA GGAAACACCG GAGCACGTGG

214 4701 TGTTCGCCTG CCCTCGCTTC GAGGCAGCGC GAAGCGAAAT GCTGGCCATT 4751 ATCGGAGCRG ACACCAGCCC GGATAATGTG GTGCGAAGAA TGTGCAGCGA 4801 CATYGCCAAG TGGAATGCGG TCGTCGGAGC GGTGACGCAG ATCACTTCGG 4851 CTCTCCAGCG GAAATGGAGA GACGATCAGA GGAGGAACGA CTAGGAGCCT 4901 AGTCGAAAAC CCACGAGTGT GGCTGTGAAG GAGAGCACGT TATGATGGTC 4951 GGCTCTACCA AATCGGTACA CGTCTCGATG GTCACAGGAG TCGAGAACCC 5001 ACGAGTGTGG CTGTGAAGGA GAGCACGTTA TGACGGTTGG CTCTACCAAA 5051 TCGGTACACG TCTCGATGGT CCAAGGAGAG GGCTGCATAT GACTAGCCGA 5101 TCAAAAGCAA CGCGATTCTT GGGCGCGGTT AAACCCTCGC ATGGACTCAT 5151 ATGTATGTRG AACAGGAAAT GGTTCTAGYA CCCGGCATGG ATCCTGTAAG 5201 TAGACTAGTG CAGAAAATGC AACGCCTCCC CCCGAAGTTA TACCGAAAGG 5251 TGGTCCCGGG GGGAMAAGGG CACGGCGTTC AAGGACTGGT TTAGTGGGTC 5301 GGGAAAACTC TTTTTGTTTT CCCAACCCCA CACTACCTGA GAAATGAATT 5351 CTCAGGTGTC TGGTAGCAGA TTCCGACCTT GTAAAAAAAA AAACACACAC 5401 ACACACACAC ACACACACAC ACACA //

215 E.2.1.10 RTE

LOCUS /tmp/readseq.in.19664 4069 bp DEFINITION /tmp/readseq.in.19664 [Unknown form], 4069 bases, 1CD7 checksum. ORIGIN 1 Cp_RTE_Ele MCGTGCACAT CGGTATTGTT TTTCGTACCG TTCCCGCGAA 51 AGTGCAAAAT TTACCACGAA AATACGCGTT AGTGGGGTGT GAAATTCATT 101 GCATCCGTGT TCTGAACCCG TCTCCTGGCC CGGGAAAGTG ATAGTCCGGT 151 GTTTTAACGA CCGCGAGTGA GACAAATCAA TCACCAGATT TGACCGAATA 201 TCGCGGTTCC GCCACGGGCA ATTGAAAAAA AGTACAAAAG AACTGCTAAG 251 GCATCGTCGT CGTCGTCGCG ATCGATGGTC GTTGTATTGT GTCGYGCGGC 301 GTGTGGAAGT GTACCGCGWA GACCATACTG ASGYCGTGTC GTCGTCGTCC 351 TCGTCAGCYG TGDTCTGTTG CTASGAGAGW MGTGYAWAGA AGAGAAKCCG 401 TWGGTGCAAC GGWGGGGTGA AKTGCTAAGG ARAGCKACAG AAAAAAAAGT 451 GCTGCAAAAG TTTTTATTGG TRAAAWTAAA AAGAAGGAAW VAAKCAARCD 501 GAAGMGTCMA CCCTGCTRCR TACACAGCCC CCCCCCCCCC CCTCTCTAGT 551 TCCGTTTYTG GGCGTCTGCA CCCCAKGTTA GGGGCGGCCC AAACGGACTA 601 GGTKGTCGAT TCCATTCTCA CCCGTGAGCG GCTGTTCCCC ACGTTAGGGG 651 CGGCTCATGA AAGGAAGTGA GCCGACACCC CCCCCCCCCT CCCCCCACCC 701 TTCTTGAGCG TCTGTTCCCC AGGTTAGGGG CGGCTCRAAG AAACCGGTGT 751 CCTGCCTCYA TCGTCGAGGT AAGCGTCTGT TCTCCAKGTT AGGGGCGGCT 801 TACAGCAGGA TAGAGTTCGG ASCCCCCACC CCCTCCCCCC CGAGCGTCTG 851 TTCCCCAGGT TAGGGGCGGC TCGAAACAGC GTCTGTACCC CAGGTTAGGG 901 GCGGCTGAGT AAAAGTCCYT GTGTCGGCGT GGGACTKTAA ACAGTACCGG 951 CACGATGGTC CTCCGGCGAG ACAGGGGGTT GGTGCAGGCC ACACGAACCC 1001 GCCGTAAAAC ACCAGTGCAG GAAGCACACG ATGCGAGCCG GACCAATCGG 1051 CACGGAACTG GACATCWTAT GAGGTCCCAC GATTGGAAGC TCGGGACGTG 1101 GAATTGCAGG TCTCTCAAAT TTGACGGGAG TATCCGCATA CTTTCCGACA 1151 TATTGAGGGT CCGCAAGTTC AGCATCGTAG CGCTGCAGGA GGTTKGCTGG 1201 ATAGGCGCGG AAGAGGTACA AGCGTACCCA AGGATTGGGC TGTACAATCT 1251 ACCAGAGCCG CGGCGAAAAC AAGAGGCTGG GGACAGCCTT TATAGTGCTG 1301 GGCGAAATGC GCGATCGCGT GATTGGGTGG ACCCCGCTCA CCGACCGAAT 1351 GTGCGTGCTG AGGATTAAAG GCCGTTTCTT CAACATTAGC ATCATAAACG 1401 TGCACAGCCC GCACTCAGGA AGCGAAGATG ACGACAAGGA CGCATTTTAC 1451 GAGCAGCTGA ACTGGACGTA CAACAGCTGC CCAAAACATG ACGTCAAAAT 1501 CGTCATCGGA GATTTTAACG CTCAGGTTGG CCAGGAGGAG GAATTCAGAC 1551 CGGTGATAGG AAAGTTCAGC GCCCACGTAC GCACGAACGA AAACGGCCTG 1601 CGACTGATCG ACTTCGCCAC CTCCAAAAAC ATGGCCGTAC GAAGTACCTG 1651 CTTCCAGCAC AACCTCCGAG ACAAGTACAC CTGGAGATCA CCGCAAGGAA 1701 CGGAATCACA AATCGACCAC GTCGTAATCG ACGGTAGACA CTTTTCCGAC 1751 ATCATCGACG TCAGGACCTA TCGCGGCGCC AACGTCGACT CGGACCACTA 1801 TCTGGTGATG GTGAAAATGC GCCAACGACT TTCCCTGGCG AAAAGCGTTC 1851 GGTACCGCCG CCCTCCGCGG TTGGATCTGG AGCGGCTTAA GTTACCGGAA 1901 GTCGCATCCC GGTACGCGCA TTCGCTGGAG GCTGCGTTGC CAGGGGAGGG 1951 TGAGCTGTTG GAAGCTCCCC TCGAGGACTG CTGGAGGAGC GTCAAGGCAG 2001 CCATCACCAA CGCAGCGGAA AGCACCATCG GATTTGTGGA ACGAGGACGA 2051 CGGAACGATT GGTTCGACGA GGAGTGTCGA GCGATTTTGG AGGAGAAGAA 2101 TGCAGCACGG AGGGCAATGC TGCAGTACAA TCTCCGTGAT TACGAGGAGG 2151 CGTATGGACA GAAGCGAAGG CAGCAGCACC AGCTCTTCCG AGCAAAAGTG

216 2201 CGCCACCAGG AAGAGTTGGA GTTTGAGGAC ATGGAGCAGC TGCATCGCTC 2251 AAACGAAACG CGCAAGTTCT ACAAGAAGCT CAACGGATCC CGMAACGGCT 2301 TCACGCCGCG AGTCGAAATG TGCCGGGATA AAAATGGAGC TATCTTGACG 2351 AACGAGCGTG AGGTGATTGA CAGGTGGAAG CAGCACTTCG ATGAACACCT 2401 GAATRGCGCA GAAGCAGAGG CAGGGGTCCA AGGCGGCAGG AGAGAGGACT 2451 TCATCGGTAC AGCGGGAGAA GGAGAGGAGC CAGTTCCCAC GATGAGGGAA 2501 GTTAAGGATG CCATCAAGAA GCTGAAGAAC AACAAAGCAG CGGGTAAGGA 2551 TGGTATCGGT GCTGAACTCA TCAAGATGGG CCCGGAGAAG CTGGCGTCCT 2601 GTCTGCACCG ACTGATAGTC AGGGTCTGGG AGTCAGAACA GCTACCGGAG 2651 GAGTGGAAAG AGGGAGTAAT ATGCCCGATC TACAAGAAGG GGGACAAGTT 2701 AGATTGTGAG AACTACCGTG CCATCACAAT CCTCAACGCG GCCTACAAAG 2751 TGTTCTCCCA GATCCTCTTC AGCCGCCTAT CGCCAATAGC GGAAGGTTTT 2801 GTTGGAAGTT ATCAAGCCGG ATTCGTCATG GGGAGATCAA CAACCGACCA 2851 AATCTTCACT GTGCGACAAA TCCTCCAAAA GTGTCGCGAG TACCAAGTCC 2901 CCACGCACCA CCTTTTCATC GACTTCAAAG CCGCGTACGA CTCAGTCGAT 2951 CGCGAAGAGC TATGGAAAAT TATGGACGAG AACGGTTTTC CCGGGAAGCT 3001 GATCAGACTG ATCAAGATGA CGATGGATGG GGCTAGGTGT TGTGTGAAGA 3051 TATCGGGTGC GGAATCGGAC TCGTTTACTT CACTTGGGGG GCTTCGGCAA 3101 GGCGATGGGA TCTCTTGYCT YTGTTTCAAT GTCGTGCTAG AAGGTGTTAT 3151 GAGACGAGCG GGCTTCAATA TGCGGGGCAC GATCTTCAGC AAGTCCAACC 3201 ARTTCATCTG CTWCGCCGAC GACATGGACA TTGTTGGCAG AACGTTCAAG 3251 GCGGTTGCKG ATGCGTACAC CGRCTTGAAG CGGGAAGCAG AGAAGGTTGG 3301 GCTAAGGGTG AATGTGGCGA AGACAAAGTA CCTGCTGGCA GGAGGAACCG 3351 AGTCCCTTAG GGCTCGCATT GGACCRAGCG TTACRATCGA CGGGGACGAA 3401 TTCGAGGTRG TGGAGGAGTT TGTATACCTC GGATCGTTGG TAACGTCGGA 3451 CAACAGCTGC AGCAGGGAAA TTCGGAGGCG CATCATCGCT GGAAGTCGTG 3501 CCTATTTCGG TCTYCACAAG AGCCTAAGGT CCCGGAAATT CTCCCTACAT 3551 ACGAAGTGTT CCATCTACAA GTCGCTGATA AGACCGGTCG TCCTCTACGG 3601 GCACGAGACG TGGACAATGC TCGARGAGGA CYTACGAGCG CTAARCGTYT 3651 TCGAACGTCG AGTGCTAAGG ACCATCTTTG GCGGCGTATA TGAGAACGAC 3701 GGATGGCGGC GGAGAATGAA CCACGARCTT GCRCAACTCT ACAACGAACC 3751 AAGCATCCGG AARGTCGCGA AGGCTGGACG GTTGCAGTGG GCGGGTCATG 3801 TTGCAAGGAT GCCGGAACGA GCCGASCAMT TGAGCCAACG GAACCAGAAG 3851 ATCAATCCTG CGAAGTTGGT GTTTGTGTCG GAGCCGGTAG GAACAAGACG 3901 TAGGGGGGTG CAACGTGCGA GGTGGGTGGA CCAAGTGGAG ARCGATYTRG 3951 AAAGTGTGGG TGCGCCGCGA AATTGGAGAM AWGCAGCCAT GGACCGAGCT 4001 TGTTGGCGGA GAATCGTGCA GCAGGYCAAG CTAATGGTGT AGCGCCAAYA 4051 AAAGTAAAGT AAAGTAAGT //

217 E.2.1.11 Unclassified LINE

LOCUS L1_Contig_59 405 bp DEFINITION L1_Contig_59, 405 bases, 1F4B checksum. ORIGIN 1 ACAATAGCCT AAGTATTCGC AGAGAAGTTT TAGTATTTGA AGTCTTTGTC 51 TTCTTCCGGC TCCCTCTAGG GCAGGTCTCA GCAGGTTCTC GAATGAGATA 101 GAGCGCCTTY CTCGCAAGAT CACAGTGACC TTGTTTTGAA ACAKCAGCCA 151 CAYGTCTGCC ACSCGTGAAC ATTCACTAAA TTTATGCTGC AGCGTTTCCA 201 CAGCAGCTCC ACAGTGAAGG CAACTCTMAC TGTTCACCCT CTGTATCGTG 251 TGAAGCAACT TACGATGTTC TGTTTTTTCG TTCACGAACA TGTACAGTTG 301 ATTCTGTTGT GCAGAAGTGA GCCCTCTCAA AGCGATGTTT TTTTTCCAAA 351 TTCTCCGCCA GTTGACCGCT GGGTTAGCTT GCTGAACCTT CGGCTGTTCG 401 GTTTG //

218 E.3 D. melanogaster

E.3.1 Transposons

E.3.1.1 mariner

DEFINITION Putative Drosophila melanogaster mariner sequence SOURCE Drosophila melanogaster flybase.org, dmel-all-chromosome-r5.29.fasta FEATURES Location/Qualifiers source 1..1043 /organism="Drosophila melanogaster" /mol_type="genomic DNA" /transposon="putative mariner transposon" repeat_region 1..26 /note="left terminal inverted repeat" ORF1 66..780 translation=LCNCILSSVSYQLLLLAECQNWFRKFRSGDFSLKHEPRSGRLYEV DDDLIKALIELDRHVNKQEIGEKFNIPKSTVYYHIKRLVKKFDIWVPHVLKEIHLTH RINACDMQLKCNEFDPFLKRITSGKEKWIVYNNVSRKRSWSKHGEPAQTTSKADIHQ KKVMLSVWWDWKGVVYFELLPRNQTINSDVYCHQLNKLNTRRSDQNWSIVKVSYSTR ITLDCTHLWSLSKNCVSLGRNF /product="putative mariner transposase" repeat_region 1018..1043 /note="right terminal inverted repeat" ORIGIN 1 TGCCCAAAAA GTAATTGCGG ATTTTTCATA TAGTCGGCGT TGACAAATTT 51 TTTCAACGGC TTGTGACTTT GTAATTGCAT TCTTTCATCT GTCAGTTATC 101 AGCTGTTACT ATTAGCTGAG TGTCAAAATT GGTTTCGCAA ATTCCGTTCT 151 GGAGATTTTT CACTTAAACA TGAGCCCCGT TCAGGTCGGC TATATGAAGT 201 TGATGATGAC CTAATCAAAG CATTAATCGA ATTGGATCGT CATGTAAATA 251 AGCAGGAGAT AGGAGAGAAG TTTAATATAC CAAAATCAAC CGTTTACTAT 301 CACATAAAAA GACTAGTGAA AAAGTTTGAT ATTTGGGTAC CACATGTATT 351 GAAAGAAATT CATTTAACAC ACCGAATAAA TGCTTGTGAT ATGCAACTTA 401 AATGCAATGA ATTCGATCCG TTTTTAAAAC GAATCACATC TGGAAAGGAA 451 AAATGGATTG TTTACAACAA CGTTAGTCGA AAACGATCAT GGTCCAAGCA 501 TGGTGAACCA GCTCAAACCA CTTCAAAGGC TGATATCCAC CAAAAGAAGG 551 TTATGCTGTC TGTTTGGTGG GATTGGAAGG GTGTCGTATA TTTTGAACTG 601 CTTCCAAGGA ACCAAACGAT TAATTCGGAT GTTTACTGTC ACCAATTGAA 651 CAAATTGAAT ACAAGGAGAA GCGACCAGAA TTGGTCAATC GTAAAGGTGT 701 CATATTCCAC CAGGATAACG CTAGACTGCA CACATCTTTG GTCACTATCC 751 AAAAACTGTG TGAGCTTAGG TAGGAACTTT TGATGCATCC ACCGTATAGC 801 CCTGACCTGG AACCATCAGA CTACCATTTA TTTCGATCTT TGCAGAACTC 851 CTTAAATGGT AAAACTTTCG GGAATGATGA GGCTATAAAA TCGCACTTGG 901 TTCAGTTTTT TGCAGATAAA GGCCAGAAGT TCTATTGACC GTGGAATAGA 951 AAAAAGGTTA TCGAAAAAAA TGGCAATTCA TTCTAAGTAT TATTAAAAAT 1001 GCATTTACTT TCTTTTAAAA AATCGGAAAT TATTTTTTGG GCA //

219 APPENDIX F

SCRIPT USED TO IDENTIFY MITES

This chapter presents the BioPerl script we utilized to identify MITEs in Pediculus humanus humanus.

#Ryan Kennedy and Scott Christley use lib ’/opt/bioperl’; use Bio::SeqIO; use Bio::SearchIO;

$len=0; $minlen=$ARGV[0]; $file=$ARGV[1]; $allowMismatch = $ARGV[2]; $maxDistance=$ARGV[3]; $name=$ARGV[4];

$minDistance = $ARGV[5];

$len=$minlen; $counter=0; open(OUTDAT, ">$name.fa"); flock(OUTDAT,$LOCK); print OUTDAT "======\n"; print OUTDAT "PARAMETERS:\n"; print OUTDAT "\tMinimum Distance:\t $minDistance\n"; print OUTDAT "\tMaximum Distance:\t $maxDistance\n"; print OUTDAT "\tMismatches Allowed:\t $allowMismatch\n"; print OUTDAT "\tMinimum Length:\t $minlen\n"; print OUTDAT "======\n"; flock(OUTDAT,$UNLOCK); close(OUTDAT); $in = Bio::SeqIO->new(-file => $file, -format => ’Fasta’);

220 if(length($str1)>length($str2)) { $max=length($str1); } else { $max=length($str2); }

$num=1; while($seq=$in->next_seq()) { if ($seq->length() < 80) { #starting sequence too short next; }

$seq1=$seq->seq(); $rvseq=$seq->revcom(); $rseq=$rvseq->seq(); $seq_full2=$seq1; $rseq_full=$rseq;

# search through scaffold, forward strand $lastMatch = "ZEWQSDV"; for ($k = 0; $k < length($seq1); $k++) { #print "k: " . $k . "\n";

# extract substrings $distance = length($seq1) - $k; if ($distance > $maxDistance) { $distance = $maxDistance; }

$seqSub = substr($seq1, $k, $distance); $rseqSub = substr($rseq_full, length($rseq_full) - $distance - $k, $distance);

# forward strand $i = 0; #for ($i = 0; $i < $distance; $i++) {

# reverse strand for ($j = 0; $j < $distance; $j++) {

# compare strings and allow for mismatches $len = 0; $numMismatch = 0; $match = ""; while ((substr($seqSub,$i + $len, 1) eq substr($rseqSub,$j + $len, 1)) || ($numMismatch < $allowMismatch)) { if (substr($seqSub,$i + $len, 1) ne substr($rseqSub,$j + $len, 1)) { ++$numMismatch; } $len++; $match=substr($seqSub,$i,$len);

221 if( (($i + $len) > $distance) || (($j + $len) > $distance) ) { last; } }

# if the distance between the two repeats is too small then skip $back = $distance - $j + $k - $len; $front = $i + $k; if ($front > $back) { next; } if (abs($front - $back) < $minDistance) { next; }

if (($match ne "") && (length($match) >= $minlen)) { #print $match . " $i\n";

# trim out simple repeats my $cntA = 0; my $cntT = 0; my $cntG = 0; my $cntC = 0; for ($l = 0; $l < length($match); ++$l) { $seqstr = substr($match, $l, 1); if ($seqstr eq "A") { ++$cntA; } if ($seqstr eq "T") { ++$cntT; } if ($seqstr eq "G") { ++$cntG; } if ($seqstr eq "C") { ++$cntC; } } if (($cntA == 0) || ($cntT == 0) || ($cntG == 0) || ($cntC == 0)) { # simple repeat } else { # check the other sequence $cntA = 0; $cntT = 0; $cntG = 0; $cntC = 0; $match=substr($rseqSub,$j,$len); #print $match . " $j\n"; for ($l = 0; $l < length($match); ++$l) { $seqstr = substr($match, $l, 1); if ($seqstr eq "A") { ++$cntA; } if ($seqstr eq "T") { ++$cntT; } if ($seqstr eq "G") { ++$cntG; } if ($seqstr eq "C") { ++$cntC; } } if (($cntA == 0) || ($cntT == 0) || ($cntG == 0) || ($cntC == 0)) { #print "Simple Repeat " . $seq->id() . "\n"; # simple repeat } else { #print $lastMatch . "\n"; ($seqNew,$passed) =

222 loop(substr($seqSub,$i,$len), $lastMatch); if($passed==1) { $lastMatch = $seqNew; open(OUTDAT, ">>$name.fa"); flock(OUTDAT,$LOCK); print OUTDAT ">" . $seq->id() . " $counter $front $back $len\tRPT1:" . $seqNew . "\tRPT2:" . substr($seq1,$back,$len) . "\n"; flock(OUTDAT,$UNLOCK); close(OUTDAT); $counter++; } } } } $len=$minlen; $match=""; } } $num++; } sub loop() { $check=0; $pass=0; $keeper=$sequence; my($sequence, $last_sequence) = @_; #print $sequence . " " . $last_sequence . "\n"; while($sequence =~ /(A{4}|T{4}|G{4}|C{4})$/i || $sequence =~ /^(A{4}|T{4}|G{4}|C{4})/i || $sequence =~ /(ATAT|GCGC|GAGA|CTCT|CACA|TGTG)$/i || $sequence =~ /^(ATAT|GCGC|GAGA|CTCT|CACA|TGTG)/i ) { #Removed $check=1; $sequence = $‘ . $’; } #end while if(length($sequence) >= $minlen) { if($check>0) { $sequence=$keeper; } if($last_sequence =~ $sequence) { $pass=0;} else {$pass=1;} #$pass=1; } else { $pass=0; #Sequence full of Repeats or is a substring } return $sequence, $pass; } #end loop

223 REFERENCES

1. S. F. Altschul, T. L. Madden, A. A. Sch¨affer,J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research, 25(17):3389– 3402, July 1997.

2. O. Andrieu, A. S. Fiston, D. Anxolab´eh`ere,and H. Quesneville. Detection of transposable elements by their compositional bias. BMC Bioinformatics, 5(94), July 2004.

3. S. M. Anwar, M. Musiani, G. McDermid, and D. Marceau. How Do Human Activities Shape Wolves’ Behavior In The Central Rocky Mountain Region, Alberta, Canada? In L. Yilmaz, editor, Proceedings of the 2009 Agent- Directed Simulation Symposium. The Society for Modeling and Simulation International, March 2009.

4. ArcGIS. http://www.esri.com/software/arcgis.

5. P. Arensburger, K. Megy, R. M. Waterhouse, J. Abrudan, P. Amedeo, B. An- telo, L. Bartholomay, S. Bidwell, E. Caler, F. Camara, C. L. Campbell, K. S. Campbell, C. Casola, M. T. Castro, I. Chandramouliswaran, S. B. Chap- man, S. Christley, J. Costas, E. Eisenstadt, C. Feschotte, C. Fraser-Liggett, R. Guigo, B. Haas, M. Hammond, B. S. Hansson, J. Hemingway, S. R. Hill, C. Howarth, R. Ignell, R. C. Kennedy, C. D. Kodira, N. F. Lobo, C. Mao, G. Mayhew, K. Michel, A. Mori, N. Liu, H. Naveira, V. Nene, N. Nguyen, M. D. Pearson, E. J. Pritham, D. Puiu, Y. Qi, H. Ranson, J. M. C. Ribeiro, H. M. Roberston, D. W. Severson, M. Shumway, M. Stanke, R. L. Strausberg, C. Sun, G. Sutton, Z. J. Tu, J. M. C. Tubio, M. F. Unger, D. L. Vanlanding- ham, A. J. Vilella, O. White, J. R. White, C. S. Wondji, J. Wortman, E. M. Zdobnov, B. Birren, B. M. Christensen, F. H. Collins, A. Cornel, G. Di- mopoulos, L. I. Hannick, S. Higgs, G. C. Lanzaro, D. Lawson, N. H. Lee, M. A. T. Muskavitch, A. S. Raikhel, and P. W. Atkinson. Sequence of Culex quinquefasciatus Establishes a Platform for vector Mosquito Comparative Genomics. Science, 330(6000):86–88, October 2010.

224 6. S. M. N. Arifin, R. C. Kennedy, K. E. Lane, G. R. Madey, A. Fuentes, and H. Hollocher. P-SAM: A Post-Simulation Analysis Module for Agent- Based Models. In Proceedings of the International Simulation Multiconfer- ence (ISMc2010): Summer Computer Simulation Conference (SCSC2010), 2010.

7. O. Balci. Handbook of Simulation: Principles, Methodology, Advances, Ap- plications, and Practice, chapter Verification, Validation, and Testing. John Wiley & Sons, New York, NY, 1998.

8. J. Banks and R. R. Gibson. Don’t simulate when... 10 rules for determining when simulation is not appropriate. IEE Solutions, September 1997.

9. J. Banks and J. S. C. II. Introduction to discrete-event simulation. In Proceedings of the 1986 Winter Simulation Conference, pages 17–23, 1986.

10. J. Banks, J. S. C. II, B. L. Nelson, and D. M. Nicol. Discrete-Event System Simulation. Pearson Education, Inc., Upper Saddle River, NJ, fourth edition, 2005.

11. Z. Bao and S. Eddy. Automated de novo identification of repeat sequence families in sequenced genomes. Genome Research, 12(8):1269–1276, August 2002.

12. E. A. Bennett, L. E. Coleman, C. Tsui, W. S. Pittard, and S. E. Devine. Nat- ural genetic variation caused by transposable elements in humans. Genetics, 168:933–951, October 2004.

13. C. M. Bergman and H. Quesneville. Discovering and detecting transposable elements in genome sequences. Briefings in Bioinformatics, 8(6):382–392, November 2007.

14. J. Biedler and Z. Tu. Non-LTR Retrotransposons in the African Malaria Mosquito, Anopheles gambiae: Unprecedented Diversity and Evidence of Recent Activity. Molecular Biology and Evolution, 20(11):1811–1825, 2003.

15. E. Birney, M. Clamp, and R. Durbin. GeneWise and Genomewise. Genome Research, 14:988–995, 2004.

16. E. Birney and R. Durbin. Using GeneWise in the Drosophila Annotation Experiment. Genome Research, 10:547–548, 2004.

17. BLAST. http://www.ncbi.nlm.nih.gov/blast.

225 18. D. Brown, R. Riolo, D. Robinson, M. North, and W. Rand. Spatial Pro- cess and Data Models: Toward Integration of Agent-based Models and GIS. Journal of Geographic Systems, Special Issue on Space-Time Information Systems, 7(1):25–47, 2005.

19. R. V. Bruggner. A system for integration and management of community annotation for vectorbase.org. Master’s Thesis, University of Notre Dame, 2007.

20. C. Burge and S. Karlin. Prediction of complete gene structures in human genomice DNA. Journal of Molecular Biology, 268:78–94, 1997.

21. R. E. Butler. The Design and Development of VectorBase: A Bioinformatic Resource Center for Invertebrate Vectors of Human Pathogens. Master’s Thesis, University of Notre Dame, 2010.

22. P. Capy, C. Bazin, D. Higuet, and T. Langin. Dynamics and Evolution of Transposable Elements. Landes Bioscience, Austin, Texas, 1998.

23. L. Cary, M. Goebel, B. Corsaro, H. Wang, E. Rosen, and M. Fraser. Trans- poson mutagenesis of baculoviruses: analysis of Trichoplusia ni transposon ifp2 insertions within the fp-locus of nuclear polyhedrosis viruses. Virology, 172(1):156–169, September 1989.

24. A. Caspi and L. Pachter. Identification of transposable elements using multi- ple alignments of related genomes. Genome Research, 16:260–270, February 2006.

25. C. Castle, A. Crooks, P. Longley, and M. Batty. Agent-based modelling and simulation using repast: A gallery of gis applications from CASA. In G. Priestnall and P. Alpin, editors, Proceedings of the 14th Geographical Information Systems Research UK Conference, pages 237–239, 2006.

26. Chado. http://www.gmod.org/wiki/index.php/chado.

27. Chado Best Practices. http://gmod.org/wiki/index.php/Chado Best Practices#Transposons.

28. N. L. Craig, R. Craigie, M. Gellert, and A. M. Lambowitz, editors. Mobile DNA II. ASM Press, Washington, DC, 2002.

29. A. T. Crooks. UCL working paper series: The repast sim- ulation/modelling system for geospatial simulation. available at http://www.casa.ucl.ac.uk/working papers/paper123.pdf, September 2007.

226 30. V. Curwen, E. Eyras, T. D. Andrews, L. Clarke, E. Mongin, S. M. Searle, and M. Clamp. The ensembl automatic gene annotation system. Genome Research, 14:942–950, 2004.

31. V. Curwen, E. Eyras, T. D. Andrews, L. Clarke, E. Mongin, S. M. Searle, and M. Clamp. The Ensembl Automatic Gene Annotation System. Genome Research, 14(5):942–950, 2004.

32. P. Daszak, A. Cunningham, and A. Hyatt. Anthropogenic environmental change and the emergence of infectious disease in wildlife. Acta Tropica, 78:103–116, 2001.

33. DNASTAR SeqMan. http://www.dnastar.com/products/seqmanpro.php.

34. Douglas-Peucker Algorithm. http://geometryalgorithms.com/Archive/algorithm 0205/#Douglas- Peucker%20algorithm.

35. R. D. Dowell, R. M. Jokerst, A. Day, S. R. Eddy, and L. Stein. The Dis- tributed Annotation System. BMC Bioinformatics, 2(7), 2001.

36. R. Drysdale and the FlyBase Consortium. FlyBase: a database for the Drosophila Research Community. Methods in Molecular Biology, 420:45–49, 2008.

37. R. M. D’Souza, M. Lysenko, S. Marino, and D. Kirschner. Data-Parallel Algorithms for Agent-Based Model Simulation of Tuberculosis On Graph- ics Processing Units. In L. Yilmaz, editor, Proceedings of the 2009 Agent- Directed Simulation Symposium. The Society for Modeling and Simulation International, March 2009.

38. R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological sequence analy- sis: Probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, United Kingdom, 2003.

39. S. R. Eddy. Profile hidden markov models. Bioinformatics Review, 14(9):755–763, 1998.

40. R. C. Edgar and E. W. Myers. PILER: identification and classification of genomic repeats. Bioinformatics, 21 Suppl. 1:i152–i158, March 2005.

41. L. J. Engel, G. A. Engel, M. A. Schillaci, A. Rompis, A. Putra, K. G. Suaryana, A. Fuentes, B. Beer, S. Hicks, R. White, B. Wilson, and J. S. Allan. Primate-to-human retroviral transmission in asia. Emerging Infec- tious Diseases, 11(7), July 2005.

227 42. J. E. Fa and D. G. Lindburg, editors. Evolution and Ecology of Macaque Socities. Cambridge University Press, 2005.

43. P. Flicek, B. Aken, K. Beal, B. Ballester, M. Caccamo, Y. Chen, L. Clarke, G. Coates, F. Cunningham, T. Cutts, T. Down, S. Dyer, T. Eyre, S. Fitzger- ald, J. Fernandez-Banet, S. Gr¨af,S. Haider, M. Hammond, R. Holland, K. Howe, K. Howe, N. Johnson, A. Jenkinson, A. K¨ah¨ari, D. Keefe, F. Kokocinski, E. Kulesha, D. Lawson, I. Longden, K. Megy, P. Meidl, B. Overduin, A. Parker, B. Pritchard, A. Prlic, S. Rice, D. Rios, M. Schus- ter, I. Sealy, G. Slater, D. Smedley, G. Spudich, S. Trevanion, A. Vilella, J. Vogel, S. White, M. Wood, E. Birney, T. Cox, V. Curwen, R. Durbin, X. Fernandez-Suarez, J. Herrero, T. Hubbard, A. Kasprzyk, G. Proctor, J. Smith, A. Ureta-Vidal, and S. Searle. Ensembl 2008. Nucleic Acids Re- search, 36:d707–d714, January 2008.

44. P. Flicek, B. L. Aken, B. Ballester, K. Beal, E. Bragin, S. Brent, Y. Chen, P. Clapham, G. Coates, S. Fairley, S. Fitzgerald, J. Fernandez-Banet, L. Gor- don, S. Graf, S. Haider, M. Hammond, K. Howe, A. Jenkinson, N. Johnson, A. Kahari, D. Keefe, S. Keenan, R. Kinsella, F. Kokocinski, G. Koscielny, E. Kulesha, D. Lawson, I. Longden, T. Massingham, W. McLaren, K. Megy, B. Overduin, B. Pritchard, D. Rios, M. Ruffier, M. Schuster, G. Slater, D. Smedley, G. Spudich, Y. A. Tang, S. Trevanion, A. Vilella, J. Vogel, S. White, S. P. Wilder, A. Zadissa, E. Birney, F. Cunningham, I. Dunham, R. Durbin, X. M. Fernandez-Suarez, J. Herrero, T. J. P. Hubbard, A. Parker, G. Proctor, J. Smith, and S. M. J. Searle. Ensembl’s 10th year. Nucleic Acids Research, 38:D557–562, 2010.

45. P. Flicek, M. R. Amode, D. Barrell, K. Beal, S. Brent, Y. Chen, P. Clapham, G. Coates, S. Fairley, S. Fitzgerald, L. Gordon, M. Hendrix, T. Hourlier, N. Johnson, A. Khri, D. Keefe, S. Keenan, R. Kinsella, F. Kokocinski, E. Kulesha, P. Larsson, I. Longden, W. McLaren, B. Overduin, B. Pritchard, H. S. Riat, D. Rios, G. R. S. Ritchie, M. Ruffier, M. Schuster, D. Sobral, G. Spudich, Y. A. Tang, S. Trevanion, J. Vandrovcova, A. J. Vilella, S. White, S. P. Wilder, A. Zadissa, J. Zamora, B. L. Aken, E. Birney, F. Cunningham, I. Dunham, R. Durbin, X. M. Fernndez-Suarez, J. Herrero, T. J. P. Hubbard, A. Parker, G. Proctor, J. Vogel, and S. M. J. Searle. Ensembl 2011. Nucleic Acids Research. Epub ahead of print.

46. J. Fooden. Systematic review of southeast asian longtail macaques, Macaca fascicularis. Fieldiana Zoology, 81, 1995.

47. A. Fuentes, M. Southern, and K. G. Suaryana. Monkey forests and human landscapes: Is extensive sympatry sustainable for Homo sapiens and Macaca

228 fascicularis on Bali? In J. D. Patterson and J. Wallis, editors, Commen- salism and Conflict: The Primate-Human Interface. American Society of Primatology Publications, 2005.

48. GD Graphics Library. http://www.libgd.org/.

49. GeoTools. http://geotools.codehaus.org.

50. N. Gilbert. Agent-based Models. SAGE Publications, Thousand Oaks, CA, 2008.

51. H. R. Gimblett, editor. Integrating Geographic Information Systems and Agent-based Modeling Techniques for Simulating Social and Ecological Pro- cesses. Oxford University Press, 2002.

52. H. R. Gimblett. Integrating geographic information systems and agent-based technologies for modeling and simulating social and ecological phenomena. In H. R. Gimblett, editor, Integrating Geographic Information Systems and Agent-based Modeling Techniques for Simulating Social and Ecological Pro- cesses. Oxford University Press, 2002.

53. GMOD. http://www.gmod.org/.

54. GMOD Names of Features. http://www.gmod.org/wiki/index.php/Chado Sequence Module#Names of Features.

55. GNU General Public License (GPL) v3. http://www.gnu.org/licenses/gpl.html.

56. GRASS: Geographic Resources Analysis Support System. http://grass.osgeo.org.

57. P. Green. http://www.phrap.org/phredphrapconsed.html.

58. V. Grimm, U. Berger, F. Bastiansen, S. Eliassen, V. Ginot, J. Giske, J. Goss-Custard, T. Grand, S. K. Heinz, G. Huse, A. Huth, J. U. Jepsen, C. Jørgensen, W. M. Mooij, B. M¨uller,G. Pe’er, C. Piou, S. F. Rails- back, A. M. Robbins, M. M. Robbins, E. Rossmanith, N. R¨uger,E. Strand, S. Souissi, R. A. Stillman, R. Vabø, U. Visser, and D. L. DeAngelis. A standard protocol for describing individual-based and agent-based models. Ecological Modelling, 198(1):115–126, 2006.

59. V. Grimm, U. Berger, D. L. DeAngelis, J. G. Polhill, J. Giske, and S. F. Railsback. The ODD protocol: A review and first update. Ecological Mod- elling, 221:2760–2768, 2010.

229 60. V. Grimm and S. F. Railsback. Individual-based Modeling and Ecology. Princeton University Press, Princeton, NJ, 2005.

61. U. Hellsten and et al. The genome of the Western clawed frog Xenopus tropicalis. Science, 328(5978):633–636, April 2010.

62. Hibernate. http://www.hibernate.org.

63. R. A. Holt, G. M. Subramanian, A. Halpern, G. G. Sutton, R. Charlab, D. R. Nusskern, P. Wincker, A. G. Clark, J. M. C. Ribeiro, R. Wides, S. L. Salzberg, B. Loftus, M. Yandell, W. H. Majoros, D. B. Rusch, Z. Lai, C. L. Kraft, J. F. Abril, V. Anthouard, P. Arensburger, P. W. Atkinson, H. Baden, V. de Berardinis, D. Baldwin, V. Benes, J. Biedler, C. Blass, R. Bolanos, D. Boscus, M. Barnstead, S. Cai, A. Center, K. Chatuverdi, G. K. Christophides, M. A. Chrystal, M. Clamp, A. Cravchik, V. Curwen, A. Dana, A. Delcher, I. Dew, C. A. Evans, M. Flanigan, A. Grundschober- Freimoser, L. Friedli, Z. Gu, P. Guan, R. Guigo, M. E. Hillenmeyer, S. L. Hladun, J. R. Hogan, Y. S. Hong, J. Hoover, O. Jaillon, Z. Ke, C. Kodira, E. Kokoza, A. Koutsos, I. Letunic, A. Levitsky, Y. Liang, J.-J. Lin, N. F. Lobo, J. R. Lopez, J. A. Malek, T. C. McIntosh, S. Meister, J. Miller, C. Mobarry, E. Mongin, S. D. Murphy, D. A. O’Brochta, C. Pfannkoch, R. Qi, M. A. Regier, K. Remington, H. Shao, M. V. Sharakhova, C. D. Sit- ter, J. Shetty, T. J. Smith, R. Strong, J. Sun, D. Thomasova, L. Q. Ton, P. Topalis, Z. Tu, M. F. Unger, B. Walenz, A. Wang, J. Wang, M. Wang, X. Wang, K. J. Woodford, J. R. Wortman, M. Wu, A. Yao, E. M. Zdobnov, H. Zhang, Q. Zhao, S. Zhao, S. C. Zhu, I. Zhimulev, M. Coluzzi, A. della Torre, C. W. Roth, C. Louis, F. Kalush, R. J. Mural, E. W. Myers, M. D. Adams, H. O. Smith, S. Broder, M. J. Gardner, C. M. Fraser, E. Birney, P. Bork, P. T. Brey, J. C. Venter, J. Weissenbach, F. C. Kafatos, F. H. Collins, and S. L. Hoffman. The genome sequence of the malaria mosquito Anopheles gambiae. Science, 298(5591), October 2002.

64. X. Huang and A. Madan. CAP3: A DNA Sequence Assembly Program. Genome Research, 9:868–877, 1999.

65. International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature, 409(6822):860–921, 2001.

66. R. M. Itami. Mobile agents with spatial intelligence. In H. R. Gimblett, ed- itor, Integrating Geographic Information Systems and Agent-based Modeling Techniques for Simulating Social and Ecological Processes. Oxford University Press, 2002.

67. Java. http://java.sun.com.

230 68. A. Jenkinson, M. Albrecht, E. Birney, H. Blankenburg, T. Down, R. Finn, H. Hermjakob, T. Hubbard, R. Jimenez, P. Jones, A. Kahari, E. Kulesha, J. Macias, G. Reeves, and A. Prlic. Integrating biological data - the Dis- tributed Annotation System. BMC Bioinformatics, 9(Suppl 8):S3, 2008. 69. N. C. Jones and P. A. Pevzner. An Introduction to Bioinformatics Algo- rithms. The MIT Press, Cambridge, MA, 2004. 70. JTS Topology Suite. http://www.vividsolutions.com/jts/jtshome.htm. 71. J. Jurka, V. V. Kapitonov, A. Pavlicek, P. Klonowski, O. Kohany, and J. Walichiewicz. Repbase update, a database of eukaryotic repetitive ele- ments. Cytogenetic and Genome Research, 110:462–467, 2005. 72. J. Jurka, P. Klonowski, V. Dagman, and P. Pelton. Censor–a program for identification and elimination of repetitive elements from dna sequences. Computers and Chemistry, 20(1):119–121, 1996. 73. K. Kaiser, J. W. Sentry, and D. J. Finnegan. Eukaryotic transposable ele- ments as tools to study gene structure and function. In D. J. Sherratt, editor, Mobile Genetic Elements. Oxford University Press, 1995. 74. M. Keeling, M. Woolhouse, R. May, G. Davies, and B. Grenfell. Modelling vaccination strategies against foot-and-mouth disease. Nature, 421:136–142, January 2003. 75. R. C. Kennedy. Verification and Validation of Agent-based and Equation- based Simulations and Bioinformatics Computing: Identifying Transpos- able Elements in the Aedes aegypti Genome. Master’s Thesis, University of Notre Dame, April 2006. 76. R. C. Kennedy, K. E. Lane, S. M. N. Arifin, A. Fuentes, H. Hollocher, and G. R. Madey. A GIS Aware Agent-Based Model of Pathogen Transmission. International Journal of Intelligent Control and Systems, 14(1):51–61, March 2009. Invited. 77. R. C. Kennedy, K. E. Lane, S. M. N. Arifin, H. Hollocher, A. Fuentes, and G. R. Madey. Effectively integrating gis data into an agent-based epidemi- ological model. In NICO Complexity Conference, September 2009. Poster Award Winner. 78. R. C. Kennedy, K. E. Lane, S. M. N. Arifin, H. Hollocher, A. Fuentes, and G. R. Madey. Simulation and Analysis of Pathogen Transmission in an Agent- and GIS-based Model. In North American Association for Com- putational Social and Organization Science 2009 Conference, Tempe, AZ, October 2009.

231 79. R. C. Kennedy, K. E. Lane, A. Fuentes, H. Hollocher, and G. Madey. Spa- tially Aware Agents: An effective and efficient use of GIS data within an Agent-based Model. In L. Yilmaz, editor, Proceedings of the 2009 Agent- Directed Simulation Symposium. The Society for Modeling and Simulation International, March 2009. 80. R. C. Kennedy, M. F. Unger, S. Christley, F. H. Collins, and G. R. Madey. An automated homology-based approach for identifying transposable elements. BMC Bioinformatics, 2010. Under review. 81. R. C. Kennedy, X. Xiang, T. F. Cosimano, L. A. Arthurs, P. A. Maurice, and S. E. Cabaniss. Verification and validation of agent-based and equation- based simulations: A comparison. In L. Yilmaz, editor, Proceedings of the 2006 Agent-Directed Simulation Symposium. The Society for Modeling and Simulation International, April 2006. 82. M. G. Kidwell and D. Lisch. Transposable elements as sources of variation in animals and plants. Proceedings of the National Academy of Sciences USA, 94:7704–7711, July 1997. 83. E. F. Kirkness, B. J. Haas, W. Sun, H. R. Braig, M. A. Perotti, J. M. Clark, S. H. Lee, H. M. Robertson, R. C. Kennedy, E. Elhaik, D. Ger- lach, E. V. Kriventseva, C. G. Elsik, D. Graur, C. A. Hill, J. A. Veen- stra, B. Walenz, J. M. C. Tubo, J. M. C. Ribeiro, J. Rozas, J. S. Johnston, J. T. Reese, A. Popadic, M. Tojo, D. Raoult, D. L. Reed, Y. Tomoyasu, E. Krause, O. Mittapalli, V. M. Margam, H.-M. Li, J. M. Meyer, R. M. Johnson, J. Romero-Severson, J. P. VanZee, D. Alvarez-Ponce, F. G. Vieira, M. Aguad, S. Guirao-Rico, J. M. Anzola, K. S. Yoon, J. P. Strycharz, M. F. Unger, S. Christley, N. F. Lobo, M. J. Seufferheld, N. Wang, G. A. Dasch, C. J. Struchiner, G. Madey, L. I. Hannick, S. Bidwell, V. Joardar, E. Caler, R. Shao, S. C. Barker, S. Cameron, R. V. Bruggner, A. Regier, J. Johnson, L. Viswanathan, T. R. Utterback, G. G. Sutton, D. Lawson, R. M. Wa- terhouse, J. C. Venter, R. L. Strausberg, M. R. Berenbaum, F. H. Collins, E. M. Zdobnov, and B. R. Pittendrigh. Genome sequences of the human body louse and its primary endosymbiont provide insights into the perma- nent parasitic lifestyle. Proceedings of the National Academy of Sciences, 107(27):12168–12173, July 2010. 84. O. Kohany, A. J. Gentles, L. Hankus, and J. Jurka. Annotation, submis- sion and screening of repetitive elements in Repbase: RepbaseSubmitter and Censor. BMC Bioinformatics, 7(474), 2006. 85. K. K. Kojima and H. Fujiwara. Evolution of Target Specificity in R1 Clade Non-LTR Retrotransposons. Molecular Biology and Evolution, 20(3):351– 361, 2003.

232 86. K. K. Kojima and H. Fujiwara. Cross-Genome Screening of Novel Sequence- Specific Non-LTR Retrotransposons: Various Multicopy RNA Genes and Microsatellites Are Selected as Targets. Molecular Biology and Evolution, 21(2):201–217, 2004.

87. I. Korf. Gene finding in novel genomes. BMC Bioinformatics, 5(59), 2004.

88. K. E. Lane, R. C. Kennedy, L. A. Miller, G. Madey, H. Hollocher, and A. Fuentes. Exploring the use of agent-based models in understanding pat- terns of pathogen transmission. In preparation.

89. M. Larkin, G. Blackshields, N. Brown, R. Chenna, P. McGettigan, H. McWilliam, F. Valentin, I. Wallace, A. Wilm, R. Lopez, J. Thompson, T. Gibson, and D. Higgins. Clustal w and clustal x version 2.0. Bioinfor- matics, 23(21):2947–2948, November 2007.

90. A. M. Law and W. D. Kelton. Simulation Modeling and Analysis. McGraw- Hill, Boston, MA, third edition, 2000.

91. D. Lawson, P. Arensburger, P. Atikinson, N. J. Besansky, R. V. Bruggner, R. Butler, K. S. Campbell, G. K. Christophides, S. Christley, E. Dialynas, D. Emmert, M. Hammond, C. A. Hill, R. C. Kennedy, N. F. Lobo, R. M. MacCallum, G. Madey, K. Megy, S. Redmond, S. Russo, D. W. Severson, E. O. Stinson, P. Topalis, E. M. Zdobnov, E. Birney, W. M. Gelbart, F. C. Kafatos, C. Louis, and F. H. Collins. VectorBase: a home for invertebrate vectors of human pathogens. Nucleic Acids Research, 35:D503–D505, 2007.

92. D. Lawson, P. Arensburger, P. Atkinson, N. J. Besansky, R. V. Bruggner, R. Butler, K. S. Campbell, G. K. Christophides, S. Christley, E. Dialynas, M. Hammond, C. A. Hill, N. Konopinski, N. F. Lobo, R. M. MacCallum, G. Madey, K. Megy, J. Meyer, S. Redmond, D. W. Severson, E. O. Stin- son, P. Topalis, E. Birney, W. M. Gelbart, F. C. Kafatos, C. Louis, and F. H. Collins. VectorBase: a data resource for invertebrate vector genomics. Nucleic Acids Research, 37:D583–587, 2009.

93. E. Lerat. Identifying repeats and transposable elements in sequenced genomes: how to find your way through the dense forest of programs. Hered- ity, 104:520–533, 2010.

94. S. Lewis, S. Searle, N. Harris, M. Gibson, V. Iyer, J. Richter, C. Wiel, L. Bayraktaroglu, E. Birney, M. Crosby, J. Kaminker, B. Matthews, S. Prochnik, C. Smith, J. Tupy, G. Rubin, S. Misra, C. Mungall, and M. Clamp. Apollo: a sequence annotation editor. Genome Biology, 3(12), 2002.

233 95. N. Lobo, A. Hua-Van, X. Li, B. Nolen, and J. M.J. Fraser. Germ line transformation of the yellow fever mosquito, Aedes aegypti, mediated by transpositional insertion of a piggyBac vector. Insect Molecular Biology, 11(2):133–139, April 2002.

96. E. R. Mardis. Next-Generation DNA Sequencing Methods. Annual Review of Genomics and Human Genetics, 9:387–402, September 2008.

97. E. M. McCarthy and J. F. McDonald. LTR STRUC: a novel search and identification program for LTR retrotransposons. Bioinformatics, 19(3):362– 367, February 2003.

98. B. McClintock. The discovery and characterization of transposable elements: The collected papers of Barbara McClintock. Garland Publishing, Inc., New York, NY, 1987.

99. MediaWiki. http://www.mediawiki.org/wiki/MediaWiki.

100. P. Medstrand, L. N. van de Lagemaat, C. A. Dunn, J. R. Landry, D. Sven- back, and D. L. Mager. Impact of transposable elements on the evolution of mammalian gene regulation. Cytogenetic and Genome Research, 110:342– 352, 2005.

101. J. R. Miller, S. Koren, and G. Sutton. Assembly algorithms for next- generation sequencing data. Genomics, 95(6):315 – 327, 2010.

102. D. M. Mount. Bioinformatics: Sequence and Genome Analysis. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY, second edition, 2004.

103. T. Naylor, J. Balintfy, D. Burdick, and K. Chu. Computer Simulation Tech- niques. John Wiley, New York, NY, 1966.

104. NCBI: National Center for Biotechnology Information. http://www.ncbi.nih.gov.

105. V. Nene, J. R. Wortman, D. Lawson, B. Haas, C. Kodira, Z. J. Tu, B. Loftus, Z. Xi, K. Megy, M. Grabherr, Q. Ren, E. M. Zdobnov, N. F. Lobo, K. S. Campbell, S. E. Brown, M. F. Bonaldo, J. Zhu, S. P. Sinkins, D. G. Hogenkamp, P. Amedeo, P. Arensburger, P. W. Atkinson, S. Bidwell, J. Biedler, E. Birney, R. V. Bruggner, J. Costas, M. R. Coy, J. Crabtree, M. Crawford, B. deBruyn, D. DeCaprio, K. Eiglmeier, E. Eisenstadt, H. El- Dorry, W. M. Gelbart, S. L. Gomes, M. Hammond, L. I. Hannick, J. R. Hogan, M. H. Holmes, D. Jaffe, J. S. Johnston, R. C. Kennedy, H. Koo, S. Kravitz, E. V. Kriventseva, D. Kulp, K. LaButti, E. Lee, S. Li, D. D. Lovin, C. Mao, E. Mauceli, C. F. M. Menck, J. R. Miller, P. Montgomery,

234 A. Mori, A. L. Nascimento, H. F. Naveira, C. Nusbaum, S. O’Leary, J. Orvis, M. Pertea, H. Quesneville, K. R. Reidenbach, Y.-H. Rogers, C. W. Roth, J. R. Schneider, M. Schatz, M. Shumway, M. Stanke, E. O. Stinson, J. M. C. Tubio, J. P. VanZee, S. Verjovski-Almeida, D. Werner, O. White, S. Wyder, Q. Zeng, Q. Zhao, Y. Zhao, C. A. Hill, A. S. Raikhel, M. B. Soares, D. L. Knudson, N. H. Lee, J. Galagan, S. L. Salzberg, I. T. Paulsen, G. Di- mopoulos, F. H. Collins, B. Birren, C. M. Fraser-Liggett, and D. W. Sever- son. Genome sequence of Aedes aegypti, a major arbovirus vector. Science, 316(5832):1718–1723, June 2007.

106. M. Oliveira de Carvalho, J. Silva, and E. Loreto. Analyses of P-like trans- posable element sequences from the genome of Anopheles gambiae. Insect Molecular Biology, 13(1):55–63, 2006.

107. OpenMap. http://openmap.bbn.com.

108. phpExcelReader. http://sourceforge.net/projects/phpexcelreader.

109. B. R. Pittendrigh, J. M. Clark, J. S. Johnston, S. H. Lee, J. Romero-Severson, and G. A. Dasch. Sequencing of a new target genome: the Pediculus humanus humanus (Phthiraptera: Pediculidae) genome project. Journal of Medical Entomology, 43(6):1101–1111, November 2006.

110. R. H. Plasterk, Z. Izsv`ak,and Z. Ivics. Resident aliens: the tc1/mariner superfamily of transposable elements. Trends in Genetics, 15(8), August 1999.

111. M. Pop, S. L. Salzberg, and M. Shumway. Genome sequence assem- bly:algorithms and issues. Computer, 35:47–54, 2002.

112. PostgreSQL. http://www.postgresql.org/.

113. K. D. Pruitt, T. Tatusova, W. Klimke, and D. R. Maglott. NCBI Refer- ence Sequences: current status, policy and new initiatives. Nucleic Acids Research, 37:D32–36, 2009.

114. QGIS: Quantum GIS. http://www.qgis.org.

115. H. Quesneville, C. M. Bergman, O. Andrieu, D. Autard, D. Nouaud, M. Ash- burner, and D. Anxolabehere. Combined evidence annotation of transposable elements in genome sequences. PLoS Computational Biology, 1, 2005.

116. H. Quesneville, D. Nouaud, and D. Anxolab´eh`ere. Detection of new trans- posable element families in Drosophila melanogaster and Anopheles gambiae genomes. Journal of Molecular Evolution, 57, 2003.

235 117. H. Quesneville, D. Nouaud, and D. Anxolab´eh`ere. P elements and mite rela- tives in the whole genome sequence of Anopheles gambiae. BMC Genomics, 7(214), 2006.

118. Repast. http://sourceforge.repast.net.

119. Repbase. http://www.girinst.org/repbase/index.html.

120. L. Roberts and John Janovy, Jr. Gerald D. Schmidt & Larry S. Roberts’ Foundations of Parasitology. McGraw-Hill, eighth edition, 2009.

121. L. M. Rocha. From artificial life to semiotic agent models: Review and research directions. available at http://informatics.indiana.edu/rocha/ps/agent review.pdf, Los Alamos National Laboratory Complex Systems Modeling Team, 1999.

122. G. Rubin and A. Spradling. Genetic transformation of drosophila with trans- posable element vectors. Science, 218(4570):348–353, October 1982.

123. S. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Pear- son Education, Inc., Upper Saddle River, NJ, 2003.

124. S. Saha, S. Bridges, and Z. V. Magbanua. Computational approaches and tools used in identification of dispersed repetitive dna sequences. Tropical Plant Biology, 1:85–96, 2008.

125. F. Sanger, S. Nicklen, and A. Coulson. Dna sequencing with chain- terminating inhibitors. In Proceedings of the National Academy of Sciences of the United States of America, volume 74, pages 5463–5467, December 1977.

126. A. Sarkar, R. Sengupta, J. Krzywinski, X. Wang, C. Roth, and F. Collins. P elements are found in the genomes of nematoceran insects of the genus Anopheles. Insect Biochemistry and Molecular Biology, 33(4):381–387, April 2003.

127. A. Sarkar, C. Sim, Y. Hong, J. Hogan, M. Fraser, H. Robertson, and F. Collins. Molecular evolutionary analysis of the widespread piggyBac trans- poson family and related “domesticated” sequences. Molecular Genetics and Genomics, 270(2):173–180, 2003.

128. R. E. Shannon. Introduction to the art and science of simulation. In Pro- ceedings of the 1998 Winter Simulation Conference, pages 7–14, 1998.

129. J. A. Shapiro. The discovery and significance of mobile genetic elements. In D. J. Sherratt, editor, Mobile Genetic Elements. Oxford University Press, 1995.

236 130. R. K. Slotkin and R. Martienssen. Transposable elements and the epigenetic regulation of the genome. Nature Reviews Genetics, 8(4):272–285, April 2007.

131. SOAP. http://www.w3.org/TR/soap/.

132. M. W. Southern. An Assessment of Potential Habitat Corridors and Land- scape Ecology for Long-Tailed Macaques (Macaca fascicularis) on Bali, In- donesia. Master’s Thesis, Central Washington University, June 2002.

133. J. E. Stajich, D. Block, K. Boulez, S. E. Brenner, S. A. Chervitz, C. Dagdi- gian, G. Fuellen, J. G. Gilbert, I. Korf, H. Lapp, H. Lehv¨aslaiho, C. Matsalla, C. J. Mungall, B. I. Osborne, M. R. Pocock, P. Schattner, M. Senger, L. D. Stein, E. Stupka, M. D. Wilkinson, and E. Birney. The Bioperl Toolkit: Perl Modules for the Life Sciences. 12(10):1611–1618, October 2002.

134. TEfam. http://tefam.biochem.vt.edu/tefam/.

135. L. Temime, Y. Pannet, L. Kardas, L. Opatowski, D. Guillemot, and P. Y. Bo¨elle. NOSOSIM: an agent-based model of pathogen circulation in a hos- pital ward. In L. Yilmaz, editor, Proceedings of the 2009 Agent-Directed Simulation Symposium. The Society for Modeling and Simulation Interna- tional, March 2009.

136. S. Tempel, M. Jurka, and J. Jurka. VisualRepbase: an interface for the study of occurrences of transposable element families. BMC Bioinformatics, 8(345), 2008.

137. TESeeker. http://www.nd.edu/˜teseeker.

138. Z. Tu and C. Coates. Mosquito transposable elements. Insect Biochemistry and Molecular Biology, 34:631–644, 2004.

139. Z. Tu and S. Li. Mobile genetic elements of malaria vectors and other mosquitoes. In P. J. Brindley, editor, Mobile Genetic Elements in Meta- zoan Parasites. Landes Bioscience, September 2008.

140. J. M. C. Tub´ıo, H. Naveira, and J. Costas. Structural and Evolutionary Analyses of the Ty3/gypsy Group of LTR Retrotransposons in the Genome of Anopheles gambiae. Molecular Biology and Evolution, 22(1):29–39, 2005.

141. UniProt Consortium. The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Research, 38:D142–148, 2010.

142. University of Notre Dame Center for Research Computing. http://crc.nd.edu.

237 143. VectorBase: A Bioinformatics Resource Center for Invertebrate Vectors of Human Pathogens. http://www.vectorbase.org.

144. VirtualBox. http://www.virtualbox.org.

145. J. L. Weber and E. W. Myers. Human whole-genome shotgun sequencing. Genome Research, 7:401–409, 1997.

146. J. D. Westervelt. Geographic information systems and agent-based model- ing. In H. R. Gimblett, editor, Integrating Geographic Information Systems and Agent-based Modeling Techniques for Simulating Social and Ecological Processes. Oxford University Press, 2002.

147. B. P. Wheatley. The Sacred Monkeys of Bali. Waveland Press, 1999.

148. WikiPoson. http://www.bioinformatics.org/wikiposon/doku.php.

149. X. Xiang, R. Kennedy, G. Madey, and S. Cabaniss. Verification and val- idation of agent-based scientific simulation models. In L. Yilmaz, editor, Proceedings of the 2005 Agent-Directed Simulation Symposium, volume 37, pages 47–55. The Society for Modeling and Simulation International, April 2005.

238