Repeatmodeler2 for Automated Genomic Discovery of Transposable Element Families

RepeatModeler2 for automated genomic discovery of transposable element families Jullien M. Flynna,1, Robert Hubleyb,1, Clément Gouberta, Jeb Rosenb, Andrew G. Clarka,2, Cédric Feschottea,2, and Arian F. Smitb,2 aDepartment of Molecular Biology and Genetics, Cornell University, Ithaca, NY 14853; and bInstitute for Systems Biology, Seattle, WA 98109 Contributed by Andrew G. Clark, March 5, 2020 (sent for review December 2, 2019; reviewed by Irina R. Arkhipova and Molly C. Gale Hammell) The accelerating pace of genome sequencing throughout the tree org/classification): Class I retroelements replicate and transpose of life is driving the need for improved unsupervised annotation of via an RNA intermediate; while class II elements (or DNA genome components such as transposable elements (TEs). Because transposons) are mobilized via a DNA intermediate. Class I el- the types and sequences of TEs are highly variable across species, ements include long and short interspersed elements (LINEs, automated TE discovery and annotation are challenging and time- SINEs) and long terminal repeat (LTR) retrotransposons. The consuming tasks. A critical first step is the de novo identification most common class II transposons are TIR (terminal inverted and accurate compilation of sequence models representing all of repeats) elements, which transpose via a “cut-and-paste” mech- the unique TE families dispersed in the genome. Here we intro- anism (13). But other class II elements, such as Helitrons, also duce RepeatModeler2, a pipeline that greatly facilitates this pro- use replicative mechanisms (14–16). Within each class, TE se- cess. This program brings substantial improvements over the quences are extremely diverse and evolve rapidly (4, 10, 17). original version of RepeatModeler, one of the most widely used Additionally, once integrated in the host genome, each element tools for TE discovery. In particular, this version incorporates a is subject to mutations, such as point mutations, and a vast array module for structural discovery of complete long terminal repeat of rearrangements, including internal deletions, truncations, and (LTR) retroelements, which are widespread in eukaryotic genomes nested insertions. The vast sequence diversity of TEs combined but recalcitrant to automated identification because of their size with the complexity of mutations that occur after insertion makes and sequence complexity. We benchmarked RepeatModeler2 on three model species with diverse TE landscapes and high-quality, automated TE identification and classification a daunting task. GENETICS manually curated TE libraries: Drosophila melanogaster (fruit fly), The most elementary level of classification of TEs is the Danio rerio (zebrafish), and Oryza sativa (rice). In these three spe- family, which designates interspersed genomic copies derived cies, RepeatModeler2 identified approximately 3 times more con- from the amplification of an ancestral progenitor sequence (10). sensus sequences matching with >95% sequence identity and sequence coverage to the manually curated sequences than the Significance original RepeatModeler. As expected, the greatest improvement is for LTR retroelements. Thus, RepeatModeler2 represents a valu- Genome sequences are being produced for more and more able addition to the genome annotation toolkit that will enhance eukaryotic species. The bulk of these genomes are composed of the identification and study of TEs in eukaryotic genome sequences. parasitic, self-mobilizing transposable elements (TEs) that play RepeatModeler2 is available as source code or a containerized pack- important roles in organismal evolution. Thus there is a age under an open license (https://github.com/Dfam-consortium/ pressing need for developing software that can accurately RepeatModeler, http://www.repeatmasker.org/RepeatModeler/). identify the diverse set of TEs dispersed in genome sequences. Here we introduce RepeatModeler2, an easy-to-use package genome annotation | mobile genetic elements | transposon families for the production of reference TE libraries which can be ap- plied to any eukaryotic species. Through several major im- ost eukaryotic genomes contain a large number of in- provements over the previous version, RepeatModeler2 is able Mterspersed repeats that by and large represent copies of to produce libraries that recapitulate the known composition transposable elements (TEs) at varying stages of evolutionary of three model species with some of the most complex TE decay (1–5). TEs are genomic sequences capable of mobilization landscapes. Thus RepeatModeler2 will greatly enhance the and replication, generating complex patterns of repeats that discovery and annotation of TEs in genome sequences. account for up to 85% of eukaryotic genome content (6). Dif- Author contributions: J.M.F., R.H., C.G., C.F., and A.F.S. designed research; J.M.F., R.H., and ferent organisms have diverse TE landscapes, including a wide J.R. performed research; A.G.C., C.F., and A.F.S. supervised the research; and J.M.F., R.H., range of abundances, activity levels, and sequence degradation C.G., J.R., A.G.C., C.F., and A.F.S. wrote the paper. levels (7, 8). As mutagens and major contributors to the orga- Reviewers: I.R.A., Marine Biological Laboratory; and M.C.G.H., Cold Spring Harbor nization, rearrangement, and regulation of the genome, TEs Laboratory. have had a profound impact on organismal evolution (reviewed Competing interest statement: C.F. and M.C.G.H. are coauthors on a 2018 review article: in ref. 4). Our understanding of the biological impact of TEs has https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1577-z. grown steadily through the study of both model and nonmodel Published under the PNAS license. organisms from which whole-genome sequences can now be Data deposition: The RepeatModeler2 software is available at https://github.com/Dfam- routinely assembled. With each new species sequenced comes consortium/RepeatModeler,andhttp://www.repeatmasker.org/RepeatModeler/. Scripts the challenge of identifying its unique set of TE families, which used for evaluating the benchmarking results are available here: https://github.com/ jmf422/TE_annotation/. The TE libraries produced by RepeatModeler1 and RepeatMod- remains a tedious and largely manual endeavor. Yet, the accu- eler2 that were used for benchmarking are available here: https://github.com/jmf422/TE_ rate identification of TEs and other repeats is a prerequisite to annotation/master/benchmark_libraries. nearly all other genomic analysis, including the annotation of 1J.M.F. and R.H. contributed equally to this work. genes (9). 2To whom correspondence may be addressed. Email: [email protected], cf458@ What makes TEs so potent in remodeling the genome but also cornell.edu, or [email protected]. challenging to annotate is their diversity in structures and se- This article contains supporting information online at https://www.pnas.org/lookup/suppl/ quences, which greatly vary across species (3, 4). There are two doi:10.1073/pnas.1921046117/-/DCSupplemental. major classes of TEs (reviewed in refs. 10–12; https://www.dfam. First published April 16, 2020. www.pnas.org/cgi/doi/10.1073/pnas.1921046117 PNAS | April 28, 2020 | vol. 117 | no. 17 | 9451–9457 Downloaded by guest on September 28, 2021 Each TE family can be represented by a consensus sequence Methods approximating that of the ancestral progenitor. Such consensus RepeatModeler2 Overview. RepeatModeler is a pipeline for automated de sequence can be recreated from a multiple alignment of indi- novo identification of TEs that employs two distinct discovery algorithms, vidual genomic copies (or “seeds”) from which each ancestral RepeatScout (32) and RECON (33), followed by consensus building and nucleotide can be inferred based on a majority rule along the classification steps. In addition, RepeatModeler2 now includes the LTRharvest (30) and LTR_retriever (34) tools. Our tool takes advantage of the alignment. Similarly, the seed alignment may be used to generate unique strengths of each approach as well as providing a tractable solution a profile Hidden Markov Model (HMM) for each family. Con- to analyzing large datasets such as whole-genome assemblies. For instance, sensus TE sequences and HMMs are used for many downstream RepeatScout uses high-frequency sequence word counts to identify in- applications in the study and annotation of genomes. Notably, terspersed repeat seeds (short regions of putative homology) and then they are used to annotate or “mask” the genome using performs an iterative extension of a multiple alignment around the aligned RepeatMasker, which is a prerequisite for gene annotation (9). seeds, similar to the seed and extend phase of the BLAST pairwise alignment algorithm. While RepeatScout’s implementation of this algorithm is fast, the Consensus sequences are generally stored in widely used data- program input is currently limited to ∼1 Gbp and the alignment scoring bases such as Repbase (18) or along with seed alignments and system (+1/−1, and nonaffine gap penalty) limits the divergence of discov- HMMs in Dfam (19). Seed alignments and accurate sequence ered families. Despite these limitations, RepeatScout serves well as a fast models are critical for reconstructing the evolutionary history of method to discover the youngest and most abundant families given a small TEs and are used for a variety of biological studies, including the sequence sample from a genome. RECON, on the other hand, provides so- study of

Repeatmodeler2 for Automated Genomic Discovery of Transposable Element Families

Multiple Origins of Viral Capsid Proteins from Cellular Ancestors

Best Practice Recommendations for Internal Validation of Human Short Tandem Repeat Profiling on Capillary Electrophoresis Platforms

Table 1. Unclassified Proteins Encoded by Polintons Polintons

Interfering with Retrotransposition by Two Types of CRISPR Effectors

Complete Article

Retroviral Insertions Into a Herpesvirus Are Clustered At

Microsatellites Within the Feline Androgen Receptor Are

A Field Guide to Eukaryotic Transposable Elements

Tandemly Arrayed Genes in Vertebrate Genomes

Evolution of Orthologous Tandemly Arrayed Gene Clusters

Human Immunodeficiency Virus Type 1 Two-Long Terminal Repeat

Correlated Variation and Population Differentiation in Satellite DNA Abundance Among Lines of Drosophila Melanogaster