Repeatmodeler2 for Automated Genomic Discovery of Transposable Element Families

Total Page:16

File Type:pdf, Size:1020Kb

Repeatmodeler2 for Automated Genomic Discovery of Transposable Element Families RepeatModeler2 for automated genomic discovery of transposable element families Jullien M. Flynna,1, Robert Hubleyb,1, Clément Gouberta, Jeb Rosenb, Andrew G. Clarka,2, Cédric Feschottea,2, and Arian F. Smitb,2 aDepartment of Molecular Biology and Genetics, Cornell University, Ithaca, NY 14853; and bInstitute for Systems Biology, Seattle, WA 98109 Contributed by Andrew G. Clark, March 5, 2020 (sent for review December 2, 2019; reviewed by Irina R. Arkhipova and Molly C. Gale Hammell) The accelerating pace of genome sequencing throughout the tree org/classification): Class I retroelements replicate and transpose of life is driving the need for improved unsupervised annotation of via an RNA intermediate; while class II elements (or DNA genome components such as transposable elements (TEs). Because transposons) are mobilized via a DNA intermediate. Class I el- the types and sequences of TEs are highly variable across species, ements include long and short interspersed elements (LINEs, automated TE discovery and annotation are challenging and time- SINEs) and long terminal repeat (LTR) retrotransposons. The consuming tasks. A critical first step is the de novo identification most common class II transposons are TIR (terminal inverted and accurate compilation of sequence models representing all of repeats) elements, which transpose via a “cut-and-paste” mech- the unique TE families dispersed in the genome. Here we intro- anism (13). But other class II elements, such as Helitrons, also duce RepeatModeler2, a pipeline that greatly facilitates this pro- use replicative mechanisms (14–16). Within each class, TE se- cess. This program brings substantial improvements over the quences are extremely diverse and evolve rapidly (4, 10, 17). original version of RepeatModeler, one of the most widely used Additionally, once integrated in the host genome, each element tools for TE discovery. In particular, this version incorporates a is subject to mutations, such as point mutations, and a vast array module for structural discovery of complete long terminal repeat of rearrangements, including internal deletions, truncations, and (LTR) retroelements, which are widespread in eukaryotic genomes nested insertions. The vast sequence diversity of TEs combined but recalcitrant to automated identification because of their size with the complexity of mutations that occur after insertion makes and sequence complexity. We benchmarked RepeatModeler2 on three model species with diverse TE landscapes and high-quality, automated TE identification and classification a daunting task. GENETICS manually curated TE libraries: Drosophila melanogaster (fruit fly), The most elementary level of classification of TEs is the Danio rerio (zebrafish), and Oryza sativa (rice). In these three spe- family, which designates interspersed genomic copies derived cies, RepeatModeler2 identified approximately 3 times more con- from the amplification of an ancestral progenitor sequence (10). sensus sequences matching with >95% sequence identity and sequence coverage to the manually curated sequences than the Significance original RepeatModeler. As expected, the greatest improvement is for LTR retroelements. Thus, RepeatModeler2 represents a valu- Genome sequences are being produced for more and more able addition to the genome annotation toolkit that will enhance eukaryotic species. The bulk of these genomes are composed of the identification and study of TEs in eukaryotic genome sequences. parasitic, self-mobilizing transposable elements (TEs) that play RepeatModeler2 is available as source code or a containerized pack- important roles in organismal evolution. Thus there is a age under an open license (https://github.com/Dfam-consortium/ pressing need for developing software that can accurately RepeatModeler, http://www.repeatmasker.org/RepeatModeler/). identify the diverse set of TEs dispersed in genome sequences. Here we introduce RepeatModeler2, an easy-to-use package genome annotation | mobile genetic elements | transposon families for the production of reference TE libraries which can be ap- plied to any eukaryotic species. Through several major im- ost eukaryotic genomes contain a large number of in- provements over the previous version, RepeatModeler2 is able Mterspersed repeats that by and large represent copies of to produce libraries that recapitulate the known composition transposable elements (TEs) at varying stages of evolutionary of three model species with some of the most complex TE decay (1–5). TEs are genomic sequences capable of mobilization landscapes. Thus RepeatModeler2 will greatly enhance the and replication, generating complex patterns of repeats that discovery and annotation of TEs in genome sequences. account for up to 85% of eukaryotic genome content (6). Dif- Author contributions: J.M.F., R.H., C.G., C.F., and A.F.S. designed research; J.M.F., R.H., and ferent organisms have diverse TE landscapes, including a wide J.R. performed research; A.G.C., C.F., and A.F.S. supervised the research; and J.M.F., R.H., range of abundances, activity levels, and sequence degradation C.G., J.R., A.G.C., C.F., and A.F.S. wrote the paper. levels (7, 8). As mutagens and major contributors to the orga- Reviewers: I.R.A., Marine Biological Laboratory; and M.C.G.H., Cold Spring Harbor nization, rearrangement, and regulation of the genome, TEs Laboratory. have had a profound impact on organismal evolution (reviewed Competing interest statement: C.F. and M.C.G.H. are coauthors on a 2018 review article: in ref. 4). Our understanding of the biological impact of TEs has https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1577-z. grown steadily through the study of both model and nonmodel Published under the PNAS license. organisms from which whole-genome sequences can now be Data deposition: The RepeatModeler2 software is available at https://github.com/Dfam- routinely assembled. With each new species sequenced comes consortium/RepeatModeler,andhttp://www.repeatmasker.org/RepeatModeler/. Scripts the challenge of identifying its unique set of TE families, which used for evaluating the benchmarking results are available here: https://github.com/ jmf422/TE_annotation/. The TE libraries produced by RepeatModeler1 and RepeatMod- remains a tedious and largely manual endeavor. Yet, the accu- eler2 that were used for benchmarking are available here: https://github.com/jmf422/TE_ rate identification of TEs and other repeats is a prerequisite to annotation/master/benchmark_libraries. nearly all other genomic analysis, including the annotation of 1J.M.F. and R.H. contributed equally to this work. genes (9). 2To whom correspondence may be addressed. Email: [email protected], cf458@ What makes TEs so potent in remodeling the genome but also cornell.edu, or [email protected]. challenging to annotate is their diversity in structures and se- This article contains supporting information online at https://www.pnas.org/lookup/suppl/ quences, which greatly vary across species (3, 4). There are two doi:10.1073/pnas.1921046117/-/DCSupplemental. major classes of TEs (reviewed in refs. 10–12; https://www.dfam. First published April 16, 2020. www.pnas.org/cgi/doi/10.1073/pnas.1921046117 PNAS | April 28, 2020 | vol. 117 | no. 17 | 9451–9457 Downloaded by guest on September 28, 2021 Each TE family can be represented by a consensus sequence Methods approximating that of the ancestral progenitor. Such consensus RepeatModeler2 Overview. RepeatModeler is a pipeline for automated de sequence can be recreated from a multiple alignment of indi- novo identification of TEs that employs two distinct discovery algorithms, vidual genomic copies (or “seeds”) from which each ancestral RepeatScout (32) and RECON (33), followed by consensus building and nucleotide can be inferred based on a majority rule along the classification steps. In addition, RepeatModeler2 now includes the LTRharvest (30) and LTR_retriever (34) tools. Our tool takes advantage of the alignment. Similarly, the seed alignment may be used to generate unique strengths of each approach as well as providing a tractable solution a profile Hidden Markov Model (HMM) for each family. Con- to analyzing large datasets such as whole-genome assemblies. For instance, sensus TE sequences and HMMs are used for many downstream RepeatScout uses high-frequency sequence word counts to identify in- applications in the study and annotation of genomes. Notably, terspersed repeat seeds (short regions of putative homology) and then they are used to annotate or “mask” the genome using performs an iterative extension of a multiple alignment around the aligned RepeatMasker, which is a prerequisite for gene annotation (9). seeds, similar to the seed and extend phase of the BLAST pairwise alignment algorithm. While RepeatScout’s implementation of this algorithm is fast, the Consensus sequences are generally stored in widely used data- program input is currently limited to ∼1 Gbp and the alignment scoring bases such as Repbase (18) or along with seed alignments and system (+1/−1, and nonaffine gap penalty) limits the divergence of discov- HMMs in Dfam (19). Seed alignments and accurate sequence ered families. Despite these limitations, RepeatScout serves well as a fast models are critical for reconstructing the evolutionary history of method to discover the youngest and most abundant families given a small TEs and are used for a variety of biological studies, including the sequence sample from a genome. RECON, on the other hand, provides so- study of
Recommended publications
  • Multiple Origins of Viral Capsid Proteins from Cellular Ancestors
    Multiple origins of viral capsid proteins from PNAS PLUS cellular ancestors Mart Krupovica,1 and Eugene V. Kooninb,1 aInstitut Pasteur, Department of Microbiology, Unité Biologie Moléculaire du Gène chez les Extrêmophiles, 75015 Paris, France; and bNational Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894 Contributed by Eugene V. Koonin, February 3, 2017 (sent for review December 21, 2016; reviewed by C. Martin Lawrence and Kenneth Stedman) Viruses are the most abundant biological entities on earth and show genome replication. Understanding the origin of any virus group is remarkable diversity of genome sequences, replication and expres- possible only if the provenances of both components are elucidated sion strategies, and virion structures. Evolutionary genomics of (11). Given that viral replication proteins often have no closely viruses revealed many unexpected connections but the general related homologs in known cellular organisms (6, 12), it has been scenario(s) for the evolution of the virosphere remains a matter of suggested that many of these proteins evolved in the precellular intense debate among proponents of the cellular regression, escaped world (4, 6) or in primordial, now extinct, cellular lineages (5, 10, genes, and primordial virus world hypotheses. A comprehensive 13). The ability to transfer the genetic information encased within sequence and structure analysis of major virion proteins indicates capsids—the protective proteinaceous shells that comprise the that they evolved on about 20 independent occasions, and in some of cores of virus particles (virions)—is unique to bona fide viruses and these cases likely ancestors are identifiable among the proteins of distinguishes them from other types of selfish genetic elements cellular organisms.
    [Show full text]
  • Best Practice Recommendations for Internal Validation of Human Short Tandem Repeat Profiling on Capillary Electrophoresis Platforms
    Best Practice Recommendations for Internal Validation of Human Short Tandem Repeat Profiling on Capillary Electrophoresis Platforms Best Practice Recommendations for Internal Validation of Human Short Tandem Repeat Profiling on Capillary Electrophoresis Platforms Biological Methods Subcommittee Biology/DNA Scientific Area Committee Organization of Scientific Area Committees (OSAC) for Forensic Science Best Practice Recommendations for Internal Validation of Human Short Tandem Repeat Profiling on Capillary Electrophoresis Platforms OSAC Proposed Standard Best Practice Recommendations for Internal Validation of Human Short Tandem Repeat Profiling on Capillary Electrophoresis Platforms Prepared by Biological Methods Subcommittee Version: 1.0 Disclaimer: This document has been developed by the Biological Methods Subcommittee of the Organization of Scientific Area Committees (OSAC) for Forensic Science through a consensus process and is proposed for further development through a Standard Developing Organization (SDO). This document is being made available so that the forensic science community and interested parties can consider the recommendations of the OSAC pertaining to applicable forensic science practices. The document was developed with input from experts in a broad array of forensic science disciplines as well as scientific research, measurement science, statistics, law, and policy. This document has not been published by an SDO. Its contents are subject to change during the standards development process. All interested groups or individuals are strongly encouraged to submit comments on this proposed document during the open comment period administered by the Academy Standards Board (www.asbstandardsboard.org). Best Practice Recommendations for Internal Validation of Human Short Tandem Repeat Profiling on Capillary Electrophoresis Platforms Foreword This document outlines best practice recommendations for the internal validation of human short tandem repeat DNA profiling on capillary electrophoresis platforms utilized in forensic laboratories.
    [Show full text]
  • Table 1. Unclassified Proteins Encoded by Polintons Polintons
    Table 1. Unclassified proteins encoded by Polintons Protein Polintons Length, Similar Description aa GenBank Family Species proteins* Polinton-1_DR Fish 280 Polinton-2_DR Fish 282 Polinton-1_XT Frog 248 Polinton-2_XT Frog 247 Polinton-1_SPU Lizard - Polinton-1_CI Sea squirt 249 A profile derived from the PX multiple alignment PX Polinton-2_CI Sea squirt 248 — does not match (based on PSI-BLAST) any proteins Polinton-1_SP Sea urchin 252 that are not encoded by Polintons. Polinton-2_SP Sea urchin 255 Polinton-3_SP Sea urchin 253 Polinton-5_SP Sea urchin 255 Polinton-1_TC Beatle 291 Polinton-1_DY Fruit fly 182 Polinton-1_DR Fish 435 Polinton-2_DR Fish 435 Polinton-1_XT Frog 436 Polinton-2_XT Frog 435 Polinton-1_SPU Lizard 435 This protein is the most conserved one among the Polinton-1_CI Sea squirt 431 Polinton-encoded unclassified proteins. Its Polinton-2_CI Sea squirt 428 conservation is comparable to that of POLB and PY Polinton-1_SP Sea urchin 442 — INT. Polinton-2_SP Sea urchin 442 A profile derived from the PX multiple alignment Polinton-3_SP Sea urchin 444 does not match any proteins that are not encoded by Polinton-4_SP Sea urchin 444 Polintons. Polinton-5_SP Sea urchin 444 Polinton-1_TC Beatle 437 Polinton-1_DY Fruit fly 430 Polinton-1_CB Nematode 287 Polinton-1_DR Fish 151 Polinton-2_DR Fish 156 Polinton-1_XT Frog 146 Polinton-2_XT Frog 147 Polinton-1_CI Sea squirt 141 Polinton-2_CI Sea squirt 140 A profile derived from the PX multiple alignment PW Polinton-1_SP Sea urchin 121 — does not match any proteins that are not encoded by Polinton-2_SP Sea urchin 120 Polintons.
    [Show full text]
  • Interfering with Retrotransposition by Two Types of CRISPR Effectors
    Zhang et al. Cell Discovery (2020) 6:30 Cell Discovery https://doi.org/10.1038/s41421-020-0164-0 www.nature.com/celldisc ARTICLE Open Access Interfering with retrotransposition by two types of CRISPR effectors: Cas12a and Cas13a Niubing Zhang1,2, Xinyun Jing1, Yuanhua Liu3, Minjie Chen1,2, Xianfeng Zhu2, Jing Jiang2, Hongbing Wang4, Xuan Li1 and Pei Hao3 Abstract CRISPRs are a promising tool being explored in combating exogenous retroviral pathogens and in disabling endogenous retroviruses for organ transplantation. The Cas12a and Cas13a systems offer novel mechanisms of CRISPR actions that have not been evaluated for retrovirus interference. Particularly, a latest study revealed that the activated Cas13a provided bacterial hosts with a “passive protection” mechanism to defend against DNA phage infection by inducing cell growth arrest in infected cells, which is especially significant as it endows Cas13a, a RNA-targeting CRISPR effector, with mount defense against both RNA and DNA invaders. Here, by refitting long terminal repeat retrotransposon Tf1 as a model system, which shares common features with retrovirus regarding their replication mechanism and life cycle, we repurposed CRISPR-Cas12a and -Cas13a to interfere with Tf1 retrotransposition, and evaluated their different mechanisms of action. Cas12a exhibited strong inhibition on retrotransposition, allowing marginal Tf1 transposition that was likely the result of a lasting pool of Tf1 RNA/cDNA intermediates protected within virus-like particles. The residual activities, however, were completely eliminated with new constructs for persistent crRNA targeting. On the other hand, targeting Cas13a to Tf1 RNA intermediates significantly inhibited Tf1 retrotransposition. However, unlike in bacterial hosts, the sustained activation of Cas13a by Tf1 transcripts did not cause cell growth arrest in S.
    [Show full text]
  • Complete Article
    The EMBO Journal Vol. I No. 12 pp. 1539-1544, 1982 Long terminal repeat-like elements flank a human immunoglobulin epsilon pseudogene that lacks introns Shintaro Ueda', Sumiko Nakai, Yasuyoshi Nishida, lack the entire IVS have been found in the gene families of the Hiroshi Hisajima, and Tasuku Honjo* mouse a-globin (Nishioka et al., 1980; Vanin et al., 1980), the lambda chain (Hollis et al., 1982), Department of Genetics, Osaka University Medical School, Osaka 530, human immunoglobulin Japan and the human ,B-tubulin (Wilde et al., 1982a, 1982b). The mouse a-globin processed gene is flanked by long terminal Communicated by K.Rajewsky Received on 30 September 1982 repeats (LTRs) of retrovirus-like intracisternal A particles on both sides, although their orientation is opposite to each There are at least three immunoglobulin epsilon genes (C,1, other (Lueders et al., 1982). The human processed genes CE2, and CE) in the human genome. The nucleotide sequences described above have poly(A)-like tails -20 bases 3' to the of the expressed epsilon gene (CE,) and one (CE) of the two putative poly(A) addition signal and are flanked by direct epsilon pseudogenes were compared. The results show that repeats of several bases on both sides (Hollis et al., 1982; the CE3 gene lacks the three intervening sequences entirely and Wilde et al., 1982a, 1982b). Such direct repeats, which were has a 31-base A-rich sequence 16 bases 3' to the putative also found in human small nuclear RNA pseudogenes poly(A) addition signal, indicating that the CE3 gene is a pro- (Arsdell et al., 1981), might have been formed by repair of cessed gene.
    [Show full text]
  • Retroviral Insertions Into a Herpesvirus Are Clustered At
    Proc. Natl. Acad. Sci. USA Vol. 90, pp. 3855-3859, May 1993 Biochemistry Retroviral insertions into a herpesvirus are clustered at the junctions of the short repeat and short unique sequences (Marek disease virus/reticuloendotheliosis virus/long terminal repeat/insertional mutagenesis/retroviral integration) DAN JONES*, ROBERT ISFORTt, RICHARD WITTERt, RHONDA KoST*, AND HSING-JIEN KUNG*§ *Department of Molecular Biology and Microbiology, Case Western Reserve University, Cleveland, OH 44106; tGenetic Toxicology Section, Human and Environmental Safety Division, The Procter and Gamble Company, Miami Valley Laboratories, Cincinnati, OH 45239; and tU.S. Department of Agriculture Agricultural Research Service, Avian Disease and Oncology Laboratory, 3606 East Mount Hope, East Lansing, MI 48823 Communicated by Frederick C. Robbins, January 11, 1993 (receivedfor review November 3, 1992) ABSTRACT We previously described the integration of a integrations are associated with large structural changes in nonacute retrovirus, reticuloendotheliosis virus (REV), into the MDV genome in both the Us and Rs regions (11). the genome of a herpesvirus, Marek disease virus (MDV), By coinfection of duck embryo fibroblasts (DEFs) with following both long-term and short-term coinfection in cul- both REV and MDV, we demonstrated that this process of tured fibroblasts. The long-term coinfection occurred in the retroviral integration can also occur within several passages course of attenuating the oncogenicity ofthe JM strain ofMDV of cocultivation (10). The structure of the inserted REV and was sustained for >100 passages. The short-term coinfec- sequences resembles those of cellular integrated proviruses, tion, which spanned only 16 passages, was designed to recreate with loss of the terminal nucleotides of the retrovirus and a the insertion phenomenon under controlled conditions.
    [Show full text]
  • Microsatellites Within the Feline Androgen Receptor Are
    JFM0010.1177/1098612X16634386Journal of Feline Medicine and SurgeryFarwick et al 634386research-article2016 Original Article Journal of Feline Medicine and Surgery 1 –7 Microsatellites within the feline © The Author(s) 2016 Reprints and permissions: androgen receptor are suitable for X sagepub.co.uk/journalsPermissions.nav DOI: 10.1177/1098612X16634386 chromosome-linked clonality testing jfms.com in archival material Nadine M Farwick1, Robert Klopfleisch1, Achim D Gruber1 and Alexander Th A Weiss2 Abstract Objectives A hallmark of neoplasms is their origin from a single cell; that is, clonality. Many techniques have been developed in human medicine to utilise this feature of tumours for diagnostic purposes. One approach is X chromosome-linked clonality testing using polymorphisms of genes encoded by genes on the X chromosome. The aim of this study was to determine if the feline androgen receptor gene was suitable for X chromosome-linked clonality testing. Methods The feline androgen receptor gene, was characterised and used to test clonality of feline lymphomas by PCR and polyacrylamide gel electrophoresis, using archival formalin-fixed, paraffin-embedded material. Results Clonality of the feline lymphomas under study was confirmed and the gene locus was shown to represent a suitable target in clonality testing. Conclusions and relevance Because there are some pitfalls using X chromosome-linked clonality testing, further studies are necessary to establish this technique in the cat. Accepted: 26 January 2016 Introduction Most tumours arise
    [Show full text]
  • A Field Guide to Eukaryotic Transposable Elements
    GE54CH23_Feschotte ARjats.cls September 12, 2020 7:34 Annual Review of Genetics A Field Guide to Eukaryotic Transposable Elements Jonathan N. Wells and Cédric Feschotte Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York 14850; email: [email protected], [email protected] Annu. Rev. Genet. 2020. 54:23.1–23.23 Keywords The Annual Review of Genetics is online at transposons, retrotransposons, transposition mechanisms, transposable genet.annualreviews.org element origins, genome evolution https://doi.org/10.1146/annurev-genet-040620- 022145 Abstract Annu. Rev. Genet. 2020.54. Downloaded from www.annualreviews.org Access provided by Cornell University on 09/26/20. For personal use only. Copyright © 2020 by Annual Reviews. Transposable elements (TEs) are mobile DNA sequences that propagate All rights reserved within genomes. Through diverse invasion strategies, TEs have come to oc- cupy a substantial fraction of nearly all eukaryotic genomes, and they rep- resent a major source of genetic variation and novelty. Here we review the defining features of each major group of eukaryotic TEs and explore their evolutionary origins and relationships. We discuss how the unique biology of different TEs influences their propagation and distribution within and across genomes. Environmental and genetic factors acting at the level of the host species further modulate the activity, diversification, and fate of TEs, producing the dramatic variation in TE content observed across eukaryotes. We argue that cataloging TE diversity and dissecting the idiosyncratic be- havior of individual elements are crucial to expanding our comprehension of their impact on the biology of genomes and the evolution of species. 23.1 Review in Advance first posted on , September 21, 2020.
    [Show full text]
  • Tandemly Arrayed Genes in Vertebrate Genomes
    Tandemly Arrayed Genes in Vertebrate Genomes Deng Pan1 and Liqing Zhang2∗ 1Department of Computer Science, Virginia Tech, 2050 Torgerson Hall, Blacksburg, VA 24061-0106, USA 2Program in Genetics, Bioinformatics, and Computational Biology ∗To whom correspondence should be addressed; E-mail: [email protected] July 9, 2008 Abstract Tandemly arrayed genes (TAGs) are duplicated genes that are linked as neighbors on a chromosome, many of which have important physio- logical and biochemical functions. Here we performed a survey of these genes in 11 available vertebrate genomes. TAGs account for an average of about 14% of all genes in these vertebrate genomes, and about 25% of all duplications. The majority of TAGs (72-94%) have parallel tran- scription orientation (i.e. are encoded on the same strand), in contrast to the genome, which has about 50% of its genes in parallel transcrip- tion orientation. The majority of tandem arrays have only two members. In all species, the proportion of genes that belong to TAGs tends to be higher in large gene families than in small ones; together with our re- cent finding that tandem duplication played a more important role than retroposition in large families, this fact suggests that among all types of duplication mechanisms, tandem duplication is the predominant mecha- nism of duplication, especially in large families. Species-specific tandem arrays (SSTAs) are identified and results of statistical tests suggest that natural selection played a role in maintaining some of the SSTAs. Our 1 work will provide more insight into the evolutionary history of TAGs in vertebrates. 2 1 Introduction Although the importance of duplicated genes in providing raw materials for genetic innovation has been recognized since the 1930s and is highlighted in Ohno’s book Evolution by gene duplication (Ohno, 1970), it is only recently that the availability of numerous genomic sequences has made it possible to quantitatively estimate how many genes in a genome are generated by gene duplication.
    [Show full text]
  • Evolution of Orthologous Tandemly Arrayed Gene Clusters
    Tremblay Savard et al. BMC Bioinformatics 2011, 12(Suppl 9):S2 http://www.biomedcentral.com/1471-2105/12/S9/S2 PROCEEDINGS Open Access Evolution of orthologous tandemly arrayed gene clusters Olivier Tremblay Savard1*, Denis Bertrand2*, Nadia El-Mabrouk1* From Ninth Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Com- parative Genomics Galway, Ireland. 8-10 October 2011 Abstract Background: Tandemly Arrayed Gene (TAG) clusters are groups of paralogous genes that are found adjacent on a chromosome. TAGs represent an important repertoire of genes in eukaryotes. In addition to tandem duplication events, TAG clusters are affected during their evolution by other mechanisms, such as inversion and deletion events, that affect the order and orientation of genes. The DILTAG algorithm developed in [1] makes it possible to infer a set of optimal evolutionary histories explaining the evolution of a single TAG cluster, from an ancestral single gene, through tandem duplications (simple or multiple, direct or inverted), deletions and inversion events. Results: We present a general methodology, which is an extension of DILTAG, for the study of the evolutionary history of a set of orthologous TAG clusters in multiple species. In addition to the speciation events reflected by the phylogenetic tree of the considered species, the evolutionary events that are taken into account are simple or multiple tandem duplications, direct or inverted, simple or multiple deletions, and inversions. We analysed the performance of our algorithm on simulated data sets and we applied it to the protocadherin gene clusters of human, chimpanzee, mouse and rat. Conclusions: Our results obtained on simulated data sets showed a good performance in inferring the total number and size distribution of duplication events.
    [Show full text]
  • Human Immunodeficiency Virus Type 1 Two-Long Terminal Repeat
    Isabel Olivares, et al.: HIV-1 Two-Long Terminal Repeat Circles Contents available at PubMed www.aidsreviews.com PERMANYER AIDS Rev. 2016;18:23-31 www.permanyer.com Human Immunodeficiency Virus Type 1 Two-Long Terminal Repeat Circles: A Subject for Debate Isabel Olivares, María Pernas, Concepción Casado and Cecilio López-Galindez Department of Molecular Virology, Centro Nacional de Microbiología, Instituto de Salud Carlos III, Majadahonda, Madrid, Spain © Permanyer Publications 2016 Abstract . HIV-1 infections are characterized by the integration of the reverse transcribed genomic RNA into the host chromosomes making up the provirus. In addition to the integrated proviral DNA, there are other forms of linear and circular unintegrated viral DNA in HIV-1-infected cells. One of these forms, known as two-long terminal repeat circles, has been extensively studied and characterized both in in vitro infected cells and in cells of the publisher from patients. Detection of two-long terminal repeat circles has been proposed as a marker of antire troviral treatment efficacy or ongoing replication in patients with undetectable viral load. But not all authors agree with this use because of the uncertainty about the lifespan of the two-long terminal repeat circles. We review the major studies estimating the half-life of the two-long terminal repeat circles as well as those proposing its detection as a marker of ongoing replication or therapeutic efficacy. We also review the charac- teristic of these circular forms and the difficulties in its detection and quantification. The variety of approaches and methods used in the two-long terminal repeat quantification as well as the low reliability of some methods make the comparison between results difficult.
    [Show full text]
  • Correlated Variation and Population Differentiation in Satellite DNA Abundance Among Lines of Drosophila Melanogaster
    Correlated variation and population differentiation in satellite DNA abundance among lines of Drosophila melanogaster Kevin H.-C. Wei, Jennifer K. Grenier, Daniel A. Barbash, and Andrew G. Clark1 Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY 14853-2703 Contributed by Andrew G. Clark, November 18, 2014 (sent for review October 1, 2014; reviewed by Giovanni Bosco and Keith A. Maggert) Tandemly repeating satellite DNA elements in heterochromatin repeats of simple sequences (≤10 bp); the most abundant include occupy a substantial portion of many eukaryotic genomes. Al- AAGAG (aka GAGA-satellite), AACATAGAAT (aka 2L3L), though often characterized as genomic parasites deleterious to and AATAT (11, 14, 15). In comparison, the genome of its sister the host, they also can be crucial for essential processes such as species Drosophila simulans,fromwhichD. melanogaster di- chromosome segregation. Adding to their interest, satellite DNA verged ∼2.5 mya, is estimated to have only 5% satellite DNA, elements evolve at high rates; among Drosophila, closely related more than 10-fold less AAGAG, and little to no AACATA- species often differ drastically in both the types and abundances GAAT (16). For further contrast, nearly 50% of the Drosophila of satellite repeats. However, due to technical challenges, the evo- virilis and less than 0.5% of Drosophila erecta genomes are sat- lutionary mechanisms driving this rapid turnover remain unclear. ellite DNA (16, 17). Strikingly, such rapid changes in genomes have been implicated in postzygotic isolation of species in the Here we characterize natural variation in simple-sequence repeats – of 2–10 bp from inbred Drosophila melanogaster lines derived form of hybrid incompatibility in several species of flies (18 20), from multiple populations, using a method we developed called demonstrating the critical role satellite DNA has on the evolu- k-Seek that analyzes unassembled Illumina sequence reads.
    [Show full text]