Identifying Repeats and Transposable Elements in Sequenced Genomes: How to ﬁnd Your Way Through the Dense Forest of Programs

Heredity (2010) 104, 520–533 & 2010 Macmillan Publishers Limited All rights reserved 0018-067X/10 $32.00 www.nature.com/hdy REVIEW Identifying repeats and transposable elements in sequenced genomes: how to find your way through the dense forest of programs E Lerat Universite´ de Lyon, F-6900 Lyon; Universite´ Lyon 1, Lyon, France; CNRS, UMR5558, Laboratoire de Biome´trie et Biologie Evolutive, F-69622 Villeurbanne, France The production of genome sequences has led to another describe here the main characteristics of the programs, their important advance in their annotation, which is closely linked main goals and the difficulties they can entail. The drawbacks to the exact determination of their content in terms of repeats, of the different methods are also highlighted to help biologists among which are transposable elements (TEs). The evolu- who are unfamiliar with algorithmic methods to understand tionary implications and the presence of coding regions in this methodology better. Globally, using several different some TEs can confuse gene annotation, and also hinder the programs and carrying out a cross comparison of their results process of genome assembly, making particularly crucial to be has the best chance of finding reliable results as any single able to annotate and classify them correctly in genome program. However, this makes it essential to verify the results sequences. This review is intended to provide an overview as provided by each program independently. The ideal solution comprehensive as possible of the automated methods would be to test all programs against the same data set to currently used to annotate and classify TEs in sequenced obtain a true comparison of their actual performance. genomes. Different categories of programs exist according to Heredity (2010) 104, 520–533; doi:10.1038/hdy.2009.165; their methodology and the repeat, which they can identify. I published online 25 November 2009 Keywords: transposable elements; bioinformatics; detection; annotation Introduction Repeats in genomes are classified on the basis of their sequence characteristics and of how they are formed. Over the past decade, genome sequencing has speeded One category consists of tandem repeats, and includes up with the improvement of the various techniques used. any sequences found in consecutive copies along a DNA The elucidation of genome sequences has made it strand. Several different categories of tandem repeats obvious that one important step in the deciphering of have been defined, depending on the number of repeats these sequences involves the accurate determination of and on the size of the repeated units. This group includes their repeat content. Repeats, and more particularly microsatellites or simple sequence repeats (short repeat transposable elements (TEs), were initially considered to units consisting of 1–6 nucleotides) and minisatellites constitute only a negligible part of eukaryotic genomes, (longer repeat units consisting of 10–60 nucleotides). although long before sequencing began, it was known Another category, on which this review will mainly that these elements can sometimes account for a major focus, is constituted by elements that are found dis- proportion of genomes (Britten and Kohne, 1968). persed across the whole genome, and which consists We now know that, depending on the organism, the mainly of TEs. TEs can be classified according to the proportion of TEs in the genome can differ widely, intermediate they use to move (Finnegan, 1989). Class-I ranging from a few percent (3% in the yeast Sacchar- TEs use an RNA intermediate to transpose by a ‘copy omyces cerevisiae; Kim et al., 1998) to a huge proportion and paste’ mechanism, whereas class-II TEs use a DNA encompassing almost the entire genome (480% in the intermediate, to transpose by a ‘cut and paste’ mechan- maize; SanMiguel et al., 1998), the human genome itself ism. Within each of these classes, TEs are further being particularly rich in repeats (which make up about subdivided on the basis of the structural features of 45%) (The International Human Genome Sequencing their sequences. The long terminal repeat (LTR) retro- Consortium, 2001). transposons are class-I repeats with direct repeats called LTRs at their extremities and have coding capacities (Figure 1). This class of elements is distinguished from Correspondence: Dr E Lerat, Laboratoire Biome´trie et Biologie Evolutive, that of the non-LTR retrotransposons, which consists of Universite´ Claude Bernard-Lyon 1, UMR-CNRS 5558-Bat Mendel, two main subclasses: the long interspersed nuclear 43 bd du 11 novembre 1918, Villeurbanne cedex 69622, France. E-mail: [email protected] elements (LINEs), which have coding capacities, and Received 15 September 2009; revised 23 October 2009; accepted 26 the short interspersed nuclear elements (SINEs), which October 2009; published online 25 November 2009 do not (Figure 1). The class-II TEs consist of the DNA Transposable element detection tools E Lerat 521 Class I: retrotransposons LTR-retrotransposons (5 to 9kb) PBS pol PPT gag PR INT RT RH env Non-LTR retrotransposons LINEs (5 to 8 kb) ORF1 ORF2 (A)n SINEs (80 to 500 bp) (A)n Class II: DNA transposons TIR (< 5 kb) Target site Dupication (TSD) transposase Terminal Inverted Repeat (TIR) MITE (500 pb) Long Terminal Repeat (LTR) Hairpin loop Helitron (5 kb) A- TC RPAI Helicase CTRR - T Figure 1 The different types of TEs. The LTR retrotransposons have a primer binding site (PBS) on the 3’ side, and a polypurine tract (PPT) on the 5’ size. Some also present a third ORF coding for env. The pol gene is formed by different domains coding for a protease (PR), an integrase (INT), a reverse transcriptase (RT) and an RnaseH (RH), respectively. The non-LTR retrotransposons have a poly A tail at their 3’ extremity. The LINEs display two ORFs whereas the SINEs have no coding capacity. The autonomous helitrons possess a coding capacity for a helicase and an RPA-like (RAPl) single-stranded DNA-binding protein. transposons (Figure 1). More recently, it has been The problem of identifying repeats in sequences is proposed that miniature inverted repeat transposable a recurrent difficulty in algorithmics and the automated elements (MITEs), DNA-based elements but which move detection of such elements is no trivial task. It is through a ‘copy and paste’ mechanism, could represent particularly difficult to determine the real boundaries a new subclass of class-II elements (Wicker et al., 2007). of these sequences accurately. They have indeed been The capacity of moving enables a given element to present within the genome for a long time, and even replicate itself, thus giving rise to a family that is though copies that belong to a given family are similar in represented by several copies of the same element. sequence, they are not identical because of evolutionary Given the relationship between elements, some families mechanisms that generate point mutations, rearrange- are phylogenetically related, which makes it possible to ments and indels. These mechanisms result in fragmen- construct the evolutionary history of these elements ted, divergent and mosaic copies that are difficult to (Capy et al., 1997). identify by similarity approaches. Another biological In addition to their numerical importance in genomes, characteristic that makes it difficult to identify their which is what makes these elements responsible for the boundaries is that TEs sometimes insert preferentially increase in genome size in most species, TEs are now into other TE copies to form nested elements (SanMiguel known to have a major part in genome evolution (Bie´mont et al., 1996; Kaminker et al., 2002). Depending on the and Vieira, 2006). Their role includes genome rearrange- family, the number of copies of a TE can range from one ment because of homologous recombination among copies or two copies for an ancient or not very active element to of a given family, gene innovation through various several million, as in the case of the SINEs in the human mechanisms, such as exon shuffling, gene regulation genome (The International Human Genome Sequencing through their own promoter regions, and insertional Consortium, 2001). The number of occurrences of a given mutations by direct insertion into genes (Kidwell and TE will depend on its activity in the genome, but also on Lisch, 2001). These various evolutionary implications and the species analyzed. In the Drosophila genome, the most the presence of coding regions in some TEs can lead to frequently occurring TE does not exceed hundreds of confusion in gene annotation, and can also complicate the copies (Kaminker et al., 2002; Lerat et al., 2003) except the process of genome assembly (Tang, 2007), which makes it newly described DINE-1 elements that display thousand particularly crucial to be able to annotate and classify TEs of non-autonomous copies (Kapitonov and Jurka, 2003). correctly in genome sequences. This means that the number of occurrence that can be Heredity Transposable element detection tools E Lerat 522 expected in a genome is not constant. This impacts the curated consensus sequences of repeats from various parameters of a program, which will usually have to be eukaryotic organisms. The most extensively used library- adapted to suit the organisms being analyzed. One last based program is REPEATMASKER (Smit et al., 1996–2004). problem that computational approaches have to deal The program was originally designed to mask repeats in with is the high cost in terms of calculation and memory sequences to facilitate further investigation of aspects size when analyzing very large genomes containing high such as assembly and gene detection. It has become occurrences of repeats. a gold standard for any search for repeats and TEs in Another problematic issue concerning TEs, once they genomes. The program performs a similarity search have been detected, is to classify them into families and based on local alignments using one of the search subfamilies. It is quite easy to identify the main classes of engines CROSSMATCH or AB-BLAST.

Identifying Repeats and Transposable Elements in Sequenced Genomes: How to ﬁnd Your Way Through the Dense Forest of Programs

Epstein-Barr Virus Recombinants from Overlapping Cosmid Fragments

"Evolutionary Emergence of Genes Through Retrotransposition"

Interfering with Retrotransposition by Two Types of CRISPR Effectors

Complete Article

Retroviral Insertions Into a Herpesvirus Are Clustered At

An Essential Gene for Fruiting Body Initiation in the Basidiomycete Coprinopsis Cinerea Is Homologous to Bacterial Cyclopropane Fatty Acid Synthase Genes

A Species-Specific Retrotransposon Drives a Conserved Cdk2ap1 Isoform Essential for Preimplantation Development

B2 and ALU Retrotransposons Are Self-Cleaving Ribozymes Whose Activity Is Enhanced by EZH2

Human Immunodeficiency Virus Type 1 Two-Long Terminal Repeat

LINE1 Inhibition (With Nucleoside Reverse Transcriptase Inhibitors)

The Evolutionary Life History of P Transposons: from Horizontal Invaders to Domesticated Neogenes

Multiple Trans-Splicing Events Are Required to Produce a Mature Nadl Transcript in a Plant Mitochondrion