Scanned by CamScanner Scanned by CamScanner Scanned by CamScanner Scanned by CamScanner Scanned by CamScanner Scanned by CamScanner Scanned by CamScanner Scanned by CamScanner See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/221693700

Transposable Elements and Their Identification

Article in Methods in molecular biology (Clifton, N.J.) · February 2012 DOI: 10.1007/978-1-61779-582-4_12 · Source: PubMed

CITATIONS READS 23 1,412

4 authors, including:

Wojciech Makalowski Izabela Makalowska University of Münster Adam Mickiewicz University

309 PUBLICATIONS 6,703 CITATIONS 165 PUBLICATIONS 4,857 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Environmental adaptation of amoeba on transcriptomic level View project

Vaccine View project

All content following this page was uploaded by Izabela Makalowska on 31 March 2015.

The user has requested enhancement of the downloaded file. Chapter 12 1

Transposable Elements and Their Identification 2

Wojciech Makałowski, Amit Pande, Valer Gotea 3 and Izabela Makałowska 4

Abstract 5

Most genomes are populated by thousands of sequences that originated from mobile elements. On the one 6 hand, these sequences present a real challenge in the process of genome analysis and annotation. On the 7 other hand, there are very interesting biological subjects involved in many cellular processes. Here, 8 we present an overview of transposable elements (TEs) biodiversity and their impact on genomic evolution. 9 Finally, we discuss different approaches to the TEs detection and analyses. 10

Key words: Transposable elements, Transposons, Mobile elements, Repetitive elements, Genome 11 analysis, Genome evolution, Junk DNA, RepeatMasker, TinT, TEclass 12

1. Introduction 13

Most eukaryotic genomes contain huge numbers of repetitive 14 elements. This phenomenon was discovered by Waring and Britten 15 almost a half century ago using reassociation studies (1, 2). It turned 16 out that most of these repetitive elements originated in transposable 17 elements (TEs) (3), though the repetitive fraction of a genome varies 18 significantly between different organisms, from 12% in Caenorhabditis 19 elegans (4)to50%inmammals(3), and more than 80% in some plants 20 (5). Covering such a significant fraction of a genome, it is not surprising 21 that TEs have a significant influence on the genome organization and 22 evolution. What once was called junk now is considered a treasure. 23 Although much progress has been achieved in understanding of a role 24 that TEs play in a host genome, we are still far from a full understanding 25 of the delicate evolutionary interplay between a host genome and the 26 invaders. They also pose a major challengeforthegenomiccommunity 27 at different levels, from their detection and classification to genome 28 assembly and genome annotation. Here, we present a brief natural 29 history of the TEs and discuss major techniques used in their analyses. 30

Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 1, Methods in Molecular Biology, vol. 855, DOI 10.1007/978-1-61779-582-4_12, # Springer Science+Business Media, LLC 2012 337 338 W. Makałowski et al.

2. Discovery of 31 Mobile Elements 32

33 TEs were discovered by Barbara McClintock during experiments 34 conducted in 1944 on . Since they influenced the activity of 35 some studied by McClintock, she named them controlling 36 elements. However, her discovery was met with less than enthusias- 37 tic reception by the genetic community. Her presentation at the 38 1951 Cold Spring Harbor Symposium was not understood and at 39 least not very well received (6). She had no better luck with her 40 follow-up publications (7–9) and after several years of frustration 41 decided not to publish on the subject for the next two decades. 42 Not for the first time in the history of science, an unappreciated 43 discovery was brought back to life after some other discovery has 44 been made. In this case, it was discovery of Insertional Elements in 45 bacteria by Szybalski group in early 1970s (10). In the original 46 paper, they wrote: “Genetic elements were found in higher organ- 47 isms which appear to be readily transposed from one to another site 48 in the genome. Such elements, identifiable by their controlling 49 functions, were described by McClintock in maize. It is possible 50 that they might be somehow analogous to the presently studied IS 51 insertions” (10). The importance of the McClintock’s original 52 work was eventually appreciated by the genetic community with 53 numerous awards including 14 honorary doctoral degrees and a 54 Nobel Prize in 1983 “for her discovery of mobile genetic elements” 55 (http://nobelprize.org/nobel_prizes/medicine/laureates/1983/). 56 Coincidently, at the same time as Szybalski “rediscovered” TEs, 57 Sozumu Ohno coined the term junk DNA that influenced genomic 58 field for decades (11). Ohno referred to so called noncoding 59 sequences or, to be more precise, to any piece of DNA that do 60 not code for a protein, which included all genomic pieces origi- 61 nated in transposons. The unfavorable picture of transposable and 62 transposed elements started to change in early 1990s when some 63 researchers noticed evolutionary value of these elements (12, 13). 64 With the wheel of fortune turning full circle and advances of 65 genome sciences, TE research is again focused on the role of mobile 66 elements played in the evolution of regulation (14, 15).

3. Transposons 67 Classification 68

3.1. Insertion 6973 The bacterial genome is composed of a core genomic backbone Sequences and 7074 decorated with a variety of multifarious functional elements. These Other Bacterial 7175 MGEs (mobile genetic elements) include bacteriophages, conjuga- Transposons 7276 tive transposons, integrons, unit transposons, composite transposons, 77 and Insertion Sequences (IS). Here, we elaborate upon the last class 78 of these elements, as they are most widely found and described (16). 12 Transposable Elements and Their Identification 339

Fig. 1. Schematic representation of insertional elements (IS). See text for detailed description.

The ISs were identified during studies of model genetic systems 79 by virtue of their capacity to generate mutations as a result of their 80 translocation (10). In-depth studies in antibiotic resistance and 81 transmissible plasmids revealed an important role for these mobile 82 elements in formation of resistance genes and promoting gene 83 capture. In particular, it was observed that several different ele- 84 ments were often clustered in “islands” within plasmid genomes 85 and served to promote plasmid integration and excision. 86 Although these elements sometimes generate beneficial muta- 87 tions, they may be considered genomic parasites as ISs code only for 88 the enzyme required for their own transposition (16). While an IS 89 element occupies a chromosomal location, it is inherited along with 90 its host’s native genes, so its fitness is closely tied to that of its host. 91 Consequently, ISs causing deleterious mutations that disrupt a 92 genomic mode or function are quickly eliminated from the popu- 93 lation. However, intergenically placed ISs have a higher chance to 94 be fixed in the population as likely they are neutral regarding 95 population’s fitness (17). 96 Insertion Sequences are generally compact (Fig. 1). They usu- 97 ally carry no other functions than those involved in their mobility. 98 These elements contain recombinationally active sequences which 99 define the boundary of the element, together with Tpase, an 100 enzyme, which processes these ends and whose gene usually 101 encompasses the entire length of the element (18). Majority of 102 ISs exhibit short terminal inverted-repeat sequences (IR) of length 103 10–40 bp. Several notable exceptions do exist, for example the IS91, 104 IS110, and IS200/605 families. 105 The IRs contains two functional domains (19). One is involved 106 in Tpase binding, and the other cleaves and transfers strand specific 107 reactions resulting in transposition. IS promoters are often posi- 108 tioned partially within the IR sequence upstream of the Tpase gene. 109 Binding sites for host-specific proteins are often located within 110 proximity to the terminal IRs and play a role in modulating trans- 111 position activity or Tpase expression (20). A general pattern for the 112 functional organization of Tpases has emerged from the limited 113 numbers analyzed. The N-terminal region contains sequence- 114 specific DNA binding activities of the proteins, while the catalytic 115 domain is often localized toward the C-terminal end (20). 116 Another common feature of IS elements is duplication of a 117 target site that results in short direct repeats (DRs) flanking the 118 IS (21). The length of the DR varies from 2 to 14 base pairs and is a 119 hallmark of a given element. Homologous recombination between 120 two IS elements can result in each having two different DRs (22). 121 340 W. Makałowski et al.

122 Insertion Sequences have been classified on the following basis 123 (1) similarities in genetic organization (arrangement of open 124 reading frames); (2) marked identities or similarities in their Tpases 125 (common domains or motifs); (3) similar features of their ends 126 (terminal IRs); and (4) fate of the nucleotide sequence of their 127 target sites (generation of a direct target duplication of determined 128 length). Based on the above rules, ISs are currently classified in 129 24 families (23) (Table 1). 130 IS elements influence gene expression by facilitating as pro- 131 moters, for instance potential 35 hexamers have been detected 132 within the terminal IR of many ISs. The list of elements which have 133 been demonstrated experimentally to carry functional 35 hexam- 134 ers is now extensive and includes IS21, IS30, IS257, IS2, IS911, 135 and IS982 in Lactococcus lactis (24). Other elements, having a 136 different mode of influencing the expression of neighboring genes 137 do so by endogenous transcription “escaping” the IS and traversing 138 the terminal IR, e.g., IS3, IS10, IS481, and IS982 in Escherichia 139 coli (25–29).

3.2. Eukaryotic 140142 The first TE classification system was proposed by Finnegan in Transposable Elements141143 1989 (30) and distinguished two classes of TEs characterized by 144 their transposition intermediate: RNA (class I or ) 145 or DNA (class II or DNA transposons). The transposition mecha- 146 nism of class I is commonly called “copy-and-paste” and that of 147 class II, “cut-and-paste.” In 2007, Wicker et al. (31) proposed 148 hierarchical classification based on TEs structural characteristics 149 and mode of replication (see Table 2 and Fig. 2). Below, we present 150 a brief overview of eukaryotic mobile elements that in general 151 follows the classification proposed by Wicker et al. (31).

3.2.1. Class I Mobile 152 As mentioned above, class I TEs transpose through an RNA inter- Elements 153 mediary. The RNA intermediate is transcribed from genomic DNA, 154 and then reverse-transcribed into DNA by a TE-encoded reverse 155 transcriptase (RT), followed by reintegration into a genome. Each 156 replication cycle produces one new copy, and as a result, class I 157 elements are the major contributors to the repetitive fraction in 158 large genomes (5, 32–34). Retrotransposons are divided into five 159 orders: LTR retrotransposons, DIRS-like elements, Penelope-like 160 elements (PLEs), LINEs (Long INterspersed Elements), and 161 SINEs (Short INterspersed Elements). This scheme is based on 162 the mechanistic features, organization and reverse transcriptase 163 phylogeny of these retroelements. Accidentally, the retrotranscip- 164 tase coded by an autonomous TE can reverse-transcribe another 165 RNA present in the cell, e.g., mRNA and produced a retrocopy 166 of it, which in most cases results in a . 167 The LTR retrotransposons, are characterized by the presence of 168 the Long Terminal Repeats (LTRs) ranging from several hundred 169 to several thousand base pairs. Both exogenous retroviruses and 12 Transposable Elements and Their Identification 341

Table 1 t1:1 Prokaryotic transposable elements as presented in the IS Finder database (120)

Typical size a Family range in bp size in bp IRs Number of ORFs t1:2

IS1 740–4,600 0–10 Y 1 or 2 t1:3 IS110 1,200–1,550 0 Y 1 t1:4 IS1380 1,550–2,000 4–5 Y 1 t1:5 IS1595 700–7,900 8 Y 1 t1:6 IS1634 1,500–2,000 5–6 Y 1 t1:7 IS200/IS605 600–2,000 0 N 1 or 2 t1:8 IS21 1,750–2,600 4–8 Y 2 t1:9 IS256 1,200–1,500 8–9 Y 1 t1:10 IS3 1,150–1,750 5 Y 2 t1:11 IS30 1,000–1,700 2–3 Y 1 t1:12 IS4 1,150–5,400 8–13 Y 1 or more t1:13 IS481 950–1,300 4–15 Y 1 t1:14 IS5 800–1,500 2–9 Y 1 or 2 t1:15 IS6 700–900 8 Y 1 t1:16 IS607 1,700–2,500 0 N 2 t1:17 IS630 1,000–1,400 2 Y 1 or 2 t1:18 IS66 1,350–3,000 8–9 Y 1 or more t1:19 IS701 1,400–1,550 4 Y 1 t1:20 IS91 1,500–2,000 0 N 1 t1:21 IS982 1,000 3–9 Y 1 t1:22 ISAs1 1,200–1,500 8–10 Y 1 t1:23 ISH3 1,225–1,500 4–5 Y 1 t1:24 ISL3 1,300–2,300 8 Y 1 t1:25 Tn3 over 3,000 0 Y More than 1 t1:26 aPresence (Y) or absence (N) of terminal inverted repeats t1:27

LTR retrotransposons contain a gag gene, that encodes a viral 170 particle coat, and a pol gene that encodes a reverse transcriptase, 171 ribonuclease H, and integrase, which provide the enzymatic 172 machinery for reverse transcription and integration into the host 173 genome. Reverse transcription occurs within the viral or viral-like 174 342 W. Makałowski et al.

t2:1 Table 2 Classification of eukaryotic transposable elements as proposed by Wicker et al. (31) t2:2 Class Order Superfamily Phylogenetic distribution t2:3 Class I (retrotransposons) LTR Copia Plants, Metazoans, Fungi t2:4 Gypsy Plants, Metazoans, Fungi t2:5 Bel-Pao Metazoans t2:6 Retrovirus Metazoans t2:7 ERV Metazoans t2:8 DIRS DIRS Plants, Metazoans, Fungi t2:9 Ngaro Metazoans, Fungi t2:10 VIPER Trypansosomes t2:11 PLE Penelope Plants, Metazoans, Fungi t2:12 LINE R2 Metazoans t2:13 RTE Metazoans t2:14 Jockey Metazoans t2:15 L1 Plants, Metazoans, Fungi t2:16 SINE tRNA Plants, Metazoans, Fungi t2:17 7SL Plants, Metazoans, Fungi t2:18 5S Metazoans t2:19 Class II (DNA transposons) TIR Tc1-Mariner Plants, Metazoans, Fungi t2:20 Subclass 1 hAT Plants, Metazoans, Fungi t2:21 Mutator Plants, Metazoans, Fungi t2:22 Merlin Metazoans t2:23 Metazoans, Fungi t2:24 P Plants, Metazoans t2:25 PiggyBac Metazoans t2:26 PIF-harbinger Plants, Metazoans, Fungi t2:27 CACTA Plants, Metazoans, Fungi t2:28 Crypton Crypton Fungi t2:29 Class II Subclass 2 Helitron Plants, Metazoans, Fungi t2:30 Maverick Maverick Metazoans, Fungi t2:31 Please note that SVAs and retrogenes are not included in this classification

175 particle (GAG) in the cytoplasm, and it is a multistep process (35). 176 Unlike LTR retrotransposons, exogenous retroviruses contain 177 an env gene, which encodes an envelope that facilitates their migra- 178 tion to other cells. Some LTR retrotransposons may contain rem- 179 nants of an env gene but their insertion capabilities are limited 180 to the originating genome (36). This would rather suggest that 181 they originated in exogenous retroviruses by losing the env gene. 182 However, there is evidence that suggests the contrary, given that 183 LTR retrotransposons can acquire the env gene and become infec- 184 tious entities (37). Currently, most of the LTR sequences (85%) are 185 found only as isolated LTRs, with the internal sequence being 186 lost most likely due to homologous recombination between flank- 187 ing LTRs (38, 39). Interestingly, LTR retrotransposons target their 188 reinsertion to specific genomic sites, often around genes, with 189 putative important functional implications for a host gene (36). 12 Transposable Elements and Their Identification 343

Fig. 2. Structures of eukaryotic mobile elements. Retrovisuses as representatives of LTR elements (a), DIRS (b), PLE (c), LINEs (d), SINEs (e), SVAs (f), retrogenes (g), “classical” autonomous DNA transposons (h), “classical” nonautonomous DNA transposons (i), Helitrons (j), and Mavericks (k). See text for detailed discussion.

Lander et al. estimate that 450,000 LTR copies make up about 190 8% of our genome (38). LTR retrotransposons inhabiting large 191 genomes, such as maize, wheat, or barley can contain thousands 192 of families. However, despite the diversity, very few families com- 193 prise most of the repetitive fraction in these large genomes. Notable 194 examples are Angela (wheat) (40), BARE1 (barley) (41), Opie 195 (maize) (42), and Retrosor6 (sorghum) (43). 196 344 W. Makałowski et al.

197 The DIRS order clusters structurally diverged group of 198 transposons that posses a tyrosine recombinase (YR) gene instead 199 of an integrase (INT) and do not form target site duplications 200 (TSDs). Their termini resemble either split direct repeats (SDR) 201 or inverted repeats. Such features indicate a different integration 202 mechanism than that of other class I mobile elements. DIRS were 203 discovered in the slime mold (Dictyostelium discoideum) genome 204 in early 1980s (44), and they are present in all major phyloge- 205 netic lineages including vertebrates (45). Recently, Piednoel and 206 Bonnivard have showed that they are common in hydrothermal 207 vent organisms (46). 208 Another order termed PLE has wide, though patchy, distribu- 209 tion from amoebae and fungi to vertebrates and copy number can 210 reach thousands per genome (47). Interestingly, no PLE sequences 211 have been found in mammalian genomes and apparently they were 212 lost from C. elegans genome (48). Although PLEs with an intact 213 ORF have been found in several genomes, including Ciona and 214 Dannio, the only transcriptionally active representative, Penelope,is 215 known from Drosophila virilis. It causes the hybrid dysgenesis 216 syndrome characterized by simultaneous mobilization of several 217 unrelated TE families in the progeny of dysgenic crosses. It seems 218 that Penelope invaded D. virilis quite recently and its invasive 219 potential was demonstrated in D. melanogaster (47). PLEs harbor 220 a single ORF that codes for a protein consisting of reverse tran- 221 scriptase (RT) and endonuclease (EN) domains. The PLE RT 222 domain more closely resembles telomerase than the RT from 223 LTRs or LINEs. The EN domain is related to GIY-YIG intron- 224 encoded endonucleases. Some PLE members also have LTR-like 225 sequences, which can be in a direct or an inverse orientation, and 226 have a functional intron (47). 227 LINEs do not have the long terminal repeats, have a poly-A 0 228 tail at the 3 end, and are flanked by the TSDs. They comprise about 229 21% of the human genome and among them L1 with about 230 850,000 copies is the most abundant and best described LINE 231 family. L1 is the only LINE still active in the human 232 genome (38). In the human genome, there are two other LINE- 233 like repeats, L2 and L3, distantly related to L1. A contrasting 234 situation has been noticed in the malaria mosquito Anopheles 235 gambiae, where around 100 divergent LINE families compose 236 only 3% of its genome. LINEs in plants, e.g., Cin4 in maize and 237 Ta11 in Arabidopsis thaliana seem rare as compared with LTR 238 retrotransposons. A full copy of mammalian L1 is about 6 kb 239 long, contains a PolII promoter, and two ORFs. The ORF1 240 codes for a non-sequence-specific RNA binding protein that con- 241 tains Zn-finger, leucine zipper, and coiled-coil motifs. The ORF1p 242 functions as chaperone for the L1 mRNA (49, 50). The second 243 ORF encodes an endonuclease, which makes a single stranded nick 244 in the genomic DNA, and a reverse transcriptase, which uses the 12 Transposable Elements and Their Identification 345 nicked DNA to prime reverse transcription of LINE RNA from 245 0 the 3 end. Reverse transcription is often unfinished, leaving 246 behind fragmented copies of LINE elements, hence most of the 247 L1-derived repeats are short, with an average size of 900 bp. LINEs 248 are part of the CR1 clade, which has members in various metazoan 249 species, including fruit fly, mosquito, zebrafish, pufferfish, turtle, 250 and chicken (51). Because they encode their own retrotrans- 251 position machinery, LINE elements are regarded as autonomous 252 retrotransposons. 253 SINEs evolved from RNA genes, such as 7SL, and tRNA genes. 254 By definition, they are short, up to 1,000 base pair long. They do 255 not encode their own retrotranscription machinery and are consid- 256 ered as nonautonomous elements and in most cases are mobilized 257 by the L1 machinery (52). The outstanding member of this class 258 from the human genome is the Alu repeat, which contains a cleav- 259 age site for the AluI restriction enzyme that gave its name (53). 260 With over a million copies in the human genome, Alu is probably 261 the most successful transposon in the history of life. Primate specific 262 Alu and its rodent relative B1 have limited phylogenetic distribution 263 suggesting their relatively recent origins. The Mammalian-wide 264 Interspersed Repeats (MIRs), by contrast, spread before eutherian 265 radiation, and their copies can be found in different mammalian 266 groups including marsupials and monotremes (54). 267 There are two special categories of retroposed elements worthy 268 of discussion but not included in general classification proposed by 269 Wicker et al. (31). SVA elements are unique primate elements due 270 to their composite structure. They are named after their main 271 components: SINE, VNTR (a variable number of tandem repeats) 272 and Alu. Usually, they contain the hallmarks of the retroposition, 273 i.e., they are flanked by TSDs and terminated by a poly(A) tail. 274 It seems that SVA elements are nonautonomous retrotransposons 275 mobilized by L1 machinery and they are thought to be transcribed 276 by RNA polymerase II. SVAs are transpositionally active causing 277 some human diseases (55). They originated less than 25 mil- 278 lion years ago and they form the youngest family 279 with about 3,000 copies in the human genome (56). Interestingly, 280 similarly to L1 elements, they can transduce downstream sequences 281 during their movement. It has been shown recently that about 282 53 kb of the human genomic sequences has been duplicated by 283 SVA-mediated transductions, including three independent duplica- 284 tions of the entire AMAC gene (57). 285 Another special group of retroposed sequences consists of 286 retro(pseudo)genes, which are products of reverse transcription of 287 a spliced (mature) mRNA. Hence, their characteristic features are as 288 follows: an absence of both 5-promoter sequence and introns, the 289 0 presence of flanking DRs and a 3 end-polyadenosine tract (58). 290 Processed , as sometimes retropseudogenes are called, 291 have been generated in vitro at a low frequency in the human HeLa 292 346 W. Makałowski et al.

293 cells via mRNA from a reporter gene (59). The source of the reverse 294 transcription machinery in humans and other vertebrates seems to 295 be active L1 elements (60). However, not all retroposed messages 296 have to end up as pseudogenes. About 20% of mammalian protein 297 encoding genes lack introns in their ORFs (61). It is conceivable 298 that many genes lacking introns arose by retroposition. Some genes 299 are known to be retroposed more often than others. For instance, 300 in the human genome there are over 2,000 retropseudogenes for 301 ribosomal proteins (62). A recent genome-wide study showed that 302 the human genome harbors about 20,000 pseudogenes, 72% of 303 which arose through retroposition (63). Interestingly, vast majority 304 (92%) of them are quite recent transpositions that occurred after 305 primate/rodent divergence (63). Some of the retroposed genes 306 may undergo quite complicated evolutionary path. An example 307 could be retrogene RNF13B, which replaced its own parental 308 gene in mammalian genomes. This retrocopy was duplicated in 309 primates and the evolution of this primate specific copy was accom- 310 panied by the exaptation of two TEs, Alu and L1, and intron gain 311 via changing a part of coding sequence into an intron, leading 312 to the origin of a functional, primate specific retrogene with two 313 splicing variants (64).

3.2.2. Class II Mobile 314 Class II elements move by a conservative cut-and-paste mechanism, Elements 315 the excision of the donor element is followed by its reinsertion 316 elsewhere in the genome. DNA transposons are abundant in bacte- 317 ria, where they are called insertion sequences (see Subheading 3.1 318 above), but are present in all phyla. Wicker et al. distinguished two 319 subclasses of DNA transposons based on the number of DNA 320 strands that are cut during transposition (31). 321 Classical “cut-and-paste” transposons belong to the subclass I. 322 They are characterized by terminal inverted repeats and encode a 323 transposase that binds near the inverted repeats and mediates 324 mobility. This process is not usually a replicative one, unless the 325 gap caused by excision is repaired using the sister chromatid. When 326 inserted at a new location, the transposon is flanked by small gaps, 327 which, when filled by host enzymes, cause duplication of the 328 sequence at the target site. The length of these TSDs is characteris- 329 tic for particular transposons. Wicker et al. (31) listed two orders 330 and ten superfamilies in this group of transposons, including rather 331 obscure Crypton TEs known only from fungal genomes (65). 332 To this subclass also belong MITEs—a heterogeneous small non- 333 autonomous elements (66), which in some genomes amplified 334 to thousands copies, e.g., Stowaway in the rice genome (67)or 335 Galluhop in the chicken genome (68). 336 Subclass II includes two orders of the TEs that, as those 337 from subclass I, do not form RNA intermediate. However, unlike 338 “classical” DNA transposons, they replicate without double-strand 339 cleavage. Helitrons replicate using a rolling-circle mechanism and 12 Transposable Elements and Their Identification 347

their insertion does not result in target site duplication (69). 340 They encode tyrosine recombinase along with some other proteins. 341 Helitrons were first described in plants but they are present also in 342 other phyla including fungi and mammals (70, 71). Mavericks are 343 large transposons that have been found in different eukaryotic 344 lineages excluding plants (72). They encode various number of 345 proteins that include DNA polymerase B and an integrase. Kapito- 346 nov and Jurka suggested that their life cycle includes a single strand 347 excision, followed by extrachromosomal replication and reintegra- 348 tion to a new location (73). 349

4. Identification 350 of Transposable 351 Elements 352 With ever-growing number of sequenced genomes from different 353 branches of tree of life, detection of TEs is getting increasingly 354 challenging. There are several different reasons why one would 355 like to analyze TEs and their “offsprings” left in a genome. On the 356 one hand, they are very interesting biological subjects to study for 357 instance genome structure, gene regulation, or genome evolution. 358 On the other hand, they might be considered just an annoying 359 genomic feature that makes genome sequencing and annotation 360 more difficult. In either case, TEs should be and are worthy to 361 study. However, it is not a simple task and requires different 362 approaches depending on the level of analysis. We will walk through 363 these different levels starting with “freshly” sequenced genome 364 without any annotation and discuss different methods and software 365 used for TEs analyses. In principle, we can imagine two scenarios: 366 in the first one, genomic or transcriptome, sequences are coming 367 from the species for which there is already some information about 368 transposon repertoire available, for instance related genome has 369 been previously characterized or TEs have been studied before. 370 In the second scenario, we have to deal with a completely unknown 371 genome or a genome with little information. In the previous case, 372 one can apply a range of techniques used in comparative genomics 373 or try to search specific libraries of transposons using “homology 374 search” approach. In the latter, called de novo approach, first we 375 need to find any repeats in a genome and then try to understand 376 their nature. In this approach, we will find any repeats not necessar- 377 ily transposons. There are many algorithms and even more software 378 that can be applied in both approaches. 379

4.1. De Novo There are several steps involved in the de novo characterization of 380383 Approaches to Finding transposons. First, we need to find all the repeats in a genome, then 381384 Repetitive Elements build a consensus of each family of related sequences, and finally 382385 classify detected sequences. For the first step, three groups of 386 348 W. Makałowski et al.

algorithms exist: k-mer approach, sequence self-comparison, and 387 periodicity approach. 388 In the k-mer approach sequences are scanned for overrepre- 389 sentation of string of certain length. The idea is that repeats that 390 belong to the same family are compositionally similar and 391 share some oligomers. If the repeats occur many times in a genome, 392 then those oligomers should be overrepresented. However, since 393 repeats and transposons in particular are not exactly the same, some 394 mismatches must be allowed when oligo frequencies are calculated. 395 The challenge is to determine optimal size of an oligo (k-mer) and 396 number of mismatches allowed. Most likely these parameters 397 should be different for different types of transposons, i.e., low 398 versus high copy number, old versus young transposons, and 399 transposon class. Several programs have been developed based on 400 k-mer idea using a suffix tree data structure including REPuter 401 (74, 75), Vmatch (Kurtz, unpublished; http://www.vmatch.de/), 402 and Repeat-match (76, 77). Another approach is to use fixed length 403 k-mers as seeds and extend those seeds to define repeat’s family as it 404 was implemented in ReAS (78), RepeatScout (79), and Tallymer 405 (80). Another interesting algorithm can be found in FORRepeats 406 software (81), which uses factor oracle data structure. It starts with 407 detection of exact oligomers in the analyzed sequences, following 408 with finding approximate repeats and their alignment. 409 The second group of programs developed for de novo detec- 410 tion of repeated sequences is using self-comparison approach. 411 Repeat Pattern Toolkit (82), RECON (83), PILER (84, 85), and 412 BLASTER (86) belong to this group. The idea is to use one of 413 the fast sequence similarity tools, e.g., BLAST (87) followed by 414 clustering search results. The programs differ in the search engine 415 for the initial step, though most are using some of the BLAST 416 algorithms, the clustering method, and heuristics of merging initial 417 hits into a prototype element. For instance, RECON (83), which 418 was developed for the repeat finding in unassembled sequence reads, 419 starts with an all-to-all comparison using WU-BLAST engine. Then, 420 single linkage clustering is applied to alignment results, which is 421 followed by construction of an undirected graph with overlapping. 422 The shortest sequence that contains connected images creates a 423 prototype element. However, this procedure might result in com- 424 posite elements. To avoid this, all the images are aligned to the 425 prototype element to detect potential illegitimate mergers and split 426 those at every point with a significant number of image ends. 427 PILER (84, 85) is using a different approach to find initial 428 clusters. Instead of BLAST, it uses PALS (Pairwise Alignment 429 of Long Sequences) (88) for the initial alignment. PALS records 430 only hit points and uses banded search of the defined maximum 431 distance to optimize its performance. To further improve perfor- 432 mance of the system PILER uses different heuristic for different 433 types of repeats, i.e., satellites, pseudosatellites, terminal repeats, 434 12 Transposable Elements and Their Identification 349

and interspersed repeats. Finally, a consensus sequence is generated 435 from a multiple sequence alignment of the defined family members. 436 Spectral Repeat Finder (89) uses completely different approach, 437 namely discrete Fourier transformation to identify periodicities pres- 438 ent in a sequence. The specific regions that contribute to a given 439 periodicity are located through a sliding window analysis. The peaks 440 in the power spectrum of the sequence represent candidate repetitive 441 elements. These candidates are used to seed a local alignment search 442 to identify similar elements and compute a consensus sequence 443 for the family. Although this method is most effective for tandem 444 repeats, because the signal worsens with the distance between ele- 445 ments, the authors claim that it can be used for the interspersed 446 repeats as well. The serious drawback of the method is its speed 447 which may be prohibitive for genome scale analyses. 448

4.2. Classification Once the consensus of a repetitive element has been constructed, it 449452 of Transposable can be subjected to further analyses. There are two major categories 450453 Elements of programs dealing with the issue of TEs classification: library or 451454 similarity based, and signature based. The latter approach is very 455 often used in specialized software, i.e., tailored for specific type of 456 TEs; however, some general tools also exist, e.g., TEclass (90). 457 The library approach is probably the most common approach 458 for TE classification. It is also very efficient and quite reliable as 459 long as good library of prototype sequences exist. In practice, it is 460 recommended approach in case when we analyze sequences from 461 well characterized genome or from a genome relatively closely 462 related to the well-studied one. For instance, since the human 463 genome is one of the best studied, any primate sequences can be 464 confidently analyzed using library approach. Most likely, the first 465 software using similarity based approach for repeat classification 466 was Censor developed by Jurka in early 1990s (91). It uses RepBase 467 (92, 93) as a reference collection and in its newest version BLAST as 468 a search engine (94). However, the most popular TE detection 469 software is RepeatMasker (RM) (http://www.repeatmasker.org). 470 Interestingly, RM is also using RepBase as a reference collection 471 and AB-BLAST, RM-BLAST, or cross-match as a search engine. 472 In both cases, original search hits are processed by a series of perl 473 scripts to determine structure of elements and classify them to one 474 of known TE families. Both Censor and RM can also use user- 475 provided libraries, including “third party” lineage specific libraries, 476 e.g., Plant Repeat Databases (95) or TREP (96). Over the 477 years, RM has become a standard tool in TE analyses and often its 478 output is used for more biologically oriented studies (see below). 479 The aforementioned programs have one important drawback, since 480 they are completely based on sequence similarity, they can detect 481 only TEs that has been previously described. However, similarity 482 search, like in many other bioinformatics tasks, should be the first 483 thing to do while analyzing repetitive elements. 484 350 W. Makałowski et al.

Signature based programs are searching for certain features that 485 characterize specific TEs, for example long terminal repeats (LTRs), 486 target site duplications (TSDs), or primer-binding sites (PBSs). 487 Since different types (families) of the elements are structurally 488 different, they require specific rules for their detection. Hence, 489 many of the programs that use signature based algorithms are 490 specific for a certain type of transposons. There is a number of 491 programs specialized in detection of LTR transposons, which are 492 based on similar methodology. They take into account several 493 structural features of LTR including size, distance 494 between paired LTRs and their similarity, the presence of TSDs, 495 presence of replication signals, i.e., the primer binding site and the 496 polypurine tract (PPTs). Some of the programs check also for ORFs 497 coding for the gag, pol, and env proteins. LTR_STRUC (97) was 498 one of the first programs based on this principle. It uses seed-and- 499 extend strategy to find repeats located within user-defined distance. 500 The candidate regions are extended based on pairwise alignment to 501 determine cognate LTRs’ boundaries. Putative full-length elements 502 are scored based on the presence of TSD, PBS, PPT, and reverse 503 transcriptase ORF. However, because of the heuristics described 504 above, LTR_STRUC is unable to find incomplete LTR transposons 505 and in particular solo LTRs. Another limitation of this program is 506 its Windows-only implementation that significantly prohibits auto- 507 mated large-scale analysis. Several other programs have been devel- 508 oped based on similar principles, e.g., LTR_par (98, 99), find_LTR 509 (100), LTR_FINDER (101), and LTRharvest (102). Lerat has 510 recently tested performance of these programs (103), and although 511 sensitivity of the methods was acceptable (between 40% and 98%), 512 it was in the expense of specificity, which was very poor. In several 513 cases, number of falsely assigned transposons exceeded number of 514 correctly detected ones. 515 Another group of transposons that have relatively conserved 516 structure are MITEs and Helitrons and several specialized pro- 517 grams were developed that take advantage of this information. 518 FINDMITE (104), MUST (105) are tailored for MITEs, while 519 HelitronFinder (106) and HelSearch (107) were developed to 520 detect Helitrons. 521 A further interesting approach to transposon classification was 522 implemented by Abrusan et al. (90) in the software package called 523 TEclass, which classifies unknown TE consensus sequences into 524 four categories, according to their mechanism of transposition: 525 DNA transposons, LTRs, LINEs, SINEs. The classification uses 526 support vector machines, random forests, learning vector quantiza- 527 tion, and predicts ORFs. Two complete sets of classifiers are built 528 using tetramers and pentamers, which are used in two separate 529 rounds of the classification. The software assumes that the analyzed 530 sequence represents a TE and the classification process is binary, 531 with the following steps: forward versus reverse sequence 532 12 Transposable Elements and Their Identification 351

orientation > DNA versus Retrotransposon > LTRs versus 533 nonLTRs (for retroelements) > LINEs versus SINEs (for nonLTR 534 repeats). If the different methods of classification lead to conflicting 535 results, TEclass reports the repeat either as unknown, or as the last 536 category where the classification methods agree. 537

4.3. Meta-analyses Recent years witnessed some attempt to create more complex, 538539 global analyses systems. One such a system is REPCLASS (108). 540 It consists of three classification modules: homology (HOM), 541 structure (STR), and target site duplication (TSD). Each module 542 can be run separately or in the pairwise manner and final step of 543 the analysis involves integration of the results delivered by each 544 module. There is one interesting novelty in the STR module, 545 namely implementation of tRNAscan-SE (109) to detect tRNA- 546 like secondary structure within the query sequence, one of the 547 signatures of many SINE families. The REPPET is another pipeline 548 for TE sequence analyses. It uses “classical” three-step approach 549 for de novo TE identification: self-alignment, clustering, and con- 550 sensus sequences generation. However, the pipeline is using a 551 spectrum of different methods at each step, followed by rigorous 552 TE classification step based on recently proposed classification 553 of TEs (110). 554 Another relatively unexplored territory of the TE analyses is 555 going beyond TE detection and classification. For instance, T-lex 556 (111) detects presence and/or absence of annotated individual TEs 557 in sequenced genomes using next-generation sequencing (NGS) 558 data and is especially useful for the population studies. When using 559 NGS data from multiple strains, T-lex also returns the frequency 560 estimate for each TE in the tested strains. To detect the presence, 561 T-lex finds reads that overlap the junctions of each TE with its 562 flanking region. The TE junction sequences and NGS data are 563 then converted into binary formats by Maq (112). Next, the 564 reads mapped on the TE junction sequences are used to build 565 contigs. If at least one contig spans the TE sequence, the TE is 566 defined as “present.” To detect the absence, T-lex finds reads that 567 span both flanking regions. The TE flanking regions are extracted 568 and concatenated. The reads are then mapped on the concatenated 569 sequence of the TEs using SHRIMP (113). T-lex selects the reads 570 spanning the two flanking regions of the TEs. If at least one read 571 that does not fully correspond to maps correctly 572 on both TE sides, T-lex classifies the TE as “absent.” Recently, 573 Hormozdari et al. have extended their VariationHunter (114) 574 algorithm to next-generation sequencing data. In the simulated 575 data test the algorithm was able to discover over 85% transposon 576 insertions (115). 577 The TinT (Transposition in Transposition) tool is applying 578 maximum likelihood model of TE insertion probability to estimate 579 relative age of TE families (116). In the first steps it takes RM 580 352 W. Makałowski et al.

output to detect nested retroposons. Then, it generates a data 581 matrix that is used by a probabilistic model to estimate chronology 582 and activity period of analyzed families. The method has recently 583 been applied to resolve the evolutionary history of galliformes 584 (117), marsupials (118), and lagomorphs (119). 585

5. Concluding 586 Remarks 587

588 Annoying junk for some, hidden treasure for others, TEs can be 589 hardly ignored. With their diversity and high copy number in most 590 of the genomes, they are not the easiest biological entities to 591 analyze. Nevertheless, recent years witnessed increase interest in

t3:1 Table 3 Selected resources for Transposable Elements discovery and analyses t3:2 Software Address t3:3 AB-BLAST http://www.advbiocomp.com/blast.html t3:4 ACLAME http://aclame.ulb.ac.be/ t3:5 BLASTER suite http://urgi.versailles.inra.fr/index.php/urgi/Tools/BLASTER t3:6 Censor http://www.girinst.org/censor/download.php t3:7 DROPOSON ftp://biom3.univ-lyon1.fr//pub/drosoposon/ t3:8 find_ltr http://darwin.informatics.indiana.edu/cgi-bin/evolution/ltr.pl t3:9 FORrepeats http://al.jalix.org/FORRepeats/ t3:10 HelitronFinder http://limei.montclair.edu/HT.html t3:11 HERVd http://herv.img.cas.cz/ t3:12 IRF http://tandem.bu.edu/irf/irf.download.html t3:13 IScan http://www.bioc.uzh.ch/wagner/software/IScan/ t3:14 IS Finder http://www-is.biotoul.fr/is.html t3:15 LTR_FINDER http://tlife.fudan.edu.cn/ltr_finder/ LTRdigest http://www.zbh.uni-hamburg.de/forschung/genominformatik/software/ t3:16 ltrdigest.html LTRharvest http://www.zbh.uni-hamburg.de/forschung/genominformatik/software/ t3:17 ltrharvest.html t3:18 LTR_STRUC http://www.mcdonaldlab.biology.gatech.edu/finalLTR.htm t3:19 LTR_MINER http://genomebiology.com/2004/5/10/R79/additional t3:20 LTR_par http://www.eecs.wsu.edu/~ananth/software.htm (continued) 12 Transposable Elements and Their Identification 353

Table 3 t3:21 (continued)

Software Address t3:22 mer-engine http://roma.cshl.org/mer-source.php t3:23 MGEScan-LTR http://darwin.informatics.indiana.edu/cgi-bin/evolution/daphnia_ltr.pl t3:24 MGEScan-nonLTR http://darwin.informatics.indiana.edu/cgi-bin/evolution/nonltr/nonltr.pl t3:25 microTranspoGene http://transpogene.tau.ac.il/microTranspoGene.html t3:26 Miropeats http://www.genome.ou.edu/miropeats.html t3:27 MITE-Hunter http://target.iplantcollaborative.org/mite_hunter.html t3:28 MUST http://csbl1.bmb.uga.edu/ffzhou/MUST/ t3:29 PILER http://www.drive5.com/piler/ t3:30 PLOTREP http://repeats.abc.hu/cgi-bin/plotrep.pl t3:31 RAP http://genomics.cribi.unipd.it/index.php/Rap_Repeat_Filter t3:32 REannotate http://www.bioinformatics.org/reannotate/index.html t3:33 ReAS ftp://ftp.genomics.org.cn/pub/ReAS/software/ t3:34 RECON http://selab.janelia.org/recon.html t3:35 RepSeek http://wwwabi.snv.jussieu.fr/public/RepSeek/ t3:36 RepeatFinder http://cbcb.umd.edu/software/RepeatFinder/ t3:37 RepeatGluer http://nbcr.sdsc.edu/euler/intro_tmp.htm t3:38 RepeatMasker http://www.repeatmasker.org/ t3:39 RepeatModeler http://www.repeatmasker.org/RepeatModeler.html t3:40 RepeatRunner http://www.yandell-lab.org/software/repeatrunner.html t3:41 RepeatScout http://repeatscout.bioprojects.org/ t3:42 repeat-match http://mummer.sourceforge.net/ t3:43 REPET http://urgi.versailles.inra.fr/index.php/urgi/Tools/REPET t3:44 RepMiner http://repminer.sourceforge.net/index.htm t3:45 REPuter http://bibiserv.techfak.uni-bielefeld.de/reputer/ t3:46 RetroBase http://biocadmin.otago.ac.nz/fmi/xsl/retrobase/home.xsl t3:47 RetroMap http://www.burchsite.com/bioi/RetroMapHome.html t3:48 RetroTector http://www.kvir.uu.se/RetroTector/RetroTectorProject.html t3:49 RJPrimers http://probes.pw.usda.gov/RJPrimers/index.html t3:50 RTAnalyzer http://www.riboclub.org/cgi-bin/RTAnalyzer/index.pl t3:51 SeqGrapheR http://www.biomedcentral.com/imedia/6784621823926090/supp4.gz t3:52 SMaRTFinder http://services.appliedgenomics.org/software/smartfinder/ t3:53 (continued) 354 W. Makałowski et al.

t3:54 Table 3 (continued) t3:55 Software Address t3:56 SoyTEdb http://www.soytedb.org t3:57 Spectral Repeat http://www.imtech.res.in/raghava/srf/ Finder t3:58 Tallymer http://www.zbh.uni-hamburg.de/Tallymer/ t3:59 TARGeT http://target.iplantcollaborative.org/ t3:60 TCF http://www.mssm.edu/labs/warbup01/paper/files.html t3:61 TEclass http://www.compgen.uni-muenster.de/tools/teclass/ t3:62 TE Displayer http://labs.csb.utoronto.ca/yang/TE_Displayer/ t3:63 TE nest http://www.plantgdb.org/prj/TE_nest/TE_nest.html t3:64 TESD http://pbil.univ-lyon1.fr/software/TESD/ t3:65 TinT http://www.bioinformatics.uni-muenster.de/tools/tint/ t3:66 T-lex http://petrov.stanford.edu/cgi-bin/Tlex_manual.html t3:67 TRANSPO http://alggen.lsi.upc.es/recerca/search/transpo/transpo.html t3:68 TranspoGene http://transpogene.tau.ac.il/ t3:69 Transposon Express http://www.swan.ac.uk/genetics/dyson/TExpress_home.htm t3:70 Transposon-PSI http://transposonpsi.sourceforge.net/ t3:71 TRAP http://www.coccidia.icb.usp.br/trap/tutorials/ t3:72 TRF http://tandem.bu.edu/trf/trf.html t3:73 TROLL http://finder.sourceforge.net/ t3:74 TSDfinder http://www.ncbi.nlm.nih.gov/CBBresearch/Landsman/TSDfinder/ t3:75 WikiPoson http://www.bioinformatics.org/wikiposon/doku.php t3:76 VariationHunter http://compbio.cs.sfu.ca/strvar.htm t3:77 Vmatch http://www.vmatch.de/ 12 Transposable Elements and Their Identification 355

TEs. On the one hand, we observe improvement in computational 592 593 tools specialized in TE analyses. Table 3 lists some of such tools 594 and more comprehensive list can be found at our Web site: http:// 595 www.bioinformatics.uni-muenster.de/ScrapYard/. On the other 596 hand, the improved tools and new technology enable biologists 597 to explore mobile elements even further, with more surprising 598 discoveries certainly lying ahead.

599 References

601 1. Britten, R. J., and Kohne, D. E. (1968) 13. Makalowski, W., Mitchell, G. A., and 645 602 Repeated sequences in DNA. Hundreds of Labuda, D. (1994) Alu sequences in the 646 603 thousands of copies of DNA sequences have coding regions of mRNA: a source of protein 647 604 been incorporated into the genomes of higher variability, Trends Genet 10, 188–193. 648 605 organisms, Science 161, 529–540. 14. Jordan, I. K., Rogozin, I. B., Glazko, G. V., and 649 606 2. Waring, M., and Britten, R. J. (1966) Koonin, E. V. (2003) Origin of a substantial 650 607 Nucleotide Sequence Repetition – a Rapidly fraction of human regulatory sequences from 651 608 Reassociating Fraction of Mouse DNA, transposable elements, Trends in 19, 652 609 Science 154, 791–794. 68–72. 653 610 3. Makalowski, W. (2001) The human genome 15. Thornburg, B. G., Gotea, V., and Maka- 654 611 structure and organization, Acta Biochim Pol lowski, W. (2006) Transposable elements as a 655 612 48, 587–598. significant source of transcription regulating 656 613 4. C. elegans Sequencing Consortium. (1998) signals, Gene 365, 104–110. 657 614 Genome sequence of the nematode C. 16. Mahillon, J., and Chandler, M. (1998) Inser- 658 615 elegans: a platform for investigating biology, tion sequences, Microbiol Mol Biol R 62, 659 616 Science 282, 2012–2018. 725–774. 660 617 5. SanMiguel, P.,Tikhonov, A., Jin, Y. K., Motch- 17. Nagy, Z., and Chandler, M. (2004) Regula- 661 618 oulskaia, N., Zakharov, D., Melake-Berhan, A., tion of transposition in bacteria, Res Micro- 662 619 Springer, P. S., Edwards, K. J., Lee, M., Avra- biol 155, 387–398. 663 620 mova, Z., and Bennetzen, J. L. (1996) Nested 18. Derbyshire, K. M., and Grindley, N. D. 664 621 retrotransposons in the intergenic regions of (1996) Cis preference of the IS903 transpo- 665 622 the maize genome, Science 274,765–768. sase is mediated by a combination of trans- 666 623 6. Keller, E. F. (1983) A feeling for the organ- posase instability and inefficient translation, 667 624 ism: the life and work of Barbara McClintock, Mol Microbiol 21, 1261–1272. 668 625 W.H. Freeman, San Francisco. 19. Ichikawa, H., Ikeda, K., Amemura, J., and 669 626 7. McClintock, B. (1950) The origin and behav- Ohtsubo, E. (1990) Two domains in the 670 627 ior of mutable loci in maize, Proc Natl Acad terminal inverted-repeat sequence of trans- 671 628 Sci U S A 36, 344–355. poson Tn3, Gene 86, 11–17. 672 629 8. McClintock, B. (1951) Chromosome Orga- 20. Maekawa, T., Amemura-Maekawa, J., and 673 630 nization and Genic Expression, Cold Spring Ohtsubo, E. (1993) DNA binding domains 674 631 Harb Sym 16, 13–47. in Tn3 transposase, Mol Gen Genet 236, 675 632 9. McClintock, B. (1956) Controlling Elements 267–274. 676 633 and the Gene, Cold Spring Harb Sym 21, 21. Weinert, T. A., Schaus, N. A., and Grindley, 677 634 197–216. N. D. F. (1983) Insertion-Sequence Dupli- 678 635 10. Malamy, M. H., Fiandt, M., and Szybalski, W. cation in Transpositional Recombination, 679 636 (1972) Electron microscopy of polar inser- Science 222, 755–765. 680 637 tions in the lac operon of Escherichia coli, 22. Turlan, C., and Chandler, M. (1995) IS1- 681 638 Mol Gen Genet 119, 207–222. Mediated Intramolecular Rearrangements – 682 639 11. Ohno, S. (1972) So much “junk” DNA in our Formation of Excised Transposon Circles 683 640 genome., In Brookhaven Symposia in Biology and Replicative Deletions, Embo Journal 14, 684 641 (Smith, H. H., Ed.), pp 366–370, Gordon & 5410–5421. 685 642 Breach, New York. 23. Chandler, M., and Mahillon, J. (2002) Inser- 686 643 12. Brosius, J. (1991) Retroposons – seeds of tion Sequences revisited, In Mobile DNA II 687 644 evolution, Science 251, 753. (Craig, N. L., Craigie, R., Gellert, M. & 688 356 W. Makałowski et al.

689 Lambowitz, A. M, Ed.), ASM, Washington, 36. Kazazian, H. H., Jr. (2004) Mobile elements: 746 690 DC. drivers of genome evolution, Science 303, 747 691 24. Reimmann, C., Moore, R., Little, S., Savioz, 1626–1632. 748 692 A., Willetts, N. S., and Haas, D. (1989) 37. Malik, H. S., Henikoff, S., and Eickbush, T. H. 749 693 Genetic-Structure, Function and Regulation (2000) Poised for contagion: evolutionary 750 694 of the Is21, Molecular origins of the infectious abilities of invertebrate 751 695 & General Genetics 215, 416–424. retroviruses, Genome Res 10, 1307–1318. 752 696 25. Derbyshire, K. M., Hwang, L., and Grindley, 38. Lander, E. S., Linton, L. M., Birren, B., 753 697 N. D. F. (1987) Genetic-Analysis of the Inter- Nusbaum, C., Zody, M. C., Baldwin, J., 754 698 action of the Insertion-Sequence Is903 Trans- Devon, K., Dewar, K., Doyle, M., FitzHugh, 755 699 posase with Its Terminal Inverted Repeats, W., Funke, R., Gage, D., Harris, K., 756 700 Proceedings of the National Academy of Heaford, A., Howland, J., Kann, L., Lehoczky, 757 701 Sciences of the United States of America 84, J., LeVine, R., McEwan, P., McKernan, K., 758 702 8049–8053. Meldrim, J., Mesirov, J. P., Miranda, C., Mor- 759 703 26. Derbyshire, K. M., Kramer, M., and Grindley, ris, W., Naylor, J., Raymond, C., Rosetti, M., 760 704 N. D. (1990) Role of instability in the cis Santos, R., Sheridan, A., Sougnez, C., Stange- 761 705 action of the insertion sequence IS903 trans- Thomann, N., Stojanovic, N., Subramanian, 762 706 posase, Proc Natl Acad Sci U S A 87, A., Wyman, D., Rogers, J., Sulston, J., Ain- 763 707 4048–4052. scough, R., Beck, S., Bentley, D., Burton, J., 764 708 27. Huisman, O., Errada, P. R., Signon, L., and Clee, C., Carter, N., Coulson, A., Deadman, 765 709 Kleckner, N. (1989) Mutational analysis of R., Deloukas, P., Dunham, A., Dunham, I., 766 710 IS10’s outside end, EMBO J 8, 2101–2109. Durbin, R., French, L., Grafham, D., Gregory, 767 711 28. Johnson, R. C., and Reznikoff, W. S. (1983) S., Hubbard, T., Humphray, S., Hunt, A., 768 712 DNA-Sequences at the Ends of Transposon Jones, M., Lloyd, C., McMurray,A., Matthews, 769 713 Tn5 Required for Transposition, Nature 304, L., Mercer, S., Milne, S., Mullikin, J. C., Mun- 770 714 280–282. gall, A., Plumb, R., Ross, M., Shownkeen, R., 771 715 29. Zerbib, D., Prentki, P., Gamas, P., Freund, E., Sims, S., Waterston, R. H., Wilson, R. K., Hill- 772 716 Galas, D. J., and Chandler, M. (1990) Func- ier, L. W., McPherson, J. D., Marra, M. A., 773 717 tional organization of the ends of IS1: specific Mardis, E. R., Fulton, L. A., Chinwalla, A. T., 774 718 binding site for an IS 1-encoded protein, Mol Pepin, K. H., Gish, W. R., Chissoe, S. L., 775 719 Microbiol 4, 1477–1486. Wendl, M. C., Delehaunty, K. D., Miner, T. 776 777 720 30. Finnegan, D. J. (1989) Eukaryotic transposa- L., Delehaunty, A., Kramer, J. B., Cook, L. L., 778 721 ble elements and genome evolution, Trends Fulton, R. S., Johnson, D. L., Minx, P. J., 779 722 Genet 5, 103–107. Clifton, S. W., Hawkins, T., Branscomb, E., 780 723 Predki, P., Richardson, P., Wenning, S., Slezak, 31. Wicker, T., Sabot, F., Hua-Van, A., Bennet- 781 724 T., Doggett, N., Cheng, J. F., Olsen, A., Lucas, zen, J. L., Capy, P., Chalhoub, B., Flavell, A., 782 725 S., Elkin, C., Uberbacher, E., Frazier, M., Leroy, P., Morgante, M., Panaud, O., Paux, 783 726 E., SanMiguel, P., and Schulman, A. H. Gibbs, R. A., Muzny, D. M., Scherer, S. E., 784 727 (2007) A unified classification system for Bouck, J. B., Sodergren, E. J., Worley, K. C., 785 728 eukaryotic transposable elements, Nat Rev Rives, C. M., Gorrell, J. H., Metzker, M. L., 786 729 Genet 8, 973–982. Naylor, S. L., Kucherlapati, R. S., Nelson, D. L., Weinstock, G. M., Sakaki, Y., Fujiyama, A., 787 730 32. Han, J. S., and Boeke, J. D. (2005) LINE-1 Hattori, M., Yada, T., Toyoda, A., Itoh, T., 788 731 retrotransposons: modulators of quantity Kawagoe, C., Watanabe, H., Totoki, Y., Taylor, 789 732 and quality of mammalian gene expression? T., Weissenbach, J., Heilig, R., Saurin, W., Arti- 790 733 Bioessays 27, 775–784. guenave, F., Brottier, P., Bruls, T., Pelletier, E., 791 734 33. Kumar, A., and Bennetzen, J. L. (1999) Plant Robert, C., Wincker, P., Smith, D. R., Douc- 792 735 33 retrotransposons, Annu Rev Genet , ette-Stamm, L., Rubenfield, M., Weinstock, K., 793 736 479–532. Lee, H. M., Dubois, J., Rosenthal, A., Platzer, 794 737 34. Sabot, F., and Schulman, A. H. (2006) Para- M., Nyakatura, G., Taudien, S., Rump, A., 795 738 sitism and the retrotransposon life cycle in Yang, H., Yu, J., Wang, J., Huang, G., Gu, J., 796 739 plants: a hitchhiker’s guide to the genome, Hood, L., Rowen, L., Madan, A., Qin, S., 797 97 740 Heredity , 381–388. Davis, R. W., Federspiel, N. A., Abola, A. P., 798 741 35. Voytas, D. F., and Boeke, J. D. (2002) Ty1 Proctor, M. J., Myers, R. M., Schmutz, J., 799 742 and Ty5 of Saccharomyces cerevisiae, In Dickson, M., Grimwood, J., Cox, D. R., 800 743 Mobile DNA II (Craig, N. L., Craigie, R., Olson, M. V., Kaul, R., Shimizu, N., Kawasaki, 801 744 Gellert, M. & Lambowitz, A. M, Ed.), ASM, K., Minoshima, S., Evans, G. A., Athanasiou, 802 745 Washington, DC. 12 Transposable Elements and Their Identification 357

803 M., Schultz, R., Roe, B. A., Chen, F., Pan, H., 44. Zuker, C., and Lodish, H. F. (1981) Repeti- 859 804 Ramser, J., Lehrach, H., Reinhardt, R., tive DNA sequences cotranscribed with 860 805 McCombie, W. R., de la Bastide, M., Dedhia, developmentally regulated Dictyostelium dis- 861 806 N., Blocker, H., Hornischer, K., Nordsiek, G., coideum mRNAs, Proc Natl Acad Sci U S A 862 807 Agarwala, R., Aravind, L., Bailey, J. A., Bate- 78, 5386–5390. 863 808 man, A., Batzoglou, S., Birney, E., Bork, P., 45. Goodwin, T. J., and Poulter, R. T. (2001) The 864 809 Brown, D. G., Burge, C. B., Cerutti, L., DIRS1 group of retrotransposons, Mol Biol 865 810 Chen, H. C., Church, D., Clamp, M., Copley, Evol 18, 2067–2082. 866 811 R. R., Doerks, T., Eddy, S. R., Eichler, E. E., 46. Piednoel, M., and Bonnivard, E. (2009) 867 812 Furey, T. S., Galagan, J., Gilbert, J. G., Har- DIRS1-like retrotransposons are widely 868 813 mon, C., Hayashizaki, Y., Haussler, D., Herm- distributed among Decapoda and are particu- 869 814 jakob, H., Hokamp, K., Jang, W., Johnson, L. larly present in hydrothermal vent organisms, 870 815 S., Jones, T. A., Kasif, S., Kaspryzk, A., Ken- BMC Evol Biol 9, 86. 871 816 nedy, S., Kent, W. J., Kitts, P., Koonin, E. V., 47. Evgen’ev, M. B., and Arkhipova, I. R. (2005) 872 817 Korf, I., Kulp, D., Lancet, D., Lowe, T. M., Penelope-like elements – a new class of retro- 873 818 McLysaght, A., Mikkelsen, T., Moran, J. V., elements: distribution, function and possible 874 819 Mulder, N., Pollara, V. J., Ponting, C. P.,Schu- evolutionary significance, Cytogenet Genome 875 820 ler, G., Schultz, J., Slater, G., Smit, A. F., Res 110, 510–521. 876 821 Stupka, E., Szustakowski, J., Thierry-Mieg, 48. Arkhipova, I. R. (2006) Distribution and 877 822 D., Thierry-Mieg, J., Wagner, L., Wallis, J., phylogeny of Penelope-like elements in eukar- 878 823 Wheeler, R., Williams, A., Wolf, Y. I., Wolfe, yotes, Syst Biol 55, 875–885. 879 824 K. H., Yang, S. P.,Yeh, R. F., Collins, F., Guyer, 880 825 49. Martin, S. L., Cruceanu, M., Branciforte, D., M. S., Peterson, J., Felsenfeld, A., Wetter- Wai-Lun Li, P., Kwok, S. C., Hodges, R. S., 881 826 strand, K. A., Patrinos, A., Morgan, M. J., de and Williams, M. C. (2005) LINE-1 retro- 882 827 Jong, P., Catanese, J. J., Osoegawa, K., Shi- transposition requires the nucleic acid chaper- 883 828 zuya, H., Choi, S., and Chen, Y. J. (2001) one activity of the ORF1 protein, J Mol Biol 884 829 Initial sequencing and analysis of the human 348, 549–561. 885 830 genome, Nature 409,860–921. 50. Martin, S. L. (2010) Nucleic acid chaperone 886 831 39. Leib-Mosch, C., Haltmeier, M., Werner, T., properties of ORF1p from the non-LTR 887 832 Geigl, E. M., Brack-Werner, R., Francke, U., retrotransposon, LINE-1, RNA Biol 7, 888 833 Erfle, V., and Hehlmann, R. (1993) Genomic 706–711. 889 834 distribution and transcription of solitary 51. Kapitonov, V. V., and Jurka, J. (2003) Molec- 890 835 HERV-K LTRs, Genomics 18, 261–269. ular paleontology of transposable elements 891 836 40. Wicker, T., Stein, N., Albar, L., Feuillet, C., in the Drosophila melanogaster genome, 892 837 Schlagenhauf, E., and Keller, B. (2001) Anal- Proc Natl Acad Sci U S A 100, 6569–6574. 893 838 ysis of a contiguous 211 kb sequence in dip- 52. Kajikawa, M., and Okada, N. (2002) LINEs 894 839 loid wheat (Triticum monococcum L.) reveals 0 mobilize SINEs in the eel through a shared 3 895 840 multiple mechanisms of genome evolution, sequence, Cell 111, 433–444. 896 841 Plant J 26, 307–316. 53. Houck, C. M., Rinehart, F. P., and Schmid, 897 842 41. Vicient, C. M., Kalendar, R., Anamthawat- C. W. (1979) Ubiquitous Family of Repeated 898 843 Jonsson, K., and Schulman, A. H. (1999) DNA Sequences in the Human Genome, 899 844 Structure, functionality, and evolution of the Journal of Molecular Biology 132, 289–306. 900 845 BARE-1 retrotransposon of barley, Genetica 846 107, 53–63. 54. Jurka, J., Zietkiewicz, E., and Labuda, D. 901 (1995) Ubiquitous Mammalian-Wide Inter- 902 847 42. SanMiguel, P., Gaut, B. S., Tikhonov, A., spersed Repeats (Mirs) Are Molecular Fossils 903 848 Nakajima, Y., and Bennetzen, J. L. (1998) from the Mesozoic Era, Nucleic Acids 904 849 The paleontology of intergene retrotranspo- Research 23, 170–175. 905 850 sons of maize, Nat Genet 20, 43–45. 55. Ostertag, E. M., Goodier, J. L., Zhang, Y., 906 851 43. Peterson, D. G., Schulze, S. R., Sciara, E. B., and Kazazian, H. H., Jr. (2003) SVA elements 907 852 Lee, S. A., Bowers, J. E., Nagel, A., Jiang, N., are nonautonomous retrotransposons that 908 853 Tibbitts, D. C., Wessler, S. R., and Paterson, cause disease in humans, Am J Hum Genet 909 854 A. H. (2002) Integration of Cot analysis, 73, 1444–1451. 910 855 DNA cloning, and high-throughput 856 sequencing facilitates genome characteriza- 56. Wang, H., Xing, J., Grover, D., Hedges, D. J., 911 857 tion and gene discovery, Genome Res 12, and Han, K. (2005) SVA elements: a 912 858 795–807. 358 W. Makałowski et al.

913 hominid-specific retroposon family. J Mol Mardis, E. R., Wilson, R. K., Peterson, D. 970 914 Biol 354, 994–1007. G., Paterson, A. H., and Ivarie, R. (2005) 971 915 57.Xing,J.,Wang,H.,Belancio,V.P.,Cordaux, The repetitive landscape of the chicken 972 916 R., Deininger, P. L., and Batzer, M. A. genome, Genome Res 15, 126–136. 973 917 (2006) From the cover: eukaryotic transpo- 69. Kapitonov, V. V., and Jurka, J. (2001) Rolling- 974 918 sable elements and genome evolution special circle transposons in eukaryotes, Proc Natl 975 919 feature: emergence of primate genes by ret- Acad Sci U S A 98, 8714–8719. 976 920 rotransposon-mediated sequence transduc- 70. Hood, M. E. (2005) Repetitive DNA in the 977 921 tion,ProcNatlAcadSciUSA103, automictic fungus Microbotryum violaceum, 978 922 17608–17613. Genetica 124, 1–10. 979 923 58. Vanin, E. F. (1985) Processed pseudogenes: 71. Pritham, E. J., and Feschotte, C. (2007) Mas- 980 924 characteristics and evolution, Annu Rev Genet sive amplification of rolling-circle transposons 981 925 19, 253–272. in the lineage of the bat Myotis lucifugus, 982 926 59. Maestre, J., Tchenio, T., Dhellin, O., and Proc Natl Acad Sci U S A 104, 1895–1900. 983 927 Heidmann, T. (1995) mRNA retroposition 72. Pritham, E. J., Putliwala, T., and Feschotte, C. 984 928 in human cells: processed pseudogene forma- (2007) Mavericks, a novel class of giant trans- 985 929 tion, EMBO J 14, 6333–6338. posable elements widespread in eukaryotes 986 930 60. Esnault, C., Maestre, J., and Heidmann, T. and related to DNA viruses, Gene 390, 3–17. 987 931 (2000) Human LINE retrotransposons gen- 73. Kapitonov, V. V., and Jurka, J. (2006) Self- 988 932 erate processed pseudogenes, Nat Genet 24, synthesizing DNA transposons in eukaryotes, 989 933 363–367. Proc Natl Acad Sci U S A 103, 4540–4545. 990 934 61. Sakharkar, M. K., Kangueane, P., Petrov, D. 74. Kurtz, S., Choudhuri, J. V., Ohlebusch, E., 991 935 A., Kolaskar, A. S., and Subbiah, S. (2002) Schleiermacher, C., Stoye, J., and Giegerich, 992 936 SEGE: a database on ‘intronless/single R. (2001) REPuter: the manifold applications 993 937 exonic’ genes from eukaryotes, Bioinformat- of repeat analysis on a genomic scale, Nucleic 994 938 ics 18, 1266–1267. Acids Research 29, 4633–4642. 995 939 62. Zhang, Z., Harrison, P., and Gerstein, M. 75. Kurtz, S., and Schleiermacher, C. (1999) 996 940 (2002) Identification and analysis of over REPuter: fast computation of maximal repeats 997 941 2000 ribosomal protein pseudogenes in the in complete genomes, Bioinformatics 15, 998 942 human genome, Genome Res 12, 1466–1482. 426–427. 999 943 63. Torrents, D., Suyama, M., Zdobnov, E., and 76. Delcher, A. L., Kasif, S., Fleischmann, R. D., 1000 944 Bork, P. (2003) A genome-wide survey of Peterson, J., White, O., and Salzberg, S. L. 1001 945 human pseudogenes, Genome Res 13, (1999) Alignment of whole genomes, Nucleic 1002 946 2559–2567. Acids Research 27, 2369–2376. 1003 947 64. Szczes´niak, M. W., Ciomborowska, J., 77. Delcher, A. L., Phillippy, A., Carlton, J., and 1004 948 Nowak, W., Rogozin, I. B., and Makałowska, Salzberg, S. L. (2002) Fast algorithms for 1005 949 I. (2011) Primate and rodent specific intron large-scale genome alignment and compari- 1006 950 gains and the origin of retrogenes with splice son, Nucleic Acids Research 30, 2478–2483. 1007 951 28 variants, Mol Biol Evol , 33–38. 78. Li, R. Q., Ye, J., Li, S. G., Wang, J., Han, Y. J., 1008 952 65. Goodwin, T. J., Butler, M. I., and Poulter, R. Ye, C., Wang, J., Yang, H. M., Yu, J., Wong, 1009 953 T. (2003) Cryptons: a group of tyrosine- G. K. S., and Wang, J. (2005) ReAS: Recovery 1010 954 recombinase-encoding DNA transposons of ancestral sequences for transposable ele- 1011 955 from pathogenic fungi, Microbiology 149, ments from the unassembled reads of a 1012 956 3099–3109. whole genome shotgun, Plos Comput Biol 1013 957 66. Bureau, T. E., and Wessler, S. R. (1994) Stow- 1, 313–321. 1014 958 away: a new family of elements 79. Price, A. L., Jones, N. C., and Pevzner, P. A. 1015 959 associated with the genes of both monocoty- (2005) De novo identification of repeat 1016 960 ledonous and dicotyledonous plants, Plant families in large genomes, Bioinformatics 21, 1017 961 Cell 6, 907–916. I351-I358. 1018 962 67. Feschotte, C., Swamy, L., and Wessler, S. R. 80. Kurtz, S., Narechania, A., Stein, J. C., and 1019 963 (2003) Genome-wide analysis of mariner- Ware, D. (2008) A new method to compute 1020 964 like transposable elements in rice reveals K-mer frequencies and its application to anno- 1021 965 complex relationships with stowaway minia- tate large repetitive plant genomes, BMC 1022 966 ture inverted repeat transposable elements Genomics 9, 517. 1023 967 163 (MITEs), Genetics , 747–758. 81. Lefebvre, A., Lecroq, T., Dauchel, H., and 1024 968 68. Wicker, T., Robertson, J. S., Schulze, S. R., Alexandre, J. (2003) FORRepeats: detects 1025 969 Feltus, F. A., Magrini, V., Morrison, J. A., 12 Transposable Elements and Their Identification 359

1026 repeats on entire chromosomes and between 95. Ouyang, S., and Buell, C. R. (2004) The 1083 1027 genomes, Bioinformatics 19, 319–326. TIGR Plant Repeat Databases: a collective 1084 1028 82. Agrawal, P., and States, D. (1994) The Repeat resource for the identification of repetitive 1085 1029 Pattern Toolkit (RPT): analyzing the struc- sequences in plants, Nucleic Acids Research 1086 1030 ture and evolution of the C. elegans genome, 32, D360-D363. 1087 1031 Proc Int Conf Intell Syst Mol Biol 2,9. 96. Wicker, T., Matthews, D. E., and Keller, B. 1088 1032 83. Bao, Z. R., and Eddy, S. R. (2002) Automated (2002) TREP: a database for Triticeae repeti- 1089 1033 de novo identification of repeat sequence tive elements, Trends Plant Sci 7, 561–562. 1090 1034 families in sequenced genomes, Genome 97. McCarthy, E. M., and McDonald, J. F. 1091 1035 Research 12, 1269–1276. (2003) LTR_STRUC: a novel search and 1092 1036 84. Edgar, R. C. (2007) PILER-CR: Fast and identification program for LTR retrotranspo- 1093 1037 accurate identification of CRISPR repeats, sons, Bioinformatics 19, 362–367. 1094 1038 BMC Bioinformatics 8, 18. 98. Kalyanaraman, A., and Aluru, S. (2005) 1095 1039 85. Edgar, R. C., and Myers, E. W. (2005) PILER: Efficient algorithms and software for detection 1096 1040 identification and classification of genomic of full-length LTR retrotransposons, Proc 1097 1041 repeats, Bioinformatics 21,I152-I158. IEEE Comput Syst Bioinform Conf, 56–64. 1098 1042 86. Quesneville, H., Bergman, C. M., Andrieu, 99. Kalyanaraman, A., and Aluru, S. (2006) Effi- 1099 1043 O., Autard, D., Nouaud, D., Ashburner, M., cient algorithms and software for detection of 1100 1044 and Anxolabehere, D. (2005) Combined full-length LTR retrotransposons, J Bioin- 1101 1045 evidence annotation of transposable elements form Comput Biol 4, 197–216. 1102 1046 in genome sequences, Plos Comput Biol 1, 100. Rho, M., Choi, J. H., Kim, S., Lynch, M., and 1103 1047 166–175. Tang, H. (2007) De novo identification of 1104 1048 87. Altschul, S. F., Gish, W., Miller, W., Myers, E. LTR retrotransposons in eukaryotic genomes, 1105 1049 W., and Lipman, D. J. (1990) Basic Local Bmc Genomics 8, 90. 1106 1050 Alignment Search Tool, Journal of Molecular 101. Xu, Z., and Wang, H. (2007) LTR_FINDER: 1107 1051 Biology 215, 403–410. an efficient tool for the prediction of full- 1108 1052 88. Rasmussen, K., Stoye, J., and Myers, E. length LTR retrotransposons, Nucleic Acids 1109 1053 (2005) Efficient q-gram filters for finding all Res 35, W265–268. 1110 1054 e-matches over a given length, In RECOMB. 102. Ellinghaus, D., Kurtz, S., and Willhoeft, U. 1111 1055 89. Sharma, D., Issac, B., Raghava, G. P. S., and (2008) LTRharvest, an efficient and flexible 1112 1056 Ramaswamy, R. (2004) Spectral Repeat software for de novo detection of LTR retro- 1113 1057 Finder (SRF): identification of repetitive transposons, BMC Bioinformatics 9, 18. 1114 1058 sequences using Fourier transformation, 103. Lerat, E. (2010) Identifying repeats and 1115 1059 Bioinformatics 20, 1405–1412. transposable elements in sequenced gen- 1116 1060 90. Abrusan, G., Grundmann, N., DeMester, L., omes: how to find your way through the 1117 1061 and Makalowski, W. (2009) TEclass – a tool dense forest of programs, Heredity 104, 1118 1062 for automated classification of unknown 520–533. 1119 1063 eukaryotic transposable elements, Bioinfor- 104. Tu, Z. (2001) Eight novel families of minia- 1120 1064 matics 25, 1329–1330. ture inverted repeat transposable elements in 1121 1065 91. Jurka, J., Klonowski, P., Dagman, V., and the African malaria mosquito, Anopheles 1122 1066 Pelton, P. (1996) Censor – A program for gambiae, Proc Natl Acad Sci U S A 98, 1123 1067 identification and elimination of repetitive ele- 1699–1704. 1124 1068 ments from DNA sequences, Computers & 105. Chen, Y., Zhou, F., Li, G., and Xu, Y. (2009) 1125 1069 Chemistry 20, 119–121. MUST: a system for identification of minia- 1126 1070 92. Jurka, J. (2000) Repbase Update – a database ture inverted-repeat transposable elements 1127 1071 and an electronic journal of repetitive and applications to Anabaena variabilis and 1128 1072 elements, Trends in Genetics 16, 418–420. Haloquadratum walsbyi, Gene 436, 1–7. 1129 1073 93. Jurka, J., Kapitonov, V. V., Pavlicek, A., 106. Du, C., Caronna, J., He, L., and Dooner, H. K. 1130 1074 Klonowski, P., Kohany, O., and Walichiewicz, (2008) Computational prediction and molecu- 1131 1075 J. (2005) Repbase update, a database of eukary- lar confirmation of Helitron transposons in the 1132 1076 otic repetitive elements, Cytogenetic and maize genome, Bmc Genomics 9,51. 1133 1077 Genome Research 110,462–467. 107. Yang, L., and Bennetzen, J. L. (2009) Struc- 1134 1078 94. Kohany, O., Gentles, A. J., Hankus, L., and ture-based discovery and description of 1135 1079 Jurka, J. (2006) Annotation, submission and plant and animal Helitrons, Proc Natl Acad 1136 1080 screening of repetitive elements in Repbase: Sci U S A 106, 12832–12837. 1137 1081 RepbaseSubmitter and Censor, BMC Bioin- 108. Feschotte, C., Keswani, U., Ranganathan, N., 1138 1082 formatics 7, 474. Guibotsy, M. L., and Levine, D. (2009) 1139 360 W. Makałowski et al.

1140 Exploring Repetitive DNA Landscapes Using 115. Hormozdiari, F., Hajirasouliha, I., Dao, P., 1171 1141 REPCLASS, a Tool That Automates the Classi- Hach, F., Yorukoglu, D., Alkan, C., Eichler, 1172 1142 fication of Transposable Elements in Eukary- E. E., and Sahinalp, S. C. (2010) Next- 1173 1143 otic Genomes, Genome Biol Evol 1, 205–220. generation VariationHunter: combinatorial 1174 1144 109. Lowe, T. M., and Eddy, S. R. (1997) tRNAs- algorithms for transposon insertion discovery, 1175 1145 can-SE: A program for improved detection of Bioinformatics 26, i350-i357. 1176 1146 transfer RNA genes in genomic sequence, 116. Churakov, G., Grundmann, N., Kuritzin, A., 1177 1147 Nucleic Acids Research 25, 955–964. Brosius, J., Makalowski, W., and Schmitz, J. 1178 1148 110. Flutre, T., Duprat, E., Feuillet, C., and Ques- (2010) A novel web-based TinT application 1179 1149 neville, H. (2011) Considering Transposable and the chronology of the Primate Alu retro- 1180 1150 Element Diversification in De Novo Annota- poson activity, BMC Evol Biol 10, 376. 1181 1151 tion Approaches, Plos One 6, e16526. 117. Kriegs, J. O., Matzke, A., Churakov, G., 1182 1152 111. Fiston-Lavier, A. S., Carrigan, M., Petrov,D. A., Kuritzin, A., Mayr, G., Brosius, J., and 1183 1153 and Gonzalez, J. (2011) T-lex: a program for Schmitz, J. (2007) Waves of genomic hitchhi- 1184 1154 fast and accurate assessment of transposable ele- kers shed light on the evolution of gamebirds 1185 1155 ment presence using next-generation sequenc- (Aves: Galliformes), BMC Evol Biol 7, 190. 1186 1156 ing data, Nucleic Acids Res 39, e36. 118. Nilsson, M. A., Churakov, G., Sommer, M., 1187 1157 112. Li, H., Ruan, J., and Durbin, R. (2008) Tran, N. V., Zemann, A., Brosius, J., and 1188 1158 Mapping short DNA sequencing reads and Schmitz, J. (2010) Tracking marsupial evolu- 1189 1159 calling variants using mapping quality scores, tion using archaic genomic retroposon 1190 1160 Genome Research 18, 1851–1858. insertions, PLoS Biol 8, e1000436. 1191 1161 113. Rumble, S. M., Lacroute, P., Dalca, A. V., 119. Kriegs, J. O., Zemann, A., Churakov, G., 1192 1162 Fiume, M., Sidow, A., and Brudno, M. Matzke, A., Ohme, M., Zischler, H., Brosius, 1193 1163 (2009) SHRiMP: Accurate Mapping of J., Kryger, U., and Schmitz, J. (2010) Retro- 1194 1164 Short Color-space Reads, Plos Comput Biol poson insertions provide insights into deep 1195 1165 5, e1000386. lagomorph evolution, Mol Biol Evol 27, 1196 1197 1166 114. Hormozdiari, F., Alkan, C., Eichler, E. E., 2678–2681. 1167 and Sahinalp, S. C. (2009) Combinatorial 120. Siguier, P., Perochon, J., Lestrade, L., Mahil- 1198 1168 algorithms for structural variation detection lon, J., and Chandler, M. (2006) ISfinder: the 1199 1169 in high-throughput sequenced genomes, reference centre for bacterial insertion 1200 1170 Genome Research 19, 1270–1278. sequences, Nucleic Acids Res 34, D32–36. 1201

View publication stats Chapter 6

Transposable Elements: Classification, Identification, and Their Use As a Tool For Comparative Genomics

Wojciech Makałowski, Valer Gotea, Amit Pande, and Izabela Makałowska

Abstract

Most genomes are populated by hundreds of thousands of sequences originated from mobile elements. On the one hand, these sequences present a real challenge in the process of genome analysis and annotation. On the other hand, they are very interesting biological subjects involved in many cellular processes. Here we present an overview of transposable elements biodiversity, and we discuss different approaches to transpo- sable elements detection and analyses.

Key words Transposable elements, Transposons, Mobile elements, Repetitive elements, Genome analysis, Genome evolution

1 Introduction

Most eukaryotic genomes contain large numbers of repetitive sequences. This phenomenon was described by Waring and Britten a half century ago using reassociation studies [1, 2]. It turned out that most of these repetitive sequences originated in transposable elements (TEs) [3], though the repetitive fraction of a genome varies significantly between different organisms, from 12% in Cae- norhabditis elegans [4] to 50% in mammals [3], and more than 80% in some plants [5]. With such large contributions to genome sequences, it is not surprising that TEs have a significant influence on the genome organization and evolution. Although much prog- ress has been achieved in understanding the role TEs play in a host genome, we are still far from the comprehensive picture of the delicate evolutionary interplay between a host genome and the invaders. They also pose various challenges to the genomic com- munity, including aspects related to their detection and classifica- tion, genome assembly and annotation, genome comparisons, and mapping of genomic variants. They also pose various challenges to the genomic community, including aspects related to their detec- tion and classification, genome assembly and annotation, genome

Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Methods in Molecular Biology, vol. 1910, https://doi.org/10.1007/978-1-4939-9074-0_6, © The Author(s) 2019 177 178 Wojciech Makałowski et al.

comparisons, and mapping of genomic variants. Here we present an overview of TE diversity and discuss major techniques used in their analyses.

2 Discovery of Mobile Elements

Transposable elements were discovered by Barbara McClintock during experiments conducted in 1944 on maize. Since they appeared to influence phenotypic traits, she named them controlling elements. However, her discovery was met with less than enthusiastic reception by the genetic community. Her presen- tation at the 1951 Cold Spring Harbor Symposium was not under- stood and at least not very well received [6]. She had no better luck with her follow-up publications [7–9] and after several years of frustration decided not to publish on the subject for the next two decades. Not for the first time in the history of science, an unap- preciated discovery was brought back to life after some other discovery has been made. In this case it was the discovery of insertion sequences (IS) in bacteria by Szybalski group in the early 1970s [10]. In the original paper they wrote: “Genetic elements were found in higher organisms which appear to be readily trans- posed from one to another site in the genome. Such elements, identifiable by their controlling functions, were described by McClintock in maize. It is possible that they might be somehow analogous to the presently studied IS insertions” [10]. The impor- tance of McClintock’s original work was eventually appreciated by the genetic community with numerous awards, including 14 hon- orary doctoral degrees and a Nobel Prize in 1983 “for her discovery of mobile genetic elements” (http://nobelprize.org/nobel_ prizes/medicine/laureates/1983/). Coincidently, at the same time as Szybalski “rediscovered” TEs, Susumu Ohno popularized the term junk DNA that influenced genomic field for decades [11], although the term itself was used already before [12, 13].1 Ohno referred to the so-called noncoding sequences or, to be more precise, to any piece of DNA that do not code for a protein, which included all genomic pieces originated in transposons. The unfavorable picture of transposable and trans- posed elements started to change in early 1990s when some researchers noticed evolutionary value of these elements [14, 15]. With the wheel of fortune turning full circle and advances of genome sciences, TE research is again focused on the role of mobile elements played in the evolution of gene regulation [16–23].

1 The historical background of the “junk DNA” term was recently discussed by Dan Graur in his excellent blog http://judgestarling.tumblr.com/post/64504735261/the-origin-of-the-term-junk-dna-a-historical Transposable Elements: Classification, Identification, and Their Use... 179

3 Transposons Classification

3.1 Insertion The bacterial genome is composed of a core genomic backbone Sequences and Other decorated with a variety of multifarious functional elements. These Bacterial Transposons include mobile genetic elements (MGEs) such as bacteriophages, conjugative transposons, integrons, unit transposons, composite trans- posons, and insertion sequences (IS). Here we elaborate upon the last class of these elements as they are most widely found and described [24]. The ISs were identified during studies of model genetic systems by virtue of their capacity to generate mutations as a result of their translocation [10]. In-depth studies in antibiotic resistance and transmissible plasmids revealed an important role for these mobile elements in formation of resistance genes and promoting gene capture. In particular, it was observed that several different ele- ments were often clustered in “islands” within plasmid genomes and served to promote plasmid integration and excision. Although these elements sometimes generate beneficial muta- tions, they may be considered genomic parasites as ISs code only for the enzyme required for their own transposition [24]. While an IS element occupies a chromosomal location, it is inherited along with its host’s native genes, so its fitness is closely tied to that of its host. Consequently, ISs causing deleterious mutations that disrupt a genomic mode or function are quickly eliminated from the popula- tion. However, intergenically placed ISs have a higher chance to be fixed in the population as they are likely neutral regarding popula- tion’s fitness [25]. ISs are generally compact (Fig. 1). They usually carry no other functions than those involved in their mobility. These elements contain recombinationally active sequences which define the boundary of the element, together with Tpase, an enzyme, which processes these ends and whose gene usually encompasses the entire length of the element [26]. Majority of ISs exhibit short terminal inverted-repeat sequences (IR) of length 10–40 bp. Sev- eral notable exceptions do exist, for example, the IS91, IS110, and IS200/605 families. The IRs contain two functional domains [27]. One is involved in Tpase binding; the other cleaves and transfers strand-specific reactions resulting in transposition. IS promoters are often posi- tioned partially within the IR sequence upstream of the Tpase gene. Binding sites for host-specific proteins are often located within proximity to the terminal IRs and play a role in modulating trans- position activity or Tpase expression [28]. A general pattern for the functional organization of Tpases has emerged from the limited numbers analyzed. The N-terminal region contains sequence- specific DNA binding activities of the proteins while the catalytic domain is often localized toward the C-terminal end [28]. 180 Wojciech Makałowski et al.

dr IRORF(s)ORF(s) IR dr

Fig. 1 Schematic representation of insertion sequences (IS). dr direct repeats, IR inverted repeats, ORF open reading frame

Another common feature of ISs is duplication of a target site that results in short direct repeats (DRs) flanking the IS [29]. The length of the direct repeat varies from 2 to 14 base pairs and is a hallmark of a given element. Homologous recombination between two IS elements can result in each having two different DRs [30]. ISs have been classified on the basis of (1) similarities in genetic organization (arrangement of open reading frames); (2) marked identities or similarities in their Tpases (common domains or motifs); (3) similar features of their ends (terminal IRs); and (4) fate of the nucleotide sequence of their target sites (generation of a direct target duplication of determined length). Based on the above rules, ISs are currently classified in 30 families (Table 1)[31].

3.2 Eukaryotic The first TE classification system was proposed by Finnegan in Transposable 1989 [32] and distinguished two classes of TEs characterized by Elements their transposition intermediate: RNA (class I or retrotransposons) or DNA (class II or DNA transposons). The transposition mecha- nism of class I is commonly called “copy and paste” and that of class II, “cut and paste.” In 2007 Wicker et al. [33] proposed hierarchi- cal classification based on TEs structural characteristics and mode of replication (see Table 2 and Fig. 2). Below we present a brief overview of eukaryotic mobile elements that in general follows this classification.

3.2.1 Class I: Mobile As mentioned above, class I TEs transpose through an RNA inter- Elements mediary. The RNA intermediate is transcribed from genomic DNA and then reverse-transcribed into DNA by a TE-encoded reverse transcriptase (RT), followed by reintegration into a genome. Each replication cycle produces one new copy, and as a result, class I elements are the major contributors to the repetitive fraction in large genomes. Retrotransposons are divided into five orders: LTR retrotransposons, DIRS-like elements, Penelope-like elements (PLEs), LINEs (long interspersed elements), and SINEs (short interspersed elements). This scheme is based on the mechanistic features, organization, and reverse transcriptase phylogeny of these retroelements. Accidentally, the retrotranscriptase coded by an autonomous TE can reverse-transcribe another RNA present in the cell, e.g., mRNA, and produce a retrocopy of it, which in most cases results in a pseudogene. The LTR retrotransposons are characterized by the presence of long terminal repeats (LTRs) ranging from several hundred to several thousand base pairs. Both exogenous retroviruses and LTR retrotransposons contain a gag gene that encodes a viral Transposable Elements: Classification, Identification, and Their Use... 181

Table 1 Prokaryotic transposable elements as presented in the IS Finder database [31]

Family Typical size range in bp Direct repeat size in bp IRsa Number of ORFs

IS1 740–4600 0–10 Y 1 or 2 IS110 1200–1550 0 Y 1 IS1182 1330–1950 0–60 Y 1 IS1380 1550–2000 4–5 Y 1 IS1595 700–7900 8 Y 1 IS1634 1500–2000 5–6 Y 1 IS200/IS605 600–2000 0 Y/N 1 or 2 IS21 1750–2600 4–8 Y 2 IS256 1200–1500 8–9 Y 1 IS3 1150–1750 5 Y 2 IS30 1000–1700 2–3 Y 1 IS4 1150–5400 8–13 Y 1 or more IS481 950–1300 4–15 Y 1 IS5 800–1500 2–9 Y 1 or 2 IS6 700–900 8 Y 1 IS607 1700–2500 0 N 2 IS630 1000–1400 2 Y 1 or 2 IS66 1350–3000 8–9 Y 1 or more IS701 1400–1550 4 Y 1 IS91 1500–2000 0 N 1 IS982 1000 3–9 Y 1 ISAs1 1200–1500 8–10 Y 1 ISAzo13 1250–2200 0–4 Y 1 ISH3 1225–1500 4–5 Y 1 ISH6 1450 8 Y ISL ISKra4 1400–2900 0–9 Y 1 or more ISL3 1300–2300 8 Y ISKra4 ISLre2 1500–2000 9 Y 1 Tn3 Over 3000 0 Y More than 1 ISNCY 1300–2400 0–12 Y/N 1 or 2 aPresence (Y) or absence (N) of terminal inverted repeats 182 Wojciech Makałowski et al.

Table 2 Classification of eukaryotic transposable elements as proposed by Wicker et al. [33]

Class Order Superfamily Phylogenetic distribution

Class I (retrotransposons) LTR Copia Plants, metazoans, fungi Gypsy Plants, metazoans, fungi Bel-Pao Metazoans Retrovirus Metazoans ERV Metazoans DIRS DIRS Plants, metazoans, fungi Ngaro Metazoans, fungi VIPER Trypanosomes PLE Penelope Plants, metazoans, fungi LINE R2 Metazoans RTE Metazoans Jockey Metazoans L1 Plants, metazoans, fungi SINE tRNA Plants, metazoans, fungi 7SL Plants, metazoans, fungi 5S Metazoans SVAa Primates Retrogenesa Plants, metazoans, fungi Class II (DNA transposons) TIR Tc1-Mariner Plants, metazoans, fungi Subclass 1 hAT Plants, metazoans, fungi Mutator Plants, metazoans, fungi Merlin Metazoans Transib Metazoans, fungi P Plants, metazoans PiggyBac Metazoans PIF-harbinger Plants, metazoans, fungi CACTA Plants, metazoans, fungi Crypton Crypton Fungi Class II (DNA transposons) Helitron Helitron Plants, metazoans, fungi Subclass 2 Maverick Maverick Metazoans, fungi Please note that SVAs and retrogenes are not included in that classification aNot included in the original Wicker classification

particle coat and a pol gene that encodes a reverse transcriptase, ribonuclease H, and an integrase, which provide the enzymatic machinery for reverse transcription and integration into the host genome. Reverse transcription occurs within the viral or viral-like particle (GAG) in the cytoplasm, and it is a multistep process [34]. Unlike LTR retrotransposons, exogenous retroviruses con- tain an env gene, which encodes an envelope that facilitates their migration to other cells. Some LTR retrotransposons may contain remnants of an env gene, but their insertion capabilities are limited to the originating genome [35]. This would rather suggest that they originated in exogenous retroviruses by losing the env gene. However, there is evidence that suggests the contrary, given that Transposable Elements: Classification, Identification, and Their Use... 183

Class I Retroviruses TSD TSD LTR gag pol envenv LTR LTR elements TSD TSD LTR gag pol LTR DIRS TSD TSD TIR gagggag YR TIR

PLE - Penelope-like elements TSD TSD LTR RT EN LTR

LINEs TSD TSD ORF1 ORF2 polyA

SINEs TSD TSD A B polyA SVAs TSD TSD CGCTCTn Alu-like VNTR SSINE-RINNE-R polyA

Retrogenes TSD TSD ORFORF polyA

Class II - subclass 1 “classical” autonomous DNA transposons TIR TIR TRANSPOSASETRAANSPOSAASE

“classical” nonautonomous DNA transposons TIR TIR

Class II - subclass 2 Helitrons TC HELICASE,HHEELICASE, (rpalrppaall) CTRR

Mavericks TDS TIR TIR TDS iinintnt ORFsOORFs polBpolB

Fig. 2 Structures of eukaryotic mobile elements. See text for detailed discussion

LTR retrotransposons can acquire the env gene and become infec- tious entities [36]. Presently, most of the LTR sequences (85%) in the human genome are found only as isolated LTRs, with the internal sequence being lost most likely due to homologous 184 Wojciech Makałowski et al.

recombination between flanking LTRs [37]. Interestingly, LTR retrotransposons target their reinsertion to specific genomic sites, often around genes, with putative important functional implica- tions for a host gene [35]. Lander et al. estimated that 450,000 LTR copies make up about 8% of our genome [38]. LTR retro- transposons inhabiting large genomes, such as maize, wheat, or barley, can contain thousands of families. However, despite the diversity, very few families comprise most of the repetitive fraction in these large genomes. Notable examples are Angela (wheat) [39], BARE1 (barley) [40], Opie (maize) [41], and Retrosor6 (sorghum) [42]. The DIRS order clusters structurally diverged group of trans- posons that possess a tyrosine recombinase (YR) gene instead of an integrase (INT) and do not form target site duplications (TSDs). Their termini resemble either split direct repeats (SDR) or inverted repeats. Such features indicate a different integration mechanism than that of other class I mobile elements. DIRS were discovered in the slime mold (Dictyostelium discoideum) genome in the early 1980s [43], and they are present in all major phylogenetic lineages including vertebrates [44]. It has been showed that they are also common in hydrothermal vent organisms [45]. Another order, termed Penelope-like elements (PLE), has wide, though patchy distribution from amoebae and fungi to vertebrates with copy number up to thousands per genome [46]. Interestingly, no PLE sequences have been found in mammalian genomes, and apparently they were lost from the genome of C. elegans [47]. Although PLEs with an intact ORF have been found in several genomes, including Ciona and Danio, the only transcrip- tionally active representative, Penelope, is known from Drosophila virilis. It causes the hybrid dysgenesis syndrome characterized by simultaneous mobilization of several unrelated TE families in the progeny of dysgenic crosses. It seems that Penelope invaded D. virilis quite recently, and its invasive potential was demonstrated in D. melanogaster [46]. PLEs harbor a single ORF that codes for a protein containing reverse transcriptase (RT) and endonuclease (EN) domains. The PLE RT domain more closely resembles telo- merase than the RT from LTRs or LINEs. The EN domain is related to GIY-YIG intron-encoded endonucleases. Some PLE members also have LTR-like sequences, which can be in a direct or an inverse orientation, and have a functional intron [46]. LINEs [48, 49] do not have LTRs; however, they have a poly-A tail at the 30 end and are flanked by the TSDs. They comprise about 21% of the human genome and among them L1 with about 850,000 copies is the most abundant and best described LINE family. L1 is the only LINE retroposon still active in the human genome [50]. In the human genome, there are two other LINE- like repeats, L2 and L3, distantly related to L1. A contrasting Transposable Elements: Classification, Identification, and Their Use... 185

situation has been noticed in the malaria mosquito Anopheles gam- biae, where around 100 divergent LINE families compose only 3% of its genome [51]. LINEs in plants, e.g., Cin4 in maize and Ta11 in Arabidopsis thaliana, seem rare as compared with LTR retro- transposons. A full copy of mammalian L1 is about 6 kb long and contains a PolII promoter and two ORFs. The ORF1 codes for a non-sequence-specific RNA binding protein that contains zinc fin- ger, leucine zipper, and coiled-coil motifs. The ORF1p functions as chaperone for the L1 mRNA [52, 53]. The second ORF encodes an endonuclease, which makes a single-stranded nick in the geno- mic DNA, and a reverse transcriptase, which uses the nicked DNA to prime reverse transcription of LINE RNA from the 30 end. Reverse transcription is often unfinished, leaving behind fragmen- ted copies of LINE elements; hence most of the L1-derived repeats are short, with an average size of 900 bp. LINEs are part of the CR1 clade, which has members in various metazoan species, including fruit fly, mosquito, zebrafish, pufferfish, turtle, and chicken [54]. Because they encode their own retrotransposition machinery, LINE elements are regarded as autonomous retrotransposons. SINEs [48, 49] evolved from RNA genes, such as 7SL and tRNA genes. By definition, they are short, up to 1000 base pair long. They do not encode their own retrotranscription machinery and are considered as nonautonomous elements and in most cases are mobilized by the L1 machinery [55]. The outstanding member of this class from the human genome is the Alu repeat, which contains a cleavage site for the AluI restriction enzyme that gave its name [56]. With over a million copies in the human genome, Alu is probably the most successful transposon in the history of life. Primate-specific Alu and its rodent relative B1 have limited phylo- genetic distribution suggesting their relatively recent origins. The mammalian-wide interspersed repeats (MIRs), by contrast, spread before eutherian radiation, and their copies can be found in differ- ent mammalian groups including marsupials and monotremes [57]. SVA elements are unique primate elements due to their composite structure. They are named after their main components: SINE, VNTR (a variable number of tandem repeats), and Alu [58]. Usually, they contain the hallmarks of the retroposition, i.e., they are flanked by TSDs and terminated by a poly(A) tail. It seems that SVA elements are nonautonomous retrotransposons mobilized by L1 machinery, and they are thought to be transcribed by RNA polymerase II. SVAs are transpositionally active and are responsible for some human diseases [59]. They originated less than 25 million years ago, and they form the youngest retrotransposon family with about 3000 copies in the human genome [58]. Retro(pseudo)genes are a special group of retroposed sequences, which are products of reverse transcription of a spliced (mature) mRNA. Hence, their characteristic features are an absence 186 Wojciech Makałowski et al.

of promoter sequence and introns, the presence of flanking direct repeats, and a 30-end polyadenosine tract [60]. Processed pseudo- genes, as sometimes retropseudogenes are called, have been gener- ated in vitro at a low frequency in the human HeLa cells via mRNA from a reporter gene [60]. The source of the reverse transcription machinery in humans and other vertebrates seems to be active L1 elements [61]. However, not all retroposed messages have to end up as pseudogenes. About 20% of mammalian protein-encoding genes lack introns in their ORFs [62]. It is conceivable that many genes lacking introns arose by retroposition. Some genes are known to be retroposed more often than others. For instance, in the human genome there are over 2000 retropseudogenes of ribosomal proteins [63]. A genome-wide study showed that the human genome harbors about 20,000 pseudogenes, 72% of which most likely arose through retroposition [64]. Interestingly, the vast majority (92%) of them are quite recent transpositions that occurred after primate/rodent divergence [64]. Some of the retro- posed genes may undergo quite complicated evolutionary paths. An example could be the RNF13B retrogene, which replaced its own parental gene in the mammalian genomes. This retrocopy was duplicated in primates, and the evolution of this primate-specific copy was accompanied by the exaptation of two TEs, Alu and L1, and intron gain via changing a part of coding sequence into an intron leading to the origin of a functional, primate-specific retro- gene with two splicing variants [65].

3.2.2 Class II: Mobile Class II elements move by a conservative cut-and-paste mechanism; Elements the excision of the donor element is followed by its reinsertion elsewhere in the genome. DNA transposons are abundant in bacte- ria, where they are called insertion sequences (see Subheading 3.1), but are present in all phyla. Wicker et al. distinguished two sub- classes of DNA transposons based on the number of DNA strands that are cut during transposition [33]. Classical “cut-and-paste” transposons belong to the subclass I, and they are classified as the TIR order. They are characterized by terminal inverted repeats (TIR) and encode a transposase that binds near the inverted repeats and mediates mobility. This process is not usually a replicative one, unless the gap caused by excision is repaired using the sister chromatid. When inserted at a new loca- tion, the transposon is flanked by small gaps, which, when filled by host enzymes, cause duplication of the sequence at the target site. The length of these TSDs is characteristic for particular transpo- sons. Nine superfamilies belong to the TIR order, including Tc1- Mariner, Merlin, Mutator, and PiggyBac. The second order Cryp- ton consists of a single superfamily of the same name. Originally thought to be limited to fungi [66], now it is clear that they have a wide distribution, including animals and heterokonts [67]. A Transposable Elements: Classification, Identification, and Their Use... 187

heterogeneous, small, nonautonomous group of elements MITEs also belong to the TIR order [68], which in some genomes ampli- fied to thousands of copies, e.g., Stowaway in the rice genome [69], Tourist in most bamboo genomes [70], or Galluhop in the chicken genome [71]. Subclass II includes two orders of TEs that, just as those from subclass I, do not form RNA intermediates. However, unlike “clas- sical” DNA transposons, they replicate without double-strand cleavage. Helitrons replicate using a rolling-circle mechanism, and their insertion does not result in the target site duplication [72]. They encode tyrosine recombinase along with some other proteins. Helitrons were first described in plants, but they are also present in other phyla, including fungi and mammals [73, 74]. Mavericks are large transposons that have been found in different eukaryotic lineages excluding plants [75]. They encode various numbers of proteins that include DNA polymerase B and an integrase. Kapitonov and Jurka suggested that their life cycle includes a single-strand excision, followed by extrachromosomal replication and reintegration to a new location [76].

4 Identification of Transposable Elements

With the ever-growing number of sequenced genomes from differ- ent branches of the tree of life, there are increasing TE research opportunities. There are several reasons why one would like to analyze TEs and their “offsprings” left in a genome. First of all, they are very interesting biological subjects to study genome struc- ture, gene regulation, or genome evolution. In some cases, they also make genome assembly and annotation quite challenging, especially with the current NGS technology that generates reads shorter than TEs. Nevertheless, TEs should be and are worthy to study. However, it is not a simple task and requires different approaches depending on the level of analysis. We will walk through these different levels starting with raw genome sequences without any annotation and discuss different methods and software used for TE analyses. In principle, we can imagine two scenarios: in the first one, genomic or transcriptome sequences are coming from a spe- cies for which there is already some information about the transpo- son repertoire, for instance, a related genome has been previously characterized or TEs have been studied before. In the second scenario, we have to deal with a completely unknown genome or a genome for which little information exists with regard to TEs. In the former case, one can apply a range of techniques used in comparative genomics or try to search specific libraries of transpo- sons using the “homology search” approach. In the latter, which is basically an approach to identify TEs de novo, first we need to find 188 Wojciech Makałowski et al.

any repeats in a genome and then attempt characterization and classification of newly identified repetitive sequences. In this approach, we will find any repeats, not necessarily transposons. There are many algorithms, and even more software, that can be applied in both approaches.

4.1 De Novo There are several steps involved in the de novo characterization of Approaches to Finding transposons. First, we need to find all the repeats in a genome, then Repetitive Elements build a consensus of each family of related sequences, and finally classify detected sequences. For the first step, three groups of algorithms exist: the k-mer approach, sequence self-comparison, and periodicity analysis. In the k-mer approach, sequences are scanned for overrepre- sentation of strings of certain length. The idea is that repeats that belong to the same family are compositionally similar and share some oligomers. If the repeats occur many times in a genome, then those oligomers should be overrepresented. However, since repeats and transposons in particular are not perfect copies of a certain sequence, some mismatches must be allowed when oligo frequen- cies are calculated. The challenge is to determine optimal size of an oligo (k-mer) and number of mismatches allowed. Most likely, these parameters should be different for different types of transpo- sons, i.e., low versus high copy number, old versus young transpo- sons, and those from different classes and families. Several programs have been developed based on the k-mer idea using a suffix tree data structure including REPuter [77, 78], Vmatch (Kurtz, unpub- lished; http://www.vmatch.de/), and Repeat-match [79, 80]. Another approach is to use fixed length k-mers as seeds and extend those seeds to define repeat’s family as it was imple- mented in ReAS [81], RepeatScout [82], and Tallymer [83]. Another interesting algorithm can be found in the FORRe- peats software [84], which uses factor oracle data structure [85]. It starts with detection of exact oligomers in the analyzed sequences, followed by finding approximate repeats and their alignment. The second group of programs developed for de novo detec- tion of repeated sequences is using self-comparison approach. Repeat Pattern Toolkit [86], RECON [87], PILER [88, 89], and BLASTER [90] belong to this group. The idea is to use one of the fast sequence similarity tools, e.g., BLAST [91], followed by clus- tering search results. The programs differ in the search engine for the initial step, though most are using some of the BLAST algo- rithms, the clustering method, and heuristics of merging initial hits into a prototype element. For instance, RECON [87], which was developed for the repeat finding in unassembled sequence reads, starts with an all-to-all comparison using WU-BLAST engine. Then, single-linkage clustering is applied to alignment results that is followed by construction of an undirected graph with overlap- ping. The shortest sequence that contains connected images Transposable Elements: Classification, Identification, and Their Use... 189

(aligned subsequences) creates a prototype element. However, this procedure might result in composite elements. To avoid this, all the images are aligned to the prototype element to detect potential illegitimate mergers and split those at every point with a significant number of image ends. PILER [88, 89] is using a different approach to find initial clusters. Instead of BLAST, it uses PALS (pairwise alignment of long sequences) for the initial alignment. PALS records only hit points and uses banded search of the defined maximum distance to optimize its performance. To further improve performance of the system, PILER uses different heuristics for different types of repeats, i.e., satellites, pseudosatellites, terminal repeats, and inter- spersed repeats. Finally, a consensus sequence is generated from a multiple sequence alignment of the defined family members. Dot matrix is a simple method to compare two biological sequences. The graphical output of such an analysis is called a dot- plot. Dotplots can be used to detect conserved domains, sequence rearrangements, RNA secondary structure, or repeated sequences. It compares every residue in one sequence to every residue in the other sequence or to every residue of the same sequence in the self- comparison mode. In the latter case, there will be a main diagonal line representing a perfect match and a number of short diagonal lines representing similar regions (red circles in Fig. 3). Interestingly, simple repeats appear as diamond shapes on a main diagonal line or short vertical and horizontal lines outside the main diagonal line (red squares in Fig. 3). The method was introduced to biological analyses almost a half century ago [92, 93]. However, the first easy-to-use software with a graphical interface, DOTTER, was developed much later [94]. The major problem of this approach is the time required for the dotplot calculation, which is of quadratic complexity. This proved to be prohibitive for comparison of the genome-size sequences. One of the solutions to this problem is using a word index for the fast identification of substrings. Gepard implements the suffix array data structure to improve the execution time [95]. It is written in Java, which makes it platform-independent. Gepard enables analyses of sequences at the mega-base level in the matter of seconds, and it takes about an hour to analyze the whole human chromosome I [95]. The example of the dotplot produced by the Gepard is presented in Fig. 3.

4.2 Transposable With constant improvement of sequencing technology associated Elements with decreasing sequencing cost, the number of new sequenced Determination in genomes is exploding. As of January 2019, there are more than NGS Data 7000 eukaryotic and almost 180,000 prokaryotic genomes publicly available (information retrieved on January 16, 2019, from https:// www.ncbi.nlm.nih.gov/genome/browse/). However, this comes with a price; most of the recently sequenced genomes, due to the short read sequencing technology, are available at various levels of 190 Wojciech Makałowski et al.

Fig. 3 Graphical output of the Gepard. A 30 kb fragment of mouse chromosome 12 was compared to itself. Similar sequences are represented by diagonal lines if both fragments are located on the same strains or by reverse diagonal lines if the fragments with significant similarity are located on opposite strands. Some of the examples are marked with the red circles. Simple repeats are represented by either diamond shapes on the main diagonal or horizontal and vertical lines. Some of the examples are marked with the red squares

“completeness” or assembly. For most non-model organisms, we are presented with draft assemblies of rather short contigs. More- over, these genomes usually are not very well annotated, with TEs not being on the annotation priority list. Unfortunately, genome annotation pipelines do not include TE annotation, focusing on protein-coding and RNA-coding genes. To fill the gap, a number of methods have been developed to detect repeats from short reads. Two algorithms dominate in attempts to determine repeats in NGS raw reads: clustering and k-mer. Transposome [96] and RepeatEx- plorer [97] employ the former approach, while RepARK [98], REPdenovo [99], and dnaPipeTE [100] utilize the latter one. Since NGS results in the relatively short reads, assembly of selected sequences into longer contigs representing TEs is required after initial clustering of the raw reads.

4.3 Population-Level Recent advances in sequencing technology and the sharp decrease Analyses of in sequencing costs allow genomic studies at population level. Transposable Although initially focused on human populations [101–103], Elements recent population studies of other species have been initiated as well [104, 105]. One of the common questions in such studies is Transposable Elements: Classification, Identification, and Their Use... 191

Actual genomic fragment with a TE insertion

a k b l c m d n e o f p g q h r i s j t

Reference genome - a TE missing

a k b l c m d n e o f p g q h r i s j t k l m n o p q r s t

Fig. 4 Detection of a TE insertion (polymorphic TE) from the NGS data. The upper panel shows real genomic sequence with a TE, which is not present in the reference genome (lower panel). Hypothetical discordant pair- reads (a, b, d, f, g, i, j, k, l, o, q, s, and t) have only one the pairs mapped to the reference genome, while the other would map to a consensus sequence of a TE. The hypothetical split reads (c, e, h, m, p, and r) will have part of the sequence mapped to the reference genome and the other to a TE consensus sequence

how much structural variation (SV) exists in different populations. TE insertions are responsible for about 25% of structural variants in human genomes [106]. In general, any tool designed for detection of SV should work for TE insertion analysis, but specialized soft- ware can take advantage of specific expectations related to inser- tions of TEs. Most of the SV-detection algorithms rely on paired- end reads and are based on discordant read pair mapping and/or split reads mapping (Fig. 4). A discordant pair read is defined as one that is inconsistent with the expected insert size in the library used for sequencing. For example, if the insert size of the library used for sequencing is 300 nt but the reads map to a reference genome within much larger distance or to two different chromosomes, such a pair is considered to be discordant. If, additionally, one of the reads maps to a TE, it might be an indication of a polymorphic TE. Usually some filtering is used to reduce a chance of false positives. These include minimum read number in the cluster mapped to a unique position, quality score of the reads, or consis- tency in reads orientation. However, the discordant read mapping cannot detect exact insertion position. Therefore another step is required that may include local assembly and split-read mapping. 192 Wojciech Makałowski et al.

A split read is defined as a read for which part of it maps uniquely to one position in the genome and the other part to another position. This is, for example, a very common feature of the mapping of RNA-seq data to eukaryotic genomes when reads span two exons. Split reads are being also observed if structural variants exist. In a case of a TE insertion, a part of the read will be mapped to a unique location and the rest to a TE in some other location or may not be mapped at all (Fig. 4). Different methods for structure variant detection return differ- ent results on the same data. Recently published benchmarking demonstrates that TE detection is not an exception [107, 108]. Ewing [107] compared TranspoSeq [109] with two other tools, Tea [110] and TraFIC [111], on the same data sets. Results were not very encouraging as in both comparisons there was a high fraction of insertions detected only by a single program [107]. Similar conclusion was drawn by Rishishwar et al. [108]ina benchmark of larger number of tools including MELT [106], Mobster [112], and RetroSeq [113]. It is clear that different soft- ware have different biases, and each one can produce a high number of false positives. It is recommended then to employ several pro- grams for high confidence results. Exhaustive tests run on real and simulated human genome data showed superior performance of MELT [106, 108]. TIPseqHunter is another tool developed to identify transposon insertion sites based on the transpose insertion profiling using next-generation sequencing [114]. It employs machine learning algorithm to ensure high precision and reliability. It is worth to note that all these tools were designed for short read sequencing methods. However, with current development of single-molecule long reads, sequencing technologies such as Pac- Bio and Oxford Nanopore may make these methods irrelevant and obsolete. Long reads should be of superior performance and make TE insertion detection relatively easy with more traditional aligners, such as MegaBLAST [115], BLAT [116], or LAST [117].

4.4 Comparative To understand the general pattern of TE insertions in different Genomics of TE genomes and evolutionary dynamics of TE families, a comparative Insertions approach is necessary. Although precomputed alignments of differ- ent genomes are publicly available, for example, the UCSC Genome Browser includes Multiz alignments of 100 vertebrate genomes [118], not many tools are available for such analyses. One of them is GPAC (genome presence/absence compiler) that creates a table of presence and absence of certain elements based on the precomputed multiple genomes alignment [119](http://bioin formatics.uni-muenster.de/tools/gpac/index.hbi). The tool is quite generic, but is well suited for the TE comparative analysis (see Fig. 5 for an example). Fig. 5 The output table of the GPAC software. Several Alu elements were analyzed for presence/absence in 11 primate species. The human genome was used a reference, and “hit coordinates” refer to that genome along with the information on the annotated elements in the hit region and a type of the region. For each genome, the presence (+) or absence (À) of the hit is presented. x/ denotes that only part of the original insertion (less than 20%) is present in a given genome, and ¼¼ indicates that more than 80% of the expected sequence is not alignable in a given locus. The optional phylogenetic tree constructed based on the obtained data is shown in the lower right corner 194 Wojciech Makałowski et al.

4.5 Classification of Once the consensus of a repetitive element has been constructed, it Transposable can be subjected to further analyses. There are two major categories Elements of programs dealing with the issue of TE classification: library or similarity-based and signature-based. The latter approach is very often used in specialized software, i.e., tailored for specific type of TEs. However, some general tools also exist, e.g., TEclass [120]. The library approach is probably the most common approach for TE classification. It is also very efficient and quite reliable as long as good libraries of prototype sequences exist. In practice, it is the recommended approach when we analyze sequences from well- characterized genomes or from a genome relatively closely related to a well-studied one. For instance, since the human genome is one of the best studied, any primate sequences can be confidently analyzed using the library approach. Most likely, the first software using the similarity-based approach for repeat classification was Censor developed by Jerzy Jurka in the early 1990s [121]. It uses RepBase [122] as a reference collection and BLAST as a search engine [91]. However, the most popular TE detection software is RepeatMasker (RM) (http://www.repeatmasker.org). Interest- ingly, RM is also using RepBase as a reference collection and AB-BLAST, RM-BLAST, or cross-match as a search engine. In both cases, original search hits are processed by a series of Perl scripts to determine the structure of elements and classify them to one of known TE families. Both Censor and RM also employ user- provided libraries, including “third-party” lineage-specific libraries, e.g., TREP [123]. Over the years, RepeatMasker has become a standard tool for TE analyses, and often its output is used for more biologically oriented studies (see below). The aforemen- tioned programs have one important drawback: since they are completely based on sequence similarity, they can detect only TEs that had been previously described. Nevertheless, similarity searches, like in many other bioinformatics tasks, should be the first approach for the analysis of repetitive elements. Signature-based programs are searching for certain features that characterize specific TEs, for example, long terminal repeats (LTRs), target site duplications (TSDs), or primer-binding sites (PBSs). Since different types (families) of elements are structurally different, they require specific rules for their detection. Hence, many of the programs that use signature-based algorithms are specific for certain type of transposons. There are a number of programs specialized in detection of LTR transposons, which are based on a similar methodology. They take into account several structural features of LTR retroposons including size, distance between paired LTRs and their similarity, the presence of TSDs, and the presence of replication signals, i.e., the primer-binding site and the polypurine tract (PPTs). Some of the programs check also for ORFs coding for the gag, pol, and env proteins. LTR_STRUC [124] was one of the first programs based on this principle. It uses Transposable Elements: Classification, Identification, and Their Use... 195

seed-and-extend strategy to find repeats located within user- defined distance. The candidate regions are extended based on the pairwise alignment to determine cognate LTRs’ boundaries. Putative full-length elements are scored based on the presence of TSD, PBS, PPT, and reverse transcriptase ORF. However, because of the heuristics described above, LTR_STRUC is unable to find incomplete LTR transposons and in particular solo LTRs. Another limitation of this program is its Windows-only implementation that significantly prohibits automated large-scale analysis. Several other programs have been developed based on similar principles, e.g., LTR_par [125], find_LTR [126], LTR_FINDER [127], and LTRharvest [128]. Lerat tested performance of these programs [129], and although sensitivity of the methods was acceptable (between 40% and 98%), it was at the expense of specificity, which was very poor. In several cases, the number of falsely assigned transposons exceeded the number of correctly detected ones. Another group of transposons that have a relatively conserved structure are MITEs and Helitrons. Several specialized programs were developed that take advantage of their specific structure. FINDMITE [130] and MUST [131] are tailored for MITEs, while HelitronFinder [132] and HelSearch [133] were developed for Helitron detection. A further interesting approach to transposon classification was implemented by Abrusan et al. [120] in the software package called TEclass, which classifies unknown TE consensus sequences into four categories, according to their mechanism of transposition: DNA transposons, LTRs, LINEs, and SINEs. The classification uses support vector machines, random forests, learning vector quantization, and predicts ORFs. Two complete sets of classifiers are built using tetramers and pentamers, which are used in two separate rounds of the classification. The software assumes that the analyzed sequence represents a TE and the classification process is binary, with the following steps: forward versus reverse sequence orientation > DNA versus retrotransposon > LTRs versus nonLTRs (for retroelements) > LINEs versus SINEs (for nonLTR repeats). If the different methods of classification lead to conflicting results, TEclass reports the repeat either as unknown or as the last category where the classification methods agree (http://bioinfor matics.uni-muenster.de/tools/teclass/index.hbi).

4.6 Pipelines Recent years witnessed some attempt to create more complex, global analyses systems. One such a system is REPCLASS [134]. It consists of three classification modules: homology (HOM), structure (STR), and target site duplication (TSD). Each module can be run separately or in the pairwise manner, whereas the final step of the analysis involves integration of the results delivered by each module. There is one interesting novelty in the STR module, namely, implementation of tRNAscan-SE [135]to 196 Wojciech Makałowski et al.

detect tRNA-like secondary structure within the query sequence, one of the signatures of many SINE families. The REPPET is another pipeline for TE sequence analyses. It uses “classical” three-step approach for de novo TE identification: self-alignment, clustering, and consensus sequences generation. However, the pipeline is using a spectrum of different methods at each step, followed by a rigorous TE classification step based on recently proposed classification of TEs [136]. Unfortunately, a complex implementation that makes installation and running the system rather difficult limits usage of the pipeline. The classification step seems to be unreliable as it may annotate lineage-specific TEs in wrong taxonomical lineages (Kouzel and Makalowski, unpublished data). There are other attempts to create comprehensive systems for “repeatome” analysis. One of them is dnaPipeTE developed for mosquito genomes’ analyses [100]. Interestingly, dnaPipeTE works on the raw NGS data, which makes the pipeline well suited for genomes with lower sequencing depth. The raw reads are first subjected to k-mer count on the sampled data. The sampling of the data to size less than 0.25Â of the genome is required to avoid clustering reads representing unique sequences. The determined repetitive reads are assembled into contigs using Trinity [137]. Although Trinity was originally developed for transcriptome assembly from RNA-seq data, it proves to be very useful for TEs assembly from short reads as it can efficiently determine consensus sequences of closely related transposons. In the next step, dnaPi- peTE annotates repeats using RepeatMasker with either built-in or user-defined libraries. This is probably the weakest point of the pipeline as it will not annotate any novel TEs, which have no similar sequences present in the provided libraries. It would be useful to complement this step with model-based or machine learning approaches (see Subheading 4.5). After contigs’ annotation, copy number of the TEs are estimated using BLAST algorithm [91]. Finally, sequence identity between an individual TE and its consensus sequence is used to determine the relative age of the TEs. The pipeline produces a number of output files including several graphs, i.e., pie chart with the relative proportion of the main repeat classes and graph with the number of base pairs aligned on each TE contig and TE age distribution. Overall, the dnaPipeTE is very efficient, outperforming, according to the authors, RepeatEx- plorer by severalfold [100].

4.7 Meta-analyses Most of the software developed are focused on the TE discovery and rarely offer more biological oriented analyses. Consequently, researchers interested in TE biology or using TE insertions as tools for another biological investigations need to utilize other resources. One of them is TinT (transposition in transposition), tool that applies maximum likelihood model of TE insertion probability to Transposable Elements: Classification, Identification, and Their Use... 197

estimate relative age of TE families [138](http://bioinformatics. uni-muenster.de/tools/tint/index.hbi). In the first steps, it takes RepeatMasker output to detect nested retroposons. Then, it gen- erates a data matrix that is used by a probabilistic model to estimate chronology and activity period of analyzed families. The method was applied to resolve the evolutionary history of galliformes [139], marsupials [140], lagomorphs [141], squirrel monkey [142], or elephant shark [143]. Another interesting application that takes advantage of TEs is their use for detecting signatures of positive selection [144], a central goal in the field of evolutionary biology. A typical research scenario for this application would be investigating whether a spe- cific TE fragment exapted into resident genomic features, such as proximal and distal enhancers or exons of spliced transcripts, has undergone accelerated evolution that could be indicative of gain of function events. In short, the test first requires the identification of all genomically interspersed TE fragments that are homolog to the TE segment of interest, which can be done through alignments with a family consensus sequence. Based on multi-species genome alignments, a second step involves identification of lineage-specific substitutions in every single homolog fragment, which are then consolidated into a distribution of lineage-specific substitutions that provides the expectation (null distribution) for a segment evolving largely without specific constraints (neutrally). A signifi- cantly higher number of lineage-specific substitutions observed in the TE fragment of interest compared to the null distribution could then be interpreted as a molecular signature of adaptive evolution. However, the possibility of confounding molecular mechanisms, such as GC-biased [145–147], needs to be eval- uated. We note that building the null distribution based only on data from intergenic regions, where transcription-coupled repair is absent, results in a more liberal estimate of the expected substitu- tions, which in turn leads to a more conservative estimate of the adaptive evolution. Additionally, building the null distribution requires the detection of many homolog fragments, which limits the applicability of the test to TE families with numerous members in a given genome. Prime examples would be human Alu or murine B1 SINEs. In theory, this test could also be used for detecting signatures of purifying selection by searching for fragments depleted of lineage-specific substitutions. However, the low level or complete lack of lineage-specific substitution is characteristic to many TE fragments, obscuring the effect of potential purifying forces. 198 Wojciech Makałowski et al.

5 Concluding Remarks

Annoying junk for some, hidden treasure for others, TEs can hardly be ignored [148]. With their diversity and high copy number in most of the genomes, they are not the easiest biological entities to analyze. Nevertheless, recent years witnessed increased interest in TEs. On the one hand, we observe improvement in computational tools specialized in TE analyses. Table 3 lists some of such tools and

Table 3 Selected resources for transposable elements discovery and analyses

Software Address

AB-BLAST http://www.advbiocomp.com/blast.html ACLAME http://aclame.ulb.ac.be/ BLASTER suite http://urgi.versailles.inra.fr/index.php/urgi/Tools/BLASTER Censor http://www.girinst.org/censor/download.php DOTTER http://sonnhammer.sbc.su.se/Dotter.html DROPOSON ftp://biom3.univ-lyon1.fr//pub/drosoposon/ find_ltr http://darwin.informatics.indiana.edu/cgi-bin/evolution/ltr.pl FINDMITE http://jaketu.biochem.vt.edu/dl_software.htm FORRepeats http://al.jalix.org/FORRepeats/ Gepard http://cube.univie.ac.at/gepard HelitronFinder http://limei.montclair.edu/HT.html HelSearch http://sourceforge.net/project/showfiles.php?group_id¼260708 HERVd http://herv.img.cas.cz/ IRF http://tandem.bu.edu/irf/irf.download.html LTR_FINDER http://tlife.fudan.edu.cn/ltr_finder/ LTR_MINER http://genomebiology.com/2004/5/10/R79/suppl/s7 LTR_par http://www.eecs.wsu.edu/~ananth/software.htm MGEScan-LTR http://darwin.informatics.indiana.edu/cgi-bin/evolution/daphnia_ltr.pl MGEScan-nonLTR http://darwin.informatics.indiana.edu/cgi-bin/evolution/nonltr/nonltr.pl microTranspoGene http://transpogene.tau.ac.il/microTranspoGene.html MITE-Hunter http://target.iplantcollaborative.org/mite_hunter.html PILER http://www.drive5.com/piler/ REannotate http://www.bioinformatics.org/reannotate/index.html ReAS ftp://ftp.genomics.org.cn/pub/ReAS/software/ RECON http://eddylab.org/software/recon/

(continued) Transposable Elements: Classification, Identification, and Their Use... 199

Table 3 (continued)

Software Address

RepSeek http://wwwabi.snv.jussieu.fr/public/RepSeek/ RepeatFinder http://cbcb.umd.edu/software/RepeatFinder/ RepeatMasker http://www.repeatmasker.org/ RepeatModeler http://www.repeatmasker.org/RepeatModeler/ RepeatRunner http://www.yandell-lab.org/software/repeatrunner.html Repeat-match http://mummer.sourceforge.net/ REPET http://urgi.versailles.inra.fr/index.php/urgi/Tools/REPET RepMiner http://repminer.sourceforge.net/index.htm REPuter http://bibiserv.techfak.uni-bielefeld.de/reputer/ RetroMap http://www.burchsite.com/bioi/RetroMapHome.html SMaRTFinder http://services.appliedgenomics.org/software/smartfinder/ SoyTEdb http://www.soytedb.org Spectral Repeat Finder http://www.imtech.res.in/raghava/srf/ T-lex http://petrov.stanford.edu/cgi-bin/Tlex.html Tallymer http://www.zbh.uni-hamburg.de/Tallymer/ TARGeT http://target.iplantcollaborative.org/ TEclass http://www.bioinformatics.uni-muenster.de/tools/teclass/ TE Displayer http://labs.csb.utoronto.ca/yang/TE_Displayer/ TE nest http://www.plantgdb.org/prj/TE_nest/TE_nest.html TESD http://pbil.univ-lyon1.fr/software/TESD/ TinT http://www.bioinformatics.uni-muenster.de/tools/tint/ TIPseqHunter https://github.com/fenyolab/TIPseqHunter TRANSPO http://alggen.lsi.upc.es/recerca/search/transpo/transpo.html TranspoGene http://transpogene.tau.ac.il/ Transposon-PSI http://transposonpsi.sourceforge.net/ TRAP http://www.coccidia.icb.usp.br/trap/tutorials/ TRF http://tandem.bu.edu/trf/trf.html TROLL http://finder.sourceforge.net/ TSDfinder http://www.ncbi.nlm.nih.gov/CBBresearch/Landsman/TSDfinder/ WikiPoson http://www.bioinformatics.org/wikiposon/doku.php VariationHunter http://compbio.cs.sfu.ca/software-variation-hunter Vmatch http://www.vmatch.de/ 200 Wojciech Makałowski et al.

the up-to-date list can be found at our web site: http://www. bioinformatics.uni-muenster.de/ScrapYard/. On the other hand, improved tools and new technologies enable biologists to explore new research avenues that might lead to novel, fascinating insights into the biology of mobile elements.

References

1. Waring M, Britten RJ (1966) Nucleotide 12. Aronson AI, Bolton ET, Britten RI, Cowie sequence repetition - a rapidly reassociating DB, Duerksen JD, McCarthy BJ, fraction of mouse DNA. Science 154 McQuillen K, Roberts RB (1960) Biophysics. (3750):791–794 In: Yearbook 59, vol 59. Carnegie Institution 2. Britten RJ, Kohne DE (1968) Repeated of Washington, Washington, pp 229–279 sequences in DNA. hundreds of thousands 13. Ehret CF, De Haller G (1963) Origin, devel- of copies of DNA sequences have been opment and maturation of organelles and incorporated into the genomes of higher organelle systems of the cell surface in Para- organisms. Science 161(841):529–540 mecium. J Ultrastruct Res 23(Suppl 6):1–42 3. Makalowski W (2001) The human genome 14. Brosius J (1991) Retroposons--seeds of evo- structure and organization. Acta Biochim lution. Science 251(4995):753 Pol 48(3):587–598 15. Makalowski W, Mitchell GA, Labuda D 4. C._elegans_Sequencing_Consortium (1998) (1994) Alu sequences in the coding regions Genome sequence of the nematode of mRNA: a source of protein variability. C. elegans: a platform for investigating biol- Trends Genet 10(6):188–193 ogy. Science 282(5396):2012–2018 16. Jordan IK, Rogozin IB, Glazko GV, Koonin 5. SanMiguel P, Tikhonov A, Jin YK, EV (2003) Origin of a substantial fraction of Motchoulskaia N, Zakharov D, Melake- human regulatory sequences from transposa- Berhan A, Springer PS, Edwards KJ, Lee M, ble elements. Trends Genet 19(2):68–72. Pii Avramova Z, Bennetzen JL (1996) Nested S0168-9525(02)00006-9 retrotransposons in the intergenic regions of 17. Thornburg BG, Gotea V, Makalowski W the maize genome. Science 274 (2006) Transposable elements as a significant (5288):765–768 source of transcription regulating signals. 6. Keller EF (1983) A feeling for the organism: Gene 365:104–110. https://doi.org/10. the life and work of Barbara McClintock. 1016/j.gene.2005.09.036. S0378-1119(05) W.H. Freeman, San Francisco 00653-0 [pii] 7. McClintock B (1950) The origin and behav- 18. Feschotte C (2008) Transposable elements ior of mutable loci in maize. Proc Natl Acad and the evolution of regulatory networks. Sci U S A 36(6):344–355 Nat Rev Genet 9(5):397–405. https://doi. 8. McClintock B (1951) Chromosome organi- org/10.1038/nrg2337 zation and genic expression. Cold Spring 19. Mita P, Boeke JD (2016) How retrotranspo- Harb Symp Quant Biol 16:13–47 sons shape genome regulation. Curr Opin 9. McClintock B (1956) Controlling elements Genet Dev 37:90–100. https://doi.org/10. and the gene. Cold Spring Harb Symp 1016/j.gde.2016.01.001 Quant Biol 21:197–216 20. Chuong EB, Elde NC, Feschotte C (2017) 10. Malamy MH, Fiandt M, Szybalski W (1972) Regulatory activities of transposable elements: Electron microscopy of polar insertions in the from conflicts to benefits. Nat Rev Genet 18 lac operon of Escherichia coli. Mol Gen Genet (2):71–86. https://doi.org/10.1038/nrg. 119(3):207–222 2016.139 11. Ohno S (1972) So much “junk” DNA in our 21. Franke V, Ganesh S, Karlic R, Malik R, genome. In: Smith HH (ed) Brookhaven Pasulka J, Horvat F, Kuzman M, Fulka H, symposia in biology, vol 23. Gordon & Cernohorska M, Urbanova J, Svobodova E, Breach, New York, pp 366–370 Ma J, Suzuki Y, Aoki F, Schultz RM et al Transposable Elements: Classification, Identification, and Their Use... 201

(2017) Long terminal repeats power evolu- 33. Wicker T, Sabot F, Hua-Van A, Bennetzen JL, tion of genes and gene expression programs Capy P, Chalhoub B, Flavell A, Leroy P, in mammalian oocytes and zygotes. Genome Morgante M, Panaud O, Paux E, Res 27(8):1384–1394. https://doi.org/10. SanMiguel P, Schulman AH (2007) A unified 1101/gr.216150.116 classification system for eukaryotic transposa- 22. Wang L, Rishishwar L, Marino-Ramirez L, ble elements. Nat Rev Genet 8(12):973–982. Jordan IK (2017) Human population-specific https://doi.org/10.1038/nrg2165. gene expression and transcriptional network nrg2165 [pii] modification with polymorphic transposable 34. Hughes SH (2015) Reverse transcription of elements. Nucleic Acids Res 45 retroviruses and LTR retrotransposons. (5):2318–2328. https://doi.org/10.1093/ Microbiol Spectr 3(2):MDNA3-0027-2014. nar/gkw1286 https://doi.org/10.1128/microbiolspec. 23. Venuto D, Bourque G (2018) Identifying MDNA3-0027-2014 co-opted transposable elements using com- 35. Kazazian HH Jr (2004) Mobile elements: dri- parative epigenomics. Develop Growth Differ vers of genome evolution. Science 303 60(1):53–62. https://doi.org/10.1111/ (5664):1626–1632. https://doi.org/10. dgd.12423 1126/science.1089670. 303/5664/1626 24. Mahillon J, Chandler M (1998) Insertion [pii] sequences. Microbiol Mol Biol Rev 62 36. Malik HS, Henikoff S, Eickbush TH (2000) (3):725–774 Poised for contagion: evolutionary origins of 25. Wilde C, Escartin F, Kokeguchi S, Latour- the infectious abilities of invertebrate retro- Lambert P, Lectard A, Clement JM (2003) viruses. Genome Res 10(9):1307–1318 Transposases are responsible for the target 37. Leib-Mosch C, Haltmeier M, Werner T, Geigl specificity of IS1397 and ISKpn1 for two EM, Brack-Werner R, Francke U, Erfle V, different types of palindromic units (PUs), Hehlmann R (1993) Genomic distribution Nucleic Acid Res 31(15):4345–4353 and transcription of solitary HERV-K LTRs. 26. Derbyshire KM, Grindley NDF (1996) Cis Genomics 18(2):261–269. https://doi.org/ preference of the IS903 transposase is 10.1006/geno.1993.1464 mediated by a combination of transposase 38. Lander ES, Linton LM, Birren B, instability and inefficient translation. Mol Nusbaum C, Zody MC, Baldwin J, Microbiol 21(6):1261–1272. https://doi. Devon K, Dewar K, Doyle M, FitzHugh W, org/10.1111/j.1365-2958.1996.tb02587.x Funke R, Gage D, Harris K, Heaford A, How- 27. Ichikawa H, Ikeda K, Amemura J, Ohtsubo E land J et al (2001) Initial sequencing and (1990) Two domains in the terminal analysis of the human genome. Nature 409 inverted-repeat sequence of transposon Tn3. (6822):860–921. https://doi.org/10.1038/ Gene 86(1):11–17 35057062 28. Maekawa T, Amemura-Maekawa J, Ohtsubo 39. Wicker T, Stein N, Albar L, Feuillet C, E (1993) DNA binding domains in Tn3 Schlagenhauf E, Keller B (2001) Analysis of transposase. Mol Gen Genet 236 a contiguous 211 kb sequence in diploid (2–3):267–274 wheat (Triticum monococcum L.) reveals 29. Weinert TA, Schaus NA, Grindley ND (1983) multiple mechanisms of genome evolution. Insertion sequence duplication in transposi- Plant J 26(3):307–316. tpj1028 [pii] tional recombination. Science 222 40. Vicient CM, Kalendar R, Anamthawat- (4625):755–765 Jonsson K, Schulman AH (1999) Structure, 30. Turlan C, Chandler M (1995) IS1-mediated functionality, and evolution of the BARE-1 intramolecular rearrangements: formation of retrotransposon of barley. Genetica 107 excised transposon circles and replicative dele- (1–3):53–63 tions. EMBO J 14(21):5410–5421 41. SanMiguel P, Gaut BS, Tikhonov A, 31. Siguier P, Gourbeyre E, Varani A, Nakajima Y, Bennetzen JL (1998) The pale- Ton-Hoang B, Chandler M (2015) Every- ontology of intergene retrotransposons of man’s guide to bacterial insertion sequences. maize. Nat Genet 20(1):43–45. https://doi. Microbiol Spectr 3(2):MDNA3-0030-2014. org/10.1038/1695 https://doi.org/10.1128/microbiolspec. 42. Peterson DG, Schulze SR, Sciara EB, Lee SA, MDNA3-0030-2014 Bowers JE, Nagel A, Jiang N, Tibbitts DC, 32. Finnegan DJ (1989) Eukaryotic transposable Wessler SR, Paterson AH (2002) Integration elements and genome evolution. Trends of Cot analysis, DNA cloning, and high- Genet 5(4):103–107 throughput sequencing facilitates genome characterization and gene discovery. Genome 202 Wojciech Makałowski et al.

Res 12(5):795–807. https://doi.org/10. retrotransposon, LINE-1. RNA Biol 7 1101/gr.226102. Article published online (6):706–711 before print in April 2002 54. Kapitonov VV, Jurka J (2003) Molecular pale- 43. Zuker C, Lodish HF (1981) Repetitive DNA ontology of transposable elements in the Dro- sequences cotranscribed with developmen- sophila melanogaster genome. Proc Natl Acad tally regulated Dictyostelium discoideum Sci U S A 100(11):6569–6574. https://doi. mRNAs. Proc Natl Acad Sci U S A 78 org/10.1073/pnas.0732024100 (9):5386–5390 55. Kajikawa M, Okada N (2002) LINEs mobi- 44. Goodwin TJ, Poulter RT (2001) The DIRS1 lize SINEs in the eel through a shared 30 group of retrotransposons. Mol Biol Evol 18 sequence. Cell 111(3):433–444. (11):2067–2082 S0092867402010413 [pii] 45. Piednoel M, Bonnivard E (2009) DIRS1-like 56. Houck CM, Rinehart FP, Schmid CW (1979) retrotransposons are widely distributed A ubiquitous family of repeated DNA among Decapoda and are particularly present sequences in the human genome. J Mol Biol in hydrothermal vent organisms. BMC Evol 132(3):289–306 Biol 9:86. https://doi.org/10.1186/1471- 57. Jurka J, Zietkiewicz E, Labuda D (1995) 2148-9-86 Ubiquitous mammalian-wide interspersed 46. Evgen’ev MB, Arkhipova IR (2005) repeats (MIRs) are molecular fossils from the Penelope-like elements - a new class of retro- mesozoic era. Nucleic Acids Res 23 elements: distribution, function and possible (1):170–175 evolutionary significance. Cytogenet Genome 58. Wang H, Xing J, Grover D, Hedges DJ, Han Res 110(1–4):510–521. https://doi.org/10. KD, Walker JA, Batzer MA (2005) SVA ele- 1159/000084984 ments: a hominid-specific retroposon family. J 47. Arkhipova IR (2006) Distribution and phy- Mol Biol 354(4):994–1007. https://doi. logeny of Penelope-like elements in eukar- org/10.1016/j.jmb.2005.09.085 yotes. Syst Biol 55(6):875–885. https://doi. 59. Ostertag EM, Goodier JL, Zhang Y, Kazazian org/10.1080/10635150601077683 HH (2003) SVA elements are nonautono- 48. Singer MF (1982) Highly repeated sequences mous retrotransposons that cause disease in in mammalian genomes. Int Rev Cytol humans. Am J Hum Genet 73 76:67–112 (6):1444–1451. https://doi.org/10.1086/ 49. Singer MF (1982) SINEs and LINEs: highly 380207 repeated short and long interspersed 60. Vanin EF (1985) Processed pseudogenes: sequences in mammalian genomes. Cell 28 characteristics and evolution. Annu Rev (3):433–434 Genet 19:253–272 50. Mills RE, Bennett EA, Iskow RC, Devine SE 61. Maestre J, Tchenio T, Dhellin O, Heidmann (2007) Which transposable elements are T (1995) mRNA retroposition in human active in the human genome? Trends Genet cells: processed pseudogene formation. 23(4):183–191. https://doi.org/10.1016/j. EMBO J 14:6333–6338 tig.2007.02.006 62. Kabza M, Ciomborowska J, Makalowska I 51. Biedler J, Tu Z (2003) Non-LTR retrotran- (2014) RetrogeneDB--a database of animal sposons in the African malaria mosquito, retrogenes. Mol Biol Evol 31 Anopheles gambiae: unprecedented diversity (7):1646–1648. https://doi.org/10.1093/ and evidence of recent activity. Mol Biol Evol molbev/msu139 20(11):1811–1825. https://doi.org/10. 63. Zhang Z, Harrison P, Gerstein M (2002) 1093/molbev/msg189. msg189 [pii] Identification and analysis of over 2000 ribo- 52. Martin SL, Cruceanu M, Branciforte D, somal protein pseudogenes in the human Wai-Lun Li P, Kwok SC, Hodges RS, Wil- genome. Genome Res 12:1466–1482 liams MC (2005) LINE-1 retrotransposition 64. Torrents D, Suyama M, Zdobnov E, Bork P requires the nucleic acid chaperone activity of (2003) A genome-wide survey of human the ORF1 protein. J Mol Biol 348 pseudogenes. Genome Res 13:2559–2567 (3):549–561. https://doi.org/10.1016/j. 65. Szczes´niak MW, Ciomborowska J, Nowak W, jmb.2005.03.003 Rogozin IB, Makałowska I (2011) Primate 53. Martin SL (2010) Nucleic acid chaperone and rodent specific intron gains and the origin properties of ORF1p from the non-LTR of retrogenes with splice variants. Mol Biol Evol 28:33–38 Transposable Elements: Classification, Identification, and Their Use... 203

66. Goodwin TJ, Butler MI, Poulter RT (2003) complete genomes. Bioinformatics 15 Cryptons: a group of tyrosine-recombinase- (5):426–427 encoding DNA transposons from pathogenic 78. Kurtz S, Choudhuri JV, Ohlebusch E, fungi. Microbiology 149. (Pt 11:3099–3109 Schleiermacher C, Stoye J, Giegerich R 67. Kojima KK, Jurka J (2011) Crypton transpo- (2001) REPuter: the manifold applications sons: identification of new diverse families and of repeat analysis on a genomic scale. Nucleic ancient domestication events. Mob DNA 2 Acids Res 29(22):4633–4642 (1):12. https://doi.org/10.1186/1759- 79. Delcher AL, Kasif S, Fleischmann RD, 8753-2-12 Peterson J, White O, Salzberg SL (1999) 68. Bureau TE, Wessler SR (1994) Stowaway: a Alignment of whole genomes. Nucleic Acids new family of inverted repeat elements asso- Res 27(11):2369–2376 ciated with the genes of both monocotyle- 80. Delcher AL, Phillippy A, Carlton J, Salzberg donous and dicotyledonous plants. Plant SL (2002) Fast algorithms for large-scale Cell 6(6):907–916. https://doi.org/10. genome alignment and comparison. Nucleic 1105/tpc.6.6.907. 6/6/907 [pii] Acids Res 30(11):2478–2483 69. Feschotte C, Swamy L, Wessler SR (2003) 81. Li RQ, Ye J, Li SG, Wang J, Han YJ, Ye C, Genome-wide analysis of mariner-like trans- Wang J, Yang HM, Yu J, Wong GKS, Wang J posable elements in rice reveals complex rela- (2005) ReAS: recovery of ancestral sequences tionships with stowaway miniature inverted for transposable elements from the unassem- repeat transposable elements (MITEs). bled reads of a whole genome shotgun. PLoS Genetics 163(2):747–758 Comput Biol 1(4):313–321. https://doi. 70. Zhou MB, Tao GY, Pi PY, Zhu YH, Bai YH, org/10.1371/Journal.Pcbi.0010043.Artn Meng XW (2016) Genome-wide characteri- E43 [pii] zation and evolution analysis of miniature 82. Price AL, Jones NC, Pevzner PA (2005) De inverted-repeat transposable elements novo identification of repeat families in large (MITEs) in moso bamboo (Phyllostachys het- genomes. Bioinformatics 21:I351–I358. erocycla). Planta 244(4):775–787. https:// https://doi.org/10.1093/Bioinformatics/ doi.org/10.1007/s00425-016-2544-0 Bti1018 71. Wicker T, Robertson JS, Schulze SR, Feltus 83. Kurtz S, Narechania A, Stein JC, Ware D FA, Magrini V, Morrison JA, Mardis ER, Wil- (2008) A new method to compute K-mer son RK, Peterson DG, Paterson AH, Ivarie R frequencies and its application to annotate (2005) The repetitive landscape of the large repetitive plant genomes. BMC Geno- chicken genome. Genome Res 15 mics 9:517. https://doi.org/10.1186/1471- (1):126–136. https://doi.org/10.1101/gr. 2164-9-517 2438004 84. Lefebvre A, Lecroq T, Dauchel H, Alexandre 72. Kapitonov VV, Jurka J (2001) Rolling-circle J (2003) FORRepeats: detects repeats on transposons in eukaryotes. Proc Natl Acad Sci entire chromosomes and between genomes. U S A 98:8714–8719 Bioinformatics 19(3):319–326. https://doi. 73. Hood ME (2005) Repetitive DNA in the org/10.1093/Bioinformatics/Btf843 automictic fungus Microbotryum violaceum. 85. Crochemore M, Ilie L, Seid-Hilmi E (2006) Genetica 124(1):1–10 Factor oracles. In: Ibarra OH, Yen H-C (eds) 74. Pritham EJ, Feschotte C (2007) Massive Implementation and application of automata. amplification of rolling-circle transposons in Springer, Berlin, pp 78–89 the lineage of the bat Myotis lucifugus. Proc 86. Agrawal P, States D (1994) The Repeat Pat- Natl Acad Sci U S A 104:1895–1900 tern Toolkit (RPT): analyzing the structure 75. Pritham EJ, Putliwala T, Feschotte C (2007) and evolution of the C. elegans genome. Mavericks, a novel class of giant transposable Proc Int Conf Intell Syst Mol Biol 2:9 elements widespread in eukaryotes and related 87. Bao ZR, Eddy SR (2002) Automated de novo to DNA viruses. Gene 390(1–2):3–17. identification of repeat sequence families in https://doi.org/10.1016/j.gene.2006.08. sequenced genomes. Genome Res 12 008. S0378-1119(06)00537-3 [pii] (8):1269–1276. https://doi.org/10.1101/ 76. Kapitonov VV, Jurka J (2006) Self- Gr.88502 synthesizing DNA transposons in eukaryotes. 88. Edgar RC, Myers EW (2005) PILER: identi- Proc Natl Acad Sci U S A 103:4540–4545 fication and classification of genomic repeats. 77. Kurtz S, Schleiermacher C (1999) REPuter: Bioinformatics 21:I152–I158. https://doi. fast computation of maximal repeats in org/10.1093/Bioinformatics/Bti1003 204 Wojciech Makałowski et al.

89. Edgar RC (2007) PILER-CR: fast and accu- repeatome with dnaPipeTE from raw geno- rate identification of CRISPR repeats. BMC mic reads and comparative analysis with the Bioinform 8:18. https://doi.org/10.1186/ yellow fever mosquito (Aedes aegypti). 1471-2105-8-18 Genome Biol Evol 7(4):1192–1205. 90. Quesneville H, Bergman CM, Andrieu O, https://doi.org/10.1093/gbe/evv050 Autard D, Nouaud D, Ashburner M, Anxola- 101. Genome of the Netherlands Consortium behere D (2005) Combined evidence annota- (2014) Whole-genome sequence variation, tion of transposable elements in genome population structure and demographic his- sequences. PLoS Comput Biol 1 tory of the Dutch population. Nat Genet 46 (2):166–175. Artn E22. https://doi.org/10. (8):818–825. https://doi.org/10.1038/ng. 1371/Journal.Pcbi.0010022 3021 91. Altschul SF, Gish W, Miller W, Myers EW, 102. 1000 Genomes Project Consortium, Lipman DJ (1990) Basic local alignment Auton A, Brooks LD, Durbin RM, Garrison search tool. J Mol Biol 215(3):403–410 EP, Kang HM, Korbel JO, Marchini JL, 92. Fitch WM (1969) Locating gaps in amino McCarthy S, McVean GA, Abecasis GR acid sequences to optimize the homology (2015) A global reference for human genetic between two proteins. Biochem Genet 3 variation. Nature 526(7571):68–74. https:// (2):99–108 doi.org/10.1038/nature15393 93. Gibbs AJ, McIntyre GA (1970) The diagram, 103. UK10K Consortium, Walter K, Min JL, a method for comparing sequences. Its use Huang J, Crooks L, Memari Y, McCarthy S, with amino acid and nucleotide sequences. Perry JR, Xu C, Futema M, Lawson D, Eur J Biochem 16(1):1–11 Iotchkova V, Schiffels S, Hendricks AE, 94. Sonnhammer EL, Durbin R (1995) A Danecek P et al (2015) The UK10K project dot-matrix program with dynamic threshold identifies rare variants in health and disease. control suited for genomic DNA and protein Nature 526(7571):82–90. https://doi.org/ sequence analysis. Gene 167(1–2):GC1–G10 10.1038/nature14962 95. Krumsiek J, Arnold R, Rattei T (2007) 104. Lack JB, Lange JD, Tang AD, Corbett-Detig Gepard: a rapid and sensitive tool for creating RB, Pool JE (2016) A thousand fly genomes: dotplots on genome scale. Bioinformatics 23 an expanded Drosophila genome nexus. Mol (8):1026–1028. https://doi.org/10.1093/ Biol Evol 33(12):3308–3313. https://doi. bioinformatics/btm039 org/10.1093/molbev/msw195 96. Staton SE, Burke JM (2015) Transposome: a 105. Lynch M, Gutenkunst R, Ackerman M, toolkit for annotation of transposable element Spitze K, Ye Z, Maruki T, Jia Z (2017) Popu- families from unassembled sequence reads. lation genomics of Daphnia pulex. Genetics Bioinformatics 31(11):1827–1829. https:// 206(1):315–332. https://doi.org/10.1534/ doi.org/10.1093/bioinformatics/btv059 genetics.116.190611 97. Novak P, Neumann P, Pech J, Steinhaisl J, 106. Gardner EJ, Lam VK, Harris DN, Chuang Macas J (2013) RepeatExplorer: a Galaxy- NT, Scott EC, Pittard WS, Mills RE, Gen- based web server for genome-wide characteri- omes Project Consortium, Devine SE zation of eukaryotic repetitive elements from (2017) The Mobile Element Locator Tool next-generation sequence reads. Bioinformat- (MELT): population-scale mobile element ics 29(6):792–793. https://doi.org/10. discovery and biology. Genome Res 27 1093/bioinformatics/btt054 (11):1916–1929. https://doi.org/10.1101/ gr.218032.116 98. Koch P, Platzer M, Downie BR (2014) RepARK--de novo creation of repeat libraries 107. Ewing AD (2015) Transposable element from whole-genome NGS reads. Nucleic detection from whole genome sequence Acids Res 42(9):e80. https://doi.org/10. data. Mob DNA 6:24. https://doi.org/10. 1093/nar/gku210 1186/s13100-015-0055-3 99. Chu C, Nielsen R, Wu Y (2016) REPdenovo: 108. Rishishwar L, Marino-Ramirez L, Jordan IK inferring de novo repeat motifs from short (2016) Benchmarking computational tools sequence reads. PLoS One 11(3):e0150719. for polymorphic transposable element detec- https://doi.org/10.1371/journal.pone. tion. Brief Bioinform. https://doi.org/10. 0150719 1093/bib/bbw072 100. Goubert C, Modolo L, Vieira C, 109. Helman E, Lawrence MS, Stewart C, ValienteMoro C, Mavingui P, Boulesteix M Sougnez C, Getz G, Meyerson M (2014) (2015) De novo assembly and annotation of Somatic retrotransposition in human cancer the Asian tiger mosquito (Aedes albopictus) revealed by whole-genome and exome Transposable Elements: Classification, Identification, and Their Use... 205

sequencing. Genome Res 24(7):1053–1063. CM, Lee BT, Karolchik D, Hinrichs AS, https://doi.org/10.1101/gr.163659.113 Haeussler M, Guruvadoo L, Navarro 110. Lee E, Iskow R, Yang L, Gokcumen O, Gonzalez J, Gibson D et al (2018) The Haseley P, Luquette LJ III, Lohr JG, Harris UCSC Genome Browser database: 2018 CC, Ding L, Wilson RK, Wheeler DA, Gibbs update. Nucleic Acids Res 46(D1): RA, Kucherlapati R, Lee C, Kharchenko PV D762–D769. https://doi.org/10.1093/ et al (2012) Landscape of somatic retrotran- nar/gkx1020 sposition in human cancers. Science 337 119. Noll A, Grundmann N, Churakov G, (6097):967–971. https://doi.org/10.1126/ Brosius J, Makalowski W, Schmitz J (2015) science.1222077 GPAC-genome presence/absence compiler: a 111. Tubio JMC, Li Y, Ju YS, Martincorena I, web application to comparatively visualize Cooke SL, Tojo M, Gundem G, Pipinikas multiple genome-level changes. Mol Biol CP, Zamora J, Raine K, Menzies A, Roman- Evol 32(1):275–286. https://doi.org/10. Garcia P, Fullam A, Gerstung M, Shlien A et al 1093/molbev/msu276 (2014) Mobile DNA in cancer. Extensive 120. Abrusan G, Grundmann N, DeMester L, transduction of nonrepetitive DNA mediated Makalowski W (2009) TEclass-a tool for by L1 retrotransposition in cancer genomes. automated classification of unknown eukary- Science 345(6196):1251343. https://doi. otic transposable elements. Bioinformatics 25 org/10.1126/science.1251343 (10):1329–1330. https://doi.org/10.1093/ 112. Thung DT, de Ligt J, Vissers LE, bioinformatics/btp084 Steehouwer M, Kroon M, de Vries P, Slag- 121. Jurka J, Klonowski P, Dagman V, Pelton P boom EP, Ye K, Veltman JA, Hehir-Kwa JY (1996) Censor - a program for identification (2014) Mobster: accurate detection of mobile and elimination of repetitive elements from element insertions in next generation DNA sequences. Comput Chem 20 sequencing data. Genome Biol 15(10):488. (1):119–121 https://doi.org/10.1186/s13059-014- 122. Bao W, Kojima KK, Kohany O (2015) 0488-x Repbase Update, a database of repetitive ele- 113. Keane TM, Wong K, Adams DJ (2013) Ret- ments in eukaryotic genomes. Mob DNA roSeq: transposable element discovery from 6:11. https://doi.org/10.1186/s13100- next-generation sequencing data. Bioinfor- 015-0041-9 matics 29(3):389–390. https://doi.org/10. 123. Wicker T, Matthews DE, Keller B (2002) 1093/bioinformatics/bts697 TREP: a database for Triticeae repetitive ele- 114. Tang Z, Steranka JP, Ma S, Grivainis M, ments. Trends Plant Sci 7(12):561–562. [pii] Rodic N, Huang CR, Shih IM, Wang TL, S1360-1385(02)02372-5 Boeke JD, Fenyo D, Burns KH (2017) 124. McCarthy EM, McDonald JF (2003) Human transposon insertion profiling: analy- LTR_STRUC: a novel search and identifica- sis, visualization and identification of somatic tion program for LTR retrotransposons. Bio- LINE-1 insertions in ovarian cancer. Proc informatics 19(3):362–367. https://doi.org/ Natl Acad Sci U S A 114(5):E733–E740. 10.1093/Bioinformatics/Btf878 https://doi.org/10.1073/pnas. 125. Kalyanaraman A, Aluru S (2006) Efficient 1619797114 algorithms and software for detection of full- 115. Chen Y, Ye W, Zhang Y, Xu Y (2015) High length LTR retrotransposons. J Bioinforma speed BLASTN: an accelerated MegaBLAST Comput Biol 4(2):197–216. search tool. Nucleic Acids Res 43 S021972000600203X [pii] (16):7762–7768. https://doi.org/10.1093/ 126. Rho M, Choi JH, Kim S, Lynch M, Tang H nar/gkv784 (2007) De novo identification of LTR retro- 116. Kent WJ (2002) BLAT--the BLAST-like transposons in eukaryotic genomes. BMC alignment tool. Genome Res 12 Genomics 8:90. https://doi.org/10.1186/ (4):656–664. https://doi.org/10.1101/gr. 1471-2164-8-90. 1471-2164-8-90 [pii] 229202. Article published online before 127. Xu Z, Wang H (2007) LTR_FINDER: an March 2002 efficient tool for the prediction of full-length 117. Kielbasa SM, Wan R, Sato K, Horton P, Frith LTR retrotransposons. Nucleic Acids Res 35 MC (2011) Adaptive seeds tame genomic (Web Server issue):W265–W268. https:// sequence comparison. Genome Res 21 doi.org/10.1093/nar/gkm286. gkm286 (3):487–493. https://doi.org/10.1101/gr. [pii] 113985.110 128. Ellinghaus D, Kurtz S, Willhoeft U (2008) 118. Casper J, Zweig AS, Villarreal C, Tyner C, LTRharvest, an efficient and flexible software Speir ML, Rosenbloom KR, Raney BJ, Lee 206 Wojciech Makałowski et al.

for de novo detection of LTR retrotranspo- genome. Nat Biotechnol 29(7):644–U130. sons. BMC Bioinform 9:18. https://doi.org/ https://doi.org/10.1038/nbt.1883 10.1186/1471-2105-9-18. 1471-2105-9- 138. Churakov G, Grundmann N, Kuritzin A, 18 [pii] Brosius J, Makalowski W, Schmitz J (2010) 129. Lerat E (2010) Identifying repeats and trans- A novel web-based TinT application and the posable elements in sequenced genomes: how chronology of the Primate Alu retroposon to find your way through the dense forest of activity. BMC Evol Biol 10:376. https://doi. programs. Heredity 104(6):520–533. org/10.1186/1471-2148-10-376. 1471- https://doi.org/10.1038/hdy.2009.165. 2148-10-376 [pii] hdy2009165 [pii] 139. Kriegs JO, Matzke A, Churakov G, 130. Tu Z (2001) Eight novel families of miniature Kuritzin A, Mayr G, Brosius J, Schmitz J inverted repeat transposable elements in the (2007) Waves of genomic hitchhikers shed African malaria mosquito, Anopheles gam- light on the evolution of gamebirds (Aves: biae. Proc Natl Acad Sci U S A 98 Galliformes). BMC Evol Biol 7:190. https:// (4):1699–1704. https://doi.org/10.1073/ doi.org/10.1186/1471-2148-7-190. 1471- pnas.041593198. 041593198 [pii] 2148-7-190 [pii] 131. Chen Y, Zhou F, Li G, Xu Y (2009) MUST: a 140. Nilsson MA, Churakov G, Sommer M, Tran system for identification of miniature NV, Zemann A, Brosius J, Schmitz J (2010) inverted-repeat transposable elements and Tracking marsupial evolution using archaic applications to Anabaena variabilis and Halo- genomic retroposon insertions. PLoS Biol 8 quadratum walsbyi. Gene 436(1–2):1–7. (7):e1000436. https://doi.org/10.1371/ https://doi.org/10.1016/j.gene.2009.01. journal.pbio.1000436 019. S0378-1119(09)00051-1 [pii] 141. Kriegs JO, Zemann A, Churakov G, 132. Du C, Caronna J, He L, Dooner HK (2008) Matzke A, Ohme M, Zischler H, Brosius J, Computational prediction and molecular Kryger U, Schmitz J (2010) Retroposon confirmation of Helitron transposons in the insertions provide insights into deep lago- maize genome. BMC Genomics 9:51. morph evolution. Mol Biol Evol 27 https://doi.org/10.1186/1471-2164-9-51. (12):2678–2681. https://doi.org/10.1093/ 1471-2164-9-51 [pii] molbev/msq162. msq162 [pii] 133. Yang L, Bennetzen JL (2009) Structure- 142. Baker JN, Walker JA, Vanchiere JA, Phillippe based discovery and description of plant and KR, St Romain CP, Gonzalez-Quiroga P, animal Helitrons. Proc Natl Acad Sci U S A Denham MW, Mierl JR, Konkel MK, Batzer 106(31):12832–12837. https://doi.org/10. MA (2017) Evolution of Alu subfamily struc- 1073/pnas.0905563106. 0905563106 [pii] ture in the Saimiri lineage of new world 134. Feschotte C, Keswani U, Ranganathan N, monkeys. Genome Biol Evol 9 Guibotsy ML, Levine D (2009) Exploring (9):2365–2376. https://doi.org/10.1093/ repetitive DNA landscapes using REPCLASS, gbe/evx172 a tool that automates the classification of 143. Luchetti A, Plazzi F, Mantovani B (2017) transposable elements in eukaryotic genomes. Evolution of two short interspersed elements Genome Biol Evol 1:205–220. https://doi. in Callorhinchus milii (Chondrichthyes, org/10.1093/Gbe/Evp023 Holocephali) and related elements in sharks 135. Lowe TM, Eddy SR (1997) tRNAscan-SE: a and the coelacanth. Genome Biol Evol 9(6). program for improved detection of transfer https://doi.org/10.1093/gbe/evx094 RNA genes in genomic sequence. Nucleic 144. Gotea V, Petrykowska HM, Elnitski L (2013) Acids Res 25(5):955–964 Bidirectional promoters as important drivers 136. Flutre T, Duprat E, Feuillet C, Quesneville H for the emergence of species-specific tran- (2011) Considering transposable element scripts. PLoS One 8(2):e57323. https://doi. diversification in de novo annotation org/10.1371/journal.pone.0057323 approaches. PLoS One 6(1):e16526. 145. Kostka D, Hubisz MJ, Siepel A, Pollard KS https://doi.org/10.1371/journal.pone. (2012) The role of GC-biased gene conver- 0016526 sion in shaping the fastest evolving regions of 137. Grabherr MG, Haas BJ, Yassour M, Levin JZ, the human genome. Mol Biol Evol 29 Thompson DA, Amit I, Adiconis X, Fan L, (3):1047–1057. https://doi.org/10.1093/ Raychowdhury R, Zeng QD, Chen ZH, molbev/msr279 Mauceli E, Hacohen N, Gnirke A, Rhind N 146. Capra JA, Hubisz MJ, Kostka D, Pollard KS, et al (2011) Full-length transcriptome assem- Siepel A (2013) A model-based analysis of bly from RNA-Seq data without a reference GC-biased gene conversion in the human and chimpanzee genomes. PLoS Genet 9(8): Transposable Elements: Classification, Identification, and Their Use... 207

e1003684. https://doi.org/10.1371/jour https://doi.org/10.1016/j.ygeno.2014.04. nal.pgen.1003684 001 147. Gotea V, Elnitski L (2014) Ascertaining 148. Makalowski W (2000) Genomic scrap yard: regions affected by GC-biased gene conver- how genomes utilize all that junk. Gene 259 sion through weak-to-strong mutational hot- (1–2):61–67. https://doi.org/10.1016/ spots. Genomics 103(5–6):349–356. S0378-1119(00)00436-4

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. Plant Molecular Biology Manual Kl: 1-15, 1994. © 1994 Kluwer Academic Publishers. Printed in Belgium.

Gene tagging by endogenous transposons

WOLF-EKKEHARD LONNIG AND PETER HUUSER Max-Planck-Institut for ZuchtungsJorschung, Carl-von-Linne- Weg 10, 50829 Ko/n, Germany

Introduction

Gene tagging by transposons (transposon tagging) is one of the most useful methods to identify and isolate specific genetic loci of an organism. It was first applied in Drosophila [1] and then in several plant species, until recently most often and successfully in maize and Antirrhinum (for an overview of attempts in these plant species see [4, 12]). However, there should be many other good candidates for this approach as listed in [8]. The advantage of transposon mediated cloning over most other gene isolation methods is that, except for a mutant , no other information on the gene product or its function is required. The two major keys to the method of transposon tagging are: 1. that integration of a transposon (mobile DNA element) into a genetic locus often leads to a loss of its function whereas excision from that locus may result in partial or full restoration for its original tasks. (In a minority of the cases the tagged gene becomes ectopically expressed, transposon excision also leading to restoration of its original boundaries of function). 2. that excision and integration of a transposon result in a change in the physical size of the donor site as well as that of the new insertion site. As a consequence, the results of transposition become detectable as Restriction Fragment Length Polymorphisms (RFLPs) when the transposon is used as a molecular probe (Fig. 1). Two different approaches to isolate genetic loci with the help of trans po sons are based on these observations: targeted and non-targeted gene tagging. In this chapter the experimental details of these strategies are discussed as far as they apply and exploit endogenous transposons. It should also allow the experimenter to adapt the procedures outlined in the protocols to his/her own plant system.

Experimental design

Selecting the proper approach

As outlined in Fig. 2, non-targeted or random gene tagging is designed to identify the re-insertion of an active transposon at any new position in the genome where it causes a detectable mutation. The frequency of insertion

PMAN-Kl/l TRANSPOSON CAUSED GENETIC INSTABILITY

wUHyPE COLolR

SOMATIC GERMINAL EXCISION EXCISION / \ RFLP ANALYSIS 2 REVEALING: 3 4

NEW INSERT10N ~ = 1-- !lrrn$l .- MUTANTAUE\.E B - ... C

PMAN-Kl/2 See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/227009467

Transposon Tagging and Reverse Genetics

Chapter · January 2009 DOI: 10.1007/978-3-540-68922-5_11

CITATIONS READS 9 791

1 author:

A. Mark Settles University of Florida

393 PUBLICATIONS 4,475 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Epigenetic Regulation of Seed Development View project

Function of RNA splicing factor Rbm48 in maize seed development View project

All content following this page was uploaded by A. Mark Settles on 20 May 2014.

The user has requested enhancement of the downloaded file. Chapter 11 Transposon Tagging and Reverse Genetics

A. Mark Settles

11.1 Introduction

Transposons are mobile genetic elements that can amplify themselves in a genome. Transposable elements were first discovered in maize (McClintock 1948) and have been found to exist in all organisms. Transposons have multiple modes of movement or transposition, which is used to group the elements into two major classes, based on having an RNA or a DNA intermediate during transposition (reviewed in Hua- Van et al. 2005). The maize genome contains examples of most known transposon families including (LTR) and non-LTR retrotransposons (class I), which use RNA intermediates, as well as class II DNA elements (Bruggmann et al. 2006). Maize class II elements tend to insert near or within genes (Bureau and Wessler 1992; Cowperthwaite et al. 2002; Fernandes et al. 2004; Kolkman et al. 2005; Kumar et al. 2005; McCarty et al. 2005; Settles et al. 2004). Transposition into genes can cause mutant , and transposons are used as endogenous mutagens. This chapter focuses on the use of maize DNA transposons in molecular genetics and functional genomics studies. With the exception of helitrons, DNA transposons share some common molecu- lar and genetic properties. DNA elements have terminal inverted repeats as well as autonomous and non-autonomous transposons (Hua-Van et al. 2005). Autonomous elements encode the genes required for transposition. Non-autonomous transposons contain sequences recognized by transposase proteins and can move only in the presence of an autonomous element. Non-autonomous elements either have mu- tations in transposase genes or have replaced them with other sequences. DNA transposons also create target site duplications at the site of insertion. The length of the duplication is specific to each family of element. The major families of

A. Mark Settles Horticultural Sciences Department and Plant Molecular and Cellular Biology Program, University of Florida, Gainesville, FL 32611-0690, USA e-mail: settles@ufl.edu

A.L. Kriz, B.A. Larkins (eds.), Molecular Genetic Approaches to Maize Improvement 143 Biotechnology in Agriculture and Forestry, Vol. 63 c Springer-Verlag Berlin Heidelberg 2009 144 A. Mark Settles maize transposons that have been used as mutagens include Activator and Dis- sociation (Ac/Ds), Enhancer/Suppressor-mutator (En/Spm), and Robertson’s Mu- tator (Mu). These families were identified because they cause unstable mutations (McClintock 1950, 1954; Peterson 1960; Robertson 1978). The instability of many transposon-induced alleles is due to excision events or epigenetic regulation of the transposons. The structures, mechanisms of transposition, and epigenetic regulation of Ac/Ds, En/Spm, and Mu are discussed in greater detail in multiple reviews (see Kunze and Weil 2002; Lisch 2002; Walbot and Rudenko 2002).

11.2 General Strategies for Transposon Tagging

Transposon mutagenesis is a central tool for current research in maize molecular genetics. A non-exhaustive database search for recent mutant gene cloning reports identified 20 papers (see Table 11.1). Transposon-tagged alleles played a central role in the proof of cloning for 90% of these studies, and conventional transposon- tagging strategies were used in 60% of the reports. Transposon tagging is a gene- cloning strategy that relies on the transposon to provide a DNA “tag” with a known sequence (see Fig. 11.1). The transposon sequence is used to identify DNA se- quences adjacent to the transposable element. The strategy was initially developed to clone the Drosophila white locus (Bingham et al. 1981). Fedoroff et al. (1984) adapted the method for Ac and cloned a tagged allele of the bronze1 (bz1) locus. To identify DNA adjacent to a transposon, the initial approach was to isolate λ phage clones that included the transposon tag. Phage libraries have continued to be useful for cloning transposon-tagged genes (e.g. McSteen et al. 2007; Vollbrecht et al. 2005), but PCR-based methods for amplifying sequences adjacent to endogenous maize transposons have been developed as well (Earp et al. 1990; Frey et al. 1998; Settles et al. 2004). Transposable elements are useful tags only if the sequences of the elements are known. The Ac, Ds, En/Spm, and Mu sequences were cloned using a method that is sometimes referred to as transposon trapping (Barker et al. 1984; Chomet et al. 1991; Fedoroff et al. 1983; Pohlman et al. 1984; Schwarz-Sommer et al. 1984). Previously cloned genes are used as “traps” to isolate transposon-tagged, mutant alleles. The known gene sequence is used to identify clones containing a mutant al- lele containing the transposon. Through transposon-induced alleles of cloned loci researchers have continued to identify new classes of transposons in the maize genome. For example, the first miniature inverted repeat transposable element and the first helitron insertions in maize were identified through mutants in the waxy (wx) and shrunken2 (sh2) loci, respectively (Bureau and Wessler 1992; Lal et al. 2003). Ac/Ds, En/Spm, and Mu exist in multiple copies within the maize genome. More- over, it is possible for transposons to induce mutations without tagging the locus of interest (e.g. Satoh-Nagasawa et al. 2006). Thus, it is imperative to collect addi- tional data about putatively tagged mutants to ensure that the cloned locus represents 11 Transposon Tagging and Reverse Genetics 145 et al. 2006b Kim et al. 2006 Reference Chung et al. 2007 Shi et al. 2005 McSteen et al. 2007 Hamant et al. 2005 Suzuki et al. 2006 Porch et al. 2006 Wen et al. 2005 Gutierrez-Marcos et al. 2007 Sturaro et al. 2005 Muszynski et al. 2006 Vollbrecht et al. 2005 Bommert et al. 2005 Taramino et al. 2007 Evans 2007 Satoh-Nagasawa et al. 2006 Chuck et al. 2007 reverse genetics Ching et al. 2006 reverse genetics Sawers et al. 2006 Mu Mu Confirmed with non-directed and directed tagging Bortiri Only non-tagged alleles identified Confirmed with directed tagging Confirmed with directed tagging Cloning approach Role for transposon mutagenesis Map-based Candidate gene Confirming alleles from Candidate gene Confirming alleles from Map-based Candidate gene None – transgenic approach Transposon tagging Non-directed mutagenesis Activation tagging Reference allele was transposon-tagged Transposon tagging Non-directed mutagenesis Transposon tagging Non-directed mutagenesis Transposon tagging Non-directed mutagenesis Transposon taggingTransposon tagging Non-directed mutagenesis Non-directed mutagenesis Transposon taggingTransposon tagging Non-directed mutagenesis Non-directed mutagenesis Transposon taggingTransposon tagging Directed tagging Directed and non-directed tagging Transposon tagging Directed tagging Transposon tagging Directed tagging Map-based Map-based Summary of mutant cloning papers since 2005 brittle stalk2 (bk2) ramosa2 (ra2) Oil yellow1 (Oy1) rootless for crown and seminal roots (rtcs) No transposon-tagged alleles ramosa3 (ra3) Mucronate (Mc) Confirmed with transposon-tagged alleles Corngrass1 (Cg1) Indeterminate gametophyte1 (ig1) Zea mays smu2 (zmsmu2) Zea mays shugoshin1 (zmsgo1) viviparous15 (vp15) viviparous10 (vp10) Roothairless1 (rth1) low phytic acid3 (lpa3) empty pericarp4 (emp4) barren infloresence2 (bif2) ramosa1 (ra1) thick tassel dwarf1 (td1) glossy1 (gl1) Table 11.1 Locus Conventional tagging delayed flowering1 (dlf1) 146 A. Mark Settles

Fig. 11.1 Schematic of a transposon-tagged mutation. DNA transposable elements insert into or near genes with a “cut and paste” type of mechanism. The resulting mutant is a fusion of the normal gene with the transposon sequence. The transposon sequence can then be used as a probe to screen phage or plasmid libraries. More frequently, the transposon sequence is used to anneal specific primers for I-PCR, AIMS, or TAIL-PCR the mutation of interest (reviewed in .Walbot 1992). Generally accepted criteria for proof of cloning include two types of data. First, the transposon tag and mutant phenotype need to be linked genetically. Second, either the mutant needs to be com- plemented with a transgene, or multiple mutant and/or revertant alleles of the locus need to be isolated and sequenced. In the past 10 years, there has been significant advancement in genetic resources and technologies, which has simplified cloning transposon-tagged mutants. These efforts have focused on three major areas: (1) directed tagging, (2) non-directed, saturation mutagenesis, and (3) reverse genetics resources.

11.3 Directed Tagging

Directed transposon tagging recovers mutations at a specific locus of interest. The conventional strategy for directed tagging is illustrated well by the cloning of opaque2 (o2) (Schmidt et al. 1987). Plants carrying active transposons are crossed with homozygous mutants for a reference, non-tagged allele (Fig. 11.2A). The F1 progeny are screened for mutant phenotypes. Obtaining multiple alleles of a locus requires a near-saturation level of mutagenesis. A near-saturation screen will re- quire between 30,000 and 500,000 progeny from tagging crosses, depending on the type of transposon used for mutagenesis. Novel alleles that show unstable or mu- table phenotypes are considered the best candidates for being transposon-tagged. 11 Transposon Tagging and Reverse Genetics 147

Fig. 11.2 Schematic of transposon mutagenesis strategies. a Directed tagging identifies transposon-induced alleles by crossing transposon-active plants with a reference allele of the mu- tation. The mutable alleles are separated from the reference allele by crossing the F1 to a standard line (hybrid, inbred, or tester). To identify a co-segregating transposon, the mutable allele is back- crossed into the standard line, and the backcrossed progeny are self-pollinated. b For non-directed tagging, transposon-active stocks are generally crossed to a standard line and the resulting progeny are self-pollinated. The self-pollinated families are screened for recessive mutants and segregating populations are generated by backcrossing into the standard line. Although both schematics show the transposon parent as a pollen parent, transposon mutagenesis can be completed with either a male or a female transposon parent 148 A. Mark Settles

These alleles are crossed to a standard inbred or hybrid to separate the mutable alle- les from the reference allele. Segregating populations of the mutable alleles are then screened by DNA gel blot or with PCR methods to identify transposon insertions that co-segregate with the mutable phenotype. The most common PCR method for co-segregation analysis is the amplification of insertion mutagenized sites (AIMS) protocol (Frey et al. 1998). AIMS is a modified amplified fragment length polymor- phism protocol that is specific for DNA adjacent to Mu transposons. Thermal asym- metric interlaced PCR (TAIL-PCR) has also been used to identify co-segregating Mu insertions (e.g. Chung et al. 2007; Porch et al. 2006). The co-segregating inser- tion is cloned by generating a size-selected phage/plasmid library or by a flanking sequence PCR method such as inverse-PCR (I-PCR) (Earp et al. 1990) or TAIL- PCR (Liu et al. 1995). Ac/Ds, En/Spm, and Mu have all been employed successfully in directed tagging experiments. These elements have different transposition characteristics. Ac/Ds el- ements are known to have a low mutation rate and an insertion preference for ge- netically linked sites (Dooner and Belachew 1989; Greenblatt 1984). En/Spm and Mu are generally used for tagging loci with unknown map locations. Mu has a high mutation rate associated with a high copy number of Mu elements (Robertson 1978; Walbot and Warren 1988). A comparison of early directed tagging experiments with Mu and En/Spm lines suggested that Mu lines give a 10-fold higher rate of novel alle- les (Walbot 1992). The high-copy nature of Mu lines makes subsequent segregation analysis more complicated. Also, Mu elements show very low rates of germinal ex- cision events and revertant alleles are difficult to recover (Brown et al. 1989; Levy et al. 1989; Schnable et al. 1989). Ac/Ds elements are useful for directed tagging when a mutant maps close to a characterized element. The propensity of Ac/Ds elements to transpose to linked sites enables the construction of a large allelic series once an Ac or Ds is positioned close to a locus of interest (Athma et al. 1992; Moreno et al. 1992). In addition, Ac/Ds tagged alleles give rise to germinal excision events that are frequently imprecise, leaving partial target site duplications. These “footprints” can generate weak alleles and even proteins with enhanced functional properties (Giroux et al. 1996; Wessler et al. 1986). Footprints are also used as supporting evidence that a tagged locus causes the mutant phenotype of interest. The historical limitation to utilizing Ac/Ds for directed tagging has been that only a small number of the transposons were at known map locations. Several research groups focused on developing Ac resources for directed tagging. Currently, there are nearly 170 Ac elements that are mapped to distributed locations throughout the genome (Auger and Sheridan 1999; Cowperthwaite et al. 2002; Dooner et al. 1994; Kolkman et al. 2005). These Ac stocks are in four collections generated by different genetic strategies. Each collection is propagated and moni- tored with distinct genetic markers. Dooner et al. (1994) mapped unlinked transpo- sitions from bz1-m2 using wx reciprocal translocations. These stocks are propagated with wx translocations to ensure that the Ac elements remain at the expected locus. Auger and Sheridan (1999) converted a series of inversion and balanced translo- cations with a Pericarp color1 allele tagged by Ac, P1-vv. The conversions are 11 Transposon Tagging and Reverse Genetics 149 recombinants of non-tagged alleles of P1 with P1-vv. The translocated or inverted DNA in each stock places the P1-vv allele (and thus the Ac element) in a linked po- sition to sites throughout the genome. More recently, Cowperthwaite et al. (2002) and Kolkman et al. (2005) generated a series of transpositions originating from Ac elements on chromosomes 1, 5, and 9. Flanking DNA from the transposed Ac el- ements was amplified and sequenced using I-PCR. The flanking sequences can be used to confirm that the Ac has not moved during the propagation of the stocks. In addition, these stocks have Ds markers to monitor Ac transposition. To use the Ac stocks for conventional tagging, the locus of interest needs to be mapped. One or more Ac’s that map close to the mutant can be ordered and crossed to a homozygous reference allele to screen for novel tagged alleles.

11.4 Non-directed Tagging

Directed tagging is limited to mutants that are non-essential for a plant to complete its life cycle. The general approach outlined above requires a homozygous mutant tester and detects tagged alleles via a mutant phenotype in the F1. If a mutant is lethal or infertile, a heterozygous tester could be generated. Tagged-alleles would be recovered at half the frequency due to the segregation of the mutation in the tester. However, most mutants would be lost in the first generation after the directed tagging crosses. Many lethal or infertile mutants have been cloned by transposon tagging. The tagged alleles of these mutants were identified from non-directed mu- tagenesis. Non-directed transposon-induced mutants are generated in a way that is simi- lar to other conventional mutagenesis approaches. Plants with active transposons are maintained through crosses to a reference genotype such as a hybrid, inbred, or transposon-activity tester (Fig. 11.2B). Progeny from these crosses contain a mutag- enized gamete and are equivalent to the M1 generation in conventional mutagenesis. Self-pollinations of the M1 yield M2 families segregating for recessive mutants. The recessive phenotypes can be identified through any screening approach practical for maize. Infertile and lethal mutations are propagated with heterozygous siblings from the specific M2 family. Co-segregating insertions are identified and cloned us- ing the same general strategies as described for directed tagging in Section 11.3. A near-saturation mutagenesis population requires a similar number of mutagenized gametes as in a directed tagging experiment, i.e. 30,000 (for highly active Mu)to 500,000 (for En/Spm) M2 families. M2 families are more laborious to generate and to screen than progeny of directed-tagging crosses. Consequently, the early focus of non-directed tagging ex- periments was on mutant classes that occur at a high frequency, such as seed and seedling lethal phenotypes (Cook and Miles 1988; Scanlon et al. 1994; Taylor et al. 1987). Relatively small transposon-tagging populations generally recover single al- leles of individual loci. A common approach for cloning “orphan” isolates is to screen for a tightly linked transposon insertion from the individual allele. To confirm 150 A. Mark Settles that a tagged locus causes the mutant phenotype, reverse genetics screens are used to recover additional alleles (e.g.Gutierrez-Marcos et al. 2007; Hamant et al. 2005; Wen et al. 2005). However, the single allele strategy carries the risk that the mutant may not be tagged. There are several approaches to ensure that a mutant locus identified in a non- directed tagging experiment can be cloned. First, near-saturation Mu populations have been developed for multiple genomics projects (Bensen et al. 1995; Fernandes et al. 2004; Martin et al. 2006; May et al. 2003; McCarty et al. 2005). If a mutant locus has a distinctive phenotype, forward genetic screens of these populations can identify several alleles to reduce the risk of recovering non-tagged alleles (e.g. Lid et al. 2002; Suzuki et al. 2006). Moreover, two functional genomics projects have generated near-saturation collections of non-photosynthetic and seed mutants. Both the photosynthesis mutant library (PML) and the UniformMu seed mutant collection contain thousands of visible mutants that represent hundreds of loci (McCarty et al. 2005; Stern et al. 2004). However, many seed and seedling lethal loci have simi- lar phenotypes. Identifying alleles of the same locus within these collections using conventional genetics requires secondary phenotypic screens, mapping the mutants, and extensive complementation tests (e.g. Scanlon et al. 1994). Second, there are several recent examples of a hybrid transposon-tagging and map-based approach to clone mutant loci (see Table 11.1). Map-based cloning is practical for maize as the physical map, genome sequence, and synteny relation- ships to rice have become better understood (reviewed in Bortiri et al. 2006a). A general strategy for this hybrid approach is to generate a segregating population by crossing the transposon-induced allele to a divergent inbred. The segregating popu- lation can be screened initially for a co-segregating transposon insertion. If a linked transposon is not identified, the same DNA samples can be used to map the mu- tant with molecular markers. Once an approximate map position is determined, the population is expanded to ∼1,000 meiotic products for fine mapping. Third, lethal mutations can be targeted by using regional tagging with Ac/Ds elements. An example of this approach is the tagging of the pink scuttellum1 (ps1) locus (Singh et al. 2003). An Ac element linked to a normal Ps1 allele was used to generate a non-directed tagging population. A Ds insertion in the R locus was used to report Ac dosage and select 400 linked transpositions from ∼50,000 M1 seed. Seven alleles of ps1 were found after self-pollinating the selected transpositions. The optimal tagging approach will be determined by several factors. Is the phe- notype sensitive to genetic background modifiers? Mutants that are sensitive to ge- netic background effects may be more difficult to map using a map-based cloning strategy. Is the mutant lethal or viable? Directed tagging can be employed for vi- able mutants. For lethals, are there many or just a few loci that give rise to similar phenotypes? It is easier to screen a near-saturation Mu population when the mutant phenotype is simple to score, and relatively rare mutants are more straight-forward to analyze with genetics. Finally, is the map position of the locus known? Tagging using Ac/Ds is practical only when one of the elements is at a closely linked site to the locus of interest. 11 Transposon Tagging and Reverse Genetics 151 11.5 Reverse Genetics Resources

Conventional transposon tagging is a forward genetics approach. Mutants are char- acterized due to their phenotypes, and the purpose of identifying tagged alleles is to understand the molecular cause of the phenotype. Transposons can also be used for reverse genetics screens, in which mutations are identified affecting a sequence using PCR. These mutants are analyzed for altered phenotypes to gain insight into the function of the sequence of interest. The maize research com- munity has developed a myriad of reverse genetics resources including multiple transposon-tagged collections. This section will focus on the transposon-tagging populations to discuss the advantages as well as the challenges that come with each resource. Chapter 12 discusses reverse genetics using chemical mutagenesis populations. A reverse genetics resource begins with a near-saturation collection of mutage- nized plants. A large population is necessary to ensure a reasonable chance that a mutation in any given sequence will be present. DNA is sampled from all of the mutagenized individuals for PCR. The specific mutations within the population can be identified on a locus-by-locus basis or systematically depending on the antici- pated demand for the particular resource. For the locus-by-locus approach, the DNA samples are pooled, typically in grids, for efficient screening of the population. To identify a mutant in a specific locus, the pooled DNA samples are screened by PCR with a primer for the gene of interest and a primer specific to the transposable ele- ment. Amplification products indicate an insertion. Corresponding row and column amplifications identify the plant that has the insertion allele. Plants harboring active transposons will also have somatic insertions in small sectors of the tissue sampled for DNA extractions. Somatic insertions are not inherited and lead to false posi- tive amplification during PCR screening. To limit the impact of somatic insertions, some reverse genetics projects sample from different leaves for row and column pools (Bensen et al. 1995; Fernandes et al. 2004). Other projects have used genetic inhibitors or genetic markers for transposon activity to select against somatic trans- position within the plants sampled for DNA (May et al. 2003; McCarty et al. 2005).

11.5.1 Single-Gene Screening Resources

Single-locus screens can be completed either by service facilities or by the individ- ual laboratory. Service facilities include the Trait Utility System for Corn (TUSC), Maize Targeted Mutagenesis (MTM), Biogemma’s Mu population, and the Mu re- sources at the University of Bristol. The TUSC facility is operated by Pioneer Hi- Bred International, Inc., and academic researchers need to establish a collaboration with the company to complete a screen. These collaborations are relatively straight- forward to develop, as evidenced by the steady rate of around two publications per year that report mutants identified using the TUSC service (e.g. Ching et al. 2006; Chung et al. 2007; Golubovskaya et al. 2006; Li et al. 2007). 152 A. Mark Settles

There are several issues to consider related to choice of a service facility. For example, mutants identified from the TUSC collection require a material transfer agreement (MTA) for distribution, while the MTM population was established as a public screening service to identify mutants that can be freely distributed (May et al. 2003). MTM screens are completed for a user fee, which can become expensive to an individual program when screening for mutations in multiple loci. Also, MTM is not a near-saturation mutagenesis collection and has a lower likelihood of recover- ing mutations than the TUSC population. The initial MTM screens found mutants for only 42% of the genes screened (May et al. 2003). Moreover, positive MTM screens typically recover a single allele (Martin et al. 2006; Sheehan et al. 2007). In contrast, TUSC screens generally recover two to three alleles. Biogemma has developed reverse genetics resources that were used in two recent mutant gene cloning reports (Gutierrez-Marcos et al. 2007; Martin et al. 2006). Similar to TUSC, a collaborative agreement is required for a screen and an MTA is required for distributing the mutant seed (Pascual Perez, personal communication). Although the Biogemma population is smaller than the TUSC and MTM popula- tions, it has a similar rate of success as MTM for finding at least one allele of the lo- cus of interest (∼50%). Unlike MTM, the two published reports using Biogemma’s population recovered multiple alleles for each locus. The Functional Genomics group at the University of Bristol has a free screening service (www.cerealsdb.uk.net). This population consists of 5,000 Mu-active plants arrayed into a grid. The one report that used this service recovered three insertion alleles for a K+-channel gene (Philippar et al. 2006). For researchers who would like to complete single-locus screens in their own laboratories, the Maize Gene Discovery Project developed a Mu population with a transgenic Mu element, RescueMu (Raizada et al. 2001). RescueMu contains a plasmid vector, and flanking DNA from the transposon can be recovered using plasmid rescue. RescueMu insertions can be screened by ordering plasmid libraries recovered from grids that contain ∼27,500 germinal transpositions (Fernandes et al. 2004). Due to the low number of transpositions, there is a significant chance that a mutation will not be found after completing a PCR screen. Although the RescueMu population is a challenge to use as a reverse genetics resource, the population is in an active Mu genetic background. The non-transgenic, endogenous Mu elements are useful for conventional transposon tagging (McSteen et al. 2007). However, it is necessary to obtain appropriate movement and release permits, as well as any spe- cific institutional authorizations, to propagate RescueMu lines due to the transgenic Mu elements in the population.

11.5.2 Flanking Sequence Tags and Reverse Genetics

Single-gene reverse genetics screens are equivalent to directed-tagging experiments. Mutations in a specific gene of interest can be recovered efficiently. However, each gene that researchers are interested in analyzing requires a separate screen. If a 11 Transposon Tagging and Reverse Genetics 153 reverse genetics resource will be used to analyze thousands of genes, it becomes more cost effective and faster to identify transposon-induced mutations using a sys- tematic approach. In Arabidopsis and rice, the approach has been to index insertion mutants using flanking sequence tags (FSTs). FSTs are sequenced DNA adjacent to a transposon or T-DNA tag. The sequence anchors the insertion site to a refer- ence genome, allowing gene disruptions to be identified in silico. Sequencing from several hundred thousand insertion sites recovers mutations in the vast majority of genes in the genome (Alonso et al. 2003; Rosso et al. 2003; Samson et al. 2004; Sessions et al. 2002). Maize FSTs are not as developed as Arabidopsis and rice resources. However, many of the same genetic resources discussed above have also been used to gener- ate FSTs. Most maize insertional mutagenesis resources utilize native transposable element systems. These transposons exist as part of the genome, and plants contain a mix of somatic, novel, and parental insertions. Thus, recovering unique, germinal insertions is more challenging from maize transposon-tagging populations than it is from rice or Arabidopsis T-DNA tagging populations. Amplifying native trans- poson insertion sites leads to redundant products that represent parental insertions shared by many plants within a mutagenized population. Also, plants with active transposons will have somatic, non-heritable insertions. Several groups have sequenced random samples of parental, novel, and somatic insertions. The functional genomics group at the University of Bristol amplified transposon insertions from Mu-active plants using a modified AFLP method (Hanley et al. 2000). Seven hundred and fifty FSTs resulted in 450 unique insertion sites. Only a small number of these insertions were tested for inheritance, leaving the specific fraction of somatic insertions unknown. MuTAIL-PCR has also been used to amplify insertions for 99 FSTs from a Mu population developed in China (Liu et al. 2006). These sequences identified 59 non-redundant insertion sites that were not tested for heritability. The Maize Gene Discovery Project took a large-scale shotgun approach and sequenced 191,717 RescueMu FSTs, resulting in 14,887 non- redundant Mu FSTs (Fernandes et al. 2004). Only 528 of the insertion sites were identified as germinal, based on the criteria that the insertion site was recovered from two independent DNA samples from the same plant. The Maize Endosperm Genomics Project used a somatic activity marker to en- sure that germinal FSTs are recovered (McCarty et al. 2005; Settles et al. 2004). A total of 37,595 FSTs were sequenced from the UniformMu population using MuTAIL-PCR to amplify the insertions. These FSTs identify over 1,900 non- redundant insertion sites (www.uniformmu.org). Heritability tests for 106 of the insertions gave no evidence of somatic FSTs and showed at least 89% are inherited, germinal insertions (Settles et al. 2007). FSTs from germinal Ac and Ds insertions have also been generated. For Ac inser- tions, DNA gel blots were screened to identify novel, germinal Ac insertions. After a subsequent gel extraction step, specific insertion sites were amplified with I-PCR (Cowperthwaite et al. 2002; Kolkman et al. 2005). A total of 115 FSTs have been sequenced from Ac and a similar approach is being used to generate FSTs from Ds elements with 916 FSTs available currently (see Table 11.2). 154 A. Mark Settles

Table 11.2 Summary of transposon-tagging reverse genetics resources Resource Website/contact FST sequence databases Ac/Ds http://www.plantgdb.org/prj/AcDsTagging/tool/blast.php RescueMu and UniformMu http://www.mutransposon.org/cgi-bin/MuBLAST.cgi UniformMu http://currant.hos.ufl.edu/mutail/ University of Bristol http://www.cerealsdb.uk.net/mudb.htm PCR screening services TUSC Robert Meeley ([email protected]) MTM http://mtm.cshl.org Biogemma Christophe Tatout ([email protected]) University of Bristol http://www.cerealsdb.uk.net/pcrscrn.htm

11.5.3 An Optimal Reverse Genetics Strategy?

With five PCR screening resources and seven FST resources, it is challenging to decide which resources are best to obtain mutations in a gene of interest. Since se- quence database searches are fast, BLAST searching the FST resources is an easy starting point. Most of the FSTs can be searched at two websites. All of the Ds FSTs and most of the Ac FSTs are at PlantGDB, and the UniformMu and RescueMu FSTs can be searched at www.mutransposon.org (see Table 11.2). A separate Unifor- mMu FST database includes both a BLAST server and analyzed FSTs to help users find non-redundant insertions and annotations for the insertions. The University of Bristol FSTs can be searched at the resource website. Germinal insertions in the RescueMu FSTs can be identified when multiple hits are recovered from columns and rows in the same grid. Both a column and a row hit is necessary to identify a specific RescueMu plant. For all of the other FST resources, a single hit is sufficient to identify the plant carrying the insertion. Combined, the Mu, Ac, and Ds FST resources represent less than 4,000 non- redundant insertion sites that are likely to be germinal. With these current sequences, a researcher will be lucky to find a match in a gene of interest. The ease of data- base searches makes these searches worth completing prior to initiating other re- verse genetics screens. If no FSTs are identified after database screening, the re- searcher needs to decide whether to collaborate with a company or pay for a public reverse genetics screen. The advantages of collaborating with a company are that there are no user fees and the probability of recovering mutant alleles is higher. However, the company will obtain some intellectual property rights to the bio- logical process under study and generally will require a company review of the manuscript describing the mutant alleles. A factor to consider for public resources is that smaller populations are less likely to identify a mutant in a gene of inter- est. The “do-it-yourself” RescueMu screen contains the fewest number of novel insertions. 11 Transposon Tagging and Reverse Genetics 155 11.6 Future Perspectives

Conventional transposon tagging is the predominant approach for current maize ge- netics research. As the maize genome is sequenced, map-based cloning will be eas- ier and is likely to become more common. However, transposon tagging will still play a central role in molecular genetics. Transposon-induced alleles give a differ- ent spectrum of mutant phenotypes than other mutagens and are useful in generating allelic series. Transposons also cause mutable or epigenetically regulated alleles that are useful for generating chimeric plants. Most importantly, transposon reverse ge- netics resources are likely to be critical for obtaining confirming alleles for forward genetics studies. A key challenge in making transposon reverse genetics resources more efficient is generating a saturated collection of FSTs with corresponding seed. Massively parallel sequencing technologies will help in reducing the cost of gener- ating FSTs. Lower costs per FST should allow sufficient numbers of FSTs to make database searches a standard method for recovering maize mutants in a gene of interest.

Acknowledgements I apologize to the authors of many relevant research articles that I was not able to highlight in this chapter due to space constraints. The author’s research is supported by grants from the NSF Plant Genome Research Project (DBI-0606607), the National Research Ini- tiative of the USDA-CSREES (grant no. 2007-35304-18439), the University of Florida Genetics Institute, and the Vasil-Monsanto Endowment.

References

Alonso JM, Stepanova AN, Leisse TJ, Kim CJ, Chen HM, Shinn P, Stevenson DK, Zimmerman J, Barajas P, Cheuk R, Gadrinab C, Heller C, Jeske A, Koesema E, Meyers CC, Parker H, Prednis L, Ansari Y, Choy N, Deen H, Geralt M, Hazari N, Hom E, Karnes M, Mulholland C, Ndubaku R, Schmidt I, Guzman P, Aguilar-Henonin L, Schmid M, Weigel D, Carter DE, Marchand T, Risseeuw E, Brogden D, Zeko A, Crosby WL, Berry CC, Ecker JR (2003) Genome-wide inser- tional mutagenesis of Arabidopsis thaliana. Science 301:653–657 Athma P, Grotewold E, Peterson T (1992) Insertional mutagenesis of the maize P gene by intra- genic transposition of Ac. Genetics 131:199–209 Auger DL, Sheridan WF (1999) Maize stocks modified to enhance the recovery of Ac-induced mutations. J Hered 90:453–459 Barker RF, Thompson DV, Talbot DR, Swanson J, Bennetzen JL (1984) Nucleotide sequence of the maize transposable element Mul. Nucleic Acids Res 12:5955–5967 Bensen RJ, Johal GS, Crane VC, Tossberg JT, Schnable PS, Meeley RB, Briggs SP (1995) Cloning and characterization of the maize An1 gene. Plant Cell 7:75–84 Bingham PM, Levis R, Rubin GM (1981) Cloning of DNA sequences from the White locus of Drosophila melanogaster by a novel and general method. Cell 25:693–704 Bommert P, Lunde C, Nardmann J, Vollbrecht E, Running M, Jackson D, Hake S, Werr W (2005) thick tassel dwarf1 encodes a putative maize ortholog of the Arabidopsis CLAVATA1 leucine- rich repeat receptor-like kinase. Development 132:1235–1245 Bortiri E, Jackson D, Hake S (2006a) Advances in maize genomics: the emergence of positional cloning. Curr Opin Plant Biol 9:164–171 156 A. Mark Settles

Bortiri E, Chuck G, Vollbrecht E, Rocheford T, Martienssen R, Hake S (2006b) ramosa2 encodes a LATERAL ORGAN BOUNDARY domain protein that determines the fate of stem cells in branch meristems of maize. Plant Cell 18:574–585 Brown WE, Robertson DS, Bennetzen JL (1989) Molecular analysis of multiple Mutator-derived alleles of the Bronze locus of maize. Genetics 122:439–445 Bruggmann R, Bharti AK, Gundlach H, Lai JS, Young S, Pontaroli AC, Wei FS, Haberer G, Fuks G, Du CG, Raymond C, Estep MC, Liu RY, Bennetzen JL, Chan AP, Rabinowicz PD, Quack- enbush J, Barbazuk WB, Wing RA, Birren B, Nusbaum C, Rounsley S, Mayer KFX, Messing J (2006) Uneven chromosome contraction and expansion in the maize genome. Genome Res 16:1241–1251 Bureau TE, Wessler SR (1992) Tourist: a large family of small inverted repeat elements frequently associated with maize genes. Plant Cell 4:1283–1294 Ching A, Dhugga KS, Appenzeller L, Meeley R, Bourett TM, Howard RJ, Rafalski A (2006) Brittle stalk 2 encodes a putative glycosylphosphatidylinositol-anchored protein that affects mechani- cal strength of maize tissues by altering the composition and structure of secondary cell walls. Planta 224:1174–1184 Chomet P, Lisch D, Hardeman KJ, Chandler VL, Freeling M (1991) Identification of a regulatory transposon that control the Mutator transposable element system in maize. Genetics 129:261– 270 Chuck G, Cigan AM, Saeteurn K, Hake S (2007) The heterochronic maize mutant Corngrass1 results from overexpression of a tandem microRNA. Nat Genet 39:544–549 Chung T, Kim CS, Nguyen HN, Meeley RB, Larkins BA (2007) The maize Zmsmu2 gene en- codes a putative RNA-splicing factor that affects protein synthesis and RNA processing during endosperm development. Plant Physiol 144:821–835 Cook WB, Miles D (1988) Transposon mutagenesis of nuclear photosynthetic genes in Zea mays. Photosynth Res 18:33–59 Cowperthwaite M, Park W, Xu ZN, Yan XH, Maurais SC, Dooner HK (2002) Use of the transposon Ac as a gene-searching engine in the maize genome. Plant Cell 14:713–726 Dooner HK, Belachew A (1989) Transposition pattern of the maize element Ac from the bz-m2(Ac) allele. Genetics 122:447–457 Dooner HK, Belachew A, Burgess D, Harding S, Ralston M, Ralston E (1994) Distribution of unlinked receptor-sites for transposed Ac elements from the bz-m2(Ac) allele in maize. Genetics 136:261–279 Earp DJ, Lowe B, Baker B (1990) Amplification of genomic sequences flanking transposable ele- ments in host and heterologous plants – a tool for transposon tagging and genome characteriza- tion. Nucleic Acids Res 18:3271–3279 Evans MMS (2007) The indeterminate gametophyte1 gene of maize encodes a LOB domain pro- tein required for embryo sac and leaf development. Plant Cell 19:46–62 Fedoroff N, Wessler S, Shure M (1983) Isolation of the transposable maize controlling elements Ac and Ds. Cell 35:235–242 Fedoroff NV, Furtek DB, Nelson OE (1984) Cloning of the Bronze locus in maize by a simple and generalizable procedure using the transposable controlling element Activator (Ac). Proc Natl Acad Sci USA 81:3825–3829 Fernandes J, Dong QF, Schneider B, Morrow DJ, Nan GL, Brendel V, Walbot V (2004) Genome- wide mutagenesis of Zea mays L.usingRescueMu transposons. Genome Biol 5:R82 Frey M, Stettner C, Gierl A (1998) A general method for gene isolation in tagging approaches: amplification of insertion mutagenised sites (AIMS). Plant J 13:717–721 Giroux MJ, Shaw J, Barry G, Cobb BG, Greene T, Okita T, Hannah LC (1996) A single gene mutation that increases maize seed weight. Proc Natl Acad Sci USA 93:5824–5829 Golubovskaya IN, Hamant O, Timofejeva L, Wang C-JR, Braun D, Meeley R, Cande WZ (2006) Alleles of afd1 dissect REC8 functions during meiotic prophase I. J Cell Sci 119: 3306–3315 Greenblatt IM (1984) A chromosome replication pattern deduced from pericarp phenotypes result- ing from movements of the transposable element, Modulator, in maize. Genetics 108:471–485 11 Transposon Tagging and Reverse Genetics 157

Gutierrez-Marcos JF, Dal Pra M, Giulini A, Costa LM, Gavazzi G, Cordelier S, Sellam O, Tatout C, Paul W, Perez P, Dickinson HG, Consonni G (2007) empty pericarp4 encodes a mitochondrion- targeted pentatricopeptide repeat protein necessary for seed development and plant growth in maize. Plant Cell 19:196–210 Hamant O, Golubovskaya I, Meeley R, Fiume E, Timofejeva L, Schleiffer A, Nasmyth K, Cande WZ (2005) A REC8-dependent plant Shugoshin is required for maintenance of centromeric cohesion during meiosis and has no mitotic functions. Curr Biol 15:948– 954 Hanley S, Edwards D, Stevenson D, Haines S, Hegarty M, Schuch W, Edwards KJ (2000) Iden- tification of transposon-tagged genes by the random sequencing of Mutator-tagged DNA frag- ments from Zea mays. Plant J 23:557–566 Hua-Van A, Le Rouzic A, Maisonhaute C, Capy P (2005) Abundance, distribution and dynamics of retrotransposable elements and transposons: similarities and differences. Cytogenet Genome Res 110:426–440 Kim CS, Gibbon BC, Gillikin JW, Larkins BA, Boston RS, Jung R (2006) The maize Mucronate mutation is a deletion in the 16-kDa gamma-zein gene that induces the unfolded protein re- sponse. Plant J 48:440–451 Kolkman JM, Conrad LJ, Farmer PR, Hardeman K, Ahern KR, Lewis PE, Sawers RJH, Lebejko S, Chomet P, Brutnell TP (2005) Distribution of Activator (Ac) throughout the maize genome for use in regional mutagenesis. Genetics 169:981–995 Kumar CS, Wing RA, Sundaresan V (2005) Efficient insertional mutagenesis in rice using the maize En/Spm elements. Plant J 44:879–892 Kunze R, Weil CF (2002) The hAT and CACTA superfamilies of plant transposons. In: Craig Nl, Craigie M, Gellert Am, Lambowitz Am (eds) Mobile DNA II. ASM Press, Washington, DC, pp 533–564 Lal SK, Giroux MJ, Brendel V, Vallejos CE, Hannah LC (2003) The maize genome contains a Helitron insertion. Plant Cell 15:381–391 Levy AA, Britt AB, Luehrsen KR, Chandler VL, Warren C, Walbot V (1989) Developmental and genetic-aspects of Mutator excision in maize. Dev Genet 10:520–531 Li J, Harper LC, Golubovskaya I, Wang CR, Weber D, Meeley RB, McElver J, Bowen B, Cande WZ, Schnable PS (2007) Functional analysis of maize RAD51 in meiosis and double-strand break repair. Genetics 176:1469–1482 Lid SE, Gruis D, Jung R, Lorentzen JA, Ananiev E, Chamberlin M, Niu XM, Meeley R, Nichols S, Olsen OA (2002) The defective kernel 1 (dek1) gene required for aleurone cell development in the endosperm of maize grains encodes a membrane protein of the calpain gene superfamily. Proc Natl Acad Sci USA 99:5460–5465 Lisch D (2002) Mutator transposons. Trends Plant Sci 7:498–504 Liu WT, Gao YJ, Teng F, Shi Q, Zheng YL (2006) Construction and genetic analysis of Mutator insertion mutant population in maize. Chinese Sci Bull 51:2604–2610 Liu YG, Mitsukawa N, Oosumi T, Whittier RF (1995) Efficient isolation and mapping of Ara- bidopsis thaliana T-DNA insert junctions by thermal asymmetric interlaced PCR. Plant J 8: 457–463 Martin A, Lee J, Kichey T, Gerentes D, Zivy M, Tatout C, Dubois F, Balliau T, Valot B, Davanture M, Terce-Laforgue T, Quillere I, Coque M, Gallais A, Gonzalez-Moro MB, Bethencourt L, Habash DZ, Lea PJ, Charcosset A, Perez P, Murigneux A, Sakakibara H, Edwards KJ, Hirel B (2006) Two cytosolic glutamine synthetase isoforms of maize are specifically involved in the control of grain production. Plant Cell 18:3252–3274 May BP, Liu H, Vollbrecht E, Senior L, Rabinowicz PD, Roh D, Pan XK, Stein L, Freeling M, Alexander D, Martienssen R (2003) Maize-targeted mutagenesis: a knockout resource for maize. Proc Natl Acad Sci USA 100:11541–11546 McCarty DR, Settles AM, Suzuki M, Tan BC, Latshaw S, Porch T, Robin K, Baier J, Avigne W, Lai JS, Messing J, Koch KE, Hannah LC (2005) Steady-state transposon mutagenesis in inbred maize. Plant J 44:52–61 McClintock B (1948) Mutable loci in maize. Carnegie Inst Wash Year B 47:155–169 158 A. Mark Settles

McClintock B (1950) The origin and behavior of mutable loci in maize. Proc Natl Acad Sci USA 36:344–355 McClintock B (1954) Mutations in maize and chromosomal aberrations in Neurospora. Carnegie Inst Wash Year B 56:415–429 McSteen P, Malcomber S, Skirpan A, Lunde C, Wu X, Kellogg E, Hake S (2007) Barren inflores- cence2 encodes a co-ortholog of the PINOID serine/threonine kinase and is required for organo- genesis during inflorescence and vegetative development in maize. Plant Physiol 144:1000– 1011 Moreno MA, Chen J, Greenblatt I, Dellaporta SL (1992) Reconstitutional mutagenesis of the maize P gene by short-range Ac transpositions. Genetics 131:939–956 Muszynski MG, Dam T, Li B, Shirbroun DM, Hou ZL, Bruggemann E, Archibald R, Ananiev EV, Danilevskaya ON (2006) Delayed flowering1 encodes a basic leucine zipper protein that mediates floral inductive signals at the shoot apex in maize. Plant Physiol 142:1523–1536 Peterson PA (1960) The pale green mutable system in maize. Genetics 45:115–133 Philippar K, Buchsenschutz K, Edwards D, Loffler J, Luthen H, Kranz E, Edwards KJ, Hedrich R (2006) The auxin-induced K+ channel gene Zmk1 in maize functions in coleoptile growth and is required for embryo development. Plant Mol Biol 61:757–768 Pohlman RF, Fedoroff NV, Messing J (1984) The nucleotide-sequence of the maize controlling element Activator. Cell 37:635–643 Porch TG, Tseung CW, Schmelz EA, Settles AM (2006) The maize Viviparous10/Viviparous13 locus encodes the Cnx1 gene required for molybdenum cofactor biosynthesis. Plant J 45:250– 263 Raizada MN, Nan GL, Walbot V (2001) Somatic and germinal mobility of the RescueMu transpo- son in transgenic maize. Plant Cell 13:1587–1608 Robertson DS (1978) Characterization of a mutator system in maize. Mut Res 51:21–28 Rosso MG, Li Y, Strizhov N, Reiss B, Dekker K, Weisshaar B (2003) An Arabidopsis thaliana T-DNA mutagenized population (GABI-Kat) for flanking sequence tag-based reverse genetics. Plant Mol Biol 53:247–259 Samson F, Brunaud V, Duchene S, De Oliveira Y, Caboche M, Lecharny A, Aubourg S (2004) FLAGdb(++): a database for the functional analysis of the Arabidopsis genome. Nucleic Acids Res 32:D347–D350 Satoh-Nagasawa N, Nagasawa N, Malcomber S, Sakai H, Jackson D (2006) A trehalose metabolic enzyme controls inflorescence architecture in maize. Nature 441:227–230 Sawers RJH, Viney J, Farmer PR, Bussey RR, Olsefski G, Anufrikova K, Hunter CN, Brutnell TP (2006) The maize Oil yellow1 (Oy1) gene encodes the I subunit of magnesium chelatase. Plant Mol Biol 60:95–106 Scanlon MJ, Stinard PS, James MG, Myers AM, Robertson DS (1994) Genetic analysis of 63 mutations affecting maize kernel development isolated from Mutator stocks. Genetics 136:281– 294 Schmidt RJ, Burr FA, Burr B (1987) Transposon tagging and molecular analysis of the maize regulatory locus Opaque-2. Science 238:960–963 Schnable PS, Peterson PA, Saedler H (1989) The Bz-Rcy allele of the Cy transposable element system of Zea mays contains a Mu-like element insertion. Mol Gen Genet 217:459–463 Schwarz-Sommer Z, Gierl A, Klosgen RB, Wienand U, Peterson PA, Saedler H (1984) The Spm (En) transposable element controls the excision of a 2-Kb DNA insert at the wx-m8 allele of Zea mays. EMBO J 3:1021–1028 Sessions A, Burke E, Presting G, Aux G, McElver J, Patton D, Dietrich B, Ho P, Bacwaden J, Ko C, Clarke JD, Cotton D, Bullis D, Snell J, Miguel T, Hutchison D, Kimmerly B, Mitzel T, Katagiri F, Glazebrook J, Law M, Goff SA (2002) A high-throughput Arabidopsis reverse genetics system. Plant Cell 14:2985–2994 Settles AM, Latshaw S, McCarty DR (2004) Molecular analysis of high-copy insertion sites in maize. Nucleic Acids Res 32:e54 Settles AM, Holding DR, Tan BC, Latshaw SP, Liu J, Suzuki M, Li L, O’Brien BA, Fajardo DS, Wroclawska E, Tseung CW, Lai JS, Hunter CT, Avigne WT, Baier J, Messing J, Hannah LC, 11 Transposon Tagging and Reverse Genetics 159

Koch KE, Becraft PW, Larkins BA, McCarty DR (2007) Sequence-indexed mutations in maize using the UniformMu transposon-tagging population. BMC Genomics 8 Sheehan MJ, Kennedy LM, Costich DE, Brutnell TP (2007) Subfunctionalization of PhyB1 and PhyB2 in the control of seedling and mature plant traits in maize. Plant J 49:338–353 Shi J, Wang H, Hazebroek J, Ertl DS, Harp T (2005) The maize low-phytic acid 3 encodes a myo-inositol kinase that plays a role in phytic acid biosynthesis in developing seeds. Plant J 42:708–719 Singh M, Lewis PE, Hardeman K, Bai L, Rose JKC, Mazourek M, Chomet P, Brutnell TP (2003) Activator mutagenesis of the pink scutellum1/viviparous7 locus of maize. Plant Cell 15:874– 884 Stern DB, Hanson MR, Barkan A (2004) Genetics and genomics of chloroplast biogenesis: maize as a model system. Trends Plant Sci 9:293–301 Sturaro M, Hartings H, Schmelzer E, Velasco R, Salamini F, Motto M (2005) Cloning and charac- terization of GLOSSY1, a maize gene involved in cuticle membrane and wax production. Plant Physiol 138:478–489 Suzuki M, Settles AM, Tseung CW, Li QB, Latshaw S, Wu S, Porch TG, Schmelz EA, James MG, McCarty DR (2006) The maize viviparous15 locus encodes the molybdopterin synthase small subunit. Plant J 45:264–274 Taramino G, Sauer M, Stauffer JL, Multani D, Niu X, Sakai H, Hochholdinger F (2007) The maize (Zea mays L.) RTCS gene encodes a LOB domain protein that is a key regulator of embryonic seminal and post-embryonic shoot-borne root initiation. Plant J 50:649–659 Taylor WC, Barkan A, Martienssen RA (1987) Use of nuclear mutants in the analysis of chloroplast development. Dev Genet 8:305–320 Vollbrecht E, Springer PS, Goh L, Buckler IV ES, Martienssen R (2005) Architecture of floral branch systems in maize and related grasses. Nature 436:1119–1126 Walbot V (1992) Strategies for mutagenesis and gene cloning using transposon tagging and T-DNA insertional mutagenesis. Annu Rev Plant Physiol Plant Mol Biol 43:49–82 Walbot V, Rudenko GN (2002) MuDR/Mu transposable elements in maize. In: Craig Nl, Craigie M, Gellert Am, Lambowitz Am (eds) Mobile DNA II. ASM Press, Washington, DC, pp 533–564 Walbot V, Warren C (1988) Regulation of Mu-element copy number in maize lines with an active or inactive Mutator transposable element system. Mol Gen Genet 211:27–34 Wen T-J, Hochholdinger F, Sauer M, Bruce W, Schnable PS (2005) The roothairless1 gene of maize encodes a homolog of sec3, which is involved in polar exocytosis. Plant Physiol 138:1637–1643 Wessler SR, Baran G, Varagona M, Dellaporta SL (1986) Excision of Ds produces WAXY proteins with a range of enzymatic activities. EMBO J 5:2427–2432 View publication stats