Interspersed Repeats and Other Mementos of Transposable Elements in Mammalian Genomes Arian FA Smit
Total Page:16
File Type:pdf, Size:1020Kb
657 Interspersed repeats and other mementos of transposable elements in mammalian genomes Arian FA Smit The bulk of the human genome is ultimately derived from Among the new additions are a series of consensus transposable elements. Observations in the past year lead to sequences for autonomous members of the MER1 family some new and surprising ideas on functions and of mammalian DNA transposons [4]. These nine elements consequences of these elements and their remnants in our (Charlie1–8 [4], Cheshire [5]) and a tenth, more distantly genome. The many new examples of human genes derived related repeat (MER69) encode proteins related to the from single transposon insertions highlight the large hobo-Activator-Tam (hAT) transposases. Besides these contribution of selfish DNA to genomic evolution. and the Tc1–family (5 Tigger and 2 mariner) fossils [4], a third, small group in our DNA is characterized by TTAA- Addresses flanking sites. One member (Looper [5]) encodes a protein Axys Pharmaceuticals, Inc., 11099 North Torrey Pines Road, that is 50% similar to the putative transposase of the Suite 160, La Jolla, California 92037-1029, USA; TTAA-flanked piggyBac transposon in moths. Most recon- e-mail: [email protected] structed DNA transposons belong to the Charlie or Tigger Current Opinion in Genetics & Development 1999, 9:657–663 group which are so far mammalian-specific. This is some- 0959-437X/99/$ — see front matter © 1999 Elsevier Science Ltd. what surprising, as the lack of intermediates between any All rights reserved. related elements suggests that they all represent indepen- dent (lateral) introductions into the genome, whereas most Abbreviations ERV endogenous retrovirus types of eukaryotic DNA transposons may be able to trans- hAT hobo-Activator-Tam pose in mammalian cells [6•]. LINE long interspersed nuclear element LTR long terminal repeat Almost all long terminal repeat (LTR) sequences in Rep- MIR mammalian-wide interspersed repeat ORF open reading frame Base (211 in release 3.06) now have been linked to the SINE short interspersed nuclear element three classes of mammalian endogenous retroviruses TIR terminal inverted repeat (ERVs) [4] (see [7,8] for classification); at least 2.3% of our UTR untranslated region autosomal DNA is derived from class I ERVs and non- autonomous associates (e.g. the MER4 group), 0.3% from Introduction class II elements, and 4.6% from foamy virus-like (ERV-L) elements. Three-quarters of the last fraction is taken by In May 1999, 10% of the human genome had been the non-autonomous MaLR elements; their relationship to sequenced, providing a suitable occasion to update an ear- ERV-L [1] is now sealed by the similarity of the single lier survey of interspersed repeats in our genome [1] in the product of older MaLRs to the ERV-L gag protein [4]. On first part of this review. Recently characterized elements have spread some more light on the origin of these the Y chromosome, class I and II LTR elements are 2.5 repeats. The characteristics of most but not all repeats fit times more frequent than on the autosomal and X chromo- the model that they are remnants of selfish DNA. In par- somes and outnumber the ERV-L elements which occur ticular, indications of a possibly symbiotic relationship of with normal copy numbers (AFA Smit, unpublished data). SINEs with mammalian genomes are reviewed. Com- pared with those of mouse and other model organisms, Non-LTR retrotransposons or LINEs are divided in 11 human transposons are remarkably subdued, leading to clades on the basis of protein phylogeny, like the ERV • speculations on host defense mechanisms. The final sec- classification [9 ]. Besides L1 (L1 clade), L2 (Jockey tion of this review discusses new findings on the many clade) and the ruminant specific BovB or BDDF (RTE ways in which transposable elements can contribute to clade), two relatively low copy repeats (L3 and CR1_Hs) genomic evolution. representing the CR1 clade have now been identified in human DNA [5]. L1 has two open reading frames (ORFs), Updated survey of human interspersed repeats the first encoding a protein that specifically binds the sec- The observed density of the main classes of repeats in a ond in the transcript [10]. A new entry in RepBase, HAL1 total of 327 Mb of human genomic DNA, pigeon-holed by [4], contains a single ORF that is very similar to ORF1 of the GC richness of 50 kb fragments is shown in Table 1. ancient L1s, followed by an untranslated region and a The average fraction of euchromatic DNA identifiably poly(A) tail. As the ~20,000 HAL1 copies are at least as derived from transposable elements is over 42%, an diverged (old) as the oldest known L1 subfamilies, the increase from the 35% determined in 1996 due to improve- ‘modern’ L1 element may be a fusion product of HAL1 ments in the detection method [2] and consensus with an existing LINE, which would explain why the L1 sequences, and to additions of human repeats in the Rep- ORF1 product is unrelated to products of other non-LTR Base database [3••]. retrotransposons [9•]. 658 Genomes and evolution Table 1 Interspersed repeat density in different isochores of the human and mouse genome. GC level <36% 36–38% 38–40% 40–42% 42–44% 44–46% 46–50% 50–54% >54% Human* A 23 Mb 41 Mb 44 Mb 39 Mb 33 Mb 27 Mb 34 Mb 20 Mb 11 Mb (bp analyzed) X 6 Mb 12 Mb 13 Mb 9 Mb 4 Mb 2 Mb 2 Mb 0.4 Mb 0.7 Mb Alu A 5.0% 6.7% 8.6% 11.4% 13.9% 16.6% 19.2% 22.2% 17.7% X 4.0% 4.5% 5.6% 9.7% 12.8% 15.6% 16.2% 22.2% 13.6% LINE1 A 20.0% 19.6% 16.9% 14.0% 11.3% 9.9% 7.4% 4.8% 3.0% X 35.3% 33.2% 28.4% 18.6% 15.0% 9.0% 9.0% 6.1% 5.3% MIR A 1.5% 1.8% 2.1% 2.3% 2.3% 2.9% 3.3% 3.3% 2.4% X 1.5% 1.8% 2.3% 2.4% 2.0% 2.1% 2.2% 2.7% 1.2% LINE2 A 3.1% 3.5% 3.7% 3.6% 3.1% 3.7% 3.6% 3.4% 2.7% X 2.7% 3.3% 3.4% 3.9% 3.0% 4.2% 3.9% 2.9% 2.3% LTR elements A 6.8% 7.7% 7.8% 8.6% 8.9% 8.5% 6.5% 4.7% 2.9% X 8.8% 10.7% 9.8% 9.2% 9.4% 10.5% 7.7% 2.7% 2.8% DNA transposon A 2.9% 3.0% 3.0% 3.1% 3.0% 2.8% 2.5% 1.8% 1.2% X 2.7% 2.7% 2.7% 3.3% 2.8% 3.0% 2.1% 1.1% 1.0% Total A 39.4% 42.3% 42.1% 43.0% 42.7% 44.4% 43% 41% 31% X 55.1% 56.2% 52.2% 47.1% 45.0% 44.3% 41% 38% 26% Mouse <0.2Mb <0.2Mb 1.2 Mb 1.9 Mb 0.9 Mb 1.5 Mb 3.0 Mb 1.6 Mb <0.2Mb B1 – – 1.2% 1.6% 2.7% 4.4% 6.4% 6.9% – Other SINEs – – 2.2% 3.0% 5.7% 7.4% 8.0% 9.6% – LINE1 – – 30.4% 25.4% 14.0% 7.7% 3.4% 1.4% – LTRs – – 7.9% 11.0% 12.7% 10.6% 9.9% 3.6% – Total – – 43.0% 42.2% 36.6% 31.6% 28.9% 20.3% – Analysis was performed with RepeatMasker (version 050599 default elements, the loci retain the same GC level (the differences in GC settings, RepBase version 3.06) on 327 Mb of non-redundant human level mostly reflect distinct neutral substitution biases). The total genomic DNA downloaded from http://www.ncbi.nlm.nih.gov/genome/seq interspersed repeat density is between 40 and 44% in all but the in June 1999 and 10 Mb of >40 kb long mouse genomic DNA entries highest GC level isochores. In the latter all repeat classes are under- in GenBank Release 112. *For the human genome, the data is split in represented, presumably because of high gene density and the total autosomal (A) and X chromosomal (X) DNA. The GC level and repeat repeat density would only be 26%, if it were not for a large contribution density of longer database entries were calculated and tallied for of repeat rich, GC-rich DNA from chromosome 19. The increased 50-kb non-overlapping windows. The opposite distribution patterns of density of LTR elements on X is mostly due to a larger ratio of LINE1 and SINEs (Alu, MIR, B1 and tRNA-derived SINEs in mouse) complete elements over solitary LTRs (a ratio even higher on Y). are discussed in the text. Note that after elimination of the Alu and L1 Selfish or symbiotic elements? DNA and having high transcription levels in somatic cells Selfish elements that are long-term denizens of a genome under stress. Schmid [12••] put forth a compelling theory should have been under selection for maximal propagation that can explain both observations. One argument is that with minimal harm to the host. Thus, one expects their the preference for GC-rich DNA may be linked to the transpositional activity to be limited to the germline, and striking hypomethylation of Alus in the male germline any target-site preference to be for relatively inert regions (suggesting a function in imprinting or sperm chromatin of the genome.