<<

Master’s thesis

Tracking the of mammalian wide interspersed repeats across the mammalian tree of .

Author Jakob Friedrich Strauß First marker Prof. Dr. Joachim Kurtz Second marker Prof. Dr. Wojciech Maka lowski

November 2, 2010 Tracking the evolution of mammalian wide interspersed repeats across the mammalian tree of life.

Jakob Friedrich Strauß Institute of Bioinformatics WWU M¨unster Master’s thesis

November 2, 2010

Abstract

This work focuses on the detection of mammalian wide interspersed repeats (MIRs) within the mammalian tree of life. MIRs are that belong to the class of SINEs (short interspersed repeats). The amplification of the ancient MIR is estimated 130 ma years ago, just before the mammalian radiation and has soon become inactive. The main idea of this project is to cross detect MIR elements in orthologous loci between fully sequenced mammalian . While a MIR sequence in an extant species may still have a strong resemblance to the ancient MIR element, orthologous sequences in other mammalian species may have diverged beyond recognition, e.g. in mammals with a high substitution rate as rodents. But even those strongly diverged MIR elements may be detectable when the orthologous loci of a closely related species contains a less diverged element. We analyze in which -families MIR elements have spread and distinguish between MIR occurrence within UTRs, introns and exons. Thus we hope to illuminate the contri- bution of MIR elements to mammalian evolution.

Contents

1 Background1 1.1 Transposable Elements...... 1 1.2 Genomic impact of transposable elements...... 3 1.2.1 Long interspersed nuclear elements (LINEs)...... 4 1.2.2 Short interspersed nuclear elements (SINEs)...... 4 1.3 Mammalian wide interspersed repeats (MIRs)...... 5 1.3.1 MIR elements and the tree of life...... 7 1.4 Detection of transposable elements...... 8 1.5 Goals of this study...... 11

i JF Strauß IOB, WWU M¨unster

2 Material and Methods 13 2.1 Cross species MIR identification in / genome alignments..... 13 2.2 MIR associated annotation features...... 15 2.3 MIR sequence site heterogeneity...... 15 2.4 Genomes...... 16 2.5 Software...... 17 2.6 Databases...... 18

3 Results and Discussion 21 3.1 Cross species MIR identification in genome / genome alignments..... 21 3.2 Species specific repeatmasker library...... 23 3.2.1 MIR seed alignments...... 23 3.2.2 MIR site heterogeneity...... 24 3.3 MIR associated annotational features...... 28 3.4 MIR elements in the lizard and the bird genomes...... 29 3.5 Building profiles for MIR sequences...... 30

4 Conclusion 32

5 Outlook 32

References 32

6 Supplementary material 38

List of Figures

1 Multicolored corn cobs...... 1 2 Classification and hypothetical phylogeny of TEs...... 3 3 Schema of a long interspersed nuclear element...... 4 4 Schema of a short interspersed nuclear element...... 5 5 Schema of the MIR element...... 6 6 MIR alignment with Gln-tRNA and L2a tail...... 6 7 Phylogeny of Sauria SINEs, mammalian MIR and bird MIRs...... 8 8 TE detection tools...... 10 9 Concept of genome genome comparison...... 14 10 Approach of MIR reannotation...... 15 11 Phylogeny of analysed species...... 17 12 RepeatMasker: Example MIR alignment output...... 18 13 Phylogeny: MIR and MIR like sequences in RepBase...... 19 14 Reannotation of MIRs...... 23 15 Length histogram MIRs in and mouse...... 24 16 MIR heterogeneity plot: baboon...... 25 17 MIR heterogeneity plot: human...... 25 18 MIR consensus plot: baboon...... 25

ii JF Strauß IOB, WWU M¨unster

19 MIR consensus plot: human...... 26 20 Sequence site plot difference...... 26 21 Sequence site heterogeneity plots A...... 27 22 Sequence site heterogeneity plots B...... 28

List of Tables

1 List of genome versions...... 16 2 RepBase Update: Mammals...... 20 3 RepBase Update: Total...... 20 4 RepeatMasker annotation of MIR elements in the chosen mammals.... 21 5 BlastZ alignment blocks in the chosen mammals...... 21 6 New RM annotation of unannotated genomes with the small MIR library. 22 7 MIR position overlap with selected genomic features...... 29

iii JF Strauß IOB, WWU M¨unster

1 Background

1.1 Transposable Elements

Transposable elements (TEs) are stretches of genomic DNA that are able transpose. Transposition hereby means a change in location and can, but does not necessarily need to include an increase in copy number. TEs were first discovered and described by Bar- bara McClintock. In 1944 she started researching a position on the maize 9 that she called dissociator, as it was prone to break the chromosome at this position. In 1948 she found that the dissociator could actually change its position and induce stable knocking out pigmentation resulting in multicolored maize cobs. This lead to the first publication about transposons at all in 1950: The origin and behavior of mutable loci in maize[1]. For a long time the paradigm for TEs was that they are selfish and parasitic, littering the genome with copies of themselves [2,3]. This paradigm changed over time as more and more functions and roles could be assigned for TEs, such as large scale mutations, gene regulation, , creation, exon cre- ation and exon shuffling (see paragraph 1.2). The mechanism by which TEs transpose is used to classify them. Their features on the sequence level can be used to further classify them into subclasses. As of now three main classes of TEs are described: Class I, Class II and Class III elements. Class III elements are are not a real class, and are most of the time referred to as unclassified TEs.

Figure 1: This picture shows multicolored corn cobs. The multicoloring is the result of transposable elements inserting into pigmentation genes. (Source: Wikimedia:Asbestos/GFDL)

Class I elements are called DNA transposons. DNA transposons transpose by a cut and paste mechanism. They get cut out of their position and are reintroduced at another position. Typical for DNA transposons are the inverted repeats (ITRs), short repeats that

1 JF Strauß IOB, WWU M¨unster

are flanking the transposon. The free intermediate of a DNA transposons is DNA, hence the name. Increase in copy number happens slowly, as only certain events, like segmental duplication, duplication of the free DNA state and cellular repair mechanisms can cause this. Usually DNA transposons encode for proteins such as the transposase used in the process of transposition, cutting out and reinserting the TE. Class II elements are called . In general the insertion of a TE can be unspecific, random, within limitations of course as not being lethal, but as for LTR retrotransposons it can also show a distinct target site preference [4]. Retrotransposons are the most versatile class of TEs. All retrotransposons have in common to transpose by a copy and paste mechanism. They get transcribed to RNA and reverse transcribed into DNA. With this mechanism, each transpositon results in the gain of a new copy. Retrotransposons can be divided into (LTR) retrotransposons and non LTR retrotransposons. LTR retrotransposons usually encode for many proteins such as gag, the , RH and . The LTRs of these transposons are direct repeats, opposed to inverted repeats of DNA transposons, and emerge as an artifact of the reverse . The subclass of non LTR retrotransposons consists of long interspersed nuclear ele- ments (LINEs) and short nuclear interspersed elements (SINEs). While LINEs are au- tonomous retrotransposons, they encode for the proteins needed for their transposition, SINEs are LINE dependent in the way, that they recruit the LINE transposition machin- ery. Figure2 shows how class I, class II TEs and MITEs could have evolved. This fig- ure suggests a phylogeny, that could explain how have evolved from LTR retrotransposons. Class III, unclassified, transposons are called MITEs. Their mechanism for transpo- sition is at the time unknown. In figure2 they are shown as descendants from DNA transposons and don’t have their own class yet. In the about 45% of the genome sequence is reported to be composed of TEs from which two-third (30% in total) are LINEs and SINEs [5]. As a lot of very old TE sequences have probably diverged beyond recognition, it it likely that much more of the human genome are derived from TEs.

2 NATURE|Vol 443|5 October 2006 NEWS & VIEWS FEATURE gene expression can also be felt through effects is reinforced by the observation that early nutri- result from the insertion of SINEs and LINEs involving TEs, because these elements can con- tion can influence the expression of various near or within genes. The human genome trol genes epigenetically when inserted within genes, including TEs, at critical developmental has been colonized by thousands of copies or very close to them14. TEs can act either stages16. The effect of nutrition on an ’s of human endogenous retroviruses (HERV), directly by inducing the methylation (and phenotype is often viewed as a manifestation of which are related to retrotransposons, and therefore silencing) of nearby DNA, or indi- physiological misregulation, but it seems that these sequences are permanently integrated rectly by disrupting the normal epigenetic state changes in gene expression associated with TEs in the genome. Many HERV copies have been of a nearby gene. The control of coat colour by are involved as well. This should make us more associated with teratocarcinoma and leu- the agouti gene in mice is a classic example of conscious of the impact that our environment kaemia, and others are involved in multiple such TE–gene interaction. Overexpression of at one developmental stage can have on further sclerosis, and . But the agouti gene leads to yellow fur, as well as stages, even in adults — especially if changes in HERV insertions are not necessarily detri- to obesity, diabetes and an increased suscep- TE expression under new environmental con- mental, and some copies have been co-opted tibility to tumours. When an IAP retrotrans- ditions do turn out to modify some neuronal by the genome to carry out cellular functions. poson inserts just in front of the agouti gene, functions, as proposed above. For example, the syncytin gene corresponds to the expression of the gene depends on the part of an endogenous and is essen- methylation state of the inserted IAP. Differ- Disease tial to the normal development of the placenta ences in the IAP methylation state in different The discovery that TEs can promote mutations in and mice20. cells leads to patchy coat colour. The methyla- and chromosome rearrangements, and can be As the methylation states of TEs can be tion pattern of the IAP is completely erased in activated by changes in epigenetic state, has modified by environmental factors, it is pos- the male germ line but incompletely erased in led to the increasing realization that these ele- sible that diet or chemicals that interfere with the female germ line. So the variable expres- ments may be responsible for various diseases the DNA-methylating enzymes involved may sion of the agouti gene is passed on to subse- — particularly cancers17,18. Gene dysfunction have major effects on normal physiology and quent generations as an epigenetic inheritance, or misregulation by TEs are considered to be the manifestation of diseases such as — mimicking a classical maternal environmental responsible for around 0.5–1% of human ill- effects that are as yet little understood. More- effect on offspring15. nesses19. , Duchenne muscular over, caution should be exercised in the use of The idea that environmental factors can affect dystrophy, tumours of the oesophagus and modified retrotransposons and retroviruses how a phenotype is produced from a genotype reproductive organs and breast cancers may for delivering gene therapy — there has been

Box 2 | Structure and nomenclature of transposable elements

There is no officially agreed system for classifying interact with the host membrane, conferring is characterized by short-sequence, terminal or transposable elements, but here is a simple infectious characteristics on the elements. subterminal inverted repeats flanked by short system based on their evolution (phylogeny) and The human genome has about half a million direct repeats, and they have no coding potential. the genetic modules they contain (modified from copies of long interspersed nuclear elements They are distributed ubiquitously and seem to ref. 30). The term transposon is often used as a (LINEs), of which 50–100 are still active. And originate from DNA transposons. generic term instead of , there are more than a million copies of a short Other classifications have been proposed but it was originally coined to name the first interspersed nuclear element (SINE) called Alu on the basis of the transposition mechanism characterized TE. This first TE moves about the in the human genome. Alu elements are also a involved, for example, and a system similar genome through a DNA intermediate, using the major source of genomic diversity in dogs. These to that used by virologists for exists transposase (Trp) enzyme to sJFplice Straußitself in and SINEs have a particular structure and depend on for LTR retrotransposons, which are then IOB, WWU M¨unster out of the DNA. The term DNA transposon LINEs for their transposition. classified as Metaviridae and Pseudoviridae30. (class II elements) is often used to distinguish this Miniature inverted-repeat transposable But these classifications are not generally kind of TE from LTR (long terminal repeat) and elements (MITEs) are an ancient TE family that accepted. C.B. & C.V. non-LTR retrotransposons (class I elements) that

move by means of an RNA intermediate and use II ITR Trp ITR MITEs s s DNA transposons the reverse transcriptase (RT) enzyme. la Inverted terminal repeats (ITRs) are needed C for the movement of DNA transposons. The gag gag RT RH INT gene specifies the components of the molecular Non-LTR retrotransposons complex that is associated with the RNA (LINEs) + LTR transposition intermediate of retrotransposons. Retrotransposons also encode the RT enzyme, which synthesizes a complementary single- PR stranded RNA from the inserted DNA of the TE, LTR retrotransposons Pseudoviridae and converts it to a double-stranded DNA that Ty1/copia-like I s will be integrated in the genome elsewhere. They s a l also encode the enzyme (RH), C + env PR which degrades the DNA–RNA hybrids obtained LTR retrotransposons during transposition. Retrotransposons have Ty3-like Metaviridae genes encoding the integrase enzyme (INT), PR which splices the double-stranded DNA into a LTR retrotransposons new spot in the host genome, and an enzyme Gypsy-like (a protease; PR) that cuts up precursor proteins and is involved in particle assembly. Some PR retrotransposons have gained the envelope Retroviruses gene (env), which encodes surface proteins that Time

Figure 2: This figure shows a possible phylogeny of transposable elements and retroviruses.523 This figure was taken from [6]. © 2006 Nature Publishing Group

1.2 Genomic impact of transposable elements.

Genomic evolution cannot be solely explained by single nucleotide mutations and adaption facilitated by positive and negative selection. Transposable elements may have contributed to complex mutations changing for example gene structure or gene expression. [7]. They most probably are the biggest driving force for genome expansion. The human genome for example consists to about 45% of transposable elements. It has been proposed that they should not be treated as junk, but rather as a scrapyard: Repetitive DNA can influence surrounding genes, can be responsible for recombinational hotspots or can become part of coding sequence. [8] Some transposable elements like MIR and L2 derived sequences are rather conserved in intergenic regions, suggesting strong selective constraint [9]. Even the evolution and emergence of chromsomes themselvel may have been impacted by TEs. SINEs as Core SINES, in this study shown in platypus, may have had a huge impact on chromosome evolution [10]. While a SINE like the can be argued being too young for extensive exonisation, MIR elements which are considerably older can be found in protein sequences [11]. Disease associated cryptic exons exonized sequences are Alu and MIR enriched [12]. RNAs, for example from repetitive DNA, can be exapted: Retroposition with acquisi- tion of new function can happen [13]. RNA in general has a chance of being reinserted into the genome. Transposable elements in coding sequence is a rather common phenomena [14]. Exons that contain SINEs of the Alu class are known to be alternatively spliced [15]. TEs can not only contribute to genome space or new coding sequence, they can also influence expression levels [16]. In the human genome transcription plays a key role for transcription of the mammalian genome [17]. Association of TEs with gene

3 JF Strauß IOB, WWU M¨unster

sequences, for example prevalence in UTRs heavily influences gene expression levels. The impact of TEs on regulation of mammalian genes has been subject to reviews [18]. Key functions of mobile DNA are exon shuffling, changes in cis-regulatory sites, horizontal transfer, cell fusions and whole genome duplications(WGDs) [19]. It has been hypothized that all TEs share a common evolutionary history and that viruses like retroviruses have evolved from LTR retrotransposons [6].

1.2.1 Long interspersed nuclear elements (LINEs)

LINE elements are autonomous retrotransposons. They encode for the proteins necessary for retrotransposition. Most LINE elements encode for a endonuclease (en) and a reverse transcriptase (rt). They have an internal polymerase II which is sufficient to trigger transcription. Once transcription is in progress the rt can recognize a nick in the tail region of the transcript, add an poly A tail and or simple repeats and simultaneously start retrotranscription from RNA to DNA. The rt also takes part in reintegration of the free DNA LINE transcript. [20, 21] Figure3 shows a schematic LINE element.

RNA Pol II

5' Orf1 Orf2 Tail 3' Direct repeat

Orf1-protein Endonuclease & Reverse transcriptase

Figure 3: This figure shows the schema of a long interspersed nuclear element (LINE). The LINE element is a transposable element with an internal polymerase II promoter. It usually has two open reading frames. While the function of the orf1 protein that is encoded by the first is unknown, the second orf encodes for an endonuclease and the reverse transcriptase (rt) needed for retrotransposition. The tail region forms a nick that is recognized by the rt which reverse transcribes the transcript and facilitates integration at the same time. The element is flanked by two direct repeats that are an artefact of the reverse transcription by the rt.

1.2.2 Short interspersed nuclear elements (SINEs)

SINE elements are very short retrotransposons, ranging in size from 100 to 600 nucleotides. The most common SINEs, t-RNA derived SINEs, are thought to have emerged by the integration of a reverse transcribed t-RNA into a LINE element, next to its tail [22]. While the LINE tail works as a recognition site for the LINE transposition machinery, the reverse t-RNA transcript provides the newly emerged retrotransposon with an internal polymerase III (pol III) promoter. This is why t-RNA derived SINE elements are called LINE dependent retrotransposons. They don’t encode for the proteins that are needed for their own transposition and have to rely on the corresponding LINE element and its proteins being active. If the LINE element that a SINE element dependends on, stops being active, the SINE element loses its activity as well. This SINE to LINE depends implies that each known SINE element should have a LINE element with which it shares

4 JF Strauß IOB, WWU M¨unster

its tail region; this could be shown for a lot of SINEs [23, 24]. Figure4 shows a schematic t-RNA derived SINE element. Alu elements () or B1 elements (rodents) are thought to be derived from an 7SL RNA [25]. While Alu elements are probably LINE1 dependent MIR elements were probably LINE2 dependent. The MIR element shares its tail (50bp 3’) with an approximately 3kb big LINE2 element named MIR2. [26]

RNA Pol III

5' tRNA-related Region tRNA-unrelated Region LINE Tail 3' Direct A box B box Direct repeat repeat

Figure 4: This figure shows the schema of a tRNA derived short interspersed nuclear element (SINE). The SINE element is flanked by direct repeats and has an internal polymerase III promoter within the 5’ situated t-RNA related region. The A and the B box are structurally the two nonbinding leafs of the t-RNA secondary structure. After the t-RNA unrelated region a LINE tail, which is needed for recruiting the LINE retrotransposition machinery, follows.

Currently SINEs can be divided into four SINE superfamilies Ceph-SINEs, CORE- SINEs, V-SINEs and DeuSINE. The most recent, Ceph-SINEs, are only present in mol- luscs [27]. CORE SINEs which all share the feature of sharing a common CORE sequence can be hypothized to have a common proto CORE SINE origin. [28, 29, 10]

1.3 Mammalian wide interspersed repeats (MIRs)

The MIR element was given its name in two publications in 1995. One publication was by Labuda et al [30], the other by Smit and Riggs [26]. Smit and Riggs describe the MIR element as a t-RNA derived SINE element that has a high abundance within mammals. The MIR sequence is 260 bp long and consists of a t-RNA related sequence, a conserved sequence core region in between and a tail region that aligns well with the tail of the B2 element from mouse. While Labuda writes that in the human genome there are about 120000 copies of the MIR element. Smit and Riggs estimated the MIR copy number to be around 300000. [Missing: Murnane 1995 [31] An example for alternative splicing within a MIR ele- ment? The origin of interspersed elements - follow up paper by Smit in 1996 [22].] Labuda et al. state a possible way for the evolution of the MIR element. He hy- pophyses, that as the CORE region is shared between a lot of SINE families that have different LINE tails, ranging over the whole tree of life, it could be possible, that the precursor of the MIR element was a short retrotransposon (CORE-SINE) which later on recruited, in several different events, different LINE tails, to make use of their ampli- fication mechanism [28]. Since this publication some researchers refer to all SINEs that share a t-RNA unrelated sequence between the t-RNA related sequence and the LINE tail as CORE-SINEs, the MIR element being one family of these. In the marsupials some CORE-SINES might still be active [32].

5 JF Strauß IOB, WWU M¨unster

Figure5 shows a schematic MIR element. Figure6 shows a the RepBase Update MIR consensus sequence aligned to a t-RNA and a L2a tail. The MIR SINE shares its 50 bp long 3’ tail with a LINE2 element called MIR2 [26]. MIR elements are not evenly distributed in the human genome as was shown for the human isochores [33], which implies some kind of selection pressure. Very early history of MIR discovery: [34, 35]. MIRs are present in the coding region of human (also a Korotkov paper) [36]. MIRs as agents of mammalian evolution [37]. A study by Kondrashov et al. on conserved fragments of TEs in intergenic regions, showed by a comparative genomics approach that in orthologous murine intergenic regions of mouse and human MIR and L2 elements are common and conserved [9]. While with a comparative genomics approach a lot of MIR elements, so called transient MIRs, are present in both human and mouse in the corresponding orthologous positions, we can see in table4 that with the RepeatMasker annotation of human and mouse, mouse is outnumbered by human by a factor of 5.

RNA Pol III

5' tRNA­related Region Core­Domain (?) L2a LINE Tail 3' A box B box

Figure 5: Schema of a mammalian wide element (MIR). The MIR element starts 5’ with an approximately 50 bp t-RNA related region containing an internal polymerase III promoter, and the A and B box of the t-RNA. At the 3’ end s LINE2 tail is situated, inbetween these two regions there is a conserved sequence called the CORE region.

tRNA(gln1)_hs 1 T A G G A T G T G G T G T A A T G G G T G G C A C G G A G A A T T T T G G A T T C T C A G G G A T G A G T T C A A A T C T C A 63 MIR_hs 1 A C A G T A T A G C A T A G T G G T T A A G A G C A C G G A C T C T G G A G C C A G A C T G C C T G G G T T C G A A T C C C G G C T C T G C C A C T T A C 77 L2A tRNA(gln1)_hs MIR_hs 78 T A G C T G T G T G A C C T T G G G C A A G T T A C T T A A C C T C T C T G T G C C T C A G T T T C C T C A T C T G T A A A A T G G G G A T A A T A A T A G 155 L2A tRNA(gln1)_hs MIR_hs 156 T A C C T A C C T C A T A G G G T T G T T G T G A G G A T T A A A T G A G T T A A T A C A T G T A A A G C G C T T A G A A C A G T G C C T G G C A C A T A G 233 L2A 1 A G C G C C T A GM A C A G T G C C T G G C A C A T A G 28 tRNA(gln1)_hs MIR_hs 234 T A A G C G C T C A A T A A A T G T T G G T T A T T A 260 L2A 29 T A G G C G C T C A A T A A A T A T T T G T T G A A T G A A T G A A T 63

Figure 6: This figure shows the MIR consensus sequence, taken from RepBase Update aligned to a Gln-tRNA and a L2a tail, also taken from RepBase update. The t-RNA alignment is shown in blue with the A and B box colored in red and and the L2a tail alignment is shown in green.

In the repeat database RepBase update 12 sequences can be found that constitute either a subgroup of MIRs or are MIR like elements: MAR, MIR, MIR3, MIR3 MarsA, MIR3 MarsB, MIRb, MIR Aves1, MIR Aves2, MON1, THER1, THER2, THER2MD. RepeatMasker uses a modified version of the RepBase repeat library. Elements can be

6 JF Strauß IOB, WWU M¨unster

shortened or split into different instances of the same element for better coverage and better computation. In case of the MIR element one additional sequence, MIRc was introduced for better coverage in the RepBase library modified for RepeatMasker.

1.3.1 MIR elements and the tree of life

The MIR element is thought to have emerged and spread prior to the eutherian radiation and went inactive about 130 million years ago [30, 26]. The point of emergence is a bit of a controversy. The key question to answer this is whether the MIR element is only present within mammals, or whether it is predominantly present within mammals but also present in a broader range of species. So can MIRs be found in all amniotes or only in the synopsida and not within the sauropsida? RepBase update already lists two MIR or MIR like element, that are only present in the birds, MIR Aves1 and MIR Aves2. While in the mammalian genomes we can find up to nearly 600k MIR elements, only between 2 and 3k MIR Aves(1,2) can be found within the two sequenced bird genomes [38]. The question remains unanswered whether the MIR Aves(1,2) element and the mammalian MIR have a common ancestor or if they have evolved independently. The avian genome is about one third smaller than the average mammalian genome [39]. This difference in genome size is mostly related to a loss of interspersed elements, which would explain the low copy number, if bird and mammalian MIRs were related. Organ et al. explain the loss with a selection pressure on cell size in birds [40]. In 2000 Korotkov et al. published that the origin of the MIR element lies with the vertebrates. They showed that MIR are present in reptiles, birds and mammals [41]. In 2006 a new family of interspersed repeats was described and named Sauria SINEs: VIN-SINEs, AFE-SINEs, POM-SINEs and ACA-SINEs [42, 43]. In 2007 Shedlock et al. showed that with running RepeatMasker on a several kb big stretch of DNA in some sauropsida MIRs are present in reptiles [44], but they may have had some doubts as they refer to the RepeatMasker findings as MIR like elements. In 2009 the Sauria SINEs that were described in 2006 got added to RepBase Update, and we can confirm that with an up to date RepeatMasker run, all elements that previously got annotated as MIRs are now annotated as Sauria SINEs in the draft of the Anolis carolinensis genome. Only the MIR findings of Korotkov et al. cannot be ruled out to be true. They run their own algorithm for finding MIRs over GenBank and reported to MIR findings in Anolis sequences. These sequences are from a different Anolis than the sequenced genome and RepeatMasker annotates in one of these two sequences the MIR as well. So while it is highly unlikely that MIRs are present in the sauropsida, the problems remains unsolved until the question whether bird MIRs are real MIR got resolved. Figure7 shows the relationship of the bird MIRs, the mammalian MIR and the Sauria SINEs.

7 JF Strauß IOB, WWU M¨unster

MIR_Aves1 MIR

MIR_Aves2

VINSINE

ACASINE

POMSINE

AFESINE

0.02

Figure 7: This figure shows the phylogenetic releationship bewteen the MIR element, the two bird MIRs, MIR Aves1 and MIR Aves2, and the four Sauria-SINEs. Mammalian and bird MIRs are shown in red and the reptilian Sauria-SINEs in green. RU consensus sequences were used to infer an ML phylogeny with RaxML (1000 bt).

1.4 Detection of transposable elements.

Approaches for TE detection can in general be split into three main categories: Library based approaches, signature based approaches and ab initio (or de novo) approaches. Library based approaches compare the input data with a library of TE sequences. TEs are annotated based on . The drawback of a library based approach is, that it can only be as good as the library it depends on. RepeatMasker would be the most prominent example for such an approach. Repbase Update is the underlying library that stores repeat definitions, usually representative sequences. Repbase itself is too big to be fully used in a RepeatMasker run, which is why the species has to be defined when giving input to RepeatMasker. Once RepeatMasker knows the species, it creates a subset of putative repeats that could occur within the chosen species. Every Repbase entry has taxonomic information; the MIR element for example is defined as a mammalian repeat and would thus be excluded in every species that is not a mammal. It than compares the repeat definition from the library with the input data and annotates TEs. Library based approaches can be seen in figure8 in the red cloud. In a signature based approach TEs are detected using distinct features of TEs, such as structure or composition. Such structures can be direct or inverted repeats, or protein domains of the encoded proteins. A signature based approach like find ltr [45] for example searches for LTR retrotransposons. An LTR retrotransposon is flanked by two direct repeat LTR sequences, between which several viral proteins can be found. So find ltr searches for tandem repeats, compares every with its up and downstream neighbor for significant homology and when those

8 JF Strauß IOB, WWU M¨unster

are less than approximately 20kb apart it checks for viral domains in between. So LTR retrotransposons get annotated by their structure. Such an approach is able to identify in this example all kind of LTR retrotransposons, even unknown LTR retrotransposon families. Signature based approaches are good for identifying TEs that are share easy to detect structural features. New classes of TEs cannot be found. Signature based approaches are presented in the blue cloud in figure8. In an ab initio or de novo approach TEs are being detected by de novo identification of repetitive strings within the given sequences or by using genome genome comparisons. ReAS [46] for example identifies repeats, defined as strings that have a repetitive occurrence within a genome, and produces a library of representative consensus sequences. These are in general repetitive sequence and need not necessarily be transposable elements. In theory such an approach would allow for identifying all existing repeats, but in reality repeats have to occur in a high copy number to be recognized. So new repeats and TEs can be identified, but not if they only occur in low copy number. Ab initio approaches can be seen in the green cloud in figure8. Also several pipelines of E-detection tools were build; these can be found in the orange cloud in figure8. There are several reviews regarding the topic of TE detection [47, 48, 49, 50]. The most recent review by Lerat in 2009 [50] concludes that there is no single program that is sufficient to detect all possible repeats and that the best thing to do is combining approaches for a good repeat annotation.

9 JF Strauß IOB, WWU M¨unster

MUST-MITE MaskerAid

LTR_STRUC HelitronFinder Greedier PLOTREP Library based RetroTector MAK SINEDR LTR_FINDER FINDMITE RepeatMasker Censor Signature based

LTRharvest TSDfinder

find_ltr

Repeat Pattern Toolkit LTR_par RTAnalyzer

TRANSPO Clustering RepeatFinder REPEATGLUER

RECON PILER DAWG-PAWS Spectral repeat finder BLASTER suite RepeatRunner

REPuter Ab initio RepeatModeler

FORRepeats P-Clouds TCF

Repeat-match ReAS Pipelines RetroPred RepeatScout RAP REPET TEnest Vmatch mer-engine RepSeek ReRep REannotate

Tallymer TARGeT

Figure 8: This diagram was created based on [50]. The red cloud shows library based TE detection approaches. The blue cloud shows signature based approaches. The green cloud shows ab initio approaches and the orange one pipelines of approaches.

In most studies that deal with MIR elements RepeatMasker was used to annotate the MIR element.

Detection of the MIR element with HMMs Hidden markov models (HMMs) [51, 52] are statistical models that are nowadays commonly used in sequence detection. They also have a wide in use in speech recognition and the weather forecast. The proba- bility of an event defined by a chain of states can be calculated given a statistical model which in sequence detection is usually a chain of vectors of transition probabilities (for jumping between the vectors). In sequence detection such HMMs are usually restricted to one direction. In a recent publication [53] it was shown, that while commonly used hidden markov models are not able to detect highly diverged DNA sequences of TEs, it is still possible to utilize HMMs for this task. To show this they implemented a conditioned version of the Baum-Welch algorithm that allows for the construction of DNA HMMs with

10 JF Strauß IOB, WWU M¨unster

low consensus. They even showed that this could in general work for the MIR element. They used the MIR element to show this, as it is the TE with the highest divergence, hence the lowest possible consensus for a seed alignment used for HMM building.

Detection by other methods Korotkov (2000): Korotkov et al. implemented a f77 (fortran) algorithm that specifically searches for MIR elements. They scanned GenBank with this tool and found MIR sequences in some vertebrate taxa besides mammals. Ap- proach: They started with a set of 49 MIR sequences. They aligned these sequences with the known MIR consensus and filled in the gaps for all the sequences, so that they are all about 260 bp long. This alignment was used to weight each position of the MIR sequence. This weighting model (the ’weight function method’) was used to test sequences for being MIRs. With the newly identified MIRs they build a new weight function on the basis of 2 ∗ 104 MIR sequences. (The statistical significance was tested between the observed sequence and a monte carlo derived random DNA level. Rudenko also states that MIR elements are present in the fish genomes, he further states, that there some of the fish repeat families originated from the MIR repeat. There were two follow up publications in 2001. In the first on they state, that they analyzed the human genome and found 254 MIR sequences in coding regions [54], which is in the same order of magnitude as the 107 loci, that were found in human by Krull et al. in 2007 [11]. In the second one they analyzed human and stated, that their method identifies the most MIR elements compared to previous methods [55]. [Both papers are in russian and I’m still waiting for their results and the f77 program.]

1.5 Goals of this study

• The main goal of this study is to provide an improved annotation for MIR elements within several mammalian species.

– First way to improve MIR annotation is to cross compare orthologous loci be- tween the well annotated human genome and other mammalian genomes. In the human genome the standard for annotating MIR elements is using Repeat- Masker. For the standard MIR elements (MIR /MIRb /MIR3) the consensus sequences defining these repeats are derived from the human genome. By com- paring the orthologous loci that correspond to the MIR position within human we try to detect by a sensible DNA comparison far diverged MIR sequences in other mammalian genomes. This way we can annotate new MIR sequences that were previously not detected by RepeatMasker due to the MIR sequences from the Repbase Update library are more adapted to human than to any other mammal. – Second way is to take a look at the annotated MIR sequences of a genome and to build new MIR consensus sequences that than can serve as new RepeatMasker MIR libraries and are not species specific but species adapted, thus annotating additional MIR sequences within the corresponding genome.

• The secondary goal is to look at MIR element distribution within the mammalian

11 JF Strauß IOB, WWU M¨unster

genomes and their association and or contribution to functional genomic locations. We also want to look at the MIR sequence itself and its conservation.

– Functional genomic locations can in general mean any kind of genomic annota- tion. We want to look how much MIR sequences contributed to genes and their parts, untranslated regions (UTRs), exon and introns.

12 JF Strauß IOB, WWU M¨unster

2 Material and Methods

2.1 Cross species MIR identification in genome / genome alignments

The original MIR consenus, called MIR in Repbase, was constructed by calculating a consensus sequence on the basis of a MIR alignment. This way a sequence was generated that comes as close to the original MIR sequence as there is information available from the chosen sequences. What can be identified as MIR is heavily mutated and diverged and the probability is high, that some of the genomic space has MIR origin, but it diverged beyond recognition. This consensus sequence and the other MIR sequences MIRb, MIRc, MIR3 have all been reconstructed based on MIR sequences extracted from the human genome. The ancient MIR is the reconstructed theoretical last common ancestor for all mammals, even if it is just reconstructed based on human sequences, if we assume only neutral evolution of the MIR sequences. If all known mammals and all known MIR sequences could be taken into accout for reconstructing the ancient MIR the reconstruction would be more accurate. Different organism have different rates and thus on the level of the MIR element higher or lesser divergence. The mutation rate in rodents for example is known to be higher than the mutation rate in human. If we assume that all the MIR sequences have no functional role at all and for the past 130 mllion year have just been randomly mutating it will, depending on the mutation rate, be difficult to detect these MIR sequences in a fast evolving species. What we can utilize here is that the actual time of that led to different mammalian species, is much more recent. Human and mouse for example split aproximately 60 million years ago. So while in mouse it might be very difficult to identify MIR sequences by using the reconstructed ancient MIR, it might be worth to look into a comparative genomics approach. Orthologous regions between two species can be used to identify MIR sequence, if one of the two species has a good annotation of MIRs. While a MIR element in mouse might not be identifiable by simple sequence similarity using the ancient MIR, we might conclude from the orthologous position in human, using sequence similarity, whether the particular sequence in mouse is MIR derived. In Figure9 we can see this general principle, if the distance a is short enough to identify the MIR, but the distance b is too long, c can be short enough for a correct identification. Distance hereby means time in evolutionary units. [In both figures9, 10 the lines of the arrows have to be twice as thick as they are now. Figure 10 lacks a certain kind of elegance and has to be redone.]

13 JF Strauß IOB, WWU M¨unster

a Species A

Ancient MIR c

Species B b

Figure 9: This figure shows how the general concept of cross species MIR identification approach works: If the split between two species (species A, B) is close, we can look into the orthologous position and a MIR element could be recogniced even, if the distance to the reconstructed ancient MIR element is too distant for detection.

Identifying MIRs by comparing two species will from now on be called cross species identification. Figure 10 shows a flow chart of this approach. In this project the cross species identification is realized as follows: The most useful resource for doing this are genome genome alignments. These are available from the UCSC genome browser database. These genome genome alignments work similar to a directed assembly. One genome is used to map the other genome onto it by building a chain of genomic fragments that aligns one genome to the other. We can treat these aligned fragments as orthologous regions. All genome genome alignments used for this project have been made from the perspective of the human genome. The dataset used consists of 10 mammalian genomes mapped onto the human genome. We then took repeat annotation of the human genome build hg19 by RepeatMasker obtained from the RepeatMasker premasked genomes database. The first step is to identify these aligments of orthologous regions that contain MIR annotation. Second step is to extract the orthologous sequences. As the genome genome alignment has not the best overall alignment quality these sub alignments then are realigned with t-coffee (see section 2.5) and corrected similarity is calculated based on the kimura 2 parameter model [56]. The human MIR sequences and the mapped putative MIR sequences then get stored dumped into an SQL library, with position and similarity information. This library can easily be used for further analysis. All steps for this approach have been automated with perl, some with the open source module collection BioPerl. The MIR positions are used for further analysis (see section 2.2) to look into the general genome impact of the MIR element. The MIR sequences are analyzed to see how conserved the MIR element actually is and if there are parts, as for example the CORE domain that, showing some kind of conservation (see section 2.3).

14 JF Strauß IOB, WWU M¨unster

Human genome 10 mammalian genomes

RepeatMasker UCSC pairwise alignments

SQL table: MIR SQL table: Synteny Annotation blocks

SQL Database

Filter Subalignments

Realignments Tcoffee

Figure 10: General approach of MIR reannotation: In human annotated MIR elements get checked for being present in orthologous sites of other mammals, finding these position via genome genome alignments. An SQL database is build out of the reannotated and the previously with RepeatMasker annotated MIRs, from which species specific consensus sequences are calculated to use as an species specific RepeatMasker library.

2.2 MIR associated annotation features

The MIR position track generated from the RepeatMasker run was combined with the MIR position track generated from the genome genome comparison and run against positions tracks from other annotation features to look into the overlap and thus MIR contribution to functional sequences. Position data for exons and introns was taken from Ensembl Biomart and the position tracks were screened for overlap.

2.3 MIR sequence site heterogeneity

After species specific profiles were build for MIR, MIRb, MIRc and MIR3 for each species, the seed alignments were used to visualize the MIR sequence site heterogeneity. The for- mer seed alignments, that were also used to build the test HMMs were formed into un- gapped alignments using the RepBase Update consensus sequence as . This was visualized with Gnu R as a stapled barplot together with a consensus graph (each staple represents the base frequency at the specific position within the RepBase consensus sequence).

15 JF Strauß IOB, WWU M¨unster

2.4 Genomes

Table1 shows the genomes that were used in this study. Not all of these genomes have yet reached the same level of quality. Some genomes like human, mouse or dog have a rather high assembly quality, others like gorilla, platypus or hedgehog are mere drafts. Genome version numbers can give a hint to the status of the genome project. Figure 11 shows the phylogenetic relationship between all the eleven mammals whose genomes were used. This phylogeny is literature based as it was generated with the interactive tree of life (iTOL) project [57]. This phylogeny is scientific consens. [Source file for the tree is on the office computer, will increase font and make the tree distanceless later.]

Table 1: List of genome versions.

Species release name version Baboon Papio hamdryas papHam1 1.0 Nov. 2008 Bushbaby Otolemur garnettii otoGar1 1.0 Dec. 2006 Cow Bos taurus BosTau4 4.0 Oct. 2007 Dog Canis lupus familiaris CanFam2 2.0 May 2005 Elephant Loxodonta africana loxAfr3 3.0 Jul. 2009 Gorilla Gorilla gorilla gorilla gorGor1 1.0 Oct. 2008 Hedgehog Erinaceus eurapeus eriEur1 1.0 Jun. 2006 Human Homo sapiens hg19 19.0 Feb. 2009 Mouse Mus musculus mm9 9.0 Jul. 2007 Opossum Monodelphis domestica monDom5 5.0 Oct. 2006 Platypus Ornithorhynchus anatinus ornAna1 1.0 Mar. 2007

16 JF Strauß IOB, WWU M¨unster

Monodelphis_domestica

Mus_musculus

Gorilla_gorilla

Euarchontoglires Homininae

Homo_sapiens Theria

Papio_hamadryas Primates

Otolemur_garnettii

Canis_lupus_familiaris Eutheria Mammalia

Bos_taurus Laurasiatheria

Erinaceus_europaeus

Loxodonta_africana

Ornithorhynchus_anatinus

Figure 11: The phylogenetic relationship of the species used in this study. This tree was generated with the interactive tree of live iTOL.

2.5 Software

Blast Blast is a huge software package that includes tools mainly used for sequence detection [58]. In this study the standaloneblast package of the BioPerl collection was used to quickly assign the type of the MIR element the SQL database entries.

MULTIZ MULTIZ is a program that is used to calculate whole genome alignments [59]. The underlying algorithm TBA (threaded blockset aligner) can be used to align huge sequences to a reference genome. These sequence can be in the order of megabases. It is commonly used to assign orthologous regions between genomes. For this project genome genome alignments from the USCSC were used. For all of these alignments human was the reference genome.

Hmmer Hmmer is a suit for generating and using profile hidden markov models (pH- MMs) [60]. Hidden markov models consist of a chain of linked states that contain tran- sition probabilities (see [52, 61] for further explanation). An observed sequence can be judged for its probability for having been generated with the tested hidden markov model. Databases such as Pfam make use of hidden markov models [62].

RaxML RaxML is a maximum likelihood based tool for phylogenetic inference. It is the fastest maximum likelighood implementation to the day and comes with a huge set of useful function such as mapping and comparing trees. [63]

17 JF Strauß IOB, WWU M¨unster

RepeatMasker RepeatMasker is tool used to mask repeats. It relies on a huge database of reperesentative repeat sequences. It detects interspersed repeats as well as low complexity DNA sequences in a given sequence and masks them, by capitalizing the repeat or by substituting the repeat with Ns. [64] RepeatMasker itself does not give a score for judging the relevance or probability of the hits. Figure 12 shows how such a hit looks like within the alignment file of the RepeatMasker ouput. Though RepeatMasker can be run with custom repeat definitions, it uses the Repbase database for default repeat masking. http://www.repeatmasker.org/tmp/RM2sequpload_1286019329.align.txt 02.10.10 13:38

439 32.45 3.82 7.09 MIR_SINE/MIR 1 262 (0) + MIR SINE/MIR 7 260 (2)

MIR_SINE/MIR 1 TAGTGTAATGGTTCACAGTGCATATTTGGAAGACAGGCT--TTGAATGTG 48 ii i v v ii iv i iv i v i --i i-- v MIR#SINE/MIR 7 TAGCATAGTGGTTAAGAGCACGGACTCTGGAGCCAGACTGCCTGG--GTT 54

MIR_SINE/MIR 49 CAAATTCCATCTCTGCTGCTTTCTGTCTGTGTGTGTGACCTTCACCAAGT 98 i i iv ii v iv ---- viv MIR#SINE/MIR 55 CGAATCCCGGCTCTGCCACTTACTAGCTGTGTG----ACCTTGGGCAAGT 100

MIR_SINE/MIR 99 TATTT----TCT----ACCTTGGTTTCCTCATTTGAAAGATGAGAATAGT 140 i ------i ii i v i i i i MIR#SINE/MIR 101 TACTTAACCTCTCTGTGCCTCAGTTTCCTCATCTGTAAAATGGGGATAAT 150

MIR_SINE/MIR 141 GGCACTAGATTATTGGCACTGAATAATATTGGATTACTTTTAGCATTGAT 190 --- v -- ii--i ivv ----- i ii v v v i v MIR#SINE/MIR 151 ---AATAG--TACC--TACCTCATA-----GGGTTGTTGTGAGGATTAAA 188

MIR_SINE/MIR 191 GGAATGAATATATTGAAAATGTTTAGAACAGTGCCTAGAGCACAATACAT 240 v i v i vv ii i i vi i i vii MIR#SINE/MIR 189 TGAGTTAATACATGTAAAGCGCTTAGAACAGTGCCTGGCACATAGTAAGC 238

MIR_SINE/MIR 241 ACTAAATATGTGTAAGATATTA 262 i v vi vi v MIR#SINE/MIR 239 GCTCAATAAATGTTGGTTATTA 260

Matrix = 25p43g.matrix Transitions / transversions = 1.72 (50 / 29) Figure 12: ThisGap_i figurenit ra showste = 0 how.03 the(9 / alignment 261), av outputg. gap ofsi RepeatMaskerze = 3.11 (28 looks/ 9) like. ’i’ stands for inversion, ’v’ for transition and ’-’ for gap. MIR#SINE/MIR is the human reference sequence from RepBase Update and MIR SINE/MIR## Tot isal a S sequenceequences from: 1 mouse that got annotated as a MIR element. ## Total Length: 262 ## Total NonMask ( excluding >20bp runs of N/X bases ): 262 ## Total NonSub ( excluding all non ACGT bases ):262 RepeatMasker version open-3.2.9 , sensitive mode run withT-coffee blastp v isers aio tooln 3.0 forSE- calculatingAB [2009-10- multiple30] [linu sequencex26-x64-I3 alignments2LPF64 2009- for10- DNA or T-coffee30T17:06:09] aminoRep acidBase sequencesUpdate 200 [96506].04, It RM can dat alsoabase be ve usedrsion to 20 include090604 additonal information for the alignment such as structural information, or alignments from other alignment methods, but only the very basic function was used in this project. One of the strengths of T-coffe is its accuracy, one of its drawbacks is that it is one of the slowest tools and cannot handle huge amounts of sequence. The drawback did not matter because MIR sequences are short (around 260 bp) and all alignments made were alignments with two sequences.

Mafft Mafft is a multiple sequence alignment tool, that is considerably faster than t-coffe [66]. It outperformes t-coffee, by terms of speed, by two order of magnitudes.

2.6 Databases

Ensembl: Biomart BioMart is an interface to a vast collection of sequence data and annotation tracks[67]. It is comparatively intuitive and easy to use and provides aSeite perl 1 von 1

18 JF Strauß IOB, WWU M¨unster

API, which makes it useful for programming automatized tasks.

Repbase Update Repbase Update [68] is a database for repeats. It contains transpos- able element, simple repeats and some microsattelites. The MIR sequences from Repbase Update were used in this study and the Repbase database was used as the fault library for annotating transposable elements with RepeatMasker. Table2 gives a quick overview over the Repbase Update content for mammals and table3 gives a quick overview over the total Repbase Update content. Repbase is rapidly growing. While it contained 3600 sequences in 2005 it now (2010) consists of 5974 repeat definitions. Figure 13 shows the phylogenetic relationship of all MIR and MIR like elements in Rep- base Update. This figure should give a quick overview of how the present MIR consensus sequences would cluster within MIR like sequences.

MIRb 66 100 MIR MON1

100 THER1 99 MAR1

57 MIR3 21 M3_MarsA 98 M3_MarsB 61 THER2_MD 85 THER2 MIRAves2 100 MIRAves1

0.07 Figure 13: This phylogeny is based on all sequences that are tagged as MIR or MIR like sequences in Repbase Update. The MON1 element was colored, as it it is the youngest MIR like CORE-SINE. Phylogeny was inferred with maximum likelihood with 1000 bt.

19 JF Strauß IOB, WWU M¨unster

Table 2: RepBase Update: Mammals.

Class of elements All Mammals Primates Rodents Cetartiodactyla Carnivora DNA transposons 289 (5) 102 (94) 8 (94) 2 (94) 3 (94) Enogeneous retrovirus 2572 545 (101) 609 (101) 126 (101) 65 (101) LTR retrotransposon 76 16 (17) 33 (17) 0 (17) 0 (17) Non-LTR retrotransposons 669 (20) 160 (66) 148 (66) 57 (65) 31 (96) SINEs 318 63 (13) 83 (13) 42 (12) 14 (12) Total interspersed repeats 3630 ? (3715) 826 ? (1169) 803 ? (1146) 185 (527) 99 (441)

Data taken from issue 15.03 (April 12, 2009) of RepBase. (n) = plus n ancestral repeats. ? = RU count number differs from number of actual entries.

Table 3: RepBase Update: Total.

Class of elements Vertebrates Invertebrates* DNA transposons 914 1015 927 Enogeneous retrovirus 2959 2 3 LTR retrotransposon 614 903 2414 Non-LTR retrotransposons 1031 296 317 SINEs 361 (364) 16 77 Total interspersed repeats 5902 (5979) 2259 3654

* Arthropoda: C. elegans , Nematostella vectensis, Schmidtea mediterrana

UCSC genome browser The UCSC (University of California at Santa Cruz) genome browser is an interface to a huge repository of genomes and all genome related information like annotation tracks or multiz genome genome alignments [69]. All genomes used in this study were downloaded from UCSC genome browser as well as the genome genome alignments used for the cross species MIR identification.

20 JF Strauß IOB, WWU M¨unster

3 Results and Discussion

3.1 Cross species MIR identification in genome / genome alignments

Table4 shows an overview over the current annotation of selected mammals with Repeat- Masker. This table does not show all the genomes used in this work, but the mammalian premasked genomes from the RepeatMasker database. Table5 shows the amount of align- ment blocks from the genome genome alignments with human. The more blocks there are, the smaller the average orthologous region gets. This table shows all the genomes used in this work. For dog, gorilla, baboon, bushbaby and hedgehog were no premasked genomes from the RepeatMasker database or from the UCSC genome browser available. For these five genomes a small MIR libary was build to annotate these. Table6 shows the amount of MIRs in these five genomes, after the new annotation. [This table was wrong, have to make new table, table empty for now.]

Table 4: RepeatMasker annotation of MIR elements in the chosen mammals.

Name MIR MIRb MIRc MIR3 total Cow Bos taurus 138.573 174.257 81.321 67.720 461.871 Cat Felis catus 127.671 146.232 67.265 43.675 384.843 Human Homo sapiens 175.609 225.162 103.490 90.841 595.102 Chimp Pan troglodytes 195.578 223.887 102.544 68.570 590.579 Orangutan Pongo pygmaeus 204.415 233.442 107.094 71.129 616.080 Rhesus monkey Macaca mulatta 181.466 206.521 94.559 63.063 545.609 Mouse Mus musculus 39.000 42.258 23.203 16.119 120.580 Rat Rattus norvegicus 41.567 38.517 21.260 8.698 110.042 Elephant Loxodonta africana 129.336 164.588 79.293 70.172 443.389 Opossum Monodelphis domestica 22.915 42.457 187.697 205.888 458.957 Platypus Ornithorhynchus anatinus 11.243 6.219 17.767 2.561 37.790

Table 5: BlastZ alignment blocks in the chosen mammals.

Name MIRs in blocks human MIR coverage Cow Bos taurus 360.915 0.61 Dog Canis familiaris 424.051 0.71 Gorilla Gorilla gorilla 394.604 0.66 Baboon Papio hamdryas 548.238 0.92 Bushbaby Otolemur garnettii 329.186 0.55 Mouse Mus musculus 269.522 0.45 Hedgehog Erinaceus europaeus 119.309 0.20 Elephant Loxodonta africana 352.842 0.59 Opossum Monodelphis domestica 27.038 0.04 Platypus Ornithorhynchus anatinus 10.562 0.02

21 JF Strauß IOB, WWU M¨unster

Table 6: New RM annotation of unannotated genomes with the small MIR library.

Name MIR MIRb MIRc MIR3 total Dog Canis familiaris Gorilla Gorilla gorilla Baboon Papio hamdryas Bushbaby Otolemur garnettii Hedgehog Erinaceus europaeus

The human MIR coverage shows how much of the in human annotated MIR sequences can be found within the orthologous blocks from the BlastZ alignments. While in baboon 92% of the MIR elements are within the orthologous alignments. In platypus we can only find 2% of the MIR elements within the orthologous alignments. This can potentially have two reasons. The genome can be evolutionary be far away from the human genome, resulting in more and smaller orthologous blocks and general less orthologous mappings. Another problem of course can be the genome quality in general. While the human or the mouse genome have had many releases and a huge quality increase since the draft, some of the other genomes, like platypus or hedgehog are still drafts. So while dog is a high quality genome and we can expect the 424k MIRs in the orthologous blocks to be more or less the maximum amount of what we can find in orthologous regions, the 329k MIRs in the bushbaby blocks are probably wrong, as the low number has propably more to do with the genome quality. Figure 14 shows how many new MIR sequences could be found with the cross species comparison. We could increase the number of annotated MIRs from about 39k MIRs in hedgehog up to arount 118k MIRs in baboon and dog. As we only had 4% human MIR coverage with opossum and 2% human MIR coverage in Platypus to begin with we could expect to reannotate almost no new MIR sequences in these two genomes. Relativley seen we got the biggest increase of MIR annotation in mouse with about 70% and the smallest increase in gorilla, baboon and elephant with about 20%. This shows that the general approach worked well and that there is a lot of potential to build better MIR models for all examined .

22 JF Strauß IOB, WWU M¨unster

><<= F<= E<= D<=

- C<= " ( 8 * 8 B<= " -

- A<= * H M

L @<= 2 ?<= ><= <= $"% +*,""- 2"./3 6)370*-8 ;)*817./ !"# &"'())* +./0,*,1 435%30"% 97"//.: J73K(3/ ;'3G("./)1H I3#H*--"*8("- *--"8*8("- >A<<<<

>?<<<<

><<<<< / M

L E<<<< 2 H O " H ' C<<<< 3 , :

. A<<<< I

?<<<<

< $"% +*,""- 2"./3 6)370*-8 ;)*817./ !"# &"'())* +./0,*,1 435%30"% 97"//.: J73K(3/ N"8*)H*:".-8H-3#HH *--"8*835H2LMH 3)3:3-8/

Figure 14: The upper diagram shows how many of the via cross species comparison annotated MIR sequences would have been found by RepeatMasker. The lower diagram shows how many MIR sequences were freshly annotated.

3.2 Species specific repeatmasker library

3.2.1 MIR seed alignments

Figure 15 shows the length histogram of the annotated MIR sequences in mouse an hu- man. We can see, that only a few thousand sequences could be considered almost full length sequences. This is true for all examined organisms. The first step for building species specific MIR consensus sequences is to take MIR sequences, make an alignment and generate a consensus; in our case a simple mojority rule consensus.

23 JF Strauß IOB, WWU M¨unster

Figure 15: This figure shows a histogram of the length distribution of all MIR elements annotated in human and mouse. Within the histogram four sub-histograms are shown to show the amount of MIR (green), MIR3 (blue), MIRb (red), MIRc (black).

We chose randomly 1000 MIR sequences from each organism that were over 200 bp long and made a step by step alignment. As such a huge alignment would be too big to calculate we added new sequence by sequence to the alignment. We did not differentiate between the four MIR types that the RepBase update library uses, as it was not possible to assign these MIR types to the annotated MIR sequences. The emitted consensus sequences were identical with the orignial MIR consensus from human. Originally we wanted to build a species specific MIR library for RepeatMasker and reannotate the genomes with this library, to see how much the species specific consensus sequences would be preferred over the original MIR consensus. As all of the MIR sequences chosen for the alignments were either directly or indirectly identified via the original MIR conensus we could expect the emitted consensus to be almost identical. Though the most frequent base is the same as in the original consensus, the base frequency for each position differs between the organisms.

3.2.2 MIR site heterogeneity

Figure 16 and 18). Figure 16 shows a stacked barplot for the nucleotide probability for each position of the MIR consensus. The consensus plot (Figure 18 shows a clearer picture. It shows a connected diagram where on each position the probability of the highest conserved base is shown. Figure 17 and 19 are showing the same diagrams for human.

24 JF Strauß IOB, WWU M¨unster

Figure 16: This figure shows the base heterogeneity for the MIR element in baboon. This plot was calculated on the basis of 400 MIR sequences. (A,T,G,C) = (blue, red, green, yellow).

Figure 17: This figure shows the base heterogeneity for the MIR element in human. This plot was calculated on the basis of 400 MIR sequences. (A,T,G,C) = (blue, red, green, yellow).

Figure 18: This figure shows the conservation of the MIR element for baboon.

25 JF Strauß IOB, WWU M¨unster

Figure 19: This figure shows the base pair conservation of the MIR element for human.

In figure 20 we can see s substraction plot of the mouse and human MIR consensus plots. While the the consensus sequence between human and mouse would be the same, the frequency of the most probable base at a certain position can be 15% less or more than 20% higher. This would be more than enough variety for building speciies specific profiles, but unfortunately it is not enough variety for building species specific consensus sequences for RepeatMasker.

Figure 20: This figure shows the consensus difference between the human and the mouse MIR sequences. Above zero means mouse consensus is higher, below zero means human consensus is higher.

26 JF Strauß IOB, WWU M¨unster

Human

Baboon

Bushbaby

Gorilla

Cow

Opossum

Dog

Figure 21: This figure shows the sequence site heterogeneity plot for human, baboon, bushbaby, gorilla, cow, opossum and dog.

27 JF Strauß IOB, WWU M¨unster

Platypus

Elephant

Hedgehog

Mouse

Figure 22: This figure shows the sequence site heterogeneity plot for platypus, elephant, hedgehog and mouse.

Except for small difference the sequence site heterogeneity plots for all organisms look the same. They only differ slightly in base frequency.

3.3 MIR associated annotational features

Table7 shows overlap of the MIR positions with genes of the eleven organisms. We differentiated between untranslated regions (UTRs), coding sequence and introns.

28 JF Strauß IOB, WWU M¨unster

Table 7: MIR position overlap with selected genomic features.

Organism gene (intron exon utr) Baboon Papio hamdryas Bushbaby Otolemur garnettii Cow Bos taurus Dog Canis lupus familiaris Elephant Loxodonta africana Gorilla Gorilla gorilla gorilla Hedgehog Erinaceus eurapeus Human Homo sapiens Mouse Mus musculus Opossum Monodelphis domestica Platypus Ornithorhynchus anatinus

[This is right now empty as I do not find gene overlap with the MIR positions. If this happens to be a simple bug in the comparison script, there will be values filled in on friday.]

3.4 MIR elements in the lizard and the bird genomes

We annotated the lizard genome (Anolis carolinensis) with RepeatMasker two times. One time with MIRs and the Sauria SINEs both present and a second time with the Sauria SINEs excluded. Without the Sauria SINEs about 47k MIR get annotated which fits the extrapolated number by Shedlock et al. (45k to 47k) pretty well. Shedlock et al. sequenced two megabases of the A. carolinensis genome and extrapolated the number of identified repeats to the complete genome. If the Sauria SINEs are included in the RepeatMasker run, not a single MIR element gets annotated and all elements that have previously been annotated as MIR element were now annotated as Sauria SINEs. About 180k Sauria were annotated. This might be a general problem for undescribed SINEs. As the 5’ part is t-RNA related and the LINE groups can be shared between different SINE elements undescribed SINEs can be missannotated if they have some sequence similarity. No bird MIRs could be annotated in the lizard genome. The bird MIR conserved does not actually share that much sequence similarity with the mammalian MIR element (73%). Though we can not exclude these two being related it is probable that the bird MIR, which certainly is a CORE SINE, has an independent origin. If it had a commson ancestor with the mammalian MIR we should be able to find the MIR element in reptile genomes. The MIR sequences in Anolis reported by Korotkov et al. [41, 54] are only two single instances. These two instances were run against the RepBase librarz and one of those is recognized by RepeatMasker as a MIRb element. The Anolis sequences used by Korotkov et al. are not from the same Anolis species. As not a single MIR can be found in the draft of the A. carolinensis genome it is highly unlikely that the identified MIRb sequence is a real MIR element. So even if the orthologous region is missing from the draft and a MIR like sequence was present in the A. carolinensis genome, one MIR like sequence is not enough to make a case for an early vertebrate origin of the MIR element, as we would

29 JF Strauß IOB, WWU M¨unster

expect huge copy numbers in these organisms. Of course only under the hypothesis, that there is no selection pressure for getting rid of the MIR copies.

3.5 Building profiles for MIR sequences

BLAST profiles with PSIBLAST The very first idea was to use reverse PSI- BLAST for an improved MIR identification. Building a vast library of MIR profiles from already known RepeatMasker MIR sequences and using this as a database to search for MIR elements within a given sequence. Unfortunately PSIBLAST only worked with pro- teins and while it was possible to manipulate a protein substitution matrix into being a nucleotide substitution matrix the statistics behind PSIblast were, probably due to the four letter alphabet, not able to produce any usable results.

Profile hidden markov models and HMMER HMMER 1, HMMER 2 and HMMER 3 failed to function properly with DNA MIR profiles. For each of the three HMMER versions profile hidden markov models were build on the basis of 100 randomly chosen MIR sequences. HMMER 1 and HMMER 2 were not able to detect MIR sequences. While HMMER 2 ha s a clear focus on protein detection, HMMER 1 was developed with both DNA and amino acid sequences in mind. Probably the biggest problem here is hav- ing as a basis a highly diverged DNA alignment (around 30% divergence), with only a four letter alphabet. The 100 randomly chosen sequences were taken from a TE sequence database derived from all TE sequences extracted from the RepeatMasker alignment files from H. sapiens and M. musculus that were downloaded from the RepeatMasker pre- masked genomes repository. HMMER 3 actually seemed promising at first, as the MIR models were perfectly able to only correctly detect MIR sequences form the TE DB, but unluckily these models annotated every bit of coding sequence as well.

Jackhmmer Jackhmmer can be used in a PSIblast like way to produce MIR profiles based on a library of MIR sequences. It iterates over the said library creating profiles along the way, which are trained with each iteration. The general procedure worked, though originally intended for proteins, well, but as said above the HMMER 3 statistics didn’t perform well for a four letter alphabet. There is an announcement that full DNA functionality will be implemented in HMMER3 soon.

OrthoMCL Using OrthoMCL clusters of all mammalian species that are in the Re- peatMasker premasked database would have been nice, as a good mapping of orthologous genes would have been provided. This would have given us the opportunity to annotate MIR elements between organisms that were previously not annotated due to being too far away from the RepBase consensus. At the beginning of the project there was the idea of taking all genes from all mammals, that are present in the OrthoMCL database and get the corresponding nucleotide sequences with 3kb up and downstream sequence. Then an alignment would be calculated for each cluster of orthologous genes, in which a search for the MIR element would be performed. While we couldn’t use actual profiles for the MIR detection in this approach, we instead simply used BLAST to test it. This

30 JF Strauß IOB, WWU M¨unster

approach quickly got too messy and was not too fruitful (alignments of introns were a problem as well as the correct mapping of exons). In this approach BLAST (bl2seq) was used to compare orthologous sequences with a MIR annotation in one species. This whole approach was actually done for mouse and human, but revealed only few cases of MIRs being cross detected in close range of genes. Between mouse and human there were 4138 orthologous mappings, containing 149 cluster that contained a MIR sequence, mostly in the up or downstream region. 7 MIR sequences could be cross identified within human and 14 within mouse paralogs. 26 MIR sequences could be annotated from human to mouse and 27 the other way round.

31 JF Strauß IOB, WWU M¨unster

4 Conclusion

The main approach, comparing orthologous positions between the human and other mam- malian genomes worked. We could on average add 30 to 50% to the current MIR annoa- tion. The approach for building species specific MIR consensus sequences for getting a library that makes RepeatMasker automatically annotate these additionally found MIR sequences did not work. This was basically a second appraoch, as the initial idea, to build species specific hidden markov models did not work for various reasons. We can conclude that with consensus sequences it is impossible to get a good MIR annotation, especially as the MIR element is too divergent. A sequence characterization such as a profile would be a much better choice and there is hope, that with conditioned Baum-Welch algorithm and the dynamic surgery algorithm, hidden markov models for MIR detectionc an be build in the near future. There is basically no gene overlap with the MIR sequences. So while it may contribute greatly to genome space, it is unlikely that it is a major contributor to genes. It still may have contributed to evolution, but this is subject to further research.

5 Outlook

• The origin of the MIR element still remains unkown. The origin could be within the synapsida or within the whole vertebrates. With more bird genomes and reptile genomes more closely related to the birds it could be researched whether the bird MIR element is likely to have an independend origin from the mammalian MIR element or not. • With actual MIR profiles the MIR annotation could greatly profit. MIR profile build- ing with conditioned Baum-Welch and dynamic surgery algorithm will be aowrth looking into as soon as a good implementation is available. Also HMMER3 is sup- posed to get full DNA functionality. So there are two promising condidates for hidden markov MIR modelling. • It would be nice to repeat the cross species MIR annotation with a set of genomes where genome genome alignment are available in all directions. From such a dataset a network of connected orthologous blocks could be generated that would not just be helpful for identifying more MIR elements but should also be quite useful for other ancient and thus hard to detect repeats.

References

[1] McCLINTOCK B: The origin and behavior of mutable loci in maize. Proc Natl Acad Sci U S A 1950, 36(6):344–355. [2] Orgel LE, Crick FH: Selfish DNA: the ultimate parasite. Nature 1980, 284(5757):604–607. [3] Doolittle WF, Sapienza C: Selfish genes, the phenotype paradigm and . Nature 1980, 284(5757):601–603.

32 JF Strauß IOB, WWU M¨unster

[4] Bushman FD: Targeting survival: integration site selection by retroviruses and LTR-retrotransposons. Cell 2003, 115(2):135–138. [5] Lander et al, Consortium IHGS: Initial sequencing and analysis of the human genome. Nature 2001, 409(6822):860–921. [6] Bimont C, Vieira C: : junk DNA as an evolutionary force. Nature 2006, 443(7111):521–524. [→]. [7] Finnegan DJ: Eukaryotic transposable elements and genome evolution. Trends Genet 1989, 5(4):103–107. [8] Makalowski W: Genomic scrap yard: how genomes utilize all that junk. Gene 2000, 259(1-2):61–67. [9] Silva JC, Shabalina SA, Harris DG, Spouge JL, Kondrashovi AS: Conserved fragments of transposable elements in intergenic regions: evidence for widespread recruitment of MIR- and L2-derived sequences within the mouse and human genomes. Genet Res 2003, 82:1–18. [10] Kirby PJ, Greaves IK, Koina E, Waters PD, Graves JAM: Core-SINE blocks com- prise a large fraction of monotreme genomes; implications for vertebrate chromosome evolution. Chromosome Res 2007, 15(8):975–984. [→]. [11] Krull M, Petrusma M, Makalowski W, Brosius J, Schmitz J: Functional persis- tence of exonized mammalian-wide interspersed repeat elements (MIRs). Genome Res 2007, 17(8):1139–1145. [→]. [12] Vorechovsky I: Transposable elements in disease-associated cryptic exons. Hum Genet 2009. [→]. [13] Brosius J: RNAs from all categories generate retrosequences that may be exapted as novel genes or regulatory elements. Gene 1999, 238:115–134. [14] Lorenc A, Makaowski W: Transposable elements and vertebrate protein di- versity. Genetica 2003, 118(2-3):183–191. [15] Sorek R, Ast G, Graur D: Alu-containing exons are alternatively spliced. Genome Res 2002, 12(7):1060–1067. [→]. [16] Pidpala OV, Iatsyshyna AP, Lukash LL: Human : structure, distribution and functional role. Tsitol Genet 2008, 42(6):69–81. [17] Faulkner GJ, Kimura Y, Daub CO, Wani S, Plessy C, Irvine KM, Schroder K, Cloo- nan N, Steptoe AL, Lassmann T, Waki K, Hornig N, Arakawa T, Takahashi H, Kawai J, Forrest ARR, Suzuki H, Hayashizaki Y, Hume DA, Orlando V, Grimmond SM, Carninci P: The regulated retrotransposon transcriptome of mammalian cells. Nat Genet 2009, 41(5):563–571. [→]. [18] Medstrand P, van de Lagemaat LN, Dunn CA, Landry JR, Svenback D, Mager DL: Impact of transposable elements on the evolution of mammalian gene regulation. Cytogenet Genome Res 2005, 110(1-4):342–352. [→]. [19] Shapiro JA: Mobile DNA and evolution in the 21st century. Mobile DNA 2010, 1:4:14.

33 JF Strauß IOB, WWU M¨unster

[20] Luan DD, Korman MH, Jakubczak JL, Eickbush TH: Reverse transcription of R2Bm RNA is primed by a nick at the chromosomal target site: a mech- anism for non-LTR retrotransposition. Cell 1993, 72(4):595–605. [21] Luan DD, Eickbush TH: RNA template requirements for target DNA-primed reverse transcription by the R2 retrotransposable element. Mol Cell Biol 1995, 15(7):3882–3891. [22] Smit AF: The origin of interspersed repeats in the human genome. Curr Opin Genet Dev 1996, 6(6):743–748. [23] Ohshima K, Okada N: SINEs and LINEs: symbionts of eukaryotic genomes with a common tail. Cytogenet Genome Res 2005, 110(1-4):475–490. [→]. [24] Okada N, Hamada M, Ogiwara I, Ohshima K: SINEs and LINEs share common 3’ sequences: a review. Gene 1997, 205(1-2):229–243. [25] Quentin Y: A master sequence related to a free left Alu monomer (FLAM) at the origin of the B1 family in rodent genomes. Nucleic Acids Res 1994, 22(12):2222–2227. [26] Smit AF, Riggs AD: MIRs are classic, tRNA-derived SINEs that amplified before the mammalian radiation. Nucleic Acids Res 1995, 23:98–102. [27] Akasaki T, Nikaido M, Nishihara H, Tsuchiya K, Segawa S, Okada N: Character- ization of a novel SINE superfamily from invertebrates: ”Ceph-SINEs” from the genomes of squids and cuttlefish. Gene 2010, 454(1-2):8–19. [→]. [28] Gilbert N, Labuda D: CORE-SINEs: eukaryotic short interspersed retropos- ing elements with common sequence motifs. Proc Natl Acad Sci U S A 1999, 96(6):2869–2874. [29] Munemasa M, Nikaido M, Nishihara H, Donnellan S, Austin CC, Okada N: Newly discovered young CORE-SINEs in marsupial genomes. Gene 2008, 407(1- 2):176–185. [→]. [30] Jurka J, Zietkiewicz E, Labuda D: Ubiquitous mammalian-wide interspersed repeats (MIRs) are molecular fossils from the mesozoic era. Nucleic Acids Res 1995, 23:170–175. [31] Murnane JP, Morales JF: Use of a mammalian interspersed repetitive (MIR) element in the coding and processing sequences of mammalian genes. Nu- cleic Acids Res 1995, 23(15):2837–2839. [32] Gilbert N, Labuda D: Evolutionary inventions and continuity of CORE- SINEs in mammals. J Mol Biol 2000, 298(3):365–377. [→]. [33] Matassi G, Labuda D, Bernardi G: Distribution of the mammalian-wide inter- spersed repeats (MIRs) in the isochores of the human genome. FEBS Lett 1998, 439(1-2):63–65. [34] Donehower LA, Slagle BL, Wilde M, Darlington G, Butel JS: Identification of a conserved sequence in the non-coding regions of many human genes. Nucleic Acids Res 1989, 17(2):699–710.

34 JF Strauß IOB, WWU M¨unster

[35] Armour JA, Wong Z, Wilson V, Royle NJ, Jeffreys AJ: Sequences flanking the repeat arrays of human : association with tandem and dis- persed repeat elements. Nucleic Acids Res 1989, 17(13):4925–4935. [36] Tulko JS, Korotkov EV, Phoenix DA: MIRs are present in coding regions of human genes. DNA Seq 1997, 8(1-2):31–38. [37] Hughes DC: MIRs as agents of mammalian gene evolution. Trends Genet 2000, 16(2):60–62. [38] Warren et al: The genome of a songbird. Nature 2010, 464(7289):757–762. [→]. [39] Ellegren H: The avian genome uncovered. Trends Ecol Evol 2005, 20(4):180–186. [→]. [40] Organ CL, Shedlock AM, Meade A, Pagel M, Edwards SV: Origin of avian genome size and structure in non-avian dinosaurs. Nature 2007, 446(7132):180–184. [→]. [41] Korotkov EV, Korotkova MA, Rudenko VM: MIR–family of repeats common for vertebrate genomes. Mol Biol (Mosk) 2000, 34(4):553–559. [42] Kosushkin SA, Borodulina OR, Grechko VV, Kramerov DA: A new family of interspersed repeats from squamate reptiles. Mol Biol (Mosk) 2006, 40(2):378– 382. [43] Piskurek O, Austin CC, Okada N: Sauria SINEs: Novel short interspersed retroposable elements that are widespread in reptile genomes. J Mol Evol 2006, 62(5):630–644. [→]. [44] Shedlock AM, Botka CW, Zhao S, Shetty J, Zhang T, Liu JS, Deschavanne PJ, Edwards SV: Phylogenomics of nonavian reptiles and the structure of the ancestral amniote genome. Proc Natl Acad Sci U S A 2007, 104(8):2767–2772. [→]. [45] Rho M, Choi JH, Kim S, Lynch M, Tang H: De novo identification of LTR retrotransposons in eukaryotic genomes. BMC Genomics 2007, 8:90. [→]. [46] Li R, Ye J, Li S, Wang J, Han Y, Ye C, Wang J, Yang H, Yu J, Wong GKS, Wang J: ReAS: Recovery of ancestral sequences for transposable elements from the unassembled reads of a whole genome shotgun. PLoS Comput Biol 2005, 1(4):e43. [→]. [47] Rogozin IB, Mayorov VI, Lavrentieva MV, Milanesi L, Adkison LR: Prediction and phylogenetic analysis of mammalian short interspersed elements (SINEs). Brief Bioinform 2000, 1(3):260–274. [48] Bergman CM, Quesneville H: Discovering and detecting transposable ele- ments in genome sequences. Brief Bioinform 2007, 8(6):382–392. [→]. [49] Saha S, Bridges S, Magbanua ZV, Peterson DG: Empirical comparison of ab initio repeat finding programs. Nucleic Acids Res 2008, 36(7):2284–2294. [→].

35 JF Strauß IOB, WWU M¨unster

[50] Lerat E: Identifying repeats and transposable elements in sequenced genomes: how to find your way through the dense forest of programs. Heredity 2009. [→]. [51] Eddy SR: What is a hidden Markov model? Nat Biotechnol 2004, 22(10):1315– 1316. [→]. [52] Rabiner LR: A Tutorial on Hidden Markov Models and Selected Applica- tions in speech recognition. Proc. IEEE 77 1989, :257–286. [53] Edlefsen PT, Liu JS: Transposon identification using profile HMMs. BMC Genomics 2010, 11 Suppl 1:S10. [→]. [54] Korotkov EV, Korotkova MA: Presence of MIR-elements in the complete nu- cleotide sequence of human chromosome 22. Mol Biol (Mosk) 2001, 35(3):376– 382. [55] Chale MB, Korotkov EV: Evolution of MIR-repeats in coding regions of the human genome. Mol Biol (Mosk) 2001, 35(6):1023–1031. [56] Kimura M: A simple method for estimating evolutionary rates of base sub- stitutions through comparative studies of nucleotide sequences. J Mol Evol 1980, 16(2):111–120. [57] Letunic I, Bork P: Interactive Tree Of Life (iTOL): an online tool for phy- logenetic tree display and annotation. Bioinformatics 2007, 23:127–128. [→]. [58] Altschul SF, Madden TL, Schffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389–3402. [59] Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AFA, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, Haussler D, Miller W: Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res 2004, 14(4):708–715. [→]. [60] Eddy SR: A probabilistic model of local sequence alignment that simplifies statistical significance estimation. PLoS Comput Biol 2008, 4(5):e1000069. [→]. [61] Durbin R, Eddy SR, Krogh A, Mitchison G: Biological Sequence Analysis: Prob- abilistic Models of Proteins and Nucleic Acids. Cambridge University Press 1999. [→]. [62] Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL, Gu- nasekaran P, Ceric G, Forslund K, Holm L, Sonnhammer ELL, Eddy SR, Bateman A: The Pfam protein families database. Nucleic Acids Res 2009. [→]. [63] Stamatakis A, Ludwig T, Meier H: RAxML-III: a fast program for maximum likelihood-based inference of large phylogenetic trees. Bioinformatics 2005, 21(4):456–463. [→]. [64] Smit A, Hubley R, Green P: RepeatMasker Open-3.0 1996-2004. [→]. [65] Notredame C, Higgins DG, Heringa J: T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol 2000, 302:205–217. [→].

36 JF Strauß IOB, WWU M¨unster

[66] Katoh K, Misawa K, ichi Kuma K, Miyata T: MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res 2002, 30(14):3059–3066. [67] Haider S, Ballester B, Smedley D, Zhang J, Rice P, Kasprzyk A: BioMart Central Portal–unified access to biological data. Nucleic Acids Res 2009, 37(Web Server issue):W23–W27. [→]. [68] Jurka J, Kapitonov VV, Pavlicek A, Klonowski P, Kohany O, Walichiewicz J: Repbase Update, a database of eukaryotic repetitive elements. Cytogenet Genome Res 2005, 110(1-4):462–467. [→]. [69] Rhead B, Karolchik D, Kuhn RM, Hinrichs AS, Zweig AS, Fujita PA, Diekhans M, Smith KE, Rosenbloom KR, Raney BJ, Pohl A, Pheasant M, Meyer LR, Learned K, Hsu F, Hillman-Jackson J, Harte RA, Giardine B, Dreszer TR, Clawson H, Barber GP, Haussler D, Kent WJ: The UCSC Genome Browser database: update 2010. Nucleic Acids Res 2010, 38(Database issue):D613–D619. [→].

37 JF Strauß IOB, WWU M¨unster

6 Supplementary material

The following supplementary material can be found on the disk attached to this document. If this document came as an online version without a dvd please feel free to contact me at [email protected].

• figures/ All figures that were produced for this project. • hmm/ Test hidden markov models for HMMER1, HMMER2 and HMMER3. • latex/ Latex source of this document, including a BibTeX reference db. • scripts/ All kinds of script that were used for this project. Script ending in pl are perl scripts. Scripts ending in r are Gnu R scripts. • sql/ The SQL database containing MIR annotation for all mammals that were produced in this project. For a description how to use the database see either sql/readme.txt or the table design of the SQL database from the material and meth- ods part.

38