Sequences Signature and Genome Rearrangements in Mitogenomes

Von der Fakultät für Mathematik und Informatik der Universität Leipzig angenommene

DISSERTATION

zur Erlangung des akademischen Grades

DOCTOR RERUM NATURALIUM (Dr. rer. nat.)

im Fachgebiet Informatik

vorgelegt

von Msc. Marwa Al Arab geboren am 27. Mai 1988 in Ayyat, Libanon

Die Annahme der Dissertation wurde empfohlen von:

1. Prof. Dr. Peter F. Stadler, Universität Leipzig 2. Prof. Dr. Burkhard Morgenstern, Universität Göttingen 3. Prof. Dr. Kifah R. Tout, Lebanese University

Die Verleihung des akademischen Grades erfolgt mit Bestehen der Verteidigung am 6.2.2018 mit dem Gesamtprädikat magna cum laude. Abstract

During the last decades, mitochondria and their DNA have become a hot topic of re- search due to their essential roles which are necessary for cells survival and pathology. In this study, multiple methods have been developed to help with the understanding of mitochondrial DNA and its evolution. These methods tackle two essential prob- lems in this area: the accurate annotation of protein-coding genes and mitochondrial genome rearrangements. Mitochondrial genome sequences are published nowadays with increasing pace, which creates the need for accurate and fast annotation tools that do not require manual intervention. In this work, an automated pipeline for fast de-novo annotation of mitochondrial protein-coding genes is implemented. The pipeline includes methods for enhancing multiple sequence alignment, detecting frameshifts and building pro- tein proles guided by phylogeny. The methods are tested on mitogenomes available in RefSeq, the comparison with reference annotations highlights the high quality of the produced annotations. Furthermore, the frameshift method predicted a large number of frameshifts, many of which were unknown. Additionally, an ecient partially-local alignment method to investigate genomic rearrangements in mitochondrial genomes is presented in this study. The method is novel and introduces a partially-local dynamic programming algorithm on three sequences around the breakpoint region. Unlike the existing methods which study the rearrangement at the genes order level, this method allows to investigate the rearrangement on the molecular level with nucleotides precision. The algorithm is tested on both articial data and real mitochondrial genomic sequences. Surprisingly, a large fraction of rearrangements involve the duplication of local sequences. Since the implemented approach only requires relatively short parts of genomic sequence around a breakpoint, it should be applicable to non-mitochondrial studies as well. Keywords: mitochondria, genome, annotation, protein, genes, rear- rangement, breakpoint, dynamic-programming, sequence alignment.

2 Acknowledgments

I would like to thank the people who helped me in the realization of this dissertation. Firstly I would like to thank Prof. Dr. Peter F. Stadler. As my supervisor, he guided me in my work and helped me nd solutions to move forward. I also would like to extend my thanks to Prof. Dr. Kifah Tout from the Lebanese University for his guidance, patience and support. I would like to express my gratitude to Dr. Matthias Bernt. Thanks to his invaluable assistance this work would not been accomplished. I would like to thank Prof. Dr. Mohammad Khalil and the Dean Prof. Dr. Fawaz Al Omar for leading the Doctoral school of the Lebanese University in the right direction. I also wish to thank my friends especially Halima for assisting me by doing the administrative tasks during my stay in Lebanon. Finally, I thank my husband, the members of my family and my family-in-law who have always been with me during my Ph.D., especially during my pregnancy and after labor. Without their help and encouragement I would have not been able to accomplish this work.

3 Contents

1 Introduction 11

2 Biological Background 16 2.1 Mitochondria Functions and Origin ...... 16 2.2 Mitochondrial DNA ...... 17 2.2.1 Protein-Coding Genes ...... 17 2.2.2 TRNA Genes ...... 19 2.2.3 Ribosomal RNA Genes ...... 21 2.3 Translation ...... 21 2.4 Genome Rearrangement and Breakpoints ...... 22

3 Technical Background 25 3.1 Sequence Alignment ...... 25 3.1.1 Pairwise Alignment ...... 26 3.1.2 Multiple Sequence Alignment ...... 28 3.2 Prole Hidden Markov Models ...... 29 3.3 Phylogenetic Tree ...... 30 3.4 Genome Annotation for Mitochondria ...... 32 3.4.1 Annotation of Protein-coding Genes ...... 32 3.4.2 Available Pipelines ...... 33 3.4.3 Shortcomings of Currently Available Pipelines ...... 33

4 Accurate Annotation for Protein-Coding Genes in Mitochondria 35

4 4.1 Overview of the Workow ...... 36 4.1.1 Data Sets ...... 37 4.1.2 Construction of Initial Models ...... 37 4.1.3 Enhancement of the Models ...... 38 4.1.4 Annotation ...... 45 4.1.5 Improvement of Start and Stop ...... 45 4.1.6 Benchmarking ...... 47 4.2 Quantitative Analysis ...... 47 4.2.1 Annotation Quality ...... 48 4.2.2 Start and Stop Positions ...... 51 4.2.3 Alignment Quality ...... 54 4.2.4 Frameshifts ...... 54 4.2.5 Annotation of mtDNA in Fungi ...... 60 4.3 Conclusion ...... 61

5 Semi Global Alignments of Breakpoints Regions and the Sequence Signatures of Mitochondrial Genome Rearrangements 64 5.1 Theory ...... 65 5.2 Parametrization of the Scoring Function ...... 72 5.3 Mitochondrial Rearrangement Data ...... 74 5.4 Quantitative Analysis ...... 77 5.5 Concluding Remarks ...... 81

6 Conclusion 83

A Protein Genes Annotaion 86 A.1 Overview on Results ...... 86 A.2 of New Frameshifts ...... 89

B Semi Global Alignments of Breakpoints Regions 91 B.1 Pairs Used in the Study ...... 91

5 List of Figures

2-1 Mitochondrial genome of Homo sapien (human) ...... 18 2-2 Example of a tRNA remolding event ...... 21 2-3 Elementary rearrangement events discussed for mitogenomes . . . . . 23

3-1 Rooted and unrooted tree ...... 31

4-1 Overview of the workow ...... 36 4-2 Distribution of bit score dierences ...... 39 4-3 Gaps in removed columns ...... 41 4-4 Removed columns by gene ...... 42 4-5 Taxonomy of removed rows ...... 43 4-6 Taxonomy of training and test data sets ...... 44 4-7 Alignment of atp8 in Ophagostomum columbianum with its closely related ...... 48 4-8 Example of true positive case...... 50 4-9 False positive case...... 51 4-10 False negative case...... 52 4-11 Dierences of predicted ends ...... 52 4-12 Distribution of the nad3 frameshift in Archosauria and Anapsida . . 57 4-13 Alternative interpretations of the same frameshift event ...... 58 4-14 Number of detected frameshifts per gene ...... 59

4-15 퐸-value of true positive hits and over-predicted hits ...... 62

5-1 Overview of the problem ...... 65

6 5-2 Alignment of 퐹 and 퐿 ...... 68 5-3 Alignment of 퐹 and 푅 ...... 69 5-4 Alignment of 퐿 and 푅 ...... 70 5-5 Traceback in 푆 ...... 71 5-6 Variation of overlap and gap sizes ...... 73 5-7 Breakpoint alignments TDRL examples ...... 76 5-8 Distribution of overlap sizes by CREx, age and overlap class . . . . . 79 5-9 Rearrangement of Hydromantes brunus ...... 80 5-10 Dierence in average overlap size between symmetric rearrangements 80

7 List of Tables

3.1 Dynamic programming matrix for the Needelman-Wunsch algorithm . 27

4.1 Root model results by gene ...... 40 4.2 Root model results by taxa ...... 42 4.3 Statistical Improvement ...... 53 4.4 Alignment quality measures ...... 55

5.1 Average overlap size of rearrangemets from literature ...... 78

A.1 Details about over predicted genes ...... 86 A.2 Taxonomy of new detected frameshifts ...... 89

B.1 Pairs used in the study ...... 91

8 Chapter 1

Introduction

Cells are the fundamental working units of living organisms. These organisms fall into two broad categories according to their cells components: Prokaryotes, single- celled organisms (bacteria and archaea) which do not include nucleus nor organelles; and eukaryotes, high level organisms with multiple cells (,plants and fungi) which possess a nucleus and organelles. Mitochondria, one of the major components of eukaryotic cells, are organelles with their own genetic material known as the pow- erhouse of the cell due to their key role in the production of cells energy. Furthermore mitochondria are engaged in substantial cellular processes such as cell aging and cell programmed death (apoptosis). During the last decades mitochondrial DNA (mtDNA) has become a hot topic for research, especially after the discovery of the high impact of mitochondrial dysfunc- tion on human health. Indeed many diseases have been linked to mtDNA damage or mutations such as cancer, diabetes, heart failure and neurodegenerative diseases[1]. Hence, the mitochondrial research has also become a substantial discipline in phar- maceutical experiments. Accordingly, a new strategy for cancer treatment has been developed kill tumor cells that are resistant to normal cell death process. In addition to other drugs which are manufactured to target the metabolic pathways inside mi- tochondria such as anti-inammatory, anti-oxidant, anti-ischemic and anti-obesity[1]. Besides the mentioned applications in biomedicine and pharmacology, the mtD- NA has become a powerful tool for drawing the evolutionary relationships among

9 organisms due to their availability and their fast rate of evolution compared to nu- clear DNA. Likewise, the rapid mutation of mtDNA has contributed to the creation of distinctions between ethnic groups. Therefore mitochondrial features have been employed in forensic genetics and the study of human origins and the history of their early migrations[2, 3]. The enormous growing wealth of mitochondrial genomic data, thanks to the re- duced cost of new sequencing technologies, represents an increasing challenge to inter- pret, analyze and derive relevant conclusions from these data. Thus, the development of computational methods and algorithms to enable the biologists to analyze and ex- tract useful information from these data has become increasingly substantial. The accurate annotation of mtDNA is the basis of any downstream analysis since the an- notated sequences are utilized for comparative analysis and for the annotation of new sequences. In addition, the accurate annotation is needed in the elds of mitochon- drial research such as phylogeny reconstruction, medicine and forensic studies. There are several tools that deal with the annotation of mtDNA, all of them use BLAST[4] to nd the protein-coding (PC) genes e.g. DOGMA [5],MITOS [6], Mitoannotator [7]. This strategy often do not consider the peculiarities of mitochondrial protein-coding genes (see 3.4.1). Along with the annotation of mtDNA, studying the events that occurred with- in the mitogenomes throughout the history help uncover phylogenetic relationships during species evolution. In particular genome rearrangements which have been fre- quently observed in animals mitochondria. Mitogenome rearrangements are usually analyzed at the level of signed permutations that represent the order and orientation of the genes. As a consequence, it is impossible to precisely reconstruct a tandem du- plication random loss (TDRL) rearrangement event from gene order information only since TDRL implies the duplication of the original genes then a random loss of parts of one of the copies. Thus it is necessary to analyze the underlying sequence data to detect duplication remnants. This raises the question whether another rearrangement types such as inversion or transposition leave characteristic sequences behind, which allows the distinction between rearrangement types independently of the gene order

10 data. This would provide more detailed and reliable data on the relative frequency of dierent rearrangement mechanisms. The determination of the breakpoints locations in relation to annotated genes can be done by the comparison of gene orders. Tools such as CREx[8] predict the putative type of rearrangement operations by assuming the most parsimonious rearrangement scenario. However, only a few approaches use sequence data for genome rearrangement analysis. Parsimonious rearrangement sce- narios w.r.t. cut-and-join operations and indels that minimize the dierence in the size distribution of the breakpoint regions from expected values are computed in [9]. For next generation sequencing data dierences in the alignments of the reads to a reference genome can be used to uncover rearrangements [10]. Alignments of synteny blocks can be extended into the breakpoint region in order to delineate its position as precisely as possible. This approach uses the sequence data, however, it does not handle duplication and its possible remnants. TRNA remolding is another particular mode of evolution that aects the tRNAs in mitochondria. A remolded tRNA is a tRNA that underwent a point mutation in its anticodon. The result is a functional tRNA with new identity (anticodon) that is closely related to the original one. This event is thought to be caused by a duplication of the original tRNA followed by a point mutation in the anticodon and later a loss of the original copy. The study of tRNA remolding in mitochondria is substantial due to i) its detection in a wide variety of animal taxa [11, 12, 13] and ii) the impact of remolding on tRNA annotation since all the available tools assign the tRNA identity based on the anticodon[14, 15]. Subsequently reliability issues in downstream analysis are induced such as in phylogenetic reconstructions and in the understanding of the evolution of genetic codes. Nevertheless, comprehensive inspection of tRNA remold- ing across animal species is still missing. A study of tRNA remolding in Metazoa is carried out based on new methods of remolding detection. The rst method extends the similarity based search and consider sequence and structural data of tRNAs in addition to statistical tests. The second method consists of phylogeny based analysis that uses maximum likelihood to detect candidates and ancestral de novo remolding events throughout the tree. The methods and results of this study are detailed in [16]

11 and in the following paper: Sahyoun AH, Hölzer M, Jühling F, Höner zu Siederdissen C, Al-Arab M, Tout K, Marz M, Middendorf M, Stadler PF, Bernt M. Towards a comprehensive picture of alloacceptor tRNA remolding in metazoan mitochondrial genomes. Nucleic acids research. 2015 Jul 30;43(16):8044-56. In this thesis, details about the contribution to two aspects related to mitochon- drial DNA and its evolution are represented:

1. Accurate annotation of protein-coding (PC) genes in mitochondria, which con- sists of set of methods that have been developed taking into consideration the peculiarity of PC genes in mitochondria. The pipeline uses automatically gen- erated phylogeny-aware prole hidden Markov models that are enhanced based on specic criteria. These methods are published in :

Al Arab M, zu Siederdissen CH, Tout K, Sahyoun AH, Stadler PF, Bernt M. Ac- curate annotation of protein-coding genes in mitochondrial genomes. Molecular phylogenetics and evolution. 2017 Jan 31;106:209-16.

Alexander Donath,Frank Jühling,Marwa Al Arab,Peter F. Stadler Martin Mid- dendorf, Matthias Bernt. Precise Annotation of Protein Coding Genes Bound- aries in Metazoan Mitochondrial Genomes. Molecular biology and evolution. (To be submitted).

2. A new partially local three-way alignment algorithm based on dynamic program- ming to align the breakpoint region in rearrangement in order to nd special signature to dierent rearrangement mechanisms. This work is published in:

Al Arab M, Bernt M, zu Siederdissen CH, Tout K, Stadler PF. Partially lo- cal three-way alignments and the sequence signatures of mitochondrial genome rearrangements. Algorithms for Molecular Biology. 2017 Aug 23;12(1):22.

Chapters Content The next two chapters in the thesis represent biological and technical background related to the subjects detailed in the following chapters along with reviews of already done work in these subjects. Chapter four depicts the set

12 of methods developed to accurately annotate PC genes in mtDNA and shows the obtained results along with the analysis and assessment of these results. Further- more a method for frameshifts detection based on HMM is explained and applied on metazoan data. In Chapter ve, we present a new algorithm for partially local alignment of breakpoint regions in genome rearrangement. Moreover we show the ability of the alignments to detect special signature for rearrangement made by dif- ferent mechanisms. Finally concluding remarks and future works are represented in chapter six.

13 Chapter 2

Biological Background

Mitochondria are double-membrane organelles, with their own genetic material[17], unique to the cells of animal, fungi and plants. In this chapter we present a biological background about the functions, origin, and DNA of mitochondria. Furthermore we give an overview about the particular mechanisms that characterize the evolution of mitochondrial genome (section 2.3 and 2.4).

2.1 Mitochondria Functions and Origin

The main function of mitochondria is the production of energy used by the cell and the entire organism through the generation of ATP supply. The production of ATP is accomplished during the cell respiration process. Mitochondria contribute to addi- tional fundemental functions such as the regulation of metabolism, cell-cycle control, cell development, antiviral responses and cell death [18, 19]. The origin of mitochondia is thought to be emerged more than 1.45 billion years ago, which is the age of the oldest eukaryotic micro-fossils [20]. The similarities with some prokaryotes supported the endosymbiotic origin of mitochondria. Thus, an an- cient nucleated cell engulfed a prokaryotic cell. The host cell assured protection to the engulfed cell, while the latter assured the energy production. Genome sequenc- ing and analyses of mitochondrial DNA have established that the origin of engulfed

14 prokaryote is a primitive bacteria (훼-proteobacteria) [21]. However the origin of the host eukariotic cell is still debated.

2.2 Mitochondrial DNA

Mitochondria contain its own DNA (mtDNA) with highly conserved structure and genes organization [22].

MtDNA in Metazoa The mtDNA in most metazoan organisms is relatively smal- l(about 16500 bp), circular and double stranded (heavy strand and low strand). The mtDNA in Metazoa comprises typically 13 protein coding genes, 22 transfer RNA (tRNA) and 2 ribosomal RNA (rRNA). The mitochondrial genes lack introns, with mostly small or absent intergenic regions [22]. However, in vertebrate cells a non coding region (D-loop) has been shown containing the origin of replication as well as the basic promoters of translation therefore it has been named the control region [23] see Fig. 2-1.

MtDNA in Fungi Fungal mitochondrial genome is either circular or linear molecule that ranges from 16 to 110 kbp in size. Usually fungal mtDNA encodes for 14-19 pro- teins, two rRNAs and 8-24 tRNAs. The variability shown in fungal mtDNA is likely associated with the existence of large intergenic regions including highly variable in- trons and sequence repeats.

2.2.1 Protein-Coding Genes

The protein-coding genes provide instructions for making enzymes. These enzymes catalyze the cell respiration process i.e the oxidation of nutrients to adenosine triphos- phate (ATP). The canonical protein-coding genes in animals mitochondria encode for four protein complexes which are all relatively conserved by negative selection which

15

99

Figure 2-1: Mitochondrial genome of Homo sapien (human)

16 eliminates deleterious mutations. However, the negative selection highly inuences some genes like cytochrome c and cytochrome b while this eect is moderated for ATP synthase [24].

Complex I: Contains seven subunits of NADH dehydrogenase (nad1, nad2, nad3, nad4, nad4l, nad5, nad6 ). These proteins play an important role in electron transfer and in NADH dehydrogenation. Nad1 gene sequence is the most conserved among the other genes in this complex while nad6 is the less conserved.[24].

Complex III : Cytochrome b (cob) is the only component of this complex. The cob protein is the part of the mitochondrial repsiratory chain which contributes to the ATP synthesis[25].

Complex IV: The complex IV encloses generally three subunits of cytochrome c (cox1, cox2 and cox3 ) which are relatively conserved throughout evolution[26]. Cy- tochrome c subunits encode for transmembrane proteins that reside in the mitochon- drial inner membrane and constitute the component of the respiratory chain that catalyzes the reduction of oxygen to water[25].

Complex V: In Metazoa ATP synthase encodes for two protein subunits (atp6 and atp8 ). Their function is the transformation of adenosine diphosphate (ADP) molecules to ATP [25].

2.2.2 TRNA Genes

The transfer ribonucleic acid (tRNAs) are RNAs that help in the translation of mes- senger RNA (mRNA) into amino acid sequence (protein). In Metazoa 22 tRNA are encoded in the mitochondria. For each of the 20 amino acids, only a single tRNA is present, except the tRNAs for leucine and serine which have two tRNAs each. Independent losses of tRNAs, which are compensated by tRNAs imported from

17 the nucleus [27], are frequent in a few clades of Metazoa, e.g., Cnidaria [28], and Chaetognatha [29]. Most tRNAs in organisms of all domains of life have a highly conserved cloverleaf structure. Mitochondrial tRNAs of Metazoa, however, frequent- ly exhibit aberrant structures. For example, lack of D-loops and/or T-loops[30], or complete arms [31, 32]. Furthermore both D- and T-arm have been reported lost for Enoplea[33], for an overview see[14].

TRNA Remolding Mitochondrial tRNAs exhibit a particular mode of evolution known as tRNA remolding or tRNA recruitment. The remolded tRNA harbor one or more point mutations at the level of the anticodon which change its identity[34]. If the accepted amino acid changed, the remolding is called alloacceptor, otherwise it is called isoacceptor. Remolding of tRNAs is likely caused by a duplication-loss mechanism. First, the tRNA is duplicated, then a point mutation occurs in the anticodon which changes the identity of the tRNA and nally the original copy which is identical to the remolded tRNA is lost [34, 35]. For an illustrated example see Fig. 2- 2. Since the mitochondrial tRNA remolding is frequently observed within genomes of Metazoa (e.g. [34]), the detection of these events is essential for accurate mtDNA annotation, rearrangement studies and phylogenetic reconstruction. For details about the remolding detection methods and results see[36].

18

Figure 2-2: Example of a tRNA remolding event. (a) tRNA- X is duplicated, a point mutation occurs to its anticodon triplet "acg" and change it to "aca" the anticodon triplet of tRNA-Y, and the secondary structure is slightly changed. (b) The original tRNA-Y is deleted through evolution

2.2.3 Ribosomal RNA Genes

Ribosomes are molecules formed of protein and ribosomal RNA. The ribosomes con- stitute the machinery of protein synthesis in every living cell[37, 38]. The ribosomes in mitochondria are formed of two subunits: large subunit, and small subunit. The small subunit selects the cognate tRNA for the translation of DNA message from nucleotide to aminoacids[37]. Whereas the large subunit catalyzes the synthesis of peptide bonds in order to polymerize the amino acids transferred by the tRNAs into a polypeptide chain[39].

2.3 Translation

The mitochondrial translation is the synthesis of protein in mitochondria. During this process the ribosome decode messenger RNAs (mRNA) using mitochondrial tRNAs

19 according to the mitochondrial genetic code and form the functional proteins used by the cells. The protein formation in mitochondria is characterized by the unusual genetic code[40], which may even involve a re-interpretation of the universal stop codon UAG translated as Tyr, e.g., in the sponge Clathrina clathrus[41] for a detailed review see[22, 42]. Furthermore, some of the protein genes are overlapping and, in many cases, part of the termination codons are not encoded but are generated post- transcriptionally by polyadenylation of the mRNAs [43]. Another characteristic is the presence of translation frameshifts in the open read- ing frame. The term frameshift (FS) designates a position-specic change of the frame in which the ribosome reads the mRNA sequence. Frameshifts are caused by specifc sequence and/or RNA secondary structure elements that program the ribosome to shift the translation in the upstream direction (programmed +1 frameshift) or in the downstream direction (programmed -1 frameshift)[44, 45]. Several frameshifts have been found in mtDNA of animals in dierent protein-coding genes. The majority of reported cases are a +1 frameshift at position 174 of the nad3 gene which has been described rst for nad3 in ostrich [46]. A frameshift at the same position (nad3 174) has been reported for several turtles and birds[47, 48, 49]. Whereas the African helmeted turtle (Pelomedusa subrufa) hosts a +1 frameshift at a dierent position nad3 135 and further ones in nad4l at positions 99 and 262[50]. Pancake tortoise (Malacochersus torneri ) has been found hosting an insertion in nad4 [49]. Further +1 frameshifts have been reported for other genes in species from various phyla, e.g., dierent sites in cytb of Polyarchis ants[51] and oyster[52], and cox3 and nad6 of glass sponge[53]. Recently, [54] have shown that a +1 frameshift at the 3' end of cox1 and nad6 lead to the use of the standard UAG stop codon.

2.4 Genome Rearrangement and Breakpoints

The genome of animal mitochondria are subject to frequent rearrangements of the gene order. However, there does not seem to exist a unique molecular mechanism for

20 A B C D E A B C D E A E C D E A B -D -C -B

Figure 2-3: Elementary rearrangement events discussed for mitogenomes. From left to right: inversion, transposition, inverse transposition, tandem duplication random loss. Pseudogenisation leading to eventual gene loss is indicated by symbols without borders. Adapted from [66] ○c Elsevier

that. Inversions [55] can be explained by inter-mitochondrial recombination [56, 57]. Similarly, transposition [58] and inverse transposition [59] may also be the result of nonhomologous recombination events [60, 61]. In a tandem duplication random loss (TDRL) event [62, 63], on the other hand, part of the mitogenome, which contains one or more genes, is duplicated in tandem; subsequently, one of the redundant copies of the genes is lost at random. Transpositions can also be explained by a TDRL mech- anism, and there is at least evidence that rearrangements involving the inversion of genes can be explained by a duplication-based mechanism, where the duplicate is inverted [64]. It remains an open question how variable the rates and the relative importance of dierent rearrangement mechanisms are over longer evolutionary time- scales and in dierent clades. While TDRLs leave a clearly identiable trace in the mitogenomic sequence, namely the usually rapidly decaying pseudogenized copies of redundant genes [64], little is known about the impact of other rearrangement mech- anisms. It has been observed, however, that lineages with frequent rearrangements also show elevated levels of nucleotide sequence variation [65]. Mitogenome rearrangements are usually analyzed at the level of signed permu- tations that represent the order and orientation of the genes, see Fig. 2-3. Inferred rearrangement scenarios, i.e., sequences of a minimal number of rearrangement oper- ations that explain the dierences between two gene orders, depend strongly on the operations that are considered [67]. While initially only inversions were considered, more recent algorithms have also been developed for shortest TDRL rearrangemen- t scenarios [68]. In contrast to inversions and transpositions, TDRLs are strongly asymmetric, i.e., when a single TDRL suces to obtain gene order 휋′ from 휋, mul- tiple TDRLs are needed to go from 휋′ back to 휋. Parsimonious TDRL scenarios

21 requiring more than one step are rarely unique [69] and usually a large number of TDRLs with dierent loss patterns are equivalent in circular genomes [70]. As a con- sequence, it is impossible to precisely reconstruct a TDRL rearrangement event from gene order information only. It is indispensable to analyze the underlying sequence data to detect duplication remnants. This begs the question whether reversals or transpositions also leave behind characteristic sequences that provide information on the rearrangement type regardless the analysis of gene order data. This would oer more detailed and more reliable data on the relative frequency of dierent rearrange- ment mechanisms. A breakpoint is the region between two consecutive genes in the reference genome that are not consecutive, or one of them is inverted, in the derived genome. It is easy to determine the approximate location of breakpoints in relation to annotated genes by the comparison of gene orders. Tools such as CREx [8] also infer the putative type of rearrangement operations by assuming a most parsimonious re- arrangement scenario. The exact localization of breakpoints in the genomic sequence is a necessary prerequisite for any detailed investigation into sequence patterns that might be associated with genome rearrangements. Currently, only a few approaches use sequence data for genome rearrangement analysis. Parsimonious rearrangement scenarios w.r.t. cut-and-join operations and indels that minimize the dierence in the size distribution of the breakpoint region- s from expected values are computed in [9]. For next generation sequencing data, dierences in the alignments of the reads to a reference genome can be used to un- cover rearrangements [10]. Alignments of synteny blocks can be extended into the breakpoint region in order to delineate its position as precisely as possible [71]. This approach does not handle duplications and their possible remnants.

22 Chapter 3

Technical Background

The computational biology techniques used in the rest of chapters are reviewed in this chapter. In particular, sequence alignment used for the breakpoints problem, in addition to prole alignment and phylogenetic trees which are used in the protein- coding genes annotation problem. Finally, in section 3.4 we review the available methods to annotate protein-coding genes in mitochondria.

3.1 Sequence Alignment

Sequence alignment is an essential task in computational biology. In particular se- quence alignment is used for inferring phylogenetic relationships and predicting pro- tein structure and function. Aligning sequences whether of DNA (alphabet of four letters) or protein (alphabet of 20 letters) permits to show the similarity among the sequences. This similarity is measured by a scoring scheme which rewards matches among characters and penalize mismatches and gaps (indels).

Dynamic Programming Dynamic programming (DP) is one of the fundamental algorithms in computational biology. DP is broadly applied in sequence alignment, RNA secondary structure prediction and gene prediction. DP algorithms solve com- plex optimization problems by dividing the problem to smaller sub-problems and the

23 optimal solutions of these sub-problems are reused to solve the overall problem [72]. The sequence alignment with DP algorithm provides an optimal exact alignment with polynomial time in number of sequences. Therefore, such alignments are only possible for a limited number of sequences with short length.

3.1.1 Pairwise Alignment

The pairwise alignment is the alignment of only two sequences in a way to maximize their similarities. The pairwise alignment can be global, local and semi-global or glocal.

Global Alignment The global alignment allows to nd the global similarity across the entire sequences. The Needlman-Wunsch (NW) algorithm [73] is the rst algorith- m which applies the dynamic programming approach to pairwise sequence alignment.

To align two sequences 퐴 and 퐵, a DP matrix 푀 is built with the size |퐴| × |퐵|. 푀 is lled based on the following recurrence with 훿 being the scoring function for the pairwise alignment :

푀푖−1,푗−1 + 훿(퐴푖, 퐵푗) ⎧ 푀 ’-’ (3.1) 푖,푗 = max ⎪푀푖−1,푗 + 훿(퐴푖, ) ⎪ ⎨⎪ 푀푖,푗−1 + 훿(’-’, 퐵푗) ⎪ ⎪ ⎪ For example let 퐴 be the sequence⎩퐴퐶퐺푇 푇 and 퐵 the sequence 퐶퐶퐴퐴푇 푇 . And the scoring function 훿 is +2 for 푚푎푡푐ℎ, −1 for 푚푖푠푚푎푡푐ℎ and −1 for a 푔푎푝. The dynamic programming matrix based on the NW algorithm is given in the Table 3.1. The backtracking starts at the end (bottom right) of the matrix and continues till the beginning (top left)of the DP matrix. Thus the optimal alignment is reconstructed from the back traced path. The optimal alignment with score of +3 is the following:

24 A ACG-TT B CCAATT

- A C G T T

- 0 -1 1-2 -3 -4 -5 C -1 -1 1 0 -1 -2 C -2 -2 1 0 -1 -2 A -3 0 0 0 -1 -2 A -4 -1 -1 -1 -1 -2 T -5 -2 -2 -2 1 1 T -6 -3 -3 -3 0 3 Table 3.1: Dynamic programming matrix for the Needelman-Wunsch algorithm

Local Alignment Aligning sequences locally is based on nding local similarity among sequences i.e. parts of the whole sequences with no penalty for the unaligned portions of the sequences. For example, alignment of DNA sequences where exons are similar while introns are not. The alignment can be extended to the whole sequences in case they are entirely similar. The Smith-Waterman algorithm[74] applies the dynamic programming to align two sequences locally. The adopted recurrence is slightly dierent from the NW recurrence:

0 ⎧ ⎪푀 −1 −1 + 훿(푆 , 푇 ) ⎪ 푖 ,푗 푖 푗 (3.2) 푀푖,푗 = max ⎪ ⎪ ’-’ ⎨⎪푀푖−1,푗 + 훿(푆푖, ) ’-’ ⎪푀푖,푗−1 + 훿( , 푇푗) ⎪ ⎪ ⎪ In Smith-Waterman algorithm the⎩ backtracking starts at the maximum score in

25 the DP matrix and stops at the rst occurrence of 0. The optimal local alignment of the two sequences 푆 and 푇 , where 푆 is the DNA sequence 퐺퐴퐴퐶퐺푇 퐴퐺퐺퐶퐺푇 퐴푇 and 푇 the sequence 퐴푇 퐴퐶푇 퐴퐶퐺퐺퐴퐺퐺퐺, is the following: S ACGGAGG T ACGTAGG

Semi-Global Alignment The semi-global alignment is a partially local alignment. The semi-global alignment is mainly used1 to nd the similarities between a sequence and a smaller sequence. Thus the insertions and deletions at the beginning/end of one of the sequences are not counted. Considering the alignment of a query sequence

푄 : 퐴퐶퐶퐶, to a reference sequence 퐹 : 퐴퐴퐴퐶퐶퐶, with free gaps at the beginning of 푄. The optimal alignment is the following. F AAACCC Q --ACCC

The recurrence in the semi-global alignment is similar to the NW algorithm. Free ends gaps, however, require initializing the1 rst row/column with zeros (free starting gaps). Furthermore, the backtracking starts at the maximum of the last row/column (free ending gaps).

3.1.2 Multiple Sequence Alignment

Multiple sequence alignment (MSA) involves the alignment of more than two se- quences. The sequence alignment with dynamic programming algorithms (see 3.1), provides an optimal exact alignment with polynomial time in number of sequences. Thus, such alignments are only possible for a limited number of sequences with short length. Therefore, heuristics and approximation methods have been applied in tools

26 such as MAFFT[75], CLUSTAL[76], PASTA[77] and HAlign[78]. Furthermore par- allel computing has been introduced to reduce the actual running time for large scale alignments[79]. However, Three-ways alignment using dynamic programming to yield an exact alignment for relatively small sequences remains computationally feasible.

3.2 Profile Hidden Markov Models

Annotating protein sequences requires rstly comparing the new query sequence to a collection of target sequences. A simple strategy consists of comparing the query sequence to all the sequences in the target database in a pairwise manner one by one. The most sensitive algorithms are the exact matching dynamic programming alignment Smith-Waterman[74] and Needleman-Wunsch[73]. Exact matching algo- rithms, however, are slow and computationally expensive. The speed problem has been solved by heuristic approaches. Tools based on this approach has been widely used in homology search (BLAST, FASTA[80]). Despite the high sensitivity in nding closely related sequences, the performance of pairwise methods drops when the query is remotely homologous to the other sequences in the database[81]. Furthermore the measures provided by this approach represent the similarity to just the best scoring alignment. Finding remotely homologous sequences with reliable and robust statisti- cal measures has been achieved by Hidden Markov Models (HMMs). In 1994, HMMs were applied to computational biology for proteins prole modeling[82]. Prole H- MMs are built from multiple sequence alignment of homologous sequences. Thus, the obtained model contains information about all the sequences in the alignment, which increase the ability to detect new remotely homologous sequences. In addition, prob- abilities are given at each column of the alignment for the variation of amino acids and inserts and deletions at this column. Therefore, conserved and non-conserved po- sitions can be distinguished thanks to the probabilistic HMM model[83, 84]. Despite the increase of sensitivity toward the detection of distantly related sequences, a main obstacle remained, which is the decrease of computational speed. This complicates

27 the performance of large-scale sequences analysis. The rst implementations of prole HMMS were at least 100 fold slower than BLAST. However, considerable advances in HMM algorithms and implementations, resulted from several optimization and accel- eration eorts[85, 86, 87, 88]. , have led to packages such as HMMER 3 with as much speed as BLAST and better sensitivity and sensibility regarding the search of distantly homologous sequences. Details about HMMER program are in the following.

HMMER HMMER is a software package which allows building probabilistic HMM from a multiple sequence alignment (hmmbuild). In addition HMMER allows to align a set of sequences to a model using hmalign. Furthermore HMMER allows the user to compare protein models to protein sequences in both ways. Therefore one can search a protein HMM model against protein sequences database (hmmsearch) and on the other hand scan a protein sequence against a database of HMM models (h- mmscan). Other programs included in the package are described in the user guide of the software[89].

3.3 Phylogenetic Tree

The analysis of relationships between sequences has become a major tool for deter- mining the history of organisms evolution. The phylogeny explains how species or other groups evolved from their common ancestors. A phylogenetic tree is a diagram that depicts the relationships among organisms or sequences.

Background The tree can be either rooted or unrooted Fig. 3.3. A rooted tree has a root node which is the most recent common ancestor for all the species in the tree. The tree branches represent the lineage between the ancestors and the descendants nodes. The nodes are either internal or external (leaves). Internal nodes represent the ancestor of the descendants nodes. While the leaves nodes represent the extant species. Two species or sequences are closely related if their lowest common ancestor is recent and are less closely related if their lowest common ancestor is less recent.

28

Figure 3-1: Rooted and unrooted tree

Phylogeny in History Grouping and classifying organisms were practiced since ancient times. Theorists, since a long time ago, tended to group animals and plants based on their visible characteristics and morphology. For example Aristote grouped animals based on their morphological aspects such as existence of wings, or number

of legs etc[90]. Later in the 18푡ℎ century, the Swedish botanist Carlous Linnaeus p- resented a hierarchical classication or taxonomy for groups and organisms such as species within genera, genera within orders, orders within classes and classes within kingdoms[91]. During the last 60 years and in the light of the advances in the molec- ular biology tools, the scientists started using the molecular data such as DNA and protein sequences from dierent species to study the phylogeny of these species[83].

MtDNA and Phylogeny Fast evolving sequences can be used to study the evo- lution of species separated up to millions of year. Whereas, older events need slowly evolved sequences to be studied. Both nuclear and molecular DNA have been used in phylogenetic construction. The mtDNA in general evolve in a faster pace than

29 the nuclear DNA. However, dierent loci in the mtDNA evolve in dierent rate, which makes the mtDNA suitable for phylogenitic studies in multiple time scales[90]. Protein-coding gene sequences in particular harbor characteristic signals which have been proved to be useful for resolving disputed phylogenetic relationships [92, 93], but the extraction of deep phylogenetic signals from protein-coding genes remains challenging [6].

NCBI Tree [94] The NCBI tree is a taxonomy tree, meant to be consistent with phylogeny, built based on the NCBI taxonomy database[94]. The database, which is available online for the users, is based on a large amount of existing sequences in Genbank, EMBL and DDBJ repositories. The database contains the scientic names of organisms as well as the taxonomic lineages of the available sequences. The power of this database comes from the continuous maintenance of the phylogenetic information by manual curation according to the latest ndings in the literature.

3.4 Genome Annotation for Mitochondria

Genome annotation allows to nd and locate the dierent features of the genome distributed throughout the genome sequence. These features include genes that code for protein, RNA genes, and the non coding regions. Given the rapid increase in number of sequenced mitogenomes, development of pipelines that are able to provide fast, automatic, accurate and reproducible annotations requiring little or no manual curation becomes imperative.

3.4.1 Annotation of Protein-coding Genes

Protein-coding genes in mitochondria are characterized by several peculiar features that complicate the genes nding and annotation.

30 ∙ The use of unusual genetic codes[95], which may even involve a re-interpretation of the universal stop codon UAG translated as Tyr, e.g., in the sponge Clathrina clathrus [41], for a detailed review see [22, 96]. ∙ The existence of overlap between genes [22]. ∙ Ill-dened 3' ends with truncated stop codons which are sometimes not entirely encoded but completed by RNA polyadenylation [97]. For an overview about incomplete stop codons see [98]. ∙ The use of non-canonical start codons, e.g., in cox1 of insect mitogenomes (e.g., UCG), see [99]. ∙ The presence of frameshifts in the open reading frame described in 2.3

3.4.2 Available Pipelines

A number of preliminary tools for annotating mitogenomes are available:

∙ DOGMA [5], a semi-automatic web tool for annotating mitogenomes and chloro- plast genomes. The ends of the genes are dened based on the NCBI genetic code tables. ∙ MITOS [6], a fully automated pipeline for annotating metazoan mtDNA. The start and stop of the genes are searched in the proximity of the ends found by blast. ∙ Mitoannotator [7], a fully automated pipeline for annotating sh mtDNA. The ends of genes are determined based on set of predened rules which t with the shes genome.

3.4.3 Shortcomings of Currently Available Pipelines

All the mentioned tools use BLAST[4] for annotating protein-coding genes. The im- plemented simple strategies have several shortcomings.

31 ∙ Inconsistency of Databases: Incorrect annotations in reference databases such as RefSeq [6] require either additional manual curation to obtain an unbiased high quality set of trusted query sequences for the blast search or automatic methods to cope with misannotated or biased queries at the level of the search results. A manual curation step of the database, as used e.g., in DOGMA and Mitoannotator, has the disadvantage that updates of the query database are labor intensive and therefore it is dicult to keep pace with the growth of the available data. So far all tools and databases that require any manual curation, such as MitoZoa [100], are not updated anymore or were not sustain- able. However, an automatic strategy is pursued in MITOS. Instead of curating the queries, the BLAST hits are aggregated and conicts are resolved by what essentially amounts to majority voting. The results of the current version of MITOS occasionally need manual curation of the start and stop positions of protein-coding genes [7, 101].

∙ Quality measures: The statistical measures for the quality of the annotations are either dicult to interpret (e.g., the aggregated e-values presented by MITOS) or do only represent the similarity with respect to a single sequence[102].

∙ Missed genes: Despite the high sensitivity in nding closely related sequences, the performance of pairwise methods drops when the query is remotely homolo- gous to the other sequences in the database[81]. Consequently, some very short and poorly conserved genes, in particular atp8, tend to be missed by BLAST- based approaches[102].

32 Chapter 4

Accurate Annotation for Protein-Coding Genes in Mitochondria

Mitochondrial genome sequences are available in large number and new sequences become published nowadays with increasing pace. Fast, automatic, consistent, and high quality annotations are a prerequisite for downstream analyses. In this chapter we present an automated pipeline for fast de-novo annotation of mitochondrial protein-coding genes. The annotation is based on enhanced phylogeny- aware hidden Markov models (HMMs). The pipeline builds taxon-specic enhanced multiple sequence alignments (MSA) of already annotated sequences and correspond- ing HMMs using an approximation of the phylogeny. The MSAs are enhanced by xing unannotated frameshifts, purging of wrong sequences, and removal of non- conserved columns from both ends. A comparison with reference annotations high- lights the high quality of the results. The frameshift correction method predicts a large number of frameshifts, many of which are unknown. A detailed analysis of the frameshifts in nad3 of the Archosauria-Testudines group is presented in 4.2.4.

33 Start

annotated sequences and phylogenetic tree

building initial models

correction of frameshifts

removal of non-fitting sequences

removal of non-conserved columns

new sequences annotation

End

Figure 4-1: Overview of the workow

4.1 Overview of the Workflow

The starting point is a collection of known mitochondrial protein-coding genes and a phylogenetic tree. An initial MSA and corresponding HMM is constructed for each protein-coding gene and each clade of the given phylogeny (details in Section 4.1.2). The constructed models for the root of the phylogeny, i.e., Metazoa, are enhanced by (i) the correction of unannotated frameshifts and (ii) the removal of poorly aligned sequences and poorly conserved columns from both ends of the MSA (Fig. 4-1).

34 4.1.1 Data Sets

The initial set of training data consists of the mitochondrial protein-coding genes

in RefSeq [103], more precisely the annotated CDS features. The data of the 3842 complete metazoan mtDNA sequences that are contained in RefSeq release 63 serve

as training data set, whereas the data of those 926 species that have been added in RefSeq release 69 aord a large collection of test data. For details on the taxonomy distribution of both training and test data see Fig. 4-6. The phylogeny-aware model building process described below requires a binary phylogenetic tree. We used the NCBI taxonomy database [104] (downloaded 06-Mar-2014) as starting point. Mul- tifurcations were replaced as in [36]: The branching pattern is obtained from the neighbor joining tree computed from the alignment of the nad5 gene sequences of one randomly selected representative of each child subtree of the multifurcation point. The nad5 gene was chosen because it has shown good performance for phylogeny reconstruction [105].

4.1.2 Construction of Initial Models

The given binary phylogenetic tree 푇 is used as a guide tree for the progressive construction of a multiple sequence alignment (MSA) for each node of 푇 . To this end we use HMMER version 3.1b2 [88]. For each leaf, i.e., for the species included in the initial set of sequences, a (trivial) HMM is constructed from the single input sequence

using hmmbuild. For the progressive step consider an interior node 푖 of 푇 and its two

children 푘 and 푙 for which the corresponding HMMs 퐻푘 and 퐻푙 and MSAs 퐴푘 and 퐴푙 are already available. We rst compare the set of sequences 푆푙 in the subtree below 푙 with 퐻푘 and correspondingly 푆푘 with 퐻푙 using hmmsearch and determine which of the two models scores better in these comparisons, i.e., which of the two models yields a better mean bit score. An MSA for the sequences in 푆푖, which contains the sequences of 푆푘 and 푆푙, is constructed on the basis of the better model using hmmalign. This is

35 implemented by using the better model to extend the alignment corresponding to the better model with the sequences of the other subtree. If, for instance, 퐻푘 scores better the sequences in 푆푙 are added to 퐴푘 using 퐻푘. Finally the model 퐻푖 is constructed from 퐴푖 with hmmbuild. Traversing 푇 in a bottom-up fashion results in an MSA and a corresponding HMM for each node. In particular, we obtain the most general model at the root of 푇 .

4.1.3 Enhancement of the Models

The construction of the root model (Metazoa) for each gene obtained from the previ- ous steps was guided by phylogeny. However artifacts were included in the alignments while building the initial models. The main cause is that the input sequences contain wrong or misannotated sequences, usually sequences that were annotated too short or too long. Another frequent problem are unannotated frameshifts. To cope with these eects we modify the MSA in two ways: (i) identify and correct unannotated frameshifts, (ii) remove poorly conserved sequences and poorly conserved columns at both ends of the alignment.

4.1.3.a Correction of Frameshifts

In order to identify frameshifts we construct all possible frameshifted variants of each gene sequence. Each variant is generated as the conceptual translation of a nucleotide gene sequence from the training set where a nucleotide at position 푗 is deleted. An iteration over all possible values of 푗 yields all variants. Furthermore also all variants are generated where two consecutive nucleotides are removed. This makes it possible to identify shifts of two nucleotides and, indirectly, also missing nucleotides. Of these frameshift-translations we retain only those that do not include an early stop codon, i.e a stop codon before the end of the amino acid sequence. These are scanned with the query against the root HMM using hmmsearch. Since part of the frameshifted sequence is translated in a wrong reading frame, this part will not t to the model

36 100 annotated

10

1

count 100 new

10

1 0 50 100 150 Δs

Figure 4-2: Distribution of bit score dierences (∆푠) of frameshifted and original sequence for annotated and newly detected putative frameshifts. The bold line at ∆푠 = 25 marks the chosen threshold. Count is in log scale and thus lead to a substantial decrease of the bit score. The Grubbs test [106] with

푝 ≤ 0.01 is employed to iteratively identify outliers from the distribution of the bit score dierences of the original and variant sequences. These are plausible candidates for sequences frameshifts.

The outlier test predicted 325 putative frameshift candidates. Certainly, not all of these outliers are real frameshifts, but signicant outliers are also possible for small bit score dierences (∆푠). Such instances are caused, for example, by sequences that are not homologous to the query at all. A comparison of the ∆푠 values of previously annotated cases and the novel candidates, see Figure 4-2, suggested a threshold value of ∆푠 = 25. At this level all previously annotated frameshifts are retained and the overwhelming majority of frameshift candidates with poor ∆푠 are rejected.

4.1.3.b Removal of Poorly Conserved Sequences and Columns

After the correction of the frameshifts, sequences that do not t in the alignment and weakly conserved ends of the MSA are pruned. The row removal is done with the method implemented by OD-Seq [107]. OD-Seq uses a gap based distance counting

37 gene equal ∆± UP OP dierent atp6 925 (0.99) 0 (0.00) 0 (0.00) 5 (0.01) 0 ( 0.00) atp8 874 (0.97) 0 (0.00) 16 (0.02) 12 (0.01) 1 ( 0.00) cob 924 (0.99) 0 (0.00) 0 (0.00) 14 (0.01) 0 ( 0.00) cox1 928 (0.99) 0 (0.00) 0 (0.00) 6 (0.01) 0 ( 0.00) cox2 926 (0.99) 0 (0.00) 0 (0.00) 5 (0.01) 0 ( 0.00) cox3 923 (0.99) 0 (0.00) 1 (0.00) 4 (0.00) 0 ( 0.00) nad1 927 (1.00) 0 (0.00) 0 (0.00) 4 (0.00) 0 ( 0.00) nad2 921 (0.99) 0 (0.00) 3 (0.00) 7 (0.01) 0 ( 0.00) nad3 982 (1.00) 0 (0.00) 0 (0.00) 2 (0.00) 0 ( 0.00) nad4 925 (0.99) 0 (0.00) 0 (0.00) 7 (0.01) 0 ( 0.00) nad4l 911 (0.99) 0 (0.00) 9 (0.01) 3 (0.00) 0 ( 0.00) nad5 933 (1.00) 0 (0.00) 0 (0.00) 4 (0.00) 0 ( 0.00) nad6 917 (0.98) 0 (0.00) 12 (0.01) 4 (0.00) 0 ( 0.00) Protein 12 016 (0.99) 0 (0.00) 41 (0.00) 77 (0.01) 1 ( 0.00) Table 4.1: Results of the enhanced model on Metazoa, for each gene. Equal represents a gene predicted by the model and the same gene is annotated in RefSeq. ∆± represents a gene with dierence in strand between the annotation and RefSeq. UP are under-predicted genes. OP over-predicted genes. Dierent are genes predicted within the coordinates of other genes in RefSeq the number of positions that have a gap in one sequence and not in the other. With this method rows are removed if they are outliers in the distribution of the mean gap distances with regard to the rest of the sequences. This test is implemented by a z-score threshold which was set to 3.5 standard deviations except for nad5 and cox3 genes where a value of 6 standard deviations was chosen because otherwise more than 1% of the sequences would have been removed. The majority of removed rows belongs to Mollusca, Arthropoda, Nematoda, and Tunicates. The percentage of removed rows in each taxon, however, is gene specic. It never exceeds 0.6% of the sequences (Fig. 4-5). Weakly conserved ends of the alignment are removed by removing all columns with a low posterior probability, by default 푝ˆ ≤ 0.4, proceeding inwards from both ends of the alignment until 푝ˆ > 0.4. This strategy removes up to 35% of the alignment (see Fig. 4-4). However, most of removed columns are largely composed of gaps Fig. 4-3; the average of gaps in removed columns is 92%. A nal model is built from the cleaned alignment.

38 1.00 ● ● ● ● ● ● ●

● ● ● 0.75 ● ● ● ● ●

● end 0.50

gene atp6 0.25 atp8 cob cox1 ● cox2 0.00 cox3 nad1

1.00 ● ● ● ● ● ● ● ● ● ● ● ● ● nad2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● nad3 ● ● ● ● ● ● ●

gaps in removed columns gaps in removed nad4 ● ● ● ● ● ● ● ● nad4l ● ● ● 0.75 nad5

● nad6 ● ● ● ●

● ● ● ● ● ● start 0.50 ●

● ● ● ●

● ● ● ● ● ● ● 0.25 ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● 0.00 ● ● ● cob atp6 atp8 cox1 cox2 cox3 nad1 nad2 nad3 nad4 nad5 nad6 nad4l gene

Figure 4-3: Fraction of gaps in removed columns from the start and the end of the alignments. The mean is 0.92, the median is 0.997

39 taxa equal ∆± UP OP dierent Annelida 129 (0.98) 0 (0.00) 2 (0.02) 0 (0.00) 0 ( 0.00) Anthozoa 139 (0.97) 0 (0.00) 2 (0.02) 1 (0.01) 0 ( 0.00) Arthropoda 3053 (0.99) 0 (0.00) 15 (0.01) 20 (0.01) 0 ( 0.00) Calicogorgia 13 (0.87) 0 (0.00) 1 (0.07) 1 (0.07) 0 ( 0.00) Cestoda 36 (1.00) 0 (0.00) 0 (0.00) 0 (0.00) 0 ( 0.00) Craniata 7679 (1.00) 0 (0.00) 1 (0.00) 37 (0.00) 0 ( 0.00) Demospongiae 26 (0.93) 0 (0.00) 2 (0.07) 0 (0.00) 0 ( 0.00) Eleutherozoa 91 (0.99) 0 (0.00) 0 (0.00) 1 (0.01) 0 ( 0.00) Mollusca 383 (0.96) 0 (0.00) 1 (0.00) 14 (0.04) 0 ( 0.00) Monogenea 11 (0.92) 0 (0.00) 1 (0.08) 0 (0.00) 0 ( 0.00) Nematoda 278 (0.93) 0 (0.00) 15 (0.05) 3 (0.01) 1 ( 0.00) Nemertea 91 (1.00) 0 (0.00) 0 (0.00) 0 (0.00) 0 ( 0.00) Trematoda 36 (1.00) 0 (0.00) 0 (0.00) 0 (0.00) 0 ( 0.00) Tunicata 51 (0.98) 0 (0.00) 1 (0.02) 0 (0.00) 0 ( 0.00) Table 4.2: Results of the enhanced model on Metazoa, for each taxa. Equal represents a gene predicted by the model and the same gene is annotated in RefSeq. ∆± represents a gene with dierence in strand between the annotation and RefSeq. UP are under-predicted genes. OP over-predicted genes. Dierent are genes predicted within the coordinates of other genes in RefSeq

20

15

type

end

10 start removed columns % removed

5

0

atp6 atp8 cob cox1 cox2 cox3 nad1 nad2 nad3 nad4 nad4l nad5 nad6 gene

Figure 4-4: Percentage of removed columns, from the start and the end of the align- ments, for each gene

40 0.105 0.052

2.0 0.158 gene

0.263 atp6

atp8 0.079 cob 1.5 cox1

0.525 cox2

cox3

nad1 0.236 1.0 0.158 nad2

removed rows % rows removed 0.079 0.026 nad3 0.053 0.131 0.026 0.577 0.158 nad4 0.105 0.053 0.262 nad4l 0.158 0.157 0.5 0.341 nad5 0.026 0.052 0.184 0.184 0.131 0.079 nad6 0.026 0.053 0.026 0.194 0.026 0.052 0.079 0.026 0.079 0.105 0.026 0.105 0.053 0.026 0.026 0.052 0.236 0.21 0.052 0.105 0.158 0.079 0.028 0.026 0.026 0.079 0.138 0.053 0.026 0.055 0.026 0.026 0.21 0.026 0.026 0.028 0.026 0.158 0.158 0.052 0.052 0.053 0.026 0.026 0.105 0.026 0.026 0.105 0.026 0.026 0.105 0.052 0.105 0.026 0.026 0.079 0.026 0.026 0.079 0.026 0.055 0.053 0.026 0.026 0.026 0.026 0.026 0.079 0.0 0.026 0.026 0.026 0.026 0.026 0.026 0.026 0.026 0.026 0.026 0.026 0.026 0.026 Rotifera Bryozoa Tunicata Cestoda Craniata Annelida Mollusca Calcarea Anthozoa Hydrozoa Trichoplax Nematoda Trematoda Cyclocoela Scyphozoa Arthropoda Sagittoidea unclassified Typhlocoela Monogenea Eleutherozoa Onychophora Rhabditophora Demospongiae Acanthocephala taxa

Figure 4-5: Taxonomic distribution of removed rows depicted by genes and taxonomic groups. The bars show the percentage of removed rows for the corresponding group. The colors represent the percentage of removed rows for the corresponding gene

41 1000

data

test count training

10 Rotifera Bryozoa Tunicata Cestoda Craniata Annelida Mollusca Cubozoa Calcarea Anthozoa Hydrozoa Nemertea Trichoplax Nematoda Tardigrada Trematoda Entoprocta Cyclocoela Scyphozoa Arthropoda Sagittoidea Pelmatozoa unclassified Typhlocoela Monogenea Calicogorgia Brachiopoda Xenoturbella Scalidophora Eleutherozoa Onychophora Hexactinellida Pterobranchia Acoelomorpha Enteropneusta Rhabditophora Demospongiae Acanthocephala Cephalochordata taxa

Figure 4-6: Taxonomy of training and test data sets in log scale

42 4.1.4 Annotation

The protein-coding gene annotations are obtained from the best hits of the search against the protein models. Therefore the annotation process starts with the trans- lation of the complete unannotated mitogenomes in all six reading frames. Each of the six conceptual translations is then scanned against the nal protein models with hmmscan. From the output the bit score and the coordinates are extracted for all hits with 퐸 ≤ 10−3. To accommodate known overlaps between mitochondrial genes [22], an 표푣푒푟푙푎푝 < 20% of the length of the shorter of two adjacent genes is tolerated in this step. For larger overlaps, the hit with larger 푒-value is discarded, in case of equality the hit with lowest bit score is discarded.

4.1.5 Improvement of Start and Stop

The ends of the annotated genes are rened based on the method implemented in MITOS[6] in addition to a probabilistic method implemented in MITOS2[108]. The rst method consists of searching for start and stop codons, dened by the NCBI genetic tables, in the proximity of the predicted ends. The latter is more sophisticated and takes into account the following inuences: i) the estimated distance to the start or end of the gene (훿), ii) the probability to be a start or stop codon (휑), iii) the probability of the resulting gene length (휆), iv) and the exclusion of annotated adjacent tRNAs.

Let the start and stop positions with respect to the sequence be 푠푠 and 푠푒 respec- tively. And 푞푠 and 푞푒 the corresponding positions with respect to the query model.

For each position 푝 (푠푠 ≤ 푝 ≤ 푠푒) the relative start position and relative stop

43 position are given by:

푞푒 − 푞푠 푟푠(푝) = 푞푠 + × (푝 − 푠푒) (4.1) 푠푒 − 푠푠 푞푒 − 푞푠 푟푒(푝) = (푙푞 − 푞푠) + × (푝 − 푠푒). (4.2) 푠푒 − 푠푠

Where 푙푞 is the length of the hit in the query sequence. The values 훿푠(푝) and 훿푒(푝) are the normalized relative start and stop position respectively.

휆(푙) is the probability to observe a gene with length 푙, where 푙 is the length obtained for chosen start and end positions. 푙 is calculated as 푝-value:

min(퐿 , 퐿 ) 휆(푙) = 2 × ≥ ≤ , (4.3) 퐿≥ + 퐿≤

Where 퐿≥ and 퐿≤ are the number of species in RefSeq63 that have the same genetic code and contain a gene of the same type that has a length of at least (re-

spectively at most) 푙.

The probability of a codon to be a start codon (휑푠(푝)) is assessed from the fre- quency of being used as a start codon in RefSeq63. To allow for annotation errors codons with frequency below 0.01 are not considered. Likewise, the probability of a codon to be a stop codon (휑푒(푝)) is calculated, but all codons that are annotated as inner codon with a frequency of at least 0.001 are ignored. The latter probabilities are also computed for incomplete stop codons (T and TA). For incomplete stop codons [109] the frequency is corrected by a malus, i.e., divided by three if one nucleotide is missing (incomplete stop codon: UA) and divided by 12 if two nucleotides are miss- ing (incomplete stop codon: U). The codon frequencies are determined separately for each set of species that share the same genetic code.

A pair (푠, 푒) of start and stop position that maximizes the product of the selection factors is chosen:

arg max 훿푠(푠) · 휑푠(푠) · 훿푒(푒) · 휑푒(푒) · 휆(푒 − 푠 + 1), (4.4) 푠∈푆,푒∈퐸

Where 휑 and 휆 are the values determined for the corresponding gene and genetic

44 code.

In order to reduce the number of pairs (푠, 푒) that need to be evaluated only positions 푝 in 푆 (respectively 퐸) are considered for which 휑푠(푝) > 0 (respectively

휑푒(푝) > 0).

4.1.6 Benchmarking

In order to benchmark our annotation we compared the obtained predictions with the annotation available in RefSeq69. For each of our predictions the feature of the

RefSeq annotation that overlaps most but by at least 10% is determined. A predicted gene is considered as equal if the overlapping pair is of the same gene, dierent if the pair consists of dierent genes, and over-predicted (OP) if no such overlap exists. Furthermore, annotated genes in RefSeq that are not included in any such pair are considered as under-predicted (UP). To assess the quality of the generated alignments after the improvement steps, we compare the generated alignments separately with the alignments before improve- ment, the alignments obtained by MAFFT[75], and UPP [110]. As quality measures we employed the average percent pairwise alignments identity (APPI), the most unrelat- ed pairwise identity (MUPI) as dened in the Alistat package [111], and the sum of pairs score implemented (SP) in MUSCLE [112]. For a pairwise (sub)alignment of the

MSA, denote by ℓ1 and ℓ2 the length of the two sequences and let 푞 be the number of identities in the alignment. Then APPI is the average of the ratio, 푞/푚푖푛(ℓ1, ℓ2), MUPI is the minimum of 푞 over the entire alignment, and SP is the sum of scores of all pairwise (sub)alignments in the MSA.

4.2 Quantitative Analysis

In order to evaluate the method, the obtained HMMs were used to annotate the mitochondrial protein-coding genes in the 926 species in RefSeq release 69 which

45 pppp p ppppp ppp E E E

Figure 4-7: Alignment of atp8 in Ophagostomum columbianum with its closely related species

were not included in the data set used to build the models (RefSeq release 63). On

an an Intel○R CoreTM2 Quad CPU Q9400 at 2.66GHz with 8 GB of memory the protein-coding genes can be annotated in less than 30 seconds in each of the 926 metazoan mitogenomes.

4.2.1 Annotation Quality

A comparison of our approach with RefSeq69 shows an agreement in 12013 out of 12132 (99%) cases. Of the remaining, 77 cases are over-predicted genes. As we will discuss below, at least 59 of them are true positives that are missing in the RefSeq annotations which leaves no more than 18 false positives. Additionally, there are 41 underpredicted genes, of which at least one is a true negative, 38 cases can be found by scanning the mitogenomes against more specic models (phylum, class, order, . . . ). Finally, there is a single case in which our annotation diers from RefSeq. This case is an atp8 in the strongylid worm Ophagostomum columbianum (NC_023933) that

is predicted in the 5' region of a cox1 in RefSeq. Despite a reasonable 푒-value of 6.5 × 10−5 this case is likely a false positive since an MSA with closely related species does not show conservation (see Fig. 4-7). Furthermore the known mitogenomes of Strongylida lack this gene. Although the enhanced models missed the annotation of

2 cases and over annotated 18 genes, the method corrected RefSeq in 60 cases.

46 Over-predicted genes Among the 77 OP, 53 cases have an 푒-value ≤ 10−7. A- mong these hits 35 cases were certainly true predictions, since corresponding gene or misc_feature entries are present in the GenBank les but CDS features were missing whereby they are ignored by our GenBank parser. In 14 of these cases the annotations contained hints to pseudogenes. In the other 17 cases the region is anno- tated as non-coding, i.e., no gene was annotated, by RefSeq. Since the mitogenomes are compact and the alignment of these genes to the closely related species based to the general HMMs showed good quality, see example in Fig. 4-8 , we are condent that these 17 cases are either pseudogenes or functional genes missed in the RefSeq annotation. The remaining high-scoring OP hit was the gene nad4l in the accession NC_024927. Here the product qualier of GenBank entry incorrectly reads NADH dehydrogenase subunit 2 instead of NADH dehydrogenase subunit 4l. Among the remaining 24 cases with an 푒-value > 10−7, six hits can be considered as true since they are either annotated by RefSeq as pseudogenes or they are annotated on the genome but no CDS is given.

Also note that, 29 of the OPs (all with 푒-values≤ 8.210−6) are clustered in a few taxa: 14 cases in Phasianus colchicus (NC_024152) which has been removed in recent releases of RefSeq and 15 cases in three unpublished so called minichromosomes of Liposcelis entomophila (NC_025503, NC_025504, and NC_025505).

In summary only 18 predictions remain to be considered as OP. The 18 OP cases are shorter than the homologous query sequences and are either located in an unanno- tated part of the genome or inside other features (control regions or D-loop). Multiple sequence alignments with closely related species support that these predictions are false positives. For details about OP by accession see A.1.

Thus more than 75% of the OP cases are errors or misannotations in the reference which highlights the advantage of our method to overcome inconsistencies and errors in the reference database.

Under-predicted genes The general models missed the prediction of 41 genes (UP genes). In one case, the cox3 gene in Loxioides bailleui (NC_025626), an interval

47 pfppp pf ppppp ppp 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11

Figure 4-8: Alignment example of true positive hit in atp8 for the accession NC_025766. Alignments are displayed with ClustalX [76]

of only 18 nt is annotated in the reference; most likely this is a false positive in the reference Fig. 4-9. This accession has indeed been removed from the most recent

version of RefSeq. The remaining 40 UP genes are distributed among the four short mitochondrial genes: atp8 (16), nad6 (12), nad4l (9), and nad2 (3). For these cases we calculated the MSA of the UP amino acid sequences as given in the reference and closely related sequences with respect to the general model. The MSAs showed poor conservation in all cases (example in Fig. 4-10. Hence, the combination of small size and poor conservation seems to be the reason for missing these genes. We compared the length of the intersection of the UP genes with the consensus, to the average length of the intersection of the other sequences in the alignment with the consensus. In all 40 cases the RefSeq sequences have similar values as close relatives that are recognized by our approach, which indicates that they are false negatives of our approach. The majority of UP cases belong to fast evolving groups (19 Nematoda and 13 Chelicerata) although only one UP is only one in Tunicata and between zero and 2 in the remaining phyla (see Table 4.2). Likewise the UP cases are more prevalent in fast evolving genes i.e., atp8, nad6 and nad4l (see Table 4.1). However, when scanning the UP genes against more specic models (closer to the leaves) almost all (38) of these cases are found. Hence choosing models of lower levels, i.e., phylum,

48 pf5gppp p444444g07ff0ff5 pppgfpgg7fpf7g pgppg

1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 11

Figure 4-9: Alignment of cox3 for the accession NC_025626 with the closely related species. Alignments are displayed with ClustalX [76] class, or family, solves the fast evolving UP cases. Moreover other UP cases are caused by the overlap threshold mentioned in Section 4.1.4. The overlap problem can be solved by a column trimming strategy. Besides the considerable increase in run time from 30 sec with the root models to 743 sec with the family models. This can be solved by going to more specic models only in case of those species missing one of the mitochondrial genes. Note that, only a few more OP results from scanning more specic models (e.g., three when the family models are used) which can be eliminated by overlap with the non coding features of the mtDNA. For an overview on all the cases (equal, UP, OP and dierent) check Supplement 1 1.

4.2.2 Start and Stop Positions

To compare the two annotations we determined the dierences in start and stop po- sitions (∆푆 and ∆퐸) for all genes that are present both in RefSeq69 and in our annotation. As expected, our annotations are systematically shorter due to the prun- ing of noisy columns from the ends of the alignments. However, the largest average

∆푆 and ∆퐸 are 13.87 푛푡 and 12.92 푛푡 respectively, i.e., the dierence is less than 1https://drive.google.com/open?id=1j8bm71XZ-XUsL460PSqRtzf7z3yV8H3z

49 p7ppp pggggggff5ff57 pppppf ppp

A A A

A A A

A A A A A A AA

Figure 4-10: Alignment of nad2 for the accession NC_024096 with the closely related species. Alignments are displayed with ClustalX [76]

Figure 4-11: Dierences of predicted start (∆푆) and stop (∆퐸) positions with respect to RefSeq annotations. A positive dierence corresponds to a prediction that is longer than the reference annotation

50 d t # d t # before improvement after improvement 푑 == 0 ∀ 0.08 푑 == 0 ∀ 60.13 푑 ≤ 3 ∀ 6.73 푑 ≤ 3 ∀ 76.33 푑 ≤ 9 ∀ 20.04 푑 ≤ 9 ∀ 82.52 푑 ≤ 30 ∀ 85.38 푑 ≤ 30 ∀ 91.6

Table 4.3: The statistical improvement of the percentage of the genes where the maximum dierence (d) between annotated and predicted start and stop positions is less than or equal 0, 3, 9, and 30 bp, respectively

5 푎푎. The distributions of ∆푆 and ∆퐸 are shown in Figure 4-11. The dierences depend systematically on the gene in question. However, the percentage of columns

that is removed during the models enhancement phase is not consistent with ∆퐸 and ∆푆. For example, the start of nad6 is trimmed more than the start (see Fig. 4-4), yet ∆푆 is smaller than ∆퐸.

Improved Start and Stop The application of the method in Section 4.1.5 on the same RefSeq69 data showed a statistical improvement of the start and stop precision

comparing to the original prediction. The average ∆푆 and ∆퐸 are 8.18 푛푡 and 5.18 푛푡, i.e, an enhancement of 5.69 and 7.74 푛푡 respectively. However, the standard devia- tion remained almost unchanged (15.06 and 11.82 푛푡 for ∆푆 and ∆퐸 respectively). Furthermore, the percentage of the genes where the maximum dierence (d) between annotated and predicted start and stop positions is less than or equal 0, 3, 9, and 30 bp, respectively, is signicantly better with the enhancement method Table 4.3. In summary, the new method which consider the non-canonical start and stop codons in mitochondria allowed a better prediction for the genes boundaries compared to the initial predictions by the HMM models.

51 4.2.3 Alignment Quality

Let us rst consider the eect of the improvement steps. We observe that the APPI

increases in the overwhelming majority of genes between 1-4%, the MUPI gains are between 2-30%, and nally the SP score increases in all genes between 3-32 points, except for cox1 where the SP decreases by 5. This indicates that the procedure is successful and in particular manages to identify and subsequently correct (frameshifts) or remove (wrong) divergent sequences. Our enhanced alignments have either the same APPI as the alignments computed with MAFFT, or dier by 1% up or down. Again we gain with respect to measures that are sensitive to highly divergent sequences. The MUPI of MAFFT alignments is between 1-13% lower and the average SP is lower up to 77 (Table 4.4). Thus, in comparison with MAFFT the alignments of the majority of the genes have a higher quality with regard to the three measures. The comparison with UPP was possible for only one gene (atp6 ) since we were

not able to install the software on our 32-bit machine. The measures show again an advantage of our alignment in the three measures. A gain in APPI of 1%, a raise of 2% in MUPI, and a raise of 7 in SP score can be observed.

4.2.4 Frameshifts

The frameshift detection method was applied on the data from RefSeq release 63.

The method succeeded to detect all 200 frameshifts annotated in RefSeq2. In all these cases the location of the predicted frameshifts coincides with the RefSeq annotation. With three exceptions (which were not included in the RefSeq set) and one dierence,

we recovered also all frameshifts that are mentioned in the literature (85). Annotated FS positions dier only slightly between RefSeq and our annotation: median 0.5, mean deviation 2.1 ± 7.29 nt, well within the range of ambiguities explained below.

2Frameshift are indicated in RefSeq as “joined” parts of annotated CDS separated by 1 or 2 푛푡.

52 Table 4.4: Alignment quality measures for the initial, enhanced, and MAFFT align- ments. Shown are A: average percent pairwise alignments identity (APPI), M: most unrelated pair-wise identity (MUPI), and S: sum of pairs (SP). Initial Enhanced MAFFT AMSAMSAMS atp6 44 1 223 48 7 236 47 4 211 atp8 27 0 51 29 0 54 30 0 55 cytb 65 0 658 66 23 665 66 10 656 cox1 76 0 974 80 30 969 79 17 953 cox2 61 16 343 62 18 351 62 15 342 cox3 68 2 467 71 18 499 70 7 462 nad1 59 15 430 59 19 431 60 16 416 nad2 38 1 349 40 4 355 40 5 315 nad3 52 0 151 52 12 157 52 3 145 nad4 49 0 525 48 11 533 48 10 494 nad4l 41 0 77 43 3 81 43 2 73 nad5 44 6 633 44 12 641 44 9 564 nad6 30 1 121 31 3 124 30 1 89

The nad3 -174 frameshift found in many but not all Archosauria-Testudines [46, 48, 49] is predicted at the position reported by [47]. Other well-described examples include nad3 -135 in P. subrufa [50], nad4 in M. torneri [50], cytb in oyster [52], cox3 and nad6 of glass sponge [53]. The polyarchis ants are not included in RefSeq. The double frameshift in nad4l of P. subrufa [50] was not identied because our method only searches for cases with a single frameshift. We remark that one of two consecutive FSs in a single gene could still be reported by our method if no down-stream stop codon occurs. This is not the case in the P. subrufa example, however. Frameshifts that occur closer to the 3' end of the gene are harder to detect since the eect on the ∆푠 becomes negligible. Therefore, it is not surprising that we missed the −1 frameshifts at the very end of human cox1 and nad6 predicted by [54].

Frameshifts in Nad3 The nad3 frameshifts events are restricted to the Archosauria-

Testudines group (Figure 4-12). The frameshift at position 174 is widespread and is completely conserved in Palaeognathae. Whereas nad3 -174 disappears completely in Crocodilia, Passeriformes, and Pleurodira. Some references consider Squamata

53 as part of Archosauria [113], however, it does not harbor the frameshifts. In the remaining taxa, the nad3 -174 FS is conserved to a high degree, nevertheless it is absent in the minority of species of each group. Therefore as highlighted by [48], this process of translational frameshift has been either (i) originated as a single event at Archosauria-Testudines and then multiple losses occurred at dierent sites in the tree, or (ii) the frameshifts arose independently. Note that for the cases where no nad3 -174 frameshift was predicted also no candidate with ∆푠 < 25 is found and also the alignment shown in Figure 4-12 supports the conclusion that the frameshift is absent in these sequences. In [47, 48] positions and dierent mechanisms were suggested for the frameshift. Both assume stalling of the ribosome while trnL is bound to CUB with B designating not A, where the last base corresponds to position 172 (in the following all positions are reported w.r.t. our alignment) and the rst position of the next 0-frame codon (AGN) corresponds to position 175. According to [48] the mechanism for stalling is a 0-frame stop codon AGN, while according to [47] the cause is a stem-loop structure that starts at the AGN codon. [48] suggested that the frameshift is caused by either a re-pairing of the peptidyl site tRNA-Leu (Russel B) with the codon 1nt downstream (i.e., frame shift at position 172) or the occlusion of the rst position of the amino- acyl site (Russel C), which is the stop codon (i.e., a frameshift at position 175). The alternative explanation of [47] is that position 174 is ignored due to a +1 slippage or RNA editing. RNA editing as a possible cause was deemed unlikely by [47] and experimentally excluded by [48]. The three possibilities aect only two codons (Figure 4-13). Due to the ambiguities of the genetic code, i.e., wobble pairing and  in this case  two codon boxes that are translated to Leu, the conceptual translations that correspond to these alternatives are equal and yield the same ∆푠 value. Such ambiguities limit the precision with which the FS site can be located. Typically the ambiguous range covers only one or two codons. It is possible however, to construct even more ambiguous cases. In the theoretical worst case of a constant sequence such as CCCC... every position could be the FS site. It should be noted, however, that such ambiguities have no inuence

54 h S nldsoerno ersnaiesqec fec group each of sequence representative random one includes MSA The or circles) position (white 109 circle: absence (black the represent frameshifts Symbols of presence [104]. tree common taxonomy the NCBI of Distribution 4-12: Figure rs position cross , 124 LS P CEY FGD P LL G SA R PFFF S I R LAV 110. 120. 130. 140. 150..160 170. 180. Didelphinae .Tenme fseisi ahgopi hw tteleaves. the at shown is group each in species of number The ). 4 TCAAGTCC-CTATGAATGTGGATTTGATCC-TTTAGGATCAGC--ACGACTACCCTTTTCAATAAAATTTTTCCT-AGTAGCT Squamata 110 CTCTCCCC-ATACGAATGCGGGTTCGACCC-CCTAGGCACAGC--GCGACTCCCATTCTCACTTCGCTTCTTCCT-AGTGGCC Pleurodira 1 ATA TCA CCA- TAT GAA TGC GGA TTC AAC CCA CTTA GAA CCA GC--T TGA TTA CCA TTC TCA GTT CGA TTT TTC C-TT ATC GCA 5 CTA TCC CC-A TAC GAA TGC GGG TTT GAT CCA CT-A GAA TCT GC--T TGA CTA CCA TTT TCC ATT CGA TTC TTC C-AT GTA GCC 28 CCA ACC CC-A TAT GAG TGT GGA TTT GAC CCA CT-A GAA TCT GC--T CGT CTA CCA TTC TCA ATC CAA TTT TTC CCAT GTA GCA Testudinoidea 8 TTA TCC CC-A TAT GAA TGC GGC TTC GAC CCA TT-A AAA ACC AC--T CGC CCA CCA TTT TCA ATT CGA TTT TTC T-AT GTA GCA 1 TTA TCC CCCA TAT GAA TGC GGC TTC GAC CCA TT-A AAA ACC AC--T CGC CCA CCA TTT TCA ATT CGA TTT TTC T-AT GTA GCA

Anapsida 1 TTA TCC CC-A TAT GAA TGC GGC T-C GAC CCA TA-A AAG CCC CCCGT CGC CCA CCA TTT TCA ATT CGA TTT TTC T-AT GTA GCA Amniota nad3 Cheloniidae 3 TTA TCC CC-A TAT GAA TGT GGA TTC GAC CCA TT-A GAA TCA GC--T CGC CTA CCA TTC TCA ATC CGA TTC TTC CCAT GTA GCA 2 TTA TCC CC-A TAC GAA TGT GGA TTT GAC CCA CT-A GAA TCA GC--C CGT CTA CCA TTC TCA ATC CGA TTC TTC CT T- GTA GCA Trionychidae 7 CTA TCC CC-A TAC GAA TGT GGC TTT GAC CCA CT-T AAA ACC TT--C CGC CTA CCC TTC TCA ATC CGA TTT TTC CCAT GTA GCA Cryptodira 5 TTA TCC CC-A TAC GAA TGC GGA TTT GAC CCA TT-A AAG ACC GA--C --G CTA CCA TTT TCA ATC CG- -TT CTT CTAC GTA GCA Palaeognathae 12 CTC TCC CC-C TAT GAA TGC GGT TTT GAC CCC CT-A GGA TCT GC--T CGC CTC CCA TTC TCA ATC CGA TTC TTC CCAT GTA GCC Sauria rmsiti rhsui n npiao the on Anapsida and Archosauria in frameshift Anseriformes 12 CTA TCA CC-A TAC GAG TGC GGA TTT GAC CCA CT-T GGA TCT GC--T CGC CTA CCA TTC TCA ATC CGA TTC TTC CCAT GTA GCT 4 CTG TCA CC-G TAC GAA TGT GGA TTC GAC CCA CT-C GGA TCT GC--T CGA CTC CCA TTC TCA ATC CGA TTC TTC C-AT GTG GCT Charadriiformes 4 CTA TCC CC-T TAC GAA TGT GGA TTC GAC CCC CT-A GGA TCC GC--T CGA CTC CCA TTT TCA ATT CGT TTC TTC CCAT GTA GCA 55 2 CTA TCC CC-A TAC GAA TGC GGA TTT GAC CCG CT-C GGG TCT GC--C CGA CTC CCA TTC TCC ATC CGC TTT TTC CCT- GTA GCA

Aves Ardeiformes 9 CTA TCC CC-C TAC GAA TGT GGC TTC GAC CCG CT-C GGG TCC GC--C CGC CTT CCA TTC TCA ATT CGA TTC TTC CCAT GTA GCA 2 CTA TCC CC-C TAC GAA TGC GGC TTT GAC CCC CT-C GGC TCC GC--C CGC CTC CCA TTC TCA ATC CGA TTC TTC C-AT GTA GCA

174 Cuculiformes 1 CTA TCA CC-C TAT GAA TGT GGA TTT GAC CCA CT-A GGA TCA GC--C CGA CTC CCC TTC TCA ATC CGA TTC TTC CCAT GTA GCA 1 CTC TCC CC-A TAC GAA TGC GGC TTT GAC CCC CT-A GGA TCC GC--C CGA CTC CCA TTC TCT ATT CGA TTC TTC C-AT GTA GCC Falconiformes 5 CTA TCT CC-A TAC GAG TGT GGC TTC GAT CCT CT-T GGA TCT GC--A CGA CTC CCA TTC TCA ATC CGA TTC TTT CTAT GTA GCA

ls position plus: , 6 CTA TCG CC-A TAC GAA TGC GGC TTT GAC CCT CT-A GGA TCT GC--C CGA CTA CCA TTC TCA ATC CGA TTC TTC CT C- GTA GCC Neognathae Craciformes 40 CTA TCC CC-A TAC GAA TGT GGA TTT GAC CCC CT-A GGA TCA GC--C CGA CTC CCA TTC TCA ATC CGA TTC TTT CCAT GTA GCC 5 CTA TCA CC-A TAC GAA TGC GGA TTT GAC CCT CT-A GGA TCA GC--T CGA CTC CCA TTC TCA ATC CGA TTC TTC C-AT GTA GCT

Archosauria Gruiformes 19 CTT TCA CC-C TAC GAA TGC GGA TTT GAC CCA CT-G GGA TCC GC--C CGC CTC CCA TTC CCA ATC CGA TTC TTC CCAT GTA GCA 3 CTA TCC CC-T TAT GAG TGC GGA TTT GAC CCA CT-A GGT TCC GC--T CGC CTC CCC TTC TCC ATC CGA TTT TTC C-AT GTA GCC Passeriformes 80 CTC TCT CC-A TAC GAA TGC GGT TTC GAC CCC CT-G GGA TCA GC--C CGA CTA CCC TTC TCA ATC CGA TTC TTC CCT- GTA GCA Pelecaniformes 2 CTC TCA CC-A TAC GAA TGT GGC TTC GAC CCA CT-A GGA TCC GC--T CGA CTA CCA TTC TCA ATT CGA TTC TTC CCAT GTA GCC 1 CTA TCC CC-C TAC GAA TGC GGC TTC GAC CCC CT-T GGT TCC GC--A CGA CTT CCA TTT TCA ATT CGA TTC TTC C-AT GTG GCC Psittaciformes 12 CTC TCA CC-C TAT GAA TGC GGG TTT GAC CCA AT-A GGC TCT GC--C CAA CTG CCA TTT TCC ATT CGA TTC TTC CCAT GTA GCC 1 CTA TCA CC-T TAT GAA TGC GGA TTT GAC CCC CT-A GGA TCC GC--C CGC CTC CCA TTC TCC ATC CGA TTC TTC CT C- GTA GCT others 30 CTA TCC CC-A TAC GAA TGT GGC TTC GAT CCA CT-T GGC TCT GC--T CGA CTC CCA TTC TCA ATC CGA TTT TTC CCAT GTA GCA CTA TCC CC-C TAC GAA TGT GGA TTC GAC CCA CT-A Crocodilia 1 GGC TCC GC--T CGC CTT CCA TTC TCA ATC CGA TTC TTC C-AT GTA GCC 19 CTT TCA CCA- TAT GAA TGC GGA TTT GAC CCG CT-G GGA TCC GC--C CGC CTG CCA CTA TCT ATC CGC TTT TTC AT-A ATC GCT 134 qaeposition square ,

Figure 4-13: Alternative interpretations of the same frameshift event on our pipeline's ability to detect the frameshifted sequences, since it operates on the conceptually translated amino acid sequences  and these are by denition the same for all alternative FS locations. In order to nd out which of the three cases is more likely a multiple sequence alignment of the Archosauria+Testudines has been created with ClustalX [76]. In contrast to the previous studies outgroup data is included (Squamata and Didelphi- nae as a more distant group). The conservation pattern strongly favors the option suggested by [47]. The sequence around the fame shift position is nearly per- fectly conserved (consensus without position 174 is TTC CTA GTA). Also the frameshift position itself is well preserved being mostly C which constitute 86% of this column ignoring gaps. Moreover the T at position 173 is 100% conserved in all the species included in the alignment. Moreover at position 175 A is 91% conserved. A complete alignment can be found in the Supplement 2 3. Note that the position 174 is shifted to the position 181 due to other non-conserved insertions/deletions at earlier positions. Since the frameshift is in all groups more often present than absent an ancestral gain of the frameshift and infrequent loss seems to be a likely explanation. Anapsida nad3 hosts, in addition, a non conserved frameshift mutation at three dierent positions 109, 124 and 134. The nad3 -134 was annotated in RefSeq and described in [50]. The nad3 -109 (occurs in Cuora aurocapitata) has been reported in [6]. The translation in the open reading frame does not initiate an early stop codon, however the amino acid sequence translated in +1 frame is more similar to the consensus. The nad3 gene in Cuora aurocapitata exhibit a deletion at position

124 and two extra nucleotides at positions 144-145. 3https://drive.google.com/open?id=1mdkW7mPqzY5rOONJb6aLR0EJiuLnt0GY

56 Figure 4-14: Number of detected frameshifts (푙표푔10 scale) for dierent genes shown separately for new (new) and already annotated frameshifts (annotated)

To assess the frameshift detection method we compared our results for nad3 in the Archosauria-Testudines and Anapsida data sets to the results of MACSE on the same dataset [114]. All the frameshifts detected by our method were reported also by MACSE Supplement 34. However, for the FS at position 174, MACSE supports Russel C (Figure 4-13), i.e., an insertion at position 175. On the other hand two FS at the same positions were not detected with our method in the accessions NC_017839 and NC_022957 since these frameshifts would imply a stop codon before the end of the sequence. In addition, 68 frameshifts at the last position are predicted by MACSE. These are not detectable by our method because they do not result in a signicant score dierence ∆푠. The frameshifts are most likely spurious and are explained by incomplete stop codons [97].

New Frameshifts The un-annotated frameshifts (new), are spread over all protein- coding genes except cox2 (Figure 4-14). We found frameshifts in the genes: atp8, atp6, cox1, for which no frameshifts were previously annotated by RefSeq. These frameshifts are not phylogenetically conserved, see Appendix A.2. Therefore we cannot determine whether real frameshifts have occurred in dierent sites in the tree, or the reading frame change is caused by sequencing or annotation errors.

To summarize, we found 98.5% of the annotated frameshifts at nearly the same positions. Furthermore we found 36 frameshifts which were not annotated in RefSeq. 4https://drive.google.com/open?id=1jitcaRUf7PRM-MMSJZGRytwnkYXAnCNz

57 Those cases, if they are real frameshifts and not sequencing or annotation errors, shed light on the advantages of our method to automatically detect errors in the reference sequences.

4.2.5 Annotation of mtDNA in Fungi

The enhanced root models of Metazoa are used to annotate protein-coding genes of Fungi (only the 13 canonical protein-coding genes present in Metazoa are considered). The test set contains 96 fungal mitochondrial from the RefSeq release number 84. The comparison with the reference annotation shows an agreement of 1015 genes out of 1051 (97%). Of the remaining 260 cases 224 cases are over-predicted by our method. Additionally 36 cases are missed (under-predicted) 4.2.5.

Over-Predicted Genes Among the 224 over-predicted genes (OP) 21 cases have an 푒-value ≤ 10−7. These OPs are clustered in only 8 accessions NC_005789, NC_015893, NC_018100, NC_021373, NC_020430, NC_020331, NC_015991 and NC_005927. Af- ter a manual check of the Genbank les we nd that the corresponding gene or misc_feature entries are present in the GenBank les but CDS features were miss- ing whereby they are ignored by our GenBank parser. Thus, these predictions are certainly true positives. Note that it is impossible to construct a parser for the Gen- Bank les in RefSeq that considers all possible cases. If, for instance, misc_feature are included also annotated gene domains are included since they are additionally annotated as misc_feature.

Of the remaining 203 OP genes, 75% belong to the atp8 gene. Moreover, the distributions of scores for the atp8 gene do not show signicant dierence in both OP and equal cases. In contrast, for the rest of studied genes, the scores of equal hits are mostly higher than those of OP hits Fig. 4-15.

Under-Predicted Genes Out of 36 UP genes 34 are for the atp8 gene known for its short size and high variability between taxa. Overall the numbers of UP

58 and OP are low except in the case of the atp8 gene. Accordingly, ignoring atp8, the

precision or positive predictive value of the models is 95%, and the sensitivity is 99.6%. Therefore, metazoan models can be used to annotate the PC genes of Fungi with a high accuracy for all the common genes except the atp8 known for its shortness and poor conservation. Specic model built from fungal sequences is needed to annotate this gene in addition to other PC genes that are not found in Metazoa.

equal ∆± UP OP dierent

atp6 87 (0.93) 0 (0.00) 0 (0.00) 6 (0.07) 0 ( 0.00) atp8 46 (0.11) 1 (0.00) 34 (0.08) 153 (0.81) 0 ( 0.00) cob 87 (0.95) 0 (0.00) 1 (0.01) 4 (0.04) 0 ( 0.00) cox1 87 (0.97) 0 (0.00) 1 (0.01) 2 (0.02) 0 ( 0.00) cox2 87 (0.92) 1 (0.01) 0 (0.00) 7 (0.07) 0 ( 0.00) cox3 85 (0.95) 0 (0.00) 0 (0.00) 5 (0.05) 0 ( 0.00) nad1 76 (0.96) 0 (0.00) 0 (0.00) 3 (0.04) 0 ( 0.00) nad2 76 (0.91) 0 (0.00) 0 (0.00) 8 (0.09) 0 ( 0.00) nad3 78 (0.93) 0 (0.00) 0 (0.00) 6 (0.07) 0 ( 0.00) nad4 76 (0.94) 0 (0.00) 0 (0.00) 5 (0.06) 0 ( 0.00) nad4l 76 (0.87) 0 (0.00) 0 (0.00) 11 (0.13) 0 ( 0.00) nad5 77 (0.93) 0 (0.00) 0 (0.00) 6 (0.07) 0 ( 0.00) nad6 77 (0.91) 0 (0.00) 0 (0.00) 8 (0.09) 0 ( 0.00)

Protein 1015 ( 0.8) 2 (0.00) 36 (0.03) 224 (0.18) 0 ( 0.00)

4.3 Conclusion

In this paper an approach for the precise, automatic, and fast annotation of mitochon- drial protein-coding genes has been presented. The implemented methods overcome known problems in the annotation of these genes, i.e., unannotated frameshifts, and misannotated genes. To this end we developed a fully automated pipeline for anno-

59 atp6 atp8 cob cox1 cox2 cox3 nad1 nad2 nad3 nad4 nad4l nad5 nad6

1.00

0.75 equal 0.50

0.25

0.00

1.00 60 cumulative frequency cumulative

0.75 OP 0.50

0.25

0.00 0 0 0 0 0 0 0 0 0 0 0 0 0 50 50 50 50 50 50 50 50 50 50 50 50 50 100 150 200 250 150 200 250 100 150 200 250 100 150 200 250 100 150 200 250 100 150 200 250 100 150 200 250 100 150 200 250 100 150 200 250 100 150 200 250 100 150 200 250 100 100 150 200 250 100 150 200 250 score

Figure 4-15: Cumulative -value of true positive hits and over-predicted hits (OP) in scale 푒 − log10 tating mitochondrial protein-coding genes. The method creates taxon-specic Hidden Markov models and their corresponding alignments from a set of annotated sequences and an approximation of the phylogeny. The pipeline incorporates several methods to improve the quality of the alignments and thereby also the quality of the model. That is, purging of sequences, removal of non conserved columns from both ends of the alignments, and correction of frameshifts. The method to detect frameshifts has been applied to all metazoan mitochondrial protein-coding genes which resulted in a large number of detected frameshifts, many of which have been unknown. A reanal- ysis of the frameshift in nad3 of Archosauria-Testudines favors the position that was previously suggested by [47] instead of [48]. The presented methods and the generated models are available from the authors upon request. Future work is the precise annotation of the gene ends which is complicated by the peculiarities of the mitochondrial translations and integration with MITOS.

61 Chapter 5

Semi Global Alignments of Breakpoints Regions and the Sequence Signatures of Mitochondrial Genome Rearrangements

Genomic DNA frequently undergoes rearrangement of the gene order that can be localized by comparing the two DNA sequences. In mitochondrial genomes dierent mechanisms are likely at work, at least some of which involve the duplication of sequence around the location of the apparent breakpoints. We hypothesize that these dierent mechanisms of genome rearrangement leave distinctive sequence footprints. In order to study such eects it is important to locate the breakpoint positions with precision. In this chapter we dene a partially local sequence alignment problem that assumes

that following a rearrangement of a sequence 퐹 , two fragments 퐿, and 푅 are produced that may exactly t together to match 퐹 , leave a gap of deleted DNA between 퐿 and 푅, or overlap with each other. We show that this alignment problem can be solved by dynamic programming in cubic space and time. We apply the new method to evaluate rearrangements of animal mitogenomes and nd that a surprisingly large

62 F

L R

F

L R

Figure 5-1: Overview of the problem: We assume that the prex of 퐿 and the sux of 푅 is known to be homologous to the reference. It thus suces to consider a core region indicated by the black frame. In this region we consider alignments in which 퐿 and 푅 overlap (above) and alignments in which a gap remains between 퐿 and 푅 (below). In the rst case, the overlapping, hashed, region is scored as three-way alignment and deletions of a prex of 푅 and a sux of 퐿 do not contribute to the score. In the second case, the gap between 퐿 and 푅 is penalized, however. Dashed lines indicate the non-aligned parts of the sequences 퐿 and 푅, resp. fraction of these events involved local sequence duplication.

5.1 Theory

The exact localization of breakpoints in the genomic sequence is a necessary prereq- uisite for any detailed investigation into sequence patterns that might be associated with genome rearrangements. In the vicinity of a breakpoint, the sequence 퐹 of one genome, which we refer to as the reference, is contiguous. The homologous sequences in the other genome are split into two non-contiguous parts, which we call 퐿 and 푅. In practice, 퐹 contains two adjacent genes in the reference genome. The fragments 푅 and 퐿 each contain the homolog of one of them. Without loosing generality, we x the notation so that 푅 and 퐿 are homologous to the right and left part of 퐹 , respectively. If rearrangements always were a simple cut-and-paste operations, it

would suce to nd the concatenation of a sux of 푅 and a prex of 퐿 that best matches 퐹 . This simple model, however, does not account for TDRLs.

63 We therefore have to allow that 푅 and 퐿 partially overlap. It is also conceivable that recombination-based rearrangements lead to the deletion of a part of the ancestral sequence or the insertion of some nucleotides. To account for this possibility, we have to allow that part of 퐹 does not appear in either 푅 or 퐿. The pertinent situations are outlined in Fig. 5-1. The computational problem of pinpointing the breakpoint position and identifying potentially duplicated or deleted sequence fragments can be understood as a specic mixture of local and global alignments of the three sequences

퐹 , 푅, and 퐿, which we will formalize below. Three-way alignment algorithms have a long tradition in bioinformatics [115, 116], where they have been used as components in progressive multiple sequence alignment procedures [117, 118]. The combination of local and global alignments in multi-way alignments, however, has not received much attention. In the following section we show that an ecient dynamic programming algorithm can be devised for this task.

The alignment of the reference sequence 퐹 and its two osprings 퐿 and 푅 is global at the outer end (here terminal deletions are scored), but local toward the breakpoint region (here terminal deletions in 푅 and 퐿, resp., remain unscored. Although the problem is symmetric, we follow the usual algorithmic design of dynamic programming algorithms for sequence alignments and consider partial solutions that are restrictions to prexes of 퐹 , 퐿, and 푅. In the following we denote by 푚 := |퐹 |, 푛 := |퐿|, and

푝 := |푅| the respective length of the input sequences. Denote by 푆푖,푗,푘 the maximal score of an alignment of the prexes 퐹 [1..푖], 퐿[1..푗], and 푅[1..푘]. As usual, an index 0 refers to the empty prex. We restrict our attention to additive scores dened on the input alphabet augmented by the gap character (’-’). We write 훾(푎, 푏, 푐) for the score of the three-way parts and 휎(푎, 푏) for the pairwise part, i.e., regions in which a sux of 퐿 or a prex of 푅 remains unaligned. More details will be given at the end of this section.

The general step in the recursion for 푆 is essentially the same as for global three- way alignments [115, 116], with a single modication introduced by local alignment of the left end of 푅. As in the familiar Smith-Waterman algorithm [74] we have to add the possibility that 푅 is still un-aligned to the set of choices. This state is encoded

64 by 푘 = 0. Thus

푆푖−1,푗−1,푘−1 + 훾(퐹푖, 퐿푗, 푅푘) ⎧ ’-’ ⎪푆푖−1,푗−1,푘 + 훾(퐹푖, 퐿푗, ) ⎪ ⎪ ⎪ ’-’ ⎪푆푖−1,푗,푘−1 + 훾(퐹푖, , 푅푘) ⎪ ⎪ ⎪푆 −1 + 훾(퐹 , ’-’, ’-’) ⎪ 푖 ,푗,푘 푖 (5.1) 푆푖,푗,푘 = max ⎪ ⎪ ’-’ ⎨⎪푆푖,푗−1,푘−1 + 훾( , 퐿푗, 푅푘) ’-’ ’-’ ⎪푆푖,푗−1,푘 + 훾( , 퐿푗, ) ⎪ ⎪ ⎪ ’-’ ’-’ ⎪푆푖,푗,푘−1 + 훾( , , 푅푘) ⎪ ⎪ ⎪ ⎪푆푖,푗,0 , 푘 > 0 ⎪ ⎪ ⎪ The initialization at the edges and⎩ faces of three-dimensional matrix deserves some separate discussion. Entries of the form 푆푖,0,0 correspond to the insertion of a prex of 퐹 . Since gaps relative to 푅 are not scored but need to be paid for in the alignment with 퐿, we have 푆푖,0,0 = 푖×푔, where 푔 is the uniform gap score. By the same argument

푆0,푗,0 = 푗 × 푔. The attempt to place 푅 to the left of 퐿 and 퐹 incurs costs relative to both 퐿 and 퐹 , thus 푆0,0,푘 = 푘 × 2푔. The last case makes the alignment local w.r.t. the right end of 푅, i.e., allows penalty-free deletions of any prex of 푅. The alignment of 퐹 and 퐿 in the absence of 푅 is global towards the left, hence the face with 푘 = 0 (Fig. 5-2) is computed according to the Needleman-Wunsch algorithm [73], i.e.,

푆푖−1,푗−1,0 + 휎(퐹푖, 퐿푗) ⎧ 푆 ’-’ (5.2) 푖,푗,0 = max ⎪푆푖−1,푗,0 + 휎(퐹푖, ) ⎪ ⎨⎪ 푆푖,푗−1,0 + 휎(’-’, 퐿푗) ⎪ ⎪ ⎩⎪

65

Figure 5-2: Alignment of 퐹 and 퐿 in the absence of 푅 (dotted plane) according the equation 5.2

The 푗 = 0 plane describes alignments of 퐹 and 푅 before the beginning of 퐿. It also follows the Needleman-Wunsch scheme but has a more elaborate scoring considering triples of sequences since the insertions relative to 퐿 are explicitly penalized here:

푆푖−1,0,푘−1 + 훾(퐹푖, ’-’, 푅푘) ⎧ 푆 ’-’ ’-’ (5.3) 푖,0,푘 = max ⎪푆푖−1,0,푘 + 훾(퐹푖, , ) ⎪ ⎨⎪ 푆푖,0,푘−1 + 훾(’-’, ’-’, 푅푘) ⎪ ⎪ ⎩⎪

66

Figure 5-3: Alignment of 퐹 and 푅 in the absence of 퐿 (wavy plane) according the equation 5.3.

The 푖 = 0 plane, nally, describes the alignment of 퐿 and 푅 without 퐹 . Since the gaps are penalized at the beginning of 퐹 , the scoring function 훾 is used in this plane. The cells are lled according to the Needleman-Wunsch recursions in the following manner:

푆0,푗−1,푘−1 + 훾(’-’, 퐿푗, 푅푘) ⎧ 푆 ’-’ ’-’ (5.4) 0,푗,푘 = max ⎪푆0,푗−1,푘 + 훾( , 퐿푗, ) ⎪ ⎨⎪ 푆0,푗,푘−1 + 훾(’-’, ’-’, 푅푘) ⎪ ⎪ ⎩⎪

67

Figure 5-4: Alignment of 퐹 and 푅 in the absence of 퐿 (dotted plane) according the equation 5.4

The matrix 푆 computes not quite the solution to our problem however. Thus we introduce a two-dimensional scoring matrix 푀 that holds the optimal scores of an alignment of 퐹 and 푅 that continues after the end of the aligned part of 퐿, i.e., after the breakpoint position in 퐿. It satises the recursion

푀푖−1,푘−1 + 휎(퐹푖, 푅푘) ⎧ ’-’ ⎪푀푖−1,푘 + 휎(퐹푖, ) ⎪ ⎪ ⎪ ’-’ (5.5) 푀푖,푘 = max ⎪푀푖,푘−1 + 휎( , 푅푘) ⎪ ⎨⎪ max 푆푖,푗,푘 푗 ⎪ ⎪ ⎪max max 푆푖′,푗,0 ⎪ 푖′<푖 푗 ⎪ ⎪ ⎪ The rst three cases are the Needleman-Wunsch⎩ style extension of the pairwise align- ment. The next case corresponds to the situation that 퐿 ends with position 푗 in the three way alignment. In the nal case the three sequences do not overlap. Our model stipulates that gaps between the left end of the 퐹 퐿 alignment and the right

68 end of 퐹 푅 alignment are not penalized, hence there is no gap cost contribution for the interval 퐹 [푖′ + 1..푖] on the reference sequence. Since the alignment of 퐹 and 푅 is global at its right end, the traceback starts from the lower right corner of 푀, i.e., the entry 푀푚푝. If there is an overlap region, the traceback moves from 푀 to 푆 at an index triple (푖, 푗, 푘) with 푘 > 0. In the case of a gap region the transition is directly to the 퐹 퐿 surface, i.e., at 푘 = 0. The traceback terminates when it reaches 푆0,0,0 (see Fig. 5-5.

Figure 5-5: Traceback in the matrix 푆. In case of overlap the traceback starts at index triple (푖, 푗, 푘) (red line), if the alignment contains a gap instead, the traceback starts at (푖, 푗, 0)(blue line)

The algorithm uses 푂(푛3) space and time for input sequences of length 푛. To

achieve this time complexity, notice that 푚˜ 푖 := max푖′<푖 max푗 푆푖′,푗,0 can be pre-computed in quadratic time for each 푖 and memoized in a linear array. As a further optimization,

we may use 푚˜ 푖 = max(max푗 푆푖−1,푗,0, 푚푖−1). The scoring model must satisfy the usual constraints for local similarity based alignments: gap and mismatch scores must be negative, at least on average, as other- wise the option to remove free end gaps would never be taken. Furthermore, mismatch

69 scores must be larger than indel scores to avoid the proliferation of insertions followed by deletions or vice versa. Here we use the sum of pairs model

훾(푎, 푏, 푐) = (휎(푎, 푏) + 휎(푎, 푐) + 휎(푏, 푐)) /푤(푎, 푏, 푐) (5.6) with pairwise similarities 휎( .,. ) dened on the alphabet comprising the four nu- cleotides and the gap character ’-’. This scoring model satises the required con- straints, provided the pairwise scores 휎( .,. ) are suitable for pairwise local sequence alignments [119]. We use match and mismatch scores 휎(푎, 푎) = 훼 and 휎(푎, 푏) = 훽 for 푎 ≠ 푏 ∈ 풜 that are independent of the letter of the alphabet. The gap scores are also specied in a sequence-independent manner: 휎(푎, ’-’) = 휎(’-’, 푎) = 훿 and 휎(’-’, ’-’) = 0. The pairwise scoring scheme is thus specied by {훼, 훽, 훾}. In ad- dition we use a sum-of-pair weight 푤(푎, 푏, 푐) = 푊 if 푎, 푏, 푐 ∈ 풜 are all letters, and 푤(푎, 푏, 푐) = 1 if at least one of 푎, 푏, and 푐 is a gap character. This extra parame- ter modies the relative weight of the overlap region compared to the parts of the alignments in which only 퐿 or only 푅 is aligned to the reference 퐹 .

5.2 Parametrization of the Scoring Function

In order to test whether the DP scheme outlined in the previous section is indeed capable of determining the breakpoint location with sucient accuracy we used syn- thetic data. We separately prepared test data with a gap relative to 퐹 and test data with an overlap of 푅 and 퐿. A random sux or prex of the same length was ap- pended to the 푅 and 퐿 sequences, thereby complementing them to mimic a scenario in which input sequences comprise two annotation items anking the approximate location of the breakpoint. Both the designed gap and overlap lengths were xed at

10 nucleotides. We used both perfect sequences and sequences in which 퐿 and 푅 were mutated with a position-wise probability of 15% or 30%, respectively. In order to survey the inuence of dierent alignment parameters we scanned all

70 type: gap type: overlap 20 ● mut: 0

● W: 1 10 ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● −10 −20 20

● mut: 0 10 ● ● ● ● ● W: 2 ● ● ● ● 0 ● ● ● ● −10 ● ● ● ● −20 20

● ● ● ● ● ● ● ● mut: 0

● ● ● ● ● ● ● W: 3 10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● −10 ● ● ● ● ● −20

20 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● mut: 0.15 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

10 ● ● ● ● ● ● ● ● ● ● ● ● W: 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● −10 ● −20 ● ● ● 20 ● ● ● mut: 0.15 ● ● ● ● ● 10 ● ● ● ● ● ● ● ● ● ● ● ● W: 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −10 ● ● ● ●

Mean size −20 ● ● 20 mut: 0.15 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● W: 3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −10 ● ● ● ● ● −20 ● ● ● ● ● 20 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● mut: 0.3 ● ● ● ● ● ●

● W: 1 10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● −10 ● ● ● ● −20

20 ●

● ● ● mut: 0.3 ● ● ● ● ●

● ● W: 2 10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● −10 ● ● ● ● ● ● ● ● ● −20 ● ● ● ● 20 mut: 0.3 ● ● ● ● ● ● ● ●

● ● ● ● ● W: 3 10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● −20 ● ● ● 1 −1 1 −1 −2 1 −1 −3 1 −2 −1 1 −2 1 −2 −3 1 −3 −1 1 −3 −2 1 −3 2 −1 2 −1 −2 2 −1 −3 2 −2 −1 2 −2 2 −2 −3 2 −3 −1 2 −3 −2 2 −3 3 −1 3 −1 −2 3 −1 −3 3 −2 −1 3 −2 3 −2 −3 3 −3 −1 3 −3 −2 3 −3 1 −1 1 −1 −2 1 −1 −3 1 −2 −1 1 −2 1 −2 −3 1 −3 −1 1 −3 −2 1 −3 2 −1 2 −1 −2 2 −1 −3 2 −2 −1 2 −2 2 −2 −3 2 −3 −1 2 −3 −2 2 −3 3 −1 3 −1 −2 3 −1 −3 3 −2 −1 3 −2 3 −2 −3 3 −3 −1 3 −3 −2 3 −3 scores

Figure 5-6: Variation of overlap and gap sizes in simulated data as a function of alignment parameters (score: match, mismatch, gap), sum of pairs weight 푊 , and sequence divergence mut between 퐹 and 푅 or 퐿, respectively. The ground truth is indicated by solid lines at −10 of for the gap scenario and +10 for the overlap case

combinations of match scores in {1, 2, 3} and gap/mismatch scores in {−1, −2, −3}. The overlap is dened as the number of alignment column from the rst to the last position at which all three sequences, 퐹 , 퐿, and 푅, are aligned. The size of the gap between or overlap of 푅 and 퐿 was estimated from a sample of 50 simulated instances for each parameter combination. In addition we investigate the eect of the sum of pairs weight parameter 푊 ∈ {1, 2, 3} We rst consider the test cases with overlap. In the absence of articial mutation the expected overlap size matches the expected value for 푊 = 1. For large values of 푊 the observed overlap is reduced to 5.9 ± 4.4. This is not unexpected since large values of 푊 down-weight the triple matches in the overlap region relative to indels in one of the sequences. The overlap size decreases with increasing sequence variation.

71 Again, larger values of 푊 aggravate the eect. For 푊 = 1 most pairwise scoring models yield good estimates of the true overlap size. Exceptions are models with s- mall relative match scores such as {1, −2, −2}, {1, −2, −3}, {1, −3, −2}, {1, −3, −3}, which produce overlaps that are too short or even gaps in the presence of elevated levels of sequence variation. If match scores are too large, overlaps tend to be over- estimated. This is the case for the scoring schemes {3, −1, −1}, {3, −2, −1}, and {3, −3, −1}. In the test cases with a gap between 퐿 and 푅 we nd that gap sizes do not depend strongly on the sequence divergence. In general, gap sizes tend to be underestimated, in particular in scoring functions with small gap penalties. This is explained by the fact that there is a non-negligible chance that the rst few nucleotides of the random sux of 푅 or the last few nucleotides of the random prex of 퐿 also match with 퐹 . Best results were obtained for {1, −1, −3}, {1, −1, −2} for 푊 = 1. Larger values of 푊 do not lead to improvements. Combining the results for the gapped and the overlapping examples, the scoring functions {1, −1, −3}, {1, −1, −2}, and possibly {2, −3, −3} perform best. Through- out the remainder of this contribution we use 푊 = 1 and the pairwise alignment scores {1, −1, −2}.

5.3 Mitochondrial Rearrangement Data

We apply our alignment method to a collection of 152 unique animal mitochondrial gene order pairs that contain exactly the 37 canonical genes: atp8, atp6, cob, cox1, cox2, cox3, nad1, nad2, nad3, nad4, nad4l, nad5, nad6, rrnL, rrnS, trnF, trnV, trnI, trnM, trnQ, trnW, trnA, trnY, trnN, trnC, trnS2, trnK, trnD, trnG, trnR, trnH, trnS1, trnL1, trnE, trnT, trnP, trnL2. The list of accession numbers was taken from [120] and the gene orders extracted from the annotations in RefSeq release 73 [103]. Rearrangement analysis for all pairs of unique gene orders has been done with CREx [8]. For our analysis only pairs of unique gene orders are considered that are separated

72 by a single rearrangement event, i.e., we retain 144 genome pairs of which 124 are predicted by CREx [8] as transpositions, 14 as inversions and six as TDRL. Since transposition and inversion are symmetric both directions are considered, i.e. 62 and 7 pairs in each direction, respectively. For the analysis of the breakpoint regions for each pair of unique gene orders a pair of representative sequences has been chosen that is most closely related according to the NCBI taxonomy database [104]. For each predicted breakpoint we extract the reference sequence from one genome and the two predicted fragments from the other, where the former genome exhibits the putative ancestral state of the rearrangement and the latter the putative derived state.

All sequences are retrieved from RefSeq release 73 [103]. The reference sequence 퐹 is the concatenation of the last 60 nt of the left gene, the intergenic region if available, and the rst 60 nt of the right gene. The left query 퐿 is formed of the last 60 nt of the left query gene and the intergenic region with respect to the next gene (previous gene in case of inversion). The right query 푅 is formed by the intergenic region with respect to the previous gene (next gene in case of inversion). In case of inversion the reverse complement of the corresponding query is used in the alignment. For each triple (퐹, 퐿, 푅) the alignment is computed as described above and the length of the overlap of 퐿 and 푅 or the length of the gap between 퐿 and 푅 is recorded. However, we exclude 191 alignments that contain stretches of intergenic region longer than 40 nt in the reference sequence and 푅 or 퐿 are very short i.e less than 10 nt 퐹 from the statistical analysis. In total, we removed 233 of 459 alignments. Most of these cases are due to annotation errors in RefSeq or because we did not consider the control region for the computation of the gene orders. A complete set of the corresponding breakpoint alignments is compiled in Supplement 41 and Supplement 52 in human and machine readable form, respectively. Furthermore, the rearrangement events considered here are compiled in Appendix B.1. We only included the 49 pairs of gene orders with symmetric rearrangements in the comparison of the average overlap sizes where the alignments for both choices of the

1https://drive.google.com/open?id=1Dl0QqVr12d3euQiQ5w5Lm_18KAbekkaS 2https://drive.google.com/open?id=1AkGPaGtszCGEdcD9hNPzJN_Wx27ji4oI

73 trnW trnA

R GCTAAATTAGACTAAGGACCTTCAAAGCCCAAAGCAGAAGTTAAAATCTTCTAATCTTTGAATAAGACTCGCAAGATTCTATCTAACATCCTTTGAATGCAACCCAAATACTTTAATTAAG- 121 trnW zG-TTA-TTAGACTAAGGACCTTCAAAGCCC}|TAAGCAGAAGTTAAAATCCTCTAATCTTTG{z--TTAGA------}|------{ 63 trnA ------ACCTTTAAAATCCTAAATA-AAGATAAA-T--T-TAATCTTTG--TAAGACCTGTAAGACTTTATCTTACATCTTATGAATGCAACCCAGACACTTTAATTAAGC 98

trnW | {z } trnA | {z } trnA trnN

R CAAGATTCTATCTAACATCCTTTGAATGCAACCCAAATACTTTAATTAAGCTAGAGCCTTACTAGACTAGCAGGTTTTTATCCTACAACCTTTTAGTTAACAGCTAAACACCCAAATCAA- 120 trnA zTAAGACTTTATCTTACATCTTATGAATGCA}|ACCCAGACACTTTAATTAAGCTAAGACCTT{zACTTGATTGGCAGACCCTTACCCCATAATA}|TTTTAATTAAC---T--ATACCCACATCT-{- 114 trnN ------CTCTTATGAA-CTAAAACCTTACTAGATTAGTAGGCTTTTACCCTACGATCTTTTAGTTAACAACTAAATACCCACATCTGG 81 trnA | {z } trnN | {z }

OL trnC

R CTTCTCCCGCTT-GTTGGGGGAAAAAAGCGGGAGAAGTCCCAGCAG-GAGTAATTCTGCTCTTTAAAATTTGCAATTTTATGTGCTAA-ACACCA 92 OL zCTTCTCCCGCTTTGTGGG}|GGGGGAAAAGCGGGAGAAG{zCCCCGGTAGAGAATTAT-CTGCTC--TGA}|AATTTGTAACT-CATATGCTTA--CACCA{ 89 trnC CTTCTCCCG-TT--TTGTGGG--AAAA-CGGTAGAAGCCCCGACAGAAAATTAT-CTGCTCTCTGAAATTTGCAATTTCATTTGCTTATACACCA 88

OL | {z } trnC | {z } trnY cox1

10. 20. 30. 40. 50. 60. 70. 80. 90. 100. 110. 120. R ACTTAAACCTTTATGATCGAGGCTACAACTCGACACCTTTTTTCGGACACCTTACCTGTGATAATCACTCGATGACTATATTCAACAAATCATAAAGATATTGGCACCCTTTATTT 120 trnY zA-TTAAACCTTTATAAATGAGACTACAG}|CTCACCACCTATTT-CGGTCACCTTATC{zTATAGTA-TTACCCGATAGCTCTTCTCAACA}|AAT-A-A-A-ATATT-GTACCCT-TATTT{ 113 cox1 ------ACACCTTACCTATGATAATTACCCGATGATTCTTCTCAACAAATCATAAAGATATTGGTACCCTATATTT 70 trnY | {z } cox1 | {z } trnNO L

R ----T-TTATCCTACAACCTTTTAGTTAACAGCTAAACACCCAAATCAATCTAGGCCTTAGTCTACTTCTCCCGCTT-GTTGGGGGAAAAAAGCGGGAGAAG- 96 trnN AGGCzTTTTACCCTACGATCTTTTAGTTAACAACT}|AAATACCC-A--C-ATCT-GGCCTTAATCTA{zCTTCTCCCGTTTTGTGGG}|------{- 78 OL ------TTACCCCATAATATTTTAATT-A-A-CT--ATACCC-A--C-ATCT-GGCCTTAATCTACTTCTCCCGCTTTGTGGGGGGGGAAAAGCGGGAGAAGC 87 trnN | {z } OL | {z }

Figure 5-7: Breakpoint alignments for the mitochondrial genome rearrangement sep- arating the reference gene order of Phaeognathus hubrichti (NC_006344) from Hy- dromantes brunus (NC_006345). This is a clear example of a TDRL. Alignments are displayed with TeXshade [121]

74 reference were retained. This data set comprises 3 inversions and 46 transpositions, corresponding to 186 alignments.

5.4 Quantitative Analysis

Fig. 5-8 summarizes the distribution of overlap sizes. Several patterns are clearly visible. First, there is a substantial fraction of alignments with long gaps. The main cause of these long gaps is a very low nucleic acid sequence similarity with an average of only 39% between the gene portions (60 nt from start or end). This explains 10 of the 15 cases. Some of the remaining cases are explained by annotation errors such as the misannotation of trnY in Luvarus imperialis. In four alignments the long gaps

are caused by the long intergenic region in reference sequence 퐹 . The distribution of the overlap sizes for the dierent types of rearrangement is

shown in Fig. 5-8(a). For TDRLs we expect to see overlaps of 푅 and 퐿 that are interpreted as remnants of the duplication event. Indeed, of the 44 alignments of breakpoint regions that are due to TDRL in the CREx data, 37 have a non-zero overlap. The mean overlap length is 7.5 nt, also including instances with a gap. In principle, the overlap can vanish due to a complete deletion of the duplicate and due to sequence divergence between the functional and the non-functional copy that eventually erases all detectable sequence similarity. The latter explanation is plausible only for species pairs with large phylogenetic distance. The is no a priori theoretical prediction for the expected overlap in the case of inversions and transpositions. Among the seven alignments classied as inversions by CREx there is no or only a marginal overlap of a few nucleotides. Note that overlaps of 2-3 nt can be easily explained by random matches rather than as a remnant of the genome rearrangement. Thus, our data indicates that the molecular mechanism generating inversions neither duplicates not deletes sequence in the breakpoint re- gion. Also, there seems to be no common sequence pattern associated with these breakpoints.

75 Table 5.1: The rearranged pairs with average overlap size ℓ¯ > 10 nt and their rear- rangement type as specied in the literature

Species Accessions ℓ¯ Literature Type Reference Maulisia mauli NC_011007 127 TDRL [64] Normichthys operosus NC_011009 Phaeognathus hubrichti NC_006344 no duplication [122] 66 Hydromantes brunus NC_006345 TDRL [123] Galaxias maculatus NC_004594 duplication 62 [64] Galaxiella nigrostriata NC_008448 deletion Jordanella floridae NC_011387 duplication 59 [124] Xenotoca eiseni NC_011381 deletion Olisthops cyanomelas NC_009061 27 duplication [125] Chlorurus sordidus NC_006355 insignis NC_022480 25 TDRL [126] platycephala NC_010199 Platax orbicularis NC_013136 20 annotation error trnY [127] Luvarus imperialis NC_009851 Ischikauia steenackeri NC_008667 20 annotation error trnI [128] Chanodichthys mongolicus NC_008683 Albula glossodonta NC_005800 20 TDRL [129] Pterothrissus gissu NC_005796 Galago senegalensis NC_012761 19 annotation error trnI [130] Otolemur crassicaudatus NC_012762 Batrachoseps wrighti NC_006333 duplication 19 [123] Batrachoseps attenuatus NC_006340 deletion Diplophos taenia NC_002647 19 TDRL [131] Chauliodus sloani NC_003159 Chlorurus sordidus NC_006355 duplication 18 [125] Xenotoca eiseni NC_011381 deletion Chauliodus sloani NC_003159 duplication 10 [132] Myctophum affine NC_003163 deletion

76 (a) (b) (c) 40 40 40

30 type 30 rank 30 class Inversion large TDRL family small Trans old TDRL 20 20 20 count count count

10 10 10

0 0 0 −50 0 50 −50 0 50 −50 0 50 overlap overlap overlap

Figure 5-8: Distribution of overlap sizes for the breakpoints types of pairs of animal species (a) as classied by CREx. (b) By species pairs are grouped by the phylogenetic age of their last common ancestor according to the NCBI taxonomy. (c) By the class to which the overlap belongs

Transpositions can be generated by TDRL or a recombination. The 177 align- ments of the breakpoint regions of gene orders separated by a transposition show an average overlap of 5.2 nt. Only for less than half of the alignments (83) an overlap ≥ 0 has been found. This can be explained in part by annotation errors. The main reason, however, is that both possible choice for the ancestral state, i.e., the reference sequence 퐹 are included for each pair. Only one of them is expected to generate an overlap (see below). Thus many of the cases that CREx predicts as transposition are generated by a TDRL on a molecular level. By design, CREx prefers to predict transpositions over TDRLs as the more parsimonious alternative if both are possi- ble explanations. Table 5.1 summarizes example cases from the literature where we detected large overlaps. All these rearrangements are explained by duplication mech- anism or TDRL in the literature. However three cases are not originally rearranged, but errors in the RefSeq annotation lead to spurious rearrangements. We consider Hydromantes brunus in some detail. In [122], the rearrangement in H. brunus is claimed to be without duplication based on an analysis of nucleotide skew to classify between rearrangement or partial genome duplication. The same rearrangement has been claimed to be a TDRL in [123]. The alignment of the corre- sponding breakpoints in Fig. 5-7 shows an average overlap of 74 nt which is a strong evidence of duplication mechanism. Thus we support the TDRL model [123] with regard to the molecular mechanism. Nevertheless our alignments show a dierent in-

77

Figure 5-9: Rearrangement of Hydromantes brunus. (A) Ancestral gene order, the underlined genes are duplicated. (B) Duplication and deletion. (C) Current gene order with genes remnants in gray boxes

4

3

type 2 Inversion

count Trans

1

0 0 20 40 60 size difference

Figure 5-10: Dierence in average overlap size between symmetric rearrangements in the two directions in absolute value ferred intermediate state, as shown in Fig. 5-9. In [123] the duplicated genes are: part of nad2, trnW, trnA,trnN, O퐿, trnC and trnY. We, however, suggest the duplication of trnW, trnA,trnN, O퐿, trnC, trnY and cox1. The absence of the remnant of nad2 and the presence of the cox1 remnant support our model. Although none of the inversions shows large overlaps, there is no clear association between the size of the overlap or gap with TDRL or transposition according to the CREx classication. In fact, at least a substantial part of the transpositions is clearly the result of a rearrangement mechanism that generates sizable duplication of mitogenomic sequence. To our surprise overlaps are also found for older rearrangements, see the Fig. 5- 8(b). From the largest overlaps shown in Tab. 5.1 at least three pairs are on order level

78 (J. floridae-X. eiseni, O. cyanomelas-C. sordidus, D. taenia-C. sloani). Thus, dele- tion is not always immediately and duplication remnants can be detected sometimes also for ancient rearrangements. Fig. 5-8(c) summarizes the overlap sizes grouped into three classes. The rst two correspond to breakpoints of symmetric rearrangements. Here we distinguish at

each breakpoint the choice of the reference 퐹 yields the larger or the smaller average overlap, respectively. 81% of the overlaps and 70% of the gaps fall into large and small class, respectively. The third class are the asymmetric rearrangements, i.e.,

TDRLs, which allow only one choice of 퐹 . The dierence in overlap size caused by the choice of the reference 퐹 is shown in Fig. 5-10 for the symmetric rearrangements. In 67% of the 49 cases the dierence is larger than 10 nucleotides (note that the one large dierence for an inversion is due to annotation errors). This relatively large dierence is due to an overlap caused by duplication in one genome, which turns to be a gap when this species is used as

a reference. The comparison of the alignments for both choices of 퐹 thus makes it possible to deduce the direction of the symmetric rearrangements, i.e., the ancestral and the derived gene order. It also shows the importance of the choice of the cor- rect reference sequence to allow to discriminate between rearrangements caused by a duplication mechanism and those that result from other mechanisms such as inver- sion and translocation. However, in phylogenetically old rearrangement events the sequences of duplicated genes may be totally degenerated; in these cases, it becomes dicult to specify the mechanism behind the rearrangement or its direction.

5.5 Concluding Remarks

We have introduced a specialized three-way alignment model to study the breakpoint regions of mitochondrial genome rearrangements in detail. We observed several un- expected features. In particular, a substantial fraction of rearrangements involves duplication of genomic DNA, many of which have not been recognized as TDRL-like

79 events. This in particular pertains to many events that have been classied as trans- positions. On the other hand, some apparent TDRL events do not produce overlaps. While it is possible that in the case of ancient events the genomic sequences have di- verged beyond the point where duplicated DNA is still recognizable, it is also possible that some of the ostensible TDRLs in fact correspond to possibly multiple rearrange- ments of other types. Clearly, further investigations into the individual cases will be necessary to resolve this issue completely. The present study at the very least adds to the evidence that multiple arrangement mechanisms are at work and indicates that their classication is by no means a trivial task. Reports of mitogenomic rearrangement are usually not suciently precise. In most publications the (potential) molecular mechanisms are not analyzed at all. The term translocation is often used to express only that genes have been moved rather than to designate a specic type of permutation or a molecular mechanism. The term TDRL is used more or less interchangeably for the resulting permutation and the molecular mechanism generating it; the published analysis usually is limited to the permutation. For the other rearrangement operations that are assumed in mitochondrial genomes, i.e., inversions, transpositions, and inverse transpositions, the mechanisms still await elucidation in most cases. In this paper we have presented a method that can help to uncover the molecular mechanisms of genome rearrangement. In conjunction with the available methods for genome rearrangement analysis [67] and the consideration of tRNA remolding [36], which can mimic rearrangements, the analyses of breakpoint sequences will signicantly improve our understanding of mitochondrial gene order evolution.

80 Chapter 6

Conclusion

In this work, two contributions to the study of mitochondrial DNA are presented. These are the accurate annotation of protein-coding genes in mitochondria and the sequence signature of genome rearrangements. The methods designed in both prob- lems are implemented and tested on wide set of mitochondrial data even though there are no constraints on applying the rearrangement methods on nuclear DNA since it is meaningfully applied only to a sequence interval of moderate size around a genomic breakpoint. In the rst contribution we developed a fully automated pipeline for precise, au- tomatic and fast annotation of protein-coding genes in mitochondria. The methods creates taxon-specic hidden Markov models from a set of annotated sequences and their phylogeny approximation. These models are improved by detecting and cor- recting the frameshifts in the alignments based on which the models are built in addition to removing the non tting sequences and columns from these alignments. Moreover, the prediction of the start and stop of the genes is enhanced based on a sophisticated method that takes into account several factors such as the exclusion of adjacent annotated tRNAs, the estimated distance to the start or end of the gene ... The methods were applied on RefSeq data and the analysis of the results showed a high accuracy compared the given annotations. Moreover, a number of errors in the training data extracted from RefSeq were found. Additionally new frameshifts that are not mentioned in the literature were identied in metazoan mtDNA.

81 The main drawback of the annotation method is the dependency on phylogenetic tree which could lessen the sensitivity of the models in case of misclassication of the training sequences. Furthermore, the bias in the quantity of sequences for dif- ferent taxonomic groups leads to higher accuracy for species belonging to the most standard mtDNAs i.e species such as vertebrates or insects. However, this biased is solved by the use of more specic HMMs. Likewise the same solution could be used for annotating i) fast evolving taxa such as Chelicerata and Tunicata, ii) and most variable genes such as atp8, nad6 and nad4l. In the second problem, a new algo- rithm to align the breakpoint region of a rearrangement is developed. The algorithm uses the dynamic programming approach to align the sequences of two consecutive genes in a reference genome to a left and right genes which are not consecutive in the query genome (breakpoint). A partially local three-way model is applied to study the breakpoint region. The application at the level of mtDNA proves that multiple rearrangement mechanisms are at work and indicates that their classication is by no means a trivial task. The analysis showed an unexpected fraction of rearrangements that involves duplication of parts of the DNA. Of these rearrangements many cases are mislabeled as transpositions events instead of TDRL-like events which involve duplication of the DNA parts. Thus this study covers the deciency of information about the molecular mechanisms behind the genome rearrangement events since in most publications the (potential) molecular mechanisms are not analyzed at all.

Future Work As we have already mentioned above, the annotation quality of fast evolving species and variable genes is relatively less accurate (higher number of false negatives) comparing to other species and other genes. The problematic cases can be solved by scanning only the missed genes against more specic models. Nevertheless, the more specic models are not used in general because of the higher run time this problem will be considered in the future work. On the other hand, the partially local alignment problem introduced in this s- tudy poses theoretical questions, which we hope to answer in forthcoming research. Most obviously, it suggests to consider multiple alignments in which, for each in-

82 put sequence and both of its ends, either a local or global alignment is requested. Given such a problem specication, how can one construct a corresponding dynamic programming algorithm, possibly with the additional constraint that the dynamic programming scheme also supports a probabilistic version. Beyond this question one may of course also ask whether there are interesting generalization of overlap align- ments and combinations of local, global, and overlap alignments. Furthermore, the alignment method introduced here is not limited to mitochondrial genomes. Since it is meaningfully applied only to a sequence interval of moderate size around a genomic breakpoint, it could also be applied to study the evolution of nuclear genomes and to investigate chromosomal rearrangements in a cancer research context.

83 Appendix A

Protein Genes Annotaion

A.1 Overview on Results tp = true positive, pseudo = pseudogene, cds = coding sequence (in genbank)

Table A.1: Details about over predicted genes accession gene Evalue copy start stop notes

NC-024152 cox1 4.60E-277 no 6566 8096 tp no cds NC-023382 nad5 7.00E-172 yes 13652 15155 pseudogene="unprocessed" NC-025504 cox1 2.90E-152 no 2902 2039 pseudo tp NC-024152 nad4 1.30E-146 no 11436 12513 tp no cds NC-024605 nad1 3.90E-132 no 12559 11654 tp no cds NC-024152 cox3 3.20E-124 no 9853 10612 tp no cds NC-024152 nad1 1.10E-115 no 4103 4910 tp no cds NC-023381 nad2 1.60E-109 no 11320 12331 not in RefSeq nothing in these coords NC-024751 nad2 9.20E-99 no 14318 15287 tp no cds NC-024260 atp6 1.70E-56 no 9451 8849 tp no cds NC-024152 nad2 1.70E-53 no 5597 6137 tp no cds NC-024152 nad2 1.10E-52 no 5154 5598 tp no cds

84 accession gene Evalue copy start stop notes

NC-025505 atp6 3.50E-50 no 1541 2159 pseudo .. Tp NC-025638 nad6 2.00E-45 no 14326 13832 not in RefSeq (nothing in these coords) NC-024045 nad2 4.50E-44 yes 4601 5045 not in RefSeq (nothing in these coords) NC-024045 nad1 4.60E-44 yes 3480 3798 not in RefSeq (nothing in these coords) NC-024152 cob 5.70E-42 no 15265 15628 tp no cds NC-025509 cox2 8.70E-39 yes 7524 7980 not in RefSeq (nothing in these coords) NC-025505 cox3 3.00E-38 no 2731 2444 pseudo tp NC-024152 nad6 2.00E-35 no 16622 16131 tp no cds NC-023381 nad2 3.10E-35 no 8137 8878 tp pseudo NC-024152 cob 6.70E-34 no 15557 15947 tp no cds NC-025509 nad4 1.90E-32 yes 4322 4793 not in RefSeq (nothing in these coords) NC-024152 cob 1.30E-31 no 15084 15282 tp no cds NC-025504 cox1 4.50E-28 no 5256 5496 pseud tp NC-024927 cox2 3.00E-26 yes 7662 7890 not in RefSeq nothing in theses coords NC-024152 nad4 3.80E-26 no 12508 12763 tp no cds NC-024152 nad4l 2.70E-25 no 11131 11404 NC-024152 cob 8.30E-25 no 14828 14990 NC-025504 cob 3.20E-21 no 10438 10286 pseudo tp NC-025633 cox1 1.70E-20 yes 18846 19050 not in RefSeq (nothing in these coords) NC-025504 cob 1.10E-19 no 8322 8149 pseudo tp NC-025509 nad5 4.20E-19 yes 2633 3248 not in RefSeq (nothing in these coords) NC-025504 cox2 1.90E-18 no 3092 3224 pseudo tp NC-025503 cox1 6.20E-18 yes 11502 11392 tp pseudo NC-025504 nad5 8.50E-18 no 5507 5633 pseudo tp NC-024941 nad6 2.40E-17 yes 17174 16734 not in RefSeq (control region) NC-024152 nad1 8.00E-17 no 4000 4111 tp no cds NC-025908 atp8 8.70E-16 no 7941 8106 tp no cds NC-024416 nad4 3.60E-15 yes 8405 8238 not in RefSeq (nothing in these coords)

85 accession gene Evalue copy start stop notes

NC-024927 nad4l 4.60E-15 no 23509 23749 in RefSeq tp NC-025755 cob 1.30E-14 yes 6853 6970 not in RefSeq (nothing in these coords) NC-025503 cox2 2.60E-14 yes 10766 10889 tp pseudo NC-024275 nad3 8.70E-14 no 7299 7605 in RefSeq tp NC-025505 cob 1.50E-13 no 4039 4129 pseudo tp NC-024202 nad3 2.90E-13 yes 6531 6681 not in RefSeq (nothing in these coords) NC-025509 cob 1.60E-11 yes 12695 12845 not in RefSeq (nothing in these coords) NC-025757 atp8 4.90E-10 no 6454 6613 not in RefSeq NC-023345 cox3 1.80E-09 yes 6068 6170 not in RefSeq (nothing in these coords) NC-025509 cob 5.10E-09 yes 42799 42623 not in RefSeq (nothing in these coords) NC-025504 nad4 8.70E-09 no 8779 8911 pseudo tp NC-025766 atp8 7.30E-08 no 5873 6035 not in RefSeq NC-024193 nad4 7.90E-08 yes 11936 12083 not in RefSeq (nothing in these coords) NC-025925 cob 9.70E-07 yes 16939 17056 tp duplication of 3¢ob NC-025921 cob 1.30E-06 yes 16756 16882 tp duplication of 3¢ob NC-025922 cob 3.70E-06 yes 16711 16897 tp duplication of 3¢ob NC-025504 atp6 8.20E-06 no 7277 7134 pseudo tp NC-024185 atp6 8.90E-06 yes 8809 8899 not in RefSeq (nothing in these coords) NC-025628 atp8 0.00016 yes 952 1135 control region NC-025504 nad2 0.00034 no 4094 4196 pseudo fn NC-023940 atp6 0.00038 yes 725 905 Dloop NC-025472 cox1 0.0005 yes 8326 8419 Dloop NC-025612 atp8 0.00059 yes 938 1028 control region NC-025504 cox3 0.00067 no 9265 9382 pseudo fn NC-025509 nad4 0.00068 yes 13252 13519 not in RefSeq (nothing in these coords) NC-024601 cob 0.00083 yes 2565 2676 l-rrna NC-024048 atp8 0.00093 yes 16504 16585 control region NC-023975 atp8 0.0019 yes 16315 16142 control region

86 accession gene Evalue copy start stop notes

NC-023460 nad6 0.0027 yes 13213 13130 not in RefSeq (nothing in these coords) NC-024927 nad4l 0.003 no 12997 13087 not in RefSeq (nothing in these coords) NC-024927 cox2 0.0032 yes 28489 28648 not in RefSeq (nothing in these coords) NC-025786 atp8 0.0033 yes 16622 16751 control region NC-025601 atp8 0.0035 yes 949 1027 control region NC-024524 atp8 0.0037 yes 16509 16692 control region NC-025602 atp8 0.004 yes 959 1034 control region NC-025751 nad5 0.0043 yes 2815 2678 repeat region NC-024927 atp8 0.0051 no 14655 14715 not in RefSeq (nothing in these coords)

A.2 Taxonomy of New Frameshifts

Table A.2: Taxonomy of new detected frameshifts

accession gene taxa

NC-022685 cox3 Pieridae NC-013275 atp6 Lucinoidea NC-014485 atp6 Mutillidae NC-016723 nad3 Anseriformes NC-001947 nad3 Pleurodira NC-014401 nad3 Geoemydidae NC-013251 nad3 Raphidioptera NC-009509 nad3 Geoemydidae NC-007440 nad3 Dicroglossidae NC-007896 nad4l Incirrata NC-011823 nad4 Gryllidae

87 accession gene taxa

NC-021747 nad4 Helicoidea NC-022957 nad4 Accipitridae NC-013474 nad4 Ascoidea NC-013604 nad5 Heliconiinae NC-011823 nad6 Gryllidae NC-005803 nad6 Elopiformes NC-001573 nad6 Xenopodinae NC-009458 cob Lithobiomorpha NC-019576 cob Fulgoroidea NC-017839 cob Chelonioidea NC-019645 cob Synodontidae NC-013879 nad1 Antennarioidei NC-007884 nad1 Ischnocera NC-021607 nad1 Sisoridae NC-021413 nad1 Crambidae NC-017839 nad1 Chelonioidea NC-007781 nad2 Buccinoidea NC-016865 nad2 Opomyzoidea NC-007229 atp8 Cobitidae NC-006133 cox1 Perloidea NC-016696 cox1 Conocephalinae NC-005781 cox1 Drosophilidae NC-022938 cox1 Parastacoidea NC-005780 cox1 Drosophilidae

88 Appendix B

Semi Global Alignments of Breakpoints Regions

B.1 Pairs Used in the Study

Table B.1: Pairs used in the study accession1 species1 accession2 species2

NC-005298 Saccopharynx lavenbergi NC-005298 Saccopharynx lavenbergi NC-008124 Bregmaceros nectabanus NC-008124 Bregmaceros nectabanus NC-023250 Solenaia carinatus NC-023250 Solenaia carinatus NC-005929 Cucumaria miniata NC-005929 Cucumaria miniata NC-008225 Ventrifossa garmani NC-008225 Ventrifossa garmani NC-013257 Ditaxis biseriata NC-013257 Ditaxis biseriata NC-009851 Luvarus imperialis NC-009851 Luvarus imperialis NC-013571 Eothenomys chinensis NC-013571 Eothenomys chinensis NC-022477 Kurtus gulliveri NC-022477 Kurtus gulliveri NC-011009 Normichthys operosus NC-011009 Normichthys operosus NC-015305 Odorrana ishikawae NC-015305 Odorrana ishikawae

89 accession1 species1 accession2 species2

NC-023228 Paraplagusia blochii NC-023228 Paraplagusia blochii NC-007012 Scleropages formosus NC-007012 Scleropages formosus NC-008778 Varanus niloticus NC-008778 Varanus niloticus NC-005796 Pterothrissus gissu NC-005796 Pterothrissus gissu NC-006355 Chlorurus sordidus NC-006355 Chlorurus sordidus NC-007689 Neogymnocrinus richeri NC-007689 Neogymnocrinus richeri NC-003163 Myctophum ane NC-003163 Myctophum ane NC-012762 Otolemur crassicaudatus NC-012762 Otolemur crassicaudatus NC-006321 Clymenella torquata NC-006321 Clymenella torquata NC-010777 Hypochilus thorelli NC-010777 Hypochilus thorelli NC-006335 Plethodon elongatus NC-006335 Plethodon elongatus NC-011381 Xenotoca eiseni NC-011381 Xenotoca eiseni NC-006345 Hydromantes brunus NC-006345 Hydromantes brunus NC-007440 Limnonectes fujianensis NC-007440 Limnonectes fujianensis NC-010199 Odontobutis platycephala NC-010199 Odontobutis platycephala NC-010199 Odontobutis platycephala NC-010199 Odontobutis platycephala NC-023228 Paraplagusia blochii NC-023228 Paraplagusia blochii NC-006340 Batrachoseps attenuatus NC-006340 Batrachoseps attenuatus NC-012762 Otolemur crassicaudatus NC-012762 Otolemur crassicaudatus NC-012688 Cephus cinctus NC-012688 Cephus cinctus NC-006355 Chlorurus sordidus NC-006355 Chlorurus sordidus NC-008975 Buergeria buergeri NC-008975 Buergeria buergeri NC-008797 Conus textile NC-008797 Conus textile NC-007781 Ilyanassa obsoleta NC-007781 Ilyanassa obsoleta NC-006340 Batrachoseps attenuatus NC-006340 Batrachoseps attenuatus NC-006335 Plethodon elongatus NC-006335 Plethodon elongatus NC-008975 Buergeria buergeri NC-008975 Buergeria buergeri NC-009061 Olisthops cyanomelas NC-009061 Olisthops cyanomelas

90 accession1 species1 accession2 species2

NC-005796 Pterothrissus gissu NC-005796 Pterothrissus gissu NC-005961 Rena humilis NC-005961 Rena humilis NC-003159 Chauliodus sloani NC-003159 Chauliodus sloani NC-008448 Galaxiella nigrostriata NC-008448 Galaxiella nigrostriata NC-009851 Luvarus imperialis NC-009851 Luvarus imperialis NC-007978 Synthliboramphus antiquus NC-007978 Synthliboramphus antiquus NC-022670 Brachyrhynchus hsiaoi NC-022670 Brachyrhynchus hsiaoi NC-004448 Alligator sinensis NC-004448 Alligator sinensis NC-006355 Chlorurus sordidus NC-006355 Chlorurus sordidus NC-011381 Xenotoca eiseni NC-011381 Xenotoca eiseni NC-023228 Paraplagusia blochii NC-023228 Paraplagusia blochii NC-010199 Odontobutis platycephala NC-010199 Odontobutis platycephala NC-014174 Brookesia decaryi NC-014174 Brookesia decaryi NC-015305 Odorrana ishikawae NC-015305 Odorrana ishikawae NC-008683 Chanodichthys mongolicus NC-008683 Chanodichthys mongolicus NC-008831 Tigriopus californicus NC-008831 Tigriopus californicus NC-004373 Carapus bermudensis NC-004373 Carapus bermudensis NC-015076 Jenkinsia sp. CL54 NC-015076 Jenkinsia sp. CL54 NC-009585 Ilisha elongata NC-009585 Ilisha elongata NC-012762 Otolemur crassicaudatus NC-012762 Otolemur crassicaudatus NC-008123 Sirembo imberbis NC-008123 Sirembo imberbis NC-012434 Myosotella myosotis NC-012434 Myosotella myosotis NC-010268 Aulorhynchus avidus NC-010268 Aulorhynchus avidus NC-006286 Bipes tridactylus NC-006286 Bipes tridactylus NC-011381 Xenotoca eiseni NC-011381 Xenotoca eiseni NC-006286 Bipes tridactylus NC-006286 Bipes tridactylus NC-008683 Chanodichthys mongolicus NC-008683 Chanodichthys mongolicus NC-009851 Luvarus imperialis NC-009851 Luvarus imperialis

91 accession1 species1 accession2 species2

NC-012689 Orussus occidentalis NC-012689 Orussus occidentalis NC-009683 Calotes versicolor NC-009683 Calotes versicolor NC-012571 Panonychus ulmi NC-012571 Panonychus ulmi NC-023228 Paraplagusia blochii NC-023228 Paraplagusia blochii NC-023222 Hemibagrus spilopterus NC-023222 Hemibagrus spilopterus NC-012463 Saldula arsenjevi NC-012463 Saldula arsenjevi NC-014703 Cervus canadensis songaricus NC-014703 Cervus canadensis songaricus NC-013571 Eothenomys chinensis NC-013571 Eothenomys chinensis

92 Bibliography

[1] Volkmar Weissig. Mitochondrial-targeted drug and dna delivery. Critical Re- views in Therapeutic Drug Carrier Systems, 20, 2003.

[2] Tomasz Grzybowski and Urszula Rogalla. Mitochondria in anthropology and forensic medicine. 2012.

[3] AM Bento, V Lopes, A Serra, H Afonso Costa, F Balsa, L Andrade, C Oliveira, L Batista, MJ Anjos, M Carvalho, et al. Forensic application of mitochondrial dna snps. Forensic Science International: Genetics Supplement Series, 2:215 216, 2009.

[4] Stephen F Altschul, Warren Gish, Webb Miller, Eugene W Myers, and David J Lipman. Basic local alignment search tool. Journal of molecular biology, 215:403410, 1990.

[5] Stacia K. Wyman, Robert K. Jansen, and Jerey L. Boore. Automatic an- notation of organellar genomes with DOGMA. Bioinformatics, 20:32523255, 2004.

[6] Matthias Bernt, Alexander Donath, Frank Jühling, Fabian Externbrink, Catherine Florentz, Guido Fritzsch, Jörn Pütz, Martin Middendorf, and Pe- ter F. Stadler. MITOS: Improved de novo metazoan mitochondrial genome an- notation. Molecular Phylogenetics and Evolution, 69:313319, 2013.

[7] Wataru Iwasaki, Tsukasa Fukunaga, Ryota Isagozawa, Koichiro Yamada, Ya- sunobu Maeda, Takashi P. Satoh, Tetsuya Sado, Kohji Mabuchi, Hirohiko Takeshima, Masaki Miya, and Mutsumi Nishida. MitoFish and MitoAnno- tator: a mitochondrial genome database of sh with an accurate and automatic annotation pipeline. Molecular Biology and Evolution, 30:25312540, 2013.

[8] Matthias Bernt, Daniel Merkle, Kai Rasch, Guido Fritzsch, Marleen Perseke, Detlef Bernhard, Martin Schlegel, Peter F. Stadler, and Martin Middendorf. CREx: Inferring genomic rearrangements based on common intervals. Bioinfor- matics, 23:29572958, 2007.

[9] Laurent Bulteau, Guillaume Fertin, and Eric Tannier. Genome rearrangements with indels in intergenes restrict the scenario space. BMC Bioinformatics, 17:225, 2016.

93 [10] Claire Lemaitre, Eric Tannier, Christian Gautier, and Marie-France Sagot. Pre- cise detection of rearrangement breakpoints in mammalian chromosomes. BMC bioinformatics, 9:286, 2008.

[11] Fabian Kilpert and Lars Podsiadlowski. The australian fresh water isopod (phreatoicidea: Isopoda) allows insights into the early mitogenomic evolution of isopods. Comparative Biochemistry and Physiology Part D: Genomics and Proteomics, 5:3644, 2010.

[12] Fabian Kilpert and Lars Podsiadlowski. The mitochondrial genome of the japanese skeleton shrimp caprella mutica (amphipoda: Caprellidea) reveals a unique gene order and shared apomorphic translocations with gammaridea. Mi- tochondrial DNA, 21:7786, 2010.

[13] Yoshinori Kumazawa, Saaya Miura, Chiemi Yamada, and Yasuyuki Hashiguchi. Gene rearrangements in gekkonid mitochondrial genomes with shu ing, loss, and reassignment of trna genes. BMC genomics, 15:930, 2014.

[14] Frank Jühling, Joern Pütz, Matthias Bernt, Alexander Donath, Martin Mid- dendorf, Catherine Florentz, and Peter F Stadler. Improved systematic trna gene annotation allows new insights into the evolution of mitochondrial trna structures and into the mechanisms of mitochondrial genome rearrangements. Nucleic acids research, 40:28332845, 2011.

[15] Todd M Lowe and Sean R Eddy. trnascan-se: a program for improved detection of transfer rna genes in genomic sequence. Nucleic acids research, 25:955, 1997.

[16] Abdullah Sahyoun. Computational investigations into the evolution of mito- chondrial genomes. 2014.

[17] William A. Wells. There's DNA in those organelles. The Journal of Cell Biology, 168:853, 2005.

[18] Heidi M McBride, Margaret Neuspiel, and Sylwia Wasiak. Mitochondria: more than just a powerhouse. Current biology: CB, 16:551560, 2006.

[19] Der-Fen Suen, Kristi L. Norris, and Richard J. Youle. Mitochondrial dynamics and apoptosis. Genes & Development, 22:15771590, 2008.

[20] Mentel M. Martin W. The origin of mitochondria. Nature Education, 3:58, 2010.

[21] Peter E Thorsness and Theodor Hanekamp. Mitochondria: Origin. 2003.

[22] D. R. Wolstenholme. Animal mitochondrial DNA: structure and evolution. International Review of Cytology, 141:173216, 1992.

[23] D. A. Clayton. Transcription of the Mammalian Mitochondrial Genome. Annual Review of Biochemistry, 53:573594, 1984.

94 [24] S Castellana, S Vicario, and C Saccone. Evolutionary patterns of the mito- chondrial genome in metazoa: exploring the role of mutation and selection in mitochondrial proteincoding genes. Genome biology and evolution, 3:1067 1079, 2011.

[25] UniProt Consortium et al. Uniprot: the universal protein knowledgebase. Nu- cleic acids research, 45:D158D169, 2017.

[26] Robert A Scott. Functional signicance of cytochrome c oxidase structure. Structure, 3:981986, 1995.

[27] Anne-Marie Duchêne, Claire Pujol, and Laurence Maréchal-Drouard. Import of tRNAs and aminoacyl-tRNA synthetases into mitochondria. Curr Genet, 55:118, 2009.

[28] CT Beagley, JL Macfarlane, GA Pont-Kingdon, R Okimoto, NA Okada, and DR Wolstenholme. Mitochondrial genomes of Anthozoa (Cnidaria). Prog Cell Res, 5:149153, 1995.

[29] Kevin G Helfenbein, H Matthew Fourcade, Rohit G Vanjani, and Jerey L Boore. The mitochondrial genome of paraspadella gotoi is highly reduced and reveals that chaetognaths are a sister group to protostomes. P Natl Acad Sci USA, 101:1063910643, 2004.

[30] Sharon Anderson, Alan T Bankier, Bart G Barrell, MH De Bruijn, Alan R Coulson, Jacques Drouin, IC Eperon, DP Nierlich, Bruce A Roe, Frederick Sanger, et al. Sequence and organization of the human mitochondrial genome. Nature, 290:457465, 1981.

[31] Paolo Arcari and George G Brownlee. The nucleotide sequence of a small (3s) seryl-trna (anticodon gcu) from beef heart mitochondria. Nucleic acids research, 8:52075212, 1980.

[32] MHL De Bruijn, PH Schreier, IC Eperon, BG Barrell, EY Chen, PW Arm- strong, JFH Wong, and BA Roe. A mammalian mitochondrial serine transfer rna lacking the “dihydrouridine” loop and stem. Nucleic acids research, 8:52135222, 1980.

[33] Frank Jühling, Joern Pütz, Catherine Florentz, and Peter F Stadler. Armless mitochondrial trnas in enoplea (nematoda). RNA biology, 9:11611166, 2012.

[34] P Cantatore, MN Gadaleta, M Roberti, C Saccone, and AC Wilson. Dupli- cation and remoulding of trna genes during the evolutionary rearrangement of mitochondrial genomes. Nature, 329:853855, 1987.

[35] Timothy A Rawlings, Timothy M Collins, and Rüdiger Bieler. Changing iden- tities: trna duplication and remolding within animal mitochondrial genomes. Proceedings of the National Academy of Sciences, 100:1570015705, 2003.

95 [36] Abdullah H. Sahyoun, Martin Hölzer, Frank Jühling, Christian Höner zu Siederdissen, Marwa Al-Arab, Kifah Tout, Manja Marz, Martin Middendorf, Peter F. Stadler, and Matthias Bernt. Towards a comprehensive picture of al- loacceptor trna remolding in metazoan mitochondrial genomes. Nucleic Acids Research, 43:80448056, 2015.

[37] T Martin Schmeing and V Ramakrishnan. What recent ribosome structures have revealed about the mechanism of translation. Nature, 461:12341242, 2009.

[38] Thomas A Steitz. A structural understanding of the dynamic ribosome machine. Nature Reviews Molecular Cell Biology, 9:242253, 2008.

[39] Poul Nissen, Jerey Hansen, Nenad Ban, Peter B Moore, and Thomas A Steitz. The structural basis of ribosome activity in peptide bond synthesis. Science, 289:920930, 2000.

[40] Himeno Hyouta, Masaki Haruhiko, Kawai Tatsushi, Ohta Takahisa, Kumagai Izumi, Miura Kin-ichiro, and Watanabe Kimitsuna. Unusual genetic codes and a novel gene structure for tRNASeragy in starsh mitochondria! DNA. Gene, 56:219230, 1987.

[41] D V Lavrov, W Pett, O Voigt, G Wörheide, L Forget, B F Lang, and E Kay- al. Mitochondrial DNA of Clathrina clathrus (Calcarea, Calcinea): six linear chromosomes, fragmented rRNAs, tRNA editing, and a novel genetic code. Molecular Biology and Evolution, 30:865880, 2013.

[42] Matthias Bernt, Christoph Bleidorn, Anke Braband, Johannes Dambach, Alexander Donath, Guido Fritzsch, Anja Golombek, Heike Hadrys, Frank Jühling, Karen Meusemann, Martin Middendorf, Bernhard Misof, Marleen Perseke, Lars Podsiadlowski, Björn von Reumont, Bernd Schierwater, Mar- tin Schlegel, Michael Schrödl, Sabrina Simon, Peter F. Stadler, Isabella Stöger, and Torsten H. Struck. A comprehensive analysis of bilaterian mitochondrial genomes and phylogeny. Molecular Phylogenetics and Evolution, 69:352364, 2013.

[43] Deanna Ojala, Julio Montoya, and Giuseppe Attardi. tRNA punctuation model of RNA processing in human mitochondria. Nature, 290:470474, 1981.

[44] P. J. Farabaugh. Programmed translational frameshifting. Microbiological Re- views, 30:507528, 1996.

[45] Jonathan D. Dinman. Programmed ribosomal frameshifting goes beyond virus- es: Organisms from all three kingdoms use frameshifting to regulate gene ex- pression, perhaps signaling a paradigm shift. Microbe, 1:521527, 2006.

[46] A. Harlid, A. Janke, and U. Arnason. The mtDNA sequence of the ostrich and the divergence between paleognathous and neognathous birds. Molecular Biology and Evolution, 14:754761, 1997.

96 [47] D. P. Mindell, M. D. Sorenson, and D. E. Dimche. An extra nucleotide is not translated in mitochondrial ND3 of some birds and turtles. Molecular Biology and Evolution, 15:15681571, 1998.

[48] R. David Russell and Andrew T. Beckenbach. Recoding of translation in turtle mitochondrial genomes: Programmed frameshift mutations and evidence of a modied genetic code. Journal of Molecular Evolution, 67:682695, 2008.

[49] James F. Parham, Chris R. Feldman, and Jerey L. Boore. The complete mito- chondrial genome of the enigmatic bigheaded turtle (Platysternon): description of unusual genomic features and the reconciliation of phylogenetic hypotheses based on mitochondrial and nuclear DNA. BMC Evolutionary Biology, 6:11, 2006.

[50] Rafael Zardoya and Axel Meyer. Complete mitochondrial genome suggests diapsid anities of turtles. Proceedings of the National Academy of Sciences of the United States of America, 95:1422614231, 1998.

[51] Andrew T. Beckenbach, Simon K. A. Robson, and Ross H. Crozier. Single nucleotide +1 frameshifts in an apparently functional mitochondrial cytochrome b gene in ants of the genus Polyrhachis. Journal of Molecular Evolution, 60:141 152, 2005.

[52] Coren A. Milbury and Patrick M. Ganey. Complete mitochondrial DNA se- quence of the eastern oyster Crassostrea virginica. Marine Biotechnology, 7:697 712, 2005.

[53] Rafael D. Rosengarten, Erik A. Sperling, Maria A. Moreno, Sally P. Leys, and Stephen L. Dellaporta. The mitochondrial genome of the hexactinellid sponge Aphrocallistes vastus: Evidence for programmed translational frameshifting. BMC Genomics, 9:33, 2008.

[54] Richard Temperley, Ricarda Richter, Sven Dennerlein, Robert N. Lightowlers, and Zoa M. Chrzanowska-Lightowlers. Hungry codons promote frameshifting in human mitochondrial ribosomes. Science, 327:301, 2010.

[55] M J Smith, D K Bameld, K Doteval, S Gorski, and D J Kowbel. Gene arrangement in sea star mitochondrial DNA demonstrates a major inversion event during echinoderm evolution. Gene, 76:181185, 1989.

[56] Mark Dowton and Nick J H Campbell. Intramitochondrial recombination  is it why some mitochondrial genes sleep around? Trends Ecol Evol, 16:269271, 2001.

[57] Sandra R. Bacman, Sion L. Williams, and Carlos T. Moraes. Intra- and inter- molecular recombination of mitochondrial dna after invivo induction of multiple double-strand breaks. Nucleic Acids Res, 37:42184226, 2009.

97 [58] J. L. Boore, T. M. Collins, D Stanton, L. L. Daehler, and W. M. Brown. Deduc- ing the pattern of arthropod phylogeny from mitochondrial dna rearrangements. Nature, 376:163165, 1995. [59] J. L. Boore, D. V. Lavrov, and W. M. Brown. Gene translocation links insects and crustaceans. Nature, 392:667668, 1998. [60] D H Lunt and B C Hyman. Animal mitochondrial DNA recombination. Nature, 387:247, 1997. [61] Casper Groth, Randi Føns Petersen, and Jure Pi²kur. Diversity in organization and the origin of gene orders in the mitochondrial DNA molecules of the genus Saccharomyces. Mol Biol Evol, 17:18331841, 2000. [62] C Moritz and W M Brown. Tandem duplications of D-loop and ribosomal RNA sequences in lizards mitochondrial DNA. Science, 233:14251427, 1986. [63] J. L. Boore. The duplication/random loss model for gene rearrangement exem- plified by mitochondrial genomes of deuterostome animals, volume 1 of Compu- tational Biology Series. Dordrecht, 2000. [64] Frank Jühling, Joern Pütz, Matthias Bernt, Alexander Donath, Martin Mid- dendorf, Catherine Florentz, and Peter F. Stadler. Improved systematic tRNA gene annotation allows new insights into the evolution of mitochondrial tRNA structures and into the mechanisms of mitochondrial genome rearrangements. Nucleic Acids Res., 40:28332845, 2012. [65] W Xu, D Jameson, B Tang, and P G Higgs. The relationship between the rate of molecular evolution and the rate of genome rearrangement in animal mitochondrial genomes. J Mol Evol, 63:375392, 2006. [66] Matthias Bernt, Anke Braband, Bernd Schierwater, and Peter F. Stadler. Ge- netic aspects of mitochondrial genome evolution. Mol. Phylog. Evol., 69:328 338, 2013. [67] G Fertin, A Labarre, I Rusu, E Tannier, and S Vialette. Combinatorics of genome rearrangements. Cambridge, MA, 2009. [68] Kamalika Chaudhuri, Kevin Chen, Radu Mihaescu, and Satish Rao. On the tandem duplication-random loss model of genome rearrangement. 2006. [69] Matthias Bernt, Kuan-Yu Chen, Ming-Chiang Chen, An-Chiang Chu, Daniel Merkle, Hung-Lung Wang, Kun-Mao Chao, and Martin Middendorf. Finding all sorting tandem duplication random loss operations. Journal of Discrete Algorithms, 9:3248, 2011. [70] Tom Hartmann, An-Chiang Chu, Martin Middendorf, and Matthias Bern- t. Combinatorics of tandem duplication random loss mutations on circular genomes. IEEE/ACM Transactions on Computational Biology and Bioinfor- matics, 2016. accepted.

98 [71] Paul Medvedev, Monica Stanciu, and Michael Brudno. Computational methods for discovering structural variation with next-generation sequencing. Nature methods, 6:S13S20, 2009.

[72] Ion Mandoiu and Alexander Zelikovsky. Bioinformatics algorithms: techniques and applications, volume 3. 2008.

[73] S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the aminoacid sequences of two proteins. J. Mol. Biol., 48:443 452, 1970.

[74] Temple F. Smith, , and Michael S. Waterman. Identication of common molec- ular subsequences. J Mol Biol, 147:195197, 1981.

[75] Kazutaka Katoh and Daron M. Standley. MAFFT multiple sequence alignmen- t software version 7: Improvements in performance and usability. Molecular Biology and Evolution, 30, 2013.

[76] Mark A Larkin, Gordon Blackshields, NP Brown, R Chenna, Paul A McGet- tigan, Hamish McWilliam, Franck Valentin, Iain M Wallace, Andreas Wilm, Rodrigo Lopez, et al. Clustal W and Clustal X version 2.0. Bioinformatics, 23:29472948, 2007.

[77] Siavash Mirarab, Nam Nguyen, Sheng Guo, Li-San Wang, Junhyong Kim, and Tandy Warnow. Pasta: ultra-large multiple sequence alignment for nucleotide and amino-acid sequences. Journal of Computational Biology, 22:377386, 2015.

[78] Quan Zou, Qinghua Hu, Maozu Guo, and Guohua Wang. Halign: Fast multiple similar dna/rna sequence alignment based on the centre star strategy. Bioin- formatics, 31:24752481, 2015.

[79] Xi Chen, Chen Wang, Shanjiang Tang, Ce Yu, and Quan Zou. Cmsa: a het- erogeneous cpu/gpu computing system for multiple similar rna/dna sequence alignment. BMC bioinformatics, 18:315, 2017.

[80] William R Pearson and David J Lipman. Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences, 85:24442448, 1988.

[81] Bin Qian and Richard A Goldstein. Detecting distant homologs using phylo- genetic tree-based hmms. Proteins: Structure, Function, and Bioinformatics, 52:446453, 2003.

[82] Anders Krogh, Michael Brown, I Saira Mian, Kimmen Sjölander, and David Haussler. Hidden markov models in computational biology: Applications to protein modeling. Journal of molecular biology, 235:15011531, 1994.

[83] Richard Durbin, Sean R Eddy, Anders Krogh, and Graeme Mitchison. Biological sequence analysis: probabilistic models of proteins and nucleic acids. 1998.

99 [84] Johannes Söding. Protein homology detection by hmmhmm comparison. Bioinformatics, 21:951960, 2005. [85] Sean R Eddy. A probabilistic model of local sequence alignment that simplies statistical signicance estimation. PLoS Comput Biol, 4:e1000069, 2008. [86] Sean R Eddy. Accelerated prole hmm searches. PLoS Comput Biol, 7:e1002195, 2011. [87] L Steven Johnson, Sean R Eddy, and Elon Portugaly. Hidden markov mod- el speed heuristic and iterative hmm search procedure. BMC bioinformatics, 11:431, 2010. [88] Robert D Finn, Jody Clements, and Sean R Eddy. Hmmer web server: in- teractive sequence similarity searching. Nucleic acids research, page gkr367, 2011. [89] Hmmer user guide. http:/http://hmmer.org/documentation.html. Ac- cessed: 2017-06-12. [90] John C Avise. Evolutionary pathways in nature: a phylogenetic approach. 2006. [91] Carl Linnaeus. Systema Naturae. 1758. [92] R. Zardoya and A. Meyer. Phylogenetic performance of mitochondrial protein- coding genes in resolving relationships among vertebrates. Molecular Biology and Evolution, 13:933942, 1996. [93] Sarah J. Bourlat, Claus Nielsen, Andrew D. Economou, and Maximilian J. Telford. Testing the new animal phylogeny: a phylum level molecular analysis of the animal kingdom. Molecular Phylogenetics and Evolution, 49:2331, 2008. [94] Scott Federhen. The ncbi taxonomy database. Nucleic acids research, 40:D136 D143, 2011. [95] Himeno Hyouta, Masaki Haruhiko, Kawai Tatsushi, Ohta Takahisa, Kumagai Izumi, Miura Kin-ichiro, and Watanabe Kimitsuna. Unusual genetic codes and a novel gene structure for tRNA푆푒푟 in starsh mitochondrial DNA. Gene, 퐴퐺푌 56:219230, 1987. [96] Matthias Bernt, Anke Braband, Bernd Schierwater, and Peter F. Stadler. Ge- netic aspects of mitochondrial genome evolution. Molecular Phylogenetics and Evolution, 69:328338, 2013. [97] Giuseppe Attardi. Mitochondrial Biogenesis and Genetics. 1996. [98] Takashi Nagaike, Tsutomu Suzuki, Takayuki Katoh, and Takuya Ueda. Hu- man mitochondrial mRNAs are stabilized with polyadenylation regulated by mitochondria-specic poly(A) polymerase and polynucleotide phosphorylase. The Journal of Biological Chemistry, 280:1972119727, 2005.

100 [99] James B Stewart and Andrew T Beckenbach. Characterization of mature mi- tochondrial transcripts in drosophila, and the implications for the tRNA punc- tuation model in arthropods. Gene, 445:4957, 2009.

[100] Renato Lupi, Paolo D'Onorio de Meo, Ernesto Picardi, Mattia D'Antonio, Daniele Paoletti, Tiziana Castrignanó, Graziano Pesole, and Carmela Gissi. MitoZoa: A curated mitochondrial genome database of metazoans for compar- ative genomics studies. Mitochondrion, 10:192199, 2010.

[101] Stephen L Cameron. How to sequence and annotate insect mitochondrial genomes for systematic and comparative genomics research. Systematic En- tomology, 39:400411, 2014.

[102] Marwa Al Arab, Christian Höner zu Siederdissen, Kifah Tout, Abdullah H Sahyoun, Peter F Stadler, and Matthias Bernt. Accurate annotation of protein- coding genes in mitochondrial genomes. Molecular phylogenetics and evolution, 106:209216, 2017.

[103] K D Pruitt, T Tatusova, and D R Maglott. NCBI reference sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Research, 33:D501D504, 2005.

[104] D A Benson, I Karsch-Mizrachi, D J Lipman, J Ostell, and E W Sayers. Gen- Bank. Nucleic Acids Research, 24:D2631, 2008.

[105] Justin C. Havird and Scott R. Santos. Performance of single and concatenated sets of mitochondrial genes at inferring metazoan relationships relative to full mitogenome data. PLoS ONE, 9:e84080, 2014.

[106] Frank E. Grubbs. Sample criteria for testing outlying observations. The Annals of Mathematical Statistics, 21:2758, 1950.

[107] Peter Jehl, Fabian Sievers, and Desmond G. Higgins. OD-seq: outlier detection in multiple sequence alignments. BMC Bioinformatics, 16:269, 2015.

[108] M. Bernt. MITOS Web Server.

[109] Deanna Ojala, Julio Montoya, and Giuseppe Attardi. tRNA punctuation model of RNA processing in human mitochondria. Nature, 290:470474, 1981.

[110] N P Nguyen, S Mirarab, K Kumar, and T Warnow. Ultra-large alignments using phylogeny-aware proles. Genome Biology, 16:124, 2015.

[111] S R Eddy. Prole hidden Markov models. Bioinformatics, 14:755763, 1998.

[112] Robert C. Edgar. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research, 32:17921797, 2004.

101 [113] Tree of Life web project. Squamata. Lizards and snakes. http://tolweb. orgSquamata/14933/1996.01.01inTheTreeofLifeWebProject,http: //tolweb.org/, 1996. Version 01 January 1996.

[114] Vincent Ranwez, Sébastien Harispe, Frédéric Delsuc, and Emmanuel J. P. Douzery. MACSE: Multiple Alignment of Coding SEquences Accounting for Frameshifts and Stop Codons. PLOS ONE, 6:e22594, 2011.

[115] O. Gotoh. Alignment of three biological sequences with an ecient traceback procedure. J. theor. Biol., 121:327337, 1986.

[116] A. S. Konagurthu, J. Whisstock, and P. J. Stuckey. Progressive multiple align- ment using sequence triplet optimization and three-residue exchange costs. J. Bioinf. Comp. Biol., 2:719745, 2004.

[117] M Hirosawa, M Hoshida, M Ishikawa, and T Toya. MASCOT: multiple align- ment system for protein sequences based on three-way dynamic programming. Comput Appl Biosci, 9:161167, 1993.

[118] Matthias Kruspe and Peter F. Stadler. Progressive multiple sequence align- ments from triplets. BMC Bioinformatics, 8:254, 2007.

[119] Ken Nguyen, Xuan Guo, and Pan Yi. Multiple Biological Sequence Alignment: Scoring Functions, Algorithms and Applications. Hoboken, NJ, 2016.

[120] Matthias Bernt. Gene order rearrangement methods for the reconstruction of phylogeny. PhD thesis, Fakultät für Mathematik und Informatik der Univer- sität Leipzig, 2010.

[121] Eric Beitz. TeXshade: shading and labeling of multiple sequence alignments using LaTeX2e. Bioinformatics, 16:135139, 2000.

[122] Miguel M. Fonseca, Elsa Froufe, and D. James Harris. Mitochondrial gene rear- rangements and partial genome duplications detected by multigene asymmetric compositional bias analysis. Journal of Molecular Evolution, pages 654661, 2006.

[123] Rachel Lockridge Müller and Jerey L. Boore. Molecular mechanisms of exten- sive mitochondrial gene rearrangement in plethodontid salamanders. Molecular Biology and Evolution, 22:21042112, 2005.

[124] Andrey Tatarenkov, Felix Mesak, and John C Avise. Complete mitochondrial genome of a self-fertilizing sh kryptolebias marmoratus (cyprinodontiformes, rivulidae) from orida. Mitochondrial DNA, pages 12, 2015.

[125] Kohji Mabuchi, Masaki Miya, Takashi P. Satoh, Mark W. Westneat, and Mut- sumi Nishida. Gene rearrangements and evolution of tRNA pseudogenes in the mitochondrial genome of the parrotsh (teleostei: Perciformes: Scaridae). Journal of Molecular Evolution, 59:287297, 2004.

102 [126] J-S Ki, S-O Jung, D-S Hwang, Y-M Lee, and J-S Lee. Unusual mitochondrial genome structure of the freshwater goby odontobutis platycephala: rearrange- ment of trnas and an additional non-coding region. Journal of Biology, 73:414428, 2008.

[127] Yusuke Yamanoue, Masaki Miya, Keiichi Matsuura, Naoki Yagishita, Kohji Mabuchi, Harumi Sakai, Masaya Katoh, and Mutsumi Nishida. Phylogenetic position of tetraodontiform shes within the higher teleosts: Bayesian inferences based on 44 whole mitochondrial genome sequences. Molecular phylogenetics and evolution, 45:89101, 2007.

[128] K Saitoh, T Sado, RL Mayden, N Hanzawa, K Nakamura, M Nishida, and M Miya. Mitogenomic evolution and interrelationships of the (: Ostariophysi): The rst evidence toward resolution of higher- level relationships of the world’s largest freshwater sh clade based on 59 w- hole mitogenome sequences. Journal of Molecular Evolution, 63:826841, 2006.

[129] Jun G. Inoue, Masaki Miya, Katsumi Tsukamoto, and Mutsumi Nishida. Com- plete mitochondrial dna sequence of conger myriaster (teleostei: Anguilli- formes): Novel gene order for vertebrate mitochondrial genomes and the phylo- genetic implications for anguilliform families. Journal of Molecular Evolution, 52:311320, 2001.

[130] Atsushi Matsui, Felix Rakotondraparany, Isao Munechika, Masami Hasegawa, and Satoshi Horai. Molecular phylogeny and evolution of prosimians based on complete sequences of mitochondrial dnas. Gene, 441:5366, 2009.

[131] Takashi P. Satoh, Masaki Miya, Kohji Mabuchi, and Mutsumi Nishida. Struc- ture and variation of the mitochondrial genome of shes. BMC Genomics, 17:719, 2016.

[132] Jan Y Poulsen, Ingvar Byrkjedal, Endre Willassen, David Rees, Hirohiko Takeshima, Takashi P Satoh, Gento Shinohara, Mutsumi Nishida, and Masaki Miya. Mitogenomic sequences and evidence from unique gene rearrangements corroborate evolutionary relationships of myctophiformes (neoteleostei). BMC Evolutionary Biology, 13:111, 2013.

103 Selbständigkeitserklärung Hiermit erkläre ich, die vorliegende Dissertation selbständig und ohne unzuläs- sige fremde Hilfe angefertigt zu haben. Ich habe keine anderen als die ange- führten Quellen und Hilfsmittel benutzt und sämtliche Textstellen, die wörtlich oder sinngemäÿ aus veröentlichten oder unveröentlichten Schriften entnom- men wurden, und alle Angaben, die auf mündlichen Auskünften beruhen, als solche kenntlich gemacht. Ebenfalls sind alle von anderen Personen bereitgestell- ten Materialen oder erbrachten Dienstleistungen als solche gekennzeichnet.

Leipzig, den 12.11.2017 ......

I Curriculum Vitae

Personal Information Name Marwa Al Arab Address East street, Kalmoun, Tripoli Lebanon Email [email protected] Tel 0096170251121

Education 2013-2017 Ph.D. student, Bioinformatics group, Leipzig university. Doctoral school for sciences and technologies, Lebanese University.

2011-2012 Msc. Bioinformatics, Lebanese university.

Master thesis Breakpoints detection in Mitochondrial DNA.

2007-2010 Bachelor in Computer Information Systems, Lebanese University.

Work Experience 2014-2016 Teaching Algorithm in Bioinformatics, Data mining lab and programming for Bioinformatics Lebanese University.

2010 Quality assurance and programming at Intelligile Tripoli.

Technical Skills Programming Python, Java, C++, R, Matlab, PHP, Perl, HTML.

Operating systems Linux, Windows.

Others Latex, SVN, mySQL, XML.

Languages Arabic Native speaker English Fluent French Fluent German Beginner

II