The Eukaryotic ITS2 Database – A workbench for modelling RNA sequence-structure evolution *** Die Eukaryotische ITS2 Datenbank - Eine Plattform zur Modellierung von RNA Sequenzstruktur Evolution

Doctoral thesis for a doctoral degree at the Graduate School of Life Sciences, Julius-Maximilians-Universit¨atW¨urzburg

PhD thesis submitted by: Christian Koetschan Place of birth: W¨urzburg

March 15, 2012

Submitted on: ......

Office stamp

Members of the Promotionskomitee:

Chairperson: Prof. Dr. Matthias Frosch

Primary Supervisor: Prof. Dr. J¨orgSchultz

Supervisor (Second): Prof. Dr. Thomas Dandekar

Supervisor (Third): Prof. Dr. Dietmar Seipel

Additional supervisors:

Group Leader: Dr. Tobias M¨uller

Group Leader: Dr. Matthias Wolf

Date of Public Defence: ......

Date of receipt of Certificates: ......

5

Affidavid / Eidesstattliche Erkl¨arung

English:

I hereby confirm that my thesis entitled ”The Eukaryotic ITS2 Database – A workbench for modelling RNA sequence-structure evolution” is the result of my own work. I did not receive any help or support from commercial consul- tants. All sources and / or materials applied are listed and specified in the thesis.

Furthermore, I confirm that this thesis has not yet been submitted as part of another examination process neither in identical nor in similar form.

Signature: ( Christian Koetschan ) Place, Date

Deutsch:

Hiermit erkl¨are ich an Eides statt, die Dissertation Die Eukaryotische ITS2 ” Datenbank - Eine Plattform zur Modellierung von RNA Sequenzstruktur Evo- lution“ eigenst¨andig, d.h. insbesondere selbst¨andig und ohne Hilfe eines kom- merziellen Promotionsberaters, angefertigt und keine anderen als die von mir angegebenen Quellen und Hilfsmittel verwendet zu haben.

Ich erkl¨are außerdem, dass die Dissertation weder in gleicher noch in ¨ahnlicher Form bereits in einem anderen Prufungsverfahren¨ vorgelegen hat.

Unterschrift: ( Christian Koetschan ) Ort, Datum

7

Acknowledgements

Many people have supported me during the last years at the Faculty of Bioin- formatics, and made work feel like pleasure. First of all, I want to thank all members of the Department for a great and enjoyable time during my studies and the time of my PhD. Further, I want to thank my mentor Prof. Dr. J¨orgSchultz, for the guidance of my PhD thesis, inspiring discussions in the group meetings, and for main- taining a comfortable working environment. Also I thank Prof. Dr. Thomas Dandekar and Prof Dr. Dietmar Seipel for supporting me as members of the Graduate School supervisor committee during the last years. I owe a debt of gratitude to Dr. Tobias M¨ullerfor spending so much time and contributing with his profound knowledge and advice. Thank you for the time-consuming discussions – even exceeding the working hours, suggestions and mathematical assistance in many cases! I really appreciate all this effort, which improved my thesis ”significantly” – *** without any doubts or further needed statistical tests! I would also like to thank Dr. Matthias Wolf, Dr. Torben Friedrich, Dr. Frank F¨orsterand Thomas Hackl for their collaborative teamwork and assistance, an- swers and explanations to biological questions and discussions – even until 5 am in the morning. Special Thanks go to Daniela Beißer, Gaby Wangorsch, Santosh Nilla and Astrid Fieselmann. Without extensive mountain-biking tours, badminton, speedminton, slack-lining and regular walks to the chocolate vending machine, time would have passed only half as fast and I would have missed some of the most pleasant moments in my life. Thank you for the great time, I will definitely miss it!

Finally I want to thank Desislava Boyanova and my family who supported me especially through the time of writing and during my studies, for which I owe them the biggest ”Thank you”!

TABLE OF CONTENTS 9

Table of Contents

1 Summary / Zusammenfassung 15 1.1 English ...... 15 1.2 Deutsch ...... 17

2 Introduction 19

3 Optimization of length modelling topologies for HMMs 23 3.1 Introduction ...... 23 3.2 Methods ...... 24 3.2.1 Hidden Markov Models ...... 24 3.2.2 Maximum likelihood ...... 25 3.2.3 Method of moments ...... 26 3.2.4 Goodness of fit and model choice ...... 26 3.3 Results ...... 27 3.3.1 Length modelling – topologies ...... 27 3.3.2 Application on biological examples ...... 30 3.4 Discussion ...... 34

4 Sequence-structure phylogeny 37 4.1 Introduction ...... 37 4.2 Tree inference with the R-package treeforge ...... 37 4.2.1 Sequence-structure coding alphabets ...... 38 4.2.2 Substitution models for phylogenetic reconstructions . . 39 Jukes Cantor ...... 39 ITS2 ...... 40 GTR...... 40 4.3 Calculating evolutionary rates and scoring sequences ...... 41 4.4 Reconstruction of the chlorophyceaen tree with different coding alphabets ...... 44 4.5 Discussion ...... 50 TABLE OF CONTENTS 10

5 Generating ITS2 sequence-structure data 51 5.1 Introduction ...... 51 5.2 Database creation and taxonomic tree storage (A) ...... 51 5.3 Sequence download and indexing (B) ...... 52 5.4 HMM annotation (C) ...... 55 5.5 Secondary structure prediction (D) ...... 55 5.6 Partial structure prediction (E) ...... 57 5.7 General overview of files and folders ...... 58 5.8 Discussion ...... 58

6 Technical implementation of the ITS2 workbench and the ITS2 database 61 6.1 Introduction ...... 61 6.2 The frontend of the ITS2 workbench ...... 61 6.2.1 AJAX and XHR with Web 2.0 ...... 62 6.2.2 The ExtJS JavaScript framework ...... 62 6.2.3 Web 2.0 data formats ...... 65 6.2.4 Frontend files and locations ...... 67 6.3 The backend of the ITS2 workbench ...... 67 6.3.1 Catalyst: A Perl-based Web development framework . . 67 Model View Controller architecture ...... 68 RESTful programming paradigm and CRUD ...... 70 The Template Toolkit ...... 71 6.3.2 The PostgreSQL database including PL/pgSQL . . . . . 72 Database Schema of the ITS2 database ...... 72 Database Schema of the ITS2 workbench – admin interface 74 6.3.3 Apache and Fcgid communication ...... 75 6.3.4 Backend files and locations ...... 75 6.4 ITS2 workbench installation script ...... 76 6.5 Discussion ...... 76 TABLE OF CONTENTS 11

7 The ITS2 workbench 77 7.1 Introduction ...... 77 7.2 Overview ...... 78 7.3 Creating a sequence-structure dataset ...... 79 7.3.1 Datasets based on the ITS2 database ...... 79 7.3.2 Datasets from own sequences ...... 82 The ITS2 annotation tool ...... 82 Prediction of secondary structure (Predict) ...... 84 Homology Modelling of structure (Model) ...... 85 7.4 Managing the dataset ...... 87 7.5 Analysing the dataset ...... 88 7.5.1 Alignment on sequence or sequence and structure . . . . 89 7.5.2 Tree calculation with the Neighbour Joining algorithm . 90 7.6 Additional tools ...... 91 7.6.1 Motif detection in ITS2 sequences ...... 91 7.6.2 BLAST on sequence and structure ...... 93 7.7 The ITS2 admin interface ...... 94 7.8 Discussion ...... 95

8 ITS2 application on large scale-data - automated reconstruction of the Green Algal Tree of Life 97 8.1 Indroduction ...... 97 8.2 Materials and Methods ...... 97 8.3 Results ...... 98 The class of ...... 98 The class of ...... 99 The class of ...... 99 The phylum ...... 99 8.4 Discussion ...... 101

9 A video tutorial for the ITS2 workbench 102 9.1 Short summary ...... 102 TABLE OF CONTENTS 12

10 Conclusion and outlook 104

11 References 106

12 List of Figures 127

13 List of Tables 129

14 List of Abbreviations 130

15 Annex 132 15.A Phylogenetic tree of class Chlorophyceae ...... 132 15.B Phylogenetic tree of class Trebouxiophyceae ...... 133 15.C Phylogenetic tree of class Ulvophyceae ...... 134 15.D Sequence-structure alphabets ...... 135 DNA...... 135 RNA10 ...... 135 RNA12 ...... 135 RNA16 ...... 136 15.E Sequence and sequence-structure score matrices on datasets137 15.F Sequence and sequence-structure score matrices on genus datasets138 15.G PL/pgSQL functions of the ITS2 database ...... 139 15.H Views of the ITS2 database ...... 140 15.I Aggregate functions of the ITS2 database ...... 140 15.J Operators of the ITS2 database ...... 140 15.K Triggers of the ITS2 database ...... 140 15.L ITS2 data generation – folders ...... 141 15.MITS2 data generation – scripts and modules ...... 141 15.N JavaScript files of the ITS2 workbench frontend ...... 142 15.O Templates of the Template Toolkit ...... 143 15.P Controller of the ITS2 workbench backend ...... 144 15.Q Modules of the ITS2 workbench backend ...... 145

16 Publications 146 TABLE OF CONTENTS 13

17 Curriculum Vitae 148

Summary / Zusammenfassung 15

1 Summary / Zusammenfassung

1.1 English:

During the past years, the internal transcribed spacer 2 (ITS2) was established as a commonly used molecular phylogenetic marker for the . Its fast evolving sequence is predestinated for the use in low-level phylogenetics. How- ever, the ITS2 also consists of a very conserved secondary structure. This enables the discrimination between more distantly related species. The combi- nation of both in a sequence-structure based analysis increases the resolution of the marker and enables even more robust tree reconstructions on a broader taxonomic range. But, performing such an analysis required the application of different programs and databases making the use of the ITS2 non trivial for the typical biologist. To overcome this hindrance, I have developed the ITS2 Workbench, a com- pletely web-based tool for automated phylogenetic sequence-structure anal- yses using the ITS2 (http://its2.bioapps.biozentrum.uni-wuerzburg.de). The development started with an optimization of length modelling topologies for Hidden Markov Models (HMMs), which were successfully applied on a sec- ondary structure prediction model of the ITS2 marker. Here, structure is predicted by considering the sequences’ composition in combination with the length distribution of different helical regions. Next, I integrated HMMs into the sequence-structure generation process for the delineation of the ITS2 within a given sequence. This re-implemented pipeline could more than double the number of structure predictions and reduce the runtime to a few days. To- gether with further optimizations of the homology modelling process I can now exhaustively predict secondary structures in several iterations. These modifi- cations currently provide 380,000 annotated sequences including 288,000 struc- ture predictions. To include these structures in the calculation of alignments and phylogenetic trees, I developed the R-package ”treeforge”. It generates sequence-structure alignments on up to four different coding alphabets. For the first time also structural bonds were considered in alignments, which required 1.1 English 16 the estimation of new scoring matrices. Now, the reconstruction of Maximum Parsimony, Maximum Likelihood as well as Neighbour Joining trees on all four alphabets requires just a few lines of code. The package was used to resolve the controversial chlorophyceaen dataset and could be integrated into future ver- sions of the ITS2 workbench. The platform is based on a modern, feature-rich Web 2.0 user interface equipped with the latest AJAX and Web-service tech- nologies. It performs HMM-based sequence annotation, structure prediction by energy minimization or homology modelling, alignment calculation and tree reconstruction on a flexible data pool that repeats calculations according to data changes. Further, it provides sequence motif detection to control annota- tion and structure prediction and a sequence-structure based BLAST search, which facilitates the taxon sampling process. All features and the usage of the ITS2 workbench are explained in a video tutorial. However, the work- bench bears some limitations regarding the size of datasets. This is caused mainly due to the immense computational power needed for such extensive calculations. To demonstrate the validity of the approach also for large-scale analyses, a fully automated reconstruction of the Chlorophyta (Green Algal) Tree of Life was performed. The successful application of the marker even on large datasets underlines the capabilities of ITS2 sequence-structure analysis and suggests its utilization on further datasets. The ITS2 workbench provides an excellent starting point for such endeavours. 1.2 Deutsch 17

1.2 Deutsch:

In den vergangenen Jahren etablierte sich der Marker internal transcribed ” spacer 2”(ITS2) zu einem h¨aufig genutzten Werkzeug in der molekularen Phy- logenetik der Eukaryoten. Seine schnell evolvierende Sequenz eignet sich be- stens fur¨ den Einsatz in niedrigeren phylogenetischen Ebenen. Die ITS2 faltet jedoch auch in eine sehr konservierte Sekund¨arstruktur. Diese erm¨oglicht die Unterscheidung weit entfernter Arten. Eine Kombination aus beiden in einer Sequenzstrukturanalyse verbessert die Aufl¨osung des Markers und erm¨oglicht die Rekonstruktion von robusteren B¨aumen auf h¨oherer taxonomischer Breite. Jedoch war die Durchfuhrung¨ solch einer Analyse, die die Nutzung unterschied- lichster Programme und Datenbanken vorraussetzte, fur¨ den klassischen Bio- logen nicht einfach durchfuhrbar.¨ Um diese Hurde¨ zu umgehen, habe ich den ITS2 Workbench“ entwickelt, eine im Internet nutzbare Arbeitsplattform zur ” automatisierten sequenzstrukturbasierten phylogenetischen Analyse basierend auf der ITS2 (http://its2.bioapps.biozentrum.uni-wuerzburg.de). Die Entwick- lung begann mit der L¨angenoptimierung unterschiedlicher Hidden Markov ” Model“ (HMM)-Topologien, die erfolgreich auf ein Modell zur Sequenzstruk- turvorhersage der ITS2 angewandt wurden. Hierbei wird durch die Analyse von Sequenzbestandteilen in Kombination mit der L¨angenverteilung verschie- dener Helixregionen die Struktur vorhergesagt. Anschließend konnte ich HMMs auch bei der Sequenzstrukturgenerierung einsetzen um die ITS2 innerhalb ei- ner gegebenen Sequenz zu lokalisieren. Dieses neu implementierte Verfahren verdoppelte die Anzahl vorhergesagter Strukturen und verkurzte¨ die Laufzeit auf wenige Tage. Zusammen mit weiteren Optimierungen des Homologiemodel- lierungsprozesses kann ich nun ersch¨opfend Sekund¨arstrukturen in mehreren Interationen vorhersagen. Diese Optimierungen liefern derzeit 380.000 anno- tierte Sequenzen einschließlich 288.000 Strukturvorhersagen. Um diese Struk- turen fur¨ die Berechnung von Alignments und phylogenetischen B¨aumen zu verwenden hab ich das R-Paket treeforge“ entwickelt. Es erm¨oglicht die Ge- ” nerierung von Sequenzstrukturalignments auf bis zu vier unterschiedlich ko- dierten Alphabeten. Damit k¨onnen erstmals auch strukturelle Basenpaarungen 1.2 Deutsch 18 in die Alignmentberechnung mit einbezogen werden, die eine Sch¨atzung neuer Scorematrizen vorraussetzten. Das R-Paket erm¨oglicht zus¨atzlich die Rekon- struktion von Maximum Parsimony“, Maximum Likelihood“ und Neigh- ” ” ” bour Joining“ B¨aumen auf allen vier Alphabeten mittels weniger Zeilen Pro- grammcode. Das Paket wurde eingesetzt, um die noch umstrittene Phylogenie der chlorophyceae“ zu rekonstruieren und k¨onnte in zukunftigen¨ Versionen ” des ITS2 workbench verwendet werden. Die ITS2 Plattform basiert auf einer modernen und sehr umfangreichen Web 2.0 Oberfl¨ache und beinhaltet neu- ste AJAX und Web-Service Technologien. Sie umfasst die HMM basierte Se- quenzannotation, Strukturvorhersage durch Energieminimierung bzw. Homo- logiemodellierung, Alignmentberechnung und Baumrekonstruktion basierend auf einem flexiblen Datenpool, der Anderungen¨ am Datensatz automatisch aktualisiert. Zus¨atzlich wird eine Detektion von Sequenzmotiven erm¨oglicht, die zur Kontrolle von Annotation und Strukturvorhersage dienen kann. Eine BLAST basierte Suche auf Sequenz- und Strukturebene bietet zus¨atzlich eine Vereinfachung des Taxonsamplings. Alle Funktionen sowie die Nutzung der ITS2 Webseite sind in einer kurzen Videoanleitung dargestellt. Die Plattform l¨asst jedoch nur eine bestimmte Gr¨oße von Datens¨atzen zu. Dies liegt vor allem an der erheblichen Rechenleistung, die bei diesen Berechnungen ben¨otigt wird. Um die Funktion dieses Verfahrens auch auf großen Datenmengen zu demon- strieren, wurde eine voll automatisierte Rekonstruktion des Grunalgenbaumes¨ (Chlorophyta) durchgefuhrt.¨ Diese erfolgreiche, auf dem ITS2 Marker basie- rende Studie spricht fur¨ die Sequenz-Strukturanalyse auf weiteren Daten in der Phylogenetik. Hier bietet der ITS2 Workbench den idealen Ausgangspunkt. Introduction 19

2 Introduction

From ”On the origin of species” to molecular biology

What is a species? Are same looking individuals also belonging to the same species? Which one evolved first? These questions have given food for thought to biologist for several centuries and different theses were raised. But even to- day – more than 150 years after Charles Darwins (* 12.2.1809; „ 19.4.1882) ”On the Origin of Species” (Darwin, 1859), the answers are not always trivial to give. Whereas at first, species were distinguished by their morphological features, one of the most recognized statements on the species concept was introduced by Ernst Mayr (* 5.7.1904; „ 3.2.2005) in 1942 in his book Mayr (1942): ”Species are groups of actually or potentially interbreeding natural populations, which are reproductively isolated from other such groups.” This statement ”gained almost universal acceptance because it explained concisely the role of the species in biology” (Mayr, 1942). Nevertheless, a discrimination is not very easy, especially in difficult cultivatable or solitary habitats. Fur- thermore it leaves the question unanswered regarding asexual individuals. But even so, large progress has been made after Darwin’s sketch of the probably first phylogenetic tree. With the emergence of computational biology, large scale sequence analysis, exponentially growing databases and high-trough-put sequencing, today tons of data became available in a relatively short time. Since the studies of Watson and Crick on the structure of DNA: ”It has not escaped our notice that the specific pairing we have postulated immediately suggests a possible copying mechanism for the genetic material” (Watson and Crick, 1953), it is clear where genetic information is coded. Introduction 20

The molecular phylogenetic pipeline for automated tree reconstruc- tion

However, the question arose on how and where to find comparable regions inside these differing genomes. This led to the assignment of phylogenetic markers. A phylogenetic marker is a sequence fragment of same origin, avail- able in a large taxonomic unit with differing nucleic manifestations for each species. Based on these fragments, an alignment – containing columns with equal and distinct positions can be created. First, these were constructed by hard and exhausting manual work, nowadays tough, the alignment algorithms of MUSCLE (Edgar, 2004a,b) and ClustalW2 (Thompson et al., 1994; Larkin et al., 2007) simplify this task. From the alignment, it is only one more step to a molecular phylogeny. Saitou and Nei (1987); Gascuel (1997) proposed a method called Neighbour Joining (NJ) to calculate a tree very quickly. His method counts the number of differences in the alignment-columns for each sequence and computes a distance based on the results. Competing treeing methods like maximum parsimony (Camin and Sokal, 1965) and maximum likelihood (Felsenstein, 1981, 1985, 2004) are not less popular and widely ap- plied in this field.

The ITS2 – a predestinated marker for tree inference

However, a phylogenetic tree reconstruction strongly relies on the quality of the underlying marker. This is firstly, its exact annotation, and secondly, the range of its resolution and sequence variability. To the latter, a plethora of debates had been started which resulted in the use of specific (e.g. matK, rbcL, rpoC1, psbA-trnH), mitochondrial (e.g. CO1) or ribosomal (e.g. ITS1/2, 5.8S) markers (Moniz and Kaczmarska, 2009, 2010; Chen et al., 2010; Yao et al., 2010). Where chloroplast markers perform reasonably well on plants, they can’t be applied on animal species. Here, CO1 provides good results instead (Astrin et al., 2006; Kitahara et al., 2010; Smith et al., 2008). This discrepancy Introduction 21 can be explained by the fact that fast evolving markers suit best for low-level phylogenies with a high sequence variability needed to distinguish between two closely related species. In the case of high-level phylogenies instead, a lower variability is preferred to distinguish between far related species. Here, fast sequence variations would cause a high level of noise. Our work focuses on the ribosomal RNA cistron (Figure 1), especially the In- ternal Transcribed Spacer 2 (ITS2). The rRNA cistron contains genes forming the smaller (18S) and larger (28S, 5.8S and 5S) subunit of the ribosome in eukaryotes. Located in between are the genetic spacers ITS1 and ITS2. Ribo- somes occur in prokaryotes as well as eukaryotes for the synthesis of proteins during translation, however the ITS2, which exact function is not unravelled yet is only available in the latter.

Figure 1: A graphical view of the ribosomal RNA cistron. It consists of the 18S small subunit (SSU), ITS1, 5.8S, ITS2, 28S large subuni(LSU).

At first, only sequence data was accessible to distinguish between different species (Baldwin et al., 1995; Powers et al., 1997; Suh et al., 1993) and it turned out that the range of this marker is often too weak to cover also the discrimination in higher – order or family levels (Coleman, 2003). Interestingly, the ITS2 folds into a secondary structure which revealed a common, very con- served core throughout all eukaryotes (Coleman, 2003; Schultz et al., 2005; Mai and Coleman, 1997). With the conserved structure enabling discrimination at higher ranks, also the accuracy and robustness of trees increased (Keller et al., 2010; Telford et al., 2005). This finding quickly led to the integration of sec- ondary structure into phylogenetic databases like the ITS2-DB (Schultz et al., 2006; Selig et al., 2008; Koetschan et al., 2010). With an exact Hidden Markov Model annotation of the conserved flanking regions 5.8S and 28S of the ITS2 Introduction 22

(Keller et al., 2009), a huge amount of highly reliable individual secondary structures became available. Due to the advent of sequence-structure analysis software, the calculation of alignments (Seibel et al., 2006, 2008; Bauer et al., 2007; Siebert and Backofen, 2005) and reconstruction of phylogenetic trees (Wolf et al., 2008) could benefit from both integrated features.

An integrated platform for ITS2-based phylogenies

The presence of such a large set of phylogenetic tools requires a broad knowl- edge for their meaningful application and can often be time-consuming. Espe- cially after two decades of controversy in taxon sampling (Nabhan and Sarkar, 2011), one has to be prepared that calculations might result in repetitions. Further, erroneous sequences like false classifications often become visible just after the final tree has been produced. To overcome this time-consuming pro- cess, the ITS2 workbench has been developed. This working suite unifies all formerly required stand-alone tools for a sequence-structure based analysis online and follows the work flow proposed by Schultz and Wolf (2009). It re- quires no additional installations and integrates an updated version of the ITS2 database with nearly 300,000 sequences and structures. Beside an interactive way of adding or removing sequence-structures at different stages, quality con- trol mechanisms have been implemented allowing an automated repetition of previous calculations, as soon as changes to the dataset have been made. In a nutshell, the workbench remains compatible to the former software, how- ever additionally providing very fast insights into a phylogeny by a complete automation of a predefined pipeline and producing reliable results within just a few mouse clicks. Optimization of length modelling topologies for HMMs 23

3 Optimization of length modelling topologies for HMMs

3.1 Introduction

To achieve high quality predictions, the first part – the annotation of the marker, is already a crucial step. Here, the ITS2 marker benefits from its very conserved neighbouring genes. These can be targeted by a simple Hidden Markov Model approach. HMMs – first emerged in the IT and mainly used for speech recognition (Rabiner, 1989; Juang and Rabiner, 1991), nowadays play an important role in software for spam deobfuscation (Lee and Ng, 2005), image processing (Willsky, 2002; Rossi et al., 2010) or in bioinformatics ap- plications. To the latter, the work of Jean Eddy and the development of the HMMER suite (Eddy, 1998, 2008) had a big influence, which simplified the in- tegration and distribution of HMMs in a large variety of programs. Especially when it comes to the automated detection of specific regions or motifs inside a strand of typically Nucleotides, Aminoacids or Proteins, HMMs benefit from their statistical capabilities. Being successfully applied in Genescan (Lukashin and Borodovsky, 1998) for the detection of genes, the HMMER webserver for interactive sequence similarity search (Finn et al., 2011), the prediction of in- teraction sites by Friedrich et al. (2006) or signal peptide (Juncker et al., 2003; Schneider and Fechner, 2004; Zhang and Wood, 2003; K¨allet al., 2004) and transmembrane predictions (Sonnhammer et al., 1998; Tusn´adyand Simon, 1998; Krogh et al., 2001; Kahsay et al., 2005; Martelli et al., 2002; Liu et al., 2003; Bagos et al., 2004) are just some of the highlights in this field. Regarding the ITS2, the exact annotation of its sequence is of major importance for an accurate structure prediction (Keller et al., 2009). Interestingly, the ITS2 also consists of different motifs (Koetschan et al., 2010) and structural regions, such as stems, loops or inter-helical areas. This formed the idea whether – based on the composition of the ITS2 sequence, a full structure prediction can be achieved. In the following study, HMMs were applied to identify start and 3.2 Methods 24 end regions of the ITS2, as well as to model a complete secondary structure. However, with different structural regions, a large variety of the underlying source data is coherent, especially when regarding the length distribution of sequences. This makes it difficult to elicit optimal performance using state of the art software like the HMMER suite. Typically, an HMM consists of several states, each connected by transitions. The retention time within one state containing a self-transition then usually follows a geometric law (Durbin et al., 1998). But this must not automatically reflect the length distribution of source data, e.g. loop or stems inside the ITS2. A bell-shaped length dis- tribution e.g. cannot be modelled by only one self-transitional state, but by a sequential replication of those (Durbin et al., 1998), which would promise a better HMM-based prediction. In the following, HMM-based optimization methods are described by adapting the number of states with its probabilities of an HMM to the length distribution of the underlying data according to the publications of Koetschan (2008); Friedrich et al. (2010); Friedrich (2009). In comparison to the diploma thesis of Koetschan (2008), the publication (Friedrich et al., 2010) contains a full validation of optimisation methods on artificial test scenarios, confirmed by statistical tests, cross validation and re- ceiver operating characteristic (ROC) curves. Further a performance estima- tion of ML (Maximum Likelihood) and MM (Method of Moments) was newly included. Beside a novel calculation method for ITS2 secondary structure er- ror predictions, the HMM optimization was successfully applied on various interesting biological sequences and motifs in a quick screening.

3.2 Methods

3.2.1 Hidden Markov Models

In a few words, an HMM can be explained as a network of nodes or states

Q = {q1, . . . , qm}, described in more detail by Durbin et al. (1998). These

States qi, qj are typically connected to each other by a transition probability

τij. Thus, all outgoing transitions of a state sum to one. States may emit an 3.2 Methods 25

alphabet of symbols O = {ω1, . . . , ωn}, or can even be silent. To add emis- sion and transition probabilities to an HMM, different learning algorithms are known. Supervised learning, e.g. relies on state path and emission values to calculate the according probabilities. When states paths are unknown, the Baum-Welch (Baum et al., 1970; Welch, 2003) training estimates this infor- mation by a maximum likelihood approach. Further important algorithms like Viterbi (Viterbi, 1967; Forney, 1973) and posterior decoding (Durbin et al., 1998) are able to provide at least one best state path through the model.

3.2.2 Maximum likelihood

The likelihood L to the density function p(X|Θ) with the random variable X and the unknown parameters Θ of a given a data set of size N is described by

N Y p(X|Θ) = p(xi|Θ) = L(Θ|x1, . . . , xN ). (3.1) i=1

To estimate the unknown parameters Θ, the set of parameters Θˆ which maxi- mizes the overall likelihood is calculated

ˆ Θ = argmax L(Θ|x1, . . . , xN ). (3.2) Θ The parameters maximizing the likelihood equal the parameters maximizing the log-likelihood function. By transferring calculations to the log-space, the product of 3.1 is replaced by a sum, which avoids calculations with very high numbers.

L(Θ|x1, . . . , xN ) = log(L(Θ|x1, . . . , xN )). (3.3)

Here, log(L(Θ|x1, . . . , xN )) is described as the log-likelihood of a model with parameters Θ. When maximizing the log-likelihood, its estimates Θi (Θ =

Θ1,..., Θn) are derived by locating the zero point of partial derivatives of this function (Johnson et al., 2006) δ L(Θ1,..., Θm|x1, . . . , xN ) = 0, i = 1, . . . , m. (3.4) δ Θi 3.2 Methods 26

3.2.3 Method of moments

With this method, the moments of a function are used as parameters to de- scribe a distribution. This idea was first published by Pearson (1902) and gives an analytical approach for parameter estimation.

For a random variable X the moments of a function can be calculated by  P k  x p(x) if X is discrete,  k  x E X = +∞ (3.5) R k  x f(x) dx if X is continuous.  −∞ Here, p(x) is referred to as the probability mass function, f(x) as the proba- bility density function (PDF) of x. Then,

M(t) = E(etX ) (3.6) is the according moment generating function of X.

When differentiating M(t) at t = 0, the kth differentiation corresponds to the kth moment dk M(t)| = E[XketX ] | = E(Xk). (3.7) dt t=0 t=0 Now, parameters can be estimated by replacing moments by sample moments and resolving the equations

n 1 X E[Xk] = xk = xk, (3.8) n i i=1 xk =x, ¯ x2,..., xk := sample moment k.

3.2.4 Goodness of fit and model choice

The likelihood of a model can be regarded under the influence of model param- eters and its complexity. A measure for model choice which punishes model complexity by the number of parameters on a logarithmic scale is the BIC criteria by Schwarz (1978). It is calculated according to

BIC = −2L(θ, x) + k ln(n). (3.9) 3.3 Results 27

L(θ, x) here denotes the likelihood of the model where k equals the number of estimated parameters and n is the sample size. Estimators which don’t compute a likelihood can be directly compared by determining the L1 distance of two discrete distributions x and y n X d1(x, y) = ||x − y||1 := |xi − yi|. (3.10) i=1

3.3 Results

3.3.1 Length modelling – topologies

Each self-transitional state of an HMM is coupled to a specific holding time while resting in the self-transition. Thus, at each (re)-visit, the length of the state path is incremented by one symbol resulting from this specific state. Sequences, emitted by this state then follow a geometric length distribution (Durbin et al., 1998). To model even bell-shaped source data (Figure 3), in the following, three length modelling topologies are described in a nested com- plexity. A state, which becomes substituted by several chained-linked states in the next complexity level is further called a macro state (MS).

The p macro state emits geometrically length distributed sequences (geo(p)). The likelihood func- tion for these sequence lengths x1, . . . , xn is given by n Y xi−1 L(p|x1, . . . , xn) = p(1 − p) . (3.11) i=1 To estimate p, the log-likelihood of the equation above is then maximized

pˆ = argmax L(p|x1, . . . , xn) (3.12) p n Y = argmax log( p(1 − p)xi−1) p i=1 n X = argmax n log p + (xi − 1) log(1 − p). p i=1 By solving the equation δL(p|x , . . . , x ) 1 n = n − Pn (x − 1) 1 = 0 (3.13) dp p i=1 i 1−p 3.3 Results 28

p macro state rp macro state srp macro state

geo (p) nbin (r, p) gnbin (s, r, p)

2 1 x¯ 2(x2−x¯ ) pˆMM =p ˆML = pˆMM = pˆMM = x¯ x¯+x2 x¯2−x2−2¯x3+3¯xx2−x3 2 2 2 4 x¯2 4(x2−x¯ )(−2¯x x2+x2 +¯x ) rˆMM = rˆMM = x¯+x2 (¯x2−x2−2¯x3+3¯xx2−x3)(x2−x¯2−2¯x3+3¯xx2−x3) 2 ˆ −x¯2x2+¯xx2−x¯3−x¯x3+2x2 ΘML = sˆMM = x2−x¯2−2¯x3+3¯xx2−x3

ˆ argmax L (Θ|x1, . . . , xn) ΘML = argmax L (Θ|x1, . . . , xn) Θ Θ

Θ = (r, p) Θ = (s, r, p) y t y y t t ensi D ensi ensi D D

Length Length Length

Figure 2: Regarding the graphical representation of p, rp and srp macro states, the nested topology becomes visible easily. For rp, states are chained in a series of r repetitions. For the shifted srp model, s fixed copies lacking self transitions are prepended to the rp macro state. For each topology, estimators of distribution parameters s,r and p are given by maximum likelihood and the method of moments. Finally, an exemplary length distribution is depicted below each macro state.

1 towards p, the ML estimatorp ˆ = x¯ is received. This p macro state repre- sents the less complex form of length modelling and is concordant to a single self-transitive state. In Figure 2, the p macro state (left) as well as its ML and MM estimators are depicted together with a rough shape of the geometric distribution. 3.3 Results 29

The rp macro state emits negative binomial length distributed sequences (nbin(r, p)). As one might be aware of, the geometric distribution is contained in the negative bi- nomial distribution when setting the parameter r to one. Thus, the rp macro state allows to model a geometric length distribution or the bell shape typically for the negative binomial distribution, regulated by the parameter r. Trans- ferred to HMMs, r specifies the number of repetitive, chain-linked copies (see the middle part of Figure 2) of a self-transitional state with a holding time in- fluenced by p. As the explicit mathematical calculation of an ML estimator for this case is very demanding, here the MM estimator is explained instead. The method of moments – as the name implies makes use of central and empirical moments to form an unbiased estimator. However for ML, still the maximal likelihood can be calculated by varying distribution parameters in a discrete

1 2 space. The first and second moment, expectation (µnbin) and variance (µnbin) can be described by the distributions parameters r and p as r 1 − p µ1 = , µ2 = r . (3.14) nbin p nbin p2 By exchanging theoretical with empirical moments, r 1 − p x¯ = , x2 = r (3.15) p p2 and resolving both equations to r and p x¯ x¯2 pˆ = , rˆ = . (3.16) x2 +x ¯ x2 +x ¯ both parameter estimators are obtained. Floating point estimations of r were rounded to the next integer value for a smooth integration into HMMs.

The srp macro state emits shifted negative binomial length distributed sequences (gnbin(s, r, p)). This distribution is similar to the negative binomial distribution, except that an extension of exactly s characters or states increases the length of a se- quence and so, shifts the distribution on the x-axis. Here, the srp macro state 3.3 Results 30 is equal to the rp macro state when setting the shift – s to zero. The moment generating function

t(r+s) r t t −r Mnbin(t) = e p (1 − e + e p) . (3.17) was used to generate MM estimators for the parameters s, r and p similar to the rp macro state.

In the following, a short table summarizes macro state specific information on moments and PDFs for all three model variants.

geometric negative binomial generalized negative binomial

macro state p rp srp

n−1 n−1 r n−r n−s−1 r n−s−r PDF p(1 − p) r−1 p (1 − p) r−1 p (1 − p)

1 r r E[X] p p p + s

1−p 1−p 1−p V ar[X] p2 r p2 r p2

Table 1: This table summarizes properties of different modelling topologies by providing information about the modelled length distribution, PDF, variance and expectation value.

3.3.2 Application on biological examples

Screening biological sequences in terms of length distributions revealed a vari- ety of data sets following a thoroughly negative binomial law. Here, exemplary the length distribution of 232 β-sheet core regions from transmembrane pro- teins received from the TMPDB database (Figure 3 A) (Ikeda et al., 2003), signal peptides (Figure 3 B), 3’-UTR sequences of C. elegans from the UTRome database (Figure 3 C) (Mangone et al., 2008) and opening stem regions of the ITS2 from helix 3 obtained from Asteraceae and the ITS2 database (Figure 3 D) (Schultz et al., 2006; Selig et al., 2008; Koetschan et al., 2010) are illus- trated. Each histogram represents the length distribution of data (black line), 3.3 Results 31 the estimated parameters including the corresponding distribution modelled by each macro state (dotted lines) as well as BIC and L1 distance values. In addition, the optimal model choice is marked by a star. Figure 3 A shows that rp and srp macro states are capable of modelling the transmembrane β sheets length distribution far better than the p macro state. Here, ML and MM estimators propose quite similar values. In Figure 3 B, ML and MM estimators similarly propose 8 to 11 copies for the rp macro state to model the bell shape of signal peptides. Additionally, one can see the mismatch to length distributed data modelled by the p macro state. K¨allet al. (2004) also modelled signal peptides by breaking the topology into several chained structural regions. Due to the increasing number of sequenced genomes, the accurate identifica- tion of prominent parts like introns/exons, intergenic regions, splice sites or untranslated flanking regions gets further attention. Interestingly, the next example of 3’-UTR sequences (Figure 3 C) does not show a bell-shaped curve. Here, clearly a geometric distributed topology was preferred by all estimators. Thus, these data could be modelled best by a simple p macro state. Figure 3 D shows slight differences between ML and MM estimates for the srp macro state on ITS2 secondary structure data. However, one can see that a bell shape is favoured from a geometric distribution model. This is also un- derlined when regarding BIC and L1 distance.

ITS2 secondary structure prediction Focusing on this data, a more complex HMM was developed modelling not only helical parts but the whole structural conformation of the ITS2. Therefore, different structural parts with varying base compositions were inspected on their length distribution and parameters were estimated according to the pre- viously introduced length modelling topologies (Chapter 3.3.1). According to the conserved core structure of the ITS2 (Coleman, 2003; Schultz et al., 2005), these parts were mainly categorized by opening/closing stem regions, loops or unpaired helix connecting regions for all four helices in the structure (Figure 4 3.3 Results 32

transmembrane β-sheets signal peptides 0.14 0.15 A srp Θ{ 0, 6, 0.484} BIC: 1227.04 B srp Θ{ 3, 8, 0.391} BIC: 15463.38 rp Θ{ 6, 0.484} BIC: 1221.59 rp Θ{ 11, 0.469} BIC: 15486.63 p Θ{ 0.081} BIC: 1618.79 0.12 p Θ{ 0.043} BIC: 20776.37 srp (MM) Θ{ −1, 7, 0.533} L1: 0.25 srp (MM) Θ{ 18, 1, 0.141} L1: 0.36 rp (MM) Θ{ 6, 0.508} L1: 0.3 rp (MM) Θ{ 9, 0.392} L1: 0.28 0.10 0.10 0.08 0.06 Rel. frequency Rel. frequency 0.05 0.04 0.02 0.00 0.00 0 5 10 15 20 25 30 35 0 20 40 60 80 100 Length Length

3’-UTR sequences 0.006 srp Θ{ 0, 1, 0.005} BIC: 166998.0 ITS2 stem 3 sequences rp Θ{ 1, 0.005} BIC: 166988.6 srp Θ{ 0, 18, 0.59} BIC: 28843.04 C p Θ{ 0.005} BIC: 166979.1 D 0.25 rp Θ{ 18, 0.59} BIC: 28834.47 0.005 srp (MM)Θ{116, 1, 0.002} L1: 1 p Θ{ 0.033} BIC: 46592.09 rp (MM) Θ{ 1, 0.004} L1: 0.33 srp (MM) Θ{ 28, 1, 0.113} L1: 0.96 rp (MM) Θ{ 20, 0.653} L1: 0.66 0.20 0.004 0.15 0.003 0.10 0.002 Rel. frequency Rel. frequency 0.05 0.001 0.00 0.000 1 5 10 50 100 500 1000 5000 0 10 20 30 40 50 60 70 Length Length

Figure 3: This Figure exhibits four different histograms, presenting the length distribution of 232 β-sheet core regions (A), signal peptides (B), 3’-UTR se- quences of C. elegans (C) and opening stem regions from helix 3 of ITS2 sequences of family Asteraceae (D). Each graph shows estimated parameters by ML and MM for the macro states p, rp and srp. Further, optimal BIC and L1 distance values are highlighted by a star. When regarding the bell shapes of Figures A,B and D and comparing them to all three model topologies, one can see clearly a preference for the rp macro state. Where data is less bell shape distributed (C), the conventional HMM topology seems to perform best in this case. illustrates the according macro states). Results of ML and MM estimations are highlighted in Table 2 for p - and rp macro states. Here, the preferred model topology is marked bold. As one can see, in most cases the optimized rp macro state is favoured by a better BIC (ML) or L1 distance (MM). In the following, a ten-fold 90/10 cross validation on structure predictions of the p and rp macro state was performed on an asteraceaen ITS2 data set. The log ratio between errors made by p and rp macro states in comparison to reference structures of 3.3 Results 33

Figure 4: This HMM (right) reflects a mapping of structural regions from Cen- taurea alba ITS2 secondary structure (left), here synonymously representing the conserved core structure of this spacer. Opening and closing stem regions (”Stem X.1”/”Stem X.2”), unpaired loops and inter helix regions are modelled by different macro states. These sometimes may be skipped and were not al- ways present in the training data set of Asteraceae. All states except first and second loop states are self-transitive. This guarantees loops with a length of at least three nucleotides.

the ITS2 database are visualized in Figure 5. Here, one can clearly see the im- proved accuracy of the rp macro state resulting from the HMM optimization. However, also some (blue) regions exist where the estimator favoured a model which turned out to be rather suboptimal. The significance of this topology optimization was further documented by a Wilcoxon rank sum test - proving a significant error reduction averaged over all cross fold estimations of 34,25% for the rp model. Here, all ten p-values were very significant and smaller than 2.26 × 10−4. 3.4 Discussion 34

MM MM ML ML State rMM , pMM L1 (p) L1 (rp) rML, pML BICp BICrp

Start (2, 0.62) 1.37 1.20 (1, 0.24) 14866.57 14875.33 Stem 1.1 (19, 0.89) 2.00 0.80 (19, 0.88) 26262.90 11962.26 Loop 1.3 (2, 0.87) 1.14 1.14 (1, 0.46) 9694.14 9702.91 Stem 1.2 (22, 0.93) 2.00 1.13 (21, 0.90) 26598.26 11159.77 Z1 (2, 0.87) 1.25 1.29 (1, 0.38) 11304.23 11313.00 Stem 2.1 (13, 0.98) 2.00 0.78 (13, 0.98) 22871.83 4286.04 Loop 2.3 (1, 0.75) 0.49 0.49 (1, 0.68) 5893.03 5901.80 Stem 2.2 (13, 0.98) 2.00 0.78 (13, 0.98) 22872.92 4303.24 Z2 (4, 0.88) 1.81 0.96 (3, 0.69) 15138.06 9136.57 Stem 3.1 (38, 0.99) 2.00 0.95 (37, 0.97) 29828.17 7933.07 Loop 3.3 (3, 1.00) 1.59 0.59 (2, 0.66) 12310.04 7853.56 Stem 3.2 (43, 0.96) 2.00 0.77 (43, 0.95) 30942.04 11300.02 Z3 (11, 0.91) 1.93 1.14 (9, 0.77) 22013.25 11110.74 Stem 4.1 (8, 0.92) 2.00 1.03 (8, 0.95) 19777.04 5853.66 Loop 4.3 (2, 1.00) 1.79 0.36 (1, 0.50) 8939.28 8948.05 Stem 4.2 (8, 0.93) 2.00 1.11 (8, 0.96) 19726.27 5433.01 Stop (4, 0.79) 1.50 1.07 (3, 0.59) 16245.67 11259.91

Table 2: This Table illustrates different macro states of an HMM modelling ITS2 secondary structure together with corresponding ML and MM parameter estimates. In addition, BIC and L1 distance represent optimal model choice and best values are marked bold. One can clearly see a favour for the rp macro state topology which models, inter alia, bell-shaped length distributions.

3.4 Discussion

In this section, three HMM topologies were introduced modelling different length distributions. The first method, the p macro state equals to a conven- tional self transitional HMM state emitting geometrically distributed sequence lengths. The final optimized srp macro state is not only capable of modelling geometrically distributed data but also models bell-shaped negative binomial length distributions even in a shifted form. This allows to describe and model data more accurately in terms of HMMs. Further, no changes in decoding 3.4 Discussion 35

Reference Structure

Start StemLoop 1.1 Loop 1.1 Loop 1.2Stem 1.3Z1 1.2 StemLoop 2.1 Loop 2.1 Loop 2.2 Stem 2.3 Z2 2.2StemLoop 3.1 Loop 3.1 Loop 3.2 Stem 3.3 Z3 3.2StemLoop 4.1 Loop 4.1 Loop 4.2 Stem 4.3 Stop 4.2 Start Stem 1.1 8 Loop 1.1 Loop 1.2 Loop 1.3 6 Stem 1.2 Z1 4 Stem 2.1 Loop 2.1 Loop 2.2 2 Loop 2.3 errorlog ratio Stem 2.2 Z2 0 Stem 3.1 Loop 3.1 Loop 3.2 −2 Loop 3.3 Stem 3.2 −4 Z3 Stem 4.1 Predicted structure by simple and optimised HMMs structure by Predicted Loop 4.1 −6 Loop 4.2 Loop 4.3 Stem 4.2 −8 Stop

Figure 5: This heat map illustrates structural regions resulting from a ten fold cross validation test between p and rp macro states, modelling ITS2 secondary structure from Asteraceae. Here, the log error ratio of p and rp macro states in comparison to a high quality reference structure from the ITS2 database is highlighted in different colors. Reddish marked fields indicate a better, less error-prone modelling of the rp macro state, while blueish fields mark structural positions where a simple p macro state topology returned more accurate predictions.

or training algorithms have to be made, as this optimization changes just the HMMs topology - thus can be integrated into present applications with a mini- mum effort. To receive optimal model parameters, two estimators, the method of moments and the maximum likelihood method were derived. The latter is known to be asymptotically efficient, however sometimes is still very time con- suming. In contrast to ML, the method of moments is able to provide all estimates in a relatively quick time. 3.4 Discussion 36

To underline the performance of topology optimizations, applications on artifi- cial data (not shown) as well as on biological examples were evaluated. Figure 3 points out the relevance to actual data by providing several biological ex- amples implicating a bell-shaped length distribution from current databases. The further reviewed/examined novel method for secondary structure predic- tion from ITS2 sequences revealed a statistically very significant performance gain when modelling structure by an optimized HMM topology. Although this method does not provide valid base pairings at the moment, it could be further augmented to a level of support vector machines for stochastic context- free grammars which might be capable of retaining base pairings by emitting both binding nucleotides of a helix within one atomic step. In conclusion, this method allows an easy design of new, as well as adaptation of current established HMM topologies by providing two estimators which fit the HMM topology to the underlying length distribution of data. Bilmes (2004) even describes bimodal HMM topologies which leaves room for more accurate predictions and complex length distributions. Sequence-structure phylogeny 37

4 Sequence-structure phylogeny

4.1 Introduction

Having the necessary requirements to annotate and predict a structure, the next steps in a sequence-structure based phylogeny are the alignment and the tree reconstruction. A set of tools was developed that deals already with sequence-structure based alignments (Seibel et al., 2006, 2008; Bauer et al., 2007; Siebert and Backofen, 2005), however different alphabets are imaginable for this task. For the reconstruction of phylogenetic trees instead, only one distance based tree inference method (ProfDistS) (Wolf et al., 2008) is cur- rently able to handle sequence-structure data. In the following, an R-package (treeforge) was created which provides Neighbour Joining (Saitou and Nei, 1987; Gascuel, 1997), Maximum Parsimony (Camin and Sokal, 1965) and Max- imum Likelihood (Felsenstein, 1981, 1985, 2004) tree reconstruction on four different sequence-structure alphabets. Further, different substitution models like jukes cantor (Jukes and Cantor, 1969), GTR (Tavar´e,1986) - as well as a marker specific ITS2 model (M¨ulleret al., 2002; Wolf et al., 2005a) are avail- able. To cope with the newly developed sequence-structure alphabets, new scoring matrices were evaluated and integrated into treeforge, resulting in a rapid prototyping R-package for phylogenetic analysis.

4.2 Tree inference with the R-package treeforge

The treeforge R package enables a user to align sequence and sequence-structure data in FASTA (Ncbi, 2007) or xFASTA (xFASTA enhances FASTA by an ad- ditional line containing structure information in bracket dot bracket notation below a sequence) format either with the alignment algorithm of ClustalW2 (Thompson et al., 1994; Larkin et al., 2007) or MUSCLE (Edgar, 2004a,b). Additionally, a various number of different sequence-structure coding alpha- bets (Chapter 4.2.1) are available. For phylogenetic tree reconstruction, the following three methods are implemented and available for all types of alpha- 4.2 Tree inference with the R-package treeforge 38 bets: Maximum Parsimony, Neighbour-Joining using the BIONJ algorithm (Gascuel, 1997) and Maximum Likelihood. For these methods, different sub- stitution models are available. Finally, trees can easily be visualized or ex- ported to standard Newick format (Felsenstein et al., 1986). The R-package, including a vignette explaining the basic usage of the package is available on CD with this dissertation.

4.2.1 Sequence-structure coding alphabets

Four different alphabets to model sequence or sequence-structure data were defined. The alphabet ’DNA’ merely codes the four base nucleotides A,C,G,T. Any other IUPAC characters are translated into the ambiguity character N, except U - which also codes for T. DNA is the only alphabet that doesn’t contain any structure information.

’RNA10’ is the most sparse alphabet containing sequence and structure in- formation together. It codes the four base nucleotides A,C,G,T in combination with an unpaired structure, plus the structural bonds between A ↔ TT ↔ A, C ↔ GG ↔ C and G ↔ TT ↔ G. Additionally, Uracil and other IUPAC characters are handled equivalently to the DNA alphabet.

An alphabet that doesn’t consider horizontal dependencies between struc- tural bonds is ’RNA12’. Here, each of the four base nucleotides A,C,G,T is coupled with one of the three structural conformations ’.’,’(’ and ’)’. This results in 12 base substitution plus some additional ones for Uracil and the other less specific IUPAC characters. This alphabet is best evaluated and implemented in the standalone software 4SALE (Seibel et al., 2006, 2008), ProfDistS (Wolf et al., 2008) and the ITS2 workbench.

’RNA16’ is the alphabet containing most, even though redundant infor- mation and was first described by Smith et al. (2004). Similar to ’RNA10’, horizontal dependencies between structural bonds are encoded. In addition to ’RNA10’, here each side of a bond is coded by a different character. This 4.2 Tree inference with the R-package treeforge 39 results in the four unpaired structural base substitutions A,C,G,T plus the six ones from ’RNA10’ - A ↔ TT ↔ A, C ↔ GG ↔ C and G ↔ TT ↔ G- each one coded twice for the 3’ and the 5’ side of a secondary structure. Ad- ditionally, further IUPAC characters as well as Uracil are treated equivalently to the other alphabets.

4.2.2 Substitution models for phylogenetic reconstructions

For each phylogenetic tree reconstruction method, different evolutionary mod- els are available. These models specify different substitution rates for alphabet characters and are represented by a substitution matrix Q which fulfils the fol- lowing requirements: X Qii = − Qij (4.1) i6=j πQ = 0 (4.2)

πiqij = πjqji (4.3) with π = {π1 . . . πn} representing the stationary distribution of the alphabet.

The Jukes Cantor (JC) rate matrix is one of the simplest substitution models and assumes an equal appearance of nucleotides as well as an equal distribution of nucleotide mutations (Jukes and Cantor, 1969). Thus, a rate matrix describing the rate qij which is a point mutation from character i into character j can be easily described by:   −(n − 1)λ . . . λ    . .. .  Q =  . . .  (4.4)   λ . . . −(n − 1)λ Only one free parameter λ controls the overall substitution rate. When cali- brated to 1 PEM (Percent of Expected Mutations),

n X − 0.01 = πiQii = −(n − 1)λ (4.5) i=1 1 ⇒ λ = . (4.6) (n − 1)100 4.2 Tree inference with the R-package treeforge 40

JC relies on a symmetric substitution matrix and thus is time reversible. In more detail, a mutation from A → T occurs with the same rate as a mutation from T → A. A distance based on this model is described by

n − 1 n − 1 d = − log (1 − p) (4.7) jc n e n where p is the proportion of sites with different characters. In addition to an uncorrected hamming distance matrix, Juces Cantor also regards silent (synonymous A → T → A) mutations in his correction model.

ITS2 is a substitution model which is explicitly based on the according marker (M¨ulleret al., 2002; Wolf et al., 2005a). Thus, distributions for base pair mutations and nucleotide occurrences were pre-calculated on a set of re- liable ITS2 sequences.

GTR, the General Time Reversible model similar to JC assumes a symmetric substitution matrix and thus is called time reversible (Tavar´e,1986). It is the most general model possible and in contrast to JC, here, the rate between different substitution pairs, as well as the nucleotide frequencies can vary. Exemplary, Q is shown for a four character alphabet

−(x1 + x2 + x3) x1 x2 x3  π1x1 π1x1 −( + x4 + x5) x4 x5 π2 π2 Q = π1x2 π2x4 π1x2 π2x4 (4.8).  −( + + x6) x6  π3 π3 π3 π3 π x π x π x π x π x π x 1 3 2 5 3 6 −( 1 3 + 2 5 + 3 6 ) π4 π4 π4 π4 π4 π4

Due to the large number of free parameters (in this case four base frequencies and six substitution rates except one substitution which serves as normaliza- tion constant for the others and is equal to one), a PAM (Percent of Accepted Mutations) distance and parameters of the GTR model are typically calculated by a maximum likelihood approach. In addition to JC, ITS2 and GTR, for the maximum likelihood tree calcula- tion, a more advanced JC IG, ITS2 IG and GTR IG substitution model also 4.3 Calculating evolutionary rates and scoring sequences 41 allows to model invariable sites (I) and rate heterogeneity of the substitution process (G) by allowing up to four different speeds of mutation rates.

Finally, bootstrap support is available for all of the methods and models.

4.3 Calculating evolutionary rates and scoring sequences

A score matrix is used to score the similarity of two nucleotides, amino acids, proteins or encoded sequence-structure tuples and is applied in programs for sequence alignment or BLAST (Altschul et al., 1997, 1990), etc.. It goes back to the calculations by Dayhoff et al. (1978), which first defined a scoring ma- trix for amino acid similarities and the work of Henikoff and Henikoff (1992) and the BLOSUM62 matrix. Typically, the main diagonal - scoring identical characters gets a positive, other diagonals a more negative score (exemplary, the tuple ’A.’ ↔ ’A(’ should be scored higher than ’A.’ ↔ ’T(’). A very simple score matrix is the identity matrix, scoring equal positions with 1, others with 0.

But as nucleotides and evolutionary rates are not equally distributed in nature, M¨ulleret al. (2002); Wolf et al. (2005a) have developed a specific score matrix specifically for the marker ITS2. In this study, the method was further applied on the newly developed alphabets ’RNA10’ and ’RNA16’, and reimplemented for the old alphabets ’RNA12’ and ’DNA’.

The whole process was repeated iteratively, meaning a first calculated score- matrix served as input for new alignments in the next iteration until

old new |Sij − Sij | < . (4.9)

Gap-costs were set to (0,0) according to 4SALE. The dataset was based on 400/1200 manually inspected sequences and structures of the same genus/species taken from the CBC analysis of M¨ulleret al. (2007). 4.3 Calculating evolutionary rates and scoring sequences 42

Detailed estimation process:

MSSA

| | 1 푆푆푆 | 푁 퐾 푁 퐷 퐴 퐼 퐷 ⋯ 2 푆푆푆 | 푁 푄 푁 퐷 퐸 퐼 퐷 ⋯ 3 푆푆푆 | 푁 푄 푁 퐷 퐸 퐼 퐷 ⋯ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋯ 푆푆푆푛 푁 푄 푁 퐷 퐸 퐼 퐷 ⋯

Distance matrix MSSA | 0 , , ,

, 01 2 1, 3 1, | 푑푆푆푆 푆푆푆 푑푆푆푆 푆푆푆 ⋯ 푑푆푆푆 푆푆푆� 푆푆푆1 푁 퐾 푁 퐷 퐴 퐼 퐷 ⋯ , = 2, 1 , 02 3 2, | 푑푆푆푆 푆푆푆 푑푆푆푆 푆푆푆 ⋯ 푑푆푆푆 푆푆푆� 2 3 1 3 2 3 푆푆푆 | 푁 푄 푁 퐷 퐸 퐼 퐷 ⋯ 퐷푛 푛 푑푆푆푆 푆푆푆 푑푆푆푆 푆푆푆 ⋯ 푑푆푆푆 푆푆푆� 3 , , , 0 푆푆푆 푁 푄 푁 퐷 퐸 퐼 퐷 ⋯ ⋮ ⋮ ⋮ ⋱ ⋮ | 푛 1 푛 2 푛 3 푑푆푆푆 푆푆푆 푑푆푆푆 푆푆푆 푑푆푆푆 푆푆푆 ⋯ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋯ 푆푆푆푛 푁 푄 푁 퐷 퐸 퐼 퐷 ⋯

Neighbor Joining tree EMP Relative rate matrix | 0 ( ) ( ) ( ) 1 ( ) 0 ( ) ( ) 푆푆푆 | 푁 퐾 푁 퐷 퐴 퐼 퐷 ⋯ 푟 푁→푄 푟 푁→퐻 ⋯ 푟 푁→퐿 , = ( ) ( ) 0 ( ) ↓ ↓ ↓ ↓ ↓ ↓ ↓ 푟 푄→푁 푟 푄→퐻 ⋯ 푟 푄→퐿 2 푆푆푆 | 푁 푄 푁 퐷 퐸 퐼 퐷 ⋯ 푚 푚 퐻→푁 퐻→푄 퐻→퐿 푅 푟 푟 ⋯ 푟 0 ↓ ↓ ↓ ↓ ↓ ↓ ↓ ( ) ( ) ( ) ⋮ ⋮ ⋮ ⋱ ⋮ Seq1 Seq2 Seq3 Seqn 푆푆푆3 푁 푄 푁 퐷 퐸 퐼 퐷 ⋯ | 푟 퐿→푁 푟 퐿→푄 푟 퐿→퐻 ⋯ ⋯ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ⋮ | ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋯ ↓ ↓ ↓ ↓ ↓ ↓ ↓ 푛 푆푆푆 푁 푄 푁 퐷 퐸 퐼 퐷 ⋯ Score matrix

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 푠 푁→푁 푠 푁→푄 푠 푁→퐻 ⋯ 푠 푁→퐿 = ( ) ( ) ( ) ( ) Stationary distribution , 푠 푄→푁 푠 푄→푄 푠 푄→퐻 ⋯ 푠 푄→퐿 푚 푚 푠 퐻→푁 푠 퐻→푄 푠 퐻→퐻 ⋯ 푠 퐻→퐿 푆 ( ) ( ) ( ) ( ) = ( N , Q , H , , L) ⋮ ⋮ ⋮ ⋱ ⋮ 푠 퐿→푁 푠 퐿→푄 푠 퐿→퐻 ⋯ 푠 퐿→퐿 п п п п ⋯ п Figure 6: Estimation of a score matrix based on an evolutionary markov pro- cess. From a multiple sequence-structure alignment, a distance matrix and neighbour joining tree are calculated. This tree together with the alignment serves as input for calculating an EMP. The EMP models point mutations in a column of the alignment and uses the tree distances for time calibration. From this EMP, a stationary distribution of the alphabet, as well as the relative rate matrix are derived. They are prerequisites for the score matrix calculation.

ˆ The estimation starts with the generation of Multiple Sequence-Structure Algnments (MSSAs) for the species and genus datasets with a specific alphabet. For these alignments, the identity matrix or a known matrix from literature is suitable.

ˆ After the alignment, a distance matrix for each sequence tupel is cal- culated by counting and normalizing the differences in both sequences. 4.3 Calculating evolutionary rates and scoring sequences 43

From this matrix, a Neighbour Joining tree is calculated easily.

ˆ This tree, together with the MSSA is needed for modelling an Evolution- ary Markov Process (EMP) (Dayhoff et al., 1978; M¨uller,2001; M¨uller and Vingron, 2000; M¨ulleret al., 2002; Schliep, 2011) along the columns of the alignment. The EMP allows to describe the development at one fixed position of a sequence, influenced by point mutations under evolu- tionary circumstances. The distance of two sequences in the tree serves as parameter for describing the speed of one sequence mutating into another. This further allows to calculate a rate for each possible substi- tution by calibrating the time to one mutation at every 100 positions in a sequence (1 PEM).

First, a transition-matrix P (t) describes the time discrete markov chain with transitions of character i mutating into character j. Further, the condition of the process is assumed to be ergodic, which means that

π = πP (t). (4.10)

All outgoing transition probabilities must be positive and sum to one

X pi,j = 1 (4.11) j

pij > 0. (4.12)

The rate matrix Q is then defined as the change of P in an infinitely small time period

P (t) − I lim = Q (4.13) t→∞ t ⇒ P (t) = etQ. (4.14)

The estimation of the EMP is implemented in the R-package ’phangorn’ by Schliep (2011). 4.4 Reconstruction of the chlorophyceaen tree with different coding alphabets 44

ˆ With the EMP, a relative rate matrix R, which is basically similar to Q but excludes base frequencies, and the stationary distribution finally are the last requirements for calculating the score matrix

qij = πirij (4.15) M(t) S(t) = log( ) (4.16) M(∞) with M(t) = diag(π)etQ (4.17)

S(t) is derived according to the log-odds formula by Dayhoff et al. (1978). M(T ) is the matrix corresponding to the distribution of character substi- tutions with t = 50 PAM. From both, genus and species datasets, only the medians were used to build one overall genus/species relative rate matrix and one overall genus/species base frequency distribution which served as input for the score matrix calculation.

Finally, the score matrix is visualized in the bubble plots of Figures 40 and 41. Here, red dots indicate a positive, green dots a negative score. The size of each bubble corresponds to the value of its absolute number. Thus, the main diagonal axis typically scores equal positions with a positive score. Other diagonals are scored more negatively, depending on the number of exchanges which occurred for the specific tuples. When comparing species and genus matrices, one might easily recognize a difference in the G-T rates for example. This might give motivation to further analyse the differences between species and genus specific evolutionary rates.

4.4 Reconstruction of the chlorophyceaen tree with dif- ferent coding alphabets

Based on the newly estimated score matrices for each alphabet, a data set of Chlorophyta, equal to Buchheim et al. (2011b) was used for phylogenetic tree reconstruction of the chlorophyceaen class of green algae with treeforge. This phylogeny is known to be not easy reconstructible due to several jumping clades (Keller et al., 2008). In total, twelve trees with 100 bootstrap replicas were 4.4 Reconstruction of the chlorophyceaen tree with different coding alphabets 45 reconstructed resulting from the combination of the three methods mp phylo, ml phylo and dist phylo and the four alphabets ’DNA’, ’RNA10’, ’RNA12’ and ’RNA16’. Gap-costs were set to (0,0) while using the model GTR IG for the ML reconstruction, the model GTR for the BIONJ tree. The GTR IG model resulted as the optimal choice from a model-test (AIC) between all implemented substitution models (data not shown). The reconstruction of a tree with treeforge just requires a few lines of code: alignment <- file2phy(file="Chlorophyta.xfasta",type=’xfasta’, alphabet=’RNA10’,go=0,ge=0) fitobj <- ml_phylo(phydat=alignment,nbs=100,model="GTR_IG") ml_tree_rna10 <- fitobj@tree

To compare the final twelve trees they were visualized by a multidimensional scaling (MDS) plot (Figure 7). Here, the topological differences between mul- tiple trees become visible on the two main axes x and y by calculating an all-against-all distance matrix and transferring this information onto two axes. A clustering on tree reconstruction methods can be clearly seen on the y-axis. However, this method visualizes the distances between branches weighted equally which makes it difficult to compare the main topological differences on higher ranks. Therefore, the trees were sorted by their alliances of the main clades which are the OCC group consisting of the orders Oedogoniales, and Chaetophorales, the CW group of and the DO group of . Additionally, two Trebouxiophyceaen sequences (Parachlorella beijerinckii, Chlorella sorokiniana) and the outgroup of Ulvophyceae (Acrochaete sp. 6 BER 2007, Ulva laetevirens) where included in the dataset. In the following, the alphabet ’RNA16’ with methods ML and BIONJ and the alphabet ’RNA10’ with method MP reconstructed a tree with a similar topology to Buchheim et al. (2011b); Brouard et al. (2010); Turmel et al. (2008). For example the MP tree of the alphabet ’RNA10’ is visualized below (Figure 8). It consists of the DO group (Sphaeropleales) and the CW group (Chlamy- domonadales) as a monophyletic sister group with a high bootstrap support of 4.4 Reconstruction of the chlorophyceaen tree with different coding alphabets 46

Maximum MP DNA parsimony MP RNA12MP RNA16 MP RNA10

ML RNA12 ML RNA16

Maximum ML RNA10 likelihood

ML DNA

DIST RNA16 DIST RNA12

BIONJ DIST RNA10

DIST DNA

Figure 7: A multidimensional scaling of the alphabets ’DNA’, ’RNA10’, ’RNA12’ and ’RNA16’ in combination with the tree reconstruction methods MP, ML and BIONJ is visualized above. Here, similar methods cluster on the y-axis. 4.4 Reconstruction of the chlorophyceaen tree with different coding alphabets 47

157695730|Acrochaete sp. 6−BER−2007|method ann 2|method str 1 Ulvophyceae 219525733|Ulva laetevirens|method ann 5|method str 3 207366649|Parachlorella beijerinckii|method ann 2|method str 2 100 Trebouxiophyceae 207366656|Chlorella sorokiniana|method ann 2|method str 2 62183529|Oedogonium cardiacum|method ann 1|method str 1 75 62183531|Bulbochaete rectangularis var. hiloensis|method ann 1|method str 2 99 Oedogoniales 71482662|Oedogonium nodulosum|method ann 1|method str 2 93 98 77176662|Oedogonium undulatum|method ann 1|method str 1

54 OCC-Clade 62183535|Oedogonium oblongum|method ann 1|method str 2

81 337263254|Draparnaldia plumosa|method ann 1|method str 2|pr|312004118 100 337263257|Stigeoclonium helveticum|method ann 1|method str 2|pr|312004118 Chaetopeltidales 98 312004118|Uronema sp. CCAP 335 92 99 337263252|Uronema belkae|method ann 1|method str 2|pr|312004118 98 337263256|Schizomeris leibleinii|method ann 1|method str 3|mod|337263258 100 337263258|Aphanochaete magna|method ann 1|method str 2|pr|71482662 83 337263253|Chaetopeltis orbicularis|method ann 1|method str 2|pr|62183524 Chaeotphorales 100 337263259|Hormotilopsis tetravacuolaris|method ann 1|method str 2|mod|337263255 95 337263255|Hormotilopsis gelatinosa|method ann 1|method str 2|pr|62183524 2627285|Haematococcus droebakensis|method ann 2|method str 3 100 145587829|Dunaliella salina|method ann 1|method str 1 95 71482601|Dunaliella parva|method ann 3|method str 1 82 2039270|Volvulina steinii|method ann 5|method str 3 CW 83 4007511|Yamagishiella unicocca|method ann 2|method str 1 Chlamydomonadales

91 82 2039259|Volvox dissipatrix|method ann 5|method str 3 70 2039265|Volvox rousseletii|method ann 5|method str 3 68 14091704|Pandorina morum|method ann 2|method str 2 93 19569568|Eudorina elegans|method ann 2|method str 2 100 96 19569569|Eudorina unicocca|method ann 3|method str 1 3192578|Gonium octonarium|method ann 3|method str 1 Chlorophyceae 56 6066290|Gonium quadratum|method ann 3|method str 1 87 3192594|Gonium pectorale|method ann 5|method str 3 64 2627279|Chlamydomonas komma|method ann 3|method str 1 85 111073399|Chlamydomonas typica|method ann 2|method str 2 99 100 111073392|Chlamydomonas petasus|method ann 2|method str 1 100 111073402|Chlamydomonas incerta|method ann 2|method str 1 88 111073415|Chlamydomonas reinhardtii|method ann 2|method str 1 169930318|Atractomorpha porcata|method ann 1|method str 2 100 169930314|Sphaeroplea annulina|method ann 2|method str 2 65 169930312|Ankyra judayi|method ann 3|method str 1 12055733|Scenedesmus arcuatus var. platydiscus|method ann 3|method str 2 100 50812884|Stauridium tetras|method ann 1|method str 2 100 50812878|Pediastrum braunii|method ann 1|method str 2 79 95 59891785|Sorastrum spinulosum|method ann 5|method str 3 83 59891779|Pediastrum boryanum var. cornutum|method ann 5|method str 3 DO 96 50812858|Hydrodictyon patenaeforme|method ann 1|method str 2 8559891774|Hydrodictyon africanum|method ann 5|method str 3 6059891781|Pediastrum duplex|method ann 5|method str 3

99 Sphaeropleales 100 59891775|Hydrodictyon reticulatum|method ann 5|method str 3 138754193|Desmodesmus elegans|method ann 1|method str 2 159135732|Desmodesmus communis|method ann 2|method str 2 93 80 12055737|Scenedesmus quadricauda|method ann 3|method str 2

76 12055748|Scenedesmus longus|method ann 3|method str 2 80 159135727|Desmodesmus opoliensis|method ann 1|method str 2 100 159135731|Desmodesmus pleiomorphus|method ann 1|method str 2 99 97 12055740|Desmodesmus bicellularis|method ann 3|method str 2 78 12055736|Scenedesmus abundans|method ann 3|method str 2 4972016|Enallax acutiformis|method ann 3|method str 2 4972017|Scenedesmus pectinatus|method ann 3|method str 2 98 100 37727741|Scenedesmus regularis|method ann 3|method str 1 67 4972015|Scenedesmus raciborskii|method ann 3|method str 2 6625510|Scenedesmus acuminatus|method ann 1|method str 2 73 73 4972041|Tetradesmus wisconsinensis|method ann 3|method str 2 99 89515386|Scenedesmus obliquus|method ann 1|method str 2 8912055731|Scenedesmus dimorphus|method ann 3|method str 2 100 17385571|Scenedesmus basiliensis|method ann 3|method str 2

Figure 8: This tree of Chlorophyceae was calculated by maximum parsimony on the alphabet ’RNA10’. It resolves the DO and CW group as monophyletic sister groups with a high bootstrap support of 99. Both are further in alliance with the OCC clade and a bootstrap support of 82. This topology has a similar conformation to the published phylogeny of Buchheim et al. (2011b); Brouard et al. (2010); Turmel et al. (2008). 4.4 Reconstruction of the chlorophyceaen tree with different coding alphabets 48

99. Both are in alliance with the OCC clade with a lower bootstrap support of 82. In this clade, Chaetopeltidales and Chaeotphorales are resolved as sister group with a high bootstrap support of 98.

Similarly, the alphabet ’RNA16’ with method MP, the alphabet ’RNA10’ with method ML, and the ’DNA’ alphabet with BIONJ algorithm reconstructed trees resolving a slight difference to the previous reconstructed tree when regarding the upper ranks. Here, as example the MP tree of the alphabet ’RNA16’ is represented for this group in Figure 9. In this case, the OCC group forms a sister clade with the CW group of Chlamy- domonadales, albeit the bootstrap support is not very robust for all three trees at this position (79 for MP with ’RNA16’, 53 for ML with ’RNA10’ and 28 for the ’DNA’ BIONJ tree). For the method ML with alphabet ’RNA10’, the OCC clade was resolved in congruence to the first group of trees, classifying the order of Oedogoniales as sister group to the orders of Chaetopeltidales and Chaetophorales, however with a low bootstrap support of 64. For the other two methods – MP with alphabet ’RNA16’ and BIONJ with alphabet ’DNA’ these clades were resolved differently.

Similar to the topology above, the alphabet ’RNA12’ for the methods ML, MP and BIONJ resolved trees with the OCC clade being a sister group to the CW group of Chlamydomonadales, and both being in alliance with the DO group of Sphaeropleales, however the group of Ankyra judayi, Sphaeroplea annulina and Atractomorpha porcata jumps between the clades of OCC (MP, BIONJ) and CW (ML).

The remaining methods, MP and ML with alphabet ’DNA’, as well as BIONJ with alphabet ’RNA10’ mostly did not resolve the OCC clade as monophyletic group. Further, all trees are available on CD with this dissertation. 4.4 Reconstruction of the chlorophyceaen tree with different coding alphabets 49

219525733|Ulva laetevirens|method ann 5|method str 3 Ulvophyceae 157695730|Acrochaete sp. 6−BER−2007|method ann 2|method str 1 207366656|Chlorella sorokiniana|method ann 2|method str 2 93 Trebouxiophyceae 207366649|Parachlorella beijerinckii|method ann 2|method str 2 169930312|Ankyra judayi|method ann 3|method str 1 100 99 169930314|Sphaeroplea annulina|method ann 2|method str 2 83 169930318|Atractomorpha porcata|method ann 1|method str 2 12055733|Scenedesmus arcuatus var. platydiscus|method ann 3|method str 2 99 50812884|Stauridium tetras|method ann 1|method str 2

60 100 50812878|Pediastrum braunii|method ann 1|method str 2 98 59891785|Sorastrum spinulosum|method ann 5|method str 3 78 59891779|Pediastrum boryanum var. cornutum|method ann 5|method str 3 DO 9759891781|Pediastrum duplex|method ann 5|method str 3 90 50812858|Hydrodictyon patenaeforme|method ann 1|method str 2 95 7659891774|Hydrodictyon africanum|method ann 5|method str 3 99 93 Sphaeropleales 59891775|Hydrodictyon reticulatum|method ann 5|method str 3 138754193|Desmodesmus elegans|method ann 1|method str 2 159135732|Desmodesmus communis|method ann 2|method str 2 100 12055737|Scenedesmus quadricauda|method ann 3|method str 2

97 98 12055748|Scenedesmus longus|method ann 3|method str 2 90 159135727|Desmodesmus opoliensis|method ann 1|method str 2 97 12055740|Desmodesmus bicellularis|method ann 3|method str 2 86 12055736|Scenedesmus abundans|method ann 3|method str 2 69 46 159135731|Desmodesmus pleiomorphus|method ann 1|method str 2 4972016|Enallax acutiformis|method ann 3|method str 2 37727741|Scenedesmus regularis|method ann 3|method str 1 99 100 4972017|Scenedesmus pectinatus|method ann 3|method str 2 79 77 4972015|Scenedesmus raciborskii|method ann 3|method str 2

94 6625510|Scenedesmus acuminatus|method ann 1|method str 2

17385571|Scenedesmus basiliensis|method ann 3|method str 2 Chlorophyceae 100100 12055731|Scenedesmus dimorphus|method ann 3|method str 2 96 4972041|Tetradesmus wisconsinensis|method ann 3|method str 2 92 89515386|Scenedesmus obliquus|method ann 1|method str 2 337263253|Chaetopeltis orbicularis|method ann 1|method str 2|pr|62183524 100 337263259|Hormotilopsis tetravacuolaris|method ann 1|method str 2|mod|337263255 Chaeotphorales 100 337263255|Hormotilopsis gelatinosa|method ann 1|method str 2|pr|62183524 77176662|Oedogonium undulatum|method ann 1|method str 1 89 100 62183529|Oedogonium cardiacum|method ann 1|method str 1 OCC-Clade 81 62183531|Bulbochaete rectangularis var. hiloensis|method ann 1|method str 2 Oedogoniales 70 71482662|Oedogonium nodulosum|method ann 1|method str 2 59 87 62183535|Oedogonium oblongum|method ann 1|method str 2 337263256|Schizomeris leibleinii|method ann 1|method str 3|mod|337263258 100 337263258|Aphanochaete magna|method ann 1|method str 2|pr|71482662 80 337263257|Stigeoclonium helveticum|method ann 1|method str 2|pr|312004118 100 Chaetopeltidales 77 337263254|Draparnaldia plumosa|method ann 1|method str 2|pr|312004118 87 337263252|Uronema belkae|method ann 1|method str 2|pr|312004118 87 312004118|Uronema sp. CCAP 335 2627285|Haematococcus droebakensis|method ann 2|method str 3 100 71482601|Dunaliella parva|method ann 3|method str 1 100 145587829|Dunaliella salina|method ann 1|method str 1 2039265|Volvox rousseletii|method ann 5|method str 3 CW

98 2039270|Volvulina steinii|method ann 5|method str 3 Chlamydomonadales 100 88 2039259|Volvox dissipatrix|method ann 5|method str 3 67 4007511|Yamagishiella unicocca|method ann 2|method str 1 87 14091704|Pandorina morum|method ann 2|method str 2 62 19569568|Eudorina elegans|method ann 2|method str 2 77 100 19569569|Eudorina unicocca|method ann 3|method str 1 111073399|Chlamydomonas typica|method ann 2|method str 2 78 111073392|Chlamydomonas petasus|method ann 2|method str 1 100 111073415|Chlamydomonas reinhardtii|method ann 2|method str 1 98 87 111073402|Chlamydomonas incerta|method ann 2|method str 1 2627279|Chlamydomonas komma|method ann 3|method str 1 100 3192578|Gonium octonarium|method ann 3|method str 1 94 6066290|Gonium quadratum|method ann 3|method str 1 87 3192594|Gonium pectorale|method ann 5|method str 3

Figure 9: This tree of chlorophyceae was calculated by maximum parsimony on the alphabet ’RNA16’. In contrast to Figure 8, here the CW clade forms a sister group to the OCC clade, albeit with a lower bootstrap support of 77. Both are then allied to the DO group with a bootstrap support of 79. 4.5 Discussion 50

4.5 Discussion

Compared to Buchheim et al. (2011b); Brouard et al. (2010); Turmel et al. (2008), the alphabets ’RNA10’ and ’RNA16’ in most cases reconstructed plau- sible trees, partially with only small differences and sometimes even better bootstrap support. However, Buchheim et al. (2011b) reconstructed his trees’ topology based on the alphabet ’RNA12’, which provided a few more topo- logical differences here. This might be due to the absence of hand-crafted optimizations in the alignment, or the mixture of directly folded and homol- ogy modelled structures, which was needed for a full automation during this study. Interestingly, ’RNA10’ and ’RNA16’ resolved this tree topology with- out manual artefact corrections. This suggests a smaller sensitivity to artefacts when using these alphabets, which mainly differ from ’RNA12’ by the addi- tional coding of the horizontal dependencies between structural bonds. But, as this is the first evaluation of both alphabets and even performed on a compli- cated dataset, one should further evaluate these alphabets on a more certain phylogeny or simulated data to confirm their performance in comparison to ’RNA12’ or ’DNA’. A differentiation based on treeing methods, which was first proposed in Figure 7, doesn’t seem to influence the higher ranks. These are still uncertain in literature and hard to resolve in a robust way. However, a more detailed analysis of the phylum Chlorophyta is given in Chapter 8. The new scoring and rate matrices were explicitly calculated separately for species and genus based alignments. Although just the latter ones were included into treeforge, one might use the matrices of both types in future analysis to detect discrepancies between the evolutionary rates in species and genus. Finally, this handy package covers all important treeing methods including a large va- riety of models, together with the major aligning algorithms. It comes with four different sequence-structure alphabets, leaving room for a large variety of further analysis on sequence and secondary structure. Generating ITS2 sequence-structure data 51

5 Generating ITS2 sequence-structure data

5.1 Introduction

With sequence annotation, secondary structure prediction, alignment and tree reconstruction methods, now the algorithmic framework is complete to run a full phylogenetic sequence-structure analysis. Nevertheless, nucleotide se- quences are stored in large sequence databases like GenBank (Benson et al., 2011). GenBank updates and shares sequence data with other international collaborating databases like EMBL (Kulikova et al., 2007) or DDBJ (Tateno et al., 2002). Although ITS2 sequences are partially annotated here, this in- formation is not as accurate as needed for a precise secondary structure pre- diction. Some annotations include imprecise location descriptors, like ”in be- tween”, ”before” or ”uncertain” in GenBank data files. These were ignored by previous database versions (Schultz et al., 2006; Selig et al., 2008), which ran a less accurate BLAST based annotation approach. Therefore, a new pipeline was developed generating a larger quantity of reliable sequence-structure data (Koetschan et al., 2010). This is based on ideas of the previous versions of the ITS2 database with about 110,000 structures. However, the new work- flow incorporates HMM-based sequence annotation, includes more databases, performs an iterative homology modelling process and makes use of several smaller additions and speed improvements which result currently in about 380,000 sequences in total including 288,000 structure predictions.

5.2 Database creation and taxonomic tree storage (A)

The basic workflow of the pipeline is visualized in Figure 10. It starts with the creation of an emtpy ITS2 database according to the database schema in Figure 16. For storing taxonomic information, e.g. a representation of the taxonomic tree of life, the database contains the table ’taxons’. This is filled by obtaining data from the NCBI database (Federhen, 2011). The taxonomy database is a curated set of names and classifications which are 5.3 Sequence download and indexing (B) 52 available per FTP download and updated every two hours. To store the tree, a data structure in form of a nested set is used.

5.3 Sequence download and indexing (B)

The acquisition of sequence data is visualized in part B of Figure 10. Genbank holds a set of sub-databases which are available per FTP download. Searching for ITS2 sequences, we process the following ones listed in Table 3.

gbenv Environmental samples gbinv Invertebrates gbmam Other mammals gbpln Plants gbpri Primates gbrod Rodents gbsts Sequence tagged sites gbuna Unannotated gbvrt Other vertebrates nc Complete genomic molecules including genomes, chromosomes, organelles, plasmids daily Daily updates, unsorted

Table 3: This partial overview of GenBank sub-databases includes all that are used in Figure 10 B for sequence download. Additionally, daily sequences are added to this data set as well.

In a next step these sequences in form of GenBank flat files are indexed. This is necessary as the download contains several 100,000 zipped GenBank files which are piped into one big datafile containing several hundred GB. While downloading this data, its length, GI and offset are simultaneously stored in the database (table ’gb indexes’). By additionally indexing the ’gi’ column of the PostgreSQL table, this method allows to find and read a complete GenBank entry from such a large data file just within milliseconds. Subsequently, files are prepared for HMM annotation. By making use of the index, GenBank flat files are read and further parsed. For this purpose, a new GenBank parser was developed. Compared to the BioPerl toolkit (Stajich et al., 2002), it allows the extraction of relevant information in about 1/3rd of the time. Finally, sequences are stored in split FASTA formatted files in preparation of the HMM annotation. For a Hidden Markov Model based de- 5.3 Sequence download and indexing (B) 53

Create empty ITS2 database & A fill taxonomic information (table taxons) B GenBank NCBI database Nucleotide release + daily seq. database

Is a selected sub‐database NCBI search with or a daily specific search‐string

download sequence

GI not in Download and index index indexing

Create splitted FASTA Download GenBank Sequence and file from GenBank data files & add to index

Write features, sequences tables to ITS2 DB –only GB annotation information from NCBI‐search

Create HMM HMM C databases

Run HMM annotation

on cluster annotation

HMM hit HMM hit yes on both at least sides one side

no / GB annot

Write features, sequences to ITS2 no / MIXED annot DB (HMM, MIXED, GB annotations) prediction Secondary

D ITS2 sequence yes Sequences has imprecise GB annotation for BLAST annotation

no

Fold sequences on cluster & write (DIRECT structure structures to ITS2 DB

yes

Homology modelling of sequences Less unfolded &

on cluster; Write structures to ITS2 sequences HM) than before DB no prediction structure Partial E Perform a BLAST‐based annotation

Homology modelling of partial sequences on cluster; Write partial structures to ITS2 DB

Figure 10: Flowchart explaining data generation of ITS2 sequences and sec- ondary structures. First, the database is filled with a taxonomic tree (A). In the next step, sequences are retrieved from Genbank and indexed for further processing (B). Afterwards, HMM annotation and secondary structure predic- tion are the crucial steps in the workflow (C,D). Finally, partial structures are predicted (E). 5.3 Sequence download and indexing (B) 54 tection, the flanking regions of the ITS2 are indispensable. Hence, available 5.8S and 28S strands are added to the ITS2, if its annotation is imprecise (a list of possible values and examples is given in Table 4).

EXACT (5..100) BEFORE (<5..100) AFTER (>5..100) WITHIN ((5.10)..100) BETWEEN (99ˆ100) UNCERTAIN (99.?100)

Table 4: This Table lists all possible values that an annotated sequence might contain in GenBank data files. Additionally, a short example is given on the right.

Beside the download of GenBank databases, a file download based on a marker specific search terms is performed additionally, targeting the complete nu- cleotide database of NCBI. In the following Table 5 a list of specific search terms is given. For future database versions, which might not only incorporate the marker ITS2, the search is performed for all subunits of the rRNA cistron.

internal+transcribed+spacer ITS1 ITS+1 internal+transcribed+spacer+1 ITS-1 ITS 1 ITS2 ITS+2 internal+transcribed+spacer+2 ITS-2 ITS 2 5.8S 5.8+S 5.8-S 5.8 S 28S 28+S 28-S 28 S

Table 5: Marker specific search terms used for NCBI query against the nu- cleotide database.

Once a list containing GIs is retrieved from the NCBI search, a match to the index is performed. Only files which are not indexed yet are fully downloaded afterwards and written to the ITS2 database. Here, it should be mentioned that it may be quite possible that same features with same GI but different annotation become added to the database during the whole precess. 5.4 HMM annotation (C) 55

5.4 HMM annotation (C)

Having downloaded all necessary sequence information, now profile HMM databases need to be generated. These rely on manually curated sets of 5.8S and 28S bordering regions of the ITS2 with a length of 25 nucleotides from Keller et al. (2009). Both form a very conserved hybridizing stem with typically two free nucleotides. Sequence sets are available for ’Eukaryota’, ’Diptera’, ’’, ’Fungi’ and ’Metazoa’. Based on those and their reverse complements, in total ten HMMs are created for usage in HMMER2 (Eddy, 1998). Applied on the split FASTA sequences, all hits are evaluated and sequences are sorted into three bins of ’GenBank’ only, ’Mixed’ or fully ’HMM’ annotated. Here we trust in particular HMM annotated hits and prefer them before a mixed or GenBank annotation. During evaluation, the ITS2 from all scoring combinations is extracted and folded. If the resulting structure proves to be a valid ITS2, it is tagged. From all tagged foldings, the HMM annotated ones with lowest energy are again preferred before mixed or GenBank annotations. This seemingly complicated process ensures that a maximum of fold features is retrieved in the next step.

5.5 Secondary structure prediction (D)

After annotation, all sequences with preference for HMM, then mixed and fi- nally Genbank annotation are loaded from the database and folded via Unafold (Markham and Zuker, 2008) by energy minimization. This resulted in about 100,000 directly folded structures during the last database updates. Based on the folded sequences, structures are iteratively homology modelled (Wolf et al., 2005a) until the number of unfolded sequences remains constant. Here, some constraints have to be set to retain a certain level of quality:

ˆ All four helices must be modelled with at least 75% of transfer for each helix. 5.5 Secondary structure prediction (D) 56

ˆ The overall length must not be shorter than in the 0.1 % quantile of directly folded sequences.

ˆ The length of unpaired sequence ends must not be shorter than in the 99.0 % quantile.

ˆ The number of N-characters contained in a sequence must not be larger than in the 99.9 % quantile.

The process of homology modelling (Wolf et al., 2005a; Selig et al., 2008) is explained by a small example in Figure 11. It starts with two sequences where only one secondary structure is known. As the term homology suggests, the sequences are aligned first, to visualize homologue positions. These are marked in light green (Figure 11). Based on those conforming columns, the structure is transferred from the template. However, there might appear the case that structural bonds (’(’ or ’)’) occur at positions where template and target nu- cleotides differ (Figure 11 blue or red marked positions). In this circumstance, structure is transferred only if the corresponding target nucleotides can form a valid base-pairing (blueish marked positions). All other/uncertain positions 20 brackets target become unpaired (’.’). This results in a transfer of 24 brackets template 100 = 83, 33 % for the first helix. Finally, a post-folding process closes remaining bulges – resulting from the gaps, as example (not illustrated in the Figure). 5.6 Partial structure prediction (E) 57

>111073401 Chlamydomonas reinhardtii AATACTCGCCCTACTCCAACACGTTTGGAGCAAGAGCGGAC .....(((.(((.((((((....)).))))..)).)))).. >111073391 Chlamydomonas petasus AATACTCGCTCCCCCATTCCCCTCCTCTTTGGGGGCGAATG

Sequence-based alignment 1st AATACTCGCCC-----TACTCCAACACGTTTGGAGC------AAGAGCGGAC helix .....(((.((-----(.((((((....)).)))).------.)).)))).. AATACTCGCTCCCCCATTCCCCTCCTCTTTGGGGGCGAATGGGAGAACGGAC .....(((.((.....(.((((...... ))))...... )).))))..

Sequence-structure of Chlamydomonas petasus AATACTCGCTCCCCCATTCCCCTCCTCTTTGGGGGCGAATGGGAGAACGGAC .....(((.((.....(.((((...... ))))...... )).))))..

Figure 11: This Figure illustrates the homology modelling (Wolf et al., 2005a) process of the first helix between the template Chlamydomonas reinhardtii and the target sequence Chlamydomonas petasus. Green columns indicate homo- logue positions where structure information is transferred directly. Blueish columns allow a structure transfer, however nucleotides between template and target sequence are not equal here. Red and black positions mark columns where no structural transfer is possible.

Notice: Target sequences which obtained a homology modelled structure may become re-annotated by the templates if their annotation method is HMM on that side, and the target sequences’ method is not.

5.6 Partial structure prediction (E)

Partial structures are mainly classified in types. The first type is yielded from a BLAST-based annotation. This annotation method performs a tar- get sequence-based BLAST search against all fold structures already in the database. The best hit then serves as template for homology modelling. As BLAST is only a ”local alignment” search tool, this annotation method is not very reliable. Hence, this type of structures is classified as partial structure, although it might contain four helical regions. The second type are structures 5.7 General overview of files and folders 58 remaining from yet unfolded sequences where only two concatenated helices could be transferred with at least 75 % per helix. These are typically missing larger fragments of the ITS2 and should not be included into standard phy- logenetic analyses. They might be useful for studies on specific helices only, however should be handled with special care. Having completed this task, the database is nearly ready for use. By running the internal PostgreSQL function ’makeproductionready()’ (Table 7) some fi- nal clean-ups are performed, e.g. the deletion of additional markers which were already integrated for future database versions, the pre-calculation of struc- ture counts for each taxonomic rank or the deletion of redundantly annotated features using the previously explained priority rules.

5.7 General overview of files and folders

The database generation follows a set of Perl and bash scripts which are located under ’rRNA DB/db/trunk’ and follow the sketch visualized in Figure 10. The ’generate.pl’ – directly located in this directory basically calls numbered scripts placed in the ’scripts’ directory. A more detailed overview of folders and scripts is given in the appendix (Chapter 15.L, 15.M).

5.8 Discussion

In this study, two main objectives were faced: (i) a larger number of secondary structures, and (ii) a better quality of annotations leading to more accurate structure predictions. As sequence databases grow daily, there is a natural increase within each update. However, with the ability to annotate sequences by ourselves, we do not have to rely just on search terms and fault-prone sequence tags any more. This achievement enabled the inclusion of a large set of relevant sequence databases which can be scanned using HMMs. Of course, HMMs also do not cover a 100 % rate of detections, thus the NCBI search is still present in the pipeline. Next, the topic of secondary structure prediction was addressed. We want to 5.8 Discussion 59 rely on state-of-the-art folding algorithms like energy minimization, which is close to the ITS2’s chemical folding process. Therefore, the folding algorithm itself was left unchanged. However, one could further enhance the homology modelling step. Having predicted new secondary structures in one iteration, these could serve as templates for further iterations as well. It turned out that this procedure must be controlled very strictly to avoid reduction of sequence lengths with increasing number of iterations. Now, the majority of structures are modelled in the first 3 rounds. The improved prediction quality of structures mainly originates from the ac- curacy of the new HMM-based annotation approach. By allowing also mixed annotations for sequences where a flanking region is cut or unpredictable by the HMM, we gain the maximum performance out of this method and could sig- nificantly increase the number of direct folds. However, one can argue whether direct folds lead to better phylogenetic predictions compared to homology mod- elled structures. It seems that the answer to this question is data-dependent and must be addressed in each case individually, even tough ”homology mod- elling tends to dampen the influence of artefacts (...) and thus yields more robust tree topologies” (Markert et al., 2012). Finally, our database is a cen- tral place to go when working on ITS2-based phylogenetics, and so far the only database that incorporates secondary structure predictions for this marker.

Technical implementation of the ITS2 workbench and the ITS2 database 61

6 Technical implementation of the ITS2 work- bench and the ITS2 database

6.1 Introduction

Having now tools available for ITS2 annotation, a large database of already predicted ITS2 sequences with their secondary structures as well as the al- gorithms to calculate a sequence-structure based phylogeny, the last missing part is an environment for making this pipeline available to the public in a simple and intuitive way. Where first only stand-alone software was able to deal with secondary structure, and a user had to manually download data from the ITS2 database, import and export it into and from several tools, today the ITS2 workbench provides a self-contained working suite, online and easy to use. To implement such high usability, a complete redesign of the outmoded CGI scripts had to take place. The ITS2 workbench is now equipped with the Perl based web development framework Catalyst which runs on a stand-alone Apache web server in combination with a PostgreSQL database server and handles requests from a modern JavaScript frontend.

6.2 The frontend of the ITS2 workbench

The frontend itself was developed using the JavaScript Framework ExtJS in combination with HTML and CSS. JavaScript is a scripting language which is applied on the client side, mainly implemented in web browsers and was used in the beginning for updating DOM structures inside an HTML document or for showing simple user notifications. It allows an object orientated, imperative as well as functional programming style. During the last 6 years, JavaScript has developed from its shadowy existence into a prominent programming language for interactive website design and stands side by side with the modern terms of Web 2.0 and Asynchronous JavaScript and XML (AJAX). 6.2 The frontend of the ITS2 workbench 62

6.2.1 AJAX and XHR with Web 2.0

The basic concept behind AJAX is an additional AJAX-engine (Figure 12 bottom) between client and server. This engine is typically based on JavaScript and becomes initialized when the pages loads for the first time. The engine itself cares about page rendering as well as interactive communication with a server. Compared to the classical web application model (Figure 12 top), this has a major advantage: A user initiating an event by performing a mouse-click for example, does not generate an HTTP-request that loads a complete website again (which would require the user to just wait), but instead triggers the AJAX-engine to load only necessary parts from the server. While data transmission takes place, the AJAX-engine is fully operational and could render further information on the website, start another data transmission or response to a different user interaction. Further positive side-effects are a reduced traffic overhead, thus a faster re- sponse from the server and a more elegant programming style which brings the website one step closer to a desktop-like environment but on the Web. This development initiated a race in speed optimization (e.g. DOM inserts) of JavaScript engines between the major browser manufactures, and further led to a battle in the development of JavaScript frameworks. These orientated on desktop GUIs, which resulted in the implementation of typical design styles like layout-containers, panels, event handlers or even drawing-suites.

6.2.2 The ExtJS JavaScript framework

In the following, a short description of open source JS frameworks, available in 2009 is given (Table 6), where finally ExtJS was chosen for implementing the ITS2 workbench. ExtJS provides the functionality to extend from objects, which are typically based on the class ’Component’. This class supports automated creation, rendering and destruction of objects. Using the ’ComponentManager’, one can easily access a component by its ’id’-property. A ’Container’ class directly 6.2 The frontend of the ITS2 workbench 63

Figure 12: The classical web application model on the top clearly visualizes the time passing until a server responds with data transmission after a user in- teraction (typically a page click) took place. When regarding the AJAX model on the bottom, the time of a server response can be reduced by eliminating redundant traffic. Further, the rendering of some components can already take place before a data transfer from the server is completed. 6.2 The frontend of the ITS2 workbench 64

Dojo toolkit several widgets; less organized, documented, structured jQuery very good concept; focussed on events and animation Prototype & Scriptaculous focus on animations; no GUI options Yahoo / YUI GUI elements and animation; weak documentation ExtJS Very rich GUI; animations and clear API documentation

Table 6: Short overview of major public licensed JavaScript frameworks avail- able in 2009. extended from ’Component’ supports the possibility to add items using the ’items’-property. By applying this concept to different layouts, components can be nested easily. An alternative way of placing a component however, is to render it directly into a DOM element. When specifying the property ’renderTo’ this task is executed automatically by ExtJS on component initialization. A simple example shows this concept for the ”Cited by” page of the workbench by using a panel in combination with an XMLHttpRequest. HTML:

Example Citations

JavaScript: var citationspanel = new Ext.Panel({ id:’citedby’, bodyStyle: ’padding:15px’, title:’Cited by’, iconCls:’icon-citedby’, 6.2 The frontend of the ITS2 workbench 65

autoLoad:{url:’ttvis’,method’POST’,params:{page:’citedby’}}, renderTo:’myCitations’ });

These lines create a panel, render it to the div-element with the id ’myCita- tions’ and automatically invoke an XHR request to the ’ttvis’ controller (Figure 13). Data content (the citations) is finally delivered from the server after a few milliseconds. Typically, POST requests are created that enable the transfer of variables as message content. The controller ’ttvis’ is explained later in the appendix, Table 17.

Figure 13: Visualisation of a data request (XHR) in Google Chrome, initiated by ExtJS. The server’s response is received after a few milliseconds.

However, data received from XHR requests must not always be rendered on a website immediately. Therefore, ExtJS provides a substantial element inside its framework: The ’Store’ class which is integrated into several components. A ’store’ allows to collect data in records, to filter and to modify them. Typically, components e.g. grids load data into a store and display them in specific columns or formats. To exchange data with the ’store’, a specific encoding must be applied. This is implemented by different data readers/writers that handle, for example, XML or JSON (JavaScript Object Notation) encoded data content.

6.2.3 Web 2.0 data formats

To transfer data from the server to the client, former applications were based on the SOAP networking protocol. This allows to deliver XML in a SOAP 6.2 The frontend of the ITS2 workbench 66 envelope using XHRs which influenced the definition of the term AJAX. The main benefits of this protocol are a well-defined standard, several available parsers and writers in different programming languages as well as an easy validation using a DOM or XML Schema Definitions (XSD). Some of the major drawbacks are a large traffic overhead and a longer development cycle for setting up definition and validation files. Here, the JSON format benefits from a very sparse and easy declaration, however lacks any automated validation and specifications before and after transferring data. In the following, an extract from a JSON transfer loading sequence information into a data-grid is given (sequences and structures are shortened):

{"totalCount":3,"jsonresponse":[

{"energy":"-31.6","sequence":"TCTGC","method_str":"2","acc":"AB190265", "group":"Alveolata","gi":"52546182","structure":".....", "method_ann":"3","specname":"Symbiodinium sp. Kokubu1a"},

{"energy":"-31.6","sequence":"TCTGC","method_str":"2","acc":"AB190276", "group":"Alveolata","gi":"52546193","structure":".....", "method_ann":"3","specname":"Symbiodinium sp. Sakurajima2c"},

{"energy":"-35.1","sequence":"TTTCA","method_str":"2","acc":"AB190280", "group":"Alveolata","gi":"52546197","structure":".....", "method_ann":"2","specname":"Symbiodinium sp. Amami1a"}

]}

The JSON notation which is the JavaScript object notation allows to obtain a JavaScript object by just applying the ’eval()’ function on the JSON string. JSON data is usually transferred by XMLHttpRequests. To not be misleading, XHR was formerly developed for Microsofts Internet Explorer and enables data transfer for all kinds of HTTP requests in combination with JavaScript. Also plain text or HTML code can be delivered. 6.3 The backend of the ITS2 workbench 67

6.2.4 Frontend files and locations

The workbench frontend is organized under ’Rrna cistron db/root’, see Figure 14. Here, the ’index.html’ is the direct entry point for the ITS2 workbench and is rendered by the ’root’ controller. All necessary JavaScript and CSS files are directly loaded from the ’index.html’.

Figure 14: Important files and folders of the frontend of the ITS2 workbench.

As soon as JavaScript and CSS files are transferred, the file ’applayout.js’ provides the basic border layout on startup. This integrates a large set of additional functions and objects. In the appendix, a short overview of those files and their functionalities is given (Table 15):

6.3 The backend of the ITS2 workbench

6.3.1 Catalyst: A Perl-based Web development framework

Catalyst is a MVC (Model View Controller) Web development framework based on the programming language Perl. Compared to the old ITS2 Web interface, this framework enables a structured organization of files and a sepa- ration of programming, visualization and database logic. Using this standard, it follows the principles of Don’t Repeat Yourself (DRY) and is fully object orientated. It further comes with a set of CPAN hosted plugins and runs on all major Web servers. 6.3 The backend of the ITS2 workbench 68

The MVC architecture of Catalyst allows a flexible program design includ- ing reusable components, an organized structure and programming logic. The ’Model’ represents a full database scheme by encapsulating each table into an own Perl object. When fetching data from the database, the correspond- ing Perl objects can be returned, providing easy access to the equally named columns. The ITS2 workbench provides two database models. One represents the sequence-structure database (see also Figure 16), a second one represents the administration database depicted in Figure 17. A ’View’ instead is re- sponsible for the data encoding when sending data back from the server to the client. This format is typically one of the previously described Web 2.0 data formats in chapter 6.2.3. The ITS2 workbench currently provides a JSON, a file download and a Template Toolkit (TT) view. Finally, a ’Controller’ con- nects model and view and further delegates all necessary actions. By calling a website in a specific way: http://name.tld/controller/function a controller, and even function calls provided by the controller are accessible for the client (see also the chapter about the RESTful programming paradigm 6.3.1). This enables a very flexible way of programming and eases the devel- opment of reusable code components. 6.3 The backend of the ITS2 workbench 69

Figure 15: The Figure illustrates a typical data request in the ITS2 workbench, established to the MVC Web development framework Catalyst. The XHR request from the client side causes a controller to contact the PostgreSQL database if necessary. Finally, information is rendered by a specific view, and transferred back to the client in a view-dependent format. The client then further processes its received data.

Figure 15 visualizes the basic concept behind the MVC architecture. A client request is first accepted by the controller. It decides whether a database con- nection is required, and in such case uses the database model to fetch relevant entries. The view finally regulates how or in which format data has to be 6.3 The backend of the ITS2 workbench 70 encoded before it is sent back to the client, where it can then be further pro- cessed. With the XHR request of course, beside the URL for the controller, also further information in form of variables can be delivered by GET or POST.

The RESTful programming paradigm and CRUD: Regarding the con- troller in more detail, a common implementation is the Restful design. The term representational state transfer (REST) was first defined by Fielding (2000) in his dissertation in 2000. Although it is not standardised, it be- came integrated into most today’s web development frameworks in different ways. The basic principle behind this concept is to access or modify the rep- resentation of a specific resource, referenced by an URI. To interact with this resource, only two things have to be known. This is first its identifier, typically in a form of: http://example.com/resources or http://example.com/resources/item and secondly the action to be performed. These actions, often go in coherence with basic database operations. These can be summarized under the term CRUD, standing for:

Create (INSERT DB statement, HTTP transfer by POST request) Read (SELECT DB statement, HTTP transfer by GET request) Update (UPDATE DB statement, HTTP transfer by PUT request) Delete (DELETE DB statement, HTTP transfer by DELETE request)

The database actions are triggered by corresponding HTTP POST, GET, PUT and DELETE requests. Especially the ITS2 administration interface is a typi- cal implementation of RESTful CRUD web services. It allows to create, show, modify, or delete information on the admin interface for managing citations, news, staff-entries and more. 6.3 The backend of the ITS2 workbench 71

Template Toolkit: Having finished all calculations by the controller, dif- ferent views are available in Catalyst that send information back to the client. Beside the common formats XML and JSON (see Chapter 6.2.3), Catalyst provides a special view for transferring parts of a website in form of templates, the Template Toolkit view. TT is a very frequently used module in the Perl community. It provides a way to separate website design from programming logic. A template is a sketch of code, usually HTML which provides the com- plete design and layout of parts of a website. Beside a replacement of variables with corresponding values, TT can also provide basic programming logic like loops, conditions and further simple constructs that ease the design process. In the following a short extract of the ’Cited by’ template is given as exam- ple. It produces an unordered list containing information about citations like authors, title, journal etc.:

Cited by

[% IF ttcitations %]

    [% FOREACH citation IN ttcitations %]
  • [% citation.authors %] ([% citation.pubyear %]):
    [% citation.title %]
    [% citation.journal %]
  • [% END %]
[% ELSE %]

No citations found!

[% END %]

All templates are located in the templates folder, see Figure 14. Further, a short description of each template is given in the appendix (Table 16). 6.3 The backend of the ITS2 workbench 72

6.3.2 The PostgreSQL database including PL/pgSQL

PostgreSQL is an object-relational database providing typical relational database mechanisms in combination with object orientated extends like the inheritance of tables. With its Procedural Language/PostgreSQL Structured Query Lan- guage (PL/pgSQL), it allows to declare stored procedures, using variables, functions, conditions and loops. PL/pgSQL code can be called directly from SQL statements as well as from triggers. The ITS2 database implements many reasonable PL/pgSQL functions, views and triggers which are listed in details in the appendix (see Tables 7, 8 and 11). Regarding the databases itself, the workbench connects to two separate ones. The first one is filled during the update procedure in section 5 and provides all information related to ITS2 sequences, structures, taxonomy information etc.. The second database is filled using the ITS2 admin interface and provides basic information about the website like staff, citations or RSS feeds.

Database Schema of the ITS2 database To be conform with the Catalyst naming-conventions, all table names end with a plural ’s’. Indexed columns are marked with a white symbol in Figure 16. Indices on primary keys are auto generated by PostgreSQL. alignments Stores the ’neelde’-alignment (EMBOSS package) of homology modelled sequence-structures. Includes model and template (transferred) in- formation as well as alignment statistics. annotation methods Holds the annotation methods ’HMM’ (both sides HMM hit), ’MIXED’ (only one side HMM hit, other side GB entry), ’GB’ (both sides GB entry), ’HM’ (re-annotation by homology modelling) and ’BLAST’ (BLAST search based annotation). elements Refers to the basic elements of the rRNA cistron which are: 18S, ITS1, 5.8S, ITS2 and 28S. 6.3 The backend of the ITS2 workbench 73

alignments

id INTEGER sequences runs_id INTEGER features features_id_model INTEGER id_gi INTEGER features_id_transferred INTEGER id INTEGER taxons_id INTEGER sequence_model TEXT sequences_id_gi INTEGER runs_id INTEGER sequence_transferred TEXT runs_id INTEGER definition TEXT structure_model TEXT elements_id INTEGER accession TEXT structure_transferred TEXT annotation_methods_id INTEGER score DOUBLE start_pos INTEGER length INTEGER end_pos INTEGER gaps INTEGER start_pos_total_reannot_shift TEXT taxons identity INTEGER end_pos_total_reannot_shift TEXT id INTEGER mismatches INTEGER sequence TEXT parent_id INTEGER transfer1 DOUBLE start_models_id INTEGER name TEXT transfer2 DOUBLE stop_models_id INTEGER rank TEXT transfer3 DOUBLE start_evalue DOUBLE lft INTEGER transfer4 DOUBLE stop_evalue DOUBLE rgt INTEGER count_its2_sequence INTEGER models count_its2_direct INTEGER count_its2_homology INTEGER id INTEGER count_its2_partial INTEGER description TEXT count_its2_genbank INTEGER

annotation_methods

structures id INTEGER short_name TEXT runs id INTEGER description TEXT features_id INTEGER id INTEGER runs_id INTEGER started TIMESTAMP structure_methods_id INTEGER elements stopped TIMESTAMP structure TEXT description TEXT id INTEGER energy DOUBLE version TEXT iteration INTEGER short_name TEXT motif_uggu_start INTEGER full_name TEXT gb_indexes motif_uggu_stop INTEGER structure_methods motif_aaa_start INTEGER id_gi INTEGER motif_aaa_stop INTEGER id INTEGER oset BIGINT motif_uu_start INTEGER short_name TEXT length INTEGER motif_uu_stop INTEGER description TEXT active BOOLEAN

Diagram: Created: DbWrench Design: its2-database 26 Feb 2012 - 10:37:39 PM its2_dev

Figure 16: Database Schema of the ITS2 database. features Stores sequence information, element type, and annotation specific information. gb indexes This table only exists while the database is being filled (see Chapter 5). It holds the GIs index position and length for a fast extraction of GenBank flat files from the index data file. Although GIs are unique, they might not be unique in this table due to the inclusion of daily GenBank up- dates. However ’old’ GI references are flagged as inactive. models A representation of models that were used for ITS2 annotation. These can be for GB/MIXED annotation: ’Genbank’ or ’Genbank imprecise’; for MIXED/HMM annotation: ’Diptera’, ’Eukaryota’, ’Fungi’, ’Metazoa’ or ’Viridiplantae’. runs This table contains information about the database version, a short description as well as the duration of the update process. sequences Holds information about a sequence, though not a specific feature or a sequence data itself! It stores mainly its GI, Accession and definition. 6.3 The backend of the ITS2 workbench 74 structure methods The structure methods of the ITS2 can be ’DIRECT’ for direct folds, ’HM’ for homology modelled or ’PARTIAL’ for incomplete structures. structures Saves the structure of a sequence itself in bracket-dot-bracket notation. Further energy and HM iteration is stored if available. This table also contains fields for sequence or sequence/structure motifs. However the detection of motifs is not automated yet. taxons This table stores the taxonomic tree in form of a nested set. Further, for each taxonomy, the number of sequence counts depending on the structure method are available. These were added for performance speed-ups only.

Database Schema of the ITS2 workbench – admin interface The schema of the ITS2 admin database (Figure 17) mainly contains static not connected tables. This is information about citations, RSS feeds or staff as example.

citations its2ofmonths socialnetworkinfos

id INTEGER id INTEGER id INTEGER authors TEXT name TEXT rsstitle TEXT pubdate DATE gi INTEGER rsslink TEXT title TEXT description TEXT rssdescription TEXT journal TEXT reference TEXT rsspubdate DATE voliss TEXT frontpicturepath TEXT twitterdescription TEXT publicationlink TEXT frontpicturecitation TEXT twitterkeywords TEXT isgrouppub BOOLEAN hoverpicturepath TEXT facebookdescription TEXT ishighlightpaper BOOLEAN showupdate DATE

staffs fundings pressreleases

id INTEGER id INTEGER id INTEGER firstname TEXT firstname TEXT magazine TEXT lastname TEXT lastname TEXT pubdate DATE description TEXT description TEXT title TEXT isalumni BOOLEAN isactive BOOLEAN link TEXT

news links supplements

id INTEGER id INTEGER id INTEGER release TEXT name TEXT description TEXT releasedate DATE link TEXT description TEXT description TEXT

Diagram: Created: DbWrench Design: its2-admin-database 26 Feb 2012 - 10:45:05 PM its2_admin_dev

Figure 17: Database Schema of the ITS2 admin database. 6.3 The backend of the ITS2 workbench 75

6.3.3 Apache and Fcgid communication

To run the ITS2 workbench on an Apache web server, Catalyst provides two methods located in the ’Rrna cistron db/script’ directory: CGI and FastCGI. Beside both options, Catalyst also provides its own web and debugging server, which is not recom- mended in production environments. FastCGI – the performance optimized version of CGI – provides an intelligent caching mechanism for CGI programs and allows Catalyst to handle even parallel requests in a quick way and with- out reloading Perl interpreters, etc. on each new request. The workbench runs a maximum of 10 parallel instances and allows an execution time of 1000 seconds per process.

6.3.4 Backend files and locations

The workbench backend is organized under ’Rrna cistron db/lib/Rrna cistron db’, see Figure 18. Here, controller, models and views are located in their corre- sponding folders. In addition, the folder ITS2 contains specific Perl modules for parsing and validating data, or for small encapsulated algorithms.

Figure 18: Important files and folders of the backend of the ITS2 workbench.

A detailed overview of all controller and specific modules is given in the ap- pendix (see Table 17, 18). 6.4 ITS2 workbench installation script 76

6.4 ITS2 workbench installation script

To install the ITS2 database with all relevant dependencies and Perl modules, an ITS2 installer (Figure 19) was developed. It distinguishes mainly between two installation types. One for a production environment on a webserver, and a second one for developer use. All output messages are logged to the file ’its2 install.log’ during installation. For specific information about the installation process, please follow the guided steps provided by the installer itself, and read the tech-documents in the SVN ’INSTALL’ folder.

Figure 19: The ITS2 workbench installer in this Figure, provides an installation for production environments, as well as for developers.

6.5 Discussion

The ITS2 workbench, technically speaking, is equipped with the latest web technologies. This is especially its feature-rich JavaScript user interface and the web services provided by the Catalyst web development framework. However, when focussing on the backend, Catalyst clearly plays a role as an outsider in the community. It lacks the extensive support that professional platforms like Ruby on Rails or Oracles Java EE in combination with Hibernate and Spring can provide. Nevertheless, Catalyst is so far the most advanced web development framework for Perl, which has been established as a commonly used programming language in the field of Bioinformatics. With a large variety of plugins, it provides all basic features needed for typical web services. The ITS2 workbench 77

7 The ITS2 workbench

7.1 Introduction

Having solved all technical aspects, the ITS2 workbench (Koetschan et al., 2012) is finally available at http://its2.bioapps.biozentrum.uni-wuerzburg.de. The website unifies all necessary tools that are required for a first ITS2 based phylogenetic analysis on sequence and structure. This ideally starts with the creation of a dataset – the taxon sampling, and ends with the calculation of a phylogenetic tree. However, this process is not always as straightforward as one would assume. During the analysis of data, especially when regarding the alignment, or even worse – after having calculated a ”final” tree, one might realize that something went wrong. This might be a surprising classification of a sequence inside an inappropriate looking clade, long branches between closely related species or a sequence that doesn’t fit into the alignment at all. When narrowing down these issues, it often turns out that a false taxonomic classification, wrong annotation or sequencing error has caused the artefact. Nevertheless, the whole time-consuming analysis has to be repeated. The ITS2 workbench assists here with a finesse. All sequence-structures that form the taxon sampling are accessible from a pool at any time. As soon as one detects an error in the alignment or tree, the erroneous sequence-structure can be deleted from the pool, and previously performed calculations are automatically repeated (see Figure 20). This also affects later added sequence-structures. Together with additional tools that complete the pipeline of Schultz and Wolf (2009), the workbench offers a quick glance into an ITS2 based sequence- structure phylogeny and hereby simplifies the time-consuming tasks of taxon sampling. 7.2 Overview 78

Figure 20: This Figure illustrates the process of taxon sampling using the ITS2 workbench. It starts with own sequence-structures or ones retrieved from the ITS2 database. These are added to a data pool which is the starting point for an alignment. By deleting sequences from the alignment, they are deleted from the pool as well, and a new alignment is recalculated automatically. The same procedure is applied when further adding sequences to the pool, or deleting them from the tree. Having repeated these steps until no further error-prone sequence-structures are visible, the taxon sampling is finished.

7.2 Overview

When regarding the website (Figure 21), the workbench is organized in a border-layout style and divided into header, west and center panel. The header enables access to the ITS2 database during all times using the live search function and provides links to basic information around the website. The west panel instead contains typical tools around the marker, like the HMM annotation, its secondary structure prediction or sequence motifs visualization. Further, it shows an accordion menu, guiding through the basic workflow of creating, managing and analysing a dataset. On the bottom left, the pool receives sequence and structure information from different locations of the website per drag & drop. Finally, a large tab panel in the center of the website shows all relevant information and allows to close or switch between relevant 7.3 Creating a sequence-structure dataset 79 tabs during the whole work process.

Figure 21: The ITS2 workbench is divided into header, west and center panel. The header contains a live search and custom information about the website as well as citations. The west panel gives access to typical marker-specific tools and shows the basic workflow for sequence-structure analysis with the data pool on the bottom left. The center panel allows several tabs to be opened simultaneously while working with the website.

7.3 Creating a sequence-structure dataset

7.3.1 Datasets based on the ITS2 database

Typically, biologists aiming for a phylogenetic analysis, mainly a phylogenetic tree, start with the creation of a dataset. This means, they are interested in a taxonomic group and want to search for sequences, fitting into the taxon sampling. As prominently visible on the top, this can easily be accessed by the search function. The search accepts one search term, or a comma sepa- rated list. This list may contain GeneInfo Identifiers, Accession numbers or 7.3 Creating a sequence-structure dataset 80 even several taxonomic names. To facilitate the input of taxonomic names, a live search is performed, completing the name based on a prefix matching of the first entered letters. Once the selection is finished and the search button is pressed, a large set of information is fetched from the database and becomes visible in the center panel. This is by default the searched group, the Gene- Info Identifier, the taxonomic name, the structure prediction method and of course sequence and structure itself. Depending on the radio buttons beside the search field, different categories are available that return sequences only, directly folded and homology modelled structures or all folded ones including partials. Furthermore, less frequently needed information such as the annota- tion method or sequence motifs can be displayed at any time by expanding the columns of the grid view. But performing a search based on pre-known search terms is only one of the ways to begin the creation of a dataset. Beside the specific search function, a user might also want to browse through the tree of life on the left. This will give an overview of different groups avail- able in the ITS2 database, while showing their number of sequences/structures in brackets next to the taxonomic name. By switching between the previously mentioned radio buttons for different structural types in the header, the tree itself changes its color and structure according to the available sequences. This allows to browse through sequences only, or to include partial structures in the tree. Finally, by clicking on a taxonomic name, a new tab (like after the direct search) is opened in the center panel. Now, the user is able to view sequences and structures. The next step is adding them to a dataset. Therefore, a data pool is included at the website on the bottom left. The data pool is able to store sequences, structures as well as alignments or trees. To add sequence- structures, a user just needs to drag and drop columns from a grid onto the pool (see Figure 22). 7.3 Creating a sequence-structure dataset 81

Figure 22: Using the live search function on the top, a user can easily scan for organisms. In addition, a taxonomic tree on the left enables the user to browse through all eukaryotes contained in the database. Clicked nodes inside the tree or searched organisms are presented in a new tab in the center panel. Here, specific information for each sequence is visible, such as its annotation methods, folding energy, Accession number, etc.. In order to create a dataset, these sequences can be added to the data pool by a simple selection in the grid, and a drag & drop onto the pool on the bottom left.

Per default, sequences are added including their secondary structures. If these are not available, just an unpaired structure is added to avoid conflicts when creating the alignment. Of course, sequence-structures can also be included by using the context menu which appears by a right click on a grid column. 7.3 Creating a sequence-structure dataset 82

7.3.2 Datasets from own sequences

Figure 23: The accordion panel on the left side of the window allows to create, manage and analyse a dataset. To create a dataset, own sequences or ones from the database may be used.

Beside including sequences from the database directly, the ”Create dataset” frame also offers to add own data. However, when importing user data, this might not be annotated yet or may lack a secondary structure. Therefore, a user possessing both, an annotated sequence and a secondary structure can use an upload form to directly add data to the pool, which is checked for consistency, before it is added. But in case that the structure is missing, or sequences are not yet annotated, a few more steps are required.

The ITS2 ”Annotation” tool: Having sequenced own data, or obtained a complete rRNA cistron, the HMM annotation tool assists the user to locate the exact borders of the ITS2. 7.3 Creating a sequence-structure dataset 83

Figure 24: Results from an HMM based ITS2 sequence annotation are illus- trated here. One can easily recognize the 5.8S motif identified on the left side of the ITS2 sequence. Further, the hybridizing proximal stems of both HMM motifs are visualized as control.

Therefore, a sequence is pasted into the sequence editor. After an automated validation of the sequence content and format, the green check mark appears on the left. Now the users may choose a model based on the sequence origin. Five different models are available for ’Eukaryota’, ’Fungi’, ’Diptera’, ’Viridiplantae’ or ’Metazoa’ data. Additionally, choice boxes for e-value cut-off and minimum ITS2 length allow to reduce the probability of false hits during annotation. After pressing the ”Annotate” button, the results are displayed below (see Fig- ure 24). The flanking borders of the ITS2, the 5.8S and 28S motif, are depicted in blue and red. However, the annotation without these motifs (each contain- ing 25 nucleotides) is not possible. To control the process, both hybridizing stems are displayed which should contain the typical free nucleotides on both stem sides. In case of inverse strand directions or no results, a user has also the possibility to analyse the reverse complement of the sequence by enabling 7.3 Creating a sequence-structure dataset 84 the appropriate check box. With the HMM annotated ITS2 sequence, the next step towards a sequence-structure based phylogeny is the secondary structure prediction. The user clicks with the right button on the ”add” - symbol, and transfers the sequence to the structure prediction tool.

Easy prediction of secondary structure – The ”Predict” tool: There are two different types of structure prediction. To facilitate the decision for the user, again, the annotated sequence just needs to be copied into the sequence editor. Then, the best method for structure prediction is chosen automati- cally. First, a prediction based on energy minimization is performed. If this is unsuccessful, an automated sequence-based BLAST search starts against the whole ITS2 database with the predefined e-value cut-off listed below the input field. According to the given number of best hits, results from the homology modelling step are displayed in a grid (Figure 26).

Figure 25: This page shows the secondary structure prediction on the ITS2 workbench. Having pasted a sequence into the editor, it might either be folded directly or become homology modelled. Here, the results of homology mod- elling are visible inside a grid. Green rows indicate a high quality prediction, red ones a low quality prediction with at least one helix below 75 %. 7.3 Creating a sequence-structure dataset 85

This grid is sorted by e-value and lists information about helix transfer and the taxonomic name of the template in green or red. Here, green rows indicate a high quality prediction, red rows indicate a prediction with at least one helix below 75 %. For more details, each row can be expanded, and shows the secondary structure, alignment statistics and helical transfer. For low quality models, the BLAST identified template structure is visible and coloured with green and gray. The green dots indicate equal sequence and structure positions between template and target. Having identified an appropriate template, the modelled sequence-structure can now be added to the data pool by a simple drag & drop or via the context menu.

Homology Modelling of structure – the ”Model” tool: Homology modelling of a structure using the ”Model” tool is quite similar to the pre- viously explained prediction tool, however it has a major difference. Whereas in the previous section a template was unknown and identified by a BLAST search, here the user has the possibility to specify an own template. Fur- ther, not only one, but even several templates in combination with multiple secondary structures are imaginable as model input. The algorithm then au- tomatically chooses the best out of all possible combinations. If wished, also the best consensus template can be calculated. 7.3 Creating a sequence-structure dataset 86

Figure 26: The Figure shows a homology modelled structure on the ITS2 workbench. High quality predictions are marked green, low quality predic- tions are coloured red. By expanding a row, the secondary structure of the target sequence becomes visible. Additionally, further information such as the alignment is available.

Beside these extensions, further advanced options are available for the more ex- perienced user. These are alignment specific gap costs, different score matrices as well as adjustable thresholds for high quality models. The visualization of results is similar to the ”Predict” tool. A grid separates high quality (green) from low quality (red) predictions. Each grid row can be expanded to view the modelling details (Figure 26). High quality predictions can further be selected automatically for download, or for the transfer into the pool. 7.4 Managing the dataset 87

7.4 Managing the dataset

Having created a dataset, the next step allows to manage it. This means visual inspection, correction or deletion of sequence-structures, alignment or tree (Figure 27).

Figure 27: Managing a dataset in the ITS2 workbench means to view structures in the pool, see an alignment or to delete a tree for example. However, not all possibilities are available at any time.

By now, only sequence-structures have been added to the pool. These can be displayed by a click on the loupe, or the pool itself. When a structure is avail- able, the folded conformation of the ITS2 is illustrated in small pictograms (Figure 28). Otherwise, icons for sequence only or partial structures are avail- able. With a left click on an image, it becomes highlighted, and information about sequence motifs, energy, etc. appears at the bottom. To be able to deal with big datasets, a filter option allows to keep track of data. It permits to fo- cus on sequences matching specific search-criteria like species name, GeneInfo Identifier, energy, structure or annotation method, respectively. 7.5 Analysing the dataset 88

Figure 28: The ITS2 workbench allows to manage sequence-structures, and illustrates the structural conformation in small overview pictures. By apply- ing a filter, specific structures can be focussed. When marking a structure, information about sequence motifs or the annotation method becomes visible in a small panel below. Further, structures can be deleted or transferred to different tools in other tabs.

Finally, sequences can be marked by a mouse click in combination with the control key. A right click on a marked sequence then shows a context menu enabling the deletion of structures from the pool, or their transfer to other tools in different tabs.

7.5 Analysing the dataset

Having managed the dataset and maybe already deleted some structures by manual inspection in view of the taxon sampling, the next step in a phyloge- netic analysis is the creation of an alignment, and further the calculation of a phylogenetic tree (see Figure 29). 7.5 Analysing the dataset 89

Figure 29: When analysing a dataset on the ITS2 workbench, different options are available. If the pool contains secondary structure information, a sequence or sequence-structure based alignment can be calculated. Afterwards, the option for calculating a Neighbour Joining tree becomes highlighted.

7.5.1 Alignment on sequence or sequence and structure

The alignment can be performed on sequence only, or, if secondary structure is included in the dataset, on both characteristics of the marker. Not visi- ble to the user, ITS2 specific gap costs and score matrices (see Chapter 4.3) are integrated during the alignment process. Having finished calculating, the alignment opens in a new tab (Figure 30). This shows the sequence head- ers on the left, the sequence alignment itself on the top, and if available the secondary structure alignment on the bottom. By clicking on a nucleotide or structural bond, the corresponding ones become highlighted with a red circle. Further, columns can be rearranged by drag & drop to compare sequences or structures directly. Having identified an erroneous sequence or structure in the alignment, this can be deleted directly with a right click in the context menu. Naturally, the user is asked whether he wants the alignment to be recalcu- lated afterwards. Finally, the alignment can be stored for the use in external programs, or the study can be continued with the calculation of a tree. 7.5 Analysing the dataset 90

Figure 30: In this Figure, a sequence-structure alignment of the ITS2 work- bench is presented. The alignment of sequence information is visible on the top, whereas the structure alignment can be found at the bottom. Corre- sponding bonds can be highlighted with a simple click and are marked with red circles. To remove a sequence from the alignment and thereby also from the pool, a context menu is available.

7.5.2 Tree calculation with the Neighbour Joining algorithm

The tree icon is visible on the ”Analyze” tab, as soon as the alignment is cre- ated. Currently, the user can only calculate a tree based on the fast Neighbour Joining algorithm using ProfDistS (Wolf et al., 2008). The tree opens in a new tab (Figure 31), and allows zooming and scrolling to different nodes. Here, a user aiming for a good taxon sampling can easily inspect the tree, and delete a node just by opening the context menu. Further, a re-rooting of the tree at each node is possible. As soon as the dataset has changed, calculations are re- peated automatically when switching between tabs. Although the calculation of the tree is primarily offered as first impression on the dataset, a download in Newick format is available. It must be stated however, that this tree is by no 7.6 Additional tools 91 means a publication ready tree. Therefore, further treeing methods should be taken into consideration, and at least a bootstrap analysis would be required.

Figure 31: This graphic shows a tree calculated on the ITS2 workbench. With a click on a node, the context menu allows to delete or re-root this position. The tree can be scaled and moved inside the panel, and is available for download in Newick format.

7.6 Additional tools

7.6.1 Motif detection in ITS2 sequences

Beside the typical pipeline for the creation of a taxon sampling, the ITS2 workbench offers some additional tools around the marker. One is the ITS2 sequence-motif detection. Sequence motifs are conserved parts or regions which occur in a large proportion of a dataset, here, inside the ITS2. Although these are not of major importance for phylogenetic analyses, they can assist in the review of sequence annotation or secondary structure prediction. Three basic sequence motifs were identified with the motif search tool MEME (Bailey et al., 2009), based on the dataset of Keller et al. (2009): 7.6 Additional tools 92

ˆ U-U mismatch (helix II, left),

ˆ U-U mismatch (helix II, right) with AAA (btw. helix II and III),

ˆ UGGU (helix III, 5’ side)

Together with the two free nucleotides of the hybridizing stem regions 5.8S and 28S, five characteristics are available that ensure the correctness of a se- quence annotation. The motif prediction on the workbench follows the typical procedure. A sequence is pasted into the sequence editor, or transferred to the ”Motif” tool from a different tab. When providing additional structure information, the motifs become highlighted in their folded conformation later.

Figure 32: The motif search tool on the ITS2 workbench detects three se- quence motifs. When entering additional structure information, the motifs are highlighted in a small coloured pictogram. This is ideal for reviewing the correctness of a folded structure.

However, the presence of a secondary structure is no prerequisite for motif detection. Two choice boxes for e-value threshold and model (possible models are ’Fungi’ and ’Viridiplantae’) are available to configure the HMM based motif search. Having clicked on the search button, the sequence is scanned 7.6 Additional tools 93 for all three motifs, and results are displayed in a detailed table below (see Figure 32). If, in addition to sequence, also structure was added as input, the according sequence motifs are now coloured accordingly when opening the structure view.

7.6.2 BLAST on sequence and structure

A tool that completes the taxon sampling process is the newly developed sequence-structure BLAST. The ITS2 sequence is a fast evolving marker, and thus provides a high resolution on lower taxonomic groups. Thus, a BLAST search on sequence only, will mainly detect closely related species. For a well balanced taxon sampling instead, also more distantly related species are needed. The sequence-structure BLAST search fulfils this requirement by com- bining sequence and secondary structure into its search. Furthermore, this en- ables the identification of distantly related species inside the ITS2 workbench. When opening the according tab, one can choose between sequence only or sequence-structure BLAST and paste the sequence-structures into the editor. After the search has finished, results are available in new tabs, one for each sequence. Here, all relevant information about score, e-value or coverage are visualized inside a grid. With a double click, a row expands and the alignment becomes visible (Figure 33). 7.7 The ITS2 admin interface 94

Figure 33: This Figure shows the results of a sequence-structure BLAST search on the ITS2 workbench. This BLAST type allows to detect also more distantly related species due to its integration of secondary structure into the search. With a double click on a row, the full alignment appears on the screen.

Finally, the alignment rows can also be transferred into the pool per drag & drop, whereas buttons on the top allow to store selected sequences or even the complete BLAST output in different formats.

7.7 The ITS2 admin interface

The last tool completing the ITS2 workbench is its administration interface. This website allows to manage most dynamic contents like citations (Figure 34) in simple, editable grids. 7.8 Discussion 95

Figure 34: This graphic shows the ITS2 workbench administration website that allows to manage dynamic contents of the ITS2 workbench. On the left, a list of topics is available, which can be edited directly inside the grid rows in the center of the website. As example, the citations page is depicted.

On the left side, a list of manageable topics is available. A click on those, opens the corresponding table in the center. Here, rows can be added, deleted or updated easily using the grid-editor. Changed values are quickly synchronized with the database and become visible on the ITS2 workbench immediately.

7.8 Discussion

During the past ten years, ITS2 sequence-structure phylogeny has been discov- ered as an interesting concept that improves phylogenetic analyses (Coleman, 2003; Schultz et al., 2005; Telford et al., 2005; M¨uller et al., 2007; Keller et al., 2010), and was applied in several studies (more than 150 ITS2 database cita- tions). Following the proposed workflow of Schultz and Wolf (2009), however, required the use of a large variety of different programs (Seibel et al., 2006; Wolf et al., 2008; Larkin et al., 2007; Page, 1996; Wolf et al., 2005b) in combi- nation with the ITS2 database. Further, these tools needed to be maintained for a variety of different operating systems. The ITS2 workbench, is so far the 7.8 Discussion 96

first website, that embeds the complete phylogenetic pipeline for a sequence- structure based analysis. However, it also bears some limitations. Having created an alignment though, one might find out that some sequence parts or structural bonds need to be corrected. Such editing is currently impossi- ble in the workbench. Here, it was decided in favour for a full automation and reproducibility. Nevertheless, one could think about implementing a few semi-automated features like the cropping of borders for frayed alignments. Another aspect is the calculation of the tree: currently, only the Neighbour Joining method is provided, which runs in a relatively short time. But even here, the number of taxa had to be limited and no bootstrap support is given. This can be disregarded when concentrating on a taxon sampling, but for more advanced tree calculation, one has to download the alignment, and run stan- dalone treeing software like ProfDistS. However, there exists no publication about an MP or ML based sequence-structure treeing program yet. A solution to this problem could be the integration of the newly developed R-package described in Chapter 4.2. Although it is not evaluated on large scale-data yet, it provided promising results on the chlorophyceaen dataset (Chapter 4.4). By running this package with parallel bootstraps on a cluster, it might be able to provide trees for MP, ML and BIONJ even in a reasonably short time. ITS2 application on large scale-data - automated reconstruction of the Green Algal Tree of Life 97

8 ITS2 application on large scale-data - automated reconstruction of the Green Al- gal Tree of Life

8.1 Indroduction

Continuing the discussion towards a large-scale analysis, one might question beside the technical hindrances, the potential of the ITS2 itself (Alvarez and Wendel, 2003; Sang, 2002) for a successful application on larger taxonomic units. This might be due to the known genetic heterogeneity of the ITS2 which resulted in a large discussion over several years (Alvarez and Wendel, 2003; Sang, 2002; Bezzhonova and Goryacheva, 2008; Wang and Yao, 2005; Nickrent et al., 1994; Powers et al., 1997; Feliner and Rossello, 2007; Wolf and Schultz, 2009) or the questioned range of this marker when applied on such big datasets. Nevertheless, the increasing number of published manuscripts using ITS/ITS2 as phylogenetic marker (Feliner and Rossello, 2007) and the proposal of ITS/ITS2 as a DNA barcode in several recent studies (Li et al., 2011; Chen et al., 2010; Yao et al., 2010; Gao et al., 2010; Sass et al., 2007) underlines its potential, especially when incorporating both of its features into one’s analysis. To further demonstrate the automated workflow of (Schultz and Wolf, 2009) on large datasets, the phylogenetic tree of Chlorophyta (green algea) with about 2270 taxa was reconstructed automatically, using techniques, equivalently im- plemented in the ITS2 workbench. This study, published by Buchheim et al. (2011a) included the creation of datasets, alignments, and a short technical chapter which were part of this thesis.

8.2 Materials and Methods

The analysis was based on ITS2 sequences and structures obtained from the ITS2 database v3.0 (2009/09/29). For all available Chlorophyceaen (591), 8.3 Results 98

Trebouxiophyceae (741) and Ulvophyceae (938) samples, a sequence struc- ture based global multiple alignment was generated using 4SALE v1.5 and ClustalW2 with an ITS2 sequence-structure specific scoring matrix. Out of each alignment, a class-specific tree was calculated by Profile Neighbor Joining (PNJ) and ProfDistS v0.9.8 using a General Time Reversible (GTR) substi- tution model. Here, in each case Micromonas (Prasinophyceae) was added as outgroup to the data set before. Additionally, a global Chlorophyta tree con- taining all (2270) taxa from the class specific trees was calculated accordingly. All trees were finally rooted and visualized by FigTree v1.2.3. For bootstrap support, manual profiles were set in ProfDistS with the aid of Cartoon2Profile (http://profdist.bioapps.biozentrum.uni-wuerzburg.de/cgi- bin/index.php?section=cart2prof) for the most important clades inside the trees. Cartoon2Profile simplifies profile definition by exporting cartoons from FigTree into a profiles definition file, compatible to ProfDistS. For online vi- sualization, all three trees of each class were concatenated and visualized by HyperGeny, a hyperbolic tree browser and are available at: http://hypertree.bioapps.biozentrum.uni-wuerzburg.de. Using conventional hardware, a 2GHz computer took less than one hour for calculating each class-specific alignment and about 3.5h for the calculation of the whole Chlorophyta alignment. Determining the bootstrap support took approximately 10 minutes for each tree.

8.3 Results

The class of Chlorophyceae (Figure 37) shows the three orders of Oedo- goniales, Sphaeropleales and Chlamydomonadales/Volvocales. Oedogoniales are categorized by the genera of Oedogonium, Bulbochaete and Oedocladium. Sphaerophleales, grouped into Desmodesmus, Scenedesmus, Atractomorpha and Sphaeroplea have a good bootstrap support of (94%) and were placed as a monophyletic unit. Chlamydomonadales/Volvocales consist of Chlamy- domonas, Yamagishiella, Pandorina, Eudorina, Astrephomene, Gonium, Pha- cotus and Dunaliella. In this case, Chlamydomonadales were separaded into 8.3 Results 99 two distinct groups (Chlamydomonadales I and Chlamydomonadales II), albeit with low bootstrap support.

The class of Trebouxiophyceae (Figure 38) contains the three clades, Microthamniales I (Trebouxia alliance), Microthamniales II (Asterochloris al- liance) and the with the genera of Chlorella, Parachlorella, Coc- comyxa, Micractinium and Didymogenes. Each clade has a high bootstrap sup- port of 99% (Microthamniales I), 94% (Microthamniales II) and 96% (Chlorel- lales).

The class of Ulvophyceae (Figure 39) is resolved with the four clades of , /, Ulvales I and Ulvales II. Bryopsidales consist of the orders and and show a high bootstrap sup- port of 92%. The Urospora/Acrosiphonia clade is supported by 79%. Ulvales I consist of the taxa Bolbocoelon, Blidingia, , Umbraulva, Acrochaete and Ulva I, whereas Ulvales II reveal a second ulvean group – Ulva II. Ulvales I and Ulvales II have low bootstrap support with the latter forming a sister group to Urospora/Acrosiphonia.

The phylum Chlorophyta (Figure 35) reveals the three classes of green al- gae described above, however in some aspects there are differences compared to the class-based analysis. Per default, the class-based analysis treats each class as monophyletic, the phylum-level analysis though questions this assumption slightly. Oedogoniales are grouped together with Chlorellales III (Coccomyxa) as sister to Ulvales I (Urospora/Acrosiphonia). Further, Sphaeropleales II (Sphaeropleaceae) are in alliance with Chlorellales I (Chlorella, Parachlorella, Micractinium, Didymogenes, Diacanthos, Closteriopsis, Actinastrum, Dictyosphaerium, Auxenochlorella, Lobosphaeropsis), Chlorellales II (Pseu- dochlorella, Koliella) and Microthamniales II. Additionally, Sphaeropleales I (Desmodesmus, Scenedesmus) are resolved as sister group to Ulvales I. Re- garding all these taxa, the Chlamydomonadales is classified as a monophyletic sister group. 8.3 Results 100

The Trebouxiophyceae as well as the Ulvophyceae form four non-monophyletic clades. For the Trebouxiophyceae, these are Microthammniales I, Microtham- niales II, Chlorellales III and the group of Microthamniales II, Chlorellales I and Chlorellales II. For the Ulvophyceae, these are the Bryopsidales II (Caulerpa), the Ulvales I and the group of Ulvales, Urospora/Acrosiphonia and Bryopsi- dales I (Halimeda).

Chlamydomonadales Sphaeropleales I

Microthamniales (I)

49 Ulvales I

Prasinophyceae 89 (outgroup)

Chlorellales (I) Bryopsidales II

Sphaeropleales II Chlorellales II ‚Botryococcus + Dictychloropsis‘ I

Microthamniales II

‚Botryococcus + Dictychloropsis‘ II

Microthamniales (III) Bryopsidales I Ulvales II

‚Interfilum‘ Oedogoniales Chlorellales III ‚Urospora‘/‚Acrosiphonia‘

Figure 35: Profile Neighbour joining tree calculated on the phylum Chloro- phyta (with 100 bootstrap replicas) using all ITS2 sequences and structures available from the ITS2 database version 3.0 (2009/09/29). The image is taken from Buchheim et al. (2011a). 8.4 Discussion 101

8.4 Discussion

Regarding the reconstructed trees of the classes Chlorophyceae, Trebouxio- phyceae and Ulvophyceae, they are basically in accordance with several pub- lished manuscripts using a variety of different markers, for example 18S rRNA (Wolf et al., 2002; Hepperle et al., 2000; Lewis and Flechtner, 2004; Nakada et al., 2008), rbcL (Nozaki et al., 2000; Loughnane et al., 2008; Zechman, 2003; Nozaki et al., 2003) or atpB (Buchheim et al., 2010; Nozaki et al., 2000, 2003). However, some differences exist to other studies, for example when regarding the Chlorophyceae (Buchheim et al., 2001, 2002). Figure 37 places Chlamydomonadales as a basal, paraphyletic unit, which differs from both manuscripts, where Oedogoniales, Chaetophorales and / or Chaetopeltidales were reconstructed at this position. The differences might be due to a weak support in those datasets, variations in the taxon sampling – e.g. during time of analysis there existed no ITS2 data for Chaetopeltidales / Chaetophorales, or differences in rooting the outgroup. Further, discrepancies become visible when comparing the phylum-level analysis to reconstructions on class-level. Here, for example, the group of Chlamydomonadales is resolved as mono- phyletic in the phylum-level tree (Figure 35), but compared to the class-level analysis (Figure 37), Chlamydomonadales II forms a sister group to the groups of Chlamydomonadales I, Oedogoniales and Sphaeropleales, albeit with a low bootstrap support of 47 and 57, respectively. Beside these small irregularities, the ITS2 marker unravelled a large phylogeny from species to phylum-level in a full automated pipeline and without manual intervention. Taking into account also the short calculation period, this suggests an application of ITS2 sequence-structure analysis on similarly large datasets. A video tutorial for the ITS2 workbench 102

9 A video tutorial for the ITS2 workbench

9.1 Short summary

Finally, the ITS2 workbench is in use and accessible for the public. However, the inexperienced user might need some guidance on the way to the phyloge- netic tree. Therefore, a short movie about the ITS2 workbench was created (Merget et al., 2012). It starts with a general introduction about the ITS2 marker, followed by a detailed explanation of the implemented pipeline. Part of this thesis was the 3D animated creation of a growing phylogenetic tree (Figure 36 top right).

Figure 36: This illustration shows four different captures from the ITS2 work- bench movie. On the top left, a 3D reconstruction of the ITS2 is visible. On the top right, a growing phylogenetic tree is depicted. The illustrations on the bottom show the website of the workbench with the ITS2 annotation tool on the left, and a calculated tree on the right. 9.1 Short summary 103

The small movie is available at: http://www.jove.com/video/3806/the-its2-database. Conclusion and outlook 104

10 Conclusion and outlook

During the last years, the development of sequencing technologies was pushed towards the 4th generation and high-throughput sequencing became available. This progress guarantees the exponential growth of databases like GenBank for the years to come. However, with the raising number of sequences, the need for exact annotation, sorting and management of data becomes fundamental. Large databases like GenBank, EMBL or DDBJ are not capable of provid- ing this service on such immense datasets accurately. Thus, the development of topic-related sub-databases is obvious. When focussing on sequence phy- logeny, different databases exist for a variety of markers (Pruesse et al., 2007; Cole et al., 2009; Wang et al., 2009; Koetschan et al., 2010). However, the ITS2 workbench is - to our knowledge - the only database for ITS2 sequence- structure phylogeny. Its data storage and accuracy was significantly improved during this study by the inclusion of HMM sequences annotation and an ex- haustive secondary structure prediction pipeline which more than doubled the number of structure predictions. But this is not even half of what the work- bench can provide. Whereas at the beginning of this work, a large variety of tools were needed for a full sequence-structure analysis, today all these meth- ods are integrated into the ITS2 workbench. This has the major advantage that time-consuming installations, data import and export along with simple handling errors could be reduced to a clear pipeline easily followed just with a few mouse-clicks on the Web. Therefore, a large number of different Web- services had to be implemented providing data for the AJAX driven Web 2.0 interface. Although the workbench currently bears some limitations, such as its repertoire of treeing methods, solutions like the new treeforge R-package are nearly ready for use. Beside a transfer of the currently established sequence- structure phylogeny to methods like MP and ML, it consists of newly developed sequence-structure alphabets, that performed reasonably well during first eval- uations. However, there still remains the challenge in computing large-scale phylogenetic trees on bigger datasets online. Recently developments towards Conclusion and outlook 105 cloud computing on large clusters and computing farms could provide solutions here. In addition, Web design is pushing towards the further development of JavaScript, like Google’s DART, local file storage mechanisms like the W3C draft of the Localstorage API (Hickson, 2011), or new browser databases such as the W3C draft of IndexDB (Mehta et al., 2011). C/C++ to JavaScript cross-compilers like Emscripten even today bring C/C++ programs to the web, just by running JavaScript. This suggests that in future, also the C++ algorithms of ClustalW2 or treeing software could be compiled for running in the Web browser using local CPU and memory resources. With the capability of calculating larger datasets, and further optimizations of the phylogenetic pipeline, in future, on might obtain a broader coverage in phylogenetic analy- ses. Reconsidering the initial questions ”What is a species?”, ”Are same looking individuals also belonging to the same species?”, ”Which one evolved first?” we might ask whether these are answered now? One has to admit, not completely – or the critic would say, not really. But one thing has drastically changed, and this work has definitely pushed the level one step further into this direction. Whereas just a few years ago, researchers had to undertake tedious manual work, analyse species with their morphological features in thousands of hours, to just gain an idea on how few fellows are related, today, with the use of the ITS2 workbench, this cumbersome work has been reduced to a minimum of effort and just requires few mouse-clicks. References 106

11 References

S F Altschul, W Gish, W Miller, E W Myers, and D J Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215(3):403–410, 1990. ISSN 00222836. doi: 10.1006/jmbi.1990.9999. URL http://www.ncbi.nlm. nih.gov/pubmed/2231712.

S F Altschul, T L Madden, A A Sch¨affer,J Zhang, Z Zhang, W Miller, and D J Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research, 25(17):3389–3402, 1997. URL http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid= 146917&tool=pmcentrez&rendertype=abstract.

I Alvarez and J F Wendel. Ribosomal ITS sequences and plant phylogenetic inference. Molecular Phylogenetics and Evolution, 29(3):417–434, 2003. URL http://linkinghub.elsevier.com/retrieve/pii/S1055790303002082.

Jonas J Astrin, Bernhard A Huber, Bernhard Misof, and Cornelya F C Klutsch. Molecular taxonomy in pholcid spiders (Pholcidae, Araneae): evaluation of species identification methods using CO1 and 16S rRNA. Zoologica Scripta, 35(5):441–457, 2006. ISSN 03003256. doi: 10.1111/j.1463-6409. 2006.00239.x. URL http://www.blackwell-synergy.com/doi/abs/10. 1111/j.1463-6409.2006.00239.x.

P G Bagos, T D Liakopoulos, I C Spyropoulos, and S J Hamodrakas. A Hidden Markov Model method, capable of predicting and discriminat- ing beta-barrel outer membrane proteins. BMC Bioinformatics, 5:29, 2004. ISSN 14712105. doi: 10.1186/1471-2105-5-291471-2105-5-29. URL http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db= PubMed&dopt=Citation&list_uids=15070403.

Timothy L Bailey, Mikael Boden, Fabian A Buske, Martin Frith, Charles E Grant, Luca Clementi, Jingyuan Ren, Wilfred W Li, and William S Noble. MEME Suite: tools for motif discovery and searching. Nucleic Acids Re- 107

search, 37(Web Server issue):W202–W208, 2009. URL http://www.ncbi. nlm.nih.gov/pubmed/19458158.

B G Baldwin, M J Sanderson, J M Porter, M F Wojciechowski, C S Campbell, and M J Donoghue. The ITS region of nuclear ribosomal DNA: a valuable source of evidence on angiosperm phylogeny. Annals of the Missouri Botan- ical Garden, 82(2):247–277, 1995. ISSN 00266493. doi: 10.2307/2399880. URL http://www.jstor.org/stable/2399880?origin=crossref.

Markus Bauer, Gunnar W Klau, and Knut Reinert. Accurate multiple sequence-structure alignment of RNA sequences using combinatorial opti- mization. BMC Bioinformatics, 8(1471-2105 (Electronic) LA - ENG PT - JOURNAL ARTICLE):271, 2007. URL http://dx.doi.org/10.1186/ 1471-2105-8-271.

L E Baum, T Petrie, G Soules, and N Weiss. A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. The Annals of Mathematical Statistics, 41(1):164–171, 1970. ISSN 00034851. doi: 10.1214/aoms/1177697196. URL http://www.jstor.org/ stable/2239727.

Dennis a Benson, Ilene Karsch-Mizrachi, Karen Clark, David J Lipman, James Ostell, and Eric W Sayers. GenBank. Nucleic acids research, 40(December 2011):48–53, December 2011. ISSN 1362-4962. doi: 10.1093/nar/gkr1202. URL http://www.ncbi.nlm.nih.gov/pubmed/22144687.

O V Bezzhonova and I I Goryacheva. Intragenomic heterogene- ity of rDNA internal transcribed spacer 2 in Anopheles messeae (Diptera: Culicidae). Journal of Medical Entomology, 45(3):337–341, 2008. URL http://www.ingentaconnect.com/content/esa/jme/2008/ 00000045/00000003/art00002.

J A Bilmes. What HMMs can’t do. In Invited paper and lecture ATR Work- shop. Citeseer, 2004. URL http://citeseerx.ist.psu.edu/viewdoc/ download?doi=10.1.1.66.6329&rep=rep1&type=pdf. 108

Jean-Simon Brouard, Christian Otis, Claude Lemieux, and Monique Turmel. The exceptionally large chloroplast genome of the green alga Floydiella terrestris illuminates the evolutionary history of the Chlorophyceae. Genome biology and evolution, 2:240–256, 2010. URL http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid= 2997540&tool=pmcentrez&rendertype=abstract.

Mark A Buchheim, Eugenia A Michalopulos, and Julie A Buchheim. PHY- LOGENY OF THE CHLOROPHYCEAE WITH SPECIAL REFERENCE TO THE SPHAEROPLEALES: A STUDY OF 18S AND 26S rDNA DATA. Journal of Phycology, 37(5):819–835, 2001. ISSN 00223646. doi: 10.1046/j.1529-8817.2001.00162.x. URL http://doi.wiley.com/10.1046/ j.1529-8817.2001.00162.x.

Mark A Buchheim, Julie A Buchheim, Tracy Carlson, and Paul Kugrens. Phy- logeny of Lobocharacium (Chlorophyceae) and allies: A study of 18S and 26S rDNA data. Journal of Phycology, 38(2):376–383, 2002. ISSN 00223646. URL http://www.refdoc.fr/Detailnotice?idarticle=9623255.

Mark A Buchheim, Andrea E Kirkwood, Julie A Buchheim, Bindhu Verghese, and William J Henley. Hypersaline Soil Supports a Diverse Community of Dunaliella (Chlorophyceae)1. Journal of Phycology, 46(5):1038–1047, 2010. ISSN 00223646. doi: 10.1111/j.1529-8817.2010.00886.x. URL http://doi. wiley.com/10.1111/j.1529-8817.2010.00886.x.

Mark A Buchheim, Alexander Keller, Christian Koetschan, Frank F¨orster, Benjamin Merget, and Matthias Wolf. Internal Transcribed Spacer 2 (nu ITS2 rRNA) Sequence-Structure Phylogenetics: Towards an Automated Re- construction of the Green Algal Tree of Life. PLoS ONE, 6(2):10, 2011a. URL http://dx.plos.org/10.1371/journal.pone.0016931.

Mark A Buchheim, Danica M Sutherland, Tina Schleicher, Frank F¨orster, and Matthias Wolf. Phylogeny of Oedogoniales, Chaetophorales and Chaetopeltidales (Chlorophyceae): inferences from sequence-structure anal- 109

ysis of ITS2. Annals of Botany, 2011b. ISSN 10958290. doi: 10.1093/aob/ mcr275. URL http://www.ncbi.nlm.nih.gov/pubmed/22028463.

Joseph H Camin and Robert R Sokal. A Method for Deducing Branching Sequences in Phylogeny. Evolution, 19(3):311, 1965. ISSN 00143820. doi: 10.2307/2406441. URL http://www.jstor.org/stable/2406441?origin= crossref.

Shilin Chen, Hui Yao, Jianping Han, Chang Liu, Jingyuan Song, Linchun Shi, Yingjie Zhu, Xinye Ma, Ting Gao, Xiaohui Pang, Kun Luo, Ying Li, Xiwen Li, Xiaocheng Jia, Yulin Lin, and Christine Leon. Validation of the ITS2 Region as a Novel DNA Barcode for Identifying Medicinal Plant Species. PLoS ONE, 5(1):8, 2010. URL http://www.ncbi.nlm.nih.gov/pubmed/ 20062805.

J R Cole, Q Wang, E Cardenas, J Fish, B Chai, R J Farris, A S Kulam- Syed-Mohideen, D M McGarrell, T Marsh, G M Garrity, and J M Tiedje. The Ribosomal Database Project: improved alignments and new tools for rRNA analysis. Nucleic Acids Research, 37(Database issue):D141–D145, 2009. URL http://www.ncbi.nlm.nih.gov/pubmed/19004872.

A Coleman. ITS2 is a double-edged tool for evolutionary compar- isons. Trends in Genetics, 19(7):370–375, 2003. ISSN 01689525. doi: 10. 1016/S0168-9525(03)00118-5. URL http://linkinghub.elsevier.com/ retrieve/pii/S0168952503001185.

Charles Darwin. On the Origin of Species, volume 5. John Murray, 1859. ISBN 1551113376. doi: 10.1038/005318a0. URL http://www.ias.ac.in/ resonance/February2009/p204-208.pdf.

M O Dayhoff, R M Schwartz, and B C Orcutt. A model of evolutionary change in proteins. Atlas of protein sequence and structure, 5(Suppl 3):345– 352, 1978. doi: 10.1.1.145.4315. URL http://citeseerx.ist.psu.edu/ viewdoc/summary?doi=10.1.1.145.4315. 110

R Durbin, S Eddy, A Krogh, and G Mitchison. Biological Se- quence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, 1998. ISBN 0521629713. URL http://eisc.univalle.edu.co/cursos/web/material/750068/1/ 6368030-Durbin-Et-Al-Biological-Sequence-Analysis-CUP-2002-No-OCR. pdf.

S R Eddy. Profile hidden Markov models. Bioinformatics, 14(9):755–763, 1998.

Sean R Eddy. A probabilistic model of local sequence alignment that simpli- fies statistical significance estimation. PLoS Comput Biol, 4(5):e1000069, 2008. doi: 10.1371/journal.pcbi.1000069. URL http://dx.doi.org/10. 1371/journal.pcbi.1000069.

Robert C Edgar. MUSCLE: a multiple sequence alignment method with re- duced time and space complexity. BMC Bioinformatics, 5(1):113, 2004a. URL http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid= 517706&tool=pmcentrez&rendertype=abstract.

Robert C Edgar. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research, 32(5):1792–1797, 2004b. URL http://www.ncbi.nlm.nih.gov/pubmed/15034147.

Scott Federhen. The NCBI Taxonomy database. Nucleic Acids Research, pages 1–8, 2011. ISSN 13624962. doi: 10.1093/nar/gkr1178. URL http: //www.ncbi.nlm.nih.gov/pubmed/22139910.

Gonzalo Nieto Feliner and Josep A Rossello. Better the devil you know? Guidelines for insightful utilization of nrDNA ITS in species-level evolu- tionary studies in plants. Molecular Phylogenetics and Evolution, 44(2): 911–919, 2007. URL http://www.sciencedirect.com/science/article/ B6WNH-4N2D2TM-2/2/b61d728843cc71cc3916f51e9c7562df.

J Felsenstein. Evolutionary trees from DNA sequences: a maximum likelihood 111

approach. Journal of Molecular Evolution, 17(6):368–376, 1981. URL http: //www.ncbi.nlm.nih.gov/pubmed/7288891.

J Felsenstein. Confidence limits on phylogenies: an approach using the boot- strap. Evolution, 39(4):783–791, 1985. ISSN 00143820. doi: 10.2307/ 2408678. URL http://www.jstor.org/stable/2408678.

J Felsenstein. Inferring Phylogenies, volume 266. Sinauer Associates, 2004. ISBN 0878931775. URL http://www.ncbi.nlm.nih.gov/pubmed/ 11675604.

J Felsenstein, J Archie, W H E Day, W Maddinson, C Meacham, F J Rohlf, and D Swofford. The Newick tree format, 1986. URL http://evolution. genetics.washington.edu/phylip/newicktree.html.

Roy Thomas Fielding. Architectural Styles and the Design of Network- based Software Architectures. PhD thesis, University of California, Irvine,

2000. URL http://www.ics.uci.edu/~fielding/pubs/dissertation/ top.htm.

Robert D Finn, Jody Clements, and Sean R Eddy. HMMER web server: interactive sequence similarity searching. Nucleic Acids Research, 39(Web Server issue):W29–W37, 2011. URL http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid= 3125773&tool=pmcentrez&rendertype=abstract.

G D Forney. The viterbi algorithm. Proceedings of the IEEE, 61(3):268– 278, 1973. ISSN 00189219. doi: 10.1109/PROC.1973.9030. URL http: //ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1450960.

Torben Friedrich. New statistical Methods of Genome-Scale Data Analysis in Life Science - Applications to enterobacterial Diagnostics, Meta-Analysis of Arabidopsis thaliana Gene Expression and functional Sequence Annotation. PhD thesis, University of W¨urzburg,2009. 112

Torben Friedrich, Birgit Pils, Thomas Dandekar, J¨orgSchultz, and Tobias M¨uller.Modelling interaction sites in protein domains with interaction pro- file hidden Markov models. Bioinformatics, 22(23):2851–2857, 2006. URL http://www.ncbi.nlm.nih.gov/pubmed/17000753.

Torben Friedrich, Christian Koetschan, and Tobias M¨uller. Optimisation of HMM topologies enhances DNA and protein sequence modelling. Stat Appl Genet Mol Biol, 9(1):Article 6, 2010. doi: 10.2202/1544-6115.1480. URL http://dx.doi.org/10.2202/1544-6115.1480.

Ting Gao, Hui Yao, Jingyuan Song, Chang Liu, Yingjie Zhu, Xinye Ma, Xi- aohui Pang, Hongxi Xu, and Shilin Chen. Identification of medicinal plants in the family Fabaceae using a potential DNA barcode ITS2. Journal of Ethnopharmacology, 130(1):116–121, 2010. URL http://www.ncbi.nlm. nih.gov/pubmed/20435122.

O Gascuel. BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. Molecular Biology and Evolution, 14(7):685– 95, 1997. ISSN 07374038. URL http://www.ncbi.nlm.nih.gov/pubmed/ 9254330.

S Henikoff and J G Henikoff. Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sci- ences of the United States of America, 89(22):10915–10919, 1992. URL http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid= 50453&tool=pmcentrez&rendertype=abstract.

Dominik Hepperle, Eberhard Hegewald, and L Krienitz. PHYLOGENETIC POSITION OF THE ( CHLOROPHYTA ) 1 The com- plete 18S rRNA gene sequences of three Oocystis A . Braun species ( Oocystaceae ) and three other chlorococcal algae , Tetrachlorella alter- nans ( G . M . Smith ) Kor  Okada ( Scenedesmacea. Direct, 595(3): 590–595, 2000. URL http://onlinelibrary.wiley.com/doi/10.1046/j. 1529-8817.2000.99184.x/full. 113

Ian Hickson. Web Storage, 2011. URL http://dev.w3.org/html5/ webstorage/.

Masami Ikeda, Masafumi Arai, Toshikatsu Okuno, and Toshio Shimizu. TM- PDB: a database of experimentally-characterized transmembrane topolo- gies. Nucleic Acids Research, 31(1):406–409, 2003. URL http://www.nar. oupjournals.org/cgi/doi/10.1093/nar/gkg018.

Norman L Johnson, Chapel Hill, and North Carolina. Univariate Discrete Distributions Univariate Discrete Distributions ( 3rd ed. ), by Norman L. Johnson , Adrienne W. Kemp , and Samuel Kotz , Hoboken, NJ : Wiley , 2005 , ISBN 0-471-27246-9 , xix + 646 pp., 125.00 . Technometrics, 48(3): 450–450, 2006. ISSN 00401706. doi: 10.1198/tech.2006.s421. URL http: //pubs.amstat.org/doi/abs/10.1198/tech.2006.s421.

B H Juang and L R Rabiner. Hidden Markov Models for Speech Recognition. Technometrics, 33(3):251, 1991. ISSN 00401706. doi: 10.2307/1268779. URL http://www.jstor.org/stable/1268779?origin=crossref.

T H Jukes and C R Cantor. Evolution of protein molecules. In H N Munro, editor, Mammalian Protein Metabolism, volume 3 of Mammalian protein metabolism, chapter 24, pages 21–132. Academic Press, 1969. URL http: //www.citeulike.org/group/1390/article/768582.

Agnieszka S Juncker, Hanni Willenbrock, Gunnar Von Heijne, Sø ren Brunak, Henrik Nielsen, and Anders Krogh. Prediction of lipoprotein signal pep- tides in Gram-negative bacteria. Protein Science, 12(8):1652–1662, 2003. URL http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid= 2323952&tool=pmcentrez&rendertype=abstract.

Robel Y Kahsay, Guang Gao, and Li Liao. An improved hidden Markov model for transmembrane protein detection and topology prediction and its applications to complete genomes. Bioinformatics, 21(9):1853–1858, 2005. URL http://www.ncbi.nlm.nih.gov/pubmed/15691854. 114

Lukas K¨all,Anders Krogh, and Erik L L Sonnhammer. A combined transmem- brane topology and signal peptide prediction method. Journal of Molecular Biology, 338(5):1027–1036, 2004. URL http://www.ncbi.nlm.nih.gov/ pubmed/15111065.

Alexander Keller, Tina Schleicher, Frank F¨orster, Benjamin Ruderisch, Thomas Dandekar, Tobias M¨uller,and Matthias Wolf. ITS2 data cor- roborate a monophyletic chlorophycean DO-group (Sphaeropleales). BMC Evolutionary Biology, 8:218, 2008. URL http://www.ncbi.nlm.nih.gov/ pubmed/18655698.

Alexander Keller, Tina Schleicher, J¨orgSchultz, Tobias M¨uller, Thomas Dan- dekar, and Matthias Wolf. 5.8S-28S rRNA interaction and HMM-based ITS2 annotation. Gene, 430(1-2):50–7, 2009. ISSN 18790038. doi: 10.1016/j.gene. 2008.10.012. URL http://www.ncbi.nlm.nih.gov/pubmed/19026726.

Alexander Keller, Frank F¨orster, Tobias M¨uller, Thomas Dan- dekar, J¨org Schultz, and Matthias Wolf. Including RNA sec- ondary structures improves accuracy and robustness in reconstruc- tion of phylogenetic trees. Biology Direct, 5(1):4, 2010. URL http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid= 2821295&tool=pmcentrez&rendertype=abstract.

Marcelo V Kitahara, Stephen D Cairns, Jaros law Stolarski, David Blair, and David J Miller. A Comprehensive Phylogenetic Anal- ysis of the (Cnidaria, Anthozoa) Based on Mito- chondrial CO1 Sequence Data. PLoS ONE, 5(7):9, 2010. URL http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid= 2900217&tool=pmcentrez&rendertype=abstract.

Christian Koetschan. Enhanced length modelling methods for Hidden Markov Models. PhD thesis, University of W¨urzburg,2008.

Christian Koetschan, Frank F¨orster,Alexander Keller, Tina Schleicher, Ben- jamin Ruderisch, Roland Schwarz, Tobias M¨uller, Matthias Wolf, and 115

J¨orgSchultz. The ITS2 Database III—sequences and structures for phy- logeny. Nucleic Acids Research, 38(Database issue):D275–D279, 2010. URL http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid= 2808966&tool=pmcentrez&rendertype=abstract.

Christian Koetschan, Thomas Hackl, Tobias M¨uller, Matthias Wolf, Frank F¨orster,and J¨orgSchultz. ITS2 database IV: Interactive taxon sampling for internal transcribed spacer 2 based phylogenies. Molecular phylogenetics and evolution, February 2012. ISSN 1095-9513. doi: 10.1016/j.ympev.2012. 01.026. URL http://www.ncbi.nlm.nih.gov/pubmed/22366368.

A Krogh, B Larsson, G Von Heijne, and E L Sonnhammer. Predicting trans- membrane protein topology with a hidden Markov model: application to complete genomes. Journal of Molecular Biology, 305(3):567–580, 2001. URL http://www.ncbi.nlm.nih.gov/pubmed/11152613.

Tamara Kulikova, Ruth Akhtar, Philippe Aldebert, Nicola Althorpe, Mikael Andersson, Alastair Baldwin, Kirsty Bates, Sumit Bhattacharyya, Lawrence Bower, Paul Browne, Matias Castro, Guy Cochrane, Karyn Duggan, Ruth Eberhardt, Nadeem Faruque, Gemma Hoad, Carola Kanz, Charles Lee, Rasko Leinonen, Quan Lin, Vincent Lombard, Rodrigo Lopez, Dar- iusz Lorenc, Hamish McWilliam, Gaurab Mukherjee, Francesco Nardone, Maria Pilar Garcia Pastor, Sheila Plaister, Siamak Sobhany, Peter Stoehr, Robert Vaughan, Dan Wu, Weimin Zhu, and Rolf Apweiler. EMBL Nucleotide Sequence Database in 2006. Nucleic Acids Research, 35 (Database issue):D16–20, 2007. ISSN 13624962. doi: 10.1093/nar/gkl913. URL http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid= 1897316&tool=pmcentrez&rendertype=abstract.

M A Larkin, G Blackshields, N P Brown, R Chenna, P A McGettigan, H McWilliam, F Valentin, I M Wallace, A Wilm, R Lopez, J D Thomp- son, T J Gibson, and D G Higgins. Clustal W and Clustal X version 2.0. 116

Bioinformatics, 23(21):2947–2948, 2007. URL http://www.ncbi.nlm.nih. gov/pubmed/17846036.

Honglak Lee and Andrew Y Ng. Spam Deobfuscation using a Hidden Markov Model. English, 2005. URL http://citeseerx.ist.psu.edu/viewdoc/ download?doi=10.1.1.79.5218&rep=rep1&type=pdf.

Louise A Lewis and Valerie R Flechtner. Cryptic Species of Scenedesmus (Chlorophyta) From Desert Soil Communities of Western North America. Journal of Phycology, 40(6):1127–1137, 2004. ISSN 00223646. doi: 10. 1111/j.1529-8817.2004.03235.x. URL http://doi.wiley.com/10.1111/j. 1529-8817.2004.03235.x.

De-Zhu Li, Lian-Ming Gao, Hong-Tao Li, Hong Wang, Xue-Jun Ge, Jian-Quan Liu, Zhi-Duan Chen, Shi-Liang Zhou, Shi-Lin Chen, Jun-Bo Yang, Cheng- Xin Fu, Chun-Xia Zeng, Hai-Fei Yan, Ying-Jie Zhu, Yong-Shuai Sun, Si-Yun Chen, Lei Zhao, Kun Wang, Tuo Yang, and Guang-Wen Duan. Comparative analysis of a large dataset indicates that internal transcribed spacer (ITS) should be incorporated into the core barcode for seed plants. Proceedings of the National Academy of Sciences of the United States of America, pages 1104551108–, 2011. ISSN 10916490. doi: 10.1073/pnas.1104551108. URL http://www.pnas.org/cgi/content/abstract/1104551108v1.

Qi Liu, Yi-Sheng Zhu, Bao-Hua Wang, and Yi-Xue Li. A HMM-based method to predict the transmembrane regions of beta-barrel membrane proteins. Computational Biology and Chemistry, 27(1):69–76, 2003. URL http:// www.ncbi.nlm.nih.gov/pubmed/12798041.

Ciar N J Loughnane, Lynne M McIvor, Fabio Rindi, Dagmar B Sten- gel, and Michael D Guiry. Morphology, rbcL phylogeny and dis- tribution of distromatic Ulva (Ulvophyceae, Chlorophyta) in Ireland and southern Britain. Phycologia, 47(4):416–429, 2008. doi: 10. 2216/07-61.1. URL http://www.phycologia.org/perlserv/?request= get-abstract&doi=10.2216/PH07-61.1. 117

A V Lukashin and M Borodovsky. GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res, 26(4):1107–1115, February 1998.

J C Mai and A W Coleman. The internal transcribed spacer 2 exhibits a common secondary structure in green algae and flowering plants. Journal of Molecular Evolution, 44(3):258–271, 1997. URL http://www.ncbi.nlm. nih.gov/pubmed/9060392.

Marco Mangone, Philip MacMenamin, Charles Zegar, Fabio Piano, and Kristin C Gunsalus. UTRome.org: a platform for 3’UTR biology in C. elegans. Nucleic Acids Research, 36(Database issue):D57–D62, 2008. URL http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid= 2238901&tool=pmcentrez&rendertype=abstract.

S M Markert, T M¨uller,C Koetschan, T Friedl, and M Wolf. ‘Y’ Scenedesmus (Chlorophyta, Chlorophyceae): the internal transcribed spacer 2 rRNA sec- ondary structure re-revisited. Plant biology, 2012. doi: 10.1111/j.1438-8677. 2012.00576.x.

Nicholas R Markham and Michael Zuker. UNAFold: software for nucleic acid folding and hybridization. Methods In Molecular Biology Clifton Nj, 453(1):3–31, 2008. URL http://www.springerlink.com/index/10.1007/ 978-1-60327-429-6.

Pier Luigi Martelli, Piero Fariselli, Anders Krogh, and Rita Casadio. A sequence-profile-based HMM for predicting and discriminating beta barrel membrane proteins. Bioinformatics, 18 Suppl 1(NIL):S46–S53, 2002. URL http://www.ncbi.nlm.nih.gov/pubmed/12169530.

Ernst Mayr. Systematics and the origin of species from the viewpoint of a zo- ologist, volume 13 of Columbia biological series ... No. XIII. Columbia Uni- versity Press, 1942. ISBN 0674862503. URL http://www.amazon.ca/exec/ obidos/redirect?tag=citeulike09-20&path=ASIN/0674862503. 118

Nikunj Mehta, Jonas Sicking, Eliot Graff, Andrei Popescu, and Orlow Jeremy. Indexed Database API, 2011. URL http://www.w3.org/TR/IndexedDB/.

Benjamin Merget, Christian Koetschan, Thomas Hackl, Frank F¨orster, Thomas Dandekar, Tobias M¨uller, J¨org Schultz, and Matthias Wolf. The ITS2 Database. Journal of Visualized Experiments, 61:e3806, 2012. doi: 10.3791/3806. URL http://www.jove.com/video/3806/ the-its2-database.

M´onicaB J Moniz and Irena Kaczmarska. Barcoding diatoms: Is there a good marker? Molecular ecology resources, 9 Suppl s1(s1):65–74, 2009. doi: 10.1111/j.1755-0998.2009.02633.x. URL http://www.ncbi.nlm.nih.gov/ pubmed/21564966.

M´onicaB J Moniz and Irena Kaczmarska. Barcoding of diatoms: nuclear encoded ITS revisited. Protist, 161(1):7–34, 2010. URL http://www.ncbi. nlm.nih.gov/pubmed/19674931.

T M¨ullerand M Vingron. Modeling amino acid replacement. Journal of computational biology a journal of computational molecular biology, 7(6): 761–776, 2000. URL http://www.ncbi.nlm.nih.gov/pubmed/11382360.

Tobias M¨uller. Modellierung von Proteinevolution. PhD thesis, 2001.

Tobias M¨uller,Rainer Spang, and Martin Vingron. Estimating amino acid substitution models: a comparison of Dayhoff’s estimator, the resolvent ap- proach and a maximum likelihood method. Molecular Biology and Evolution, 19(1):8–13, 2002. URL http://www.ncbi.nlm.nih.gov/pubmed/11752185.

Tobias M¨uller,Nicole Philippi, Thomas Dandekar, J¨orgSchultz, and Matthias Wolf. Distinguishing species. Rna New York Ny, 13(9):1469–1472, 2007. URL http://www.ncbi.nlm.nih.gov/pubmed/17652131.

Ahmed Ragab Nabhan and Indra Neil Sarkar. The impact of taxon sam- pling on phylogenetic inference: a review of two decades of controversy. Briefings in Bioinformatics, pages bbr014–, 2011. ISSN 14774054. doi: 119

10.1093/bib/bbr014. URL http://bib.oxfordjournals.org/content/ early/2011/03/23/bib.bbr014.abstract.

Takashi Nakada, Kazuharu Misawa, and Hisayoshi Nozaki. Molecular system- atics of Volvocales (Chlorophyceae, Chlorophyta) based on exhaustive 18S rRNA phylogenetic analyses. Molecular Phylogenetics and Evolution, 48 (1):281–291, 2008. URL http://www.sciencedirect.com/science?_ob= ArticleURL&_udi=B6WNH-4S3G3X7-1&_user=10&_rdoc=1&_fmt=&_orig= search&_sort=d&_docanchor=&view=c&_searchStrId=1081008629&_ rerunOrigin=scholar.google&_acct=C000050221&_version=1&_ urlVersion=0&_userid=10&md5=e16db5ad61e3a7dfcb93791912e903fb.

Ncbi. FASTA format description, 2007. URL http://www.ncbi.nlm.nih. gov/blast/fasta.shtml.

Daniel L Nickrent, Kevin P Schuette, and Ellen M Starr. A molecular phy- logeny of IArceuthobium (Viscaceae) based on nuclear ribosomal DNA in- ternal transcribed spacer sequences. American Journal of Botany, 81 IS - (9):1149–1160, 1994. URL http://www.jstor.org/stable/2445477.

H Nozaki, K Misawa, T Kajita, M Kato, S Nohara, and M M Watanabe. Origin and evolution of the colonial volvocales (Chlorophyceae) as inferred from multiple, chloroplast gene sequences. Molecular phylogenetics and evolution, 17(2):256–68, November 2000. ISSN 1055-7903. doi: 10.1006/mpev.2000. 0831. URL http://www.ncbi.nlm.nih.gov/pubmed/11083939.

Hisayoshi Nozaki, Osami Misumi, and Tsuneyoshi Kuroiwa. Phylogeny of the quadriflagellate Volvocales (Chlorophyceae) based on chloroplast multigene sequences. Molecular Phylogenetics and Evolution, 29(1):58–66, 2003. URL http://linkinghub.elsevier.com/retrieve/pii/S1055790303000897.

R D M Page. TreeView: An application to display phylogenetic trees on personal computers. Computer applications in the biosciences CABIOS, 12 (4):357–358, 1996. URL http://eprints.gla.ac.uk/394/. 120

Karl Pearson. on the Systematic Fitting of Curves To Observations and Measurements. Biometrika, 1(3):265–303, 1902. ISSN 00063444. doi: 10.2307/2331540. URL http://biomet.oxfordjournals.org/cgi/doi/ 10.1093/biomet/1.3.265.

T O Powers, T C Todd, A M Burnell, P C B Murray, C C Flem- ing, A L Szalanski, B A Adams, and T S Harris. The rDNA Internal Transcribed Spacer Region as a Taxonomic Marker for Nematodes. Journal of Nematology, 29(4):441–450, 1997. URL http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid= 2619808&tool=pmcentrez&rendertype=abstract.

Elmar Pruesse, Christian Quast, Katrin Knittel, Bernhard M Fuchs, Wolfgang Ludwig, J¨orgPeplies, and Frank Oliver Gl¨ockner. SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Research, 35(21):7188–7196, 2007. URL http://www.ncbi.nlm.nih.gov/pubmed/17947321.

L R Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. 77(2):257–286, 1989. doi: 10.1109/5.18626.

Lorenzo Rossi, Jacob Chakareski, Pascal Frossard, and Stefania Colonnese. Proceedings of 2010 IEEE 17th International Conference on Image Process- ing A NON-STATIONARY HIDDEN MARKOV MODEL OF MULTIVIEW VIDEO TRAFFIC. Signal Processing, pages 2921–2924, 2010.

N Saitou and M Nei. The neighbor-joining method: a new method for recon- structing phylogenetic trees. Molecular Biology and Evolution, 4(4):406–425, 1987. URL http://www.ncbi.nlm.nih.gov/pubmed/18343690.

Tao Sang. Utility of low-copy nuclear gene sequences in plant phylogenet- ics. Critical Reviews in Biochemistry and Molecular Biology, 37(3):121–147, 2002. URL http://www.ncbi.nlm.nih.gov/pubmed/12139440. 121

Chodon Sass, Damon P Little, Dennis Wm Stevenson, and Chelsea D Specht. DNA Barcoding in the Cycadales: Testing the Potential of Proposed Bar- coding Markers for Species Identification of Cycads. PLoS ONE, 2(11):9, 2007. URL http://www.pubmedcentral.nih.gov/articlerender.fcgi? artid=2063462&tool=pmcentrez&rendertype=abstract.

Klaus Peter Schliep. phangorn: phylogenetic analysis in R. Bioinformat- ics, 27(4):592–593, 2011. URL http://www.ncbi.nlm.nih.gov/pubmed/ 21169378.

Gisbert Schneider and Uli Fechner. Advances in the prediction of protein targeting signals. Proteomics, 4(6):1571–1580, 2004. URL http://www. ncbi.nlm.nih.gov/pubmed/15174127.

J¨orgSchultz and Matthias Wolf. ITS2 Sequence-Structure Analysis in Phy- logenetics: A How-to Manual for Molecular Systematics. Molecular Phy- logenetics and Evolution, 52(2):520–523, 2009. ISSN 10959513. doi: 10. 1016/j.ympev.2009.01.008. URL http://www.ncbi.nlm.nih.gov/pubmed/ 19640446.

J¨orgSchultz, Stefanie Maisel, Daniel Gerlach, Tobias M¨uller,and Matthias Wolf. A common core of secondary structure of the internal transcribed spacer 2 (ITS2) throughout the Eukaryota. Rna New York Ny, 11(4):361– 364, 2005. URL http://www.pubmedcentral.nih.gov/articlerender. fcgi?artid=1370725&tool=pmcentrez&rendertype=abstract.

J¨orgSchultz, Tobias M¨uller,Marco Achtziger, Philipp N Seibel, Thomas Dan- dekar, and Matthias Wolf. The internal transcribed spacer 2 database—a web server for (not only) low level phylogenetic analyses. Nucleic Acids Research, 34(Web Server issue):W704–W707, 2006. URL http://dx.doi. org/10.1093/nar/gkl129.

G Schwarz. Estimating the dimension of a model. Annals of Statistics, 6 (2):461–464, 1978. ISSN 00905364. doi: 10.2307/2958889. URL http: //www.jstor.org/stable/2958889. 122

Philipp N Seibel, Tobias M¨uller, Thomas Dandekar, J¨org Schultz, and Matthias Wolf. 4SALE – A tool for synchronous RNA sequence and sec- ondary structure alignment and editing. BMC Bioinformatics, 7(1):498, 2006. URL http://www.ncbi.nlm.nih.gov/pubmed/17101042.

Philipp N Seibel, Tobias M¨uller, Thomas Dandekar, and Matthias Wolf. Synchronous visual analysis and editing of RNA sequence and secondary structure alignments using 4SALE. BMC research notes, 1:91, 2008. URL http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid= 2587473&tool=pmcentrez&rendertype=abstract.

Christian Selig, Matthias Wolf, Tobias M¨uller,Thomas Dandekar, and J¨org Schultz. The ITS2 Database II: homology modelling RNA structure for molecular systematics. Nucleic Acids Research, 36(Database issue):D377– D380, 2008. URL http://dx.doi.org/10.1093/nar/gkm827.

Sven Siebert and Rolf Backofen. MARNA: multiple alignment and consen- sus structure prediction of RNAs based on sequence structure comparisons. Bioinformatics, 21(16):3352–3359, 2005. URL http://www.ncbi.nlm.nih. gov/pubmed/15972285.

Andrew D Smith, Thomas W H Lui, and Elisabeth R M Tillier. Empirical models for substitution in ribosomal RNA. Molecular Biology and Evolution, 21(3):419–427, 2004. URL http://dx.doi.org/10.1093/molbev/msh029.

M Alex Smith, Nikolai A Poyarkov, and Paul D N Hebert. DNA BAR- CODING: CO1 DNA barcoding amphibians: take the chance, meet the challenge. Molecular ecology resources, 8(2):235–246, 2008. doi: 10.1111/j. 1471-8286.2007.01964.x. URL http://www.blackwell-synergy.com/doi/ abs/10.1111/j.1471-8286.2007.01964.x.

E L Sonnhammer, G Von Heijne, and A Krogh. A hidden Markov model for predicting transmembrane helices in protein sequences. Proceedings Interna- tional Conference on Intelligent Systems for Molecular Biology ISMB Inter- 123

national Conference on Intelligent Systems for Molecular Biology, 6(NIL): 175–182, 1998. URL http://www.ncbi.nlm.nih.gov/pubmed/9783223.

J E Stajich, D Block, K Boulez, S E Brenner, S A Chervitz, C Dagdigian, G Fu- ellen, J G Gilbert, I Korf, H Lapp, H Lehvaslaiho, C Matsalla, C J Mungall, B I Osborne, M R Pocock, P Schattner, M Senger, L D Stein, E Stupka, M D Wilkinson, and E Birney. The Bioperl toolkit: Perl modules for the life sciences. Genome Research, 12(10):1611–1618, 2002. ISSN 10889051. doi: 10.1101/gr.361602.1. URL http://dx.doi.org/10.1101/gr.361602.

Y B Suh, L B Thien, H E Reeve, and E A D A S E P Zimmer. MOLECULAR EVOLUTION AND PHYLOGENETIC IMPLICATIONS OF INTERNAL TRANSCRIBED SPACER SEQUENCES OF RIBOSOMAL DNA IN WIN- TERACEAE. American Journal of Botany, 80(9):1042–1055 ST – MOLEC- ULAR EVOLUTION AND PHYLOGENET, 1993. ISSN 00029122. URL http://www.jstor.org/stable/2445752.

Y Tateno, T Imanishi, S Miyazaki, K Fukami-Kobayashi, N Saitou, H Sug- awara, and T Gojobori. DNA Data Bank of Japan (DDBJ) for genome scale research in life science. Nucleic Acids Research, 30(1):27–30, 2002. URL http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid= 99140&tool=pmcentrez&rendertype=abstract.

S Tavar´e.Some probabilistic and statistical problems in the analysis of DNA sequences. Lectures on Mathematics in the Life Sciences, 17(17):57–86, 1986. URL http://books.google.com/books?hl=en&lr=&id= 8aI1phhOKhgC&oi=fnd&pg=PA57&dq=Some+probabilistic+ and+statistical+problems+in+the+analysis+of+DNA+sequences& ots=rlLEaJIcPl&sig=ScWJwRxj6YEd85wxsVoMSq7XNY8.

Maximilian J Telford, Michael J Wise, and Vivek Gowri-Shankar. Consid- eration of RNA secondary structure significantly improves likelihood-based estimates of phylogeny: examples from the bilateria. Molecular Biology 124

and Evolution, 22(4):1129–1136, 2005. URL http://discovery.ucl.ac. uk/10796/.

JD Thompson, DG Higgins, and TJ Gibson. ClustalW: improving the sensitiv- ity of progressive multiple sequence alignment through sequence weighting, posistion-specific gap penalites and weight matrix choice. Nucleic Acids Research, 22:4673*4680, 1994.

Monique Turmel, Jean-Simon Brouard, C´edricGagnon, Christian Otis, and Claude Lemieux. Deep Division in the Chlorophyceae (Chlorophyta) Revealed By Chloroplast Phylogenomic Analyses. Journal of Phycol- ogy, 44(3):739–750, 2008. ISSN 00223646. doi: 10.1111/j.1529-8817. 2008.00510.x. URL http://blackwell-synergy.com/doi/abs/10.1111/ j.1529-8817.2008.00510.x.

G E Tusn´adyand I Simon. Principles governing amino acid composition of integral membrane proteins: application to topology prediction. Journal of Molecular Biology, 283(2):489–506, 1998. URL http://www.ncbi.nlm. nih.gov/pubmed/9769220.

A Viterbi. Error bounds for convolutional codes and an asymptotically op- timum decoding algorithm. 13(2):260–269, 1967. doi: 10.1109/TIT.1967. 1054010.

D-M Wang and Y-J Yao. Intrastrain internal transcribed spacer heterogeneity in Ganoderma species. Canadian Journal of Microbiology, 51(2):113–121, 2005. URL http://www.ncbi.nlm.nih.gov/pubmed/16091769.

Norman Wang, Alison R Sherwood, Akira Kurihara, Kimberly Y Conklin, Thomas Sauvage, and Gernot G Presting. The Hawaiian Algal Database: a laboratory LIMS and online resource for biodiversity data. BMC Plant Bi- ology, 9(1):117, 2009. URL http://www.biomedcentral.com/1471-2229/ 9/117. 125

J D Watson and F H Crick. Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid. Nature, 171(4356):737–738, 1953. URL http: //www.ncbi.nlm.nih.gov/pubmed/17804965.

Lloyd R Welch. Hidden Markov Models and the Baum-Welch Algorithm. IEEE Information Theory Society Newsletter, 53(4):1,10–13, 2003. URL http://www.itsoc.org/publications/nltr/it_dec_03final.pdf.

Alan S Willsky. Multiresolution Markov models for signal and image process- ing. Electrical Engineering, 90(8):1396–1458, 2002. ISSN 00189219. doi: 10.1109/JPROC.2002.800717. URL http://ieeexplore.ieee.org/xpls/ abs_all.jsp?arnumber=1037568.

M Wolf, M Buchheim, E Hegewald, L Krienitz, and D Hepperle. Phy- logenetic position of the Sphaeropleaceae (Chlorophyta). Plant System- atics and Evolution, 230(3-4):161–171, 2002. ISSN 03782697. doi: 10. 1007/s006060200002. URL http://www.springerlink.com/openurl.asp? genre=article&id=doi:10.1007/s006060200002.

Matthias Wolf and Jorg Schultz. ITS Better Than Its Reputation. Science E- Letter, 2009. URL http://www.sciencemag.org/content/325/5941/682. full/reply#sci_el_12692.

Matthias Wolf, Marco Achtziger, J¨orgSchultz, Thomas Dandekar, and To- bias M¨uller. Homology modeling revealed more than 20,000 rRNA in- ternal transcribed spacer 2 (ITS2) secondary structures. RNA, 11(11): 1616–1623, 2005a. ISSN 13558382. doi: 10.1261/rna.2144205. URL http://dx.doi.org/10.1261/rna.2144205.

Matthias Wolf, Joachim Friedrich, Thomas Dandekar, and Tobias M¨uller. CBCAnalyzer: inferring phylogenies based on compensatory base changes in RNA secondary structures. In Silico Biology, 5(3):291–294, 2005b. URL http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd= Retrieve&db=PubMed&dopt=Citation&list_uids=15996120. 126

Matthias Wolf, Benjamin Ruderisch, Thomas Dandekar, J¨orgSchultz, and Tobias M¨uller. ProfDistS: (profile-) distance based phylogeny on sequence– structure alignments. Bioinformatics, 24(20):2401–2402, 2008. ISSN 13674811. doi: 10.1093/bioinformatics/btn453. URL http://www.ncbi. nlm.nih.gov/pubmed/18723521.

Hui Yao, Jingyuan Song, Chang Liu, Kun Luo, Jianping Han, Ying Li, Xiao- hui Pang, Hongxi Xu, Yingjie Zhu, Peigen Xiao, and Shilin Chen. Use of ITS2 Region as the Universal DNA Barcode for Plants and Animals. PLoS ONE, 5(10):9, 2010. URL http://dx.plos.org/10.1371/journal.pone. 0013102.

Frederick W Zechman. PHYLOGENY OF THE DASYCLADALES ( CHLOROPHYTA , ULVOPHYCEAE ) BASED ON ANALYSES OF RU- BISCO LARGE SUBUNIT ( rbc L ) GENE SEQUENCES. Journal of Phycology, 39(4):819–827, 2003. ISSN 00223646. doi: 10.1046/j.1529-8817. 2003.02183.x. URL http://doi.wiley.com/10.1046/j.1529-8817.2003. 02183.x.

Zemin Zhang and William I Wood. A profile hidden Markov model for signal peptides generated by HMMER. Bioinformatics, 19(2):307–308, 2003. URL http://www.ncbi.nlm.nih.gov/pubmed/12538263. List of Figures 127

12 List of Figures

1 WB: Basic layout ...... 21 2 Graphical illustration of HMM length modelling topologies . . . 28 3 Length distribution and ML/MP estimations on biological data 32 4 HMM modelling ITS2 secondary structure ...... 33 5 Heat map of ITS2 structural regions resulting from p and rp optimizations ...... 35 6 Sketch of score matrix estimation ...... 42 7 Multidimensional scaling of different tree reconstruction meth- ods and alphabets ...... 46 8 Chlorophyceaean tree calculated by MP on ’RNA10’ ...... 47 9 Chlorophyceaean tree calculated by MP on ’RNA16’ ...... 49 10 Flowchart explaining major steps of sequence-structure data generation ...... 53 11 Homology Modelling process ...... 57 12 Classical and AJAX-based Web application model ...... 63 13 XHR data request logged in Google Chrome ...... 65 14 Important files and folders of the frontend of the ITS2 workbench 67 15 AJAX data requests in combination with a Catalyst backend . . 69 16 Database Schema of the ITS2 database...... 73 17 Database Schema of the ITS2 admin database...... 74 18 Important files and folders of the backend of the ITS2 workbench 75 19 The ITS2 workbench install script ...... 76 20 WB: Taxon sampling process ...... 78 21 WB: Basic layout ...... 79 22 WB: Sequence search and adding sequences to the pool . . . . . 81 23 WB: Creating a dataset ...... 82 24 WB: ITS2 sequence annotation ...... 83 25 WB: Secondary structure prediction (Predict) ...... 84 26 WB: Homology Modelling of structures (Model) ...... 86 128

27 WB: Managing a dataset ...... 87 28 WB: Sequence-structure view of the pool ...... 88 29 WB: Analysing a dataset ...... 89 30 WB: Sequence-structure alignment ...... 90 31 WB: Neighbour Joining tree calculation ...... 91 32 WB: Motif search ...... 92 33 WB: Sequence-structure BLAST search ...... 94 34 WB: Admin interface ...... 95 35 Phylogenetic tree of phylum Chlorophyta ...... 100 36 Movie about the ITS2 workbench ...... 102 37 Phylogenetic tree of class Chlorophyceae ...... 132 38 Phylogenetic tree of class Trebouxiophyceae ...... 133 39 Phylogenetic tree of class Ulvophyceae ...... 134 40 Sequence and sequence-structure score matrices on species datasets137 41 Sequence and sequence-structure score matrices on genus datasets138 List of Tables 129

13 List of Tables

1 Properties of p,rp and srp macro states ...... 30 2 Comparison of ML and MP macro state estimates ...... 34 3 Relevant GenBank databases for sequence-structure generation . 52 4 Annotation types of GenBank flat files ...... 54 5 Marker specific search terms on NCBI ...... 54 6 Public JavaScript frameworks in 2009 ...... 64 7 PL/pgSQL function of the ITS2 database ...... 139 8 Views of the ITS2 database ...... 140 9 PL/pgSQL aggregate function of the ITS2 database ...... 140 10 Operators of the ITS2 database ...... 140 11 Triggers of the ITS2 database ...... 140 12 Folder overview for the ITS2 data generation ...... 141 13 Scripts overview for the ITS2 data generation ...... 141 14 Modules overview for the ITS2 data generation ...... 141 15 JavaScript files of the ITS2 workbench ...... 142 16 Template of the ITS2 workbench ...... 143 17 Controller of the ITS2 workbench ...... 144 18 Modules of the ITS2 workbench ...... 145 List of Abbreviations 130

14 List of Abbreviations

A...... Adenine AIC ...... Akaike’s Information Criterion AJAX ...... Asynchronous JavaScript and XML BLAST ...... Basic Local Alignment Search Tool C...... Cytosine CD ...... Compact Disc CGI ...... Common Gateway Interface CRUD ...... Create Read Update Delete CSS ...... Cascading Style Sheets DB ...... Database DOM ...... Document Object Model DRY ...... Don’t Repeat Yourself FCGI ...... Fast Common Gateway Interface FTP ...... File Transfer Protocol G...... Guanine GB ...... Gigabyte GI ...... GenInfo Identifier GTR ...... General Time Reversible GUI ...... Graphical User Interface HMM ...... Hidden Markov Model HTML ...... Hypertext Markup Language HTTP ...... Hypertext Transfer Protocol ITS1 ...... Internal Transcribed Spacer 1 ITS2 ...... Internal Transcribed Spacer 2 IUPAC ...... International Union of Pure and Applied Chemistry JC ...... Juces Cantor JS ...... JavaScript JSON ...... JavaScript Object Notation LSU ...... Large Subunit List of Abbreviations 131

MDS ...... Multidimensional Scaling ML ...... Maximum Likelihood MS ...... Macro State MSSA ...... Muliple Sequence Structure Alignment MVC ...... Model View Controller NCBI ...... National Center for Biotechnology Information NJ ...... Neighbor Joining PAM ...... Percent of Accepted Mutations PDF ...... Probability Density Function PEM ...... Percent of Expected Mutations Perl ...... Practical Extraction and Report Language PL/pgSQL . . . Procedural Language/PostgreSQL Structured Query Language PNJ ...... Profile Neighbor Joining REST ...... Representational State Transfer RNA ...... Ribonucleic Acid ROC ...... Receiver Operating Characteristics rRNA ...... ribosomal Ribonucleic Acid RSS ...... Really Simple Syndication SOAP ...... originally: Simple Object Access Protocol SQL ...... Structured Query Language SSU ...... Small Subunit SVN ...... Subversion T...... Thymine TT ...... Template Toolkit U...... Uracil XML ...... Extensible Markup Language XSD ...... XML Schema Definition YUI ...... Yahoo User Interface Annex 132

15 Annex

15.A Phylogenetic tree of class Chlorophyceae

Sphaeropleales

‚Scenedesmus‘ ‚Desmodesmus‘

138754279 Desmodesmus lunatus

89515342 Desmodesmus sp.

89515391 Desmodesmus sp.

46241852 Scenedesmus bajacalifornicus 12055733 Scenedesmus arcuatus var

1 89515392 Desmodesmus sp. 89515379 Desmodesmus sp. 89515393 Desmodesmus sp.

1

3 89515378 Desmodesmus sp. 1

1387 3 1387542

8

3 138754247 Desmodesmus costato-granulatus

8 89515374 Desmodesmus sp. s

7

89515373 Desmodesmus sp. 8 89515372 Desmodesmus sp. 89515359 Desmodesmus sp. 7

5

7

5

4

89515356 Desmodesmus89515358 sp. Desmodesmus sp. 5

89515351 Desmodesmus sp. NDem9/21T 5 4

2 nicu

4

4 2

‚Sphaeropleaceae‘ 1 2 W

2 0

d

3

3 2

d

89515368 Scenedesmus sp. NDem8/18T 4 9 33 Desmodesm

1

-

7 - 6 s

D 1

T

-

D

u

S

D e

3 t D

S

e

/

4

s

a

e

8

e s 6

2

m

/ s

1

m s s m

/

m 190151812 Scenedesmus sp. BR8a 2 138754274 Desmodesmus elegans o m

a 4972020 Desmodesmus serratus a

8

o

t

138754187 Desmodesmus elegans 138754269 Desmodesmus elegans s

d

I u 138754264 Desmodesmus elegans o 138754193 Desmodesmus elegans s o d

138754253 Desmodesmus138754183 elegans Desmodesmus138754259 elegans Desmodesmus elegans a e

. s 89515361 Scenedesmus sp. q t d

a

d T e

I

190151813 Scenedesmus sp. BR8b t

s 138754200 Desmodesmus elegans p s u 89515363 Desmodesmus sp. e

I s e ow8/18P-14W s 240148061 sp. EM-2009 t

.

89515360 Scenedesmus sp. i 89515362 Desmodesmus sp. m

s a 240382685 Scenedesmaceae sp. EM-2009 T s

.

m r s a r 4972021 Desmodesmus denticulatus p

ow6/16T m t

p 89515347 Desmodesmus sp. m u s a

l s

l 89515346 Desmodesmus sp. u

m s b

s u

u u

u u u s s a olisicus

s s

c g s m s s 46241851 Scenedesmus dissociatus u Desmodesmus fennicus u

c ularis

u

o

s e

u c . p q

o c c r

m

T T s 4972042 T e W

o s . rectangularis

m o o s

s m EH25 ow10/1 T ow10/1 t ow10/1 d s mus regularis

s

s a a t s s

s ow10/1 e a r

o u -26W t

T e t t t t munis

a e d l a o T t ow10/1 T a

d . platydiscus T o m m 12055737 Scenedesmus quadricauda t d ow8/18T o u 43348981 Desmodesmus maximus t - ow6/16T t

o

o ow8/18P-13W o - nis var o g s s

o

T T g - m s

- - r e e

g m us cuneatus win8/18P-2d g ow6/3T ow8/18P-1d r g a s m u 90992795 90992794 Desmodesmus fennicus . muzzanensis 1T d 164510421 Desmodesmus fen -10W 1T a

1T s 164510422 Desmodesmus fennicus T r D communis r r n s e aldevei

a m 1T o a a n

etradesmus wisconsinensis var e m

4972041 u e ow8/18P-4 7 n -17W D n -3W u s 1T -6W n

D m T l

9 -1W u D a

u u l e

6 s 4624185846241859 Scenedesmus Scenedesmus deserticola deserticola a 46241856 Scenedesmus deserticola 23394014 Desmodesmus perforatus 3 l

-10W t

8 46241850 Scenedesmus rotundus l 12055742 Desmodesmus perforatus commu -12W -17W l d a

6 9 e a u t smus a

5

u 46241854 Scenedesmus46241855 bajacalifornicus Scenedesmus deserticola 9 o -1 t 9 6625510 Scenedesmus acuminatus 3 t s t D smus sp. ET14 u 1

u s 4972014 Scenedesmus acuminatus u 3 ow6/16T s sp. NIOO-MV3

2 5

m

1d s 46241857 Scenedesmus deserticola 5 T is s T s 5 2

4 1 -10W s ow8/18P-20W 9

1

8

5 T -1 5 e

T 8

5 Oedogoniales 2 etradesmus wisconsinensis smodesmus com 7 ow6/3T T 9 ensis 1W D

89515388 Scenedesmus sp. Mary9/21BT 9 T 4 e 1 Desmodesmus sp. 8 ow8/18T 8 desmus communis

ow8/18P-25W 8 5 9 T 3 6625532 Scenedesmus obliquus D Desmodesmus communis T 7 2 ow8/18T 1 190151815 Scenedesmus sp. BR2 ow8/18T 138754218 Desmodes 9 Scenedesmus 8 2 . ibera

3 6625508 Scenedesmus6625531 Scenedesmus acutus12055731 obliquus Scenedesmus dimorphus 4 . iberaensis 37727739 Sc 17385571 Scenedesmus basiliensis -9W 94024 Desmodesmus tropicus 1 37727738 Scenedesmus arcuatus var 5 . iberaensis 6625530 Scenedesmus obliquus 7 89515386 Scenedesmus6625509 obliquus Scenedesmus naegelii -25W 138754289 Desmodesmus reg 89515385 Desmodes 8 89515384 Desmodesmus cuneatus esmus arthrodesmiformis 233 15913571 tus var 136054956 Scenedesmus sp. 6625507 Scenedesmus acutus -23W 3 -5W 1 1205573 1 Desmodesmus hystrix 40287876 210139134 169930318 159135720159135732 Desmode Desmodesmus communis 159135719 Desmodesmus communis smus arthrodesmiformis 169930314 473318Sphaeroplea Spermatozopsis annulina similis 159135710 Desmode desmus sp. Hegewald 1987-51 91221289 Oedocladium prescottii enedesmus arcuatus var 055745 Scenedesmu 169930316 Sphaeroplea annulina 12055732 Enallax acutiformis . reginae 159135717 Desmodesmus communis 40288339 Desmodesmus 40288344 Desmo 40288340 Desmodesmus hystrix 4972016 Enallax acutiformis 12 89515341 Desmodesmus sp. WT 15055107 Desmodesmus pirkolleimo 497201537727741 Scenedesmus4972018 Scenedesmus Scenedesmus raciborskii regularis pectinatus 40288342 Desmodesmus communis89515340 var Desmodesmus 159135716sp. 89515375 Desmodesmus Desmodesmus sp. EH1 pirkollei 4972017 Scenedesmus pectinatus 1205574689515369 Desmodesmus Desmodesmus159135715 hystrix Desmodesmus hystrix4972022 Desmodesmus sp. ET8a lefevrei var -2W-2d 4028834 12055734 Desmodesmus1T communis Atractomorpha porcata 89515345 Desmodesmus sp. smus perforatus var 89515344 Desmodesmus sp.Desmodesmus NDem6/3P-3d arthrodesmiformis 1699303123998637727740 1 Scenede 4972024 Desmod ow10/1 62183526 Oedogonium calliandrum T 62183527 Oedogonium foveolatum 159135714 Desmodesmus sp. EH32 6218353162183529 Bulbochaete Oedogonium rectangularis cardiacum var 15216662 Desmodesmus pannonicus . simplex 89574209 Oedogonium sp. L3 4972023 Des 17 Scenedesmus obtusus 89515352 Desmodesmus arthrodesmiformis 62183525 Oedogonium donnellii -19W 8951535389515355 Desmode Desmodesmus arthrodesmiform 89515395 Desmodesmus asymmetricus-2008c . simplex Ankyra judayi 89515354 T 71482661 Oedogonium cylindrosporum Antarctic 2339401523394016 Desmodesmus Desmodesmus perforatus perforatus . platydiscus 23394017 Desmodesmus perforatus esmus sp. . arcuatus 2339401823394013 Desmodesmus Desmodesmus perfora maximus smodesmus sp. Mary6/3T 62183530 Oedogonium angustistomum 23394019 Desmodesmus perforatus var 4972019 Desmodesmusmodesmus sp. bicellularis CL1 89574208 Oedogonium subplagiostomumsmus hindakii 23394020 Desmode . simplex 1205574389515367 Scenedesmus Desmodesmus jovais asymmetricus 23394023 Desmodesmus tropicus 2339402523394021 Desmodesmus Desmodesmus tropicus tropicus 23394022 Desmodesmus tropicus . turskensis 87245052 Desmodesmus tropicus . turskensis 159135721 Desmodesmus sp. ET27b . turskensis 89515370 De . kantonensis 62183524 Oedogonium borisianum 1591357138951537115620765 Desmodesmus Desmod Des sp. ET2 T-2008b 40288330 Desmodesmus komarekii 62183535 Oedogonium oblongum 2055740 Desmodesmus bicellularis . hiloensis 1 ‚Chlamydomonas I‘ 62183528 Oedogonium geniculatum 15913571212055738 Desmodesmus Scenedesmus sp. EH21 ecornis 89574214 Oedogonium pakistanense 16979801989515376 40288332Desmodesmus Desmodesmus Desmodesmus sp. MA bicellularis subspicatus var 8957421 -8W 6218353489574212 Oedogonium Oedogonium crispum77176661 brevicingulatum Oed 40288337 Desmodesmus subspicatus var 1T-15W 89574210 Oedogonium sp. L4 12055735 Desmodesmus subspicatus 89574213 Oedogonium sp. M3 15913572212055736 Desmodesmus Scenedesmus sp. ET51 abundans -9W 71482662 Oedogonium1 Oedogoniumnod sp. L7 Tow10/1 62183533 Oe Tow6/16T 714826646218 Oedogonium3532 Oedogonium subdissimile vaucherii 40288329 Desmodesmus subspicatus var ow6/16T 6066301 V 40288331 Desmodesmus multivariabilis T -16W 7717666277176660 Oedogonium Oedogonium undulatum brevicingulatum 40288338 Desmodesmus169798018 multivariabilis Desmodesmusow8/18P-3W sp. MA ogonium tenerum T -2d 40288336 Desmodesmus multivariabilis var olvox714 africanus dogonium 89515343 Desmodesmus multivariabilis var Tow6/16T 82663 Oedogonium emin 40288343 Desmodesmus multivariabilis var 2627273 Chlamydomon253749356 V 4028833540288334 Desmodesmus Desmodesmus pleiomorphus pleiomorphus var 111073393 Chlamydomonas asymmetrica 56122680 Desmodesmus pleiomorphus fragile T-2008a 229258273111073394 Chlamydomonas Chlamydomonas sp. HTL-10 minuta ulosum 159135731 Desmodesmus pleiomorphus olv 4028833389515383 Desmodesmus Desmodesmus pleiomorphus sp. ox africanus 89515382 Desmodesmus sp. 253749359229258272 V Chlamydomonas111073396 sp. HTL-DChlamydomonas pygmaea 89515381 Desmodesmus sp. as asymmetrica 94 8951534812055747 Desmodesmus Scenedesmus sp. sp. NIOO-MV5 197097041 Eudorina unicocca ens 89515350 Desmodesmus sp. 111073398 Chlamydomonas sp. UTEX 796 89515349 Desmodesmus sp. Itas6/3T 197097042 Eudorina 14091692unicocca olvox 159135728 Desmodesmus sp. ET53 1409114091704 barberi Pandorina morum 159135729159135730 Desmodesmus Desmodesmus sp. sp.EH42 EH69 4007496 Eudorina elegans 703 Pandorina morum 12055741169798015 Desmodesmus Desmodesmus opoliensis sp. MA Yamagishiella unicocca 159135725 Desmodesmus opoliensis -31W 14091691 159135738 Desmodesmus opoliensis ‚Yamagishiella‘ 159135726 Desmodesmus opoliensisow6/16T 400751 T 40075101 Yamagishiella unicocca 159135727 Desmodesmus opoliensis . subalternans 4007509 Yamagishiella unicocca 159135724 Desmodesmus sp. ET3 Yam 38491399 Desmodesmus komarekii . subalternans 4007508Y am agishiella unicocca 57 12055748 Scenedesmus longus . subalternans Y agishiella unicocca 89515380159135723 Desmodesmus Desmodesmus komarekii sp. EH36 amagishiella unicocca 89515366 Desmodesmus sp. 12055744 Desmodesmus pannonicus 1409168140916897 159135733 Desmodesmus armatus var 89515339 Desmodesmus armatus var 14091712 Pandorina morum14091688Yamagishiella Yamagishiella unicocca unicocca 89515338 Desmodesmus armatus var 400750714091690 Y 159135737 Desmodesmus armatus Y amagishi 159135735 Desmodesmus armatus 4007522 Pandorin 4007523 Pandorinaamagishiella morumYamagishiell unicoccaella unicocca 47 159135736 Desmodesmus armatus 4007524 Pandorina morum a unicocca 159135734 Desmodesmus armatus 4007514a morum Pandorina morum 14091702 Pandori 4007530 Pandorina morum Prasinophyceae 4007518 Pandorina morum na morum

4007516 Pandorina mo 14091709 Pandorinarum mor 65427932 Micromonas pusilla 6542793065427931 MicromonasMicromonas pusillapusilla 65427934 Micromonas pusilla 14165183 14091701Pandorina Pandorina morum morumum 654279351097 Micromonas pusilla 65427933 Micromonas pusilla 468654279261 65427940 Micromonas Micromonas pusilla pusilla 65427923 Micromonas pusilla (outgroup) 4007533 Pandori 226523057 Micromonas sp. RCC299 14165187 Pandorina morum na morum 4007519 Pandorina morum 65427927 Micromonas pusilla 14165186 Pandorina morum 6542793865427937 Micromonas pusilla 14165185 Pandorina morum 6542793665427941 Micromonas Micromonas pusilla pusilla 4007515 Pandorina morum 6542793965427922 Micromonas pusilla pusilla 14091708 Pandorina morum 65427928 Micromonas 6542792465427929 Micromonas Micromonas pusilla pusilla 4007520 Pandorina morum 65427925 Micromonas pusilla 205361369 Dunaliella salina 33469503 Pandorina morum 145587829 Dunaliella salina 33333776 Dunaliella salina 14165188 Pandorina morum 205371718 Dunaliella sp. 14029196 Pandorina colemaniae 71482598 Dunalie ABRIINW U1/1 71482596 Dunaliellalla salina 4007513 Pandorina morum 71482597 Dunaliella salina 14091699 Pandorina morum 71482593 Dunaliella salina 71482594 Duna 14091697 Pandorina morum 47499297 liellaDunaliella salina s 14091698 Pandorina morum 87047576254838316 Dunaliella Dunaliella bardawil salina 14091696 Pandorina morum 46250926 Dunaliella salina alina ‚Pandorina‘ 71482595 Dunaliella salina 6066295 Pandorina morum 33333772 Dunaliella salina 4007529 Pandorina morum 33333774 Dunaliella salina 14091705 Pandorina morum 16596840 Dunaliella bardawil 14091706 Pandorina morum 2645739 Dunaliella sp. 006 6066293 Pandorina morum 20537171916596837 Dunaliella Dunaliella sp. salina 14091707 Pandorina morum 87047612 Dunaliella sp. SPMO 601-1 8704757458339343 Dunaliella Dunaliella v 4007528 Pandorina morum olvulina boldii 87047 3758539 Pandorina morum 87047572 Dunaliella sp. BSF1ABRIINW U2/1 14091685 V 8704765731 D 4106858 Pandorina morum 99 8704761 unaliella sp. BSF3iridis 8704 1 Dunaliella sp. sp. SPMO BSF2 600-1 4007527 Pandorina morum 8704761476130 DuDunaliellaDunaliellanaliella sp.sp. SPMOSPSPMOM 980625-IEBP3 14091694 Pandorina morum 87047605 Dunaliella sp. SPMO 202-4 4007525 Pandorina morum 14091695 Pandorina morum 4007526 Pandorina morum 3192574 Pandorina morum 61200913 Dunaliella viridis 3192575 Pandorinaolvulina compacta morum 61200 87047589 O 300-5 87047600914 16596834 DunaliellaDunaliella Dunaliella sp.virid SPMO viridis 201-2 14091686 V 5657859787047597 Dunaliella Dunaliella viridisDunaliella sp. SPMO viridis 200-2 1659684587047599 Dunaliella sp. SPMO 200-8 4007512 Pandorina19569572 morum Eudorina elegans 19569575 Eudorina elegans Dunaliell is 4007504 Eudorina19569578 illinoisensis Eudorina elegans 197290927 Dunaliella sp. 19569577 Eudorina elegans 87047582145587828 Dunaliella parva Dunaliella salina 19569576 Eudorina elegans 165968364907309116596835 Dunaliella Dunaliella Dunaliellaa salina parva salina pseudosalina 47499296 Dunaliella salina 1956957019569571 Eudorina Eudorina sp. China-6sp. Hono 87047594 Dunaliella sp. SPMO 1 19569573 Eudorina elegans 870 87047606 Dunaliella sp. SPMO 207-3 195695744007503 Eudorina Eudorina elegans elegans 8704760247595 Duna Dunaliellaliella sp. SPMO 201-4ABRIINW M1/1 87047603 Dunaliella sp. SPMO 201-5 87047598 Dunaliella sp. SPMO 200-3 71482602 Dunaliella sp. 71482601 Dunaliella parva 71482600 8704Dunaliell87047601 Dunaliella sp. SPMO 201-3 71482599 Dunaliella bardawil 195695806066298 Eudorina Eudorina sp. Chile-3elegans 76097092 Dunaliella87047604 bioculata Dunaliella sp. SPMOsp. SPMO 201-6 1 77539932 Dunaliella7596 quartolecta Dunaliella sp. SPMO 128-2 77539931 Dunaliella polymorpha 19569583 Eudorina sp. Suz 77955899 Dunaliella tertiolecta 6066297 Eudorina elegans 51035302 Dunaliella minuta 50952902 Dunal 12-3 4007505 Eudorina elegans 5 12-4 gans 5103530459792 Dunaliella sp. SAG19.6 51035303 Dunaliella maritima 19569579 Eudorina sp. Seekonk 47933783 Dunaliella tertiolecta a primolecta 16596841 hd10 40075324007531 Eudorina Eudorina elegans unicocca 16596843 Dunaliella09 Dunaliella parva tertiolecta 19569581 Eudorina elegans 16596846 Dunaliella peircei 87047587 Dunaliella tertiolecta 4007498 Eudorina elegans 8704758856578596 Dunaliella Dunaliella tertiolecta tertiolecta 87047583 6066299197097044 Eudorina illinoisensisEudorina peripheralis 87047584 Dunaliella2627284 sp. Dunaliella CCMP1641 tertiolectaiella primolecta 19569568 Eudorina elegans 87047585 Dunaliella sp. CCMP1923 4007506 Pleodorina195695694007495 indica Eudorina Eudorina unicocca elegans 87047577 Dunaliella bioculataDunaliella tertiolecta 4007494 Eudorina ele 16596842 Dunaliella tertiolecta 197097043 Eudorina4007493 peripheralis Eudorina elegans 16596844 Dunaliella parva 19569582 Eudorina elegans 145587830 Dunaliella tertiolecta Dunaliella primolecta ‚Dunaliella‘

6066288 Paulschulziaubernaculifera pseudovolvox 151573027 Dunaliella salina 111073395 Chlamydomonas gyrus 16596838 Dunaliella salina 87047578 Dunaliella sp. CCMP367 197290646 Dunaliella sp. 16596839 Dunaliella salina 33333770 Dunaliella salina 87047575 Chlorosarcinopsis gelatinosa ‚Eudorina‘ 8704760987047607 Dunaliella Dunaliella sp. sp. SPMO SPMO 300-4 210-3 2645740 Pyrobotrys stellata Astrephomene g 111073389 Chlamydomonas ulvaensis 2645745 Chlamydomonas allensworthii 16596847 Dunaliella lateralis 2645746 Chlamydomonas allensworthii 2627282 Chlamydomonas allensworthii AstrephomeneAstrephomene gubernaculifera gubernaculifera 87047581 Halosarcinochlamys cherokeensis Astrephomene perforata 1327444613274447 Chlamydomonas Chlamydomonas allensworthii allensworthii 87047593 Dunaliella sp. SPMO 1 2627276 Chlamydomonas gelatinosa 87047580 Dunaliella lateralis13274452 Chlamydomonas allensworthii 8704759087047592 Dunaliella Dunaliella sp. FL1 sp. SPMO 1 111073399 Chlamydomonas typica 18025111 Phacotus lenticularis 180251 87047591 Dunaliella sp. SPMO 109-1 2982760 18025109 Phacotus lenticularis 87047608 Dunaliella sp. SPMO 21 2627278 Chlamydomonas sphaeroides 18025121 Phacotus lenticularis 13274451 Chlamydomonas allensworthii 1 18025120 Phacotus lenticularis 29827572982752 2627275 Lobochlamys culleus 2982761 8 Astrephomene gubernaculifera 141399295107925 Chlamydomonas Chloromonas hedleyi playfairii 0 Astrephomene gubernaculifera 180251 132744481327444913274450 Chlamydomonas Chlamydomonas Chlamydomonas allensworthii allensworthii allensworthii 87047586 Dunaliella sp. CCMP220 AstrephomeneAstrephomeneAstrephomene gubernaculifera gubernaculifera gubernaculifera 2 5 180251 10 Phacotus lenticularis

Astrephomene gubernaculifera 1 18025125 Phacotus lenticularis 180251 18025124 Phacotus lenticularis 18025123 Phacotus lenticularis 18025108 Phacotus sp. UTEX LB 236 2645753 Chlamydomonas reinhardtii 37 Gonium sociale 0 2982756 Astrephomene gubernaculifera 2645754 Lobochlamys culleus ABRIINW M1/2 1 180251 180251 180251 180251 2982755 18025122 Phacotus lenticularis 19 Phacotus lenticularis 2982753 18025137 Phacotus lenticularis 2982758 W 18025136 Phacotus lenticularis 2982754 18025135 Phacotus lenticularis 18025134 Phacotus lenticularis 18025133 Phacotus lenticularis 18025132 Phacotus lenticularis 29784 18025129 Phacotus18025130 lenticularis Phacotus lenticularis 1

1 1 1 1 18025131 Phacotus lenticularis 18 Phacotus lenticularis 2982759 as incerta i 2627274 Chlamydomonas callosa s 17 Phacotus lenticularis

8 180251

8 8 8 einhardtii 8 3192576 l o 18025126 Phacotus lenticularis 0

0 0 0 0 18025128 Phacotus lenticularis 2978436 Gonium sociale 18025127 Phacotus lenticularis a u 2

2 2 2 13 Phacotus lenticularis 2 12 Phacotus lenticularis 2978435 Gonium sociale s 15 Phacotus lenticularis t 14 Phacotus lenticularis

c 5

5 5 5 1073390 Chlamydomonas parallestriata 5 2978431 Gonium sacculiferum u 1 itreochlamys ordinata a 2978432 Gonium sacculiferum h 1

1 1 1 1 1 c n

i

i 0

0 0 0 0

r e

a a

l 16 Phacotus lenticularis

4

6 2 5 3

e l

s

l Chlamydomonadales I p

a

a 12-2

o

P

P P P P

i 2978434 Gonium sociale f. sociale p 12-1

2627279 Chlamydomonas komma h

p

h g

h h h onas reinhardtii h a

p

l

a n

a a a a a

s 1073404 Chlamydomonas debaryana s

c o 1-2

c c c c 111073388 V 1 n a

s

o

o o o o 1 p orale c n

u

t s

t t t t 1073403 Chlamydomonas debaryana t

t m u

u u u o u 1 o

o u s s

s s s 2645742 Chlamydomonas cribrum 1 s n

111073397 Chlamydomonas marvanii i t 111073391 Chlamydomonas petasus m c

m r

073402 Chlamydomonas incerta i a s s s s s 2978433 Gonium sociale f. sacculiferum o c 44829162 Chlamydomon a 6598327 Chlamydomonas reinhardtii u o 1 r

p l p p p atum t p a d 1 h 5031531 Chlamydomonas reinhardtii d

h h

h h h a h 1 y

P r a

c a

a a a a 2645749 Chlamydomonas r

d u m

o e

e e e e

7 111073392 Chlamydomonas petasus 1 Chlamydomonas reinhardtii i a q a

r r r r l r

0 2978438 Gonium octonarium g

i i i distell i i 31925319257877 Gonium Gonium octonarium octonarium u 111073400 Chlamydomonas parallestriata c c c i h c c 1110734051073408 Chlamydomonas Chlamydomonas reinhardtii reinhardtii sp. XJU-3 3192586 Gonium3192592 pectorale Gonium pectorale 1

r q m

u u u 92582 Gonium pectorale u 1110734091 Chlamydomonas reinhardtii u

5 3192579 Gonium pectorale C 1 3192594 Gonium pectorale u s s s s s i 3192580 Gonium pectorale 2 111073407 Chlamydomonas reinhardtii 2978440 Gonium4206399 pectorale Gonium pectorale 3192585 Gonium pectorale 3192581 Gonium pectorale m 31 9

n 111073406 Chlamydomonas reinhardtii 0

u

111073415 Chlamydom 3192593 Gonium pectorale 7

i o 45752 Chlamydomonas reinhardtii 8 111073414 Chlamydomonas reinhardtii 3192589 Gonium pectorale

5

n 11107341 1 2978439 Gonium pectorale G

111073410 Chlamydomonas reinhardtii 7 26 3192583 Gonium pectorale 3192584 Gonium pectorale 111073413 Chlamydomonas reinhardtii o

4 111073412 Chlamydomonas2645748 Chlamydomonas reinhardtii reinhardtii 3192591 Gonium pectorale 4

G 2645750 Chlamydomonas reinhardtii 6066289 Gonium pectorale 0 2645751 Chlamydomonashlamydomonas zebra reinhardtii 3192588 Gonium pect 8 111073401 Chlamydomonas reinhardtii 3192590 Gonium pectorale 7 3 1 192587 Gonium pectorale

8 8 9 3 144512 Spon

1 1

Chlamydomonas sp. Rt621

9

7

20379182 Gonium quadratum 6066290 Gonium quadratum 3 2037 Chlamydomonadales II

0 2645756 Chlamydomonas sp. Chile J 039217 Gonium vi

6066292 Gonium viridistellatum 2 2645755 C 2

monas reinhardtii

6066291 Gonium quadratum 2645738 ‚Astrephomene‘ 221267836 Chlamydomonas sp. XJU-36 221327584 Chlamydomonas

2645747 Chlamydo

14165184 Pandorina morum

‚Chlamydomonas III‘ ‚Chlamydomonas II‘

0.2 ‚Gonium‘ ‚Phacotus‘

Figure 37: Profile Neighbour joining tree calculated on the class of Chloro- phyceae (with 100 bootstrap replicas) using all ITS2 sequences and structures available from the ITS2 database version 3.0 (2009/09/29). The image was taken from Buchheim et al. (2011a). 15.B Phylogenetic tree of class Trebouxiophyceae 133

15.B Phylogenetic tree of class Trebouxiophyceae

‚Trebouxia‘

9754932

9754931 Microthamniales I 9754930

t 9754934

n

o

i

T 9754935 b

rebouxia Tjamesii letharii sp. 3 203284910 uncultured o 203284902 uncultured 20328491 t 203284912 uncultured 203284909 uncultured rebouxia jamesii letharii sp. 2 203284913 uncultured T rebouxia corticola

9754933 o rebouxia jamesii letharii sp. 1 T 76262957 uncultured h rebouxia photobiont

76262929 uncultured T p T

rebouxia photobiont rebouxia jamesii letharii sp. 5 t sp. RCOM1

a T n 9 9 i 9 9 76262950 uncultured 9 9 9 rebouxia higginsiae 9 90265752

T x o 0 0 0 0 0 0 0 T 0 76262952 uncultured i

rebouxia jamesii letharii sp. 6 u

2 2 2 2 2 2 2 2 b

1 uncultured o

6 6 6 6 6 6 6 6 T o rebouxia corticola t b

a 5 a 5 a 5 5 a 5 a 5 a 5 5

l l rebouxia jamesii letharii sp. 4 l l l l rebouxia photobiont T e

7 7 7 7 90265761 7 7 7 7

o o o o o r 203284905 uncultured 37650998 o rebouxia photobiont T O5375B

4 4 5 5 5 5 203284904 uncultured 4 c c 5 c rebouxia photobiont c c rebouxia photobiont 37651004 c rebouxia photobiont sp. RG1 T T

i i 9754927 i 203284903 uncultured i i i

7 8 7 3 Y 6 9 t t 9 t 1 t t t pho T T T

r r 203284901 uncultured r r r

d

T T T T 9754936 T T T T o o T o o o 37651001 e 37651000 37650999

r r r r r r r 90265760 c c r c r c c rebouxia corticola

e e 22022960 e e

e e e e red u rebouxia photobiont rebouxia photobiont

b t b 175712 b b 9754928 a a b a b a a b b l

i i i i i ia cor T rebouxia sp. Mayrhofer9943 T

o o o o 3765101 o 255653887 o T o o x x u x x x T x 76262945 uncultured T T u u 37651005 u u 23307638 u rebouxia photobiont T u rebouxia photobiont u u c u u 76262938 uncultured u T T u u rebouxia photobiont u T 151

rebouxia photobiont x 76262937 uncultured T x x x x T rebouxia photobiont x rebouxia T x x n

o o o rebouxia photobiont o o ouxia sp. rebouxia jamesii vulpinae T rebouxia photobiont o

i i i i rebouxia jamesii i nt i T i i T

rebouxia photobiont a u 58866273 uncultured a a a b b a b a b b rebouxia jamesii a a b rebouxia usneae 58866262 T rebouxia corticola rebouxia corticola rebouxia photobiont rebouxia corticola

e e e e e T e T

c 0 c 37651006 c corti rebouxia usneae rebouxia photobiont rebouxia jamesii c c T T r r 58866272 uncultured c r c r T reb 76262956 uncultured r r T rebouxia photobiont rebouxia photobiont T

o o T o 7 T o o T 76262958 uncultured o T o T T 37651013 T T T T T T 37651014 rebouxia jamesii anii

T T 76262951 uncultured 37651010 rebouxia jamesii

r 22022958 r 37651007 rebouxia jamesii r 2 rebouxia photobiont 1 r r 22022961 r r

rebouxia jamesii t t t tobiont t t 6 5 37651009 rebouxia jamesii t 4 t 3 2 1 6 i i i i 37651012 i i i

c c 9754929 c T c 9844900 c 37651003 c c

203284897 uncultured c 4 4 37651008 4 4 4 4 6

o o o o T o rebouxia jamesii o T o T o

7 7 18076826 7 7 7 7 866261 203284896 uncultured 22022948 8

l l l l rebouxia suecica l l rebouxia jamesii l rebouxia jamesii l

a 5 5 T a 5 22022947 a a 5 a 5 203284895 uncul a 5 58866266 uncultured 22022946 T a a 8 203284894 uncultured T rebouxia photobiont 6 6 6 18076825 6 6 6 58 rebouxia photobiont 5 58866274 uncultured 203284893 uncultured 5918304 ia pho T rebouxia photobiont 58866271 uncultured 5918257

2 2 2 2 203284892 uncultured 6465956 T 2 2 58866269 uncultu rebouxia photobiont 1545975 90265750

0 0 90265762 0 t 0 203284891 uncultured 0 0 T rebouxia jamesii 22022957 T 90265754 9 T 9 T 9 203284890 uncultured 9 T 9 9 22022956 T rebouxia jamesii 203284889 uncultured 22022953 rebouxiarebouxia jamesii jamesiiT T rebouxia jamesii T rebouxia jamesii 22022954 T rebouxia jamesii rebouxia photobiont 189182643 uncultured T T rebouxiarebouxia jamesii jamesii rebouxia rebouxiaphotobiont photobiont T rebouxia jamesii 52 uncultured rebouxia photobiont 189182644 uncultured T rebouxia photobiontT rebouxia jamesii rebouxia jamesii rebouxia photobio T T rebouxia photobiont157272518 uncultured T rebouxia photobiontT reboux 157272519 uncultured T rebouxia photobiont T T rebouxia jamesii T 189182647 uncultured 23307636 T rebouxia jamesii T 23307635 T rebouxia jamesii rebouxia showm T rebouxia jamesii 23307634 T rebouxia jamesii 189182653 unculturedT T rebouxia jamesii 23307637 T rebouxia photobiont rebouxia showmanii rebouxia photobiontT T 76262920 uncultured tured rebouxia photobiontT 18918762629152674 uncultured uncultured T rebouxia jamesii T rebouxia jamesii 189182673 uncultured T rebouxia jamesii 1891826 rebouxia jamesii cultured asymmetrica obiont 189182672 uncultured rebouxia photobiont ultured157272522 uncultured rebouxia sp. C-18 T rinkaus450 T rebouxia photobiont T T T rebouxia photobiontT rebouxiaT photobiont T T rebouxia jamesii rebouxia photobiont 74100300 uncultured203284907 uncultured rebouxia photobiontrebouxia jamesii rebouxiarebouxia photobiontT photobiont 74100297 uncultured T rebouxia photobiontrebouxia jamesii T 74 203284906 uncultured T rebouxia photobiontT T rebouxiaT photobion 74100303 uncultured 9754940 T 203284908 uncultured 22022962 T rebouxia photobiontrebouxia jamesii 255653885 rebouxia photobiont 100298 uncultured T r rebouxia photobiontia photobiont T ebouxia photobiont rebouxia T 22901829 rebouxia photobiont rebouxia asymmetricaT 23307633 T rebouxia photobiont T rebouxia asymmetrica rebouxia photobiont 4916 unc T 74100299 uncultured255653878 T T rebouxia photobiont rebouxia photobiont 74100302 uncultured 18412474 uncultured1841247122022955 uncultured 1841841247212473 un uncultured rebouxia asymmetricarebouxiaT phot 1 1 reboux 22022959 T 1 1 T T 74100301 uncultured T rebouxia photobiontT rebouxia sp. 98.003B2rebouxia sp. T rebouxia photobiont 203284914 uncultured rebouxia jamesii 203284915 uncultured rebouxia photobiont T rebouxia photobiont 20328 T rebouxia photobiont 17385553 rebouxia photobiontT T T rebouxia photobiont T T T rebouxia photobiont T rebouxia photobiontT rebouxia sp. Ri2 1 uncultured T rebouxia photobiont rebouxia jamesii T rebouxiarinkaus439 photobiont T T rebouxia photobiont T T T rebouxia photobiont 2749705527497057 T rebouxiaT T photobiontrebouxia brindabellaeT 27497056 rebouxia photobiont rebouxia photobiont T rebouxia photobiont rebouxia photobiontT rebouxiaT photobiont T rebouxia photobiont T rebouxia photobiont T rebouxiarebouxia photobiont photobiont rebouxia jamesii 9757428 T rebouxia photobiont rebouxiaT photobiont T rebouxia photobiont rebouxia jamesii 76262933 uncultured T rebouxia photobiont 255653877 5918303 9844715 T rebouxia photobiont 255653886 7575572776262949 uncultured uncultured T 76262926 uncultured T 7626294 T rebouxia photobiontrebouxiarebouxia photobiont photobiont 75755738 unculturedT rebouxia photobiont 76262932 uncultured T rebouxiaT photobiont rebouxia photobiont 747 uncultured rebouxia photobiont sp. gbA19 T rebouxiarebouxia photobiontT photobiont la 75755743 unculturedT 7575572975755725 uncultured uncultur9 uncultured ed T T 75755742 uncultured376510025918302rebouxia photobiont T rebouxia photobiont rebouxia sp. T 75755746 uncultured T rebouxia photobiont 75755745 T T T 75755 rebouxia photobiontrebouxiarebouxia photobiont photobiont 75755744 uncultured rebouxiarebouxia simplex australis T T T rebouxia photobiont 7 T 757557496061716 uncultured0 uncultured T 755755748 uncultured T rebouxia jamesii 75755751 uncultured rebouxia jamesii 567150656715068 uncultured 75755724755752 uncultured uncultured 5671506756715072 uncultured uncultured 75755723 uncultureduncultured 567150705671507110808566 uncultured uncultured rebouxia photobiont 75755741 uncultured T T 1 9844705 rebouxia decolorans T rebouxiarebouxia photobiont photobiont 75755737757555755735732 uncultured uncultured unculture d rebouxiarebouxia decoloransrebouxi arboricorebouxiaa decolorans decolorans 75755740 uncultured T 7 rebouxiaT decoloransT rebouxiaT decolorans rebouxia photobiont 7575573675755733 uncultured uncultured T rebouxiaT Tdecolorans 75 T 75755734755731 uncultured uncultured T rebouxiaT arboricola 755739 uncultured T rebouxia photobiont 7575572675755730 uncultured uncultured T rebouxia photobiont 75 9444888994448677 uncultured uncultured T rebouxia decolorans rebouxia photobiont 7575575018412470 uncultured uncultured T rebouxia decolorans decolorans 76262940 uncultured Trebouxia photobiont 1 1841246994448891 uncultured uncultured T Treb 1 rebouxia decolorans 76262942 uncultured T 94448893 uncultured T rebouxiarebouxia decolorans decoloransrebouxiarebouxia photobiont photobiont T rebouxia photobiont 50428761444870694448717 uncultured 9444873494448723 T rebouxia decoloransT T 187469929 rebouxiouxia photobiont 9 94448722944487492330763294448686 uncultured94448735 uncultured T T rebouxia decolorans 9844895 Trebouxia photobiont 9444889094448895 uncultured 94448738 unculturedTrebouxia rebouxia decoloransrebouxia photobiont 75 187469930 Trebouxia photobiont 18699039 T rebouxiarebouxiaT decolorans decoloransT 755728 uncultured Trebouxia photobiont 94448713 T T Trebouxia photobiont 56715063 uncultu Trebouxia photobionta photobiont 9444875694448780 uncultured T 94448759 rebouxia sp. TMayhofer13.740rebouxia incrustata 944487619444875894448733 9844735T rebouxiaTrebouxia photobiont incrustata 94448757 rebouxiarebouxia decolorans photobiont Trebouxia photobiont 9444876294448755 rebouxia T decoloransTTrebouxia photobiont 9844886 9444876094448707 T red 94448697 uncultured TrebouxiaTrebouxia photobiont incrustata 9444890094448678 uncultured uncrebouxiaultured arboricola Trebouxia photobiont 7626293594448894 uncultured uncultured 9757429Tre T 94448888 uncultured 18699048 9844881 Trebouxia9754938 sp.bouxia Mayrhofer13.747 photob 9444875294448687 uncultured 94448705 uncu 9754937 9844710 22022963 9444874194448685 Trebouxia uncultured decolorans 17385554 2202295122022950 Trebouxiarebouxia decolorans decolorans Trebouxia gigantea T rebouxia decoloransTrebouxia photobiont T TrebouxiaT impressaTrebouxiaTreb sp. Mayrhofer13.749Trebouxia gigantea 18699042 TTrebouxia decolorans 9844910 T rebouxia photobiontrebouxia flava Trebouxia sp. B-8 76262921 unculturedTrebouxia decolorans ltured T Trebrebouxia Tphotobiontrebouxia impressa22022949ouxia Tsp.Trebouxia sp. B-9iont 94448699 unculturedTrebouxiarebouxia decolorans decolorans rebouxiaoux photobiont reboux 9449444876648764 T rebouxia decolorans 9844690TrebouxiaT reboux s ia photobiontia photobiont 94448765 T 94448892 unculturedTrebouxia photobiont T ia sp. B-10 94448773 rebouxia decolorans T rinkaus356a 9444877094448774 T Trebouxia photobiont 9844905 Trebouxiap. Mayrhofer13.702 sp. Friedl10.1994 rebouxia sp. B-7 94448768 Trebouxiarebouxia arboricola arboricola 76262919 uncultured 9444873194448730 T Trebouxia decolorans Trebouxia9844670 sp. Mayrhofer13.723 94448679 unculturedTrebouxiaTrebouxia arboricola arboricola 94448721 Trebouxia photobiont 18699060 T 59182555881947 rebouxia sp. Pa2 94448719Trebouxia decolorans 18699049TrebouxiaT photobiont 944487181869904018699041 uncultured Trebouxia decoloransrebouxia photobiont 18699047 rebouxia impressa 76262924rebouxia uncultured decolorans T T Trebouxia decoloransTrebouxia photobiont 1869904 Trebouxia impressa rebouxia decoloransTrebouxia photobiont 25565388118699045 Trebouxia impressa 9444869494448739 T Trebouxia photobiont 94448898 6 Trebouxia impressa 94448680 unculturedTrebouxia decolorans 94448704 TrebouxiaT gelatinosa 94448703255653879 94448902 uncultured Treb rebouxia impressa 94448903 5918306 T ouxi 94448689 unculturedTrebouxia decolorans rebouxiaa gelatinosa gelatinosa 9444886694448887 unculturedTrebouxia decolorans Trebouxia gelat 94448750 unculturedTrebouxiarebouxia decoloransdecoloransTrebouxia photobiont 3476170 18699043 Trebouxia photobiont 9444874894448746 unculturedTTrebouxia decoloransTrebouxia photobiont 3 uncultured9754 T inosa 94448772 Trebouxia decoloransTrebouxia photobiont 941rebouxia impres 9444877194448775 T Trebouxia sp. UC 94448769 rebouxia photobiontsa 94448767 rebouxia decolorans 94448725 unculturedTTrebouxia decolorans 5886626494448729 unculturedunculturedTrebouxia decolorans 94448726 Trebouxiarebouxia decoloransdecolorans 9444872894448727 Trebouxia decolorans 76262948 uncultured 944487431 TTrebouxiaTrebouxia decolorans decolorans 94448688 uncultured 944487449444871 94448712 uncultured 9444871594448716T rebouxiarebouxia rebouxia sp. decolorans Friedl10.1994 arboricola Trebouxia photobiont T Trebouxia decolorans 94448901 T 984464518699038 Prasinophyceae 94448720 76262925 un Tr Trebouxiarebouxia arboricola photobiont 94448747Trebouxia decolorans 76262918 unculturecultured94448782 deboux ia photobiont TrebouTxiarebouxia photobiont arboricol 94448683 65427936 Micromonas pusilla 94448682 Trebouxia photobiont 94448754 uncultured94448701Trebouxia arboricola a 65427930 Micromonas pusilla65427927 Micromonas pusilla 76262953 uncultured T 6542793165427935 Micromonas65427938 pusilla Micromonas pusilla 189182 Trebouxiarebouxia photobiont arboricola 65427937 Micromonas pusilla 76262955 uncultured621 uncultured T rebouxia photobiont 1097 Micromonas pusilla 189182616 unculturedTrebouxia Trebou photobiontxia photobiont 4681 189182615 uncultured Trebouxia photobiont (outgroup) 76262922 uncultured Tr 65427926 Micromonas pusilla 189182604 unculturedT rebouxiaebouxia photobiontphotobiont 6542794165427939 Micromonas pusilla Trebouxia photobiont 65427922 Micromonas pusilla 189182609 uncultured Trebouxia photobiont 226523057 Micromonas 65427932sp. RCC299 Micromonas pusilla 189182614 uncultured Trebou 65427940 Micromonas pusilla6542793465427933 Micromonas pusilla 189182608 uncultured Trebouxiaxia photobiontphotobiont 65427923 Micromonas pusilla 94448776 T 65427928 Micromonas pusilla 94448696 Trebouxiarebouxia arboricolaarboricola 65427929 Micromonas pusilla 22901827 Trebouxia sp. Lr 6542792465427925 Micromonas Micromonas pusilla pusilla 189182648 uncultured Trebouxia photobiont 207366646 Chlorellaceae sp. TP-2008 22901828 Trebouxia sp. Ri1 207366647 Chlorellaceae sp. TP-2008 ed Trebouxia photobiont 207366648 Chlorellaceae sp. CCAP 189182611 uncultur Trebouxia photobiont 283/3 189182645 uncultured94448702 Trebouxia arboricola 99 207366653 Chlorella vulgaris 94448700 Trebouxia arboricola 207366654 Chlorella vulgaris 94448753 Trebouxia arboricola 94448710 Trebouxia arboricola 121489578 Chlorella vulgaris 94448709 TrebouxiaTrebouxia arboricola arboricola 121489577 Chlorella vulgaris 94448763 Trebouxia arboricola 51095190 Chlorella 94448745 Trebouxia arboricola vulgaris 94448736 rebouxia arboricola 51095191 Chlorella vulgaris 94448681 TTrebouxia arboricola 51095188 Chlorella vulgaris 94448781 Trebouxiarebouxia photobiont arboricola 51095189 Chlorella vu T 51095179 Chlorella vulgarislgaris 94448695 uncultured94448724Trebouxia arboricola 51095180 Chlorella vulgaris 94448737 Trebouxiarebouxia arboricola arboricola 94448698 T Trebouxia photobiont 51095181 Chlorella vulgaris 94448740 Trebouxia photobiont 5109518 Trebouxia photobiont 510951872 Chlorella vulgarivulg 189182617 unculturedTrebouxia photobiont aris Trebouxia photobiont 51095171 Chlorella vulgaris 189182639 uncultured Trebouxia photobiont 51095174 Chlorella vulgariss 18918261094448778 uncultured uncultured Trebouxia photobiont 51095175 Chlorel Trebouxia arboricola 189182654 unculturedTrebouxia Trebouxia photobiont arboricola 51095172 Chlo 189182650 uncultured Trebouxia arboricola 51 la vulgaris 189182649 uncultured25565387694448691 Trebouxia arboricola 1 095173 Chlorellarella vulgav 94448692 51095176 Chlorella vuulgaris 94448708 uncultured94448684 Trebouxia photobiont 5109 Trebouxia arboricola 510951835177 ChlorellaChlorella vvulgarisris Trebouxia Trebouxiarebouxia photobiontTrebouxia arboricola arboricola sp. D-1 lgaris 94448690 T 207366655 Chlorella vulgaris ‚Chlorella‘ rebouxia photobiont 94448693 uncultured94448742 22022952Trebouxia T rebouxiaphotobiontTrebouxiarebouxia photobiont photobiontphotobiont 121489579 Chlorella vulgarisulga 94448732 T Trebouxia photobiontTTrebouxia photobiont 207366636 Chlo ris 94448779 uncultured Trebouxia photobiont 51095184 Chlorella vulgaris Trebouxiarebouxia photobiontphotobiont rebouxia photobiont T rebouxia photobiont 51095185 rella vulgaris 129278864 unculturedT TTrebouxia photobiont 51095186 Ch 76262944189182627 uncultured uncultured Trebouxia photobiont Chlorella vulgaris 189182633189182632 uncultured uncultured189182626 uncultured Trebouxiarebouxia photobiontphotobiont 510951783767452 Chlor4ella vulgaris 189182625 uncultured T rebouxia photobiont lorella vulgaris 189182624 uncultured TTrebouxiarebouxia photobiont photobiont Chlorella vulgaris 189182623 uncultured T Trebouxiarebouxia photobiontphotobiont 76262943 uncultured 189182622 uncultured TTrebouxia photobiont 158828583 Chlorella207366637 sp. SAG Lobosphaeropsis 21 lobophora 189182664 uncultured Trebouxiarebouxia photobiontphotobiont 68449774 Chlorella37674521 sp. NC64A 157272521 uncultured TTrebouxia photobiont 74355995 Chlorella sp. MRBG1 157272520 uncultured Trebouxiarebouxia photobiontphotobiont Lobosphaeropsis lobophora 157272523 uncultured T rebouxia photobiont 20736664420736 Didymogenes palatina 15727252476262954 uncultured uncultured TTrebouxia photobiont 189182629 uncultured Trebouxiarebouxia photobiontphotobiont 37674531 Didymogenes6643 Didymogenes palat37674525207366651 anomala Closteriops Closteriopsis acicularis 76262931189182618 uncultured uncultured T rebouxia photobiont 1-6 189182670 uncultured TTrebouxia photobiont 45181485 Chlorella37674530 sp. Itas Didymogenes 2/24 S-12w anomala 189182671 uncultured Trebouxiarebouxia photobiontphotobiont 45181484 Chlorella sp. Itas 2/24 S-1 189182637 uncultured T rebouxia photobiont 45181486 Chlorella sp. Itas 2/24 S-1 189182634 uncultured T rebouxiarebouxia photobiont photobiont 37674528 Dictyosphaerium pulchellum 158325121 Chlorella pyrenoidosa 189182636 uncultured TT rebouxia photobiont 45181487 Chlorella sp. Itas 6/3 M-1d 189182630 uncultured T rebouxia photobiont 207366658 Chlorella sp. CCAP is acicularis 189182631 uncultured T rebouxia photobiont 76262914 uncultured TTrebouxiarebouxiarebouxia photobiont photobiont sp. M6162 189182628 uncultured T TTrebouxia sp. M6146 ina 189182681 uncultured Trebouxia jamesii 189182680 uncultured Trebouxia photobiont 207366641 Micractinium sp. SAG 72.80 189182638 uncultured rebouxia photobiont 37674534 Micr207366642 Micractinium sp. SAG 48.93 189182640 uncultured TTrebouxiarebouxia photobiont photobiont 37674533 Micractinium pusillum 189182641 uncultured 98447009844695 T rebouxia photobiont 1w 18918260576262927 uncultured uncultured T rebouxiarebouxia photobiont photobiont TT rebouxia photobiont w 18918261376262946 uncultured uncultured T rebouxia photobiont 94 207366640 Micractinium pusillum 21 255653884 rebouxia photobiont 189182607 uncultured TT rebouxia photobiont 37674532a Micractinium pusillum1/90 189182666 uncultured T rebouxia photobiont ctinium sp. SAG 72.80 T rebouxia photobiontrebouxia photobiont 189182665 uncultured T rebouxiarebouxia photobiont photobiont 189182659 uncultured TT rebouxiaT photobiont 96 189182667 uncultured T rebouxia photobiont 207366671 Micractinium pusillum T rebouxiarebouxia photobiont photobiont 61 207366669121489581 Micractinium Micractinium pusillum pusillu T 189182679 uncultured T 207366668207366670 Micractinium Micractinium pusillum pusillum 207366667 Micractinium pusillum ‚Didymogenes‘ 189182675189182676 uncultured uncultured 207366665 Micractinium pusillum 189182677 uncultured 189182678 uncultured 207366666 Mic 18918263576262923 uncultured uncultured 207366663 Micractinium pusillum 189182646 uncultured 207366664 Micractinium pusillum 189182642 uncultured 207366662 Micractinium pusillum 189182657 uncultured 189182656 uncultured 121489580 Micractinium pusillum 18918265876262947 uncultured uncultured 121489582 Micractiniumractinium pusillum pusillum 189182663 uncultured 189182660 189182651uncultured uncultured m 189182662 uncultured 158325122 18918266158866263 uncultured uncultured 158325120 Chlo 189182612 uncultured 221706506207366659 Chlorella Micractiniumsp. USTB-01 sp.207366660 CCAP Micractinium sp. CCAP 1583251 207366674207366639 Micractinium Micractinium sp. CCAP sp. CCAP 207366673207366661121489583 Micractinium Micractinium Micractinium sp. CCA sp.P CCApusillumP 51095192 Chlorella vulgaris 207366677 MicractiniumAuxenochlor sp. CCAP 19 Chlorella pyrenoidosa 207366656 Chlorellarella pyrenoid sorokiniana207366672osa Micractinium sp. CCAP 207366657 Chlorella sorokinian 207366638 Chlorella sorokiniana ella protothecoides 37674526 Diacanthos37674523 belenophorus Chlorella sorokiniana 207366675 Micractiniu 37674522 Chlorella sorokiniana 248/16 rebouxia photobiont T 37674529207366676 Dic Micractinium169798016 sp. Chlorella sp. MA 248/7 rebouxia824 photobiont 21 23 248/13 T Y 1/92 21 1/1 207366679 Y658 207366678 1/1 248/14 1F 207366680 248/2 tyosphaerium sp. W a ‚Micractinium‘ m sp. Actinastrum hantzschii Actinastrum hantzschii TKF1 207366650 Parachlorella kessleri rebouxia glomerata 207366681 ParachlorellaTP-2008a kessleri T 161338477 Koliella sempervirens62997326 Chlorella vulgarisActinastrum hantzschii 189182606 uncultured 6 1 27526737 Koliella sempervirens37674536 Parachlorella kessleri T 189182669 uncultured 352 -2008a rella sp. TP-2008a photobiontrebouxia MN400irregularis T Asterochloris photobiont R olf 2000-1 Asterochloris photobiont MN610rebouxia irregularis 207366649 Parachlorella beijerinckii Asterochloris photobiont TR rebouxia irregularis 187469844 T 51863533 58866277 Asterochloris photobiont MN750 biont MN354 1 51863526 94483229 51863528 37674535 Parachlorella beijerinckii 94483230 51863527 Asterochloris photobiont MN360 51863529 1 5886627694483224 51863531 157678523 Asterochloris photobiont MN288 51863524 oto 94483228 Asterochloris photobiont MN459 5886627551863523 51863522 Asterochloris 217885 Pseudochlorella sp. CCAP 20147523 Chlo 255653883 loris photobiont MN749 Y101 51863532 5385421351863521 51863525

187469845 rebouxia excentrica 34559472 Coccomyxa27526738 Koliella sp. C14 longiseta 5385421 58866278 187469846 83265433 34559468 Coccomyxa sp. C9 53854215 T 1 1 Asterochloris photobiont R Asterochloris photobiont 34559469 Coccomyxa sp. C19

rochloris photobiont MN274 7 53854220 Asterochloris photobiont MN751 58866279 AsterochlorisAsterochloris photobiont R photobiont R Asterochloris photobiont R 34559461 Coccomyxa sp. C6 94483220 51863543 Asterochloris photobiont R Asterochloris photobiont R Asterochloris photobiont MN433 94483221 1 Asterochloris photobiont R 53854222 ochloris photobiont MN945 Asterochloris photobiont Asterochloris photobiont R 51863540 Asterochloris photobiont R Asterochloris photobiont R 51863530 34559457 Coccomyxa sp. C18 1 1718 53854221 Asteroch 1 1 Asterochloris sp. PS-2007c 94483219 8 Asterochloris photobiont R Asterochloris photobiont CI2 Asterochloris photobiont MN350 1 AsterochlorisAsterochloris photobiont photobiont R 34559466 Coccomyxa sp. C5 1 7 94483222 rochloris photobiont MN550 8 8

Aste 5 8 34559465 Coccomyxa sp. C4

5 1 Asterochloris photobiont R 7 7 Chlorellales Asterochloris photobiont MN356 157 a

0 Asterochloris photobiont R Asterochloris photobiont MN944 7

Asterochloris photobiont MN 7 123692639 Pseudochlorella pyrenoidosa ont L1 8 4 4 Aster c 34559456 Coccomyxa sp. C12

4

8

Asterochloris ph 8 i 9 4 34559463 Coccomyxa sp. C15

Asterochloris photobiont MN94 6 1 5

t 6 6 5 rebouxia erici otobiont L15 tica Asterochloris photobiontAsterochloris MN61 photobiont L54 07e a 3 Aste Y1 7 2 Y967 2 6 7 Asterochloris photobiont CLA1 5 0

T n 0 Asterochloris photobiont MN826 9 9 chloris photobiont L3 n Y666 n Y765 2 157678546 4 1 9 53854210 1 Asterochloris photobiont S1

187469839Asterochloris sp. PS-2007a 8 7 8 o 8 o 8 8 Y581 53854214 Y836 a 1 1 34559464 Coccomyxa sp. C2 1 i

ph i t R 8 i Asterochloris photobiont 34559460 Coccomyxa sp. C13 A 5 6 photobiont 158 7 34559470 Coccomyxa sp. C20 7 53854212 l 6 6 s 51863541 b

Asterochloris photobiont S3 b 34559467 Coccomyxa sp. C7 6 Asterochloris photobiont R Y Y Y Asterochloris photobiont R Asterochloris photobiont R 34559459 Coccomyxa sp. C1 53854218 4 Asterochloris photobiont R 7 7

s 34559471 Coccomyxa sp. C16 6 2 0 Asterochloris photobiont R 4228 rebouxia magna a 34559462 Coccomyxa sp. C8

o 34559458 Coccomyxa sp. C10 53854219 o 1 187469843 t Asterochloris photobiont R

R R 8 R 8 i t

T

53854216 c PS-2007e c Asterochloris photobiont R A e A A A

Astero t t 53854225 t 53854217 A 5

y A 3854227 . y 53854226 a r s s 53854223 s s

n n i n 5

o s 5 Asterochloris sp. PS-2007b h Asterochloris photobiont S4 h 6 53854224 s t t Asterochloris sp. PS-2007b t t x 5385 o o loris sp. PS-2007d o 2 e

c e t e e

Asterochloris sp. PS-2007b p i i i p

t 53854229 1 157678549 e

u e h r r 51863542 Asterochloris photobiont R r r

b b b A 6 o r 255653880 s o o s o o r 21

l

Asterochloris photobiont L6 i is sp 3265435 i o

o o o 187469950 Asterochloris photobiont 142 hotobiont R o o s c Asterochloris photobiont L36 r Y r c c r c t t 8 b t Asterochloris photobiont L55 c

r t Asterochloris photobiont L16 c h h h h sterochloris photobiont L9 o 1/1A o Asterochloris photobiont R o o e o R e Asterochloris photobiont L78 i

l h

l

Asterochlori s h Asterochloris photobiont L7 r l l l l

h h A h o Asterochloris t r o o o

h l h Asterochloris photobiont 153 T l o o

p p p o p n r c r r r c 37595483 Coccomyxa subellipsoidea Asterochloris photobiont L10

sterochloris photobiont 187469858Asterochloris Asterochloris photobiont sp. PS-2007dP1 r i c 157678535 i Asterochloris photobiont L12 Asterochloris sp. PS-2007d i i Y91 Asterochloris photobiont h r Y1212 Asteroch 7 s o o s s s o Y976 83265436 Asterochloris sp. PS-2007d i s s Asterochloris photobiont L4 s 183206220 Interfilum paradoxum Asterochloris photobi i Y921 A i s h

r i i Asterochloris photobiont L63 i r Y807 y s

187469840 4 Y1099 Y838 183206221 Interfilum sp. SAG 2101

r r r Y999 b s s 187469842 s s l Y1225 Asterochloris photobiont L30 e Y798 e 51863538 c 183206225 Interfilum paradoxum Asterochloris photobiont LOMA o s 187469841 8

t Asterochloris photobiont 157 t 158325123 Pseudochlorella sp. CCAP o o p p o ‚Parachlorella‘ p p o p Asterochloris photobiont R Y1231 Y941

l o l l 1 183206240 Interfilum sp. SAG 2100 p t r 9

s

s

h . . . . Asterochloris sp. PS-2007e i

h b Asterochloris sp. PS-2007e h Y1004 Y918 rochloris photobiont R . Asterochloris sp. PS-2007e 157678525 Asterochloris sp. PS-20 o s Asterochloris sp. PS-2007e 6 Asterochlo 157678533 A o L A P L 157678529 P c c

i

P h 4 183206241 Interfilum sp. SAG 2101

o

69850 t E p 157678526 E o S o S

5

9

o 157678542 p 7 S n

r 193807318 Botryococcus braunii r 157678541 Asterochloris photobion h

157678537 P 183206235 Interfilum sp. SAG 36.88 P

Asterochloris photobiont R 7 5 157678544 157678538 b - - 44844332 Botryococcus braunii 183206223 Interfilum183206224 sp. SAG Interfilum 2102 sp. SAG 2147 8 ebouxia glomerata t

e e - Asterochloris p s o 157678543 Aste 2 2

i r 8 t i t 8 i 2

Asterochloris photobiont L70 c 183206242 Interfilum sp. SAG 2100 1 3 o 1 Asterochloris photobiont R r t 157678539 0 0 57678536 Asterochloris photobiont R 219813145 s s T 0 157678540 157678557187469849 9 0 o 187469851 a Y1010 219812565 6 1874 n 0 o 157678551 1 187469848 0 0 Y968

5 Asteroch l 6 Y1242 A 157678545 A 0 b Y1240

2 Y762 t 469857 7 7

8

h 4 219814185 uncultured 157678528 7 i Y1 157678550 3537 8 7 1 7469856 6 o 157678553 f f Y1

c

1

7 f

8 219812739 uncultured 2 4 n 2 193807316 Botryococcus braunii

o

7 219813701 uncultured 187469854 8 163 187469855 3 2 0 18 r 2 t 187 187469852 160 187469853 1 1

5 3 e 3 L 94483225 t 5186 6 8 1 8 193807323 Botryococcus braunii 51863535 51863534 s

483223 5 4 8 4 37595484 Coccomyxa rayssiae

51863536 157678531 51863539 A 5 4 4 193807324 Botryococcus braunii Y242

94 2 9 4 9

4

5

3 193807322 Botryococcus braunii

6

8 Y650

1

5

T rebouxiophyceae T T rebouxiophyceae rebouxiophyceae 37595482 Coccomyxa peltigerae 21 ‚Coccomyxa‘

1/1A 193807321 Botryococcus braunii

44844331 Botryococcus braunii

44844333 Botryococcus braunii 189038677 Dictyochloropsis symbiontica

193807325 Botryococcus braunii 193807320 Botryococcus braunii Microthamniales II ‚Interfilum‘ 0.3 ‚Asterochloris‘ ‚Botryococcus + Dictychloropsis‘

Figure 38: Profile Neighbour joining tree calculated on the class of Trebouxio- phyceae (with 100 bootstrap replicas) using all ITS2 sequences and structures available from the ITS2 database version 3.0 (2009/09/29). The image was taken from Buchheim et al. (2011a). 15.C Phylogenetic tree of class Ulvophyceae 134

15.C Phylogenetic tree of class Ulvophyceae

Bryopsidales

195976347 Cladophora sp. FL-2008 195976348 Cladophora sp. FL-2008 3820845 Caulerpa mexicana 1 1 10006337 Caulerpa racemosa 10006335 Caulerpa racemosa

ra

i

i

a x

s u

y

o

a r h

a a a a

s r u

p

s ourouxii s s

y e o

o

o o o f

r h i

m

c v m p m m

a a . turbinata-uvifera a e o e e l

l lamourouxii . lam r . turbinata-uvife . c c c .

c . turbinata-uvifera m c 111117050 Caulerpa sertularioides r

a 111117061 a a 111117063 Caulerpa1111 sertularioides . 1111 .

r a 111117069 Caulerpa sertularioides18028142 Caulerpa sertularioides r r a 111117070 Caulerpa sertularioides r 111 f

111117060 Caulerpa sertularioides

v . 111117062 Caulerpa sertularioides . . . laetevirens 3820847 Caulerpa mexicana 111117053 Caul a

r m 21436559 Caulerpa racemosa f. macrophysa 1111 r r 111117059 Caulerpa sertularioides a

1 21436555 Caulerpa racemosa var 111117058 C 1 1 v 21436558 Caulerpa racemosa f. macrophysa 1 111117056 Caulerpa sertularioides 1 1 a a . 111117055 Caulerpa sertularioides a a

s

1 111117052 Caulerpa sertularioides 1 1 17068 Caulerpa sertularioides 1 f 111116844 Caulerpa prolifera 1 1111 17064 Caulerpa sertularioides 1 1 s v 111117073 Caulerpa sertularioides v v a 1 o 111117072 Caulerpa sertularioides 1 a a 1 17057 Caulerpa sertularioides 1

1 1 1 . laetevirens o 111117065 Caulerpa sertularioides 1 111116841 Caulerpa prolifera a 111116840 Caulerpa prolifera s 1 111116842 Caulerpa prolifera a 1 s . turbinata-uvifera 1 111117067 Caulerpa sertularioides 1

1 1 a 111117054 Caul 1 17051 Caulerpa sertularioides a a m 1 s 1 1 1 s o 1 1 o 2 1 1 s a 111117074 Caulerpa sertularioides1 s s m osa

13925477 Caulerpa prolifera 1 1 e o

1111 5 5 17071 Caulerpa sertularioides o 111116849 Caulerpa racemosa 5 6 1 13925476 Caulerpa prolifera 4 s 5 5 o 1 o o e 1 m 1 m c

17066 Caulerpa sertularioides 8 8

8 8 9 c 10006328 Caulerpa racemosa 8 8 1 m

1 m e emosa e a

m Caulerpa sertularioides m m 5 6

5 4 4 r a 17049 Caulerpa sertularioides 5 5 6 e c acemosa e c

e e e r

9 0

3 8 3 16850 Caulerpa racemosa 7 6 c

8 c a a

c a c c

r r 1 a 3820848 Caulerpa mexicana a 3 a C C

C C a p 3820846 Caulerpa mexicana a a C C r

r

7 r r r r p 9 a aulerpa sertularioides a

a rpa racemosa var a

a a r

a a 5 e a a p 13925488 Caulerpa prolifera p l u e u a a a C u u e

u u r r erpa sertularioides p

p

C u l l 111117089 Caulerpa serrulata p p p l l r

l l r

e e e mosa a e e e r r r e e l l 1 a

a e

e u r r

r r e l e e r r l u 1 u lerpa racemosa

p p l l l p p u C p p l 1117084 Caulerpa serrulata u u rpa racemosa a e a

a a u erpa sertularioides u u a a l a a . lamourouxii

e a

a 9 r 1 Caulerpa racemosa var

a a a C Caulerpa rac C

p

a Caulerpa racemosa a

r a p 1 a a C pa racemosa

C

p

C a C C s s 6 rolifer s 6 5 s s

lerpa racemosa

a

h h 8

h 7 5 4 h h 4

1 p 0 2 aulerpa racemosa

m m 1

m 4 5 acemosa 8 p m m 6

6 r 6 6 ‚Caulerpa‘ 5 7090 Caulerpa racemosa o 17097 Caulerpa r 8 6 6 r 8

5 5 5 e e

e

e e o 4 l 6 3 1 8 6 17102 Caulerpa racemosa 6 6 i a a a 37719538 Caulerpa racemosa var a a a f l 6

1 1 16848 Caulerpa racemosa 4 37575120 Caulerpa racemosa var 1 17100 Cau i 2

e 3 3 3 d 2143654 d 17098 Caulerpa racemosa d f 21436544 Caul d d

1 21436543 Caul 1 8 21436542 Caulerpa racemosa var 1 1 1 e

r 4 4 4 1 i i 1111

i

i i 1 1 i i a 8 1 i 2 r i i

1 1 1 11111 111117093 7 Caulerpa racemosa 1 111117094 Caulerpa racemos 1 . clavifera a 111117095 Caulerpa racem 1 111117096 2 28848897 Caulerpa racemosa var

2 111117105 Caulerpa racemosa 2 2

1 111 1 111 28848896 Caulerpa peltata 21951974 Halimeda gigas 111117101 Caulerpa racemosa 111 37719537 Caulerpa peltata peltata 111117104 Caulerpa28848895 racemosa Caulerpa peltata37719536 Caulerpa peltata 32365526 Halimeda minima 111117099 Caulerpa racemo Caulerpa r 2143654613925485 Caulerpa Caulerpa racemosa racemosa10006334 Caule 561 1392548213925480 Caulerpa Caulerpa racemosa10006332 racemosa Caulerpa racemosa 21951973 Halimeda macrophysa 28848898 Caulerpa racemosa var139254831 Caulerpa racemosa 13925479 C111116826 Caulerpa racemosa 37719539 Caulerpa racemosa var 1 10006333 Caulerpa racemosa 21436550 Caulerpa racemosa10006331 Cau 561 1 1 006326 Caulerpa racemosa 561 17712 Halimeda cuneata 21436551 Cauler 10006338 Caulerpa racemosa 21951970 Halimeda21951968 tuna Halimeda lacunalis 111116828 Caulerpa1 racemosa 5626669356266694 Halimeda56266691 Halimeda56 1cuneata Halimeda cuneata cuneata 32365484 Halimeda21951971 lacunalis Halimeda discoidea 111116825 Caulerpa1000632 race 3236549332365483 Halimeda Halimeda tuna 32365491 lacunalis Halimeda discoidea 10 17708 Halimeda17718 cuneataHalimeda discoidea 11111682421436548 Caulerpa1 1 Caulerparacemosa21436553 racemosa Caulerpa racemosa 561 561 21951977 Halimeda taenicola 16827 Caulerpa racemosa 561 55501477 Halimeda taenicola 32365490 Halimeda discoidea 10006324 Caulerpa racemosa 17707 Halimeda cuneata 32365489 Halimeda discoidea 1 56156266690 Halimeda cuneata 1 21436549 Caulerpa racemosa 21951972 Halimeda tuna 17714 Halimeda17705 Halimedacunea cuneata 55501478 Halimeda tuna 111116822 a taxifolia 561 17706 Halimeda cuneata 55501476 Halimeda taenicola 21436552 Caulerpa111116823 racemosa Caulerpa racemosa 32365492 Halimeda tuna 111 561561 17713 Halimeda cuneata ulerpa taxifolia 17720 Halimeda tuna 1772217721 Halimeda Halimeda tuna tuna 2143654710006330 Caulerpa Caulerpa racemosa racemosa 1 10006336 Caulerpa racemosa6830 Caulerpa racemosa 1 1000632521436545 Caulerpa Caulerpa racemosa racemosa var 21436554100063291 Caulerpa Caulerpa racemosa racemosa 6831 Caulerpa cupressoides axifolia 21951985 Halimeda minima 1 2143655716829 Caulerpa Caulerpa racemosa 085 Caulerpa serrulata 11111 9 Caulerpa taxifolia 21951975 Halimeda magnidisca 1111 32365500 21951978Halimeda Halimedamagnidisca cuneata 11111 3236552432365499 Halimeda 32365510Halimeda minima Halimedamagnidisca cuneata ifolia 111111111708317 Caulerpa serrulata 21951986 Halimeda minima 1111170861111 Caulerpa17075 serrulataCa 288645211111 Caulerpa1707611 17078serrulata Caulerpa Caulerpa taxifolia taxifolia 1111117077 Caulerp 4965991349659914 Halimeda Halimeda minima minima 1 1117080 Caulerpa taxifolia a taxifolia olia 111111707 f 1 165583 Caulerpa taxifolia 32365523 Halimeda minima t 111117081165582001031 Caulerpa Caulerpa Caulerpa taxifolia taxifolia taxifolia ia a 21 21 32365552 Halimeda distorta 18 1 Caulerpa taxifolia 17644225182016 Caulerpa Caulerpa taxifolia taxifolia 111820261 Caulerpa taxifolia 111820211 Caulerpa t 11182010182013 Caulerpa Caulerpa taxifolia taxifolia 32365554 Halimeda distorta3236552221951987 Halimeda Halimeda minima minima 111820231 Caulerpa taxifolia 32365556 Halimeda323655 distorta1 1 182028 Caulerpa taxifolia 3236555732365555 Halimeda Halimeda hederacea distorta32365512 Halimeda gracilis 1118201 1 32365553 Halimeda distorta 1 3236554432365551 Halimed Halimedaa distorta distorta2195198121951979 Halimeda Halimeda gracilis gracilis 1118202782022 Caulerpa Caulerpa taxifolia taxifolia 32365513 Halimeda gracilis 11182025 Caulerpa tax 32365543 Halimeda 32365547 Halimeda distorta 11182024 Caulerp 32365550 Halimeda distorta 111 32365549 Halimeda distorta 1 Halimeda gracilis 11182029 Caulerpa taxifolia 11182015182017 Caulerpa Caulerpa taxifolia taxi 32365548 Halimeda distorta 111820141 Caulerpa taxifolia taxifolia 49659909 Halimeda distorta 1 a 2132365545951990 Halimed Halimeda distorta 11182012 Caulerpa taxifolia 32365546 Halimeda distorta 1118202013925506 Caulerpa Caulerpa taxifolia taxifolia 32365540 Halimeda opuntia 49659910 Halimeda distorta 18001030165578 Caulerpa Caulerpa taxifol taxifolia 2195 49659912 Halimeda distorta 35187999 Caulerpa taxifolia 2113925489 Caulerpa taxifolia folia 35187994 Caulerpa taxifolia 219519891991 Halimeda Halimeda hederacea opuntia 35187989 Caulerpa taxifolia 21449848 Caulerpa taxifolia ifolia 32365538 Halimeda opuntia 35187995449849 Caulerpa Caulerpa taxifolia taxifolia 55501482 Halimeda opuntia 17644226 Caulerpa taxifolia 32365541 Halimeda32365531 opuntia Halimeda opuntia 21 32365535 Halimeda opuntia 12289321417644224 Caulerpa Caulerpa taxifolia 32365542 Halimeda aopuntia distorta 32365536 Halimeda opuntia496599496599151 181983 Caulerpa taxifolialerpa taxifolia 1118201811 Caulerpa taxifolia 11181982 Caulerpa taxifoli 11181988181987 Caulerpa Caulerpa taxifolia taxifolia 11 olia 1 Halim 11181985 Caulerpa taxifolia Halimeda opuntia 11181181974973 Caulerpa Caulerpa taxi taxifolia 11 181962 Caulerpa tax 11 eda opuntia 11181965 Cau 32365534 11181970 Caulerpa taxifolia 4965990832365539 Halimeda opuntia 11181967 Caulerpa taxifolia 11181968 Caulerpa taxifolia 2193236553751988 Halimeda Halimeda opuntia opuntia 21449847 Caulerpa taxifolia 3236552932365527 Halimeda Halimeda opuntia 21449845 Caulerpa taxifolia Halimeda opuntia 17644219 Caulerpa taxifolia 17644220 Caulerpa taxifolia ‚Halimeda‘ 32365530 Hali Halimeda opuntia 17644216 Caulerpa taxifolia 32365533 Halimeda opuntia 11182004 Caulerpa taxif 32365532 Halimeda opuntia 21449850 Caulerpa taxifolia 11181998 Caulerpa taxifolia 32365528 Halimeda opuntia 11122893215182000 Caulerpa Caulerpa taxifolia taxifolia 17644222 Caulerpa taxifolia 11182002 Caulerpa taxifolia 21951966 Halimedam crypticaopuntia 11182006 Caulerpa taxifolia eda opuntia 11182003182007 Caulerpa Caulerpa taxifolia taxifolia 11 11182008 Caulerpa taxifolia 55501472 Halimeda micronesica 11182005 Caulerpa taxifolia 55501471555014 Hal70 Halimeda micronesica 38208661182019182009 Caulerpa Caulerpa Caulerpa taxifolia taxifolia taxifolia 21951967 Halimeda fragili 1 11 11181972181975 Caulerpa Caulerpa taxifolia taxifolia 11 21449846 Caulerpa taxifolia 35187993 Caulerpa taxifolia imeda micronesic 17644217 Caulerpa taxifolia 56181 17644223 Caulerpa taxifolia 56181 18001032 Caulerpa taxifolia 56181 193 Halimeda incrassata 111181995181996 Caulerpa Caulerpa taxifolia taxifolia 56181 192 Halimeda incrassata 1 3236547756181 190 Halimeda Halimeda incrassata incrassata 24943174 Caulerpa taxifolia 32365474 191 Halimedaa incrassatas 11181990 Caulerpa taxifolia 56181 32365476 189Halimeda Halimeda incrassata incrassata 11181993 Caulerpa taxifolia 56181 111819941181997 Caulerpa Caulerpa taxifolia taxifolia 183 Halimeda56181 incrassata 1 32365475188 Halime Ha Halimeda incrassata 11181971 Caulerpa taxifolia 21951963 Halimeda181 Halimeda incrassata incrassat 21449844 Caulerpa taxifolia 56181 11181989 Caulerpa taxifolia 56181 limeda incrassata 11181992 Caulerpa taxifolia 32365472186 Halimeda incrassat 187 Halimeda incrassata 13925508 Caulerpa taxifolia 5 Halimeda incrassata 13925491 Caulerpa taxifolia 561861811 da incrassata 111140018182001 Caulerpa Caulerpa taxifolia taxifolia 21951964185 Halimeda Halimeda incrassata incrassata 1 184 Halimeda incrassata 11181961 Caulerpa taxifolia 5618 a 11181991 Caulerpa taxifolia 323654731 Halimeda incrassata 34447108165580 CaulerpaCaulerpa taxifoliataxifolia 55501469182 Halimeda macroloba 21 32365467 HHalimeda macroloba 13925486 Caulerpa taxifolia 32365468 Halimedaalimeda macrincrassataoloba 35188000 Caulerpa taxifolia 32365469 Halimeda macrolob a 13925474 Caulerpa taxifolia 13925502 Caulerpa taxifolia 5550146832365470 Halimeda Halimeda macroloba macroloba 13925503 Caulerpa taxifolia 21951962 Halimeda macroloba 13925496 Caulerpa taxifolia 32365465 Halimeda macroloba 13925498 Caulerpa taxifolia 32365466 Halimed 1392550013925505 Caulerpa Caulerpa taxifolia taxifolia 32365471 Halimeda macroloba 35187990 Caulerpa taxifolia 2195195832365464 Halimeda Halimeda cylindrac macroloba 3820865 Caulerpa taxifolia 3236545032365452 Halimeda Halimeda cylindracea cylindracea a 11181980 Caulerpa taxifolia 56181 17644218 Caulerpa taxifolia 32365449 Halimeda cylindraceaa macroloba 11181999 Caulerpa taxifolia 32365453180 Halimeda Halimeda cylind cylindracea 3954794 Caulerpa taxifolia 21165579 Caulerpa taxifolia 32365451 Halimeda cylindr 18001033 Caulerpa taxifolia 3820864 Caulerpa taxifolia 2195195556181206 Halimeda incrassata ea 17644221 Caulerpa taxifolia 56181204 Halimeda incrassata 35187991 Caulerpa taxifolia 56181 Halimeda incrassatracea 3820863 Caulerpa taxifolia acea 35187997 Caulerpa taxifolia 56181194 Halimeda incrassata 35187998 Caulerpa taxifolia 56181195 Halimeda incrassata 35187996 Caulerpa taxifolia 56181197 Halimeda incrassataa 11181969 Caulerpa taxifolia 56181203196 Halimeda incrassata 11181963 Caulerpa taxifolia 56181 11181964 Caulerpa taxifolia 56181201198 Halimeda Halimeda incrass incrassata 11181966 Caulerpa taxifolia 56181200 Halimeda incrassata 11181976 Caulerpa taxifolia 32365444 Hali 11181977 Caulerpa taxifolia 32365441 Halimeda incrassata 11181981 Caulerpa taxifolia 32365443 Halimedameda incrassataincrassataata 11181978 Caulerpa taxifolia 5618561812021 Halimeda 13925494 Caulerpa taxifolia 199 H 35187992 Caulerpa taxifolia 32365442alimeda Halimeda incrassata i incrassata 21165581 Caulerpa taxifolia 5618122132365446 Halimeda Ha simulansncrassata 11181984 Caulerpa taxifolia 56181224 Halimedalimeda simulans simulans 11181986 Caulerpa taxifolia 56181223 Halimeda simulans 18001034 Caulerpa taxifolia 56181222 Halimeda simulans 13925492 Caulerpa taxifolia 56181226 Halimeda sim 11181979 Caulerpa taxifolia 56181225 Halimeda simulans 32365445 Halimeda simulans 32365447 Halime ulans 32365448 Hal da simulans 21951957 Halimedaimeda sisimul 56181220 Halimeda simulansans 56181219 Halimeda mulans 56181205 Halimedasimulans incrassata Prasinophyceae 56181207 Halimeda melanes 56181208 H 56181210 Halimedaalimeda melanesicamelanesicaica 56181209 Halimeda melanesica 5618121 323654403 Halimeda Halimeda monile monile 21951956 Halimeda monile 5618121 5618121 2 Halimeda monile 6542793065427940654279316542793565427926 1097Micromonas Micromonas Micromonas Micromonas pusilla pusilla pusilla pusilla 1 Halimeda monile 65427938654279374681 Micromonas pusilla (outgroup) 56181218 Halimeda monile 6542793665427927 MicromonasMicromonas pusillapusilla 56181217 Halimeda monile 2265230576542794165427939 Micromonas Micromonas pusilla sp. RCC299 56181216 Halimeda monile 6542793265427934654279223 MicromonasMicromon Micromonasas pusilla pusilla 56181215 Halimeda monile 65427928665427935427923 Micromonas Micromonas pusilla pusilla 55501467 Halimeda borneensis 654279296542792565427924 Micromonas Micromonas pusilla pusilpusillala 134254397 Ulva pertusa 56181173 Halimeda borneensis 134254395134254396134254398 Ulva pertusapertusa 56181171 Halimeda borneensis 113425440134254399 Ulva pertusapertusa 56181172 Halimeda borneensis 134254394134254400 UlvaUlva perpertusatusa 56181178 Halimeda borneensis 134254388134254313425439392 UlvaUlva pertpertusa 56181179 Halimeda borneensis 134254390134254389 Ulva pertusape usa 32365457 Halimeda borneensis 92 134254291131342543425438791 Ulva pertusartusa 32365458 Halimeda borneensis 134254292134254318 Ulva pertusaertusa 32365456 Halimeda borneensis 134254315134254316134254317 Ulva UlUlva lactuca pertusasp. NY00 32365462 Halimeda borneensis 13425431342543121 Ulvava lalactuca sp. NY004 1342543131342543141 Ulva lactucactuca 9 32365459175 HalimedaHalimeda borneensisborneensis 134254310134254308 Ulva lactuca 56181174 Halimeda borneensis 134254306134254307134254309 UlvaUlva lactucalactlactuca 56181177 Halimeda borneensis 134254304134254305 Ulva lactucala 56181176 Halimeda borneensis 31442259 Ulvaria fusca 134254301134254302134254303 UlvaUUlva lactucalactucalactuca 56181 1 Acrochaete sp. 5-BER-2007 61 134254300 Ulva lactucauca 32365461 Halimeda borneensis 157695712 134254319134254320134254321 UlvaUlvalva l armoricanaarmoricarmoricanactuca ana 21951961 Halimeda borneensis 15769571 13425432813425432948947074894705 Ulva UlvaUlva flexuosa muscoidemuscoidesactuca 55501466 Halimeda borneensis 134254340 Ulva sp. NY1 32365463 Halimeda borneensis 134254344134254341 Ulva sp. NY126NY1 21951960 Halimeda stuposa 157695760 Bolbocoleon piliferum 134254407134254326134254327 Ulva linza Ulva compressa 157695753 Bolbocoleon piliferum 98 70 1342544101342544091342544081342543231342543241342 Ulva54325 linzalin UlvaUlvUlvaa compressacompressa 3236546032365454 Halimeda Halimeda borneensis borneensis 156629290 Bolbocoleon piliferum 134254403134254444061134254322 Ulva Ulva linza linza Ulva compressa 32365455 Halimeda borneensis 157695754157695755 Bolbocoleon piliferum 134254404134254405134254402 UlvUlvaa linzalinza s 157695758 Bolbocoleon piliferum 134213425429613425429754294 UlvaUlvaUlva Ulva linzalinzalinza linza 1917 157695757157695759 Bolbocoleon piliferum 134254299134254293134254295 UUlva1Ulvalva Ulva linzalinza linza mpressa 222457887222457889134254298 UlvaUlva proliflinzaeraza 29335815 Halimeda cylindracea 157695762 Bolbocoleon sp. BER-2007 22245788222457886 Ulva prolifera 222457879222457881222457884222457891222457876222457882222457878 Ulva UlvaUlva Ulva proliferaUlva proliferaprolifera Ulva Ulvaprolifera prolifera prolifera prolifera 79 2224578222457885222457880 6689903 UlvaUlva Ulva proliferapro lproliferaUlvalinza procera 157695764157695765 epi/endophyte epi/endophyte Ulvales Ulvales66576158 sp. sp. 1-BER-2007 1-BER-2007 Blidingia dawsonii 13222457222457888 U 66576159 Blidingia dawsonii 1342543341342543331342543364254335 Ulva Ulva UlvaUlva3 Ulva procera procera proceraprocera 156629291157695766 epi/endophyte Ulvales sp. 1-BER-2007 134254339134254337134254332 Ulva UlvaUlva procera proceraprocera 5834515 Blidingia chadefaudii 134254338134254331877 Ulva90 Ulva Ulva proceraintestinalis prolifera prolifera 157695763156629292 epi/endophyte epi/endophyte Ulvales Ulvales sp. 1-BER-2007 sp. 2-BER-2007 134254330 Ulva intestinalisprolifera 157695767 epi/endophyte Ulvales sp. 2-BER-2007 1576957893331116103501668501 epi/endophyte Ulvales undulatum sp. 6-BER-2008 157695768 epi/endophyte Ulvales sp. 2-BER-2007 157695788111610349ulotrichales Ulothrixepi/endophyte sp.l vaX3 proliferaUlvales sp. 6-BER-2008 157695781576915769578744845786419310245 epi/endophyteepi/endophyte uncultu Protomonostromaepi/endophyte Ulvales Ulvales Ulvales undulatum sp. sp. sp. 6-BER-2008 6-BER-2008 6-BER-2008 YG2 111610348 Ulothrix11604461 Ulva sp. Gloeotilopsis intestinalis X1 ifera planctonica 18032152 Blidingia minima 44409524 Urospora15459641545963 wormskioldii Protoderma Gloeotilopsis sa paucicellularisrcinoidea 157695775 epi/endophyte Ulvales18032151 sp. 4-BER-2007 Blidingia minima 44409527444095224440953144409517111610347 UrosporaUrospora Urospora Urospora neglectawormskio neglecta sp. 18032153 Blidingia minima . blyttii 4440951555977967444095264440952544409529 Urospora UrosporaUrospora 55977966 UrosporaUrospora penicilliformis neglectaneglecta sp. neglectaChlorothrix U122 sp. C167 156629294157695774 epi/endophyteepi/endophyte UlvalesUlvales sp.sp. 4-BER-2007 4019196 4440951444409521 Urospora Urospora penicilliformis wormskioldii 157695776 epi/endophyte Ulvales1803215418032155 sp.1099 4-BER-2007 Blidingia BlidingiaMonostroma minima minima grevillei 44409516427418695690383 Urospora penicilliformis 157695771 epi/endophyte Ulvales333 11sp.1032101 3-BER-2007 Monostroma32967694 angicavagrevillei32964998 Percursaria Ulva percursa taeniata 432432401919511 157695773 epi/endophyte Ulvales33311030 sp. 3-BER-2007Monostroma arcticum 156629293 epi/endophyte Ulvales33311028 sp. 3-BER-2007Monostroma arcticum31442269 Ulva flexuosa 33311103 Monostroma grevillei var 157695769157695770 epi/endophyte epi/endophyte Ulvales Ulvales3331 sp. sp. 3-BER-2007 3-BER-2007 185186 Capsosiphonred Urospora groenlandicus 333180321601203366535524626 Monostroma Monostroma Monostroma31442287 nitidum 210076555grevillei31442266 arcticum Ulva reticulata Ulva sp.flexuosa L Acrosiphonia coalita 20336652 Monostroma grevillei 2886452328864522Acrosiphonia Caulerpa Caulerpa cupressoides cupressoides sp. Sk5 583482731442262 Umbraulva 5834550Umbraulva31442268 olivascens Ulva amamiensis sp.Ulva SY0107 flexuosa AcrosiphoniaAcrosiphoniaAcrosiphoniaAcrosiphonia coalita sp. arcta arcta 3144226766899106689909 Ulva UlvaUlva flexuosa linzalinza ‚Bolbocoelon‘ 32967688 Umbraulva olivascens6689908 Ulva linzaYG1 3331 32967695 Ulvaria obscura32967683 32967684var Ulva sp. Ulva WTU344779 californica 5834828 Ulva sp. SSO0102 AB58 1026 Monostroma nitidum 5834546 Ulva prolifera 6689902 Ulva prolifera ldii 5834547 Ulva prolifera 5524623 Ulva clathrata AH/Sk/BI/Bm 1576957502661009 Ulva Ulva prolifera prolifera 28864532 Caulerpa webbiana 156629287552462432967687 Ulva Ulva prolifera Ulvaintestinalis lobata 157695749210076554 Ulva Ulvaprolifera sp. L 5834824 Ulva rigida 32967689 Ulva rigida 1 5834822 Ulva scandinavica noi 2132260117652874 sp. LM98 5834821 Ulva fenestrata 157695665 . vahlii 157695699 21322602156629279 1 Ulva muscoides 31442260 Umbraulva japonica 314422813144228231442284 Ulva Ulva Ulvascandinavica armoricana armoricana ctuca 15662927531442261 Umbraulva japonica 157695691156629276 157695658 157695692 32967690 Ulva sp. WTU344835 157695690157695689 157695686157695688157695687 157695660157695656 157695657 31442283 Ulva armoricana 157695683 157695646 157695694157695695 157695693 157695647 157695718 5834549 Ulva muscoides 157695675 157695650 157695715 157695643 1 157695713156629280 157695652157695644 157695661157695655 157695717 157695654 4894706 Ulva muscoides 1 157695659 157695716157695706 157695664157695653157695648 4894704 Ulva muscoides 5 157695710157695714 ‚Blidingia‘ 489471 5 1 156629278156629283157695730 Acrochaete viridis 7 157695701 lva la 157695672 156629277 4894712 Ulva muscoides 157695674 157695666157695673 157695671157695663 157695702 157695736 1 1 1 1576956 1 157695696157695742 4894713 Ulva muscoides 7 157695697 5 157695705 157695737 489471041529329 Ulva muscoides Ulva ohnoi sa 6 157695734157695738 7

7 5 5 5 5 489470941529328 Ulva muscoides Ulva ohnoi 6 157695740157695651 5834823 Ulva scandinavica 7 157695709157695707157695708 157695733 7

9 41529324 Ulva ohnoi 7 156629282 Acrochaete viridis

7 157695703 0

0 7 7 7 7 489470841529325 Ulva muscoides Ulva ohnoi 9 157695698 Acrochaete sp. 2-BER-2007 6 ‚Ulva II‘ 157695735 sp. HT4 0 157695700 5 157695741 Acrochaete sp. 1-BER-2007 41529326 Ulva ohnoi 0 157695704

0 0 156629286 0 6 6 6 6 41529327 Ulva oh 5

9 Acrochaete sp. 4-BER-2007 0

6 0

2 0 157695732 Acrochaete sp. 1-BER-2007 2 9 9 9 9 6

5 Acrochaete sp. 1-BER-2007

2 -

7 5524620 Ulva rigida - 2 156629281 Acroc

2 157695729 5 5 5 5 - 7 157695731 Acrochaete sp. 1-BER-2007 - 6 156629285 Acrochaete sp. 1-BER-2007 -

8 Acrochaete sp. 1-BER-2007

Ulva R AcrochaeteAcrochaete sp. sp. 1-BER-2007 1-BER-2007 5834825 Ulva taeniata R Acrochaete sp. 3-BER-2007 6 6 6 6 Acrochaete sp. 3-BER-2007 p. WTU344819 7 Acrochaete sp. 3-BER-2007 Acrochaete sp. 1-BER-2007 AcrochaeteAcrochaete sp. 3-BER-2007 sp. 3-BER-2007 AcrochaeteAcrochaete sp. sp.1-BER-2007 1-BER-2007 Acrochaete repens Acroc 8 Acrochaete sp. 3-BER-2007 R va linza Acrochaete sp. 3-BER-2007 Acrochaete sp. 1-BER-2007 Acrochaete repens Acrochaete sp. 3-BER-2007 5524619 Ulva lactuca R Acrochaete sp. 3-BER-2007 Acrochaete sp. 1-BER-2007 5834815 U Acrochaete repens Acrochaete sp. 3-BER-2007 Acrochaete sp. 1-BER-2007 AcrochaeteAcrochaete sp. 3-BER-2007 sp. 3-BER-2007 157695720 Acrochaete repens

R Acrochaete sp. 3-BER-2007 Acrochaete repens AcrochaeteAcrochaete sp. 5-BER-2007 repens E A 6 6 7 6 6 Acrochaete repens E Acrochaete sp. 4-BER-2007 s AcrochaeteAcrochaete sp. 1-BER-2007sp. 1-BER-2007 5524617 Ulva scandinavica 2 Acrochaete sp. 1-BER-2007 A Acrochaete sp. 4-BER-2007 5834826 Ulva pertusa E 5524614 Ulva scandinavica E AcrochaeteAcrochaete sp. 6-BER-2007sp. 6-BER-2007Acrochaete viridis E Acrochaete sp. 4-BER-2007 9 7 Ulva sp. HT7 a 2 0 8 B c Acrochaete sp. 4-BER-2007 5524616 Ulva scandinavica B 157695719 AcrochaeteAcrochaete sp. 4-BER-2007 leptochaete Acrochaete viridis haete sp. 1-BER-2007 c Acrochaete sp. 4-BER-2007 A 157695721 B - Acrochaete viridis - B 157695722 Acrochaete sp. 4-BER-2007 Acrochaete sp. 2-BER-2007 Acrochaete sp. 2-BER-2007 Acrochaete viridis Acrochaete sp. 2-BER-2007 157695723 r Acrochaete sp. 2-BER-2007 Acrochaete Acrochaetesp. 2-BER-2007 sp. 2-BER-2007 B 157695727157695728 157695724 Acrochaete sp. 4-BER-2007 AcrochaeteAcrochaete viridis sp. 1-BER-2007 - A A 5524615 Ulva scandinavica A A A 157695725 - 156629284 31442288 Ulva spinulosa r Acrochaete viridis

o

5524618 Ulva scandinavica 3 - 32967685 Ulva fasciata32967681 Ulva linza 3 c Acrochaete viridis haete sp. 2-BER-2007 o AcrochaeteAcrochaeteAcrochaete sp. sp. 4-BER-2007 sp. 4-BER-2007 4-BER-2007 31442286 Ulva fasciata 3 5834548 Ulva flexuosa 67693 Ulva stenophylla752 Ulva linza 3 c c c c c Acrochaete sp. 4-BER-2007 31442285 Ulva fasciata lis c Acrochaete sp. 4-BER-2007 Acrochaete viridis . 3 r Acrochaete viridis

.

c

1908514 Ulva sp. RZ1 Acrochaete Acrochaetesp. 4-BER-2007 sp. 4-BER-2007 . 5524622 Ulva linza o r . r Acrochaete leptochaete is r r r

h

p .

p

h o o 5834814 Ulva lactuca o o o s s Acrochaete viridis s

s p c

p

a i i i s

i p 21 32932967680 Ulva tanneri s l l l 157695751 Ulva linza a c c c c Acrochaete repens c l s h Acrochaete sp. 6-BER-2007 208343776 195984436 Ulva linza 5834816 Ulva lactuca s Acrochaete viridis

e

31442270 Ulva linza s Acrochaete repens

h h a a a h h h 31442271 Ulva linza e a e

e a

31442277 Ulva pertusa t t e 31442273 Ulva lactuca t 3144227831442276 Ulva Ulva pertusa pertusa e a a n n n a a a t 32967691 Ulva sp. WTU344801 31442274 Ulva lactuca e n 31442275 Ulva pertusa e t 5834545 Ulva intestinaloides e 5524621 Ulva rigida i i t 31442280 Ulva pertu i 157695 i e

e t

31442279 Ulva pertusa e Acrochaete heteroclada inalis 32967686 Ulva fenestrata t t 219932445 Ul e e ssa e e e 208343777 5834820 Ulva californica t

e t

e

31442272 Ulva arasakii s

a e e stinalis s s a e s t 199584165 Ulva prolifera t t t 255504593 Ulva sp. DYL-2009 t s

a

a

r p e e e e e Acrochaete heteroclada h a Acrochaete heteroclada 32967692 Ulv h est e e 32967682 Ulva procera e p AcrochaeteAcrochaete heteroclada heteroclada h Acrochaete heteroclada t t t s Acrochaete repens h

t . Acrochaete repensAcrochaete heteroclada 208343774 Ulva sp. HT1 c Acrochaete heteroclada 208343775 Ulva sp. HT3 c h

. s s s s s

c p n n n c n 3

a intestinalis o c i i i o i p p p p p 3

o r intestinalis . 199599917 Ulva prolifera r o

-

o

98 07Ulva Ulva compressa compressa r . . r . . . -

c B

a c r a a a 3 a

‚Monostroma‘ B c

c 2 2 2 2 2

Ulva intestinalis v v v c v va intestinalis - va intestinalis A E A l l l l

E B A - - - - -

A

A

R B B B B B

4

U U U U 1

R E

6

0

E 8 5 E E E E

- 26609994102629 Ulva Ulva compressa compressa 8

R 0 7 - 1 8 9

2 2660926610 8

6 8

6 2 R R R R R

5 5 4 6

0 6 4102630 Ulva compressa - 5834544 Ulva compressa 5 6 5834543 Ulva compressa 5 0

2 - - - - - 9907 Ulva intestinalis 5 6 6 6

0 5

9 5 2 2 9 2 2 2 45598251 Ulva 2660988compressa Ulva intestinalis 0

45598249 Ulva intestinalis 0 6 6 6 9 2660990 Ulva intestina 7 5834819 Ulva pseudocurvata 2660986 Ulva intestinalis 9

6 9 0 0 6 0 0 0 266099526609972660996 Ulva Ulva intestinalis Ulva intestinalis intestinalis 7 45598250 Ulva compressa 2660989 Ulva intestinalis 0 5834818 Ulva pseudocurvata 3 6 3 3 5834817 Ulva pseudocurvata 583454226609852660994 Ulva Ulva intestinalis Ulv intestinalis5834541 Ulva intestinalis2660993 6 7 6 Ulvales II 2660992 Ulva intestinalis 0 0 26609846685013 Ulva Ulva intestinal intestinalis 7 0 0 0 60987 Ulva intestinalis 668 7 6689906 Ulva intestinalis 3 3 3 7 31442263 Ulva compressa 7 1597234 Ulva intestinalis 5 7 7 7 2661003 Ulva compressa 31442264 Ulva intestinalis 5 7 7 7 157695747 Ulva sp. BER-2007 6685012 Ulva intestinalis 0 0 5 455982461 Ulva intestinalis 5

1 5 6685010 Ulva compressa 157695746 Ulva sp. BER-2007 1 2661004 Ulva comp 26 660991 Ulva intestinalis 2 1 45598248 Ulva57695743 intestinalis Ulva inte 20 2 2033664 64251252661002 Ulva Ulva compressa compressa 31442265 Ulv 1 2661005 Ulva compressa 45598247 Ul ‚Umbraulva‘ 1 2661006 Ulva compressa 2 6685014 Ulva intestinalis1566292881 Ul 6685015 Ulva intestinalis 157695744 Ulva intestinalis 157695745 Ulva intest

Ulvales I

‚Urospora‘ ‚Acrosiphonia‘ ‚Ulva I‘

‚Acrochaete‘ 0.3

Figure 39: Profile Neighbour joining tree calculated on the class of Ulvophyceae (with 100 bootstrap replicas) using all ITS2 sequences and structures available from the ITS2 database version 3.0 (2009/09/29). The image was taken from Buchheim et al. (2011a). 15.D Sequence-structure alphabets 135 - - - ) N/else C ↔ ↔ ( N/else ) G T ↔ ↔ ( U ) G T ↔ ↔ ( T U ) P ↔ ↔ ( G - - - T ) P ) ↔ ↔ Z ( G else/N ) C ) L U F ↔ ↔ ) L T ( G ) G G ) G M ↔ ↔ ) C C ( C ) R A A ) S ↔ ↔ ( ( Y U else/N ) A I ( U S ↔ ↔ ( T I ( T ) U ( E G L ↔ ↔ ( ( C A D ( T ) A A - - L ↔ ↔ ( A . X N else/N . else/N N X . U K . T U U D . T K . T T T D . H G . G G N G . C C C Q . coding sequence and structure with 10 letters coding sequence and structure with 12 letters C R coding sequence only with 4 letters . A A A N . A A Compatible treeforge input format: fasta,xfasta Compatible treeforge input format: xfasta Compatible treeforge input format: xfasta 15.D Sequence-structure alphabets DNA RNA10 RNA12 Sequence Substitution Sequence Structure Substitution Sequence Structure Substitution 15.D Sequence-structure alphabets 136 - - - else/N else/N ( ) V ↔ ↔ W else/N else/N G G I ( ) T ↔ ↔ U U G G I ( ) T ↔ ↔ T T U U ( ) P G ↔ ↔ G G T T ( ) P G ↔ ↔ G G C C ( ) F E ↔ ↔ G G G G ( ) Q M ↔ ↔ C C A A ( ) S H ↔ ↔ U U . A A X ( ) S else/N H ↔ ↔ T T . U D U U . T D ( ) L C ↔ ↔ A A . N G T T . C R ( ) L C ↔ ↔ . A A A A coding sequence and structure with 16 letters Sequence Structure Substitution Sequence Structure Substitution Sequence Structure Substitution Compatible treeforge input format: xfasta RNA16 15.E Sequence and sequence-structure score matrices on species datasets 137

15.E Sequence and sequence-structure score matrices on species datasets

Score matrix DNA species Score matrix RNA10 species

A. ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● A ● ● ● ● C.

G. ● ● ● ● ● ● ● ● ● ●

T. ● ● ● ● ● ● ● ● ● ● C ● ● ● ● A()T ● ● ● ● ● ● ● ● ● ●

T()A ● ● ● ● ● ● ● ● ● ●

● Nucleotides G ● ● ● C()G ● ● ● ● ● ● ● ● ● ●

G()C ● ● ● ● ● ● ● ● ● ●

● ● ● ● Sequence−structure tuples/pairings T ● ● ● G()T ● ● ● ● ● ● ●

T()G ● ● ● ● ● ● ● ● ● ●

A C G T A. C. G. T. A()T C()G G()T

Nucleotides Sequence−structure tuples/pairings

Score matrix RNA12 species Score matrix RNA16 species

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● A. A. ● ● ● ● ● ● ● ● ● C. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● C. ● ● ● ● ● ● ● ● ● ● ● ● G. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● G. ● ● ● ● ● T. ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● T. ● ● ● ● ● ● ● ● ● ● ● ● A(T ● ● ● ● ● ● T(A ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● A( ● ● ● ● ● ● ● ● ● ● ● ● C(G ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●

● ● ● ● ● ● ● ● ● ● ● ● C( ● ● ● G(C ● ● ● ●● ● ● ● ●● ● ● ●

● ● ● ● ● ● ● ● ● ● G( ● ● ● ● ● ● ● ● ● ● G(T ● ● ● ● ● ●● ● T(G ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● T( ● ● ● ● ● ● ● ● ● ● ● ● A)T ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● A) ● T)A ● ● ● ● ● ●● ● ● ● ● ●● ● Sequence−structure tuples

● ● ● ● ● ● ● C) ● ● ● ● ● ● ● ● ● ● C)G ● ● ● ● ●● ● ● ● ● ● Sequence−structure tuples/pairings G)C ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● G) ● ● ● ● ● ● ● ● G)T ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● T) ● T)G ● ● ● ● ● ●● ● ● ● ● ● ●

A. C. G. T. A( C( G( T( A) C) G) T) A. G. A(T C(G G(T A)T C)G G)T

Sequence−structure tuples Sequence−structure tuples/pairings

Figure 40: The following bubble plots visualize four different score matrices calculated on the alphabets ’DNA’, ’RNA10’, ’RNA12’ and ’RNA16’. The calculations were based on about 1200 manually inspected species alignments taken from M¨ulleret al. (2007). Red dots indicate a positive, green dots a negative score. The size of each dot corresponds to the absolute size of its representing number. Thus, the main diagonal typically shows red, positive scoring bubbles for matching positions. Other diagonals show mostly a neg- ative score, depending on the number of exchanges which occurred for the specific tuples. 15.F Sequence and sequence-structure score matrices on genus datasets 138

15.F Sequence and sequence-structure score matrices on genus datasets

Score matrix DNA genus Score matrix RNA10 genus

A. ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● A ● ● ● ● C.

G. ● ● ● ● ● ● ● ● ● ●

T. ● ● ● ● ● ● ● ● ● ● C ● ● ● ●

A()T ● ● ● ● ● ● ● ● ● ●

T()A ● ● ● ● ● ● ● ● ● ●

Nucleotides G ● ● ● ● C()G ● ● ● ● ● ● ● ● ● ●

G()C ● ● ● ● ● ● ● ● ● ●

● ● Sequence−structure tuples/pairings ● ● T ● G()T ● ● ● ● ● ● ● ●

T()G ● ● ● ● ● ● ● ● ● ●

A C G T A. C. G. T. A()T C()G G()T

Nucleotides Sequence−structure tuples/pairings

Score matrix RNA12 genus Score matrix RNA16 genus

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● A. A. ● ● ● ● ● ● ● ● ● C. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● C. ● ● ● ● ● ● ● ● ● ● ● ● G. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● G. ● ● ● ● ● ● ● T. ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● T. ● ● ● ● ● ● ● ● A(T T(A ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● A( ● ● ● ● ● ● ● ● ● ● ● ● C(G ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● C( ● ● ● ● ● G(C ● ● ● ● ●● ● ● ● ●● ● ● ●

● ● ● ● ● ● ● ● ● ● ● G( ● ● ● ● ● ● ● ● ● ● G(T ● ● ● ● ● ● ● T(G ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● T( ● ● ● ● ● ● ● ● ● ● ● ● A)T ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● A) ● T)A ● ● ● ● ● ● ● ● ● ● ● ● ● Sequence−structure tuples

● ● C) ● ● ● ● ● ● ● ● ● ● ● ● C)G ● ● ● ● ● ● ●● ● ● ● ● ● ● Sequence−structure tuples/pairings G)C ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● G) ● ● ● ● ● ● ● ● ● ● ● ● G)T ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● T) ● ● T)G ● ● ● ● ● ● ● ● ● ● ● ● ●

A. C. G. T. A( C( G( T( A) C) G) T) A. G. A(T C(G G(T A)T C)G G)T

Sequence−structure tuples Sequence−structure tuples/pairings

Figure 41: The following bubble plots visualize four different score matrices calculated on the alphabets ’DNA’, ’RNA10’, ’RNA12’ and ’RNA16’. The cal- culations were based on about 400 manually inspected genus alignments taken from M¨ulleret al. (2007). Red dots indicate a positive, green dots a negative score. The size of each dot corresponds to the absolute size of its representing number. Thus, the main diagonal typically shows red, positive scoring bub- bles for matching positions. Other diagonals show mostly a negative score, depending on the number of exchanges which occurred for the specific tuples. 15.G PL/pgSQL functions of the ITS2 database 139

15.G PL/pgSQL functions of the ITS2 database

getGiForTaxon(text) Returns all GIs for a case insensitive taxonomic name getGiNameRankForTaxon(text) Returns all GIs, names and ranks for a case insensitive taxonomic name getLineage(int4) Returns the taxonomic lineage to a given taxid getLineageGI(int4) Returns the taxonomic lineage to a given GI createViews() Creates all views specified in 15.H getITS2OptimalFolds() Returns fold ITS2 features in the priority order of HMM, MIXED, GB, HM, BLAST: > | <fid> |[ | <fid> |] getLastITS2OptimalFolds() Returns fold ITS2 features in the priority order of HMM, MIXED, GB, HM, BLAST from the last homology modelling iteration: > | <fid> |[ | <fid> |] getITS2OptimalUnfolds() Returns unfold ITS2 features in the priority order of HMM, MIXED, GB, HM, BLAST: > | <fid> | | | |[ | <fid> | | | |] updateTaxonCountsRecursive() This method updates all parent ids up to the root (not the root itself) with new taxon sums makeProductionReady() ˆ Deletes all features which are not of element ITS2 ˆ Deletes fold features which are redundant in the priority order of HMM, MIXED, GB, HM, BLAST ˆ Deleting annotations which are in conflict with fold features in the priority order (for fold) of DIRECT, HM, PARTIAL ˆ Deletes unfold features which are redundant in the priority order of HMM, MIXED, GB, HM, BLAST ˆ Drops table gb indexes; this table is not needed in production envi- ronment ˆ Creates triggers: taxonCountsChanged, featureCountsChanged, structureCountsChanged ˆ Updates taxonomy tree with structure counts public.textcat null(text, text) Concatenates strings containing NULL pg grant(TEXT, TEXT, TEXT, TEXT) Grants rights so user, e.g. select pg grant(’userreadonly ’,’select’,’%’,’public’); select pg grant(’userall ’,’select,insert,update,delete’,’%’,’public’); pg revoke(TEXT, TEXT, TEXT, TEXT) Revokes rights from user pg drop() Drops all tables

Table 7: Overview of PL/pgSQL functions available in the ITS2 database 15.H Views of the ITS2 database 140

15.H Views of the ITS2 database

ViewITS2Folds Returns gi, fid, seq, str, ann, start lt stop, modstart, modstop, iter from all folded ITS2 features including duplicate annotations ViewITS2Unfolds Returns gi, fid, seq, str, ann, start lt stop, modstart, modstop from all unfolded ITS2 features including duplicate annotations ViewITS2OptimalAnnotated Returns gi, fid, seq, start lt stop, modstart, modstop from all ITS2 features in the priority order of HMM, MIXED, GB, HM, BLAST ViewITS2OptimalFolds Returns gi, fid, seq, str, ann, start lt stop, modstart, modstop, iter from all folded ITS2 features in the priority order of HMM, MIXED, GB, HM, BLAST ViewITS2OptimalUnfolds Returns gi, fid, seq, str, ann, start lt stop, modstart, modstop from all unfolded ITS2 features in the priority order of HMM, MIXED, GB, HM, BLAST

Table 8: Overview of views available in the ITS2 database

15.I Aggregate functions of the ITS2 database

textcat all Concatenates grouped strings containing NULL

Table 9: Overview of PL/pgSQL aggregate functions available in the ITS2 database

15.J Operators of the ITS2 database

||+ Concatenates VALUE ||+ NULL to VALUE

Table 10: Overview of PL/pgSQL operators in the ITS2 database

15.K Triggers of the ITS2 database

taxonCountsChanged AFTER UPDATE ON taxons, updates taxon counts recursive featureCountsChanged AFTER INSERT OR DELETE ON features, updates taxon counts recursive structureCountsChanged AFTER INSERT OR DELETE ON structures, updates taxon counts recursive

Table 11: Overview of triggers, set in the ITS2 database 15.L ITS2 data generation – folders 141

15.L ITS2 data generation – folders

sql SQL-scripts creating and deleting the database with all functions, etc. lib Perl modules and different interfaces scripts All scripts executed for data generation, invoked by generate.pl hmms HMMER2 HMM database and source sequence files for HMM generation log Log-directory during the generation process tmp Temporary files backup PostgreSQL database backups from different stages bin Binary files like EMBOSS package, HMMER, Unafold, etc.

Table 12: Overview of folders for ITS2 database generation located under: ’rRNA DB/db/trunk’

15.M ITS2 data generation – scripts and modules

generate.pl Basic entry point for data generation 01-mk-pg-database.pl An empty PostgreSQL database is created 02-fill-taxonomy-tree.pl The taxonomy tree is downloaded and written to the database 03-download-genbank.pl Download of important GenBank sub-databases 04-mk-genbank-index.pl Index of GenBank files is created 05-parse-ncbi-search.pl GenBank entries retrieved from NCBI-search are parsed 06-split-genbank.pl GenBank databases are split for HMM annotation 07-write-ncbi-search.pl Features and sequences from NCBI-search are written to the database 08 - 09 HMM annotation 10 Direct folding of sequences 11-* Homology modelling 12-* BLAST-based annotation 13-* Partial structure prediction

Table 13: Overview of scripts for ITS2 database generation located under: ’rRNA DB/db/trunk/scripts’

DbInterface.pm Module for interfacing with a database, provides methods for e.g. writing features, sequences, receiving ITS2 optimal folds or unfolded features GbInterface.pm A comprehensive module for interfacing with GenBank data files; these can be read directly from the index file; contains accessors for obtaining sequence, lineage, gi, features and more Index.pm A module for creating, appending and reading from an index file; basically usable for any kind of files; independent from GenBank, just requires a specific file separator FastaParser.pm Robust parser for FASTA format, parses also 60 line-break FASTA GbParser.pm An extremely speed optimized parser for GenBank data files NCBISearch.pm Module for NCBI queries and downloads Blast.pm Module for BLAST search and parsing Fold.pm Module for folding sequences using UNAFold Nussinov.pm Module for post-folding homology modelled structures HMM.pm Parsing of various HMMER2 annotated features ITS2check.pm A module that tests whether the secondary structure of an ITS2 seems to be valid globals.pm This module contains all paths, constants and settings used during the update process

Table 14: Overview of Perl modules for ITS2 database generation located under: ’rRNA DB/db/trunk/lib’ 15.N JavaScript files of the ITS2 workbench frontend 142

15.N JavaScript files of the ITS2 workbench frontend

accordion analyse Visualization of panel in the accordion under ’Analyse’ tab accordion create Visualization of panel in the accordion under ’Create’ tab accordion manage Visualization of panel in the accordion under ’Manage’ tab accordion own seq Visualization of panel in the accordion under ’Create’ tab under ’add own data’ button accordion taxbrowser Visualization of panel in the accordion under ’Create’ tab under ’from database’ button analyse alignmentvis2 Visualization of panel for showing a reduced and speed optimized version of the alignment analyse alignmentvis Visualization of panel for showing the full version of alignment analyse treevis Visualization of panel for showing trees applayout Direct JavaScript entry point, configures viewport with a border layout blast Visualization of panel when performing a BLAST search COPYING License information create annotate Visualization of panel under ’Create’ tab for own sequence annotation create known Visualization of panel under ’Create’ tab for own sequence upload create model Visualization of panel under ’Create’ tab for own sequence homology modelling create motif Visualization of panel under ’Create’ tab for own sequence motif search create predict Visualization of panel under ’Create’ tab for own sequence-structure prediction create seqvis Visualization of panel under ’Create’ tab and live-search for database seq-str search devel Visualization of panel for bug tracker - visible only on the development web server its2 classes Collection of general classes, specific to the ITS2, e.g. seq-str parser, etc. its2 SequenceEditor Class for ITS2 sequence-structure input main accordion Visualization of accordion panel on the west side of the border layout main dragdrop Drag & Drop management for the whole web interface main examples ITS2 data examples for different components, mostly loaded into ’SequenceEditor’ main functions Basic unspecific functions used in the whole workbench e.g. error and alert messaging main handler Basic event handlers of the workbench, specific event handling is included in each file main header Visualization of the header menu in the north of the border layout main interactions Functions for interaction of different components like sequence transfer between panels main overrides Some basic overrides and extensions of the ExtJS framework main pool Visualization of panel handling the pool and sequence-structure drag & drops main tools Visualization of panel containing the tools section above the accordion on the west side manage poolvis Visualization of panel for managing the data pool containing sequences and structures

Table 15: A short description of the major JavaScript files in the frontend of the ITS2 workbench. 15.O Templates of the Template Toolkit 143

15.O Templates of the Template Toolkit

about Visualization of ’About’ page (currently hidden) accordionAnalyse Visualization of panel in the accordion under ’Analyse’ tab accordionManage Visualization of panel in the accordion under ’Manage’ tab analysetreevis Visualization of panel for phylogenetic tree annotate Visualization of annotation results from ’Annotate’ tab citation Visualization of ’Citations’ page (currently hidden) citedby Visualization of ’Cited by’ page contact Visualization of ’Contact’ page createannotate Visualization of panel in ’Annotate’ tab createknown Visualization of panel in ’Own sequences’ upload tab createmodelHM Visualization of homology modelling results from ’Model’ tab createmodel Visualization of panel in ’Model’ tab createmotif Visualization of panel in ’Motif’ tab createpredictDF Visualization of direct fold results from ’Predict’ tab createpredictHM Visualization of homology modelling results from ’Predict’ tab createpredict Visualization of panel in ’Predict’ tab erroralert Visualization of error messages error Visualization of error message when template is not found flowchart Visualization of ’Flow chart’ page funding Visualization of ’Funding’ page (currently hidden) fun Visualization of ’Fun’ page grouppubbib Visualization of ’Group Publications (BibTeX)’ page grouppub Visualization of ’Group Publications’ page helpannotate Visualization of help page showing usage information about ’Annotate’ helpblast Visualization of help page showing usage information about ’Blast’ helpmodel Visualization of model page showing usage information about ’Model’ helpmotif Visualization of motif page showing usage information about ’Motif’ helppredict Visualization of predict page showing usage information about ’Predict’ helpsequenceeditor Visualization of sequence editor page showing usage information about sequence editor highlight Visualization of ’Highlight papers’ page howtocite Visualization of ’How to cite us’ page links Visualization of ’Links’ page (currently hidden) motif Visualization of motif search results from ’Motif’ tab showdetails Visualization of details page after sequence-structure search staff Visualization of ’Staff’ page statistics Visualization of ’Statistics’ page (currently hidden) supplements Visualization of ’Supplements’ page (currently hidden) updates Visualization of ’Updates’ page (currently hidden) usagepolicy Visualization of ’Usage policy’ page whatsnew Visualization of ’What’s new’ page

Table 16: A short description of the major template files of the ITS2 work- bench. 15.P Controller of the ITS2 workbench backend 144

15.P Controller of the ITS2 workbench backend

alignment Creates a sequence or sequence-structure based alignment from all sequences in the pool; stores the alignment in the pool alvis Reads alignment from pool and prepares it for highlighting open/closing brackets of the secondary structure (this step is integrated in the backend for performance speed-ups) annotate Provides HMM sequence annotation; further calculates images of stem hybridization and returns newly annotated ITS2 positions blast Performs blast search on either sequence only, sequence-structure or sequence-structure database without partials; allows to cancel a Blast search or to download results devel Provides submission and retrieval of bugs to an XML file livesearch Provides a quick live search on taxonomic names and returns the number of available structures; also provides a method for fetching sequence and structure records for specific GIs managedataset Allows to manage the pool by removing all sequences, all alignments, all trees or even the whole pool marker Gives the short name of all markers stored in the ITS2 database model Runs a full homology modelling on template and target sequences; supports multiple templates, multiple structures and the option to find one best consensus template motif Performs motif search on sequences and returns start and stop position for each motif; if secondary structure is available, an SVG-file with coloured motifs is created pool Provides the possibility to add/remove sequence-structure entries or complete taxa to the pool; also returns information such as number of elements in the pool poolvis Returns all information available for sequences inside the pool; further calculates corresponding images of secondary structures; or exports the sequence part of the pool predict Folds a sequence, if folding fails in the web frontend, automatically a homology modelling step is performed by using the best hits of a prepended Blast search. Root The Root controller deletes expired sessions, creates a new session and redirects to the index.html file seqdownload A controller which handles the database download of selected/all sequences, alignments, trees when data is not fully available in the frontend yet seqvis Provides all details visible after a sequence search, either by the live-search option or from the taxonomic tree session Restores/saves a (previous) session by uploading/downloading a stored pool.xml file; also provides a method to add own sequences to a pool session-file taxbrowser Returns the taxonomic tree – visible on the left side of the workbench, including sequence/structure counts; tree Reads alignment, calculates tree, writes tree to pool treevis Visualization of a tree stored in the pool; uses Newick Utilities to create an SVG-tree which is then parsed by the ExtJS in the frontend. If specified, the tree is also (re)rooted here. ttvis Responsible for the visualization of all templates defined by the Template Toolkit

Table 17: A short description of controller integrated in the backend of the ITS2 workbench. 15.Q Modules of the ITS2 workbench backend 145

15.Q Modules of the ITS2 workbench backend

Alignment.pm Creates MUSCLE or ClustalW2-based alignment Base.pm Contains basic paths and variables for the backend Errors.pm Provides text for error messages Fold.pm Folds and validates sequences/structures Format.pm Parses FASTA, XFASTA, XXFASTA, RAW, XRAW, XXRAW format Hmm.pm Parses HMMER2 output HomologyModeling.pm Provides the Homology Modelling ITS2Check.pm Checks correctness of folded ITS2 sequences ITS2Motifs.pm Annotates ITS2 motifs Pool.pm Provides methods to create a pool, add and remove sequences, alignments or trees StructurePlot.pm Provides colouring of different structures Translateprot.pm ’RNA12’ encoding alphabet for sequence and structure TransSeqStructPseudoProt.pm ’RNA12’ encoding alphabet for sequence and structure Tree.pm Module for tree calculation and re-rooting with Newick Utilities

Table 18: A short description of modules integrated in the backend of the ITS2 workbench. Publications 146

16 Publications

Markert S. M., M¨ullerT., Koetschan C., Friedl T., and M. Wolf. “Y” Scenedesmus (Chlorophyta, Chlorophyceae): the internal transcribed spacer 2 (ITS2) rRNA secondary structure re-revisited. Plant Biology, (in press), 2012.

B. Merget, Koetschan, C., T. Hackl, F. F¨orster,T. Dandekar, T. M¨uller, J. Schultz, and M. Wolf. The ITS2 Database. Journal of Visualized Exper- iments, 61:e3806, Mar 2012.

Koetschan, C., T. Hackl, M¨ullerT., Wolf M., F¨orsterF., and J. Schultz. ITS2 database IV: Interactive taxon sampling for internal transcribed spacer 2 based phylogenies. Molecular Phylogenetics and Evolution, Epub ahead of print, Feb 2012.

M. A. Buchheim, A. Keller, Koetschan, C., F. F¨orster,B. Merget, and M. Wolf. Internal transcribed spacer 2 (nu ITS2 rRNA) sequence-structure phylogenetics: towards an automated reconstruction of the green algal tree of life. PLoS ONE, 6:e16931, Feb 2011.

Koetschan, C., F. F¨orster, A. Keller, T. Schleicher, B. Ruderisch, R. Schwarz, T. M¨uller,M. Wolf, and J. Schultz. The ITS2 Database III– sequences and structures for phylogeny. Nucleic Acids Res., 38:D275–279, Jan 2010.

T. Friedrich, Koetschan, C., and T. M¨uller. Optimisation of HMM topologies enhances DNA and protein sequence modelling. Stat Appl Genet Mol Biol, 9:6, Jan 2010. 147 Curriculum Vitae 148

17 Curriculum Vitae

in

version removed

page electronic Curriculum Vitae 149

in

version removed

page electronic Curriculum Vitae 150

in

version removed

page electronic