PROTISTAN SPECIES AND THEIR DIVERSITY FROM A MOLECULAR PERSPECTIVE

A dissertation presented by

Angela Schena

to The Department of Biology

In partial fulfillment of the requirements for the degree of Doctor of Philosophy in the field of Biology

Northeastern University Boston, Massachusetts February 2012

1 PROTISTAN SPECIES AND THEIR DIVERSITY FROM A MOLECULAR PERSPECTIVE

by Angela Schena

ABSTRACT OF DISSERTATION

Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Biology in the Graduate School of Science of Northeastern University, February 2012

2 ABSTRACT

Traditionally protists have been described based on their morphology using what is often referred to as alpha approaches. However, today it is all but certain that these approaches have not revealed the real scale of protistan diversity. The two main reasons are the current uncultivability of most of microbial , which often makes direct observations impossible, and the lack of a comprehensive concept of species. Today, the molecular based

(beta) taxonomy of comparisons of DNA sequences is increasingly used to bypass the limitations of the alpha taxonomy. The challenge is that the two approaches are typically used separately, and with different units of diversity. For example, it is not known to what extent the genetic distance between two taxa corresponds to morphological differences between them, and it is not clear if morphologically defined species do or do not cluster as phylogenetically distinct genetic groups of rRNA gene sequences (Operational Taxonomic Units, or OTUs). In the end, there is little understanding whether traditionally defined species vs OTUs combine identical, similar, or entirely different populations. Since traditional morphology and molecular taxonomy will likely both be used for the foreseeable future, it is important to understand what a morphologically defined species means in terms of gene sequence variability among cells composing this species.

Here we use marine as model representatives of protists to investigate the level of intra- and interspecies heterogeneity in the most widely used genetic marker, the 18S rRNA gene.

Using single-cell analysis and molecular cloning, we show that OTUs comprising 18S rRNA gene sequences that share ≥99% homology correspond well to species as defined by alpha taxonomy. Therefore, at least in ciliates, there appears to be a level of genetic variability in 18S

3 rRNA gene sequences that could be used as a proxy for morphologically defined species.

Merging alpha and beta taxonomy is very convenient for protistan diversity studies as this opens a way to assess this diversity faster and more objectively. Capitalizing on this, we surveyed diversity in several marine habitats, and statistically estimated the total ciliate richness in these habitats. The resulting throughput compares favorably to, and at times exceeds, what would have been achieved by more traditional alpha taxonomy approaches.

4 ACKNOWLEDGEMENTS

I would like to acknowledge and thank the people who helped me throughout my project and made my work possible.

I thank my husband, my family, especially my father, and my friends, who have always supported and motivated me with any mean necessary. I would not have completed my studies without them.

I am especially grateful to my advisor, Slava Epstein, a true mentor who gave me the guidance and the confidence I needed and stimulated me with his thought-provoking conversations.

I would like to thank my committee members for their support and helpful suggestions: Virginia

Edgcomb, Veronica Godoy-Carter, Edward Jarroll, Steven Vollmer.

I thank all the members of the Epstein Lab, past and present, for the many insightful discussions about science and life. I especially thank Sun Hee Hong for her friendship and her teachings, Bill

Orsi, Tine Hohmann and Annette Bollmann for their help during my field trips.

I am indebted to Thorsten Stoeck for his secondary structure model and his words of molecular wisdom.

I thank Chesley Leslin and Nathan Cahoon for navigating me into the unfamiliar territory of bioinformatics and John Bunge for providing the CatchAll software.

I would also like to mention the people of the Estación de Investigationes Marinas de Margarita

(Margarita Island, Venezuela) for their kindness and help during my field work.

5 DEDICATION

This dissertation is dedicated to my mother Maria and my son Luca, my past and my future.

6 TABLE OF CONTENTS

Abstract 3

Acknowledgments 5

Dedication 6

Table of Contents 7

List of Figures 8

List of Tables 10

Chapter 1. Introduction 11

Chapter 2. Material and Methods 20

Chapter 3. The rRNA gene sequence variability and the species molecular 27 signature Chapter 4. Species diversity and richness predictions on small local scale 44

Chapter 5. Conclusions 66

References 69

7 LIST OF FIGURES

Figure 1. Geleid species used as model organisms in this study. A: G. simplex. 16 B: G. fossata. C: G. swedmarki (courtesy of R. Droste and E. Murphy). Figure 2. Maximum likelihood (ML) tree of 18S rRNA gene sequences showing 17 the phylogenetic position of the geleids within the class . The first number at the nodes represents bootstrap values (percentage out of 1000 replicates) for ML and the second the posterior probability values of the Bayesian analysis. Bootstrap values over 50% are shown. simplex is strain 0.2Nah. The class Heterotrichea was used as outgroup. Figure 3. Phylogenetic conservation map and the nucleotide exchange 30 superimposed onto the G. simplex consensus SSU rRNA secondary structure. Number of sequences = 1960. Nucleotides categories: ACGU -98+% conserved; acgu -90-98% conserved; acgu -<90% conserved. Letters in bold with green arrows show a position where at least one of the 20 sequences has a nucleotide different from the consensus. Numbers in brackets: numbers of sequences which show a nucleotide different from the consensus. Letters in brackets: type of nucleotide exchange. Figure 4. Phylogenetic novelty of the OTUs retrieved from Nahant. (A) 46 Distribution of the similarity to the CEM and CIM. (B) Novelty pattern of the environmental OTUs. Red circles represent the similarity to the CEM and the CIM for each OTU. Figure 5. Phylogenetic novelty of the OTUs retrieved from Canada. (A) 47 Distribution of the similarity to the CEM and CIM. (B) Novelty pattern of the environmental OTUs. Red circles represent the similarity with the CEM and CIM for each OTU. Figure 6. Phylogenetic novelty of the OTUs retrieved from Greenland. (A) 48 Distribution of the similarity to the CEM and CIM. (B) Novelty pattern of the environmental OTUs. Red circles represent the similarity with the CEM and CIM for each OTU. Figure 7. Phylogenetic novelty of the OTUs retrieved from Venezuela. (A) 48 Distribution of the similarity to the CEM and CIM. (B) Novelty pattern of the environmental OTUs. Red circles represent the similarity with the CEM and CIM for each OTU.

8 Figure 8. (Page 50) ML phylogenetic tree of 18S rDNA clones retrieved from 50 Nahant. The first number at the nodes represents the bootstrap value (percentage out of 1000 replicates) for ML and the second the posterior probability value of the Bayesian analysis. Black dots indicate the nodes with 100% bootstrap/1 Bayesian probability values. Nahant OTUs appear in red. Phy = Phyllopharingea. Olig = . Figure 9. (Page 52) ML phylogenetic tree of 18S rDNA clones retrieved from 52 Canada. The first number at each node represents the bootstrap value (percentage out of 1000 replicates) for ML and the second the posterior probability value of the Bayesian analysis. Black dots indicate nodes with 100% bootstrap/1 Bayesian probability values. Canada OTUs appear in red. Figure 10. (Page 54) ML phylogenetic tree of 18S rDNA clones retrieved from 54 Venezuela. The first number at each node represents the bootstrap value (percentage out of 1000 replicates) for ML and the second the posterior probability value of the Bayesian analysis. Black dots indicate nodes with 100% bootstrap/1 Bayesian probability values. Venezuela OTUs appear in red. Figure 11. (Page 56) ML phylogenetic tree of 18S rDNA clones retrieved from 56 Greenland. The first number at each node represents the bootstrap value (percentage out of 1000 replicates) for ML and the second the posterior probability value of the Bayesian analysis. Black dots indicate nodes with 100% bootstrap/1 Bayesian probability values. Greenland OTUs appear in red. Figure 12. Taxonomic affiliation of the OTUs from the four studied locations. 58

Figure 13. The number of OTUs (at 99% sequence similarity) shared among the 59 four clone libraries Figure 14. Frequency counts distributions of OTUs in three clone libraries 62 (Nahant, Canada, Venezuela), with the best-fitted parametric curve model (red) compared to other three competing parametric models.

9 LIST OF TABLES

Table 1. Classification of the phylum Ciliophora. 15

Table 2. 18S rRNA gene sequence similarities among different Geleia species 18 (direct sequencing). Table 3. 18S rRNA gene sequence similarities between several populations of 19 G. simplex. The sequences from Nah 0.0m, 0.2m, 0.6m, 200m are from samples collected in 2002, spatially separated by 0.0, 0.2, 0.6, and 200 m respectively. Data from Droste 2003. (Nah= Nahant, MA; WH=Woods Hole, MA) Table 4. Primers used in the PCR amplifications of ciliate 18S rRNA gene. In 22 the primer names (F) indicates the forward direction and (R) indicates the reverse direction; in the primer sequences R* is an A or a G. Table 5. Variability of G. simplex 18S rRNA gene sequences within a population 27 and single cells. SD, Standard deviation. Table 6. Nucleotides composition of the G. simplex consensus sequence. 31

Table 7. Nucleotides composition of 18S rRNA in the different nucleotide 33 categories and the number of mismatches between cells in a 20-cells collection and within single cells. Table 8. Variability of G. simplex 18S rRNA gene sequences within a population 36 and single cells. SD, Standard Deviation. Table 9. Variability of G. swedmarki and G. fossata 18S rRNA gene sequences 37 within a population and single cells as appears after the gene amplification using different polymerases. SD, Standard Deviation. Table 10. Variability of P. tetraurelia 18S rRNA gene sequences within a 38 population and single cells. SD, Standard Deviation. Table 11. Number of clones and OTUs obtained for each clone library (based on 45 99% sequence similarity). Table 12. The average similarity (%) of the OTUs to the closest environmental 46 match (CEM) and the closest previously described/cultured groups match (CIM), with standard errors (SE). Table 13. Estimates of total species richness of ciliate communities in three 63 locations, with their Standard Errors (SE) for the Single Exponential model.

10 Chapter 1: Introduction

1.1 Protistan diversity and its challenges

With a great level of morphological and ultrastructural complexity and their wide distribution, protists are possibly the most abundant and diverse group among the eukaryotes

(Patterson, 1999). Yet, despite their importance, the real scale of protistan diversity is still under debate (Patterson 1999; Corliss 2001; Cavalier-Smith 1998).

Until recently, the systematics of microbial eukaryotes has relied upon cultivation and/or hand collection of specimens, and microscopic analysis of morphological characters. These conventional methods are, however, laborious and require large numbers of cells (often hundreds) to achieve acceptable results. Moreover, the presence of cryptic species and the current uncultivability of most of microbial eukaryotes (Moreira and López-García 2002) often makes direct observations impossible and contributes to the controversy over protistan taxonomy and how to detect and classify protistan species (Caron et al. 2004 and 2009). One of the main obstacles to improving protistan classification remains the lack of a comprehensive concept of species (Schlegel and Meisterfeld 2003), which can encompass their morphological, physiological, genetic and ecological differences. The widely accepted “biological species” concept (Mayr 1942) defines species in terms of reproductive isolation of natural populations.

However it is fairly difficult to apply it to the numerous asexual or inbreeding protistan species.

Ecologists on the other hand, advocate the use of an “ecological species” concept to describe

11 organisms that share the same ecological niche (Finlay et al. 1996; Finlay 2004). However, there is still no consensus on objective criteria on how to define such “ecospecies”.

The creation of a comprehensive species concept is of primary importance and can help to answer several key questions related to the diversity and distribution of protistan species.

Currently, cells that share a range of morphological characters are considered members of the same species (“morphospecies”). Today it has become evident that the extent of protistan diversity can hardly be estimated in this way. Because physiologically different cells may have similar morphologies, and likewise dissimilarly looking cells may share important aspects of ecology and behavior, the concept of morphospecies does not allow the resolution of many e.g. biogeography questions. For example, a heated debate continues whether protists are essentially ubiquitous but not rich in species (Finlay and Fenchel 1999), or else they are very diverse and mostly endemic (Foissner 1999). Without a practical way to differentiate members of one protistan species from another, such arguments may never be resolved.

These uncertainties as well as the limitations of the alpha taxonomy techniques have significantly held back the pace of discovery of new organisms.

1.2 Molecular taxonomy and its benefits

Conversely, a cultivation-independent molecular (beta) taxonomy based on comparing

DNA sequences is now routinely applied to study protistan diversity as it provides many advantages that helps overcome the limitations of the more traditional one (Caron et al. 2009). In fact, in the last decade alone, a significant number of molecular surveys using SSU rRNA gene

12 sequences have revealed an unexpectedly high level of protistan diversity from a wide range of different environments (Epstein and López-García 2008). These sequences belong to old and novel protistan lineages and strikingly some of them form large clades with no known cultured representatives. These works have changed our notion about protistan diversity as they have allowed the explorations of previously inaccessible environments. However, despite the fact that our knowledge of the protistan richness has vastly improved, many important questions are still unanswered (Stoeck and Stock 2010).

The problem is that traditional and new approaches are normally used separately, and each with the most disparate combination of characters. Doubts remain on how to relate genetic distance to morphological differences between species and if morphologically defined species do or do not cluster as groups of gene sequences (OTUs).

1.3 Research aims

It is not known to what extent rRNA gene sequences correspond to the accepted

“morphospecies” groups. The issue of sequence variability within genus, species or single organism is important to the understanding of the boundaries of species. Is there genetic variation within morphospecies, within populations of the same morphospecies or even within a single cell? How significant is this variation? Does there exist a molecular signature of a protistan species, and what sort of variation in this signature is characteristic of intra-specific heterogeneity as opposed to inter-specific differences?

13 This study aims to address some of these unresolved questions to understand what 18S rRNA gene sequences variability means for morphologically defined species, in an attempt to reconcile traditional and molecular taxonomy.

The approach we took to achieve our goals is as follows:

• In Chapter 3, we investigate the 18S rRNA gene sequence intra- and interspecies variability

level and the possibility that a species defined by alpha taxonomy also has a unique

molecular signature.

• In Chapter 4, armed with the understanding of molecular variation within morphospecies, we

apply the rRNA approach to the study of ciliate diversity on a small scale.

1.4 Model species and background

We focused our attention on ciliates because they represent a large, monophyletic, arguably the best-known protistan phylum. These organism are characterized by the presence of cilia at some stage of their life cycle and by the presence of two different kinds of nuclei: a somatic nucleus

() and a “germline” one (). Ciliates include free-living as well as symbiotic forms. They are globally distributed and their unique features make them easily detectable in the environment. Ciliate classification has been revised many times in the past.

According to a well-established classification (Adl et al. 2005) ciliates are divided into 11 different classes (Table 1). Note however that recently the new Class Cariacorichea was reported

(Orsi et al. 2011), demonstrating that the quest for new taxa is ongoing and ciliate taxonomy is still evolving.

14 Table 1. Classification of the phylum Ciliophora.

Subphylum Classes Representative genera Karyorelictea , Geleia Heterothrichea ,

Intramacronucleata Spirotrichea , Oxytricha , Nyctotherus , Lacrymaria Dysteria, Chilodonella Orthodonella, Bursaria, , Plagiopylea Lechriopyla, Oligohymenophorea , Cariacotrichea nov. cl. Cariacothrix nov. gen.

Over 3500 ciliates species have been described to date (Adl et al. 2007). However, based on the number of unknown DNA sequences found in environmental samples, Adl et al. (2007) estimate the number of species yet to be discovered at least an order of magnitude higher.

The ciliates species used here as model organisms belong to the class Karyorelictea, family (Figure 1). They have generally a long, vermiform and flatten appearance, and include important and abundant marine benthic interstitial species (Carey 1992).

15 Figure 1. Geleid species used as model organisms in this study. A: G. simplex. B: G. fossata. C: G. swedmarki (courtesy of R. Droste and E. Murphy).

Up to this date public access databases reported the 18S rRNA gene sequence of just a few members of this group. This is mainly because these organisms are very delicate to handle and it has proven almost impossible to obtain or maintain pure cultures of these species

(Andreoli et al. 2009). Therefore they represent good model organisms relevant to the use of culture-independent molecular approaches.

Using a few18S rRNA gene sequences retrieved from GenBank, we reconstructed the relationships between some members of the Family Geleiidae (Figure 2).

16 =)+.<%2".%).+'$*(*& !""#"$%& !"#$#0$#1*$')* =)+.<%2")+8<*&'&8

1%2%*+'&*482%#

'&#"$%& 1%2%*+'&3%$4+)5** '(#"$%& !""#")%& !"#$#%&$&"# $"'(%')* !""#"$%& 1%2%*+'9"&&+(+

!""#"$%& 1%2%*+'&8

-+)$,./*+'")0*& (*$"$'+

!"#"$%&'&()*+(,& ,%#"$%& 2#3#)')* 6%4+7%22+'&8

-%)*()"4,&'5+<2*

!""#"$%& +&$&"#$"'(%&*,#-$."#-/ :8*)"&("4,4'+40*;,,4

Figure 2. Maximum likelihood (ML) tree of 18S rRNA gene sequences showing the phylogenetic position of the geleids within the class Karyorelictea. The first number at the nodes represents bootstrap values (percentage out of 1000 replicates) for ML and the second the posterior probability values of the Bayesian analysis. Bootstrap values over 50% are shown. Geleia simplex is strain 0.2Nah. The class Heterotrichea was used as outgroup.

The resulting phylogenetic tree agrees with the latest proposed phylogeny of Karyorelictea

(Andreoli et al. 2009, Gao et al. 2010).

In the past, only a preliminary assessment of rRNA gene sequence variation at different taxonomic levels within Geleiidae species has been achieved (Droste 2003). Droste (2003) amplified separately the 18S rRNA gene of environmental populations of G. simplex, G. fossata,

G. swedmarki and Geleia sp. and directly sequenced the PCR products up to 1500 basepair (bp)- long, nearly the full length of the gene. Droste (2003) used the resulting rRNA sequences to

17 assess the inter-specific variability within the genus Geleia, which ranged from 97.4% to 98.8%

(Table 2).

Table 2. 18S rRNA gene sequence similarities among different Geleia species (direct sequencing).

G. simplex G. fossata G. swedmarki

G. fossata 98.4%

G. swedmarki 98.8% 98.2%

Geleia sp 97.8% 97.4% 97.6%

Subsequently, Droste (2003) investigated the possible intra-specific variability among environmental populations of G. simplex. Direct sequencing of PCR products of G. simplex’ 18S rDNA amplification (Table 3) resulted in essentially identical sequences (with the exception of a single nucleotide difference in one sequence) even when the organisms were collected from habitats separated in space (>100 km) and time (years 2000, 2001, 2002). Note that the DNA for each amplification was extracted from populations of hand-collected organisms.

18 Table 3. 18S rRNA gene sequence similarities between several populations of G. simplex. The sequences from Nah 0.0m, 0.2m, 0.6m, 200m are from samples collected in 2002, spatially separated by 0.0, 0.2, 0.6, and 200 m respectively. Data from Droste 2003. (Nah= Nahant, MA; WH=Woods Hole, MA)

Nah 0.0m Nah 0.2m Nah 0.6m Nah 200m WH 2001

Nah 0.2m 99.9 Nah 0.6m 99.9 100 Nah 200m 99.9 100 100 WH 2001 99.9 100 100 100 Nah 2000 99.9 100 100 100 100

The results obtained seem to indicate that there is essentially no intra-specific variability in the G. simplex’ 18S rRNA genes. It is however important that Droste (2003) did not clone the

PCR-amplified 18S rRNA genes, nor did she address the variability at a single cell level.

Therefore this study might not have fully characterized the extent of intra-specific variability. For instance, direct sequencing of PCR products produces an average, or consensus, sequence and this might have masked intra-specific heterogeneity due to the presence of multiple copies of 18S rRNA genes in eukaryotic organisms. We decided to examine this possibility, as described below.

19 Chapter 2: Materials and Methods

2.1 Sampling

2.1.1 Locations.

Ciliates were collected manually from marine intertidal flat sediments in several locations along the Western Atlantic coast. The first location was in Nahant beach, close to the Marine

Science Center of Northeastern University, MA (42°26’ N, 70°56’ W). The second location was

Saint Andrews, Bay of Fundy, Canada (45°4’ N, 67°3’ W), and two other sites at 1Km (to the

West) and 55Km (to the East) distance, respectively. The third location was Disko Island,

Greenland (70°7’ N, 53°24’ W). The fourth location was Margarita Island, Venezuela (10°56’ N,

63°54’ W), where samples were collected from three different sites: intertidal flats in El Yaque and Parguito beaches, and La Restinga Lagoon national park. At each location, 7 cm sediment cores were collected at low tide.

2.1.2 Cell collection.

Ciliates were extracted from the sediments by a modified Uhlig's technique (Uhlig et al.

1973) as described by Small (1992). Samples were examined under a Zeiss Stemi 2000C dissecting microscope equipped for bright and dark field observations. The cilias were collected manually using glass micropipettes and identified after fixation in Bouin's fixative and staining with Protargol stain (Silver proteinate, Cellpoint, Rockville, Maryland) as described in

Montagnes and Lynn (1987).

20 The four morphospecies used in the heterogeneity experiments, from the family Geleiidae,

Geleia simplex Faure-Fremiet 1951, G. fossata Kahl 1933, G. swedmarki Dragesco 1954 and

Parduczia orbis (Faure-Fremiet 1951) Dragesco 1999 were separately collected manually. Cells were washed two to three times in filter-sterilized seawater and processed for molecular work as described below.

2.2 Molecular methods

2.2.1 DNA/RNA extraction.

Organisms belonging to different ciliate morphospecies were collected and processed singularly or in groups of 30-100 cells per collection. Genomic DNA was extracted using

DNeasy Tissue Kit (Qiagen, Valencia, CA). Some cells were collected and processed for total

RNA extraction using the RNeasy Micro Kit (Qiagen). Other cells were collected singularly from environmental samples, washed, placed directly into PCR tubes and processed for single-cell

PCR amplification without the DNA extraction step.

2.2.2. PCR amplification.

Eukaryotic and universal primers (Table 4) were used to amplify the 18S rRNA genes

(first reaction). An additional semi-nested PCR step (second reaction) using combinations of the aforementioned primers was performed whenever the first amplification failed to produce visible bands after gel electrophoresis.

21 Table 4. Primers used in the PCR amplifications of ciliate 18S rRNA gene. In the primer names (F) indicates the forward direction and (R) indicates the reverse direction; in the primer sequences R* is an A or a G.

Primer name Sequence (5’-3’) Reference

EukA (F) First reaction AACCTGGTTGATCCTGCCAGT (Medlin et al. 1988) EukB (R) TGATCCTTCTGCAGGTTCACCAC (Medlin et al. 1988) Euk528F (F) Second reaction CGGTAATTCCAGCTCC (Elwood et al. 1985) EukB (R) TGATCCTTCTGCAGGTTCACCAC (Medlin et al. 1988) Euk528F (F) First reaction CGGTAATTCCAGCTCC (Elwood et al. 1985) EukB (R) TGATCCTTCTGCAGGTTCACCAC (Medlin et al. 1988) Euk528F (F) Second reaction CGGTAATTCCAGCTCC (Elwood et al. 1985) U1391R (R) GGGCGGTGTGTACAAR*GR*G (Lane, 1991)

For PCR-amplification, HotStart Taq DNA Polymerase (Qiagen) or PfuUltra High-

Fidelity DNA Polymerase (Stratagene, La Jolla, CA) were used.

RT-PCR (reverse transcriptase PCR amplification) was performed on RNA extraction samples using QIAGEN OneStep RT-PCR Kit (Qiagen) or AccuScript High Fidelity RT-PCR Kit

(Stratagene). PCR-amplified samples were purified when necessary with QIAquick PCR

Purification Kit or QIAEX II Gel Extraction Kit (Qiagen).

2.2.3 Cloning and sequencing.

The Taq-amplified sequences were cloned with TA Cloning Kit or TOPO TA Cloning Kit

(Invitrogen, Carlsbad, CA). The Pfu-amplified samples were cloned with Zero Blunt TOPO PCR

Cloning Kit (Invitrogen). The clones obtained were sequenced commercially at the University of

Maine DNA Sequencing Facility (Orono, ME), SeqWright (Houston, TX) and Agencourt

(Beverly, MA).

22 2.3 Evaluation of sequence data

2.3.1 Sequence editing and alignment.

Sequences were edited manually using Bioedit (Hall 1999) and checked to eliminate potential chimeras using the Bellerophon Chimera Check (Huber et al. 2004). The sequences were then aligned using ClustalX (Thompson et al.1997). All the sequences were compared to the GenBank databases using the BLAST (Altschul et al. 1997) search tools to determine their approximate phylogenetic affiliation.

2.3.2 Secondary rRNA structure.

Sequences from 20 G. simplex clones were aligned to obtain a consensus sequence.

Individual G. simplex sequences and the consensus sequence were aligned in ARB (Ludwig et al.

2004) to more than 5000 pre-aligned eukaryotic sequences. A secondary structure of the consensus was constructed based on the 18S secondary structure model of Saccharomyces cerevisiae. To analyze regions of high, medium, and low degree of conservation the consensus secondary structure was superimposed onto the model secondary structure of the RDP project

(Wuyts et al. 2004) where phylogenetic conservation of 1939 eukaryotes representative of all principal eukaryotic lineages is identified. Sequences from the test morphospecies were compared to the G. simplex consensus secondary structure to search for degree of conservation.

23 2.3.3 Sequence clustering.

Cloned sequences of organisms from different locations were clustered into OTUs based on the percentage of sequence similarity as described in Hong et al. (2006). This was achieved by first making all possible pairwise sequence alignments using ClustalW at default settings

(Thompson et al. 1994) and calculating percentage sequence similarities, followed by clustering of the sequences into OTUs using the mean unweighted-pair group method using average linkages as implemented in the OC clustering program (Siddiqui et al. 2001). The OTU grouping was checked manually to verify that all OTUs were assembled at the cutoff level desired. For each OTU, the sequence least different from the others (centroid) was chosen as representative of the group.

2.3.4 Novelty analysis.

For each environmental clone, the similarity value to the closest environmental match

(CEM), excluding clones from this study and the closest taxonomically identified match (CIM), was noted using BLAST search. The values were used to create histograms and dispersion plots showing the novelty level on the base of sequence similarity to CIM and CEM

2.3.5 Phylogenetic analyses.

Original environmental sequences were compared to those in GenBank using BLAST analysis and the closest ciliate matches together with a few other representative ciliate sequences were used to construct phylogenetic trees.

24 The phylogenetic methods used were Maximum Likelihood and Bayesian analyses.

Phylogenetic analyses were conducted using the GARLI (Zwickl 2006), RAxML (Stamatakis

2006) and MrBayes v3.2.1 (Ronquist and Huelsenbeck 2003) programs on the CIPRES Portal

(Miller et al. 2010). The program JModelTest (Posada 2008, Guindon and Gascuel 2003) was used to choose the evolutionary model that best fit the different datasets among 88 possible models of DNA substitutions. The relative stability of the Maximum Likelihood trees topologies was assessed with 1000 bootstrap replicates using heuristic searches. A majority-rule Maximum

Likelihood consensus tree was generated from the 1000 trees of the bootstrap using the

SumTrees program (Sukumaran and Holder 2009). Bayesian analyses were conducted with posterior probability support values calculated using four chains/two runs and running 10 million generations for each alignment. Trees were sampled every 1000th generation. The first 25% of sampled trees were considered ‘burn-in’ trees and were discarded. A 50% majority rule consensus of the remaining trees was used to calculate posterior probability values.

2.3.6 Estimation of the total number of OTUs.

The total number of OTUs (observed and unobserved) in each sample was estimated by applying a set of statistical models to the frequency counts (number of times a certain OTU was observed in the clone library) at 99% sequence similarity level.

There are two main families of methods for richness estimation: parametric and nonparametric.

For a given sample, the parametric methods essentially fit a parametric distribution to the observed frequency counts by maximum likelihood, and projects this curve to obtain the number of unobserved OTUs. Non-parametric methods involve no parameters and normally utilize

25 simple equation. We used the program CatchAll (Bunge 2011) to compute each of five widely used nonparametric (Good-Turing, Chao1, ACE, ACE1, and Chao- Bunge) and eight parametric

(Poisson; negative binomial; inverse Gaussian, Pareto and lognormal-mixed Poisson; and mixtures of one, two, or three geometrics) estimators, which were run at every possible right- truncation point of the frequency-count data, that is, omitting outliers (highly abundant taxa in the sample) as previously described (Hong et al. 2009, Hong et al. 2006, Jeon et al. 2008, Jeon et al. 2006). The software selects the best of each model as well as the “best-of-the-best” based on low standard errors, confidence intervals, and high goodness-of-fit estimations.

26 Chapter 3: The rRNA gene sequence variability and the species molecular signature

In this chapter the rRNA sequence variability analysis is described in environmental populations of marine ciliates based on a single-cell approach and molecular cloning of the 18S rRNA genes. The results allowed us to identify a threshold value of sequence identity that we propose to use as a practical means of discrimination among morphospecies. We describe the experiments in their chronological order and present findings as they were accumulating, sometimes changing our initial ideas, and even prompting a re-evaluation of earlier conclusions.

The end of the chapter will deconvolve the sometimes contradictory results presented in the beginning.

We focused our attention on populations of G. simplex first. In September 2003, 43 cells of G. simplex were collected from Nahant (Massachusetts, USA) beach sediments. The genomic

DNA of the samples was extracted and two subsequent PCRs to amplify fragments of their 18S rRNA gene (1044bp) were performed. The sequences of eight clones were compared. The average similarity among the clones was 99.37% ranging from 98.8 to 99.8% (Table 5).

Table 5. Variability of G. simplex 18S rRNA gene sequences within a population and single cells. SD, Standard deviation.

Cells Clones Average Samples Similarity range collected sequenced identity (SD) G. simplex 2003 43 8 99.37% (0.26) 98.9-99.8%

G. simplex 2003 1 4 99.2% (0.45) 98.8-100%

G. simplex 2003 1 4 99.1% (0.35) 98.8-99.6%

27 The intra-population variability observed was equivalent to the difference between different morphospecies as determined by Droste (2003) (i.e., G. simplex and G. swedmarki,

Table 2). We decided to find the source of this unusual variability.

It is possible that the heterogeneity observed was not among the different cells in the population but rather among the different RNA gene copies within each cell. It is known that ciliates (as well as many other eukaryotic organisms) carry multiple copies of the rDNA clusters

(Torres-Machorro et al. 2010 and references therein), and are among the few taxa that can exhibit thousands of copies in a single cell (Prescott 1994).

Testing this possibility required single cell analyses. Using a combination of two PCR reactions, including a semi-nested one, we successfully amplified the 18S rRNA gene from two different cells separately with direct single-cell PCR (for primers see Table 4). Four clones were obtained for each single cell amplified. The average similarity between clones from the same cell was 99.1%and 99.2%. The similarity ranges were 98.2-100% and 98.8-99.6%, respectively

(Table 5). We were puzzled by this high within-cell heterogeneity because it exceeded differences between different species (1.2% value, Table 2). One possible explanation was that all the observed differences were in fact artifactual, and caused by amplification or sequencing errors. We sought to check this possibility, which could be done in several different ways.

3.1 Variability might be a result of sequencing errors

To verify this, nine randomly chosen clones were re-sequenced. Eight clones showed

100% similarity (out of 1044bp) with the original sequence while only one clone showed a single nucleotide difference with the original (data not shown). We concluded that sequencing errors

28 were an unlikely explanation of the 18S rRNA gene sequence heterogeneity reported above and ruled out differences between sequencing runs.

3.2 Variability might be a result of PCR amplification mistakes

Three randomly chosen clones were PCR-amplified again and re-cloned. For each original clone three random “new” clones were sequenced and each showed 100% sequence similarity to the original clone sequence (data not shown). This was a strong indication that PCR amplification, at least when the template was the plasmid DNA, was not responsible for the observed rRNA gene heterogeneity.

3.3 Should amplification/sequencing errors be random?

The above evidence notwithstanding, we checked the idea of the artifactual nature of 18S rRNA gene sequence heterogeneity within and among populations of ciliates in another way. We hypothesized that while real differences between two sequences would be expected to be largely limited to variable regions of the gene, sequencing/amplification mistakes were likely to be distributed randomly over the length of the amplified fragments. Therefore, we examined the distribution of the rRNA gene sequence differences detected within and among single cells mapped on the SSU rRNA secondary structure. A SSU rRNA gene secondary structure variability map (see Chapter 2 for details) was constructed by our collaborator Dr. Thorsten

Stoeck (University of Kaiserslautern, Germany) based on the consensus sequence (Figure 3) from 20 G. simplex clones.

29 Figure 3. Phylogenetic conservation map and the nucleotide exchange superimposed onto the G. simplex consensus SSU rRNA secondary structure. Number of sequences = 1960. Nucleotides categories: ACGU -98+% conserved;acgu -90-98% conserved;acgu -<90% conserved. Letters in bold with green arrows show a position where at least one of the 20 sequences has a nucleotide different from the consensus. Numbers in brackets: numbers of sequences which show a nucleotide different from the consensus. Letters in brackets: type of nucleotide exchange.

Depending on the degree of conservation, the 18S rRNA nucleotides were divided into three categories: highly conserved (>98%), conserved (90-98%) and variable (<90%). The frequency and position of mismatches observed among the 20 G. simplex sequences are presented in Table 6. Apart from 8 insertions/deletions, a total of 118 mismatched were mapped

30 at 107 different positions. A disproportionate number of mismatches were in the variable (<90%) regions (Figure 3 and Table 6), consistent with the idea that they were not generated by a random process.

Table 6. Nucleotides composition of the G. simplex consensus sequence.

Nucleotides Number of Number of Fraction of all nucleotides categories nucleotides mismatches representing mismatches

98+% conserved 293 20 6.8%

90-98% conserved 262 15 5.75%

<90% conserved 489 83 16.9%

The degree of such within-cell variability remained surprising, and we conducted an additional test. In 2004, a population of 20 cells of G. simplex from Nahant beach sediments were collected. Their DNA was extracted; the 18S rRNA genes amplified, cloned and 30 clones from the resulting library were sequenced. The sequence analyses confirmed the earlier finding of high within-population rRNA gene sequence variability: even when the four most divergent clones were discarded, the 26 remaining clones still showed an average similarity of only

99.27% (Standard Deviation is 0.28), with the similarity among the clones ranging from 98.5% to 100%. Notably, the consensus sequence of these clones again matched the one obtained by

Droste (2003) by sequencing PCR products directly (without cloning). Using the SSU rRNA secondary structure, we mapped the position of mismatches and normalized their frequencies to

31 the lengths of highly conserved, conserved, and variable regions of the gene. About 79% of the mismatches were in the variable region, representing 30% of the total number of nucleotides in that region. This contributed to a growing understanding that gene copies within a single cell, as well as within cells from the same population, might indeed be different – as much as those from different morphospecies. An implication of this is that rRNA molecules themselves, and thus ribosomes and possibly their performance, might be different within a single cell, and so we determined whether this was the case by studying the rRNA molecules.

3.4 rRNA gene sequence variability should translate into differences between rRNA molecules.

To verify this hypothesis, we decided to analyze rRNA sequences by obtaining cDNA copies of the 18S rRNA genes. Twenty cells of G. simplex were collected from Nahant beach sediments, the total RNA was extracted and one-step RT-PCR amplification was performed.

After cloning the amplification products, 23 clones were sequenced. The average similarity of the clones was 99.78% (Standard Deviation is 0.13), with the similarity between pairs of clones ranging from 99.2 to100%.

From the same samples, the RNA from two single G. simplex cells was extracted, and a one-step RT-PCR followed by a semi-nested PCR using Taq polymerase in both cases were performed. One cell yielded 32 clones (two clones were excluded due to a 28bp long deletion in their sequence) and the other 39 clones. The average similarity between the clones was 99.6%

(Standard Deviation is 0.22) and 99.54% (Standard Deviation is 0.21), respectively, with the similarity ranging from 99-100% to 98.9-100%, higher than observed in similar experiments that

32 used rRNA gene sequences. The consensus sequence was still 100% similar to the G. simplex sequence obtained by Droste (2003)

Following the above logic, we mapped the mismatches observed in the cDNA on the secondary structure of the 18S rRNA molecule. Given our observations on rRNA gene sequence mismatches, we were surprised to see that those on the rRNA fell mostly into the conserved regions (Table 7).

Table 7. Nucleotides composition of 18S rRNA in the different nucleotide categories and the number of mismatches between cells in a 20-cells collection and within single cells.

Mismatches distribution Fraction of all nucleotides Sample Number of representing mismatches clones <90% 90%-98% +98% <90% 90%-98% +98% 20 cells 23 12 6 9 2.45% 2.29% 3.07%

Single cell 1 32 30 18 18 6.13% 6.87% 6.14%

Single cell 2 39 49 21 19 10% 8% 6.48%

In conclusion the work with the cDNA showed some variability in the rRNA molecules within and among the cells, but this variability fell short of expectations based on much larger differences detected among the rRNA genes. We found no functional explanation for this finding, and thus reexamined the assumptions of the tests based on mapping the mismatches on the eukaryotic rRNA secondary structure. Specifically, we questioned whether PCR mistakes should indeed be randomly distributed. We realized that the process of in vitro amplification of rRNA genes is not necessarily different from in vivo DNA polymerase activity. The latter does produce

33 rRNA sequences some parts of which are better conserved and others are not. It is possible that rRNA copies with mistakes in the conserved regions are selected against but it is also possible that DNA polymerase makes fewer mistakes in such regions in the first place (or they are repaired differently). Should that be the case, and should Taq polymerase behavior in vitro mimic the in vivo processes, the resulting mistakes would not have to be distributed randomly, as we assumed in the beginning. An ultimate test of this is to use in PCR a high fidelity polymerase different from Taq.

3.5 The effect of different polymerases

Taq polymerase is the enzyme of choice in most PCR amplification reactions because it is inexpensive and provides adequate results most of the time. Nevertheless, it does not have proofreading capabilities, and it is well known that the Taq DNA polymerase’s average error rate is higher when compared to some other thermostable polymerases (Cline et al. 1996). We decided to check the previous results with PCR amplification using Pfu proofreading polymerase, which according to the manufacturer (Stratagene/Agilent Technologies) has an error rate of only 4.3x10-7 mutations per base pairs per duplication, versus two errors every 105 nucleotides for Taq (Qiagen HotStart Taq was used in this study).

The DNA extract from 43 cells of G. simplex (2003) was PCR-amplified using Pfu polymerase, followed by cloning the amplicons. The 22 resulting clones were sequenced. The length of the clones was 1008bp. They were on average 99.96% (Standard Deviation is 0.057) similar to each other and the pair-wise similarity ranged from 99.8 to 100%. The consensus sequence matched the G. simplex sequence from Droste (2003).

34 Similarly, we PCR-amplified the DNA extract from 20 cells of G. simplex (2004) twice using Pfu polymerase. We obtained 33 clones (one highly divergent clone was excluded) that were later sequenced. On a length of 1044bp, the average similarity between the clones was

99.97% (Standard Deviation is 0.07), ranging from 99.7 to 100%. Also in this case the consensus matched the original sequence from Droste (2003).

Furthermore in 2006, we successfully extracted genomic DNA from a single cell of G. simplex collected from Nahant beach sediments samples. Using a combination of two PCRs

(including a semi-nested one) we amplified the 18S rRNA gene using Pfu polymerase. The average similarity between the 46 sequences obtained was 99.98% (Standard Deviation is 0.04), ranging from 99.8 to 100%. The consensus matched the original sequence from Droste (2003).

The conclusion was that the apparent degree of intra- and intercellular 18S rRNA gene sequence variability depended heavily on the DNA polymerase used for DNA amplification: the use of Taq polymerase resulted in much higher (apparent) variability, whereas the use of Pfu polymerase erased that variability. Irrespective of their position on the rRNA secondary structure, the overwhelming majority of gene sequence mismatches detected seem to be due to mistakes originated with Taq polymerase. According to these data it now appears that the majority of the intra-specific sequence variability we had observed up to that point, was indeed created during the PCR amplification, and was not reflective of real differences within and among the cells, which in fact was minimal.

To complete the assessment of DNA polymerases’ mistakes, we decided to locate the differences among the Pfu-amplified clones on the SSU 18S rRNA conservation map for G.

35 simplex. In the case of the 2003 samples, only three mismatches were found in the moderately conserved regions (90-98% conserved) and two in the variable regions (<90% conserved). For the G. simplex collected in 2004 only three mismatches were found, one in each conservation area (98+% conserved; 90-98% conserved; <90% conserved). For the G. simplex collected in

2006 in the sequences from one cell only four mismatches were found but they were all in the

<90% conserved region and only at a single nucleotide position. Interestingly, two mismatches were shared between one or more clones obtained from different DNA extracts. They were all in the <90% conserved region.

Following the previous experiments’ rationale, the rRNA gene amplification results were checked using Pfu polymerase for the amplification. Since the previous RNA extracts were used completely in the previous Taq amplification attempts, we collected more cells of G. simplex from Nahant beach sediments. The RNA from populations of 28 cells, 39 cells and from two single cells was extracted. As shown in Table 8 the average similarity was high in all the cases.

Table 8. Variability of G. simplex 18S rRNA gene sequences within a population and single cells. SD, Standard Deviation.

Sample Cells Clones Average similarity (SD) Similarity Length collected sequenced range RNA extract 28 22 99.86% (0.12) 99.6-100% 902bp

RNA extract 39 24 99.94% (0.10) 99.3-100% 902bp

RNA extract cell #1 1 18 99.9% (0.11) 99.7-100% 902bp

RNA extract cell #2 1 21 99.89% (0.11) 99.7-100% 902bp

36 We located the positions of the mismatches in the 18S rRNA conservation map for G. simplex. Interestingly we found a smaller number of mismatches, and they were equally scattered throughout the sequence, appearing to be randomly distributed.

To confirm the results of the comparative study of Taq and Pfu polymerases, we conducted the same type of molecular analysis using two other members of the Geleia genus: G. swedmarki and G. fossata. The organisms were collected from Nahant beach sediments, their genomic DNA was extracted and the 18S rRNA gene was amplified using the two different polymerases, Taq and Pfu. (Table 9) G. swedmarki showed a broader similarity range and lower overall average similarity between the two given sequences compared to G. fossata. Nevertheless the consensus generated from the clones matched 100% (and 99.89% in the case of G. swedmarki) the sequences of the two species obtained by Droste (2003) by direct PCR products sequencing.

Table 9. Variability of G. swedmarki and G. fossata 18S rRNA gene sequences within a population and single cells as appears after the gene amplification using different polymerases. SD, Standard Deviation.

Average Sequence Sample Cells Clones similarity (SD) Similarity range Length Consensus G. swedmarki Taq 47 39 98.98% (0.87) 96.4-100% 1008 99.89% G. swedmarki Pfu 47 43 99.24% (0.78) 97.9-100% 1008 99.89%

G. fossata Taq 32 22 99.44% (0.28) 98.5-100% 1008 100% G. fossata Pfu 32 15 100% (0) 100% 1008 100%

37 The data for these species also show how the apparent gene sequence variability produced by Taq polymerase decreases, or disappears all together, with the switch to Pfu.

To provide further evidence that the intra- and intercellular variability in 18 rRNA gene sequences we detected might not be limited to a single species or genus, some partial data on

Paramecium tetraurelia strain d4-2 (ATCC), a species from a different class of ciliates

(Oligohymenophorea) were obtained. We extracted the genomic DNA from 30 cells of P. tetraurelia and amplified the 18S rRNA gene using only Taq polymerase. We obtained 40 sequences ca 550 base pairs long. The average similarity among the sequences was 99.57%. We also performed single-cell PCR on a number of P. tetraurelia cells and partially amplified the

18S rRNA gene of 12 of them. Among those, we obtained more than two clones for three of them and calculated the sequences similarity (Table 10).

Table 10. Variability of P. tetraurelia 18S rRNA gene sequences within a population and single cells. SD, Standard Deviation.

Average Sample Cells Clones Similarity range similarity (SD) DNA extract 30 39 99.58 (0.28) 99.4-100% cell #1 1 5 99.46% (0.17) 99.1-99.6% cell #2 1 6 99.75% (0.12) 99.6-100% cell #3 1 7 99.12% (0.46) 98.4-100%

To confirm these results, we compared five P. tetraurelia 18S rRNA gene sequences obtained from GenBank (accession numbers: EF502045, X03772, AF149979, AY102613,

AB252009). These sequences were deposited in the database at different times by different

38 groups of researchers. The average similarity between those sequences for the full length (about

1700 base pairs) of the gene was a surprising 98.93% (Standard Deviation is 0.8) with a similarity range of 97.8-100%. The average similarity between the part of the 18S rRNA gene that matched the position of our sequences (from about position 500 to 1700) was 98.89% instead (Standard Deviation is 0.8). P. tetraurelia DNA could not be amplified using Pfu polymerase.

3.6 Discussion

Since there is no general agreement on the correspondence between DNA sequences and the traditional species, environmental sequences are usually grouped into OTUs on the bases of

18S rRNA gene sequence similarity values. Many different values, from 95% up to 99%

(sometimes within the same paper) have been used to identify species-level OTUs in culture- independent environmental surveys. This is not a minor issue, as the value chosen to discriminate between OTUs can dramatically affect the richness estimates in a given sample. We are aware of only a handful of works (Caron et al. 2009, Nebel et al. 2010) that directly addressed the question of what level of 18S rRNA gene identity corresponds to species. Caron et al. suggested

95% as a rather conservative cutoff value, and implemented it in a computer-based tool that practically separates OTUs of microbial eukaryotes. Nebel et al. focused more specifically on ciliates. While the authors found no consistent value to delineate ciliate species they nevertheless proposed to use 98% sequence similarity as a practical means to identify species. In our view, both proposals are rather arbitrary as they do not take into account the degree of intracellular,

39 intercellular, interspecies, intraspecies, and intrageneric variability of 18S rRNA gene sequences.

The central focal point of this research is to investigate this variability and on this basis relate molecularly defined OTUs to traditional taxa.

In some ciliate species we detected different levels of intracellular and intercellular variability in 18S rRNA gene sequences. This variability between individual copies of the gene is masked when PCR amplification products of 18S rRNA gene are sequenced directly, but is revealed if these products are cloned. It appears to be greater in rRNA genes and it decreases in rRNA copies. Part – but not all – of this variation may be an artifact since the use of proofreading

Pfu polymerase lowers the degree of the apparent intracellular variability. Nonetheless, some of this variability probably exists in both rRNA genes and rRNA, prompting a question of its functional role.

The nature of this variability is puzzling. The RNA genes are members of a multigene family and in eukaryotic cells are organized in tandem repeats and present in multiple copies. A high degree of sequence similarity is expected since these copies evolve under concerted evolution (Elder and Turner 1995). According to this model, these genes do not evolve independently, and the organism retains homogeneous copies of these genes throughout its genome (Elder and Turner 1995). The most evident result is that the copies within a species are more similar to each other than to copies between organisms from different species (Liao 1999).

There have been reports of exceptions to this rule. Coexisting intracellular variations of

18S rRNA genes sequence were found in and other Apicomplexans (Rooney 2004), flatworms (Carranza et al. 1996) and higher organisms like sturgeons (Krieger et al. 2006).

Recent studies also demonstrated population heterogeneity within free-living microbial

40 eukaryotes (see Pfandl et al. 2009 and examples therein). However there is a difference between the detection of the genes in the DNA and their expression in the RNA. In the case of

Plasmodium, different “classes” of 18S sequences are found in different phases of the parasites life cycle (Li et al. 1997). In the case of the flatworm Dugesia different 18S rRNA genes are present but only one is expressed (Carranza et al. 1996). Similarly only one 18S rRNA variant in a sturgeon species represents the overwhelming majority of the transcripts (Krieger and Fuerst

2004).

If the concerted evolution model applies to G. simplex, its 18S rRNA gene sequences should be extremely similar (as it appears from our data) but also should be equally expressed in RNA transcripts (but our data show that this is not the case). On the other hand it has been reported that usually only a part of the rRNA genes are expressed in an organism at any given time

(Weider et al. 2005).

It is in principle possible that some mutated variants of an rRNA gene are maintained in the organism as non-expressed pseudogenes. However, this is an unlikely situation in ciliates at least for the macronuclear copies as they are produced every time the cell goes through conjugation. Pseudogenes may of course be in the micronucleus but there are only very few of those per micronucleus (one in Tetrahymena, for example), and these cannot account for a large number of variants we observed in single-cell experiments.

Further detailed analysis of the 18S rDNA and rRNA sequences and secondary structures would be necessary to determine if the intra-specific sequence variability is not indeed an artifact of the methodology in its entirety, or it does really exist. Whatever the answer, we believe the real question of relevance to this study is if it really influences the outcome of assigning

41 organisms to different clusters (species, genera, etc.), or it is too small to have an effect. Even more importantly, is there the least common denominator of the slightly different sequences observed within single cells and their populations, and can this denominator serve as a unique molecular signature of such cells?

We believe that such a signature can be found by creating a consensus sequence of several clones obtained from one or several cells. The consensus by definition masks the single sequence differences (real or artificial) uncovered by the cloning and matches nearly 100% the sequence obtained by direct sequencing of the PCR products.

For every set of sequences from DNA or RNA extracts we created the consensus sequence. Our data show that it always corresponded to the sequences obtained by sequencing the PCR products directly. So the consensus sequence from any combination of gene copies is invariable, irrespective of whether these are copies from one cell or a field population of cells.

This consensus sequence may serve as a molecular signature of the ciliate species.

Provided such a signature exists and is invariable, not only for the species studied here but also for others, it may be possible to identify a cutoff value of sequence similarity to separate molecular species. If the level of inter-specific sequence variation exceeds the intra-specific level, then the intermediate level of sequence similarity could serve as a convenient threshold value that can define ciliate species, at least from a practical prospective. The sequences data show that this value, at least for some species, is approximately 1% 18S rRNA gene sequence divergence: cells diverging by more than 1% could be considered as belonging to different species, cells diverging by less than 1% would fall into the same species.

42 The threshold value of 1% sequence divergence is therefore proposed here as a practical measure of identifying ciliate species from the molecular perspective.

It is worth noting that, as we show below, when populations of G. fossata were collected in Greenland one year apart (2002 and 2003), the 18S rRNA gene sequences retrieved from the samples were somewhat different. The consensus sequence from G. fossata collected in 2002 was 99.6% similar to the one obtained by Droste (2003). According to our proposed criterion, these represented the same species. However, the consensus from G. fossata collected in 2003 was only 98.5% similar to the one obtained by Droste (2003), indicating species-level differences from the cells we collected in Nahant. Interestingly, cells from all populations concerned were sufficiently similar to be considered the G. fossata morphospecies. Moreover some ciliates collected later in Venezuela (see below) appeared to the collector’s eyes similar to G. fossata but showed an 18S rRNA gene consensus sequence as different as 4% to the G. fossata from Droste

(2003). If, as we propose, 1% 18S rRNA gene sequence divergence defines species boundaries, then G. fossata collected in Greenland in 2002 and again in 2003, G. fossata collected in

Venezuela, and G. fossata collected locally in Nahant are all different molecular species, albeit the same morphospecies. This argues strongly that similar morphology might not be indicative of molecular similarity. Similar morphology could be the result of convergent evolution, as different organisms can occupy similar ecological niches, while molecular similarity is the measure of the species evolutionary divergence (Caron 2009), therefore more relevant when trying to assess the species phylogenetic distance.

43 Chapter 4: Species diversity and richness predictions on small local scale

In many cases, especially when several species are present in difficult to access environments, it is simply not feasible to retrieve large numbers of specimens for each species to achieve acceptable taxonomic identification, and thus conduct a proper species inventory by classical means. However species do need to be practically separated and unambiguously identified for any surveying, monitoring, biogeographic, etc., effort. Molecular methods offer a promising alternative for microbial diversity studies. Below, we present data from four molecular surveys conducted in different geographical areas as an example of its practical application.

4.1 Clone libraries

We obtained environmental samples from marine intertidal flat sediments in four different locations along the Western Atlantic Ocean coast: Greenland, Canada, U.S.A. (Nahant,

MA) and Venezuela. In all of these locations, except for Greenland, the cells where randomly collected. The cells from the Greenland samples were collected and tentatively divided into species-level groups by gross morphological characters, as they appeared to the collector’s eye.

We created four independent 18S rDNA clone libraries (Table 11). The clones from each location were then grouped into OTUs based on 99% sequence similarity value (Table 11).

44 Table 11. Number of clones and OTUs obtained for each clone library (based on 99% sequence similarity).

Location Clones OTUs

USA (Nahant) 231 39 Canada 456 66 Venezuela 352 53 Greenland 59 47

4.2 Novelty of environmental sequences

We first compared ciliate OTUs to the sequences in the NCBI database using Blast to establish their approximate phylogenetic position. Centroids sequences (see Chapter 2 for details) were used as representative of each OTU. We proceeded with assessing the novelty of the environmental sequences based on the degree of their divergence from the closest relatives in

GenBank (both from previously described/cultured groups, and from environmental sequences obtained in other molecular surveys). This kind of comparison does not identify the specific taxonomic position of the OTUs, only the novelty of the newly detected phylogenetic lineages.

Exceptions are lineages exhibiting high similarities (96-100%) to sequences already deposited in

GenBank since this places the newly detected OTUs within already existing taxa. The OTUs that show a lower degree of similarity might represent rare or novel higher-level clades of uncertain position.

In our analyses, the degree of novelty for ciliate OTUs from Venezuela and Greenland appear to be higher than for ciliates from the other locations (Table 12) since their average

45 similarity to the closest environmental match (CEM) and closest previously described/cultured groups (CIM) is lower (Table 12).

Table 12. The average similarity (%) of the OTUs to the closest environmental match (CEM) and the closest previously described/cultured groups match (CIM), with standard errors (SE).

Location % CEM (SE) % CIM (SE) Nahant 95.92 (0.47) 96.97 (0.44) Canada 96.68 (0.29) 97.18 (0.24) Venezuela 93.54 (0.49) 95.69 (0.57) Greenland 94.12 (0.45) 95.52 (0.45)

Indeed, most of the OTUs retrieved from Nahant are ≥96% similar to sequences present in GenBank (Figure 4 A), with only three sequences showing a similarity to CEM and CIM

≤90% (Figure 4 B, lower left area of the dispersion plot).

Figure 4. Phylogenetic novelty of the OTUs retrieved from Nahant. (A) Distribution of the similarity to the CEM and CIM. (B) Novelty pattern of the environmental OTUs. Red circles represent the similarity to the CEM and CIM for each OTU.

46 The same is characteristic of the OTUs from Canada (Figure 5). Most of the individual data points are localized in the upper right corner of the graph (Figure 5B), indicating high similarity to the CEM and the CIM.

Figure 5. Phylogenetic novelty of the OTUs retrieved from Canada. (A) Distribution of the similarity to the CEM and CIM. (B) Novelty pattern of the environmental OTUs. Red circles represent the similarity with the CEM and CIM for each OTU.

In the Greenland sample, the OTUs similarities to the CEM and CIM are scattered along the axes (Figure 6). In general these OTUs are more similar to the CIM than to the CEM.

47 Figure 6. Phylogenetic novelty of the OTUs retrieved from Greenland. (A) Distribution of the similarity to the CEM and CIM. (B) Novelty pattern of the environmental OTUs. Red circles represent the similarity with the CEM and CIM for each OTU.

The ciliate OTUs from Venezuela also show a wider distribution along the axes (Figure

7), with more sequences less similar to the CEM than to CIM.

Figure 7. Phylogenetic novelty of the OTUs retrieved from Venezuela. (A) Distribution of the similarity to the CEM and CIM. (B) Novelty pattern of the environmental OTUs. Red circles represent the similarity with the CEM and CIM for each OTU.

48 These findings are not surprising given the amount of previous samplings along the Atlantic northwest compared to the other two regions.

4.3 Molecular phylogeny

To assess the phylogenetic position of observed OTUs, we performed Maximum likelihood (ML) analyses for each sampling location (Figures 8 - 11) and applied Bayesian inference to confirm trees topology and nodes support. Each tree included centroids representing each OTU, as well as selected reference sequences characterizing the major ciliate classes, retrieved from the GenBank database. We also included a group of environmental clones labeled

“CARH” (Stoeck et al. 2003) from the novel ciliate class Cariacotrichea (Orsi et al. 2011) recently discovered in an anoxic marine basin off the coast of Venezuela to highlight possible molecular similarities with the OTUs we recovered in the vicinity.

49 !""#"$%% %!#"$%&

!""#"$%& &'#"$%% 5,%#$%#&'%('

*'#"$%* &!#"$%&

!""#"$%& ')#"$'* 7896 &&#"$%% **#"$%*

%,#"$%* !""#"$%& !""#"$%%

!""#"$%% %*#"$%% !""#"$%& +(#"$'( %&#"$%% %%#"$%& !""#"$%%

')#"#%! !""#"$%% 43,"#%",-1(' '(#"$'' !""#"$%& %'#"$%% +'#"$%& %%#"$%%

&"#"$%' %&#"$%% &&#"$%*

*)#"$%% &(#"$%% !""#"$%& *'#"$%& !1*0

%,#"$%& !"#$%#&'%(' %,#"$%' &&#"$%* !""#"$%%

*!#"$%' .+,/0 !""#"$%& &+#"$%&

!""#"$%% 2'$$#31#"('

!""#"$%&

!""#"$%& %%#"$%&

%"#"$%& )'"*#"(+,-%('

!""#"$%& *&#"$%*

+(#"$+! *'#"$** %'#"$%% 6(%("#%",-1(' %+#"$%&

50 Figure 8. (Page 50) ML phylogenetic tree of 18S rDNA clones retrieved from Nahant. The first number at the nodes represents the bootstrap value (percentage out of 1000 replicates) for ML and the second the posterior probability value of the Bayesian analysis. Black dots indicate the nodes with 100% bootstrap/1 Bayesian probability values. Nahant OTUs appear in red. Phy = Phyllopharingea. Olig = Oligohymenophorea.

51 !!"# >0&%.,2$%1(,$.#0$" #$$"** ,&"** 8(*%/*%3"*&" 9,"/2'8'+&(1,= '+"** !!"$%!! !"#$%&"$'"(&"$')" !("#

!-"$%!!

!&"#

?-0+$.)0&"(,"$"/"$8+& 9'(:%1$3& ,'"$%!! ;%01%#&"

!("# !+"# !#"** A$.)/.)'"(/#2':'1.B"0

'-"$%!! C$%,/.#"$%.)('$$'/")1

!("# 2#%/*%3"*&" (,"** !!"#

)("** #$$"$%!! C.-0,1().-")8' 7$/2.8.)0--"(",.2"&"/+1 ."//%01%#&" (!"#

!+"# 4567 ''"$%!&

!!"#

'!"$%!' *+,-./01(#2"$.)

'!"#

#$$"$%!!

'!"$%!!

!!"#

,)"$%!! ()"$%!+

9/$.&:'8'+&(,+$,+$0+& (&"$%!) !-"# (&"# '!"#

!-"# !!"# <0(#%*#()1&"

,#"** !("#

'+"**

<$"#20-.1/%-"(,08'#+-'5.$&'1 !,"#

6&,2'1'0--"("))+-"/"

)-"#

),"** '&"#

3'.,2$%1(1#+/+&

?0$'/$.&+1(@"2-' !&"# 7&*&#%*#()1&"

9,'$.1/.&+&("&:';++&

#$$"$%!! )("** !)"# !("#

#$$"** <$"#20-.$",2'1(1,= !"#$%#& '()*&" 40-0'"(5.11"/" ,'"**

,,"** !.D.801(1/$'"/+1 !$"#

+,-

52 Figure 9. (Page 52) ML phylogenetic tree of 18S rDNA clones retrieved from Canada. The first number at each node represents the bootstrap value (percentage out of 1000 replicates) for ML and the second the posterior probability value of the Bayesian analysis. Black dots indicate nodes with 100% bootstrap/1 Bayesian probability values. Canada OTUs appear in red.

53 ()#%&') /+"1,$%*+(" :'#3*()8'#,<"(40#0#

)%#%&'+ @"8)0($#"8(4,*""*0#4+ /'$#.$;".(" (!#%&'' !"#%&'' :&'#3(0%"*B,#'9&('#0#

'*#%&'' :'&1"(4&2#,)#"#0#"612 !!#%&'( :+,1$&*;(0$%&$'("

'*#%&'' '!#" ))#$$ ?"(40(4*#,0$%*/*+(9#& ./&"0"12*#,3&("3*#4# '!#" !"##$%&$'(" ))#%&*) 7(+0&"(6#+8+,#3#2#'*&9* )(#%&'" D8+0&"*#,+)- "%%#%&'' /&*++$%&"',01(" @%*'(6(4&''#,14$*4#0# "%%#%&'' *(#%&''

"%%#%&'' 5#'#40*6*12,$('* "%%#%&'' '(#%&'' A,.$#.$;".(" A(B()%8''12,C*4* "%%#%&'' (+#%&'( '-#%&'' 23456<= "%%#%&'' 23456789 2346 2345679 ()#%&') 234567@

'*#%&'' '(#%&''

"%%#%&'' !"#%&*, !$>(+51'$?% '-#%&''

:+&16(#2))%*+*&''#,#'$&('#0# '%#%&'' )(#$$

(!#%&'' =1)'(0&+,$%#"(4

!(#$$ "%%#$$ ."0%#2)%*+*&''#,/"&9*+&"*&+ B%,'$.',-&("

)!#%&'! *-#%&'* '!#%&'' !"#$%&'(+08'#,)&61$1'*<("2*+ !"#$$

;&'&*#,<(++#0#

(+#%&'*

')#%&'' '!#%&'( "%%#$$ )(#%&( !(#%&') ''#%&''

*!#$$ ('#%&'' )"'*$ :#"61$F*#,("/*+ '(+,-.(" "%%#%&''

)"#$$ "%%#%&''

"%%#%&'' (%#%&', !"#$%&'("#)%*+,+)- '(#%&''

''#" ("#%&(( *'#%&') >)*"(+0(212,#2/*3112 (*#%&'' 6(.('$.',-&(" :&"*0"(21+,E#%'*

54 Figure 10. (Page 54) ML phylogenetic tree of 18S rDNA clones retrieved from Venezuela. The first number at each node represents the bootstrap value (percentage out of 1000 replicates) for ML and the second the posterior probability value of the Bayesian analysis. Black dots indicate nodes with 100% bootstrap/1 Bayesian probability values. Venezuela OTUs appear in red.

55 G"#%'#%,+&'4;,9,)#/+(

!"#'(!) !+"+2"#%(-+&.#%62- 61&7%(0,"8%.(%$") *+"$,#)'#-+'(..+&/("-,0#"-(

!.(2"#%(-+&3+"+'+"$2- !&#$

%%#'(!$ *"53'#4+"5#%&,"",'+%) &'#'(%) !"#"#$#%&'("() 5$%+#%,)#") %&#'(!%

*#.(3)&%#.+%$,

!+#'(!! 234! !!#$

="+4;(.#)'5.+&3($,42.,0#"-,)

%$#$ >#.#)',4;+&$,+$(-+'+ *&#'(!)

8'"#-9,$,2-&)'5.,0("2- %%#'(%" %!#$

:-3;,),(..+&-+6%,6"+%2.#)+ &%#'()) -.&$%#$&'(") <75'",4;+&6"+%2.,0("+ &%#'(!$ %$#'(*- >(-,2"#)#-+&'("",4#.+

*)#'(*! !$#$ %"#'()" A,#3;"5)&#.,6#';",7 :)3,$,)4+&)'(,%,

!+#$ 1#7#3;5..2-&"#)'"+'2- &$#'()* 1,'#%#'2)&3+"+456%2) *&#%+#%,)#") D+.+%',$,2-&4#., >#-+.#?##%&/("-,42.+"( !"#'()" )*#$ **#'(*$

!*#$

**#'(!$ !%#'(!! !)#$ C(.(,+&0#))+'+

!+"$24?,+&#"9,) !%#$ )'#'(!! &+#,, !"#$ !*#$

)-#$ ="+4;(.#"+3;,)&)3B !-#$ %"#$ /)$0% $"1&'#") &)#'(%! !'#'(!% !*#$ @%42.'2"($&="+4;(.#4("4,$+( &)#'(!%

)"#'(!*

F(-+%(..+&)3B 1#7#$()&)'",+'2)

*!#,, ))#'(!! &'#'(!!

!(",'"#-2)&E+;., &)#'(!%

!&#$ 83,"#)'#-2-&+-9,622- !"#"$%#$&'(") !!#$

56 Figure 11. (Page 56) ML phylogenetic tree of 18S rDNA clones retrieved from Greenland. The first number at each node represents the bootstrap value (percentage out of 1000 replicates) for ML and the second the posterior probability value of the Bayesian analysis. Black dots indicate nodes with 100% bootstrap/1 Bayesian probability values. Greenland OTUs appear in red.

All OTUs identified using our approach group into class-level clades coinciding with traditionally defined ciliate classes. This demonstrates that, at least at some level, the molecularly defined OTUs and morphologically defined classes do not contradict each other and describe overall ciliate taxonomy in broadly similar manner. Interestingly, in a single notable exception, some of the environmental clones form Venezuela (Figure 10) form a group of OTUs (labeled

“Novel group”) that does not cluster with any of the established classes. These OTUs cluster with the CARH sequences instead. They may very well represent a new taxon within the novel

Class Cariacotrichea.

4.4 Taxonomic composition and OTUs biogeography

The class-level taxonomic composition of the four ciliate communities as inferred from the ML trees is shown in Figure 12.

57 Figure 12. Taxonomic affiliation of the OTUs from the four studied locations.

The OTUs retrieved belong to 10 out of the 12 ciliate classes (no representatives of

Armophorea and Colpodea classes were detected in this study). Spirotrichea OTUs are abundant in the Canada and Nahant samples. Karyorelictea are more abundant in the Venezuela and

Greenland ones. The Venezuelan sample appears to be the most diverse among the four, with

OTUs belonging to all 10 classes, followed by Nahant (8 classes), Canada (7 classes) and

Greenland (6 classes). This is likely because we collected samples from three separate sites with slightly different sediment profiles, which might harbor differently adapted species.

To investigate how the clone libraries composition reflected species biogeography, the

OTUs distribution in the four locations were compared. After combining all the 18S rDNA clones from the different libraries and creating a single dataset, followed by clustering the sequences into OTUs sharing > 99% similarity, the overlap of clone composition among all the resulting OTUs appeared minimal (Figure 13). In fact only one OTU was shared among all the

58 locations. This OTU belongs to a group of ciliates within the Trachelocercidae Family from the class Karyorelictea which comprises species with worldwide distribution. Only two OTUs contained clones obtained from three location; these clustered with the cosmopolitan species

Spirostomum and Diophrys and just 19 from two different locations, which were similar to species or clones found globally. All the others were found at a single location only.

Canada Nahant

Greenland Venezuela

Figure 13. The number of OTUs (at 99% sequence similarity) shared among the four clone libraries.

This suggests that the ciliate OTUs composition in each specific sampling location is quite unique, indicating a possible endemicity of at least some of the taxa detected.

59 4.5 Total species richness estimation

Typical molecular surveys of environmental eukaryotes, similar to the ones conducted here, although informative, offer no more than a partial snapshot of the species present at the time of the collection, mainly due to under-sampling. This is characteristic of any study, whether molecular or traditional, but few attempts have been made to estimate the degree of under- sampling, and thus the community’s true richness. A more comprehensive picture could be achieved by the use of statistical models, which can estimate the species richness based on a sample of this richness.

There are two main classes of statistical procedures to infer the number of taxa present in a population given the sampled taxa: parametric and non-parametric. Coverage-based non- parametric estimators have been extensively applied to calculate the total species richness in environmental populations. However they are more accurate when a large portion of the population has been sampled, and have a strong tendency to underestimate the total richness when only a fraction of it is actually captured by, e.g., a clone library (Bunge and Barger 2008).

Recently, parametric estimators based on frequency distributions have become more common since the first applications by Hong et al. (2006) or Jeon et al. (2006). These parametric finite-mixture models appear to fit best the high diversity observed in almost every molecular environmental survey reported until now (Hong et al. 2006) and carry a higher degree of flexibility. In fact, as a mixture, they have components that best fit abundant species and others that best fit rare ones allowing for a good approximation of real life species frequency curves

(Bunge 2011). Therefore, we decided to apply the parametric statistical models to the observed

60 frequencies of OTUs from three locations to predict their total ciliate species richness. These methods were implemented using the CatchAll software package (Bunge 2011). The dataset from Greenland was excluded from the statistical estimations since the sampling procedure was not random.

The CatchAll program uses OTU frequency counts, as input data and returns richness predictions (the number of species predicted to exist, including the ones empirically observed) comparing several parametric and non-parametric statistical models, selecting the “best” one based on associated standard errors, confidence intervals, and goodness-of-fit estimations. Figure

14 shows the data from three of the environmental clone libraries with the best-fitted parametric model.

61 Figure 14. Frequency counts distributions of OTUs in three clone libraries (Nahant, Canada, Venezuela), with the best-fitted parametric curve model (red) compared to other three competing parametric models.

The shape of the curves in Figure 14 is such that their upward left projection to the Y axis value of zero gives the estimate of the total number of species present in the samples, including the observed ones. Table 13 summarizes the statistical estimations for the three clone libraries.

62 Table 13. Estimates of total species richness of ciliate communities in three locations, with their Standard Errors (SE) for the Single Exponential model.

Observed OTUs Expected OTUs SE Nahant 39 50.6* 4.5 Canada 66 124.6* 16.8 Venezuela 53 131.8* 30.7 *, the best-fitting parametric model is the Single Exponential for all the three datasets.

Except for the Nahant clone libraries, the predictions estimate the samples to be almost twice the size detected in the study. This implies that there are just as many ciliate species in the environmental populations yet to be detected which were missed by our sampling approaches. To the best of our knowledge, these are the first estimates of the degree of under-sampling in ciliate molecular surveys.

4.6 Discussion

The 18S rRNA approach is presently the method of choice to study protistan species diversity and biogeography (Caron et al. 2012). Large environmental surveys are now made easy by high-throughput tools such as e.g. pyrosequencing. These offer the scale of coverage impossible to achieve by traditional identification techniques, which necessitate large sample size, excellent cell preservation and taxonomical expertise. However morphology still plays a major role in the identification of protistan organisms. The difficulty is to build a useful link between the morphological and the genetic information.

Our work contributes to building this link and offers a practical way to discriminate morphological entities based on their molecular signature. Our small-scale environmental

63 surveys relied on gross morphology for initial cell collection AND on 18S rRNA gene information for OTUs separation. As we argued in Chapter 3, the 1% sequence divergence cutoff value produces OTUs closely mimicking species in alpha taxonomy terms, bridging molecular and traditional surveys/inventories. The four such surveys we conducted in various parts of the world allow for a comparison of power of more traditional and more modern approaches to studying ciliate diversity. In our view, this comparison reveals four major advantages of the latter.

1. The molecular species surveys sample richness randomly since the principal source material

is community DNA. This opens a way of estimating relative species abundances using the

frequency of their 18S rRNA signatures in clone libraries. Traditional approaches rely on hand

collection of mass species followed by their identification but this provides only a binary

registration of their presence/absence.

2. Since the rRNA approach allows for reconstruction of species frequencies, statistical tools can

be used to estimate the degree of under-sampling. These tools predict what the total

community richness is, how many species have been missed, and suggest the amount of effort

necessary to capture these missing species as well. Surveys conducted using traditional

approaches do not offer this opportunity.

3. Time and effort required to survey >50% of ciliate species composition using molecular tools

(Table 13) make such surveys practical. In the tested communities, this coverage was achieved

64 with 39 to 66 species detected. To the best of our knowledge, there have been few to no

taxonomic surveys that used proper identification methods such as silver impregnation that

achieved similar coverage. This is because no community we know offers more than just a

handful, let alone dozens, of species abundant enough for the traditional methods to be

practical.

4. rRNA surveys are easily comparable among different sites (e.g., Figures 12 and 13).

Taxonomic surveys are not, principally because similarly looking specimens may represent

different species producing at times a possibly misleading picture of cosmopolitan distribution

of a “species” whereas in reality this “species” may be no more than a conglomerate of

similarly looking convergent adaptations.

In the end, the rRNA approach recovers protistan, or at least ciliate diversity in a way that is more comprehensive, and with less effort. We would like to emphasize though that this is only so if a correspondence between molecular and morpho- species has been uncovered. This correspondence, established in Chapter 3, is necessary and sufficient to translate alpha taxonomy into ß-taxonomy terms, and vice versa, leading to an increased understanding of protistan diversity and its distribution in time and space.

65 Chapter 5: Conclusions

5.1 The contribution of our work

Classification of protists has been a daunting task for over two centuries, due to the tremendous variability in their shape, size and ecophysiology. Good taxonomic identification has been achieved using a combination of cultivation methods, microscopic observation and morphological characters. However traditional techniques and the “morphospecies” concept are increasingly deemed insufficient to define protistan taxa (Caron 2009, McManus and Katz 2009).

That is why in recent years scientists have turned to DNA sequences data to help uncover the real extent of protistan diversity. However, there is not yet a defined and accepted link between the genetic information and the traditional species description (Caron 2009, Caron et al. 2012).

Our work aimed at filling the gaps between the two. We started with well-defined ciliates morphospecies and performed an extensive analysis of their genetic variability to investigate the species boundaries. Our results allowed us to discriminate species in environmental populations taking into account intraspecific variations and to present the rationale behind our proposed 1% similarity cut off value for a practical species separation. Equipped with this valuable tool, we discovered novel ciliates OTUs in our local scale molecular surveys, some of which appear to be related to a recently described new class. Moreover, with the application of more effective statistical procedures, we estimated the local species richness, emphasizing the level of undersampling among natural ciliate populations.

66 5.2 Future possibilities

Methodological limitations notwithstanding, morphology is still the “golden” standard for protistan classification. However the recent exponential increase in sequencing and computational power seems to lead toward a molecular taxonomy (Caron et al. 2012). We do recognize the advantages of molecular methods, especially considering the fast “extinction rate” of the alpha taxonomists (McManus and Katz 2009), but like others we advocate for a more comprehensive species concept for protists, which includes morphological, molecular and physiological information. Whether or not the debate over the species concept will be resolved depends greatly on our ability to reconcile the morphological description with sequence information and to agree on the acceptable level of genetic variability within species. More work is needed to investigate the extent of intraspecific variability in different taxa. This will allow generalized inference about species boundaries, with the goal of getting closer to a novel taxonomic approach to identify protists.

Studies like the present one are already applying PCR techniques to single cells in order to correlate morphotypes with genotypes (Auinger et al. 2008, Kim et al. 2001) in protists.

Once a better method to define protistan species is at hand, we will be able to tackle the problem of assessing protistan diversity.

Studies of protistan diversity are central to understand how species richness and community composition affect different ecosystems. Knowing how ecosystems are organized, we will be able to monitor and predict their response to the rapid environmental changes we are facing nowadays.

67 Pairing molecular techniques with powerful statistical methods will help unveil the “unseen”, that is what still lies ahead of us in terms of the real extent of protistan diversity.

68 REFERENCES

Adl, S.M., Leander, B.S., Simpson, A.G., Archibald, J.M., Anderson, O.R., Bass, D., Bowser, S.S., Brugerolle, G., Farmer, M.A., Karpov, S., Kolisko, M., Lane, C.E., Lodge, D.J., Mann, D.G., Meisterfeld, R., Mendoza, L., Moestrup, Ø., Mozley-Standridge, S.E., Smirnov, A.V., Spiegel, F. 2007. Diversity, nomenclature, and taxonomy of protists. Syst Biol. 56: 684-689.

Adl, M.S., Simpson, A.G.B., Farmer, M. A., Andersen, R.A., Anderson, O.R., Barta, J.R, Bowser, S.S., Brugerolle, G., Fensome, R. A., Fredericq, S., James, T.Y., Karpov, S., Kugrens, P., Krug, J., Lane, C., Lewis, L.A., Lodge, J., Lynn, D.H., Mann, D.G., McCourt, R.M., Mendoza, L., Moestrup, Ø., Mozley-Standridge, S.E., Nerad, T.A., Shearer, C.A., Smirnov, A.V., Spiegel, F., and Taylor, F.J.R. 2005. The new higher level classification of eukaryotes with emphasis on the taxonomy of protists. J. Euk. Microbiol. 52: 399–451.

Altschul, S.F., Madde, T.L., Schaffer, A.A., Zhang, J.H., Zhang, Miller, W., Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 3389–3402.

Andreoli, I., Mangini L., Ferrantini F., Santangelo G., Verni F., Petroni G. 2009. Molecular phylogeny of unculturable Karyorelictea (Alveolata, Ciliophora). Zool. Scripta 38: 651-662.

Auinger, B.M., Pfandl, K. and Boenigk, J. 2008. Improved Methodology for Identification of Protists and Microalgae from Plankton Samples Preserved in Lugol’s Iodine Solution: Combining Microscopic Analysis with Single-Cell PCR. Appl Environ Microbiol 74: 2505–2510.

Bunge, J. 2011. Estimating the number of species with CatchAll. Forthcoming in Proceedings of the Pacific Symposium on Biocomputing 2011.

Bunge, J., Barger, K. 2008. Parametric models for estimating the number of classes. Biometrical Journal 50: 971-982.

Carey, P.G. 1992. Marine interstitial ciliates. An illustrated key. Natural History Museum Publications, Chapman and Hall, London.

Caron, D.A. 2009. Past President’s Address: Protistan Biogeography: Why All The Fuss? J. Eukaryot. Microbiol. 56: 105–112.

Caron, D.A., Countway, P.D., Brown, M.V. 2004. The growing contributions of molecular biology and immunology to protistan ecology: molecular signatures as ecological tools. J. Euk. Microbiol. 51: 38-48.

69 Caron, D.A., Countway, P.D., Savai, P., Gast, R.J., Schnetzer, A., Moorthi, S.D., Dennett, M.R., Moran, D.M., Jones, A.C. 2009. Defining DNA-based Operational Taxonomic Units for microbial- ecology. Appl Environ Microbiol 75: 5797–5808.

Caron, D.A., Countway, P.D., Jones, A.C., Kim, D.Y., Schnetzer, A. 2012. Marine Protistan Diversity. Annu. Rev. Mar. Sci. 4: 6.1–6.27.

Carranza, S., Giribet, G., Ribera, C., Baguna, J., Riutort, M.. 1996. Evidence that two types of 18S rRNA coexist in the genome of Dugesia (Schmidtea) mediterranea (Platyhelminthes, Turbellaria, Tricladida). Mol. Biol. Evol. 13: 824–32.

Cavalier-Smith, T. 1998. A revised six- system of life. Biol. Rev. Camb. Phil. Soc. 73: 203-266.

Cline, J., Braman, J.C. and Hogrefe, H.H. 1996. PCR fidelity of Pfu DNA polymerase and other thermostable DNA polymerases. Nucleic Acids Res 24: 3546-3551.

Corliss, J.O. April 2001. Protists systematics. In: Encyclopedia of Life Sciences. John Wiley and Sons, Ltd: Chichester.

Droste, R. 2003. Defining protistan species at the molecular level: 18S rRNA approach. M.S. thesis, Northeastern University, Boston.

Elder, J.F.; Turner, B.J. 1995: Concerted evolution of repetitive DNA sequences in eukaryotes. Q. Rev. Biol. 70: 297–320.

Elwood, H.J., Olsen, G.J., Sogin, M.L. 1985. The small-subunit ribosomal RNA gene sequences from the hypotrichous ciliates Oxytricha nova and pustulata. Mol Biol Evol 2: 399– 410.

Epstein, S., López-García, P. 2008. “Missing“ protists: a molecular prospective. Biodivers. Conserv. 17: 261-276.

Finlay, B.J. 2004. Protist taxonomy: an ecological perspective. Phil. Trans. R. Soc. Lond. B 359: 599-610.

Finlay, B.J. and Fenchel, T. 1999. Divergent perspectives on protist species richness. Protist 150: 229–233.

Finlay, B.J., Corliss, J.O., Esteban, G., Fenchel, T. 1996. Biodiversity at the microbial level: the number of free-living ciliates in the biosphere. Q. Rev. Biol. 71: 221-237.

Foissner, W. 1999. Protist diversity: Estimates of the near-imponderable. Protist 150: 363–368.

70 Gao, S., Strüder-Kypke M., Al-Rasheid K.A.S., Lin X., Song W. 2010. Molecular phylogeny of three ambiguous ciliate genera: , Trachelolophos and Trachelotractus (Alveolata, Ciliophora). Zool. Scripta 39: 305-313.

Guindon, S, Gascuel O. 2003. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol. 52: 696-704.

Hall, T.A. 1999. BioEdit: a user-friendly biological sequence alignment editor and analysis program for Windows 95/98/NT. Nucl. Acids. Symp. Ser. 41: 95-98.

Hong, S.H., Bunge J., Leslin C., Jeon S.O., Epstein S.S. 2009. Polymerase chain reaction primers miss half of rRNA microbial diversity. ISME J 3: 1365-1373.

Hong, S.H., Bunge, J., Jeon, S.O., and Epstein, S.S. 2006. Predicting microbial species richness. Proc Natl Acad Sci 103: 117-122.

Huber, T., Faulkner, G., Hugenholtz, P. 2004. Bellerophon: a program to detect chimeric sequences in multiple sequence alignments. Bioinformatics 20: 2317-2319.

Jeon, S.O., Bunge J., Leslin C., Stoeck T., Hong S.H., Epstein S.S. 2008. Environmental rRNA inventories miss over half of protistan diversity. BMC Microbiol 8: 222.

Jeon, S.O., Bunge J., Stoeck T., Barger K.J., Hong S.H., Epstein S.S. 2006. Synthetic statistical approach reveals a high degree of richness of microbial eukaryotes in an anoxic water column. Appl Environ Microbiol 72: 6578-6583.

Kim, S.J., Choi, J.K., Ryu, S. and Min, G.S. 2011. Single-cell PCR on protargol-impregnated euplotid ciliates: a combined approach of morphological and molecular taxonomy. Anim Cells Syst 15: 251-258.

Krieger, J. and Fuerst, P.A. 2004. Characterization of nuclear 18S rRNA gene sequence diversity and expression in an individual lake sturgeon (Acipenser fulvescens). J. Appl. Ichthyol. 20: 433– 439.

Krieger, J., Hett, A.K., Fuerst, P.A., Birstein, V.J. and Ludwig, A. 2006. Unusual Intraindividual Variation of the Nuclear 18S rRNA Gene is Widespread Within the Acipenseridae. J Hered 97: 218–225.

Lane, D.J. 1991. 16S/23S rRNA sequencing. In: Stackebrandt E, Goodfellow M, eds. Nucleic acid techniques in bacterial systematics. Chichester, UK: John WileyandSons. pp 115–175.

71 Li, J., Gutell, R.R., Damberger, S.H., Wirtz, R.A., Kissinger, J.C., Rogers, M.J., Sattabongkot, J., McCutchan, T.F. 1997. Regulation and trafficking of three distinct 18 S ribosomal RNAs during development of the malaria parasite. J. Mol. Biol. 269: 203-213.

Liao, D. 1999. Concerted evolution: molecular mechanism and biological implications. Am J Hum Genet. 64: 24-30.

Ludwig, W., Strunk, O., Westram, R., Richter, L., Meier, H., Yadhukumar, Buchner, A., Lai, T., Steppi, S., Jobb, G., Forster, W., Brettske, I., Gerber, S., Ginhart, A.W., Gross, O., Grumann, S., Hermann, S., Jost, R., Konig, A., Liss, T., Lussmann, R., May, M., Nonhoff, B., Reichel, B., Strehlow, R., Stamatakis, A., Stuckmann, N., Vilbig, A., Lenke, M., Ludwig, T., Bode, A., and Schleifer, K.H. 2004. ARB: a software environment for sequence data. Nucleic Acids Res, 32: 1363-71.

Mayr, E. 1942. Systematics and the Origin of Species Columbia Univ. Press, New York.

McManus, G.B. and Katz, L.A. 2009. Molecular and morphological methods for identifying plankton: what makes a successful marriage? J. Plankton Res. 31: 1119-1129.

Medlin, L., Elwood, H.J., Stickel, S., Sogin, M.L. 1988. The characterization of enzymatically amplified eukaryotic 16S-like rRNA-coding regions. Gene 71: 491–499.

Miller, M.A. Pfeiffer, W., Schwartz, T. 2010. Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees. In Proceedings of the Gateway Computing Environments Workshop (GCE) pp: 1-8.

Montagnes, D.J.S., Lynn, D.H. 1987. A Quantitative Protagol Stain (QPS) for ciliates: a method description and test of its quantitative nature. Marine Microbial Food Webs 2: 83-93.

Moreira, D. and López-García, P. 2002. The molecular ecology of microbial eukaryotes unveils a hidden world. Trends Microbiol. 10: 31-38.

Nebel, M., Pfabel, C., Stock, A., Dunthorn, M., Stoeck, T. 2010. Delimiting operational taxonomic units for assessing ciliate environmental diversity using small-subunit rRNA gene sequences. Environmental Microbiology Reports 3: 154–158.

Orsi, W., Edgcomb, V., Faria, J., Foissner, W., Fowle, W.H., Hohmann, T., Suarez, P., Taylor, C., Taylor, G.T., Vdacny, P., Epstein, S.S. 2011. Class Cariacotrichea, a novel ciliate taxon from the anoxic Cariaco Basin, Venezuela. Int J Syst Evol Microbiol. Published online ahead of print August 12.

Patterson, D.J. 1999. The diversity of Eukaryotes. Am. Nat. 154 (Suppl.): S96-S124. Schlegel, M. and Meisterfeld, R. 2003. The species problem in protozoa revisited. Europ. J.

72 Protistol. 39: 349-355.

Pfandl, K., Chatzinotas, A., Dyal, P., Boenigk, J. 2009. SSU rRNA gene variation resolves population heterogeneity and ecophysiological differentiation within a morphospecies (Stramenopiles, Chrysophyceae). Limnol. Oceanogr. 54: 171–181.

Posada, D. 2008. jModelTest: Phylogenetic Model Averaging. Mol. Biol. Evol. 25: 1253-1256.

Prescott, D.M. 1994. The DNA of Ciliated Protozoa. Microbiol Rev 58: 233-267.

Ronquist, F. and Huelsenbeck, J. P. 2003. MRBAYES 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19: 1572-1574.

Rooney, A.P. 2004. Mechanisms Underlying the Evolution and Maintenance of Functionally Heterogeneous 18S rRNA Genes in Apicomplexans. Mol. Biol. Evol. 21: 1704–1711.

Siddiqui, A.S., Dengler, U. and Barton, G.J. 2001. 3Dee: a database of protein structural domains. Bioinformatics 17: 200 –201.

Small, E.B. 1992. A simple method for obtaining concentrated populations of protists from sediments. In: Lee JJ, Soldo AT (eds.) Protocols in Protozoology. Society for Protozoology, Allen Press, Lawrence, KS, pp. B-4.1 – B-4.2.

Stamatakis, A. 2006. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22: 2688–2690.

Stoeck, T. and Stock, A. 2010. The protistan gap in the eukaryotic tree of life. Palaeodiversity 3: Supplement 151-154.

Stoeck, T., Taylor, G.T. and Epstein, S.S. 2003. Novel eukaryotes from the permanently anoxic Cariaco Basin (Caribbean Sea). Appl. Environ. Microbiol. 69: 5656–5663.

Sukumaran, J. and Holder, M.T. 2009. SumTrees: Summarization of Split Support on Phylogenetic Trees. Version 1.0.2. Part of the DendroPy Phylogenetic Computation Library Version 2.6.1.

Thompson, J.D., Higgins, D.G. and Gibson, T.J. 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucl. Acids Res. 22: 4673–4680.

Thompson, J.D., Gibson, T.J., Plewniak, F., Jeanmougin, F. and Higgins, D.G. 1997. The ClustalX windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucl. Acids Res. 25: 4876-4882.

73 Torres-Machorro, A.L., Hernández, R., Cevallos, A.M., Ló́pez-Villaseñor, I. 2010. Ribosomal RNA genes in eukaryotic microorganisms: witnesses of phylogeny? FEMS Microbiol Rev 34: 59–86.

Uhlig, G., Thiel, H., Gray J.S. 1973. The quantitative separation of meiofauna: a comparison of methods. Helgoländer Wiss. Meeresunters 25: 173-195.

Weider, L.J., Elser, J.J., Crease, T.J., Mateos, M., Cotner, J.B. and Markow, T.A. 2005. The functional significance of ribosomal (r)DNA variation: Impacts on the evolutionary ecology of organisms. Annu. Rev. Ecol. Evol. Syst. 36: 219–42.

Wuyts, J., Perriere, G. and Van de Peer, Y. 2004. The European ribosomal RNA database. Nucl. Acids Res. 32, D101-D103.

Zwickl, D.J. 2006. Genetic algorithm approaches for the phylogenetic analysis of large biological sequence datasets under the maximum likelihood criterion. Ph.D. dissertation, The University of Texas at Austin.

74