Developing online tools for metagenomic analysis and SNP detection using Nanopore sequencing data

Master

01.06.2019 Felix Manske, 407800 Department of Biology Research group: Comparative Institute of University of Münster Supervisors: Dr. Francesco Catania; Prof. Dr. Wojciech Makalowski

Table of contents

Abbreviations IV Acknowledgements VII Abstract VIII

1. Introduction 9 1.1. Metagenomics 9 1.1.1. Applications of metagenomics 10 1.1.2. Missing consensus for a prokaryotic species definition 11 1.1.3. Metaprofiling analysis of microbial communities 12 1.1.4. Topical challenges in metagenomics 16 1.1.5. Development of MetaG was motivated by topical challenges in the field 17 1.2. Nucleotide Polymorphisms 18 1.2.1. SNPs: markers of genetic variability 18 1.2.2. Identification and analysis of SNPs using next-generation sequencing (NGS) 20 1.2.3. Challenges in SNP studies 22 1.2.4. The polymorphism calculation in NanoPipe: robust and easy to use 23

2. Materials and Methods 24 2.1. Design of MetaG 24 2.1.1. Database preparations 25 2.1.2. Alignment 25 2.1.3. Post processing 26 2.1.4. Taxonomic assignment 28 2.1.5. Output of MetaG 30 2.1.6. Implementation as a web tool and standalone program 31 2.1.7. Creating custom databases 36 2.1.7.1. Amplicon databases 36 2.1.7.1.1. RDP 16S 28S 36 2.1.7.1.2. MTX 38 2.1.7.2. The database for pathogens: PATRIC 39 2.2. Performance evaluation and cross-algorithm comparisons 40 2.2.1. General considerations 40 2.2.2. Read simulation 41 2.2.2.1. Simulation of nanopore reads 43 2.2.2.2. Simulation of Illumina reads 43 2.2.3. Defining standard parameters for MetaG 44 2.2.4. Competing classifiers 46 2.2.4.1. Parallel-META 3.4.4 46 2.2.4.2. QIIME 2 47 2.2.4.3. RDP Classifier 48 2.2.5. Cross-algorithm comparisons and visualizations 49 2.2.6. Evaluating the quality of the simulated data 50

II

2.3. Design of the SNP analysis in NanoPipe 52 2.3.1. NanoPipe, a bioinformatics pipeline featuring SNP analysis 52 2.3.2. Workflow of the analysis 52 2.3.3. Filtering ambiguous SNP candidates 53 2.3.4. Providing metadata for polymorphic sites 54 2.3.4.1. Joint probability of a SNP 54 2.3.4.2. Connecting previous and topical observations 54 2.3.4.3. The alignment quality as a measure of a SNP's reliability 56 2.3.5. Display of results from the polymorphism analysis 57 2.4. Benchmarking of nanopore sequencing and NanoPipe SNP analysis 57

3. Results 59 3.1. Performance evaluation of MetaG 59 3.1.1. Statistical evaluation and comparison to competitors 59 3.1.1.1. Simulated nanopore sequencing 59 3.1.1.2. Simulated Illumina MiSeq sequencing 62 3.1.2. Analysis of a novel bacterium 65 3.1.3. Reanalysis of a nanopore mock sample 68 3.2. Performance of the polymorphism detection in NanoPipe 69 3.2.1. The effect of sequencing error on the SNP analysis 69 3.2.2. Recovery of real SNPs by nanopore sequencing and NanoPipe analysis 69

4. Discussion 75 4.1. Metaprofiling analysis 75 4.1.1. In silico performance evaluation 75 4.1.2. Analysis of a novel bacterium 78 4.1.3. Reanalysis of a nanopore mock sample 80 4.1.4. Evaluation of the performance and practical use of MetaG 81 4.2. SNP detection in NanoPipe 84 4.2.1. Detection of known SNPs 84 4.2.2. Current state and future improvements 86

5. References 89

6. Supplemental figures 104

Declaration of Academic Integrity 123

III

Abbreviations ac Alignment score cutoff ANI Average nucleotide identity API Application programming interface ASV Amplicon sequence variant BA Sample containing bacteria and archaea bash Bourne-again shell BFA Sample containing bacteria, archaea and fungi bp Basepair cc Confidence cutoff COI Cytochrome c oxidase subunit I CPU Central processing unit DNA Deoxyribonucleic acid DRα Major histocompatibility complex class II gene for the α-chain DRβ Major histocompatibility complex class II gene for the β-chain ec E-value cutoff EF-G Elongation factor G gene EF-Tu Elongation factor Tu gene FN(s) False negative(s) FP(s) False positive(s) GB Gigabyte genomosp. Species characterized only by genomic analyses h Hour HSP70 Heat shock 70 HTTP Hypertext transport/transfer protocol indel Insertion, deletion ITS Internal transcribed spacers json JavaScript Object Notation KEGG Kyoto Encyclopedia of Genes and Genomes lca Lowest common ancestor ld Linkage disequilibrium LGT Lateral gene transfer LSU Large subunit of ribosome LSU11 Fungal LSU training set 11 MB Megabyte MCC Matthew's Correlation Coefficient

IV

MHC Major histocompatibility complex n.d. No date NA Not applicable NGS Next-generation sequencing nt Nucleotide OTU Operational taxonomic unit PCR Polymerase chain reaction perfect (sample) Sample simulated with no sequencing error p-error Error probability pH Power of hydrogen PREC Precision R*, e.g. R9 Flowcell version of nanopore sequencer, e.g. flowcell version 9 R9 2D (sample) Sample simulated with sequencing errors derived from E. coli sequencing on flowcell R9 using 2D reads RadA Radiation-sensitive gene A RAM Random-access memory RBR Relative binding ratio RecA Recombinase A gene RFLP Restriction fragment length polymorphism RNA Ribonucleic acid rrn Ribosomal ribonucleic acid operon rRNA Ribosomal ribonucleic acid S Svedberg sec Second SENS Sensitivity sh Bourne shell SNP Single nucleotide polymorphism SNV Single nucleotide variant sp. Species SPEC Specificity SRA Sequence Read Archive SSU Small subunit of ribosome subsp. Subspecies TI Transition TN(s) True negative(s) TP(s) True positive(s) TV Transversion

V

UNITE UNITE Fungal ITS trainset 07-04-2014 URL Uniform/universal resource locator WARCUP2 Warcup Fungal ITS trainset 2 WGM Whole-genome metagenomics WIMP What's in my Pot? 1D One-dimensional (read) 2D Two-dimensional (read)

VI

Acknowledgements I want to express my gratitude to Dr. Francesco Catania and to Prof. Dr. Wojciech Makalowski for giving me the chance to work on two topical and very interesting projects. I could always rely on their guidance for my projects. Besides, they contributed helpful comments on the earlier versions of this thesis. I was enjoying the work at Prof. Dr. Wojciech Makalowski’s lab. The good atmosphere and productive working environment in the group made learning quite easy. I want to thank Norbert Grundmann who was always happy to show me how computer scientists work and think. Without him, the web interfaces would still only exist on paper and MetaG would not have its current name. I am also grateful to Dr. Victoria Shabardina for her advice and helpful comments on the SNP analysis in NanoPipe. Additionally, I thank Tabea Kischka, Reza Halabian, Wolfgang Garbers, Jonas Bohn and Marten Kellner for sharing ideas and/or promoting enjoyable tea times. I am grateful to my whole family for supporting me and making this work possible.

VII

Abstract The twenty first century has witnessed a rapid development of sequencing technologies. Although the market is dominated by Illumina short-read sequencing, third generation technologies allow sequencing of much longer, single nucleic acid molecules. However, this comes at the expense of much lower sequencing accuracy. Thus, new algorithms for sequencing data analysis are required. Here, I present two algorithms developed with nanopore sequencing in mind.

The first algorithm was developed for metagenomic analyses and was implemented into the MetaG software. MetaG identifies taxa from targeted rRNA gene sequencing. By predicting pathogens and antibiotic resistances, it was designed to be applicable to the medical environment. The program is available as a command line version and as a web implementation. Compared with other commonly used metagenomic analysis programs, MetaG showed a high level performance and substantially improved genus and species identifications.

The second algorithm analyses single nucleotide polymorphisms (SNPs), which are of major interest to cancer research, animal breeding, ecology, and forensics. This algorithm was implemented in the NanoPipe web server developed for nanopore MinION sequencing data analyses. The algorithm is using combinations of several parameters for the SNP validation: the total of these estimators proved to be powerful. Slight modifications to existing standards and the use of a new parameter were shown to provide promising options for future improvement.

Nanopore sequencing technology offers the possibility to perform analyses directly in the field. Since both algorithms presented here have a local implementation, analyses may also be performed in remote regions without an internet connection. The web implementation opens the analyses to researchers with weak computational setups. Besides, the software development focused on easy use. Thus, researchers and medical practitioners without high- level computer expertise can perform their analyses.

VIII

1. Introduction

1.1. Metagenomics As will be shown in the following section, the analysis of the microbial community structure has a wide range of applications. Traditionally, taxa within the community were cultured for identification. For example, Mycobacterium moriokaense was defined as a distinct species by DNA-DNA hybridization, and several traits involving growth in culture, resistance to and metabolism of selected compounds (Tsukamura, Yano, and Imaeda 1986). DNA-DNA hybridization essays denature DNA from different species. The complementary strands from different species reassociate into hybrid molecules. The ratio of hybrid molecule formation

(relative binding ratio, RBR) or the thermo stability (ΔTm) of the hybrids may be used as measures of genomic similarity (reviewed in: Rosselló-Mora and Amann 2001). In terms of RBR, the recommended definition gave a value of at least 70 % (Wayne et al. 1987). Still, the analysis requires high technical skill and a high amount of time (reviewed in: Stackebrandt and Ebers 2006; Rosselló-Mora and Amann 2001).

Bacterial identification by culture-dependent methods is expected to miss a significant portion of taxa: in soil samples a maximum of 1 % of all bacteria could be cultured (reviewed in: Torsvik, Sørheim, and Goksøyr 1996). A possible solution lay in the advent of omics analysis to classify microorganisms. It started in 1977, when the use of rRNA genes for the identification of organisms was proposed (Woese and Fox 1977). The subsequent movement can be termed the “ of [m]icrobiology into [m]etagenomics” (Escobar-Zepeda, de León, and Sanchez-Flores 2015).

The term metagenomics (literally “beyond the genome” (Gilbert and Dupont 2011)) has been widely used to describe the analysis of the microbial community structure (reviewed in: Segata et al. 2013). However, the field actually has several branches and sub branches depending on the analyzed molecule: when analyzing DNA, researchers can focus on the whole genome or selected marker genes. It has been argued, that only the former should be called metagenomics (Handelsman et al. 1998; Quince et al. 2017). The latter has been referred to as metaprofiling (Escobar-Zepeda, de León, and Sanchez-Flores 2015) and marker gene analysis (Knight et al. 2018).

For metaprofiling, the 18S rRNA gene has been commonly used to assess the eukaryotic community (e.g.: Scanlan and Marchesi 2008; Kataoka et al. 2017; Kettner et al. 2019; Reboul et al. 2019). In the same way, internal transcribed spacers (ITS), 28S rRNA or 16S rRNA genes have been used to analyze fungal (reviewed in: Edwards et al. 2017; Schoch et al. 2012) or bacterial communities (reviewed in: Fox et al. 1980), respectively. Viruses, do not have such a universal gene, as demonstrated for phages (Rohwer and Edwards 2002). However, there are some conserved genes within limited subgroups (reviewed in: Breitbart and Rohwer 2005).

Using DNA is one option for community analysis. Specifically, there are also metatranscriptomics, analyzing RNA (reviewed in: Bashiardes, Zilberman-Schapira, and Elinav 2016), and metaproteomics, analyzing protein sequences (reviewed in: Wilmes and Bond 2006). The following sections, however, will focus on the use of DNA data for metagenomic classifications. For the sake of simplicity, the whole process of analyzing DNA to assess the microbial community will be termed metagenomics. This includes analyses on marker genes and the whole genome. Where a distinction is necessary due to the different

9 nature of the analyses, the terms metaprofiling and whole-genome metagenomics (WGM) will be used.

1.1.1. Applications of metagenomics Microorganisms populate a broad range of environments from very low (e.g.: Korzhenkov et al. 2019) to very high pH-value (e.g.: Suzuki, Nealson, and Ishii 2018). They live at deep-sea hydrothermal vents (reviewed in: Poli et al. 2017), and in the human gut (reviewed in: Thursby and Juge 2017). The outer environment shapes the community and has been researched using metagenomic approaches. Selected examples of extreme habitats are presented in the following: in groundwater contaminated with heavy-metals, the community consisted of just 13 operational taxonomic units (see 1.1.2. for definition) adapted to the challenging environment (Hemme et al. 2010). At the example of the Deepwater Horizon oil leakage, it was shown that the temperature drop in high depth and natural gas shaped the community aside from the contamination by oil (Redmond and Valentine 2012). This correlation can be used to assess the status of a habitat. Smith and colleagues developed a machine-learning approach using targeted 16S rRNA gene sequencing. Their goal was to use the bacterial composition as a biomarker (Smith et al. 2015). The algorithm was trained on groundwater contaminated with uranium and nitrate. The program could predict the level of contamination and the pH by a given microbial sample (Smith et al. 2015). Besides, the model was trained on 16S rRNA microarray data from sites contaminated by the Deepwater Horizon accident: contaminated, not contaminated and no longer contaminated sites could be reliably distinguished (Smith et al. 2015).

Aside from being environmental markers, microorganisms can also shape their environment by removing contamination (bioremediation) (reviewed in: Lovley 2003). For example, bacteria were used to remove soluble uranium from contaminated groundwater. For that, researchers used emulsified vegetable oil as a biostimulant to increase the growth of remediating bacteria (Gihring et al. 2011). The attempt was successful: the community structure shifted several times to high abundances of few bacterial species as shown by metaprofiling. Besides, the uranium concentration in the water was reduced within several months (Gihring et al. 2011).

Biostimulation is also used to screen the environment for enzymes relevant to the industry: a hormone-sensitive lipase was isolated from permafrost soil incubated with olive oil (Petrovskaya et al. 2016). Besides, a novel chitinase was found in agricultural soil spiked with chitin (Cretoiu et al. 2015). Another chitinase identified by metagenomics was shown to act as a fungicide against several plant pathogens (Hjort et al. 2014), highlighting the importance of this enzyme for plant protection (reviewed in: Herrera-Estrella and Chet 1999). Overall, 332 enzymes relevant to the industry were obtained via metagenomics in a three- year-period (Berini et al. 2017).The wide application of biocatalysts is mainly limited by their availability (Schmid et al. 2001).

In 1996, almost all amino acid variations throughout the BPN’ subtilisin had been protected by patents (reviewed in: Maurer 2004). Thus, metagenomic screening to mine novel enzymes is also motivated by the need to avoid legal issues.

10

Apart from the aforementioned utility of microorganisms, they also have a major impact on human health: for example, gut bacteria produce vitamins essential for humans (reviewed in: LeBlanc et al. 2013) and influence the intestinal barrier, as shown in mice (Natividad et al. 2012). Despite these benefits, infections, namely of the lower respiratory tract, were fourth place in a ranking of the leading causes of death worldwide in 20161. Thus, metagenomics has recently been more extensively used in health care (reviewed in: Forbes et al. 2018). In a clinical setting, metagenomics helps identifying rare pathogens as shown in 2013. At that time, Cyclospora cayetanensis lead to an outbreak of gastrointestinal infections in the USA. In the early stages and prior to conventional methods, C. cayetanensis was detected using the BioFire FilmArray (reviewed in: Buss et al. 2013). Adding to the problem of rare pathogens, exact clinical diagnosis is complicated by infections with ambiguous symptoms, as can be seen from the following case. An immunodeficient patient was undiagnosed with bacterial meningitis during a four-month period of repeated hospitalization. During the time period, the symptoms worsened and the patient was finally transferred to medical coma. Illumina sequencing revealed an infection with Leptospira sp. After admission of an adequate antibiotic, the patient rapidly recovered (Wilson et al. 2014).

In a clinical setting, the speed of pathogen detection is essential. The timeframe for clinical diagnosis ranges from 1 hour (point of care tests) up to 132 hours for more elaborate testing (Deurenberg et al. 2017). The runtime of metagenomic approaches lies within the clinical timeframe and further improves: in a recent study, the authors could outperform the RNA MiSeq sequencing time of 20 hours with real-time nanopore sequencing. The latter took only six hours (Greninger et al. 2015). This clearly highlights the efficiency of metagenomic testing under these challenging conditions.

1.1.2. Missing consensus for a prokaryotic species definition Literature about metagenomics often states that operational taxonomic units (OTUs) were found. Microbiologists use this term to describe taxa at several degrees of similarity, rather than using the rank name (reviewed in: Goodrich et al. 2014). This is necessary, since it is unclear if, in the context of microbiology, species truly exist (reviewed in: Doolittle and Papke 2006).

A prokaryotic species was defined as "a monophyletic and genomically coherent cluster of individual organisms that show a high degree of overall similarity with respect to many independent characteristics, and is diagnosable by a discriminative phenotypic property" (Rosselló-Mora and Amann 2001). A reformulation of the species concept that is more useful for bioinformatics is: "a prokaryotic species is considered to be a group of strains (including the type strain) that are characterized by a certain degree of phenotypic consistency, showing 70 % of DNA–DNA binding and over 97 % of 16S ribosomal RNA (rRNA) gene-sequence identity" (Gevers et al. 2005). The theoretical concept is at least partially baffled by the following observations. First, lateral gene transfer (LGT) leads to OTUs with a significant sequence similarity, but a distinct phenotype, e.g. pathogenicity (reviewed in: Doolittle and Papke 2006). Second, it was shown that, at 94 % or more average nucleotide identity (ANI) , 5 to 35 % of the individual organism's genes were often not shared with other members of the same species (Konstantinidis and Tiedje 2005).

1 http://www.who.int/healthinfo/global_burden_disease/GHE2016_Deaths_WBInc_2000_2016.xls; accessed 26.05.2019. 11

Several models have been proposed as a more accurate definition of species. The ecotype model assumes that populations are asexual, thus a beneficial trait may only spread together with its genome. This happens in distinct ecological niches and leads to near-clonal ecotypes. The process is referred to as periodic selection. Drift may also add to this or act on its own. Besides, there are several branches of this model, introducing several other mechanisms (reviewed in: Cohan and Perry 2007). However, recombination was shown to occur, also at high frequencies, e.g. in Helicobacter pylori (Falush et al. 2001). This baffles the (universal) application of the ecotype model.

Especially in the case of (high) recombination, the biological species concept proposed by Ernst Mayr is applicable (Dykhuizen and Green 1991; reviewed in: Riley and Lizotte-Waniewski 2009): individuals within a species should, at least, have the potential of sexual reproduction. Importantly, this reproduction must be a barrier towards other groups (Mayr 1942). However, LGT was proposed to increase interspecies recombination between microorganisms thereby removing the reproductive isolation of species (Nesbø, Dlutek, and Doolittle 2006).

Another approach divides genes into a core set (housekeeping genes) and a set of additional genes promoting adaptation (niche adaptive genes). At the level of species, the core is shared by most, if not all members. The niche adaptive set is strain specific and will be subject to LGT more readily (Reeves and Lan 1996). The hypothesis can be also used to underpin the biological species concept (Riley and Lizotte-Waniewski 2009). While no decision has been reached so far, it is still common sense to use terms like species in order to assign traits to groups of organisms or to refer to certain levels of phylogenetic relatedness (reviewed in: Goodrich et al. 2014).

1.1.3. Metaprofiling analysis of microbial communities In the following, the general workflow of a metaprofiling study will be presented. The focus lays on the computational analyses. However, first, the DNA of interest needs to be isolated, separated from contaminants, replicated and sequenced. See Thomas, Gilbert, and Meyer 2012 for a thorough review on this topic (Thomas, Gilbert, and Meyer 2012). The raw reads have to be base-called and should be pre-filtered for optimal results (Kunin et al. 2008). Commonly used steps include cutting amplification artifacts (e.g.: primers, barcodes), removing low quality regions and ambiguous targets, correcting sequencing errors, adjusting reads to a common length, clustering similar reads (de-replication) and filtering chimeras (reviewed in: Jünemann et al. 2017).

In contrast to WGM approaches, targeted sequencing requires amplification of targets (reviewed in: Jünemann et al. 2017). For that, universal primers are designed and used in a polymerase chain reaction (PCR). Primer choice depends on the targeted gene and the domain of interest. For example, high-resolution primers for bacteria might not be ideal for archaea (Klindworth et al. 2013). Another aspect to consider is that the sequencing technology influences the ideal amplicon length (Klindworth et al. 2013). Subsequently, the amplified targets are sequenced. An exemplified workflow can be found in the primer evaluation study by Klindworth and colleagues (Klindworth et al. 2013). However, new technologies, like microarrays, have emerged. They are faster and cheaper than traditional sequencing approaches, but come at the expense of reduced functionality especially regarding novel taxa (DeSantis et al. 2007). 12

In 1993, Carl Woese and Gary J. Olsen discussed the ideal properties of a universal marker (Olsen and Woese 1993) after Carl Woese had proposed the idea of using rRNA for this purpose (Woese and Fox 1977). During that time, no decision had been reached for a universal candidate. The 16S rRNA gene was proposed as the best candidate, as it is conserved in bacteria and archaea and is sufficiently long to carry enough evolutionary information (reviewed in: Olsen and Woese 1993). Besides, it carries several structural motives which lessens the effect of one large scale structural mutation on the whole molecule's information (reviewed in: Olsen and Woese 1993). Importantly, it mutates at a frequency that is slow enough, due to functional constraints, to resolve distant phylogenies. Still, some regions evolve at higher speed, making the gene a nearly universal marker (reviewed in: Woese 1987). The analysis is further eased by the fact that nucleotide substitutions are favored over other mutations (reviewed in: Olsen and Woese 1993). Even in 1993, the 16S rRNA gene had been extensively studied and indications for its superior performance had been found (reviewed in: Olsen and Woese 1993).

However, there is no magic bullet for all organisms and neither is there the one and only true marker (Olsen and Woese 1993). For fungi, the 28S rRNA gene and internal transcribed spacers (ITS) are widely used (reviewed in: Edwards et al. 2017; Schoch et al. 2012). Other marker genes include the elongation factors Tu (EF-Tu) and G (EF-G), the heat shock protein 70 (HSP70), RecA and RadA (Venter et al. 2004).

After sequencing, the basic steps of quality control and pre-processing are applied as outlined in section 1.1.3. Subsequently, reads are clustered by similarity in operational taxonomic units by one of three approaches: the first approach is to assign DNA reads to taxa by comparison to a reference database (closed-reference OTUs) (reviewed in: Goodrich et al. 2014) using alignments (e.g.: Jing et al. 2017), placement onto phylogenetic trees (e.g.: Matsen, Kodner, and Armbrust 2010) or assessment of the subsequence (so-called word) structure (e.g.: Wang et al. 2007). Well known reference databases include SILVA (Glöckner et al. 2017), Greengenes (McDonald et al. 2012) and RDP (Cole et al. 2014). The databases focus on different marker genes: SILVA focuses on 16S, 18S, 23S and 28S rRNA (Glöckner et al. 2017), Greengenes on 16S rRNA (McDonald et al. 2012) and RDP on bacterial/archaeal 16S rRNA and fungal 28S rRNA2. Sequences that are sufficiently similar to a reference database are placed into a bin. Sequences that fail to match any reference are excluded (reviewed in: Callahan, McMurdie, and Holmes 2017).

In the second approach, sequences are clustered by similarity between the reads themselves (de novo OTUs) (reviewed in: Goodrich et al. 2014; Callahan, McMurdie, and Holmes 2017). Optionally, de novo OTUs may also be assigned to a reference taxonomy: from each de novo OTU, one representative or all sequences can be analyzed. In the latter case, a consensus taxonomy is obtained for each bin. This lessens the effect of wrong classifications (Schloss and Westcott 2011).

Closed-reference and de novo methods may also be combined in the third approach: open-reference OTUs use the closed reference clustering first. Reads which do not match significantly are then clustered de novo (reviewed in: Goodrich et al. 2014; Callahan, McMurdie, and Holmes 2017).

2 https://rdp.cme.msu.edu/misc/rel10info.jsp#release11; accessed 14.03.2019. 13

The methods apply similarity thresholds that are thought to cluster sequences for a common rank, e.g. 97 % similarity for a common species (reviewed in: Goodrich et al. 2014). Effectively, this should cluster reads of 16S rRNA gene paralogs and strains from the same species. However, the paralog dissimilarity may exceed the expected limits (Acinas et al. 2004). Besides, 16S rRNA genes from distinct species may be too similar to be put into separate OTUs (Kawamura et al. 1995).

Consequentially, the common thresholds have been subject to debate: the threshold for species discrimination was proposed in 1994 to limit DNA reassociation studies, the gold standard of species definition, to promising cases (Stackebrandt and Goebel 1994). However, it was shown that this value was too low (Stackebrandt and Ebers 2006). This was also confirmed by a recent study: it proposed a threshold of at least 99 % similarity for species level discrimination using full-length 16S rRNA gene sequences (R. C. Edgar 2018).

OTUs also offer the possibility to bin sequences amplified from the same template, even if they contain variation due to sequencing and amplification errors (Huse et al. 2010). However, this is largely dependent on the number of sequencing errors (Kunin et al. 2010) and the clustering algorithm used (Huse et al. 2010). Thus, an unexpectedly high OTU number may be used as a performance indicator of an analysis workflow (e.g.: Kunin et al. 2010; Huse et al. 2010).

Apart from OTUs, sequences may be clustered by amplicon sequence variants (ASVs). They are independent of marker gene databases and the changes within. Unlike de novo OTUs, ASVs are not dataset specific. However, they depend on the analyzed locus (reviewed in: Callahan, McMurdie, and Holmes 2017).

Oftentimes researchers are interested in comparing communities. These could be distinct communities or the same at different time points. This is called the β-diversity. The α- diversity describes the community structure of a single sample (reviewed in: Lozupone and Knight 2005).

In a conservative sense, assigning taxonomy to reads and optionally assessing diversity are the final steps of the analysis. This conservative approach has found wide application. Selected examples include assessing the microbial community in a mouse gut (Shin et al. 2016), in water of the Dead Sea (Jacob et al. 2017) and attempts to use the community structure for forensic analysis, e.g. to estimate the time of death (Finley et al. 2016). However, marker gene analysis is also useful in the WGM approach: An estimate of the community structure gives hints for the appropriate sequencing depth. For a complex community with low abundance of individual species, the sequencing depth needs to be high, to obtain sufficient coverage for the genome of interest (reviewed in: Allen and Banfield 2005).

However, researchers have used metaprofiling also for predictive functional profiling (reviewed in: Knight et al. 2018). For that, 16S rRNA gene sequences from samples were compared to reference genomes. Genes from sufficiently similar organisms were considered as candidate genes. These were genes that may also have been present in the sample (Okuda et al. 2012). Candidates had to be sufficiently abundant within the genomes orthologous to the individual marker gene sequences (Okuda et al. 2012). This approach was shown to produce simulated metagenomes (so-called virtual metagenomes) that were

14 similar to the real ones (Okuda et al. 2012). The gene functions were queried in the Kyoto Encyclopedia of Genes and Genomes (KEGG) (Okuda et al. 2012). Functional prediction has been implemented into various software (e.g.: Langille et al. 2013; Aßhauer et al. 2015; Jun et al. 2015). Notably, the role of evolution between the query organism and its database relatives has also been modeled (Langille et al. 2013).

Functional prediction has been frequently applied, for example to monitor temporal changes in a biofilm (Okuda et al. 2012), dietary adaptations in the human gut (De Filippo et al. 2017) and sewage sludge processing (J. Gao et al. 2016). All in all, it combines the benefits of targeted sequencing, i.e. speed and low cost (reviewed in: Knight et al. 2018), with the analytical power of WGM in which the functions can be directly predicted on the reads (reviewed in: Thomas, Gilbert, and Meyer 2012). However, the accuracy of the method is largely dependent on the availability of closely-related metagenomes and a simple query community (Okuda et al. 2012). Besides, 16S rRNA gene similarity is not necessarily correlated with genomic similarity (reviewed in: Knight et al. 2018). After all, the results are just indications that should be validated by experimental observations.

While the taxonomic resolution in metaprofiling is lower than in WGM approaches (reviewed in: Knight et al. 2018), over 80 % of bacterial isolates could be identified to the species level (Mignard and Flandrois 2006). However, three factors have a particular impact on the analysis results: read length, choice of the variable region and the choice of the assignment program (Mizrahi-Man, Davenport, and Gilad 2013). Short read length often hinders the sequencing of the complete 16S rRNA gene. Thus single variable regions of the gene are sequenced, which can be described as obtaining the "marker of a marker of genomic diversity" (Schloss 2010). Unfortunately, results depend majorly on the chosen region: longer variable regions perform comparable to full-length 16S rRNA gene sequences, while shorter ones deviate significantly (Schloss 2010). Besides, variable regions V3 and V4 were found to yield better classifications than other regions using a Naïve Bayesian classifier (Mizrahi-Man, Davenport, and Gilad 2013). Similarly, the choice of the variable region influences both taxonomic resolution and found taxa (Meisel et al. 2016).

Adding to biases due to study design, two other flaws have been proposed for the marker gene analysis. First, the PCR reaction has been subject to debate: with increasing PCR cycle number (M. T. Suzuki and Giovannoni 1996; Bonnet et al. 2002), annealing temperature, and mismatches in universal primers (Sipos et al. 2007), more biases were introduced into the analysis. The second flaw is due to features of the molecule itself. This will be briefly addressed at the example of the 16S rRNA gene: a large-scale study has shown that only ca. 13 % of the analyzed bacteria had a single copy of the whole rrn operon, whereas ca. 65 % of archaea had a single copy. Variation in copy number occurred also between strains of single bacterial species (Acinas et al. 2004). This finding implies that the number of detected 16S rRNA genes belonging to a single taxon is not necessarily correlated to its actual cellular abundance. However, specialized databases allow for copy number corrections of organism in the database (e.g.: Stoddard et al. 2015). Less than half of the bacteria under study containing multiple 16S rRNA gene copies had identical ones. The same applied to ca. 25 % of archaea (Acinas et al. 2004). However, most disagreements were limited to 1 % sequence difference. One species showed particularly high sequence difference, but lower divergence towards other species in the same genus. This lead the authors to suspect horizontal gene transfer (Acinas et al. 2004).

15

However, the biases introduced by marker gene analysis have been researched more thoroughly than those which occur during WGM processing (Knight et al. 2018). The approach has been used successfully in many studies, as outlined above. Apart from that, low price and high speed compared to WGM (reviewed in: Knight et al. 2018) add to the popularity of metaprofiling. The method is especially appropriate for samples with host contamination and samples with few organisms (Knight et al. 2018).

1.1.4. Topical challenges in metagenomics Microbiologists and medical doctors are often facing the challenge that the structure of the observed microbiome does not (only) relate to the factor of interest. Instead, there is a variety of different influences. For example, the dietary lifestyle leads to long-term effects on the microbial community in the human gut (Wu et al. 2011). Besides, factors such as medication (Forslund et al. 2015) and sexual preference (Noguera-Julian et al. 2016) influence the microbial community. Apart from long-term community shifts, fluctuations of the microbiome are also suspected indicators of disease, as shown for the inflammatory bowel disease (Halfvarson et al. 2017).

The effect of single factors on the microbial community can be elucidated by statistical testing. This, in turn, requires proper biological replicates. However, the field of metagenomics appears to lack the appreciation for that factor: in 2009, 82 % of all sequencing studies in five selected journals did not perform replicates (Prosser 2010). A non- exhaustive list of reasons comprises the high price for sequencing and the confusion of technical and biological replicates. The importance of universal statistical rules was often underestimated, given the advances in technology (Prosser 2010).

Again, owing partially to the high costs (reviewed in: Kunin et al. 2008), microeukaryotes, especially protists, are often excluded from metagenomic studies (reviewed in: Escobar-Zepeda, de León, and Sanchez-Flores 2015). Besides, results for this group come at a much higher expense of workload: eukaryotic genomes are much larger than prokaryotic ones and contain a higher portion of non-coding DNA (reviewed in: Gilbert and Dupont 2011; Escobar-Zepeda, de León, and Sanchez-Flores 2015). Splice variants further complicate functional annotation (Escobar-Zepeda, de León, and Sanchez-Flores 2015). Besides, reference databases are highly skewed towards selected groups, such as plants or fungi (reviewed in: Gilbert and Dupont 2011).

As already outlined before, a study’s outcome depends majorly on the experimental procedures. Missing standard procedures hinder an effective comparison between studies (Escobar-Zepeda, de León, and Sanchez-Flores 2015). However, standard procedures have been proposed including guidelines for amplification and submission of sequences and metadata (Yilmaz et al. 2011).3

Still, there are challenges related to the analysis process itself. Briefly, sequencing errors have been shown to artificially increase the number of taxa recovered from a sample by metaprofiling, unless strict quality control is in place (Kunin et al. 2010). As outlined above, short read length also affects the metagenomic analysis. With the progress in sequencing technology, these issues will become less significant. However, challenges like

3 http://press.igsb.anl.gov/earthmicrobiome/protocols-and-standards/; accessed 19.03.2019. 16 contamination (Tanner et al. 1998; Mukherjee et al. 2015) and formations of chimeric reads (in metaprofiling) (reviewed in: v. Wintzingerode, Göbel, and Stackebrandt 1997) or contigs (in WGM) (reviewed in: Kunin et al. 2008) are yet to be solved.

1.1.5. Development of MetaG was motivated by topical challenges in the field As outlined above, metaprofiling is still widely used either in conjunction with WGM approaches or as a standalone. However, the bioinformatic analysis is often a challenge of its own. Most often programs have a high demand for resources and technical skill of the researcher (reviewed in: Escobar-Zepeda, de León, and Sanchez-Flores 2015). Thus, there is a need for easy to use software. Naturally, not all laboratories can afford to buy hardware that is able to cope with these demands. Thus, software should have few requirements or be virtually independent from these constraints.

With these requirements in mind, MetaG was developed. It can be run online on a provided server, so the main bottleneck for users is a good internet connection. In cases where this cannot be ensured or users want to analyze confidential data, a standalone version can be downloaded and installed manually. The software was designed to be easy to use. This is ensured by a comprehensive documentation (for online and local version) and strong standard parameters. More experienced users can adjust a variety of parameters according to their needs. Pre-clustering of OTUs was not natively supported, due to the ongoing debate about artificial thresholds and the concept itself (see above). However, experienced users may create OTUs themselves and, for example, only perform calculations on a representative sequence.

MetaG is not a strictly conservative assignment program, i.e. it does not only perform the assignment of taxa to reads. Rather, it also offers a probabilistic follow-up analysis by detecting potential human pathogens with antibiotic resistance phenotypes. Despite being developed for metaprofiling data, technically, WGM data may also be analyzed. This offers a broad field of application to potential users.

17

1.2. Nucleotide Polymorphisms Organisms within a population frequently show variation in certain characteristics (traits). This variation is referred to as polymorphisms in population genetics. The groundwork for this phenotypic variation lays at the gene level: genes themselves, may also be polymorphic (reviewed in: Brooker 2009). This is also due to changes at the level of individual nucleotides: single nucleotide polymorphisms (SNPs) are responsible for the majority of total differences between humans (reviewed in: Brooker 2009). A recent study on healthy Caucasian adults found an average of ca. 3.3 million SNPs per genome. Only an average of ca. 420,000 insertions/deletions (indels) per genome was detected (Shen et al. 2013). The frequency of a SNP is vital: by definition a SNP must occur at least in 1 % of the population (reviewed in: Hartl and Jones 2009).

As for genes, the term allele is used to describe the different characteristics, i.e. nucleotide changes, of a SNP: in diploid organisms, a SNP is homozygous, if it is identical on both homologous chromosomes and heterozygous if it differs (reviewed in: Hartl and Jones 2009). However, the term should not lead to the belief that SNPs only occur in genes (reviewed in: Hartl and Jones 2009). In fact, the study presented above found most SNPs in intergenic regions (Shen et al. 2013). The single nucleotide changes often have only two alleles (are biallelic) due to the low mutation frequency and a bias favoring transitional nucleotide changes over transversions (reviewed in: Vignal et al. 2002).

Dependent on their position and practical use (see next section), the nucleotide changes can be grouped in five categories (reviewed in: Griffiths et al. 2008). If a single gene has the potential to alter the phenotype, a SNP may or may not contribute to this change. In the latter case, it is a silent mutation. SNPs are also present in intergenic regions and within polygenes, i.e. multiple genes that interact to form a phenotype. Last but not least, there are restriction fragment length polymorphisms (RFLPs). In that case, an allele may alter the binding of a restriction enzyme to its site within a genic or intergenic region (reviewed in: Griffiths et al. 2008).

Apart from SNPs and the aforementioned indels, there are more sources of genomic polymorphisms which are frequently found in humans, namely inversions and copy number variations (Tuzun et al. 2005). Still, the following work will concentrate on the single nucleotide polymorphisms.

1.2.1. SNPs: markers of genetic variability If SNPs do not directly contribute to a phenotype or disease, they may be used as markers: SNPs can occur in clusters (so-called haplotypes) that are inherited together more often than expected by chance. Unlinked clusters would be broken by crossing-over (reviewed in: Griffiths et al. 2008). Thus, the SNPs are in linkage disequilibrium (ld). Importantly, other genetic elements of interest, like genes, inside the cluster are also included in the ld (reviewed in: Griffiths et al. 2008).

Observing that a certain haplotype is associated with a disease implies, in the simplest case, that the gene causing the disease is located within the cluster. Likewise, polygenes can be found in multiple haplotypes associated with a certain (degree of) a trait (reviewed in: Griffiths et al. 2008). Depending on the initial hypothesis, e.g. the gene of interest is on a defined chromosome, this will narrow down the search area by several orders

18 of magnitude (reviewed in: Griffiths et al. 2008). The same is true for RFLPs. However, the detection of the RFLP is done by digesting the DNA with a restriction enzyme. One allele of the polymorphism may disable a certain cleavage site, whereas the other does not. Fragment length and abundance give hints on the allele and whether the individual is homo- or heterozygous (reviewed in: Brooker 2009). An example for this is the association study of an RFLP in the Hpa I restriction site indicating sickle cell disease (Kan and Dozy 1978).

Most SNP markers are observed to have two alleles. They are biallelic (reviewed in: Vignal et al. 2002). Thus, a higher number of individual SNPs is required to get the same resolution, i.e. the same number of distinct types, as with multiallelic markers. However, the biallelic nature limits major detection errors to the confusion of the two alleles (reviewed in: Vignal et al. 2002). Consequentially, efforts are made to identify novel SNPs for the use in agriculture: for example, as markers for favorable traits in breeding (e.g.: Chen et al. 2016), parental testing in herds and tracing of (contaminated) animal products through the chain of production (e.g.: Heaton et al. 2014).

SNPs also have a value as a subject of additional analyses in forensics: the applications include the identification of an individual (suspect or victim) based on material found at a crime scene (e.g.: Dixon et al. 2005) and the identification of relatives (e.g.: Tomas et al. 2010). If the suspect remains unidentified, further applications can narrow the number of candidates by predicting the ethnic background (e.g.: Phillips et al. 2007). On top of that, the analysis of SNPs offers a prediction of the general appearance. Polymorphisms have been identified which, for example, indicate skin color (Lamason et al. 2005) and predict brown and blue eyes (Walsh et al. 2011). Due to the interplay of genes with the environment, polygene encoded traits and aging, it seems unlikely to soon receive a portrait from DNA polymorphism analysis alone (Kayser and Schneider 2009; Butler 2012).

In ecology, SNPs may be used to assess the genetic diversity of a population. A recent study on albacore tuna (Thunnus alalunga) used SNPs to assess the impact of overfishing on the genetic diversity of the population (Laconcha et al. 2015). From a microbiology or health care perspective, SNPs can aid with finding the source of disease outbreaks as performed for cholera in Haiti (Hendriksen et al. 2011).

So far, the selected examples presented in this section dealt with SNPs that are inheritable. However, somatic mutations, e.g. single nucleotide variants (SNVs), have strong implications on cancer (reviewed in: Raphael et al. 2014). Detection of specific mutations is already in use to individually adjust cancer treatment (e.g.: Herbst et al. 2016).

19

1.2.2. Identification and analysis of SNPs using next-generation sequencing (NGS) A recent study stated that "two of the most significant tasks [for next-generation sequencing] include alignment to a reference genome and detection of single nucleotide polymorphisms" (Mielczarek and Szyda 2016). This section will summarize the workflow of a NGS-based SNP study with a focus on the bioinformatics analysis. Briefly, after the sequencing, the raw reads are base-called and quality controlled (reviewed in: Altmann et al. 2012). The next step is crucial for SNP calling: the DNA reads are aligned to a reference database. Studies have examined the impact of different alignment program and found that the choice significantly impacted the number of (correctly) called SNPs (Hamada et al. 2011; Altmann et al. 2012; Clevenger et al. 2015). Naturally, also the choice of the according parameters impacts the success of the study: settings should be strict enough to avoid the generation of spurious SNPs. Still, settings must also allow a certain degree of sequence ambiguity which is represented by the true SNPs (Altmann et al. 2012). However, the degree of true variation is specific to the organism (Keane et al. 2011; The 1000 Genomes Project Consortium 2010) and the specific sequence under study: for example, the major histocompatibility complex (MHC) class II gene DRβ has 323 alleles, whereas DRα has two (Janeway et al. 2001).

As the choice of optimal parameters is far from self-evident, alignments should be double checked by the user. This includes evaluating the number of aligned reads (mapping statistics) and manually inspecting (parts of) the alignment. The verified alignments should then be grouped by their start on the respective targets (reviewed in: Altmann et al. 2012). If a PCR-based sequencing strategy has been used, it is common practice to exclude alignments of similar reads on the assumption that they are artifacts. However, reads could also just be flagged as duplicates (reviewed in: Altmann et al. 2012). Duplicate removal resolves artificial peaks in sequencing depth which, in heterozygotes, may skew the allelic abundances at a spot (R. Li et al. 2009). However, PCR artifacts skew SNPs only to a limited extend (H. Li 2014). Apart from spurious reads, spurious alignments of the same read to different regions of the target should be removed (reviewed in: Altmann et al. 2012). To improve the quality of alignments, users may perform a realignment around indels (reviewed in: Altmann et al. 2012).

As a last step of alignment preprocessing, the quality scores for each base given by the sequencing platform may be recalibrated. This is because it was shown that the instrument-given scores could be further improved (R. Li et al. 2009; Brockman et al. 2008). At the example of Illumina data, a recalibration was proposed which takes into account the quality scores given by the instrument, the cycle number and the type of substitution (R. Li et al. 2009).

Next, SNP calling and genotype calling can be performed which can be seen as virtually the same when analyzing a single organism (reviewed in: Nielsen et al. 2011). The analysis may start from reads or from an assembled genome. However, the assembled genome has lost the coverage data per position. Thus, artifacts cannot be identified by the observed coverage (reviewed in: Olson et al. 2015). In the following, I will focus on read- based approaches.

Software solutions are often assigned to one of two groups: the "classical" approach using fixed cutoffs and a probabilistic approach (reviewed in: Nielsen et al. 2011). In the "classical" approach, reads are first filtered by quality. Then the allele counts at each position are evaluated by their relative frequencies. For example, if the non-reference allele (i.e. the 20

SNP) has an abundance of at least 20 %, the genotype is heterozygous. If the abundance is greater than 80 % the genotype is homozygous for the SNP. The same applies to the reference allele (Harismendy et al. 2009). This approach is most reliable at a coverage of at least 20. However, minor adaptations to the thresholds presented above yielded reliable results also for lower coverage (Hedges et al. 2009).

Still, at lower coverage, the probabilistic model is expected to perform better in calling heterozygotes (reviewed in: Nielsen et al. 2011). Additionally, no quality filtering needs to be performed, so no important information is lost by too strict cutoffs (reviewed in: Nielsen et al. 2011). In the following, the algorithm is explained at the example of a study using the Bayes theorem (R. Li et al. 2009): Briefly, the genotype at each position was chosen by a cutoff dependent on the probability of observing the genotype given the data. It could be calculated by the prior probability of observing the genotype and the probability of the data given the genotype. The first was calculated based on the sample, reference genome and the expected frequency of transitions vs. transversions. The latter was influenced by the alleles observed, their quality scores and abundances, and the sequencing cycle (R. Li et al. 2009). The prior probability is often designed in a way to penalize heterozygotes following the logic that more heterozygotes than homozygotes are spurious (reviewed in: Nielsen et al. 2011).

So far, the workflow dealt with the processing of one sample at a time. However, the processing of multiple samples may aid with the calculation of the prior probabilities (reviewed in: Nielsen et al. 2011). Information about positions in linkage disequilibrium can also be used to rate previously called SNPs (e.g.: Le and Durbin 2011). As for the alignment algorithms, the choice of a SNP caller influences the results (Altmann et al. 2012; Clevenger et al. 2015).

After the calling, polymorphic candidate sites may be filtered. Li (H. Li 2014) proposed the use of six filters: the filters applied to regions of low complexity (1) or those with artificially high coverage (2). Sites were filtered if the polymorphic alleles at a site were too seldom (3). The cutoff was also applied strand-specifically (4). Candidates that were significantly associated to one of the strands, as revealed by the Fisher's exact test, were also filtered (5). Finally, SNPs were filtered based on their quality (6).

Ambiguity in the low complexity regions were not caused by PCR alone, but by the local realignment attempted by the algorithms. Thus, the associated filter 1 was the most effective under study (H. Li 2014). Exceptionally high read depth was seen as an indicator for paralogs and copy number variations that only existed in the sample. These artifacts increased the number of heterozygotes (H. Li 2014). In scenarios with limited read depth and/or non-uniform coverage, the author proposed to use a Hardy-Weinberg filter instead (H. Li 2014). Filters 3-6 were highly dependent on the individual analysis outline (H. Li 2014). Finally, performing analysis by more than one caller and filtering all but the intersect, followed by the application of further filters (as outlined above), may provide additional confidence for the calls (H. Li 2014).

Software solutions exist that help the user to filter SNP results and to derive further information. The most important functions of ANNOVAR, for example, are filtering SNPs based on information from (public) databases and providing metadata such as: information on important structural elements in the vicinity of the SNP and the functional consequences

21 of the mutation (K. Wang, Li, and Hakonarson 2010). Note, however, that the software is not exclusive to SNP data (K. Wang, Li, and Hakonarson 2010).

1.2.3. Challenges in SNP studies Systematic errors, e.g. the specificity of a SNP to the analysis pipeline, are more likely to bias the analysis than random errors (reviewed in: Olson et al. 2015). The former may be caused by a variety of factors which is far beyond the scope of this section. The latter may be overcome given enough data (reviewed in: Olson et al. 2015). However, the polymorphism analysis presents a range of other challenges arising from the biological context.

Not all organisms have the same number of chromosome sets. In plants, for example, polyploidy was shown to be a frequent among angiosperms (Masterson 1994). Polyploid organisms are defined as having more than two chromosome sets (reviewed in: Leitch and Bennett 1997). This has a significant impact on the analysis: first, it influences the sequencing depth. If all alleles should be observed with a probability of 95 %, the estimated needed coverage almost doubles between diploids and tetraploids (Griffin, Robin, and Hoffmann 2011). Second, polyploids do not show a uniform rate of divergence between their chromosome sets. Rather, the general trend depends on the type of polyplodie (reviewed in: Clevenger et al. 2015).

There are autoploid plants in which the chromosome set was subject to intraspecies duplication. In alloploid plants, the duplication arose from hybridization of species (reviewed in: Glover, Redestig, and Dessimoz 2016). Homoeologs (also: homeologs) are former homologs which were reunited in alloploids (reviewed in: Glover, Redestig, and Dessimoz 2016). Thus, the homoeologous SNPs need to be separated from allelic SNPs, for example in marker analysis (Clevenger et al. 2015).

On the contrary, bacteria are monoploid (reviewed in: Snustad and Simmons 2010). However, the SNP analysis in these organisms is complicated by occasionally high mutation rates leading to low SNP allele frequencies (reviewed in: Olson et al. 2015). This requires adaptations to existing algorithms although they might have been already validated on organisms from other branches of the tree of life (Olson et al. 2015). Diverse but not clonal bacteria may profit from comparison to multiple reference genomes (reviewed in: Olson et al. 2015).

In cancer research, the portfolio of challenges is further expanded: research focuses on so-called driver mutations which actually cause cancer. The difficult task is to discern drivers from passenger mutations, which are unassociated somatic and germline mutations (reviewed in: Raphael et al. 2014). Additionally, a tumor may not be purely clonal but consist of several subclones (e.g.: Pantou et al. 2005; Campbell et al. 2008).

Several approaches exist to cope with these issues (reviewed in: Raphael et al. 2014): a mutation is likely a driver mutation, if it is repeatedly observed with this condition. This indication may be supported by functional profiling for non-silent SNVs and cluster or pathway identification (reviewed in: Raphael et al. 2014). Still, further research on non-coding SNVs is needed (reviewed in: Raphael et al. 2014).The importance of this can be estimated from the results of a study on British. It was shown that diseases, such as the coronary artery disease or type 2 diabetes, were also associated with non-coding SNPs (The Wellcome

22

Trust Case Control Consortium 2007). Still, rare SNVs present challenges on statistical association testing (reviewed in: Z. Yang and Bielawski 2000).

1.2.4. The polymorphism calculation in NanoPipe: robust and easy to use The software NanoPipe was designed to enable researchers with or without experience in computational analysis to produce comprehensive results from their nanopore sequencing data (Shabardina et al. 2019). Prior to the start of my thesis, the polymorphism analysis consisted of raw nucleotide counts without any metadata. To fit the concept of NanoPipe, the polymorphism analysis was designed to provide an intuitive and clearly arranged output. Besides, the presentation of metadata should enable users to perform in-depth analyses. For example, the quality of each SNP call was approximated and links to public databases for known SNPs were made (Shabardina et al. 2019).

Above, two major competing styles of algorithms for SNP analyses were presented: the conservative and the probabilistic approach. Both differ in the number of assumptions made (reviewed in: Nielsen et al. 2011) and the conservative approach is sometimes regarded as old-fashioned (e.g.: Altmann et al. 2012). However, it was proposed that SNP callers have to be revalidated, especially if they are used for other clades in the tree of life (Olson et al. 2015). This is expected to be especially true for callers that make more (elaborate) assumptions. Derived from the general concept of NanoPipe, the analysis should be open to researchers from diverse fields (Shabardina et al. 2019). This forbade pre- validation of each and every case by the developers. The credo of easy use (Shabardina et al. 2019) prohibited that users were forced to validate the program on their own analyses. Thus, the SNP calling was designed in the conservative way to be as robust as possible. However, probabilistic calculations were made to compare SNP calls at the same site and also across the analysis. The results were provided as metadata (Shabardina et al. 2019).

Common SNP analyses, consisting of combinations of aligners and SNP callers, vary broadly in their demands of the computational setup. Besides, they are often limited to selected operating systems (reviewed in: Mielczarek and Szyda 2016). Not all laboratories can afford to buy and/or maintain these systems. Thus, the integration of the SNP calling into a web interface, that can be accessed from any operating system and hardware, enables more researchers to conduct their analyses (Shabardina et al. 2019). The only choke point may be the internet connection to submit the input data. Thus, NanoPipe and its polymorphism calculation are available as stand-alone versions (Shabardina et al. 2019). This also allows experienced users to customize the program and to use more computational power for each calculation.

23

2. Materials and Methods

2.1. Design of MetaG

Figure 1: Simplified workflow of MetaG with the most relevant files (boxes), stages and programs.

The main workflow of MetaG is depicted in Figure 1. The analysis started with query reads in fasta format from a metaprofiling sample. These were aligned to a marker gene database by LAST 963 (Kiełbasa et al. 2011). Since one query could have multiple alignments with different database entries, only relevant alignments were retained during post processing. The alignments for each query were then used to assign the respective taxonomy information. This information was obtained from a separate database taxonomy file. The assignment was then looked up in PATRIC (Wattam et al. 2017) to highlight

24 pathogenic species. The assignments of all queries and the pathogenic metadata were visualized in separate interactive html graphics using KronaTools 2.7 (Ondov, Bergman, and Phillippy 2011).

To make the algorithm more understandable, functions were distributed over a range of modules. MetaG has also been fitted with a web interface by Norbert Grundmann. Details of the web implementation are beyond the scope of the following sections.

2.1.1. Database preparations The first step of the metaprofiling analysis was the alignment. Before the alignment could be calculated, the database sequences in fasta format had to be indexed once by LASTdb 963 (Frith, Hamada, and Horton 2010) using standard settings. This created a database in LAST format which could be reused for other alignments against the same reference database.

lastdb [dbPath][dbName] [db.fasta]

The alignments profited from optimal parameters and an optimal substitution matrix. These were specific for the sequencing platform and genome, due to patterns like AT content (Hamada et al. 2017). Thus, LAST-TRAIN 963 (Hamada et al. 2017) was used with standard settings to obtain optimal alignment matrixes and parameters for the analyses (see also section 2.2.3.).

last-train [dbPath][dbName] [query.fasta] > [matrix.mat]

It aligned query and database using an X-drop algorithm. The expected numbers of gap openings, extensions and nucleotide substitutions were then used to rate the alignments. Ambiguous alignments were removed by LAST-SPLIT (Frith and Kawaguchi 2015). This step was repeated with varying parameters till only non-significant improvements were made (Hamada et al. 2017).

Although users were not required to use LAST-TRAIN in the MetaG pipeline, it was advisable to train LAST on the specific analysis to improve the alignment results (Hamada et al. 2017). However, users of MetaG without a previously calculated alignment were required to provide some kind of substitution matrix and/or LAST parameters for the analysis: these could be self-developed or the predefined standards (see section 2.2.3.).

2.1.2. Alignment The calculated substitution matrix and/or LAST parameters were used to align the database and the query using LAST 963.

lastal [dbPath][dbName] [query.fasta] > [alignment.maf]

LAST used adaptive seeds for the alignment (Kiełbasa et al. 2011): the initial step was to obtain the shortest seed (stretch of bases) in the query that matched a database index less often than a given frequency threshold. The seed was then extended to build a full alignment (Kiełbasa et al. 2011). The length of adaptive seeds was variable, in contrast to fixed-length seeds (Kiełbasa et al. 2011). The latter were used in popular aligners like BLAST

25

(Altschul et al. 1997), but offered less sensitive results with longer calculation time (Kiełbasa et al. 2011).

To filter ambiguous alignments from the LAST output, LAST-SPLIT 963 was used. Ambiguous alignments contained a query sequence that aligned to multiple reference sequence with equal quality. These alignments had a high probability that each base was erroneously aligned (Frith and Kawaguchi 2015). Setting the -m parameter allowed to discard alignments whose error probability was above the given threshold. The filtered alignment was returned in MAF format (-fMAF) and no split alignments were performed (-n).

last-split -fMAF -n -m[error threshold] [alignment.maf] > [filtered.maf]

Splitting would have been useful in an alignment with two highly similar regions separated by a dissimilar region. Depending on the length and similarity of the three regions, it would have been feasible to split the alignment into single ones: one for each of the two highly similar regions (Frith and Kawaguchi 2015). This was not applicable here, since only highly conserved amplicon regions were considered.

The MetaG workflow did not use LAST and LAST-SPLIT separately; rather, the LAST alignments were directly processed by LAST-SPLIT. This saved disc space, as MAF files with many ambiguous alignments tended to be very large (see also section 2.2.3.).

2.1.3. Post processing The initial stage of post processing was obtaining the query IDs from the input read file. For that, each fasta header was split by whitespace and the first part was used as the query ID. Therefore, the first part of the header had to be unique. Otherwise, information was lost.

The assignments of queries to individual database IDs in the alignments were also loaded into memory. Each individual alignment in MAF format had three lines: the first line started with an a and provided the algorithm with the e-value (E) and alignment score (score). The following two lines started with an s. From the first s-line, the algorithm obtained the database ID, from the second one, it obtained the query ID. For each alignment, e-value, alignment score and database ID were saved in an array. This was assigned to the respective query ID. Since a query could have multiple alignments, the individual sub-arrays were stored in global array for the individual query ID.

To assign taxonomy to each read, the algorithm needed to link the database IDs to the corresponding taxonomies. Loading the whole database fasta file into memory would have been inefficient, as the individual sequences were not needed at this point of the analysis. Thus, a separate taxonomy file was created. It displayed all IDs, each in an individual line. Each line was formatted as follows: it started with the database ID, followed by a semicolon and ended with the semicolon-separated taxonomy. The taxonomy had to have 10 ranks: domain, phylum, class, subclass, order, suborder, family, genus, species and strain. If one rank was unknown, it was set to 0.

In an ideal scenario, each query read would have had a perfect alignment with just one database ID. However, in the real-world scenario this was quite rare, due to errors during sequencing and/or alignment. Therefore, the alignments needed to be filtered to find the most likely matches. The stored alignments were filtered by an e-value cutoff (from here 26 on: ec) and alignment score cutoff (from here on: ac). Alignments with an e-value larger than were discarded. To do this efficiently, the sub-arrays in each global array (i.e., per query ID) were sorted by the e-value. Then, only the sub-arrays with an e-value smaller or equal to were retained by a binary search. A binary search was an efficient method to search in an array. It limited the number of necessary comparisons which decreased the runtime (Chadha, Misal, and Mokashi 2014). I implemented an adapted version of the modified binary search algorithm by Chadha and colleagues (Chadha, Misal, and Mokashi 2014). Consider the hypothetical example in Figure 2: the task was to find all e-values in the array which were smaller or equal to the query e-3. First, it was verified that any matching element could be found in the array. For that, the query needed to be larger or equal to the first element, e-7 (see also Figure 2, [1]). Subsequently, an element in the middle was analyzed (see also Figure 2, [2]): it had to be smaller or equal to the query. This was true; however, the next element had to be bigger than the query to get all matching elements in the array. This was not the case here, so the search index was increased by 50 %. The element at this position was bigger than the query, i.e. it was a mismatch to the initial task (see also Figure 2, [3]). The algorithm now checked every element between the last match e-4 and the mismatch e-2 (see also Figure 2, [4]) to find the match with the highest array index.

In this case, this was e-3. The whole array up to and including e-3 was returned. In this example, the algorithm needed four steps with five comparisons. Note that step two included actually two comparisons. An ordinary approach would have analyzed all elements. This would have resulted in eight comparisons. The difference between conventional and binary search will become more pronounced for longer arrays.

Figure 2: Illustration of the binary array search as implemented in MetaG. Comparisons of query >= array element (individual boxes) are represented by arrows with numbers indicating the order. The hypothetical query is e-3.

Applying the ec removed whole query IDs from the results, if they failed to match the threshold. However, the ac was not able to remove query IDs. It was a relative threshold and depended on the best alignment score for each query ID: the retained sub-arrays meeting the ec, were sorted by the alignment score to obtain the maximum score. The maximum was subsequently multiplied by the ac. The result was the absolute alignment score threshold for that query ID. Alignments with a score lower than the absolute threshold were removed. An ac of one therefore retained only alignments with a top score but at least one alignment per query ID.

27

2.1.4. Taxonomic assignment To report a query ID as coming from a defined lineage, each query and each rank was checked for a common taxon of all database hits. In the following, the algorithm will be explained for one query and one rank only, since the steps were the same for each rank and query.

The very first step was to check whether the query is in any of the filtered alignments from the previous section. Some queries might not have been aligned by LAST, or were filtered out by applying the ec. These queries were reported as UNMATCHED in the per query output of MetaG. Besides, they were accounted for in the output statistics of calc.LOG.txt (see section 2.1.5. for more details).

Each query passing this condition was then analyzed at each of the ten ranks given in the database taxonomy. The taxon name was extracted from the database taxonomy using the respective identifier. The abundance of each single taxon at the given rank was calculated from the filtered database matches (see also previous section). Besides, the mean alignment score and e-value for each taxon were calculated.

Often, multiple taxa were reported for each query. Two selected strategies to solve this issue are briefly presented in the following: the first one was used by the taxonomic classifier MEGAN (Huson et al. 2007). MEGAN chose the lowest common ancestor (lca) of all query hits. This means that only the last rank at which all database hits revealed a common taxon was reported (Huson et al. 2007).

The other approach was to assign the most abundant taxon. The advantages and disadvantages of both methods become apparent, if one imagines the reference database as a tree, with the nodes being taxa at a given rank. If a query is assigned to the lca of all matches (see Figure 3, A: blue dot), all nodes belonging to candidate taxa will be true positives (TPs), all others will be false positives (FPs) (see Figure 3, A) (Fosso et al. 2018). If a lower internal node is chosen (see Figure 3, B: blue dot), all nodes below that node are TPs, if they belong to one of the candidate taxa, else they are FPs. Nodes under other internal nodes at the same rank are false negatives (FNs), if they belong to candidate taxa, else they are true negatives (TNs) (see Figure 3, B). If the assignment to the internal node is shifted to a node below, more FPs and also TPs will be lost and the number of TNs and FNs will be increased (Fosso et al. 2018). This influences statistical parameters as discussed in the following sections.

28

Figure 3: Assignment of a query ID to a node of the reference taxonomy (blue dot). Candidate taxa from the filtered alignments are depicted in grey, other taxa are white. A: Assignment of the query ID to the lowest common ancestor of all candidates; B: Assignment of the query ID to a lower internal node (more specific rank). Both parts of the figure were obtained and modified from Fosso et al. (Fosso et al. 2018).

MetaG chose the most abundant taxon for a rank. If two taxa were equally abundant, the taxon with the lower e-value or higher alignment score was assigned. However, if two taxa were also equal in these metrics this rank and all following were unassigned. For the next rank, only taxa from the same lineage were allowed as candidates. Compared to the lca approach, this will decrease FPs at the expense of FNs. However, this simple strategy was also problematic: imagine ten database entries matching a query. At the domain level, seven supported a common taxon. The taxon was assigned and at the phylum level, four of the seven remaining IDs supported a taxon. This process has continued, till only very few of the initial matches supported the most specific taxon. In order to give information about the statistical support for the most abundant taxon, a confidence parameter was introduced.

The confidence was a two-part parameter. The first part taken into account was the abundance of the matched taxa. The second aspect was the quality of the alignments supporting the taxa. To calculate the total confidence for the most abundant taxon, the taxa were sorted by e-value (low to high), alignment score (high to low) and abundance (high to low). Starting with the e-value, the three parameters were checked to find the one that was able to distinguish the first and the last taxon from the sorted taxa. This parameter was then used as a measure of quality. Subsequently, weights were assigned according to the quality of each taxon. The weight was four, for all taxa showing the maximum quality, two, for the best third excluding the very best taxa, one, for the following third, and zero, for the worst third. The weight for each taxon was then multiplied by the abundance of the taxon. The weighted abundance of the most abundant taxon added to its total confidence, while any other weighted abundance was subtracted from the total confidence. The total confidence was then divided by the number of taxa with a weight above zero. For convenience, the result was expressed as a number between zero and one, bad and good confidence, respectively. There were two major scenarios that prevented the calculations presented above: if there was only one candidate taxon for a query ID, the confidence was automatically one. Also, if the most abundant taxon had twice as many supporting alignments as the second most abundant taxon, the most abundant taxon immediately got a confidence of one. 29

If the analyzed rank was not the domain, the confidence value of a rank was multiplied with the confidence for the previous rank. This mirrored the decreasing confidence, i.e. loss of TPs, when lower internal nodes in the reference taxonomy were assigned by a majority-decision (see also Figure 3, B).

The confidence was used to provide additional information for the user as the average confidence for each called taxon. However, it also influenced the assignment itself. MetaG used a provided confidence cutoff (from here on: cc) for the assignment. If the confidence dropped below the threshold, the current and all following ranks were unmatched.

Species names that did not start with un (like uncultured and unclassified) were looked up in the taxonomy of the pathogen database PATRIC (see also section 2.1.7.2.). PATRIC provided the whole lineage down to the strain level with information about the host and antibiotic resistances. Candidate pathogens were only searched for by the species name, as preliminary experiments indicated almost ambiguous strain assignments (data not shown). To enable searches at the species level, relative host and antibiotic resistance abundances were calculated for each queried species in PATRIC. For that, the number of each single host and antibiotic resistance phenotype of a single species was divided by the number of respective strains in PATRIC. The result was multiplied by the respective number of species calls in MetaG. This enabled the output of all potential host and resistance information for a species as a probability, while maintaining the proportion of the called taxa.

2.1.5. Output of MetaG Apart from the LAST alignment file, MetaG created six output files. Two of which were used to create the interactive diagrams with KronaTools 2.7: calc.VIS.txt and calc.PATHO.txt. calc.VIS.txt contained the taxonomic assignments to the identifiable reads. The taxonomy for each identifiable read was tab-delimited and quotation marks were escaped by ". Each line was preceded by the individual taxon's abundance. In calc.VIS.txt, KronaTools summed up the numbers for identical taxa and presented the output as an interactive html5 graphic (Ondov, Bergman, and Phillippy 2011). The interactive searchable pie chart allowed viewing all taxa of a sample in a single diagram without sacrificing visual clarity. By default, taxa were colored by abundance relative to the chosen root. The coloration got lighter with more specific ranks (Ondov, Bergman, and Phillippy 2011). The graphic was saved as calc.VIS.html.

calc.PATHO.txt contained the output of querying PATRIC (see also previous section). In calc.PATHO.txt, the abundance of taxa was already summed up during the assignment: each host resistance phenotype was on a single line. The abundance, host, resistance phenotype and species name were tab-delimited. Quotation marks in the species names were escaped by ". The results were visualized by KronaTools 2.7 and saved as calc.PATHO.html.

The two remaining files were calc.LIN.txt and calc.LOG.txt. The first contained the full annotation for each read by MetaG in a fasta-like format. The header was the read ID proceeded by a >. The taxonomic annotation followed: each rank and its assignment were on a single line. This was complimented by the number of filtered alignments supporting the assignment at the individual rank. Besides, the total number of filtered alignments was given

30 in brackets. Reads that could not be assigned at all were reported as No match for [read ID] and chosen cutoff without the preceding >.

calc.LOG.txt contained the total statistics of calls by MetaG. It provided the total read count, the numbers of reads matched and removed and why the reads were removed: because of the LAST settings or because of the ec. This was followed by the total number of assignments to each taxon, split by ranks. Reads that could not be assigned to a rank were reported as UNMATCHED for this and all following ranks. The average confidence for each taxon (except UNMATCHED) was given in brackets, with a short identifier of the mean's type: [w] was the William's mean (Williams 1937), [g] was the geometric mean, [a] was the arithmetic mean and [h] was the harmonic mean (mean types reviewed in: Clapham and Nicholson 2014). The choice of the mean depended on the user input (see next section).

2.1.6. Implementation as a web tool and standalone program MetaG was implemented as a command line tool (also: standalone) that was designed to be used intuitively. In the standalone version, a Bourne shell (sh) script formed the interface between the user, LAST and/or LAST-SPLIT, the Perl v5.26.1 assignment and helper scripts and KronaTools. In the web version4, KronaTools was customized by Norbert Grundmann to adapt the design of the output graphs to the webpage design. The source code of MetaG's standalone version can be found in the digital appendix (see D01).

The standalone version was tested on Ubuntu 18.04.2 LTS and Ubuntu 16.04.6 LTS. Its help menu gave an overview over all parameters and their valid values. A quick guide explained format conventions, preprocessing steps and also the rationale behind the most important parameters. This overview was designed to ease the use also for inexperienced users. Most cutoffs and parameters were fully customizable.

The user was free to use any fasta file as a query file. Besides, custom databases could be supplied, as long as they were correctly formatted. The shell script allowed for the adjustment of the substitution matrix and LAST parameters. Within the matrix, LAST parameters could be specified. Either single separate parameters or the matrix needed to be given. However, if both were given, the parameters inside the matrix were not considered5. The user could also influence the calculation time by determining the number of CPU cores used by LAST. This also implied that users with weak setups could tailor the computational burden to their needs. As outlined in 2.1.2., LAST-SPLIT could be optionally used to remove ambiguous alignments.

For the steps of post processing of alignments and the taxonomic assignment, e- value, alignment score and confidence cutoff were fully customizable. The output was visualized by KronaTools (see also previous section). The user could choose the method to calculate the average confidence for each taxon: Williams' -, geometric -, arithmetic - or harmonic mean were used in calc.LOG.txt according to the user’s choice (see also above). Besides, all other output files discussed in the previous section could be obtained. The user had to supply four environment variables: the path to LAST (LASTAL), LAST-SPLIT (LASTSPLIT), to the script performing the taxonomy-calling (GETMETA) and the path to KronaTools (KRONA). The variables could be set in the Bourne or Bourne-again shell (bash)

4 http://www.bioinformatics.uni-muenster.de/tools/metag//index.hbi?; accessed 24.05.2019. 5 http://last.cbrc.jp/doc/lastal.html; accessed 05.03.2019. 31 using export [VARIABLE]="[path]". Note that the variables had to be set in the terminal window that MetaG was executed in. Additionally, LAST and LAST-SPLIT had to be version 963 or higher. In that version, LAST-SPLIT reported the e-values of the filtered alignments (first in version 961)6, if used with the options as in MetaG. Besides, MetaG was developed to be compatible to this version.

Generally, the most time-consuming step in the analysis was creating the alignment. However, due to the tailored alignment parameters, if using LAST-TRAIN, users might only want to alter the taxonomic assignment in a reanalysis. Relevant parameters for this were the e-value, alignment score, confidence cutoff or mean type. Therefore, it was most efficient to use the pre-calculated alignment and repeat only the taxonomy calling with the new parameters. The user could supply the alignment using the --no_align flag. If the flag was set, only the path to the taxonomy file (-ltax), pathogen database (-pdbPath), the taxonomy- calling cutoffs (-e, -ac, -cc) and the method to calculate the average confidence for each taxon (-m) were needed. Besides, the query fasta file (-q) needed to be specified. The existence and version of LAST and LAST-SPLIT was not checked in this case. Only the environment variables GETMETA and KRONA were needed. The presented shortcut also implied that users might supply alignments calculated by their algorithm of choice. However, the files had to be formatted in MAF format.

6 http://last.cbrc.jp/last/index.cgi/rev/2063a5a7d7ac; accessed 05.03.2019. 32

Figure 4: Section of the web form used to submit a sample to MetaG.

The implementation into a web tool was performed by Norbert Grundmann. It wrapped the shell script into a graphical user interface (see also Figure 4). The web version allowed fasta and fastq files. However, fastq files were internally transformed to fasta, prior to the calculations. Options could be applied by choosing parameters from a drop-down menu, or entering a value, depending on the individual option. Users were provided with complete standard workflows for the RDP and MTX database (see next section) and Illumina MiSeq and nanopore sequencing, respectively (see also Figure 4, Predefined). Each parameter had a short description. Users could optionally enter their email-addresses to get notified, when the job was finished (see also Figure 4, Email). An optional job title could be supplied to ease the post analysis for the users (see also Figure 4, Title). By sharing the job ID, the analysis output could be send to other users. Detailed usage information was supplied on a single website: a technical summary was written by Norbert Grundmann. Besides, a tutorial from uploading the files to analyzing the output was provided.

The results of the web-request are depicted in Figure 5. To get to their results, users had to enter their request ID in the respective field of the submit form (see also Figure 4, ID). If they had provided their email addresses, direct links were sent.

33

Figure 5: Section of the results form of MetaG displaying the Info output (top). Clicking on the fields labeled with the letters A to E, reveals the output in the respectively labeled boxes. A is the Summary output of the found taxa, here at domain level with a minimum abundance of 10. B and C are sections of the KronaTools graphs containing all taxa and pathogenic bacteria/archaea, respectively. D shows the Download section, where all output files can be obtained. The Rerun site depicted in E offers the option to restart the taxon calling, only. The graphic contains subsections and few elements within have been removed for enhanced visual clarity. 34

When accessing their results, users first saw the Info output (see also Figure 5, top). It gave information about the runtime and the selected parameters for the analysis. Underneath the runtime, the total number of reads in the input file was displayed. Further, the number of reads assigned with a taxon and the number of reads removed by the chosen settings were displayed. Users were provided with a summary of the found taxa in two ways: By a table and a graph.

A searchable summary table displayed the number of supporting reads for each taxon, together with the average confidence and the abbreviation of the method used for average calculation. It could be filtered by rank and taxon abundance (see also Figure 5, A). Searches were allowed in regular expression notion. The most important expressions for users included: ^[term] matched only the beginning of taxa names. [term]$ showed the inverse behavior. Besides, ^[A,B] matched terms starting with A or B. One way to look for taxa ending with a digit was to search for \d$. This could be useful at the strain level. Experienced users should note that not the full regular expressions syntax was supported. All searches were case-insensitive.

The summary was visualized by an interactive graph created with KronaTools (see also Figure 5, B). Another graph depicted the search for pathogenic bacteria and archaea in PATRIC (see also Figure 5, C).

From the Download section (see also Figure 5, D), users could obtain the taxonomic assignment for each read (Reads, internally: calc.LIN.txt) and for the whole sample, including the read statistics, (Summary internally: calc.LOG.txt). Besides, the two graphs for display of the summary and the pathogenic organisms (Summary Graph, internally: calc.VIS.html, and Pathogen Graph, internally: calc.PATHO.html, respectively) could be downloaded. The files used to create the charts were provided as Summary Data (internally: calc.VIS.txt) and Pathogen Data (internally: calc.PATHO.txt), respectively. The properties of the files were already explained in more detail in the previous section.

As outlined previously, users could also choose to repeat the taxonomy calling, only. For that the previously calculated alignment was used internally and selected parameters could be adjusted using the Rerun section (see also Figure 5, E). This corresponded to setting the --no_align flag in the standalone version (see above).

35

2.1.7. Creating custom databases

2.1.7.1. Amplicon databases

2.1.7.1.1. RDP 16S 28S The following illustrations of RDP apply to release 11 as described by Cole et al. (Cole et al. 2014): RDP contained sequences of rRNA genes from bacteria/archaea and fungi, respectively. Most of the sequences were incomplete, for example due to missing ends (Cole et al. 2014). 16S rRNA gene sequences most often stemmed from uncultured microorganisms (Cole et al. 2014).

The majority of sequences were received from the International Nucleotide Sequence Database Collaboration (Nakamura et al. 2013) (Cole et al. 2014). The corresponding organism names were looked up in Bacterial Nomenclature Up-to-Date7 (Cole et al. 2014). Sequences were filtered for length and quality and 16S rRNA genes were additionally controlled for chimeras (Cole et al. 2014). The fungal taxonomy was curated manually, in release 11 it was based on Liu et al. (Liu et al. 2012) (Cole et al. 2014). However, the bacterial and archaeal taxonomies were developed by automated classifications. They integrated Bergey's Trust8, the List of Prokaryotic Names with Standing in Nomenclature (Parte 2014), topical publications and the All-Species Living Tree Project (Munoz et al. 2011) with sequence similarity (Cole et al. 2014). The lowest officially supported rank was the genus (Cole et al. 2014).

A recent study suggested that RDP had less annotation errors than either Greengenes or SILVA (R. Edgar 2018). Therefore, RDP was preferred over the latter databases. Besides, the last release of Greengenes was in 20139. Thus, it was significantly older than RDP: version 11.5, which was used here, was released in 201610. This release comprised of ca. 3.4 million 16S and ca. 130,000 28S rRNA genes10.

For the following steps, semicolons in the headers of the RDP fasta files for bacteria, archaea and fungi, respectively, were replaced by commas. The sequences were converted to uppercase. The files were following the same general structure. The header line contained an ID and the taxonomic placement. The taxonomic placement was split in two parts. One was starting with "Lineage=" and contained taxa from root to genus. Each taxon was immediately followed by the according rank name. The lineage information was extracted for all ranks, excluding the root. Ranks were skipped, if the taxon name started with unclassified_. The other taxonomy part contained a merged representation of species and strain information. Although in some cases they were split by a semicolon, this was not always the case. Besides, there existed a variety of different non-informative species names, such as unclassified bacterium, unclassified [genus] sp.. In order to allow species level resolution, a common name was given to the non-informative species names. Besides, species and strain information were split. To understand the importance of these two aspects, consider a query read which matched unclassified bacterium XY and bacterium Z from the same genus. In this hypothetical example, the genus would be the lca. If the two entries were adjusted to unclassified strain XY and unclassified strain Z, the lca would be at

7 https://www.dsmz.de/; accessed 07.03.2019. 8 www.bergeys.org; accessed 07.03.2019. 9 http://greengenes.secondgenome.com/downloads; accessed 01.03.2019. 10 https://rdp.cme.msu.edu/misc/rel10info.jsp#release11_history; accessed 01.03.2019. 36 the species level. Note, that no information was lost. MetaG did not apply a strict lca algorithm. Still, an lca got the highest confidence value, influencing the downstream assignments via the confidence cutoff.

To adjust the entries, a set of rules was defined which coped well with many but not all cases. The custom RDP database containing bacteria, archaea and fungi had ca. 3.5 million entries. This implied the presence of special cases that were not covered by the following not exhaustive rule set. Too specific rules would have hindered the application to future releases of RDP with potentially altered taxonomic nomenclature. However, native support by the database would have been the best option.

Species-strain information was analyzed word by word. If a word was considered to be strain information, all following words were also considered to be strain information. The following criteria defined a strain word: a strain word contained characters other than letters or commas (e.g.: TH3 in bacterium TH3). Capital letters at any position in the word, apart from the very first, also indicated strain words. Besides, any information after sp. was regarded as strain information (e.g: Y0018 in Acidimicrobium sp. Y0018). In this case, the word sp. was also not included into the species name. Any word that did not match the strain criteria was considered to be a species word. Full species and strain names were obtained by merging the candidate words. The full species name was not allowed to start with a small letter. In that case, the full species and strain name were merged and saved as the full strain name. This conserved important information about the properties of the microorganisms (e.g.: acidophilic bacterium) but did not compromise the species-level resolution. The species was subsequently set to unclassified.

A species was also defined as unclassified, if the full name consisted only of one word. The species was set to uncultured, if any species word matched that term. Strain words were identified as outlined above. The taxonomy was extracted and saved together with the respective ID to a separate taxonomy file. ID and lineage information were separated by a semicolon. Each entry was on a single line. The taxonomy was required to have 10 ranks per entry, separated by semicolon: domain, phylum, class, subclass, order, suborder, family, genus, species, and strain. Ranks that were missing in the individual lineages were reported as unknown that was 0.

The taxonomy and database fasta files for bacteria, archaea and fungi, respectively, were merged. As the resulting taxonomy file (see D02) was much smaller than the merged database fasta file, it could be processed faster and increased the overall assignment speed of MetaG. For the use in MetaG, the database fasta file was subsequently processed using LASTdb (see section 2.1.1.).

37

2.1.7.1.2. MTX The MTX database was the custom database of the metagenomic classification tool METAXA version 2.2 (Bengtsson-Palme et al. 2015). MTX from METAXA version 2 was shown to substantially improve results of metaprofiling analyses (Escobar-Zepeda et al. 2018). MTX was manually curated by the METAXA team and comprised of a consensus from several databases: SILVA, MITOZOA, Greengenes, CRW and GenBank (Bengtsson-Palme et al. 2015). The database was obtained from the software's database directory. It consisted of separate files for cytochrome c oxidase subunit I (COI), small subunit (SSU) and large subunit (LSU) rRNA genes. Each of the sub-databases consisted of a sequence file in blast format and a taxonomy file. The latter contained translations from database IDs to the taxonomy of the entries.

To use MTX in other tools, the blast databases for LSU and SSU sequences were transformed into fasta-format using blastdbcmd (Camacho et al. 2009). The resulting fasta files of SSU and LSU were merged. The same was done for the taxonomy files. However, the merged taxonomy file had to be extensively post-processed, as it contained entries ranging from seven up to nine taxonomic ranks. However, the number of ranks had to be equal, so that a MetaG and other non-proprietary algorithms could identify taxa of the same rank.

To solve this issue, the whole taxonomy was forced to a level of 7 ranks by keeping only the first three and last four ranks (counting from the species) of each entry. In a previous step, entries ending with multiple unknown taxa had been removed. Taxa called 0 were introduced for subclass and suborder at the fourth and sixth position of the taxonomy, respectively. This was only important for MetaG. For usability in MetaG, empty rank information was replaced by 0. The usage of MTX by other algorithms was not expected to be affected by these modifications.

The next two steps were expected to improve the performance of algorithms working with the database. Especially entries for chloroplasts and mitochondria were lacking the genus, but a full species name was given. Thus, I decided to use the first word from the species name as the genus name.

Next, the individual species names were further processed. In most cases, they were a merged representation of the species name and some additional information, such as strain, subspecies or common name. To resolve this confusion, the first two words of each species name remained at this rank, while all following were considered strain information. If the last species word was sp., the species name was empty, i.e. 0. The transformation of MTX was less extensive than that of RDP, as it lacked many of the uninformative species names. The rule set for transforming MTX was also less elaborate than for RDP to keep the peculiarity of the database.

From the merged database fasta file, only IDs that were also given in the modified taxonomy file were retained. The resulting 10 levels of taxonomy were ready for use by QIIME 2 (Bolyen et al. 2018). For the use in MetaG, ID and taxonomy had to be separated by a semicolon and not by tab (see also D03). Additionally, the database fasta file had to be processed using LASTdb (see section 2.1.1.).

38

2.1.7.2. The database for pathogens: PATRIC To find out, if the supplied samples contained pathogens, the database PATRIC (Wattam et al. 2017) was used. Since 2009, PATRIC has stored omics data of pathogenic bacteria using the NCBI taxonomy (Wattam et al. 2014) and now also includes archaea (Antonopoulos et al. 2017). With the advent of antibiotic resistances, associate metadata has become increasingly important. Therefore, the database stored additional data related to the genomes, for example host name and antibiotic resistance (Wattam et al. 2014). As of 2017, more than 104,000 genomes were stored in PATRIC and 15,000 were assigned with an antibiotic resistance (Antonopoulos et al. 2017).

To integrate the identification of pathogenic bacteria and the assessment of the resistance status into MetaG, I combined the genome_metadata and genome_lineage files from PATRIC's FTP server11. The taxonomy information of the latter file was assigned to the host name and antibiotic resistances from the first file. This was done by using the genome ID as the identifier. The taxonomy of PATRIC was formatted similarly to RDP: it was forced to have 10 ranks. As subclass and suborder were not given, they were set as unknown (0). PATRIC did not separate species and strain names. Thus, I took care to separate species and strain information, myself. For that, the following assumptions were made: species names should consist of two words, additional information was considered as strain information. The first of the two words must not start with a lower-case letter or a number. If this was the case, the species name was considered to be strain information and the species was set as unclassified. In the special case that the name started with uncultured, the species was set as uncultured. Besides, the second word was not allowed to be bacterium or undefined and was not allowed to end with a dot (e.g., sp. or genomosp.): else, the species was considered unclassified and any word after the first two words was assigned to the strain. The resulting taxonomy file can be found in the digital appendix (see D04).

Unfortunately, PATRIC, MTX and RDP were based on different taxonomies (see above). The overlap between PATRIC and the two other databases was evaluated at the species and strain level to assess if comparing entries in the different databases would be reasonable. Any uncultured or unclassified species were not considered. To avoid unnecessary taxonomy clashes, the comparison at the species level was case-insensitive and did not consider non-word characters, such as brackets, quotation marks and spaces.

The comparison was applied in a stricter fashion at the strain level, since lower and uppercase letters, as well as other non-alphabetic characters, were essential in classifying strains. The pathogen database shared 67,462 species with RDP, of which 1,635 were also identical at the strain level. 1,117 species could not be matched. This result was similar to the comparison with MTX: 66,624 species were shared; 2,246 strains were common. However, 1,955 species were not common. In the light of these results, it was reasonable to query taxa assigned by RDP and MTX in PATRIC. However, only species names were queried, as the numbers of common strains were low. Besides, the strain resolution of MetaG was low (data not shown).

11 ftp://ftp.patricbrc.org/; accessed 29.01.2019. 39

2.2. Performance evaluation and cross-algorithm comparisons

2.2.1. General considerations Metagenomic classifiers have been evaluated by measures such as sensitivity (SENS), specificity (SPEC), precision (PREC) and the Matthew's Correlation Coefficient (MCC) (Matthews 1975) (e.g.: Lindgreen, Adair, and Gardner 2016; Escobar-Zepeda et al. 2018). For that, the problem of identifying reads was abstracted as a binary problem, i.e. the read was either classified correctly or not. However, this implied knowledge about the origin of each read: that was, to which taxon it belonged or if it was an artifact (Lindgreen, Adair, and Gardner 2016; Escobar-Zepeda et al. 2018). Therefore, reads were simulated (see next section).

A read could be assigned to one of four different classes (Lindgreen, Adair, and Gardner 2016; Escobar-Zepeda et al. 2018): a read from a certain taxon that was assigned to the same taxon was a true positive (TP). If it was assigned to another taxon it was a false positive (FP). If it was not assigned, it was a false negative (FN). An artifact that was assigned to any taxon was a FP. If was not assigned, it was a true negative (TN).

The statistical measures applied for the ranking of each algorithm were defined as follows:

The definition of the MCC was obtained from Matthews (Matthews 1975). The metric has been proposed as overall performance indicator for classifications (Matthews 1975). The coefficient ranges between -1 and 1: if all predictions are wrong, the MCC is -1. In the opposite case, it is one. If it is zero, the classification is as good as a random one (Matthews 1975).

All metagenomic analysis programs were run either on a server with 16 cores (Intel® Xeon® CPU X7350; clock speed: 2.93 GHz) and 64 GB RAM or on a second server with 64 cores (AMD Opteron™ Processor 6378; clock speed: 2.40 GHz) and 512 GB RAM. The servers used the Ubuntu 16.04.6 LTS operating system.

40

2.2.2. Read simulation

Figure 6: Phylogeny of the taxa used for read simulation, shown at three ranks: domain or kingdom, family and species. The latter includes a strain and subspecies name, if applicable. Note that lineages with only one ancestor were collapsed. The degree of relationship is visualized by shared inner areas, not individual order. The recently discovered Xylanibacillus composti K13 [genus novum, species novum] (Kukolya et al. 2018) is marked by a black star. The graphic was obtained by using KronaTools 2.7.

For the calculation of error statistics, an in silico sample was created. As viruses lack a common marker (see section 1.1.), my analysis focused on archaea, bacteria and fungi. The test set consisted of 27 species from the three groups with varying degrees of relatedness (from here on: BFA) (see Figure 6 and also Table 1). All sequences in the test set were obtained from the NCBI Nucleotide database (Sayers et al. 2019). For archaea and fungi, respectively, three sequences were obtained: two of which shared a common family. In a subset of 16 of all 21 bacterial sequences, two shared a common genus, four a common family and eight a common phylum. Additionally, four sequences stemmed from four different phyla (see Figure 6). Xylanibacillus composti strain K13 was only recently described and was classified as genus novum and species novum (Kukolya et al. 2018). It was included to 41 account for the fast pace of microorganisms discovery, as well as for the incompleteness of databases.

Table 1: NCBI Nucleotide database accession and version of specimen used for in silico sequencing.

Sequence name Accession.version Methanoregula formicica strain SMSP NR_102441.1 Pyrobaculum aerophilum strain IM2 NR_102764.2 Thermoproteus tenax strain Kra 1 NR_044683.1 Escherichia coli J01859.1 Escherichia fergusonii strain ATCC 35469 NR_074902.1 Salmonella enterica subsp. enterica strain LT2 NR_074910.1 Salmonella bongori strain NCTC 12419 NR_074888.1 Neisseria gonorrhoeae strain NCTC 8375 NR_026079.2 Neisseria meningitidis strain M1027 NR_104946.1 Kingella denitrificans L06166.1 Kingella kingae strain ATCC 23330 NR_042976.1 Bacillus velezensis strain FZB42 NR_075005.2 Bacillus subtilis subsp. subtilis strain 168 NR_102783.2 Geobacillus kaustophilus strain BGSC 90A1 NR_115285.2 Geobacillus thermoleovorans strain BGSC 96A1 NR_115286.2 Paenibacillus polymyxa strain DSM 36 NR_117729.2 Paenibacillus dendritiformis strain T168 NR_042861.1 Brevibacillus agri strain DSM 6348 NR_040983.1 Brevibacillus nitrificans strain DA2 NR_112926.1 Porphyromonas gingivalis strain JCM 12257 NR_113086.1 Corynebacterium glutamicum strain ATCC 13032 NR_041817.1 Anaerolinea thermophila strain UNI-1 NR_074383.1 Acaryochloris marina strain MBIC11017 NR_074407.1 Xylanibacillus composti strain K13 NR_159902.1 Aspergillus oryzae RIB40 XR_002735721.1 Penicillium expansum AF003359.1 Cryptococcus neoformans strain CN7 MF580733.1

All species were chosen, so that they their GC content lay within an interval of 90 % around the GC content of E. coli. This was done to avoid simulation artifacts: the coverage of Illumina reads depended on the GC content of the targeted region, not just on the content of the read (Speed and Benjamini 2012). PCR is one prominent error source during library preparation of, for example, GC-rich DNA, but modified protocols improved the outcomes (Aird et al. 2011). Still, considering the GC content of the whole genome, as done here, took the worst case into account.

42

The GC content also influenced 16S rRNA gene sequencing using the nanopore flowcell R7.3: this became obvious in terms of significant influences on mismatch rate and coverage and a trend for differential GC recovery in the reads (Benítez-Páez, Portune, and Sanz 2016). However, the range of GC contents was narrow (Benítez-Páez, Portune, and Sanz 2016). Still, nanopore sequencing improved genome assembly of short reads also in the presence of GC bias for the latter techniques (Goldstein et al. 2019). In any case, sequences were chosen within a narrow range of GC content to lessen the impact of the GC bias on the comparison of both techniques.

From the list of candidate species, preferably full-length 16S rRNA gene sequences for bacteria and archaea and 28S rRNA gene sequences for fungi were obtained from the NCBI Nucleotide database (see also Table 1). Sequences were not allowed to be shorter than 1,400 nucleotides (nt) and were not allowed to have a significant number of unidentified nucleotides (N).

From the obtained sequences, positive and negative reads were simulated using an error profile specific to the sequencing technology. Positive reads could still be aligned to their reference sequences (mother sequences), while negative reads could not be aligned anymore by the simulation tool.

2.2.2.1. Simulation of nanopore reads Nanopore reads were simulated using NanoSim-H version 1.1.0.4 (C. Yang et al. 2017; Břinda and Yang, n.d.) using the pre-computed error profile for E. coli 2D sequencing on flowcell chemistry R9. The seed number was the default: 42. Using the same error profile and seed number, reads could be exactly replicated. The maximum read length was set to 1,415 nt to meet the properties of sequences obtained from NCBI (see also above). Apart from the --circular setting, all other settings were left at default values. The simulated read length was consistent with the average read length observed by a recent study using flowcell chemistry R9 (Krishnakumar et al. 2018). Each 16S or 28S rRNA gene sequence was separately in silico sequenced with 400 reads, of which 24 were simulated as true negatives. The simulation results were output as single fasta files which were subsequently concatenated.

nanosim-h -p 'ecoli_R9_2D' -n 400 --circular --max-len 1415

2.2.2.2. Simulation of Illumina reads The nanopore device has been on the market for almost five years (release for early-access users in 2014 (reviewed in: Deamer, Akeson, and Branton 2016)). Still, it is used in the minority of all metagenomics studies: querying PubMed (Bethesda (MD): National Library of Medicine (US) 1946) with the Entrez facility (Schuler et al. 1996) yielded 20 results for metagenomics and nanopore, 78 for metagenomics and solid (Applied Biosystems SOLiD) and 388 for metagenomics and illumina. The results indicated that the Illumina technology has most often been used in metagenomics. The most widely used Illumina device is the MiSeq (133 matches). I chose to test the performance of MetaG and the competing classifiers (see below) on the Illumina MiSeqv3, as it was the most recent version of the MiSeq supported by ART (Huang et al. 2012) version ART-MountRainier-2016-06-05. ART was a simulation tool for reads from next generation sequencing. It provided several pre-

43 computed error profiles for the most abundant sequencing platforms and allowed also for custom error-training (Huang et al. 2012). The resulting reads had an individual simulated quality string and were given in fastq format (Huang et al. 2012). Reads could be single-, paired-end or mate-pair reads12. Of the 133 PubMed matches for metagenomics and MiSeq, 7 also matched paired-end and 1 matched mate-pair. None matched to the term single-reads (also: single reads). However, I assumed that a significant proportion of the 125 MiSeq studies that did neither match paired-end, nor mate-pair, were carried out with single-reads. All PubMed queries in this section were performed on 23.01.2019.

400 single reads (-f) of 250 nt length (-l) were simulated for the amplicon sequencing (-amp) of each NCBI sequence (see also Table 1) using the error profile of Illumina MiSeqv3 (-ss MSv3). The N masking was turned off (-nf 0), since the standard settings produced no reads for G. thermoleovorans strain BGSC 96A1 and K. denitrificans. The seed number was the default, which is dependent on the system time. Applying the same seed number and settings would have given identical results.

art_illumina -nf 0 -ss MSv3 -amp -na -q -i [sample.fasta] -l 250 -f 400 -o [out]

Unlike NanoSim-H, ART provided no true negative reads. Therefore, 24 reads of each NCBI sequence were shuffled. The individual fastq files were transformed to fasta format so that the simulations from both sequencing technologies were comparable. Subsequently, reads were shuffled with a modified version of the shuffled_fasta.pl script by Escobar-Zepeda and colleagues (Escobar-Zepeda et al. 2018) which was available online13.

I implemented a parameter to shuffle only the first 24 reads in each fasta file. Besides, the automated output path was adjusted. In line with the authors’ settings14, the kmer-size was 10 and the step-size was 2 for each fasta file. Shuffled and native MiSeq reads from all NCBI sequences were merged into a single fasta file and used for further analysis.

2.2.3. Defining standard parameters for MetaG As outlined in section 2.1.6., many of the parameters used for the taxonomic assignment in MetaG were fully customizable. On the one hand, this offered experienced users the chance to improve their results by simple parameter adjustment. On the other hand, users with basic expertise in bioinformatics and metagenomics might be unable to cope with the opportunities of MetaG. Thus, standard parameters were defined.

For that, MetaG was run with different combinations of e-value, alignment score and confidence cutoff. Every possible parameter combination within the limits was automatically checked using a custom Perl v5.26.1 script making use of the Algorithm::Loops15 module version 1.032. MetaG analyzed all four possible alignments of the nanopore and Illumina

12 https://www.niehs.nih.gov/research/resources/software/biostatistics/art/index.cfm; accessed 10.04.2019. 13 https://github.com/Ales-ibt/Metagenomic-benchmark/tree/master/bin/16SrRNAamplicon; accessed 28.01.2019. 14 https://github.com/Ales-ibt/Metagenomic-benchmark/blob/master/datasets_16SrRNA/ datasets_description.txt; accessed 28.01.2019. 15 https://metacpan.org/pod/Algorithm::Loops; accessed 28.01.2019. 44

BFA samples and the databases RDP and MTX. Each alignment of an individual sample and database varied by the LAST-SPLIT option –m: 0.9, 0.95 and none. The optimal LAST parameters and matrix for each sample and database combination had been previously obtained by using LAST-TRAIN.

At the genus level, I compared the expected taxonomy to the observed taxonomy. From the comparison, I derived the sensitivity, specificity, precision and the MCC of the respective assignments. For each -m, sample and database combination, the optimal parameters were chosen by manually examining the output. Parameter combinations with high values for MCC, sensitivity, specificity and precision, respectively, were considered as candidates. Additionally, candidates had to be "reasonable": that was, parameters with higher alignment score and confidence cutoffs were preferred over parameters with just a strict e-value cutoff. This followed the rationale that standard parameters defined in this way would perform well in a broader range of cases. For each simulated sample and database, the different -m parameters were compared. Generally, setting -m to 0.95 was preferred over not setting it at all. This was done for performance reasons. Alignments post-processed by LAST-SPLIT were significantly smaller than native LAST alignments. For example, the native alignment file of the nanopore BFA sample against RDP was 7.5 GB. However, processing the alignment with -m 0.95 and -m 0.90, lead to alignments of 38.5 MB and 34.1 MB, respectively. Real-world data was expected to give even larger native alignments, because of increased read numbers. Huge alignments increased the calculation time drastically while simultaneously increasing the demand of main memory in MetaG.

Finally, a consensus of alignment score, confidence and e-value cutoff was obtained for both RDP and MTX and an individual sample. Thus, the consensus only depended on the sequencing technology. The derived optimal workflow and parameters for the simulated test sets were:

 Simulated nanopore set: Use LAST-TRAIN on dataset against the desired database and use the obtained optimal LAST parameters and matrix. The optimal LAST parameters and matrices for the following analyses can be found in the digital appendix files D05 and D06. Choose a -m of 0.95 for LAST-SPLIT. Choose -e -3, -ac 1, -cc 1.

 Simulated MiSeq set: Use LAST-TRAIN on dataset against the desired database and use the obtained optimal LAST parameters and matrix. The optimal LAST parameters and matrices for the following analyses can be found in the digital appendix files D07 and D08. Do not use LAST-SPLIT. Choose -e -11, -ac 1, -cc 1.

The standards for the web interface contain all of the above parameters. The files are specific for each sequencing technology and database. They are listed as D09 to D12 in the digital appendix.

45

2.2.4. Competing classifiers In order to assess the quality of MetaG, it was compared to other algorithms. In a recent study on metaprofiling data sequenced by Illumina, the following software solutions outperformed competitors at genus level: Parallel-META v2.4.1 and QIIME v1.9.1. The authors evaluated the softwares' performances with various databases. Both programs performed best with the MTX database (Escobar-Zepeda et al. 2018). Because of their peak performance, topical versions of QIIME and Parallel-META were evaluated on nanopore and Illumina data. The RDP Classifier was distributed by the RDP database team (Q. Wang et al. 2007). Thus it was expected to perform at level with other algorithms using RDP and was chosen for the comparison.

2.2.4.1. Parallel-META 3.4.4 Parallel-META 3 had been announced as a fast and memory efficient metagenomic classification tool (Jing et al. 2017). It offered support for amplicon and whole-genome shotgun sequencing. In the latter case, the amplicons were predicted and extracted (Jing et al. 2017). The markers were then aligned to a custom database using Bowtie2 (Jing et al. 2017). Apart from making classifications, the program offered in-depth sample analyses, such as assessment of α- and β-diversity and functional analysis (Jing et al. 2017). At the time of writing, its topical version was 3.4.416.

Parallel-META could only be tested using its native database. Escobar-Zepeda and colleagues (Escobar-Zepeda et al. 2018) used Parallel-META v2.4.1 with a patch to support the use of other databases. I contacted the publishers of Parallel-META to find out, if it was possible to use other databases in the latest version or if I could be provided with version 2.4.1. However, I did not receive any reply.

The included database allowed for the choice between 16S rRNA and 18S rRNA genes. Therefore, I removed all 28S rRNA gene sequences from my input fasta files. The 16S rRNA gene sequences from bacteria and archaea were classified using:

PM-parallel-meta -t 4 -f F -e [mode] -D B -r [input.fasta] -o [outpath]

-D specified the database, the functional analysis (-f) was set to false and the alignment mode -e was specified. The latter ranged between zero and three. The help menu of the software stated that the sensitivity increased with the mode number. All alignment modes were used for the analyses. The functional analysis was disabled, since only the taxonomic annotations were rated.

16 http://bioinfo.single-cell.cn/parallel-meta.html; accessed 21.01.2019. 46

2.2.4.2. QIIME 2 QIIME 2 was a metagenomic framework with several plugins that allowed for visualizations, quality control and analysis (Bolyen et al. 2018). The modular structure lead to a faster pace in development, as improvements to the software could be done by multiple labs at the same time (Bolyen et al. 2018).

I used QIIME 2 as a command line tool with version 2018.11.0. The import of data into QIIME 2 was more complex than the import into the other tools: because of its modular structure, all input data needed to be in a special format, the so-called artifact. This was a zipped folder containing the input data together with additional QIIME data. The latter data was continuously updated throughout the analysis and kept track of the workflow and data integrity (Bolyen et al. 2018). The simulated sequences (SampleData[Sequences]) were imported as follows:

qiime tools import --type SampleData[Sequences] --input-path [input] --output-path [output]

For the database sequences and taxonomies, the --type was adjusted to FeatureData[Sequence] and FeatureData[Taxonomy], respectively. The simulated reads were considered to be a single sample (already demultiplexed). Quality filtering based on quality scores could not be performed, as the simulated reads were in fasta format. No denoising or chimera filtering was performed. Chimeric reads arise during PCR and consist of two amplification targets (Shuldiner, Nirula, and Roth 1989; Meyerhans, Vartanian, and Wain-Hobson 1990). This was not expected in the simulated dataset, as reads were simulated for one reference sequence at a time and no artificial PCR was performed. QIIME 2 uses DADA2 (Callahan et al. 2016) and Deblur (Amir et al. 2017) for denoising. According to topical advice by the QIIME 2 service team, neither of the tools was readily applicable to nanopore data17. For comparability, this step was also skipped for the simulated Illumina reads.

In a next step, similar sequences were clustered together (dereplicated) using qiime vsearch dereplicate-sequences. The outputs were FeatureTable[Frequency] and FeatureData[Sequence] artifacts. The FeatureData[Sequence] artifact was mandatory for the subsequent taxonomic classification. However, by dereplicating the simulated reads, some of the read identifiers were merged and thus lost. IDs were crucial for the evaluation of the classification performance, as each ID had an expected taxonomy.

For that reason, the fasta file inside FeatureData[Sequence] was replaced with a correctly formatted fasta file containing the native reads. The replacement file had a single-line header with the according sequence in the following lines. Each sequence line was prepared to have at max 80 characters. The zipped artifact was not extracted during the replacement, because un- and re-zipping was recognized by QIIME 2 and prevented the downstream analysis. Rather, the internal folder structure was mimicked and the fasta file containing the reads in the artifact was updated using:

zip -r artifact [folderStructure][replacement.fa]

17 https://forum.qiime2.org/t/analysis-of-fastq-files/3177; accessed 21.01.2019. 47

The FeatureData[Sequence] reads could then be classified by three different approaches: local alignment, global alignment or a kmer approach. Local alignment was carried out using the BLAST+ (Camacho et al. 2009) implementation of QIIME 2.

qiime feature-classifier classify-consensus-blast --i-query [FeatureData[Sequence]] --i-reference-reads [FeatureData[Sequence]] --i-reference-taxonomy [FeatureData[Taxonomy]]

All settings were left at default, the most important for the classification were: the maximum number of hits per query (10, --p-maxaccepts), the minimum identity of a match to the query (0.8, --p-perc-identity), e-value threshold (0.001, --p-evalue), the strand of the reference to align to (both, --p-strand) and the minimum percentage for the consensus assignment (0.51, --p-min-consensus).

Global alignment using the vsearch (Rognes et al. 2016) implementation was attempted with default settings, but aborted after over 110 hours of calculations without results. The kmer approach consisted of two separate steps: first, the implemented Naïve Bayesian classifier (Pedregosa et al. 2011; Bokulich et al. 2018) needed to be trained on the database and its taxonomy.

qiime feature-classifier fit-classifier-naive-bayes --i-reference-reads [FeatureData[Sequence]] --i-reference-taxonomy [FeatureData[Taxonomy]]

The classifier was trained on the custom MTX database using standard settings. Training on the custom RDP database was attempted, but not successful. The number of sequences in RDP triggered a bug. An OTU-clustering of RDP was not attempted. After training, the simulated reads were classified using the pre-calculated training_profile and standard parameters in the Scikit-learn (Pedregosa et al. 2011) implementation of QIIME 2 (Bokulich et al. 2018): most importantly, the confidence threshold was 0.7 (--p-confidence) and the read orientation was automatically detected (--p-read- orientation).

qiime feature-classifier classify-sklearn --i-reads [FeatureData[Sequence]] --i-classifier [training_profile]

2.2.4.3. RDP Classifier The RDP Classifier was a supervised Naïve Bayesian classifier, which used 8-character words to infer the taxonomy for a given query (Q. Wang et al. 2007). The algorithm provided confidence estimates by performing bootstrapping at the genus level. Confidence estimates of broader ranked taxa were the total of those on the genus level within the given clade (Q. Wang et al. 2007). The orientation of the query reads was automatically detected (Q. Wang et al. 2007).

One way to access the algorithm, was by its web interface18. At the time of writing, the online classifier incorporated the 16S rRNA training set 1618 and three fungal databases: Fungal LSU training set 11 (Liu et al. 2012), Warcup Fungal ITS trainset 2 (Deshpande et al. 2016) and the UNITE Fungal ITS trainset 07-04-2014 (Q. Wang and Cole 2014).

18 https://rdp.cme.msu.edu/classifier/classifier.jsp; accessed 26.02.2019. 48

Since none of the four databases was able to simultaneously detect bacteria, archaea and fungi, I separated the nanopore and MiSeq samples, respectively. First, the individual subsets containing bacteria and archaea were analyzed using the 16S rRNA training set 16. Next, the individual fungal subsets were analyzed using the Fungal LSU training set 11 (from hereon: LSU11), the Warcup Fungal ITS trainset 2 (from hereon: WARCUP2) and the UNITE Fungal ITS trainset 07-04-2014 (from hereon: UNITE), respectively. During the subsequent analyses, the results from the fungi-free and fungi-only subsets were merged (see also next section) to obtain results as expected by analyzing the full samples. The online version of the classifier did not allow for adjusting parameters that would have influenced the taxa calling. The annotation for each read was obtained from the fixrank output file: if one taxon within the lineage containing domain, phylum, class, order, family and genus was missing, the classification and confidence of the subsequent rank were reported19. Note that the RDP Classifier using the 16S rRNA gene database or LSU11 only made assignments up to and including the genus in the fixrank output file.

2.2.5. Cross-algorithm comparisons and visualizations The performance of MetaG and the established classifiers was evaluated on the read level. As the reads of the simulated samples of both sequencing technologies had a known origin, the classifications of each read could be compared to the expected outcome. The reference taxonomy for each read and for each rank (from domain to species) was obtained from the NCBI Taxonomy Database (Sayers et al. 2019). TPs, FPs, TNs and FNs were assigned as described in section 2.2.1. MCC, sensitivity, precision and specificity for all comparisons were plotted using R 3.4.4 (R Core Team 2018) and the packages tidyverse 1.2.1 (Wickham 2017) and reshape2 1.4.3 (Wickham 2007).

First, each individual program was evaluated on the BA and BFA sample, respectively. With that, I aimed to find the most preferable parameter combination for each program and each of the two samples. Thus, MCC, sensitivity, specificity and precision were noted for each sample, rank, database and parameter combination of the individual program, respectively. For MetaG, the previously defined standard parameters were used for RDP and MTX. Parallel-META 3 was compared across all four alignment modes with its custom 16S rRNA gene database. It was only used on the reduced test set lacking fungal sequences and nanopore and MiSeq sequencing, respectively.

The QIIME 2 and RDP classifier workflows were individually compared across confidence cutoffs. Cutoffs of 0, 0.5 and 1 were used. QIIME 2 using BLAST was run with standard parameters and RDP and MTX, respectively. Due to bugs in the Naïve Bayesian classifier of QIIME 2, it was only run with MTX, using standard settings. The RDP Classifier was run with all of its four custom databases (see also previous section). A simultaneous analysis of fungi and bacteria/archaea was not possible in the online version of the classifier (see also section 2.2.4.3.). Thus, for the same confidence cutoff, the numbers of TPs, FPs, TNs and FNs for the analyses of bacteria/archaea and fungi were added. The statistical parameters were recalculated on the new totals.

19 https://rdp.cme.msu.edu/classifier/class_help.jsp#download; accessed 26.02.2019. 49

From the plots, the confidence cutoff or alignment mode with the best MCC was determined for nanopore and MiSeq sequencing of BA and BFA, respectively. This was done for each program and database. MetaG had been compared using either MTX or RDP. Still, both databases were chosen for the comparisons.

Algorithms using their chosen settings were then compared to each other by plotting MCC, sensitivity, precision and specificity for MiSeq and nanopore sequencing and the BA and BFA sample, respectively. The classifications of the BA sample and MiSeq and nanopore sequencing, respectively, were also used to analyze the specific placement of genus novum and species novum X. composti strain K13 (Kukolya et al. 2018). This was done for the most abundant classifications at family, genus and species level, respectively. The results were compared to the expected taxonomy (Kukolya et al. 2018).

2.2.6. Evaluating the quality of the simulated data In order to show the general consistency of simulated data and sequenced data, sequences of a mock community were analyzed. The community was the ZymoBIOMICS Microbial Community DNA Standard D6306 sequenced with nanopore chemistry R9.4.1 (Cuscó et al. 2018). It was available from the Sequence Read Archive (SRA) (Leinonen et al. 2011) using the run ID SRR8029984 (Cuscó et al. 2018). Amongst others, the study authors performed a taxonomic classification of the full-length 16S rRNA gene against the NCBI database using the What's in my Pot (WIMP) (Juul et al. 2015) workflow. They compared the results against the species' abundances given by the manufacturer20 (Cuscó et al. 2018).

I chose to compare MetaG using the MTX database, QIIME 2-BLAST using the RDP database and Parallel-META 3 to the study's results. I chose the MTX database for MetaG, because it showed a superior MCC at species level of the simulated BA sample. It was run using the previously defined standard settings for nanopore sequencing (see section 2.2.3.). The specific QIIME 2 workflow was chosen, as, at species level, its MCC was second best after the MetaG workflows. Parallel-META 3 was selected because of its high speed and accuracy. It was run in the strictest alignment mode (three) as outlined in section 2.2.4.1.

The settings of QIIME 2 were changed to cope with the requirements of a real sample. Most importantly, the formation of chimeras could not be excluded anymore, as done for the simulated sets. Besides, knowing the taxonomy of each read was no longer mandatory. Thus, sequences could be clustered to improve the calculation time. Accordingly, no manipulation of intermediate results (see section 2.2.4.2.) was performed. After the initial steps outlined for the analysis of the simulated nanopore data (see section 2.2.4.2.), I chose to use de novo clustering (99 % identity) on the sequences after dereplication. Subsequently, chimeras in the reads were predicted de novo using qiime vsearch uchime-denovo (Rognes et al. 2016) with default settings. The clustered FeatureTable [Frequency] and FeatureData [Sequence] were strictly filtered: the resulting nonchimera.qza artifact was used for the taxonomic assignment. The following BLAST search against RDP was carried out, as described for the simulated sets.

20 https://files.zymoresearch.com/protocols/_d6305_d6306_zymobiomics_microbial_ community_dna_standard.pdf; accessed 24.04.2019. 50

The ZymoBIOMICS Microbial Community DNA Standard D6306 included eight bacteria and two fungi20 (see also Table 2). Only the bacterias' abundances measured as the 16S rRNA gene composition were considered. The values given by the manufacturer were plotted against those obtained by the reference study (Cuscó et al. 2018). The study’s results were derived from its Figure 3 c) (Cuscó et al. 2018). The abundances obtained by MetaG using MTX and Parallel-META 3 using alignment mode 3 were also included in the figure. The QIIME 2 analysis had to be stopped without results. Absolute taxa calls by MetaG and Parallel-META 3 were transformed to relative abundances considering the individual total number of matched reads as the whole. To the best of my knowledge, Cuscó and colleges also only considered assigned reads (Cuscó et al. 2018).

Table 2: Species and the relative abundances of their 16S rRNA genes in the ZymoBIOMICS Microbial Community DNA Standard, as reported by the manufacturer20.

Species Relative abundance 16S rRNA gene (%) Pseudomonas aeruginosa 4.2 Escherichia coli 10.1 Salmonella enterica 10.4 Lactobacillus fermentum 18.4 Enterococcus faecalis 9.9 Staphylococcus aureus 15.5 Listeria monocytogenes 14.1 Bacillus subtilis 17.4 Saccharomyces cerevisiae NA Cryptococcus neoformans NA

To estimate the relative abundances of all unexpected taxa, the sum of all relative abundances of the expected taxa was calculated for each algorithm. The abundance of the unexpected taxa equaled the difference between the sum and a total of one. Bar plots of the classifications were created using R 3.4.4 and tidyverse 1.2.1.

51

2.3. Design of the SNP analysis in NanoPipe

2.3.1. NanoPipe, a bioinformatics pipeline featuring SNP analysis The SNP calling and analysis presented in the following sections were integrated into NanoPipe (Shabardina et al. 2019). NanoPipe was available as a web tool21 and standalone program22 (Shabardina et al. 2019). It was developed as an intuitive program with strong default settings and also targeted wet-lab scientists. Given a target and sequencing reads, alignments were performed using LAST (Shabardina et al. 2019). The most important outputs for the user were: alignments, corresponding statistics, mapping of queries, mapping statistics and the SNP analysis (Shabardina et al. 2019).

The SNP analysis depended on the calculation of a consensus sequence from the alignment which was based on the majority rule. However, if the coverage at a position was less than ten, a gap was introduced (Shabardina et al. 2019).

Nucleotide abundances and occurrences, as well as gaps, were reported in files structured by the target IDs. These files did not report every target position, but only those within regions that query reads often mapped to with high quality. The files were used for the SNP analysis and will be referred to as candidate files in the following sections.

2.3.2. Workflow of the analysis The main script in the SNP analysis was nanopipe_calc_polymorphism.py which was written in Python 2.7.15rc1. It communicated with NanoPipe, combined data from different SNP analyses and coordinated the generation of structured results. For that, it received the species ID of the analyzed organism (-s) to provide further information about previously detected SNPs at the candidate positions. Therefore, dbSNP (Sherry et al. 2001) and PlasmoDB (Aurrecoechea et al. 2009) were queried (Shabardina et al. 2019). Besides, it could optionally analyze the alignment quality around a SNP, given -q (see also workflow in Figure 7). In NanoPipe, -q was always set (Shabardina et al. 2019). The script needed to be called in the same directory where the results of the upstream analyses had been stored. All polymorphism scripts have been tested on Ubuntu 18.04.2 LTS and Ubuntu 16.04.6 LTS.

21 http://www.bioinformatics.uni-muenster.de/tools/nanopipe2/index.hbi?lang=en; accessed 01.03.2019. 22 https://github.com/IOB-Muenster/NanoPipe2; accessed 01.03.2019. 52

Figure 7: The workflow of the polymorphism analysis. It starts at nucleotide counts obtained from high quality regions during alignment. Spurious SNP candidates were filtered and the remainder was rated based on the mutation type: transitions or transversions (Shabardina et al. 2019). Optionally, candidates were compared to PlasmoDB or dbSNP (ID annotation) and/or the quality in the surroundings was assessed. Regardless of the chosen options, the results were displayed in a table format (Shabardina et al. 2019).

2.3.3. Filtering ambiguous SNP candidates The polymorphism candidates were structured into separate candidate files by NanoPipe which were called calc.nuccount.ID. ID was a cipher for the respective target ID. The encodings were resolved by NanoPipe in a configuration file (calc.tidmap) that was specific for each analysis. The individual candidate files contained information on the target ID, as well as tab-delimited information on the polymorphism candidates: position on the target, counts of the four nucleotides, the consensus and the target nucleotide and the number of mapped reads showing a gap at that position. In order to remove spurious candidates, several custom filtering steps were applied to the positions (labeled: one to four):

(1) The consensus nucleotide of all reads at a candidate position was not allowed to be a gap or non-DNA character (Shabardina et al. 2019). (2) Besides, the nucleotide of the target had to have an abundance of at most 80 % of the total nucleotide abundance at a position. (3) Hence, each putative SNP had to have an abundance of 20 % or more (Shabardina et al. 2019). (4) The maximum coverage for a target ID was calculated and individual positions with a coverage of less than 30 % of the maximum were removed (Shabardina et al. 2019).

53

2.3.4. Providing metadata for polymorphic sites

2.3.4.1. Joint probability of a SNP Even though the candidates were strictly filtered, artifacts might have introduced artificial SNPs. In an attempt to provide the user with a confidence for each SNP at a position, the likelihood of its occurrence was calculated (Shabardina et al. 2019).

The DNA nucleotides are grouped into pyrimidines and purines and thus there are two major types of mutations: a transition (TI) is a base change within either pyrimidines or purines, whereas a transversion (TV) leads to a change between the groups. If mutations were completely random, transversions would be twice as likely as transitions (reviewed in: Wakeley 1996). However, when homologous sequences are compared, transitions occur more often than transversions. The magnitude of and the phenomenon itself are, amongst others, dependent on the selection strength, codon position, gene, organism and method of calculation (reviewed in: Wakeley 1996).

This observation was transformed into a simplified model: transitions were assumed to be twice as likely as transversions in all samples (TI/TV = 2) (Shabardina et al. 2019). This was in line with a study on healthy Caucasian adults which found an average ratio of 2.13 for the SNPs (Shen et al. 2013). If a transition could transform the target nucleotide to the SNP candidate, the relative abundance of the candidate was multiplied by two. In the case of a transversion, the relative abundance remained unchanged. The weighted abundance was called the joint probability of a SNP, and the sum at each position was one (Shabardina et al. 2019). However, the probability calculations were not used to filter any SNPs (Shabardina et al. 2019). The metadata gave the user a rough estimate of the likelihood of a specific SNP at a position with multiple candidates (Shabardina et al. 2019). Note, however, that likelihoods between positions cannot be compared. For example, a single valid base change at one position may get the same joint probability as a single spurious base change at another position.

2.3.4.2. Connecting previous and topical observations With the fast increase in data, interlacing the information has become increasingly important. Connections between reported and just analyzed SNPs have been implemented for human and Plasmodium falciparum data (Shabardina et al. 2019). The species were indicated by the parameters -s human and -s plasf given to the Python script, respectively. Other organisms were not supported (Shabardina et al. 2019).

PlasmoDB was used to query SNP candidates of P. falciparum (Shabardina et al. 2019). PlasmoDB is a database for genomic, transcriptomic and proteomic data, as well as metadata, of the Plasmodium parasites causing malaria. Data had been collected for multiple life stages of selected species and for multiple strains (Aurrecoechea et al. 2009).

A local database was built by obtaining SNP positions for P. falciparum 3D7 and the chromosomes 1 to 14, individually starting at base position 0. The read frequency threshold was set to 80 %, minor allele frequency was zero and percent of isolates with a base call were 20 %. The reported positions for the individual chromosomes were sorted and saved into separate files. Each file consisted of five tab-delimited columns: SNP ID, major allele, major allele frequency, minor allele and minor allele frequency. The suffix of the SNP ID

54 provided information about the location. For example: NGS_SNP.Pf3D7_12_v3.8 was a SNP at position 8 of chromosome 12 (note the bold print).

For each target ID, the observed positions were compared to those reported in the database. If a match was found, the database ID, major and minor allele and the respective frequencies were extracted. Numbers of matched SNPs were reported in the logs. Target IDs were required to be in a similar format as the entries in PlasmoDB: they had to start with Pf3D7_ followed by two digits and end with _v3.

dbSNP was used to query human (Shabardina et al. 2019). dbSNP was a database of nucleotide variations with associated metadata (Sherry et al. 2001). Variation data was not subject to any filtering regarding allele count or clinical significance (Sherry et al. 2001). Recently, modifications to the database were introduced and now, it maintains only human data.23,24 The database was interconnected with other NCBI databases and provided links to several external databases for more specific analyses (Sherry et al. 2001). As a consequence, having the specific ID for an observed polymorphism was a gateway for NanoPipe users to more detailed analyses.

To provide the users with the most topical information and to avoid the computational challenges of huge files, dbSNP's application programming interface (API)25 was used. Only target IDs which indicated chromosomes and transcripts were queried in the API. More specifically, chromosomes were allowed to be a one or two-digit number, X or Y, respectively. Transcript IDs had to start with A, N or X; followed by one of: C, G, T, W, Z, M or R. The two characters had to be followed by an underscore.

The requests were sent to the VCF services of the API using a vcf table. It consisted of five tab-delimited columns: chromosome, position, ID (containing a placeholder for the rs- ID), reference (i.e.: target nucleotide) and alternative (i.e.: observed nucleotide). Each row was only allowed to have one observed nucleotide.

I did not query only one target ID at once. Instead, candidates for all target IDs in the analysis were processed at once. To receive any reported SNP at the observed positions, each position for each target ID was reported on three different lines: the target nucleotide was the same, but the observed nucleotide was one of the three remaining nucleotides. The post request was analyzed for assembly GCF_000001405.38. 40,000 rows were sent at once. If the server returned a HTTP or URL error, a second try was performed after 30 sec. If at least one of the errors occurred again, db:error was reported for every candidate position. If no error occurred, dbSNP returned the query table. If the SNP had been previously reported, the ID placeholder was replaced by an rs-ID. The rs-ID, target and observed alleles were saved for every chromosome and candidate position. IDs with the same name and alleles were reported only once. If applicable, the next blocks of 40,000 rows each, were processed. Between each block 0.5 seconds were added to the calculation time to stay within the requirements of the API. Checks for HTTP and URL errors were performed at the

23 https://ncbiinsights.ncbi.nlm.nih.gov/2017/05/09/phasing-out-support-for-non-human-genome- organism-data-in-dbsnp-and-dbvar/; accessed 06.03.2019. 24 https://ncbiinsights.ncbi.nlm.nih.gov/2017/07/07/dbsnp-redesign-supports-future-data-expansion/; accessed 06.03.2019. 25 https://api.ncbi.nlm.nih.gov/variation/v0/; accessed 06.03.2019. 55 start of the calculations for each block. API specific errors were printed to the same destination as the status reports (see section 2.3.5.).

For each target ID, the number of matches was reported in a log file. The log reported also artifacts for both PlasmoDB and dbSNP. These covered cases when the matching positions returned by the database search did not match the respective query positions. This should not have been the case and was only reported for the purpose of internal debugging. Thus, the artifact counts were without any biological significance.

2.3.4.3. The alignment quality as a measure of a SNP's reliability If given the parameter -q, nanopipe_calc_polymorphism.py was also able to analyze the alignment quality around an individual SNP (Shabardina et al. 2019). The calculations were performed for positions of all target IDs at once. They were outsourced to the Perl v5.26.1 script nanopipe_qualreg.pl for better performance. Both scripts communicated using the JavaScript Object Notation (json). The Perl script used the non-proprietary module JSON::Tiny26 version 0.58.

The calculations started by loading the LAST alignment MAF file into memory. Each alignment included a line indicating the alignment quality for each base. The quality was not given in percent, but was encoded by a symbol27. For each target ID, I saved all alignment start positions with an array of end positions and according alignment quality strings.

For each target ID, arrays of observed SNPs and alignment starts were created. For each observed position, the data of alignments with a start smaller than or equal to the SNP was obtained via a binary array search (see also section 2.1.3.). If the alignment end was equal to or larger than the SNP, the alignment was considered for the calculation of the average alignment error. For that, the quality symbols within a region of at most ten nucleotides before and after the SNP were extracted from each of the alignments. If the SNP was too close to one of the alignment ends, fewer nucleotides were extracted (Shabardina et al. 2019). The average error probability around a SNP was calculated from the extracted symbols of all alignments containing the respective polymorphism. For that, the symbols were converted to error probabilities (p-error) (Shabardina et al. 2019). The two steps were combined in the following formula modified from Shabardina et al. (Shabardina et al. 2019):

In the formula, i is the sum of the number of extracted symbols over all alignments that contain the SNP (Shabardina et al. 2019) and symb is an individual quality symbol. The average error, ranges between zero and one with one indicating a highly ambiguous region (Shabardina et al. 2019).

26 https://metacpan.org/pod/JSON::Tiny; accessed 06.03.2019. 27 http://last.cbrc.jp/doc/last-split.html; accessed 05.03.2019. 56

2.3.5. Display of results from the polymorphism analysis The results of the analyses were written to separate output files depending on the individual target ID. The output files had the same name as the candidate files, but with the suffix .poly. This was done by the Python script, if no analysis of the alignment quality had been performed. Otherwise, the Perl script wrote the files. The output files were tab-delimited tables. Each row contained a single SNP position. The results display of the web tool followed the same structure as the output files. The following columns were displayed: the position and four columns indicating the joint probability of each nucleotide at the position (Shabardina et al. 2019). If the probability was 0, the nucleotide was assigned with a hyphen. The next column was the target nucleotide, followed by the matches in dbSNP or PlasmoDB, if applicable (Shabardina et al. 2019). A new feature since the official release of NanoPipe was that each SNP match was provided with a direct link to the respective online database entry. The link was created by the nanopipe_calc_polymorphism.py script, if no subsequent quality analysis was performed. Otherwise, it was created by nanopipe_qualreg.pl. The link was accessible to the user by the web interface, created by Norbert Grundmann. Note, that this feature was exclusive to the online implementation. Links were not provided in the standalone application. The source code of the SNP scripts for the standalone version and online implementation of NanoPipe, respectively, can be found in the digital appendix (see D13 and D14).

Positions that could be queried in the databases, but had no matches showed an empty field. If the organism was human or P. falciparum, but the target ID could not be queried, every SNP for that target ID was assigned with N/A. Likewise, if a HTTP or URL error occurred during the dbSNP querying, db:error was assigned. If the organism was not human or P. falciparum, the column was hidden. The alignment quality was reported for each position (Shabardina et al. 2019). If the alignment quality analysis was disabled, for example by users of the standalone version, the column was not displayed. The last four columns contained the raw nucleotide counts (Shabardina et al. 2019).

In the web tool, error messages (see also above) and status reports for the main calculation steps were reported in internal log files. Only status reports were shown to the user during calculation time. In the standalone version, all messages were printed to the terminal, by default.

2.4. Benchmarking of nanopore sequencing and NanoPipe SNP analysis NanoPipe and the included SNP module had been extensively tested to detected SNPs of high biological importance (Shabardina et al. 2019). Due to this, I decided to test the joint SNP prediction power of nanopore sequencing followed by the polymorphism analysis in NanoPipe. For that, I retrieved the E. coli reference genome from the NanoPipe targets.

To assess the influence of sequencing errors alone, reads of the unmodified target genome were simulated as error-free or using the error profile for E. coli 2D sequencing on flowcell chemistry R9. Simulations were carried out using NanoSim-H. The number of simulated reads was 100,000 in all cases. Reads had to have a length between 1000 and 2000 nt. Thus, the expected coverage of the ca. 4.7 million bp E. coli genome was at least ca. 21. Apart from --circular, the other settings were left at default values. This also included the seed number, which was 42. Using the same error profile and seed number, reads could be exactly replicated. 57

nanosim-h -n 100000 --circular --perfect --min-len 1000 --max-len 2000

or

nanosim-h -p 'ecoli_R9_2D' -n 100000 --circular --min-len 1000 --max-len 2000

Sources of artifacts in the datasets without simulated SNPs were thoroughly analyzed in order to prevent these with updates to the SNP algorithm. The analysis for each position included: assessing the average alignment error as provided by NanoPipe and analyzing the overall nucleotide count (coverage) and relative abundance of the SNP nucleotide. The sum of all SNPs was analyzed for a possible skew of mutations towards transitions or transversions.

Using a custom Perl v5.26.1 script, SNPs were simulated at random positions on the genome. 100 positions were simulated as biallelic, i.e. the target nucleotide and a mutation were present at a ratio of ca. 67:33 or vice versa. Sites were evenly split between the two different ratios. Using NanoSim-H, nanopore reads were simulated as outlined above, using no error profile at all (--perfect) and using the error profile ecoli_R9_2D, respectively.

In the sets containing the simulated SNPs, the following classifications were made: a detected SNP was a TP, if the algorithm detected a SNP at the expected position, regardless of abundance and nucleotide. Expected SNPs that were not in the output were considered FN. A position that was not simulated as polymorphic and was not falsely identified as such, was a TN. Positions that were called, but not expected, were contributing to the FP count. With these classes, sensitivity, precision, specificity and MCC of the analyses were assessed (see also section 2.2.1.). Besides, the total ratio of transitions and transversions (TI/TV) was calculated. At the default threshold of NanoPipe (20 % SNP allele abundance (Shabardina et al. 2019)), the average and maximum p-error were analyzed for abnormalities.

Reduction of FP calls was attempted by increasing the minimum relative abundance of a SNP nucleotide at each position from 20 % to 22, 25, 30 and 35 %, respectively. The maximum relative abundances of the target nucleotide were adjusted to reach a total of 100 %, respectively. For selected parts of the above analyses, bar plots were created using R 3.4.4 and tidyverse 1.2.1.

Subsequently, the improved cutoff was tested for its ability to recover the representative SNPs of P. falciparum (only Pf3D7_07_v3) and cell line H1975 (only chromosome seven) presented by Shabardina and colleagues (Shabardina et al. 2019). The input data of the two study cases was obtained online28. The polymorphism calculation was run either on a server with 64 cores (AMD Opteron™ Processor 6378; clock speed: 2.40 GHz) and 512 GB RAM or on a desktop PC with 8 cores (Intel® Core™ i7-6700; clock speed: 3.40 GHz) and 15.5 GB RAM. The server used the Ubuntu 16.04.6 LTS operating system, and the desktop PC used Ubuntu 18.04.2 LTS.

28 http://bioinformatics.uni-muenster.de/share/NanoPipe_test_data/; accessed 24.04.2019. 58

3. Results

3.1. Performance evaluation of MetaG

3.1.1. Statistical evaluation and comparison to competitors

3.1.1.1. Simulated nanopore sequencing To compare the performance of MetaG against the competing classifiers, nanopore sequencing was simulated. Since Parallel-META 3 could not analyze 28S rRNA gene sequences (see section 2.2.4.1.), two separate samples were created. One sample contained only 16S rRNA gene sequences from bacteria and archaea (BA), the other additionally contained 28S rRNA gene sequences from fungi (BFA). The online RDP Classifier could only separately analyze 16S and 28S rRNA gene sequences (see section 2.2.4.3.). Therefore, it was tested on a set containing only 28S rRNA gene sequences and subsequently on the BA set. The results were then merged to estimate its performance for the BFA set. Note that the RDP Classifier only made assignments up to and including the genus in the fixrank output file of the analyses using the 16S rRNA gene database and/or LSU11.

At first, classifiers were separately tested for their performance on BA and BFA. For that, classifiers were run with different settings and the sensitivity, specificity, precision and MCC of the classifications for each rank were analyzed. For the sake of visual clarity, only plots of the MCC comparisons can be found in the supplemental data (see Figures S1-S13). The MCC was used as a general performance indicator of an algorithm. An MCC of one indicated perfect classifications, the opposite was true for minus one (Matthews 1975).

Table 3 shows the individual classifiers, varied setting and the setting chosen for an inter-algorithm comparison. This was often the optimal setting as revealed by the MCC. For MetaG, the database was varied between MTX and RDP. Even though MTX clearly outperformed RDP in the BFA sample (see Figure S7), both databases were chosen for BA and BFA (see Table 3). QIIME 2 Blast and the QIIME 2 Classifier analyses showed similar performance of the confidence cutoffs 0 and 0.5 (see Figures S3-5 and S8-10). I chose to continue with the stricter cutoff, 0.5 (see Table 3).

59

Table 3: Choice of the best settings for each classifier and the in silico nanopore-sequenced BA and BFA sample, respectively. For each classifier, results from varying one specific setting (Variable setting) were analyzed to choose a subset (Chosen) for inter-algorithm comparison. This was often the best subset as identified by MCC. The Figure column indicates the MCC plot of the individual comparison.

Sample Program Variable setting Chosen Figure BA MetaG Database MTX; RDP S1 BA Parallel-META 3 Alignment mode 3 S2 BA QIIME 2 Blast MTX Confidence cutoff 0.5 S3 BA QIIME 2 Blast RDP Confidence cutoff 0.5 S4 BA QIIME 2 Classifier MTX Confidence cutoff 0.5 S5 BA RDP Classifier 16S Confidence cutoff 0.5 S6 BFA MetaG Database MTX; RDP S7 BFA QIIME 2 Blast MTX Confidence cutoff 0.5 S8 BFA QIIME 2 Blast RDP Confidence cutoff 0.5 S9 BFA QIIME 2 Classifier MTX Confidence cutoff 0.5 S10 BFA RDP Classifier 16S + LSU11 Confidence cutoff 1 S11 BFA RDP Classifier 16S + UNITE Confidence cutoff 0.5 S12 BFA RDP Classifier 16S + WARCUP2 Confidence cutoff 0.5 S13

All classifiers with their chosen settings were then compared to each other using the aforementioned statistical measures. For the sake of visual clarity, only the findings for the MCC are shown in the main text. The remaining metrics can be found in the appendix (see Figures S14-S19).

The MCC of all algorithms analyzing the nanopore BA sample decreased towards the class and especially the order level. The values reached a local maximum at the family level before dropping again (from hereon: wave-like pattern) (see Figure 8). A similar trend was observed for precision and specificity (see Figure S15 and S16). The measure indicated that MetaG, Parallel-META 3 and QIIME 2 Blast using RDP performed almost equally well from domain to order (see Figure 8). At family level Parallel-META 3 outperformed all others. From the genus on, MetaG took the lead (see Figure 8). The worst algorithm was the QIIME 2 Classifier with the MTX database (see Figure 8). Using RDP was beneficial over using MTX for identifications at domain, phylum, class and family level (see Figure 8). All programs performed worst when identifying species. MetaG using MTX was the only classifier with a positive MCC at species level (see Figure 8). With RDP, it still performed better than the others. At species level, the MCC for the RDP Classifier was undefined (see also Figure 8).

60

Figure 8: MCC from domain to species for classifiers using their chosen settings to analyze the BA sample. The sample was subject to simulated nanopore sequencing. 16S is the 16S rRNA training set 16 of the RDP Classifier. The MCC of the RDP classifier was undefined at species level.

Next, the nanopore BFA sample was analyzed. Parallel-META 3 was not suited for the evaluation of the BFA sample (see section 2.2.4.1.). Thus, it was removed from the comparison. The RDP Classifier was split into three different workflows according to the three different fungal databases used in conjunction with the 16S rRNA gene database.

The most obvious difference between the BFA and BA sample was that the overall MCC was lower in BFA (see Figure 8 and 9). In the BFA sample, sensitivity, precision, specificity and MCC were higher for algorithms using MTX than for algorithms using RDP. This pattern was common for domain to order (see Figures S17-S19; 9). At the three broadest ranks, the inverse had been observed for the specificity and the MCC of BA (see Figures S16 and 8). Unexpected from BA, QIIME 2 BLAST using RDP was less beneficial than using MTX at order and species level (see Figures 8 and 9).

61

Figure 9: MCC from domain to species for classifiers using their chosen settings to analyze the BFA sample. The sample was subject to simulated nanopore sequencing. 16S is the 16S rRNA training set 16 of the RDP Classifier. The MCC for the RDP Classifier with the 16S and LSU11 databases was undefined at species level.

The MCC of most algorithms in BFA followed a similar wave-like pattern as in BA. However, the local maximum at the family level was not existent or less pronounced for MetaG and QIIME 2 Blast, respectively (see Figures 8 and 9). This is in line with the overall observations of diminished local maxima at this rank for precision and specificity in BFA vs. BA, respectively (see Figure S15 and S18; S16 and S19).

Again, the overall worst classifications were calculated by the QIIME 2 Classifier using the MTX database (see Figures 8 and 9). The three RDP Classifier workflows performed better than the QIIME 2 Classifier. Interestingly, LSU11 performed worse than UNITE and WARCUP2 at most ranks, except family and genus (see Figure 9). At species level, the MCC for LSU11 was undefined (see also Figure 9). Over most ranks, MetaG using the MTX database was the best classifier. However, MetaG using RDP performed marginally better at family level (see Figure 9). The superiority of MetaG over all ranks was not observed in the BA sample (see Figures 8 and 9). As for BA, all algorithms classifying species showed a negative MCC, except for MetaG MTX (see Figures 8 and 9).

3.1.1.2. Simulated Illumina MiSeq sequencing A second analysis was performed on simulated MiSeq sequencing data. It was carried out as described for nanopore sequencing. Again, RDP and MTX were chosen for inter-algorithm comparisons of MetaG (see Table 4). The alignment modes one, two and three showed comparable performance for Parallel-META 3. However, alignment mode three was chosen for subsequent analysis, due to a slight performance gain (see Figure S21). As for the nanopore sample, QIIME 2 Blast and the QIIME 2 Classifier showed equally well performance for confidence cutoffs of 0 and 0.5 (see Figures S22-S24 and S27-S29). Thus, the stricter cutoff was chosen (see Table 4). Note that the RDP Classifier only made assignments up to and including the genus in the fixrank output file of the analyses using the 16S rRNA gene database and/or LSU11.

62

Most settings remained identical between in silico nanopore and MiSeq sequencing. Only the RDP Classifier 16S + LSU11 and the RDP Classifier 16S + WARCUP2 showed best performance at lower and higher confidence cutoffs, respectively, compared to nanopore sequencing (see Tables 3 and 4).

Table 4: Choice of the best settings for each classifier and the in silico MiSeq-sequenced BA and BFA sample, respectively. For each classifier, results from varying one specific setting (Variable setting) were analyzed to choose a subset (Chosen) for inter-algorithm comparison. This was often the best subset as identified by MCC. The Figure column indicates the MCC plot of the individual comparison.

Sample Program Variable setting Chosen Figure BA MetaG Database MTX; RDP S20 BA Parallel-META 3 Alignment mode 3 S21 BA QIIME 2 Blast MTX Confidence cutoff 0.5 S22 BA QIIME 2 Blast RDP Confidence cutoff 0.5 S23 BA QIIME 2 Classifier MTX Confidence cutoff 0.5 S24 BA RDP Classifier 16S Confidence cutoff 0.5 S25 BFA MetaG Database MTX; RDP S26 BFA QIIME 2 Blast MTX Confidence cutoff 0.5 S27 BFA QIIME 2 Blast RDP Confidence cutoff 0.5 S28 BFA QIIME 2 Classifier MTX Confidence cutoff 0.5 S29 BFA RDP Classifier 16S + LSU11 Confidence cutoff 0.5 S30 BFA RDP Classifier 16S + UNITE Confidence cutoff 0.5 S31 BFA RDP Classifier 16S + WARCUP2 Confidence cutoff 1 S32

The classifiers were subsequently compared by using sensitivity, precision, specificity and MCC of the classifications. Only the chosen settings (see Table 4) were used for the inter-algorithm comparison on the BA and BFA sample, respectively. For the sake of visual clarity, only the MCC plots are shown. Plots for the other metrics can be found in the appendix (see Figures S33-S38).

The overall performance of the programs in the BA set using MiSeq sequencing was indicated by the MCC. MetaG, QIIME 2 Blast and Parallel-META 3 were almost at level from domain to class (see Figure 10). Consistent with BA and nanopore sequencing, Parallel- META 3 dominated family identifications, whereas MetaG MTX performed exceptionally well at genus and species level (see Figures 8 and 10). The MCC of Parallel-META 3 at genus level increased and the MCC of MetaG RDP decreased relative to nanopore sequencing of BA. Thus, MetaG was not always better at classifying the genus as observed for nanopore sequencing and the BA sample (see Figures 8 and 10). The advantage of RDP over MTX in QIIME 2 Blast, as observed for BA nanopore sequencing, was dampened here over most ranks. The MCC of QIIME 2 Blast RDP remained the same in both samples. The QIIME 2 Classifier with the MTX database significantly recovered for MiSeq sequencing (see Figures 8 and 10). Overall, the only negative MCC values were reached at species level. MetaG MTX never showed negative values (see Figure 10). The RDP Classifier showed significantly better performance for the BA sample and nanopore sequencing at domain and phylum level. At family and genus level, it improved using MiSeq data (see Figures 8 and 10). At species level, however, its MCC was undefined (see Figure 10). Overall, the wave-like pattern was visible (see Figure 10). It was supported by the precision and specificity patterns of most algorithms (see Figure S34 and S35).

63

Figure 10: MCC from domain to species for classifiers using their chosen settings to analyze the BA sample. The sample was subject to simulated MiSeq sequencing. 16S is the 16S rRNA training set 16 of the RDP Classifier. The MCC for the RDP Classifier was undefined at species level.

The analysis conducted for the BA sample was also applied to the BFA sample. When the MCC of MiSeq sequencing of BFA was compared to the MCC of nanopore sequencing of BFA, a range of trends was exposed. In the MiSeq compared to the nanopore sample, the MCC of MetaG using MTX increased at genus and species level (see Figures 9 and 11). However, the MCC of MetaG decreased when using RDP at these ranks (see Figures 9 and 11). The performance of QIIME 2 Blast RDP increased over most ranks in the MiSeq set (see Figures 9 and 11). This increase also held true for all ranks analyzed by the QIIME 2 Classifier. It changed from purely negative MCCs in the nanopore BFA sample, to only one negative MCC at the species level of the MiSeq sample (see Figures 9 and 11). Besides, the performance of the RDP Classifier using LSU11 and WARCUP2, respectively, was better for most ranks compared to these workflows in the nanopore sample. This trend was inverse for the UNITE database. Still, LSU11, WARCUP2 and UNITE were substantially better at genus level predictions in the MiSeq sample (see Figures 9 and 11). However, the MCC for WARCUP2 in the MiSeq set was undefined at the species level. This had not been the case in the nanopore BFA sample. The MCC for species identifications of LSU11 had not been defined in both sets (see Figures 9 and 11).

From the MCC of the BFA MiSeq sample, it became apparent that, overall, WARCUP2 was the best fungal database. LSU11 performed worse and UNITE showed the least overall MCC (see Figure 11). In the nanopore BFA sample, LSU11 had shown the worst performance of all fungal databases (see Figure 9).

In the nanopore BFA sample, MetaG MTX had been the algorithm of choice for identifications at most ranks. Here, QIIME 2 Blast MTX had significantly closed the performance gap (see Figures 9 and 11). However, MetaG MTX still showed peak performance at the genus and species level (see Figures 9 and 11). In both samples, MTX overall yielded higher MCCs than RDP. In MetaG, the performance gap between both databases was notably extended from order to species of the MiSeq BFA sample (see Figures 9 and 11).

64

Figure 11: MCC from domain to species for classifiers using their chosen settings to analyze the BFA sample. The sample was subject to simulated MiSeq sequencing. 16S is the 16S rRNA training set 16 of the RDP Classifier. The MCC for the RDP Classifier with the 16S database supplemented with LSU11 and WARCUP2, respectively, was undefined at species level.

The MCC of MetaG and QIIME 2 Blast was reduced for all ranks in the MiSeq BFA relative to MiSeq BA sample (see Figures 10 and 11). Overall, the differences between RDP and MTX in both algorithms were most pronounced from domain to order in the MiSeq BFA compared to the MiSeq BA sample. From family to species, however, both algorithms showed the same general database patterns in both samples (see Figures 10 and 11). Overall, the QIIME 2 Classifier performed significantly worse in the BFA than in the BA sample sequenced with the MiSeq. This strong trend held true for all ranks, except genus and species (see Figures 10 and 11). MetaG MTX had also been the algorithm of choice for species and genus identifications in the MiSeq BA sample (see Figures 10 and 11). The general wave-like pattern was observed in MiSeq BA and BFA and nanopore BA and BFA (see Figures 8-11). Precision and specificity of the MiSeq BFA sample showed the wave-like pattern. However, the local maximum was mostly either at family or genus level (see Figures S37 and S38).

3.1.2. Analysis of a novel bacterium In order to assess the performance of the programs on novel taxa, X. composti strain K13 was included in the simulations. It has only recently been identified as a genus novum and species novum (Kukolya et al. 2018). The classifications of its simulated reads were analyzed in the BA sample and simulated nanopore and MiSeq sequencing, respectively. All algorithms were considered only with the respective settings chosen for the program comparison on the two datasets (see also Tables 3 and 4). Only the most abundant identifications of each workflow at family, genus and species level, respectively, were considered for the following analysis (see also section 2.2.5.).

For simulated nanopore sequencing, most algorithms identified the majority of reads as coming from the family Paenibacillaceae and the genus Paenibacillus (see Table 5). Algorithms using the RDP database classified the family as Paenibacillaceae 1 (see Table 5). Only the RDP Classifier and the QIIME 2 Classifier using the MTX database did not at all 65 classify most of X. composti's reads (see Table 5). The majority of reads was not assigned to a species. On the contrary, MetaG in conjunction with RDP identified a majority of reads as the species Paenibacillaceae bacterium (see Table 5).

Table 5: Most abundant taxa identified for reads simulating nanopore R9 2D sequencing on X. composti strain K13. Reads have been classified using the programs’ chosen workflows for the inter-algorithm comparison of the BA sample. The programs’ settings were those chosen for the nanopore-sequenced dataset. Taxa and supporting numbers of reads are displayed at family, genus and species level, respectively. NA marks unclassified taxa or taxa without a name. The total number of analyzed reads was 400 of which 24 were simulated as negative controls.

Rank Program Family Genus Species Parallel-META 3 Paenibacillaceae Paenibacillus NA [362] [291] [369] QIIME 2 Blast MTX Paenibacillaceae Paenibacillus NA [337] [327] [353] QIIME 2 Blast RDP Paenibacillaceae 1 Paenibacillus NA [345] [292] [295] QIIME 2 Classifier MTX NA NA NA [225] [236] [376] MetaG MTX Paenibacillaceae Paenibacillus NA [366] [336] [271] MetaG RDP Paenibacillaceae 1 Paenibacillus Paenibacillaceae [369] [358] bacterium [330] RDP Classifier NA NA NA [239] [314] [400]

Table 6 shows the dominant classifications of reads from X. composti using simulated MiSeq sequencing. At the first glance, identifications based on nanopore and MiSeq sequencing were not equal (see Tables 5 and 6). Although the QIIME 2 Classifier using MTX did not identify the majority of reads, again, the number of unidentified reads was increased (see Tables 5 and 6). This was also true for the genus level identifications of the RDP Classifier. Interestingly, the program identified the family as Paenibacillaceae 1 (see Table 6). Using nanopore sequencing, most reads had not been classified at the family level (see Table 5). For MetaG, the number of reads assigned to all taxa increased when using RDP. This trend did not hold true at all ranks when it used the MTX database. However, the number of unassigned reads at the species increased largely (see Tables 5 and 6).

66

Table 6: Most abundant taxa identified for reads simulating Illumina MiSeqv3 sequencing on X. composti strain K13. Reads have been classified using the programs’ chosen workflows for the inter-algorithm comparison of the BA sample. The programs’ settings were those chosen for the MiSeq-sequenced dataset. Taxa and supporting numbers of reads are displayed at family, genus and species level, respectively. NA marks unclassified taxa or taxa without a name. The total number of analyzed reads was 400 of which 24 were simulated as negative controls.

Rank Program Family Genus Species Parallel-META 3 Paenibacillaceae Thermobacillus NA [337] [337] [368] QIIME 2 Blast MTX Bacillaceae NA NA [278] [352] [354] QIIME 2 Blast RDP Paenibacillaceae 1 NA uncultured bacterium [376] [400] [372] QIIME 2 Classifier MTX NA NA NA [264] [400] [400] MetaG MTX Paenibacillaceae Paenibacillus NA [373] [334] [377] MetaG RDP Paenibacillaceae 1 Paenibacillus Paenibacillaceae [376] [376] bacterium [376] RDP Classifier Paenibacillaceae 1 NA NA [245] [399] [400]

Reads classified by QIIME 2 Blast and the MTX and RDP database, respectively, lost resolution on the genus level. Besides, Bacillaceae was the most frequent family for the MTX database and MiSeq sequencing as opposed to Paenibacillaceae for nanopore sequencing (see Tables 5 and 6). The analysis using the RDP database, however, classified most reads at the species level, although this had not been possible in nanopore sequencing. Besides, Parallel-META 3 switched classifications at the genus level (see Tables 5 and 6).

67

3.1.3. Reanalysis of a nanopore mock sample After analysis of simulated sequencing data, MetaG, Parallel-META 3 and QIIME 2 Blast using the RDP database were used on a sample with known abundance of taxa. The sample had been in vitro sequenced. I could access classifications of the sequencing results by the study authors (Cuscó et al. 2018) and the expected abundances of taxa within the sample, as given by the manufacturer20 (see Table 2).

The classification of reads by QIIME 2 Blast using the RDP database was not successful, as the program showed an abnormal run time. The analysis was stopped after over 1000 hours (ca. six weeks). Thus, QIIME 2 Blast RDP was not considered in the following comparison.

Figure 12: Relative abundances of the eight bacteria in the ZymoBIOMICS Microbial Community DNA Standard D6306 as sequenced by the study of Cuscó and colleagues (Cuscó et al. 2018). The abundances were defined by 16S rRNA gene abundance. The values expected by the manufacturer20 (Expected) are compared to the classifications of the study (Ref. study). Besides, classifications by MetaG using the MTX database and Parallel- META 3 using its native database are shown. Organisms that were identified, but not expected by the manufacturer are represented by OTHER. The abundances of taxa, except for Expected, are scaled to the individual total number of assigned reads. To the best of my knowledge, the study authors' also only considered assigned reads (Cuscó et al. 2018).

From Figure 12, it became apparent that the 16S rRNA gene classifications of MetaG using the MTX database best matched the expected abundances given by the manufacturer. Only in the case of L. fermentum, the analysis conducted by the study authors was closer to the expectation. In any other case, the authors underestimated the abundances of the expected taxa and overestimated the abundances of unexpected taxa to a greater extent than MetaG (see Figure 12). Parallel-META 3 performed worse than both MetaG and the study. However, in the case of S. aureus, Parallel-META 3 outperformed the classifications by the study, but not by MetaG (see Figure 12).

68

3.2. Performance of the polymorphism detection in NanoPipe

3.2.1. The effect of sequencing error on the SNP analysis Artificial SNPs were expected to arise also by sequencing error. In order to test NanoPipe for this error source and to ultimately lessen its impact, in silico nanopore sequencing was performed on two samples using NanoSim-H. One sample showed no error (perfect sample) and served as the control. The other was simulated using R9 2D sequencing (see also section 2.4). A total of 12,838,863 errors were introduced into 100,000 reads which were simulated with the error profile ecoli_R9_2D. Errors were considered regardless of error type, following the logic that any error might have introduced artificial SNPs. Since the coverage was expected to be 21, errors were not expected to occur at unique positions, only. Setting the total number of introduced errors in relation to the expected coverage showed that ca. 611,374 errors could technically have had a coverage of 21.

In the following analyses, the E. coli target of NanoPipe was used as query and target. The SNP analysis of reads simulated with no error profile resulted in no SNP detection. However, the analysis of the reads with the applied error profile (R9 2D reads) yielded 568 spurious biallelic SNPs. In order to ultimately prevent spurious SNPs, in-depth analyses for common properties of the artifacts were performed: at first, the alignment quality in the surroundings of the SNPs was analyzed. The average error, as provided by NanoPipe, was ca. 0.0024 (highest: 0.0500). The average nucleotide count (coverage) at the spurious positions was ca. 21 (highest 34). The relative abundance of the SNP nucleotide was ca. 22 % on average, with a maximum of ca. 46 %. The vast majority of SNPs resulted from transitions, rather than transversions, of the target nucleotide (481 vs. 87). This lead to a TI/TV of ca. 5.53.

3.2.2. Recovery of real SNPs by nanopore sequencing and NanoPipe analysis Subsequently, 100 SNPs were simulated at random positions on the E. coli target genome of NanoPipe. All simulated SNPs were biallelic: they could either show the target nucleotide or a random non-target nucleotide. The frequencies of the non-target nucleotide were to equal parts 33 % and 67 %, respectively. Simulated nanopore sequencing was performed on the sample in two different ways: the sample was simulated without any sequencing errors (perfect sample). The other approach was to use the pre-computed error profile of E. coli sequenced with flowcell chemistry R9 using 2D reads. Both sequenced samples were analyzed in NanoPipe using the default polymorphism abundance threshold: the non-target allele at a position had to have an abundance of not less than 20 % (Shabardina et al. 2019). In subsequent analyses, this threshold was further increased based on the findings presented in the previous section. This was done in order to receive more accurate results. The performances of the SNP analyses were evaluated by using sensitivity, precision, specificity and MCC.

The most obvious difference between the perfect and R9 2D sample was the number of FPs. These were SNPs that were identified at positions not expected from the set of the 100 simulated SNPs. At default and increased thresholds, the perfect dataset did not show any FPs. Using the default SNP cutoff, the R9 2D sample, however, had over 1,700 FPs (see Figure 13, 20 %). That was roughly threefold compared to the analysis without simulated SNPs (see previous section). Increasing the threshold by 2 %, lead to an artifact count

69 comparable to the one without simulated SNPs (see Figure 13, 22 % and previous section). By increasing the cutoff, FPs were further depleted (see Figure 13).

Figure 13: FP SNP identifications by NanoPipe on nanopore reads simulated without sequencing error (perfect) or with a pre-computed error profile for E. coli sequencing using flowcell chemistry R9 and 2D reads (R9 2D). The cutoff is the minimum abundance of a non-target nucleotide at a given position, so that a SNP is called. Perfect reads show no FPs.

The total ratio of transitions and transversions for sample R9 2D was decreasing with stricter cutoffs. Besides, the TI/TV for R9 2D at a cutoff of 20 % was roughly six fold of the perfect sample (see Figure 14). The decrease of the TI/TV in the R9 2D sample resembled the general trend observed for the FPs (see Figures 13 and 14). However, the TI/TV remained almost constant when the cutoff was changed from 20 to 22 % (see Figure 14). This was not expected by the FP pattern (see Figure 13). The ratio for the perfect sample remained almost constant over all cutoffs (see Figure 14).

70

Figure 14: Total TI/TV for SNP identifications by NanoPipe on nanopore reads simulated without sequencing error (perfect) or with a pre-computed error profile for E. coli sequencing using flowcell chemistry R9 and 2D reads (R9 2D). The cutoff is the minimum abundance of a non-target nucleotide at a given position, so that a SNP is called.

TPs, which were identified SNP positions expected by the simulations, from both samples showed the same general trend. By increasing the cutoff, TPs were lost (see Figure 15). Although, perfect and R9 2D sample showed comparable TP counts at 20 % and 35 %, the TP calls in R9 2D deplete more evenly over the thresholds (see Figure 15). Neither the perfect sample, nor R9 2D could recover all expected SNPs (TP below 100; see Figure 15).

Figure 15: TP SNP identifications by NanoPipe on nanopore reads simulated without sequencing error (perfect) or with a pre-computed error profile for E. coli sequencing using flowcell chemistry R9 and 2D reads (R9 2D). The cutoff is the minimum abundance of a non-target nucleotide at a given position, so that a SNP is called.

71

It is worth noting that at a cutoff of 35 %, simulated SNPs with an allelic abundance of 33 % should have been completely removed. However, 9 and 13 of these SNPs remained in the R9 2D and perfect sample, respectively (data not shown).

With stricter cutoffs and a loss of TPs (see also Figure 15), the number of FNs increased in both samples (data not shown). This followed the inverse patterns identified for the TPs (see above). The numbers of TNs was very large compared to the numbers FPs. Thus, the TN graphs for both samples hardly showed any trend (data not shown). Examination of the raw data revealed the following patterns: there were no FPs in the perfect sample, thus the number of TNs was always at the maximum. In the R9 2D sample, the number of TNs increased with stricter cutoffs and depletion of FPs (see also Figure 13).

The p-errors for the SNPs reported by NanoPipe in the perfect and R9 2D sample at a cutoff of 20 % were low. The average p-error was ca. 0.0004 (maximum: 0.0035) and roughly 0.0020 (maximum: 0.0748) for the perfect and R9 2D sample, respectively. The low p-errors in both samples were congruent with the observation on reads without simulated SNPs, but with sequencing error (see previous section).

The high recovery of expected SNPs by NanoPipe at a cutoff of 20 % also implied a high sensitivity of 90 % and 89 % in the perfect and R9 2D sample respectively. The sensitivity was the ratio of recovered expected and total expected SNPs. It showed the same trend (data not shown) as shown for the TPs in Figure 15. The specificity was the ratio of TNs and the total number of TNs and FPs. Reflecting the trend in FPs (see Figure 13), the specificity increased with stricter cutoffs for sample R9 2D. However, the total change was minute, because the specificity was ca. 99.96 % at a cutoff of 20 %. The perfect sample did not show any FP, so the specificity was always one (data not shown). For the same reason, the precision in the perfect sample was always one (see also Figure 16). The precision is the ratio of the numbers of identified expected SNPs and the total identified SNPs. Due to the initially large number of FPs in R9 2D, the precision was ca. 5 % at a cutoff of 20 % (see Figures 13 and 16). With increasing cutoff, and decreasing number of FPs (see Figure 13), the precision increased up to ca. 87 % at a cutoff of 35 % (see Figure 16).

72

Figure 16: Precision of SNP identifications by NanoPipe on nanopore reads simulated without sequencing error (perfect) or with a pre-computed error profile for E. coli sequencing using flowcell chemistry R9 and 2D reads (R9 2D). The cutoff is the minimum abundance of a non-target nucleotide at a given position, so that a SNP is called.

The overall performance of the SNP calling in NanoPipe in conjunction with nanopore sequencing was given by the MCC which is depicted in Figure 17. The MCC of the perfect sample was always higher than that of the R9 2D sample: at a cutoff of 20 %, the MCC of the perfect sample was nearly fivefold of R9 2D (see Figure 17). The difference decreased with an increasing cutoff to about 0.05 at a cutoff of 35 % (see Figure 17). This was for one part, due to the increase of the R9 2D sample’s MCC till the 30 % threshold. The MCC of the perfect sample, for the other part, decreased with stricter cutoffs (see Figure 17).

Figure 17: MCC of SNP identifications by NanoPipe on nanopore reads simulated without sequencing error (perfect) or with a pre-computed error profile for E. coli sequencing using flowcell chemistry R9 and 2D reads (R9 2D). The cutoff is the minimum abundance of a non-target nucleotide at a given position, so that a SNP is called.

73

A cutoff of 25 % for the non-target allele was chosen as the most promising candidate for further analyses. This was due to its high reduction of FPs in combination with a relatively low loss of TPs (see Figures 13 and 15). The cutoff was tested for its ability to detected the SNPs in the P. falciparum and cell line H1975 study cases by Shabardina and colleagues (Shabardina et al. 2019).

All SNP identifications of P. falciparum aligned against Pf3D7_07_v3 (version: 2013– 03-01) (Shabardina et al. 2019) could be recovered at the increased threshold (data not shown). For the study case of the human cell line H1975, I considered only chromosome seven, because its SNPs had been exemplified in the NanoPipe publication (Shabardina et al. 2019). On this chromosome, however, the SNPs at the positions 55173958 and 55192839 were lost (data not shown). Only the latter was also found in dbSNP as rs376176117 (Shabardina et al. 2019).

74

4. Discussion

4.1. Metaprofiling analysis

4.1.1. In silico performance evaluation To analyze the performance of MetaG and other established metagenomic classifiers, I simulated in silico nanopore and MiSeq sequencing. This was performed on a sample containing 16S rRNA gene sequences from bacteria and archaea. Another sample was complemented with 28S rRNA gene sequences from fungi. Four statistical measures were collected for each sample and sequencing approach, respectively. See also section 2.2.1. for further information on these measures.

The MCC has been proposed as an overall performance indicator (Matthews 1975). A case in point for this, in my analysis, is the RDP Classifier with the 16S rRNA training set 16. It did not provide species resolution for the BA samples. Thus, its sensitivity was zero (see Figures S14 and S33). However, because it did not classify any sequence to the species level, it could not make any FP calls at this rank. Therefore, its specificity for species identifications was maximal (see Figures S16 and S35). Due to the lack of any species identifications, its precision was undefined (see Figures S15 and S34). Accordingly, its MCC was undefined for species identifications (see Figures 8 and 10). Thus, the MCC provided a brief summary of the RDP Classifier’s performance. Note that the RDP Classifier only made assignments for the species in the workflows using the UNITE and WARCUP2 databases. The MCC mostly reflected the wave-like pattern of precision and specificity (see Figures S15; S16; S18; S19; S34; S35; S37; S38 and 8-11). Thus, the MCC was a valuable summary statistic for the algorithm comparison. Accordingly, the focus of this section will be the MCC comparisons.

When comparing the MCC of the nanopore and MiSeq samples, some common trends were apparent. First, the MCC of the BFA identifications was reduced compared to the BA identifications (see Figures 8-11). This was because of an overall reduction in sensitivity and reductions of precision and specificity for some programs (see Figures S14-S19; S33-S38). The trends indicated a lowered rate of TPs compared to FNs and FPs and a lowered rate of TNs compared to FPs (see also section 2.2.1.). This implied a lowered rate of correct identifications for reads simulated as positives (alignable) and for reads simulated as negatives (unalignable or shuffled). Second, the MTX database was preferable over RDP in most cases of BFA (see Figures 9 and 11) which was mainly caused by preferable precision and specificity of the former database (see Figures S18; S19; S37 and S38). The increased precision and specificity indicated a reduction of FP calls relative to TPs and TNs, respectively.

Generally, all four analyses showed a drop in MCC from domain to species. Most algorithms then showed a local maximum at the family level. Subsequently, the MCC declined towards the species level (see Figures 8-11). At the example of MetaG, this trend was mainly caused by a loss of specificity and precision in the MiSeq and nanopore samples (see Figures S15; S16; S18; S19; S34; S35; S37 and S38). This general pattern, which I referred to as wave-like, was most likely due to naming inconsistencies between the databases used for the analysis and the expected taxonomy given by the NCBI database. In a recent study, the authors focused on phylum and genus predictions to avoid naming inconsistencies (Lindgreen, Adair, and Gardner 2016). A first impression of the possible

75 extent of such an error can be inferred from the comparison of PATRIC to MTX and RDP, respectively (see section 2.1.7.2.).

I expect that this pattern would not have held true, if each algorithm had had its own expected taxonomy. The individual taxonomy would have needed to be consistent with the individual algorithm’s database nomenclature. However, this was too time-consuming to implement in this comparison. Thus, the NCBI taxonomy was used in hopes to provide equal chances to the algorithms.

The general decline in MCC from domain to species (see Figures 8-11), however, was real. For one part, sequencing errors might have obscured differences in the 16S rRNA gene sequences that would have enabled the programs to assign the correct species. For the other part, it has been argued that metaprofiling of the 16S rRNA gene can yield ambiguous classifications. It was demonstrated that partial 16S rRNA gene copies within one genome could differ. However, species of a genus could also have identical partial sequences (Větrovský and Baldrian 2013). As outlined in section 1.1.3., these results were expected to vary with the (length of) the variable region.

Another aspect was the incompleteness of databases. A significant portion of the bacteria in nature was not found in the databases (Louca et al. 2019) (see also next section). If the entry for the exact organism was missing, it could not be assigned. However, this was most relevant at low ranks, e.g. the strain, and for certain groups: for example, the database completeness was worse for archaea (Louca et al. 2019). Because of the high loss of MCC at the species level (see Figures 8-11), assessing the MCC for the strains was not attempted. Preliminary experiments on MetaG showed rather arbitrary assignments (data not shown). Seeing that MetaG was the best algorithm at species classifications (see Figures 8-11), it was reasonable to assume that this trend was also true for the competing programs.

The loss of classification power in the BFA compared to the BA sample can also be explained by incomplete databases. I assessed the number of fungi in both RDP and MTX. As MTX performed better classifications in BFA than RDP (see Figures 9 and 11), I hypothesized that the sampling of fungi was denser. This was the not the case, as indicated by the number of fungal sequences. These were 125,572 in RDP and 4,938 in MTX. Another aspect might have been that uncultured organisms had been removed from MTX and the database had been extensively curated (Bengtsson-Palme et al. 2015). Since matches to taxa without a scientific name would, in my case, always have been FPs, having a higher proportion of well characterized sequences could have improved the classification results. Indeed, MTX included no fungal species which matched unclassified or uncultured. RDP matched 19,634 and 22,586 fungi for both groups, respectively.

Regarding the fungal databases, LSU11 was the worst database by MCC in the nanopore BFA sample (see Figure 9). This was mainly because it showed an inferior sensitivity (see Figure S17). The best fungal database was WARCUP2 which also performed best in MiSeq BFA (see Figures 9 and 11). In the latter set, it outperformed the other two databases by precision and specificity (see Figures S37 and S38).

Also the type of algorithm itself influenced the analysis. In most cases, the alignment- based algorithms, these were MetaG, QIIME 2 Blast and Parallel-META 3, performed better than the kmer based approaches, i.e. the QIIME 2 Classifier and the RDP Classifier

76

(see Figures 8-11). It has been argued previously that the assumptions of kmer-based approaches were problematic (X. Gao et al. 2017): first, algorithms assume independent kmers. Besides, kmer order should not be matter (X. Gao et al. 2017). However, sequence order matters. In biology, several kmers may also be interdependent, e.g. because of a common function (X. Gao et al. 2017). Adding to these issues, the length of the kmers is also important for a correct classification (X. Gao et al. 2017). These challenges might have been reflected by the analysis results.

In the nanopore samples, the QIIME 2 Classifier was the overall worst algorithm. The RDP Classifier performed better (see Figures 8 and 9). This introduces the next aspect influencing the classifications: the sequencing technology. Switching from nanopore to MiSeq sequencing improved the MCC of the QIIME 2 Classifier substantially (see Figures 8-11). In the nanopore samples, the QIIME 2 Classifier showed a low sensitivity at the specific ranks and a low specificity at broad ranks (see Figures S14; S16; S17 and S19). Its precision also led to a bad ranking by MCC (see Figures S15 and S18). When switching to MiSeq, both sensitivity and precision were improving substantially (see Figures S14; S15; S17; S18; S33; S34; S36 and S37). This indicated that the program was better at correctly identifying positive reads using MiSeq data: There were less FPs and FNs with respect to TPs. The classifications of shuffled reads were not major drivers of the improvement (see specificity; Figures S16; S19; S35 and S38).

Another prominent example for the improved power with the short-read technology was QIIME 2 Blast MTX (see Figures 8-11). Like Parallel-Meta 3 and QIIME 2 Blast RDP, it was mostly a good alternative to MetaG in the MiSeq BA sample (see Figure 10). Importantly, QIIME 2 Blast MTX also performed similar to MetaG at most ranks in the MiSeq BFA sample (see Figure 11). In the nanopore BFA sample, however, MetaG MTX had shown best performance, mainly due to superior sensitivity (see Figures 9 and S17). These aspects indicated that not all algorithms could handle the tradeoff between longer read length and increased error rate for nanopore sequencing. MetaG's performance could also be attributed to the training of alignment parameters for both technologies using LAST-TRAIN (see also section 2.1.1.). Interestingly, the MCC of MetaG MTX at genus and species level profited from the change to MiSeq sequencing in the BFA sample (see Figures 9 and 11). This was mainly because FP calls were reduced (see precision and specificity; Figures S18; S19; S37 and S38). In the BA sample analyzed with MetaG MTX, MiSeq sequencing was mainly preferable over nanopore at the genus level (see Figures 8 and 10). The MCC of Parallel- META 3 increased for all classifications, except the species, when using MiSeq instead of nanopore data. This was most pronounced at the family and genus level (see Figures 8 and 10). Reducing error thus had a positive effect on the classifications of more specific taxa. This pattern was striking, as it indicated that the loss of information by read length (see also section 1.1.3.) could be lesser than the loss of information by sequencing error.

However, the switch to MiSeq was not beneficial at all ranks. In the BA sample, for example, the RDP Classifier performed better with nanopore sequencing but only at broader ranks (see Figures 8 and 10). The performance gain was mainly caused by the specificity (see Figures S16 and S35). MiSeq sequencing provided better performance at the more specific ranks (see Figures 8 and 10). All in all, the database, the type of the analysis algorithm, the sequencing technology, the analyzed rank and the organisms in the sample were of major importance for the analyses outcomes.

77

4.1.2. Analysis of a novel bacterium Evidence suggests that the numbers of bacteria in the metaprofiling databases do not reflect the numbers of bacteria in nature: clustering 16S rRNA gene sequences at 97 % similarity, OTUs in SILVA’s non-redundant release 132 and RDP’s release 11 represented 29 and 42 % of the total estimated bacterial OTUs, respectively (Louca et al. 2019). Naturally, researchers will also wish to analyze samples containing novel bacteria. Therefore, I chose to assess the prediction power of MetaG and its competitors in a realistic scenario. The programs’ task was to analyze X. composti K13. The strain had given rise to a completely new genus and species within the Paenibacillaceae (Kukolya et al. 2018). It was already isolated in 2008, but classified in 2018 (Kukolya et al. 2018). The strain’s genome comprised of four 16S rRNA gene copies. The copies were most similar to Paenibacillus nanensis MX2-3 (93.8 %) (Kukolya et al. 2018). The novel bacterium was classified by nucleotide and culture-dependent analysis (Kukolya et al. 2018). Analyses of novel organisms have been previously simulated, for example by leave-one-out experiments. These are manipulations of the database (e.g.: Q. Wang et al. 2007). Here, I performed the analysis using a real novel bacterium.

Due to the recent classification, neither genus, nor species and strain were present in the databases of the analyzed algorithms (data not shown). Therefore, the most accurate classification would have been to assign the reads as belonging to the Paenibacillaceae without assigning genus and species. I assessed the classifications of reads from X. composti K13 during the program comparisons on the BA samples. This was done for in silico nanopore and MiSeq sequencing, respectively. For the sake of clarity, only the most abundant taxon at family, genus and species level, respectively, was analyzed.

In the nanopore sample, most programs classified the reads as coming from the correct family. However, programs then also assigned the genus, Paenibacillus (see Table 5). The results are in line with those found by Kukolya and coworkers (Kukolya et al. 2018). They claimed that the found “similarity values [were] much lower than the suggested 95 % threshold for the delineation of a new genus” (Kukolya et al. 2018). However, the most similar strain from the genus of Paenibacillus displayed a 16S rRNA gene similarity of ca. 94 % (Kukolya et al. 2018). As the authors used Sanger sequencing (Kukolya et al. 2018) (minimal error rate ca. 0.001 % (reviewed in: Shendure and Ji 2008)), it was reasonable that more error-prone sequencing technologies (Dohm et al. 2008; Jain et al. 2018) would have obscured this difference. However, the QIIME 2 Classifier and the RDP Classifier did not assign any family or lower rank to most reads (see Table 5). This was not desired, as the family could have been identified given the databases. This can be seen for the former workflow, by looking at the other algorithms using MTX in Table 5. The 16S rRNA training set 16 of the RDP classifier also contained the correct family, but not the correct genus (data not shown). The fasta file of the training set was available online29.

In the nanopore sample, only MetaG using RDP also assigned “Paenibacillaceae bacterium” as a species (see Table 5). This reflects the different nature of RDP and MTX. The latter was extensively curated; also to removed uncultured bacteria (Bengtsson-Palme et al. 2015).

29 https://datapacket.dl.sourceforge.net/project/rdp-classifier/RDP_Classifier_TrainingData/ RDPClassifier_16S_trainsetNo16_rawtrainingdata.zip; accessed 13.05.2019. 78

A trend indicated the increase of unclassified reads in the MiSeq BA sample relative to the nanopore sample (see Tables 5 and 6). Besides, the RDP Classifier now indentified the correct family (see also Tables 5 and 6). It made a desirable classification, as no genus or species were assigned (see also Table 6). The QIIME 2 Blast analyses lost their erroneous genus predictions (see also Tables 5 and 6). This could indicate a higher adaptation of the algorithms to the peculiarities of the MiSeq technology. Besides, the error rate in MiSeq sequencing is lower than in nanopore sequencing (Dohm et al. 2008; Jain et al. 2018) which enables more strict alignments. However, this trend for improved classifications on MiSeq data was not universal: QIIME 2 Blast MTX now assigned the wrong family and Parallel-META 3 classified reads as coming from another wrong genus (see also Tables 5 and 6). Interestingly, QIIME 2 Blast RDP identified an uncultured bacterium within the Paenibacillaceae which had not been placed into a genus (see Table 6). Technically, it was possible, that the novel bacterium had already been sequenced and had been placed into the database: however, with a wrong classification. Due to time constraints and the number of uncultured bacteria without a genus in the Paenibacillaceae, this possibility was not further investigated.

MetaG made the same identifications, compared to nanopore sequencing. However, the number of reads assigned to the most abundant taxon varied (see Tables 5 and 6). The different results of MetaG and the QIIME 2 workflows using the same databases (see Tables 5 and 6) illustrated the dependence of the results on the analysis algorithm.

Overall the results indicated that some algorithms were better adapted to the error rate of MiSeq sequencing. This general pattern had also been observed in the in silico analyses (see previous section). However, the performance on the whole simulated data could not directly predict the performance in classifying the novel bacterium: Parallel-META 3's overall identifications of family and genus, for example, profited from the switch of nanopore BA to MiSeq BA (see Figures 8 and 10). Still, it assigned an even less likely genus to the novel bacterium in the MiSeq BA sample (see Table 6). Besides, the QIIME 2 Classifier improved majorly when exchanging nanopore for MiSeq reads (see Figures 8 and 10). Still, the classifier did not assign the majority of reads produced by both technologies in the analysis of the novel taxon (see Tables 5 and 6). However, the RDP Classifier improved at the family level in the MiSeq BA relative to the nanopore BA sample (see Figures 8 and 10). Likewise, it only identified the correct family of the novel bacterium in the MiSeq sample (see Tables 5 and 6).

As hypothesized in the previous section, partial worsening of MiSeq results could have been caused by the shorter read length compared to nanopore sequencing. Thus, (parts of) variable regions of the whole 16S rRNA gene needed to be analyzed. Algorithms appeared to have coped differently with the tradeoff of between read length and error rate but most assigned too specific or too unspecific taxa.

79

4.1.3. Reanalysis of a nanopore mock sample The nanopore sequencing of the ZymoBIOMICS Microbial Community DNA Standard D6306 by Cuscó and coworkers (Cuscó et al. 2018) offered the possibility to test MetaG on a realistic sample. This analysis was similar to the in silico analysis, because there was a known abundance of taxa. However, there were several differences: first, the origin of a particular read was unknown in the mock sample. Only the total abundance of taxa in the sample could be rated. Second, the DNA had to be extracted from the cells and was subsequently amplified. This was expected to introduce realistic artifacts, e.g. in the form of contamination and amplification artifacts (see also sections 1.1.3. and 1.1.4.). Third, the DNA had to be sequenced. This was also simulated in silico. However, a simulation algorithm can only approximate the in vitro process. Thus, having data that was generated (and not simulated) was very important at the end of algorithm development. After all, the above biases will later also influence data of users.

MetaG was subject to three benchmarks. The abundance of called taxa was compared to the manufacturer’s expectations20 and the study results (Cuscó et al. 2018). Also, I aimed to compare the results of Parallel-META 3 and QIIME 2 Blast RDP to the output of MetaG. I chose Parallel-META 3 and QIIME 2 Blast due to their performance in the nanopore BA comparison. The specific workflow of each algorithm was chosen according to its performance at the species level in the nanopore BA sample. However, QIIME 2 Blast RDP ran ca. six weeks (CPU time; 1 thread) without generating results. Therefore, I aborted the analysis. The query reads had been clustered as 99 % OTUs. In hindsight, clustering at 97 % could have been attempted to reduce the input data. Besides, RDP was much bigger than MTX. Still, the observed runtime at 99 % OTUs was inconveniently long. This was especially true, given the time Parallel-META 3 and MetaG MTX needed to classify the sample: ca. 12 h (4 threads) and less than 19 h (4 threads), respectively. All timings were CPU times.

The analysis revealed that classifications by MetaG MTX were closest to the manufacturer’s expectations. Only in the case of L. fermentum, the study’s results were more similar to the expected ones (see Figure 12). Parallel-META 3 performed worse than the study’s workflow in nearly all cases, except in the case of S. aureus. All algorithms overestimated the number of unexpected bacteria (see Figure 12). Still, MetaG was closest to the expected abundance of 0 % (see Figure 12). The detection of other taxa could have been caused by preparation, amplification and sequencing artifacts. Besides, contamination could have played a role. Contamination could, on the one hand, have arisen by the processing of the sample. On the other hand, the sample also contained fungi20. Ambiguous binding of the bacterial primers could thus have introduced contamination.

From the bioinformatics perspective, the taxa could have also appeared as a result of alignment errors or due to biased taxon abundances in the databases, leading to preferential detection of selected taxa. Likewise, the study’s authors suggested database issues, missing resolution of the 16S rRNA gene and sequencing errors as challenges for species-level identifications. Besides, amplification issues were thought to play a role (Cuscó et al. 2018). However, the results showed that the algorithms and databases majorly influenced the outcome. After all, all algorithms were influenced by the same biases arising from the wet-lab processing. However, they performed significantly different (see Figure 12). All in all, the superior classifications of MetaG over Parallel-META 3 at species level are in line with the in silico analysis of the nanopore BA sample (see Figure 8). 80

Variations of the copy numbers of the 16S rRNA genes within the genomes could be excluded as causes of the discrepancies. Neither Cuscó and coworkers (Cuscó et al. 2018), nor MetaG or Parallel-META 3, in the mode described in section 2.2.4.1, corrected copy numbers. The former did not state this explicitly, but they compared their observed taxon abundances to the uncorrected abundances given by the manufacturer20 (Cuscó et al. 2018). In any case, copy number corrections of the 16S rRNA gene would not explain the large number of unexpected taxa (see also Figure 12).

The discrepancies between the predictions of algorithms were most likely due to different levels of adaptation to the nanopore technology. MetaG showed the best adaptation. This was striking given the fact that the study used the WIMP workflow (Cuscó et al. 2018) and WIMP is specifically adapted to the nanopore technology (Juul et al. 2015). The bad performance of WIMP could be caused due to its design for WGM data (Juul et al. 2015). MetaG, for example, implicitly assumed that all reads were either from the rRNA gene or were artifacts. However, it was possible that the parameters chosen by the study's authors were not optimal.

The superior performance of MetaG was likely also due to the training of LAST alignment parameters on the nanopore technology by LAST-TRAIN (see section 2.1.1.). Importantly, MetaG’s standard parameters were robust and could be used for the in silico analysis using R9 2D reads and the in vitro analysis of the mock sample using R9.4.1 and 1D reads (Cuscó et al. 2018).

4.1.4. Evaluation of the performance and practical use of MetaG To the best of my knowledge, of all algorithms compared to MetaG, only the RDP Classifier had an online implementation. However, this is essential for users with limited computational resources. Some programs had only local versions with command line interfaces. However, these may present significant challenges to users with limited computational skills. Thus, MetaG was developed with a web interface. A standalone version was created for users without an internet connection for submitting read files. Thus, analysis can be performed directly at the spot, even in remote locations. The nanopore technology is also suitable to be used in the field (e.g.: Quick et al. 2016; Castro-Wallace et al. 2017). Thus, MetaG’s standalone version could expand the utility of the device by enabling subsequent analysis of the generated data. Besides, a standalone version enables the analysis of, for example, confidential data.

MetaG was supplemented with a thorough documentation. Still, it was brief enough to allow users to proceed with their analyses without lengthy studying of the manual. The major concepts of the software were included into the documentation which will enable users to adapt the settings of MetaG to their needs.

Usage was further eased by standard settings for nanopore and MiSeq reads. These two technologies represent third and next-generation sequencing, respectively (reviewed in: Metzker 2010; Garrido-Cardenas et al. 2017). Thus, the standard settings will cover most of the demands by researchers in the field of metagenomics. The standards were shown to be robust enough to allow high-performance analyses of in silico and in vitro samples (see preceding sections). In the case of the nanopore technology, standards were suitable across flowcell versions and sequencing approaches: The in silico sample contained 2D reads from

81 flowcell chemistry R9, the mock sample consisted of 1D reads from flowcell chemistry R9.4.1 (Cuscó et al. 2018).

Overall, MetaG performed at level with the established programs. Some of these programs had been recommended by a previous study (Escobar-Zepeda et al. 2018). However, MetaG (using MTX) performed substantially better for genus and species classifications (see Figures 8-11). This can be partially attributed to the training of alignment parameters using LAST-SPLIT. Besides, RDP and MTX were manually curated to obtain a better species resolution (see section 2.1.7.1.). This could have improved the results compared to the algorithms using their custom databases. In the case of nanopore sequencing of the sample containing bacteria, archaea and fungi, MetaG using MTX outperformed its competitors at all ranks (see Figure 9). A test set containing the three domains of life was expected to be a realistic for research on patient samples or ecological samples. On actual sequencing data, MetaG performed overall better than all other analyzed algorithms, including the WIMP workflow, as used by Cuscó and coworkers (Cuscó et al. 2018) (see Figure 12). Besides, MetaG’s runtime was reasonable.

When a bacterium that had only recently been described as genus novum and species novum was analyzed, most programs tended to assign too specific or too few reasonable ranks. MetaG was no exception. When using RDP, it even assigned a species (see Tables 5 and 6). The performance of MetaG MTX was better, as it assigned the most similar genus, but no species (see Tables 5 and 6). The best performance was reached by the RDP Classifier. Using MiSeq reads it assigned the correct family as the most specific taxon (see Table 6).

I expect that the wrong classifications of novel taxa will become insignificant with further expansion of well curated metaprofiling databases, as this implies less novel taxa. Till then, MetaG MTX will be a good choice. It was most resistant to MiSeq and nanopore sequencing. Besides, the assignments of too specific taxa were limited to the genus (see Tables 5 and 6).

MetaG also provided metadata in the form of pathogen detection and antibiotic resistance status. However, these features were predictive, as the strain resolution was insufficient (see also sections 4.1.1.). However, strain resolution was mandatory for definitive assignments (see also section 2.1.7.2.). Naturally, this metadata will be especially useful in a clinical setting.

All analysis results could be obtained in text format or visualized by graphs. The comprehensive display of results, the virtual independence of the computational setup, the high performance and the ease of use will make MetaG a serious alternative for anyone interested in metaprofiling analysis. Unlike Parallel-Meta 3 (see section 2.2.4.1.), MetaG allows users to include their own databases, as long as they are correctly formatted (see also section 2.1.7.).

In the future, the performance on WGM data should be assessed. An optional copy number correction of marker genes and additional diversity analyses will also be interesting to the users. Besides, analysis of viruses would be preferable. This, however, will most likely present challenges, due to the lack of a universal viral marker gene (see section 1.1.). Another necessary improvement for the standalone version of MetaG will be that users can

82 load a predefined standard workflow. Till now, users have only been able to directly load their optimal LAST parameters and matrix (see section 2.1.6.). However, users should also be able to create configuration files for their favorite workflow in MetaG. These files could be stored in a special folder and define, for example, the environment variables and locations of databases. Users could then choose to simply load their configuration file together with a read file. This would make the use of the standalone version even more convenient. A similar feature has already been integrated into the web implementation (see section 2.1.6.).

In the field of metagenomics, some researchers apparently lost the appreciation for statistical replicates (Prosser 2010). Thus, I feel compelled to highlight that the various tests of MetaG’s performance on in vitro and in silico samples, as well as on the novel bacterium, were only first, but strong, indications. When the software will be made available to other researchers, I expect that their experimental results will underpin the findings presented above.

83

4.2. SNP detection in NanoPipe

4.2.1. Detection of known SNPs The evaluation of SNP calls by NanoPipe was performed by aligning its E. coli target against itself. The analysis was different from the study cases of P. falciparum and cell line H1975 in the NanoPipe publication (Shabardina et al. 2019) because all true SNPs were known. Thus, in silico data was straightforward to assess the causes of artifacts and to find ways to improve the SNP calling.

In the study cases, the results were rated based on previous detections (Shabardina et al. 2019). These, in turn, could have already missed or misclassified polymorphisms. However, the study cases were performed on actual sequencing data. Thus, both attempts to verify the results complemented each other.

The evaluation of SNP calls was carried out using two different workflows: one used the unmodified E. coli target sequence and the other used the sequence with 100 artificial SNPs. Both samples were subject to in silico sequencing with and without sequencing error, respectively. The perfect sample without sequencing error served as a control. It was possible that SNP artifacts had been introduced by alignment of the created reads. However, in the case of the unmodified E. coli genome, no FPs were detected (see section 3.2.1.). This implied that the effect of read creation and subsequent alignment on the SNP calls was negligible at a SNP allele cutoff of 20 %.

R9 2D sequencing of the unmodified genome, however, produced 568 erroneous SNPs out of the expected 611,374 errors which could technically have had a coverage of 21 (see section 3.2.1.). Thus, ca. 99.9 % of these errors did not influence the SNP output. However, the expected allele frequency of these errors was unclear.

The introduction of artifacts will bias downstream analyses of users. Thus, even though the vast majority of sequencing errors did not lead to FP SNP identifications, properties of the FPs were analyzed. This was first done on the sample without simulated SNPs. The aim was to find common patterns which would further improve the filtering of SNPs in NanoPipe.

The first obvious metric which could have served as an error indicator was the p-error of each SNP. It gave the average alignment error within the vicinity of the polymorphism (Shabardina et al. 2019). The alignment error was thought to reveal the amount of sequencing errors (Shabardina et al. 2019). However, the error was at max 0.05 (see section 3.2.1.). This indicated that sequencing errors did not occur in dense clusters which would have influenced the average alignment error to a greater extend. The average error calculation was limited to a range of, at max, 10 nucleotides upstream and downstream of the individual SNP (Shabardina et al. 2019). Thus, sequencing errors could have been indeed clustered, but with greater distances. Following this line of thought, the range for the error calculation could have been expanded. However, SNP clusters will also increase their respective alignment errors (Shabardina et al. 2019). Thus, increasing the calculation range would have further complicated the interpretation of the average alignment error. For that reason, it was not attempted.

84

The average nucleotide coverage at FP positions corresponded to the expected coverage of 21. Thus, even though the maximum coverage was as high as 34 (see section 3.2.1.), the overall coverage was not a feasible indicator of SNP artifacts. Next, the allelic abundances of FPs were examined. It became apparent that these were ca. 22 % on average (see section 3.2.1.). Thus, increasing the threshold for the minimum allelic abundance of a SNP to at least 22 %, would have removed a significant portion of FPs. This was subsequently attempted in the sample with the simulated SNPs. However, this was no magic bullet, as the maximum abundance of a FP was as high as 46 % (see section 3.2.1.).

The average ratio of transitions and transversions was ca. 5.53 for the whole genome (see section 3.2.1.). As there are twice as many random possibilities for transversions than for transitions, the expected purely random TI/TV would have been 0.5 (reviewed in: Wakeley 1996). Note, that the ratio, which was observed here, was purely technical in nature. As discussed previously, the TI/TV varies between genes, codon positions and also organisms (reviewed in: Wakeley 1996). However, query and target of the polymorphism analysis were identical. Thus, the ratio was solely influenced by the sequencing error and/or the alignment.

Next, the sample with the 100 simulated SNPs was analyzed. These had, to equal parts, allelic abundances of 33 and 67 %. Analysis of the perfect sequencing indicated that alignment errors due to read simulation did not introduce FP SNPs, even in the presence of real SNPs (see also section 3.2.2.). On the one hand, the real SNPs could have hindered alignment in some regions. On the other hand, the number of simulated SNPs was minuscule given the size of the genome (see also section 2.4.)

Interestingly, R9 2D sequencing of the sample introduced over 1,700 FPs at a cutoff of 20 %. This was surprising, as the number of FPs in the R9 2D sample without simulated SNPs was only ca. one third (see section 3.2.1. and Figure 13, 20 %). The average alignment error of the detected SNPs was increased compared to the perfect sequencing of the same sample and the R9 2D sequencing of the sample without simulated SNPs (see section 3.2.1. and 3.2.2.). This indicated that sequencing errors in conjunction with real SNPs raised the rate of misalignments. Misalignments could also have been responsible for the already mentioned inflation of FPs at a cutoff of 20 %.

As the results of the sample without simulated SNPs indicated a reduction of FPs by increasing the allele frequency threshold (see section 3.2.1.), the cutoff was increased for the samples containing simulated SNPs. As expected, increasing the cutoff to 22 %, lead to a drastic reduction of FPs in the R9 2D sample. The number of FPs was almost at level with the one initially reported in the R9 2D sample without simulated SNPs (see sections 3.2.1. and Figure 13, 22 %). With an increasing threshold in the R9 2D sample with the simulated SNPs, the number of FPs was further depleted (see Figure 13). However, the number of TPs was also decreased (see Figure 15).

As the minimum allele frequency of the simulated SNPs was 33 %, all SNPs should have been detected up to a threshold of 30 %.Unexpectedly, this was never the case: neither in the perfect, nor in the R9 2D sequencing (see Figure 15). This supports the hypothesis that, indeed, alignment errors and the in silico sequencing itself influenced the SNP detection: for example, some SNPs may have been lost due to insufficient read depth in the sequencing simulations. SNPs with an expected allele frequency of 33 % were expected to

85 be depleted at a cutoff of 35 %. In line with this, R9 2D and perfect sequencing produced the same number of TP calls at this cutoff (see Figure 15). However, some of the low-allelic SNPs were retained by both sequencing procedures (see section 3.2.2.). This was most likely by an unexpected read depth due to alignment or read creation. However, sequencing errors also played a role, as can be seen from the differential patterns of TP reduction in both samples (see Figure 15). The two error sources might have lead to an increased allele frequency of the SNP. As outlined above, the depletion of expected SNPs could have followed the same, but inverse, mechanism.

The TI/TV of the perfect sequencing was similar to the random expectation of 0.5 (reviewed in: Wakeley 1996) (see also Figure 14). Slight deviations from this expectation were most likely the result of the SNP simulation, itself. The R9 2D sequencing showed a substantially higher TI/TV which decreased with an increasing cutoff. Notably, the ratio was almost constant when increasing the cutoff from 20 to 22 % (see Figure 14). However, with this increase, most FPs were lost (see Figure 13). If one follows the idea that the initial inflation of FPs at the cutoff of 20 % compared to the R9 2D sample without simulated SNPs (see section 3.2.1. and Figure 13) was due to alignment errors, then the TI/TV was relatively unaffected by these errors. However, the ratio would have been a good indicator for the sequencing error.

The overall performance evaluation by the MCC showed a tradeoff between high recovery of expected SNPs and the detection of FPs in the R9 2D sample. Although the sensitivity was high at lower cutoffs, the precision was low (see section 3.2.2. and Figure 16). With increasing precision towards stricter cutoffs, the MCC grew (see Figures 16 and 17). It reached its maximum at a cutoff of 30 % (see Figure 17). From this cutoff onward, the decrease of FPs could not compensate for the loss of TPs.

The impact of FPs on the MCC can also be seen in the perfect sample. This had an almost perfect MCC, even at the lowest cutoff (see Figure 17). As there were no FPs in the perfect sample (see Figure 13), the decrease of the MCC with increasing cutoffs directly reflected the loss of TPs (see Figures 15 and 17).

4.2.2. Current state and future improvements The analysis presented above has opened new perspectives for improving the results of the polymorphism calculation in NanoPipe. A high TI/TV has occurred together with high FP counts (see Figures 13 and 14). Most likely, the ratio was directly related to the sequencing and not the alignment errors (see above). Still, the exact cause-effect relationship remained unclear.

Excluding SNP calls by the TI/TV was problematic. As discussed in the previous section, the exact ratio is dependent on the biological context. This context was absent from these experiments. However, users will be confronted with it in their everyday analyses. Thus, a display of the overall TI/TV per target could be provided as metadata. Users will then have to put the observed TI/TV into the theoretical context: on the one hand, a ratio which is higher than expected may then be used as an indicator of high spurious SNP calls on a target. On the other hand, it may indicate a biological pattern. In line with the former argument, Altmann and coworkers compared the TI/TV ratios of results from different SNP- calling algorithms to the expected one (Altmann et al. 2012). However, the ratio will not

86 reveal the individual FPs, but it could serve as an indicator to make the overall allele threshold stricter. Still, probabilistic approaches (see also section 1.2.2.), may use the expected values for transitions and transversions to derive an individual classification (R. Li et al. 2009).

The polymorphism analysis in NanoPipe already includes metadata which should reveal artifacts (Shabardina et al. 2019): At an individual site, the joint probabilities of SNP candidates provide indications for the likelier SNP (see also section 2.3.4.1.) (Shabardina et al. 2019). This could not be applied here, since simulated and erroneous SNPs were strictly biallelic (data not shown). The average alignment error was inconspicuous for the majority of artifacts in the sample without simulated SNPs and R9 2D sequencing (see section 3.2.1.). In the sample with the simulated SNPs and R9 2D sequencing, it rose (see sections 3.2.1. and 3.2.2.). Even though the error was still minuscule, it indicated a slightly decreasing alignment quality. Still, I expect that filtering by this small p-error would significantly lose TPs when analyzing real data.

NanoPipe also provided a framework for accurate SNP calling. As discussed in section 1.2.2., alignments used for SNP calling should be thoroughly investigated: spurious alignments should be removed and mapping of reads should be visually examined (reviewed in: Altmann et al. 2012). NanoPipe already included these features by removing spurious reads with LAST-SPLIT, except for analyses on the predefined human transcriptome target, and by displaying mapping statistics (Shabardina et al. 2019).

It is also important to note that the number of FPs needs to be put into perspective. The false-positive rate of SNP identifications should be below 20 % (reviewed in: Clevenger et al. 2015). The false-positive rate, which could be calculated from the specificity (clarifyed by: Suojanen 1999), was below 0.1 % for any tested cutoff in the R9 2D sample with the simulated SNPs. Thus, a reduction of FPs should not be forced at the expense of TPs, although it might be tempting to propose an allele threshold of 30 % to maximize the MCC (see Figure 17).

In this experiment, the minimum abundance of the simulated SNPs was high. Thus, the data was relatively robust against stricter cutoffs. However, low-frequency mutations were proposed to have an impact on disease: it was shown that a significant portion of these were non-synonymous (Shen et al. 2013). NanoPipe targeted users from diverse fields (Shabardina et al. 2019), which will not only be interested in high-frequency mutations. Thus, users should have the choice to manipulate the allelic cutoff in future versions of NanoPipe. Still, a standard threshold will be needed to fulfill the credo of easy use (Shabardina et al. 2019). Based on the trends observed, the standard threshold should be increased to 25 %. This increase is expected to retain many true SNPs, but to lose a vast majority of FPs (see Figure 13 and 15, R9 2D). The re-analyses of NanoPipe's study cases for the human cell line H1975 and P. falciparum with the higher cutoff showed, that most SNPs were retained (see section 3.2.2.). However, in the human sample, two SNPs on chromosome seven were lost. One of these was also reported in dbSNP (see section 3.2.2.). However, this did not verify that the SNP was actually in the analyzed sample. Overall, the high recovery of SNPs from the study cases indicated that raising the allelic threshold to 25 % would not be problematic. Still, further tests will be needed to validate the benefits of increasing the standard allelic threshold. However, the preliminary results were promising. Users using older flowcell

87 versions and targeting SNPs with high allelic frequencies will profit from raising the allelic threshold beyond the proposed standard.

88

5. References Acinas, Silvia G., Luisa A. Marcelino, Vanja Klepac-Ceraj, and Martin F. Polz. 2004. “Divergence and Redundancy of 16S RRNA Sequences in Genomes with Multiple Rrn Operons.” Journal of Bacteriology 186 (9): 2629–35. https://doi.org/10.1128/JB.186.9.2629-2635.2004. Aird, Daniel, Michael G. Ross, Wei-Sheng Chen, Maxwell Danielsson, Timothy Fennell, Carsten Russ, David B. Jaffe, Chad Nusbaum, and Andreas Gnirke. 2011. “Analyzing and Minimizing PCR Amplification Bias in Illumina Sequencing Libraries.” Genome Biology 12 (2): R18. https://doi.org/10.1186/gb-2011-12-2-r18. Allen, Eric E., and Jillian F. Banfield. 2005. “Community Genomics in Microbial Ecology and Evolution.” Nature Reviews Microbiology 3 (June): 489–98. https://doi.org/10.1038/nrmicro1157. Altmann, André, Peter Weber, Daniel Bader, Michael Preuß, Elisabeth B. Binder, and Bertram Müller-Myhsok. 2012. “A Beginners Guide to SNP Calling from High- Throughput DNA-Sequencing Data.” Human Genetics 131 (10): 1541–54. https://doi.org/10.1007/s00439-012-1213-z. Altschul, S. F., T. L. Madden, A. A. Schäffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. 1997. “Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs.” Nucleic Acids Res 25 (17): 3389–3402. https://doi.org/10.1093/nar/25.17.3389. Amir, Amnon, Daniel McDonald, Jose A. Navas-Molina, Evguenia Kopylova, James T. Morton, Zhenjiang Zech Xu, Eric P. Kightley, et al. 2017. “Deblur Rapidly Resolves Single-Nucleotide Community Sequence Patterns.” Edited by Jack A Gilbert. MSystems 2 (April): e00191-16. https://doi.org/10.1128/mSystems.00191-16. Antonopoulos, Dionysios A., Rida Assaf, Ramy Karam Aziz, Thomas Brettin, Christopher Bun, Neal Conrad, James J. Davis, et al. 2017. “PATRIC as a Unique Resource for Studying Antimicrobial Resistance.” Briefings in Bioinformatics, July. https://doi.org/10.1093/bib/bbx083. Aßhauer, Kathrin P., Bernd Wemheuer, Rolf Daniel, and Peter Meinicke. 2015. “Tax4Fun: Predicting Functional Profiles from Metagenomic 16S RRNA Data.” Bioinformatics 31 (17): 2882–84. https://doi.org/10.1093/bioinformatics/btv287. Aurrecoechea, Cristina, John Brestelli, Brian P. Brunk, Jennifer Dommer, Steve Fischer, Bindu Gajria, Xin Gao, et al. 2009. “PlasmoDB: A Functional Genomic Database for Malaria Parasites.” Nucleic Acids Research 37 (Issue suppl_1): D539-543. https://doi.org/10.1093/nar/gkn814. Bashiardes, Stavros, Gili Zilberman-Schapira, and Eran Elinav. 2016. “Use of Metatranscriptomics in Microbiome Research.” Bioinformatics and Biology Insights 10 (January). https://doi.org/10.4137/BBI.S34610. Bengtsson-Palme, Johan, Martin Hartmann, Karl Martin Eriksson, Chandan Pal, Kaisa Thorell, Dan Göran Joakim Larsson, and Rolf Henrik Nilsson. 2015. “METAXA2: Improved Identification and Taxonomic Classification of Small and Large Subunit RRNA in Metagenomic Data.” Molecular Ecology Resources 15 (6): 1403–14. https://doi.org/10.1111/1755-0998.12399. Benítez-Páez, Alfonso, Kevin J. Portune, and Yolanda Sanz. 2016. “Species-Level Resolution of 16S RRNA Gene Amplicons Sequenced through the MinIONTM Portable Nanopore Sequencer.” GigaScience 5 (1): s13742–016–0111–z. https://doi.org/10.1186/s13742-016-0111-z. Berini, Francesca, Carmine Casciello, Giorgia Letizia Marcone, and Flavia Marinelli. 2017. “Metagenomics: Novel Enzymes from Non-Culturable Microbes.” FEMS Microbiology Letters 364 (21): fnx211. https://doi.org/10.1093/femsle/fnx211. Bethesda (MD): National Library of Medicine (US). 1946. “PubMed [Internet].” https://www.ncbi.nlm.nih.gov/pubmed/. Accessed 23.01.2019.

89

Bokulich, Nicholas A., Benjamin D. Kaehler, Jai Ram Rideout, Matthew Dillon, Evan Bolyen, Rob Knight, Gavin A. Huttley, and Gregory J. Caporaso. 2018. “Optimizing Taxonomic Classification of Marker-Gene Amplicon Sequences with QIIME 2’s Q2-Feature- Classifier Plugin.” Microbiome 6: 90. https://doi.org/10.1186/s40168-018-0470-z. Bolyen, Evan, Jai Ram Rideout, Matthew R. Dillon, Nicholas A. Bokulich, Christian C. Abnet, Gabriel A . Al-Ghalith, Harriet Alexander, et al. 2018. “QIIME 2 : Reproducible , Interactive , Scalable , and Extensible Microbiome Data Science.” PeerJPreprints 6: e27295v2. https://doi.org/10.7287/peerj.preprints.27295v2. Bonnet, Régis, Antonia Suau, Joël Doré, Glenn R. Gibson, and Matthew D. Collins. 2002. “Differences in RDNA Libraries of Faecal Bacteria Derived from 10- and 25-Cycle PCRs.” International Journal of Systematic and Evolutionary Microbiology 52: 757–63. https://doi.org/10.1099/00207713-52-3-757. Breitbart, Mya, and Forest Rohwer. 2005. “Here a Virus, There a Virus, Everywhere the Same Virus?” Trends in Microbiology 13 (6): 278–84. https://doi.org/10.1016/j.tim.2005.04.003. Břinda, Karel, and Chen Yang. n.d. “NanoSim-H (Version 1.1.0.4).” Zenodo. https://doi.org/10.5281/zenodo.1341249. Brockman, William, Pablo Alvarez, Sarah Young, Manuel Garber, Georgia Giannoukos, William L. Lee, Carsten Russ, Eric S. Lander, Chad Nusbaum, and David B. Jaffe. 2008. “Quality Scores and SNP Detection in Sequencing-by-Synthesis Systems.” Genome Research 18 (5): 763–70. https://doi.org/10.1101/gr.070227.107. Brooker, Robert J. 2009. Genetics: Analysis & Principles. Edited by Patrick E. Reidy and Lisa A. Bruflodt. 3rd ed. New York, USA: McGraw-Hill. Buss, Sarah N., Roxanne Alter, Peter C. Iwen, and Paul D. Fey. 2013. “Implications of Culture-Independent Panel-Based Detection of Cyclospora Cayetanensis.” Journal of Clinical Microbiology 51 (11): 3909. https://doi.org/10.1128/JCM.02238-13. Butler, John M. 2012. “Single Nucleotide Polymorphisms and Applications.” In Advanced Topics in Forensic DNA Typing: Methodology, edited by John M. Butler, 347–69. Academic Press. https://doi.org/10.1016/B978-0-12-374513-2.00012-9. Callahan, Benjamin J., Paul J. McMurdie, and Susan P. Holmes. 2017. “Exact Sequence Variants Should Replace Operational Taxonomic Units in Marker-Gene Data Analysis.” The ISME Journal 11: 2639–43. https://doi.org/10.1038/ismej.2017.119. Callahan, Benjamin J., Paul J. McMurdie, Michael J. Rosen, Andrew W. Han, Amy Jo A. Johnson, and Susan P. Holmes. 2016. “DADA2: High-Resolution Sample Inference from Illumina Amplicon Data.” Nature Methods 13 (May): 581–83. https://doi.org/10.1038/nmeth.3869. Camacho, Christiam, George Coulouris, Vahram Avagyan, Ning Ma, Jason Papadopoulos, Kevin Bealer, and Thomas L. Madden. 2009. “BLAST+: Architecture and Applications.” BMC Bioinformatics 10: 421. https://doi.org/10.1186/1471-2105-10-421. Campbell, Peter J., Erin D. Pleasance, Philip J. Stephens, Ed Dicks, Richard Rance, Ian Goodhead, George A. Follows, Anthony R. Green, P. Andy Futreal, and Michael R. Stratton. 2008. “Subclonal Phylogenetic Structures in Cancer Revealed by Ultra-Deep Sequencing.” Proceedings of the National Academy of Sciences 105 (35): 13081–86. https://doi.org/10.1073/pnas.0801523105. Castro-Wallace, Sarah L., Charles Y. Chiu, Kristen K. John, Sarah E. Stahl, Kathleen H. Rubins, Alexa B .R. McIntyre, Jason P. Dworkin, et al. 2017. “Nanopore DNA Sequencing and Genome Assembly on the International Space Station.” Scientific Reports 7: 18022. https://doi.org/10.1038/s41598-017-18364-0. Chadha, Ankit R., Rishikesh Misal, and Tanaya Mokashi. 2014. “Modified Binary Search Algorithm.” International Journal of Applied Information Systems 7 (2): 37–40. http://arxiv.org/abs/1406.1677. Chen, Qingshan, Xinrui Mao, Zhanguo Zhang, Rongsheng Zhu, Zhengong Yin, Yue Leng, Hongxiao Yu, et al. 2016. “SNP-SNP Interaction Analysis on Soybean Oil Content under Multi-Environments.” PLOS ONE 11 (9): e0163692. https://doi.org/10.1371/journal.pone.0163692.

90

Clapham, Christopher, and James Nicholson. 2014. “Mean.” In The Concise Oxford Dictionary of Mathematics, 5th ed. Oxford University Press. http://www.oxfordreference.com/view/10.1093/acref/9780199679591.001.0001/acref- 9780199679591-e-1791. Clevenger, Josh, Carolina Chavarro, Stephanie A. Pearl, Peggy Ozias-Akins, and Scott A. Jackson. 2015. “Single Nucleotide Polymorphism Identification in Polyploids: A Review, Example, and Recommendations.” Molecular Plant 8 (6): 831–46. https://doi.org/10.1016/j.molp.2015.02.002. Cohan, Frederick M., and Elizabeth B. Perry. 2007. “A Systematics for Discovering the Fundamental Units of Bacterial Diversity.” Current Biology 17 (10): R373–86. https://doi.org/10.1016/j.cub.2007.03.032. Cole, James R., Qiong Wang, Jordan A. Fish, Benli Chai, Donna M. McGarrell, Yanni Sun, C. Titus Brown, Andrea Porras-Alfaro, Cheryl R. Kuske, and James M. Tiedje. 2014. “Ribosomal Database Project: Data and Tools for High Throughput RRNA Analysis.” Nucleic Acids Research 42 (D1): D633–42. https://doi.org/10.1093/nar/gkt1244. Cretoiu, Mariana Silvia, Francesca Berini, Anna Maria Kielak, Flavia Marinelli, and Jan Dirk van Elsas. 2015. “A Novel Salt-Tolerant Chitobiosidase Discovered by Genetic Screening of a Metagenomic Library Derived from Chitin-Amended Agricultural Soil.” Applied Microbiology and Biotechnology 99 (19): 8199–8215. https://doi.org/10.1007/s00253-015-6639-5. Cuscó, Anna, Carlotta Catozzi, Joaquim Viñes, Armand Sanchez, and Olga Francino. 2018. “Microbiota Profiling with Long Amplicons Using Nanopore Sequencing: Full-Length 16S RRNA Gene and Whole Rrn Operon [Version 1; Referees: 2 Approved, 3 Approved with Reservations].” F1000Research 7 (1755). https://doi.org/10.12688/f1000research.16817.1. Deamer, David, Mark Akeson, and Daniel Branton. 2016. “Three Decades of Nanopore Sequencing.” Nature Biotechnology 34 (5): 518–24. https://doi.org/10.1038/nbt.3423. DeSantis, Todd Z., Eoin L. Brodie, Jordan P. Moberg, Ingrid X. Zubieta, Yvette M. Piceno, and Gary L. Andersen. 2007. “High-Density Universal 16S RRNA Microarray Analysis Reveals Broader Diversity than Typical Clone Library When Sampling the Environment.” Microbial Ecology 53 (3): 371–83. https://doi.org/10.1007/s00248-006-9134-9. Deshpande, Vinita, Qiong Wang, Paul Greenfield, Michael Charleston, Andrea Porras-Alfaro, Cheryl R. Kuske, James R. Cole, David J. Midgley, and Nai Tran-Dinh. 2016. “Fungal Identification Using a Bayesian Classifier and the Warcup Training Set of Internal Transcribed Spacer Sequences.” Mycologia 108 (1): 1–5. https://doi.org/10.3852/14- 293. Deurenberg, Ruud H., Erik Bathoorn, Monika A. Chlebowicz, Natacha Couto, Mithila Ferdous, Silvia García-Cobos, Anna M. D. Kooistra-Smid, et al. 2017. “Application of next Generation Sequencing in Clinical Microbiology and Infection Prevention.” Journal of Biotechnology 243: 16–24. https://doi.org/10.1016/j.jbiotec.2016.12.022. Dixon, L. A., C. M. Murray, E. J. Archer, A. E. Dobbins, P. Koumi, and P. Gill. 2005. “Validation of a 21-Locus Autosomal SNP Multiplex for Forensic Identification Purposes.” Forensic Science International 154 (1): 62–77. https://doi.org/10.1016/j.forsciint.2004.12.011. Dohm, Juliane C., Claudio Lottaz, Tatiana Borodina, and Heinz Himmelbauer. 2008. “Substantial Biases in Ultra-Short Read Data Sets from High-Throughput DNA Sequencing.” Nucleic Acids Research 36 (16): e105. https://doi.org/10.1093/nar/gkn425. Doolittle, W. Ford, and R. Thane Papke. 2006. “Genomics and the Bacterial Species Problem.” Genome Biology 7 (9): 116. https://doi.org/10.1186/gb-2006-7-9-116. Dykhuizen, D. E., and L. Green. 1991. “Recombination in Escherichia Coli and the Definition of Biological Species.” Journal of Bacteriology 173 (22): 7257–68. https://doi.org/10.1128/jb.173.22.7257-7268.1991. Edgar, Robert. 2018. “Taxonomy Annotation and Guide Tree Errors in 16S RRNA Databases.” PeerJ 6 (e5030). https://doi.org/10.7717/peerj.5030.

91

Edgar, Robert C. 2018. “Updating the 97% Identity Threshold for 16S Ribosomal RNA OTUs.” Bioinformatics 34 (14): 2371–75. https://doi.org/10.1093/bioinformatics/bty113. Edwards, Joan E., Robert J. Forster, Tony M. Callaghan, Veronika Dollhofer, Sumit S. Dagar, Yanfen Cheng, Jongsoo Chang, et al. 2017. “PCR and Omics Based Techniques to Study the Diversity, Ecology and Biology of Anaerobic Fungi: Insights, Challenges and Opportunities.” Frontiers in Microbiology. https://doi.org/10.3389/fmicb.2017.01657. Escobar-Zepeda, Alejandra, Elizabeth Ernestina Godoy-Lozano, Luciana Raggi, Lorenzo Segovia, Enrique Merino, Rosa María Gutiérrez-Rios, Katy Juarez, Alexei F. Licea- Navarro, Liliana Pardo-Lopez, and Alejandro Sanchez-Flores. 2018. “Analysis of Sequencing Strategies and Tools for Taxonomic Annotation: Defining Standards for Progressive Metagenomics.” Scientific Reports 8: 1–13. https://doi.org/10.1038/s41598- 018-30515-5. Escobar-Zepeda, Alejandra, Arturo Vera-Ponce de León, and Alejandro Sanchez-Flores. 2015. “The Road to Metagenomics: From Microbiology to DNA Sequencing Technologies and Bioinformatics.” Frontiers in Genetics 6 (348). https://doi.org/10.3389/fgene.2015.00348. Falush, Daniel, Christian Kraft, Nancy S. Taylor, Pelayo Correa, James G. Fox, Mark Achtman, and Sebastian Suerbaum. 2001. “Recombination and Mutation during Long- Term Gastric Colonization by Helicobacter Pylori: Estimates of Clock Rates, Recombination Size, and Minimal Age.” Proceedings of the National Academy of Sciences 98 (26): 15056–61. https://doi.org/10.1073/pnas.251396098. Filippo, Carlotta De, Monica Di Paola, Matteo Ramazzotti, Davide Albanese, Giuseppe Pieraccini, Elena Banci, Franco Miglietta, Duccio Cavalieri, and Paolo Lionetti. 2017. “Diet, Environments, and Gut Microbiota. A Preliminary Investigation in Children Living in Rural and Urban Burkina Faso and Italy.” Frontiers in Microbiology 8 (1979). https://doi.org/10.3389/fmicb.2017.01979. Finley, Sheree J., Jennifer L. Pechal, M. Eric Benbow, B. K. Robertson, and Gulnaz T. Javan. 2016. “Microbial Signatures of Cadaver Gravesoil During Decomposition.” Microbial Ecology 71 (3): 524–29. https://doi.org/10.1007/s00248-015-0725-1. Forbes, Jessica D., Natalie C. Knox, Christy Lynn Peterson, and Aleisha R. Reimer. 2018. “Highlighting Clinical Metagenomics for Enhanced Diagnostic Decision-Making: A Step Towards Wider Implementation.” Computational and Structural Biotechnology Journal 16: 108–20. https://doi.org/10.1016/j.csbj.2018.02.006. Forslund, Kristoffer, Falk Hildebrand, Trine Nielsen, Gwen Falony, Emmanuelle Le Chatelier, Shinichi Sunagawa, Edi Prifti, et al. 2015. “Disentangling Type 2 Diabetes and Metformin Treatment Signatures in the Human Gut Microbiota.” Nature 528 (December): 262–66. https://doi.org/10.1038/nature15766. Fosso, Bruno, Graziano Pesole, Francesc Rosselló, and Gabriel Valiente. 2018. “Unbiased Taxonomic Annotation of Metagenomic Samples.” Journal of Computational Biology 25 (3): 348–60. https://doi.org/10.1089/cmb.2017.0144. Fox, G. E., E. Stackebrandt, R. B. Hespell, J. Gibson, J. Maniloff, T. A. Dyer, R. S. Wolfe, et al. 1980. “The Phylogeny of Prokaryotes.” Science 209 (4455): 457–63. https://doi.org/10.1126/science.6771870. Frith, Martin C., Michiaki Hamada, and Paul Horton. 2010. “Parameters for Accurate Genome Alignment.” BMC Bioinformatics. https://doi.org/10.1186/1471-2105-11-80. Frith, Martin C., and Risa Kawaguchi. 2015. “Split-Alignment of Genomes Finds Orthologies More Accurately.” Genome Biology 16: 106. https://doi.org/10.1186/s13059-015-0670-9. Gao, Jing, Guoji Liu, Hongping Li, Li Xu, Lili Du, and Bo Yang. 2016. “Predictive Functional Profiling Using Marker Gene Sequences and Community Diversity Analyses of Microbes in Full-Scale Anaerobic Sludge Digesters.” Bioprocess and Biosystems Engineering 39 (7): 1115–27. https://doi.org/10.1007/s00449-016-1588-7. Gao, Xiang, Huaiying Lin, Kashi Revanna, and Qunfeng Dong. 2017. “A Bayesian Taxonomic Classification Method for 16S RRNA Gene Sequences with Improved Species-Level Accuracy.” BMC Bioinformatics 18: 247. https://doi.org/10.1186/s12859- 017-1670-4. 92

Garrido-Cardenas, Jose Antonio, Federico Garcia-Maroto, Jose Antonio Alvarez-Bermejo, and Francisco Manzano-Agugliaro. 2017. “DNA Sequencing Sensors: An Overview.” Sensors 17 (3): 588. https://doi.org/10.3390/s17030588. Gevers, Dirk, Frederick M. Cohan, Jeffrey G. Lawrence, Brian G. Spratt, Tom Coenye, Edward J. Feil, Erko Stackebrandt, et al. 2005. “Re-Evaluating Prokaryotic Species.” Nature Reviews Microbiology 3 (August): 733–39. https://doi.org/10.1038/nrmicro1236. Gihring, Thomas M., Gengxin Zhang, Craig C. Brandt, Scott C. Brooks, James H. Campbell, Susan Carroll, Craig S. Criddle, et al. 2011. “A Limited Microbial Consortium Is Responsible for Extended Bioreduction of Uranium in a Contaminated Aquifer.” Applied and Environmental Microbiology 77 (17): 5955–65. https://doi.org/10.1128/AEM.00220- 11. Gilbert, Jack A., and Christopher L. Dupont. 2011. “Microbial Metagenomics: Beyond the Genome.” Annual Review of Marine Science 3 (1): 347–71. https://doi.org/10.1146/annurev-marine-120709-142811. Glöckner, Frank Oliver, Pelin Yilmaz, Christian Quast, Jan Gerken, Alan Beccati, Andreea Ciuprina, Gerrit Bruns, et al. 2017. “25 Years of Serving the Community with Ribosomal RNA Gene Reference Databases and Tools.” Journal of Biotechnology 261: 169–76. https://doi.org/10.1016/j.jbiotec.2017.06.1198. Glover, Natasha M., Henning Redestig, and Christophe Dessimoz. 2016. “Homoeologs: What Are They and How Do We Infer Them?” Trends in Plant Science 21 (7): 609–21. https://doi.org/10.1016/j.tplants.2016.02.005. Goldstein, Sarah, Lidia Beka, Joerg Graf, and Jonathan L. Klassen. 2019. “Evaluation of Strategies for the Assembly of Diverse Bacterial Genomes Using MinION Long-Read Sequencing.” BMC Genomics 20: 23. https://doi.org/10.1186/s12864-018-5381-7. Goodrich, Julia K., Sara C. Di Rienzi, Angela C. Poole, Omry Koren, William A. Walters, J. Gregory Caporaso, Rob Knight, and Ruth E. Ley. 2014. “Conducting a Microbiome Study.” Cell 158 (2): 250–62. https://doi.org/10.1016/j.cell.2014.06.037. Greninger, Alexander L., Samia N. Naccache, Scot Federman, Guixia Yu, Placide Mbala, Vanessa Bres, Doug Stryke, et al. 2015. “Rapid Metagenomic Identification of Viral Pathogens in Clinical Samples by Real-Time Nanopore Sequencing Analysis.” Genome Medicine 7: 99. https://doi.org/10.1186/s13073-015-0220-9. Griffin, Philippa C., Charles Robin, and Ary A. Hoffmann. 2011. “A Next-Generation Sequencing Method for Overcoming the Multiple Gene Copy Problem in Polyploid , Applied to Poa Grasses.” BMC Biology 9: 19. https://doi.org/10.1186/1741-7007-9-19. Griffiths, Anthony J. F., Susan R. Wessler, Richard C. Lewontin, and Sean B. Carroll. 2008. INTRODUCTION to GENETIC ANALYSIS. Edited by Jerry Correa, Susan Moran, Janie Chan, Amy Peltier, Lindsay Lovier, Ted Szczepanski, and Mary Louise Byrd. 9th ed. New York, England: W.H. Freeman and Company. Halfvarson, Jonas, Colin J. Brislawn, Regina Lamendella, Yoshiki Vázquez-Baeza, William A. Walters, Lisa M. Bramer, Mauro D’Amato, et al. 2017. “Dynamics of the Human Gut Microbiome in Inflammatory Bowel Disease.” Nature Microbiology 2 (17004). https://doi.org/10.1038/nmicrobiol.2017.4. Hamada, Michiaki, Yukiteru Ono, Kiyoshi Asai, and Martin C. Frith. 2017. “Training Alignment Parameters for Arbitrary Sequencers with LAST-TRAIN.” Bioinformatics 33 (6): 926–28. https://doi.org/10.1093/bioinformatics/btw742. Hamada, Michiaki, Edward Wijaya, Martin C. Frith, and Kiyoshi Asai. 2011. “Probabilistic Alignments with Quality Scores: An Application to Short-Read Mapping toward Accurate SNP/Indel Detection.” Bioinformatics 27 (22): 3085–92. https://doi.org/10.1093/bioinformatics/btr537. Handelsman, Jo, Michelle R. Rondon, Sean F. Brady, Jon Clardy, and Robert M. Goodman. 1998. “Molecular Biological Access to the Chemistry of Unknown Soil Microbes: A New Frontier for Natural Products.” Chemistry & Biology 5 (10): R245-249.

93

Harismendy, Olivier, Pauline C. Ng, Robert L. Strausberg, Xiaoyun Wang, Timothy B. Stockwell, Karen Y. Beeson, Nicholas J. Schork, et al. 2009. “Evaluation of next Generation Sequencing Platforms for Population Targeted Sequencing Studies.” Genome Biology 10 (3): R32. https://doi.org/10.1186/gb-2009-10-3-r32. Hartl, Daniel L., and Elizabeth W. Jones. 2009. Genetics: Analysis of Genes and Genomes. Edited by Shoshanna Goldberg, Dean W. DeChambeau, Molly Steinbach, Caroline Perry, and Rachel Rossi. 7th ed. Sudburry, USA: Jones and Bartlett Publishers. Heaton, Michael P., Kreg A. Leymaster, Theodore S. Kalbfleisch, James W. Kijas, Shannon M. Clarke, John McEwan, Jillian F. Maddox, et al. 2014. “SNPs for Parentage Testing and Traceability in Globally Diverse Breeds of Sheep.” PLOS ONE 9 (4): e94851. https://doi.org/10.1371/journal.pone.0094851. Hedges, Dale J., Dan Burges, Eric Powell, Cherylyn Almonte, Jia Huang, Stuart Young, Benjamin Boese, et al. 2009. “Exome Sequencing of a Multigenerational Human Pedigree.” PLOS ONE 4 (12): e8232. https://doi.org/10.1371/journal.pone.0008232. Hemme, Christopher L., Ye Deng, Terry J. Gentry, Matthew W. Fields, Liyou Wu, Soumitra Barua, Kerrie Barry, et al. 2010. “Metagenomic Insights into Evolution of a Heavy Metal- Contaminated Groundwater Microbial Community.” The ISME Journal 4 (February): 660–72. https://doi.org/10.1038/ismej.2009.154. Hendriksen, Rene S., Lance B. Price, James M. Schupp, John D. Gillece, Rolf S. Kaas, David M. Engelthaler, Valeria Bortolaia, et al. 2011. “Population Genetics of Vibrio Cholerae from Nepal in 2010: Evidence on the Origin of the Haitian Outbreak.” Edited by David Relman. MBio 2 (4): e00157-11. https://doi.org/10.1128/mBio.00157-11. Herbst, Roy S., Paul Baas, Dong-Wan Kim, Enriqueta Felip, José L. Pérez-Gracia, Ji-Youn Han, Julian Molina, et al. 2016. “Pembrolizumab versus Docetaxel for Previously Treated, PD-L1-Positive, Advanced Non-Small-Cell Lung Cancer (KEYNOTE-010): A Randomised Controlled Trial.” The Lancet 387 (10027): 1540–50. https://doi.org/10.1016/S0140-6736(15)01281-7. Herrera-Estrella, Alfredo, and Ilan Chet. 1999. “Chitinases in Biological Control.” EXS 87: 171–84. Hjort, Karin, Ilaria Presti, Annelie Elväng, Flavia Marinelli, and Sara Sjöling. 2014. “Bacterial Chitinase with Phytopathogen Control Capacity from Suppressive Soil Revealed by Functional Metagenomics.” Applied Microbiology and Biotechnology 98 (6): 2819–28. https://doi.org/10.1007/s00253-013-5287-x. Huang, Weichun, Leping Li, Jason R. Myers, and Gabor T. Marth. 2012. “ART: A next- Generation Sequencing Read Simulator.” Bioinformatics 28 (4): 593–94. https://doi.org/10.1093/bioinformatics/btr708. Huse, Susan M., David Mark Welch, Hilary G. Morrison, and Mitchell L. Sogin. 2010. “Ironing out the Wrinkles in the Rare Biosphere through Improved OTU Clustering.” Environmental Microbiology 12 (7): 1889–98. https://doi.org/10.1111/j.1462- 2920.2010.02193.x. Huson, Daniel H., Alexander F. Auch, Ji Qi, and Stephan C. Schuster. 2007. “MEGAN Analysis of Metagenomic Data.” Genome Research 17 (3): 377–86. https://doi.org/10.1101/gr.5969107. Jacob, Jacob H., Emad I. Hussein, Muhamad Ali K. Shakhatreh, and Christopher T. Cornelison. 2017. “Microbial Community Analysis of the Hypersaline Water of the Dead Sea Using High-Throughput Amplicon Sequencing.” MicrobiologyOpen 6 (5): e00500. https://doi.org/10.1002/mbo3.500. Jain, Miten, Sergey Koren, Karen H. Miga, Josh Quick, Arthur C. Rand, Thomas A. Sasani, John R. Tyson, et al. 2018. “Nanopore Sequencing and Assembly of a Human Genome with Ultra-Long Reads.” Nature Biotechnology 36 (January): 338–45. https://doi.org/10.1038/nbt.4060.

94

Janeway, Charles A. Jr., Paul Travers, Mark Walport, and Mark J. Shlomchik. 2001. “5-12. The Protein Products of MHC Class I and Class II Genes Are Highly Polymorphic.” In Immunobiology: The Immune System in Health and Disease, edited by Penelope Austin, Eleanor Lawrence, Sarah Gibbs, Mark Ditzel, Emma Hunt, Michael Morales, and Len Cegielka, 5th ed. New York, USA: Garland Publishing. https://www.ncbi.nlm.nih.gov/books/NBK27156/. Jing, Gongchao, Zheng Sun, Honglei Wang, Yanhai Gong, Shi Huang, Kang Ning, Jian Xu, and Xiaoquan Su. 2017. “Parallel-META 3: Comprehensive Taxonomical and Functional Analysis Platform for Efficient Comparison of Microbial Communities.” Scientific Reports 7 (40371). https://doi.org/10.1038/srep40371. Jun, Se-Ran, Michael S. Robeson, Loren J. Hauser, Christopher W. Schadt, and Andrey A. Gorin. 2015. “PanFP: Pangenome-Based Functional Profiles for Microbial Communities.” BMC Research Notes 8 (479). https://doi.org/10.1186/s13104-015-1462- 8. Jünemann, Sebastian, Nils Kleinbölting, Sebastian Jaenicke, Christian Henke, Julia Hassa, Johanna Nelkner, Yvonne Stolze, et al. 2017. “Bioinformatics for NGS-Based Metagenomics and the Application to Biogas Research.” Journal of Biotechnology 261: 10–23. https://doi.org/10.1016/j.jbiotec.2017.08.012. Juul, Sissel, Fernando Izquierdo, Adam Hurst, Xiaoguang Dai, Amber Wright, Eugene Kulesha, Roger Pettett, and Daniel J. Turner. 2015. “What’s in My Pot? Real-Time Species Identification on the MinIONTM.” BioRxiv 030742 (January). https://doi.org/10.1101/030742. Kan, Yuet Wai, and Andrée M. Dozy. 1978. “Polymorphism of DNA Sequence Adjacent to Human β-Globin Structural Gene: Relationship to Sickle Mutation.” Proceedings of the National Academy of Sciences of the United States of America 75 (11): 5631–35. www.jstor.org/stable/68688. Kataoka, Takafumi, Haruyo Yamaguchi, Mayumi Sato, Tsuyoshi Watanabe, Yukiko Taniuchi, Akira Kuwata, and Masanobu Kawachi. 2017. “Seasonal and Geographical Distribution of Near-Surface Small Photosynthetic Eukaryotes in the Western North Pacific Determined by Pyrosequencing of 18S RDNA.” FEMS Microbiology Ecology 93 (2): fiw229. https://doi.org/10.1093/femsec/fiw229. Kawamura, Y., X. G. Hou, F. Sultana, H. Miura, and T. Ezaki. 1995. “Determination of 16S RRNA Sequences of Streptococcus Mitis and Streptococcus Gordonii and Phylogenetic Relationships among Members of the Genus Streptococcus.” International Journal of Systematic Bacteriology 45 (2): 406–8. https://doi.org/10.1099/00207713-45-2-406. Kayser, Manfred, and Peter M. Schneider. 2009. “DNA-Based Prediction of Human Externally Visible Characteristics in Forensics: Motivations, Scientific Challenges, and Ethical Considerations.” Forensic Science International: Genetics 3 (3): 154–61. https://doi.org/10.1016/j.fsigen.2009.01.012. Keane, Thomas M., Leo Goodstadt, Petr Danecek, Michael A. White, Kim Wong, Binnaz Yalcin, Andreas Heger, et al. 2011. “Mouse Genomic Variation and Its Effect on Phenotypes and Gene Regulation.” Nature 477 (September): 289–94. https://doi.org/10.1038/nature10413. Kettner, Marie Therese, Sonja Oberbeckmann, Matthias Labrenz, and Hans-Peter Grossart. 2019. “The Eukaryotic Life on Microplastics in Brackish Ecosystems.” Frontiers in Microbiology 10: 538. https://doi.org/10.3389/fmicb.2019.00538. Kiełbasa, Szymon M., Raymond Wan, Kengo Sato, Paul Horton, and Martin C. Frith. 2011. “Adaptive Seeds Tame Genomic Sequence Comparison.” Genome Research 21: 487– 93. https://doi.org/10.1101/gr.113985.110. Klindworth, Anna, Elmar Pruesse, Timmy Schweer, Jörg Peplies, Christian Quast, Matthias Horn, and Frank Oliver Glöckner. 2013. “Evaluation of General 16S Ribosomal RNA Gene PCR Primers for Classical and Next-Generation Sequencing-Based Diversity Studies.” Nucleic Acids Research 41 (1): e1. https://doi.org/10.1093/nar/gks808. Knight, Rob, Alison Vrbanac, Bryn C. Taylor, Alexander Aksenov, Chris Callewaert, Justine Debelius, Antonio Gonzalez, et al. 2018. “Best Practices for Analysing Microbiomes.” Nature Reviews Microbiology 16: 410–22. https://doi.org/10.1038/s41579-018-0029-9. 95

Konstantinidis, Konstantinos T., and James M. Tiedje. 2005. “Genomic Insights That Advance the Species Definition for Prokaryotes.” Proceedings of the National Academy of Sciences of the United States of America 102 (7): 2567–72. https://doi.org/10.1073/pnas.0409727102. Korzhenkov, Aleksei A., Stepan V. Toshchakov, Rafael Bargiela, Huw Gibbard, Manuel Ferrer, Alina V. Teplyuk, David L. Jones, Ilya V. Kublanov, Peter N. Golyshin, and Olga V. Golyshina. 2019. “Archaea Dominate the Microbial Community in an Ecosystem with Low-to-Moderate Temperature and Extreme Acidity.” Microbiome 7 (1): 11. https://doi.org/10.1186/s40168-019-0623-8. Krishnakumar, Raga, Anupama Sinha, Sara W. Bird, Harikrishnan Jayamohan, Harrison S. Edwards, Joseph S. Schoeniger, Kamlesh D. Patel, Steven S. Branda, and Michael S. Bartsch. 2018. “Systematic and Stochastic Influences on the Performance of the MinION Nanopore Sequencer across a Range of Nucleotide Bias.” Scientific Reports 8 (1): 1–13. https://doi.org/10.1038/s41598-018-21484-w. Kukolya, József, Ildikó Bata-Vidács, Szabina Luzics, Erika Tóth, Zsuzsa Kéki, Peter Schumann, András Táncsics, István Nagy, Ferenc Olasz, and Ákos Tóth. 2018. “Xylanibacillus Composti Gen. Nov., Sp. Nov., Isolated from Compost.” International Journal of Systematic and Evolutionary Microbiology 68 (3): 698–702. https://doi.org/10.1099/ijsem.0.002523. Kunin, Victor, Alex Copeland, Alla Lapidus, Konstantinos Mavromatis, and Philip Hugenholtz. 2008. “A Bioinformatician’s Guide to Metagenomics.” MICROBIOLOGY AND MOLECULAR BIOLOGY REVIEWS 72 (4): 557–78. https://doi.org/10.1128/MMBR.00009-08. Kunin, Victor, Anna Engelbrektson, Howard Ochman, and Philip Hugenholtz. 2010. “Wrinkles in the Rare Biosphere: Pyrosequencing Errors Can Lead to Artificial Inflation of Diversity Estimates.” Environmental Microbiology 12 (January): 118–23. https://doi.org/10.1111/j.1462-2920.2009.02051.x. Laconcha, Urtzi, Mikel Iriondo, Haritz Arrizabalaga, Carmen Manzano, Pablo Markaide, Iratxe Montes, Iratxe Zarraonaindia, et al. 2015. “New Nuclear SNP Markers Unravel the Genetic Structure and Effective Population Size of Albacore Tuna (Thunnus Alalunga).” PLOS ONE 10 (6): e0128247. https://doi.org/10.1371/journal.pone.0128247. Lamason, Rebecca L., Manzoor-Ali P. K. Mohideen, Jason R. Mest, Andrew C. Wong, Heather L. Norton, Michele C. Aros, Michael J. Jurynec, et al. 2005. “SLC24A5, a Putative Cation Exchanger, Affects Pigmentation in Zebrafish and Humans.” Science 310 (5755): 1782–86. https://doi.org/10.1126/science.1116238. Langille, Morgan G. I., Jesse Zaneveld, J .Gregory Caporaso, Daniel McDonald, Dan Knights, Joshua A. Reyes, Jose C. Clemente, et al. 2013. “Predictive Functional Profiling of Microbial Communities Using 16S RRNA Marker Gene Sequences.” Nature Biotechnology 31 (August): 814–21. https://doi.org/10.1038/nbt.2676. Le, Si Quang, and Richard Durbin. 2011. “SNP Detection and Genotyping from Low- Coverage Sequencing Data on Multiple Diploid Samples.” Genome Research 21 (June): 952–60. https://doi.org/10.1101/gr.113084.110. LeBlanc, Jean Guy, Christian Milani, Graciela Savoy de Giori, Fernando Sesma, Douwe van Sinderen, and Marco Ventura. 2013. “Bacteria as Vitamin Suppliers to Their Host: A Gut Microbiota Perspective.” Current Opinion in Biotechnology 24 (2): 160–68. https://doi.org/10.1016/j.copbio.2012.08.005. Leinonen, Rasko, Hideaki Sugawara, Martin Shumway, and on behalf of the International Nucleotide Sequence Database Collaboration. 2011. “The Sequence Read Archive.” Nucleic Acids Research 39 (suppl_1): D19–21. https://doi.org/10.1093/nar/gkq1019. Leitch, Ilia J., and Michael D. Bennett. 1997. “Polyploidy in Angiosperms.” Trends in Plant Science 2 (12): 470–76. https://doi.org/10.1016/S1360-1385(97)01154-0. Li, Heng. 2014. “Toward Better Understanding of Artifacts in Variant Calling from High- Coverage Samples.” Bioinformatics 30 (20): 2843–51. https://doi.org/10.1093/bioinformatics/btu356.

96

Li, Ruiqiang, Yingrui Li, Xiaodong Fang, Huanming Yang, Jian Wang, Karsten Kristiansen, and Jun Wang. 2009. “SNP Detection for Massively Parallel Whole-Genome Resequencing.” Genome Research 19 (June): 1124–32. https://doi.org/10.1101/gr.088013.108. Lindgreen, Stinus, Karen L. Adair, and Paul P. Gardner. 2016. “An Evaluation of the Accuracy and Speed of Metagenome Analysis Tools.” Scientific Reports 6: 19233. https://doi.org/10.1038/srep19233. Liu, Kuan-Liang, Andrea Porras-Alfaro, Cheryl R. Kuske, Stephanie A. Eichorst, and Gary Xie. 2012. “Accurate, Rapid Taxonomic Classification of Fungal Large-Subunit RRNA Genes.” Applied and Environmental Microbiology 78 (5): 1523–33. https://doi.org/10.1128/AEM.06826-11. Louca, Stilianos, Florent Mazel, Michael Doebeli, and Laura Wegener Parfrey. 2019. “A Census-Based Estimate of Earth’s Bacterial and Archaeal Diversity.” PLOS Biology 17 (2): e3000106. https://doi.org/10.1371/journal.pbio.3000106. Lovley, Derek R. 2003. “Cleaning up with Genomics: Applying Molecular Biology to Bioremediation.” Nature Reviews Microbiology 1: 35–44. https://doi.org/10.1038/nrmicro731. Lozupone, Catherine, and Rob Knight. 2005. “UniFrac: A New Phylogenetic Method for Comparing Microbial Communities.” Applied and Environmental Microbiology. https://doi.org/10.1128/AEM.71.12.8228-8235.2005. Masterson, Jane. 1994. “Stomatal Size in Fossil Plants: Evidence for Polyploidy in Majority of Angiosperms.” Science 264 (5157): 421–24. https://doi.org/10.1126/science.264.5157.421. Matsen, Frederick A., Robin B. Kodner, and E. Virginia Armbrust. 2010. “Pplacer: Linear Time Maximum-Likelihood and Bayesian Phylogenetic Placement of Sequences onto a Fixed Reference Tree.” BMC Bioinformatics 11 (538). https://doi.org/10.1186/1471- 2105-11-538. Matthews, B .W. 1975. “Comparison of the Predicted and Observed Secondary Structure of T4 Phage Lysozyme.” Biochimica et Biophysica Acta (BBA) - Protein Structure 405 (2): 442–51. https://doi.org/10.1016/0005-2795(75)90109-9. Maurer, Karl-Heinz. 2004. “Detergent Proteases.” Current Opinion in Biotechnology 15 (4): 330–34. https://doi.org/10.1016/j.copbio.2004.06.005. Mayr, Ernst. 1942. Systematics and the Origin of Species. New York City, USA: Columbia University Press. McDonald, Daniel, Morgan N. Price, Julia Goodrich, Eric P. Nawrocki, Todd Z. DeSantis, Alexander Probst, Gary L. Andersen, Rob Knight, and Philip Hugenholtz. 2012. “An Improved Greengenes Taxonomy with Explicit Ranks for Ecological and Evolutionary Analyses of Bacteria and Archaea.” The ISME Journal 6 (3): 610–18. https://doi.org/10.1038/ismej.2011.139. Meisel, Jacquelyn S., Geoffrey D. Hannigan, Amanda S. Tyldsley, Adam J. SanMiguel, Brendan P. Hodkinson, Qi Zheng, and Elizabeth A. Grice. 2016. “Skin Microbiome Surveys Are Strongly Influenced by Experimental Design.” Journal of Investigative Dermatology 136 (5): 947–56. https://doi.org/10.1016/j.jid.2016.01.016. Metzker, Michael L. 2010. “Sequencing Technologies - the next Generation.” Nature Reviews Genetics 11: 31–46. https://doi.org/10.1038/nrg2626. Meyerhans, Andreas, Jean-Pierre Vartanian, and Simon Wain-Hobson. 1990. “DNA Recombination during PCR.” Nucleic Acids Research 18 (7): 1687–91. https://doi.org/10.1093/nar/18.7.1687. Mielczarek, M., and J. Szyda. 2016. “Review of Alignment and SNP Calling Algorithms for Next-Generation Sequencing Data.” Journal of Applied Genetics 57: 71–79. https://doi.org/10.1007/s13353-015-0292-7. Mignard, S., and J. P. Flandrois. 2006. “16S RRNA Sequencing in Routine Bacterial Identification: A 30-Month Experiment.” Journal of Microbiological Methods 67 (3): 574– 81. https://doi.org/10.1016/j.mimet.2006.05.009.

97

Mizrahi-Man, Orna, Emily R. Davenport, and Yoav Gilad. 2013. “Taxonomic Classification of Bacterial 16S RRNA Genes Using Short Sequencing Reads: Evaluation of Effective Study Designs.” PLOS ONE 8 (1): e53608. https://doi.org/10.1371/journal.pone.0053608. Mukherjee, Supratim, Marcel Huntemann, Natalia Ivanova, Nikos C. Kyrpides, and Amrita Pati. 2015. “Large-Scale Contamination of Microbial Isolate Genomes by Illumina PhiX Control.” Standards in Genomic Sciences 10 (March): 18. https://doi.org/10.1186/1944- 3277-10-18. Munoz, Raúl, Pablo Yarza, Wolfgang Ludwig, Jean Euzéby, Rudolf Amann, Karl-Heinz Schleifer, Frank Oliver Glöckner, and Ramon Rosselló-Móra. 2011. “Release LTPs104 of the All-Species Living Tree.” Systematic and Applied Microbiology 34 (3): 169–70. https://doi.org/10.1016/j.syapm.2011.03.001. Nakamura, Yasukazu, Guy Cochrane, Ilene Karsch-Mizrachi, and International Nucleotide Sequence Database Collaboration. 2013. “The International Nucleotide Sequence Database Collaboration.” Nucleic Acids Research 41 (Database issue): D21–24. https://doi.org/10.1093/nar/gks1084. Natividad, Jane M. M., Valerie Petit, Xianxi Huang, Giada de Palma, Jennifer Jury, Yolanda Sanz, Dana Philpott, Clara L. Garcia Rodenas, Kathy D. McCoy, and Elena F. Verdu. 2012. “Commensal and Probiotic Bacteria Influence Intestinal Barrier Function and Susceptibility to Colitis in Nod1-/-; Nod2-/- Mice.” Inflammatory Bowel Diseases 18 (8): 1434–46. https://doi.org/10.1002/ibd.22848. Nesbø, Camilla L., Marlena Dlutek, and W. Ford Doolittle. 2006. “Recombination in Thermotoga: Implications for Species Concepts and Biogeography.” Genetics 172 (2): 759–69. https://doi.org/10.1534/genetics.105.049312. Nielsen, Rasmus, Joshua S. Paul, Anders Albrechtsen, and Yun S. Song. 2011. “Genotype and SNP Calling from Next-Generation Sequencing Data.” Nature Reviews Genetics 12: 443–51. https://doi.org/10.1038/nrg2986. Noguera-Julian, Marc, Muntsa Rocafort, Yolanda Guillén, Javier Rivera, Maria Casadellà, Piotr Nowak, Falk Hildebrand, et al. 2016. “Gut Microbiota Linked to Sexual Preference and HIV Infection.” EBioMedicine 5: 135–46. https://doi.org/10.1016/j.ebiom.2016.01.032. Okuda, Shujiro, Yuki Tsuchiya, Chiho Kiriyama, Masumi Itoh, and Hisao Morisaki. 2012. “Virtual Metagenome Reconstruction from 16S RRNA Gene Sequences.” Nature Communications 3 (1203). https://doi.org/10.1038/ncomms2203. Olsen, Gary J., and Carl R. Woese. 1993. “Ribosomal RNA : A Key to Phylogeny.” The FASEB Journal 7 (1): 113–23. Olson, Nathan D., Steven P. Lund, Rebecca E. Colman, Jeffrey T. Foster, Jason W. Sahl, James M. Schupp, Paul Keim, Jayne B. Morrow, Marc L. Salit, and Justin M. Zook. 2015. “Best Practices for Evaluating Single Nucleotide Variant Calling Methods for Microbial Genomics.” Frontiers in Genetics 6: 235. https://doi.org/10.3389/fgene.2015.00235. Ondov, Brian D., Nicholas H. Bergman, and Adam M. Phillippy. 2011. “Interactive Metagenomic Visualization in a Web Browser.” BMC Bioinformatics 12: 385. https://doi.org/10.1186/1471-2105-12-385. Pantou, Dimitra, Helen Rizou, Haroula Tsarouha, Anastasia Pouli, Kostas Papanastasiou, Marina Stamatellou, Theoni Trangas, Nikos Pandis, and Georgia Bardi. 2005. “Cytogenetic Manifestations of Multiple Myeloma Heterogeneity.” Genes Chromosomes Cancer 42 (January): 44–57. https://doi.org/10.1002/gcc.20114. Parte, Aidan C. 2014. “LPSN—List of Prokaryotic Names with Standing in Nomenclature.” Nucleic Acids Research 42 (D1): D613–16. https://doi.org/10.1093/nar/gkt1111. Pedregosa, Fabian, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, et al. 2011. “Scikit-Learn: Machine Learning in Python.” Journal of Machine Learning Research 12 (November): 2825–30. http://dl.acm.org/citation.cfm?id=1953048.2078195.

98

Petrovskaya, Lada E., Ksenia A. Novototskaya-Vlasova, Elena V. Spirina, Ekaterina V. Durdenko, Galina Yu Lomakina, Maria G. Zavialova, Evgeny N. Nikolaev, and Elizaveta M. Rivkina. 2016. “Expression and Characterization of a New Esterase with GCSAG Motif from a Permafrost Metagenomic Library.” FEMS Microbiology Ecology 92 (5). https://doi.org/10.1093/femsec/fiw046. Phillips, C., A. Salas, J. J. Sánchez, M. Fondevila, A. Gómez-Tato, J. Álvarez-Dios, M. Calaza, et al. 2007. “Inferring Ancestral Origin Using a Single Multiplex Assay of Ancestry-Informative Marker SNPs.” Forensic Science International: Genetics 1 (3–4): 273–80. https://doi.org/10.1016/j.fsigen.2007.06.008. Poli, Annarita, Ilaria Finore, Ida Romano, Alessia Gioiello, Licia Lama, and Barbara Nicolaus. 2017. “Microbial Diversity in Extreme Marine Habitats and Their Biomolecules.” Microorganisms. https://doi.org/10.3390/microorganisms5020025. Prosser, James I. 2010. “Replicate or Lie.” Environmental Microbiology 12 (7): 1806–10. https://doi.org/10.1111/j.1462-2920.2010.02201.x. Quick, Joshua, Nicholas J Loman, Sophie Duraffour, Jared T Simpson, Ettore Severi, Lauren Cowley, Joseph Akoi Bore, et al. 2016. “Real-Time, Portable Genome Sequencing for Ebola Surveillance.” Nature 530 (February): 228. https://doi.org/10.1038/nature16996. Quince, Christopher, Alan W. Walker, Jared T. Simpson, Nicholas J. Loman, and Nicola Segata. 2017. “Shotgun Metagenomics, from Sampling to Analysis.” Nature Biotechnology 35 (9): 833–44. https://doi.org/doi:10.1038/nbt.3935. R Core Team. 2018. “R: A Language and Environment for Statistical Computing.” Vienna, Austria: R Foundation for Statistical Computing. https://www.r-project.org/. Raphael, Benjamin J., Jason R. Dobson, Layla Oesper, and Fabio Vandin. 2014. “Identifying Driver Mutations in Sequenced Cancer Genomes: Computational Approaches to Enable Precision Medicine.” Genome Medicine 6: 5. https://doi.org/10.1186/gm524. Reboul, Guillaume, David Moreira, Paola Bertolino, Alexandra Maria Hillebrand-Voiculescu, and Purificación López-García. 2019. “Microbial Eukaryotes in the Suboxic Chemosynthetic Ecosystem of Movile Cave, Romania.” Environmental Microbiology Reports, April. https://doi.org/10.1111/1758-2229.12756. Redmond, Molly C., and David L. Valentine. 2012. “Natural Gas and Temperature Structured a Microbial Community Response to the Deepwater Horizon Oil Spill.” Proceedings of the National Academy of Sciences 109 (50): 20292 LP – 20297. https://doi.org/10.1073/pnas.1108756108. Reeves, P. R., and R. Lan. 1996. “Gene Transfer Is a Major Factor in Bacterial Evolution.” Molecular Biology and Evolution 13 (1): 47–55. https://doi.org/10.1093/oxfordjournals.molbev.a025569. Riley, Margaret A., and Michelle Lizotte-Waniewski. 2009. “Population Genomics and the Bacterial Species Concept.” In Horizontal Gene Transfer: Genomes in Flux, edited by Maria Boekels Gogarten, Johann Peter Gogarten, and Lorraine C. Olendzenski, 367– 77. Totowa, NJ: Humana Press. https://doi.org/10.1007/978-1-60327-853-9_21. Rognes, Torbjørn, Tomáš Flouri, Ben Nichols, Christopher Quince, and Frédéric Mahé. 2016. “VSEARCH: A Versatile Open Source Tool for Metagenomics.” Edited by Tomas Hrbek. PeerJ 4: e2584. https://doi.org/10.7717/peerj.2584. Rohwer, Forest, and Rob Edwards. 2002. “The Phage Proteomic Tree: A Genome-Based Taxonomy for Phage.” Journal of Bacteriology 184 (16): 4529–35. https://doi.org/10.1128/JB.184.16.4529-4535.2002. Rosselló-Mora, Ramon, and Rudolf Amann. 2001. “The Species Concept for Prokaryotes.” FEMS Microbiology Reviews 25 (1): 39–67. https://doi.org/10.1111/j.1574- 6976.2001.tb00571.x. Sayers, Eric W., Richa Agarwala, Evan E. Bolton, J. Rodney Brister, Kathi Canese, Karen Clark, Ryan Connor, et al. 2019. “Database Resources of the National Center for Biotechnology Information.” Nucleic Acids Research 47 (D1): D23–28. https://doi.org/10.1093/nar/gky1069.

99

Scanlan, Pauline D., and Julian R. Marchesi. 2008. “Micro-Eukaryotic Diversity of the Human Distal Gut Microbiota: Qualitative Assessment Using Culture-Dependent and - Independent Analysis of Faeces.” The ISME Journal 2 (12): 1183–93. https://doi.org/10.1038/ismej.2008.76. Schloss, Patrick D. 2010. “The Effects of Alignment Quality, Distance Calculation Method, Sequence Filtering, and Region on the Analysis of 16S RRNA Gene-Based Studies.” PLOS Computational Biology 6 (7): e1000844. https://doi.org/10.1371/journal.pcbi.1000844. Schloss, Patrick D., and Sarah L. Westcott. 2011. “Assessing and Improving Methods Used in Operational Taxonomic Unit-Based Approaches for 16S RRNA Gene Sequence Analysis.” Applied and Environmental Microbiology 77 (10): 3219–26. https://doi.org/10.1128/AEM.02810-10. Schmid, A., J. S. Dordick, B. Hauer, A. Kiener, M. Wubbolts, and B. Witholt. 2001. “Industrial Biocatalysis Today and Tomorrow.” Nature 409 (January): 258–68. https://doi.org/10.1038/35051736. Schoch, Conrad L., Keith A. Seifert, Sabine Huhndorf, Vincent Robert, John L. Spouge, C. Andre Levesque, Wen Chen, and Fungal Barcoding Consortium. 2012. “Nuclear Ribosomal Internal Transcribed Spacer (ITS) Region as a Universal DNA Barcode Marker for Fungi.” Proceedings of the National Academy of Sciences of the United States of America 109 (16): 6241–46. https://doi.org/10.1073/pnas.1117018109. Schuler, Gregory D., Jonathan A. Epstein, Hitomi Ohkawa, and Jonathan A. Kans. 1996. “[10] Entrez: Molecular Biology Database and Retrieval System.” In Methods in Enzymology, 266:141–62. Academic Press. https://doi.org/10.1016/S0076- 6879(96)66012-1. Segata, Nicola, Daniela Boernigen, Timothy L. Tickle, Xochitl C. Morgan, Wendy S. Garrett, and . 2013. “Computational Meta’omics for Microbial Community Studies.” Molecular Systems Biology 9 (666). https://doi.org/10.1038/msb.2013.22. Shabardina, Victoria, Tabea Kischka, Felix Manske, Norbert Grundmann, Martin C. Frith, Yutaka Suzuki, and Wojciech Makałowski. 2019. “NanoPipe—a Web Server for Nanopore MinION Sequencing Data Analysis.” GigaScience 8 (2): giy169. https://doi.org/10.1093/gigascience/giy169. Shen, Hui, Jian Li, Jigang Zhang, Chao Xu, Yan Jiang, Zikai Wu, Fuping Zhao, et al. 2013. “Comprehensive Characterization of Human Genome Variation by High Coverage Whole-Genome Sequencing of Forty Four Caucasians.” PLOS ONE 8 (4): e59494. https://doi.org/10.1371/journal.pone.0059494. Shendure, Jay, and Hanlee Ji. 2008. “Next-Generation DNA Sequencing.” Nature Biotechnology 26 (October): 1135–45. https://doi.org/10.1038/nbt1486. Sherry, S .T., M.-H. Ward, M. Kholodov, J. Baker, L. Phan, E. M. Smigielski, and K. Sirotkin. 2001. “dbSNP: The NCBI Database of Genetic Variation.” Nucleic Acids Research 29 (1): 308–11. https://doi.org/10.1093/nar/29.1.308. Shin, Jongoh, Sooin Lee, Min-Jeong Go, Sang Yup Lee, Sun Chang Kim, Chul-Ho Lee, and Byung-Kwan Cho. 2016. “Analysis of the Mouse Gut Microbiome Using Full-Length 16S RRNA Amplicon Sequencing.” Scientific Reports 6: 29681. https://doi.org/10.1038/srep29681. Shuldiner, Alan R., Ajay Nirula, and Jesse Roth. 1989. “Hybrid DNA Artifact from PCR of Closely Related Target Sequences.” Nucleic Acids Research 17 (11): 4409. https://doi.org/10.1093/nar/17.11.4409. Sipos, Rita, Anna J. Székely, Márton Palatinszky, Sára Révész, Károly Márialigeti, and Marcell Nikolausz. 2007. “Effect of Primer Mismatch, Annealing Temperature and PCR Cycle Number on 16S RRNA Gene-Targetting Bacterial Community Analysis.” FEMS Microbiology Ecology 60 (2): 341–50. https://doi.org/10.1111/j.1574-6941.2007.00283.x. Smith, Mark B., Andrea M. Rocha, Chris S. Smillie, Scott W. Olesen, Charles Paradis, Liyou Wu, James H. Campbell, et al. 2015. “Natural Bacterial Communities Serve as Quantitative Geochemical Biosensors.” Edited by Steven E Lindow. MBio 6 (3): e00326- 15. https://doi.org/10.1128/mBio.00326-15.

100

Snustad, Peter D., and Michael J. Simmons. 2010. “DNA and the Molecular Structure of Chromosomes.” In PRINCIPLES OF GENETICS, edited by Ashley Hager, 5th ed., 210– 43. John Wiley & Sons (Asia) Pte Ltd. Speed, Terence P., and Yuval Benjamini. 2012. “Summarizing and Correcting the GC Content Bias in High-Throughput Sequencing.” Nucleic Acids Research 40 (10): e72. https://doi.org/10.1093/nar/gks001. Stackebrandt, Erko, and Jonas Ebers. 2006. “Taxonomic Parameters Revisited: Tarnished Gold Standards.” Microbiology Today 33 (nov06): 152–55. Stackebrandt, Erko, and Brett M. Goebel. 1994. “Taxonomic Note: A Place for DNA-DNA Reassociation and 16S RRNA Sequence Analysis in the Present Species Definition in Bacteriology.” International Journal of Systematic and Evolutionary Microbiology 44: 846–49. https://doi.org/10.1099/00207713-44-4-846. Stoddard, Steven F., Byron J. Smith, Robert Hein, Benjamin R. K. Roller, and Thomas M. Schmidt. 2015. “RrnDB: Improved Tools for Interpreting RRNA Gene Abundance in Bacteria and Archaea and a New Foundation for Future Development.” Nucleic Acids Research 43 (D1): D593–98. https://doi.org/10.1093/nar/gku1201. Suojanen, James N. 1999. “False False Positive Rates.” New England Journal of Medicine 341 (July): 131. https://doi.org/10.1056/NEJM199907083410217. Suzuki, Marcelino T., and Stephen J. Giovannoni. 1996. “Bias Caused by Template Annealing in the Amplification of Mixtures of 16S RRNA Genes by PCR.” Applied and Environmental Microbiology 62 (2): 625–30. Suzuki, Shino, Kenneth H. Nealson, and Shun’ichi Ishii. 2018. “Genomic and In-Situ Transcriptomic Characterization of the Candidate Phylum NPL-UPL2 From Highly Alkaline Highly Reducing Serpentinized Groundwater.” Frontiers in Microbiology 9 (3141). https://doi.org/10.3389/fmicb.2018.03141. Tanner, Michael A., Brett M. Goebel, Michael A. Dojka, and Norman R. Pace. 1998. “Specific Ribosomal DNA Sequences from Diverse Environmental Settings Correlate with Experimental Contaminants.” Applied and Environmental Microbiology 64 (8): 3110–13. http://aem.asm.org/content/64/8/3110.abstract. The 1000 Genomes Project Consortium. 2010. “A Map of Human Genome Variation from Population-Scale Sequencing.” Nature 467 (October): 1061–73. https://doi.org/10.1038/nature09534. The Wellcome Trust Case Control Consortium. 2007. “Genome-Wide Association Study of 14,000 Cases of Seven Common Diseases and 3,000 Shared Controls.” Nature 447 (June): 661–78. https://doi.org/10.1038/nature05911. Thomas, Torsten, Jack Gilbert, and Folker Meyer. 2012. “Metagenomics - a Guide from Sampling to Data Analysis.” Microbial Informatics and Experimentation 2 (3). https://doi.org/10.1186/2042-5783-2-3. Thursby, Elizabeth, and Nathalie Juge. 2017. “Introduction to the Human Gut Microbiota.” Biochemical Journal 474 (11): 1823 LP – 1836. https://doi.org/10.1042/BCJ20160510. Tomas, Carmen, Juan J. Sanchez, Jose Aurelio Castro, Claus Børsting, and Niels Morling. 2010. “Forensic Usefulness of a 25 X-Chromosome Single-Nucleotide Polymorphism Marker Set.” Transfusion 50 (October): 2258–65. https://doi.org/10.1111/j.1537- 2995.2010.02696.x. Torsvik, V., R. Sørheim, and J. Goksøyr. 1996. “Total Bacterial Diversity in Soil and Sediment Communities—A Review.” Journal of Industrial Microbiology 17 (3–4): 170– 78. https://doi.org/10.1007/BF01574690. Tsukamura, Michio, Ikuya Yano, and Tamotsu Imaeda. 1986. “Mycobacterium Moriokaense Sp. Nov., a Rapidly Growing, Nonphotochromogenic Mycobacterium.” International Journal of Systematic Bacteriology 36 (2): 333–38. https://doi.org/10.1099/00207713- 36-2-333. Tuzun, Eray, Andrew J. Sharp, Jeffrey A. Bailey, Rajinder Kaul, V. Anne Morrison, Lisa M. Pertz, Eric Haugen, et al. 2005. “Fine-Scale Structural Variation of the Human Genome.” Nature Genetics 37 (May): 727–32. https://doi.org/10.1038/ng1562.

101

Venter, J .Craig, Karin Remington, John F. Heidelberg, Aaron L. Halpern, Doug Rusch, Jonathan A. Eisen, Dongying Wu, et al. 2004. “Environmental Genome Shotgun Sequencing of the Sargasso Sea.” Science 304 (5667): 66–74. https://doi.org/10.1126/science.1093857. Větrovský, Tomáš, and Petr Baldrian. 2013. “The Variability of the 16S RRNA Gene in Bacterial Genomes and Its Consequences for Bacterial Community Analyses.” PLOS ONE 8 (2): e57923. https://doi.org/10.1371/journal.pone.0057923. Vignal, Alain, Denis Milan, Magali SanCristobal, and André Eggen. 2002. “A Review on SNP and Other Types of Molecular Markers and Their Use in Animal Genetics.” Genetics Selection Evolution 34: 275–305. https://doi.org/10.1186/1297-9686-34-3-275. Wakeley, John. 1996. “The Excess of Transitions among Nucleotide Substitutions: New Methods of Estimating Transition Bias Underscore Its Significance.” Trends in Ecology and Evolution 11 (4): 158–62. https://doi.org/10.1016/0169-5347(96)10009-4. Walsh, Susan, Fan Liu, Kaye N. Ballantyne, Mannis van Oven, Oscar Lao, and Manfred Kayser. 2011. “IrisPlex: A Sensitive DNA Tool for Accurate Prediction of Blue and Brown Eye Colour in the Absence of Ancestry Information.” Forensic Science International: Genetics 5 (3): 170–80. https://doi.org/10.1016/j.fsigen.2010.02.004. Wang, Kai, Mingyao Li, and Hakon Hakonarson. 2010. “ANNOVAR: Functional Annotation of Genetic Variants from High-Throughput Sequencing Data.” Nucleic Acids Research 38 (16): e164. https://doi.org/10.1093/nar/gkq603. Wang, Qiong, and Jim R. Cole. 2014. “Comparison of Three Fugal ITS Reference Sets.” RDP TECHNICAL REPORT. https://rdp.cme.msu.edu/download/posters/fungalITSreport_062014.pdf. Accessed 03.04.2019. Wang, Qiong, George M. Garrity, James M. Tiedje, and James R. Cole. 2007. “Naïve Bayesian Classifier for Rapid Assignment of RRNA Sequences into the New Bacterial Taxonomy.” Applied and Environmental Microbiology 73 (16): 5261–67. https://doi.org/10.1128/AEM.00062-07. Wattam, Alice R., David Abraham, Oral Dalay, Terry L. Disz, Timothy Driscoll, Joseph L. Gabbard, Joseph J. Gillespie, et al. 2014. “PATRIC, the Bacterial Bioinformatics Database and Analysis Resource.” Nucleic Acids Research 42 (D1): D581–91. https://doi.org/10.1093/nar/gkt1099. Wattam, Alice R., James J. Davis, Rida Assaf, Sébastien Boisvert, Thomas Brettin, Christopher Bun, Neal Conrad, et al. 2017. “Improvements to PATRIC, the All-Bacterial Bioinformatics Database and Analysis Resource Center.” Nucleic Acids Research 45 (D1): D535–42. https://doi.org/10.1093/nar/gkw1017. Wayne, L.G., D.J. Brenner, R.R. Colwell, P.A.D. Grimont, O. Kandler, M.I. Krichevsky, L.H. Moore, et al. 1987. “Report of the Ad Hoc Committee on Reconciliation of Approaches to Bacterial Systematics.” International Journal of Systematic Bacteriology 37 (4): 463– 64. https://doi.org/10.1099/00207713-37-4-463. Wickham, Hadley. 2007. “Reshaping Data with the Reshape Package.” Journal of Statistical Software 21 (12): 1–20. https://doi.org/10.18637/jss.v021.i12. ———. 2017. “Tidyverse: Easily Install and Load the ‘Tidyverse.’” https://cran.r- project.org/package=tidyverse. Williams, C. B. 1937. “THE USE OF LOGARITHMS IN THE INTERPRETATION OF CERTAIN ENTOMOLOGICAL PROBLEMS.” Annals of Applied Biology 24 (May): 404– 14. https://doi.org/10.1111/j.1744-7348.1937.tb05042.x. Wilmes, Paul, and Philip L Bond. 2006. “Metaproteomics: Studying Functional Gene Expression in Microbial Ecosystems.” Trends in Microbiology 14 (2): 92–97. https://doi.org/10.1016/j.tim.2005.12.006. Wilson, Michael R., Samia N. Naccache, Erik Samayoa, Mark Biagtan, Hiba Bashir, Guixia Yu, Shahriar M. Salamat, et al. 2014. “Actionable Diagnosis of Neuroleptospirosis by Next-Generation Sequencing.” New England Journal of Medicine 370 (25): 2408–17. https://doi.org/10.1056/NEJMoa1401268.

102

Wintzingerode, Friedrich v., Ulf B. Göbel, and Erko Stackebrandt. 1997. “Determination of Microbial Diversity in Environmental Samples: Pitfalls of PCR-Based RRNA Analysis.” FEMS Microbiology Reviews 21 (3): 213–29. https://doi.org/10.1111/j.1574- 6976.1997.tb00351.x. Woese, Carl R. 1987. “Bacterial Evolution.” Microbiological Reviews 51 (2): 221–71. Woese, Carl R., and George E. Fox. 1977. “Phylogenetic Structure of the Prokaryotic Domain: The Primary Kingdoms.” Proceedings of the National Academy of Sciences 74 (11): 5088–90. https://doi.org/10.1073/pnas.74.11.5088. Wu, Gary D., Jun Chen, Christian Hoffmann, Kyle Bittinger, Ying-Yu Chen, Sue A. Keilbaugh, Meenakshi Bewtra, et al. 2011. “Linking Long-Term Dietary Patterns with Gut Microbial Enterotypes.” Science 334 (6052): 105–8. https://doi.org/10.1126/science.1208344. Yang, Chen, Justin Chu, René L. Warren, and Inanç Birol. 2017. “NanoSim: Nanopore Sequence Read Simulator Based on Statistical Characterization.” GigaScience 6 (4): 1– 6. https://doi.org/10.1093/gigascience/gix010. Yang, Ziheng, and Joseph P. Bielawski. 2000. “Statistical Methods for Detecting Molecular Adaptation.” Trends in Ecology & Evolution 15 (12): 496–503. https://doi.org/10.1016/S0169-5347(00)01994-7. Yilmaz, Pelin, Renzo Kottmann, Dawn Field, Rob Knight, James R. Cole, Linda Amaral- Zettler, Jack A. Gilbert, et al. 2011. “Minimum Information about a Marker Gene Sequence (MIMARKS) and Minimum Information about Any (x) Sequence (MIxS) Specifications.” Nature Biotechnology 29 (May): 415–20. https://doi.org/10.0.4.14/nbt.1823.

103

6. Supplemental figures

Figure S1: MCC from domain to species for MetaG using the MTX or RDP database on the BA sample. The sample was subject to simulated nanopore sequencing.

Figure S2: MCC from domain to species for Parallel-META 3 using the alignment modes (align) zero to three on the BA sample. The sample was subject to simulated nanopore sequencing

104

Figure S3: MCC from domain to species for QIIME 2 Blast using the MTX database on the BA sample. The sample was subject to simulated nanopore sequencing. Results were filtered by the confidence cutoffs (conf) 0, 0.5 and 1, respectively.

Figure S4: MCC from domain to species for QIIME 2 Blast using the RDP database on the BA sample. The sample was subject to simulated nanopore sequencing. Results were filtered by the confidence cutoffs (conf) 0, 0.5 and 1, respectively.

105

Figure S5: MCC from domain to species for the QIIME 2 Classifier using the MTX database on the BA sample. The sample was subject to simulated nanopore sequencing. Results were filtered by the confidence cutoffs (conf) 0, 0.5 and 1, respectively. The MCC for the confidence cutoff of 1 was not defined.

Figure S6: MCC from domain to species for the RDP Classifier analyzing the BA sample with its 16S rRNA gene database. The sample was subject to simulated nanopore sequencing. Results were filtered by the confidence cutoffs (conf) 0, 0.5 and 1, respectively. The MCC for the confidence cutoff of 0 and for all species assignments was not defined.

106

Figure S7: MCC from domain to species for MetaG using the MTX or RDP database on the BFA sample. The sample was subject to simulated nanopore sequencing.

Figure S8: MCC from domain to species for QIIME 2 Blast using the MTX database on the BFA sample. The sample was subject to simulated nanopore sequencing. Results were filtered by the confidence cutoffs (conf) 0, 0.5 and 1, respectively.

107

Figure S9: MCC from domain to species for QIIME 2 Blast using the RDP database on the BFA sample. The sample was subject to simulated nanopore sequencing. Results were filtered by the confidence cutoffs (conf) 0, 0.5 and 1, respectively.

Figure S10: MCC from domain to species for the QIIME 2 Classifier using the MTX database on the BFA sample. The sample was subject to simulated nanopore sequencing. Results were filtered by the confidence cutoffs (conf) 0, 0.5 and 1, respectively. The MCC for the confidence cutoff of 1 was not defined.

108

Figure S11: MCC from domain to species for the RDP Classifier using its 16S rRNA training set 16 in conjunction with LSU11 on the BFA sample. The sample was subject to simulated nanopore sequencing. Results were filtered by the confidence cutoffs (conf) 0, 0.5 and 1, respectively. The MCC for the confidence cutoff of 0 and for all species assignments was not defined.

Figure S12: MCC from domain to species for the RDP Classifier using its 16S rRNA training set 16 in conjunction with UNITE on the BFA sample. The sample was subject to simulated nanopore sequencing. Results were filtered by the confidence cutoffs (conf) 0, 0.5 and 1, respectively. The MCC for the confidence cutoff 0 is only defined for the species level.

109

Figure S13: MCC from domain to species for the RDP Classifier using its 16S rRNA training set 16 in conjunction with WARCUP2 on the BFA sample. The sample was subject to simulated nanopore sequencing. Results were filtered by the confidence cutoffs (conf) 0, 0.5 and 1, respectively. The MCC for the confidence cutoff 0 is only defined for the species level. The MCC for the confidence cutoff 1 is undefined for the species.

Figure S14: Sensitivity from domain to species for classifiers using their chosen settings to analyze the BA sample. The sample was subject to simulated nanopore sequencing. 16S is the 16S rRNA training set 16 of the RDP Classifier.

110

Figure S15: Precision from domain to species for classifiers using their chosen settings to analyze the BA sample. The sample was subject to simulated nanopore sequencing. 16S is the 16S rRNA training set 16 of the RDP Classifier. The precision at species level was undefined for the RDP Classifier.

Figure S16: Specificity from domain to species for classifiers using their chosen settings to analyze the BA sample. The sample was subject to simulated nanopore sequencing. 16S is the 16S rRNA training set 16 of the RDP Classifier.

111

Figure S17: Sensitivity from domain to species for classifiers using their chosen settings to analyze the BFA sample. The sample was subject to simulated nanopore sequencing. 16S is the 16S rRNA training set 16 of the RDP Classifier.

Figure S18: Precision from domain to species for classifiers using their chosen settings to analyze the BFA sample. The sample was subject to simulated nanopore sequencing. 16S is the 16S rRNA training set 16 of the RDP Classifier. The precision for the RDP Classifier with the 16S and LSU11 databases was undefined at species level.

112

Figure S19: Specificity from domain to species for classifiers using their chosen settings to analyze the BFA sample. The sample was subject to simulated nanopore sequencing. 16S is the 16S rRNA training set 16 of the RDP Classifier.

Figure S20: MCC from domain to species for MetaG using the MTX or RDP database on the BA sample. The sample was subject to simulated MiSeq sequencing.

113

Figure S21: MCC from domain to species for Parallel-META 3 using the alignment modes (align) zero to three on the BA sample. The sample was subject to simulated MiSeq sequencing

Figure S22: MCC from domain to species for QIIME 2 Blast using the MTX database on the BA sample. The sample was subject to simulated MiSeq sequencing. Results were filtered by the confidence cutoffs (conf) 0, 0.5 and 1, respectively.

114

Figure S23: MCC from domain to species for QIIME 2 Blast using the RDP database on the BA sample. The sample was subject to simulated MiSeq sequencing. Results were filtered by the confidence cutoffs (conf) 0, 0.5 and 1, respectively.

Figure S24: MCC from domain to species for the QIIME 2 Classifier using the MTX database on the BA sample. The sample was subject to simulated MiSeq sequencing. Results were filtered by the confidence cutoffs (conf) 0, 0.5 and 1, respectively. The MCC for the confidence cutoff of 1 was not defined.

115

Figure S25: MCC from domain to species for the RDP Classifier analyzing the BA sample with its 16S rRNA training set 16. The sample was subject to simulated MiSeq sequencing. Results were filtered by the confidence cutoffs (conf) 0, 0.5 and 1, respectively. The MCC for the confidence cutoff of 0 was not defined.

Figure S26: MCC from domain to species for MetaG using the MTX or RDP database on the BFA sample. The sample was subject to simulated MiSeq sequencing.

116

Figure S27: MCC from domain to species for QIIME 2 Blast using the MTX database on the BFA sample. The sample was subject to simulated MiSeq sequencing. Results were filtered by the confidence cutoffs (conf) 0, 0.5 and 1, respectively.

Figure S28: MCC from domain to species for QIIME 2 Blast using the RDP database on the BFA sample. The sample was subject to simulated MiSeq sequencing. Results were filtered by the confidence cutoffs (conf) 0, 0.5 and 1, respectively.

117

Figure S29: MCC from domain to species for the QIIME 2 Classifier using the MTX database on the BFA sample. The sample was subject to simulated MiSeq sequencing. Results were filtered by the confidence cutoffs (conf) 0, 0.5 and 1, respectively. The MCC for the confidence cutoff of 1 was not defined.

Figure S30: MCC from domain to species for the RDP Classifier using its 16S rRNA training set 16 in conjunction with LSU11 on the BFA sample. The sample was subject to simulated MiSeq sequencing. Results were filtered by the confidence cutoffs (conf) 0, 0.5 and 1, respectively. The MCC for the confidence cutoff of 0 and for all species assignments was not defined.

118

Figure S31: MCC from domain to species for the RDP Classifier using its 16S rRNA training set 16 in conjunction with UNITE on the BFA sample. The sample was subject to simulated MiSeq sequencing. Results were filtered by the confidence cutoffs (conf) 0, 0.5 and 1, respectively. The MCC for the confidence cutoff 0 was only defined for the species. The MCC for the confidence cutoff 1 was not defined at the species level.

Figure S32: MCC from domain to species for the RDP Classifier using its 16S rRNA training set 16 in conjunction with WARCUP2 on the BFA sample. The sample was subject to simulated MiSeq sequencing. Results were filtered by confidence cutoffs (conf) 0, 0.5 and 1, respectively. The MCC for the confidence cutoff 0 was only defined for the species. The MCC for the confidence cutoff 1 was not defined at the species level.

119

Figure S33: Sensitivity from domain to species for classifiers using their chosen settings to analyze the BA sample. The sample was subject to simulated MiSeq sequencing. 16S is the 16S rRNA training set 16 of the RDP Classifier.

Figure S34: Precision from domain to species for classifiers using their chosen settings to analyze the BA sample. The sample was subject to simulated MiSeq sequencing. 16S is the 16S rRNA training set 16 of the RDP Classifier. The precision for the RDP Classifier was undefined at species level.

120

Figure S35: Specificity from domain to species for classifiers using their chosen settings to analyze the BA sample. The sample was subject to simulated MiSeq sequencing. 16S is the 16S rRNA training set 16 of the RDP Classifier.

Figure S36: Sensitivity from domain to species for classifiers using their chosen settings to analyze the BFA sample. The sample was subject to simulated MiSeq sequencing. 16S is the 16S rRNA training set 16 of the RDP Classifier.

121

Figure S37: Precision from domain to species for classifiers using their chosen settings to analyze the BFA sample. The sample was subject to simulated MiSeq sequencing. 16S is the 16S rRNA training set 16 of the RDP Classifier. The precision for the RDP Classifier with the 16S database supplemented with LSU11 and WARCUP2, respectively, was undefined at species level.

Figure S38: Specificity from domain to species for classifiers using their chosen settings to analyze the BFA sample. The sample was subject to simulated MiSeq sequencing.16S is the 16S rRNA training set 16 of the RDP Classifier.

122

Declaration of Academic Integrity

I hereby confirm that this thesis on “Developing online tools for metagenomic analysis and SNP detection using Nanopore sequencing data” is solely my own work and that I have used no sources or aids other than the ones stated. All passages in my thesis for which other sources, including electronic media, have been used, be it direct quotes or content references, have been acknowledged as such and the sources cited.

______(date and signature of student)

I agree to have my thesis checked in order to rule out potential similarities with other works and to have my thesis stored in a database for this purpose.

______(date and signature of student)

123