Comparative analysis of urban microbiomes using machine learning algorithms

DIPLOMARBEIT

ZUR ERLANGUNGDESAKADEMISCHEN GRADES MASTEROF SCIENCEIN ENGINEERING

DER FACHHOCHSCHULE FHCAMPUS WIEN MASTER-STUDIENGANG BIOINFORMATIK

Vorgelegt von: René Lenz Personenkennzeichen: c1810542004

FH-Hauptbetreuer*in: FH-Prof.in Mag.a Dr.in Alexandra Graf

Abgabetermin: 19.11.2020 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

Erklärung:

Ich erkläre, dass die vorliegende Diplomarbeit von mir selbst verfasst wurde und ich keine anderen als die angeführten Behelfe verwendet bzw. mich auch sonst keiner unerlaubter Hilfe bedient habe. Ich versichere, dass ich dieses Diplomarbeitsthema bisher weder im In- noch im Ausland (einer Beurteilerin/einem Beurteiler zur Begutach- tung) in irgendeiner Form als Prüfungsarbeit vorgelegt habe. Weiters versichere ich, dass die von mir eingereichten Exemplare (ausgedruckt und elektronisch) identisch sind.

19.11.2020 Datum Unterschrift FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

Abstract

Die Metagenomforschung ist in der Lage, Einblicke in die mikrobielle Zu- sammensetzung von Proben zu gewinnen. Mit Hilfe von Machine Learning Methoden ist es bis zu einem gewissen Grad möglich, unbekannte Proben ihrer Herkunft zuzuordnen. Verbindet man diese beiden Ansätze ergeben sich auch neue Felder für die Forensik. In dieser Arbeit wird untersucht, ob man Proben die in öffentlichen Verkehrsmitteln verteilt über den Globus ge- sammelt wurden, wieder den einzelnen Städten zugeordnet werden können und ob sich Schlüsselspezies herauskristalisieren. Die Whole Genome Sequencing Daten werden taxonomisch zugeordnet und die Häufigkeit der gefundenen Spezies ermittelt. Die Daten stammen aus 2016 und 2017. 4 Städte von 2016 und 9 Städte von 2017 werden analysiert. Es werden Vorhersagemodelle mit Random Forests, K-nearest Neighbors und Support Vector Machines erstellt. Das genaueste Modell hat jeweils der Random Forest geliefert, > 98% 2016, > 80% 2017. Die Arbeit zeigt, dass die Genauigkeit der Vorhersagemodelle erwartungs- gemäß mit der Anzahl an Städten sinkt. Die Randon Forest Modellen haben Spezies hervorgehoben, die das Potenzial haben, näher betrachtet zu wer- den. Die Analyse von Mystery Samples hat letzten Endes dann nicht so gut funktioniert, wie erwartet. Hier gibt es auf jeden Fall noch viel Potenzial zur Verbesserung.

Metagenomic research is capable of creating insights into the microbial composition of samples. With the help of machine learning methods it is, to a certain degree, possible to assign unknown samples to their origin. Com- bining these two approaches also opens up new fields for forensics. In this thesis it is investigated whether samples collected in public transport sys- tems distributed around the globe can be assigned to their individual cities of origin and whether key can be identified. The Whole Genome Sequencing data are taxonomically assigned and the abundance of the species found is determined. The data are derived from 2016 and 2017. 4 cities from 2016 and 9 cities from 2017 are analysed. Predictive models using random forests, K-nearest Neighbors and Support Vector Machines are developed. The most accurate model was provided by the random forest, > 98% in 2016, > 80% in 2017. The work shows that, as expected, the accuracy of the prediction models decreases with the number of cities. However, the Randon Forest models have identified a number of species that have the potential to be studied more thoroughly.In the end, the analysis of Mystery Samples did not work as well as expected. In any case, there is still much potential for improvement.

1 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

Acknowledgements

Als erstes möchte ich meinen Betreuerinnen Alexandra und Theresa danken, für ihre fachliche Unterstützung über den gesamten Zeitraum der Masterarbeit und auch schon während des Studiums. Vielen Dank für das angenehme produktive Arbeitsklima! Der nächste Dank gebührt meinen Eltern Helene und Daniel, sowie ihren Lebenspartnern Gerd und Ulli, die die Hoffnung nie aufgegeben haben, dass aus mir irgendwann noch was wird. Ich möchte mich auch herzlich bei Jennifer bedanken. Du hast dich wirklich hervorragend um mein leibliches und seelisches Wohl gesorgt. Du bist mir immer mit Rat und Tat zur Seite gestanden! Vielen Dank dafür! Vielen Dank Eva und Alex, für die als „Think Tanks“ getarnten „Drink Tanks“. Etliche gute Ansätze wurden hier geboren und vernichtet. „Your code is wrong on so many levels that I barely know where to start.“ - Danke Perffizienz!

2 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

Contents

1 Introduction5 1.1 Machine Learning ...... 5 1.1.1 Learning Methods ...... 6 1.1.2 PCA - Principal Component Analysis ...... 6 1.1.3 Hierarchical Clustering ...... 7 1.1.4 RF - Random Forest ...... 8 1.1.5 KNN - k-Nearest neighbors ...... 8 1.1.6 Support Vector Machines ...... 8 1.2 Metagenomics ...... 9 1.3 Background ...... 10 1.3.1 MetaSub ...... 12 1.3.2 CAMDA ...... 14

2 Aim of this Work 16

3 Materials and Methods 17 3.1 Data origin ...... 17 3.2 Data availability ...... 19 3.3 Workflow ...... 19 3.3.1 From WGS data to species abundance ...... 19 3.3.2 Ecological assignment ...... 20 3.3.3 Abundance analysis and Machine learning ...... 21 3.4 Software ...... 24 3.5 Hardware ...... 24

4 Results 25 4.1 CSD17 ...... 25 4.1.1 PCA analysis ...... 26 4.1.2 Hierarchical Clustering ...... 28 4.1.3 Random Forest analysis ...... 29 4.1.4 K-nearest neighbors analysis ...... 34 4.1.5 Support Vector Machines analysis ...... 35 4.1.6 Accuracy comparison ...... 36 4.2 CSD16 ...... 38 4.2.1 PCA analysis ...... 40 4.2.2 Hierarchical Clustering ...... 41 4.2.3 Random Forest analysis ...... 42 4.2.4 K-nearest neighbors analysis ...... 47 4.2.5 Support Vector Machines analysis ...... 48 4.2.6 Accuracy comparison ...... 49 4.3 Matching species of Ilorin and New York (CSD16 vs. CSD17) ...... 51 4.4 Mystery samples ...... 52

5 Discussion 54

6 Glossary 58

3 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

List of Figures

1 General workflow ...... 21 2 CSD17 Species per data set ...... 26 3 CSD17 PCA example ...... 27 4 CSD17 Hierarchical Clustering ...... 28 5 CSD17 RF example 1 ...... 29 6 CSD17 RF example 2 ...... 30 7 CSD17 variable importance plot ...... 32 8 CSD17 KNN plots ...... 34 9 CSD17 SVM plots ...... 35 10 CSD16 Species per data set ...... 39 11 CSD16 PCA example ...... 40 12 CSD16 Hierarchical Clustering ...... 41 13 CSD16 RF example 1 ...... 42 14 CSD16 RF example 2 ...... 43 15 CSD16 variable importance plot ...... 44 16 CSD16 KNN plots ...... 47 17 CSD16 SVM plots ...... 48

List of Tables

1 Cities participating at the CSDs ...... 18 2 Mystery sample translation ...... 19 3 Data partitioning ...... 22 4 Used R software ...... 24 5 CSD17 explained variance ...... 27 6 CSD17 RF error table ...... 31 7 CSD17 most important variables ...... 33 8 CSD17 Parameter optimization ...... 36 9 CSD17 model accuracy incl. train sets ...... 37 10 CSD17 model accuracy incl. trest sets ...... 38 11 CSD16 explained variance ...... 41 12 CSD16 RF error table ...... 44 13 CSD16 most important variables ...... 46 14 CSD16 Parameter optimization ...... 49 15 CSD16 model accuracy incl. train sets ...... 50 16 CSD16 model accuracy incl. test sets ...... 51 17 No. of matching species of CSD17 vs. CSD16 ...... 51 18 Matching species of CSD17 vs. CSD16 ...... 52 19 Mystery samples right prediction ...... 52 20 Evaluation of the mystery samples ...... 53

4 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

1 Introduction

1.1 Machine Learning

Machine learning has actually been an important part of data analysis from the be- ginning of the computer age. In the 50s and 60s of the last century, when computers appeared on a larger scale, algorithms were developed or known approaches were further developed (Kononenko, 2001). One of the cornerstones was Baysian statistics (Bayes’ Theorem, 18th century). Further discoveries were the method of least squares (Legendre, 1805), Markov chains (Hayes, 2013) and the Turing machine (Turing, 1950). The idea behind machine learning is to make computers work in a similar way to the human brain. Nowadays so called Neural Networks (NN) imitate the way information is processed by neurons by passing the data from an input layer (of neurons) on to hidden layers and to the output layer (Lancashire et al., 2009). There is always training data, from which it is tried to recognize certain rules, processes and patterns. Conclusions are then drawn from these findings, the computer learns similar to humans. Tom M. Mitchell, a pioneer in the field of machine learning described this approach as follows:

"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E."

There are three important methods of learning, namely "supervised learning", "unsu- pervised learning" and "semisupervised learning". In fact, depending on the charac- teristics, there are even further distinctions, but these can be roughly divided into the above categories. Information that can be obtained from the models includes clus- tering, classification, regression, detection of anomalies or recommendation engines1. Cluster analysis is based on unsupervised learning and groups the data according to their similarity. Methods are Hierarchical Clustering (HC), centroid based cluster- ing, distribution-based clustering or density clustering. If the task is to assign data to different clusters this is called classification2. Applied methods are decision trees, Random Forest (RF), Support Vector Machines (SVM), K-nearest neighbors (KNN), logistic regression just to name a few. To relate a dependent variable to one or more independent variables is called regression. Regression analysis can be divided in lin- ear and non linear regression. One must be careful when interpreting the results, as a regression makes relations visible, but does not establish causality (Freedman, 2009).

1https://www.innoarchitech.com/blog/machine-learning-an-in-depth-non-technical-guide, accessed: 08.11.2020 2https://www.stat.berkeley.edu/ s133/Class1a.html, accessed: 08.11.2020

5 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

Recommendation engines are often used in the marketing industry to analyse user habits and thus offer customised advertising. Most of the named methods and fundamental pricipals can also be adapted to bioinfor- matics questions. Examples of frequently used algorithms in bioinformatics are Princi- pal Component Analysis (PCA), RF, KNN or SVM, which all occur in this work. Clas- sifications in general can achieve high accuracy by using machine learning methods. The algorithms perform the classification on the basis of either mathematical models, heuristic models or random models (Filipovic,´ 2017).

1.1.1 Learning Methods

Supervised learning

In supervised learning the algorithm is trained with well known training data i.e. the class membership is already known. This allows to tune the parameters in a way that the result of the model fits best to the output. Based on this experience, conclusions can be drawn about unknown data and the performance of a model can be positively influenced. Examples are classification and regression3.

Unsupervised learning

Unsupervised learning involves non-categorized data. The objective of the algorithm is to find groupings and categorize the input data. This enables the model to work out its own solutions and thus to discover previously unknown patterns. Complex algorithms allow the solution of difficult tasks. Examples are cluster analyses3.

Semisupervised learning

Semisupervised learning deals with labeled data as well as with non labeled data. Marked samples are often rare and not easily produced. In such a case, the mix of data can help to keep costs low and still provide enough data for analysis.

1.1.2 PCA - Principal Component Analysis

PCA is a standard tool in modern data analysis in diverse fields from neuroscience to computer graphics because it is a simple, non-parametric method for extracting rele- vant information from confusing data sets (Shlens, 2014). Especially with biological

3https://www.guru99.com/supervised-vs-unsupervised-learning.html, accessed: 24.08.2020

6 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering data, it is often the case that the number of variables per sample far exceeds the num- ber of samples. Preparing such data visually is difficult to implement in practice. This is where PCA comes into play. It reduces the dimensionality of data and makes them more illustrative. A mathematical algorithm identifies directions in the data from which the greatest variation is derived - the principal components. These values with the maximal variation allow us to represent a sample with just a few features. This makes it possible to visually compare different samples in a simple way and to identify similari- ties or differences, on the basis of which the samples can then be grouped. A so-called biplot is used to visualize the original data in a new coordinate system spanned by the first two pcs. (Ringnér, 2008) A measure of interest can be the number of Principal Component (PC)s one needs to explain a certain proportion of variance.

1.1.3 Hierarchical Clustering

HC is a greedy algorithm to create a hierarchy of partitions of data, called dendrograms (Nielsen, 2016). It is often used to display biological interrelationship. This can be solved in two different ways. Agglomerative Clustering starts with each data point to be considered as its own cluster and creates the dendrogram from bottom-up and always joins the two most sim- ilar clusters. The second approach, called Divisive Clustering, is a top-down method starting with one cluster containing all data points. The clustering is performed recur- sively by always partitioning the data into two least similar clusters.4 To determine the differences between the individual observations, different distance measures can be used, e.g. Euclidean distance, Manhattan distance, Maximum dis- tance, Mahalanobis distance etc. The data then is clustered using linkage methods that calculate the distances between the data objects. Applied methods are Complete Linkage, Ward’s Linkage, Single Linkage, Centroid Linkage or Average Linkage.5 On the y-axis of the dendrogram the similarity of the clusters is represented. For interpre- tation and for a meaningful grouping of the clusters concise differences of similarity can be used. This is also called "cutting". It can be advantageous to cut the dendrogram in different ways to achieve a meaningful result.6

4https://towardsdatascience.com/hierarchical-clustering-agglomerative-and-divisive-explained- 342e6b20d710, accessed: 01.11.2020 5https://lzpdatascience.wordpress.com/2019/11/17/distance-measures-and-linkage-methods-in- hierarchical-clustering/, accessed: 07.11.2020 6https://support.minitab.com/en-us/minitab/18/help-and-how-to/modeling-statistics/multivariate/how- to/cluster-observations/interpret-the-results/all-statistics-and-graphs/dendrogram/, accessed: 01.11.2020

7 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

1.1.4 RF - Random Forest

Random forests belong to the supervised learning algorithms and have been used for classification and regression problems for about two decades. The name "random forest" comes from the fact that the algorithm allows a certain number of random and uncorrelated decision trees to grow. In classification, each of these trees makes a decision for a class. The class with the most votes is used for the final classification. The RF trains relatively quickly because a single decision tree can be formed quickly and the time for calculation increases linearly with the number of trees. This makes it an efficient method for processing large amounts of data, i.e. many samples with many features. The algorithm works as follows: N bootstrap samples are taken. If M is the number of features, m  M features are randomly selected in each node of the tree and from this split a feature is selected e.g. by minimising entropy. In this case the tree is not pruned, but fully developed. In order to determine the classification of an input, it is evaluated in each tree and the sample is assigned to the class with the most votes.(Breiman, 2001)

1.1.5 KNN - k-Nearest neighbors

KNN is a non-parametric method used in machine learning (Altman, 1992). It is based on pattern recognition and can be used for classification as well as for regression. When classifying samples the algorithm assigns the sample to a certain class which is determined by a plurality vote of its closest neighbours in the feature space (Grbic´ et al., 2016). "k" is the number of the nearest neighbors and therefore has to be a positive integer. The distance metric can be one of many like the Euclidian distance, the Manhattan distance, or the Minkowski distance. Using the actual values of the data may not achieve the best result. To deal with distance metrics it is common to normalize the values. Applying the algorithm to nor- malized data generally results in better solutions (Piryonesi & El-Diraby, 2020). To obtain proper results it is crucial to choose an optimal value for k. Small values of k are more likely to be affected by noisy data, whereas larger values may cause the boundaries of the classes to be less distinct.

1.1.6 Support Vector Machines

SVM, also called support-vector networks (Cortes & Vapnik, 1995) are terms for algo- rithms which use supervised learning for classification and regression problems. To separate and classify the data in some feature space into different regions the model generates hyperplanes. These hyperplanes can be lines or curves. The algorithm tries

8 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering to maximise the distance between two classes i.e. maximise the margin7. If the data to be classified is non-linear this can be done by using the kernel trick. The function used for the calculation then is a radial kernel function. An important parameter is the soft margin parameter C. The parameter is usually se- lected by a grid search with exponentially growing sequences of C followed by a cross validation which finds the best value for C to be implemented in the final model (Hsu et al., 2003).

1.2 Metagenomics

The usual procedure to analyze the diversity of species from environmental samples was to sequence cultivated and later isolated species. However, it is far from being possible to cover all species in this way, as many organisms that occur in the environ- ment cannot be cultivated under simple laboratory conditions. Even among the culture- dependent species, it has become apparent that some species are ubiquitous, while others are bound to special ecosystems. With this method, however, about 99% of and archaea remain undiscovered, as comparisons with culture-independent experiments show. (Hugenholtz et al., 1998) Early metagenomic studies included the analysis of seawater in which more than 5,000 viruses were found using environmental shotgun sequencing (Breitbart et al., 2002), or analysing DNA samples from drainage water from mines which revealed whole genomes of some bacteria and archaea which were previously not able to be cultured (Tyson et al., 2004) (Hugenholtz, 2002). Main approaches to analyse microbial communities’ metagenome are Whole Genome Sequencing (WGS) and 16S Ribosomal ribonucleic acid (rRNA). With the help of 16S rRNA, prokaryotes can be identified very well, as it only occurs in these organisms. Due to its composition of highly conserved regions and variable regions it is so mean- ingful in the identification of species. Its length of about 1,500 base pairs also make it cost efficient in terms of computing power and further of course in terms of money. WGS is based on shotgun metagenomics. This also gives us a big advantage over 16S rRNA, namely the detection of all organisms, including new ones, in the sample. Ac- cordingly, however, the computation is more complex. One of the most important tech- nical advances that make it possible to analyse the metagenome so precisely is Next Generation Sequencing (NGS). NGS is a collective term for various high-throughput sequencing methods. Established methods are Illumina sequencing, Roche 454 se- quencing and Ion Torrent: Proton / PGM sequencing. These developments led

7https://datascienceplus.com/radial-kernel-support-vector-classifier/, accessed: 25.10.2020

9 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering to a rapid increase in metagenome studies. Inspired by the Human Genome Project (Collins et al., 2003), there is the Human Microbiome Project in this field. Its main goals are to use NGS techniques to characterize the human microbiome, to study in- teractions between the human microbiome and health issues and to provide technical approaches and data to make such studies broadly realizable (Peterson et al., 2009). Also the soil microbiome is in the focus of metagenomics. Exploring the genomes of the soil microflora has its beginnings in the 90s (Handelsman et al., 1998). Plants are known to live in symbiosis with their surrounding microbes and they are even able to shape their microbial environment to increase health and fertility (Chaparro et al., 2012). The interactions between soil organisms and plants to increase the fruit yield can also be studied with the mentioned methods (Yurgel et al., 2019). Another application of metagenomics is in the field of forensics. Often boilogical sam- ples of crime scenes do not contain enough DNA to analyse. With its high throughput technologies metagenomics can help to get insights into those samples (Dileep et al., 2020). Deoxyribonucleic acid (DNA) has long been used to make predictions about the ancestry and appearance of perpetrators and victims in criminal cases. By including the microbiome, it is possible to draw better conclusions about the parties involved in a crime, as the microbiome of a person or place is also unique (Mason-Buck et al., 2020).

1.3 Background

Bioinformatics is a scientific discipline, which is capable of combining computational algorithms and "omics" data to address scientific questions. The focus here lies on predictions and simulations. One of the major challenges in many studies in this area is that the number of samples is often relatively small, but the features are numerous. And this is exactly where machine learning can be of help. Structural predictions of proteins are a good example of the necessity of machine learning in biological environments. Although structures such as alpha helix and beta leaflet have been known since the early 1950s, accurate predictions of the structure were almost not realizable until a few years ago. With the help of huge amounts of data from growing databases and the inclusion of known templates, predictions can now be made with an accuracy of 84%. Machine learning algorithms are able to process and classify a large number of features with known training data. (Yang et al., 2018) Another field of application of Machine Learning is research on rare diseases, the frequency of which is far from being as rare as the name suggests, there are many, over 6000 of them. As a result they reach 3.5 - 5.9% of the population. Between 2010 and 2019, there have been over

10 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

200 publications on rare diseases, with a steady annual increase. During the same period, machine learning publications have also increased similarly proportionally. In particular, methods such as Artificial Neural Networks, Support Vector Machines and Ensemble Learning were used, but also decision trees and clustering were applied. (Schaefer et al., 2020) Metabolic modelling, more precise constraint-based and kinetic modelling, use bioin- formatic tools to describe the relationship of the metabolism and the phenotype of cells, tissues and organisms (Volkova et al., 2020). To find the minimal set of reactions necessary to operate a certain pathway the elementary flux mode analysis finds its application (Klamt et al., 2017). When it comes to the analysis of DNA sequences one fundamental method applied is the Hidden Markov Model (HMM). This stochastic model is easy to use and requires only small amounts of training data (Sasikumar & Kalpana, 2016). The powerful combination of bioinformatics and machine learning is now also finding its way into metagenome studies. One finds the genetic material of living beings ev- erywhere, or at least fragments of it, be it DNA or Ribonucleic acid (RNA). As a result of the further development and refinement of technical tools in recent decades for the classification and analysis of biological material, it is now also possible to examine the entire genetic material of a sample. This information can in turn be used to create a genetic "fingerprint" of the sample. Computer-based methods can then be used to try to assign these "fingerprints" to their place of origin. It is well known that the microbiome has an influence on our health (Peterson et al., 2009). So it is not surprising that special attention is paid to the urban microbiome in particular, as there are a very large number of people living in a confined space. The Metagenomics & Metadesign of Subways & Urban Biomes (MetaSUB) international Consortium is specifically concerned with the microbial communities of subways and urban space. One goal is to use the knowledge gained from sampling, data analysis, visualization and characterization to positively influence the development and design of artificially created habitats8. A cooperation which offers the possibility to work with the publicly available data. Furthermore, one can take part in an annual competition to assign unknown samples ("mystery samples") to their origin9. This competition is part of the Critical Assessment Of Massive Data Analysis (CAMDA) conference. These prerequisites have driven the analyses in recent years.

8http://metasub.org/about, accessed: 24.08.2020 9http://camda.info, accessed: 24.08.2020

11 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

1.3.1 MetaSub

As already discussed, the MetaSUB Consortium is concerned with research on the mi- crobiome and metagenome in urban space. This field is so interesting because more than 55% of the world’s population now live in or near cities (United Nations, 2018). This creates interactions between people and the microbes that are surrounding them. In artificially created environments the microbiome is composed of different species than in rural areas (Neiderud, 2015). Some diseases, such as allergies, are more common in densely populated areas (Nicolaou et al., 2005). It is therefore clear that the city as a habitat has an influence, on human health, but so far very little is under- stood about the microbial relationships in the background. The MetaSub Consortium is convinced that the categorisation of urban species opens up a field of research that will allow a better understanding of the above mentioned interactions between humans and the urban microbiome. For this reason, the consortium was established in 2015, and in order to achieve useful results, uniform sampling and processing procedures were developed that are globally feasible. The samples were taken from places where people are in contact with urban surfaces, such as the surfaces of public transport systems. City sampling days have been carried out so far between 2016 and 2020, preceded by a pilot study in 2015 and 2016. During the sampling days of 2016 - 2018 4,728 samples have been collected in 60 cities worldwide (Danko et al., 2020). For projects of this magnitude, which span the entire globe and involve many employees, scientists and laboratories, the experiment must be precisely designed. A uniform design was written down in the Consortium’s inaugural report (The MetaSUB International Consortium, 2016). There, in addition to the ethical, social and legal framework, the procedure for data collection and analysis was defined and standards implemented. The samples are taken using swabs. The heterogeneous composition of viruses, bacteria, fungi and other organisms requires a protocol that makes the analysis as accurate as possible. Several influences must be taken into account, e.g. pH value, temperature, physical state of the sample, salt content and others.

Summary of the guidelines for sampling and analysis

During this inaugural meeting sampling guidelines were established. Nylon swabs are used to avoid contamination with cotton DNA. Samples are kept on ice during transport to avoid a shift in the abundance of the species due to different generation times. They are stored at -20°C or below. A workplace cleaned with ethanol or bleach is required. There are two ways to extract DNA: (1) direct extraction of DNA by dissolving

12 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering the bacterial cells, or (2) indirect extraction by separating bacteria and other organic material followed by DNA extraction. For the actual sequencing the gold standard, the Illumina HiSeq platform and 150 bp paired-end reads are used. Furthermore, a positive control with bacterial reference genomes and known metagenome samples is recommended. These can be obtained from the Genome Reference Consortium (GRC) or the US National Institute of Stan- dards and Technology (NIST). There are also some challenges to be overcome in terms of bioinformatics. The focus here is not on defined guidelines but on further develop- ment and improvement in terms of standardisation, reproducibility, access and sharing of data and innovation. A crucial part of the analysis is how to deal with potential new findings and the diversity of such metagenomic samples. To avoid misunderstanding between genuine new discoveries and results of technical deviations it is of importance to standardize computational and empirical protocols. Strategies pursued here are on the one hand the creation and integration of reference data sets of the different se- quencing technologies for benchmarking (Li et al., 2014) and the introduction of well analysed control samples. The standards created with these reference sets provide the basis to interpret results gathered from many people working around the world. Since 52% of researchers think science is facing a significant reproducibility crisis and 38% think that there is at least a slight crisis according to an interrogation of Baker and Penny (2016), it obvious that the MetaSUB Consortium is also addressing this issue. One needs to distinguish between steps in the laboratory or during sampling and the computational steps. Creating transparent workflows, implement code sharing and the use of common data formats and data sharing portals are the cornerstones in solving these issues. The points listed above may lead to new insights, new techniques and overall to further innovation in this field, which shall be preserved and shared. Visualizing the data shall provide quick insight for people interested in this field, includ- ing an ability to look at results, metadata, and milestones of this research and the data should be published uniformly. To meet legal, ethical and social acceptance criteria all data, publications and corre- spondences are publicly available. The data is shared with local authorities, who even have a "Right to First Review" before any of it is published. The interpretation of results of such sensitive data, data concerning public and highly frequented spaces, is crucial to avoid misunderstanding in the public. Even though most of the species found are harmless, pathogenic species are always present. In this case it is important to put the results in relation to each other in order not to unsettle the population.Even though most of the species found are harmless, pathogenic species are always present. In this

13 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering case it is important to put the results in relation to each other in order not to unsettle the population.

1.3.2 CAMDA

CAMDA is a conference track at the Intelligent Systems for Molecular Biology (ISMB)10. This conference deals with the challenges of the intelligent exploitation of big amounts of data. Each year CAMDA offers the possiblity to participate at several data analysis challenges. In 2020 the challenges are, according to the CAMDA homepage11:

• The Metagenomic Geolocation Challenge presents thousands of city micro- biome profiles in the context of climate data. Construct multi-source microbiome fingerprints and predict the ecological niche or exact geographical origin of mys- tery samples.

• The Hi-Res Cancer Data Integration Challenge presents clinical and matched molecular profiles, with read level data for individual genomes. Explore non- standard genomes and splicing events for better prognosis.

• The CMap Drug Safety Challenge presents clinical toxicity results, cellular and molecular responses to hundreds of drugs. Compare and integrate a range of cell line assays to predict the severity of the drug induced liver injury in humans.

This work deals with the Metagenomic Geolocation Challenge. A workflow will be es- tablished to identify key species, according to their abundances in the samples, with common machine learning methods. The program should then be able to assign mys- tery samples of cities to their origin as long as the city to be identified is already known to the model.

Former CAMDA approaches concerning the Metagenomic Geolocation Challenge

As already mentioned in the section above Harris et al. (2019) had a similar attempt to identify mystery samples based on abundances in combination with machine learning methods. Their results show that abundance based methods can definitely lead to accurate models. Trying to find the bacterial fingerprint of different cities from 16S gene profiles Walker et al. (2018) were able to differentiate three cities via PCA. A large proportion of the variance was explained by the first three PCs. The levels

10https://www.iscb.org/ismb2020, accessed: 26.10.2020 11http://camda.info/, accessed: 24.08.2020

14 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering they were focusing on were "Order", "Family", and "Genus". Analysing the variance of the samples results in different bacterial compositions across the cities. Zhu et al. (2019) differentiated the microbiome on the basis of the functional profile of the microbes. The accuracy of the model reaches 63% or 90% based on a balanced training and test set. The functional annotation was done with "mi-faser" (Zhu et al., 2018), algorithm that is optimized to map reads to molecular functions encoded by the read-correspondent genes. The open accessible CAMDA data is also used in studies not competing at the chal- lenge. A NN is used by Zhou et al. (2019) to predict microbial communities at un- sampled locations. The network is called "MetaMLAnn" and works with metadata like subway lines, known microbial composition patterns and sampling material types. They can show that the NN outperforms other classifiers and enables relatively precise pre- dictions of the microbial communities.

15 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

2 Aim of this Work

A number of criminal cases remain unsolved due to the lack of working hypotheses concerning the identity of the criminals who committed the crimes. Although DNA samples are collected at the crime scene that result in a good DNA profile, investigators cannot determine the culprit due to the missing DNA reference in their databases. Thus, it would be of advantage, if DNA based forensic methods were able to reliably narrow down the number of potential suspects, by creating a physiological profile, as well as by providing an environmental context of the subjects past whereabouts. The sub-field of Forensic DNA Phenotyping - which is the umbrella term for predictive DNA methods that provide tools to limit the pool of suspects – can basically be split up into two groups: The first one takes into consideration only the culprit‘s DNA and its phenotyping, whereas the second covers the microbiome print of the crime scene. This work is based on bioinformatic analysis methods, which essentially combine metage- nomics and machine learning and thus should contribute to a better understanding of our environment. This work is concerned with the assignment of metagenomic data to places, or more precisely to the respective city from which the samples originate. It is therefore a kind of forensic processing of metagenomic data. In order to classify the samples, the raw data undergoes various quality and trimming steps, as well as a taxonomic classification. The data than is trimmed down to bacterial species only. We expect the best results from this selection because bacterial species are better represented in data bases and the available date is more valid compared to microbes and organisms. An attempt is made to assign them to their origin on the basis of abundances of certain species. The investigations are to reveal whether there are certain key species by which the samples can be assigned. In reverse it should then be possible to assign unknown samples to their origin. A better understanding of the microbial composition will result in a higher resolution of the microbial composition of urban spaces which can then be used for forensic purposes.

16 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

3 Materials and Methods

3.1 Data origin

In the course of the global City Sampling Day, organised by the MetaSub Consortium, samples are taken annually on a certain day from specific public places, such as air- ports, highly frequented public transport systems like busses and subways. These samples are then sequenced and analysed. The data in this work are taken from dif- ferent surfaces in subways in 2016 (City sampling day 2016 (CSD16)) and 2017 (City sampling day 2017 (CSD17)). Between 15 and 50 samples per city were collected (see Table 1). Samples from both years however are not available from all cities. Of the 24 cities, 5 cities have samples from both years, CSD17 includes samples from 17 cities and CSD16 includes samples from 11 cities. These data are publicly available, but in this case they were obtained from CAMDA. With the exception of the data from CSD17 of Vienna. They are transmitted in the form of .fasta files containing the data of paired end reads. Cities used for the analysis are marked in bold in Table 1, while the year from which the data are selected is marked in green. "n.a." in the "CSD" columns stands for "not applicable" and means that no samples were taken in the respective cities in the year in question. New York City will be referred to as New York and Kyiv will be referred to as Kiev in this work. Although the Stockholm data set contains 50 samples, it is not used for further analysis. The reason is that the species’ abundances are disproportionately lower compared to the others so it was decided to discard them. Otherwise, only cities that come close to the maximum number of samples collected will be used for analysis. The maximum number of samples per sampling day and city is 50. In addition, the data from the CSD17 of Vienna are analysed to show how well the analysis works for just a few samples.

17 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

Table 1: Cities participating at the CSDs and no. of collected samples City City Accronym CSD16 CSD17 Stockholm ARN n.a. 50 Barcelona BCN 38 n.a. Berlin BER 41 n.a. Denver DEN 23 22 Doha DOH 50 15 Fairbanks FAI 48 n.a. Hong Kong HKG n.a. 49 Incheon ICN n.a. 50 Kyiv IEV n.a. 49 Ilorin ILR 47 50 Kuala Lumpur KUL n.a. 30 London LCY n.a. 37 Lissabon LIS 19 n.a. New York City NYC 49 50 Omaha OFF 26 n.a. Sao Paulo SAO n.a. 29 Santiago de Chile SCL 26 n.a. Sendai SDJ n.a. 32 San Francisco SFO n.a. 29 Singapore SGP n.a. 48 Taipei TPE n.a. 50 Tokyo TYO 25 50 Vienna VIE n.a. 17 Zürich ZRH n.a. 33

After quality control and alignment the data is then taxonomically assigned. This done with Kraken (Wood & Salzberg, 2014). Kraken is a software tool for fast assignment of taxonomic labels to metagenomic DNA sequences.

18 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

Table 2: Mystery sample translation list. HongKongM1 to HongKongM13 refer to Hong Kong as sample origin. KievM1 to KievM13 refer to Kiev as sample origin. TaipeiM1 to TaipeiM10 refer to Taipei as sample origin, TokyoM1 to TokyoM11 refer to Tokyo as sample origin and ViennaM1 to ViennaM10 refer to Vienna as sample origin.

0V1JrUr3qPfY HongKongM1 0Ae1dHGe0DyY KievM1 16nKqloPqYql TaipeiM1 6YwsuU6EHlT0 TokyoM1 0c85YsqS2gpP ViennaM1 7mm7F8ebohvI HongKongM2 3CtKjudk0Wt2 KievM2 5wCWl46pJ3MX TaipeiM2 Begi5ms9HQE6 TokyoM2 GygV1WRqtyU8 ViennaM2 CAov7ffceNbk HongKongM3 D92WwTPqqCN9 KievM3 B9TwLHfoYhG2 TaipeiM3 N6HxsNCeOpCu TokyoM3 KNBBwa0bAzXI ViennaM3 QMsho8YSU54N HongKongM4 EA3qOpmxgoy9 KievM4 ONPzTKdGTXuB TaipeiM4 O3Byrj7fOgpd TokyoM4 W0t7uUvktyH9 ViennaM4 a0pTYjAUGvLb HongKongM5 MNjYO4DHYxuY KievM5 VJcXyLAgaaSw TaipeiM5 RXA8KOX22npi TokyoM5 Wx6qDJOH7Bkz ViennaM5 bhDkIhhbsn63 HongKongM6 PEIQYIGgdqRR KievM6 fSsPSaDvTDFo TaipeiM6 VoFWzgsqkR2W TokyoM6 ZsSiR4ZE2JJs ViennaM6 hB7CJ4Jbvo9o HongKongM7 PaJxnodc0WTC KievM7 jYCnTTvl48gm TaipeiM7 Vvw1JWzmLMWr TokyoM7 c1WayUd4P4wo ViennaM7 hYlkiU9h5dAk HongKongM8 TmoWLQAjWb1V KievM8 mxb4uDYFG1bv TaipeiM8 gfjfi52jRl7W TokyoM8 dR4u8mbA6LVO ViennaM8 hiBLpdVbSkOh HongKongM9 VxRfsZ5CEipk KievM9 w6s878eMcnQa TaipeiM9 r2nmgQkfTTVo TokyoM9 tRmUyVw4oXy6 ViennaM9 klKodXilNzwr HongKongM10 d0z9i0tIWKPV KievM10 xLuyUbkUQnQD TaipeiM10 sUMZQLn3yBSB TokyoM10 zci9FX1R8dn4 ViennaM10 skgN1DPY3tfx HongKongM11 mUWBc0gzBK0r KievM11 syatOMVYSSCU TokyoM11 vRfWohxWD0Hk HongKongM12 skcahSj00d0n KievM12 yVuBUyFA3BaG HongKongM13 uEFSltwT1KjW KievM13

After the models are created, they are tested with the mystery samples listed in table 2. The chosen 57 samples all originate from cities known to the model.

3.2 Data availability

• The data of this work was obtained via the CAMDA Challenge 202012

• R-Scripts compiled for the analysis: https://github.com/LenzRene/Thesis

3.3 Workflow

3.3.1 From WGS data to species abundance

The .fasta- zip files were further processed as indicated in the workflow described in figure 1. The first phases include quality assurance steps. Starting with FastQC13, a quality control tool for high throughput sequence data. This step is followed by Trimmomatic (Bolger et al., 2014), which is especially designed to trim technical sequences, adapter sequences, from NGS paired end read data.

Trimmomatic parameters:

• trim leading: 25 • trim minlength: 80

• trim trailing: 25 • trim sliding window: 4 12http://camda.info/, accessed: 24.08.2020 13http://www.bioinformatics.babraham.ac.uk/projects/fastqc/, accessed 07.10.2020

19 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

• trim sliding qual: 20 • trim illumina clip: TruSeq3-PE-2

• trim headcrop: 10

Illumina clipping parameters:

• trim seed mismatches: 2 • trim simple clip theshold: 30

• trim palindrome clip theshold: 20 • trim minAdapterLength: 0

Then the data is aligned with Bowtie (Langmead et al., 2009). It This software is designed to efficiently and quickly align short DNA sequences. After quality control and alignment the data is then taxonomically assigned. This done with Kraken 2 (Wood et al., 2019). Kraken is a software tool for fast assignment of taxonomic la- bels to metagenomic DNA sequences by using a k-mer based approach. The further development of Kraken (Wood & Salzberg, 2014) to Kraken 2 makes the applica- tion even more sensitive and less memory consumption. The data base used for the assignment is RefSeq (Pruitt et al., 2005). RefSeq is a public database of nucleotide and protein sequences with corresponding feature and bibliographic annotation and is a a curated non-redundant sequence database of genomes, transcripts and proteins (Pruitt et al., 2005). The RefSeq database was filtered for all Eukaryota and Prokaryota except for primates. After the taxonomic assignment the data set is supplemented with the frequencies. Bracken (Lu et al., 2017) is applied for this purpose. Bracken uses the taxonomic assignments made by Kraken 2, along with information about the genomes themselves to estimate abundance at the species level, the genus level, or above (Wood & Salzberg, 2014).

3.3.2 Ecological assignment

In order to reduce the diversity of data, the focus of the study is on bacterial species. Kraken (Wood & Salzberg, 2014) showed a lower specificity in eukaryotic species and therefore the assignments were not as reliable. One advantage is that there are large databases with very well described information on bacterial species. For this pur- pose, an extract from The Bacterial Diversity Metadatabase (BacDive) (Reimer et al., 2019), which contains only the mentioned species. The "Isolation Sources" application available on the homepage (https://bacdive.dsmz.de/, accessed on 21 April 2020) was used for this purpose. It is possible to filter all available bacterial strains. This list in csv format contains 92,104 BacDive IDs. This file is processed in RStudio (RStudio Team, 2020) for further analysis, which is reduced to the individual species. This leaves 13,778 species.

20 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

WGS-Data

Quality Control Steps

FastQC Trimmomatic Bowtie

Taxonomic Assignment

Kraken 2 Bracken

Ecological Assignment

BacDive

Abundance Analysis

PCA & HC

RF KNN SVM

Identify Key Species

Mystery sample analysis

Figure 1: General workflow

3.3.3 Abundance analysis and Machine learning

After the abundances of the species have been determined for the WGS data with the help of tools specialised in bioinformatics and the extract from BacDive is available, the further analysis of these data is performed in RStudio. The used versions of R (R Core

21 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

Team, 2020), RStudio and the versions of the integrated packages are listed in table 4. In the first step of the R-analysis the data is loaded and converted into an editable format. A table is created containing the abundance data of all species from all cities. This table still contains all species that were taxonomically assignable. In the next step this table is aligned with the BacDivetable, so that from each sample of each city only bacterial species are remaining. Subsequently, the individual data sets with the top abundances of the species in the respective cities are formed from this table. For this purpose, the median of the abun- dances of the species is created over all samples of a city. The median is used because it is more robust against outliers than the mean value. Based on this value, the "Top 10", "Top 20", "Top 30" and "Top 100" species of each city are determined. Each "Top" data set is then divided into three data sets. "Top 10" contains all 10 top abundant species of all cities. "Top 10_1" contains only unique species and "Top 10_2" contains all species that occur at least twice in the cities in the "Top 10". The same applies to the other data sets ("Top 20", "Top 30", "Top 100")

Table 3: Partitioning of train sets and test sets City No. of samples for training No. of samples for testing Hong Kong 33 16 Incheon 33 17 Kiev 33 16 Ilorin 32 16 CSD17 New York 33 17 Singapore 32 16 Taipei 33 17 Tokyo 33 16 Vienna 11 5 Doha 29 15 Fairbanks 32 16 CSD16 Ilorin 31 15 New York 32 16

This classification is followed by a PCA analysis of each data set and a hierarchical clustering. This classification is followed by a PCA analysis of each data set and a hierarchical clustering. For further analysis using machine learning methods, the data sets are divided into test sets and training sets (table 3), while the training set consists of two thirds of the data and the test set of one third. To put the results of the machine

22 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering learning in relation to each other, three methods are applied to the data sets: Random Forest, K-nearest neighbors and Support Vector Machines. All models are created us- ing the caret package.

Random Forest analysis

The model is created with a 5-fold cross validation. The best value of the parameter "mtry" which is the number of variables randomly sampled as candidates at each split, is also determined. The "mtry" selected for the final model is the one which provides the highest accuracy. For each data set three values for "mtry" were tested. Always starting with 2. The other two values depend on the number of predictors (in this case the number of species). No. of predictors / 2 + 1 = second value. Exceptfor if the no. of predictors is odd: No. of predictors / 2 + 0.5 = second value. The third value equals the no. of predictors.

K-nearest neighbors analysis

The KNN model is set up with a 10-fold repeated cross-validation with three repeats, which means three is the number of complete sets of folds to compute. In order to find the best value for "k", which is the number of neighbors taken into account for classifi- cation, the values range from 5 to 23 in steps of 2. The "k" chosen is the one with the highest accuracy.

Support Vector Machines analysis

The SVM model is set up with a radial basis function kernel tuning the parameters "sigma" and "C". "Sigma" is the kernel parameter. The best cost parameter "C" is the one which provides the highest accuracy for the model. "C" is tested 10 times starting with C = 0.5 and than the value is doubled with each cycle, which finally results in C = 128.

Identify key species and mystery sample analysis

Using the best prediction model, i.e. the model with the highest accuracy based on the training and test data, the species that have the most influence on the model are de- termined. This will help to determine whether a group of key species will be identified. The analysis of the mystery samples will also be executed based on the best prediction

23 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering model.

3.4 Software

The analysis of the abundance data is done with R (R Core Team, 2020) in RStudio (RStudio Team, 2020). Software packages and versions are listed in table 4.

Table 4: This table contains the version information about the used software in RStudio Programming language Ref. Version R R Core Team (2020) 3.6.3 Development environment Ref. Version RStudio RStudio Team (2020) 1.2.5033 R-package Ref. Version caret Kuhn (2020) 6.0.85 dendextend Galili (2015) 1.13.4 factoextra Kassambara and Mundt (2020) 1.0.7 ggfortify Tang et al. (2016) 0.4.8 ggplot2 Wickham (2016) 3.2.1 ggpubr Kassambara (2020) 0.2.5 nlme Pinheiro et al. (2020) 3.1.145 RcolorBrewer Neuwirth (2014) 1.1-2 stringr Wickham (2019) 1.4.0 tidyverse Wickham et al. (2019) 1.3.0

3.5 Hardware

The hardware used was:

• Fujitsu LIFEBOOK A557, Core i5-7200U, 8GB RAM, 256GB SSD

• FH-Server

All computations were made on the bioinformatics server of the University of Applied Sciences, Vienna, which runs a Linux Debian operating system, and provides 64 logical cores and 378 GB random access memory (RAM) as well as with the Fujitsu LIFEBOOK A557. Docker containers were run on Docker Version: 17.05.0-ce (API version: 1.29) based on a Debian GNU/Linux 8 (jessie) operating system.

24 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

4 Results

As is very often the case with biological data, the number of samples is relatively small and features are all the more numerous and diverse. In order to nevertheless achieve a meaningful result, this study concentrates on those cities that have collected the most samples from the City sampling day (CSD)s. The maximum number of samples per city is 50. The eight cities with the most samples, namely Hong Kong, Incheon, Kiev, Ilorin, New York City, Singapore, Taipei, Tokyo and Vienna, all participating at CSD17, were chosen for the analysis and the ninth was Vienna, in order to have a comparison of how the predictions work with few samples. In the case of Vienna the number of samples is 16 (table 1). The same approach was also applied to the CSD16 data, which resulted in four cities chosen to be further analysed (table 1).

4.1 CSD17

To find key species, the data sets have to be processed. The focus is on bacteria that are most common in the respective city. A complete list of all species and their abun- dance from all samples is matched against a list of bacteria from BacDive (Reimer et al., 2019). The resulting list contains only bacterial species. To sort the data by frequency, the median frequency of all samples is calculated for each city. The robust- ness of the median forgives outliers, which originate from previous processes (sam- pling, sequencing, analysis etc.) rather than a mean value.The median occurrence of the species is then used to filter out those species that occur most frequently in the re- spective city. In this way, four data sets are created for each of the nine cities, with the 10 most frequent, 20 most frequent, 30 most frequent and 100 most frequent species.

The next step is to merge the data of the individual cities. Thus the nine times four data sets are condensed into four sets with the 10, 20, 30 and 100 most common species of all cities. These sets are called "Top 10", "Top 20", "Top 30" and "Top 100", according to the number of species they contain. In order to be able to compare whether and what effects it has when the same species appear in several cities among the most abundant species, additional data sets are generated. In the case of the "Top 10" set, which contains all species from all cities, the sets "Top 10_1" and "Top 10_2" are also generated. In "Top 10_1" only those species are included that only appear once in the top 10 of all cities. "Top 10_2" contains exactly those species that are in "Top 10_1", but only those species that appear in at least 2 cities in the top 10 are kept in the set. Following the same criteria, the corresponding data sets for "Top 20", "Top 30" and "Top 100" are created with the endings "_1" and "_2". The number of species per data

25 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

Figure 2: Species per dataset from the CSD17 set is shown in figure 2. The number of species in "Top 10" corresponds to the sum of the species in "Top 10_1" and "Top 10_2". This applies analogously to the other data sets. In the "Top 10" there are a total of 33 species, of which 18 occur once in all cities and 15 occur in several cities. In the "Top 20" there are 52 species, of which 16 are unique and 36 occur more often than once. In the "Top 30" there are a total of 80 species, of which 29 are unique and 51 occur more than once. In the "Top 100" there are a total of 207 species, 52 of which are unique and 155 occur more than once. "Top 10_1" contains fewer species than "Top 10_2" and is therefore the exception, since all other data sets contain more species in the ones with the ending "_2". The number of species that occur in several cities increases continuously from "Top 10_2" to "Top 100_2".

4.1.1 PCA analysis

The previously generated data is scaled, to perform a PCA is of each data set. There- fore the data points are plotted in dependency of the first two PCs. A clear separation based on the principal components would be visualized in individual, clearly separated clusters.

26 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

Figure 3: PCA analysis of "Top 10" and "Top 100"

In figure 3 one can see that in the biplots of the "Top 10" data, as well as in the PCA of the "Top 100" data, a large part of the data points is concentrated in one area. In the "Top 100" data, the point cloud of data is rather more dispersed than is the case with the "Top 10" data. Some samples from New York seem to stand out from the rest when looking at the "Top 10". In the "Top 100", samples from New York, Hong Kong and Taipei separate themselves from the rest.To explain more than 25%, 50% and 75% of the variance in the "Top 10" data, one needs to include the first 2, 6 and 14 PCs (table 5). In order to explain the same percentage of variance for the "Top 100" data it is necessary to take 2, 8 and 28 PCs into account.

Table 5: List of PCs needed to explain a certain percentage of variance

Top 10 Top 10_1 Top 10_2 Top 20 Top 20_1 Top 20_2 Variance explainid >25% >50% >75% >25% >50% >75% >25% >50% >75% >25% >50% >75% >25% >50% >75% >25% >50% >75% No. of PCs 2 7 14 3 6 11 2 4 7 3 8 18 2 5 8 2 6 14 Top 30 Top 30_1 Top 30_2 Top 100 Top 100_1 Top 100_2 Variance explained >25% >50% >75% >25% >50% >75% >25% >50% >75% >25% >50% >75% >25% >50% >75% >25% >50% >75% No. of PCs 3 8 21 2 5 11 2 7 17 2 8 28 2 6 17 2 8 25

Table 5 shows how many PCs are required to explain a certain percentage of the vari- ance for each data set, namely >25%, >50% and >75%. For all data sets, two to three PCs are sufficient to explain at least more than 25%. Four to eight PCs are sufficient to explain more than 50%. However, "Top 10_2" is the only record that requires only four PCs to reach the 50%. All others do not manage with such a small number of

27 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

PCs. To describe more than 75% of the variance, more than 10 PCs are needed in each data set, except again for "Top 10_2", which is sufficient with seven PCs and "Top 20_1", which is sufficient with eight PCs. All other data records require between 11 ("Top 10_1" and "Top 30_1") and 28 ("Top 100") PCs.

4.1.2 Hierarchical Clustering

The HC is performed with in R using the hclust() function. Agglomerative cluster- ing is performed with Euclidian distance as distance measure and the linkage method used is the complete linkage. To add the colored bar to the dendrogram (see figure 4), the dendextend package is used.

Figure 4: Hierarchical clustering of "Top 10"

Figure 4 indicates that single clusters of cities are not clearly distinguishable. On the left side there is a cluster consisting mostly of New York data points. Between the second and the third third of the x-axis a cluster of Taipei samples is visible, but it is not really standing out. Cutting the dendrogram at "2e+05" results in 3 clusters of which one, the middle cluster, is the already mentioned cluster of the mostly New York samples. Left there is a small one consisting of New York, Hong Kong and Singapore. The right cluster is by far the biggest one consisting of more than 90% of the samples.

28 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

4.1.3 Random Forest analysis

The first approach to construct a reliable prediction model is with random forest. For this purpose, the data is divided into a training set and a test set. The training set consists of two thirds of the data, the test set of one third (see table 3). The exact number varies between 32 to 33 samples for the training set and 16 to 17 samples for the test set, except for Vienna. For Vienna the training set contains 11 samples and the test set 5 samples. The implementation of the random forests is done with the R- package caret. A model is created for each individual data set. To improve the model, a 5-fold cross-validation is always performed. This procedure is used to estimate how well a predictive model will work when new data is involved. The final model indicates, among other things, how high the error of the individual classes is and how it changes during the course of the calculation. Figures 5 and 6 show two of these curves.

Figure 5: The random forest analysis of the "Top 20" data set shows the lowest OOB estimate of error rate

Both plots show the trends in the error rate of each class, in this case each city. The error rate is plotted against the number of trees that the model has formed. In addition to the curves of the nine cities there is an additional curve (blue), the OOB error. This indicates the mean error in predicting each training sample by using only those trees that do not contain this particular training sample. Figure 5 shows the lowest value for the OOB error, namely 21.4 %, whereas figure 6 shows one of the highest values for the OOB error, namely 32.9 %.

29 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

Table 6 contains error rates of all classes and all data sets. Additionally, OOB and the mean error of all classes per dataset are listed. The smallest mean error is 25.19% in the data set "Top 10", the largest mean error is 35.78% in the data set "Top 10_2". In order to be able to compare the effects on the error rate of keeping all species in the set or keeping only the unique species or only those that occur at least twice in the cities, table 6 is highlighted in colour. As already explained the top 10 data, as well as the top 20, top 30 and top 100, splits up into three data sets.The background color of the individual class error cells indicates whether the actual value of the error rate is the highest (red), lowest (green) or middle (yellow) value in the questioned triplet of data sets. For the top 10 data and Ilorin this means that Ilorin/"Top 10" is yellow, because it lies in between Ilorin/"Top 10_1", which is the lowest value and therefore green and "Top 10_2", which is the highest value and therefore red. If two results are the same, both results are marked green if the third has a higher error rate. Conversely, two equal error rates are marked in red if the third has a lower error rate. If all three values are the same they all are highlighted in yellow. In the case of the top 20 data, for example, for Ilorin this means that both the error rate for "Top 20" and that for "Top 20_1" are marked green, because in both cases the error rate is 9.67% and therefore lower than for "Top 20_2" with 19.35%, which is therefore marked red.

Figure 6: The random forest analysis of the "Top 100_1" data set shows one of the highest OOB estimate of error rate

The data sets containing all species have the lowest value for the error rate 16 times,

30 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

16 times the medium value and 4 times the highest value. The sets with the unique species (ending with "_1") have 8 times the lowest value, 5 times the medium value and 23 times the highest value in the error rate. The sets without the unique species (ending with "_2") have 20 times the lowest value, 7 times the medium value and 9 times the highest value in the error rate. So the number of lowest values, which are 20, are most common in the group that excludes unique species in the analysis. The group containing all species shows the most medium values, namely 16. Most of the highest values are found in the group where the analysis was performed with the unique species, namely 23.

Table 6: Random forest error table CSD17 Top 10 Top 10_1 Top 10_2 Top 20 Top 20_1 Top 20_2 Hong_Kong 0.2647 0.3529 0.2352 0.2352 0.2352 0.2352 Ilorin 0.0967 0.0645 0.3225 0.0967 0.0967 0.1935 Incheon 0.2727 0.5758 0.3939 0.2727 0.5152 0.1818 Kiev 0.3636 0.4848 0.5757 0.4242 0.4545 0.3636 New_York 0.1176 0.1176 0.1176 0.1176 0.2059 0.1176 Singapore 0.4375 0.5313 0.5625 0.5625 0.7188 0.6250 Taipei 0.0313 0.0313 0.1563 0.0313 0.0625 0.0313 Tokyo 0.1818 0.2424 0.2727 0.2121 0.3030 0.1515 Vienna 0.5000 0.3333 0.5833 0.5000 0.5833 0.5000 mean error 0.2519 0.3037 0.3578 0.2725 0.3528 0.2666 OOB 0.248 0.321 0.321 0.241 0.336 0.259 Top 30 Top 30_1 Top 30_2 Top 100 Top 100_1 Top 100_2 Hong_Kong 0.2059 0.3235 0.1470 0.2352 0.3529 0.2058 Ilorin 0.1290 0.0967 0.1612 0.0967 0.1290 0.0967 Incheon 0.2424 0.4545 0.2121 0.3333 0.3030 0.3030 Kiev 0.3636 0.5455 0.4545 0.4545 0.5152 0.3636 New_York 0.1471 0.1471 0.1176 0.1176 0.0882 0.0882 Singapore 0.5313 0.7500 0.5313 0.5625 0.5938 0.5000 Taipei 0.0625 0.1875 0.0938 0.0938 0.1250 0.0625 Tokyo 0.2424 0.2121 0.2424 0.2121 0.3333 0.1818 Vienna 0.4167 0.5000 0.4167 0.5000 0.7500 0.5833 mean error 0.2601 0.3574 0.2641 0.2896 0.3545 0.2650 OOB 0.252 0.343 0.266 0.277 0.329 0.241

Another measure that emerges from the random forest analysis is the variable impor-

31 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering tance. Here in figure 7 and table 7 the class-specific values are averaged without weighting. In 7 the influences of the 10 most important variables of "Top 30" and "Top 100_1" are shown graphically. In the diagram on the right, containing the data of "Top 100_1" one can see that there is only a small reduction in variable importance. None of the variables (species) stands out from the others in terms of their importance. The diagram on the right containing the data of "Top 30" shows that the first variable is clearly set apart from the rest in terms of importance. This means that these variable has a much greater influence on the model.

Figure 7: Variable importance plots of the 10 most important variables

Table 7 lists the ten most important variables of each data set ranked according to their importance. One species can appear in this table a maximum of 8 times, namely in each of the 4 sets with all species, and either in the set with the unique species or in the one without the unique species. Two species reach the maximum appearance in this table: rosea and Microbacterium aurum. Three species appear 7 times: Pseu- domonas stutzeri, Ornithinimicrobium flavum and Cupriavidus metallidurans. Other species occur less frequently in this table.

32 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

Table 7: 10 most important variables of each data set ranked according to their impor- tance Top 10 Top 10_1 Top 10_2 Kocuria.rosea .stutzeri Kocuria.rosea Pseudomonas.stutzeri Microbacterium.aurum Cutibacterium.acnes Ornithinimicrobium.flavum Ornithinimicrobium.flavum .testosteroni Microbacterium.aurum Dermacoccus.nishinomiyaensis Kocuria.palustris Dermacoccus.nishinomiyaensis Kocuria.indica Micrococcus.luteus Geodermatophilus.obscurus Geodermatophilus.obscurus Cupriavidus.metallidurans Cupriavidus.metallidurans Sphingobium.yanoikuyae Paracoccus.yeei Acinetobacter.schindleri Methylorubrum.extorquens Stenotrophomonas.maltophilia Kocuria.indica Bradyrhizobium.japonicum Moraxella.osloensis Comamonas.testosteroni Acinetobacter.schindleri Bradyrhizobium.diazoefficiens Top 20 Top 20_1 Top 20_2 Kocuria.rosea Massilia.oculi Kocuria.rosea Pseudomonas.stutzeri Pseudomonas.putida Pseudomonas.stutzeri Kocuria.flava Ornithinimicrobium.flavum Cupriavidus.metallidurans Microbacterium.aurum Microbacterium.aurum Geodermatophilus.obscurus Ornithinimicrobium.flavum Aquabacterium.olei Dermacoccus.nishinomiyaensis Acinetobacter.indicus Kocuria.flava Comamonas.testosteroni Geodermatophilus.obscurus Friedmanniella.luteola Kocuria.indica Massilia.oculi Xanthomonas.campestris Modestobacter.marinus Pseudoxanthomonas.suwonensis Enterobacter.cloacae Cutibacterium.acnes Pseudomonas.putida Pseudoxanthomonas.suwonensis Micrococcus.luteus Top 30 Top 30_1 Top 30_2 Acinetobacter.indicus Kocuria.flava Pseudomonas.stutzeri Corynebacterium.glutamicum Ornithinimicrobium.flavum Microbacterium.aurum Cupriavidus.metallidurans Brevundimonas.vesicularis Kocuria.rosea Microbacterium.aurum Enterobacter.cloacae Dermacoccus.nishinomiyaensis Kocuria.rosea Nakamurella.multipartita Cupriavidus.metallidurans Rothia.mucilaginosa Acinetobacter.indicus Massilia.oculi Dermacoccus.nishinomiyaensis Corynebacterium.aurimucosum Rothia.mucilaginosa Blastococcus.saxobsidens Pseudoxanthomonas.suwonensis Kocuria.indica Enterobacter.cloacae Sphingopyxis.macrogoltabida Blastococcus.saxobsidens Massilia.oculi Brevundimonas.naejangsanensis Geodermatophilus.obscurus Top 100 Top 100_1 Top 100_2 Kocuria.rosea Pseudomonas.tolaasii Rhodococcus.fascians Kocuria.flava Acinetobacter.indicus Acinetobacter.schindleri Pseudomonas.stutzeri Cupriavidus.pauculus Ornithinimicrobium.flavum Ornithinimicrobium.flavum Tsukamurella.tyrosinosolvens Microbacterium.aurum Comamonas.aquatica Dietzia.psychralcaliphila Cupriavidus.metallidurans Microbacterium.aurum Corynebacterium.xerosis Kocuria.rosea Cupriavidus.metallidurans Pantoea.agglomerans Comamonas.aquatica Dietzia.lutea Microbacterium.lemovicicum Pseudomonas.fluorescens Modestobacter.marinus Tessaracoccus.flavus Pseudomonas.stutzeri Massilia.oculi Microbacterium.chocolatum Corynebacterium.kroppenstedtii

33 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

To put the random forest model in relation to others, the next step is a K-nearest neigh- bors analysis.

4.1.4 K-nearest neighbors analysis

The K-nearest neighbors algorithm is a non-parametric method in pattern recognition, which can be used for classification problems. A data point is assigned to a class based on the previously defined k-nearest neighbors, where k is the number of adjacent data points. The process of optimising the parameter k is shown in figure 8. The graph in the upper left corner of the figure 8 shows the course for "Top 10", "Top 10_1" and "Top 10_2". The same applies to "Top 20" at the top right, "Top 30" at the bottom left and "Top 100" at the bottom right in the figure 8.

Figure 8: KNN accuracy of repeated cross-validation versus neighbors

Among the top 10 species "Top 10" achieves the highest accuracy of 0.548 at k=5, followed by "Top 10_1" with 0.509 at k=9. "Top 10_2" achieves a value of 0.472 at k=9. Among the top 20 species "Top 20" achieves the highest accuracy of 0.551 at k=5, followed by "Top 20_2" with 0.514 at k=5. "Top 20_1" achieves a value of 0.495 at k=5. Among the top 30 species "Top 30" achieves the highest accuracy of 0.521 at k=5, followed by "Top 30_2" with 0.500 at k=5. "Top 30_1" achieves a value of 0.475 at k=9. Among the top 100 species "Top 100" achieves the highest accuracy of 0.565 at

34 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering k=5, followed by "Top 100_2" with 0.527 at k=5. "Top 100_1" achieves a value of 0.517 at k=5. In all four cases the data set containing all species has the highest value for accuracy. The data set containing all species always has its highest value at k=5. "Top 10_1", "Top 10_2" and "Top 30_1" are performing best with a chosen value for k=9. The nine other data sets share the same optimized value, namely k=5.

4.1.5 Support Vector Machines analysis

The last machine learning method used is SVM. The model divides the classes into certain categories and then assigns the samples to those categories. By evaluating the best separation possible the margin between the classes can be adjusted and also the penalty costs of samples that lie in between the margin. The cost parameter defines the weight of how much samples inside the margin contribute to the overall error. The process of optimisation of the cost parameter in dependence of the accuracy of the model is shown in figure 9. The graph in the upper left corner of the figure 9 shows the course for "Top 10", "Top 10_1" and "Top 10_2". The same applies to "Top 20" at the top right, "Top 30" at the bottom left and "Top 100" at the bottom right in the figure 9.

Figure 9: SVM accuracy versus cost

Among the top 10 species "Top 10" achieves the highest accuracy of the repeated cross validation of 0.590 at C=8, followed by "Top 10_2" with 0.582 at C=64. "Top

35 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

10_1" achieves a value of 0.508 at C=128. Among the top 20 species "Top 20_2" achieves the highest accuracy of 0.604 at C=32, followed by "Top 20" with 0.602 at C=128. "Top 20_1" achieves a value of 0.531 at C=8. Among the top 30 species "Top 30_2" achieves the highest accuracy of 0.578 at C=64, followed by "Top 30" with 0.570 at C=8. "Top 30_1" achieves a value of 0.510 at C=64. Among the top 100 species "Top 100" achieves the highest accuracy of 0.603 at C=32, followed by "Top 100_2" with 0.596 at C=32. "Top 100_1" achieves a value of 0.538 at C=32.

4.1.6 Accuracy comparison

In this section a comparison is made of the accuracy of the three prediction models. Table 8 shows the optimized parameters of the single models. For each machine learn- ing model one parameter was optimized. In the case of RF it is "mtry", for KNN it is "k" and for SVM the parameter optimized is "C".

Table 8: The optimized parameters "mtry" for RF, "k" for KNN and "C" for SVM. These parameters are optimized during the model development. The value that leads to the highest accuracy is selected. Top 10 Top 10_1 Top 10_2 Top 20 Top 20_1 Top 20_2 Top 30 Top 30_1 Top 30_2 Top 100 Top 100_1 Top 100_2 mtry (RF) 2 2 8 2 2 2 80 2 26 2 2 78 k (KNN) 5 9 9 5 5 5 5 9 5 5 5 5 C (SVM) 8 128 64 128 8 32 8 64 64 32 32 32

Table 9 contains the accuracy of the models and also the according Kappa value with the training data and cross-validation. "Kappa" or "Cohens Kappa" is a robust mea- sure to calculate the inter-rater reliability. The robustness comes from the fact that the possibility of a random match is also taken into account. Interpretation proposed by Landis and Koch (1977): κ < 0 = poor agreement, 0 < κ < 0.2 = slight agreement, 0.21 < κ < 0.40 = fair agreement, 0.41 < κ < 0.60 = moderate agreement, 0.61 < κ < 0.80 = substantial agreement and 0.81 < κ < 1 = (almost) perfect agreement. The RF model has a maximum calculated accuracy of 0.763 for the "Top 100" data set and the lowest accuracy is 0.631 for the "Top 10_2" data set. The RF model consis- tently has the highest accuracy compared to the other two models.

36 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

Table 9: Accuracy of machine learning models based on the training data. The ac- curacy values and the according Kappa values are always representing the highest value reached by optimizing parameters of the used methods. In the case of RF the parameter optimized is "mtry", for KNN the "k" is optimized and for SVM the "C"-parameter is chosen for the optimum output. Top 10 Top 10_1 Top 10_2 accuracy Kappa accuracy Kappa accuracy Kappa RF 0.734 0.698 0.683 0.641 0.631 0.582 KNN 0.548 0.488 0.509 0.443 0.472 0.401 SVM 0.590 0.535 0.508 0.442 0.582 0.525 Top 20 Top 20_1 Top 20_2 accuracy Kappa accuracy Kappa accuracy Kappa RF 0.741 0.706 0.663 0.618 0.730 0.694 KNN 0.551 0.491 0.495 0.428 0.514 0.448 SVM 0.602 0.548 0.531 0.468 0.604 0.550 Top 30 Top 30_1 Top 30_2 accuracy Kappa accuracy Kappa accuracy Kappa RF 0.730 0.694 0.682 0.639 0.755 0.722 KNN 0.521 0.457 0.475 0.406 0.500 0.433 SVM 0.570 0.512 0.510 0.445 0.578 0.521 Top 100 Top 100_1 Top 100_2 accuracy Kappa accuracy Kappa accuracy Kappa RF 0.741 0.706 0.673 0.628 0.763 0.731 KNN 0.565 0.489 0.517 0.454 0.527 0.464 SVM 0.603 0.549 0.538 0.475 0.596 0.540

Behind the RF model, the SVM model has a higher accuracy than the KNN model in 11 of 12 cases. The only exception is the data set "Top 10_1", although the values are here are very close. 11 times the difference in accuracy between the RF model and the second best model is greater than 0.1. Only for " Top 10_2" is the difference smaller. The highest value for the accuracy of the KNN model is 0.565, which is still less than the lowest value of the RF model. 0.604 is the highest accuracy of the SVM models and therefore still worse than the lowest value of the RF model.

Table 10 indicates the accuracy of the prediction models when the test data is included. Additionally the 95% Confidence Interval (CI) and the Kappa value are displayed. The

37 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering accuracy of the RF model including test data is always higher than without test data (see tables 9 & 10). The accuracy of the KNN model including test data is always higher than that without test data, with one exception ("Top 30_1"). The accuracy of the SVM model including test data is higher in 10 out of 12 cases except for "Top 30_1" and "Top 100_2".

Table 10: Accuracy of machine learning models based on the test data Top 10 Top 10_1 Top 10_2 accuracy 95% CI Kappa accuracy 95% CI Kappa accuracy 95% CI Kappa RF 0.793 0.714, 0.858 0.764 0.704 0.619, 0.779 0.663 0.756 0.674, 0.825 0.722 KNN 0.578 0.490, 0.662 0.519 0.526 0.438, 0.612 0.460 0.541 0.453, 0.627 0.476 SVM 0.630 0.542, 0.711 0.579 0.563 0.475, 0.648 0.504 0.622 0.535, 0.704 0.571 Top 20 Top 20_1 Top 20_2 accuracy 95% CI Kappa accuracy 95% CI Kappa accuracy 95% CI Kappa RF 0.793 0.714, 0.858 0.764 0.689 0.604, 0.766 0.646 0.800 0.723, 0.864 0.772 KNN 0.593 0.505, 0.676 0.538 0.526 0.438, 0.612 0.461 0.578 0.490, 0.662 0.519 SVM 0.681 0.596, 0.759 0.638 0.585 0.497, 0.669 0.528 0.652 0.565, 0.732 0.604 Top 30 Top 30_1 Top 30_2 accuracy 95% CI Kappa accuracy 95% CI Kappa accuracy 95% CI Kappa RF 0.807 0.731, 0.870 0.781 0.711 0.627, 0.786 0.671 0.778 0.698, 0.845 0.747 KNN 0.607 0.520, 0.690 0.554 0.467 0.380, 0.554 0.396 0.578 0.490, 0.662 0.522 SVM 0.622 0.535, 0.704 0.571 0.496 0.409, 0.584 0.429 0.644 0.558, 0.725 0.596 Top 100 Top 100_1 Top 100_2 accuracy 95% CI Kappa accuracy 95% CI Kappa accuracy 95% CI Kappa RF 0.793 0.714, 0.858 0.764 0.719 0.636, 0.792 0.680 0.800 0.723, 0.864 0.772 KNN 0.622 0.535, 0.704 0.571 0.533 0.446, 0.620 0.470 0.585 0.497, 0.669 0.528 SVM 0.630 0.542, 0.711 0.579 0.541 0.453, 0.627 0.479 0.585 0.497, 0.669 0.529

The RF model has a maximum calculated accuracy of 0.807 for the "Top 30" data set and the lowest accuracy is 0.689 for the "Top 20_1" data set. The RF model always has the highest accuracy compared to the other two models. Following the RF model, the SVM model has a higher accuracy than the KNN model in 11 of 12 cases. The only exception is the data set "Top 100_2". The accuracy is the same for both model, namely 0.585.

4.2 CSD16

The analysis of the data from CSD16 is analogous to the analysis of the data from CSD17. Data from CSD16 are available for 11 cities. Again, only those cities are used for analysis that are close to the maximum number of samples, i.e. 50. The cities selected are Doha (50 samples), Fairbanks (48 samples), Ilorin (47 samples) and New York (49 samples) (see table 1). The analysis of the CSD16 therefore only includes

38 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

4 cities. Significantly fewer than in the CSD17 Analysis. The focus here is also on bacterial species based on the list from BacDive (Reimer et al., 2019) can be selected. The species with the highest abundance are identified in the same way as for CSD17. The data from the individual cities are merged together and the 10, 20, 30 and 100 most common species are determined. The data sets are named analogously to the data sets of CSD17.

Figure 10: Species per dataset from the CSD16

The number of species per data set is shown in figure 10. The number of species in "Top 10" again is corresponding to the sum of the species in "Top 10_1" and "Top 10_2". This applies analogously to the other data sets. In the "Top 10" there are a total of 23 species, of which 11 occur once in all cities and 12 occur in several cities. In the "Top 20" there are 44 species, of which 24 are unique and 20 occur more often than once. In the "Top 30" there are a total of 66 species, of which 37 are unique and 29 occur more than once. In the "Top 100" there are a total of 234 species, 140 of which are unique and 94 occur more than once. In contrast to CSD17, CSD16 contains fewer species in the "Top 10", "Top 20" and "Top 30" (see tables 10 and 2). Only for the "Top 100" there are 234 to 207 species more for the CSD16 data. It is also noticeable that, except for the top 10 species, the fraction with unique species is always larger than the fraction with species that occur at least twice. This is also in contrast to the data of CSD17.

39 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

4.2.1 PCA analysis

Also for these data, a PCA is performed first, which plots the data points in dependency of the first two PCs. Figure 11 illustrates the PCA for the records "Top 10_2" and "Top 30_1". The data points of "Top 10_2" are mainly distributed along the axis of the second Principal Component (PC2), whereby in the direction of the negative area of this axis almost exclusively data points of Ilorin can be seen and in the positive area the data points of the remaining cities. Some data points of New York are distributed along the axis of the first Principal Component (PC1).

Figure 11: PCA analysis of "Top 10_2" and "Top 30_1" from the CSD16

In table 11 one can see that one requires 2, 5 and 8 PCs to explain more than 25%, 50% and 75% of the variance. The distribution of the data points of "Top 30_1" (Figure 11) shows again that the Ilorin data points are distributed along the axis of the PC2, mainly in the negative range below -0.1. Fairbanks and New York are drawn along both axes of the PCs. Doha is also isolated from the rest of the data. In table 11 we see that it takes 3, 7 and 15 PCs to explain more than 25%, 50% and 75% of the variance.

40 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

Table 11: List of PCs needed to explain a certain percentage of variance

Top 10 Top 10_1 Top 10_2 Top 20 Top 20_1 Top 20_2 Variance explainied >25% >50% >75% >25% >50% >75% >25% >50% >75% >25% >50% >75% >25% >50% >75% >25% >50% >75% No. of PCs 3 6 11 2 4 6 2 5 8 3 8 16 3 6 11 2 5 10 Top 30 Top 30_1 Top 30_2 Top 100 Top 100_1 Top 100_2 Variance explainied >25% >50% >75% >25% >50% >75% >25% >50% >75% >25% >50% >75% >25% >50% >75% >25% >50% >75% No. of PCs 3 8 19 3 7 15 3 6 13 2 7 18 2 5 14 3 8 17

Table 11 shows how many PCs are required to explain a certain percentage of the variance for each data set, namely >25%, >50% and >75%. For all data sets, two to three PCs are sufficient to explain at least more than 25%. Four to eight PCs are sufficient to explain more than 50%. This is so far the same result as it is for the data of the CSD17. Values for an explained variance of more than 75% lie between 6 and 19 PCs. For the "Top 10_2" data set only 6 PCs are required to reach a value of more than 75%. "Top 10" has the highest number of PCs required to exceed 75%, namely 19.

4.2.2 Hierarchical Clustering

Figure 12: Hierarchical clustering of "Top 10"

The HC for the CSD16 data is performed in the same manner, as it was for CSD17 data. The resulting dendrogram is visualized in figure 12. There are several clusters, most of which consist of samples from one city. In the middle is a large cluster containing almost only samples from Doha. Clusters of Ilorin can be seen on both the right and left.

41 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

New York and Fairbanks are also grouped together repeatedly, but these groups are more widely distributed along the entire x-axis. Cutting the dendrogram at "5.0e+06" results in five clusters. The first two of the left side contain only Fairbanks samples, the third one consists of Ilorin. The fourth cluster consists of about 80% of the data, with samples of each city. The fifth is a cluster of Ilorin samples only.

4.2.3 Random Forest analysis

Again the first machine learning method applied is the random forest. The data sets are split up according to table 3. The size of the training sets is between 29 and 32 samples, test sets contain 15 to 16 samples. One model is created for each individual data set. To improve the model, a 5-fold cross-validation is always performed. This procedure is used to estimate how well a predictive model will work when new data is involved. The final model indicates, among other things, how high the error of the individual classes is and how it changes during the course of the calculation. Figures 13 and 14 show two of these curves.

Figure 13: The random forest analysis of the "Top 10_1" data set shows the highest OOB estimate of error rate

Both plots show the trends in the error rate of each class, in this case each city. The error rate is plotted against the number of trees that the model has formed. In addition to the curves of the four cities there is an additional curve (blue), the OOB error.

42 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

Figure 13 shows the highest value for the OOB error, namely 12.9%, whereas figure 14 shows the lowest value for the OOB error, namely 7.9%. The individual class errors of the "Top 10_1" data are ranging from 3.23% for Ilorin to 16.13% for New York. The individual class errors of the "Top 20_1" data are ranging from 0% for Ilorin to 15.63% for Fairbanks. Table 12 shows all individual class error rates for all data sets of the CSD16 data, as well as the mean error of the classes per data set and the OOB. The mean error ranges from 5.57% for "Top 30_1" to 12.75% for "Top 30_2". Ilorin’s class error is 0 in four approaches, its highest error is 6.45%. Doha achieves 0% once, its highest value is 6.67%. The class error rate for Fairbanks lies between 9.38% and 21.88%. New York’s class error rate ranges from 6.45% to 22.58%. The colour coding of the individual cells is to be understood in the same way as the CSD17 data. Data sets containing all species result in the lowest class errors 6 times, medium errors 6 times and highest values 4 times. Data sets containing unique species result in the lowest class errors 8 times, medium errors 3 times and highest values 5 times. Data sets excluding unique species result in the lowest class errors 2 times, medium errors 4 times and highest values 10 times.

Figure 14: The random forest analysis of the "Top 20_1" data set shows the lowest OOB estimate of error rate

43 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

Table 12: Random forest error table CSD16 Top 10 Top 10_1 Top 10_2 Top 20 Top 20_1 Top 20_2 Doha 0 0.0667 0.0333 0.0333 0.0333 0.0333 Fairbanks 0.1563 0.2188 0.1563 0.1250 0.1563 0.1875 Ilorin 0 0.0323 0.0323 0.0645 0 0.0645 New York 0.1613 0.1613 0.1290 0.1613 0.0968 0.1935 mean error 0.0794 0.1197 0.0877 0.0960 0.0716 0.1197 oob 0.105 0.129 0.113 0.089 0.079 0.113 Top 30 Top 30_1 Top 30_2 Top 100 Top 100_1 Top 100_2 Doha 0.0333 0.0333 0.0333 0.0333 0.0333 0.0667 Fairbanks 0.1563 0.1250 0.2188 0.1563 0.0938 0.1875 Ilorin 0.0323 0 0.0323 0 0.0645 0.0323 New York 0.1290 0.0645 0.2258 0.1935 0.1613 0.1935 mean error 0.0877 0.0557 0.1275 0.0958 0.0882 0.1200 oob 0.097 0.089 0.113 0.121 0.121 0.121

Figure 15 and table 13 show the specific importance values and important species. The left diagram in figure 15 the 10 most important variables of "Top 10_1". Bacillus circulans, Pseudomonas balearica, Staphylococcus epidermidis and Acinetobacter pit- tii are clearly set apart from the other variables. The right diagram shows the 10 most important variables of "Top 20_1". The decrease in importance is not as high as for the "Top 10_1", still three species are standing out: Chryseobacterium indoltheticum to Pseudomonas balearica and Dermacoccuns nishinomiyaensis.

Figure 15: Variable importance plots of the 10 most important variables

44 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

Table 13 lists the ten most important variables of each data set ranked according to their importance. One species can appear in this table a maximum of 8 times, namely in each of the 4 sets with all species, and either in the set with the unique species or in the one without the unique species. Two species reach the maximum appearance in this table: Acinetobacter schindleri and Pseudomonas balearica. Three species appear 7 times: , Bacillus circulans and Moraxella osloensis. Other species occur less frequently in this table.

45 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

Table 13: 10 most important variables of each data set ranked according to their im- portance Top 10 Top 10_1 Top 10_2 Pseudomonas.stutzeri Bacillus.circulans Pseudomonas.stutzeri Pseudomonas.balearica Pseudomonas.balearica Acinetobacter.schindleri Acinetobacter.schindleri Staphylococcus.epidermidis Enterobacter.hormaechei Staphylococcus.epidermidis Acinetobacter.pittii Kocuria.rosea Bacillus.circulans Pseudomonas.fluorescens Moraxella.osloensis Acinetobacter.pittii Salmonella.enterica Acinetobacter.baumannii Pseudomonas.fluorescens Enterobacter.cloacae Cutibacterium.acnes Salmonella.enterica Staphylococcus.hominis Acinetobacter.lwoffii Moraxella.osloensis Escherichia.coli Pseudomonas.resinovorans Kocuria.rosea Pseudomonas.aeruginosa Micrococcus.luteus Top 20 Top 20_1 Top 20_2 Clostridium.botulinum Chryseobacterium.indoltheticum Pseudomonas.stutzeri Chryseobacterium.indoltheticum Pseudomonas.balearica Acinetobacter.schindleri Acinetobacter.schindleri Dermacoccus.nishinomiyaensis Pseudomonas.fluorescens Bacillus.circulans Clostridium.botulinum Moraxella.osloensis Stenotrophomonas.rhizophila Stenotrophomonas.rhizophila Staphylococcus.epidermidis Pseudomonas.balearica Bacillus.circulans Kocuria.rosea Pseudomonas.fluorescens Massilia.oculi Acinetobacter.pittii Pseudomonas.stutzeri Corynebacterium.jeikeium Acinetobacter.radioresistens Moraxella.osloensis Enterobacter.cloacae Micrococcus.luteus Acinetobacter.pittii Acinetobacter.nosocomialis Acinetobacter.baumannii Top 30 Top 30_1 Top 30_2 Clostridium.botulinum Pseudomonas.balearica Pseudomonas.stutzeri Chryseobacterium.indoltheticum Clostridium.botulinum Acinetobacter.schindleri Lactococcus.lactis Dermacoccus.nishinomiyaensis Staphylococcus.epidermidis Acinetobacter.schindleri Lactococcus.lactis Acinetobacter.pittii Bacillus.circulans Bacillus.circulans Moraxella.osloensis Pseudomonas.balearica Chryseobacterium.indoltheticum Pseudomonas.fluorescens Deinococcus.wulumuqiensis Corynebacterium.jeikeium Kocuria.rosea Moraxella.osloensis Acinetobacter.indicus Salmonella.enterica Stenotrophomonas.rhizophila Stenotrophomonas.rhizophila Massilia.oculi Pseudomonas.stutzeri Pseudomonas.xanthomarina Acinetobacter.radioresistens Top 100 Top 100_1 Top 100_2 Dermacoccus.nishinomiyaensis Chryseobacterium.indoltheticum Pseudomonas.balearica Pseudomonas.balearica Clostridium.botulinum Lactococcus.lactis Deinococcus.wulumuqiensis Clostridium.argentinense Acinetobacter.schindleri Variovorax.paradoxus Pseudomonas.moraviensis Moraxella.osloensis Acinetobacter.schindleri Deinococcus.wulumuqiensis Pseudomonas.stutzeri Clostridium.argentinense Bacillus.flexus Stenotrophomonas.rhizophila Enterobacter.cloacae Bacillus.circulans Dermacoccus.nishinomiyaensis Chryseobacterium.indoltheticum Clostridium.pasteurianum Staphylococcus.epidermidis Staphylococcus.haemolyticus Acinetobacter.soli Enterobacter.ludwigii Bacillus.circulans Octadecabacter.temperatus Pseudomonas.chlororaphis 46 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

4.2.4 K-nearest neighbors analysis

The K-nearest neighbors analysis for the CSD16 data is performed with the same parameters as it was performed for the CSD17 data. The process of optimising the parameter k is shown in figure 16. The number of neighbors is plotted against the accuracy of the repeated cross validation. The graph in the upper left corner of the figure 16 shows the course for "Top 10", "Top 10_1" and "Top10_2". The same applies to "Top 20" at the top right, "Top 30" at the bottom left and"Top 100" at the bottom right in the figure 16.

Figure 16: KNN accuracy of repeated cross-validation versus neighbors from the CSD16

Among the top 10 species "Top 10" achieves the highest accuracy of 0.699 at k=5, fol- lowed by "Top 10_2" with an accuracy of 0.651 at k=5 and "Top 10_1" with an accuracy of 0.592 at k=13. Among the top 20 species "Top 20_1" achieves the highest accuracy of 0.788 at k=5, followed by "Top 20" with an accuracy of 0.743 at k=5 and "Top 20_2" with an accuracy of 0.636 at k=5. Among the top 30 species "Top 30_1" achieves the highest accuracy of 0.777 at k=5, followed by "Top 30" with an accuracy of 0.749 at k=5 and "Top 30_2" with an accuracy of 0.670 at k=7. Among the top 100 species "Top 100_2" achieves the highest accuracy of 0.725 at k=5, followed by "Top 100_1" with an accuracy of 0.709 at k=5 and "Top 100" with an accuracy of 0.678 at k=5. The data sets containing all species (red line) reach the highest accuracy values in the

47 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering top 10 set those containing unique species reach the highest values in the top 20 and top 30 sets. The top 100’s highest value occurs in the set excluding unique species. The chosen k=5 is true for 10 out of 12 sets, except for "Top 10_1" which has k=13 and "Top 30_2" which results in k=7.

4.2.5 Support Vector Machines analysis

For the CSD16 data also a Support Vector Machine model is derived for each data set to evaluate the best separation possible between the classes. The process of optimisation of the cost parameter in dependence of the accuracy of the repeated cross validation of the model is shown in figure 17. The graph in the upper left corner of the figure 8 shows the course for "Top 10", "Top 10_1" and "Top 10_2". The same applies to "Top 20" at the top right, "Top 30" at the bottom left and "Top 100" at the bottom right in the figure 17.

Figure 17: SVM accuracy of repeated cross-validation versus cost from the CSD16

Among the top 10 species "Top 10" achieves the highest accuracy of 0.728 at C=64, followed by "Top 10_2" with 0.658 at C=2. "Top 10_1" achieves a value of 0.578 at C=64. Among the top 20 species "Top 20" achieves the highest accuracy of 0.723 at C=8, followed by "Top 20_1" with 0.687 at C=4. "Top 20_2" achieves a value of 0.668 at C=16. Among the top 30 species "Top 30_1" achieves the highest accuracy of 0.699 at C=8, followed by "Top 30" with 0.678 at C=4. "Top 30_2" achieves a value of 0.650

48 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering at C=128. Among the top 100 species "Top 100_2" achieves the highest accuracy of 0.695 at C=8, followed by "Top 100" with 0.678 at C=16. "Top 100_1" achieves a value of 0.638 at C=8.

4.2.6 Accuracy comparison

In this section we compare the accuracy and also the according Kappa value of the three prediction models. Table 15 contains the accuracy of the models with the training data and the cross-validation. Table 8 shows the optimized parameters of the single models. For each machine learning model one parameter was optimized. In the case of RF it is "mtry", for KNN it is "k" and for SVM the parameter optimized is "C".

Table 14: The optimized parameters "mtry" for RF, "k" for KNN and "C" for SVM. These parameters are optimized during the model development. The value that leads to the highest accuracy is selected. Top 10 Top 10_1 Top 10_2 Top 20 Top 20_1 Top 20_2 Top 30 Top 30_1 Top 30_2 Top 100 Top 100_1 Top 100_2 mtry (RF) 2 2 2 23 2 11 34 2 15 2 71 94 k (KNN) 5 13 5 5 5 5 5 5 7 5 5 5 C (SVM) 64 64 2 8 4 16 4 8 128 16 8 8

According to table 15 the RF model has a maximum calculated accuracy of 0.936 for the "Top 20_1" record and the lowest accuracy is 0.871 for the "Top 10_1". The RF model always has the highest accuracy compared to the other two models. The accuracy of the KNN model ranges from 0.788 for "Top 20_1" to 0.592 for "Top 10_1". Accuracy of the SVM model varies between 0.728 for "Top 10" and 0.578 for "Top 10_1".

49 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

Table 15: Accuracy of machine learning models based on the training data. The ac- curacy values and the according Kappa values are always representing the highest value reached by optimizing parameters of the used methods. In the case of RF the parameter optimized is "mtry", for KNN the "k" is optimized and for SVM the "C"-parameter is chosen for the optimum output. Top 10 Top 10_1 Top 10_2 accuracy Kappa accuracy Kappa accuracy Kappa RF 0.919 0.892 0.871 0.827 0.904 0.872 KNN 0.669 0.559 0.592 0.455 0.651 0.536 SVM 0.728 0.636 0.578 0.438 0.658 0.545 Top 20 Top 20_1 Top 20_2 accuracy Kappa accuracy Kappa accuracy Kappa RF 0.904 0.850 0.936 0.915 0.885 0.847 KNN 0.743 0.657 0.788 0.718 0.636 0.516 SVM 0.723 0.631 0.687 0.581 0.668 0.558 Top 30 Top 30_1 Top 30_2 accuracy Kappa accuracy Kappa accuracy Kappa RF 0.920 0.893 0.929 0.905 0.895 0.860 KNN 0.749 0.666 0.777 0.703 0.670 0.559 SVM 0.678 0.559 0.699 0.597 0.650 0.534 Top 100 Top 100_1 Top 100_2 accuracy Kappa accuracy Kappa accuracy Kappa RF 0.904 0.872 0.896 0.861 0.903 0.871 KNN 0.678 0.571 0.709 0.613 0.725 0.634 SVM 0.678 0.571 0.638 0.519 0.695 0.592

Table 16 indicates the accuracy of the prediction models when test data is included. The RF model has a maximum calculated accuracy of 0.984 for the "Top 30_2" data and the lowest accuracy is 0.903 for the "Top 10_1" data. The accuracy of the prediction including the test data is always higher than without. The RF model always has the highest accuracy compared to the other two models. The accuracy of the KNN model varies between 0.903 for "Top 30" and 0.565 for "Top 10_1". Accuracy of the SVM model ranges between 0.806 for "Top 30" and 0.613 for "Top 10_1".

50 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

Table 16: Accuracy of machine learning models based on the test data Top 10 Top 10_1 Top 10_2 accuracy 95% CI Kappa accuracy 95% CI Kappa accuracy 95% CI Kappa RF 0.968 0.888, 0.996 0.957 0.903 0.801, 0.964 0.871 0.968 0.888, 0.996 0.857 KNN 0.839 0.723, 0.902 0.785 0.565 0.433 0.690 0.425 0.726 0.598, 0.831 0.634 SVM 0.694 0.563, 0.804 0.593 0.613 0.481, 0.734 0.486 0.758 0.633, 0.858 0.677 Top 20 Top 20_1 Top 20_2 accuracy 95% CI Kappa accuracy 95% CI Kappa accuracy 95% CI Kappa RF 0.968 0.888, 0.996 0.957 0.952 0.865, 0.990 0.935 0.968 0.888, 0.996 0.957 KNN 0.871 0.761, 0.943 0.828 0.758 0.633, 0.858 0.678 0.694 0.563, 0.804 0.591 SVM 0.710 0.581, 0.818 0.614 0.694 0.563, 0.804 0.591 0.694 0.563, 0.804 0.588 Top 30 Top 30_1 Top 30_2 accuracy 95% CI Kappa accuracy 95% CI Kappa accuracy 95% CI Kappa RF 0.952 0.865, 0.990 0.935 0.952 0.865, 0.990 0.935 0.984 0.913, 1.000 0.978 KNN 0.903 0.801, 0.964 0.871 0.774 0.650, 0.871 0.699 0.758 0.633, 0.858 0.678 SVM 0.806 0.686, 0.896 0.741 0.774 0.650, 0.871 0.698 0.661 0.53, 0.777 0.544 Top 100 Top 100_1 Top 100_2 accuracy 95% CI Kappa accuracy 95% CI Kappa accuracy 95% CI Kappa RF 0.968 0.888, 0.996 0.957 0.919 0.822, 0.973 0.892 0.952 0.865, 0.990 0.935 KNN 0.806 0.686, 0.896 0.742 0.694 0.563, 0.804 0.593 0.871 0.761, 0.943 0.828 SVM 0.742 0.615, 0.845 0.658 0.677 0.547, 0.791 0.572 0.758 0.633, 0.858 0.677

All in all, the models with the CSD16 Data containing the data of 4 cities, more accurate than those with the CSD17 containing the data of 9 cities.

4.3 Matching species of Ilorin and New York (CSD16 vs. CSD17)

Due to the fact that the cities that are analysed are selected according to the number of samples (except for Vienna), only Ilorin and New York appear in this section. In both cities a high number of samples were collected in both years and therefore appear in this comparison.

Table 17: Number and percentage of species matching of CSD16 and CSD17 Ilorin New York No. of matching species % of matching species No. of matching species % of matching species Top 10 2 20,00 % 5 50,00 % Top 20 7 35,00 % 8 40,00 % Top 30 10 33,00 % 15 50,00 % Top 100 21 21,00 % 57 57,00 %

Table 17 indicates how many matching species are found in Ilorin and New York be- tween the two years and the percentage they represent in the respective data set. In Ilorin, the number of matching species in each record is less than in New York. The ex- act species concerned are listed in table 18. In Ilorin, Acinetobacter and Pseudomonas

51 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering species stand out. Acinetobacter and Acinetobacter species can also be found in New York. Furthermore some Kocuria and Microbacterium species can be found.

Table 18: Matching species matching between the two years of sampling. Highlighted in blue are the matches of the Top 10. The matches of the Top 20 are from the blue and yellow highlighted species. The matches of the Top 30 are from the blue, yellow and green species. Matches of the Top 100 refer to the whole list. Ilorin Achromobacter xylosoxidans Acinetobacter schindleri Massilia putida Pseudomonas stutzeri Acinetobacter baumannii Comamonas aquatica Micrococcus luteus Salmonella enterica Acinetobacter haemolyticus Comamonas testosteroni Stenotrophomonas maltophilia Acinetobacter indicus Enterobacter cloacae Pseudomonas mendocina Acinetobacter johnsonii Escherichia coli Pseudomonas oryzihabitans Acinetobacter lwoffii Klebsiella pneumoniae Pseudomonas putida New York Achromobacter xylosoxidans Dermacoccus nishinomiyaensis Massilia oculi Pseudomonas stutzeri Acinetobacter baumannii Enterobacter hormaechei Massilia putida Ramlibacter tataouinensis Acinetobacter johnsonii Escherichia coli Methylobacterium brachiatum Rathayibacter festucae Acinetobacter lwoffii Friedmanniella luteola Microbacterium aurum Rhodococcus fascians Actinomyces oris Geodermatophilus obscurus Microbacterium foliorum Rhodopseudomonas palustris Agrococcus carbonis Janibacter indicus Microbacterium oxydans Salmonella enterica Blastococcus saxobsidens Klebsiella pneumoniae Micrococcus luteus Sphingobium yanoikuyae Brevundimonas diminuta Kocuria flava Modestobacter marinus Sphingomonas melonis Brevundimonas naejangsanensis Kocuria indica Moraxella osloensis Staphylococcus aureus Brevundimonas vesicularis Kocuria palustris Ornithinimicrobium flavum Stenotrophomonas maltophilia Comamonas testosteroni Kocuria rhizophila Paracoccus yeei Variovorax paradoxus Corynebacterium jeikeium Kocuria rosea Pseudomonas aeruginosa Xanthomonas campestris Corynebacterium ureicelerivorans Kocuria turfanensis Pseudomonas fluorescens Cutibacterium acnes Marmoricola scoriae Pseudomonas putida Cutibacterium granulosum Massilia lutea Pseudomonas resinovorans

4.4 Mystery samples

Table 20 and 19 show the outcome of the mystery sample analysis. The model correctly predicts 16% to 25%. The model incorrectly assigns 37% to 65% to New York which does not appear in the mystery samples.

Table 19: Here the absolute numbers and the percentage of correctly predicted sam- ples is shown. Further the absolute numbers and the percentage of samples incorrectly predicted as New York is shown Top 10 Top 10_1 Top 10_2 Top 20 Top 20_1 Top 20_2 Top 30 Top 30_1 Top 30_2 Top 100 Top 100_1 Top 100_2 Right predictions 9 10 12 12 14 11 13 14 12 12 13 13 Percentage of 16% 18% 21% 21% 25% 19% 23% 25% 21% 21% 23% 23% right predictions New York predicted 35 37 26 33 24 34 26 28 31 34 27 21 Percentage of 61% 65% 46% 58% 42% 60% 46% 49% 54% 60% 47% 37% New York predictions

52 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

Table 20: Evaluation of the mystery samples. In the column "origin" the city from where the sample originates is stated

Origin Top 10 Top 10_1 Top 10_2 Top 20 Top 20_1 Top 20_2 Top 30 Top 30_1 Top 30_2 Top 100 Top 100_1 Top 100_2 Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Incheon Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York Hong_Kong New_York New_York Hong_Kong Hong_Kong New_York Hong_Kong Ilorin New_York Hong_Kong Hong_Kong Ilorin Ilorin Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Incheon Hong_Kong Hong_Kong New_York New_York New_York New_York Ilorin New_York New_York New_York New_York New_York New_York New_York Hong_Kong Kiev New_York Hong_Kong Hong_Kong Kiev Hong_Kong New_York Kiev Hong_Kong Kiev Hong_Kong Kiev Hong_Kong New_York New_York New_York Hong_Kong Hong_Kong New_York Hong_Kong New_York New_York Hong_Kong Hong_Kong Hong_Kong Hong_Kong New_York New_York New_York Hong_Kong Hong_Kong Hong_Kong New_York Hong_Kong Hong_Kong New_York Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Tokyo Tokyo Hong_Kong Tokyo Hong_Kong Tokyo Hong_Kong Hong_Kong Tokyo Hong_Kong Hong_Kong Hong_Kong Kiev Tokyo Tokyo Hong_Kong Tokyo Tokyo Tokyo New_York Tokyo Tokyo Tokyo Tokyo Tokyo Kiev Singapore Tokyo Hong_Kong Hong_Kong Incheon Singapore Singapore Tokyo Singapore Singapore Incheon Singapore Kiev Hong_Kong New_York Hong_Kong New_York Incheon New_York New_York New_York Hong_Kong New_York New_York Hong_Kong Kiev New_York New_York Hong_Kong New_York New_York New_York New_York New_York New_York New_York New_York New_York Kiev Tokyo Tokyo Tokyo Tokyo Tokyo Tokyo Tokyo Tokyo Tokyo Tokyo Tokyo Tokyo Kiev Tokyo Incheon Tokyo Tokyo Kiev Tokyo Kiev Kiev Tokyo Hong_Kong Incheon Tokyo Kiev New_York New_York Ilorin New_York New_York New_York New_York New_York New_York New_York New_York Ilorin Kiev Tokyo Tokyo Tokyo Tokyo Tokyo Tokyo Tokyo Tokyo Tokyo Tokyo Tokyo Tokyo Kiev New_York New_York New_York Ilorin Ilorin New_York New_York Ilorin New_York New_York Ilorin Ilorin Kiev New_York New_York New_York New_York Hong_Kong New_York Incheon Taipei New_York Hong_Kong Taipei Hong_Kong Kiev Hong_Kong Hong_Kong Hong_Kong Hong_Kong Incheon Hong_Kong Hong_Kong Incheon Hong_Kong Hong_Kong Incheon Hong_Kong Kiev New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York Kiev Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Taipei New_York New_York Ilorin New_York New_York New_York Hong_Kong New_York Hong_Kong New_York Hong_Kong Hong_Kong Taipei New_York New_York Taipei New_York Taipei New_York Taipei Taipei New_York New_York Taipei Hong_Kong Taipei New_York New_York New_York New_York Ilorin New_York Ilorin Ilorin New_York New_York Ilorin Hong_Kong Taipei Hong_Kong Hong_Kong Ilorin Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Hong_Kong Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Taipei Hong_Kong Taipei Hong_Kong Hong_Kong Taipei Hong_Kong Hong_Kong Taipei Hong_Kong Hong_Kong Taipei Taipei Taipei New_York New_York Ilorin New_York Ilorin New_York New_York New_York New_York New_York New_York Ilorin Taipei Taipei Taipei Hong_Kong Hong_Kong Taipei Hong_Kong Taipei Taipei Taipei Taipei Hong_Kong Taipei Taipei New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York Taipei Hong_Kong New_York Hong_Kong New_York New_York New_York Hong_Kong Hong_Kong New_York New_York Hong_Kong Hong_Kong Tokyo New_York New_York New_York New_York Ilorin New_York Ilorin New_York New_York New_York New_York New_York Tokyo New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York Tokyo New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York Tokyo New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York Tokyo New_York New_York Ilorin New_York Ilorin New_York Ilorin Ilorin New_York New_York New_York Ilorin Tokyo New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York Tokyo New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York Tokyo New_York New_York New_York New_York New_York New_York Hong_Kong Hong_Kong New_York New_York New_York New_York Tokyo New_York Ilorin Ilorin New_York Ilorin New_York Ilorin Ilorin New_York New_York Ilorin Ilorin Tokyo New_York New_York Hong_Kong New_York New_York New_York Hong_Kong New_York Hong_Kong New_York New_York Hong_Kong Tokyo New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York Vienna New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York Vienna New_York New_York New_York New_York Taipei Ilorin Hong_Kong New_York Hong_Kong New_York New_York Hong_Kong Vienna New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York Vienna Vienna Vienna New_York Vienna Vienna Vienna Vienna Vienna Vienna Vienna Vienna Vienna Vienna New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York Vienna New_York New_York Vienna New_York New_York New_York New_York New_York New_York New_York New_York New_York Vienna New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York Vienna New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York Vienna New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York Vienna New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York New_York

53 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

5 Discussion

In order to find key species by which different cities can be distinguished from each other, in this thesis WGS data from the urban microbiome have been analysed using machine learning methods. The most abundant bacterial species were selected. The focus lies on bacterial species because a classification among frequently occurring species in the urban microbiome has shown that there are more than 1,100 species that occur in the majority of the samples (> 70%). The composition consists mainly of bacteria and only a very small proportion of eukaryotes. Viruses and archaea have not been found in this group. This may be due to the fact that there is considerably more available data on prokaryotes than on eukaryotes. In addition, the annotation involves DNA fragments, which excludes RNA-based viruses from the outset. (The MetaSUB International Consortium, 2016) The two data sets studied (CSD16 and CSD17) differ both in terms of the cities and the number of cities studied. Since New York and Ilorin are the only cities that appear in both years and are also analysed in this thesis, only these two cities can be compared over both years. The number of matching species over both year is smaller for Ilorin compared to New York but still the highest match of species is only 57% (seen table 17). The composition of matching species mostly consist Acinetobacter species, as well as Kocuria and Pseudomonas species. For the CSD16 data, the PCA (Figure 11) already shows the formation of clusters of the four cities studied. Walker et al. (2018) have already pointed out that a small number of cities can be easily distinguished using this method, though not at species level but at taxonomically higher levels. In contrast, the PCA of the CSD17 data does not yield clear clusters, although there is a slight dispersion of the data points in some cases. The HC also shows a similar picture. The four cities from 2016 form some clearly recognisable clusters (Figure 12). Doha in particular stands out with a large cluster. The nine cities from 2017 do not form clusters (Figure 4). To achieve better differentiation of the data, predictive models were created using ma- chine learning to show how well individual samples can be assigned to the city from which they originate. The known data were split into training and test sets (Table 3). The models created with the training data show that the accuracy of the prediction is highest at RF. SVM works better in most cases than KNN. This applies to the data from both years. When the models are fed with the test data, the results are similar to those with the training data, but slightly higher. The search for the key species therefore focuses on the results of the RF. The CSD17 models have mean errors of about 25% to almost 36% (table 6). Individual

54 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering cities rarely have a value below 10%. The OOB varies between 24% and 34%. The greatest accuracy achieved by the RF model with the test data is 80% for the "Top 30" data (table 10). The RF predictions work worst with the unique species. The accuracy between the data with all species and those occurring in at least two cities is always similarly high. This fact can be of interest when the data set needs to be kept as small as possible because of computational reasons or others. A look at the variable importance shows that species rarely stand out from the others, but that a relatively large number of species are of similar importance for the model. There are no species that clearly stand out, but there are some that are very often found among the 10 most important (Table 7). The most important species seem to be Kocuria rosea, Pseudomonas stutzeri, Ornithinimicrobium flavum, Microbacterium aurum, Dermacoccus nishinomiyaensis and Cupriavidus metallidurans. The CSD16 Models have mean errors of about 5.6% to almost 13% (table 12). Ilorin and Doha have partially 0% errors. The OOB varies between about 8% and 13%. The greatest accuracy achieved by the RF model with the test data is 98.4% for the "Top 30_2" data (table 16). Again, the worst predictions of the RF model work worst with the unique species. Omitting the unique species works just as well as with all data. Also with the variable importance one can clearly see that mostly a few species that make up the model settle(Table 13). Most important species of this data set seem to be Pseudomonas stutzeri, Acinetobacter schindleri, Bacillus circulans, Pseudomonas balearica and Moraxella osloensis. It must be mentioned, however, that no clear patterns can be deduced from the influen- tial species. CSD16 seem to be more affected by the human microbiome than in 2017, but this may also be due to the fact that the sampling protocols were improved for the CSD17 data and are less contaminated with human microbiome. In general, water- associated species, species associated with polluted environments (various types of pollution), and bacteria associated with plants were part of the list. In 2016, Clostridium botulinum, and C. argentinense stand out, the two are genetically similar and probably often confused, but they both produce a neurotoxin and can be found in spoiled canned food (at least C. botulium), and are both used for the production of Botox. However, the classification into habitats is generally difficult, because the organisms are usually assigned to the places from which they were originally isolated, but we have no further knowledge about their habitat preferences or geographical distribution. During the analysis of the mystery samples carried out with RF, it turned out that the assignment works much worse than the models suggest. The accuracy of the predic- tion is 16 to 25% (Table 19). The most correct classifications were found for Hong Kong (Table 20). Not a single sample from Tokyo was correctly assigned. For the other cities

55 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering there were also very few hits. What is remarkable is that most of the wrongly predicted samples are assigned to New York. 37 to 65% of all samples were considered to be New York. One possibility for this inclination would be the fact that all samples were se- quenced in New York and that the results share a certain technological bias. The weak performance can also be due to the fact that there are already very big differences in abundance values for samples from the same city (high intra-city variance).

Outlook

To get more accurate results, the possibility of parameter optimization for the RF model seems to be a good choice. One could test the parameter "mtry" with several values to improve the model. The inclusion of further machine learning methods such as NN also seems to make sense. It is also possible to combine these methods and thus benefit from their respective strengths, as Melcher et al. (2015) demonstrated. KNN and SVM do not appear to be appropriate for this kind of data, although there is definitely potential for improvement too. Logarithmising, standardising or dichotomising the data could also bring gains in terms of performance. To find better classifiers it can also be of advantage to consider not only the accuracy and kappa parameter, but also other metrics like precision, Recall or the F1-score which is a function of precision and recall. How meaningful the analysis is and especially the time period in which the data are reliable depends strongly on how much the microbiome of a city changes. In this study, the compositions of the species in the individual data sets of New York and Ilorin are considered (tables 17 and 18). For Ilorin, the species are quite different in both years, but for New York they are similar. However, here too, the correlation is no greater than 57%. Meanwhile there are data of 5 CSDs that can be compared (The MetaSUB In- ternational Consortium, 2016). These data should be analysed to see how big the differences are on average. It is also likely that the seasonal differences have a greater influence on the microbial composition of a city than the annual ones. The weather on the sampling day can also have an influence. These differences can also be compared with other metadata, such as whether the variance of the species depends on climate zones or continents.

Implementing global projects

The MetaSUB Consortium has established some guidelines in advance of their global sampling and analysis in order to standardise the experiments worldwide. Despite

56 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering these efforts, the data differ in quality and quantity. Table 1 shows that the maximum number of 50 available metagenome samples is rarely available. This of course results in a certain error when comparing the different cities. Some are overrepresented, oth- ers underrepresented. It is better to keep the data balanced. The different laboratories that process the samples could also have a major influence on the results. A uniform system for evaluating in-house contamination and each laboratory is one option to re- duce the error. One would have to subtract a kind of "blank" sample from the result of each laboratory. The specification of sequencing methods should definitely be seen in a positive light. For further bioinformatic analysis it would also be appropriate to offer prefabricated pipelines, perhaps in containers, depending on the desired parameters (e.g. abundance). These should be in accordance with the MetaSUB Consortium will be made publicly available. All these measures can have a positive effect on the repro- ducibility and accuracy of the results.

Ethical aspects

Sensitive data and its analysis implicate a great ethical responsibility. The way the MetaSUB Consortium deal with these data seems appropriate. By clarifying with local authorities and allowing them to be the first to assess the data, an attempt is being made to create a consensus between administration and science (The MetaSUB In- ternational Consortium, 2016). In this way contradictory conclusions can be avoided. Very important for such publications is also in which medium they are published and who the target audience is. If, for example, there are indications of pathogenic microor- ganisms, the facts must be presented truthfully and adapted to the level of danger. This may seem relatively trivial at first, but one must admit that the natural and technical sci- ences do not always provide unambiguous results and are always afflicted with certain errors. Simply looking at the same data from a different angle can lead to different results. As an honest expert in such a discipline, one must be careful not to degener- ate into a political instrument. This is the task of each individual and of the MetaSUB Consortium. If one now looks specifically at the development of metagenome analyses in the direc- tion of forensics, it would be important to define in advance precise limits as to when conclusions are considered to be really certain. These limits must then be constantly evaluated and adjusted in order to take appropriate action.

57 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

6 Glossary

BacDive The Bacterial Diversity Metadatabase

CAMDA Critical Assessment Of Massive Data Analysis

CI Confidence Interval

CSD City sampling day

CSD16 City sampling day 2016

CSD17 City sampling day 2017

DNA Deoxyribonucleic acid

GRC Genome Reference Consortium

HC Hierarchical Clustering

HMM Hidden Markov Model

ISMB Intelligent Systems for Molecular Biology

KNN K-nearest neighbors

MetaSUB Metagenomics & Metadesign of Subways & Urban Biomes

NGS Next Generation Sequencing

NIST National Institute of Standards and Technology

NN Neural Networks

OOB Out-of-Bag error

PC Principal Component

PC1 first Principal Component

PC2 second Principal Component

PCA Principal Component Analysis

RF Random Forest

58 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

RNA Ribonucleic acid rRNA Ribosomal ribonucleic acid

SVM Support Vector Machines

WGS Whole Genome Sequencing

59 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

References

Altman, N. S. (1992). An introduction to kernel and nearest-neighbor nonparametric regression. American Statistician, 46(3), 175–185. https://doi.org/10.1080/ 00031305.1992.10475879 Baker, M., & Penny, D. (2016). Is there a reproducibility crisis? Nature, 533(7604), 452– 454. https://doi.org/10.1038/533452A Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: A flexible trimmer for Illumina sequence data. Bioinformatics, 30(15), 2114–2120. https://doi.org/10. 1093/bioinformatics/btu170 Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32. https://doi.org/doi. org/10.1023/A:1010933404324 Breitbart, M., Salamon, P.,Andresen, B., Mahaffy, J. M., Segall, A. M., Mead, D., Azam, F., & Rohwer, F. (2002). Genomic analysis of uncultured marine viral communi- ties. Proceedings of the National Academy of Sciences of the United States of America, 99(22), 14250–14255. https://doi.org/10.1073/pnas.202488399 Chaparro, J. M., Sheflin, A. M., Manter, D. K., & Vivanco, J. M. (2012). Manipulating the soil microbiome to increase soil health and plant fertility. Biology and Fertility of Soils, 48(5), 489–499. https://doi.org/10.1007/s00374-012-0691-4 Collins, F. S., Morgan, M., & Patrinos, A. (2003). The human genome project: Lessons from large-scale biology. Science, 300(5617), 286–290. https://doi.org/10. 1126/science.1084564 Cortes, C., & Vapnik, V. (1995). Support-Vector Networks. Machine Learning, 273–297. https://doi.org/10.1007/BF00994018 Danko, D. C., Bezdan, D., Afshinnekoo, E., Ahsanuddin, S., Alicea, J., Bhattacharya, C., Bhattacharyya, M., Blekhman, R., Butler, D. J., Castro-Nallar, E., Cañas, A. M., Chatziefthimiou, A. D., Rei Chng, K., Coil, D. A., Syndercombe Court, D., Crawford, R. W., Desnues, C., Dias-Neto, E., Donnellan, D., . . . Mason, C. (2020). Global genetic cartography of urban metagenomes and anti-microbial resistance. bioRxiv. https://doi.org/10.1101/724526 Dileep, D., Ramesh, A., Sojan, A., Dhanjal, D. S., Harinder, K., & Kaur, A. (2020). Metagenomics: Techniques, applications, challenges and opportunities. Springer Singapore. Filipovic,´ V. (2017). Optimization, classification and dimensionality reduction in biomedicine and bioinformatics. Biologia Serbica, 39(1), 83–98. https://doi.org/10.5281/ zenodo.827099 Freedman, D. A. (2009). Statistical models: Theory and practice. Cambridge University Press. Galili, T. (2015). Dendextend: An r package for visualizing, adjusting, and compar- ing trees of hierarchical clustering. Bioinformatics. https://doi.org/10.1093/ bioinformatics/btv428 Grbic,´ M., Kartelj, A., Matic,´ D., & Filipovic,´ V. (2016). Improving 1nn strategy for classifi- cation of some prokaryotic organisms. Book of abstracts, Belgrade Bioinformatic Conference (BelBI). Handelsman, J., Rondon, M. R., Brady, S. F., Clardy, J., & Goodman, R. M. (1998). Molecular biological access to the chemistry of unknown soil microbes: A new frontier for natural products. Chemistry and Biology, 5(10). https://doi.org/10. 1016/S1074-5521(98)90108-9

60 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

Harris, Z. N., Dhungel, E., Mosior, M., & Ahn, T. H. (2019). Massive metagenomic data analysis using abundance-based machine learning. Biology direct, 14(12), 13. https://doi.org/10.1186/s13062-019-0242-0 Hayes, B. (2013). First links in the markov chain. American Scientist, 101(2), 92. https: //doi.org/10.1511/2013.101.92 Hsu, C.-W., Chang, C.-C., & Lin, C.-J. (2003). A Practical Guide to Support Vector Classificatio. https://doi.org/10.1177/02632760022050997 Hugenholtz, P.,Goebel, B. M., & Pace, N. R. (1998). Erratum: Impact of culture-independent studies on the emerging phylogenetic view of bacterial diversity (Journal of Bac- teriology (1998) 180:18 (4765-4774)). Journal of Bacteriology, 180, 6793. https: //doi.org/10.1128/jb.180.24.6793-6793.1998 Hugenholtz, P. (2002). Exploring prokaryotic diversity in the genomic era. Genome biology, 3(2), 1–8. https://doi.org/10.1186/gb-2002-3-2-reviews0003 Kassambara, A. (2020). Ggpubr: ’ggplot2’ based publication ready plots [R package version 0.2.5]. R package version 0.2.5. https://CRAN.R-project.org/package= ggpubr Kassambara, A., & Mundt, F. (2020). Factoextra: Extract and visualize the results of multivariate data analyses [R package version 1.0.7]. R package version 1.0.7. https://CRAN.R-project.org/package=factoextra Klamt, S., Regensburger, G., Gerstl, M. P., Jungreuthmayer, C., Schuster, S., Mahade- van, R., Zanghellini, J., & Müller, S. (2017). From elementary flux modes to elementary flux vectors: Metabolic pathway analysis with arbitrary linear flux constraints. PLOS Computational Biology, 13(4), 1–22. https : / / doi . org / 10 . 1371/journal.pcbi.1005409 Kononenko, I. (2001). Machine learning for medical diagnosis: History, state of the art and perspective. Artificial Intelligence in Medicine, 23(1), 89–109. https://doi. org/10.1016/S0933-3657(01)00077-X Kuhn, M. (2020). Caret: Classification and regression training [R package version 6.0- 85]. R package version 6.0-85. https://CRAN.R-project.org/package=caret Lancashire, L. J., Lemetre, C., & Ball, G. R. (2009). An introduction to artificial neural networks in bioinformatics - Application to complex microarray and mass spec- trometry datasets in cancer studies. Briefings in Bioinformatics, 10(3), 315–329. https://doi.org/10.1093/bib/bbp012 Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174. http : / / www . jstor . org / stable / 2529310 Langmead, B., Trapnell, C., Pop, M., & Salzberg, S. L. (2009). Ultrafast and memory- efficient alignment of short DNA sequences to the human genome. Genome Biology, 10(3). https://doi.org/10.1186/gb-2009-10-3-r25 Li, S., Tighe, S. W., Nicolet, C. M., Grove, D., Levy, S., Farmerie, W., Viale, A., Wright, C., Schweitzer, P.A., Gao, Y., Kim, D., Boland, J., Hicks, B., Kim, R., Chhangawala, S., Jafari, N., Raghavachari, N., Gandara, J., Garcia-Reyero, N., . . . Mason, C. E. (2014). Multi-platform assessment of transcriptome profiling using RNA- seq in the ABRF next-generation sequencing study. Nature Biotechnology, 32(9), 915–925. https://doi.org/10.1038/nbt.2972 Lu, J., Breitwieser, F. P., Thielen, P., & Salzberg, S. L. (2017). Bracken: Estimating species abundance in metagenomics data. PeerJ Computer Science, 3, e104. https://doi.org/10.7717/peerj-cs.104

61 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

Mason-Buck, G., Graf, A., Elhaik, E., Robinson, J., Pospiech, E., Oliveira, M., Moser, J., Lee, P. K. H., Githae, D., Ballard, D., Bromberg, Y., Casimiro-Soriguer, C. S., Dhungel, E., Ahn, T.-H., Kawulok, J., Loucera, C., Ryan, F., Walker, A. R., Zhu, C., . . . Labaj, P. (2020). DNA Based Methods in Intelligence - Moving Towards Metagenomics. Preprints, (February). https://search.proquest.com/docview/ 2413932290?accountid=42404 Melcher, M., Scharl, T., Spangl, B., Luchner, M., Cserjan, M., Bayer, K., Leisch, F., & Striedner, G. (2015). The potential of random forest and neural networks for biomass and recombinant protein modeling in Escherichia coli fed-batch fer- mentations. Biotechnology Journal, 10(11), 1770–1782. https : / / doi . org / 10 . 1002/biot.201400790 Neiderud, C. J. (2015). How urbanization affects the epidemiology of emerging infec- tious diseases. African Journal of Disability, 5(1). https://doi.org/10.3402/iee. v5.27060 Neuwirth, E. (2014). Rcolorbrewer: Colorbrewer palettes [R package version 1.1-2]. R package version 1.1-2. https://CRAN.R-project.org/package=RColorBrewer Nicolaou, N., Siddique, N., & Custovic, A. (2005). Allergic disease in urban and rural populations: Increasing prevalence with increasing urbanization. Allergy: Euro- pean Journal of Allergy and Clinical Immunology, 60(11), 1357–1360. https : //doi.org/10.1111/j.1398-9995.2005.00961.x Nielsen, F. (2016). Introduction to hpc with mpi for data science. Springer. Peterson, J., Garges, S., Giovanni, M., McInnes, P., Wang, L., Schloss, J. A., Bonazzi, V., McEwen, J. E., Wetterstrand, K. A., Deal, C., Baker, C. C., Di Francesco, V., Howcroft, T. K., Karp, R. W., Lunsford, R. D., Wellington, C. R., Belachew, T., Wright, M., Giblin, C., . . . Guyer, M. (2009). The NIH Human Microbiome Project. Genome Research, 19(12), 2317–2323. https://doi.org/10.1101/gr. 096651.109 Pinheiro, J., Bates, D., DebRoy, S., Sarkar, D., & R Core Team. (2020). nlme: Linear and nonlinear mixed effects models [R package version 3.1-145]. R package version 3.1-145. https://CRAN.R-project.org/package=nlme Piryonesi, S. M., & El-Diraby, T. E. (2020). Role of Data Analytics in Infrastructure Asset Management: Overcoming Data Size and Quality Problems. Journal of Transportation Engineering Part B: Pavements, 146(2), 1–15. https://doi.org/ 10.1061/JPEODX.0000175 Pruitt, K. D., Tatusova, T., & Maglott, D. R. (2005). NCBI Reference Sequence (Ref- Seq): A curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Research, 33(DATABASE ISS.), 501–504. https : //doi.org/10.1093/nar/gki025 R Core Team. (2020). R: A language and environment for statistical computing. R Foun- dation for Statistical Computing. Vienna, Austria. https://www.R-project.org/ Reimer, L. C., Vetcininova, A., Carbasse, J. S., Söhngen, C., Gleim, D., Ebeling, C., & Overmann, J. (2019). BacDive in 2019: Bacterial phenotypic data for High- throughput biodiversity analysis. Nucleic Acids Research, 47(D1), D631–D636. https://doi.org/10.1093/nar/gky879 Ringnér, M. (2008). What is principal component analysis? Nature Biotechnology, 26(3), 303–304. https://doi.org/10.1038/nbt0308-303 RStudio Team. (2020). Rstudio: Integrated development environment for r. RStudio, PBC. Boston, MA. http://www.rstudio.com/

62 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

Sasikumar, R., & Kalpana, V. (2016). Hidden Markov Model in Biological Sequence Analysis– A Systematic Review. International Journal of Scientific and Innova- tive Mathematical Research, 4(3), 1–7. https://doi.org/10.20431/2347- 3142. 0403001 Schaefer, J., Lehne, M., Schepers, J., Prasser, F., & Thun, S. (2020). The use of ma- chine learning in rare diseases: A scoping review. Orphanet Journal of Rare Diseases, 15. https://doi.org/10.1186/s13023-020-01424-6 Shlens, J. (2014). A Tutorial on Principal Component Analysis, arXiv arXiv:1404.1100v1. https://arxiv.org/abs/1404.1100 Tang, Y., Horikoshi, M., & Li, W. (2016). Ggfortify: Unified interface to visualize statistical result of popular r packages. The R Journal, 8. https://journal.r-project.org/ The MetaSUB International Consortium. (2016). The Metagenomics and Metadesign of the Subways and Urban Biomes (MetaSUB) International Consortium inaugural meeting report. Microbiome, 4(1), 24. https://doi.org/10.1186/s40168-016-0168-z Tyson, G. W., Chapman, J., Hugenholtz, P., Allen, E. E., Ram, R. J., Richardson, P. M., Solovyev, V. V., Rubin, E. M., Rokhsar, D. S., & Banfield, J. F. (2004). Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature, 428(6978), 37–43. https://doi.org/10.1038/nature02340 United Nations, T. (2018). World Urbanization Prospects: The 2018 Revision. Volkova, S., Matos, M. R., Mattanovich, M., & de Mas, I. M. (2020). Metabolic modelling as a framework for metabolomics data integration and analysis. Metabolites, 10(8), 1–27. https://doi.org/10.3390/metabo10080303 Walker, A. R., Grimes, T. L., Datta, S., & Datta, S. (2018). Unraveling bacterial finger- prints of city subways from microbiome 16S gene profiles. Biology Direct, 13(1). https://doi.org/10.1186/s13062-018-0215-8 Wickham, H. (2016). Ggplot2: Elegant graphics for data analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org Wickham, H. (2019). Stringr: Simple, consistent wrappers for common string operations [R package version 1.4.0]. R package version 1.4.0. https://CRAN.R-project. org/package=stringr Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., Grole- mund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., . . . Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686 Wood, D. E., Lu, J., & Langmead, B. (2019). Improved metagenomic analysis with Kraken 2. Genome Biology, 20(1), 1–13. https://doi.org/10.1186/s13059-019- 1891-0 Wood, D. E., & Salzberg, S. L. (2014). Kraken: Ultrafast metagenomic sequence clas- sification using exact alignments. Genome Biology, 15(3). https://doi.org/10. 1186/gb-2014-15-3-r46 Yang, Y., Gao, J., Wang, J., Heffernan, R., Hanson, J., Paliwal, K., & Zhou, Y. (2018). Sixty-five years of the long march in protein secondary structure prediction: The final stretch? Briefings in Bioinformatics, 19(3), 482–494. https://doi.org/10. 1093/bib/bbw129 Yurgel, S. N., Nearing, J. T., Douglas, G. M., & Langille, M. G. I. (2019). Metagenomic functional shifts to plant induced environmental changes. Frontiers in Microbiol- ogy, 10, 1682. https://doi.org/10.3389/fmicb.2019.01682

63 FH Campus Wien University of Applied Sciences/Fachbereich Bioengineering

Zhou, G., Jiang, J. Y., Ju, C. J., & Wang, W. (2019). Prediction of microbial communities for urban metagenomics using neural network approach. Human genomics, 13. https://doi.org/10.1186/s40246-019-0224-4 Zhu, C., Miller, M., Lusskin, N., Mahlich, Y., Wang, Y., Zeng, Z., & Bromberg, Y. (2019). Fingerprinting cities: differentiating subway microbiome functionality. Biology di- rect, 14(1), 19. https://doi.org/10.1186/s13062-019-0252-y Functional analysis 310 metagenomes, 30 known-unknown mystery samples from known cities and unknown set 36, mix set with 16 samples Zhu, C., Miller, M., Marpaka, S., Vaysberg, P., Rühlemann, M. C., Wu, G., Heinsen, F. A., Tempel, M., Zhao, L., Lieb, W., Franke, A., & Bromberg, Y. (2018). Func- tional sequencing read annotation for high precision microbiome analysis. Nu- cleic Acids Research, 46(4), 1–11. https://doi.org/10.1093/nar/gkx1209

64