The Woolf Classifier Building Pipeline: A Machine Learning Tool for Predicting Function

Anna Farrell-Sherman

Submitted in Partial Fulfillment of the Prerequisite for Honors in Computer Science under the advisement of Eni Mustafaraj and Vanja Klepac-Ceraj

May 2019

© 2019 Anna Farrell-Sherman Abstract

Proteins are the machinery that allow cells to grow, reproduce, communicate, and create multicellular organisms. For all of their importance, scientists still have a hard time understanding the function of a protein based on its sequence alone. For my honors thesis in computer science, I created a machine learning tool that can predict the function of a protein based solely on its amino sequence. My tool gives scientists a structure in which to build a � Nearest Neighbor or random forest classifier to distinguish between that can and cannot perform a given function. Using default Min-Max scaling, and the Matthews Correlation Coefficient for accuracy assessment, the Woolf Pipeline is built with simplified choices to guide users to success.

i Acknowledgments

There are so many people who made this thesis possible.

First, thank you to my wonderful advisors, Eni and Vanja, who never gave up on me, and always pushed me to try my hardest.

Thank you also to the other members of my committee, Sohie Lee, Shikha Singh, and Rosanna Hertz for supporting me through this process.

To Kevin, Sophie R, Sophie E, and my sister Phoebe, thank you for reading my drafts, and advising me when the going got tough.

To all my wonderful friends, who were always there with encouraging words, warm hugs, and many congratulations, I cannot thank you enough. Jocelyn, Hershel, Fiona, Anneli, Lydia, Linda, and Claire, I could not have done this without all of you.

And finally, thank you to my parents, who have always believed in me more than I could have thought possible, and have shown me how much I am capable of.

ii A Note on the Name Woolf

The Woolf Classifier Building Pipeline is named after the lesbian feminist author Virginia Woolf. She is known for her use of stream of consciousness, and is considered one of the most important 20th century modernist authors. The tribute to her here is a reminder that science alone can never capture all of the wonders of life.

iii Table of Contents Abstract i Acknowledgments ii A Note on the Name Woolf iii

List of Figures vi

List of Tables vii

1 Introduction 1

2 Anaerobic Manganese Oxidation: A Case Study 3 2.1 Manganese Oxides in the Geologic Record 3 2.2 Biological Mn(II) Oxidation 4 2.3 and Proteins 6

3 The Field of Machine Learning 8 3.1 What is Machine Learning? 8 3.2 Classification in Machine Learning 9 3.2.1 The Class Distribution Problem 9 3.2.2 The Multi-Class Problem 13 3.3 Algorithm Choice 14 3.3.1 Naïve Bayes 15 3.3.2 Support Vector Machine (SVM) 16 3.3.3 Neural Nets 17 3.3.4 k Nearest Neighbors (kNN) 18 3.3.5 Decision Trees and Random Forests 19 3.4 Machine Learning in Biology 20

4 The Woolf Classifier Building Pipeline 22 4.1 Introduction 22 4.2 How is the Woolf Pipeline Different? 22 4.3 Pipeline Composition 23 4.3.1 Input 24 4.3.2 Creating a Feature Table 25 4.3.3 Model Creation with Default Settings 26 4.3.4 Changing the Pipeline Parameters 27 4.3.5 Error Evaluation 28 4.3.6 Prediction 29 4.3.7 Implementation 29 4.4 Design Decisions 29 4.4.1 Why the Command Line? 29 4.4.2 Machine Learning Features 30 4.4.3 Scalar Operations 31 4.4.4 Accuracy Metrics 32

iv 4.4.5 Algorithm Choices 33 4.5 When are Woolf Classifiers Useful 34

5 Model Building with the Woolf Pipeline 35 5.1 Introduction 35 5.2 �-Lactamases for Woolf Model Testing 35 5.3 Evaluating Algorithm Implementations 36 5.3.1 Accuracy Metrics 36 5.3.2 Scaling Types 39 5.3.3 Hyperparameter Values 40 5.4 Detecting Anaerobic Manganese Oxidation Genes 41 5.4.1 kNN Model 42 5.4.2 Random Forest Model 42 5.4.3 Prediction of Green Lake Manganese Oxide Genes 43 5.5 Conclusion 44

6 Conclusions and Future Work 45 6.1 The Success of the Woolf Classifier Building Pipeline 45 6.2 Improvements to the Woolf Tool 45 6.3 Future Applications 46

7 References 48

Appendices i Appendix A: User Manual i Appendix B: Manganese Oxidizing xiv Appendix C: Non-Manganese Oxidizing Peroxidases xv

v List of Figures

2.1 Manganese Oxides and the Build-up of Oxygen in Earth’s Atmosphere 4 2.2: Anaerobic Manganese Oxide Producing Bacterial Communities in Fayetteville 5 Green Lake 3.3.1 Bayes’ Rule 15 3.3.2 Support Vector Machine in Two Dimensions 16 3.3.3 Neural Network Diagram 17 3.3.4 kNN Classification in Two Dimensions 19 3.3.5 A Random Forest with Three Trees 20 4.2 Genome to to Protein 23 4.3.1 FASTA and FASTQ format 24 4.3.2 The Woolf Classifier Building Pipeline 26 4.4.2 The Twenty Amino 31 5.1 The 4 �-Lactamase Comparisons Used to Test the Woolf Classification Pipeline 36 5.3.1 Three Accuracy Metrics to Evaluate Woolf Classifiers 37 5.3.2 Effect of Scaling on kNN-based Model MCC 39 5.3.3 Effect of Hyperparameters on MCC 40

vi

List of Tables

3.2.1 The Confusion Matrix 11 4.3.4 User Specified Parameters 28 4.4.3 Possible Scaler Types 32 4.4.4 Possible Accuracy Metrics 33 5.4.3 Predicted Manganese Binding Function of Green Lake Peroxidases 44

vii 1 Introduction

Until recently, scientists did not know that were capable of producing manganese oxides without oxygen. In fact, manganese oxide minerals have been used to date both the formation of Earth’s modern oxygenated atmosphere and the evolution of oxygenic photosynthesis, based on the idea that these minerals could not form without oxygen in the atmosphere [1, 2]. Yet, despite these claims, recent research has discovered bacterial communities capable of forming manganese oxides without oxygen [3]. This discovery has the potential to change our understanding of the early Archean world. However, the formation of these minerals is still a mystery. We have yet to identify any gene that could enable bacteria to produce them without oxygen.

The function of genes can be predicted with algorithms used to annotate genomes after whole genome sequencing, and proven with laboratory experiments that show chemically how proteins work within cells [4, 5]. We have full genome sequences of both the manganese oxide producing bacteria, but because anaerobic manganese oxidation is a novel process, there are no known protein sequences to which we can compare the new sequences. The functional gene annotation produced by current algorithms helps narrow down protein types, but none of the annotations are specific enough to identify genes that might be involved in manganese oxidation. Without a way to narrow down the possibilities, it is difficult to know where to start with laboratory experiments. Machine learning has the potential to provide a solution: a more specific computational tool to differentiate between specific protein functions.

To be successful, such a tool would need to be more precise than the current protein differentiation algorithms and at the same time, require less training data. This requires simpler machine learning algorithms, more careful treatment of the data to avoid adding bias, and the ability to biologically interpret the results on the level of individual genes. The biological background of the project will be discussed in Chapter Two. The machine learning required for such a tool is complex, and is explored in detail in Chapter Three, as a prelude to the main work of this thesis: the creation of a machine learning tool like the one described above, called the Woolf Classifier Building Pipeline.

The Woolf Pipeline, described in Chapter Four, allows researchers to collect sequence data on genes with and without a given function, choose either a � Nearest Neighbor (kNN) or random forest classification algorithm, and predict the functionality of novel protein sequences. Users can select the type of scaling used on the input dataset, the accuracy metric used to assess the classification ability of the models, and give hyperparameter ranges to the tool that will be tested with cross-validation. The Woolf Pipeline user manual will guide users through these decisions to make sure that the pipeline is accessible even to researchers who do not have an extensive machine learning background (Appendix A).

1 An example set of models created by the Woolf Pipeline is presented in Chapter Five. These models show that Woolf-built classifiers are effective at differentiating between classes of �-lactamase antibiotic resistance genes with both kNN and random forest based classification. Both algorithms are feature based; they model the classes using percentage composition for each of the 20 amino acids and protein length for a total of 21 machine learning features. The default parameters are tuned for the problem of protein differentiation; Min-Max scaling for the kNN algorithm combats the sparsity of the amino acid composition data, and the MCC accuracy metric deals with the small dataset size and unbalanced class distributions. All four example classifiers with default Min-Max scaling achieved Matthews correlation coefficient (MCC) accuracies of over 0.8, better than any previously published model [6].

Most broadly, this thesis is an investigation into the practicality of using machine learning to create simple classification models to understand the function of biological proteins. Through careful study of the machine learning theory behind classification problems, I have created the Woolf Pipeline as a simple, usable tool with real world applications for the study of biological proteins. This thesis gives an overview of a specific biological motivation for such a machine learning tool, the background of the field of classification in machine learning, the specifics of how the tool works, and proof that the tool can be used to produce functional machine learning models.

2 2 Anaerobic Manganese Oxidation: A Case Study

2.1 Manganese Oxides in the Geologic Record

It has long been thought that the oxidation of manganese into manganese oxide minerals can only occur in the presence of oxygen (Figure 2.1) [7-9]. Today’s atmosphere, in which oxygen makes up a fifth of the total molecules, contains plenty of oxygen for manganese oxide formation, however, this was not the case throughout Earth’s history [9]. Sometime between 2.5 and 2.0 billion years ago, oxygen began to accumulate in Earth’s atmosphere, increasing from 0% to 21% in what geologists call the Great Oxidation Event (GOE) [10]. Because scientists thought that manganese oxide minerals required oxygen to form, the presence of manganese oxide has been used to date the evolution of oxygenic photosynthesis that produced Earth’s oxygen [11]. While this has been a useful tool for studying the evolution of photosynthesis, recent results call this precept into question [3].

The first piece of evidence suggesting that manganese oxides can form without oxygen was the discovery of manganese oxides in 3 billion year old Archean Eon rock cores from South Africa [12]. How did these molecules form? Did the GOE occur earlier than we thought? Could these molecules have somehow formed without oxygen? While some researchers have cited these rocks as evidence for oxygen before the currently accepted start of the GOE [2], a simpler explanation might be that these oxides formed without oxygen. Manganese ions, particularly ones that have already lost some electrons, have an extremely high potential, meaning it takes a lot of energy to add or remove any further electrons [9]. Oxygen has a lot of energy to pull electrons from other atoms, and so can effectively oxidize manganese under the correct pH conditions, but the redox potential of Mn is high enough that even this process does not happen very often on its own: most natural Mn oxides are formed aerobically by bacteria and fungi [9]. Perhaps, these bacteria and fungi at some point in evolutionary history also had the ability to oxidize manganese anaerobically, and produced manganese oxide minerals before the GOE.

The hypothesis of a biological anaerobic Mn(II) oxidizing system is attractive in a few respects. First, as discussed above, an organism capable of anaerobic Mn(II) oxidization would explain the presence of manganese oxides in Archean Eon rocks, which formed before the known existence of oxygen in the atmosphere. Second, such a wide variety of microbes show an ability to oxidize manganese aerobically that genes with this capability, or the closely related capability of anaerobic manganese oxidation, were likely present in some early ancestor of many present day bacterial species, perhaps from before the GOE [13]. Third, this hypothesis is attractive because it fits with previous hypotheses surrounding the debate on the origin of photosynthesis: it has been suggested that an Mn-centered photosystem was a precursor to the present-day water-splitting version [14-16]. Because the current water-oxidizing photosystem

3 relies on a Mn-bearing complex, it makes sense that the protein ancestor to the current complex used Mn(II) as an electron donor [17]. Such an ancient manganese oxidation system could have produced the first Mn oxides anaerobically, and then evolved as the rest of life did once oxygen was plentiful.

Figure 2.1 Manganese Oxides and the Build-up of Oxygen in Earth’s Atmosphere. The red and green lines respectively represent the upper and lower estimates for oxygen levels in Earth’s atmosphere. The beginning of the build-up of oxygen in Earth’s atmosphere began ~2.5 billion years ago [11] around 500 million years after the rocks containing manganese oxides formed in South Africa [12]. Base chart from [11], manganese dendrite image from [18].

2.2 Biological Mn(II) Oxidation

Work preceding this thesis in Professor Vanja Klepac-Ceraj’s lab at Wellesley College and Professor Tanja Bosak’s lab at MIT has found evidence of anaerobic manganese oxidation by bacterial communities. The communities were discovered in Green Lake, in Fayetteville, New York, a lake often studied as a modern-day mimic of conditions in Archean Oceans [19]. The lake is permanently stratified, meaning it contains distinct layers that do not mix, including a large section from 21m to the lake bottom at 45m that is completely anoxic [19]. The lake is also unusual in its high manganese concentration, which peaks around 60,000 nM at 21.5m, the same depth in the water column where most of the microorganisms can be found [19]. This unusual environment, with anoxic photosynthetic zones rich in microorganisms and manganese, made Green Lake an ideal location to preserve ancient manganese oxidizing biological behavior: the behavior our lab has observed.

4

A B

C

D

Figure 2.2: Anaerobic Manganese Oxide Producing Bacterial Communities in Fayetteville Green Lake. (A) Green Lake is located in upstate New York in the northeast United States. The star indicates the location of sample collection. (B) The lake’s dissolved oxygen content (D.O.) falls sharply in the upper depths to zero by 20m. The microbial plate is located just under the oxygenated region, and coincides with the peak manganese concentration composition [19]. (C) The microbial diversity of biofilms inoculated from lake samples and grown for two weeks analyzed with 16s ribosomal gene Illumina sequencing. Chlorobium is the most abundant taxon (Unpublished). (D) The biofilm mass as measured by crystal violet assay of the samples grown in (C) were highest in the light with manganese (Unpublished).

To identify the organisms involved in this manganese oxidizing process, cultures of water from Green Lake were grown under completely anaerobic conditions mimicking those of the Archean Earth. The community that grew under these conditions was able to oxidize manganese, and contained Chlorobium, Paludibacter, Acholeplasma, Geobacter, Desulfomicrobium, Clostridium, Acetobacterium, and several other bacteria species at lower

5 abundances (Figure 2.2). The manganese oxidation observed was light-dependent, indicating that photosynthesis was required, and of these organisms, the only photosynthetic organism was the green sulfur bacterium Chlorobium limicola (Figure 2.2). To our surprise, however, cultures of pure C. limicola did not produce manganese oxides. Only co-cultures containing both C. limicola and the metal-oxidizing Geobacter lovleyi were able to produce Mn oxides minerals, suggesting that Mn(II) oxidation requires interactions between more than one microorganism. The next step in the research process is to determine how these bacteria perform this oxidation.

2.3 Genes and Proteins

To determine how an organism performs a particular process is not an easy task. The instructions for running a cell are encoded in a cell’s DNA, which is used as a blueprint to create the protein machinery that perform all cellular functions [20]. Studying the proteins requires long and complex laboratory-based experiments. However, the advent of new sequencing technologies has made it possible to determine the complete sequence of an organism’s DNA, which codes for all the proteins it can produce [21]. Once the DNA sequence is obtained, it is possible to identify potential gene-coding regions, translate those regions into amino acid protein code, and predict their function [22].

Even though this deciphering process can be done computationally, it is not trivial to decipher the genetic code and determine protein function [23]. Simple comparisons of letter sequence are not enough: small mutations and changes in the DNA sequence can drastically alter a gene’s function if it changes the physical properties of an amino acid, or not change it at all if the amino acid folds into a region of the protein that is non-essential [24]. Only very small portions of genes end up forming the binding sites that allow proteins to interact with other molecules. Therefore, it can be hard to even identify where those binding sites are, much less what they bind. Each of these challenges on their own would be difficult to model computationally, and the combined challenge is quite complex [23].

The complexity of the issue has led to the development of numerous computational tools, which will be discussed in detail in Chapter 3. What they all have in common is that they represent an attempt to use algorithmic logic to solve the complex problem of identifying genes according to function. They work by transferring annotations from known proteins to novel ones based on sequence similarity [25], often employing complex hierarchical terminology labeling systems [4, 5]. The technique works by first representing proteins as a set of features, then creating model feature sets representative of various classes, and determining which model set best matches any newly identified proteins, which can finally be assigned to that class [26].

While these tools are extremely useful to label genes in unknown genomic sequence data, they fall short when asked to determine specific functions of genes that are less well

6 characterized in databases. For example, there are no known genes that can oxidize manganese anaerobically as described in this chapter. Particular and multicopper oxidase genes have been found that can perform the oxidation aerobically, but both gene families are large and sometimes have the ability to oxidize compounds other than manganese. Thus, current computational tools cannot tell which individual genes bind manganese and which do not [13]. The algorithms are too general, and the models they create are too large to determine function with such accuracy.

In the next section I will discuss the background of machine learning within computational biology in order to introduce a new potential way to study genes, the proteins they code for, and the functions of the cells that contain them. I hope to show that binary protein discrimination tools, which learn to differentiate between only two different functional categories, may be a better method for the hardest protein function differentiation problems than those that attempt to categorize every category. This technique is much more specific than the general classification method in that it deals with only one particular functional ability of any given cell. However, what it loses in breadth may be gained in depth: binary classifiers tend to be more accurate, leading to the identification of new genes in a given class.

7 3 The Field of Machine Learning

3.1 What is Machine Learning?

Machine learning is a field in the discipline of computer science that investigates the problem of teaching computing machines to learn. Learning is the process of improving knowledge or ability by study, instruction, or experience. Computing machines cannot yet learn with the depth that humans can, but the objective of machine learning is to understand how computer learning and human learning can teach us about each other. Researchers in machine learning attempt to study the “spectrum of learning methods” employed by humans, and to “endow computers with the same ability” [27]. In 1983, Carbonell and colleagues broke the field into three major branches that are still useful categories for thinking about the types of studies in machine learning:

Type 1 – Practical Studies: the attempt to teach a computer how to improve its performance on a particular task Type 2 – Cognitive Studies: using computer simulation of human cognition to learn more about how humans think Type 3 – Theoretical Studies: investigating learning algorithms to explore the boundaries of what types of learning are possible [28].

Each of these study types pushes the boundaries of computer science by investigating the relationship between human learning and computer learning but with different focuses. Machine learning for text categorization, for example, is a Type 1 practical task that involves teaching a computer to learn the differences between types of texts [29]. Computer modeling of human emotion, on the other hand, is a Type 2 study aimed at understanding humans through computer simulation [30]. While some studies may fall into only one of these categories, many are complicated multifaceted projects that encompass more than one. For example, facial recognition is a practical task (study type 1) which would provide concrete practical benefits if it could be mastered by a computer [31]. It is also, however, surprisingly difficult for computers to match a human’s innate ability to identify faces in our surroundings. This disadvantage reveals interesting aspects of human biology (type 2), and explores different approaches to facial recognition leading to the creation of new algorithms that have expanded our idea of what is theoretically possible for computers to do (type 3) [32]. Through these types of complicated studies, we learn more about computers and logical thinking, while studying ourselves and how we think.

8 3.2 Classification in Machine Learning

While there are many different types of machine learning, one of the most common, and the one investigated in this thesis, is classification. Supervised classification problems, in which an algorithm learns how to differentiate pre-labeled instances of a class, is likely the most common task set to machine learning tools [33]. These types of problems appear in language processing, facial recognition, data mining, and all sorts of other applications, including biological ones. The basic idea is to show to an algorithm various instances of a particular type, let the algorithm learn the differences between instances in several marked classes, and then give the algorithm new instances of unknown class for which the algorithm can make predictions to determine where they belong. This type of task is ideal for a computer because it requires extensive memory and computational power, and can scale to much larger problem sets than classification done by humans alone.

There are two main issues still discussed in the field of classification that affect the efficiency of classification algorithms. They are the class distribution problem and the multi- class problem. They will be discussed in detail below.

3.2.1 The Class Distribution Problem

Class distribution refers to the proportion of instances belonging to each class in a dataset. Class distribution can be balanced, with an even number of instances in each class, but in most real-life cases, including most biological cases, classes are not balanced. In fact, the costs of creating an ideal large, balanced, well-sampled dataset are so large that researchers rarely come across a problem with data distributed the way that algorithms are designed to handle [34]. This handicap creates a number of issues with trusting predictions made from these algorithms which can be easily overlooked in the excitement of a high prediction accuracy score [35].

To understand the dangers, consider this example. Assume we are trying to predict the presence of a rare gene, and we have a dataset with 10,000 organisms to learn from. This dataset seems huge by biological standards, but if we have 9,500 organisms that do not have the gene, and 500 samples that do, we quickly run into a problem. By default, most classification algorithms are designed to iteratively converge on classifier parameters that maximize the accuracy of the classifier: that is, parameters that correctly assign the most training instances to the correct class [36]. In our example, this is an issue. Because 95% of the organisms do not contain the gene, all the classifier has to do is predict that no organism will ever contain the gene to achieve a 95% accuracy rate. Without careful planning, classifiers such as this could have disastrous implications for research.

9 In this example, the issue stemmed from the fact that we used a natural class distribution for our training:

Natural Distribution: Natural distributions contain proportions of each class approximately equal to the proportion of instances of that class found in the real world. This type of distribution weights the classifier towards the larger classes biasing it to predict them with higher accuracy.

Natural distributions are intuitive, because it makes sense that a classifier should classify more common classes more frequently, however, it does not always produce the best classifier [34]. More commonly, techniques are used to create a balanced distribution:

Balanced Distribution: Balanced distributions contain an equal, or approximately equal amount of data in each class. This type of distribution gives the classifier the same amount of data from which to learn each class.

There are two main ways balanced distributions are created. The first is under-sampling: randomly removing instances of the majority class from the training data until you have an equal number in each set [37]. This is attractive because it maintains the integrity of the data, however, it can also severely limit the amount of data available to learn from [34]. Recalling the rare gene example from above, if we used an under-sampling method we would only be able to use data from about 500 of our 9,500 organisms without the gene. In some cases this would mean throwing away too much of the data. The second option is over-sampling, in which instances of the minority class are randomly resampled to create a larger dataset [37]. This solves the imbalance problem without throwing away any data but can make the training time exceedingly long by increasing the size of the dataset and can increase the chances of overfitting the minority class because data has been duplicated [34]. Various augmentations to these schemes have been proposed [38], as well as ways to combine them [39], however, similar limitations apply.

One such way to combine both over and under sampling is to create a distribution between natural and balanced. For example, adding 50% to the percentage composition of each class and then dividing by two is a way to calculate new percentages for each class that creates a distribution halfway between balanced and natural [35]. This approach over-samples one class and under-samples the other, which reduces the negative bias of each technique, but ends up modifying more of the original data.

Changing the class distribution gets to the root of the issue with the accuracy metric, but sometimes, particularly when the dataset is small, there is no good way to change the distribution of the data [40]. A different approach to solve the issues that arise when calculating accuracy on an imbalanced dataset is to change the measure of accuracy in the first place [36]. This approach

10 is one of the most common ways to deal with class imbalance in small datasets, which suffer from manipulation of the instances themselves, but can still be classified well using a performance measure other than accuracy [34]. The most common form of alternative performance measure is referred to as cost-sensitive learning, which updates the concept of accuracy to give bigger penalties to classification mistakes of the minority class [41]. These metrics make use of the confusion matrix (Table 3.2.1), a table of measures that can provide insight into how well a classifier is doing on each class in a dataset [40].

Table 3.2.1 The Confusion Matrix. The confusion matrix is a combination of four values that assesses how well a classifier is predicting the positive class. The number of true positives (TP) is the number of actually positive class instances that were predicted to be positive. False positives (FP) are negative class instances mis-predicted to be positive. True negatives (TN) and false negatives (FN) are the same but for the negative class.

Predicted Predicted Positive Negative Actually Positive True Positive False Negative (TP) (FN) Actually False Positive True Negative Negative (FP) (TN)

The confusion matrix gives four metrics for interpreting how well a classifier performs on the positive class in a binary classifier. Accuracy, intuitively the percentage of correctly classified instances, can be defined from the confusion matrix as:

�� + �� �������� = �� + �� + �� + ��

This definition highlights how accuracy combines information from both classes into one measure. The confusion matrix keeps them separate allowing researchers to compare how well the classifier works on the negative class separately from how well it works on the positive class [40]. The limitation of this representation is that by keeping all the values separate, it become cumbersome to compare values across classifiers [42]. To solve this issue, numerous combination measures have been created, each with different emphasis, useful for different problems [43]. Two of the most intuitive, and useful in cases when it is important to correctly identify all the true positives and prevent false positives, are precision (�) and recall (�) [40], which are defined below:

�� �� � = � = �� + �� �� + ��

11 Recall is the proportion of actually positive instances that are correctly identified as positive, while precision is the proportion of predicted positive instances that are actually positive [43]. These two measures are powerful at identifying success of a classifier on a single class even when class distributions are imbalanced, however, two measures are not much better for comparison than one [40]. The most common way to combine the two is by taking the harmonic mean, called the f measure [44]. When the mean is balanced, it is called the f1 measure, and is defined as below:

2(� × �) � = � + �

This singular metric is harder to interpret but can easily be compared across classifiers [40]. Because the f1 measure does not take TN into account, it focuses on only the positive class, and is best for classification cases when the accuracy on a single class is most important.

To get around the issue of biasing towards a single class, while still being applicable for small biological datasets that suffer from adjustments to sample balance, Matthews proposed an alternative way to combine the values of confusion matrix [45]. The measure has come to be known as the Matthews Correlation Coefficient (MCC) and can be calculated from the confusion matrix as shown below:

�� × �� − �� × �� ��� = (�� + ��)(�� + ��)(�� + ��)(�� + ��)

The measure ranges from -1 to 1, with 1 being perfect classification, and -1 being perfect inverse classification [46]. Because the measure takes into account both true positives and true negatives, even though its interpretation is less intuitive than other measures, it has been considered the best way to combine the elements of the confusion matrix in an unbiased way, and has become popular in computational biology [46].

The idea of each of these accuracy metrics is to prevent the accuracy measure from favoring one class while another is severely underrepresented [36]. They still have inherent biases in one direction or another [43], but by understanding these biases, researchers can choose the one that best fits their problem [40]. In many cases, particularly when datasets are small, choosing an alternative accuracy metric such as the MCC can be a much less biased way to overcome class imbalance than manually “fixing” the imbalances.

In cases when these metrics are not enough, it is also possible to modify the algorithm itself. This technique it is less common in computational biology, so I will not go into it in detail here, but it could be a promising option in the future. The idea is to create an unbiased algorithm, or to use an ensemble of different algorithms whose biases compensate for each other

12 [36]. This has the advantage of being specific to a particular problem, so can be fine-tuned to work as well as possible, however, its particularity also limits its applications to the highly variable world of biology. Looking into various algorithms quickly becomes a swim in an alphabet soup of various options tailored to different problems. In a single example, extreme machine learning (EML) neural networks can be augmented to become weighted ELM, class- specific cost regulation ELM (CCS-ELM), boosting weighted ELM (BWELM), regularized weighted circular complex valued ELM (RWCCELM), ensemble weighted ELM (EWELM) and class-specific ELM (CS-ELM) [47]. While these often have the potential to perform well, they must be selected carefully for the task on hand, and are hard to generalize.

3.2.2 The Multi-Class Problem

When the goal of a classification task is to divide data into two discrete classes it is called a binary classification task. The field of machine learning has put forth many extremely powerful classification algorithms for this type of classification [48], however, not all tasks are by nature binary. Protein classification, for example, usually attempts to divide proteins into multiple functional classes, which can number in the thousands depending on the set of gene function classes used [4]. The challenge of solving multiclass classification with classification algorithms that have been built for, or at least work better in, binary classification, has been solved in many different ways [49].

Because binary classification is fundamentally easier than the multiclass version of the same problem, many algorithms for multiclass classification solve the problem through some combination of binary methods. There are two usual ways that this is done:

1. One vs. All: This classification scheme creates one binary classifier for each class that

can discriminate that particular class from the rest of the data. i.e. the �th classifier, �, is trained to recognize items in the set � as positive instances while data from all the other classes are used as negative instances. A novel instance is classified as belonging to the class for which the given classifier gave the highest percentage certainty [50].

2. One vs. One (All vs All): This scheme builds one binary classifier for each pair of classes in the dataset, disregarding all instances not in one of the classes in the given pair.

Classifier � discriminates between instances of class � and class �. Multiple methods for combining the outputs of each classifier have been proposed; the most common is a voting system in which each classifier votes for one of its two classes and the class with the most votes is assigned [49].

13 The differences between one vs. all (OvA) and one vs. one (OvO, also called all vs all), vary depending on the type of underling binary classifier, the independence between the classifiers, and the type of problem being solved [49].

In an attempt to improve on these two systems, some researchers have proposed combination methods that consider information about the already created binary classifiers while creating each additional classifier. These algorithms can be divided into two categories:

1. Single Machine Approaches: This approach contains the problem to a single optimization step that trains all of the classifiers at once [51].

2. Error Correcting Output Codes (ECOCs): This approach calculates the error and correlation between binary classifiers as it combines them with a goal of reducing overall error [52].

While individual studies have found that OvO, Single Machine, and ECOC methods perform better than the most simplistic OvA approach [51-53], one recent review paper made a strong argument that this may not be the case [54]. Another review offers evidence that the failings of OvA strategies can be fixed rather simply if the underling binary classifiers are well enough trained [49].

With so much debate in the field on the proper way to solve the multiclass classification problem, one piece is clear throughout the literature: binary classification is easier than the multiclass version [48-52, 54]. In response, this thesis aims to avoid the pitfalls of the multi- class problem by avoiding the situation of multiple classes to begin with. Instead of changing the algorithm to better suit the question, the biological question can be reframed in a way that allows the algorithms to work their best. In this case, this means that instead of trying to classify all the unknown proteins in a given organism, I will try to categorize the presence or absence of specific biological functions. This approach changes a messy multiclass problem into a simple binary classification.

3.3 Algorithm Choice

In addition to outlining a general approach to classification, researchers looking to build a tool must choose an algorithm that fits the needs of their classifier. Virtually every machine learning algorithm has at some point been applied to the task of protein differentiation, with varying degrees of success [5]. This section looks at some of those algorithms, with their advantages and disadvantages.

14 3.3.1 Naïve Bayes

Naïve Bayes is a simple algorithm that uses Bayes’ Rule (Figure 3.3.1) to calculate the probability that each data point belongs to each class given a set of features. Naïve Bayes is limited by the fact that Bayes’ rule assumes independence between the features, and this is almost never the case [55]. For example, in the case of protein differentiation with amino acid composition as features, switching one amino acid for another will increase the value of one feature and decrease another; each feature effects the values of the others, therefore they are not independent. Despite this limitation, however, the simplicity of Naïve Bayes means it is often used as a baseline comparison algorithm in studies looking at new prediction techniques [56]. If a model performs better than Naïve Bayes, it can be assumed that the data does not follow normal distributions in all classes, and the model must be capturing an interesting pattern in the data.

Pros - Works well with very little training data - Fast and efficient computation - Determines the importance of each variable Cons - Assumes independence between features - Assumptions about distributions can limit the types of correlations observable

Example Feng et al. used a Naïve Bayes model to identify phage virion proteins based on amino acid and di-peptide amino acid composition [57]. With all 420 features they reached 75% accuracy, and showed that using feature selection can improve the accuracy, slimming down the model to only include 38 features, and achieving 79% final accuracy [57].

Figure 3.3.1 Bayes’ Rule. The Naïve Bayes algorithm follows Bayes’ rule to calculate the probability of a class given some data based on the data and class distributions.

15 3.3.2 Support Vector Machine (SVM)

Support vector machines (SVMs) plot each datum in multidimensional space where each dimension is one feature calculated from the data. The algorithm then calculates vectors perpendicular to the division between the classes to create boundaries between them. These vectors are called support vectors, hence the name support vector machine. Because the support vectors are found perpendicular to the division plane, the boundaries are always formed on linear lines, but because they can be created in high dimensional space and then be reduced back into 2 dimensions, complex boundary functions are also possible to compute.

Pros - Versatile to differently sized classes (including multi-dimensions) - Hard to overfit Cons - Requires a large sample size - Hard to interpret - Data can require significant preprocessing to achieve good results - Irrelevant features can significantly slow down training

Example Kumar et al. used a support vector machine to predict β-lactamase functionality and type based on protein sequence [6]. They achieved an accuracy of 89% using amino acid composition to detect functionality, and an accuracy of 60% (Type A), 83% (Type B), 69% (Type C) and 69% (Type D) when classifying each of the four types separately [6].

Figure 3.3.2 Support Vector Machine in Two Dimensions. Support vector machines form a linear boundary between classes. This two dimensional representation can be scaled up into multidimensional space and then reduced back down to incorporate more features and more complex decision boundaries. In this example, the new purple data point would be classified as blue because it falls on the blue side of the boundary.

16 3.3.3 Neural Nets

Neural networks, or neural nets, are popular algorithms in machine learning for use in particularly difficult learning tasks [58]. They create a network of nodes and edges running from input feature values as starting nodes to prediction nodes at the end of the network. Each edge is a function that can be tuned during training to manipulate the input so that the network gives the correct prediction.

Pros - Handles multi-dimensions and continuous features well - Can solve difficult learning tasks Cons - Very hard to interpret - Requires a large sample size - Irrelevant features make training slow and sometimes completely impractical

Example Clark et al. used a neural network based program to incorporate the interdependencies between the GO terms, a common protein function identification system, in creating a program to transfer functional annotations from already annotated protein sequences to novel ones [22]. Their results showed that the multi-output framework of neural nets helped to improve the accuracy of protein function annotation transfer, particularly when the proteins had high sequence identity [22].

Figure 3.3.3 Neural Network Diagram. While neural networks can be much more complicated, this diagram represents the basic idea, with a sequence of input nodes, hidden layers and then a set of output nodes. Arrows between nodes represent functions that combine the values from the layer before into the current layer. In classification, the output is the probability of belonging to a particular class [59]. The example here has two output nodes that would each report a probability that the input belonged to the orange or blue class.

17 3.3.4 k Nearest Neighbors (kNN)

The k nearest neighbor algorithm is one of the simplest learning algorithms. kNNs calculate the distance between a novel instance and each datum in the training set, then assign the novel instance to the one in which the majority of � “nearest neighbors” or closest matched data points are found. � is an example of a hyper-parameter that must be trained for any given dataset. This process is flexible enough to work with an arbitrarily large number of features, always predicting class by calculating the distance in the multidimensional space created by those features. As a lazy learning algorithm, a kNN stores information about each instance in the training data set, and performs most of the expensive calculations when a new datum is classified by comparing the novel instance with the stored information [60]. The lazy-learning aspect of the algorithm means less computational time when model is created, but more time for each prediction. This limits the usability of kNNs for large scale classification style protein function tools [61], but makes it a good candidate for discrimination when researchers have a smaller number of candidate genes that they want to know if they belong to a group.

Pros - Hard to overfit - Works well on small datasets - Decision boundary can take any form Cons - Large storage requirement - Sensitive to choice of features - Lack a principled way to choose k

Example Horton and Nakai used a kNN to predict cellular localization of and E. coli proteins with 60% and 86% accuracy respectively [62]. Lan et al. used a kNN to predict protein function using information about sequence similarity, protein-protein interactions, and gene expression [63]. They assessed their models using Term Area Under the Curve (TermAUC) with a value of 0.848 for their best classifier [63].

18

Figure 3.3.4 kNN Classification in Two Dimensions. In two dimensions, a kNN algorithm plots all the data points and classifies a new instance into the class that held by the majority of the new instance’s � nearest neighbors. In the example above, the two feature 3-NN algorithm depicted would classify the new purple instance as blue. This process can be scaled up to an arbitrary number of features plotted in an arbitrary number of dimensions.

3.3.5 Decision Trees and Random Forests

In contrast to the lazy-learning computation style of kNNs, decision trees are considered eager-learning algorithms because they perform the majority of the expensive computational steps in setting up the tree, and can quickly classify new instances. Setting up a decision tree requires calculating features based on the instances, classifying the instances based on those features, and determining which features split the data in the most efficient way possible [33]. To create a decision tree, the algorithm iteratively calculates the best split between values in training data creating a tree of “rules” describing the data set [64]. When a new datum is tested, the rules can be followed down the branches of the tree to a leaf that predicts the class of the datum.

A random forest is created by building many such decision trees each with a subset of features from the master set, and then averaging the class prediction of all the trees in the forest. The hyper-parameters of random forests include the number of trees in the forest, the number of branches allowed in each tree, and the smallest number of samples allowed at any ending leaf. Random forests are often used in computational biology [65].

19

Pros - Understandability: it is easy to see which training features resulted in any given outcome. This is a little bit harder with random forests than single decision trees, but still possible. - Works when sequences are dissimilar - Works well for classification Cons - No single method exists to find “best” pruning or split values - It is sometimes possible to improve decision trees by adding conjunction, negation, disjunction logical operators, but usually trees are univariate - Needs a lot of training data

Example Jiang et al. use a random forest based model to classify real and pseudo micro-RNA precursors. They achieved accuracy rates ranging from 88-96% and MCC values from 0.77 to 0.94 depending on the set of features they used in training [66].

Figure 3.3.5 A Random Forest with Three Trees. This example random forest creates three subsets of features from which different decision trees are created. Each tree can give its own prediction for the class of a new instance which can then be averaged to predict the class. In this case the new instance would be classified as blue.

3.4 Machine Learning in Biology

Machine learning is a very large field with many applications, but one significant one is computational biology. The combination of the ever-increasing amount of biological data available to researchers and the power of machine learning tools in finding patterns in this data has led to the increasing application of machine learning to biological problems [67, 68]. Of the three types of machine learning study discussed in Section 3.1, the field of computational

20 biology is focused mainly on Type 1 studies: using machine learning (as well as other computational techniques) to solve practical problems in the field of biology.

One major challenge in computational biology is the identification of the function of genomic data: either DNA, RNA, or protein sequences [69]. This is the type of problem discussed in Chapter Two; trying to use machine learning to discriminate between and classify proteins that have, or do not have, a given function. Current methods to address this problem classify proteins into many different categories, unlike the proposed solution discussed in this thesis, but it is important to consider briefly how these algorithms work.

Due to the complexity of the protein classification problem, the algorithms proposed to solve it vary considerably. The input data can be sequence alone, protein-protein interaction data, or gene expression data [63]. Some methods incorporate a high degree of biological information into their pipelines, such as Busa-Fekete et al. in their TreeInsert and TreNN algorithms that incorporate phylogenetic information into the protein classification problem [70]. Other methods rely on relatively few computational features, for example kNN classifiers that use amino acid composition as the sole set of computational features [19]. A few employ even more complicated algorithms to consider the hierarchical information captured by functional classification systems like (GO) [26]. In all cases, the algorithms use basic protein sequence data to compare known proteins to new ones in order to transfer annotations of protein function.

The classification process is useful because it allows for the fast processing of many proteins into biologically relevant categories, but there are known issues. For example, the class distribution problem (Section 3.2.1) can be particularly tricky when there are not very many experimentally proven examples of some protein types, while other proteins are extremely well studied. In addition, different algorithms often have different levels of success on different types of proteins [6], so no single classification tool has been built that can create discriminatory models for any given binary differentiation task.

In my thesis I have explored how to create such a binary protein function differentiation tool. Because binary protein discrimination tools do not learn an overarching set of classification categories, they can be more specific with less training data than traditional models [23]. While this quality limits the situations in which the tool is applicable, it can also be more accurate, leading to the identification of new genes in a given class. This type of model has been created for specific types of protein functions, such as Tung and Ho’s discriminatory model to identify ubiquitination sites [71], and Kumar et al.’s SVM to identify �-lactamase genes [6], however, as of now, I know of no general purpose algorithmic tool that can build models trained to discriminate between any two given protein classes.

21 4 The Woolf Classifier Building Pipeline

4.1 Introduction

The previous two chapters outlined what machine learning is, and how it can be applied to biological problems like protein function differentiation. This chapter introduces a new tool called the Woolf Classifier Building Pipeline.1 The Woolf tool guides researchers through the process of creating a machine learning model that can distinguish which members of a protein class have a particular function, exactly like the case presented in Chapter Two. The point of the Woolf approach is to use a simple binary predictor to learn the subtle differences in protein sequence that determine the specific function of a protein in a given class. Such a technique is missing from the general classification scheme, often leading to issues when precise protein function information is required, especially on less well-known classes [72].

4.2 How is the Woolf Pipeline Different?

The Woolf Pipeline’s binary classification approach allows it to solve different problems than usual protein differentiation tools. Most tools are categorical: they learn the distinguishing features of many different protein categories, and then sort new proteins into those categories [72]. This type of protein function differentiation is useful in confronting the huge datasets produced in genome sequencing experiments because they give researchers an initial overview of the types of genes present in genetic sequences [73]. Raw data, which can be millions of letters long, are sorted into predicted protein generating regions, which are then used as input to categorical protein function differentiation tools to predict general classes for each predicted protein (Figure 4.2) [73].

These types of tools work in many different ways, some transferring protein annotations based on sequence similarity, like the well-known BLAST algorithm, while others look for known functional motifs that could indicate the overall function of a given protein, and still others build clusters of similar proteins under the assumption that proteins with similar biological features will group together and be more likely have similar functions [4, 74]. The most common protein function classification algorithm is HMMer, which uses hidden markov models to create a markov chain representation of each protein class, and then compares new sequences to each representation to determine the best annotation [75]. Despite their differences in approach, what all of these models have in common is their generality; these models are able to take any protein coding sequence and transfer the best matching annotation from previously

1 A Note on Terminology: The Woolf Classifier Building Pipeline is a computational tool that produces binary classification models called Woolf Classifiers which are models that represent the presence versus absence of a particular protein function in a given protein class.

22 known examples. Woolf Classifiers are different: they look for a single function. Given a set of proteins in a particular , a Woolf Classifier can learn to distinguish between proteins that have a given function and those that do not.

Bacterial Cells

DNA

Whole Genome Sequencing

ATAAGAGGGAATCGCGGCGCATATCGATGCATAGGAGAAGAACGCGCGCGCTATCCTAGAGATATTTCGCGATAACAAGCTAGAGAGGGAGCCGGA TCGCAAGCTAGAGGATCGCAAGCTAGAGGATCGCAAGCTAGAATCGCGGCGCATATCGATGCATAGGAGAAGAACGCGCGCGCTATCCTAGAGA…

Protein-Coding Region Prediction

ATAAGAGGGAATCGCGGCGCATATCGATGCATAGGAGAAGAACGCGCGCGCTATCCTAGAGATATTTCGCGATAACAAGCTAGAGAGGGAGCCGGA TCGCAAGCTAGAGGATCGCAAAGGATCGCAAGCTAGAGGAATCGCGGCGCATATCGATGCATAGGAGAAGAACGCGCGCGCTATCCTAGAGA…

Protein Function Annotation

ATAAGAGGGAATCGCGGCGCATATCGATGCATAGGAGAAGAACGCGCGCGCTATCCTAGAGATATTTCGCGATAACAAGCTAGAGATGGAGCCGGA TCGCAAGCTAGAGGATCGCAAGCTAGAGGAAAGCTAGAGGAATCGCGGCGCATATCGATGCATAGGAGAAGAACGCGCGCGCTATCCTAGAGA…

Figure 4.2 Genome to Gene to Protein. The genome contains both protein coding regions and non-protein coding regions. Whole genome sequencing reads sections of DNA a few hundred base pairs long, which can be combined into a full DNA sequence. DNA is translated into protein sequence through a direct three letter code. Protein function annotation algorithms can be used to predict the function of these regions; however, these initial guesses are often not very specific. More specific protein annotations are provided later through experimental techniques [76].

4.3 Pipeline Composition

The Woolf Classifier Building Pipeline is shown in Figure 4.3.2. It is comprised of two basic components. First, raw protein sequence data is converted into feature tables giving the percentage amino acid composition and protein length for every instance. Second, the pipeline trains either a kNN or random forest algorithm to differentiate between the two different types of proteins identified in the input phase. The goal is to provide a basic set up for machine learning

23 that is tuned to the problem of binary protein function identification, but is still flexible enough to be tailored to a specific protein classification problem.

4.3.1 Input

The Woolf Pipeline uses raw amino acid sequence data as input. The FASTA/FASTQ file type has become the ubiquitous format for such sequences (Figure 4.3.1) [77, 78]. The Woolf tool does not require the quality information stored in FASTQ files, so the pipeline is set up to deal only with FASTA files, which are smaller and faster to work with. Researchers with FASTQ can do a simple conversion available through a variety of online tools, although they may wish to filter out low quality sequences [79].

FASTA FASTQ

>Description of sequence @Description of Sequence ATCCGTGAGTGTGAGCATG GGGTGATGGCCGCAAATCCC DNA TGAGCATGTGAGCAACGCG ACC >Description of sequence + ACGTGAGTGTGAGCATGTG Quality Metrics AGTGCA… @Description of Sequence …

>Description of sequence @Description of Sequence Amino MELPNIMHPVAKLSTALAA NIMHPVAKNIMHPVAKNIMHP Acid ALMLSGMPGEIRPTIGVPN VAK >Description of sequence + MELPNIMHPVAKLSTAL Quality Metrics NIMHPV… @Description of Sequence …

Figure 4.3.1 FASTA and FASTQ format. FASTA and FASTQ files are used to store genetic sequence data. Files can contain one or more sequence entries, each started by a special character, “>” for FASTA, “@” for FASTQ (blue). The character is followed by a text description of the sequence, often containing a unique ID number (red). All following lines until the next special character are considered sequence data, which can either be DNA letter code, or single letter amino acid code. In each case, one letter corresponds to one biological unit. In FASTA files, the sequence is followed by another “>” character indicating a new sequence. FASTQ files contain additional quality information from the sequencing procedure which is separated from the sequence data in the file by a “+” symbol (green). Conversion is possible between DNA and Amino acid files, and FASTQ can be converted to FASTA by stripping the quality data, but FASTA cannot be converted to FASTQ (arrows).

24

4.3.2 Creating a Feature Table

The first step in the Woolf Classifier Building Pipeline is to create feature tables from the input sequence data. Raw amino acid sequences from the FASTA files are analyzed and converted into comma separated value (CSV) files containing a binary class feature table ready for machine learning (Figure 4.3.2, Step 1). This command can take any number of FASTA files, but users need to indicate which files should be part of the positive and negative class based on the protein function they are studying. In the example discussed in Chapter Two, this means splitting a known class of manganese oxide genes, such as peroxidases, into two sets: one known to bind manganese, and one known to not bind manganese.

The feature table created by this command contains amino acid percentage composition and length for each protein, which are the most common features used for feature-based protein differentiation algorithms [80, 81]. One criticism of this feature set is its failure to account for the sequence of amino acids within a protein, leading to proposed alternatives such as Chou’s pseudo-amino acid composition, which adds several features to the basic 20 amino acid percentages to account for sequence order [6, 82]. This approach has been shown to improve prediction accuracy, especially with respect to the localization of where proteins function within the cell [82], however, it adds additional features, complicating the model. The extra complication hinders models trained on small datasets, and helps more with broader differentiation than the specific functional annotations investigated here, so the Woolf Pipeline does not currently support it. Because the program is written using the popular BioSeq Python package built for dealing with biological sequence data, additional features, including Chou’s pseudo-amino acid composition can be added easily [83].

25

Figure 4.3.2 The Woolf Classifier Building Pipeline. Woolf Classifiers can be built using a few simple commands on the command line in 6 steps. (1) Input fasta files are converted into a feature table based on length and amino acid composition before (2) being used to train a model with either a kNN or random forest algorithm. The default parameters lead to an initial report of accuracy and best parameters (3) which can then be fine-tuned as inputs to the command (4). Users can also ask to see a list of the protein sequences misclassified by the best scoring classifier in each model (5) and predict the classes of unknown proteins (6).

4.3.3 Model Creation with Default Settings

After building a feature table, the second step in the Woolf Pipeline is to train a Woolf Classifier (Figure 4.3.2, Steps 2-4). This step is built to prevent the most common mistakes non- expert machine learning scientists make when creating machine learning tools [35] while still leaving room for classifiers to be fine-tuned to a particular task. The Woolf Classifier in the following stages:

1) Feature Scaling To ensure that features are evenly weighted in classification, feature scaling processes can be applied to the feature table so that every feature follows a normal distribution [84]. This is important for the kNN algorithm, but not necessary for random forests (Section 4.4.3). The default Woolf Pipeline scaling is a simple Min-Max scaling, which scales the training set data to fall between 0 and 1 [85].

26

2) Train / Validation / Test Splitting To prevent the model from over-fitting itself to the full dataset while determining hyperparameters, and to ensure that there is unlearned data reserved for testing, it is important to split the input data into training, validation, and test datasets [35]. In the Woolf Pipeline this is done through cross-validation using the Scikit-learn GridSearchCV package, which searches for hyperparameters in a user provided n number of splits (Figure 4.3.2, Step 4).

3) Classification The Woolf Pipeline currently supports two simple classification algorithms, kNNs and random forests (Section 3.3). Both are optimized through GridSearchCV hyperparameter testing of a range of user-provided hyperparameter values. Having a built-in cross-validation based hyperparameter testing step forces users to follow machine learning best practices [35].

4) Accuracy Evaluation During and after training, the accuracy of the classifier is calculated to improve results on the training set, and to provide evidence supporting the power of the classifier on new data. The accuracy of a model can be assessed with many different metrics (Section 3.2.1 and Section 4.4.4), which can be selected by the user of the Woolf tool. The default is Matthews Correlation Coefficient (MCC).

5) Manual Review At the end of the process, the Woolf script prints out metrics allowing for the evaluation of the model by the user (Figure 4.3.2, Step 3). If the desired accuracy or other performance metrics are not met, the model can be updated and refined as necessary (Figure 4.3.2, Step 4).

4.3.4 Changing the Pipeline Parameters

Once users understand how to use the tool and see their first results with the default parameters, they can adjust the inputs to the Woolf Pipeline to improve the accuracy of their predictions. There are 6 user specified inputs that can be changed on the command line when a Woolf Classifier is created. Table 4.3.4 lists the 6 input flags, their default values, and how to format the input on the command line to change the parameter.

27

Table 4.3.4 User Specified Parameters

Parameter Default Option Option Flag How to Format Input Feature scaling type Min-Max −� StandardScaler Scaling MaxAbsScaler Cross-validation 5 −� 10 folds 20 Accuracy metric MCC −� accuracy f1 Number of neighbors range from 1 to −� 5 (kNN only) 20 1-20 1-30,5

Algorithm Hyper- Number of trees range from 1 to −� 5 (random forest only) 20 1-20 parameter 1-30,5 Ranges Minimum instances 10, 15, 20, 25, −� 10 per leaf (random 30 10-50 10-30,5 forest only)

As discussed in Chapter 3, there are many possible values for each of these parameters, so both the defaults and the options to which each parameter can be changed have been chosen specifically for the problem of protein classification (See Section 4.4.3, 4.4.4, and 4.4.5).

4.3.5 Error Evaluation

The Woolf Pipeline can return the sequences in the training set on which the model makes incorrect judgements (using the −� command). Because of the cross-validation step in model creation, every instance in the training dataset can be evaluated in the final model. The script returns the mistakes in two lists of instances: one of proteins misclassified as positive (but were actually negative) and one of proteins misclassified as negative (but were actually positive). The user at this point is not concerned with the percentage correct, but with the attributes of the sequences that were misclassified, so receiving a list of sequences allows them to investigate which aspects of those sequences make them hard to predict. Once a researcher can pull out the sequences that were identified incorrectly, they can do analysis of those instances themselves, and compare them to new data they want to classify.

28 4.3.6 Prediction

The final phase of the Woolf Pipeline is to predict the class of new unknown genes (using the −� command). Users can create a feature table as they did in the first step, and give it to their newly built model to determine if the new protein sequences perform the studied function. This option also allows users to predict the function of test data that was held out of the training set to get a more accurate measure of how well the model works.

4.3.7 Implementation

The pipeline is implemented as a set of two Python 3 scripts run on the command line. The machine learning components were built with the Python package Scikit-learn [86]. The Python package BioPython was used for the processing of the FASTA input files [87].

4.4 Design Decisions

Careful thought was put into each decision made in creating this tool as something non- machine learning experts could use to the greatest effect. This section gives more detail on the reasoning behind these decisions, including why the tool runs on the command line, why length and amino acid composition were chosen as features, the motivation behind the choice of default parameters, and why each optional parameter value was included.

4.4.1 Why the Command Line?

A good interface is necessary for the success of any tool, and its design must consider not only the practical aspect of what the tool produces, but also the complexities of how people understand the task of creating that [88]. In the field of biology, the command line is a widely used interface well-known for being simple yet powerful [89]. Most computational biology tools are built first on the command line, and only after they become mainstream do they get graphical user interfaces (GUIs), which usually build the command line requests from a user filled form [90].

The reason GUI interfaces are developed later is that they often lose some of the simple, efficient, and building-block like functionality of the pure command line [91]. Command line tools are favored in biology because they are simple, efficient, and can fit into pipelines with other command line tools without significant work rearranging them on the part of the user [92]. Commonly used computational biology tools that run in this manner include the sequence alignment tool MUSCLE [93], the phylogenetic tree building tool PhyML [94], the sequence search tool BLAST [74], the network compilation tool Facile [91], the multiple sequence alignment tool MAFFT [95], and the sequence processing pipeline EMBOSS [92]. QIIME, which processes raw sequence data, and Mothur, a tool for comparing microbial communities,

29 both rely heavily on the ability of command line tools to allow modular pipeline style use [96, 97]. The command line interface on all these tools provides a common starting place for computational biologists to learn, keeps users from having to learn new interfaces for every tool, and allows them to be combined into multifunctional pipelines.

While the Woolf tool might in the future need a GUI interface for users unfamiliar with the command line, the initial users of the Woolf tool will likely be scientists who already use command line tools to process their sequence data, and so will appreciate being able to use the Woolf Pipeline on the command line in sequence with those other tools.

4.4.2 Machine Learning Features

Woolf Classifiers are built using amino acid percentage composition and length as the 21 features given to algorithms for classification. These features are common in protein classification algorithms because it is the properties of the amino acids that give a protein its function (Figure 4.4.2) [6]. The one downside to amino acid percentage composition is that it does not take into account the order of the amino acids within the sequence. This can be tricky to measure because of the complex folding that proteins undergo which can bring distant regions of the protein together [98], however, some researchers have suggested alternate ways to calculate percentage composition features to take this into account [6].

The most popular way to incorporate amino acid order into machine learning tools for protein prediction is Chou’s pseudo-amino acid composition [82]; however, there are a few reasons that it is not ideal for the Woolf Pipeline. First, it has been shown to improve prediction, particularly among distantly related proteins [6], but the Woolf Pipeline is looking at closely related proteins to get more specific annotations. In addition, Chou’s method adds to the total number of features. A ratio of 10/1 instance to feature is considered best [35] for machine learning, implying that a classifier built with the Woolf Tool’s 21 features should provide at least 200 example instances. This is already a lot of instances for more obscure protein types, so adding more features is not desirable. Because the python package used to build the pipeline does include Chou’s Composition [87], it could easily be added to future versions.

30 s Figure 4.4.2 The Twenty Amino Acids. Proteins are encoded by a genetic sequence of amino acids. Each one has different properties that affect the function of the protein they create. This chart displays the molecular structure of each amino acid, gives some chemical properties, and displays the three letters of DNA code that translate to each amino acid. Image from [99].

4.4.3 Scalar Operations

Scaling is the process of modifying the center and range of the data in each feature. It is used to modify input data distributions to meet the assumptions most algorithms make about their input data [100]. Random forests are tree based and do not require scaling, however, with algorithms like kNNs scaling the data prevents features with different ranges from unduly influencing the prediction [100]. This might not be necessary if only percentage amino acid composition was used for features, but the length of the protein is also included and as it is on a significantly different range, it could unbalance the classifier [35]. The Woolf Pipeline provides several different scaler options for different types of datasets (Table 4.5.3).

The Woolf Pipeline default is the Min-Max scaler. This is because the feature data provided has the potential to be sparse but is never negative, and likely will not have outliers. Max-Abs would be useful if there were negative data, but because that is not possible with percentage composition and length as features, it is only included as a comparison with Min- Max. The Standard Scaler might also be an alternative if sparsity is not a concern, but it usually will be because some amino acids are rare and might not be present in some proteins. The Robust Scaler might be useful in cases where a researcher knows that a few of their instance sequences are very different from the rest and they do not want those sequences to unduly influence the classifier.

31 Table 4.4.3 Possible Scaler Types. All implementations come from the Scikit-learn preprocessing package [101]. Additional scaling and normalization options could be added by an advanced user but are less applicable to amino acid composition features and so are not included here. For the random forest algorithm, or if the user wishes not to scale the data, they can supply the value “None.”

Scaler Name Function Suggested Use Cases StandardScaler Scales each feature to zero - General use mean and unit variance MinMaxScaler Scales each feature to a range - Sparse data between 0 and 1 - Possible zeros in data - Small standard deviations MaxAbsScaler Scales each feature to a range - Sparse data between -1 and 1 - Possible negative data - Small standard deviations RobustScaler Scales using alternative center - Data with outliers and range metrics that are robust to outliers None No scaling - Approximately normally distributed data in similar ranges - comparison to other methods

4.4.4 Accuracy Metrics

The accuracy metric is used iteratively as the classifier is trained to converge on the classifier parameters with the best results. Different measures emphasize different types of accuracy and can be beneficial in different circumstances [40]. A detailed description of these metrics is given in Chapter 3, however, the options available in the Woolf Pipeline are shown in Table 4.5.3. The default option is the Matthews Correlation Coefficient (MCC), which has been shown to be good for small unbalanced datasets where both the positive and negative class is important [45, 46].

32 Table 4.4.4 Possible Accuracy Metrics. All implementations come from the Scikit-learn preprocessing package [101].

Metric Description Function Suggested Use Cases accuracy Percentage of Balanced class �� + �� instances classified distributions of �� + �� + �� + �� correctly instances recall Proportion of actually When the most

positive instances that �� important result is to

are correctly identified �� + �� identify all the as positive positive cases precision Proportion of When it is important

predicted positive �� to make sure all the

instances that are �� + �� predicted positives actually positive are really positive f1 Harmonic mean of When both recall 2(� × �)

recall and precision � + � and precision are important MCC Combination of all Small datasets in terms from confusion �� × �� − �� × �� which both positive

matrix (�� + ��)(�� + ��)(�� + ��)(�� + ��) and negative classes are important

4.4.5 Algorithm Choices

There are many different options for classification algorithms, just a few of which are described in Chapter Three [33]. The ones chosen for the Woolf Pipeline follow the principle that simple algorithms are better than more complex ones, particularly if they can provide some intuitive insight in to the results [35]. kNNs were chosen for their simplicity, and their ability to deal well with small datasets (N < 200) like those for obscure proteins. The long training time and large storage requirements are not issues on small datasets [61], so are not a concern in the Woolf Pipeline. The random forest is a little bit more demanding in terms of dataset size, but has been shown to work extremely well for classification [102], so is included for larger datasets where a more powerful algorithm might be useful. Users can try both algorithms to see which gives them more useful biological interpretations of their data.

33 4.5 When are Woolf Classifiers Useful

Due to the differences between Woolf Classifiers and general protein function differentiation algorithms, Woolf Classifiers are useful in particular circumstances. General protein function annotation algorithms like HMMer are good at annotating protein categories with many experimentally identified training examples, but can make mistakes on less well studied protein classes [103]. In addition, many protein function databases have accumulated mistaken annotations due to large scale functional annotation projects without enough manual checking of results [104]. Results are not checked properly because of the experimental cost of most protein function identification experiments [105]. Woolf Classifiers are designed to fill the gap; because they are more specific, they take more careful human input than large-scale genome annotation, but do not have as high time and monetary costs as complex laboratory experiments. Woolf Classifiers have an advantage because (1) they are specific, targeted, and smaller scale than full genome annotation algorithms, and (2) they do not require as much input data. This should result in better prediction power on protein classes with fewer known examples. Chapter Five will investigate the legitimacy of this claim through two real-life examples.

34 5 Model Building with the Woolf Pipeline

5.1 Introduction

The purpose of the Woolf Classifier Building Tool is to provide researchers with a simple, intuitive window into the power of machine learning without an extensive machine learning background. To this end, the tool was created to be easy to use, with defaults that help avoid common machine learning mistakes, while still providing the flexibility to model many different types of protein functions. To test how well the tool works, this chapter highlights several comparisons of the different accuracy measures and scaling types, as well as how varying hyperparameter ranges and cross fold validations affect the models created by the tool. The chapter then finishes with an example application of the tool to the biological problem outlined in Chapter Two.

5.2 �-Lactamases for Woolf Model Testing

To test the functionality of the Woolf model in a well-defined setting, this section describes models made with genetic sequences for the antibiotic resistance gene �-lactamase. This gene is used by a wide variety of bacteria to evade the toxic effect of �-lactam type antibiotics such as penicillin, ampicillin, other penicillin family antibiotics, and cephalosporins [106]. As the first drug ever developed to treat bacterial infection, penicillin and penicillin-like drugs have been used to tread a wide variety of bacterial diseases since the 1930s [107]. Their long use history has allowed bacteria to evolve a number of different strategies to avoid �- lactams’ toxic effects [107]. These strategies are encoded in the bacteria’s DNA as a diverse set of antibiotic resistance genes, called �-lactamases, that are classified into 4 types based on the way they function [108]. It is these 4 types that make the �-lactamase gene an interesting example for the Woolf model.

In this chapter, �-lactamases will be used to explain how a Woolf model can be used to classify protein function based on amino acid sequence alone, and explain some of the design decisions that went into choosing particular parameters as defaults. Woolf models always produce binary classifiers, so the 4 types represented in �-lactamase gene datasets provide 10 possible binary comparisons between various types. In each case, the dataset representing one of the types is considered the positive class, while the negative class is made up of either a single alternate class, or the sum of all the other classes (Figure 4.1). There are 6 one vs. one comparisons and 4 one vs. all. Each of these 10 binary comparisons provide different problems with different challenges, making the �-lactamase example a diverse test case for the Woolf model. This section will focus on 4 comparisons: A vs. Not A, A vs. B, A vs. C, and A vs. D. These four comparisons keep the A class constant while providing different challenges for each model by varying the negative class.

35

A vs. Not A: This is the most diverse comparison, and the most balanced. 2468 non-Type A instances of varying type against 2296 instances of Type A. A vs. B: This comparison is the most distinct, because type B �-lactamases require divalent zinc ions for hydrolysis, unlike types A, C, and D which use serine [109]. A vs. C: This is the most balanced of the three one vs. one comparisons, with 2296 sequences compared to 1598. A vs. D: This comparison is fairly imbalanced with 2296 vs. 616 sequences, and contains two types with similar functionality, so will challenge the model to learn a difficult differentiation with non-ideal data.

Positive Figure 5.1 The 4 �-Lactamase Negative Class Class Comparisons Used to Test the Type A Not Type Woolf Classification Pipeline. Each Type B Type C Type D (n = 2296) A positive class (Type A, Type B, Type Type B Not Type C, Type D) could be compared Type C Type D (n = 254) B against all the negative datasets described in the same line. The Type C Not Type Type D chosen comparisons are highlighted (n = 1598) C in grey (4 out of 10 possible Type D Not Type comparisons). The number of

(n = 616) D sequences in each type is set in parenthesis.

5.3 Evaluating Algorithm Implementations

5.3.1 Accuracy Metrics

As explained in Chapter 4, the default parameters for a Woolf Classifier were chosen to focus models built in the Woolf Pipeline to the problem of binary protein differentiation, while allowing the flexibility to become fine-tuned to a specific problem. Models were trained with these default parameters to differentiate between the four different Type A comparisons based on three of the possible accuracy metrics provided by the Woolf Pipeline: accuracy, f1-measure, and Matthews Correlation Coefficient (MCC) (Figure 5.3.1).

36 A 1 0.9 0.8 0.7 0.6 0.5 0.4 Accuracy 0.3 0.2 0.1 SVM k = 2 k = 4 k = 3 k = 3 t=9,l=10 t=7,l=10 t=9,l=13 t=13,l=10 0 Kumar et al. A vs Not A A vs B A vs C A vs D A vs Not A A vs B A vs C A vs D A vs Not A

B 1 0.9 0.8 0.7 0.6 0.5 measure

- 0.4

f1 0.3 0.2 0.1 Not Measured k = 2 k = 4 k = 3 k = 3 t=9,l=10 t=9,l=10 t=9,l=13 t=13,l=10 0 Kumar et al. A vs Not A A vs B A vs C A vs D A vs Not A A vs B A vs C A vs D A vs Not A

C 1 0.9 0.8 0.7 0.6 0.5 0.4

Coefficent 0.3 0.2

Matthews Correlation 0.1 SVM k = 2 k = 4 k = 3 k = 3 t=9,l=10 t=7,l=10 t=5,l=13 t=13,l=10 0 Kumar et A vs Not A A vs B A vs C A vs D A vs Not A A vs B A vs C A vs D al. A vs Not A

Figure 5.3.1 Three Accuracy Metrics to Evaluate Woolf Classifiers. Woolf Classifiers were built using either a kNN (orange) or random forest (blue) algorithm with the Woolf defaults for scaling (Min-Max) and cross validation (5-fold cross validation). Accuracy metrics are percentage accuracy (A), f1-measure (B) and Matthews Correlation Coefficient (C). Each accuracy metric is compared to scores reported by Kumar et al (2015) in their study of SVMs (Section 3.3.2) for the same comparisons (green) [6]. The hyperparameter ranges tested were number of neighbors (k) from 1 – 20, the number of trees (t) from 1 – 15 odd, and the minimum samples per leaf (l) from 10 – 30 every third. The numbers at the base of each column indicate optimized hyperparameter settings.

37 Accuracy, the f1-measure, and MCC each emphasize a different aspect of the classifier, resulting in a different score, and different comparisons across the models (Figure 5.3.1). Accuracy and f1-measure scores tend to be about 8 percentage points, or 0.7 f1-measure better in binary class comparisons rather than one vs. all, and seem to be able to distinguish A vs B and A vs D a little bit better than A vs C. This divide is interesting because Type A, with 2296 samples has more instances than any other Type, with only 254 Type B samples, 1598 Type C samples, and 616 Type D samples.

In these models, the different biases inherent in the accuracy calculation and the f1- measure calculation effect the results in a similar way. Accuracy is biased towards the larger class, in these cases Class A, while the f1-measure is biased to the positive class, in this case also class A. The two comparisons with the highest accuracy were A vs. B and A vs. D, the ones with the most class size imbalance. When the positive class is also the most abundant, the f1- measure based models are biased in a similar way to accuracy.

This hypothesis is further supported by the fact that the pattern breaks down with MCC. MCC scores are about 0.1 lower on A vs. Not A than on the other comparisons, possibly because there is more variation that the model has to learn, but MCC models perform well on A vs. C, which the accuracy and f1-measure based models made more mistakes on (Figure 5.3.1). The MCC model generally did predicted the least well on A vs. B. This could be an indication that because the MCC is built to accurately measure the performance of small unbalanced datasets [46], the lower MCC score is reflecting the inherent difficultly of building such a classifier, in contrast to the inflated high scores offered by the accuracy and f1-measure models.

Another interesting observation is the pattern in the best hyperparameter values. Despite differences in how well they classified, across all three accuracy metrics, the models landed on very similar hyperparameter values (k=2,4,3,3 t≈9,9,9,13 l=10,10,13,10 for AvNotA, AvB, AvC, and AvD respectively). This suggests that the hyperparameter choice is based more on the inherent differences in the classification task rather than the accuracy metric used to train the model. While there was a little variation between models in the optimized number of trees per random forest, in the case of kNNs, all the accuracy metrics suggested the same � values for each comparison. Higher values of � generally indicate that there is a clearer separation between the classes, and in this case, low � values were favored in the kNN models, particularly for the A vs Not A comparison where there was the greatest variation in the negative class. The highest � value was chosen for the A vs. B comparison, which matches with the biological observation that Type B is the most distinct from Types A, C and D because it uses a different co-factor in its [109].

However, beyond all the differences in model scores, the overall pattern is clear: the Woolf-built classifiers perform extremely well. Neither kNNs nor random forests perform

38 clearly better than the other, but both are well above the scores reported by Kumar et al. for their SVM based for �-lactamase classifier, which scored only a 60% accuracy, and a 0.18 MCC [6], compared to Woolf Classifiers that were all over 90% accurate, and MCC based models with MCC over 0.75. These numbers were generated with default values, indicating that the Woolf tool itself is well equipped to create protein differentiation models out-of-the-box without much modification. This is the goal of the tool, to provide an easy way to classify new genes and uncover potential hypothesis for researchers to investigate further.

5.3.2 Scaling Types

1 k:3 k:3 k:3 k:1 k:3 k:1 k:2 k:3 k:1 k:4 k:1 k:1 k:2 k:4 0.9 k:2 k:2 k:2 k:2 k:1 0.8 k:1 0.7 0.6 A vs Not A 0.5 A vs B MCC 0.4 A vs C 0.3 A vs D 0.2 0.1 0 None MinMaxScaler StandardScaler MaxAbsScaler RobustScaler

Figure 5.3.2 Effect of Scaling on kNN based model MCC. Woolf Classifiers were built using data scaled in five different ways: Min-Max Scaling, Standard Scaling, Max-Abs Scaling, and Robust Scaling. All models used a kNN algorithm, MCC as an accuracy metric, and 5-fold cross-validation. Reported MCCs are the highest across � values tested from 1–20 with the optimal � indicated at the top of each column.

There are four scalar types available to scale the data before use in kNN based models. Figure 5.3.2 compares models based on these four scalars with models created without scaling. Scaling improves the MCC score of all the models by about 0.1, but the choice of scalar does not vary the score by more than ~0.02. All of the models performed better than the 0.18 MCC score achieved in the latest published work [6], so comparisons between these scores are relative, but if any of the scalers were revealing patterns in the input data they should perform better by at least a few points. If the dataset contained more outliers, for example, there should be a benefit to using the Robust Scalar. If the data were sparser, the Min-Max and Max-Abs Scalars would show improvement over the Standard Scaler. One interesting difference is the � values for each distribution. Without scaling, most of the � values are extremely low signifying that the features did not do a good job of spreading the data. The highest � values were for the Min-Max Scalar, indicating that this scalar did a good job of influencing the spread of the features to make the

39 classes distinct. This observation supports the choice of the Min-Max Scalar as the default for the Woolf Classifier Building tool.

5.3.3 Hyperparameter Values

A A vs Not A A vs B A vs C A vs D

1

0.95

0.9

0.85 MCC 0.8

0.75

0.7 0 2 4 6 8 10 12 14 16 18 20 Number of Neighbors (k)

B 1 C 1

0.95 0.95

0.9 0.9

0.85 0.85 MCC

0.8 0.8

0.75 0.75

0.7 0.7 0 5 10 15 9 14 19 24 29 Number of Trees (t) Minimum Samples per Leaf

Figure 5.3.3 Effect of Hyperparameters on MCC. Woolf Classifiers were built using Min- Max Scaling, MCC based accuracy calculation, and 5 fold cross-validation. Reported MCCs are for all hyperparameters tested. kNN models (A) were tested across � values from 1-20. Random forest models were tested using 1-13 trees with minimum 10 leaves (B), and on 10- 30 minimum samples per leaf with 13 trees (C).

40 The effect of hyperparameter values on the MCC of default Woolf models is shown in Figure 5.3.3. While there is not much variation, probably due to the large size of the training data, there are a few interesting trends. In all the kNN based models, the accuracy of the model drops as � increases, but while most fall by only 1-3 percentage points, the A vs B comparison drops by 8. This could be because these two classes are more distinct, so there is more space in the multi-feature plane between these two categories, but once � reaches higher numbers that space can be crossed and the results become less accurate. For random forest algorithms, the Woolf model lets users adjust two hyperparameters. Increasing the number of trees and decreasing the minimum samples per leaf generally increases the accuracy of the model, but users must keep in mind that a model with too many trees, or one with very few leaves could easily become overfitted, learning the training data without any power to predict on test data. The slope of the line of increasing accuracy with more trees tends to decrease sharply around 5-7 trees, suggesting that this could be a good range in which to keep models.

5.4 Detecting Anaerobic Manganese Oxidation Genes

The results highlighted in Section 5.3 give some justification for the default parameters chosen for the Woolf Classifier Builder Pipeline, however, �-lactamase genes are already well studied, and previous authors have come up with ways to differentiate between them computationally [6, 74, 110]. To test the efficacy of the Woolf tool on novel datasets, Woolf Classifiers were built to address the biological question described in Chapter Two: identifying genes in C. limicola or G. lovleyi that might allow these two organisms to anaerobically produce manganese oxides when grown together.

The classifier built to address this question was created based on the family of manganese oxidation genes known as peroxidases. Peroxidases are an extremely broad class, only some of which bind and oxidize manganese as part of their function [111]. The dataset created contains 31 genes, 11 manganese peroxidases (Appendix B), and 20 non-manganese binding peroxidases (Appendix C). Of the 11 manganese peroxidases, 9 have been experimentally proven to bind manganese. Of the non-manganese peroxidases, 5 have direct evidence that they cannot bind manganese, of the other 15, all but 2 have experimental evidence of other types of peroxidase activity, and the last 2 are well annotated computationally, and curated manually (See Appendix C for detailed information on the gene database). All 30 have annotated sequences in the online UniProt database which is available to download for free [112].

Because this dataset is extremely small, some adjustments needed to be made from the Woolf defaults in order to create good models of the data. The optimal kNN and random forest models created with this dataset are described below.

41 5.4.1 kNN Model

The kNN model created based on this peroxidase dataset is described below. Because of the small size of the dataset, there are not enough instances to test � sizes as high as the default 20, but as shown in section 5.2.3 this should not be an issue for accuracy as smaller � sizes tend to work better.

Inputs Algorithm: kNN Accuracy Metric: MCC Scaling: Min-Max Scaler Cross Validation Folds: 5 Tested � Values: 1-5

Output MCC of Best Classifier: 0.71 Best � value: 1 Misclassified Training Sequences: None

This model seems to be a good measure of peroxidase separation. The MCC value is high for such a small dataset, indicating good separation between the classes, and the final model misclassified none of the training sequences. The � value of 1 is expected, as it has been previously shown that 1NN is usually best for datasets with less than 100 instances [61]. Unfortunately, the lack of misclassified instances does not provide an opportunity to learn about which aspects of the sequences influence classification in either class, but the addition of more instances into the dataset – either for training, or as a test dataset for prediction – could help add to the biological information.

5.4.2 Random Forest Model

The random forest algorithm is tricky to use on such a small dataset (Section 3.3.5). However, with careful attention to the input parameters to keep the number of hyperparameter validation tests low enough, a model can be created as shown below.

Inputs Algorithm: Random Forest Accuracy Metric: MCC Scaling: Min-Max Scaler Cross Validation Folds: 5 Tested ����������� Values: 3-5

42 Tested # of Trees: 3

Output MCC of Best Classifier: 0.66 Best �����������: 4 Best # of Trees: 3 Misclassified Training Sequences: Gene Dyp2

The MCC for the random forest model is lower than for the kNN model, but is still relatively high compared to MCCs for other published models which can be as low as 0.18 [6]. Because of the small dataset, it was impossible to test a large number of hyperparameter values as each additional split required more training data. For this reason, the model was only tested with 3 trees and minimum leaf sizes ranging from 3-6. This is low on both numbers compared to the models presented earlier in this chapter, smaller datasets require both fewer trees to prevent over-fitting, and smaller leaf sizes to allow enough flexibility in the algorithm. Users would be guided through this decision process around hyperparameter choices in the user manual (Appendix A).

The one misclassified sequence from this classifier is extremely interesting. The Dyp2 gene (UniProt K7N5M8) can oxidize manganese, but is considered a multifunctional dye peroxidase, because it can also perform non-manganese oxidizing peroxidase activity [113]. The random forest based Woolf built classifier put this sequence into the non- category even though it has manganese binding functionality. This mistake, possible to observe because of the error evaluation feature of the Woolf Pipeline, suggests that the model may be picking up on real biological features of these proteins to make its decisions.

5.4.3 Prediction of Green Lake Manganese Oxide Genes

As described in Chapter Two, the point of this peroxidase comparison model was to predict if any peroxidases from C. limicola or G. lovleyi could possibly bind and oxidize manganese. To address this question, a set of novel peroxidase genes was generated from published genome sequences of these two organisms found on the Integrated Microbial Genomes (IMG) online database [102]. This dataset consisted of 3 genes; 6 from G. lovleyi, and 7 from C. limicola. All were labeled as some kind of peroxidase by the high level protein function annotation algorithm on the IMG server. These are a first set of genes that might possibly be able to oxidize manganese anaerobically if they are able to bind manganese.

Of the 13 genes, none were predicted to be manganese binding by the kNN classifier, but two were predicted to be manganese binding by the random forest classifier (one gene from each

43 of the organisms). The C. limicola thiol peroxidase gene (UniProt A0A124GAN8) is functionally annotated based on preliminary sequence data. The related gene in E. coli (UniProt P0A862) is better studied, and has been shown to act in the cytoplasm and periplasm of cells, as well as acting as a lipid peroxidase to inhibit bacterial membrane oxidation [114]. The G. lovleyi gene is a (UniProt A0A0D5N9D7), and like the gene from The C. limicola its function is only inferred computationally. Glutathione peroxidase genes are found in all sorts of organisms across kingdoms, including animals, plants, fungi, and bacteria [112]. Both of these genes will be targets for future study of possible manganese oxidation peroxidases.

Table 5.4.3 Predicted Manganese Binding Function of Green Lake Peroxidases. Gene Name Species kNN Result Random Forest Result -peroxidase G. lovleyi - - Glutathione peroxidase G. lovleyi - Manganese Binding Thiol peroxidase G. lovleyi - - Alkylhydroperoxidase G. lovleyi - - Tpx thiol peroxidase G. lovleyi - - katG catalase/peroxidase HPI G. lovleyi - - cytochrome-c peroxidase C. limicola - - thiol peroxidase C. limicola - Manganese Binding katG catalase/peroxidase HPI C. limicola - - katG catalase-peroxidase C. limicola - - non- chloroperoxidase C. limicola - - Thiol peroxidase C. limicola - - C. limicola - -

5.5 Conclusion

This section has outlined real-life examples of how a Woolf model can be applied to biological data. The multiple classifiers built to differentiate between �-lactamase genes in section 5.2 provide evidence supporting the reasoning behind the selection of the default Woolf parameters. Section 5.3 shows how these defaults work on a real example dataset, and the ways in which they might have to be tweaked by users who are constrained by small dataset sizes. Overall, this chapter has shown that the Woolf model can be successfully used to build binary classifiers in a way that is useable for scientists interested in protein function.

44 6 Conclusions and Future Work

6.1 The Success of the Woolf Classifier Building Pipeline

The Woolf Classifier Building Pipeline presented in this thesis is a step on the way to filling the gap in protein function differentiation. Large scale protein annotation programs like HMMer can provide initial, wide sweeping protein annotations [75], and laboratory experiments will continue to establish the exact function of particular genes [115], but the evidence presented here suggests that Woolf Classifiers might be able to span the space between these two methods for investigating protein function.

On the datasets presented in this thesis, Woolf Classifiers were able to achieve high accuracy functional predictions of the presence or absence of a given protein function. With larger datasets, like the �-lactamase gene set, Woolf Classifiers archived an MCC above 0.9: 0.7 points above the prediction ability than any published dataset to date [6]. The �-lactamase models also provided justification for MCC and Min-Max scaling as default parameters. On smaller datasets, like the manganese oxidation example that motivated this thesis, the prediction scores fell between 0.6 and 0.7, lower than on the larger �-lactamase example, but still above other published models, and perhaps more importantly, the model was able to detect some real biological differences between the samples.

These results indicate that the Woolf model can likely help researchers identify putative genes, and generate new hypotheses in a wide variety of fields related to protein function identification. While both the examples here focused on bacteria, the basic concept of protein function differentiation based on amino acid sequence is relevant across all living organisms, and such a tool could provide a useful perspective into problems as disparate as finding T-box transcription factors important for the proper development of the immune system [116], and determining why some regions of the ocean have high nutrient content but low biomass by investigating the metabolism genes of phytoplankton [117].

6.2 Improvements to the Woolf Tool

Even as the Woolf tool is successful, it could still be improved, both in functionality and usability. Feature selection is likely the biggest shortcoming. As a tool for small datasets, the biggest issue with the current version of the Woolf Classifier Building Pipeline is its large number of features. As shown in Section 5.3, the current Woolf tool has difficulty producing reliable results on extremely small datasets. One possible reason is the number of features used in the current version. The recommended feature to instance ratio is about 1/10, which would mean that a 21 feature Woolf Model would ideally have at least 200 instances of training data [35]. To reduce the number of features, various methods of feature reduction have been

45 proposed and employed with good success on a number of problems [57, 118]. Feature reduction could help with the classification of small datasets.

Implementing feature selection in the model would not be difficult, and could provide more information to users about the biology behind the models. Scikit-learn, the python package providing the machine learning background for the Woolf tool, has a feature selection package [119] which could be used to remove features with lower classification power, and possibly increase the performance of the pipeline on small datasets. Because the features pruned away would be the least informative, their removal could help scientists to identify amino acids that are important for the protein function of interest.

Once feature selection was implemented, a few more easy additions could also help improve the accuracy of the classifier. For instance, Chou’s Pseudo Amino Acid Composition could be used as the feature set instead of the basic combination of percentage amino acids with length [82]. Chou’s composition was not used in this version of the Woolf Pipeline in part because it adds more features, which would decrease classification performance on small datasets (Section 4.4.3). The addition of feature selection, however, would get around the issue of additional features, and could increase the accuracy of the predictions by focusing on the features that are most important.

Another separate bias in the current pipeline is the bias in accuracy and f1-measure towards larger positive classes (Section 5.2.1). Currently, the intrinsic properties of these two accuracy metrics hide the bias in a single score. One possible way to investigate further would be to build the models twice, once with the functional class as the positive class, and once with the functional class as the negative class, and compare the f1-measure scores. This technique has been used in the past to determine if classifiers are biased towards one class over another [40]. Another possibility would be to balance the instance distribution in the input datasets. Such tests were out of the scope of this thesis because most biological projects using the Woolf tool probably will not have enough data for distribution manipulation, however, with the �-lactamase gene set, it would be possible to manipulate the distributions of the input classes in various ways to see the effect of the distribution types. This information would be essential to making modifications to the tool to prevent this type of bias on smaller datasets.

6.3 Future Applications

The additions to the Woolf Pipeline discussed in Section 6.2 would all help to refine the tool’s function, and their addition would help broaden the scope of cases in which the Woolf Pipeline is helpful, but the pipeline does not need those additions to be applied. With more rigorous testing on multiple datasets, Woolf Pipeline has the potential to become a more broadly

46 used protein analysis tool. These features can then be added as they become important to the scientists using the tool for analysis. All of the code, including the instruction manual for this pipeline, will be publicly available on GitHub2 for common use, allowing researchers to use the Woolf Pipeline to bridge the protein function analysis divide, and gain insights into biological protein function in a way not possible before.

2 https://github.com/afarrellsherman/Woolf

47 7 References

1 Crowe, S.A., Jones, C., Katsev, S., Magen, C., O'Neill, A.H., Sturm, A., Canfield, D.E., Haffner, G.D., Mucci, A., Sundby, B., and Fowle, D.A.: ‘Photoferrotrophs thrive in an Archean Ocean analogue’, Proc. Natl. Acad. Sci. USA, 2008, 105, (41), pp. 15938-15943 2 Planavsky, N.J., Asael, D., Hofmann, A., Reinhard, C.T., Lalonde, S.V., Knudsen, A., Wang, X., Ossa Ossa, F., Pecoits, E., Smith, A.J.B., Beukes, N.J., Bekker, A., Johnson, T.M., Konhauser, K.O., Lyons, T.W., and Rouxel, O.J.: ‘Evidence for oxygenic photosynthesis half a billion years before the Great Oxidation Event’, Nat Geosci, 2014, 7, (4), pp. 283-286 3 Daye, M., Klepac-Ceraj, V., Pajusalu, M., Rowland, S., Beukes, N., Tamura, N., Fournier, G., and Bosak, T.: ‘Light-Driven Anaerobic Microbial Oxidation of Manganese’, Science, In Review 4 Fetrow, J.S., and Babbitt, P.C.: ‘New computational approaches to understanding molecular protein function’, PLoS Comput Biol, 2018, 14, (4), pp. e1005756 5 Radivojac, P., Clark, W.T., Oron, T.R., Schnoes, A.M., Wittkop, T., Sokolov, A., Graim, K., Funk, C., Verspoor, K., Ben-Hur, A., Pandey, G., Yunes, J.M., Talwalkar, A.S., Repo, S., Souza, M.L., Piovesan, D., Casadio, R., Wang, Z., Cheng, J., Fang, H., Gough, J., Koskinen, P., Toronen, P., Nokso-Koivisto, J., Holm, L., Cozzetto, D., Buchan, D.W., Bryson, K., Jones, D.T., Limaye, B., Inamdar, H., Datta, A., Manjari, S.K., Joshi, R., Chitale, M., Kihara, D., Lisewski, A.M., Erdin, S., Venner, E., Lichtarge, O., Rentzsch, R., Yang, H., Romero, A.E., Bhat, P., Paccanaro, A., Hamp, T., Kassner, R., Seemayer, S., Vicedo, E., Schaefer, C., Achten, D., Auer, F., Boehm, A., Braun, T., Hecht, M., Heron, M., Honigschmid, P., Hopf, T.A., Kaufmann, S., Kiening, M., Krompass, D., Landerer, C., Mahlich, Y., Roos, M., Bjorne, J., Salakoski, T., Wong, A., Shatkay, H., Gatzmann, F., Sommer, I., Wass, M.N., Sternberg, M.J., Skunca, N., Supek, F., Bosnjak, M., Panov, P., Dzeroski, S., Smuc, T., Kourmpetis, Y.A., van Dijk, A.D., ter Braak, C.J., Zhou, Y., Gong, Q., Dong, X., Tian, W., Falda, M., Fontana, P., Lavezzo, E., Di Camillo, B., Toppo, S., Lan, L., Djuric, N., Guo, Y., Vucetic, S., Bairoch, A., Linial, M., Babbitt, P.C., Brenner, S.E., Orengo, C., Rost, B., Mooney, S.D., and Friedberg, I.: ‘A large-scale evaluation of computational protein function prediction’, Nat. Methods, 2013, 10, (3), pp. 221-227 6 Kumar, R., Srivastava, A., Kumari, B., and Kumar, M.: ‘Prediction of beta-lactamase and its class by Chou's pseudo-amino acid composition and support vector machine’, J. Theor. Biol., 2015, 365, pp. 96-103 7 Liang, M.C., Hartman, H., Kopp, R.E., Kirschvink, J.L., and Yung, Y.L.: ‘Production of hydrogen peroxide in the atmosphere of a Snowball Earth and the origin of oxygenic photosynthesis’, Proceedings of the National Academy of Sciences, 2006, 103, (50), pp. 18896-18899 8 Oze, C., Sleep, N.H., Coleman, R.G., and Fendorf, S.: ‘Anoxic oxidation of chromium’, 2016, 44, (7), pp. 543-546 9 Tebo, B.M., Bargar, J.R., Clement, B.G., Dick, G.J., Murray, K.J., Parker, D., Verity, R., and Webb, S.M.: ‘BIOGENIC MANGANESE OXIDES: Properties and Mechanisms of Formation’, Annual Review of Earth and Planetary Sciences, 2004, 32, (1), pp. 287-328 10 Berkner, L.V., and Marshall, L.C.: ‘On the Origin and Rise of Oxygen Concentration in the Earth's Atmosphere’, J Atmos Sci, 1965, 22, (3), pp. 225-261

48 11 Kump, L.R.: ‘The rise of atmospheric oxygen’, Nature, 2008, 451, (7176), pp. 277 12 Johnson, J.E., Webb, S.M., Thomas, K., Ono, S., Kirschvink, J.L., and Fischer, W.W.: ‘Manganese-oxidizing photosynthesis before the rise of cyanobacteria’, Proc. Natl. Acad. Sci. USA, 2013, 110, (28), pp. 11238-11243 13 Tebo, B.M., Johnson, H.A., McCarthy, J.K., and Templeton, A.S.: ‘Geomicrobiology of manganese (II) oxidation’, Trends Microbiol., 2005, 13, (9), pp. 421-428 14 Allen, J.F., and Martin, W.: ‘Out of thin air’, Nature, 2007, 445, pp. 610 15 Dismukes, G.C., Klimov, V.V., Baranov, S.V., Kozlov, Y.N., DasGupta, J., and Tyryshkin, A.: ‘The origin of atmospheric oxygen on Earth: the innovation of oxygenic photosynthesis’, P Natl Acad Sci USA, 2001, 98, (5), pp. 2170-2175 16 Zubay, G.L.: ‘Origins of life on the earth and in the cosmos’ (Academic Press, c2000, 1996, 2nd ed edn. 1996) 17 Fischer, W.W., Hemp, J., and Johnson, J.E.: ‘Manganese and the Evolution of Photosynthesis’, Origins Life Evol B, 2015, 45, (3), pp. 351-357 18 http://minerals.gps.caltech.edu/FILES/DENDRITE/Index.html 19 Havig, J.R., McCormick, M.L., Hamilton, T.L., and Kump, L.R.: ‘The behavior of biologically important trace elements across the oxic/euxinic transition of meromictic Fayetteville Green Lake, New York, USA’, Geochim. Cosmochim. Acta, 2015, 165, pp. 389-406 20 Li, G.-W., and Xie, X.S.: ‘Central dogma at the single-molecule level in living cells’, Nature, 2011, 475, (7356), pp. 308 21 Liu, L., Li, Y., Li, S., Hu, N., He, Y., Pong, R., Lin, D., Lu, L., and Law, M.: ‘Comparison of next-generation sequencing systems’, BioMed Research International, 2012, 2012 22 Clark, W.T., and Radivojac, P.: ‘Analysis of protein function and its prediction from amino acid sequence’, Wiley, 2011, 79, (7), pp. 2086-2096 23 Das, S., Sillitoe, I., Lee, D., Lees, J.G., Dawson, N.L., Ward, J., and Orengo, C.A.: ‘CATH FunFHMMer web server: protein functional annotations using functional family assignments’, Nucleic Acids Res., 2015, 43, (W1), pp. W148-153 24 Loewe, L.: ‘Genetic mutation’, Nat Education, 2008, 1, (1), pp. 113 25 Prakash, A., Jeffryes, M., Bateman, A., and Finn, R.D.: ‘The HMMER Web Server for Protein Sequence Similarity Search’, Curr Protoc Bioinformatics, 2017, 60, pp. 3 15 11- 13 15 23 26 Pandey, G., Myers, C.L., and Kumar, V.: ‘Incorporating functional inter-relationships into protein function prediction algorithms’, 2009, 10, (1), pp. 142 27 Dutton, D.M., and Conroy, G.V.: ‘A Review of Machine Learning’, Knowledge of Engineering Review, 1996, 12:4, pp. 341-367 28 Carbonell, J.G., Michalski, R.S., and Mitchell, T.M.: ‘An Overview of Machine Learning’ (Elsevier, 1983), pp. 3-23 29 Sebastiani, F.: ‘Machine learning in automated text categorization’, ACM Computing Surveys, 2002, 34, (1), pp. 1-47 30 Oren, T., and Ghasem-Aghaee, N.: ‘Personality representation processable in fuzzy logic for human behavior simulation’, in Editor (Ed.)^(Eds.): ‘Book Personality representation processable in fuzzy logic for human behavior simulation’ (Citeseer, 2003, edn.), pp. 11- 18

49 31 Mohammed, A.A., Minhas, R., Jonathan Wu, Q.M., and Sid-Ahmed, M.A.: ‘Human face recognition based on multidimensional PCA and extreme learning machine’, 2011, 44, (10-11), pp. 2588-2597 32 Zhao, W., Chellappa, R., Phillips, P.J., and Rosenfeld, A.: ‘Face Recognition: A Literature Survey’, ACM Computing Surveys, 2003, 35, (4), pp. 399-458 33 Kotsiantis, S.B., Zaharakis, I.D., and Pintelas, P.E.: ‘Machine learning: a review of classification and combining techniques’, Artif Intel Review, 2007, 26, (3), pp. 159-190 34 Weiss, S., Van Treuren, W., Lozupone, C., Faust, K., Friedman, J., Deng, Y., Xia, L.C., Xu, Z.Z., Ursell, L., Alm, E.J., Birmingham, A., Cram, J.A., Fuhrman, J.A., Raes, J., Sun, F., Zhou, J., and Knight, R.: ‘Correlation detection strategies in microbial data sets vary widely in sensitivity and precision’, ISME J, 2016, 10, (7), pp. 1669-1681 35 Chicco, D.: ‘Ten quick tips for machine learning in computational biology’, BioData Min, 2017, 10, (1) 36 Ortigosa-Hernandez, J., Inza, I., and Lozano, J.A.: ‘Towards Competitive Classifiers for Unbalanced Classification Problems: A Study on the Performance Scores’, 2016 37 Japkowicz, N.: ‘Learning from Imbalanced Data Sets: A Comparison of Various Strategies’, AAAI Technical Report, 200, WS-00-05, pp. 10-15 38 Kubat, M., and Matwin, S.: ‘Addressing the Curse of Imbalanced Training Sets: One- Sided Selection’, Proceedings of the Fourteenth International Conference on Machine Learning, 1997 39 Chawla, N.V., Bowyer, K.W., Hall, L.O., and Kegelmeyer, W.P.: ‘SMOTE: Synthetic Minority Over-sampling TEchnique’, J Artif Intell Res, 2002, 16, pp. 341-378 40 Japkowicz, N., and Shah, M.: ‘Evaluating Learning Algorithms: A Classification Perspective’ (Cambridge University Press, 2011. 2011) 41 Liu, X.-Y., and Zhou, Z.-H.: ‘The influence of class imbalance on cost-sensitive learning: An empirical study.’, in Editor (Ed.)^(Eds.): ‘Book The influence of class imbalance on cost-sensitive learning: An empirical study.’ (2006, edn.), pp. 970-974 42 Ortigosa-Hernández, J., Inza, I., and Lozano, J.A.: ‘Measuring the class-imbalance extent of multi-class problems’, Pattern Recogn Lett, 2017, 98, pp. 32-38 43 Powers, D.M.: ‘Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation’, 2011 44 Goutte, C., and Gaussier, E.: ‘A probabilistic interpretation of precision, recall and F- score, with implication for evaluation’, in Editor (Ed.)^(Eds.): ‘Book A probabilistic interpretation of precision, recall and F-score, with implication for evaluation’ (Springer, 2005, edn.), pp. 345-359 45 Matthews, B.W.: ‘Comparison of the predicted and observed secondary structure of T4 phage lysozyme’, Biochimica et Biophysica Acta (BBA)-Protein Structure, 1975, 405, (2), pp. 442-451 46 Boughorbel, S., Jarray, F., and El-Anbari, M.: ‘Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric’, PLoS One, 2017, 12, (6), pp. e0177678 47 Raghuwanshi, B.S., and Shukla, S.: ‘Class-specific extreme learning machine for handling binary class imbalance problem’, Neural Networks, 2018, 105, pp. 206-217 48 Lorena, A.C., De Carvalho, A.C.P.L.F., and Gama, J.M.P.: ‘A review on the combination of binary classifiers in multiclass problems’, Artificial Intelligence Review, 2008, 30, (1- 4), pp. 19-37

50 49 García-Pedrajas, N., and Ortiz-Boye, D.: ‘An empirical study of binary classifier fusion methods for multiclass classification’, Inform Fusion, 2010, 12, pp. 111-130 50 Har-Peled, S., Roth, D., and Zimak, D.: ‘Constraint Classification: A New Approach to Multiclass Classification’, in Numao, N.C.-B.M., and Reischuk, R. (Eds.): ‘Algorithmic Learning Theory’ (Springer Berlin Heidelberg, 2002), pp. 365-379 51 Weston, J., and Watkins, C.: ‘Support Vector Machines for Multi-Class Pattern Recognition’, European Symposium on Artificial Neural Networks, 1999, 1999 Proceedings, pp. 219-222 52 Dietterich, T.G., and Bakiri, G.: ‘Solving Multiclass Learning Problems via Error- Correcting Output Codes’, J Artif Intell Res, 1995, 2, pp. 263-286 53 Hastie, T., and Tibshirani, R.: ‘Classification by Pairwise Coupling’, NIPS-10, The 1997 Conference on Advances inNeural Information Processing Systems, 1998, pp. 507–513 54 Rifkin, R., and Klautau, A.: ‘In Defense of One-Vs-All Classification’, J Mach Learn Res, 2004, 5, pp. 101-141 55 Kotsiantis, S.B., and Pintelas, P.E.: ‘Increasing the classification accuracy of simple bayesian classifier’, in Editor (Ed.)^(Eds.): ‘Book Increasing the classification accuracy of simple bayesian classifier’ (Springer, 2004, edn.), pp. 198-207 56 Lacey, A.: ‘Supervised Machine Learning in Bioinformatics: Protein Classification’, Swansea University, 2014 57 Feng, P.M., Ding, H., Chen, W., and Lin, H.: ‘Naive Bayes classifier with feature selection to identify phage virion proteins’, Comput Math Methods Med, 2013, 2013, pp. 530696 58 Sutskever, I., Vinyals, O., and Le, Q.V.: ‘Sequence to sequence learning with neural networks’, in Editor (Ed.)^(Eds.): ‘Book Sequence to sequence learning with neural networks’ (2014, edn.), pp. 3104-3112 59 Zhang, G.P.: ‘Neural networks for classification: a survey’, IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews), 2000, 30, (4), pp. 451-462 60 Yong, Z., Youwen, L., and Shixiong, X.: ‘An improved KNN text classification algorithm based on clustering’, J Compt, 2009, 4, (3), pp. 230-237 61 Wettschereck, D., Aha, D.W., and Mohri, T.: ‘A review and empirical evaluation of feature weighting methods for a class of lazy learning algorithms’, Artificial Intelligence Review, 1997, 11, (1-5), pp. 273-314 62 Horton, P., and Nakai, K.: ‘Better Prediction of Protein Cellular Localization Sites with the k Nearest Neighbor Classifier’, ISMB Proceedings: American Association for Artificial Intelligence, 1997 63 Lan, L., Djuric, N., Guo, Y., and Vucetic, S.: ‘MS-kNN: protein function prediction by integrating multiple data sources’, BMC Bioinformatics, 2013, 14 Suppl 3 64 Liaw, A., and Wiener, M.: ‘Classification and regression by randomForest’, R news, 2002, 2, (3), pp. 18-22 65 Boulesteix, A.L., Janitza, S., Kruppa, J., and König, I.R.: ‘Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics’, Wires Data Min Knowl, 2012, 2, (6), pp. 493-507 66 Jiang, P., Wu, H., Wang, W., Ma, W., Sun, X., and Lu, Z.: ‘MiPred: classification of real and pseudo microRNA precursors using random forest prediction model with combined features’, Nucleic Acids Res., 2007, 35, (suppl_2), pp. W339-W344

51 67 Tarca, A.L., Carey, V.J., Chen, X.-w., Romero, R., and Drăghici, S.: ‘Machine learning and its applications to biology’, PLoS Comput Biol, 2007, 3, (6), pp. e116 68 Noble, D.: ‘The rise of computational biology’, Nat Rev Mol Cell Bio, 2002, 3, (6), pp. 459-463 69 Tzanis, G., Berberidis, C., Alexandridou, A., and Vlahavas, I.: ‘Improving the Accuracy of Classifiers for the Prediction of Translation Initiation Sites in Genomic Sequences’, in Bozanis, P., and Houstis, E.N. (Eds.): ‘Advances in Informatics’ (Springer, 2005), pp. 426- 70 Busa-Fekete, R., Kocsor, A., and Pongor, S.: ‘Tree-Based Algorithms for Protein Classification’, in Kelemen, A., Abraham, A., and Chen, Y. (Eds.): ‘Computational Intelligence in Bioinformatics’ (Springer, 2008), pp. 164-182 71 Tung, C.-W., and Ho, S.-Y.: ‘Computational identification of ubiquitylation sites from protein sequences’, BMC Bioinformatics, 2008, 9, (1), pp. 310 72 Leslie, C., Eskin, E., and Noble, W.S.: ‘The Spectrum Kernel: A String Kernal for SVM Protein Classification’, in Editor (Ed.)^(Eds.): ‘Book The Spectrum Kernel: A String Kernal for SVM Protein Classification’ (2002, edn.), pp. 564- 73 Fleischmann, R.D., Adams, M.D., White, O., Clayton, R.A., Kirkness, E.F., Kerlavage, A.R., Bult, C.J., Tomb, J.-F., Dougherty, B.A., Merrick, J.M., McKenney, K., Sutton, G., FitzHugh, W., Fields, C., Gocayne, J.D., Scott, J., Shirley, R., Liu, L.-I., Glodek, A., Kelley, J.M., Weidman, J.F., Phillips, C.A., Spriggs, T., Hedblom, E., Cotton, M.D., Utterback, T.R., Hanna, M.C., Nguyen, D.T., Saudek, D.M., Brandon, R.C., Fine, L.D., Fritchman, J.L., Fuhrmann, J.L., Geoghagen, N.S.M., Gnehm, C.L., McDonald, L.A., Small, K.V., Fraser, C.M., Smith, H., and Venter, J.C.: ‘Whole-Genome Random Sequencing and Assembly of Haemophilus Influenzae Rd’, Science, 1995, 269, (5223), pp. 96-498+507-512 74 Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J.: ‘Gapped BLAST and PSI-BLAST: a new generation of protein database search programs’, Nucleic Acids Res., 1997, 25, pp. 3389-3402 75 Eddy, S.R.: ‘Multiple alignment using hidden Markov models’, in Editor (Ed.)^(Eds.): ‘Book Multiple alignment using hidden Markov models’ (1995, edn.), pp. 114-120 76 Pruitt, K.D., Harrow, J., Harte, R.A., Wallin, C., Diekhans, M., Maglott, D.R., Searle, S., Farrell, C.M., Loveland, J.E., Ruef, B.J., Hart, E., Suner, M.M., Landrum, M.J., Aken, B., Ayling, S., Baertsch, R., Fernandez-Banet, J., Cherry, J.L., Curwen, V., Dicuccio, M., Kellis, M., Lee, J., Lin, M.F., Schuster, M., Shkeda, A., Amid, C., Brown, G., Dukhanina, O., Frankish, A., Hart, J., Maidak, B.L., Mudge, J., Murphy, M.R., Murphy, T., Rajan, J., Rajput, B., Riddick, L.D., Snow, C., Steward, C., Webb, D., Weber, J.A., Wilming, L., Wu, W., Birney, E., Haussler, D., Hubbard, T., Ostell, J., Durbin, R., and Lipman, D.: ‘The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes’, 2009, 19, (7), pp. 1316-1323 77 Cock, P.J.A., Fields, C.J., Goto, N., Heuer, M.L., and Rice, P.M.: ‘The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants’, Nucleic Acids Res., 2010, 38, (6), pp. 1767-1771 78 Shen, W., Le, S., Li, Y., and Hu, F.: ‘SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation’, PLoS One, 2016, 11, (10), pp. e0163962 79 http://sequenceconversion.bugaco.com/converter/biology/sequences/, accessed April 5th 2019

52 80 Cedano, J., Aloy, P., Pérez-Pons, J.A., and Querol, E.: ‘Relation between amino acid composition and cellular location of proteins 1 1Edited by F. E. Cohen’, J. Mol. Biol., 1997, 266, (3), pp. 594-600 81 Nakashima, H., and Nishikawa, K.: ‘Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies’, J. Mol. Biol., 1994, 238, (1), pp. 54-61 82 Chou, K.-C.: ‘Prediction of Protein Cellular Attributes Using Pseudo- Amino Acid Composition’, Proteins, 2001, 43, pp. 246–255 83 Li, H.: ‘Using the BioSeqClass Package’, in Editor (Ed.)^(Eds.): ‘Book Using the BioSeqClass Package’ (Shanghai Institutes for Biological Sciences, 2018, edn.), pp. 84 Bollegala, D.: ‘Dynamic feature scaling for online learning of binary classifiers’, Knowl- Based Syst, 2017, 129, pp. 97-105 85 https://scikit- learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.pr eprocessing.MinMaxScaler2019 86 Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E.: ‘Scikit-learn: Machine Learning in Python’, J Mach Learn Res, 2011, 12, pp. 2825-2830 87 Cock, P.J.A., Antao, T., Chang, J.T., Chapman, B.A., Cox, C.J., Dalke, A., Friedberg, I., Hamelryck, T., Kauff, F., Wilczynski, B., and De Hoon, M.J.L.: ‘Biopython: freely available Python tools for computational molecular biology and bioinformatics’, Bioinformatics, 2009, 25, (11), pp. 1422-1423 88 Norman, D.A., and Draper, S.W.: ‘User centered system design: New perspectives on human-computer interaction’ (CRC Press, 1986. 1986) 89 Seemann, T.: ‘Ten recommendations for creating usable bioinformatics command line software’, GigaScience, 2013, 2, (1), pp. 15 90 Carver, T., and Bleasby, A.: ‘The design of Jemboss: a graphical user interface to EMBOSS’, Bioinformatics, 2003, 19, (14), pp. 1837-1843 91 Siso-Nadal, F., Ollivier, J.F., and Swain, P.S.: ‘Facile: a command-line network compiler for systems biology’, BMC systems biology, 2007, 1, (1), pp. 36 92 Rice, P., Longden, I., and Bleasby, A.: ‘EMBOSS: the European molecular biology open software suite’, Trends Genet., 2000, 16, (6), pp. 276-277 93 Edgar, R.C.: ‘MUSCLE: a multiple sequence alignment method with reduced time and space complexity’, BMC Bioinformatics, 2004, 5, (1), pp. 113 94 Guindon, S., Dufayard, J.-F., Lefort, V., Anisimova, M., Hordijk, W., and Gascuel, O.: ‘New Algorithms and Methods to Estimate Maximum-Likelihood Phylogenies: Assessing the Performance of PhyML 3.0’, Syst. Biol., 2010, 59, (3), pp. 307-321 95 Katoh, K., and Standley, D.M.: ‘MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability’, Mol. Biol. Evol., 2013, 30, (4), pp. 772- 780 96 Caporaso, J.G., Kuczynski, J., Stombaugh, J., Bittinger, K., Bushman, F.D., Costello, E.K., Fierer, N., Pena, A.G., Goodrich, J.K., and Gordon, J.I.: ‘QIIME allows analysis of high-throughput community sequencing data’, Nat. Methods, 2010, 7, (5), pp. 335 97 Schloss, P.D., Westcott, S.L., Ryabin, T., Hall, J.R., Hartmann, M., Hollister, E.B., Lesniewski, R.A., Oakley, B.B., Parks, D.H., and Robinson, C.J.: ‘Introducing mothur:

53 open-source, platform-independent, community-supported software for describing and comparing microbial communities’, Appl. Environ. Microbiol., 2009, 75, (23), pp. 7537- 7541 98 Dobson, C.M.: ‘Protein folding and misfolding’, Nature, 2003, 426, (6968), pp. 884 99 https://www.compoundchem.com/2014/09/16/aminoacids/2019 100 Aksoy, S., and Haralick, R.M.: ‘Feature normalization and likelihood-based similaritymeasures for image retrieval’, Pattern Recogn Lett, 2001, 22, pp. 563-582 101 https://scikit-learn.org/stable/modules/preprocessing.html2019 102 Almlöf, J.C., Alexsson, A., Imgenberg-Kreuz, J., Sylwan, L., Bäcklin, C., Leonard, D., Nordmark, G., Tandre, K., Eloranta, M.-L., Padyukov, L., Bengtsson, C., Jönsen, A., Dahlqvist, S.R., Sjöwall, C., Bengtsson, A.A., Gunnarsson, I., Svenungsson, E., Rönnblom, L., Sandling, J.K., and Syvänen, A.-C.: ‘Novel risk genes for systemic lupus erythematosus predicted by random forest classification’, Sci Rep - UK, 2017, 7, (1) 103 Melvin, I., Ie, E., Weston, J., Noble, W.S., and Leslie, C.: ‘Multi-class protein classification using adaptive codes’, J Mach Learn Res, 2007, 8, (Jul), pp. 1557-1581 104 Linial, M.: ‘How incorrect annotations evolve – the case of short ORFs’, Trends Biotechnol., 2003, 21, (7), pp. 298-300 105 Alberts, B., Johnson, A., and Lewis, J.: ‘ Analyzing Protein Structure and Function’: ‘Mol Biol Cell’ (Garland Science, 2002) 106 Holten, K., and Onusko, E.: ‘Appropriate Prescribing of Oral Beta-Lactam Antibiotics’, Am. Fam. Physician, 2000, 62, (3), pp. 611-620 107 Kong, K.-F., Schneper, L., and Mathee, K.: ‘Beta-lactam antibiotics: from antibiosis to resistance and bacteriology’, APMIS, 2010, 118, (1), pp. 1-36 108 Shaikh, S., Fatima, J., Shakil, S., Rizvi, S.M.D., and Kamal, M.A.: ‘Antibiotic resistance and extended spectrum beta-lactamases: Types, epidemiology and treatment’, Saudi J Biol Sci, 2015, 22, (1), pp. 90-101 109 Bush, K., and Jacoby, G.A.: ‘Updated functional classification of β-lactamases’, Antimicrob Agents Ch, 2010, 54, (3), pp. 969-976 110 Lee, D., Das, S., Dawson, N.L., Dobrijevic, D., Ward, J., and Orengo, C.: ‘Novel Computational Protocols for Functionally Classifying and Characterising Serine Beta- Lactamases’, PLoS Comput Biol, 2016, 12, (6), pp. e1004926 111 Gettemy, J.M., Ma, B., Alic, M., and Gold, M.H.: ‘Reverse transcription-PCR analysis of the regulation of the manganese peroxidase gene family’, Appl. Environ. Microbiol., 1998, 64, (2), pp. 569-574 112 Consortium, U.: ‘UniProt: a hub for protein information’, Nucleic Acids Res., 2014, 43, (D1), pp. D204-D212 113 Brown, M.E., Barros, T., and Chang, M.C.: ‘Identification and characterization of a multifunctional dye peroxidase from a -reactive bacterium’, ACS Chem Biol, 2012, 7, (12), pp. 2074-2081 114 Baker, L.M., and Poole, L.B.: ‘Catalytic Mechanism of Thiol Peroxidase from Escherichia coli: Sulfenic Acid Formation and Oceroxidation of Essential CYS61’, J. Biol. Chem., 2003, 278, (11), pp. 9203-9211 115 Wertz, I.E., O'rourke, K.M., Zhou, H., Eby, M., Aravind, L., Seshagiri, S., Wu, P., Wiesmann, C., Baker, R., and Boone, D.L.: ‘De-ubiquitination and ubiquitin domains of A20 downregulate NF-κB signalling’, Nature, 2004, 430, (7000), pp. 694

54 116 Finotto, S., Neurath, M.F., Glickman, J.N., Qin, S., Lehr, H.A., Green, F.H., Ackerman, K., Haley, K., Galle, P.R., and Szabo, S.J.: ‘Development of spontaneous airway changes consistent with human asthma in mice lacking T-bet’, Science, 2002, 295, (5553), pp. 336-338 117 Morel, F.M., Rueter, J.G., and Price, N.M.: ‘Iron nutrition of phytoplankton and its possible importance in the ecology of ocean regions with high nutrient and low biomass’, Oceanography, 1991, 4, (2), pp. 56-61 118 Guyon, I., and Elisseeff, A.: ‘An introduction to variable and feature selection’, J Mach Learn Res, 2003, 3, (Mar), pp. 1157-1182 119 https://scikit-learn.org/stable/modules/feature_selection.html2019

55 Appendices

Appendix A: User Manual

Woolf Model User Guide

In this tutorial, you will go through the steps to build a Woolf model classifier that can classify a �-lactamase gene as either Class A or not Class A (one of either class B, C, or D).

What is a �-lactamase?

�-lactamases are used by a wide variety of bacteria to evade the toxic effect of �-lactam type antibiotics including penicillin family antibiotics and cephalosporins [106]. As the first drug ever developed to treat bacterial infection, penicillin and penicillin like drugs have been used to tread a wide verity of bacterial diseases since the 1930s [107]. Their long use history has allowed bacteria to evolve a number of different strategies to avoid �-lactams’ toxic effects [107]. These strategies are encoded in the bacteria’s DNA as a diverse set of antibiotic resistance genes, called �-lactamases, that are classified into 4 types based on the way they function [108]. The classes are labeled Class A, Class B, Class C, and Class D.

Files Provided • FASTA amino acid sequence files o ClassA.fasta o ClassBCD.fasta o ClassA_test.fasta o ClassBCD_test.fasta o unknownClass.fasta • Python Scripts o featureCSVfromFASTA.py o trainWoolf.py

Dependences

The Woolf Classifier Building Pipeline has several dependences you will want to install into your python environment. The commands for installation and links to the documentation are provided below. Also note that it is best practice to store python inside a virtual environment (see https://packaging.python.org/guides/installing-using-pip-and-virtualenv/)

Python3 You will need Python 3 rather than Python 2.7. You can install 3 if you already have 2.7 by creating a virtual environment. Install from: https://realpython.com/installing-python/

i

Biopython Install with: pip install biopython Documentation: https://biopython.org/wiki/Packages

Scikit-Learn Install with: pip install -U numpy scipy scikit-learn Documentation: https://scikit-learn.org/0.16/install.html

Pandas Install with: pip install pandas Documentation: https://pandas.pydata.org/pandas-docs/stable/install.html

Argparse Only required if you have python >3.2. Install with: pip install argparse Documentation: https://pypi.org/project/argparse/

Sys and Ast May or may not need manual installation with: Pip install sys Pip install ast

NOTE: Using the Command Line

The scripts to build a Woolf Model run on the command line. The basic structure of a command to run a script is:

$ python scriptName.py arguments -option

To get help for any command, run it with the help option:

$ python scriptName.py -h

Overview: Steps in Building a Woolf Model

Woolf Classifiers can be built using a few simple commands on the command line in 6 steps.

(1) Input FASTA files are converted into a feature table based on length and amino acid composition before (2) being used to train a model with either a kNN or random forest algorithm. The default parameters lead to an initial report of accuracy and best parameters (3) which can then be fine-tuned as inputs to the command (4). Users can also ask to see a list of the protein sequences misclassified by the best scoring classifier in each model (5) and predict the classes of unknown proteins (6).

ii

STEP 1: Creating the feature tables

The FASTA files provided in the tutorial have already been split into positive (Class A) and negative (Class B, C, and D), with test data reserved for a final accuracy measure, and an unknown set of data. Use featureCSVfromFasta.py to create three CSV based feature tables, one for training, one final one for testing after you have made all the modifications to the data, and one unlabeled table for prediction.

Training:

$ python featureCSVfromFASTA.py --binary -c AvsNotA -f CSVfolder -pf classA.fasta -nf classBCD.fasta

This command creates a binary (not prediction because it has class labels) feature table that uses the Class A sequences as the positive class, and the Class B, C, and D sequences as the negative class.

Testing: $ python featureCSVfromFASTA.py --binary -c AvsNotA_TEST -f CSVfolder -pf classA_test.fasta -nf classBCD_test.fasta

Prediction: $ python featureCSVfromFASTA.py --predict -c AvsNotA_UNKNOWN -f CSVfolder -pf unknownClass.fasta

iii After each command you should get confirmation from the command line that the file created was the type of table you intended, and that it has been saved to the folder with the filename you indicated. Check to make sure the three feature table files are where you think they should be, and that they contain feature data before continuing.

STEP 2: Running Model with Default Parameters

Now that you have your feature tables, you can build a model to differentiate between Class A and non-Class A beta-lactamases.

First build a classifier using a kNN algorithm trained on the training feature table:

$ python trainWoolf.py -k CSVfolder/AvsNotA.csv

The output should look like this:

Building kNN Woolf Model... Training Model... ~~~~~~ RESULTS ~~~~~~ Score of best classifier: 0.832644774106317 Standard deviation of best score: 0.02979380521786959 Best Params:{'clf__n_neighbors': 1} ~~~~~~ ~~~~~~

This indicates that the model achieved an f1-measure accuracy score of 0.91 using k=1 as an algorithm parameter. You can get more information about the model by running the command with the −� option. The results should like this:

Building kNN Woolf Model... Cross-validation Folds: 5 Scoring Metric: MCC Scaler type: MinMaxScaler Training Model... ~~~~~~ RESULTS ~~~~~~ Score of best classifier: 0.832644774106317 Best Params:{'clf__n_neighbors': 1} Range of classifier scores across hyperparameters: Max: 0.832644774106317 Min: 0.8097974586532646 Range of training scores across hyperparameters: Max: 1.0 Min: 0.919719472833318 ~~~~~~ ~~~~~~

iv STEP 3: Accuracy Metric Evaluation

There is no rule about what makes an accuracy metric “good,” but each time you get results from the Woolf model, you can do some reasoning about how well your classifier is working.

Classifiers that are doing well will generally:

• Have accuracy measures above: o 50% accuracy o 0.5 f1-score o 0.3 MCC • Have � values over 100x the number of instances o Ex: if you have 200 instances, � should be at least 2 o Greater values of � indicate better separation between the classes. • Have fewer trees than the maximum argument provided • Have more instances per leaf than the minimum argument provided

However, the main goal of the Woolf Pipeline is to provide useful hypothesis generating biologically relevant insight into protein sequences, not to create the best possible computational model, so no definitive number will be able to tell you if a classifier is “good enough.”

STEP 4: Modifying Default Parameters

The parameters defined Table 4.1 can be specified by the user to improve the classification power of Woolf Classifiers.

Table 4.1 User specified parameters Parameter Default Option Option Flag How to Format Input Feature scaling type Min-Max Scaling −� StandardScaler MaxAbsScaler Cross-validation folds 5 −� 10 20 Accuracy metric MCC −� accuracy f1 Number of neighbors range from 1 to −� 5 (kNN only) 20 1-20 1-30,5 Algorithm Hyper- Number of trees range from 1 to −� 5 (random forest only) 20 1-20 Parameter 1-30,5 Ranges Minimum instances per 10, 15, 20, 25, 30 −� 10 leaf (random forest 10-50 only) 10-30,5

v Hyperparameter Ranges

These options control the hyperparameter values tested by the Woolf Pipeline and optimized in your model. If you have a small dataset, all the values will need to be smaller. � values will self-regulate and get smaller with smaller datasets, but you will need to impose upper and lower limits for the number of trees and the minimum samples per leaf in a random forest model to prevent overfitting. As a general rule of thumb, try to have 20 instances per tree, and a minimum leaf size of approximately 1/20 of the data. However, as stated above, the goal is biological relevance, not machine learning perfection, so there is no hard and fast rule.

For example, to create a kNN model that tests the � values of 1, 3, 5, and 7, you could use:

$ python trainWoolf.py -k CSVfolder/AvsNotA.csv -n 1-7,2

Cross-Validation Folds

To create a model with 10 rather than 5 cross-validation folds, run the command like this:

$ python trainWoolf.py -k CSVfolder/AvsNotA.csv -c 10

Note that the number of cross validation folds must be greater than 1 so that the data can be split at least once. In general, greater numbers of cross-validation folds take longer to run, but give better estimates of the model’s accuracy. If the number of folds is too high for the dataset, you may get an error because there is not enough training data in each split to train the model.

Scalar Value

Scaling is the process of modifying the center and range of the data in each feature. It is used to modify input data distributions to meet the assumptions most algorithms make about their input data [100]. Random forests are tree based and do not require scaling, however, with algorithms like kNNs scaling the data prevents features with different ranges from unduly influencing the prediction [100].

Table 4.2 shows the range of different scalar types and when they might be useful.

vi Table 4.2 Possible Scaler Types

Scaler Name Function Suggested Use Cases StandardScaler Scales each feature to zero - General use mean and unit variance MinMaxScaler Scales each feature to a range - Sparse data between 0 and 1 - Possible zeros in data - Small standard deviations MaxAbsScaler Scales each feature to a range - Sparse data between -1 and 1 - Possible negative data - Small standard deviations RobustScaler Scales using alternative center - Data with outliers and range metrics that are robust to outliers None No scaling - Approximately normally distributed data in similar ranges - comparison to other methods

To change the scalar type from the default Min-Max Scalar to the Standard Scalar, use this command:

$ python trainWoolf.py -k CSVfolder/AvsNotA.csv -s StandardScaler

To remove scaling to create a random forest Model, use this command:

$ python trainWoolf.py -f CSVfolder/AvsNotA.csv -s None

You should see results like this:

Building Random Forest Woolf Model... Training Model... ~~~~~~ RESULTS ~~~~~~ Score of best classifier: 0.8285663419177125 Best Params:{'clf__min_samples_leaf': 13, 'clf__n_estimators': 11}

Accuracy Metric

There are five options to assess the accuracy of your classifier. These metrics are used at each stage within cross validation and reported at the end of the training. The default option is the Matthews Correlation Coefficient (MCC), which has been shown to be good for small unbalanced datasets where both the positive and negative class is important.

To use percentage accuracy as the accuracy metric instead of the default MCC, use this command:

vii

$ python trainWoolf.py -k CSVfolder/AvsNotA.csv -v -a accuracy

All possible accuracy metrics are described in Table 4.3.

Table 4.3 Possible Accuracy Metrics. All implementations come from the scikit-lean preprocessing package [101].

Metric Description Function Suggested Use Cases accuracy Percentage of Balanced class �� + �� instances classified distributions of �� + �� + �� + �� correctly instances recall Proportion of actually When the most

positive instances that �� important result it to

are correctly identified �� + �� identify all the as positive positive cases precision Proportion of When it is important

predicted positive �� to make sure all the

instances that are �� + �� predicted positives actually positive are really positive f1 Harmonic mean of When both recall 2(� × �)

recall and precision � + � and precision are important MCC Combination of all Small datasets in terms from confusion �� × �� − �� × �� which both positive

matrix (�� + ��)(�� + ��)(�� + ��)(�� + ��) and negative classes are important

STEP 5: Listing Misclassified Proteins

To determine which proteins are misclassified by your final model, run the script again with the −� option. Assuming your final model was a kNN trained with percentage accuracy and 3-10 as possible k values, you would run the following command:

$ python trainWoolf.py -k CSVfolder/AvsNotA.csv -a accuracy -n 4-10 -e

You should get results that look like this:

Building kNN Woolf Model... Training Model... ~~~~~~ RESULTS ~~~~~~

viii Score of best classifier: 0.9090318860447215 Best Params:{'clf__n_neighbors': 4} ~~~~~~ ~~~~~~ Listing misclassified instances misclassified as positive class: ['EIT76073.1', 'WP_121940274.1', 'SFE08607.1', 'KUK79721.1', 'SDU95206.1', 'SEE70256.1', 'SEE30393.1', 'WP_013151481.1', 'WP_101502916.1', 'WP_012895319.1', 'WP_026824192.1', 'WP_128836872.1', 'WP_127568596.1', 'WP_123152110.1', 'WP_008504272.1', 'WP_011497575.1', 'WP_012831211.1', 'WP_013752449.1', 'WP_012287432.1', 'WP_011759294.1', 'WP_086947865.1', 'AEK44086.1'] misclassified as negative class: ['WP_129745509.1', 'WP_129745937.1', 'WP_129749158.1', 'WP_129654766.1', 'WP_007481284.1', 'WP_129584969.1', 'WP_128916907.1', 'WP_128919506.1', 'WP_128943966.1', 'WP_128945668.1', 'WP_128946536.1', 'WP_128911376.1', 'WP_124394870.1', 'WP_128836226.1', 'WP_128836873.1', 'WP_128837474.1', 'WP_128797655.1', 'WP_128795559.1', 'WP_128617379.1', 'WP_115702137.1', 'WP_127566621.1', 'WP_126411272.1', 'WP_126337303.1', 'WP_126404864.1', 'WP_126634890.1', 'WP_124325291.1', 'WP_126411131.1', 'WP_126167828.1', 'WP_125469785.1', 'WP_125148997.1', 'WP_124114574.1', 'WP_124114133.1', 'WP_123939366.1', 'WP_123291711.1', 'WP_123637641.1', 'WP_123679477.1', 'WP_123657470.1', 'WP_123591130.1', 'WP_123438002.1', 'WP_123069295.1', 'WP_122443884.1', 'WP_122497446.1', 'WP_121211065.1', 'WP_115300326.1', 'WP_120215311.1', 'WP_120218024.1', 'WP_119700268.1', 'WP_118763952.1', 'WP_034241719.1', 'WP_117176239.1', 'WP_117395343.1', 'WP_117174207.1', 'WP_116675890.1', 'WP_116612552.1', 'WP_115181656.1', 'WP_115653569.1', 'WP_115327361.1', 'WP_115320774.1', 'WP_115272997.1', 'WP_115297885.1', 'WP_115222632.1', 'WP_115303624.1', 'WP_115241962.1', 'WP_114980594.1', 'WP_115175906.1', 'WP_114889885.1', 'WP_114910097.1', 'WP_055392689.1', 'WP_058032294.1', 'WP_008620049.1', 'WP_008624220.1', 'WP_076381892.1', 'WP_008546521.1', 'WP_023978354.1', 'WP_012225639.1']

The output lists the proteins in the training data that were either misclassified as Type A, but actually belong to one of the other classes (misclassified as positive), or were misclassified as not Type A, but in fact are (misclassified as negative). These barcodes can be used to find the original sequences for further analysis.

STEP 6: Predicting New Proteins

The final step in the Woolf Classification Building Pipeline is to actually predict the function of new proteins. This is done with the −� option. Before you do, it is useful to get a final accuracy measure with data that has never been through the model up until now. This is done with the same option flag.

ix

To test a model with the test data feature table you made in step one use the following command:

$ python trainWoolf.py -k CSVfolder/AvsNotA.csv -p CSVfolder/AvsNotA_TEST.csv

You should see results like this:

Building kNN Woolf Model... Training Model... ~~~~~~ RESULTS ~~~~~~ Score of best classifier: 0.832644774106317 Best Params:{'clf__n_neighbors': 1} ~~~~~~ ~~~~~~ Predicting novel instances {'WP_013188475.1': 1, 'WP_001931474.1': 1, 'WP_015058868.1': 1, 'WP_032277257.1': 1, 'WP_011091028.1': 1, 'WP_000239590.1': 1, 'WP_025368620.1': 1, 'WP_053444694.1': 1, 'WP_001617865.1': 1, 'WP_000027057.1': 1, 'WP_015387340.1': 1, 'WP_003015755.1': 1, 'WP_003015518.1': 1, 'WP_004197546.1': 1, 'WP_002904004.1': 1, 'WP_012477595.1': 1, 'WP_130451746.1': 1, 'WP_130333879.1': 1, 'WP_066599917.1': 1, 'WP_035895532.1': 1, 'WP_005068900.1': 1, 'WP_000352430.1': 1, 'WP_003634596.1': 1, 'WP_000874931.1': 1, 'WP_001100753.1': 1, 'WP_004179754.1': 1, 'WP_013188473.1': 1, 'WP_015058867.1': 1, 'WP_004199234.1': 1, 'WP_000733283.1': 1, 'WP_000733271.1': 1, 'WP_004176269.1': 1, 'WP_129609116.1': 1, 'WP_110123529.1': 1, 'WP_116721879.1': 1, 'ORN50155.1': 1, 'WP_075986622.1': 1, 'WP_044662084.1': 1, 'WP_020835015.1': 1, 'WP_000733276.1': 1, 'WP_063864653.1': 1, 'WP_002164538.1': 1, 'WP_002101021.1': 1, 'WP_039493569.1': 0, 'WP_039496288.1': 0, 'WP_004201164.1': 0, 'WP_039469786.1': 0, 'WP_000742473.1': 0, 'WP_129757379.1': 0, 'WP_129749067.1': 0, 'WP_129653498.1': 0, 'WP_023408309.1': 0, 'WP_000778180.1': 0, 'WP_001367937.1': 0, 'WP_001299465.1': 0, 'WP_001339114.1': 0, 'WP_001317579.1': 0, 'WP_000976514.1': 0, 'WP_001523751.1': 0, 'WP_001339477.1': 0, 'WP_001352591.1': 0, 'WP_024225500.1': 0, 'WP_001361488.1': 0, 'WP_012139762.1': 0, 'WP_001336292.1': 0, 'WP_001460207.1': 0, 'WP_001376670.1': 0, 'WP_005053920.1': 0, 'WP_005114784.1': 0, 'WP_045149331.1': 0, 'WP_005111907.1': 0, 'WP_001300820.1': 0, 'WP_009667650.1': 0, 'WP_013850585.1': 0, 'WP_006081706.1': 0, 'WP_013169234.1': 0, 'WP_009177260.1': 0, 'WP_004714767.1': 0, 'WP_015950416.1': 0, 'WP_012587601.1': 0, 'WP_012314241.1': 0, 'WP_012018474.1': 0, 'WP_011622707.1': 0, 'WP_011626203.1': 0, 'WP_011716980.1': 0, 'WP_011919325.1': 0, 'WP_011846786.1': 0, 'WP_001531742.1': 0, 'WP_001659360.1': 0, 'WP_001711023.1': 0, 'WP_001681940.1': 0, 'WP_001417211.1': 0, 'WP_009585129.1': 0, 'WP_007675122.1': 0, 'WP_007761109.1': 0, 'WP_014007498.1': 0, 'WP_003499687.1': 0, 'WP_013691409.1': 0, 'WP_013761358.1': 0, 'WP_006384871.1': 0, 'WP_013394930.1': 0, 'WP_006220425.1': 0, 'WP_013343304.1': 0, 'WP_013091659.1': 0, 'WP_006052187.1': 0, 'WP_012250756.1': 0, 'WP_012337063.1': 0, 'WP_085779089.1': 0,

x 'WP_020305829.1': 0, 'WP_012052554.1': 0, 'WP_085116644.1': 0, 'WP_071849985.1': 1, 'WP_048624885.1': 0, 'WP_046101847.1': 0, 'WP_041023628.1': 0, 'WP_045489586.1': 0, 'WP_035261877.1': 0, 'WP_045460298.1': 0, 'WP_025373271.1': 0, 'WP_024092339.1': 0, 'WP_023653734.1': 0, 'WP_022562602.1': 0, 'WP_020786480.1': 0, 'WP_020443915.1': 0, 'WP_020296865.1': 0, 'WP_020288732.1': 0, 'WP_020302602.1': 0, 'WP_018611413.1': 0, 'WP_008912959.1': 0, 'WP_013812748.1': 0, 'WP_013592279.1': 0, 'WP_013509122.1': 0, 'WP_006686288.1': 0, 'WP_012145001.1': 0, 'WP_011505503.1': 0, 'WP_074072297.1': 0, 'WP_060769150.1': 0, 'WP_047213261.1': 0, 'WP_006578319.1': 0, 'WP_082929673.1': 0, 'WP_065948248.1': 0, 'WP_065872869.1': 0, 'WP_065880648.1': 0, 'WP_065890887.1': 0, 'WP_065928025.1': 0, 'WP_065895410.1': 0, 'WP_065909522.1': 0, 'WP_065885292.1': 0, 'WP_065878451.1': 0, 'WP_065869409.1': 0, 'WP_065503970.1': 0, 'WP_054452744.1': 0, 'WP_064597896.1': 0, 'WP_061555327.1': 0, 'WP_082806966.1': 0, 'WP_061541403.1': 0, 'WP_033757684.1': 0, 'WP_038492143.1': 0, 'WP_035730667.1': 0, 'WP_060419126.1': 0, 'WP_067436871.1': 0, 'WP_059182486.1': 0, 'WP_058021331.1': 0, 'WP_056784645.1': 0, 'WP_054882571.1': 0, 'WP_046237244.1': 0, 'WP_071840539.1': 0, 'WP_049601150.1': 0, 'WP_060844168.1': 0, 'WP_060840246.1': 0, 'WP_033057352.1': 0, 'WP_050680469.1': 0, 'WP_050534676.1': 0, 'WP_050540198.1': 0, 'WP_049607872.1': 0, 'WP_072089562.1': 0, 'WP_049616313.1': 0, 'WP_072082310.1': 0, 'WP_048227211.1': 0, 'WP_048236297.1': 0, 'WP_048222344.1': 0, 'WP_046855210.1': 0, 'WP_050086780.1': 0, 'WP_057620914.1': 0, 'WP_044459295.1': 0, 'WP_071841543.1': 0, 'WP_035564089.1': 0, 'WP_031376719.1': 0, 'WP_065814057.1': 0, 'WP_043018158.1': 0, 'WP_032679412.1': 0, 'WP_033638425.1': 0, 'WP_024013480.1': 0, 'WP_016656628.1': 0, 'WP_015700263.1': 0, 'WP_005122150.1': 0, 'WP_004923631.1': 0, 'WP_038489723.1': 0, 'WP_033732716.1': 0, 'WP_044550790.1': 0, 'WP_037962530.1': 0, 'WP_032887178.1': 0, 'WP_047597435.1': 0, 'WP_025339397.1': 0, 'WP_005330568.1': 0, 'WP_010458642.1': 0, 'WP_007965416.1': 0, 'WP_007742644.1': 0, 'WP_020826741.1': 0, 'WP_020724617.1': 0, 'WP_016492932.1': 0, 'WP_016151399.1': 0, 'WP_016149438.1': 0, 'WP_007748831.1': 0, 'WP_013366957.1': 0, 'WP_013357894.1': 0, 'WP_013203868.1': 0, 'WP_074064204.1': 0, 'WP_047429863.1': 0, 'WP_016542230.1': 0, 'WP_016162495.1': 0, 'WP_009508908.1': 0, 'WP_007480812.1': 0, 'WP_065506251.1': 0, 'WP_081279377.1': 0, 'WP_081277750.1': 0, 'WP_061518155.1': 0, 'WP_061524383.1': 0, 'WP_061516136.1': 0, 'WP_062573657.1': 0, 'WP_058708125.1': 0, 'WP_058701878.1': 0, 'WP_047954938.1': 0, 'WP_029307868.1': 0, 'WP_057430086.1': 0, 'WP_057397006.1': 0, 'WP_054423529.1': 0, 'WP_053010118.1': 0, 'WP_072077720.1': 0, 'WP_050135449.1': 0, 'WP_050144425.1': 0, 'WP_072136784.1': 0, 'WP_072186958.1': 0, 'WP_072078661.1': 0, 'WP_044420034.1': 0, 'WP_044390253.1': 0, 'WP_042568790.1': 0, 'WP_047715209.1': 0, 'WP_071841717.1': 0, 'WP_072089849.1': 0, 'WP_072078020.1': 0, 'WP_050088644.1': 0, 'WP_072088286.1': 0, 'WP_049607427.1': 0, 'WP_072081726.1': 0, 'WP_050073671.1': 0, 'WP_048286820.1': 0, 'WP_048273893.1': 0, 'WP_048284614.1': 0, 'WP_047355585.1': 0, 'WP_045882124.1': 0, 'WP_045792035.1': 0, 'WP_045269678.1': 0, 'WP_044293428.1': 0,

xi 'WP_042560048.1': 0, 'WP_042086213.1': 0, 'WP_020928562.1': 0, 'WP_004910271.1': 0, 'WP_047414966.1': 0, 'WP_047500028.1': 0, 'WP_039340300.1': 0, 'WP_037140816.1': 0, 'WP_035225479.1': 0, 'WP_024474547.1': 0, 'WP_038409596.1': 0, 'WP_036966456.1': 0, 'WP_036959629.1': 0, 'WP_036974156.1': 0, 'WP_036977846.1': 0, 'WP_036948496.1': 0, 'WP_025416832.1': 0, 'WP_025395172.1': 0, 'WP_023533520.1': 0, 'WP_022625635.1': 0, 'WP_021491540.1': 0, 'WP_019692735.1': 0, 'WP_020431495.1': 0, 'WP_016500007.1': 0, 'WP_004637248.1': 0, 'WP_004262799.1': 0, 'WP_045902232.1': 0, 'WP_060460871.1': 0, 'WP_060442301.1': 0, 'WP_060431700.1': 0, 'WP_060455639.1': 0, 'WP_060437608.1': 0, 'WP_060432983.1': 0, 'WP_060438709.1': 0, 'WP_060418500.1': 0, 'WP_050880653.1': 0, 'WP_050918486.1': 0, 'WP_050162987.1': 0, 'WP_050158303.1': 0, 'WP_050130984.1': 0, 'WP_050160138.1': 0, 'WP_050321664.1': 0, 'WP_050322708.1': 0, 'WP_050335261.1': 0, 'WP_048324983.1': 0, 'WP_047360713.1': 0, 'WP_048233629.1': 0, 'WP_039570340.1': 0, 'WP_043179244.1': 0, 'WP_034622813.1': 0, 'WP_038448376.1': 0, 'WP_025207685.1': 0, 'WP_025328859.1': 0, 'WP_016453823.1': 0, 'WP_016453249.1': 0, 'WP_013983315.1': 0, 'WP_011946603.1': 0, 'WP_034196516.1': 0, 'WP_034188536.1': 0, 'WP_034040414.1': 0, 'WP_014475645.1': 0, 'WP_020914447.1': 0, 'WP_057099948.1': 0, 'WP_057992932.1': 0, 'WP_057037937.1': 0, 'WP_039124095.1': 0, 'WP_044780371.1': 0, 'WP_050460055.1': 0, 'WP_032071425.1': 0, 'WP_031955431.1': 0, 'WP_032017448.1': 0, 'WP_032046023.1': 0, 'WP_031972720.1': 0, 'WP_031991294.1': 0, 'WP_032019111.1': 0, 'WP_032026533.1': 0, 'WP_031950449.1': 0, 'WP_032039758.1': 0, 'WP_033853219.1': 0, 'WP_032045193.1': 0, 'WP_032034494.1': 0, 'WP_032061328.1': 0, 'WP_032070286.1': 0, 'WP_032015622.1': 0, 'WP_032037742.1': 0, 'WP_032027613.1': 0, 'WP_032009364.1': 0, 'WP_042757256.1': 0, 'WP_031636917.1': 0, 'WP_034324889.1': 0, 'WP_011860820.1': 0, 'AOH51342.1': 0, 'WP_013279374.1': 0, 'WP_063862740.1': 0, 'WP_063862735.1': 0, 'WP_063862730.1': 0, 'WP_063862729.1': 0, 'WP_063862728.1': 0, 'WP_063862727.1': 0, 'WP_063862726.1': 0, 'WP_063862725.1': 0, 'WP_063862724.1': 0, 'WP_063862723.1': 0, 'WP_063862722.1': 0, 'WP_063862721.1': 0, 'WP_063862719.1': 0, 'WP_063862420.1': 0, 'WP_063862718.1': 0, 'WP_063862408.1': 0, 'WP_063862402.1': 0, 'WP_047023509.1': 0, 'WP_057625895.1': 0, 'WP_057631141.1': 0, 'WP_057627739.1': 0, 'WP_071447731.1': 0, 'EAB9198241.1': 0, 'WP_063086404.1': 0, 'WP_001439279.1': 0, 'WP_001350795.1': 0} Score on test data: 0.987116201732058

The overall accuracy score is the last number printed.

Finally, to run the model to predict your unknown proteins, use the following command. Remember that AvsNotA_UNKNOWN.csv is a feature table file you generated in step 1.

$ python trainWoolf.py -k CSVfolder/AvsNotA.csv -p CSVfolder/AvsNotA_UNKNOWN.csv

The results should look like this:

xii

Building kNN Woolf Model... Training Model... ~~~~~~ RESULTS ~~~~~~ Score of best classifier: 0.832644774106317 Best Params:{'clf__n_neighbors': 1} ~~~~~~ ~~~~~~ Predicting novel instances {'WP_050067789.1': 0, 'WP_021577770.1': 0, 'WP_023224609.1': 1, 'WP_017442020.1': 1, 'WP_001630205.1': 0, 'WP_001631316.1': 1, 'WP_002852433.1': 0, 'WP_002856956.1': 0, 'WP_000830775.1': 0, 'WP_001317977.1': 0, 'WP_001082962.1': 1, 'WP_000830777.1': 0, 'WP_000830773.1': 0, 'WP_000059908.1': 0, 'WP_057991827.1': 0, 'WP_000673293.1': 0, 'WP_002776717.1': 0, 'WP_000817293.1': 0, 'WP_001082975.1': 1, 'WP_001208005.1': 1, 'WP_023216810.1': 1, 'WP_023993659.1': 0, 'WP_001667037.1': 0, 'WP_000673298.1': 0, 'WP_001208011.1': 1, 'WP_000673287.1': 0, 'WP_001520983.1': 0, 'WP_001082970.1': 1, 'WP_002865991.1': 0, 'WP_020899073.1': 0, 'WP_000830757.1': 0, 'WP_001327042.1': 0, 'WP_020837858.1': 1, 'WP_000188069.1': 0, 'WP_001082979.1': 1, 'WP_022645952.1': 0, 'WP_001299057.1': 0, 'WP_001931474.1': 1, 'WP_023259161.1': 0, 'WP_111742248.1': 1}

And that is it! You have built a Woolf Classifier that can predict if novel �-lacamases are Type A or not Type A. Using the barcodes provided in by the −� option, you can further study these sequences using any conventional experimental or computational technique.

xiii Appendix B: Manganese Oxidizing Peroxidases

UniProt Gene Entry Gene Experimental Name Number Description Organism Evidence Citing paper chrysosporium https://bmcbiotechno l.biomedcentral.com/ Manganese (White-rot fungus) (Sporotrichum track/pdf/10.1186/s1 MnP Q02567 peroxidase 1 pruinosum) y 2896-017-0338-5 Multifunctional https://www.ncbi.nl dye peroxidase Amycolatopsis sp. (strain ATCC m.nih.gov/pubmed/2 dyp2 K7N5M8 DyP2 39116 / 75iv2) y 3054399 Phanerochaete chrysosporium https://www.ncbi.nl Manganese (White-rot fungus) (Sporotrichum m.nih.gov/pubmed/2 P19136 peroxidase H4 pruinosum) y 760033 Manganese https://link.springer.c om/article/10.1007/s mnp2 Q70LM3 peroxidase 2 Phlebia radiata (White-rot fungus) y 002530050764 https://link.springer.c Manganese om/article/10.1007/s mnp3 Q96TS6 peroxidase 3 Phlebia radiata (White-rot fungus) y 002530050764 Phanerochaete chrysosporium https://www.ncbi.nl Manganese (White-rot fungus) (Sporotrichum m.nih.gov/pubmed/1 P78733 peroxidase H3 pruinosum) y 592808 Versatile https://www.ncbi.nl peroxidase Pleurotus eryngii (Boletus of the m.nih.gov/pubmed/9 vpl2 O94753 VPL2 steppes) y 987124 Versatile https://www.ncbi.nl peroxidase Pleurotus eryngii (Boletus of the m.nih.gov/pubmed/9 vps1 Q9UVP6 VPS1 steppes) n 987124 Pleurotus eryngii (Boletus of the https://www.ncbi.nl m.nih.gov/pubmed/9 vpl1 Q9UR19 VPL1 steppes) n 987124 https://www.tandfonl ine.com/doi/abs/10.1 080/0149045970937 mofA P71431 MofA protein Leptothrix discophora y 8037

xiv Appendix C: Non-Manganese Oxidizing Peroxidases

UniProt Gene Entry Gene Experimental Name Number Description Organism Evidence Citing paper Dye- decolorizing Auricularia auricula-judae (Judas https://www.ncbi. peroxidase ear fungus) (Tremella auricula- nlm.nih.gov/pubm dyp1 I2DBY1 AauDyP1 judae) y ed/25153532 https://www.ncbi. nlm.nih.gov/pmc/a Eosinophil rticles/PMC14536 EPX P11678 peroxidase Homo sapiens (Human) y 1/ https://www.ncbi. Glutathione nlm.nih.gov/pubm GPX2 P18283 peroxidase 2 Homo sapiens (Human) ed/21873635 https://www.ncbi. Eosinophil nlm.nih.gov/pubm Epx P49290 peroxidase Mus musculus (Mouse) y ed/18694936 https://www.ncbi. Glutathione nlm.nih.gov/pubm GPX3 P22352 peroxidase 3 Homo sapiens (Human) y ed/1897960 https://www.ncbi. Peroxidase mlt- nlm.nih.gov/pubm mlt-7 Q23490 7 Caenorhabditis elegans y ed/19406744 Glutathione peroxidase-like https://www.ncbi. Schizosaccharomyces pombe (strain nlm.nih.gov/pubm O59858 gpx1 972 / ATCC 24843) (Fission yeast) y ed/10455235 https://www.ncbi. Glutathione nlm.nih.gov/pubm Gpx1 P11352 peroxidase 1 Mus musculus (Mouse) y ed/21420488 https://www.ncbi. Nicotiana tabacum (Common nlm.nih.gov/pubm poxN1 Q9XIV8 Peroxidase N1 tobacco) y ed/10364388 https://www.ncbi. Catalase- Mycobacterium tuberculosis (strain nlm.nih.gov/pubm katG P9WIE5 peroxidase ATCC 25618 / H37Rv) y ed/9006925 https://www.ncbi. Arabidopsis thaliana (Mouse-ear nlm.nih.gov/pubm PER59 Q39034 Peroxidase 59 cress) y ed/10713531 Arabidopsis thaliana (Mouse-ear PER22 P24102 Peroxidase 22 cress) n https://www.ncbi. Catalase- nlm.nih.gov/pubm katG P13029 peroxidase Escherichia coli (strain K12) y ed/374409 https://www.ncbi. Catalase- Burkholderia pseudomallei (strain nlm.nih.gov/pubm katG Q939D2 peroxidase K96243) y ed/15280362 https://www.ncbi. Catalase- Synechocystis sp. (strain PCC 6803 nlm.nih.gov/pubm katG P73911 peroxidase / Kazusa) y ed/10543446 Phanerochaete chrysosporium https://www.ncbi. (White-rot fungus) (Sporotrichum nlm.nih.gov/pubm LPOA P06181 Ligninase H8 pruinosum) y ed/2303054 Dye- decolorizing Mycetinis scorodonius (Garlic https://www.unipr peroxidase mushroom) (Marasmius ot.org/citations/23 msp1 B0BK71 msp1 scorodonius) y 111597

xv Low-redox Taiwanofungus camphoratus https://www.ncbi. potential (Poroid brown-rot fungus) nlm.nih.gov/pubm LnP C0IW58 peroxidase (Antrodia camphorata) y ed/19202090 Dye- https://www.ncbi. decolorizing Exidia glandulosa (Black witch's nlm.nih.gov/pubm dyp1 I2DBY2 peroxidase butter) y ed/23111597 Dye- https://www.ncbi. decolorizing Mycena epipterygia (Yellow- nlm.nih.gov/pubm dyp1 I2DBY3 peroxidase stemmed mycena) y ed/23111597

xvi