A Machine Learning Tool for Predicting Protein Function Anna
Total Page:16
File Type:pdf, Size:1020Kb
The Woolf Classifier Building Pipeline: A Machine Learning Tool for Predicting Protein Function Anna Farrell-Sherman Submitted in Partial Fulfillment of the Prerequisite for Honors in Computer Science under the advisement of Eni Mustafaraj and Vanja Klepac-Ceraj May 2019 © 2019 Anna Farrell-Sherman Abstract Proteins are the machinery that allow cells to grow, reproduce, communicate, and create multicellular organisms. For all of their importance, scientists still have a hard time understanding the function of a protein based on its sequence alone. For my honors thesis in computer science, I created a machine learning tool that can predict the function of a protein based solely on its amino acid sequence. My tool gives scientists a structure in which to build a � Nearest Neighbor or random forest classifier to distinguish between proteins that can and cannot perform a given function. Using default Min-Max scaling, and the Matthews Correlation Coefficient for accuracy assessment, the Woolf Pipeline is built with simplified choices to guide users to success. i Acknowledgments There are so many people who made this thesis possible. First, thank you to my wonderful advisors, Eni and Vanja, who never gave up on me, and always pushed me to try my hardest. Thank you also to the other members of my committee, Sohie Lee, Shikha Singh, and Rosanna Hertz for supporting me through this process. To Kevin, Sophie R, Sophie E, and my sister Phoebe, thank you for reading my drafts, and advising me when the going got tough. To all my wonderful friends, who were always there with encouraging words, warm hugs, and many congratulations, I cannot thank you enough. Jocelyn, Hershel, Fiona, Anneli, Lydia, Linda, and Claire, I could not have done this without all of you. And finally, thank you to my parents, who have always believed in me more than I could have thought possible, and have shown me how much I am capable of. ii A Note on the Name Woolf The Woolf Classifier Building Pipeline is named after the lesbian feminist author Virginia Woolf. She is known for her use of stream of consciousness, and is considered one of the most important 20th century modernist authors. The tribute to her here is a reminder that science alone can never capture all of the wonders of life. iii Table of Contents Abstract i Acknowledgments ii A Note on the Name Woolf iii List of Figures vi List of Tables vii 1 Introduction 1 2 Anaerobic Manganese Oxidation: A Case Study 3 2.1 Manganese Oxides in the Geologic Record 3 2.2 Biological Mn(II) Oxidation 4 2.3 Genes and Proteins 6 3 The Field of Machine Learning 8 3.1 What is Machine Learning? 8 3.2 Classification in Machine Learning 9 3.2.1 The Class Distribution Problem 9 3.2.2 The Multi-Class Problem 13 3.3 Algorithm Choice 14 3.3.1 Naïve Bayes 15 3.3.2 Support Vector Machine (SVM) 16 3.3.3 Neural Nets 17 3.3.4 k Nearest Neighbors (kNN) 18 3.3.5 Decision Trees and Random Forests 19 3.4 Machine Learning in Biology 20 4 The Woolf Classifier Building Pipeline 22 4.1 Introduction 22 4.2 How is the Woolf Pipeline Different? 22 4.3 Pipeline Composition 23 4.3.1 Input 24 4.3.2 Creating a Feature Table 25 4.3.3 Model Creation with Default Settings 26 4.3.4 Changing the Pipeline Parameters 27 4.3.5 Error Evaluation 28 4.3.6 Prediction 29 4.3.7 Implementation 29 4.4 Design Decisions 29 4.4.1 Why the Command Line? 29 4.4.2 Machine Learning Features 30 4.4.3 Scalar Operations 31 4.4.4 Accuracy Metrics 32 iv 4.4.5 Algorithm Choices 33 4.5 When are Woolf Classifiers Useful 34 5 Model Building with the Woolf Pipeline 35 5.1 Introduction 35 5.2 �-Lactamases for Woolf Model Testing 35 5.3 Evaluating Algorithm Implementations 36 5.3.1 Accuracy Metrics 36 5.3.2 Scaling Types 39 5.3.3 Hyperparameter Values 40 5.4 Detecting Anaerobic Manganese Oxidation Genes 41 5.4.1 kNN Model 42 5.4.2 Random Forest Model 42 5.4.3 Prediction of Green Lake Manganese Oxide Genes 43 5.5 Conclusion 44 6 Conclusions and Future Work 45 6.1 The Success of the Woolf Classifier Building Pipeline 45 6.2 Improvements to the Woolf Tool 45 6.3 Future Applications 46 7 References 48 Appendices i Appendix A: User Manual i Appendix B: Manganese Oxidizing Peroxidases xiv Appendix C: Non-Manganese Oxidizing Peroxidases xv v List of Figures 2.1 Manganese Oxides and the Build-up of Oxygen in Earth’s Atmosphere 4 2.2: Anaerobic Manganese Oxide Producing Bacterial Communities in Fayetteville 5 Green Lake 3.3.1 Bayes’ Rule 15 3.3.2 Support Vector Machine in Two Dimensions 16 3.3.3 Neural Network Diagram 17 3.3.4 kNN Classification in Two Dimensions 19 3.3.5 A Random Forest with Three Trees 20 4.2 Genome to Gene to Protein 23 4.3.1 FASTA and FASTQ format 24 4.3.2 The Woolf Classifier Building Pipeline 26 4.4.2 The Twenty Amino Acids 31 5.1 The 4 �-Lactamase Comparisons Used to Test the Woolf Classification Pipeline 36 5.3.1 Three Accuracy Metrics to Evaluate Woolf Classifiers 37 5.3.2 Effect of Scaling on kNN-based Model MCC 39 5.3.3 Effect of Hyperparameters on MCC 40 vi List of Tables 3.2.1 The Confusion Matrix 11 4.3.4 User Specified Parameters 28 4.4.3 Possible Scaler Types 32 4.4.4 Possible Accuracy Metrics 33 5.4.3 Predicted Manganese Binding Function of Green Lake Peroxidases 44 vii 1 Introduction Until recently, scientists did not know that bacteria were capable of producing manganese oxides without oxygen. In fact, manganese oxide minerals have been used to date both the formation of Earth’s modern oxygenated atmosphere and the evolution of oxygenic photosynthesis, based on the idea that these minerals could not form without oxygen in the atmosphere [1, 2]. Yet, despite these claims, recent research has discovered bacterial communities capable of forming manganese oxides without oxygen [3]. This discovery has the potential to change our understanding of the early Archean world. However, the formation of these minerals is still a mystery. We have yet to identify any gene that could enable bacteria to produce them without oxygen. The function of genes can be predicted with algorithms used to annotate genomes after whole genome sequencing, and proven with laboratory experiments that show chemically how proteins work within cells [4, 5]. We have full genome sequences of both the manganese oxide producing bacteria, but because anaerobic manganese oxidation is a novel process, there are no known protein sequences to which we can compare the new sequences. The functional gene annotation produced by current algorithms helps narrow down protein types, but none of the annotations are specific enough to identify genes that might be involved in manganese oxidation. Without a way to narrow down the possibilities, it is difficult to know where to start with laboratory experiments. Machine learning has the potential to provide a solution: a more specific computational tool to differentiate between specific protein functions. To be successful, such a tool would need to be more precise than the current protein differentiation algorithms and at the same time, require less training data. This requires simpler machine learning algorithms, more careful treatment of the data to avoid adding bias, and the ability to biologically interpret the results on the level of individual genes. The biological background of the project will be discussed in Chapter Two. The machine learning required for such a tool is complex, and is explored in detail in Chapter Three, as a prelude to the main work of this thesis: the creation of a machine learning tool like the one described above, called the Woolf Classifier Building Pipeline. The Woolf Pipeline, described in Chapter Four, allows researchers to collect sequence data on genes with and without a given function, choose either a � Nearest Neighbor (kNN) or random forest classification algorithm, and predict the functionality of novel protein sequences. Users can select the type of scaling used on the input dataset, the accuracy metric used to assess the classification ability of the models, and give hyperparameter ranges to the tool that will be tested with cross-validation. The Woolf Pipeline user manual will guide users through these decisions to make sure that the pipeline is accessible even to researchers who do not have an extensive machine learning background (Appendix A). 1 An example set of models created by the Woolf Pipeline is presented in Chapter Five. These models show that Woolf-built classifiers are effective at differentiating between classes of �-lactamase antibiotic resistance genes with both kNN and random forest based classification. Both algorithms are feature based; they model the classes using amino acid percentage composition for each of the 20 amino acids and protein length for a total of 21 machine learning features. The default parameters are tuned for the problem of protein differentiation; Min-Max scaling for the kNN algorithm combats the sparsity of the amino acid composition data, and the MCC accuracy metric deals with the small dataset size and unbalanced class distributions. All four example classifiers with default Min-Max scaling achieved Matthews correlation coefficient (MCC) accuracies of over 0.8, better than any previously published model [6]. Most broadly, this thesis is an investigation into the practicality of using machine learning to create simple classification models to understand the function of biological proteins.