Exploring Machine Learning Methods for Nuclear Export Sequence Identification
Total Page:16
File Type:pdf, Size:1020Kb
Exploring Machine Learning Methods for Nuclear Export Sequence Identification A Major Qualifying Project Report Submitted to the Faculty Of the WORCESTER POLYTECHNIC INSTITUTE In partial fulfillment of the requirements for the Degree of Bachelor of Science ______________________________ Erin Conneilly Advisors: ______________________________ Professor Destin Heilman (CBC) ______________________________ Professor Rodica Neamtu (CS) Abstract The goal of this project is to design and implement a user-friendly machine learning tool that can be applied to the classification of polypeptides to find functional nuclear export sequences (NESs). This tool incorporates an API that takes advantage of support vector machines and can be expanded to include other models. Because NESs have been found to have consistent structure, structural data is incorporated into the model to increase confidence. This report is accompanied by a manual that instructs users on how to use the tool. 1 Table of Contents Abstract 1 Table of Contents 2 Table of Figures 3 Table of Tables 3 Acknowledgements 4 Introduction 5 Project Motivation 5 Problem Statement 5 Background 6 Section 1: Intracellular Transport 6 Section 2: Nuclear Export Sequence Structure 8 Section 3: Existing Prediction Methods 9 Section 4: Machine Learning Techniques 9 4.1 Recurrent Neural Networks (RNNs) 10 4.2 Nuclear Export Sequences as Time Series 10 4.3 Autoregressive Integrated Moving Average Model (ARIMA) 11 4.4 One-Class Support Vector Machines (OC SVM) 11 4.5 Scikit-Learn 13 Tool Overview and Functionality 13 Section 1: API Functionality 13 Section 2: The Command Line Tool 14 Section 3: The Web Application Tool 17 Methodology 20 Section 1: Data Selection 20 Section 2: Model Metrics and Goals 23 Section 3: Testing Procedure 24 3.1 Finding the Best Parameters 24 3.2 Finding the Best Train Set 25 Case Study 29 Future Work 31 References 32 2 Appendix A: API Manual 35 Getting Started 35 Running the API 36 Appendix B: Web Application Manual 38 Getting Started 38 Running the Web Application 38 Table of Figures Figure 1: Cellular transport cycles 7 Figure 2: Visualization of PKI-type NESs docked in the hydrophobic cleft of CRM1 8 Figure 3: Hyperplane example 12 Figure 4: Linearly and nonlinearly seperable data 12 Figure 5: General workflow for API and related tools 14 Figure 6: Snapshot of command line tool that shows the test options 15 Figure 7: Example of a summary of test results for a self test 17 Figure 8: A screen capture of the main page of the web application 18 Figure 9: A screen capture of the web application where a test was run 19 Figure 10: A screen capture of the web application prediction page 20 Figure 11: ClustalW alignment of extracted sequences 21 Figure 12: A WebLogo created by compiling the sequences collected 23 Figure 13: Example of a confusion matrix 24 Figure 14: Schematic representation of CAV-VP3 30 Figure 15: Snapshot of web application prediction page for CAV-VP3 30 Figure 16: Snapshot of web application prediction page for PCV1-VP3 30 Figure 17: NESbase prediction algorithm applied to PCV 31 Table of Tables Table 1: The test set used in consensus tests 15 Table 2: The test set used in mixNES tests 16 Table 3: The test set used in mixWithRandom tests 16 Table 4: The test set used in mixWithAll tests 16 Table 5: Results of parameter testing 25 Table 6: The set of randomly generated sequences used for testing 26 Table 7: Top five results of different train sizes tested against ‘mixWithRandom’ 26 Table 8: Top five results of different train sizes with all but one randomly generated 26 Table 9: Results of test to find the best train set of 17 consensus NESs 27 Table 10: A summary of unexpected and failing results for a bias test 27 Table 11: Results of each non consensus sequence added to the best set of 17 28 Table 12: Results of each non consensus added to the set of 23 consensus 29 Table 13: The best train set of 17 consensus conforming sequences 29 3 Acknowledgements I would like to express deep thanks to Professor Destin Heilman and Professor Rodica Neamtu for advising this project. This interdisciplinary computer science and biochemistry project would not have been possible without their flexibility and willingness to learn about a subject completely outside not only their department, but their comfort zone. Thank you to Professor Heilman, who pushed me to create the most useful and relevant tool I could and assisted with testing. Thank you to Professor Neamtu, who constantly gave candid feedback and reminded me to keep the scope of the project under control. I would also like to thank Fareya Ikram, Natalie Bloniarz, and Lucas Sacherer. These WPI students and alumni gave crucial advice about implementation and user-friendly design. 4 Introduction Project Motivation Protein functions are defined not just by the sequence of amino acids they are composed of, but by their 3-dimensional structure. The protein can be understood from its tertiary and quaternary structure, as well as which conformations are energetically favorable. The affinity of protein-protein and protein-ligand interactions are controlled by the strength of intermolecular forces and how well the components fit together; thus structure and conformational changes control the speed and efficiency of protein-catalyzed reactions. Because every level of protein structure is so important to understanding these interactions, determining these structures has long been a target of biochemical research. Experimental discovery of these structures through methods like X-ray crystallography is expensive and time consuming. Current methods of crystallization and structural imaging simply do not work for some proteins. Attempts have been made to predict protein structures from sequence, but it is a complex problem. The folding of amino acids into a protein is affected by too many variables to calculate individually. The complexity of the problem, as far as modelling each amino acid and its conformation, is beyond even the newest technology for computation. In the realm of computers, when a problem is simply too big to calculate directly, data science and machine learning can be employed to reduce the computational load and sometimes to come to a more optimal solution. Instead of calculating the energy of amino acid interactions and torsion angles, a machine learning approach is based on patterns found in the data. Using already categorized, or labelled, data gives the model a base truth to begin forming a pattern to recognize inputs. This has led to the use of heuristics and machine learning to use data about the amino acid sequence to categorize and predict the shape those peptides will take in vivo. Problem Statement One area that is lacking in accurate structural data and predictability is nuclear export sequences. Currently, sequences of experimentally verified nuclear export sequences are well known and publicly available in a database called NESbase. This database does not have any structural information associated with sequences. Functional NESs are difficult to identify and efforts to come up with a predictive model have high rates of false positives and false negatives. These models have been mostly based on sequence data, however, there may be a stronger connection between the structures NESs form than their raw sequences. Given the importance of secondary, tertiary, and quaternary structure in the function of proteins, it follows that sequences with similar properties lend themselves to repeatable structures that interact with transport proteins. Taking a structural approach in creating a predictive model extends more recent research that has used conformational fit to help identify protein-ligand interactions. A more reliable systematic method of identification and classification of NESs could make the discovery of new NESs easier, allow biochemists to create new NESs, and aid the understanding of cellular transport mechanisms. The proposed project is an investigation of a 5 few machine learning based structural prediction methods and their relevance to classifying nuclear export sequences. The aim of this project is trifold: to explore modeling and machine learning tools that can be used to classify NESs, to investigate the improvement of NES classification through structural prediction, and to create an API that utilizes open-source tools to create and test models that help to identify sequences as functional NESs. This API will be complemented by a user interface that makes the modeling creation and use clear and easy to use for users without a computer science background. Background Section 1: Intracellular Transport A key feature of eukaryotic cells that differentiates them from prokaryotic cells is their subdivision into a variety of cellular compartments. This division is accomplished through a complex series of membrane-enclosed organelles that carry out specialized functions within the cell [2]. These specialized organelles often require function-specific conditions within the walls of their membrane. A notable case of this is the lysosome, which contains digestive enzymes. The enzymes housed by the lysosome perform optimally under acidic conditions with a pH of + about 5, which is maintained by H pumps in the membrane [2]. The separation of these enzymes from the rest of the cell not only provides the unique environment needed for the digestive enzymes to function properly, it also prevents integral parts of the cell from being unintentionally harmed. Another such membrane-enclosed organelle is the nucleus, which holds the genetic information safely away from the rest of the cell and controls gene expression [2]. Carefully controlled transport to and from the nucleus is crucial to the survival and proliferation of the cell. In order for genes to be expressed properly, genetic controls like transcription factors and organizational proteins like histones that are constructed in the cytosol need to be imported into the nucleus.