Exploring Machine Learning Methods for Nuclear Export Sequence Identification

A Major Qualifying Project Report Submitted to the Faculty Of the WORCESTER POLYTECHNIC INSTITUTE In partial fulfillment of the requirements for the Degree of Bachelor of Science

______Erin Conneilly

Advisors:

______Professor Destin Heilman (CBC)

______Professor Rodica Neamtu (CS)

Abstract The goal of this project is to design and implement a user-friendly machine learning tool that can be applied to the classification of polypeptides to find functional nuclear export sequences (NESs). This tool incorporates an API that takes advantage of support vector machines and can be expanded to include other models. Because NESs have been found to have consistent structure, structural data is incorporated into the model to increase confidence. This report is accompanied by a manual that instructs users on how to use the tool.

1 Table of Contents

Abstract 1

Table of Contents 2

Table of Figures 3

Table of Tables 3

Acknowledgements 4

Introduction 5 Project Motivation 5 Problem Statement 5

Background 6 Section 1: Intracellular Transport 6 Section 2: Nuclear Export Sequence Structure 8 Section 3: Existing Prediction Methods 9 Section 4: Machine Learning Techniques 9 4.1 Recurrent Neural Networks (RNNs) 10 4.2 Nuclear Export Sequences as Time Series 10 4.3 Autoregressive Integrated Moving Average Model (ARIMA) 11 4.4 One-Class Support Vector Machines (OC SVM) 11 4.5 Scikit-Learn 13

Tool Overview and Functionality 13 Section 1: API Functionality 13 Section 2: The Command Line Tool 14 Section 3: The Web Application Tool 17

Methodology 20 Section 1: Data Selection 20 Section 2: Model Metrics and Goals 23 Section 3: Testing Procedure 24 3.1 Finding the Best Parameters 24 3.2 Finding the Best Train Set 25

Case Study 29

Future Work 31

References 32

2 Appendix A: API Manual 35 Getting Started 35 Running the API 36

Appendix B: Web Application Manual 38 Getting Started 38 Running the Web Application 38

Table of Figures Figure 1: Cellular transport cycles 7 Figure 2: Visualization of PKI-type NESs docked in the hydrophobic cleft of CRM1 8 Figure 3: Hyperplane example 12 Figure 4: Linearly and nonlinearly seperable data 12 Figure 5: General workflow for API and related tools 14 Figure 6: Snapshot of command line tool that shows the test options 15 Figure 7: Example of a summary of test results for a self test 17 Figure 8: A screen capture of the main page of the web application 18 Figure 9: A screen capture of the web application where a test was run 19 Figure 10: A screen capture of the web application prediction page 20 Figure 11: ClustalW alignment of extracted sequences 21 Figure 12: A WebLogo created by compiling the sequences collected 23 Figure 13: Example of a confusion matrix 24 Figure 14: Schematic representation of CAV-VP3 30 Figure 15: Snapshot of web application prediction page for CAV-VP3 30 Figure 16: Snapshot of web application prediction page for PCV1-VP3 30 Figure 17: NESbase prediction algorithm applied to PCV 31

Table of Tables Table 1: The test set used in consensus tests 15 Table 2: The test set used in mixNES tests 16 Table 3: The test set used in mixWithRandom tests 16 Table 4: The test set used in mixWithAll tests 16 Table 5: Results of parameter testing 25 Table 6: The set of randomly generated sequences used for testing 26 Table 7: Top five results of different train sizes tested against ‘mixWithRandom’ 26 Table 8: Top five results of different train sizes with all but one randomly generated 26 Table 9: Results of test to find the best train set of 17 consensus NESs 27 Table 10: A summary of unexpected and failing results for a bias test 27 Table 11: Results of each non consensus sequence added to the best set of 17 28 Table 12: Results of each non consensus added to the set of 23 consensus 29 Table 13: The best train set of 17 consensus conforming sequences 29

3 Acknowledgements I would like to express deep thanks to Professor Destin Heilman and Professor Rodica Neamtu for advising this project. This interdisciplinary computer science and biochemistry project would not have been possible without their flexibility and willingness to learn about a subject completely outside not only their department, but their comfort zone. Thank you to Professor Heilman, who pushed me to create the most useful and relevant tool I could and assisted with testing. Thank you to Professor Neamtu, who constantly gave candid feedback and reminded me to keep the scope of the project under control. I would also like to thank Fareya Ikram, Natalie Bloniarz, and Lucas Sacherer. These WPI students and alumni gave crucial advice about implementation and user-friendly design.

4 Introduction

Project Motivation functions are defined not just by the sequence of amino acids they are composed of, but by their 3-dimensional structure. The protein can be understood from its tertiary and quaternary structure, as well as which conformations are energetically favorable. The affinity of protein-protein and protein-ligand interactions are controlled by the strength of intermolecular forces and how well the components fit together; thus structure and conformational changes control the speed and efficiency of protein-catalyzed reactions. Because every level of protein structure is so important to understanding these interactions, determining these structures has long been a target of biochemical research. Experimental discovery of these structures through methods like X-ray crystallography is expensive and time consuming. Current methods of crystallization and structural imaging simply do not work for some . Attempts have been made to predict protein structures from sequence, but it is a complex problem. The folding of amino acids into a protein is affected by too many variables to calculate individually. The complexity of the problem, as far as modelling each amino acid and its conformation, is beyond even the newest technology for computation. In the realm of computers, when a problem is simply too big to calculate directly, data science and machine learning can be employed to reduce the computational load and sometimes to come to a more optimal solution. Instead of calculating the energy of amino acid interactions and torsion angles, a machine learning approach is based on patterns found in the data. Using already categorized, or labelled, data gives the model a base truth to begin forming a pattern to recognize inputs. This has led to the use of heuristics and machine learning to use data about the amino acid sequence to categorize and predict the shape those peptides will take in vivo.

Problem Statement One area that is lacking in accurate structural data and predictability is nuclear export sequences. Currently, sequences of experimentally verified nuclear export sequences are well known and publicly available in a database called NESbase. This database does not have any structural information associated with sequences. Functional NESs are difficult to identify and efforts to come up with a predictive model have high rates of false positives and false negatives. These models have been mostly based on sequence data, however, there may be a stronger connection between the structures NESs form than their raw sequences. Given the importance of secondary, tertiary, and quaternary structure in the function of proteins, it follows that sequences with similar properties lend themselves to repeatable structures that interact with transport proteins. Taking a structural approach in creating a predictive model extends more recent research that has used conformational fit to help identify protein-ligand interactions. A more reliable systematic method of identification and classification of NESs could make the discovery of new NESs easier, allow biochemists to create new NESs, and aid the understanding of cellular transport mechanisms. The proposed project is an investigation of a

5 few machine learning based structural prediction methods and their relevance to classifying nuclear export sequences.

The aim of this project is trifold: to explore modeling and machine learning tools that can be used to classify NESs, to investigate the improvement of NES classification through structural prediction, and to create an API that utilizes open-source tools to create and test models that help to identify sequences as functional NESs. This API will be complemented by a user interface that makes the modeling creation and use clear and easy to use for users without a computer science background.

Background

Section 1: Intracellular Transport A key feature of eukaryotic cells that differentiates them from prokaryotic cells is their subdivision into a variety of cellular compartments. This division is accomplished through a complex series of membrane-enclosed organelles that carry out specialized functions within the cell [2]. These specialized organelles often require function-specific conditions within the walls of their membrane. A notable case of this is the lysosome, which contains digestive enzymes. The enzymes housed by the lysosome perform optimally under acidic conditions with a pH of + about 5, which is maintained by H ​ pumps in the membrane [2]. The separation of these ​ enzymes from the rest of the cell not only provides the unique environment needed for the digestive enzymes to function properly, it also prevents integral parts of the cell from being unintentionally harmed. Another such membrane-enclosed organelle is the nucleus, which holds the genetic information safely away from the rest of the cell and controls gene expression [2]. Carefully controlled transport to and from the nucleus is crucial to the survival and proliferation of the cell. In order for genes to be expressed properly, genetic controls like transcription factors and organizational proteins like histones that are constructed in the cytosol need to be imported into the nucleus. For much of the RNA created by these genetic controls to be of any use to the cell, they must be exported from the nucleus for protein synthesis or more specialized uses. Some macromolecules require multiple passes between the nucleus and cytosol to complete their construction, like ribosomes [6]. Transport is made even more difficult by molecules having to cross not one, but two membranes. The nucleus is separated from the cytosol by a double membrane called the nuclear envelope. This double membrane is contiguous with the endoplasmic reticulum. With the exception of small molecules, up to about 40kDa, passage through this membrane is mediated by complexes (NPC) [15]. NPCs are made up of nuclear pores, which traverse both the inner and outer membranes of the nuclear envelope, as well as small associated proteins called nucleoporins. Peripheral nucleoporins on both the nuclear and cytoplasmic sides provide a further element of selectivity to the complex. With such heavy regulation, macromolecules over 40 kDa are unable to diffuse through and must be transported across the NPC [5]. Even smaller proteins like histones (20 - 30 kDa) are often actively carried across the NPC [6]. Unlike transport mechanisms for most cellular compartments, nuclear pore complexes do not require proteins transported through them to unfold. Carrier-proteins called importins and exportins associate with the macromolecules to be transported into, and out of, the nucleus respectively. To do work in the cell that cannot be accomplished via diffusion, there must be energy released during this process. This energy

6 comes from the cleavage of phosphoanhydride bonds in guanosine triphosphate (GTP) by a protein called Ran, as shown in Figure 1.

Figure 1: Transport cycles maintained by Ran, Importins (Imp), Exportins (Exp), and NTF2. [6]

Ran is found in two forms, GTP bound and guanosine diphosphate (GDP) bound. Ran-GTP associates with exportins that are already associated with cargo molecules and importins, transports them out of the nucleus, then hydrolyzes the GTP, becoming Ran-GDP, to release the proteins and cargo in the cytosol. Once in the cytosol, the importin associates with a cargo molecule and transports it into the nucleus and the exportin reenters on its own. A Ran-GDP-specific importer called NTF2 associates with Ran-GDP and shuttles Ran back into the nucleus to be replenished with GTP by Ran guanine exchange factor (Ran-GEF) [5]. The cyclical Ran pathway facilitates the balance of macromolecules entering and leaving the nucleus. Importins and exportins use target sequences to associate with the appropriate cargo molecules [2]. These signals indicate whether the molecule is destined for the cytosol or the nucleus. Some of these labels are not directly readable by importins and exportins, but associate with adapter molecules that bridge the gap and associate with the carrier-proteins [6].

7 Section 2: Nuclear Export Sequence Structure Nuclear export sequences (NESs) are localization signals that are recognized by exportins, while nuclear localization signals (NLSs) are their counterparts that are recognized by importins. Experimentally confirmed NESs have been collected into a publicly available database called NESbase. NESbase was used to create a consensus pattern in an attempt to define nuclear export sequences and make them easier to identify. The -rich pattern L-x(2,3)-[LIVFM]-x(2,3)-L-x-[LI] (where the ‘x’ can be any amino acid with the following numbers indicating how many there could be and the groups of amino acids in brackets indicating that one of those amino acids is in that position) was widely accepted. However, this pattern only fits about 60% of the sequences in NESbase, highlighting the pattern’s lack of sensitivity [23]. More research was done on the binding of NESs to CRM1, the canonical exportin, to understand how one protein can bind to sequences that have such variability. Güttler et al. ​ focused on the structure of PKI- and Rev-type nuclear export sequences. It was found that the NES regions of PKI-type proteins form an amphipathic ∝-helix that fits hydrophobic residues into hydrophobic pockets in CRM1.

Figure 2: Visualization of PKI-type NESs docked in the hydrophobic cleft of CRM1 [7]

Rev-type NESs do not bind in this way, but short spaces between hydrophobic residues create an extended structure that serves as an alternative way to dock to CRM1 and bind to the five hydrophobic pockets [7]. These two classes of NES structure appear to extend across sequences. A large portion of NES secondary structures that have empirically identified have been found to take short α-helix and loop conformations, akin to the PKI-type, while many of the remaining sequences form long helices and β-strands, a possible extended form [23].

8 Section 3: Existing Prediction Methods Researchers behind papers like Sequence and Structural Analyses of Nuclear Export Signals in the NESdb database use proprietary structure prediction software to model and calculate the energy of proteins [23]. These projects are based on a variety of methods of prediction, but many rely on variations of an energy minimization or on templates to start from. An energy minimization relies on the principle of free energy, indicating that the polypeptide chain will fold in such a way that its entropy is minimized. In other words, a conformation will be taken that allows each bond in the molecule the most freedom to rotate and vibrate. Energy minimizations calculate the free energy of different models and move forward with the model that produces the minimum free energy. Predictive methods that start from a template use a protein with a similar sequence and begin modeling from that protein’s structure. One hybrid project is a popular and versatile software called Rosetta. It began as a structure prediction software, but has grown into a tool to solve common macromolecular problems including ligand docking, enzyme design, and macromolecular complex prediction. The core functionality of Rosetta is energy minimization through matching fragments of proteins that are similar to sections of the desired protein. The ab initio predictions are achieved through a series of steps that utilize a fragment library as starting points and apply energy minimization functions from there. The first few steps begin with Monte Carlo sampling, or repeated random sampling, of larger fragments, then cycle through to find the minimum energy and score the accuracy of the result [13]. To account for the exact nature of scoring functions, the models can also be relaxed to check for matches that are close, but not identical to a model that may score much higher [13]. The models that come from Rosetta and other programs like it, along with empirically determined structures, form the basis for other research. The classification of full or fragments of proteins is a problem that often is aided by the use of structural data. Helix caps that tend to be found on the ends of helical sections of proteins are often difficult to identify due to their more varied shapes. They fold to form ‘loops’ and ‘turns’ where peptide chains tend to fold into more repetitive structures like α-helices and β-sheets. Mullane et al. used well-known machine learning techniques to improve the identification and classification of helix caps. They extracted torsion angles from empirically proven 3-D structures and added context with values like hydrophobicity and charge. This information was used in a support-vector clustering (SVC) model and in a recurrent neural network (RNN) that employed long and short term memory (LSTM) nodes to make a decision about whether or not the sequence of amino acids given was a helix cap. While the SVC model was not very accurate, the RNN had a peak accuracy of about 87% [17]. These types of studies demonstrate a proof of concept for using machine learning to solve complex biochemical problems.

Section 4: Machine Learning Techniques Machine learning is a broad term that encompasses methods that allow computer algorithms to improve with pattern recognition and the addition of data. Many of these techniques and models are based heavily in statistics. Machine learning is used with sets of data, typically large ones, that need to be interpreted, separated into groups, or used for prediction of related data. The applications of these mathematical models are far-reaching, from

9 market trend prediction to evidence of climate change, from traffic prediction to understanding genomic data. Models that ‘learn’ do so through training on data input. This data can be supervised, with the data being labelled as being from one group, or unsupervised, with the data having no labels[14]. The model then tries to fit the training data, or training set, picking up patterns that separate the given groups or can be used to separate the data into a specified number of groups. The fit model is then tested for correctness on a different set of data, called a test set, that is either labelled for immediate results, or unlabelled and experimentally proven or disproven. Machine learning models are trained and tested on sets of data that are, ideally, generated in the same way [14]. Since biological data, especially those derived from DNA, are entirely built on sequences, they allow powerful pattern detection algorithms and models to give reasonable, and often accurate predictions and classifications. Machine learning models have been frequently used to annotate genomes based on examples of gene structure, predicting the location of promoters, transcription start sites, splice sites, etc. [14]. A particularly useful technique for breaking complex systems into simpler computations is neural networks [10]. Once a problem is broken up in this way, the weight of each smaller calculation can be altered to affect the output. Using these mutable weights, a neural network can be tuned to produce accurate results at high computation speed [10].

4.1 Recurrent Neural Networks (RNNs) Neural networks are being widely used for pattern recognition across disciplines. They are well-suited to multivariate time series, like the folding of amino acid sequences. In the case of models dealing with proteins, simply finding patterns in the sequence will likely not be enough. The units in the sequence (amino acids) interact with other units that are not direct, or even close, neighbors, so pattern detection cannot capture the protein’s true behavior. In cases like these, some of the most successful models have been recurrent neural networks (RNNs). While traditional neural networks simply pass data through a node and apply a function to it, RNNs have feedback connections that bias that incorporate temporal information from the node’s previous output [16]. In this way, RNNs are able to not only learn from, but use previous data in the series to make contextualized decisions. The context surrounding the decisions improves the performance of RNNs on long term predictions when compared to other types of neural networks, but still performs well on shorter term problems [16]. RNNs can handle a large input set well and are a good choice for a problem affected by so many variables like protein folding.

4.2 Nuclear Export Sequences as Time Series In some modelling methods, data is represented as a time series. A time series is a set of data points arranged in some kind of sequential order, usually the data is over time [1]. Time series can be univariate or multivariate. These are characterized by whether each data point or record represents one (uni) or multiple (multi) variables [1]. Time series data can also be continuous, with values for every time, or discrete, with values at intervals. Though time series are typically measured over time, polypeptide chains can also be viewed as time series. The chain is a sequence where instead of seconds or years, an amino

10 acid’s position in the protein is the ‘time’. Due to the nature of peptides, the sequence is a discrete time series. The use of different features of each amino acid in the model adds more variables to each record, making the sequence a multivariate time series.

4.3 Autoregressive Integrated Moving Average Model (ARIMA) Another method that handles multivariate time series and has strong predictive power is the Autoregressive Integrated Moving Average Model, usually called ARIMA. So long as the time series has some sort of pattern and isn’t just white noise, ARIMA can be used. The base ARIMA method does not require repeated trends, or seasonality, to predict accurately. The idea of this method is to create a linear regression model based on past, or lag, values. An error factor is added based on the difference between the actual past values and the forecasted values when the equation is applied to predict known values [3]. The name is descriptive, with each letter or two relating directly to a term that describes the pieces involved in creating the equation. In order for a linear regression model to work, the time series needs to be stationary. This just means that its statistical properties, like mean, variance, etc., are constant throughout the time series. To convert a nonstationary series into a stationary one, differencing is often used. Differencing is subtracting the previous value from the current value, which can be done as many times as is needed to make the time series stationary. The number of times the series needs to be differenced is called the degree of differencing, otherwise known as the I, or integrated, term [3]. The AR, or autoregressive, term is also called the lag order. It references the number of previous values, lags, to use in the prediction [3]. In a pure AR model, each variable’s prediction would only depend on its own past values. The moving average window, or MA term, is the piece that incorporates the error of the prediction between past values by referencing the number of lag errors to use in the forecast [3]. In a pure MA model, each variable’s prediction would be based solely on the error between a set number of past values. The moving average window is the term that will affect how far back the prediction will look to relate an amino acid to its predecessors. Without a large window, the ARIMA model may struggle with some of the relationships between peptides that are far apart in sequence, but actually fold closely.Though there may be limitations in the ARIMA model when looking at larger sequences, NESs are much smaller than an entire protein. This model’s incorporation of past values and error allow for the possibility of structure prediction. With a somewhat large window to look at the whole NES at once, this model can take each amino acid into account for the prediction.

4.4 One-Class Support Vector Machines (OC SVM) Departing from the structure prediction and moving into classification, one-class support vector machines (OC SVMs) are extremely well-suited for classifying data as belonging to a group or not. Traditional SVMs, given a set of labelled data, will find a hyperplane that separates the groups of data. The data is represented as points in dimensional space, with each dimension representing a feature of the data. The groups are separated using two or more classes by creating a hyperplane that maximizes the margin between itself and the data points of each group. This results in a hyperplane that is equidistant from the groups of data being separated, creating a line of distinction between them. A hyperplane is a representation of space that is one dimension less than the overall space being discussed. For example, in two dimensions, a one-dimensional line is a hyperplane. Another example is that in three

11 dimensions, a two-dimensional plane is a hyperplane. Examples of hyperplanes in three dimensions are shown in Figure 3.

Figure 3: Hyperplanes that all pass through a common point. [9]

With one class, the SVM separates the data from the origin and maximizes a similar margin between the data and the origin [21]. A parameter called nu is the upper bound of the ​ ​ proportion of the data that can be on the wrong side of the hyperplane, and a parameter called gamma determines the impact of each data point on the hyperplane’s fit. The data is ​ represented as points in space with the dimension specified by the number of features. If the group is unable to be sectioned off by drawing a linear hyperplane as seen in Figure 4 A, a dimension is added and another attempt is made [21]. This happens until a straight hyperplane is possible. This hyperplane is projected, or reduced down to fit into the constraints of the original dimensions as seen in Figure 4 B and represents the separation [21].

Figure 4: (A) A linearly separable data set (B) A nonlinearly separable data set [19]

One of the goals of this project is to explore if this technique can be used to decide whether or not a sequence is a nuclear export sequence. NESs have well documented similarities in amino acid sequence in structure. These similarities, especially those relating to structure, can be used as features for the SVM to use in the creation of a hyperplane. The

12 features will be the following traits of each amino acid: hydropathy, isoelectric point, number of pKas, molecular weight, propensity to be in an α-helix, and propensity to be in a β-sheet. Using OC SVMs, known NESs can be used as the group for fitting the model. The model should then be able to distinguish between NESs and sequences that do not fit into the group.

4.5 Scikit-Learn Scikit-Learn is an open source python module that creates user-friendly ways to interact with implementations of popular machine learning algorithms [18]. The particular license it has, BSD, is minimally restrictive and makes it accessible to research and commercial endeavors. In addition, it is open to external contributions to the module, inviting improvement and community-driven changes. The module is built with few dependencies, including numpy and scipy, making it easy to incorporate into a project and distribute [18]. The implemented algorithms that Scikit-Learn has available are grouped into classification, regression, clustering, dimensionality reduction, model selection, and preprocessing on the main documentation website. More specifically, some of the machine learning algorithms included in the module are support vector machines, nearest neighbors, and spectral clustering. Nearest neighbors is a more simplistic algorithm, using the distance between the data to be classified and the training data to make a decision, while spectral clustering is more complex, with data or a graph being separated into groups based on the edges connecting them. The model selection tools include a grid search feature to test parameters with a scoring interface that has pre-built scoring functions like recall, precision, and f1-scores, but also leaves the option of creating one. The grid search feature and support vector machine implementation will be used to create an easy to distribute tool to identify potentially functional NESs from amino acid sequences.

Tool Overview and Functionality This tool consists of three main parts: 1. An API that adapts Scikit-Learn tools to the NES classification problem 2. An interactive script that wraps the API (command line) 3. A locally hosted web-based application (GUI) This section will give an overview of the functionality of these pieces.

Section 1: API Functionality

To implement the OCSVM Models, I created an API using python3 to adapt Scikit-Learn tools to the NES classification problem. The API includes functions to 1) convert sequence files into feature sets 2) create a model from a training set 3) test the model against a test set of sequences with the option to save the results and 4) use the model to predict NESs in a protein sequence. The feature set is made up of physical characteristics of amino acids that can be measured quantitatively. The features used in each model are hydropathy, isoelectric point, molecular mass, number of pKas, propensity to be in an α-helix, and propensity to be in a ​ ​

13 β-sheet. The only input file required to create the model is a sequence file for training the model. To run a custom test, another sequence file and a matching label file are required. The figure below gives a general workflow for the use of the API, as well as the tools built from it.

Figure 5: General workflow for API and related tools.

To create the sequence files that are used for both train and test sets, I used Microsoft Excel. I entered each sequence of 10 amino acids into a cell in the first column, then saved the file as a csv instead of an excel workbook. I used a similar process for the label files that are required for test sets. I again used Microsoft Excel, but put the labels corresponding to each sequence in the test set in cells in the first row, with column 1 being the label for row A and so on. These were also saved as csv files.

Section 2: The Command Line Tool

To make the API easier to use, a wrapper was created in the form of a command line tool. This tool was written as a python3 script and requires all relevant files to be in the same directory as the API and tool. The tool depends on some packages to run properly. They include Python (>= 3.3), NumPy (>= 1.6.1), SciPy (>= 0.9), Pandas, and Scikit-Learn. Instructions for installing Python can be found at https://www.python.org/downloads/. Pip3 can be used to install ​ ​

14 most of the other dependencies by running the command: pip3 install numpy scipy ​ pandas. To install Scikit-Learn, run the command: sudo apt-get install ​ ​ python-sklearn. To create custom train and test sets, users can follow the procedure I ​ outlined above. Running the tool requires that the user is in the same directory as the tool when they run the script using the command: python3 ocsvmTool.py. ​ ​ Upon beginning the run, it prompts the user for a training set, formatted as specified above. The tool then offers more details about the parameters used in model creation for inexperienced users. The user is then prompted for parameters and provided with defaults. The defaults (kernel=rbf, gamma=1/n_features, nu=0.2) were chosen based on the results of ​ ​ ​ ​ ​ ​ exhaustive grid search testing. Once the model has been created, the user is given a list of available tests, as shown in Figure 6. These testing features include built-in test sets and an option to test the model on a custom test set.

Figure 6: Snapshot of command line tool that shows the test options.

The first testing option is called ‘self’. This uses the model to predict the labels of the train set used to build the model, training and testing on the same sequences. This works as a check to assess how well the model has fit to the train set. The second option is called ‘consensus’, this is because the built-in test set used is made up entirely of NESs that match the consensus pattern. This set is shown in Table 1 and is used to determine if the model can predict NESs that follow the strict consensus pattern.

LESNLRELQI LENNLRELQI LTKRIDSLPL NLREISQLGI KLNEISKLGI

KLECLKSLDL LKMILRLLQI LICSLQSLII LVVFNRGLIL LALKFAGLDL Table 1: The test set used in consensus tests.

The third option is ‘mixNES’. This test set (shown in Table 2) is made up of 10 NESs that match the consensus pattern and 10 NESs that do not follow the pattern (non consensus). The mixed NES test checks that the model can predict both consensus and non consensus following NESs correctly.

15

DLRTLQQLFL YLGEILRLAL ILRDFFELRL QLVSFQKLKL LITFINALKL

LESNLRELQI LENNLRELQI LTKRIDSLPL NLREISQLGI KLNEISKLGI

EKEIRKIFSI LLSAVKLLCM IMTLTQLLAL VDDLRLDILL HQLLEVLLAL

ILSLEVKLHL ILSLEAKLNL IVMELDTLEV SQALSIALQV LATYMESMRL Table 2: The test set used in mixNES tests. The top two rows contain the NESs that conform to the consensus pattern, the bottom two rows contain the nonconforming NESs.

The fourth option is ‘mixWithRandom’. This test uses a test set of 10 NESs that follow the consensus pattern and 10 randomly generated sequences. The set shown in Table 3 is used to see how well the model can differentiate between the most formulaic NESs and noise (the random sequences).

LKDFLKELNI LQKKLEELEL LALKLAGLDI QLPPLERLTL LAAEFRHLQL

LMKLKESLEL LCVRFFGLDL LSSHFQELSI DLRTLQQLFL YLGEILRLAL

LSVDWGKDIF EPLHTGPAIV TKQPCGPGAE NVWNPAHDDL IFKQRQKTWD

AQMKITWECH TMHGCVEPPD FQHYDMVKHA NEPQVYLFHC HNYIILVYDA Table 3: The test set used in mixWithRandom tests. The top two rows contain the NESs that conform to the consensus pattern, the bottom two rows randomly generated sequences.

The fifth option is ‘mixAll’, which uses a test set of five NESs that follow the consensus pattern, five NESs that do not conform to the pattern, and five randomly generated sequences. This test is used to understand the ability of the model to separate NESs from noise, regardless of whether they fit the consensus pattern. The sequences in this test set can be found in Table 4.

MEGCVSNLMV RNLFSQTLSL IVHSLENLSL FTDLFDYLPL ILRLGSNLSL

PDWTIEPFMA FQMFMCDHLM HMCRVAGNKQ RNGLSHDVVC VRQNWRTLAI

LQKKLEELEL LALKLAGLDI QLPPLERLTL LAAEFRHLQL LMKLKESLEL Table 4: The test set used in mixAll tests.The top row contains the NESs that do not conform to the consensus pattern, the middle row contains randomly generated sequences, and the bottom row contains the conforming NESs.

The sixth option is ‘all’, which runs all of the previous five tests in one command for convenience. The final option, ‘custom’, allows the user to use a test set of their own creation.

16 When this option is chosen, the user is prompted for a file containing the test sequences formatted in the same way as the training input. Additionally, the user is prompted for a file containing the true labels for their test set as a csv formatted file with the labels on one line separated by commas. The labels are ‘1’ for an NES, and ‘-1’ for any other sequence. After selecting a dataset to test the model with, a summary of the results is printed, as shown in Figure 7. The true labels are displayed, followed by the predicted labels. The values traditionally in a confusion matrix are shown: true positives, true negatives, false positives, and false negatives. These are followed by the model’s recall, precision, and accuracy when applied to the test set.

Figure 7: Example of a summary of test results for a self test.

To use the prediction feature of the tool, a user would run the option ‘predict’. The tool then prompts the user to enter an amino acid sequence. The sequence will be broken up for analysis using a 10 amino acid window. The sequences predicted to be functional NESs by the model will be listed on screen accompanied by their start and stop positions in the entered sequence. The user is then prompted with their options again and can repeat this as many times as they would like. To exit the tool, the user will make use of the ‘exit’ option.

Section 3: The Web Application Tool

To make this API more accessible to those unfamiliar with tools like the command line, a web application was created using Flask, a micro web framework written in and used with python. This framework was used along with html to create a user interface that is more intuitive to use than the command line tool. A similar explanation to that of the command line tool is present at the top of the home page. After the explanation, there is an html form to fill out to create and test the model, as seen in Figure 8.

17

Figure 8: A screen capture of the main page of the web application.

18

Figure 8 cont’d

The form allows the user to upload a csv formatted sequence file for the train set, choose their model’s parameters, select any tests they’d like to run, and predict NESs of an amino acid sequence of at least length 10. The parameters are by default filled with the optimal parameters found via grid search, as in the command line tool. The options for the testing are the same, however the ‘all’ option is removed since each test to be run is checked off before the user submits the request. When the ‘custom’ option is selected, the user must upload the csv formatted test sequence file and test label file. After the user has made their selections, they click the submit button at the bottom of the page. A summary is generated for each of the selected tests, as seen in Figure 9. The accuracy is shown as a percentage, followed by the categories traditionally in a confusion matrix: true positives, true negatives, false positives, and false negatives. The sequences from the test set that align with each category are listed under descriptive names.

Figure 9: A screen capture of the web application where a mixWithRandom test was run.

19 When the model is used to predict likely functional NESs, the results screen lists the predicted sequences as 10 amino acid chunks as seen in Figure 10. Each sequence indicates the start and stop position in relation to the full entered sequence.

Figure 10: A screen capture of the web application prediction page.

Methodology

Section 1: Data Selection To create a model that recognizes NESs, I needed a large dataset of labeled data for training the model. This labeled data needed to include only confirmed, functional NESs. To compile this dataset, I looked for sequences in peer reviewed papers that centered around the analysis or discovery of nuclear export sequences in biochemically focused journals. I extracted one hundred sequences from four sources and removed any duplicates between them. Sequences 1-7 are from influenza viral proteins [4]. Sequences 8-11 are generally accepted NESs of PLC-δ1 and 12-14 are canonical sequences [24]. Sequences 15-87 were extracted from NESbase indirectly and are a collection of experimentally proven NESs [23]. Sequences 88-116 are DUB cNESs that tested positive for nuclear export activity [20]. Sequences 117-120 are NESs that were mutated to emulate PKI- and Rev-type [7]. The structure of an SVM requires that each compared feature set be the same length. This is because the algorithm uses each feature as a dimension for the sequence being evaluated. An SVM cannot construct a hyperplane if the number of dimensions for each data point varies. Since the transformation of the sequences to features is directly proportional, the size of sequence needed to be uniform. According to the literature, typical lengths of NESs range from 8 to 15 amino acids [23]. I chose to use a sequence length of 10 amino acids to limit the introduction of unrelated amino acids when using NESs that were on the shorter side of the range, and to prevent removing as much as half of the amino acids in the longer NESs. Sequences longer than 10 amino acids were truncated from the left to preserve the same portion of each sequence. The sequences were shown to be highly comparable through clustalw alignment, as seen in Figure 11.

20

Figure 11: The extracted sequences as grouped by a ClustalW alignment. The sequences have been colored on a hydropathy scale, with the most hydrophobic residues being red and the most hydrophilic being blue, using NCBI Multiple Sequence Alignment Viewer.

21

Figure 11 cont’d

The experimentally proven NESs were categorized into two groups for ease of train set creation and performance tracking when testing the model, consensus and non consensus. Even though they were separated, both groups were labeled as NESs. The point of categorizing the NESs in this way is, sequences that follow a more easily discernible pattern, the consensus sequences, will likely be learned and predicted by the model more efficiently. Creating a model for specifically consensus sequences first, then adding the less predictable non consensus sequences allows me to see the effect each sequence has on the model. I chose the consensus pattern from literature, as it was consistent across publications about nuclear export sequences [12][23].

[12] The sequence dataset was sorted into the two groups using a custom python script that made use of a regular expression based on this pattern. Figure 12 shows that the sequences as a group were leucine rich, especially in the third from last position as predicted by the consensus pattern. Despite many of the compiled sequences sharing some key features with the canonical pattern, only 23 sequences followed the strict consensus sequence.

22

Figure 12: A WebLogo created by compiling the sequences collected. The size of the letter indicates how frequently the corresponding amino acid was found to be in that position over the entire dataset.

Section 2: Model Metrics and Goals To assess and improve a model, I need to use metrics to measure progress toward a goal. Choosing a metric to judge improvement requires me to understand the data and which measures are most important in the context of the problem. I am using a One-Class SVM (OCSVM) model to classify amino acid sequences as NESs or non-NESs because of its ability to identify one group, while classifying anything that varies too much as outside the group. I trained the OCSVM datasets of NESs to be able to identify functional NESs as belonging to the group, and anything else to be categorized as not in the group, or not NESs. A positive result in the One-Class SVM model represents a sequence being identified as an NES, while a negative result indicates a sequence that is not an NES. An SVM can be assessed by accuracy, the number of NESs correctly predicted divided by the total number tested. However, accuracy alone can be misleading in an imbalanced problem that has a disproportionate number of instances of one case. In the NES problem, the majority of genomes are not NESs and are completely unrelated sequences while a tiny proportion of genomes are NESs. For example, if you are screening for a disease that only 1 in 1000 people have and a model assumes that no one has it, the accuracy would be 0.999. Even though the accuracy is high, the point of screening for the disease is missed and the person that needs treatment wouldn’t be found, thus accuracy alone is not a good metric for an imbalanced problem. To mitigate this, classification models can also be evaluated through metrics that center around a confusion matrix, shown in Figure 13, which categorizes the data into four groups; true positives, false positives, true negatives, and false negatives [11].

23

Figure 13: Example of a confusion matrix [11] ​ Recall is the proportion of true positives to the sum of true positives and false negatives, or the percentage of data points that should have been identified as positive that were correctly identified. Precision is the proportion of true positives to the sum of true positives and false positives, or the percentage of identified positives that were correctly identified. Both recall and precision are combined in an f1 score, which is the harmonic mean of recall and precision scores [11]. The way the f1 score is calculated gives equal weight to both metrics and finds a balance that maximizes them both without detriment to the other.

[11] In the OCSVM model, the NESs are used to train, making NESs the identifiable group. The identifiable group is also known as the negatives. For this reason, false negatives misidentify non nuclear export sequences as being functional NESs when they are not. This is a highly undesirable outcome. False positives are NESs that were not identified, which, while being undesirable, are more acceptable than a false negative. NESs are much less abundant than sequences that have no localization function which makes them difficult and expensive to locate, making it vital to keep the sequences classified as NESs as free of non NESs as possible. On the other hand, if an NES is misclassified as a nonNES, it may be found experimentally instead. Since false negatives and positives have more consequences in the context of classifying NESs, I chose recall as the metric to maximize at the potential cost of precision. As recall and precision are not being weighted equally, the f1 score will not be used. The training goals for the model are to maximize recall, while maintaining a precision that is better than a random model.

Section 3: Testing Procedure

3.1 Finding the Best Parameters To begin the search for the best predictive model, I needed to find suitable parameters for the support vector machine. Because the parameters may behave differently with very different train set sizes, training sets of 5, referred to as small, 10, referred to as medium, and 20, referred to as large, were compiled from the collected sequences. In the preliminary testing,

24 the training sets were made up of entirely strict consensus NESs to match the performance of other NES predictors that use the canonical pattern exclusively. The sizes are representative of roughly a quarter, half, and all, respectively, of the NESs that fit the strict consensus sequence. The small and medium sets are disjoint, but the small and medium sets are combined in the large set. Each set was used to train a model with combinations of varied parameters. Nu, the ​ ​ upper bound to the proportion of samples on the wrong side of the model’s hyperplane, was tested between 0.1 and 0.9 by 0.1 intervals. This spread of values tests whether the model will perform best with a minimal amount, 10%, of values allowed outside of their true group, a substantial, 90%, of values allowed outside of their true group, or some intermediate proportion. Gamma, the measure of impact of each feature value, was tested at 1 , 0.001, and ​ number of features 0.0001. The gamma being proportional to the number of features had each value making an ​ ​ impact not exceeding its fair share compared to the rest, while the other two gamma values ​ ​ tested whether smaller adjustments would be more effective. The recall, precision, and accuracy were collected for each combination through a grid search applied using Scikit-Learn. Due to the data being made up of entirely NESs, the precision was 1.0 for each test. In addition, the accuracy was the same as the recall because the train and test set were identical. The test sets were kept the same to standardize the results for comparison of the train sets to find the best one. The best recall values for each training set size can be seen in Table 5.

Train Size Recall Gamma Nu

5 (small) 0.8 1/features 0.2 - 0.4

10 (medium) 0.8 0.0001 0.2 - 0.4

20 (large) 0.9 1/features 0.1 Table 5: Results of parameter testing.

The best recall score was 0.9 with the largest test size, a gamma of 1 , and a ​ ​ number of features nu of 0.1. The overall results suggest that a smaller nu and larger train size tends to give higher ​ ​ ​ recall results. A scaled gamma resulted in more of the best recall values (2 of 3), but did not ​ ​ produce the best recall across tests. The parameters chosen to use throughout testing were a gamma of 1 and a nu of 0.2 based on this test. ​ number of features ​ ​

3.2 Finding the Best Train Set The size of the train set also has a large impact on the predictive power of the model. In many situations, the more training data, the better the model. However, in some cases an overabundance of training causes the model to overfit to the point of inaccuracy. To find the optimal size for a train set, I tested sets from size 2 to 23 against two different test sets. The train sets were again made up of only consensus pattern NESs. Each train set was tested against the built-in ‘mixWithRandom’ set (see Table 3) to check for consensus NES recognition and ability to correctly categorize random sequences, as well as a set of 10 randomly generated

25 amino acid sequences (see Table 6) with one consensus NES for a more accurate simulation of the problem.

LSVDWGKDIF EPLHTGPAIV TKQPCGPGAE NVWNPAHDDL IFKQRQKTWD

AQMKITWECH TMHGCVEPPD FQHYDMVKHA NEPQVYLFHC HNYIILVYDA Table 6: The set of randomly generated sequences used for testing with only one NES at a time.

The results of the best five tests for each test set are shown in Table 7 and Table 8 below.

Table 7: Top five results of different train sizes (2-23) tested against ‘mixWithRandom’ test set. The top result is highlighted.

Table 8: Top five results of different train sizes (2-23) tested against the test set with all but one being randomly generated. The top three results are highlighted.

A train size of 17 was optimal in the first test, and was just shy of being one of the best performers of the second test. I concluded that, at least for completely consensus train sets, a size of 17 sequences was the best option for NES recognition. I then moved to determining the best set of 17 sequences to use as the train set. Since the train set was taken from the 23 total consensus NESs, a limited number of sets can be created. I created five other train sets from the consensus NESs and tested those five alongside the original set of 17 with the ‘mixWithRandom’ set (see Table 3). The results of the testing can be seen in Table 9.

26

Table 9: Results of test to find the best train set of 17 consensus NESs. The best and the worst scoring sets are highlighted.

I concluded that the best train set of 17 consensus NESs was the original set I had used for testing. To ensure that earlier results were not biased by overlapping train and test sets, I tested the best set of 17 sequences against the randomly generated set (see Table 6) with each of the 23 consensus sequences added separately. Seventeen of the sequences are used in the train set, so those would be expected to produce correct predictions, while the predictions for the five sequences that are not in the train set would be based only on what the model has learned. A summary of the unexpected results of this test can be seen in Table 10.

Table 10: A summary of unexpected and failing results for the bias test for the best set of 17 sequences. The entries highlighted in red were predicted incorrectly despite being in the train set, the entries highlighted in green were predicted correctly, but not in the train set, and the entries in black were not predicted correctly, but were not in the train set.

27 This test illustrates that the earlier test was not biased by the singular consensus sequence chosen for testing. The next step in finding the best training set was to begin adding NESs that do not conform to the consensus pattern to the train set. Ten different nonconforming NESs were added to the best 17 conforming one at a time and tested against the ‘mixWithRandom’ set (see Table 3). This was done to see whether the addition of a non conforming sequence would harm the performance of the model when predicting conforming sequences. The results of adding each sequence are shown in Table 11.

Table 11: Results of each non consensus conforming sequence added to the best set of 17 conforming sequences and tested against the ‘mixWithRandom’ set. The best performing sequence is highlighted.

The best non conforming sequence to add appears to be number 6, as it does not interfere with the model’s ability to predict conforming sequences. To see if the model’s performance would be less affected by the addition of a non conforming sequence with a larger set of conforming sequences, I tested the set of all 23 conforming sequences in the same way as the best set of 17. The same non conforming sequences were added one at a time and tested against the ‘mixWithRandom’ set (see Table 3). The results of this test can be seen in Table 12.

28

Table 12: Results of each non consensus conforming sequence added to the set of 23 conforming sequences and tested against the ‘mixWithRandom’ set. The best performing sequences are highlighted.

The addition of the nonconforming sequences were not as detrimental to the performance of the model trained with the set of 23 conforming sequeces as to that trained with the best set of 17. Additionally, the performance of the model was the best it had ever been for the model based on the train set of 23 with the addition of both nonconforming sequence 6 and 7. The larger train set may be more robust, but more testing is needed on this topic. The best training set made up of only consensus pattern conforming NESs is of size 17 and can be seen in Table 13.

LKDFLKELNI LQKKLEELEL LALKLAGLDI QLPPLERLTL LAAEFRHLQL LMKLKESLEL

LCVRFFGLDL LSSHFQELSI DLRTLQQLFL YLGEILRLAL ILRDFFELRL QLVSFQKLKL

LITFINALKL LESNLRELQI LENNLRELQI LTKRIDSLPL NLREISQLGI Table 13: The best train set of 17 consensus conforming sequences.

Case Study To assess the viability of this tool, I ran a case study on a protein from a virus called porcine circovirus type 1 (PCV1) that is actively being studied for use in medical applications. Specifically, the protein encoded by a gene called VP3. A close relative of PCV1 is chicken anemia virus (CAV), which also has a VP3 gene with a similar function [8]. CAV-VP3, unlike PCV1-VP3, has a well-characterized NES at amino acids 97-105 as seen in Figure 14. This relationship was used as a positive control to tune the model to the viral family being analyzed.

29

Figure 14: Schematic representation of CAV-VP3 showing the locations of a leucine rich sequence, two nuclear localization signals, and one nuclear export sequence. [22]

The best train set of 17 consensus conforming sequences (Table 13) was used as the training set. The default nu of 0.2 was found to predict too many sequences. Once the nu was ​ ​ ​ ​ changed to 0.6, the NES area was correctly predicted as seen in Figure 15.

Figure 15: Snapshot of web application prediction page for CAV-VP3.

The tool predicted an NES in five amino acid sequences centered around the actual NES. The window created by the tool’s predicted extends two amino acids before the actual first peptide of the NES and three amino acids after the actual end of the NES. This indicates that the tool predicted a functional NES in the positive control within three amino acids on either end. The same model made to evaluate the control was used to predict NESs on the PCV1-VP3 polypeptide. The results can be seen in Figure 16 below.

Figure 16: Snapshot of web application prediction page for PCV1-VP3.

Since the NES(s) in PCV1-VP3 have not been confirmed, the results are compared to those of an existing prediction algorithm. The same polypeptide was run through the NESbase prediction algorithm and the results are shown below in Figure 17.

30

Figure 17: NESbase prediction algorithm applied to PCV, predicting two NESs (using 0.5 as the threshold). One at amino acids 42-49, and one at amino acids 134-139. [8]

As can be seen on inspection of Figures 17 and 18, both predictions cover the same sections of amino acids. Both algorithms predict two NESs, one at around amino acids 40-50, and another between amino acid 130 and amino acid 150. The high degree of agreement between the two methods demonstrates the effectiveness of this tool. Some advantages of the tool I’ve created are the clearly defined start and stop positions of predicted sequences for ease of reading the prediction and ability to be tuned to protein families for more specific searches.

Future Work In this project, I’ve created a tool for use in a biochemical lab that employs OCSVMs, but there is room to expand further. This API and tool could be extended to make use of other machine learning algorithms, like recurrent neural networks, to potentially improve predictive power. More amino acid features could be tested, especially in the realm of steric effects to add more data that informs protein folding. From the perspective of the user, the tool could implement a feature to allow the use of FASTA files in place of csv files and the ability to predict on multiple polypeptides at a time. The area with the most room for improvement is the training set and the data set in general. I compiled 120 experimentally proven NESs from peer-reviewed papers, but machine learning methods tend to work best on much larger datasets. The way papers have been and are currently published is not conducive to compiling large datasets, as the actual sequences are embedded in text of figures, or simply referenced and not included. NESbase is a start to compiling NES data, but is unwieldy to extract data from. During this project, an attempt was made to scrape the html for sequences, but it wasn’t possible to isolate the NESs from the full protein sequences at that time. Better access to larger datasets would likely drastically improve the capabilities of this tool.

31 References [1] Adhikari, R., & Agrawal, R. K. (n.d.). An Introductory Study on Time Series Modeling and Forecasting.

[2] Alberts, B. (2014). Chapter 15 Intracellular Compartments and Protein Transport. Essential ​ Cell Biology (4th ed., pp. 488-497). New York: Garland Science. ​

[3] Brownlee, J. (2019). A Gentle Introduction to the Box-Jenkins Method for Time Series Forecasting. Retrieved from https://machinelearningmastery.com/gentle-introduction-box -jenkins-method-time-series-forecasting/

[4] Chutiwitoonchai, N., Kakisaka, M., Yamada, K., & Aida, Y. (2014). Comparative Analysis of Seven Viral Nuclear Export Signals (NESs) Reveals the Crucial Role of Nuclear Export Mediated by the Third NES Consensus Sequence of Nucleoprotein (NP) in Influenza A Virus Replication. PLoS ONE, 9(8), e105081. https://doi.org/10.1371/journal.pone.0105081

[5] Dominko, T. (2019). CH4190: Regulation of Gene Expression, week 5, session 1. ​ [PowerPoint slides].

[6] Görlich, D., & Kutay, U. (1999). Transport Between the and the Cytoplasm. Annual Review of Cell and Developmental Biology, 15, 607–660. doi: 10.1146/annurev.cellbio. 15.1.607

[7] Güttler, T., Madl, T., Neumann, P., Deichsel, D., Corsini, L., Monecke, T., … Görlich, D. (2010). NES consensus redefined by structures of PKI-type and Rev-type nuclear export signals bound to CRM1. Nature Structural and Molecular Biology, 17(11), 1367–1376. https://doi.org/10.1038/nsmb.1931

[8] Hough, K. P., Rogers, A. M., Zelic, M., Paris, M., & Heilman, D. W. (2015). Transformed cell-specific induction of by porcine circovirus type 1 viral protein 3. Journal of General Virology, 96(2), 351–359. https://doi.org/10.1099/vir.0.070284-0

[9] Hyperplane—Wolfram Language Documentation. (n.d.). Retrieved January 11, 2020, from https://reference.wolfram.com/language/ref/Hyperplane.html?view=all

[10] Libbrecht, M. W., & Noble, W. S. (2015, May 18). Machine learning applications in genetics and genomics. Nature Reviews Genetics. Nature Publishing Group. https://doi.org/10.1038/nrg3920

[11] Koehrsen, W. (2018). Beyond accuracy: precision and recall. Towards Data Science. https://towardsdatascience.com/beyond-accuracy-precision-and-recall-3da06bea9f6c.

32 [12] Kosugi, S., Hasebe, M., Tomita, M., & Yanagawa, H. (2008). Nuclear export signal consensus sequences defined using a localization-based yeast selection system. Traffic, 9(12), 2053–2062. https://doi.org/10.1111/j.1600-0854.2008.00825.x

[13] Leaver-Fay, A. (2019). Rosetta Bootcamp February 2019 [powerpoint]. Retrieved from ​ ​ https://drive.google.com/drive/folders/121EJnKcjG6qRiNdDCwVpOOgsmuquSA0f

[14] Libbrecht, M. W., & Noble, W. S. (2015, May 18). Machine learning applications in genetics and genomics. Nature Reviews Genetics. Nature Publishing Group. https://doi.org/10.1038/nrg3920

[15] Marfori, M., Mynott, A., Ellis, J. J., Mehdi, A. M., Saunders, N. F. W., Curmi, P. M., … Kobe, B. (2011, September). Molecular basis for specificity of nuclear import and prediction of nuclear localization. Biochimica et Biophysica Acta - Molecular Cell Research. https://doi.org/10.1016/j.bbamcr.2010.10.013

[16] Milos Miljanovic, A. (n.d.). Comparative analysis of Recurrent and Finite Impulse Response Neural Networks in Time Series Prediction.

[17] Mullane, S., Chen, R., Vemulapalli, S. V., Draizen, E. J., Wang, K., Mura, C., & Bourne, P. E. (2019). Machine Learning for Classification of Protein Helix Capping Motifs. 2019 Systems and Information Engineering Design Symposium (SIEDS). Doi: 10.1109/sieds.2019.8735646

[18] Pedregosa FABIANPEDREGOSA, F., Michel, V., Grisel OLIVIERGRISEL, O., Blondel, M., Prettenhofer, P., Weiss, R., … Duchesnay EDOUARDDUCHESNAY, Fré. (2011). Scikit-learn: Machine Learning in Python Gaël Varoquaux Bertrand Thirion Vincent Dubourg Alexandre Passos PEDREGOSA, VAROQUAUX, GRAMFORT ET AL. Matthieu Perrot. Journal of Machine Learning Research (Vol. 12). Retrieved from http://scikit-learn.sourceforge.net.

[19] Raschka, S. (2014). Naive Bayes and Text Classification. Retrieved from http://sebastianraschka.com/Articles/2014_naive_bayes_1.html

[20] Rodr, J. A. (2012). A global survey of CRM1-dependent nuclear export sequences in the human deubiquitinase family Iraia GARC´IAGARC´GARC´IA. Biochem. J, 441, 209–217. https://doi.org/10.1042/BJ20111300

[21] Vlasveld, R. (2013). Introduction to One-class Support Vector Machines. Retrieved from http://rvlasveld.github.io/blog/2013/07/12/introduction-to-one-class-support-vector-machines/

[22] Wang, Y., Song, X., Gao, H. et al. C-terminal region of apoptin affects chicken anemia virus replication and virulence. Virol J 14, 38 (2017). https://doi.org/10.1186/s12985-017-0713-9

33 [23] Xu, D., Farmer, A., Collett, G., Grishin, N. V., & Chook, Y. M. (2012). Sequence and structural analyses of nuclear export signals in the NESdb database. Molecular Biology of the Cell, 23(18), 3677–3693. https://doi.org/10.1091/mbc.E12-01-0046

[24] Yamaga, M., Fujii, M., Kamata, H., Hirata, H., & Yagisawa, H. (1999). Phospholipase C-δ1 contains a functional nuclear export signal sequence. Journal of Biological Chemistry, 274(40), 28537–28541. https://doi.org/10.1074/jbc.274.40.28537

34 Appendix A: API Manual

Getting Started

To use the API directly, or the command line wrapper, you will need access to a command line terminal. For simplicity, these instructions are for a linux system. To begin, open a terminal window. It should look something like this:

The API has some dependencies that need to be installed on your system for it to run properly. These are Python (>= 3.3), NumPy (>= 1.6.1), SciPy (>= 0.9), Pandas, and Scikit-Learn. Instructions for installing Python can be found here: https://www.python.org/downloads/. Most distributions of python come with pip, a python ​ package management system. This makes installing the rest of the packages simple. To install numpy, scipy, and pandas, you can use this command: pip3 install numpy scipy pandas The use of pip3 just means that the packages will be available when using python3. The last thing you need to run the simplest wrapper of the API is scikit-learn. To install this, you’ll make use of a more general package management system called apt-get that usually comes with a linux environment. Running this command in your terminal should get you up and running: sudo apt-get install python-sklearn Now you should be ready to download the API and its wrapper. To download the files you need, go here: https://github.com/emconneilly/mqp/tree/master/demo. Download the demo ​ ​ folder.

35 Running the API

To run the API, you’ll need to go into the demo directory you downloaded. To do this use the cd command followed by the path you saved the folder to. An example of this would be: ​ ​ cd Downloads/demo/ Once in the correct folder, you should be able to start up the command line tool by running the command: python3 ocsvmTool.py From there you’ll be prompted by the tool to give a train set, parameters, and which tests you’d like to run with your model.

The tool explains the parameters you can change, and provides defaults for them. To use the default value, simple hit enter instead of making a choice to be taken to the next parameter. Once your parameters have been chosen, your model is done and can be tested. You can choose from some built-in tests, or create your own.

The built-in tests are running your model against pre-made test sets that are in the demo folder you downloaded. A sample training set is also provided in the demo folder, it is called trainTest.csv. This can be used to figure out how the API works and as a model for how to ​ set up your own train set. If you choose to test your model against a custom test input, you’ll have to enter another file. There is a sample of this as well, the sample test set is called customTest.csv and the accompanying label file that is required is called ​ customTestLabels.csv. The labels are needed to compare the model prediction against ​

36 the actual values of the test set to give metrics on your model’s recall, precision, and accuracy. Additionally, there are other sequence and label files for example and learning the tool included. All sequence files match to a label file with the same name, but with an ‘L’ appended before the file extension.

If you create your own train or test sets, you’ll have to make sure to add them to the folder with the tool before running it.

37 Appendix B: Web Application Manual

Getting Started

To use the web application for the OCSVM modeling API, you need to download the executable file. The windows executable file and accompanying README can be found here: https://github.com/emconneilly/mqp/releases/tag/4.20.2020

Running the Web Application

To run the application, you’ll need to run the executable. This creates a website hosted ​ ​ on your computer, so only you will be able to access the application. Open your web browser and go to the url localhost:5000. From there you’ll be prompted by the app to upload a train set, enter parameters, select which tests you’d like to run with your model, and sequences you’d like to make a prediction from.

38

The tool explains the parameters you can change, and provides defaults for them. To use the default value, simply leave that parameter box as is. You can choose from some built-in test sets, or create your own. A sample training set is also provided in the examples folder, it is called trainTest.csv. This can be used to figure out how the API works and as a model for how to ​ set up your own train set. If you choose to test your model against a custom test input, you’ll have to enter another file. There is a sample of this as well, the sample test set is called con23.csv and the accompanying label file that is required is called con23L.csv. The labels ​ ​ ​

39 are needed to compare the model prediction against the actual values of the test set to give metrics on your model’s recall, precision, and accuracy.

Figure 9: A screen capture of the web application where a mixWithRandom test was run.

To use the model to predict likely functional NESs, simply paste the sequence into the box at the bottom of the main page and deselect any tests. The results screen lists the predicted sequences as 10 amino acid chunks, with each sequence indicating the start and stop position in relation to the full entered sequence.

For prediction, it is best practice to use a positive control in the same or a similar protein family to create a suitable model. An example of this can be found in the case study of Exploring Machine Learning Methods for Nuclear Export Sequence Identification. If this tool is being used frequently, look for the ‘uploads’ folder in the same directory as the executable. This is where the tool stores the files you give to it. Clear it every so often to ensure that it doesn’t eat into the memory of your computer.

40